0% found this document useful (0 votes)

113 views

Design and Analysis of Non-Inferiority Trials

Design and Analysis of Non-Inferiority Trials is not intended as a substitute for regulatory guidances on non-inferiority trials, but as a complement to such guidances. This text provides a comprehensive discussion on the purpose and issues involved in non-inferiority trials and will assist the reader in designing a non-inferiority trial and in assessing the quality of non-inferiority comparisons done in practice.

Uploaded by

fasantos11

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

113 views

Design and Analysis of Non-Inferiority Trials

Uploaded by

fasantos11

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 442

Design and Analysis of

Non-Inferiority Trials

Mark D. Rothmann
Brian L. Wiens
Ivan S. F. Chan

© 2012 by Taylor and Francis Group, LLC

Chapman & Hall/CRC
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742

© 2012 by Taylor and Francis Group, LLC

Chapman & Hall/CRC is an imprint of Taylor & Francis Group, an Informa business

No claim to original U.S. Government works

Printed in the United States of America on acid-free paper

10 9 8 7 6 5 4 3 2 1

International Standard Book Number: 978-1-58488-804-8 (Hardback)

This book contains information obtained from authentic and highly regarded sources. Reasonable efforts
have been made to publish reliable data and information, but the author and publisher cannot assume
responsibility for the validity of all materials or the consequences of their use. The authors and publishers
have attempted to trace the copyright holders of all material reproduced in this publication and apologize to
copyright holders if permission to publish in this form has not been obtained. If any copyright material has
not been acknowledged please write and let us know so we may rectify in any future reprint.

Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmit-
ted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented,
including photocopying, microfilming, and recording, or in any information storage or retrieval system,
without written permission from the publishers.

For permission to photocopy or use material electronically from this work, please access www.copyright.
com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood
Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and
registration for a variety of users. For organizations that have been granted a photocopy license by the CCC,
a separate system of payment has been arranged.

Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used
only for identification and explanation without intent to infringe.

Library of Congress Cataloging-in-Publication Data

Rothmann, Mark D.
Design and analysis of non-inferiority trials / Mark D. Rothmann, Brian L. Wiens,
Ivan S.F. Chan.
p. ; cm. -- (Chapman & Hall/CRC biostatistics series)
Includes bibliographical references and index.
ISBN 978-1-58488-804-8 (hardback : alk. paper)
1. Drugs--Testing. 2. Experimental design. 3. Therapeutics, Experimental. I. Wiens,
Brian L. II. Chan, Ivan S. F. III. Title. IV. Series: Chapman & Hall/CRC biostatistics
series.
[DNLM: 1. Clinical Trials as Topic. 2. Research Design. 3. Therapies, Investigational.
QV 771]

RM301.27.R68 2011
615.5’80724--dc22 2011005377

Visit the Taylor & Francis Web site at

http://www.taylorandfrancis.com

and the CRC Press Web site at

http://www.crcpress.com

© 2012 by Taylor and Francis Group, LLC

To Shiowjen, Marilyn, and Lotus

© 2012 by Taylor and Francis Group, LLC

Contents

Preface......................................................................................................................xv

1 What Is a Non-Inferiority Trial?...................................................................1

1.1 Definition of Non-Inferiority................................................................1
1.2 Reasons for Non-Inferiority Trials......................................................3
1.3 Different Types of Comparisons..........................................................7
1.4 A History of Non-Inferiority Trials................................................... 10
References........................................................................................................ 13

2 Non-Inferiority Trial Considerations........................................................ 15

2.1 Introduction.......................................................................................... 15
2.2 External Validity and Assay Sensitivity........................................... 16
2.3 Critical Steps and Issues..................................................................... 17
2.3.1 Historical Evidence of Sensitivity to Drug Effects............. 18
2.3.2 Designing a Trial..................................................................... 19
2.3.3 Selecting the Margin.............................................................. 20
2.3.4 Study Conduct and Analysis Populations.......................... 27
2.4 Sizing a Study.......................................................................................30
2.5 Example of Anti-Infectives................................................................. 36
References........................................................................................................ 40

3 Strength of Evidence and Reproducibility...............................................43

3.1 Introduction..........................................................................................43
3.2 Strength of Evidence............................................................................44
3.2.1 Overall Type I Error................................................................44
3.2.2 Bayesian False-Positive Rate..................................................44
3.2.3 Relative Evidence between the Null and Alternative
Hypotheses.............................................................................. 47
3.2.4 Additional Considerations.................................................... 49
3.3 Reproducibility..................................................................................... 50
3.3.1 Correlation across Identical Non-Inferiority Trials........... 52
3.3.2 Predicting Future and Hypothetical Outcomes................. 52
References........................................................................................................ 56

4 Evaluating the Active Control Effect......................................................... 57

4.1 Introduction.......................................................................................... 57
4.2 Active Control Effect........................................................................... 58
4.2.1 Defining the Active Control Effect....................................... 58
4.2.2 Modeling the Active Control Effect..................................... 58

© 2012 by Taylor and Francis Group, LLC

x Contents

4.2.3 Extrapolating to the Non-Inferiority Trial.......................... 59

4.2.4 Potential Biases and Random Highs.................................... 62
4.3 Meta-Analysis Methods...................................................................... 74
4.3.1 Fixed Effects Meta-Analysis.................................................. 75
4.3.2 Peto’s Method.......................................................................... 76
4.3.3 Random-Effects Meta-Analysis............................................77
4.3.4 Sampling Distributions.......................................................... 78
4.3.5 Concerns of Random-Effects Meta-Analyses..................... 81
4.3.6 Adjusting over Effect Modifiers...........................................85
4.4 Bayesian Meta-Analyses..................................................................... 87
References........................................................................................................ 89

5 Across-Trials Analysis Methods................................................................ 91

5.1 Introduction.......................................................................................... 91
5.2 Two Confidence Interval Approaches............................................... 92
5.2.1 Hypotheses and Tests............................................................. 92
5.3 Synthesis Methods............................................................................... 98
5.3.1 Introduction............................................................................. 98
5.3.2 Retention Fraction and Hypotheses..................................... 99
5.3.3 Synthesis Frequentist Procedures...................................... 102
5.3.3.1 Relative Metrics..................................................... 102
5.3.3.2 Absolute Metrics.................................................... 102
5.3.4 Synthesis Methods as Prediction Interval Methods........ 108
5.3.5 Addressing the Potential for Biocreep .............................. 110
5.3.6 Bayesian Synthesis Methods............................................... 110
5.3.7 Application............................................................................. 115
5.3.8 Sample Size Determination................................................. 117
5.4 Comparing Analysis Methods and Type I Error Rates................ 120
5.4.1 Introduction........................................................................... 120
5.4.2 Comparison of Methods...................................................... 121
5.4.3 Asymptotic Results............................................................... 124
5.4.4 More on Type I Error Rates................................................. 128
5.4.4.1 Non-Inferiority Trial Size Depends on
Estimation of Active Control Effect.................... 131
5.4.4.2 Incorporating Regression to Mean Bias............. 132
5.5 A Case in Oncology........................................................................... 141
5.5.1 Applying the Arithmetic Definition of Retention
Fraction................................................................................... 142
5.5.2 Applying the Geometric Definition of Retention
Fraction................................................................................... 143
5.5.3 Power of Such Procedures................................................... 144
References...................................................................................................... 146

6 Three-Arm Non-Inferiority Trials........................................................... 149

6.1 Introduction........................................................................................ 149

© 2012 by Taylor and Francis Group, LLC

Contents xi

6.2 Comparisons to Concurrent Controls............................................. 152

6.2.1 Superiority over Placebo...................................................... 152
6.2.2 Non-Inferior to Active Control........................................... 154
6.3 Bayesian Analyses.............................................................................. 160
References...................................................................................................... 165

7 Multiple Comparisons................................................................................ 167

7.1 Introduction........................................................................................ 167
7.2 Comparing Multiple Groups to an Active Control....................... 168
7.2.1 Unordered Treatments: Subset Selection........................... 169
7.2.2 Ordered Treatments: Subset Selection............................... 170
7.2.3 All-or-Nothing Testing........................................................ 171
7.3 Non-Inferiority on Multiple Endpoints.......................................... 171
7.3.1 Multiple Endpoints in a Single Family.............................. 172
7.3.2 Multiple Endpoints in Multiple Families.......................... 174
7.3.3 Further Considerations........................................................ 175
7.4 Testing for Both Superiority and Non-Inferiority......................... 175
7.4.1 Testing Superiority after Achieving Non-Inferiority...... 176
7.4.2 Testing Non-Inferiority after Failing Superiority............ 178
References...................................................................................................... 179

8 Missing Data and Analysis Sets............................................................... 181

8.1 Introduction........................................................................................ 181
8.2 Missing Data....................................................................................... 182
8.2.1 Potential Impact of Missing Data....................................... 182
8.2.2 Preventing Missing Data..................................................... 184
8.2.3 Missing Data Mechanisms.................................................. 186
8.2.4 Assessing Missing Data Mechanisms............................... 187
8.2.5 Analysis of Data when Some Data Are Missing.............. 189
8.2.5.1 Handling Ignorable Missing Data in
Non- Inferiority Analyses..................................... 189
8.2.5.2 Handling Non-Ignorable Missing Data............. 193
8.3 Analysis Sets....................................................................................... 196
8.3.1 Different Analysis Populations........................................... 197
8.3.2 Influence of Analysis Population on Conclusions........... 198
8.3.3 Further Considerations........................................................ 203
References...................................................................................................... 204

9 Safety Studies............................................................................................... 207

9.1 Introduction........................................................................................ 207
9.2 Considerations for Safety Study...................................................... 209
9.2.1 Safety Endpoint Considerations ........................................ 210
9.2.2 Design Considerations......................................................... 211
9.2.3 Possible Comparisons.......................................................... 212

© 2012 by Taylor and Francis Group, LLC

xii Contents

9.2.3.1 Ruling Out a Meaningful Risk Increase

Compared to Placebo............................................ 212
9.2.3.2 Ruling Out a Meaningful Risk Increase
Compared to an Active Control.......................... 213
9.2.3.3 Indirect Comparison to Placebo.......................... 215
9.3 Cardiovascular Risk in Antidiabetic Therapy .............................. 216
References...................................................................................................... 218

10 Additional Topics........................................................................................ 219

10.1 Introduction........................................................................................ 219
10.2 Interaction Tests................................................................................. 220
10.2.1 Test Procedures.....................................................................222
10.2.2 Internal Consistency............................................................. 224
10.2.3 Conclusions and Recommendations..................................225
10.3 Surrogate Endpoints.......................................................................... 226
10.4 Adaptive Designs............................................................................... 230
10.4.1 Group Sequential Designs................................................... 231
10.4.2 Changing the Sample Size or the Primary Objective...... 235
10.5 Equivalence Comparisons................................................................ 237
10.5.1 Data Scales............................................................................. 238
10.5.2 Two One-Sided Tests Approach.......................................... 239
10.5.3 Distribution-Based Approaches......................................... 240
10.5.4 Lot Consistency..................................................................... 242
References...................................................................................................... 247

11 Inference on Proportions........................................................................... 251

11.1 Introduction........................................................................................ 251
11.2 Fixed Thresholds on Differences..................................................... 253
11.2.1 Hypotheses and Issues......................................................... 253
11.2.2 Exact Methods.......................................................................254
11.2.2.1 Exact Confidence Intervals.................................. 257
11.2.3 Asymptotic Methods............................................................ 259
11.2.4 Comparisons of Confidence Interval Methods................ 262
11.2.4.1 Inferences on a Single Proportion....................... 263
11.2.4.2 Inferences for a Difference in Proportions........ 264
11.2.5 Sample Size Determination................................................. 268
11.2.5.1 Optimal Randomization Ratio............................ 271
11.3 Fixed Thresholds on Ratios.............................................................. 273
11.3.1 Hypotheses and Issues......................................................... 273
11.3.2 Exact Methods....................................................................... 273
11.3.2.1 Exact Conditional Non-Inferiority Test . ........... 274
11.3.3 Asymptotic Methods............................................................ 278
11.3.4 Comparisons of Methods.................................................... 279
11.3.5 Sample-Size Determination................................................. 283
11.3.5.1 Optimal Randomization Ratio............................ 286

© 2012 by Taylor and Francis Group, LLC

Contents xiii

11.4 Fixed Thresholds on Odds Ratios.................................................... 289

11.4.1 Hypotheses............................................................................ 289
11.4.2 Exact Methods....................................................................... 290
11.4.3 Asymptotic Methods............................................................ 291
11.4.4 Sample Size Determination................................................. 292
11.5 Bayesian Methods.............................................................................. 293
11.6 Stratified and Adjusted Analyses.................................................... 297
11.6.1 Adjusted Rates....................................................................... 298
11.6.2 Adjusted Estimators.............................................................300
11.7 Variable Margins................................................................................304
11.8 Matched-Pair Designs.......................................................................308
11.8.1 Difference in Two Correlated Proportions........................309
11.8.2 Ratio of Two Correlated Proportions................................. 311
References...................................................................................................... 314

12 Inferences on Means and Medians.......................................................... 319

12.1 Introduction........................................................................................ 319
12.2 Fixed Thresholds on Differences of Means.................................... 320
12.2.1 Hypotheses and Issues......................................................... 320
12.2.2 Exact and Distribution-Free Methods................................ 321
12.2.3 Normalized Methods........................................................... 322
12.2.3.1 Test Statistics.......................................................... 323
12.2.4 Bayesian Methods................................................................. 326
12.2.5 Sample Size Determination................................................. 331
12.3 Fixed Thresholds on Ratios of Means.............................................334
12.3.1 Hypotheses and Issues.........................................................334
12.3.2 Exact and Distribution-Free Methods................................334
12.3.3 Normalized and Asymptotic Methods............................. 335
12.3.3.1 Test Statistics.......................................................... 335
12.3.4 Bayesian Methods................................................................. 338
12.3.5 Sample Size Determination.................................................340
12.4 Analyses Involving Medians............................................................342
12.4.1 Hypotheses and Issues.........................................................343
12.4.2 Nonparametric Methods.....................................................344
12.4.3 Asymptotic Methods............................................................ 351
12.5 Ordinal Data....................................................................................... 353
References...................................................................................................... 355

13 Inference on Time-to-Event Endpoints................................................... 357

13.1 Introduction........................................................................................ 357
13.2 Censoring............................................................................................ 361
13.3 Exponential Distributions................................................................. 363
13.3.1 Confidence Intervals for a Difference in Means...............364
13.3.2 Confidence Intervals for the Hazard Ratio
(Ratio of Means).................................................................... 366

© 2012 by Taylor and Francis Group, LLC

xiv Contents

13.4 Nonparametric Inference Based on a Hazard Ratio..................... 368

13.4.1 Event and Sample Size Determination.............................. 375
13.4.2 Proportional Hazards Assessment Procedures............... 377
13.5 Analyses Based on Landmarks and Medians............................... 380
13.5.1 Landmark Analyses............................................................. 380
13.5.2 Analyses on Medians...........................................................384
13.6 Comparisons over Preset Intervals................................................. 389
References...................................................................................................... 392
Appendix: Statistical Concepts........................................................................ 395
A.1 Frequentist Methods.......................................................................... 395
A.1.1 p-Values.................................................................................. 395
A.1.2 Confidence Intervals............................................................. 401
A.1.3 Comparing and Contrasting Confidence Intervals
and p-Values...........................................................................404
A.1.4 Analysis Methods.................................................................405
A.1.4.1 Exact and Permutation Methods.........................405
A.1.4.2 Asymptotic Methods............................................408
A.2 Bayesian Methods.............................................................................. 410
A.2.1 Posterior Probabilities and Credible Intervals.................. 410
A.2.2 Prior and Posterior Distributions....................................... 412
A.2.3 Statistical Inference............................................................... 415
A.3 Comparison of Methods................................................................... 419
A.3.1 Relationship between Frequentist and Bayesian
Approaches............................................................................ 419
A.3.1.1 Exact Confidence Intervals and Credible
Intervals Using a Jeffreys Prior........................... 419
A.3.1.2 Comparison Involving a Retention Fraction..... 421
A.3.1.3 Likelihood Function for a Non-Inferiority
Trial.........................................................................423
A.3.2 Dealing with More than One Comparison.......................425
A.4 Stratified and Adjusted Analyses.................................................... 427
A.4.1 Stratification........................................................................... 427
A.4.2 Analyses................................................................................. 429
References...................................................................................................... 431
Index......................................................................................................................433

© 2012 by Taylor and Francis Group, LLC

Preface

In recent years there has been frequent use of non-inferiority trial designs
to establish the efficacy of an experimental agent. There has also been a pro-
liferation of research articles on the design and analysis of non-inferiority
studies. Points to Consider documents involving non-inferiority trials have
been issued by the European Medicines Agency, and there is a draft guid-
ance on non-inferiority trials that has been issued by the U.S. Food and Drug
Administration. A typical non-inferiority trial randomizes subjects to an
experimental regimen or to a standard of care, which is often referred to
as an “active” control. A non-inferiority trial places a limit on the amount
an experimental therapy is allowed to be inferior to a standard of care to
still be considered worthwhile. This limit or non-inferiority margin should
be selected so that a loss of efficacy of less than the margin relative to the
standard of care implies that the experimental therapy has efficacy (relative
to a placebo) and its efficacy is not unacceptably worse than the standard
of care. A new treatment that offers a better safety profile or a more prefer-
able method of administration compared to standard treatment may be ben-
eficial even if somewhat less effective than standard treatment. There have
been many non-inferiority clinical trials in various medical areas, including
thrombolytic, oncology, cardiorenal, and anti-infective drugs, vaccines, and
medical devices.
Design and Analysis of Non-Inferiority Trials is not intended as a substitute
for regulatory guidances on non-inferiority trials, but as a complement to
such guidances. This text provides a comprehensive discussion on the pur-
pose and issues involved in non-inferiority trials and will assist the reader in
designing a non-inferiority trial and in assessing the quality of non-inferior-
ity comparisons done in practice.
Design and Analysis of Non-Inferiority Trials is intended for statisticians and
nonstatisticians involved in drug development. Although some sections are
technical and written for an audience of statisticians, most of the book is
nontechnical and written to be easily understood by a broad audience with-
out any prior knowledge of non-inferiority clinical trials. Additionally, every
chapter begins with a nontechnical introduction.
We have strived to provide a thorough discussion on the most important
aspects involved in the design and analysis of non-inferiority trials. The first
two chapters discuss the history of non-inferiority trials and the design and
conduct considerations for a non-inferiority trial. A first step in designing a
non-inferiority trial is evaluating the previous effect of the selected active
control treatment. Chapters 3 and 4 cover the strength of evidence of an effi-
cacy finding and evaluating the effect size of a treatment. The active con-
trol therapy is identified based on knowledge of its performance in previous

© 2012 by Taylor and Francis Group, LLC

xvi Preface

trials, not independent of the results of those previous trials. Thus, addi-
tional efforts are required to understand the effect size of the active control.
Chapter 5 presents the two main analysis methods frequently used in non-
inferiority trials, their variations, and their properties. Chapter 6 discusses
the gold standard non-inferiority design that additionally includes a placebo
group. Chapters 7 through 10 cover a variety of individual issues of non-infe-
riority trials, including multiple comparisons, missing data, analysis popu-
lation, the use of safety margins, the internal consistency of non-inferiority
inference, the use of surrogate endpoints, trial monitoring, and equivalence
trials. Chapters 11 through 13 provide specific issues and analysis methods
when the data are binary, continuous, and time to event, respectively. Design
and Analysis of Non-Inferiority Trials can be read fully in the order presented.
Various chapters can be comprehended or used as reference directly without
reading previous chapters. A reader with little prior exposure to non-infe-
riority trials should start with Chapters 1 through 6 in the order presented,
and cover the remaining material as needed. We have also included a discus-
sion on p values, confidence intervals, and frequentist and Bayesian analyses
in the appendix.
We appreciate the assistance of all the reviewers of this book and the book
proposal for their careful, insightful review. We are also indebted to so many
at Taylor & Francis Publishing, most notably David Grubbs for his guidance
and patience.
We thank all of those who have provided discussions and interactions on
non-inferiority trials, including David Brown, Kevin Carroll, Gang Chen,
George Chi, Ralph D’Agostino, Susan Ellenberg, Thomas Fleming, Paul Flyer,
Thomas Hammerstrom, Dieter Hauschke, Rob Hemmings, David Henry, Jim
Hung, Qi Jiang, Armin Koch, John Lawrence, Ning Li, Kathryn Odem-Davis,
Robert O’Neill, Stephen Snapinn, Greg Soon, Robert Temple, Ram Tiwari,
Yi Tsong, Hsiao-Hui Tsou, Thamban Valappil, and Sue Jane Wang. We are
particularly grateful to Dr. Ellenberg for providing slides on the history of
non-inferiority trials.
We are grateful for the support and encouragement provided by our fam-
ilies. Our deepest gratitude to our wives, Shiowjen (for MR), Marilyn (for
BW), and Lotus (for IC), for their patience and support during the writing of
this book.

© 2012 by Taylor and Francis Group, LLC

1
What Is a Non-Inferiority Trial?

1.1 Definition of Non-Inferiority

If the effect of an experimental therapy on an endpoint is either better than
or not too much worse than the effect of a control therapy on that same end-
point, the experimental therapy’s effect is said to be noninferior to the effect
of the control therapy. It is common to say that the experimental therapy is
noninferior to the control therapy. A clinical trial comparing an experimen-
tal arm with a control arm to determine whether the experimental arm is
noninferior to the control arm is often called a “non-inferiority trial.” The
phrase “at least equivalent as” has also been used instead of “noninferior.”
We will use “noninferior” throughout this book.
The term “non-inferiority” is well established but can be misleading. A
non-inferiority trial places a limit on the amount the experimental therapy
is allowed to be inferior to the control therapy and still be beneficial or effi-
cacious. If the experimental therapy had an effect that was “not inferior” to
the effect of the control therapy, then the effect of the experimental therapy
would either equal or be greater than the effect of the control therapy.
The acceptable amount for which the experimental therapy may be worse
than the control therapy and still be noninferior to the control therapy is
called the “non-inferiority margin.” The margin may represent a threshold
for the difference in effects or for the relative size of the effects. In practice,
a non-inferiority margin is rarely a priori and usually subjective. Typically,
the non-inferiority margin is based on a systematic review of the effect of
the control therapy so that the experimental therapy would be regarded as
beneficial and/or efficacious should the effect of the experimental therapy be
greater than the difference in the effect of the control therapy and the non-
inferiority margin.
A non-inferiority comparison may involve an efficacy endpoint, a safety
endpoint, or some other clinical endpoint. For many indications where there
is an effective standard therapy, it is unethical to deny patients the use of
effective therapy. It is therefore not possible to conduct placebo-controlled
trials in such settings. A non-inferiority trial with an effective standard
therapy as an active control may be necessary to establish the efficacy of the

© 2012 by Taylor and Francis Group, LLC

2 Design and Analysis of Non-Inferiority Trials

experimental therapy. In these situations, the goals of the non-inferiority

trial are to demonstrate (1) that the experimental therapy has any efficacy
relative to a placebo or some other reference therapy and (2) that the effi-
cacy of the experimental therapy is not unacceptably worse than that of
the active control. A non-inferiority margin is selected for the purpose of
satisfying one or both of these goals. An efficacy margin would be selected
to show indirectly that the experimental therapy is more effective than a
placebo. The determination of any efficacy involves ruling out a difference
between the active control and experimental therapies at least the size of
the efficacy margin. The size of the efficacy margin should be less than or
equal to the effect of the active control therapy in the setting of the non-
inferiority trial. The clinical margin is the amount of unacceptable loss of
the effect of the active control that needs to be ruled out to declare that
the experimental therapy is noninferior to the active control therapy. The
clinical margin must be less than or equal to the efficacy margin. The term
“non-inferiority margin” will be applied generically to both the efficacy and
clinical margins.
While it may not seem worthwhile at first, a new treatment that has not
been demonstrated to have better efficacy than a standard therapy can still
provide advantages such as superior safety and tolerability. Selective sero-
tonin reuptake inhibitor (SSRI) antidepressants, new “atypical” antipsychot-
ics, nonsedating antihistamines, and treatments for hypertension such as
diuretics and reserpine are not more effective than the drugs that preceded
them, but they are much better tolerated and, in many cases, clearly safer.1
New therapies may provide alternatives for people who do not respond to
or cannot tolerate available therapies.2 For therapeutic indications where the
future of patient care appears to involve the addition of future therapeutics
to the then standard of care, a more tolerable therapeutic with less toxicity
may form a better building block than a current standard therapy that is
either frequently not tolerated or very toxic. New therapies may also provide
methods of administration or treatment schedules that are better preferred
by patients. As with generic drugs, allowing new therapies as alternatives
to existing ones may also reduce prices because of the resulting competition
in the market. Therefore, it is desirable to be able to establish the safety and
effectiveness of such treatments and to know how to conduct a meaningful
non-inferiority trial.
Other cases that involve non-inferiority trials include experimental ther-
apies that treat or prevent toxicities caused by anticancer therapies where
there are concerns involving “tumor/disease protection” or “tumor promo-
tion.” For example, a therapy that protects the body from detrimental effects
of chemotherapy may also protect the tumors from the action of chemothera-
peutic drugs. Experimental therapies that aim to prevent toxicities by pro-
moting the increase of “good cells” may also promote the increase of “bad
cells.” In both cases, it is important to study whether the use of these experi-
mental therapies alters the benefits or the likelihood of obtaining benefits

© 2012 by Taylor and Francis Group, LLC

What Is a Non-Inferiority Trial? 3

from the anticancer therapy. For an adequate improvement in the rate of a

symptom or adverse event, there may be an acceptable amount of loss of the
anticancer therapy’s benefits. In such cases, the non-inferiority margin may
be related to the amount of proven benefit of the experimental therapy.

1.2 Reasons for Non-Inferiority Trials

A randomized, double-blind placebo-controlled trial is the gold standard
in the determination of efficacy and the risk–benefit profile of an investi-
gational drug. Such trials provide direct evidence on the effectiveness and
safety of the experimental therapy. A high-quality study conduct will result
in an unbiased comparison of the study arms. Therefore, any demonstrated
difference in outcomes can be attributed to the difference in the therapies
used for each arm. The randomization fairly assigns treatments to subjects.
This fairness of randomization along with the high quality of study conduct
allows for a valid statistical comparison of the study arms.
The existence of blinding is important in achieving a high-quality study
conduct. Double blinding means that the subject, the subject’s investigator,
and anyone else associated with the day-to-day conduct of the study do not
know to which arm the subject was assigned. This would require that for
the entire duration for which information is gathered on a subject, neither
the subject nor the investigator is aware to which treatment arm the subject
was allocated. In theory, double blinding allows for an equitable distribution
across arms of internal and external influences, biases, and errors. For some
clinical trials, the unique toxicities or side effects of a drug will break the
blind for those who experience them. For a truly double-blind trial, the like-
lihood of protocol compliance, the general behavior of the subjects and the
investigators that influence subject outcomes, the accuracy of subjects’ and
investigators’ measurement, and the ability to obtain such measurements
should not systematically favor one study arm over another. Use of a placebo
or placebos, such as double dummy blinding in which subjects each receive
two treatments (one of which is a placebo), may be necessary to implement
a double blind.
There is disagreement on when a placebo-controlled trial would be unethi-
cal. Freedman3 provides a principle of clinical equipoise that requires sound,
professional disagreement as to the preferred treatment at the start of the
randomized, clinical trial. Freedman4 provides five conditions, any of which
justifies the use of a placebo control: (1) no standard treatment exists; (2) stan-
dard treatment is not better than placebo; (3) standard treatment is a placebo
(or no treatment); (4) new evidence has shown uncertainty of the risk–benefit
profile of the standard treatment; and (5) effective treatment is not readily
available due to cost or supply issues.

© 2012 by Taylor and Francis Group, LLC

4 Design and Analysis of Non-Inferiority Trials

Placebo-controlled trials are ethical when there are no permanent adverse

consequences with delaying treatment and patients are fully informed about
the alternatives.2 Escape clauses could be also included in the protocol.5
Placebo-controlled trials may also be ethical when no therapy has an obvi-
ous, favorable benefit–risk profile (e.g., the only existing therapy may be effi-
cacious but so toxic that patients may refuse therapy).
The use of a placebo control may be unethical when there is an available
therapy that has been shown to prevent serious harm, prevent irreversible
morbidity, or extend life, or there is an available therapy that has become a
benchmark, or one of many benchmarks, by having shown such outstanding
benefit with a very favorable benefit–risk profile that denying patients that
therapy would be regarded as a safety risk or causing a patient harm. In the
latter instance, an investigational drug would need to have better than mini-
mal efficacy to even be considered a possible alternative therapy.
If subjects believe that it is better for them to receive an available therapy that
has not demonstrated efficacy than to receive a placebo, recruitment to a placebo-
controlled trial may be so difficult as to make the conduct of a placebo-controlled
trial impossible. In such a case, an active-controlled trial should be considered.
Since the active control has not demonstrated efficacy, the experimental therapy
would need to demonstrate superior efficacy over the control therapy.
Subjects may desire a non-inferiority comparison when a new innovative
therapy has a small efficacy advantage, no efficacy advantage, or slightly less
efficacy than a standard therapy with some notable advantages of the stan-
dard therapy. Possible advantages may include a different or easier form of
administration, reduced toxicities or side effects, or a more preferable safety
profile. Examples where a new product may be preferred even when its effi-
cacy is not greater than that of standard therapy include a new anti-infective
product that produces no resistant bacteria and a synthetic respiratory dis-
tress product that may be less risky than an animal-derived product.6
Additionally, a study of applications studying a different regimen, sched-
ule, or formulation of a therapy that has already demonstrated efficacy and a
favorable risk–benefit assessment against placebo may also qualify for a non-
inferiority comparison with a currently used version of that same therapy.
For example, when a therapeutic protein is “pegylated,” its half-life increases.
Thus, a pegylated therapeutic protein requires less frequent dosing than its
nonpegylated counterpart. To illustrate, consider both pegfilgrastim (pegy-
lated-filgrastim) and filgrastim, which are both approved for the reduction
of the incidence of febrile neutropenia induced by cytotoxic chemotherapy.
Pegfilgrastim is administered subcutaneously once per chemotherapy cycle,
whereas filgrastim is administered as multiple daily injections. This great
reduction in the frequency of administration may be preferred by patients,
provided there is no important loss in efficacy and no increase in risks.
A guideline from the European Medicines Agency7 provides situations
where a non-inferiority trial may be conducted for registration instead of a
superiority trial over placebo. These situations include “Applications based

© 2012 by Taylor and Francis Group, LLC

What Is a Non-Inferiority Trial? 5

on essential similarity in areas where bioequivalence studies are not pos-

sible, e.g., modified release products or topical preparations; products with a
potential safety advantage over the standard might require an efficacy com-
parison to the standard to allow a risk–benefit assessment to be made; cases
where a direct comparison against the active comparator is needed to help
assess risk/benefit; cases where no important loss of efficacy compared to
the active comparator would be acceptable; disease areas where the use of a
placebo arm is not possible and an active control trial is used to demonstrate
the efficacy of the test product.”
The importance of the active control effect is the motivation for choosing
a non-inferiority trial as the basis of demonstrating the effectiveness of the
experimental therapy. Therefore, it may be unacceptable for the experimen-
tal therapy to have a much less effect than the active control. The meaning or
purpose of a non-inferiority comparison varies from trial to trial. The follow-
ing provide different cases for a non-inferiority comparison.

Case 1: A placebo-controlled trial would be unethical or impossible

to conduct in practice, and the standard requirement is the demon-
stration of efficacy against a placebo. In these cases, the purpose of a
non-inferiority trial is to provide substantial, indirect evidence that
the experimental therapy has better efficacy than a placebo. This is
done by providing substantial, direct evidence from clinical trials
that the experimental therapy’s efficacy is not much worse than the
efficacy of an active therapy by some prespecified amount no larger
than the effect of the active control therapy against placebo in the
setting of the non-inferiority trial. The effect of the active control
versus placebo is based on historical trials that evaluated the effect
of the active control to placebo (or some other reference therapy)
with proper adjustment if necessary to apply this estimated effect to
the setting of the non-inferiority trial.
Case 2: In indications where there is a standard therapy that has such
a large effect, some minimal efficacy may be necessary for the experi-
mental therapy to have a favorable, worthwhile benefit–risk profile.
In these cases, the purpose of a non-inferiority trial is to provide sub-
stantial, indirect evidence that the experimental therapy has better
efficacy than a placebo by some minimal amount, or that the experi-
mental therapy is not unacceptably worse than the active control
therapy. This is done by providing substantial, direct evidence from
clinical trials that the experimental therapy’s efficacy is not much
worse than the efficacy of an active control therapy by some minimal
prespecified amount that is comfortably smaller than the difference
between the effect of the active control therapy and placebo.
Case 3: There are cases where it may be necessary to demonstrate
that an experimental therapy has efficacy either similar to or better

© 2012 by Taylor and Francis Group, LLC

6 Design and Analysis of Non-Inferiority Trials

than a standard of care. This may be the case when the experimen-
tal therapy is in the same “drug class” as the standard therapy. It
would be necessary to demonstrate that the experimental therapy
has efficacy either better than or not too much worse than the stan-
dard therapy.
Case 4: The experimental regimen replaces a drug in a standard reg-
imen of multiple drugs with the experimental drug. For the experi-
mental regimen to be considered as an alternative to the standard
regimen, it may be necessary for the experimental regimen to have
better efficacy than every drug, drug combination, and regimen for
which that standard combination is superior. If each component
of the drug combination for the experimental arm is regarded as
“active,” it may also be necessary for the experimental combina-
tion to have more efficacy than any subset of the drugs in that drug
combination. When a new standard of care demonstrates improved
survival over the previous standard of care, it may (or may not) be
unethical to give patients that previous standard of care for that
indication or line of therapy. Thus, it is important that any therapy
being considered for use has sufficient efficacy to be considered an
ethical therapy for the studied indication.
Case 5: The purpose of an experimental drug is to reduce the chance
of toxicities or side effects to patients caused by a standard therapy. It
is important to study whether the experimental drug interferes with
the effectiveness of the standard therapy; that is, to study the amount
of efficacy of the standard therapy that may be lost by additionally
providing patients with the experimental therapy. The standard
therapy with the experimental drug may be worthwhile to patients
relative to the standard therapy alone if the standard therapy plus the
experimental drug has less toxicities or side effects than the standard
therapy alone, despite having a little less efficacy than the standard
therapy alone. However, a lower dose (or less frequent use) of the stan-
dard therapy may also have less toxicities or side effects than the reg-
ular dose of the standard therapy. While a trial comparing the regular
dose of the standard therapy with and without the experimental drug
provides efficacy and safety data on the two regimens, it may not
(depending on the results of the trial) provide evidence of the neces-
sity of the experimental drug, unless the dose–response relationship
on efficacy and safety is known for the standard therapy.

Additionally, non-inferiority comparisons are done in safety studies to rule

out any unacceptable increase in the risk of some adverse events when decid-
ing to use the experimental therapy.
As a global goal of drug development is the advancement of patient care,
the use of non-inferiority trials to evaluate experimental therapies that are not
expected to provide important efficacy or safety advantages over standard

© 2012 by Taylor and Francis Group, LLC

What Is a Non-Inferiority Trial? 7

therapy (“me too drugs”) has been criticized.8,9 This is particularly problem-
atic when nonrigorous margins are used, potentially leading to a “biocreep,”
in which an inferior therapy is used as the control therapy for the next gen-
eration of non-inferiority trials.
There may not be an appropriate choice for the active comparator for a
non-inferiority trial of efficacy even when it may seem that a non-inferi-
ority trial is the appropriate choice. Per the International Conference of
Harmonization (ICH) E9 Guidance10: “A suitable active comparator could
be a widely used therapy whose efficacy in the relevant indication has been
clearly established and quantified in well-designed and well-documented
superiority trials and which can be reliably expected to have similar effi-
cacy in the contemplated active control trial.” Per the ICH-E10 guidelines,11
an active control can be used in a non-inferiority trial when its effect is
(1) of substantial magnitude compared to placebo or some other reference
therapy, (2) precisely estimated, and (3) relevant to the setting of the non-
inferiority trial. Due to the importance of the effect of the active control
(the motivation for conducting an active control trial), the non-inferiority
margin should be sufficiently small so that demonstrating non-inferiority
leads to the conclusion that the experimental therapy preserves a substan-
tial fraction of the active control effect and that the use of the experimental
therapy instead of the active control therapy will not result in a clinically
meaningful loss of effectiveness.12
For an active-controlled clinical trial, the efficacy requirements are less rig-
orous in a non-inferiority comparison (less needs to be statistically ruled out)
than in a superiority comparison. That is, when compared with an effective
standard therapy, it is easier to demonstrate that the experimental therapy
has noninferior efficacy than to demonstrate that it has superior efficacy.
Thus, when the efficacy of an experimental therapy must be determined
against an effective standard therapy, non-inferiority may be preferred as
the main objective instead of superiority. When the experimental therapy
has a small efficacy advantage over the standard therapy, a superiority trial
having the standard therapy as the control therapy would require a large
number of subjects to be adequately powered (e.g., at least 80% power). When
the experimental therapy has no efficacy advantage over the standard ther-
apy, it is impossible to design an adequately powered superiority trial that
has the standard therapy as the control therapy.

1.3 Different Types of Comparisons

There are various ways whereby an endpoint may compare between an
experimental arm and a control arm, as summarized in Table 1.1. Notice that
when the experimental arm is superior to the control arm, the experimental

© 2012 by Taylor and Francis Group, LLC

8 Design and Analysis of Non-Inferiority Trials

TABLE 1.1
Description of Each Type of Comparison
Type of Comparison Description
Inferiority The experimental arm is worse than the control arm.
Equivalence The absolute difference between the experimental arm and the
control arm is smaller than a prespecified margin.
Non-inferiority The experimental arm is either better than the control arm or the
experimental arm is inferior to the control arm by less than some
prespecified margin.
Superiority The experimental arm is better than the control arm.
Difference The study arms are not equal. Either the experimental arm is worse
than the control arm or the experimental arm is better than the
control arm.

arm is also noninferior to the control arm. When the equivalence margin cor-
responds to the non-inferiority margin and the experimental arm is “equiva-
lent” to the control arm, then the treatment arm is also noninferior to the
control arm.
Whether a specific relation can be concluded between the control and
experimental arms is often reduced to comparing a confidence interval for
the difference in effects with either zero, a non-inferiority margin of δ, or
equivalence limits of ±δ (for some δ > 0). For a prespecified confidence level,
a confidence interval does not contain those cases that have been ruled out
by the data. For a confidence level of 100 (1 – α)%, the method for determining
the confidence interval is such that before observing the data, there was a 100
(1 – α)% chance (or greater chance) that the confidence interval will capture
the true value. As such, about 95% of all 95% confidence intervals capture the
true value of the parameter that is being estimated.
In Figure 1.1, since confidence interval A contains only negative values for
the difference in the effects between the control and experimental therapies
(C–E), the experimental therapy is concluded to be superior to the control

B
A

–δ 0 δ
Favors experimental Favors control

FIGURE 1.1
Relationship between different types of conclusions.

© 2012 by Taylor and Francis Group, LLC

What Is a Non-Inferiority Trial? 9

therapy. Because confidence interval B contains only values less than δ, the
experimental therapy is concluded to be noninferior to the control therapy
with respect to the margin δ. However, as confidence interval B contains both
positive and negative values for the difference in the effects, the experimental
therapy cannot be concluded to be superior or inferior to the control therapy.
As confidence interval C contains only values between –δ and δ, reflecting
a small absolute difference in the effects of the experimental and control
therapies, the experimental therapy and control therapy are concluded to
be “equivalent” or similar with respect to the limits ±δ. Since confidence
interval D contains only positive values for the difference in the effects that
are smaller than δ, the experimental therapy is concluded to be inferior and
noninferior to the control therapy. This would mean that the experimental
therapy is less effective than the control therapy, but not with unacceptably
worse efficacy. Since confidence interval E contains only positive values with
some of those values larger than δ, the experimental therapy is inferior to
the control therapy and cannot be concluded to be noninferior to the control
therapy.
The type of comparison that can be done depends on the type or scale of
the data. Data may be qualitative or quantitative. Qualitative data may be
nominal or ordinal. The scale is nominal when subjects’ outcomes are orga-
nized into unordered categories (e.g., gender, type of disease). The scale
is ordinal when subjects’ outcomes are organized into ordered categories.
Quantitative data may have an interval or ratio scale. The scale is interval
when differences have meaning (e.g., time of day and temperature in degrees
Celsius). Two different scenarios having equal differences display the same
meaning. The scale is ratio when ratios or quotients have meaning (e.g., time
to complete a task, survival time, temperature in degrees kelvin). Data hav-
ing a ratio scale have a meaningful zero.
When the data have a nominal scale, the relevant parameters are the actual
relative frequencies or probabilities for each category. Since the categories
are unordered, comparisons between study arms of the distributions for
such measurements involve comparing for each category the similarity of
the respective relative frequencies. That the distributions are different or
that distributions are similar (an “equivalence” type of inference) are the
only possible type of inferences. Non-inferiority, superiority, and inferiority
inferences require that there is an order to the possible values. One measure
of the similarity of two distributions of nominal measurements is the sum
over all categories of the smaller relative frequencies between the two arms.
For all other scales of measurements, any type of inference (e.g., equivalence,
non-inferiority or superiority) can be made.
For data having an ordinal scale, additional relevant parameters would
include the actual cumulative relative frequencies or cumulative probabili-
ties for each category. For a given category, its cumulative relative frequency
is the relative frequency of observations that either fall into that category or
any category having less value. For data that have an interval or ratio scale,

© 2012 by Taylor and Francis Group, LLC

10 Design and Analysis of Non-Inferiority Trials

parameters of interest include means, medians, specific percentiles, vari-

ances, or standard deviations of the distribution. For data having an inter-
val scale, comparisons between study arms of the same type of parameter
would involve examining differences in the respective parameters. For data
having a ratio scale, comparisons between study arms of the same type of
parameter may involve examining differences or quotients in the respective
parameters.

1.4 A History of Non-Inferiority Trials

The Kefauver–Harris amendment of the Food and Drug Cosmetic Act of
1962 requires that the safety and efficacy of an investigational drug be evalu-
ated on the basis of “evidence consisting of adequate and well-controlled
investigations.” The gold standard has been evidence from at least two
placebo-controlled trials. Placebo-controlled trials provide direct evidence of
safety and effectiveness. As effective and highly effective therapies became
approved and available, the ethical use and feasibility of placebo-controlled
trials became questioned and consideration was given to active-controlled
trials.
Makuch and Simon13 discussed the requirements for evaluating that a less
intensive treatment is as good as a more intensive one, including ruling out
a difference of a prespecified size (i.e., a margin). There was little discus-
sion on how to choose this margin or in relating the margin with previ-
ous results involving the active control effect. Lasagna14 criticized the use
of active-controlled trials for a purpose other than superiority of the experi-
mental therapy, “In the absence of placebo controls, one does not know if the
‘inferior’ new medicine has any efficacy at all, and ‘equivalent’ performance
may reflect simply a patient population that cannot distinguish between two
active treatments that differ considerably from each other, or between active
drug and placebo.”
Blackwelder15 provided the statistical hypotheses as one-sided hypotheses
with a prespecified margin and the use of confidence intervals for testing for
non-inferiority. Blackwelder noted that the choice of the margin may depend
on the endpoint, with a more conservative margin being used for a mortality
endpoint. Temple16 discussed the complexities in evaluating the similarity
of outcomes between an investigational drug and an active control. Temple
proposed that any active-controlled trial aiming to show similarity provide
historical results demonstrating that the active control was regularly shown
superior to placebo, design the active-controlled study as similarly as pos-
sible to the previous studies involving the active control, and estimate the
level of response in the experimental arm that would exceed the response
expected by placebo.

© 2012 by Taylor and Francis Group, LLC

What Is a Non-Inferiority Trial? 11

Fleming17 examined the case of mitoxantrone in metastatic breast can-

cer that was presented to the U.S. Food and Drug Administration (FDA)
Oncologic Advisory Committee in 1986. The sponsor concluded from four
clinical trials comparing mitoxantrone to doxorubicin that the efficacies of the
two drugs are comparable and that mitoxantrone is less toxic.17 Arguments
for comparability seem to depend on nonsignificant results for comparisons
on overall survival. However, the lack of significant evidence of a difference
is not evidence of similarity. Additionally, using nonsignificant results for a
determination of equivalence would lead to small, sloppily conducted trials
that may increase the chance of failing to detect a difference.
For the determination of efficacy in an active-controlled trial, Fleming17
emphasized the need for using confidence intervals along with having a reli-
able estimate of the active control effect. Three of the four mitoxantrone stud-
ies had confidence intervals for the overall survival hazard ratio that could
not rule out an increase in the instantaneous risk of death of at least 75% by
using mitoxantrone instead of doxorubicin. This, accompanied with the lack
of historical evidence on an effect or effect size for doxorubicin, makes it
impossible to legitimately conclude that mitoxantrone has efficacy (relative
to placebo) on overall survival.
According to Fleming,18 there are three key components of information for
an active-controlled trial: (1) the confidence interval relating the effect of the
experimental therapy to the active control therapy; (2) an assessment of the
difference between the experimental and control therapies in side effects,
toxicity, cost, etc.; and (3) a reliable estimate of the effect of the control ther-
apy in the setting of the non-inferiority trial.
In 1992, the Division of Anti-Infective Drug Products of the FDA pub-
lished a Points to Consider document on the Clinical Development and
Labeling of Anti-Infective Drug Products,19 which established a standard
non-inferiority procedure for active-controlled trials of anti-infective ther-
apies. The procedure used a random margin based on the larger of the
observed cure rates among the treatment arms. The margin for the differ-
ence in cure rates ranged from 10% to 20% and was decreasing in the larger
observed cure rate. The susceptibility to a biocreep led the FDA to revise
the non-inferiority criterion and use a more conservative approach.5 In 2007,
the FDA issued a draft guidance20 on the use of non-inferiority products to
support approval of antibacterial products. The guidance states that it is
unlikely for the indications of acute bacterial sinusitis, acute bacterial exac-
erbation of chronic bronchitis, and acute bacterial otitis media that data will
support a non-inferiority comparison. In the case of acute bacterial sinusitis,
there has not been a consistent reliable estimate of the efficacy of a stan-
dard therapy against placebo. Per the Code of Federal Regulations (21 CFR
314.126), adequate evidence should be provided to support a proposed non-
inferiority margin.
In 1998, the ICH issued a guidance10 on the statistical principles in clini-
cal trials. This guidance included discussions on non-inferiority trials. An

© 2012 by Taylor and Francis Group, LLC

12 Design and Analysis of Non-Inferiority Trials

experimental therapy could show effectiveness relative to a standard ther-

apy if only clinically acceptable differences were contained in the confidence
interval for the differences in the treatment effects. Two years later the ICH
issued a guidance11 on the choice of the control group in a clinical trial. This
guidance discussed various appropriate control groups and the situations
in which a given concurrent control would be possible in determining effec-
tiveness. The related trial designs and conduct along with the strengths and
weaknesses of each type of control group were also discussed.
There has been a great desire to claim equivalence due to the failure to show
a difference. Greene et al.21 evaluated 88 reports of distinct studies from the
literature published between 1992 and 1996 that investigated or concluded
equivalence. They found that for 67% (59 of 88) of the studies, equivalence
was claimed after a test for superiority yielded a nonsignificant result.
There has also been a desire to switch to a non-inferiority analysis after a
failed test for superiority. In 2000, the European Agency for the Evaluation
of Medicinal Products released a Points to Consider document on switching
between superiority and non-inferiority.22 One of the conditions for switch-
ing from superiority to non-inferiority is the prespecification and justifica-
tion of a non-inferiority margin. A later guideline from the European Agency
for the Evaluation of Medicinal Products was issued in 2005 on the choice of
the non-inferiority margin. This document considered both two-arm non-
inferiority trials and concurrent placebo-controlled three-arm non-inferiority
trials. The usual primary focus of a non-inferiority trial was identified as the
relative efficacy of the experimental and control therapies, and not simply
that the experimental therapy has an effect. The appropriate choice of margin
must provide assurance that the experimental therapy is effective and not
substantially inferior to the control therapy. The effect of the experimental
therapy should be of a clinically relevant size (greater than zero).
As non-inferiority trials may be more complex in their design and con-
duct than superiority trials, and there are either inaccurate conclusions of
non-inferiority or equivalence in many publications or other unspecified
determinations of non-inferiority, Piaggio et al.,8 for the CONSORT Group,
provided a recommendation to authors on how to report the design, con-
duct, and results of non-inferiority and equivalence randomized trials in
publications. Some items on their checklist included providing a rationale
for choosing a non-inferiority or equivalence design, providing the results
from trials used to base the active control effect, relevant changes in patient
characteristics compared with the previous trials that evaluated the active
control effect, noting and justifying any differences in outcome measures
from the previous trials of the active control including changes in timing
of assessment, and interpretation of the results that accounts for sources of
potential bias or imprecision.
Henanff et al.23 reviewed the quality of reporting on non-inferiority and
equivalence trials in the literature. Among 116 non-inferiority reports pub-
lished in 2003 or 2004, only 24 provided justification on the choice of the

© 2012 by Taylor and Francis Group, LLC

What Is a Non-Inferiority Trial? 13

margin, of which 11 reports provided statistical reasoning in the choice of

the margin. In reporting a positive finding of non-inferiority, Henanff et al.23
recommend language such as “treatment A is not inferior to treatment B
with regard to the prespecified margin δ.”
The FDA issued a draft guidance on non-inferiority clinical trials in early
2010. This guidance24 attempts to identify the steps and assumptions made in
a non-inferiority trial and suggests approaches to account for the uncertain-
ties. The aim is to reduce the possibility of drawing false-positive conclu-
sions from the non-inferiority trial.

References
1. Rothmann, M. et al., Design and analysis of non-inferiority mortality trials in
oncology, Stat. Med., 22, 239–264, 2003.
2. Ellenberg, S.S. and Temple, R., Placebo controlled trials and active-control trials
in the evaluation of new treatments. Part 2: Practical issues and specific cases,
Ann. Intern. Med., 133, 464–470, 2000.
3. Freedman, B., Equipoise and the ethics of clinical research, N. Engl. J. Med., 317,
141–145, 1987.
4. Freedman, B., Placebo-controlled trials and the logic of clinical purpose, IRB:
Rev. Hum. Subj. Res., 12, 1–6, 1990.
5. D’Agostino, R.B., Massaro, J.M., and Sullivan, L.M., Non-inferiority trials:
Design concepts and issues—The encounters of academic consultants in statis-
tics. Stat. Med., 22, 169–186, 2003.
6. Ebutt, A.F. and Frith, L., Practical issues in equivalency trials, Stat. Med., 17,
1691–1701, 1998.
7. Committee for Medicinal Products for Human Use (CHMP), Guideline on
the Choice of the Non-inferiority Margin, EMA, London, 2005, at http://www.ema
.europa.eu/ema/pages/includes/document/open_document. jsp?webContent
Id=WC500003636.
8. Piaggio, G. et al., Reporting of non-inferiority and equivalence randomized tri-
als: An extension of the CONSORT statement, JAMA, 295, 1152–1160, 2006.
9. Fleming, T.R., Current issues in non-inferiority trials, Stat. Med., 27, 317–332,
2008.
10. International Conference on Harmonization of Technical Requirements for
Registration of Pharmaceuticals for Human Use (ICH) E9: Statistical Principles
for Clinical Trials, 1998, at http://www.ich.org/cache/compo/475-272-1
.html#E4.
11. International Conference on Harmonization of Technical Requirements for
Registration of Pharmaceuticals for Human Use (ICH) E-10: Guidance on
Choice of Control Group in Clinical Trials, 2000, at http://www.ich.org/cache/
compo/475-272-1.html#E4.
12. Fleming, T.R. and Powers, J.H., Issues in non-inferiority trials: The evidence in
community-acquired pneumonia, Clin. Infect. Dis., 47, S108–120, 2008.

© 2012 by Taylor and Francis Group, LLC

14 Design and Analysis of Non-Inferiority Trials

13. Makuch, R. and Simon, R., Sample size requirements for evaluating a conserva-
tive therapy. Cancer Treat. Rep., 62, 1037–1040, 1978.
14. Lasagna, L., Placebos and controlled trials under attack, Eur. J. Clin. Pharmacol.,
15, 373–374, 1979.
15. Blackwelder, W.C., Proving the null hypothesis in clinical trials, Control. Clin.
Trials, 3, 345–353, 1982.
16. Temple, R., Government viewpoint of clinical trials, Drug Inf. J., 16, 10–17,
1982.
17. Fleming, T.R., Treatment evaluation in active control studies, Cancer Treat. Rep.,
71, 1061–1065, 1987.
18. Fleming, T.R., Evaluation of active control trials in acquired immune deficiency
syndrome, J. AIDS, 3, 82–87, 1990.
19. U.S. Food and Drug Administration Division of Anti-Infective Drug Products
Advisory Committee Meeting Transcript, February 19–20, 2002, at http://www
.fda.gov/ohrms/dockets/ac/cder02.htm#Anti-Infective.
20. U.S. Food and Drug Administration. Guidance for Industry Antibacterial Drug
Products: Use of Non-inferiority Studies to Support Approval (draft guidance),
October 2007.
21. Greene, W.L., Concato, J., and Feinstein A.R., Claims of equivalence in medical
research: Are they supported by the evidence?, Ann. Intern. Med., 132, 715–722,
2000.
22. Committee for Proprietary Medicinal Products. Points to Consider on Switching
between Superiority and Non-inferiority, EMA, London, 2005, at http://
www.ema.europa.eu/ema/pages/includes/document/open_document
.jsp?webContentId=WC500003658.
23. Henanff, A.L. et al., Quality of reporting of non-inferiority and equivalence ran-
domized trials, JAMA, 295, 1147–1151, 2006.
24. U.S. Food and Drug Administration, Guidance for Industry: Non-inferiority
Clinical Trials (draft guidance), March 2010.

© 2012 by Taylor and Francis Group, LLC

2
Non-Inferiority Trial Considerations

2.1 Introduction
The gold standard in evaluating the safety and efficacy of an experimental
agent is a placebo-controlled trial that is designed and conducted so that no
or little bias is introduced in the comparison of study arms. It is also neces-
sary for the clinical trial to have assay sensitivity—the ability to distinguish
an effective therapy from an ineffective therapy. The experimental condi-
tions of the clinical trial should also be such that the results are externally
valid.
Poor study conduct will either introduce a bias, favoring one treatment
over another, or obscure treatment differences. Obscuring treatment dif-
ferences makes it more difficult to show that one study arm is better than
another. However, for an active-controlled trial, obscuring treatment differ-
ences will make it easier to conclude both equivalence and non-inferiority
when the experimental therapy is not notably better than the active control.
Additionally, since the active control effect size is often assumed before con-
ducting the non-inferiority trial, poor study conduct can reduce the active
control effect in the setting of the non-inferiority trial, making it more dif-
ficult to distinguish whether an experimental therapy is effective. For a non-
inferiority comparison, it is important that the selection of the non-inferiority
margin and the effect of the control arm for the current trial are such that
a demonstration of noninferior efficacy by the experimental arm compared
with the control arm, along with an appropriate, fair study conduct, will
imply that the experimental therapy is effective and not unacceptably worse
than the active control.
In this chapter we will discuss external validity, assay sensitivity, the steps
and issues in designing a non-inferiority trial, including the setting of the
non-inferiority margin, the analysis population, and the sizing of a non-
inferiority trial. The last section of this chapter briefly discusses the early
history and experience of non-inferiority studies in anti-infective products.

© 2012 by Taylor and Francis Group, LLC

16 Design and Analysis of Non-Inferiority Trials

2.2 External Validity and Assay Sensitivity

Clinical trials that lack external validity, or for which the external validity
of the results is suspect, have produced results that cannot be applied to the
future use of the medicine to patients. For many indications, for inferences
to have external validity, it may be desirable for the control arm to reflect
how all or many patients are currently being treated and the experimental
arm should as much as possible reflect how a similar group of patients will
be treated if the experimental drug becomes approved and is used. Thus, if
the experimental drug is not currently available to patients having the dis-
ease of interest, it should not be made available to subjects in the control
arm. Whatever concurrent or subsequent therapies that patients receive in
practice should be made available to subjects in the clinical trial. Whenever
there are important differences between a clinical trial and practice due to
conduct, medical practice, or patient populations, the inferences from the
clinical trial may not apply outside of the trial.
A clinical trial has assay sensitivity if it has the ability to distinguish an effec-
tive treatment from a less effective or ineffective treatment. When comparing
an experimental therapy with a placebo, a positive finding implies the presence
of assay sensitivity. In an active-controlled trial, this is not the case. For a superi-
ority comparison on a given endpoint, it is important that the control arm does
not have a negative effect on the endpoint. Then the demonstration of superior
efficacy of the experimental arm to the control arm, along with an appropriate,
fair study conduct, implies that the experimental therapy has positive efficacy.
Poor study conduct or providing the experimental therapy to control subjects
(when the experimental therapy is not available in medical practice) can make
the outcomes similar between treatment arms and reduce assay sensitivity.
Failure to demonstrate superiority of an experimental therapy to placebo can
either be due to the experimental therapy being ineffective or the trial lacking
assay sensitivity. Additional use of an active control arm (i.e., a three-arm trial)
can assist in determining whether the study has assay sensitivity. If the active
control is demonstrated to be superior to placebo, then the trial has assay sensi-
tivity. If neither the active control nor the experimental therapy is demonstrated
to be superior to placebo, then the trial may be lacking assay sensitivity.
The non-inferiority margin is the minimum amount by which the efficacy
of the experimental therapy can be less, but not unacceptably worse, than
that of the active control. For a non-inferiority trial, assay sensitivity is the
ability of the trial to distinguish whether the efficacy of the experimental
therapy is unacceptably worse than the active control when the experimental
therapy is noninferior to the active control. The absence of a placebo arm in
the non-inferiority trial makes it difficult to assess whether the clinical trial
conditions modified the active control effect and impaired the ability of the
trial to have assay sensitivity. The assay sensitivity of a non-inferiority trial
depends on the active control having an effect that is at least the size of the

© 2012 by Taylor and Francis Group, LLC

Non-Inferiority Trial Considerations 17

non-inferiority margin or the assumed effect of the active control used to

determine the non-inferiority margin.1 As such, the determination of assay
sensitivity relies on the external results of previous trials evaluating the
effect of the active control.
More will be discussed on external validity and assay sensitivity in
Section 2.3 as they relate to issues involving the design and conduct of a non-
inferiority trial.

2.3 Critical Steps and Issues

According to International Conference on Harmonization (ICH) E-10
guidelines,2 the historical evidence of sensitivity to drug effects and appropriate
trial conduct may be used to determine whether a non-inferiority trial has
assay sensitivity. For the studied disease in past clinical trials, an evaluation
should be made on whether similarly designed trials using the control ther-
apy (or related therapy) regularly had assay sensitivity and on how sensitive
or variable were the therapeutic effect and the conduct of the trials. Ideally,
the control therapy regularly shows a consistent positive effect with respect
to a placebo or some other reference therapy in trials that were similarly
designed with acceptable study conducts that are analogous to the expected
conduct of the non-inferiority trial.
The non-inferiority trial may require a similar design as trials used to study
the effect of the control therapy. The conduct of the non-inferiority trial must
not undermine its ability to distinguish whether an experimental therapy has
any effectiveness or an acceptable effectiveness. Only after the non-inferiority
trial is completed can the conduct of that trial be assessed. An evaluation
should be made on features of the study conduct of the non-inferiority trial
that may have altered the effect of the control arm from its historical effect or
may have altered that reference against which the effect of the control arm is
defined. The actual patient population represented by the subjects who entered
the non-inferiority trial and the usage of concomitant therapies should not be
dissimilar to the patient populations represented by the historical studies. If
they are different, the difference will either need to be accounted for in deter-
mining the non-inferiority margin or a non-inferiority comparison may not
even be appropriate (i.e., superiority would need to be shown).
According to ICH-E10,2 there are four critical steps in the design and con-
duct of non-inferiority trials:

1. Determining that historical evidence of sensitivity to drug effects

exists
2. Designing a trial

© 2012 by Taylor and Francis Group, LLC

18 Design and Analysis of Non-Inferiority Trials

3. Setting a margin
4. Conducting the trial

2.3.1 Historical Evidence of Sensitivity to Drug Effects

There is no standard way of determining whether there is sensitivity to the
effect of the control therapy on the basis of the results of previous trials. Such
is determined on a case-by-case basis and such determinations may involve
some subjectivity. It is important to understand whether the control arm will
have an effect on the non-inferiority trial and how quantifiable that effect is.
Ideally, there would be

1) Consistent results from multiple trials on the estimation of the effect

of the control therapy on that endpoint of interest or that such an
effect was so consistently large that its specific magnitude was not
relevant to establish the non-inferiority criterion.
2) The design and conduct of the non-inferiority trial makes the results
of the previous trials with the control therapy relevant to the current
practice of medicine for the disease studied.
3) The design and the conduct of the non-inferiority trial make its
results and the results of the non-inferiority comparison externally
valid for the current care of patients.

There is historical sensitivity to drug effects when previously conducted

clinical trials using the active control therapy consistently demonstrated
that the active control was effective (i.e., superior to placebo). For the non-
inferiority trial, this can provide assurance that the active control will be
effective and may make it easier to evaluate the potential size of the effect of
the active control and ultimately easier to set the non-inferiority margin. The
non-inferiority margin should be based on the historically estimated effect,
the variability in that estimate, the variability across trials in the active con-
trol effect, and any effect modification that has occurred since the previous
trials as it relates to the setting of the non-inferiority trial. This evaluation is
simplest when the design and conduct of the non-inferiority trial is similar
to those of the previous studies.
When the outcome is very different for patients treated with an effective
therapy versus untreated patients, the sensitivity to drug effects is more obvi-
ous. For such indications, it may be typical to have large effects for therapies
versus placebo—for example, the effects on short-term cure rates in many
infectious diseases.
In cases where there has been an inconsistency among past trials as to
whether a therapy is effective, there will probably be no minimal effect
that the control therapy can be expected to have in a non-inferiority trial.
Therefore, a non-inferiority comparison would not be possible. According

© 2012 by Taylor and Francis Group, LLC

Non-Inferiority Trial Considerations 19

to ICH E10,2 indications where this has been a concern include depression,
anxiety, dementia, angina, symptomatic congestive heart failure, seasonal
allergies, and symptomatic gastroesophageal reflux disease. For these indica-
tions, therapies have been shown effective in multiple well-controlled trials.
However, because of the lack of sensitivity to having even a minimal effect
size, a non-inferiority margin cannot be established for which the effec-
tiveness of the experimental arm could be inferred from a non-inferiority
comparison. In such situations, it would be necessary to conduct trials for
the purpose of demonstrating superiority to either a placebo or a standard
therapy.

2.3.2 Designing a Trial

For a comparative clinical trial, the choice of the control arm is very influ-
ential in the overall design of a trial and the potential conclusions that
may be drawn from the trial. As mentioned in ICH-E10,2 the choice of the
control arm affects “the inferences that can be drawn from the trial; the
ethical acceptability of the trial; the degree to which bias in conducting
and analyzing the study can be minimized; the types of subjects that can
be recruited and the pace of recruitment; the kind of endpoints that can be
studied; the public and scientific credibility of the results; the acceptabil-
ity of the results by regulatory authorities; and many other features of the
study, its conduct, and its interpretation.” The ICH-E10 guidelines2 further
discuss the advantages and disadvantages of various choices for a control
arm. Potential choices for a control therapy may include no treatment or
background therapy only, a placebo or a placebo plus background therapy,
a different dose or regimen of the test drug, a different drug that may be
used in practice for the disease being studied, or a different drug plus a pla-
cebo as in a “double dummy” design where a placebo will also be provided
with the experimental therapy. Thus, the subjects in the control group will
receive no treatment, a placebo, or an active therapy. The choice of control
therapy may depend on whether there are any known, available effective
therapies; whether it is ethical to give a placebo; the therapeutic practice
for the disease studied; and whether it is necessary to demonstrate that the
experimental therapy has any efficacy or that the experimental therapy has
adequate efficacy.
In certain situations it may be desirable to have multiple control arms—
for example, an active control arm and a placebo control arm. When pos-
sible, a three-arm randomized trial with a non-inferiority comparison can
provide advantages to a two-arm non-inferiority trial. Three-arm random-
ized trials, having both standard therapy and placebo control arms, allow
for the simultaneous assessment of the experimental therapy having supe-
rior efficacy versus placebo and noninferior efficacy versus the standard
therapy. The non-inferiority comparison will not require any use of histori-
cal trials to evaluate the efficacy of the standard therapy since the efficacy

© 2012 by Taylor and Francis Group, LLC

20 Design and Analysis of Non-Inferiority Trials

of the standard therapy can be assessed against the placebo arm within
the non-inferiority trial. Because only direct comparisons are needed to
evaluate the effectiveness of the experimental therapy, there will be fewer
issues involving the sensitivity of the trial to establish that the experimental
therapy has efficacy or adequate efficacy. For example, there are no similar
issues, as with a two-arm non-inferiority trial, about whether the results of
the previous trials are transferable to the non-inferiority trial. Additionally,
a three-arm trial allows for the control of both the precision of the estimated
effect of the experimental therapy versus placebo and the precision of the
estimated difference between the experimental and active control thera-
pies. This is usually an advantage as the precision of the historical effect of
the active control therapy is what it is, possibly leading to imprecise indirect
estimates of the effect of the experimental therapy relative to placebo. If the
precision of the historically estimated effect of the active control is very low,
this historical estimation may not be useful in designing a two-arm, active-
controlled non-inferiority trial.
Recall that for an experimental drug that treats or prevents toxicity caused
by another drug, it may be important to study whether the use of the exper-
imental drug alters the benefits or likelihood of benefit from the original
therapy. Since a reduction of the dosage of the original drug should produce
less toxicity, for the experimental drug to be useful, it is important that the
use of the experimental drug with the studied dosage of original drug have
a benefit–toxicity profile that is as good as or better than the benefit–toxicity
profile of any given reduction of dosage of the original drug. Without know-
ing the dose–response relationship of the original drug on benefit and toxic-
ity, the demonstration of less toxicity and noninferior benefit of adding the
experimental drug to the standard dosage of the original drug may not be
sufficient to show that the experimental drug is absolutely necessary.

2.3.3 Selecting the Margin

We will briefly discuss issues surrounding the selection of the non-inferiority
margin. Further discussion and details are provided in Chapter 4 on evaluat-
ing the active control effect and in Chapter 5 on analysis methods.
The non-inferiority margin is that loss in the control therapy’s effect on
an endpoint that needs to be ruled out from using the experimental ther-
apy instead of the control therapy. Any loss in the control therapy’s effect
greater than the non-inferiority margin is regarded as unacceptable. For
efficacy comparisons, the non-inferiority margin will necessarily be smaller
than the effect size of the control therapy. A quality non-inferiority inference
depends largely on the ability to estimate or quantify the effect of the active
control therapy within the setting of the non-inferiority trial. This is usually
done in the absence of a placebo arm in the non-inferiority trial and on the
basis of the results from past clinical trials evaluating the effect of the active
control.

© 2012 by Taylor and Francis Group, LLC

Non-Inferiority Trial Considerations 21

Wiens3 summarized the various definitions of the non-inferiority mar-

gin in regulatory documents on the basis of one or more of the following
criteria:

1. A value that is small enough to conclude an effect of the test treat-

ment compared with placebo
2. The smallest value that would represent a clinically meaningful dif-
ference or the largest value that would represent a clinically mean-
ingless difference
3. A value that is small compared to background variability or has
other good statistical properties

Wiens3 discussed three considerations for choosing a non-inferiority mar-

gin/criterion: comparison with putative placebo, the clinical importance of
the active control effect, and statistical considerations.
The basic requirement for demonstrating efficacy is substantial evidence
provided in an unbiased comparison of an experimental therapy to a pla-
cebo. When there is no placebo arm in the non-inferiority trial, a putative
placebo strategy attempts to indirectly compare the experimental therapy to
a placebo on the basis of the past comparisons of the active control to placebo
and the comparison of the experimental therapy to the active control therapy
in the non-inferiority trial. The inferences about the effect of the active con-
trol therapy in the setting of the non-inferiority may be based on a meta-
analysis of previous trials using the control therapy with adjustments due
to any design or conduct differences between those previous trials and the
non-inferiority trial. The potential impact of any planned or expected design
and/or conduct difference between the non-inferiority trial and the previous
trials should be considered upfront when establishing the non-inferiority
criterion.
According to ICH-E10,2 the non-inferiority margin chosen for the planned
trial “ . . . cannot be greater than the smallest effect size that the active drug
would be reliably expected to have compared with placebo in the setting of the
planned trial.” In that way, a conclusion that the relative effects of the experi-
mental arm and the active control arm differ by less than that margin is
tantamount to concluding that the experimental therapy is more effective
than placebo.
A lower limit of a confidence interval (CI) for the active control effect based
on a meta-analysis of past studies evaluating the active control effect or a
fraction of that lower limit has been used as a surrogate for the effect of
the active control therapy in the non-inferiority trial. Then, the test for non-
inferiority compares the upper limit of a CI for the difference of effects
between the experimental arm and the control arm with the non-inferiority
margin. If the upper limit of the CI is smaller than the non-inferiority mar-
gin, the experimental arm is concluded to be noninferior to the active control
arm with respect to that margin.

© 2012 by Taylor and Francis Group, LLC

22 Design and Analysis of Non-Inferiority Trials

The assumption that the historical estimation of the effect of the control
therapy is unbiased for the setting of the non-inferiority trial has been called
the constancy assumption.
Because the evaluation of the active control effect is based on past studies,
it is often unclear whether these estimated effects apply to the non-inferiority
trial setting. Even when the historical studies show a fairly constant active
control effect, there may be factors that would alter the effect of the active
control in the setting of the non-inferiority trial. The non-inferiority trial
may be conducted in subjects with less responsive or more resistant disease;
subjects may now have access to better supportive care or different concomi-
tant interventions that may attenuate the active control effect, or there may
be lower adherence in the non-inferiority trial.4 The definitions of the pri-
mary endpoint and/or how the primary endpoint is measured may also vary
across studies. If it is believed that the effect size of the active control therapy
has diminished or otherwise will be smaller in the non-inferiority trial than
in the previous trials, the estimated effect of the control therapy should be
reduced when applied to the setting of the non-inferiority trial. If the active
control effect is smaller in the non-inferiority trial than in the historical tri-
als and this is not accounted for in the analysis, the assay sensitivity of the
trial will be low and there will be an increased risk of claiming an ineffec-
tive therapy as effective. The non-inferiority margin is often conservatively
chosen because of concerns that the effect of the standard therapy may have
diminished. As stated in the ICH-E10 guidance2: “The determination of the
margin in a non-inferiority trial is based on both statistical reasoning and
clinical judgment, and should reflect uncertainties in the evidence on which
the choice is based, and should be suitably conservative.”
As the effect of the active control depends on external experience, the
non-inferiority comparison is an across-trials comparison. As such, formal
cause-and-effect conclusions cannot be made from across-trials comparison
without either making assumptions or providing evidence or arguments
that the conditions and conduct of the current trial and previous trial are
exchangeable, or the results are so marked that the lack of such exchange-
ability is not impactful.
Although many of these across-trials issues are shared with historically
controlled trials, other issues are different. Essentially, historically controlled
studies compare subject outcomes where subjects were not randomized.
Unaddressed imbalances between groups on known and unknown prog-
nostic factors can invalidate a historical comparison. Differences between
the historical trials used to evaluate the effect of the active control and the
non-inferiority trial in factors associated with the size of the active control
effect (effect modifiers) that are not accounted for in the analysis can invali-
date a non-inferiority comparison. More on effect modification is discussed
in Chapter 4 on evaluating the active control effect.
The non-inferiority margin should account for effect modifiers and also for
biases in the estimation of the active control effect. Biases in the estimation of

© 2012 by Taylor and Francis Group, LLC

Non-Inferiority Trial Considerations 23

the historical effect of the active control can arise owing to selection biases in
choosing the historical studies and regression to the mean bias in identifica-
tion of the active control. If the historical trials were found through a literature
search, there may be a publication bias. If studies having unfavorable or less
favorable results were not published, and thus not included in the estimation
of the active control therapy’s effect, the historical active control effect will be
overestimated. Furthermore, the active control is likely selected on the basis
of outcome (i.e., positive results from previous trials) and thus the estimated
active control effect will be biased and greater than the true effect size.
Additionally, in the absence of the ability to estimate the between-trial vari-
ability of the effect of the active control therapy, some additional variability
may need to be added to the variance of the estimator of the control therapy’s
historical effect to account for potential unknown factors that influence the
effect of the active control. This would be particularly true if there were only
one or two previous studies that could be used to estimate the effect of the
control therapy and the disease of interest has a history of therapies having
between-trial variability in their effects.
A constant or slightly varying effect size across studies for the control
therapy is more important when the effect size is always small or moderate.
The planned non-inferiority trial may still have assay sensitivity in dem-
onstrating that the experimental arm has adequate efficacy when there are
inconsistent, but all large, demonstrated effect sizes across studies. When
large effects have been demonstrated across studies, there may be little sta-
tistical uncertainty in the choice of the acceptable amount of loss of a control
therapy’s effect that an experimental therapy can have for the experimental
therapy to be noninferior.
The U.S. Food and Drug Administration (FDA) draft guidance1 discusses
an efficacy margin (M1), used in evaluating whether the experimental ther-
apy has any efficacy, and a clinical margin (M2), used to evaluate whether the
experimental therapy has unacceptably less efficacy than the active control.
The reason for considering a clinical margin that is smaller than the efficacy
margin is attributable to the importance of the effect of the active control
therapy. The importance of the active control effect is often the reason why a
placebo-controlled trial cannot be conducted.
As noted in an FDA advisory committee meeting for antibiotic drugs,5 for
some diseases (e.g., pneumonia), the reasons that make it unethical to do a
placebo-controlled trial are the same reasons attributed to the unwillingness
to have an experimental therapy that is much less effective than the standard
therapy. For clinical trials, in such diseases, it may therefore be worthwhile
to consider how much less efficacious a new therapy could be compared with
an existing therapy when choosing the non-inferiority margin. Such margins
may be based on clinical practice guidelines, patient opinion, other sources,
and/or sound reasoning.
For endpoint of mortality or irreversible morbidity, it may be more difficult
or impossible to define any margin that is clinically acceptable. However, if

© 2012 by Taylor and Francis Group, LLC

24 Design and Analysis of Non-Inferiority Trials

the available therapies include intolerable side effects or onerous adminis-

tration, patients may be willing to give up a modest amount of efficacy for
a better overall quality of treatment experience. There is subjectivity in the
selection of such a margin. Different physicians and different patients may
have different views on what would constitute a clinically important loss in
efficacy. An appropriate non-inferiority margin may also depend on the yet-
to-be-determined toxicities and/or additional benefits of the experimental
therapy.
In many instances, numerical evaluations or statistical considerations
may need to be considered when establishing a non-inferiority margin or
criterion.
For continuous data, the choice of the non-inferiority margin for a differ-
ence in means may depend on the amount of variability (the standard devia-
tion) of the outcomes on the control therapy. This allows for an evaluation of
whether the distributions of the outcomes between the experimental and the
control arms are likely to have significant overlap.
For a difference in proportions, inverting the margin can provide a number
needed to treat to prevent one event. With a margin of 2 percentage points,
treating at least 50 subjects will be required to observe one extra event; with
a margin of 20 percentage points, treating only five subjects will result in one
extra event.
The choice of analysis method for the comparison in the non-inferiority
trial and the comparisons in the historical trials may also influence the valid-
ity of transferring a historical estimate of the effect of the control therapy
to the non-inferiority trial. There may be better validity when all trials (the
non-inferiority and historical trials used to evaluate the effect of the control
therapy) adjust for the same meaningful covariates. This may be particularly
true for modeling log-odds and log-hazard ratios, as in those cases in which
the parameter being estimated from an adjusted analysis is different from
the parameter being estimated from an unadjusted analysis.
Additionally, one choice of a metric may have more stable estimated effects
for the active control therapy. If a product is able to protect 50% of the sub-
jects who would otherwise acquire the disease, the active control effect as
defined by a relative risk will be more stable than as defined by a difference
in proportions.
Lenient margins introduce the risk that a treatment having little benefit
will be endorsed after two or three generations of non-inferiority trials.6
A biocreep is said to have occurred when slightly inferior therapy becomes
the control therapy for the next generation of non-inferiority trials. This can
occur by using a therapy whose efficacy was established from a non-inferi-
ority trial as the control therapy in future non-inferiority trials. Unless great
care is taken in designing the non-inferiority trials, continuing such a pro-
cess can potentially lead to control therapies that are not any better than pla-
cebo and to clinical trials lacking assay sensitivity. Indications where there
have been concerns about biocreep include mild or common infections.7

© 2012 by Taylor and Francis Group, LLC

Non-Inferiority Trial Considerations 25

The potential for biocreep can be greatly reduced by using as the control
therapy the therapy (or one of the therapies) with the greatest demonstrated
effect.5,8
Fleming6 proposed that a clinical margin that takes into consideration the
perspective of the patient be determined by a team of clinical and statistical
researchers—that is, how much clinical benefit would a patient be willing to
exchange for greater ease of administration or less risk of adverse events.
Probably the most common choice of a non-inferiority margin is half of the
lower limit of the 95% CI of the effect of the control therapy based on a meta-
analysis of historical studies comparing that therapy with placebo. Different
individuals have agreed on using such a margin but have disagreed on its
interpretation. Some have viewed this margin as acceptable only for indi-
rectly concluding that the experimental treatment is better than placebo—
that is, the lower limit of the 95% CI of the effect of the control therapy is
used as an “estimate” of the historical effect of the control therapy and is
then decreased by 50% to apply it to the setting of the non-inferiority trial
(see Snapinn9). Others have viewed such a non-inferiority margin as using
the lower limit of the 95% CI of the effect of the control therapy as a con-
servative estimate of the control therapy’s effect for the non-inferiority, and
it is required that the experimental therapy retain 50% of the effect of the
control therapy.10 In both perspectives, how conservative such an approach
for selecting a margin changes from case to case and is independent of the
concerns on how transferable the estimates based on historical trials are to
the non-inferiority trial.
Synthesis test procedures have also been used in testing non-inferiority.
Typically, the results from the active-controlled trial and the results from
estimating the historical effect of the control therapy are integrated through
a normalized test statistic. The goal is to demonstrate that the experimental
therapy retains a fraction of the control therapy’s effect greater than a pre-
specified fraction of the control therapy’s effect. Examples and discussion on
particular synthesis methods can be found in the papers of Hasselblad and
Kong,11 Holmgren,12 Simon,13 and Rothmann et al.10 The procedures used by
Rothmann et al.,10 Hasselblad and Kong,11 and Holmgren,12 are designed to
maintain a desired type I error rate when the estimation of the effect of the
control therapy is unbiased for the setting of the non-inferiority trial. For a
synthesis test procedure, Wang, Hung, and Tsong14 examined how the type
I error rate changes in various cases when the historical estimation of the
control effect is used and the constancy assumption is false.
Efficacy and Clinical Margins. As stated in the FDA draft guidance,1 “Deter
mining the NI margin is the single greatest challenge in the design, conduct,
and interpretation of NI trials.” We discussed in Section 2.3 much of the
issue involved in selecting a non-inferiority margin. Temple and Ellenberg15
described three possible margins, M0, M1, and M2. M0 (or just zero) is the
margin used when the active control is not regularly superior to placebo. M1
is the efficacy margin used to determine whether the experimental therapy

© 2012 by Taylor and Francis Group, LLC

26 Design and Analysis of Non-Inferiority Trials

has any efficacy and M2 is the clinical margin used to evaluate whether the
experimental therapy has unacceptably less efficacy than the active control.
The efficacy margin has been regarded as the assumed active control effect
size for the non-inferiority trial.1 The value for M1 is often based on previ-
ous trials evaluating the active control effect with appropriate adjustments
due to factors (effect modifiers) that may lead to a different effect size for the
active control in the setting of the non-inferiority trial. Since the quality of
the non-inferiority trial cannot be assessed beforehand,1 “the size of M1 can-
not be entirely specified until the NI study is complete.” As concluding that
an ineffective therapy is effective comes with a great cost, there is a tendency
to conservatively choose the assumed effect of the active control and the cor-
responding clinical margin.1
Clinical judgment is used in determining M2, which cannot be greater
than M1. M2 is often determined by taking a fraction of M1. This particular
fraction, the retention fraction, depends on the importance of the endpoint
and the size of the active control effect. The importance of the active control
effect is the motivation for choosing a non-inferiority trial as the basis of
demonstrating the effectiveness of the experimental therapy. Therefore, it
may be unacceptable for the experimental therapy to have an effect much
less than the active control. Situations that influence the retention fraction are
provided in the FDA draft guidance.1 If the active control has a large effect
in reducing the mortality rate, retaining a large fraction of that effect will be
desirable. If it is known that the experimental therapy is associated with a
lower incidence of serious adverse events or is more tolerable for patients, the
retention fraction may be lowered.
Statistical hypotheses involving such an M1 or M2 are surrogate hypoth-
eses. The intention or hope is that ruling out that the experimental therapy
has an effect that is less than the effect of the active control by M1 or more
will imply that the experimental therapy is effective. Likewise, the inten-
tion is that ruling out that the experimental therapy has an effect that is less
than the effect of the active control by M2 will imply that the experimental
therapy does not have unacceptably worse efficacy than the active control
therapy. When M2 is much smaller than M1, ruling out a difference in effects
between the experimental therapy and active control of M2 should provide
persuasive evidence that the experimental therapy is effective.
Assurance that the active control will have an effect at least the size of M1
in the setting of the non-inferiority trial is the “single most critical determi-
nation” in planning the non-inferiority trial.1 Whether the non-inferiority
trial will have assay sensitivity is based on whether the effect of the active
control will be at least M1 in the setting of the non-inferiority trial, and the
quality of the design and conduct of the non-inferiority trial.
The FDA draft guidance1 prefers basing M1 on the lower limit of a high-
percentage confidence for the active control effect (e.g., a 95% CI) from a
meta-analysis of clinical trials that evaluates the effect of the active control.
In ruling out that the experimental therapy is unacceptably worse than the

© 2012 by Taylor and Francis Group, LLC

Non-Inferiority Trial Considerations 27

active control, the synthesis method with an appropriate retention fraction

can be considered.1
This choice of the lower limit of the 95% CI for the active control effect for
M1 is intended as a conservative estimate or assumed value of the active
control effect in the setting of the non-inferiority trial. There are situations
provided in Chapters 4 and 5 where the regression to the mean bias will be
so great that such a choice of M1 will not be conservative. A motivation for
selecting the lower limit of the 95% CI for the active control effect for M1 is
that it may protect against any decrease in the effect of the active control
since the previous trials. However, as the issue in the departure of constancy
is one of bias, not the precision of the estimate, Fleming16 cautions against
using the precision of the historical estimate to make adjustments but instead
recommends addressing departures from constancy by adjusting the esti-
mate of the active control effect. When there is only one previous study that
evaluated the active control effect, and thus between-trial variability cannot
be evaluated, the FDA draft guidance1 recommends as a “potential cautious
approach” defining M1 as the lower limit of the 99% CI for the active control
effect.
When the experimental therapy is pharmacologically similar to the active
control therapy, a single trial with possibly a less conservative choice of the
non-inferiority margin may be appropriate.1 The rationale is that pharma-
cologically similar products are expected to have a similar performance.
Other possibilities in the FDA draft guidance1 where one study with a less
conservative choice of a margin may be appropriate include: when similar
activity has been demonstrated between the experimental and active con-
trol therapies on a “very persuasive biomarker,” the experimental therapy
has already been shown to be effective in a closely related clinical setting,
or when the experimental therapy has been shown effective in “distinct but
related populations” (e.g., pediatric versus adult populations for the same
disease).

2.3.4 Study Conduct and Analysis Populations

The conduct of the non-inferiority trial should allow for inferences that have
external validity. Differences between the conduct of the historical trials and
conduct that is needed for the non-inferiority trial to have external validity
should be addressed, if possible, when determining the non-inferiority mar-
gin. When better efficacy than placebo is the goal of the active-controlled
trial, the demonstrated efficacy of the experimental arm should be such
that it is highly unlikely to have observed such efficacy or greater efficacy if
the experimental therapy was a placebo. In other words, the study conduct
should not introduce bias that would increase the risk of claiming an inef-
fective therapy as effective.
Poor or inappropriate study conduct can undermine the anticipated assay
sensitivity of a non-inferiority trial. The specific results of previous trials

© 2012 by Taylor and Francis Group, LLC

28 Design and Analysis of Non-Inferiority Trials

are often instrumental to the non-inferiority margin to the point where it is

necessary for the design and conduct of the non-inferiority trial to be similar
to those previous trials. The ICH-E10 guidelines2 provide some factors in
the conduct of a study that could reduce the difference in the effects of an
experimental therapy and a control therapy that should not be related to the
treatment effects and therefore reduce the assay sensitivity of the trial. These
factors include (from ICH-E102)

1. Poor compliance with therapy

2. Poor responsiveness of the enrolled study population to drug effects
3. Use of concomitant nonprotocol medication or other treatment that
interferes with the test drug or that reduces the extent of the poten-
tial response
4. An enrolled population that tends to improve spontaneously, leav-
ing no room for further drug-induced improvement
5. Poorly applied diagnostic criteria (subjects lacking the disease to be
studied)
6. Biased assessment of endpoint because of knowledge that all subjects
are receiving a potentially active drug—for example, a tendency to
read blood pressure responses as normalized, potentially reducing
the difference between test drug and control

For an equivalence trial, the aim is to determine whether the effects of

therapies on an endpoint are fairly similar. Any undesirable design feature
or undesirable study conduct that tends to make the observed effects of the
study arms on some endpoint closer to each other without influencing the
corresponding standard errors would increase the chance that a statistical
procedure will arrive at the conclusion of “equivalence.” For a non-inferiority
comparison, such factors that make the observed effects on an endpoint
closer without influencing the corresponding standard errors would increase
(decrease) the chance of statistically concluding non-inferiority when the
experimental arm is inferior (superior) to the control arm. Poor study con-
duct that obscures treatment differences makes it more difficult to show
a difference (i.e., to show superiority) but can make it easier to show non-
inferiority.
Randomization fairly assigns subjects to study arms. It is this fairness of
randomization that allows for a valid statistical comparison of the study
arms when the analysis is based on the “as-randomized” groups. The as-
randomized population is referred to as the intent-to-treat (ITT) population.
When the experimental therapy is inferior to the control therapy, the use of
the ITT population for the non-inferiority analysis may not be conservative
when poor study conduct obscures differences. Non-inferiority analysis on
the basis of a per protocol (PP) subject population have also been performed
in the hope of reducing or eliminating biases that tend to make the subject

© 2012 by Taylor and Francis Group, LLC

Non-Inferiority Trial Considerations 29

outcomes more similar between the study arms. It is important that the
results and/or conclusions of these analyses be similar. Any difference in the
results of the analyses may be indicative of influential, poor study quality.
Too much missing data may potentially introduce a large bias and invalidate
both analyses, even if the results are similar.
For a non-inferiority comparison, an ITT analysis need not be more conser-
vative than a PP analysis. For non-inferiority comparisons of anti-infective
products, in most studies evaluated by Brittain and Lin,17 the PP analysis was
less conservative than the ITT analysis. In fact, sloppiness due to poor study
conduct may introduce a bias that favors a particular treatment arm. Study 1
of Rothmann et al.18 in an advanced cancer setting had a high percentage of
subjects prematurely censored for progression-free survival. On the basis of
poorer prognosis for overall survival among subjects prematurely censored
for progression-free survival compared with those still under observation for
a progression-free survival event on the experimental arm, this premature
censoring appears to be highly informative, whereas the premature censor-
ing on the control arm does not appear to be informative. More on analysis
populations is discussed in Chapter 8.
Because biases can also occur in subtle, unknown ways, the robustness of
the results and primary conclusions should be evaluated19—that is, how sen-
sitive the conclusions are to the limitations of the data and the unverifiable
assumptions made. Open-label trials may be particularly vulnerable to bias.
The limitations of the data will likely not be known until they are analyzed.
It is important to keep missing data to a minimum. Proper sensitivity analy-
ses are important in addressing data limitations and the potential impact
of missing data. Sensitivity analyses should be prespecified to the extent
possible. While sensitivity analyses are recommended, the use of sensitivity
analyses is not a substitute for poor trial conduct or poor adherence to proto-
col, and does not rescue the results of a poor-quality clinical trial.
It is thus important that the conduct of the non-inferiority trial be of high
quality so as to not compromise the non-inferiority comparison by either
obscuring differences in the effects of the study arms on the endpoint of
interest, or being so dissimilar to the study conduct of those previous trials
whose results were used to establish the non-inferiority criterion so as to
make the non-inferiority margin irrelevant.
The comparison of interest with the greatest real-world relevance is that
between a control arm of a standard therapy along with best medical manage-
ment and an experimental arm consisting of the experimental therapy along
with best medical management. This compares how all or many patients are
currently being treated with how the same group of patients could be treated
if the experimental drug becomes approved for that indication. Influences or
biases that interfere with having an unbiased comparison reduce the assay
sensitivity of the trial. However, this comparison of interest may not require
that all aspects of the trial conduct be equal between arms. Differences in the
tolerability and effectiveness of different study therapies may result in the

© 2012 by Taylor and Francis Group, LLC

30 Design and Analysis of Non-Inferiority Trials

subjects in one study arm complying more frequently in taking their study
therapy than subjects in other study arms. This unevenness in taking the
assigned study therapy is an outcome of being on different arms and not a
bias to any comparison that will be made. The analysis should not adjust for
such unevenness. The potential subsequent therapies that are used and their
distribution of usage may naturally be different between study arms, and
such would be expected for that comparison of interest. If the control therapy
is available for subsequent use in practice, or would be available for subse-
quent use if the experimental therapy is approved (as part of its best medi-
cal management), then it may be natural for the control therapy to be made
available to subjects on the experimental arm for subsequent use. Although
this feature may make it more difficult to show that the experimental ther-
apy has better efficacy than the active control therapy, it can make it easier to
show non-inferiority or equivalence. Delayed use of the control therapy may
be noninferior to immediate use of the control therapy. In which case, an
experimental therapy being noninferior to the control therapy where many
subjects cross-in to the control therapy may not distinguish the experimental
therapy from a placebo. Therefore, unless it was true for the historical stud-
ies used to establish the non-inferiority margin, allowing the control therapy
to be available to the subjects in the experimental arm can obscure a non-
inferiority comparison and the determination of effectiveness of the experi-
mental therapy. If it is unethical to deny the control therapy for later use to
subjects on the experimental arm, either the non-inferiority margin would
need to account for this cross-in to the control therapy or a superiority com-
parison may need to be required. In most instances, previous studies evalu-
ating the effect of that standard therapy (the active control therapy) would
probably not have subjects on the placebo arm later use the standard therapy.
This makes it difficult to evaluate the effect of the active control therapy in the
setting of the non-inferiority trial where cross-in to the control therapy may
be allowed.
The study conduct of the non-inferiority trial cannot be evaluated until
the trial has ended. It is only at that time an assessment or reassessment
can be made as to the transferability of the results of previous trials that
had been used to establish the non-inferiority margin. If the non-inferiority
margin was based on previous trials and the conduct of the non-inferiority
trial is not consistent with the required conduct, a reevaluation of the non-
inferiority margin may be necessary.

2.4 Sizing a Study

Table 2.1 summarizes sample size formulas for designing superiority trials,
non-inferiority trials with a fixed margin, and equivalence trials based on

© 2012 by Taylor and Francis Group, LLC

Non-Inferiority Trial Considerations
TABLE 2.1
Sample Size Formulas
Type of Comparison Superioritya Non-Inferiority with a Fixed Marginb Equivalencec
Alternative hypothesis Ha: Δ > 0 Ha: Δ > −δ Ha: −δ < Δ < δ
Required total sample size 2 2 n satisfies
 Z + Zβ   Z +Z 
n ≥ 4 α  n ≥ 4 α β
  δ − ∆a   −δ − ∆a 
 ∆a /σ  ( )
 ∆a + δ /σ  1 − β ≥ Φ
 2σ / n
− Zα  − Φ 
  2σ / n
− Zα 

Multiplier for a randomization
1
ratio other than 1:1
4π (1 − π )

a Superiority is concluded if the lower bound of the two-sided 95% CI for Δ is greater than zero.
b Non-inferiority is concluded if the lower bound of the two-sided 95% CI for Δ is greater than −δ.
c Equivalence is concluded if the two-sided 90% CI for Δ lies within the interval (–δ,δ).

31
© 2012 by Taylor and Francis Group, LLC
32 Design and Analysis of Non-Inferiority Trials

continuous data. It is assumed that outcomes with larger values are more
desirable. Here,

Δ is the true difference in the effects of the experimental arm and the
control arm (E − C).
Δa is the assumed difference in the effects of the experimental arm
and the control arm chosen to size the study.
δ is the non-inferiority (equivalence) margin.
α is the significance level.
1 − β is the power (the probability of making the respective conclu-
sion of superiority, non-inferiority, or equivalence) at Δa.
σ2 is the common population variance of the values for each study arm.
zγ is the 100(1 − γ)th percentile of a standard normal distribution.
π is the proportion of patients randomized to the control arm.

For time-to-event endpoints where effect sizes are measured with a log-
hazard ratio, formulas for the required number of events are obtained by
replacing σ with the numeral 1. Note that the sample size formula for a supe-
riority trial is just the sample size for a non-inferiority trial when δ = 0 (or
when δ → 0+). Note also that

a) For δ > 0, the same α, β, and σ, and the same alterative Δa, the required
sample size is smaller for a non-inferiority trial than for a superiority
trial.
b) For δ > 0 and the same α, β, and σ, the sample size for a superiority
trial powered at the alterative Δa equals the sample size for a non-
inferiority trial powered at the alternative Δa − δ.
c) For both superiority and non-inferiority trials, the required sample
size decreases as Δa increases within the alternative hypothesis.
d) For an equivalence trial, the required sample size decreases as
|Δa − 0| decreases.

It is very common to size an equivalence trial at the alternative where the

experimental therapy and the control therapy have the same effect. This
alternative provides the maximum power among all alternatives and thus
leads to the smallest sample size for a fixed power. Since it is not likely that
the experimental and control therapies have exactly the same effect, the
equivalence trial will have less power than desired. It is recommended to
size an equivalence trial at some small difference between the experimental
and control therapies to ensure that the trial is adequately powered.
It is perhaps unfortunate that the sample size for a non-inferiority trial is
often powered at the alternative where the experimental arm and the control

© 2012 by Taylor and Francis Group, LLC

Non-Inferiority Trial Considerations 33

arm have the same effect. This may be consistent with the thought that
“non-inferiority” is “one-sided equivalence.” When the active control has a
small effect (i.e., when the non-inferiority margin is small), a non-inferiority
trial powered at no difference in the effects of the experimental and control
therapies will generally require a rather large sample size. When the control
therapy has a very small effect versus a placebo, sizing a trial at the alterna-
tive where the experimental arm and the control arm have the same effect
means that the trial is being powered at an alternative where the experi-
mental arm has a very small effect versus a placebo. A placebo-controlled
clinical trial with such an experimental arm having a small effect relative to
placebo would require a large study size to be adequately powered to dem-
onstrate superiority. When the non-inferiority margin is much smaller than
the true effect of the control therapy versus placebo, a large sample size may
be needed.
As the active control therapy may represent the therapy having the best
effect among all therapies previously evaluated for a disease, it may also be
unrealistic to assume that the next arbitrary product for that disease is equal
in effect to the active control therapy. In these instances, it may be reason-
able to size the non-inferiority trial on the basis of an assumed difference
in effects where the experimental therapy is less effective than the active
control therapy.
In an active-controlled, substitution trial, a misconception frequently
arises about whether a non-inferiority trial or a superiority trial would re
quire more subjects. When comparing an experimental therapy with an
active control therapy, lesser values need to be statistically ruled out by the
CI for a non-inferiority comparison than for a superiority comparison. Thus,
for a fixed power, a larger sample size is required for a superiority compari-
son than for a non-inferiority comparison of the same two treatment arms.
The misconception arises from comparing sample size calculations based
on different assumed differences in effects. For the superiority analysis, the
calculated sample size has adequate power when the experimental arm has
greater efficacy than the control arm by some meaningful amount. For the
non-inferiority analysis, the calculated sample size has adequate power when
the experimental arm and the control arm have the same effect. The sample
sizes should be compared on the basis of a single assumed difference in the
effects of the experimental and active control therapies. When comparing an
experimental therapy to an active control therapy for the same assumed dif-
ference in effects and power, a non-inferiority comparison requires a smaller
sample size than a superiority comparison.
There is also a misconception that the more efficacious the control therapy,
the easier it is for an experimental therapy (E) to demonstrate non-inferiority.
Suppose that, in designing a non-inferiority trial, there are two candidates
that may be chosen as the active comparator of the trial, C1 and C2, where C2
is more effective than C1 (C2 > C1). It is easier for an experimental therapy
to demonstrate superiority (more probable or requires a smaller size) against

© 2012 by Taylor and Francis Group, LLC

34 Design and Analysis of Non-Inferiority Trials

a slightly efficacious control therapy than against a very efficacious control

therapy. Thus, it would be easier to demonstrate that E is superior to C1 than
to demonstrate that E is superior to C2 in a randomized, comparative trial. It
is also easier to demonstrate that E is noninferior to C1 than to demonstrate
that E is noninferior to C2 in a randomized, comparative trial. We will dis-
cuss this further in an example later in this section.
Suppose the non-inferiority comparisons of maintaining greater than some
fixed proportion of the control effect were powered at the same assumed dif-
ference in effects—for example, that the mean for E is greater than the mean
for C2 by 10. For the same power, a smaller sample size would be needed
for the non-inferiority comparison with C1 than would be needed for the
non-inferiority comparison with C2. This can easily be shown algebraically.
It is true that a non-inferiority trial of E versus C1 powered at E = C1 + 10
will require a larger sample size than a non-inferiority trial of E versus C2
powered at E = C2 + 10. However, E = C1 + 10 and E = C2 + 10 are differ-
ent assumptions regarding the effect of E. In general, the less efficacious the
control therapy is, the easier it is for an experimental therapy to demonstrate
non-inferiority (superiority).
To illustrate some ideas from this section, we will compare the sample
sizes of various possible designs of a two-arm trial having a one-to-one ran-
domization where all of the effects between the experimental therapy, con-
trol therapy, and a placebo are known. It is understood that these effects
are unknown in practice. Here, the effects are assumed to be known so that
comparisons will isolate the effect of a particular design feature on the over-
all sample size.
ΔE–P, ΔC–P, and ΔE–C denote the true difference in the effects between the test
therapy and placebo, between the control therapy and placebo, and between
the test therapy and control therapy, respectively. As before, δa is the alterna-
tive used to power the study and δ is the non-inferiority margin. Table 2.2
gives three different cases of direct or indirect superiority comparisons of
the test therapy versus placebo. In each case, ΔE–P = 10 and the overall sample
size is 378.

TABLE 2.2
Sample Sizes for Direct or Indirect Superiority Comparisons of Test
Therapy versus Placebo
α = 0.025, β = 0.10, α = 0.025, β = 0.10, α = 0.025, β = 0.10,
α, β, σ σ = 30 σ = 30 σ = 30
Type of trial Superiority of Non-inferiority Non-inferiority
E vs. P of E vs. C of E vs. C
Δa and δ Δa = 10 = ΔE–P , Δa = 0 = ΔE–C, Δa = 5 = ΔE–C,
δ = 0 δ = 10 = ΔC–P δ = 5 = ΔC–P
N N = 378 N = 378 N = 378

© 2012 by Taylor and Francis Group, LLC

Non-Inferiority Trial Considerations 35

TABLE 2.3
Sample Sizes for Non-Inferiority Comparisons of Test Therapy versus Control
Therapy where Greater Than 50% Retention of Control Therapy’s Effect Is Required
α = 0.025, β = 0.10, α = 0.025, β = 0.10, α = 0.025, β = 0.10,
α, β, σ σ = 30 σ = 30 σ = 30
Type of trial Non-inferiority of Non-inferiority of Non-inferiority of
E vs. C E vs. C E vs. C
Δa Δa = 0 = ΔE–C Δa = 5 = ΔE–C Δa = 8 = ΔE–C
δ δ = 5 = 0.5 × ΔC–P δ = 2.5 = 0.5 × ΔC–P δ = 1 = 0.5 × ΔC–P
N N = 1513 N = 673 N = 467
N/378 4 1.78 1.23

Table 2.3 gives three different cases of non-inferiority comparisons of the

test therapy versus the control therapy. In each case, ΔE–P = 10 and greater
than 50% retention of the control therapy’s effect is required. Notice that
the overall sample size decreases as the control therapy becomes less effica-
cious. The overall sample sizes for the first, second, and third cases in Table
2.3 are 300%, 78%, and 23%, respectively, greater than the common sample
size in Table 2.2. Additionally, Table 2.3 illustrates that, for a fixed effect of
an experimental therapy versus placebo (here, ΔE–P = 10), it becomes easier
to demonstrate non-inferiority (here, with respect to 50% retention) as the
active control therapy becomes less effective. Note that for a superiority com-
parison with β = 0.10, for the second case in Table 2.3, the needed sample size
is 1513 when Δa = 5, 2.25 times the sample size for the non-inferiority trial.
For a superiority comparison with β = 0.10, for the third case in Table 2.3, the
needed sample size is 591 when Δa = 8, 1.27 times the sample size for the non-
inferiority trial. Again, whatever is the true difference in effects between the
experimental and active control arms, the power will be greater in demon-
strating non-inferiority than the power in demonstrating superiority.
The necessary sample size also depends on the choice of the endpoint,
the metric chosen to compare the outcomes between study arms, and the
method of analysis. When there is more than one choice for the clinical end-
point, choosing that endpoint for which there is less (relative) variability can
lead to a smaller non-inferiority trial.20 Furthermore, for binary outcomes, an
odds ratio or relative risk may be a more stable measure of the effectiveness
of the active control therapy than a difference in proportions. The use of an
odds ratio in a non-inferiority trial where the mortality rates are fairly low
(comfortably below 50%) across risk groups may encourage the enrolment of
subjects with greater risk of mortality.4 A study conducted only in subjects
with greater risk of mortality would be smaller than a study conducted only
in subjects with less risk of mortality. This provides an incentive to study
more seriously ill subjects, to both reduce trial size and to provide robust
evidence for safety and effectiveness in that group where a drug can have

© 2012 by Taylor and Francis Group, LLC

36 Design and Analysis of Non-Inferiority Trials

the greatest impact in lowering the mortality rate. Additionally, choosing

adjusted analyses over prognostic factors can also lead to smaller sample
sizes.20

2.5 Example of Anti-Infectives

Perhaps more so than any other therapeutic area, drug development of anti-
infective products has relied on non-inferiority trials. There have been many
issues about relying on such designs, including the choice of an appropriate
margin, the potential for biocreep, and the impact of study conduct on the
ITT and PP populations.
Clinical trials for the registration of antibiotic therapies have almost exclu-
sively been active-controlled, non-inferiority trials. Owing to highly effective
therapies and the serious consequences in providing a placebo to a patient, the
use of placebo-controlled trials is rare. In 1992, the Division of Anti-Infective
Drug Products of the FDA published a Points to Consider document on the
Clinical Development and Labeling of Anti-Infective Drug Products,21 which
included the establishment of a standard non-inferiority test procedure for
efficacy for active control trials of anti-infective therapies. The procedure did
not use a prespecified or a priori margin, but rather a random margin that
was based on the larger of the observed cure rates in the experimental and
control arms from the non-inferiority trial. The margin was 10% if the larger
cure rate was greater than 90%, 15% if the larger cure rate is between 80%
and 90%, and 20% if otherwise. Non-inferiority would be concluded if the
lower limit of the two-sided 95% CI for the experimental versus control dif-
ference in the cure rates is greater than the “observed” margin. This type of
approach to post hoc selection of a non-inferiority margin has been called
the “step function” approach.
For pˆ max equal to the larger observed cure rate, the observed margin, δ̂ , is
expressed as

0.10, if pˆ max > 0.90



δˆ = 0.15, if 0.80 < pˆ max ≤ 0.90

0.20, if pˆ max ≤ 0.80

Because the margin was based on the larger cure rate, the margin used at
the time of analysis may be different from the anticipated margin at the time
of study design. If the anticipated control cure rate was 81%, the anticipated
margin is 15%. If the observed cure rate for the control arm is 79%, with a
lower observed cure rate for the experimental arm, a 20% margin would

© 2012 by Taylor and Francis Group, LLC

Non-Inferiority Trial Considerations 37

be used. This type of “margin selection” seems counterintuitive. If the con-

trol therapy is better than the experimental therapy, the smaller its apparent
effect (the smaller its cure rate), the larger the non-inferiority margin. Since
the variance in the estimates becomes larger as the true cure rate approaches
0.5, it appears that this type of “margin selection” was based on maintain-
ing a desired power for the trial for a sample size or a given range of sample
sizes.
Consider the following hypothetical example with results summarized in
Table 2.4. Three hundred subjects are randomly divided among the experi-
mental and control arms. This sample size will provide at least 80% power
for a margin of 0.15 when there is a common cure rate between 80% and 90%.
Suppose 112 (75%) patients on the experimental arm and 121 (81%) patients
on the control arm are cured. Then the lower limit of the two-sided 95% CI
is less than −0.15, and thus non-inferiority is not demonstrated. However,
had the control cure rate been 2% lower and the experimental cure rate been
6% lower, non-inferiority would have been demonstrated; in fact, the results
would have statistically demonstrated that the experimental arm has an infe-
rior cure rate when compared with the control arm. Thus, a small reduction
in the observed cure rate of the control arm with a more dramatic reduction
in the cure rate of the experimental arm changed the conclusion from the
experimental arm not demonstrating noninferior efficacy to the control arm
to one in which the experimental arm has demonstrated noninferior (and
inferior) efficacy.
There is no transitivity to conclusions of non-inferiority as there is for
conclusions of superiority (inferiority). Table 2.5 illustrates a three-arm trial
where, with respect to cure rates, arm C would be considered noninferior
to arm B and arm B would be considered noninferior to arm A, but arm C
would not be considered noninferior to arm A. In fact, in this example, the
cure rate for arm C is demonstrated to be inferior to the cure rate for arm A.
In relation to the lack of transitivity, there is a real possibility for a bio-
creep for such trials. These trials were often designed to have at least 80%
power when the cure rates between the arms are equal. This allows for the
experimental arm to have a lower observed cure rate than the control arm
and still demonstrate noninferior efficacy. In fact, these trials have sufficient
power when the experimental arm has even a moderately lower cure rate.
Additionally, since drugs with even much lower cure rates than current

TABLE 2.4
Possible Cases in Deciding whether Non-Inferiority Has Been Demonstrated
Experimental Control Arm 95% CI for the
Sample Size Arm Number Number Difference in
Case per Arm Cured (Rate) Cured (Rate) Cure Rates Margin
1 150 112 (0.75) 121 (0.81) (–0.154, 0.034) –0.15
2 150 103 (0.69) 118 (0.79) (–0.199, –0.001) –0.20

© 2012 by Taylor and Francis Group, LLC

38 Design and Analysis of Non-Inferiority Trials

TABLE 2.5
Example Showing Lack of Transitivity of a Non-Inferiority Conclusion
Cure Rate 95% CI for the Difference in Cure Rates and Margin
Arm A Arm B Arm C B vs. A C vs. B C vs. A
122/150 115/150 103/150 (–0.139, 0.045) (–0.180, 0.020) (–0.224, –0.030)
(0.81) (0.77) (0.69) Margin = –0.15 Margin = –0.20 Margin = –0.20

standards could be approved with wider margins up to 20%, a biocreep

may occur with the future use of such drugs as the control therapy for the
next generation of non-inferiority trials. Table 2.6 illustrates an example of
biocreep for two-arm studies where the number of patients per arm is 150,
where the observed cure rates would satisfy a non-inferiority conclusion
when the observed results of each therapy are duplicated when that therapy
is used as the control therapy in a future trial. We see that when the experi-
mental therapy in an anti-infective trial is used as the control therapy in a
future non-inferiority trial, there is a potential for continually concluding
non-inferiority (and inferiority) while the new experimental therapy has an
observed cure rate between 8% and 10% lower than its comparator.
Additionally, such standard test procedures do not differentiate between
different indications. For diseases with high mortality or where approved
therapies add little to the “cure” rate, a drug having a too much lower cure
rate than standard therapy may not be acceptable, particularly when there is
a small chance of curing a patient with subsequent therapy.
Upon further review of the step-function recommendation, the FDA in
February 2001 stated on its Web site that the step function method is no lon-
ger in use and that they are developing a detailed guidance on selecting an
appropriate non-inferiority margin.22 The FDA encouraged the discussion
on the selection of a margin with drug sponsors before the initiation of the

TABLE 2.6
Potential Change in Observed Cure Rates When Experimental Arm for Each Study
Is Control Arm of the Next Study
Experimental Control Arm 95% CI for
Sample Size Arm Number Number Difference in
Therapy per Arm Cured (Rate) Cured (Rate) Cure Rates Margin
A 150 118 (0.79) — — —
B 150 103 (0.69) 118 (0.79) (–0.199, –0.001) 0.20
C 150 90 (0.60) 103 (0.69) (–0.195, –0.021) 0.20
D 150 78 (0.52) 90 (0.60) (–0.192, –0.032) 0.20
E 150 66 (0.44) 78 (0.52) (–0.193, –0.033) 0.20
F 150 53 (0.35) 66 (0.44) (–0.197, –0.024) 0.20
G 150 39 (0.26) 53 (0.35) (–0.197, –0.010) 0.20

© 2012 by Taylor and Francis Group, LLC

Non-Inferiority Trial Considerations 39

clinical trials. In February 2002 the FDA held an advisory committee meet-
ing on the selection of the non-inferiority margin for antibiotic drugs.5 The
outcomes from the meeting8 included

• An agreement that infectious disease indications are different and

that it is impractical to have a common statistical requirement across
such indications.
• Addressing the potential for biocreep by using appropriate compara-
tor agents (e.g., that comparator with the largest demonstrated effect)
for which there is assurance of a sufficient effect. When not possible,
the use of three-arm clinical trials having a placebo-arm with early
withdrawal should be considered.

According to D’Agostino, Massaro, and Sullivan,23 from February 2001 to

February 2002, a 10% margin has been a fairly popular choice for the non-
inferiority margin regardless of the indication. A non-inferiority trial should
have a prespecified non-inferiority margin. The FDA draft guidance on the
use of non-inferiority products to support approval of antibacterial prod-
ucts24 states that it is unlikely for the indications of acute bacterial sinusitis,
acute bacterial exacerbation of chronic bronchitis, and acute bacterial otitis
media that historical data will support the determination of a non-inferiority
margin or a non-inferiority comparison. In November 2008, the FDA held an
advisory committee meeting on the selection of the non-inferiority margin
for anti-infective drugs.25 The committee recommended that non-inferiority
trials are acceptable in complicated skin and skin structure infections (SSSI).
There was general agreement that a margin of 10% may be appropriate in
most settings, although the particular margin would depend on the non-
inferiority trial setting and the patient population. The great difficulty in
establishing a non-inferiority margin is the evaluation of the effect of an
active control therapy. Available data date back many decades and there is
a concern that antimicrobials lose effectiveness over time. The committee
recommended against using non-inferiority trials in uncomplicated SSSI.
Superiority trials may be needed against a standard therapy.
Another main issue in the evaluation of antibiotic non-inferiority trials has
been the role of the ITT and PP populations in the assessment of the efficacy
of the experimental therapy. The list of possible criteria for excluding a patient
from the PP population varies from trial to trial. Brittain and Lin17 provided
some fairly common criteria. These fairly common criteria include:

• Insufficient compliance with the assigned drug

• No assessment at the primary endpoint visit, unless previously clas-
sified as a failure
• Use of concomitant antibiotics effective against target infection
• Failure to meet baseline eligibility criteria

© 2012 by Taylor and Francis Group, LLC

40 Design and Analysis of Non-Inferiority Trials

Some PP populations exclude subjects who die before the primary end-
point assessment where the cause of death is not regarded as related to the
infection. Alternatively, these patients have been treated as nonfailures in
the PP population and treated as failures in the ITT population. True ITT
analyses where all subjects are followed to the endpoint or the end of study
rarely occur due to missing outcomes. Patients with missing outcomes are
often treated as failures in the cure rate analyses.
Brittain and Lin17 compared the PP and ITT analyses from 20 trials
that were presented to the FDA Anti-Infective Drug Products Advisory
Committee between October 1999 and January 2003. Each trial studied a spe-
cific infection. The characteristic of the trials and the results are summarized
as follows:

• The overall sample sizes ranged from 20 to 819 with a median of 400.
• The percentage of patients in the ITT population that were excluded
from the PP population ranged from 2% to 43% with a median of 22%.
• The estimated treatment effect was more favorable for the experi-
mental therapy in the ITT analysis for 13 of the 20 trials.
• The 95% CI for the difference in cure rates was wider for the ITT
analysis for 12 of the 20 trials.
• The absolute differences in the treatment effect between the PP and
ITT analyses ranged from 0.03% to 18.9% (the trial with the overall
sample size of 20) with a median of 1.3%. The second largest absolute
difference was 4.8%.

© 2012 by Taylor and Francis Group, LLC

Non-Inferiority Trial Considerations 41

7. Tally, F., Challenges in antibacterial drug development, presented at the U.S.

Food and Drug Administration Division of Anti-Infective Drug Products
Advisory Committee meeting, meeting transcript, February 19–20, 2002, at
http://www.fda.gov/ohrms/dockets/ac/02/slides/3837s1_06_Tally%20.ppt.
8. Shlaes, D.M., Reply, Clin. Infect. Dis., 35, 216–217, 2002.
9. Snapinn, S., Alternatives for discounting in the analysis of non-inferiority trials,
J. Biopharm. Stat., 14, 263–273, 2004.
10. Rothmann, M. et al., Design and analysis of non-inferiority mortality trials in
oncology, Stat. Med., 22, 239–264, 2003.
11. Hasselblad, V. and Kong, D.F., Statistical methods for comparison to placebo in
active-control trials, Drug Inf. J., 35, 435–449, 2001.
12. Holmgren, E.B., Establishing equivalence by showing that a specified percent-
age of the effect of the active control over placebo is maintained, J. Biopharm.
Stat., 9, 651–659, 1999.
13. Simon, R., Bayesian design and analysis of active control clinical trials, Biometrics,
55, 484–487, 1999.
14. Wang, S.-J., Hung, H.M.J., and Tsong, Y., Utility and pitfalls of some statistical
methods in active controlled clinical trials, Control. Clin. Trials, 23, 15–28, 2002.
15. Temple, R. and Ellenberg, S.S., Placebo-controlled trials and active-controlled
trials in the evaluation of new treatments, Part 1: Ethical and scientific issues,
Ann. Intern. Med., 133, 455–463, 2000.
16. Fleming, T.R., Current issues in non-inferiority trials, Stat. Med., 27, 317–332,
2008.
17. Brittain, E. and Lin, D., A comparison of intent-to-treat and per-protocol results
in antibiotic non-inferiority trials, Stat. Med., 24, 1–10, 2005.
18. Rothmann, M. et al., Examining the extent and impact of missing data in oncol-
ogy clinical trials, ASA Biopharm. Sec. Proc., 4014–4019, 2009.
19. International Conference on Harmonization of Technical Requirements for
Registration of Pharmaceuticals for Human Use (ICH), E9: Statistical princi-
ples for clinical trials, 1998, at http://www.ich.org/cache/compo/475-272-1
.html#E4.
20. Koch, G.G., Comments on ‘current issues in non-inferiority trials’ by Thomas R.
Fleming, Stat. Med., 27, 333–342, 2008.
21. U.S. Food and Drug Administration, Division of Anti-Infective Drug Products,
Clinical Development and Labeling of Anti-Infective Drug Products, Points-to-
Consider, U.S. Food and Drug Administration, Washington, DC, 1992.
22. U.S. Food and Drug Administration, Division of Anti-Infective Drug Products,
Points to Consider: Clinical Development and Labeling of Anti-infective Drug
Products, Disclaimer of 1992 Points to Consider document, U.S. Food and Drug
Administration, Washington, DC, 2001.
23. D’Agostino, R.B. Massaro, J.M., and Sullivan, L.M., Non-inferiority trials: Design
concepts and issues—The encounters of academic consultants in statistics, Stat.
Med., 22, 169–186, 2003.
24. U.S. Food and Drug Administration, Guidance for industry antibacterial drug
products: use of non-inferiority studies to support approval (draft guidance),
October 2007.
25. U.S. Food and Drug Administration Division of Anti-Infective Drug Products
Advisory Committee meeting transcript, November 19–20, 2008, at http://
www.fda.gov/ohrms/dockets/ac/cder08.html#AntiInfective.

© 2012 by Taylor and Francis Group, LLC

3
Strength of Evidence and Reproducibility

3.1 Introduction
It is important to evaluate from the evidence in the data whether an observed
finding is real and can be reproduced in any similar or different relevant
setting. In evaluating the strength of the evidence in the data, the Kefauver–
Harris amendment of the Food and Drug Cosmetic Act of 1962 defines “sub-
stantial evidence” as “evidence consisting of adequate and well-controlled
investigations.”1 According to Huque,2 the U.S. Food and Drug Administra
tion (FDA) has interpreted this as “the need to conduct at least two adequate
and well-controlled studies, each convincing on its own, as evidence of effi-
cacy of a new treatment of a given disease.” There are conditions where data
from a single adequate and well-controlled trial can be considered to con-
stitute substantial evidence.3 An interpretation of this is data from a large,
adequate, and well-controlled multicenter study with a sufficiently small
p-value that is internally consistent and clinically meaningful.2
Studying how the results of a clinical trial or an experiment would change
if it were repeated under the exact same conditions (i.e., the same time in
history with the same clinical investigators and potential patient pool, etc.)
is a study of the variability of the results. Studying how the results of a clini-
cal trial or an experiment would change if an additional study (or studies)
were performed under a different environment (e.g., using different clini-
cal investigators or a different potential pool of patients) is a study of the
reproducibility of the results. Understanding the variability of the results from
a clinical trial is necessary but usually not sufficient in understanding the
reproducibility of the results.
For some indications, it may often be reasonable that if the results of a
given clinical trial are quite marked and internally consistent—having a
large estimated effect size that is many times greater than its correspond-
ing standard error—a statistically significant result will be reproduced if an
additional clinical trial were conducted having the same number of subjects
as the earlier trial. The only way to truly know would be to conduct the addi-
tional clinical trial.

© 2012 by Taylor and Francis Group, LLC

44 Design and Analysis of Non-Inferiority Trials

In Section 3.2, we will evaluate the strength of evidence both based on

p-values (probabilities of observing as strong or stronger evidence support-
ing effectiveness than actually observed when the therapy is ineffective) and
based on Bayesian probabilities that a therapy is ineffective given the data.
The former starts with a hypothesis and under assumptions determines a
probability based on the data (a form of inductive reasoning), whereas the
latter starts with data and under assumptions determines a probability of
a hypothesis (a form of deductive reasoning). In Section 3.32, we examine
the reproducibility of a finding in an additional, identically designed and
conducted clinical trial.

3.2 Strength of Evidence

3.2.1 Overall Type I Error
The gold standard is achieving a statistically significant favorable result at a
one-sided significance level of 0.025 in each of the two trials. When two clinical
trials are simultaneously designed to try to achieve this standard, the over-
all type I error probability of getting false-positive conclusions from both trials
is 0.000625 = (0.025)2. This provides a rationale for designing a single clinical
trial to achieve favorable statistical significance at a one-sided 0.000625 level.
However, if that single trial achieves statistical significance at the 0.025 level,
but not at the 0.000625 level (i.e., 0.000625 < p < 0.025), a second clinical trial may
be designed to achieve favorable statistical significance at the one-sided 0.025
level. This “one after the other” approach has an overall one-sided type I error
probability of approximately 0.001234 ≈ 0.000625 + (0.025 – 0.000625) (0.025). In
essence, the type I error probability is greater than 0.000625 because the results
of the first trial are not abandoned by failing to achieve statistical significance
at the 0.000625 level. For the “one after the other” approach to have an overall
one-sided type I error probability of 0.000625, a single trial would need to have
a one-sided p-value < 0.000315 ≈ (0.0178)2 or have 0.000315 < p < 0.0178 accom-
panied with a p-value < 0.0178 from an additional trial.

3.2.2 Bayesian False-Positive Rate

How convincing the evidence that an observed effect is real depends on both
the size of the p-value and the general likelihood that a given product is
effective for the studied indication. When 95% of all investigational products
are effective for an indication, observing a one-sided p-value of 0.02 will be
convincing of a real effect. However, when 5% of all investigational products
are effective for an indication, observing a one-sided p-value of 0.02 may be
suggestive but not convincing that an effect is real.

© 2012 by Taylor and Francis Group, LLC

Strength of Evidence and Reproducibility 45

Given the results from the trial (i.e., “given the data”), the probability that
the experimental therapy is or is not effective can be determined on the basis
of some prior probability that a random experimental therapy is effective.
This differs from a p-value, which is a probability involving the likelihood
of observing the actual data (or data that would provide stronger evidence)
given that the null hypothesis is true. The p-value does not consider the like-
lihood that the null hypothesis is true. Suppose 5% of investigated agents
for a given indication are truly effective (and meaningfully so). Additionally,
when an agent is effective, there is 80% power to achieve a one-sided p-value
of less than 0.025. When an agent is ineffective, there is a 2.5% probability of
achieving a one-sided p-value less than 0.025. For a typical 100 cases, Table
3.1 gives the number of cases for each combination of whether the investi-
gated agent is truly effective and whether the observed one-sided p-value
is less than 0.025. From Table 3.1, in 2.4 out of 6.4 cases (37.5%) where the
one-sided p-value is less than 0.025, the agent was truly ineffective. Thus,
when 5% of the investigated agents for a given indication are truly effective,
simply achieving a one-sided p-value of less than 0.025 from a single clinical
trial may be suggestive, but far from convincing, evidence of effectiveness. A
similar example as that provided here can be found in the paper by Fleming.4
We will refer to the posterior probability that the experimental agent is truly
ineffective, given that the experimental therapy has been concluded as effec-
tive, as the Bayesian false-positive rate.
Suppose that two simultaneously conducted clinical trials are done per
investigational agent. An investigational agent is concluded as effective when
each of the two studies has a one-sided p-value less than 0.025. As before,
when an agent is effective, there is 80% power to achieve a one-sided p-value
less than 0.025 within a single clinical trial. For a typical 100 cases, Table 3.2
gives the number of cases for each combination of whether the investigated
agent is truly effective and whether the agent is concluded as effective by
achieving one-sided p-values less than 0.025 in both studies. From Table 3.2,
in 3.2 out of 3.26 cases (≈98.2%) where the conclusion was “effective,” the
agent was truly effective. Therefore, the Bayesian false-positive rate is about
1.8%. Thus, when 5% of the investigated agents for a given indication are
truly effective, achieving a one-sided p-value less than 0.025 from each of
two clinical trials is fairly convincing evidence of effectiveness.
TABLE 3.1
Number of Cases for Which the Agent Is Effective According to
Observed p-value
Truth
One-Sided p-value Effective Ineffective
<0.025 4 2.4 6.4
≥0.025 1 92.6 93.6
5 95 100

© 2012 by Taylor and Francis Group, LLC

46 Design and Analysis of Non-Inferiority Trials

TABLE 3.2
Number of Cases for Which the Agent is Effective According to
Conclusion Drawn from Two Studies
Truth
Conclusion of “Effective” Effective Ineffective
Yes 3.2 0.06 3.26
No 1.8 94.94 96.74
5 95 100

Formulas can be derived for determining the Bayesian false-positive and

Bayesian false-negative rates. Let η denote the probability that a random
agent is truly effective, and let 1 – β denote the power at the assumed effect
and α/2 denote the one-sided significance level. When an agent is not effec-
tive, we will assume the effect is zero. The Bayesian false-positive rate, α*,
that an experimental therapy is ineffective given a favorable test results (i.e.,
a conclusion of effectiveness) equals

α (1 − η)/2
α* = (3.1)
α (1 − η)/2 + (1 − β )η

The Bayesian false-positive rate increases as the power or trial size decreases.
Thus, a group of small studies would have a larger Bayesian false-positive
rate than an analogous group of large studies. The Bayesian false-positive
rate also increases as the probability that a random agent is truly effective
decreases or as the significance level increases. The Bayesian false-negative
rate, β*, that an experimental therapy is effective given a nonfavorable test
results equals

βη
β* =
βη + (1 − α /2)(1 − η)

The Bayesian false-negative rate increases as the power or trial size decreases.
Thus, a group of small studies would have a larger Bayesian false-negative
rate than an analogous group of large studies. The Bayesian false-negative
rate also increases as the probability that a random agent is truly effec-
tive increases or as the significance level increases (and the power remains
unchanged). Our notation of α* and β* for the Bayesian false-positive and
Bayesian false-negative rates is the reverse of the notation by Lee and Zelen.5
Lee and Zelen5 examined 87 studies conducted by the Eastern Cooperative
Oncology Group. Most studies used a one-sided significance level of 0.05
and sized for 80–90% power. Twenty-five of those studies had “significant
outcomes.” On the basis of a model that considers only no effect and the
assumed effect to size the study as the possibilities for the true effect, it was

© 2012 by Taylor and Francis Group, LLC

Strength of Evidence and Reproducibility 47

deduced that the true fraction of studies with effective experimental thera-
pies was between 0.28 and 0.32. Moreover, “on average,” 3 of the 25 studies
(12%) having significant outcomes are expected to have false-positive con-
clusions and 4–10% of the 62 nonpositive studies are expected to be false-
negative conclusions.
For fixed power and probability that a random trial uses an effective
experimental agent, the one-sided significance level (α/2) for a single trial
or overall level for two trials can be determined so as to lead to a desired
Bayesian false-positive rate. For fixed β, η, and α*,

α * (1 − β )η
 α*   η 
α /2 = =   (1 − β ) (3.2)
(1 − α * )(1 − η)  1 − α *   1 − η 

The required significance level equals the product of the odds of a false-
positive result, the odds a random study has an effective experimental agent,
and the power at the assumed effect.
From Equation 3.2, for α* = 00.025 and 1 – β = 0.9, α/2 ≈ (00.0231)(η/(1 – η)).
When η > 0.52, α/2 > α* = 00.025. For α* = 00.025, 00.01 and 00.000625 and
1 – β = 0.9, Table 3.3 gives the value of α/2 for various η values. When the
probability that a random agent is truly effective is .2, a single-study signifi-
cance level of 0.0058 leads to a Bayesian false-positive rate of 00.025, whereas
a single-study significance level of 0.00014 leads to a Bayesian false-positive
rate of 00.000625.

3.2.3 Relative Evidence between the Null and Alternative Hypotheses

For a placebo-controlled trial (a non-inferiority trial), the null effect (null
difference in effects) is less than the assumed effect (assumed difference in
effects) to size the trial by (zα/2 + zβ)s, where s is standard error for estimated
effect (estimated difference in effects). For a placebo-controlled trial, when
the estimated treatment effect is less than (zα/2 + zβ)s/2, the results of the trial
support no effect more than the assumed effect. For 90% power and a one-
sided significance level of 0.025, this means that a one-sided p-value greater
TABLE 3.3
One-Study Significance Levels Leading to Given Bayesian
False-Positive Rates
η α* = 0.025 α* = 0.01 α* = 0.000625
0.2 0.0058 0.0023 0.00014
0.3 0.0099 0.0039 0.00024
0.4 0.0154 0.0061 0.00038
0.5 0.0231 0.0091 0.00056
0.6 0.0346 0.0136 0.00084

© 2012 by Taylor and Francis Group, LLC

48 Design and Analysis of Non-Inferiority Trials

than 0.053 supports no effect more than the assumed effect. For a non-inferi-
ority trial, an estimated difference in effects between the experimental and
active control arms of –δ/2 supports the null difference of –δ more than the
alternative of no difference in effects. When the study has 90% power at no
difference with a one-sided significance level of 0.025, an estimated differ-
ence in effects of –δ/2 corresponds to a one-sided p-value of 0.053.

Bayes Factor. Goodman6 proposes the use of the Bayes factor as a measure of
the strength of evidence instead of a p-value. The greater the data support the
alternative hypothesis relative to the null hypothesis, the more likely the alter-
native hypothesis is true. For testing two simple hypotheses, we have that

Posterior odds = Prior odds × Bayes factor (3.3)

of null hypothesis of null hypothesis

where the Bayes factor equals the probability of the data given the null
hypothesis/probability of the data given the alternative hypothesis.
For testing the simple hypotheses Ho:θ = θo versus Ha:θ = θa, Equation 3.3
can be expressed as

( ) = h (θ ) f ( x θ )
g θo x o o

g (θ x ) h (θ ) f ( x θ )
a a a

The Bayes factor, f(x|θo)/f(x|θa), is the Neyman–Pearson likelihood ratio.

For testing a simple hypothesis against a composite hypothesis (i.e., Ho:θ =
θo versus Ha:θ ∊ Θa), Goodman proposes a minimum Bayes factor defined as

f (x θ o )
Minimum Bayes factor = (3.4)
supθ ∈Θa f (x θ )

which is also the generalized likelihood ratio that is often used as a frequen-
tist test statistic. In practice, the supremum in the denominator of Equation
3.4 occurs at the maximum likelihood estimate of θ.
In many applications where the maximum likelihood estimator has an
approximate normal distribution, the minimum Bayes factor is approxi-
mated by exp(–z2/2), where z is the number of standard errors the maximum
likelihood estimate is different from θo.6 Goodman evaluated the strength of
evidence for various “small” p-values and prior odds that the null hypoth-
esis is true. On the basis of a fairly pessimistic prior that the alternative
hypothesis is true, Goodman regarded a one-sided p-value of 0.05 as pro-
viding moderate evidence (at best) against the null hypothesis, a one-sided
p-value of 0.001–0.01 as at best moderate to strong evidence against the null
hypothesis, and a one-sided p-value of less than 0.001 as strong to very strong

© 2012 by Taylor and Francis Group, LLC

Strength of Evidence and Reproducibility 49

evidence. Data leading to a p-value less than 0.001 yields posterior odds that
the null hypothesis is true that is less than 1/216 of the prior odds that the
null hypothesis is true.6
When testing with two composite hypotheses (i.e., Ho:θ ∊ Θo vs. Ha:θ ∊ Θa), a
natural extension chooses the generalized likelihood ratio as that Bayes fac-
tor when determining the posterior odds that the null hypothesis is true. For
two composite hypotheses, Goodman6 proposes having the selected Bayes
factor based on a weight function. For a nonnegative function w defined on
the parameter space, the weight-based Bayes factor is given by

∫ w(θ ) f (θ x) dθ / ∫ w(θ ) dθ
Θo Θo

∫ w(θ ) f (θ x) dθ / ∫ w(θ ) dθ
Θa Θa

A weight function can also used when testing a simple hypothesis against a
composite hypothesis. When the weight function is the prior density func-
tion h, the posterior odds that the null hypothesis is true is given by

∫ h(θ ) f (θ x) dθ
Θo

∫ h(θ ) f (θ x) dθ
Θa

which is the posterior odds that the null hypothesis is true on the basis of the
posterior distribution for θ.
For fixed prior odds that the null hypothesis is true, Goodman6 notes that
the weights or prior densities for the possibilities in the alternative hypoth-
esis can be distributed to focus on whether the true difference or effect is
meaningful. When the observed effect is small and not meaningful, such a
weight-based Bayes factor would account for this and lead to unimpressive
posterior odds that the null hypothesis is true.
In practice, it may be better or more appropriate for the prior odds of the
null hypotheses to be based on a typical or random therapy for that indica-
tion, not on the prior belief involving the given experimental therapy. This
leads to a consistent criterion across all studies in that indication. Different
decisions from studies involving different experimental therapies would be
based on the differences in the study results.

3.2.4 Additional Considerations

Demonstrating superiority to a highly effective active control would prob-
ably be substantial evidence of (any) effectiveness as this would mean that

© 2012 by Taylor and Francis Group, LLC

50 Design and Analysis of Non-Inferiority Trials

the experimental therapy has a great effect versus a theoretical placebo.

Likewise, ruling out even a small inferior difference when the non-inferiority
margin is quite large would mean that the experimental therapy has a great
effect versus a theoretical placebo and minimally retains a large fraction of
the active control effect. A conclusion that the experimental therapy is non-
inferior to the active control therapy has as a “cushion” the distance between
the confidence interval from the non-inferiority trial and the non-inferiority
margin along with the corresponding confidence coefficient.
One-study and two-study approaches for the determination of efficacy in
a placebo-controlled trial have been compared,2,7,8 including non-inferiority
trials.8 For powering at no difference in the effects of the experimental and
active control therapies, Koch8 noted that one study using a margin of 0.7δ
with a one-sided significance level of 0.01 and 90% power at equal effective-
ness would require essentially the same total sample size from two studies
having 95% power at equal effectiveness, each with a margin of δ and a one-
sided significance level of 0.025. In the two-study approach, the overall power
in demonstrating non-inferiority in both trials with a margin of δ is approxi-
mately 90.25%. Additionally, for powering at no difference in the effects of
the experimental and active control therapies, one study using a margin of
0.802δ with a one-sided significance level of 0.000625 with 81% power would
require essentially the same sample size as two studies having 90% power
each with a margin of δ and a one-sided significance level of 0.025.

3.3 Reproducibility
It is important that a finding in one laboratory by one investigator can be
reproduced in another laboratory by a different investigator. A finding that
fails to be reproduced when tried at different laboratories by different investi-
gators may not be of great consequence and may have been a fluke. Likewise,
it is important to know whether a positive finding from a clinical trial can be
reproduced from an independent clinical trial having different subjects and
different investigators. If a positive finding from a given clinical trial fails to
be reproduced by other conducted clinical trials, the finding will lack exter-
nal validity. Hung and O’Neill9 investigated the distribution for the p-value
under the alternative hypothesis and the likelihood of reproducing a posi-
tive result in an identical, second trial when the true effect is the observed
effect from the first trial for the second trial. When the observed one-sided
p-value in the first trial is 0.025, there is a 50% probability of achieving a one-
sided p-value less than 0.025 in the second trial when the true effect is the
observed effect from the first trial. When the observed one-sided p-value in
the first trial is 0.000625, there is a 90% probability of achieving a one-sided

© 2012 by Taylor and Francis Group, LLC

Strength of Evidence and Reproducibility 51

p-value less than 0.025 in the second trial when the true effect is the observed
effect from the first trial.
In practice, the reproducibility of a positive finding from a clinical trial
need not require clinical trials of identical designs. Separate positive find-
ings from clinical trials involving different stages of the same disease may
support each other and represent reproducibility of a positive finding for that
disease. Similarly, positive findings from a clinical trial using subjects who
were previously treated and also from a clinical trial using subjects who were
previously untreated of the disease may support each other and represent
reproducibility of a positive finding.
Reproducibility is also important for a non-inferiority efficacy claim. How
ever, there may be differing views on what reproducibility should mean for
a non-inferiority inference,10 which depends jointly on both the compari-
son from the non-inferiority trial and the historical experience of the active
control. Conceptually, repeating the entire non-inferiority inference can be
considered as jointly repeating both the historical experience of the active
control and the non-inferiority trial.11 Alternatively, the reproducibility of a
non-inferiority inference can be viewed by separately assessing the repro-
ducibility in the estimated active control effect across the previous trials and
the reproducibility in the difference in effects between the active control and
the experimental therapy across multiple non-inferiority trials. When the
non-inferiority trial is based on an assumed active control effect size and
that effect size or a larger effect size is “regularly reproduced” across trials
studying the active control, the testing of non-inferiority will generally be
associated with a rather small false-positive rate (α*) for a conclusion that
the experimental therapy is effective and constitute substantial evidence of
any efficacy, provided that the active control effect is at least the size of the
assumed effect.
A consistent, reproduced conclusion of efficacy across trials not only
increases the likelihood that the finding of efficacy is real but also can
justify that a model used to estimate the active control effect may approx-
imately hold. Before observing the results of any study, the estimated treat-
ment effect or treatment difference is unbiased. However, as the decision to
evaluate the effect of a selected active control is dependent on the already
observed effects, the retrospective estimation of the active control effect is
biased. When a finding of efficacy across studies has reproducibility, this
bias should be small.
For indications where there is only one effective standard therapy that can
be difficult to tolerate, a second clinical trial comparing the experimental
therapy with a placebo can use subjects that do not tolerate the standard
therapy. A demonstration of effectiveness for that trial may involve dem-
onstrating superior efficacy to the placebo or some other therapy. In some
instances, the dose–response relationship of an experimental therapy may
provide supportive information on the efficacy of the experimental therapy.

© 2012 by Taylor and Francis Group, LLC

52 Design and Analysis of Non-Inferiority Trials

3.3.1 Correlation across Identical Non-Inferiority Trials

The non-inferiority analyses from two different trials are positively cor-
rected (not independent) when the same active control is used with the same
estimation of the active control effect.11 Therefore, the likelihood of two false-
positive conclusions may be greater for two identical non-inferiority trials
than from two identical superiority trials. For the standard synthesis method
under the constancy assumption, the probability of getting x false posi-
tives out of n identical non-inferiority trials was provided and assessed by
Lawrence.11 The greater the required fraction of the active control effect that
needs to be retained, the smaller the correlation between the test statistics
(i.e., the tests) of two non-inferiority trials that use the same historical studies
to evaluate the active control effect.12 Additionally, for two non-inferiority tri-
als using the same estimation of the active control effect, Tsong, Zhang, and
Levenson13 investigated the correlation of a two confidence interval-based
test statistic for means.
It should be noted that if the likelihood of conducting a non-inferiority
trial is greater when the active control effect is overestimated, the collective,
average type I error rate across non-inferiority trials will be greater than the
desired or anticipated rate.

3.3.2 Predicting Future and Hypothetical Outcomes

For a real or hypothetical clinical trial, Bayesian and frequentist prediction
methods can be used to assess the likelihood of a favorable outcome. This
can involve predicting whether an ongoing trial will produce a favorable
outcome; whether a clinical trial repeated under the exact same conditions or
different conditions produces a favorable outcome; or whether a hypotheti-
cal, standard placebo-controlled trial would demonstrate that experimental
therapy is superior to placebo.
Predictive Distributions. The predictive distribution for a future observa-
tion is based on the prior distribution for the parameter(s), the distribution
for an observation given a fixed value of the parameter, and the data that
have presently been observed. The predictive density (or mass function) for
a random variable Xn + 1 can be determined from the posterior distribution
of the parameter and the density for Xn + 1 (conditioned on the value of the
parameter). That is, the predictive density for Xn + 1 is given by

f * ( x n+ 1 ) =
∫ f (x n+ 1 θ ) g(θ x1 , x2 ,  , xn ) dθ (3.5)
Ω

For example, consider a Jeffreys prior (a beta distribution with both parame-
ters equal to 0.5) for the probability that a random study subject will respond
to therapy. Suppose that 9 of 20 patients have responded to therapy. Solely

© 2012 by Taylor and Francis Group, LLC

Strength of Evidence and Reproducibility 53

on the basis of these data, the predictive probability that a future random
study patient will respond to therapy is 19/42 (≈0.452). This value (19/42)
1
1
was found by evaluating f * (1) =
∫ u ⋅ B(9.5, 11.5) u
0
9.5
(1 − u)11.5 du , where

1
B(9.5, 11.5) =
∫
0
9.5 11.5
u (1 − u) du. Thus, with Bernoulli (dichotomous) data, the
posterior mean is the predictive probability that a future random observation
will be a success. Additionally, the predictive distribution for the number of
the next m subjects that will respond to therapy will be a binomial distribu-
tion with parameters m trials and a probability of success of 19/42.
Example 3.1 illustrates determining the predictive probability of a favor-
able outcome from a second identical trial when the first clinical trial has a
favorable outcome.

Example 3.1

Suppose a randomized, controlled clinical trial was performed, having a one-to-

one randomization between the two study arms. The primary endpoint will be
some time-to-event endpoint where the event is undesirable. After 400 events, an
analysis yields an experimental versus control hazard ratio of 0.75, which is sta-
tistically significant (two-sided p-value = 0.004 < 0.05). An approximate posterior
distribution for the actual log-hazard ratio, θ, can be determined on the basis of
a noninformative prior, an approximate normal distribution for an observed log-
hazard ratio based on 400 events, and the actual observed log-hazard ratio of log
0.75. The approximate posterior distribution for θ is a normal distribution with
mean log 0.75 and variance 4/400 = 00.01. For an identical clinical trial with an
analysis after 400 events, the log-hazard ratio will be modeled as having a normal
distribution with mean θ and variance 4/400 = 00.01. Using this model and the
posterior distribution for θ, the predictive distribution for the observed log-hazard
ratio based on 400 events can be found in Equation 3.5. The predictive distribu-
tion for the log-hazard ratio from the additional trial reduces to the convolution
of two normal distributions (one having mean 0 and variance 00.01 and the other
having mean log 0.75 and variance 00.01), which yields a normal distribution
with mean = log 0.75 and variance = 00.02. Statistical significance based on 400
events at a one-sided 0.025 level requires the observed log-hazard be less than
–0.196 (hazard ratio less than 0.822). Here, the predictive probability of achieving
statistical significance based on 400 events equals .742.

When it is believed that the true effect in the second trial is the same as the
true effect in the first trial, the posterior distribution for the common effect
based on the results in the first trial forms the prior distribution for the com-
mon effect in the second trial. The predictive probability of achieving statis-
tical significance can also be determined under a model for differing true
effects between clinical trials by adding variability into the prior distribu-
tion for the common effect in the second trial or from a hierarchical model.

© 2012 by Taylor and Francis Group, LLC

54 Design and Analysis of Non-Inferiority Trials

When indirect comparisons are made between the experimental therapy

and a placebo (or other reference therapy) through the control therapy, a pre-
dictive distribution can also be used to determine the predictive probability
that a hypothetical trial comparing the experimental therapy with placebo
will result in a statistically significant result on the primary endpoint favor-
ing the experimental arm. This is illustrated in Example 3.2.

Example 3.2

We consider two non-inferiority trials comparing capecitabine to the Mayo clinic

regimen of 5-fluorouracil with leucovorin (5-FU + LV) in first-line metastatic col-
orectal cancer. This case is presented in greater detail in Chapter 5. Estimates from
each trial and a meta-analysis of those trials along with a meta-analysis of the effect
of 5-FU + LV relative to 5-fluorouracil (5-FU) can be found in previous studies.14–16
The endpoint of interest is overall survival—the time from randomization to death.
Posterior distributions for the actual 5-FU + LV versus 5-FU log-hazard ratio for
overall survival, θ C/P, and the actual capecitabine versus 5-FU + LV log-hazard
ratio for overall survival, θ E/C, can be determined on the basis of noninformative
prior distributions and the results from the meta-analysis for the effect of 5-FU + LV
and the two active-controlled clinical trials. Assuming that the effects of therapies
remain constant across studies, the posterior distribution for the actual capecit-
abine versus 5-FU log-hazard ratio for overall survival, θ E/P, is determined on the
basis of the relation θ E/P = θ E/C + θ C/P, where θ E/C and θ C/P are independent. Table 3.4
provides the posterior distributions for θ C/P, θ E/C, and θ E/P, and the predictive distri-
bution for the observed overall survival capecitabine versus 5-FU log-hazard ratio,
x E/P, on the basis of 533 events from the hypothetical trial comparing capecitabine
and 5-FU arms. The event number of 533 is chosen since it was the number of
events for the overall survival analysis in both capecitabine clinical trials.
For our hypothetical 5-FU–controlled trial, statistical significance at a one-sided
0.025 level favoring capecitabine on overall survival requires an observed log-
hazard ratio be less than –0.170 (hazard ratio less than 0.844) on the basis of 533
events. Here, the predictive probability of reaching statistical significance favoring
capecitabine equals 0.797. Note that a superiority inference at a one-sided 0.025
significance level based on 533 events would have 79.7% power for a hazard ratio
of 0.785.

TABLE 3.4
Posterior and Predictive Distributions
Parameter Posterior Distribution
θC/P Normal distribution with mean –0.234 and standard deviation 0.0750
θE/C Normal distribution with mean –0.044 and standard deviation 0.0613
θE/P Normal distribution with mean –0.278 and standard deviation 0.0969
Estimator Predictive Distribution
xE/P Normal distribution with mean –0.278 and standard deviation 0.1300

© 2012 by Taylor and Francis Group, LLC

Strength of Evidence and Reproducibility 55

In these examples, we have started with noninformative prior distribu-

tions before any results involving the active control and/or experimental
therapy are known. In certain situations, an informative prior distribution
may be more appropriate than a noninformative prior for the parameter
of interest. Knowledge of the history of the reproducibility of clinical trial
results among all therapies used for studied indication and/or the repro-
ducibility of clinical trial results using the experimental therapy in related
indications may also be used to establish a prior distribution on the param-
eter of interest.
Frequentist Prediction. When dealing with random quantities U and V that
have a bivariate normal distribution, where the value of U is used as the pre-
dicted value for V and the expected value of V – U is zero, then the 100(1 – α)%
prediction interval for V is given by

u ± zα/2σ V–U (3.6)

where u is the observed value for U, 1 – Φ(zα/2) = α/2, and σ V–U is the standard
deviation of V – U. In Example 3.1, U is the estimated log-hazard ratio from
the first clinical trial based on 400 events, V is the estimator of the log-hazard
ratio for the second clinical trial based on 400 events, and σ V–U = 0.02 . From
Equation 3.6 the 95% prediction interval for the observed log-hazard ratio in
the second clinical trial, on the basis of 400 events, is (–0.565, –00.010) ((i.e.,
the 95% prediction interval for the hazard ratio is 0.568, 0.990)). An observed
hazard ratio less than 0.822 is needed for statistical significance at a one-
sided 0.025 level. The one-sided 74.2% prediction interval for the observed
hazard ratio in the second clinical trial is (0, 0.822), an analogous result to the
Bayesian predictive probability.
For Example 3.2, prediction limits can be determined for the comparison
of capecitabine with 5-FU from a hypothetical 5-FU–controlled trial. Let
UC/P and UE/C denote the estimators of the overall survival log-hazard ratio
for 5-FU + LV versus 5-FU (“placebo”) and capecitabine versus 5-FU + LV,
respectively. Let V denote the estimator of the overall survival log-hazard
ratio for capecitabine versus 5-FU based on 533 events from the hypotheti-
cal trial comparing capecitabine and 5-FU arms. We will assume that E(V) =
E(UC/P + UE/C) and use normal distributions for the sampling distributions.
The 95% prediction interval for V is (–0.533, –00.023). The corresponding
95% prediction interval for the hazard ratio based on 533 events is (0.587,
0.977). This interval includes 0.844, the threshold needed for the observed
hazard ratio for achieving statistical significance, and larger possibilities for
the observed hazard ratio. A one-sided 79.7% prediction interval for the log-
hazard ratio based on 533 events is (–∞, –0.170), the analog to the result of the
Bayesian predictive analysis.

© 2012 by Taylor and Francis Group, LLC

56 Design and Analysis of Non-Inferiority Trials

References
1. U.S. Food and Drug Administration. Statement regarding the demonstration of
effectiveness of human drug products and devices. Federal Register, 60, Docket
No. 9500230, 39180–39181, August 1, 1995.
2. Huque, M.F., Commentaries on statistical consideration of the strategy for dem-
onstrating clinical evidence of effectiveness—one larger vs two smaller pivotal
studies by Z. Shun, E. Chi, S. Durrleman and L. Fisher, Stat. Med., 24, 1639–1651,
2005.
3. U.S. Food and Drug Administration, Guidance for industry: Providing clini-
cal evi dence of effectiveness for human drug and biological products, 1998,
at http://www.fda.gov/downloads/Drugs/GuidanceComplianceRegulatory
Information/Guidances/UCM078749.pdf.
4. Fleming, T.R., Clinical trials: Discerning hype from substance, Ann. Intern. Med.,
153, 400–406, 2010.
5. Lee, S.J. and Zelen, M., Clinical trials and sample size considerations: Another
perspective, Stat. Sci., 15, 95–110, 2000.
6. Goodman, S.N., Toward evidence-based medical statistics 2: The Bayes factor,
Ann. Intern. Med., 130, 1005–1013, 1999.
7. Shun, Z. et al., Statistical consideration of the strategy for demonstrating clinical
evidence of effectiveness—one larger vs two smaller pivotal studies, Stat. Med.
24, 1619–1637, 2005.
8. Koch, G.G., Commentaries on statistical consideration of the strategy for dem-
onstrating clinical evidence of effectiveness—one larger vs two smaller pivotal
studies by Z. Shun, E. Chi, S. Durrleman and L. Fisher, Stat. Med., 24, 1639–1651,
2005.
9. Hung, H.M.J. and O’Neill, R.T. Utilities of the p-value distribution associated
with effect size in clinical trials, Biometrical J., 45, 659–669, 2003.
10. Rothmann, M.D., Issues to consider when constructing a non-inferiority analy-
sis, ASA Biopharm. Sec. Proc., 1–6, 2005.
11. Lawrence, J., Some remarks about the analysis of active control studies,
Biometrical J., 47, 616–622, 2005.
12. Tsong, Y. et al., Choice of λ-margin and dependency of non-inferiority trials,
Stat. Med. 27, 520–528, 2008.
13. Tsong, Y., Zhang, J., and Levenson, M., Choice of δ non-inferiority margin and
dependency of the non-inferiority trials, J. Biopharm. Stat., 17, 279–288, 2007.
14. FDA Medical-Statistical review for Xeloda (NDA 20-896), dated April 23, 2001.
15. FDA/CDER New and Generic Drug Approvals: Xeloda product labeling, at
http://www.fda.gov/cder/foi/label/2003/20896slr012_xeloda_lb1.pdf.
16. Rothmann, M. et al. Design and analysis of non-inferiority mortality trials in
oncology, Stat. Med., 22, 239–264, 2003.

© 2012 by Taylor and Francis Group, LLC

4
Evaluating the Active Control Effect

4.1 Introduction
According to the U.S. Food and Drug Administration (FDA) Draft Guidance
on Non-inferiority Trials,1 “The first and most critical task in designing an
NI study is obtaining the best estimate of the effect of the active control in
the NI study (i.e., M1).” The FDA draft guidance on non-inferiority trials1
provides instances for which a non-inferiority margin can be defined in the
absence of controlled clinical trials evaluating the active control effect. The
circumstances are similar to those for which historically controlled trials can
provide persuasive evidence.2 For example, there should be a good under-
standing or estimate of the outcome (e.g., spontaneous cure rate) without
treatment, and the outcomes or cure rate for the active control from mul-
tiple historical experiences should be substantially different from those
seen without treatment (e.g., substantially different spontaneous cure rates).
The assumed effect of the active control in the setting of the non-inferiority
would be conservatively chosen.
Usually, there are data on the effect of the active control therapy from other
clinical trials. It is a daunting task to determine whether the estimated effects
of the active control therapy from previous trials apply to the setting of the
non-inferiority trial. Differences in patient populations, in the natural his-
tory of the disease, and in supportive care are just some of the factors that can
alter the effect of a therapy from one clinical trial to another. Additionally,
bias may be introduced by identifying the active control therapy after its
effect has been estimated. Bias may also be introduced by selective, post hoc
determination of which studies to include in the evaluation of the active con-
trol effect. How to integrate results across trials and whether the integrated
results would apply to a future clinical trial are also key concerns. The poten-
tial heterogeneity in the active control effect across trials needs to be consid-
ered and investigated. For the setting of the non-inferiority trial, explained
heterogeneity should be accounted for in the estimated effect of the active
control and unexplained heterogeneity should be accounted for in the cor-
responding variance. These issues and topics are discussed in this chapter.

© 2012 by Taylor and Francis Group, LLC

58 Design and Analysis of Non-Inferiority Trials

4.2 Active Control Effect

4.2.1 Defining the Active Control Effect
For monotherapy, the effect of the active control therapy is usually defined
by the difference in outcomes in a randomized study between an arm of the
active control therapy and a placebo arm. When the endpoint is objective
and beyond manipulation, it may be appropriate to define the active control
effect relative to an observation arm in a randomized study. For combination
therapy, when an experimental regimen substitutes the experimental ther-
apy for one or more components in the control regimen, the active control
effect would be defined as the difference in outcomes in a randomized study
between an arm of the active control regimen and an arm where subjects are
given a regimen made up of the active control regimen minus the substituted
components (or the experimental regimen minus the experimental therapy).
The non-inferiority criterion should assure that the experimental therapy
is effective. However, since active-controlled trials tend to be done because
it may be unethical to give subjects a placebo, an experimental therapy sim-
ply being better than a placebo may not provide assurance that it would be
ethical to give the experimental therapy to subjects in another clinical trial
or to patients in practice. For example, it may be unethical to give a slightly
effective treatment for a serious or fatal disease when there is a much more
effective therapy and there is only one shot at cure or treatment. Also, for
combination therapy, the reference (active control regimen minus the sub-
stituted components) for defining the active control effect may be a regimen
that itself may be unethical to give to subjects.
Standards for what is a safe, effective, and ethical treatment may change
over time. When an indication has a history of replacing standards of care
with better, more efficacious standards of care, the previous standard of care
may become unethical to give. There may be some part of the active control
effect that must be retained by the experimental therapy for the experimen-
tal therapy to be ethically administrable to patients.

4.2.2 Modeling the Active Control Effect

For a two-arm non-inferiority trial, the effect of the active control is assumed
not measured or estimated within the non-inferiority trial. By relying on an
across-trials inference, two-arm non-inferiority trials can have similar con-
cerns as historical controlled trials. Ideally, there are multiple studies of the
active control therapy, providing consistent estimates of the size of an effect
that is undoubtedly real and relevant to the setting of the non-inferiority
trial.
Modeling the effect of the active control relative to a placebo in the setting
of the non-inferiority trial is imperfect. Modeling the active control effect

© 2012 by Taylor and Francis Group, LLC

Evaluating the Active Control Effect 59

involves examining the effect of the active control from relevant previous tri-
als, adjusting for any potential biases, and understanding and adjusting for
any differences between the historical trials and the non-inferiority trial. The
between-trial variability that cannot be explained should also be considered
when modeling the uncertainty of an estimate of the active control effect.
When there are no historical studies providing relevant information on the
effect of the control therapy, a two-arm non-inferiority trial cannot be done.
When there are relevant randomized comparative studies, it may be possible
to assess the effect of the control therapy. In assessing the size of the active
control effect in the setting of the non-inferiority trial, relevant information
from previous trials needs to be considered, including the consistency of the
size of the estimated effects, consistency of any effect, and similarities and
differences in the designs of the trials (e.g., differences in patient popula-
tions, concurrent therapies, and subsequent therapies).
If there are concerns that the effect of the control therapy has diminished
by some fraction, ε, then the estimated control effect can also be reduced by
this fraction.
When there is one historical, randomized trial comparing the active con-
trol with placebo, between-trial variability cannot be assessed. It is also dif-
ficult to quantify the between-trial variability with just two historical trials.
The potential between-trial variability should be considered, particularly
when that disease or indication has a history of inconsistent estimated effects
across clinical trials investigating the effects of the same therapy.
“Likes” should be combined with “likes.” For example, it may be inappro-
priate to combine the results of observational studies with blinded, placebo-
controlled studies. Therefore, in a meta-analysis of a collection of studies, it
may be necessary to first divide the overall collection of studies into subsets
where, within each subset, the studies are fairly homogeneous on the most
important design and conduct features relative to the treatment effect and its
estimation. Then a meta-analysis is done for each subset of the studies. The
use of multiple definitions of the endpoint, differences in how the endpoint
is measured or how frequently the endpoint is monitored, differences in the
amount of follow-up on the endpoint, or meaningfully different patient pop-
ulations may be the basis for dividing the overall collection of studies into
subsets.

4.2.3 Extrapolating to the Non-Inferiority Trial

Across-trials non-inferiority comparisons need to account for effect modi-
fiers and biases in the estimation of the active control effect. Biases in the
estimation of the historical effect of the active control can arise owing to
publication or other selection biases in choosing the historical studies and
regression to the mean bias in that the identification of the active control was
based on the positive results of some or all of the selected studies.

© 2012 by Taylor and Francis Group, LLC

60 Design and Analysis of Non-Inferiority Trials

In establishing the non-inferiority margin or criterion, it is important to

understand what effect the active control may have in the setting of the non-
inferiority trial. Without a placebo arm in the non-inferiority trial, the effect
of the active control cannot be directly assessed. An assessment of the effect
that the active control has had in previous clinical trials and an understand-
ing of the differences between the previous trials and the non-inferiority
trial that may affect the effect of the active control are used to judge the effect
size that the active control may have in the non-inferiority trial. Failure to
account for biases in the estimation of the historical active control effect and
any effect modification will lead to a biased estimation of the active control
effect in the setting of the non-inferiority trial. Additionally, since the selec-
tion of the active control is likely outcome based (based on positive results of
previous trials), the estimation of the active control effect will be biased with
a tendency to produce an estimate greater than the true effect size. The stud-
ies selected to estimate the size of the historical effect of the active control
may also be outcome dependent as would occur when there is publication
bias in the available studies that can be found in a systematic review.
Factors that may change the effect of the active control from one study to
the next study include:

1. Changes in the distribution of a covariate where the covariate is

associated with the size of the treatment effect
2. Changes in supportive care
3. Improvements in diagnosing the disease and the true prognosis of
patients
4. Differences in the dose or regimen of the active control
5. Differences in the patient population that would volunteer for an
active control trial and the patient population that would volunteer
for a placebo-controlled trial
6. Differences in the timing of the measurement of the endpoint and,
for time-to-event endpoints, differences in the censoring distribu-
tions when there are nonproportional hazards
7. Differences in “cross-over” to the active control therapy
8. Differences in the level of adherence and the amount and impact of
missing data

Analog to Nonrandomized Comparisons. A comparison of two groups involves

a comparison of the patient-level outcomes. A randomized, controlled trial
comparing two groups fairly allocates treatments to subjects, which allows
for valid comparisons. Nonrandomized comparisons that include observa-
tional and historical studies do not fairly allocate subjects to groups and do
not control for known and unknown prognostic factors. Unaddressed imbal-
ances between groups can invalidate a comparison of the groups. In many

© 2012 by Taylor and Francis Group, LLC

Evaluating the Active Control Effect 61

instances, conclusions based on marked differences from nonrandomized

comparisons have been proven false on the basis of the evidence from later
randomized, controlled trials. This includes a suggested 40–50% reduction
in the incidence of coronary heart disease for postmenopausal hormone
therapy from observational studies that was later disproven by a Women’s
Health Initiative clinical trial that found an elevated incidence of coronary
heart disease for postmenopausal hormone therapy.3
To validate a nonrandomized comparison, the analysis would need to
adjust not only for different distributions in known prognostic factors but
also for biases due to potential differences in unknown prognostic factors,
the lack of fair assignment of patients to groups, and differences in exter-
nal factors (e.g., environmental/external differences due to differences in the
calendar time for which data are obtained for each group as in a historical
controlled study).
While the lack of balance of known and unknown prognostic factors can
invalidate a nonrandomized comparison, differences between the historical
trials used to evaluate the effect of the active control and the non-inferiority
trial in the distribution of covariates that are associated with the effect size
of the active control and/or the existence of other effect modifiers can invali-
date an across-trial comparison when unaccounted for in the non-inferiority
analysis.4 The non-inferiority analysis in a two-arm trial involves an across-
trials comparison of the estimated active control effect in the historical trials
with the estimated difference in effects between the experimental therapy
and the active control in the non-inferiority trial. The validity of the non-
inferiority analysis depends on whether the true effect of the active control
in the setting of the non-inferiority trial is different from the true histori-
cal effect of the active control and, if so, how different. It is thus important
that the relative frequencies of effect size-related covariates are similar in the
non-inferiority trial as in the historical trials and that the effects attributed
to such covariates has not changed (e.g., due to improvements in supportive
care or in monitoring or diagnosing the disease) from the historical trials to
the non-inferiority trial. A non-inferiority analysis should adjust for differ-
ences in known and potentially unknown covariates that are effect modifiers
between the historical trials and the non-inferiority trial, as well as how the
effect of the active control may have changed and for any biases in the esti-
mation of the historical effect of the active control.
What Does the Constancy Assumption Mean? When the true effect of the
active control varies across previous trials (i.e., the active control effect is
not constant), what does the constancy assumption mean? One interpreta-
tion is that the active control effect in the non-inferiority trial equals the
global mean active control effect across studies (i.e., the location parameter
in the random-effects model).5 Another interpretation is that the true active
control effect in the non-inferiority trial, γ k+1, has the same distribution (i.e.,
N(γ,τ 2)) as the true effects in the previous trials. In either case, the random-
effects meta-analysis estimator of the global mean effect across studies, γˆ , is

© 2012 by Taylor and Francis Group, LLC

62 Design and Analysis of Non-Inferiority Trials

“unbiased” for the effect of the active control in the non-inferiority trial (i.e.,
E(γˆ ) = E(γˆ i ) = γ ). The difference is the variance that is attributed to γˆ . Since
E(γˆ − γˆ k +1 )2 = E(γˆ − γ )2 + E(γ k +1 − γ )2 , the variance for the second case is larger.
While the estimator is unbiased for the active control effect in both cases, the
modeling of the uncertainty of the estimator and its sampling distribution
is different. In general, the constancy assumption is more than just having
an unbiased estimator of the active control effect in the non-inferiority trial,
but also correctly modeling or identifying the sampling distribution for the
estimator.
The constancy of effect may depend on the chosen metric. The benefit of a
therapy used to prevent a disease may depend on the placebo rate of getting
the disease. An experimental therapy that prevents occurrence of the disease
in one out of two subjects who would have otherwise acquired the disease
has an occurrence rate of 25% when the placebo rate is 50% (a difference of
25%). The occurrence rate would be 15% when the placebo rate is 30% (a dif-
ference of 15%). How to make adjustments for departures from constancy in
the active control effect is often a matter of judgment.1

4.2.4 Potential Biases and Random Highs

Inferior design features and poor study conduct can introduce bias in the
estimated treatment effects. Bias is also introduced when the timing of the
definitive analysis is outcome dependent or when the selection of a study
used in estimating the effect of the active control depends on the results of
the study. The fact that the active control is selected because it had positive
results in previous clinical trials also introduces a bias.
Open-label studies of the active control versus observation may not pro-
vide unbiased estimates of the active control effect that placebo-controlled
trials would provide when the blind is maintained until the endpoint is
determined. Additionally, when the endpoint of interest is not the primary
endpoint of a study, there may not be quality scrutiny in obtaining the mea-
surements, thereby yielding a biased estimate of the active control effect.
For example, for a time-to-event secondary endpoint, it may be deemed less
important to follow subjects to the endpoint or the end of study as according
to the intent-to-treat principle. Subjects who were lost to follow-up in deter-
mining the endpoint may have their times censored when they were lost to
follow-up (i.e., at their last evaluation for the endpoint) for the analysis of the
endpoint. Such censoring may be informative, thus leading to bias estimates
of the effect of the active control versus placebo.
Publication Bias. It is crucial to have an appropriate, valid way of estimating
the effect of the active control therapy in the non-inferiority trial, which often
involves a meta-analysis. Having all available relevant information from past
studies on the effect of the active control therapy is crucial.6–8 It is important
that the studies to be included in the meta-analysis are selected in a manner
that would avoid or minimize bias. When there are only a few studies, all of

© 2012 by Taylor and Francis Group, LLC

Evaluating the Active Control Effect 63

which are known, this will be an easier task. In many instances, a literature
search is done to obtain known relevant studies that can be used to quantify
the effect of the active control therapy. There may be concern of a publication
bias, that only the results from studies indicative of a treatment effect will
be published or that such studies are more likely to be published (and thus
found in a literature search) than the results from those studies that did not
indicate a treatment effect. There are various techniques that can assist in
recognizing that a publication bias may exist. The techniques tend to assume
that the true treatment effect is constant across trials or that the true effect
is not dependent on the size of the trial. Hopefully, the recent creation of a
clinical trials data bank (i.e., clinicaltrials.gov) will reduce the possibility of
publication bias for estimating the effects of many future active controls.
A “funnel plot” is a graphical display often used to evaluate for potential
publication bias.9 Plotted for each study is the pair of the estimated effect and
a measure related to the associated variability in the estimate. The greater the
variability associated with the estimate (as with smaller studies), the more
spread there is in the observed estimates; thus, when there is no publication
or sampling bias, a funnel-like shape is expected.
Search strategies that attempt to find all relevant studies and minimize
bias should be used. The methods for abstracting estimates and the standard
error from summaries of the results should also be considered. For example,
it is easy and valid to derive the estimate and corresponding standard devia-
tion from a 95% confidence interval that was based on a normal distribu-
tion. The estimated effect would be the midpoint of the confidence interval,
whereas the standard deviation would equal the difference of the upper and
lower limits of the 95% confidence interval divided by 3.92. When the sub-
jects are monitored indefinitely for an event (e.g., death) and accrued over
time, it would probably be inappropriate to use the fraction of subjects who
had events in both arms to arrive at an estimate of the hazard ratio.
The Timing of the Definitive Analysis Is Random. When a study includes one
or more interim analyses, the sample treatment effect is often regarded as the
estimated treatment effect when an efficacy boundary is first crossed, or as
the estimated treatment effect at the final planned analysis should no efficacy
boundary be crossed. For this definition, when the experimental therapy is
more effective than the control therapy or placebo, the sample treatment
effect is biased, having a mean or expected value greater than the true treat-
ment effect. If the study were replicated over and over, the long-run arithme-
tic average sample treatment effect would exceed the true treatment effect.
This long-run average gives equal weight to each replication. If each study
result was weighted by the amount of the information used at the definitive
analysis (analogous to a fixed-effects meta-analysis), then the corresponding
long-run weighted average sample treatment effect would converge almost
surely to the true treatment effect. On the basis of a sequence of independent,
identical clinical trials having at least one interim analysis, the cumulative
fixed-effects meta-analysis estimator is asymptotically unbiased (i.e., the bias

© 2012 by Taylor and Francis Group, LLC

64 Design and Analysis of Non-Inferiority Trials

decreases towards zero) as the number of trials increases, whereas the bias
in the arithmetic average is constant in the number of trials. When a given
clinical trial necessarily continues to the final analysis, regardless of the
results of earlier analyses, the estimated treatment effect at the time of the
final planned analysis is an unbiased estimator of the true treatment effect,
provided no design or conduct changes occur on the basis of the results of
the interim analysis.
When there is zero treatment effect and parallel boundaries are used at
the interim analyses, the expected sample treatment effect will be zero. The
bias in the observed treatment effect at the time of the definitive analysis is
attributable to the randomness of the amount of information at the time of
that analysis. The amount of bias will depend on the true effect size, the α
allocation, and the timings of the analyses.
Adaptive designs having a sample size reestimation component also have
sample treatment effects that are biased with the mean sample effect greater
than the true treatment effect when the treatment is effective.
Random Highs. Random highs in the estimated effect of the active control
therapy in historical studies are a real issue. For example, “data dredging”
leads to estimates of treatment effects that tend to overstate the true effect.
Situations include selecting a subgroup retrospectively on the basis of the
estimated effect seen in that group. The estimates that generate hypothe-
ses for further studies are in themselves conditionally biased, tending to be
larger than the true effect size. Likewise, conditional bias estimates can occur
when a claim is limited to a subgroup either due to quite positive results seen
for that subgroup or for quite negative results seen for the complement sub-
group. There are various other scenarios when random highs are likely to
be more prevalent, which include the following: when the use of the control
therapy in the non-inferiority trial was predicated on the success of one or
two trials designed to study that therapy in the indication; the first trial in an
indication to yield a favorable statistically significant result after many other
trials (possibly based on different therapies) previously failed to do so; the
estimated effect is from an interim analysis that resulted in favorable statisti-
cal significance or is from a design having a sample size reestimation; and
a retrospective or nonprespecified analysis on a demographic or genomic
subgroup.
Statistical Significance Bias. Studies whose results are responsible for moti-
vating the use of a therapy as an active control introduce some bias or con-
ditional bias into the historical estimation of a treatment effect. Before a trial
that will be well conducted and well controlled is started, the estimated
treatment effect is unbiased for the true treatment effect of a population that
is represented by the subjects in the clinical trial. At the start of the trial, the
observed treatment effect will or will not wind up being large enough to
achieve statistical significance. Conditional on statistical significance being
achieved, the expected or mean sample treatment effect is greater than the
true treatment effect. Because active control therapies in a non-inferiority

© 2012 by Taylor and Francis Group, LLC

Evaluating the Active Control Effect 65

trial are often selected because statistical significance was reached in one or
two trials that were designed to study that therapy in the indication, there
will be a tendency for the estimated active control effect to be greater than
the true effect. It is therefore necessary for the estimated control effect to
be either reproduced in multiple trials or be “adjusted” for this conditional
bias.
When a drug has demonstrated an effect in a clinical trial (e.g., one-sided
p-value < 0.025), it is more likely than not that the estimated effect in a sec-
ond trial of the same design will be smaller than that seen in the first trial.
Statistically, consider two normalized test statistics from two separate, iden-
tically designed clinical trials, Z1 and Z2, that have standard normal distribu-
tions when there is no difference in effects between treatments. For the ith
trial (i = 1,2), a one-sided p-value of less than 0.025 is equivalent to Zi > 1.96.
It can be shown that P(Z1 > Z2|Z1 > 1.96) > 0.5. In other words, given that the
first trial achieved statistical significance, it is more likely that the first trial
had a smaller p-value (and also a larger estimated effect) than the second
trial.
Let σ denote the standard error for the estimated treatment effect. Consider
a study having 100γ% power at the actual treatment effect μ, which is based
on a large sample normalized test statistic with a one-sided significance level
of α. When statistical significance has been achieved (p-value < α), the con-
ditional expected or mean estimated treatment effect is approximately μ +
g U(γ)σ, where gU(γ) = [ϕ(Φ−1(γ))/γ] and ϕ and Φ are the density and distribution
functions for a standard normal distribution, respectively. Note that for a
one-sided significance level of α, γ = Φ(μ/σ − zα). When statistical significance
is not reached, the conditional expected or mean estimated treatment effect
is approximately μ – g L(γ)σ, where g L(γ) = [ϕ(Φ−1(γ))/(1 − γ)]. Because of the
symmetry of ϕ about zero, g U(γ) = g L(1 − γ). Table 4.1 provides values of g U(γ)
for various γ values. For 90% power at the actual treatment effect μ, based
on a large sample normalized test, the conditional expected or mean esti-
mated treatment effect given that statistical significance has been reached
is approximately μ + 0.195σ. Thus, if the same study having 90% power was
repeated over and over where only the estimated effects from those replica-
tions having a one-sided p-value of < 0.025 are retained, the long-run average

TABLE 4.1
Number of Standard Error Bias in Achieving Statistical Significance
at a One-Sided Level by Power
γ gU(γ) γ gU(γ) γ gU(γ)
0.05 2.06 0.30 1.16 0.75 0.42
0.10 1.75 0.40 0.97 0.80 0.35
0.15 1.55 0.50 0.80 0.85 0.27
0.20 1.40 0.60 0.64 0.90 0.195
0.25 1.27 0.70 0.50 0.95 0.11

© 2012 by Taylor and Francis Group, LLC

66 Design and Analysis of Non-Inferiority Trials

among the retained estimated effects would be approximately μ + 0.195σ.

For a case where the true effect size is marginal, Example 4.1 illustrates the
relationship between the true effect size and the mean observed effect size
when statistical significance is achieved.

Example 4.1

Consider a time-to-event endpoint compared between two arms after 400 events
in a placebo-controlled clinical trial having a 1:1 randomization. Suppose the true
experimental to placebo hazard ratio is 0.894, which provides 20% power to
achieve statistical significance at a one-sided α of 0.025 (which occurs when the
observed experimental to placebo hazard ratio is less than 0.822). Given that
statistical significance is achieved, the mean for the observed experimental to
placebo log-hazard ratio is –0.252 (=ln 0.894 – 1.40 × 0.1) from Table 4.1, which
corresponds to a hazard ratio of 0.777. In cases where the true power is 20%,
the typical observed experimental to placebo hazard ratio when statistical signifi-
cance is achieved will be 13% less than the true value.

Maximum and Regression to the Mean Biases. In baseball, the best rookie
player in each league receives the Rookie of the Year (ROY) award. The per-
formance of the winners in their next seasons tends to be worse than in their
rookie year. This is often referred to as the “sophomore jinx.” The ROY has
the maximum performance or outcome among all rookies in their league
their first year. Also, among the same group of players for the next year, the
ROY cannot do any better than having the maximum performance and can
do comparatively worse.
The sophomore jinx occurs because the ROY is identified, not at random,
but on the basis of having the maximum performance. The sophomore jinx
is an example of “regression to the mean.” The bias that occurs from using
the rookie outcome of the ROY to project their second (sophomore) year per-
formance or estimate their ability, while ignoring and not adjusting for the
fact that the ROY is being identified on the basis of maximum performance/
outcome (not identified at random), is an example of regression to the mean
bias. In a non-inferiority trial, the active control is usually identified on the
basis of past performance in clinical trials. Often, the active control will be
the therapy that performed best (or one therapy among the therapies that
performed the best) in previous clinical trials. Therefore, unless a proper
adjustment is made, including the outcomes from previous clinical trials
that were used to identify the active control will lead to biased estimation of
the active control effect with a tendency to overestimate the true effect even
when the true effect of the active control remains constant across previous
trials and the non-inferiority trial.
Regression to the mean refers to the phenomenon in a simple linear regres-
sion when an observed value of an explanatory variable X of x is k standard
deviations away from its mean (μX), and the expected value for the response

© 2012 by Taylor and Francis Group, LLC

Evaluating the Active Control Effect 67

variable Y of μY|x is ρk of its standard deviations away from its mean (μY),
where –1 < ρ < 1 is the correlation coefficient. Since |ρk| < |k|, in relative
terms of respective standard deviations, μY|x is closer to μY than x is to μX.
For example, if ρ = 0.5 with standard deviations of σ X and σ Y, then when we
observe x = μX + 2σ X, the corresponding expected value of Y is μY|x = μY + σ Y.
For the sophomore jinx, X is the performance in the rookie year and Y is the
performance in the second year.
In Statistics, when an outcome or estimate represents a maximum, the out-
come or estimate will tend to be greater than the true mean of the underlying
distribution with high probability. Thus, in the subject area of clinical tri-
als, when the estimated effect represents a maximum across studies and/or
subgroups, it is highly likely that the estimated effect is greater than the true
effect. Conditional bias is also introduced when the selection of historical
studies used to estimate the effect of the active control is outcome dependent.
For example, limiting the selected studies to a narrow indication where a
study achieves statistical significance and ignoring the results from related
indications will lead to a bias and an exaggerated estimate of the active con-
trol effect and potentially inflate the type I error rate for the non-inferiority
trial. The maximum of a random sample tends to be larger than the mean of
the underlying distribution (i.e., larger than the true effect). The bias of the
maximum in estimating the underlying mean increases as the number of
studies increases. When an observation represents a maximum, it should not
be evaluated as if it were an isolated, random observation.
Consider an investigational agent, A, being studied for a first-line meta-
static cancer in three large, equally sized, randomized clinical trials. Each
clinical trial compared the addition of agent A to a different standard che-
motherapy regimen with that standard chemotherapy regimen alone. The
three clinical trials used a different background standard chemotherapy
regimen (X1, X2, and X3). Suppose that the only trial that demonstrated
improved overall survival when agent A is added is the trial that used X1 as
the background chemotherapy. Now, a sponsor wants to study the addition
of the experimental agent B in a non-inferiority trial that compares X1 plus
B with X1 plus A. As the observed effect of adding A to X1 represents the
maximum observed effect across three trials, the observed effect of adding
A to X1 probably overestimates the true effect. Therefore, if the estimation of
the effect of the active control, A, only considers the previous trial that used
X1 as the background chemotherapy and ignores the fact that the observed
effect represents a maximum observed effect, the true effect of adding A to
X1 will tend to be overestimated. This may then lead to an inappropriately
large non-inferiority margin and an increase in the likelihood of conclud-
ing that an ineffective experimental therapy is effective. If only the results
from the clinical trial using X1 as the background therapy are used, the esti-
mated effect in that trial needs to be interpreted and modeled as represent-
ing a maximum observed effect. It is important to note that when improved
survival is not demonstrated, it does not mean that improved survival was

© 2012 by Taylor and Francis Group, LLC

68 Design and Analysis of Non-Inferiority Trials

ruled out. If the other two trials had slightly favorable observed effects, their
failure to demonstrate a survival improvement does not mean that the effect
of adding agent A to chemotherapy is heterogeneous across background che-
motherapies. The observed effects across the studies may still be consistent
with homogeneous effects. Knowledge of the observed effects from the other
two studies is needed to correctly interpret the results from the study using
X1 as the background chemotherapy.
Similar situations would also arise when an investigational agent is stud-
ied in multiple lines of an advanced or metastatic cancer, when an investi-
gational agent is studied in separate trials in different disease settings, or
when the chosen dose for the active control in the non-inferiority trial is the
dose with the greatest estimated effect and only data on that dose is used to
estimate the effect of the active control. Treating a better or best finding as
coming from an isolated trial will tend to overstate the true effect. Treating
a better or best finding as a maximum or the relevant upper order statistic of
a sample when the effects are homogeneous will be correct when the effects
are homogeneous and will be conservative when the effects are heteroge-
neous. However, when the effects are homogeneous, the most reliable esti-
mate of the common effect integrates the estimated effects from all trials.
Dealing with Maximum Bias. For a random sample, the observed maximum
is not an appropriate estimator of the common true mean or treatment effect.
When assumptions are made on the shape of the underlying distribution
and/or the shape of the distribution of the maximum, the observed maxi-
mum can be used to make inferences on the common true mean or treatment
effect.
Let X1, … ,Xk be a random sample from a distribution with underlying
distribution function H. Let X(k) denote the maximum of X1, . . . ,Xk, and let
H(k) denote its distribution function. Then for –∞ < t < ∞, H(k)(t) = (H(t))k.
The quantiles/percentiles for X(k) are given by H(−k1) (γ ) = H −1 (γ 1/k ) for 0 < γ < 1.
1
The mean and variance for X(k) are given as µ X( k ) =
1 ∫H
0
−1
( x 1/k ) d x and
σ X2 ( k ) =
∫ (H
0
−1
( x 1/k ) − µ X( k ) )2 d x , respectively.

Let Z1, . . . ,Zk be a random sample from a standard normal distribution

with distribution function denoted by Φ. Let Z(k) denote the maximum of
Z1, . . . ,Zk, and let Φ(k) denote its distribution function. The quantiles/percen-
tiles for Z(k) are given by

Φ(−k1) (γ ) = Φ −1 (γ 1/k ) (4.1)

for 0 < γ < 1. The mean and variance for Z(k) are given, respectively, as

1

µ Z( k ) =
∫Φ
0
−1
( x 1/k ) d x (4.2)

© 2012 by Taylor and Francis Group, LLC

Evaluating the Active Control Effect 69

and

∫ (Φ ) dx
1 2
−1
σ Z2 ( k ) = ( x 1/k ) − µ Z( k ) (4.3)
0

Table 4.2 provides the mean, standard deviation, and various percentiles
based on Equations 4.1 through 4.3.
When the underlying distribution for X1, . . . ,Xk is a normal distribution
with mean μ and standard deviation σ, X(k) is equal in distribution to μ +Z(k) σ.
The behavior of the minimum treatment effect is analogous to that of the
maximum treatment effect, with the roles of the treatment arms reversed.
Example 4.2 illustrates using the distribution of the maximum in construct-
ing a confidence interval for the true common treatment effect for a time-to-
event endpoint.

Example 4.2

Suppose that there are five randomized, placebo-controlled clinical trials evaluat-
ing an experimental therapy with each trial based on a one-to-one randomization.
For each study, as in Example 4.1, the same time-to-event endpoint will be com-
pared after 400 events where the true experimental versus placebo hazard ratio is
0.894 (which provides 20% power to achieve statistical significance at a one-sided
α of 0.025) in every study. Then, using Table 4.2, the maximum observed treat-
ment effect is represented by the minimum observed experimental versus placebo
log-hazard ratio, which has mean –0.228 (= ln 0.894 – 1.16 × 2/ 400 ), which cor-
responds to a hazard ratio of 0.796. The standard deviation for the minimum log-
hazard ratio is 0.067. The median minimum experimental versus placebo hazard
ratio is 0.799 (= ln 0.894 – 1.13 × 2/ 400). Using the 2.5th and 97.5th percentiles
in Table 4.2, an equal-tailed 95% prediction interval for the minimum hazard ratio
is 0.691, 0.899.

TABLE 4.2
Means, Standard Deviations, and Various Percentiles for the Maximum of a
Random Sample from a Standard Normal Distribution
Percentiles
K µ z( k ) σ z( k ) 2.5th 25th 50th 75th 97.5th
2 0.56 0.83 –1.00 0 0.54 1.11 2.24
3 0.85 0.75 –0.55 0.33 0.82 1.33 2.39
5 1.16 0.67 –0.05 0.70 1.13 1.59 2.57
10 1.54 0.59 0.50 1.13 1.50 1.91 2.80
25 1.97 0.51 1.09 1.61 1.92 2.28 3.09

© 2012 by Taylor and Francis Group, LLC

70 Design and Analysis of Non-Inferiority Trials

Suppose instead that the true common experimental versus placebo hazard ratio
is unknown and that only the best (minimum) observed hazard ratio is considered.
If the minimum observed hazard ratio is 0.75, then, based solely on that, a 95%
equal-tailed confidence interval for the true common experimental versus placebo
hazard ratio is 0.746, 0.970. This confidence interval is based on the 2.5th and
97.5th percentiles in Table 4.2 and the relation X(5) = μ + Z(5)σ in distribution, which
is applied to the placebo versus the experimental log-hazard ratio.

Once the estimates have been observed along with their respective order,
the confidence coefficients for the confidence intervals change. The confidence
coefficients will depend on the distributions (or conditional distributions) of
the order statistics. In Example 4.2, where there is a random sample of five
estimated effects, the confidence coefficient for the error symmetric 95% con-
fidence interval for that individual study that had the maximum (minimum)
estimated effect is now an error asymmetric 88.1% confidence interval when
the order of the estimated effects across all five studies is considered. The
confidence coefficient for the 95% confidence interval for that study having
the second largest (second smallest) estimated effect is 99.4% when the order
of the estimated effects is considered. The confidence coefficient for the 95%
confidence interval for that study having the median estimated effect is 99.97%
when the order of the estimated effects is considered. For a random sample of
estimated effects, the confidence coefficient for the 95% confidence interval
for the individual study that had the maximum (median) estimated effect
decreases (increases) toward zero (one) as the number of studies increases.
Simultaneous Confidence Bounds. Fairly analogous to having the inference
based on a maximum is requiring simultaneous one-sided confidence inter-
vals to maintain a desired overall coverage. For k studies and a probability
of 1 – α that every one-sided confidence interval will capture the respective
true effect, the common confidence coefficient for each confidence interval is
(1 − α)1/k. When the estimated effects across studies is a random sample (e.g.,
the studies are identical in design and conduct), the largest (smallest) of the
one-sided simultaneous lower (upper) confidence bounds each with confi-
dence coefficient (1 − α)1/k equals the lower confidence bound of coefficient
1 – α based solely on the maximum (minimum) observed effect. For example,
when k = 5 and α = 0.025, the confidence coefficient for each confidence inter-
val is 0.995 (=0.9750.2). Note that the formula for determining the common
confidence coefficient for each confidence interval is the same as the formula
for relating the (1 – α)th quantile for the maximum to the [(1 − α)1/k]th quantile
of the underlying distribution.
It is fairly common to use for the non-inferiority trial the lower limit of
a 95% confidence interval for the true active control effect (usually from a
meta-analysis) as a surrogate or substitute for the unknown true effect of the
active control. When only the result from the study that produced the larg-
est estimated effect among the k studies is considered, it seems a reasonable
analog to base the surrogate or substitute for the unknown true effect of the

© 2012 by Taylor and Francis Group, LLC

Evaluating the Active Control Effect 71

active control for the non-inferiority trial as the (0.975)1/k × 100% lower confi-
dence bound calculated solely from that study.
More extensive modeling based on order statistics can also be done. For
example, suppose it is believed that there may be a specific number of small
studies that did not get published because of unfavorable results for the
treated arm. A model can be applied to the results from the known small
studies that assumes that those known results represent better-order statis-
tics from some samples of independent observations. Two approaches used
in Example 4.3 are based on the maximum of a sample of estimated effects
that are not a random sample. In Example 4.3 we consider various ways of
integrating the available information from two studies on the overall sur-
vival effect of docetaxel in second-line non-small cell lung cancer (NSCLC).

Example 4.3

The JMEI trial studied the use of pemetrexed against the active control of docetaxel
at a dose of 75 mg/m2 (D75) with subjects in second-line NSCLC. A non-inferiority
claim for pemetrexed versus docetaxel on overall survival was sought.10 Thus, it
would be necessary to understand the effect of docetaxel on overall survival in
second-line NSCLC. There have been several clinical trials studying the effects
of docetaxel in NSCLC and other cancers. For the sake of this example, only two
studies of docetaxel in second-line NSCLC (TAX 317 and TAX 320) will be consid-
ered. For the TAX 320 study, 373 subjects were randomized to either 100 mg/m2
docetaxel (D100), D75, or a control therapy (vinorelbine or ifosfamide, V/I). There
is little evidence that vinorelbine or ifosfamide extends life in a second-line setting
of NSCLC. For the TAX 317 study, 100 subjects were randomized to D100 or best
supportive care (BSC) in phase A of the study, and 104 subjects were randomized
to D75 or BSC in phase B of the study.
How the results are modeled or integrated will have a great impact on the
estimation of the relevant effect of D75. When an approach is selected retro-
spectively and dependent on the trial results, it will produce biased estimates.
Prespecification of an approach before the conduct of the TAX 320 and TAX 317
studies (or independent of their results) would be necessary to produce unbiased
estimates. Some possible approaches are listed below.

1. A naïve approach that uses only the results from phase B of the TAX 317 study.
2. There is no strong enough evidence from the TAX 320 study to rule out that
the effects are equal between the docetaxel regimens. Therefore estima-
tion of the active control effect based on the assumption that the effects
of the docetaxel regimens are equal and constant across studies can be
considered.
a. Use only the results from the TAX 317 study.
b. Use results from both studies treating the control arms of vinorelbine or
ifosfamide, and BSC as exchangeable.
3. An approach that integrates the results in the TAX 320 study of the com-
parison of D100 with D75, with the separate comparisons of each phase
of docetaxel to BSC from the TAX 317 study. The effects of each docetaxel

© 2012 by Taylor and Francis Group, LLC

72 Design and Analysis of Non-Inferiority Trials

regimen are allowed to be different and are assumed to be constant across

studies.
4. An approach used during an Oncology Drugs Advisory Committee meet-
ing10 treated the results from phase B of TAX 317 as isolated results (i.e., as
the sole results available for estimating the effect of docetaxel). However,
the results from phase B of TAX 317 are not isolated results.
a. Therefore, another approach uses only the results from phase B of TAX
317, but treats the estimate and the corresponding variability as repre-
senting the maximum observed effect across all observed effects.
b. Further condition on the maximum observed effect being the observed
effect in phase B of TAX 317.

For each approach, Table 4.3 summarizes the estimated hazard ratio, the corre-
sponding 95% confidence interval for the true D75 versus BSC hazard ratio, and
the one-sided p-value for testing that D75 is superior to BSC. For approach 1, the
estimate of the D75 versus BSC hazard ratio from TAX 317 is 0.56 with the cor-
responding 95% confidence interval of 0.35–0.88.11 From the confidence interval,
the standard error for the log-hazard ratio estimator is approximately 0.235 and the
one-sided p-value for superiority of D75 versus BSC is approximated as 0.007.
From the overall survival results provided in the Statistical review of NDA
20449/S11 for TAX 317,12 with data cutoff date of April 12, 1999, the observed
D100 versus BSC hazard ratio is either 0.96 or 1.04 = 1/0.96 (using the p-value and
the number of events for each group) and the corresponding standard error for the
log-hazard ratio estimator is 0.221 (= 1/ 40 + 1/ 42 ). For this example, we will use
0.96 as the observed hazard ratio. For approach 2a, applying a fixed-effects meta-
analysis to the independent comparison of phases A and B of TAX 317 leads to an
estimated D75/D100 versus BSC hazard ratio of 0.743 (= exp([ln 0.56/(0.235)2 +
ln 0.96/(0.221)2]/[1/(0.235)2 + 1/(0.221)2])) and the corresponding standard error for
the log-hazard ratio estimator of 0.160 (= 1/ (0.235)2 + 1/ (0.221)2 ).
For the TAX 320 study, there were 104, 97, and 110 deaths in the D100, D75,
and V/I treatment groups, respectively.12 The D75 versus V/I hazard ratio is pro-
vided in the product label for Taxotere11 as 0.82, and the D100 versus V/I hazard
ratio is determined to be either 0.99 or 1.01 = 1/0.99. For this example, we will use

TABLE 4.3
Estimates of D75 versus BSC Hazard Ratio by Approach
D75 vs. BSC Hazard Ratio
One-Sided
Approach Estimate 95% Confidence Interval p-Value
1 0.56 (0.35, 0.88) 0.007
2a 0.743 (0.543, 1.018) 0.032
2b 0.842 (0.698, 1.015) 0.035
3 0.655 (0.466, 0.921) 0.007
4a 0.675 (0.524, 0.938) 0.011
4b 0.704 (0.536, 0.985) 0.021

© 2012 by Taylor and Francis Group, LLC

Evaluating the Active Control Effect 73

0.99 as the observed hazard ratio. The geometric mean of the two hazard ratio
estimates is 0.901, which will be used as the combined estimate of the D75/D100
versus V/I hazard ratio. The estimated standard error for the combined log-hazard
ratio estimator is 0.119 (= 0.5 × 1/104 + 1/ 97 + 4 /110 ). When the combined results
for TAX 320 and TAX 317 (determined for approach 2a) are integrated by a fixed-
effects meta-analysis, the estimated D75/D100 versus V/I/BSC hazard ratio is
0.842 (= exp([ln 0.743/(0.160)2 + ln 0.901/(0.119)2]/[1/(0.160)2+1/(0.119)2])), with a
corresponding estimated standard error for the log-hazard ratio estimator of 0.095
(= 1/ (0.160)2 + 1/ (0.119)2 ).
For approach 3, the D75 versus D100 hazard ratio from TAX 320 is determined
as 0.828 (=0.82/0.99), with a corresponding standard error for the log-hazard ratio
of 0.141 (= 1/104 + 1/ 97 ). This result is combined with the results of the D100
versus BSC, yielding an estimated hazard ratio of 0.795 and a corresponding stan-
dard error for the log-hazard ratio estimator of 0.262. This indirect comparison is
now integrated with the direct comparison of D75 versus BSC, yielding an overall
D75 versus BSC hazard ratio of 0.655 with a corresponding standard error for the
log-hazard ratio estimator of 0.175. This estimate of the D75 versus BSC hazard
ratio is the maximum likelihood estimate under the model where the log-hazard
ratio estimators have independent normal distributions with respective standard
deviations equal to the estimated standard errors and the true log-hazard ratios
of D75 versus BSC, BSC versus D100, and D100 versus D75 are required to sum
to zero.
Let β denote the common true D75/D100 versus BSC/V/I log-hazard ratio across
studies. Let β̂1 , β̂ 2 , β̂3 , and β̂ 4 denote the log-hazard ratios of D100 versus V/I,
D75 versus V/I in TAX 320, D100 versus BSC, and D75 versus BSC in TAX 317,
respectively. For approaches 4a and 4b, the distribution of the deviation between
the minimum observed log-hazard ratio (maximum observed effect) was studied
by simulations. This deviation equals the minimum deviation across estimates/
estimators (i.e., min{βˆ1, βˆ 2 , βˆ3 , βˆ 4 } − β = min{βˆ1 − β , βˆ 2 − β , βˆ3 − β , βˆ 4 − β } ). Let (Z1, Z2),
Z3, Z4 be independent, where each Zi has a standard normal distribution and
(Z1, Z2) has a bivariate normal joint distribution with a correlation of 0.5. For the
comparison of D100 versus V/I and D75 versus V/I in TAX 320, the respective
deviations β̂1 − β and β̂ 2 − β are modeled as 0.138Z1 and 0.138Z2 (0.138 repre-
sents the average of the two estimated standard errors). For the comparisons of
D100 versus BSC and D75 versus BSC in TAX 317, the respective deviations β̂ 3 − β
and β̂ 4 − β are modeled as 0.221Z3 and 0.235Z4.
On the basis of 100,000 replications, the simulated mean minimum deviation is
−0.188 with simulated 2.5th and 97.5th percentiles of −0.515 and 0.067, respec-
tively. On the basis of retaining only the maximum observed effect of a hazard
ratio of 0.56, this leads to the estimated common hazard ratio of 0.675 (= exp(ln
0.56 + 0.188)) and limits of the corresponding 95% confidence interval of 0.524
(= exp(ln 0.56 – 0.067)) and 0.938 (= exp(ln 0.56 + 0.515)).
In 31,899 of the 100,000 replications, βˆ 4 = min{βˆ1, βˆ 2 , βˆ3 , βˆ 4 }. Conditioning on
β 4 = min{βˆ 1 , βˆ 2 , βˆ 3 , βˆ 4 }, the simulated mean minimum deviation is −0.229 with
ˆ
simulated 2.5th and 97.5th percentiles of −0.565 and 0.043, respectively. On the
basis of retaining only the maximum observed effect of a hazard ratio of 0.56 and
that this maximum observed effect came from phase B of TAX 317, this leads to
the estimated common hazard ratio of 0.704 (= exp(ln 0.56 + 0.229)) and limits of

© 2012 by Taylor and Francis Group, LLC

74 Design and Analysis of Non-Inferiority Trials

the corresponding 95% confidence interval of 0.536 (= exp(ln 0.56 – 0.043)) and
0.985 (= exp(ln 0.56 + 0.565)).
From Table 4.3, the results vary across approaches with the estimated effects
ranging from a D75 versus BSC hazard ratio of 0.56–0.842. The upper limits of
the 95% confidence intervals range from 0.88 to 1.018. Additionally, the more
information used in an approach or the more restrictive the assumptions, the nar-
rower the 95% confidence interval. Two approaches failed to achieve statistical
significance at a one-sided 0.025 level. All in all, the results do not provide sub-
stantial evidence that the true D75 versus BSC hazard ratio for overall survival is
less than 1.

4.3 Meta-Analysis Methods

Meta-analysis was defined by Glass13 as “the statistical analysis of a large
collection of analysis results from individual studies for the purpose of inte-
grating the findings.” Some reasons for performing a meta-analysis include
the following: to obtain a more precise estimate for the entire population or
subgroup, to evaluate the secondary endpoint when the power is inadequate
in any given trial, to more reliably understand the dose–response relation-
ship, and to qualify the relationship between a surrogate endpoint and a
clinical benefit endpoint. For non-inferiority trials, meta-analyses have been
used to estimate the historical effect of the active control therapy. The goal
for a non-inferiority trial is estimating the effect of the active control in the
setting of the non-inferiority trial along with appropriately modeling the cor-
responding variability/uncertainty. This includes determining the appropri-
ate shape for the sampling distribution of the estimated effect of the active
control. For a random-effects model involving the mean of continuous data
where the underlying variances within the studies are unknown, Follmann
and Proschan14 showed that, in many cases, the sampling distribution for
the estimator for the global mean across studies should be modeled with the
appropriate t distribution.
There may be many differences among the clinical trials considered in
the meta-analysis. Some of the studies may have bias in the estimation of the
effect of the endpoint of interest. The lack of blinding can introduce bias. The
primary objective of the studies may differ, potentially leading to different
scrutiny in obtaining measurements on the endpoint of interest for the meta-
analysis.
The extent of monitoring and follow-up for an endpoint may be differ-
ent between a primary and a secondary endpoint. In investigating the
follow-up for overall survival in 10 studies in advanced solid tumor cancers,
Rothmann et al.15 found that the percentage of loss to follow-up for overall
survival ranged from 1% to 5% for the six studies having overall survival as
the primary endpoint and ranged from 10% to 12% for the four studies where

© 2012 by Taylor and Francis Group, LLC

Evaluating the Active Control Effect 75

overall survival was a secondary endpoint. Poor follow-up and monitoring

may introduce bias in the estimated effect.
We will discuss some meta-analytic methods. These include fixed-effects
meta-analyses, which assume a common effect across clinical trials; ran-
dom-effects meta-analyses, which assume varying effects across studies;
and “covariate-adjusted” approach, which estimates the treatment effect on
the basis of a particular distribution for effect modifiers. Meta-analytic mod-
els used to estimate the historical active control effect ignore that the active
control was identified on the basis of one, some, or all the studies used in the
meta-analysis. This will lead to a (regression to the mean) biased estimation
of the active control effect with a tendency to overestimate the effect. The
presence of reproduced estimated effects across studies can provide assur-
ance that this bias is small and unimportant and that the modeling assump-
tions may approximately hold.

4.3.1 Fixed Effects Meta-Analysis

Consider that there are k studies comparing the active control with placebo.
Let γi denote the true active control effect for the ith study. For a fixed-effects
meta-analysis, it is assumed that γ1 = γ 2 = γ k = γ. The standard model for the
sample estimators γˆ 1 , γˆ 2 ,  , γˆ k is given by γˆ i = γ + ε i , where ε1, ε2,…,εk are inde-
pendent and ε i ~ N (0, σ i2 ) . The maximum likelihood estimator of γ is given

∑ ∑ ∑
k k k
by γˆ FE = (1/σ i2 )γˆ i / (1/σ i2 ) . The variance of γˆ FE is 1 / (1/σ i2 ).
i=1 i=1 i=1
Commonly, the true variances of the within-study sample effects are not
known, but are estimated. Let si2 denote the estimated variance of γˆ i for
i = 1, … , k. Then the standard fixed-effects estimator of γ is given by
k k

γˆ FE = ∑ (1/si2 )γˆ i / ∑ (1/s ) 2

i=1 i=1
∑
k
and the corresponding estimated variance is given by s2 = 1 / (1/si2 ).
i=1

The existence of heterogeneity invalidates the assumptions of a fixed-effects

meta-analysis. The presence of heterogeneous effects across studies can be

∑
k
tested on the basis of the statistic Q = (1/si2 )(γˆ i − γˆ FE )2 (see DerSimonian
i=1
and Laird16). When the effects are homogeneous, Q has an approximate χ2
2
distribution with k – 1 degrees of freedom. For 0 < α < 1, let χ k−1,α denote the
upper αth percentile from a χ2 distribution with k – 1 degrees of freedom.
2
When Q > χ k−1,α , a single common effect is rejected. Formally, the conclusion
is that there are at least two different values among γ 1, γ 2, . . . , γ k. The formal
conclusion is neither that there are k distinct values for γ1, γ 2, . . . , γ k nor that γ 1,
γ 2, . . . , γ k are independent and identically distributed random variables. The
test of heterogeneity tends to have low power in most practical situations. A

© 2012 by Taylor and Francis Group, LLC

76 Design and Analysis of Non-Inferiority Trials

formula for determining the power for testing heterogeneity on the basis of
the test statistic Q is given by Jackson.17

4.3.2 Peto’s Method

For binary data, a popular fixed-effects meta-analysis uses Peto’s odds ratio
of a success.18,19 For a given clinical trial comparing an experimental arm to
a control arm, let dj denote the number of successes in arm j among nj sub-
jects, j = C, E, and let d = dC + dE and n = nC + nE denote the overall number of
subjects. Peto’s odds ratio and its distribution are conditioned on the total
number of successes, d, between the two treatment arms. Then the number
of successes in the experimental arm, dE, is random, having a hypergeomet-
ric distribution. To determine Peto’s odds ratio, first calculate
Z = dE – nE × d/n and V = (nE × nC × d × (n – d))/(n2 × (n – 1)) (4.4)
Conditioned on the value of d, Z represents the difference between the
observed number of successes and the expected number of successes for
the experimental arm, and V is the conditional variance for both dE and Z.
Peto’s estimator of the log-odds ratio and odds ratio are given by θ̂ = Z/V
and exp(Z/V). For large samples, the estimator of the log-odds ratio, θ̂ , has
an approximate normal distribution with a mean true log-odds ratio of θ and
a variance of 1/V. In the fixed-effects meta-analysis, the log-odds estimates
will be weighted by the corresponding values for V. Note that the test statis-
tic for testing Ho:θ = 0 against Ha:θ ≠ 0 is Z/ V , which has an approximate
standard normal distribution when θ = 0 for large sample sizes.
Let Zi and Vi denote the corresponding quantities in Equation 4.4 for the
ith study, i = 1,…, k. Then the meta-analysis estimator of the common log-

∑
∑ V = ∑ Z ∑ V . The esti-
k k k k
odds ratio is given by θˆ = Viθˆi i i i
i=1 i=1 i=1 i=1

mated variance of θ̂ is given by 1 ∑ V . The estimator θ̂ has the form

k
i
i=1

∑ (O − E ) ∑ Var(O ), where O denotes an observed value and E

k k
i i i
i=1 i=1

denotes an expected value when θ = 0. Peto’s method may have substantial

bias when the allocation is not balanced.20
The presence of heterogeneous effects across studies can be tested on the
2

∑ 
∑ 
∑ ∑
k k k k
basis of the statistic R = Zi2 Vi −  Zi  Vi (see Yusuf
i=1 i=1  i=1  i=1

et al.18). When the effects are homogeneous, R has an approximate χ2 distri-

2
bution with k – 1 degrees of freedom. For 0 < α < 1, let χ k−1,α denote the upper
αth percentile from a χ2 distribution with k – 1 degrees of freedom. When R >
2
χ k−1 ,α , a single common effect is rejected.
The Cochran–Mantel–Haenszel approach for proportions is another type
of fixed-effects meta-analysis (see Section 11.6 for details).

© 2012 by Taylor and Francis Group, LLC

Evaluating the Active Control Effect 77

4.3.3 Random-Effects Meta-Analysis

One way of modeling different effect sizes is with the DerSimonian and
Laird random-effects model. For a standard random-effects meta-analysis,
γ 1, γ 2, . . . , γ k are assumed to be a random sample from a normal distribution
having mean γ and variance τ 2 (see DerSimonian and Laird16). The standard
model for the sample estimators γˆ 1 , γˆ 2 ,  , γˆ k is given by γˆ i = γ + ηi + ε i , where
η1, η 2, . . . , ηk are a random sample from a normal distribution having mean
zero and variance τ 2 and ε1, ε2, . . . , εk are independent with ε i ~ N (0, σ i2 ).
The estimator γˆ i “unbiasedly estimates” the realized value for γi = γ +
ηi. If σ 1, . . . , σ k and τ 2 are known, then the maximum likelihood estimator

∑ ∑
k k
of γ is given by γˆ RE = (τ 2 + σ i2 )−1 γˆ i (τ 2 + σ i2 )−1 . The variance of γˆ RE
i=1 i=1

∑
k
is 1 (τ 2 + σ i2 )−1 . In practice, σ 1, . . . , σ k and τ 2 are not known. Then the
i=1
between-study variance, τ 2, is estimated by

 k 

2


∑ (1/s )(γˆ
2
i 1 − γˆ 0 )2 − ( k − 1) 

τˆ = max 0, k
i=1
k k 



∑ i=1
(1/si2 ) − ∑i=1
(1/si4 ) ∑i=1

(1/si2 ) 


∑ ∑
k k
where γˆ 0 = (1/si2 )γˆ i (1/si2 ). Then the random-effects estimator
i=1 i=1

∑ ∑
k k
of γ is given by γˆ RE = (τˆ 2 + si2 )−1 γˆ i (τˆ 2 + si2 )−1 . The corresponding
i=1 i=1

∑
k
estimated variance for γˆ RE is s2 = 1 (τ̂ 2 + si2 )−1 .
i=1
For the random-effects model, the units are studies, not subjects. Inference
is formally on studies. For the inference to apply at the subject level, either all
studies should have the same study size or all individual estimated effects
should have the same study standard error. Otherwise, the study standard
error should not be correlated (not even spuriously correlated) with the study
effect size. It is difficult to evaluate and be certain that the study standard
error and study effect size are not correlated.
While the existence of heterogeneity invalidates the assumptions of a fixed-
effects meta-analysis, the existence of a correlation between the estimated
effects and the within-study standard errors invalidates the assumptions of
a random-effects meta-analysis. There are other circumstances that can also
invalidate the assumptions of a random-effects meta-analysis.
For these meta-analysis methods, the common or average effect in the models
reflects the expected value (or conditional expected value for a random-effects
model) of the estimated effects. This is important as study conduct, missing
data, or design features can introduce bias in estimating the true study effect.

© 2012 by Taylor and Francis Group, LLC

78 Design and Analysis of Non-Inferiority Trials

Differential bias across studies will resemble and contribute to between-trial

variability. Heterogeneity in effects should be investigated. The estimated effect
of the active control in the setting of the non-inferiority trial should account for
explained heterogeneity (effect modification) and bias. The corresponding vari-
ance or standard error should account for the unexplained heterogeneity.
Additionally, there are other options to modeling the study effects when
the effects are heterogeneous. The within-trial effects can be viewed as fixed,
unknown, and not necessarily equal, instead of using a random model.
When there is a distinct effect modifier that is either present or absent in a
trial, modeling can be done separately for studies where the effect modifier
is present and those studies where the effect modifier is absent. This may
be the case when each study only involves patients from one of two distinct
populations where the responsiveness to therapy is known to be different
between the populations.

4.3.4 Sampling Distributions

For a fixed-effects meta-analysis, γ represents the common effect across stud-
ies or the mean effect across subjects. Subjects can be regarded as the units
in a fixed-effects meta-analysis. For a fixed-effects meta-analysis, the asymp-
totic distribution of γˆ FE depends on the total number of subjects in each
arm across studies going to infinity. In most settings involving a fixed-effects
meta-analysis, the overall number of subjects will be large and (γˆ FE − γ )/s
will have an approximate standard normal distribution.
For a random-effects meta-analysis, γ represents the mean of the within-
study effects or the mean effect across studies. Studies are the units in a
random-effects meta-analysis, not subjects. The asymptotic normality of
γˆ RE depends on the total number of studies, k, going to infinity for a ran-
dom-effects meta-analysis. Typically, the number of studies considered in a
random-effects meta-analysis is not large enough for (γˆ RE − γ )/s to have an
approximate standard normal distribution. The distribution for (γˆ RE − γ )/s
under a random-effects model has been investigated by many.14,21,22
For constructing confidence intervals for the global study mean effect, γ,
Larholt, Tsiatis, and Gelber21 utilized Sattherwaite’s approximation for the dis-
2
tribution of Q under the restriction that each within-trial variance, σ i , is much
smaller than τ (i.e., σ i  τ all i). Biggerstaff and Tweedie extended the work
2 2 2 23

of Larholt, Tsiatis, and Gelber21 to determine the confidence intervals for τ2 and
the alternative confidence intervals for γ on the basis of the distribution of

k  k k k 
2
τˆ BT = ∑
i=1
(1/si2 )(γˆ i − γˆ 0 )2 − ( k − 1) 
 


∑ (1/ s ) − ∑ (1/s ) ∑ (1/s )
i=1
2
i
i=1
4
i
i=1
2
i

that are not restricted to σ i2  τ 2 all i.

© 2012 by Taylor and Francis Group, LLC

Evaluating the Active Control Effect 79

For random-effects models, Follmann and Proschan14 criticized the

approach of using a standard normal distribution as the distribution for γˆ RE
when γ = 0. For the cases that they considered, which involved continuous
data, the type I error rate was greatly inflated in testing Ho: γ = 0 against
Ha: γ ≠ 0 when standard normal percentiles were used as the critical values.
This included a potential inflation of the type I error rate by 100% from the
desired rate when the meta-analysis involves 16 studies. For cases similar to
those they considered, Follmann and Proschan recommended using as criti-
cal values the appropriate percentiles from a t distribution with k – 1 degrees
of freedom. They also recommended performing simulations to determine
more appropriate critical values.
When σ 12 ,  , σ k2 are known, how close the distribution of (γˆ RE − γ )/s is to a
standard normal distribution or a t distribution with k – 1 degrees of freedom
depends on both the number of studies and the relative size of τ 2 to σ 12 ,  , σ k2 .
For example, when τ 2 > 0 is unknown and it is known that σ 12 =  = σ k2 = 0,
then (γˆ RE − γ )/s will have a t distribution with k – 1 degrees of freedom.
When τ 2 > 0 is unknown and σ 12 =  = σ k2 = σ 2 > 0 is known, an approximate
distribution for (γˆ RE − γ )/s will be between a standard normal distribution
and a t distribution with k – 1 degrees of freedom. The larger σ 2/τ 2, the closer
the approximate distribution of (γˆ RE − γ )/s tends to be to a standard normal
distribution.
When σ 12 ,  , σ k2 are unknown, and therefore need to be estimated, the
approximate distribution for (γˆ RE − γ )/s will also depend on how reliably
σ 12 ,  , σ k2 can be estimated. For example, when a log-hazard ratio is consid-
ered, the asymptotic standard deviation can be closely approximated when
the analysis is based on a prespecified number of events and the true log-
hazard ratio is not far from zero (see Chapter 13). However, when data are
from a normal distribution with unknown variances, the estimated standard
deviation for the observed treatment difference is more variable relative to
the true standard deviation.
When the between-trial variability dominates the within-trial variability,
a t distribution with k – 1 degrees of freedom or a similar number of degrees
of freedom may appropriately model the distribution for (γˆ RE − γ )/s. A t dis-
tribution with k – 1 or a similar number of degrees of freedom may also be
an appropriate model when τ 2 > 0 is unknown and the estimators of σ 12 ,  , σ k2
are fairly variable, as with continuous data. When the within-trial variabil-
ity dominates the between-trial variability and the within-trial variability
is either known or very reliably estimated, a distribution with smaller tails
(e.g., a t distribution with degrees of freedom larger or much larger than k – 1
or a standard normal distribution) may be appropriate as an approximate
null distribution for (γˆ RE − γ )/s.
Ziegler, Koch, and Victor22 showed for the log-odds ratio that the test for
effectiveness based on a random-effects meta-analysis with standard nor-
mal determined critical values (the RE test) can be both conservative and
anticonservative. The amount of type I error rate inflation increases as the

© 2012 by Taylor and Francis Group, LLC

80 Design and Analysis of Non-Inferiority Trials

within-trial variance and the number of studies decreases. For known and
equal study-specific variances of σ 2, they provided an approximate type I
error rate for the RE test of



2  1 − Φ zα /2

( ) (
1 + τ 2 /σ 2 Fk −1 ( k − 1)/(1 + τ 2 / σ 2 ) )
∞ 
−
∫ (
Φ zα /2 ) 
x/( k − 1) f k −1 ( x)dx 

( k − 1)( 1+τ 2 /σ 2 ) 

where Fk – 1 and f k – 1 are the distribution and density function for a χ2 distribu-
tion with k – 1 degrees of freedom, respectively. As noted by Ziegler, Koch,
and Victor,22 for a fixed number of studies, the type I error rate is increas-
ing in τ 2/σ 2. Thus, for fixed k and τ 2, the type I error rate is decreasing in
σ 2 (increasing in the sample size/the number of events for a time-to-event
endpoint). For fixed τ 2 and σ 2, the type I error rate decreases as the number
of studies, k, increases. As σ 2 → 0, the type I error rate converges to 2(1 –
Gk – 1(Z α/2)), where Gk – 1 is the distribution function for a t distribution with
k – 1 degrees of freedom.
For selected numbers of studies, Table 4.4 provides the limiting type I error
rates as σ 2 → 0 for α/2 = 0.025. When there are only three studies and the
within-trial variability is much less than the between-trial variability, the
type I error rate for the superiority test will be about 9.5%. The type I error
inflation may be quite large when the number of studies is small.
As σ 2 → 0, the form of the asymptotic distribution function, Hk – 1, for the
RE test statistic when γ = 0 is provided in the paper of Ziegler, Koch, and
−1
Victor.22 They proposed using H k− 1 (1 − α /2) as the critical value for the RE
test statistic when testing for effectiveness and as a multiplier when deter-
mining confidence intervals for γ. From their simulations, the new test either
maintains the approximate type I error rate or is conservative.

TABLE 4.4
Limiting Type I Error Rates as σ 2 → 0 for α/2 = 0.025
Number of Studies Limiting Type I Error Rate
3 0.095
10 0.041
25 0.031
50 0.028

© 2012 by Taylor and Francis Group, LLC

Evaluating the Active Control Effect 81

We recommend performing simulations that use models consistent with

the observed γˆ RE , τ̂ 2 , and s12 ,..., sk2 to determine an appropriate approximate
distribution for (γˆ RE − γ )/s or appropriate critical values or multipliers to be
used in hypotheses testing or the determination of confidence intervals.

4.3.5 Concerns of Random-Effects Meta-Analyses

There are various concerns with the application of random-effects models to
estimate the active control effect in the setting of the non-inferiority trial. For
example, the estimation of the mean effect across studies (global mean) is
frequently used to infer the effect of the active control in the non-inferiority
trial from either the use of a 95% confidence interval or by the use of the
point estimate with its standard error (e.g., in a synthesis test statistic). Such
an inference is not on a study-specific estimate of the active control effect but
on the mean of such an effect across the trials that were considered in the
random-effects meta-analysis. Also, the standard deviation and the width of
the 95% confidence interval converges to zero as the number of studies used
in the random-effects meta-analysis increases without bound, regardless of
the extent of the heterogeneity. When the within-study effects are greatly het-
erogeneous, using a precisely estimated global or average effect to establish
a non-inferiority margin will likely not be appropriate and be accompanied
with a great chance of concluding from the non-inferiority trial that an inef-
fective experimental therapy is effective. Additionally, as will be seen in an
example later, the lower limit of a two-sided 95% confidence interval for the
global mean may increase as the largest estimated effect is reduced. This
would seem counterintuitive and is a result of the inference being based on
a global mean, the assumption of between-study effects being normally dis-
tributed, and the large confidence coefficient for the 95% confidence interval.
This seeming contradiction is of particular importance as the lower limit of
the 95% confidence interval for the global mean effect is often used as a surro-
gate, substitute, or “estimate” of the treatment effect in a non-inferiority trial.
In one case from experience, there were eight previous studies involv-
ing the active control with sample sizes ranging from 50 to 1500 subjects.
The non-inferiority trial had about 20,000 subjects. The logistics involving a
20,000-subject trial are bound to be quite different (location of sites, types of
subjects, study conduct, quality of follow-up) from the logistics of the trials
used to estimate the effect of the active control. Estimating the active control
effect from trials quite different from the non-inferiority trial is likely to give
results not relevant to the non-inferiority trial. In this case, because there are
no previous randomized comparative trials involving the active control of
comparable size to the non-inferiority trial, using the results of those previ-
ous trials to determine the effect of the active control in the setting of the
non-inferiority trial may involve notable extrapolation.
In another case, the meta-analysis to estimate the effect of the active con-
trol was based on the results of 10 small and 3 large studies (each large

© 2012 by Taylor and Francis Group, LLC

82 Design and Analysis of Non-Inferiority Trials

study having thousands of subjects). The small studies collectively had a

huge estimated treatment effect, whereas the large studies averaged to a
slightly negative or adverse estimated effect for the “active” control. The
three large studies accounted for more than 95% of the total number of sub-
jects across 13 clinical trials. The random-effects meta-analysis yielded a
95% confidence interval for the average study effect that was entirely on
the side of having an effect (entirely positive). However, the fixed-effects
meta-analysis yielded no estimated difference between the “active” control
and placebo. The stark difference in the results of the two meta-analyses is
due to the correlation across studies between the observed effects and the
trial size. The smaller trials had much larger observed treatment effects.
When the observed effects are correlated with the trial size, a random-
effects meta-analysis pulls the estimated effect away from the estimated
effect from a fixed-effects meta-analysis toward the observed effects from
the smaller studies.
As the non-inferiority trial would likely have been similar in size to the
larger clinical trials, it may also be more similar in other aspects to the larger
previous trials than the smaller trials, and thus the observed effects from the
larger studies may be more relevant. It is not clear why the observed effects
would be so different between the small and large studies. There may be
differences between the groups of studies in the general health of subjects
enrolled or in the quality of medical care. Additionally, rather large studies
get published regardless of the results, and the publication of small studies
may be affected by whether the results of the study are positive (i.e., a positive
small study may be more likely to be published then a negative or nonposi-
tive small study). If among small studies only the positive ones are published,
the estimated effect from the meta-analyses will be biased and will overstate
the true effect (if any). This bias would be more profound in a random effects
meta-analysis than in a fixed effects meta-analysis. Furthermore, heteroge-
neity in treatment effects across studies may be due to different distributions
across studies in important effect modifiers. Example 4.4 illustrates some of
the complexities in using the results of meta-analyses to interpret the effect
of the active control in the non-inferiority setting.

Example 4.4

In this example, there are three previous randomized, clinical trials comparing the
active control therapy to placebo on a continuous outcome. We will assume that
each within-trial estimator of the treatment effect is unbiased and has a normal
distribution. Table 4.5 provides the estimated active control effect along with the
corresponding standard deviation and 95% confidence interval for each trial and
for the fixed-effects and random-effects meta-analyses. Figure 4.1 displays the
corresponding 95% confidence intervals.
For a random-effects meta-analysis, Table 4.5 provides the 95% confi-
dence intervals based on both percentiles from a standard normal distribution

© 2012 by Taylor and Francis Group, LLC

Evaluating the Active Control Effect 83

TABLE 4.5
Trial and Integrated Estimated Effects, Standard Deviations, and 95% Confidence
Intervals
Trial/Analysis Estimated Effect Standard Deviation 95% Confidence Interval
Trial 1 7 3 (1.1, 12.9)
Trial 2 6 2 (2.1, 9.9)
Trial 3 37 3.5 (30.2, 43.8)
Fixed effects 12.0 1.50 (9.0, 14.9)
Random effects 16.5 9.01 (–1.2, 34.2)a (–22.3, 55.3)b
a Based on the 2.5th and 97.5th percentiles of a standard normal distribution.
b Based on the 2.5th and 97.5th percentiles of a t distribution with 2 degrees of freedom.

and percentiles from a t distribution with 2 degrees of freedom. The estimated

between-trial variance is given by τˆ 2 = 235. As the between-trial variability domi-
nates the within-trial variances (which range from 4 to 12.25), if a random-effects
model is the correct model, the 95% confidence interval for the global mean
should be based on the percentiles from a t distribution with 2 degrees of freedom
rather than based on the percentiles from a standard normal distribution. A real
issue is whether a random-effects model is appropriate.
Despite the confidence intervals from the individual trials being consistently
positive and comfortably away from zero, the 95% confidence interval for the
global mean from a random-effects model contains zero, regardless of the
approach. This would suggest that there is uncertainty that the active control is
typically effective.
Table 4.6 provides the estimated active control effects and the corresponding
standard deviation and 95% confidence interval for each trial and for the fixed-
effects meta-analysis had the estimated active control effect from trial 3 been 12
instead of 37. The estimated effects are similar for the fixed-effects and random-

Trial 1

Trial 2

Trial 3
Fixed
Effects
Random
Effects

0 10 20 30 40

FIGURE 4.1
Trial and meta-analysis 95% confidence intervals.

© 2012 by Taylor and Francis Group, LLC

84 Design and Analysis of Non-Inferiority Trials

TABLE 4.6
Trial and Integrated Estimated Effects, Standard Deviations, and 95% Confidence
Intervals
Trial/Analysis Estimated Effect Standard Deviation 95% Confidence Interval
Trial 1 7 3 (1.1, 12.9)
Trial 2 6 2 (2.1, 9.9)
Trial 3 12 3.5 (5.2, 18.8)
Fixed effects 7.4 1.50 (4.4, 10.3)
Random effects 7.5 1.62 (4.3, 10.6)a (0.5, 14.4)b
a Based on the 2.5th and 97.5th percentiles of a standard normal distribution.
b Based on the 2.5th and 97.5th percentiles of a t distribution with 2 degrees of freedom.

e ffects meta-analyses. The estimated between-trial variance is τˆ 2 = 0.91, which

is much smaller than all the within-trial variances. Additionally, as there is little
difference between the estimated effects and the standard deviation from the two
methods, standard normal percentiles would be more appropriate than percen-
tiles from a t distribution with 2 degrees of freedom. The actual sampling dis-
tribution for (γˆRE − γ )/s would probably have slightly fatter tails than a standard
normal distribution leading to a slightly wider confidence interval than 4.3–10.6.
Simulations can be performed to determine the appropriate multipliers that yield
an approximate equal-tailed confidence interval that has roughly a 95% coverage
probability.
Note that reducing the estimated effect in trial 3 from 37 to 12 led to an increase
in the lower limit of the 95% confidence interval from a random-effects meta-
analysis. This is particularly important since the lower limit of a 95% confidence
interval for the global mean has been used as a surrogate of the active control
effect in a non-inferiority trial. It seems counterintuitive to use a smaller value for
the assumed effect of the active control when the third trial has an estimated effect
of 37 than when the third trial has an estimated effect of 12.
Here, when the third trial has an estimated effect of 37, it may also be very
difficult to defend the use of the lower limit of the 95% confidence interval for
the global mean as the effect of the active control when there is great heteroge-
neity of the estimated effects across trials. The actual treatment effect in trial 3
is clearly bigger than the actual treatment effect in trials 1 and 2. It is important
to investigate why the treatment effect in the third study is different so as not to
provide too large or too small a non-inferiority margin in the non-inferiority trial.
There may be effect modifiers for the active control that are distributed differently
across trials. For example, the third trial may have been designed after learning
that the active control therapy may only be beneficial and have a large effect in
a particular subgroup of subjects. This may lead to the designing of a clinical trial
with only subjects belonging to that subgroup randomized between therapy and
placebo that would likely be sized at a large effect (thus yielding a greater stan-
dard deviation for the estimated effect than in the previous trials). It is important
to use the historical studies of the active control therapy to make an inference on
the expected or (minimum) likely effect that the active control would have in the
non-inferiority trial with respect to the actual subject population, schedule and

© 2012 by Taylor and Francis Group, LLC

Evaluating the Active Control Effect 85

dose of the active control therapy, and study conduct in the non-inferiority trial.
Differences in the bias on the estimated effects, due to differences in study con-
duct and study design, also contribute to heterogeneity in the estimated effects.
This heterogeneity is not properly dealt with by being treated as unexplained vari-
ability in treatment effects.

When there is convincing evidence that γi and σ i2 are correlated, the assump-
tions for the random effects probably do not hold. That is, the assumption
that γ 1, γ 2, . . . , γ k are identically distributed is probably false.
The fixed-effects estimator, γˆ FE , is also an unbiased estimator of γ when
the random-effects model holds. The variance for γˆ FE under the random-
−1 2
 
∑    
∑ ∑
k k k
effects model is given by  1/σ i2  + τ 2  1/σ i4   1/σ i2  .
 i=1   i=1   i=1 
The variance of γˆ FE under a fixed-effects model and the variances for
γˆ RE and γˆ FE under a random-effects model are respectively ordered with
−1 −1 −1 2

∑  
∑ ( )  
∑  
∑  
∑ 
k k k k k
 1/σ i2  ≤ 1/ σ i2 + τ 2  ≤ 1/σ i2  +τ2  1/σ i4   1/σ i2 
i=1   i=1   i=1   i=1  i=1 
2

∑  
∑ 
k k
1/σ i4   1/σ i2  . The closer τ 2 is to zero, the more similar the variances. When
i=1  i=1 
the fixed effects and random effects estimated effect sizes are quite different,
this may indicate that the assumptions for the random effects model do not
hold.
When the effects are heterogeneous, Greenland and Salvan20 suggest mod-
eling the study differences instead of providing a single estimated effect.
In a case like that given in Table 4.5, where there is enormous heteroge-
neity of the estimated effects, it is likely that much of that heterogeneity is
explainable. Both the fixed-effects and the DerSimonian–Laird random-
effects meta-analyses are not appropriate. It is important to investigate the
heterogeneity of the estimated effects. The potential bias in the estimates
should also be considered. The variability in the estimated effects that can be
explained should be used to estimate the active control effect in the setting
of the non-inferiority trial along with any further effect modification antici-
pated in the setting of the non-inferiority trial. The precision of the estimate
would be based on the within-trial variances of the estimated effects and
the unexplained between-trial variability in the estimated effects of active
control therapy.

4.3.6 Adjusting over Effect Modifiers

There may be key differences among the historical trials that attribute to
differences in the estimated effects across clinical trials. For example, when
the active control effect varies greatly across meaningful subgroups and the
relative frequencies of these subgroups vary across the studies, there will

© 2012 by Taylor and Francis Group, LLC

86 Design and Analysis of Non-Inferiority Trials

be heterogeneous overall true effects of the active control across studies. In

these instances, an estimate of the active control effect in the setting of the
non-inferiority trial will not be relevant if it does not consider the heteroge-
neity of the effects across meaningful subgroups and the likely relative fre-
quencies of these subgroups in the non-inferiority trial. Zhang24 considered
the problem of estimating the active control effect in the setting of the non-
inferiority trial when the effects are heterogeneous across important sub-
groups or covariates (i.e., effect modifiers) by adjusting the historical effect
of the active control according to the distribution of the effect modifiers/
covariates within the non-inferiority trial. When the estimated effect, γˆ (x ) ,
for any given set of covariates, x, can be determined from the previous trial
comparing the active control to placebo, then the estimated effect of the
active control in the setting of the non-inferiority trial, γˆ EM , is the average of
the γˆ (x ) ’s across all subjects in the non-inferiority trial (i.e., γˆ EM = ∑ γˆ (x )/n ).
i
i

The standard deviation for the resulting estimator of the active control effect
is larger than the corresponding standard deviation for the estimator not
based on a covariate adjustment.
In the absence of other biases previously discussed, the unbiased appli-
cation of this estimated effect to the non-inferiority trial requires the con-
ditional constancy assumption that for any given set of covariates the
conditional active control effect is constant across all studies (including the
non-inferiority trial) and that all effect modifiers have been accounted for.
There should be biological plausibility that a covariate is an effect modifier
with preferably reproduced results on the effect size of the active control.
Selection of a covariate should not be based on data dredging by selecting
an arbitrary covariate that just happens to have differing observed effects
across its subgroups.
Such a procedure is most relevant when multiple trials evaluating the active
control have demonstrated similar heterogeneous effects. When there are
no underlying differences in the effects across subgroups, there will always
be some anticipated difference in the estimated effects across subgroups.
Observing similar but small differences in the estimated effects within sub-
groups in two, three, or even four trials may not be strong evidence of het-
erogeneous effects (or at least meaningful heterogeneous effects that would
deserve attention). A conservative approach may select the smaller of the
margin (or the more conservative estimation of the active control effect) from
an approach that adjusts the estimated active control effect by the relative
frequencies of important subgroups and from an approach of homogeneous
effects across subgroups.
This problem of heterogeneous effects across subgroups cannot be solved
by using adjusted analyses within each study, as such analyses either assume
homogeneous effects across the corresponding subgroups, or weights the
effects by the corresponding relative frequencies, which would be different
across trials (leading to heterogeneous effects across trials).

© 2012 by Taylor and Francis Group, LLC

Evaluating the Active Control Effect 87

However, when the effect of the active control varies across important
subgroups, a non-inferiority or any efficacy conclusion overall is really a
conclusion on an overall or weighted-average result with the weights being
the relative frequencies of the important subgroups. A conclusion of non-
inferiority or any efficacy for every meaningful subgroup requires individ-
ual non-inferiority comparisons for each subgroup. Unless the results are
quite marked, it is very difficult to interpret subgroup analyses in a two-arm
non-inferiority trial.

4.4 Bayesian Meta-Analyses

For a Bayesian fixed-effects meta-analysis, the output is a posterior distribu-
tion for the common treatment effect across studies, γ. As this would be a
meta-analysis of the results from possibly all known studies, it may make
sense to use a noninformative prior distribution for γ. When appropriate,
the within-study estimators γˆ 1 , γˆ 2 ,  , γˆ k may be modeled as independent
and normally distributed with common mean γ and respective variances
of σ 12 , σ 22 ,  , σ k2 . The variances σ 12 , σ 22 ,  , σ k2 may be regarded as known
or as having some prior distribution. In the known variances case, γ has
a normal posterior distribution with mean equal to the observed value of

∑ ∑ ∑
k k k
(1/σ i2 )γˆ i (1/σ i2 ) and variance equal to 1 / (1/σ i2 ). When the
i=1 i=1 i=1
constancy assumption holds, the derived posterior distribution for the effect
of the active control can validly be used in the setting of the non-inferiority
trial as the distribution for the effect of the active control. If appropriate the
active control effect can be discounted when applied to the setting of the
non-inferiority trial.
For a Bayesian analog to a random-effects meta-analysis, the output can be
either a posterior distribution for the random within-study treatment effect,
γ k+1, or a posterior distribution for the mean treatment effect across studies
(i.e., the global mean), γ. Let ψ = τ 2. We will consider an improper prior distri-
bution for (γ,ψ) whose density depends only on the value of ψ (i.e., g(γ,ψ) = j(ψ)).
Conditional on (γ,ψ), the true within-study effects, γ1, γ 2, . . . , γ k, are assumed
to be a random sample from a normal distribution having mean γ and vari-
ance ψ. Conditional on γ1, γ 2, . . . , γ k, γˆ 1 , γˆ 2 ,  , γˆ k are independently normally
distributed, where γˆ 1 has a normal distribution with mean γi and variance
σ i2 for i = 1, . . . , k. The variances σ 12 , σ 22 ,  , σ k2 may be regarded as known or as
having some prior distribution. In the known variances case, the joint poste-
rior distribution for (γ,ψ) is determined conditional on the observed values,
x1, x2, . . . , xk, of γˆ 1 , γˆ 2 ,  , γˆ k . The joint posterior density will factor into the
product of the marginal distribution for ψ and a normal conditional distri

∑ ∑
k k
bution for γ given ψ having a mean equal to (σ i2 + ψ )−1xi (σ i2 + ψ )−1
i=1 i=1

© 2012 by Taylor and Francis Group, LLC

88 Design and Analysis of Non-Inferiority Trials

∑
k
and variance equal to 1 (σ i2 + ψ )−1 . The density for the marginal distri-
i=1

∏
k
bution for ψ is proportional to exp[−(1/2)q(ψ )] (σ i2 + ψ )−1/2 × j(ψ ) , where
i=1
2

∑ 
∑ 
∑
k k k
q(ψ ) = xi2 (σ i2 + ψ )−1 −  xi (σ i2 + ψ )−1  (σ i2 + ψ )−1 . For simulat-
i=1  i=1  i=1

ing probabilities involving γ, a random value for ψ, ψr, can be selected from

its posterior distribution and then a random value for γ can be taken from a

∑ ∑
k k
normal distribution having mean equal to (σ i2 + ψ r )−1 xi (σ i2 + ψ r )−1
i=1 i=1

∑
k
and variance equal to 1 (σ i2 + ψ r )−1 .
i=1

Alternatively, the modeling can use a posterior distribution for ψ based on

some prior distribution and the sampling distribution of a chosen estimator
of ψ, while still letting γ given ψ have a normal distribution with mean equal

∑ ∑
k k
to (σ i2 + ψ )−1 xi (σ i2 + ψ )−1 .
i=1 i=1

Under the assumption that the same model applies for the next trial or the
non-inferiority trial (i.e., a form of the constancy assumption), the distribu-
tion for the treatment or active control effect in the next study, γ k+1, is based
on the posterior distribution for (γ,ψ) and that conditional on (γ,ψ), γ k+1 has a
normal distribution with mean γ and variance ψ. Thus, the distribution for
γ k+1 can be approximated by further taking a random value from the normal
distribution with mean equal to the simulated value for γ and variance equal
to the simulated value for ψ. As noted earlier, if appropriate the active control
effect can be discounted when applied to the setting of the non-inferiority
trial.
In the cases where the variances σ 12 , σ 22 ,  , σ k2 are unknown, the joint pos-
terior distribution for (γ i , σ i2 ) can be determined for each i = 1, 2, . . . , k. For
continuous data, Section 12.2.4 provides an example of a joint posterior dis-
tribution for the mean and variance. Approximating the posterior distribu-
tion for γ involves:

1. Simulating values for (γ i , σ i2 ),  ,(γ k , σ k2 ) from their respective joint

posterior distributions.
2. Simulating a value ψr from the posterior distribution of ψ = τ 2 based
on the simulated values for (γ 1 , σ 12 ),  ,(γ k , σ k2 ).
3. Simulating a value for γ, γr, from a normal distribution with mean

∑ ∑ ∑
k k k
(σ i2 + ψ r )−1 xi (σ i2 + ψ r )−1 and variance 1 (σ i2 + ψ r )−1 .
i=1 i=1 i=1

To approximate the distribution for γ k+1 further, this entails

4. Simulating a value for γ k+1 from a normal distribution with mean γr
and variance ψr.

© 2012 by Taylor and Francis Group, LLC

Evaluating the Active Control Effect 89

References
1. U.S. Food and Drug Administration, Guidance for industry: Non-inferiority
clinical trials (draft guidance), March 2010.
2. International Conference on Harmonization of Technical Requirements for
Registration of Pharmaceuticals for Human Use (ICH) E-10: Guidance on
choice of control group in clinical trials, 2000, at http://www.ich.org/cache/
compo/475-272-1.html#E4.
3. Prentice, R.L. et al., Combined postmenopausal hormone therapy and cardio-
vascular disease: Toward resolving the discrepancy between observational
studies and the Women’s Health Initiative clinical trial, Am. J. Epidemiol., 162,
404–414, 2005.
4. Fleming, T.R. et al., Some essential considerations in the design and conduct of
non-inferiority trials, submitted manuscript, 2010.
5. Wang, S.-J., Hung, H.M.J., and Tsong, Y., Utility and pitfalls of some statistical
methods in active controlled clinical trials, Control. Clin. Trials, 23, 15–28, 2002.
6. Hedges, L.V., Modeling publication selection effects in meta-analysis, Stat. Sci.,
7, 246–255, 1992.
7. Dear, K.B. and Begg, C.B., An approach for assessing publication bias prior to
performing meta-analysis, Stat. Sci., 7, 237–245, 1992.
8. Sterling, T.D., Rosenbaum, W.L., and Weinkam, J.J., Publication decisions revis-
ited: The effect of the outcome of statistical tests on the decision to publish and
vice-versa. Am. Stat., 49, 108–112, 1995.
9. Light, R.J. and Pillemer, D.B., Summing Up: The Science of Reviewing Research,
Harvard University Press, Boston, MA, 1984.
10. U.S. Food and Drug Administration Oncologic Drugs Advisory Committee
meeting, July 27, 2004, transcript, at http://www.fda.gov/ohrms/dockets/
ac/04/transcripts/2004-4060T1.pdf.
11. Product label for Taxotere, at http://www.accessdata.fda.gov/drugsatfda_
docs/label/2010/020449s059lbl.pdf.
12. Statistical review of NDA 20449/S11 dated December 15, 1999, at http://www
.accessdata.fda.gov/drugsatfda_docs/nda/99/20449-S011_TAXOTERE_statr
.pdf.
13. Glass, G.V., Primary, secondary and meta-analysis of research, Educ. Res., 5, 3–8,
1976.
14. Follmann, D.A. and Proschan, M.A., Valid inferences in random effects meta-
analysis, Biometrics, 55, 732–737, 1999.
15. Rothmann, M.D. et al., Missing data in biologic oncology products, J. Biopharm.
Stat., 19, 1074–1084, 2009.
16. DerSimonian, R. and Laird, N., Meta-analysis in clinical trials, Control. Clin.
Trials, 7, 177–188, 1986.
17. Jackson, D., The power of the standard test for the presence of heterogeneity in
meta-analysis, Stat. Med., 25, 2688–2699, 2006.
18. Yusuf, S. et al., Beta blockade during and after myocardial infarction: An over-
view of the randomized trials. Prog. Cardiovasc. Dis., 27, 335–371, 1985.
19. Peto, R., Why do we need systematic overviews of randomised trials? Stat. Med.,
6, 233–240, 1987.

© 2012 by Taylor and Francis Group, LLC

90 Design and Analysis of Non-Inferiority Trials

20. Greenland, S. and Salvan, A., Bias in the one-step method for pooling study
results, Stat. Med., 9, 247–252, 1990.
21. Larholt, K., Tsiatis, A.A., and Gelber, R.D., Variability of coverage probabili-
ties when applying a random effects methodology for meta-analysis, Harvard
School of Public Health Department of Biostatistics, unpublished, 1990.
22. Ziegler, S., Koch, A., and Victor, N., Deficits and remedy of the standard random
effects methods in meta-analysis, Methods Inform. Med., 40, 148–155, 2001.
23. Biggerstaff, B.J. and Tweedie, R.L., Incorporating variability in estimates of het-
erogeneity in the random effects model in meta-analysis. Stat. Med., 16, 753–768,
1997.
24. Zhang, Z., Covariate-adjusted putative placebo analysis in active-controlled
clinical trials, Stat. Biopharm. Res., 1, 279–290, 2009.

© 2012 by Taylor and Francis Group, LLC

5
Across-Trials Analysis Methods

5.1 Introduction
A non-inferiority analysis is frequently conducted based on the determi-
nation of a non-inferiority margin or threshold. The choice of the margin
should depend on prior experience of the estimated effect of the active con-
trol in adequate, well-controlled trials, and account for regression-to-the-
mean bias, effect modification, and clinical judgment. The non-inferiority
margin must be small enough to preclude that a placebo (or a treatment that
is no better than placebo on a given endpoint) is noninferior to the active
control. Other concerns about the non-inferiority margin might make the
margin even smaller, but it should not be larger than the smallest anticipated
difference between a placebo and the active control in the setting of the non-
inferiority trial.
From experience there are basically two philosophies in constructing a
non-inferiority analysis. One philosophy involves making adjustments to
the estimation of the active control effect to account for biases, effect modi-
fication, and any additional uncertainty, and then use a test procedure that
targets a desired type I–like error rate. The other philosophy involves apply-
ing a conservative method of analysis (e.g., comparing the most conservative
limits of 95% confidence intervals) that includes the results from an unad-
justed estimation of the active control effect and from the non-inferiority
trial. The hope is that the conservative method will account for any biases in
the estimate of the active control effect and any deviation from the constancy
assumption.
There will be instances when the non-inferiority analysis will not be based
on either philosophy. For example, the clinical judgment of the unaccept-
able amount of loss of the active control effect is necessarily smaller than
determined from either philosophy. Another exception can occur when there
is great heterogeneity in the effect of the control in previous studies. If the
heterogeneity cannot be explained, the non-inferiority analysis may need to
consider this heterogeneity and how small the active control effect may need
to be in the non-inferiority trial. If the active control therapy has not regu-
larly shown efficacy in clinical trials, the non-inferiority margin may need to

© 2012 by Taylor and Francis Group, LLC

92 Design and Analysis of Non-Inferiority Trials

be zero, meaning that the experimental therapy must show superiority to the
active control to be deemed effective.
In this chapter, we discuss two confidence interval and synthesis methods
for non-inferiority testing in an active-controlled trial. These methods are
compared in Section 5.4. Additionally, the type I error rates are also assessed
in Section 5.4, including under practical models where the estimation of the
active control effect is subject to regression-to-the mean bias. In Section 5.5,
we compare the results of two confidence interval and synthesis methods
with an example in oncology.

5.2 Two Confidence Interval Approaches

5.2.1 Hypotheses and Tests
It is tempting to use the point estimate of the historical effect of the active con-
trol versus placebo as the true effect of the active control therapy in the non-
inferiority trial. Such an inference allows for considerable room for error, as
half the time an estimated effect overestimates the true effect. Additionally,
by using the point estimate of the historical effect of the active control versus
placebo as the margin in the non-inferiority trial, the probability of conclud-
ing that an ineffective experimental therapy is effective can exceed 50% even
when the constancy assumption holds.1,2
To incorporate variability in the point estimate when choosing the margin,
the lower bound of a confidence interval for the active control effect has been
considered a surrogate for the unknown active control effect. The selected
non-inferiority margin may be smaller than this lower bound of the confi-
dence interval when it is desired to show that the experimental therapy has
more than some minimal amount of efficacy, or to account for uncertainty in
whether the constancy assumption holds. For an endpoint such as mortality
or irreversible morbidity, allowing an investigational treatment to be non-
inferior if it is only a little better than placebo is unacceptable if an alterna-
tive therapy exists that provides much superior benefit. In these cases, it is
therefore common to take a fraction, such as one-half, of the lower bound of
the confidence interval as the non-inferiority margin. This has been called
“preservation of effect” in that it guarantees that some fraction of the effect
of the active control is preserved.3 With such a margin and accounting for
any regression-to-the mean bias and effect modification, there is, with high
confidence, little chance that an experimental therapy having no effect com-
pared with placebo would be called noninferior to the active control.
Temple and Ellenberg4 described three possible margins, M0, M1, and M2.
M0 (or just zero) is the margin used when the active control is not regularly
superior to placebo. M1, the effectiveness margin, is the smallest effect the

© 2012 by Taylor and Francis Group, LLC

Across-Trials Analysis Methods 93

active control may have versus placebo in the setting of the non-inferiority
trial. If the active control has an effect of M1 in the non-inferiority trial, the
trial will have assay sensitivity in determining whether an experimental
therapy is effective or ineffective provided adequate study conduct. The non-
inferiority margin M2 is a fraction of M1 chosen to assure that the experi-
mental therapy retains at least some desired amount of the active control
effect. The margins of M1 and M2 are used respectively for two objectives:
(1) demonstrating that the experimental therapy is superior to placebo and
(2) demonstrating that the experimental therapy is not unacceptably worse
than placebo.
M2 has been treated as a fixed margin, despite often being based on or
influenced by the estimated active control effect. In this section, we will con-
sider testing involving statistical hypotheses that treat M2 as a fixed value.
In Section 5.4 on evaluating error rates, we will treat M1 and M2 as realized
values involving the estimated active control effect. We will consider com-
paring the treatment arms using metrics based on undesirable outcomes.
This includes the difference in means where the smaller the value the better,
differences in proportions on an undesirable event, the log-relative risk of
an undesirable event, and the log-hazard ratio where the longer the time the
better. Then the hypotheses of interest are expressed as

Ho: βN ≥ M2 vs. Ha: βN < M2 (5.1)

where β N is the experimental therapy versus the active control therapy (i.e.,
E–C or E/C) parameter of interest (i.e., the true treatment difference) in
the non-inferiority trial. The inequalities in the hypotheses in Equation 5.1
would be reversed for “positive” or desirable outcomes (e.g., adverse cure,
prevention, time-to-relief). For these cases, each hypothesis is expressed by
multiplying each side of the inequality by –1 and defining the parameter in
terms of the active control therapy versus the experimental therapy.
Since M1 and M2 are based on the estimated effect of the active control,
they are realizations of random quantities, not a priori. The hypotheses in
Equation 5.1 are surrogate hypotheses for whether the experimental ther-
apy is unacceptably worse than the active control in the setting of the non-
inferiority trial. The hope is that rejecting Ho in Expression 5.1 and concluding
Ha will imply that the experimental therapy is effective, with an effect that
is not unacceptably worse than the active control. The null hypothesis in
Expression 5.1 is rejected and non-inferiority is concluded when the upper
limit of a 100(1 – α)% confidence interval for βN is less than M2. Normal-
based confidence intervals are often used, so Ho would be rejected when
β̂ N + zα /2 sN < M2, where β̂ N is the estimated value for β N and sN is the esti-
mated standard deviation for β̂ N .
Additionally, it is popular to define M1 as the lower limit of a 100(1 – γ)%
confidence interval for the historical active control effect, βH (expressed

© 2012 by Taylor and Francis Group, LLC

94 Design and Analysis of Non-Inferiority Trials

in terms of placebo vs. active control). For some 0 ≤ λo ≤ 1, we would then

have M2 = (1 – λo) × M1. Such approaches have been referred to as “fixed-
margin” approaches. The non-inferiority analysis is then expressed as a
two−confidence interval approach. For estimators that are normally or
approximately distributed, non-inferiority would be concluded when

βˆ N + zα /2 sN < (1 − λo )(βˆ H − zγ /2 sH ) (5.2)

Consistent with Hung, Wang, and O’Neill,5 an “Y–X method” will refer to a
two−confidence interval procedure where a two-sided Y% confidence inter-
val is determined from the non-inferiority trial and the active control effect
is based on a two-sided X% confidence interval. The definitions of Y and
X are the reverse in the U.S. Food and Drug Administration (FDA) draft
guidance.6
Using 95% confidence intervals for both the historical effect of the active
control therapy and for the comparison of the experimental and active control
therapies in the non-inferiority trial is common. We will refer to this approach
as the 95–95 method or approach. This approach has been described as com-
paring the two statistically worst cases. For the 95–95 approach, Rothmann et
al.1 showed that when the constancy assumption holds, the one-sided type I
error rate is between 0.0027 and 0.025 in falsely concluding that an ineffective
therapy is effective. Sankoh7 called the two−confidence interval approach
“uniformly ultraconservative” and preferring to use a fraction of the point
estimate instead of the lower bound of a confidence interval for the active
control effect. Although using a fraction of the lower bound of a confidence
interval may be conservative in many situations, it may not be conservative
(and certainly not uniformly ultraconservative) in all situations, particularly
in indications where regression-to-the-mean bias and/or effect modification
are major concerns.
Such a margin (the lower bound of the 95% confidence interval or some
fraction thereof) will be conservative when the constancy assumption holds.
However, in many cases, the constancy assumption does not hold, or cannot
be proven to hold. The use of the lower limit of the 95% confidence interval
for the estimated active control effect provides some adjustment for bias and
deviation from the constancy. Subjects enrolled in the current study may be
fundamentally different from subjects enrolled in the historical study, owing
to changes in diagnosis or standards of concomitant care since the histori-
cal study was completed; or the disease is fundamentally different (such as
infectious diseases, which are known to change over time as they adapt in
response to medications); or logistics differ (when a study is run in a dif-
ferent set of geographic sites than the historical comparison used). When
the constancy assumption may not hold, choosing a fraction of the lower
bound of a confidence interval for the historical treatment effect can provide
an allowance for deviation from the constancy assumption.

© 2012 by Taylor and Francis Group, LLC

Across-Trials Analysis Methods 95

The width of the confidence interval for the historical effect of the active
control will depend on the sample sizes of the historical studies. A large
estimated effect for the active control therapy from large studies may pro-
duce a confidence interval with a lower bound that is a large effect, and thus
require a smaller sample size for the non-inferiority trial to rule out a differ-
ence of practical importance. Conversely, a single small study may produce
a confidence interval with a lower bound that corresponds to a small effect,
even if the point estimate of the active control effect was large, and thus
require a large sample size for the non-inferiority trial to rule out an appro-
priate non-inferiority margin. In such cases, it is tempting, although it may
not be possible, to increase the margin because of the lack of precision in the
estimate of the historical treatment effect. When warranted, the confidence
level for the historical effect of the active control therapy can be adjusted to
be higher for a more conservative, smaller margin or be adjusted lower for
a more liberal, larger margin. Hauck and Anderson8 suggested utilizing the
lower bound of a confidence interval with a confidence level of 68–90%. The
lower confidence level will lead to a larger lower bound, and hence a larger
non-inferiority margin.
Fixed-effects and random-effects meta-analyses have been used in de
termining the confidence interval for the historical active control effect.
Between-trial variability in the active control is a concern especially when
the heterogeneity in the active control effect cannot be explained. When
there is a single study, the heterogeneity in the active control effect cannot
be assessed. Also, with the lack of a reproduced effect size, the significant
or highly significant result from a single trial may have a large associated
regression-to-the-mean bias and thus greatly overstate the true active con-
trol effect. The existence of multiple studies that provide consistent estimates
of the active control effect gives assurance that the regression-to-the-mean
bias is small, and that the meta-analysis reliably estimates the active control
effect when the historical effect of the active control applies in the setting of
the non-inferiority trial. Concerns involving the use of the estimated active
control effect to the setting of the non-inferiority trial may lead to either dis-
counting the estimated active control effect (i.e., discounting the lower limit
of the confidence interval for the active control effect) or basing the non-
inferiority margin on a larger-level confidence interval for the active control
effect.
If the multiple historical comparisons of the active control to placebo
provide inconsistent estimates of the active control effect, confidence in a
common active control effect decreases. In such a case, the choice of non-
inferiority margin should consider the between-trial variability in the active
control effect. When a random-effects meta-analysis is used for the estima-
tion of the active control effect, Lawrence9 proposes using a 95% prediction
interval for the active control effect in a next random trial as a replacement
in the 95–95 method for the 95% confidence interval for the active control
effect.

© 2012 by Taylor and Francis Group, LLC

96 Design and Analysis of Non-Inferiority Trials

Example 5.1 summarizes one of the first two confidence interval proce-
dures that involved thrombolytic products.

Example 5.1

As life-saving therapeutics became available in thrombolytics, placebo-controlled

trials could not ethically be done. The Cardio-Renal advisory committee in 1992
recommended that a new thrombolytic must retain 50% of the control throm-
bolytic’s effect.10
The Assent-2 trial compared the experimental agent tenecteplase to the active
control of activase on 30-day mortality. There are no direct comparisons of acti-
vase with placebo. The historical effect of activase on 30-day mortality was evalu-
ated on the basis of a meta-analysis that included a comparison of activase to
streptokinase from the Global Utilization of Streptokinase and t-PA for Occluded
Coronary Arteries (GUSTO) study and four studies comparing streptokinase to
placebo.10 As there were concerns that the GUSTO study was not blinded, a
weighted meta-analysis was used that gave weight to the GUSTO study, consistent
with constructing a one-sided 99.5% confidence interval.10 The resulting non-
inferiority threshold was based on half of the lower limit of a one-sided 95%
confidence interval for the historical effect of activase from the meta-analysis. This
approach yielded a non-inferiority threshold for the experimental versus activase
relative risk of 30-day mortality of 1.143 (which corresponds to a non-inferiority
margin on the log-relative risk of 0.134). The Assent-2 trial randomized 16,949
subjects between the tenecteplase and activase arms. For the primary analysis,
the estimated percentages for the 30-day mortality were 6.18% and 6.15% for the
tenecteplase and activase arms, respectively.11 The relative risk was 1.004, with a
corresponding 90% confidence interval of 0.914–1.104. As 1.104 < 1.143, tenect-
eplase was deemed noninferior to activase on 30-day mortality.

For an application in second-line non-small cell lung cancer (NSCLC),

Example 5.2 revisits Example 4.3 in Section 4.2.4 and illustrates using a 95%
confidence interval for the historical active control effect to define a margin
or margins. The determined margins are then applied to a later trial of peme
trexed versus docetaxel in second-line NSCLC.

Example 5.2

To illustrate the use of the margins just discussed, we revisit Example 4.3, which
considered six approaches for estimating the effect of docetaxel versus best sup-
portive care (BSC) on overall survival in second-line NSCLC. Table 4.3 provided
the estimates and 95% confidence intervals for the docetaxel versus BSC haz-
ard ratios. Here the BSC versus docetaxel log-hazard ratio is the docetaxel effect
parameter, β H. For each of the six approaches using Table 4.3, Table 5.1 gives the
95% confidence intervals for β H and the corresponding margins obtained from the
confidence interval where M2 represents 50% of M1 where M1 is the lower limit
of the 95% confidence interval for β H. For approaches 2a and 2b, a superiority
comparison to docetaxel would be required for a new investigation agent.

© 2012 by Taylor and Francis Group, LLC

Across-Trials Analysis Methods 97

TABLE 5.1
95% Confidence Intervals for Docetaxel Effect and Corresponding Margins
Approach 95% Confidence Interval for βH Margins
1 (0.128, 1.050) M1 = 0.128, M2 = 0.064
2a (–0.018, 0.611) M0 = 0
2b (–0.015, 0.360) M0 = 0
3 (0.082, 0.764) M1 = 0.082, M2 = 0.041
4a (0.064, 0.646) M1 = 0.064, M2 = 0.032
4b (0.015, 0.624) M1 = 0.015, M2 = 0.0075

The JMEI trial studied the use of pemetrexed against the active control of doc-
etaxel at a dose of 75 mg/m2 with subjects in second-line NSCLC. From the FDA
Oncologic Drugs Advisory Committee meeting transcript,12 the 95% confidence
interval for the pemetrexed versus docetaxel hazard ratio in the JMEI study is
0.817–1.204. Taking the natural logarithms of each limit in the confidence interval
gives a 95% confidence interval for the pemetrexed versus docetaxel log-hazard
ratio, βN, of –0.202 to 0.186. The upper limit of the 95% confidence interval for βN
of 0.186 exceeds every margin specified in Table 5.1. Thus when a margin is based
on a 95% confidence interval for the docetaxel effect versus BSC, the results
from the JMEI trial fail to conclude non-inferiority to docetaxel, regardless of the
approach used to estimate the docetaxel effect.

Reducing the Potential for Biocreep. In general, when possible, the seemingly
most effective, available standard of care should be used as the active control
in a non-inferiority trial. However, as the selected standard of care is the
therapy or regimen that has the best estimated effect, the estimation of the
effect of that standard of care will have a regression-to-the-mean bias. This
bias should be accounted for by either making an appropriate adjustment to
the estimation of its effect or including the estimated effects of all potential
candidates for a standard of care into the meta-analysis.
For a given indication, once the first non-inferiority trial has established
a criterion for non-inferiority, it may be reasonable that all future non-
inferiority trials have the same or more stringent criterion regardless of the
active control used in the trial. For example, suppose that a margin of 5 days
was used for the duration of an adverse event for the original active control
(A) in non-inferiority trials. If another therapy (B) is to be used as an active
control in a non-inferiority trial and the 5-day margin to A is still relevant,
the non-inferiority margin for B as a control, δ, should be such that it guar-
antees that if the experimental therapy (C) is noninferior to B with margin δ,
then C is noninferior to A with a margin of 5 days. Suppose B was previously
compared with A in randomized trials and the 95% confidence interval for
the difference in days of the mean durations was –0.8 to 1.6. Using the phi-
losophy of a 95–95 method, a margin of 5.0 – 1.6 = 3.4 days may be justified
for B as the control therapy in a non-inferiority trial. In practice, it may also

© 2012 by Taylor and Francis Group, LLC

98 Design and Analysis of Non-Inferiority Trials

be necessary to make further adjustments due to effect modifiers/deviations

from constancy.
This way of establishing margins for the next generation of therapies so
that non-inferiority is maintained to the original active control will avoid
a biocreep. The use of the same margin for a next-generation active control
therapy that is not superior to the original active control is susceptible to
a biocreep. If the next-generation therapy demonstrated superiority to the
original active control, then the margin to the original active control will
probably be no longer relevant.

5.3 Synthesis Methods

5.3.1 Introduction
A synthesis method is a non-inferiority test procedure whose decision is
based on the difference in effects of the active control and experimental arms
in the non-inferiority trial and an external estimate of the effect of the active
control along with the corresponding standard errors. A synthesis method
usually compares a test statistic on the basis of the aforementioned estimates,
their standard errors, and a retention fraction, to a critical value. The estimate
of the active control effect is based on the results of previous clinical trials
studying the active control and may be adjusted to fit the setting of the non-
inferiority trial. The determination of non-inferiority usually requires that
the experimental arm retains more than some prespecified fraction of the
effect of the active control arm. The test procedures are generally designed
to maintain a desired type I error rate under the assumption that the effect of
the active control in the non-inferiority trial has been unbiasedly estimated.
Particular synthesis methods have been proposed in many papers.1,13–16
There are reasons why it is appropriate to require that a new experimental
agent not only be efficacious but also retain some minimal part of the effect
of the active control therapy. When it is unethical to do a placebo-controlled
trial, it is probably also unethical to use a therapy that has very little effect
in a clinical trial. Additionally, the Reinventing Regulation of Drugs and
Medical Devices of April 1995 stated that17
In evaluating the safety of a new drug or Class III device, the Agency
weighs the demonstrated effectiveness of the product against its risks to
determine whether the benefits outweigh the risks. This weighing pro-
cess also takes into account information such as the seriousness and out-
come of the disease, the presence and adequacy of existing treatments,
and adverse reaction data .…
In certain circumstances, however, it may be important to consider
whether a new product is less effective than available alternative

© 2012 by Taylor and Francis Group, LLC

Across-Trials Analysis Methods 99

therapies, when less effectiveness could present a danger to the patient or

to the public. For example, it is essential for public health protection that
a new therapy be as effective as alternatives that are already approved
for marketing when
1. The disease to be treated is life-threatening or capable of causing
irreversible morbidity (e.g., stroke or heart attack)
2. The disease to be treated is a contagious illness that poses seri-
ous consequences to the health of others (e.g., sexually transmitted
disease)

This provides a basis for some indications on requiring that a new therapy
have efficacy greater than some minimal threshold. In the typical synthesis
testing, that threshold is regarded as a prespecified fraction of the effect of
the active control. Snapinn and Jiang18 expressed concern that a requirement
that the experimental therapy in a non-inferiority trial retain more than some
fraction of the effect of the active control creates a higher bar for approval
than was required for the active control, and that such a requirement may
prevent the approval of superior treatments to the active control.
In this section we discuss definitions for the proportion of the active con-
trol effect that the experimental therapy retains, possible corresponding sets
of non-inferiority hypotheses that can be tested, frequentist and Bayesian
procedures, and respective issues.

5.3.2 Retention Fraction and Hypotheses

Holmgren13 introduced the use of a synthesis test for showing that the exper-
imental therapy retained more than a prespecified fraction of the histori-
cal effect of the active control. More specifically, the inference involved the
experimental to active control relative risk in the non-inferiority trial, βN, and
a historical placebo to active control relative risk, βH. As a treatment versus
placebo relative risk of 1 is equivalent to no treatment effect, in Holmgren’s
relative risk case, βH – 1 represents the active control effect versus placebo.
When the constancy assumption holds and βH – 1 > 0, the proportion of the
active control effect that is retained by the experimental therapy, referred to
as an retention fraction, is given by

βH − βN
λ= (5.3)
βH − 1

The definition of the proportion of the active control effect that is retained
by the experimental therapy in Equation 5.3 is referred to as a “arithmetic
definition” in Rothmann’s paper.1 The definition of the retention fraction has
been used for relative metrics—for example, a relative risk or a hazard ratio.
However, for relative metrics, how two different possible values (e.g., a and b)

© 2012 by Taylor and Francis Group, LLC

100 Design and Analysis of Non-Inferiority Trials

for the metric compare (or statistically compare) depends on their ratio (i.e.,
a/b) not on their difference (i.e., a – b).
For undesirable outcomes (i.e., smaller probabilities of “success” are better)
with a prespecified retention fraction of λo, the null and alternative hypoth-
eses are expressed as

Ho: βN ≥ λo + (1 – λo)βH vs. Ha: βN < λo + (1 – λo)βH (5.4)

where 0 ≤ λo ≤ 1. When it is assumed that βH – 1 > 0, the alternative hypothesis

is that the experimental therapy retains more than 100λo% of the historical
effect of the active control.
The utility of a conclusion that the experimental therapy retains more than
a prespecified fraction of the “historical effect” of the active control ther-
apy (which is the formal conclusion from rejecting the null hypothesis in
Expression 5.4) depends on whether that historical estimated active control
effect is relevant to the non-inferiority trial. This would require the estima-
tion of the historical active control effect to be unbiased for the active control
effect in the setting of the non-inferiority trial (i.e., the constancy assumption
holds) or to have very low bias.
Statistical inferences involving non-inferiority and a retention fraction are
not usually based on relative metrics, but instead on “absolute” metrics (dif-
ferences have meaning). For absolute metrics, how two different possible val-
ues (e.g., a and b) for the metric compare (or statistically compare) depends on
their difference (i.e., a – b) and not on their ratio (i.e., a/b). Some absolute met-
rics include a difference in means, a difference in proportions, a log-relative
risk, or a log-hazard ratio. The same notation of βN and βH for relative metrics
will be used as the parameters for absolute metrics for comparing the experi-
mental therapy to the active control in the non-inferiority trial and for the
historical comparison of the placebo to the active control, respectively.
For a metric such as a difference in means, a difference in proportions, a
log-relative risk, or a log-hazard ratio, a treatment versus placebo value of
zero is equivalent to no treatment effect. The historical effect of the active
control versus placebo is then given by βH. For a log-hazard ratio where the
greater the time to the event the better, βH would represent the placebo ver-
sus active control hazard ratio and βN would represent the experimental to
active control log-hazard ratio in the non-inferiority trial. The definitions for
βH and βN would be analogous for a relative risk or difference in proportions
of an undesirable event, or a difference in means where the smaller value the
better. For time to event endpoints where the smaller value the better, risk
ratios or proportions of desirable events, and difference in means where the
larger the value the better, the parameter βH would be in terms of the active
control to placebo and the parameter βH would be in terms of the active con-
trol to the experimental. When the constancy assumption holds and βH > 0,
the proportion of the active control effect that is retained by the experimental
therapy or retention fraction is given by

© 2012 by Taylor and Francis Group, LLC

Across-Trials Analysis Methods 101

βH − βN
λ= (5.5)
βH

The definition of the retention fraction in Equation 5.5 is referred to as a

“geometric” definition in Rothmann’s paper.1 This definition is also some-
times referred to as a “log-scale” definition. When the active control is effec-
tive, the arithmetic and geometric retention fractions agree at 0 and 1. When
the arithmetic retention fraction is smaller than 1, the geometric retention
fraction will be between zero and the arithmetic retention fraction. When
the arithmetic retention fraction is greater than 1, the geometric retention
fraction will be greater than the arithmetic retention fraction. Unless condi-
tioning on the size of the active control effect is done, there is no functional
relationship between the two retention fractions. Curves relating the two
definitions of the retention fraction for three different active control effect
sizes are provided in the paper of Hung et al.19
For testing that the experimental therapy maintains more than 100λo% (0 ≤
λo ≤ 1) of the active control effect based on Equation 5.5, the null and alterna-
tive hypotheses are expressed as

Ho: βN ≥ (1 – λo)βH vs. Ha: βN < (1 – λo)βH (5.6)

When it is assumed that βH > 0 the alternative hypothesis is that the experi-
mental therapy retains more than 100λo% of the historical effect of the active
control.
The null and alternative hypotheses have also been expressed simply as

Ho: λ ≤ λo vs. Ha: λ > λo (5.7)

where 0 < λo ≤ 1. The corresponding hypotheses in Expression 5.6 and 5.7

agree when βH > 0, but they disagree βH < 0 and the definition of λ in Equation
5.5 is extended to include cases where β H < 0.
The proper formulation and testing procedure involving a retention frac-
tion has been an issue. When the active control has a negative effect, the
alternative hypothesis in Expression 5.6 contains possibilities where the pla-
cebo is better than the experimental therapy, which in turn is better than the
active control (i.e., P > E > C).20 Formulations of the hypotheses provided by
Cheng et al.20 are those in Expression 5.7 (above) through 5.9.

Ho: βN – (1 – λo)βH ≥ 0 or βH ≤ 0 vs. Ha: βN – (1 – λo)βH < 0 and βH > 0 (5.8)

Ho: {βN – (1 – λo)βH ≥ 0 or βH ≤ 0} and {βN – βH ≥ 0 or βH ≥ 0} vs.

Ha: {βN – (1 – λo) βH < 0 and βH > 0} or {βN – βH < 0 and βH < 0} (5.9)

However, the null hypothesis in Expression 5.7 contains possibilities where

E > P > C, whereas the alternative hypothesis in Expression 5.7 contains

© 2012 by Taylor and Francis Group, LLC

102 Design and Analysis of Non-Inferiority Trials

possibilities where P > E > C. Likewise, the null hypothesis in Expression

5.8 contains possibilities where E > P > C. The alternative hypothesis in
Expression 5.9 is the percentage retention analog for the experimental ther-
apy to be both better than placebo and noninferior to the active control.

5.3.3 Synthesis Frequentist Procedures

5.3.3.1 Relative Metrics
We will consider a relative metric—for example, a relative risk. The observed
experimental to active control relative risk in the non-inferiority trial, β̂ N , is
used to estimate βN. The historical estimator of the placebo to active control
relative risk will be denoted by β̂ H . The test procedure of Holmgren13 uses a
test statistic based on applying the delta method to log βˆ N − log(λo + (1 − λo )βˆ H ) .
The test statistic is

log βˆ N − log(λo + (1 − λo )βˆ H )

Z1 = (5.10)
ˆ (log βˆ N ) + ((1 − λo )βˆ H /(λo + (1 − λo )βˆ H ))2 Var
Var ˆ (log βˆ H )

The test rejects the null hypothesis in Expression 5.4 and concludes non-
inferiority when Z1 < − zα /2 for some 0 < α < 1. When β̂ N and β̂ H are in
dependent, Z1 having an approximate standard normal distribution when λ =
λo depends on β̂ N having a normal distribution, and whether log(λo + (1 − λo )βˆ H )
log(λo + (1 − λo )βˆ H ) has an approximate normal distribution with a variance reli-
ably estimated by (λo + (1 − λo )βˆ H ))2 Var ˆ (log βˆ H ) or whether Var
ˆ (log βˆ N )/[((1 − λo )βˆ H /(
ˆ (log βˆ N )/[((1 − λo )βˆ H /(λo + (1 − λo )βˆ H )) Var
Var 2
ˆ (log β̂ H )] is large.
The remainder of this section will focus on absolute metrics.

5.3.3.2 Absolute Metrics

Hauck and Anderson8 noted that the choice of a 90% or 95% confidence
interval may be conservative for establishing a non-inferiority margin and
that choosing a non-inferiority margin δ based on the results of prior stud-
ies makes δ a random variable, and thus the variability in δ should be con-
sidered. When the effect of the active control versus placebo in the setting
of the non-inferiority trial also equals β H, then β N – β H is the effect of experi-
mental therapy versus placebo. On the basis of this, Hauck and Anderson
provided

βˆ N − βˆ H + 1.645 sN
2
+ sH2 < 0 (5.11)

as a one-sided test that the experimental therapy is better than placebo. When
the constancy assumption holds, the left-hand side in Expression 5.11 is the
upper limit for the the two-sided 90% confidence interval for the difference

© 2012 by Taylor and Francis Group, LLC

Across-Trials Analysis Methods 103

between the experimental therapy and a placebo. This test procedure can
be rewritten to compare the upper limit of a one-sided 95% confidence
interval for βN with δ * = βˆ H − csH , where c = 1.645 1 + sN2 /sH2 − 1.645 sN /sH .
Non-inferiority is concluded when the upper limit of the two-sided 90% con-
fidence interval for βN is less than δ*. A similar procedure can also be found
in papers by Fisher and colleagues.21,22
The use of δ* is contingent on the constancy assumption. Hauck and
Anderson recommended discounting δ* when it is believed that there may be
between-trial variability in the active control effect. As there may be disagree-
ment in the appropriate margin, Hauck and Anderson recommend reporting
the 90% or the 95% two-sided confidence interval for βN to allow each indi-
vidual to decide for themselves whether non-inferiority has been met.
When Expression 5.11 and the constancy assumption holds, the left-hand
side provides the effect or minimal effect or a greater effect versus placebo
that can be ruled out. For the experimental therapy to rule out the same
minimal effect versus placebo as the active control has ruled out on the basis
of a one-sided 95% confidence interval requires that

βˆ N < −1.645 × ( )
sN2 + sH2 − sH < 0

For a one-sided 100(1 – α/2)% confidence interval, 1.645 is replaced with zα/2.
The Standard Synthesis Method. The standard synthesis method is based on
the test statistic

βˆ N − (1 − λo )βˆ H
Z2 = (5.12)
sN2 + (1 − λo )2 sH2

The test rejects the null hypothesis and concludes non-inferiority when Z2 < –
zα/2 for some 0 < α < 1. After correcting for differences in notation, we see that
the test statistics Z1 in Equation 5.10 and Z2 in Equation 5.12 are equivalent
when λo = 0 or 1. When the estimators β̂ N and β̂ H are independent, normally
distributed, and unbiased, or approximately so, Z2 will have a standard nor-
mal distribution or an approximate standard normal distribution when βN =
(1 – λo)βH with a type I error rate of approximately α/2 for falsely concluding
that an experimental therapy that retains 100λo% of the active control effect
retains more than 100λo% of the active control effect. When β̂ H tends to overes-
timate (underestimate) the effect of the active control in the setting of the non-
inferiority trial, the type I error rate will be inflated (deflated). If the sampling
distributions for βˆ N − βˆ H and βˆ N − (1 − λo )βˆ H are not normal distributions, these
two tests can be modified to fit the appropriate sampling distributions.
A Fieller 100(1 – α)% confidence interval can be determined for λ.1 The
Fieller 100(1 – α)% confidence interval equals {λo: –zα/2 < Z2 < zα/2, –∞ ≤ λo ≤ ∞}.

© 2012 by Taylor and Francis Group, LLC

104 Design and Analysis of Non-Inferiority Trials

The null hypothesis in Expression 5.6 is rejected whenever every value in the
Fieller 100(1 – α)% confidence interval exceeds the prespecified value for λo.
If it is believed that the effect of the active control may have decreased,
the historical estimated effect can by discounted by using 0 < θ < 1.1 The test
statistic Z2 in Expression 5.6 would then be replaced with the test statistic Z2*
in Equation 5.13 where

βˆ N − (1 − λo )θβˆ H
Z2* = (5.13)
sN2 + (1 − λo )2 θ 2 sH2

When the design of the non-inferiority trial is independent of the estimation

of the active control effect and the estimator of that effect has an approximate
normal distribution, then the standard synthesis test statistic will have an
approximate normal distribution. If, however, the design (e.g., the sample
size) of the non-inferiority trial is dependent on the estimation of the active
control effect, the numerator of the synthesis test statistic is the difference
of two uncorrelated but dependent quantities and the synthesis test statis-
tic will not have an approximate standard normal distribution across the
boundary of the null hypothesis.2 The test statistic would have the form of

βˆ N − (1 − λo )βˆ H

σˆ 2ˆ (βˆ ) + (1 − λ )2 s 2
βN H o H

where the true variance for the non-inferiority trial σ β̂2N (βˆ H ) is a random
variable that depends on the estimated active control effect, β̂ H . Rothmann2
assessed the type I error probability for testing the hypotheses in Expression
5.6 for two confidence interval methods and the standard synthesis method
when the standard error from the non-inferiority trial depends on the esti-
mated historical active control effect.
Delta-Method Confidence Interval Approach. When λo = 0, Hasselblad and
Kong15 recommended testing the hypotheses in Expression 5.6 on the basis
of the test statistic Z2 in Equation 5.12. However, when 0 < λo ≤ 1, Hasselblad
and Kong15 proposed a delta-method confidence interval test procedure. The
estimator of λ is given by λˆ = 1 − βˆ N /βˆ H and the estimated standard error is
given by Sλˆ = (βˆ N /βˆ H )2 ( sN2 /βˆ N2 + sH2 /βˆ H2 ). The null hypothesis in Expression
5.7 is rejected, and non-inferiority is concluded when λˆ − zα /2Sλˆ > λo. The test
is equivalent to rejecting the null hypothesis in Expression 5.7 when

βˆ N − (1 − λo )βˆ H
Z3 = < − zα /2 (5.14)
βˆ S ˆ
H λ

© 2012 by Taylor and Francis Group, LLC

Across-Trials Analysis Methods 105

As Z2 in Equation 5.12 and Z3 in Equation 5.14 have the same numerators, and
Z2 has an approximate standard normal distribution when λ = λo, how close
the distribution of Z3 is to a standard normal distribution may depend on the
whether R = βˆ HSλˆ / sN2 + (1 − λo )2 sH2 (i.e., Z2/Z3) tends to be close to 1.
When λo = 0, R will be greater than 1 with probability 1 and Z3 would have
a distribution more concentrated near zero than the distribution for Z2.23 It
was noted from simulations that fairly large sample sizes may be needed
for the ratio of two independent normally distributed quantities to have an
approximate normal distribution.23 In particular the ratio of the mean of β̂ H to
its (estimated) standard deviation should be greater than 8 for the test based
on the delta-method confidence interval to have approximately the desired
type I error rate when β̂ H unbiasedly estimates the effect of the active control
in the setting of the non-inferiority trial. We comment further on the distri-
bution of Z3 in Section 5.4.2 on comparing the different analysis methods.
In Example 5.3, synthesis procedures will be performed including the
determination of Fieller confidence intervals for the proportion of the active
control effect retained by the experimental therapy.

Example 5.3

To illustrate the use of some of the synthesis methods just discussed, we revisit
Example 4.3. The JMEI trial studied the use of pemetrexed against the active con-
trol of docetaxel at a dose of 75 mg/m2 with subjects in second-line NSCLC. The
endpoint of interest was overall survival. We use the result of approach 2b in
Example 4.3 for the estimation of the docetaxel effect. From that approach, the
estimated docetaxel versus BSC hazard ratio was 0.842, with a corresponding
standard error for the log-hazard ratio estimator of 0.095 (95% confidence interval
for the hazard ratio of 0.698, 1.015). From the FDA Oncologic Drugs Advisory
Committee meeting,12 the estimated pemetrexed versus docetaxel hazard ratio
was 0.992 in the JMEI study with corresponding standard error for the log-hazard
ratio estimator of 0.099, which is determined from the 95% confidence interval of
0.817–1.204. Then the indirect estimate of the pemetrexed versus BSC hazard ratio
is given by 0.992 × 0.842 = 0.835 with a standard deviation for the correspond-
ing log-hazard ratio estimator of (0.099)2 + (0.095)2 = 0.137. This leads to a 95%
confidence interval for the pemetrexed versus BSC hazard ratio of 0.638–1.093.
ln 0.992 − (1 − 0.5)ln(1/ 0.842)
For λo = 0.5, we have Z2 = = −0.856, which would
(0.099)2 + (1 − 0.5)2(0.095)2
correspond with a one-sided p-value of 0.195. Here, since the 95% confidence
interval for the docetaxel versus BSC includes 1 (the upper limit is 1.015), and the
pemetrexed versus docetaxel estimated hazard ratio is close to 1, the 95% Fieller
confidence interval for λ as defined in Equation 5.5 is –∞ to ∞. That is that the
95% confidence interval does not rule out any possibilities for λ. A 90% Fieller
confidence interval for λ is –1.01 to 3.55.
If the estimated docetaxel effect was discounted by 20% (i.e., θ = 0.8), the
resulting value of the test statistic would be Z2* = −0.724 with a corresponding
one-sided p-value of 0.23.

© 2012 by Taylor and Francis Group, LLC

106 Design and Analysis of Non-Inferiority Trials

For the delta-method confidence interval approach with no discounting on the

estimated docetaxel effect, the value for λ̂ is 1.047 = 1 – ln 0.992/ln (1/0.842) and
the value for Sλ̂ is 0.576. The delta-method 95% confidence interval for λ is then
calculated as –0.083 to 2.176. Since the lower limit of the confidence interval is
less than 0.5 (and is less than 0), the null hypothesis in Expression 5.7 cannot be
rejected for λo = 0.5 (for λo = 0). For λo = 0.5 we have Z2 = –0.949, which would
correspond with a one-sided p-value of 0.171. The value for R is approximately
0.902.

Discounting both the estimate and the corresponding standard deviation

need not be conservative when determining whether the experimental ther-
apy is indirectly superior to placebo.24 This may arise when there is still an
adequate amount of uncertainty on whether the active control is effective.
For example, suppose the estimated placebo versus active control log-hazard
ratio is β̂ H = 0.211 with a corresponding standard error of 0.17. Using standard
normal multipliers, this leads to a confidence interval for βH of 0.88–1.72. For
suppose the estimated experimental versus active control log-hazard ratio
is β̂ N = –0.163 with corresponding standard error of 0.10. Then the indirect
95% confidence interval for the experimental versus placebo hazard ratio is
0.47, 1.01, which does not rule out that the experimental therapy is ineffective.
When incorporating 50% discounting of the historical control effect, the indi-
rect 95% confidence interval for the experimental versus placebo hazard ratio
is 0.59, 0.99, which does rule out that the experimental therapy is ineffective.
This seeming contradiction is due to not only having the uncertainty of the
control effect more concentrated near zero when the active control is effective,
but also more concentrated near zero when the active control is ineffective.
As a solution to alleviate this, Odem-Davis24 proposed discounting only the
estimated active control effect without discounting the corresponding stan-
dard error. In Example 5.3, discounting the estimated active control effect by
50% without altering the standard error leads to an indirect 95% confidence
interval for the experimental versus placebo hazard ratio of 0.52, 1.13.
Non-inferiority to the Active Control and Indirect Superiority to Placebo. The
desired outcome of a non-inferiority trial is the demonstration that the
experimental therapy is effective (superior to placebo) and noninferior to
the active control. The hypotheses in Expression 5.9 correspond with this
desired outcome and the hypotheses can be tested with simultaneous tests
of (a) indirectly showing that the experimental therapy is better than placebo
and (b) that the experimental therapy is noninferior to the control therapy as
defined by βN – (1 – λo)βH < 0. For example, for some 0 < α < 1 and an appro-
priate and perhaps conservative estimation of the control effect in the non-
inferiority trial, the first test may conclude that the experimental therapy is
indirectly better than placebo when βˆ N − βˆ H + zα /2 sN + sH < 0 ; a generaliza-
2 2

tion of Expression 5.11 and the experimental therapy would be concluded to

be noninferior to the control therapy when Z2 < –zα/2.

© 2012 by Taylor and Francis Group, LLC

Across-Trials Analysis Methods 107

Testing Based on a Surrogate Evaluation of Constancy. Wang and Hung25 pro-

posed a two-stage active control testing (TACT) method, where an assess-
ment of constancy of the outcome distribution for the active control arm is
used as a substitute for assessing the constancy of the effect of the active
control. An overall assumption is made that there is no change in the under-
lying distribution for outcomes on a placebo arm between the historical tri-
als and the non-inferiority trial (if the trial had a placebo arm). On the basis
of this assumption, the active control effect in the non-inferiority would
be the same as in the historical trial if the underlying distribution for the
active control outcomes remains the same. The TACT method is presented
by Wang and Hung25 for the proportion of undesirable events. The first
stage of the TACT method consists of an interim analysis that compares the
event proportions for the active control arm between the non-inferiority
trial and the historical trials. If there is strong evidence that the underly-
ing event proportion for the active control is greater in the non-inferiority
trial than historically, further testing of non-inferiority at the final analysis
is deemed futile and a superiority comparison may be the only adequate
comparison. If the trial continues, then this comparison of active control
event proportions is repeated at the second/final stage. Again, if there is
strong evidence that the underlying event proportion for the active control
is greater in the non-inferiority trial than historically, a superiority com-
parison may be the only adequate comparison. If there is strong evidence
that the underlying event proportion for the active control is smaller in the
non-inferiority trial than historically, a synthesis method is used to test
for non-inferiority. Otherwise, a 95–95 two confidence interval method is
used.
An assessment of any deviation from constancy in the distribution
in outcomes in the active control arm is used as a surrogate for assessing
the constancy assumption. As there is no placebo arm in the non-inferi-
ority trial, the constancy assumption cannot directly be verified. Such
a surrogate comparison only checks whether the prognosis was similar
between subjects on the active control arm in the non-inferiority trial
and the subjects on the active control arm in the historical trials. The
prognosis of subjects could be better in the non-inferiority trial without
improving the effect of the active control and possibly reducing its effect.
Potential differences on effect modifiers are not considered, including
nonprognostic covariates that may be effect modifiers. Some indications
have factors that are not prognostic but are effect modifiers. Additionally,
it may not be true that patients having better prognosis are more likely to
benefit from a therapy. For example, HER-2 overexpression is associated
with poorer outcomes and shorter survival in breast cancer. However, the
effect of trastuzumab, a HER-2 inhibitor, is greater among women whose
tumors overexpress HER-2 than those whose tumors do not overexpress
HER-2.

© 2012 by Taylor and Francis Group, LLC

108 Design and Analysis of Non-Inferiority Trials

5.3.4 Synthesis Methods as Prediction Interval Methods

The test criterion in various synthesis methods can be expressed as a com-
parison of the observed difference between the experimental therapy and
the active control from the non-inferiority trial to an upper or lower limit
of a prediction interval for that observed difference determined under the
assumption that the null hypothesis is true. For example, when the constancy
assumption is true in that E(β̂ H ) is the effect of the active control in the non-
inferiority trial and the historical estimator of the active control effect and
the estimator for the difference in effects in the non-inferiority trial are inde-
pendent and normally distributed, then

βˆ H ± zα /2 Var(βˆ H ) + Var(βˆ N )

is a 100(1 – α)% prediction interval for the observed value of β̂ N when the
experimental therapy has the same efficacy as a placebo. Non-inferiority (or
any efficacy in this case) is concluded when βˆ N < βˆ H − zα /2 Var(βˆ H ) + Var(βˆ N ) ,

or in other words when the observed value for β̂ N is less than the lower limit
of the 100(1 – α)% prediction interval for the observed value of β̂ N (deter-
mined under the assumption that the experimental therapy has the same
efficacy as a placebo).
We will discuss synthesis methods as prediction interval methods for both
fixed-effects and random-effects models for the active control effect.
Fixed-Effects Model. Consider a fixed-effects model for the active control
effect where it is assumed that the active control effect is constant across all
trials, including in the setting of the non-inferiority trial. When the experi-
mental therapy retains 100λ% of the effect of the active control therapy,
a 100(1 – α)% prediction interval for the observed value of β̂ N is given by
(1 − λ )βˆ H ± zα /2 (1 − λ )2 Var(βˆ H ) + Var(βˆ N ) . Non-inferiority (i.e., the experi-
mental therapy retains more than 100λ% of the effect of the active control
therapy) is concluded when βˆ N < (1 − λ )βˆ H − zα /2 (1 − λ )2 Var(βˆ H ) + Var(βˆ N ) , or
in other words when the observed value for β̂ N is less than the lower limit of
the 100(1 – α)% prediction interval for the observed value of β̂ N (determined
under the assumption that the experimental therapy retains exactly 100λ% of
the effect of the active control therapy).
Random-Effects Model. Consider a random-effects model for the active con-
trol effect where it is assumed that the same random-effects model holds
for all trials, including in the setting of the non-inferiority trial. For the case
where a random-effects model is used for the effect of the active control,
the same notation will be used as in Section 4.3.3. Thus γ and γ k+1 will be
used in place of βH and βN, respectively. Parameters and random variables
for the non-inferiority trial will be subscripted by k + 1. When the experi-
mental therapy has the same effect as a placebo and the same random-effects

© 2012 by Taylor and Francis Group, LLC

Across-Trials Analysis Methods 109

model that applies for the historical studies of the active control also applies
in the non-inferiority trial, we have γ k+1 = γ + ηk+1 and γˆ k +1 = γ + ηk +1 + ε K +1 ,
where ηk+1 ~ N(0, τ 2) and ε k +1|ηk +1 ~ N (0, σ k2+1 ) are uncorrelated, μ is the global
mean active control effect across studies, and σ k2+1 = Var(γˆ k +1|ε k +1 ). If σ 1,…, σ k,
σ k+1 and τ 2 are known, then γˆ k+1 − γˆ has a normal distribution with mean

∑
k
equal to zero and variance 1 / (τ 2 + σ i2 )−1 + τ 2 + σ k2+1 . In practice, σ 1,…,
i=1
σ k, σ k+1, and τ 2 are not known and then a 100(1 – α)% prediction interval for

∑
k
the observed value of γˆ k+1 is given by γˆ ± wα /2 1 / (τˆ 2 + σˆ i2 )−1 + τˆ 2 + σˆ k2+1 ,
i=1

where wα/2 is the 100(1 – α/2) percentile (or an approximation thereof) of

∑
k
the distribution for (γˆ k +1 − γˆ )/ 1 / (τˆ 2 + σˆ i2 )−1 + τˆ 2 + σˆ k2+1 . Under certain
i=1

∑
k
assumptions or conditions, (γˆ k +1 − γˆ )/ 1 / (τˆ 2 + σˆ i2 )−1 + τˆ 2 + σˆ k2+1 may
i=1

have an approximate standard normal distribution or a t distribution where

the approximate degrees of freedom may be determined on the basis of a

Satterwaite approximation or on the basis of simulations.
Non-inferiority (or any efficacy in this case) is concluded when the observed
value for β̂ N is less than the lower limit of the 100(1 – α)% prediction inter-
val for the observed value of β̂ N (determined under the assumption that the
experimental therapy has the same efficacy as a placebo, i.e., βˆ N = γˆ k +1).
When the experimental therapy retains 100λo% of the effect of
the active control therapy (i.e., βN = (1 – λo)(γ + ηk+1)), a 100(1 – α)%
prediction interval for the observed value of β̂ N is given by

∑
k
(1 − λo )γˆ ± wα /2 (1 − λo )2 / (τˆ 2 + σˆ i2 )−1 + (1 − λo )2 τˆ 2 + σˆ k2+1 . Non-inferiority
i=1
(i.e., the experimental therapy retains more than 100λo% of the effect
of the active control therapy) is concluded when βˆ N < (1 − λo )γˆ − wα /2

∑
k
(1 − λo )2 / (τˆ 2 + σˆ i2 )−1 + (1 − λo )2 τˆ 2 + σˆ k2+1 .
i=1
When wα/2 ≈ zα/2, this test procedure can be expressed as comparing a
synthesis-like test to –zα/2. The test statistic would be given by

βˆ N − (1 − λo )γˆ
Z4 =
∑
k
(1 − λo )2 / (τˆ 2 + σˆ i2 )−1 + (1 − λo )2 τˆ 2 + σˆ k2+1
i=1

The appropriateness of using standard normal critical values may be influ-
enced by the sampling distribution for the estimated active control effect and
by whether the sizing of the non-inferiority trial depended on the estimation
of the active control effect.

© 2012 by Taylor and Francis Group, LLC

110 Design and Analysis of Non-Inferiority Trials

5.3.5 Addressing the Potential for Biocreep

When the first non-inferiority trial has established a criterion for non-
inferiority that the experimental therapy retains more than 100λo% of the
effect of the active control in that trial, it may be reasonable that all future
non-inferiority trials have the same or more stringent criterion regardless
of the active control therapy used. For example, suppose that better than
50% retention of the effect of the original active control (A) was required in
the first non-inferiority trial. If another therapy (B) is to be used as an active
control in a non-inferiority trial and a better than 50% retention of the effect
of A is still relevant, the non-inferiority criterion for B should be such that it
guarantees that if the experimental therapy (C) is noninferior to B, then C is
noninferior to A by retaining more than 50% of the effect of A.
Let β̂ P:A, β̂B:A, and β̂ C:B be estimated differences between placebo and A,
between B and A, and between C and B. Then when constancy holds across
all trials in the effects of A and B, a synthesis test would conclude that C
retains more than 100λo% of the effect of A when

(βˆ C:B − βˆ B:A ) − (1 − λo )βˆ P:A

< − zα /2
ˆ (βˆ C:B ) + Var
Var ˆ (βˆ P:A )
ˆ (βˆ B:A ) + (1 − λo )2 Var

Discounting can be incorporated when there are concerns that the effect of
B (and/or A) may have decreased. If β̂B:A favors or greatly favors B over A,
a more stringent standard may be needed, with B as the ultimate reference
instead of A.

5.3.6 Bayesian Synthesis Methods

A general discussion on Bayesian and frequentist methods and how they
compare is given in the Appendix in Sections A.2 and A.3.
As noted by Rothmann,2 the distribution of a synthesis test will depend on
whether the sizing of the non-inferiority trial is dependent on the estimation
of the active control effect. Consider the case where a random variable Y has
a normal distribution with mean μY and variance σ Y2 , and the conditional
distribution of X given Y = y is a normal distribution with mean μX and vari-
ance σ X2 ( y ). For a two-arm non-inferiority trial, Y represents the estimator of
the active control effect and X represents the estimated difference in effects
between the experimental and active control arms in the non-inferiority trial.
The density for the joint distribution of X and Y is then given by

f ( x , y ) = (2πσ Y σ ( y ))−1 (exp{(−1/2)[( y − µY )2 /σ Y2 + ( x − µ X )2 /σ 2 ( y )]})

The joint density function f(x,y) cannot be factored into separate functions
of x and y. However, for independent prior distributions for μY and μX with
respective densities hY(μY) and hX(μX), the joint posterior density for μY and

© 2012 by Taylor and Francis Group, LLC

Across-Trials Analysis Methods 111

μX will factor into the product of the marginal posterior densities. The joint
posterior density for μY and μX is given by

hX ( µ X )hY ( µY )(exp{(−1/2)[( y − µY )2 /σ Y2 + ( x − µX )2 /σ 2 ( y )]})

∞ ∞
=

∫ ∫
−∞ −∞
hX ( µ X )hY ( µY )(exp{(−1/2)[( y − µY )2 /σ Y2 + ( x − µ X )2 /σ 2 ( y )]}) d µY d µ X

hY ( µY )(exp{(−1/2)( µY − y )2 / σ Y2 })
∞

∫ −∞
hY ( µY )(exp{(−1/2)( µY − y )2 /σ Y2 }) d µY

hX ( µX )(exp{(−1/2)( µX − x)2 / σ 2 ( y )})

× ∞

∫ −∞
hX ( µ X )(exp{(−1/2)( µ X − x)2 /σ 2 ( y )}) d µ X

Thus, unlike a frequentist evaluation of non-inferiority, a Bayesian evaluation

of non-inferiority does not depend on whether the sizing of non-inferiority
trial depended on the estimation of the active control effect.
Simon14 discussed a Bayesian approach to non-inferiority analysis. The ini-
tial model relates a patient outcome, y, with treatment (experimental, control,
or placebo). For continuous data the model is
y = χ + βx + γz + ε

where x and z are treatment indicators and ε is the random deviation from the
mean. The indicator x = 1 if the treatment is the control therapy; otherwise
x = 0. The indicator z = 1 if the treatment is the experimental therapy; other-
wise z = 0. Per Simon’s setup, the larger values of y are better outcomes. The
mean outcome for the experimental therapy, control therapy, and placebo are
χ + γ, χ + β, and χ, respectively. The errors are assumed to be independent and
normally distributed with mean 0 and some common variance. Let h denote
the joint prior distribution for χ, β, and γ. Then for the sample means from the
non-inferiority trial y E and y C , the joint posterior distribution satisfies

g( χ , β , γ |y E , y C ) ∝ fE ( y E |χ , γ ) fC ( y C |χ , β )h( χ , β , γ )

When χ, β, and γ are modeled with independent prior distributions, h can be
replaced with the product of the marginal prior densities.
When the sample means are modeled as having normal distributions and
independent normal distributions are chosen for the prior distributions of
α, β, and γ, the joint posterior distribution for (χ, β, γ) is a multivariate nor-
mal distribution. Various posterior probabilities can be determined. These
include for “positive” or desirable outcomes (e.g., response):

(a) The probability that the experimental therapy is better than placebo
(i.e., P(γ > 0)).

© 2012 by Taylor and Francis Group, LLC

112 Design and Analysis of Non-Inferiority Trials

(b) The probability that the experimental therapy is better than the con-
trol therapy (i.e., P(γ > β)).
(c) The probability that the experimental therapy is better than both the
control therapy and placebo (i.e., P(γ > β and γ > 0)).
(d) From Simon’s paper, 14 the probability that the experimental therapy
retains more than 100k% of the control therapy’s effect and the con-
trol therapy is better than placebo (i.e., P(γ – kβ > 0 and β > 0)).
(e) The probability that the experimental therapy retains more than
100k% of the control therapy’s effect and the control therapy is better
than placebo, or the experimental arm is better than both the control
therapy and placebo (i.e., P(γ – kβ > 0 and β > 0) + P(γ > 0 and β < 0)).

Note that the probability statements in (a)–(e) do not involve χ. The inequali-
ties in the probability statements would be reversed for “negative” or unde-
sirable outcomes (e.g., adverse events, time-to-death/overall survival). In (e),
the experimental therapy may have adequate efficacy when the experimental
therapy retains more than some minimal fraction of the effect of the control
therapy when the control therapy is effective, or when the experimental ther-
apy is more effective than both the placebo and the control therapy when the
control therapy is not effective. Additional comments on (e) are given below.
The posterior probabilities in (d) and (e) involve the experimental ther
apy retaining more than a minimal fraction of the control therapy’s effect.
The definitions for the fraction of the control therapy’s effect retained by
the experimental therapy, the retention fraction, require that the control
therapy has a positive effect (β > 0). In such situations, the retention frac-
tion is a measure of the relative efficacy of the experimental therapy ver-
sus the control therapy. Since the parameter space for (γ, β) is –∞ < γ < ∞,
–∞ < β < ∞, and includes possibilities where the effect of the active control is
zero or negative, the retention fraction is not defined for some possible (γ, β).
In general, it can be problematic dealing with new parameters (a function
of the original parameters) that do not exist everywhere over the original/
underlying parameter space. This is particularly true when the estimator of
the new parameter is a function of estimators of the original parameters.
The variance for the original parameters and their sampling distributions
would incorporate possibilities for which the new parameter is not defined.
Inference on the new parameter should consider such issues. Here when
β ≤ 0 (the placebo is as effective or more effective than the control therapy),
the desired possibilities for (γ, β) may be that the experimental therapy has
any efficacy (i.e., γ > 0). When β > 0 and the experimental therapy has some
fixed advantage over placebo, the proportion of the control therapy’s effect
that is retained by the experimental therapy increases without bound as the
effect of the control therapy decreases toward zero. It is thus reasonable for
any fixed γ > 0 and –∞ < a < b < ∞ that the relative efficacy of the experimen-
tal therapy versus the control therapy is larger when β = a than when β = b,

© 2012 by Taylor and Francis Group, LLC

Across-Trials Analysis Methods 113

even when a (and possibly also b) is negative. The probability in (e) would
consider any case of (γ, β) where γ > 0 and β ≤ 0 as providing greater relative
efficacy of the experimental therapy versus the control therapy than any case
of (γ, β) where γ > 0 and β > 0.
For undesirable outcomes as was used in the earlier definitions of βN and
βH, the probability statements (a)–(e) are given by

1. P(βN – βH < 0)
2. P(βN < 0)
3. P(βN < 0 and βN – βH < 0)
4. P(βN – (1 – k)βH < 0 and βH > 0)
5. P(βN – (1 – k)βH < 0 and βH > 0) + P(β N – βH < 0 and βH < 0)

In Example 5.4 various posterior probabilities of interest will be determined

in a hypothetical example involving a log-hazard ratio for overall survival.

Example 5.4

Consider the following hypothetical example for overall survival. The prior distri-
bution for β H, the placebo versus control therapy log-hazard ratio, is modeled as
a normal distribution with mean 0.2 and standard deviation 0.1. On the basis of a
noninformative prior distribution and the study results comparing the experimen-
tal and control arms in the non-inferiority trial, the posterior distribution for η, the
experimental versus control log-hazard ratio βN is modeled modeled as a normal
distribution with mean –0.10 and standard deviation 0.08. Then we have the fol-
lowing probabilities:

The probability that the experimental therapy is better than placebo: P(β N –
β H < 0) = 0.990
The probability that the experimental therapy is better than the control ther-
apy: P(β N < 0) = 0.894
The probability that the experimental therapy is better than both the control
therapy and placebo: P(βN < 0 and βN – β H < 0) = 0.891
The probability that the experimental therapy retains more than 50% of
the control therapy’s effect and the control therapy is better than placebo:
P(β N – β H/2 < 0 and β H > 0) = 0.964
The probability that the experimental therapy retains more than 50% of the
control therapy’s effect and the control therapy is better than placebo, or
the experimental arm is better than both the control therapy and placebo:
P(βN – β H/2 < 0 and β H > 0) + P(βN – β H < 0 and β H < 0) = 0.981

Additionally, there is a 0.975 probability that the experimental therapy retains

more than 61.4% of the control therapy’s effect and the control therapy is better
than placebo, or the experimental arm is better than both the control therapy and
placebo. There is also a 0.025 probability that the experimental therapy retains
more than 990% of the control therapy’s effect and the control therapy is better

© 2012 by Taylor and Francis Group, LLC

114 Design and Analysis of Non-Inferiority Trials

than placebo. Therefore for λ = 1 – βN/β H, 0.95 = P(0.614 < λ < 9.90, β H > 0). Thus
0.614–9.90 is a 95% credible interval for λ, the proportion of the control therapy’s
effect that is retained by the experimental therapy, when additionally requiring
that the control therapy has an effect.
There are six possible orderings for the effects of the experimental therapy,
active control, and placebo. Table 5.2 provides the posterior probability for each
possible ordering on overall survival of the experimental therapy, control therapy,
and placebo. There is only a 0.005 posterior probability that the placebo is better
than both the active control and the experimental therapy. The bulk of the prob-
ability, 0.973, corresponds with the orderings of E > C > P and C > E > P.

Effect Retention Likelihood Plot. As a graphical tool for assessing the relative
efficacy of an experimental therapy to the active control therapy, Carroll26
proposed the use of an effect retention likelihood plot, which plots the pos-
terior probability that the experimental therapy retains more than a given
retention fraction against that given retention fraction between 0 (i.e., indirect
superiority to placebo) and 1 (i.e., superiority to the active control). According
to Carroll, the use of an effect retention likelihood plot would be part of
a stepwise approach where first the non-inferiority trial would be sized to
indirectly demonstrate that the experimental therapy is better than placebo;
when the data are analyzed, the posterior probability that the experimental
therapy is superior to placebo is determined, and if sufficiently high, then
the relative efficacy of the experimental therapy to the control therapy is
assessed using the effect retention likelihood plot.
Analogous plots to the effect retention likelihood plot can also be con-
structed of the posterior probability that the difference in effects of the
experimental therapy and the active control therapy (or the indirect effect
of the experimental therapy versus placebo) is greater than any prespecified
value. Additionally, when noninformative prior distributions are used for
the effects, the posterior probabilities will equal or approximately equal 1
minus the corresponding one-sided p-value. Therefore the one-sided p-values
can be substituted for the corresponding posterior probabilities in such plots.
Example 5.5 gives a modified version of Carroll’s effect retention likelihood
plot for approach 2b in the Examples 4.3, 5.2, and 5.3.

TABLE 5.2
Posterior Probability for Each Possible Ordering
Ordera Probability Statement Posterior Probability
E>C>P P(βN < 0, βH > 0) 0.874
E>P>C P(βN – βH < 0, βH < 0) 0.017
C>E>P P(βN > 0, βN – βH < 0) 0.099
C>P>E P(βH > 0, βN – βH > 0) 0.004
P>E>C P(βN – βH > 0, βN < 0) 0.003
P>C>E P(βH < 0, βN > 0) 0.002
a The “>” sign represents “better than” or “superior to.”

© 2012 by Taylor and Francis Group, LLC

Across-Trials Analysis Methods 115

0.9

Probability 0.8

0.7

0.6

0.5
0.0 0.5 1.0
Retention fraction

FIGURE 5.1
Probability that the true effect retention exceeds a given value between 0 and 1.

Example 5.5

We will revisit the previous example involving pemetrexed, docetaxel, and BSC
in NSCLC based on approach 2b. Consider noninformative prior distributions on
the pemetrexed versus docetaxel log hazard ratio, βN, and the BSC versus doc-
etaxel log hazard ratio, β H. Then βN has a normal posterior distribution with mean
–0.008 and standard deviation 0.099, and β H has an independent normal poste-
rior distribution with mean 0.172 and standard deviation 0.095. Let Λ = 1 – βN/β H.
Figure 5.1 provides a plot of P(Λ > λ, β H > 0) + P(βN – β H, β H < 0) versus λ, which is
a modified version of Carroll’s effect retention likelihood plot. For λ = 0, 0.25, 0.5,
0.75, and 1, the respective probability is 0.905, 0.868, 0.803, 0.690, and 0.526.

5.3.7 Application
Example 5.6 revisits the non-inferiority comparison of pemetrexed ver-
sus docetaxel discussed in Example 5.3 using all approaches discussed in
Example 4.3 in Section 4.2.4. This allows for a comparison of the results from
each approach in estimating the docetaxel effect. Additionally, for approaches
4a and 4b in estimating the docetaxel effect, which involves a nonnormal
sampling distribution for the estimated docetaxel effect, a Bayesian analysis
will be done.

Example 5.6

For each approach discussed in Section 4.2.4, Table 5.3 provides the estimates
and 95% confidence interval for the indirect pemetrexed versus BSC hazard ratio
along with the one-sided p-value for indirectly testing that pemetrexed is superior
to BSC and the one-sided p-value for testing that pemetrexed is noninferior to doc-
etaxel at 75 mg/m2 (D75) by retaining more than 50% of the effect of D75 versus
BSC. These calculations are based on the “constancy assumption” that the effects
are constant across trials. The indirect estimate of the pemetrexed versus BSC

© 2012 by Taylor and Francis Group, LLC

116 Design and Analysis of Non-Inferiority Trials

TABLE 5.3
Estimates, Confidence Intervals, and p-values of Pemetrexed versus BSC Hazard
Ratio by Approach
Pemetrexed vs. BSC Hazard Ratio One-Sided p-value
95% Confidence/ Pemetrexed Better
Approach Estimate Credible Interval than BSC 50% Retention
1 0.555 (0.337, 0.916) 0.011 0.026
2a 0.737 (0.510, 1.067) 0.052 0.110
2b 0.835 (0.638, 1.093) 0.095 0.195
3 0.650 (0.438, 0.963) 0.016 0.048
4a 0.670 (0.485, 0.974) 0.019 0.053
4b 0.699 (0.499, 1.027) 0.033 0.077

log-hazard ratio equals the pemetrexed versus D75 estimate from the JMEI study
plus the estimated D75 versus BSC log-hazard ratio from the particular approach
(hazard ratios provided in Table 4.3). For approaches 1, 2a, 2b, and 3, the stan-
dard error for the indirect log-hazard ratio estimator is the square root of the sum
of the variances. The corresponding p-values for approaches 1, 2a, 2b, and 3 are
based on synthesis test statistics. Results based on approach 2b are provided in
Example 5.3.
For approaches 4a and 4b, simulations were performed to determine the indi-
rect estimate and 95% credible interval for the pemetrexed versus BSC hazard
ratio and to determine the posterior probabilities for the respective one-sided null
hypotheses in testing that pemetrexed superior to BSC and that pemetrexed is
noninferior to D75.
As in Example 4.3 in Section 4.2.4, let β denote the common true D75/D100 ver-
sus BSC/V/I log-hazard ratio (β = – β H). Also, define β̂1, β̂ 2 , β̂3 , and β̂ 4 , and (Z1, Z2),
Z3, Z4 as in Example 4.3. Let βˆ = min{βˆ1, βˆ 2 , βˆ3 , βˆ 4 } denote the minimum observed
D75/D100 versus BSC/V/I log-hazard ratio (maximum observed effect) and let
W = min{0.138Z1, 0.138Z2, 0.221Z3, 0.235Z4}. Then β̂ = β + W . Thus, β = βˆ − W ,
and because the distribution of W does not depend on the value of β, it makes
sense that the posterior distribution of β is the distribution of y –W given that β̂ = y .

This will be true when a flat, improper prior distribution is selected for β.
Let f W (·) denote the density for W and f(·|β) denote the density for β̂ given the
true value β. Then for –∞ < y < ∞, f(y|β) = f W (y – β). For a flat, improper prior
distribution for β (i.e., the “density” equals a positive constant over the parameter
space), the posterior density for β is given by g(β|y) = f W (y – β) for –∞ < β < ∞.
Thus, given the value of y for β̂ , the posterior distribution of β is simulated through
random values of W, where for each replication β = βˆ − W is calculated. On the
basis of the results of the JMEI study, the posterior distribution for βN, the pem-
etrexed versus D75 log-hazard ratio is modeled as having a normal distribution
with mean ln 0.992 and standard deviation 0.099. The pemetrexed versus BSC
log-hazard ratio is equal to β + βN (i.e., βN – β H). On the basis of 100,000 simula-
tions, the posterior distribution of β + βN has mean –0.400 (0.670 = exp(–0.400))
with 2.5th and 97.5th percentiles of –0.724 (0.485 = exp(–0.724)) and –0.026

© 2012 by Taylor and Francis Group, LLC

Across-Trials Analysis Methods 117

(0.974 = exp (–0.026)), respectively. Zero was the 98.1st percentile of the posterior
distribution of β + βN, which leads to the “p-value” (i.e., the posterior probability
that pemetrexed is inferior to BSC) of 0.019 = 1 – 0.981. The retention fraction is
given by λ = 1 + βN/β when β < 0. Among the 100,000 replications, 94.7% had
λ > 0.5 and β < 0, or β + βN < 0 and β > 0. The posterior probability of the comple-
ment event of 5.3% is used in Table 5.3 as a one-sided p-value for testing for more
than 50% retention of the docetaxel effect.
For approach 4b, 31,899 of the 100,000 replications had β̂ 4 = min{βˆ1, βˆ 2 , βˆ3 , βˆ 4 } .
Conditioning on β̂ 4 = min{βˆ1, βˆ 2 , βˆ3 , βˆ 4 } , the posterior distribution of β + βN has a
mean of –0.358 (0.699 = exp(–0.358)) with 2.5th and 97.5th percentiles of –0.695
(0.499 = exp(–0.695)) and 0.026 (1.027= exp (0.026)), respectively. Zero was the
96.7th percentile of the posterior distribution of β + βN, which leads to a one-sided
p-value of 0.033 = 1 – 0.967. Among the 31,899 replications, 92.3% had λ > 0.5
and β < 0, or β + βN < 0 and β > 0. The posterior probability of the complement
event of 7.7% is used in Table 5.3 as a one-sided p-value for testing for more than
50% retention of the docetaxel effect.
In all, three of the six approaches provided one-study evidence that pemetrexed
is more effective than BSC by having one-sided p-values less than 0.025. The
one-sided p-values for each approach was greater than 0.025 for testing that pem-
etrexed retained more than 50% of the docetaxel effect.

5.3.8 Sample Size Determination

A sample size formula is provided by Simon,14 which conditions on the esti-
mated active control effect and its corresponding estimated standard error.
This sample size formula also applies to the standard synthesis method when
the estimation of the active control effect is known before the non-inferiority
trial is sized. We will illustrate determining the sample size or event size and
the caveats. We will use the term “power” in this context, although we are
truly referring to conditional power as the estimated active control effect and
its corresponding estimated standard error have already been determined.
Let σ N denote the standard deviation for β̂ N , and let βH denote the estimate
for the active control effect. For some 0 ≤ λo ≤ 1 and desired power of 1 – β at
the difference of effects between the experimental and active control arms of
β N,a, σ N must satisfy

zβ σ N = [(1 − λo )βH − β N,a ] − zα /2 σ N2 + (1 − λo )2 sH2 (5.15)

We are only interested in positive solutions of σ N. When the solution for σ N

is zero, then the power of 1 – β at βN,a is only achieved asymptotically. When
the only solutions for σ N are negative or complex numbers, a power of 1 – β
at βN,a can never be achieved regardless of the sample size. We will elaborate
further on this last point at the end of Example 5.8.
Suppose σ N > 0 satisfies Equation 5.15. For a one-to-one randomization
involving a continuous variable where the population variances within each
arm is σ 2, the required overall sample size (n) is given by

© 2012 by Taylor and Francis Group, LLC

118 Design and Analysis of Non-Inferiority Trials

n = 4σ 2 /σ N2 (5.16)

For a k to 1 randomization (experimental to control), the required overall

sample size is given by n = (( k + 1)2 /k )(σ 2 /σ N2 ) .
For a one-to-one randomization involving a time-to-event variable where a
log-hazard ratio-based method is used to compare arms, the required num-
ber of events (r) is given by

r = 4/σ N2 (5.17)

For a k to 1 randomization (experimental to control), the required overall

sample size is given by r = (( k + 1)2 /k )/σ N2 .
Sample size and event size calculations are provided in Examples 5.7 and
5.8 for cases involving continuous and time-to-event endpoints.

Example 5.7

Consider a continuous variable where smaller values are more desirable and the
estimated mean difference between placebo and the active control is 4.5 with a
corresponding standard error of 0.6. An experimental therapy is required to dem-
onstrate better than 60% retention of the active control effect. For the standard
synthesis method or a Bayesian synthesis method, Table 5.4 provides the solutions
for σ N in Equation 5.15 and the overall sample size for a one-to-one randomization
based on Equation 5.16 when 90% power is desired for βN,a = –0.5, 0, or 0.5, pos-
sibilities where the experimental therapy is slightly more effective than the active
control, has the same effect as the active control, or is slightly less effective than
the active control, respectively; α = 0.05 and it is assumed that the population
variance in each arm is 100. For β N,a = –0.5, Equation 5.15 becomes

(1.282)σ N = [(1− 0.6)(4.5) + 0.5] − 1.96 σ N2 + (1− 0.6)2(0.6)2

The solution to the equation is σ N = 0.685. Then applying Equation 5.16 gives

n = 4 × 100/(0.685)2 = 853 as the overall sample size.

TABLE 5.4
Sample Sizes by Assumed Mean Differences for Experimental and Active Control
Arms for 90% Power
Assumed Mean Difference in Standard Error in
Non-Inferiority Trial (Exper.–Control) Non- Inferiority Trial Sample Size
–0.5 0.685 853
0 0.524 1459
0.5 0.357 3142

© 2012 by Taylor and Francis Group, LLC

Across-Trials Analysis Methods 119

TABLE 5.5
Event Sizes by Assumed Experimental versus Active Control Hazard
Ratios for 80% Power
Assumed Hazard Ratio Standard Error in
(Exper./Control) Non- Inferiority Trial Event Size
0.9 0.0885 511
1 0.0443 2037
1.1 No positive solution No event size can
provide 80% power

Example 5.8

Consider a time-to-event variable where longer values are more desirable. The
placebo versus active control hazard ratio is 1.40 with a corresponding standard
error for the log-hazard ratio of 0.1. The experimental therapy is required to dem-
onstrate better than 50% retention of the active control effect. For the standard
synthesis method or a Bayesian synthesis method, Table 5.5 provides the solutions
for σ N in Equation 5.15 and the overall required number of events for a one-to-
one randomization based on Equation 5.17 when 80% power is desired for βN,a =
ln 0.9, 0, or ln 1.1, possibilities where the experimental therapy has a 10% lower
instantaneous risk of an event than the active control, has the same instantaneous
risk as the active control, or has a 10% greater instantaneous risk of an event than
the active-control, respectively (where α = 0.05). For βN,a = ln 0.95, Equation 5.15
becomes

(0.842)σ N = [(1 − 0.5)(ln1.4) − ln 0.9] − 1.96 σ N2 + (1 − 0.5)2(0.1)2

The solution to the equation is σ N= 0.0885. Then applying Equation 5.16 gives

n = 4/(0.0885)2 = 511 as the overall number of events.

In Example 5.8, note that 80% power cannot be achieved at βN,a = ln 1.1,
regardless of the sample size. This is attributable to the knowing of the esti-
mated active control effect along with the fixed, nonzero standard error
beforehand and that powering a comparison involving a difference of two
parameters is usually based on an assumed difference for those two param-
eters where the standard error for the estimated difference can be chosen as
any positive value. Here the known standard error for (1 − λo )βˆ H establishes
a positive lower bound for the standard error of βˆ N − (1 − λo )βˆ H .
The power at βN,a = ln 1.1 is maximized at approximately 9.5% for 441
events. When conditioned on the estimated active control effect and its corre-
sponding standard error, the power for a given of βN,a need not be monotone
in the sample/event size.

© 2012 by Taylor and Francis Group, LLC

120 Design and Analysis of Non-Inferiority Trials

For a two 100(1 – α)% confidence interval approach, sample/event size

formulas can be obtained from Equation 5.15 by replacing σ N2 + (1 − λo )2 sH2
with σ N + (1 – λo)sH and reducing. This leads to Equation 5.18

[(1 − λo )(βH − zα /2 sH )] − β N,a

σN = (5.18)
zβ + zα /2

For a one-to-one randomization when β N,a < [(1 − λo )(βH − zα /2 sH )], Equations
5.16 and 5.17 will apply for determining the sample size for a continuous
variable and the event size for a time-to-event variable, respectively. For
β N,a < [(1 − λo )(βH − zα /2 sH )], the sample size or event size can be determined
that provides a given power greater than α/2. For β N,a > [(1 − λo )(βH − zα /2 sH )],
the power will always be less than α/2. Again, as previously stated, the term
“power” in this context is truly “conditional power.” As [(1 − λo )(βH − zα /2 sH )]
is an already observed value, β N < [(1 − λo )(βH − zα /2 sH )] does not reflect an
alternative hypothesis that specifies when the experimental therapy has
acceptable efficacy. Likewise, for the standard synthesis method, when the
conditional power is greater than or less than α/2 does not necessarily cor-
respond to exactly when the experimental therapy has acceptable efficacy.

5.4 Comparing Analysis Methods and Type I Error Rates

5.4.1 Introduction
There have been criticisms between factions involving two–confidence
interval methods and synthesis methods. One criticism about the synthe-
sis method is that a margin is not specified at the time of the design of the
non-inferiority trial. Also, synthesis-like thresholds as in Equation 5.28
are decreasing as the sample size increases (standard error decreases) for
the non-inferiority trial.5 Conversely, for a synthesis test statistic, a two–
confidence interval approach does not prespecify a critical value for test-
ing. In fact, the associated critical value changes with the standard error in
the non-inferiority trial in a nonmonotone fashion. The absolute value of the
critical value increases as the standard error for β̂ N decreases, peaks when
the standard errors for β̂ N and (1 − λo )βˆ H are equal, and then decreases as the
standard error for β̂ N decreases toward zero.
Another criticism on the synthesis method is that it relies on the constancy
assumption to maintain the desired type I error rate. However, adjust-
ments can be made to a synthesis method to account for concerns involv-
ing deviations from constancy.1,16,27 Additionally, although a 95–95 method

© 2012 by Taylor and Francis Group, LLC

Across-Trials Analysis Methods 121

is conservative under the constancy assumption, the amount of adjustment

made by using the lower limit of a two-sided 95% confidence interval for
(1 − λo )β H as the true value is independent of the concerns involving devia-
tions from constancy and depends on the sample size (standard error) in
the non-inferiority trial, which has nothing to do with deviations from con-
stancy. The smaller the standard error for β̂ N , the smaller the adjustment. By
consistently applying a 95–95 method without any additional adjustment,
there will be instances when the 95–95 method makes too great an adjust-
ment and instances when the 95–95 method makes too little an adjustment.
We will also consider the type I error rate of these procedures. A signifi-
cant limitation to many of these type I error calculations is that the models
ignore that the selection of the active control therapy was not done at ran-
dom, but was based on the selected therapy having the best or one of the best
estimated effects. Some type I error rate calculations are also based on the
assumption that the sizing of the non-inferiority trial did not depend on the
estimation of the active control effect. Although this is sometimes true, usu-
ally the active control effect is evaluated on the basis of previous trials and
the results influence the sizing of the non-inferiority trial.

5.4.2 Comparison of Methods

In this subsection, we compare the two–confidence interval and synthesis
test procedures. Comparisons include a ratio of respective test statistics and
the difference in thresholds that are compared with a confidence interval
from the non-inferiority trial.
Ratio of Respective Test Statistics. In examining the potential conservatism
of two–confidence interval procedures relative to a synthesis method, the
two–confidence interval procedure of Equation 5.2 can be expressed in the
form of a test statistic.27 The test based on the synthesis method statistic is
designed to maintain a desired type I error rate when certain assumptions
hold for testing hypotheses involving the effects of the active control versus
placebo and the active control versus the experimental therapy. When α = γ,
the test procedure in Equation 5.2 would conclude non-inferiority when

βˆ N − (1 − λo )βˆ H
Z4 = = Z2 /Q < − zα /2 (5.19)
sN + (1 − λo )sH
sN + (1 − λo )sH
where Q = and Z2 is the standard synthesis method test
sN2 + (1 − λo )2 sH2
statistic given in Equation 5.12. The value for Q is necessarily greater than
or equal to 1, which follows from the triangle inequality applied to a right
triangle having for the lengths of its legs sN and (1 – λo)sH, and thus having
sN2 + (1 − λo )2 sH2 for the length of the hypotenuse (see Figure 5.2). As the sum
of the length of the legs is greater than the length of the hypotenuse, we have

© 2012 by Taylor and Francis Group, LLC

122 Design and Analysis of Non-Inferiority Trials

SN2 + (1 – λ0)2 SH2

(1 – λ0)SH

FIGURE 5.2
Right triangle representation of standard errors.

that Q ≥ 1 (equality holding only when a given leg has length zero). The largest
possible value for Q is 2 ≈ 1.414 , which occurs when sN = (1 – λo)sH. Wiens28
expressed the ratio Q as a function of λo when sN = sH that is equivalent to

2 − λo
t(λo ) =
1 + (1 − λo )2

The function t is decreasing in λo over 0 ≤ λo ≤ 1 with t(0) = 2 and t(1) = 1.

The standard synthesis method and the 95–95 approach are equivalent when
λo = 1. When sN = sH and λo = 0, the standard synthesis method test statistic
equals 2 multiplied by test statistic for the 95–95 approach.
The point estimate method, a two–confidence interval method where γ = 0,
concludes non-inferiority when

βˆ N − (1 − λo )βˆ H
Z5 = = RPE × Z2 < − zα /2 (5.20)
sN

where RPE = 1 + (1 − λo )2 sH2 /sN2 . The factor RPE increases from 1 to ∞, as sH2 /sN2
increases from 0 to ∞.
Recall that the Hassalblad and Kong method concludes non-inferiority when

βˆ N − (1 − λo )βˆ H
Z3 = = RHK × Z2 < − zα /2 (5.21)
s2 + (βˆ /βˆ )2 s2
N N H H

sN2 + (1 − λo )2 sH2
where RHK = . We see from Equation 5.21 that when βN =
sN2 + (βˆ N /βˆ H )2 sH2
(1 – δo)βH and Var(βˆ N /βˆ H ) is small, RHK ≈ 1 in distribution and Z3 will have an
approximate standard normal distribution when Z2 has a standard normal
distribution.

© 2012 by Taylor and Francis Group, LLC

Across-Trials Analysis Methods 123

Differences in the Thresholds. The standard synthesis method can also be

compared with the two–confidence interval method by the threshold that
is compared with the confidence interval from the non-inferiority trial. The
standard synthesis method for when the null hypothesis is rejected can be
expressed as

βˆ N + zα /2 sN < (1 − λo )βˆ H − zα /2 × ( sN
2
+ (1 − λo )2 sH2 − sN ) (5.22)

The two–confidence interval method where α = γ rejects the null hypothesis

when

βˆ N + zα /2 sN < (1 − λo )(βˆ H − zα /2 sH ) (5.23)

The right-hand side of Equation 5.22 is larger than the right-hand side of
Equation 5.23 by

zα /2 × ( sN + (1 − λo )sH − sN2 + (1 − λo )2 sH2 ) (5.24)

Expression 5.24, which can also be found in the Appendix of Fleming,29 can
be thought of as the adjustment made by the two–confidence interval method
compared with the synthesis method. Such an adjustment may be in hopes
of addressing any bias in the estimation of the active control effect and any
deviation from constancy. In terms of S = (1 – λo)sH, Expression 5.24 will be
between 0 and zα/2S, which occur at sN = 0 and as sN → ∞, respectively.
Unified Test Statistic. For indirectly testing that the experimental therapy is
effective (i.e., more effective than placebo), Snapinn and Jiang16 compared the
power of synthesis and two–confidence interval methods through a unified
approach where the test statistic is given by

βˆ N − (1 − w)βˆ H
U (w , v) = (5.25)
Var(βˆ N ) + (1 − w)2 Var(βˆ H ) + 2 v(1 − w) Var(βˆ N )Var(βˆ H )

where v > 0 is a variance inflation factor and 0 < w < 1 is a discounting factor.
For a given (w, v), non-inferiority is concluded when U(w, v) < –1.96. We see
from Equation 5.25 that the standard synthesis statistic given in Equation
5.12 equals U(λo, 0), the two–confidence interval equivalent test statistic
in Equation 5.19 equals U(λo, 1), and U(1,v) provides the test statistic for a
superiority test of the experimental therapy to the active control, regardless
of the value of v. Snapinn and Jiang noted that failing to account for viola-
tions of assay sensitivity and constancy can lead to an inflated type I error
rate, which increases the risk of claiming an ineffective therapy as effective.

© 2012 by Taylor and Francis Group, LLC

124 Design and Analysis of Non-Inferiority Trials

Departures from assay sensitivity are given by the amount a = E(βˆ N ) − β N ,ideal
and departures from constancy by the amount c = E(β̂ H ) − β C,P,N, where
βN, ideal is the true treatment difference between the active control and the
experimental arms under ideal trial situations and β C,P,N is the actual effect of
the active control in the non-inferiority trial. For a given pair of values for the
departures from assay sensitivity and constancy (a, c) a nd β = βH = βN and
fixed values for the variances of β̂ N and β̂ H , Snapinn and Jiang determined
the values wS and wF so that the calibrated synthesis test statistic U(wS, 0) and
the calibrated fixed margin approach test statistic U(wF, 1) maintain a 0.025
type I error rate. For various cases studied involving departures from assay
sensitivity and constancy, Snapinn and Jiang found that the calibrated syn-
thesis method based on the statistic U(wS, 0) had greater power than the cali-
brated fixed-margin approach based on U(wF, 1). The difference in power (or
in the determined sample sizes) became more profound as Var(βˆ H )/Var(βˆ N )
increased.

5.4.3 Asymptotic Results

Hettmansperger30 studied the properties of confidence intervals for the dif-
ference in medians where the confidence limits are differences of the limits
of individual confidence intervals for the medians. The results are summa-
rized in greater detail in Chapter 12 on means and medians and also apply
to other comparisons (e.g., a difference in means and a log-hazard ratio). We
apply the results here to two–confidence interval non-inferiority approaches.
The 100(1 – α)% two-sided confidence interval for βN – (1 – λo)βH of (L, U) =
βˆ N − (1 − λo )βˆ H ± zα /2 sN 2
+ (1 − λo )2 sH2 can be expressed by (L, U) = (LN – UH,
UN – LH), where (LN, UN) = β̂ N ± zα N /2 sN is a 100(1 – α N)% confidence interval
for βN and (LH, UH) is a 100(1 – α H)% confidence interval for (1 – λo)βH of (LH,
UH) = (1 − λo )βˆ H ± zα H /2 (1 − λo )sH . Relating the notation here to the notation of
the two–confidence interval approach in Equation 5.2, we have α N = α and
α H = γ.
We will assume that β̂ N and β̂ H have normal distributions. Let
sN2  Var(βˆ N ) 
k= 2  or k = . For 0 < α < 1, it is
Var(β N ) + (1 − λo ) Var(β H ) 
ˆ ˆ
2 2
sN + (1 − λo ) sH  2

easy to show that choosing

zα N /2 = zα /2 k and zα H /2 = zα /2 1 − k (5.26)

leads to a synthesis based 100(1 – α)% two-sided confidence interval for βN –

(1 – λo)βH of (L, U) = (LN – UH, UN – LH) = βˆ N − (1 − λo )βˆ H ± zα /2 sN2 + (1 − λo )2 sH2 .

© 2012 by Taylor and Francis Group, LLC

Across-Trials Analysis Methods 125

TABLE 5.6
zα N /2 and zα H /2 Based on Equation 5.26
α/2 = 0.0005 α/2 = 0.005 α/2 = 0.025
k zα N / 2 zα H / 2 zα N / 2 zα H / 2 zα N / 2 zα H / 2
1/2 2.327 2.327 1.822 1.822 1.386 1.386
2/3 2.687 1.900 2.103 1.487 1.600 1.132
3/4 2.850 1.645 2.231 1.288 1.697 0.980
1 3.291 0 2.576 0 1.960 0

The two–confidence interval test procedure based on α N and α H in Equation

5.26 is equivalent to the standard synthesis test with critical value –zα/2.
For k = 1/2, 2/3, 3/4, and 1, and α/2 = 0.0005, 0.005, and 0.025, Table 5.6
gives the multipliers zα N /2 and zα H /2 based on Equation 5.26 for the two–
confidence interval procedure that is equivalent to the synthesis test. For k =
1/2 and α/2 = 0.0005, the standard synthesis test would be equivalent to a
two–confidence interval procedure based on two 98% confidence intervals
( zα N /2 = zα H /2 = 2.327). For k = 0, 1/4, and 1/3, the values for zα N /2 and zα H /2 are
the values for zα H /2 and zα N /2 in Table 5.6 for k = 1, 3/4, and 2/3, respectively.
The equations in Equation 5.26 give one possibility for zα N /2 and zα H /2 that
leads to (L, U) = (LN – UH, UN – LH) being a 100(1 – α)% two-sided confidence
interval for βN – (1 – λo)βH. It can be shown that (L, U) = (LN – UH, UN – LH) will
be a 100(1 – α)% two-sided confidence interval for β N – (1 – λo)βH provided

zα /2 = zα N /2 k + zα H /2 1 − k (5.27)

For k = 0, 1/4, 1/4, and 1, Table 5.7 gives the values for zα/2 used in the stan-
dard synthesis test and one-sided type I error rates (α/2) under the con-
stancy assumption for the 95–95 approach and a 95–80 approach based on
Equation 5.27. When k = 1/2, a two 95% confidence interval approach is
equivalent to a standard synthesis method that targets a one-sided type I
error rate of 0.0028. For the 95–80 method, zα/2 has an umbrella shape in k
with its maximum at k = 0.7004 of z0.0096 = 2.342. For 0.1606 < k < 1, we have
zα/2 > 1.96 and for k < 0.1605, we have zα/2 < 1.96. For k = 0, 1/4, and 1/3, the

TABLE 5.7
zα/2 Values Used in Standard Synthesis Test and One-Sided Type I Error Rates
αN/2 = αH/2 = 0.025 αN/2 = 0.025, αH/2 = 0.10

k zα/2 α/2 zα/2 α/2

1/2 2.772 0.0028 2.292 0.0109
2/3 2.732 0.0031 2.341 0.0096
3/4 2.678 0.0037 2.338 0.0097
1 1.960 0.0250 1.960 0.0250

© 2012 by Taylor and Francis Group, LLC

126 Design and Analysis of Non-Inferiority Trials

values for zα/2 and α/2 are the same as those in Table 5.7 for k = 1, 3/4, and
2/3, respectively. Rothmann et al.1 provided a graph of the type I error for the
95–95 approach by the ratio of the standard deviations (i.e., (1 – λo)sH/sN).
For a one-to-one randomization and a time-to-event endpoint, Equation 10
of Rothmann et al.1 gives the non-inferiority threshold for a 100(1 – α N)% con-
fidence interval of βN that is consistent with the standard synthesis method
when α N/2 = α/2. Non-inferiority is concluded at a targeted level of α/2
for the standard synthesis method when the upper limit of the two-sided
100(1 – α)% confidence interval for β N is less than

zα /2 sN + (1 − λo )βH − zα /2 × sN2 + (1 − λo )2 sH2 (5.28)

where βH is the placebo versus active control estimate of the active control
effect. For a two–confidence interval approach, the use of the threshold in
Equation 5.28 is equivalent to choosing zα H /2 (α H/2) by

zα H /2 = zα /2 (1 − k )/ 1 − k = zα /2 1 − k /(1 + k ) (5.29)

For k = 0, 1/4, 1/3, 1/2, 2/3, 3/4, and 1, Table 5.8 gives the values for zα H /2 and
α H/2 for a two–confidence interval procedure where α N/2 = α/2, which is
equivalent to the standard synthesis method with critical value –zα/2, which
targets a one-sided type I error rate of α/2 under the constancy assump-
tion. When k = 1/2 and α/2 = 0.025, an approach that compares a two-sided
95% confidence interval for βN to a two-sided 58.3% confidence interval for
(1 – λo) βH is equivalent to the standard synthesis test with critical value –1.96
(α/2 = 0.025).
When zα H /2 = 0, the two–confidence interval method is referred to as the
“point estimate method.” The point estimate method would compare a con-

TABLE 5.8
zα H /2 and α H/2 Based on Equation 5.29
α/2 = 0.0005 α/2 = 0.005 α/2 = 0.025

k zα H / 2 αH/2 zα H / 2 αH/2 zα H / 2 αH/2

0 3.291 0.0005 2.576 0.0050 1.960 0.0250
1/4 1.900 0.0287 1.487 0.0685 1.132 0.1289
1/3 1.703 0.0443 1.333 0.0912 1.015 0.1552
1/2 1.363 0.0864 1.067 0.1430 0.812 0.2084
2/3 1.046 0.1478 0.819 0.2065 0.623 0.2667
3/4 0.882 0.1890 0.690 0.2450 0.525 0.2997
1 0 0.5 0 0.5 0 0.5

© 2012 by Taylor and Francis Group, LLC

Across-Trials Analysis Methods 127

fidence interval for βN from the non-inferiority trial with (1 − λo )βH . For the
point estimate method by Equation 5.27,

zα /2 = zα N /2 k (5.30)

The point estimate method is equivalent to the standard synthesis method

using a critical value of −zα N /2 k that targets a significance level of
Φ(−zα N /2 k ). For fixed α N/2, zα/2 in Equation 5.30 is increasing in k. Under
the constancy assumption, the associated type I error rate α/2, as a function
of k, is decreasing in k, ranging from 0.5 at k = 0 down to α N/2 at k = 1.
If it is desired to have α N = α H, then from Equation 5.27 set

zα N /2 = zα H /2 = zα /2 /( k + 1 − k ) (5.31)

For k = 1/2, 2/3, 3/4, and 1, and α/2 = 0.0005, 0.005, and 0.025, Table 5.9 gives
the common values for zα N /2 = zα H /2 and α N = α H based on Equation 5.31 that
yield two–confidence interval procedures equivalent to the standard synthe-
sis test with critical value –zα/2. For k = 1/2 and α/2 = 0.025, zα N /2 = zα H /2 = 1.386
and α N/2 = α H/2 = 0.0829, corresponding to a two 83.4% confidence interval
approach being equivalent to the standard synthesis method that targets a
one-sided type I error rate of 0.025. As the right-hand side of Equation 5.31 is
symmetric in k, the values for zα/2 and α/2 for k = 0, 1/4, and 1/3 are the same
as the values in Table 5.9 for k = 1, 3/4, and 2/3, respectively.
If it is desired to have equal-length confidence intervals, then for 0 < k < 1,

zα N /2 = zα /2 /[2 k ] and zα H /2 = zα /2 /[2 1 − k ] (5.32)

For k = 1/2, 2/3, 3/4, and k → 1, and α/2 = 0.0005, 0.005, and 0.025, Table
5.10 gives the values for zα N /2 and zα H /2 (α N and α H) based on Equation 5.32
that provide an equal-length two confidence interval procedure equivalent
to the standard synthesis test with critical value –zα/2. For k = 3/4 and α/2 =
0.025, a 74.2–95 two–confidence interval procedure is based on equal-length

TABLE 5.9
Values for zα N /2 = zα H /2 and α N = α H Based on Equation 5.31
α/2 = 0.0005 α/2 = 0.005 α/2 = 0.025
k zα N / 2 = zα H / 2 (αN/2 = αH/2) zα N / 2 = zα H / 2 (αN/2 = αH/2) zα N / 2 = zα H / 2 (αN/2 = αH/2)
1/2 2.327 (0.0100) 1.822 (0.0343) 1.386 (0.0829)
2/3 2.361 (0.0091) 1.848 (0.0323) 1.406 (0.0798)
3/4 2.409 (0.0080) 1.886 (0.0297) 1.435 (0.0757)
1 3.291 (0.0005) 2.576 (0.0050) 1.960 (0.0250)

© 2012 by Taylor and Francis Group, LLC

128 Design and Analysis of Non-Inferiority Trials

TABLE 5.10
Values for zα N /2 , zα H /2 , α N/2 and α H/2 Based on Equation 5.32
α/2= 0.0005 α/2 = 0.005 α/2 = 0.025

k zα N / 2 (αN/2) zα H / 2 (αH/2) zα N / 2 (αN/2) zα H / 2 (αH/2) zα N / 2 (αN/2) zα H / 2 (αH/2)

1/2 2.327 2.327 1.822 1.822 1.386 1.386
(0.0100) (0.0100) (0.0343) (0.0343) (0.0829) (0.0829)
2/3 2.015 2.850 1.577 2.231 1.200 1.697
(0.0220) (0.0022) (0.0573) (0.0128) (0.1150) (0.0448)
3/4 1.900 3.291 1.487 2.576 1.132 1.960
(0.0287) (0.0005) (0.0685) (0.0050) (0.1289) (0.0250)
→1 1.645 →∞ 1.288 →∞ 0.980 →∞
(0.0500) (→ 0) (0.0989) (→ 0) (0.1635) (→ 0)

confidence intervals and is equivalent to the standard synthesis method that

targets a one-sided type I error rate of 0.025. For k → 0 and k = 1/4 and 1/3,
the values for zα N /2 and zα H /2 (for α N/2 and α H/2) are, respectively, the values
for zα H /2 and zα N /2 (for α H/2 and α N/2) in Table 5.10 for k → 1, and k = 3/4
and 2/3.

5.4.4 More on Type I Error Rates

When the margin is truly fixed, not based on an estimation of the active con-
trol effect, the use of a two-sided confidence interval from the non-inferiority
trial in testing the hypotheses will have a one-sided type I error probability
of 0.025, provided no bias is introduced by the conduct of the trial. When
the “margin” is a function of the estimated active control effect, statistical
hypotheses based on such a margin are surrogate hypotheses. It is hoped that
the rejection of the surrogate null hypothesis will imply that there is substan-
tial evidence that the experimental therapy is effective and not unacceptably
worse than the active control therapy. For testing the surrogate hypotheses,
the one-sided type I error rate (referred to by Hung et al.5 as a conditional or
within-trial type I error rate) is 0.025 for using a two-sided 95% confidence
interval from the non-inferiority (provided no bias is introduced by the con-
duct of the trial). If the “margin” is based on the lower limit of the two-sided
95% confidence interval for (1 – λo)βH and the constancy assumption holds,
then the type I error rate for testing the surrogate hypotheses will be greater
than 0.025 for the standard synthesis method based on retaining better than
100(1 – λo)% of the active control effect with critical value –1.96. The stan-
dard synthesis method is not designed to test such surrogate hypotheses.
It should also be noted that the “conditional type I error rate” is based on a
random possibility (or possibilities) in the parameter space that depends on
the estimated active control effect.
An across-trials or unconditional type I error rate for testing the hypoth-
eses in Expression 5.6 is also considered in the study by Hung et al.5 When

© 2012 by Taylor and Francis Group, LLC

Across-Trials Analysis Methods 129

the sizing of the non-inferiority trial is independent of the estimation of the

active control effect and no bias is introduced by the conduct of the trial, the
95–95 approach will have a type I error rate for testing the hypotheses in
Expression 5.6 that is less than 0.025, while type I error rate for the standard
synthesis method will equal 0.025. For U = βˆ N + z0.025 sN and θ = βN – (1 – λo)βH,
the across-trials type I error rate for a test comparing U with a random
threshold l( β̂ H ) is defined by Hung et al.5 as

P(U < l(βˆ H )|H o : θ = 0) =
∫ P(U < l(β )|H
H o : θ = 0 ; βH )v(βH ) dβH

Odem-Davis24 compared across-trials type I error rates for the standard syn-
thesis method, the standard synthesis method with various discounting,
the 95–95 two–confidence interval method, and various bias-only adjusted
synthesis methods for cases where the effect of the active control in the non-
inferiority trial is smaller than the historical effect by some known fraction.
The 95–95 two–confidence interval method has a type I error rate less than
the analogous synthesis method. Odem-Davis observed that the type I error
rate for the 95–95 two–confidence interval method was more sensitive to
changes in the historical variance. Additionally, when the historical variance
is small and the estimator of the active control effect in the setting of the non-
inferiority trial has a small bias favoring overestimating the true effect, the
95–95 two–confidence interval method may have a type I error rate greater
than that of a synthesis method that accounts for the bias. A synthesis method
that does not closely account for the bias would be even more likely to lead to
a false positive than the 95–95 two–confidence interval method.
Wang, Hung, and Tsong3 examined the impact of deviations from con-
stancy on a range of procedures. Included are: (a) a synthesis method
indirectly testing superiority of the experimental therapy to placebo, (b) a
method that uses a non-inferiority margin or a random threshold based on
50% of the estimated control effect, (c) a random threshold based on 50% of
the lower limit or the 95% confidence interval for the control effect, (d) use
of a non-inferiority margin or a random threshold based on 20% of the
estimated control effect, (e) a random threshold based on 20% of the lower
limit of the 95% confidence interval for the control effect. For the meth-
ods based on random thresholds, non-inferiority would be concluded if the
95% confidence interval for the difference in the effects of the experimental
therapy and the active control lies entirely below the random threshold.
The method in (a) had the largest type I error rates, then (b), (c), (d), and (e)
in that order.
When there is a deviation in the effect of the active control in the non-
inferiority trial relative to the historical effect of the active control by a and the
estimator of the historical effect of the active control is normally distributed,
the approximate type I error rate is given by (from Equation 5.12, 5.19, and 5.20,

© 2012 by Taylor and Francis Group, LLC

130 Design and Analysis of Non-Inferiority Trials

 (1 − λo )a 
respectively) Φ  − zα /2 +  for the standard synthe-
 Var(βˆ N ) + (1 − λo )2 Var(βˆ H ) 
 (1 − λo )a 
sis method, Φ  − zα /2 / Q +  for a two 100(1 – α)%
 Var(βˆ N ) + (1 − λo )2 Var(βˆ H ) 
 (1 − λo )a 
confidence interval method, and Φ  − zα /2 / RPE + 
 Var(βˆ N ) + (1 − λo )2 Var(βˆ H ) 
for the point estimate method.
Hung et al.31 simulated the across-trials trial type I error rate for the point
estimate, 95–95, and standard synthesis methods based on proportions of
undesirable outcomes. For the simulations, the true active control rate is 14%
or 15%, which is less than the placebo rate by 4%; the sample size is 7500 or
10,000 per arm. The sample sizes for the active control and the placebo in the
historical comparison are assumed to be roughly equal to the sample sizes
in the non-inferiority trial. In each case, the 95–95 and the standard synthesis
methods had type I error rates of approximately 0.003 and 0.025, respectively.
The unconditional type I error rate for the point estimate method ranged
from 0.050 to 0.058.
For the 95–80, 95–85, 95–90, and 95–95 methods, Figure 1 of Hung et al.5
gives across-trials type I error rate curves as the ratio of the variance of the
historical estimate of the active control effect to the variance in the non-
inferiority trial ranges from 0.5 to 10.
Additionally, for both fixed-effects and random-effects meta-analyses,
Wang, Hung, and Tsong3 compared the type I error rates for 50% and 80%
retention based on log-relative risks under the constancy assumption of three
methods: one method using the lower limit of the 95% confidence interval for
the control effect as the true effect of the control therapy, a standard synthe-
sis method, and the Hasselblad and Kong procedure. For the random-effects
model, the true effect for the active control in the non-inferiority trial is
assumed to equal the true global mean effect across studies. The 95% lower
confidence interval method maintained type I error rates below the target of
0.025 in all studied cases. The standard synthesis method maintains a proper
type I error rate except for cases where the within-trial variability was much
smaller than the between-trial variability. The Hasselblad and Kong method
consistently had a slightly higher type I error rate than the synthesis method
in the random-effects cases. This increase above the desired type I error rate
may be largely due to using a normal sampling distribution for the estimated
active control effect when an appropriate t-distribution would be a better
choice.32,33
For the fixed-effects meta-analysis, the standard synthesis method
maintained approximately the desired type I error rate. When basing the

© 2012 by Taylor and Francis Group, LLC

Across-Trials Analysis Methods 131

non- inferiority margin on the lower limit of the 95% confidence interval for
the active control effect, the simulated type I error rates ranged from 0.0015
to 0.0115 with the type I error rates increasing as the percentage of retention
increased. The Hasselblad and Kong method had simulated type I error rates
ranging from 0.0009 to 0.0386 with the type I error rates decreasing as the
percentage of retention increases.

5.4.4.1 Non-Inferiority Trial Size Depends on

Estimation of Active Control Effect
As previously stated, the distribution of these synthesis test statistics and
the corresponding across-trials type I error rates depend on whether the siz-
ing of the non-inferiority trial was influenced by the estimation of the active
control effect.
Let X represent the estimated difference in the effects of the active control
and experimental arms in the non-inferiority trial and let Y represent the
historical estimated active control effect. Suppose Y is normally distributed
with mean μY and standard deviation σ Y and X is normally distributed with
mean μX and standard deviation σ X = σ X(Y). That is, the standard deviation
may be chosen based on the value for Y. The hypotheses are

Ho: μX = (1 – λo)μY vs. Ha: μX < (1 – λo)μY

For 0 < α < 1 and 0 < γ < 1, Ho is rejected by a two–confidence interval proce-
dure whenever X + zα/2σ X < (1 – λo)(Y – zγ/2 σY). For a given strategy σ X(•), the
probability of rejecting the null hypothesis at (μX, μY) is

∞
 (1 − λo )( y − zγ /2σ Y ) − µ X   y − µY 

∫ Φ  − z
−∞
α /2 +
σ X (y)  ⋅ ϕ  σ  dy/σ Y
 Y
(5.33)

Based on Expression 5.33, Rothmann2 showed that over all possible strategies
σ X(•), the supremum one-sided type I error probability is (α + γ)/2 – αγ/4 >
α/2. This supremum occurs as σX(y) → 0 for y > μY and as σX(y) → ∞ for y < μY.
Conversely, the infimum one-sided type I error probability across all possible
strategies is αγ/4. This infimum occurs as σX(y) → 0 for y < μY and as σX(y) →
∞ for y > μY. However, practical strategies have type I error probabilities
between those two extremes. For α = γ = 0.05, the infimum and supremum
type I error probabilities are 0.0006 and 0.0494, respectively. This is a wider
range than the range for the type I error rate when the design of the non-
inferiority trial is independent of the estimation of the active control effect
of 0.0028–0.025.
A common strategy sizes the non-inferiority trial to have a desired power
(e.g., 80% or 90% power) when there is no difference in the effects of the

© 2012 by Taylor and Francis Group, LLC

132 Design and Analysis of Non-Inferiority Trials

experimental and active control therapies. In an example, Rothmann deter-

mined the type I error probabilities for the 95–95 method when σ X(•) is based
on the 80% and 90% conditional power at μX = 0. When α = 0.05, Rothmann
also determined for 80% and 90% conditional power at μX = 0 the values for γ
that leads to a maximum type I error probability of 0.025 over a likely range
of values for (1 – λo)μY or that leads to a type I error rate of 0.025 based on a
distribution for the possible values of (1 – λo)μY.
X − (1 − λo )Y
For a synthesis test, Ho is rejected whenever < − zα /2 for
σ X (Y ) + (1 − λo )2 σ Y2
2

some 0 < α < 1. For a given strategy σ X(•), the probability of rejecting the null hy
 (1 − λ )y − µ − z 2 2 
α /2 σ X ( y ) + σ Y  y − µY 
∞

∫
o X
pothesis at (μX, μY) is Φ  ⋅ϕ  dy/σ Y .
−∞  σ X ( y )   σ Y 
As with the two–confidence interval approach, Rothmann determines the
infimum and supremum type I error probabilities over all possible strategies
for σ X(•) for the synthesis approach. When α = 0.05, infimum and supremum
type I error probabilities are 0.0006 and 0.049, respectively. This is the same
range for the type I error probabilities as with the two–confidence interval
approach when α = γ = 0.05. A critical value for the test can be determined
that gives a maximum type I error probability of 0.025 over a likely range
of values for (1 – λo)μY or that leads to a type I error rate of 0.025 based on a
distribution for the possible values of (1 – λo)μY.
In the examined settings, Rothmann noted that strategies based on having
adequate power when there is no difference in effects between the experi-
mental and active control therapies have smaller type I error rates than the
type I error rate in an analogous case when the sizing of the non-inferiority
trial is independent of the estimation of the active control effect. This will
be likewise seen in Example 5.9 in Section 5.4.4.2, which examines the type I
error rate under a model that introduces regression-to-the mean bias.
As noted in Section 5.3.6, for the Bayesian setting, the analysis does not
depend on whether the sizing of the non-inferiority is independent or
dependent on estimation of the active control effect. In both cases, the joint
posterior distribution of the differences in effects of the experimental ther-
apy versus the active control therapy and the active control therapy versus
placebo factors into the product of the marginal posterior distributions when
independent prior distributions are used. In the frequentist setting, the joint
density function does not factor into the marginal density functions unless
the sizing of the non-inferiority trial does not depend on the estimation of
the active control effect.

5.4.4.2 Incorporating Regression to Mean Bias

Example 5.9 illustrates the impact that regression-to-the-mean bias can have
on the type I error rates for a 95–95 method, the standard synthesis method,

© 2012 by Taylor and Francis Group, LLC

Across-Trials Analysis Methods 133

and a bias-corrected synthesis method. Also considered in simulating the

type I error rates is whether the sizing of the non-inferiority trial is indepen-
dent or dependent on the estimation of the active control effect.

Example 5.9

Consider that there are five previous studied therapies for an indication. For i = 1,
…, 5, let Xi denote the observed placebo versus active control log-hazard ratio
for the i-th therapy from either one clinical trial or a meta-analysis of clinical
trials. Suppose X1,…,X5 is a random sample from a normal distribution having
mean ln(4/3) and standard deviation of 0.1. A new experimental therapy is to be
compared in a clinical trial to that therapy among the five therapies that has the
largest observed effect. We will first assume that the sizing of the non-inferiority
trial is independent of the estimation of the active control effect. We are interested
in testing the hypotheses in Expression 5.6 when λo = 0 and when the true placebo
versus active control log-hazard ratio in the non-inferiority trial is ln(4/3). Table
5.11 provides simulated type I error rates for both the 95–95 and standard synthe-
sis methods when the true placebo versus experimental hazard ratio is 1 (i.e., the
experimental therapy is ineffective) for various standard errors for the experimen-
tal versus active control log-hazard ratio, as the standard error goes to 0 and as
the standard error goes to infinity. Note that the regression-to-the-mean bias and
the type I error rates depend on the common historical standard deviation and the
standard error in the non-inferiority trial and do not depend on the value for the
common effect of the five previously studied therapies.
As the standard error for the estimated difference in the non-inferiority trial
decreases, the simulated type I error rate for the standard synthesis method
increases from 0.025 toward about 0.12. The type I error rate for the 95–95 method
is “U-shaped” in the standard error from the non-inferiority trial. That is, as the
standard error in the non-inferiority trial decreases, the type I error rate decreases
to some minimum value at a standard error between 0.1 and 0.2, and increases

TABLE 5.11
Simulated Type I Error Rates—Independent Design
Simulated Type I Error Ratea
Non-Inferiority Trial Standard Error 95–95 Method Standard Synthesis Method
0 0.1194 0.1194
0.03 0.0359 0.1161
0.04 0.0260 0.1145
0.05 0.0200 0.1114
0.0707 0.0157 0.1066
0.1 0.0119 0.0920
0.2 0.0130 0.0638
→∞b 0.025 0.025
a Each type I error rate is based on 100,000 simulations.
b Type I error rate for the “→ ∞” case determined mathematically. The simulated type I error
rate for this case was 0.0245.

© 2012 by Taylor and Francis Group, LLC

134 Design and Analysis of Non-Inferiority Trials

toward a value of about 0.12 as the standard error decreases toward zero. The
95–95 method maintains a type I error rate below 0.025 unless the standard error
in the trial is very small.
We next consider the case where the sizing of the non-inferiority trial depends
on the estimated active control effect. For both analysis methods, Table 5.12 pro-
vides simulated type I error rates when the true placebo versus experimental haz-
ard ratio is 1, where the non-inferiority trial is sized based on a conditional power
ranging from 3% to 100% from using the 95–95 method under the assumptions
that the experimental and active control therapies have the same effect and the
true placebo versus active control log-hazard ratio in the non-inferiority trial is
ln(4/3).
Comparing Tables 5.11 and 5.12, when the sizing of the non-inferiority trial
depends on the estimated active control effect, the type I error rates tend to be
slightly smaller than when the sizing of the non-inferiority trial is independent of
the estimated active control effect. The 95–95 method achieved type I error rates
smaller than 0.025 for every practical choice for power. The simulated type I error
rates for the standard synthesis method exceeded 0.025 and were increasing in
the conditional power.
In general, the amount of regression-to-the-mean bias (and the increase to
the type I error rates of the standard synthesis method and the 95–95 method)
increases as there is an increase in the number of previously studied investiga-
tional agents that potentially could have produced results so as to be selected as
the active control and/or as the standard deviations for the estimated effects of the
previously studied investigational agents increases. When the standard errors for
the previously studied investigational agents are equal, the regression-to-the-mean
bias will be at its greatest when the effects of those previously studied investi-
gational agents are equal. Reproducibility across studies in the effect size of the
selected active control can provide assurance that the size of the regression-to-

TABLE 5.12
Simulated Type I Error Rates—Dependent Design
Simulated Type I Error Ratea
Conditional Power (%) 95–95 Method Standard Synthesis Method
3 0.0225 0.0269
10 0.0138 0.0428
20 0.0113 0.0528
30 0.0101 0.0589
40 0.0095 0.0631
50 0.0094 0.0668
60 0.0092 0.0702
70 0.0090 0.0739
80 0.0091 0.0776
90 0.0095 0.0824
99.99 0.0143 0.0988
Just below 100b 0.1177 0.1193
a Each type I error rate is based on 100,000 simulations.
b Power of 1 – β where zβ = 1000.

© 2012 by Taylor and Francis Group, LLC

Across-Trials Analysis Methods 135

the-mean bias is small or negligible. However, in the absence or nonexistence of a

reproduced estimated effect size across studies—for example, when there is only
one historical study that estimates the active control effect—regression-to-the-
mean bias in the estimated active control effect should be accounted for in the
non-inferiority analysis.
Under the assumption that the previously studied therapies had the same treat-
ment effect, the bias in using the maximum estimated effect can be determined
and does not depend on the size of the common effect. In this example, the bias in
using the maximum estimated effect is approximately 0.116 from Table 4.1, which
from Expression 5.24 is approximately the adjustment made by the 95–95 method
(compared with the synthesis method) of 0.115. The type I error rate will now be
examined for a synthesis method that accounts for this bias by subtracting 0.116
from the maximum estimated effect in the numerator of the test statistic. For the
case where the sizing of the non-inferiority trial is independent of the estimation
of the active control effect, Table 5.13 provides simulated type I error rates for a
bias-corrected synthesis method when the true placebo versus experimental haz-
ard ratio is 1 for various standard errors for the experimental versus active control
log-hazard ratio, as the standard error goes to 0, and as the standard error goes
to infinity.
We note that the type I error rates in Table 5.13 for the bias-corrected synthesis
method are all smaller than 0.025 and are much smaller than those type I error
rates given in Table 5.11 for the standard synthesis method without a bias correc-
tion. The skewed, nonnormal distribution for the maximum of a normal random
sample, which the test does not account for, is the reason why the type I error rates
for the bias-corrected synthesis method are not equal to 0.025 but smaller than
0.025. Additionally, the type I error rates in Table 5.13 for the bias-corrected syn-
thesis method are increasing in the studied standard errors for the non-inferiority
trial, not decreasing in the studied standard errors for the non-inferiority trial as in
Table 5.11 for the standard synthesis method without a bias correction. Also, in
this example, for most (not all) of the studied standard errors for the non-inferiority
trial, the bias-corrected synthesis method had a smaller simulated type I error rate
than the 95–95 method without bias correction. For the case where the standard

TABLE 5.13
Simulated Type I Error Rates for a Bias-Corrected Synthesis
Test—Independent Design
Non-Inferiority Trial Standard Error Simulated Type I Error Ratea
0 0.0047
0.03 0.0053
0.04 0.0057
0.05 0.0066
0.0707 0.0078
0.1 0.0116
0.2 0.0193
→ ∞b 0.025
a Each type I error rate is based on 100,000 simulations.
b Type I error rate for the “→ ∞” case determined mathematically. The
simulated type I error rate for this case was 0.0245.

© 2012 by Taylor and Francis Group, LLC

136 Design and Analysis of Non-Inferiority Trials

error for the non-inferiority trial equaled 0.1, the common standard error for the
previous studies, the type I error rates for the bias-corrected synthesis and 95–95
methods are similar (0.0116 and 0.0119, respectively).
Table 5.14 provides simulated type I error rates when the true placebo versus
experimental hazard ratio is 1 for a bias-corrected synthesis method where the
non-inferiority trial is sized on the basis of a conditional power ranging from 3% to
100% from using the 95–95 method under the assumptions that the experimental
and active control therapies have the same effect and the true placebo versus
active control log-hazard ratio in the non-inferiority trial is ln(4/3).
As in the independent design case, we note that the type I error rates in Table 5.14
for the bias-corrected synthesis method are all smaller than 0.025 and are much
smaller than those type I error rates given in Table 5.12 for the standard synthesis
method without a bias correction. Additionally, the type I error rates in Table 5.14
for the bias-corrected synthesis method are decreasing in the conditional power,
not increasing in the conditional power as in Table 5.12 for the standard synthesis
method without a bias correction. Also, in this example for 90% or greater con-
ditional power, the bias-corrected synthesis method had a smaller simulated type
I error rate than the 95–95 method without bias correction. For 80% or smaller
conditional power, the bias-corrected synthesis method had a greater simulated
type I error rate than the 95–95 method without bias correction.

Suppose it is desired that the experimental therapy retain more than

100λo% of the historical effect of the active control (for some 0 ≤ λo ≤ 1) and
there are five previously studied therapies with the same treatment effect
that are estimated with a common standard error. Extending from Example
5.9, the regression-to-the mean bias by selecting the therapy with the greatest
observed effect as the active control is approximately equal to the adjustment

TABLE 5.14
Simulated Type I Error Rates for a Bias Corrected Synthesis
Test—Dependent Design
Conditional Power (%) Simulated Type I Error Ratea
3 0.0240
10 0.0194
20 0.0166
30 0.0145
40 0.0134
50 0.0123
60 0.0113
70 0.0104
80 0.0095
90 0.0085
99.99 0.0065
Just below 100b 0.0048
a Each type I error rate is based on 100,000 simulations.
b Power of 1 – β, where zβ = 1000.

© 2012 by Taylor and Francis Group, LLC

Across-Trials Analysis Methods 137

made by the 95–95 method in Expression 5.24 when the standard error in the
non-inferiority trial equals the (1 – λo) multiplied by the common standard
error for the five previously studied therapies.
Example 5.10 evaluates the type I error rate of false conclusions of efficacy
of the experimental therapy under the previous model in Chapter 3, where
the likelihood that the active control is truly effective depends on the prob-
ability a random agent for that indication is truly effective and the power for
concluding effectiveness when the agent is effective.

Example 5.10

Often, when a therapy has been concluded as effective for an indication based
on one or more clinical trials (generally, at least two clinical trials), it may become
unethical to conduct further placebo-controlled trials for that indication. The ther-
apy concluded as effective would be given as an “active” control in future trials.
The active control may or may not be effective. As we have seen earlier, the likeli-
hood that the active control is truly effective depends on the statistical significance
of the results, the probability a random agent for that indication is truly effective,
and the power for concluding effectiveness when the agent is effective. Equation
3.1 gives the probability that the active control is ineffective. Under this paradigm,
we will evaluate the type I error rate of falsely concluding that the experimental
therapy has any efficacy when it has zero efficacy and the type I error rate of
falsely concluding that the experimental therapy has efficacy more than half the
efficacy of the active control when the experimental therapy has efficacy exactly
half that of the active control. For ease, it is assumed that the sizing of the non-
inferiority trial is independent of the estimation of the active control effect.
The active control for the non-inferiority trial will be evaluated based on achiev-
ing a favorable one-sided p-value less than 0.025 from a single trial. It is under-
stood that in practice, usually a more stringent criterion than a one-sided p-value
less than 0.025 would be used for having a therapy as an active control in future
clinical trials. The evaluation of the type I error rate in a non-inferiority trial can
also be done for a more stringent significance level than 0.025. At the beginning
of the historical trial, the estimated active control effect is assumed to have a
normal distribution with standard deviation sH. When the active control is truly
effective, the actual power is assumed to be 90%, which makes the true effect of
the active control approximately 3.24sH. Given that the p-value is less than 0.025,
the conditional mean, median, and standard deviation for the estimated effect are
3.44sH, 3.37sH, and 0.84sH, respectively. When the “active” control is truly inef-
fective (has zero effect), the conditional mean, median, and standard deviation
for the estimated effect given a p-value of less than 0.025 are 2.33sH, 2.24sH, and
0.34sH, respectively.
For the cases where the active control has its assumed effect and when the
active control has zero effect, the type I error rate for a conclusion of any efficacy
from a non-inferiority trial was evaluated for the standard synthesis method, the
standard synthesis method with 50% discounting, the 95–95 method, and the
95–95 method with 50% discounting under the assumption that the active control
effect has not changed. For these methods, Tables 5.15 and 5.16 provide the simu-
lated type I error rates for a false conclusion of any efficacy when the experimental

© 2012 by Taylor and Francis Group, LLC

138 Design and Analysis of Non-Inferiority Trials

TABLE 5.15
Simulated Type I Error Rates Conditioned on Active Control Having Its Assumed
Effect with Historical Power of 90% and Significance Level of 0.025
Synthesis Synthesis Method 95–95 Method
Ratio of Method 0% 0% Retention 95–95 Method 0% Retention
Variances Retention 50% Discounting 0% Retention 50% Discounting
→0+ 0.025 0.025 0.025 0.025
0.1 0.0270 0.0075 0.0078 0.0033
0.25 0.0270 0.0031 0.0044 0.0008
0.5 0.0279 0.0011 0.0034 0.0002
1 0.0269 0.0003 0.0027 0.00001
2 0.0286 0.00003 0.0032 0
4 0.0271 0.00001 0.0045 0
10 0.0282 0 0.0076 0
→∞ 0.0278 0.0000001 0.0278 0.0000001

therapy has zero efficacy for various ratios of the variance of the estimated dif-
ference in effects from the non-inferiority trial to the variance of the historically
based estimate of the active control effect. For each case, 100,000 simulations
were used on the underlying distribution for the estimated effect. Thus, when the
active control is effective (is ineffective), there were typically 90,000 cases (2,500
cases) with a p-value of less than 0.025 that were further used to simulate the type
I error rate of falsely concluding that the experimental therapy has efficacy when
the experimental therapy has zero effect in the non-inferiority trial.
The simulated type I error rates have associated margins of error with some
simulated rates greater than the true rate. Some reversals appear to occur in the
order of the simulated type I error rates. For example, when the active control is
truly ineffective and a 95–95 method is used with 50% discounting, the simulated
type I error rates when the ratio of variances is 0.5 and 1 are 0.0452 and 0.0404,

TABLE 5.16
Simulated Type I Error Rates Conditioned on Active Control Being Ineffective
Synthesis Synthesis Method 95–95 Method
Ratio of Method 0% 0% Retention 95–95 Method 0% Retention
Variances Retention 50% Discounting 0% Retention 50% Discounting
→0+ 0.025 0.025 0.025 0.025
0.1 0.1033 0.0572 0.0347 0.0300
0.25 0.1619 0.0844 0.0457 0.0337
0.5 0.2524 0.1274 0.0611 0.0452
1 0.3353 0.1494 0.0646 0.0404
2 0.4651 0.2323 0.1051 0.0570
4 0.5796 0.3217 0.1466 0.0666
10 0.7221 0.5199 0.2913 0.1233
→∞ 1 1 1 1

© 2012 by Taylor and Francis Group, LLC

Across-Trials Analysis Methods 139

respectively. Based on the general pattern for the simulated type I error rates for
this procedure, it is likely that the true type I error rates in these two cases are
ordered in the reverse direction. Additionally, it appears that when the active con-
trol is effective, the synthesis method has a type I error rate between 0.025 and
0.0278 and that the deviations out of this range by the simulated type I error rates
are due to random error.
When the active control is truly effective, the methods based on 50% discount-
ing appear to have type I error rates for concluding any efficacy that decreases as
the ratio of the variances increases. The type I error rates for the 95–95 method
without discounting was U-shaped in the ratio of the variances, with the type I
error rate exceeding 0.025 only when the ratio of variances is quite large. When
the active control is truly ineffective for all methods, the limiting type I error rate is
1 as the ratio of variances goes to infinity. In all cases, the limiting type I error rate
is 0.025 as the ratio of variances goes to zero.
In the cases studied when the active control was truly effective, the standard
synthesis method with 50% discounting had a smaller type I error rate for a false
conclusion of any efficacy when the experimental therapy has zero efficacy than
the 95–95 method without discounting. However, when the active control is truly
ineffective, the order was reversed with the 95–95 method without discount-
ing having the smaller type I error rate. For these two methods, when the likeli-
hood that the active control is truly effective is considered, the standard synthesis
method with 50% discounting will have the smaller type I error rate when it is
highly likely that the active control is effective, and the larger type I error rate
when it is highly unlikely that the active control is effective.
For the four analysis methods, Tables 5.17 and 5.18 provide the type I error rates
for a false conclusion of any efficacy when the experimental therapy has zero effi-
cacy based on various probabilities that the active control is ineffective when the
ratio of the variances is 1 and 4. The probability that the active control is ineffec-
tive is based on Equation 3.1 and the selected probability that a random therapy is
effective. The simulation-based type I error rate is then a convex combination of
the corresponding conditional type I error rates in Tables 5.15 and 5.16. When the
probability a random therapy is effective equals 0.25, the probability is 0.0769 that
a therapy that achieves favorable statistical significance at a one-sided 0.025 level
is ineffective. Then for the standard synthesis method when the ratio of variances
is 1, the simulation-based type I error rate for a false conclusion of any efficacy

TABLE 5.17
Simulation-Based Type I Error Rate for a False Conclusion of Any Efficacy When
Ratio of Variances = 1
Probability Standard
Random Probability Standard Synthesis 95–95 Method
Therapy Is Active Control Synthesis 95–95 Method with with 50%
Effective Is Ineffective Method Method 50% Discounting Discounting
0.1 0.2000 0.0886 0.0151 0.0301 0.0081
0.25 0.0769 0.0506 0.0075 0.0118 0.0031
0.5 0.0270 0.0352 0.0044 0.0043 0.0011
0.75 0.0092 0.0297 0.0033 0.0017 0.0004
0.9 0.0031 0.0278 0.0029 0.0008 0.0001

© 2012 by Taylor and Francis Group, LLC

140 Design and Analysis of Non-Inferiority Trials

TABLE 5.18
Simulation-Based Type I Error Rate for a False Conclusion of Any Efficacy When
Ratio of Variances = 4
Probability Standard
Random Probability Standard Synthesis 95–95 Method
Therapy Is Active Control Synthesis 95–95 Method with with 50%
Effective Is Ineffective Method Method 50% Discounting Discounting
0.1 0.2000 0.1376 0.0329 0.0643 0.0133
0.25 0.0769 0.0696 0.0154 0.0248 0.0051
0.5 0.0270 0.0420 0.0083 0.0087 0.0018
0.75 0.0092 0.0322 0.0058 0.0030 0.0006
0.9 0.0031 0.0288 0.0049 0.0010 0.0002

when the experimental therapy has zero efficacy is 0.9231 × 0.0269 + 0.0769 ×
0.3351 = 0.0506. When the probability that a random therapy is effective is only
10% and the ratio of variances is 4, the simulation-based type I error rate for the
95–95 method without discounting is 0.0329, larger than a desired one-sided level
of 0.025. When the probability a random therapy is effective is small, the type I
error rate for a false conclusion of any efficacy for the 95–95 method without
discounting will be larger than an intended level of 0.025. It is thus important in
settings where “success” is rare to consider using a more stringent criterion than
a margin based on the lower limit of the 95% confidence interval of the active
control effect.
For cases where the active control has its assumed effect and when the active
control has zero effect, the simulated type I error rates for a conclusion of efficacy
of more than half the efficacy of the active control when the experimental therapy
has efficacy exactly half that of the active control are provided in Table 5.19 for

TABLE 5.19
Simulated Conditional Type I Error Rates for Testing for Better 50% Retention of
Active Control Effect
Active Inactive
Ratio of Synthesis Method 95–95 Method Synthesis Method 95–95 Method
Variances 50% Retention 50% Retention 50% Retention 50% Retention
→0+ 0.025 0.025 0.025 0.025
0.1 0.0269 0.0137 0.0572 0.0300
0.25 0.0273 0.0091 0.0844 0.0337
0.5 0.0279 0.0067 0.1274 0.0452
1 0.0272 0.0043 0.1494 0.0404
2 0.0282 0.0034 0.2323 0.0570
4 0.0268 0.0030 0.3217 0.0666
10 0.0280 0.0036 0.5199 0.1233
→∞ 0.0278 0.0278 1 1

© 2012 by Taylor and Francis Group, LLC

Across-Trials Analysis Methods 141

the standard synthesis and 95–95 methods based on 50% retention. When the
active control has zero effect, half the active control effect equals zero effect.
As with Tables 5.15 and 5.16, for each case, 100,000 simulations were used in
the underlying distribution for the estimated effect. Simulations are based on no
change in the active control effect. For the standard synthesis method and the
95–95 method, the limiting conditional type I error rate is 0.0278 = 0.025/0.9 as
the ratio of variances goes to infinity when the active control is truly effective.
For both methods, the limiting conditional type I error rate is 0.025 as the ratio of
variances goes to zero, and when the active control is truly ineffective the limiting
type I error rate is 1 as the ratio of variances goes to infinity. When the probability
that a random therapy is effective equals 0.25 (leading to a probability of 0.0769
that the active control is ineffective from Table 5.17) and the ratio of variances is
4, the standard synthesis method for 50% retention has a simulation-based type I
error rate of 0.9231 × 0.0268 + 0.0769 × 0.3217 = 0.0495 and the 95–95 method
has a simulation-based type I error rate of 0.9231 × 0.0030 + 0.0769 × 0.0666 =
0.0079.

5.5 A Case in Oncology

For the arithmetic and geometric definitions of the retention fraction, the
95–95 approach and the synthesis method will be applied to an example
in two trials comparing capecitabine (the experimental therapy) with the
Mayo clinic regimen of 5-fluorouracil (5-FU) with leucovorin in previously
untreated metastatic colorectal. The non-inferiority analyses on overall sur-
vival were not prospectively planned.34 A retrospective evaluation of the
power for the studies will also be discussed.
The use of capecitabine for first-line metastatic colorectal cancer was
approved by the FDA in April 2001. The basic information provided for this
example can be found from several sources.1,34,35 Capecitabine is an oral flu-
oropyrimidine—a prodrug of 5′-deoxy-5-fluorouridine, which is converted
to 5-FU. Capecitabine was compared with the Mayo clinic regimen of 5-FU
with leucovorin (5-FU + LV) in two clinical trials, each involving about 600
subjects having untreated metastatic colorectal cancer. The Mayo clinic regi-
men of 5-FU + LV consists of 20 mg/m2 leucovorin by intravenous bolus fol-
lowed by 425 mg/m2 5-FU by intravenous bolus on days 1–5 every 28-day
cycle. Capecitabine was given 1250 mg/m2 twice daily for 14 days followed
by 1-week rest (3-week cycle) for each trial.
Each study was designed to demonstrate noninferior response rates for
capecitabine versus 5-FU + LV and resulted in demonstrating superior
response rates for capecitabine versus 5-FU + LV. The non-inferiority analy-
sis on overall survival was retrospectively done for each study. Additionally,
exploratory analyses will be performed on the basis of a fixed-effects meta-
analysis of the two studies.

© 2012 by Taylor and Francis Group, LLC

142 Design and Analysis of Non-Inferiority Trials

TABLE 5.20
Summary of Results on a Meta-Analysis on Overall Survival Comparing
5-FU + LV with 5-FU
Log Hazard Ratioa Standard Error 95% Confidence Interval for Hazard Ratioa
0.2341 0.0750 (1.091–1.464)
a Hazard ratios are 5-FU + LV/5-FU.

Bolus 5-FU by itself has not demonstrated an effect on overall survival for
first-line metastatic colorectal cancer, whereas the addition of leucovorin to
bolus 5-FU appears to improve overall survival (see Table 5.20). There is an
assumption in the trials that the use of the fluoropyrimidine capecitabine
does not require the additional use of leucovorin to improve its effect. A sys-
tematic review was done to find those clinical trials that compared similar
regimens of 5-FU + LV as the Mayo clinic regimen of 5-FU + LV to the same
regimen minus leucovorin. For each capecitabine trial, capecitabine would
be regarded as noninferior to 5-FU + LV on overall survival if capecitabine
retains greater than 50% of the historical survival effect of 5-FU + LV relative
to 5-FU alone.

5.5.1 Applying the Arithmetic Definition of Retention Fraction

We will apply the arithmetic definition of the retention fraction given in
Equation 5.3 to the determination of a non-inferiority margin and in the
application of the synthesis method. Here, the effect of 5-FU + LV on overall
survival is defined by the 5-FU versus 5-FU + LV hazard ratio minus 1.
The results of a random-effects meta-analysis on the overall survival log-
hazard ratio of 5-FU + LV versus 5-FU is provided in Table 5.20. A summary
of the overall survival results from the capecitabine clinical trials is provided
in Table 5.21.
The results of non-inferiority analyses on overall survival based on the
results of the random-effects meta-analysis of the effect of adding leucovorin
to 5-FU are provided in the FDA Medical-Statistical review for Xeloda.34

TABLE 5.21
Summary of Overall Survival Results for Two Capecitabine Clinical Trials
Total Number Log Hazard Standard 95% Confidence Interval
Study of Deaths Ratioa Error for the Hazard Ratioa
Study 1 533 –0.0036 0.0868 (0.841, 1.181)
Study 2 533 –0.0844 0.0867 (0.775, 1.089)
Combinedb 1066 –0.0440 0.0613 (0.849, 1.079)
a Hazard ratios are capecitabine/5-FU + LV.
b A fixed-effects meta-analysis of studies 1 and 2.

© 2012 by Taylor and Francis Group, LLC

Across-Trials Analysis Methods 143

Both a 95–95 approach and a synthesis method with a test statistic similar to
Equation 5.10 were used.
On the basis of the 95–95 approach, the non-inferiority efficacy threshold
for the capecitabine versus 5-FU + LV hazard ratio is 1.091. The upper limits of
the 95% confidence interval for the capecitabine versus 5-FU + LV hazard ratio
are 1.181 and 1.089 for studies 1 and 2, respectively. Thus, study 1 fails to meet
this criterion for determining efficacy, whereas study 2 barely succeeds. For
the fixed-effects meta-analysis of studies 1 and 2, the upper limit of the 95%
confidence interval for the capecitabine versus 5-FU + LV hazard ratio is 1.079,
satisfying the non-inferiority efficacy threshold as determined by the 95–95
approach. The non-inferiority threshold for the capecitabine versus 5-FU + LV
hazard ratio is 1.045 (=1 + (1.09 – 1)/2) based on the 95–95 approach and a reten-
tion fraction of 50%. Studies 1 and 2 and their combined analysis all fail to
satisfy this threshold with the upper limits of their respective 95% confidence
intervals for the capecitabine versus 5-FU + LV hazard ratio above 1.045.
When targeting a one-sided type I error rate of 0.025 and assuming that
Z1 in Equation 5.10 has an approximate standard normal distribution at the
boundary of the null hypothesis in Expression 5.4, the critical value for the
synthesis method is –1.96. For studies 1 and 2, the values for the test statistic
Z1 in Equation 5.10 for a retention fraction of 50% are –1.32 and –2.16, respec-
tively. From this synthesis method, study 1 fails to demonstrate that capecit-
abine retains more than 50% of the historical effect of 5-FU + LV relative to
5-FU (–1.32 > –1.96), whereas study 2 demonstrates that capecitabine retains
more than 50% of the historical effect of 5-FU + LV relative to 5-FU (–2.16 <
–1.96). For the combined analysis, Z1 = –2.26.
A Fieller lower confidence limit for the retention fraction can be deter-
mined by setting Z1 = –1.96 and solving for the unspecified retention fraction,
λo. From this, we see that study 1 (study 2) demonstrated that capecitabine
retains at least 10% (61%) of the historical effect of 5-FU + LV on overall sur-
vival. The combined analysis demonstrated that capecitabine retains at least
64% of the historical effect of 5-FU + LV on overall survival.

5.5.2 Applying the Geometric Definition of Retention Fraction

We will apply the geometric definition of the retention fraction given in
Equation 5.5 to the determination of a non-inferiority margin and in the
application of the synthesis method using the test statistic given in Equation
5.12. Here, the effect of 5-FU + LV on overall survival is defined by the 5-FU
versus 5-FU + LV log-hazard ratio.
As the two definitions agree at 0% retention, they agree on the efficacy
threshold for the capecitabine versus 5-FU + LV hazard ratio of 1.091. The
non-inferiority threshold for the capecitabine versus 5-FU + LV hazard ratio
is 1.044 (= exp(ln(1.091)/2)) based on the 95–95 approach and a retention frac-
tion of 50%. The conclusions are the same as those drawn earlier with the
95–95 approach using the arithmetic definition of the retention fraction.

© 2012 by Taylor and Francis Group, LLC

144 Design and Analysis of Non-Inferiority Trials

For this application, it was argued by Rothmann2 that Z2 has an approxi-

mate standard normal distribution at the boundary of the null hypothesis in
Expression 5.6 when the constancy assumption holds. Thus, when targeting
a one-sided type I error rate of 0.025, the critical value for the standard syn-
thesis method is –1.96.
For studies 1 and 2, the values for the test statistic Z2 in Equation 5.12 for a
retention fraction of 50% are –1.28 and –2.13, respectively. From the standard
synthesis method, study 1 fails to demonstrate that capecitabine retains more
than 50% of the historical effect of 5-FU + LV relative to 5-FU (–1.28 > –1.96),
whereas study 2 demonstrates that capecitabine retains more than 50% of the
historical effect of 5-FU + LV relative to 5-FU (–2.13 < –1.96). For the combined
analysis, Z1 = –2.24.
A Fieller lower confidence limit for the retention fraction can be deter-
mined by setting Z1 = –1.96 and solving for the unspecified retention fraction,
λo. From this, we see that study 1 (study 2) demonstrated that capecitabine
retains at least 9% (59%) of the historical effect of 5-FU + LV on overall sur-
vival. The combined analysis demonstrated that capecitabine retains at least
62% of the historical effect of 5-FU + LV on overall survival. For these per-
centages to apply for the effect of 5-FU + LV on overall survival for the indi-
vidual capecitabine trials, the statistical model that is used to estimate the
historical effect of 5-FU + LV on overall survival must also be the appropri-
ate model to unbiasedly estimate the effect of 5-FU + LV on overall survival
for that capecitabine trial. For study 2 (the combined analysis) using a nor-
malized test statistic, noninferior overall survival would be demonstrated
by capecitabine versus 5-FU + LV if the estimator of the historical survival
effect were reduced by 18% (24%).
When testing for better than 50% retention by capecitabine of the 5-FU +
LV effect on overall survival (relative to 5-FU), the adjustment that the 95–95
approach makes on the threshold for the log-hazard ratio of capecitabine
versus 5-FU + LV relative to the synthesis method is 0.058 as per Expression
5.24. The “threshold” for the hazard ratio of capecitabine versus 5-FU + LV
is 1.107 greater than the threshold of 1.044 from the 95–95 approach by 0.063.
For indirectly testing that capecitabine is superior to 5-FU on overall sur-
vival, the adjustment that the 95–95 approach makes on threshold for the
log-hazard ratio of capecitabine versus 5-FU + LV relative to the synthesis
method is 0.092. The “threshold” for the hazard ratio of capecitabine ver-
sus 5-FU + LV is 1.196 greater than the threshold of 1.091 from the 95–95
approach by 0.105.

5.5.3 Power of Such Procedures

Conditioning on the results from the estimation of the historical effect size
of 5-FU + LV on overall survival, Table 5.22 provides the power for the indi-
vidual studies and the combined analyses at various capecitabine versus
5-FU + LV hazard ratios for these two methods of analysis. For each case in

© 2012 by Taylor and Francis Group, LLC

Across-Trials Analysis Methods 145

TABLE 5.22
Power Is Provided for Individual Studies and Combined Analyses at Various True
Hazard Ratios of Capecitabine versus 5-FU + LV
Methoda
Standard Synthesis Method with
Non-Inferiority Threshold of 1.044 50% Retention of the Control Effect
True Hazard
Ratiob 533 Deaths (%) 1066 Deaths (%) 533 Deaths (%) 1066 Deaths (%)
1.05 2 2 9 12
1.00 7 10 22 35
0.95 19 34 42 67
0.90 40 68 67 91
0.85 66 92 86 99
a Based on the geometric definition of the proportion of effect retained.
b For overall survival of capecitabine versus 5-FU + LV.

Table 5.22, as the alternative becomes a slightly more and more advantageous
effect for capecitabine, the power increases greatly. Also, for each alternative
and fixed number of deaths, the power is slightly higher for the standard
synthesis method than for the 95–95 approach with threshold of 1.044.
Conditioning on the results from the estimation of the historical effect
size of 5-FU + LV on overall survival, Table 5.23 provides, for a one-to-one
randomization, the number of events to have 90% power at various capecit-
abine versus 5-FU + LV hazard ratios for these two methods of analysis.
For each case in Table 5.23, as the alternative becomes a slightly more and
more advantageous effect for capecitabine, the number of events sharply
decreases. This helps illustrate the importance of a proper choice of the
alternative to size the non-inferiority trial. Also, for each hazard ratio, the
number of events is smaller for the test procedure that uses a normalized
test statistic than for the test procedure that is based on a two 95% confi-
dence interval approach.

TABLE 5.23
For Each Method, Number of Events Is Provided to Have 90% Power for Various
True Hazard Ratios of Capecitabine versus 5-FU + LV (1:1 Randomization)
Method
Non-Inferiority Cutoff Standard Synthesis Method with
True Hazard Ratioa of 1.044 50% Retention of the Control Effect
1.00 22,669 7,192
0.95 4,721 2,113
0.90 1,908 1,030
0.85 995 606
a For overall survival of capecitabine versus 5-FU + LV.

© 2012 by Taylor and Francis Group, LLC

146 Design and Analysis of Non-Inferiority Trials

References
1. Rothmann, M. et al., Design and analysis of non-inferiority mortality trials in
oncology, Stat. Med., 22, 239–264, 2003.
2. Rothmann, M., Type I error probabilities based on design-stage strategies with
applications to non-inferiority trials, J. Biopharm. Stat., 15, 109–127, 2005.
3. Wang, S.-J., Hung, H.M.J., and Tsong, Y., Utility and pitfalls of some statisti-
cal methods in active controlled clinical trials, Control Clin. Trials, 23, 15–28,
2002.
4. Temple, R. and Ellenberg, S.S., Placebo-controlled trials and active-controlled
trials in the evaluation of new treatments: Part 1. Ethical and scientific issues,
Ann. Intern. Med., 133, 455–463, 2000.
5. Hung, H.M.J., Wang, S.-J., and O’Neill, R., Issues with statistical risks for test-
ing methods in non-inferiority trial without a placebo arm, J. Biopharm. Stat., 17,
201–213, 2007.
6. U.S. Food and Drug Administration, Guidance for industry: Non-inferiority
clinical trials (draft guidance), March 2010.
7. Sankoh, A.J., A note on the conservativeness of the confidence interval approach
for the selection of non-inferiority margin in the two-arm active-control trial,
Stat. Med., 27, 3732–3742, 2008.
8. Hauck, W.W. and Anderson, S., Some issues in the design and analysis of equiv-
alence trials, Drug Inf. J., 33, 109–118, 1999.
9. Lawrence, J., Some remarks about the analysis of active control studies, Biom. J.,
47, 616–622, 2005.
10. Gupta, G. et al., Statistical review experiences in equivalence testing at FDA/
CBER, Proc. Biopharm. Sec., American Statistical Association Alexandria, VA,
1999, 220–223.
11. The ASSENT-2 Investigators, Single bolus tenecteplase compared with front-
loaded alteplase in acute myocardial infarction: ASSENT-2 double-blind ran-
domised trial, Lancet, 354, 716–22 1999.
12. U.S. Food and Drug Administration Oncologic Drugs Advisory Committee
meeting July 27, 2004, transcript at http://www.fda.gov/ohrms/dockets/
ac/04/transcripts/2004-4060T1.pdf.
13. Holmgren, E.B., Establishing equivalence by showing that a specified percent-
age of the effect of the active control over placebo is maintained, J. Biopharm.
Stat., 9, 651–659, 1999.
14. Simon, R., Bayesian design and analysis of active control clinical trials, Biometrics,
55, 484–487, 1999.
15. Hasselblad, V. and Kong, D.F., Statistical methods for comparison to placebo in
active-control trials, Drug Inf. J., 35, 435–449, 2001.
16. Snapinn, S. and Jiang, Q., Controlling the type 1 error rate in non-inferiority tri-
als, Stat. Med., 27, 371–381, 2008.
17. Clinton, B. and Gore, A., Reinventing regulation of drugs and medical devices,
National Performance Review, April 1995.
18. Snapinn, S. and Jiang, Q., Preservation of effect and the regulatory approval
of new treatments on the basis of non-inferiority trials, Stat. Med., 27, 382–391,
2008.

© 2012 by Taylor and Francis Group, LLC

Across-Trials Analysis Methods 147

19. Hung, H.M.J., Wang, S-J,. and O’Neill, R.T., A regulatory perspective on choice
of margin and statistical inference issue in non-inferiority trials, Biom. J., 47,
28–36, 2005.
20. Chen, G., Wang, Y-C., and Chi, Y.H.G., Hypotheses and type I error in active-
control non-inferiority trials, J. Biopharm. Stat. 14, 301–313, 2004.
21. Fisher, L.D., Active control trials: What about a placebo? A method illustrated
with clopidogrel, aspirin and placebo, J. Am. Coll. Cardiol, 31, 49A, 1998.
22. Fisher, L.D., Gent, M., and Büller, H.R., Active-control trials: How would a new
agent compare with placebo? A method illustrated with clopidogrel, aspirin,
and placebo, Am. Heart J., 141, 26–32, 2001.
23. Rothmann, M.D. and Tsou, H., On non-inferiority analysis based on delta-
method confidence intervals, J. Biopharm. Stat., 13, 565–583, 2003.
24. Odem-Davis, K.S., Current issues in non-inferiority trials, dissertation,
University of Washington, Department of Biostatistics, 2010.
25. Wang, S.-J. and Hung, H.M.J., TACT method for non-inferiority testing in active
controlled trials, Stat. Med., 22, 227–238, 2003.
26. Carroll, K.J., Active-controlled non-inferiority trials in oncology: Arbitrary lim-
its, infeasible sample sizes and uninformative data analysis. Is there another
way? Pharm. Stat., 5, 283–293, 2006.
27. Snapinn, S.M., Alternatives for discounting in the analysis of non-inferiority tri-
als, J. Biopharm. Stat., 14, 263–273, 2004.
28. Wiens, B., Choosing an equivalence limit for non-inferiority or equivalence
studies, Control Clin. Trials, 1–14, 2002.
29. Fleming, T.R., Current issues in non-inferiority trials, Stat. Med., 27, 317–332,
2008.
30. Hettmansperger, T.P., Two-sample inference based on one-sample sign statis-
tics, Appl. Stat., 33, 45–51, 1984.
31. Hung, H.M.J. et al., Some fundamental issues with non-inferiority testing in
active controlled clinical trials, Stat. Med., 22, 213–225, 2003.
32. Follmann, D.A. and Proschan, M.A., Valid inferences in random effects meta-
analysis, Biometrics, 55, 732–737, 1999.
33. Larholt, K., Tsiatis, A.A, and Gelber, R.D., Variability of coverage probabili-
ties when applying a random effects methodology for meta-analysis, Harvard
School Public Health Department of Biostatistics, unpublished, 1990.
34. FDA Medical-Statistical review for Xeloda (NDA 20-896), dated April, 23, 2001.
35. FDA/CDER New and Generic Drug Approvals: Xeloda product labeling, at
http://www.fda.gov/cder/foi/label/2003/20896slr012_xeloda_lb1.pdf.

© 2012 by Taylor and Francis Group, LLC

6
Three-Arm Non-Inferiority Trials

6.1 Introduction
When designing a study to show that an experimental therapy is effective, it
is sometimes possible to include a third arm in the study to obtain data on
both a concurrent placebo control and an active control. Earlier in this book,
we considered two-arm non-inferiority trials having only an active control
when the use of a placebo control is unethical or problematic—for example,
if an effective treatment is available for a disease with obvious discomfort or
irreversible morbidity, it may be difficult to obtain permission from an ethi-
cal review board to include a placebo control and most likely impossible to
obtain informed consent from potential study subjects. Alternatively, when
a placebo control is ethical, the comparison of an experimental treatment to
a placebo control is the gold standard and inclusion of an active control is
generally not required. However, there are situations in which inclusion of
both a placebo control and an active control are ethically and scientifically
defensible.
A three-arm trial involving concurrent active and placebo controls may
evolve in one of two ways—an active control may be added to a placebo-
controlled trial where the objective is the demonstration of superior efficacy
of the experimental therapy relative to placebo or a placebo control is added
to a two-arm non-inferiority trial.
When the standard therapy has a large, important effect, use of placebo-
controlled trials without a concurrent standard therapy arm may allow
claims of effectiveness for drugs that are substantially less effective than
standard therapy. Also, failure to demonstrate superiority of an experimen-
tal therapy to placebo can either be due to the experimental therapy being
ineffective or due to the trial lacking assay sensitivity. Additional use of an
active control arm (i.e., a three-arm trial) can assist in determining whether
the study has assay sensitivity. If the active control is demonstrated to be
superior to placebo, then the trial has assay sensitivity. If neither the active
control nor the experimental therapy is demonstrated to be superior to pla-
cebo, the trial may have lacked assay sensitivity.

149

© 2012 by Taylor and Francis Group, LLC

150 Design and Analysis of Non-Inferiority Trials

Reasons for including a placebo in a non-inferiority trial include uncer-

tainty that the active control is effective in the setting of the non-inferiority
trial, great heterogeneity in the active control effect in previous trials, and
lack of complete medical understanding of the indication being studied.1
When the active control has not consistently shown an effect over placebo, it
may be necessary to additionally include a placebo arm (i.e., a three-arm trial)
to assess assay sensitivity. Unless the experimental arm is demonstrated to
be superior to the active control, omission of a placebo arm may prevent the
determination of whether the clinical trial had assay sensitivity and provide
results that are difficult to interpret.
As expressed by Temple and Ellenberg,2 a three-arm non-inferiority trial
is optimal because: (1) assay sensitivity can be assessed within the trial,
(2) the effect of the experimental therapy can be directly evaluated, and (3)
the effects of the experimental and active control arms can be compared.
Additionally, since the effect of the active control therapy versus placebo can
be estimated within the three-arm non-inferiority trial, there are no issues
on applying an estimate of the active control effect from one study (or many
studies) to another study. The three-arm non-inferiority trial can also be
sized so as to have an estimate of the active control effect versus placebo that
has the desired precision. When previous studies are used to estimate the
active control effect, the precision of the estimate of the active control effect
is what it is—you are stuck with it.
Pigeot et al.3 proposed that in addition to an experimental and active con-
trol arm, a third arm of subjects receiving placebo should be used whenever
possible in a non-inferiority study. The use of a three-arm trial may be con-
sidered when:

1. The use of a placebo control is ethical

2. The use of a placebo control will not preclude potential study sub-
jects from giving informed consent to participate in the study
3. An active control is available that will provide additional informa-
tion about the benefit and/or risk of the experimental medication
compared with other available therapy

It may be rare that all these criteria are met. Potential situations include a dis-
ease that does not result in discomfort, mortality, or irreversible morbidity
(perhaps for short-term studies of chronic indications in which there are lim-
ited acute symptoms, or for acute diseases with relatively mild symptoms).
Another potential situation is when the active control has not been studied
in a clinical trial in the disease under investigation but is hypothesized to
work. A third potential situation is a disease area in which the active control
is believed to confer benefit, but for whatever reason does not always show
an advantage in direct comparisons to placebo (i.e., the studies lack assay
sensitivity). An example of the first situation is a study of mild infections

© 2012 by Taylor and Francis Group, LLC

Three-Arm Non-Inferiority Trials 151

that are usually self-limiting without irreversible morbidity. An example of

the second situation is a study of hypertension in a special population—for
example, organ transplant recipients—when the only available active com-
parators are drugs that have been shown active as first-line treatment for
mild to moderate hypertension but never studied comprehensively in the
special population. An example of the third situation is in studies of demen-
tia, in which assay sensitivity is commonly low. In each situation, it is con-
ceivable that an experimental therapy could be studied in a three-arm trial,
with comparisons to both placebo and an active control. In the second situa-
tion, a secondary advantage is that definitive information can be obtained on
the comparator treatment compared with placebo, thereby advancing medi-
cal knowledge.
When a placebo-controlled trial is ethical, three-arm non-inferiority tri-
als have also been designed to assess a new schedule of an approved drug
to the approved schedule. The comparison of new schedule against placebo
will address whether the new schedule is efficacious, whereas including
the approved schedule will provide important information on how the two
schedules compare in efficacy and safety. If the new schedule is efficacious
but is clearly inferior to the approved schedule, use of the new schedule may
not be appropriate.
It is important that the analysis objectives for all three arms be well under-
stood and documented before the design is considered. If the purpose of
the active control and placebo arms is to establish assay sensitivity, this will
require a different analysis strategy than if the purpose is to establish the
retention of at least some portion of the efficacy of the active control. Analyses
and issues with each of these situations are discussed in the next section.
As in all non-inferiority trials, a secondary objective of superiority of the
experimental treatment compared with the active control can be considered.
This can be done by evaluating the difference between the experimental and
active control therapies without considering the placebo arm, or by using the
results of all three arms in estimating the percentage of the active control
effect that is retained by the experimental therapy. Establishing assay sensi-
tivity may require the extra step of comparing the active control to placebo.
There is no agreement on whether this step requires an adjustment for mul-
tiple comparisons.
In each of these situations, a parallel group design is generally used and
is assumed in the following sections. Crossover designs with two or three
periods can be considered if treatment effects are not irreversible and if an
appropriate washout period can be incorporated. Allocation of subjects to
the various treatment groups need not be balanced: especially if the active
control and experimental treatment are both thought to provide substan-
tial efficacy compared with placebo, but the non-inferiority margin is rela-
tively small compared with the effect versus placebo, the number of subjects
allocated to receive placebo can be smaller than the number allocated to
receive either of the other two treatments. Unless the active control and the

© 2012 by Taylor and Francis Group, LLC

152 Design and Analysis of Non-Inferiority Trials

experimental treatment are hypothesized to have substantially different effi-

cacy or safety profiles, planned sample sizes in these two groups will gener-
ally be balanced.

6.2 Comparisons to Concurrent Controls

Several authors have proposed analyses to interpret three-arm non-inferiority
trials. Before we introduce the details of these analyses, we will discuss the
general aspects of such analyses. Because of the multiple comparisons prob-
lem that will occur with three arms, it is important to prospectively describe
analysis methods and control of the type I error rate. With three arms in the
trial, there are three potential pairwise treatment comparisons that can be
considered: active versus placebo, active versus experimental, and experi-
mental versus placebo. A risk–benefit advantage in the last comparison is
necessary to deem an experimental product to be worthy of use, but may not
be sufficient if the experimental therapy does not have at least a comparable
risk–benefit profile as the active control. Analyses can also be considered that
include all three treatments in a single hypothesis test, as will be discussed
in analyzing the percentage of effect retained.

6.2.1 Superiority over Placebo

When the primary objective of the three-arm trial is to establish superiority
of the experimental therapy over placebo in a direct comparison, the active
control arm will be used to assess assay sensitivity by comparing the active
control to placebo. If neither the experimental treatment nor the active con-
trol demonstrate efficacy beyond that provided by the placebo, the study
might not have been able to identify any benefit provided by the experimen-
tal treatment because of design problems, poor study conduct, and/or ran-
dom variability. Rather than a conclusion of lack of efficacy, a conclusion
of lack of assay sensitivity might be appropriate. Alternatively, if the active
control demonstrates benefit over placebo but the experimental treatment
does not, then a conclusion of lack of sufficient benefit from the experimental
treatment would be appropriate. Such data are useful when the active and
experimental treatment yield similar responses and it is necessary to deter-
mine whether this is due to equal efficacy or equally ineffective treatment in
a given study. This design and analysis strategy is common when treatments
known to be active often do not demonstrate benefit in individual trials, such
as in certain neurological indications.
If the active control is included to demonstrate assay sensitivity by show-
ing that the active control is superior to placebo, some thought must be given
to the desired control of the type I error rate. If comparison of both the active

© 2012 by Taylor and Francis Group, LLC

Three-Arm Non-Inferiority Trials 153

control to placebo and the experimental treatment to placebo are made at the
full α level, there is an inflated chance that at least one of these two treatments
will falsely be considered superior to placebo, if neither is. However, this
might not be sufficient to disregard such an analysis strategy. The hypoth-
eses can be structured such that the only hypothesis used for making a con-
clusion is the comparison of the experimental treatment to placebo, and the
comparison of active control to placebo is presented for descriptive purposes
and only interpreted in the event that the primary comparison is not positive.
Testing the experimental treatment versus placebo at the full α level will con-
trol the probability that an ineffective treatment is considered effective, even
though there is an inflated chance of rejecting at least one true null hypoth-
esis. Note that a non-inferiority comparison in this case is of relatively low
interest, and a non-inferiority margin δ might not even be proposed a priori.
If a margin is proposed, a fixed sequence test would be a useful approach
to multiplicity, with the first comparison being the experimental therapy to
placebo and the second comparison being the non-inferiority comparison of
the experimental therapy to the active control. Again, the comparison of the
active control to placebo would be used to establish assay sensitivity, but will
not be included in the α-preserving multiple comparison procedure.
Concluding superiority of the experimental treatment compared with the
active control can be a secondary objective of a trial in which the primary
comparison is the experimental treatment to placebo. The comparison of the
experimental treatment to the active control will generally follow the fixed-
sequence approach, so other hypotheses are first tested and, conditional on
demonstrating sufficient benefit, the experimental treatment is compared
with the active control using the full two-sided α level. As an example, the
first comparison could be a direct comparison of the experimental treatment
versus placebo at the full α level. If this comparison shows significance favor-
ing the experimental treatment, the second comparison would be the non-
inferiority comparisons, also at the full α (or one-sided α/2) level. If this again
is positive, the third comparison might be to demonstrate that experimental
treatment is superior to the active control, again at the full α level. This third
comparison would be two sided; thus, it is possible with this strategy to dem-
onstrate that the experimental treatment is worse or better than the active
control.
Another use of an active control as a third arm in a non-inferiority trial
might be to establish favorable risk–benefit of the experimental treatment
compared with the active control. When the active control consistently dem-
onstrates efficacy compared with placebo but is also associated with consid-
erable toxicity, a less efficacious but better tolerated experimental treatment
might be preferable and therefore should be considered so future patients
can choose among treatments with various levels of efficacy and tolerability.
Statistical methodology for combining measures of efficacy and toxicity are
not commonly used, although they have been proposed.4–6 Therefore, such
comparisons of risk–benefit will generally be made more informally. This

© 2012 by Taylor and Francis Group, LLC

154 Design and Analysis of Non-Inferiority Trials

analysis strategy will use an α-preserving strategy for assessment of efficacy,

and conclusions pertaining to risk and risk–benefit will be made without
formal hypothesis testing.

6.2.2 Non-Inferior to Active Control

Inferences on Means. When planning a three-arm non-inferiority study, one
might be concerned about the relative efficacy of the experimental treatment
versus placebo compared with the efficacy of the active treatment versus
the control. Simply demonstrating that some small amount of efficacy over
placebo is obtained can be insufficient if the magnitude of efficacy is small
compared with the efficacy of the active control (when the active control is
beneficial). In such a situation, the ratio

µE − µP
λ= (6.1)
µC − µP

is commonly used to measure the proportion of efficacy retained by the

experimental treatment. The ratio λ is often called the preservation ratio,
the retention fraction, or one of many other similar phrases. It is the ratio of
the effect of the experimental therapy to the effect of the control therapy and
is presented in Equation 6.1 for the means of continuous data. This ratio will
be of limited value unless both the active control and experimental treatment
are shown superior to placebo in the three-arm trial, since a negative value
will be difficult to interpret, and a positive value will be even more difficult
to interpret when both the numerator and denominator are negative. A value
of λ that is positive (assuming that the denominator is positive) will imply
that the experimental treatment is superior to placebo; however, a value of
λ that is close to 0 will imply that the experimental treatment provides rela-
tively little benefit compared with the active control. A situation with posi-
tive but small λ might lead to the conclusion that the active control should be
the standard and the experimental treatment should not be widely adopted
although it provides some benefit over placebo. When λ > 1, the experimental
treatment provides more benefit than the active control.
An analysis strategy for a three-arm trial is to directly base inference on λ.
The hypotheses are given as

Ho: λ ≤ λo vs. Ha: λ ≥ λo (6.2)

where 0 ≤ λo ≤ 1 is prespecified. For continuous data and an overall assump-

tion that μ C – μP > 0, Pigeot et al.3 rewrote the hypotheses in Equation 6.2
based on Equation 6.1 as

Ho: μE – λoμ C – (1 – λo)μP ≤ 0 vs. Ha: μE – λoμ C – (1 – λo)μP > 0 (6.3)

© 2012 by Taylor and Francis Group, LLC

Three-Arm Non-Inferiority Trials 155

The best linear unbiased estimator for the unknown quantity in the hypoth-
eses of Expression 6.3 is found by substituting the sample means for the
population means: ψ(µ̂ ) = X E − λoX C − (1 − λo )X P , where µ̂ is the vector of
sample means. Under the null hypothesis, with the assumption of normality
and homogenous variances, the test can use the statistic

X E − λoX C − (1 − λo )X P
TBLUE = (6.4)
1 λo2 (1 − λo )2
σˆ + +
nE nC nP

where σ̂ is the pooled estimate of variance and nP, nC, and nE are the corre-
sponding sample sizes. Under Ho, TBLUE in Equation 6.4 follows a t distribu-
tion with nE + nCt + nP – 3 degrees of freedom.
Pigeot et al.3 discussed values of λo and suggested λo = 0.8 could be appro-
priate. This implies that the experimental treatment retains 80% or more of
the efficacy of the active control. Values of λo = 0.5, or even less, may be appro-
priate depending on the indication and the amount of efficacy and toxicity
of the active control.
Hasler, Vonk, and Hothorn7 considered the continuous case under the
assumption of unequal, unknown variances. The test statistic becomes

X E − λoX C − (1 − λo )X P
SE2 λo2SC2 (1 − λo )2 SP2
+ +
nE nC nP

and the critical value is based on a t distribution with degrees of freedom, ν,

equal to the greatest integer less than or equal to

2
 sE2 λo2 sC2 (1 − λo )2 sP2 
n + n + nP 
 E C 
4 4 4
sE λ s (1 − λ )4 s 4
+ 2 o C + 2 o P

2
nE (nE − 1) nC (nC − 1) nP (nP − 1)

Koch and Tangen8 provided a sample size formula for three-arm trials. Pigeot
et al.3 also discussed optimal allocation ratios of active control to experimental
to placebo. In general, the active control and experimental treatment should
have the same sample size. With equal sample sizes for the first two treat-
ments, the optimal allocation ratio becomes 1:1:kP. Pigeot et al. showed that

(1 − λo ) 2 + 2 λo
kP = (6.5)
1 + λo2

© 2012 by Taylor and Francis Group, LLC

156 Design and Analysis of Non-Inferiority Trials

was optimal for λo < 1. In other words, the optimal ratio depends only on the
value of λo. For λo = 0.8, the ratio is 1:1:0.23, or approximately 9:9:2. For λo = 0.5,
the optimal ratio is 1:1:0.69, or approximately 3:3:2. For λo = 0.3213, the ratio
is approximately 1:1:1; whereas for even smaller values of λo, more subjects
are required in the placebo group than in the other two groups. We have
from Equation 6.5 that kP is decreasing in λo for 0 < λo < 1 and as λo increases
toward 1, kP decreases toward 0. For λo < 0.5, we recommend equal sample
sizes in all three groups, as the power loss will not be large compared to the
complexity of unequal ratios and the potential ethical concern of enrolling
more subjects in the placebo group than the other groups. In the general case
(the experimental and active control arms need not have equal sample sizes),
as shown by Pigeot et al.,3 the optimal allocation across the experimental,
active control, and placebo arms is 1:λo:1 – λo. Optimal allocation ratios when
the variances are unequal among the three arms lead to larger allocation for
the study arms with larger variances.7 The procedure described by Hasler,
Vonk, and Hothorn’s paper7 required that the ratio of variances among treat-
ments be known, something that in practice is not known precisely. A two-
stage sample size recalculation procedure for the three-arm testing problem
is described by Schwartz and Denne.9 The optimal allocation ratios for the
two-stage procedure are the same as those for the fixed sample size case.
Inferences on Proportions. Similar methodologies have been proposed for
demonstrating that a sufficient proportion of the effect of the active con-
trol was maintained when considering binomial data rather than continu-
ous data.10,11 Letting p̂P , p̂C, and p̂E be the observed success proportions of
desired outcomes for the placebo, active control, and experimental treatment
groups, respectively, inference is made on the quantity λ, modified from
 p − pP 
Equation 6.1 for binary data  λ = E . The hypotheses are expressed as
 pC − pP 

Ho: pE – λopC – (1 – λo)pP ≤ 0 vs. Ha: pE – λopC – (1 – λo)pP > 0 (6.6)

where 0 ≤ λo ≤ 1 is prespecified. A Wald-type test statistic is ψ/σ(ψ), where

ψ = pˆ E − λo pˆ C − (1 − λo )pˆ P is again an alternative (linear) way of expressing
the relative treatment effect (with ψ = 0 corresponding to λ = λo) and σ̂ 2(ψ) =
pˆ E (1 − pˆ E )/nE + λ o2 pˆ C (1 − pˆ C )/nC + (1 − λo )2 pˆ P (1 − pˆ P )/nP is the estimated standard
error of ψ. Asymptotically, ψ/σ(ψ) will be distributed as a standard normal
when λo is the correct retention fraction.
Tang and Tang10 compared the type I error rate and power for the Wald-type
test and the similar test using the null restricted maximum likelihood estimates
of the true proportions when estimating the standard error, in analogous fash-
ion to that of Farrington and Manning12 for the two-sample inference on a risk
ratio. For a desired one-sided test with α = 0.05, simulations were performed
to assess the type I error rate when λo = 0.6 and 0.8; (pP,pC) = (0.1, 0.8), (0.2, 0.7),
(0.3, 0.6), and (0.4, 0.5); and pE = λopC + (1 – λo)pP (which appears in Tang and
Tang10 as pE = λopC – (1 – λo)pP). The overall sample sizes were 30, 60, 90, and 120

© 2012 by Taylor and Francis Group, LLC

Three-Arm Non-Inferiority Trials 157

with allocations to the experimental, active control, and placebo groups of 1:1:1,
2:2:1, and 3:2:1, respectively. According to their results, the simulated one-sided
type I error probabilities ranged overall from 0.043 to 0.060 with the use of the
restricted maximum likelihood estimates maintaining the desired one-sided
type I error rate of 0.05 better than the Wald’s test. The power when using the
restricted maximum likelihood estimates to estimate the standard error was
consistently slightly less than the power when using the sample proportions.
Kieser and Friede11 also investigated the type I error rate for a three-arm
non-inferiority test on proportions when using the Wald-type test and the
analogous test based on the null restricted maximum likelihood estimates of
the true proportions when estimating the standard error. Their calculations
were based on the actual probabilities from the corresponding binomial dis-
tributions, not simulations, and differed from those of Tang and Tang.10 All
cases in Tang and Tang’s study10 were considered along with additional cases.
Desired one-sided levels of α = 0.025 and 0.05 were considered with λo = 0.6
and 0.8; pP = 0.05, 0.10, . . . , 0.50; pC – pP = 0.05, 0.10, . . . , 0.95 (only those cases
where pC ≤ 1); and pE = λopC + (1 – λo)pP . The overall sample sizes were 30, 60, 90,
120, 180, 240, and 300 with allocations to the experimental, active control, and
placebo groups of 1:1:1, 2:2:1, and 3:2:1. Both procedures tended to have actual
type I error rates above the desired rates of 0.025 and 0.05. The inflation was
more profound for the Wald-type test. Interestingly, cases that had the greatest
actual type I error rate for the Wald-type test (as high as 0.212) had the actual
type I error rate maintained under the desired level of 0.025 or 0.05 when using
the restricted maximum likelihood estimates to estimate the standard error.
Kieser and Friede further proposed sample size calculations to achieve a
given power. Because power estimates depend on the variances, which dif-
fer under the null and alternative hypotheses, Kieser and Friede proposed
several sample size formulae. The one with the best properties has for the
overall sample size N = (1 + kE + kC )( zατ 0 + zβτ 1 )2 ψ 1−2 , where the allocation of
subjects to the experimental, active control, and placebo groups is kE:kC:1, τ i2 =
(1 − λo )2 pP ,i (1 − pP ,i ) + (λ 2o/kC ) pC ,i (1 − pC ,i ) + (1/kE ) pE ,i (1 − pE ,i ), where for i = 0
the proportions are under the null hypothesis and for i = 1 the proportions
are in the alternative hypothesis. However, even this formula can be incor-
rect, so simulations are advised to confirm the power before conducting the
study. In addition, the ratio k E:kC:1 does not in general have a unique point
that maximizes power, so investigation of various values (with kC > kE > 1
often holding) is advised.
Additionally, because the hypotheses in Expression 6.2 assume that the
active control is superior to placebo, Kieser and Friede11 recommended test-
ing first that the active control is superior to placebo at the full α, and if supe-
riority is concluded, proceed to testing for non-inferiority at the full α. They
further discussed how to size the trial under this testing sequence to achieve
the desired power of concluding non-inferiority.
The inconsistent calculations reported by Tang and Tang10 and Kieser
and Friede11 suggest additional caution against planning based on direct

© 2012 by Taylor and Francis Group, LLC

158 Design and Analysis of Non-Inferiority Trials

application of published results. Simulations or calculations before a new

study should always be done to confirm the performance of testing plans,
especially for newer research. Koch and Tangen8 illustrated nonparametric
analysis of covariance with three-arm non-inferiority trials. Nonparametric
analysis has the advantage of robustness when the distributions of the data
are not as expected. Using analysis of covariance further allows for variance
reduction due to inclusion of factors affecting the variability of results.
Inferences on a Log-Hazard Ratio. For time-to-event endpoints, we will
assume that the three underlying time-to-event distributions have propor-
tional hazards. Let βP/E denote the placebo versus experimental log-hazard
ratio and let βP/C denote the placebo versus active control log-hazard ratio.
Then, for βP/C > 0, the retention fraction is given by

λ = βP/E/βP/C

The hypotheses given in Expression 6.2 are to be tested where 0 ≤ λo ≤ 1 is

prespecified. The hypotheses can be reexpressed as

Ho: βP/E – λoβP/C ≤ 0 vs. Ha: βP/E – λoβP/C > 0 (6.7)

Alternatively, the hypotheses can be expressed around β C/E + (1 – λo)βP/C =

βP/E – λoβP/C, where β C/E is the active control versus the experimental log-
hazard ratio. All three treatment arms can be modeled using a Cox propor-
tional hazards model. Let β̂P/E and β̂P/C denote the Wald estimators of the βP/E
and βP/C from a Cox model. The test statistic is given by

βˆ P/E − λoβˆ P/C

(6.8)
Var(βˆ P/E ) + λo2 Var(βˆ P/C ) − 2 λoCov(βˆ P/E , βˆ P/C )

Software packages provide estimates of the variances and covariance under

the assumption that βP/E = βP/C = 0. In practice, when the true log-hazard ratio
is not far from zero, a good estimate of the denominator in Expression 6.8 is
given by

( )
2
1/rE + λo2/rC + 1 − λo /rP (6.9)

where rE,rC, and rP denote the number of events in the experimental, active
control, and placebo arms, respectively. From Expressions 6.8 and 6.9, we
have the test statistic

βˆ P/E − λoβˆ P/C

Z= (6.10)
( )
2
1/rE + λo2/rC + 1 − λo /rP

© 2012 by Taylor and Francis Group, LLC

Three-Arm Non-Inferiority Trials 159

The test rejects the null hypothesis in Expression 6.7 and concludes non-
inferiority when Z > zα/2.
A similar test statistic to Equation 6.10 was used by Mielke, Munk, and
Schacht13 under the assumption of underlying exponential distributions.
There, the estimator β̂ P/E (β̂ P/C ) is equal to the difference in the natural loga-
rithms of the maximum likelihood estimators of the means of the experi-
mental (active control) and placebo arms.
For all of these three-arm non-inferiority cases, a Fieller 100(1 – α) confi-
dence interval for λ can be found by treating λo as unknown and setting the
test statistic equal to ±zα/2 (or the analogous values from the appropriate t dis-
tribution) and solving for λo.3,8 If all the values in the confidence interval are
greater than zero, superiority of the experimental arm to the placebo arm is
concluded. If all the values in the confidence interval are greater than λo, non-
inferiority of the experimental arm to the active control arm is concluded. If
all the values in the confidence interval are greater than 1, superiority of the
experimental arm to the active control arm is concluded.
Capturing All Possibilities of Efficacy. In determining whether the experi
mental therapy is efficacious or has adequate efficacy, the possibility that
μ C ≤ μP < μE should be included, but is not included, in the non-inferiority
inference. For a two-arm non-inferiority trial, superiority of the experimen-
tal therapy to the active control therapy is intended to imply non-inferiority
and that the experimental therapy is effective (i.e., due to the assumption that
the control therapy is “active”). Although the possibility that μP < μ C < μE is
included in the alternative hypothesis in Expression 6.3 by having the over-
all assumption that μP < μ C and that μE – λoμ C – (1 – λo)μP > 0 for some pre-
specified 0 ≤ λo ≤ 1, the possibility that μ C ≤ μP < μE is excluded. The possibility
of μ C ≤ μP < μE accounts for one-sixth of the overall, unrestricted parameter
space for (μP, μ C, μE), and the estimation of μE – λoμ C – (1 – λo)μP that is done,
including the modeling of the uncertainty in that estimation, does not pre-
clude μ C ≤ μP < μE. Order restricted or constrained inference is not done. The
aforementioned test procedures do not estimate or model the estimation of
μE – λoμ C – (1 – λo)μP under the restriction that μP < μ C.
Having as the alternative hypothesis

H a: {( µ P , µ C , µ E ) : µ E − λo µC − (1 − λo )µ P > 0, µ P < µ C } ∪
(6.11)
{( µ P , µ C , µ E ) : µ C ≤ µ P < µ E }

seems appropriate for three-arm inferences. The null hypothesis would

be the complement within the unrestricted parameter space. Koch and
Röhmel14 also argued that the goal of testing in the three-arm setting should
be to demonstrate that the experimental therapy is both efficacious (better
than placebo) and noninferior to the active control therapy. They proposed
to achieve this by performing separate tests for superiority to placebo and

© 2012 by Taylor and Francis Group, LLC

160 Design and Analysis of Non-Inferiority Trials

non-inferiority to the active control. The overall alternative hypothesis of

Koch and Röhmel14 is

Ha: {(μP, μ C, μE): μE > max{μP, μ C – δ}} (6.12)

where δ > 0 is the prespecified non-inferiority margin. To provide a some-

what analogous form as the alternative hypothesis in Expression 6.11, the
alternative hypothesis in Expression 6.12 can be expressed as Ha: {(μP, μ C, μE):
μE > μ C – γ ≥ μP, for any 0 < γ ≤ δ} ∪ {(μP, μ C, μE): μ C ≤ μP < μE}.
After demonstrating that the experimental therapy is efficacious and non-
inferior to the active control therapy, Koch and Röhmel suggested additionally
testing whether (1) the active control is superior to placebo and (2) whether
the experimental therapy is superior to the active control.

6.3 Bayesian Analyses

In this section, we will consider non-inferiority analysis based on three-arm
studies using Bayesian approaches. For proportions and means, the param-
eters are modeled based on independent random samples and independent
prior distributions (for a joint prior distribution for the mean and variance).
In addition to those posterior probabilities provided in Section 5.3.6, the pos-
terior probability that the active control is superior to placebo can be deter-
mined. A list of interested posterior probabilities include:

(a) The posterior probability that the experimental therapy is superior

to placebo.
(b) The posterior probability that the active control is superior to placebo.
(c) The posterior probability that the experimental therapy is superior
to the active control therapy.
(d) The posterior probability that the experimental therapy is superior
to both the active control therapy and placebo.
(e) The posterior probability that the experimental therapy retains more
than 100λo% of the active control therapy’s effect and the active con-
trol therapy is superior to placebo.
(f) The posterior probability that the experimental therapy retains more
than 100λo% of the active control therapy’s effect and the active control
therapy is superior to placebo, or the experimental arm is superior
to both the active control therapy and placebo (i.e., the experimental
therapy is superior to placebo and noninferior to the active control
therapy).

© 2012 by Taylor and Francis Group, LLC

Three-Arm Non-Inferiority Trials 161

The determination that the experimental therapy is effective requires that

the posterior probability in (a) is close to 1 (e.g., greater than 0.975). The pos-
terior probabilities given in (e) and (f) are two possibilities for determining
whether the experimental therapy is not unacceptably worse than the active
control. We will discuss testing for means, proportions, and log-hazard ratios
based on the calculation of the posterior probabilities in (e) and (f).
Inferences on Means When the Variances Are Unknown. In each arm, we will
consider a joint Jeffreys prior density, h, where h(θ)∝θ–3/2 for –∞ < μ < ∞, and
θ > 0 where θ = σ 2 is the variance. For a given treatment arm, X1, X2, …, Xn
is a random sample from a normal distribution with mean μ and variance θ.
Then the joint posterior density is given by

 ( µ − x )2  − n/2−1  1 
∑
n
g( µ , θ |x1 , x2 ,..., xn ) ∝ θ −1/2 exp − ×θ exp − ( xi − x )2/θ 
 2θ/n   2 i=1

(6.13)

We see from Expression 6.13 that the joint density factors into the product
of an inverse gamma marginal distribution for θ and a normal conditional
distribution for μ given θ. The inverse gamma distribution has shape and
∑
n
scale parameters equal to n/2 and ( xi − x )2/2 , respectively, with a mean
i=1 2

∑ 
∑ 
n n
equal to ( xi − x )2/(n − 2) and a variance equal to 2  ( xi − x )2  /[(n − 2)2 (n −
2 i=1  i=1 

( xi − x )2  /[(n − 2)2 (n − 4)] . Note that θ has an inverse gamma distribution with para
=1 

∑
n
meters n/2 and ( xi − x )2/ 2 if and only if 1/θ has a gamma distribution
i=1

∑ ∑
n n
with parameters n/2 and 2 / ( xi − x )2 with mean equal to n/
( x i − x )2 .
i=1 i=1

Given θ, μ has a normal distribution with mean equal to x and variance

equal to θ/n. For simulating probabilities involving μ, a random value for
1/θ can be taken from the gamma distribution with parameters n/2 and
∑
n
2/ ( xi − x )2 , and then a random value for μ can be taken from a normal
i=1
distribution having mean x and variance θ/n. The above procedure is quite
similar to a procedure provided by Ghosh et al.15
The probability in (f) (see preceding list) is the posterior probability of the
alternative hypothesis given in Expression 6.11. If P(μE – λoμ C – (1 – λo)μP >
0, μP < μ C or μ C ≤ μP < μE) exceeds some threshold (e.g., 0.975), the experimental
therapy is concluded to be noninferior to the active control therapy and to be
efficacious. The probability in (e) is the posterior probability of the alterna-
tive hypothesis given in Expression 6.3. The analogs of the frequentist tests
would be based on P(μE – λoμ C – (1 – λo)μP > 0), or P(μE – λoμ C – (1 – λo)μP > 0,
μP < μ C), or both of P(μP < μ C) and P(μE – λoμ C – (1 – λo)μP > 0). If the posterior

© 2012 by Taylor and Francis Group, LLC

162 Design and Analysis of Non-Inferiority Trials

probability exceeds the preselected threshold (or in the last case both prob-
abilities exceed the threshold), non-inferiority of the experimental therapy
would be concluded.
An alternative way of calculating posterior probabilities in this case is pro-
vided by Gamalo et al.16 They discussed the use of a generalized p-value (i.e.,
the posterior probability of a one-sided null hypothesis) and a generalized
confidence interval (i.e., a credible interval) for μE – λoμC – (1 – λo)μP when the
variances are unknown and are not assumed equal. For arm i, i = C, E, P, let
ni denote the number of subjects on that arm, xi denote the observed sam-
ple mean, and si denote the observed standard deviation (i.e., si2 has the form

∑
ni
( x j ,i −x )2/(ni − 1)). The means μC, μE, and μP are independent where for i =
j=1

C, E, P, μi has a posterior distribution equal to the distribution of

xi + Ti si/ ni

where Ti has a t distribution with ni – 1 degrees of freedom.
Similar results are reported from applying the procedure described by
Gamalo et al.16 as with applying the procedure of Hasler, Vonk, and Hothorn7
in testing the hypotheses in Expression 6.3. The advantage of the Bayesian
procedure is that the uncertainty of the active control is superior to placebo
(i.e., μP < μ C) and the possibility that μ C ≤ μP < μE can be included directly into
the testing procedure. That is, posterior probabilities like those in (e) and (f)
can be calculated. Gamalo et al.16 validated the type I error rate of their pro-
cedure in testing the hypotheses in Expression 6.3 with simulations based on
a model that includes modeling the variances.
The hypotheses given in Expression 6.12 can also be tested either based
on the posterior probability of μE > max{μP,μ C – δ} or based on min{P(μE > μP),
P(μE > μ C – δ)}. Comparing min{P(μE > μP), P(μE > μ C – δ)} to a threshold of 0.975
would be analogous to two separate one-sided tests at level 0.025 that the
experimental therapy is superior to placebo (i.e., μE > μP) and that the experi-
mental therapy is noninferior to the active control therapy (i.e., μE > μ C – δ).
The two-arm versions of these Bayesian approaches are discussed in
Section 12.2.4, along with examples that calculate posterior probabilities and
credible intervals for the difference in means.
Inferences on Proportions. For each arm we will consider a beta prior dis-
tribution for the probability of a success. For a random sample of n binary
observations where x are successes, a beta prior distribution with parameters
α and β for p, the probability of success, leads to a posterior distribution for
p with parameters α + x and β + n – x. A Jeffreys prior distribution has α =
β = 0.5.
For proportions where a “success” is desirable, the probability in (f) is the
posterior probability that pE – λopC – (1 – λo)pP > 0, pP < pC or pC ≤ pP < pE. If
that probability exceeds some threshold (e.g., 0.975), then the experimental

© 2012 by Taylor and Francis Group, LLC

Three-Arm Non-Inferiority Trials 163

therapy is concluded to be noninferior to the active control therapy and effi-

cacious. The probability in (e) is the posterior probability of the alternative
hypothesis given in Expression 6.6. As with means, the analogs of the fre-
quentist tests would be based on P(pE – λopC – (1 – λo)pP > 0), or P(pE – λopC –
(1 – λo)pP > 0, pP < pC), or both of P(pP < pC) and P(pE – λopC – (1 – λo)pP > 0). If
the posterior probability exceeds the preselected threshold (or in the last case
both probabilities exceed the threshold), non-inferiority of the experimental
therapy would be concluded.
Inferences Based on Log-Hazard Ratios. Once again we will assume that the
three underlying time-to-event distributions have proportional hazards.
Asymptotically, (θˆP/E , θˆP/C ) can be modeled as having a bivariate normal dis-
tribution with mean (θ P/E, θ P/C) and variances denoted by σ P/E 2 2
and σ C/E and
correlation denoted by ρ. Using this bivariate normal distribution with σ P/E 2
,

σ C/E , and ρ treated as known values and a constant, improper prior density
2

for (θ P/E, θ P/C) leads to a joint posterior distribution for (θ P/E, θ P/C) with mean
(θˆP/E , θˆP/C ), variances σ P/E
2 2
and σ C/E , and correlation ρ. Unlike the estimator
of the standard error of a sample mean from a normal random sample, the
estimators of standard error for a log-hazard ratio are quite stable in the vast
majority of applications. Likewise, the estimator of ρ is fairly stable. Therefore,
2
for the Bayesian model, with slight crudeness, we use estimates of σ P/E and
σ C/E and correlation ρ as the true values. A parallel to using the test statistic
2

in Equation 6.10 would use 1/rP + 1/rE, 1/rP + 1/rC, and 1 / (1 + rP/rC )(1 + rP/rE )
as the values for σ P/E2 2
and σ C/E and ρ. Then the posterior probability of θ P/E –
λo θ P/C ≤ 0 would equal exactly the one-sided p-value from using the test sta-
tistic in Equation 6.10.
For time-to-event endpoints where the event is undesirable (i.e., longer
times are more desirable), the probability in (f) is the posterior probability
that θ P/E – λo θ P/C > 0, θ P/C > 0, or θ P/C ≤ 0 < θ P/E. If that probability exceeds
some threshold (e.g., 0.975), then the experimental therapy is concluded to
be noninferior to the active control therapy and to be efficacious. The prob-
ability in (e) is the posterior probability of the alternative hypothesis given in
Expression 6.7. As with means and proportions, the analogs of the frequen-
tist tests would be based on P(θ P/E – λo θ P/C > 0), or P(θ P/E – λo θ P/C > 0, θ P/C > 0),
or both P(θ P/C > 0) and P(θ P/E – λo θ P/C > 0). If the posterior probability exceeds
the preselected threshold (or in the last case both probabilities exceed the
threshold), non-inferiority of the experimental therapy would be concluded.
Example 6.1 illustrates the use of a frequentist and a Bayesian method in a
three-arm testing involving proportions.

Example 6.1

To demonstrate the calculation of the various posterior probabilities of (a)–(f) and

some of the issues, we consider the example of the reporting rates of adverse
events in Tang and Tang’s study of patients with functional dyspepsia.10 The

© 2012 by Taylor and Francis Group, LLC

164 Design and Analysis of Non-Inferiority Trials

motivation for doing non-inferiority testing is not clear and we do not recommend
doing non-inferiority testing in this fashion. For this example, we will assume that
the greater the reporting rate of adverse events the better. Adverse events were
reported in 7 of 61 patients randomized to placebo, 10 of 59 patients randomized
to cisapride (the active control), and 12 of 58 patients randomized to simethicone
(the experimental therapy). For testing the hypotheses in Expression 6.6 with λo =
0.8, the one-sided p-value = 0.234 for the Wald-type test as given by Tang and
Tang.10 It should be noted that the 95% confidence intervals for the difference in
rates between the simethicone and placebo arms, and between the cisapride and
placebo arm are –0.039 to 0.224 and –0.070 to 0.179, respectively. Thus, neither
the active control nor experimental therapy demonstrated a higher underlying
reporting rate of adverse events. Also, for every –∞ < λo < ∞, the value for Wald-
type test statistic is between –1.96 and 1.96. Therefore, ignoring that the active
control was not demonstrated to be “superior” to placebo, the Fieller 95% confi-
dence interval for the retention fraction is –∞ to ∞.
Jeffreys’ prior distributions were used for each of pE, pC, and pP. Posterior prob-
abilities and credible intervals were approximated from 100,000 simulations. The
simulated 95% credible intervals for the difference in rates between the simethi-
cone and placebo arms, and between the cisapride and placebo arm were similar
to the Wald’s 95% confidence intervals and are (–0.038 to 0.224) and (–0.070
to 0.180), respectively. The simulated 95% credible interval for the difference in
rates between the simethicone and cisapride arms is (–0.104 to 0.178). Simulated
posterior probabilities of interest are given in Table 6.1.
Note that, in (f), pE > pP ≥ pC implies pE – 0.8pC – 0.2pP > 0. From Table 6.1, the
simulated posterior probability of pE – 0.8pC – 0.2pP > 0 and pC > pP, or pE > pP ≥
pC equals 0.738, which is far smaller than 0.975. The experimental arm has not
demonstrated the combination of non-inferiority and efficacy (i.e., adverse events
reporting rates greater than placebo). The direct analog of the one-sided p-value
of .234 of the Wald-type test in testing the hypotheses in Expression 6.6 is the
simulated posterior probability for pE – 0.8pC – 0.2pP > 0 of 0.763 (i.e., the simu-
lated posterior probability of the null hypothesis is 0.237 = 1 – 0.763). However, in
2.5% of the simulations, pE – 0.8pC – 0.2pP > 0 and pP > pE > pC. The uncertainty
that pP > pE > pC is not accounted for in the Wald’s-type test of the hypotheses in
Expression 6.6.

TABLE 6.1
Simulated Posterior Probabilities of Interest
Event Simulated Posterior Probability
(a) pE > pP 0.916
(b) pC > pP 0.806
(c) pE > pC 0.696
(d) pE > max{pC,pP} 0.667
(e) pE – 0.8pC – 0.2pP > 0 and pC > pP 0.589
(f) pE – 0.8pC – 0.2pP > 0 and pC > pP, or pE > pP ≥ pC 0.738
pE – 0.8pC – 0.2pP > 0 and pP > pE > pC 0.025
pE – 0.8pC – 0.2pP > 0 0.763

© 2012 by Taylor and Francis Group, LLC

Three-Arm Non-Inferiority Trials 165

References
1. Koch, G.G., Comments on ‘current issues in non-inferiority trials’ by Thomas R.
Fleming, Stat. Med., 27, 333–342, 2008.
2. Temple, R. and Ellenberg S.S., Placebo-controlled trials and active-controlled tri-
als in the evaluation of new treatments: Part 1. Ethical and scientific issues, Ann.
Intern. Med., 133, 455–463, 2000.
3. Pigeot, I. et al., Assessing non-inferiority of a new treatment in a three-arm clini-
cal trial including a placebo, Stat. Med., 22, 883–899, 2003.
4. Letierce, A. et al., Two-treatment comparison based on joint toxicity and efficacy
ordered alternatives in cancer trials, Stat. Med., 22, 859–868, 2003.
5. Jennison, C. and Turnbull, B.W., Group sequential tests for bivariate response:
Interim analyses of clinical trials with both efficacy and safety endpoints,
Biometrics, 49, 741–752, 1993.
6. Thall, P.F. and Cheng, S.-C., Treatment comparisons based on two-dimensional
safety and efficacy alternatives in oncology trials, Biometrics, 55, 746–753, 1999.
7. Hasler, M., Vonk, R., and Hothorn, L.A., Assessing non-inferiority of a new
treatment in a three-arm trial in the presence of heteroscedasticity, Stat. Med.,
27, 490–503, 2008.
8. Koch, G.G. and Tangen, C.M., Nonparametric analysis of covariance and its role
in non-inferiority clinical trials, Drug Inf. J., 33, 1145–1159, 1999.
9. Schwartz, T.A. and Denne, J.S., A two-stage sample size recalculation procedure
for placebo- and active-controlled non-inferiority trials, Stat. Med. 45, 3396–3406,
2006.
10. Tang, M.-L. and Tang, N.-S., Tests of non-inferiority via rate difference for three-
arm clinical trials with placebo, J. Biopharm. Stat., 14, 337–347, 2004.
11. Kieser, M. and Friede, T., Planning and analysis of three-arm non-inferiority tri-
als with binary endpoints, Stat. Med., 26, 253–273, 2007.
12. Farrington, C.P. and Manning, G., Test statistics and sample size formulae for
comparative binomial trials with null hypothesis of non-zero risk difference or
non-unity relative risk, Stat. Med., 9, 1447–1454, 1990.
13. Mielke, M., Munk, A., and Schacht, A., The assessment of non-inferiority in a
gold standard design with censored, exponentially distributed endpoints, Stat.
Med., 27, 5093–5110, 2008.
14. Koch, A. and Röhmel, J., Hypothesis testing in the ‘gold standard’ design for
proving the efficacy of an experimental treatment relative to placebo and a refer-
ence, J. Biopharm. Stat., 14, 315–325, 2004.
15. Ghosh, P. et al., Assessing non-inferiority in a three-arm trial using the Bayesian
approach, Technical report, Memorial Sloan-Kettering Cancer Center, 2010.
16. Gamalo, M. et al., A Generalized p-value approach for assessing non-inferiority
in a three-arm trial, Stat. Methods Med. Res. Published online February 7, 2011.

© 2012 by Taylor and Francis Group, LLC

7
Multiple Comparisons

7.1 Introduction
Multiple comparisons pose a problem in any clinical trial. There are many
aspects to this, including exploring multiple treatment arms, multiple effi-
cacy measurement endpoints, and multiple timepoints, but all lead to the
same problem: the chance of falsely concluding efficacy is inflated without
proper recognition of multiplicity.
In non-inferiority testing, the roles of the null and alternative hypoth
eses are in some ways reversed, which can cause confusion at first glance. In
non-inferiority testing, the type I error is the probability of concluding non-
inferiority when the active control is markedly superior to the experimental
treatment; in superiority testing, the type I error is the probability of con-
cluding superiority of one treatment when the effects of the treatments are
identical. This may lead to some confusion about the interpretation of type I
and type II errors. However, when the type I error is properly recognized as
the probability of rejecting a null hypothesis that is true, and the type II error
is the probability of not rejecting a null hypothesis that is false, the confusion
dissipates. This is the same in non-inferiority or superiority testing.
Control of the type I error rate can be defined in different ways. The most
common for clinical trials is control of the familywise error (FWE) rate—the
probability of rejecting at least one true null hypothesis. In the case of testing
multiple endpoints for non-inferiority, this means concluding non-inferiority
at least one time when non-inferiority is not true. Control in the strong sense
requires that the FWE is controlled at the claimed α level or less for every
possibility in the parameter space (i.e., no matter which null hypotheses are
true and which are false). This is in contrast to control in the weak sense,
which requires that the FWE is controlled at or below the claimed level only
when all null hypotheses are true. Control of the FWE in the strong sense is
most commonly used in a regulatory setting. Thus, in the rest of this chapter,
we do not continually state “in the strong sense” when we refer to control of
the FWE, although this is implied. Other definitions of type I error rate can
also be considered, including comparisonwise error rate (CWE: controlling

167

© 2012 by Taylor and Francis Group, LLC

168 Design and Analysis of Non-Inferiority Trials

nominal comparisons without adjustment for multiplicity), experimentwise

error rate (EWE: controlling the expected number of true null hypotheses
that are rejected), and false discovery rate (FDR: controlling the number of
rejected null hypotheses that are true). These other definitions are not com-
monly used for efficacy endpoints in clinical trials.
In this chapter, we outline some of the common multiple comparison issues
in non-inferiority testing along with discussing some solutions. The solu-
tions are meant to be representative rather than comprehensive and allow
illustration of the issues. For notation, we will consider comparing I experi-
mental treatments to a single control, and use the additional subscript i to
denote the experimental treatment being considered, as in the mean of the
ith treatment, μE,i.

7.2 Comparing Multiple Groups to an Active Control

Multiple experimental treatments can be compared to one active control with
known efficacy. This might occur when multiple doses of an experimental
treatment are considered, when multiple schedules or routes of administra-
tion are examined, when multiple experimental treatments are compared
to an active control in a single study for efficiency, or when alternative regi-
mens of the active control therapy are compared with the standard regimen.
In each case the family of hypotheses for which the type I error rate is to
be controlled must be considered, and an appropriate adjustment must be
made for control of the FWE in the strong sense. Two general structures
are possible: attempting to find a subset of treatments that are noninferior
to the control or attempting to show that all treatments are noninferior to
the control. In the first case, the hypotheses are written as Ho: μ C – μE,i >
δi for all i versus Ha: μ C – μE,i ≤ δi for at least one i. In the second case, the
hypotheses are written as Ho: μ C – μE,i > δi for at least one i versus Ha: μ C –
μE,i ≤ δi for all i. In the first case, an intersection–union test will result and
the standard multiple comparisons procedures will be considered.1 In the
second case, a union–intersection test will result and often no adjustment to
the nominal type I error rate will be needed.2 Note that each experimental
treatment might have the same non-inferiority margin to the control, par-
ticularly in the indirect determination of efficacy relative to a placebo, but
we have used notation to allow for the possibility that the non-inferiority
margin may differ across experimental treatments. In a dose-ranging or
frequency-ranging study of an approved agent, the non-inferiority margin
for evaluating whether a particular regimen has efficacy that is not unac-
ceptably worse than the standard regimen may depend on the reduction in
dose or frequency.

© 2012 by Taylor and Francis Group, LLC

Multiple Comparisons 169

7.2.1 Unordered Treatments: Subset Selection

When comparing multiple experimental treatments that have no a priori
ordering of importance in an intersection–union test, many options are avail-
able for controlling the FWE. One option is to treat the various comparisons
of experimental treatments to the active control as unrelated comparisons
and make no adjustment. This would be acceptable if each experimental
treatment was compared to the active control in an independent clinical
trial, the difference being that the comparisons will be correlated when the
same active control group is used for all comparisons. This is generally con-
sidered inappropriate when the same control is being used for comparison
to multiple experimental treatments in a confirmatory trial, as it controls
the CWE only and ignores the larger family of comparisons as required for
control of the FWE. In a proof-of-concept study in which any conclusion
must be confirmed in a subsequent trial, control of only the CWE may be
appropriate.
If adjustment is employed, then a simple method will be the Bonferroni
adjustment, in which each of the I experimental treatments is compared to
the active control in a test at the α/I level. If confidence intervals are used
for the decision process, a common two-sided confidence interval of 100 ×
(1 – α/I)% is appropriate. Non-inferiority will be concluded for a given com-
parison if the upper bound of the confidence interval is less than δi. This
will control the type I error rate and has the advantage of producing valid
simultaneous confidence intervals; however, more powerful procedures are
generally available.
The Holm procedure is related to the Bonferroni procedure.3 The Holm
procedure will control the FWE and, in addition, will be uniformly more
powerful than the Bonferroni procedure. The Holm procedure is usually
described in terms of p-values. The p-values for the various comparisons
are ordered resulting in p(1), … p(I). The smallest p-value, p(1), is compared
to α/I. If p(1) < α/I, the hypothesis associated with p(1) is rejected and p(2) is
compared to α/(I – 1). Again, if p(2) < α/(I – 1), the hypothesis associated with
p(2) is rejected and p(3) is compared to α/(I – 2). Testing continues in this fash-
ion as long as hypotheses are rejected. When a hypothesis is not rejected
because p(i) > α/(I – i +1), the procedure stops without conclusions on untested
hypotheses. For non-inferiority testing, which typically does not explicitly
rely on p-values, an analogous procedure will produce confidence intervals
with decreasing nominal confidence. That is, based on maintaining a fami-
lywise one-sided type I error rate of α/2, if non-inferiority is concluded on
the basis of at least one of the 100 × (1 – α/I)% two-sided confidence intervals,
the procedure moves to the next step. At the second step, if non-inferiority is
concluded for at least two of the I comparisons based on 100 × (1 – α/(I – 1))%
two-sided confidence intervals, the procedure moves to the next step (and so
on). When the procedure stops owing to not concluding non-inferiority for
at least j of the I endpoints, the j – 1 comparisons from the previous step are

© 2012 by Taylor and Francis Group, LLC

170 Design and Analysis of Non-Inferiority Trials

the ones on which non-inferiority is concluded. Because of the complexities

of this procedure, it is probably easier to do the testing with the p-values and
calculate confidence intervals after testing has determined the experimental
treatments that are noninferior to the control. These confidence intervals will
be at the level (1 – α/(I – j)) × 100% for the endpoint associated with p-value
p[j+1], j = 0, . . . , I – 1, if the null hypothesis associated with that endpoint was
rejected. For comparisons not concluded noninferior, no associated confi-
dence interval will be determined.
A common multiple comparison procedure when testing more than one
experimental group to a single control is Dunnett’s test.4 Dunnett’s test cal-
culates a confidence interval according to the form xC − xC ± dα ( I − 1, f ) × se ,
where f, the degrees of freedom, is the total number of observations minus
the number of treatment groups, and I is the number of treatment groups
including the control. Tables for the critical values d are available in many
textbooks or from statistical packages including the PROBMC function
of SAS®. The advantage of using Dunnett’s test is that it is more powerful
because it considers the correlation among various comparisons.

7.2.2 Ordered Treatments: Subset Selection

When the various comparisons have an a priori ordering, more structure can
be added to gain efficiency. Multiple doses of an experimental treatment are
often considered to have a monotone efficacy effect: larger doses have better
efficacy. When comparing various doses of an experimental treatment to
a single active control to determine which, if any, doses are noninferior to
the active control, a fixed sequence approach can be considered. That is, the
highest dose (most likely to be efficacious) is compared to the active control;
if non-inferiority is concluded, the next highest dose is compared, and so
on. As long as non-inferiority is concluded on all previous comparisons, the
tests can be made at the full α level. The tests can be conducted by calculat-
ing two-sided 100 × (1 – α)% confidence intervals for the comparison of each
dose to the active control or by calculating another test statistic. Once one
dose is not shown to be noninferior to the active control, no further testing
can be considered. In addition, confidence intervals for further comparisons
are difficult to interpret because of the lack of a procedure to produce valid
simultaneous confidence intervals corresponding to the fixed sequence test-
ing procedure. The fixed sequence procedure can also be used when com-
paring multiple experimental treatments other than doses, when the order
of importance or likelihood of a positive conclusion can be prespecified. A
modification to the fixed sequence procedure is the fallback procedure5,6;
some of the type I error rate is reserved for later comparisons in the event
that an early comparison does not result in rejecting the null hypothesis. In
dose comparison designs, this will be especially appropriate when higher
doses may lead to tolerability issues, which may negate the intended or on-
target efficacy advantage.

© 2012 by Taylor and Francis Group, LLC

Multiple Comparisons 171

7.2.3 All-or-Nothing Testing

The preceding discussion involves intersection–union testing. For union–
intersection testing, different strategies are required. Since these designs
are less common than those involving intersection–union tests, we present a
motivating example (Example 7.1).

Example 7.1

Consider a study of three consistency lots. The objective is to demonstrate that

the product can be consistently produced without substantial impact on time
immunogenicity of the product. Rather than the usual equivalence criteria, for
this example we will consider that if each lot is concluded to be noninferior to a
control, manufacture consistency will be concluded. That is, rather than showing
that each lot induces similar protection to every other lot, just showing that each
lot induces adequate protection in comparison to the control will be sufficient.
Then the hypotheses can be written as Ho: maxi(μC – μE,i ) ≥ δ versus Ha: maxi(μC
– μE,i) < δ. Consider testing these hypotheses by testing separately for i = 1, 2, 3,
Ho,i: μC – μE,i ≥ δ versus Ha,i: μC – μE,i < δ each at the nominal level α. If each null
hypothesis for an individual lot is rejected at the nominal level α, then global non-
inferiority (manufacture consistency) will be concluded. If exactly one of the lots
is not noninferior to the control, the FWE will be less than or equal to the nominal
level α. If more than one lot is not noninferior to the control, the FWE will be less
than the nominal level α, since more than one error must be made to incorrectly
conclude consistency. Thus, making each comparison at the nominal level α may
be conservative.

7.3 Non-Inferiority on Multiple Endpoints

Multiple comparisons involving testing of several endpoints frequently arise
in non-inferiority clinical trials of a single experimental treatment and a single
active control. To achieve efficiency, many endpoints will typically be collected
in a single clinical trial, and many of these endpoints could be used for infer-
ence. Maintaining the FWE is critical to support the conclusions of the study,
and that is the primary topic of this subsection. There are, however, other issues
to consider when performing non-inferiority testing on multiple endpoints with
the same control therapy. Using the most effective agent relative to an endpoint
helps guard against biocreep. However, for a given patient population or dis-
ease, the most effective agent for one endpoint may not be the most effective
agent for another endpoint. Additionally, determining a margin for the most
important clinical endpoint is a difficult enough task. This most important
clinical endpoint may often have the best-quality data on the effect of the active
control, with less-quality data available on other efficacy endpoints. The active
control may not consistently demonstrate an effect on the additional endpoints,

© 2012 by Taylor and Francis Group, LLC

172 Design and Analysis of Non-Inferiority Trials

or may not demonstrate an effect of consistent magnitude, even if the active

control has a real effect.
It is common to begin addressing the multiple comparisons issue by assign-
ing priority among the comparisons: one comparison is (or very few compar-
isons are) called primary, a small number are called secondary, and maybe a
few are called tertiary, with other endpoints not being part of a formal infer-
ential or exploratory strategy. Testing begins on the primary comparison(s)
with secondary endpoints being considered only if the comparison of the
primary endpoint(s) results in significance.7 This general structure is com-
mon in clinical trials, both those designed to demonstrate non-inferiority
and those designed to demonstrate superiority. However, it is only a start:
the set of primary hypotheses can contain more than one test, and a decision
on how to move from primary to secondary comparisons (and eventually to
tertiary comparisons) needs to be addressed to control the FWE.
In non-inferiority clinical trials, the purpose of the secondary endpoints
must be considered. An obvious strategy is to consider a formal assessment
of non-inferiority for all endpoints (primary, secondary, tertiary, etc.), which
requires non-inferiority margins to be prespecified for all endpoints. This
may be required for regulatory approvals for endpoints beyond the primary
one(s). Alternatively, secondary and tertiary endpoints might be reported
with point estimates and confidence intervals, but not formally tested for
non-inferiority (or superiority). This strategy would be considered if provid-
ing information to patients and physicians is sufficient without explicit non-
inferiority conclusions, and especially if it is not possible to select appropriate
margins for non-inferiority assessments on these endpoints.
In the following sections, we first consider multiple endpoints in a single fam-
ily, such as a family of primary comparisons that will be the sole basis of conclu-
sions of the study. We then consider multiple families of comparisons, including
how to evaluate secondary tests after first confirming efficacy in a primary test.

7.3.1 Multiple Endpoints in a Single Family

We start by considering multiple comparisons in a single family of compari-
sons or endpoints. These procedures will be appropriate when inference is
desired only on prespecified primary endpoints and form the basis of dis-
cussions of multiple comparisons in more than one family. Some methods
will be similar as in the prior discussion of comparing multiple experimen-
tal treatments to a control. However, other procedures will be available, as
the comparisons of multiple endpoints have different inherent structures
and often more comparisons than was considered in the prior discussion.
Because a family, especially a primary family, will generally not have many
comparisons, this discussion will assume that very few comparisons are of
interest within a single family.
When there is a priori ordering of the multiple endpoints, the fixed
sequence procedure can be considered, as discussed in the previous section.

© 2012 by Taylor and Francis Group, LLC

Multiple Comparisons 173

With such an approach, one primary endpoint becomes the most important
primary endpoint, and other endpoints are not even considered unless non-
i nferiority is demonstrated for the first endpoint. If non-inferiority is dem-
onstrated on an endpoint, the next endpoint (again from the prespecified
ordering) will be tested. It may seem illogical that a less important endpoint
that is called “primary” might not be tested at all, depending on the results
of the previous comparisons. If this is a concern, an alternative is to only
call the first primary endpoint “primary” and label other endpoints “sec-
ondary,” by placing them in a separate family of endpoints. This change in
labels has no impact on the testing process, the power, or the interpretation
of results.
An obvious alternative to the fixed sequence strategy, to avoid some of the
problems mentioned above, is to save some of the α for subsequent testing, as
in the fallback test described earlier. With the fallback, all comparisons can
be considered even if one or more endpoints do not result in a conclusion of
non-inferiority.
The Holm procedure, described earlier, can also be used for a single family
of comparisons, and will control the FWE.
Hochberg8 proposed a procedure based on the Holm procedure, but using
a step-up rather than a step-down approach. That is, the null hypothesis is
rejected and non-inferiority is concluded for the endpoint associated with
the largest p-value for which p(j) ≤ α/(I – j + 1), and for all endpoints with
smaller p-values. By definition, this will include all endpoints for which
the Holm procedure concludes non-inferiority, and maybe more; thus, the
Hochberg procedure is uniformly more powerful than the Holm procedure.
Again, caution must be used as the Hochberg procedure does not always
control the FWE in the strong sense.
More generally, a multiple comparison procedure that is a closed test-
ing procedure will control the FWE in the strong sense.9 A closed testing
procedure considers all possible nonempty subsets of hypotheses. With J
 J
∑
J
endpoints, there will be   = 2J – 1 subsets to consider. Within each
k =1  k 

subset, a global null hypothesis is considered: the null hypothesis is that

non-inferiority exists in none of the endpoints of that subset, and the alter-
native hypothesis is that non-inferiority exists in at least one endpoint in
that subset. The global hypothesis in each subset is tested, but is a very
flexible procedure: any testing method that controls the type I error rate for
that subset of tests can be used if it is prespecified. When all such subsets
have been considered, individual endpoints are considered by determin-
ing whether the global null hypothesis has been rejected in all subsets con-
taining that endpoint. If so, non-inferiority is concluded for that endpoint.
Many tests for a single family of hypotheses discussed in this section can be
written as closed tests, including Bonferroni, fixed sequence, fallback, and
Holm.

© 2012 by Taylor and Francis Group, LLC

174 Design and Analysis of Non-Inferiority Trials

In practice it is uncommon to consider all 2J – 1 subsets of hypotheses in a

closed testing procedure, as shortcut methods reduce the number of required
comparisons while producing the same conclusions.10 The Bonferroni, Holm,
fixed sequence, and fallback tests are examples of shortcut procedures that
result in identical conclusion to a closed test.
When a closed testing procedure is used, it is generally desired that the
tests be α-exhaustive—that is, the type I error rate should be equal to and
not less than α when testing the global null hypothesis in each subset. The
Bonferroni procedure, for example, does not accomplish this, as the type I

∑
I
error rate for a given subset will be bounded above by α k I ( k ), where
k =1
I(k) is an indicator function that equals 1 if endpoint k is in the subset and 0
otherwise. This will be strictly less than the nominal α level except for the
subset containing all endpoints. For this reason, it is always possible to find
a procedure that is more powerful than the Bonferroni procedure for a single
family of hypotheses but still controls the FWE in the strong sense.

7.3.2 Multiple Endpoints in Multiple Families

When multiple endpoints are grouped into multiple families in decreasing
order of importance, the multiple comparison procedures to control the FWE
become more complex. Multiple comparison procedures typically can be
described as gatekeeping strategies. We consider two gatekeeping strategies:
serial gatekeeping and parallel gatekeeping.11
Serial gatekeeping strategies require all endpoints in a given family to
be found significant before endpoints in the next family can be consid-
ered. For non-inferiority clinical trials, it is required that non-inferiority
be demonstrated on every endpoint in a family before any endpoint in the
next family can be considered. This is similar in philosophy to the fixed
sequence approach, which requires non-inferiority to be demonstrated on
each endpoint before the next endpoint is considered, but with multiple
endpoints in a given step. Testing within a family of endpoints can be con-
ducted in any way that controls the FWE in the strong sense at the α level in
that family: Bonferroni, fixed sequence, fallback, Holm, etc. As long as non-
inferiority is demonstrated for each endpoint in a family, the next family
of endpoints is tested in any way that controls the FWE in the strong sense
at the α level in that family. It is advisable that the methods in the various
families are consistent, but this is not required.
Parallel gatekeeping strategies differ from serial gatekeeping strategies
in that they require non-inferiority to be demonstrated on at least one end-
point, rather than all endpoints, in a family before the subsequent family
of endpoints is considered. However, if non-inferiority is demonstrated
for some, but not all, endpoints in a given family, subsequent families of
endpoints will be tested at a smaller type I error level to account for the

© 2012 by Taylor and Francis Group, LLC

Multiple Comparisons 175

lack of a non-inferiority conclusion in that prior family. In addition, test-

ing within a given family of endpoints may have lower power than in a
series gatekeeping strategy, as more powerful α-exhaustive procedures
for early families of endpoints can be used only in a serial gatekeeping
procedure.
Consider a non-inferiority trial with two primary, one secondary, and
two tertiary endpoints, each with a non-inferiority margin defined. A
serial gatekeeping strategy can be defined that uses for the family of pri-
mary endpoints an α-exhaustive procedure. (This includes the Holm, fixed
sequence, Bonferroni, and other possible approaches.) A parallel gatekeep-
ing strategy cannot use an α-exhaustive procedure for this first family,
but is limited to Bonferroni and other procedures with generally lower
power for the family of primary endpoints. The tradeoff is that the parallel
gatekeeping strategy can consider secondary endpoints if at least one pri-
mary endpoint shows non-inferiority; the serial gatekeeping strategy can
consider secondary endpoints only if both primary endpoints show non-
i nferiority. Note that an α-exhaustive procedure can be used for the last
family of endpoints, as no further testing is involved after consideration
of these endpoints.

7.3.3 Further Considerations

When choosing a multiple comparison procedure, the importance of a conclu-
sion for each endpoint and the corresponding power should be considered.
For multiple endpoints in a single family, if there is a priori ordering of
importance, then a method that exploits that ordering, such as the fixed
sequence or fallback, will be preferred. If all endpoints have similar impor-
tance, then a procedure that does not require a priori ordering will be most
appropriate—for example, the Holm procedure or the Hochberg procedure.
A weighted version of the Holm procedure can be used if there is some a
priori preference among endpoints, but not a complete ordering.
For multiple families of hypotheses, the gatekeeping structure can be used
very flexibly with a number of associated procedures within individual fami
lies. Choosing between a serial and parallel gatekeeping procedure should
be based on the relative importance of showing non-inferiority for all end-
points in a given family versus the improved ability to consider endpoints in
a less important family.
In either case, the order of importance of endpoints is critical. It is not pos-
sible to give advice that will be flexible enough to apply to all situations while
rigorous enough to be helpful, but there are some general thoughts. Primary
endpoints should be those on which regulatory approval (or widespread
acceptance, if regulatory approval is not a goal of the trial) will be based.
Secondary endpoints should be those on which further conclusions will be
based. Tertiary endpoints are correspondingly less important.

© 2012 by Taylor and Francis Group, LLC

176 Design and Analysis of Non-Inferiority Trials

7.4 Testing for Both Superiority and Non-Inferiority

It is possible that some endpoints will be tested for non-inferiority and,
conditional on demonstrating this, be tested for superiority. Such a strat-
egy is an example of multiple comparisons that is unique to non-inferiority
studies.
Testing of superiority and non-inferiority in the same study has been dis-
cussed extensively in the literature. There are two ways to testing for both
non-inferiority and superiority: if non-inferiority is concluded, it is logical
to take the next step and ask whether the experimental treatment is not just
noninferior to but also superior to the active control; alternatively, when
superiority is the goal of the active-controlled trial but is not concluded, it is
natural to try to examine non-inferiority as an alternative to further under-
stand how the two treatments arms compare.

7.4.1 Testing Superiority after Achieving Non-Inferiority

It is generally, but not universally, agreed that testing for superiority after
concluding non-inferiority can be acceptable if the testing plan is prespeci-
fied. This is often justified as a fixed sequence approach to multiple testing in
which it is possible to test hypotheses in a prespecified order at the full type
I error level as long as all prior hypotheses were rejected12 or that the same
confidence interval is used for each test. However, once a hypothesis is not
rejected, all testing stops and claims cannot be made on untested null hypoth-
eses. The addition of multiple endpoints will complicate the testing scheme.
The easiest scenario to discuss is two endpoints, a primary and a secondary,
in which both are to be tested for non-inferiority and then superiority. A fixed
sequence test can be set up in two ways, as provided in Table 7.1.
The difference in the two approaches is whether to test more than one
endpoint for non-inferiority before testing any endpoint for superiority, or
to test one endpoint both for non-inferiority and superiority before test-
ing the next endpoint for non-inferiority. The solution will be based on the
TABLE 7.1
Fixed Sequence Approaches
Testing Order Alternative A Hypothesis Alternative B Hypothesis
1 Test primary endpoint for Test primary endpoint for
non-inferiority non-inferiority
2 Test secondary endpoint for Test primary endpoint for
non-inferiority superiority
3 Test primary endpoint for Test secondary endpoint for
superiority non-inferiority
4 Test secondary endpoint for Test secondary endpoint for
superiority superiority

© 2012 by Taylor and Francis Group, LLC

Multiple Comparisons 177

relative advantages of superiority on an important endpoint versus the con-

sequences of not concluding (by not testing) non-inferiority on a less impor-
tant endpoint.
It may be unusual for labeling to explicitly state the superiority of one prod-
uct over another. Demonstrating superiority over a truly active control in a
single trial is much stronger evidence of any efficacy (relative to placebo) than
superior efficacy to the active control. In such cases, it makes more sense to
test non-inferiority on all endpoints that would be the basis for conclusions
before considering any superiority claims. However, superiority testing can
be used outside of the regulatory channels, in publications in the medical lit-
erature. In addition, if there is debate over the proper non-inferiority margin
to use, a superiority conclusion will effectively end the debate and lead to a
much quicker determination that the experimental treatment is also superior
to a putative placebo.
Incorporating superiority determinations, if desired, can be done using the
framework of gatekeeping procedures. The superiority comparison can be
inserted into a serial gatekeeping approach, either in the same family with
the corresponding non-inferiority comparison or in a subsequent family of
comparisons. If the superiority comparison is in the same family as the corre-
sponding non-inferiority comparison, superiority must be demonstrated on
all endpoints in a given family before non-inferiority can be considered for
subsequent families of comparisons. If superiority comparisons are in sub-
sequent families relative to the corresponding non-inferiority comparisons,
non-inferiority must be demonstrated on all endpoints for a given family
before the subsequent corresponding superiority comparisons are consid-
ered. With either process, the simple testing of superiority, conditional on
demonstrating non-inferiority, becomes quite complicated. Non-inferiority
and superiority comparisons can also be considered for parallel gatekeeping
strategies, as discussed in the following example.
Dmitrienko et al.13 proposed a potential alternative that would allow inves-
tigation of both superiority of the primary endpoint and non-inferiority of
the secondary endpoint, based on a finding of non-inferiority for the primary
endpoint in a manner more complex than the usual gatekeeping approach.
The proposed procedure allows branching or consideration of hypotheses
in multiple directions, based on whether a single hypothesis or family of
hypotheses is rejected. Consider the diagram in Figure 7.1. A clinical trial
that has one primary and one secondary endpoint is considered, and a non-
inferiority margin is prespecified for each of these two endpoints. Based
on a finding of non-inferiority for the primary endpoint, two tests are con-
sidered: non-inferiority for the secondary endpoint and superiority for the
primary endpoint. Based on then concluding non-inferiority for the second-
ary endpoint, superiority can be considered for the primary endpoint. An
adjustment will need to be made for the two comparisons. In this case, a
Bonferroni adjustment is easy to implement: if non-inferiority is concluded
for the primary endpoint at a one-sided level of α, the superiority for the

© 2012 by Taylor and Francis Group, LLC

178 Design and Analysis of Non-Inferiority Trials

Non-inferiority tested for

Primary Endpoint

Non-inferiority Non-inferiority
concluded concluded

Non-inferiority tested for Superiority tested for

Secondary Endpoint Primary Endpoint

Non-inferiority
concluded

Superiority tested for

Secondary Endpoint

FIGURE 7.1
A testing sequence involving two endpoints.

primary endpoint and non-inferiority for the secondary endpoint are each
tested at a one-sided level of α/2 (or otherwise such that the levels sum to α).
If both are rejected, then superiority for the secondary endpoint can be tested
at the one-sided level of α; if only non-inferiority for the secondary endpoint
is concluded, then superiority for the secondary endpoint can be tested at the
one-sided level of α/2; otherwise, superiority for the secondary endpoint can-
not be tested. However, further testing would cease if non-inferiority for the
primary endpoint was not concluded, or if either superiority for the primary
endpoint or non-inferiority for the secondary endpoint was not concluded.
Proof of the control of FWE is shown by considering the closure of all
potential hypotheses. In the example, the closure contains 15 nonempty sub-
sets, but the shortcut illustrated in Figure 7.1 provides an equivalent test.
This method can be expanded to a larger number of endpoints, including
co-primary and co-secondary endpoints.

7.4.2 Testing Non-Inferiority after Failing Superiority

Multiple comparison problems also occur when a study designed to dem-
onstrate superiority of an experimental treatment over an active control is
transformed post hoc into a non-inferiority comparison because the superi-
ority conclusion was not demonstrated. Attempting to salvage a failed supe-
riority comparison with a non-inferiority conclusion is difficult. For example,
prespecification of the non-inferiority margin is required for any non-inferi-
ority analysis. In attempting to salvage a failed superiority trial, it is unlikely
that a margin can be chosen objectively, as the confidence interval for the

© 2012 by Taylor and Francis Group, LLC

Multiple Comparisons 179

difference in effects between the experimental and active control therapies is

already known. Unless a small, negligible difference is ruled out (as would
be the case when a superiority test just fails), a post hoc non-inferiority anal-
ysis will be difficult to defend.
An additional issue with using a non-inferiority test to salvage a failed
superiority test is the choice of analysis set including imputation methods.
Several authors have proposed that with a prespecified margin, it is trivial to
test in either order: non-inferiority then superiority, or superiority then non-
inferiority.14,15 This is typically justified by noting that a finding of superior-
ity automatically implies a finding of non-inferiority, and that any conclusion
involving superiority and/or non-inferiority can be simultaneously based on
the same confidence interval. However, this ignores the fact that superiority
testing and non-inferiority testing often differ in the details.16 For example,
non-inferiority testing often relies more heavily on a per-protocol analysis
set than superiority testing, and missing data may need to be imputed under
the null hypothesis being considered. When some other aspect of the testing
changes, such as imputation of missing data or choice of analysis set, the
closure argument no longer allows simultaneous testing. We acknowledge
that this is possible only when one can conclude superiority but not non-
inferiority on the same data. Although this may seem illogical on its face,
there is nothing to prevent this outcome when different analysis sets are
used—especially when many subjects are removed from the per-protocol
analysis set, the margin is very small or imputation strongly affects the con-
clusions. Thus, we recommend testing non-inferiority first, then superiority,
and that the results be robust with respect to appropriate analysis sets.

References
1. Berger, R.L. and Hsu, J.C., Bioequivalence trials, intersection–unions tests and
equivalence confidence sets, Stat. Sci., 11, 283–319, 1996.
2. Berger, R.L., Multiparameter hypothesis testing and acceptance sampling,
Technometrics, 24, 295–300, 1982.
3. Holm, S., A simple sequentially rejective multiple test procedure, Scand. J. Stat.,
6, 65–70, 1979.
4. Dunnett, C., New tables for multiple comparisons with a control, Biometrics, 20,
482–491, 1964.
5. Wiens, B.L., A fixed sequence Bonferroni procedure for testing multiple end-
points, Pharm. Stat., 2, 211–215, 2003.
6. Wiens, B.L. and Dmitrienko, A., The fallback procedure for evaluating a single
family of hypotheses, J. Biopharm. Stat., 15, 929–942, 2005.
7. O’Neill, R.T., Secondary endpoints cannot be validly analyzed if the primary
endpoint does not demonstrate clear statistical significance, Control. Clin. Trials,
18, 550–556, 1997.

© 2012 by Taylor and Francis Group, LLC

180 Design and Analysis of Non-Inferiority Trials

8. Hochberg, Y., A sharper Bonferroni procedure for multiple tests of significance,

Biometrika, 75, 800–802, 1988.
9. Marcus, R., Peritz, E., and Gabriel, K.R., On closed testing procedures with spe-
cial reference to ordered analysis of variance, Biometrika, 63, 655–660, 1976.
10. Grechanovsky, E. and Hochberg, Y., Closed testing procedures are better and
often admit a shortcut, J. Stat. Plan. Infer., 76, 79–91, 1999.
11. Dmitrienko, A., Offen, W.W., and Westfall, P.H., Gatekeeping strategies for clini-
cal trials that do not require all primary endpoints to be significant, Stat. Med.,
22, 2387–2400, 2003.
12. Westfall, P.H. and Krishen, A., Optimally weighted, fixed sequence and gate-
keeper multiple testing procedures, J. Stat. Plan. Infer., 99, 25–40, 2001.
13. Dmitrienko, A. et al., Tree-structured gatekeeping procedures in clinical trials
with hierarchically ordered multiple objectives, Stat. Med, 26, 2465–2478, 2007.
14. Dunnett, C.W. and Gent, M., An alternative to the use of two-sided tests in clini-
cal trials, Stat. Med, 15, 1729–1738, 1996.
15. Morikawa T. and Yoshida M., A useful testing strategy in phase III trials:
Combined test of superiority and test of equivalence, J. Biopharm. Stat., 5, 297–
306, 1995.
16. Wiens, B.L., Something for nothing in non-inferiority/superiority testing: A
caution, Drug Inf. J., 35, 241–245, 2001.

© 2012 by Taylor and Francis Group, LLC

8
Missing Data and Analysis Sets

8.1 Introduction
Issues involving missing data are often linked with issues involving the choice
of the proper analysis set. However, it is also true that missing data issues
are often confused with issues involving the proper choice of analysis set.
Consider a randomized, double-blind, two-arm clinical trial in which some
subjects drop out at randomization before undergoing study therapy or any
other therapy, and have no follow-up for the study endpoint. Because these
subjects should be fairly distributed between arms, they may be excluded
from the analysis without compromising the integrity of the randomization.
Whether to include such subjects in the analysis is an analysis set issue. If
these subjects were included in the analysis [i.e., as in an intent-to-treat (ITT)
analysis], the imputation or representation of their values for the endpoint
should not depend on treatment arms (since such subjects should have been
fairly distributed between arms) and should consider the actual adherence
or nonadherence to therapy. A variation of this “imputation under the null”
can be used for non-inferiority trials and will be discussed later.
According to the ITT principle, all subjects should be followed to the end-
point or the end of study with the comparisons based on the “as-randomized”
treatment groups (i.e., based on the ITT population). This allows for an
unbiased analysis. Missing data violate the ITT principle and can under-
mine both the integrity of the randomization and confidence in the results.
Additionally, selective follow-up of subjects can weaken the quality of the
data from the high quality expected from a randomized clinical trial to that
obtained from an observational study.
The purpose of accounting for missing data is not to retrospectively change
the design or objective of the clinical trial or the adherence to therapy or the
protocol of any subject. Rather, the purpose is to account for all subjects with
respect to the ITT principle. For a subject with a missing outcome, the objec-
tive is to adequately represent the missing outcome based on what would
have been the expected outcome had the outcome been measured.

181

© 2012 by Taylor and Francis Group, LLC

182 Design and Analysis of Non-Inferiority Trials

8.2 Missing Data

Missing data are an important feature of clinical trials, and at least some miss-
ing data must be anticipated in non-inferiority trials. However, because miss-
ing data may introduce or add uncertainty in the results, every effort should
be made to limit the amount of missing data. In any clinical trial, a subject
has the right to withdraw from the study, at any time and for any reason,
without prejudice. In many clinical trials, at least a few subjects will miss
assessments for various reasons as well, while remaining in the study and
having assessments at later time points. The fact that data are missing can be
very indicative of the outcome, so ignoring missing data can lead to incor-
rect inference on and biased estimates of treatment effects. Many methods
for addressing missing data are specifically developed for studies with mul-
tiple assessments over time (called “longitudinal” studies), whereas other
methods are applicable when only a single assessment is planned.
Per EMA guidelines,1 the amount of missing data may be affected by: (1)
the nature of the endpoint, (2) the length of follow-up (the greater the length
of follow-up, the greater the proportion of missing data), (3) the therapeutic
indication, and (4) treatment modalities, and thus the design of the study
should consider these factors. As an example, a mortality endpoint is a hard
(objective) endpoint that does not require physical visits, and should be less
susceptible to missing data.
Philosophically, missing data can be separated into two types: data that
exist but are unknown, and data that do not exist. Consider evaluation of a
subject’s serum hemoglobin levels. If a serum sample is obtained but lost in
transit to the laboratory, the hemoglobin level at the time of the blood draw
exists but is unknown. Alternatively, if a subject dies before the scheduled
assessment, the hemoglobin level does not exist (or is zero) at the time of
the scheduled assessment. Many methods for handling missing data were
developed originally for sample surveys, in which answers to specific ques-
tions were left blank by respondents. Such data exist but are unknown.
These data are easier to handle than data that do not exist and with bet-
ter developed methods and a longer history of their application to clinical
trials.

8.2.1 Potential Impact of Missing Data

The Code of Federal Regulations (21 CFR 314.126)2 provides seven character-
istics of an adequate and well-controlled trial, which includes that “adequate
measures are taken to minimize bias on the part of the subjects, observers,
and analysts of the data” and that “There is an analysis of the results of the
study adequate to assess the effects of the drug.” The first characteristic listed
above is usually associated with the need for a double-blind design in a clini-
cal trial. However, missing data are generally outcome-related and may also

© 2012 by Taylor and Francis Group, LLC

Missing Data and Analysis Sets 183

be treatment-related and thus, introduce bias. Per the second characteristic of

an adequate and well-controlled trial listed above, the analysis of the results
should address the impact of missing data. Therefore, how to minimize the
amount of missing data and how to handle missing data in the analysis are
critical in the design and conduct of a clinical trial.1 Additionally, as all anal-
ysis methods have unverifiable assumptions, even when there is no miss-
ing data, the existence of missing data increases the number of unverifiable
assumptions that need to be made and, thus, affect the interpretation of the
results.
Missing data introduce a bias in the estimation of a treatment effect or
difference in treatment effects that affects the comparability of study groups
and the relation of the study results to the external target population.1
Statistically, missing data that are not correctly addressed alters the type I
error rate of statistical tests, the precision and reliability of the estimates, and
the confidence level of confidence intervals.
Clinical trials should be designed and conducted to minimize the amount
of missing data. The statistical analysis should address missing data when-
ever possible and not ignore the missingness. In some instances, the exis-
tence of missing data may make the results more similar between arms and
in other instances the missingness may introduce a bias favoring one of the
study arms. Carroll3 describes a situation in oncology in which the discon-
tinuation of tumor assessments may bias the results in favor of the more
toxic, less effective treatment arm in a comparison of progression-free sur-
vival. In general, the assay sensitivity of the clinical trial may be impaired
because of missing data.
The ability to measure and collect data on an endpoint may affect the role
of the endpoint. If one endpoint is difficult to measure or collect, whereas
a second endpoint is as meaningful and easier to measure and collect data
on, the second endpoint would likely be a better choice for the primary
endpoint. Thus, it may be necessary to choose as the primary endpoint
that clinically meaningful outcome that is less susceptible to missing data.
Additionally, the appropriate amount of follow-up should be considered.
Reducing the length of follow-up for an endpoint should also decrease the
amount of missing data. For a non-inferiority trial there should be reliable
data on the effect of the active control on whatever primary endpoint is
selected.
It is important to follow-up subjects and to collect relevant data. Off study
treatment is not the same as off study. Outcomes should still be measured
and collected from subjects after they stop study treatment. Going off study
treatment is part of being on a treatment arm and is often treatment-related.
Failure to collect information from subjects after they have stopped study
therapy leads to a conditional analysis of subjects conditioned on continuing
therapy.4 While this may be relevant to patients after they have been on one
of the study therapies for a while, it does not provide relevant information at
the time when an initial treatment decision needs to be made. Data should

© 2012 by Taylor and Francis Group, LLC

184 Design and Analysis of Non-Inferiority Trials

continue to be collected from subjects after they have discontinued study

therapy.
Additionally, ignoring that a subject has died prior to the evaluation of
the endpoint by not accounting for that in the determination of the endpoint
creates a hypothetical endpoint – an endpoint for which some subjects do
not have any value.4 Complete information or follow-up has been done on
the subject. In the analysis a subject who died prior to the evaluation of the
endpoint is represented by subjects on the same arm that have evaluations
for the endpoint, or at best the analysis is conditioned on subjects not dying.
This would particularly be problematic if the experimental therapy increases
the risk of death.
The International Conference on Harmonization (ICH) E-9 guidance5
states: “Because bias can occur in subtle or unknown ways and its effect
is not measurable directly, it is important to evaluate the robustness of the
results and primary conclusions of the trial. Robustness is a concept that
refers to the sensitivity of the overall conclusions to various limitations of
the data, assumptions, and analytic approaches to data analysis.” Sensitivity
analyses should study the limitations of the data and the potential impact of
missing data. When a subject has an outcome of interest that is missing, but
related or correlated outcomes are collected, a reasonable imputation of the
missing outcome may be possible when using the collected information and
prior measured outcomes on that subject. For subjects for whom there is no
reasonable imputation for the missing outcome, a variety of imputation tech-
niques should be considered to examine the sensitivity of the results, includ-
ing imputation under the null hypothesis and imputation techniques that
would assume that the missingness if unaddressed introduce a bias favoring
the experimental arm.
Missing data in the historical studies of the active control therapy can
obscure the active control effect. Differing behavior of missing data in those
historical studies can also contribute to variability of the active control effect
across the historical studies. Missing data in the non-inferiority trial not only
obscure the comparison of the study arms but also undermine the confi-
dence in the believed or assumed effect of the active control therapy and in
the selected non-inferiority margin.
The best approach to missing data is to prevent it. Missing data should be
kept to a minimum with the missing data addressed in the analysis, not be
ignored.

8.2.2 Preventing Missing Data

The best approach to missing data is prevention. An ounce of preven-
tion is better than a pound of imputation. Fleming6 identifies eight factors
that should be recognized and addressed in improving the prevention
of missing data: (1) proper distinction in the study protocol between
nonadherence and nonretention, (2) the misuse of the term “withdrawal

© 2012 by Taylor and Francis Group, LLC

Missing Data and Analysis Sets 185

of consent,” (3) informing subjects of the impact of missing data in the

Informed Consent process, (4) lack of integrity that the ITT principle pre-
serves the integrity of the randomization, (5) sample size adjustments for
dropouts only address the variability and not the bias induced by miss-
ing data, (6) lack of specification in the protocols of the performance stan-
dards and targeted levels of data capture needed to achieve high quality
trial conduct, (7) implementation of procedures during enrollment and
follow-up to enhance the likelihood of achieving high levels of retention
and (8) monitoring/overseeing of the quality of both trial conduct and
data capture.
As mentioned earlier, discontinuing treatment is not the same as discon-
tinuing study. Going off study treatment is an outcome in itself. Failure to
include information from subjects while off study therapy leads to a condi-
tional analysis of subjects conditioned on continuing therapy. Additionally,
the only valid reasons to identify a subject as “off study” are withdrawal
of consent, or achievement of all required efficacy and safety information.6
The term “withdrawal of consent” is often used to justify or explain why
a subject is no longer being followed for outcomes. However, “withdrawal
of consent” means that the subject no longer wishes to participate in any
way in the trial (including directly or indirectly ascertaining that subject’s
outcomes).
If a subject understands how missing data reduces the value of the infor-
mation obtained from the subject and the potential bias in the estimated
treatment effect that can result, then the subject may be more willing to have
their data measured and collected. Additionally, investigators need to under-
stand the necessity to collect data from subjects even after the subjects have
discontinued their randomized therapy. Protocols also provide misleading
information about correcting for dropouts. The convention to increase the
sample size in a clinical trial based on an assumed fraction of “dropouts”
does not address the bias that is produced by missing data, and simply pro-
duces a more precise biased estimate of the treatment effect. The mention
of targeted levels of retention will both encourage investigators to collect
information, and also help them evaluate their success or lack of success in
doing so. The absence of such information places the quality of follow-up on
the arbitrary willingness of the investigator.
There have been successful approaches to achieving high levels of reten-
tion in settings where low retention levels were anticipated. A study in
Kampala, Uganda on mother-to-child transmission (MCT) of HIV achieved
high levels of retention for every determination of the rate of MCT of HIV
including 95.5% at 18 months, even though many mother-infant pairs did
not have addresses and lived far from the hospital in Kampala, Uganda.6,7
The high retention levels were achieved by assigning to each study partici-
pant a Health Visitor who would create a rapport with the subject, answer
questions, periodically visit the subject and their family, and encouraged the
subject’s participation in the clinical trial.

© 2012 by Taylor and Francis Group, LLC

186 Design and Analysis of Non-Inferiority Trials

8.2.3 Missing Data Mechanisms

Despite best efforts, most studies have some missing data. Analyses of miss-
ing data is well developed in the statistical literature.8,9 Most of the methods
have been implicitly developed with the intention of application to superi-
ority trials rather than to non-inferiority trials and the application to non-
inferiority trials is not always straightforward.
Missing data might be missing due to study discontinuation or due to a
one-off situation in which some data are missing but other subsequent or
simultaneous data are available. The difference is that in one case, missing
data are intermittent, and in the other, missing data are terminal. It is com-
mon to have both in any given study.
Missing data are often categorized into three distinct classes: missing com-
pletely at random (MCAR), missing at random (MAR) and missing not at
random (MNAR). The last is non-ignorable missingness, whereas the first
two are considered ignorable missingness because methods have been devel-
oped to obtain valid inference and unbiased point estimates when data are
ignorable. The various categories are assigned based on whether the unob-
served response data is related to other aspects of the study: baseline char-
acteristics and demographics, observed response data or the unobserved
response data. The factors that cause a data point to be missing are referred
to as the missing data mechanism.
MCAR. With data that are MCAR, the missing data mechanism is unre-
lated to the outcome of interest, any observed or unobserved covariate, and
other measurable quantity. Specifically, data are missing not because of any
relationship to the unobserved data or even to the observed outcome data.
Data that are MCAR are the easiest to handle in the analysis, since they can
essentially be ignored for proper inference. However, MCAR is the most
restrictive assumption, and one that is often not applicable, so there is a risk
in assuming that data are MCAR.
MAR. Data that are MAR have some relationship between the observed
outcome data and the missing data mechanism. Conditioned on all observed
measurements (pre- and post-baseline measurements), missingness is ran-
dom. That is, values of observed data give insight into the chance that sub-
sequent data are missing. However, there is no such relationship between
the unobserved data and the chance that data are missing. MAR is a slightly
weaker assumption about the missing data mechanism than MCAR and as
such requires more careful analysis and interpretation, but with care can pro-
vide valid inference.
MNAR. Data that are MNAR have some relationship between the unob-
served outcome value and the missing data mechanism. Missing data are
likely to have a different distribution than observed data, and the distribu-
tion is not predictable based only on observed outcome values or covariates.
MNAR is the weakest assumption, but the assumption that is hardest to incor-
porate into an analysis. However, since the value of the data is unobserved, it

© 2012 by Taylor and Francis Group, LLC

Missing Data and Analysis Sets 187

is not possible to accurately predict which values will be missing MNAR, or

what the unobserved values will be, or the relationship between the unob-
served values and the missing data mechanism.
At any scheduled assessment, it is possible that some or all measurements
are not obtained. This might be due to human error—forgetting to take a
measurement—or due to lack of supplies—not having a sufficient supply
of tubes to collect serum. Such missing data are commonly MCAR, but if
a measurement is missing because of a subject’s health state—not feeling
well enough at a particular visit to fill out a form for quality of life data, for
example—the data might easily be MNAR. The reason for such missingness
must be collected prospectively to accurately assess the missing data mecha-
nism. A discussion of missingness in uncomplicated bacterial infections and
long-term cardiovascular endpoint studies are provided in Example 8.1.

Example 8.1

Uncomplicated bacterial infections: A subject who discontinues a study of an

uncomplicated infection may discontinue for one of many reasons, some of which
are suggestive of positive outcomes and some of which are not. Subjects who are
cured of such an infection might decide not to return for evaluations because they
no longer have symptoms, and a clinic visit is an inconvenience. A subject who is
not cured might be more likely to return for a clinic visit to request further treat-
ment, but also might choose a different health care provider if the first treatment
was unsuccessful. In short-term studies (such as a week or so of oral therapy),
discontinuation due to a major life event such as moving to another city would be
rare. Therefore a subject who discontinues the study might discontinue due to the
outcome, in which case the data would be MNAR. If subjects who are cured and
subjects who fail discontinue at the same rate, the data could be MAR or MCAR.
Long-term cardiovascular endpoint study: A study in which subjects are fol-
lowed up for a long period of time (years in many cases) will inevitably have
subjects who discontinue for administrative reasons, as well as for reasons of mor-
tality, adverse reactions, etc. It is tempting to assume many of the dropouts are
MCAR (or at least ignorable) but because any dropout might be related to unob-
served outcome, this is again a risky assumption. In such a study, understanding
the missing data mechanism is vital to understanding the treatment effect.

8.2.4 Assessing Missing Data Mechanisms

Distinguishing among missing data mechanisms is vital to using the appro-
priate analysis and supporting analyses. Distinguishing MCAR from MAR
is generally easier than distinguishing ignorable from non-ignorable missing
data.
In the event that data are thought to be MCAR (or it is thought a likely sce-
nario), an assessment of whether the data might be MAR should be under-
taken. Since the difference between MCAR and MAR is in the relationship

© 2012 by Taylor and Francis Group, LLC

188 Design and Analysis of Non-Inferiority Trials

between the missingness and the observed responses, efforts should focus
on investigating this relationship.
Informal visual approaches can often provide a preliminary assessment
of the relationship between the missingness and the observed responses.
A graphical approach to assessing this relationship with longitudinal data
is to graph over time the outcomes for those who have complete data and
those who discontinue at various time points (including only data up to
the point at which the discontinuation occurred, of course). Subsetting the
subjects into a small number of groups (three to five groups often provide
informative results) based on meaningful study benchmarks will allow a
quick visual assessment of potential relationships. Such graphs can provide
assessment of whether the trajectory of response or baseline values differed
between those with complete data and those with incomplete data.
Inferential assessments of the relationship between missingness and the
observed data are also possible. Under the null hypothesis that data are
MCAR, various test statistics can be developed. Simplistically, testing for
differences in early response between subjects with complete data and sub-
jects with incomplete data can provide for a test – albeit, perhaps not an opti-
mal test. Incorporating covariates into this assessment can improve the test.
Correlation assessments of presence of outcome measurements to baseline
characteristics and, after accounting for this relationship, of presence of out-
come to the previous measurement will provide insight. Logistic regression
can be used for the multivariate investigations. This will give information on
whether subjects with higher or lower outcomes, or a better or worse interme-
diate response, are more likely to have missing data at the subsequent assess-
ment, something allowed under MAR but not under MCAR. Other tests
specifically developed for missing data assessments are available as well.10
The ability of such tests to differentiate the type of missingness is unclear due
to lack of power, but they can be used to demonstrate the lack of MCAR.
Distinguishing between MAR and MNAR is more difficult. By defini-
tion, MNAR data are missing in part because of the unobserved value—but
without observing the value, it can be difficult to establish this association.
Heuristic arguments can be used, more convincingly if they argue for MNAR
than if they argue against MNAR. Other markers of disease progression or
regression can be used if collected after discontinuation. In the situation that
subjects who discontinue (or otherwise have missing data) tend to experi-
ence events at a higher or lower rate after discontinuation than subjects who
continue, one can conclude MNAR. Subjects who discontinue because of
adverse events might also be considered to have data that are MNAR, espe-
cially if the specific adverse events are related to or affect the outcome mea-
surement (such as when the outcome is quality of life). This can be handled
more easily in the analysis than MNAR in general, but if discontinuation due
to reported adverse events is common, it might also be common that discon-
tinuation occurs because of adverse events that are not reported, which is
harder to handle.

© 2012 by Taylor and Francis Group, LLC

Missing Data and Analysis Sets 189

8.2.5 Analysis of Data when Some Data Are Missing

Analysis in the presence of missing data typically focuses on two objectives:
unbiased estimation of treatment effect, and appropriate inferential analysis
relative to the null hypothesis. Both objectives require care in imputing the
missing data to avoid a bias in estimation of means, but the second objective
also requires care in accurately determining the reliability of the estimated
treatment effect or treatment difference, to avoid a situation in which the
point estimates are unbiased but the hypothesis test results in an inflated
chance to reject a true null hypothesis.
When a subject has a missing measurement that is ignored in the analysis,
that subject is represented in the analysis by subjects on the same treatment
arm that have a measurement. If their prognosis is not similar to those other
subjects on the same arm that have measurements, their missingness may
be informative. For some imputation methods (including inverse probability
weighting methods), a subject with a missing measurement is represented
by a subset of subjects on the same treatment arm that are believed to have a
similar prognosis to the subject with the missing measurement. It should be
understood that the strategy used to handle missing data instead of alleviat-
ing bias may introduce more bias in the same direction or may change the
direction of the bias. The reason for discontinuation and any relevant post
dropout information on the subject may be useful in imputing or represent-
ing a subject’s missing outcome.
According to EMA guidelines,1 the particular way of handling missing
data should be primarily based on providing “an appropriately conservative
estimate for the comparison of primary regulatory interest” and not on the
“properties of the method under particular assumptions.”

8.2.5.1 Handling Ignorable Missing Data in Non-Inferiority Analyses

When data are ignorably missing, valid analysis is possible (and probably
fairly easy) with a few precautions.
Complete case analysis is analysis of only subjects who have no missing
data. A completers analysis is equivalent to applying a particular imputa-
tion method to the missing data that has treatment arm as the sole factor.
When discontinuations and other missing data are MCAR and also inde-
pendent of covariates, then complete case analysis will be appropriate for
inference and estimation. When discontinuations and other missing data
are MCAR but not independent of covariates, then complete case analysis
will be appropriate for inference and estimation only if the relevant covari-
ates are included appropriately in the analysis model. When discontinua-
tions and other missing data are not MCAR, then complete case analysis
will not be appropriate for inference and estimation, as type I error rates
will be higher or lower than advertised and treatment effects will be under-
or overestimated. Since MCAR is a very strong assumption and the testing

© 2012 by Taylor and Francis Group, LLC

190 Design and Analysis of Non-Inferiority Trials

methodologies are imprecise at best, it is generally wise to avoid heavy reli-

ance on complete case analysis. However, such an analysis can be reported
as a sensitivity analysis.
Use of covariates in the analysis can be somewhat controversial, especially
when the covariates are not pre-specified. It is standard to pre-specify strati-
fication factors in the randomization process as covariates in the analysis, but
post hoc inclusion of other covariates in the analysis is known to have the
potential to inflate the type I error. When covariates are known to be asso-
ciated with the likelihood of missing data, they should be included in the
model, even if the association was not known or predicted before study start.
To comply with this inconsistent pair of assertions, reporting multiple analy-
ses will be required. The pre-specified analysis will generally serve as the
primary analysis, with the model adding covariates labeled as a supportive
or sensitivity analysis. Too much missing data or inconsistent results from
the supportive and sensitivity analyses may invalidate the assumptions of the
primary analysis and make the corresponding results impossible to interpret.
An alternative that might be acceptable in some cases is to investigate the
relationship between baseline covariates and missingness in the blinded data
set, decide which covariates should be included in the primary analysis, final-
ize the analysis plan, and then unblind. Imputation using baseline covariates
and/or post-baseline measurements may also be used.
A complete case analysis violates the intent-to-treat principle and is thus
subject to bias. This approach is not recommended as the primary analy-
sis for a confirmatory trial.1 When data are MAR but not MCAR, complete
case analysis will generally be inappropriate and an analysis recognizing
the missing data mechanism must be used. Generally this involves imputa-
tion of missing data. The process of imputation involves substituting a likely
value for a missing value and a subsequent analysis of the resulting data, so
it is important to choose the imputed value appropriately.
A simple method that can always be used for longitudinal data, but is rarely
optimal, is to substitute using the last observation carried forward (LOCF)
method. When a value is missing, the last observed value for that subject
substitutes for the missing value. The resulting data set is analyzed as if
every subject has complete data available. LOCF is not valid under MCAR
or MAR, and is only valid when the subject’s outcome does not change after
dropout. When the therapeutic effect is reached rather quickly, and long-
term observation follows, LOCF may be more appropriate for accurately esti-
mating treatment effects.
LOCF is particularly prone to criticism when a disease state is progressive
(progressively worsening disease, or progressively improving disease after
treatment). When the subject’s condition or prognosis tends to deteriorate
over time an LOCF analysis likely gives optimistic imputations that may
lead to understating the subject-level standard deviation. Also, if the discon-
tinuations occur earlier on the experimental arm, a bias may potentially be
induced favoring the experimental arm (particularly when the experimental

© 2012 by Taylor and Francis Group, LLC

Missing Data and Analysis Sets 191

therapy has inferior or equal efficacy to the active control). Even when esti-
mation is unbiased, LOCF may understate the standard error surrounding
the point estimate, leading to test procedures that have an inflated type I
error rate.
Koch11 recommended incorporating imputation under the null hypothesis
for missing data in a non-inferiority trial. Therefore, a penalized imputa-
tion might be considered. When a subject discontinues, subsequent missing
observations can be imputed as the last observation plus or minus a small
amount to penalize the analysis for the missing data. For a non-inferiority
margin of δ > 0 with larger values being more preferred, possible ways of
performing imputation under the non-inferiority null hypothesis include
first performing an imputation under the assumption of no treatment differ-
ence and then either (1) subtract δ from each imputed outcome in the experi-
mental group, or (2) add δ to each imputed outcome in the control group, or
(3) subtract δ/2 from each imputed outcome in the experimental group and
add δ/2 to each imputed outcome in the control group (or some variation
thereof). Per Koch11 for binary data (i.e., “0” for a failure and “1” for a success)
with a fixed margin on the difference in proportions, the imputed outcomes
should be inclusively between 0 and 1. Imputation under the non-inferiority
null hypothesis is particularly important when the evaluation of the active
control effect and the selection of the non-inferiority margin did not consider
the possibility of missing data.
Incorporating an imputation under the non-inferiority null hypothesis can
help alleviate some problems inherent in disregarding missing data when
establishing the non-inferiority margin. Consider a continuous outcome
for an endpoint of interest where the conditional effect of the active control
among subjects for which the endpoint is measured is 50 and the conditional
effect of the active control is zero among subjects for which the endpoint is
not measured. If the endpoint is not measured in 20% of subjects in both
arms, then the true effect of the active control is 40. Suppose that the evalu-
ation of the active control effect in previous clinical trials treated missing
data as ignorable and M1, the efficacy margin, was set at 50. Suppose the
non-inferiority trial had 20% of subjects on each arm not measured for
the primary endpoint. The true effect of the active control in the setting of
the non-inferiority trial for the ITT population is 40. The efficacy margin
can be adjusted retrospectively to account for the missing data. However,
there may be a preference to pre-specify the margin without later chang-
ing it. Employing an imputation under the non-inferiority null hypothe-
sis (based on a true mean difference of 50) without altering the margin can
achieve the same or similar results as altering the margin and ignoring the
missing data.
Another simple imputation approach, with similar advantages and disad-
vantages to LOCF, is to use the mean value of all observed values (or other
estimate of central tendency) to replace missing values. This might be the
mean across an individual’s observed assessments, or the median across all

© 2012 by Taylor and Francis Group, LLC

192 Design and Analysis of Non-Inferiority Trials

subjects’ observed assessments at the time point in question, or some other

quantity. If such a procedure is used, it is very likely that the estimated vari-
ability of outcomes will be reduced, resulting in an inflation of type I error.
In addition, this method is applicable to MCAR, not MAR, if the mean is
calculated from other subjects than the subject with missing data.
Rather than using the mean value, a regression-based value can be sub-
stituted for missing data. When outcomes are expected to be monotonically
changing over time, an individual’s assessments can be used to model dis-
ease progression (or improvement) over time. Baseline covariates and post-
baseline measurements from all subjects can be used to develop more complex
regression models and obtain more precision in this modeling and imputa
tion process, but again using data from other subjects assumes a missing
data mechanism closer to MCAR or MAR than MNAR.
Rank-based imputation is often useful for superiority studies but less so
for non-inferiority studies. Subjects who have died (i.e., without data due to
death) might have all their ranks assigned (as a tie) to the worst ranks among
all subjects; subjects who discontinue for toxicity might have all their ranks
assigned (as a tie) to the next worst ranks, etc. For a rank-based nonpara-
metric analysis, this provides an appropriate data set for analysis. However,
nonparametric analyses are uncommon for non-inferiority due to concerns
about permutation tests and also due to lack of an appropriate rank-based
margin, so this method is less useful.12
Mixed model repeated measures (MMRM) analysis is attracting increased
interest in analysis of superiority studies. MMRM analyses use two key
features that are distinct from ANOVA models: MMRM models both fixed
effects and random effects, and MMRM models within-patient correlations
of various data.13 The fixed effect in MMRM is typically the efficacy or other
primary outcome data of interest, whereas the random effect is the patient-
specific data, often consisting primarily of the correlation of outcome data at
various time points or the correlation of various outcome variables. By mod-
eling the within-patient data as random effects, the missing data mechanism
can be modeled to obtain valid inference under an MAR scenario. MMRM
is now well accepted for analysis of superiority trials, but is not as common
for non-inferiority trials. Although the mathematical properties suggest that
MMRM will perform well in non-inferiority trials, an MMRM analysis in a
non-inferiority trial should also include sensitivity analyses and assessments
of the impact of different missing data mechanisms on the conclusions.
The most complex procedure for imputing missing data is multiple impu-
tation, which is a two-step process. First, missing data are modeled using
known information about the missing data mechanism. Second, this miss-
ing data model is used to impute data that are used for analysis. The process
is repeated a few times (e.g., 3 to 20 times) to obtain a sample of full data
sets and associated analyses. The variability among the various analyses
is incorporated into the interpretation of analyses, resulting in appropriate
point estimates, confidence intervals and inference. Multiple imputation is

© 2012 by Taylor and Francis Group, LLC

Missing Data and Analysis Sets 193

most helpful when the process used to impute data is different from the
process used to analyze data, such as by using different covariates to predict
missingness than are used in the analysis model, and most useful for large
studies. This could happen when information used to explain missingness
occurs after discontinuation, such as death predicting (backwards) study dis-
continuation within a short period before the death. Such information may
not be included in the analysis model, but can easily be incorporated into
the imputation procedure. If multiple imputation is planned, collection of
auxiliary data (such as other measures of efficacy or follow-up after discon-
tinuation, subject to ethical constraints) should be planned. Treating subjects
with missing data the same way as subjects with complete data may bias
the conclusion toward no difference. For a non-inferiority analysis, a penal-
ized imputation approach can also be considered with multiple imputation.
However, little is known about how this approach will affect an analysis in
the presence of non-ignorable missingness.
A trite but true piece of advice is that missing data are best handled by
prevention. Missing data are most easily handled when almost all data are
known. Most sensitivity analyses, even when missing data are handled or
imputed quite differently, will give very similar conclusions if the amount of
missing data is small. This requires planning before the study starts to allow
collection of auxiliary data, motivate study subjects and investigators to com-
ply with the protocol, and perhaps gain approval from ethics committees to
allow collection of data even after a subject “withdraws” from some study
procedures. Efforts should be directed toward obtaining data on all subjects,
even those who discontinue treatment, to best assess treatment effects when
subjects are noncompliant with certain aspects of the study plan.
In summary, there is not one clear method for handling missing data that
are MAR but not MCAR. Each method has promise, but each must be used
with caution. Simple methods that focus on unbiased estimation may inflate
the type I error rate by underestimating the variance. For example, in many
settings, dropouts are likely to have poorer outcomes than subjects who
remain under observation for outcomes. Then, for continuous outcomes,
treating the missing outcome as ignorable or using LOCF (and other single
imputation methods) treats their unknown outcomes as being more central-
valued, and thus will then lead to an underestimated subject-level standard
deviation. More complex methods can address the estimation of the variance
of the estimated treatment difference. Various methods should be consid-
ered for most studies, with one simple method pre-specified as primary and
the other methods pre-specified as sensitivity or exploratory analyses.

8.2.5.2 Handling Non-Ignorable Missing Data

When missing data are non-ignorable, the solutions are even less appealing.
Although it might be clear that the missing data mechanism depends on
the value of the missing data, it is much less clear how to use available data

© 2012 by Taylor and Francis Group, LLC

194 Design and Analysis of Non-Inferiority Trials

to recreate or impute the unobserved data. Therefore, the analysis involves

some level of speculation.
Analysis of data that are MNAR typically involves modeling the missing
data mechanism and the value of the missing data. These models are diffi-
cult in practical situations, and even more difficult to pre-specify. When not
pre-specified, the analysis may be regarded as a sensitivity analysis or as an
exploratory analysis.
For non-inferiority trials, the problems may be more severe. Most studies on
missing data implicitly assume that the hypothesis test of interest is a test of
equality of treatment effects. The performance of such methods in non-inferior-
ity analyses has yet to be comprehensively explored in the statistical literature.
One model that may be of interest (perhaps as a sensitivity analysis rather
than a primary analysis) is a pattern-mixture model. Such a model assumes that
the possibly unobserved outcome is dependent on the missingness of the data.
Mathematically, the model is often written as fθ(Y, M) = fψ(Y | M) × fπ(M), where
Y is the vector of outcome data and M is the corresponding vector of indica-
tors of missingness. Such a model assumes that the outcome is dependent on
the missingness, and the average treatment effect over all patterns of missing-
ness is of interest. Covariates related to the missingness can be included in the
imputation model as well as a factor for the randomized treatment group.
One proposed application of the pattern-mixture model that may be appli-
cable to non-inferiority analyses is a composite approach.14 In this composite
approach, the pattern-mixture model is factored so fψ(Y | M) and fπ(M) are
considered separately. That is, the probability of missingness (with emphasis
on missingness for unfavorable reasons, since missingness unrelated to out-
come is MCAR) is considered separately from the treatment effect dependent
on the successful completion of the study. Again, this was proposed for a
superiority analysis, so a non-inferiority analysis must adapt.
Example 8.2 illustrates the use of LOCF in a non-inferiority analysis.

Example 8.2

Consider a clinical trial of an antihypertensive agent. The primary endpoint is

change from baseline to the end of the study (8 weeks after baseline) in sitting
systolic blood pressure (SBP), with low numbers (changes that are large negatively)
being preferred. The non-inferiority margin was prospectively set at 4 mmHg,
so the lower limit of the two-sided 95% confidence interval for the difference in
mean change in SBP (active control – experimental) must exceed –4.0 mmHg to
conclude non-inferiority. Dropouts were included with LOCF. Results are given
in Table 8.1, and show that based on the overall study population a conclusion of
non-inferiority is obtained.
Consider, though, the following discontinuation pattern. Subjects with data
available at week 8 showed a similar response to subjects included via LOCF
imputation in the active control group, but such subjects showed less antihyper-
tensive effect in the experimental group. Using only the subset of subjects with
complete data, a conclusion of non-inferiority would not be obtained.

© 2012 by Taylor and Francis Group, LLC

Missing Data and Analysis Sets 195

Among subjects who did not complete the study, a nearly equal number were
for administrative reasons such as inconvenient study visits. Many more subjects
discontinued the experimental group because of adverse events or lack of efficacy
(14.0% vs. 1.3%). It is quite possible that subjects who discontinued because of
adverse events had a fairly large antihypertensive effect, enough to affect the group
mean by 1 mmHg. So, the discontinuation is related to outcome, and implies that
the missing data are not MCAR. In addition, subjects who do not complete the
study do not receive benefit at week 8, so the inclusion of such data in the primary
analysis via LOCF is questionable (for superiority or non-inferiority).
A pattern-mixture model composite approach such as suggested by Shih and
Quan,14 does not fully confirm the conclusions of the primary analysis, due both to the
lack of conclusion of non-inferiority on antihypertensive effect and also the increase
in discontinuations for adverse reasons. Therefore the conclusion of the primary anal-
ysis is questioned by the sensitivity analysis because of the impact of missing data.

The impact of missing data in a study in which the primary endpoint is

binary is easier to assess than when the primary endpoint is continuous
due to the finite number of possible imputations. For nE and nC subjects in
the experimental and active control groups, respectively, with mE and mC of
those subjects with missing data, there are 2 mC + mE possible imputation pat-
terns but only (mE + 1)×(mC + 1) potential unique results. A “tipping point”
graphical sensitivity analysis, similar to that proposed for general study of
missing data, can be used to assess the proportion of the (mE + 1)×(mC + 1)
potential unique results that will result in a conclusion of non-inferiority. A
grid is constructed with horizontal length mE + 1 and vertical height mC +
1. Grid box (a, b) corresponds to having a of the mE + 1 missing subjects in

Table 8.1
Summary of Trial Results and Reasons for Discontinuation
Treatment Group Active Control Experimental
Overall Study Population
N 150 150
Mean –8.0 –8.2
Standard deviation 14.2 13.8
Confidence interval (–3.0, 3.4)

Completers
N 140 120
Mean –8.0 –7.2
Standard deviation 14.2 13.8
Confidence interval (–4.2, 2.6)

Discontinuation Active Control Experimental

Administrative reasons 8 9
Lack of efficacy 0 2
Adverse events 2 19

© 2012 by Taylor and Francis Group, LLC

196 Design and Analysis of Non-Inferiority Trials

the experimental group imputed as a success and b of the mC + 1 missing

subjects in the active control group imputed as a success, 0 ≤ a ≤ mE and 0 ≤
b ≤ mC. Each of the (mE + 1)×(mC + 1) resulting squares in the grid is shaded,
say shaded black if the conclusion would not be of non-inferiority, gray if the
conclusion would be of non-inferiority and white (i.e., no shading) if the con
clusion would be of superiority of the experimental treatment compared to
the control.15 Such a figure will easily represent not only how many of the
(mE + 1)×(mC + 1) potential unique results would lead to each conclusion,
but also how different the pattern of missing successes would have to be to
change the conclusion of the primary analysis.
A worst case analysis assigns the worst possible outcome for the missing
outcome. This usually alters the definition of the endpoint to include dropout
or discontinuation as the worst outcome. This can mix tolerability and safety
with efficacy and usually does not address the bias of missing data for the
original endpoint. When outcomes tend to deteriorate over time, a worst case
analysis may favor the treatment arm where discontinuation occurs earlier.
For time to event endpoints (where worst case means assigning an event at
discontinuation of follow-up), a worst case analysis may favor the treatment
arm where the discontinuation occurs later.
A worst comparison analysis assigns the best possible outcome to miss-
ing values in the control group while assigning the worst possible outcome
to missing values in the experimental group. A worst comparison analysis
provides a bound on the true internal value of the estimated difference. Such
an analysis is conservative, and the conservatism increases as the amount of
missing data increases.
Missing data must be considered in any clinical trial. Even though most
applications of missing data analysis have inherently assumed a superiority
hypothesis, missing data must also be considered in non-inferiority analyses.
Simple methods are encouraged, due to both ease of interpretation and
lack of extensive experience applying more sophisticated methods to non-
inferiority analyses. Methods should emphasize the identification of a
missing data mechanism (MCAR, MAR or MNAR). If MNAR can be con-
vincingly ruled out (which may be unlikely), then the simple methods are
sufficient. Otherwise, interpretation will require creative use of more sophis-
ticated methods along with justification that such methods are applicable.
Imputation under the null hypothesis and/or even less favorable imputation
for the experimental arm should be considered.

8.3 Analysis Sets

The analysis set (the set of subjects who contribute to the analysis result)
used for analysis of non-inferiority clinical trials has been debated, and is

© 2012 by Taylor and Francis Group, LLC

Missing Data and Analysis Sets 197

an evolving issue. In this section we will define several analysis sets used in
clinical trials, and discuss their advantages and disadvantages.

8.3.1 Different Analysis Populations

Several analysis sets (or analysis populations) have been defined for analyz-
ing the results of clinical trials. It is common to pre-specify two or three
analysis sets in the analysis plan, with one of those sets identified for pri-
mary inference. Analysis based on the other pre-specified analysis sets are
also presented to provide support for the primary endpoint by checking the
robustness of the results: if results are similar for all analysis sets, it provides
some assurance that the results are not strongly dependent on the choice of
the analysis set. Additionally, a different analysis set is often used for inves-
tigation of safety and tolerability. Although names and definitions differ,
the following give some outline to various analysis sets that are commonly
proposed:
ITT Set. The phrase “intention to treat” was originated by Sir Austin
Bradford Hill,16 and the term “intent-to-treat” is commonly used. The ITT
set comprised all subjects according to their randomly assigned treatment
groups. The advantage of using the ITT analysis is that inclusion in the anal-
ysis is based on randomization rather than a post-randomization (and post-
treatment) event.
Modifications to the ITT set may be allowed under certain circumstances.
Subjects who were retrospectively identified as not having the disease under
study may be removed from the analysis set if the information was collected
before randomization but not interpreted until after randomization. This can
occur when it requires time to complete certain laboratory procedures, but
it is necessary for the subject not to have their treatment delayed until the
results are known. Such subjects would be randomized, but are removed
from the analysis if the laboratory results do not confirm the diagnosis.
However, any modifications to the ITT set move the inference further away
from the randomization on which it is based. Modifications based on data
obtained after the initiation of treatment have the potential to introduce bias
in the inference.
Per-Protocol Set. This analysis set includes all subjects who in retrospect
met the inclusion and exclusion criteria of the clinical trial, received at least a
minimal amount of randomized study drug (or, in some cases, subjects might
be required to have nearly perfect adherence to the medication plan to be
included), and had assessments to allow evaluation of the primary endpoint.
In other words, the per-protocol (PP) analysis set used only the subjects whose
behavior adhered to protocol. Also, a per protocol analysis alters the outcomes
from protocol violators to missing and treats the missingness as ignorable.
Completers. This analysis set includes subjects who had assessments to
allow evaluation of the primary endpoint of the study. This might be a super-
set of the per protocol subset, or might be equal to the per protocol subset,

© 2012 by Taylor and Francis Group, LLC

198 Design and Analysis of Non-Inferiority Trials

depending on the definition. In addition, a completers analysis set might

be defined for a particular secondary endpoint that is independent of the
primary endpoint. This analysis set can also be described as an “Evaluable
Subjects Set”. For a time-to-event endpoint, a completer can be considered as
a subject who is followed to the event or to the end of study.
Safety Set. This analysis set is intended for the summaries of safety and toler-
ability data. Unlike the ITT analysis set, subjects who receive a treatment that is
inconsistent with the randomized treatment are often summarized according to
treatment actually received, as an important safety event in a subject assigned
to one treatment but receiving the other could change the safety conclusions. In
addition, it is generally required that a subject receive at least one dose of study
medication, and sometimes required that a subject had a follow-up assessment
of safety or tolerability, to be included in this analysis set.
Other possible analysis sets/populations and definitions have been pro-
posed: all subjects screened, all subjects randomized, all subjects who took at
least one dose of study medication, and all subjects with any follow-up. Each
has its advantages and disadvantages, but the key is that exclusion from the
statistical analysis of some patients who were randomized to treatment may
induce bias. To the extent that an analysis set avoids bias from such exclu-
sions (i.e., still maintains the fairness of randomization), it can be consid-
ered as adhering to the ITT principle. Additionally, it is mandatory that the
analysis set be clearly defined before study start and that this definition is
consistently applied.
For the remainder of this discussion we will use terminology from the ICH
guidance document E-95 and call the primary analysis set/population the
“full analysis set.”

8.3.2 Influence of Analysis Population on Conclusions

In choosing the appropriate strategy for building the full analysis set, the
influence on the results and the potential to bias results toward a conclusion
of non-inferiority will be of interest. There is also the possibility that investi-
gators’ knowledge that all subjects are receiving an “active” therapy may bias
outcome evaluations toward success at a subject-level, which may reduce the
sensitivity to detecting a difference between treatments.17 Too much missing
data, which may introduce a large bias, and/or bias in the study conduct may
invalidate the analysis regardless of selected analysis population even when
the results are similar across analysis populations.
For analysis of clinical trials aimed at showing the superiority of one treat-
ment over another, it is nearly universal to base the full analysis set on the
ITT set. For a non-inferiority analysis, the full analysis set has often been
based on the PP set, rather than the ITT set. More recently, international guid-
ance documentation18 discussed the use of PP analysis sets in non-inferiority
analyses and suggested providing separate, co-primary analyses using the
PP and the ITT sets.

© 2012 by Taylor and Francis Group, LLC

Missing Data and Analysis Sets 199

Depending on the method of analysis, use of the ITT set can provide unbi-
ased estimation and comparisons of the endpoints in the setting of the clini-
cal trial. If the clinical trial setting represents medical practice, subjects in
the trial are randomly selected from subjects in medical practice, and full fol-
low-up is obtained, then use of the ITT set will provide unbiased estimates
and comparisons on the use of the treatments in practice. Limitations to this
include selecting subjects who are not representative of medical practice and
accounting for missing data – especially from subjects who discontinue the
trial but would have continued to be observed by a physician in medical
practice.
Part of the ITT principle is that subjects be followed until an event or end
of study. A subject is said to be lost to follow-up if the individual is not fol-
lowed to the endpoint or to the end of study. Loss to follow-up can rarely be
assumed to be random – that is subjects lost to follow-up are different from
subjects not lost to follow-up, particularly with respect to the distribution
of the primary outcome. Even if the numbers that are lost to follow-up are
similar between arms, the unobserved outcomes may not be similar between
arms. It is much more important to keep loss-to-follow-up rates to a mini-
mum than to have similar numbers that are lost to follow-up. The greater the
amount of loss to follow-up, the greater the potential of substantial bias, even
if rates are similar between groups.
One rationale for using the PP analysis set is that it may more closely fol-
low the scientific hypothesis that a subject with the disease of interest, who
receives a particular treatment, will exhibit improvement compared with a
subject not receiving that treatment. If a subject does not have the disease
under study, or does not receive the treatment, the subject is not part of the
target population for examining the scientific hypothesis.
Another reason for the recommendation to use the PP set for non-inferiority
analyses is that deviations from the protocol in randomization, conduct or
evaluation might make the outcomes for the treatment groups more simi-
lar. In other words, sloppiness in trial conduct or other deviations from the
planned procedures may bias the results toward no difference between the
arms. In an extreme case, all subjects on both treatment arms could discon-
tinue treatment immediately upon randomization, resulting in all subjects
receiving the same treatment. Producing outcomes that are more similar
between the groups has the effect of making a superiority analysis con-
servative, since no difference between treatments is in the null hypothesis.
However, producing outcomes that are more similar between the groups
might have the effect of making a non-inferiority analysis anticonservative
since no difference between treatments is in the null hypothesis.
Consider the following situations to illustrate the relative conservativeness
of the ITT and PP analysis sets:

Subjects are treated with study treatment that is not their randomized study
treatment. This can be caused by several kinds of errors, including a

© 2012 by Taylor and Francis Group, LLC

200 Design and Analysis of Non-Inferiority Trials

problem with the randomization process, the site personnel deliver-

ing medication inconsistent with the randomized study medication
and packaging errors. In each case, analyzing the subject accord-
ing to randomized treatment is not appropriate for a non-inferiority
study as subject outcomes will tend to be more similar across the “as
randomized” study arms resulting in an analysis biased toward a
conclusion of non-inferiority.
Subjects do not comply with their randomized treatment assignment. If this
problem is equally distributed between treatment groups, then again
there is a fear that the resulting outcomes are more similar to each
other than randomization would dictate. If this problem is confined
to one treatment group, then it might be an indication of a problem
with the tolerability of the treatment or the existence of investiga-
tor bias if the blinding is not maintained. In either case, analyzing
according to randomized treatment group will be conservative
for a superiority analysis but not necessarily for a non-inferiority
analysis.
Subjects who discontinue the study early. One problem unique to active-
controlled trials is that the active control and perhaps other effective
therapies are often available commercially, so subjects who discon-
tinue the experimental therapy are then able to take a treatment
that has known efficacy (i.e., a “rescue” therapy). This can encour-
age subjects to discontinue early more often than in the setting of a
placebo-controlled study for an indication with no known effective
therapy. Whether to continue to follow subjects in non-inferiority
trials after discontinuing study drug or to use the subsequent effi-
cacy data for primary inference has been an issue. The outcomes
from the two treatment groups of subjects may be more similar than
if rescue therapy were not available. Often, during historical place-
bo-controlled studies (data on which the margin is based), the active
control was not available as rescue treatment. This would lead to
efficacy and non-inferiority margins that are too large relative to the
design and conduct of the non-inferiority trial. In such a case, the
non-inferiority analysis may be invalidated by excessive use of res-
cue therapy or need to appropriately consider the effect of the active
control therapy relative to a placebo followed by rescue therapy.
Failure to collect data upon using rescue therapy limits the ability
to examine the robustness of conclusions and the value of switching
treatment based on lack of early efficacy. Additionally, if other treat-
ments are available and were available during the historical com-
parison of the active control to placebo, including data from subjects
who discontinue study medication is consistent with the design of
these studies and on how the non-inferiority margin is based. We
therefore strongly advocate collecting data on all subjects even after

© 2012 by Taylor and Francis Group, LLC

Missing Data and Analysis Sets 201

discontinuation of study treatment and using the data appropriately

in the analysis.
Subjects are not evaluated for the primary endpoint. The problem of miss-
ing data must be considered for all clinical trials. Most methodol-
ogy for missing data has been developed implicitly assuming that
the test of interest has a null hypothesis of no difference. Thus, new
methodology is needed to appropriately test the null hypothesis of a
nonzero difference. Ignoring subjects who do not have the primary
endpoint evaluation should be recognized as inappropriate: if the
missingness is associated with treatment, then valuable information
on the missing data is not used. In that sense, the PP analysis set
is not appropriate because of elimination of subjects who are not
evaluated for the primary endpoint. However, utilizing such sub-
jects is also problematic: if all subjects in both treatment arms are
simply counted as treatment failures for a binary endpoint, this will
bias the conclusions toward non-inferiority. Thus, a way to account
for missing data without biasing the conclusions is required but not
necessarily readily available.

Under the belief that the PP analysis set addresses these problems, the PP
analysis set has been commonly used instead of or in addition to the ITT
analysis set. A concern is whether the PP analysis set is always the appropri-
ate way to address these and other issues. Some authors have questioned
the wisdom of using the per protocol concept for analyzing non-inferiority
clinical trials. A discussion of several antibiotic non-inferiority clinical trials
concluded that use of the ITT analysis set does not systematically lead to
smaller estimates of treatment effect in these trials (see Section 2.5 for more
details).19 A hybrid ITT/PP analysis set that excludes noncompliant subjects
as with the PP set while addressing the impact of missing data (based on
maximum likelihood) due only to lack of efficacy using an ITT approach was
also proposed as a compromise.20 More aggressively campaigning against
the use of the PP analysis sets, Hauck and Anderson21 noted that standards
required the use of PP analysis set for the null hypothesis Ho: μC ‑ μE > δ
versus Ha: μC ‑ μE ≤ δ for any value of δ > 0, making the null hypothesis of
equality the only point at which the ITT analysis set is favored. Such a dis-
continuity is difficult to justify for only one point, making the assumption
faulty. Wiens and Zhao22 expanded this idea and concluded that the argu-
ments for using the ITT set for superiority analyses apply equally well to
non-inferiority analyses, and therefore the ITT analysis set should be consid-
ered. Furthermore, the PP approach is not the universally best choice for a
sensitivity analysis and therefore should not be a standard adjunct analysis,
much less the standard co-primary analysis.
Other authors have proposed basing the analysis on the treatment actu-
ally received, regardless of whether it was the randomized treatment. These

© 2012 by Taylor and Francis Group, LLC

202 Design and Analysis of Non-Inferiority Trials

proposals include analyzing subsets of the ITT set, incorporating some

ideas of the PP analysis set23; and structural models that account for non-
compliance and consider treatment received as a time-varying covariate, and
thus have some of the advantages of the PP analysis set and the ITT analysis
set.24 However, other authors such as Lee et al.25 determined that analysis by
treatment actually received can produce counterintuitive results that obscure
rather than enlighten.
Our conclusion is that the agreement on an appropriate full analysis set
for analysis of a non-inferiority clinical trial is not currently widespread in
the clinical trials community. Momentum seems to be moving toward an
approach that favors something closer to the ITT analysis set and moves
away from the PP approach. In the absence of unanimity, the statistician pre-
paring a non-inferiority analysis must rely on sound statistical principles to
develop a comprehensive strategy. Pre-specification of the analysis strategy
is essential. This requires understanding potential issues before the study
begins and preparation for all eventualities in the analysis plan. Even though
most trials will have surprises that cannot be foreseen, the planning for fore-
seeable issues will be vital. Wiens and Zhao22 outlined four areas that will be
problematic: missing data, ineligible subjects, poor study conduct, and sen-
sitivity analysis, with missing data posing the biggest concern. We outline
some ideas on ineligible subjects below; the other three topics are covered
elsewhere in this book.
Ineligible study subjects should be considered differently in the analysis of
non-inferiority clinical trials than in superiority clinical trials. Subjects who can-
not benefit from the treatment (i.e., they do not have the disease under study)
may tend to make the treatments look more similar to each other, if such sub-
jects are equally randomized to the treatments under study. If such subjects will
receive treatment once the treatment is approved for marketing, excluding these
subjects provides information inconsistent with real-world application. It has
been advocated that subjects enrolled in a clinical trial be offered the study drug
as long as it does not pose a safety concern, and should be aggressively followed
even after discontinuation of study drug to avoid bias.26 Although this might
seem less important in a non-inferiority clinical trial, and even misguided as
noted in the above example of subjects receiving the active control after dis-
continuation of study drug, it is better to obtain the information and not use it
than to need the information and not have it. Example 8.3 discusses whether to
include in the non-inferiority analysis subjects from whom cultures were not
obtained in a study of uncomplicated bacterial infections.

Example 8.3

Cultures from subjects enrolled in a study of uncomplicated bacterial infections

must be obtained at enrollment to confirm the pathogen causing the symptoms.
Similar symptoms can be associated with fungal or viral infections, and such infec-
tions are not meant to be treated by antibacterial drugs. The determination of

© 2012 by Taylor and Francis Group, LLC

Missing Data and Analysis Sets 203

the pathogen takes 48 hours but treatment must be commenced immediately for
ethical reasons. If the pathogen is determined to be one that is not susceptible to
the study treatments based on preclinical results, or if no pathogen is found, it is
common to discontinue the subject from the study and ignore the subject in any
efficacy analyses. However, this might not be the best course of action. When the
new treatment is approved for marketing, it will be prescribed based on empiri-
cal symptoms rather than on cultures. Thus, it may be of interest for the subject
to be offered the chance to stay in the study and even continue to receive study
medication until symptoms resolve. It is likely that the informed subject will not
choose to remain on study medication if told that it will likely not impact the
symptoms. The subject who chooses to discontinue study medication and start a
different course of treatment should continue to receive follow-up evaluations in
accordance with the protocol. The next question becomes how to use the subject
in the analysis. For the primary analysis, it may be possible to remove the subject
since the exclusion was in place before randomization—even though it was not
known until after randomization.27 It may be of benefit to report the success rate
in the primary analysis and also among subjects treated empirically, to give the
physician information on success rates under clinical trial situations and under
practical situations.

8.3.3 Further Considerations

The preference of analysis set in non-inferiority is not universal. Using a PP
analysis set and an ITT analysis set is common. However, it is possible that
bias will persist in the analyses for each analysis set and that neither analysis
set is appropriate.
The method of Sanchez and Chen20 is one of the more promising ap
proaches. However, it requires knowledge of the reason for missing data.
Often in clinical trials the decision of a subject to discontinue is based on a
series of complex and intermingled reasons, which can depend on whether
the subject and the investigator of the therapy are truly blinded of the ther-
apy the subject is receiving at the time of discontinuation. Anything less than
complete and immediate efficacy can cause a subject to discontinue if the
logistics of study participation is difficult, and a desire to please can cause a
study subject to be less than honest with the study staff. When the reason for
discontinuation is known with certainty, this method will work, but when
the reasons are not known with certainty the method will have flaws.
Sensitivity analyses will be important in addressing the analysis set for
non-inferiority clinical trials, particularly the impact of missing data on the
analysis set of choice. In fact, by suggesting that both a PP and an ITT analysis
set be considered, guidance implicitly includes a sensitivity analysis. Until
resolution is found in the clinical trials community, a primary approach
based on ITT with variations to account for anticipated deviations from
the protocol, along with sensitivity analyses, is probably the best approach.
Sensitivity analyses should be pre-specified to the extent possible. Although
sensitivity analyses are recommended, the use of sensitivity analyses is not a

© 2012 by Taylor and Francis Group, LLC

204 Design and Analysis of Non-Inferiority Trials

substitute for poor trial conduct or poor adherence to the protocol, and does
not salvage a poorly conducted clinical trial.
Although both the ITT and PP approaches have been criticized, we recom-
mend that non-inferiority analyses should be performed for both the ITT
and the PP analyses. In most instances, the results should be quite similar.
Any notable difference in the results should be investigated and may be
indicative of poor study conduct or other reasons that must be thoroughly
investigated and explained. Similarity in the results of the ITT and PP anal-
yses, while reassuring, does not imply confidence in the results. A poorly
conducted trial likely introduces bias, which can be of a nearly equal size for
both the ITT and PP analyses.

References
1. European Medicines Evaluation Agency, Guideline on Missing Data in Confirmatory
Clinical Trials, Committee for Medical Products for Human Use, 2009, at http://
www.ema.europa.eu/human/ewp/177699endraft.pdf.
2. Code of Federal Regulations 21 CFR 314.126.
3. Carroll, K.J., Analysis of progression-free survival in oncology trials: Some com-
mon statistical issues. Pharm. Stat., 6, 99–113, 2007.
4. Fleming, T.R., Rothmann, M.D., and Lu, H.L., Issues in using progression-free
survival when evaluating oncology products. J. Clin. Oncol., 27, 2874–2880,
2009.
5. International Conference on Harmonization of Technical Requirements for
Registration of Pharmaceuticals for Human Use (ICH), E9: Statistical principles for
clinical trials. 1998, at http://www.ich.org/cache/compo/475-272-1.html#E4.
6. Fleming, T.R., Addressing missing data in clinical trials. Ann. Intern. Med., 154,
113–117 2011.
7. Jackson, J.B. et al., Intrapartum and neonatal single-dose neviparine compared
with ziodovudine for prevention of mother-to-child transmission of HIV-1 in
Kampala, Uganda: 18 months follow-up of the HIVNET 012 randomised trial.
Lancet 362, 859–868, 2003.
8. Little, R.J.A. and Rubin, D.B., Statistical Analysis with Missing Data, John Wiley,
New York, NY, 1987.
9. Little, R.J.A., Regression with missing X’s: a review, J. Am. Stat. Assoc., 87, 1227–
1237, 1992.
10. Little, R.J.A., A test for missing completely at random for multivariate data with
missing values, J. Am. Stat. Assoc., 83, 1198–1202, 1988.
11. Koch, G.G., Comments on ‘current issues in non-inferiority trials’ by Thomas R.
Fleming, Stat. Med., 27, 333–342, 2008.
12. Wiens, B.L., Randomization as a basis for inference in noninferiority trials,
Pharm. Stat., 5, 265–271, 2006.
13. Mallinckrodt, C.H. et al., Recommendations for the primary analysis of continu-
ous endpoints in longitudinal clinical trials, Drug Inf. J., 42, 303–319, 2008.

© 2012 by Taylor and Francis Group, LLC

Missing Data and Analysis Sets 205

14. Shih, W.J. and Quan, H., Testing for treatment differences with dropouts present
in clinical trials—A composite approach, Stat. Med., 16, 1225–1239, 1997.
15. Hollis, S., A graphical sensitivity analysis for clinical trials with non-ignorable
missing binary outcome, Stat. Med., 21, 3755–3911, 2002.
16. Hill, A.B., Principles of Medical Statistics, 7th ed., Lancet, London, 1961.
17. Snapinn, S.M., Noninferiority trials. Curr. Control Trials Cardiovasc. Med. 1, 19–21,
2000.
18. International Conference on Harmonization of Technical Requirements for
Registration of Pharmaceuticals for Human Use (ICH) E-10: Guidance on
choice of control group in clinical trials, 2000, at http://www.ich.org/cache/
compo/475-272-1.html#E4.
19. Brittain, E. and Lin, D., A comparison of intent-to-treat and per-protocol results
in antibiotic non-inferiority trials, Stat. Med., 24, 1–10, 2005.
20. Sanchez, M.M. and Chen, X., Choosing the analysis population in non-inferiority
studies: Per protocol or intent-to-treat, Stat. Med., 25, 1169–1181, 2006.
21. Hauck, W.W. and Anderson, S., Some issues in the design and analysis of equiv-
alence trials, Drug Inf. J., 33, 177–224, 1999.
22. Wiens, B.L. and Zhao, W., The role of intention to treat in analysis of noninferi-
ority studies, Clin. Trials, 4, 286–291, 2007.
23. Stewart, W.H., Basing intention-to-treat on cause and effect criteria, Drug Inf. J.,
38, 361–369, 2004.
24. Robins, J.M., Correction for non-compliance in equivalence trials, Stat. Med., 17,
269–302, 1998.
25. Lee, Y.J., Ellenberg, J.H., Hirtz, D.G. and Nelson, K.B., Analysis of clinical trial
data by treatment actually received:Is it really an option? Stat. Med., 10, 1595–
1605, 2002.
26. Peto, R., Pike, M.C., Armitage, P., Breslow, N.E., Cox, D.R., Howard, S.V., Mantel,
N., McPherson, K., Peto, J., Smith, P.G., Design and analysis of randomized clin-
ical trials requiring prolonged observation of each patient. I. Introduction and
design, Brit. J. Cancer, 34, 585–612, 1976.
27. Gillings, D. and Koch, G., The application of the principle of intention-to-treat
to the analysis of clinical trials, Drug Inf. J., 1991.

© 2012 by Taylor and Francis Group, LLC

9
Safety Studies

9.1 Introduction
Statistical hypotheses and testing to rule out a prespecified increased risk of
an adverse event is statistically similar to that in the determination of non-
inferior efficacy. Examples include establishing the safety of a test treatment
compared to placebo or establishing the safety of a test compound com-
pared to an active control, both with the objective of ruling out an important
increase in the rates of adverse events. Less common, but possible, is the com-
parison of a test compound to an active control with inference desired on the
event rate of the test compound compared to a putative placebo. Because the
design and analysis are dependent on the objectives, and the objectives can
vary, it is vital to prespecify and define the study objectives.
There may be uncertainty about the safety of a drug at the time of approval.
Some adverse events are infrequent or are long-term adverse outcomes
that may not be discovered during the clinical trials that led to approval.
Additionally, the risk–benefit profile can change based on changes in sup-
portive care, the nature of the disease, the standards or in the understanding
of the risks. A change to an unfavorable risk-benefit assessment may alter or
remove the indication or intended use. New evidence on safety may be suf-
ficient to provide caution in the use of the product but not sufficient to lead
to an unfavorable or uncertain risk–benefit profile. Some changes in risks
can be addressed through introduced changes in medical practice. Subjects
who are more at risk of a particular known adverse event can either not be
recommended or provided the drug or may be more thoroughly monitored
on their risk of experiencing the adverse event while receiving the drug.
The U.S. Food and Drug Administration (FDA) Amendments Act of 2007,1
which expanded the authority of the FDA during postmarketing, provides
situations in which a postapproval study on safety may be required. A post-
approval study on the safety of a drug may be required to assess a known
serious risk, or a signal of a serious risk, or to identify an unexpected serious
risk when data indicate the potential for a serious risk. The source of a safety
signal may be clinical trials, adverse event reports, postapproval studies, peer-
reviewed biomedical literature, postmarket data, or other scientific data.

207

© 2012 by Taylor and Francis Group, LLC

208 Design and Analysis of Non-Inferiority Trials

Evaluating whether data suggest a safety signal that was not prespeci-
fied for the investigation may be associated with substantial error and bias.
Although the efficacy of a drug is based on the intended effects of the exper-
imental agent or regimen, its safety profile usually involves unintended,
harmful effects. If the rate of these unintended, harmful effects is too great,
the risk–benefit profile may be unfavorable. However, unlike efficacy analy-
ses that prespecify the endpoints to be tested and with the overall type I error
rate maintained at a desired level, standard safety analyses usually involve
multiple tests, sometimes on nonprespecified adverse events, without any
multiplicity adjustment. Thus there is an exploratory nature to the standard
safety analyses that are conducted in a clinical trial. Additionally, owing to
the multiple testing, the most impressive differences between arms in an
adverse event will tend to be randomly high, and more likely than not will
have a smaller observed difference in a subsequent identically designed and
conducted clinical trial. Therefore, when the safety signal is evaluated on the
basis of ongoing or previous trials, any meta-analysis used to formally test
whether an unacceptable increased risk can be ruled out should not include
the results from the clinical trial that identified the potential safety risk as
the analysis that identified a potential safety signal is conditionally biased
and potentially represents a random high.
Retrospective meta-analysis may be used to identify safety signals. If
random-effects meta-analyses are done, the results should be viewed with
care. Increasing the variability and altering the weighting of the studies can
obscure the determination of a safety signal or in what subgroup a safety
signal may be present.
There are three criteria or questions to be considered when assessing the
reliability of an exploratory safety analysis2:

(1) Is it unlikely that such events can be explained by chance?

(2) Is the safety risk biologically plausible?
(3) Can independent, prospectively obtained data be identified to con-
firm the finding?

These criteria are illustrated to clinical trials in aortic stenosis in Fleming’s

paper.2
Safety clinical trials prospectively performed to “confirm” a safety sig-
nal do not generally have the primary hypotheses of no increased risk
(as the null hypotheses) and an increased risk (as the alternative hypoth-
esis). Although such hypotheses and their testing are of interest, the pri-
mary hypotheses generally involve whether there is an increased risk of
a clinically unacceptable size. This threshold of increased risk is the non-
i nferiority or safety margin. The margin should be selected so that ruling
out an increased risk as large as the safety margin rules out every unac-
ceptable increased risk. For such a safety trial to provide meaningful,

© 2012 by Taylor and Francis Group, LLC

Safety Studies 209

interpretable results regarding the potential safety risk(s), (1) enrollment

of subjects with the target population of interest (e.g., subjects for whom
excess risk by the experimental agent or regimen is a concern or uncertain)
should be of sufficient size so that the required number of events will be
achieved in a timely fashion; (2) adherence and use of the experimental
regimen should be consistent with the real world; (3) the number of subjects
on the control arm who cross over to the experimental agent or regimen
should be minimized; and (4) long-term retention/follow-up of nearly all
randomized subjects should be done.2

9.2 Considerations for Safety Study

Safety studies have special considerations for extrapolation to the larger
population. If an active- or placebo-controlled study does not show an
important safety differential, it is not immediately apparent that the safety
signal does not exist. The study may lack assay sensitivity and not be able to
distinguish a difference in the safety endpoint between the two treatments.
It is vital to plan and conduct the study appropriately so that it will have
assay sensitivity.
Safety studies should minimize the risk to trial subjects. Patients for whom
there is clear evidence of harm or a negative risk–benefit profile should not
be included as subjects in the trial. The need for a safety study suggests that
either the experimental drug has an increased safety risk, no meaningful
increase in risk, or no increased risk. There is generally no conveyance of a
potential improvement (decrease) in risks. There should be careful monitor-
ing of such safety studies to make sure that subjects are not being harmed.
Internal or external adverse safety data may arise during a study that will
require termination.
If there is an expectation that a drug induces an increased risk of an
adverse event, subjects may be monitored more thoroughly for the event and
for symptoms, or monitored more thoroughly for intermediate outcomes that
are associated with an increased risk of the adverse event. It is likely (and
natural) that the study drug would be discontinued for a subject who has an
intermediate outcome associated with an increased risk of the adverse event
due to a concern, a belief, or an assumption that the drug may increase the
risk of the adverse event. This may prevent some subjects from experiencing
the adverse event and may make the observed risks more similar between
study arms. Not addressing such patient monitoring in the evaluation of the
experimental drug will increase the likelihood of declaring an unsafe drug
as safe (relative to that adverse event). Such a conclusion could lead to medi-
cal practice that does not discontinue the drug when an intermediate out-
come associated with an increased risk of the adverse event occurs.

© 2012 by Taylor and Francis Group, LLC

210 Design and Analysis of Non-Inferiority Trials

The study population must be chosen to accurately reflect patients who

will receive the treatment when it is widely available, or that subpopula-
tion in which the potential safety concern arises. Intentionally enrolling sub-
jects not susceptible to the event of interest will rarely provide convincing
evidence of lack of a real effect. Alternatively, enrolling only those subjects
within the target population with the greatest risk of the event of interest
may also be problematic. As subjects will not be representative of the target
population, the safety margin may not be relevant to the estimated differ-
ence in risk and the corresponding confidence interval.
In this section, we will discuss several aspects of planning and conducting
the non-inferiority safety study, including choice of study population, selec-
tion of endpoint, and study conduct issues.

9.2.1 Safety Endpoint Considerations

Safety is commonly assessed through observation of discrete events such as
deaths or events of morbidity such as stroke or myocardial infarction, result-
ing in a comparison of proportions or hazard ratios. Less common is assess-
ment of continuous values such as serum glucose levels or change in body
weight. In this section, we assume that the discrete events are of interest;
however, the ideas may also apply to assessment of continuous data.
The endpoint used to assess comparative safety must be well defined and
accepted as an appropriate endpoint to assess relative safety. This requires
clear definition and perhaps (in the case of a surrogate endpoint or a bio-
marker) requires confirmation that the endpoint is of clinical consequence.
An independent, blinded adjudication committee should be considered to
evaluate every suspected event and determine, consistently and clearly,
whether the event met the definition of the safety signal. The general out-
line of the process is not importantly different from the adjudication process
used commonly for superiority studies.
Composite endpoints are attractive when the concern is about a group of
related events but can pose problems. A vague concern about cardiovascu-
lar issues might lead a sponsor to consider incidence of stroke, myocardial
infarction, heart failure, or mortality, and the incidence of any event (or the
time to the first such event) might be a candidate for the safety endpoint. It
must be recognized that in a safety context, this can be problematic. First,
an advantage in one component can counter a negative effect in another,
leading one to conclude that the safety is identical when in fact it is not. An
advantage on one component and a disadvantage on another is not necessar-
ily a negative for the test treatment, but must be known to fully evaluate the
comparative safety. Second, increasing the incidence rate by adding compo-
nents on which there is no difference may obscure a real difference in other
components. When inference is based on a simple difference in event rates,
the variance of the estimate is based on the event rate, with maximum vari-
ance occurring when the incidence rate is 50%. Increasing a low incidence

© 2012 by Taylor and Francis Group, LLC

Safety Studies 211

rate by adding components for which there is no difference may therefore

increase the variance without changing the difference, resulting in less
power to draw a conclusion.
As in any analysis of a composite endpoint, sensitivity analyses are
required to understand the contribution of the various components. With a
non-inferiority hypothesis, an increase in any single component may cause
concern, even if the other components show no effect or even a beneficial
effect. Setting a non-inferiority margin for each component of the composite
is not practical and defeats the purpose of the composite endpoint, which
is to increase power by increasing the event rate. These issues with using
a composite endpoint must be understood, and planning must account for
them, for a successful safety study with a composite endpoint.
Other decisions about safety endpoints may include hard endpoints (such
as mortality or irreversible morbidity) or less important endpoints (such as
symptoms or laboratory values that may imply a hard endpoint is more likely
in the near future). The more definitive endpoints are preferred, but when
rare will require a large sample size to rule out a difference of interest.
Trial conduct will be critical to ensure that all safety events are identified,
which requires careful monitoring by the investigator and appropriate train-
ing. Again, this is required to ensure that events are observed and that they
are consistently assessed and evaluated. Of importance is the decision on
whether to prompt study subjects for evidence of a transient symptom for
a serious safety event. Doing so can increase the number of events consid-
ered, but this may increase the noise without uncovering any new events of
concern.
The analysis set choice for a safety endpoint with a non-inferiority
analysis may be different than for an efficacy endpoint. The intention-to-
treat approach is generally considered inappropriate for safety analyses—
including subjects according to randomized treatment when some subjects
receive a treatment to which they were not randomized can mask a safety
signal. For this reason, intention to treat is rarely used for safety analyses.
However, the per-protocol analysis may also be inappropriate as excluding
subjects (especially those who experience the safety event) for any reason can
also mask a safety signal. The most appropriate analysis will consider and
may include all subjects treated, according to treatment actually received. If
the safety event can be a residual effect occurring well after treatment is dis-
continued, then including follow-up for a lengthy period after discontinua-
tion may also be necessary.

9.2.2 Design Considerations

A discussion of non-inferiority analyses for safety endpoints needs to con-
sider several potential comparisons. An investigational treatment may be
compared to placebo or to an active control, in each case to assess whether
there is a difference in safety risks. A comparison to an active control may

© 2012 by Taylor and Francis Group, LLC

212 Design and Analysis of Non-Inferiority Trials

be to rule out an increase compared to the active control, or estimate the

difference versus placebo via an indirect comparison. In this subsection, we
discuss various designs. In any given situation, the question of interest will
generally be known; thus, the solution will be a straightforward choice of the
appropriate design to answer the question of interest.
The metric used in the comparison (e.g., a difference in proportions or a
relative risk) should provide the necessary assessment of the safety risk. For
example, for a non-contagious event, a 1% increase in the incidence means
that 1 additional person out of every 100 will get the event regardless of
whether the increase is from 0% to 1%, from 5% to 6%, or from 20% to 21%.
The impact at a health-care level (absolute increase in financial cost or mor-
bidity) is the same regardless of the background rate.

9.2.3 Possible Comparisons

9.2.3.1 Ruling Out a Meaningful Risk Increase Compared to Placebo
When developing a new treatment, there may be concerns about certain
adverse events caused by the new drug treatment. These concerns may be
based on effects in animal models, on effects of other treatments that are
chemically similar, on data observed in early studies, or on hypothetical or
logical concerns. Regardless of the reason, these concerns must be addressed
to fully understand and characterize the safety of the product.
The inferential framework is straightforward: letting the event rate be
denoted by pk for k = E or P, the null and alternative hypotheses may be writ-
ten as

Ho: pE – pP ≥ δ vs. Ha: pE – pP < δ

for an appropriate δ > 0. In other words, the null hypothesis is that the inves-
tigational treatment increases the event rate by some difference δ, and the
alternative hypothesis is that the investigational treatment increases the rate
by less than δ (or has no effect, or decreases the event rate). The safety margin
for increased risk or harm may depend on the benefit of the product. The
parallels to non-inferiority testing for efficacy are immediately obvious. A
possible disadvantage to expressing the hypotheses about a risk difference
includes lack of robustness to an incorrect estimate of the rate in the control
group: an increase of 5 percentage points may seem inconsequential when
the placebo event rate is 30%, but not when placebo event rate is 3%. This
disadvantage can be exacerbated when the patient population changes over
time.
Alternatively, the hypotheses can be expressed as a ratio of event rates (risk
ratio):

Ho: pE/pP ≥ δ vs. Ha: pE/pP < δ

© 2012 by Taylor and Francis Group, LLC

Safety Studies 213

for an appropriate δ > 1. The relative risk is often perceived as being more consis-
tent across different patient populations with different event rates than the risk
difference. However, the risk ratio is not robust to a change in event rates, par-
ticularly for fairly rare events. A 50% increase when the placebo event rate is 1%
(which affects 1 out of every 200 subjects) is quite different from a 50% increase
when the placebo event rate is 10% (which affects 1 out of every 20 subjects).
If time to an undesirable event is of interest, then the hypotheses can be
expressed as a hazard ratio, θ:

Ho: θ ≥ δ vs. Ha: θ < δ

for an appropriate δ > 1. Basing non-inferiority on hazard ratios is beneficial

when the time to an event is important—that is, not only whether an event
that occurs is of interest but also when that event occurs. However, the inter-
pretation of an estimated hazard ratio depends on whether the cumulative
hazards are fairly proportional over time.
Although the mathematics and interpretations of the various forms of the
hypotheses are quite different, many fundamentals are similar. In each case,
the null hypothesis is that the investigational treatment induces more events
of concern than does placebo (or sooner), and the alternative hypothesis is
that the difference, if it exists, is not a safety concern.
Because the investigational treatment is being compared to placebo, the mar-
gin of interest (whether it be based on differences, relative rates, or hazard ratios)
cannot be based on a putative placebo argument. Instead, it must be based on
ruling out an increase of importance. This can be difficult with an adverse event
of mortality or irreversible morbidity as it is difficult to state that any increase is
acceptable. However, because of the inability to definitively conclude that two
rates are identical, and the lack of reason to conjecture that the test treatment
will be superior to placebo, a small margin may be necessary to represent some
maximum allowed risk. The conclusion of noninferior safety would be with
respect to the particular safety margin.
As there is subjectivity involved with selecting a safety margin, there is
bound to be disagreement in the selected margin. Any increase in the risk
of a serious adverse event is undesirable. To evaluate the margin, various
aspects of the comparative treatments must be considered. If the test treat-
ment offers irreversible efficacy advantages over placebo, these may coun-
ter the increase in adverse events. However, if the test treatment offers only
reversible or symptomatic advantages over placebo, there would be no will-
ingness to accept any increase in mortality or high morbidity.

9.2.3.2 Ruling Out a Meaningful Risk Increase

Compared to an Active Control
Comparisons of adverse event rates between two active treatments are con-
ceptually similar to comparisons for efficacy. Event rates can be compared

© 2012 by Taylor and Francis Group, LLC

214 Design and Analysis of Non-Inferiority Trials

using differences, ratios, or hazard ratios. The purpose is to rule out an

important increase in events in the experimental arm compared to the con-
trol arm, so the null hypothesis is that the experimental arm has a higher
event rate than the control arm by at least the amount of the margin. The
alternative hypothesis is that the experimental arm has an event rate that is
lower than the control arm, identical to the control arm, or greater than the
control arm by less than the margin.
As always, setting the margin will be an exercise of high importance. The
putative placebo effect is again irrelevant, so the margin will be based on
differences of importance, as discussed in the previous section. However, at
this point, the parallels with the safety comparison to placebo described in
the previous section cease. Relative efficacy can be very important in deter-
mining the margin of safety between two active treatments. An investiga-
tional treatment that has superior efficacy on an endpoint of high morbidity
may allow a larger safety margin than an investigational treatment that has
identical efficacy or has an advantage only on symptomatic endpoints. An
investigational treatment that has no advantage over the active control (or
has an efficacy disadvantage) will not support a large margin and may not
support any positive margin at all on safety endpoints. It is unlikely that
ease of administration of the test treatment or cost will affect the margin—a
relatively inexpensive oral medication may have different acceptable safety
margins compared with an expensive parenteral medication, but it is diffi-
cult to allow an increase in important adverse events due solely to cost.
Figure 9.1 demonstrates how confidence intervals can be interpreted to rule
out an important increase in adverse events. Three situations are illustrated,
with parentheses noting the bounds of the confidence intervals. The solid
vertical line represents the point of equality (risk ratio = 1 or hazard ratio = 1).
The area to the right of the solid line represents an increase in events in the
investigational group compared to the reference group. The dashed vertical
line represents the point of concern, δ, below which an important difference
would be excluded.
In the first (top) situation, the confidence interval extends from the area of
decreased risk to a point above the area of concern. In this case, an impor-
tant difference cannot be excluded, even though a decreased risk cannot be
excluded either. A wide confidence interval suggests that the event rate was
not adequate to establish the precision of the risk with sufficient accuracy.
In the second (middle) situation, the confidence interval extends from the
area of decreased risk to the area of elevated risk, but does not extend above
the point of concern. Such a confidence interval is sufficient to rule out an
important elevation of risk.
In the third (bottom) situation, the confidence interval lies entirely in the
area of elevated risk above the point of concern. This situation clearly demon-
strates an important safety signal. Depending on the nature of the elevated
risk and the benefits conferred by the investigational treatment, it may be
unlikely that such a treatment would be adopted for general use.

© 2012 by Taylor and Francis Group, LLC

Safety Studies 215

Decreased risk Elevated risk

1 
Risk ratio

FIGURE 9.1
Some possible outcomes for a safety evaluation.

9.2.3.3 Indirect Comparison to Placebo

Another option for non-inferiority safety studies is to compare safety of a
test and control treatment with the intention of indirectly comparing the test
treatment to placebo. The purpose is to determine whether the test treatment
has a safety signal compared to placebo. This assumes that the active con-
trol has been compared to placebo, directly or indirectly, so that the relative
safety is known. As in all putative placebo efficacy analyses, the implication
is that the comparison of the active control to placebo will consistently pro-
duce an effect of known magnitude, enabling the indirect comparison of the
experimental treatment to placebo.
If the active control has a safety concern compared to placebo, but better
efficacy than the experimental therapy, the purpose of the safety analysis
may be to show that the experimental therapy is more safe than the active
control, preferably with a magnitude approaching that of the magnitude of
the difference between the active control and placebo. It will never be possible
to definitively rule out the lack of a safety signal of the experimental therapy
versus placebo, but the magnitude of the difference between the experimen-
tal therapy treatment and placebo can be shown to be small enough to be of
little interest. The margin, therefore, may be based on the “putative placebo”
difference. However, the form of the hypotheses differs: the null hypothesis
will be that the experimental and control therapies have similar risks or that
the experimental therapy has a much greater risk than the control therapy,
and the alternative hypothesis will be that the experimental therapy has a
lower risk of an event and the magnitude is large enough to be important.
With risk differences, the hypotheses can be written as

© 2012 by Taylor and Francis Group, LLC

216 Design and Analysis of Non-Inferiority Trials

Ho: pE – pC ≥ –δ vs. Ha: pE – pC < –δ

where δ > 0. Setting δ to be large enough results in a conclusion that the

safety event incidence rate of the active control approximates that of placebo,
if the null hypothesis is rejected. The hypotheses can be tested on the basis
of an appropriate-level confidence interval for the difference in risks. Failing
to reject the null hypothesis does not imply a safety concern for the experi-
mental thera py—it may be safer than, or as safe as, the control therapy, but it
has a safety signal compared to the putative placebo. This testing paradigm
differs from other testing in this book in one important aspect: equality, be
it a difference of zero or a ratio of 1, is part of the null space rather than the
alternative space. However, the study can only be successful if the rate in the
experimental group is lower than in the active control group; thus, basing
the test statistic on the assumption that the rates differ by an amount of δ is
still required.
If the active control does not have a safety signal compared to placebo,
then the non-inferiority safety study comparing the experimental therapy
to the active control therapy will be conducted to rule out an increase in
the safety risk of the experimental treatment compared to placebo, or an
estimate of the magnitude of the increase. The hypotheses to be tested start
with the assumption that the test treatment has an important effect com-
pared to the active control, hoping to reject in favor of the alternative that the
test treatment does not have an important increase. For risk differences, the
hypotheses are

Ho: pE – pC ≥ δ vs. Ha: pE – pC < δ

for an appropriate δ > 0.

9.3 Cardiovascular Risk in Antidiabetic Therapy

A recent guidance from the FDA on evaluating cardiovascular risk in anti
diabetic therapies3 uses some of the principles discussed earlier and is sum-
marized here to illustrate the concepts.
Patients with diabetes are at increased risk for mortality and various
morbidities. Of importance, diabetic patients are at increased risk for car-
diovascular morbidity, including stroke, myocardial infarction, acute coro-
nary syndrome, revascularization, and other events, whether fatal or not.
Such events can lead to mortality, hospitalization, or irreversible physical
limitations. Although improving the glycemic control (measured by HbA1c,
a generally accepted primary endpoint for efficacy) is logically expected to
improve long-term outcomes, it may not always accurately predict effects on

© 2012 by Taylor and Francis Group, LLC

Safety Studies 217

mortality or irreversible morbidity. Thus, the desire to study such long-term

outcomes is justified.
The guidance proposes two non-inferiority boundaries for evaluation of
cardiovascular risk, both based on the upper bound of a two-sided 95% con-
fidence interval on the risk ratio. If the upper bound for increased risk is
less than 1.3, the new treatment can be reasonably assumed to not induce an
important increase in cardiovascular events. If the upper bound for increased
risk is greater than 1.8, the new treatment cannot be assumed safe without
further study. If the upper bound is between 1.3 and 1.8, conclusions are
more difficult, and the FDA guidance suggests that further study is neces-
sary, perhaps after the investigational product is approved.
Put into hypothesis testing terminology, the guidance suggests that two
sets of hypotheses are tested:

Ho1 : θ ≥ 1.8 vs. Ha1 : θ < 1.8

and

Ho2 : θ ≥ 1.3 vs. Ha2 : θ < 1.3

where θ is the risk ratio of the experimental therapy (in the numerator) to
the control therapy (either a placebo or an active comparator, if appropriate).
If the first null hypothesis, H01, is not rejected, the possibility of an impor-
tant safety signal cannot be ruled out, and approval is unlikely. If both null
hypotheses are rejected, an important safety signal is unlikely and approval
is possible, given that efficacy and other safety data support such approval.
If only the first null hypothesis (Ho1) is rejected and the second (Ho2) is not,
then further study is required. In this situation, other safety and efficacy
data may allow approval of the product, but the sponsor will be obligated to
conduct further study to rule out an important increase of 30% in cardiovas-
cular risk.
With multiple hypotheses being tested, and possibly multiple attempts at
testing, it is necessary to consider the type I error rate. We consider testing
on the basis of two-sided 95% confidence intervals. When θ ≥ 1.8, the prob-
ability is at most 0.025 that θ ≥ 1.8 is rejected on the basis of the data used
for the consideration of the approval. The probability of concluding that θ <
1.3 is much less than 0.025. In the event that it is falsely concluded that θ <
1.8 but θ < 1.3 is not concluded, it is quite unlikely that a later safety study,
if properly designed and conducted, would conclude θ < 1.3 (the probability
being much less than 0.025).
When 1.3 < θ < 1.8, concluding that θ < 1.8 is not an error and is also not
automatic. Any given test of θ ≥ 1.3 versus θ < 1.3 would maintain the desired
type I error rate or a smaller rate. Because there would likely be two oppor-
tunities (the pre- and post-approval analyses) to conclude that θ < 1.3, the

© 2012 by Taylor and Francis Group, LLC

218 Design and Analysis of Non-Inferiority Trials

overall the type I error rate would be a little less than 0.05 (when θ is slightly
larger than 1.3) or smaller.
In calculating confidence intervals on risk ratios, the number of events
(particularly for rare events) must be adequate to result in a sufficiently nar-
row confidence interval. Observing few events will result in a confidence
interval that is wide, which may result in not being able to rule out an impor-
tant increase in events even if the observed rates are similar. To obtain an
adequate number of events, the guidance document recommends enroll-
ing patients at increased risk of the event. This serves a second purpose,
which is to study patients who are at higher risk, since some such patients
will inevitably receive the drug once it is approved for marketing. However,
enrolling patients at increased risk of the event can also make the results less
generalizable: enrolling patients with different risk levels than the general
population may provide results that do not easily extrapolate to the general
population, whether those rates are higher or lower.
We direct the reader to Chapter 13 in this book for determining confi-
dence intervals for non-inferiority testing of time-to-event endpoints and to
Chapter 7 on multiple testing.

References
1. Food and Drug Administration Amendments Act of 2007 http://frwebgate
.access.gpo.gov/cgi-bin/getdoc.cgi?dbname=110_cong_public_laws&docid=f:
publ085.110.
2. Fleming, T.R., Identifying and addressing safety signals in clinical trials, New
Engl. J. Med., 359, 1400–1402, 2008.
3. Guidance for industry: Diabetes mellitus—evaluating cardiovascular risk in
new antidiabetic therapies to treat type 2 diabetes, United States Food and Drug
Administration, Silver Spring, MD, 2008.

© 2012 by Taylor and Francis Group, LLC

10
Additional Topics

10.1 Introduction
In this chapter, we discuss additional topics that may be involved in the design,
analysis, and interpretation of the results of a non-inferiority trial. Many of these
topics are well developed for superiority trials but less understood for non-
inferiority trials. We discuss issues involving the consistency of non-inferiority
across subgroups in Section 10.2. The relationship between non-inferiority
inferences on a surrogate endpoint and the corresponding clinical benefit end-
point are discussed in Section 10.3. Adaptive designs (mostly involving trial
monitoring) and group sequential trials are discussed in Section 10.4. Section
10.5 provides a brief discussion on equivalence comparisons.
The effects of therapies (e.g., the active control and experimental therapies
in a non-inferiority trial) may vary across meaningful subgroups. A non-
inferiority inference involves concluding that the effectiveness of the experi-
mental therapy is both superior to placebo and not unacceptably worse than
the active control. To formally make such an inference, or to check for con-
sistency in those inferences, across subgroups would require an understand-
ing of the effect of the active control relative to placebo in the investigated
subgroups along with the estimated difference in effects between the active
control and experimental therapies from the non-inferiority trial(s). There
are various scenarios in which the effects (relative to placebo) of the active
control therapy and/or the experimental therapy, as well as the differences
in their effects, may vary across subgroups. It may also be the case that dif-
ferent subgroups may have different “non-inferiority margins” due to vary-
ing effects of the active control.
A surrogate endpoint is an endpoint used as a substitute for a clinical
benefit endpoint. The objective of using a surrogate endpoint is that specific
inferences on the surrogate endpoint imply specific inferences on the clinical
benefit endpoint. It is therefore important that treatment effects on the surro-
gate endpoint are related to treatment effects on the clinical benefit endpoint.
In a superiority trial with a rather good surrogate endpoint that represents the
sole pathway toward clinical benefit, superiority on the surrogate endpoint
would imply superiority on the clinical benefit endpoint. For a non-inferiority

219

© 2012 by Taylor and Francis Group, LLC

220 Design and Analysis of Non-Inferiority Trials

trial on a surrogate endpoint, if the experimental therapy is noninferior to the

active control on the surrogate endpoint (relative to an appropriate non-inferi-
ority margin), then the experimental therapy will have an effect on the clini-
cal benefit endpoint that is superior to placebo and not unacceptably worse
than the active control therapy. To efficiently determine the non-inferiority
margin on the surrogate endpoint and the size of the non-inferiority trial not
only requires understanding of the effect of the active control on both the sur-
rogate and clinical benefit endpoints, but also requires understanding on the
relationship between the size of a treatment effect on the surrogate endpoint
and the size of a treatment effect on the clinical benefit endpoint.
Adaptive clinical trial designs allow for the modification of some aspect
of the design based on an interim look of the data. Possible modifications
include altering the sample size or the timing of an analysis, changing end-
points, changing inclusion/exclusion criteria, or dropping or modifying
treatment arms. We will focus mostly on designs in which the timing of the
definitive analysis is random.
Two therapies would be concluded “equivalent” if the difference in their
effects lies within the lower and upper boundaries. The experimental ther-
apy effect on the endpoint of interest would either be slightly better than, the
same as, or slightly worse than the active control. Mathematically, equiva-
lence can be considered a two-sided non-inferiority inference—that is, the
experimental therapy is noninferior to the active control therapy and the
active control therapy is noninferior to the experimental therapy.

10.2 Interaction Tests

Interaction effects with treatment are a concern in any clinical trial. By “interac-
tion effects with treatment,” we mean that the size of the relative treatment effects
varies across relevant disjoint subgroups of subjects. The impact of a baseline or
demographic characteristic or time-varying covariate on the outcome can have a
major influence on the interpretation of the clinical trial results. Like other areas
discussed in this book, interaction tests must be approached differently for non-
inferiority clinical trial analyses than for superiority clinical trial analyses.
In their simplest form, interaction tests look for differences in the effect on
outcomes in two or more distinct subgroups of study subjects. The presence
of interaction itself does not automatically negate the findings of a clinical
trial, but the form of interaction will impact the interpretation of results and
for which subjects benefit can be expected from the experimental therapy.
Two types of interaction are generally considered: qualitative and quantita-
tive. In this book we will define these terms as follows.
An interaction is any heterogeneity of treatment effect across subgroups of
subjects. A qualitative interaction is an interaction where there is heterogeneity

© 2012 by Taylor and Francis Group, LLC

Additional Topics 221

in the state of efficacy (or relative efficacy) across subgroups of subjects.

States of efficacy or relative efficacy include the existence or nonexistence
of efficacy, the order between arms of the extent of efficacy, and the type of
comparisons between arms on the extent of efficacy (inferior, noninferior, or
superior). As a relevant example, if the experimental therapy is noninferior
to the control therapy with respect to primary endpoint for some subgroups,
but not for other subgroups, this is a qualitative interaction. A quantitative
interaction is an interaction that is not qualitative.
In a superiority analysis for an active-controlled trial, a qualitative interac-
tion generally means that one treatment is superior in one subgroup and the
other treatment is superior in a mutually exclusive subgroup. Thus, the con-
clusion of the study depends on the subgroups considered. For non-inferiority
analyses, a qualitative interaction means that non-inferiority exists in one
subgroup and does not exist in a mutually exclusive subgroup.1
Obviously, qualitative interaction is a concern. If qualitative interaction
exists, a finding of non-inferiority among the entire study population is not
sufficient to demonstrate that the experimental treatment is adequate for
general use.
An interaction can be examined graphically as shown in Figure 10.1. For
each subgroup, the estimated difference (active control minus experimental)

N
Group Active control
experimental
78
Male 76
126
Female 120
120
Etiology 1 116
18
Etiology 2 16
66
Etiology 3 64

Europe 48
34

Asia 119
125

Australia 19
21

Japan 18
16

–2 –1 0 1 δ 2 3 4

FIGURE 10.1
Exploratory plot to informally look for qualitative or quantitative interaction.

© 2012 by Taylor and Francis Group, LLC

222 Design and Analysis of Non-Inferiority Trials

is shown with a line representing the two-sided 95% confidence interval

(95% CI). The axis at the bottom shows the scale with the margin (1.5 units
in the example) also marked. In this example, geographic region is quickly
identified as a potential source of heterogeneity, with non-inferiority being
demonstrated in some subgroups (bound of the 95% CI less than δ), observed
in most (point estimate less than δ), but not observed in Australia (point esti-
mate larger than δ). This is an informal assessment of interaction effects, and
there may be various interpretations.
It is also important to note that there may be interaction effects between the
experimental and control therapies, but not between the experimental therapy
and a placebo, and vice versa. This distinction is important if the most impor-
tant purpose of the trial is to indirectly demonstrate that the experimental
therapy is efficacious (better than placebo). Historical information on the rela-
tive efficacy of the active control and placebo in various subgroups must be
known to fully interpret the interaction between the experimental therapy
and placebo—information that may not be well known. If the putative placebo
comparison is of less interest and the non-inferiority comparison is based on
clinically relevant differences only, the interpretations of interaction conclu-
sions is more straightforward—a conclusion of a qualitative interaction implies
that there is no uniform conclusion on non-inferiority across subgroups.

10.2.1 Test Procedures

Interaction tests are common, and the test for any interaction is identical for
superiority and non-inferiority studies. For continuous data, letting Di be the
estimated difference in effects in the ith stratum and δi be the true difference,
i = 1, . . . , I, the test of any interaction is the test of the null hypothesis Ho:
δ1 = δ 2 . . . = δ I versus the alternative of at least one inequality. An appropriate

∑
I (D − D)2
test statistic is W = i , where σi is the standard deviation within
2
i=1 σi
the ith stratum and D is the mean of the Di values weighted by the inverses
−1
I D  
∑ ∑
I
of the subgroup variances (standard errors squared): i 1 .
2 2
i=1 σ 
i
i=1
σi 
Under the null hypothesis, W has a central χ2 distribution with I – 1 degrees
of freedom.2 Rejection of Ho implies that an understanding of the interaction
is required to fully interpret the full results of the clinical trial. However,
failure to reject Ho does not imply anything about the magnitude or mean-
ingfulness of the differences in effects across subgroups. Tests for interac-
tion effects tend to have little power at meaningful alternatives, and thus
the interaction effects may be rather large even if Ho is not rejected. These
limitations will also be true for tests of a qualitative interaction when com-
paring the experimental and active control therapies. The test procedure for
an interaction is basically the same as the test procedure for heterogeneous
effects across studies provided in Section 4.3.1.

© 2012 by Taylor and Francis Group, LLC

Additional Topics 223

Wiens and Heyse proposed several likelihood ratio–type tests for a quali-
tative interaction, where a common non-inferiority margin across a partition
− +
of subgroups is used, on the basis of quantities QLR, QLR , and QLR defined as
follows:

I
+
QLR = ∑ (Di − δ )2 I (Di > δ )
σi
2
i=1

∑ (D − δ )σI(D < δ )
2
−
QLR = i
2
i

i
i=1

+
QLR = min QLR −
, QLR ( )
In these calculations, I(•) is an indicator function that equals 1 if the argu-
ment is true and 0 otherwise.1 The test statistic QLR+ tests the null hypothesis

that non-inferiority exists in all strata versus the alternative hypothesis that
non-inferiority does not exist in at least one stratum. Such hypotheses can
be written as

HoQ+: δi < δ for all i (10.1)

HaQ+: δi > δ for at least one i

The test statistic QLR tests the null hypothesis that either non-inferiority
exists in all strata or the experimental treatment is markedly inferior to the
active control in all strata; the alternative is that this is true in some strata but
not in others. Such hypotheses can be written as

HoQ: δi > δ for all i or δi < δ for all i (10.2)

HaQ: δi > δ for at least one i and δi < δ for at least one i

+
Thus, both QLR and QLR are testing for the existence of qualitative interac-
tion. For non-inferiority, the test statistic QLR does not seem to make much
sense because the additional area in the null region is an area in which it
will not be possible to conclude non-inferiority, but in equivalence analyses
QLR might be appropriate. Critical values for both tests are given in Table 1
of Gail and Simon,2 who discussed this test without the “–δ” in the formulae
+ −
for either QLR or QLR for superiority trials.
When the point estimate of treatment effect in every subgroup has the
+ −
same directional relationship to the non-inferiority margin, either QLR or QLR
will be zero (and therefore QLR will be zero). Using the data from Figure 10.1,

© 2012 by Taylor and Francis Group, LLC

224 Design and Analysis of Non-Inferiority Trials

+
the test statistic QLR will be zero when testing interaction between treatment
+ −
and gender and between treatment and etiology, but neither QLR nor QLR will
be zero when testing interaction between treatment and geographic region.
Therefore, QLR will be zero in the interaction test for treatment and gender
and treatment and etiology, but not for the interaction test of treatment and
geographic region.
Note that in both cases, the tests start with the null hypothesis of no quali-
tative interaction and conclude qualitative interaction only if there is strong
evidence that it exists. In both cases, observed qualitative interaction (i.e., at
least one stratum with Di > δ and at least one stratum with Di < δ) is neces-
sary to reject the null hypothesis and conclude the existence of qualitative
interaction. An alternative test, based on the “min test,” assumes the exis-
tence of qualitative interaction unless there is strong evidence to the con-
trary.1 However, a test will require a conclusion of non-inferiority in each
stratum when based on the min test.3 Hence, this test will have little power
for the typical non-inferiority trial (and thus great uncertainty) and is not
recommended.
An alternative to the likelihood ratio test is the standardized range test.
For an appropriate critical value C, the standardized range test considers
the hypotheses in Expression 10.1—analogous to QLR—with test statistics
max((Di – δ)/σ i) and min(–(Di – δ)/σ i). This can be written as Q SR = min[max((Di
– δ)/ σ i), –min(–(Di – δ)/σ i)]. HoQ is rejected if Q SR > C. Alternatively, the range
+
test considers the hypotheses in Expression 10.2—analogous to QLR —with
+ +
the test statistic QLR = min(–(Di – δ)/σ i). HoQ+ is rejected if QLR < C′. Critical
values for C and C′ are given in Table 1 of Piantadosi and Gail.4 Furthermore,
the range test is more powerful when the effect is reversed in very few
subgroups, whereas the likelihood ratio test is more powerful when a few
subgroups have an effect in one direction and a few in the other. For non-
inferiority purposes, it is unlikely to achieve a conclusion of non-inferiority
if there are many subgroups for which the true effect is against a conclu-
sion of non-inferiority, which argues for use of the standardized range test.
However, this test has not been studied extensively in the non-inferiority lit-
erature and therefore should be approached with caution. In addition, with
few strata, the difference in the performance of the tests is minor, so the
likelihood ratio test should perform well.4

10.2.2 Internal Consistency

Tests of interaction are important to check for internal consistency of treat-
ment effect and to identify subgroups for which one therapy may be pre-
ferred over another. In the non-inferiority setting, historical comparisons of
the active control to placebo in a subgroup are needed to completely under-
stand the effect–effect size of the experimental therapy in that subgroup.
If there are some subgroups in which the active control is more efficacious

© 2012 by Taylor and Francis Group, LLC

Additional Topics 225

compared to placebo than others, this information must be known to com-

pletely understand the indirect comparison of the experimental treatment to
placebo. Unfortunately, the detailed subgroup analyses comparing the active
control to placebo are not always known, so it is often assumed that there are
no interaction effects between the active control and placebo.

10.2.3 Conclusions and Recommendations

When the purpose of an active control study is to indirectly demonstrate the
superiority of the experimental therapy compared to placebo, the interaction
between the active control and the experimental therapies must be consid-
ered in the context of the putative placebo effect.

• The easiest outcome to interpret is an outcome that suggests no inter-

action effect between the active control and experimental therapies.
The conclusion in this case is that any interaction effect between
the experimental therapy and placebo is identical to the interaction
effect between the active control and placebo.
• A conclusion of quantitative interaction between the active con-
trol and experimental therapies may lead to a conclusion of either
qualitative or quantitative interaction between the experimental
therapy and placebo or a conclusion of no interaction between the
experimental therapy and placebo. The conclusion depends on the
relationship between the active control and placebo from historical
comparisons. Apart from the putative placebo inference, a conclu-
sion of a quantitative interaction does not invalidate the effectiveness
of the experimental therapy and may, in fact, lead to a conclusion of
qualitative interaction when considering only the active control and
experimental therapies.
• A conclusion of qualitative interaction between the active control
and experimental therapies must be interpreted in the context of
an interaction effect involving the active control and placebo. In the
most obvious situation of uniform efficacy of the active control over
placebo in all disjoint subgroups, qualitative interaction between the
active control and the experimental therapies will lead to a conclu-
sion of efficacy of the experimental therapy compared to placebo
in some subgroups but not in other subgroups, resulting in a rec-
ommendation to use the experimental therapy only in a subset of
studied patients. However, if the active control has demonstrated an
interaction effect with placebo, the conclusion becomes more compli-
cated. In the absence of a putative placebo comparison, a qualitative
interaction between the active control and experimental therapies
will lead to a recommendation on treatment option on the basis of
the subgroup in which the individual patient belongs.

© 2012 by Taylor and Francis Group, LLC

226 Design and Analysis of Non-Inferiority Trials

Questions about varying treatment effects across meaningful subgroups should

be studied in any registration trial. Graphs are a good starting point. The Q+
test is most applicable for testing for qualitative interaction but the power may
be low for most non-inferiority trials to detect a qualitative interaction.

10.3 Surrogate Endpoints

The efficacy of an experimental therapy is evaluated on the basis of a clini-
cal benefit endpoint, which is related to how a subject feels, functions, or
survives. A surrogate endpoint is a different endpoint that is evaluated to
make inferences on the clinical benefit endpoint. Surrogate endpoints are of
interest when they provide reliable inference on the clinical benefit endpoint
more quickly than evaluating the clinical benefit endpoint itself. This can
occur when there is a strong relationship between the surrogate and clinical
benefit endpoints, and the surrogate endpoint is determined much sooner
than the clinical benefit endpoint. How surrogate endpoints are involved
in the approval of an indication depends on the type of approval—regular
approval or accelerated approval.
Regular approval (also known as full approval or traditional approval) in
the United States usually involves the conclusion that the experimental ther-
apy provides clinical benefit for the studied indication. Such a conclusion is
based on demonstrating a meaningful effect on either a clinical benefit end-
point or an appropriate surrogate endpoint. For serious or life-threatening
illnesses, a product may be able to receive accelerated approval in the United
States. For accelerated approval, the product should provide “meaningful
therapeutic benefit to patients over existing treatments,” where the acceler-
ated approval may be based on a surrogate endpoint that reasonably likely
predicts clinical benefit.5 Accelerated approval requires that experimental
therapy be studied further to verify and describe its clinical benefit when
“there is uncertainty as to the relation of the surrogate endpoint to clinical
benefit, or of the observed clinical benefit to ultimate outcome.” In essence,
results on the surrogate endpoint should predict with high probability that
the clinical trial or trials designed to evaluate the clinical benefit endpoint
will demonstrate the clinical benefit of the experimental therapy.
It should be biologically plausible that effects on the surrogate endpoint
will translate to effects on the clinical benefit endpoint. An endpoint that is
correlated with a clinical benefit endpoint need not be an appropriate surro-
gate endpoint. Positive changes in symptoms of a disease (e.g., normalization
of body temperature for subjects having community-acquired pneumonia)
within a subject have been used as “signs” that a therapy may be working.
However, because the symptoms are often results of the disease and not on
the pathway of treating the disease, a symptom-related endpoint is unlikely

© 2012 by Taylor and Francis Group, LLC

Additional Topics 227

to be an appropriate surrogate endpoint.6 Improvement in symptoms can

occur by treating the symptoms without treating the disease.
In another example, Bruzzi et al.7 studied the results on objective tumor
response rate and overall survival from 10 randomized trials comparing
standard therapy with intensified epirubicin-containing regimens in meta-
static breast cancer. There was a clear, demonstrated advantage in response
rate (odds ratio, 0.60; 95% CI, 0.51, 0.72) favoring the intensified epirubicin-
containing regimens that was accompanied with an unimpressive overall
survival advantage that failed to reach statistical significance (hazard ratio,
0.94; 95% CI, 0.86, 1.04). Thus, for that setting, objective tumor response does
not appear to be an appropriate surrogate endpoint for overall survival. It is
possible that a chemotherapeutic agent may successfully destroy nonaggres-
sive cancer cells, but have no effect on aggressive cancer cells. In so doing,
the objective tumor responses may identify subjects who had less aggressive
disease at baseline and thus those who will have longer overall survival.
Alternatively, off-target detrimental effects can cancel the on-target effects
of the chemotherapy. For example, the chemotherapeutic agent may provide
little or no benefit to subjects particularly when weighing the agent’s toxici-
ties. Thus, objective tumor response would be associated with overall sur-
vival (a correlate) at the subject level, but effects on tumor response would
not be associated with an effect on overall survival. If this is the case, then
the objective response rate would not be an appropriate surrogate endpoint
for overall survival.
Many studies have discussed when an endpoint, which is correlated with
a clinical benefit endpoint, would or would not be an appropriate surro-
gate endpoint.8–10 The disease may casually affect both the biomarker (i.e.,
the potential surrogate) and the clinical outcome, thus correlating the bio-
marker with clinical outcome. However, if the biomarker does not represent
a mechanism through which clinical benefit is induced, then affecting the
biomarker may not affect the clinical benefit endpoint. If instead there are
multiple pathways in which clinical benefit can be induced and an experi-
mental agent only affects one pathway, use of a surrogate may overpredict
or underpredict the effect on the clinical benefit endpoint depending on
whether the surrogate is an endpoint relevant to the actual pathway in which
the experimental agent induces clinical benefit. Related to this, it should be
noted that validating a surrogate endpoint for one pharmacological class of
agents may not apply to another pharmacological class of agents.
It is more difficult to prove surrogacy involving non-inferiority compari-
sons than that involving superiority comparisons.11,12 The use of surrogate
endpoints in active-controlled trials would involve using a non-inferiority
comparison on a surrogate endpoint to imply a non-inferiority comparison
on the clinical benefit endpoint. Rather precise information on the relation-
ship between effects on the surrogate endpoint and effects on the clinical
benefit endpoint are needed to establish a margin for the non-inferiority
comparison on the surrogate endpoint. The non-inferiority criterion on the

© 2012 by Taylor and Francis Group, LLC

228 Design and Analysis of Non-Inferiority Trials

surrogate endpoint needs to consider the uncertainty in establishing the

effect of the control therapy on the clinical benefit endpoint, as well as the
uncertainty in establishing the effect on the surrogate endpoint.
Regular Approval. Consistent with regular approval in demonstrating supe-
riority to a placebo, Prentice13 defined a surrogate endpoint as a “a response
variable for which a test of the null hypothesis of no relationship to the treat-
ment groups under comparison is also a valid test of the corresponding null
hypothesis based on the true endpoint.” This means that the surrogate end-
point must be correlated with the clinical benefit endpoint and fully capture
the net effect of treatment on the clinical benefit.8 Prentice noted that this is
a rather strong, restrictive relationship between the surrogate endpoint and
the clinical benefit endpoint.14 It would be sufficient to require for surrogate
(S) and clinical benefit (T) time-to-event endpoints

1. The hazard rate for T depends on S.

2. Given S, the hazard rate for T does not depend on the treatment arm
(i.e., given S, the treatment-related hazard ratio equals 1).

Similar criteria can be established for other types of endpoints (e.g., continu-
ous or binary endpoints). The second criterion requires that the effect of a
treatment on the clinical benefit endpoint is completely mediated through
the surrogate endpoint. Various researchers recommend verifying the sec-
ond criterion through meta-analysis of relevant trials studying the (poten-
tial) surrogate and clinical benefit endpoints.9,13,15
For regular approval, the type I error rate for drawing conclusions on the
clinical benefit endpoint from testing on the surrogate endpoint should be
maintained at the desired level that would be used in a test directly on the
clinical benefit endpoint. In a superiority trial, if testing at a one-sided type
I error rate of 0.025 on the clinical benefit endpoint is desired, then the sur-
rogate endpoint must be such that if the experimental therapy has zero effect
on the clinical benefit endpoint, the probability is 0.025 of demonstrating
superiority on the surrogate endpoint.
For an active-controlled trial, if the aim of the trial is to demonstrate any
efficacy, the non-inferiority margin for the surrogate endpoint should assure
that when the experimental therapy has zero effect on the clinical benefit
endpoint, the probability that the experimental arm demonstrates non-
inferiority to the control arm on the surrogate endpoint is at most 0.025 or
whatever level is prespecified. In this setting, it is concluded that the experi-
mental therapy has a positive effect on the clinical benefit endpoint when-
ever it is concluded that the experimental therapy has a noninferior effect
on the surrogate endpoint. When the surrogate endpoint is acceptable for
regular approval, it is sufficient to choose a non-inferiority margin that addi-
tionally guarantees that the experimental therapy has an effect on the sur-
rogate endpoint (i.e., the non-inferiority margin is less than or equal to the
effect on the surrogate endpoint that the active control can be assumed to

© 2012 by Taylor and Francis Group, LLC

Additional Topics 229

have in the non-inferiority trial). When the aim of the trial is to demonstrate
adequate efficacy (e.g., the experimental therapy retains at least some mini-
mal amount or fraction of the active control effect), more precise information
is needed on the relationship between effect sizes on the surrogate endpoint
and effect sizes on the clinical benefit endpoint. Such precise information
on the relationship of the effect sizes may not be known. Uncertainty on the
precise relationship may lead to a more conservatively selected margin for
the surrogate endpoint or invalidate the use of the surrogate endpoint in a
non-inferiority trial setting.
For fixed non-inferiority margins, the requirements for the surrogate and
clinical benefit endpoints have a similar appearance as the mathematical
requirement for a function to be uniformly continuous. Consider the cases
of using means where μE,S and μ C,S are the true means for the surrogate end-
point and μE,CB and μ C,CB are the true means for the clinical benefit endpoint
for the experimental and active control arms, respectively. The use of the sur-
rogate endpoint for regular approval with an associated type I error rate or
significance level of 2.5%, where the non-inferiority margin on the surrogate
endpoint is δ > 0 and the non-inferiority margin on the clinical benefit end-
point is ε > 0, requires 97.5% certainty that μ C,S – μE,S < δ to imply 97.5% cer-
tainty that μ C,CB – μE,CB < ε. The value for ε would represent either the entire
effect of the control therapy (vs. placebo) on the clinical benefit endpoint or
the amount of the effect that a therapy can be worse than the active control
therapy but still have adequate efficacy. It is unlikely that 97.5% certainty that
μ C,S – μE,S < δ will be equivalent to 97.5% certainty that μ C,CB – μE,CB < ε. A sur-
rogate endpoint would still be useful and conservative when 97.5% certainty
that μ C,S – μE,S < δ was equivalent to a greater than 97.5% certainty that μ C,CB –
μE,CB < ε. Example 10.1 describes the approval of peg-filgrastim based on non-
inferiority comparisons from two clinical trials on a surrogate endpoint.

Example 10.1

The registration clinical trials comparing peg-filgrastim with filgrastim are exam-
ples of non-inferiority trials on the surrogate endpoint of the duration of severe
neutropenia, which led to the regular approval of peg-filgrastim. Filgrastim was
approved on the basis of a demonstrated improvement (reduction) in the clinical
benefit endpoint of the incidence in febrile neutropenia.16 Filgrastim also dem-
onstrated an improvement in the duration of severe neutropenia during the first
cycle of chemotherapy.16 The duration of severe neutropenia in the first cycle is
correlated with the chance of getting febrile neutropenia. It is also biologically
plausible that reducing the duration of severe neutropenia decreases the likeli-
hood of experiencing febrile neutropenia.
In each of the two non-inferiority registration trials comparing peg-filgrastim
with filgrastim, the non-inferiority margin for the mean duration of severe neu-
tropenia during the first cycle of chemotherapy was 1 day.17 Study 1, which ran-
domized 157 subjects, used fixed-dose peg-filgrastim, whereas study 2, which
randomized 310 subjects, used a weight-adjusted dose of peg-filgrastim. The

© 2012 by Taylor and Francis Group, LLC

230 Design and Analysis of Non-Inferiority Trials

95% CIs for the difference in the mean duration of severe neutropenia between
peg-filgrastim and filgrastim were (–0.2 to 0.6) and (–0.2 to 0.4) for studies 1 and
2, respectively. Both studies succeeded in demonstrating that peg-filgrastim was
noninferior to filgrastim in the mean duration of severe neutropenia during the
first cycle of chemotherapy, leading to the conclusion that peg-filgrastim is effec-
tive (relative to placebo) in reducing the incidence of febrile neutropenia.

Reasonably Likely Predicting Clinical Benefit. As an accelerated approval

requires that the experimental therapy provides meaningful benefit over
available therapy, it is not likely that a non-inferiority analysis would be used
for accelerated approval. The concepts involving non-inferiority and reason-
ably likely predicting clinical benefit can apply in making a go–no go deci-
sion based on the results of a randomized, phase 2 trial or in monitoring a
long-term phase 3 trial.
When the aim is whether the experimental therapy has any efficacy on the
clinical benefit endpoint, the non-inferiority margin on the surrogate end-
point should be such that demonstrating non-inferiority on the surrogate
endpoint reasonably likely predicts that the plan for evaluating the efficacy
of the experimental therapy will lead to the demonstration of efficacy on
the clinical benefit endpoint. The predictability of the results on the surro-
gate endpoint will depend on the observed difference on the surrogate, its
corresponding standard error, and the correlation of the comparison on the
surrogate endpoint with the comparison on the clinical benefit endpoint and
the precision in the later comparison. Therefore, there may not be a universal
margin that will work in evaluating the surrogate endpoint independent of
the sample size, the timing of the analyses, and other design features.
When comparing both endpoints using fixed margins, for the non-inferi-
ority margin on the clinical benefit endpoint of ε > 0, there needs to exist a δ
> 0 such that demonstrating that μ C,S – μE,S < δ will reasonably likely predict
with a high enough probability that μ C,CB – μE,CB < ε will be demonstrated in
the non-inferiority comparison on the clinical benefit endpoint (which may
occur in a separate trial).

10.4 Adaptive Designs

Adaptive clinical trial designs are designs that allow modification of some
aspect of the design during the study on the basis of accumulating data.
Modifications may include changing the sample size (increasing or decreas-
ing, including stopping study enrollment completely), dropping or adding
a treatment arm, changing the test statistic, changing inclusion criteria to
study only a subset of subjects, and even changing the primary endpoint.
Because these modifications are made on the basis of accumulating data, the
procedures must be carefully implemented to maintain the integrity of the

© 2012 by Taylor and Francis Group, LLC

Additional Topics 231

final analysis. Proponents of adaptive designs believe that they can be more
efficient than standard designs in either reducing the expected trial size for a
given power or in increasing the study power for a given trial size.
Adaptive designs should not be a means to alleviate the burden of rigorous
planning. Changes in the design of ongoing trials are not recommended.18
When substantial changes are made to the design of the trial, the primary
analysis may need to stratify by whether subjects were randomized before or
after the change.18 There may not be a way to correct an analysis for adapta-
tions that affect subjects already in the trial.
When an adaptation involves external information, that external informa-
tion tends to be available to the study subjects. However, this is not true
when adaptations are made based on internal information. As such, there
may be ethical concerns when adaptations are made based on internal data.
If a sponsor deems the results as important enough to make design modifi-
cations during the trial, then information learned from the study should be
important for subjects to learn.19 However, subjects and investigators may
prejudge the results if provided information on the relative treatment effect.
Properties of adaptive designs are not well understood for many poten-
tial adaptations in non-inferiority clinical trials. Properties of non-inferiority
group sequential trials have been studied in greater detail, so our discussion
focuses on such designs. In these designs, the study can be terminated early
at prespecified time points on the basis of accumulating evidence of efficacy
or of lack of efficacy. For reasons that will be discussed later, such designs
are not as common in non-inferiority trials, but can easily be implemented
if desired in a particular situation. Other adaptations can also be considered
for non-inferiority designs, but much less experience is available with which
to evaluate them. Adding or dropping a treatment arm is applicable when
multiple treatment arms are tested (notably in dose-ranging trials that com-
pare several doses of a single drug), but such designs are not commonly ana-
lyzed as non-inferiority trials. Changing the primary endpoint or primary
test statistic is a difficult proposition for superiority trials, and not much is
known about the effects on non-inferiority trials. Changing the sample size
is possible in a non-inferiority trial, usually as a result of insufficient infor-
mation being available before the start of the study to appropriately power
the study.
For the rest of this section, we will focus on group sequential methods and
the use of sample size reestimation based on interim results. For other issues
involving adaptive designs, see the U.S. Food and Drug Administration
(FDA) draft guidance.20

10.4.1 Group Sequential Designs

Interim analyses are performed to assess the efficacy and/or safety of the
experimental therapy or the correctness of design assumptions in the siz-
ing of the trial. Interim analyses should be planned on the basis of realistic

© 2012 by Taylor and Francis Group, LLC

232 Design and Analysis of Non-Inferiority Trials

objectives and conducted on complete data with no or limited data delin-

quency. Substantial data delinquency leads to an interim analysis based on
a convenience sample. This can be particularly problematic if the clinical
trial is not blinded. Group sequential designs perform interim analyses at
prespecified time points to assess efficacy. Some discussions and papers on
adaptive designs include group sequential analyses as a special case of adap-
tive designs, whereas other discussions and papers separate group sequen-
tial analyses from adaptive designs.
A blinded interim analysis, in which summaries by randomized treatment
are not produced, can provide information on the appropriateness of the
assumed subject-level standard deviations in each arm or in the event rate.
This can lead to a change in the sample size. However, for a non-inferiority
trial, the appropriate choice of a margin may also depend on the subject-level
standard deviations or the event rate. Thus, the interim results may negate
not only sample size assumptions but also the prespecified non-inferiority
margin. For continuous data when the means differ by a margin of 5 units,
the subject-level distributions between the arms would be more similar if the
common subject-level standard deviation is 100 than if the common subject-
level standard deviation is 1.
Group sequential designs compare treatments at a small number of pre
specified intervals, and the study is stopped if efficacy has been demonstrated
or if it has been determined that efficacy is unlikely to be demonstrated even
if the study continues to the planned final analysis (commonly called futil-
ity). Such designs are used to obtain a final answer earlier than if the entire
study was finished, thereby potentially saving financial and other resources
if the study stops early and potentially marketing the drug earlier as well.
Such designs also satisfy ethical concerns of limiting the number of subjects
on inferior treatment if the study stops early for either a conclusion of effi-
cacy or futility. Group sequential designs, like all adaptive designs, are most
useful when prior information on the relative efficacy of the various treat-
ments is insufficient to more efficiently design the trial in advance.
Most goals of a group sequential design in the superiority setting can also
be applied to non-inferiority trials. However, in a superiority trial, ethical
considerations have been used as a rationale for including interim analyses
for efficacy and for harm. If the experimental therapy is highly effective com-
pared to placebo, an interim boundary would likely be crossed and the exper-
imental therapy can be provided broadly to patients more quickly. When the
experimental therapy is harmful, an early analysis result indicative of harm
would lead to a “stopping” of the trial with no further subjects provided the
experimental therapy and restricting or stopping use to subjects currently
receiving the experimental therapy on the trial. When an early determina-
tion of non-inferiority is not a determination of superiority, there may be no
ethical reason to stop the trial and broadly provide the experimental therapy
to patients unless the control therapy has great risks. Additionally, in the
absence of a placebo arm, it cannot be precisely determined how much worse

© 2012 by Taylor and Francis Group, LLC

Additional Topics 233

the experimental therapy would need to be compared to the active control

to be harmful.
Group sequential tests were introduced by Pocock,21 who proposed a
method that required for an interim or final analysis a p-value less than a
common nominal significance level to reject the null hypothesis. O’Brien
and Fleming22 proposed a method of different critical boundaries for the
test statistics at each analysis that are consistent with a common bound-
ary on the difference in cumulative sums. The nominal significance level is
smaller for earlier interim analyses and increases from one analysis to the
next. Often, the nominal level for the final analysis is slightly less than the
overall type I error rate. The methods of Pocock21 and O’Brien and Fleming22
were introduced for analyses at equally spaced time points relative to the
amount of information in the data, and some examples involving one-sided
nominal significance levels are given below. The methodology for calculat-
ing boundaries for analyses at unequal time points while maintaining the
overall type I error rate was generalized by Lan and DeMets23 through error
spending functions.
Jennison and Turnbull24 proposed the calculation of repeated confidence
intervals. With the confidence interval at interim analysis k, k = 1, … , K,
being of the form (θ kL , θ kU ), the repeated confidence intervals have the prop-
erty that P(θ ∈ (θ kL , θ kU ), k = 1, … , K) ≥ 1 – α, where θ is a parameter of inter-
est. In theory, repeated confidence intervals can be calculated to correspond
to any group sequential method, including those of Pocock, O’Brien and
Fleming, and others.
The application of repeated confidence intervals to equivalence trials was
discussed by Jennison and Turnbull.25 Although Jennison and Turnbull
also touched on non-inferiority analyses, the main focus of their study was
equivalence. With θ = μ C –μ E, the study will conclude equivalence at the kth
interim analysis if (θ kL , θ kU ) ∈ (–δ, δ), and will conclude nonequivalence if
θ kL > δ or if θ kU < –δ (i.e., if the confidence interval for θ lies entirely outside
of the equivalence zone). By considering only the lower bound of the con-
fidence intervals, the methods can be applied to non-inferiority analyses
as well. In this type of situation, the study will conclude non-inferiority
if θ kU < δ at an interim analysis, and might also conclude futility if θ kL > δ
at the interim analysis. We point at a caveat to this process, as stopping a
study for early demonstration of non-inferiority is not required for ethical
reasons, but continuing the study will often be preferred: if a treatment
can be shown noninferior to a control on the basis of a smaller sample size
than predicted, it is quite possible that the investigational treatment can be
shown superior given the full sample size. Continuing the study in spite of
an early demonstration of non-inferiority may be the preferred course of
action.
However, if non-inferiority is the only desired outcome, and if stopping
early for non-inferiority is a desired possibility, direct application of group
sequential methods designed for superiority studies will often suffice. Such

© 2012 by Taylor and Francis Group, LLC

234 Design and Analysis of Non-Inferiority Trials

designs may be important when the difference in efficacy is not expected to

be large enough to be of interest, and the variances are not well known (or
the population proportions are not well known for binomial data). In such
a case, a group sequential design might save considerable time and enroll
fewer subjects than otherwise required, while maintaining the power under
less optimal parameter configurations.
When a trial continues after demonstrating non-inferiority at interim
analysis to test for superiority at a later analysis, different nominal confi-
dence intervals (supposedly on the same parameter) are used for the tests
of non-inferiority and superiority. It is therefore possible that the nominal
confidence interval at the final analysis does not rule out the non-inferiority
margin. If this is the outcome from a single study, there is likely no per-
suasive evidence that the experimental therapy is noninferior to the active
control. Statistically, if it is reasonable to assume that the same parameter for
treatment difference is tested at both analyses, a repeated confidence interval
can be used for testing superiority. However, the existence of the nominal
confidence interval at the later analysis not ruling out the non-inferiority
margin would introduce doubt that the same parameter for treatment dif-
ference is tested at both analyses. This could happen with a time-to-event
endpoint where after short-term follow-up the two time-to-event distribu-
tions are similar, but thereafter become separate with the distribution for the
control arm being noticeably better.
Consider now the details of group sequential analyses for non-inferiority
conclusions. At the ith planned interim analysis, i = 1, … , I, the null hypoth-
esis is tested at the one-sided level αi. This corresponds to comparing the
lower bound of a one-sided (1 – αi) × 100% CI to the non-inferiority margin
δ. If the lower bound of the confidence interval is larger than δ, a conclusion
of non-inferiority is made at that interim analysis and the study concludes;
otherwise, the study continues to the next planned analysis. Choosing the
values of αi is required to maintain the overall type I error rate, and the only
difference among candidate methods is how to make this choice.
Pocock21 proposed a boundary that stayed constant at each planned interim
analysis (α1 = α2 = . . . = αI). For a study with one interim analysis at the half-
way point, the Pocock procedure uses α1 = α2 = 0.0147 for an overall one-sided
significance level of 0.025. For a study with four equally spaced analyses, three
interim and one final, the Pocock procedure uses α1 = α2 = α 3 = α4 = 0.0091.
The method of O’Brien and Fleming22 can be adapted to non-inferiority
group sequential trials. O’Brien and Fleming proposed a boundary that
changed with every interim analysis, specifically with αi < αi+1 for every i.
That is, the criterion to stop early is most strict at the first interim analysis
and is progressively eased until the final interim analysis. For a study with
one interim analysis at the halfway point, the method would use α1 = 0.0026
and α 2 = 0.0240 for an overall one-sided significance level of 0.025. For three
equally spaced interim analyses plus a final analysis, the values would be
α1 = 0.00003, α 2 = 0.0021, α 3 = 0.0130, and α4 = 0.0215.

© 2012 by Taylor and Francis Group, LLC

Additional Topics 235

The Pocock boundary, being constant at every analysis, yields a better

chance of stopping early than the O’Brien–Fleming boundary. However,
conditional on not stopping at an interim analysis, the O’Brien–Fleming
boundary retains more of a chance of concluding non-inferiority at the final
analysis than the Pocock boundary. Other methods can also be used, includ-
ing general spending functions that allow a variety of differently shaped
stopping boundaries.
For synthesis tests, when considering the historical estimation of the active
control effect as random, the correlation between the synthesis test statistics
at two different time points is different from the correlation between two
non-inferiority test statistics based on fixed margins.26 This would require a
different error spending approach for the synthesis test than for a fixed mar-
gin procedure. The same correlation structure would be present if the results
of the estimation of the historical effects were treated as fixed, nonrandom.
However, this may be counterintuitive to arguments used in choosing a syn-
thesis method due to maintaining a desired type I error rate.

10.4.2 Changing the Sample Size or the Primary Objective

Reestimating the sample size is an adaptive technique that has been discussed
in the literature and applied successfully to clinical trials. Sample size reestima-
tion techniques can be categorized into two distinct classes, each with specific
advantages and disadvantages: blinded and unblinded. With blinded methods,
comparative treatment effects are not calculated or known, and decisions are
based on data (generally with focus on the variability of the responses) from the
various treatment groups being combined. This can be helpful when estimating
the variance for normal data or when the average response rate for binomial
data is required: under the assumption of a given treatment effect, such param-
eters can be estimated in an interim analysis without simultaneously estimating
the treatment effect. With unblinded methods, on the other hand, subjects are
unblinded at the interim analysis (at least in aggregate treatment groups) and a
treatment effect is estimated. This can be helpful when the treatment effect is
not known in advance and a better estimate is desired to develop a sample size
that maintains a desired power; however, the downside is that bias may be intro-
duced (in an unquantifiable manner) that in turn introduces uncertainty about
the final results. In general, reestimating sample size via a blinded assessment
is most commonly used. When unblinded sample size reestimation methods
are employed, an independent group is commonly charged with assessing the
interim data and recommending changes to the sample size.27
Wittes and Brittain28 introduced the concept of an internal pilot study, in
which a small sample is first enrolled and results from this sample are used
to refine procedures for the entire study. This is one form of a sample size
reestimation clinical trial design. An internal pilot study requires one interim
analysis, usually fairly early in the study. The concept of sample size reestima-
tion has been expanded to non-inferiority trials, keeping blind to treatment

© 2012 by Taylor and Francis Group, LLC

236 Design and Analysis of Non-Inferiority Trials

effects at the interim analysis and using only blinded estimates of variability
of the internal pilot study to reestimate sample size.29 For an interim analy-
sis, consider calculating the one-sample standard deviation at a prespecified
interim analysis (i.e., the internal pilot study) according to the usual method:
1
∑ ( X − X ) . This estimate is then used with a priori estimates of
2
σ̂ = i
n1 − 1 i

treatment effect hypothesized before the study began to obtain an updated

estimate of sample size required for sufficient power. Since X in this equa-
tion is the grouped mean from all treatment groups combined, it should be
clear that this estimate will be biased (an overestimate) if the treatment group
means are not all identical, and unbiased otherwise. Since it is known that this
bias does not appreciably affect the process for superiority studies, even when
the treatment groups are different enough to be meaningful, it should be of
even less concern for non-inferiority studies in which the treatment effects are
identical. Simulations show an inflation of the type I error but this is negligible
in situations that are clinically relevant.29 This demonstrates that sample size
reestimation can be effectively applied in the design of non-inferiority stud-
ies, but we again note that if superiority of the investigational treatment over
the active control is a desired outcome, then further increases in sample size
should also be considered.
Alternatively, the sample size may be reestimated based on an unblinded
interim analysis that estimates the relative treatment effect or treatment differ-
ence. When using a fixed non-inferiority margin, the type I error rate is main-
tained at the desired type I error rate by either reducing the significance level for
the final analysis, as described by Gao, Ware, and Mehta,30 or by appropriately
“weighting” the separate test statistics of the individual stages, as described by
Cui, Hung, and Wang.31 One reason why the test needs to be adjusted in the
frequentist sense is that the joint distribution of the estimated effects that are
respectively solely based on stage 1 data and solely based on stage 2 data does
not factor into the product of the marginal distributions. For a Bayesian analysis,
the likelihood function for the treatment effect does not depend on whether the
sizing of stage 2 does or does not depend on the results at stage 1.
If the estimated effect is quite small at the interim analysis for a superiority
trial or barely on the favorable side of the non-inferiority margin, a large sam-
ple size adjustment would be needed to obtain a high conditional power. This
would make the objective of the trial the investigation for small, perhaps clini-
cally meaningless, effects just to obtain a statistically significant result. The pri-
mary goal should not be to obtain a statistically significant result, but to obtain a
statistically reliable evaluation on whether the experimental therapy is safe and
provides clinically meaningful benefit.19 The clinical relevance of a particular
size of a treatment effect should be decided before initiating the clinical trial and
should not be influenced by interim results.18
Also there is a danger in using an unreliable estimate of the treatment effect
to size the remainder of the trial. Comparisons between arms of outcomes

© 2012 by Taylor and Francis Group, LLC

Additional Topics 237

from the first stage of the trial may not represent the comparisons between
arms of the same endpoint for the second stage of the trial. For a time-to-
event endpoint, the duration of follow-up is different for the two analyses.
Short-term effects or relative effects may not translate to long-term effects.
The effects of therapy may be greater for subjects with better prognosis than
those with worse prognosis. The patients with worse prognosis will have a
lopsided contribution to the total number of events at the interim analysis
and thus may yield an estimated effect that would tend to be smaller than
that obtained at the final analysis. Additionally, sample size or event size
reestimation allows for the possibility of back calculating, gaining an idea of
the estimated effect at the interim analysis. Such knowledge could be used
in a manner that reduces the integrity of the trial.
Alternatively, the trial can be sized for a minimally meaningful effect size
and a group sequential testing procedure can be implemented. This is gen-
erally more efficient and provides a more natural relative weighting of data
generated before and after any interim look.
Wang et al.32 compared the power, type I error rate, and sample size for two
group sequential approaches for testing non-inferiority and superiority with a
two-stage adaptive approach. In the two-stage adaptive approach, if the trial
continues after stage 1, the specific primary objective for the end of trial is cho-
sen between non-inferiority and superiority based on the results in stage 1.
Thresholds for decision making in such two-stage adaptive approaches were
also considered by Shih, Quan, and Li33 and Koyama, Sampson, and Gleser.34

10.5 Equivalence Comparisons

There are various instances where it is important to understand whether two
underlying distributions are similar. The approval of generic drugs involves
“equivalence” comparisons to the innovator drug. It is important to demon-
strate that there is no meaningful difference in outcomes between the generic
drug and the innovator drug. An equivalence margin or criterion involves
how similar the underlying distributions for the control product (e.g., the
innovator drug) and the experimental product (e.g., the generic drug) need
to be for the experimental product to be “equivalent” (or “bioequivalent”) to
the control product. The criterion for equivalence may depend on the pur-
pose of demonstrating equivalence. Having a goal that the products have
similar distributions for outcomes may lead to a different criterion than that
the outcomes from one treatment arm are not unacceptably worse than the
outcomes from the other treatment arm.
There are a variety of approaches in evaluating the similarity of two
parameters or two distributions. Four of the more common approaches
in testing for equivalence or describing the similarity of two distributions

© 2012 by Taylor and Francis Group, LLC

238 Design and Analysis of Non-Inferiority Trials

involve (1) comparing a two-sided confidence interval for a comparative

parameter (e.g., a difference in means or proportions, a relative risk, an
odds ratio, or a hazard ratio) to equivalence limits, (2) determining the
overlap of two distributions (in particular, the overlap in the density or
mass functions), (3) comparing the distribution functions, and (4) compar-
ing the variability of two random observations from different groups or
distributions to the variability of two random observations from the same
distribution.
We discuss various approaches to equivalence for two or more groups. The
equivalence problem involving more than two groups is presented in rela-
tion to determining lot consistency. These test procedures presented for more
than two groups also have simplified versions for two groups. An additional
discussion on equivalence comparisons is provided in Section A.3.2 in the
Appendix. A more extensive discussion of equivalence comparisons can be
found in Wellek’s book.35

10.5.1 Data Scales

The types of comparisons that can be performed depend on the scale
of the outcome data. For data having a nominal scale (i.e., the outcomes
involve unordered categories), the relevant parameters are the actual rela-
tive frequencies or probabilities for each category. Since the categories are
unordered, comparisons between study arms of the distributions for such
measurements involve comparing the similarity of the respective relative
frequencies for each category. When considering a nominal endpoint, com-
parison of two groups either involves demonstrating the true or underlying
relative frequencies are different between the groups or demonstrating that
the true or underlying relative frequencies are similar between the groups.
One measure of the similarity of two distributions of nominal categorical
data is the sum over all categories of the smaller relative frequency between
the two arms. This is a type of “overlap” measure.
When the data are ordered, various types of inferences (e.g., equivalence,
non-inferiority, or superiority) can be made. For measurements that have an
ordinal scale, additional relevant parameters would include the actual cumu-
lative relative frequencies or cumulative probabilities for each category. For
a given category, its cumulative relative frequency is the relative frequency
of observations that either fall into that category or any category having less
value. The cumulative relative frequencies are often used in the comparison
on ordinal data between two groups.
For measurements that have an interval or ratio scale, parameters of inter-
est include means, medians, specific percentiles, variances, or standard
deviations of the distribution. For measurements having an interval scale,
comparisons between study arms of the same type of parameter would
involve examining differences in the respective parameters. For measure-
ments having a ratio scale, comparisons between study arms of the same

© 2012 by Taylor and Francis Group, LLC

Additional Topics 239

type of parameter may involve examining differences or quotients in the

respective parameters.

10.5.2 Two One-Sided Tests Approach

Equivalence trials are also frequently conducted in phase I studies to evalu-
ate the pharmacokinetics of drug candidates. In a typical bioequivalence
trial, for example, the primary objective is to show that the experimental
and control arms have equivalent pharmacokinetic profiles, often measured
by the area under the plasma-concentration curve (AUC). The criterion for
equivalence is based on the ratio of the two geometric mean AUCs such that
the ratio is not too small and not too large. Let μE and μC be the true geo-
metric mean AUCs for the experimental and control arms, respectively. The
hypotheses of interest are

Ho : μE/μC ≤ δ1 or μE/μC ≥ δu vs. Ha : δ1 < μE/μC < δu (10.3)

where δ1 and δu are the prespecified equivalence margins. A common choice

for the equivalence margins is δ1 = 0.8 and δu = 1/δ1 = 1.25.
The equivalence hypothesis can be analyzed by two one-sided tests (TOST),
in which the p-value to reject the null hypothesis of nonequivalence is the
larger of the two p-values from the TOST.36 Specifically, equivalence testing
can be reformulated as one-sided tests of the respective sets of hypotheses

Ho,1 : μE/μ C ≤ δ1 vs. Ha,1 : μE/μ C > δ1 and (10.4)

Ho,2 : μE/μ C ≥ δu vs. Ha,2 : μE/μ C < δu

The validity of the TOST approach has been documented by Berger,37 and this
approach is general enough to apply to continuous, discrete, or time-to-event
data. The testing approach is equivalent to comparing an appropriate-level
confidence interval for μE/μ C with the interval (δ1,δ1). If the confidence inter-
val lies entirely within (δ1,δ1), then “equivalence” is concluded. Otherwise,
equivalence is not shown. For example, performing the standard sets of tests
of the respective sets of hypotheses in Expression 10.4, each at a significance
level of α/2, is equivalent to comparing a 100(1 – α)% two-sided confidence
interval for μE/μ C with the interval (δ1,δ1). As the two tests are simultaneously
performed at a significance level of α/2 and both null hypotheses needed to
be rejected to conclude equivalence, the type I error rate is maintained at a
level of α/2 or less.
This TOST approach is recommended in the International Conference on
Harmonization of Technical Requirements for Registration of Pharmaceutic
als for Human Use E9,27 which states that “Operationally, this (equivalence
test) is equivalent to the method of using two simultaneous one-sided tests
to test the (composite) null hypothesis that the treatment difference is outside

© 2012 by Taylor and Francis Group, LLC

240 Design and Analysis of Non-Inferiority Trials

the equivalence margins versus the (composite) alternative hypothesis that

the treatment difference is within the margins.” We will later discuss a gen-
eralized version of the confidence interval or TOST approach and its limita-
tions when requiring similarity of more than two groups.

10.5.3 Distribution-Based Approaches

Overlap Approaches. For discrete data, the proportion of similar responses
(PSR) is defined as the sum of the minimum probabilities or minimum rela-
tive frequencies across categories. For example, when there are k possible

∑
k
outcomes, the parameter of interest is min{ pE ,i , pC ,i }. It is easy to see that
i=1
this parameter does not retain any information on any ordered relationship
among the observations—that is, the possible outcomes are treated as nomi-
nal, unordered categories.
For continuous data, Rom and Hwang38 defined the PSR to be the over-
lap under the density curves between the two treatments. It measures the
degree of overlap (similarity) of the two distributions. More formally, the
PSR is given by

PSR =
∫ min{ f (x), f (x)}dx
E C

−∞

where f E and fC are the underlying density functions for the experimental
and control arms, respectively. A PSR close to 1 indicates similar distribu-
tions of outcomes between the two arms. In practice, when the PSR is far
from 1, the means, medians, and/or variances of the two distributions will
be quite different.
We will consider the special case of normal distributions having equal
standard deviations. Let μE and μ C denote the underlying means for the
experimental and control arms, and let σ denote the common standard devi-
ation. Then the PSR can be expressed as a decreasing function of the absolute
standardized difference in the means (|DS|). That is,

PSR(DS) = 2Φ(–|DS|/2)

where DS = (μE – μ C)/σ and Φ is the distribution function for a standard nor-
mal distribution. Inferences on PSR then reduce to inferences on |μE – μ C|/σ,
the number of absolute standard deviation difference in the means. When σ
is known, the inference reduces to an inference on |μE – μ C| with hypoth-
eses tested like those in Expressions 10.3 and 10.4. The analysis can then
be based on a confidence interval for μE – μ C or a TOST on the difference in
means. When σ 2 is unknown, the t statistic can be used to make inference
since its noncentrality parameter is a monotone function of the standardized

© 2012 by Taylor and Francis Group, LLC

Additional Topics 241

difference in means.38 Rom and Hwang38 have also derived the PSR as a
function of the means and standard deviations of two normal distributions,
allowing the standard deviations to be different and unknown. They showed
in this general normal case that the PSR measure provides a better tool for
comparing treatments than the standard t test, which only focuses on a dif-
ference in means and not a difference in the standard deviations.
In terms of equivalence margin for PSR, there is no universally agreed-
upon value. Rom and Hwang38 suggested that a PSR of at least 0.7 (70% over-
lap) could be used to judge whether two treatments are equivalent. Values
of 0.8 or 0.9 have also been suggested. As with other designs, if one is inter-
ested in using the PSR to analyze equivalence trials, it is important to pre-
specify the equivalence margin and discuss its properties before the start of
the study.
For two normal distributions with a common standard deviation, Table
10.1 gives for different values of |DS| the corresponding PSR. To further
interpret these values of |DS| and PSR(DS), the probability that a random
observation from the smaller distribution is greater than a random observa-
tion from the larger distribution and the percentile of the value of the smaller
mean in the larger distribution are also provided in Table 10.1. When the two
means differ by half a standard deviation (i.e., |DS| = 0.5), the PSR is 0.80,
and the probability that a random observation from the smaller distribution
is greater than a random observation from the larger distribution is 0.36.
Also, since a value of a half standard deviation below the mean is the 31st
percentile of a normal distribution, the smaller mean is the 31st percentile of
the larger normal distribution.
A nonparametric estimate of PSR using kernel density estimates was
proposed by Heyse and Stine.39 This nonparametric estimate avoids strong
assumptions on the shape of the populations, such as normality or equal vari-
ance. Through empirical studies, they showed that nonparametric estimates
of PSR are accurate for a variety of normal and nonnormal distributions. The
sampling variance from the kernel-based estimate of PSR is only slightly
larger than that of the normal maximum likelihood estimated variance for
TABLE 10.1
Comparative Characteristics of Two Normal Distributions Having a Common
Standard Deviation
Number of
Standard Deviation Proportion of Probability that Smaller Percentile of
Difference in Similar Distribution Has Greater Smaller Mean in the
Means Responses Random Value Larger Distribution
0 1 0.50 50
0.25 0.90 0.43 40
0.5 0.80 0.36 31
0.75 0.71 0.30 23
1 0.62 0.24 16

© 2012 by Taylor and Francis Group, LLC

242 Design and Analysis of Non-Inferiority Trials

normal data, and the kernel-based estimate may have less bias in analyzing
nonnormal data.
In a pure nonparametric setting where no assumptions are made about the
underlying distributions, the amount of overlap in the densities also treats
the data as having a nominal scale as in the discrete case. A relationship to
order in the outcomes is introduced when the densities are expressed involv-
ing a parameter for which order makes sense (e.g., the mean), as in the afore-
mentioned normal case.
Kolmogorov–Smirnov Approach. When order makes sense (i.e., the data have
an ordinal, interval, or a ratio scale), a Kolmogorov–Smirnov type of statistic
is one of the possibilities for an equivalence comparison. For distribution
functions of the experimental and control arms of FE and FC, respectively, the
hypotheses are expressed as

Ho : sup–∞<x<∞|FE(x) – FC(x)|≥ δ vs. Ha : sup–∞<x<∞|FE(x) – FC(x)| < δ

The Kolmogorov–Smirnov statistic is given by max −∞< x<∞ |FˆE ( x) − FˆC ( x)|, where
F̂E and F̂C are the corresponding estimated distribution functions. As equality
of the distributions is in the alternative hypothesis, the common scaled ver-
sion of the Kolmogorov–Smirnov test statistic would not apply for equivalence
testing. Bootstrapping or simulations may be useful in studying the behavior
of the Kolmogorov–Smirnov statistic in equivalence testing. Alternatively, the
hypotheses in the above expression could be tested on the basis of simulta
neous confidence bounds for FE(x) – FC(x). Some rank-based tests of equiva-
lence are provided by Wellek.35

10.5.4 Lot Consistency

Equivalence-like comparisons are also used in vaccine clinical development
to evaluate the consistency of the manufacturing process. The production
of vaccines or related biological products involves product lots consisting of
thousands of individual doses. Licensure in the United States requires the
demonstration of consistently produced lots of vaccine that produce simi-
lar biological effects. This reduces to demonstrating simultaneous “equiva-
lence” across multiple pairs of groups or lots. In a typical lot consistency trial,
there will be three arms representing three consecutive manufacturing lots
of new vaccines.40 In addition, an active control, if available, will be included
to show that the new vaccine (with all lots combined) is noninferior to the
active control once consistency of manufacturing is demonstrated.
A typical clinical trial evaluating lot consistency has several hundred sub-
jects randomized to one of many (at least three) vaccine lots. Subjects are
vaccinated from their respective lot and have blood samples taken when
peak immune responses are expected (usually sometime between 14 and
42 days after their last dose). The blood samples are assayed, and for each
lot the geometric mean concentration of some measurement (e.g., antibody

© 2012 by Taylor and Francis Group, LLC

Additional Topics 243

levels) or equivalently the arithmetic mean of the log of the measurements

is calculated.
For a continuous endpoint, such as geometric mean titers (GMT) of anti-
body responses, consistency of the lots has been interpreted as the range
of the true means being less than a prespecified threshold or margin. For
example, let μi, i =1, 2, 3 be the true GMTs of the three lots. Then the consis-
tency hypothesis can be formulated as

Ho : maxi,j|μi – μj| ≥ δ vs. Ho : maxi,j|μi – μj| < δ (10.5)

Some approaches have assumed normally distributed data, usually with

equal variances, with equivalence margins defined with respect to the dif-
ferences in means. Thus, these methods fail to test for similarity with the
variances or the distributions altogether. The paper by Lachenbruch, Rida,
and Kou41 provides a very good overview on lot consistency and the various
procedures that can be used.
A margin must be determined that reflects that the production process
will consistently produce lots that have similar biological effects. There is
no universal margin or equivalence criterion; instead agreements are made
between the FDA and the manufacturers on a case-by-case basis.40 A conven-
tional approach to demonstrating equivalence involves comparing all pair-
wise 90% CIs with the interval –δ to δ, where δ is the equivalence margin.36
If all 90% confidence intervals lie within –δ to δ, then it is concluded that the
production process will consistently produce lots that have similar biological
effects. Issues with such an approach include that

(1) The pooled estimate of the within-lot variance is often used when it
may not be appropriate, and by doing so the within-lot variances are
assumed to be equal when it may be important to additionally dem-
onstrate that the within-lot variances are similar to reliably conclude
that the production process will consistently produce lots that have
similar biological effects.
(2) Normality is assumed for the distribution for the mean and this may
not be an appropriate assumption in many cases.
(3) The type I error rate may be much less than 0.05 and is dependent
on the number of lots compared; the larger the number of lots com-
pared, the smaller the type I error rate.

This confidence interval approach is equivalent to performing three TOST, one

for each pair of lots and is discussed in detail by Wiens, Heyse, and
Matthews.42
A real example of consistency lots trial is reported by Lieberman et al.43
in evaluating the manufacturing consistency of a quadrivalent measles,
mumps, rubella, and varicella vaccine (MMRV) in approximately 4000
healthy children. Since there are multiple components, the testing strategy

© 2012 by Taylor and Francis Group, LLC

244 Design and Analysis of Non-Inferiority Trials

is rather complicated, even though the design is relatively straightforward.

The success of the study requires demonstration of consistency on all end-
points as well as non-inferiority between the quadrivalent vaccine and the
active control. Because of the high dimension of success criteria, a large
sample size for consistency lots trial was required to provide sufficient
power.
In this study, children 12–23 months of age were randomized to receive
either one of three consistency lots of MMRV or a control of concomitant
administration of MMR and varicella vaccine (MMR + V). For the assess-
ment of consistency, both immune response rate and GMTs of measles,
mumps, rubella, and varicella antibody responses at 6 weeks after vacci-
nation were used as co-primary endpoints. The consistency margins for
response rates were 5 percentage points for measles, mumps, and rubella,
where expected responses are approximately 95%. However, the consis-
tency margin was 10 percentage points for varicella because the expected
response rate is approximately 90%. For GMTs, the consistency margin was
1.5-fold for all four antigens. For the consistency testing, the TSOT approach
was used at the α = 0.05 level for each pair of comparison. Manufacturing
consistency was to be declared only if consistency can be concluded for
both immune response rates and GMTs. Once consistent responses were
established among lots of MMRV, results from the three lots of MMRV
were combined and compared with the control group given MMR + V. The
results presented in the paper showed that the manufacturing process of
MMRV is consistent and that the antibody responses from the combined
MMRV lots are noninferior to the control of MMR + V for both immune
response rates and GMTs.
For location parameters μ1, …, μk, and equivalence margin δ > 0, Giani and
Finner44 considered a test of the hypotheses in Expression 10.5 based on the
range of estimators of the parameters. For means and normally distributed
data, equivalence is concluded if X( k ) − X(1) < a ⋅ δ , where X(1) ≤ X( 2 ) ≤  ≤ X( k )
are the ordered sample means of the k lots and 0 < a <1. The value of a depends
on the significance level and the common within-lot standard error. For uni-
modal distributions, the type I error probability under the null hypothesis
is maximized when μ(1) = –0.5δ + c, μ(2) = . . . = μ(k–1) = c, and μ(k) = 0.5δ + c for
any real number c.
Wiens, Heyse, and Matthews42 proposed a test, which they referred to as
the min test, which corresponds to the likelihood ratio test of Sasabuchi.45
For each pair of lots, the test statistic for the equivalence of the two respec-
tive means is computed, and then the minimum value of these test statistics
is the value of the min test statistic. The test statistic is

 δ −|X − X | 
Zmin = min  
i j
1≤ i < j ≤ k  2 2 
σ /n + σ j /n j
 i i 

© 2012 by Taylor and Francis Group, LLC

Additional Topics 245

If Zmin > Zα*, then equivalence is concluded, where the critical value is calcu-
lated from the distribution of the range statistic for the means.46 When the
standard errors are equal (e.g., equal lot sizes and the lots are assumed to
have the same within-lot variance), the min test is equivalent to the range
test of Giani and Finner.44
When the within-lot variances or the lot sizes are not equal, the min test can
be quite conservative (i.e., the min test has a type I error rate much smaller
than the desired significance level). Wiens and Iglewicz46 suggested using
an adjusted critical value, Zα**. The smallest within-lot standard error for
the mean across all lots is used as the common within-lot standard error in
determining the value for Zα** . Wiens and Iglewicz46 showed that the result-
ing test is still conservative, but since Zα** ≤ Zα*, the test is both less conserva-
tive and more powerful than the original min test.
Ng47 proposed hypotheses and an equivalence test on the basis of the
between-lot variability of the means for common lot sizes and common
within-lot variances. Here, the null hypothesis is
1/2
 k 
Ho : 
 ∑
i=1
2
(µi − µ ) 

≥δ

for some margin on the variability, δ. The test statistic is the standard F sta-
tistic for testing for any between-lot variability. On the boundary of the null
hypothesis, the test statistic has a noncentral F distribution with k – 1 and
k(n – 1) degrees of freedom and noncentrality parameter nδ 2/σ 2, where n is
the common lot size and σ 2 is the common within-lot variance. The critical
value depends on the value of σ 2. When σ 2 must be estimated, Ng provides
an iterative method for finding the critical value. The test procedure assumes
that the data are normally distributed.
When the means and variances of the distributions exist and differences
in values make sense, the expected difference of the average squared dif-
ference between random observations from any two distributions relative
to the same expected difference when the random observations are drawn
from the same distribution is an appropriate measure of the amount of dif-
ference between the two distributions. For many distributions, this would
mean taking a random observation from each distribution and measuring
the variability in their values. Let X1, …, Xk be independent but not identi-
∑
k
cally distributed random variables with X = X i /k . Then if we randomly
i=1
select two of the k distributions and randomly draw an observation from
each selected distribution,

  k   k 
∑ ∑ ( X − X ) /k 
2
E ( X i − X j )2    = E  i
 i< j
 2   i=1

© 2012 by Taylor and Francis Group, LLC

246 Design and Analysis of Non-Inferiority Trials

represents the expected distance squared between the two observations. Let
the respective means be denoted by μ1, …, μk and the respective variances
denoted σ 12 ,  , σ k2 . Then the above expected value equals

k k

(2/k ) ∑ σ i2 + ∑ (µ − µ ) /k
i
2
(10.6)
i=1 i=1

For each i = 1, …, k, let W1i and W2i be independent and identically distributed
with variance σ i2 . Then, if we select one of the k distributions at random,

 k  k

E
 ∑
i=1
(W1i − W2 i )2 /k  = (2/k ) σ i2

i=1
∑ (10.7)

represents the expected square distance between two random observations
taken from that randomly selected distribution. The difference in the two
expectations in Expressions 10.6 and 10.7 equals

∑ (µ − µ ) /k i
2

i=1

and thus inferences can be reduced to that statistic proposed by Ng.47

Alternatively, instead of the difference, the inference can be based on the
ratio of the expected values in Expressions 10.6 and 10.7.
Lachenbruch, Rida, and Kou41 considered a nonparametric approach to
equivalence based on a similarity coefficient for the probability densities
(or probability mass functions if discrete data) provided by Cleveland and
Lachenbruch.48 For absolutely continuous distributions, the similarity coef-
ficient is defined as

∞
1
γ =
k ∫ max( f (x))dx i
1≤ j ≤ k
−∞

The value of γ is between 1/k and 1. Smaller values of γ correspond with

greater similarity in the distributions. The null hypothesis would be Ho: γ ≥ δ,
where 1/k ≤ δ ≤ 1 is prespecified. Preliminary simulations by Lachenbruch,
Rida, and Kou41 suggest that their test of similarity can be useful in dem-
onstrating equivalence when distributions have much heavier tails than
normal distributions or when there are differences in scale or shape of the
distributions being compared.
It is very important in an evaluation of the consistency of lots to establish
an appropriate definition of what it means for the lots to be consistent. There

© 2012 by Taylor and Francis Group, LLC

Additional Topics 247

have been different approaches as to the meaning of lot consistency. For any
given approach, the equivalence margins will vary from case to case, and
may be dependent on the indication and the efficacy of the product.

References
1. Wiens, B.L. and Heyse, J.F., Testing for interaction in studies of non-inferiority,
J. Biopharm. Stat., 13, 103–115, 2003.
2. Gail, M. and Simon, R., Testing for qualitative interactions between treatment
effects and patient subsets, Biometrics, 41, 361–376, 1985.
3. Laska, E.M. and Meisner, M.J., Testing whether an identified treatment is best,
Biometrics, 45, 1139–1151, 1989.
4. Piantadosi, S. and Gail, M.H., A comparison of the power of two tests for quali-
tative interactions, Stat. Med., 12, 1239–1248, 1993.
5. U.S. Code of Federal Regulations, Title 21, Sec. 314.500-560 and Sec. 601.40–46.
6. Fleming, T.R. and Powers, J.H., Issues in non-inferiority trials: The evidence in
community-acquired pneumonia, Clin. Infect. Dis., 47, S108–120, 2008.
7. Bruzzi, P. et al., Objective response to chemotherapy as a potential surrogate
endpoint of survival in metastatic breast cancer patients, J. Clin. Oncol., 23, 5117–
5125, 2005.
8. Fleming, T.R. and DeMets, D.L., Surrogate endpoints in clinical trials: Are we
being misled? Ann. Intern. Med., 125, 605–613, 1996.
9. Fleming, T.R., Surrogate endpoints and FDA’s accelerated approval process:
The challenges are greater than they seem, Health Affair, 24, 67–78, 2005.
10. Fleming, T.R., Objective response rate as a surrogate endpoint: A commentary, J.
Clin. Oncol., 23, 4845–4846, 2005.
11. Rothmann, M.D., Issues to consider when constructing a non-inferiority analy-
sis, ASA Biopharm. Sec. Pro., 1–6, 2005.
12. Fleming, T.R., Current issues in non-inferiority trials, Stat. Med., 27, 317–332,
2008.
13. Prentice, R.L., Surrogate endpoints in clinical trials: Discussion, definition and
operational criteria, Stat. Med., 8, 431–440, 1989.
14. Prentice, R.L., Surrogate and mediating endpoints: Current status and future
directions, J. Natl. Cancer Inst., 101, 216–217, 2009.
15. Baker, S.G., Surrogate endpoints: Wishful thinking or reality? J. Natl. Cancer
Inst., 9, 502–503, 2006.
16. Neupogen product labeling available at http://www.accessdata.fda.gov/Scripts/
cder/DrugsatFDA/index.cfm?fuseaction=Search.Label_ApprovalHistory.
17. Neulasta product labeling available at http://www.accessdata.fda.gov/Scripts/
cder/DrugsatFDA/index.cfm?fuseaction=Search.Label_ApprovalHistory.
18. Committee for Proprietary Medicinal Products. Reflection paper on method-
ological issues in confirmatory clinical trials with flexible design and analysis
plan, EMA, London, 2006.
19. Fleming, T.R., Standard versus adaptive monitoring procedures: A commentary,
Stat. Med., 25, 3305–3312, 2006.

© 2012 by Taylor and Francis Group, LLC

248 Design and Analysis of Non-Inferiority Trials

20. Guidance for Industry: Adaptive design clinical trials for drugs and biologics
(draft guidance), February 2010.
21. Pocock, S.J., Group sequential methods in the design and analysis of clinical tri-
als, Biometrika, 64, 191–199, 1977.
22. O’Brien, P.C. and Fleming, T.R., A multiple testing procedure for clinical trials,
Biometrics, 35, 549–556, 1979.
23. Lan, K.K. and DeMets, D.L., Design and analysis of group sequential tests based
on the type I error spending function, Biometrika, 74, 149–154, 1983.
24. Jennison, C. and Turnbull, B.W., Repeated confidence intervals for group
sequential clinical trials, Control. Clin. Trials, 5, 33–45, 1984.
25. Jennison, C. and Turnbull, B.W., Sequential equivalence testing and repeated
confidence intervals, with application to normal and binary data, Biometrics, 49,
31–43, 1993.
26. Lawrence, J., Some remarks about the analysis of active control studies,
Biometrical J., 47, 616–622, 2005.
27. International Conference on Harmonization of Technical Requirements for
Registration of Pharmaceuticals for Human Use (ICH), E9: Statistical principles for
clinical trials, 1998, at http://www.ich.org/cache/compo/475-272-1.html#E4.
28. Wittes, J. and Brittain, E., The role of internal pilot studies in increasing the effi-
ciency of clinical trials, Stat. Med., 9, 65–72, 1990.
29. Friede, T. and Kieser, M., Blind sample size reassessment in non-inferiority and
equivalence trials, Stat. Med., 22, 995–1007, 2003.
30. Gao, P., Ware, J.H., and Mehta, C.R., Sample size re-estimation for adaptive
sequential design in clinical trials, J. Biopharm. Stat., 18, 1184–1196, 2008.
31. Cui, L., Hung, H.M.J., and Wang, S.-J, Modification of sample size in group
sequential clinical trials, Biometrics, 55, 321–324, 1999.
32. Wang, S.J. et al., Group sequential test strategies for superiority and non-inferiority
hypotheses in active controlled clinical trials, Stat. Med., 20, 1903–1912, 2001.
33. Shih, W.J., Quan, H., and Li, G., Two-stage adaptive strategy for superiority
and non-inferiority hypotheses in active controlled clinical trials, Stat. Med., 23,
2781–2798 2004.
34. Koyama, T., Sampson, A.R., and Gleser, L.J., A framework for two-stage adap-
tive procedures to simultaneously test non-inferiority and superiority, Stat.
Med., 24, 2439–2456, 2005.
35. Wellek, S., Testing Statistical Hypotheses of Equivalence, Chapman & Hall/CRC
Press, Boca Raton, FL, 2003.
36. Schuirmann, D., A comparison of the two one-sided tests procedure and the
power for assessing the equivalence of average bioavailability, J. Pharmacokinet.
Pharm., 15, 657–680, 1987.
37. Berger, R.L., Multiparameter hypothesis testing and acceptance sampling,
Technometrics, 24, 295–300, 1982.
38. Rom, D.M. and Hwang, E., Testing for individual and population equivalence
based on the proportion of similar responses, Stat. Med., 15, 1489–1505, 1996.
39. Heyse, J.F. and Stine, R., Use of the overlapping coefficient for measuring the
similarity of treatments, Am. Stat. Assoc. Proc. Biopharm. Sec., 29–32, 2000.
40. Guidance for Industry for the Evaluation of Combination Vaccines for Pre
ventable Diseases: Production, Testing, and Clinical Studies. U.S. Department
of Health and Human Services, Food and Drug Administration, Center for
Biologics Evaluation and Research, April 1997.

© 2012 by Taylor and Francis Group, LLC

Additional Topics 249

41. Lachenbruch, P.A., Rida, W., and Kou, J., Lot consistency as an equivalence
problem, J. Biopharm. Stat., 14, 275–290, 2004.
42. Wiens, B.L., Heyse, J.F., and Matthews, H., Similarity of three treatments, with
application to vaccine development, Am. Stat. Assoc. Proc. Biopharm. Sec., 203–
206, 1996.
43. Lieberman, J.M. et al., The safety and immunogenicity of a quadrivalent mea-
sles, mumps, rubella and varicella vaccine in healthy children: A study of man-
ufacturing consistency and persistence of antibody, Pediatr. Infect. Dis. J., 25,
615–622, 2006.
44. Giani, G. and Finner, H., Some general results on least favorable parameter con-
figurations with special reference to equivalence testing and the range statistic,
J. Stat. Plan. Infer., 28, 33–47, 1991.
45. Sasabuchi, S., A test of multivariate normal mean with composite hypotheses
determined by linear inequalities, Biometrika, 67, 429–439, 1980.
46. Wiens, B. and Iglewicz, B., On testing equivalence of three populations,
J. Biopharm. Stat., 9, 465–483, 1999.
47. Ng, T., Iterative chi-square test for equivalence of multiple treatment groups,
Am. Stat. Assoc. Proc. Biopharm. Sec., 2464–2469, 2002.
48. Cleveland, W. and Lachenbruch, P., A measure of divergence among several
populations, Commun. Stat., 33, 201–211, 1974.

© 2012 by Taylor and Francis Group, LLC

11
Inference on Proportions

11.1 Introduction
Many clinical trials use outcome variables that are binary in nature, that is,
there are two possible outcomes for each subject. Without loss of generality,
these two outcomes are called “success” and “failure.” Other terms used are
“with an event” and “without the event.” Usually there are qualitative differ-
ences between these two outcomes, in that one outcome is always preferred
over the other. Specifically excluded from this class of outcome variables are
outcomes of a time-to-event endpoint where we are interested in the length
of time until an event (as well as the occurrence of the event are of inter-
est), or outcomes where we are interested in the magnitude in gradations
between success and failure.
The simplest model of proportions is the binomial model, in which each
subject has the same probability of success, p. When a sample of n subjects is
taken, the expected number of successes is np and the variance is np(1 – p).
Often of most interest is the proportion of successes, p̂ = x/n, which then has
mean p and variance p(1 – p)/n. With large sample sizes, the normal approxi-
mation can be used to describe the distribution of p̂ with ease of calculations
and minimal loss of precision, as noted below. Since the variance estimate is
not independent of the mean estimate, extension of the normal approxima-
tion to more complex situations must be used with caution.
The experimental therapy is noninferior to the control therapy if the prob-
ability of a success outcome on the experimental arm is better than or not too
much worse than that of the control arm. When a “success” is a positive or
desirable outcome (as the word “success” suggests), this means that the prob-
ability of a success for the experimental arm is greater than or not too much
less than that for the control arm. When a “success” is a negative or undesir-
able outcome, this means that the probability of a success for the experimen-
tal arm is less than or not too much greater than that for the control arm. This
“not too much less than” or “not too much greater than” can be expressed
through a difference in the two probabilities of a success, in the ratio of the
two probabilities (i.e., a relative risk), or through an odds ratio.

251

© 2012 by Taylor and Francis Group, LLC

252 Design and Analysis of Non-Inferiority Trials

The choice of the appropriate scale is a very important consideration in the

non-inferiority testing of proportions. For a superiority comparison, the null
and alternative hypotheses do not depend on the scale or “metric” (the dif-
ference in proportions or risk difference, the relative risk, or the odds ratio)
used. There is no such correspondence across respective non-inferiority
hypotheses based on a risk difference, relative risk, and odds ratio. It is thus
important to choose the proper metric or function when comparing propor-
tions in a non-inferiority trial.
A non-inferiority comparison can be used to compare the efficacy of an exper-
imental therapy with that of an active control therapy, or to compare the safety
of an experimental therapy with a placebo or an active control. When using
binary data, the clinical benefit and the safety risk are expressed by the differ-
ence in proportions. If an experimental therapy cures an additional 2 subjects
out of every 100 subjects, then that clinical benefit of the experimental therapy is
expressed by the 2% improvement in cure rates. If 10 additional subjects out of
every 100 subjects incur an adverse event, then the safety risk of the experimen-
tal therapy is expressed by that 10% increase in incurring that adverse event.
The best way to describe the effect of a therapy may be different from
describing the amount of benefit the therapy provides. How a drug works
and the extent of the disease in the population can impact which is the
appropriate metric. For example, consider a drug that prevents 60% of the
subjects from getting a disease they would have otherwise acquired, inde-
pendent of the background rate for the disease. That 60% prevention rate
among patients at risk of the disease describes the effect of the therapy. If
the background rate in acquiring the disease is 5%, then the drug prevents 3
out of every 100 treated subjects from acquiring the disease. That 3 out of 100
describes the clinical benefit and would be used as the benefit part of a risk–
benefit assessment. If the background rate is 50%, then the drug prevents 30
out of every 100 treated subjects from acquiring the disease, and that 30 out
of 100 would be used as the benefit part of a risk–benefit assessment.
Background disease risk rates and cure rates tend to change over time and
from study to study. Also, the prevention rate or cure rate of the drug may
change as the background disease rate or background disease cure rate changes
from one population to another. When one population has a lower disease rate
than another population, the typical extent of disease among those subjects
having the disease may also be different in the two populations. Thus, the cure
rate of an effective drug may be different between the two populations (lower
cure rate for the population where the disease is more prevalent).
If the non-inferiority inference is not based on a risk difference, the results
when positive may somehow need to be translated to a risk difference to pro-
vide an impression of the clinical benefit of the experimental therapy.
When the non-inferiority margin is developed, it will be based on the cho-
sen scale. Ideally, the scale will be determined on the basis of the evaluation
of the effect of the active control therapy compared to placebo. The choice of
scale should also consider whether and how the effects or estimated effects

© 2012 by Taylor and Francis Group, LLC

Inference on Proportions 253

of therapies in the studied indication vary across studies or important sub-

groups when based on the risk difference, relative risk, and odds ratio. If the
active control therapy is routinely superior to placebo with a consistent mag-
nitude for the treatment effect in a given scale, then that scale is appropriate
for consideration for the non-inferiority analysis.
There are many other considerations that need to be made when choosing
the appropriate scale. For proportions near zero, a non-inferiority efficacy
margin on a risk difference of δ > 0 would allow for an experimental success
rate of 0 to be noninferior to any control rate less than δ. This does not make
sense, and a non-inferiority inference based on a risk difference should be
avoided unless the minimum likely control rate is sufficiently larger than
the non-inferiority margin that would be used. For negative outcomes, Dann
and Koch1 recommended using the relative risk for control rates ≤0.20.
A binary outcome is modeled as having a Bernoulli distribution. The log-
odds ratio is the canonical link or natural parameter for a Bernoulli distribu-
tion, and therefore better lends itself to a linear model as an adjusted analysis
(e.g., better than modeling the probability of a success with a linear model).
For the relative risk and the odds ratio, the true value that is being estimated
by an unadjusted analysis is different from the value that is being estimated
by an adjusted analysis. Therefore, the same type of analysis should be used
in estimating the difference in effects in the non-inferiority trial as in evalu-
ating the active control effect and determining non-inferiority margin.
In this chapter we will discuss non-inferiority analyses based on the risk
difference (difference in proportions), the relative risk, and the odds ratios in
Sections 11.2 through 11.4, respectively. Bayesian analyses will be discussed
in Section 11.5. Adjusted analyses will be discussed in Section 11.6. The use
of variable margins and the corresponding analyses will be discussed in
Section 11.7. Non-inferiority inference based on matched-pair designs for
proportions is discussed in Section 11.8.

11.2 Fixed Thresholds on Differences

In this section we will consider the situation in which a margin has been
chosen on the basis of the difference in two proportions.

11.2.1 Hypotheses and Issues

For this discussion, consider a situation in which the margin has been deter-
mined on the basis of the simple difference in the probability of a success of a
desirable outcome. The null and alternative hypotheses to be considered are

Ho: pE – pC ≤ –δ vs. Ha: pE – pC > –δ (11.1)

© 2012 by Taylor and Francis Group, LLC

254 Design and Analysis of Non-Inferiority Trials

That is, the null hypothesis is that the active control is superior to the experi-
mental treatment by at least a quantity of δ ≥ 0 that is prespecified. The alter-
native hypothesis is that the active control is superior by a smaller amount,
or the two treatments are identical, or the experimental treatment is superior.
When δ = 0, the hypotheses in Expression 11.1 reduce to classical one-sided
hypotheses for a superiority trial. The null hypothesis in Expression 11.1 is
rejected and the experimental therapy is concluded to be noninferior to the
control therapy when a decrease in the proportion of success of δ or greater
is statistically ruled out. If a “success” is an undesirable outcome, the roles
of pC and pE in the hypotheses in Expression 11.1 would be reversed (i.e., test
Ho: pC – pE ≤ –δ vs. Ha: pC – pE > –δ).
In the simplest application involving a desirable outcome, a confidence
interval can be calculated on the difference pE – pC, and non-inferiority is con-
cluded (the null hypothesis is rejected) if the lower bound is greater than –δ.

11.2.2 Exact Methods

When the sample sizes are small, the performance of asymptotic methods is
uncertain. In some cases, the inflation of type I error rate becomes substan-
tial. In addition, the type I error rate may depend on the sample size in ways
that are not predictable or may vary greatly across the boundary of the null
hypothesis. Therefore, it is desirable to use exact methods to control the true
level of the non-inferiority test.
Let us consider the setting where we are comparing an experimental treat-
ment (E) to an active control treatment (C) using a dichotomous (success/fail-
ure) endpoint in a randomized study. Suppose there are nC and nE subjects
randomized to the active control and experimental arms, respectively. Let
x and y be the observed number of successes (responses) in the active con-
trol and experimental arms, respectively. Table 11.1 provides the notation for
the breakdown of the number of responses (“successes”) and nonresponses
(“failures”) between treatment arms.
Let pC and pE represent the true response rates of the control and experi-
mental arms, respectively. Let X and Y be independent random variables
representing the number of successes in the treatment arms and distrib-
uted as Binomial (nC,pC) and Binomial (nE,pE), respectively. To show that the
experimental treatment is noninferior to the active control, we will test the
hypotheses in Expression 11.1 for some fixed δ > 0.

TABLE 11.1
Notation for Breakdown of Counts of Responses between Treatment Arms
Treatment Arm Response No Response Sample Size
Control x nC – x nC
Experimental y nE – y nE
Total s nC + nE – s nC + nE

© 2012 by Taylor and Francis Group, LLC

Inference on Proportions 255

Given a true difference (pE – pC = –δ) under the null hypothesis Ho, the
probability of observing an outcome (x, y) = (i, j) is given by

n n 
P(X = i, Y = j|H o ) =  C   E  (p + δ )i (1 − p − δ )nC − i p j (1 − p)nE − j (11.2)
 i  j 

where p = pE is the nuisance parameter with the domain A = [0, 1 – δ]. For the
classical null hypothesis of no difference (δ = 0), the marginal total (S = X + Y)
is the sufficient statistic for the nuisance parameter (p). To eliminate the effect
of p, an exact test can be constructed conditional on this sufficient statistic,
which yields the well-known Fisher’s exact test.
In the case where pE – pC = –δ (δ > 0), there is no simple sufficient statistic for
p (the numbers of successes from each group are jointly minimal sufficient
statistics). Therefore, the conditional argument will not simplify the problem
of the nuisance parameter in testing non-inferiority hypotheses. In general,
an exact test of non-inferiority can be developed on the basis of the null prob-
ability distribution given in Equation 11.2 using the unconditional sampling
space consisting of all possible 2 × 2 tables given the sample sizes (nC, nE).
The exact test procedure defines the tail region (TR) of the observed table
(i, j) as the region of those tables that are at least as extreme as the observed
table according to a predefined ordering criterion. Then the exact p-value is
defined as

p∈A
(
p -value = max P (X , Y ) ∈TR(i, j) H o , p ) (11.3)

The exact p-value calculation eliminates the nuisance parameter using the
maximization principle,2,3 which caters to the worst-case scenario. Because
the maximization involves a large number of iterations in evaluating sums
of binomial probabilities, the exact unconditional tests are computationally
intensive, particularly with large sample sizes.
A natural ordering criterion proposed by Chan4 used the Z-statistic based
on the constrained maximum likelihood estimate (MLE) of parameters
under the null hypothesis:

pˆ E − pˆ C + δ
Z1 = Z( x , y ) = (11.4)
{p (1 − p ) n ( ) }
1/2
E E E + pC 1 − pC nC

where p̂C and p̂E are the observed response rates for the control and experi-
mental treatment groups, respectively. In addition, pC and pE are the MLEs
of pC and pE, respectively, under the constraint pE – pC = –δ given in the null
hypothesis. The closed-form solutions for pC and pE are given by Farrington
and Manning5 and are provided in Expression 11.6. Since large values of Z1

© 2012 by Taylor and Francis Group, LLC

256 Design and Analysis of Non-Inferiority Trials

favor the alternative hypothesis, the tail region includes those tables whose
Z1 statistics are larger than or equal to the Z1 statistic associated with the
observed table (i, j), zobs. As a result, the exact p-value can be obtained as

( )
p -value = max P (X , Y ) : Z1 ≥ zobs H o , p .
p∈A

In summary, we can calculate the exact p-value in the following steps:

1. Compute the Z1 statistic for all tables and order them. Let zobs be the
calculated value of Z1 for the observed table. The tail of the observed
table includes those tables whose Z1 statistics are larger than or equal
to zobs.
2. For a given value of the nuisance parameter p in A = [0, 1 – δ], cal-
culate the tail probability by summing up the probabilities of those
tables in the tail using the probability function (Equation 11.2).
3. Repeat step 2 for every value of p in its domain. Then the exact
p-value is the maximum of the tail probability over the domain of
p. Since the domain of p is continuous, a numerical grid search (e.g.,
more than 1000 points) over the domain can be done to obtain the
maximum tail probability. This should provide adequate accuracy
for most practical uses.

For a nominal α level test, we reject the null hypothesis if the exact p-value
is less than or equal to α. To obtain the true level of the exact test, we first
convert the test procedure to find the critical value given the nominal α level
and the sample sizes nC and nE. This critical value does not depend on any
specific value of the nuisance parameter, and the true level is the maximum
(over the domain of the nuisance parameter) null probability of those tables
of which the test statistics are less than or equal to the critical value. This
exact test has been implemented in commercial software.
When δ = 0, pC and pE both simplify to the pooled estimate of the response
rate among the two groups, and the Z1 statistic in 11.4 reduces to the Z-pooled
statistic for the classical null hypothesis of no difference. As a result, the exact
unconditional test of non-inferiority based on Z provides a generalization of
the unconditional test of the classical null hypothesis studied by Suissa and
Shuster6 and Haber.7
Other types of statistics may also be considered as ordering criteria. A few
examples include: (1) the observed difference Dobs = pˆ E − pˆ C, (2) a Z-statistic
with the variance (denominator of Z) estimated directly from the observed
proportions, (3) a Z-statistic with the variance estimated from fixed marginal
totals,8 and (4) a likelihood ratio statistic.9 Findings from empirical investiga-
tions show that the Z-statistic in 11.4 generally performs better than Dobs and
other Z-type statistics. Röhmel and Mansmann10 recommended an ordering

© 2012 by Taylor and Francis Group, LLC

Inference on Proportions 257

criterion, which adapts Barnard’s test to the non-inferiority hypothesis. Using

Barnard’s ordering criterion, one creates the critical region (CR) for testing
the non-inferiority hypothesis in a sequential manner starting with the most
extreme outcome (0, nE), and then adds to the CR from the set of adjacent out-
comes the one that increases the null probability of the new CR by the small-
est amount. Barnard’s test has been shown through empirical investigation
to be generally the most powerful exact unconditional method for testing the
classical null hypothesis of no difference between two independent propor-
tions.11 However, its relative performance in non-inferiority tests is generally
unknown. Since Barnard’s ordering criterion requires evaluating the true
size of the test multiple times (once for each candidate outcome to be intro-
duced into the CR), the computational complexity involved is substantially
greater than methods using a test statistic (such as Z) as the ordering crite-
rion. Additional discussions on ordering criteria were provided in previous
papers.10,12

11.2.2.1 Exact Confidence Intervals

Besides a hypothesis test, a confidence interval is usually provided to
assess the treatment difference. To provide a consistent confidence inter-
val corresponding to the exact test of non-inferiority described above, a
test-based confidence interval can be constructed as follows. Following the
ideas of Clopper and Pearson,13 Chan and Zhang9 proposed to construct
a two-sided 100(1 – α)% confidence interval for the true difference Δ = pE
– pC by inverting the exact test procedure based on the Z1 statistic for two
one-sided hypotheses in 11.4, one for the lower bound and the other for the
upper bound.
For example, the upper bound of the two-sided 100(1 – α)% confidence
interval (ΔU) is obtained by considering the one-sided hypothesis Ho: Δ = Δo
versus Ha: Δ < Δo such that

∆o
{ pC ∈A }
∆ U = sup ∆o : max P(Z1 (X , Y ; ∆o ) ≤ Z1 ( x , y ; ∆o )|∆o , pC ) > α/2 .

Similarly, the lower bound of the two-sided 100(1 – α)% confidence interval
(ΔL) is obtained by considering the one-sided hypothesis Ho: Δ = Δo versus
H1: Δ > Δo such that

∆o { pC ∈A }
∆ L = inf ∆o : max P(Z1 (X , Y ; ∆o ) ≥ Z1 ( x , y ; ∆o )|∆o , pC ) > α/2 .

It was shown by Chan and Zhang9 that the exact confidence interval based
on the Z1 statistic is much better than the simple tail-based confidence inter-
val (see Santner and Snell14) as well as confidence intervals based on the

© 2012 by Taylor and Francis Group, LLC

258 Design and Analysis of Non-Inferiority Trials

Z-unpooled and the likelihood ratio statistics. Also, since the exact confi-
dence interval based on Z1 is obtained by inverting two one-sided tests, it
controls the error rate of each side at the α/2 level, and hence provides con-
sistent inference with the p-value from the one-sided hypothesis. In other
words, if the null hypothesis in 11.1 is rejected at the one-sided α/2 level for
a specific δ, then the lower bound of the two-sided 100(1 – α)% confidence
interval for the difference pE – pC will be greater than –δ.
Exact confidence intervals have also been proposed by Agresti and Min15
and Chen16 by inverting the two-sided hypothesis Ho: Δ = Δo versus H1: Δ ≠
Δo based on the Z1 statistic. The resulting confidence interval generally has
a shorter width than the one obtained by inverting two one-sided tests, and
therefore is very useful if the hypothesis is two-sided in nature or if esti-
mation is of primary interest. Other methods (non-test-based) that are also
useful for estimation purposes have been proposed by Coe and Tamhane17
and Santner and Yamagami.18 By inverting a two-sided test, the confidence
interval controls the overall error rate at the α level but does not guarantee
control of the error rate of each side at the α/2 level. Consequently, it may
potentially produce results that are inconsistent with a one-sided hypothesis
test. Therefore if the criterion of showing non-inferiority is to require that the
lower bound of the two-sided confidence interval for the difference pE – pC
be greater than –δ, then controlling the one-sided type I error is essential,
and constructing the confidence interval by inverting two one-sided tests is
recommended.
Examples 11.1 and 11.2 provide p-values from applying some of these meth-
ods to the results of actual studies.

Example 11.1

Fries et al.19 conducted a challenge study to evaluate the efficacy of a recombinant

protein influenza A vaccine against wild-type H1N1 virus. In this study, subjects
were randomized to receive either the experimental vaccine or the placebo injec-
tion, and they were subsequently challenged intranasally with influenza virus. All
subjects were closely monitored in the next 9 days for viral infection and clinical
symptoms. The observed incidence rates of any clinical illness among subjects
who had viral infection were p̂P = 0.8 (12/15) in the placebo group and p̂E = 0.467
(7/15) in the vaccine group. The investigators are interested in testing the classical
null hypothesis of no difference—that is, Ho: pE = pP versus H1: pE < pP. The one-
sided Fisher’s exact (conditional) test yields a p-value of 0.064, compared with
0.0341 from the exact unconditional test based on Chan’s4 Z statistic in Equation
11.4. For the same data set, Röhmel and Mansmann10 reported a p-value of 0.036
from the exact unconditional test using Barnard’s criterion, and a p-value of 0.034
when Fisher’s p-value was used as an ordering criterion.20 The results from these
exact unconditional tests are consistent and produced smaller p-values than from
the exact conditional test. The smaller p-values are due to having the probability
distributed over a more extensive set of “possible outcomes (x, y).”

© 2012 by Taylor and Francis Group, LLC

Inference on Proportions 259

Example 11.2

Rodary, Com-Nougue, and Tournade21 reported a randomized clinical trial of 164

children with nephroblastoma to evaluate whether chemotherapy (new treatment)
is noninferior to radiation therapy (control) based on a prespecified margin of δ =
0.1 and a one-sided 5% level test. The study showed that the chemotherapy group
actually had a higher success response rate ( p̂E = 0.943 [83/88]) than the radia-
tion group ( p̂C = 0.908 [69/76]), and the observed difference ( p̂E – p̂C ) was 0.035.
Because of the relatively few failures in both treatment groups, an approximate
exact conditional method based on the odds ratio statistic (see Dunnet and Gent8)
was used to test the non-inferiority hypothesis and yielded a p-value of 0.002.
These data were reanalyzed by Chan4 using the exact unconditional test based on
the Z-statistic in Equation 11.4, and yielded a p-value of 0.0017. The exact uncon-
ditional test based on Barnard’s criterion10 yielded a p-value of 0.0011. The results
from various analyses were very consistent and provide evidence that chemother-
apy is noninferior to the radiation therapy at a margin of 0.1. In addition, the results
suggested that the exact methods not only guarantee the level of the test, but can
also be as powerful as the asymptotic method. For this example, the size of the
exact unconditional test based on the Z1 statistic is 4.97%, whereas the true type I
error rate of the asymptotic Z1 statistic could be as high as 5.78%. In addition, the
two-sided 90% exact confidence interval based on the Z1 statistic (see Chan and
Zhang9) for the difference in response rate (chemotherapy–radiation therapy) was
–0.035 to 0.117. Since the lower bound was much greater than –0.1, the confidence
interval supports the conclusion of non-inferiority (at a one-sided 5% level).

11.2.3 Asymptotic Methods

When the sample size allows, use of asymptotic methods will generally be
common for analysis of binary data. Historically, this has been attributable
to computational needs for exact methods, as the number and complexity of
computations have made asymptotic methods very attractive. In the com-
puter age, the preference for asymptotic methods has probably outlived this
explanation. Asymptotic methods can simplify/reduce a problem to its most
basic components, thus making the method more understandable, particu-
larly to nonstatisticians.
Asymptotic methods are appropriate when the sample size is large enough
that the binomial distribution can be approximated by the normal distribu-
tion. In that case, the distribution of an observed proportion, p̂, is approxi-
mately normal with mean p and variance p (1 – p)/n. A general rule of thumb
is that as long as the sample size is at least 30 or 35, and the population pro-
portion is such that at least 5 successes and at least 5 failures are expected,
the normal approximation will be adequate. In Section 11.2.4, cases will be
provided (including when that rule of thumb is satisfied) where the coverage
probability for confidence intervals is noticeably lower than desired (i.e., the
two-sided type I error rate for testing superiority is noticeably higher than
desired).

© 2012 by Taylor and Francis Group, LLC

260 Design and Analysis of Non-Inferiority Trials

The non-inferiority hypotheses can be tested using asymptotic methods

analogously to the methods presented in Chapter 12 for continuous data.
The test statistic of

pˆ E − pˆ C + δ
Z= (11.5)
se( pˆ E − pˆ C )

will be distributed approximately as a standard normal variable under the

null hypothesis in Expression 11.1; thus a value of Z > zα/2 will be sufficient
evidence against the null hypothesis at the one-sided α/2 level. More com-
monly (and equivalently), the testing procedure does not use a test statistic
Z. Instead a two-sided 100(1 – α)% confidence interval for the true difference,
pE – pC, is calculated. If the lower bound is larger than −δ, then the null hypoth-
esis is rejected in favor of the alternative hypothesis of non-inferiority.
There are many ways of calculating a large-sample confidence interval.
The simplest expression is

pˆ E − pˆ C ± { zα/2 × se( pˆ E − pˆ C )},

where zα/2 is the 100(1 – α/2)th percentile of the standard normal distribution
and se( pˆ E − pˆ C ) is the standard error of the estimated difference in propor-
tions. The standard error is commonly estimated by the unrestricted MLE of
pˆ E (1 − pˆ E ) pˆ C (1 − pˆ C )
+ , which leads to a Wald’s confidence interval for the
nE nC
true difference in proportions. However, not all possible values of pE and pC can
be observed: in a study with nC and nE observations, only multiples of 1/nE and
1/nC can be observed. Thus, when there are large sample sizes, these (nE + 1)
(nC + 1) possible outcomes are fairly dense within the unit square, the param-
eter space of pE and pC. In cases with small sample sizes and/or probabilities
of success near 0 or 1, this simple confidence interval can have suboptimal
coverage probabilities and the associated test can reject the null hypothesis
less often (or more often) than desired. In addition the unrestricted MLE of
the variance is inconsistent with the null hypothesis, which restricts the true
difference in proportions.
Hauck and Anderson22 considered confidence intervals of the form
pˆ E − pˆ C ± { zα/2 × se( pˆ E − pˆ C ) + CC}, where CC denotes continuity correction
and se( pˆ E − pˆ C ) may or may not be based on the unrestricted MLE of the
variance. Hauck and Anderson concluded that some adjustment is neces-
sary, either in estimating the standard error or through use of a continuity
correction or both, even if sample sizes are large. With minor restrictions
on sample size, Hauck and Anderson recommended the unbiased estimate
of standard error (i.e., using n − 1 in the denominators instead of n as in the
MLE) and also using a continuity correction of 1/{2 × min(nE,nC)}. We note
that this is based on two-sided coverage probabilities, not on the testing of a

© 2012 by Taylor and Francis Group, LLC

Inference on Proportions 261

one-sided non-inferiority hypothesis of a nonzero difference. In addition, the

estimate of standard error is not consistent with the null hypothesis, which
restricts the relative values of pE and pC.
Dunnett and Gent8 discussed a test statistic with the same form as Z in
Equation 11.5, but with the variance estimated conditional on total number
of successes. The values for pE and pC that are used to estimate the variance
satisfy pE – pC = –δ and nE pE + nC pC = nE pˆ E + nC pˆ C . The solutions are pE = [kp̂E +
p̂C – δ]/(1 + k) and pC = [kp̂E + p̂C + kδ]/(1 + k), where k = nE/nC is the ratio of
sample sizes of the two groups. Note that when k = 1, the solutions pE and pC
satisfy pE + pC = p̂E + p̂C . Although this may be thought of as an improvement
over the methods using unrestricted MLEs, they do not always produce valid
estimates in the closed interval [0,1]. For example, when p̂E = p̂C < δ/2 and the
two groups have equal sample sizes, the solution for pE will be negative.
Testing based on the test statistic in Equation 11.5 using restricted MLEs
of the standard error was discussed by Farrington and Manning.5 That is,
under the restriction that pE – pC = –δ, the MLEs of pE and pC were found. The
MLE for pC was found to be the unique solution in the open interval (δ, 1) to
the third-degree maximum likelihood equation. The solution is

pC = 2 u cos(w) − b/3 a and pE = pC − δ , (11.6)

where
v = b3/(3a)3 – bc/(6a2) + d/(2a)
u = sign(v)[b2/(3a)2 – c/(3a)]½
w = [π + cos–1(v/u3)]/3,

and

a = 1 + k,

b = –[1 + k + p̂C + k p̂E + δ(k + 2)],

c = δ 2 + δ(2 p̂C + k + 1) + p̂C + kp̂E ,

d = – p̂C δ(1 + δ)

and k is as defined above.

Although the algebra of the estimates of these MLEs appears somewhat
complex, it is easily implemented in computer code. The corresponding test
statistic is that given in Equation 11.4

pˆ E − pˆ C + δ
ZFM = Z1 = ,
pE (1 − pE ) pC (1 − pC )
+
nE nC

© 2012 by Taylor and Francis Group, LLC

262 Design and Analysis of Non-Inferiority Trials

which has an approximate normal distribution for sufficient sample sizes

when pE – pC = –δ. Note that the form of the test statistic of the Farrington and
Manning method is identical to that proposed by Dunnett and Gent—only
the choice of pi ’s differ.
Farrington and Manning reported simulation results of three methods,
which differed by how the variance is estimated. In the estimation of the
variance, the methods differ only in the estimates of the proportions that
are used: using the observed proportions pi to estimate variance (the unre-
stricted MLEs), conditioning on fixed marginal totals (as proposed by Dunnett
and Gent), or the restricted MLEs as defined above. The simulation results
showed that the method using the MLEs under the null hypothesis was pref-
erable if an asymptotic test is desired. The method using unrestricted MLEs
often gave dramatically different sample size requirements (either higher or
lower), and the method of Dunnett and Gent occasionally failed to supply a
meaningful number (i.e., occasionally provided estimates of proportions not
between 0 and 1), which will be discussed again in Section 11.2.5.
Using the same summary data, Example 11.3 considers testing on the
basis of the approaches of Hauck and Anderson, Dunnett and Gent, and
Farrington and Manning.

Example 11.3

Suppose an investigational antibiotic is being compared to a standard antibiotic

in uncomplicated bacterial sinusitis. The margin was set to δ = 0.10 (10 percent-
age points). The observed cure rates were 92/100 (92%) for the control group and
89/100 (89%) for the experimental group.
The 95% confidence interval for the difference in cure rates by Hauck and
Anderson approach is (−0.117, 0.057). As −0.117 is less than −0.10, the null
hypothesis would not be rejected and non-inferiority cannot be concluded.
Using the method of Dunnett and Gent, the variance will be estimated with
C = 0.955 and p
values of p E = 0.855. The resulting test statistic is 1.71 (one-sided
p-value = 0.044), which is less than the critical value of 1.96 (for a one-sided
α = 0.025).
Using the method of Farrington and Manning, the variance will be estimated
C = 0.941 and p
with values of p E = 0.841. The resulting test statistic is 1.61 (one-
sided p-value = 0.054), which is less than the critical value of 1.96 (for a one-sided
α = 0.025).
Regardless of the approach, non-inferiority cannot be concluded.

11.2.4 Comparisons of Confidence Interval Methods

For both a single proportion and a difference in proportions, we will discuss
the results and conclusions of many papers that compare the probability
coverage, type I error rates, and power of various methods. Most research
involving a difference in proportions concerns two-sided testing. The results
from two-sided testing will not directly apply to one-sided testing.

© 2012 by Taylor and Francis Group, LLC

Inference on Proportions 263

Wald’s intervals for a single proportion and a difference in proportions

may be the most commonly used confidence interval methods. For a single
proportion, Wald’s interval is given by

pˆ ± zα/2 pˆ (1 − pˆ )/n .

For a difference in proportions, Wald’s interval is given by

pˆ E − pˆ C ± zα/2 pˆ E (1 − pˆ E )/nE + pˆ C (1 − pˆ C )/nC .

Many authors have discussed the poor coverage properties (and mainte-
nance of a desired type I error probability) of Wald’s confidence interval in
both the single proportion and the difference in proportions settings.23–29 We
begin with some results involving a single proportion.

11.2.4.1 Inferences on a Single Proportion

On Wald’s interval for a single proportion, Brown, Cai, and Dasgupta29
reported that the “popular prescriptions the standard interval comes with
are defective in several respects and are not to be trusted.” The authors
found that the 95% Wald’s interval has an unpredictable, random coverage
probability even for large sample sizes. More specifically, noted on the 95%
Wald’s interval of Brown, Cai, and Dasgupta29 are the following:

• For a sample size or 100, the coverage probability of a 95% Wald’s

interval was 0.952 when p = 0.106 and 0.911 when p = 0.107.
• When p = 0.005, the coverage probability of a 95% Wald’s interval
was 0.945, 0.792, 0.945, 0.852, and 0.898 when n = 591, 592, 953, 954,
and 1876, respectively.

The Wilson30 interval, a popular alternative to Wald’s interval, is found by

inverting the standard large-sample normal test statistic (score test), which
scales by the standard deviation for the sample proportion under the null
hypothesis. The Wilson interval is { po : pˆ − po / po (1 − po ) < zα/2 }. This reduces to
 n  1  zα2/2  1 ˆ ˆ
 n   1  1  zα2/2  
p̂  2 
+   ± zα/2 p(1 − p )  2 
+   2 

 n + zα/2   2  2  n + zα/2  
2
 n + zα /2  2  n + zα/2  n + zα2/2 

The Wilson interval has a midpoint that is a weighted average of p̂ and 0.5
(weights n and zα2/2 ). In fact, the midpoint is the “proportion of successes” if
zα2/2 /2 successes and zα2/2 /2 failures were added. This motivated Agresti and
Coull25 to apply the 95% Wald’s confidence interval to an adjusted proportion
that adds two successes and two failures to the mix (z02.025 ≈ 4). The resulting
95% confidence interval is

© 2012 by Taylor and Francis Group, LLC

264 Design and Analysis of Non-Inferiority Trials

p ± 1.96 p(1 − p)/n



where p = ( x + 2)/n
 and n
 = n + 4. Applying this idea to a 100(1 – α)% confi-
dence interval for arbitrary α yields the interval of p ± zα/2 p(1 − p)/n
 , where
p = ( x + zα2/2/2)/n
 and n
 = n + zα2/2 . When this interval is applied to no data (x =
0, n = 0), the resulting interval is [0, 1]. Agresti and Coull reported substantial
improvement in the coverage probability of this interval over Wald’s interval
for small sample sizes.
Brown, Cai, and Dasgupta29 compared the probability coverage and inter-
val lengths of several methods for constructing a confidence interval for a
single proportion. They recommended the Wilson interval or the equal-tailed
Jeffreys credible interval for small sample sizes (n ≤ 40), and the interval of
Agresti and Coull25 for large sample sizes (n > 40). All of these intervals
have instances where the coverage probability of the 95% interval is below
95%. For success rates fairly close to 0 and 1, the Jeffreys interval had a very
small coverage probability. To improve the probability coverage in such
cases, a modified version of the Jeffreys interval was proposed by Brown,
Cai, and Dasgupta.29 When x = 0, define the upper limit of the interval by
1 – (α/2)1/n and when x = n define the lower limit of the interval by (α/2)1/n.
For the details and more on the Jeffreys credible interval, see Appendices
A.2.2 and A.3.

11.2.4.2 Inferences for a Difference in Proportions

Hauck and Anderson22 performed simulations to evaluate the coverage prob-
abilities of two-sided confidence interval methods based on combinations of
two forms for the standard error and four forms for the continuity correction
factor. The confidence intervals are expressed as

pˆ E − pˆ C ± ( zα/2 se( pˆ E − pˆ C ) + CC).

The choices for standard errors are the unrestricted MLE of the standard
error and a modified version of such that replaces ni with ni – 1 for i = E, C. The
possible corrections are: (1) no correction (CC = 0), (2) Yates correction (CC =
1/(2nE) + 1/(2nC)), (3) a correction of Schouten et al.31 (CC = 1/(2 max(nE,nC))),
and (4) a correction of Hauck and Anderson (CC = 1/(2 min(nE,nC))). The cases
considered had minimum expected cell counts ranging from 2 to 15 with the
smallest group size ranging from 6 to 100.
As mentioned in Section 11.2.3, Hauck and Anderson22 recommended
the use of the Hauck–Anderson correction with the modified version of the
standard error. When the desired confidence level was 90% or 95% and the
minimum expected cell count was at least 3, that method gave coverage

© 2012 by Taylor and Francis Group, LLC

Inference on Proportions 265

probabilities close to the desired level. Wald’s interval with a Yates correction
also performed reasonably well but was more conservative. Wald’s interval
without any correction did not provide adequate coverage at any sample size
studied, and Hauck and Anderson recommended against its use. When the
desired confidence level was 99% and the minimum expected cell count was
at least 5, their recommended method and Wald’s interval with a Yates cor-
rection performed equally as well. No method studied performed consis-
tently well when the minimum expected cell count was 2. Tu32 preferred
Wald’s interval with a Hauck–Anderson continuity correction for equiva-
lence testing.
Li and Chuang-Stein33 evaluated and compared the type I error rate in non-
inferiority testing of a difference of two proportions using Wald’s interval
with and without a Hauck–Anderson continuity correction. Their evaluation
was based on equal allocation for “sample sizes relevant to the confirmatory
trials.” The sample sizes were between 100 and 300. For the cases studied
where all the expected cell counts (success and failures for both arms) were
all greater than 15 and 2.5% is the one-sided targeted type I error rate, the
estimated type I error rate for the standard Wald’s interval was between 2.3%
and 2.75%. Wald’s interval with Hauck–Anderson continuity correction pro-
duced type I error rates consistently below 2.5%. For the cases studied where
some of the expected cell counts were less than 15 and 2.5% is the one-sided
targeted type I error rate, the estimated type I error rate for the standard
Wald’s interval could go beyond 2.75%. The inflation appeared to increase as
the smallest expected cell count approached 5. In these cases, Wald’s inter-
val with Hauck–Anderson continuity correction performed fairly well and
produced type I error rates below 2.75%. Li and Chuang-Stein33 recommend
using Wald’s interval without a continuity correction when the expected fre-
quency of all cell counts is at least 15. Otherwise, they recommend imple-
menting the Hauck–Anderson correction.
Newcombe28 compared the coverage probabilities and expected lengths
of 11 methods for determining a confidence interval for the difference in
proportions. A tail area profile likelihood-based method, and the methods
of Mee34 and Miettinen and Nurminen,35 which invert test statistics that
use standard errors restricted to the specified difference in proportions, all
performed well but were either difficult to compute or required a computer
program. Newcombe recommended a method that combined Wilson score
intervals for the two proportions either with or without a continuity cor-
rection. The Newcombe–Wilson 100(1 – α)% confidence interval without a
continuity correction is given by (L, U) where

L = ( pE − pC ) − zα/2 lE (1 − lE )/nE + uC (1 − uC )/nC ,

U = ( pE − pC ) + zα/2 uE (1 − uE )/nE + lC (1 − lC )/nC ,

© 2012 by Taylor and Francis Group, LLC

266 Design and Analysis of Non-Inferiority Trials

lE and uE are the roots of pE − y/nE = zα/2 pE (1 − pE )/nE and lC and uC are
the roots of pC − x/nC = zα/2 pC (1 − pC )/nC . For the Newcombe–Wilson in
terval with a continuity correction, lE and uE are the limits of the interval
{ }
p : p − y/nE − 0.5/nE ≤ zα/2 p(1 − p)/nE and lC and uC are the limits of the

{
interval p : p − x/nC − 0.5/nC ≤ zα/2 p(1 − p)/nC . }
Motivated by Agresti and Coull , Agresti and Caffo26 proposed to use the
25

corresponding Wald’s interval after adding one success and one failure to
each treatment group for the 95% confidence interval of the difference of pro-
portions. This addition of observations performed best on the basis of their
simulations using pairs of true probabilities selected randomly over the unit
square (uniform distribution) and group sizes selected randomly (uniform
distribution) over {10, 11, …, 30}. The resulting 95% confidence interval is

pE − pC ± 1.96 pE (1 − pE )/n

E + pC (1 − pC )/n
C

where pi = ( xi + 1)/n i and n i = ni + 2 for i = E, C. Applying this idea to a 100(1

– α)% confidence interval for arbitrary α leads to the interval pE − pC ± zα/2 pE (1 − pE )/n
pE − pC ± zα/2 pE (1 − pE )/n
E + pC (1 − pC )/nC where pE = ( y + zα2/2/4)/n
E , pC = ( x + zα2/2/4)/n
C ,

and n i = ni + zα/2/2 for i = E, C. When this interval is applied to no data (x =
2

y = 0 and ni = 0 for i = E, C), the resulting interval is [–1, 1]. Note that for a
common sample size, the middle of the Agresti and Caffo interval is closer to
zero than that of Wald’s interval (i.e., pE − pC ≤ pˆ E − pˆ C ). However, this need
not be true for uneven sample sizes.
Santner et al.36 compared the small-sample probability coverage and
expected lengths of five methods for determining a 90% confidence inter
val for the difference of proportions. The methods include the asymptotic
method of Miettinen and Nurminen,35 which is based on the score statis-
tic, and the exact methods of Agresti and Min,15 Chan and Zhang,9 Coe and
Tamhane,17 and Santner and Yamagami.18 For seven pairs of sample sizes
(three cases of balanced allocation and four cases of unbalanced allocation
were examined), the average (exact) probability coverage was calculated
(based on binomial distributions) across 10,000 pairs of (pE,pC) selected evenly
across the unit square. The overall sample size ranged from 20 to 70. The
authors conclude that the exact method of Coe and Tamhane performed the
best, and the asymptotic method of Miettinen and Nurminen performed
the worst. The authors recommended the use of the Coe and Tamhane
method; when that method is not available, either the method of Agresti and
Min or the method of Chan and Zhang is recommended. The use of any of
these five methods was strongly recommended by the authors in the abstract
of the paper. However, in the conclusions of paper, use of the methods of
Santner and Yamagami and Miettinen and Nurminen was discouraged.

© 2012 by Taylor and Francis Group, LLC

Inference on Proportions 267

Dann and Koch37 reviewed the ability of various non-inferiority tests

involving a risk difference to maintain the proper type I error rate for differ-
ent allocation rates. The power is compared among procedures that maintain
the desired type I error rate. Their results suggested that the best test to use
depends on the allocation ratio. The accuracy of the corresponding sample-
size formulas was also checked. The assessments were based on one-sided
97.5% confidence intervals. The methods compared were Wald’s, Farrington–
Manning, Agresti–Caffo, and Newcombe–Wilson intervals, with and with-
out a continuity correction. Control success rates (for positive outcomes)
between 0.60 and 0.95 were examined with non-inferiority margins (δ) of
0.05, 0.075, and 0.10. The sample sizes chosen provide approximately 85%
power at no difference in the control and experimental success rates. The
experimental versus control allocation ratios considered were 1:2, 1:1, 3:2, 2:1,
and 3:1. For each case, 100,000 simulations were performed. Simulated type I
error rates were regarded as appropriate if between 0.0225 and 0.0275.
With respect to the type I error rate and power, the continuity-corrected
Newcombe–Wilson method was the most conservative. We have the follow-
ing observations from Figure 1 of Dann and Koch37:

• For all allocation ratios, the continuity corrected Newcombe–Wilson

method maintained the type I error rate at or below 0.025 (i.e.,
between 0.01 and 0.025). As the allocation ratio increased, the distri-
bution of type I error rates for the continuity-corrected Newcombe–
Wilson method became narrower and closer to 0.025.
• For allocation ratios of 1:2 and 1:1 the Newcombe–Wilson method
(without a continuity correction) maintained type I error rates below
0.0275 and within 0.0225–0.0275, respectively. As the allocation ratio
increased, the type I error rate of Newcombe–Wilson method tended
to increase.
• For a given allocation ratio, the distribution of type I error rates for
the Farrington–Manning method was slightly smaller than that of
the Newcombe–Wilson method. For allocation ratios of 1:2, 1:1, and
3:2, the Newcombe–Wilson method (without a continuity correction)
maintained type I error rates below 0.0275, within 0.0225–0.0275
and 0.0225–0.0275, respectively. As the allocation ratio increased,
there was a modest increasing trend in the type I error rate of the
Farrington–Manning method.
• For a 1:2 allocation ratio, Wald’s method produced type I error rates
between 0.025 and 0.05. As the allocation ratio increased, the type
I error rate of Wald’s method decreased. Wald’s maintained type I
error rates within 0.0225–0.0275 for a 3:2 allocation ratio and type I
error rates at or below 0.025 for 2:1 and 3:1 allocation ratios.
• For any given allocation ratio, the distribution of type I error rates
for the Agresti–Caffo method was concentrated between 0.025 and

© 2012 by Taylor and Francis Group, LLC

268 Design and Analysis of Non-Inferiority Trials

the distribution of type I error rates of Wald’s method. How the type
I error rates of the Agresti–Caffo method compare also depend on
the control success.

Dann and Koch37 seem to prefer Wald’s method and the Agresti–Caffo
method when the allocation ratio is large (3:2, 2:1, or 3:1), and the Farrington–
Manning method or the Newcombe–Wilson method for smaller allocation
ratios (1:2 or 1:1). It should be noted that their results do not directly apply to
control success rates (for positive outcomes) less than 0.5.
We close this section with Example 11.4, which applies the 95% confidence
intervals for the difference in proportions from various methods to the
results from a hypothetical clinical trial.

Example 11.4

Suppose there are 131 successes among 150 subjects in the experimental arm
and 135 successes among 150 subjects in the control arm. For various asymptotic
and Bayesian methods (see Section 11.5), the corresponding 95% two-sided con-
fidence intervals for p̂E – p̂C is determined. The results are provided in Table 11.2
where the methods are listed in decreasing order with respect to the lower confi-
dence limit. Four of the nine methods would lead to a non-inferiority conclusion if
the δ = 0.10 (the lower limit of the Newcombe–Wilson interval is –0.1002).

11.2.5 Sample Size Determination

The selection of sample size is based primarily on the desired power of
the test at a reasonable possibility (“selected alternative”) in the alterna-
tive hypothesis. It should be noted that many standard software packages
for sample size and power calculations are not designed for non-inferiority

TABLE 11.2
95% Confidence Intervals for Difference in Proportions
Method 95% Confidence Interval
Wald (–0.098, 0.045)
Zero prior Bayesiana (–0.099, 0.045)
Jeffreys (–0.099, 0.045)
Agresti–Caffo (–0.099, 0.046)
Newcombe–Wilson (–0.100, 0.047)
Newcombe–Wilson with CC (–0.100, 0.047)
Farrington–Manningb (–0.101, 0.047)
Wald with Hauck and Anderson CC (–0.102, 0.048)
Wald with Yates CC (–0.105, 0.052)
a Based on the resulting posterior distributions after the prior α→0 β→0 for each arm.
b Standard errors based on the null restricted estimates of the proportions.

© 2012 by Taylor and Francis Group, LLC

Inference on Proportions 269

analyses. Care is urged in interpreting the results when trying to design a

non-inferiority trial.
Sample sizes for non-inferiority trials are often developed under the
assumption that the control and experimental therapies have identical effi-
cacy. This assumption will not always lead to an adequately designed trial.
For indications in which the active control has a very large effect compared
to placebo, the assumption of identical efficacy may be optimistic and would
thus lead to an undersized, underpowered trial. In such a situation, an experi
mental treatment might have less efficacy than the active control but still be
far superior to a placebo. For indications in which the control therapy has a
small effect compared to placebo, the non-inferiority margin is small. In such
situations, assuming identical efficacy of the active control and experimen-
tal therapies will lead to a large sample size. However, in this setting, the
assumption of identical efficacy means that the experimental therapy has
a small effect compared to placebo, and thus a superiority trial versus pla-
cebo would also require a large sample size. In such cases, it makes sense
to power the non-inferiority trial at an appropriate superiority alternative
compared to the control therapy. Assuming that the experimental therapy
provides a modest benefit in efficacy compared to the control therapy can
lead to a marked reduction in sample size.
If there is evidence that the experimental treatment may be superior to
the control therapy, additional consideration is needed for determining an
appropriate sample size. If the objective of demonstrating non-inferiority is
sufficient, a smaller sample size can be used by assuming that the experi-
mental treatment is slightly superior to the active control. However, if it is
desirable to demonstrate that the experimental therapy is superior to the
active control, the trial can be powered for superiority. The value of an addi-
tional conclusion of superiority needs to be weighed against the associated
increases in the number of patients, study sites, and study time.
Since the standard deviation for the sample proportion of successes de
pends on the true probability of a success, sample-size formulas for propor-
tions are more complicated than those for means. The respective sample-size
formulas for the risk difference, relative risk, and odds ratios are not only
functions of the selected alternative values for the risk difference, relative risk,
and odds ratios, but also depend on the specific success probabilities for the
two arms. In addition, when a normalized test statistic is used, the sample-
size formula depends on whether the estimated standard error used for the
denominator is an unrestricted estimate or an estimate restricted to the null
hypothesis (or more specifically, the boundary of the null hypothesis).
Let p1 and p2 represent success rates for the experimental and control arms,
p1 (1 − p1 ) p2 (1 − p2 )
respectively, where Δa = p1 – p2 > –δ. Then let σ a = + and
nE nC
p1′ (1 − p1′ ) p2′ (1 − p2′ )
σo = + , where ( p1′ , p2′ ) either represents for some criteria
nE nC

© 2012 by Taylor and Francis Group, LLC

270 Design and Analysis of Non-Inferiority Trials

that pair of probabilities closest to (p1, p2) satisfying p1′ – p2′ = –δ (when a
restricted estimate of the standard error is used), or p1′ = p1 and p2′ = p2 (when
an unrestricted estimate of the standard error is used). Then the power
 pˆ − pˆ C + δ   pˆ − pˆ C − ∆a zα/2σ o − δ − ∆a   ∆a + δ −
is approximately P E > zα/2  = P  E >  ≈ Φ
 σo   σa σa   σ
− δ − ∆a 
2σ o  ∆a + δ − zα/2σ o 
 ≈ Φ .
σa   σa 
For a desired power of 1 – β, the right-hand term above is set equal to Φ(zβ).
Simplifying the equation leads to

zβσa + zα/2σo = Δa + δ (11.7)

For k = nE/nC, solving for nC yields

2
 z p (1 − p )/k + p (1 − p ) + z p1′ (1 − p1′ )/k + p2′ (1 − p2′ ) 
β 1 1 2 2 α/2
nC =   . (11.8)
 ∆a + δ 

For the analyses proposed in previous papers,5,35 using p1′ = (p1 + p2 – δ)/2 and
p2′ = (p1 + p2 + δ)/2 or using p1′ = (kp1 + p2 – δ)/(1 + k) and p2′ = (kp1 + p2 + kδ)/(1 +
k) could also be appropriate, as they more closely match the analysis method.
Otherwise, ( p1′ , p2′ ) can be selected using some rule for determining the pair
of ( p1′ , p2′ ) in the null hypothesis that is the most difficult to reject when (p1, p2)
is the true pair of the probabilities of a success. The change in the estimated
sample size may not be dramatically affected unless δ is very large, or p1 and
p2 are close to 0 or 1. When δ = 0, corresponding to a superiority analysis, it is
common to use p1′ = p2′ = (p1 + p2)/2 or p1′ = p2′ = (kp1 + p2)/(1 + k). Otherwise,
the approaches of Farrington and Manning5 and Miettinen and Nurminen35
to obtain MLEs of the true success rates restricted to the null hypothesis can
be adapted to determine p1′ and p2′ . This is accomplished by treating p1 and
p2 as the observed success rates.
The use of the sample-size formula in Equation 11.8 is illustrated in
Example 11.5.

Example 11.5

Suppose an investigational antibiotic is being compared with a standard antibiotic

in uncomplicated bacterial sinusitis. The standard is expected to cure approxi-
mately 85% of all treated cases, and the non-inferiority margin has been deter-
mined to be δ = 0.10. The test will be one-sided at α = 0.025, and 90% power is
desired. An obvious starting point for the sample size calculation is to assume that
the investigational antibiotic also cure 85% of all treated cases. In this situation,

© 2012 by Taylor and Francis Group, LLC

Inference on Proportions 271

p1 = p2 = 0.85, Δa = 0, p1′ = (p1 + p2 – δ)/2 = 0.80, p2′ = (p1 + p2 + δ)/2 = 0.90, z0.025 =
1.96, and z0.10 = 1.28. Then by 11.8, the required sample size per arm for a one-to-
one randomization (k = 1) is calculated as 16.262 = 264.5. Thus around 265 sub-
jects should be randomized to each treatment group. For p1′ = p1 = 0.85 and p2′ =
p2 = 0.85, the calculated sample size is 268 subjects per arm.
A new antibiotic might be developed to have a higher cure rate than currently
available treatments. If the investigational antibiotic might cure 95% of all cases
(p1 = 0.95 p2 = 0.85, Δa = 0.10, p1′ = (p1 + p2 – δ)/2 = 0.850, p2′ = (p1 + p2 +
δ)/2 = 0.95), then the required sample size is only 46 per treatment arm by 11.8.
This sample size may be low enough to cause concerns with whether the dif-
ference in sample proportions has a normal distribution. A sample size of 200
per group would provide 90% power to show that the experimental treatment
is superior with a greater power to show non-inferiority.
Alternatively, a new antibiotic may be expected to have a slightly lower cure
rate than currently available treatments, but have some other advantage such as
better tolerability. If the investigational antibiotic might cure 80% of all cases, the
required sample size to show non-inferiority is about 1200 per treatment group
(p1 = 0.80 p2 = 0.85, Δa = –0.05, p1′ = (p1 + p2 – δ)/2 = 0.775, p2′ = (p1 + p2 + δ)/2 =
0.875).

11.2.5.1 Optimal Randomization Ratio

The standard deviation for a sample proportion depends on the true propor-
tion and would thus be different for the selected null and alternative proba-
bilities. Because of this, the randomization or allocation ratio that minimizes
the overall study size will seldom equal 1, even for a superiority analysis.
Hilton38 showed that a balanced randomization allocation is optimal only for
superiority trials when σa = σo (where p1′ = p2′ = (p1 + p2)/2); otherwise, some
imbalanced allocation is optimal. By Equation 11.8, for an experimental to
control allocation ratio of k, the overall study size is given by

2
 z p (1 − p )/k + p (1 − p ) + z p1′ (1 − p1′ )/k + p2′ (1 − p2′ ) 
β 1 1 2 2 α/2
  (1 + k ). (11.9)
 ∆a + δ 

The optimal k that minimizes Expression 11.9 can be found using calculus
by taking a derivative of Expression 11.9 with respect to k, setting the result
equal to zero and then solving for k, or by a “grid search” by evaluating
Expression 11.9 for many candidates for k.
Example 11.6 determines the optimal allocation ratio using the assump-
tions in Example 11.5

Example 11.6

Consider the case in Example 11.5 where p1 = p2 = 0.85, Δa = 0, p1′ = (p1 + p2 – δ)/2 =

0.80, p2′ = (p1 + p2 + δ)/2 = 0.90, z0.025 = 1.96, and z0.10 = 1.28. The optimal

© 2012 by Taylor and Francis Group, LLC

272 Design and Analysis of Non-Inferiority Trials

allocation ratio under those assumptions is k = 1.187, where nE = 286 and nC =

240. If instead p1′ = (kp1 + p2 – δ)/(1 + k), p2′ = (kp1 + p2 + kδ)/(1 + k) are used, the
optimal allocation ratio under those assumptions is k = 1.145, where nE = 278 and
nC = 242. The values for p1′ and p2′ at k = 1.145 are 0.803 and 0.903, respectively.
Whenever p1′ = p1 = p2′ = p2, the optimal allocation ratio will be k = 1 (and thus
268 subjects per arm as noted above).

Before choosing a particular unbalanced allocation, its impact on the eval-

uation of other endpoints needs to be considered. Although an unbalanced
allocation may reduce the required sample size for the primary endpoint, it
may not be optimal for the evaluation of secondary endpoints and/or safety
endpoints.
Dann and Koch37 found that for experimental to control allocation ratios
of 1:2 and 1:1 (allocation ratios for which the Farrington–Manning method
was recommended), the Farrington–Manning method for a risk difference
maintained the desired type I error rate with a corresponding sample-size
formula that is always conservative (the simulated power was larger than the
calculated power based on normal approximation). Also, for experimental to
control allocation ratios of 3:2, 2:1, and 3:1, Wald’s sample-size formula was
appropriate to design a clinical trial for a risk difference and was slightly
conservative.
Finally, we note that the relationship in Equation 11.7 holds in general
(see Figure 11.1). Consider testing Ho: η ≤ ηo versus Ha: η > ηo at a one-sided
level of α/2 based on the estimator η̂ . Suppose when η = ηo (η = ηa), η̂ has
an approximate normal distribution with standard deviation σo (σa). Let 1 −
β denote the power of the test when η = ηa. Let η* be that value for which
statistical significance is reached if and only if the value for η̂ is less than η*.
Since the one-sided level is α/2, ηo + zα/2σo = η*, and since the power is 1 − β
when η = ηa, η* = ηa – zβσa (see Figure 11.1). Therefore, zβσa + zα/2σo = ηa − ηo.
Since the standard error in the denominator of a test statistic involv-
ing proportions is not constant, there is no specific value for the estimate
(i.e., η*) on the boundary of statistical significance. However, sample-size
calculations for inferences involving proportions generally reduce to
approximations that are represented by Figure 11.1. We will also use this
relationship for sample-size formulas involving a relative risk or an odds
ratio.

Zα/ 2σo Zβ σa

ηo η* ηa

FIGURE 11.1
Relationship between null and selected alternative.

© 2012 by Taylor and Francis Group, LLC

Inference on Proportions 273

11.3 Fixed Thresholds on Ratios

In this section we will consider the situation in which a margin has been
chosen on the basis of the ratio of two proportions, which is often referred to
as the relative risk or risk ratio.

11.3.1 Hypotheses and Issues

Let θ = pE/pC denote the relative risk. When a success is a positive outcome,
the hypotheses for testing that the experimental therapy is noninferior to the
control therapy based on a prespecified threshold of θo (0 < θo ≤ 1) are

Ho: θ ≤ θo vs. Ha: θ > θo (11.10)

That is, the null hypothesis is that the success rate of the experimental arm is
smaller than a prespecified fraction, θo, of the success rate of the control arm,
whereas the alternative is that the active control is superior by a smaller amount,
the two treatments are identical, or the experimental treatment is superior. When
θo = 1, the hypotheses in Expression 11.10 reduce to classical one-sided hypoth-
eses for a superiority trial. If a “success” is a negative outcome, the roles of pC and
pE in the hypotheses in Expression 11.10 would be reversed, leading to hypothe-
ses that can expressed as Ho: θ = pE/pC ≥ θo versus Ha: θ = pE/pC < θo, where θo ≥ 1.
In the simplest application with no covariates, a confidence interval can be
calculated for pE/pC and non-inferiority is concluded (the null hypothesis in
Expression 11.10 is rejected) if the lower bound of the confidence interval is
greater than θo.

11.3.2 Exact Methods

Many asymptotic tests of non-inferiority using the relative risk have been
proposed in the literature. However, their performance in small-sample
studies is uncertain and the test procedures may be too liberal owing to type
I error inflation. In these situations, exact test procedures are desirable as
they guarantee control of the type I error rate. In this section we will discuss
two exact test procedures: one unconditional exact test based on the score
test statistic and one conditional exact test based on the Poisson approxima-
tion for rare disease endpoints.
Similar to the exact unconditional test based on the Z1 statistic in Equation
11.4, an exact unconditional test for Expression 11.10 has been developed on
the basis of the following statistic:

pˆ E − θ o pˆ C
Z2 = Z2 (X , Y ) = (11.11)
{ ( ) ( ) }
1/2
pE 1 − pE nE + θ o2 pC 1 − pC nC

© 2012 by Taylor and Francis Group, LLC

274 Design and Analysis of Non-Inferiority Trials

where p̂C and p̂E are the observed response rates and pC and pE are the MLEs
of pE and pC, respectively, under the constraint pE = θopC given in the null
hypothesis in Expression 11.10. The closed form solutions for p̂C and p̂E are
given in Farrington and Manning’s study5 and in Expression 11.18 of this
text. Since large values of Z2 favor the alternative hypothesis, the tail region
includes tables whose Z2 values are greater than or equal to zobs, the Z2 value
for the observed table. Therefore, the exact p-value is calculated as

p -value = max P(( x , y ) : Z2 ≥ zobs H o , p) (11.12)

p∈D

where p (=pC) is the nuisance parameter with the domain A = [0, θo], and the
probability is evaluated using the following null probability function

n n 
P(x = i, y = j|H o ) =  C   E  θ o− i p i+ j (1 − θ o−1 p)nC − i (1 − p)nT − j .
 i  j 

For a nominal α-level test, the CR and true size can be calculated in a simi-
lar fashion as for the exact unconditional test using the difference measure
described in Section 11.2.2. In the special case where θo = 1, the hypotheses in
Expression 11.10 are those in Expression 11.1 with δ = 0. In this special case,
the Z1 and Z2 statistics are identical. Chan and Bohidar39 have studied the
utility of this exact unconditional test in designing clinical trials and found
that the empirical performance of this exact test compares very favorably
with its asymptotic counterpart (Z2 test) in terms of type I error rate, power,
and sample size under a wide range of true parameter values.

Example 11.7

Chan12 reanalyzed the data in Example 11.2 using the relative risk measure to show
non-inferiority based on the criterion requiring that the tumor response rate to the
chemotherapy treatment (pE) be greater than 90% of the response to the radiation
therapy (pC). This corresponds to a threshold of θo = 0.9 for the relative risk. Since
p̂E = 0.943 [83/88] and p̂C = 0.908 [69/76], the observed relative risk is θ̂ = 1.039.
The MLEs when pE/pC = 0.9 are p C = 0.946 and pE = 0.851, which gives Z2 = 2.835
from Equation 11.11. From Equation 11.12 the exact unconditional test based on Z2
yielded a p-value of 0.0028, compared with the asymptotic p-value of 0.0024. Both
tests strongly supported the conclusion of non-inferiority. At the one-sided 5% level,
the size of the exact test is 4.82%, whereas the type I error rate of the asymptotic Z2
test is 5.32% when pC = 0.9 and the size of the test is approximately 5.59%.

11.3.2.1 Exact Conditional Non-Inferiority Test

For rare disease endpoints, a large-scale study is usually required to demon-
strate non-inferiority or equivalence between two treatments. In such cases,

© 2012 by Taylor and Francis Group, LLC

Inference on Proportions 275

it is reasonable to assume that the number of disease events in the control

and experimental groups (X and Y, respectively) are approximately distrib-
uted as independent Poisson random variables with respective means λC =
nCpC and λE = nEpE. Since the rates of interest are for negative outcomes, the
hypotheses to be tested are Ho: θ = pE/pC ≥ θo versus Ha: θ = pE/pC < θo, where
θo ≥ 1. Let S = X + Y be the total number of disease cases observed in the
study. Then conditional on S = s, the number of disease cases in the experi-
mental group (Y) is distributed as Binomial (s, ϕ), where

λE nE pE θ
φ= = =
λC + λE nC pC + nE pE θ + u

and u = nC/nE.
Since ϕ is increasing in θ, the non-inferiority hypotheses in Expression
11.10 are equivalent to

Ho: ϕ ≥ ϕo vs. Ha: ϕ < ϕo (11.13)

where ϕo = θo/(θo + u). Thus, inferences can be based on a simple exact test
involving a one-sample binomial distribution. Suppose yobs is the number
of disease cases observed in the experimental group, then the exact p-value
conditional on the total number of disease cases S = s is

yobs

p -value = Pr[Y < yobs |Y ~ Binomial( s, φo )] = ∑ k !(ss−! k)! φ (1 − φ )

k =0
k
o o
s− k
.
(11.14)

For an α-level test, the critical value yα can be determined as the largest value
satisfying that Pr{Y ≤ yα|Y ∼ Binomial(s, ϕo)} is as close as possible to α from
below without exceeding it. The power conditional on S = s for testing the
hypotheses in Expression 11.13 against a specific alternative ϕ = ϕ1 < ϕo is
then calculated as Pr{Y ≤ yα|y ∼ Binomial(s, ϕ1)}, where ϕ1 = θ 1/(θ 1 + u).
Note that Equation 11.14 could also be evaluated via the F-distribution
using the following relationship (see, e.g., Johnson, Kotz, and Kemp,40 p.110):

y
 ν 2φ 
∑ k !(ss−! k)! φ (1 − φ)
k s− k
= Fν1 ,ν 2 
 ν 1 (1 − φ ) 
k =0

where Fν1 ,ν 2 (⋅) is the central F-distribution function with parameters ν1 =

2(y + 1) and ν2 = 2(s – y).

© 2012 by Taylor and Francis Group, LLC

276 Design and Analysis of Non-Inferiority Trials

In addition, conditional on S = s, an exact confidence interval for θ is

determined from the exact confidence interval for ϕ. For example, using the
Clopper and Pearson13 method, a two-sided 100(1 – α)% test-based exact con-
fidence interval, (ϕL, ϕU), for ϕ can be constructed as

ν 1Fν−1,ν (α/2)
φL = 1 2
,
ν 2 + ν 1Fν−1,ν (α/2)
1 2

where
ν1 = 2yobs
ν2 = 2(s – yobs + 1),

and

ν 1Fν−1,ν (1 − α/2)
φU = 1 2
,
ν 2 + ν 1Fν−1,ν (1 − α/2)
1 2

where
ν1 = 2(yobs + 1),
ν2 = 2(s – yobs).

Then, a 100(1 – 2α)% exact confidence interval for the relative risk θ (θ L, θ U)
is given by

uφL uφU
θL = , θU = (11.15)
1 − φL 1 − φU

This exact conditional method can be applied to design a study with a goal to
obtain a fixed total number of events instead of running for a fixed duration.
Once the desired total number of events (S) is fixed, the power of the study
depends on incidence rates only through the relative risk (θ = pE/pC); thus,
one can avoid the situation potentially encountered in a fixed-duration trial
where the anticipated power is not achieved at the end of the trial because
the number of events is too few owing to unexpectedly low incidence rates.
Since the unconditional expected value of S is (nCpC + nEpE), the expected
number of subjects required for the study can be estimated on the basis of
the incidence rate in the control group (pC) and the relative risk (θ = θ 1) under
the alternative hypothesis:

s
nE ≈ (11.16)
( u + θ 1 ) pC

© 2012 by Taylor and Francis Group, LLC

Inference on Proportions 277

In addition, the Poisson approximation allows the exact conditional method

to easily handle person-time or exposure-time data and account for poten-
tial differential follow-up between treatment groups. The only modification
required is to change the constant u in Equations 11.15 and 11.16 to be the
ratio of the respective total person-times instead of the respective sample
sizes. This flexibility is important in long-term studies where differential
follow-up between treatment groups often occurs, and as a result the asymp-
totic Z2 test can no longer be used appropriately.

Example 11.8

Chan12 used the exact conditional method to design a non-inferiority trial compar-
ing a new hepatitis A vaccine with immune globulin (IG, standard treatment C) in
postexposure prophylaxis of hepatitis A disease. IG is believed to have approxi-
mately 90% efficacy for postexposure prophylaxis. However, IG is a derived
blood product, and thus there are concerns about its safety and purity in addition
to its short-lived protection. In contrast, the hepatitis A vaccine has been demon-
strated to be safe and highly efficacious (≈100%) in preexposure prophylaxis,41 and
capable of inducing long-term protective immunity against hepatitis A in healthy
subjects.42 Recognizing the potential long-term benefit of the vaccine, investiga-
tors of this study intended to show that the vaccine is noninferior to IG in terms of
postexposure efficacy. Since we are dealing with negative outcomes (have the dis-
ease), the hypotheses to be tested are Ho: θ = pE/pC ≥ θo versus Ha: θ = pE/pC < θo,
where θo ≥ 1. If non-inferiority is established (θ < θo), one can infer that the new
vaccine has reasonable efficacy (π E) for postexposure prophylaxis on the basis of
the following indirect argument:

pE p p
πE = 1− = 1 − E ⋅ C = 1 − θ (1 − π C ) > 1 − θ o (1 − π C ) (11.17)
pU pC pU

where pU is the assumed disease incidence rate in unvaccinated (or untreated)

subjects. The choice of the non-inferiority threshold (θo) was determined by con-
sidering the lower bound of the “estimated vaccine efficacy” given the assumed
efficacy of IG based on Equation 11.17. For example, a non-inferiority threshold
of θo = 3.0 will correspond to the estimated vaccine efficacy lower bound of
70%, which will imply a preservation of at least 78% of IG efficacy. Likewise,
a choice of θo = 2.0 will indicate that the new vaccine will preserve (with lower
bound) at least 89% of the IG efficacy. Although an equivalence margin that
preserves 50% of the treatment effect has been proposed for evaluation of drug
treatments,43–45 there is a general perception that a narrower margin should be
used in vaccine trials because the vaccine will be given to healthy subjects for
prophylactic purpose.
For a relative risk threshold of 2.0, the study requires a total of 94 disease cases
to achieve 91.0% power for demonstrating non-inferiority of the vaccine to IG
when the true efficacy of the vaccine is identical to that of IG. In this case, the
projected number of subjects needed in this study will be approximately 4700
subjects per treatment group if the disease incidence rate in the IG group is 1%.

© 2012 by Taylor and Francis Group, LLC

278 Design and Analysis of Non-Inferiority Trials

11.3.3 Asymptotic Methods

When the sample size allows, use of asymptotic methods will generally be
preferred. For some prespecified 0 < θo < 1, the alternative hypothesis of pE/
pC > θo can be reexpressed as pE – θo pC > 0. Inference is then based on the
pˆ E − θ o pˆ C
estimator pˆ E − θ o pˆ C . If Z* = exceeds the critical value or equiva-
se( pˆ E − θ o pˆ C )
lently every value in the corresponding confidence interval for pE – θopC is
positive, then non-inferiority is concluded.
For some 0 < α < 1, a two-sided 100(1 – α)% Fieller confidence interval
 pˆ E − kpˆ C 
for pE/pC equals {k:  < zα/2} or those values k that can replace
ˆ ˆ 
 se( pE − kpC ) 
θo for which statistical significance is not reached. The approach used by
Farrington and Manning5 and Miettinen and Nurminen35 modifies this pro-
cedure by using the MLE of se( pˆ E − θ o pˆ C ) restricted to pE – θopC = 0. The MLEs,
pE and pC , of pE and p̂C restricted to pE – θopC = 0 are given by

θ o (nE + x) + y + nC − {θ o (nE + x) + y + nC }2 − 4θ o (nE + nC )( x + y )

pE =
2(nE + nC ) (11.18)

and pC = pE/θ o , where x and y are the observed number of successes in the
control and experimental arms, respectively, given in Table 11.1.
An alternative would be to base the inference on the log-relative risk,
log(pE/pC) = log pE – log pC. Katz et al.46 proposed the use of the standard
Taylor series method for determining a one-sided or two-sided confidence
interval for a relative risk.
By using the asymptotic standard error of log( pˆ E ) − log( pˆ C ) , a 100 (1 – α)%
confidence interval for log pE – log pC can be calculated as

(1 − pˆ E ) (1 − pˆ C )
log( pˆ E ) − log( pˆ C ) ± zα/2 + .
nE pˆ E nC pˆ C

Non-inferiority is concluded if the confidence interval contains only values

greater than log(θo). It should be understood that this standard error is an
approximation and not the actual standard error, and that log( pˆ E ) − log( pˆ C )
(1 − pˆ E ) (1 − pˆ C )
may not lie within zα/2 + of log pE – log pC, 100(1 – α)% of the
nE pˆ E nC pˆ C
time.
Section 11.3.4 introduces some other confidence interval methods that can
be used to perform a test of non-inferiority using a relative risk. The results
of comparisons of methods found in the literature are also discussed.

© 2012 by Taylor and Francis Group, LLC

Inference on Proportions 279

11.3.4 Comparisons of Methods

For a relative risk of success rates, we will discuss the results and conclusions
of many papers that compare the probability coverage, type I error rates, and
power of various methods. Some research involving two-sided testing on a rel-
ative risk may not directly apply to non-inferiority testing, which is one-sided.
Katz et al.46 compared the coverage probabilities of the standard Taylor
series method, the standard quadratic method, and the method of Thomas
and Gart.47 The standard quadratic method for determining a one-sided or
two-sided confidence interval is based on testing Ho: θ =pE /pC = θo with the
normalized test statistic

pˆ E − θ o pˆ C

pˆ E (1 − pˆ E )/nE + θ o2 pˆ C (1 − pˆ C )/nC

A Fieller method can be used to determine the corresponding one-sided or

two-sided 100(1 – α)% confidence interval for the relative risk, which will
consist of those values that can be specified as θo for which the respective
one-sided or two-sided test of level α fails to reject the null hypothesis.
Katz et al.46 provided detailed results for the coverage probabilities of one-
sided lower confidence intervals for the possible combinations where nE =
nC = 100; pC = 0.1, 0.2, or 0.4; and θo = 0.25, 0.5, 0.667, 1, 1.5, 2, and 4. For each
case, 5000 simulations were performed. Katz et al.46 concluded that the Taylor
series method and the Thomas and Gart method produced “similar and
appropriate results,” and that the quadratic method could be erratic and was
not recommended. However, for cases that represent non-inferiority testing,
the standard quadratic method and the Thomas and Gart method had simu-
lated type I error rates (=1 − simulated coverage probability) of less than 0.025,
whereas the standard Taylor series method usually had an inflated type I
error rate. For a superiority test (θo = 1), the Thomas and Gart method had
simulated significance levels always less than 0.025 in every case, whereas the
standard quadratic method had type I error rates slightly greater than 0.025
and the Taylor series method had type I error rates either slightly greater
than or slightly less than 0.025. Also, the simulated type I error rates for the
standard quadratic method increased as θo increased, whereas the simulated
type I error rates for the Taylor series method decreased as θo increased.
Koopman48 considered confidence intervals based on testing Ho: θ = pE/pC =
θo using a Pearson’s χ2 goodness-of-fit statistic. The test statistic is given by

( y − nE pE )2 ( x − nC pC )2
+
nE pE (1 − pE ) nE pC (1 − pC )

where pE and pC are the MLEs restricted to pE/pC = θo (see Equation 11.18
for the expressions for pE and pC ). The test statistic is compared with the

© 2012 by Taylor and Francis Group, LLC

280 Design and Analysis of Non-Inferiority Trials

appropriate upper percentile of a χ2 distribution with 1 degree of freedom. A

two-sided 100(1 – α)% confidence interval for the relative risk consists of val-
ues that can be specified as θo for which the respective one-sided or two-sided
test of level α fails to reject the null hypothesis. A one-sided confidence inter-
val would extend the appropriate side of a two-sided confidence interval.
Koopman48 compared the coverage probabilities of the Taylor series
method and Pearson’s maximum likelihood method. Koopman provided
detailed results for the coverage probabilities of two-sided 95% confidence
intervals and one-sided 97.5% confidence intervals for various combinations
where nE = nC = 100, nE = 50 and nC = 150, or nE = 150 and nC = 50; pE = 0.05,
0.2, 0.35, 0.5, 0.65, 0.8 or 0.95; and θo = 1, 1.5, 2, or 4. Pearson’s method main-
tained the desired two-sided coverage probability notably better than the
Taylor series method. For non-inferiority testing on an undesirable outcome
targeting a one-sided type I error rate of 0.025 when nE = nC = 100, the type I
error rate ranged from 1.26% to 3.55% and from 2.39% to 4.10% for the Taylor
series method and Pearson’s method, respectively. When nE = nC = 100 as the
probability of a “success” increased, the type I error rate for the Taylor series
method tended to increase, whereas the type I error rate for the Pearson’s
method tended to decrease. When nE = 150 and nC = 50, the type I error rate
ranged from 3.51% to 3.78% and from 2.97% to 4.91% for the Taylor series
method and Pearson’s method, respectively. Results also apply to non-infe-
riority testing on a desirable outcome with the roles of the experimental and
control arms reversed (for relative risks of 0.25, 0.5, and 0.667).
Bailey49 modified the standard quadratic method to reduce the skewness
of the confidence interval so as to improve the coverage probability. Bailey’s
method determines a one-sided or two-sided 100(1 – α)% confidence interval
for the relative risk based on the asymptotic normalized test statistic

pˆ E1/3 − θ o1/3 pˆ C1/3

.
(1/3) pˆ E−1/3 (1 − pˆ E )/nE + θ o2/3 pˆ C−1/3 (1 − pˆ C )/nC

The resulting two-sided 100(1 – α)% confidence interval is

3
 pˆ E   1  zα/2 (1 − pˆ E )/y + (1 − pˆ C )/x − zα/2 (1 − pˆ E )(1 − pˆ C )/(9 xy ) /3 
2

 pˆ   .
C 1 − zα2/2 (1 − pˆ C )/(9 x) 
 

The continuity-corrected version replaces y with y + 0.5 when determining

the lower limit of the confidence interval and replaces x with x + 0.5 when
determining the upper limit of the confidence interval.
Bailey49 compared this new method with and without a continuity correc-
tion to the standard Taylor series method. Bailey’s method had exact cover-
age probabilities closer to the desired level than the standard Taylor series
method. Bailey provided detailed results for the exact coverage probabilities

© 2012 by Taylor and Francis Group, LLC

Inference on Proportions 281

of one-sided lower 95% confidence intervals for the same possible combina-
tions as in Katz et al.’s46 study (nE = nC = 100; pC = 0.1, 0.2, or 0.4; and θo =
0.25, 0.5, 0.667, 1, 1.5, 2, and 4). Bailey’s method had coverage probabilities
much closer to the desired level than the Taylor series method and Pearson’s
method. For non-inferiority testing targeting a one-sided type I error rate
of 5%, the corresponding (one-sided) type I error rates ranged from 4.6% to
5.2% for Bailey’s method, from 4.3% to 5.8% for the Taylor series method, and
from 3.8% to 4.6% for Bailey’s method with a continuity correction.
Dann and Koch1 evaluated and compared several methods of construct-
ing a confidence interval for the relative risk based on the calculated limits,
power, type I error rate, and agreement/disagreement with other methods.
The methods included were classified into three categories: Taylor series
methods, solution to quadratic equation methods (Fieller-based methods
applied to normalized test statistics), and maximum likelihood–based meth-
ods (asymptotic likelihood ratio test and Pearson’s χ2 test).
The Taylor series methods are the standard Taylor series method, the mod-
ified Taylor series method of Gart and Nam50 that adds 0.5 to the number of
successes in each group and to the common sample size, and another modi-
fied method (adapted Agresti–Caffo) that adds 4θo/(1 + θo) successes and 4/
(1 + θo) failures to the experimental group and 4/(1 + θo) successes and 4θo/
(1 + θo) failures to the control group. In addition, a Taylor series adjusted
alpha method that uses z0.0225 = 2.005 instead of z0.025 = 1.96 for the standard
Taylor series method is investigated.
The quadratic methods studied included the standard quadratic method
(referred to as “F-M 1” by Dann and Koch1), an adapted version that divides
by one less the sample size when determining the standard error (referred to
as “the quadratic method” in the paper of Dann and Koch1), Bailey’s method,
and two variations of the quadratic method provided by Farrington and
Manning.5 One of the variations in Farrington and Manning’s study5 uses
the MLEs in Equation 11.18 in determining the standard error. The other
variation uses the approach of Dunnet and Gent8 in obtaining estimates of
the success rates for determining the standard error.
The maximum likelihood methods include the Pearson’s method pro-
posed by Koopman48 and the generalized likelihood ratio test (referred to as
the “deviance method” by Dann and Koch1). The deviance statistic is
2 ln[L( pˆ E , pˆ C )/L( pE , pC )] , where pE and pC are the MLEs restricted to pE/pC = θo
provided in Section 11.3.3. The test statistic is compared with the appropriate
upper percentile of a χ2 distribution with 1 degree of freedom. A two-sided
100(1 – α)% confidence interval for the relative risk consists of those values that
can be specified as θo for which the respective one-sided or two-sided test of level
α fails to reject the null hypothesis. A one-sided confidence interval would extend
the appropriate side of a two-sided confidence interval to zero or infinity.
The assessments were based on one-sided 97.5% confidence intervals. The
trial sizes used were 100, 140, and 200 patients per group. Control success
rates (for undesirable outcomes) of 0.10, 0.15, 0.20, and 0.25 were examined

© 2012 by Taylor and Francis Group, LLC

282 Design and Analysis of Non-Inferiority Trials

with null relative risks (experimental/control) of 0.667, 0.8, 1, 1.25, 1.5, and 2.
For each combination of sample size, control success rate, and relative risk,
100,000 simulations were performed. For the majority of methods, when the
experimental or control number of “successes” was three or fewer, the confi-
dence interval was replaced by the corresponding exact confidence interval
for the odds ratio. When there were no successes in the control group, the
upper limit was assigned the value of 100.
The simulated type I error rates were provided in the paper of Dann and
Koch1 for a null relative risk of 2. The quadratic method had the smallest simu-
lated type I error rate in every case. In all 12 of these cases, the Taylor series
adjusted alpha method and the deviance methods maintained the type I error
rate between 0.023 and the desired 0.025. Bailey’s method, maintained a type
I error rate between 0.0225 and 0.0275 in all cases. The adapted Agresti-Caffo
method had simulated type I error rates between 0.020 and 0.027. The other
methods did not perform consistently as well in maintaining the targeted
type I error rate as the Taylor series adjusted alpha, deviance, adapted Agresti-
Caffo, and Bailey methods. In cases having a null relative risk of 0.667, 0.8, 1,
1.25, and 1.5, the power was determined for each method for a true relative risk
of 2. In those 60 cases, the Taylor series adjusted alpha, deviance, and Bailey’s
methods all produced extraordinarily similar simulated power.
We close with Example 11.9, which applies the 95% confidence intervals
for the relative risk from various methods to the results from a hypothetical
clinical trial.

Example 11.9

TABLE 11.3
95% Confidence Intervals for Relative Risk from Various Methods
Method 95% Confidence Interval
Taylor series (Katz) (0.895, 1.052)
Bailey (0.895, 1.052)
Standard quadratic (0.894, 1.052)
Zero prior Bayesiana (0.893, 1.052)
Taylor series adjusted alpha (0.893, 1.054)
Jeffreys (0.893, 1.053)
Deviance (0.892, 1.053)
Farrington–Manning (0.890, 1.055)
Koopman–Pearson (0.890, 1.055)
a Based on the resulting posterior distributions after the prior α→0 β→0 for each arm.

© 2012 by Taylor and Francis Group, LLC

Inference on Proportions 283

where the methods are listed in decreasing order with respect to the lower confi-
dence limit. The order of the lower limits can be fairly arbitrary and depends on
the general success rate, which of the arms had the greater rate, the sample size,
and the allocation ratio.

11.3.5 Sample-Size Determination

We will see that simulations may often be necessary to determine or assist in
determining the appropriate sample size when an inference is based on a rel-
ative risk. Dann and Koch1 compared the difference between calculated and
simulated power for some of the methods. For those methods compared, the
calculated power was less than the simulated power, indicating that sample-
size calculations are conservative.
For the relative risk, let p1 and p2 represent success rates for the experi-
mental and control arms, respectively, where θa = p1/p2 > θo. There are
multiple ways of determining the sample size for a non-inferiority analy-
sis based on the relative risk. We will first consider basing the sample-size
calculations using the estimator of the log-relative risk (log( pˆ E /pˆ C ) ) and its
asymptotic distribution. The sample-size formula will depend on whether
the estimated standard error used for the denominator is an unrestricted
estimate or an estimate restricted to the null hypothesis. Although it is
typical in this setting to base the inference on a test statistic that uses the
unrestricted estimate of the standard error, we will consider both cases. Let
1 − p1 1 − p2 1 − p1′ 1 − p2′
σa = + and σ o = + , where ( p1′ , p2′ ) either repre-
nE p1 nC p2 nE p1′ nC p2′
sents for some criteria that pair of probabilities closest to (p1, p2) satisfying
p1′ /p2′ = θo (when a restricted estimate of the standard error is used), or p1′ =
p1 and p2′ = p2 (when an unrestricted estimate of the standard error is used).
When the sample sizes are large and the true success probabilities of p1 and
p2 ( p1′ and p2′ ) are not close to 0 or 1, the log-relative risk will have an approxi-
mate normal distribution with standard deviation of σa (σo). Analogous to
Expression 11.7, we have for a desired power of 1 − β that zβσa + zα/2 σo = logθa
− logθo. For k = nC/nE, solving for nC yields
2
 z (1 − p )/( kp ) + (1 − p )/p + z 
β 1 1 2 2 α/2 (1 − p1
′ )/( kp1′ ) + (1 − p2′ )/p2′
nC =   . (11.19)
 log θ a − log θ o 

Simulations should be used when it is believed that the distribution for θ̂

will not closely be approximated by a normal distribution.
Another way to compute the sample size is based on the distribution
for pˆ E − θ o pˆ C . When the true success probabilities are p1 and p2, the respec-
p1 (1 − p1 ) 2 p2 (1 − p2 )
tive standard deviation for pˆ E − θ o pˆ C is σ a = + θo . Let
nE nC

© 2012 by Taylor and Francis Group, LLC

284 Design and Analysis of Non-Inferiority Trials

p1′ (1 − p1′ ) 2 p2′ (1 − p2′ )

σo = + θo , where ( p1′ , p2′ ) either represents for some crite-
nE nC
ria that pair of probabilities closest to (p1, p2) satisfying p1′ – θo p2′ = 0 (when a
restricted estimate of the standard error is used), or p1′ = p1 and p2′ = p2 (when
an unrestricted estimate of the standard error is used). Then analogous to
Expression 11.7, we have for a desired power of 1 − β that zβσa + zα/2σo = p1 −
θop2. For k = nC/nE, solving for nC yields

2
 z p (1 − p )/k + θ 2 p (1 − p ) + z p1′ (1 − p1′ )/k + θ o2 p2′ (1 − p2′ ) 
β 1 1 o 2 2 α/2
nC =   . (11.20)
 p1 − θ o p2 

Using p1′ = θo(p1 + p2)/(1 + θo) and p2′ = (p1 + p2)/(1 + θo) or using p1′ = θo(kp1 +
p2)/(1 + kθo) and p2′ = (kp1 + p2)/(1 + kθo) could be appropriate in many cases.
Otherwise, ( p1′ , p2′ ) can be selected using some rule for determining the pair
of ( p1′ , p2′ ) in the null hypothesis that is the most difficult to reject when (p1,
p2) is the true pair of the probabilities of a success. The change in the esti-
mated sample size may not be dramatically affected unless θo is very small,
or p1 and p2 are close to 0 or 1. It should be understood that p1′ = θo(kp1 + p2)/
(1 + kθo) and p2′ = (kp1 + p2)/(1 + kθo) need not both be between 0 and 1 (e.g.,
when p1 = p2 = 0.9, θo = 0.5 and k = 1, p2′ = 1.2).
Example 11.10 compares and contrasts the sample-size formulas in
Equations 11.19 and 11.20.

Example 11.10

We will first compare and contrast the sample-size formulas in Equations 11.19
and 11.20 at both 80% and 90% power for three cases based on a one-to-one
randomization (k = 1). The values for θo, p1, and p2 are provided for each case
below.

• Case 1: θo = 0.7 and p1 = p2 = 0.4

• Case 2: θo = 0.3 and p1 = p2 = 0.04
• Case 3: θo = 0.1 and p1 = p2 = 0.04.

The values chosen for p1′ and p2′ will be based on the formulas p1′ = θo(p1 + p2)/
(1 + θo) and p2′ = (p1 + p2)/(1 + θo). The results are summarized in Table 11.4.
In all cases examined, the sample size was smaller using formula 11.20 than for-
mula 11.19. In each case, the respective calculated sample sizes from the formulas
are closer for 90% power than for 80% power. The calculated sample sizes using
formula 11.19 grew at a faster rate or at least a faster relative rate as the power
increased from 80% to 90%. This occurs because when the inference is based on
the distribution for pˆE − θ o pˆ C , σa /σo is larger than when the inference is based on
the distribution of the estimator of the log-relative risk. For case 1, there was very
little difference in the sample-size calculation. Case 2 had a smaller value for θo

© 2012 by Taylor and Francis Group, LLC

Inference on Proportions 285

TABLE 11.4
Calculated Sample Sizes for 80% and 90% Power for Cases 1 through 3
Sample Size per Arm
Power (%) θo (p1, p2) (p1′ , p2′ ) log( pˆ E / pˆ C )a pˆ E − θ o pˆ Cb
80 0.7 (0.4, 0.4) (0.329, 0.471) 192 190
90 0.7 (0.4, 0.4) (0.329, 0.471) 256 255
80 0.3 (0.04, 0.04) (0.018, 0.062) 336 284
90 0.3 (0.04, 0.04) (0.018, 0.062) 435 403
80 0.1 (0.04, 0.04) (0.007, 0.073) 168 90
90 0.1 (0.04, 0.04) (0.007, 0.073) 204 141
a Calculations based on Equation 11.19.
b Calculations based on Equation 11.20.

and values for p1 and p2 close to zero. The calculated sample size begins to devi-
ate between the two formulas. Deviation is larger despite the sample sizes being
smaller for case 3, which had an even smaller value for θo than case 2. When p1′ =
p1 and p2′ = p2, the results are summarized in Table 11.5.
For case 1, comparing Tables 11.4 and 11.5, there was only moderate change in
the calculated sample sizes using p1′ = p1 = 0.4 and p2′ = p2 = 0.4 instead of p1′ =
θo(p1 + p2)/(1 + θo) = 0.329 and p2′ = (p1 + p2)/(1 + θo) = 0.471. For cases 2 and 3,
there were dramatic changes in the calculated sample sizes. Although in Table
11.4, for all cases examined, the sample size was larger using formula 11.20 than
formula 11.19, the reverse is seen in Table 11.5. As with Table 11.4, for cases 2 and
3, there were different calculated sample sizes between formulas 11.19 and 11.20.
For each case when the inference is based on the log-relative risk estimator, the
calculated sample size decreases when p1′ = p1 and p2′ = p2 is used instead of p1′ =

θo(p1 + p2)/(1 + θo) and p2′ = (p1 + p2)/(1 + θo). This is because when p1′ + p2′ is fixed,
(1− p1′)/p1′ + (1− p2′ )/p2′ = 1/p1′ + 1/p2′ – 2 becomes smaller when the probabilities
p1′ and p2′ become more similar.
Conversely, for an inference based on pˆE − θ o pˆ C , the calculated sample size
increases ( p1′(1− p1′) + θ o2p2′ (1− p2′ ) increases) when p1′ = p1 and p2′ = p2 is used

TABLE 11.5
Calculated Sample Sizes when p1′ = p1 and p2′ = p2
Sample Size per Arm
Power (%) θo (p1, p2) ( p1′ , p2′ ) log( pˆ E / pˆ C )a pˆ E − θ o pˆ C b
80 0.7 (0.4, 0.4) (0.4, 0.4) 186 195
90 0.7 (0.4, 0.4) (0.4, 0.4) 248 261
80 0.3 (0.04, 0.04) (0.04, 0.04) 260 420
90 0.3 (0.04, 0.04) (0.04, 0.04) 348 561
80 0.1 (0.04, 0.04) (0.04, 0.04) 72 235
90 0.1 (0.04, 0.04) (0.04, 0.04) 96 315
a Calculations based on Equation 11.19.
b Calculations based on Equation 11.20.

© 2012 by Taylor and Francis Group, LLC

286 Design and Analysis of Non-Inferiority Trials

instead of p1′ = θo(p1 + p2)/(1 + θo) and p2′ = (p1 + p2)/(1 + θo). When p1′ + p2′ = s for
fixed s and 0 ≤ θo ≤ 1, the maximum value for p1′(1− p1′) + θ o2p2′ (1− p2′ ) occurs when
p1′ = min {s, (1 + 2sθo – θo)/(2θo + 2).

Example 11.10 illustrates that the choice for p1′ and p2′ can have a small or
rather large effect on the calculated sample size depending on the value for θo
and the expected success probabilities. Whenever the calculated sample size
changes greatly as the choices for p1′ and p2′ change, simulations should be
used to find the appropriate sample size or validate a calculated sample size.

11.3.5.1 Optimal Randomization Ratio

By Equation 11.19, the overall study size when the inference is based on the
distribution of log( pˆ E/pˆ C ) for an experimental to control allocation ratio of k
is given by

2
 z (1 − p )/( kp ) + (1 − p )/p + z 
β 1 1 2 2 α/2 1 (1 − p1′ )/( kp′) + (1 − p2′ )/p2′
  (1 + k ). (11.21)
 log θ a − log θ o 

When the inference is based on the distribution for pˆ E − θ o pˆ C , it follows from

Equation 11.20 that the overall study size is given by

2
 z p (1 − p )/k + θ 2 p (1 − p ) + z p1′ (1 − p1′ )/k + θ o2 p2′ (1 − p2′ ) 
 β 1 1 o 2 2 α/2
 (1 + k ). (11.22)
 p1 − θ o p2 

In either case, the optimal k that minimizes Equation 11.21 or 11.22 can be
found by using calculus or by a “grid search.” Example 11.11 compares and
contrasts the sample-size formulas in Equations 11.21 and 11.22.

Example 11.11

We will compare and contrast the results for the optimal overall study size based
on formulas 11.21 and 11.22 for cases 1 through 3 in Example 11.10 at both 80%
and 90% power when p1′ = θo(p1 + p2)/(1 + θo) and p2′ = (p1 + p2)/(1 + θo). The
results are summarized in Tables 11.6 and 11.7.
We see that the study size reduction is more prominent when the inference is
based on pÊ − θ o pˆ C . Within a case, the optimal k for 90% power is smaller than
that for 80% when the inference is based on the distribution of log( pÊ / pˆ C ) , but
larger when the inference is based on the distribution of pÊ − θ o pˆ C.
Tables 11.8 and 11.9 provide analogous results on the calculation of the optimal
allocation ratio, k, and the corresponding study size when p1′ = θo(kp1 + p2)/(1 +
kθo) and p2′ = (kp1 + p2)/(1 + kθo).

© 2012 by Taylor and Francis Group, LLC

Inference on Proportions 287

TABLE 11.6
Sample Sizes for Log-Relative Risk Based on Optimal Allocation
log( pˆ E / pˆ C )
Reduction in
Case Power (%) Ratio nE nC n Study Sizea
1 80 1.23 210 170 380 4 (1%)
1 90 1.20 277 231 508 4 (1%)
2 80 1.55 390 251 641 31 (5%)
2 90 1.47 499 341 840 30 (3%)
3 80 2.34 203 87 290 46 (13%)
3 90 2.11 246 117 363 45 (11%)
a Reduction is relative to the sample-size calculation in Table 11.4 using Equation 11.19.

For this scenario, the calculated samples are more similar between formulas
11.19 and 11.20 than in the earlier scenarios. Although earlier when p1′ = θo(p1 +
p2)/(1 + θo) and p2′ = (p1 + p2)/(1 + θo), the study reduction was more prominent
using an optimal allocation ratio when the inference is based on pÊ − θ o pˆ C , we see
that when p1′ = θo(kp1 + p2)/(1 + kθo) and p2′ = (kp1 + p2)/(1 + kθo), the study size
reduction is more prominent using an allocation ratio when the inference is based
on log( pÊ / pˆ C ). As before, within a case, the optimal k for 90% power is smaller
than that for 80% when the inference is based on the distribution of log( pÊ / pˆ C ) ,
but larger when the inference is based on the distribution of pÊ − θ o pˆ C . Compared
with when p1′ = θo(p1 + p2)/(1 + θo) and p2′ = (p1 + p2)/(1 + θo), the calculated study
size for the optimal allocation ratio is smaller when the inference is based on the
distribution of log( pÊ / pˆ C ) when p1′ = θo(kp1 + p2)/(1 + kθo) and p2′ = (kp1 + p2)/(1 +
kθo), but larger when the inference is based on the distribution of pÊ − θ o pˆ C . For
cases 1 through 3, the optimal allocation ratios when p1′ = θo(p1 + p2)/(1 + θo) and
p2′ = (p1 + p2)/(1 + θo) and the inference based on the distribution of log( pÊ / pˆ C )
were similar to the optimal allocation ratios when p1′ = θo(kp1 + p2)/(1 + kθo) and
p2′ = (kp1 + p2)/(1 + kθo) and the inference based on the distribution of pÊ − θ o pˆ C .

Likewise, the optimal allocation ratios when p1′ = θo(kp1 + p2)/(1 + kθo) and p2′ =
(kp1 + p2)/(1 + kθo) with the inference is based on the distribution of log( pÊ / pˆ C )

TABLE 11.7
Sample Sizes for pˆ E − θ o pˆ C Based on Optimal Allocation

pˆ E − θ o pˆ C
Reduction in
Case Power (%) Ratio nE nC n Study Sizea
1 80 1.37 214 156 370 10 (3%)
1 90 1.38 288 209 497 13 (3%)
2 80 2.10 337 161 498 70 (12%)
2 90 2.34 486 208 694 112 (14%)
3 80 4.53 105 23 128 52 (29%)
3 90 5.00 163 33 196 86 (30%)
a Reduction is relative to the sample-size calculation in Table 11.4 using Equation 11.20.

© 2012 by Taylor and Francis Group, LLC

288 Design and Analysis of Non-Inferiority Trials

TABLE 11.8
Sample Sizes for Log-Relative Risk Based on Optimal Allocation
Reduction in
Case Power (%) ( p1′ , p2′ ) Ratio nE nC n Study Sizea
1 80 (0.342, 0.489) 1.53 222 146 368 16 (4%)
1 90 (0.340, 0.486) 1.44 293 203 496 16 (3%)
2 80 (0.024, 0.079) 2.40 394 165 559 113 (17%)
2 90 (0.023, 0.077) 2.13 515 242 757 113 (13%)
3 80 (0.016, 0.164) 5.24 154 30 184 152 (45%)
3 90 (0.015, 0.146) 4.19 206 50 256 152 (37%)
a Reduction is relative to the sample-size calculation in Table 11.4 using Equation 11.19.

were similar to the optimal allocation ratios when p1′ = θo(p1 + p2)/(1 + θo) and p2′ =
(p1 + p2)/(1 + θo) and the inference is based on the distribution of pˆE − θ o pˆ C.

These examples illustrate how the “optimal” allocation ratio depends on the
selection of p1′ and p2′ . Therefore, for the relative risk, the “optimal” allocation
ratio should be interpreted with caution. A moderate or even large change
in the allocation ratio often provides only a small change in the power (for a
fixed sample size) or sample size (for fixed power). Also, the allocation ratio
selected to maximize the power for the analysis of the primary efficacy end-
point may not be optimal or appropriate for the evaluation of secondary effi-
cacy endpoints and/or safety endpoints.
When the inference is based on log( pˆ E /pˆ C ) and p1′ = p1 = p2′ = p2, the optimal
allocation ratio will be k = 1 (and thus the sample sizes are those provided
in Table 11.5). When the inference is based on pˆ E − θ o pˆ C and p1′ = p1 = p2′ = p2,
the optimal allocation ratio will be k = 1/θo. The sample sizes are provided
in Table 11.10 for cases 1 through 3. Compared with Table 11.9 when p1′ =
θo(kp1 + p2)/(1 + kθo) and p2′ = (kp1 + p2)/(1 + kθo), the sample sizes for the
optimal allocation ratio are larger when p1′ = p1 and p2′ = p2, as are the cor-
responding calculated optimal allocation ratios.

TABLE 11.9
Sample Sizes for pˆ E − θ o pˆ C Based on Optimal Allocation
Reduction in
Case Power (%) ( p1′ , p2′ ) Ratio nE nC n Study Sizea
1 80 (0.338, 0.482) 1.32 212 160 372 8 (2%)
1 90 (0.338, 0.483) 1.34 286 214 500 10 (2%)
2 80 (0.020, 0.068) 1.43 325 226 551 17 (3%)
2 90 (0.021, 0.070) 1.59 471 296 767 39 (5%)
3 80 (0.011, 0.107) 2.27 106 47 153 27 (15%)
3 90 (0.012, 0.117) 2.71 169 63 232 50 (18%)
a Reduction is relative to the sample-size calculation in Table 11.4 using Equation 11.20.

© 2012 by Taylor and Francis Group, LLC

Inference on Proportions 289

TABLE 11.10
Sample Sizes for pˆ E − θ o pˆ C Based on Optimal Allocation
Reduction in
Case Power (%) ( p1′ , p2′ ) Ratio nE nC n Study Sizea,b
1 80 (0.4, 0.4) 1.43 223 156 379 1 (0%)
1 90 (0.4, 0.4) 1.43 298 209 507 3 (1%)
2 80 (0.04, 0.04) 3.33 500 150 650 –82 (–14%)
2 90 (0.04, 0.04) 3.33 669 201 870 –64 (–8%)
3 80 (0.04, 0.04) 10 256 26 282 –102 (–57%)
3 90 (0.04, 0.04) 10 343 34 377 –95 (–34%)
a Reduction is relative to the sample-size calculation in Table 11.4 using Equation 11.20.
b Reductions in sample sizes relative to Table 11.5, using Equation 11.20, are 11 (3%), 15 (3%),
190 (24%), 252 (22%), 188 (40%), 253 (40%), respectively.

11.4 Fixed Thresholds on Odds Ratios

11.4.1 Hypotheses
The odds ratio of two binomial proportions is commonly used in case–
control and other retrospective studies. Although it is less frequently used in
prospective clinical trials, we will briefly describe the methods used to prove
non-inferiority using the odds ratio. For a probability of success, p, the odds
are defined as

ω = p/(1 – p).

The odds are a nonnegative number, representing the likelihood of a success

occurring relative to the likelihood of a failure. For example, when p = 0.8,
the odds are ω = 0.8/0.2 = 4.0, indicating a success is four times as likely as a
failure. Inversely,

p = ω/(1 + ω)

When the odds ω = 2, p = 2/3, indicating 2 successes for every failure. For a
comparative study with two binomial parameters, pE and pC, the odds ratio
between the new treatment (E) and the control (C) is

ψ = ω E/ω C = pE (1 – pC)/(pC (1 – pE)).

It can be seen that the odds ratio = relative risk of a success ÷ relative risk of
a failure. When the probabilities of a success are very small (the relative risk
of a failure ≈ 1), the odds ratio is approximately equal to the relative risk of
a success.

© 2012 by Taylor and Francis Group, LLC

290 Design and Analysis of Non-Inferiority Trials

When a success is a desirable outcome, the hypotheses for testing that the
experimental therapy is noninferior to the control therapy based on a pre-
specified threshold of ψo (0 < ψo ≤ 1) are

Ho: ψ ≤ ψo vs. Ha: ψ > ψo (11.23)

11.4.2 Exact Methods

Let X and Y be independent binomial random variables with parameters (nC,
pC) and (nE, pE), respectively. Conditional on the marginal sum (S = X + Y = s),
the random variable Y is a sufficient statistic for ψ with its conditional prob-
ability distribution given by (Fisher)

P(Y = y|s, ψ ) =
( ) ( )ψ
nE
y
nC
s− y
y

(11.24)
∑ k
(nk E )(ns−Ck )ψ k

where the permissible values of y and k consist of all integers within the
range max(0, s – nC) to min(nE, s). This is called the extended hypergeomet-
ric distribution, and more details can be found in the papers of Zelterman51
and Johnson et al.40 Note that for a classical null hypothesis of unity odds
ratio (Ho: ψ = 1), the probability function in Equation 11.24 will reduce to the
hypergeometric distribution under the null hypothesis.
Suppose yobs is the observed number of positive responses in the new treat-
ment group, then the exact p-value for testing hypothesis in Expression 11.23
is given by

p-value = P(Y ≥ yobs|s,ψo).

As noted in previous papers,8,52,53 a 100(1 – α)% exact confidence interval for

ψ, (ψ L, ψ U) can be constructed by inverting two one-sided tests based on the
conditional probability distribution given in Equation 11.24:

yobs B

∑i= A
P(Y = i|s, ψ U ) = α /2, and ∑ P(Y = i|s,ψ
i = yobs
L ) = α/2

where A = max(0, s – nC) and B = min(nE, s). Agresti and Min15 have also dis-
cussed the construction of confidence interval for the odds ratio by inverting
a two-sided test.
The above method of analyzing the odds ratio in a single 2 × 2 table has been
extended to analyze a common odds ratio in a series of 2 × 2 tables.54–57

© 2012 by Taylor and Francis Group, LLC

Inference on Proportions 291

11.4.3 Asymptotic Methods

The sample odds ratio is defined as

y(nC − x) pˆ E (1 − pˆ C )
ψˆ = = .
x(nE − y ) pˆ C (1 − pˆ E )

If there are zero counts, the following amended estimator has been shown to
have good large sample behavior58:

( y + 0.5)(nC − x + 0.5) .
ψ =
( x + 0.5)(nE − y + 0.5)

Both estimators have the same asymptotic normal distribution around ψ.

However, the convergence is more rapid with the log transformation, and
thus the normal approximation is better for a log-odds ratio. An estimator of
the standard error for log ψ̂ is given by

1/2 1/2
1 1 1 1   1 1 
σˆ =  + + + = + .
 Y nE − Y X nC − X   nE pE (1 − pE ) nC pC (1 − pC ) 
ˆ ˆ ˆ ˆ

To protect against having a zero cell count, 1 or 0.5 can be added to each cell
count in the estimator of standard error. On the basis of the asymptotic nor-
mality of log ψ̂, a two-sided 100 (1 – α)% Wald’s confidence interval for log ψ
is given by

log ψˆ ± zα/2σˆ

where zα/2 is the upper α/2 percentile of the standard normal distribution.
Therefore, a confidence interval (ψ L,ψ U) for the odds ratio (ψ) can be obtained
by exponentiating the above limits.
For testing the non-inferiority hypothesis in Expression 11.23, the test sta-
tistic is

log ψˆ − log ψ o
Z= ,
σˆ

and the p-value can be calculated asymptotically as

 log ψˆ − log ψ o   log ψˆ − log ψ o 

p-value = P Z >  = 1− Φ 
 σˆ   σˆ

© 2012 by Taylor and Francis Group, LLC

292 Design and Analysis of Non-Inferiority Trials

where Z follows a standard normal distribution and Φ(•) is the standard nor-
mal distribution function. The hypothesis will be rejected if Z > zα/2 or if the
p-value is <α/2. Equivalently, one can compare the lower limit (ψ L) of the
100(1 – α)% confidence interval for ψ with ψo. The non-inferiority hypothesis
will be rejected at the one-sided α level if ψ L > ψo.
If there are covariates to be adjusted in the analysis, one can consider
performing a logistic regression with covariates or a log-linear model if all
covariates are categorical. Then the odds ratio between the treatments can be
estimated from the regression parameter.

11.4.4 Sample Size Determination

For a design based on the log-odds ratio, let p1 and p2 represent the
selected success rates for the experimental and control arms, respectively,
where ψa = p1(1 – p2)/(p2(1 – p1)) > ψo. We will base the sample-size calcu-
lations on the estimator of the log-odds ratio (log( pˆ E (1 − pˆ C )/( pˆ C (1 − pˆ E )) )
1 1
and its asymptotic distribution. Let σ a = + and
nE p1 (1 − p1 ) nC p2 (1 − p2 )
1 1
σo = + , where ( p1′ , p2′ ) either represents for some
nE p1′ (1 − p1′ ) nC p2′ (1 − p2′ )
criteria that pair of probabilities closest to (p1, p2) satisfying p1′ (1 – p2′ )/( p2′
(1 – p1′ )) = ψo (when a restricted estimate of the standard error is used), or p1′ =
p1 and p2′ = p2 (when an unrestricted estimate of the standard error is used).
When the sample sizes are large and the true success probabilities of p1 and
p2 ( p1′ and p2′ ) are not close to 0 or 1, the estimator of the log-odds ratio will
have an approximate normal distribution with standard deviation of σa (σo).
Analogous to Equation 11.7, we have for a desired power of 1 − β that zβσa +
zα/2σo = logψa – logψo. For k = nE/nC, solving for nC yields

2
 z ( kp (1 − p ))−1 + ( p (1 − p ))−1 + z ′ (1 − p1′ ))−1 + ( p2′ (1 − p2′ ))−1 
1 1 2 2 α/2 ( kp1
 β  . (11.25)
 log ψ a − log ψ o 

Alternatively, simulations can be performed to determine the required sam-

ple size. Simulations should be used when it is believed that the distribution
for ψ̂ will not closely be approximated by a normal distribution.
By Equation 11.25, the overall study size for an experimental to control
allocation ratio of k, is given by

2
 z ( kp (1 − p ))−1 + ( p (1 − p ))−1 + z ′ (1 − p1′ ))−1 + ( p2′ (1 − p2′ ))−1 
1 1 2 2 α/2 ( kp1
 β  (1 + k ).
 log ψ a − log ψ o 

© 2012 by Taylor and Francis Group, LLC

Inference on Proportions 293

As before, the optimal k that minimizes the above expression can found by
using calculus or by a “grid search.”

11.5 Bayesian Methods

In this section we will consider non-inferiority testing based on a credible
interval when independent beta prior distributions are used. The beta distri-
butions are a conjugate family of prior distributions for a random sample from
a Bernoulli distribution. For a random sample of n binary observations where
x are successes, a beta prior distribution with parameters α and β for p (the
probability of success) leads to a beta posterior distribution for p with parame-
ters α + x and β + n – x. The mean of the posterior distribution, (α + x)/(α + β + n)
is the most frequently used estimate of p. This fraction can be regarded as a
fraction of “successes,” where the prior distribution contributes α “successes”
and β “failures” among α + β Bernoulli trials. Thus, in essence, a beta prior dis-
tribution establishes a beta posterior distribution to base inferences while also
contributing additional successes and failures. When α and β for the prior dis-
tribution are relatively small compared to the sample size, the specific choice
of α and β have little impact. For a Jeffreys prior, α = β = 0.5, the 95% credible
interval for p is similar to the 95% exact confidence interval for p.
For a superiority comparison, a Bayesian analysis gives the same results
whether the inference is based on a difference in proportions, the relative
risk, or an odds ratio (i.e., P(pE – pC > 0) = P(pE/pC > 1) = P((pE/(1 – pE))/(pC/
(1 – pC)) > 1)). The results will depend on the chosen prior distributions. The
results may also differ among frequentist analyses based on different asymp-
totic methods. For example, study GAN040 randomized 304 subjects receiv-
ing orthotopic liver transplants between a Cytovene and a placebo arm.59
Forty-six R– subjects (seronegative recipients), 21 in the Cytovene arm and 25
in the placebo arm, were transplanted with D+ (seropositive donors) livers.
Three of these 21 subjects on the Cytovene arm and 11 of these 25 subjects
on the placebo arm had CMV disease at 6 months. For the subgroup of R–
subjects, the respective 95% confidence intervals based on asymptotic results
(use of Taylor series/delta methods for the relative risk and the odds ratio)
for the difference in incidence of CMV disease at 6 months, the relative risk,
and the odds ratio are (–0.54, –0.05), (0.10, 1.01), and (0.05, 0.91), respectively.
If the 95% confidence interval for either the difference in proportions or the
odds ratios were used to make inferences, the conclusion would be that
Cytovene reduces the incidence of CMV disease at 6 months for R– subjects
transplanted with D+ livers. However, if the 95% confidence interval for the
relative risk were used, the conclusion that Cytovene reduces the incidence
of CMV disease at 6 months for R- subjects transplanted with D+ livers can-
not be made.

© 2012 by Taylor and Francis Group, LLC

294 Design and Analysis of Non-Inferiority Trials

TABLE 11.11
Integrals Representing Posterior Probability of Non-Inferiority
Characteristic Determination of the Posterior Probability
Difference in proportions For any 0 ≤ k ≤ 1, P(pE –pC > –k) =
1 v− k
1−
∫∫ k 0
g E (u y ) g C (v x) du d v.

Relative risk For any 0 < k ≤ 1, P(pE/pC > k) =

1 kv
1−
∫∫
0 0
g E (u y ) g C (v x) du d v.

Odds ratio For any 0 < k < 1, P((pE/(1 – pE))/(pC/(1 – pC)) > k) =
1 1

∫∫0 kv/( 1+ ( k − 1) v )
g E (u y ) g C (v x) du d v.

In the above Cytovene example, if Jeffreys prior distributions are chosen

for the respective probabilities of CMV disease at 6 months, the 95% credible
intervals for difference in proportions, relative risk, and odds ratio are (–0.51,
–0.04), (0.09, 0.88), and (0.05, 0.84), respectively. For the Bayesian analysis, the
choice between using a difference in proportions, the relative risk, or the
odds ratio does not affect the decision on whether Cytovene reduces the inci-
dence of CMV disease at 6 months for R– subjects transplanted with D+ liv-
ers. The choice of a prior distribution will influence whether that conclusion
can be made. If the parameters for the beta prior distribution are very small
relative to the sample size, the prior distribution will have little influence on
the results of the analysis.
For a difference in proportions, relative risk, or odds ratio, a credible inter-
val and the posterior probability of non-inferiority (i.e., P(pE –pC > –k), P(pE/
pC > k), or P((pE/(1 – pE))/(pC/(1 – pC)) < k) for appropriate k) can be determined
from either the joint distribution for (pE,pC); the respective posterior distribu-
tion for pC – pE,pC/pE, or (pE/(1 – pE))/(pC/(1 – pC)); or from simulations. Let
g E(pE|y) and gC(pC|x) denote the posterior densities for pE and pC, respectively.
Table 11.11 provides integral forms for determining the posterior probabili-
ties of non-inferiority for the different metrics used to compare proportions.
The relationship that P((pE/(1 – pE))/(pC/(1 – pC)) < k) = P(pE < kpC/(1 + (k – 1)
pC)) is used to construct the posterior probability for an odds ratio.
Example 11.12 illustrates the impact of the choice of the prior distributions
for the probability of a success on the non-inferiority inference.

Example 11.12

Suppose for a randomized clinical trial, 80 of 100 subjects in the experimental

arm and 85 of 100 subjects in the control arm have a “response.” We will examine
the equal-tailed 95% credible intervals for the difference, relative risk, and odds
ratio of the probabilities of a response under a variety of choices for the prior

© 2012 by Taylor and Francis Group, LLC

Inference on Proportions 295

distributions for the probability of a response. For each case, independent beta
distributions are selected as the prior distribution for the probability of a response
for the control and the experimental arms. Table 11.12 summarizes the equal-
tailed 95% credible intervals for the difference, relative risk, and odds ratio of the
probabilities of a response under seven different pairs of the prior distributions.
Each credible interval was based on 1 million simulations from the corresponding
beta posterior distributions. The first case uses the limiting posterior distributions,
as the parameters (α and β) for each prior distribution tend toward zero. This
establishes a limiting beta posterior distribution where the inference is essentially
based entirely on the data (the mean of each posterior distribution is the respec-
tive observed proportion of responders). The second case has a Jeffreys prior
distribution for each probability of a response. The remaining cases are based
on having prior information on each probability of a response that is essentially
equivalent to having response data on 40 subjects. The third case is essentially
equivalent to beginning with 20 responders and 20 nonresponders for each arm.
When compared with case 1, this choice of a common prior distribution makes
the proportion of responders between arms more similar and closer to 0.5, while
not having a great impact on the variance of the posterior distributions. The fourth
case represents starting with 34 responders out of 40 subjects (85%, the same
as the observed proportion in the control arm) in each arm. When compared
with case 1, this choice of a common prior distribution makes the proportion of
responders between arms more similar and reduces the variance of each posterior
distribution. The fifth case represents starting with 34 responders out of 40 sub-
jects in the control arm and 32 responders out of 40 subjects in the experimen-
tal arm. When compared with case 1, this choice of prior distributions does not
change the proportion of responders in the respective arms while reducing the
variance of each posterior distribution.
The sixth case represents starting with 34 responders out of 40 subjects in the
control arm and 28 responders out of 40 subjects in the experimental arm. For a
non-inferiority margin of 15%, this case starts with observed proportions whose

TABLE 11.12
Equal-Tailed 95% Credible Intervals for a Difference, Relative Risk, and Odds Ratio
Case Prior Parameters pE – pC pE/pC Odds Ratio
1 C: α→0 β→0 (–0.155, 0.055) (0.825, 1.070) (0.329, 1.468)
E: α→0 β→0
2 C: α = 0.5 β = 0.5 (–0.155, 0.055) (0.824, 1.070) (0.336, 1.464)
E: α = 0.5 β = 0.5
3 C: α = 20 β = 20 (–0.139, 0.068) (0.825, 1.098) (0.487, 1.418)
E: α = 20 β = 20
4 C: α = 34 β = 6 (–0.123, 0.051) (0.861, 1.064) (0.406, 1.449)
E: α = 34 β = 6
5 C: α = 34 β = 6 (–0.139, 0.039) (0.842, 1.048) (0.372, 1.309)
E: α = 32 β = 8
6 C: α = 34 β = 6 (–0.170, 0.012) (0.807, 1.015) (0.317, 1.085)
E: α = 28 β = 12
7 C: α = 34 β = 6 (–0.231, –0.040) (0.737, 0.950) (0.238, 0.785)
E: α = 20 β = 20

© 2012 by Taylor and Francis Group, LLC

296 Design and Analysis of Non-Inferiority Trials

difference (ignoring any associated variability) would be on a boundary of the

null hypothesis for testing for non-inferiority. When compared with case 5, this
choice of prior distributions increases the difference in the proportion of respond-
ers in the respective arms while not greatly affecting the variance of the posterior
distributions.
The seventh case represents starting with 34 responders out of 40 subjects in
the control arm and 20 responders out of 40 subjects in the experimental arm. For
a non-inferiority margin of 15%, this case starts with observed proportions whose
difference (ignoring any associated variability) would be in the interior of a bound-
ary of the null hypothesis for testing for non-inferiority. When compared with
case 6, this choice of prior distributions increases the difference in the proportion
of responders in the respective arms while not greatly affecting the variance of the
posterior distributions.
For cases 1 and 2, since the corresponding posterior distributions of the prob-
ability of a responder for the control and experimental arms, pC and pE, are simi-
lar, there was very little difference in the respective 95% equal-tailed credible
intervals.
Compared with cases 1 and 2, using a common beta prior distribution with α =
β = 20 (case 3) had a different impact on the respective 95% credible intervals
for pE – pC, pE /pC, and the odds ratio. For case 3, the 95% equal-tailed credible
interval for pE – pC has greater upper and lower limits than the corresponding cred-
ible interval for cases 1 and 2. The 95% equal-tailed credible interval for pE /pC for
case 3 is similar to that for cases 1 and 2. The 95% equal-tailed credible interval
for the odds ratio for case 3 is noticeably narrower than that for cases 1 and 2—in
particular, the lower limit is much greater.
Although using rather informative prior distributions may not be appropriate (we
will later elaborate further), it should be understood that there can be a different
impact of using such prior distributions for different functions of (pC,pE).
Compared with cases 1 and 2, the upper limits of the respective 95% equal-
tailed credible intervals for case 4 are about the same, whereas the lower limits are
noticeably greater. Since case 5 has the same means for the posterior distributions
as case 1 but with smaller variances, the corresponding 95% equal-tailed credible
intervals are narrower versions (from both ends) of those for case 1. Additionally,
since the prior distribution for the experimental arm in case 5 moves probability
to the left compared with the corresponding prior distribution in case 4, the cor-
responding 95% equal-tailed credible intervals for case 5 has limits to the left of
those for case 4.
Since cases 6 and 7 involve prior distributions for the experimental arm that
further moves probability to the left, the corresponding 95% equal-tailed credible
intervals have limits further to the left. Note for case 7, the conclusion is that the
control arm is superior to the experimental arm. For cases 6 and 7, usage of these
pairs of prior distributions make it more difficult to conclude non-inferiority based
on a difference in the probability of a responder using a margin of 15%.

If the results of a clinical trial are to stand alone, the α’s and β’s for the prior
distributions should be relatively small when compared to the sample size.
Otherwise, as would be done in a meta-analysis, the use of a beta prior dis-
tribution for each arm involves integrating prior successes and failures with
the successes and failures in the present clinical trial.

© 2012 by Taylor and Francis Group, LLC

Inference on Proportions 297

We note again that it is the size of the parameters for the prior distributions
that can be influential rather than the prior probability of non-inferiority,
inferiority, or superiority. Suppose for the experimental and control arms,
the prior distributions for the probability of a response are beta distributions
with α = 5 × 10 –10 and β = 9.5 × 10 –9 for the experimental arm and α = 9.5 × 10 –9
and β = 5 × 10 –10 for the control arm. Then the prior probability that pE > pC is
greater than 0.9 (90%). However, these prior distributions lose any real impact
once the response status is known for at least one patient in each arm.
In addition to using posterior probabilities for testing non-inferiority, Kim
and Xue60 discussed two other Bayesian approaches for non-inferiority test-
ing. The first alternative approach determines the 5% contour region defined
as those possible (pE,pC) whose joint posterior density is 5% of the joint den-
sity of the mode. When the 5% contour region lies entirely within the non-
inferiority region (the alternative hypothesis), non-inferiority is concluded.
The second alternative approach concludes non-inferiority whenever the 95%
credible set of highest posterior probability for (pE,pC) lies entirely within the
non-inferiority region (the alternative hypothesis).

11.6 Stratified and Adjusted Analyses

A stratified analysis yields a single comparison between arms by combining
the stratum-level comparisons. An analysis that adjusts for baseline covari-
ates will yield a comparison between arms that is not obscured by differences
between arms in the distributions of known baseline covariates. Other types
of adjusted analyses make adjustments over some characteristic (strata levels
or combination of levels) so that the estimators and/or some comparison are
reflective of some target population. This provides for unbiased estimation
for that target population. For a non-inferiority comparison, it is important
that the method used to compare the proportions in the non-inferiority trial
is consistent with that used in the determination of the non-inferiority mar-
gin. This is particularly true when the value of the parameter of interest is
dependent on the analysis. For example, for an odds ratio, unadjusted and
adjusted estimators will be estimating different parameters/values. When
the factors are prognostic, the mean value of an unadjusted estimator will be
different from the mean value of an adjusted estimator. Consider the exam-
ple provided in Table 11.13.
For each stratum, the odds ratio is 2. Therefore, the estimate adjusting
for strata is 2. The unadjusted or overall odds ratio is 1.96. In general (not
always), for relative measures, when the factors or covariates are prognostic
and the size of the effect or relative effect is similar across factors/covariates,
the adjusted estimator will tend to be further away from equality than the
unadjusted estimator. For relative measures, it is important that the analysis

© 2012 by Taylor and Francis Group, LLC

298 Design and Analysis of Non-Inferiority Trials

TABLE 11.13
Distribution of Successes and Failures between Arms across Strata
Experimental Arm Control Arm
Strata Success Failure Success Failure
1 40 20 30 30
2 30 30 20 40
Total 70 50 50 70

method used in determining the effect of the control therapy and the non-
inferiority margin also be used in comparing the experimental and control
arms in the non-inferiority trial. If not, some “adjustment” may be needed to
the non-inferiority margin.
A Cochran–Mantel–Haenszel procedure is a very common stratified anal-
ysis when testing for the difference (or superiority or inferiority) between
two treatment arms on a binary endpoint. Essentially, the test statistic has for
its numerator the sum across strata within one of the arms of the difference
in the observed number of successes and the expected number of successes
(assuming no difference in success rates between arms). The denominator is
an estimate of the corresponding standard deviation under the assumption
that the success rate is equal between arms within each stratum. This is one
of the primary ways of performing a stratified analysis.
Another type of stratified or adjusted analysis adjusts with respect to some
preset relative frequency of some characteristic or combinations of charac-
teristics in a target population. A particular subpopulation or stratum would
consist of subjects that have the same level for that chosen characteristic or
the same combination of levels of many characteristics. For valid compari-
sons, the same relative weights should be used for each arm.

11.6.1 Adjusted Rates

Suppose a target population is partitioned into k subpopulations. Let ai denote
the proportion of the target population that belongs to the ith subpopulation
(all ai > 0, Σai = 1). Then an adjusted rate of success for the target population is

∑
k
given by pˆ Target = ai pˆ i , where p̂i is some estimator of pi, the success rate
i=1

for the ith subpopulation. If p̂i is an unbiased estimator of pi, then p̂Target will
be an unbiased estimator of the true success rate for the target population.
Each p̂i can be modeled as having a normal distribution with mean pi and
variance pi(1 – pi)/bi, where bi is the number of Bernoulli trials observed for
the ith subpopulation. Then p̂Target can be modeled as having a normal dis-

∑ ∑
k k
tribution with mean ai pi and variance a 2i pi (1 − pi )/bi .
i=1 i=1

© 2012 by Taylor and Francis Group, LLC

Inference on Proportions 299

For a clinical trial, let pˆ E ,i and pˆ C ,i denote the observed proportion of “suc-
cesses” in the ith stratum or subpopulation for the experimental and control
arms, respectively. Then the respective estimators for the target population

∑ ∑
k k
are given by pˆ E = ai pˆ E ,i and pˆ C = ai pˆ C ,i . The difference is given
i=1 i=1

∑
k
by pˆ E − pˆ C = ai ( pˆ E ,i − pˆ C ,i ). Thus, the difference in the overall estimated
i=1
rates is a weighted average (same weights) of the difference in the observed
rates within each stratum or subpopulation. The overall relative risk is given

∑ ∑
k k
ai pˆ E ,i ( ai pˆ C ,i )( pˆ E ,i /pˆ C ,i )
by pˆ E/pˆ C = i=1
= i=1
. Thus, the overall relative risk
∑ ∑
k k
ai pˆ C ,i ai pˆ C ,i
i=1 i=1
for the target population can be expressed as a weighted average, with ran-
dom weights, of the relative risks within the strata. The overall odds ratio

∑ ∑
k k
ai a j pˆ E ,i (1 − pˆ C , j )
i=1 j=1
is given by pˆ (1 − pˆ )/( pˆ (1 − pˆ )) = . Since this
∑ ∑
E C C E k k
ai a j (1 − pˆ E ,i )pˆ C , j
i=1 j=1
expression involves the product of terms calculated from different strata,
this odds ratio estimator cannot be expressed as a weighted average of the
within-stratum odds ratios. Discussion and inferences about odds ratio pro-
vided later will instead be based on a common or “average” odds ratio.
There are many choices for how to do a stratified or adjusted non-inferiority
analysis of a binary endpoint. This goes beyond whether a difference in pro-
portions, a relative risk, or an odds ratio is chosen as the basis for making an
inference. When comparing two proportions or probabilities in a randomized,
stratified clinical trial, one method for comparing the difference in propor-
tions uses the overall strata sizes as the common weights for each arm. This
allows for a comparison of the two proportions with respect to a target popu-
lation that has the same breakdown for the strata levels as observed in the
study. This is also consistent with the one proportion problem in estimating a
common proportion across strata (or studies). When it is assumed that the true
probability of a success for an arm is constant across strata (or studies), the
MLE of the common probability of a success for that arm uses the total num-
ber of subjects in that stratum for just that arm (or the study size for that arm)
as weights. This will lead to the overall proportion of successes as the estimate
of the common probability of a success. However, it may not be reasonable to
assume for a given arm that the true probability of a success is constant across
strata. It should be noted that in a clinical trial, the most prognostic factors are
selected as stratification factors. Thus, the expectation is that the success rate
will vary greatly across stratification levels of the same factor.
Another adjusted analysis uses the harmonic mean of the number of sub-
jects in the experimental and control arms within a stratum as the stratum

© 2012 by Taylor and Francis Group, LLC

300 Design and Analysis of Non-Inferiority Trials

weight. This is the adjustment that is used in determining the Mantel–

Haenszel estimators of the difference in proportions and the relative risk,
which will be introduced later. When the ratio of the number of subjects
in the experimental arm to that of the control arm is constant across strata,
using the harmonic means as the strata weights leads to the same estimates
as using the overall strata sizes as the strata weights.
When making an inference on the difference in means for two arms,
where the data are assumed to be normally distributed and the differ-
ence in means is constant across strata or studies, the MLE of the common
difference in means of an arm uses the inverse of the variance (or mean
square error) of the strata sample mean difference as the weights. If there
is a common variance of the observations across strata and arms, then this
approach is equivalent to using the harmonic mean of the number of sub-
jects in the experimental and control arms within a stratum as the stratum
weight.
If the non-inferiority hypothesis is based on a relative risk (or on an odds
ratio), a decision needs to be made on whether to use the ratio of the two
overall proportions (or an odds ratio based on the overall proportions) or to
estimate a common or “average” relative risk by integrating the estimated
relative risk (estimated odds ratio) across strata. How the results from other
trials were used to determine the non-inferiority margin/criterion needs to
be considered in making the choice.

11.6.2 Adjusted Estimators

For i = 1, 2, …, k, let XE,i and XC,i have a binomial distributions with respec-
tive parameters nE,i and pE,i, and nC,i and pC,i. Further assume that all Xj,i are
independent for i = 1, 2, …, k and j = E, C. For the ith strata or subpopulation,
Table 11.14 provides the notation for the cell counts.
The observed proportion of successes within strata will be denoted by pˆ j ,i
for i = 1, 2, …, k and j = E, C.
Risk Difference. Let Δi denote the true risk difference for the ith strata or sub-
population. The observed risk difference in the ith stratum or subpopulation
is given by ∆ˆ i = xE ,i /nE ,i − xC ,i /nC ,i . The weighted least squares estimator of a

∑ ∑
k k
common overall risk difference (Δ ≡ Δ) is given by ∆ˆ =
i w w ∆ˆ
i i w, i
i=1 i=1
where wi = 1/[xE,i(nE,i – xE,i)/nE,i + xC,i(nC,i – xC,i)/nC,i)]. Relative to the proportion
of all observations that are in a stratum, this estimator downweights those
strata where the observed success rates are near 0.5, and overweights those
strata where the observed success rates are close to 0 or 1. For most clini-
cal trials, the risk difference will not be constant or approximately constant
across strata.
The Mantel–Haenszel estimator of the common risk difference across
strata is given by

© 2012 by Taylor and Francis Group, LLC

Inference on Proportions 301

TABLE 11.14
Notation for the Cell Counts
Experimental Control
Success xE,i xC,i
Failure nE,i – xE,i nC,i – xC,i
nE,i nC,i ni

∑ (x n
E ,i C ,i − xC ,i nE ,i )/N i k k

∆ˆ MH = i=1
k
= ∑ wi ∆ˆ i ∑w i

∑n n /N i
C ,i E ,i
i=1 i=1

i=1

 1 1  ˆ
where wi = nC ,i nE ,i /N i = 1 
n
+
n  and ∆ i = xE ,i /nE ,i − xC ,i /nC ,i . Equiv
 C ,i E ,i 

alently, the weight for a given stratum can be considered as the harmonic
mean of the within-stratum sizes for the experimental and control arms.
The estimators
k k

∑ pˆ E ,i (1/nC ,i + 1/nE ,i )−1 ∑ pˆ C ,i (1/nC ,i + 1/nE ,i )−1

pˆ E,MH = i=1
k and pˆ C,MH = i=1
k (11.26)
∑ (1/n
i=1
C ,i + 1/nE ,i ) −1
∑ (1/n
i=1
C ,i + 1/nE ,i ) −1

will be regarded as the Mantel–Haenszel estimators of pE and pC, respectively.

In this case, the parameters pE and pC are analogous weighted averages of the
respective strata probabilities of success. These estimators weigh the within-
stratum proportions (estimators) by the harmonic mean of the number of
subjects in the experimental and control arms within the stratum. Here, we
have ∆ˆ MH = pˆ E,MH − pˆ C,MH . When p̂E,MH and p̂C,MH have approximate normal
distributions, confidence intervals can be found for the Mantel–Haenszel
average risk difference.
Relative Risk. Let θi denote the true relative risk for the ith strata or subpop-
ulation. The observed relative risk in the ith strata or subpopulation is given
by θˆi = xE ,i nC ,i /xC ,i nE ,i . Grizzle, Starmer, and Koch61 provided the asymptoti-
cally weighted least squares estimator of a common log-relative risk (log θi ≡

∑ ∑
k k
log θ), which is given by log θˆw = wi log θˆi wi , where wi = 1/xE,i –
i=1 i=1
1/nE,i + 1/xC,i – 1/nC,i. The weight for a stratum is the inverse of the asymptotic
variance of the respective log-relative risk estimators.

© 2012 by Taylor and Francis Group, LLC

302 Design and Analysis of Non-Inferiority Trials

The Mantel–Haenszel estimator of the common or average relative risk across

strata (or the relative risk for the appropriate target population) is given by

∑x n
E ,i C ,i Ni k k

θˆMH = i=1
k
= ∑ wiθˆi ∑w i (11.27)
∑x
i=1
n
C ,i E ,i Ni i=1 i=1

where wi = xC,inE,i/Ni = (xC,i/nC,i)/(1/nC,i + 1/nE,i). We see that the Mantel–Haenszel

estimator of the relative risk equals the ratio of the Mantel–Haenszel estima-
tors for pE and pC (i.e., θˆMH = pˆ E,MH /pˆ C,MH ). In an example to follow, we will use
Fieller’s method and the approximate normal distributions of p̂E,MH and p̂C,MH
to find a 95% confidence interval for the true Mantel–Haenszel relative risk.
Odds Ratio. Let ψ i denote the true odds ratio for the ith strata or sub-
population. The odds ratio in the ith strata or subpopulation is given by
ψˆ i = xE ,i (nC ,i − xC ,i )/( xC ,i (nE ,i − xE ,i )) . The asymptotically weighted least
squares estimator of a common log-odds ratio (log ψ i ≡ log ψ) is given by
∑ ∑
k k
log ψˆ w = wi log ψˆ i wi , where wi = 1/xE,i + 1/(nE,i – xE,i) + 1/xC,i +
i=1 i=1
1/(nC,i – xC,i). This estimator is known as the Woolf estimator. The weights
are the inverse of the asymptotic variance of the respective log-odds ratio
estimators within stratum. Gart62 showed that the Woolf estimator and the
MLE have the same asymptotic distribution and that the Woolf estimator is
asymptotically efficient.
The Mantel–Haenszel estimator of the common or average odds ratio
across strata is given by
k

∑x E ,i (nC ,i − xC ,i ) N i k k

ψˆ MH = i=1
k
= ∑ wiψ̂ i ∑w i (11.28)
∑x C ,i (nE ,i − xE ,i ) N i i=1 i=1

i=1

where wi = xC,i(nE,i – xE,i)/Ni = (xC,i/nC,i)(1 – xE,i /nE,i)/(1/nC,i + 1/nE,i). Unlike with

the risk difference and the relative risk, the Mantel–Haenszel estimator of
the common odds ratio does not equal the odds ratio based on the Mantel–
Haenszel estimators for pE and pC ( ψˆ MH ≠ ( pˆ E,MH (1 − pˆ C,MH )/( pˆ C,MH (1 − pˆ E,MH ))).

Robins, Breslow, and Greenland63 provided an estimate of the variance of
log(ψ̂ MH ), which works well for both large strata and sparse strata data, that
is given by
k k k

∑i=1
PR 2
i i /(2 R+ ) + ∑ i=1
i i + Qi Ri )/(2 R+ S+ ) +
( PS ∑ Q S /(2S )
i=1
i i
2
+ (11.29)

© 2012 by Taylor and Francis Group, LLC

Inference on Proportions 303

where Pi = (xE,i + nC,i – xC,i)/Ni, Qi = xC,i + nE,i – xE,i, Ri = xE,i(nC,i – xC,i)/Ni, Si =

∑ ∑
k k
xC,i(nE,i – xE,i)/Ni, R+ = Ri , and S+ = Si . We will use this standard
i=1 i=1
error estimate in an example to construct confidence intervals for the com-
mon or average odds ratio.
Mantel and Haenszel64 indicated their disbelief that the relative risk for
an exposure factor would be constant across strata and suggested the use
instead of an average relative risk or, rather, an average odds ratio.
Logistic Regression. A logistic regression model can be used to estimate a
common log-odds ratio across all possibilities for a collection of covariates.
For a logistic regression model, the log-odds of a success is a linear function
of the covariate values of the given patient. A patient having baseline covar
iates values x1, … , xk (one or more of these covariates used to identify the
∑
k
treatment arm) has a log-odds of success of α + β i xi . When the sample
i=1
size is large, the MLEs will have approximate normal distributions. For fixed
values of the baseline covariates, the treatment effect represents the common
log-odds ratio between the experimental and control arms.
Example 11.13 illustrates the use of these methods.

Example 11.13

To illustrate these methods, data were simulated for a two-arm study having 200
subjects. One hundred subjects were randomized to each arm according to two
stratification factors having two levels each. The endpoint is a binary response.
Table 11.15 gives the subject breakdown according to treatment arm, stratification
factors, and response status.
From Equation 11.26, the Mantel–Haenszel estimates of pE and pC are 0.631
and 0.429, respectively. The respective estimates of the corresponding standard
deviations are 0.0479 and 0.0480. The approximate 95% confidence interval for
pE – pC is 0.069–0.335. From Equation 11.27, the Mantel–Haenszel estimate of the
relative risk pE /pC is 1.47. On the basis of Fieller’s method, the approximate 95%
confidence interval of 1.14–1.95 is found by solving for the values of x that satisfy
–1.96 < (0.631 – 0.429x)/((0.0479)2 + (0.0480x)2)0.5 < 1.96.
From Equation 11.28, the Mantel–Haenszel estimate of the common odds ratio
is 2.31 with approximate 95% confidence interval of 1.30–4.09 based on the
standard error estimate of the log-odds ratio of Robins, Breslow, and Greenland
in Equation 11.29.
For a logistic regression model using treatment arm and the two stratification
factors as factors in the model, the estimate of the common odds ratio is 2.31 with
corresponding approximate 95% confidence interval of 1.30–4.09, identical to the
Mantel–Haenszel estimate and the corresponding confidence interval. When an
interaction term for the stratification factors is added, the estimate of the common
odds ratio is 2.33 with a corresponding approximate 95% confidence interval of
1.31–4.15.

© 2012 by Taylor and Francis Group, LLC

304 Design and Analysis of Non-Inferiority Trials

TABLE 11.15
Breakdown by Treatment Arm, Stratification Factors, and Response Status
Arm Factor 1 Factor 2 n Number of Responses
Experimental 0 0 30 21
0 1 25 15
1 0 23 15
1 1 22 12
Control 0 0 32 18
0 1 25 6
1 0 22 9
1 1 21 10

11.7 Variable Margins

There are non-inferiority testing procedures and real cases involving propor-
tions that are not based on a fixed margin for a risk difference, a relative risk,
or an odds ratio. These other procedures generally prespecify which pairs
of (pE, pC) satisfy that the experimental therapy is noninferior to the control
therapy. Whether a difference, pE – pC, can be regarded as acceptable (i.e., that
the experimental therapy is noninferior to the control therapy) depends on
the control success rate. The non-inferiority criteria can be expressed by a
“variable margin” on the risk difference. That is, a function δ is prespecified
so that the hypotheses for non-inferiority testing are given by (for positive
outcomes)

Ho: pE – pC ≤ –δ(pC) vs. Ha: pE – pC > –δ(pC) (11.30)

For a given value of 0 ≤ pC ≤ 1, the value of δ(pC) satisfies 0 ≤ δ(pC) ≤ pC and is

prespecified. If a success is an undesirable outcome, the roles of pC and pE in
the hypotheses in Expression 11.30 would be reversed (i.e., test Ho: pC – pE ≤
–δ(pC) vs. Ha: pC – pE > –δ(pC)).
Note that all previous cases of non-inferiority testing of proportions in
volve hypotheses that are specific cases of those given in Expression 11.30.
For a fixed margin on a risk difference, we have δ(pC) = δ. For a fixed margin
on a relative risk, we have δ(pC) = (1 – θo)pC. For a fixed margin on an odds
ratio, we have δ(pC) = (1 – ψo)pC(1 – pC)/[1 – (1 – ψo)pC].
For testing the hypotheses in Expression 11.30, Zhang65 provided the delta-
method large sample normal statistic of

pˆ E − pˆ C + δ ( pC )
ZS =
pE (1 − pE )/nE + pC (1 − pC )(1 − δ ′( pC ))2/nC

© 2012 by Taylor and Francis Group, LLC

Inference on Proportions 305

where δ′ is the first derivate of δ. The values for pE and pC may correspond
to the sample proportions or to the MLEs of the proportions under the null
hypothesis. The hypotheses in Expression 11.30 can also be tested on the
basis of the posterior probability that pE – pC > –δ(pC). Alternatively, the spe-
cific form for δ(pC) may dictate an appropriate method of analysis.
Various Variable Margins. For the risk difference and the relative risk, the
corresponding variable margin is a linear function of pC. Phillips66 pro-
posed the use of a linear function in pC, δ(pC) = a + bpC. A motivation was
to “fit” a line to the random margin provided in the U.S. Food and Drug
Administration (FDA) guidelines for anti-infective products67 by having that
margin based on pC. The value for b < 0 in that fit is indicative of a margin
that increases as pC decreases. Thus, when the success rate (and possibly the
effect) of the active control appears smaller, the acceptable amount of inferi-
ority becomes larger. This seems counterintuitive. It would make more sense
to either maintain the same or smaller amount of the effect of the control
therapy as the perceived effect becomes smaller. When b > 0, the variable
margin of δ(pC) = a + bpC appropriately decreases as pC decreases. For a > 0
and 0 < b < 1, pC – (a + bpC) < 0 whenever pC < a/(1 – b). Thus, such a variable
margin should be avoided whenever pC may be less than a/(1 – b).
Röhmel68,69 proposed various functions for δ(•). In one study, Röhmel69 pro-
posed to use δ ( pC ) = 0.223 pC (1 − pC ) and δ ( pC ) = 0.333 pC (1 − pC ) for the
purpose of stabilizing the desired power and providing a variable margin
fairly consistent with those provided in the FDA guidelines for anti-infective
products.67 The power should be fairly stable among possibilities for pC that
are not close to 0 or 1. Röhmel appears to have been recommending such a
variable margin for real situations—for example, antibiotics or anti-infective
products—when the anticipated success rate is greater than 50%. For 0.50 <
pC < 1, the margin of δ ( pC ) = c pC (1 − pC ) increases as pC decreases. When it
is anticipated that 0.50 < pC < 1, such a function for the margin is probably
no longer appropriate for a non-inferiority registration trial. For 0 < pC < 0.50,
the margin of δ ( pC ) = c pC (1 − pC ) decreases as pC decreases. This is more
appropriate. However, pC − c pC (1 − pC ) < 0 when pC is close to zero. Thus,
such a variable margin should be avoided when pC may be close to zero.
By stabilizing the power for a given sample size, such a variable margin
may be appropriate for a randomized phase 2 study to assist in making a
go/no go decision to phase 3. For a one-sided significance level of α/2 and
power of 1 – β at pE – pC = p, where δ ( pC ) = c pC (1 − pC ) with c > 0, a crude
sample-size calculation of the number of patients per arm is given by 2(zβ +
zα/2)2/c2 (derived from Equation 11.7 with k = 1, δ = c pC (1 − pC ) , Δa = 0, and
p1′ = p1 = p2 = p2′ ).

© 2012 by Taylor and Francis Group, LLC

306 Design and Analysis of Non-Inferiority Trials

In another study, Röhmel68 proposed possibly defining δ (•), so that Φ–1(pC) –

Φ (pC – δ(pC)) = d, where d is a prespecified constant and Φ is the standard
–1

normal distribution function. This leads to δ(pC) = pC – Φ(Φ–1(pC) – d). Use

of this function of δ(•) corresponds to testing the hypotheses Ho: Φ–1(pE) –
Φ–1(pC) ≤ –d versus Ha: Φ–1(pE) – Φ–1(pC) > –d. Zhang65 noted that Röhmel’s
proposal could be extended to any distribution function F. For F(x) = (1 +
e –x)–1, the hypotheses reduce to the standard non-inferiority hypotheses of an
odds ratio given in Expression 11.23.
Some authors, including Tsou et al.,70 have discussed the use of a mixed
margin. When the control success rate lies in particular range (e.g., above 0.2),
a fixed margin on the difference is used. When the control success rate lies
outside this range, a fixed margin on the relative risk of success (e.g., control
success rates below 0.2) is used. Tsou et al. proposed a synthesis method
for evaluating such cases. A similar approach can also be considered when
using a prespecified variable margin.
A Comparison of Methods. Kim and Xue60 compared the type I error rate
and power of various methods based on a piece-wise linear model for the
variable margin for a non-inferiority evaluation of the Safety of Estrogens in
Systemic Lupus Erythematosus National Assessment (SELENA) study. The
desired one-sided type I error rate is 5%. For various selected placebo rates,
eight experts were independently queried on the appropriate non-inferiority
margin. After averaging, discussion, and consensus, the variable margin
was defined by line segments connecting the choices for the non-inferiority
margin for those selected control rates.
The methods compared by Kim and Xue60 include an “observed event rate”
approach, two Bayesian approaches, and an exact likelihood ratio test. The
observed event rate approach concludes non-inferiority (involving negative/
undesirable outcomes) when the upper bound of a one-sided 95% confidence
interval for pE – pC is less than δ ( pˆ C ). The confidence interval is based on the
Farrington and Manning approach, where the standard error is based on the
null restricted MLEs of pE and pC.
The first of two Bayesian approaches determines the posterior probability
that pE and pC lie in the alternative hypothesis based on independent uni-
form prior distributions. When the posterior probability is greater than 0.95,
non-inferiority is concluded.
The second Bayesian approach uses a uniform prior distribution for pE and
an informative prior distribution for pC. The informative prior distribution
for pC is determined after the non-inferiority study is completed and equals
the posterior distribution that arises from a uniform prior distribution and
observing a number of successes in 100 Bernoulli trials that equals to 100×
the observed control rate from the non-inferiority study. This case appears to
represent having prior information about the control rate.
Control rates between 5% and 40% were used to evaluate the type I error
rate and power of these procedures. For the type I error evaluations, pE =
pC – δ(pC); for the power evaluations, pE = pC. For each combination of (pE,pC)

© 2012 by Taylor and Francis Group, LLC

Inference on Proportions 307

and sample size per arm of 150 and 300 subjects, 1000 trials were simulated
and the proportion in which non-inferiority was demonstrated with each
approach was determined.
The second Bayesian approach (using a retrospective prior for the control
rate) and the exact likelihood ratio test maintained a type I error near 0.05 in
all cases. The Bayesian approach using independent uniform prior distribu-
tions slightly inflates the type I error rate in all cases. The observed event rate
procedure had an inflated type I error rate that was as high as 0.10 when the
control success rate was 15%. The inflation at all time points is mostly due to
the statistic ignoring the variability in δ ( pˆ C ). The particularly high inflated
type I error rate when the control rate was 15% appears to be due to the value
of δ(0.15) being smaller than what would be consistent with values of δ(pC)
for pC near 0.15. The value of δ (0.15) is 0.09, whereas from our interpolations
δ(pC) ≈ 0.04 + 0.4pC for pC near 0.15, which would lead to δ(0.15) ≈ 0.10. Thus,
although (pE,pC) = (0.24, 0.15) is on the boundary of the null hypothesis, the
general behavior of δ(pC) is such that (pE,pC) = (0.25, 0.15) is “expected” to be
on the boundary of the null hypothesis and (pE,pC) = (0.24, 0.15) is expected
to be just in the alternative hypothesis of non-inferiority. Hence, the added
inflation of the type I error rate when the control rate is 15%.
The second Bayesian approach tended to have the greatest power followed
by the observed event rate approach, and then followed by the first Bayesian
approach.
A type I error rate evaluation of using the testing procedure described in
the 1992 FDA Guidance67 is given in Example 11.14.

Example 11.14

Much of the work on variable margins for proportions has been motivated by
experience in testing the efficacy of anti-infective products. The observed margin
is given by

 0.10, if pˆ > 0.90

 max

δˆ =  0.15, if 0.80 < pˆ max ≤ 0.90

 0.20, if pˆ ≤ 0.80
 max

where pˆ max = max{ pˆE , pˆ C } . Non-inferiority would be concluded if the lower limit
of the 95% two-sided confidence interval for the experimental versus control
difference in the cure rates is greater than the negative of the observed margin.
Formally, this test procedure does not perfectly correspond to a test of two spe-
cific statistical hypotheses. Statistical hypotheses can be specified for a test that is
approximately equal to this non-inferiority test. The alternative hypothesis would
be defined as the union of the sets {(pE,pC):pC < 0.8,pE > pC – 0.2}, {(pE,pC):0.8 ≤
pC < 0.9,pE > pC –0.15}, and {(pE,pC):pC ≥ 0.9,pE > pC – 0.1}. The null hypothesis is

© 2012 by Taylor and Francis Group, LLC

308 Design and Analysis of Non-Inferiority Trials

the complement. The variable margin is given by δ(pC) = 0.2, if pC < 0.8; δ(pC) =
0.15, if 0.8 ≤ pC < 0.9; and δ(pC) = 0.1, if pC ≥ 0.9.
For three possibilities in the null hypothesis and sample sizes of 150 and 300
per arm, the simulated probabilities of rejecting the null hypothesis (the type I
error rate) for testing these hypotheses by using the testing procedure described in
the 1992 FDA Guidance are provided in Table 11.16. The two-sided 95% Wald’s
confidence interval for the difference in proportions was used. Two possibilities
are located at or near where the variable margin changes. One million simulations
were used in each case. For the case where (pE,pC) = (0.65, 0.8) and the sample
size is 150 per arm, only 124 simulations (about 1 in every 8000 simulations) had
δ ( pˆ C ) ≠ δˆ = δ (max{ pÊ , pˆ C }). In all of these 124 simulations, non-inferiority was con-
cluded using the smaller margin of δˆ = δ (max{ pÊ , pˆ C }). In all other studied cases
for (pE,pC) and the sample size, observing δ ( pˆ C ) ≠ δˆ = δ (max{ pÊ , pˆ C }) was more rare
and never influenced the conclusion on non-inferiority. Unless the sample size
is small, non-inferiority will be demonstrated with respect to δˆ = δ (max{ pÊ , pˆ C })
whenever pÊ > pˆ C . Thus, for all practical purposes, the observed margin could
have been regarded as δ ( pˆ C ). When (pE,pC) = (0.65, 0.8), the type I error rate is
greatly inflated and tends to increase toward some value slightly larger than 0.5 as
the common sample size increases without bound. When (pE,pC) = (0.599, 0.799),
the type I error rate is slightly deflated and tends to increase toward 0.025 as the
common sample size increases without bound. When (pE,pC) = (0.70, 0.85), a
value on the boundary of the null hypothesis not very near a change point in the
variable margin, the type I error rate is inflated and tends to decrease toward 0.025
as the common sample size increases without bound.

11.8 Matched-Pair Designs

Matched-pair designs are frequently used in assessing the performance
of diagnostic or laboratory tests. It is also often used in clinical trials to
increase the efficiency of treatment comparison. For example, subjects are
asked to take both an experimental and a standard treatment in a crossover
clinical trial to reduce the variability of treatment comparison. In many
laboratory studies of new assays or diagnostic tests, a sample is split into

TABLE 11.16
Type I Error Rates Consistent with Old FDA Guidelines
on Anti-Infective Products
Sample Size per Arm

(pE,pC) 150 300 ∞

(0.599, 0.799) 0.0234 0.0235 0.025
(0.65, 0.8) 0.1385 0.2201 ≈0.5007
(0.70, 0.85) 0.0438 0.0293 0.025

Inference on Proportions 309

two with one tested by a new assay (or diagnostic test) and the other by the
standard method. When the outcome measure is dichotomous, risk differ-
ence and risk ratios are often used to compare treatments. In this section
we describe the statistical methods for evaluating non-inferiority of the
difference and the rate ratio of two proportions in a matched-pair design.
Methods appropriate for large samples and small to moderate samples will
be discussed.
Consider the matched-pair design in which two treatments (e.g., experi-
mental and control) are performed on the same n subjects. A “response” to a
treatment will be denoted by a “1,” whereas a “2” will denote “no response”
to the treatment. For any subject the possible outcomes are denoted by (1, 1),
(1, 2), (2, 1), and (2, 2), where the first (second) entry is the outcome to the
experiment (control) treatment. Let q11, q12, q21, and q22 be the corresponding
probabilities of the pairs and let a, b, c, and d (a + b + c + d = n) be the observed
numbers for the pairs (1, 1), (1, 2), (2, 1), and (2, 2), respectively. The observed
vector (a, b, c, d) is assumed to come from the usual multinomial distribution
model:

n!
P(( a, b , c , d) n,(q11 , q12 , q11 , q12 )) = a
q11 b
q12 c
q21 d
q22 .
a!b! c ! d!

Then pE = q11 + q12 and pC = q11 + q21 are the probability of a response to the
experimental and control treatments, respectively. For a classical hypothesis
test of no difference between the new and standard treatments,

Ho: pE = pC (i.e., q12 = q21) vs. Ha: pE ≠ pC (i.e., q12 ≠ q21),

the well-known McNemar71 test is

b−c
ZM = . (11.31)
b+c

The square of ZM is a χ2 statistic with 1 degree of freedom. The McNemar sta-

tistic depends only on the off-diagonal cells (discordant pairs) for the infer-
ence about whether the two treatments are different.

11.8.1 Difference in Two Correlated Proportions

The difference of two correlated proportions is frequently used in compar-
ing two treatments or diagnostic procedures in a matched-pair design. For a
non-inferiority test for some δ > 0, the hypotheses are

Ho : pE – pC ≤ –δ vs. Ha : pE – pC > –δ. (11.32)

310 Design and Analysis of Non-Inferiority Trials

Rejection of the null hypothesis will lead to a conclusion of non-inferiority.

The definition of a positive response will depend on the set up of the study.
For example, the positive response may refer to the ability of an assay to
detect an antigen of interest in assay development or it may represent a suc-
cessful outcome following a treatment in a crossover study.
For the non-inferiority hypotheses in Expression 11.32, Tango72 proposed
a score-type test

b − c + nδ
ZD = (11.33)
{n ( 2qˆ )}
1/2
21 − δ (δ + 1)

where the estimator q̂21 is the constrained MLE of q21 under pE – pC = –δ.
Specifically, Tango72 showed that

(B2 − 4 AC) − B
qˆ 21 = (11.34)
2A

where

A = 2n, B = –b – c – (2n – b + c)δ, C = cδ(δ + 1).

The null hypothesis will be rejected at the one-sided α/2 level if ZD > zα/2,
where zα/2 is the upper α-percentile of the standard normal distribution.
Tango provided special cases for this test. From Equation 11.34, when
δ = 0, corresponding to the test of no difference between the test and control,
q̂21 = (b + c)/2n. Thus, the test statistic ZD in Equation 11.33 simplifies to the
McNemar test statistic in Equation 11.31. When the off-diagonal cells are all
zero (b = c = 0), q̂21 = δ and the test statistics reduce to

1/2
 nδ 
ZD =  .
 1 − δ 

In other words, if there are no discordant pairs observed from the study,
the non-inferiority hypothesis will be rejected at the one-sided α level if the
sample size (n) is large enough:

1−δ
n> zα2/2 .
δ

On the basis of the score test statistic ZD in Equation 11.33, a 100(1– α)% con-
fidence interval for the difference in proportions Δ = pE – pC = q12 – q21 can
be constructed by solving for δ in the equations: ZD = ±zα/2. Example 11.15

Inference on Proportions 311

illustrates the use of non-inferiority testing based on the difference in pro-

portions in a matched-pair design.

Example 11.15

Miyanaga73 conducted a crossover clinical trial of 44 patients comparing a chemi-

cal disinfection system with a thermal disinfection system. The outcome of the
trial is summarized in Table 11.17. It was concluded that the two methods are
equivalent based on a one-sided p-value of 0.5 from Fisher’s exact test. However,
as previously mentioned in this text, failure to demonstrate a difference does not
mean there is no difference.
The data were reanalyzed by Tango72 using the non-inferiority test frame-
work with δ = 0.1. From Equation 11.33, ZD = 1.709 with a one-sided p-value
of .044, based on a standard normal distribution. The lower limit of the cor-
responding 90% confidence interval is −0.096. As a result, it was concluded
that the chemical disinfection system is noninferior to the thermal disinfection
system at the one-sided 0.05 level, but not at the one-sided 0.025 level. We
note that as all observations but one were the same, ZD may not have a normal
distribution.

It has been shown (see Tango72) that the score statistic ZD performs better
in small samples than in two other asymptotic tests proposed by Lu and
Bean74 and Morikawa and Yanagawa.75 In particular, the type I error rate of
the ZD test statistic is much closer to the nominal level than the other two
statistics, while maintaining similar power. For a matched-pair design with
small sample sizes, an exact test of non-inferiority proposed by Hsueh, Liu,
and Chen76 can be used to guarantee control of the type I error rates. Sample-
size and power calculation methods have been developed in Lu and Bean74
and Nam.77 Nam77 showed that a method based on the score-type statistic
performed better than the method of Lu and Bean.74

11.8.2 Ratio of Two Correlated Proportions

When an outcome measure is dichotomous, the relative risk is also often
used to compare treatments. A few publications have focused on the relative

TABLE 11.17
Outcomes of Disinfection Systems for Soft Contact Lenses
Thermal Disinfection
Chemical Disinfection Effective Ineffective Total
Effective 43 0 43
Ineffective 1 0 1
Total 44 0 44

312 Design and Analysis of Non-Inferiority Trials

risk measure in matched-pair designs. To show that the experimental treat-

ment is noninferior to the standard treatment based on the relative risk mea-
sure, we formulate the hypothesis as follows

Ho: θ =pE /pC ≤ θo vs. Ha: θ = pE/pC > θo (11.35)

where 0 < θo < 1 is a prespecified acceptable threshold for the ratio of the
two proportions. Rejection of the null hypothesis will lead to a conclusion of
non-inferiority in the sense that the experimental treatment will have simi-
lar positive response rate compared with the standard treatment based on
the marginal response rates. Extending the work of Tango72, a score test was
derived by Tang, Tang, and Chan78 for testing the non-inferiority hypothesis
in Equation 11.35

a + b − (a + c)θ o
ZR = , (11.36)
n{(1 + θ o )qˆ 21 + (a + b + c)(θ o − 1)/n}

where q̂21 is the constrained MLE of q21 under the null hypothesis, given by
the larger root of the following quadratic equation

f(x) = n(1 + θo)x2 + [(a + c)θ o2 – (a + b + 2c)]x + c(1 – θo)(a + b + c)/n = 0.

That is,

qˆ 21 =  B2 − 4 AC − B  /(2 A) (11.37)

where

A = n(1 + θo), B = [(a + c)θ o2 – (a + b + 2c)], and C = c(1 – θo)(a + b + c)/n.

The ZR statistic has an asymptotic standard normal distribution under the

null hypothesis, and the test rejects the null hypothesis at the one-sided α/2
level if ZR ≥ zα/2.
From Equation 11.37, q̂21 = (b + c)/(2n) for the special case of θo= 1, and thus
ZR in Equation 11.36 reduces to the McNemar’s test statistic in Equation 11.31.
When a > 0 and b = c = 0 (complete concordance), ZR = {a(1 – θo)/θo}1/2; thus,
ZR increases with the concordant positive response (a). In the rare case where
both new and standard treatments have no observed responses (a = b = c =0),
ZR is undefined.
Non-inferiority requires that the lower limit of the confidence interval for
the risk ratio, θ = pE /pC, is greater than θo. A 100(1 – α)% confidence interval
for the risk ratio θ = pE /pC can be obtained on the basis of the ZR statistic

Inference on Proportions 313

using iterative procedures. Since ZR has an asymptotic standard normal dis-

tribution, the lower and upper limits of the 100(1 – α)% confidence interval
2 2 2
can be obtained as the two roots to the equation of ZR = χ 1,α , where χ 1,α
is the upper αth percentile of the central χ distribution with 1 degree of
2

freedom.
Tang, Tang, and Chan78 compared the performance of ZR in Equation
11.36 with several other potential test statistics, including one proposed by
Lachenbruch and Lynch.79 The empirical comparison showed that ZR was
the only test statistic that behaved satisfactorily in the sense that its empiri-
cal type I error rate was much closer to the desired, nominal level than those
for the other tests. The ZR statistic tends to be slightly conservative in cases
where pE and pC are large and the probability of discordance (q21) is low. In
addition, the empirical coverage probabilities of the confidence intervals
based on ZR were close to the nominal level, and the error rates of both tails
were generally similar. Sample sizes and power calculation formulas based
on the ZR statistic for both hypothesis testing and confidence interval estima-
tion were given by Tang et al.80
Example 11.16 illustrates the use of non-inferiority testing based on a rela-
tive risk in a matched-pair design.

Example 11.16

Tang, Tang, and Chan78 revisited the crossover clinical trial described in Example
11.15, where a chemical disinfection system was compared with a thermal dis-
infection system for soft contact lenses. Here suppose the interest is to assess
non-inferiority using the relative risk with a margin of 0.9 (requiring the response
of chemical method be at least 90% of the thermal method). The observed risk
ratio (chemical/thermal) is 0.977 with the 90% confidence interval based on ZR
in Equation 11.36 of (0.904–1.038). The p-value for testing the null hypothesis in
Equation 11.35 is .044. The results indicate that the chemical method is noninfe-
rior to the thermal method at a one-sided 0.05 level, but not at the one-sided level
of 0.025.

For small studies, the score test sometimes may be anticonservative—that

is, its type I error rates may be inflated. For this reason, Chan et al.81 studied
the performance of an exact unconditional test and an “approximate” exact
test for testing non-inferiority hypothesis based on risk ratios in matched-
pair designs. Their research showed that the exact unconditional test guar-
antees the type I error rate and should be recommended if strict control of
type I error (protection against any inflated risk of accepting inferior treat-
ments) is required. However, the exact method tends to be overly conser-
vative and computationally demanding. Through empirical studies, it was
demonstrated that an “approximate” exact test, which is computationally
simple to implement, controls the type I error rate reasonably well while
maintaining high power for the hypothesis testing.

314 Design and Analysis of Non-Inferiority Trials

References
1. Dann, R.S. and Koch, G.G., Review and evaluation of methods for computing
confidence intervals for the ratio of two proportions and considerations for non-
inferiority clinical trials, J. Biopharm. Stat., 15, 85–107, 2005.
2. Barnard, G.A., Significance tests for 2 × 2 tables, Biometrika, 34, 123–138,
1947.
3. Basu, D., On the elimination of nuisance parameters, J. Am. Stat. Assoc., 72, 355,
1977.
4. Chan, I.S.F., Exact tests of equivalence and efficacy with a non-zero lower bound
for comparative studies, Stat. Med., 17, 1403–1413, 1998.
5. Farrington, C.P. and Manning, G., Test statistics and sample size formulae for
comparative binomial trials with null hypothesis of non-zero risk difference or
non-unity relative risk, Stat. Med., 9, 1447–1454, 1990.
6. Suissa, S. and Shuster, J.J., Exact unconditional sample sizes for the 2 × 2 bino-
mial trial, J. R. Stat. Soc. A, 148, 317–327, 1985.
7. Haber, M., An exact unconditional test for the 2 × 2 comparative trials, Psychol.
Bull., 99, 129–132, 1986.
8. Dunnet, C.W. and Gent, M., Significance testing to establish equivalence between
treatments with special reference to data in the form of 2 × 2 tables, Biometrics,
33, 593–602, 1977.
9. Chan, I.S.F. and Zhang, Z., Test-based exact confidence intervals for the differ-
ence of two binomial proportions, Biometrics, 55, 1201–1209, 1999.
10. Röhmel, J. and Mansmann, U., Unconditional non-asymptotic one-sided tests
for independent binomial proportions when the interest lies in showing non-
inferiority and/or superiority, Biom. J., 41, 149–170, 1999.
11. Andres, A.M. and Mato, A.S., Choosing the optimal unconditional test for com-
paring two independent proportions, Comput. Stat. Data Anal., 17, 555–574,
1994.
12. Chan, I.S.F., Providing non-inferiority or equivalence of two treatments with
dichotomous endpoints using exact methods, Stat. Method. Med. Res., 12, 37–58,
2003.
13. Clopper, C.J. and Pearson, E.S., The use of confidence or fiducial limits illus-
trated in the case of the binomial, Biometrika, 26, 404–413, 1934.
14. Santner, T.J. and Snell, M.K., Small-sample confidence intervals for p1 – p2 and
p1/p2 in 2 × 2 contingency tables, J. Am. Stat. Assoc., 75, 386–394, 1980.
15. Agresti, A. and Min, Y., On small-sample confidence intervals for parameters in
discrete distributions, Biometrics, 57, 963–971, 2001.
16. Chen, X., A quasi-exact method for the confidence intervals of the difference
of two independent binomial proportions in small sample cases, Stat. Med., 21,
943–956, 2002.
17. Coe, P.R. and Tamhane, A.C., Small sample confidence intervals for the differ-
ence, ratio, and odds ratio of two success probabilities, Commun. Stat. B Simul.,
22, 925–938, 1993.
18. Santner, T.J. and Yamagami, S., Invariant small sample confidence intervals for
the difference of two success probabilities, Commun. Stat. B Simul., 22, 33–59,
1993.

Inference on Proportions 315

19. Fries, L.F. et al., Safety and immunogenicity of a recombinant protein influenza
A vaccine in adult human volunteers and protective efficacy against wild-type
H1N1 virus challenge, J. Infect. Dis., 167, 593–601, 1993.
20. Boschloo, R.D., Raised conditional level of significance for the 2×2-table when
testing the equality of two probabilities, Stat. Neerl., 24, 1–35, 1970.
21. Rodary, C., Com-Nougue, C., and Tournade, M.F., How to establish equivalence
between treatments: A one-sided clinical trial in paediatric oncology, Stat. Med.,
8, 593–598, 1989.
22. Hauck, W.W. and Anderson, S., A comparison of large sample confidence inter-
val methods for the differences of two binomial probabilities, Am. Stat., 40, 318–
322, 1986.
23. Ghosh, B.K., A comparison of some approximate confidence intervals for the
binomial parameter, J. Am. Stat. Assoc., 74, 894–900, 1979.
24. Vollset, S.E., Confidence intervals for a binomial proportion, Stat. Med., 12, 809–
824, 1993.
25. Agresti, A. and Coull, B.A., Approximate is better than ‘exact’ for interval esti-
mation of binomial proportions, Am. Stat., 52, 119–126, 1998.
26. Agresti, A. and Caffo, B., Simple and effective confidence intervals for propor-
tions and differences of proportions result from adding two successes and two
failures, Am. Stat., 54, 280–288, 2000.
27. Newcombe, R.G., Two-sided confidence intervals for the single proportion:
comparison of seven methods, Stat. Med., 17, 857–872, 1998.
28. Newcombe, R.G., Interval estimation for the difference between independent
proportions: Comparison of seven methods, Stat. Med., 17, 873–890, 1998.
29. Brown, L.D., Cai, T., and Dasgupta, A., Interval estimation for a binomial pro-
portion (with discussion), Stat. Sci., 16, 101–133, 2001.
30. Wilson, E.B., Probable inference, the law of succession, and statistical inference.
J. Am. Stat. Assoc., 22, 209–212, 1927.
31. Schouten, H.J.A. et al., Comparing two independent binomial proportions by a
modified chi-square test, Biom. J., 22, 241–248, 1980.
32. Tu, D., A comparative study of some statistical procedures in establishing thera-
peutic equivalence of nonsystemic drugs with binary endpoints, Drug Inf. J., 31,
1291–1300, 1997.
33. Li, Z. and Chuang-Stein, C., A note on comparing two binomial proportions in
confirmatory non-inferiority trials, Drug Inf. J., 40, 203–208, 2006.
34. Mee, R.W., Confidence bounds for the difference between two probabilities,
Biometrics, 40, 1175–1176, 1984.
35. Miettinen, O.S. and Nurminen, M., Comparative analysis of two rates, Stat.
Med., 4, 213–226, 1985.
36. Santner, T.J. et al., Small-sample comparisons of confidence intervals for the dif-
ference of two independent binomial proportions, Comput. Stat. Data Anal., 51,
5791–5799, 2007.
37. Dann, R.S. and Koch, G.G., Methods for one-sided testing of the difference
between proportions and sample size considerations related to non-inferiority
clinical trials, Pharm. Stat., 7, 130–141, 2008.
38. Hilton, J.F., Designs of superiority and non-inferiority trials for binary responses
are noninterchangeable, Biom. J., 48, 934–947, 2006.
39. Chan, I.S.F. and Bohidar, N.R., Exact power and sample size for vaccine efficacy
studies, Commun. Stat. Theory, 27, 1305–1322, 1998.

316 Design and Analysis of Non-Inferiority Trials

40. Johnson, N.L., Kotz, S., and Kemp, A.W., Univariate Discrete Distributions, Wiley,
New York, NY, 1992.
41. Werzberger, A. et al., A controlled trial of a formalin-inactivated hepatitis A vac-
cine in healthy children, New Engl. J. Med., 327, 453–457, 1992.
42. Wiens, B.L. et al., Duration of protection from clinical hepatitis A disease after
vaccination with VAQTA®, J. Med. Virol., 49, 235–241, 1996.
43. Temple, R., Problems in interpreting active control equivalence trials, Acct. Res.,
4, 267–275, 1996.
44. Jones, B. et al., Trials to assess equivalence: The importance of rigorous meth-
ods, Br. Med. J., 313: 36–39, 1996.
45. Ebbutt, A.F. and Frith, L., Practical issues in equivalence trials, Stat. Med., 17,
1691–1701, 1998.
46. Katz, D. et al., Obtaining confidence intervals for the risk ratio in cohort studies,
Biometrics, 34, 469–474, 1978.
47. Thomas, D.G. and Gart, J.J., A table of exact confidence limits for differences
and ratios of two proportions and their odd ratios, J. Am. Stat. Assoc., 72, 73–76,
1977.
48. Koopman, P.A.R., Confidence intervals for the ratio of two binomial propor-
tions, Biometrics, 40, 513–517, 1984.
49. Bailey, B.J.R., Confidence limits to the risk ratio, Biometrics, 43, 201–205, 1987.
50. Gart, J.J. and Nam, J., Approximate interval estimation of the ratio of binomial
parameters: A review and corrections for skewness, Biometrics, 44, 323–338,
1988.
51. Zelterman, D., Models for Discrete Data, Oxford University Press, Oxford, 1999.
52. Cornfield, J., A statistical problem arising from retrospective studies. Proceedings
of the Third Berkeley Symposium on Mathematical Statistics and Probability IV, J.
Neyman (ed.). 135–148, California Press, Berkeley, CA, 1956.
53. Gart, J.J., The comparison of proportions: A review of significance tests, confi-
dence intervals and adjustments for stratification, Rev. Inst. Int. Stat., 39, 148–
169, 1971.
54. Mehta, C.R., Patel, N.R., and Gray, R., Computing an exact confidence interval
for the common odds ratio in several 2 by 2 contingency tables, J. Am. Stat.
Assoc., 80, 969–973, 1985.
55. Vollset, S.E., Hirji, K.F., and Elashoff, R.M., Fast computation of exact confidence
limits for the common odds ratio in a series of 2×2 tables, J. Am. Stat. Assoc., 86,
404–409, 1991.
56. Mehta, C.R. and Walsh, S.J., Comparison of exact, mid-p, and Mantel–Haenszel
confidence intervals for the common odds ratio across several 2×2 contingency
tables, Am. Stat., 46, 146–150, 1992.
57. Emerson, J.D., Combining estimates of the odds ratio: The state of the art, Stat.
Methods Med. Res., 3, 157–178, 1994.
58. Gart, J.J. and Zweiful, J.R., On the bias of various estimators of the logit and its
variance with application to quantal bioassay, Biometrika, 54, 181–187, 1967.
59. Cytovene product labeling available at www.fda.gov/cder/foi/label/2000/
20460s10lbl.pdf.
60. Kim, M.Y. and Xue, X., Likelihood ratio and a Bayesian approach were supe-
rior to standard non-inferiority analysis when the non-inferiority margin varied
with the control event rate, J. Clin. Epidemiol., 57, 1253–1261, 2004.

Inference on Proportions 317

61. Grizzle, J.E., Starmer, C.F., and Koch, G.G., Analysis of categorical data by linear
models, Biometrics, 25, 489–504, 1969.
62. Gart, J.J., On the combination of relative risks, Biometrics, 18, 601–610, 1962.
63. Robins, J., Breslow, N., and Greenland, S., Estimators of the Mantel–Haenszel
variance consistent in both sparse data and large-strata limiting models, Bio
metrics, 42, 311–323, 1986.
64. Mantel, N. and Haenszel, W., Statistical aspects of the analysis of data from
retrospective studies of disease, J. Natl. Cancer I., 22, 71 9–748, 1959.
65. Zhang, Z., Non-inferiority testing with a variable margin, Biom. J., 48, 948–965,
2006.
66. Phillips, K.F., A new test of non-inferiority for anti-infective trials, Stat. Med., 22,
201–212, 2003.
67. U.S. Food and Drug Administration, Division of Anti-infective Drug Products,
Clinical Development and Labeling of Anti-Infective Drug Products. Points-to-
consider. U.S. Food and Drug Administration, Washington, DC, 1992.
68. Röhmel, J., Therapeutic equivalence investigations: Statistical considerations,
Stat. Med., 17, 1703–1714, 1998.
69. Röhmel, J., Statistical considerations of FDA and CPMP rules for the investiga-
tion of new antibacterial products, Stat. Med., 20, 2561–2571, 2001.
70. Tsou, H.H. et al., Mixed non-inferiority margin and statistical tests in active con-
trolled trials, J. Biopharm. Stat., 17, 339–357, 2007.
71. McNemar, Q., Note on the sampling error of the difference between correlated
proportions or percentages, Psychometrika 12, 153–157, 1947.
72. Tango, T., Equivalence test and confidence interval for the difference in propor-
tions for the paired-sample design, Stat. Med., 17, 891–908, 1998.
73. Miyanaga, Y., Clinical evaluation of the hydrogen peroxide SCL disinfection
system (SCL-D), Jpn. J. Soft Contact Lenses, 36, 163–173, 1994.
74. Lu, Y. and Bean, J.A., On the sample size for one-sided equivalence of sensitivi-
ties based upon McNemar’s test, Stat. Med., 14, 1831–1839, 1995.
75. Morikawa, T. and Yanagawa, T., Taiounoaru 2chi data ni taisuru doutousei
kentei (Equivalence testing for paired dichotomous data), P. Ann. Conf. Biometric
Soc. Jpn., 123–126, 1995.
76. Hsueh, H.M., Liu, J.P., and Chen, J.J., Unconditional exact tests for equiva-
lence or non-inferiority for paired binary endpoints, Biometrics, 57, 478–483,
2001.
77. Nam, J., Establishing equivalence of two treatments and sample size require-
ments in matched-pairs design, Biometrics, 53, 1422–1430, 1997.
78. Tang, N.S., Tang, M.L., and Chan, I.S.F., On tests of equivalence via non-unity
relative risk for matched-pairs design, Stat. Med., 22, 1217–1233, 2003.
79. Lachenbruch, P.A. and Lynch, C.J., Assessing screening tests: Extensions of
McNemar’s Test, Stat. Med., 17, 2207–2217, 1998.
80. Tang, N.S. et al., Sample size determination for establishing equivalence/non-
inferiority via ratio of two proportions in matched-pair design, Biometrics, 58,
957–963, 2002.
81. Chan, I.S.F. et al., Statistical analysis of non-inferiority trials with a rate ratio in
small-sample matched-pair designs, Biometrics, 59, 1170–1177, 2003.

12
Inferences on Means and Medians

12.1 Introduction
This chapter discusses non-inferiority based on the underlying means or
medians when there are no missing data or censored observations. Means
and medians are often used to describe the typical value or the central loca-
tion of a distribution. Medians are preferred when the data are skewed. The
outcomes may be continuous or discrete. For continuous outcomes where
larger outcomes are more desirable and differences between outcomes have
meaning (i.e., the data have an interval or ratio scale), the difference in the
means of the experimental and control arms in a randomized trial repre-
sents the average benefit across trial subjects from being randomized to the
experimental arm instead of the control arm. A difference in the medians
does not have any analogous interpretation unless additional assumptions
are made on the underlying distributions (e.g., that the shapes of the under-
lying distributions are equal).
For discrete outcomes, such as scores, the value for the mean will prob-
ably not be a possible value and may not be interpretable. In such a case,
inferences based on means may be difficult to interpret without additional
assumptions (e.g., the distributions, when different, are ordered). In these
situations, testing should not be based on the mean. For binary data, the
mean is the proportion of 1s or successes, which is interpretable.
When the difference in means (medians) defines the benefit or loss of bene-
fit, non-inferiority testing should be based on the difference in means (medi-
ans). When the data are positive and relative changes are most important, it
may be more appropriate to base non-inferiority testing on the ratio of the
means (medians). The mean for the control group may be needed to under-
stand and interpret a ratio of means.
A normal model is frequently used for inferences on the mean when the
sample size is large. The sample mean is assumed to be a random value from
an approximate normal distribution, with mean equal to the true mean or
population mean and variance equal to σ 2/n, where σ 2 is the population vari-
ance and n is the sample size. Inferences on a median are often based on the
behavior of the order statistics (see Section 12.4).

319

320 Design and Analysis of Non-Inferiority Trials

The experimental therapy is noninferior to the control therapy if the mean

(median) for the experimental arm is better than or not too much worse
than that of the control arm. When larger outcomes are more desirable,
non-inferiority corresponds to the mean (median) for the experimental arm
either being greater than or not too much less than that for the control arm.
When smaller outcomes are more desirable, non-inferiority corresponds to
the mean (median) for the experimental arm either being less than or not too
much greater than that for the control arm.
In this chapter we will discuss non-inferiority analyses based on the dif-
ferences in means (Section 12.2), the ratio of means (Section 12.3), and the dif-
ferences in medians (Section 12.4). The chapter ends with a brief discussion
of non-inferiority testing involving ordinal data (Section 12.5). The method
discussed in Section 12.5 can also be adapted for use in continuous data.

12.2 Fixed Thresholds on Differences of Means

Many non-inferiority trials use a continuous measure for the primary outcome.
The mean is a natural summary for the typical or central value. In this section
we will discuss non-inferiority analysis methods for the difference in means.
The distribution for continuous data generally cannot be solely described by
the mean. Another parameter (or more than one) may also need to be assessed.
For normally distributed data, the standard deviation (or alternatively the
variance) is used in conjunction with the mean to completely summarize the
distribution. The need for a second parameter to define the distribution adds
complexity. This second parameter is often treated as a nuisance parameter.
For normal data, the standard deviations of responses from the two treatment
groups are often assumed to be identical, negating the need to base inference on
differences in distribution other than the difference in means.

12.2.1 Hypotheses and Issues

For this discussion, consider the situation where larger outcomes are more
desired than smaller outcomes and the margin for the difference in means, δ >
0, is prespecified. The null and alternative hypotheses to be considered are

Ho: μ C – μE ≥ δ and Ha: μ C – μE < δ (12.1)

That is, the null hypothesis is that the mean in the active control group is
superior to the mean in the experimental treatment group by at least a quan-
tity of δ, whereas the alternative is that the active control is superior by a
smaller amount, or the two treatments are identical, or the experimental
treatment is superior.

Inferences on Means and Medians 321

In the simplest application with no covariates, a confidence interval can be

determined for μ C – μE, and non-inferiority is concluded (the null hypothesis
rejected) if the upper bound of the confidence interval is less than δ. For situ-
ations that include covariates, the methods will become more complex.

12.2.2 Exact and Distribution-Free Methods

With inference on means, the “exact” distribution will have a different mean-
ing than with inference on proportions discussed in the previous chapter.
For continuous data, we will use the term “distribution-free” to signify the
lack of reliance on a normal distribution of the data or even asymptotic nor-
mality of the parameter estimates.
One proposed method of distribution-free non-inferiority analysis is to use
a permutation test. With a permutation test, the analysis occurs by estimat-
ing a sampling distribution of the test statistic, which might be a z-statistic,
a p-value, a point estimate, or another statistic. The sampling distribution
is estimated by multiple rerandomizations of treatments to study subjects,
using the same mechanism as was used for the real allocation of treatments
to subjects in the clinical trial (e.g., using permuted blocks of the same block
size, with the same stratification) and determining the test statistic from the
rerandomization. This process is completed many times (say, 10,000) and a
distribution of the test statistics is estimated. This process is well known for
tests of superiority; however, for tests of non-inferiority, one more adjust-
ment is required. Since the null hypothesis represents a nonzero difference
of δ, we start by assuming that subjects on the control therapy will have on
average a response that exceeds that of subjects on the experimental treat-
ment by δ. This leads to an analogous modification of the test statistic.
An easy implementation of this is to use the actual outcomes and add or
subtract δ to the actual observed values as necessary. For example, if a subject
was assigned to the control arm in the real study but to the experimental arm
in the rerandomization, δ is subtracted from that subject’s actual observed
value for the rerandomization calculations; if a subject was assigned to the
experimental treatment in the real study but to the active control in the reran-
domization, δ is added to that subject’s actual observed value for the reran-
domization calculations; and if a subject receives the same allocation in the
rerandomization as in the real trial and, the actual observed value is used.
In such an approach, we are assuming the shapes of the distributions for the
outcomes are identical between the control and experimental arms.
Hollander and Wolfe1 proposed a modification of the usual distribution-
free testing procedure for testing a null hypothesis of a nonzero difference of
δ, a modification that is operationally identical to that described in the previ-
ous paragraph. To test a null hypothesis of a δ difference in mean, each obser-
vation from the active control group is transformed by subtracting δ from
the observed value and each observation from the experimental treatment is
unchanged. With the resulting transformed values, a test for superiority of

322 Design and Analysis of Non-Inferiority Trials

the investigational treatment over the active control by any method, includ-
ing a parametric test, will be equivalent to a test of non-inferiority of the
original values. A permutation test for superiority can easily be used to test
the null hypotheses after transformation.2
The permutation test being valid means that the residuals are exchange-
able. That is, if the distribution of Xi – μi (observed difference in a value
minus the treatment group mean value) is identical for the two treatment
groups, the permutation test is valid. This requirement is often assumed to
be correct (at least mostly correct) but rarely checked in a detailed manner.
Obvious examples of situations where the residuals are not exchangeable
include when one treatment produces a unimodal distribution and the other
produces a bimodal distribution, or when one treatment produces responses
with a larger variance than those produced by the other treatment. In such
cases, the permutation test will not be appropriate.3
A sufficient condition for the permutation test to be valid is that each sub-
ject would have a response, if assigned to receive the active control, that
exceeds that subject’s response, if assigned to receive the experimental
treatment, by exactly δ. This condition guarantees that the necessary condi-
tion from the previous paragraph is met, but this condition is not in itself
necessary.

12.2.3 Normalized Methods

When the sample size allows, use of normalized methods are generally pre-
ferred. Like binary data, this has historically been due to computational ease
compared with other methods (e.g., exact methods). However, as discussed
in the previous section, distribution-free methods are not easily applied in
each situation. In addition, if the data are normally or approximately nor-
mally distributed, tests based on normal theory will be more powerful than
tests that ignore the underlying distribution.
A confidence interval for μ C –μE is needed to test whether the difference
in means is less than δ. The standard formula of the confidence interval is
xC − xE ± k × se(X C − X E ), where X C and X E are the sample means for the con-
trol and experimental groups respectively, with X C and X E denoting their
corresponding observed values. With normally distributed data, the multi-
plier or critical value, k, is either zα/2, the upper α/2 percentile of the standard
normal distribution or the upper α/2 percentile of some t distribution (which
are almost equal for large sample sizes). When the data are not normally
distributed but the sample size is large, the Central Limit Theory supports
using zα/2 as the critical value. The standard error of the difference X C − X E
is generally estimated using either the empirical (method of moments) esti-
σˆ C2 σˆ E2 σˆ C2 σˆ E2
mator, + , or the unbiased estimator, + , where σ̂ C2 and
nC nE nC − 1 nE − 1
σ̂ E2 are the empirical (method of moments) estimators of the variances for

Inferences on Means and Medians 323

the control and experimental groups, respectively. Again, with large sample
sizes, the relative difference in the two estimates will be negligible.
As an equivalent alternative to the confidence interval methodology, a test
x − xE − δ
statistic can be calculated. If C is less than the critical value (e.g.,
se(X C − X E )
less than –zα/2), non-inferiority is concluded. Alternatively, non-inferiority is
concluded when the appropriate-level confidence interval for μC –μE contains
only values less than δ. Using a test statistic has the advantage of being able to
calculate a p-value for the test of the null hypothesis. However, p-values are not
often calculated for such non-inferiority tests and, when they are calculated,
they are prone to misinterpretation as an indication of the existence of differ-
ences, not the rejection of the null hypothesis of a specific nonzero difference.
We will later compare the results of different analysis methods based on both
the calculated p-values (and an analogous posterior probability) for given vari-
ous margins and compare the calculated 95% confidence/credible intervals.

12.2.3.1 Test Statistics

Let X1, X2, …, X nC and Y1, Y2, …, YnE denote independent random samples for
the control and experimental arms, respectively with corresponding sample
means X and Y. Let μ C and μE denote the means of the underlying distribu-
tions for the control and experimental arms and let σ C2 and σ E2 denote the
respective variances. Let SC2 and SE2 denote the respective sample variances

∑ ∑
nC nE
given by SC2 = (X i − X )2 /(nC − 1) and SE2 = (Yj − Y )2/(nE − 1). We will
i=1 j=1
consider three cases for testing the hypotheses in Expression 12.1: (1) large
sample normal–based inference, (2) using Satterwaite degrees of freedom,4
and (3) using a t statistic under the assumption of unknown but equal vari-
ances. Procedures (1) and (2) are just two cases to address the Behrens–Fisher
problem—making statistical inferences on the difference in the means of two
normal distributions having unknown variances that are not assumed to be
equal.
Large Sample Normal Inference. For large sample sizes, it follows from the
central limit theorem that the test statistic

X −Y −δ
Z= (12.2)
S /nC + SE2 /nE
2
C

will have an approximate standard normal distribution when μ C – μE =

δ. The null hypothesis in Expression 12.1 is rejected and non-inferiority
is concluded at a one-sided level of α/2 when the observed value for Z is
less than –zα/2. An approximate 100(1 – α)% confidence interval is given by
x − y ± zα/2 sC2 /nC + sE2 /nE , where x and y are the respective observed sample
means and sC2 and sE2 are the respective observed sample variances.

324 Design and Analysis of Non-Inferiority Trials

Using Satterwaite Degrees of Freedom. The hypotheses in Expression 12.1 are

still tested using the test statistic in Equation 12.2. However, the observed
value for Z is compared with –tα/2,ν instead of –zα/2, where, from Satterthwaite,4
the degrees of freedom ν is the greatest integer less than or equal to
2
 sC2 sE2 
n +n 
 C E
(12.3)
sC4 sE4
+
nC2 (nC − 1) nE2 (nE − 1)

The approximate 100(1 – α)% confidence interval is then given by x − y ±

tα/2 ,ν sC2 /nC + sE2 /nE . The value tα/2,ν is the upper α/2 percentile of a t distri-
bution with ν degrees of freedom. The Satterwaite degrees of freedom are
random and can only be determined after calculating the observed sample
variances. Because tα/2,ν > zα/2, the confidence intervals using tα/2,νwill be wider
than those using zα/2 and the one-sided p-values based on the t distribution
with ν degrees of freedom will be closer to 0.5 than the one-sided p-values
based on the standard normal distribution.
t Statistic Assuming Unknown but Equal Variances. The test statistic is based
on the assumptions that the underlying distributions are normal distribu-
tions with equal variances. The test statistic is given by

X −Y −δ
T= (12.4)
2
S (1/nC + 1/nE )

where S2 = [(nC − 1)SC2 + (nE − 1)SE2 ]/(nC + nE − 2) is a pooled estimator of the

common underlying variance. The null hypothesis in Expression 12.1 is
rejected and non-inferiority is concluded at a one-sided level of α/2 when
the observed value for T is less than – tα/2 ,nC + nE − 2 . The multiplier or critical
values for the three methods are ordered tα/2,ν > tα/2 ,nC + nE − 2 > zα/2. The following
examples compare the results based on p-values and confidence interval from
applying these three methods. They also provide insight on how the two
estimated standard errors for X − Y in Equations 12.2 and 12.4 compare.

Example 12.1

Consider testing the hypotheses in Expression 12.1 with δ = 4 based on a random-

ized trial having 25 subjects in the control arm and 30 subjects in the experimental
arm. For the control arm the observed sample mean is x = 40.5 and the observed
sample variance is sC2 =4. For the experimental arm the observed sample mean is
2
y = 39.1 and the observed sample variance is sE =49. The one-sided p-values and
two-sided 95% confidence intervals are provided in Table 12.1. From Equation
12.3 the degrees of freedom for the Satterwaite method equal 34.

Inferences on Means and Medians 325

TABLE 12.1
Summary of One-Sided p-Values and 95% Confidence Intervals
Large Sample Normal Satterwaite Equal Variance
p-Value 0.026 0.030 0.039
95% CI (–1.23, 4.03) (–1.32, 4.12) (–1.51, 4.31)

For each method, the one-sided p-values are greater than 0.025 and each 95%
confidence interval contains the non-inferiority margin of 4. Therefore non-infe-
riority cannot be concluded. The upper limits of the 95% confidence intervals
represent the smallest margin that could have been prespecified for which non-
inferiority would have been concluded. The equal variance method has both the
largest confidence interval upper limit of 4.31 and the largest p-value of 0.039.
This is primarily due to the larger estimated standard error for X − Y used by the
equal variance method. For the equal variance method, the estimated standard
error for X − Y equals 1.45, whereas the estimated standard error for X − Y equals
1.34 for the large sample normal and Satterwaite methods. When the treatment
group having the larger sample size has the larger (smaller) observed sample vari-
ance, the estimated standard error for X − Y used by the equal variance method
will be larger (smaller) than the estimated standard error for X − Y used by the large
sample normal and Satterwaite methods. When the sample sizes are equal, the
same standard error for X − Y is used in all three methods. Note that the multipli-
ers (i.e., the absolute values of the critical values) used for the confidence intervals
were 1.960, 2.032, and 2.006.

In Example 12.2 the observed sample variances in Example 12.1 are

reversed, and again the hypotheses in Expression 12.1 are tested with δ = 4.

Example 12.2

We still have nC = 25, nE = 30, x = 40.5, and y = 39.1. However, now sC2 = 49 and
sE2 = 4. The one-sided p-values and two-sided 95% confidence intervals are pro-
vided in Table 12.2. From Equation 12.3 the degrees of freedom for the Satterwaite
method equal 27.
For each method, the one-sided p-values are greater than 0.025 and each 95%
confidence interval contains the non-inferiority margin of 4. Therefore, non-infe-
riority cannot be concluded. The Satterwaite method has both the largest confi-
dence interval upper limit of 4.37 and the largest p-value of 0.042. The estimated
standard errors used for X − Y are approximately reversed from Example 12.1. For

TABLE 12.2
Summary of One-Sided p-Values and 95% Confidence Intervals
Large Sample Normal Satterwaite Equal Variance
p-Value 0.036 0.042 0.029
95% CI (–1.44, 4.24) (–1.57, 4.37) (–1.28, 4.08)

326 Design and Analysis of Non-Inferiority Trials

TABLE 12.3
Summary of One-Sided p-Values and 95% Confidence Intervals
Large Sample Normal Satterwaite Equal Variance
p-Value 0.015 0.016 0.017
95% CI (–1.76, 2.75) (–1.79, 2.78) (–1.83, 2.82)

the equal variance method the estimated standard error for X − Y equals 1.34,
whereas the estimated standard error for X − Y equals 1.45 for the large sample
normal and Satterwaite methods. The multipliers used for the confidence intervals
were 1.960, 2.052, and 2.006.

In Example 12.3 we consider a case where the observed sample variances

are similar.

Example 12.3

Consider testing the hypotheses in Expression 12.1, where δ = 3 based on a ran-

domized trial having 50 subjects in the control arm and 55 subjects in the experi-
mental arm. For this example the data were randomly generated. For the control
arm the observed sample mean is x = 29.80 and the observed sample variance
is sC2 =23.27. For the experimental arm the observed sample mean is y = 29.31
and the observed sample variance is sE2 =47.22. The one-sided p-values and two-
sided 95% confidence intervals are provided in Table 12.3. From Equation 12.3 the
degrees of freedom for the Satterwaite method equal 97.
For each method the one-sided p-values are less than 0.025 and each 95%
confidence interval contains only values less than the non-inferiority margin of
3. Therefore, non-inferiority is concluded. The p-values and confidence intervals
are fairly similar across the trials. For the equal variance method the estimated
standard error for X − Y equals 1.17, whereas the estimated standard error for X − Y
equals 1.15 for the large sample normal and Satterwaite methods. The multipliers
used for the confidence intervals were similar: 1.960, 1.983, and 1.985.
When the observed sample variances are not greatly dissimilar and the sample
sizes are large, the results from these three analysis methods should be similar.

We will revisit these examples in greater detail with addition of a Bayesian

method in Section 12.2.4.

12.2.4 Bayesian Methods

In this section we will consider non-inferiority analysis of the difference in
means using Bayesian approaches. For ease, conditional on the values of the
parameters, we will consider the samples from the experimental and con-
trol arms as being independent random samples from normal distributions.
For these pairs of normal distributions, we will consider the cases where
(1) the population variances are known and (2) the population variances are

Inferences on Means and Medians 327

unknown. Both cases are provided to illustrate the similarities and differ-
ences in applying the methods. As in Section 12.2.3.1, only the case where the
population variances are unknown will be carried forward and compared in
revisited examples with the methods in Section 12.2.3.1.
Variances Known. For a random sample of n from a normal distribution
with mean μ and known variance σ 2, a normal prior distribution for μ that
has mean υ and variance τ 2 leads to normal posterior distribution for μ that
has mean

 n   1   n   1 
  σ 2  x +  τ 2  υ    σ 2  +  τ 2  
   

and variance

 n   1 
1  2  +  2 
 σ   τ 

When τ 2 is relatively large compared to σ 2/n, the specific choice of τ 2 will
have little impact. The Jeffreys prior has density h µ ∝ I µ = 1/σ for – ∞( ) ( )
< μ < ∞, which is not a proper density and is a noninformative prior for μ.
When h(μ) = 1/σ is used, the resulting posterior density is the density for a
normal distribution having mean equal to x and variance σ 2/n.
The parameters μ C and μE can be regarded as independent. Therefore, the
posterior distribution for μ C – μE is a normal distribution with mean equal to
the difference in the posterior means for μ C and μE and variance equal to the
sum of the posterior variances for μ C and μE.
When testing Ho: μ C – μE ≥ δ versus Ha: μ C – μE < δ, the null hypothesis
is rejected and non-inferiority is concluded when the posterior probability
of μ C – μE < δ exceeds some threshold (e.g., exceeds 1 – α/2) or alternatively
when the appropriate level credible interval for μ C – μE contains only values
less than δ.
Variances Unknown. In the frequentist setting, the analysis simplifies when
the additional assumption is made that the two underlying normal dis-
tributions have the same variance. In the Bayesian setting this additional
assumption complicates the analysis by leading to a joint posterior distribu-
tion where μ C and μE are not independent (μ C and μE are conditionally inde-
pendent given σ). We will discuss two Bayesian procedures, which will be
referred to as the Bayesian-γ and Bayesian-T procedures. These procedures
were introduced in Section 6.3 for three-arm non-inferiority trials. We repeat
the explanations of the procedures below.
Bayesian-γ Procedure. This procedure is similar to a procedure provided by
Ghosh et al.5 There are different choices on what function of the variance

328 Design and Analysis of Non-Inferiority Trials

(e.g., σ, σ 2, 1/σ, or 1/σ 2) to model. Applying the joint Jeffreys prior in each case
leads to joint posterior distributions that provide different posterior prob-
abilities. For this discussion, the variance will be modeled with σ 2, as this
will lead to a more convenient form for the joint posterior distribution. For
θ = σ 2, the density of the Jeffreys prior, h, satisfies h(θ) ∝ θ–3/2 for –∞ < μ < ∞,
and θ > 0. Then for X1, X2, …, Xn, a random sample from a normal distribu-
tion with mean μ and variance θ, where the prior density satisfies h(θ) ∝ θ–3/2,
the joint posterior density satisfies

 ( µ − x )2  − n/2−1  1 n 
g( µ , θ |x1 , x2 ,… , xn ) ∝ θ −1/2 exp −
 2θ /n 
×θ exp −

2 ∑ (x − x) /θ 
i=1
i
2

(12.5)

We see from Expression 12.5 that the joint density factors into the product of
an inverse gamma marginal distribution for θ and a normal conditional dis-
tribution for μ given θ. The inverse gamma distribution has shape and scale
∑
n
parameters equal to n/2 and ( xi − x )2 /2 , respectively, with mean equal to
i=1 2

∑ 
∑ 
n n
( xi − x )2 /(n − 2) and variance equal to 2  ( xi − x )2  /[(n − 2)2 (n − 4)] .
i=1  i=1 
Note that θ has an inverse gamma distribution with parameters n/2 and
∑
n
( xi − x )2 /2 , if and only if 1/θ has a gamma distribution with para
i=1

∑ ∑
n n
meters n/2 and 2/ ( xi − x )2 with mean equal to n/ ( xi − x )2 . Given
i=1 i=1
θ, μ has a normal distribution with mean equal to x and variance equal to
θ/n. Therefore, to simulate probabilities involving μ, a random value for
1/θ can be taken from the gamma distribution with parameters n/2 and
∑
n
2/ ( xi − x )2 , and then a random value for μ can be taken from a normal
i=1
distribution having mean x and variance θ/n.
Bayesian-T Procedure. Another approach consistent with Gamalo et al.6 to
address the problem of unknown variances uses translated t distributions
for the posterior distributions. The mean of the control arm, μ C, has a poste-
rior distribution equal to the distribution of

x + TC sC nC

where TC has a t distribution with nC– 1 degrees of freedom.
The mean of the experimental arm, μE, has a posterior distribution equal
to the distribution of

y + TE sE nE

where TE has a t distribution with nE– 1 degrees of freedom.

Inferences on Means and Medians 329

Example 12.4 revisits the examples in Section 12.2.3.1. We compare the pos-
terior probabilities of the null hypothesis for various margins and the cor-
responding 95% credible interval for the above Bayesian procedures with the
methods in Section 12.2.3.1.

Example 12.4

We can determine from the information provided in Example 12.1 that

∑ ∑
25 30
2/ ( xi − x )2 = 0.02083 and 2/ (y j − y )2 = 0.001408. One hundred thou-
i =1 j =1
sand values for μC – μE were simulated for each Bayesian method.
For the Bayesian-γ procedure, a value for μ C was simulated by first randomly
selecting a value for 1/σ C2 from a gamma distribution with parameters 12.5 and
0.02083. Then for that randomly selected value that will be denoted by 1/θ C*, a
value for μC was randomly taken from a normal distribution having mean x = 40.5
and variance equal to θ C*/25. A value for μE was simulated by first randomly select-
ing a value for 1/σ E2 from a gamma distribution with parameters 15 and 0.001408.
Then for that randomly selected value that will be denoted by 1/ θE*, a value for
μE was randomly taken from a normal distribution having mean y = 39.1 and
variance equal to θE* /30. The simulated probability that μC – μE< 4 is 0.971 (=1 –
0.029), which is less than 0.975; thus, non-inferiority cannot be concluded for a
non-inferiority margin of 4.
For the Bayesian-T procedure, a value for μC was selected at random from the
distribution for 40.5 + 2TC/5, where TC has a t distribution with 24 degrees of free-
dom. A value for μE was selected at random from the distribution for 39.1+ 7TE / 30 ,
where TE has a t distribution with 29 degrees of freedom.
Table 12.4 provides a summary of the 95% confidence/credible intervals for
the Bayesian methods and the three methods discussed in Section 12.2.3.1 along
with the calculated p-values or simulated probabilities of the null hypothesis in
Expression 12.1 for various choices of a non-inferiority margin, δ. In this example
the posterior probabilities and 95% credible interval for the Bayesian methods are
respectively similar to p-values and the 95% confidence intervals from the large
sample normal and Satterwaite methods.

TABLE 12.4
Summary of p-Values, Posterior Probabilities, and 95% Confidence/Credible
Intervals
Non- Large
Inferiority Sample Equal
Margin, δ Bayesian-γ Bayesian-T Normal Satterwaite Variance
0 0.850 0.846 0.865 0.861 0.831
1 0.618 0.618 0.617 0.617 0.608
2 0.327 0.330 0.327 0.329 0.340
3 0.117 0.121 0.116 0.120 0.137
4 0.029 0.030 0.026 0.030 0.039
5 0.005 0.005 0.004 0.006 0.008
95% CI (–1.38, 4.08) (–1.34, 4.12) (–1.23, 4.03) (–1.32, 4.12) (–1.51, 4.31)

330 Design and Analysis of Non-Inferiority Trials

For Example 12.2, where the observed sample variances are reversed, we have

∑ ∑
25 30
that 2/ ( xi − x )2 = 0.001701 and 2/ (y j − y )2 = 0.01724. One hundred
i =1 j =1
thousand values for μC – μE were simulated for each Bayesian method.
For the Bayesian-γ procedure, the first step in simulating a value for μC (μE) is ran-
domly selecting a value for 1/σ C2 (1/σ E2) from a gamma distribution with parameters
12.5 and 0.001701 (15 and 0.01724). The second steps are the same as before. For
the Bayesian-T procedure, a value for μ C(μE) was selected at random from the dis-
tribution for 40.5 + 7TC/5 (39.1+ 2TE / 30 ), where TC (TE) has a t distribution with
24 (29) degrees of freedom.
Table 12.5 provides a summary of the 95% confidence/credible intervals for
these Bayesian methods and the three methods discussed in Section 12.2.3.1,
along with the calculated p-values or simulated probabilities of the null hypoth-
esis in Expression 12.1 for various choices of a non-inferiority margin, δ. In this
example the posterior probabilities and 95% credible interval for the Bayesian
methods are respectively similar to p-values and the 95% confidence intervals
from the large sample normal and Satterwaite methods.
∑ ∑
50 55
For Example 12.3, we have that 2/ ( xi − x )2 = 0.001748 and 2/ (y j − y )2 =
i =1 j =1
0.000784. These two values are the values of the scale parameters for simulat-
ing values for 1/σ C2 and 1/σ E2 from respective gamma distributions with values
for the shape parameters of 25 and 27.5. For both Bayesian methods, a value
for μC – μE is simulated in analogous fashion as above. Again, 100,000 values for
μ C – μE were simulated for each method. Table 12.6 provides a summary of the
95% confidence/credible intervals for the Bayesian methods and the three meth-
ods discussed in Section 12.2.3.1 along with the calculated p-values or simulated
probabilities of the null hypothesis in Expression 12.1 for various choices of a non-
inferiority margin, δ.
In all examples (Tables 12.4 through 12.6) the posterior probabilities and 95%
credible intervals for the Bayesian methods are respectively similar to p-values
and the 95% confidence intervals from the large sample normal and Satterwaite
methods. In each case the Bayesian-T and Satterwaite methods gave quite similar
results with the Bayesian-T method, producing slightly wider 95% confidence/

TABLE 12.5
Summary of p-Values, Posterior Probabilities, and 95% Confidence/Credible
Intervals
Non- Large
Inferiority Sample Equal
Margin, δ Bayesian-γ Bayesian-T Normal Satterwaite Variance
0 0.833 0.828 0.833 0.829 0.850
1 0.610 0.610 0.609 0.608 0.617
2 0.340 0.344 0.339 0.341 0.328
3 0.136 0.140 0.134 0.139 0.118
4 0.039 0.044 0.036 0.042 0.029
5 0.009 0.010 0.006 0.010 0.005
95% CI (–1.49, 4.32) (–1.59, 4.41) (–1.44, 4.24) (–1.57, 4.37) (–1.28, 4.08)

Inferences on Means and Medians 331

TABLE 12.6
Summary of p-Values, Posterior Probabilities, and 95% Confidence/Credible
Intervals
Non- Large
Inferiority Sample Equal
Margin, δ Bayesian-γ Bayesian-T Normal Satterwaite Variance
0 0.666 0.662 0.666 0.666 0.663
1 0.331 0.329 0.331 0.331 0.334
2 0.096 0.096 0.096 0.097 0.101
2.5 0.042 0.042 0.041 0.042 0.045
3 0.016 0.016 0.015 0.016 0.017
95% CI (–1.78, 2.77) (–1.82, 2.79) (–1.76, 2.75) (–1.79, 2.78) (–1.83, 2.82)

credible intervals. For the Bayesian-γ method, when the posterior probability of
the null hypothesis was very small, the posterior probability lied between the
p-values for the large sample normal method and the Satterwaite method and in
each example the upper limit of the 95% credible interval for μC – μE lied between
the upper limits of the 95% confidence intervals from the large sample normal and
Satterwaite methods.

12.2.5 Sample Size Determination

The selection of sample size for a non-inferiority test of a continuous end-
point might be easier than for a binary endpoint, as discussed in the previous
chapter. The sample size depends on the difference in the selected alterna-
tive for μ C – μE and the hypothesized value for μ C – μE in the null hypothesis
(i.e., δ). A superiority trial (i.e., δ = 0) where it is believed that μ C – μE = –15
will require the same sample size for a desired power as a non-inferiority
trial where it is believed that μ C – μE = –5 and δ = 10. In each case the differ-
ence between the selected alternative and the hypothesized value in the null
hypothesis is –15. Therefore, the sample size for a non-inferiority trial can be
determined by using sample size software designed for a superiority trial
where the selected superior effect is the difference between the selected non-
inferiority alternative and –δ.
It is tempting to assume that there is no difference in the true means for the
experimental and control arms when powering/sizing the non-inferiority
trial. However, if the control therapy has demonstrated the best efficacy
among many available or tried therapies, it may be appropriate to power the
trial on the basis of an appropriate difference in favor of the control therapy.
It may therefore be optimistic to size the study under the assumption that
the experimental therapy is as efficacious as the best therapy among many
therapies, leading to an undersized, underpowered trial. In such cases it
may be appropriate to assume that the experimental therapy has less effi-
cacy than the control therapy (but still meaningfully efficacious if compared
to a placebo).

332 Design and Analysis of Non-Inferiority Trials

Alternatively, when the control therapy is not terribly effective, the non-in-
feriority margin is small. Assuming no difference in the true means for the
experimental and control arms not only will lead to a large sample size but will
also reflect the belief that the experimental therapy is not terribly effective.
In deriving sample formulas, we borrow from ideas in Kieser and
Hauschke.7 Let μ1 and μ2 represent the assumed true means for the experi-
mental and control arms, respectively, where μ2 – μ1 < δ. Let σ 1 and σ 2 denote
the respective assumed underlying standard deviations, where l = σ 1/σ 2. Let
k = nE/ nC denote the allocation ratio. On the basis of the assumption of inde-
pendent normal random samples and using a Satterwaite-like approxima-
tion for the degrees of freedom, the test statistic Z in Expression 12.2 will be
modeled as having an approximate noncentral t distribution with noncen-
µ 2 − µ1 − δ
trality parameter given by and degrees of freedom given by
σ 2 (l 2/k + 1)/nC
(1 + l 2/k )2
ν= . We will use the approximate relation that the
1/(nC − 1) + l 4/[ k 2 ( knC − 1)]
100βth percentile of the noncentral t distribution is approximately equal to
the noncentrality parameter plus the 100βth percentile of the t distribution
having the same degrees of freedom. Then an iterative sample size formula
for nC (nC appears on both sides of the equation) is given by

nC = (1 + l 2/k )(tα/2 ,ν + tβ ,ν )2 (σ 2 /( µ 2 − µ1 − δ ))2 (12.6)

For the case where equal, unknown underlying variances is assumed,

Equation 12.6 applies with ν = 2nC – 2 and l = 1. For the large sample normal
test, Equation 12.6 can directly be applied (no iterations needed) where the
term tα/2,ν + tβ,ν is replaced with the term zα/2 + zβ. The control sample size cal-
culated for the large sample normal test can be used as the starting value for
the iterations for the t-test procedures.
The overall sample size is 1 + k times the solution for nC in Equation 12.6
or the corresponding analog for one of the other test statistics. The optimal
allocation ratio k that minimizes the overall sample size can be found by a
“grid search,” by determining the overall sample size for many candidates
for k, or for the large sample normal procedure by taking a derivative of the
formula for the overall sample size with respect to k and setting the result
equal to zero and then solving for k. As stated in Chapter 11, before choosing
a particular allocation ratio, its impact on the evaluation of secondary end-
points and/or safety endpoints should be considered.
The above sample size expressions are based on substituting the assumed
standard deviations for the random sample standard deviations. The actual
power for the determined sample sizes may be slightly less than the desired/
targeted power.
For the Bayesian procedures, when a specific alternative is selected, the
required sample size should be similar to the calculated required sample size

Inferences on Means and Medians 333

for the frequentist procedures. For the Bayesian-T procedure, the required
sample size should be similar to that calculated for the Satterwaite-like pro-
cedure. Alternatively for the frequentist or Bayesian methods, the required
sample size can be based on an assumed distribution over all possibilities
for (μ1, μ2).
The use of the sample size formulas will be illustrated in Example 12.5.

Example 12.5

Consider an endpoint that is the improvement from baseline in some score. The
non-inferiority margin for the difference in mean improvement is 10. Both a 1:1 and
a 2:1 experimental to control allocation are potentially being considered. The trial
will be sized on the basis of the following assumptions: the true mean improve-
ment is 38 and 40 for the experimental and control arms, respectively, and the
corresponding underlying standard deviations are 6 and 12, respectively. For the
equal variance approach, the common underlying variance will be assumed as
90 (the average of the variances for standard deviations of 6 and 12). Table 12.7
provides the determined sample sizes for the control arm.
For this example, the calculated sample sizes were smaller for the large sample
normal and equal variance methods. For the large sample normal method, when
the inference is based on the difference in means with equal and known underly-
ing variances, the overall sample size for a 2:1 allocation will be 12.5% greater
than the overall sample size for a 1:1 allocation. In this example, for each method,
the percentage increase in the calculated value for the overall sample size going
from a 1:1 allocation to a 2:1 allocation is greater than 12.5%. This is due to the
unequal variances and the iterative nature of the sample size equations for the
Satterwaite and equal variance methods.
For the large sample normal and Satterwaite methods, the optimal allocation
ratio is l = σ 1/σ 2. In this example, for the large sample normal and Satterwaite
methods, k = 0.5 is the allocation ratio that minimizes the calculated overall sam-
ple size. For the large sample normal method, the corresponding allocation is
16 subjects to the control arm and 8 subjects to the experimental arm. For the
Satterwaite method, the corresponding allocation is 18 subjects to the control
arm and 9 subjects to the experimental arm. For the equal variance method, a
1:1 allocation ratio minimizes the calculated overall sample size with 15 subjects
allocated to each arm.

TABLE 12.7
Sample Sizes for the Control Arm
Allocation Ratio Percentage Change in
Method 1:1 2:1 Overall Sample Size (%)a
Large sample normal 14 12 35.1
Satterwaite 15 14 38.6
Equal variance 15 11 15.9
a Based on the calculated value before rounding up.

334 Design and Analysis of Non-Inferiority Trials

12.3 Fixed Thresholds on Ratios of Means

In some situations the relative differences between arms is the primary inter-
est. Typically in bioequivalence problems, the bioequivalence of two products
is based an observed ratio—for example, the ratio of means. In this section
we will discuss non-inferiority analysis methods for the ratio of means.

12.3.1 Hypotheses and Issues

A ratio of means only makes sense when the means must be positive.
Interpretation of the ratio of means can also be difficult if individual obser-
vations can be negative. Without loss of generality, we will consider the situ-
ation where large outcomes are more desired than smaller outcomes. We will
assume that means for the experimental and control arms are positive (i.e.,
μE > 0 and μ C >0). The preselected non-inferiority threshold will be denoted
by δ, where 0 < δ < 1. The null and alternative hypotheses to be considered
are

Ho: μE/μ C ≤ δ and Ha: μE/μ C > δ (12.7)

Since μ C > 0, the hypotheses in Expression 12.7 can be reexpressed as

Ho: μE – δ μ C ≤ 0 and Ha: μE – δ μ C > 0 (12.8)

Thus, non-inferiority testing can be based on either μE/μ C or μE – δ μ C. The

advantage of using μE– δμ C in practice is that smaller sample sizes should
be needed for Y − δ X to be approximately normally distributed than the
required sample sizes needed for Y/X to be approximately normally distrib-
uted. Adaptations of the testing procedures discussed in Section 12.2 can be
used in testing the hypotheses in Expression 12.8.

12.3.2 Exact and Distribution-Free Methods

Consistent with the null hypothesis in Expression 12.7, a permutation or
rerandomization test can be done by assuming that subjects on the experi-
mental therapy will have outcomes that are 100δ% of the outcomes observed
on the control therapy. An appropriate test statistic is chosen. The sampling
distribution when μE/μ C = δ is approximated by multiple rerandomizations
of therapies to study subjects, using the same allocation mechanism of thera-
pies to subjects as was originally used in the clinical trial (e.g., using per-
muted blocks of the same block size, with the same stratification, etc.).
An easy implementation of this is to use the actual outcomes and multiply
or divide by δ to the actual observed values as necessary. For example, if a

Inferences on Means and Medians 335

subject was assigned to the control arm in the real study but to the experi-
mental arm in the rerandomization, the subject’s actual observed value
is multiplied by δ for the rerandomization calculations; if a subject was
assigned to the experimental arm in the real study but to the control arm
in the rerandomization, the subject’s actual observed value is divided by δ
for the rerandomization calculations; and if a subject receives the same allo-
cation in the rerandomization as in the real trial and, the actual observed
value is used. In such an approach where outcomes must be positive, we are
assuming the shapes of the distributions for the logarithms of the outcomes
are identical between the control and experimental arms. That is, the shapes
of the underlying distributions for the outcomes differ by a scale factor. Here
the permutation test being valid means that the residuals of the logarithms
are exchangeable. A sufficient condition for the permutation test to be valid
is that each subject would have an outcome, if assigned to receive the experi-
mental therapy that is exactly δ times the outcome that the subject would
have had if assigned to receive the control therapy.

12.3.3 Normalized and Asymptotic Methods

We will discuss the use of normalized methods for testing the hypotheses
as expressed in Expression 12.8 and the use of a delta method for testing the
hypotheses as expressed in Expression 12.7.
However, as discussed in the previous section, distribution-free methods
are not easily applied in each situation. In addition, if the data are normally
or approximately normally distributed, tests based on normal theory will be
more powerful than tests that ignore the underlying distribution. There are
some issues with using a normal or similar model. A normal model will have
some probability below zero. Likewise, there will be some uncertainty that
the true mean is positive that is inherent in using a normal distribution or t
distribution to model the sampling distribution of the sample mean. When
the sample mean divided by its estimated standard error is far from zero
(e.g., beyond ±4), the uncertainty that the true mean is positive will be quite
small, and the issue of having some uncertainty that the true mean is posi-
tive that is introduced by the model can be ignored.

12.3.3.1 Test Statistics

Let X1, X2, …, X nC and Y1, Y2, …, YnE denote independent random samples
for the control and experimental arms, respectively. Let μ C and μ C denote
the means of the underlying distributions for the control and experimen
2 2 2 2
tal arms and let σ C and σ E denote the respective variances. Let SC and SE
∑
nC
denote the respective sample variances given by SC2 = (X i − X )2/(nC − 1)
∑
nE i=1
and SE2 = (Yj − Y )2 /(nE − 1).
j=1

336 Design and Analysis of Non-Inferiority Trials

We will consider four procedures for testing the hypotheses in Equations

12.7 and 12.8: (1) a large sample normal–based inference, (2) using a Satterwaite-
like degrees of freedom, (3) using a t statistic under the assumption of
unknown but equal variances, and (4) using a delta-method approach.
Large Sample Normal Inference. For large sample sizes, it follows from the
central limit theorem that the test statistic

Y − δX
Z= (12.9)
S /nE + δ 2SC2 /nC
2
E

will have an approximate standard normal distribution when μE – δμ C = 0.

The null hypothesis in Expression 12.8 is rejected and non-inferiority is con-
cluded at a one-sided level of α/2 when the observed value for Z is greater
than zα/2. When x/ sC2 /nC is quite large (e.g., greater than 4), an approxi-
mate 100(1 – α)% confidence interval for μE/μ C can be found using a Fieller
approach given by {λ: –zα/2 < ( y − λ x )/ sE2 /nE + λ 2 sC2 /nC < zα/2}, where x and y
are the respective observed sample means and sC2 and sE2 are the respective
observed sample variances.
Using Satterwaite-like Degrees of Freedom. The hypotheses in Expression 12.8
are still tested using the test statistic in Equation 12.9. However, the observed
value for Z is compared with –tα/2,ν(δ) instead of –zα/2, where the degrees of
freedom ν(δ) is the greatest integer less than or equal to

2
 δ 2 sC2 sE2 
 n +n 
 C E
(12.10)
δ 4 sC4 sE4
+
nC2 (nC − 1) nE2 (nE − 1)

An approximate 100(1 – α)% confidence interval for μE/μC by a Fieller approach

is given by {λ: –tα/2,ν(λ) < ( y − λ x )/ sE2 /nE + λ 2 sC2 /nC < tα/2,ν(λ)}.
t Statistic Assuming Unknown but Equal Variances. The test statistic is based
on the assumptions that the underlying distributions are normal distribu-
tions with equal variances. The test statistic is given by

Y − δX
T= (12.11)
S (1/nE + δ 2/nC )
2

where S2 = [(nC − 1)SC2 + (nE − 1)SE2 ]/(nC + nE − 2) is a pooled estimator of the

common underlying variance. The null hypothesis in Expression 12.8 is
rejected and non-inferiority is concluded at a one-sided level of α/2 when
the observed value for T is less than −tα/2 ,nC + nE − 2. An approximate 100(1 – α)%

Inferences on Means and Medians 337

confidence interval for μE/μ C by a Fieller approach is given by {λ: −tα/2 ,nC + nE − 2 <
( y − λ x )/ s2 (1/nE + λ 2/nC ) < tα/2 ,nC + nE − 2}.
A Delta-Method Approach. A delta-method approach can be considered in
the testing of the hypotheses in Expression 12.7. This can be done using either
a test statistic based on the ratio of sample means or a confidence interval
for μE/μ C. Hasselblad and Kong8 considered a delta-method approach to the
retention fraction for relative risks, odds ratios, and hazard ratios. Rothmann
and Tsou9 evaluated the behavior of delta-method confidence interval pro-
cedures through the maintenance of a desired 0.025 one-sided type I error
rate and the quality of estimated standard error for the ratio of two normally
distributed estimators.
The theorem behind the delta method can be found in many sources, such
as the book by Bishop, Fienberg, and Holland.10 For independent sequences
of random variables {Un} and {Vn}, we have from the delta-method theo-
(
rem that if n U n − µ1  ) d
(
→ N (0, σ 12 ) and n Vn − µ 2 d )
→ N (0, σ 22 ) , then
U µ    σ 2 µ 2σ 2  
n  n − 1  d → N  0,  12 + 1 4 2   provided μ2 ≠ 0. It follows, as noted
 Vn µ 2    µ2 µ2  
by Rothmann and Tsou,9 that

U µ  σ 12 µ12σ 22
Wn = n  n − 1  +
 Vn µ 2  µ 22 µ 24

 2 
 µ   µ1  2
=  n  U n − Vn 1  σ X2 +   σ 2  ( µ 2 /Vn ) = Zn × ( µ 2 /V
Vn )
  µ2   µ2  
 

Note that if Un and Vn are distributed as N ( µ1 , σ 12/n) and N ( µ 2 , σ 22/n), respec-

tively, with μ2 ≠ 0, then Zn has a standard normal distribution. How close the
distribution of Wn is to a standard normal distribution then depends on how
close the distribution of μ2/Vn is to the degenerate distribution at 1.
Applying the delta-method to the ratio of sample means and replacing the
standard errors with the estimated standard errors leads to the test statistic
of the hypotheses in Expression 12.7 of

2
Y   Y   SE2/nE SC2 /nC 
W =  −δ  X   Y 2 + X 2  (12.12)
X   

The unrestricted delta-method estimator for the standard deviation of Y/X is

2
 Y   SE2/nE SC2 /nC 
 X   Y 2 + X 2  . Alternatively, the denominator in the test statistic
 

338 Design and Analysis of Non-Inferiority Trials

in Equation 12.12 can be modified to use an estimator of the standard error

that is restricted to μE/μ C = δ. The delta-method approximate 100(1 – α)% con-
fidence interval for μE/μ C is given by

 sE2/nE sC2 /nC 

( y/x )
2
y/x ± zα/2  y2 + x2 
 

Note that W in Expression 12.12 can also be reexpressed as W = Z × R, where

Z is as in Equation 12.9 and R = SE2/nE + δ 2SC2 /nC / SE2/nE + (Y/X )2 SC2 /nC . If
Z has an approximate standard normal distribution, then W will have an
approximate standard normal distribution when R ≈ 1 in distribution.
Further note that when δ = 0, R is necessarily smaller than 1 (0 < R < 1), and
thus W has a more concentrated distribution around zero than Z. When Z
has an approximate standard normal distribution, the distribution of W has
smaller tail areas than a standard normal distribution. Therefore, when δ is
close to zero and the sample sizes are reasonably large, we would expect that
the distribution of W is more concentrated near zero than a standard normal
distribution. Likewise when μE/μ C is much larger than δ, R will tend to be
less than 1, and thus have a distribution more concentrated toward zero than
W. Thus, there will be less power when using the test statistic in Expression
12.12 than using the test statistic in Expression 12.9.
Rothmann and Tsou9 argued that for independent and normally distrib-
uted random variables Y and X, where the mean of X is not zero, Y/X will
have an approximate normal distribution, if X behaves as approximately a
nonzero constant with respect to the distribution of Y. This would occur if
the standard deviation for X divided by the mean of X is close to zero. To
have a type I error rate close to the desired level, Rothmann and Tsou9 sug-
gested, based on their simulations, that the absolute value of the ratio of the
mean of X to its standard deviation for X be “greater than 7 or 8 or even
greater, depending on how close is “close.”
When the sample sizes are quite large (so that the sample mean for the
control arm divided by its corresponding estimated standard error is large),
the results from these four methods should be similar. We will compare
the results of these methods and two Bayesian methods when we revisit
Examples 12.1 through 12.3 with non-inferiority testing of the ratio of means
when δ = 0.9 in Section 12.3.4.

12.3.4 Bayesian Methods

In this section we will consider non-inferiority analysis of the ratio of means
using Bayesian approaches. As before, conditional on the values of the
parameters, we will consider the samples from the experimental and control
arms as being independent random samples from normal distributions. The

Inferences on Means and Medians 339

procedures in Section 12.2.4 (e.g., the Bayesian-γ and Bayesian-T procedures)

for obtaining joint posterior distributions for μ C and μE still apply. There are
various posterior probabilities that can be determined. These include:

(a) The posterior probability that the experimental therapy is superior

to the active control therapy.
(b) The posterior probability that the mean for the experimental therapy
is greater than 100δ% of the mean for the control therapy and the
mean for the control therapy is positive.
(c) The posterior probability that the mean for the experimental therapy
is greater than 100δ% of the mean for the control therapy and the mean
for the control therapy is positive, or the mean for the experimental
therapy is positive and greater than the mean for the control therapy.

Example 12.6 revisits Examples 12.1 through 12.3. Here the non-inferiority
inference will be based on the ratio of the means. The posterior probabilities
of the null hypothesis or p-value when δ = 0.9 and the 95% confidence/cred-
ible interval will be determined using each Bayesian procedure and each
procedure discussed in Section 12.3.3.1.

Example 12.6

We revisit Examples 12.1 through 12.3. Table 12.8 provides the p-value or the
posterior probabilities of the null hypothesis in Expression 12.7 (or Expression
12.8 if appropriate) when δ = 0.9 and the corresponding 95% confidence/credible
intervals for the six methods discussed in this section. The p-values or posterior
probabilities of the null hypothesis less than 0.025 are italicized. For the Bayesian
methods, the set of values for μ C and μE simulated earlier were used. All simulated
values for μC and μE were positive and far from zero.

TABLE 12.8
Summary of p-Values, Posterior Probabilities, and 95% Confidence Intervals
Example 12.1 Example 12.2 Example 12.3
Procedures p-Value 95% CI p-Value 95% CI p-Value 95% CI
Large sample 0.0230 (0.901, 1.030) 0.0217 (0.902, 1.038) 0.0125 (0.910, 1.061)
normal
Satterwaite- 0.0271 (0.899, 1.033) 0.0265 (0.899, 1.042) 0.0137 (0.909, 1.062)
like
Equal 0.0294 (0.898, 1.039) 0.0206 (0.903, 1.033) 0.0135 (0.909, 1.064)
variance
Delta-method 0.0236 (0.901, 1.030) 0.0292 (0.898, 1.033) 0.0148 (0.908, 1.059)
Bayesian-γ a 0.0255 (0.900, 1.032) 0.0249 (0.900, 1.039) 0.0137 (0.909, 1.062)
Bayesian-Ta 0.0270 (0.899, 1.033) 0.0283 (0.898, 1.042) 0.0144 (0.909, 1.063)
a Posterior probabilities of the null hypothesis are given under the p-value column.

340 Design and Analysis of Non-Inferiority Trials

Statistical significance at a one-sided level of 0.025 (or the Bayesian equivalent)

was reached for the large sample normal and delta-method procedures for revisited
Example 12.1, by the large sample normal and equal variance procedures for revisited
Example 12.2 and for all procedures for revisited Example 12.3. For revisited Examples
12.1 and 12.2, the lower limits of the 95% confidence/credible intervals varied far less
across methods than the respective upper limits. Reversing the sample variances in
Example 12.1 for Example 12.2 reduced the estimated standard error of Y − 0.9X for
the large sample normal method, which thereby decreased the corresponding p-value
but increased the estimated standard deviation of Y / X − 0.9 for the delta method,
which increased the corresponding p-value.
In all examples, the ratio of the sample mean for the control arm to its estimated
standard error was quite large. This satisfies a Rothmann and Tsou9 requirement for the
delta-method to produce type I error rates close to the desired level. This also makes
the posterior probability that μC > 0 microscopically close to 1 for the Bayesian meth-
ods and thus simplifying the probabilities given in (b) and (c). For (a), the simulated
probabilities of μE > μC can be found from Tables 12.4 through 12.6 by subtracting from
1 each posterior probability of the null hypothesis when the non-inferiority margin for
the difference is zero. In these cases, the simulated probabilities of μE > μC are closer
to 0.5 for the Bayesian-T method than for the Bayesian-δ method, which suggests that
the Bayesian-T method imposes more variability on μC and μE than the Bayesian-γ
method. In all three cases, the standard deviations for the simulated marginal posterior
distribution of μC and μE were larger for the Bayesian-T method than for the Bayesian-γ
method ranging from 1.0% larger to 2.8% larger. In these three examples, relative to
the other methods, the Bayesian-T method was fairly conservative, producing gener-
ally larger p-values with wider 95% confidence/credible intervals.
For the Satterwaite-like procedure in all three examples, the upper and lower
confidence limits were determined from multipliers/table values based on differ-
ent numbers of degrees of freedom. In revisited Example 12.3, the lower limit of
the 95% Fieller confidence interval was based on t0.025,92 = 1.9861, whereas the
upper limit of the 95% Fieller confidence interval was based on t0.025,99 = 1.9842.

12.3.5 Sample Size Determination

Standard software packages for sample size and power calculations are not
designed for non-inferiority analyses based on ratios. We again provide cau-
tion about sizing a study based on the arbitrary assumption that the control
and experimental therapies have identical efficacy.
On the basis of the test statistic in Equation 12.11, Kieser and Hauschke7
determined iterative formulas for the desired sample size assuming an
unknown, common underlying variance. Their arguments can be extended
to the case that does not assume equal underlying variances. Let μ1 and μ2
represent the assumed true means for the experimental and control arms,
respectively, where μ1/μ2 > δ and μ2 > 0. Let σ 1 and σ 2 denote the respec-
tive assumed underlying standard deviations where l = σ 1/σ 2. Let k = nE/ nC
denote the allocation ratio. On the basis of the assumption of independent
normal random samples and using a Satterwaite-like approximation for
the degrees of freedom, the test statistic Z in Equation 12.9 will be mod-
eled as having an approximate noncentral t distribution with noncentrality

Inferences on Means and Medians 341

µ1 − δµ 2
parameter given by and degrees of freedom given by
σ 2 (l 2/k + δ 2 )/nC
(δ 2 + l 2/k )2
ν= . We will use the approximate relation that the
δ /(nC − 1) + l 4/[ k 2 ( knC − 1)]
4

100βth percentile of the noncentral t distribution is approximately equal to

the noncentrality parameter plus the 100βth percentile of the t distribution
having the same degrees of freedom. Then an iterative sample size formula
for nC (nC appears on both sides of the equation) is given by

nC = (l 2/k + δ 2 )(tα/2 ,ν + tβ ,ν )2 (σ 2/( µ1 − δµ 2 ))2 (12.13)

For the case where equal, unknown underlying variances is assumed,
Equation 12.13 applies with ν = 2nC – 2 and l = 1. For the large sample normal
test, Equation 12.13 can directly be applied (no iterations needed) where the
term tα/2,ν + tβ,ν is replaced with the term zα/2 + zβ. The control sample size
calculated for the large sample normal test can be used as the starting value
for the iterations for the t-test procedures. For the delta-method procedure
based on the test statistic in Equation 12.12, replace the term tα/2,ν + tβ,ν with
the term zα/2 + zβ and replace the term l2/k + δ 2 with the term l2/k + (μ1/μ2)2
in Equation 12.13. Since μ1/μ2 > δ, the calculated sample size for the delta-
method procedure using the test statistic in Equation 12.12 will be larger
than the calculated sample size for the large sample normal procedure. This
is due to the test statistic in Equation 12.12 not having the standard error
in its denominator restricted to the null assumption that μE/μ C = δ. A delta-
method test statistic that restricts the standard error in the denominator of
the test statistic under the assumption that μE/μ C = δ should have a similar
sample size calculation as the large sample normal procedure.
The overall sample size is 1 + k times the solution for nC in Equation 12.13
or the corresponding analog for one of the other test statistics. The optimal
allocation ratio k that minimizes the overall sample size can be found by
similar means as those provided in Section 12.2.5.
We refer the reader to Section 12.2.5 on the sample size determination for
non-inferiority trials based on a difference in means for additional comments
that also apply for the ratio of means. The use of the sample size formulas
will be illustrated in Example 12.7.

Example 12.7

Consider an endpoint that is the improvement from baseline in some score. The
non-inferiority threshold for the ratio of mean improvement is 0.75. Both a 1:1 and
a 2:1 experimental-to-control allocation are potentially being considered. As in
Example 12.5, the trial will be sized on the basis of true mean improvement of 38
and 40 for the experimental and control arms, respectively, with corresponding
underlying standard deviations of 6 and 12, respectively. For the equal variance

342 Design and Analysis of Non-Inferiority Trials

Table 12.9
Sample Sizes for the Control Arm
Allocation Ratio
Percentage Change in
Method 1:1 2:1 Overall Sample Size (%)a
Large sample normal 20 17 26.9
Delta method 28 25 33.7
Satterwaite 21 18 30.2
Equal variance 25 17 4.2
a Based on the calculated value before rounding up.

approach, the common underlying variance will be assumed as 90. Table 12.9
provides the determined sample sizes for the control arm.
For this example, the calculated sample sizes were smallest for the large sample
normal and equal variance methods. The delta method requires a larger sample
size than the large sample normal method since μ1/μ2 = 0.95 > 0.75.
When the inference is based on the difference in means with equal and known
underlying variances (i.e., when l/δ = 1), the overall sample size for a 2:1 alloca-
tion will be 12.5% greater than the overall sample size for a 1:1 allocation. In this
example for these methods, the percentage increase in the overall sample size
from a 1:1 allocation to a 2:1 allocation was quite different from 12.5% since l/δ (or
lμ2/μ1 when the delta-method is used) was different from 1. The iterative nature of
the Satterwaite and equal variance methods also influences the particular percent-
age increase. For each of the large sample normal, Satterwaite (l/δ = 2/3), and delta
methods (lμ2/μ1 ≈ 0.526), there was around a 30% increase in the overall sample
size from a 1:1 allocation to a 2:1 allocation. For the equal variance method l/δ =
4/3 and instead of a 12.5% increase in the calculated value for the overall sample
size going from a 1:1 allocation to a 2:1 allocation, there was a 4.2% increase.
The optimal allocation ratio is l/δ (or lμ2/μ1 for the delta-method is used). In this
example, for the large sample normal and the Satterwaite methods, k = 2/3 is the
allocation ratio that minimizes the calculated overall sample size. For the large sample
normal method, this corresponds to 22 subjects in the control arm and 15 subjects in
the experimental arm. For the Satterwaite-like method, 25 subjects are allocated to
the control arm and 16 subjects are allocated to the experimental arm. For the delta-
method, k ≈ 0.526 is the allocation ratio that minimizes the calculated overall sample
size with 33 subjects allocated to the control arm and 17 subjects allocated to the
experimental arm. For the equal variance method, k = 4/3 is the allocation ratio that
minimizes the calculated overall sample size with 21 subjects allocated the control
arm and 29 subjects allocated to the experimental arm.

12.4 Analyses Involving Medians

A median of a distribution for a random variable X, µ, is any value satisfying
P(X ≤ µ) ≥ 0.5 and P(X ≥ µ) ≥ 0.5. For a continuous distribution, µ satisfies

Inferences on Means and Medians 343

P(X ≤ µ) = 0.5 = P(X ≥ µ ). We will consider only cases involving continuous
distributions having unique medians. We will denote the medians for the
underlying distributions for the experimental and control arms as µE and
µC , respectively, and their difference by ∆ = µE − µC .
Medians are often used to describe the central location of a distribution
that is skewed. They are less frequently used for comparison purposes. A
difference in two means between an experimental and control arm in a ran-
domized trial represents the average benefit across trial subjects from being
randomized to the experimental arm instead of the control arm. A differ-
ence in the two medians does not have any analogous interpretation unless
additional assumptions are made on the underlying distributions (e.g., that
the shapes of the underlying distributions are equal). Non-inferiority test-
ing involving a median is rare. There are, however, both distinct proper-
ties involving non-inferiority testing of medians than for other metrics and
there are some common methodology more easily discussed with medians.
For positive-valued variables, the median of the log-values is the log of the
median. Therefore, the ratio of the medians can be tested through the differ-
ence of the medians of log-values. Means do not have such a property (the
mean of the log-values is not the log of the mean).
For a sample, the sample median is the middle ordered value of a sample when
the sample size is odd. When the sample size is even, any value between the two
middle values is a sample median—however, it is common to use the average of
the two middle values as the sample median. We will use this common conven-
tion as defining the sample median when the sample size is even.

12.4.1 Hypotheses and Issues

The non-inferiority tests involving differences in medians are just shifted ver-
sions of the tests for superiority. When there are no restrictions on the distri-
butions, testing on medians usually requires determining an estimate of the
variance of the estimated median. The variance for an estimator of the median
depends on the density at the true median. Since the value of the median is
unknown and it may require a rather large sample size to reliably estimate
the density function at any likely possibility for the median, inferences on the
median are not generally based on asymptotic normal distributions for data
that tends to be seen in a clinical trial. To reduce the complexities in basing
an inference on the difference of medians, it is a popular practice to make the
overall assumption that the underlying distributions have the same shape
(i.e., FC(y) = FE (y – Δ) for all y and some Δ). We will discuss procedures based
on the median test, Mathisen’s test, and the Mann–Whitney–Wilcoxon test
having the underlying assumption of equal-shaped distributions.
For a non-inferiority margin of δ (for some δ > 0), the hypotheses can be
expressed as

Ho: Δ ≤ –δ and Ha: Δ > –δ

344 Design and Analysis of Non-Inferiority Trials

For the nonparametric methods that will be discussed, non-inferiority test-

ing can be done by first adding δ to every outcome from the experimental
arm and then testing for superiority against the control arm. Alternatively,
non-inferiority is concluded when the associated confidence interval of
appropriate level for the difference in medians contains only values greater
than –δ.
Two–confidence interval approaches, discussed in Chapter 5, are popular
for non-inferiority testing. These approaches combine into an inference the
results from an external evaluation of the effect size of the control therapy
and the estimated difference between treatment arms in the non-inferiority
trial. Some approaches comparing two–medians can easily be expressed as
two–confidence interval procedures. Hettmansperger11 investigated the use
and properties of confidence intervals for the difference in medians where
the limits are differences of the confidence limits of the individual confi-
dence intervals for the medians. The 100(1 – α)% confidence interval for the
difference in medians has the form (L, U) = (LE – UC, UE – LC), where (LE, UE)
is a 100(1 – α E)% confidence interval for the median of the experimental arm
and (LC, UC) is a 100(1 – α C)% confidence interval for the median of the control
arm. The confidence coefficients for the individual confidence intervals are
selected so that when those two intervals are disjoint, Ho: Δ = 0 is rejected at
a significance level of α. Note that the width of the confidence interval for the
difference in medians is the sum of the widths of the individual confidence
intervals for the medians (U – L = (UE – LE) + (UC – LC)). As a special case,
Hettmansperger11 showed that the confidence interval for the difference in
medians constructed by inverting the median test has such a form.

12.4.2 Nonparametric Methods

The Median Test. The median test or Mood’s test12 compares the equality of two
distributions through the equality of the medians with the overall assump-
tion that the two distributions have the same shape (i.e., FC(y) = FE(y – Δ) for
all y and some Δ). Let X1, X2, …, X nC and Y1, Y2, …, YnE denote independent
random samples from distributions having respective distribution functions
FC and FE.
Consider testing under the assumption that the two distributions have the
same shape that Ho: Δ = 0 against Ha: Δ ≠ 0. The test statistic for the median
test is
nE

M= ∑ I(R(Y ) > (N + 1)/2)

j (12.14)
j=1

where R(Yj) is the rank of Yj in the ordering of the combined sample, I is an

indicator function, and N is the combined sample size. The test statistic M is
the number of observations in the experimental arm that are greater than the

Inferences on Means and Medians 345

median of the combined sample. Under the null hypothesis, M has a hyper-
 N/2   N/2   N 
geometric distribution, where P( M = m) =       for m = 0,
 m   nE − m   nE 
 a
1, 2, …, nE, where   = 0 whenever a < b. When Δ = 0, the distribution for M is
 b
symmetric about the mean nE/2 with variance nCnE/[4(N – 1)]. The test rejects
Ho: Δ = 0 when M < d or M > nE − d + 1, where α/2 = P(M < d|Δ = 0) = P(M >
nE − d + 1|Δ = 0). For large sample sizes, the value for d can be approximated

by the greatest integer less than or equal to nE/2 − zα/2 nE nC/[ 4( N − 1)] . When
M − nE/2 d
Δ = 0 (FC = FE),  → N (0, 1) as nC, nE → ∞, and nC/N → λ > 0.
λ nE/4
Let X(1) < X(2) < … < X( nC ) and Y(1) < Y(2) < … < Y( nE ) denote the respective
order statistics. Without loss of generality, assume that N is an even number
and that nC ≥ nE. The test statistic in Equation 12.14 can be reexpressed as

M= ∑ I(Y
j=1
( j) − X( N/2− j+1) > 0) (12.15)

(see Pratt13 or Gastwirth14). Therefore, from Equation 12.15, the value of M

depends on the two samples through the ordered values of

Y(1) − X( N/2 ) < Y( 2 ) − X( N/2−1) <  < Y( nE ) − X(( nC − nE )/2+1)

As provided by Hettmansperger,11 rejecting Ho: Δ = 0 is therefore tan-

tamount to examining whether 0 is in the confidence interval for the dif-
ference in medians of (L, U) = (Y( d ) − X( N/2− d+1) , Y( nE − d+1) − X(( nC − nE )/2+ d ) ) . This
confidence interval for the difference in medians can be expressed as (L, U) =
(LE – UC, UE – LC), where (LE, UE) = (Y( d ) , Y( nE − d+1) ) is a confidence interval for
the median of the experimental arm and (LC, UC) = (X(( nC − nE )/2+ d ) , X( N/2− d+1) ) =
(X( dC ) , X( nC − dC +1) ) is a confidence interval for the median of the control arm
where dC = (nC – nE)/2 + d. The confidence coefficients are determined from
respective binomial distributions based on nE and nC Bernoulli trials, where
all trials have success probability of 0.5.
The confidence coefficient for an individual confidence interval will be
larger for the arm that has the smaller sample size. This is easily seen for
large sample sizes from the arguments of Hettmansperger,11 who argued
that for large nC and nE with λ = nC/N,

zα E/2 ≈ zα/2 λ and zα C/2 ≈ zα/2 1 − λ (12.16)

346 Design and Analysis of Non-Inferiority Trials

where 1 – α E and 1 – α C are the respective confidence coefficients for the indi-
vidual confidence interval for the medians of the experimental and control
arms.
When (LE, UE) = (Y( dE ) , Y( nE − dE +1) ) and (LC, UC) = (X( dC ) , X( nC − dC +1) ) for some dE
and dC, we have that for large nC and nE from Theorem 2.2 of Hettmansperger,11
the confidence coefficients are related by

λ zα E/2 + 1 − λ zα C/2 ≈ zα/2 (12.17)

In addition, the asymptotic width of the confidence interval does not depend
on the choice of α E and α C. From Theorem 2.3 of Hettmansperger,11 with
probability of 1

N (U − L) → zα/2 / λ(1 − λ ) fC (0) 

where fC(0) is the common density at the median. Thus, there are many pairs
of α E and α C that lead to a confidence interval for the difference in medians of
(LE – UC, UE – LC) that has confidence coefficient of approximately 1 – α hav-
ing the same asymptotic width/efficiency. Note from Equation 12.17, when it
is desired to have α E = α C, then set

(
zα E/2 = zα C/2 = zα/2 / λ + 1 − λ ) (12.18)

For allocation ratios ranging from 1 to 3, Hettmansperger11 compared three

methods of confidence intervals for the difference in medians of the form
(LE – UC, UE – LC): the median test-based confidence interval, the confidence
interval where α E = α C, and the confidence interval where UE – LE and UC – LC
are asymptotically equal. In the case of asymptotically equal-length confi-
dence intervals, we have from Hettmansperger11

zα E/2 ≈ zα/2 /[2 λ ] and zα C/2 ≈ zα/2 /[2 1 − λ ] (12.19)

The overall assumption of the two underlying distributions having the same
shape (i.e., FC(y) = FE(y – Δ) for all y) is necessary for Equations 12.16 through
12.19 and other properties to hold. If the shapes of the two distributions
are quite different, then these methods may not produce confidence inter-
vals for the difference in medians of a desired level. Pratt13 noted that when
the true medians are equal with underlying distributions having different
shapes, then the two-sided level for the median test is asymptotically equal
to 2[1 – Φ(cz α/2)], where c = (1 − λ + λτ )/ 1 − λ + λτ 2 , and τ is the ratio of the
underlying densities (fC/f E) at the common median. It follows, for large sam-
ple sizes, that the desired significance level can be approximately maintained
(as noted by Freidlin and Gastwirth15) when the underlying assumption of

Inferences on Means and Medians 347

equal-shaped distributions is weakened to having equal density functions in

analogous neighborhoods about the medians.
We will consider a variable or endpoint where the larger the value, the bet-
ter the outcome. For a non-inferiority margin of δ (for some δ > 0), the hypoth-
eses can be expressed as Ho: Δ ≤ –δ and Ha: Δ > –δ. Non-inferiority testing can
be done by first adding δ to every outcome from the experimental arm and
then testing for superiority against the control arm. The corresponding test
statistic based on the median test is thus

Mδ = ∑ I(R(Y + δ ) > (N + 1)/2)

j=1
j

where R(Yi + δ) is the rank of Yi + δ in the ordering of the combined sample

of Yj + δ’s and Xi ’s. The null hypothesis is rejected and non-inferiority is
concluded whenever Mδ > nE − d + 1, where α/2 = P(Mδ > nE − d + 1|Δ = −δ).
Alternatively, non-inferiority is concluded when the appropriate level con-
fidence interval for the difference in medians contains only values greater
than −δ.
We will now consider another two confidence interval procedure for the
difference in medians based on Mathisen’s test.
Mathisen’s Test. Mathisen’s16 test also compares the equality of two distri-
butions through the equality of the medians with the overall assumption
that the two distributions have the same shape (i.e., FC(y) = FE(y – Δ) for all y
and some Δ). Again let X1, X2, …, X nC and Y1, Y2, …, YnE denote independent
random samples from distributions having respective distribution functions
FC and FE, and we consider testing under the assumption that the two distri-
butions have the same shape Ho: Δ = 0 against Ha: Δ ≠ 0. The test statistic for
Mathisen’s test is

V= ∑ I(Y > med X )

j=1
j
i
i (12.20)

where I is an indicator function. The test statistic V is the number of observa

tions in the experimental arm that are greater than the median for the control
arm. Under the null hypothesis, V has a symmetric distribution. The test rejects
Ho: Δ = 0 when V < d or V > nE − d + 1 where α/2 = P(V < d|Δ = 0) = P(V >
nE − d + 1|Δ = 0). For ease, we will consider the distribution of V only when nC
 (n − 1)/2 + v   (nC − 1)/2 + nE − v   N
is odd. When nC is odd, P(V = v) =  C     
 v   nE − v   nE 
for v = 0, 1, 2, … , nE. When Δ = 0, V has a symmetric distribution about nE/2
with mean nE/2 and variance nE(N + 1)/(4[nC + 2]). For large sample sizes, the
value for d can be approximated by the greatest integer less than or equal to

348 Design and Analysis of Non-Inferiority Trials

V − nE/2
nE/2 − zα/2 nE ( N + 1)/( 4[nC + 2]) . When Δ = 0 (FC = FE), d
→ N (0, 1)
nE/( 4λ )
as nC, nE → ∞ and nC/N → λ > 0.
Asymptotic results can also be applied when nC is even and large. This can

∑ ∑
nE nE
be shown from the relation I (Yj > X( nc/2+1) ) ≤ V ≤ I (Yj > X( nc/2 ) ) and
j=1 j=1
that each bounding sum has the same asymptotic distribution as V when nC
is odd. The probability distributions for these bounding sums are also easily
obtained and can be used to obtain approximate critical values. Alternatively,

∑ ∑
nE nE
for ease when nC is even, I (Yj > X( nc/2+1) ) or I (Yj > X( nc/2 ) ) can be
j=1 j=1
used as the test statistic instead of V. Note that neither of these sums have a
symmetric distribution when Δ = 0.
The test statistic in Equation 12.20 can be reexpressed as

V= ∑ I(Y
j=1
( j) − X(( nC +1)/2 ) > 0) (12.21)

Therefore, from Equation 12.21, the value of V depends on the two samples
through the ordered values of

Y(1) − X(( nC +1)/2 ) < Y( 2 ) − X(( nC +1)/2 ) <  < Y( nE ) − X(( nC +1)/2 )

Rejecting Ho: Δ = 0 is therefore tantamount to examining whether 0 is in the

confidence interval for the difference in medians of (L, U) = (Y( d ) − X(( nC +1)/2 ) ,
Y( nE − d+1) − X(( nC +1)/2 ) ). This confidence interval for the difference in medians can
be expressed as (L, U) = (LE – UC, UE – LC), where (LC, UC) = [X(( nC +1)/2 )) , X(( nC +1)/2 )) ]
is a 0% confidence interval for the median of the control arm and (LE, UE) =
(Y( d ) , Y( nE − d+1) ) is a confidence interval for the median of the experimen-
tal arm with confidence coefficient 1 – α E, which satisfies for nC and nE by
Equation 12.17

zα E/2 ≈ zα/2/ λ (12.22)

The decision from Mathisen’s test (rejecting or failing to reject Ho: Δ = 0) can
depend on which sample’s median is used. If the roles were switched for the
X’s and Y’s, the decision may change.
We provide a table similar to Table 1 of Hettmansperger.11 When α = 0.05
for each method, Table 12.10 provides the confidence coefficients for the
individual intervals for the control and experimental medians for allocation
ratios between 1 and 3, and as the allocation ratio goes to infinity. Equations
12.16, 12.18, 12.19, and 12.22 are used to determine the confidence coefficients.
Those common entries differ somewhat from those of Hettmansperger.11 The

Inferences on Means and Medians 349

TABLE 12.10
Confidence Coefficients for the Intervals for Control and Experimental
Medians when α = 0.05
Equal
Mathisen’s Test Median Test Coefficients Equal Lengths

nC/nE 1 – αC 1 – αE 1 – αC 1 – αE 1 – αC = 1 – αE 1 – αC 1 – αE
1 0 0.994 0.834 0.834 0.834 0.834 0.834
1.5 0 0.989 0.785 0.871 0.836 0.879 0.794
2 0 0.984 0.742 0.890 0.840 0.910 0.770
2.5 0 0.980 0.705 0.902 0.845 0.933 0.754
3 0 0.976 0.673 0.910 0.849 0.950 0.742
→∞ 0 0.95 0 0.95 0.95 1 0.673

methods are arranged so that the confidence coefficient for the control arm
increases (experimental arm decreases) going from left to right. For the con-
fidence intervals based on Mathisen’s test, as the allocation ratio increased
from 1 to 3, the confidence coefficient of the confidence interval for the con-
trol median remained at zero and the confidence coefficient of the confi-
dence interval for the experimental median decreased from 0.994 to 0.976.
For the remaining three methods, the common confidence coefficient for
the individual confidence intervals for the medians was 0.834 when equal
sample sizes were used. For the confidence intervals based on the two sam-
ple median test, as the allocation ratio increased from 1 to 3, the confidence
coefficient of the confidence interval for the control median decreased from
0.834 to 0.673, whereas the confidence coefficient of the confidence interval
for the experimental median increased from 0.834 to 0.910. The common con-
fidence coefficient for the equal coefficients case was stable, varying from
0.834 to 0.849 as the allocation ratio ranged from 1 to 3. For the approach of
using equal asymptotic length confidence intervals, as the allocation ratio
increased from 1 to 3, the confidence coefficient of the confidence interval for
the control arm increased from 0.834 to 0.950, whereas the confidence coef-
ficient of the confidence interval for the experimental arm decreased from
0.834 to 0.742. For Mathisen’s test, the median test, and the equal coefficients
cases, the limiting confidence coefficient for the confidence interval for the
experimental median was 0.95 as the allocation ratio approached infinity. For
the equal-lengths approach, the limiting confidence coefficient of the con-
fidence interval for the experimental median was 0.673 (i.e., zα E/2 → z0.025/2)
as the allocation ratio approached infinity, whereas the limiting confidence
coefficient of the confidence interval for the experimental median was 1.
When nC/nE is less than 1, except for Mathisen’s test, the confidence coef-
ficients can be found by reversing the roles of the control and experimen-
tal arms. For Mathisen’s test when α = 0.05, the confidence coefficient for
the individual confidence interval for the experimental median is greater

350 Design and Analysis of Non-Inferiority Trials

than 0.999 when nC/nE < 0.549. For Mathisen’s test, all the uncertainty in the
comparison of medians is reflected in the confidence interval for the experi-
mental median. As the uncertainty in the estimation of the control median
becomes larger relative to the uncertainty in the estimation of the experi-
mental median, a greater confidence coefficient is needed for the confidence
interval for the experimental median. A two–confidence interval procedure
based on Mathisen’s test is analogous to using the point estimate of the his-
torical effect of the active control therapy as the true effect of the active con-
trol in the non-inferiority trial (and thereby ignoring the uncertainty in the
estimate) when the constancy assumption holds.
Mann–Whitney–Wilcoxon Test. The Mann–Whitney–Wilcoxon test com-
pares the equality of two distributions. When the assumption is made that
the two distributions have the same shape (i.e., FC(y) = FE(y – Δ) for all y and
some Δ), inferences can be made on the shift parameter, which equals the
difference in the medians. Let X1, X2, …, X nC and Y1, Y2, …, YnE denote inde-
pendent random samples from distributions having respective distribution
functions FC and FE.
Consider testing Ho: Δ = 0 against Ha: Δ ≠ 0 under the assumption that the
two distributions have the same shape. The Mann–Whitney–Wilcoxon test
can be based on the sum of the ranks of the observations in one of the arms
∑
nE
among the combined observations (i.e., R(Yi )) or equivalently based on
i=1
the test statistic

nE nC

W= ∑ ∑ I(Y − X > 0)
j=1 i=1
j i

where I is an indicator function. Note that when Δ = 0, W has a symmetric
distribution about the mean nCnE/2 with variance nCnE(nC + nE + 1)/12. The
test rejects Ho: Δ = 0 when W < d or W > nCnE − d + 1, where α/2 = P(W < d|Δ
= 0) = P(W > nCnE − d + 1|Δ = 0). The Mann–Whitney–Wilcoxon test is the
locally most powerful rank test when FC has a logistic distribution.
For a non-inferiority margin of δ (for some δ > 0), the hypotheses can be
expressed as Ho: Δ ≤ –δ and Ha: Δ > –δ. The corresponding test statistic is

nE nC

Wδ = ∑ ∑ I(Y − X > −δ )
j=1 i=1
j i

The null hypothesis is rejected and non-inferiority is concluded whenever
Wδ > nCnE − d + 1, where α/2 = P(Wδ > nCnE − d + 1|Δ = −δ). Alternatively,
non-inferiority can be tested by finding the corresponding confidence inter-
val for the difference in medians, Δ, and comparing the interval with −δ.

Inferences on Means and Medians 351

It can be shown that a 100(1 − α)% confidence interval for Δ based on the
Mann–Whitney–Wilcoxon test is given by

(Z(d), Z( nCnE − d+1) )

where Z(1) < … < Z( nCnE ) are the ordered differences of Yj – Xi for i = 1, …, nC
and j = 1, …, nE. Non-inferiority is concluded when this confidence interval
only contains values greater than −δ. For large sample sizes, the value for d
can be approximated by the greatest integer less than or equal to

nC nE/2 − zα/2 nCnE (nC + nE + 1)/12

W − nCnE/2
When Δ = 0 (FC = FE), W * = d
→ N (0, 1) as nC,
nCnE (nC + nE + 1)/12
nE → ∞ and nC/N → λ. The asymptotic behavior of the test can also be

determined when FC ≠ FE. When FC ≠ FE, W has mean nCnEp1 and variance
nC nE ( p1 − p 21 ) + nCnE (nE − 1)( p2 − p 21 ) + nCnE (nC − 1)( p3 − p 21 ), where p1= P(Y1 > X1),
p2 = P(minY1,Y2 > X1), and p3 = P(Y1 > max X1,X2) (see Theorem 3.5.1 of Hett
mansperger17). When FC ≠ FE, it follows that as nC, nE →∞ and nC/N → λ > 0, P(W* >
zα/2) → 0 if p1 < 0.5, P(W * > zα/2 ) → 1 − Φ( zα/2 / 12{(1 − λ )( p2 − 0.25) + λ( p3 − 0.25)})
if p1= 0.5, and P(W* > zα/2) → 1 if p1 > 0.5.
Since the test statistics are distribution free when Δ = 0, for small samples
d can either be found from mathematical calculations or by simulations. The
asymptotic approximate results can be used for large sample sizes.
These methods can be used to find confidence intervals for the ratio of medi-
ans when the two underlying distributions are positive-valued and “differ” by
a scale factor. In this case we have FC(0) = 0 and FC(y) = FE(θy) for all y > 0 and
some θ > 0, which represents the experimental to control ratio of the medians.
Then the underlying distributions for the logarithms of the observations have
the same shape (i.e., GC(y) = GE(y + logθ)) with the difference in medians of log
θ. These methods can be used to find a confidence interval for log θ that can be
converted to a confidence interval for the ratio of medians θ.

12.4.3 Asymptotic Methods

In this subsection we discuss some estimators for a median or difference of medi-
ans, their asymptotic distributions and variances and their relative efficiencies.
Sample Median. Let X1, X2, …,Xn be a random sample from a continuous
distribution with distribution function F and unique median µ . Let X  denote
the sample median. Then as n → ∞,

 − µ) d
n (X → N (0,( 4[ f ( µ)]2 )−1 )

352 Design and Analysis of Non-Inferiority Trials

For a normal random sample, this asymptotic variance equals πσ 2/(2n),

whereas the asymptotic variance for the sample mean equals σ 2/n. For a
normal random sample, the efficiency of the sample median relative to the
sample mean is 2/π ≈ 0.637 (i.e., the sample mean based on 637 observations
has approximately equal standard error as the sample median based on 1000
observations). This is also the relative efficiency of the difference in sample
medians to the difference in sample means when there are independent nor-
mal random samples.
When the variance of the underlying distribution exists with the mean
equal to the median, the relative efficiency of the sample median to the sam-
ple mean equals 4[ f ( µ)]2 σ 2. Thus the sample median will have greater limit-
ing efficiency than the sample mean if and only if 2 f ( µ) > 1/σ .
The estimate of the median corresponding to a two-sided test is obtained
as the value x such that if it is specified as the null value the test result will
not favor one side of the alternative hypothesis (Δ > x or Δ < x) over the other
side (e.g., both one-sided p-values equal 0.5). For both the median test and
Mathisen’s test, the corresponding estimator of the difference in medians is
given by ∆ˆ = med(Y( j ) − X( N/2+1− j ) ) = Y − X
 . As n , n → ∞, and n /N → λ > 0,
C E C
j

N (∆ˆ − ∆) d
→ N (0,( 4λ(1 − λ )[ f ( µ)]2 )−1 ).
When comparing two distributions that have the same shape (i.e., FC(y) =
FE(y – Δ) for all y and some Δ) where a common variance exists, 4[ f ( µ)]2/σ 2
is also the relative efficiency of the difference in sample medians to the dif-
ference in sample means when there are independent random samples.
When the variance does not exist, the median or difference in medians is
more efficient. For underlying double exponential distributions, the relative
efficiency of the difference in sample medians to the difference in sample
means is 2.
Hodges–Lehmann Estimator of the Difference in Medians. The correspond-
ing estimator of the difference in medians based on the Mann–Whitney–
Wilcoxon test is the Hodges–Lehmann estimator of ∆ˆ HL = med(Yj − X i ) . As
i, j

nC, nE → ∞, and nC/N → λ > 0, N (∆ˆ HL − ∆) d
→ N (0, τ −2 ) , where
∞
τ = 12 λ(1 − λ )
∫
−∞
fC2 ( x) d x .

When the two underlying distributions have the same shape, the efficiency
of the Hodges–Lehmann estimator of the difference in medians relative to
2
 ∞ 2 
the difference in sample medians equals 3 
 −∞ ∫
fC ( x) d x  / f ( µ)2 . For normal

random samples with equal underlying variances, this relative efficiency is
approximately 1.5 and the relative efficiency of the Hodges–Lehmann esti-
mator to the difference in means is approximately 0.955. For random samples
from double exponential distributions, the relative efficiency of the Hodges–
Lehmann estimator of the difference in medians to the difference in sample

Inferences on Means and Medians 353

medians is 0.75, whereas the relative efficiency of the Hodges–Lehmann esti-

mator to the difference in means is approximately 1.5.
The asymptotic distributions of these estimators are not easy to use, can be
misused, and are not necessary to use when constructing tests or confidence
intervals (i.e., from the previous subsection, an expression already exists for
determining d that do not require the estimation of f). Their use requires
estimation of the density function, which is also associated with some uncer-
tainty. Also for those intervals based on the sample median, f ( µ) is only the
correct term in limit. For determining the lower limit of the confidence inter-
val for a median, f ( µ) should be replaced by f(ξ) ( f(ξ) is the average over some
interval [ µ − k1 , µ ]), where ξ is near and below µ . The same is true for the upper
limit, but with ξ near and above µ . Let X1, X2, …, Xn be a random sample from
a continuous distribution with distribution function F and unique median
µ . The 100(1 – α)% confidence interval for µ is given by (X(d),X(n – d+1)), where
d ≈ n/2 – zα/2 n/4 when n is large. Suppose that n is large and odd. Then
X((n+1)/2) – X(d) is estimating F −1 (0.5) − F −1 (0.5 − zα/2/ 4n ) = ( zα/2/ 4n )/ f ( F −1 (η)) ,
where ξ = F–1(η) is between F −1 (0.5 − zα/2 / 4n ) and µ . Similarly, X((n+1)/2) is esti-
mating ( zα/2/ 4n )/ f ( F −1 (η)), where ξ = F–1(η) is between µ and F −1 (0.5 + zα/2/ 4n )
F −1 (0.5 + zα/2/ 4n ) . The window about µ for which f is averaged gets smaller as the sample

size gets larger. When f has a local minimum (maximum) at µ , using f ( µ)
will lead to a wider (narrower) confidence interval with a confidence coef-
ficient greater (less) than 1 – α.

12.5 Ordinal Data

In this section we briefly discuss non-inferiority inference on ordinal data. That
is, different outcomes are ordered, but there is no measure for how different the
outcomes are. Munzel and Hauschke18 applied a rank test based on the Mann–
Whitney–Wilcoxon statistic for non-inferiority testing of ordinal data. For our
purposes (i.e., different from Munzel and Hauschke), larger outcome values will
be more favorable outcomes. The procedure is based on p = P(Y > X) + 0.5 ×
P(Y = X), where X has the underlying distribution for control arm and Y has the
underlying distribution for the experimental arm. This generalizes the Mann–
Whitney–Wilcoxon parameter of P(Y > X) by counting a tie as half a “success”
and half a “failure.” Formulas for one-sided confidence intervals for p and for
determining the sample size for non-inferiority and superiority testing are pro-
vided in the papers of Munzel and Hauschke18 and Brunner and Munzel.19 The
hypotheses are expressed as

Ho : p ≤ 0.5 – δ vs. Ha : p > 0.5 – δ (12.23)

354 Design and Analysis of Non-Inferiority Trials

where 0 ≤ δ < 0.5. The value for δ, the non-inferiority margin should depend
on both the effect of the control therapy and the differences in the utility or
preferences of the categories. In general, the fewer the number of categories
(e.g., only “failure” and “success”), the larger will be the difference in utility
between successive categories, and thus the tendency for a smaller margin.
We will describe the test procedure in Munzel and Hauschke18 for test-
ing the hypotheses in Expression 12.23. Let X1, X2, …, X nC and Y1, Y2, …,
YnE denote independent random samples from distributions having respec-
tive distribution functions FC and FE. An unbiased, consistent estimator
for p is given by pˆ = (RE − (nE + 1)/2)/nC , where RE is the arithmetic aver-
age of the ranks of the observations in the experimental arm among all
∑
nE
observations. That is, RE = R(Yj )/nE , where R(Yj) is the rank of Yj in
j=1

∑
nC
the ordering of the combined sample. Define also RC = R(X j )/nC ,
j=1

where R(Xj) is the rank of Xj in the ordering of the combined sample.
When ties occur, R(Yj) (R(Xj)) is the midrank. Let R(C)(Xj) denote the
rank of Xj among X 1 ,  , X nC and let R(E)(Yj) denote the rank of Yj among

∑
nC
Y1 ,  , YnE . Define J C2 = (R(X j ) − R( C) (X j ) − RC + (nC + 1)/2)2/(nC − 1) and
j=1

∑
nE
J E2 = (R(Yj ) − R( E ) (Yj ) − RE + (nE + 1)/2)2 /(nE − 1) . Then for i = E, C, define
j=1

uˆ i2 = J i2/( N − ni )2 , where N = nE + nC. The test statistic is given by

pˆ − (0.5 − δ )
Q=
uˆ C2 uˆ E2 (12.24)
+
nC nE

Brunner and Munzel19 showed that when N → ∞ and nC/nE converges to a

positive constant, ( pˆ − p)/ uC2 /nC + uE2 /nE converges in distribution to a stan-
2 2
dard normal distribution, where uC and uE are the variances of FC(X1) and
2 2
FE(Y1), respectively. Additionally, they showed that uC and uE are consistently
estimated by ûC2 and ûE2, respectively. Therefore, a large sample test of the
hypotheses in Expression 12.23 can be performed by comparing Q in Equation
12.24 with the appropriate percentile of a standard normal distribution. Non-
inferiority is concluded at an approximate significance level of α/2 when
Q > zα/2. Alternatively, the non-inferiority inference can be based on an approx-
uˆ 2 uˆ 2
imate 100(1 – α)% confidence interval for p given by pˆ ± zα/2 C + E . Non-
nC nE

inferiority is concluded when every value in the confidence interval exceeds
0.5 – δ. For small sample sizes, from simulations, Brunner and Munzel19 found
that the test that rejects Ho : p ≤ 0.5 when Q > zα/2 can be quite liberal or con-
servative depending on the ratios for the true variances and for the sample

Inferences on Means and Medians 355

sizes. They recommend for per group sample sizes between 15 and 50 using
a Satterthwaite-like approximation of the degrees of freedom. The quality of
the approximation of the degrees of freedom was dependent on the number
of categories and deemed sufficient when there were at least three categories.
The value of tα/2,ν (where v is the Satterthwaite degrees of freedom) would
replace zα/2 for the determination of confidence intervals and as a critical
value in hypotheses testing.
For sizing a trial, let p′ represent the assumed value for p. Let σ 1 and σ 2
denote the respective assumed underlying standard deviations of FC(X1) and
FE(Y1), where l = σ 1/σ 2. Let k = nE/nC denote the allocation ratio. Then the
sample size for the control arm is given by

nC = (1 + l2/k)(zα/2 + zβ)2(σ 2/(p′ – 0.5 + δ))2 (12.25)

When the sample sizes are small, the term zα/2 + zβ in Equation 12.25 can be
replaced with tα/2,ν + tβ,ν (where v is the Satterthwaite degrees of freedom).
This creates an equation where iterations will be needed to determine the
sample sizes, since nC appears on both sides of the equation.
For two noncontinuous distributions, Wellek and Hampel20 proposed a
nonparametric test of equivalence around the parameter P(Y > X | Y ≠ X).
This parameter ignores the ties and the probability of a tie. For both equiv-
alence and non-inferiority testing, a tie is consistent with the alternative
hypothesis. Therefore, use of the parameter P(Y > X | Y ≠ X) will greatly
penalize an experimental therapy in situation when a tie is quite likely and
lead to conservative testing as noted by Munzel and Hauschke.18
For equivalence testing, a constant odds ratio based on the Wilcoxon mid-
ranks statistic and derived from the corresponding exact permutation distri-
bution was considered by Mehta, Patel, and Tsiatis.21
Often scores are assigned to the ordered categories and the data are treated
as continuous. However, it may be difficult to interpret a specific difference
between arms in the score, and the scores themselves may be subjective.

References
1. Hollander, M. and Wolfe, D.A., Nonparametric Statistical Methods, John Wiley &
Sons, New York, 1973.
2. Wiens, B.L., Randomization as a basis for inference in non-inferiority trials,
Pharm. Stat., 5, 265–271, 2006.
3. Good, P., Permutation, Parametric, and Bootstrap Tests of Hypotheses, Springer, New
York, NY, 2005.
4. Satterthwaite, F., An approximate distribution of estimates of variance compo-
nents, Biometrics, 2, 110–114, 1946.

356 Design and Analysis of Non-Inferiority Trials

5. Ghosh, P. et al., Assessing non-inferiority in a three-arm trial using the Bayesian

approach, Technical report, Memorial Sloan-Kettering Cancer Center, 2010.
6. Gamalo, M. et al., A generalized p-value approach for assessing non-inferiority
in a three-arm trial, Stat. Methods Med. Res., published online February 7, 2011.
7. Kieser, M. and Hauschke, D., Approximate sample sizes for testing hypotheses
about the ratio and difference of two means, J. Biopharm. Stat., 9, 641–650, 1999.
8. Hasselblad, V. and Kong, D.F., Statistical methods for comparison to placebo in
active-control trials, Drug Inf. J., 35, 435–449, 2001.
9. Rothmann, M.D. and Tsou, H., On non-inferiority analysis based on delta-
method confidence intervals, J. Biopharm. Stat., 13, 565–583, 2003.
10. Bishop, Y.M.M., Fienberg, S.E., and Holland, P.W., Discrete Multivariate Analysis:
Theory and Practice, MIT Press, Cambridge, MA, 1975, 493.
11. Hettmansperger, T.P., Two-sample inference based on one-sample sign statis-
tics, Appl. Stat., 33, 45–51,1984.
12. Mood, A. M., Introduction to the Theory of Statistics, McGraw-Hill, New York, NY,
1950.
13. Pratt, J.W., Robustness of some procedures for the two-sample location prob-
lem, J. Am. Stat. Assoc., 59, 665–680, 1964.
14. Gastwirth, J.L., The first-median test: A two-sided version of the control median
test, J. Am. Stat. Assoc., 63, 692–706, 1968.
15. Freidlin, B. and Gastwirth, J.L., Should the median test be retired from general
use? Am. Stat., 54, 161–164, 2000.
16. Mathisen, H.C., A method of testing the hypothesis that two samples are from
the same population, Ann. Math. Stat., 14, 188–194, 1943.
17. Hettmansperger, T.P., Statistical Inference Based on Ranks, Wiley, New York, NY,
1984.
18. Munzel, U. and Hauschke, D., A nonparametric test for proving non-inferiority
in clinical trials with ordered categorical data, Pharm. Stat., 2, 31–37, 2003.
19. Brunner, E. and Munzel, U., The nonparametric Behrens–Fisher problem: Asymp
totic theory and a small sample approximation, Biom. J., 42, 17–25, 2000.
20. Wellek, S. and Hampel, B.A., Distribution-free two-sample equivalence test
allowing for tied observations, Biom. J., 41, 171–186, 1999.
21. Mehta, C.R., Patel, N.R., and Tsiatis, A.A, Exact significance testing to establish
treatment equivalence with ordered categorical data, Biometrics, 40, 819–825,
1984.

13
Inference on Time-to-Event Endpoints

13.1 Introduction
Many meaningful clinical endpoints are time-to-event endpoints—for
example, overall survival, time to a response, time to a cardiac-related event,
and time to progressive disease. When the intention and the outcome is that
all subjects are followed until the event is observed (no censoring), time-to-
event endpoints can be analyzed as continuous endpoints. The inferences
can be based on the mean, median, or some other quantity relevant to contin-
uous endpoints. However, in most practical cases involving a time-to-event
endpoint, all subjects are not followed until an event (i.e., some subjects have
their times censored). This limits the types of analyses that can be performed.
Nonparametric inferences on means and/or medians may not be possible. To
base the inference on means or medians may require following subjects for
a long and perhaps impractical length of time.
For a time-to-event endpoint, the amount of available information for
inferential purposes is tied to the total number of events and increases either
by continuing the follow-up on subjects that have not had events or by begin-
ning to follow additional subjects for events. For standard binary or con-
tinuous endpoints, the amount of available information increases solely by
including the outcomes of additional subjects.
For clinical trials, most time-to-event endpoints are defined as the time
from randomization (or enrollment or start of therapy) to the event of inter-
est or the first of many events of interest. Typically, at randomization, the
subject does not have the event or any of the events of interest. Starting the
time-to-event endpoint at randomization is also important as randomization
is fair and it is during randomization that subjects and their prognoses are
fairly allocated to treatment arms. In addition, to maintain the integrity of
this fairness of randomization, intent-to-treat analyses should be conducted
where all subjects are followed, regardless of adherence, to an event or the
end of study (i.e., until the data cutoff date or until some prespecified maxi-
mum follow-up has been completed). This allows for a valid comparison of
the study arms.

357

358 Design and Analysis of Non-Inferiority Trials

When a subject is not followed up to an event or the end of study, it is

common practice to censor their time to an event at that the time when their
follow-up ended. When a subject’s time is censored x months after random-
ization, their remaining time to an event is represented by those subjects on
the same treatment arm still under observation for an event when applying
standard nonparametric comparative methods (e.g., a log-rank test or a Cox
proportional hazards model with treatment as the sole factor) and noncom-
parative methods (e.g., Kaplan–Meier estimates). This representation is only
valid when random censoring occurs; that is, when the reason for censoring
is independent of the subject’s prognosis at the time of censoring. Informative
censoring occurs when the reason for discontinuing follow-up is not inde-
pendent of the subject’s prognosis. Informative censoring introduces bias in
the estimation of the univariate distribution and may introduce bias as well
as reduce the reliability in the evaluation of the difference in the treatment
effects. Premature or administrative censoring reduces the integrity of the
randomization and should be avoided, so that the methods of analyses are
valid.
The experimental therapy is noninferior to the control therapy if the time-
to-event distribution for the experimental arm is better than or not too much
worse than that of the control arm. When the “event” is a negative or unde-
sirable outcome, non-inferiority means that the time-to-event distribution
for the experimental arm is greater than or not too much less than that for
the control arm. When the “event” is a positive or desirable outcome, non-
inferiority means that the time-to-event distribution for the experimental
arm is less than or not too much greater than that for the control arm. There
are many ways to express the difference between the two time-to-event
distributions—these include, through a hazard ratio, a cumulative odds
ratio, a difference in means when applicable, a difference in medians, or a
difference in the probabilities of achieving or not achieving an event by some
landmark.
In clinical trials when using a time-to-event endpoint, it is common to
use a hazard ratio or a log-hazard ratio to describe the treatment effect or
relative treatment effect. A hazard ratio is the ratio of the instantaneous
risk of an event between the two arms. Analyses assume that this ratio is
constant over time. On its own, a hazard ratio of an experimental group to
a reference group (e.g., a placebo group or an active control group) does not
provide the benefit or relative benefit of a therapy. The time-to-event distri-
bution for the reference group is also needed to obtain an understanding of
the size of the benefit. In practice, an experimental versus placebo survival
hazard ratio of 0.75 (the instantaneous risk of death for the experimental
group is 75% of that for the placebo group) corresponds to greater benefit
when the mean survival in the placebo arm is 24 months than when the
mean survival in the placebo arm is 4 months. That is, for an experimen-
tal versus placebo survival hazard ratio of 0.75, the experimental therapy
extends mean survival by more months when the mean placebo survival is

Inference on Time-to-Event Endpoints 359

24 months than when the mean placebo survival is 4 months. The benefit of
the experimental therapy on survival is truly defined by the improvement
in mean/expected survival. However, usually the inference is not based on
the difference in mean/expected time-to-event. Therefore, the results when
positive may need to be translated into some form that provides an impres-
sion of the clinical benefit of the experimental therapy. Whenever possible,
although such instances are few, an inference should be based on a differ-
ence of means.
Composite Endpoints. Composite time-to-event endpoints are popular.
Such an endpoint is the time to the first event in a set of events of interest.
Examples include the time to the first event of stroke, myocardial infarction,
and death for cardiovascular trials, and the time to the first event of disease
progression and death in metastatic or advanced cancer. The use of a com-
posite endpoint may be necessary when the disease can be characterized
by many factors. For example, as reported by Chi,1 a disease may be charac-
terized by its pathophysiology, severity, signs and symptoms, progression,
morbidity, and mortality. Since the event or hazard rate for a composite end-
point is greater than that of the individual components, the use of a com-
posite endpoint has the advantage of requiring fewer subjects and having
an earlier analysis than a trial designed on an individual component. Many
researchers have written on the issues and disadvantages of using composite
endpoints.1–4
The individual components of a composite endpoint should be relevant
and meaningful for subjects and constitute clinical benefit. When the com-
ponents are equally important, and a new drug demonstrates superior effi-
cacy, the particular distribution of events across the individual components
or the differences between arms in the distribution of the events do not con-
tribute any necessary additional information on the new drug’s overall ben-
efit. When the importance varies across components, it is more difficult to
interpret the endpoint and the corresponding results. For example, a drug
that improves a major component (e.g., death) while having an adverse effect
on a minor component may be beneficial, but a drug that improves a minor
component while having an adverse effect on a major component would not
be beneficial. In addition, the severity of a component may change over time
owing to improvements in the treatment or management of the component,
thus reducing the utility of including that component in the composite end-
point. A composite endpoint may not be sensible if the components have
widely different importance.
When the components are not equally important, additional analyses
involving subcomposite endpoints may need to be done to assess the effects.
The subcomposite time-to-event endpoint should exclude the events of less
relative importance. This process of additional analyses by excluding events
of less importance may need to be further done until an analysis on a sub-
composite time-to-event endpoint that includes only those most important
events of equal value is done. To perform valid analyses on these additional

360 Design and Analysis of Non-Inferiority Trials

subcomposite time-to-event endpoints would require following all subjects

until one of the most important events has occurred or until the end of study.
In practice, a frequent difficulty in performing a valid analysis on a subcom-
posite endpoint is the lack of continued follow-up for additional events after
the first event for the original composite endpoint has occurred.
The absence of a placebo control, as in a two-arm active controlled non-
inferiority trial, can further the difficulty in analyzing and interpreting the
results on the basis of a composite endpoint. For a placebo control trial, a
superiority analysis on a subcomposite endpoint may lack power, and the
estimate of the effect of therapy on a subcomposite endpoint may be unre-
liable. This would particularly be true when the subcomposite events are
rare. When the subcomposite events are rare, it is unlikely that there will
be reliable estimates of the effect of the active control on the subcomposite
endpoint. Therefore, it will be difficult to evaluate the effect of the experi-
mental therapy on the subcomposite from an active control non-inferiority
trial.
When the benefit of therapy has tended to be through the same compo-
nent, A, at the potential expense of another component, B, there is a potential
of a biocreep with respect to effects on component B. This would be a serious
problem when component B is a more meaningful event than component A.
Also, differences in the frequency of assessments of the different compo-
nents can affect the comparison of a composite time-to-event endpoint when
the benefit of the experimental therapy on one component is countered by its
negative effect on another component. If the component for which there is a
positive effect is assessed much more frequently, the estimated effect of the
experimental therapy will be inflated.
Effective therapies can lower the event rates. This will lead to larger, lon-
ger clinical trials or the temptation to include additional components into
a revised composite endpoint so as to achieve an event rate consistent with
past studies. It is unlikely that the new components will be as meaningful
as the original components. Such revisions to a composite endpoint lead to a
less meaningful endpoint. Additionally, introducing less meaningful compo-
nents which the therapies are not expected to affect will make the treatment
arms more similar. If this is unaccounted for in the evaluation of the active
control effect and the determination of the non-inferiority, the likelihood of
concluding that an ineffective therapy is effective will increase.
Hazard Rate, Hazard Ratio, and Survival Function. Many inferences involv-
ing time-to-event endpoints involve the hazard rates. Let T be a nonnega-
tive continuous random variable representing the time to some event. Let f
denote the probability density function for T and F represent the distribution
t
function for T. Then for t ≥ 0, F(t) = P(T ≤ t) =
∞
∫0
f ( x) d x . The survival function

S is defined by S(t) = P(T ≥ t) =

∫
t
f ( x) d x for t ≥ 0. The hazard function or
hazard rate at time t is defined by

Inference on Time-to-Event Endpoints 361

P(t ≤ T ≤ t + ε |T ≥ t)
h(t) = limε →0 (13.1)
ε

For an individual subject, the hazard rate at time t, h(t), represents the
instantaneous risk of an event at time t given that the subject has not
had an event by time t. If the event is death, h(t) represents the instanta-
neous risk of death at time t for a subject who is alive as time t approaches.
Additionally, for a subject who is alive at time t (without an event by time
t), the probability the subject dies (has an event) during the next ε of time is
approximately ε h(t) for a small ε. Evaluating the limit in Equation 13.1 gives
P(t ≤ T ≤ t + ε |T ≥ t) f (t) d
h(t) = lim ε →0 = = − log S(t).
ε S(t) dt
t

t ≥ 0. 0 ∫
The cumulative hazard function is given by H (t) = h( x) d x = − log S(t) for

When the hazard functions for the experimental and control arms are pro-
portional, then the common ratio of the hazard functions, called the hazard
ratio, is often used to measure the difference in the two distributions. The
h (t) H (t) − log SE (t)
hazard ratio, θ, satisfies θ = E = E = and SE(t) = [SC(t)]θ for
all t ≥ 0. hC (t ) H C (t ) − log SC (t )
In this chapter we will discuss the types of censoring, reasons for cen-
soring, and the issue of censoring deaths in Section 13.2. Non-inferiority
analyses involving exponential distributions are discussed in Section 13.3.
Non-inferiority analysis based on a hazard ratio from a proportional haz-
ards model is discussed in Section 13.4. Non-inferiority analyses either at
landmarks or involving medians are discussed in Section 13.5. The extension
of the testing problem in Section 13.5 to an inference over a preset interval is
discussed in Section 13.6.

13.2 Censoring
Throughout this chapter, the only type of censoring that will be considered is
right censoring. Right censoring means that a subject’s true time is unknown
and to the right of (greater than) the censored time. When the term censoring
is used, it will refer to right censoring unless otherwise stated.
Informative censoring occurs when the prognosis of a given subject with
a censored time is not independent of the censoring. In other words, what
to expect for a given subject’s ultimate or true time-to-event, which is cen-
sored at time x, is not represented by the follow-up experience of those
subjects in the same group with times that exceed x. Whenever a subject is
censored because treatment is being withheld because of declining physical

362 Design and Analysis of Non-Inferiority Trials

condition, the censoring will be informative. Comparative and noncompara-

tive methods commonly used for time-to-event endpoints assume no infor-
mative censoring. Therefore, excessive informative censoring invalidates
analyzes from such methods.
Valid reasons for censoring a subjects’ time-to-event include:

1. The subject did not have an event observed at the time of the data
cutoff for the analysis.
2. The subject completed the prespecified required time on study with-
out having an event.

Alternatives to standard censoring at loss to follow-up should be considered

when subjects have their follow-up prematurely discontinued. Alternatives
would include selecting a different representative group for a subject’s
remaining time to event, on the basis of having similar prognosis at the
time of loss to follow-up to the subject whose follow-up was prematurely
discontinued.
Type I and II Censoring. There are various types of random censoring. Two
types of censoring prevalent in statistical research are defined here. Type I cen-
soring arises when the individual subjects have their own maximum allowed
follow-up without having an event. This would occur in a clinical trial if there
was a preset study time or calendar time when the analysis will be performed
or there is a required preset maximum follow-up for each subject. In the for-
mer case, the limit on a given subject’s follow-up would equal the time from
the subject’s start time (i.e., randomization time) to the preset analysis time.
For a given arm, the observed time-to-event outcome of a subject is regarded
as the minimum of the actual time and an independent censoring time. For
the subjects on that arm, the actual times are a random sample. For type I
censoring, the number of subjects that have events is random.
For type II censoring, the number of subjects that have events is predeter-
mined. For type II censoring, only the r smallest times are observed among
a random sample of n subjects. This would occur if all subjects had the same
start time and the stop time was the time of the rth event.
The exact distribution for a within-arm or between-arm estimator (e.g., a
mean, ratio of means, or hazard ratio) depends on the type of censoring. Type
I and II censoring lend themselves better in deriving distributional results
for estimators and in proving asymptotic results than other types of censor-
ing. In general, for those quantities of interest (e.g., means, hazard ratios),
their asymptotic distribution tends to be approximately the same regardless
of how the random censoring is done.
Censoring in clinical trials tends to be neither type I nor type II censoring.
In many oncology clinical trials, subjects are accrued over time (and thus have
different start times) and followed until death or the end of study, where the
end of study is the time of the rth death (i.e., for a preset r) among all subjects
regardless of treatment arm. Because the time for end of study is random, for

Inference on Time-to-Event Endpoints 363

this type of censoring the random censoring times are not independent of the
actual times. Although the censoring times are not independent of the time-
to-event endpoints, the censoring is not informative for a given treatment arm
provided the actual times to the event are a random sample.
Inclusion of Death as an Event. Censoring a subject’s time-to-event endpoint
because of death or death not related to disease is problematic and creates a
hypothetical endpoint. Follow-up ceases at death; there is no remaining time
to the event and thus there is no loss to follow-up or missing data. Complete
information or follow-up has been done on the subject. When death is not an
event of interest for the time-to-event endpoint, there is no actual time-to-
event for a subject who dies without being observed for an event of interest.
In situations where such censoring occurs, the subjects’ time-to-event times
in a given arm are not a random sample from that distribution that is being
estimated. Such an endpoint is a hypothetical endpoint. That is, for a given
arm, what is being estimated is the distribution for the time-to-event, or the
time-to-event when death occurs without experiencing the event, and we pre-
tend that the remaining time to an event of dead subjects can be represented
by the remaining time to an event of living subjects still under observation
of an event. Having living subjects represent dead subjects is beyond infor-
mative censoring and common sense. Another setting where this censoring
occurs involves analyses that include only disease-related deaths. This not
only creates a hypothetical endpoint where living subjects represent dead
subjects but also the determination of whether a death was related to the
disease, or even to the treatment, may be inexact and subjective.

13.3 Exponential Distributions

An exponential distribution is a popular time-to-event distribution. They
are often used to model the life length of objects that fail or stop from one
random “shock.” The instantaneous risk of the “shock” is constant over time
(i.e., shocks occur according to a homogeneous Poisson process). Exponential
distributions also model the life lengths of objects that have “half-lives.”
When a time-to-event endpoint has an underlying exponential distribution,
the remaining life distributions are identical and equal to the underlying
distribution. This property is known as “new as good as used” and corre-
sponds to the concept of no deterioration.
An exponential distribution is unlikely for human time-to-event end-
points. For many clinical time-to-event endpoints, the risk of an event is
believed to be increasing over time, which corresponds to having a condi-
tion that tends to deteriorate. Because of its simplicity and ease of use, expo-
nential distributions have been assumed when determining the appropriate
number of subjects and/or the study time for the analysis for a clinical trial.

364 Design and Analysis of Non-Inferiority Trials

We present the case involving exponential distributions because some

research has focused on using exponential distributions and to provide a
comparison of the results to the semiparametric case using Cox’s propor-
tional hazards models.
The probability density function for an exponential distribution having
mean μ is given by f(t) = (1/μ)e–t/μ for t > 0. For the exponential distribution,
the constant hazard rate equals 1/μ and the survival function is given by
S(t) = e–t/μ for t > 0, which represents the probability that a random subject has
a time to event greater than t, or equivalently the probability that a random
subject is event-free at time t.
Inferences on One Mean. Assume that the actual survival times from a given
treatment arm is a random sample of size n from an exponential distribution
with mean μ. Let t1, t2, . . . , tn denote the observed times among which r times
are uncensored and n – r times are censored. For both type I and type II
∑
n
censoring, the maximum likelihood estimate for μ is given by µˆ = ti/r
i=1
(provided r > 0).
For type II censoring for a given arm, it can be shown that 2r µˆ/µ has a χ2
distribution with 2r degrees of freedom. For type I censoring, per Cox,5 treat-
ing 2r µˆ/µ as having a χ2 distribution with 2r + 1 degrees of freedom leads
to satisfactory tests and confidence intervals. These results can be used to
derive a confidence interval for the mean or ratio of means (i.e., the hazard
ratio).
Inferences based on large sample normal approximations can also be
done. For type II censoring, it follows that ( µ̂ − µ )/( µ/ r ) , ( µˆ − µ )/( µˆ/ r ), and
r (log µ̂ − log µ ) each converge in distribution to a standard normal distri-
bution as r → ∞. For type I censoring, it follows from the proposal of Cox5
that treating (2 r µˆ/(2 r + 1) − µ )/( µ/ r + 0.5 ), (2 r µˆ/(2 r + 1) − µ )/( µˆ/ r + 0.5 ) , and
r + 0.5 (log µ̂ − log µ + log(1 + 0.5/r )) as having a standard normal distribu-
tion for large r may lead to satisfactory tests and confidence intervals. These
results or adaptations of such can be used to derive a confidence interval
for the mean, difference in means, or ratio of means (via a difference in log
means).
The above results apply, or approximately apply, to many other types of
random censoring. For most of the discussion on time-to-event endpoints
in later sections, asymptotic results based on appropriate random censor-
ing will be used without any distinction on whether the censoring is type I,
type II, or neither.

13.3.1 Confidence Intervals for a Difference in Means

As before, subscripts E and C will be used to identify quantities that corre-
spond to the experimental and control arms, respectively. We will start with
non-inferiority testing based on a difference in means, μE – μ C. When larger

Inference on Time-to-Event Endpoints 365

times are better than smaller times (i.e., the event is undesirable), the null
and alternative hypotheses to be considered are

Ho: μE – μ C ≤ –δ vs. Ha: μE – μ C > –δ (13.2)

That is, the null hypothesis is that the active control is superior to the experi-
mental treatment by at least the prespecified quantity of δ ≥ 0. The alterna-
tive hypothesis is that the active control is superior by a smaller amount, or
the two treatments are identical, or the experimental treatment is superior.
When δ = 0, the hypotheses in Expression 13.2 reduce to classical one-sided
hypotheses for superiority testing. Rejection of the null hypothesis (Ho) leads
to the conclusion that the experimental treatment is noninferior to the con-
trol treatment. When smaller times are more desired than larger times (i.e.,
the event is desirable), the roles of μE and μ C in the hypotheses in Expression
13.2 would be reversed (i.e., test Ho: μ C – μE ≤ –δ vs. Ha: μ C – μE> –δ).
Let rE and rC denote the number of events observed in the experimen-
tal and control arms, respectively. For type II censoring, an approximate
100(1 – α)% confidence interval for the difference in means is given by
µˆ E − µˆ C ± zα/2 µˆ E2/rE + µˆ C2 /rC when rE and rC are sufficiently large. For type I cen-
soring, an approximate 100(1 – α)% confidence interval based on Cox’s proposal
is given by 2 rE µˆ E/(2 rE + 1) − 2 rC µˆ C/(2 rC + 1) ± zα/2 µˆ E2/(rE + 0.5) + µˆ C2 /(rC + 0.5)
when rE and rC are sufficiently large.
The following example illustrates the use of these formulas.

Example 13.1

A random sample of 200 observations is taken from an exponential distribution

having mean 100. The corresponding values represent the actual survival times
for subjects in the experimental arm. A random sample of 100 observations is
taken from an exponential distribution having mean 90. The corresponding values
represent the actual survival times for subjects in the control arm. Note that the dif-
ference in the true means is 10. The 200th smallest observation is determined for
the combined sample as 114.88. All subjects in both arms with times larger than
114.88 will have their times censored at 114.88. This random censoring is neither
type I nor type II censoring. We will apply each confidence interval formula for the
difference in means conditioning on the number of uncensored observations in
each arm. We have rE= 137 uncensored observations in the experimental arm with
µ̂E = 105.4 and rC 63 uncensored observations in the control arm with µ̂C = 117.4.
As can happen when data are random, the order of the sample means differs from
the order of the true means. The corresponding confidence intervals for the hazard
ratio are provided in Table 13.1. All confidence intervals contain the true difference
in means of 10. For a non-inferiority margin of 15, non-inferiority would fail to be
concluded regardless of the method. The confidence interval for the type II censor-
ing methods has a slightly greater width. Because there were fewer events in the
control arm, the type I censoring method reduced the estimate of the control mean

366 Design and Analysis of Non-Inferiority Trials

TABLE 13.1
Calculated 95% Confidence Intervals for the Difference in
Means by Method
Method Estimate 95% CI Width
Type I censoring –11.5 (–45.3, 22.4) 67.7
Type II censoring –12.0 (–46.0, 21.9) 67.9

more than the estimate of the experimental mean. This led to a larger estimate of
the difference in means (–11.5 vs. –12). Had the values for the experimental and
control arms been exchanged for each other, the type II censoring method would
have the larger estimate for the difference in means (12 vs. 11.5).

13.3.2 Confidence Intervals for the Hazard Ratio (Ratio of Means)

We will now consider non-inferiority testing based on a ratio of means,
μE/ μ C. When larger times are more desired than smaller times, the null and
alternative hypotheses to be considered are

Ho: μE/μ C ≤ 1 – δ vs. Ha: μE/μ C > 1 – δ (13.3)

That is, the null hypothesis is that the mean for the active control is superior
to that of the experimental treatment by at least δμ C, where δ ≥ 0. The alterna-
tive hypothesis is that the active control is superior by a smaller amount, or
the two treatments are identical, or the experimental treatment is superior.
When δ = 0, the hypotheses in Expression 13.3 reduce to classical one-sided
hypotheses for a superiority trial. Rejection of the null hypothesis (Ho) leads
to the conclusion that the experimental treatment is noninferior to the con-
trol treatment. When smaller times are more desired than larger times, the
roles of μE and μ C in the hypotheses in Expression 13.3 would be reversed
(i.e., test Ho: μ C/μE ≤ 1 – δ vs. Ha: μ C/μE> 1 – δ). Note that μE/μ C is also the
scale factor relating the two exponential distributions and the control versus
experimental hazard ratio (the ratio of the instantaneous risk of an event).
µˆ /µ
For type II censoring within each arm, we have that V = E E has
µˆ C/µC
an F distribution with 2rE and 2rC degrees of freedom in the numera-
tor and denominator, respectively. For any 0 < γ < 1, the correspond-
ing 100(1 – γ)% percentile, Fγ ,2 rE ,2 rC is defined as that value satisfying
γ = P(V > Fγ ,2 rE ,2 rC ) . For other F distributions, similar notation will be used
for the percentiles. Note that F1−γ ,2 rC ,2 rE = 1/Fγ ,2 rE ,2 rC . For 0 < α < 1, we have that
 µˆ /µ   µ 
P F1−α/2 ,2 rE ,2 rC < E E < Fα/2 ,2 rE ,2 rC = P ( µˆ E/µˆ C)F1−α/2 ,2 rC ,2 rE < E < ( µˆ E/µˆ C)Fα/2 ,2rrC ,2 rE.
 µ ˆ / µ   µ 
C C C

Thus, a 100(1 – α)% confidence interval for the hazard ratio is given by
( )
( µˆ E/µˆ C )F1−α/2 ,2 rC ,2 rE ,( µˆ E/µˆ C )Fα/2 ,2 rC ,2 rE .

Inference on Time-to-Event Endpoints 367

For type I censoring from applying the proposal of Cox, we analogously

have an approximate 100(1 – α)% confidence interval for the hazard ratio
given by

 rE (2 rC + 1)µˆ E rE (2 rC + 1)µˆ E 
 r (2 r + 1)µˆ F1−α/2 ,2 rC +1,2 rE +1 , r (2 r + 1)µˆ Fα/2 ,2 rC +1,2 rE +1  .
C E C C E C

These results may apply approximately for other types of random censoring.
For outcomes for which larger values yield better outcomes, non-inferiority
is concluded if the confidence interval for the ratio of means (the recipro-
cal of the hazard ratio) contains only values greater than the non-inferiority
threshold.
Another way of determining a confidence interval for the hazard ratio
applies the asymptotic distributions for the natural log of the maximum like-
lihood estimator of μE and μ C. For type II censoring, an approximate 100(1 –
α)% confidence interval for the control versus experimental log hazard ratio is
given by log µˆ E − log µˆ C ± zα/2 1/rE + 1/rC . For type I censoring from applying
the proposal of Cox, we analogously have an approximate 100(1 – α)% confi-
dence interval for the control versus experimental log hazard ratio given by
 r (2 r + 1) 
(log µˆ E − log µˆ C ) + log  C E ± zα/2 1/(rE + 0.5) + 1/(rC + 0.5) . These results
 rE (2 rC + 1) 
may apply approximately for other types of random censoring.
Example 13.2 illustrates these formulas.

Example 13.2

We revisit Example 13.1. Note that the true control versus experimental hazard
ratio is 10/9 (≈1.111). We will apply each confidence interval formula for a hazard
ratio or log-hazard ratio conditioning on the number of uncensored observations
in each arm. We again have rE = 137 uncensored observations in the experimen-
tal arm with µ̂E = 105.4 and rC = 63 uncensored observations in the control arm
with µ̂C = 117.4. The corresponding confidence intervals for the hazard ratio are
provided in Table 13.2. All confidence intervals contain the true control versus
exponential hazard ratio of roughly 1.111. For a non-inferiority threshold of 0.8,
non-inferiority would fail to be concluded regardless of the method. The confi-
dence intervals for the type II censoring methods have a slightly greater relative
width than the type I censoring methods. Which method is more conservative
depends on various factors, including whether the experimental or control arm
performed better in the clinical trial. In this example, the control arm performed
better (despite having a poorer underlying distribution) and the F distribution
methods gave more conservative intervals (i.e., had smaller lower limits) than
their normal distribution counterparts. Likewise, the type II censoring methods
were more conservative than the type I censoring methods. Had the values for
the experimental and control arms been exchanged for each other, the opposite

368 Design and Analysis of Non-Inferiority Trials

TABLE 13.2
Calculated 95% Confidence Intervals for the Hazard Ratio by Method
Method Estimate 95% CI Relative Width
Type I censoring 0.901 (0.663, 1.205) 1.816
F distribution
Type II censoring 0.898 (0.660, 1.201) 1.820
F distribution
Type I censoring 0.901 (0.670, 1.214) 1.813
Normal distribution
Type II censoring 0.898 (0.666, 1.210) 1.816
Normal distribution

relations for conservatism would have held as noted by the order of the upper
limits of the 95% confidence intervals.

13.4 Nonparametric Inference Based on a Hazard Ratio

Without loss of generality, we will assume that the event or events of interest
are undesirable. The hypotheses and the test procedures can be modified
when the event or events of interest are desirable. We will see that when the
hazard ratio is close to 1, the asymptotic standard error for the log-hazard
ratio from a Cox proportional hazards model is approximately the same
as the asymptotic standard error for the log-hazard ratio based on assum-
ing underlying exponential distributions. Thus, there may be little gained
in assuming exponential distributions when such an assumption is correct.
Also, when the underlying distributions have proportional hazards but are
not exponential distributions, the log-hazard ratio estimator in Section 13.3
(i.e., log µˆ E − log µˆ C ) will be biased and the form for its asymptotic standard
deviation will not be valid.
Compared with other methods of analysis of time-to-event endpoints,
the Cox proportional hazards model has the advantage of allowing for the
adjustment of meaningful covariates. Let x denote a p × 1 vector of the values
for p explanatory variables. Let h(t|x) denote the hazard rate at time t given x.
Then the proportional hazards regression model has

h(t|x) = h0(t)exp( β′x) for t ≥ 0 (13.4)

where β is a p × 1 vector of parameters and h0 is a nonnegative function over

[0, ∞). For the proportional hazards model, the hazard functions for any two
subjects are proportional. The function h0 is known as the baseline hazard

Inference on Time-to-Event Endpoints 369

function and represents the hazard function for a subject having β′x = 0,
when such is possible. Per Cox,6 estimation of β through a partial likelihood
function does not depend on the function h0.
Suppose there are 10 subjects at risk of an event at time t (i.e., as time
approaches t), for some t > 0, having hazard rates hi(t) for i = 1, . . . , 10 that are
continuous at t. Given that an event occurred at time t for exactly 1 of the 10
subjects, the probability that subject j had the event is given by

h j (t) ∑ h (t)
i=1
i (13.5)

The probability would remain the same if each subject’s hazard rate was mul-
tiplied or divided by some positive constant c (e.g., divided by a baseline haz-
ard rate value denoted by h0(t)). The partial likelihood function for β is based
on the product of conditional probabilities like that in Expression 13.5.
Let xi denote the vector of explanatory variables for the ith subject, i = 1,
. . . , n. Suppose that among the n subjects, k subjects are each followed to an
event where their times to an event are different, whereas n–k subjects have
their times to an event censored. Let t(1) < t (2) < . . . < t(k) denote the ordered
times when events occurred. For i = 1, . . . , k define R(t(i)) as the set of indices
of those subjects at risk of an event as time t(i) approaches (i.e., consists of the
indices of subjects whose time to event, censored or uncensored, is at least
t(i)) and let x(i) denote the vector of explanatory variables for the subject that
had an event at time t(i). Then applying the multiplication rule of probabilities
to the k probabilities of the form of Expression 13.5 at the times of the events
leads to the partial likelihood for β of

L(β) = ∏ exp(β′x
i=1
(i) ) ∑ exp(β′x )
j∈R ( t( i ) )
j (13.6)

An iterative process is used to find/approximate the maximum likelihood

estimate based on maximizing the log of this partial likelihood. The partial
likelihood in Expression 13.6 can be adapted for situations where there are
multiple events at the same time.6
From Expression 13.6, it should be noted that the estimate of the vector β
depends on the observed times and their censoring status only on the respec-
tive order of the times, not on the magnitude of the times or their differences.
Per Cox,7 the overall likelihood function for β, h0(t) factors into two functions,
one being the partial likelihood given in Expression 13.6. The overall likeli-
hood function has Expression 13.6 as a factor if and only if the censoring

370 Design and Analysis of Non-Inferiority Trials

mechanisms are independent of the actual times to the event. Formally, if the
censoring is not independent, using the partial likelihood as a basis for infer-
ence may not be justified. It would be particularly problematic if the amount
of informative censoring is substantial.
The censoring mechanism being “random” requires that when an indi-
vidual censored at an early time would have survived without an event to
some later time, t″, their hazard rate of an event at time t″ would be the same
as that of another subject, having the same set of values for the explanatory
variables, who survived to time t″ without having an event. In essence, cen-
soring and the process of achieving an event are determined by independent
mechanisms.
Let β1 denote the experimental versus control log-hazard ratio correspond-
ing to the model given in Expression 13.4. Then the experimental versus
control hazard ratio is θ = exp(β1). For a non-inferiority threshold θo ≥ 1, the
hypotheses are expressed as

Ho: θ ≥ θo and Ha: θ < θo (13.7)

Let β̂1 denote the maximum likelihood Cox estimator (often referred to as a
Wald’s estimator) of β1 and let se(β̂1 ) denote an estimate of its standard error.
We will elaborate on the form for the standard error later. An approximate
100(1 – α)% confidence interval for the experimental versus control hazard
ratio, θ = exp(β1), is given by

exp(βˆ 1 ± zα/2 se(βˆ 1 )) (13.8)

The null hypothesis in Expression 13.7 is rejected at an approximate signifi-

cance level of α/2, and non-inferiority is concluded when the upper limit
of the two-sided 100(1 – α)% confidence interval is less than θo. A test can
also be constructed through a test statistic of the form (βˆ 1 − ln θ o )/se(βˆ 1 ).
When the value of the test statistic is greater than zα/2, the null hypothesis
in Expression 13.7 is rejected at an approximate significance level of α/2 and
non-inferiority is concluded. The log-rank test can also be adapted to non-
inferiority testing.8
Standard Error of β̂1 . Suppose that for each treatment arm, the underlying
time to an event distribution is an exponential distribution. The relative effi-
ciency of the log-hazard ratio estimator as determined by a Cox proportional
hazards model having treatment as the sole factor has been compared to
that of the maximum likelihood estimator in Efron’s study.9 For the case of
no censoring, the asymptotic standard error for the maximum likelihood
estimator of the log-hazard ratio is given by 1/nE + 1/nC . In the no censoring
case, the asymptotic standard error for the Cox estimator of the log-hazard
ratio is given by

Inference on Time-to-Event Endpoints 371

−1/2
 1 
nE nC

 ∫
0
nC + nEθ u(θ −1)/θ
d u 

(13.9)

Approximate confidence intervals for the hazard ratio, θ, can be determined

with Expression 13.8 using the maximum likelihood Cox estimator and the
Efron standard error in Expression 13.9. Since the Cox estimate depends on
the observed times only through their respective order, Expression 13.9 pro-
vides the asymptotic standard error for the Cox estimator whether or not
the underlying distributions are exponential. When the true hazard ratio
is 1, the asymptotic standard error is given by 1/[π (1 − π )n] = 1/nE + 1/nC ,
where n = nE + nC and π is the proportion of subjects allocated to the control
arm. When there is no censoring and nE = nC, Table 13.3 provides the asymp-
totic relative efficiency of the Cox estimator to the maximum likelihood esti-
mator (i.e., the reciprocal of the respective ratio of asymptotic variances) and
the ratio of the asymptotic standard errors for various hazard ratios.
When censoring occurs, the determination of a form for the standard error
is more complicated. The form can depend on the stopping rule for the anal-
ysis and the censoring distribution when noninformative censoring occurs.9

TABLE 13.3
Asymptotic Relative Efficiencies and Ratios of Asymptotic Standard Errors
Hazard Ratio or Asymptotic Relative Ratio of the Asymptotic
Its Reciprocal Efficiencya Standard Errorsb
1.00 1.0000 1.0000
0.95 0.9993 0.9997
0.90 0.9972 0.9986
0.85 0.9935 0.9968
0.80 0.9879 0.9939
0.75 0.9803 0.9901
0.70 0.9703 0.9851
0.65 0.9578 0.9787
0.60 0.9424 0.9708
0.55 0.9238 0.9611
0.50 0.9014 0.9494
0.45 0.8747 0.9352
0.40 0.8429 0.9181
0.35 0.8050 0.8972
0.30 0.7596 0.8716
a Cox estimator to the maximum likelihood estimator.
b Maximum likelihood estimator to the Cox estimator.

372 Design and Analysis of Non-Inferiority Trials

Additionally, in the presence of informative censoring, β̂1 will be biased and

therefore the standard error will be different from the standard deviation.
For cases involving censoring, let rE and rC denote the number of events in
the experimental and control arms, respectively. In cases where censoring is
present and that censoring is noninformative, the quantity

1/rE + 1/rC (13.10)

has been a useful estimate of the unrestricted standard error of the log-hazard
ratio when determining confidence intervals. An approximate 100(1 – α)%
confidence interval for the experimental versus control hazard ratio would
then be given by

exp(βˆ 1 ± zα/2 1/rE + 1/rC )

It should be noted that the standard error provided from statistical packages
is determined under the null hypothesis of no difference (i.e., a hazard ratio
of 1). Frequently in practice, the standard error restricted to the hazard ratio
equal to 1 will be relatively close to the quantity provided in Expression
13.10 and other estimates of the standard error. Thus, using the standard
error from the statistical packages tends to lead to the same conclusion as
using some other estimate of the standard error (e.g., unrestricted version or
restricted to the non-inferiority null hypothesis). However, caution should
be taken in the choice of an estimate of the standard error of the log-hazard
ratio. We provide two examples (Examples 13.3 and 13.4) to illustrate the use
of Expressions 13.9 and 13.10.

Example 13.3

Consider a two-arm study of the experimental drug A and the active control drug
B where 400 subjects are evenly randomized between the two arms. Suppose all
400 subjects are followed to the event of interest (e.g., death). Consider testing Ho:
θ ≥ 1.25 versus Ha: θ < 1.25. The test statistic is

lnθ̂ − ln1.25
s

where θ̂ is the Wald’s estimator from a Cox model with treatment as the sole
explanatory variable. The value for s is selected as ( 2/ 400 )/0.9939 = 0.10061,
where the value of 0.9939 comes from Table 13.3. For an observed hazard ratio
of 0.95, the value for the test statistic above is –2.7277, which corresponds to a
p-value of 0.003. For a one-sided significance level of 0.025, non-inferiority is
concluded (0.003 < 0.025).

Inference on Time-to-Event Endpoints 373

Example 13.4

Consider again testing Ho: θ ≥ 1.25 versus Ha: θ < 1.25 on the basis of a two-arm
study where 1000 subjects are evenly randomized to the experimental and con-
trol arms. At the time of analysis, there are 320 and 304 events in the experimen-
tal and control arms, respectively, with a corresponding estimate of the hazard
ratio of 1.10. Using Expression 13.10 gives an estimate of the standard error of
1/320+1/304= 0.0801, from which a corresponding 95% confidence interval of
0.940–1.287 is obtained. Since the upper limit of 1.287 is greater than 1.25, non-
inferiority cannot be concluded.

Standard Errors for the Effects of Binary Covariates. For a binary (0–1) variable,
the form for the asymptotic standard error for its corresponding regression
parameter (a log-hazard ratio) is similar to that of the treatment effect. One
difference is that the standard error for the treatment effect (the treatment
log-hazard ratio) can be fairly controlled by knowing ahead of time roughly
how many subjects will be in each arm and how many events will be needed
for the analysis. For a binary covariate, the number of subjects that will have
each value, 0 and 1, is random (not controlled). It is customary to condition on
the number of subjects that have each value of the binary covariate (and the
number of events observed at each level) when determining the correspond-
ing standard error for its log-hazard ratio estimator. It is that conditioning
that justifies using the same formulas for the standard error for the binary
covariate as for the treatment effect. For ri (i = 0, 1) equal to the observed num-
ber of events among subjects having value i, a useful estimate of the asymptotic
standard error for the log-hazard ratio of the binary covariate is 1/r0 + 1/r1 .
As with the treatment effect, software packages restrict the standard error to
the null hypothesis that the true value of the log-hazard ratio for the binary
covariate equals zero. If the true log-hazard ratio is not far from zero, these
two estimated standard errors should be relatively the same.
It is important to note that when a variable is prognostic, and is still prog-
nostic given the set of values of any other potential covariates, the propor-
tional hazards assumption cannot simultaneously hold for the model that
includes that variable as an explanatory variable and the model that omits it
as an explanatory variable.
P(X >Y). For an experimental versus control hazard ratio of θ, it can easily
be shown that the probability is 1/(1 + θ) that a random subject, X, given the
experimental therapy will have a longer time to the event than a random
subject, Y, given the control therapy. This probability that the random subject
in the experimental arm has an event after the random subject in the control
arm has an event remains constant even when both random subjects have
“survived” for t amount of time without having an event.
For a randomly paired designed, when the hazard rates are proportional,
a confidence interval can be found for the hazard ratio by first finding a
confidence interval for 1/(1 + θ), the probability that a random subject given

374 Design and Analysis of Non-Inferiority Trials

the experimental therapy will have a longer time to the event than a random
subject given the control therapy, and then converting the interval into a con-
fidence interval for θ. This can be done by simultaneously following subjects
within their random pairs and conditioning on the number of pairs, n, where
at least one subject had an event. On this condition, the number of pairs
where the experimental subject had a longer time has a binomial distribu-
tion with parameters n and 1/(1 + θ).
Additional Design Considerations. The treatment parameter, β1, being esti-
mated by a Cox model is dependent on the covariates being included and
excluded in the model. Suppose that the baseline prognosis of subjects
improves as accrual continues (e.g., owing to differences in a known prog-
nostic factor), that at any time of accrual for a subject accrued at that time
the hazard rates for the theoretical distribution of the experimental arm are
proportional to the hazard rates of the control arm with a log-hazard ratio of
β1 ≠ 0, and that all subjects are followed to the events or the same data cutoff
date. Then the “log-hazard ratio” being estimated by a Cox analysis with
treatment as the sole explanatory variable is between zero and β1. For β1 to be
that value being estimated as the treatment effect by a Cox analysis, covari-
ates that collectively completely capture the baseline prognosis of subjects
need to be included in the model.
Owing to the covariates included or excluded in a Cox model, it is impor-
tant to realize that the active control versus placebo treatment parameter
may also be different across historical trials (and in the non-inferiority trial).
Not adjusting for influential covariates in the active control therapy versus
placebo trials will tend to “underestimate” the active control effect relative to
when there is adjustment for those covariates. Also, when influential covari-
ates are adjusted for, the treatment parameter being estimated will depend
on the prognosis of the subjects. Consideration should be given in the histor-
ical and non-inferiority trials to capture and adjust by important covariates.
Group sequential non-inferiority trials can be done on a hazard ratio with
a fixed non-inferiority margin.10 When the hazards are not proportional,
having multiple analyses at different study times may be acceptable for a
superiority trial. However, nonproportional hazards can be problematic
for a non-inferiority analysis based on a hazard ratio that uses the same
threshold or margin for both analyses. Since the follow-up or censoring
distribution will depend on the time of the analysis, when the hazards are
not proportional a different parameter or value (that is called the “hazard
ratio” or “average hazard ratio”) is being estimated at each analysis. If there
have also been nonproportional hazards when comparing the active control
with placebo, then the effect of the active control therapy versus placebo as
measured by a hazard ratio depends on the follow-up or censoring distribu-
tion. In the presence of nonproportional hazards, a non-inferiority criterion
should consider the amount of subject follow-up to ensure that the rejec-
tion of the null hypothesis will truly mean that the experimental therapy
is noninferior to the active control therapy. If the non-inferiority criterion

Inference on Time-to-Event Endpoints 375

is established to indirectly represent that the experimental therapy has any

efficacy (is superior to placebo), the criterion should be such that the rejec-
tion of the null hypothesis will truly mean that the experimental therapy is
superior to placebo.
Per Kalbfleisch and Prentice,11 this issue of nonproportional hazards can
be addressed by basing the inference on an “average hazard ratio” using
the same weights over time in all the historical trials as well as in the non-
inferiority trials. The experimental versus control average hazard ratio in
∞
the non-inferiority trial would equal θ A,H =
∫
0
( hE (t)/hC (t)) d K (t) for some pre-

specified distribution function K. The historical estimation of the effect of

the control therapy would be based on the analogous placebo versus control
therapy average hazard ratio using the same distribution function K.
The endpoint monitoring frequency influences the estimate of a hazard
ratio and that value that the (log) hazard ratio estimator estimates without
bias. Less frequent monitoring tends to lead to estimates of the hazard ratio
closer to 1 than the true hazard ratio, had the endpoint been continuously
monitored. Therefore, although having less frequent monitoring can reduce
the power in a superiority trial, it can increase the probability of concluding
non-inferiority when the experimental therapy is truly inferior. The impact
of the monitoring frequency on the hazard ratio needs to be considered when
evaluating the effect of the control therapy from the historical clinical trials
and when choosing the monitoring frequency and the non-inferiority mar-
gin for the non-inferiority trial. It is recommended that the non-inferiority
trial have a monitoring frequency the same or more frequent than the trials
used to estimate the effect of the control therapy.

13.4.1 Event and Sample Size Determination

Let the respective sample sizes for the experimental and control arms be
denoted by nE and nC. The sample-size determination for a time-to-event
endpoint can be regarded as a two-step process. The standard error of the
log-hazard ratio estimator can usually be expressed as approximately c/ r
for some constant c, where r is the total number of events. Since the standard
error depends on the total number of events, so will the power to conclude
non-inferiority (or superiority). It has therefore been common for the tim-
ing of the analysis of a time-to-event endpoint to be based on a prespecified
number of events so as to isolate the power.
The first step in determining the necessary sample size is to determine
the necessary total number of events. When the true log-hazard ratio is not
terribly far from 0 and there is no informative censoring, then the asymp-
totic standard error for the log-hazard ratio is [(1 + k )2/k ]/ r , where k =
nE/ nC is the allocation ratio (see Fleming and Harrington,12 pp. 394–395,
when k = 1). Additionally, the number of required events will depend on the

376 Design and Analysis of Non-Inferiority Trials

non- inferiority threshold and the assumed experimental versus control log-

hazard ratio. The number of required events is given by

2
 zα/2 + zβ  2
 β − β  (1 + k ) /k (13.11)
 1, a 1, o 

where β1,a is the assumed experimental versus control log-hazard ratio (or
selected alternative), and β1,o is the non-inferiority threshold for the log-
hazard ratio. When powering for a superiority claim, β1,o = 0. Expression
13.11 was provided by Fleming10 for a one-to-one randomization (k = 1).
After determining the required number of events, the sample size will fur-
ther depend on

• The accrual period or the rate of accrual

• The assumed underlying distributions
• The desired study time (time since the study opens) or calendar time
when the number of events will be reached

Example 13.5 illustrates the determination of the sample sizes for a time-to-
event endpoint.

Example 13.5

Consider a two-arm non-inferiority trial comparing an experimental therapy to an

active control therapy. Suppose it is desired to allocate subjects in a 2-to-1 fashion
between the experimental and control arms, respectively. The experimental arm
will be regarded as noninferior to the control arm with respect to a specific time-
to-event endpoint (e.g., overall survival) if the hazard ratio is less than 1.15 (i.e.,
Ha: θ < 1.15 or Ha: β1 < ln 1.15). Since it is believed that the experimental therapy
will be slightly better than the control therapy, a hazard ratio of 0.95 is selected
to power/size the trial. For a one-sided test at a significance level of 0.025, 90%
power at a hazard ratio of 0.95 requires

2 2
 zα /2 + zβ   1.96 + 1.2816 
 β − β  (1+ k ) /k =  ln 0.95 − ln1.15  (1+ 2) / 2 ≈ 1296 events.
2 2

 1,a 1,o 

In determining the appropriate sample size that achieves 1296 events by some
target time, we will assume that the subjects will be accrued over 24 months
in a uniform fashion and it is desired to have the analysis 12 months after the
end of accrual (at a study time of 36 months). A random accrual time will be
modeled as uniformly distributed over the first 24 months. For ease in deter-
mining the sample size, we will assume that the underlying distributions

Inference on Time-to-Event Endpoints 377

for the experimental and control arms are exponential distributions with
respective medians of 10 and 9.5 months. Then, the probability that a random
subject will have had an event by the study time of 36 months is

∫
1− exp( − (36 − x )/(10/ln2))d x /24 = 0.788 and
0

∫
1− exp( − (36 − x )/(9.5/ln2))d x /24 = 0.803
0

for the experimental and control arms, respectively. Therefore, after 36 months,
we would expect (2 × 0.788 + 0.803)/3 = 0.793 of the subjects to have had events.
This leads to a sample size of 1296/0.793 ≈ 1635 subjects. This sample-size calcu-
lation provides the number of subjects needed so the expected number of events
at the study time of 36 months is the number required for stopping and perform-
ing the analysis. There will be some variability to the timing in study months that
the analysis is performed (1296 events are reached). This variability may also be
considered when designing the trial.

Sample-size formulas for non-inferiority trials can be easily derived13 on

the basis of a fixed overall study duration and constant hazards for the time-
to-event distributions.
We next discuss procedures that are commonly used to assess whether the
assumption of proportional hazards is reasonable.

13.4.2 Proportional Hazards Assessment Procedures

Statistical graphics and tests are often misused for assessing whether the
assumption of proportional hazards is reasonable. The assessment methods
tend to ignore how the estimates of the hazard ratio are determined. The
nonparametric estimate of the hazard ratio depends solely on the ordered
arrangement of the observations in the combined ordering of the times
from both treatment arms, which includes for each observation the arm the
observation belongs to and its censoring/event status. If the same increas-
ing transformation is applied to all observations (censored or uncensored),
without changing the censoring/event status, then the ordered arrangement
of the observations and the hazard ratio estimate remain the same. Also, for
each arm, the collection of all estimated event-free probabilities remains the
same; what changes is the corresponding event times.
It makes sense and would be consistent with inferences on a hazard ratio
that the evaluation of whether the hazards are proportional depends solely
on this ordered arrangement of the observations. Therefore, the criterion for
determining whether the hazards are proportional should also be invariant

378 Design and Analysis of Non-Inferiority Trials

under increasing transformations. The eyeball test using a standard plot

supplied by statistical software and the commonly used time-dependent
covariate model used to assess proportional hazards are not invariant under
increasing, continuous transformations.
When the hazards are proportional, SE(t) = SC(t)θ for all t > 0 and some θ >
0. This is equivalent to ln(–lnSE(t)) = ln(–lnSC(t)) + lnθ for all t > 0 and some
θ > 0. This corresponds to the plots of y = ln(–lnSE(x)) and y = ln(–lnSC(x))
being parallel, separated vertically by the constant lnθ for all x > 0. For the
Kaplan–Meier estimates, it is common for the statistical software to plot for
both treatment arms the pairs (x, ln(− ln Sˆ ( x))) or the pairs (ln x, ln(− ln Sˆ ( x))).
This provides an eyeball assessment of proportional hazards by determin-
ing whether the curves are nearly parallel. In this context, assessing whether
these plots are parallel means the difference, ln(− ln Sˆ E ( x)) − ln(− ln Sˆ C ( x)), is
nearly constant in x. It is not an assessment of whether the various inter-
mediate slopes are different. It can be tempting to compare slopes when
two curves or functions deviate greatly from being “parallel.” Applying an
increasing, continuous transformation to all the censored and uncensored
times to the event, without changing the censoring status, can extend or
shrink the interval of time or log times where the curves deviate from being
parallel, even though it does not influence the estimation of the hazard ratio.
Such a graphical display also overemphasizes the places where the survival
curves are flat or nearly flat (where the estimated hazard rates are near zero),
exactly the intervals of time that are not influential in the comparison of the
arms.
One case from the experience of one of the authors involved a software-
produced plot of the pairs (ln x, ln(− ln Sˆ ( x))) for each arm, where the observed
log times fell into three equally spaced, enumerated intervals. There were
fewer than 10 events, about 150 events, and about 40 events, respectively, in
the intervals. Since the intervals are of equal length (in log times), the eyeball
test gives equal weight to each interval. Within the first interval (the smaller
times), which had fewer than 10 events between the arms, the plots of y =
ln(− ln Sˆ E ( x)) and y = ln(− ln Sˆ C ( x)) were far from parallel. This is certainly not
abnormal given the small number of events. Within the other two intervals,
the plots of y = ln(− ln Sˆ E ( x)) and y = ln(− ln Sˆ C ( x)) did not greatly deviate from
being equidistant or parallel. In assessing for proportional hazards, the eye-
ball test gives one-third weight to the first interval despite fewer than 5% of
all the observed events occurring within that interval.
Plots Invariant under Increasing Transformations. In the one treatment group
case, graphical displays of the survival functions where the scaling on the
abscissa (“x-axis”) has its jth gap length equal to the reciprocal of the number
of subjects at risk of an event just before the jth event can be used.14,15 This
approach can be applied to the two-sample case in multiple ways. One way
is by rescaling the x-axis so that the jth gap length is equal to the recipro-
cal of the number of subjects at risk of an event just before the jth event. To
describe the approach in more detail, first let u1, . . . , ur represent the times

Inference on Time-to-Event Endpoints 379

events were observed when combining the observations from both arms.
Then, for j = 1, . . . , r, let Nj denote the number of subjects at risk of an event

∑
j
just before time uj and let v j = 1/N i . The resulting plot has for each arm
i=1
ˆ
(vj, ln(− ln S(u j ))) plotted for j = 1, . . . , r. Other approaches that can be used
include having jth gap length, vj – vj–l, equal to the reciprocal of the harmonic
mean of the number of subjects in each arm that are at risk of an event just
before the jth event (or corrected version should one arm have no subjects at
risk) or having vj = j (i.e., enumerating the event times on the x-axis). In all
these cases, the resulting plot is invariant under increasing transformations.
Time-Dependent Covariate Model. Likewise, tests of proportional hazards
involving time or log time as a time-dependent covariate are not invariant
when applying an increasing, continuous transformation to all the censored
and uncensored times. For example, the p-values for testing whether the
coefficient for the time-dependent covariate is zero will change after the
transformation h(x) = exp{x +exp{x}} is applied to all the observations, even
though the ordered arrangement of the observations remains the same.
Analogous to the rescaling that was described above for the graphical dis-
plays, the censored and uncensored times can be rescaled on the basis of
the sum of the reciprocal of the number of subjects at risk of an event. The
∑
j
rescaled survival time of the jth event would be 1/N i . The censored
i=1
times would also be rescaled so as not to affect the overall ordered arrange-
ment of the observations.
As with other diagnostic plots used to evaluate an assumption of propor-
tional hazards, the plot of the ratio of the hazard rates will also overempha-
size the places where the estimated hazard rates are near zero (i.e., the time
intervals that are not influential in a comparison of the arms). In a published
study, the authors concluded from one such plot that the hazard ratio was not
constant, whereas in a quite different type of plot concluded that an assump-
tion of proportional cumulative odds may be appropriate. The conclusion is
unusual as the aspects of the estimated survival distribution were such that
similar conclusions should have drawn on the proportionality of the cumula-
tive hazards and the cumulative odds. In the example, the estimated survival
probabilities within both arms were greater than 90% over all studied time
points. In particular, − ln(Sˆ i (t))/(1 − Sˆ i (t)) ≈ 1 for all studied t and i = E, C, and
Sˆ C (t)/Sˆ E (t) appears to only vary between 0.98 and 1 over the studied interval
of time. Different conclusions on proportionality were drawn because the
assessment of proportional hazards was based on the ratio of estimates of
the hazard rates (not the ratio of estimates of the cumulative hazards) where
outlying estimates of the ratio were observed when the survival curves were
nearly flat and the hazard rates were close to zero. The assessment of the
cumulative odds ratio was based on a plot of the cumulative odds ratio over
time, which did not vary significantly. A plot of the ratio of the cumulative
hazards would also not have significantly varied over time.

380 Design and Analysis of Non-Inferiority Trials

13.5 Analyses Based on Landmarks and Medians

Time-to-event comparison can also be based on the medians or on an event-
free probability at some prespecified time (i.e., at some landmark). The dif-
ference in the medians represents a particular horizontal difference in the
plot of the survival functions, whereas the difference in the event-free prob-
abilities at a specific time point represents a vertical difference in the plots
of the survival functions. In practice, with the presence of censoring, the
medians and event-free probabilities are often used to describe the results
within a given treatment arm. It is common to include the medians in a prod-
uct label to describe the typical time to an event. When the median time to
an event is not estimable, event-free probabilities are often used to describe
the results within a given arm. Although hypothesis testing could be based
on the medians or the event-free probabilities at a prespecified time point,
in practice hypothesis testing tends to be based on a hazard ratio. We will
discuss performing non-inferiority testing on the basis of the median times
to event and of the event-free probabilities.
In the absence of censoring, medians can be compared as provided in
Section 12.4, and the event-free probabilities can be compared by methods
discussed in Chapter 11. In the presence of censoring, the precision in the
estimates involves not only the underlying distribution and the sample size
but also the amount of censoring and where that censoring occurs.
For the event-free probabilities, we consider inferences based on their dif-
ference and their ratio. Analyses are primarily based on the Kaplan–Meier
estimated event-free probabilities and the corresponding Greenwood’s esti-
mated variance. For the median times to event, the ratio of the medians will
primarily be used. When the two underlying distributions differ by a scale
factor, the ratio of the medians is equal to the scale factor. When assuming
that the two underlying distributions differ by a scale factor, non-inferiority
procedures on the ratio of medians that will be discussed include convert-
ing standard tests for testing a difference to construct confidence intervals
for the ratio of medians, and a generalization of the two confidence interval
procedures by Hettmansperger16 (provided in Section 12.4) to the presence of
censoring (see Wang and Hettmansperger17). For the case in which it is not
assumed that the two underlying distributions differ by a scale factor, a pro-
cedure from Su and Wei,18 based on a minimum dispersion statistic, is also
provided. In all instances in this section, the event of interest is regarded as
undesirable (i.e., larger times to the event are better).

13.5.1 Landmark Analyses

We will next consider comparing between the arms the event-free probabili-
ties after a fixed, prespecified duration. A landmark is a prespecified time
from randomization (or start of treatment) at which time the absence or

Inference on Time-to-Event Endpoints 381

presence of an event is noted or a variable of interest is measured. For time-

to-event endpoints, the parameter of interest for a landmark analysis at time
t* is the probability a random subject is event free at time t*.
In practice, some interested landmarks arise from the natural history of
the treatment of the disease. For example, for a curable disease, there may
be a common landmark across therapies or across clinical trials after which
the rate of cure is very small or is zero. In the setting of preventing disease
recurrence, there likewise may be a common landmark after which recur-
rence is rather unlikely.
When monitoring for the event is scheduled periodically, for unbiased esti-
mation of the event-free probabilities at the landmark, the selected landmark
should be a scheduled time for monitoring the event. If monitoring occurs
every 5 months, a landmark analysis at 12 months will yield the same results
as a landmark analysis at 10 months. It is important that the necessary evalu-
ations for an event occur at the time of the selected landmark, not before or
after. Performing evaluations too early or too late from the selected land-
mark can lead to bias estimates of the event-free probabilities. Differential
monitoring between treatment arms can lead to a bias in the estimation of
the difference in the event-free probabilities at a landmark.
The survival functions for the experimental and control arms will be
denoted by SE and SC, respectively. Then, the probability that a random sub-
ject on the experimental and control arms will be event-free at time t* are
denoted by SE(t*) and SC(t*), respectively. For a non-inferiority margin of δ
(for some 0 < δ < 1), the hypotheses for the difference in event-free probabili-
ties at time t* are expressed as

Ho: SE(t) – SC(t) ≤ –δ and Ha: SE(t) – SC(t) > –δ (13.12)

A confidence interval for SE(t*) – SC(t*) can be determined using the respec-
tive Kaplan–Meier estimates and Greenwood’s estimates of the correspond-
ing variance. When the lower limit of the confidence interval for SE(t*) – SC(t*)
is greater than –δ, non-inferiority is concluded.
Kaplan–Meier Estimation. In the absence of censoring, the determination of
the estimated survival function (i.e., the event-free probabilities) for a given
arm is straightforward. For t > 0, the survival function is given by Sˆ (t) = the
relative frequency of times in that arm ≤ t. In the presence of censoring, the
most common estimate of the survival function is the Kaplan–Meier esti-
mate.19 As earlier, let t(1)< t(2) < . . . < t(k) denote the distinct ordered times
when events occurred, and for i = 1, . . . , k define R(t(i)) as the set of indices
of those subjects at risk of an event as time t(i) approaches (i.e., consists of the
indices of subjects whose time-to-event, censored or uncensored, is at least
t(i)). Let ni denote the size of R(t(i))—that is, the number of subjects at risk of an
event as time t(i) approaches—and let di denote the number of subjects that
had events at time t(i). For ease, we will define t(0) = 0. Then for i = 1, . . . , k, 1 –
di/ ni represents the relative frequency of subjects followed completely from

382 Design and Analysis of Non-Inferiority Trials

time t(i–1) to time t(i) that did not have an event, and represents an estimate of
the conditional probability that a subject will not have an event during the
interval from t(i–1) to t(i) given they have not had an event by time t(i–1). For
intermediate intervals, t(i–1) to t, where t(i–1) < t < t(i), the observed relative fre-
quency of subjects followed completely from time t(i–1) to time t that did not
have an event is 1. Thus, 1 is the estimate of the conditional probability that
a subject will not have an event during the interval from t(i–1) to t given they
have not had an event by time t(i–1). The Kaplan–Meier estimate of the sur-
vival function applies the multiplication rule to these estimated conditional
probabilities.
For t(i) ≤ t < t(i+1) i = 0, 1, . . . , k, the Kaplan–Meier estimate of the survival
function is given by

Sˆ (t) = ∏ (1 − d /n )
j=1
j j

For t(i) ≤ t < t(i+1) i = 0, 1, . . . , k, Greenwood’s formula provides an estimate of
dj
∑
i
the variance for Sˆ (t) of Var ˆ (Sˆ (t)) ≈ (Sˆ (t))2 .
j = 1 n j (n j − d j )

Testing Procedures. The Kaplan–Meier estimators (estimates) of the survival

functions for the experimental and control arms will be denoted by ŜE and ŜC
( ŝE and ŝC ), respectively. For the landmark of t*, the asymptotic test statistic for
Sˆ E (t*) − Sˆ C (t*) + δ
testing the hypotheses in Expression 13.12, is Z* = , as
Varˆ (Sˆ E (t*)) + Varˆ (Sˆ C (t*))

provided by Com-Nougue, Rodary, and Patte.8 At a significance level of α/2,
the null hypothesis in Expression 13.12 is rejected and non-inferiority is con-
cluded when Z* > zα/2. An approximate 100(1 – α)% confidence interval for

SE(t*) – SC(t*) is given by sˆE (t*) − sˆC (t*) ± zα/2 Varˆ (Sˆ E (t*)) + Varˆ (Sˆ C (t*)) . When
the lower limit of the confidence interval is greater than –δ the null hypoth-
esis in Expression 13.12 is rejected and non-inferiority is concluded.
As noted by Com-Nougue, Rodary, and Patte,8 under the assumption of
proportion hazards, the non-inferiority margin for the landmark analysis
can be linked to a non-inferiority threshold based on a hazard ratio by θo =
ln(SC(t*) – δ)/ln(SC(t*)). This relation along with a guess of the event-free prob-
ability at the landmark for the control arm can guide in translating an histor-
ical problem where inference was based on a hazard ratio to a non-inferiority
problem involving a difference in event-free probabilities or vice versa. First,
the historical control effect would be estimated using one of the metrics, and
then an appropriate non-inferiority threshold or margin for that metric. Then
with a guess of SC(t*), the above relation leads to a non-inferiority margin or
threshold for the other metric.

Inference on Time-to-Event Endpoints 383

Alternatively, the relationship between the event-free probabilities and the

hazard ratio can be used to base the non-inferiority test directly on the event-
free probabilities with the non-inferiority threshold for the hazard ratio. When
the hazards are proportional, then the experimental versus control hazard
ratio, θ, satisfies θ = logSE(t*)/logSC(t*). Likewise, the estimates of the event-
free probabilities at time t* can be transformed into an estimate of the hazard
ratio. When the inference on a hazard ratio is based on the cumulative haz-
ards or event-free probabilities at time t*, the hypotheses are expressed as

Ho: θ = logSE(t)/logSC(t) ≥ θo and Ha: θ = logSE(t)/logSC(t) < θo

where the non-inferiority threshold satisfies θo ≥ 1. The hypotheses can be

reexpressed as

Ho: logSE(t) – θologSC(t) ≤ 0 and Ha: logSE(t) – θologSC(t) > 0

log Sˆ E (t) − θ o log Sˆ C (t)

The corresponding test statistic is Z* = .
Varˆ (Sˆ E (t*))/Sˆ E2 (t*) + θ o2 Varˆ (Sˆ C (t*))/Sˆ C2 (t*)

When Z* > zα/2, the null hypothesis is rejected and non-inferiority is concluded.
A Fieller approach can be used to determine an approximate 100(1 – α)% con-
fidence interval for θ = logSE(t*)/logSC(t*). Note that the inference on a hazard
ratio is most efficient when using the entire time to the event (censored or
uncensored) from each subject. The estimate of a hazard ratio is less reliable
when based on a single time point.
Alternatively, the testing can be based on the ratio of the survival prob-
abilities at a landmark. For a landmark of t* and a non-inferiority threshold
of γo (for some 0 < γo ≤ 1), the hypotheses are expressed as

Ho: SE(t)/SC(t) ≤ γo and Ha: SE(t)/SC(t) > γo

These hypotheses can be reexpressed as

Ho: SE(t) – γoSC(t) ≤ 0 and Ha: SE(t) – γoSC(t) > 0

Sˆ E (t) − γ oSˆ C (t)

The corresponding normalized test statistic is Z* = .
Varˆ (Sˆ E (t*)) + γ o2 Varˆ (Sˆ C (t*))

When Z* > zα/2, the null hypothesis is rejected and non-inferiority is con-
cluded. A Fieller approach can be used to determine an approximate 100(1 –
α)% confidence interval for SE(t*)/SC(t*). Other testing procedures can be
considered. As done by Thomas and Grunkemeier,20 the testing can be based
on a likelihood ratio test-based confidence interval for the ratio of the sur-
vival probabilities at a landmark. Methods can also be adapted when the
hypotheses and the corresponding test are based on the ratio of probabilities
of an event by time t* (i.e., based on (1 – SE(t*))/(1 – SC(t*))).

384 Design and Analysis of Non-Inferiority Trials

13.5.2 Analyses on Medians

The medians for the experimental and control arms are denoted by µE and
µC , respectively, and their ratio by ∆ = µE/µC . For a non-inferiority margin of
δ (for some 0 < δ < 1), the hypotheses are expressed as

Ho: Λ ≤ Λo = 1 – δ vs. Ha: Λ > Λo = 1 – δ (13.13)

For inference on one median, Efron21 and Reid22 used bootstrap methods to
derive confidence intervals for the median. Such bootstrapping methods can
be easily applied to determine confidence intervals for the difference or ratio
of two medians. Alternatively, confidence sets for one median can be derived
by inverting tests similar to a sign test, as done by Brookmeyer and Crowley23
and Emerson.24 The confidence interval for one median consists of all values
t* for which a two-sided test of Ho: S(t*) = 0.5 fails to reject the null hypothesis.
The Brookmeyer and Crowley procedure23 uses the Kaplan–Meier estimated
survival probability at t*, Sˆ (t*), along with the corresponding Greenwood
estimated variance. This estimated variance changes as t* changes. As noted
by Wang and Hettmansperger,17 the confidence set derived from these meth-
ods need not be an interval. For the Brookmeyer and Crowley procedure,
this inadequacy can be alleviated by choosing the Greenwood estimate vari-
ance of Sˆ ( x), where x is the observed median. In generalizing to two medi-
ans, this type of estimated variance can be used for both arms in a minimum
dispersion test statistic, as in Su and Wei’s study.18
General Procedures. For comparing two medians, we will first discuss two
procedures that are not based on the assumption that the two underlying
distributions are related by scale factor.
Su and Wei18 derived confidence intervals for the difference and ratio of
two medians based on a quadratic test statistic similar to a minimum dis-
persion test statistic used by Basawa and Koul25 for continuous data. We will
present the test procedure for a ratio of medians.
Let X  and Y denote the sample medians of the control and experi men-
tal arms, respectively. The observed sample medians x and y sat isfy
x = min{t : Sˆ C (t) ≤ 0.5} and y = min{t : Sˆ E (t) ≤ 0.5} . For testing Ho: Λ = Λo ˆ
 (S ( Λ t) − 0
against Ha: Λ ≠ Λo (for some 0 < Λo ≤ 1), the test statistic is G( Λ o ) = min t>0  E o 2
 σˆ E
 (SˆE ( Λ ot) − 0.5)2 (Sˆ C (t) − 0.5)2  2 2
Λ o ) = min t>0  +  , where σ̂ E and σ̂ C are the Greenwood’s esti-
 σˆ E2 σˆ C2 
mates of the variances of S ( y) and Sˆ ( x) , respectively. From a simulation
E
ˆ
C
study of Su and Wei,18 the upper percentiles of the distribution of G(Λo) when
Λ = Λo can be approximated by the upper percentiles of a χ2 distribution with
2
1 degree of freedom. Let χ 1,α denote the 100αth upper percentile of a χ2 dis-
tribution with 1 degree of freedom. Then, an approximate 100(1 – α)% confi-
2
dence interval for µE/µC consists of those positive values u so that G(u) < χ 1,α.
If the lower bound of the approximate 100(1 – α)% confidence interval for

Inference on Time-to-Event Endpoints 385

µE/µC is greater than Λo = 1 – δ, then the null hypothesis in Expression 13.13

is rejected and non-inferiority is concluded. Alternatively, bootstrapping can
be used to obtain a critical value.
The sign-like test of one median can also be generalized to a large sample
normal test for the ratio of medians that uses the sample medians and esti-
mates of the corresponding standard errors. The standard error estimates
involve density estimation at the sample median. For ε > 0 and small, define
mC (ε ) = (Sˆ C ( x − ε ) − Sˆ C ( x + ε ))/(2ε ). If mC(ε) is fairly stable for reasonable choices
of ε, then the interval provided by

x ± zα/2σˆ C/mC (ε )

is approximately equal to the approximate 100(1 – α)% confidence interval for

µC of (SC−1 (0.5 − zα/2σˆ C ) , SC−1 (0.5 + zα/2σˆ C ) ). For the experimental arm, mE(ε) is anal-
ogously defined. The procedures in Chapter 12 used for comparing two means
for a positive-valued endpoint can be applied for the time-to-event medians
where σˆ C/mC (ε ) and σˆ E/mE (ε ) are used as the standard errors for the sample
medians X  and Y , respectively. For example, for testing H : Λ ≤ Λ against H :
o o a
Λ > Λo for some 0 < Λo ≤ 1, the hypotheses can be expressed as Ho: µE − Λ o µC ≤ 0
and Ha: µE − Λ o µC > 0. A large sample normal test statistic is given by

Y − Λ oX

Z* =
σˆ E2/mE2 (ε ) + Λ o2σˆ C2 /mC2 (ε )

When Z* > zα/2, the null hypothesis in Expression 13.13 is rejected and non-
inferiority is concluded. A Fieller approach can also be used to determine an
approximate 100(1 – α)% confidence interval for µE/µC .
The remaining procedures that will be discussed are based on the overall
assumption that the two underlying distributions differ by a scale factor.
Adapting Standard Time-to-Event Tests. Let X1, X2, . . . , X nC and Y1, Y2, . . . , YnE
denote independent random samples from distributions having respective
distribution functions FC and FE. The X’s and the Y’s represent the actual,
uncensored times to the event of the control and experimental arms, respec-
tively. The underlying assumption is that the two distributions are related
through a scale factor Λ (i.e., FE(y) = FC(y/Λ) for all y and some Λ). For the
control arm, the independent censoring variables are denoted as A1, A2, . . . ,
AnC , which are assumed to be a random sample having common distribution
function H. For each subject in the control arm, the variable X i* = min{X i , Ai }
and the event status I (X i = X i* ) are observed. For the experimental arm, the
independent censoring variables are denoted as B1, B2, . . . , BnE , which are
assumed to be a random sample having a common distribution function K.
For each subject in the experimental arm, the variable Yi* = min{Yi , Bi } and
the event status I (Yi = Yi* ) are observed.

386 Design and Analysis of Non-Inferiority Trials

Confidence intervals for Λ can be obtained through a test statistic for time-
to-event endpoints by altering the values in one of the arms and then testing
for the equality of the underlying distributions. For any positive number c,
replace X1, X2, . . . , X nC with cX1, cX2, . . . , cX nC and replace A1, A2, . . . , AnC
with cA1, cA2, . . . , cAnC . For the observations in the control arm, the analysis
multiplies each observed censored or uncensored time-to-event by c with-
out changing the event status for those observations. If the null hypothesis
of equal medians (i.e., equal underlying distributions for Yi and cXi) is not
rejected at a two-sided significance level α, then Λo is in the (approximate)
100(1 – α)% confidence interval for Λ. If the lower bound of this approxi-
mate 100(1 – α)% confidence interval for the scale factor (i.e., also for µE/µC )
is greater than Λo = 1 – δ, then the null hypothesis in Expression 13.13 is
rejected and non-inferiority is concluded. This procedure for determining a
confidence interval for the scale factor or ratio of medians can be applied to
the log-rank test (where the parameter of interest tends to be a hazard ratio,
not a scale factor) or any Wilcoxon-like test. This procedure for obtaining a
confidence interval for a scale factor is analogous to manipulating the Mann–
Whitney–Wilcoxon procedure for deriving a confidence interval for the shift
in the distributions (i.e., the difference in medians) provided in Section 12.4.
It is a matter of debate whether there may be some crudeness to this proce-
dure. For the event status to remain the same, the corresponding censoring
variables would also need to be multiplied by c. In a properly conducted clin-
ical trial, the censoring distributions should be the same across arms. When
assumptions that are made on the underlying distribution do not hold, com-
parisons involving quite different censoring distributions can be difficult to
interpret when there is a moderate or large amount of censoring.
Two Confidence Interval Procedures. For time-to-event data in the presence of
censoring, the use and properties of confidence intervals for the difference in
medians where the limits are differences of the confidence limits of the indi-
vidual confidence intervals for the medians were investigated by Wang and
Hettmansperger.17 Several cases are considered, including the case where the
two underlying time-to-event distributions are assumed to have the same
shape. The results are fairly analogous to those by Hettmansperger16 for
determining confidence intervals for the difference in medians for continu-
ous data, which are summarized in Section 12.4.
It is, however, unlikely that two time-to-event distributions differ by a
shift. If the two underlying distributions differ by a scale factor, which is
often assumed when comparing time-to-event distributions, then the dis-
tributions for the log times will have the same shape (i.e., differ by a shift).
The results of Wang and Hettmansperger17 can be applied to testing the ratio
of the underlying medians when assuming equal shapes for the distribu-
tion of the log times. For ease in both presentation and in comparing the
results to those of Hettmansperger16 in Section 12.4, the results of Wang and
Hettmansperger17 will be presented for a difference in medians for the log
times. The medians for the log times are denoted by µlog,E and µlog,C for the

Inference on Time-to-Event Endpoints 387

experimental and control arms, respectively. Their difference is denoted by

∆ = µlog,E − µlog,C . For a non-inferiority margin of δ (for some δ > 0), the hypoth-
eses are expressed as

Ho: Δ ≤ –δ and Ha: Δ > – δ. (13.14)

As in Section 12.4, the 100(1 – α)% confidence interval for the difference in
medians of the log times has the form (L, U) = (LE – UC, UE – LC), where
(LE, UE) is a 100(1 – α E)% confidence interval for the median log time of the
experimental arm and (LC, UC) is a 100(1 – α C)% confidence interval for the
median log time of the control arm. The confidence coefficients for the indi-
vidual confidence intervals are selected so that when those two intervals are
disjoint, Ho: Δ = 0 is rejected at a significance level of α in favor of the two-
sided alternative Ha: Δ ≠ 0. The null hypothesis in Expression 13.14 is rejected
at a significance level of α/2, and non-inferiority is concluded if L > – δ.
The previous notation for the time-to-event and censoring variables will
apply here to the log times. Let X1, X2, . . . , X nC and Y1, Y2, . . . , YnE denote
independent random samples from distributions having respective distribu-
tion functions FC and FE. The X’s and the Y’s represent the actual, uncensored
log times to the event of the control and experimental arms, respectively.
For the control arm, the independent censoring variables for the log times
are denoted as A1, A2, . . . , AnC (i.e., exp(A1), . . . , exp(AnC ) are the censor-
ing variables for exp(X1), exp(X2), . . . , exp(X nC ) ), which are assumed to be
a random sample having common distribution function H. For each sub-
ject in the control arm, the variable X i* = min{X i , Ai } and the event status
I (X i = X i* ) are observed. The common distribution function for X i* is given
by FC* (t) = 1 − (1 − FC (t))(1 − H (t)) . For the experimental arm, the independent
censoring variables for the log times are denoted as B1, B2, . . . , BnE , which
are assumed to be a random sample having common distribution function
K. For each subject in the experimental arm, the variable Yi* = min{Yi , Bi } and
the event status I (Yi = Yi* ) are observed. The common distribution function
for Yi* is given by FE* (t) = 1 − (1 − FE (t))(1 − K (t)).
The left continuous inverse of a distribution F is defined by F–1, where
F (p) = inf{t: F(t) ≥ p} for 0 < p < 1. Let F̂C and F̂E denote the Kaplan–Meier
–1

estimators of FC and FE, respectively. As in Section 12.4, dE and dC will

denote the “depths” of the interval confidence intervals for the experimen-
tal and control arms, respectively. For i = E, C, the 100(1 – αi)% confidence
interval for µi is of the form (Li, Ui) = ( Fî−1 (di/ni ) , Fî−1 (1 − (di − 1)/ni ) ), where
di is asymptotically approximated by ni/2 + 0.5 − ( ni /2)Zi for some appro-
priately selected multiplier Zi. For each of i = E, C, the follow-up needs to be
such that Fî−1 (di/ni ) and Fî−1 (1 − (di − 1)/ni ) exists. That is, the Kaplan–Meier
curves are not suspended at heights near or above 0.5.
µi
For i = E, C, define τ i =
∫
–∞
(1 − Fi* (t))−2 dGi* (t), where GC* (t) = P(X i* ≤ t , X i* = X i )

and GE* (t) = P(Yi* ≤ t , Yi* = Yi ) . Per Wang and Hettmansperger,17 τi is the

388 Design and Analysis of Non-Inferiority Trials

asymptotic variance of 2 ni ( Fˆi ( µi ) − Fi ( µi )) . Asymptotically, the multi-

pliers (i.e., ZC and ZE) and the confidence coefficients (i.e., 1 – α C and 1 –
α E) will depend on the values for τC and τ E and the value of α. Wang and
Hettmansperger17 recommend using Greenwood’s formula to estimate τC
and τ E. For i = E, C, the proposed estimator τ̂ i equals 4ni multiplied by the
Greenwood estimate of the variance at time ti* , where ti* is the Kaplan–Meier
estimated median (i.e., tC* = x and tE* = y ).
For λ = nC/N from Theorem 2 of Wang and Hettmansperger,17 the multipli-
ers are asymptotically related by

λ ZE + 1 − λ ZC ≈ zα/2 λτ E + (1 − λ )τ C (13.15)

In addition, the asymptotic width of the confidence interval does not depend
on the choice of ZE and ZC that satisfies Equation 13.15. From Theorem 3 of
Wang and Hettmansperger,17

N (U − L) → zα/2 λτ E + (1 − λ )τ C /[ λ(1 − λ ) fC (0)]

in probability, where fC(0) is the common density at the median.

According to Wang and Hettmansperger,17 when it is desired to have equal
confidence coefficients for the individual intervals (i.e., α E = α C), then asymp-
totically the appropriate multipliers satisfy

ZE = τ E/τ C ZC = zα/2 τ E λτ E + (1 − λ )τ C /( λτ E + (1 − λ )τ C )

In the case of asymptotically equal-length confidence intervals, we have from

Wang and Hettmansperger17

ZE = (1 − λ )/λ ZC = zα/2 λτ E + (1 − λ )τ C /[2 λ ]

The authors also provided formulas for the multipliers in the equal-depth
case where dE = dC.17
For the equal confidence coefficient procedure, the common confidence
coefficient ranged from 0.83 to 0.88 in the cases studied,17 where the alloca-
tion ratios ranged from 1 to 3 and various relative frequencies of censoring
were assumed. We refer the reader to Wang and Hettmansperger’s paper17
for analogously determined confidence coefficient intervals for the equal
length and equal depth procedures.
Additionally, Wang and Hettmansperger17 modified the two confidence
intervals procedures for the equal-shape case to obtain procedures for the

Inference on Time-to-Event Endpoints 389

Behrens–Fisher problem and for when proportional hazards are assumed.

These additional cases require the estimates of the densities at the true medi-
ans. There are other approaches that do not require density estimation. An
estimator and confidence interval for the ratio of medians or scale factor was
based on the Hodges–Lehmann approach in Wei and Gail’s paper.26

13.6 Comparisons over Preset Intervals

Instead of comparing treatment arms at a specific landmark or a specific per-
centile, the event-free probabilities of the treatment arms can be compared
over a prespecified time interval or the quantiles of the treatment arms can
be compared over an interval of the corresponding levels of percentiles as in
a comparison of trimmed means. There should be greater precision when the
inference is over an interval than at a particular point. As with inferences at
landmarks or based on medians, there is also the advantage of not having to
rely on assumptions such as the proportional hazards assumption.
For non-inferiority testing over a preset interval of time on the difference
in the survival functions or on a cumulative odds ratio, test procedures
based on simultaneous pointwise confidence intervals and based on a con-
fidence interval for the supremum difference between the survival curves
were evaluated by Freitag, Lange, and Munk.27 Let [τ0,τ1] denote the interval
of interest for some 0 < τ0 < τ1. For the difference in the survival functions
and a non-inferiority margin of δ (for some 0 < δ < 1), the hypotheses for the
difference in event-free probabilities over the interval [τ0,τ1] are expressed as

Ho: SE(t) – SC(t) ≤ – δ for some t* ∈ [τ0,τ1] and

Ha: SE(t) – SC(t) > –δ for all t* ∈ [τ0,τ1] (13.16)

For the cumulative odds ratio, ψ(t) = [FE(t)/(1 – FE(t))]/[FC(t)/(1 – FC(t))], the
hypotheses are expressed as

Ho: ψ(t) ≥ ψo for some t* ∈ [τ0,τ1] and Ha : ψ(t) < ψo for all t* ∈ [τ0,τ1] (13.17)

Alternatively, the hypotheses can also be expressed involving the ratio of

the cumulative hazards (lnSE(t)/lnSC(t)), the ratio of event-free probabilities
(SC(t)/SE(t)), or the relative risk of an event ((1 – SE(t))/(1 – SC(t))).
Pointwise Approach. Concluding the alternative hypothesis in Expression
13.16 requires that for every t* ∈ [τ0,τ1], the null hypothesis Ho: SE(t*) – SC(t*) ≤
– δ is rejected in favor of Ha: SE(t*) – SC(t*) > –δ. The landmark testing proce-
dures in Section 13.5 can be used for testing these hypotheses for each given
t* ∈ [τ0,τ1]. Freitag, Lange, and Munk27 used bootstrapping to the construct

390 Design and Analysis of Non-Inferiority Trials

pointwise confidence intervals at the individual landmarks for testing of the

hypotheses in Expressions 13.16 and 13.17. This type of test procedure is con-
servative and necessarily maintains the desired type I error, provided the
desired type I error rate is maintained by the testing of the hypotheses at the
individual landmarks in [τ0,τ1].
Supremum Approach. For testing based on the supremum difference in the
survival functions, the hypotheses in Expression 13.16 can be reexpressed
as

Ho: supt∈[τ 0 ,τ 1 ][SC (t) − SE (t)] ≥ δ and Ha: supt∈[τ 0 ,τ 1 ][SC (t) − SE (t)] < δ (13.18)

The hypotheses involving the cumulative odds ratio, ratio of the cumula-
tive hazards, ratio of event-free probabilities, and relative risk of an event,
would involve the following supremums being compared to the appropriate
non-inferiority margin or threshold: supt∈[τ 0 ,τ 1 ] ψ (t), supt∈[τ 0 ,τ 1 ][ln SE (t)/ln SC (t)],
supt∈[τ 0 ,τ 1 ][SC (t)/SE (t)], and supt∈[τ 0 ,τ 1 ][(1 − SE (t))/(1 − SC (t))], respectively. Freitag,
Lange, and Munk27 used a hybrid bootstrap-based procedure based on that used
by Shao and Tu28 to construct a confidence interval for supt∈[τ 0 ,τ 1 ][SC (t) – SE (t)]
for testing the hypotheses in Expression 13.18. When the upper limit of the
confidence interval is less than the non-inferiority margin/threshold, the
null hypothesis is rejected and non-inferiority is concluded. This procedure
maintains the desired type I error rate and the supremum approach has more
power than the pointwise approach.
Comparing the event-free probabilities over an interval makes use of more
information in the data than a landmark analysis. As with landmark analy-
ses, it is not necessary to assume proportional hazards, the existence of a
scale factor, or proportional cumulative odds. When such an assumption
holds (or approximately holds), it is more efficient to base the inference on a
procedure that is designed for such an assumption than to restrict the infer-
ence to some prespecified interval.
The selected non-inferiority margin or threshold here may represent the
maximal allowed difference across [τ0,τ1], which may be a larger allowed
difference than for a landmark analysis at a specific time. When the same
margin or threshold is used for the non-inferiority analysis over [τ0,τ1] as
for a non-inferiority landmark analysis at landmark t* ∈ [τ0,τ1], rejecting the
null hypothesis in Expressions 13.16 or 13.18 for the interval analysis implies
that the null hypothesis in Expression 13.12 is rejected for the landmark
analysis.
Besides the need to choose a larger margin than for a landmark analysis,
it can be very tricky in using the historical results to determine the effect of
the control therapy. The specific information on the estimates of the event-
free probabilities over [τ0,τ1] and their corresponding standard errors for the
control therapy and the placebo may not be readily available from some or
all of the historical trials. If such information was not readily available but it

Inference on Time-to-Event Endpoints 391

is still desired to base the non-inferiority inference over [τ0,τ1], it is likely that
the non-inferiority margin would be conservatively chosen.
It may also be difficult (as with landmark analyses) in determining how to
incorporate differences in estimated SC(t) across trials to determine the appro-
priate/best interval [τ0,τ1] to consider and to set the non-inferiority margin/
threshold. There are analogous concerns and issues when the non-inferiority
inference over [τ0,τ1] is based on the cumulative odds ratio, ratio of the cumu-
lative hazards, ratio of event-free probabilities, or relative risk of an event.
When the assumptions do not hold for proportional hazards, the existence
of a scale factor, or a constant cumulative odds ratio, the corresponding sample
estimator unbiasedly estimates some quantity that depends on the amount
of follow-up (i.e., the censoring distributions) for that trial. Thus, the under-
lying time-to-event distributions for the control therapy and the placebo can
remain constant across trials (across historical trials and the non-inferiority
trial), thereby having a constant true effect of the control therapy across tri-
als; however, because the assumption relating the underlying distributions is
not true (e.g., the hazards are not proportional) and the amount of follow-up
differs across trials, the value the selected estimator (e.g., the hazard ratio
estimator) is unbiasedly estimating varies across the trials. Landmark analy-
ses and analyses over an interval would not be affected by differences across
trials in the amount of follow-up, although they have their own issues.
Horizontal Differences in the Survival Functions. The difference in medians is
one specific horizontal difference in the experimental and control survival
functions (i.e., SE−1 (0.5) − SC−1 (0.5)). For continuous time-to-event distributions,
the difference in the means is the average of the horizontal differences in
the experimental and control survival functions over all percentiles (i.e., the
average of SE−1 ( p) − SC−1 ( p) for 0 < p <1), or simply the mean difference in per-
centiles. Thus, the difference in means (when the means exist) is given by

µE − µC =
∫ (S
0
−1
E ( y ) − SC−1 ( y ))dy (13.19)

Graphically, the difference in means is equal to the area between the two
survival functions, which is usually represented by

µE − µC =
∫ (S (x) − S (x))dx
0
E C (13.20)

When the means exist, the expression in Equation 13.19 can be extended to
1
continuous real-valued distributions to µ E − µ C =
∫ (F
0
−1
C ( y ) − FE−1 ( y ))dy . The
assumption of continuous distributions is necessary for Equation 13.19 to
hold. Therefore, Kaplan–Meier estimates of SE and SC, which are discrete

392 Design and Analysis of Non-Inferiority Trials

distributions, cannot directly be substituted into Equation 13.19 to produce

an estimate of μE – μ C. When the Kaplan–Meier estimates of SE(t) and SC(t)
both approach zero as t increases, the Kaplan–Meier estimates can be substi-
tuted into Equation 13.20 to produce an estimate of μE – μ C.
Alternatively, the same percentiles can be defined by Sˆ i−1 ( p) = min{t : Sˆ C (t)
ˆS−1 ( p) = min{t : Sˆ (t) ≤ 1 − p} for i = E, C or in some other fashion, and substituted
i C
into Equation 13.19 to obtain an estimate of μE – μ C when the Kaplan–Meier
estimates of the survival functions both approach zero as time increases. In
any case, an estimate of μE – μ C involves an average of the difference in the
100p-th sample percentiles for the experimental and control arms, where p is
taken uniformly over (0, 1).
Because of the presence of censoring, it is rare that both (or either) Kaplan–
Meier survival functions will approach zero as time increases, but rather
the Kaplan–Meier survival function will be suspended at some height.
Thus, μE – μ C will not be estimable. However, a difference in trimmed
means may still be estimable. For i = E, C, the γ -trimmed mean is defined
1−γ
by µ i ,γ =
∫
γ
Si−1 ( y )dy/(1 − 2γ ) for 0 < γ < 0.5. To estimate the difference in the

γ -trimmed means, it is necessary that both Kaplan–Meier survival function

eventually fall below γ. The chosen γ and the intended follow-up in the trial
must allow for this. Bootstrapping may be able to provide confidence intervals
for μE,γ – μC,γ; however, this requires that every (or almost every) bootstrap rep-
lication has both Kaplan–Meier survival functions eventually fall below γ.
When statistical software is used to plot a Kaplan–Meier survival curve,
it is standard that the plot connects with vertical line segments the drops in
curve at the event times. This gives the impression of a continuous curve.
From such a representation, when the Kaplan–Meier survival curve even-
tually falls below γ, the area between the 100γth and the 100(1 – γ)th per-
centiles can be found, which when divided by (1 – 2γ) yields an estimate of
the γ -trimmed mean. An estimate of the difference in the trimmed means
would be an average of the difference in the 100p-th sample percentiles for
the experimental and control arms where p is taken uniformly over (γ, 1 – γ).

References
1. Chi, G.Y.H., Some issues with composite endpoints in clinical trials, Fund. Clin.
Pharm., 19, 609–619, 2005.
2. DeMets, D.L. and Califf, R.M., Lessons learned from recent cardiovascular clini-
cal trials: Part I, Circulation, 106, 746–751, 2002.
3. Montori, V.M. et al., Validity of composite endpoints in clinical trials, Br. Med. J.,
330, 594–596, 2005.

Inference on Time-to-Event Endpoints 393

4. Kleist, P., Composite endpoints: Proceed with caution, Appl. Clin. Trial, May 1,
2006, at http://appliedclinicaltrialsonline.findpharma.com/appliedclinicaltrials/
article/articleDetail.jsp?id=324331.
5. Cox, D.R., Some simple approximate tests for Poisson variates, Biometrika, 40,
354–360, 1953.
6. Cox, D.R., Regression models and life tables, J. R. Stat. Soc., 34, 187–220, 1972.
7. Cox, D.R., Partial likelihood, Biometrika, 62, 269–276, 1975.
8. Com-Nougue, C., Rodary, C., and Patte, C., How to establish equivalence when
data are censored: A randomized trial of treatments for B non-Hodgkin lym-
phoma, Stat. Med., 12, 1353–1364, 1993.
9. Efron, B., Efficiency of Cox’s likelihood function for censored data, J. Am. Stat.
Assoc., 72, 557–565, 1977.
10. Fleming, T.R., Evaluation of active control trials in AIDS, J. Acq. Immun. Def.
Synd., 2, S82–S87, 1990.
11. Kalbfleisch, J.D. and Prentice, R.L., Estimation of the average hazard ratio,
Biometrika, 68, 105–112, 1981.
12. Fleming, T.R. and Harrington, D., Counting Processes and Survival Analysis, Wiley,
Chichester, 1991.
13. Crisp, A. and Curtis, P., Sample size estimation for non-inferiority trials of time-
to-event data, Pharm. Stat., 7, 236–244, 2008.
14. Cox, D.R., A note on the graphical analysis of survival data, Biometrika, 66, 188–
190, 1979.
15. Nelson, W., Theory and application of hazard plotting for censored failure data,
Technometrics, 14, 945–966, 1972.
16. Hettmansperger, T.P., Two-sample inference based on one-sample sign statis-
tics, J. R. Stat. Soc. C Appl., 33, 45–51, 1984.
17. Wang, J.-L. and Hettmansperger, T.P., Two-sample inference for median sur-
vival times based on one-sample procedures for censored survival data, J. Am.
Stat. Assoc., 85, 529–536, 1990.
18. Su, J.Q. and Wei, L.J., Nonparametric estimation for the difference or ratio of
median failure times, Biometrics, 49, 603–607, 1993.
19. Kaplan, E.L. and Meier, P., Nonparametric estimation from incomplete observa-
tions, J. Am. Stat. Assoc., 53, 457–481, 1958.
20. Thomas, D.R. and Grunkemeier, G.L., Confidence interval estimation of sur-
vival probabilities for censored data, J. Am. Stat. Assoc., 70, 865–871, 1975.
21. Efron, B., Censored data and the bootstrap, J. Am. Stat. Assoc., 76, 312–319,
1981.
22. Reid, N., Estimating the median survival time, Biometrika, 68, 601–608, 1981.
23. Brookmeyer, R. and Crowley, J., A confidence interval for the median survival
time, Biometrics, 38, 29–41, 1982.
24. Emerson, J.D., Nonparametric confidence intervals for the median in the pres-
ence of right censoring, Biometrics, 38, 17–27, 1982.
25. Basawa, I.V. and Koul, H.L., Large-sample statistics based on quadratic disper-
sion, Int. Stat. Rev., 56, 199–219, 1988.
26. Wei, L.J. and Gail, M.H., Nonparametric estimation for a scale-change with cen-
sored observations, J. Am. Stat. Assoc., 78, 382–388, 1983.
27. Freitag, G., Lange, S. and Munk, A., Non-parametric assessment of non-
inferiority with censored data, Stat. Med., 25, 1201–1217, 2006.
28. Shao, J. and Tu, D., The Jackknife and Bootstrap, Springer, New York, NY, 1995.

Appendix: Statistical Concepts

A.1 Frequentist Methods

P-values and confidence intervals are used to evaluate the evidence in the
data. A p-value measures the strength of evidence against a null hypothesis,
while a confidence interval is an interval that contains only those possible
values for a parameter that have not been statistically ruled out by the data.

A.1.1 p-Values
A p-value is the probability of obtaining results as extreme or more extreme
(against the null hypothesis) than the observed results, where the probabil-
ity is determined under the assumption that the null hypothesis is true. For
most cases, when the null hypothesis is true, the p-value is a completely ran-
dom value between 0 and 1, its statistical distribution being a uniform distri-
bution over (0,1). As commonly applied, the null hypothesis is rejected if and
only if the p-value is less than or equal to the significance level. Hence, the
p-value can be regarded as the smallest significance level for which the null
hypothesis is rejected.
A p-value measures the strength of evidence against the null hypothesis
in the direction or directions of the alternative hypothesis. The smaller a
p-value, the stronger is the evidence against the null hypothesis, in favor of the
alternative hypothesis. A large p-value would correspond to little evidence
against the null hypothesis. Little or no evidence against the null hypothesis
does not mean that there is great evidence for the null hypothesis.
Examples A.1 through A.3 illustrate some properties of p-values. These
examples involve dichotomous data (coin tosses), continuous data (hemoglo-
bin levels), and time-to-event data (for an undesirable event).

Example A.1

We will consider a simple experiment of tossing a coin 10 times. Let p denote the
probability that any given toss results as a head. The coin is fair if p = 0.5. For the
null hypothesis of p = 0.5, there are three realistic possibilities for the alternative
hypothesis: p < 0.5, p > 0.5, and p ≠ 0.5. Suppose eight of these tosses result in a
head. Table A.1 summarizes the p-value in each of three cases. This example helps
illustrate the differences among the three cases in the directions of the strength of

395

396 Appendix

TABLE A.1
Summary of p-Values for Three Cases Involving Dichotomous Data
Null Alternative Result of the As Strong or Stronger
Case Hypothesis Hypothesis Experiment Evidence in Favor of Ha p-Value a
1 Ho: p = 0.5 Ha: p < 0.5 8 heads in 8 or fewer heads in 0.989
10 tosses 10 tosses
2 Ho: p = 0.5 Ha: p > 0.5 8 heads in 8 or more heads in 0.055
10 tosses 10 tosses
3 Ho: p = 0.5 Ha: p ≠ 0.5 8 heads in 8 or more heads in 0.109
(p < 0.5 or 10 tosses 10 tosses, or 2 or fewer
p > 0.5) heads in 10 tosses
a p-Values as fractions are 1013/1024, 56/1024, and 112/1024, respectively.

evidence. In case 1, the smaller the number of heads, the stronger is the evidence
against the null hypothesis in favor of the alternative hypothesis. In case 2, the
larger the number of heads, the stronger is the evidence against the null hypothesis
in favor of the alternative hypothesis. In case 3, the further the number of heads
is from five (50% of the number of tosses), the stronger the evidence against the
null hypothesis in favor of the alternative hypothesis. Note also that the p-value
in case 3 is double the p-value in case 2 (double the smaller of the p-values in
cases 1 and 2). In case 3, two or fewer heads among 10 tosses provide as strong
or stronger evidence against p = 0.5 in favor of p < 0.5 as the strength of eight
heads among 10 tosses provides against p = 0.5 in favor of p > 0.5. If p = 0.5, for
10 tosses, the probability of getting two or fewer heads equals the probability of
getting eight or more heads.
In this coin-tossing example, the test is based on the number of heads in 10
tosses, which is referred to as the test statistic. The p-values in cases 1 and 2 are
referred to as one-sided p-values since the respective alternative hypotheses are
one-sided. Likewise, since the alternative hypothesis in case 3 is two-sided, the
respective p-value is referred as a two-sided p-value. Note that, here, the sum of
the one-sided p-values equals 1 plus the probability of getting the observed num-
ber of heads if the null hypothesis is true. Whenever the test statistic has a discrete
distribution, the sum of the one-sided p-values will equal 1 plus the probability of
getting the observed value of the test statistic if the null hypothesis is true.

Example A.2

Suppose it is known that the measured hemoglobin levels of patients having a

particular disease are normally distributed with standard deviation of 0.8. Let μ
denote the mean hemoglobin level of the target population. For the null hypoth-
esis of μ = 11, consider each of the possibilities for the alternative hypothesis of
μ < 11, μ >11, and μ ≠ 11. Suppose that the sample mean hemoglobin level for a
random sample of 4 patients is 10.5. Table A.2 summarizes the p-value in each
of three cases. There are various similarities between the examples illustrated in
Tables A.1 and A.2. One difference is that the test statistic in hemoglobin level
example has a continuous distribution. When the test statistic has a continuous
distribution, the sum of the one-sided p-values equals 1.

Appendix 397

TABLE A.2
Summary of p-Values for Three Cases Involving Continuous Data
Null Alternative Result of the As Strong or Stronger
Case Hypothesis Hypothesis Experiment Evidence in Favor of Ha p-Value
1 Ho: μ = 11 Ha: μ < 11 Sample Sample mean from 0.106
mean from 4 patients is 10.5 or
4 patients less
is 10.5
2 Ho: μ = 11 Ha: μ >11 Sample Sample mean from 0.894
mean from 4 patients is
4 patients 10.5 or more
is 10.5
3 Ho: μ = 11 Ha: μ ≠ 11 Sample Sample mean from 0.211
(μ < 11 or mean from 4 patients is either
μ >11) 4 patients 10.5 or less, or is
is 10.5 11.5 or more

Example A.3 will compare and contrast the calculation of a p-value for each
of four types of comparisons. For this example, an equivalence comparison
will be evaluated in Section A.3.

Example A.3

Let θ denote the true experimental arm versus control arm hazard ratio of some
undesirable event (e.g., death or disease progression). For an observed hazard
ratio of 0.91 based on 400 events in a clinical trial that had a one-to-one ran-
domization, Table A.3 summarizes the p-value for each comparison type. For the
non-inferiority comparison, a hazard ratio threshold of 1.1 is used.
For the inferiority, superiority, and difference comparisons, the orderings of the
strength of evidence against the null hypothesis (in favor of the alternative hypoth-
esis) are analogous to cases 1, 2, and 3 in each of Tables A.1 and A.2.
Note that the order of the strength of evidence is the same for a superiority compari-
son as with a non-inferiority comparison. For each of these comparisons, the smaller
the observed hazard ratio, the more favorable is the result for the experimental arm.
For these two comparisons, it is the same event (observing a hazard ratio of 0.91 or
less) whose probability is the p-value. The p-values are different because the prob-
abilities are calculated under different assumptions of the truth (θ = 1 and θ = 1.1). In
fact, because the “bar is lower” for a non-inferiority comparison than for a superiority
comparison between the same two treatment arms, the p-value for the non-inferiority
comparison will always be smaller than the p-value for a superiority comparison.
Note that had the observed hazard ratio equaled 1, the p-values for an inferior-
ity comparison and for a superiority comparison would be equal (both p-values
equaling 0.5). In this example, the p-values for an inferiority comparison and for
a non-inferiority comparison would be equal (both p-values approximately 0.317)
if the observed hazard ratio were the square root of 1.1 (the geometric mean of
1 and 1.1). When these p-values are equal, the strength of evidence in favor of
inferiority equals the strength of evidence in favor of non-inferiority.

398
TABLE A.3
Summary of p-Values for Various Types of Comparisons Involving Time-to-Event Data
Null Alternative As Strong or Stronger
Case Hypothesis Hypothesis Result of the Experiment Evidence in Favor of Ha p-Valuea
Inferiority Ho: θ = 1 H a: θ > 1 Observed hazard ratio of Observed hazard ratio based on 0.827
or 0.91 based on 400 events 400 events of 0.91 or greater
Ho: θ ≤ 1
Superiority Ho: θ = 1 H a: θ < 1 Observed hazard ratio of Observed hazard ratio based on 0.173
or 0.91 based on 400 events 400 events of 0.91 or less
Ho: θ ≥1
Difference Ho: θ = 1 Ha: θ ≠ 1 Observed hazard ratio of Observed hazard ratio based on 0.346
0.91 based on 400 events 400 events is either 0.91 or less,
or is 1/0.91 or greater
Non-inferiority Ho: θ = 1.1 Ha: θ < 1.1 Observed hazard ratio of Observed hazard ratio based on 0.029
0.91 based on 400 events 400 events of 0.91 or less
a p-Values are approximated by using the asymptotic distribution of the estimator of the log-hazard ratio for a one-to-one
randomization.

There is fairly suggestive but compelling evidence that the experimental arm is
noninferior to the control arm with respect to the time-to-event endpoint. For any
significance level less than 0.10, whether a one-sided or two-sided significance level,
there is not strong enough evidence that the experimental arm is inferior, superior,
or different from the control arm with respect to the time-to-event endpoint.

When the correct value for a parameter or effect is the hypothesized value
in the null hypothesis and the test statistic has a continuous distribution, the
p-value is a random value between 0 and 1, its statistical distribution being
a uniform distribution over (0,1). For most hypotheses-testing scenarios in
practice, when the correct value is among the alternative hypothesis, the dis-
tribution for the p-value is smaller. Several authors have examined the distri-
bution of the p-value when the alternative hypothesis is true. Dempster and
Schatzoff1 and Schatzoff2 investigated the stochastic nature of the p-value
and evaluated test procedures based on the expected (mean) p-value at a
given alternative. Hung et al.3 determined, for a fixed significance level and
a fixed difference between the true value and the hypothesized value in the
null hypothesis, that as the sample size increases, the mean, variance, and
percentiles for the distribution of the p-value decrease toward zero. They
also examined the distribution of the p-value in certain cases when the effect
size is a random variable. Sackrowitz and Samuel-Cahn4 extended the work
of Dempster and Schatzoff,1 and also related the expected p-value to the sig-
nificance level and power. Joiner5 introduced the median significance level
and the “significance level of the average” (the significance level that corre-
sponds with the mean value of the test statistic) as measures of test efficiency.
Bhattacharya and Habtzghi6 also used the median p-value to evaluate the
performance of a test. Below, we provide our own analogous derivation for
the distribution of the p-value.
In general, the distribution of the p-value depends on the sample size and
the true value of the parameter (or alternatively on the significance level and
the true power of the test). For test statistics that are normally distributed, the
distribution of the p-value depends on the number of standard error differ-
ence in the true value and the hypothesized value in the null hypothesis. For
a random sample of size n from a normal distribution with mean μa and stan-
dard deviation σ, we will see that the distribution of the p-value for testing
the null hypothesis μ = μo against the alternative hypothesis μ < μo depends
on the value of (μa – μo)/(σ/ n). We can replace μ, σ, and n by the equivalent
when comparing two means or when using a log-hazard ratio to compare
two time-to-event distributions. For test statistics that are modeled as hav-
ing a normal distribution, when the power is 1 – β with a one-sided signifi-
cance of α, the number of standard error difference in the true value and the
hypothesized value in the null hypothesis reduces to zα + zβ.
Suppose we are testing Ho: θ = θo versus Ha: θ < θo on the basis of an es
timate θ̂ , where the true value is θa and (θ̂ − θ a )/σ ′ is modeled as having
a standard normal distribution. The test statistic is (θ̂ − θ o )/σ ′ , and thus

400 Appendix

the p-value is the observed value of Φ((θ̂ − θ o )/σ ′). Let G denote the distribu-
tion function for the p-value. Then for 0 < w < 1,

G(w) = P(Φ((θˆ − θ o )/σ ′ < w|θ = θ a )

= P((θˆ − θ a )/σ ′ < Φ −1 (w) + (θ o − θ a )/σ ′|θ = θ a )

= Φ(Φ −1 (w) + (θ o − θ a )/σ ′)

= Φ(Φ −1 (w)) + ( zα + zβ ))

where α is the significance level and 1 – β is the power when the true value of
θ is θa. For 0 < y < 1, the quantile function is given by

G –1(y) = Φ(Φ–1(y) – (zα + zβ))

Since Φ–1(p) = z1–p for 0 < p < 1, the 100p-th percentile of the distribution of
the p-value is given by Φ(z1–p – (zα + zβ)). Note for any significance level α, the
100(1 – α)-th percentile for the p-value is β (i.e., 1 minus the power). Also, the
100(1 – β)-th percentile for the p-value is α. For 0 < w < 1, the density function
for the p-value is given by

g(w) = ϕ(Φ–1(w) + (zα + zβ))/ϕ(Φ–1(w))

We note that it can easily be shown that the distribution of the p-value
becomes larger with respect to a likelihood ratio ordering as zα + zβ becomes
smaller. In particular, for a fixed significance level, α, the distribution of the
p-value becomes smaller with respect to a likelihood ratio ordering when
the power increases (which can occur by either increasing the sample size or
choosing a more favorable alternative). Thus, for a fixed sample size, when
comparing two alternatives, the relative likelihood increases in favor of the
more favorable alternative as the smaller the observed p-value.
Note also that the test statistic has a normal distribution with mean
– (zα + zβ) and variance 1.
For cases where the test statistic is normally distributed, Table A.4 pro-
vides the median, 5th percentile, and 95th percentile for the distribution of
the p-value for various combinations for the significance level and power.
Whenever the power at the true effect size is 80% or greater, the median
p-value is very small and relatively much smaller than the significance level.
If a clinical trial is adequately powered at the actual effect size, the p-value
will typically be very small. An observed p-value that is microscopic (e.g.,
smaller than 10 –8 if the significance level is 0.005) would tend to be indica-
tive of an overpowered study—that is, the study would have had near 100%
power at the true effect size.

Appendix 401

TABLE A.4
Median and Percentiles for p-Value Based on Significance Level and Power
Distribution for p-Value
Significance Level a Power Median 5th Percentile 95th Percentile
0.05 0.05 0.5 0.05 0.95
0.5 0.05 0.0005 0.5
0.8 0.0064 0.00002 0.2
0.9 0.0017 0.000002 0.1
0.025 0.025 0.5 0.05 0.95
0.5 0.025 0.0002 0.376
0.8 0.0025 0.000004 0.124
0.9 0.0006 0.0000005 0.055
0.01 0.01 0.5 0.05 0.95
0.5 0.01 0.00004 0.248
0.8 0.0008 0.0000007 0.064
0.9 0.0002 0.00000007 0.025
0.005 0.005 0.5 0.05 0.95
0.5 0.005 0.00001 0.176
0.8 0.0003 0.0000002 0.038
0.9 0.00006 0.00000002 0.013
a One-sided significance level.

Although p-values larger than the significance level are not out of the ordi-
nary when the power at the true effect size is 80% or greater, large p-values
are out of the ordinary. Large p-values are indicative of either the alternative
hypothesis being false or that the study is not adequately powered at the true
effect size (i.e., the assumed effect size is greater than the true effect size). In
the latter case, the effect size chosen to design the study (“power the study”)
was larger than the true effect size.
Note that reporting a p-value for non-inferiority testing is rare. This is pri-
marily due to some subjectivity in the determination of the non-inferiority
margin.

A.1.2 Confidence Intervals

The parameter space for a parameter is the set of all values that the param-
eter could theoretically be. Some of the values in the parameter space may
not be practical. A confidence interval for a parameter is an interval based
on data that contain only those values in the parameter space for which the
data have not statistically ruled out at some significance level, α. Since the
probability of “statistically ruling out” the correct value of the parameter
is at most α, we are thus at least 100(1 – α)% confident that the true value of
the parameter is contained in the confidence interval. For example, when
α = 0.05 for each of a large number of experiments (or repetitions of the same

402 Appendix

experiment), about 95% of the 95% confidence intervals actually capture the
correct value of the respective parameter. The value of (1 – α) is called the
confidence coefficient.
For each different choice of an alternative hypothesis as presented in Tables
A.1 and A.2, there is a different type of confidence interval. For some real-
valued parameter θ and significance level α, θo values where Ho: θ = θo is not
rejected in favor of Ha: θ < θo form a 100(1 – α)% confidence interval for θ of
the form (–∞, U). The value for U is referred to as the 100(1 – α)% confidence
upper bound for θ. Analogously, θo values where Ho: θ = θo is not rejected
in favor of Ha: θ > θo at a significance level α form a 100(1 – α)% confidence
interval for θ of the form (L, ∞). The value for L is referred to as the 100(1 –α)%
confidence lower bound for θ. These two types of confidence intervals are
sometimes referred to as “one-sided confidence intervals” since they are
based on tests of one-sided alternative hypotheses.
Values for θo where Ho: θ = θo is not rejected in favor of Ha: θ ≠ θo at a signifi-
cance level α form a 100(1 – α)% confidence interval for θ of the form (L, U).
Such confidence intervals are sometimes referred to as “two-sided confidence
intervals” since they are based on tests of two-sided alternative hypotheses.
In clinical trials, the lower limit and upper limits, L and U, of the two-sided
100(1 – α)% confidence interval for θ that tend to be used are the 100(1 – α/2)%
confidence lower bound for θ and the 100(1 – α/2)% confidence upper bound
for θ, respectively. This allows the selected two-sided confidence interval to
be “error symmetric.” That is, before any data occur, the probability that the
two-sided 100(1 – α)% confidence interval for θ will lie entirely above the
true value equals the probability that the two-sided 100(1 – α)% confidence
interval for θ will lie entirely below the true value.
The factors that influence the width of a confidence interval depend on the
type of parameter of interest. For a mean of some characteristic within a study
arm, the width of the confidence interval depends on the sample size, the
variability between patients on that characteristic, and the confidence coef-
ficient. For a treatment versus control log-hazard ratio of some time-to-event
endpoint, the width of the confidence interval depends on the breakdown
of the number of events (or the total number of events and the randomiza-
tion ratio) and the confidence coefficient. In general terms, the width of a
confidence interval will depend on some quantification of the amount of evi-
dence or information gathered and the confidence coefficient. Increasing the
amount of evidence gathered on the correct value of a parameter θ (increasing
the sample size or the number of events) reduces the width of the confidence
interval. Also, increasing the confidence coefficient increases the width of
the confidence interval (e.g., a 90% confidence interval is wider than the cor-
responding 80% confidence interval).
Table A.5 summarizes various confidence interval formulas for large
sample sizes (event sizes). Here, for some quantitative characteristic, xE and
xC denote the sample means in the experimental and control arms, respec-
tively; sE and sC denote the respective sample standard deviations in the

Appendix 403

TABLE A.5
Formulas for Approximate 100(1 – α)% Confidence Intervals for Particular
Parameters of Interest
Parameter of Interest Approximate 100(1 – α)% Confidence Interval
Single mean (μE)
xE ± zα /2 sE / nE

Difference in means (μE – μC)

sE2 sC2
( xE − xC ) ± zα /2 +
nE nC

Single proportion (pE)

pˆ (1 − pˆ E )
pˆ E ± zα /2 E
nE

Difference in proportions (pE – pC)

pˆ (1 − pˆ E ) pˆ C (1 − pˆ C )
( pˆ E − pˆ C ) ± zα /2 E +
nE nC

Log hazard ratio (θ)

1 1
θ̂ ± zα /2 +
rE rC

experimental and control arms, respectively; and μE and μ C denote the actual
or underlying means for the experimental and control arms, respectively.
For a dichotomous characteristic, where the possibilities will be expressed
as “success” or “failure,” let p̂E and p̂C denote the sample proportions of “suc-
cesses” in the experimental and control arms, respectively, and pE and pC
denote the actual probability of a “success” for the experimental and control
arms, respectively.
For a time-to-event endpoint, let θ denote the true experimental arm ver-
sus control arm log-hazard ratio and let θ̂ denote its estimate based on rE and
rC events on the experimental and control arms, respectively. Furthermore,
let nE and nC denote the sample sizes for the experimental and control arms,
respectively, and let the 100(1 – γ)-th percentile of a standard normal distribu-
tion be denoted by zγ .
The confidence intervals in Table A.5 are all of the same form—the esti-
mate plus or minus the corresponding standard error for the estimator multi-
plied by a standard normal value, which represents the number of standard
errors that the estimate and the parameter will be within each other 100(1 –
α)% of the time. The standard error is the square root of the average squared
distance between the estimator of the parameter and the actual value of the
parameter.
As can be seen from Table A.5, the confidence interval for a difference in
means (or proportions) is not determined by manipulating the individual
confidence intervals for each mean (proportion). The use of separate confi-
dence intervals is conservative in determining whether we can rule out that
the two true means are equal. Each separate confidence interval reflects, for

404 Appendix

that arm only, the possibilities for the true mean where it was not out of the
ordinary to observe the data that was observed or more extreme data. The
confidence interval for the difference reflects possibilities for the difference
in means for which it was not out of the ordinary to observe that collective
data from both arms or more extreme data.
Suppose the 95% confidence interval for the mean of one arm is (2–8) and the
95% confidence interval for the mean of the other arm is (5–11). The 95% confi-
dence interval for the difference in means is (–0.43, 6.43). For the first (second)
arm, it is not out of the ordinary to observe the data for that arm if the true
mean was 2.5 (10.5). However, it would be out of the ordinary to observe the
collective data of both arms if the difference in the true means was 8.
The standard error of the estimator of the log-hazard ratio depends on the
randomization ratio, the total number of events, and the true hazard ratio.
For a one-to-one randomization, when the true hazard ratio is not far from 1,
the standard error is approximately 2 divided by the square root of the total
number of events. For a fixed total number of events, this provides a specific
relationship between the upper limit and lower limits of a confidence inter-
val for the hazard ratio and the hazard ratio estimate. For example, for a one-
to-one randomization where there are 400 events, the upper limit of the 95%
confidence interval for the hazard ratio should be about 22% greater than the
estimate of the hazard ratio, which in turn should be about 22% greater than
the lower limit of that 95% confidence interval.
A frequent mistake in calculating confidence intervals is applying asymp-
totic methods when the sample size is not large enough for the assumptions
to approximately hold. The confidence intervals will then not have a level
approximately equal to the desired level. The delta method has been applied
to many functions where it would take rather large sample sizes for the
asymptotic results to approximately hold. Care should be taken when apply-
ing such methods.

A.1.3 Comparing and Contrasting Confidence Intervals and p-Values

For standard testing of a difference between two means, or proportions (or
other measures of location), the 100(1 – α)% confidence interval for the differ-
ence will exclude zero exactly when the two-sided p-value for testing for a
difference is less than α. A p-value and a confidence interval present differ-
ent concepts about the strength of evidence in the data. Consider two clini-
cal trials comparing an experimental therapy with a control therapy with
respect to a time-to-event endpoint of an undesirable event where each trial
has a one-to-one randomization. Suppose the first trial has an experimental
versus control hazard ratio of 0.7 based on 500 events and the second trial
has an experimental versus control hazard ratio of 0.6 based on 200 events.
For the first trial, the two-sided p-value for testing for a difference is .00007
and the 95% confidence interval for the true hazard ratio is (0.59, 0.83). For the
second trial the two-sided p-value for testing for a difference is 0.0003 and

Appendix 405

the 95% confidence interval for the true hazard ratio is (0.45, 0.79). The first
trial provides stronger evidence than the second trial that the experimental
arm gives longer times to the event than the control arm. However, the sec-
ond trial rules out more possibilities for the hazard ratio away from equality
(0.79, 1) than the first trial (0.83, 1).
The use of hypotheses tests and p-values has been viewed by some as
dichotomizing the results as either “successful” or “unsuccessful,” and that
the role of a clinical trial should be to get precise estimates of the effect of
the experimental therapy relative to the control therapy. Precise estimates
would require a particular maximum standard error for the estimate and
maximum width of the confidence interval. This would require studies of
some minimal sample size.
In hypotheses testing, the conclusion of the alternative hypothesis of one
trial is reproduced by another trial, if that other trial also reached the con-
clusion of the same alternative hypothesis. In such a case, for each trial, the
p-value is smaller than the respective significance level. In practice, for two
superiority trials, a one-sided significance level of 0.025 would be used for
each trial. For confidence intervals, it may be unclear what is meant by a
“reproduced finding.” Is a reproduced finding getting similar confidence
intervals from each study or confidence intervals that have a great amount
of overlap with respect to their widths?

A.1.4 Analysis Methods

Non-inferiority testing has been understood mostly as a frequentist testing
exercise. The Neyman–Pearson framework has been the basis for frequentist
testing. In this section we describe the motivation of Neyman–Pearson test-
ing and discuss Fisherian significance testing (randomization-based testing)
as well.

A.1.4.1 Exact and Permutation Methods

We use the term “permutation test” for tests of continuous data that are
based on rerandomization procedures and “exact test” for tests of binary or
categorical data that are based on procedures that do not require asymptotic
distributional assumptions, while acknowledging some overlap in the two
sets of tests. Both permutation and exact tests are useful when sample sizes
are small (including sparse cells in categorical data) and when there is insuf-
ficient reason to believe that assumptions required for normal theory tests
are met.
Permutation Tests for Continuous Data. Significance testing was developed
using permutation methods,7 but these are not common in testing of non-
inferiority. Assuming only that subjects are randomly assigned to treat-
ment, a significance test of the null hypothesis of no effect of treatment (i.e., a
null hypothesis from a superiority study) can be performed by permutation

406 Appendix

methods. By generating alternative randomization lists using the same mech-

anism as was used for the actual randomization procedure, one can generate
alternate possible realizations of the data, which form a sampling distri-
bution: the set of treatment assignments actually used is but one possible
allocation of treatments to subjects, and within the framework of the permu-
tation process any allocation is equally likely to have been used. Comparing
the observed data with the sampling distribution allows an experimenter to
decide whether the observed data are consistent with other possible permu-
tations or whether the observed data are unusual. If the null hypothesis of
no difference is true, then the distribution for the observed value of the test
statistic can often be determined or approximated. This allows for the sim-
pler calculation of p-values and confidence intervals.
The mechanics of permutation tests for testing hypotheses of non-
inferiority with continuous data are straightforward. Similar to a test
described by Hollander and Wolfe,8 testing a hypothesis involving the dif-
ference in means Ho: μ C – μ E ≥ δ can be accomplished by subtracting δ from
every outcome in the active control group, resulting in X CT = XC – δ. Testing
the null hypothesis Ho: μ C – μ E ≥ δ is identical to testing the null hypothesis
Ho: µ C − µ E ≥ 0, where µ CT = µ C – δ.
T

One criticism of this non-inferiority permutation test for continuous

data is that the resulting test seems to implicitly assume that the effect of
the control treatment (under Ho) is exactly δ on each study subject in the
population of interest, whereas the interpretation is usually a less stringent
requirement of mean effect of δ on the population. More generally (and more
accurately), the residuals must be exchangeable. With residuals defined as
rCT = X CT − µ CT and r E = XE – μ E, the residuals must be identical in distribution
to be exchangeable.9 Equivalently, having X CT (a random outcome from the
control arm, less δ) and XE (a random outcome from the experimental arm)
equal in distribution (i.e., the distribution of XC is shifted δ to the right of
the distribution of XE) is a sufficient condition for the permutation test to be
valid. The same condition holds for permutation tests of null hypotheses of
no difference, but it is easier to comprehend no effect of treatment when an
experimental product is being compared to placebo; under the null hypoth-
esis, both treatments are inert and any variability is conceptually random.
In a non-inferiority clinical trial, it can be more difficult to conclude that the
transformed values have exchangeable residuals. Differences in the shapes
of the distributions of XC and XE can occur for many reasons: the treatments
have different variances, one treatment has a bimodal response whereas
the other has a unimodal response, or other differences exist. Importantly,
it is often impossible to know with certainty whether the shapes of the
distributions are identical when planning a study; thus, proposing a per-
mutation test carries inherent risks. In addition, with a sufficient sample
size, tests using normal theory tend to approximate permutation tests,10
making permutation tests unnecessary for many non-inferiority trials. For
these reasons, permutation tests are unusual in non-inferiority trials when

Appendix 407

the primary outcome variable is continuous. See Wiens’ study11 for more
discussion.
Exact Tests for Binary and Categorical Data. For binary data, exact methods
have been proposed in the literature and are commonly used. Unlike reran-
domization methods for continuous data, exact methods are easy to interpret
for binary data. Again, as one moves from a null hypothesis of no difference
to a null hypothesis of a nonzero difference, complications arise.
The idea of exact tests was first introduced by Fisher12 while he was devel-
oping a conditional exact test for comparing two independent proportions
in a 2 × 2 table. Fisher’s exact test deals with the classical null hypothesis of
no difference between the two proportions conditioning on the observed
marginal totals. In this case, the marginal totals form a sufficient statistic
(i.e., a sufficient quantity from the data on which inferences can be based)
for the nuisance parameter (common proportion, p). Conditioning on the
marginal totals yields a hypergeometric distribution for the number of suc-
cesses for the experimental group. However, because of the discreteness of
the hypergeometric distribution, Fisher’s test tends to be overly conservative.
For the same problem, Barnard13 has proposed exact unconditional tests in
which all combinations of the unconditional sampling space are considered
in constructing the test. The probability is thus spread over more possi-
bilities, providing a test statistic that is less discrete in nature than the test
statistic for Fisher’s test. As a result, these exact unconditional tests gener-
ally offer better power than Fisher’s test, although they are computationally
more involved.
For testing the hypothesis of non-inferiority where the null space con-
tains a nonzero difference or a nonunity relative risk, a nuisance parameter
arises, making the calculation of the exact p-values more complicated. As
there is no simple sufficient statistic for the nuisance parameter, the condi-
tioning argument does not solve the problem. Exact unconditional methods
for non-inferiority testing have been proposed by Chan14 and Röhmel and
Mansmann,15 in which the nuisance parameter is eliminated using the maxi-
mization principle—that is, the exact p-value is taken as the maximum tail
probability over the entire null space. Because the maximization involves a
large number of iterations in evaluating sums of binomial probabilities, the
exact unconditional tests are computationally intensive, particularly with
large sample sizes. Some exact test procedures for non-inferiority and the
associated confidence intervals are currently available in statistical software.
Although most exact tests have been developed for the non-inferiority
testing, they can be easily adapted for equivalence testing by reformulat-
ing the equivalence hypothesis as two simultaneous non-inferiority hypoth-
eses of opposite direction. Then, equivalence of the two treatments can be
proved if both one-sided hypotheses are rejected.16,17 This approach is also
recommended in regulatory environments as indicated in the International
Conference on Harmonization E9 Guideline18 “Biostatistical Principles for
Clinical Trials,” which states that “Operationally, this (equivalence test) is

408 Appendix

equivalent to the method of using two simultaneous one-sided tests to test

the (composite) null hypothesis that the treatment difference is outside the
equivalence margins versus the (composite) alternative hypothesis that the
treatment difference is within the margins.”

A.1.4.2 Asymptotic Methods

It is common in frequentist testing for non-inferiority (as well as superiority)
studies to rely on Neyman–Pearson inference.
Neyman and Pearson19 introduced a testing principle requiring both a
null hypothesis (Ho) and an alternative hypothesis (Ha). The likelihood of
observing the data is determined under the assumption that the alternative
hypothesis is true is divided by the likelihood of observing the data under
the assumption that the null hypothesis is true. When this relative likelihood
exceeds some prespecified threshold, the null hypothesis is rejected. When
the null hypothesis and the alternative hypothesis consist of one distribution
each, such a test is the most powerful test among all tests having the same
or smaller type I error rate. With this as a foundation, it has been shown that
a most powerful test or uniformly most powerful test (most powerful test
at every possible alternative) can be constructed in many other hypotheses-
testing scenarios. Many test procedures including a t test for the difference
in means are test procedures based on relative likelihood. Like the t test, the
form for these test procedures has been simplified to compare a test statistic
(often “normalized”) to a critical value.
In Neyman–Pearson testing, the null hypothesis is either rejected in
favor of the alternative hypothesis, or not rejected in favor of the alternative
hypothesis, without the calculation of a p-value. In practice, it is common to
blend these two philosophies: reporting a p-value even while testing a null
hypothesis against an alternative hypothesis.
In his seminal paper on non-inferiority testing, Blackwelder20 explicitly
relied on Neyman–Pearson testing for null hypothesis Ho: μ C – μE ≥ δ versus
the alternative Ha: μ C – μE < δ. In practice, assumptions can usually be made
on the underlying distribution to allow a valid non-inferiority analysis for
small sample sizes. Often for small samples, the samples are assumed as
coming from normal distributions. In this situation, a marked violation of
the assumed normality may be easy to detect but a minor violation, even
if enough to question the results, may not be easy to detect. Often, a t test
is used assuming equal variances for the underlying distributions, or the
degrees of freedom for the test statistic are approximated on the basis of
the observed variances. When the sample sizes are not small, a central limit
theorem is applied for testing the difference of means. The assumptions nec-
essary for valid use of the central limit theorem are usually much milder
than the assumptions necessary for a t test. For discrete data, asymptotic
approximations such as the normal approximation to the binomial distri-
bution have historically been much easier to compute than exact tests, for

Appendix 409

any sample size. With the advances in computer hardware and software, the
results from exact tests can be readily determined.
Asymptotic Tests for Continuous Data. When the primary endpoint is continu-
ous, asymptotic methods are commonly used. Consider the null and alternative
hypotheses of Ho: μC – μE ≥ δ versus Ha: μC – μE < δ. By assuming that the estima-
tors of μC and μE, the sample means X C and X E, have approximate normal dis-
tributions (based on the underlying normality of the data or on a central limit
X − XE − δ
theorem), the test statistic Z = C will have an approximate standard
se(X C − X E )
normal distribution, where se(X C − X E ) is the standard error of the difference in
sample means. For sample sizes of nC and nE, respectively, and common stan-
dard deviation σ, the standard error will be σ 1/nC + 1/nC . Except for the sub-
traction of the nonzero δ in the numerator, this test statistic Z is identical to the
test statistic for a test of superiority. The null hypothesis is rejected if Z < –zα in a
one-sided test at the level α, where zα is the 100 × (1 – α) percentile of the standard
normal distribution (e.g., 1.645 for a one-sided test with significance level 0.05).
A p-value can also be calculated. When σ is unknown and the samples are from
normal distributions, a t test can be used where σ2 is estimated by a pooled vari-
ance. With the large sample sizes common in non-inferiority clinical trials, it is
not necessary to assume either an equal variance or that the samples are from a
normal distribution. Rather, sC2 /nC + sE2/nE , where sC2 and sE2 are the respective
sample variances, can be used to estimate the standard error of the difference in
sample means and replaces se(X C − X E ) in Z.
In practice, non-inferiority test procedures are often expressed without test
statistics, although test statistics can be used. This is due to the subjectivity
involved in the choice of δ, when analyzing non-inferiority trials. Rather, a
two‑sided 100(1 – α)% confidence interval for μ C – μE is determined, and the
null hypothesis is rejected if the upper bound of the confidence interval is
less than δ. If the confidence interval is calculated as µˆ C − µˆ E ± zα/2 se( µˆ C − µˆ E ),

the confidence interval approach and the test statistic approach are identi-
cal. The confidence interval approach conveys those margins that can and
cannot be ruled out by the data. Thus, when there are different perspectives
on the non-inferiority margin (e.g., different perspectives among regulatory
bodies), individual decisions on non-inferiority are based on the same confi-
dence interval. The two-sided confidence interval is preferred to give infor-
mation about both the “best-case scenario” and “worst-case scenario” for the
experimental treatment. A two-sided confidence interval also can be used to
simultaneously test for superiority, non-inferiority, inferiority, and equiva-
lence (if equivalence margin are established).
More details on statistical approaches for continuous data are given in
Chapter 12.
Asymptotic Tests for Binary Data. When the primary endpoint is binary, infer-
ence is based on the proportion of subjects in each arm who have a favorable

410 Appendix

response or a success. For certain combinations of sample size and success

rate, the sample proportion has an approximate normal distribution with
mean p and variance np(1 – p), where p is the true or theoretical probability
of a success and n is the sample size. A general rule of thumb is that at least
five successes and at least five failures should be expected in each treatment
group, and sometimes a requirement of at least 30 total subjects is added. For
a simple analysis of non-inferiority, the null hypothesis of Ho: pC – pE ≥ δ can
be tested versus the alternative Ha: pC – pE < δ using the confidence interval
methodology as noted above: reject the null hypothesis if the upper bound of
a two‑sided 100(1 – α)% confidence interval on pC – pE is less than δ. The form
of this confidence interval was provided in Section A.1.2. Other confidence
intervals, having better properties and more suited for non-inferiority test-
ing, are discussed in Chapter 11.

A.2 Bayesian Methods

For Bayesian analyses, inferences are made on a parameter by treating the
parameter as a random variable. The uncertainty about the parameter of
interest is based on a distribution called a posterior distribution that is depen-
dent on the observed data and an initial distribution of that parameter called
a prior distribution. This prior distribution tends to be chosen ahead of time
and is defined over the parameter space of the parameter. The choice of the
prior distribution may reflect a prior belief of where the certainty and uncer-
tainty of the parameter lies in the parameter space, may be based on prior
data and chosen for the purposes of integrating findings, or may be selected
to have little or no influence on the final statistical inference. Posterior prob-
abilities are probabilities involving the parameter that are based on the pos-
terior distribution.

A.2.1 Posterior Probabilities and Credible Intervals

We will consider again the simple experiment of tossing a coin 10 times
where p denotes the probability that any given toss results in a head. For
three prior distributions each having mean 0.5, Table A.6 provides the poste-
rior probability that heads are more likely than tails when eight heads have
been observed in 10 independent tosses. Among prior beta distributions
having mean 0.5, as the variance decreases (i.e., the common value of α and β
increases), the posterior probability of p > 0.5 (or p < 0.5) gets closer to 0.5.
In Example A.4, the posterior probabilities of the corresponding alternative
hypothesis are determined on the basis of the comparisons and observed
hazard ratio in Example A.3.

Appendix 411

TABLE A.6
Summary of Posterior Probabilities of p > 0.5 for Different Prior Distributions When
Observing 8 Heads in 10 Tosses
Prior Distribution Posterior Distribution Posterior Probability of
Case for p for p p > 0.5
1 Beta α = 1, β =1 Beta α = 9, β =3 0.967
2 Beta α = 3, β =3 Beta α = 11, β =5 0.941
3 Beta α = 5, β =5 Beta α = 13, β =7 0.916

Example A.4

As before, let θ denote the true experimental versus control hazard ratio of some
undesirable event (e.g., death or disease progression). For an observed hazard
ratio of 0.91 based on 400 events in a clinical trial that had a one-to-one random-
ization, Table A.7 summarizes the posterior probability that θ lies in the alternative
hypothesis for each comparison type.
Since superiority (θ > 1) implies non-inferiority (θ > 0.9), the posterior probability
for non-inferiority comparison will always be greater than the posterior probabil-
ity for a superiority comparison of the same two treatment arms. Also, since the
parameter has a continuous posterior distribution, the sum of the posterior prob-
abilities for superiority and for inferiority is 1, and the posterior probability for the
alternative hypothesis of a difference comparison is always 1 (even if the observed
hazard ratio is 1). In practice, since a difference comparison is a two-sided com-
parison, the posterior probabilities on each side of no difference would be calcu-
lated and compared when making an inference about whether a difference (and
the direction of that difference) has been demonstrated.
Note that from Tables A.3 and A.7, for inferiority, superiority, and non-inferiority
comparisons, the sum of the p-value and the posterior probability of the alterna-
tive hypothesis equal 1. This will occur for each of these comparisons whenever a
normal model with known variance is used for the estimator and a noninformative

TABLE A.7
Summary of Posterior Probabilities for Various Types of Comparisons for an
Observed Experimental versus Control Hazard Ratio of 1.10 Based on 400 Events
Posterior Posterior
Distribution for Probability of
Null Alternative the True Log- the Alternative
Case Hypothesis Hypothesis Hazard Ratioa Hypothesis
Inferiority Ho: θ ≤ 1 Ha: θ > 1 θ ~ N(ln (0.91), 0.173
Superiority Ho: θ ≥ 1 Ha: θ < 1 (0.1)2) 0.827
Difference Ho: θ = 1 H a: θ ≠ 1 1
Non-inferiority Ho: θ = 1.1 Ha: θ < 1.1 0.971
a The posterior distribution is approximated for a one-to-one randomization by using the
asymptotic distribution of the estimator of the log-hazard ratio and a noninformative prior on
the true log-hazard ratio.

412 Appendix

prior distribution is selected for the parameter. More on comparing and contrast-
ing a p-value and the posterior probability of the alternative hypothesis will be
provided in Section A.3.

The point estimate either represents a specific characteristic of the poste-

rior distribution (e.g., the mean or median) or it is the value that minimizes
expected loss. For every possible pair of a potential estimate and a possible
value for the parameter, a loss function provides the corresponding error, cost,
or loss that is incurred. Then the estimate is the value that has the smallest
expected loss. In many instances, the estimate based on the posterior distri-
bution can be expressed as a weighted average of the corresponding estimate
based on the prior distribution and the frequentist estimate based on the
empirical distribution of the data. As more and more data are collected, a
greater weight is placed on the frequentist estimate.
A credible interval is an interval estimate of the parameter that has a pre-
specified probability of containing the correct value of the parameter. A 100(1
– α)% credible interval has a probability of 1 – α of containing the correct
value of the parameter based on the posterior distribution. As with confi-
dence intervals, credible intervals can be one-sided or two-sided. Whenever
any arbitrary value in the credible interval necessarily has greater posterior
density than any arbitrary value outside the credible interval, the credible
interval is said to be of highest density. The 100(1 – α)% credible interval from
the 100(α/2)-th percentile of the posterior distribution to the 100(1 – α/2)-th
percentile of the posterior distribution is said to be equal tailed. Equal-tailed
credible intervals are usually used in practice instead of highest density
credible intervals as they mirror error symmetric confidence intervals.
Asymptotic formulas for credible intervals are analogous to those of con-
fidence intervals given in Table A.5. Either the sample means, proportions,
and standard deviations from the data can be used or they can be replaced
by the estimates based on the respective posterior distribution. The width
of a credible interval depends on the variability of the posterior distribution
and the probability coverage for the interval. The factors that influence the
variability of the posterior distribution include the variability of the prior
distribution and the amount of information in the data (i.e., the sample size(s)
or the number of events).
The probability distribution for future observations can also be found by
integrating over the parameter space the product of the likelihood function
and the posterior distribution. Such distributions can be used to make infer-
ences about prediction.

A.2.2 Prior and Posterior Distributions

Bayesian statistics integrates the prior knowledge or belief on a parameter
along with relevant data to make inferences on the parameter. Let X be
a random variable that has a distribution that depends on the value of a

Appendix 413

parameter θ. Let Ω denote the parameter space for θ, the set of all possible
values for θ. We denote the probability density function (or probability mass
function) of X by f (x|θ). Let h denote the probability density function for the
prior distribution of θ. Then for the observed values x1, . . . , xn, from a random
sample X1, . . . , Xn, the posterior density function for θ is given by

g(θ |x1 , x2 ,. . ., xn ) =
( ) ( ) .
f ( x1|θ ) f ( x2 |θ ) ⋅⋅⋅ f xn |θ h θ
(A.1)
∫ f (x |θ ) f (x |θ ) ⋅⋅⋅ f ( x |θ ) h (θ ) dθ
1 2 n

Since the denominator in Equation A.1 does not involve θ (θ is integrated

out), we have,

g(θ|x1, x2, …, xn) ∝ f(x1|θ)f(x2|θ)⋯f(xn|θ)h(θ). (A.2)

To illustrate, let X1, . . . , X20, be a random sample from a Bernoulli distribu-

tion having probability of success p, where p has a Beta prior distribution
with α = 3 and β = 5. We have for i = 1, . . . , 20,

f ( xi |p) = p xi (1 − p)1− xi for xi = 0 or 1 and f (xi | p) = 0, otherwise, and

h(p) = 105p2(1 – p)4 for 0 < p < 1 and h(p) = 0, otherwise.

If we observe 8 successes and 12 failures (8 Xi ’s = 1 and 12 Xi ’s = 0), then from

Equation A.2, the density for the posterior distribution of p satisfies g(p|x1, x2,
. . . , x20) ∝ p8(1 – p)12 × p2(1 – p)4 ∝ p10(1 – p)16 for 0 < p < 1, and g(p|x1,x2, . . . x20) =
0, otherwise. Since g is proportional to the density for a Beta distribution
with α = 11 and β = 17, p has for a Beta distribution with α = 11 and β = 17 as
its posterior distribution.
A given type or parametric family of distributions for the observed data is
said to have a conjugate family of prior distributions (or have a conjugate prior),
if there exists a parametric family of priors for which the posterior distribu-
tion must also necessarily belong. For example, for a random sample from a
Bernoulli distribution having probability of success p where p is modeled as
having a beta distribution, the posterior distribution for p is also a beta dis-
tribution. Thus, a Bernoulli distribution has a beta distribution as a conjugate
family of prior distributions. Table A.8 gives a summary of conjugate family
of prior distributions for various distributions of data. The posterior distri-
bution and their respective means are provided in the rightmost column.
Noninformative Prior Distributions. A noninformative prior is a prior dis-
tribution that contains no information about the parameter. Use of a non-
informative prior allows for the inferences on the parameter to be based
entirely or almost entirely on the data. Additionally, there may be other prior

414 Appendix

TABLE A.8
Summary of Conjugate Family of Prior Distributions
Distribution from
which X1, …, Xn Is
Randomly Drawn Prior Distribution Posterior Distributiona
Bernoulli (p) p ~ Beta (α, β) p ~ Beta (α + Σxi, β + n – Σxi)
  α 
(
mean =  nx + α + β 

) (
 n+α + β
 α + β  
)
Normal (μ, σ2) μ ~ Normal (υ, τ2) μ ~ Normal with
σ2 known  n   1    n   1 
mean =   2  x +  2  υ    2  +  2  
 σ   τ    σ   τ 
 n   1 
and variance = 1   2  +  2  
 σ   τ 
Poisson (λ) λ ~ Gamma (α, β) λ ~ Gamma (α + Σxi, 1/(n+1/β))
where the mean of mean = (nx + (1/β )(αβ ))/(n + 1/β )
λ is αβ

a For the posterior distributions, the observed values are x1, . . . , xn with sample mean x .

distributions for which the inferences on the parameter are based almost
entirely on the data. In some settings, Bayesian inferences based on such
prior distributions are completely analogous or identical to inferences based
on frequentist methods. For example, consider an experiment where inde-
pendent samples of size 100 each are taken from normal distributions each
having a variance of 25 and respective means of μ1 and μ2. The statistical
hypotheses that will be tested are Ho: μ1 ≤ μ2 and Ha: μ1 > μ2. Let x1 and x2
denote the observed values of the respective sample means. The p-value for
x −x 
the normalized test statistics is 1 − Φ  1 2
 . For noninformative prior dis-
 1/2 
tributions for μ1 and μ2 (or equivalently, a noninformative prior distribution
x −x 
on θ = μ1 – μ2), the posterior probability that μ1 > μ2 equals Φ  1 2
 . Thus,
 1/2 
for this example, for any 0 < α < 1, rejecting Ho (and thus concluding Ha)
whenever the p-value is ≤α is equivalent to rejecting Ho whenever the poste-
rior probability of Ha is ≥ 1 – α.
Jeffreys Prior Distributions. In the univariate setting, a Jeffreys prior has a den-
sity function proportional to the square root of Fisher’s information. Fisher’s
 ∂2 
information in a single observation is given by I (θ ) = −E  2 log f (X |θ ) .
 ∂θ 
The density for the Jeffreys prior then satisfies h(θ ) ∝ I (θ ) . When sampling

Appendix 415

from a Bernoulli distribution, the Jeffreys prior for p is a Beta distribution

having α = β = 0.5. Jeffreys prior distributions have little influence on the
statistical inferences.

A.2.3 Statistical Inference

As mentioned earlier, either a specific characteristic of the posterior distribu-
tion (e.g., the median or mean) or a value that minimizes the expected value
of a loss function is chosen as the point estimate of a parameter. The means
of the posterior distributions are also provided in Table A.8. In each case, the
posterior mean is a weighted average of the sample mean and the mean of
the prior distribution. When the data come from a normal distribution with a
known variance and a normal prior distribution is used for μ, the weights are
inversely proportional to the corresponding variance. When the data come
from a Bernoulli distribution with a Beta prior distribution for p with param-
eters α and β, the mean of the posterior distribution is the quotient of the sum
of the number of successes and α to the sum of the number of trials and α +
β. This prior distribution can be interpreted as starting with α successes in
α + β trials. The mean of the posterior distribution for p is then the pooled
proportion of successes. When the data come from a Poisson distribution
with a gamma prior distribution for λ, the mean of the posterior distribution
is the quotient of the sum of the number of events and α to the sum of the
sample size and 1/β. This prior distribution can be interpreted as starting
with α events among 1/β patients or trials. As the sample size increases, the
proportion of weight given the sample mean (the data) increases toward 1.
The weight given the prior mean approaches zero as α → 0 and β → 0 for the
beta prior distribution, as τ → ∞ for the normal prior distribution, or as β →
∞ for the gamma prior distribution. Additionally, the relative weight given
the prior mean approaches zero as the sample size grows without bound
(n → ∞).
The coefficient for the prior mean is sometimes referred as a shrinkage fac-
tor.21 The shrinkage factor is the proportion of distance that the posterior
mean is “shrunk back” from the frequentist estimate toward the prior mean.
For example, from Table A.8, the shrinkage factor for the normal conju-
gate family of prior distributions is 1/(1 + nτ 2/σ 2). To illustrate, suppose the
observed values are modeled as a random sample from a normal distribu-
tion with unknown mean, μ, and a variance of 9. A normal distribution with
mean 70 and variance 100 is selected as the prior distribution for μ. If the
observed sample mean from a random sample of 25 is 72, then we see from
Table A.8 that μ will have a normal posterior distribution with mean ≈ 71.96
and variance ≈ 0.352. Here, the shrinkage factor is roughly 0.022.
A function L that assigns a cost or error to every possible pair of an esti-
mate and the actual value of the parameter is called a loss function. An esti-
mate of the parameter is often selected as that value that has the smallest
expected loss with respect to the posterior distribution of the parameter.

416 Appendix

That is, the estimate, a, is the value in the parameter space that minimizes
=
∫ L(θ ,a) g(θ |x ,. . ., x )dθ . For example, consider the squared-error loss
Ω
1 n

function, L(θ,a) = (θ – a)2. The value a = E(θ|x1, . . . ,xn), the posterior mean of θ,
minimizes E(θ – a)2 =
∫ (θ − a) g(θ |x ,. . ., x )dθ . For an absolute loss function
Ω
2
1 n

(L(θ,a) = |θ – a|), the expected loss is minimized by using the median of the
posterior distribution as the estimate.
Consider a Jeffreys prior for p, the probability of a success. An experiment
results in two successes and eight failures among 10 trials. Here, p has a pos-
terior Beta distribution α = 2.5 and β = 8.5. Table A.9 provides the estimates
for three loss functions.
The posterior median and mean for p are approximately 0.210 and 0.227,
respectively. For cubed absolute loss, the value of approximately 0.242 mini-
mizes the expected loss. When the Beta distribution has a mean less than 0.5,
as in this example, the sequence of ah that minimizes E(|θ – a|h) increases to
0.5. If the Beta distribution has a mean greater than 0.5, this same sequence
decreases to 0.5. When the Beta distribution has a mean equal to 0.5, ah = 0.5
minimizes E(|θ – a|h) for all h > 0.
A frequentist evaluation of a Bayesian method can also be done. For
Bayesian estimators, the sampling distribution can be determined as can
the mean square error (when it exists) of the estimator, and the asymptotic
properties of the estimator. For example, consider making an inference on a
response rate p, based on a sample of 20 subjects and a beta prior distribu-
tion for p where the value for each parameter is 2. Let x denote the number
among the 20 subjects that responded. We will model x as a random value
from a binomial distribution based on 20 trials with probability of success
p. We denote the mean of the posterior distribution by p̂. Then the sampling
 
distribution for p̂ is summarized by P( pˆ = ( x + 2)/24) =  20  p x (1 − p)20− x
 x
for x = 0, . . . , 20. The mean squared error for p̂ is (1 + p + p2)/144.
Credible Intervals. To illustrate an example of an equal-tailed credible inter-
val, consider a randomized, controlled clinical trial where 10 of 15 patients
on the experimental arm and 4 of 15 patients on the control arm responded.
We will use a Jeffreys prior for the prior distribution for the response rate (pC

TABLE A.9
Loss Functions and Corresponding Estimates
Loss Function = L(θ,a) Estimate
Absolute loss = |θ – a| 0.210
Squared-error loss = (θ – a)2 0.227
Cubed absolute loss = |θ – a|3 0.245
L(θ,a) = |θ – a|h as h → ∞ 0.5

Appendix 417

TABLE A.10
Summary of 95% Credible Intervals
95% Credible
Parameter Prior Distribution Posterior Distribution Interval
Control response pC ~ Beta (0.5, 0.5) pC ~ Beta (4.5, 11.5) 0.097, 0.517
rate pC
Experimental pE ~ Beta (0.5, 0.5) pE ~ Beta (10.5, 5.5) 0.416, 0.860
response rate pE
Difference in response pC and pE are assumed pC and pE are assumed 0.046, 0.663
rates pE – pC to be independent to be independent

and pE) of each arm. Table A.10 summarizes the equal-tailed 95% credible
intervals for the response rate of each arm and for the difference in response
rates. The joint posterior distribution (with joint density being the product of
the posterior densities for pC and pE) is used to determine the 95% credible
interval for the difference.
Note that the 95% exact confidence intervals for pC and pE are (0.078, 0.551)
and (0.384, 0.882), respectively. Also, the large sample normal approximate
95% confidence interval for pE – pC is (0.073, 0.727). The 95% credible inter-
vals are narrower than the respective exact 95% confidence intervals for pC
and pE. Here, for the difference in response rates, the 95% credible interval
is narrower than and slightly shifted to the left of the 95% confidence inter-
val. Section A.3 investigates the relationship between credible intervals for a
proportion based on a Jeffreys prior with the corresponding exact confidence
interval.
Hypotheses testing can be based on a credible interval, on the magnitude
of the posterior probability that the alternative hypothesis is true, or on the
expected loss/cost for rejecting or failing to reject the null hypothesis. In
any case, there will exist a rejection region, a set of possible samples for
which the null hypothesis is rejected. The rejection region can be assessed
to determine the type I error rate or size of the test, and the power function.
For example, suppose a posterior probability for p > 0.5 greater than 0.975 is
needed to reject the null hypothesis that p ≤ 0.5 and conclude the alternative
hypothesis that p > 0.5. For a Jeffreys prior distribution, this would require
at least 15 responses among the 20 subjects. The power function for this test
 20  x
∑
20
20− x
is   p (1 − p)
x = 15  x 
for 0 < p < 1, and thus the size of the test is approx-

imately 0.021.
The Likelihood Principle. The likelihood principle states that when two dif-
ferent experiments lead to the same or proportional likelihood functions,
inferences about the parameter should be identical for the two experiments.
The likelihood principle is preserved with Bayesian inference but may not
be preserved with frequentist inference. This is illustrated in the following
example.

418 Appendix

Example A.5 is similar to one provided by Goodman,22 which provides

two different experiments where the likelihood function is proportional and
the frequentist decision may differ.

Example A.5

Suppose we learn that among seven subjects that were given an investigational
drug in a phase I study, one subject experienced the target toxicity. Suppose we
are interested in whether the true probability that a subject will experience the tar-
get toxicity, p, is less than 0.5. Thus, we are interested in testing Ho: p = 0.5 against
the hypothesis that Ho: p < 0.5. Consider the following two possible designs (with
corresponding calculations for the p-value):

• Design A (binomial design): The design of the study required the toxicity
experiences of exactly seven subjects given the proposed dose of the inves-
tigational drug. One patient out of the seven experienced the target toxicity.
Here, the p-value, the probability that zero or one of the seven patients
would experience the target toxicity when p = 0.5, equals 1/16 (=0.0625).
The one-sided lower 95% confidence interval for p is (0, 0.52).
• Design B (negative binomial design): For the study, subjects were to receive
the investigational therapy, one at a time, until a subject experiences the
target toxicity or until 10 patients have received the investigational therapy.
Here, the p-value, the probability that at least seven patients will be treated
with the investigational drug when p = 0.5, equals 1/64 (≈0.0156). The one-
sided lower 95% confidence interval for p is (0, 0.39).

The corresponding likelihood functions for p in the two studies are propor-
tional. However, if formal hypotheses testing were done with a significance
level of 0.05, the decision on whether the evidence is strong enough to con-
clude p < 0.5 would differ between the study designs. In the Bayesian set-
ting, we have from Equation A.1 that since the likelihood functions for p are
proportional, the posterior distribution for p will not depend on whether the
design was A or B. For a Jeffreys prior for p (which leads to a Beta posterior
distribution with parameter values 1.5 and 6.5), the posterior probability that
p > 0.5 is 0.025. The one-sided lower 95% credible interval for p is (0, 0.44). For
a Beta prior on p with parameters α and β, the posterior distribution for p
will have parameter values α + 1 and β + 6. Sending α and β to zero leads to a
limiting Beta posterior distribution with parameter values of 1 and 6. On the
basis of this Beta posterior distribution, the posterior probability that p > 0.5
is 1/64 with a one-sided lower 95% credible interval for p of (0, 0.39). This pro-
vides analogous results to the frequentist analysis using a negative binomial
design. It can be shown that for alternative hypotheses of the form p > pothat
the p-value for observing the xth success in the nth trial from a negative bino-
mial design equals the probability of p > po when p has a Beta distribution
with parameter values x and n–x. Further comparison of credible intervals
and confidence intervals for proportions is provided in Section A.3.

Appendix 419

A.3 Comparison of Methods

In this section we will compare and contrast the results of analyses based on
frequentist and Bayesian methods. Particular attention will be given to the
similarity and differences of inferences based on p-values, confidence inter-
vals, and posterior probabilities.

A.3.1 Relationship between Frequentist and Bayesian Approaches

As in Section A.2.3, the p-value (and also a desired level confidence interval)
can be different depending on the design even when the observed data are
identical. The Bayesian inference remains the same (but would change as
the prior distribution changes). Additionally, we saw in Example A.5, involv-
ing 15 subjects in each of two arms, that the 95% credible interval for the
response rates and the difference in response rates were broadly similar to
the respective 95% confidence intervals.
In practice, for most tests of hypotheses involving one comparison, the
inference is identical whether based on a p-value or on a confidence interval.
The null hypothesis is ruled out by a 100(1 – α)% confidence interval exactly
when the p-value is less than α. This consistency between the statistical pro-
cedures is a desirable property. This relationship between p-values and confi-
dence intervals holds for most inferences on one mean, proportion, variance,
or quantile, and holds for most inferences on a difference between groups
in proportions, means, quantiles, variances, and log hazards. In those cases,
one comparison is made—that is, one parameter or a difference in param-
eters is compared with a constant.

A.3.1.1 Exact Confidence Intervals and Credible

Intervals Using a Jeffreys Prior
We will compare for one proportion exact confidence intervals with credible
intervals based on a Jeffreys prior distribution.
For 0 < α < 1, we will compare the 100(1 – α)% equal-tailed credible in
terval for p, the probability of a success or the response rate, based on a
Jeffreys prior distribution for p with the exact 100(1 – α)% confidence interval
for p.
Suppose x successes or responses have been observed from n subjects
according to binomial sampling. The posterior distribution for p based on
Jeffreys prior is a Beta distribution with parameter values x + ½ and n – x + ½.
Let (rL,rU) denote the 100(1 – α)% equal-tailed credible interval for p (i.e., rL and
rU are, respectively, the 100α/2-th and 100(1 – α/2)-th percentiles of a Beta dis-
tribution with parameter values x + ½ and n – x + ½).
The exact 100(1 – α/2)% confidence lower bound for p is pL, where pL
satisfies

420 Appendix

∑ j !(nn−! j)! p (1 − p )
j= x
j
L L
n− j
= α/2 .

Similarly, the exact 100(1 – α/2)% confidence upper bound for p is pU, where
pU satisfies

∑ j !(nn−! j)! p (1 − p )
j= 0
j
U U
j− x
= α/2 .

The exact 100(1 – α)% confidence interval for p is then given by (pL,pU).
It can be shown by applying multiple integrations by parts that pL and

∑
pL
n! n!
∫
n
pU sat isfy pLj (1 − pL )n− j = z x−1 (1 − z)n− x = α/2
j = x j !(n − j)! 0 ( x − 1)!(n − x)!

∑
1
n! n!
∫
x
and pUj (1 − pU )n− j = z x (1 − z)n− x−1 = α/2 . Thus, the
j= 0 j !(n − j)! pU x !(n − x − 1)!

100(1 – α)% confidence interval for p has the 100α/2-th percentile of a Beta dis-
tribution with parameter values x and n – x + 1 for its lower limit (pL) and the
100(1 – α/2)-th percentile of a Beta distribution with parameter values x + 1
and n – x for its upper limit (pU).
Let qL and qU denote the exact 100α/2% confidence lower bound for p and
the exact 100α/2% confidence upper bound for p, respectively. Thus, qL is the
100(1 – α/2)-th percentile of a Beta distribution with parameter values x and
n – x + 1 and qU is the 100α/2-th percentile of a Beta distribution with param-
eter values x + 1 and n – x. Since we are 100α/2% confident that the actual
value of p is in [qL, 1] and also 100α/2% confident that the actual value of p
is in [0, qU], it would seem reasonable that we are 100(1 – α)% confident that
the actual value of p is in (qL,qU). However, note that a Beta distribution with
parameter values x and n – x + 1 is stochastically smaller than a Beta distri-
bution with parameter values x + ½ and n – x + ½, which in turn is smaller
than a Beta distribution with parameter values x + 1 and n – x. Thus, pL <
rL < qU and qL < r U < pU. Hence, (qU, qL) is contained in (rL,rU), which in turn is
contained in (pL,pU).
Note that when a frequentist says that a confidence coefficient (regardless
whether a one-sided or two-sided confidence interval) is γ, that means that
the confidence interval will capture the correct value of a parameter at least
100γ% of the time. Thus, the interval (qU, qL) will capture the correct value of
a parameter at most (not at least) 100(1 – α)% of the time. In fact, there are few
possibilities (if any) for the actual value of p that would be captured exactly
100(1 – α)% of the time by an exact 100(1 – α)% confidence interval for p. Thus,
invariably, the “probability coverage” of a 100(1 – α)% exact confidence inter-
val for p, (pL, pU), is greater than 1 – α. Exact confidence intervals for p have
been regarded as conservative.

Appendix 421

Suppose subjects are sampled until x successes occur (negative bino-

mial sampling). We observe the xth success from the nth subject. Similar
to before, it can be shown that the limits, sL and sU, of the equal-tailed

∑
pL
n! n!
∫
n
100(1 – α)% confidence interval satisfy sLj (1 − sL )n− j =
j = x j !(n − j)! 0 ( x − 1)!(n

∑ (n − 1)! (n −
pL 1
n! x−1
j
L (1 − sL )n− j =
∫ 0 ( x − 1)!(n − x) !
z x−1 (1 − z)n− x = α/2 and
j = 0 j !(n − 1 − j)!
sUj (1 − sU )n−1− j =
∫
pU ( x − 1)!(n
1
(n − 1)!
1 − sU )n−1− j =
∫pU ( x − 1)!(n − x − 1)!
z x−1 (1 − z)n− x−1 = α/2 . Thus, the 100(1 – α)% confidence

interval for p has the 100α/2-th percentile of a Beta distribution with parameter
values x and n – x + 1 for its lower limit (sL) and the 100(1-α/2)-th percentile of
a Beta distribution with parameter values x and n – x for its upper limit (sU).
Note that the lower limit of the 100(1 – α)% exact confidence interval for p is the
same for binomial sampling as negative binomial sampling (i.e., sL = pL). Since
a Beta distribution with parameter values x and n – x is stochastically smaller
than a Beta distribution with parameter values x and n – x + 1, we have sU <
pU. Whether there is an ordering between a Beta distribution with parameter
values x and n – x and a Beta distribution with parameter values x + ½ and n – x
+ ½ (or, i.e., the respective order of sU and rU) depends on the specific values for
x and n. Note also that a Beta distribution with parameters x and n – x is the
limiting posterior distribution from using a Beta prior distribution for p with
parameters α and β, and then sending α and β to zero. Thus, sU is the upper limit
of the respective credible interval based on this limiting posterior distribution.
Inference based on a Jeffreys prior distribution can be directly generalized to
comparing two proportions. If the samples involving each proportion are inde-
pendent, the true probabilities of a success will have independent Beta posterior
distributions. The posterior distribution for the difference in the true probabili-
ties (or some other function of the true probabilities) can be determined, which
can be used to find an estimate and a credible interval for the difference in the
probabilities of a success. The specific approach used to find an exact confi-
dence interval for p cannot be directly extended to making inferences about
a difference in two probabilities. However, there are exact confidence inter-
val approaches for the difference of two probabilities (see Chapter 11). These
approaches require setting an ordering on the possible observations that may
or may not be a priori (i.e., different orderings have been used in practice).

A.3.1.2 Comparison Involving a Retention Fraction

A retention fraction, the fraction of the effect of the control therapy’s effect
that is retained by the experimental therapy, is only defined when the control
therapy has an effect. Dealing with this fact is an added obstacle in making
an inference on the retention fraction. We will compare different methods of
constructing a 95% confidence interval or 95% credible interval for the reten-
tion fraction. Details on the methods used can be found in Chapter 5. Some

422 Appendix

methods adjust for the possibility that the control therapy is not effective.
We borrow ideas from Simon23 in constructing some of the credible inter-
vals. We will assume that the sample/event size for the non-inferiority trial
is independent of results in estimating the effect of the control therapy.
Consider the following hypothetical example for overall survival. The pla-
cebo versus control therapy log-hazard ratio is estimated as 0.20, with corre-
sponding standard error of 0.10. A normal distribution is considered for the
sampling distribution of the placebo versus control log-hazard ratio estimator.
From the non-inferiority trial, the experimental therapy versus control therapy
log-hazard ratio is –0.10, with corresponding standard error of 0.08. Table A.11
gives 95% confidence intervals and 95% credible intervals for the retention
fraction, λ, based on various methods. Those methods that are being intro-
duced are described below. For the Bayesian methods, the prior distribution
for the placebo versus control therapy log-hazard ratio, β, is modeled as a nor-
mal distribution with mean 0.2 and standard deviation 0.1, and the posterior
distribution for the experimental versus control log-hazard ratio, η, is modeled
as a normal distribution with mean –0.10 and standard deviation 0.08.
The intervals using frequentist methods do not adjust for the uncertainty
that the control therapy is less effective than placebo. Here the p-value
for testing whether the control therapy is better than placebo is 0.019. The
(Bayesian) probability that the control therapy is less effective than placebo
is also 0.019. If we ignore whether the control therapy is more or less effective
than placebo and extend the definition of λ to include cases where β < 0 (λ =
1 – η/β), then P(λ > 0.239) = P(λ < 4.56) = 0.975. Here the 95% (“equal-tailed”)
credible interval analog to the 95% confidence interval is (0.239, 4.56).
The other two credible intervals in Table A.11 do not ignore the uncertainty
that the control therapy may be less effective than placebo. When the deter-
mination of a credible interval for λ requires or is restricted to β > 0, the 95%
(“equal-tailed”) credible interval for λ is (–0.527, 4.51). That is, P(λ > –0.527, β > 0)
= 0.975 and P(λ > 4.51, β > 0) = 0.025. For this case, since P(β > 0) ≈ 0.977, an equal-
tailed credible interval with coefficient greater than 0.954 cannot be determined.
If possibilities where η – β < 0 and β < 0 (cases where the experimental therapy
is better than placebo, which is better than the control therapy) are regarded as

TABLE A.11
95% Confidence Interval or Credible Interval for Retention Fraction, λ, Based on
Several Methods
95% Confidence Interval or
Method Credible Interval for λ
Fieller (based on a normalized test statistic) 0.640, 26.612
Delta method 0.575, 2.424
Bayesian ignore whether β < 0 or β > 0 P(0.239 < λ < 4.56) = 0.95
Bayesian exclude P(η – β < 0 and β < 0) P(–0.527 < λ < 4.51, β > 0) = 0.95
Bayesian include P(η – β < 0 and β < 0) P(0.614 < λ < 9.90, β > 0) = 0.95

Appendix 423

having greater relative efficacy than any case where λ > 0 and β > 0, then the
95% (“equal-tailed”) credible interval for λ is (0.614–9.90). That is, P({λ > 0.614, β >
0} or {η – β < 0, β < 0}) = 0.975 and P({λ > 9.90, β > 0} or {η – β < 0, β < 0})) = 0.025.
This last method has the advantage of considering both the uncertainty that
the control therapy may be less effective than a placebo and also other cases of
greater relative efficacy. The interval (0.614, 9.90) may be the most appropriate
95% CI (Confidence interval or credible interval) for the retention fraction.
When P(β > 0) is extremely close to 1, the credible intervals from the above
three methods will be approximately the same. The 95% confidence interval
from the Fieller method will also be similar. For example, if the estimate
of the placebo versus control therapy log-hazard ratio is instead 0.4, then
the 95% Fieller confidence interval and each of the three 95% credible inter-
vals are approximately (0.851, 1.81). The approximate 95% confidence interval
using the delta method is (0.839, 1.66). The confidence interval for the delta
method is noticeably different from that using Fieller’s method. The estima-
tor of the retention fraction may not have an approximate normal distribu-
tion. Rothmann and Tsou24 examined the actual coverage of delta method
confidence intervals for the retention fraction when estimated by a ratio of
independent random variables, each having an approximate normal distri-
bution. When the ratio of the mean to the standard deviation is greater than 8
for the estimator of the effect of the control therapy, then (per Rothmann and
Tsou24) a hypothesis test based on a delta method confidence intervals for the
retention fraction will have approximately the desired type I error rate.
For testing for a retention fraction of more than 0.5, Table A.12 summarizes
the one-sided p-values and the analogous posterior probabilities using these
methods for the example given in Table A.11. For this case, the p-value or
posterior probability was similar for the normalized test statistic, the delta
method, and the Bayesian method, which includes possibilities where η – β <
0 and β < 0 as having the greatest relative efficacy.

A.3.1.3 Likelihood Function for a Non-Inferiority Trial

We saw in Section A.2.3 that the frequentist analysis (the p-value and a
desired level confidence interval) can be different for different designs of

Table A.12
One-Sided p-Values and Analogous Posterior Probabilities for Non-Inferiority for
Testing for a Retention Fraction of More than 0.5
Method p-Value/Posterior Probability
Normalized test statistic 0.017
Delta method 0.017
Bayesian ignore whether β < 0 or β > 0 P(λ < 0.5) = 0.032
Bayesian exclude P(η – β < 0 and β < 0) 1 – P(0.5 < λ, β > 0) = 0.036
Bayesian include P(η – β < 0 and β < 0) 1 – P(0.5 < λ, β > 0 or η – β < 0 and β < 0) = 0.019

424 Appendix

experiments even though the observed data are identical. The Bayesian anal-
ysis remains the same. This is also pertinent to the design and analysis of
non-inferiority trials when the analysis includes the estimation of the effect
of the control therapy from previous trials. The frequentist interpretation of
the results formally depends on whether the design of the non-inferiority
trial was independent or dependent of the estimation of the effect of the con-
trol therapy. We will illustrate this by considering two designs for comparing
the means from two samples. For the purpose of a non-inferiority compari-
son, μ1 represents the effect of the control therapy versus placebo that will be
estimated by previous trials and μ2 represents the difference in the effects of
the control and experimental therapies that will be estimated from the non-
inferiority trial. Consider the following two experiments.
Case 1: A random sample of size 25 is drawn from a normal distribution
having an unknown mean μ1 and a variance equal to 100, and an indepen-
dent random sample of size 100 is drawn from a normal distribution having
an unknown mean μ2 and a variance equal to 100.
Case 2: A random sample of size 25 is drawn from a normal distribution
having an unknown mean μ1 and a variance equal to 100. The observed
sample mean, x1 , is noted. An independent random sample of size m(x1 ) is
drawn from a normal distribution having an unknown mean μ2 and a vari-
ance equal to 100, for some positive-integer valued function m.
Let x1 and x2 denote the respective sample means.
In case 1, the likelihood function reduces to (is proportional to)
( ) ( )
L µ1 , µ 2 ; x1 , x2 = f x1 , x2 ; µ1 , µ 2 =
1
4π
{ }
exp − ( x1 − µ1 )2/8 + ( x2 − µ 2 )2/2  . The
likelihood function factors into the product of separate functions of x1 and
x2 , and also factor into the product of separate functions of μ1 and μ2. The
two random sample means are independent, and if independent noninfor-
mative priors are selected for μ1 and μ2, μ1 and μ2 will be independent at all
stages of sampling. In fact, in such a case, certain frequentist and Bayesian
inferences will be the same.
In case 2, the likelihood function reduces to (is proportional to)
m(x )
( ) ( ) 40π exp {− (x − µ ) /8 + (x − µ ) /( 200/m ( x ))}
L µ1 , µ 2 ; x1 , x2 = f x1 , x2 ; µ1 , µ 2 =
1
1 1
2
2 2
2
1

− µ ) /( 200/m ( x )) }. This likelihood function will factor into the product of separate
2
2 1

functions of μ1 and μ2. Analogous types of Bayesian methods can be applied
in case 2 as in case 1, and if m(x1 ) = 100, the likelihood functions and the pos-
terior distributions will be identical, and hence the inferences will be identi-
cal. However, in case 2, the likelihood function cannot be expressed as the
product of separate functions of x1 and x2 . In fact, if m is not a constant func-
tion, then the difference in the random sample means will not have a normal
distribution. Suppose μ1 = μ2 = 0 and m(x) = 1, if x < 0 and 100/m(x) ≈ 0, if x > 0.
(
Then, it is easy to see that P(X 1 − X 2 > 0) > 0.5 , even though E X 1 − X 2 = 0. )

Appendix 425

Scenarios like these arise when historical trials are used to estimate the
effect of the non-inferiority trial’s control therapy. Many of the first such non-
inferiority analyses that used historical trials to estimate the effect of the
control therapy had this estimation occur retrospectively after the results
of the non-inferiority trial were known. It is currently common practice to
prospectively estimate the effect of the control therapy before conducting
the non-inferiority trial. Thus, the non-inferiority criterion and the sizing
of the non-inferiority will depend on the results from estimating the effect
of the control therapy. For Bayesian analyses, it does not matter whether the
estimate of the control therapy’s effect and its corresponding variance influ-
ences the sizing of the non-inferiority trial. For frequentist analyses, the sam-
pling distribution, for whatever test statistic used, is altered and may not be
approximately determined. Rothmann25 provided a discussion on how the
type I error probability changes across the boundary of a non-inferiority null
hypothesis and potential ways of addressing this problem when trying to
maintain a desired type I error rate.

A.3.2 Dealing with More than One Comparison

When more than one comparison is made, the relationships (if any) between
p-values, confidence intervals, and posterior probabilities change. More than
one comparison may mean that the same quantity or parameter is com-
pared multiple times (e.g., as in an equivalence trial), the same endpoint is
involved in multiple comparisons (e.g., two experimental arms are compared
separately to a control arm), or that different endpoints are used in the com-
parisons. We illustrate some of the differences in the relationship between
p-values, confidence intervals, and posterior probabilities in the setting of
an equivalence comparison in Example A.6. For an equivalence comparison,
there are multiple ways of defining a p-value. We begin by using the concept
directly to the equivalence hypothesis in Section A.1 to define the p-value,
and then applying the concept to conducting equivalence by performing two
simultaneous one-sided tests.

Example A.6

For an equivalence comparison, a measured difference between arms is com-

pared with two values. This difference, for example, can be a difference in pro-
portions or means, a relative risk, or a hazard ratio. We revisit Examples A.3 and
A.4 involving a hazard ratio based on 400 events. The hypotheses of interest for
this example will be

Ho: θ ≤ 0.8 or θ ≥ 1.25 vs. Ha: 0.8 < θ < 1.25 (A.3)

where 0.8 and 1.25 are the equivalence limits. As in examples A.3 and A.4, the
observed experimental to control hazard ratio is 0.91. We will define the p-value

426 Appendix

consistent with Section A.1 as the (largest) probability of observing a hazard ratio
of 0.91 or more extreme (more in favor of the alternative hypothesis) if the null
hypothesis was true. It would seem reasonable, at least conceptually, that the
closer the observed hazard ratio is to 1 in a relative sense (the closer the observed
log-hazard ratio is to 0), the stronger the strength of evidence against the null
hypothesis in favor of “equivalence.” On the basis of that approach, the p-value
is the largest probability of getting an observed hazard ratio between 0.91 and
1/0.91 when the null hypothesis is true, which equals 0.098. In practice, a p-value
is rarely calculated when performing an equivalence test. In general, equivalence
is concluded if a confidence interval (usually a 90% confidence interval) contains
only possibilities within the equivalence margin. For example, for the alternative
hypothesis of equivalence in Equation A.3, equivalence may be concluded if a
90% confidence interval for θ lies within (0.8, 1.25). As the 90% confidence inter-
val for θ is (0.772, 1.072), which is not contained in (0.8, 1.25), equivalence cannot
be concluded. Here, the p-value is less than .10, but the 90% confidence interval
contains possibilities in the null hypothesis. There is thus a different relationship
between inferences on a p-value and inferences based on a confidence interval
for equivalence tests than for a superiority test or a test of a difference.
For determining the posterior probability of the alternative hypothesis in Equation
A.3, a noninformative prior distribution will be used for the true log-hazard ratio,
and the estimated log-hazard ratio will be modeled as having a normal distribu-
tion with standard deviation of 0.1. The posterior distribution for the log-hazard
ratio is a normal distribution with mean ln(0.91) and standard deviation 0.1. The
posterior probability of the alternative hypothesis in Expression A.3 is 0.900. Note
that for the equivalence comparison, the posterior probability of the alternative
hypothesis did not equal 1 minus the p-value. If a 90% posterior probability were
required for a conclusion of equivalence, the result would lie on the boundary of
statistical significance.
Thus, while it seems that there is 90% confidence that 0.8 < θ < 1.25, the 90%
confidence interval for θ does not lie within the interval (0.8, 1.25). We note that
these types of equivalence hypotheses tend to be tested using a 90% confidence
interval in various settings, including generic drug settings. Schuirmann17 showed
that such a test has a maximum type I error rate of 0.05. This is the result of treat-
ing an equivalence test as performing two simultaneous one-sided tests based on
one-sided 95% confidence intervals, both of which need statistical significance at
a 5% level. The alternative hypotheses for the one-sided tests are Ha: 0.8 < θ and
Ha: θ < 1.25. More commonly, the p-value for the equivalence test is alternatively
defined as the maximum of the two p-values from the two one-sided tests. For
each of the one-sided tests, the definition of the p-value in Section A.1 is used. In
this example, the respective one-sided p-values are 0.099 and 0.0008, resulting
in a p-value of 0.099 for the equivalence test. This p-value is compared with 0.05
(the desired type I error rate), not 0.10. Since, 0.099 > 0.05, equivalence is not
demonstrated. Note also that this p-value corresponds to the largest-level confi-
dence interval, having a confidence coefficient 1 – 2 × p-value, which lies within
(0.8, 1.25). Here, the 80.2% confidence interval is (0.8, 1.035).
For this second definition of the p-value for the equivalence test, the largest
distribution of the p-value under Ho: θ ≤ 0.8 or θ ≥ 1.25 is larger than a uniform
distribution over (0,1). Thus, this test can be conservative. For the first definition of
the p-value for an equivalence test in this example, the largest distribution of the
p-value under Ho: θ ≤ 0.8 or θ ≥ 1.25 is a uniform distribution over (0,1).

Appendix 427

A.4 Stratified and Adjusted Analyses

This section will discuss the use of stratification in the randomization and
the use of stratified and adjusted analyses.

A.4.1 Stratification
Clinical trials are commonly randomized using permuted blocks.26 Further
more, randomization is commonly stratified by some predefined prognostic
factor. With stratification, separate permuted blocks are generated for each
level of stratification, and subjects are assigned the next available random-
ized treatment from the stratum to which they belong. In this way, subjects
from each stratum are assigned to the various treatment groups in num-
bers approximately equal to the desired randomization ratio (exactly equal
to the extent that entire blocks are used). This is desired to balance the levels
of meaningful prognostic factors between arms. By doing so, any demon-
strated difference between the arms can be attributed to the difference in the
treatments received instead of one arm being allocated with better subjects
than the other arm.
With stratification, treatment arms tend to be more similar in the distribu-
tion of the stratification factors. A clinical trial in which treatment arms are
not well balanced can be subject to criticism and difficult to interpret, even
though the randomization procedure was fair in its assignment of subjects
to arms, and the calculation of the p-value accounts for any potential imbal-
ance. Without stratification, the treatment arms will be balanced on average;
with stratification, the balance will be much closer for the realized allocation
as well as for the mean among many theoretical realizations.
A second advantage of using stratification is that it allows for the use of
analyses that have greater power. When a stratification factor is used for the
randomization process, the analysis may adjust for the stratification factor
by either using that factor as a covariate in the analysis or by integrating the
results of the comparisons within each level of the factor (known as a “strati-
fied analysis”). For an analysis of covariance, including the factor as a covari-
ate in the model tends to reduce the associated standard error in estimating
the difference in means (the treatment effect). The analysis of covariance
allows the covariate to explain its contribution to the total variability of the
observed outcomes. The remaining variability is now the background vari-
ability in estimating the treatment effect. This is true even if the treatment
arms are balanced for the stratification factor, provided the stratification fac-
tor is correlated with outcome. For a stratified analysis and for Cox propor-
tional hazards models and logistic regression models, the use of a prognostic
covariate in the analysis allows for a comparison of “likes” between treat-
ment arms. That is, the comparison is not obscured by arbitrary differences
in the covariate between arms. In a stratified analysis, patients with similar

428 Appendix

prognostics are compared with the results of these comparisons integrated

to form the overall comparison. When a treatment is effective and the covari-
ate is quite prognostic, the effect size, as measured by a hazards ratio or an
odds ratio, tends to be larger for an adjusted analysis than for an unadjusted
analysis.
If a factor is believed to be associated with outcome, it should be consid-
ered as a stratification factor. However, having too many stratification factors
can be as bad as having none: in the extreme case, each subject represents a
separate stratum, which results in a procedure identical to having no stratifi-
cation factors. The factors related to the outcome of a primary endpoint tend
to be more useful than other factors as a stratification factor. In addition, it
may be imperative to balance treatments for a particular prognostic factor to
ensure that each group will have a sufficient number for subgroup compari-
sons. Frequently, one or two stratification factors will generally be sufficient.
One or two factors with the greatest association with the primary endpoint
where imbalance in their allocation can be impactful should be considered
as stratification factors for the randomization.
Stratification factors should have few levels. A factor that has many levels
can have the number of levels reduced by combining levels. Examples of
such stratification factors include diagnosis (e.g., disease variety) and con-
tinuous variables (e.g., younger than 65 years and at least 65 years of age in a
study in which age is expected to be prognostic). Choosing the factor levels is
as important as choosing the stratification factor itself. The impact from level
to level on the relationship of the primary endpoint and the relative number
of patients entered with such levels need to be considered. An analysis that
adjusts for a factor where there is little impact on the relationship with pri-
mary endpoint from level to level will provide similar results as the analysis
that does not adjust for that factor. Likewise, when almost all the subjects
have the same level and there is not a large enough impact on outcome across
levels, the adjusted and unadjusted analyses should provide similar results.
Even when several stratification factors are chosen, there may be other fac-
tors that are prognostic, either because the relationship was not appreciated
when the study was designed or other factors were thought to be more pre-
dictive of response. In such a case, it may be possible and plausible to include
the factor as a covariate. We caution against post hoc identification of covari-
ates in the primary model, as bias can result. Covariates can be identified a
priori for inclusion in the model, even when they were not used as stratifica-
tion factors; however, every factor used in the stratification process should be
included in the analysis model.
One special stratification factor is investigative site. For many therapeutic
areas, it is common to include site as a covariate in the model even when it
was not used for stratification in the randomization process. This is especially
important when outcome is heavily related to investigator skill at diagnosis
or treatment (e.g., new surgical technique) or to geography (e.g., bacterial
infections in different climates or societies). Including site as a covariate will

Appendix 429

be problematic when each site enrolls few study subjects, as it creates many
strata, some of which may be confounded with treatment when all subjects
at a given site are randomized to the same treatment. When it is appropriate,
sites can be grouped by geographic region or by some other meaningful cri-
terion (e.g., grouping based on climate or based on the specialization of the
principal investigator whenever the climate or specialization has impact).

A.4.2 Analyses
Members of the population for which a new drug is being developed have
characteristics that are quite heterogeneous. For a clinical trial to be appli-
cable to the entire population (have external validity), the study population
will also be heterogeneous. When subjects are provided a treatment, this
subject variability contributes to the variability in the observed outcomes.
Restricting a study to only patients who have similar values for a very influ-
ential prognostic factor will lead to less variable outcomes and will require
fewer patients for the desired power. However, the results from such a
study will only be externally valid for people similar to the study subjects.
Variability can be reduced by stratifying the randomization and analysis,
or by adjusting the analysis for prognostic factors not used in the random-
ization process. Stratified and adjusted analyses therefore allow studies to
enroll a diverse group of subjects without requiring a dramatically larger
number of subjects as a study using a homogeneous group of subjects.27,28
When a clinical trial incorporates a randomization process that uses strati-
fication, the analysis commonly incorporates the stratification factor as a
covariate. More generally, when a potentially prognostic baseline variable
is identified before the study begins, an analysis may use this variable as
a covariate in the model whether or not it was used as a stratification fac-
tor in the randomization process. Whether these prognostic factors should
be included as covariates in the primary analysis model is a matter of some
controversy for superiority analyses.26,27 For non-inferiority analyses, the
same controversies exist along with other concerns.
Covariates that can be included in the analysis model are identified in one
of three ways: prospectively identified as being prognostic, retrospectively
identified as being correlated with response, or recognized as being imbal-
anced in the clinical trial under consideration. A factor that is prospectively
identified is the easiest to assess; a factor that is identified based on the data
observed in the study is more difficult, and its inclusion in the model may
introduce bias.
In general, prospectively identified factors can be included as covariates in
the analysis model without much controversy. If the covariate influences the
conclusion, the argument can be made that the inclusion of the factor in the
analysis was made before the data were known and, therefore, the conclu-
sion is not biased. One aspect of covariate analysis that may differ in non-
inferiority and superiority analyses is the issue of collinearity. Collinearity

430 Appendix

occurs when two prognostic factors are related to each other and at least
one is related to the outcome variable. A model that includes both covari-
ates commonly shows neither covariate has a statistically significant effect
on the outcome. Thus, with each collinear variable considered after adjust-
ment for the other, an effect is not identified for either covariate. When one
of these two variables is the randomized treatment group (i.e., by not having
the covariate balanced between groups by the randomization), this can have
the effect of masking a real effect of treatment. For superiority analyses, this
has the effect of decreasing the chance of finding a relationship; for non-
inferiority analyses, the impact on conclusions is much less well understood.
We thus recommend caution in interpreting a non-inferiority clinical trial
in which the analysis uses a covariate not included in the randomization
process and in which there is considerable imbalance between comparative
treatment arms.
The impact of choosing covariates for inclusion in the model on the basis
of data observed in the model is even more difficult to defend. Such post
hoc model choices are subject to biases in superiority analyses as well as in
non-inferiority analyses. Additionally, choosing covariates on the basis of
baseline imbalances can have the effect of causing collinearity, obscuring
differences in treatment groups, and resulting in a biased estimate of treat-
ment effect.
To illustrate, consider a simple additive analysis of covariance (ANCOVA)
model, in which

yij = α + κi + γj + ε

where α is the grand mean, κi is the effect of treatment group i, γj is the effect
of categorical covariate j with two levels, ε is the error, and yij is the observed
value for a subject with associated covariate and treatment assignment. If the
covariate is predictive of the outcome as an additive effect to the treatment,
as expected by the model, the confidence interval calculated via ANCOVA
will tend to be shorter than the confidence interval calculated without con-
sideration of the covariate. The confidence interval for the true difference
in means, μE – μ C, is calculated using the estimated treatment effect and the
mean square error, and the null hypothesis is rejected if the lower bound of
the confidence interval is greater than –δ.
Despite the advantages of using covariates in the analysis, we caution
that all covariates, like other aspects of the analysis, must be prespeci-
fied. Including or excluding covariates to obtain a confidence interval that
excludes –δ, based on post hoc analyses, is not appropriate as it inflates the
chance of falsely concluding that the experimental drug is noninferior to the
standard drug.
It is also often important to investigate whether there is any interaction
between treatment and a prespecified covariate on the outcome of inter-
est. Such an interaction effect means that the difference in the effects of the

Appendix 431

experimental and control therapies varies across the levels of the covariate.
For superiority testing, when this difference in effect always favors the same
therapy, the interaction is regarded as quantitative. If instead this difference in
effects sometimes favors the experimental therapy and sometimes favors the
control therapy, the interaction is regarded as qualitative. Determination of
whether the interaction is quantitative or qualitative involves comparing the
difference of effects with zero (zero being the value specified as the difference
in effects in the null hypothesis).29 For non-inferiority testing, the determina-
tion of whether the interaction is quantitative or qualitative involves compar-
ing the difference of effects with the non-inferiority margin. In a stratified
analysis, an advantage of the control group in one stratum that is larger than
the non-inferiority margin can be offset by an advantage of the experimental
group in another stratum, a situation akin to that of a qualitative interaction
and causing the two treatments, on average, to look similar.30 In such a situa-
tion, a non-inferiority analysis on the overall population may be problematic.
Examination of the treatment effect in each stratum to check for consistency
of effect will be important (see Chapter 10 for further details).

References
1. Dempster, A.P. and Schatzoff, M, Expected significance level as a sensibility
index for test statistics, J. Am. Stat. Assoc., 60, 420–436, 1965.
2. Schatzoff, M., Sensitivity comparisons among tests of the general linear hypoth-
eses, J. Am. Stat. Assoc., 61, 415–435, 1966.
3. Hung, H.M.J. et al., The behavior of the p-value when the alternative hypothesis
is true, Biometrics, 53, 11–22, 1997.
4. Sackrowitz, H. and Samuel-Cahn, E., P-values as random variables—expected
p-values, Am. Stat., 53, 326–331, 1999.
5. Joiner, B.L., The median significance level and other small sample measures of
test efficacy, J. Am. Stat. Assoc., 64, 971–985, 1969.
6. Bhattacharya, B. and Habtzghi, D., Median of the p-value under the alternative
hypothesis, Am. Stat., 56, 202–206, 2002.
7. Fisher, R.A., The Design of Experiments, Oliver and Boyd, Edinburg, 1935.
8. Hollander, M. and Wolfe, D.A., Nonparametric Statistical Methods, John Wiley,
New York, NY, 1973.
9. Good P., Permutation, Parametric, and Bootstrap Tests of Hypotheses, Springer, New
York, NY, 2005.
10. Box, G.E.P., Hunter, J.S., and Hunter, W.G., Statistics for Experimenters: An
Introduction to Design, Data Analysis, and Model Building, John Wiley, New York,
NY, 1978.
11. Wiens, B.L., Randomization as a basis for inference in non-inferiority trials,
Pharm. Stat., 5, 265–271, 2006.
12. Fisher, R.A., Statistical Methods for Research Workers, Oliver and Boyd, Edinburgh,
1925.

432 Appendix

13. Barnard, G.A., Significance tests for 2 × 2 tables, Biometrika, 34, 123–138, 1947.
14. Chan, I.S.F., Exact tests of equivalence and efficacy with a non-zero lower bound
for comparative studies, Stat. Med., 17, 1403–1413, 1998.
15. Röhmel, J. and Mansmann, U., Unconditional non-asymptotic one-sided tests
for independent binomial proportions when the interest lies in showing non-
inferiority and/or superiority, Biom. J., 41, 149–170, 1990.
16. Dunnet, C.W. and Gent, M., Significance testing to establish equivalence be
tween treatments with special reference to data in the form of 2 × 2 tables,
Biometrics, 33, 593–602, 1977.
17. Schuirmann, D., A comparison of the two one-sided tests procedure and the
power for assessing the equivalence of average bioavailability, J. Pharmacokinet.
Pharm., 15, 657–680, 1987.
18. International Conference on Harmonization of Technical Requirements for
Registration of Pharmaceuticals for Human Use (ICH), E9: statistical princi-
ples for clinical trials, 1998, at http://www.ich.org/cache/compo/475-272-1
.html#E4.
19. Neyman, J. and Pearson, E.L., On the problem of the most efficient tests of sta-
tistical hypotheses, Philos. T. R. Soc. Lond., 231, 289–337, 1933.
20. Blackwelder, W.C., Proving the null hypothesis in clinical trials, Control. Clin.
Trials, 3, 345–353, 1982.
21. Carlin, B.P. and Louis, T.A., Bayes and Empirical Bayes Methods for Data Analysis,
Chapman and Hall, London, 1996.
22. Goodman, S.N., Towards evidence-based medical statistics: The P-value fallacy,
Ann. Intern. Med., 995–1004, 1999.
23. Simon R., Bayesian design and analysis of active control clinical trials, Biometrics,
55, 484–487, 1999.
24. Rothmann, M.D. and Tsou, H., On non-inferiority analysis based on delta-
method confidence intervals, J. Biopharm. Stat., 13, 565–583, 2003.
25. Rothmann, M., Type I error probabilities based on design-stage strategies with
applications to non-inferiority trials, J. Biopharm. Stat., 15, 109–127, 2005.
26. Senn, S., Added values: Controversies concerning randomization and additivity
in clinical trials, Stat. Med., 23, 3729–3753, 2004.
27. Friedman, L.M., Furberg, C.D., and DeMets, D.L., Fundamentals of Clinical Trials,
3rd Edition, Springer, New York, NY, 1998.
28. Montgomery, D.C., Design and Analysis of Experiments, John Wiley & Sons, New
York, NY, 1991.
29. Gail, M. and Simon, R., Testing for qualitative interactions between treatment
effects and patient subsets, Biometrics, 41,, 361–372, 1985.
30. Wiens, B.L. and Heyse, J.F., Testing for interaction in studies of non-inferiority,
J. Biopharm. Stat., 13, 103–115, 2003.

Index

A B
Abstracting estimates and the standard Barnard’s ordering criterion, 257
error, methods for, 63–64 Baseline hazard function, 368–369
Active control effect, 5 Bayesian analyses, 160–164
defining the, 58 Bayesian and frequentist prediction
extrapolating to the non-inferiority methods, 51–55
trial, 59–62 Bayesian fixed-effects meta-analysis,
modeling the, 58–59 87–88
potential biases and random highs, Bayesian methods, 293–297
62–74 posterior probabilities and credible
Active-controlled clinical trial, 4, 7. See intervals, 410–412
also Clinical trials prior and posterior distributions,
Adaptive clinical trial designs, 220, 230. 412–415
See also Clinical trials statistical inference, 415–418
changing the sample size or the Binomial model, 251
primary objective, 235–237 Biocreep, 110
group sequential designs, 231–235 Bioequivalence problems, 334
Adjusted estimators, 300–304 Bioequivalence studies, 5
Agresti-Caffo method, 282 Bioequivalence trial, 235
All-or-nothing testing, 171 Blinded interim analysis, 232
Analyses involving medians, 342–343 Blinded methods, 235–236
asymptotic methods, 351–353 Bonferroni procedure, 169–170, 174
hypotheses and issues, 343–344 Brookmeyer and Crowley procedure, 384
nonparametric methods, 344–351
Analysis methods and type I error rates,
C
comparing, 120–121
asymptotic results, 124–128 Cardiovascular risk in antidiabetic
comparison of methods, 121–124 therapy, 216–218
more on type I error rates, 128–131 Censoring, 361–363
incorporating regression to mean Central limit theorem, 323, 336, 408, 409
bias, 132–141 Clinical trials, 16–17. See also Adaptive
non-inferiority trial size depends clinical trial designs
on estimation of active control designing and conducting, 183–184
effect, 131–132 randomization of, 427–429
Analysis sets, 196 for the registration of antibiotic
different analysis populations, therapies, 36–40
197–198 Cochran–Mantel–Haenszel procedure,
further considerations, 203–204 298
influence of analysis population on Composite time-to-event endpoints,
conclusions, 198–203 359–360
Assay sensitivity, 15–17 Concurrent controls, comparisons to
Asymptotic methods, 259–262, 278, noninferior to active control, 154–160
291–292 superiority over placebo, 152–154

433

434 Index

Confidence coefficient, 402 Exact tests for binary and categorical

Confidence interval, exact, 257–259 data, 407–408
Confidence interval methods, 262–263 Experimental therapies, 2–3
inferences for a difference in Experimental therapy effect, 220
proportions, 264–268 Exponential distribution, 363–364
inferences on a single proportion, confidence intervals for a difference
263–264 in means, 364–366
Constancy assumption, 22 confidence intervals for the hazard
Critical region (CR), 257 ratio (ratio of means), 366–368
Crossover designs, 151–152 Extended hypergeometric distribution,
290
External validity, 16–17
D
Data scales, 238–239
F
Delta-method approach, 337–338
Delta-method confidence interval Familywise error (FWE) rate, 167–168
approach, 104–106 Farrington–Manning method, 272
Delta-method theorem, 337–338 Fieller method, 279
Deviance method, 281 Filgrastim, 4
Distribution-based approaches Fisher’s exact test, 255
Kolmogorov–Smirnov approach, 242 Fixed sequence approaches, 176
overlap approaches, 240–242 Fixed thresholds on differences
Double-blind design, 182–183 asymptotic methods, 259–262
Double blinding, 3 comparisons of confidence interval
methods, 262–263
inferences for a difference in
E
proportions, 264–268
“Equivalence” comparisons, 237 inferences on a single proportion,
data scales, 238–239 263–264
distribution-based approaches, exact methods, 254–257
240–242 exact confidence intervals,
lot consistency, 242–247 257–259
two one-sided tests approach, hypotheses and issues, 253–254
239–240 sample size determination, 268–271
Equivalence margins, 243 optimal randomization ratio,
European Medicines Agency, 4–5 271–272
Evidence, strength of Fixed thresholds on differences of
additional considerations, 49–50 means
Bayesian false-positive rate, 44–47 exact and distribution-free methods,
overall type I error, 44 321–322
relative evidence between the null hypotheses and issues, 320–321
and alternative hypotheses, normalized methods, 322–323
47–49 Bayesian methods, 326–331
Exact conditional non-inferiority test, test statistics, 323–326
274–277 sample size determination, 331–333
Exact methods, 254–257, 273–274, 290 Fixed thresholds on odds ratios
exact conditional non-inferiority test, asymptotic methods, 291–292
274–277 exact methods, 290
exact confidence intervals, 257–259 hypotheses, 289–290

Index 435

sample size determination, 292–293 Group sequential designs, 231–235

Fixed thresholds on ratios Group sequential tests, 233–235
asymptotic methods, 278
comparisons of methods, 279–283
H
exact methods, 273–274
exact conditional non-inferiority Hauck–Anderson correction, 264–265
test, 274–277 Hazard rate, hazard ratio, and survival
hypotheses and issues, 273 function, 360–361
sample-size determination, 283–286 Hazard ratio, 361, 374
optimal randomization ratio, Hochberg procedure, 173–175
286–289 Holm procedure, 169–170, 173, 175
Fixed thresholds on ratios of means
Bayesian methods, 338–340
I
exact and distribution-free methods,
334–335 Imputation procedure, 189–193
hypotheses and issues, 334 Integrals representing posterior prob
normalized and asymptotic ability of non-inferiority, 294
methods, 335 Interaction tests, 220–222
test statistics, 335–338 interaction tests, 224–225
sample size determination, 341–342 recommendations, 225–226
Food and Drug Cosmetic Act of 1962, 43 test procedures, 222–224
Frequentist and Bayesian approaches, Internal consistency of treatment effect,
relationship between 224–225
comparison involving a retention International Conference of
fraction, 421–423 Harmonization (ICH) E9
dealing with more than one Guidance, 7
comparison, 425–426 International Conference on
exact confidence intervals and Harmonization (ICH) E-10
credible intervals using a guidelines, 17–18
Jeffreys prior, 419–421
likelihood function for a non-
J
inferiority trial, 423–425
Frequentist methods Jeffreys prior, 419–421
comparing and contrasting
confidence intervals and
K
p-values, 404–405
analysis methods Kolmogorov–Smirnov approach, 242
asymptotic methods, 408–410
exact and permutation
L
methods, 405–408
confidence intervals, 401–404 Landmarks and medians, analyses
p-values, 395–401 based on
Full approval. See Regular approval analyses on medians, 384–389
Futility, 232 landmark analyses, 380–383
Lenient margins, 24–25
Logistic regression model, 303
G
Longitudinal studies, 182
Gatekeeping procedure, 175 Loss function, 415
Generalized likelihood ratio test, 281 Lot consistency, 242–247

436 Index

M ordered treatments: subset selection,

170
Mann–Whitney–Wilcoxon test, 350–351
unordered treatments: subset
Mantel–Haenszel estimator, 300–303
selection, 169–170
Matched-pair designs, 308–309
difference in two correlated
proportions, 309–311
N
ratio of two correlated proportions,
311–313 Natural ordering criterion, 255–256
Mathisen’s test, 347–350 Newcombe–Wilson 100(1 – α)%
Maximum likelihood estimate (MLE), confidence interval, 265–266
255 Newcombe–Wilson method, 267–268
Maximum likelihood methods, 281 Neyman–Pearson inference, 407–410
McNemar test, 309 Nominal α level test, 256–257
Medians, analyses involving, 342–343, Non-inferiority
384–389 defined, 1
asymptotic methods, 351–353 Non-inferiority comparison, different
hypotheses and issues, 343–344 cases for, 5–6
nonparametric methods, 344–351 Non-inferiority margin, 2, 16–17
Median test, 344–347 defined, 1
Meta-analysis methods, 74–75 Non-inferiority on multiple endpoints,
adjusting over effect modifiers, 171–172
85–87 further considerations, 175
concerns of random-effects meta- multiple endpoints in a single family,
analyses, 81–85 172–174
fixed effects meta-analysis, 75–76 multiple endpoints in multiple
Peto’s method, 76 families, 174–175
random-effects meta-analysis, 77–78 Non-inferiority trial, 1–2
sampling distributions, 78–81 critical steps and issues, 17–18
Minimum Bayes factor, 48–49 historical evidence of sensitivity
Min test, 244–245 to drug effects, 18–19
Missing data, 182 margin selection, 20–27
analysis of data when some data are study conduct and analysis
missing populations, 27–30
handling ignorable missing data trial designing, 19–20
in non-inferiority analyses, example of anti-infectives, 36–40
189–193 history of, 10–13
handling nonignorable missing reasons for, 3–7
data, 193–196 sizing a study, 30–36
mechanisms, 186–187 Noninferior to the active control, 92
assessing, 187–188 Nonparametric inference based on a
potential impact of, 182–184 hazard ratio, 368–375
preventing, 184–185 event and sample size determination,
Mood’s test, 344–347 375–377
Multiple comparisons, 167–168 proportional hazards assessment
Multiple experimental treatments, procedures, 377–379
comparison of, 168 Null and selected alternative
all-or-nothing testing, 171 relationship between, 272

Index 437

O Relative risk, 273

Reproducibility, 50–51
Oncology, a case in, 141–142
predicting future and hypothetical
applying the arithmetic definition of
outcomes, 52–55
retention fraction, 142–143
of result, 43
applying the geometric definition of
Risk ratio, 273
retention fraction, 143–144
power of such procedures, 144–145
S
Optimal randomization ratio, 271–272
Ordinal data, 353–355 Safety of Estrogens in Systemic Lupus
Overlap approaches, 240–242 Erythematosus National
Assessment (SELENA) study,
P 306
Safety studies, 207–209
Pearson’s method, 281 considerations for, 209–210
Pegfilgrastim, 4 design considerations, 211–212
Pegylated therapeutic protein, 4 safety endpoint, 210–211
Permutation tests for continuous data, possible comparisons
405–407 indirect comparison to placebo,
Peto’s method, 76 215–216
Placebo-controlled trials, 4 ruling out a meaningful risk
double-blind, 3–7, 10 increase compared to an active
Plasma-concentration curve (AUC), 239 control, 213–215
Pocock procedure, 233–235 ruling out a meaningful risk
Point estimate method, 122 increase compared to placebo,
Pointwise approach, 389–390 212–213
Posterior distribution, 410 Sample size, reestimating the, 235–237
Preservation of effect, 92 Sample-size determination, 117–120,
Preservation ratio, 154 268–271, 284–286, 292–293
Preset intervals, comparisons over, optimal randomization ratio,
389–392 271–272, 286–289
Principle of clinical equipoise, 3 Serial gatekeeping strategy, 175
Prior distribution, 410 Similarity coefficient, 246
Proportion of similar responses (PSR), Standard quadratic method, 279
240–241 Step function method, 36, 38
p-value, 255–256 Stratified analysis, 427–428
Stratified and adjusted analyses,
Q 297–298
adjusted estimators, 300–304
Qualitative interaction, 220–221, 225 adjusted rates, 298–300
Quantitative interaction, 221, 225 analyses, 429–431
stratification, 427–429
R Statistical inference, 415–418
Substantial evidence
Rank-based imputation, 192. See also defined, 43
Imputation procedure Supremum approach, 390–391
Ratio of respective test statistics, 121–122 Surrogate endpoint, 219–220, 226–230
Regular approval, 226, 228–229 Synthesis methods, 98–99

438 Index

Synthesis methods (continued) Two-stage active control testing (TACT)

addressing the potential for biocreep, method, 107
110 Type I error rates and analysis methods,
Bayesian synthesis methods, 110–117 comparing, 120–121
retention fraction and hypotheses, asymptotic results, 124–128
99–102 comparison of methods, 121–124
sample size determination, 117–120 more on type I error rates, 128–131
synthesis frequentist procedures incorporating regression to mean
absolute metrics, 102–107 bias, 132–141
relative metrics, 102 non-inferiority trial size depends
synthesis methods as prediction on estimation of active control
interval methods, 108–109 effect, 131–132
Synthesis tests, 235

U
T
Unblinded interim analysis, 236
Tail region (TR), 255 Unified test statistic, 123–124
Taylor series method, 279–280 U.S. Food and Drug Administration
Testing for both superiority and non- (FDA) Draft Guidance on Non-
inferiority inferiority Trials, 57
testing non-inferiority after failing
superiority, 178–179
testing superiority after achieving V
non-inferiority, 176–178
Variability of results, 43
Thomas and Gart method, 279–280
Variable margins, 304–308, 305
Three-arm non-inferiority trial, 149–150
use of, 150
TOST approach. See Two one-sided tests W
approach
Traditional approval. See Regular Wald’s confidence interval, 260, 263
approval Weight-based Bayes factor, 49
Two-confidence interval approaches, Whitney–Wilcoxon procedure, 386
94–95 Woolf estimator, 302
hypotheses and tests, 92–98
Two confidence interval procedures,
Z
386–389
Two one-sided tests approach, 239–240 Z-type statistics, 256–257

Foundations of Clinical Research Applications to Practice Pearson New International Edition Leslie Gross Portney - Quickly download the ebook to explore the full content
100% (1)
Foundations of Clinical Research Applications to Practice Pearson New International Edition Leslie Gross Portney - Quickly download the ebook to explore the full content
57 pages
(Ebook) Clinical trials in neurology : design, conduct, analysis by Bernard Ravina; et al ISBN 9781139032445, 9781139379441, 1139032445, 1139379445 - The full ebook version is just one click away
100% (2)
(Ebook) Clinical trials in neurology : design, conduct, analysis by Bernard Ravina; et al ISBN 9781139032445, 9781139379441, 1139032445, 1139379445 - The full ebook version is just one click away
61 pages
8.Clinical Pharmacist Guide to Biostatistics and Literature Evaluation
No ratings yet
8.Clinical Pharmacist Guide to Biostatistics and Literature Evaluation
185 pages
Statistics and Data Analysis for Nursing Research 2nd Edition Denise Polit 2024 Scribd Download
100% (3)
Statistics and Data Analysis for Nursing Research 2nd Edition Denise Polit 2024 Scribd Download
71 pages
Exploratory Factor Analysis: Prof. Andy Field
No ratings yet
Exploratory Factor Analysis: Prof. Andy Field
33 pages
Critique of The Effect of Complementary Music Therapy On The Patient's Postoperative State Anxiety, Pain Control, and Environmental Noise Satisfaction
No ratings yet
Critique of The Effect of Complementary Music Therapy On The Patient's Postoperative State Anxiety, Pain Control, and Environmental Noise Satisfaction
12 pages
04 - Extravascular Administration (Oral)
No ratings yet
04 - Extravascular Administration (Oral)
41 pages
Understanding The Concept of Equivalence and Non-Inferiority Trials Understanding The Concept of Equivalence and Non-Inferiority Trials
No ratings yet
Understanding The Concept of Equivalence and Non-Inferiority Trials Understanding The Concept of Equivalence and Non-Inferiority Trials
13 pages
Tango - 1998 - Equivalence Test and Confidence Interval For The Difference in Proportions For The Paired-Sample Design
No ratings yet
Tango - 1998 - Equivalence Test and Confidence Interval For The Difference in Proportions For The Paired-Sample Design
18 pages
Experiments and Quasi-Experiments: Fourth Edition, Allen Rubin. Earl Babbie
100% (1)
Experiments and Quasi-Experiments: Fourth Edition, Allen Rubin. Earl Babbie
36 pages
MacNew QLMI PDF
No ratings yet
MacNew QLMI PDF
10 pages
Crown in Shield, Brand - 1981 - A Physiologically Based Criterion of Muscle Force Prediction in Locomotion-Annotated
No ratings yet
Crown in Shield, Brand - 1981 - A Physiologically Based Criterion of Muscle Force Prediction in Locomotion-Annotated
9 pages
Non Inferiority Trials
No ratings yet
Non Inferiority Trials
31 pages
EQ-5D-5L User Guide
No ratings yet
EQ-5D-5L User Guide
36 pages
[Ebooks PDF] download (eBook PDF) Using and Interpreting Statistics: A Practical Text for the Behavioral, Social, and Health Sciences 3rd Edition full chapters
100% (2)
[Ebooks PDF] download (eBook PDF) Using and Interpreting Statistics: A Practical Text for the Behavioral, Social, and Health Sciences 3rd Edition full chapters
51 pages
Assignment 5 PDF
100% (1)
Assignment 5 PDF
4 pages
Placebo Effects in Medicine
No ratings yet
Placebo Effects in Medicine
11 pages
H.1 Eating Powerpoint 2017
No ratings yet
H.1 Eating Powerpoint 2017
40 pages
Principles of Clinical Pharmacology: Steven P. Stratton, PH.D
No ratings yet
Principles of Clinical Pharmacology: Steven P. Stratton, PH.D
53 pages
PDF (Ebook) Randomized Phase II Cancer Clinical Trials by Sin-Ho Jung ISBN 9781439871850, 143987185X download
100% (4)
PDF (Ebook) Randomized Phase II Cancer Clinical Trials by Sin-Ho Jung ISBN 9781439871850, 143987185X download
81 pages
MSPSS
No ratings yet
MSPSS
3 pages
Rebt Depression Manual
No ratings yet
Rebt Depression Manual
33 pages
Nurs 612 Pico Paper
No ratings yet
Nurs 612 Pico Paper
10 pages
Psychometric Properties of The Wolpe-Lazarus Assertiveness Scale
No ratings yet
Psychometric Properties of The Wolpe-Lazarus Assertiveness Scale
7 pages
Population and Health 2
No ratings yet
Population and Health 2
51 pages
The Newcastle-Ottawa Scale (NOS) For Assessing The Quality of Nonrandomized Studies in Meta-Analysis
No ratings yet
The Newcastle-Ottawa Scale (NOS) For Assessing The Quality of Nonrandomized Studies in Meta-Analysis
21 pages
CCJM Symptom Management An Important Part of Cancer Care
No ratings yet
CCJM Symptom Management An Important Part of Cancer Care
10 pages
Improvement of Palliative Care in Cancer Patients in
No ratings yet
Improvement of Palliative Care in Cancer Patients in
20 pages
Referensi - Kuesioner ASS (Academic Stress Scale)
No ratings yet
Referensi - Kuesioner ASS (Academic Stress Scale)
14 pages
A Study To Assess The Effectiveness of Mindfulness Techniques On Level of Test Anxiety Among The B.SC Nursing Students in Selected College at Chennai
No ratings yet
A Study To Assess The Effectiveness of Mindfulness Techniques On Level of Test Anxiety Among The B.SC Nursing Students in Selected College at Chennai
3 pages
Insomnia Severity Index
No ratings yet
Insomnia Severity Index
2 pages
Assignment 2 Evidence Summary Instructions and Rubric-2
No ratings yet
Assignment 2 Evidence Summary Instructions and Rubric-2
9 pages
Mmse Mna GDS PDF
No ratings yet
Mmse Mna GDS PDF
4 pages
(eBook PDF) Research Design in Counseling 4th Editionpdf download
100% (5)
(eBook PDF) Research Design in Counseling 4th Editionpdf download
48 pages
A Normative Study of The Raven Coloured Progressive Matrices Test For Omani Children Aged 5-11 Years
No ratings yet
A Normative Study of The Raven Coloured Progressive Matrices Test For Omani Children Aged 5-11 Years
15 pages
Biostat Quiz Leak
50% (2)
Biostat Quiz Leak
3 pages
The RAND 36 Measure of Health Related Quality of Life
No ratings yet
The RAND 36 Measure of Health Related Quality of Life
9 pages
Fundamental Concepts for New Clinical Trialists Scott Evans instant download
100% (1)
Fundamental Concepts for New Clinical Trialists Scott Evans instant download
78 pages
Lesson 15 - Crossover Designs
No ratings yet
Lesson 15 - Crossover Designs
20 pages
CASP Critical Appraisal Checklist - RCT
No ratings yet
CASP Critical Appraisal Checklist - RCT
4 pages
CASP Systematic Review Checklist 2018 Fillable Form
No ratings yet
CASP Systematic Review Checklist 2018 Fillable Form
4 pages
Assignment 1
100% (1)
Assignment 1
13 pages
Foundations of Clinical Research - Applications To Evidence-Based Practice
No ratings yet
Foundations of Clinical Research - Applications To Evidence-Based Practice
23 pages
Nciph ERIC10
No ratings yet
Nciph ERIC10
5 pages
Checklist For Controlled Trials
No ratings yet
Checklist For Controlled Trials
2 pages
Scoring Short Ipaq April04
No ratings yet
Scoring Short Ipaq April04
9 pages
Adjustment & Adaptation To Illness & Disability
No ratings yet
Adjustment & Adaptation To Illness & Disability
48 pages
GRM597 Final Examination 100% Correct Answers
No ratings yet
GRM597 Final Examination 100% Correct Answers
48 pages
Professional Quality of Life Scale
No ratings yet
Professional Quality of Life Scale
2 pages
Classification of Theory
No ratings yet
Classification of Theory
25 pages
Traditional CCD Fillable
No ratings yet
Traditional CCD Fillable
1 page
Prader-Willi Syndrome Behavior Questionnaire (PSWBQ)
No ratings yet
Prader-Willi Syndrome Behavior Questionnaire (PSWBQ)
9 pages
Exp Psych Week 2
No ratings yet
Exp Psych Week 2
122 pages
Randomized Controlled Trial
No ratings yet
Randomized Controlled Trial
34 pages
A Study to Determine the Stress Level among First Semester BSc. Nursing Students Studying in Selected College in Kannur District
No ratings yet
A Study to Determine the Stress Level among First Semester BSc. Nursing Students Studying in Selected College in Kannur District
6 pages
Ttest
No ratings yet
Ttest
8 pages
Pico Analysis: Qoestion Components Your Answer
No ratings yet
Pico Analysis: Qoestion Components Your Answer
9 pages
Prediabetes dmt2 PDF
No ratings yet
Prediabetes dmt2 PDF
10 pages
Download Design and Analysis of Non Inferiority Trials Chapman Hall CRC Biostatistics Series 1st Edition Mark D. Rothmann ebook All Chapters PDF
100% (3)
Download Design and Analysis of Non Inferiority Trials Chapman Hall CRC Biostatistics Series 1st Edition Mark D. Rothmann ebook All Chapters PDF
83 pages
Design and Analysis of Non Inferiority Trials Chapman Hall CRC Biostatistics Series 1st Edition Mark D. Rothmann - Download the ebook now to start reading without waiting
100% (1)
Design and Analysis of Non Inferiority Trials Chapman Hall CRC Biostatistics Series 1st Edition Mark D. Rothmann - Download the ebook now to start reading without waiting
53 pages