0% found this document useful (0 votes)

75 views

JAMA Guide to Statistics and Methods

The intention-to-treat (ITT) principle is crucial for interpreting randomized clinical trials, ensuring that participants are analyzed in the groups they were randomized to, regardless of treatment adherence. This approach helps avoid bias in treatment effect estimates and is particularly important in noninferiority trials, where both ITT and per-protocol analyses should be reported to validate findings. The JAMA Guide to Statistics and Methods aims to enhance clinicians' understanding of statistical methods, starting with a discussion on ITT analysis in relation to a specific research study.

Uploaded by

Lucas Brasil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

75 views

JAMA Guide to Statistics and Methods

Uploaded by

Lucas Brasil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 107

Clinical Review & Education

JAMA Guide to Statistics and Methods

The Intention-to-Treat Principle

How to Assess the True Effect of Choosing
a Medical Treatment
Michelle A. Detry, PhD; Roger J. Lewis, MD, PhD

The intention-to-treat (ITT) principle is a cornerstone in the in- Treatment adherence often depends on many patient and cli-
terpretation of randomized clinical trials (RCTs) conducted with the nician factors that may not be anticipated or are impossible to
goal of influencing the selection of medical therapy for well- measure and that influence response to treatment. For example,
defined groups of patients. The ITT principle defines both the study in the study by Robertson et al, some patients randomized to the
population included in the pri- higher transfusion threshold may not have received the intended
mary efficacy analysis and how therapeutic strategy due to adverse events associated with trans-
Related article 36 the outcomes are analyzed. Un- fusion, fluid overload, or unwillingness of clinicians to adhere to
der ITT, study participants are the strategy for other reasons. These patients are likely to be fun-
analyzed as members of the treatment group to which they were damentally different from those who were actually treated using
randomized regardless of their adherence to, or whether they re- the 10-g/dL strategy. The characteristics that differ between
ceived, the intended treatment.1-3 For example, in a trial in which pa- patients who received the intended therapy and those who did
tients are randomized to receive either treatment A or treatment B, not could easily influence whether a successful GOS score is
a patient may be randomized to receive treatment A but errone- achieved. If the ITT principle was not followed and patients were
ously receive treatment B, or never receive any treatment, or not removed from their randomized group and either ignored or
adhere to treatment A. In all of these situations, the patient would assigned to the other treatment group, the results of the analysis
be included in group A when comparing treatment outcomes using would be biased and no longer represent the effect of choosing
an ITT analysis. Eliminating study participants who were random- one therapy over the other.
ized but not treated or moving participants between treatment It is common to see alternative analyses proposed, eg, per-
groups according to the treatment they received would violate the protocol or modified intent-to-treat (MITT) analyses. 5 A per-
ITT principle. protocol analysis includes only study participants who completed
In this issue of JAMA, Robertson et al conducted an RCT using the trial without any major deviations from the study protocol; this
a factorial design to compare transfusion thresholds of 10 and 7 usually requires that they successfully receive and complete their
g/dL and administration of erythropoietin vs placebo in 895 assigned treatment(s), complete their study visits, and provide pri-
patients with anemia and traumatic brain injury.4 The primary mary outcome data. The requirements to be included in the per-
outcome was the 6-month Glasgow Outcome Scale (GOS), protocol analysis vary from study to study. While the definition of
dichotomized so a good or moderate score indicated success. The an MITT analysis also varies from study to study, the MITT ap-
trial was conducted with high fidelity to the protocol so only a few proach deviates from the ITT approach by eliminating patients or
patients did not receive the intended treatment strategy. Two reassigning patients to a study group other than the group to which
patients randomized to the 7-g/dL study group were managed they were randomized. Neither of these approaches satisfies the ITT
according to the 10-g/dL threshold and an additional 2 patients principle and may lead to clinically misleading results. It has been ob-
randomized to the 7-g/dL study group received one transfusion served that studies using MITT analysis are more likely to be posi-
not according to protocol. The authors implemented the ITT prin- tive than those following a strict ITT approach.5 A comparison of re-
ciple and the outcomes for these 4 patients were included in the sults from ITT and per-protocol or MITT analyses may provide some
7-g/dL group. indication of the potential effect of nonadherence on overall treat-
ment effectiveness.
Use of the Method Noninferiority trials, which are designed to demonstrate that
an experimental treatment is no worse than an established one,
Why Is ITT Analysis Used? require special considerations with regard to the ITT principle.6-8
The effectiveness of a therapy is not simply determined by its Consider a noninferiority trial of 2 treatments— treatment A is a
pure biological effect but is also influenced by the physician’s abil- biologically ineffective experimental therapy and treatment B is a
ity to administer, or the patient’s ability to adhere to, the intended biologically effective standard therapy—with the goal to demon-
treatment. The true effect of selecting a treatment is a combina- strate that treatment A is noninferior to B. Patients may be ran-
tion of biological effects, variations in compliance or adherence, domized to receive treatment B, not adhere to the treatment,
and other patient characteristics that influence efficacy. Only by and fail treatment due to their nonadherence. If this happens fre-
retaining all patients intended to receive a given treatment in quently, treatment B will appear less efficacious. Thus, the inter-
their original treatment group can researchers and clinicians vention in group A may incorrectly appear noninferior to the
obtain an unbiased estimate of the effect of selecting one treat- intervention in group B, simply as a result of nonadherence rather
ment over another. than because of similar biological efficacy. In this case, the ITT

jama.com JAMA July 2, 2014 Volume 312, Number 1 85

Copyright 2014 American Medical Association. All rights reserved.

Clinical Review & Education JAMA Guide to Statistics and Methods

analysis is somewhat misleading because the noninferiority is a Caveats to Consider When Looking at Results
result of poor adherence. In a noninferiority trial, both ITT and per- Based on ITT Analysis
protocol analyses should be conducted and reported. If the per- Although the ITT principle is important for estimating the efficacy
protocol results are similar to the ITT results, the claim of noninferi- of treatments, it should not be applied in the same way in assessing
ority is substantially strengthened.6-8 the safety (eg, medication adverse effects) of interventions. For ex-
ample, it would not make sense to attribute an apparent adverse ef-
What Are the Limitations of ITT Analysis? fect to an intended treatment when, in fact, the patient was never
A characteristic of the ITT principle is that poor treatment adher- exposed to the experimental drug. For this reason, safety analyses
ence may result in lower estimates of treatment efficacy and a loss are generally conducted according to the treatment actually re-
of study power. However, these estimates are clinically relevant be- ceived, even though this may not accurately estimate—and may well
cause real-world effectiveness is limited by the ability of patients and overestimate—the burden of adverse effects likely to be seen in clini-
clinicians to adhere to a treatment. cal practice.
Because all patients must be analyzed under the ITT principle, While determining the effect of choosing one treatment over
it is essential that all patients be followed up and their primary out- another, or over no treatment at all, is a key goal of trials conducted
comes determined. Patients who discontinue study treatments are late in the process of drug and device development, the goals of trials
often more likely to be lost to follow-up. Following the ITT principle conducted earlier in development are generally focused on nar-
will not eliminate bias associated with missing outcome data; steps rower questions such as biological efficacy and dose selection. In
must always be taken to keep missing data to a minimum and, when these cases, MITT and per-protocol analysis strategies have a greater
missing data are unavoidable, to use minimally biasing methods for role in guiding the design and conduct of subsequent clinical trials.
adjusting for missing data (eg, multiple imputation). For example, it would be unfortunate to falsely conclude, based on
the ITT analysis of a phase 2 clinical trial, that a novel pharmaceuti-
Why Did the Authors Use ITT Analysis in This Particular Study? cal agent is not effective when, in fact, the lack of efficacy stems from
Robertson et al4 used an ITT analysis because it allowed the effec- too high a dose and patients’ inability to be adherent because of in-
tiveness of their therapeutic strategies to be evaluated without bias tolerable adverse effects. In that case, a lower dose may yield clini-
due to differences in adherence. Failure to follow the ITT principle cally important efficacy and a tolerable adverse effect profile. A per-
could have led to greater scrutiny of the trial results, especially if protocol analysis may be helpful in such a case, allowing the detection
adherence to the intended treatments had been poorer. of the beneficial effect in patients able to tolerate the new therapy.

ARTICLE INFORMATION Chapman & Hall/CRC; Taylor & Francis Group; 5. Montedori A, Bonacini MI, Casazza G, et al.
Author Affiliations: Berry Consultants LLC, Austin, 2008:chp 11. Modified versus standard intention-to-treat
Texas (Detry, Lewis); Department of Emergency 2. Schulz KF, Altman DG, Moher D; CONSORT reporting: are there differences in methodological
Medicine, Harbor-UCLA Medical Center; Los Group. CONSORT 2010 statement: updated quality, sponsorship, and findings in randomized
Angeles Biomedical Research Institute; and David guidelines for reporting parallel group randomized trials? a cross-sectional study. Trials. 2011;12:58.
Geffen School of Medicine at UCLA, Torrance, trials. Ann Intern Med. 2010;152(11):726-732. 6. Piaggio G, Elbourne DR, Pocock SJ, Evans SJ,
California (Lewis). 3. Food and Drug Administration. Guidance for Altman DG; CONSORT Group. Reporting of
Corresponding Author: Roger J. Lewis, MD, PhD, industry e9 statistical principles for clinical trials. noninferiority and equivalence randomized trials:
Department of Emergency Medicine, Harbor-UCLA http://www.fda.gov/downloads/Drugs extension of the CONSORT 2010 statement. JAMA.
Medical Center, Bldg D9, 1000 W Carson St, /GuidanceComplianceRegulatoryInformation 2012;308(24):2594-2604.
Torrance, CA 90509 ([email protected]). /Guidances/ucm073137.pdf. September 1998. 7. Le Henanff A, Giraudeau B, Baron G, Ravaud P.
Conflict of Interest Disclosures: The authors have Accessed May 11, 2014. Quality of reporting of noninferiority and
completed and submitted the ICMJE Form for 4. Robertson CS, Hannay HJ, Yamal J-M, et al; and equivalence randomized trials. JAMA. 2006;295
Disclosure of Potential Conflicts of Interest and the Epo Severe TBI Trial Investigators. Effect of (10):1147-1151.
none were reported. erythropoietin and transfusion threshold on 8. Mulla SM, Scott IA, Jackevicius CA, et al. How to
neurological recovery after traumatic brain injury: use a noninferiority trial: Users’ Guides to the
REFERENCES a randomized clinical trial. JAMA. doi:10.1001/jama Medical Literature. JAMA. 2012;308:2605-2611.
1. Cook T, DeMets DL. Introduction to Statistical .2014.6490.
Methods for Clinical Trials. Boca Raton, FL:

86 JAMA July 2, 2014 Volume 312, Number 1 jama.com

Copyright 2014 American Medical Association. All rights reserved.

Opinion

Editorials represent the opinions of the authors and JAMA

EDITORIAL and not those of the American Medical Association.

Introducing the JAMA Guide to Statistics and Methods

Edward H. Livingston, MD

Original research articles in JAMA are selected for publication Users’ Guides to the Medical Literature series in 1993.3 The
because the results are valid and findings provide important new Users’ Guides help clinicians better understand the medical
clinical, research, or policy-related insights. To be current, clini- literature. There are articles about how to search the literature,
cians must read and understand the primary research literature. how to interpret studies, and how to understand the nuances
By implication, this means also understanding increasingly com- characteristic of review articles. The JAMA Guide to Statistics and
plex methodologies and statistical analyses now used in clinical Methods complements the Users’ Guides by providing a more
research. Clinicians may not be familiar with research methods granular and specific discussion about statistics and research
introduced after they completed training. Because relatively little methodology used in an individual article.
emphasis is placed in medical school on research methods and The first JAMA Guide to Statistics and Methods article dis-
statistics, clinicians may never have learned enough about these cusses intention-to-treat (ITT) analysis4 as it relates to a re-
topics to properly understand current research articles. search article in this issue.5 In addition to providing a general
As an aid for readers, in this issue of JAMA, we introduce the description of the ITT principle, the Guide to Statistics and
JAMA Guide to Statistics and Methods. This new series of articles Methods in this issue explains how this principle was applied
will provide explanations about statistical analytic approaches in the research study it accompanies. By pairing a JAMA Guide
and methods used in research reported in JAMA articles, and they to Statistics and Methods with a specific article, we hope the
will be written in language practicing clinicians can understand. learning experience will be enhanced.
These explanations will be published concurrently with research JAMA Guide to Statistics and Methods articles will be writ-
articles that use the statistical test or methodological approach, ten in plain English, avoid complex mathematics, and pre-
thereby providing an example of the topic being discussed. sent material graphically whenever possible. We distinguish
The challenge in balancing statistical rigor with reader com- between statistics and methods: statistics are mathematical
prehension dates back to one of the first uses of a χ2 test in a JAMA approaches to describing collections of data whereas meth-
article. A randomized clinical trial evaluating azacyclonol for ods refer to how a study was designed or some other general
schizophrenia treatment used χ2 analysis to demonstrate a sta- approach to how a study was organized and conducted.
tistically significant treatment effect.1 The author concluded that JAMA Guide to Statistics and Methods articles will ex-
since the P value of .0003 was less than .05, sufficient evidence plain why a particular test or method was used, what its limi-
existed to establish a hypothesis. A letter published in re- tations are, discuss risks of bias, and examine why the study
sponse to this paper pointed out that P values do not establish authors used the particular test. These articles will explain how
hypotheses: “ No p, however small, can ever establish that a hy- the findings from statistical tests should be interpreted in the
pothesis is correct…p merely is the probability that if a given hy- accompanying JAMA research article. Also discussed will be
pothesis is correct, then chi-square will be found at least as large the limitations of interpreting the data given the methodol-
as it was in fact found. The distinction may be made clear to non- ogy used to examine it.
mathematical readers by the following example from the game Because medical information is vast and rapidly expand-
of bridge. The chance that if a deal is honest a particular player ing, physicians must pursue life-long learning. This requires
should be dealt 13 hearts is only 1 in 635,013,559,600; but if he is reading and understanding research articles published in medi-
indeed dealt such a hand it would be quite erroneous—and per- cal journals. Research articles cannot be assessed if the statis-
haps even fatal—for him to conclude that the probability is only tical analysis and research methodology used are not under-
1 in 635,013,559,600 that the deal was honest, that is, that it is stood by the readers. Along with the Users’ Guides to the
virtually certain that the deal was crooked.”2 Medical Literature, the new JAMA Guide to Statistics and Meth-
Recognizing the need to help clinician readers better un- ods will help readers better understand clinical research re-
derstand how to interpret scientific articles, JAMA launched the ports that in turn will help them provide better patient care.

ARTICLE INFORMATION schizophrenia; a double-blind, controlled study. 4. Detry MA, Lewis RJ. The intention-to-treat
Author Affiliation: Deputy Editor, JAMA. J Am Med Assoc. 1957;165(4):333-335. principle: how to assess the true effect of choosing
2. Zeisler EB. Treatment of chronic schizophrenia. a medical treatment. JAMA. doi:10.1001/jama
Corresponding Author: Edward H. Livingston, MD, .2014.7523.
JAMA, 330 N Wabash Ave, Chicago, IL 60611 JAMA. 1957;165(13):1739.
([email protected]). 3. Guyatt GH, Rennie D. Users’ guides to the 5. Robertson CS, Hannay HJ, Yamal J-M, et al; Epo
medical literature. JAMA. 1993;270(17):2096-2097. Severe TBI Trial Investigators. Effect of
erythropoietin and transfusion threshold on
REFERENCES neurological recovery after traumatic brain injury:
1. Odland TM. Azacyclonol (frenquel) a randomized clinical trial. JAMA. doi:10.1001/jama
hydrochloride in the treatment of chronic .2014.6490.

jama.com JAMA July 2, 2014 Volume 312, Number 1 35

Copyright 2014 American Medical Association. All rights reserved.

Clinical Review & Education

JAMA Guide to Statistics and Methods

Sample Size Calculation for a Hypothesis Test

Lynne Stokes, PhD

In this issue of JAMA, Koegelenberg et al1 report the results of a

Figure. Power for Detecting Difference and Sample Size
randomized clinical trial (RCT) that investigated whether treat-
ment with a nicotine patch in addition to varenicline produced 100
MDD = 14%
higher rates of smoking absti-
nence than varenicline alone. 80 MDD = 12%

Use of the Method

Why Is Power Analysis Used? minimum detectable difference, or MDD, of 14%), when signifi-
The sample size in a research investigation should be large enough cance level α is set to .05. For this scenario, the authors’ target
that differences occurring by chance are rare but should not be larger sample size of 398 (199 in each group) will produce a power of
than necessary, to avoid waste of resources and to prevent expo- 80%. All these values (45%, 14%, .05, 80%) must be selected at
sure of research participants to risk associated with the interven- the planning stage of the study to carry out this calculation. The
tions. With any study, but especially if the study sample size is very significance level and power are “rule-of-thumb” choices and are
small, any difference in observed rates can happen by chance and typically not based on the specifics of the study. If the researcher
thus cannot be considered statistically significant. wants to reduce the probability of making a type I error (α = .05)
In developing the methods for a study, investigators conduct a or to increase the probability of detecting the specified difference
power analysis to calculate sample size. The power of a hypothesis (power = 80%), then these values can be changed. Either change
test is the probability of obtaining a statistically significant result will require a larger sample size.
when there is a true difference in treatments. For example, sup- Selecting the baseline rate and MDD requires the expertise of
pose, as Koegelenberg et al1 did, that the smoking abstinence rate the researcher. The baseline rate is typically available from the lit-
were 45% for varenicline alone and 14% larger, or 59%, for the com- erature, because this rate is often based on a therapy that has
bination regimen. Power is the probability that, under these condi- been studied. The MDD choice is more subjective. It should be a
tions, the trial would detect a difference in rates large enough to be clinically meaningful rate difference, or a scientifically important
statistically significant at a certain level α (ie, α is the probability of a rate difference, or both, that is also feasible to detect. For
type I error, which occurs by rejecting a null hypothesis that is ac- example, if the combination therapy of varenicline and nicotine
tually true). patch increased abstinence by 0.1%, this difference would not be
Power can also be thought of as the probability of the comple- of practical benefit, would require an extremely large sample size,
ment of a type II error. If we accept a 20% type II error for a differ- and would thus be too small a setting for the MDD. If the MDD
ence in rates of size d, we are saying that there is a 20% chance that were specified as 50%, the new therapy would have to be 95%
we do not detect the difference between groups when the differ- effective (45% + 50%) before there would be a high chance of
ence in their rates is d. The complement of this, 0.8 = 1 − 0.2, or the detecting any difference, so would be too large for the MDD. The
statistical power, means that when a difference of d exists, there is authors based their choice of MDD = 14% on a compromise
an 80% chance that our statistical test will detect it. between their judgment of a clinically important difference, 12%,
The Figure illustrates the relationship between sample size and the scientifically meaningful value of 16%. The 16% rate was
and power for the test described. The orange line shows the the observed difference in a study that compared varenicline
power for the parameter settings above (baseline rate of 45% and alone and together with nicotine gum.4 Thus, the ability to con-

180 JAMA July 9, 2014 Volume 312, Number 2 jama.com

Copyright 2014 American Medical Association. All rights reserved.

JAMA Guide to Statistics and Methods Clinical Review & Education

firm a difference that is slightly smaller for a related treatment How Should This Method’s Findings Be Interpreted
was considered scientifically important. in This Particular Study?
A power analysis can help with the interpretation of study findings
What Are the Limitations of Power Analysis? when statistically significant effects are not found. However, be-
Calculation of sample size requires predictions of baseline rates and cause the findings in the study by Koegelenberg et al1 were statis-
MDD, which may not be readily available, before the study begins. tically significant, interpretation of a lack of significance was unnec-
The sample size is especially sensitive to the MDD. This is illus- essary. If no statistically significant difference in abstinence rates had
trated by the blue line in the Figure, which shows the sample size been found, the authors could have noted that, “The study was suf-
needed in this study if the MDD were set to 12%. The resulting sample ficiently powered to have a high chance of detecting a difference of
size is 542 (2 × 271) to achieve a power of 80%. 14% in abstinence rates. Thus, any undetected difference is likely
This method of conducting a power analysis might also pro- to be of little clinical benefit.”
duce the incorrect sample size if the data analysis conducted dif-
fers from that planned. For example, if abstinence were affected Caveats to Consider When Looking at Results Based
by other covariates, such as age, and the groups were unbalanced on Power Analysis
on this variable, other analyses might be used, such as logistic Sample size calculation based on any power analysis requires input
regression models accounting for covariate differences. The from the researcher prior to the study. Some of these are assump-
sample size that would be appropriate for one analysis may be tions and predictions of fact (such as the baseline rate), which may
too large or small to achieve the same power with another ana- be incorrect. Others reflect the clinical judgment of the researcher
lytic procedure. (eg, MDD), with which the reader may disagree. If a statistically sig-
nificant effect is not found, the reader should assess whether either
Why Did the Authors Use Power Analysis in This Particular Study? of these are concerns.
The number of research participants available for any study is lim- The reader should also not interpret a lack of significance for an
ited by resources. However, the authors were aware that previous outcome other than the one on which the power analysis was based
studies comparing these treatments had found no significant dif- as confirmation that no difference exists, because the analysis is spe-
ference in abstinence rates. This can occur even if a difference cific to the parameter settings. For example, no significant differ-
exists if the sample size is too small. The authors wanted to ence was found in this study for most adverse events rates, al-
ensure that their sample size was adequate to detect even a small though the power analysis does not apply to these rates. Thus, the
but clinically important difference, so they carefully evaluated sample size may not be adequate to interpret that finding to con-
sample size. firm that no meaningful difference in these outcomes exists.

ARTICLE INFORMATION Disclosure of Potential Conflicts of Interest and effective in helping smokers quit than varenicline
Author Affiliation: Department of Statistical none were reported. alone? a randomised controlled trial. BMC Med.
Science, Southern Methodist University, Dallas, 2013;11:140.
Texas. REFERENCES 3. Ebbert JO, Burke MV, Hays JT, Hurt RD.
Corresponding Author: Lynne Stokes, PhD, 1. Koegelenberg CFN, Noor F, Bateman ED, et al. Combination treatment with varenicline and
Department of Statistical Science, Southern Efficacy of varenicline combined with nicotine nicotine replacement therapy. Nicotine Tob Res.
Methodist University, PO Box 750100, Dallas, TX replacement therapy vs varenicline alone for 2009;11(5):572-576.
75275 ([email protected]). smoking cessation: a randomized clinical trial. JAMA. 4. Besada NA, Guerrero AC, Fernandez MI, Ulibarri
doi:10.1001/jama.2014.7195. MM, Jiménez-Ruiz CA. Clinical experience from a
Conflict of Interest Disclosures: The author has
completed and submitted the ICMJE Form for 2. Hajek P, Smith KM, Dhanji AR, McRobbie H. Is a smokers clinic combining varenicline and nicotine
combination of varenicline and nicotine patch more gum. Eur Respir J. 2010;36(suppl 54):462s.

jama.com JAMA July 9, 2014 Volume 312, Number 2 181

Copyright 2014 American Medical Association. All rights reserved.

Clinical Review & Education

JAMA Guide to Statistics and Methods

Multiple Comparison Procedures

Jing Cao, PhD; Song Zhang, PhD

Problems can arise when researchers try to assess the statistical 0.95 × 0.95, or 90%. With 20 such tests, the probability that all of
significance of more than 1 test in a study. In a single test, statistical the 20 tests correctly show insignificance is (0.95)20 or 36%. So there
significance is often determined based on an observed effect or find- is a 100% − 36%, or 64%, chance of at least 1 false-positive test oc-
ing that is unlikely (<5%) to occur due to chance alone. When more curring among the 20 tests. Because this probability quantifies the
than 1 comparison is made, the chance of falsely detecting a non- risk of making any false-positive inference by a group, or family, of
existent effect increases. This is known as the problem of multiple tests, it is referred to as the family-wise error rate (FWER). The
comparisons (MCs), and adjustments can be made in statistical test- FWER generally increases as the number of tests performed in-
ing to account for this.1 creases. For example, assuming IER = 5% and denoting the number
In this issue of JAMA, Saitz et al2 report results of a randomized of multiple tests performed as K, then for K = 2 independent tests,
trial evaluating the efficacy of 2 brief counseling interventions (ie, a FWER = 1 − (0.95)2 = 10%; for K = 3, FWER = 1 − (0.95)3 = 14%; and
brief negotiated interview and an adaptation of a motivational infor K = 20, FWER = 1 − (0.95)20 = 64%. This shows that the risk of
terview, referred to as MOTIV) in reducing drug use in primary care making at least 1 false discovery in MCs can be greatly inflated even
patients when compared with if the error rate is well controlled in each individual test.
not having an intervention. Be- When MCs are made, to control FWER at a certain level, the
Related article page 502
cause MCs were made, the au- threshold for determining statistical significance in individual tests
thors adjusted how they determined statistical significance. In this must be adjusted.1 The simplest approach is known as the Bonfer-
article, we explain why adjustment for MCs is appropriate in this study roni correction. It adjusts the statistical significance threshold by the
and point out the limitations, interpretations, and cautions when number of tests. For example, for a FWER fixed at 5%, the IER in a
using these adjustments. group of 20 tests is set at 0.05/20 = 0.0025; ie, an individual test
would have to have a P value less than .0025 to be considered sta-
Use of Method tistically significant. The Bonferroni correction is easy to imple-
ment, but it sets the significance threshold too rigidly, reducing the
Why Are Multiple Comparison Procedures Used? statistical procedure’s power to detect true effects.
When a single statistical test is performed at the 5% significance level, The Hochberg sequential procedure, which was used in the
there is a 5% chance of falsely concluding that a supposed effect ex- study by Saitz et al,2 takes a different approach.3 All of the tests (the
ists when in fact there is none. This is known as making a false dis- multiple comparisons) are performed and the resultant P values are
covery or having a false-positive inference. The significance level rep- ordered from largest to smallest on a list. If the FWER is fixed at 5%
resents the risk of making a false discovery in an individual test, and the largest observed P value is less than .05, then all the tests
denoted as the individual error rate (IER). If 20 such tests are con- are considered significant. Otherwise, if the next largest P value is
ducted, there is a 5% chance of making a false-positive inference with less than 0.05/2 (.025), then all the tests except the one with the
each test so that, on average, there will be 1 false discovery in the largest P value are considered significant. If not, and the third P value
20 tests. in the list is less than 0.05/3 (.017), then all the tests except those
Another way to view this is in terms of probabilities. If the prob- with the largest 2 P values are considered significant. This is contin-
ability of making a false conclusion (ie, false discovery) is 5% for a ued until all the comparisons are made. This approach uses progres-
single test in which the effect does not exist, then 95% of the time, sively more stringent statistical thresholds with the most stringent
the test will arrive at the correct conclusion (ie, insignificant effect). one being the Bonferroni threshold, and thus the approach can
With 2 such tests, the probability of finding an insignificant effect with achieve a greater power to detect true effect than the Bonferroni
the first test is 95%, as it is for the second. However, the probability procedure under appropriate conditions. An example in the Table
of finding insignificant effects in the first and the second test is consists of 6 tests in MCs; given a FWER of 5%, none of the tests

Table. An Example to Compare the Bonferroni Procedure and the Hochberg Sequential Procedure

Bonferroni Hochberg
Test P Value Threshold Result Threshold Result
1 .40 0.05/6 = 0.008 Not significant 0.05 Not significant
2 .027 0.05/6 = 0.008 Not significant 0.05/2 = 0.025 Not significant
3 .020 0.05/6 = 0.008 Not significant 0.05/3 = 0.017 Not significant
4 .012 0.05/6 = 0.008 Not significant 0.05/4 = 0.0125 Significant
5 .011 0.05/6 = 0.008 Not significant NA Significant
6 .010 0.05/6 = 0.008 Not significant NA Significant
Abbreviation: NA, not applicable.

jama.com JAMA August 6, 2014 Volume 312, Number 5 543

Copyright 2014 American Medical Association. All rights reserved.

Clinical Review & Education JAMA Guide to Statistics and Methods

are significant with the Bonferroni procedure. By comparison, 3 tests Suppose in a different study, brief negotiated interview was
are significant with the Hochberg sequential procedure. intended to treat alcohol use and MOTIV was intended to treat
drug use. Then there is no need to adjust for MCs. This is in con-
What Are the Limitations of Multiple Comparison Procedures? trast to performing a family of tests from which the results as a
Statistical procedures to control FWER in MCs were developed to whole address a single research question; then adjusting for MCs
reduce the risk of making any false-positive discovery. This is offset is necessary. As in the report by Saitz et al,2 both the brief negoti-
by having a lower test power to detect true effects. For example, ated interview and MOTIV were compared with the control to
when K = 10, the Bonferroni-corrected IER is 0.05/10 = 0.005 to draw a single conclusion about the efficacy of brief counseling
control FWER at 0.05. Under the conventional 2-sided t test, for a interventions for drug use.
single test in the group to be considered significant, the observed
effect needs to be 43% larger than that with an IER = 0.05. When Confirmatory vs Exploratory
K = 20, the Bonferroni-corrected IER is 0.05/20 = 0.0025, and the Bender and Lange 5 suggested that MC procedures are only
observed effect needs to be 54% larger than that with an IER = 0.05. required for confirmatory studies for which the goal is the defini-
This limitation of reduced test power by controlling FWER be- tive proof of a predefined hypothesis to support final decision
comes more apparent as the number of tests in MCs increases. making. For exploratory studies seeking to generate hypotheses
that will be tested in future confirmatory studies, the number of
Why Did the Authors Use Multiple Comparison Procedures tests is usually large and the choice of hypotheses is likely data
in This Particular Study? dependent (ie, selecting hypotheses after reviewing data), mak-
In the study by Saitz et al, 2 tests were performed (brief negotiated ing MC adjustments unnecessary or even impossible at this stage
interview vs no brief interview and MOTIV vs no brief interview) to of research. “Significant” results based on exploratory studies,
determine if interventions with brief counseling were more effective however, should be clearly labeled so readers can correctly assess
in reducing drug use than interventions without counseling. With 2 their scientific strength.
tests and the IER set at 5%, the risk of falsely concluding at least 1 treat-
ment is effective because of chance alone is 10%. To avoid the in- FWER vs FDR
flated FWER, the authors used the Hochberg sequential procedure.3 The main approaches to MC adjustment include controlling
FWER, which is the probability of making at least 1 false discovery
How Should This Method’s Findings Be Interpreted in MCs, or controlling the false discovery rate (FDR), which is the
in This Particular Study? expected proportion of false positives among all discoveries.
Saitz et al found that the adjusted P value4 based on the Hoch- When using the FDR approaches, a small proportion of false posi-
berg procedure was .81 for both the brief negotiated interview tives are tolerated to improve the chance of detecting true
and MOTIV vs no brief interview. The study did not provide suffi- effects.6 In contrast, the FWER approaches avoid any false posi-
cient evidence to claim that interventions with brief counseling tives even at the cost of increased false negatives. The FDR and
were more effective than the one without brief counseling in FWER represent 2 extremes of the relative importance of control-
reducing drug use among primary care patients. However, the ling for false positive or false negatives. The decision whether to
absence of evidence does not mean there is an absence of an control FWER or FDR should be made by carefully weighing the
effect. The interventions may be effective, but this study did not relative benefits between false-positive and false-negative dis-
have the statistical power to detect the effect. coveries in a specific study.

Caveats to Consider When Looking at Multiple Comparison

Definition of Family
Procedures
Both FWER and FDR are defined for a particular family of tests. This
To Adjust or Not “family” should be prespecified at the design stage of a study. Test
If researchers conduct multiple tests, each addressing an unre- bias can occur in MCs when selecting hypothesis to be tested after
lated research question, then adjusting for MCs is unnecessary. reviewing the data.

ARTICLE INFORMATION REFERENCES 5. Bender R, Lange S. Adjusting for multiple

Author Affiliations: Department of Statistical 1. Hsu JC. Multiple Comparisons: Theory and testing: when and how? J Clin Epidemiol. 2001;54
Science, Southern Methodist University, Dallas, Methods. London, UK: Chapman & Hall; 1996. (4):343-349.
Texas (Cao); Department of Clinical Sciences, 2. Saitz R, Palfai TPA, Cheng DM, et al. Screening 6. Benjamini Y, Hochberg Y. Controlling the false
UT Southwestern Medical Center, Dallas, Texas and brief intervention for drug use in primary care: discovery rate: a practical and powerful approach to
(Zhang). the ASPIRE randomized clinical trial. JAMA. doi:10 multiple testing. J R Stat Soc B. 1995;57(1):289-300.
Corresponding Author: Jing Cao, PhD, Southern .1001/jama.2014.7862.
Methodist University, Statistical Science, Dallas, TX 3. Hochberg Y. A sharper Bonferroni procedure for
75205 ([email protected]). multiple tests of significance. Biometrika. 1988;75
Conflict of Interest Disclosures: Both authors (4):800-802.
have completed and submitted the ICMJE Form for 4. Wright SP. Adjusted P value for simultaneous
Disclosure of Potential Conflicts of Interest and inference. Biometrics. 1992;48(4):1005-1013.
none were reported.

544 JAMA August 6, 2014 Volume 312, Number 5 jama.com

Copyright 2014 American Medical Association. All rights reserved.

Clinical Review & Education

JAMA Guide to Statistics and Methods

Minimal Clinically Important Difference

Defining What Really Matters to Patients
Anna E. McGlothlin, PhD; Roger J. Lewis, MD, PhD

When assessing the clinical utility of therapies intended to ing a numerical value for the MCID. The MCID for the pain assess-
improve subjective outcomes, the amount of improvement that is ment scale used in Hinman et al3 was determined by the Delphi
important to patients must be determined.1 The smallest benefit method using a panel of 6 rheumatology experts.4
of value to patients is called Anchor-based methods determine the MCID by associating the
the minimal clinically impor- change in the numerical scale for an outcome to some other sub-
Related article page 1313 tant difference (MCID). The jective and independent assessment of improvement. For ex-
MCID is a patient-centered ample, patients may be asked if they felt “about the same,” “a little
concept, capturing both the magnitude of the improvement and bit better,” or “quite a bit better” after receiving treatment. These
also the value patients place on the change. Using patient- categorical responses are then related to the numerical measure-
centered MCIDs is important for studies involving patient- ment scale used in the study, “anchoring” the numerical outcome
reported outcomes,2 for which the clinical importance of a given scale to the qualitative, categorical assessment that is presumably
change may not be obvious to clinicians selecting treatments. The more meaningful to patients. The MCID for the WOMAC measure
MCID defines the smallest amount an outcome must change to be of functional status in the study by Hinman et al3 was based on the
meaningful to patients.1 75th percentile of the WOMAC score; 75% of patients categorizing
In this issue of JAMA, Hinman et al3 report findings from a clini- themselves as having experienced benefit (the anchor) had an im-
cal trial evaluating whether acupuncture (needle, laser, and sham la- provement equal to or larger than the derived MCID using this
ser) improved pain or overall functional outcomes compared with definition.5
no acupuncture among patients with chronic knee pain. Pain was Distribution-based methods for defining the MCID involve
measured on a numerical rating scale and functional status by the neither expert opinion nor patient assessments. These methods
Western and McMaster Universities Osteoarthritis Index (WOMAC) rely on the statistical properties of the distribution of outcome
score. The MCIDs for both end points were based on prior experi- scores, particularly how widely the scores vary between patients.
ence with these scoring systems. The MCID for pain was deter- These methods determine what magnitude of change is required
mined using an expert consensus, or Delphi approach,4 while the to show that the change in an outcome measure in response to an
MCID for function was determined using an “anchor” approach, intervention is more than would be expected from chance alone.
based on patients’ qualitative assessments of their own responses Because distribution-based methods are not derived from indi-
to treatment.5 vidual patients, they probably should not be used to determine an
MCID.6
Use of the Method
Why Is the MCID Used? What Are the Limitations of MCID Derivation Methods?
The appropriate clinical interpretation of changes on a numerical Consensus methods use clinical and domain experts, rather than
scale must consider not only statistical significance, but also patients, to define the MCID. In many settings, expert opinion may
whether the observed change is meaningful to patients. Identical not be a valid and reliable way to determine what is important to
changes on a numerical scale may have different clinical impor- patients.
tance in different patient populations (eg, different ages, disease Anchor-based methods are limited by the choice of anchor,
severity, injury type). Furthermore, statistical significance is linked which is a subjective assessment. For example, when an anchor is
to the sample size. Given a large enough sample, statistical signifi- based on asking a patient whether he or she improved after receiv-
cance between groups may occur with very small differences that ing treatment, the response may be susceptible to recall bias. A
are clinically meaningless.6 patient’s current status tends to influence recollection of the past.
When determining how many patients to enroll in a study, the The anchor’s validity and reliability are crucial for determination of
calculation usually reflects the intention to reliably find a clinically a valid MCID.
important effect of a treatment, such as the MCID. The smaller the Anchor-based methods may be influenced by the statistical
treatment effect sought, the larger the required number of study distribution of scores within each category of the anchor. If the data
participants.7 are highly skewed, such as occurs with length-of-stay information
The MCID can be calculated using consensus, anchor, and dis- because of the occasional outlying patient with a complicated clini-
tribution-based methods. Consensus (also known as Delphi) meth- cal course, the derivation of the MCID may be affected by the outli-
ods convene an expert panel to provide independent assessments ers. Furthermore, anchor methods often rely on an MCID estimate
of what constitutes a clinically relevant change. The assessments are derived from only a subset of patients (those within a particular
then revised after the panel members review each other’s assess- category of the anchor). Not accounting for information from
ments. This process is repeated until a consensus is reached regard- patients outside of this group may result in erroneous MCID

1342 JAMA October 1, 2014 Volume 312, Number 13 jama.com

Copyright 2014 American Medical Association. All rights reserved.

JAMA Guide to Statistics and Methods Clinical Review & Education

estimates if the characteristics of the excluded patients differ from to 5.1 units in function, relative to an MCID of 6 WOMAC units. Al-
those who were included. though there were statistically significant differences between
Because distribution-based methods are based on purely sta- groups, the clinical importance of these differences is uncertain.3
tistical reasoning, they can only identify a minimal detectable ef-
fect: that is, an effect that is unlikely to be attributable to random Caveats to Consider When Looking at Results
measurement error. The lack of an anchor that links the numeric Based on MCIDs
scores to an assessment of what is important to patients causes dis- In the study by Hinman et al3 the observed effect is smaller than the
tribution-based methods to fall short of identifying important, clini- predefined MCID, yet the differences between groups still achieved
cally meaningful outcomes for patients. In fact, the term MCID is statistical significance. This phenomenon is not uncommon and oc-
sometimes replaced by “minimal detectable change” when the dif- curred in another recently published study in JAMA on the effect of
ference is calculated by distribution-based methods.6 Distribution- vagal nerve stimulation for obesity treatment.8 This occurs be-
based methods are not recommended as a first-line means for de- cause the study sample size is selected to achieve a high probabil-
termining MCID. ity of detecting a benefit equal to the MCID, resulting in a substan-
Ideally, determination of the MCID should consider different tial chance of demonstrating statistical significance even when the
thresholds in different subsets of the population. For example, pa- effect of an intervention is smaller than the MCID.
tients with substantial amounts of pain at baseline might require In the study by Hinman et al,3 acupuncture therapies were com-
greater pain reduction to perceive treatment benefit compared with pared with control groups by measuring difference in mean changes
those patients who have little baseline pain. in pain and function scores. An alternative experimental design would
be based on a “responder analysis,” namely, comparing the propor-
Why Did the Authors Use MCID in This Particular Study? tion of patients within each therapy who experienced a change
Hinman et al3 specified an MCID for each end point to establish an greater than the MCID. This type of data presentation could be more
appropriate sample size for their study and to facilitate clinically informative because it focuses on patients who experience an im-
meaningful interpretation of the final outcome data. The number provement at least as large as the MCID.2 This approach is useful
of patients enrolled was selected to provide sufficient power when the data are highly skewed by outliers in such a way that the
(ie, probability) for detecting a change in outcomes resulting from calculated mean value may be above the MCID even when most pa-
the intervention that was at least as large as the MCID for each tients do not have an effect greater than the MCID.
end point. A fundamental aspect of MCIDs that is often ignored is the need
to consider potential improvements from an intervention in rela-
How Should MCID Findings Be Interpreted in This Particular Study? tion to costs and complications. When selecting an MCID for a clini-
The actual treatment effects observed in the study of Hinman et al3 cal trial, defining a meaningful improvement from the patient’s per-
were quite modest, ranging from an improvement of 0.9 to 1.2 units spective ideally involves considering all aspects of clinical care, both
in pain, relative to an MCID of 1.8 units, and an improvement of 4.4 favorable and unfavorable.

ARTICLE INFORMATION 2. Guidance for industry: patient-reported outcomes in knee and hip osteoarthritis: the
Author Affiliations: Berry Consultants, Austin, outcome measures: use in medical product minimal clinically important improvement. Ann
Texas (McGlothlin, Lewis); Department of development to support labeling claims. Food and Rheum Dis. 2005;64(1):29-33.
Emergency Medicine, Harbor-UCLA Medical Center, Drug Administration. http://www.fda.gov 6. Turner D, Schünemann HJ, Griffith LE, et al. The
Torrance, California (Lewis); Los Angeles /downloads/Drugs/Guidances/UCM193282.pdf. minimal detectable change cannot reliably replace
Biomedical Research Institute, Torrance, California Accessed August 27, 2014. the minimal important difference. J Clin Epidemiol.
(Lewis); David Geffen School of Medicine at UCLA, 3. Hinman RS, McCrory P, Pirotta M, et al. 2010;63(1):28-36.
Los Angeles, California (Lewis). Acupuncture for chronic knee pain: a randomized 7. Livingston EH, Elliot A, Hynan L, Cao J. Effect
Corresponding Author: Roger J. Lewis, MD, PhD, clinical trial. JAMA. doi:10.1001/jama.2014.12660. size estimation: a necessary component of
Department of Emergency Medicine, Harbor-UCLA 4. Bellamy N, Carette S, Ford PM, et al. statistical analysis. Arch Surg. 2009;144(8):706-712.
Medical Center, Bldg D9, 1000 W Carson St, Osteoarthritis antirheumatic drug trials: III, setting 8. Ikramuddin S, Blackstone RP, Brancatisano A,
Torrance, CA 90509 ([email protected]). the delta for clinical trials: results of a consensus et al. Effect of reversible intermittent
Conflict of Interest Disclosures: Both authors development (Delphi) exercise. J Rheumatol. 1992; intra-abdominal vagal nerve blockade on morbid
have completed and submitted the ICMJE Form for 19(3):451-457. obesity: the ReCharge randomized clinical trial. JAMA.
Disclosure of Potential Conflicts of Interest and 5. Tubach F, Ravaud P, Baron G, et al. Evaluation of 2014;312(9):915-922.
none were reported. clinically relevant changes in patient reported

REFERENCES
1. Jaeschke R, Singer J, Guyatt GH. Measurement of
health status: ascertaining the minimal clinically
important difference. Control Clin Trials. 1989;10(4):
407-415.

jama.com JAMA October 1, 2014 Volume 312, Number 13 1343

Copyright 2014 American Medical Association. All rights reserved.

Clinical Review & Education

JAMA Guide to Statistics and Methods

Methods for Evaluating Changes in Health Care Policy

The Difference-in-Differences Approach
Justin B. Dimick, MD, MPH; Andrew M. Ryan, PhD

Observational studies are commonly used to evaluate the after and before the policy are compared between the study
changes in outcomes associated with health care policy group and the comparison group without the exposure (group A)
implementation. An important limitation in using obser- and the study group with the exposure (group B), which allows
vational studies in this context the investigator to subtract out the background changes in out-
is the need to control for back- comes. Two differences in outcomes are important: the differ-
Related articles pages 2364 ground changes in outcomes ence after vs before the policy change in the group exposed to
and 2374 that occur with time (eg, secu- the policy (B2 −B1, Figure) and the difference after vs before the
lar trends affecting outcomes). date of the policy change in the unexposed group (A2 −A1). The
The difference-in-differences approach is increasingly applied to change in outcomes that are related to implementation of the
address this problem.1 policy beyond background trends can then be estimated from the
In this issue of JAMA, studies by Rajaram and colleagues2 and difference-in-differences analysis as follows: (B2 −B1) −(A2 −A1). If
Patel and colleagues3 used the difference-in-differences approach there is no relationship between policy implementation and
to evaluate the changes that occurred following the 2011 Accredi- subsequent outcomes, then the difference-in-differences esti-
tation Council for Graduate Medical Education (ACGME) duty hour mate is equal to 0 (Figure, A). In contrast, if the policy is associ-
reforms. The 2 studies were conducted with different data sources ated with beneficial changes, then the outcomes following imple-
and study populations but used similar methods. mentation will improve to a greater extent in the exposed group.
This will be shown by the difference-in-differences estimate
Use of the Method (Figure, B).
Why Was the Difference-in-Differences Method Used? These estimates are derived from regression models rather than
The association between policy changes and subsequent out- simple subtraction. Using regression modeling allows the esti-
comes is often evaluated by pre-post assessments. Outcomes af- mates to be adjusted for other factors (eg, patient or hospital char-
ter implementation are compared with those before. This design is acteristics) that may differ between the groups.4 Regression mod-
valid only if there are no underlying time-dependent trends in out- els also offer a way to estimate the statistical significance of the
comes unrelated to the policy change. If clinical outcomes were al- association between policy change and outcomes, by including a vari-
ready improving before the policy, then using a pre-post study would able that indicates if the observation is in the pre or post period and
lead to the erroneous conclusion that the policy was associated with another variable that divides the groups into those exposed and un-
better outcomes. exposed to the policy.
The difference-in-differences study design addresses this Statistically, the association between policy implementation and
problem by using a comparison group that is experiencing the outcomes is estimated by examining the interaction between the
same trends but is not exposed to the policy change.4 Outcomes pre-post and exposed-unexposed variables. If the association exists,

Figure. Conceptual Illustration of a Difference-in-Differences Analysis for 2 Scenarios

A No association between exposure and measured outcome B Association of exposure and measured outcome Preexposure mean
A1 No exposure group
Preexposure period Postexposure period Preexposure period Postexposure period
B1 Exposure group
A1 A1 Postexposure mean
A2 No exposure group
Difference Difference B2 Exposure group
A2–A1 A2–A1
Improved Outcome

Improved Outcome

A2 A2

B1 B1

Difference Difference
B2–B1 B2–B1
B2
B2

Time Time

jama.com JAMA December 10, 2014 Volume 312, Number 22 2401

Copyright 2014 American Medical Association. All rights reserved.

Clinical Review & Education JAMA Guide to Statistics and Methods

this interaction term will be significantly different from zero. Other ated several clinical outcomes (mortality, serious morbidity, read-
design and statistical issues should be considered when perform- mission, failure to rescue) and American Board of Surgery pass
ing difference-in-differences analysis and are considered in detail rates after vs before the 2011 ACGME duty hour reforms.2 The
elsewhere.1,5 authors chose to use nonteaching hospitals as a control group,
which makes the assumption that teaching and nonteaching hospi-
What Are the Limitations of the Difference-in-Differences Method? tals have similar trends for improved outcomes prior to the ACGME
The 2 main assumptions of difference-in-differences analysis are par- policy changes. Similarly, the study by Patel et al, conducted using
allel trends and common shocks.4 The parallel trends assumption Medicare claims data, evaluated mortality and readmissions after
states that the trends in outcomes between the treated and com- vs before the ACGME duty hour reforms, also using a comparison
parison groups are the same prior to the intervention (Figure). If true, group of nonteaching hospitals.3
it is reasonable to assume that these parallel trends would con-
tinue for both groups even if the program was not implemented. This How Should the Findings Be Interpreted?
is tested empirically by examining the trends in both groups before Both studies found no association of the 2011 ACGME duty hour
the policy was implemented. In a regression model, this is evalu- reform with clinical outcomes. After accounting for the slight
ated by assessing the significance of the interaction term between background trend for improved outcomes among these popula-
time and policy exposure in the preintervention period. If the trends tions using the difference-in-differences method, there was no
are significantly different prior to the intervention, a difference-in- additional improvement (or worsening) in outcomes associated
differences analysis would be biased and a different comparison with the ACGME policy. Both studies had strong comparison
group should be sought. groups and neither appeared to violate the key assumptions of
In economics, a shock is an unexpected or unpredictable event this approach. The rigorous approach and the consistency of the
(unrelated to the policy) that affects a system. The common shocks finding across outcomes make a compelling case that there was
assumptions state that any events occurring during or after the time no association between implementation of the policy and the
the policy changed will equally affect the treatment and compari- measured outcomes.
son groups. A key limitation to implementing difference-in-
differences design is finding a control group for which these as- What Caveats Should the Reader Consider?
sumptions are met. Ideally, the only difference between the Difference-in-difference analyses must also account for spillover ef-
comparison group and the study group would be exposure to the fects. Spillovers occur when some aspect of the policy spills over and
policy. In practice, such a group may be difficult to find. influences clinical care in the hospitals unexposed to the policy (eg,
nonteaching hospitals improved quality in some way in reaction to
Why Did the Authors Use the Difference-in-Differences Method? the ACGME duty hour reforms). Spillover can be evaluated by ex-
The studies by Rajaram et al 2 and Patel et al 3 both used the amining whether there is a measurable change in outcomes in the
difference-in-differences method to control for background trends comparison group of hospitals at the time of the policy implemen-
in patient outcomes. The study by Rajaram et al, conducted using a tation. In the studies in this issue of JAMA, the lack of a change in
large clinical registry for surgical patients (American College of outcomes among nonteaching hospitals at the time of the duty hour
Surgeons National Surgical Quality Improvement Program), evalu- reforms suggests there were no associated spillover effects.

ARTICLE INFORMATION Conflict of Interest Disclosures: Both authors 2. Rajaram R, Chung JW, Jones AT, et al.
Author Affiliations: The Center for Healthcare have completed and submitted the ICMJE Form for Association of the 2011 ACGME resident duty hour
Outcomes and Policy, University of Michigan, Ann Disclosure of Potential Conflicts of Interest. Dr reform with general surgery patient outcomes and
Arbor (Dimick, Ryan); Institute for Healthcare Policy Dimick reported that he receives grant funding with resident examination performance. JAMA. doi:
and Innovation, University of Michigan, Ann Arbor from the National Institutes of Health, the Agency 10.1001/jama.2014.15277.
(Dimick, Ryan); Department of Surgery, School of for Healthcare Research and Quality, and BlueCross 3. Patel MS, Volpp KG, Small DS, et al. Association
Medicine, University of Michigan, Ann Arbor BlueShield of Michigan Foundation; and is a of the 2011 ACGME resident duty hour reforms with
(Dimick); Department of Health Policy & cofounder of ArborMetrix Inc, a company that mortality and readmissions among hospitalized
Management, School of Public Health, University of makes software for profiling hospital quality and Medicare patients. JAMA. doi:10.1001/jama.2014
Michigan, Ann Arbor (Dimick, Ryan). efficiency. Dr Ryan reported that he receives grant .15273.
funding from Agency for Healthcare Research and
Corresponding Author: Justin B. Dimick, MD, Quality. 4. Angrist JD, Pischke JS. Mostly Harmless
MPH, University of Michigan Medical School, 2800 Econometrics: An Empiricist's Companion. Princeton,
Plymouth Rd, Bldg 16, Office 136E, Ann Arbor, MI REFERENCES NJ: Princeton University Press; 2008.
48109 ([email protected]). 5. Bertrand M, Duflo E, Mullainathan S. How much
1. Ryan AM, Burgess J, Dimick, JB. Why we
Section Editors: Roger J. Lewis, MD, PhD, shouldn’t be indifferent to specification in should we trust differences-in-differences
Department of Emergency Medicine, Harbor-UCLA difference-in-differences analysis [published online estimates? Q J Econ. 2004;119:249-275.
Medical Center and David Geffen School of December 9, 2014]. Health Serv Res. doi:10.1111
Medicine at UCLA; and Edward H. Livingston, MD, /1475-6773.12270.
Deputy Editor, JAMA.

2402 JAMA December 10, 2014 Volume 312, Number 22 jama.com

Copyright 2014 American Medical Association. All rights reserved.

Clinical Review & Education

JAMA Guide to Statistics and Methods

Cluster Randomized Trials

Evaluating Treatments Applied to Groups
William J. Meurer, MD, MS; Roger J. Lewis, MD, PhD

Sometimes a new treatment is best introduced to an entire group Use of the Method
of patients rather than to individual patients. Examples include Why Is Cluster Randomization Used?
when the new approach requires procedures be followed by mul- Cluster randomization should be used when it would be impracti-
tiple members of a health care team or when the new technique is cal or impossible to assign and correctly deliver the experimental and
applied to the environment of care (eg, a method for cleaning a control treatments to individual study participants.1,2 Typical situ-
hospital room before it is known which patient will be assigned the ations include the study of interventions that must be imple-
room). This avoids confusion that could occur if all caregivers had mented by multiple team members, that affect workflow, or
to keep track of which patients were being treated the old way and that alter the structure of care delivery. As in the RESTORE trial, in-
which were being treated the new way. terventions that involve training multidisciplinary health care teams
One approach to evaluate the efficacy of such treatments— are practically difficult to conduct using individual-level randomiza-
treatments for which the application typically involves changes at tion, as health care practitioners cannot easily unlearn a new way
the level of the health care practice, hospital unit, or even health of taking care of patients.
care system—is to conduct a cluster randomized trial. In a cluster Cluster randomization is often used to reduce the mixing or con-
randomized trial, study participants are randomized in groups or tamination of treatments in the 2 groups of the trial, as might occur
clusters so that all members within a single group are assigned to if patients in the control group start to be treated using some of the
either the experimental intervention or the control.1,2 This con- approaches included in the experimental treatment group, per-
trasts with the more familiar randomized clinical trial (RCT) in haps because the practitioners become habituated to the experi-
which randomization occurs at the level of the individual partici- mental approach or perceive it to be superior.1,2 For example, con-
pant, and the treatment assigned to one study participant is sider an injury prevention trial testing the effect of offering bicycle
essentially independent of the treatment assigned to any other. helmets to students in a classroom on the incidence of subsequent
In a cluster randomized trial, the cluster is the unit randomized, head injury. If a conventional RCT were conducted and half of the
whereas in a traditional RCT, the individual study participant is students in each classroom received helmets, it is likely that some
randomized. In both types of trials, however, the outcomes of of the other half of students would inform their parents about the
interest are recorded for each participant individually. ongoing intervention and many of these children might also begin
Although there are both theoretical and pragmatic reasons for to use bicycle helmets. Contamination is a form of crossover be-
using cluster randomization in a clinical trial, doing so introduces a tween treatment groups and will generally reduce the observed
fundamental challenge to those analyzing and interpreting the treatment effect using the usual intent-to-treat analysis.6 Cluster ran-
results of the trial: study participants from the same cluster domization may also be used to reduce potential selection bias. Phy-
(eg, patients treated within the same medical practice or hospital sicians choosing individual patients from their practice to consent
unit) tend to be more similar to each other than participants from for randomization may tend to enroll patients with specific charac-
different clusters.2 This nearly universal fact violates a common teristics (eg, lesser or greater illness severity), reducing the exter-
assumption of most statistical tests, namely, that individual obser- nal validity of the trial. Assignment of the treatment group at the prac-
vations are independent of each other. To obtain valid results, a tice level, with the application of the assigned treatment to all
cluster randomized trial must be analyzed using statistical meth- patients treated within the practice, may minimize this problem.
ods that account for the greater similarity between individual par- Using a cluster randomized design also can offer practical ad-
ticipants from the same cluster compared with those from differ- vantages. For example, if 2 or more treatments are considered to
ent clusters.2-4 be within the standard of care, and depending on the risks associ-
In a recent JAMA article, Curley et al5 reported the results of ated with treatment, streamlined consent procedures or even in-
RESTORE, a cluster randomized clinical trial evaluating a nurse- tegration of general and research consents may be used to reduce
implemented, goal-directed sedation protocol for children with barriers to participation and ensure a truly representative patient
acute respiratory failure receiving mechanical ventilation in the population is enrolled in the trial.1,7
intensive care setting, comparing this approach with usual care.
The trial evaluated the primary hypothesis that the intervention What Are Limitations of Cluster Randomization?
group—patients treated in intensive care units (ICUs) using the Any time data are clustered, the statistical analysis must use tech-
goal-directed sedation protocol—would have a shorter duration of niques that account for the likeness of cluster members.2,3 Exten-
mechanical ventilation. Thirty-one pediatric ICUs, the “clusters,” sions of the more-familiar regression models that are appropriate
were randomized to either implement the goal-directed sedation for the analysis of clustered data include generalized estimating equa-
protocol or continue their usual care practices. tions (as used in RESTORE), mixed linear models, and hierarchical

2068 JAMA May 26, 2015 Volume 313, Number 20 (Reprinted) jama.com

Copyright 2015 American Medical Association. All rights reserved.

JAMA Guide to Statistics and Methods Clinical Review & Education

models. While the proper use of these approaches is complex, the How Should Cluster Randomization Findings Be Interpreted
informed reader should be alert to statements that the analysis in This Particular Study?
method was selected to account for the similarity or correlations of As in any clinical trial, randomization may or may not work effec-
data within each cluster. The intracluster correlation coefficient (ICC) tively to create similar groups of patients. In RESTORE, some differ-
quantifies the likeness within clusters and ranges from 0 to 1, ences between the intervention groups were observed that might
although it is frequently in the 0.02 to 0.1 range.4 A value of 0 means partially explain the negative primary outcome. Specifically, the in-
each member of the cluster is not more like the other members, with tervention group had a greater proportion of younger children—a
respect to the measured characteristic, than they are to the popu- group that is more difficult to sedate.8 The RESTORE investigators
lation at large, so each additional individual contributes the same used randomization in blocks to ensure balance of pediatric ICU sizes
amount of new information. In contrast, a value of 1 means that each between groups; methods exist to balance groups in cluster trials
member of the cluster is exactly the same as the others in the clus- on multiple factors simultaneously.9 Although the RESTORE trial
ter, so any participants beyond the first contribute no additional in- yielded a negative primary outcome, the authors noted some prom-
formation at all. A larger ICC, representing greater similarity of re- ising secondary outcomes related to clinicians’ perception of pa-
sults within clusters, will decrease the effective sample size of the tient comfort. However, these assessments were unblinded and thus
trial, reducing the precision of estimates of treatment effects and may be subject to bias.
the power of the trial.2 If the ICC is high, the effective sample size
will be closer to the number of groups, and if the ICC is low, the ef- Caveats to Consider When Looking
fective sample size will be closer to the total number of individuals at a Cluster Randomized Trial
in the trial. When evaluating a cluster randomized trial, the first consideration
It is often impossible to maintain blinding of treatment assign- is whether the use of clustering was well justified. Would it have been
ment in a cluster randomized trial, both because of the nature of possible to use individual-level randomization and maintain fidel-
treatments and because of the number of patients in a given loca- ity in treatment allocation and administration? What would be
tion all receiving the same treatment. It is well known that trials evalu- the likelihood of contamination? Cluster randomization cannot
ating nonblinded interventions have a greater risk of bias. minimize baseline differences between 2 treatment groups as effi-
ciently as individual-level randomization. The design must be justi-
Why Did the Authors Use Cluster Randomization fied for scientific or logistical reasons to accept this trade-off.10
in This Particular Study? Second, the usual sources of bias should be considered, such
The RESTORE trial investigators used cluster randomization be- as patient knowledge of treatment assignment and unblinded as-
cause they were introducing a nurse-implemented, goal-directed se- sessments of outcome. Although not specific to cluster random-
dation protocol that required a change in behavior among multiple ized trials, these sources of bias tend to be more problematic.
caregivers within each ICU. A major component of the experimen- Third, it is important to consider whether the intracluster cor-
tal intervention was educating the critical care personnel regarding relation was appropriately accounted for in the design, analysis, and
the perceived benefits and risks of sedation agents and use pat- interpretation of the trial.1,10 During the design, the likely ICC should
terns relative to others. Had individual-level randomization been be considered to ensure the planned sample size is adequate. The
used to allocate patients, it is highly likely that the patients random- analysis should be based on statistical methods that account for clus-
ized to standard care would have received care that was some- tering, such as generalized estimating equations.
where between the prior standard and the new protocol, because Finally, the interpretation should consider the extent with which
all ICU caregivers would have been informed about the scientific and the 2 treatment groups contained an adequate number, size, and simi-
pharmacological basis for the goal-directed sedation protocol. larity of clusters and whether any clusters were lost to follow-up.

ARTICLE INFORMATION Disclosure of Potential Conflicts of Interest and patients mechanically ventilated for acute
Author Affiliations: Department of Emergency none were reported. respiratory failure. JAMA. 2015;313(4):379-389.
Medicine, University of Michigan, Ann Arbor 6. Detry MA, Lewis RJ. The intention-to-treat
(Meurer); Department of Emergency Medicine, REFERENCES principle. JAMA. 2014;312(1):85-86.
Harbor-UCLA Medical Center, Torrance, California 1. Campbell MK, Elbourne DR, Altman DG; 7. Huang SS, Septimus E, Kleinman K, et al.
(Lewis); Los Angeles Biomedical Research Institute, CONSORT group. CONSORT statement: extension Targeted versus universal decolonization to prevent
Torrance, California (Lewis); David Geffen School of to cluster randomised trials. BMJ. 2004;328(7441): ICU infection. N Engl J Med. 2013;368(24):2255-
Medicine at University of California, Los Angeles 702-708. 2265.
(Lewis); Berry Consultants, Austin, Texas (Lewis). 2. Wears RL. Advanced statistics: statistical 8. Anand KJ, Willson DF, Berger J, et al. Tolerance
Corresponding Author: Roger J. Lewis, MD, PhD, methods for analyzing cluster and and withdrawal from prolonged opioid use in
Department of Emergency Medicine, Harbor-UCLA cluster-randomized data. Acad Emerg Med. 2002;9 critically ill children. Pediatrics. 2010;125(5):e1208-
Medical Center, 1000 W Carson St, Bldg D9, (4):330-341. e1225.
Torrance, CA 90509 ([email protected]). 3. Dawid AP. Conditional independence in 9. Scott PA, Meurer WJ, Frederiksen SM, et al.
Section Editors: Roger J. Lewis, MD, PhD, statistical theory. J R Stat Soc Series B. 1979;41:1-31. A multilevel intervention to increase community
Department of Emergency Medicine, Harbor-UCLA 4. Killip S, Mahfoud Z, Pearce K. What is an hospital use of alteplase for acute stroke
Medical Center and David Geffen School of intracluster correlation coefficient? Ann Fam Med. (INSTINCT). Lancet Neurol. 2013;12(2):139-148.
Medicine at UCLA; and Edward H. Livingston, MD, 2004;2(3):204-208.
Deputy Editor, JAMA. 10. Ivers NM, Taljaard M, Dixon S, et al. Impact of
5. Curley MA, Wypij D, Watson RS, et al. CONSORT extension for cluster randomised trials
Conflict of Interest Disclosures: Both authors Protocolized sedation vs usual care in pediatric on quality of reporting and study methodology. BMJ.
have completed and submitted the ICMJE Form for 2011;343:d5886.

jama.com (Reprinted) JAMA May 26, 2015 Volume 313, Number 20 2069

Copyright 2015 American Medical Association. All rights reserved.

Clinical Review & Education

JAMA Guide to Statistics and Methods

Noninferiority Trials
Is a New Treatment Almost as Effective as Another?
Amy H. Kaji, MD, PhD; Roger J. Lewis, MD, PhD

Sometimes the goal of comparing a new treatment with a stan- known effective treatments as controls because there is little to be
dard treatment is not to find an approach that is more effective but gained by demonstrating that a new therapy is not inferior to a sham
to find a therapy that has other advantages, such as lower cost, fewer or placebo treatment.
adverse effects, or greater con- The objective of a noninferiority trial is to demonstrate that the
venience with at least similar intervention being evaluated achieves the efficacy of the estab-
Related article page 2340 efficacy to the standard treat- lished therapy within a predetermined acceptable noninferiority mar-
ment. With other advantages, a gin (Figure). The magnitude of this margin depends on what would
treatment that is almost as effective as a standard treatment might be a clinically important difference, the expected event rates, and,
be preferred in practice or for some patients. The purpose of a non- possibly, regulatory requirements. Other determinants of the non-
inferiority trial is to rigorously evaluate a new treatment against an inferiority margin include the known effect of the standard treat-
accepted and effective treatment with the goal of demonstrating ment vs placebo; the severity of the disease; toxicity, inconve-
that it is at least almost as good (ie, not inferior). nience, or cost of the standard treatment; and the primary end point.
In this issue of JAMA, Salminen et al describe the results of a A smaller noninferiority margin is likely appropriate if the disease un-
multicenter noninferiority trial of 530 adults with computed der investigation is severe or if the primary end point is death.3-6
tomography–confirmed acute appendicitis who were randomized The sample size required to reliably demonstrate noninferior-
either to early appendectomy (the standard treatment) or to anti- ity depends on both the choice of the noninferiority margin and the
biotic therapy alone (a potentially less burdensome experimental assumed true relative effects of the compared treatments.3-6 An ac-
treatment).1 tive-controlled noninferiority trial often requires a larger sample size
than a superiority trial because the noninferiority margins used in
Use of the Method noninferiority studies are generally smaller than the differences
sought in superiority trials. Just as important is the assumed effect
Why Are Noninferiority Trials Conducted?
of the experimental treatment relative to the active-control treat-
In a traditional clinical trial, a new treatment is compared with a stan-
ment. The assumed effect may be that the experimental treatment
dard treatment or placebo with the goal of demonstrating that the
is worse than the control but by a smaller amount than the nonin-
new treatment has greater efficacy. The null hypothesis for such a
feriority margin, that the 2 treatments are equivalent, or even that
trial is that the 2 treatments have the same effect. Rejection of this
hypothesis, implying that the effects are different, is signaled by a
statistically significant P value or, alternatively, by a 2-tailed confi- Figure. Two Different Possible Results of a Noninferiority Trial,
dence interval that excludes no effect. While the new treatment Summarized by 1-Tailed Confidence Intervals for the Relative Efficacy of
could be either superior or inferior, the typical trial aims to demon- the New and Active-Control Treatments
strate superiority of the new treatment and is known as a superior- Active New
ity trial. Since a superiority trial is capable of identifying both harm- Noninferiority Control Treatment
margin Better Better
ful and beneficial effects of a new therapy vs a control (ie, a current
therapy), a 2-tailed 95% CI can be used to indicate the upper and Noninferiority not demonstrated
∞
lower limits of the difference in treatment effect that are consis-
tent with the observed data. The null hypothesis is rejected, indi- Noninferiority demonstrated
∞
cating that the new therapy differs from the control, if the confi-
dence interval does not include the result that indicates absence of
– +
effect (eg, a risk ratio of 1 or a risk difference of 0).2 This is equiva- 0
lent to a statistically significant P value. Difference in Efficacy
(New Treatment Minus Active Control)
Although superiority or inferiority of a new treatment can be
demonstrated by a superiority trial, it would generally be incorrect In the top example, the lower limit of the confidence interval lies to the left of the
to conclude that the absence of a significant difference in a superi- noninferiority margin, demonstrating that the results are consistent with greater
ority trial demonstrates that the therapies have similar effects; ab- inferiority (worse efficacy) than allowed by the noninferiority margin. Thus, the
new treatment may be inferior and noninferiority is not demonstrated. In the
sence of evidence of a difference is not reliable evidence that there lower example, the lower limit of the confidence interval lies to the right of the
is no difference. An active-controlled noninferiority trial is needed noninferiority margin, demonstrating noninferiority of the new treatment relative
to determine whether a new intervention, which offers other ad- to the active-control treatment. The overall result of the trial is defined by the
lower limit of the 1-sided confidence interval rather than by the point estimate for
vantages such as decreased toxicity or cost, does not have lesser ef-
the treatment effect, so point estimates are not shown.
ficacy than an established treatment.3-6 Noninferiority trials use

jama.com (Reprinted) JAMA June 16, 2015 Volume 313, Number 23 2371

Copyright 2015 American Medical Association. All rights reserved.

Clinical Review & Education JAMA Guide to Statistics and Methods

the experimental treatment is more effective. These 3 options will To design the clinical trial, Salminen et al assumed a surgical treat-
result in larger, intermediate, and smaller required sample sizes, re- ment success rate of 99% and prespecified a noninferiority margin
spectively, to achieve the same trial power—the chance of demon- of −24% based on clinical considerations. This is equivalent to say-
strating noninferiority—because they assume progressively better ing that if the rate of treatment success with antibiotics alone could
efficacy of the experimental treatment. be shown to be no worse than 24% worse than the rate with sur-
Because a noninferiority trial only aims to demonstrate nonin- gery, then the antibiotic-only strategy would be clinically noninfe-
feriority and does not aim to distinguish noninferiority from superior. As this study demonstrates, the selection of the noninferiority
riority, it is analyzed using a 1-sided confidence interval (Figure) or margin is often subjective rather than based on specific criteria.
hypothesis test. Typically, a 1-sided 95% or 97.5% CI (−L to ⬁; nega-
tive values represent inferiority of the experimental treatment) is How Should the Results Be Interpreted?
constructed for the difference between the 2 treatments, and the The results demonstrated that all but 1 of 273 patients randomized
lower limit, −L, is compared with the noninferiority margin. Nonin- to the surgery group underwent successful appendectomy, result-
feriority is demonstrated if the lower confidence limit lies above or ing in a treatment efficacy of 99.6%. In the antibiotic treatment
to the right of the noninferiority margin.3-6 group, 186 of 256 patients available for follow-up had treatment suc-
cesses, for a success rate of 72.7%; 70 of the 256 patients under-
What Are the Limitations of Noninferiority Trials? went surgical intervention within 1 year of initial presentation. Thus,
A negative noninferiority trial does not in general demonstrate in- the point estimate for the difference in success rate with the anti-
feriority of the experimental treatment, just as a negative superior- biotic-only strategy was −27.0% and the associated 1-tailed 95% CI
ity trial does not demonstrate equivalence of 2 treatments. would range from −31.6% to infinity. Because that interval includes
A noninferiority trial is similar to an equivalence trial in that the efficacy values worse than the noninferiority margin of −24%, non-
objective of both is to demonstrate that the intervention matches inferiority cannot be demonstrated.
the action of the established therapy within a prespecified margin.
However, the objective of a noninferiority trial is only to demon- Caveats to Consider When Looking at a Noninferiority Trial
strate that the experimental treatment is not substantially worse than Noninferiority active-controlled trials often require a larger sample
the standard treatment, whereas that of an equivalence trial is to size than placebo-controlled trials, in part because the chosen non-
demonstrate that the experimental treatment is neither worse than inferiority margins are often small. The required sample size for a non-
nor better than the standard treatment.3 inferiority trial is highly dependent on the noninferiority margin and
the assumed effect of the new treatment; this assumed effect must
Why Was a Noninferiority Trial Conducted in This Case? be clearly stated and realistic.
Ever since McBurney demonstrated reduced morbidity from pelvic The primary analysis for a superiority trial should be based on
infections with appendectomy, the standard treatment for acute ap- the intention-to-treat (ITT) principle because it is generally conser-
pendicitis has been surgery, which requires general anesthesia, in- vative in the setting of imperfect adherence to treatment. How-
curs increased cost, and is associated with postoperative complica- ever, analyzing a noninferiority trial by ITT could make an inferior
tions, such as wound infections and adhesions. Thus, a less invasive treatment appear to be noninferior if poor patient adherence re-
approach with similar efficacy might be preferred by many pa- sulted in both treatments being similarly ineffective. Thus, when ana-
tients and physicians. Three randomized trials summarized in a re- lyzing a noninferiority trial, both ITT and per-protocol analyses should
cent Cochrane analysis demonstrated equipoise as to whether ap- be conducted. The results are most meaningful when both ap-
pendicitis can successfully be treated with antibiotics alone rather proaches demonstrate noninferiority.
than surgery.7 Because appendectomy is viewed as the standard A noninferiority trial does not distinguish between a new treat-
treatment, it was considered the active control with which the less ment that is noninferior and one that is truly superior and cannot
invasive experimental antibiotic treatment was to be compared. demonstrate equivalence.

ARTICLE INFORMATION Conflict of Interest Disclosures: The authors have 4. Mulla SM, Scott IA, Jackevicius CA, et al. How to
Author Affiliations: Department of Emergency completed and submitted the ICMJE Form for use a non-inferiority trial: Users’ Guides to the
Medicine, Harbor-UCLA Medical Center, Torrance, Disclosure of Potential Conflicts of Interest and Medical Literature. JAMA. 2012;308:2605-2611.
California (Kaji, Lewis); David Geffen School none were reported. 5. Tamayo-Sarver JH, Albert JM, Tamayo-Sarver M,
of Medicine at UCLA, Torrance, California Cydulka RK. Advanced statistics: how to determine
(Kaji, Lewis); Los Angeles Biomedical Research REFERENCES whether your intervention is different, at least as
Institute, Los Angeles, California (Kaji, Lewis); 1. Salminen P, Paajanen H, Rautio T, et al. Antibiotic effective as, or equivalent. Acad Emerg Med. 2005;
Berry Consultants, LLC (Lewis). therapy vs appendectomy for treatment of 12(6):536-542.
Corresponding Author: Roger J. Lewis, MD, PhD, uncomplicated acute appendicitis: the APPAC 6. Piaggio G, Elbourne DR, Altman DG, Pocock SJ,
Department of Emergency Medicine, Harbor-UCLA randomized clinical trial. JAMA. doi: 10.1001/jama Evans SJW; CONSORT Group. Reporting of
Medical Center, 1000 W Carson St, Box 21, .2015.6154. noninferiority and equivalence randomized trials:
Torrance, CA 90509 ([email protected]). 2. Young KD, Lewis RJ. What is confidence? I: the an extension of the CONSORT statement. JAMA.
Section Editors: Roger J. Lewis, MD, PhD, use and interpretation of confidence intervals. Ann 2006;295(10):1152-1160.
Department of Emergency Medicine, Harbor-UCLA Emerg Med. 1997;30(3):307-310. 7. Wilms IM, de Hoog DE, de Visser DC, Janzing
Medical Center and David Geffen School of 3. Kaji AH, Lewis RJ. Are we looking for superiority, HM. Appendectomy vs antibiotic treatment for
Medicine at UCLA; and Edward H. Livingston, MD, equivalence, or noninferiority? Ann Emerg Med. acute appendicitis. Cochrane Database Syst Rev.
Deputy Editor, JAMA. 2010;55(5):408-411. 2011;(11):CD008359.

2372 JAMA June 16, 2015 Volume 313, Number 23 (Reprinted) jama.com

Copyright 2015 American Medical Association. All rights reserved.

Clinical Review & Education

JAMA Guide to Statistics and Methods

Missing Data
How to Best Account for What Is Not Known
Craig D. Newgard, MD, MPH; Roger J. Lewis, MD, PhD

Missing data are common in clinical research, particularly for vari- mate may be incorrect.3 This is known as complete (observed)
ables requiring complex, time-sensitive, resource-intensive, or lon- case analysis and is the default methods used by most statistical
gitudinal data collection methods. However, even seemingly read- software.
ily available information can be missing. There are many reasons for Strategies for handling missing values are each based on differ-
“missingness,” including missed ent assumptions and have different limitations. Key questions to con-
study visits, patients lost to fol- sider when selecting a method for handling missing values include
Related article page 884 low-up, missing information in (1) Why are data missing? (2) How do patients with missing and com-
source documents, lack of avail- plete data differ? and (3) Do the observed data help predict the miss-
ability (eg, laboratory tests that were not performed), and clinical ing values? To better understand this last concept, suppose a phy-
scenarios preventing collection of certain variables (eg, missing coma sician was asked to make a best guess about a characteristic of one
scale data in sedated patients). It is particularly challenging to inter- of her patients that was missing from their chart; eg, weight, sys-
pret studies when primary outcome data are missing. However, many tolic blood pressure, fasting serum cholesterol, or serum creati-
methods commonly used for handling missing values during data nine. The chance of guessing a value close to the true value would
analysis can yield biased results, decrease study power, or lead to likely be substantially improved if the physician was given related
underestimates of uncertainty, all reducing the chance of drawing data about the patient, such as his or her age, comorbidities, and prior
valid conclusions. laboratory values.
In this issue of JAMA, Bakris et al evaluated the effect of finere- The cause for missing data, called censoring, is “noninforma-
none on urinary albumin-creatinine ratio (UACR) in patients with dia- tive” when the reason a value could not be measured provides no
betic nephropathy in a randomized, phase 2B, dose-finding clinical information for what it should be. Censoring is “informative” when
trial conducted in 148 sites in 23 countries.1 Because of the logisti- the absence of a value indicates something about what it should be.
cal complexity of the study, it is not surprising that some of the in- For example, a patient lost to follow-up may have quit the study be-
tended data collection could not be completed, resulting in miss- cause declining health made traveling to follow-up visits more dif-
ing outcome data. Bakris et al used several analysis and imputation ficult, implying that patients with complete follow-up data may have
techniques (ie, methods for replacing missing data with specific val- better health status than those with missing data.
ues) to assess the effects of different approaches for handling miss- There are 3 ways by which data may be missing.3,4 The first is
ing data. These methods included complete case analysis (restrict- that data may be missing completely at random (MCAR), meaning
ing the analysis to include only patients with observed 90-day UACR the probability of being missing is completely unrelated to all ob-
values); last observation carried forward (LOCF; typically this in- served and unobserved patient characteristics. This is the least plau-
volves using the last recorded data point as the final outcome; Bakris sible mechanism but is the only one for which complete case analy-
et al used the higher of 2 UACR values and, separately, the most re- sis will yield unbiased results.
cent UACR obtained prior to study discontinuation); baseline ob- The next mechanism, missing at random (MAR) or “ignorable,”
servation carried forward (using the baseline UACR value as the out- does not assume patients with missing values are similar to those
come UACR value, therefore assuming no treatment effect for that with complete data but instead assumes that observed values can
patient); mean value imputation (replacing missing values with the be used to “explain” which values are missing and help predict what
mean of observed UACR values); and random imputation (using ran- the missing values would be.3 This mechanism of missingness is a
domly selected UACR values to replace missing UACR values).1 Mul- more realistic assumption than MCAR, and MAR is assumed by most
tiple imputation2 to handle missing values was also performed. With of the currently used valid techniques for handling missing data.
the exception of multiple imputation, each of the imputation ap- However, most simple imputation methods still yield biased or falsely
proaches replaces a missing value with a single number (termed precise results when MAR is assumed.
“single” or “simple” imputation) and can threaten the validity of study Missing not at random (MNAR) is the most problematic censor-
results.3,4 The authors concluded that finerenone improved the ing mechanism and occurs when missing values are dependent on
UACR, a result that was consistent regardless of the method for han- unobserved or unknown factors. When MNAR is present, statisti-
dling missing data. cal adjustment for missing information is virtually impossible.
Because an investigator usually cannot determine the actual
Use of the Method mechanism for missingness, statistical analyses usually proceed as-
Why Are These Methods Used? suming the data conform to a MAR mechanism. Collecting informa-
It is rare for a research investigation not to have any missing data. tion to explain why data are missing (eg, participants’ mode of trans-
If patients with missing variables are omitted from an analysis, the portation and distance to the clinic) can help predict certain values
effective sample size is reduced and the treatment effect esti- and make the MAR assumption more plausible.3,4

940 JAMA September 1, 2015 Volume 314, Number 9 (Reprinted) jama.com

Copyright 2015 American Medical Association. All rights reserved.

JAMA Guide to Statistics and Methods Clinical Review & Education

What Are the Limitations of These Methods? methods were used so that results obtained by the various ap-
Simple imputation methods (eg, LOCF, complete case analysis, mean proaches could be compared.
value imputation, and random imputation) are considered “naive”
because they fail to account for the uncertainty in imputing miss- How Should This Method’s Findings Be Interpreted
ing values, do not use information available in observed values, can in This Particular Study?
introduce bias, and artificially increase precision (ie, inappropri- Because of the inherent limitations of simple imputation methods,
ately narrow confidence intervals and result in smaller P values).3,4 the multiply imputed results provide the most valid results in the
Each of these limitations can cause spurious results. Better esti- study by Bakris et al. Provided the underlying assumptions are met
mates and measures of uncertainty (eg, confidence intervals) can and rigorous imputation methods (eg, multiple imputation) are used,
be obtained by using maximum likelihood–based methods, hot deck study results can be interpreted as if all values had been observed.
imputation, and multiple imputation.3
The primary limitation of complete case analysis is bias and Caveats to Consider When Looking at the Results
reduced sample size, resulting in reduced study power.4 Unless the in This Study Based on This Method
data are MCAR (an unlikely event), estimates using observed case The LOCF method for handling missing values (as used in the pri-
analysis will be biased and the direction of the bias unpredictable. mary analysis by Bakris et al1) has the same fundamental limitations
Last observation carried forward is a commonly performed simple as other simple imputation methods, generating potentially biased
imputation technique. This strategy requires the tenuous assump- results with inappropriately narrow confidence intervals. Because
tion that the final outcome (eg, 90-day UACR) does not change results from the post hoc multiple imputation analysis were
from the last observed value. In mean value imputation, all missing reported to be no different from those of the LOCF analysis,1 the
values are replaced with the mean of observed values (eg, 90-day primary results can be considered valid despite the risks of using
UACR). With an increasing proportion of missing data, mean value simple imputation methods. Nonetheless, results from the multiple
imputation results in larger numbers of patients with an identical imputation analysis are more rigorous (despite the post hoc selec-
imputed value, creating smaller measures of variance and greater tion of this strategy) because of the advantages of this method
bias, artificially increasing the apparent precision of inaccurate over simple imputation methods.5 Caution is required when using
estimates.4,5 Random number imputation avoids the repetitive use traditionally defined “conservative” methods for handling missing
of the same imputed estimate but fails to use observed values to outcomes (eg, LOCF) over more sophisticated missing data meth-
inform the selected estimate. ods. While they may be conservative in assigning the outcome of a
participant with missing data, they can lead to both false-positive
Why Did the Authors Use This Method in This Particular Study? and false-negative results in measured treatment effects. In gen-
In the study by Bakris et al, the primary outcome had missing val- eral, multiple imputation is the best approach for modeling the
ues requiring the use of missing data methods. Several imputation effects of missing data in studies.

ARTICLE INFORMATION Section Editors: Roger J. Lewis, MD, PhD, Group. Effect of finerenone on albuminuria in
Author Affiliations: Center for Policy and Research Department of Emergency Medicine, Harbor-UCLA patients with diabetic nephropathy: a randomized
in Emergency Medicine, Department of Emergency Medical Center and David Geffen School of clinical trial. JAMA. doi:10.1001/jama.2015.10081.
Medicine, Oregon Health and Science University, Medicine at UCLA; and Edward H. Livingston, MD, 2. Rubin DB. Multiple Imputation for Nonresponse
Portland (Newgard); Department of Emergency Deputy Editor, JAMA. in Surveys. New York, NY: Wiley; 1987.
Medicine, Harbor–UCLA Medical Center, Torrance Conflict of Interest Disclosures: The authors have 3. Little RJA, Rubin DB. Statistical Analysis With
(Lewis); Los Angeles Biomedical Research Institute, completed and submitted the ICMJE Form for Missing Data. 2nd ed. Princeton, NJ: Wiley; 2002.
Torrance, California (Lewis); David Geffen School of Disclosure of Potential Conflicts of Interest and
Medicine, University of California, Los Angeles none were reported. 4. Haukoos JS, Newgard CD. Advanced statistics:
(Lewis). missing data in clinical research, I: an introduction
REFERENCES and conceptual framework. Acad Emerg Med.
Corresponding Author: Craig D. Newgard, MD, 2007;14(7):662-668.
MPH, Center for Policy and Research in Emergency 1. Bakris GL, Agarwal R, Chan JCN, et al;
Medicine, Department of Emergency Medicine, Mineralocorticoid Receptor Antagonist Tolerability 5. Newgard CD, Haukoos JS. Advanced statistics:
Oregon Health and Science University, 3181 SW Sam Study–Diabetic Nephropathy (ARTS-DN) Study missing data in clinical research, II: multiple
Jackson Park Rd, Mail Code CR-114, Portland, OR imputation. Acad Emerg Med. 2007;14(7):669-678.
97239-3098 ([email protected]).

jama.com (Reprinted) JAMA September 1, 2015 Volume 314, Number 9 941

Copyright 2015 American Medical Association. All rights reserved.

Clinical Review & Education

JAMA Guide to Statistics and Methods

Evaluating Discrimination of Risk Prediction Models

The C Statistic
Michael J. Pencina, PhD; Ralph B. D’Agostino Sr, PhD

Risk prediction models help clinicians develop personalized It is useful to quantify the performance and clinical value of pre-
treatments for patients. The models generally use variables mea- dictive models using the positive predictive value (PPV; the propor-
sured at one time point to estimate the probability of an outcome tion of patients in whom the model predicts an event will occur who
occurring within a given time actually have an event) and the negative predictive value (NPV; the
in the future. It is essential to proportion of patients whom the model predicts will not have an
Related article page 1030
assess the performance of a event who actually do not experience the event). An important mea-
risk prediction model in the setting in which it will be used. This is sure of a model’s misclassification of events is 1 minus NPV, or the
done by evaluating the model’s discrimination and calibration. proportion of patients the model predicts will not have an event who
Discrimination refers to the ability of the model to separate indi- actually have the event. The PPV and 1 minus NPV can be more in-
viduals who develop events from those who do not. In time-to- formative for individual patients than the sensitivity and specificity
event settings, discrimination is the ability of the model to predict because they answer the question “What are this patient’s chances
who will develop an event earlier and who will develop an event of having an event when the model predicts they will or will not have
later or not at all. Calibration measures how accurately the mod- one?” If the event rate is known, then the PPV and NPV can be es-
el’s predictions match overall observed event rates. timated based on sensitivity and specificity and, hence, the C sta-
In this issue of JAMA, Melgaard et al used the C statistic, a global tistic can be viewed as a summary for both sets of measures.
measure of model discrimination, to assess the ability of the
CHA2DS2-VASc model to predict ischemic stroke, thromboembo- What Are the Limitations of the C Statistic?
lism, or death in patients with heart failure and to do so separately The C statistic has several limitations. As a single number, it sum-
for patients who had or did not have atrial fibrillation (AF).1 marizes the discrimination of a model but does not communicate
all the information ROC plots contain and lacks direct clinical appli-
cation. The NPV, PPV, sensitivity, and specificity have more clinical
Use of the Method
relevance, especially when presented as plots across all meaning-
Why Are C Statistics Used? ful classification thresholds (as is done with ROCs). A weighted sum
The C statistic is the probability that, given 2 individuals (one who of sensitivity and specificity (known as the standardized net ben-
experiences the outcome of interest and the other who does not efit) can be plotted to assign different penalties to the 2 misclassi-
or who experiences it later), the model will yield a higher risk for fication errors (predicting an individual who ultimately experi-
the first patient than for the second. It is a measure of concor- ences an event to be at low risk; predicting an individual who does
dance (hence, the name “C statistic”) between model-based risk not experience an event to be at high risk) according to the prin-
estimates and observed events. C statistics measure the ability of ciples of decision analysis.3,4 In contrast, the C statistic does not ef-
a model to rank patients from high to low risk but do not assess fectively balance misclassification errors.5 In addition, the C statis-
the ability of a model to assign accurate probabilities of an event tic is only a measure of discrimination, not calibration, so it provides
occurring (that is measured by the model’s calibration). C statis- no information regarding whether the overall magnitude of risk is
tics generally range from 0.5 (random concordance) to 1 (perfect predicted accurately.
concordance).
C statistics can also be thought of as being the area under the Why Did the Authors Use C Statistics in Their Study?
plot of sensitivity (proportion of people with events for whom the Melgaard et al1 sought to determine if the CHA2DS2-VASc score could
model predicts are high risk) vs 1 minus specificity (proportion of predict occurrences of ischemic stroke, thromboembolism, or death
people without events for whom the model predicts are high risk) among patients who have heart failure with and without AF. The au-
for all possible classification thresholds. This plot is called the thors used the C statistic to determine how well the model could dis-
receiver operating characteristic (ROC) curve, and the C statistic is tinguish between patients who would or would not develop each
equal to the area under this curve.2 For example, in the study by of the 3 end points they studied. The C statistic yielded the prob-
Melgaard et al, CHA 2 DS 2 -VASc scores ranged from a low of 0 ability that a randomly selected patient who had an event had a risk
(heart failure only) to a high of 5 or higher, depending on the score that was higher than a randomly selected patient who did not
number of comorbidities a patient had. One point on the ROC have an event.
curve would be when high risk is defined as a CHA 2 DS 2 -VASc
score of 1 or higher and low risk as a CHA 2 DS 2 -VASc score How Should the Findings Be Interpreted?
of 0. Another point on the curve would be when high risk is The value of the C statistic depends not only on the model under in-
defined as a CHA2DS2-VASc score of 2 or higher and low risk as a vestigation (ie, CHA2DS2-VASc score) but also on the distribution of
CHA2DS2-VASc score of lower than 2, etc. Each cut point is associ- risk factors in the sample to which it is applied. For example, if age
ated with a different sensitivity and specificity. is an important risk factor, the same model can appear to perform

jama.com (Reprinted) JAMA September 8, 2015 Volume 314, Number 10 1063

Copyright 2015 American Medical Association. All rights reserved.

Clinical Review & Education JAMA Guide to Statistics and Methods

much better when applied to a sample with a wide age range com- sified as low risk had major events, yielding an NPV of 61% to 76%.
pared with a sample with a narrow age range. Because there was less misclassification among patients without AF
The C statistics reported by Melgaard et al1 range from 0.62 to who were predicted to be at low risk, a CHA2DS2-VASc score of 0 is
0.71 and do not appear impressive (considering that a C statistic of a better determinant of long-term low risk among patients without
0.5 represents random concordance). This might be due to limita- AF than patients with AF. This aspect of the model performance is
tions of the model; eg, if there were an insufficient number of pre- not apparent when looking at C statistics alone.
dictors or the predictors had been dichotomized for simplicity. The
nationwide nature of the data used by Melgaard et al suggests that Caveats to Consider When Using C Statistics to Assess
the unimpressive values of the C statistic cannot be attributed to nar- Predictive Model Performance
row ranges of risk factors in the analyzed cohort. Rather, it might sug- Special extensions of the C statistic need to be used when applying
gest inherent limitations in the ability to discriminate between pa- it to time-to-event data6 and competing-risk settings.7 Further-
tients with heart failure who will and will not die or develop ischemic more, there exist several appealing single-number alternatives to the
stroke or thromboembolism. C statistic. They include the discrimination slope, the Brier score, or
The C statistic analysis suggested that the CHA2DS2-VASc model the difference between sensitivity and 1 minus specificity evalu-
performed similarly among heart failure patients with and without ated at the event rate.3
AF (C statistics between 0.62 and 0.71 among patients with AF and The C statistic provides an important but limited assessment
0.63 to 0.69 among patients without AF). An additional insight of the performance of a predictive model and is most useful as a
emerges from NPV analysis looking at misclassification of events oc- familiar first-glance summary. The evaluation of the discriminat-
curring at 5 years, however. Between 19% and 27% of patients with- ing value of a risk model should be supplemented with other sta-
out AF who were predicted to be at low risk actually had 1 of the 3 tistical and clinical measures. Graphical summaries of model cali-
events and thus were misclassified, yielding an NPV of 73% to 82%. bration and clinical consequences of adopted decisions are
Between 24% and 39% of patients with AF whom the model clas- particularly useful.8

ARTICLE INFORMATION REFERENCES 5. Hand DJ. Measuring classifier performance:

Author Affiliations: Biostatistics and 1. Melgaard L, Gorst-Rasmussen A, Lane DA, a coherent alternative to the area under the ROC
Bioinformatics, Duke Clinical Research Institute, Rasmussen LH, Larsen TB, Lip GYH. Assessment of curve. Mach Learn. 2009;77:103-123.
Duke University, Durham, North Carolina (Pencina); the CHA2DS2-VASc score in predicting ischemic 6. Pencina MJ, D’Agostino RB. Overall C as a
Department of Mathematics and Statistics, Boston stroke, thromboembolism, and death in patients measure of discrimination in survival analysis:
University, Boston, Massachusetts (D’Agostino). with heart failure with and without atrial fibrillation. model specific population value and confidence
Corresponding Author: Michael J. Pencina, PhD, JAMA. doi:10.1001/jama.2015.10725. interval estimation. Stat Med. 2004;23(13):2109-
Biostatistics and Bioinformatics, Duke Clinical 2. Hanley JA, McNeil BJ. The meaning and use of 2123.
Research Institute, Duke University, 2400 Pratt St, the area under a receiver operating characteristic 7. Blanche P, Dartigues JF, Jacqmin-Gadda H.
Durham, NC 27705 ([email protected]). (ROC) curve. Radiology. 1982;143(1):29-36. Estimating and comparing time-dependent areas
Section Editors: Roger J. Lewis, MD, PhD, 3. Pepe MS, Janes H. Methods for evaluating under receiver operating characteristic curves for
Department of Emergency Medicine, Harbor-UCLA prediction performance of biomarkers and tests. In: censored event times with competing risks. Stat Med.
Medical Center and David Geffen School of Lee M-LT, Gail M, Pfeiffer R, Satten G, Cai T, Gandy 2013;32(30):5381-5397.
Medicine at UCLA; and Edward H. Livingston, MD, A, eds. Risk Assessment and Evaluation of 8. Moons KGM, Altman DG, Reitsma JB, et al.
Deputy Editor, JAMA. Predictions. New York, NY: Springer; 2013:107-142. Transparent reporting of a multivariable prediction
Conflict of Interest Disclosures: The authors have 4. Vickers AJ, Elkin EB. Decision curve analysis: model for individual prognosis or diagnosis
completed and submitted the ICMJE Form for a novel method for evaluating prediction models. (TRIPOD): explanation and elaboration. Ann Intern
Disclosure of Potential Conflicts of Interest and Med Decis Making. 2006;26(6):565-574. Med. 2015;162(1):W1-W73.
none were reported.

1064 JAMA September 8, 2015 Volume 314, Number 10 (Reprinted) jama.com

Copyright 2015 American Medical Association. All rights reserved.

Clinical Review & Education

JAMA Guide to Statistics and Methods

The Propensity Score

Jason S. Haukoos, MD, MSc; Roger J. Lewis, MD, PhD

Two recent studies published in JAMA involved the analysis of ob- clinician,andclinicalenvironment.4 Suchprobabilitiescanbeestimated
servational data to estimate the effect of a treatment on patient out- using multivariable statistical methods (eg, logicistic regression), in
comes. In the study by Rozé et al,1 a large observational data set was which case the treatment of interest is the dependent variable and
analyzed to estimate the relationship between early echocardio- the characteristics of the patient, prescribing clinician, and clinical set-
graphic screening for patent ductus arteriosus and mortality among tingarethepredictors.Investigatorsestimatetheseprobabilities,rang-
preterm infants. The authors compared mortality rates of 847 ining from 0 to 1, for each patient in the study population. These
fants who were screened for patent ductus arteriosus and 666 who probabilities—the propensity scores—are then used to adjust for dif-
were not. The 2 infant groups were dissimilar; infants who were ferences between groups. In biomedical studies, propensity scores are
screened were younger, more likely female, and less likely to have re- often used to compare treatments, but they can also be used to esti-
ceived corticosteroids. The authors used propensity score matching mate the relationship between any nonrandomized factor, such as the
to create 605 matched infant pairs from the original cohort to adjust exposure to a toxin or infectious agent and the outcome of interest.
for these differences. In the study by Huybrechts et al,2 the Medicaid There are 4 general ways propensity scores are used. The most
Analytic eXtract data set was analyzed to estimate the association be- common is propensity score matching, which involves assembling
tween antidepressant use during pregnancy and persistent pulmo- 2 groups of study participants, one group that received the treat-
nary hypertension of the newborn. The authors included 3 789 330 ment of interest and the other that did not, while matching indi-
women, of which 128 950 had used antidepressants. Women who viduals with similar or identical propensity scores.1 The analysis of
used antidepressants were different from those who had not, with a propensity score–matched sample can then approximate that of
differences in age, race/ethnicity, chronic illnesses, obesity, tobacco a randomized trial by directly comparing outcomes between indi-
use, and health care use. The authors adjusted for these differences viduals who received the treatment of interest and those who did
using, in part, the technique of propensity score stratification. not, using methods that account for the paired nature of the data.5
Thesecondapproachisstratificationonthepropensityscore.4 This
Use of the Method technique involves separating study participants into distinct groups
Why Were Propensity Methods Used? or strata based on their propensity scores. Five strata are commonly
Many considerations influence the selection of one therapy over an- used, although increasing the number can reduce the likelihood of
other. In many settings, more than one therapeutic approach is com- bias. The association between the treatment of interest and the
monly used. In routine clinical practice, patients receiving one treat- outcome of interest is estimated within each stratum or pooled
ment will tend to be different from those receiving another, eg, if across strata to provide an overall estimate of the relationship be-
one treatment is thought to be better tolerated by elderly patients or tweentreatmentandoutcome.Thistechniquereliesonthenotionthat
more effective for patients who are more seriously ill. This results in a individuals within each stratum are more similar to each other than in-
correlation—or confounding—between patient characteristics that dividuals in general; thus, their outcomes can be directly compared.
affect outcomes and the choice of therapy (often called “confounding The third approach is covariate adjustment using the propen-
byindication”).Ifobservationaldataobtainedfromroutineclinicalprac- sity score. For this approach, a separate multivariable model is de-
tice are examined to compare the outcomes of patients treated with veloped, after the propensity score model, in which the study out-
different therapies, the observed difference will be the result of both come serves as the dependent variable and the treatment group and
differing patient characteristics and treatment choice, making it diffi- propensity score serve as predictor variables. This allows the inves-
cult to delineate the true effect of one treatment vs another. tigator to estimate the outcome associated with the treatment of
The effect of an intervention is best assessed by randomizing interest while adjusting for the probability of receiving that treat-
treatment assignments so that, on average, the patients are similar ment, thus reducing confounding.
in the 2 treatment groups. This allows a direct assessment of the ef- The fourth approach is inverse probability of treatment weight-
fect of the intervention on outcome. In observational studies, ran- ing using the propensity score.6 In this instance, propensity scores
domization is not possible, so investigators must adjust for differ- are used to calculate statistical weights for each individual to cre-
ences between groups to obtain valid estimates of the associations ate a sample in which the distribution of potential confounding fac-
between the treatments being compared and the outcomes of tors is independent of exposure, allowing an unbiased estimate of
interest.3 Multivariable statistical methods are often used to esti- the relationship between treatment and outcome.7
mate this association while adjusting for confounding. Alternative strategies—other than use of propensity scores—
Propensity score methods are used to reduce the bias in estimat- for adjusting for baseline differences between groups in observa-
ing treatment effects and allow investigators to reduce the likelihood tional studies include matching on baseline characteristics, perform-
of confounding when analyzing nonrandomized, observational data. ing stratified analyses, or using multivariable statistical methods to
The propensity score is the probability that a patient would receive the adjust for confounders. Propensity score methods are often more
treatment of interest, based on characteristics of the patient, treating practical or statistically more efficient than these methods, in part

jama.com (Reprinted) JAMA October 20, 2015 Volume 314, Number 15 1637

Copyright 2015 American Medical Association. All rights reserved.

Clinical Review & Education JAMA Guide to Statistics and Methods

because propensity score methods can substantially limit the num- goal of accurately estimating the association between treatment and
ber of predictor variables used in the final analysis. Propensity score outcome, the investigators had to adjust for differences in the treat-
methods generally allow many more variables to be included in the ment groups. Use of propensity score methods, whether by match-
propensity score model, which increases the ability of these ap- ing or stratification, resulted in less biased estimates than if such
proaches to effectively adjust for confounding, than could be incor- methods were not used. Even though observational data cannot defi-
porated directly into a multivariable analysis of the study outcome. nitely establish causal relationships or determine treatment ef-
fects as rigorously as a randomized clinical trial, assuming propen-
What Are the Limitations of Propensity Score Methods? sity score methods are properly used and the sample size is
The propensity score for each study participant is based on the avail- sufficiently large, these methods may provide a useful approxima-
able measured patient characteristics, and unadjusted confounding tion of the likely effect of a treatment. This approach is particularly
may still exist if unmeasured factors influenced treatment selection. valuable for clinical situations in which randomized trials are not fea-
Therefore,usingfewervariablesinthepropensityscoremodelreduces sible or are unlikely to be conducted.
the likelihood of effectively adjusting for confounding.
Although propensity score matching may be used to assemble What Caveats Should the Reader Consider When Assessing
comparable study groups, the quality of matching depends on the the Results of Propensity Analyses?
quality of the propensity score model, which in turn depends on The studies by Rozé et al1 and Huybrechts et al2 used propensity score
the quality and size of the available data and how the model was built. matching and propensity score stratification, respectively. Al-
Conventional modeling methods (eg, variable selection, use of in- though both methods are more valid in terms of balancing study
teractions, regression diagnostics, etc) are not typically recom- groups than simple matching or stratification based on baseline char-
mended for the development of propensity score models. For ex- acteristics, they vary in their ability to minimize bias. In general, pro-
ample, propensity score models may optimally include a larger pensity score matching minimizes bias to a greater extent than
number of predictor variables. propensity score stratification. Assessment of balance between the
groups, after use of propensity score methods, is important to al-
Why Did the Authors Use Propensity Methods? low readers to assess the comparability of patient groups.
In the reports by Rozé et al1 and Huybrechts et al,2 both of whom used Although no single standard approach exists to assess bal-
propensity score methods because their data were observational, the ance, comparing characteristics between treated and untreated pa-
treatments of interest (ie, screening by echocardiography and use of tients typically begins with comparing summary statistics (eg, means
antidepressants in pregnancy) were not randomly allocated, and im- or proportions) and the entire distributions of observed character-
portant characteristics differed between groups. Direct comparisons istics. For propensity score—matched samples, standardized differ-
of the outcomes between treated and untreated groups would have ences (ie, differences divided by pooled standard deviations) are of-
likely resulted in significantly biased estimates. Instead, use of propen- ten used and, although no threshold is universally accepted, a
sity score matching and stratification enabled the investigators to cre- standard difference less than 0.1 is often considered negligible. As-
ate study groups that were similar to one another and more accurately sessing for balance provides a general sense for how well matching
measure the relationship between treatment and outcome. or stratification occurred and thus the extent to which the results
are likely to be valid. Unfortunately, balance can only be demon-
How Should the Findings Be Interpreted? strated for patient characteristics that were measured in the study.
Given the observational nature of these studies, the fact that indi- Differences could still exist between patient groups that were not
viduals in the treated and untreated groups were dissimilar, and the measured, resulting in biased results.

ARTICLE INFORMATION Funding/Support: Dr Haukoos is supported, in 3. Greenland S, Pearl J, Robins JM. Causal diagrams
Author Affiliations: Department of Emergency part, by grants R01AI106057 from the National for epidemiologic research. Epidemiology. 1999;10
Medicine, University of Colorado School of Institute of Allergy and Infectious Diseases (NIAID) (1):37-48.
Medicine, Denver (Haukoos); Department of and R01HS021749 from the Agency for Healthcare 4. Rosenbaum PR, Rubin DB. The central role of
Emergency Medicine, Harbor-UCLA Medical Center, Research and Quality (AHRQ). the propensity score in observational studies for
Torrance, California (Lewis); David Geffen School of Disclaimer: The views expressed herein are those causal effects. Biometrika. 1983;70:41-55.
Medicine at UCLA, Los Angeles, California (Lewis). of the authors and do not necessarily represent the 5. Austin PC. An introduction to propensity score
Corresponding Author: Jason S. Haukoos, MD, views of NIAID, the National Institutes of Health, methods for reducing the effects of confounding in
MSc, Department of Emergency Medicine, or AHRQ. observational studies. Multivariate Behav Res. 2011;
Denver Health Medical Center, 777 Bannock St, 46(3):399-424.
Mail Code 0108, Denver, CO 80204 REFERENCES
6. Schaffer JM, Singh SK, Reitz BA, Zamanian RT,
([email protected]). 1. Rozé JC, Cambonie G, Marchand-Martin L, et al; Mallidi HR. Single- vs double-lung transplantation in
Section Editors: Roger J. Lewis, MD, PhD, Hemodynamic EPIPAGE 2 Study Group. Association patients with chronic obstructive pulmonary
Department of Emergency Medicine, Harbor-UCLA between early screening for patent ductus disease and idiopathic pulmonary fibrosis since the
Medical Center and David Geffen School of arteriosus and in-hospital mortality among implementation of lung allocation based on medical
Medicine at UCLA; and Edward H. Livingston, MD, extremely preterm infants. JAMA. 2015;313(24): need. JAMA. 2015;313(9):936-948.
Deputy Editor, JAMA. 2441-2448.
7. Robins JM, Hernán MA, Brumback B. Marginal
Conflict of Interest Disclosures: All authors have 2. Huybrechts KF, Bateman BT, Palmsten K, et al. structural models and causal inference in
completed and submitted the ICMJE Form for Antidepressant use late in pregnancy and risk of epidemiology. Epidemiology. 2000;11(5):550-560.
Disclosure of Potential Conflicts of Interest and persistent pulmonary hypertension of the
none were reported. newborn. JAMA. 2015;313(21):2142-2151.

1638 JAMA October 20, 2015 Volume 314, Number 15 (Reprinted) jama.com

Copyright 2015 American Medical Association. All rights reserved.

Clinical Review & Education

JAMA Guide to Statistics and Methods

Multiple Imputation
A Flexible Tool for Handling Missing Data
Peng Li, PhD; Elizabeth A. Stuart, PhD; David B. Allison, PhD

In this issue of JAMA, Asch et al1 report results of a cluster random- ables, correlations between 2 variables, as well as the related vari-
ized clinical trial designed to evaluate the effects of physician finan- ances. In doing so, it reduces the chance of false-positive or false-
cial incentives, patient incentives, or shared physician and patient negative conclusions.
incentives on low-density lipoprotein cholesterol (LDL-C) levels Multiple imputation entails 2 stages: (1) generating replace-
among patients with high car- ment values (“imputations”) for missing data and repeating this pro-
diovascular risk. Because 1 or cedure many times, resulting in many data sets with replaced miss-
Related article page 1926
more follow-up LDL-C measure- ing information, and (2) analyzing the many imputed data sets and
ments were missing for approximately 7% of participants, Asch combining the results. In stage 1, MI imputes the missing entries
et al used multiple imputation (MI) to analyze their data and con- based on statistical characteristics of the data, for example, the as-
cluded that shared financial incentives for physicians and patients, sociations among and distributions of variables in the data set. Af-
but not incentives to physicians or patients alone, resulted in the pa- ter the imputed data sets are obtained, in stage 2, any analysis can
tients having lower LDL-C levels. Imputation is the process of re- be conducted within each of the imputed data sets as if there were
placing missing data with 1 or more specific values, to allow statis- no missing data. That is, each of the “filled-in” complete data sets is
tical analysis that includes all participants and not just those who do simply analyzed with any method that would be valid and appro-
not have any missing data. priate for addressing a scientific question in a data set that had no
Missing data are common in research. In a previous JAMA missing data.
Guide to Statistics and Methods, Newgard and Lewis2 reviewed the After the intended statistical analysis (regression, t test, etc) is
causes of missing data. These are divided into 3 classes: (1) missing run separately on each imputed data set (stage 2), the estimates of
completely at random, the most restrictive assumption, indicating interest (eg, the mean difference in outcome between a treatment
that whether a data point is missing is completely unrelated to and a control group) from all the imputed data sets are combined
observed and unobserved data; (2) missing at random, a more real- into a single estimate using standard combining rules.3 For ex-
istic assumption than missing completely at random, indicating ample, in the study by Asch et al,1 the reported treatment effect is
whether a missing data point can be explained by the observed the average of the treatment effects estimated from each of the im-
data; or (3) missing not at random, meaning that the missingness is puted data sets. The total variance or uncertainty of the treatment
dependent on the unobserved values. Common statistical methods effect is obtained, in part, by seeing how much the estimate varies
used for handling missing values were reviewed.2 When missing from one imputed data set to the next, with greater variability across
data occur, it is important to not exclude cases with missing infor- the imputed data sets indicating greater uncertainty due to miss-
mation (analyses after such exclusion are known as complete case ing data. This imputed-data-set-to-imputed-data-set variability is
analyses). Single-value imputation methods are those that estimate built into a formula that provides accurate standard errors and,
what each missing value might have been and replace it with a thereby, confidence intervals and significance tests for the quanti-
single value in the data set. Single-value imputation methods ties of interest, while allowing for the uncertainty due to the miss-
include mean imputation, last observation carried forward, and ing data. This distinguishes MI from single imputation.
random imputation. These approaches can yield biased results and Combining most parameter estimates, such as regression co-
are suboptimal. Multiple imputation better handles missing data by efficients, is straightforward,4 and modern software (including R,
estimating and replacing missing values many times. SAS, Stata, and others) can do the combining automatically. There
are some caveats as to which variables must be included in the sta-
Use of Method tistical model in the imputation stage, which are discussed exten-
Why Is Multiple Imputation Used? sively elsewhere.5
Multiple imputation fills in missing values by generating plausible Another advantage of adding MI to the statistical toolbox is
numbers derived from distributions of and relationships among that it can handle interesting problems not conventionally
observed variables in the data set.3 Multiple imputation differs thought of as missing data problems. Multiple imputation can cor-
from single imputation methods because missing data are filled in rect for measurement error by treating the unobserved true
many times, with many different plausible values estimated for scores (eg, someone’s exact degree of ancestry from a particular
each missing value. Using multiple plausible values provides a population when there are only imperfect estimates for each per-
quantification of the uncertainty in estimating what the missing son) as missing,6 generate data appropriate for public release
values might be, avoiding creating false precision (as can happen while ensuring confidentiality, 7 or make large-scale sampling
with single imputation). Multiple imputation provides accurate more efficient through planned missing data (ie, by intentionally
estimates of quantities or associations of interest, such as treat- measuring some variables on only a subset of participants in a
ment effects in randomized trials, sample means of specific vari- study to save money).8

1966 JAMA November 10, 2015 Volume 314, Number 18 (Reprinted) jama.com

Copyright 2015 American Medical Association. All rights reserved.

JAMA Guide to Statistics and Methods Clinical Review & Education

What Are the Limitations of Multiple Imputation? without missing values, violation of the intent-to-treat principle, pos-
As with any statistical technique, the validity of MI depends on the sible nonrandom loss and therefore a loss of the ability to rely on the
validity of its assumptions. But when those assumptions are met, fact of randomization to justify causal inferences, and ultimately to
MI rests on well-established theory.3,5 Moreover, substantial em- results that may not apply to the original full sample.
pirical support exists for the validity of MI in simulations, including
those based on real data patterns.9 In principle, computational speed How Should Multiple Imputation Findings Be Interpreted
can be a problem because each analysis must be run multiple times, in This Particular Study?
but in practice, this is rarely an issue with modern computers. Provided that the underlying assumptions of MI are met, the results
Many nonstatisticians chafe at “making up data” as is done in from this study can be interpreted as if all the participants had no miss-
MI and note that the validity of MI depends on an assumption ing entries. That is, both the estimates of quantities like means and
about which factors relate to the probability that a data point is measures of association and the estimates of their uncertainty (stan-
missing. Because of concern this assumption may be violated, it is dard errors) on which formal statistical testing is based will not be bi-
tempting to retreat to the safe haven of complete case analysis, ie, ased by the fact that some data were missing. There would have been
only analyze the participants without missing values. This safe greater precision of the estimates and study power had there been
haven is, however, illusory. Although rarely made explicit by users, no missing data. But imputation at least appropriately reflects the
complete case analysis requires a far more restrictive assumption: amount of information there actually is in the data available.
that any data point missing is missing completely at random. Other
common strategies—mean imputation, last observation carried Caveats to Consider When Looking at Results
forward, and other single imputation approaches—underestimate Based on Multiple Imputation
standard errors by ignoring or underestimating the inherent uncer- When the missing data are not missing at random, results from MI
tainty created by missing data, a problem MI helps overcome. may not be reliable. Generally, reasons for missingness cannot be
fully identified. In practice, collecting more information about study
Why Did the Authors Use Multiple Imputation participants may help identify why data are missing. These “auxil-
in This Particular Study? iary variables” can then be used in the imputation process and im-
In the study by Asch et al,1 the primary outcome, LDL-C levels, had prove MI’s performance. All other things being equal, imputation
missing values. Thus, a method to handle missingness was needed models with more variables included and a large number of impu-
to maintain the validity of the statistical inferences. Complete case tations improve MI’s performance. Multiple imputation is arguably
analysis would have inappropriately not included 7% of their sample, the most flexible valid missing data approach among those that are
leading to less study power, results restricted to those individuals commonly used.

ARTICLE INFORMATION Funding/Support: This work was supported in part 5. Schafer JL. Analysis of Incomplete Multivariate
Author Affiliations: Office of Energetics and by grants from the National Institutes of Health Data. New York, NY: Chapman & Hall; 1997.
Nutrition Obesity Research Center, University of (NIH) (R25HL124208, R25DK099080, 6. Padilla MA, Divers J, Vaughan LK, Allison DB,
Alabama at Birmingham (Li, Allison); Department of R01MH099010, and P30DK056336). Tiwari HK. Multiple imputation to correct for
Biostatistics, School of Public Health, University of Role of the Funder/Sponsor: The funding sources measurement error in admixture estimates in
Alabama at Birmingham (Li, Allison); Departments had no role in the preparation, review, or approval genetic structured association testing. Hum Hered.
of Mental Health, Biostatistics, and Health Policy of the manuscript. 2009;68(1):65-72.
and Management, Johns Hopkins Bloomberg Disclaimer: The opinions expressed are those of 7. Wang H, Reiter JP. Multiple imputation for
School of Public Health, Baltimore, Maryland the authors and do not necessarily represent those sharing precise geographies in public use data. Ann
(Stuart). of the NIH or any other organization. Appl Stat. 2012;6(1):229-252.
Corresponding Author: David B. Allison, PhD, 8. Capers PL, Brown AW, Dawson JA, Allison DB.
Office of Energetics and Nutrition Obesity Research REFERENCES Double sampling with multiple imputation to
Center, University of Alabama at Birmingham, 1. Asch DA, Troxel AB, Stewart WF, et al. Effect of answer large sample meta-research questions:
Ryals Bldg, Room 140J, 1665 University Blvd, financial incentives to physicians, patients, or both introduction and illustration by evaluating
Birmingham, AL 35294 ([email protected]). on lipid levels: a randomized clinical trial. JAMA. adherence to two simple CONSORT guidelines.
Section Editors: Roger J. Lewis, MD, PhD, doi:10.1001/jama.2015.14850. Front Nutr. 2015;2:6.
Department of Emergency Medicine, Harbor-UCLA 2. Newgard CD, Lewis RJ. Missing data: how to best 9. Elobeid MA, Padilla MA, McVie T, et al. Missing
Medical Center and David Geffen School of account for what is not known. JAMA. 2015;314(9): data in randomized clinical trials for weight loss:
Medicine at UCLA; and Edward H. Livingston, MD, 940-941. scope of the problem, state of the field, and
Deputy Editor, JAMA. performance of statistical methods. PLoS One.
3. Rubin DB. Multiple Imputation for Nonresponse
Conflict of Interest Disclosures: All authors have in Surveys. New York, NY: Wiley; 1987. 2009;4(8):e6624.
completed and submitted the ICMJE Form for
Disclosure of Potential Conflicts of Interest and 4. White IR, Royston P, Wood AM. Multiple
none were reported. imputation using chained equations: issues and
guidance for practice. Stat Med. 2011;30(4):377-399.

jama.com (Reprinted) JAMA November 10, 2015 Volume 314, Number 18 1967

Copyright 2015 American Medical Association. All rights reserved.

Clinical Review & Education

JAMA Guide to Statistics and Methods

Dose-Finding Trials
Optimizing Phase 2 Data in the Drug Development Process
Kert Viele, PhD; Jason T. Connor, PhD

Clinical trials in drug development are commonly divided into 3 yielded a negative result (P = .15), but a different dose-response
categories or phases. The first phase aims to find the range of doses model based on linear regression, used in an exploratory second-
of potential clinical use, usually by identifying the maximum toler- ary analysis, yielded a positive result (P = .02).
ated dose. The second phase
aims to find doses that demon- Use of the Method
Related article page 2251 strate promising efficacy with Why Are Dose-Response Models Used?
acceptable safety. The third A dose-response model assumes a general relationship between
phase aims to confirm the benefit previously found in the second dose and efficacy or dose and the rates of adverse effects.2 Ideally,
phase using clinically meaningful end points and to demonstrate this allows data from patients receiving all doses of the drug to con-
safety more definitively. tribute to the estimated dose-response curve, maximizing the sta-
Dose-finding trials—studies conducted to identify the most tistical power of the study and reducing the uncertainty in the esti-
promising doses or doses to use in later studies—are a key part of mates of the effects of each dose. When a sufficiently flexible general
the second phase and are intended to answer the dual questions of relationship is used, the dose-response model correctly identifies
whether future development is warranted and what dose or doses doses of low or high efficacy (avoiding the assumption of similar ef-
should be used. If too high a dose is chosen, adverse effects in later ficacy across doses, as is implied with pooling) while smoothing out
confirmatory phase 3 trials may threaten the development pro- spurious highs and lows (avoiding problems that occur when each
gram. If too low a dose is chosen, the treatment effect may be too dose is analyzed separately). A model can produce estimates and
small to yield a positive confirmatory trial and gain approval from a confidence intervals for the effect of every dose and often even for
regulatory agency. A well-designed dose-finding trial is able to es- drug doses not included in the trial.
tablish the optimal dose of a medication and facilitate the decision Dose-response modeling is first used to determine whether a
to proceed with a phase 3 trial. treatment effect appears to exist and, if so, to estimate dose-
Selection of a dose for further testing requires an understand- specific effects to help optimize subsequent phase 3 trial design. Un-
ing of the relationships between dose and both efficacy and safety. like a confirmatory trial in which a regulatory agency makes a bi-
These relationships can be assessed by comparing the data from each nary decision (eg, to approve or not approve a drug), phase 2 trials
dose group with placebo, or with the other doses, in a series of pair- are used to inform the next stage of drug development. Therefore,
wise comparisons. This approach is prone to both false-negative and estimation of the magnitude of treatment effects is more impor-
false-positive results because of the large number of statistical com- tant than testing hypotheses regarding treatment effects. Phase 2
parisons and the relatively small number of patients receiving each dose-finding studies can also be used to predict the likelihood of later
dose. These risks can be mitigated by combining data from pa- phase 3 success through calculation of predictive probabilities.3
tients receiving multiple active doses into a single treatment group The assumptions in the dose-response model can be rigid or flex-
for comparison with placebo (“pooling”), but only if it is possible to ible to match preexisting knowledge of the clinical setting. When ac-
reliably predict which doses are likely to be effective. curate, such assumptions can increase the power of a trial design
In general, dose-response relationships are best examined by incorporating known clinical information. When inaccurate, these
through dose-response models that make flexible, justifiable as- assumptions compromise the statistical properties of the trial and
sumptions about the potential dose-response relationships and al- the interpretability of the results. For example, in SOCRATES-
low the integration of information from all doses used in the trial. REDUCED, the primary analysis consisted of pooling data from the
This can reduce the risk of both false-negative and false-positive re- 3 highest-dose regimens.1 This approach is most effective when the
sults; incorporating all data into the estimates of efficacy and safety efficacious region of the dose range can be predicted reliably. The
for each dose produces more accurate estimates than evaluating the exploratory secondary analysis in SOCRATES-REDUCED was based
response to each dose separately. on a linear regression model. This approach is most effective when
In this issue of JAMA, Gheorghiade et al1 report the results of a linear dose-response relationship is likely to exist over the range
SOCRATES-REDUCED, a randomized placebo-controlled dose- of doses evaluated in the trial.
finding clinical trial investigating 4 different target doses of vericiguat A common dose-response model is the Emax model,4 which as-
for patients with worsening chronic heart failure, with the primary sumes an S-shaped curve for the dose response (eg, a monotoni-
outcome being a reduction in log-transformed level of N-terminal cally increasing curve that is flat for low doses, increases for the
pro-B-type natriuretic peptide. The primary approach to analyzing middle dose range, and then flattens out again for high doses). The
the dose response, combining the data from patients allocated to model is flexible in that the height of the plateau, the dose location
the 3 highest target doses (pooling) for comparison with placebo, of the increase in efficacy, and the rate of increase may all be in-

2294 JAMA December 1, 2015 Volume 314, Number 21 (Reprinted) jama.com

Copyright 2015 American Medical Association. All rights reserved.

JAMA Guide to Statistics and Methods Clinical Review & Education

formed by the data. Alternatives to the Emax model include smooth- “treatment vs placebo” hypothesis test and higher power to detect
ing models such as a normal dynamic linear model.5 These models an effect. In the exploratory secondary analysis using the linear dose-
take the raw data and produce a smooth curve that eliminates ran- response model, the authors used a model that allowed the higher
dom highs and lows but maintains the general shape. Normal dy- doses to be significantly more efficacious than the lower doses.
namic linear models are particularly useful for dose responses that
may be “inverted U” shaped and may be applicable when the dose How Should the Dose-Response Findings Be Interpreted
response is for an outcome that combines safety and efficacy (low in This Particular Study?
doses may not be efficacious, high doses may be unsafe, resulting Figure 2 in the report by Gheorghiade et al1 shows the key dose-
in an inverted U shape, with the optimal dose in the middle). response relationship and suggests that the 10-mg target dose is
the most or possibly only effective dose. However, the primary
What Are the Limitations of Dose-Response Modeling? analysis was null, and the protocol called for the statistical second-
All dose-response models require assumptions regarding the ary analysis only if the primary analysis were significant at P < .05.
potential shapes of the dose-response curve, although sometimes Therefore, although the 10-mg dose appears to be the most prom-
(eg, with pooling) the assumptions are only implied. When assump- ising for investigation in a phase 3 trial, the dose-ranging findings
tions are incorrect, inferences from the model may be invalid. In must be considered very tentative. There remains uncertainty re-
SOCRATES-REDUCED, the implied assumption of the primary garding how best to estimate the effect of the 10-mg dose. The pri-
analysis of similar efficacy among the 3 highest doses was not sup- mary analysis did not evaluate the effect of the 10-mg dose alone,
ported by the data. Similarly, the linear model used in the explor- and separate analyses for each dose would be prone to high varia-
atory secondary analysis assumed that the increase in benefit from tion and false-positive results due to multiple comparisons. The ex-
one dose to the next was the same between every successive pair ploratory linear model produced an estimated effect for the 10-mg
of doses. This also does not appear to be strictly consistent with dose under an assumption of linearity. This analysis and its results
the data obtained in the trial. were considered only exploratory.

Why Did the Authors Use Dose-Response Modeling Caveats to Consider When Looking at Results
in This Particular Study? Based on a Dose-Response Model
The authors used dose-response modeling to maximize the power It is often useful to inspect a plot of the dose-response model–
of the primary analysis hypothesis test. If the 3 highest doses had based estimates against all data observed in the trial. This allows vi-
all been similarly effective, pooling of data from these doses would sual confirmation that the chosen dose-response model captures
result in higher sample sizes in the treatment group of the primary the general shape of the observed data.

ARTICLE INFORMATION Conflict of Interest Disclosures: Both authors 3. Saville BR, Connor JT, Ayers GD, Alvarez J. The
Author Affiliations: Berry Consultants LLC, Austin, have completed and submitted the ICMJE Form for utility of Bayesian predictive probabilities for
Texas (Viele, Connor); University of Central Florida Disclosure of Potential Conflicts of Interest and interim monitoring of clinical trials. Clin Trials. 2014;
College of Medicine, Orlando (Connor). none were reported. 11(4):485-493.

Corresponding Author: Kert Viele, PhD, 4. Dragalin V, Hsuan F, Padmanabhan SK. Adaptive
REFERENCES designs for dose-finding studies based on sigmoid
Berry Consultants LLC, 4301 Westbank Dr,
Bldg B, Ste 140, Austin, TX 78746 1. Gheorghiade M, Greene SJ, Butler J, et al. Effect Emax model. J Biopharm Stat. 2007;17(6):1051-1070.
([email protected]). of vericiguat, a soluble guanylate cyclase stimulator, 5. Krams M, Lees KR, Hacke W, Grieve AP,
on natriuretic peptide levels in patients with Orgogozo JM, Ford GA; ASTIN Study Investigators.
Section Editors: Roger J. Lewis, MD, PhD, worsening chronic heart failure and reduced
Department of Emergency Medicine, Harbor-UCLA Acute Stroke Therapy by Inhibition of Neutrophils
ejection fraction: the SOCRATES-REDUCED (ASTIN): an adaptive dose-response study of
Medical Center and David Geffen School of randomized trial. JAMA. doi:10.1001/jama.2015.15734.
Medicine at UCLA; and Edward H. Livingston, MD, UK-279,276 in acute ischemic stroke. Stroke. 2003;
Deputy Editor, JAMA. 2. Bretz F, Hsu J, Pinheiro J, Liu Y. Dose finding: 34(11):2543-2548.
a challenge in statistics. Biom J. 2008;50(4):480-
504.

jama.com (Reprinted) JAMA December 1, 2015 Volume 314, Number 21 2295

Copyright 2015 American Medical Association. All rights reserved.

Clinical Review & Education

JAMA Guide to Statistics and Methods

Analyzing Repeated Measurements Using Mixed Models

Michelle A. Detry, PhD; Yan Ma, PhD

Longitudinal studies often include multiple, repeated measure- nal data, does not have this flexibility and can yield misleading
ments of each patient’s status or outcome to assess differences in results if its more rigid assumptions (eg, all effects are considered
outcomes or in the rate of recovery or decline over time. fixed) are not met.
Repeated measurements from a particular patient are likely to be Furthermore, using a mixed model, data from all assessments
more similar to each other than measurements from different contribute to the treatment comparisons, resulting in more pre-
patients, and this correlation needs to be considered in the analy- cise estimates and a more powerful study. A mixed model can
sis of the resulting data. Many common statistical methods, such also address if outcomes changed over time (eg, the rate of recov-
as linear regression models, should not be used in this situation ery of function or decline) within each treatment group. More-
because those methods assume measurements to be indepen- over, in addition to population-level comparisons, mixed models
dent of one another. can be used to characterize an individual patient’s response pat-
It is possible to compare outcomes between treatments using terns over time. The specific clinical question motivating the trial
only a final measurement to determine whether there was a differ- determines the structure of the mixed model that is most appli-
ence at the end of the study; however, this approach would not in- cable. For example, if the effect of a treatment on the rate of
clude much of the information captured with repeated measure- recovery from a patient-specific baseline is to be determined,
ments and there would be no consideration of the pattern of then the mixed model is likely to include a random baseline effect
outcomes each patient experienced in reaching his or her final out- and a fixed interaction term between treatment group and time,
come. When outcomes are measured repeatedly over time, a wide with the latter term capturing the effect of the treatment on the
variety of clinically important questions may be addressed. rate of recovery.
In the EXACT study recently published in JAMA, Moseley et al1 Observations may be correlated with each other in several dif-
examined activity limitations and quality of life (QOL) among pa- ferent ways. These patterns are known as correlation structures
tients with ankle fractures to determine if a supervised exercise pro- and it is important when using mixed models to use the correct
gram with rehabilitation advice was more beneficial than advice structure. For example, if the correlation between each measure-
alone. Activity limitations and QOL were measured at baseline and ment is likely to be the same regardless of the length of time
at 1, 3, and 6 months of follow-up. The authors used mixed models2 between the measurements, then a “compound symmetry” struc-
to compare patient outcomes over time between the 2 interven- ture is appropriate. In contrast, if the correlation between mea-
tion groups. surements decreases as the time between measurements
increases, then an “autoregressive” structure should be used.
Use of the Method Finally, an “unstructured” correlation can be used if no constraints
Why Are Mixed Models Used for Repeated Measures Data? can be imposed on the correlation pattern, but fitting a model
Mixed models are ideally suited to settings in which the individual with an unstructured correlation requires a larger data set than the
trajectory of a particular outcome for a study participant over time other approaches.
is influenced both by factors that can be assumed to be the same Ideally, the assumed correlation structure should be based on
for many patients (eg, the effect of an intervention) and by charac- the clinical context in which the repeated measurements were
teristics that are likely to vary substantially from patient to patient taken. For example, certain longitudinal data (eg, pain scores after
(eg, the severity of the ankle fracture, baseline level of function, and joint surgery) at adjacent assessments would tend to be more cor-
QOL). Mixed models explicitly account for the correlations be- related than those measured farther apart, making an autoregres-
tween repeated measurements within each patient. sive structure appropriate. Statistical testing (eg, a likelihood ratio
The factors assumed to have the same effect across many test) may be used when an objective comparison is needed to
patients are called fixed effects and the factors likely to vary sub- evaluate competing correlation structures.
stantially from patient to patient are called random effects. For Incomplete outcome data, for example, caused by patients
example, the effect of a new treatment may be assumed to be the missing some visits or dropping out of the study, are common in
same for all patients and modeled as a fixed effect, whereas longitudinal studies.3 As a result, study participants may have dif-
patients may have markedly different baseline function or inher- ferent numbers of available measurements, a situation that cannot
ent rates of recovery and these may be best modeled as random be addressed by repeated measures ANOVA. Mixed models can
effects. Mixed models are called “mixed” because they generally accommodate unbalanced data patterns and use all available
contain both fixed and random effects. The ability to consider observations and patients in the analysis. Mixed models assume
both fixed and random effects in the model gives flexibility to that the missingness is independent of unobserved measurements,
determine the effects of multiple factors and to address specific but dependent on the observed measurements.4,5 This assump-
questions of clinical importance. In contrast, repeated measures tion is called “missing at random” and is often reasonable. 3,5
analysis of variance (ANOVA), often used for analyzing longitudi- Repeated measures ANOVA requires a more unlikely assumption

jama.com (Reprinted) JAMA January 26, 2016 Volume 315, Number 4 407

Copyright 2016 American Medical Association. All rights reserved.

Clinical Review & Education JAMA Guide to Statistics and Methods

that the missingness is independent of both the observed and Why Did the Authors Use Mixed Models in This Particular Study?
unobserved measurements, called “missing completely at random.” The EXACT trial investigators used mixed models in their analyses
Using mixed models, reasonably valid estimates of treatment because they wanted to answer the question of how outcomes
effects can often be obtained even when the missing values are not changed over time and how they were affected by treatment. The
completely random and additional methods for handling missing model included fixed effects for treatment group, time of measure-
data, such as multiple imputation, are generally not required.3-5 ment, and baseline score. An interaction term between treatment
group and time was also included to determine if the 2 treatment
What Are the Limitations of Mixed Models? interventions led to different recovery trajectories over time. In ad-
As with any statistical model, a mixed model will have limited valid- dition, the model included a random effect for the baseline value,
ity if its underlying assumptions are not met. For example, if the ef- addressing the variability in the starting point for each patient.
fect of a treatment varies substantially from patient to patient, for The EXACT trial reported that in each treatment group, 10% to
instance, because of genetic differences, then considering the treat- 20% of the patients were lost to follow-up as the study progressed.
ment effect as fixed may not be reasonable. Similarly, the assumed Thus, it was important for the authors to examine the effects of the
correlation structure can adversely impact model results and study missingness. They included a preplanned sensitivity analysis that used
conclusions if incorrect. It is important to ensure that the structure multiple imputation5 to evaluate how sensitive the primary out-
of the mixed model matches what is reasonably believed about the come result was to the missing at random data assumption. The re-
clinical setting in which the model is applied. sults of the main and sensitivity analyses were similar.
Because of the larger number of parameters to be estimated
from the data, mixed models may be difficult to estimate or “fit” when Caveats to Consider When Looking at Results
the available data are limited. This is especially true if an unstruc- From Mixed Models
tured correlation structure must be used. The precise methods used As with most statistical models, it is important to consider whether
by different software packages to fit mixed models differ, so the nu- the structure of the data obtained and the clinical setting (eg, re-
merical results can vary somewhat based on the statistical soft- peated measures over time) match the model structure. It is often
ware used. useful to inspect graphical data summaries (eg, “spaghetti” or “string”
In the presence of missing data, mixed models can provide valid plots showing the outcome trajectories of individual study partici-
inferences under an assumption that data are missing at random. pants over time) to determine whether the observed data patterns
However, in practice it is often impossible to know that this assump- appear consistent with model assumptions.
tion is met and informative censoring (nonignorable missingness) When outcome data are missing, the analyst should consider
can never be ruled out. If the investigators suspect deviation from whether the pattern of missingness is likely to be random, meeting
the missing-at-random assumption, sensitivity analyses may be con- the assumptions inherent in mixed models. The rationale for the cho-
ducted using models appropriate for nonignorable missingness. The sen correlation structure should be clear and based on study de-
models used would depend on the study design, missing data pat- sign (eg, the pattern of follow-up visits) rather than based on what
terns observed, and other study specific considerations.2 allows a model to be fit with the available data.

ARTICLE INFORMATION Conflict of Interest Disclosures: The authors have 3. Newgard CD, Lewis RJ. Missing data: how to best
Author Affiliations: Berry Consultants, LLC, Austin, completed and submitted the ICMJE Form for account for what is not known. JAMA. 2015;314(9):
Texas (Detry); Department of Epidemiology and Disclosure of Potential Conflicts of Interest and 940-941. doi:10.1001/jama.2015.10516.
Biostatistics, Milken Institute School of Public none were reported. 4. Ma Y, Mazumdar M, Memtsoudis SG. Beyond
Health, The George Washington University, Funding/Support: Dr Ma receives funding from the repeated-measures analysis of variance: advanced
Washington, DC (Ma). Agency for Healthcare Research and Quality. statistical methods for the analysis of longitudinal
Corresponding Author: Michelle A. Detry, PhD, data in anesthesia research. Reg Anesth Pain Med.
Berry Consultants LLC, 4301 Westbank Dr, REFERENCES 2012;37(1):99-105.
Bldg B, Ste 140, Austin, TX 78746 1. Moseley AM, Beckenkamp PR, Haas M, Herbert 5. Li P, Stuart EA, Allison DB. Multiple imputation:
([email protected]). RD, Lin CW; EXACT Team. Rehabilitation after a flexible tool for handling missing data. JAMA.
Section Editors: Roger J. Lewis, MD, PhD, immobilization for ankle fracture: the EXACT 2015;314(18):1966-1967. doi:10.1001/jama.2015
Department of Emergency Medicine, Harbor-UCLA randomized clinical trial. JAMA. 2015;314(13):1376- .15281.
Medical Center and David Geffen School of 1385.
Medicine at UCLA; and Edward H. Livingston, MD, 2. Fitzmaurice GM, Laird NM, Ware JH. Applied
Deputy Editor, JAMA. Longitudinal Analysis. 2nd ed. Hoboken, NJ: Wiley;
2011.

408 JAMA January 26, 2016 Volume 315, Number 4 (Reprinted) jama.com

Copyright 2016 American Medical Association. All rights reserved.

Clinical Review & Education

JAMA Guide to Statistics and Methods

Time-to-Event Analysis
Juliana Tolles, MD, MHS; Roger J. Lewis, MD, PhD

Time-to-event analysis, also called survival analysis, was used tribute less information than those who are observed for a long
in the study by Nissen et al 1 published in this issue of JAMA time before censoring. However, all observations provide some
to compare the risk of major adverse cardiovascular events information, and to avoid bias, methods of analysis that can accom-
(MACE) in a noninferiority trial of a combination of naltrexone and modate censoring are used for time-to-event studies.
bupropion vs placebo for overweight or obese patients with car- Kaplan-Meier plots and the Cox proportional hazards model
diovascular risk factors. The are examples of methods for analyzing time-to-event data that
authors used a type of time-to- account for censored observations. A Kaplan-Meier curve plots the
Related article page 990
event analysis called Cox pro- fraction of “surviving” patients (those who have not experienced an
portional hazards modeling to event) against time for each treatment group. The height of the
compare the risk of MACE in the 2 groups, concluding that the use Kaplan-Meier curve at the end of each time interval is determined
of naltrexone-bupropion increased the risk of MACE per unit time by taking the fraction or proportion of patients who remained
by no more than a factor of 2. event-free at the end of the prior time interval and multiplying that
proportion by the fraction of patients who survive the current time
Use of the Method interval without experiencing an event. The value of the Kaplan-
Why Is Time-to-Event Analysis Used? Meier curve at the end of the current time interval then becomes
One way to evaluate how a medical treatment affects patients’ risk the starting value for the next time interval. This iterative and
of an adverse outcome is to analyze the time intervals between the cumulative multiplication process begins with the first time interval
initiation of treatment and the occurrence of such events. That and continues in a stepwise manner along the Kaplan-Meier curve;
information can be used to calculate the hazard for each treatment the Kaplan-Meier curve is thus sometimes called the “product limit
group in a clinical trial. The hazard is the probability that the estimate” of the survival curve. Censoring is properly taken into
adverse event will occur in a defined time interval. For example, account because only patients still being followed up at the begin-
Nissen et al1 could measure the number of patients who experience ning of each time interval are considered in determining the frac-
MACE while taking naltrexone-bupropion during week 8 of the tion “surviving” at the end of that time interval.3 Figures 2A and 2B
study and calculate the risk that an individual patient will experi- in Nissen et al1 plot the cumulative incidence of MACE in each
ence MACE during week 8, assuming that the patient has not had group vs time, an “upside-down” version of Kaplan-Meier, which
MACE before week 8. This concept of a discrete hazard rate can be provides similar information.
extended to a hazard function, which is generally a continuous While a Kaplan-Meier plot elegantly represents differences be-
curve that describes how the hazard changes over time. The haz- tween different groups’ survival curves over time, it gives little in-
ard function shows the risk at each point in time and is expressed dication of their statistical significance. The statistical significance
as a rate or number of events per unit of time.2 of observed differences can be tested with a log-rank test.3 This test,
Calculating the hazard function using time-to-event observa- however, does not account for confounding variables, such as dif-
tions is challenging because the event of interest is usually not ferences in patient demographics between groups.
observed in all patients. Thus, the time to the event occurrence for The Cox proportional hazards model both addresses the prob-
some patients is invisible—or censored—and there is no way to lem of censoring and allows adjustment for multiple prognostic
know if the event will occur in the near future, the distant future, or independent variables, or confounders such as age and sex. The
never. Censoring may occur because the patient is lost to follow-up model assumes a “baseline” hazard function exists for individuals
or did not experience the event of interest before the end of the whose independent predictor variables are all equal to their refer-
study period. In Nissen et al,1 only 243 patients experienced MACE ence value. The baseline hazard function is not explicitly defined
before the termination of the study, resulting in 8662 censored but is allowed to take any shape. The output of a Cox proportional
observations, meaning there were 8662 patients for whom it is not hazards model is a hazard ratio for each independent predictor
known when they experienced MACE, if ever. Common nonpara- variable, which defines how much the hazard is multiplied for each
metric statistical tests, such as the Wilcoxon rank sum test, could unit change in the variable of interest as compared with the base-
be used to compare the time intervals seen in the 2 groups if the line hazard function. Hazard ratios can be calculated for all inde-
analysis was limited to only the 243 patients who had observed pendent variables, both confounders and intervention variables.
events; however, when censored data are excluded from analysis,
the information contained in the experience of the other 8662 What Are the Limitations of the Proportional Hazards Model?
patients is lost. While it is unknown when in the future, if ever, The Cox proportional hazards model relies on 2 important assump-
these patients will experience an event, the knowledge that these tions. The first is that data censoring is independent of outcome of
patients did not experience MACE during their participation in the interest. If the placebo patients in the trial by Nissen et al1 were
trial is informative. The information contained in censored observa- both less likely to experience MACE and less likely to follow up with
tions varies: patients whose data are censored early, such as a trial investigators because they did not experience weight loss, the
patient who is lost to follow-up in the first weeks of a study, con- probability of censoring and the risk of MACE would be correlated,

1046 JAMA March 8, 2016 Volume 315, Number 10 (Reprinted) jama.com

Copyright 2016 American Medical Association. All rights reserved.

JAMA Guide to Statistics and Methods Clinical Review & Education

threatening the validity of the analysis. The second assumption is

Figure. Time to MACE in the Final End-of-Study Analysis
that the hazard functions, representing the risk of an event over
time, are proportional to each other for all patient groups. In other 4
words, the hazard functions all have the same shape and differ only

Cumulative Incidence of MACE, %

in overall magnitude; the effect of each independent predictor or 3
confounder is on the overall magnitude of the hazard function. In
Placebo
this trial, it is reasonable to assume that the baseline hazard func-
2
tion for MACE in patients taking placebo looks like a line with a
Naltrexone-bupropion
positive slope: age likely increases the hazard of a cardiovascular
event. The assumption of proportional hazards means that the haz- 1

ard function of MACE in patients taking naltrexone-bupropion is

Hazard ratio, 0.95 (99.7% CI, 0.65-1.38)
assumed to be the baseline hazard multiplied by an unknown, con- 0
0 26 52 78 104 130 156
stant value. This assumption would be violated if, for example,
Weeks Following Randomization
patients taking the drug experience an early increase in risk of
No. at risk
MACE after initiating treatment as a result of adverse effects but Placebo 4450 4289 4183 4053 3886 3333 90
then experience decreased risk over the long-term as they lose Naltrexone- 4455 4317 4228 4092 3951 3403 102
bupropion
weight. In that scenario, the treatment group hazard function
would be shaped like a peak with a long tail and would not be pro- The survival curves cross in this figure from Nissen et al,1 suggesting that the
portional to the baseline hazard function. proportionality assumption may have been violated. MACE indicates major
adverse cardiovascular events.
How Should Time-to-Event Findings Be Interpreted
in This Particular Study? cular risk factors. This trial likely meets the assumptions of the Cox
The trial was designed as a noninferiority study and statistically proportional hazards model: the censoring is likely to be indepen-
powered to assess the null hypothesis that the hazard ratio of dent of hazard, and the hazard functions for all groups are likely to
naltrexone-bupropion to placebo for MACE is greater than 2.0 at be roughly proportional. Readers should interpret with caution any
the 25% interim analysis point. Using a Cox proportional hazard time-to-event analysis in which the probability of being lost to
model with randomized treatment as a predictor, the estimated follow-up or the duration of observation is likely to be correlated
hazard ratio was 0.59 (95% CI, 0.39-0.90). It can therefore be con- with the risk of experiencing an event. Readers should also be cau-
cluded that the hazard ratio of MACE associated with the active tious in accepting Cox proportional hazards models in which the
treatment was less than 2.0. Although it might be tempting to con- hazard function of a treatment group is unlikely to be proportional
clude that the hazard ratio is smaller (eg, less than 1.0), the hypoth- to the baseline hazard. If 2 survival curves cross at any point, such
esis testing structure of the noninferiority trial only allowed a rigor- as seen in the far right of Figure 2B in the article by Nissen et al,1 this
ous conclusion to be drawn about the hypothesis that the hazard might suggest that the hazard ratio between the 2 groups has
ratio was less than 2.0. reversed and the proportionality assumption has been violated
(Figure). There are also several diagnostic tests that researchers
Caveats to Consider When Looking at Results can use to verify the proportionality assumption, including using
From a Time-to-Event Analysis Kaplan-Meier curves, testing the significance of time-dependent
Nissen et al1 used a Cox proportional hazards model to estimate the covariates, and plotting Schoenfeld residuals.4 Selection of an
hazard ratio associated with naltrexone-bupropion compared with appropriate verification method depends on the types of covari-
placebo for MACE in overweight or obese patients with cardiovas- ates used in the Cox proportional hazards model.

ARTICLE INFORMATION Department of Emergency Medicine, Harbor-UCLA patients with cardiovascular risk factors:
Author Affiliations: Department of Emergency Medical Center and David Geffen School of a randomized clinical trial. JAMA. doi:10.1001/jama
Medicine, Harbor-UCLA Medical Center, Torrance, Medicine at UCLA; and Edward H. Livingston, MD, .2016.1558.
California (Tolles, Lewis); Los Angeles Biomedical Deputy Editor, JAMA. 2. Lee ET. Statistical Methods for Survival Analysis.
Research Institute, Torrance, California Conflict of Interest Disclosures: Both authors 2nd ed. New York, NY: John Wiley & Sons; 1992.
(Tolles, Lewis); David Geffen School of Medicine have completed and submitted the ICMJE Form for 3. Young KD, Menegazzi JJ, Lewis RJ. Statistical
at UCLA, Los Angeles, California (Tolles, Lewis); Disclosure of Potential Conflicts of Interest and methodology: IX, Survival analysis. Acad Emerg Med.
Berry Consultants LLC, Austin, Texas (Lewis). none were reported. 1999;6(3):244-249.
Corresponding Author: Roger J. Lewis, MD, PhD, 4. Hess KR. Graphical methods for assessing
Department of Emergency Medicine, Harbor-UCLA REFERENCES
violations of the proportional hazards assumption
Medical Center, Bldg D9, 1000 W Carson St, 1. Nissen SE, Wolski KE, Prcela L, et al. Effect of in Cox regression. Stat Med. 1995;14(15):1707-1723.
Torrance, CA 90509 ([email protected]). naltrexone-bupropion on major adverse
Section Editors: Roger J. Lewis, MD, PhD, cardiovascular events in overweight and obese

jama.com (Reprinted) JAMA March 8, 2016 Volume 315, Number 10 1047

Copyright 2016 American Medical Association. All rights reserved.

Clinical Review & Education

JAMA Guide to Statistics and Methods

Interpretation of Clinical Trials That Stopped Early

Kert Viele, PhD; Anna McGlothlin, PhD; Kristine Broglio, MS

Clinical trials require significantresources to complete in terms of sary. During the course of a trial, strong evidence may accumulate
patients, investigators, and time and should be carefully designed that the experimental treatment offers a benefit. This may be from
and conducted so that they use the minimum amount of resources a large observed treatment effect emerging early in a trial or from
necessary to answer the motivating clinical question. The size of a the anticipated treatment effect being observed as early as two-
clinical trial is typically based on the minimum number of patients thirds of the way through a trial.
required to have high probability of detecting the anticipated treat- Conversely, evidence could accumulate early in a trial that the
ment effect. However, it is possible that strong evidence could experimental treatment performs no better than the control. In a
emerge earlier in the trial either in favor of or against the benefit of trial with no provision for early stopping, patients would continue
the novel treatment. If early trial results are compelling, stopping to be exposed to the potential harms of the experimental therapy
the trial before the maximum planned sample size is reached pre- with no hope of benefit. Interim analyses to stop trials early for fu-
sents ethical advantages for patients inside and outside the trial tility may avoid this risk. Trials may also stop early for futility if there
and can save resources that can be redirected to other clinical is a limited likelihood of eventual success.4
questions. This advantage must be balanced against the potential
for overestimation of the treatment effect and other limitations of What Are the Limitations of Early Stopping?
smaller trials (eg, limited safety data, less information about treat- One key statistical issue with early stopping, particularly early stop-
ment effects in subgroups). ping for success, is accounting for multiple “looks” at the data. Accu-
Many methods have been proposed to allow formal incorpora- mulating data, particularly early in the trial with a smaller number of
tion of early stopping into a clinical trial.1,2 All of these methods al- observations, is likely to exhibit larger random highs and lows of val-
low a trial to stop at a prespecified interim analysis while maintain- ues for the treatment effects. The more frequently the data are ana-
ing good statistical properties. Data monitoring committees or other lyzed as they accumulate, the greater the chance of observing one of
similar governing bodies may also monitor the progress of a trial and these fluctuations. Rules allowing early stopping therefore require a
recommend stopping the trial early in the absence of a prespeci- higher level of evidence, such as a lower P value, at each interim analy-
fied formal rule. An overwhelmingly positive treatment effect might sis than would be required at the end of a trial with no potential for
lead to a recommendation for unplanned early stopping but, more early stopping. Taken together, the multiple looks at the data, each
commonly, unplanned early stopping results from concerns for par- requiring a higher bar for success, lead to the same overall chance of
ticipant safety, lack of observed benefit, or concerns about the fea- falsely declaring success (type I error) as a trial with the usual crite-
sibility of continuing the trial due to slow patient accrual or new ex- rion for success (eg, a P<.05) and no potential for early stopping.
ternal information. Trials stopped for success in an ad hoc manner Early stopping for futility requires no such adjustment. There
are challenging to interpret rigorously. In this article, we focus on early are no added opportunities to declare a success; thus, no statistical
stopping for success or futility based on formal, prespecified stop- adjustment to the success threshold is required. However, futility
ping rules. stopping may reduce the power of the trial by stopping trials based
In the December 15, 2015, issue of JAMA, Stupp et al3 reported on a random low value for the treatment effect that could have gone
the results of a trial assessing electric tumor-treating fields plus te- on to be successful. This reduction in power is usually quite small.
mozolomide vs temozolomide alone in patients with glioblastoma. Success thresholds are typically chosen to be more conserva-
The trial design included a preplanned interim analysis defined active for interim analyses than for the final analysis should the trial
cording to an early stopping procedure. The trial was stopped for suc- continue to completion. The O’Brien-Fleming method, for ex-
cess at the interim analysis, reporting a hazard ratio of 0.62 for the ample, requires very small P values to declare success early in the
primary end point of progression-free survival. trial and then maintains a final P value very close to the traditional
.05 level at the final analysis.1 Using this method, very few trials could
Use of the Method be successful at the interim analyses that would not have been suc-
Why Is Early Stopping Used? cessful at the final analysis. Thus, there is a minimal “penalty” for the
When 2 treatments are compared in a randomized clinical trial, the interim analyses. The more conservative the early stopping crite-
treatment effects observed both during the trial and when the trial ria, the more assurance there is that an early stop for success is not
ends are subject to random highs and lows that depart from the true a false-positive result.
treatment effect. Sample sizes for trials are selected to reliably de- While methods such as O’Brien-Fleming protect against falsely
tect an anticipated treatment effect even if a modest, random low declaring an ineffective drug successful, the accuracy of estimates
observed treatment effect occurs at the final analysis. If such a ran- of the treatment effect in trials that have stopped early for success
dom low value does not occur or the true treatment effect is larger remains a concern.5 When considering the true effect of a treat-
than anticipated, the extra study participants required to provide ment, bias is introduced when considering only trials that have ob-
this protection against a false-negative result may not be neces- served a large enough treatment effect to meet the critical value for

1646 JAMA April 19, 2016 Volume 315, Number 15 (Reprinted) jama.com

Copyright 2016 American Medical Association. All rights reserved.

JAMA Guide to Statistics and Methods Clinical Review & Education

success. By definition, successful trials have larger treatment ef- corresponding to a hazard ratio of 0.62. Given the potential for an
fects than unsuccessful trials; thus, successful trials include more ran- overestimated treatment effect, combined with the general intrac-
dom highs than random lows. As such, small trials that end in suc- tability of treating glioblastoma, there is good reason to suspect that
cess, either at the end or early, are prone to overestimating the the actual benefit of tumor-treating fields, while present, might be
treatment effect. The larger the observed treatment effect, the more smaller than that observed in the study. A robustness analysis (ie, a
likely it is an extreme random high, and the greater the chance for supplementary or supporting analysis conducted to see how con-
overestimation. If the trial were continued, with the enrollment of sistent the results are if different approaches were taken in con-
additional patients, it is likely that there would be a reduction of the ducting the analysis), based on the then-available data from all par-
observed treatment effect. In other words, trials with very impres- ticipants, illustrates this pattern. That analysis resulted in a hazard
sive early results are likely to become less impressive after observ- ratio of 0.69 (95% CI, 0.55-0.86), also with a P<.001. The result re-
ing more data, and this should be taken into account when moni- mained statistically significant, but the magnitude of the treat-
toring and interpreting such trials. Extreme attenuation, such as a ment effect was smaller.
complete disappearance of the observed treatment benefit, how-
ever, is less likely. Caveats to Consider When Looking at a Trial
That Stopped Early
Why Did the Authors Use Early Stopping in This Study? It is important to consider trial design, quality of trial conduct, safety
Glioblastoma is an aggressive cancer with few treatment options. and secondary end points, and other supplementary data when in-
In the report by Stupp et al,3 enrollment was largely complete at the terpreting the results of any clinical trial. For trials that stop early for
time of the interim analysis. However, the interim analysis allowed success, the statistical superiority of an experimental treatment is
the possibility that a beneficial result could be disseminated many straightforward when the early stopping was preplanned and it is
months (potentially years) earlier in advance of the fully mature data. reasonable to preserve patient resources and time once the pri-
mary objective of a trial has been addressed. Early stopping proce-
How Should Early Stopping Be Interpreted in This Particular Study? dures protect against a false conclusion of superiority. However, if
The primary analysis in this study found a hazard ratio of 0.62 the result seems implausibly good, there is a high likelihood that the
(P = .001) based on 18 months of follow-up from the first 315 pa- true effect is smaller than the observed effect. In that light, the ben-
tients enrolled. This is strong evidence of a treatment benefit for tu- efits of early stopping, to patients both in and out of the trial, must
mor-treating fields plus temozolomide in this population. How- be weighed against how much potential additional knowledge would
ever, care should be taken when interpreting the estimated benefit be gained if the trial were continued.

ARTICLE INFORMATION REFERENCES 4. Saville BR, Connor JT, Ayers GD, Alvarez J.
Author Affiliations: Berry Consultants LLC, 1. Jennison C, Turnbull BW. Group Sequential The utility of Bayesian predictive probabilities for
Austin, Texas. Methods With Applications to Clinical Trials. interim monitoring of clinical trials. Clin Trials. 2014;
Boca Raton, FL: Chapman & Hall; 2000. 11(4):485-493.
Corresponding Author: Kert Viele, PhD,
Berry Consultants LLC, 4301 Westbank Dr, 2. Broglio KR, Connor JT, Berry SM. Not too big, 5. Zhang JJ, Blumenthal GM, He K, Tang S, Cortazar
Bldg B, Ste 140, Austin, TX 78746 not too small: a Goldilocks approach to sample size P, Sridhara R. Overestimation of the effect size in
([email protected]). selection. J Biopharm Stat. 2014;24(3):685-705. group sequential trials. Clin Cancer Res. 2012;18(18):
4872-4876.
Section Editors: Roger J. Lewis, MD, PhD, 3. Stupp R, Taillibert S, Kanner AA, et al.
Department of Emergency Medicine, Harbor-UCLA Maintenance therapy with tumor-treating fields
Medical Center and David Geffen School of plus temozolomide vs temozolomide alone for
Medicine at UCLA; and Edward H. Livingston, MD, glioblastoma: a randomized clinical trial. JAMA.
Deputy Editor, JAMA. 2015;314(23):2535-2543.
Conflict of Interest Disclosures: All authors have
completed and submitted the ICMJE Form for
Disclosure of Potential Conflicts of Interest and
none were reported.

jama.com (Reprinted) JAMA April 19, 2016 Volume 315, Number 15 1647

Copyright 2016 American Medical Association. All rights reserved.

Clinical Review & Education

JAMA Guide to Statistics and Methods

Logistic Regression
Relating Patient Characteristics to Outcomes
Juliana Tolles, MD, MHS; William J. Meurer, MD, MS

In a recent issue of JAMA, Seymour et al1 presented a new method tient with sepsis—is measured as a ratio called the odds ratio (OR).
for estimating the probability of a patient dying of sepsis using in- If patients with tachypnea have an odds of mortality of 2.0 and pa-
formation on the patient’s respiratory rate, systolic blood pressure, tients without tachypnea have an odds of mortality of 0.5, then the
and altered mentation. The method used these clinical character- OR associated with tachypnea would be 2.0:0.5, or 4. This is the same
istics—called “predictor” or explanatory or independent vari- as an increase in the probability of mortality from 1/3 to 2/3.
ables—to estimate the likelihood of a patient having an outcome of In logistic regression, the weight or coefficient calculated for
interest, called the dependent variable. To determine the best way each predictor determines the OR for the outcome associated with
to use these clinical characteristics, the authors used logistic regres- a 1-unit change in that predictor, or associated with a patient state
sion, a common statistical method for quantifying the relationship (eg, tachypneic) relative to a reference state (eg, not tachypneic).
between patient characteristics and clinical outcomes.2 Through these ORs and their associated 95% confidence intervals,
logistic regression provides a measure of the magnitude of the in-
Use of the Method fluence of each predictor on the outcome of interest and of the un-
Why Is Logistic Regression Used? certainty in the magnitude of the influence.
One use of logistic regression is to estimate the probability that Logistic regression also enables “adjustment” for confounding
an event will occur or that a patient will have a particular out- factors—patient characteristics that might also influence the out-
come using information or characteristics that are thought to be come and simultaneously be correlated with 1 or more predictors.
related to or influence such events. Logistic regression can show To accomplish this, both the confounding factors and the predic-
which of the various factors being assessed has the strongest tors of interest are included in the model. For example, when ad-
association with an outcome and provides a measure of the mag- justing for the influence of fever in estimating the influence of tachyp-
nitude of the potential influence. It also has the ability to “adjust” nea on mortality, both fever and the presence of tachypnea would
for confounding factors, ie, factors that are associated with both be included as predictors in the regression model. The result is that
other predictor variables and the outcome, so the measure of the the estimate of the association between tachypnea and mortality
influence of the predictor of interest is not distorted by the effect would not be confounded by a possible correlation between fever
of the confounder. and tachypnea.
Although logistic regression can be used to evaluate epidemio-
logical associations that do not represent cause and effect, this ar- What Are the Limitations of Logistic Regression?
ticle focuses on the use of logistic regression to create models for First, the validity of a regression model depends on the number
predicting patient outcomes. In this context, the term predictors is and suitability of the measured independent predictor variables.
used to refer to the independent factors (variables) for which the Ideally, all biologically relevant factors should be included. When
influences are being quantified, and the term outcome is used for multiple variables convey closely related information (a situation
the dependent variable that the logistic regression model is trying termed collinearity), such as would occur when using both serum
to predict. lactate and anion gap as predictors in patients with septic shock,
this can produce serious errors or great uncertainty in the esti-
Description of the Method mates of the effects of these variables on the outcome of interest.
Patient outcomes that can only have 2 values (eg, lived vs died) When 2 variables provide overlapping information, minor random
are called binary or dichotomous. The outcomes for groups of variation in the data can greatly and unpredictably influence how
patients can be summarized by the fraction of patients experienc- much of the association is attributed to one factor vs the other in
ing the outcome of interest or, similarly, by the probability that any the model.
single patient experiences that outcome. However, to understand A second limitation of logistic regression is that the variables
the results of a logistic regression model, it is important to under- must have a constant magnitude of association across the range
stand the difference between probability and odds. The probabil- of values for that variable. For example, in examining the relation-
ity that an event will occur divided by the probability that it will ship between age and mortality, if the odds ratio for mortality is 2
not occur is called the odds. For example, if there is a 75% chance for each 10-year increase in age, this association needs to be the
of survival and a 25% chance of dying, then the odds of survival is same when comparing 30- and 40-year-olds as it is when com-
75%:25%, or 3. Logistic regression quantitatively links one or more paring 70- and 80-year-olds if the model is to be used across this
predictors thought to influence a particular outcome to the odds entire age range. If the association is not consistent over the age
of that outcome.2 range, then age may be stratified into ranges (eg, 21-50, 51-65,
The change in the odds of an outcome—for example, the in- and ⱖ66) based on the assumption that within each category,
crease in the odds of mortality associated with tachypnea in a pa- the influence of age will be similar. The age category would then

jama.com (Reprinted) JAMA August 2, 2016 Volume 316, Number 5 533

Copyright 2016 American Medical Association. All rights reserved.

Clinical Review & Education JAMA Guide to Statistics and Methods

be used as the independent variable, usually with the lowest-risk dicted mortality in patients with sepsis better than systemic inflam-
age group the reference category. matory response syndrome criteria or the usual Sequential [Sepsis-
A third limitation is that many logistic regression analyses as- related] Organ Failure Assessment score.
sume that the effect of one predictor is not influenced by the value
of another predictor. When this is not true and the value of one pre- Caveats to Consider When Assessing the Results
dictor alters the effect of another, there is said to be an “interaction” of a Logistic Regression Analysis
between the 2 predictors. Such interactions need to be explicitly in- The associations found through logistic regression models are in-
cluded in the analysis to ensure the estimated associations are valid. tended to provide insights into what might happen in a similar popu-
lation of future patients. Certain combinations of patient character-
Why Did the Authors Use Logistic Regression in This Study? istics and factors may have been sparsely represented in the data
Seymour et al likely selected logistic regression for its familiarity and set (eg, young patients with sepsis and a low Glasgow Coma Scale
interpretability. More complex prognostic models may produce al- score but a normal blood pressure and respiratory rate), and the es-
gorithms that are difficult to use clinically. timates of the model for mortality among such patients should be
considered with caution.
How Should the Results of Logistic Regression Be Interpreted Because probabilities are more intuitive than ORs, it is impor-
in This Particular Study? tant to avoid confusing them. For example, an increase in probabil-
Seymour et al used logistic regression to derive a new clinical tool ity from 25% to 75% would correspond to a risk ratio (RR) of 3 but
for assessing the risk of mortality in patients with sepsis, called the an OR of 9. However, when probabilities are very close to zero, the
quick Sequential [Sepsis-related] Organ Failure Assessment OR and the RR will be nearly equal. Thus, ORs and RRs are practi-
(qSOFA).1 The qSOFA model is used to estimate the likelihood of cally interchangeable when the outcome of interest is rare. How-
in-hospital mortality in patients with suspected infection using ever, when the outcome of interest is a common event (eg, occur-
respiratory rate, systolic blood pressure, and Glasgow Coma Scale ring in >20% of patients in any group), it is important to recognize
score. Rather than using the precise OR coefficient for each predic- that ORs do not approximate RRs.
tor in their final model, the authors simplified the model by assign- Reported ORs for the effects of predictors should be accompa-
ing the same 1-point value to each predictor. By assigning all coeffi- nied by 95% confidence intervals; intervals that include an OR of 1
cients equal value, the authors created a simplified model that would indicate a non–statistically significant relationship between
could be applied to individual patients by counting the number of that predictor and the outcome of interest.
positive clinical predictors. The authors then determined how well The predictors included in logistic regression models should be
the qSOFA score estimated mortality relative to other models for selected to avoid redundancy in the information they provide
estimating mortality in sepsis, demonstrating that a qSOFA score of (collinearity). It is also important to consider the possibility that the
2 or more produced a 3- to 14-fold increase in the probability of value of one predictor might alter the effect of another (interac-
in-hospital mortality over baseline risk in patients with sepsis. They tions). Both of these situations can adversely affect the validity of
also found that for patients not in intensive care, the qSOFA pre- the resulting logistic regression model.

ARTICLE INFORMATION Corresponding Author: William J. Meurer, MD, MS, REFERENCES

Author Affiliations: Department of Emergency Department of Emergency Medicine, University of 1. Seymour CW, Liu VX, Iwashyna TJ, et al.
Medicine, Harbor-UCLA Medical Center, Torrance, Michigan, 1500 E Medical Center Dr, Ann Arbor, MI Assessment of Clinical Criteria for Sepsis: For the
California (Tolles); Los Angeles Biomedical Research 48109-5303 ([email protected]). Third International Consensus Definitions for Sepsis
Institute, Torrance, California (Tolles); David Geffen Section Editors: Roger J. Lewis, MD, PhD, and Septic Shock (Sepsis-3). JAMA. 2016;315(8):
School of Medicine, University of California, Los Department of Emergency Medicine, Harbor-UCLA 762-774.
Angeles (Tolles); Department of Emergency Medical Center and David Geffen School of 2. Hosmer DW Jr, Lemeshow S, Sturdivant RX.
Medicine, University of Michigan, Ann Arbor Medicine at UCLA; and Edward H. Livingston, MD, Applied Logistic Regression. 3rd ed. Hoboken, NJ:
(Meurer); Department of Neurology, University of Deputy Editor, JAMA. Wiley; 2013.
Michigan, Ann Arbor (Meurer). Conflict of Interest Disclosures: The authors have
completed and submitted the ICMJE Form for
Disclosure of Potential Conflicts of Interest and
none were reported.

534 JAMA August 2, 2016 Volume 316, Number 5 (Reprinted) jama.com

Copyright 2016 American Medical Association. All rights reserved.

Clinical Review & Education

JAMA Guide to Statistics and Methods

Pragmatic Trials
Practical Answers to “Real World” Questions
Harold C. Sox, MD; Roger J. Lewis, MD, PhD

The concept of a “pragmatic” clinical trial was first proposed nearly avoid downstream events that could affect study outcomes. If
50 years ago as a study design philosophy that emphasizes answer- these events occur at different rates in the different study groups,
ing questions of most interest to decision makers.1 Decision mak- the effect attributed to the intervention may be larger or smaller
ers, whether patients, physi- than its true effect. To avoid this problem, explanatory trials may
cians, or policy makers, need to choose a relatively short follow-up period. Explanatory trials pur-
Related article page 1172 know what they can expect from sue internal validity at the cost of external validity, whereas prag-
the available diagnostic or thera- matic trials place a premium on external validity while maintaining
peutic options when applied in day-to-day clinical practice. This fo- as much internal validity as possible.
cus on addressing real-world effectiveness questions influences
choices about trial design, patient population, interventions, out- Description of the Method
comes, and analysis. In this issue of JAMA, Gottenberg et al2 report According to Tunis et al,3 “the characteristic features of [prag-
the results of a trial designed to answer the question “If a biologic matic clinical trials] are that they (1) select clinically relevant alter-
agent for rheumatoid arthritis is no longer effective for an indi- native interventions to compare, (2) include a diverse population
vidual patient, should the clinician recommend another drug with of study participants, (3) recruit participants from heterogeneous
the same mechanism of action or switch to a biologic with a differ- practice settings, and (4) collect data on a broad range of health
ent mechanism of action?” Because the authors included some prag- outcomes.” Eligible patients may be defined by presumptive diag-
matic elements in the trial design, this study illustrates the issues that noses, rather than confirmed ones, because treatments are often
clinicians should consider in deciding whether a trial result is likely initiated when the diagnosis is uncertain.3 Pragmatic trials may
to apply to their patients. compare classes of drugs and allow the physician to choose which
drug in the class to use, the dose, and any cointerventions, a free-
Use of the Method dom that mimics usual practice. Furthermore, the outcome mea-
Why Are Pragmatic Trials Conducted? sures are more likely to be patient-reported, global, subjective,
Pragmatic trials are intended to help typical clinicians and typical and patient-centered (eg, self-reported quality-of-life measures),
patients make difficult decisions in typical clinical care settings by rather than the more disease-centered end points commonly
maximizing the chance that the trial results will apply to patients used in explanatory trials (eg, the results of laboratory tests or
that are usually seen in practice (external validity). The most impor- imaging procedures).
tant feature of a pragmatic trial is that patients, clinicians, clinical Both approaches to study design must deal with the cost of clini-
practices, and clinical settings are selected to maximize the applica- cal trials. Explanatory trials control costs by keeping the trial period
bility of the results to usual practice. Trial procedures and require- as short as possible, consistent with the investigators’ ability to en-
ments must not inconvenience patients with substantial data col- roll enough patients to answer the study question. These trials pref-
lection and should impose a minimum of constraints on usual erentially recruit patients who will experience the study end point
practice by allowing a choice of medication (within the constraints and not leave the study early because of disinterest or death from
imposed by the purpose of the study) and dosage, providing the causes other than the target condition. Investigators in explana-
freedom to add cointerventions, and doing nothing to maximize tory trials prefer to enroll participants with a high probability of ex-
adherence to the study protocol. periencing an outcome in the near term. In contrast, pragmatic trials
The pragmatic trial strategy contrasts with that used for an may control costs by leveraging existing data sources, eg, using dis-
explanatory trial, the goal of which is to test a hypothesis that the ease registries to identify potential participants and using data in elec-
intervention causes a clinical outcome. Explanatory trials seek to tronic health records to identify study outcomes.
maximize the probability that the intervention—and not some Although these concepts sharpen the contrasts between prag-
other factor—causes the study outcome (internal validity). Explana- matic and explanatory trials for pedagogical reasons, in reality, many
tory trials seek to give the intervention the best possible chance to trials have features of both designs, in part to find a reasonable bal-
succeed by using experts to deliver it, delivering the intervention to ance between internal validity and external validity.4,5
patients who are most likely to respond, and administering the
intervention in settings that provide expert after-care. Explanatory What Are the Limitations of Pragmatic Trials?
trials try to prevent any extraneous factors from influencing clinical The main limitation of a pragmatic trial is a direct consequence of
outcomes, so they exclude patients who might have poor adher- choosing to conduct a lean study that puts few demands on
ence and may intervene to maximize patient and clinician adher- patients and clinicians. Data collection may be sparse, and there are
ence to the study protocol. Explanatory trials are structured to few clinical variables with which to identify subgroups of patients

jama.com (Reprinted) JAMA September 20, 2016 Volume 316, Number 11 1205

Copyright 2016 American Medical Association. All rights reserved.

Clinical Review & Education JAMA Guide to Statistics and Methods

who respond particularly well to one of the interventions. The use which within-class drug the clinician determines. The trial was also
of the electronic health record as a source of data may save money, pragmatic in that clinicians were aware of the primary treatment and
but it typically means inconsistent data collection and missing data. were free to choose cointerventions that would complement it, as
Relying on typical clinicians rather than experts in caring for would occur in clinical practice.
patients with the target condition may lead to increased variability Several features of this study were not pragmatic, and others
in practice and associated documentation of clinical findings. The raise internal validity concerns. The researchers recruited partici-
variation caused by these shortcomings may reduce statistical pre- pants from rheumatology specialty clinics. Although the article
cision and the capability of answering the research question does not specify the clinicians who managed the patient’s rheu-
unequivocally. matoid arthritis during the study, the clinicians were presumably
rheumatologists in the participating practices. Even though the
Why Was a Pragmatic Trial Conducted in This Case? results apply to patients in a specialty clinic, whether they apply
While Gottenber et al2 cite the pragmatism of their study as its main to patients managed by primary care physicians, with or without
strength, the authors do not explain their study design decisions. expert consultation, is unknown. The authors also did not specify
However, they imply a pragmatic motivation when they state that the intensity of follow-up; was it typical of rheumatoid arthritis
the study confirms the superiority of a drug from a different class in patients receiving biologic agents or did the study protocol
a setting that “corresponds to the therapeutic question clinicians face specify more intensive follow up? The primary outcome measure
in daily practice.” The investigators note that their main limitation was a score based on the erythrocyte sedimentation rate and a
of the study was the inability to blind the participants to the iden- count of involved joints. The article does not identify the person
tity of the drug they received. Blinding is especially important when who assessed the primary outcome. Assigning this task to the
the principal study outcomes are those reported by the patient, who managing physician would be consistent with a pragmatic design,
may be influenced by knowing the intervention that they received. but it would also raise concerns about biased outcome assess-
ment because the person measuring the outcome would know
How Should the Results Be Interpreted? the treatment assignment.
The study by Gottenberg et al2 shows that, from the perspective of The terms “explanatory” and “pragmatic” mark the ends of
a population of patients, changing from one class of drugs to an- a spectrum of study designs. Typically, as noted by Thorpe and
other improves the outcomes of care by rheumatologists in a rheu- co-authors of the PRECIS (Pragmatic-Explanatory Continuum Indi-
matology subspecialty clinic. This result has limited external valid- cator Summary) article,5 some features of a study are pragmatic and
ity. It probably applies to other rheumatology clinics, but its others are explanatory, as the study by Gottenberg et al illustrates
application to other settings is unknown. The main pragmatic fea- and as would be expected because internal validity and external va-
ture of the study—allowing the rheumatologist to choose from sev- lidity are typically achieved at the cost of one another. Whether the
eral drugs within a class—implies that the main result applies strictly authors label their study as pragmatic or explanatory, readers should
to the class of drugs rather than to any individual agent. It does not, pay close attention to the study characteristics that maximize its ap-
for example, show that the improvement is the same regardless of plicability to their patients and their practice style.

ARTICLE INFORMATION Conflict of Interest Disclosures: Both authors 3. Tunis SR, Stryer DB, Clancy CM. Practical clinical
Author Affiliations: Patient Centered Outcomes have completed and submitted the ICMJE Form for trials: increasing the value of clinical research for
Research Institute (PCORI), Washington, DC (Sox); Disclosure of Potential Conflicts of Interest and decision making in clinical and health policy. JAMA.
Geisel School of Medicine at Dartmouth, Hanover, none were reported. 2003;290(12):1624-1632.
New Hampshire (Sox); Department of Emergency Disclaimer: Dr Sox is an employee of the 4. Zwarenstein M, Treweek S, Gagnier JJ, et al;
Medicine, Harbor-UCLA Medical Center, Torrance, Patient-Centered Outcomes Research Institute CONSORT group; Pragmatic Trials in Healthcare
California (Lewis); Department of Emergency (PCORI). This article does not represent the (Practihc) group. Improving the reporting of
Medicine, David Geffen School of Medicine at the policies of PCORI. pragmatic trials: an extension of the CONSORT
University of California-Los Angeles (Lewis); Berry statement. BMJ. 2008;337:a2390. doi:10.1136/bmj
Consultants, LLC, Austin, Texas (Lewis). REFERENCES .a2390
Corresponding Author: Roger J. Lewis, MD, PhD, 1. Schwartz D, Lellouch J. Explanatory and 5. Thorpe KE, Zwarenstein M, Oxman AD, et al.
Department of Emergency Medicine, Harbor-UCLA pragmatic attitudes in therapeutical trials. J Chronic A Pragmatic-Explanatory Continuum Indicator
Medical Center, 1000 W Carson St, PO Box 21, Dis. 1967;20(8):637-648. Summary (PRECIS): a tool to help trial designers.
Torrance, CA 90509 ([email protected]). 2. Gottenberg J-E, Brocq O, Perdriger A, et al. J Clin Epidemiol. 2009;62(5):464-475.
Section Editors: Roger J. Lewis, MD, PhD, Non–TNF-targeted biologic vs a second anti-TNF
Department of Emergency Medicine, Harbor-UCLA drug to treat rheumatoid arthritis in patients with
Medical Center and David Geffen School of insufficient response to a first anti-TNF drug. JAMA.
Medicine at UCLA; and Edward H. Livingston, MD, doi:10.1001/jama.2016.13512
Deputy Editor, JAMA.

1206 JAMA September 20, 2016 Volume 316, Number 11 (Reprinted) jama.com

Copyright 2016 American Medical Association. All rights reserved.

Clinical Review & Education

JAMA Guide to Statistics and Methods

Confounding by Indication in Clinical Research

Demetrios N. Kyriacou, MD, PhD; Roger J. Lewis, MD, PhD

In the assessment of the effect of a treatment or potential risk Not all confounding by indication is related to severity of ill-
factor—termed an exposure—on a patient outcome, the possibility ness. Other factors that affect both the type of intervention and
of confounding by other factors must be considered.1 For example, the outcome can result in this form of confounding. For example,
if researchers studied the effect patients with health insurance may receive different interven-
of coffee drinking on the devel- tions for their illness compared with patients without insurance.
Related article page 1786 opment of lung cancer, they Furthermore, patients with insurance also tend to be healthier
might observe an apparent and have access to better overall medical care, thus improving
association between these 2 variables. However, because drinking their overall measured outcomes. In this case, having health
coffee is also related to smoking, the observed association insurance may act as a confounder when estimating the effect of
between coffee drinking and lung cancer does not represent a true the treatment on the outcome.
causal relationship but is rather the result of the association of cof-
fee drinking with smoking—the confounder—which is the true Addressing Confounding in Clinical Research
cause of lung cancer. The primary goal of clinical research, whether observational or in-
This illustration is a simple example of the very complicated and terventional, is to obtain valid measures of the effects of treat-
multifaceted phenomenon of confounding. Distortion from a con- ments or potential risk factors on patient outcomes. Because con-
founder can appear to strengthen, weaken, or completely reverse founding distorts the true relationship between the exposure of
the true effect of an exposure. In addition, multiple factors can in- interest and the outcome, investigators attempt to control con-
teract to cause confounding in both epidemiologic and clinical re- founding to provide valid measures of the observed associations or
search. Notwithstanding these complexities, a confounding vari- treatment effects.3 In particular, randomized clinical trials (RCTs) use
able can be readily identified if it meets 3 important criteria.1 First, randomized treatment assignment to balance potential confound-
a confounder must be an independent risk factor for the outcome, ing factors—whether measured, unmeasured, or unknown—that
either a causal factor or a surrogate for a casual factor (eg, smoking might affect the outcome to ensure that those factors are unre-
for lung cancer). Second, a confounder must be associated with the lated to the assigned intervention. Thus, RCTs do not typically re-
exposure (eg, smoking and coffee drinking). Third, a confounder quire use of statistical methods to adjust for confounding, as the ran-
cannot be an intermediate variable between the exposure and the domization process is meant to limit all forms of confounding.
outcome (eg, smoking is not caused by drinking coffee). In some settings, RCTs may be inappropriate, impossible, or not
A particularly important type of confounding in clinical feasible.4 In these situations, observational studies are often used
research is “confounding by indication,” which occurs when the to investigate causal relationships in which the treatment assign-
clinical indication for selecting a particular treatment (eg, severity ment for each patient is not randomized but instead is determined
of the illness) also affects the outcome. For example, patients with by clinical indications. These types of observational studies are gen-
more severe illness are likely to receive more intensive treatments erally more difficult to interpret than RCTs. Without an opportunity
and, when comparing the interventions, the more intensive inter- to randomize the exposure, potential confounding frequently ex-
vention will appear to result in poorer outcomes. This is called “con- ists. Failing to adjust for confounding during the statistical analysis
founding by severity” to emphasize that the degree of illness is the could result in inaccurate estimates of the relationship between the
confounder. Because the degree of severity affects both treatment exposure and the outcome.
selection and patient outcome and is not an intermediate between
the treatment and outcome, it fulfills the criteria for confounding. Use of Methods to Control Confounding
The nonrandomized assessment of tracheal intubation vs To control confounding, clinical researchers implement study de-
bag-valve-mask ventilation for pediatric cardiopulmonary arrest sign procedures to prevent confounding (eg, randomization, re-
reported by Andersen et al2 in the November 1, 2016, issue of JAMA striction, and matching) and conduct statistical procedures in the
is likely to be complicated by confounding by indication. Clinical analysis to remove confounding (eg, stratified analyses, regression
conditions (eg, asthma, cystic fibrosis, and upper airway obstruc- modeling, and propensity scoring) for both clinical trials and obser-
tion) existing before and during a patient’s cardiopulmonary re- vational studies. Previous JAMA Guide to Statistics and Methods ar-
suscitation will both affect the patient’s outcome and influence ticles have summarized the use of logistic regression models and pro-
the type of airway management. 2 In other words, it is likely pensity score methods.5,6
that children with more severe disease and worse overall prognosis Andersen et al used propensity score matching to statistically
for survival had a greater probability to be intubated.2 This pos- adjust for confounding.6 The propensity score is the probability
sibility is especially great because severity of illness is both a strong that a patient receives a specific treatment based on his or her
predictor of mortality and a strong predictor of the clinical decision characteristics and the clinical indications determined by the treat-
to intubate. ing physician. This probability is used to match patients receiving

1818 JAMA November 1, 2016 Volume 316, Number 17 (Reprinted) jama.com

Copyright 2016 American Medical Association. All rights reserved.

JAMA Guide to Statistics and Methods Clinical Review & Education

the treatment of interest with those receiving the comparison nary resuscitation was associated with decreased survival to hos-
treatment to control confounding by balancing potential confound- pital discharge, with a risk ratio of 0.64 (95% CI, 0.59-0.69; P < .001).
ing factors between these groups. However, in the propensity score–matched adjusted statistical analy-
sis, the risk ratio effect estimate was only 0.89 (95% CI, 0.81-0.99;
What Are the Limitations of Methods to Control for Confounding? P = .03). This change in estimate with statistical adjustment is evi-
Incompletely controlled “residual” confounding may persist in clini- dence of confounding by 1 or more clinical conditions that were in-
cal investigations despite study design and statistical procedures cluded in the multivariable analyses.9 Furthermore, if all of the im-
aimed at eliminating this form of bias.1,7,8 This can occur in RCTs portant confounding variables were not included in the adjusted
when the randomization process fails (typically in smaller trials) to analyses, then residual confounding could still persist. Although
completely balance confounders between the treatment groups. Andersen et al implemented sophisticated statistical methods to spe-
More likely, residual confounding occurs in observational studies of cifically limit confounding by indication, their observational cohort
interventions when statistical analyses do not adequately adjust for study may not have included measures of all potential confound-
confounding. Reasons for failure of statistical adjustments include ing, such as factors concerned with the resuscitation phase that in-
the following: (1) failure to measure the confounding variable so fluenced the decision to intubate the patients.
that it cannot be included in the statistical analysis (ie, “unmea-
sured confounding”); (2) use of a measure for the confounding Caveats to Consider When Interpreting an Analysis Intended
variable that does not accurately reflect or capture the characteris- to Adjust for Confounding by Indication
tic it is supposed to represent (eg, the variable used to describe the When assessing an observational study of treatment effects for
confounder is an imperfect or misclassified measure of the charac- confounding by indication, the reader should consider why clini-
teristic); and (3) use of overly broad categories for the confounder cians select specific interventions and how those decisions might
(ie, even for patients with the same value for the confounding vari- be influenced by factors that also directly affect outcomes. Con-
able there is important variability in the likelihood of receiving each versely, investigators must know and understand the causal and
treatment and in experiencing the outcome). noncausal relationships among the intervention, potential con-
founders, and the outcome to ensure potential confounding is
How Should the Results Be Interpreted? controlled. Underlying pathophysiologic processes must also be
In the study by Andersen et al, some degree of confounding by in- considered when determining what variables should be measured
dication exists in the comparison between tracheal intubation and and included in any statistical analysis. Any assessment of a clinical
bag-valve-mask ventilation. Confounding by indication is evident be- intervention should include an evaluation of confounding by indi-
cause inclusion in the propensity score–matched statistical analy- cation that is best accomplished by the following: (1) understand-
sis of certain clinical conditions that might influence a clinician’s de- ing the underlying pathophysiologic mechanisms leading to spe-
cision to intubate a patient (eg, illness category, preexisting cific outcomes; (2) understanding the criteria for confounding and
conditions, whether the arrest was witnessed; see Supplement in describing the relationships between potential confounders and
Andersen et al2) reduced the strength of the estimated deleterious both intervention and outcome variables; and (3) understanding
effect of tracheal intubation. For example, in the unadjusted statis- effective study designs and statistical methods that reduce or
tical analysis, tracheal intubation during pediatric cardiopulmo- eliminate confounding by indication.

ARTICLE INFORMATION Disclosure of Potential Conflicts of Interest and 5. Tolles J, Meurer WJ. Logistic regression: relating
Author Affiliations: Departments of Emergency none were reported. patient characteristics to outcomes. JAMA. 2016;
Medicine and Preventive Medicine, Northwestern 316(5):533-534.
University Feinberg School of Medicine, Chicago, REFERENCES 6. Haukoos JS, Lewis RJ. The propensity score. JAMA.
Illinois (Kyriacou); Senior Editor, JAMA (Kyriacou); 1. Rothman KJ, Greenland S, Lash T, eds. Modern 2015;314(15):1637-1638.
Department of Emergency Medicine, Harbor-UCLA Epidemiology. 3rd ed. Philadelphia, PA: Lippincott 7. Szklo M, Nieto FJ, eds. Epidemiology: Beyond the
Medical Center and David Geffen School of Williams & Wilkins; 2008. Basics. 3rd ed. Burlington, MA: Jones & Bartlett; 2014.
Medicine at UCLA, Los Angeles, California (Lewis). 2. Andersen LW, Raymond TT, Berg RA, et al. 8. Fewell Z, Davey Smith G, Sterne JA. The impact
Corresponding Author: Demetrios N. Kyriacou, The association between tracheal intubation during of residual and unmeasured confounding in
MD, PhD, Northwestern University Feinberg School pediatric in-hospital cardiac arrest and survival. JAMA. epidemiologic studies: a simulation study. Am J
of Medicine, 211 Ontario St, Ste 200, Chicago, IL doi:10.1001/jama.2016.14486 Epidemiol. 2007;166(6):646-655.
60611 ([email protected]). 3. Greenland S, Morgenstern H. Confounding in 9. McNamee R. Confounding and confounders.
Section Editor: Edward H. Livingston, MD, Deputy health research. Annu Rev Public Health. 2001;22: Occup Environ Med. 2003;60(3):227-234.
Editor, JAMA. 189-212.
Conflict of Interest Disclosures: The authors have 4. Black N. Why we need observational studies to
completed and submitted the ICMJE Form for evaluate the effectiveness of health care. BMJ.
1996;312(7040):1215-1218.

jama.com (Reprinted) JAMA November 1, 2016 Volume 316, Number 17 1819

Copyright 2016 American Medical Association. All rights reserved.

Clinical Review & Education

JAMA Guide to Statistics and Methods

Equipoise in Research
Integrating Ethics and Science in Human Research
Alex John London, PhD

The principle of equipoise states that, when there is uncertainty or ment within the medical community. In the present case, some
conflicting expert opinion about the relative merits of diagnostic, pre- clinicians may maintain that video laryngoscopy is the superior strat-
vention, or treatment options, allocating interventions to individu- egy for orotracheal intubation in the ICU, others may disagree, while
als in a manner that allows the others judge that there is not sufficient evidence to warrant a strong
generation of new knowledge commitment for or against this approach.
Related article page 483 (eg, randomization) is ethically The principle of equipoise states that if there is uncertainty or
permissible.1,2 The principle of conflicting expert opinion about the relative therapeutic, prophy-
equipoise reconciles 2 potentially conflicting ethical imperatives: to lactic, or diagnostic merits of a set of interventions, then it is per-
ensure that research involving human participants generates scien- missible to allocate a participant to receive an intervention from this
tifically sound and clinically relevant information while demonstrat- set, so long as there is not consensus that an alternative interven-
ing proper respect and concern for the rights and interests of tion would better advance that participant’s interests.1,2,5-7
study participants.1 In the present case, there is equipoise between video vs di-
In this issue of JAMA, Lascarrou et al3 report the results of a rect laryngoscopy because experts disagree about their relative
randomized trial designed to investigate whether the “routine use clinical merits. These disagreements are reflected in variations
of the video laryngoscope for orotracheal intubation of patients in in clinical practices. If it is ethically permissible for patients to
the ICU increased the frequency of successful first-pass intubation receive care from expert clinicians in good professional standing
compared with use of the Macintosh direct laryngoscope.” Intuba- with differing medical opinions about what constitutes optimal
tion in the intensive care unit (ICU) is associated with the potential treatment, then it ordinarily cannot be wrong to permit partici-
for serious adverse events, and video laryngoscopy in the ICU has pants to be randomized to those same treatment alternatives.5
gained support from some clinicians who believe it to be superior Although randomization removes the link between what a partici-
to direct laryngoscopy. Such practitioners may therefore regard it pant receives and the recommendation of a particular clinician, the
as unethical to randomize study participants to direct laryngoscopy presence of equipoise ensures that each participant receives an
because they consider it to be an inferior intervention. But requir- intervention that would be recommended or utilized by at least a
ing uncertainty of individual clinicians to conduct a clinical trial reasonable minority of informed expert clinicians.1,5,6 Equipoise
gives too much ethical weight to personal judgment, hindering thus ensures that randomization is consistent with respect for par-
valuable research without providing benefit to patients. Therefore, ticipant interests because it guarantees that no participant receives
it is important to understand the role of conflicting expert medical care known to be inferior to any available alternative.
judgment in establishing equipoise and how this principle applies
to the trial conducted by Lascarrou et al. Why Is Equipoise Important?
Ensuring equipoise helps researchers and institutional review
What Is Equipoise? boards (IRBs) fulfill 3 ethical obligations. First, to “disturb” equi-
Two features of medical research pose special challenges for the poise studies must be designed to generate information that
goal of ensuring respect and concern for the rights and interests resolves uncertainty or reduces divergence in opinion among quali-
of participants. First, to generate reliable information, research fied medical experts. Such studies are likely to have both social and
often involves design features that alter the way participants are scientific value. Second, any risks to which participants are exposed
treated. For example, randomization and blinding are commonly must be reasonable in light of the value of the information a study
used to reduce selection bias and treatment bias.4 Controlling is likely to produce.5,6 IRBs must make this determination before
how interventions are allocated and what researchers and partici- participants are enrolled.
pants know about who is receiving which interventions helps to Third is the obligation to show respect for potential partici-
more clearly distinguish the effects of the intervention from con- pants as autonomous decision makers. Explaining during the
founding effects. But randomization severs the link between informed consent process the nature of the uncertainty or con-
what a participant receives and the recommendation of a treating flict in medical judgment that a study is designed to resolve
clinician with an ethical duty to provide the best possible care for allows each individual to decide whether to participate by under-
the individual person. In the study by Lascarrou et al,3 patients standing the relevant uncertainties, their effects on that person’s
were randomized to undergo intubation with the video laryngo- own interests, and how their resolution will contribute to improv-
scope or the direct laryngoscope, independent of the preference ing the state of medical care.
of the treating physician.
Second, medical research involves exposing people to inter- What Are the Limitations of Equipoise?
ventions whose risks and potential therapeutic, prophylactic, or di- Since its introduction, the concept of equipoise has received nu-
agnostic merits may be unknown, unclear, or the subject of disagree- merous formulations, creating the potential for confusion and

jama.com (Reprinted) JAMA February 7, 2017 Volume 317, Number 5 525

Copyright 2017 American Medical Association. All rights reserved.

Clinical Review & Education JAMA Guide to Statistics and Methods

misunderstanding2,7 and spurring criticism and debate. One criti- How Is Equipoise Applied in This Case?
cism holds that the version of equipoise described here is too per- Lascarrou et al did not explicitly discuss equipoise in their study. How-
missive because it allows randomization even when individual cli- ever, the consent process approved by the ethics committee re-
nicians are not uncertain about how best to treat a patient.8 The trial flects the judgment that the interventions in the trial “were consid-
conducted by Lascarrou et al3 represents a case in which some cli- ered components of standard care” and patients who lacked
nicians have strong preferences for one modality of treatment over decisional capacity could be enrolled even if no surrogate decision
others. Requiring individual clinician uncertainty entrenches unwar- maker was present.
ranted variation in patient care by preventing participants from being Ensuring that a study begins in and is designed to disturb a state
offered the choice of participating in a study in which they might be of equipoise provides credible assurance to participants and other
allocated to interventions that would be recommended or utilized stakeholders that patients in medical distress can be enrolled in a
by other medical experts. If it is ethically acceptable for patients to study that will help improve patient care in emergency settings with-
receive care from informed, expert clinicians who favor different in- out concern that their health interests will be knowingly compro-
terventions, then it ordinarily cannot be unethical to allow patients mised in the process.
to be randomized to the alternatives that such clinicians recom-
mend. Legitimate disagreement among informed experts signifies How Does Equipoise Influence the Interpretation
that the clinical community lacks a basis for judging that patients are of the Study?
better off with one modality over the other. In the past, strongly held beliefs about the effectiveness of treat-
An interpretation of equipoise that requires uncertainty on the ments ranging from bloodletting to menopausal hormone therapy
part of the individual clinician is not ethically justified because it pre- have proven to be false. Intubation in the ICU is associated with
vents studies that are likely to improve the quality of patient care the potential for serious adverse events. Because video laryngos-
without the credible expectation that this restriction will improve copy is increasingly championed as the superior method for orotra-
patient outcomes. cheal intubation in the ICU, careful study of its relative merits and
Another criticism is that equipoise is unlikely ever to exist, or to risks in comparison to conventional direct laryngoscopy addresses
persist for long.9 This objection applies most directly to the view that a question of clinical importance. The findings of Lascarrou et al3
equipoise only exists if the individual clinician believes that the in- suggest that perceived merits of video laryngoscopy do not trans-
terventions offered in a trial are of exactly equal expected value.10 late into superior clinical outcomes and may be associated with
On this view, equipoise would often disappear even though differ- higher rates of life-threatening complications. This result under-
ent experts retain conflicting medical recommendations. scores the importance of conducting clinical research before novel
It therefore appears poorly suited to the goals of promoting the pro- interventions become widely incorporated into clinical practice,
duction of valuable information and protecting the interests of even if those interventions appear to offer clear advantages over
study participants. existing alternatives.

ARTICLE INFORMATION REFERENCES 6. Miller PB, Weijer C. Rehabilitating equipoise.

Author Affiliation: Department of Philosophy, 1. Freedman B. Equipoise and the ethics of clinical Kennedy Inst Ethics J. 2003;13(2):93-118.
Carnegie Mellon University, Pittsburgh, research. N Engl J Med. 1987;317(3):141-145. 7. van der Graaf R, van Delden JJ. Equipoise should
Pennsylvania. 2. London AJ. Clinical equipoise: foundational be amended, not abandoned. Clin Trials. 2011;8(4):
Corresponding Author: Alex John London, PhD, requirement or fundamental error? In: Steinbock B, 408-416.
Department of Philosophy, Center for Ethics ed. The Oxford Handbook of Bioethics. Oxford, UK: 8. Hellman D. Evidence, belief, and action: the
and Policy, Carnegie Mellon University, Oxford University Press; 2007:571-595. failure of equipoise to resolve the ethical tension in
135 Baker Hall, Pittsburgh, PA 15213-3890 3. Lascarrou JB, Boisrame-Helms J, Bailly A, et al; the randomized clinical trial. J Law Med Ethics.
([email protected]). Clinical Research in Intensive Care and Sepsis 2002;30(3):375-380.
Section Editors: Roger J. Lewis, MD, PhD, (CRICS) Group. Video laryngoscopy vs direct 9. Sackett DL. Equipoise, a term whose time
Department of Emergency Medicine, Harbor-UCLA laryngoscopy on successful first-pass orotracheal (if it ever came) has surely gone. CMAJ. 2000;163
Medical Center and David Geffen School of intubation among ICU patients: a randomized (7):835-836.
Medicine at UCLA; and Edward H. Livingston, MD, clinical trial. JAMA. doi:10.1001/jama.2016.20603 10. Lilford RJ, Jackson J. Equipoise and the ethics
Deputy Editor, JAMA. 4. Guyatt G, Rennie D, Meade MO, Cook DJ. Users’ of randomization. J R Soc Med. 1995;88(10):552-559.
Conflict of Interest Disclosures: The author has Guides to the Medical Literature: A Manual for
completed and submitted the ICMJE Form for Evidence-Based Clinical Practice. 3rd ed. New York,
Disclosure of Potential Conflicts of Interest and NY: McGraw-Hill; 2015.
none were reported. 5. London AJ. Reasonable risks in clinical research:
a critique and a proposal for the Integrative
Approach. Stat Med. 2006;25(17):2869-2885.

526 JAMA February 7, 2017 Volume 317, Number 5 (Reprinted) jama.com

Copyright 2017 American Medical Association. All rights reserved.

Clinical Review & Education

JAMA Guide to Statistics and Methods

Logistic Regression Diagnostics

Understanding How Well a Model Predicts Outcomes
William J. Meurer, MD, MS; Juliana Tolles, MD, MHS

In the March 8, 2016, issue of JAMA, Zemek et al1 used logistic re-
Figure. Receiver Operating Characteristic Curves
gression to develop a clinical risk score for identifying which pedi-
atric patients with concussion will experience prolonged postcon- 100
cussion symptoms (PPCS). The authors prospectively recorded the
initial values of 46 potential predictor variables, or risk factors— PPCS risk score (validation cohort)
selected based on expert opinion and previous research—in a co- 80

hort of patients and then followed those patients to determine who

developed the primary outcome of PPCS. In the first part of the study,
60

Sensitivity, %
the authors created a logistic regression model to estimate the prob-
ability of PPCS using a subset of the variables; in the second part of Physicians’ prediction alone

the study, a separate set of data was used to assess the validity of 40
the model, with the degree of success quantified using regression
model diagnostics. The rationale for using logistic regression to de-
velop predictive models was summarized in an earlier JAMA Guide 20
to Statistics and Methods article.2 In this article, we discuss how well
a model performs once it is defined.
0
0 20 40 60 80 100
Use of the Method 1 – Specificity, %
Why Are Logistic Regression Model Diagnostics Used?
Logistic regression models are often created with the goal of pre- PPCS indicates persistent postconcussion symptoms. The area under the curve
dicting the outcomes of future patients based on each patient’s pre- was 0.71 (95% CI, 0.69-0.74) for the derivation cohort and 0.68 (95% CI,
0.65-0.72) for the validation cohort. Based on Figure 2 from Zemek et al.1
dictor variables.2 Regression model diagnostics measure how well
models describe the underlying relationships between predictors and
patient outcomes existing within the data, either the data on which sensitivity and specificity would have an AUROC of 1. A model that pre-
the model was built or data from a different population. dicts who has PPCS no better than chance would have an AUROC of
Theaccuracyofalogisticregressionmodelismainlyjudgedbycon- 0.5. While dependent on context, C statistic values higher than 0.7
sidering discrimination and calibration. Discrimination is the ability of are generally considered fair and values higher than 0.9 excellent;
themodeltocorrectlyassignahigherriskofanoutcometothepatients those less than 0.7 generally are not clinically useful.5
who are truly at higher risk (ie, “ordering them” correctly), whereas cali- A particular model might discriminate well, correctly identifying
bration is the ability of the model to assign the correct average abso- patients who are at higher risk than others, but fail to accurately es-
lute level of risk (ie, accurately estimate the probability of the outcome timate the absolute probability of an outcome. For example, the model
for a patient or group of patients). Regression model diagnostics are might estimate that patients with a high risk of PPCS have a 99%
used to quantify model discrimination and calibration. chance of developing the condition, whereas their actual risk is only
80%. Although this hypothetical model would correctly discrimi-
Description of the Method nate, it would be poorly calibrated. One method to assess calibration
The model developed by Zemek et al discriminates well if it consis- is to compare the average predicted and average observed probabili-
tently estimates a higher probability of PPCS in patients who de- ties of an outcome both for the population as a whole and at each level
velop PPCS vs those who do not; this can be assessed using a re- of risk across a population. The patients are commonly divided into
ceiver operating characteristic (ROC) curve. An ROC curve is a plot of 10 groups based on their predicted risk, so-called deciles of risk.
the sensitivity of a model (the vertical axis) vs 1 minus the specificity In a well-calibrated model, the observed and predicted proportions
(the horizontal axis) for all possible cutoffs that might be used to sepa- of patients with the outcome of interest will be the same within
rate patients predicted to have PPCS compared with patients who each risk category, at least within the expected random variability
will not have PPCS (Figure).1 Given any 2 random patients, one with (see Table 6 in the article by Zemek et al). The Hosmer-Lemeshow test
PPCS and one without PPCS, the probability that the model will cor- measures the statistical significance of any differences between the
rectly rank the patient with PPCS as higher risk is equal to the area unobserved and predicted outcomes over the risk groups; when there
der the ROC curve (AUROC).3 This area is also called the C statistic, is good agreement, the Hosmer-Lemeshow statistic will not show a
short for “concordance” between model estimates of risk and statistically significant difference, suggesting that the model is well
the observed risk. The C statistic is discussed in detail in a previous calibrated.6 Another way to assess calibration is through a calibra-
JAMA Guide to Statistics and Methods article.4 A model with perfect tion plot (eFigure 3 in the article by Zemek et al) in which the observed

1068 JAMA March 14, 2017 Volume 317, Number 10 (Reprinted) jama.com

Copyright 2017 American Medical Association. All rights reserved.

JAMA Guide to Statistics and Methods Clinical Review & Education

proportion of the outcome of interest is plotted against the pre- cal screening test for which the intended use is to discriminate be-
dicted probability. tween patients with very low risk of a particular outcome and all
Some statistical programs also report a pseudo-R2 regression di- others. Such a model might discriminate well at a particular screen-
agnostic for logistic regression models. The pseudo-R2 is meant to ing cut point but have poor calibration, or it may have inaccurate es-
mimic the R2 calculated for linear regression models, a measure of the timation of risk for patients who are not classified as very low risk
fraction of the variability in the outcome that is explained by the model. but still be completely appropriate for its intended use.
However, because is there no direct equivalent to R2 in logistic re-
gression, many variations of pseudo-R2 have been developed by dif- How Should the Results of Logistic Regression Diagnostics
ferent statisticians, each with a slightly different interpretation.7 Be Interpreted in This Particular Study?
The ROC curve plotted by Zemek et al (Figure) demonstrates mod-
What Are the Limitations of Logistic Regression Diagnostics? est discrimination; in the initial derivation cohort, the AUROC was
It is easy to interpret extreme values of the AUROC statistic—those 0.71. In the validation cohort, the combination of physician judg-
close to 1 or 0.5—but it is a matter of judgment to decide whether a ment with the final prediction model produced an AUROC of 0.68.
value of 0.75, for example, represents acceptable discrimination. The While this AUROC value might seem low, it was substantially better
AUROC is therefore subject to interpretation and comparison with than physician estimation alone for predicting PPCS (AUROC of 0.55).
the AUROC values of competing diagnostic tests. Additionally, using As the authors pointed out, this difference indicated that the model
the AUROC alone as a metric assumes that a false-positive result is generally outperformed clinical judgment alone, although it pro-
just as bad as a false-negative result. This assumption is often not vided only fair discrimination at best.
appropriate in clinical scenarios, and more sophisticated metrics such The model used by Zemek et al appears well calibrated. The
as a decision curve analysis may be needed to appropriately ac- Hosmer-Lemeshow statistic associated with the comparison be-
count for the different costs of different types of misclassification.8 tween predicted and observed rates of PPCS (Table 6 in the article
With large sample sizes the Hosmer-Lemeshow statistic can yield by Zemek et al) across all deciles of risk was not significant. Further-
false-positive results and thus falsely suggest that a model is poorly more, the sample size in this study is large enough that the Hosmer-
calibrated. In addition, the Hosmer-Lemeshow statistic depends on Lemeshow statistic should have reasonable power to detect poor
the number of risk groups into which the study population is di- calibration. The intercept and slope of the calibration plot on the vali-
vided. There is no theoretical basis for the “correct” number of risk dation cohort were 0.07 and 0.90, respectively, closely approach-
groups into which a population should be divided. Also, with sample ing their respective ideals of 0 and 1.
sizes smaller than 500, the test has low power and can fail to iden-
tify poorly calibrated models.9 Caveats to Consider When Assessing the Results
of Logistic Regression Diagnostics
Why Did the Authors Use Logistic Regression Diagnostics Whenever possible, all metrics of model quality should be measured
in This Particular Study? on a data set separate from the data set used to build the model. In-
Logistic regression model diagnostics, and model diagnostics gen- dependence of test data is crucial because reusing the data on which
erally, are essential for judging the usefulness of any new predic- a model was built (the “training data”) to measure accuracy will over-
tion instrument. A model is unlikely to improve practice if it per- estimate the accuracy of the model in future clinical applications.
forms no better than chance or currently available tests. However, Zemek at al used an independent validation cohort, recruited from the
in particular clinical applications, physicians may be interested in samecentersasthetrainingcohort.Therefore,althoughthemodelwas
using models that perform well on only one of these metrics or per- testedagainstdataotherthanfromwhichitwasderived,itstillmaylack
form well only at a particular cut point. For example, consider a clini- external validity in patient populations seen in other settings.10

ARTICLE INFORMATION Disclosure of Potential Conflicts of Interest and 5. Swets JA. Measuring the accuracy of diagnostic
Author Affiliations: Departments of Emergency none were reported. systems. Science. 1988;240(4857):1285-1293.
Medicine and Neurology, University of Michigan, 6. Hosmer DW Jr, Lemeshow S, Sturdivant RX.
Ann Arbor (Meurer); Department of Emergency REFERENCES Applied Logistic Regression. 3rd ed. New York, NY:
Medicine, Harbor-UCLA Medical Center, Torrance, 1. Zemek R, Barrowman N, Freedman SB, et al; Wiley; 2013.
California (Tolles); Los Angeles Biomedical Research Pediatric Emergency Research Canada (PERC) 7. Cameron AC, Windmeijer FAG. An R-squared
Institute, Torrance, California (Tolles); David Geffen Concussion Team. Clinical risk score for persistent measure of goodness of fit for some common
School of Medicine at UCLA, Los Angeles, California postconcussion symptoms among children with nonlinear regression models. J Econom. 1997;77(2):
(Tolles). acute concussion in the ED. JAMA. 2016;315(10): 329-342. doi:10.1016/S0304-4076(96)01818-0
Corresponding Author: William J. Meurer, MD, MS, 1014-1025.
8. Fitzgerald M, Saville BR, Lewis RJ. Decision curve
Department of Emergency Medicine, 1500 E 2. Tolles J, Meurer WJ. Logistic regression: relating analysis. JAMA. 2015;313(4):409-410.
Medical Center Dr, Ann Arbor, MI 48109-5303 patient characteristics to outcomes. JAMA. 2016;
([email protected]). 316(5):533-534. 9. Hosmer DW, Hosmer T, Le Cessie S, Lemeshow S.
A comparison of goodness-of-fit tests for the
Section Editors: Roger J. Lewis, MD, PhD, 3. Hanley JA, McNeil BJ. The meaning and use of logistic regression model. Stat Med. 1997;16(9):
Department of Emergency Medicine, Harbor-UCLA the area under a receiver operating characteristic 965-980.
Medical Center and David Geffen School of (ROC) curve. Radiology. 1982;143(1):29-36.
Medicine at UCLA; and Edward H. Livingston, MD, 10. Efron B. How biased is the apparent error rate
4. Pencina MJ, D’Agostino RB Sr. Evaluating of a prediction rule? J Am Stat Assoc. 1986;81(394):
Deputy Editor, JAMA. discrimination of risk prediction models: the C 461-470. doi:10.2307/2289236
Conflict of Interest Disclosures: All authors have statistic. JAMA. 2015;314(10):1063-1064.
completed and submitted the ICMJE Form for

jama.com (Reprinted) JAMA March 14, 2017 Volume 317, Number 10 1069

Copyright 2017 American Medical Association. All rights reserved.

Clinical Review & Education

JAMA Guide to Statistics and Methods

Gatekeeping Strategies for Avoiding False-Positive Results

in Clinical Trials With Many Comparisons
Kabir Yadav, MDCM, MS, MSHS; Roger J. Lewis, MD, PhD

Clinical trials characterizing the effects of an experimental therapy if it were positioned in the sequence after a negative result. By restrict-
rarely have only a single outcome of interest. In a previous report in ing the pathways for obtaining a positive result, gatekeeping controls
JAMA,1 the CLEAN-TAVI investigators evaluated the benefits of a ce- theriskoffalse-positiveresultsbutpreservesgreaterpowerfortheear-
rebral embolic protection device for stroke prevention during trans- lier, higher-priority end points. This approach works well to test a se-
catheter aortic valve implantation. The primary end point was the re- quence of secondary end points as in the CLEAN-TAVI study or to test
duction in the number of ischemic lesions observed 2 days after the a series of branching secondary end points (Figure).
procedure. The investigators were also interested in 16 secondary end Steps in serial gatekeeping are as follows: (1) determine the or-
points involving measurement of the number, volume, and timing of der for testing multiple end points, considering their relative impor-
cerebral lesions in various brain regions. Statistically comparing a large tance and the likelihood that there is a difference in each; (2) test
number of outcomes using the usual significance threshold of .05 is the first end point against the desired global false-positive rate
likely to be misleading because there is a high risk of falsely conclud- (ie, .05) and, if the finding does not reach statistical significance, then
ing that a significant effect is present when none exists.2 If 17 com- stop all further testing and declare this and all downstream end points
parisons are made when there is no true treatment effect, each com- nonsignificant. If testing the first end point is significant, then de-
parison has a 5% chance of falsely concluding that an observed clare this difference significant and proceed with the testing of the
difference exists, leading to a 58% chance of falsely concluding at least next end point; (3) test the next end point using a significance thresh-
1 difference exists. The formula 1 −[1 −α]N can be used to calculate the old of .05; if not significant, stop all further testing and declare this
chance of obtaining at least 1 falsely significant result, when there is and all downstream end points nonsignificant. If significant, then de-
no true underlying difference between the groups (in this case α is .05 clare this difference significant and proceed with the testing of the
and N is 17 for the number of tests). next end point; and (4) repeat the prior step until obtaining a first
To avoid a false-positive result, while still comparing the mul- nonsignificant result, or until all end points have been tested.
tiple clinically relevant end points used in the CLEAN-TAVI study, the As shown in the Figure, this approach can be extended to test
investigators used a serial gatekeeping approach for statistical test- 2 or more end points at the same step by using a Bonferroni adjust-
ing. This method tests an outcome, and if that outcome is statisti- ment to evenly split the false-positive error rate within the step. In
cally significant, then the next outcome is tested. This minimizes the that case, testing is continued until either all branches have ob-
chance of falsely concluding a difference exists when it does not. tained a first nonsignificant result or all end points have been tested.
For example, a neuroimaging end point could be used as a single end
Use of the Method point for the first level, reflecting the assumption that if an improve-
Why Is Serial Gatekeeping Used? ment in an imaging outcome is not achieved then an improvement
Many methods exist for conducting multiple comparisons while in a patient-centered functional outcome is highly unlikely, fol-
keeping the overall trial-level risk of a false-positive error at an ac- lowed by a split to allow the testing of motor functions on one branch
ceptable level. The Bonferroni approach3 requires a more stringent and verbal functions on the other. This avoids the need to prioritize
criterion for statistical significance (a smaller P value) for each sta- either motor or verbal function over the other and may increase the
tistical test, but each is interpreted independently of the other com- ability to demonstrate an improvement in either domain.
parisons. This approach is often considered to be too conservative, Serialgatekeepingprovidesstrictcontrolofthefalse-positiveerror
reducing the ability of the trial to detect true benefits when they rate because it restricts multiple comparisons by sequentially testing
exist.4 Other methods leverage additional knowledge about the trial hypotheses until the first nonsignificant test is found, and, no matter
design to allow only the comparisons of interest. In the Dunnett how significant later end points appear to be, they are never tested. The
method for comparing multiple experimental drug doses against a advantage is increased power for detecting effects on the end points
single control, the number of comparisons is reduced by never com- that appear early in the sequence because they are tested against .05
paring experimental drug doses against each other.5 Multiple com- rather than, eg, .05 divided by the total number of outcomes tested
parison procedures, including the Hochberg procedure, have been using a traditional Bonferroni adjustment. By accounting for the impor-
discussed in a prior JAMA Guide to Statistics and Methods.2 tance of certain hypotheses over others and by grouping hypotheses
into primary and secondary groups, gatekeeping allocates the trial’s
Description of the Method power to be consistent with the investigators’ priorities.6
Aserialgatekeepingprocedurecontrolsthefalse-positiveriskbyrequir-
ing the multiple end points to be compared in a predefined sequence What Are the Limitations of Gatekeeping Strategies?
and stopping all further testing once a nonsignificant result is obtained. Gatekeeping strategies are a powerful way to incorporate trial-
A given comparison might be considered positive if it were placed early specific clinical information to create prespecified ordering of hy-
in the sequence, but the same analysis would be considered negative potheses and mitigate the need to adjust for multiple comparisons

jama.com (Reprinted) JAMA October 10, 2017 Volume 318, Number 14 1385

© 2017 American Medical Association. All rights reserved.

Clinical Review & Education JAMA Guide to Statistics and Methods

Figure. Criteria for Statistical Significance That Would Be Used in a Hypothetical Gatekeeping Strategy

P ≥ .05
Outcome 1 A vs B STOP
Comparison not significant

P < .05 Comparison significant

P ≥ .05
Outcome 2 A vs B STOP
Comparison not significant

P < .05 Comparison significant

P ≥ .05
Outcome 3 A vs B STOP
Comparison not significant

P < .05 Comparison significant

P ≥ .025 P ≥ .025
Outcome 4 A vs B STOP Outcome 6 A vs B STOP
Comparison not significant this Comparison not significant this
branch branch
P < .025 Comparison significant P < .025 Comparison significant

P ≥ .025 P ≥ .025
Outcome 5 A vs B Comparison Outcome 7 A vs B Comparison
not significant not significant
P < .025 P < .025

Comparison significant Comparison significant

This Figure shows the criteria for statistical significance that would be used in a end points, the criterion for statistical significance is adjusted with a Bonferroni
hypothetical gatekeeping strategy in which there are 3 levels each with a single correction value of .025 for each. If 1 or both of these end points is significant at
end point, followed by 2 levels with 2 end points each. The 3 end points are .025, then the next end point in the branch is tested, against a criterion of .025.
each tested in order against a criterion of .05. All testing stops as soon as 1 result If 1 or both are nonsignificant, no further testing occurs. If any outcome tested
is nonsignificant. If all are significant then a pair of fourth-level end points are along a given pathway is not statistically significant, no further outcomes along
tested, and to preserve the required significance of .05 at that level across 2 that branch are tested because they are assumed to be nonsignificant.

at each stage of testing. The primary challenge in using gatekeeping mary study end point, the number of brain lesions 2 days after TAVI.
is the need to prespecify and truly commit to the order of testing. The Secondary end points were only tested if the primary one was posi-
resulting limitation is that if, in retrospect, the order of outcome test- tive. Then, up to 16 secondary end points were tested in a defined
ing appears ill chosen (eg, if an early end point is negative and im- sequence. The study was markedly positive, with the primary and
portant end points later in the sequence appear to suggest large treat- many secondary end points demonstrating benefit. The first 8 com-
ment effects), then there is no rigorous, post hoc method for parisons were reported in detail in the publication—in their prespeci-
statistically evaluating the later end points. This highlights the im- fied order—retaining the structure of the gatekeeping strategy.1
portance of having a clear data analysis strategy determined before
the trial is started, and maintaining transparency (eg, publishing the How Should the Results Be Interpreted?
study design and analysis plan on public websites or in journals). The CLEAN-TAVI clinical trial demonstrated the efficacy of a cere-
bral protection strategy with respect to multiple imaging measures
How Was Gatekeeping Used in This Case? of ischemic damage. The use of the prespecified gatekeeping strat-
The CLEAN-TAVI investigators used a gatekeeping strategy to com- egy should provide assurance that the large number of imaging end
pare several magnetic resonance imaging end points along with points that were compared was unlikely to have led to false-
neurological and neurocognitive performance.1 The first was the pri- positive results.

ARTICLE INFORMATION Conflict of Interest Disclosures: Both authors 3. Bland JM, Altman DG. Multiple significance tests:
Author Affiliations: Department of Emergency have completed and submitted the ICMJE Form for the Bonferroni method. BMJ. 1995;310(6973):170-
Medicine, Harbor-UCLA Medical Center, Torrance, Disclosure of Potential Conflicts of Interest and 170.
California (Yadav, Lewis); Los Angeles Biomedical none were reported. 4. Hommel G, Bretz F, Maurer W. Powerful
Research Institute, Torrance, California (Yadav); short-cuts for multiple testing procedures with
Berry Consultants, LLC, Austin, Texas (Lewis). REFERENCES special reference to gatekeeping strategies. Stat Med.
Corresponding Author: Kabir Yadav, MDCM, MS, 1. Haussig S, Mangner N, Dwyer MG, et al. Effect 2007;26(22):4063-4073.
MSHS, Department of Emergency Medicine, of a cerebral protection device on brain lesions 5. Holm S. A simple sequentially rejective multiple
1000 W Carson St, Box 21, Torrance, CA 90509 following transcatheter aortic valve implantation in test procedure. Scand J Stat. 1979;6(2):65-70.
([email protected]). patients with severe aortic stenosis. JAMA. 2016;
316(6):592-601. 6. Dmitrienko A, Millen BA, Brechenmacher T,
Section Editors: Roger J. Lewis, MD, PhD, Paux G. Development of gatekeeping strategies in
Department of Emergency Medicine, Harbor-UCLA 2. Cao J, Zhang S. Multiple comparison procedures. confirmatory clinical trials. Biom J. 2011;53(6):875-
Medical Center and David Geffen School of JAMA. 2014;312(5):543-544. 893.
Medicine at UCLA; and Edward H. Livingston, MD,
Deputy Editor, JAMA.

1386 JAMA October 10, 2017 Volume 318, Number 14 (Reprinted) jama.com

© 2017 American Medical Association. All rights reserved.

Clinical Review & Education

JAMA Guide to Statistics and Methods

Bayesian Analysis: Using Prior Information

to Interpret the Results of Clinical Trials
Melanie Quintana, PhD; Kert Viele, PhD; Roger J. Lewis, MD, PhD

In this issue of JAMA, Laptook et al1 report the results of a clinical information. This is not the case for evaluating HIE treatments be-
trial investigating the effect of hypothermia administered between cause very few neonates are affected. Despite a large research net-
6 and 24 hours after birth on death and disability from hypoxic- work, Laptook et al1 were only able to enroll 168 eligible newborns
ischemic encephalopathy (HIE). in 8 years.
Hypothermia is beneficial for HIE Prior information facilitates more efficient study design, allow-
Related article page 1550 when initiated within 6 hours of ing stronger, more definitive conclusions without requiring addi-
birth but administering hypo- tional patients to be included in the study or analysis. As such,
thermia that soon after birth is impractical.2 The study by Laptook the use of prior information is particularly relevant and important
et al1 addressed the utility of inducing hypothermia 6 or more hours for the study of rare diseases where patient resources are limited.
after birth because this is a more realistic time window given the lo- Prior information can take a number of forms. For example, for
gistics of providing this therapy. Performing this study was difficult binary outcomes, the knowledge that an adverse outcome occurs
because of the limited number of infants expected to be enrolled. in 15% to 40% of cases is worth the equivalent of having to enroll
To overcome this limitation, the investigators used a Bayesian analy- 30 or more patients into the trial (depending on the certainty at-
sis of the treatment effect to ensure that a clinically useful result tached to this knowledge). Another form of prior information could
would be obtained even if traditional approaches for defining sta- be beliefs held regarding the effect of a delay beyond 6 hours in in-
tistical significance were impractical. The Bayesian approach al- stituting therapeutic hypothermia, ie, that the treatment effect at
lows for the integration or updating of prior information with newly 7 hours is similar to that at 6 hours and the longer it takes to begin
obtained data to yield a final quantitative summary of the informa- treatment, the less effective the treatment is likely to be.
tion. Laptook et al1 considered several options for the representa-
tion of prior information—termed neutral, skeptical, and optimistic Limitations of Prior Information
priors—generating different final summaries of the evidence. Prior information is a form of assumption. As with any assumption,
incorrect prior information can result in invalid or misleading con-
Prior Information clusions. For instance, if prior information used the assumption that
What Is Prior Information? hypothermia becomes less effective with increasing postnatal age
Prior information is the evidence or beliefs about something that ex- and, in fact, waiting until 12 to 24 hours was associated with the great-
ist prior to or independently of the data to be analyzed. The math- est benefit, the resulting inferences would likely be incompatible with
ematical representation of prior information (eg, of beliefs regard- the data, less accurate, or biased. If the statistical model uses prior
ing the likely efficacy of hypothermia for HIE 6-24 hours after birth) information derived from neonates 0 to 6 hours old in evaluating
must summarize both the known information and the remaining un- the treatment effect in neonates 6 to 24 hours of age, and is based
certainty. Some prior information is quite strong, such as data from on the assumption that the patients respond similarly, the results
many similar patients, and might have little remaining uncertainty may be biased or less accurate if the 2 age groups actually respond
or it can be weak or uninformative with substantial uncertainty. differently to treatment.
Clinicians routinely interpret the results of a new study in the These assumptions can be assessed. Just as the modeling as-
context of prior work. Are the new results consistent? How can new sumptions made in logistic regression can be checked through good-
information be synthesized with the old? Often this synthesis is done ness-of-fit tests,5 there are tests that can be used to verify agree-
by clinicians when they consider the totality of evidence used to treat ment between prior and current data. More importantly, some
patients or interpret research studies. methods for incorporating prior information can explicitly adjust to
Prior information may be formally incorporated in trial analysis conflict between the prior and the data, decreasing the reliance on
using Bayes theorem, which provides a mechanism for synthesiz- prior information when the new data appear to be inconsistent with
ing information from multiple sources.3,4 Clear specification of the the proposed prior information.6
prior information used and assumptions made need to be reported
in the article or appendix to allow transparency in the analysis and How Was Prior Information Used?
reporting of outcomes. Laptook et al1 incorporated prior information by allowing for the out-
come to vary across time windows of 6 to 12 hours and 12 to 24 hours
Why Is Prior Information Important? and prespecifying 3 separate prior distributions on the overall treat-
When large quantities of patient outcome data are available, tradi- ment effect (Description of Bayesian Analyses and Implementa-
tional non-Bayesian (frequentist) and Bayesian approaches for quan- tion Details section of the eAppendix in Supplement 2). The neu-
tifying observed treatment effects will yield similar results because tral prior assumes that the treatment effect diminishes completely
the contribution of the observed data will outweigh that of the prior after 6 hours, the enthusiastic prior assumes that effect does not

jama.com (Reprinted) JAMA October 24/31, 2017 Volume 318, Number 16 1605

© 2017 American Medical Association. All rights reserved.

Clinical Review & Education JAMA Guide to Statistics and Methods

diminish at all after 6 hours, and the skeptical prior assumes that the cision for the time intervals closer to 6 hours. This may have al-
treatment is detrimental after 6 hours. Primary results are pre- lowed more definitive conclusions to be drawn from the same set
sented based on the neutral prior and, as such, the authors’ ap- of data.
proach is transparent and easily interpretable. The authors found a
76% probability of benefit with the neutral prior, a 90% probabil- How Should the Trial Results Be Interpreted in Light of the
ity of benefit with the enthusiastic prior, and a 73% probability of Prior Information?
benefit with the skeptical prior.1 Laptook et al1 used a prespecified Bayesian analysis, using prior in-
An alternative to this approach might include specifying a model formation, to allow quantitatively rigorous conclusions to be drawn
that relates postnatal age at the start of therapeutic hypothermia regarding the probability that therapeutic hypothermia is effective
to the magnitude of the treatment effect, assuming that the effect 6 to 24 hours after birth in neonates with HIE. Conclusions of the
does not increase over time. This model would explicitly account for analysis were given as probabilities that benefit exists. For ex-
a possible decrease in treatment benefit with increasing age at ini- ample, the statement that there is “a 76% probability of any reduc-
tiation, while still allowing the effect at each age to inform the ef- tion in death or disability, and a 64% probability of at least 2% less
fects at other ages. Additionally, this model could be heavily in- death or disability” are easily understood by clinicians and can be used
formed or anchored in the 0 to 6–hour range using data from to inform clinical care. The use of several options for prior informa-
previous studies.2 With this anchor, inferences would be improved tion allows clinicians with different perspectives to have the data in-
across the range of 6 to 24 hours, with a particular increase in pre- terpreted over a range of prior beliefs.

ARTICLE INFORMATION Conflict of Interest Disclosures: All authors have a randomized controlled trial. Arch Pediatr Adolesc
Author Affiliations: Berry Consultants LLC, Austin, completed and submitted the ICMJE Form for Med. 2011;165(8):692-700.
Texas (Quintana, Viele, Lewis); Department of Disclosure of Potential Conflicts of Interest and 3. Food and Drug Administration. Guidance for the
Emergency Medicine, Harbor-UCLA Medical Center, none were reported. use of Bayesian statistics in medical device clinical
Los Angeles, California (Lewis); Los Angeles trials. https://www.fda.gov/MedicalDevices
Biomedical Research Institute, Torrance, California REFERENCES /ucm071072.htm. Published February 5, 2010.
(Lewis); David Geffen School of Medicine at UCLA, 1. Laptook AR, Shankaran S, Tyson JE, et al; Eunice Accessed September 20, 2017.
Los Angeles, California (Lewis). Kennedy Shriver National Institute of Child Health 4. Spiegelhalter DJ, Abrams KR, Myles JP. Bayesian
Corresponding Author: Roger J. Lewis, MD, PhD, and Human Development Neonatal Research Approaches to Clinical Trials and Health-Care
Department of Emergency Medicine, Harbor-UCLA Network. Effect of therapeutic hypothermia Evaluation. Chichester, England: Wiley; 2004.
Medical Center, Bldg D9, 1000 W Carson St, initiated after 6 hours of age on death or disability
among newborns with hypoxic-ischemic 5. Meurer WJ, Tolles J. Logistic regression
Torrance, CA 90509 ([email protected]). diagnostics: understanding how well a model
encephalopathy: a randomized clinical trial. JAMA.
Section Editors: Roger J. Lewis, MD, PhD, doi:10.1001/jama.2017.14972 predicts outcomes. JAMA. 2017;317(10):1068-1069.
Department of Emergency Medicine, Harbor-UCLA 6. Viele K, Berry S, Neuenschwander B, et al. Use of
Medical Center and David Geffen School of 2. Jacobs SE, Morley CJ, Inder TE, et al; Infant
Cooling Evaluation Collaboration. Whole-body historical control data for assessing treatment
Medicine at UCLA; and Edward H. Livingston, MD, effects in clinical trials. Pharm Stat. 2014;13(1):41-54.
Deputy Editor, JAMA. hypothermia for term and near-term newborns
with hypoxic-ischemic encephalopathy:

1606 JAMA October 24/31, 2017 Volume 318, Number 16 (Reprinted) jama.com

© 2017 American Medical Association. All rights reserved.

Clinical Review & Education

JAMA Guide to Statistics and Methods

The “Utility” in Composite Outcome Measures

Measuring What Is Important to Patients
Telba Z. Irony, PhD

There are many harmful manifestations of atherosclerotic cardio- component counts equally within the composite. However, the rela-
vascular disease (ASCVD). Because all of these manifestations are tive importance of each component to patients, their families, and
undesirable, combining the most important ones into a single study clinicians may be very different. For example, it is common to count
outcome measure can simplify efforts to measure the overall ef- the occurrence of CHD death, nonfatal MI, or stroke equally as a
fect of the disease on health outcomes. For example, ASCVD can re- MACE event. If any of these outcomes should occur, the patient is
sult in myocardial infarction (MI), stroke, or death. Each of these is considered to have the outcome event, resulting in each compo-
to be avoided, and how well an intervention reduces the risk of any nent being weighed equally. However, CHD death is more impor-
of these occurring can be measured by combining all of these clini- tant to patients than nonfatal MI, especially if the patient recovers
cal outcomes into a single composite end point. A composite end from the MI with little or no long-term effects.
point is an outcome that is defined as occurring if 1 or more of the If patients perceive the importance of individual components
components occurs. For ASCVD, one of the most common compos- differently (eg, death being a much worse outcome than having an
ites is called major adverse cardiovascular events (MACE). Because MI), then using a singular composite end point to represent a study
a composite outcome occurs more frequently than its individual com- result may be misleading. For example, the LIFE trial compared
ponents, composites can reduce the number of study participants losartan and atenolol for hypertension and showed a statistically sig-
required to achieve the desired power of a study, making it easier nificant advantage of losartan for reducing the composite end point
and less expensive to conduct a clinical trial. of having an ASCVD event (CHD death, MI, or stroke). However, this
To demonstrate the benefits and limitations of composite effect was only observed with stroke (fatal and nonfatal) and not with
outcomes, this JAMA Guide to Statistics and Methods article re- MI or CHD death.3 Often, a positive effect on the composite end point
views a study by Kavousi et al1 that assessed the utility of coronary is driven by the event with the highest occurrence rate (eg, signifi-
artery calcification (CAC) testing for estimating the probability of in- cant reduction on nonfatal MI). If this component has relatively few
cident ASCVD in low-risk women using a composite end point that consequences for a patient but other outcomes, such as stroke or
included nonfatal MI, coronary heart disease (CHD) death, and CHD death, that are more consequential are unaffected or even in-
stroke. Overall, the authors found that CAC “was associated with creased by the intervention, then the apparent benefit of the inter-
an increased risk of ASCVD and modest improvement in prognos- vention is misleading. To remedy this limitation, additional analy-
tic accuracy compared with traditional risk factors.”1 ses, with each component considered individually, are recommended
as an adjunct to the analysis of the composite outcome. However,
Why Are Composite End Points Used in Clinical Studies? because each individual component is less common than the com-
Composite end points may be used in a clinical trial (or in observa- posite, the power of these analyses is often limited and, further, mul-
tional studies) if the target disease has several clinically important tiple comparisons increase the risk of a false-positive result.4
consequences and the study is intended to examine the effects In addition, if patients can experience the composite outcome
(or association) of an intervention on (or exposure with) more than more than once, and the number of events is the outcome of inter-
1 consequence or end point. In this case, a composite end point pro- est, then an intervention that increases the incidence of CHD death
vides a summary measure for the treatment effect. Composite end might incorrectly appear beneficial. This can happen because,
points, such as MACE, may also be used when a single outcome of when a death occurs, patients are no longer at risk of having a non-
interest (eg, CHD death in a low-risk population) is rare, making it fatal MI or stroke. This is known as a competing risk because some
impractical to conduct studies that are adequately powered to dem- of the risks being assessed cannot happen after another, such as
onstrate an effect of an intervention on its occurrence.2 For rare out- death, has occurred. The intervention may not be desirable, even if
comes, researchers often combine several types of events (CHD it shows an overall advantage on the composite of incident ASCVD,
death, MI, and stroke) in a single composite end point. Because the because a study participant who would have had 2 mild MIs dies
frequency of the composite end point is greater than any of its com- first and has only 1 event that is more serious. Competing risks
ponents, this facilitates the design of studies of reasonable size and might therefore result in an underestimate of the occurrence rate
duration that have sufficient statistical power. If only 1 infrequent of the composite end point that would occur if the mortality rate
outcome were considered, such as CHD death, studies to deter- were not increased by the intervention. The study by Kavousi et al1
mine the effect of an intervention on those outcomes could be un- avoided possible bias from competing risks by including only first-
reasonably large or take too long to complete. incident events in the analysis.
Composite end points can be more useful when a specific
Limitations of Composite End Points weight is assigned to each component that reflects each compo-
When multiple outcomes are combined into a composite end point, nent’s importance or “utility” to patients and clinicians. For
each is given the same importance because the occurrence of any example, avoiding CHD death would have a greater utility to

1820 JAMA November 14, 2017 Volume 318, Number 18 (Reprinted) jama.com

© 2017 American Medical Association. All rights reserved.

JAMA Guide to Statistics and Methods Clinical Review & Education

patients than avoiding a nonfatal MI. A composite end point in asymptomatic individuals is associated with higher risk of coro-
with no weights to its components assumes the utilities of all out- nary heart disease (CHD) and all-cause mortality.” They used the
comes are equal and, in clinical medicine, this is almost never the composite end point of incident ASCVD (CHD death, nonfatal MI,
case. The relative value of the weights or utilities of each outcome and stroke) to assess the value of CAC for cardiovascular risk assess-
of interest should be scientifically elicited from patients or clini- ment among women with low risk of CVD (<7.5% predicted risk)
cians. For example, in a study by Ho et al,5 patient preferences and concluded that CAC presence was associated with an in-
were elicited using a discrete-choice experiment in which 540 creased risk of incident ASCVD. The incidence rate difference be-
obese respondents evaluated the effectiveness, safety, and other tween CAC presence and CAC absence was 2.92 events (95% CI,
attributes of weight-loss devices. The study generated patient utili- 2.02-3.83) per 1000 person-years.
ties for effectiveness, safety, and other device attributes that were In addition, the composite end point of total CHD (nonfatal MI
subsequently used to inform regulatory decision-making. and CHD death) was examined as a secondary outcome. The inci-
Eliciting and assigning utilities can be difficult because indi- dence rate difference between CAC presence and CAC absence
vidual patients and clinicians are likely to assign different values groups was 2.63 events (95% CI, 1.92-3.34) per 1000 person-years.
to each outcome. Equal weighting of each component of a com-
posite outcome avoids this complexity and subjectivity, but the How Does the Use of a Composite End Point
loss of important information can be considerable because this Affect the Interpretation of This Study?
approach ignores the relative importance of markedly different The authors did not separately assess the utility of CAC testing for
clinical events. Despite the difficulties and subjectivity involved predicting CHD death, which is the most important component of
in assigning relative weights or patient utilities, doing so can be the ASCVD composite end point to most patients and clinicians.
advantageous as the weights represent the relative value patients Consequently, the authors’ conclusion that “the presence of CAC in
and clinicians place on the individual components of the composite asymptomatic individuals is associated with higher risk for coro-
end point.6 nary heart disease (CHD) and all-cause mortality” could be chal-
lenged if the larger incidence of CHD and all-cause mortality in the
How Were Composite End Points Used in This Study? group in which CAC was present were explained by a higher inci-
Kavousi et al1 stated that “CAC scanning allows for the detection dence of nonfatal MI while the group with CAC absence still had a
of subclinical coronary atherosclerosis, and the presence of CAC higher incidence of CHD deaths.

ARTICLE INFORMATION Disclosure of Potential Conflicts of Interest and mortality in the Losartan Intervention for Endpoint
Author Affiliation: Office of Biostatistics and none were reported. Reduction in Hypertension Study (LIFE):
Epidemiology, Center for Biologics Evaluation Disclaimer: This article reflects the views of a randomised trial against atenolol. Lancet. 2002;
and Research/US Food and Drug Administration, the author and should not be construed to 359(9311):995-1003.
Silver Spring, Maryland. represent the Food and Drug Administration’s 4. Cao J, Zhang S. Multiple comparison procedures.
Corresponding Author: Telba Z. Irony, PhD, views or policies. JAMA. 2014;312(5):543-544.
Office of Biostatistics and Epidemiology, 5. Ho MP, Gonzalez JM, Lerner HP, et al.
Center for Biologics Evaluation and Research/Food REFERENCES Incorporating patient-preference evidence into
and Drug Administration, 10903 New Hampshire 1. Kavousi M, Desai CS, Ayers C, et al. Prevalence regulatory decision making. Surg Endosc. 2015;29
Ave, Bldg 71, 1216, Silver Spring, MD 20953 and prognostic implications of coronary artery (10):2984-2993.
([email protected]). calcification in low-risk women: a meta-analysis. 6. Chaisinanunkul N, Adeoye O, Lewis RJ, et al;
Section Editors: Roger J. Lewis, MD, PhD, JAMA. 2016;316(20):2126-2134. DAWN Trial and MOST Trial Steering Committees;
Department of Emergency Medicine, Harbor-UCLA 2. Kleist P. Composite endpoints for clinical trials: Additional contributors from DAWN Trial Steering
Medical Center and David Geffen School of current perspectives. Int J Pharm Med. 2007;21(3): Committee. Adopting a patient-centered approach
Medicine at UCLA; and Edward H. Livingston, MD, 187-198. doi:10.2165/00124363-200721030-00001 to primary outcome analysis of acute stroke trials
Deputy Editor, JAMA. 3. Dahlöf B, Devereux RB, Kjeldsen SE, et al; LIFE using a utility-weighted Modified Rankin Scale. Stroke.
Conflict of Interest Disclosures: The author has Study Group. Cardiovascular morbidity and 2015;46(8):2238-2243.
completed and submitted the ICMJE Form for

jama.com (Reprinted) JAMA November 14, 2017 Volume 318, Number 18 1821

© 2017 American Medical Association. All rights reserved.

Clinical Review & Education

JAMA Guide to Statistics and Methods

Mendelian Randomization
Connor A. Emdin, DPhil; Amit V. Khera, MD; Sekar Kathiresan, MD

Mendelian randomization uses genetic variants to determine What Are the Limitations of Mendelian Randomization?
whether an observational association between a risk factor and an Mendelian randomization rests on 3 assumptions: (1) the genetic vari-
outcome is consistent with a causal effect.1 Mendelian randomiza- ant is associated with the risk factor; (2) the genetic variant is not
tion relies on the natural, ran- associated with confounders; and (3) the genetic variant influ-
dom assortment of genetic vari- ences the outcome only through the risk factor. The second and third
Author Audio Interview
ants during meiosis yielding a assumptions are collectively known as independence from pleiot-
random distribution of genetic ropy. Pleiotropy refers to a genetic variant influencing the outcome
CME Quiz variants in a population.1 Indi- through pathways independent of the risk factor. The first assump-
viduals are naturally assigned at birth to inherit a genetic variant that tion can be evaluated directly by examining the strength of asso-
affects a risk factor (eg, a gene variant that raises low-density lipo- ciation of the genetic variant with the risk factor. The second and third
protein [LDL] cholesterol levels) or not inherit such a variant. Indi- assumptions, however, cannot be empirically proven and require
viduals who carry the variant and those who do not are then fol- both judgment by the investigators and the performance of vari-
lowed up for the development of an outcome of interest. Because ous sensitivity analyses.
these genetic variants are typically unassociated with confound- If genetic variants are pleiotropic, mendelian randomization
ers, differences in the outcome between those who carry the vari- studies may be biased. For example, if genetic variants that in-
ant and those who do not can be attributed to the difference in the crease HDL cholesterol levels also affect the risk of coronary heart
risk factor. For example, a genetic variant associated with higher LDL disease through an independent pathway (eg, by decreasing inflam-
cholesterol levels that also is associated with a higher risk of coro- mation), a causal effect of HDL cholesterol on coronary heart dis-
nary heart disease would provide supportive evidence for a causal ease may be claimed when the true causal effect is due to the alter-
effect of LDL cholesterol on coronary heart disease. nate pathway.
One way to explain the principles of mendelian randomization Another limitation is statistical power. Determinants of statis-
is through an example: the study of the relationship of high- tical power in a mendelian randomization study include the fre-
density lipoprotein (HDL) cholesterol and triglycerides with coro- quency of the genetic variant(s) used, the effect size of the variant
nary heart disease. Increased HDL cholesterol levels are associated on the risk factor, and study sample size. Because any given ge-
with a lower risk of coronary heart disease, an association that re- netic variant typically explains only a small proportion of the vari-
mains significant even after multivariable adjustment.2 By con- ance in the risk factor, multiple variants are often combined into
trast, an association between increased triglyceride levels and coro- a polygenic risk score to increase statistical power.
nary risk is no longer significant following multivariable analyses.
These observations have been interpreted as HDL cholesterol being
How Did the Authors Use Mendelian Randomization?
a causal driver of coronary heart disease, whereas triglyceride level
In a previous report in JAMA, Frikke-Schmidt et al4 initially applied
is a correlated bystander.2 To better understand these relation-
mendelian randomization to HDL cholesterol and coronary heart dis-
ships, researchers have used mendelian randomization to test
ease using gene variants in the ABCA1 gene. When compared with
whether the observational associations between HDL cholesterol
noncarriers, carriers of loss-of-function variants in the ABCA1 gene
or triglyceride levels and coronary heart disease risk are consistent
displayed a 17-mg/dL lower HDL cholesterol level but did not have
with causal relationships.3-5
an increased risk of coronary heart disease (odds ratio, 0.93; 95%
CI, 0.53-1.62). The observed 17-mg/dL decrease in HDL cholesterol
Use of the Method level is expected to increase coronary heart disease by 70% and this
Why Is Mendelian Randomization Used? study had more than 80% power to detect such a difference; thus,
Basic principles of mendelian randomization can be understood the lack of a genetic association of ABCA1 gene variants and coro-
through comparison with a randomized clinical trial. To answer the nary heart disease was unlikely to be due to low statistical power.
question of whether raising HDL cholesterol levels with a treat- These data were among the first to cast doubt on the causal role of
ment will reduce the risk of coronary heart disease, individuals might HDL cholesterol for coronary heart disease. In other mendelian ran-
be randomized to receive a treatment that raises HDL cholesterol domization studies, genetic variants that raised HDL cholesterol lev-
levels and a placebo that does not have this effect. If there is a causal els were not associated with reduced risk of coronary heart dis-
effect of HDL cholesterol on coronary heart disease, a drug that raises ease, a result consistent with HDL cholesterol as a noncausal factor.5
HDL cholesterol levels should eventually reduce the risk of coro- Low HDL cholesterol levels track with high plasma triglyceride
nary heart disease. However, randomized trials are costly, take a great levels, and triglyceride levels reflect the concentration of triglyceride-
deal of time, and may be impractical to carry out, or there may not rich lipoproteins in blood. Using multivariable mendelian random-
be an intervention to test a certain hypothesis, limiting the number ization, Do et al3 examined the relationship among correlated
of clinical questions that can be answered by randomized trials. risk factors such as HDL cholesterol and triglyceride levels. In an

jama.com (Reprinted) JAMA November 21, 2017 Volume 318, Number 19 1925

© 2017 American Medical Association. All rights reserved.

Clinical Review & Education JAMA Guide to Statistics and Methods

Figure. Comparison of Observational Estimates and Mendelian Randomization Estimates of the Association
of Low-Density Lipoprotein (LDL) Cholesterol, High-Density Lipoprotein (HDL) Cholesterol,
and Triglycerides With Coronary Heart Disease

Odds Ratio
Analysis Source (95% CI)
LDL cholesterol
Observational estimates are
Observational ERFC2 1.37 (1.09-1.73)
derived from the Emerging Risk
Mendelian randomization Do et al3 1.46 (1.37-1.56)
Factors Collaboration (ERFC).2
Test for heterogeneity: P = .60 Mendelian randomization estimates
HDL cholesterol are derived from Do et al3 based on
Observational ERFC2 0.78 (0.76-0.81) an analysis of 185 genetic variants
Mendelian randomization Do et al3 0.96 (0.89-1.03) that alter plasma lipids and mutually
Test for heterogeneity: P <.01 adjusted for other lipid fractions
Triglycerides (eg HDL cholesterol and triglycerides
Observational ERFC2 0.99 (0.96-1.03) for LDL cholesterol). A formal test
Mendelian randomization Do et al3 1.43 (1.28-1.60) of heterogeneity (Cochran Q test)
Test for heterogeneity: P <.01 shows that the observational
and mendelian randomization
0.5 1.0 2.0 causal estimates are consistent for
Odds Ratio (95% CI) LDL cholesterol but not so for HDL
cholesterol or triglycerides.

analysis of 185 polymorphisms that altered plasma lipids, a 1-SD the gene encoding C-reactive protein, have been used in a mende-
increase in HDL cholesterol level (approximately 14 mg/dL) due to lian randomization study to exclude a direct causal effect of C-reactive
genetic variants was not associated with risk of coronary heart protein on coronary heart disease.6 However, variants in single genes
disease (odds ratio, 0.96; 95% CI, 0.89-1.03; Figure). In contrast, that encode a risk factor of interest are often not available. In these
a 1-SD increase in triglyceride level (approximately 89 mg/dL) cases, pleiotropy can be examined by testing whether the gene
was associated with an elevated risk of coronary heart disease variants used are associated with known confounders such as diet,
(odds ratio, 1.43; 95% CI, 1.28-1.60). LDL cholesterol and triglyceride- smoking, and lifestyle factors.7 More advanced statistical tech-
rich lipoprotein levels, but not HDL cholesterol level, may be the niques, including median regression 8 and use of population-
causal drivers of coronary heart disease risk as demonstrated by specific instruments,7 have recently been proposed to protect
these mendelian randomization studies. against pleiotropic variants biasing results.
A second concern relates to whether the mendelian random-
Caveats to Consider When Evaluating Mendelian ization study has adequate statistical power to detect an associa-
Randomization Studies tion. Consequently, an estimate from a mendelian randomization
The primary concern when evaluating mendelian randomization study that is nonsignificant should be accompanied by a power analy-
studies is whether genetic variants used in the study are likely to be sis based on the strength of the genetic instrument and the size of
pleiotropic. Variants in a single gene that affects an individual risk the study. Furthermore, mendelian randomization estimates should
factor are most likely to affect the outcome only through the risk fac- be compared with results from traditional observational analyses
tor and not have pleiotropic effects. For example, variants in CRP, using a formal test for heterogeneity.

ARTICLE INFORMATION REFERENCES infarction: a mendelian randomisation study. Lancet.

Author Affiliations: Center for Genomic Medicine, 1. Smith GD, Ebrahim S. ‘Mendelian randomization’: 2012;380(9841):572-580.
Massachusetts General Hospital, Harvard Medical can genetic epidemiology contribute to 6. Zacho J, Tybjaerg-Hansen A, Jensen JS, Grande
School, Boston (Emdin, Khera, Kathiresan); understanding environmental determinants of P, Sillesen H, Nordestgaard BG. Genetically elevated
Cardiovascular Disease Initiative, Broad Institute, disease? Int J Epidemiol. 2003;32(1):1-22. C-reactive protein and ischemic vascular disease.
Cambridge, Massachusetts (Emdin, Khera, 2. Di Angelantonio E, Sarwar N, Perry P, et al; N Engl J Med. 2008;359(18):1897-1908.
Kathiresan). Emerging Risk Factors Collaboration. Major lipids, 7. Emdin CA, Khera AV, Natarajan P, et al. Genetic
Corresponding Author: Sekar Kathiresan, MD, apolipoproteins, and risk of vascular disease. JAMA. association of waist-to-hip ratio with
Center for Genomic Medicine, Massachusetts 2009;302(18):1993-2000. cardiometabolic traits, type 2 diabetes, and
General Hospital, 185 Cambridge St, 3. Do R, Willer CJ, Schmidt EM, et al. Common coronary heart disease. JAMA. 2017;317(6):626-634.
CPZN 5.830, Boston, MA 02114 variants associated with plasma triglycerides and 8. Bowden J, Davey Smith G, Haycock PC, Burgess
([email protected]). risk for coronary artery disease. Nat Genet. 2013;45 S. Consistent estimation in mendelian
Section Editors: Roger J. Lewis, MD, PhD, (11):1345-1352. randomization with some invalid instruments using
Department of Emergency Medicine, Harbor-UCLA 4. Frikke-Schmidt R, Nordestgaard BG, Stene MCA, a weighted median estimator. Genet Epidemiol.
Medical Center and David Geffen School of et al. Association of loss-of-function mutations in 2016;40(4):304-314.
Medicine at UCLA; and Edward H. Livingston, MD, the ABCA1 gene with high-density lipoprotein
Deputy Editor, JAMA. cholesterol levels and risk of ischemic heart disease.
Conflict of Interest Disclosures: All authors have JAMA. 2008;299(21):2524-2532.
completed and submitted the ICMJE Form for 5. Voight BF, Peloso GM, Orho-Melander M, et al.
Disclosure of Potential Conflicts of Interest and Plasma HDL cholesterol and risk of myocardial
none were reported.

1926 JAMA November 21, 2017 Volume 318, Number 19 (Reprinted) jama.com

© 2017 American Medical Association. All rights reserved.

Clinical Review & Education

JAMA Guide to Statistics and Methods

Using Free-Response Receiver Operating Characteristic Curves

to Assess the Accuracy of Machine Diagnosis of Cancer
Chaya S. Moskowitz, PhD

In this issue of JAMA, Ehteshami Bejnordi et al1 evaluate and com-

Figure. FROC Curves of the Top 5 Performing Algorithms
pare the ability of 32 computer algorithms to identify the presence vs Pathologist WOTC for the Metastases Identification Task (Task 1)
and location of metastatic lesions on pathology slide images of sen- From the CAMELYON16 Competition
tinel axillary lymph nodes from
1.0
women with breast cancer. The
Related article page 2199 authors used free-response re-

Metastases Detection Sensitivity

ceiver operating characteristic 0.8

(FROC) curve analysis to assess diagnostic and localization accu-

racy. They found that the best algorithm performed similarly to a pa- 0.6
Diagnostic machine-learning
thologist working without a time constraint (Figure). algorithm teams
FROC analysis assesses the ability of a medical test to identify 0.4 HMS and MIT II
abnormalities on an image. Examples include identifying tumors in HMS and MGH III
HMS and MGH II
radiographs or foci of malignancy on histological slides. There are 0.2 CULab III
HMS and MIT I
similarities between FROC analysis and the more commonly used
Pathologist WOTC
receiver operating characteristic (ROC) curve analysis.2,3 Conven- 0
tional ROC curves, however, evaluate the accuracy of a test for de- 0 0.125 0.25 0.5 1 2 4 8
tecting the presence or absence of disease but do not evaluate Mean No. of False-Positives per Whole-Slide Image

whether a test correctly identifies the location.

CAMELYON16 indicates Cancer Metastases in Lymph Nodes Challenge 2016;
CULab, Chinese University Lab; FROC, free-response receiver operator
characteristic; HMS, Harvard Medical School; MGH, Massachusetts General
Why Are FROC Curves Used? Hospital; MIT, Massachusetts Institute of Technology; WOTC, without time
When trying to characterize how well a test determines the loca- constraint. The range on the x-axis is linear between 0 and 0.125 (blue) and base
tion of disease and decide if one test is better than another at this 2 logarithmic scale between 0.125 and 8. Teams were those organized in the
task, it is necessary to account for variations in the appearances CAMELYON16 competition. Task 1 was measured on the 129 whole-slide images
in the test data set, of which 49 contained metastatic regions. The pathologist
of the lesions and the fact that lesions may be located anywhere on did not produce any false-positives and achieved a true-positive fraction of
an image. One approach that can be used for this purpose is called 0.724 for detecting and localizing metastatic regions.
free-response analysis, meaning that a person or machine reading
the image assesses the entire image, marks the portions of the im-
age that look abnormal and may be diseased, and makes a deter- false-positive rate (FPR) is calculated as the mean number of false-
mination regarding the probability that the marked areas repre- positive locations per image.
sent disease. A single image may have several locations with the
disease entity. How Are FROC Curves Constructed?
When performing this sort of analysis, a rating (either continu- Ehteshami Bejnordi et al1 constructed FROC curves by requiring the
ous or ordinal) is given regarding the likelihood that there is disease algorithms to mark areas suspected of having disease and to assign
in any marked spot. Lesions identified as pathological by the per- ratings between 0 and 1 to these marked areas. Marked areas were
son or machine assessing the image may or may not correspond to first classified as true positives or false positives depending on
truly diseased areas previously identified by some reference stan- whether the identified locations were within 75 μm of lesions out-
dard. Areas that raters mark that have disease are considered true lined on the reference standard created by 2 expert pathologists
positives and areas they mark in which disease is not present are con- using immunohistochemistry. The ratings were then compared
sidered false positives. The marked locations do not need to over- with various thresholds to characterize the accuracy of the ratings
lie the true diseased locations but they must be reasonably close relative to the thresholds. For any given threshold, c, the FPR at this
using a proximity criterion defined by the investigators. False nega- threshold, FPR(c), was defined as the total number of false-positive
tives occur when a rater fails to mark a lesion. locations rated higher than c divided by the number of slides with-
Because the number of lesions is known, the true-positive frac- out lesions. It has been recommended that the FPR only be esti-
tion (TPF) can be calculated by dividing the number of areas iden- mated among images without disease because the tendencies of
tified (ie, marked) by the rater as having disease by the total num- raters to incorrectly mark nondiseased areas on images containing
ber of areas on the images known to have disease. This is equivalent disease may be very different than their tendencies to incorrectly
to the sensitivity of the test. The number of false-positive marks is identify nondiseased areas on images in which no disease is pre-
counted but there is no true negative equivalent in this analysis present. The distribution of the ratings assigned to the locations is
cluding the calculation of a false-positive fraction. Consequently, the likely to be different as well.4 The TPF at c, TPF(c), was defined as

2250 JAMA December 12, 2017 Volume 318, Number 22 (Reprinted) jama.com

© 2017 American Medical Association. All rights reserved.

JAMA Guide to Statistics and Methods Clinical Review & Education

the total number of true-positive locations with ratings higher than computer algorithms performed across the spectrum of the rating
c divided by the total number of lesions on all the slides. c was var- scale when compared with the criterion standard. One algorithm
ied across the values of the rating scale, and the resultant FROC performed better than the others at all thresholds. At each aver-
curve plotted the pairs of FPR(c) and TPF(c) for all values of c. The age FPR per slide, the TPF was greater for the HMS-MIT II team
points were connected by straight lines. (Harvard Medical School and Massachusetts Institute of Tech-
nology) than for other algorithms. In addition to looking at the
What Are the Limitations of FROC Curves? entire FROC curve, the operating characteristics can be explored
In ROC curve analysis, the diagonal 45° line characterizes a test that at particular points of interest. Examining certain TPFs at spe-
cannot distinguish between diseased and nondiseased states. The cific FPRs allows readers to judge if the algorithms performed
line serves as a benchmark for judging how well a test works. There well enough to be useful in clinical practice. For instance, if an
is no analogous, simply defined line for FROC curves. Thus, it can be average of at most 1 false positive per slide was determined to be
more difficult to infer from FROC curves how well a test is perform- acceptable, team HMS-MIT II had a TPF of approximately 0.81
ing. Moreover, because the FPR is not a fraction and can take val- (see eTable 4 in the article’s supplement). At the operating point of
ues greater than 1, the FROC curve may extend endlessly along the having 1 false positive per slide, the TPF of 0.81 means that team
horizontal axis. HMS-MIT II identified 81% of all metastases and failed to identify
An integral component in calculating an FROC curve is the prox- 19% of them.
imity criterion. Choosing a different distance for this criterion may
result in a different FROC curve.4,5 Furthermore, the decision re- Caveats to Consider When Looking at FROC Curves
garding how to handle multiple findings close to a single lesion af- FROC curves depend on the sample chosen and do not necessarily
fects the FROC curve.6 In addition, analysis of FROC data can be com- generalize to other populations or sets of cases with different distri-
plicated because there are many sources of variability, and butions of disease locations.5 While several numerical indices have
appropriately accounting for these may be difficult.4,6 been proposed for summarizing FROC curve results as a simple
number,5,7 none of them have been universally accepted as has the
How Should the FROC Curves Be Interpreted in This Study? area under the curve used to summarize an ROC curve. Ehteshami
The FROC curves in Figure 1 and eFigure 4 in the study by Ehteshami Bejnordi et al identified specific FPRs of interest and took the mean
Bejnordi et al facilitate visualization of how well the automated value of TPFs at those FPRs as a summary measure.

ARTICLE INFORMATION Funding/Support: This work was supported by 3. Hanley JA. Receiver operating characteristic
Author Affiliation: Department of Epidemiology core grant P30 CA008748 to Memorial Sloan (ROC) methodology: the state of the art. Crit Rev
and Biostatistics, Memorial Sloan Kettering Cancer Kettering Cancer Center from the National Diagn Imaging. 1989;29(3):307-335.
Center, New York, New York. Cancer Institute. 4. Chakraborty DP. A brief history of free-response
Corresponding Author: Chaya S. Moskowitz, PhD, Role of the Funder/Sponsor: The National Cancer receiver operating characteristic paradigm data
Department of Epidemiology and Biostatistics, Institute had no role in the preparation, review, or analysis. Acad Radiol. 2013;20(7):915-919.
Memorial Sloan Kettering Cancer Center, 485 approval of the manuscript; and decision to submit 5. Zou KH, Liu A, Bandos AI, Ohno-Machado L,
Lexington Ave, Second Floor, New York, NY 10017 the manuscript for publication. Rockette HE. Statistical Evaluation of Diagnostic
([email protected]). Performance: Topics in ROC Analysis. Boca Raton, FL:
REFERENCES Chapman & Hall; 2012.
Section Editors: Roger J. Lewis, MD, PhD,
Department of Emergency Medicine, Harbor-UCLA 1. Ehteshami Bejnordi B, Veta M, van Diest PJ, et al; 6. Gur D, Rockette HE. Performance assessments
Medical Center and David Geffen School of CAMELYON16 Consortium. Diagnostic assessment of diagnostic systems under the FROC paradigm:
Medicine at UCLA; and Edward H. Livingston, MD, of deep learning algorithms for detection of lymph experimental, analytical, and results interpretation
Deputy Editor, JAMA. nodes metastases in women with breast cancer. issues. Acad Radiol. 2008;15(10):1312-1315.
JAMA. doi:10.1001/jama.2017.14585
Conflict of Interest Disclosures: The author has 7. Bandos AI, Rockette HE, Song T, Gur D. Area
completed and submitted the ICMJE Form for 2. Alba AC, Agoritsas T, Walsh M, et al. under the free-response ROC curve (FROC) and a
Disclosure of Potential Conflicts of Interest and Discrimination and calibration of clinical prediction related summary index. Biometrics. 2009;65(1):
none were reported. models: Users’ Guides to the Medical Literature. 247-256.
JAMA. 2017;318(14):1377-1384.

jama.com (Reprinted) JAMA December 12, 2017 Volume 318, Number 22 2251

© 2017 American Medical Association. All rights reserved.

Clinical Review & Education

JAMA Guide to Statistics and Methods

The Stepped-Wedge Clinical Trial

Evaluation by Rolling Deployment
Susan S. Ellenberg, PhD

Cluster randomized trials are studies in which groups of individu- Description of the Stepped-Wedge Clinical Trial Design
als, for example those associated with specific clinics, families, or geo- Important considerations in designing a stepped-wedge trial
graphical areas, are randomized between an experimental interven- include the number of clusters, the number of “steps” (time points
tion and a control.1 A stepped- at which the changeovers from control to intervention occur), the
wedge design is a type of cluster duration of treatment at each step, and the balance of prognostic
Related article page 567 design in which the clusters characteristics across the clusters receiving the intervention at
are randomized to the order in each step. The required sample size (total number of participants)
which they receive the experimental regimen. All clusters begin the to achieve a given level of power decreases as the numbers of clus-
study with the control intervention, and by the end of the trial ters and steps increase. Maximum power for a given number of
(assuming no unexpected and unacceptable safety issues arise), all clusters is achieved when each cluster has its own step, but more
clusters are receiving the experimental regimen. typically multiple clusters are randomized to change at the same
time to limit trial duration.6 The risk of bias decreases as the num-
Use of the Method ber of clusters increases, as more clusters improve the likelihood of
Why Is a Stepped-Wedge Clinical Trial Design Used? achieving similar prognoses across clusters, and as the trial dura-
Cluster randomized trials have been performed for many decades, tion decreases, reducing the effect of temporal trends.
even centuries,2 but the statistical underpinnings of such designs
have been worked out only relatively recently.3,4 The primary Limitations of the Stepped-Wedge Design
motivation for a cluster design is to study treatments that can be As with cluster randomized designs generally, stepped-wedge
delivered only in a group setting (eg, an educational approach in a designs require larger sample sizes, often much larger, than would
classroom setting) or to avoid contamination in the delivery of be required for randomized trials in which individual study partici-
each regimen (eg, a behavioral intervention that could be deliv- pants are randomized to receive the experimental or control inter-
ered individually but in settings in which those randomized to dif- vention. Efficiency is reduced because of the need to account for
ferent approaches are in close contact with each other and might the similarities among participants within a given cluster; ie, the
learn about and then adopt the alternative regimen).1 Clusters are extent to which individuals within a cluster are more alike than
typically identified prospectively and randomized to receive the they are similar to the study population as a whole. Consequently,
experimental or control intervention. However, there are excep- each individual in a cluster provides less information about the
tions, such as the ring vaccination trial conducted during the 2014- study findings than would occur if the randomization had been by
2015 Ebola epidemic, in which clusters were defined around newly individual. For example, suppose the outcome of a trial was 1-year
identified cases.5 survival, and in 1 cluster the prognosis of participants was so good
If a cluster randomized trial is deemed necessary or desirable that every participant in the cluster was certain to survive at least 1
in a specific setting, but resource limitations permit only a gradual year. Then the information from that cluster is the same whether
implementation of the experimental regimen, a stepped-wedge there are 100 participants or only 1 participant. When participants
design may be considered as the fairest way to determine which are randomized individually, the factors that influence outcomes
clusters receive the experimental regimen earlier and which later. are balanced within each participating site, and in an analysis
Stepped-wedge designs have benefits similar to those of cross- appropriately stratified by site, the comparisons will not be
over trials because outcomes within a cluster may be compared affected by site differences in prognosis. Even though randomiza-
between the time intervals in which a cluster received the con- tion of clusters is intended to balance prognosis, such balance can-
trol and the experimental interventions. This controls for the not be ensured with a small number of clusters (eg, 10-20), which
unique characteristics of the cluster when making the treatment is common in many cluster randomized trials. The randomization
comparison. One attractive aspect of stepped-wedge designs can be stratified according to characteristics that are considered to
is that all participants in all clusters ultimately receive the experi- relate to prognosis (eg, mean socioeconomic status of cluster par-
mental regimen, thereby ensuring that all participants have ticipants), but this is often difficult to do precisely. Unless the
an opportunity to potentially benefit from the intervention. This number of clusters is quite large, stratification by more than 1 or 2
can be advantageous when strong beliefs exist regarding the effi- variables is not feasible.
cacy of a treatment regimen. When limited resources preclude Another limitation of the stepped-wedge design is the poten-
making the treatment regimen widely available from the start, tial for confounding by temporal trends. When changes in clinical
the use of randomization to determine which clusters get early care are occurring over a short time, comparisons of outcomes be-
access to the treatment regimen may appeal to participants’ tween earlier and later periods may be influenced by background
sense of fairness. changes that affect the outcome of interest irrespective of the

jama.com (Reprinted) JAMA February 13, 2018 Volume 319, Number 6 607

© 2018 American Medical Association. All rights reserved.

Clinical Review & Education JAMA Guide to Statistics and Methods

intervention being tested. Another time-dependent phenomenon intervention at 1 of 4 randomization points. The duration of each of
that can influence stepped-wedge trials is the effect of accumulat- the 4 steps was 4 months. After adjusting for within-hospital clus-
ing experience with the intervention. If more experience enhances tering and temporal trends, the prognostic characteristics of the trial
the likelihood that the intervention will be successful, participants participants in the 2 treatment groups were similar.
in clusters randomized earlier in the trial will more likely benefit.
Time dependency concerns must be balanced against the advan- How Should a Stepped-Wedge Clinical Trial Be Interpreted?
tage that the before-after comparison within clusters balances the Huffman et al did not find a significant benefit of the quality im-
unknown as well as the known characteristics of cluster partici- provement intervention. Although unadjusted analyses did sug-
pants. To address the time dependency, the time factor must be gest benefit, appropriate statistical analysis adjusting for time trends
accounted for in the analysis. markedly attenuated the benefit. In this case, it is possible that the
quality of care was improving while the study was progressing in-
How Was the Stepped-Wedge Design Used? dependent of the study intervention, highlighting the importance
In this issue of JAMA, Huffman and colleagues7 report results of the of accounting for time trends (clearly shown in Figures 2A and 2B in
QUIK trial, an investigation of a quality improvement intervention the article7) when analyzing the results of stepped-wedge trials.
intended to reduce complications following myocardial infarction. Concerns have been raised about the difficulties in obtaining
A stepped-wedge design was used rather than a standard cluster ran- informed consent from patients in stepped-wedge trials.9 Obtain-
domized design because this approach allowed all the participating individual informed consent is often difficult in cluster random-
ing hospitals to receive the experimental intervention during the ized trials because individuals receiving treatment in a particular
course of the study and also had the advantage of controlling for po- cluster may not be able to avoid exposure to the intervention
tential differences in study participant characteristics by compar- assigned to that cluster. In the QUIK trial, consent was not obtained
ing outcomes within a cluster during different periods.8 The au- from patients who received the assigned intervention but it was
thors did not pursue an individually randomized design, which also obtained for 30-day follow-up. The investigators noted that this
would have controlled for both cluster characteristics and tempo- requirement may have introduced selection bias because of refus-
ral trends. Individual randomization for quality improvement inter- als by some participants.
ventions would probably not be feasible within individual partici- Stepped-wedge clinical trials offer a way to evaluate an inter-
pating hospitals because the intervention would be difficult to isolate vention in a system in which the ultimate goal is to implement the
to individual patients. Sixty-three hospitals were included in the study intervention at all sites yet retain the ability to objectively evaluate
and were randomized in groups of 12 or 13 that would initiate the the intervention’s efficacy.

ARTICLE INFORMATION REFERENCES 7. Huffman MD, Mohanan PP, Devarajan R, et al.

Author Affiliation: Department of Biostatistics, 1. Meurer WJ, Lewis RJ. Cluster randomized trials: Effect of a quality improvement intervention on
Epidemiology, and Informatics, Perelman School of evaluating treatments applied to groups. JAMA. clinical outcomes in patients in India with acute
Medicine, University of Pennsylvania, Philadelphia. 2015;313(20):2068-2069. myocardial infarction: the ACS QUIK randomized
clinical trial [published February 13, 2018]. JAMA.
Corresponding Author: Susan S. Ellenberg, PhD, 2. Moberg J, Kramer M. A brief history of the doi:10.1001/jama.2017.21906
Department of Biostatistics, Epidemiology, and cluster randomised trial design. J R Soc Med. 2015;
Informatics, Perelman School of Medicine, 108(5):192-198. 8. Huffman MD, Mohanan PP, Devarajan R, et al.
University of Pennsylvania, 423 Guardian Acute coronary syndrome quality improvement
3. Cornfield J. Randomization by group: a formal in Kerala (ACS QUIK): rationale and design
Dr, Blockley 611, Philadelphia, PA 19104 analysis. Am J Epidemiol. 1978;108(2):100-102.
([email protected]). for a cluster-randomized stepped-wedge trial.
4. Donner A, Birkett N, Buck C. Randomization by Am Heart J. 2017;185:154-160.
Section Editors: Roger J. Lewis, MD, PhD, cluster: sample size requirements and analysis. Am
Department of Emergency Medicine, Harbor-UCLA 9. Taljaard M, Hemming K, Shah L, Giraudeau B,
J Epidemiol. 1981;114(6):906-914. Grimshaw JM, Weijer C. Inadequacy of ethical
Medical Center and David Geffen School of
Medicine at UCLA; and Edward H. Livingston, MD, 5. Henao-Restrepo AM, Camacho A, Longini IM, conduct and reporting of stepped wedge cluster
Deputy Editor, JAMA. et al. Efficacy and effectiveness of an randomized trials: results from a systematic review.
rVSV-vectored vaccine in preventing Ebola virus Clin Trials. 2017;14(4):333-341.
Conflict of Interest Disclosures: The author has disease: final results from the Guinea ring
completed and submitted the ICMJE Form for vaccination, open-label, cluster-randomised trial
Disclosure of Potential Conflicts of Interest and (Ebola Ça Suffit!). Lancet. 2017;389(10068):505-518.
none were reported.
6. Baio G, Copas A, Ambler G, Hargreaves J, Beard
E, Omar RZ. Sample size calculation for a stepped
wedge trial. Trials. 2015;16:354.

608 JAMA February 13, 2018 Volume 319, Number 6 (Reprinted) jama.com

© 2018 American Medical Association. All rights reserved.

Clinical Review & Education

JAMA Guide to Statistics and Methods

Randomization in Clinical Trials

Permuted Blocks and Stratification
Kristine Broglio, MS

The most compelling way to establish that an intervention defini- For example if randomization were stratified by sex (men vs women)
tively causes a clinical outcome is to randomly allocate patients into and age (<40, 40-59, ⱖ60 years), there would be a total of 6 strata.
treatment groups. Randomization helps to ensure that a certain pro- Randomization within each stratum could be a simple randomiza-
portion of patients receive each treatment and that the treatment tion or could be a permuted block randomization.
groups being compared are similar in both measured and unmea-
sured patient characteristics.1,2 Simple or unrestricted, equal ran- Why Are Permuted Blocks and Stratified Randomization Important?
domization of patients between 2 treatment groups is equivalent The most efficient allocation of patients for maximizing statistical
to tossing a fair coin for each patient assignment.2,3 As the sample power is often equal allocation into groups. Power to detect a treat-
size increases, the 2 groups will become more perfectly balanced. ment effect is increased as the standard error of the treatment-
However, this balance is not guaranteed when there are relatively effect estimate is decreased. In a 2-group setting, allocating more
few patients enrolled in a trial. In the coin toss scenario, obtaining patients to 1 group would reduce the standard error for that 1 group
several consecutive heads, for example, is more likely than typi- but doing so would decrease the sample size and increase the stan-
cally perceived.1,4 If a long series of assignments to 1 group oc- dard error in the other group. The standard error of the treatment
curred when randomizing patients in a clinical trial, imbalances be- effect or the difference between the groups is therefore minimized
tween the groups would occur. with equal allocation.8 Permuted block randomization avoids such
Imbalances between groups can be minimized in small imbalances.2 This is an important consideration for trials with planned
sample–size studies by restricting the randomization procedure. interim analyses because interim analyses may be conducted using
Restricted randomization means that simple randomization is small sample sizes resulting in a greater chance of having large im-
applied within defined groups of patients. Two recent articles in balances in the allocation of patients between groups.1,4,7
JAMA used restrictions on the randomization procedure: Bilecen Stratified randomization ensures balance between treatment
et al5 used permuted block randomization, a restricted randomiza- groups for the selected, measurable prognostic characteristics used
tion method used to help ensure the balance of the number of to define the strata. Because stratified randomization essentially pro-
patients assigned to each treatment group.3 Kim et al6 used a duces a randomized trial within each stratum, stratification can be
stratified randomization scheme together with permuted block used when different patient populations are being enrolled or if it
randomization. Stratified randomization is a restricted randomiza- is important to analyze results within the subgroups defined by the
tion method used to balance one or a few prespecified prognostic stratifying characteristics.3,7 For example, when there are con-
characteristics between treatment groups.1 cerns that an intervention is influenced by patient sex, stratifica-
tion might occur by sex. Because patients are randomly allocated
Explanation of the Concept both in the male and female groups, the effect of the intervention
What Are Permuted Blocks and Stratified Randomization? can be tested for the entire population and—assuming sufficient
The permuted block technique randomizes patients between groups sample size—separately in men and women.
within a set of study participants, called a block. Treatment assign-
ments within blocks are determined so that they are random in or- Limitations of Permuted Block Randomization
der but that the desired allocation proportions are achieved ex- and Stratified Randomization
actly within each block. In a 2-group trial with equal allocation and The main limitation of permuted block randomization is the poten-
a block size of 6, 3 patients in each block would be assigned to the tial for bias if treatment assignments become known or predictable.1,9
control and 3 to the treatment and the ordering of those 6 assign- For example, with a block size of 4, if an investigator knew the first
ments would be random. For example, with treatment labels A 3 assignments in the block, the investigator also would know with
and B, possible blocks might be: ABBABA, BABBAA, and AABABB. certainty the assignment for the next patient enrolled. The use of
As each block is filled, the trial is guaranteed to have the desired al- reasonably large block sizes, random block sizes, and strong blind-
location to each group. ing procedures such as double-blind treatment assignments and
Stratified randomization requires identification of key prognos- identical-appearing placebos are strategies used to prevent this.
tic characteristics that are measurable at the time of randomization In stratified randomization, the number of strata should be fairly
and are considered to be strongly associated with the primary out- limited, such as 3 or 4, but even fewer strata should be used in trials
come.Thecategoriesoftheprognosticcharacteristicsdefinethestrata enrolling relatively few research participants.7,10 There is no particu-
and the total number of strata for randomization is the product of the lar statistical disadvantage to stratification, but strata do result in
number of categories across the selected prognostic characteristics.1,7 more complex randomization procedures.3 In some settings, strati-
Randomization is then performed separately within each stratum.7 fied randomization may not be possible because it is simply not

jama.com (Reprinted) JAMA June 5, 2018 Volume 319, Number 21 2223

© 2018 American Medical Association. All rights reserved.

Clinical Review & Education JAMA Guide to Statistics and Methods

feasible to determine a patient’s prognostic characteristics before gastrectomy after administration of ferric carboxymaltose or pla-
getting a treatment assignment, such as in an emergency setting. cebo. A total of 454 patients were randomized using both stratifi-
An alternative to stratification is to prespecify a statistical adjust- cation and permuted blocks with random block sizes. Randomiza-
ment for the key characteristics in the primary analysis that are tion was stratified at each site based on the clinical stage of gastric
thought to influence outcomes and may not be completely bal- cancer. For this randomization scheme, a randomization list can be
anced between groups by the randomization procedure. Another generated prior to the start of the trial as well, but 1 randomization
alternative to stratification is minimization.7 Minimization consid- list must be generated for each site and clinical stage strata. A se-
ers the current balance of the key prognostic characteristics be- quence of block sizes is randomly generated where allowable block
tween treatment groups and if an imbalance exists, assigns future sizes were 2, 4, or 6 in this study. Within each block, half of the as-
patients as necessary to rebalance the groups.7 For example, if the signments are randomly selected to be to the control group and re-
experimental group had a smaller proportion of women than did the maining assignments are allowed to be to the treatment group.
control group and the next patient to be randomized is a woman, a As each patient is randomized into the trial, the patient receives the
minimization procedure might assign that patient to the experimen- next sequential assignment on the randomization list specific to
tal group. Minimization can be more complex than stratification, but his/her site and clinical cancer stage. The use of a random block size
is effective and can accommodate more factors than stratification.7 ensures that the next randomization assignment cannot be guessed.
Because this was a multicenter trial with 7 sites, randomization within
How Were These Approaches to Randomization Used? each site ensures that a site discontinuing participation in the trial
Bilecen et al5 reported a single-center randomized clinical trial or enrolling poorly would not affect the overall balance of the treat-
comparing a fibrinogen concentrate with placebo in reducing intra- ment groups.2,7 Stratifying by clinical cancer stage ensures that the
operative bleeding during high-risk cardiac surgery, with a total control and intervention groups are balanced on this 1 important
sample size of 120 patients. In this study, patients were randomized prognostic characteristic. The treatment groups were nearly equal
according to a permuted block randomization scheme with a in size and were balanced for cancer stage. While Kim et al did not
block size of 4. With this randomization scheme, the entire ran- report the primary efficacy results by cancer stage subgroups,
domization list can be generated before a single patient is enrolled. it would have been appropriate to do so.
Random treatment assignments are generated in groups of 4, by
randomly selecting 2 of the assignments to be to the control group How Does the Approach to Randomization
and then allowing the remaining 2 assignments to be to the treat- Affect the Trial’s Interpretation?
ment group. As each patient is randomized into the trial, the In a clinical trial, the ultimate goal of the randomization procedure
patient receives the next sequential assignment on the randomiza- is to create similar treatment groups that allow an unbiased com-
tion list. The study by Bilecen et al had an equal number of patients parison. Restricted randomization procedures such as stratified ran-
randomized into the 2 treatment groups. The block sizes were domization and permuted block randomization create balance be-
small, so randomization was performed centrally and blinding pro- tween important prognostic characteristics and are useful when
cedures were in place to minimize the ability of the investigators to conducting randomized trials enrolling relatively few patients.3
predict the randomization sequence. In the cases of the trials by Bilecen et al and by Kim et al, the re-
Kim et al6 performed a multicenter clinical trial assessing the he- stricted randomization procedures minimized the risk of biased study
moglobin response at 12 weeks among patients undergoing radical results by ensuring balanced treatment groups.

ARTICLE INFORMATION 2. Lachin JM. Statistical properties of acute isovolemic anemia following gastrectomy:
Author Affiliation: Berry Consultants LLC, randomization in clinical trials. Control Clin Trials. The FAIRY Randomized Clinical Trial. JAMA. 2017;
Austin, Texas. 1988;9(4):289-311. 317(20):2097-2104.

Corresponding Author: Kristine Broglio, MS, 3. Lachin JM, Matts JP, Wei LJ. Randomization in 7. Kernan WN, Viscoli CM, Makuch RW, Brass LM,
Berry Consultants LLC, 3345 Bee Caves Rd, Ste 201, clinical trials: conclusions and recommendations. Horwitz RI. Stratified randomization for clinical
Austin, TX 78746 ([email protected]). Control Clin Trials. 1988;9(4):365-374. trials. J Clin Epidemiol. 1999;52(1):19-26.

Section Editors: Roger J. Lewis, MD, PhD, 4. Zelen M. The randomization and stratification of 8. Hey SP, Kimmelman J. The questionable use of
Department of Emergency Medicine, Harbor-UCLA patients to clinical trials. J Chronic Dis. 1974;27(7-8): unequal allocation in confirmatory trials. Neurology.
Medical Center and David Geffen School of 365-375. 2014;82(1):77-79.
Medicine at UCLA; and Edward H. Livingston, MD, 5. Bilecen S, de Groot JAH, Kalkman CJ, et al. Effect 9. Matts JP, Lachin JM. Properties of
Deputy Editor, JAMA. of fibrinogen concentrate on intraoperative blood permuted-block randomization in clinical trials.
Conflict of Interest Disclosures: The author has loss among patients with intraoperative bleeding Control Clin Trials. 1988;9(4):327-344.
completed and submitted the ICMJE Form for during high-risk cardiac surgery: a randomized 10. Therneau TM. How many stratification factors
Disclosure of Potential Conflicts of Interest and clinical trial. JAMA. 2017;317(7):738-747. are “too many” to use in a randomization plan?
none were reported. 6. Kim Y-W, Bae J-M, Park Y-K, et al; FAIRY Study Control Clin Trials. 1993;14(2):98-108.
Group. Effect of intravenous ferric carboxymaltose
REFERENCES on hemoglobin response among patients with
1. Pocock SJ. Allocation of patients to treatment in
clinical trials. Biometrics. 1979;35(1):183-197.

2224 JAMA June 5, 2018 Volume 319, Number 21 (Reprinted) jama.com

© 2018 American Medical Association. All rights reserved.

Clinical Review & Education

JAMA Guide to Statistics and Methods

Odds Ratios—Current Best Practice and Use

Edward C. Norton, PhD; Bryan E. Dowd, PhD; Matthew L. Maciejewski, PhD

Odds ratios frequently are used to present strength of association dard test is whether the parameter (log odds) equals 0, which cor-
between risk factors and outcomes in the clinical literature. Odds responds to a test of whether the odds ratio equals 1. Odds ratios
and odds ratios are related to the probability of a binary outcome typically are reported in a table with 95% CIs. If the 95% CI for an
(an outcome that is either present or absent, such as mortality). odds ratio does not include 1.0, then the odds ratio is considered to
The odds are the ratio of the probability that an outcome occurs to be statistically significant at the 5% level.
the probability that the outcome does not occur. For example, sup-
pose that the probability of mortality is 0.3 in a group of patients. What Are the Limitations of Odds Ratios?
This can be expressed as the odds of dying: 0.3/(1 − 0.3) = 0.43. Several caveats must be considered when reporting results with
When the probability is small, odds are virtually identical to the odds ratios. First, the interpretation of odds ratios is framed in
probability. For example, for a probability of 0.05, the odds are terms of odds, not in terms of probabilities. Odds ratios often are
0.05/(1 − 0.05) = 0.052. This similarity does not exist when the mistaken for relative risk ratios.2,3 Although for rare outcomes
value of a probability is large. odds ratios approximate relative risk ratios, when the outcomes
Probability and odds are different ways of expressing similar con- are not rare, odds ratios always overestimate relative risk ratios, a
cepts. For example, when randomly selecting a card from a deck, the problem that becomes more acute as the baseline prevalence of
probability of selecting a spade is 13/52 = 25%. The odds of select- the outcome exceeds 10%. Odds ratios cannot be calculated
ing a card with a spade are 25%/75% = 1:3. Clinicians usually are in- directly from relative risk ratios. For example, an odds ratio for
terested in knowing probabilities, whereas gamblers think in terms men of 2.0 could correspond to the situation in which the prob-
of odds. Odds are useful when wagering because they represent fair ability for some event is 1% for men and 0.5% for women. An odds
payouts. If one were to bet $1 on selecting a spade from a deck of ratio of 2.0 also could correspond to a probability of an event
cards, a payout of $3 is necessary to have an even chance of win- occurring 50% for men and 33% for women, or to a probability of
ning your money back. From the gambler’s perspective, a payout 80% for men and 67% for women.
smaller than $3 is unfavorable and greater than $3 is favorable. Second, and less well known, the magnitude of the odds ratio
Differences between 2 different groups having a binary out- from a logistic regression is scaled by an arbitrary factor (equal to
come such as mortality can be compared using odds ratios, the ra- the square root of the variance of the unexplained part of binary
tio of 2 odds. Differences also can be compared using probabilities outcome).4 This arbitrary scaling factor changes when more or bet-
by calculating the relative risk ratio, which is the ratio of 2 probabili- ter explanatory variables are added to the logistic regression model
ties. Odds ratios commonly are used to express strength of asso- because the added variables explain more of the total variation and
ciations from logistic regression to predict a binary outcome.1 reduce the unexplained variance. Therefore, adding more indepen-
dent explanatory variables to the model will increase the odds ratio
Why Report Odds Ratios From Logistic Regression? of the variable of interest (eg, treatment) due to dividing by a
Researchers often analyze a binary outcome using multivariable smaller scaling factor. In addition, the odds ratio also will change if
logistic regression. One potential limitation of logistic regression is the additional variables are not independent, but instead are corre-
that the results are not directly interpretable as either probabilities lated with the variable of interest; it is even possible for the odds
or relative risk ratios. However, the results from a logistic regression ratio to decrease if the correlation is strong enough to outweigh the
are converted easily into odds ratios because logistic regression change due to the scaling factor.
estimates a parameter, known as the log odds, which is the natural Consequently, there is no unique odds ratio to be estimated,
logarithm of the odds ratio. For example, if a log odds estimated by even from a single study. Different odds ratios from the same study
logistic regression is 0.4 then the odds ratio can be derived by cannot be compared when the statistical models that result in odds
exponentiating the log odds (exp(0.4) = 1.5). It is the odds ratio ratio estimates have different explanatory variables because each
that is usually reported in the medical literature. The odds ratio is model has a different arbitrary scaling factor.4-6 Nor can the magni-
always positive, although the estimated log odds can be positive or tude of the odds ratio from one study be compared with the mag-
negative (log odds of −0.2 equals odds ratio of 0.82 = exp(−0.2)). nitude of the odds ratio from another study, because different
The odds ratio for a risk factor contributing to a clinical out- samples and different model specifications will have different arbi-
come can be interpreted as whether someone with the risk factor trary scaling factors. A further implication is that the magnitudes of
is more or less likely than someone without that risk factor to expe- odds ratios of a given association in multiple studies cannot be syn-
rience the outcome of interest. Logistic regression modeling al- thesized in a meta-analysis.4
lows the estimates for a risk factor of interest to be adjusted for other
risk factors, such as age, smoking status, and diabetes. One nice fea- How Did the Authors Use Odds Ratios?
ture of the logistic function is that an odds ratio for one covariate is In a recent JAMA article, Tringale and colleagues7 studied industry
constant for all values of the other covariates. payments to physicians for consulting, ownership, royalties, and re-
Another nice feature of odds ratios from a logistic regression is search as well as whether payments differed by physician specialty
that it is easy to test the statistical strength of association. The stan- or sex. Industry payments were received by 50.8% of men across

84 JAMA July 3, 2018 Volume 320, Number 1 (Reprinted) jama.com

© 2018 American Medical Association. All rights reserved.

JAMA Guide to Statistics and Methods Clinical Review & Education

all specialties compared with 42.6% of women across all special- more likely to receive payments than women, even after control-
ties. Converting these probabilities to odds, the odds that men re- ling for confounders. The magnitude of the odds ratio, about 1.4,
ceive industry payments is 1.03 (0.51/0.49), and the odds that indicates the direction of the effect, but the magnitude of the num-
women receive industry payments is 0.74 = (0.43/0.57). ber itself is hard to interpret. The estimated odds ratio is 1.4 when
The odds ratio for men compared with women is the ratio of simultaneously accounting for specialty, spending region, sole pro-
the odds for men divided by the odds for women. In this case, the prietor status, sex, and the interaction between specialty and sex.
unadjusted odds ratio is 1.03/0.74 = 1.39. Therefore, the odds for A different odds ratio would be found if the model included a differ-
men receiving industry payments are about 1.4 as large (40% ent set of explanatory variables. The 1.4 estimated odds ratio
higher) compared with women. Note that the ratio of the odds is should not be compared with odds ratios estimated from other
different than the ratio of the probabilities because the probability data sets with the same set of explanatory variables, or to odds
is not close to 0. The unadjusted ratio of the probabilities for men ratios estimated from this same data set with a different set of
and women (Tringale et al7 report each probability, but not the explanatory variables.4
ratio), the relative risk ratio, is 1.19 (0.51/0.43).
Greater odds that men may receive industry payments may be What Caveats Should the Reader Consider?
explained by their disproportionate representation in specialties Odds ratios are one way, but not the only way, to present an asso-
more likely to receive industry payments. After controlling for spe- ciation when the main outcome is binary. Tringale et al7 also report
cialty (and other factors), the estimated odds ratio was reduced from absolute rate differences. The reader should understand odds ra-
1.39 to 1.28, with a 95% CI of 1.26 to 1.31, which did not include 1.0 tios in the context of other information, such as the underlying prob-
and, therefore, is statistically significant. The odds ratio probably de- ability. When the probabilities are small, odds ratios and relative risk
clined after adjusting for more variables because they were corre- ratios are nearly identical, but they can diverge widely for large prob-
lated with physicians’ sex. abilities. The magnitude of the odds ratio is hard to interpret be-
cause of the arbitrary scaling factor and cannot be compared with
How Should the Findings Be Interpreted? odds ratios from other studies. It is best to examine study results pre-
In exploring the association between physician sex and receiving sented in several ways to better understand the true meaning of
industry payments, Tringale and colleagues7 found that men are study findings.

ARTICLE INFORMATION Conflict of Interest Disclosures: All authors have catheterization. N Engl J Med. 1999;341(4):279-283.
Author Affiliations: Department of Health completed and submitted the ICMJE Form for doi:10.1056/NEJM199907223410411
Management and Policy, Department of Disclosure of Potential Conflicts of Interest . 3. Holcomb WL Jr, Chaiworapongsa T, Luke DA,
Economics, University of Michigan, Ann Arbor Dr Maciejewski reported receiving personal fees Burgdorf KD. An odd measure of risk: use and
(Norton); National Bureau of Economic Research, from the University of Alabama at Birmingham misuse of the odds ratio. Obstet Gynecol. 2001;98
Cambridge, Massachusetts (Norton); Division of for a workshop presentation; receiving grants from (4):685-688.
Health Policy and Management, School of Public NIDA and the Veterans Affairs; receiving a contract
from NCQA to Duke University for research; 4. Norton EC, Dowd BE. Log odds and the
Health, University of Minnesota, Minneapolis interpretation of logit models. Health Serv Res.
(Dowd); Center for Health Services Research in being supported by a research career scientist
award 10-391 from the Veterans Affairs Health 2018;53(2):859-878. doi:10.1111/1475-6773.12712
Primary Care, Durham Veterans Affairs Medical
Center, Durham, North Carolina (Maciejewski); Services Research and Development; and that his 5. Miettinen OS, Cook EF. Confounding: essence
Department of Population Health Sciences, spouse owns stock in Amgen. No other disclosures and detection. Am J Epidemiol. 1981;114(4):593-603.
Duke University School of Medicine, Durham, were reported. doi:10.1093/oxfordjournals.aje.a113225
North Carolina (Maciejewski); Division of General 6. Hauck WW, Neuhaus JM, Kalbfleisch JD,
Internal Medicine, Department of Medicine, Duke REFERENCES Anderson S. A consequence of omitted covariates
University School of Medicine, Durham, North 1. Meurer WJ, Tolles J. Logistic regression when estimating odds ratios. J Clin Epidemiol. 1991;
Carolina (Maciejewski). diagnostics: understanding how well a model 44(1):77-81. doi:10.1016/0895-4356(91)90203-L
Section Editors: Roger J. Lewis, MD, PhD, predicts outcomes. JAMA. 2017;317(10):1068-1069. 7. Tringale KR, Marshall D, Mackey TK, Connor M,
Department of Emergency Medicine, Harbor-UCLA doi:10.1001/jama.2016.20441 Murphy JD, Hattangadi-Gluth JA. Types and
Medical Center and David Geffen School of 2. Schwartz LM, Woloshin S, Welch HG. distribution of payments from industry to
Medicine at UCLA; and Edward H. Livingston, MD, Misunderstandings about the effects of race and physicians in 2015. JAMA. 2017;317(17):1774-1784.
Deputy Editor, JAMA. sex on physicians’ referrals for cardiac doi:10.1001/jama.2017.3091

jama.com (Reprinted) JAMA July 3, 2018 Volume 320, Number 1 85

© 2018 American Medical Association. All rights reserved.

Clinical Review & Education

JAMA Guide to Statistics and Methods

Case-Control Studies
Using “Real-world” Evidence to Assess Association
Telba Z. Irony, PhD

Associations between patient characteristics or treatments re-

Figure. Hypothetical Example of a Case-Control Study
ceived and clinical outcomes are often first described using observa-
tional data, such as data arising through usual clinical care without the Example of Case-Control Study Analysis
experimental assignment of 1 Study population selected for patients with Patients
treatments that occurs in a ran- chronic obstructive pulmonary disease (COPD) with COPD
Related article at
domizedclinicaltrial(RCT).These
jamainternalmedicine.com 2 Cases selected for disease outcome of interest Recent
data based on usual clinical care (recent cardiac event) cardiac event
are referred to by some as “real-world” data. A key strategy for effi-
ciently finding such associations is to use a case-control study.1 In a
3 Controls selected for similar distribution of H Hypertension

recent issue of JAMA Internal Medicine, Wang et al2 assessed the as-
hypertension and/or diabetes as the cases
D Diabetes

sociation between cardiovascular disease (CVD) and use of inhaled 4 Exposure to risk factor or treatment New use of
compared between cases and controls COPD inhaler
long-acting β2-agonists (LABAs) or long-acting antimuscarinic antago-
nists (LAMAs) in chronic obstructive pulmonary disease (COPD), Cases Controls
utilizing a nested case-control study.

Explanation of the Method

What Are Case-Control and Nested Case-Control Studies? H H H H
A case-control study compares individuals who had the outcome of D D
interest (cases) vs individuals who did not have that outcome (con-
trols) with respect to exposure to a potential “risk factor.” The goal
is to determine if there is an association between the risk factor and
H H H H
the outcome. The risk factor may be a behavior such as tobacco use,
D D
a patient characteristic, or a treatment. The idea is to define a popu-
lation or cohort, identify the cases and controls in the population, and
retrospectively determine which patients in each group were exposed
to the risk factor; the case-control study works backward from out- H H
come to exposure (Figure). A higher proportion of individuals with D D D D
exposure to the risk factor among cases than among controls suggests
that the risk factor is associated with the outcome. The term control Exposure to a risk factor (in this case, new COPD inhaler use) changes the chance
refers to an individual who did not have the outcome; in contrast, the of subsequently developing the outcome of interest. However, in conducting
a case-control study, the outcome (in this case, a cardiovascular event) is used initially
same term in a clinical trial refers to a study participant who receives
to define cases and controls, and then the distribution of the exposure is assessed.
the standard (or placebo) treatment.
In a nested case-control study, the cases are identified in a large
cohort and, for each case, a specified number of controls matching sure vs in the absence of the exposure.3 In practice, the OR in a case-
the case are selected from the cohort. The selected controls should controlstudyistheratiooftheoddsofexposureamongthecasestothe
match the cases with respect to characteristics, other than the risk fac- odds of exposure among the controls, where the odds of exposure is
tor, that are likely related to the outcome of interest. Because it is easier the probability of exposure divided by the probability of no exposure.
to find controls than cases when the outcome is rare, increasing the The prevalence of the exposure is compared between cases and con-
number of controls beyond the number of cases (eg, 2:1 or 3:1 match- trolsandnottheotherwayaround.However,becausetheORtreatsout-
ing) may be used to improve study precision. come and exposure symmetrically, it provides the desired measure of
The nested case-control study by Wang et al2 used data from association. If the OR is greater than 1, the exposure is associated with
284 220 LABA-LAMA–naive patients with COPD retrieved from the the outcome, ie, having the exposure increases the odds of having an
Taiwan National Health Insurance Research Database with health care outcome (and vice versa). The OR is a measure of effect size; the larger
claims from 2007 to 2011. Cases (n = 37 719) were patients who had the OR, the stronger the association.
inpatient or emergency care visits for coronary artery disease, heart In the study by Wang et al,2 new use of LABA occurred in 520
failure, ischemic stroke, or arrhythmia (CVD events). Each patient was cases (1.4%) and 1186 controls (0.8%), resulting in an adjusted odds
matched to 4 controls (n = 146 139) without visits for these disorders. ratio of 1.50 (95% CI, 1.35-1.67). New use of LAMA occurred in 190
In a case-control study, the most common measure of association cases (0.5%) and 463 controls (0.3%), resulting in an adjusted odds
between exposure and outcome is the odds ratio (OR), which aims to ratio of 1.52 (95% CI, 1.28-1.80). An OR of 1.5 represents a modest
compare the occurrence of the outcome in the presence of the expo- association4 between outcome (CVD) and exposure (LABA and

jama.com (Reprinted) JAMA September 11, 2018 Volume 320, Number 10 1027

© 2018 American Medical Association. All rights reserved.

Clinical Review & Education JAMA Guide to Statistics and Methods

LAMA). Thus, the authors found that new use of LABAs or LAMAs (RR), which is the ratio between the probability of the outcome when
was associated with a modest increase in cardiovascular risk in pa- exposed and the probability of the outcome when not exposed, pro-
tients with COPD, within 30 days of therapy initiation. vides a straightforward comparison measure but, because the case-
control study design does not allow for the estimation of the occur-
Why Are Case-Control Studies Used? rence of the outcome in the population (ie, incidence or prevalence),
Case-control studies are time-efficient and less costly than RCTs, par- the RR cannot be determined from a case-control study. A case-
ticularly when the outcome of interest is rare or takes a long time to control study can only estimate the OR, which is the ratio of odds
occur, because the cases are identified at study onset and the out- and not the ratio of probabilities. The OR approximates the RR for
comes have already occurred with no need for a long-term follow- rare outcomes, but differs substantially when the outcome of inter-
up. The case-control design is useful in exploratory studies to as- est is common. In addition, case-control studies are limited to the
sess a possible association between an exposure and outcome. examination of one outcome, and it is difficult to examine the tem-
Nested case-control studies are less expensive than full cohort stud- poral sequence between exposure and outcome.
ies because the exposure is only assessed for the cases and for the Despite these limitations, case-control studies and other “real-
selected controls, not for the full cohort. world” evidence can provide valuable empirical evidence to comple-
ment RCTs. Additionally, case-control studies may be able to ad-
Limitations of Case-Control Studies dress questions for which an RCT is either not feasible or not ethical.7
Case-control studies are retrospective and data quality must be care-
fully evaluated to avoid bias. For instance, because individuals in- How Was the Method Applied in This Case?
cluded in the study and evaluators need to consider exposures and In the case-control study by Wang et al,2 the exposure to LABA
outcomes that happened in the past, these studies may be subject and LAMA use for both cases and controls in the year preceding
to recall bias and observer bias. Because the controls are selected ret- the occurrence of the CVD event was measured and stratified by
rospectively, such studies are also subject to selection bias, which may duration since initiation of LABA or LAMA into 4 groups: current
make the case and control groups not comparable. For a valid com- (ⱕ30 days), recent (31-90 days), past (91-180 days), and remote
parison, appropriate controls must be used, ie, selected controls must (>180 days). Additional stratification on concomitant COPD medi-
be representative of the population that produced the cases. The ideal cations and other factors was also conducted. The data source
control group would be generated by a random sample from the gen- used in the study (Taiwan National Health Insurance Research
eral population that generated the cases. If controls are not repre- Database) mitigates data quality concerns because it is national,
sentative of the population, selection bias may occur. universal, compulsory, and subject to periodic audits. Overall, the
Case-control studies provide less compelling evidence than RCTs. authors found that new use of LABAs or LAMAs was associated
Due to randomization, treatment and control groups in RCTs tend to with a modest increase in cardiovascular risk in patients with
be similar with respect to baseline variables, including unmeasured COPD, within 30 days of therapy initiation, and this finding was
ones.5 Because the only difference between treatment and control strengthened by the steps taken to ensure data quality and com-
groups is the treatment, RCTs can demonstrate causation between parability of cases and controls.
treatment and outcome. In case-control studies, case and control
groups are similar with respect to the matching variables, but are not How Does the Case-Control Design Affect the Interpretation
necessarily similar with respect to unmeasured variables. Such stud- of the Study?
ies are susceptible to confounding, which occurs when the exposure Causality cannot be established in a case-control study because there
and the outcome are both associated with a third unmeasured isnowaytocontrolforunmeasuredconfounders.InthestudybyWang
variable.6 Unlike RCTs, case-control studies demonstrate association et al,2 the use of the disease risk score for predicting CVD events was
between exposure and outcome but do not demonstrate causation. helpful to control for measured confounders but could not adjust for
The objective of case-control studies is to compare the occur- unmeasuredconfounders.Theauthorsmitigatedfurtherpossiblecon-
rence of an outcome with and without an exposure. The relative risk founding effects by conducting extensive sensitivity analyses.

ARTICLE INFORMATION Disclaimer: This article reflects the views of the 4. Chen H, Cohen P, Chen S. How big is a big odds
Author Affiliation: Office of Biostatistics and author and should not be construed to represent ratio? interpreting the magnitudes of odds ratios in
Epidemiology, Center for Biologics Evaluation and FDA’s views or policies. epidemiological studies. Commun Stat Simul Comput.
Research (CBER), US Food and Drug 2010;39(4):860-864.
Administration, Silver Spring, Maryland. REFERENCES 5. Broglio K. Randomization in clinical trials:
Corresponding Author: Telba Z. Irony, PhD, Office 1. Breslow NE. Statistics in epidemiology: the permuted blocks and stratification. JAMA. 2018;319
of Biostatistics and Epidemiology, CBER/FDA, case-control study. J Am Stat Assoc. 1996;91(433): (21):2223-2224. doi:10.1001/jama.2018.6360
10903 New Hampshire Ave, Bldg 71, 1216, Silver 14-28. doi:10.1080/01621459.1996.10476660 6. Kyriacou DN, Lewis RJ. Confounding by
Spring, MD 20953 ([email protected]). 2. Wang MT, Liou JT, Lin CW, et al. Association of indication in clinical research. JAMA. 2016;316(17):
Published Online: August 23, 2018. cardiovascular risk with inhaled long-acting 1818-1819. doi:10.1001/jama.2016.16435
doi:10.1001/jama.2018.12115 bronchodilators in patients with chronic obstructive 7. Corrigan-Curay J, Sacks L, Woodcock J.
pulmonary disease: a nested case-control study. Real-world evidence and real-world data for
Conflict of Interest Disclosures: The author has JAMA Intern Med. 2018;178(2):229-238. doi:10.1001
completed and submitted the ICMJE Form for evaluating drug safety and effectiveness [published
/jamainternmed.2017.7720 online August 13, 2018]. JAMA. 2018. doi:10.1001
Disclosure of Potential Conflicts of Interest and
none were reported. 3. Norton EC, Dowd BE, Maciejewski ML. Odds /jama.2018.10136
ratios: current best practice and use. JAMA. 2018;
320(1):84-85. doi:10.1001/jama.2018.6971

1028 JAMA September 11, 2018 Volume 320, Number 10 (Reprinted) jama.com

© 2018 American Medical Association. All rights reserved.

Clinical Review & Education

JAMA Guide to Statistics and Methods

On Deep Learning for Medical Image Analysis

Lawrence Carin, PhD; Michael J. Pencina, PhD

Neural networks, a subclass of methods in the broader field of ma- matched to the N components (layers) of the input “image” at layer 2.
chine learning, are highly effective in enabling computer systems to If there are M second-layer filters, there are M feature maps output
analyze data, facilitating the work of clinicians. Neural networks have from layer 2. These M feature maps are again spatially aligned and
beenusedsincethe1980s,withconvolutionalneuralnetworks(CNNs) “stacked” and correspond to the input to layer 3, and the process
applied to images beginning in repeats again. A sequential analysis of this form is repeated for a de-
the 1990s.1-3 Examples include sired number of layers, and the final set of feature maps at the top
Viewpoint pages 1101 and 1099 identifying natural images of ev- (last layer) are used to make a classification decision. This decision
eryday life,4 classifying retinal may correspond to determining if a sentence being searched for
Video pathology,5 selecting cellular ele- is in the image (as in the video), or a lesion is present in a photo-
ments on pathological slides,6 graph, as was the case for recent work identifying pathology from
and correctly identifying the spatial orientation of chest radiographs.7 retinal photographs.5
Successful neural networks for such tasks are typically composed of The Video illustrates how a CNN works. Consider the word
multiple analysis layers; the term deep learning is also (synony- “Ada” from the name Ada Lovelace. Each first layer filter corre-
mously) used to describe this class of neural networks. sponds to one of the letters in the alphabet. For simplicity, all let-
ters are assumed to be uppercase. Consider “ADA” located within the
Opening the Deep Learning Black Box original image being analyzed. When the A motif overlies the letter
One way to understand how CNNs work is to use an analogy of A in the original image, the convolution output generates a strong
written language. Ideas are communicated in written articles that are signal that is mapped to the corresponding feature map. This map
composed of a series of paragraphs that are, in turn, made of sen- has weak signals everywhere A is absent, including in the space where
tences, sentences of words, and words from collections of letters. the D in ADA is present. On the feature map corresponding to the
Understanding text comes after assessing the relationships of the letter/filter D, a strong signal is situated spatially where all D’s
letters to one another in increasing layers of complexity (a “deep” are located, including at the location between the 2 A’s in ADA.
hierarchical representation: from letters, to words, to sentences, to The N feature maps (where N is the number of motifs or letters
paragraphs). Images are analyzed by computers via motifs, instead assessed) output from first layer of the model detect the presence
of letters. A motif is a collection of pixels that form a basic unit of analy- and location of each of the N building-block motifs/letters. The sec-
sis, the simplest of which represent the most basic pattern for com- ond layer filters simultaneously analyze all N feature maps output
municating visual information, just as a letter does for language. Af- from layer one, looking for combinations of letters.
ter the computer learns the form of these motifs, they are detected For the example shown in the accompanying Video, each sec-
in images using a filter that is matched to the motif’s structure. ond layer filter is designed to detect a short sequence of letters, for
Consider the image in the accompanying Video corresponding example, the sequence ADA. Each second layer filter has N compo-
to a collection of words. An image may be considered as a map, with nents, corresponding to the N feature maps output from the first layer.
the location-dependent pixel value reflecting the signal strength at The N-component filter that is seeking to detect the sequence ADA
a given point; the collection of pixels yields an image, and in this ex- will have zero amplitude on N-2 of its components (associated with
ample the image forms a set of words. the N-2 letters other than A and D). The component of the filter cor-
The most primitive building blocks that make up the images are responding to the letter A will seek 2 nearby strong signals (corre-
on the first layer of the CNN model; these building blocks corresponding to A “space” A), and the filter component corresponding to
spond to the motifs. The CNN detects these motifs by applying fil- the letter D will have a single strong amplitude, situated between
ters to the images. Each filter is a set of pixels that are of similar form where the A’s reside. After the N components of the ADA filter are con-
as the respective motif. In this example, the first layer filters corre- volved with the N first layer output feature maps, and then summed,
spond to the letters of the alphabet. Each filter is shifted sequen- a strong signal will be manifested at the location of the word ADA.
tially to each location in the image and measures the degree to which Each of the M feature maps output from layer 2 corresponds to
the local properties of the image match the filter at each location, a a different short collections of letters. Moving to layer 3, larger words
process called convolution. The result of this convolution process is and groups of words are detected. Moving to layer 4, sequences of
projected to another array (or new image) called a feature map. Fea- words and sentences are detected. Finally, at layers 5 and above para-
ture maps quantify the degree of match between the filter and each graphs are detected. At each layer the same process is used: con-
local region in the original image. If there are N first layer filters, there volutions with filters, with the number of filter components matched
are N 2D feature maps created by the convolutional process. to the number feature maps from the layer below. There are a few
The N feature maps output from layer 1 are now aligned spa- additional details of the CNN omitted here for simplicity, but this cap-
tially and “stacked” atop each other; this is the input to layer 2 of the tures the essence of the model.
model. At layer 2, another set of filters is used to process the im- What is the advantage of this hierarchy? Why not directly
age. Each layer 2 filter has N motifs associated with each filter, learn filters for sequences like ADA? The hierarchy facilitates a

1192 JAMA September 18, 2018 Volume 320, Number 11 (Reprinted) jama.com

© 2018 American Medical Association. All rights reserved.

JAMA Guide to Statistics and Methods Clinical Review & Education

more complete ability to share data (termed sharing strength in rently no means of defining the appropriate number of layers and
statistics). The words ADA, ADAPT, ADAM, and MADAM all have filters. For example, the model in eFigure 1 of the article by Ting
the sequence ADA, and the presence of all these words may be et al5 was not designed for medical images, but rather was origi-
shared in learning the sequence ADA. By learning a hierarchical nally specified for analysis of natural imagery.8 This suggests that
representation (rather than directly learn separate filters for each the already excellent performance can be further improved, by
word) the model is able to more fully utilize all data (the model is refinement/tuning of the model structure for medical images.
able to learn what the word ADA may look like, by leveraging expe- Because numerous parameters must be tuned in the learning
rience with ADAPT, ADAM, etc). Similar concepts exist for analysis process (number of layers, number of kernels at each layer, form of
of medical images: images of different states of health may be dis- the classifier, etc), it is essential that the test data set be completely
tinct at one level of granularity (scale), but at a finer scale they may separate from all data used for training and for evaluating model
share substructural characteristics (hence, at that finer scale the performance. If one examines the results on the test data set, then
model can learn motifs that are shared between different states goes back and adjusts the model structure, and then retrains the
of health/disease). model, effectively all the data are being used in the training process
(the “test” data set becomes part of the training process and is not
What Are the Limitations of Deep Learning Methods independent). This separation is necessary so that deep learning
The deep network has a specified number of layers, and at each results are not overly optimistic and will generalize to medical set-
layer there are a specified number of filters to learn. There are cur- tings outside those used for model development.

ARTICLE INFORMATION REFERENCES diabetic retinopathy and related eye diseases using
Author Affiliations: Duke University, Durham, 1. LeCun Y, Boser BE, Denker JS, et al. Handwritten retinal images from multiethnic populations with
North Carolina (Carin); Duke Clinical Research digit recognition with a back-propagation network. diabetes. JAMA. 2017;318(22):2211-2223. doi:10
Institute, Department of Biostatistics and Adv Neural Inf Process Syst. 1990:396-404. https: .1001/jama.2017.18152
Bioinformatics, Duke University, Durham, //www.cs.rit.edu/~mpv/course/ai/lecun-90c.pdf. 6. Kraus OZ, Ba JL, Frey BJ. Classifying and
North Carolina (Pencina). 2. Hinton G. Deep learning—a technology with the segmenting microscopy images with deep multiple
Corresponding Author: Lawrence Carin, PhD, potential to transform health care [published online instance learning. Bioinformatics. 2016;32(12):i52-i59.
Duke University, Durham, NC 27705 August 30, 2018]. JAMA. doi:10.1001/jama.2018.11100 doi:10.1093/bioinformatics/btw252
([email protected]). 3. Naylor CD. On the prospects for a (deep) 7. Rajkomar A, Lingam S, Taylor AG, Blum M,
Section Editors: Roger J. Lewis, MD, PhD, learning health care system [published online Mongan J. High-throughput classification
Department of Emergency Medicine, Harbor-UCLA August 30, 2018]. JAMA. doi:10.1001/jama.2018.11103 of radiographs using deep convolutional neural
Medical Center and David Geffen School of networks. J Digit Imaging. 2017;30(1):95-101. doi:10
4. Krizhevsky A, Sutskever I, Hinton GE. ImageNet .1007/s10278-016-9914-9
Medicine at UCLA; and Edward H. Livingston, MD, classification with deep convolutional neural
Deputy Editor, JAMA. networks. Presented at: NIPS 12 Proceedings of the 8. Simonyan K, Zisserman A. Very deep
Conflict of Interest Disclosures: The authors have 25th International Conference on Neural convolutional networks for large-scale image
completed and submitted the ICMJE Form for Information Processing Systems. December 2012. recognition. arXiv:1409.1556v6. Last updated
Disclosure of Potential Conflicts of Interest. April 2015.
5. Ting DSW, Cheung CY, Lim G, et al. Development
Dr Pencina reported that his institution received and validation of a deep learning system for
grant support from Bristol-Myers Squibb and
Regeneron/Sanofi. Dr Carin reported no
disclosures.

jama.com (Reprinted) JAMA September 18, 2018 Volume 320, Number 11 1193

© 2018 American Medical Association. All rights reserved.

Clinical Review & Education

JAMA Guide to Statistics and Methods

Bayesian Hierarchical Models

Anna E. McGlothlin, PhD; Kert Viele, PhD

Treatment effects will differ from one study to another evaluating In this issue of JAMA, Stunnenberg and colleagues present the re-
similartherapies,bothbecauseofrandomvariationbetweenindividual sults of a trial that used a BHM to integrate data from a series of N-of-1
patients and owing to true differences that exist because of other dif- crossover trials7 comparing mexiletine with placebo in the treatment
ferences, including inclusion criteria and temporal trends. The sources of patients with nondystrophic myotonia.8 An N-of-1 trial uses a patient
of variability have many levels; as his or her own control by repeatedly exposing the patient to a treat-
onelevelinvolvestherandomdif- ment or placebo and measuring the effect of the intervention. Each
Author Audio Interview ferences between individual pa- N-of-1 trial exposes the patient to between 1 and 4 treatment pairs or
tients, and another level involves sets, with each set randomizing the order of mexiletine and placebo,
the systematic differences that witha1-weekwashoutperiodbetweentherapies.Aftereachtreatment
Related article page 2344
exist between studies. This mul- set, prespecified rules were used to determine whether the patient
tilevelorhierarchicalinformationoccursinmanyresearchsettings,such shouldcontinuetothenexttreatmentsetordiscontinue,eitherforevi-
as in cluster-randomized trials and meta-analyses.1,2 Sources of varia- dence of benefit of mexiletine, evidence of no benefit, or for reaching
tion can be better understood and quantified if treatment effect esti- the maximum allowed number of treatment sets. A BHM was used to
mates from each individual study are examined in relation to the to- integratedatafromallavailableN-of-1trialsperformedinallthepatients
tality of information available in all the studies. to produce estimates of treatment effects for each patient individu-
Bayesian analysis differs from the usual frequentist approach ally and also for 2 genetic subtypes of the disease.
(eg, use of P values or confidence intervals). Rather than focusing
on the probability of different patterns in outcomes assuming Why Is a BHM Used?
specific treatment effects, Bayesian analysis relies on the use of Multilevel data have an underlying hierarchical structure. In the re-
prior information in combination with data from a study to calcu- port by Stunnenberg et al, each trial had data from a single patient,
late the probabilities of a treatment effect. 3 Readers may be and patients were grouped into genetic subtypes. Properly inte-
familiar with Bayesian analysis when used in randomized clinical grating this information required acknowledging the commonali-
trials.4,5 In this type of Bayesian analysis, patients are considered ties, eg, data from 2 patients having the same genetic subtype are
largely equivalent except with respect to the assigned treatment, more likely to be similar than data from 2 patients having different
and the goal is to estimate the probability of an overall treatment genetic subtypes. Heterogeneity between genetic subtypes and
effect in the population. patient-to-patient variability are simultaneously accounted for in the
In contrast, a Bayesian hierarchical model (BHM) is a statisti- BHM. A pooled analysis, ie, simply combining data from all pa-
cal procedure that integrates information across many levels, so tients, would not account for systematic patient-to-patient differ-
multiple quantities are estimated simultaneously, and explicitly ences. At the other extreme, analyzing each individual patient’s trial
separates the observed variability into parts attributable to separately would not account for the information available across
random differences and true differences.6 The model has 2 key all the trials. This could result in underpowered analyses.
characteristics. First, there is a hierarchical or multilevel structure. By considering the results across all trials, BHMs allow for more ac-
For example, if multiple studies were conducted to evaluate curateestimatesofthetreatmenteffectsforeachindividualtrialbecause
diabetes management strategies, the first-level data may be of a fundamental fact about multilevel data—namely, that regardless of
improvements in hemoglobin A1C values in individual patients, the the true systematic differences between the true treatment effects
second-level data may be the mean improvements for patients estimated by each trial, random variability is more likely to amplify
within each trial, and the third-level data may be the average these differences than diminish them. For example, suppose 4 single-
improvements in trials grouped according to the type of disease interventiongrouptrialsof100patientseachareconductedtoestimate
management strategy. Second, prior information is used to reflect acommonrateofaparticularpatientoutcome,whichis,hypothetically,
available information, even if vague, regarding the likely values 60% for all of the trials. Because of random variability in 100 patients,
and variability at each level of the hierarchy (eg, the variability of it is likely that one of the studies will produce an observed rate less than
improvements in patients in a single trial, the variability of aver- 60%, while another will produce an observed rate greater than 60%.
age treatment effects between trials using similar disease Even though the studies all have exactly equal true underlying rates,
management strategies, and the variability of treatment effects numerical simulation demonstrates that the lowest observed value will
among groups of trials that use different disease management average54.9%andthehighestwillaverage65.0%.Eventhoughnotrue
strategies). Using Bayes theorem, prior information, and the data, heterogeneityexists,whenactualobservationsaremadetheresultswill
the BHM yields estimates of the true effects at each level of the appear heterogeneous because of random variation. Consider a differ-
hierarchy.3,6 Estimates of true treatment effects may be derived ent scenario in which the 4 trials are truly different, with true underly-
for individual patients, patient subgroups, individual trials, or ing rates of 54%, 58%, 62%, and 66%. Although the true rates range
groups of trials. Each of these estimates are informed by the from54%to66%,theobservedratesonaveragewillrangefrom52.5%
entire data set included in the statistical model.6 to 67.4% because of the additional random variation seen in a trial with

jama.com (Reprinted) JAMA December 11, 2018 Volume 320, Number 22 2365

© 2018 American Medical Association. All rights reserved.

Clinical Review & Education JAMA Guide to Statistics and Methods

a finite sample size. Here again, the observed values tend to be farther a bell-shaped curve. This assumption may fail if there is an outlying
apartthanthetruevalues.Thelowestobservedvalueinagroupislikely group inconsistent with a bell shape, potentially resulting in biased
lower than its true value, and the highest observed value in a group is estimates for that outlying group.10 It is important to consider sen-
likely higher than its true value. sitivity analyses that verify the robustness of the conclusions to
Knowing that observed values tend to be farther apart than the changes in the choices of prior distributions.
true values, the best estimates of the true values are closer to-
gether than the observed values. These estimates (which are more How Were BHMs Used in This Case?
accurate than if each estimate were based only on the results from In the study by Stunnenberg et al,8 information from 27 N-of-1 trials
the individual trial) can be obtained using “shrinkage estimation.”6,9 was integrated to produce estimates of the treatment effect of mexi-
The term “shrinkage” refers to the reduction in the observed differ- letine relative to placebo for 2 genetic subgroups and for the over-
ences between the trials. The purpose of the BHM is to determine all population. The outcome was a reduction in self-reported mus-
the proper amount to move the observed treatment differences cular stiffness on a 1-to-9 scale using a validated questionnaire.
closer together to obtain the shrinkage estimates. The model esti- The mean reduction in stiffness was 3.84 (95% CI, 2.52 to 5.16) for
mates the proportion of total variability attributable to random the CLNC1 genotype and 1.94 (95% CI, 0.35 to 3.53) for the SCN4A
(within-trial) variability and the amount attributable to systematic genotype. The mean reduction across all subgroups was 3.06 (95%
differences. By eliminating random noise, the resulting estimates are, CI, 1.96 to 4.15). Bayesian hierarchical models were used to success-
on average, closer to the underlying truth.6 fully and rigorously integrate information with a complex underly-
If the observed heterogeneity is consistent entirely with random ing structure: a variable number of treatment sets per patient, with
variation, the resulting estimates for each group will be close to each patients grouped into 2 genotype subgroups.
other. In contrast, if the observed heterogeneity far exceeds what may The BHM allows analysis at different levels of aggregation. In
be explained by random variation, the heterogeneity will be attrib- the study by Stunnenberg et al,8 the aggregation occurred at 3 lev-
uted to true differences that exist between the groups, and the treat- els. First, data from each patient were aggregated across multiple
ment effect estimates will not shift much from the observed rates. treatment sets to estimate a single treatment effect for each pa-
EstimatesfromaBHMtypicallyhavereducedvariabilitycompared tient. At the second level of the hierarchy, data were aggregated to
with those from independent analyses, in which each trial is analyzed estimate the treatment effect within 2 genotype subgroups. The
separately.Thisresultsintighterintervalestimatesoftreatmenteffects third level described the distribution across subgroups.
and may result in statistical hypothesis tests with greater power and
lower type I error. For these reasons, BHMs are especially promising for How Should BHMs Be Interpreted?
studies of rare diseases for which large sample sizes are not feasible. A BHM provides estimates of treatment effects, or other relevant
clinical metrics, at each level of the hierarchy, based on all data in-
What Are the Limitations of BHMs? cluded in the model. Because of the inclusion of a greater amount
All statistical models are predicated on assumptions that should be of information, these estimates are generally more accurate than if
understood before applying the method. Bayesian hierarchical mod- analyses were conducted on subgroups separately, increasing the
els rely on various assumptions (eg, the number of levels and the prior power of statistical comparisons. For example, in a 3-level model with
probability distributions used as the basis for Bayesian estimation multiple measurements for each patient, multiple patient sub-
of treatment effects) to estimate and separate within- and across- types (eg, genetic subtypes), and an overall treatment effect for all
group variability. 6 Additionally, most BHMs assume a certain patients, the hierarchical model provides estimates for each pa-
type of distribution for the across-group variability—for example, tient individually, for each patient subtype, and across all subtypes.

ARTICLE INFORMATION 2. Whitehead A. Meta-analysis of Controlled Clinical 7. Zucker DR, Schmid CH, McIntosh MW, et al.
Author Affiliations: Berry Consultants LLC, Trials. Sussex, United Kingdom: Wiley West; 2002. Combining single patient (N-of-1) trials to estimate
Austin, Texas. doi:10.1002/0470854200 population treatment effects and to evaluate
3. Quintana M, Viele K, Lewis RJ. Bayesian analysis: individual patient responses to treatment. J Clin
Corresponding Author: Anna E. McGlothlin, PhD, Epidemiol. 1997;50(4):401-410. doi:10.1016/S0895
Berry Consultants LLC, 3345 Bee Caves Rd, Ste 201, using prior information to interpret the results of
clinical trials. JAMA. 2017;318(16):1605-1606. doi: -4356(96)00429-5
Austin, TX 78746 ([email protected]).
10.1001/jama.2017.15574 8. Stunnenberg BC, Raaphorst J, Groenewoud HM,
Section Editors: Roger J. Lewis, MD, PhD, et al. Effect of mexiletine on muscle stiffness in
Department of Emergency Medicine, Harbor-UCLA 4. Goligher EC, Tomlinson G, Hajage D, et al.
Extracorporeal membrane oxygenation for severe patients with nondystrophic myotonia evaluated
Medical Center and David Geffen School of using aggregated N-of-1 trials [published December
Medicine at UCLA; and Edward H. Livingston, MD, acute respiratory distress syndrome and posterior
probability of mortality benefit in a post hoc 11, 2018]. JAMA. doi:10.1001/jama.2018.18020
Deputy Editor, JAMA.
Bayesian analysis of a randomized clinical trial 9. Lipsky AM, Gausche-Hill M, Vienna M, Lewis RJ.
Conflict of Interest Disclosures: Drs McGlothlin [published online October 22, 2018]. JAMA. doi:10 The importance of “shrinkage” in subgroup
and Viele are employees of Berry Consultants LLC, .1001/jama.2018.14276 analyses. Ann Emerg Med. 2010;55(6):544-552.
a private consulting firm specializing in Bayesian doi:10.1016/j.annemergmed.2010.01.002
adaptive clinical trial design, implementation, 5. Lewis RJ, Angus DC. Time for clinicians to
and analysis. embrace their inner Bayesian? reanalysis of results 10. Neuenschwander B, Wandel S, Roychoudhury
of a clinical trial of extracorporeal membrane S, Bailey S. Robust exchangeability designs for early
REFERENCES oxygenation [published online October 22, 2018]. phase clinical trials with multiple strata. Pharm Stat.
JAMA. doi:10.1001/jama.2018.16916 2016;15(2):123-134. doi:10.1002/pst.1730
1. Meurer WJ, Lewis RJ. Cluster randomized trials.
JAMA. 2015;313(20):2068-2069. doi:10.1001/jama 6. Gelman A, Stern HS, Carlin JB, Dunson DB,
.2015.5199 Vehtari A, Rubin DB. Bayesian Data Analysis. 3rd ed.
Boca Raton, FL: CRC Press; 2013.

2366 JAMA December 11, 2018 Volume 320, Number 22 (Reprinted) jama.com

© 2018 American Medical Association. All rights reserved.

Clinical Review & Education

JAMA Guide to Statistics and Methods

Marginal Effects—Quantifying the Effect of Changes

in Risk Factors in Logistic Regression Models
Edward C. Norton, PhD; Bryan E. Dowd, PhD; Matthew L. Maciejewski, PhD

Marginal effects can be used to express how the predicted prob- other explanatory variables and will not be the same for all mem-
ability of a binary outcome changes with a change in a risk factor. bers of a group.
For example, how does 1-year mortality risk change with a 1-year For example, consider a linear regression analysis predicting
increase in age or for a patient with diabetes compared with a body weight in pounds from a person’s height measured in inches.
patient without diabetes? This approach can make the results more If the regression coefficient is 5, it means that a 1-in increase in height
easily understood. Marginal effects often are reported with logistic is associated with a 5-lb increase in weight. In this instance, the mar-
regression analyses to communicate and quantify the incremental ginal effect of the 1-unit change in the risk factor, height, is how it
risk associated with each factor.1,2 changes the predicted outcome, weight in pounds. This is true in lin-
In a 2013 article in JAMA Psychiatry, Cummings et al3 studied ear regressions unless the predictors included in the model in-
factors that predicted access to outpatient mental health facilities cludes higher powers of the risk factors (eg, age and age squared)
that accept Medicaid. Their main outcome had 3 categories, which or interactions among the explanatory variables (eg, 2 explanatory
were labeled “no access,” “some access,” and “good access.” An or- variables multiplied together). In a simple linear regression (eg, with-
dered logistic regression model was developed and results were pre- out interactions between predictors), this marginal effect is con-
sented as the change in the probability of each outcome for a change stant across all values of the risk factor. For instance, a change in
in certain demographic factors. height from 5 ft 8 in to 5 ft 9 in has the same predicted effect as does
a change from 6 ft 3 in to 6 ft 4 in. The marginal effect is also con-
Use of Marginal Effects stant across all values of the other explanatory variables, such as age
Why Are Marginal Effects Used? or presence of diabetes.
There are several ways to express the strength of the association be- In a nonlinear model like logistic regression, the marginal ef-
tween a risk factor and a binary outcome from a logistic regression. fect of the risk factor is an informative way to answer the research
One popular approach is the odds ratio (OR).4 The odds are the ra- question—how does a change in a risk factor affect the probability
tio of the probability that an outcome occurs to the probability that that the outcome occurs? In logistic regression, neither the mar-
the outcome does not occur. The ratio of the odds for 2 groups— ginal effect nor the OR is the same as the regression coefficient. In-
the OR—is often used to quantify differences between 2 different stead, the marginal effect reflects the nonlinear function on which
groups; eg, treatment and control groups. Another approach is the the logistic regression model is based. Logistic regression ensures
risk ratio, which is the probability that the outcome occurs in the pres- that predicted probabilities lie between 0 and 1, even for extreme
ence of the risk factor divided by the probability that the outcome values of a continuous risk factor, by modeling the relationship as a
occurs in the absence of the risk factor. Risk ratios are often easier curve that fits between 0 and 1. Thus, the marginal effect of a 1-unit
to use in clinical practice than are ORs.4,5 increase in age is not constant. The marginal effect will be small when
A third alternative is the marginal effect, which is the change in the probability of the outcome is close to 0 or 1 and relatively large
the probability that the outcome occurs as the risk factor changes when the probability is close to 0.5. Because the values of the other
by 1 unit while holding all the other explanatory variables constant. covariates change the predicted probabilities, the marginal effect
When the risk factor is continuous (eg, age), the change in the of any covariate depends on the value of other covariates in the
probability that the outcome occurs that is associated with a 1-unit model. For example, the marginal effect of a 1-unit increase in age
change in the risk factor has been called a marginal effect. When may depend on whether the study participant is a man or a woman,
the risk factor is discrete (eg, presence or absence of diabetes), the even without including an interaction term between sex and age.7
change has been called an incremental effect. In this article, the The variability in marginal effects makes intuitive sense because it
term marginal effect represents this strength of association mea- is expected that the effect of a risk factor on the outcome is hetero-
sure in both instances. geneous; ie, different effects for different values of the risk factor
and other explanatory variables.
What Are Marginal Effects? In logistic regression, there is no single marginal effect for the en-
Of the 3 approaches, marginal effects are the most intuitive tiresampleofindividuals,soanalystsmustchoosehowtopresentmar-
because they are expressed as the change in the predicted prob- ginal effects. The most common way is to report the average mar-
ability that the outcome occurs that is associated with a 1-unit ginal effect across all persons in the data set, knowing that it is larger
change in the risk factor. Unlike ORs, it is easier to compare mar- for some individuals and smaller for others. A second way is to report
ginal effects across different studies because they are less sensi- the marginal effect calculated at the means of all covariates. This can
tive to the statistical model conditions that influence the reported lead to a challenging interpretation; for instance, an estimated mar-
values of ORs. 6 Marginal effects depend on the values of the ginal effect of a risk factor for a person who is 50% pregnant or 20%

1304 JAMA April 2, 2019 Volume 321, Number 13 (Reprinted) jama.com

© 2019 American Medical Association. All rights reserved.

JAMA Guide to Statistics and Methods Clinical Review & Education

diabetic. A third way is to report the marginal effect for an individual models. In recent years, standard statistical packages have added
with a specific set of characteristics; for example, the effect of an in- commands that make it easier to generate marginal effects, includ-
tervention on pregnant patients with diabetes. ing the margins command in Stata and the margins package in R.
An important advantage of marginal effects over ORs is that
estimated marginal effects are less sensitive than ORs to inclusion How Should the Marginal Effects be Interpreted
of different sets of explanatory variables and estimation based on in Cummings et al?
different samples of data.4 The sensitivity of ORs and marginal Cummings et al3 described how changes in 4 county-level charac-
effects to different model specifications and data sets was teristics would change the predicted probability of either having
reviewed by Norton and Dowd.6 no access or having good access to mental health outpatient treat-
ment facilities that accept Medicaid (see Table 2 in the article3).
What Are the Limitations of Marginal Effects? For example, an increase of 31 percentage points in the fraction of
Marginal effects vary across individuals, so it is important to pre- the county population living in a rural community (the standard
sent reported marginal effects in context by comparing the mar- deviation of that variable) would on average increase the probabil-
ginal effects with the magnitude of the baseline risk. For example, ity of no access to mental health care by 27.9 percentage points
a change in probability of 1% may seem small if the baseline risk is (baseline risk = 34.8%) but would also increase the probability of
80% but may be large for a rare outcome (eg, baseline risk of 2%). good access by 3.4 percentage points (baseline risk = 20.2%),
Care must be exercised when reporting marginal effects from holding the effect of other explanatory variables constant. Such
case-control studies.8 In this type of model, the sample proportions a change in rural population therefore would decrease the prob-
of the outcome values are not representative of the population.5 ability of some access, the third possible outcome, by 31.3 per-
Simple logistic models cannot provide either a meaningful marginal centage points (27.9 + 3.4).
effect or a meaningful risk ratio from a case-control study, so ORs are Marginal effects are a useful way to describe the average ef-
the appropriate measures of association in this setting. fect of changes in explanatory variables on the change in the prob-
Until recently, it was challenging to compute marginal effects ability of outcomes in logistic regression and other nonlinear mod-
from logistic regressions and other nonlinear models such as or- els. Marginal effects provide a direct and easily interpreted answer
dered logistic, Poisson, negative binomial, and conditional logistic to the research question of interest.

ARTICLE INFORMATION Medicine at UCLA; and Edward H. Livingston, MD, 3. Cummings JR, Wen H, Ko M, Druss BG.
Author Affiliations: Department of Health Deputy Editor, JAMA. Geography and the Medicaid mental health care
Management and Policy, Department of Published Online: March 8, 2019. infrastructure: implications for health care reform.
Economics, University of Michigan, Ann Arbor doi:10.1001/jama.2019.1954 JAMA Psychiatry. 2013;70(10):1084-1090. doi:10.
(Norton); National Bureau of Economic Research, 1001/jamapsychiatry.2013.377
Conflict of Interest Disclosures: Dr Maciejewski
Cambridge, Massachusetts (Norton); Division of reported being supported by research career 4. Norton EC, Dowd BE, Maciejewski ML.
Health Policy and Management, School of Public scientist award 10-391 from the Veterans Affairs Odds ratios—current best practice and use. JAMA.
Health, University of Minnesota, Minneapolis Health Services Research and Development and 2018;320(1):84-85. doi:10.1001/jama.2018.6971
(Dowd); Durham Center of Innovation to Accelerate support from the Durham Veterans Affairs Health 5. Sackett DL, Deeks JJ, Altman DG. Down with
Discovery and Practice Transformation (ADAPT), Services Research and Development Center of odds ratios! BMJ Evid Based Med. 1996;1(6):164-166.
Durham Veterans Affairs Health Care System, Innovation (CIN 13-410); receiving grants from doi:10.1136/ebm.1996.1.164
Durham, North Carolina (Maciejewski); Department the National Institute on Drug Abuse and the
of Population Health Sciences, Duke University 6. Norton EC, Dowd BE. Log odds and the
Department of Veterans Affairs; receiving interpretation of logit models. Health Serv Res.
School of Medicine, Durham, North Carolina a contract from the National Committee for
(Maciejewski); Division of General Internal 2018;53(2):859-878. doi:10.1111/1475-6773.12712
Quality Assurance to Duke University for research;
Medicine, Department of Medicine, Duke University and that his spouse owns stock in Amgen. 7. Karaca-Mandic P, Norton EC, Dowd B.
School of Medicine, Durham, North Carolina No other disclosures were reported. Interaction terms in nonlinear models. Health Serv
(Maciejewski). Res. 2012;47(1 pt 1):255-274. doi:10.1111/j.1475-6773.
Corresponding Author: Matthew L. Maciejewski, REFERENCES 2011.01314.x
PhD, Durham Center of Innovation to Accelerate 1. Meurer WJ, Tolles J. Logistic regression 8. Irony TZ. Case-control studies: using
Discovery and Practice Transformation (ADAPT), diagnostics: understanding how well a model “real-world” evidence to assess association. JAMA.
Durham Veterans Affairs Health Care System, predicts outcomes. JAMA. 2017;317(10):1068-1069. 2018;320(10):1027-1028. doi:10.1001/jama.2018.
508 Fulton St, Ste 600, Durham, NC 27705 doi:10.1001/jama.2016.20441 12115
([email protected]).
2. Tolles J, Meurer WJ. Logistic regression: relating
Section Editors: Roger J. Lewis, MD, PhD, patient characteristics to outcomes. JAMA. 2016;
Department of Emergency Medicine, Harbor-UCLA 316(5):533-534. doi:10.1001/jama.2016.7653
Medical Center and David Geffen School of

jama.com (Reprinted) JAMA April 2, 2019 Volume 321, Number 13 1305

© 2019 American Medical Association. All rights reserved.

Clinical Review & Education

JAMA | JAMA Guide to Statistics and Methods

Overview of Cost-effectiveness Analysis

Gillian D. Sanders, PhD; Matthew L. Maciejewski, PhD; Anirban Basu, PhD

Health care decision makers, including patients, clinicians, hospi- QALYs are the most commonly used benefit measure in cost-
tals, private health systems, and public payers (eg, Medicare), are effectiveness analyses, in which the length of life is left unchanged
often challenged with choosing among several new or existing in- or adjusted downward to reflect the health-related quality of life.
terventions or programs to commit their limited resources to. This Specifically, a quality weight of 1 indicates optimal health, 0 indi-
choice is ideally based on a comparison of health benefits, harms, cates the equivalent of death, and weights between 0 and 1 indi-
and costs associated with each alternative. How best to determine cate less-than-optimal health. The weight for each period is multi-
the optimal intervention is a challenging task because benefits, plied by the length of the period to yield the QALYs for that period.
harms, and costs must be weighed for a given option and com- A primary rationale for using QALYs as a standard effective-
pared with alternatives. ness outcome in cost-effectiveness analyses is the ability for policy
One way to inform such decisions is to perform a cost- makers to compare ICERs for various interventions across different
effectiveness analysis. A cost-effectiveness analysis is an analytic diseases when allocating scarce resources to the intervention(s) that
method for quantifying the relative benefits and costs among 2 or provide the greatest value for money. ICER values that are low sug-
more alternative interventions in a consistent framework. In a 2018 gest that intervention A improves health at a small additional cost
study published in JAMA Oncology, Moss et al1 examined the cost- per unit of health, assuming that A is both more costly and effec-
effectiveness of multimodal ovarian cancer screening with serum tive than B. If the ICER is negative, interpretation is more complex
cancer antigen 125 compared with no screening in the United States, because negative ICERs can result from negative incremental costs
based on findings from the large United Kingdom Collaborative Trial (ie, the new treatment is less costly than the existing treatment) or
of Ovarian Cancer Screening (UKCTOCS). The UKCTOCS evaluated from negative incremental benefits (ie, the new treatment is less ef-
the effect of screening on ovarian cancer mortality2 and demon- fective than the existing treatment). A new treatment is said to be
strated that multimodal screening reduced mortality among women “dominant” if it is lower in cost and more effective than the com-
without prevalent ovarian cancer. parator and is clearly of better value for money. However, the new
treatment is said to be “dominated” if it is higher in cost and less ef-
The Use of Cost-effectiveness Analysis fective than the comparator and is not of good value for money.
Choosing among alternative treatments or programs is complicated
because benefits, harms, and costs vary in the following ways: (1) ben- Limitation in the Use of Cost-effectiveness Analysis
efitsmaybereflectedinvaryingpatternsofreducedmorbidityormor- There are important qualifications to consider when reviewing a cost-
tality in patients; (2) interventions vary in price and also in costs of effectiveness analysis. What is considered cost-effective depends
acquiringorprovidingthem(eg,timecosts);and(3)benefitsandcosts on comparing the ICER to the threshold value (eg, $50 000 or
accrue differently to different constituents (patients, caregivers, cli- $100 000 per additional QALY) of the decision maker, which rep-
nicians, health systems, and society). A cost-effectiveness analysis resents the willingness to pay for a unit of increased effectiveness
is designed to allow decision makers to clearly understand the (eg, 1 QALY). The threshold helps to determine which interven-
tradeoffs of costs, harms, and benefits between alternative treat- tions merit investment. This willingness to pay is often repre-
ments and to combine those considerations into a single metric, the sented by the largest ICER among all the interventions that were ad-
incremental cost-effectiveness ratio (ICER), that can be used to in- opted before current resources were exhausted, because adoption
form decision making when limited resources are available. of any new intervention would require removal of an existing inter-
vention to free up resources. There is no fixed threshold for cost per
Description of Cost-effectiveness Analysis QALY to determine what is cost-effective. Most decision makers do
Cost-effectiveness analysis is an analytic tool in which the costs and not rely on a single threshold to determine investment decisions.
harms and benefits of an intervention (intervention A) and at least 1 Cost-effectiveness analyses have numerous limitations, includ-
alternative (intervention B) are calculated and presented as a ratio of ing that available data may be drawn from heterogeneous popula-
the incremental cost (cost of intervention A − cost of intervention B) tions, data on important outcomes may be unavailable, and that only
and the incremental effect (effectiveness of intervention A − effec- short-term outcomes may be available and long-term outcomes must
tiveness of intervention B). This ratio is known as the ICER. be extrapolated. Further, simplifying assumptions often must be
The incremental cost in the numerator represents the additional made about how to represent the health states associated with the
resources (eg, medical care costs, costs from productivity changes) disease being studied that may not accurately represent the nu-
incurred from the use of intervention A over intervention B. The in- ance and complexities of the clinical setting.
cremental effect in the denominator of the ICER represents the ad- In 2016, the Second Panel on Cost-Effectiveness in Health and
ditional health outcomes (eg, the number of cases of a disease pre- Medicine 4 recommended that all cost-effectiveness analyses
vented or the quality-adjusted life-years [QALYs] gained) through the should include a discussion of relevant limitations and efforts to
use of intervention A over intervention B.3 compensate for the shortcomings of cost-effectiveness analyses.

1400 JAMA April 9, 2019 Volume 321, Number 14 (Reprinted) jama.com

© 2019 American Medical Association. All rights reserved.

JAMA Guide to Statistics and Methods Clinical Review & Education

The Second Panel also recommended that all cost-effectiveness Clinical effectiveness was estimated from the UKCTOCS trial es-
analyses should provide their findings from a health care sector timates of the effects of MMS on ovarian cancer mortality, with ex-
perspective, which would incorporate the costs, benefits, and trapolation of the long-term effects beyond the 11-year follow-up pe-
harms that are incurred by a payer, and from a societal perspective, riod. Direct medical costs were estimated based on Medicare claims
which would incorporate all costs and health effects regardless of data. Quality of life–related weights were included for the health
who incurs the costs or experiences the effects. To ensure that all states of being cancer free, undergoing MMS screening, and hav-
consequences to patients, caregivers, social services, and others ing ovarian cancer (incorporating lower weights for the chemo-
outside the health care sector are considered, the Second Panel therapy and cancer stage).
recommended use of an “Impact Inventory” that lists the health-
and non–health-related effects of an intervention. This tool allows How Should the Cost-effectiveness Analysis
analysts to evaluate categories of effects that may be most impor- Be Interpreted in This Study
tant to diverse stakeholders. Checklists for the various items that In the main, base-case analysis, MMS screening with a risk algo-
should be included when reporting cost-effectiveness analysis rithm cost estimate of $100 reduced ovarian cancer mortality by 15%,
results were provided by the Second Panel.4 resulting in an incremental cost-effectiveness ratio of $106 187 per
QALY gained (95% CI, $97 496-$127 793). The authors explored the
How Was the Cost-effectiveness Analysis Performed uncertainty in the underlying parameters and found that screening
in This Study women starting at 50 years of age with MMS was cost-effective in
Moss et al evaluated the cost-effectiveness of a multimodal screen- 70% of the simulations at a willingness to pay of $150 000 per QALY.
ing(MMS)programforovariancancerintheUnitedStatesfromahealth If the willingness to pay were $100 000 per QALY, then screening
care sector perspective (eg, Medicare).1 In a health care sector per- was cost-effective 47% of the time.
spective, only costs, health benefits, and harms that were observed A cost-effectiveness analysis does not make the decision for pa-
by the health care sector are considered, and other costs, benefits, tients, clinicians, health care systems, or policy makers, but rather
and harms that may affect patients or their caregivers are ignored.4 provides information that they can use to facilitate decision mak-
The authors developed a Markov simulation model using data ing. A cost-effectiveness analysis is also not designed for cost con-
from the UKCTOCS to compare MMS with no screening for women tainment. These analyses do not set the level of resources to be spent
beginning at 50 years of age in the general population. The model, on health care, but rather they provide information that can be used
which involved a mathematical simulation that evaluated the to ensure that those resources, whatever the level available, are used
benefits of the screening strategies in hypothetical cohorts of as effectively as possible to improve health. When reviewing cost-
patients as they moved from one health state to the next, accord- effectiveness analyses, readers should examine the study and use
ing to transition probabilities, demonstrated that MMS was both the recommendations from the Second Panel on Cost-Effective-
more expensive and more effective in reducing ovarian cancer mor- ness in Health and Medicine4 to help understand the implications
tality than no screening. of cost-effectiveness analysis research.

ARTICLE INFORMATION Section Editors: Roger J. Lewis, MD, PhD, multimodal ovarian cancer screening program in
Author Affiliations: Department of Population Department of Emergency Medicine, Harbor-UCLA the united states: secondary analysis of the UK
Health Sciences, Duke University School of Medical Center and David Geffen School of Collaborative Trial of Ovarian Cancer Screening
Medicine, Durham, North Carolina (Sanders, Medicine at UCLA; and Edward H. Livingston, MD, (UKCTOCS). JAMA Oncol. 2018;4(2):190-195. doi:
Maciejewski); Duke Clinical Research Institute, Deputy Editor, JAMA. 10.1001/jamaoncol.2017.4211
Duke University, Durham, North Carolina (Sanders); Published Online: March 11, 2019. 2. Jacobs IJ, Menon U, Ryan A, et al. Ovarian cancer
Duke-Margolis Center for Health Policy, Duke doi:10.1001/jama.2019.1265 screening and mortality in the UK Collaborative Trial
University, Durham, North Carolina (Sanders); Conflict of Interest Disclosures: Dr Maciejewski of Ovarian Cancer Screening (UKCTOCS):
Durham Center of Innovation to Accelerate reported receiving research and center funding a randomised controlled trial. Lancet. 2016;387
Discovery and Practice Transformation (ADAPT), (CIN 13-410) from the VA Health Services Research (10022):945-956. doi:10.1016/S0140-6736(15)
Durham Veterans Affairs Health Care System, and Development Service, receiving a contract for 01224-6
Durham, North Carolina (Maciejewski); Division of research from the National Committee for Quality 3. Neumann PJ, Cohen JT. QALYs in
General Internal Medicine, Department of Assurance, receiving research funding from the 2018–advantages and Concerns. JAMA. 2018;319
Medicine, Duke University School of Medicine, National Institute on Drug Abuse (RCS 10-391), (24):2473-2474. doi:10.1001/jama.2018.6072
Durham, North Carolina (Maciejewski); The and that his spouse owns stock in Amgen.
Comparative Health Outcomes, Policy, and 4. Sanders GD, Neumann PJ, Basu A, et al.
Dr Basu reported consulting for Merck, Pfizer, Recommendations for conduct, methodological
Economics (CHOICE) Institute, Departments of GlaxoSmithKline, Janssen, and AstraZeneca as
Pharmacy, Health, Services and Economics, practices, and reporting of cost-effectiveness
an expert on issues related to cost-effectiveness analyses: Second Panel on Cost-effectiveness in
University of Washington, Seattle (Basu). analysis. No other disclosures were reported. Health and Medicine. JAMA. 2016;316(10):1093-1103.
Corresponding Author: Gillian D. Sanders, PhD, doi:10.1001/jama.2016.12195
Duke University Medical Center, 100 Fuqua Dr, REFERENCES
Box 90120, Durham, NC 27710 (gillian.sanders@ 1. Moss HA, Berchuck A, Neely ML, Myers ER,
duke.edu). Havrilesky LJ. Estimating cost-effectiveness of a

jama.com (Reprinted) JAMA April 9, 2019 Volume 321, Number 14 1401

© 2019 American Medical Association. All rights reserved.

Clinical Review & Education

JAMA Guide to Statistics and Methods

Using Instrumental Variables to Address Bias

From Unobserved Confounders
Matthew L. Maciejewski, PhD; M. Alan Brookhart, PhD

Randomized clinical trials are considered the most reliable source founder and would bias the estimated treatment effect in regression
of evidence for the effects of medical interventions, but nonexperi- or propensity score methods.3 Instrumental variables that are strongly
mental studies are often used to assess the effectiveness of treat- related to treatment can be used to help reduce the effect of unob-
ments as they are used in actual clinical practice. In nonexperimental served confounding when estimating the association between a treat-
studies, treatment groups may ment (initiation of osteoporosis treatment following a hip fracture) and
differ by important patient char- an outcome of interest (prevention of subsequent nonvertebral frac-
Supplemental content
acteristics, such as disease sever- tures). In the report by Desai et al, the strongest of 4 instrumental vari-
ity, frailty, cognitive function, vulnerability to adverse effects, and abil- ables considered was based on hospital-level rates of osteoporosis
ity to pay.1 While statistical adjustment can account for imbalances in medication prescribing following hip fracture. Clinicians at some hos-
observed characteristics between groups, observed imbalances are pitals routinely initiated treatment; clinicians at other hospitals did not.
concerning because they suggest that unobserved differences may
alsoexist.Unobservedpatientcharacteristicsthatinfluencebothtreat- What Are Instrumental Variables?
ment and the outcomes result in “unobserved confounding,” a bias Instrumental variables should plausibly operate like a randomization
that cannot be removed using standard statistical adjustment.1 process, effectively randomly assigning a subset of the patients into
Instrumental variable analysis is an approach used to help ad- different treatment groups to achieve balance on observed and unob-
dress unobserved confounding when estimating treatment effects in served factors (see the eTable in the Supplement for a list of instrumen-
nonrandomizedstudiesandinrandomizedstudieswhenprotocolnon- tal variables used in previous studies), without having any direct effect
adherence exists. An instrumental variable is a factor that should ef- on outcome. In one example,4 day of the week of hospital admission
fectively randomize some patients to the different groups; it should wasaninstrumentalvariableusedforassessingtheassociationbetween
be correlated with the treatment received and related to outcomes health care costs and 2 approaches (medical and surgical) to kidney
only through its effect on treatment. Under specific assumptions, in- stone removal. Admission day was associated with intervention type
strumental variable methods can provide unbiased estimates of treat- because surgical treatments were much more likely to be performed
ment effects even if unobserved confounding exists. on weekdays than weekends. Thus, an instrumental variable of admis-
In an article published in JAMA Network Open,2 Desai and col- sion day may approximate the randomization process in a clinical trial.
leagues used instrumental variable methods to assess the associa- Dayoftheweekmayapproximatearandomizationprocessforpatients
tion between initiation of osteoporosis medications and nonverte- admitted for emergency reasons (eg, acute renal colic) but not for pa-
bral fracture risk in a cohort of 97 169 patients aged 50 years or older tients admitted for elective reasons (eg, plastic surgery).
who were hospitalized for hip fracture from 2004 to 2015.2 The au- If admission day is unlikely to be associated with changes in
thors used an instrumental variable analysis to address suspected the outcome of interest, except through its potential relationship
unobserved confounding from frailty, disease risk, and other fac- with whether medical or surgical intervention is provided, then this
tors that might bias the estimate of the treatment effect. variable may be a suitable instrumental variable. When a plausibly
valid and measurable instrumental variable has been identified by
Use of the Method the investigator, an instrumental variable analysis can be conducted.
Why Are Instrumental Variables Used? Although there are many different approaches, instrumental vari-
When patients and clinicians are free to choose treatments, some of able analysis is typically done using a 2-stage regression model. In the
the factors that influence treatment choice may also be strongly re- first stage, a model for the probability of receiving some treatment
lated to the likelihood of a good outcome. In the study by Desai et al,2 is estimated, which depends on the instrumental variable and other
men were less likely to initiate osteoporosis medication than women, patient factors. From this fitted model, each patient’s predicted prob-
and they also may be less likely to develop fractures. If all confound- ability of treatment and the difference between actual treatment re-
ers are observed, then potential confounding influence can be ad- ceived and predicted treatment (referred to as the “residual”) is gen-
dressed by adjusting for them in a statistical model. If only some con- erated. In the second stage, a traditional regression model of the
founders are observed, then outcome differences between treatment outcome is estimated as a function of patient factors and the pre-
groups may be driven by treatment alone, confounders alone, or both. dicted treatment5 (or the residual6) from the first model.
For example, if patients with osteoporosis were not given a preven-
tive treatment because they were perceived to be close to the end of Limitations of Instrumental Variable Estimation
life, the group of patients not receiving treatment might be more se- Although instrumental variables may provide estimates of causal re-
verely ill or frail than patients receiving treatment. If frailty that is not lationships in the presence of unmeasured confounders, these analy-
measured in the study influences patient outcomes and is imbal- ses are difficult to implement and interpret. The first challenge is
anced between groups, then frailty would be an unmeasured con- finding a variable that is plausibly predictive of treatment but does

2124 JAMA June 4, 2019 Volume 321, Number 21 (Reprinted) jama.com

© 2019 American Medical Association. All rights reserved.

JAMA Guide to Statistics and Methods Clinical Review & Education

not have a direct or indirect effect on outcome. To assess the ex- mates with wide CIs.7,8 Weak instrumental variables are associated
ecution and reporting of results from an instrumental variable with treatment in only a small number of patients and therefore ef-
analysis,1,7 strong justification should be provided to explain why un- fectively result in a substantially reduced sample size.8 If the instru-
observed confounding may bias results from conventional ap- mental variable is strong and the underlying assumptions hold, then
proaches, why the chosen instrumental variable might be ex- instrumental variable methods may be useful in settings where
pected to be strongly related to treatment assignment, and why the strong unobserved confounding is suspected.
instrumental variable should be related to the outcome only through
its effect on treatment. These theoretical justifications should be sup- How Were Instrumental Variables Used?
ported by empirical evidence of a strong association with treat- The study by Desai et al2 considered 4 instrumental variables: cal-
ment in the first-stage model predicting treatment. endar year, specialist access, prescribing rates of osteoporosis
A second challenge is that the key instrumental variable as- medications across geographic regions, and hospital preference.
sumption, that the instrument is related to the outcome only through These instrumental variables attempt to exploit random variation
its effect on treatment, cannot be confirmed with data. To test in treatment choice across calendar time, physicians, geographic
whether there is no independent causal pathway from the instru- regions, and hospitals. The authors found that several patient
mental variable to the outcome would require controlling for all con- covariates were imbalanced when comparing patients who
founding factors. Yet, unobserved confounding that motivates the received osteoporosis medications and patients who did not, but
need for instrumental variable analysis precludes testing of this as- imbalances were less severe when comparing patients by hospital-
sumption. If all confounding factors were observed, there would be specific variation in osteoporosis medication prescribing rates (see
no need for an instrumental variable analysis. The instrumental vari- Figure 2 in the article by Desai et al2). The authors examined the
able should be unassociated with observed patient characteristics properties of the instruments and found that the hospital-specific
if it effectively randomizes patients. Reports involving instrumen- prescribing measure was the most appropriate measure to
tal variable analysis typically include a table that summarizes the address unobserved confounding because it was the most
means and frequencies of observed patient characteristics across strongly related to the treatment. Desai et al also conducted a
levels of the instrument. It is still possible that the instrument could conventional regression analysis to compare with the results from
be associated with unobserved variables, so researchers should make instrumental variable estimation.
a strong argument why this is unlikely (Supplement).
A third challenge is that instrumental variable methods yield an How Should the Instrumental Variables Analysis
estimated treatment effect that is often difficult to interpret. Un- Be Interpreted?
der a specific assumption,5 an instrumental variable estimate will gen- The authors came to 2 conclusions. First, osteoporosis medication
eralize only to patients whose treatment status depends on the in- initiation may have been associated with a reduced risk of nonver-
strument, a subgroup that has been termed “compliers” or “marginal tebral fractures in adults with recent hip fractures. This result may
patients.”5,7 If the treatment affects all patients in a similar manner, not generalize to all patients eligible for osteoporosis medications
the estimate from an instrumental variable analysis should gener- because this cohort was younger and had fewer comorbidities than
alize to all patients. Reports describing the results of instrumental Medicare cohorts examined in previous studies. Second, unob-
variable analyses should discuss interpretation of estimated treat- served confounding may have resulted in an underestimation of the
ment effects, including characterization of the marginal patients. association between osteoporosis medication initiation and non-
A fourth challenge arises when an instrumental variable is weakly vertebral fractures (1.3 fewer events in the multivariable-adjusted
correlated with treatment, a problem that can lead to biased esti- analysis vs 4.2 fewer events in the instrumental variable analysis).

ARTICLE INFORMATION (CIN 13-410 and RCS 10-391) from the VA Health 4. Hollingsworth JM, Norton EC, Kaufman SR,
Author Affiliations: Center of Innovation to Services Research and Development Service, Smith RM, Wolf JS Jr, Hollenbeck BK. Medical
Accelerate Discovery & Practice Transformation, receiving a contract for research from the National expulsive therapy versus early endoscopic stone
Durham Veterans Affairs Medical Center, Durham, Committee for Quality Assurance, receiving removal for acute renal colic: an instrumental
North Carolina (Maciejewski); Department of research funding from the National Institute on variable analysis. J Urol. 2013;190(3):882-887.
Population Health Sciences, Duke University School Drug Abuse (R01 DA040056), and that his spouse 5. Angrist JD, Imbens GW, Rubin DB. Identification
of Medicine, Durham, North Carolina (Maciejewski); owns stock in Amgen. Dr Brookhart reported equity of causal effects using instrumental variables. J Am
Division of General Internal Medicine, Department ownership in NoviSci. Stat Assoc. 1996;91(434):444-472.
of Medicine, Duke University School of Medicine, 6. Terza JV, Basu A, Rathouz PJ. Two-stage residual
Durham, North Carolina (Maciejewski); Department REFERENCES
inclusion estimation: addressing endogeneity in
of Epidemiology, University of North Carolina at 1. Brookhart MA, Rassen JA, Schneeweiss S. health econometric modeling. J Health Econ. 2008;
Chapel Hill Gillings School of Global Public Health Instrumental variable methods in comparative 27(3):531-543. doi:10.1016/j.jhealeco.2007.09.009
(Brookhart). safety and effectiveness research.
Pharmacoepidemiol Drug Saf. 2010;19(6):537-554. 7. Ertefaie A, Small DS, Flory JH, Hennessy S.
Corresponding Author: Matthew L. Maciejewski, A tutorial on the use of instrumental variables in
PhD, Department of Population Health Sciences, 2. Desai RJ, Mahesri M, Abdia Y, et al. Association pharmacoepidemiology. Pharmacoepidemiol Drug
Duke University Medical Center, 508 Fulton St, of osteoporosis medication use after hip fracture Saf. 2017;26(4):357-367. doi:10.1002/pds.4158
Ste 600, Durham, NC 27705 (matthew.maciejewski with prevention of subsequent nonvertebral
@va.gov). fractures: an instrumental variable analysis. JAMA 8. Hernán MA, Robins JM. Instruments for causal
Netw Open. 2018;1(3):e180826. inference: an epidemiologist’s dream? Epidemiology.
Published Online: May 2, 2019. 2006;17(4):360-372. doi:10.1097/01.ede.
doi:10.1001/jama.2019.5646 3. Haukoos JS, Lewis RJ. The propensity score. JAMA. 0000222409.00878.37
Conflict of Interest Disclosures: Dr Maciejewski 2015;314(15):1637-1638.
reported receiving research and center funding

jama.com (Reprinted) JAMA June 4, 2019 Volume 321, Number 21 2125

© 2019 American Medical Association. All rights reserved.

Clinical Review & Education

JAMA Guide to Statistics and Methods

Genome-Wide Association Studies

Xiuqing Guo, PhD; Jerome I. Rotter, MD

Each individual’s genetic makeup influences the presence of, mani- group. The history of humans and the human genome has resulted
festation of, and susceptibility to disease. The identification of spe- in some SNP variants being coinherited more frequently with some
cific genetic regions that influence disease for mendelian genetic con- variants of disease-related effector genes, with the likelihood of co-
ditions, such as cystic fibrosis and inheritance decreasing the farther the SNP is from the effector gene
Huntington disease, have often because of random recombination. This phenomenon is known as
Related article page 1682
been elucidated through familial linkage disequilibrium. If the allele frequency at a particular SNP is
linkage studies, combining familial patterns of disease with a limited significantly different between affected individuals and control in-
set of genomic markers. In contrast, many diseases have very com- dividuals, the allele variant is said to be associated with the disease.
plex underlying mechanisms with many genes and the environment Because each SNP is analyzed for association with disease indepen-
influencing risk. Understanding the influence of genetics on risk for dently and hundreds of thousands to millions of SNPs are analyzed
these diseases requires approaches beyond familial linkage studies. in a single GWAS, very strict criteria for statistical significance
A candidate gene association study begins by identifying the must be applied to avoid false-positive results.3 A P value threshold
candidate genes—either 1 gene or multiple genes thought to be- of 5 × 10 −8 is typically used to define statistical significance
long to a common pathway. The association between genetic varia- (ie, a Bonferroni correction, with the threshold determined by di-
tions in the candidate genes and the presence of disease is investi- viding .05 by 106, to reflect the number of tests).
gated. The success of this strategy is highly dependent on the correct The odds ratios (ORs) obtained in GWASs are the odds of disease
choice of genes to study, although the overall experience with can- among individuals who have a specific allele vs the odds of disease
didate gene association studies has been disappointing.1 among individuals who do not have that same allele, reflecting both
In contrast to a candidate gene association study, a genome- the degree of coinheritance or linkage disequilibrium of the SNP allele
wide association study (GWAS) is based on a hypothesis-free strat- and the disease-causing allele in the effector gene and the magnitude
egy with no need to specify target genes in advance, and can be of the effector gene’s effect on disease risk. Most disease-associated
used to survey the entire genome to elucidate susceptibility to com- SNPs have very small effect sizes (OR <1.5) and large numbers of indi-
mon heritable human diseases. A GWAS quantifies the association be- viduals are needed to identify predisposing (or protective) SNPs.4
tween the presence of disease and genetic variations at known posi-
tions in the genome, referred to as single-nucleotide polymorphisms
(SNPs; see the Table for related terminology), to pinpoint relatively Table. Genomic Terms and Definitions
smaller areas of the genome that may contribute to the risk of disease.
Terminology Definition
In this issue of JAMA, Hauser et al2 report a GWAS that evaluated Allele One of 2 or more DNA sequences occurring at a particular
genetic disposition for primary open-angle glaucoma (POAG) in indi- gene locus (eg, blood groups A and B)
viduals with African ancestry. SNP rs59892895*C in the amyloid β A4 Effector gene The gene (whether protein coding or RNA coding) that
underlies an SNP association with a trait
precursor protein-binding family B member 2 (APBB2) gene was found
Gene The basic unit of heredity that occupies a specific location
to be significantly associated with POAG in this population, while no on a chromosome; each gene consists of nucleotides arranged
association between this gene and POAG was found in European or in a linear manner; although many genes code for a specific
protein or segments of protein leading to a particular
Asian populations. The authors conclude that there are differences in characteristic or function, other genes just code for RNA
genetic mechanisms underlying glaucoma in African ancestry popu- Genome-wide A way for scientists to identify inherited genetic variants
lations compared with European and Asian ancestry populations. association associated with risk of disease or a particular trait; this
study method surveys the entire genome for genetic
polymorphisms, typically SNPs, that occur more frequently
in individuals with the disease or trait being assessed (cases)
Use of the Method than in individuals without the disease or trait (controls)
Why Is the Method Used? Imputation The statistical inference of unobserved genotypes; it is
GWASs take advantage of variation in the millions of known SNPs, achieved by using a known genotype in a population
(eg, from the HapMap or the 1000 Genomes Project)
occurring in known locations across the entire genome, to deter-
Linkage A gene-hunting technique that traces patterns of disease or
mine whether one genetic variant (ie, allele) at the location of each analysis traits in families and attempts to locate a trait-causing gene
SNP occurs more often than expected in individuals with a particu- by identifying genetic markers of known chromosomal
location that are coinherited with the trait
lar disease than in those without the disease. The associated SNPs
Linkage Where alleles of different SNPs occur together more often
are then considered to mark a region of the human genome that in- disequilibrium than can be accounted for by chance (ie, beyond the
association due to their physical proximity on a chromosome)
fluences the risk of disease. The approach allows identification of
Locus The physical site or location of a specific gene on a
small genetic regions that contain potential effector genes (ie, genes chromosome
that may affect the likelihood of disease). Meta-analysis The statistical procedure for combining data from multiple
studies, extensively used in genome-wide association studies
Description of the Method Single- DNA sequence variations that occur when a single nucleotide
nucleotide (adenine, thymine, cytosine, or guanine) in the genome
The most common approach used in a GWAS is to compare allele fre- polymorphism sequence is altered; usually present in at least 1%
quency among affected individuals with that of a healthy control (SNP) of the population

jama.com (Reprinted) JAMA November 5, 2019 Volume 322, Number 17 1705

© 2019 American Medical Association. All rights reserved.

Clinical Review & Education JAMA Guide to Statistics and Methods

This can be achieved by prospectively gathering samples from large A key scientific step in modern GWASs is to move from finding
populations (eg, the deCode project, which gathered genotypic and associated SNPs to identifying the actual effector transcript that
medical data from more than 160 000 volunteer participants, com- codes for protein or for RNA and is responsible for the underlying
prising well over half of the adult population in Iceland)5 or, more disease pathophysiology. One way to do that is to try to focus nar-
often, via combining samples from different study cohorts using rowly on the fewest amount of SNPs within a given region or locus
meta-analytic methods. to eventually even a single SNP. The SNP itself may play a role in pro-
Genotyping arrays used for GWASs do not directly genotype all tein function or processing, regulation of transcription, or may sim-
known SNP variations in the genome. Information for SNPs not di- ply be in linkage disequilibrium with other rare variants that are re-
rectly measured can be imputed using reference panels that cover sponsible for disease risk.7
a greater number of SNPs, spanning both SNPs directly measured
in a particular GWA study and those that were not included, along How Was the Method Used?
with information on the known SNP locations. Hauseretal2 performedadiscoveryGWASof2320patientswithPOAG
and 2121 unaffected control individuals without POAG of African an-
What Are the Limitations of the Method? cestry. Replication of the study was carried out in 5401 individuals with
GWASs have several limitations. Early in the evolution of GWASs, de- POAG and 13 015 control individuals of African ancestry for SNPs with
spite stringent significance thresholds, false-positive associations significant associations at genome-wide significance level
were common and many findings failed to replicate in subsequent (P < 5 × 10−8). A significant association was mapped to the APBB2
studies. Issues including differences in phenotyping of patients, the rs59892895T>C locus, whereas the minor allele C was observed to be
genotyping methods used, and a failure to account for artifacts in- associated with increased risk of POAG. A second de novo replication
troduced by subpopulations of patients in cohorts of differing an- that included 1536 individuals with POAG and 1902 control individu-
cestral backgrounds (population stratification) all likely contrib- als further confirmed the association. Functional studies of a very small
uted to challenges in replicating findings. Over time, methods for number of donor eyes suggested that individuals with African ances-
accounting for these and other technical issues have been im- try who carried the risk allele had higher levels of APBB2 in the retina.
proved, increasing the reliability of GWAS findings. This increased expression level was accompanied by increased levels
Genotyping arrays designed for GWASs rely on coinheritance or of cytotoxic β-amyloid, which colocalizes with retinal ganglion cells,
linkage disequilibrium between SNPs to provide coverage of the en- and the death of these cells defines POAG. Of interest, the mecha-
tire genome, even though only a subset of SNPs are characterized in nism hypothesized by the investigators of increased β-amyloid in the
anyparticularGWAS.Thus,oneidentifiedSNPusuallyrepresentsmany retina potentially links glaucoma to Alzheimer disease.2
others, and the identified associated SNP variants are unlikely to be
within the potentially causal effector genes. The most common as- How Should the Results Be Interpreted?
sociated SNP variants identified in GWASs are noncoding complicat- In their GWAS, Hauser et al2 also found that 26 SNPs from 15 loci that
ing attempts to establish the molecular effects of these GWAS loci. were previously identified to be associated with POAG in individu-
From a clinical perspective, it is tempting to use GWAS findings to als with European and Asian ancestry had significantly lower effect
predict disease risk. However, the predictive ability of SNP markers sizes in individuals with African ancestry. This finding coupled with
with very low ORs (including the SNP identified in the study by Hauser the finding of a risk-associated SNP that appears to be unique to in-
et al2) is quite poor. Approaches for aggregating multiple SNPs with dividuals of African ancestry suggests that genetic influences on
low ORs into a risk model, known as a polygenic risk score, are evolv- POAG in African populations could be different from those in
ing for a number of common conditions.6 To date, GWASs have been European and Asian populations. The identification of associated
conductedinmostlyEuropeanpopulations,sofindingsmaynotbegen- SNPs in a GWAS study, such as rs59892895 for POAG, may help to
eralizable to other populations. The study by Hauser et al2 is important identify and delineate the actions of effector transcripts, in this case
in that it conducts a GWAS including individuals of diverse African the APBB2 gene, thus yielding a possible explanatory mechanism for
ancestries, which comparatively few successful GWASs have done. the association and potential targets for new therapies.

ARTICLE INFORMATION Conflict of Interest Disclosures: Drs Rotter and 4. Visscher PM, Brown MA, McCarthy MI, Yang J.
Author Affiliations: The Institute for Translational Guo reported receiving grants from the National Five years of GWAS discovery. Am J Hum Genet.
Genomics and Population Sciences, Los Angeles Institutes of Health. 2012;90(1):7-24. doi:10.1016/j.ajhg.2011.11.029
Biomedical Research Institute, Department of 5. Greely HT. The uneasy ethical and legal
Pediatrics, Harbor-UCLA Medical Center, Torrance, REFERENCES underpinnings of large-scale genomic biobanks.
California. 1. Hirschhorn JN, Lohmueller K, Byrne E, Annu Rev Genomics Hum Genet. 2007;8:343-364.
Corresponding Author: Jerome I. Rotter, MD, Hirschhorn K. A comprehensive review of genetic doi:10.1146/annurev.genom.7.080505.115721
The Institute for Translational Genomics and association studies. Genet Med. 2002;4(2):45-61. 6. Sugrue LP, Desikan RS. What are polygenic
Population Sciences, Los Angeles Biomedical 2. Hauser MA, Allingham RR, Aung T, et al; scores and why are they important? JAMA. 2019;
Research Institute, Department of Pediatrics, The Genetics of Glaucoma in People of African 321(18):1820-1821. doi:10.1001/jama.2019.3893
Harbor-UCLA Medical Center, 1124 W Carson St, E-5, Ancestry (GGLAD) Consortium. Association of 7. Mahajan A, Taliun D, Thurner M, et al.
Torrance, CA 90502 ([email protected]). genetic variants with primary open-angle glaucoma Fine-mapping type 2 diabetes loci to single-variant
Section Editors: Roger J. Lewis, MD, PhD, among individuals with African Ancestry [published resolution using high-density imputation and
Department of Emergency Medicine, Harbor-UCLA November 5, 2019]. JAMA. doi:10.1001/jama.2019. islet-specific epigenome maps. Nat Genet. 2018;50
Medical Center and David Geffen School of 16161 (11):1505-1513. doi:10.1038/s41588-018-0241-6
Medicine at UCLA; and Edward H. Livingston, MD, 3. Cao J, Zhang S. Multiple comparison procedures.
Deputy Editor, JAMA. JAMA. 2014;312(5):543-544.

1706 JAMA November 5, 2019 Volume 322, Number 17 (Reprinted) jama.com

© 2019 American Medical Association. All rights reserved.

Clinical Review & Education

JAMA Guide to Statistics and Methods

Using Propensity Score Methods to Create Target Populations

in Observational Clinical Research
Laine Thomas, PhD; Fan Li, PhD; Michael Pencina, PhD

In a propensity score–matched cohort study published in the March

12, 2019, issue of JAMA, Zeng et al1 found that prescription trama- Box. Target Population, Propensity Score Weighting,
dol was associated with significantly greater 1-year mortality com- and Propensity Score Matching
pared with nonsteroidal anti-inflammatory alternatives in adults with Thetargetpopulation,towhomtheaveragetreatmenteffectgeneralizes,
osteoarthritis. At baseline, patients receiving tramadol were differ- depends not only on inclusion and exclusion criteria, but also on how
ent than those who received other analgesics in terms of demo- patients enter the sample. In observational analyses this is further
graphics, medical comorbidities, medications, and prior hospital re- influenced by the choice of propensity score method.
source utilization. Zeng et al1 used propensity score matching in an Propensity score weighting, using the form of inverse probability of
effort to account for differences between groups.2 This matched treatmentweighting,appliestoallsampledpatientswhoreceivedeither
comparator (A or B), but may excessively weight findings from patients
sample corresponds to a unique target population.
who would typically receive only 1 of these options and are not good
Explanation of the Concept candidates for the other. Propensity score weighting addresses the
question, “What if everyone in the sampled population received one
What Is a Target Population?
treatment vs if everyone received the other treatment?”
The target population is an intended group of patients character-
Propensity score matching in observational studies can emulate the
ized by inclusion and exclusion criteria and described by baseline
populations in randomized clinical trials by having relatively narrow
characteristics to whom the average treatment effect applies. Two patient characteristics that typify patients who are most eligible for
samples derived from a cohort with the same inclusion and exclu- either treatment being compared, but usually exclude a portion of
sion criteria can have different characteristics that represent differ- the sample. Propensity score matching typically addresses the
ent target populations because of variation in sites and the way pa- question, “What if everyone in the sampled population who could be
tients were enrolled in the study (Box). matched received one treatment or the other treatment?”

Why Is the Target Population Important?

(72.1 vs 67.5 years) and number of visits to a general practitioner (GP)
Understandingastudy’stargetpopulationisimportantforknowinghow
in the past 2 years (14.3 vs 9.7). After propensity score matching only
study results apply to certain types of patients. Observational studies
38% of eligible patients were retained in 6512 pairs.
performedwithpropensityscoremethodscanchangethetargetpopu-
Toillustratehowpropensityscorematchingaltersthetargetpopu-
lation by shifting the distribution of patient characteristics that contrib-
lation of an observational study, a simulation of 50 patients represen-
ute to analysis. Thus, propensity score analyses are used to reduce bias
tative of those studied by Zeng et al1 is presented in the Figure, A. After
in the comparison between a population that received treatment and
matching, patients receiving tramadol and diclofenac who were in-
a control population and effectively mimic different randomized clini-
cluded in the study were similar but the matched sample differed from
cal trials (RCTs) that examine different target populations.3
the original sample, with a narrower distribution of age and GP vis-
Limitations and Alternatives to Developing its (Figure, B). Patients with successful matches tended to have
Target Populations by Propensity Methods middle-range propensity scores, indicating that such patients often
A propensity score is the probability that a patient would receive the receive either treatment in routine practice (Figure, B). The result is
treatment of interest based on characteristics of the patient, treat- aligned with the concept of equipoise in clinical trials,5 whereby pa-
ing clinician, and clinical environment.2 Propensity methods can be tients for whom treatment decisions remain uncertain are enrolled.
conducted in many ways, with different approaches that create dif-
ferent target populations.4 Two common methods are propensity Propensity Score Weighting
matching and propensity score weighting (Box). Inverse probability of treatment weighting (IPTW) is a form of propen-
sity score weighting. Typically, patients who received treatment A are
Propensity Matching weighted by 1/propensity score and those who received treatment B
The most common propensity matching method, used by Zeng et al,1 are weighted by 1/(1−propensity score), where the propensity score is
uses 1:1 nearest-neighbor matching, known as greedy matching, theprobabilityofreceivingtreatmentAinpractice.2 Outcomesareana-
in which each individual who received treatment A is assessed se- lyzed on the weighted sample. A patient who received treatment A
quentially to find the closest propensity score match among remain- received a larger weight if their probability of being treated with treat-
ing individuals who received treatment B, usually within a prespeci- ment A (propensity score) was small and a smaller weight if their prob-
fied bound on the closeness of the propensity score. Patients for ability of being treated with treatment A (propensity score) was large.
whom no match exists within the bound are excluded. For ex- Intuitively, the weights make up for underrepresentation or overrep-
ample, Zeng et al1 began with 16 372 eligible patients who received resentation of certain types of individuals in each treatment group.
tramadol and 21 675 control individuals who received diclofenac. In Figure, C, IPTW is applied to the simulated patients that resemble
Prior to matching, patients who received tramadol and diclofenac those in the Zeng et al study.1 The relative contribution of each patient,
had large differences in their characteristics, including mean age after weighting, is shown by the size of each bubble. No patients are

466 JAMA February 4, 2020 Volume 323, Number 5 (Reprinted) jama.com

© 2020 American Medical Association. All rights reserved.

JAMA Guide to Statistics and Methods Clinical Review & Education

Figure. Relative Contribution of 50 Simulated Patients With Different Ages and Number of General Practice Visits Within the Past Year

A Unadjusted sample B Unadjusted sample and matched sample C Weighted sample

25 25 25

20 20 20
General Practice Visits, No.

General Practice Visits, No.

15 15 15

10 10 10

5 5 5

0 0 0
40 50 60 70 80 90 40 50 60 70 80 90 40 50 60 70 80 90
Patient Age, y Patient Age, y Patient Age, y

1 patient Medication 1 patient Medication Patients Medication

Diclofenac Matched patient Diclofenac 8 Diclofenac
2 4
Tramadol Tramadol Tramadol

Simulated according to the distribution of the same variables in the Zeng et al treatment and represents only themselves. Patients without a dark blue dot
study.1 The bubble size reflects the relative contribution of each patient to failed to find a match and are excluded, resulting in a smaller sample with a
analysis. A, Each patient represents only themselves. Patients receiving narrower range. C, After inverse probability of treatment weighting, some
tramadol are older with more general practitioner visits. B, Each patient with a patients represent up to 8 other patients and others represent less than 1.
dark blue dot was successfully matched to similar patient receiving the other

excluded from the population, but some represent up to 8 other pa- How Should the Propensity Analysis
tients while others represent less than 1 other patient. in Zeng et al Be Interpreted?
The weighted sample mimics the potential population in an RCT Zeng et al1 created a matched sample of 13 024 patients, represent-
drawnfromthesametargetpopulationastheoriginalsampleoftreat- ing a relatively narrow target population. IPTW would have used all
ment A plus B. In the study by Zeng et al1 this would include all 38 407 38 407patientsandrepresentedthepopulationfromwhichtheywere
patients. The clinical relevance of the corresponding RCT depends all sampled. In this study, another 66 261 patients received other an-
on whether the sampled population is representative of patients to algesics and could have been candidates to receive tramadol or di-
whom results will be generalized. When the sample includes pa- clofenac, further broadening the potential population that might have
tients who already received treatment A or B with near certainty, they been considered for an RCT. None of these populations is inherently
are up-weighted by IPTW to have larger influence on the results. In better, but each can be characterized through a baseline character-
the simulated example, patients older than 75 years with more than istics table. After matching in the analysis by Zeng et al,1 patients who
16 GP visits in the past 2 years almost always received tramadol. The received tramadol were younger (mean age, 70.3 vs 72.1) and had
single patient who received diclofenac and had these characteris- a lower mean (SD) number of GP visits (12.8 [9.7] vs 14.3 [12.8]). The
ticsgotalargeweight(Figure,C).Thisup-weighting,particularlywhen reduction in SD suggests that the range of GP visits under evaluation
it is extreme, can result in inflated variance.6 There is substantial un- narrowed, limiting the generalizability of the findings. To determine
certainty about the average treatment effect for all patients when whether the target population that resulted from the application of
the target population is defined to include patients who nearly al- propensity score methods is clinically relevant, investigators should
ways receive 1 treatment. Other target populations can be studied undertake a comprehensive review of baseline characteristics, which
by varying the definition of the weight.3 goes beyond the simple comparison of means or medians.

ARTICLE INFORMATION Medicine at UCLA; and Edward H. Livingston, MD, 2. Haukoos JS, Lewis RJ. The propensity score. JAMA.
Author Affiliations: Biostatistics and Deputy Editor, JAMA. 2015;314(15):1637-1638.
Bioinformatics, Duke University, Durham, Published Online: January 10, 2020. 3. Li F, Morgan KL, Zaslavsky AM. Balancing
North Carolina (Thomas); Department of Statistical doi:10.1001/jama.2019.21558 covariates via propensity score weighting. J Am Stat
Science, Duke University, Durham, North Carolina Conflict of Interest Disclosures: Dr Pencina Assoc. 2018;113(521):390-400.
(Li); Duke Clinical Research Institute, Department of reported receiving grants from Sanofi/Regeneron 4. Stuart EA. Matching methods for causal
Biostatistics and Bioinformatics, Duke University, and Amgen and personal fees from Merck and inference. Stat Sci. 2010;25(1):1-21.
Durham, North Carolina (Pencina). Boehringer Ingelheim outside the submitted work. 5. London AJ. Equipoise in research: integrating
Corresponding Author: Michael Pencina, PhD, No other disclosures were reported. ethics and science in human research. JAMA. 2017;
Duke Clinical Research Institute, Department 317(5):525-526. doi:10.1001/jama.2017.0016
of Biostatistics and Bioinformatics, Duke REFERENCES
University, 2400 Pratt St, Durham, NC 27705 6. Li F, Thomas LE, Li F. Addressing extreme
1. Zeng C, Dubreuil M, LaRochelle MR, et al. propensity scores via the overlap weights. Am J
([email protected]). Association of tramadol with all-cause mortality Epidemiol. 2019;188(1):250-257.
Section Editors: Roger J. Lewis, MD, PhD, among patients with osteoarthritis. JAMA. 2019;321
Department of Emergency Medicine, Harbor-UCLA (10):969-982. doi:10.1001/jama.2019.1347
Medical Center and David Geffen School of

jama.com (Reprinted) JAMA February 4, 2020 Volume 323, Number 5 467

© 2020 American Medical Association. All rights reserved.

Clinical Review & Education

JAMA Guide to Statistics and Methods

Explaining Health State Utility Assessment

Eric M. Chang, MD; Christopher S. Saigal, MD, MPH; Ann C. Raldow, MD, MPH

Health state utilities are used to provide quantitative measures of

Figure. Direct Methods for Utility Assessment
how strongly a person values a certain health state.1 Distinct from
other patient-reported outcome measures, which may not be based A Visual analog scale
on preference, utilities estimate a person’s preference for an out- Incontinence
come, such as sexual dysfunction, which may be very different from
another person’s.2 Utilities are measured on a scale of 0 to 1, in which
0 70 100
0 represents a health state equivalent to death and 1 represents a Death Best
health state of perfect health. In a recent issue of JAMA Facial Plastic imaginable
health
Surgery, Faris et al3 used multiple direct methods, including the visual
analog scale, time trade-off, and standard gamble, to assess respon- B Time trade-off
dents’ preferences for 5 health states related to facial palsy and its
7 y at perfect health
treatment using facial reanimation surgery. 1.0

0.9
Use of the Method 0.8
Why Is Health State Utility Assessment Used? 10 y with incontinence
0.7

Quality of life
Although survival or freedom from some major event such as myo- 0.6
cardial infarction are important treatment outcomes, they do not
0.5
explicitly account for a patient’s quality of life. The quality-adjusted
0.4
life-year (QALY) was developed to measure health outcomes in terms
0.3
of both quantity and quality.1 QALYs are calculated by multiplying
0.2
the duration of time spent in a health state by the utility associated
0.1
with that health state; for instance, 1 year spent in perfect health
0
(1 year × a utility value of 1) is equal to 1 QALY. Totaling the QALYs 0 1 2 3 4 5 6 7 8 9 10
accumulated by a patient during a specified follow-up period pro- Time, y
vides a summary measure of the health outcomes achieved.
C Standard gamble
Description of Direct Methods for Utility Assessment Remain
Incontinence
To measure utilities, health states of interest must first be defined.
Selection of the health states typically depends on the health states Decision
70%
Perfect health
and events required for the eventual economic model. Descrip-
Gamble
tions of the health states are often developed from several sources,
such as patient and physician interviews, trial data, and published 30%
Death
literature.4 People are then asked to evaluate these health states to
assign them a numerical value. The health state of incontinence was valued at 0.7 in A-C. The duration of time
Depending on the goal of the study, participants may include spent in perfect health is varied in B until respondents indicate an equal
preference between the choices, with the preference derived from the
members of the general public, patients with a disease who have not durations of life offered at this point of indifference. For example, if 10 years of
yet experienced the health state, or patients with a disease who are life with incontinence is equally preferred to 7 years with perfect health, the
experiencing the health state.5 Patients may be more familiar with health state of incontinence is valued at 7 years/10 years = 0.7. In C,
respondents are willing to tolerate a higher chance of death in exchange for a
their health states than members of the general public; however, the
chance of cure from worse states of health. The probability of a cure at the point
general public may be more relevant for the development of eco- of indifference determines the utility. For example, if the respondent shows
nomic models used to guide allocation of societal resources.6 equal preference for living with incontinence and taking a pill with a 70%
Direct methods for utility assessment ask respondents about their chance of cure and a 30% chance of death, the utility of incontinence is valued
at 70% for probability of a cure = 70/100 = 0.7. This figure was adapted from
preferences for health states. The simplest method for directly as-
Whitehead and Ali.1
sessing preferences is the visual analog scale. When using the visual
analog scale, respondents are asked to estimate where on a scale they
feel a health state falls to determine their preference (Figure). When paired health state or taking a gamble, in which there is a chance of
performing a time trade-off study, respondents are asked to choose returning to perfect health and a chance of immediate death.
between 2 scenarios: living for a period in an impaired health state or
living with perfect health for a shorter period.1 The standard gamble Limitations of Direct Methods for Utility Assessment
method incorporates an element of risk in the decision posed to The visual analog scale may not be as good as other methods be-
respondents.6 When using the standard gamble method, respon- cause it involves a rating rather than a choice task and does not re-
dents are asked to choose between 2 options: remaining in an im- flect how patients make medical decisions. The visual analog scale

jama.com (Reprinted) JAMA March 17, 2020 Volume 323, Number 11 1085

© 2020 American Medical Association. All rights reserved.

Clinical Review & Education JAMA Guide to Statistics and Methods

may also be subject to biases and ceiling or floor effects, in which 5 health states. Naive observers were selected as the target popu-
the set limits of the scale constrain its ability to capture the com- lation to reflect societal preferences, with the instruments admin-
plete range of responses.5 To reflect equal outcomes in the calcu- istered electronically using a web-based application. The survey was
lation of QALYs, respondents who estimate a utility of 0.5 for a health initially assessed for language, duration, and comprehension by a
state should be willing to trade half of their life expectancy in ex- team of volunteers prior to administration to improve understand-
change for a return to perfect health (eg, 1 year spent at a utility of ing of the complex concepts presented.
0.5 should have the same value as 6 months with perfect health). Visual depictions of the probabilities were included to facili-
In contrast to the visual analog scale, the time trade-off and the stan- tate understanding of mathematical concepts. A trained proctor was
dard gamble methods ask respondents to evaluate a health state in present for additional clarification. Due to the importance of aes-
terms of an exchange for duration or probability of perfect health. thetics in the health states evaluated, a unique strength of the study
When using direct methods, respondents are asked to consider was the inclusion of videos in the depictions of the health states.
complex concepts and make decisions about hypothetical situations
they have not actually experienced. Consequently, they may under- How Should a Health State Utility Assessment Be
estimate the effect of treatment complications or become confused Interpreted?
by the complicated scenarios they are asked to consider. Direct meth- Faris et al3 found no differences in preferences for 3 of the 5 health
ods are also subject to framing effects, in which the format or word- states for the diseases studied using all 3 methods. Two of the clini-
ing of the description of a health state may shape responses.5 The hy- cal outcomes studied were substantially favored over the others
pothetical nature of these methods also means respondents may not using all 3 methods. As seen in this study, there may be variability in
have well-formed preferences for certain aspects of their health, al- the values derived by each method.5 The visual analog scale may re-
lowing for further bias by the manner of questioning.6 sult in lower preference values because there is no consequence to
By including the gamble method, values determined using the giving a health state a lower value. Although not seen in the study
standard gamble may be sensitive to whether a respondent is a risk- by Faris et al,3 the standard gamble may yield higher utility values
taker or risk averse.6 In contrast, the time trade-off may be subject because respondents tend to avoid the risk of immediate death.
to time preference because respondents may value years in the near
term more than years later in life, and their preferences may be in- What Caveats Should the Reader Consider?
fluenced by the duration of life offered in the trade-off.6 Both the Utility assessment varies for a variety of reasons, including the method
standard gamble and the time trade-off methods may be subject to used for deriving the utility and the population assessed. Even though
ceiling effects as well, in particular when assessing adverse out- many of these methods are rooted in formal mathematical theory,
comes that are of importance to respondents but that they would gauging patient preferences is a psychological process that results in
not take a mortality risk to avoid (eg, sexual dysfunction). variable responses. Including an assessment of variance and model-
ing the potential effect of prognostic and demographic factors is criti-
How Were Direct Methods for Utility Assessment Used? cal. When using utilities in economic modeling, sensitivity analyses,
Faris et al3 used the visual analog scale, time trade-off, and stan- which assess the effect of altering the values for variables included in
dard gamble methods to determine respondents’ preferences for the models, are essential in interpreting results.

ARTICLE INFORMATION shared decision-making aids. Dr Raldow reported and facial reanimation. JAMA Facial Plast Surg.
Author Affiliations: Department of Radiation being a consultant for Intelligent Automation Inc 2018;20(6):480-487. doi:10.1001/jamafacial.2018.
Oncology, University of California, Los Angeles and receiving honoraria from Varian Medical 0866
(Chang, Raldow); Department of Urology, Systems. No other disclosures were reported. 4. Wolowacz SE, Briggs A, Belozeroff V, et al.
University of California, Los Angeles (Saigal). Estimating health-state utility for economic models
REFERENCES in clinical studies: an ISPOR Good Research
Corresponding Author: Ann C. Raldow, MD, MPH,
Department of Radiation Oncology, University 1. Whitehead SJ, Ali S. Health outcomes in Practices Task Force Report. Value Health. 2016;19
of California, 200 Medical Plaza Driveway, economic evaluation: the QALY and utilities. Br Med (6):704-719. doi:10.1016/j.jval.2016.06.001
Ste B265, Los Angeles, CA 90095 Bull. 2010;96(1):5-21. doi:10.1093/bmb/ldq033 5. Neumann PJ, Sanders GD, Russell LB, Siegel JE,
([email protected]). 2. Wailoo AJ, Hernandez-Alava M, Manca A, et al. Ganiats TG, eds. Cost Effectiveness in Health and
Section Editors: Roger J. Lewis, MD, PhD, Mapping to estimate health-state utility from Medicine. 2nd ed. Oxford University Press; 2017.
Department of Emergency Medicine, Harbor-UCLA non–preference-based outcome measures: an 6. Hunink MGM, Weinstein MC, Wittenberg E, et al.
Medical Center and David Geffen School of ISPOR Good Practices for Outcomes Research Task Decision Making in Health and Medicine: Integrating
Medicine at UCLA; and Edward H. Livingston, MD, Force Report. Value Health. 2017;20(1):18-27. doi: Evidence and Values. 2nd ed. Cambridge University
Deputy Editor, JAMA. 10.1016/j.jval.2016.11.006 Press; 2014. doi:10.1017/CBO9781139506779
Published Online: February 24, 2020. 3. Faris C, Tessler O, Heiser A, Hadlock T, Jowett N.
doi:10.1001/jama.2020.0656 Evaluation of societal health utility of facial palsy

Conflict of Interest Disclosures: Dr Saigal reported

being a consultant for WiserCare, which makes

1086 JAMA March 17, 2020 Volume 323, Number 11 (Reprinted) jama.com

© 2020 American Medical Association. All rights reserved.

Clinical Review & Education

JAMA Guide to Statistics and Methods

Random-Effects Meta-analysis
Summarizing Evidence With Caveats
Stylianos Serghiou, MD; Steven N. Goodman, MD, PhD

Questions involving medical therapies are often studied more than larger CIs than would be obtained if a fixed effect was assumed.
once. For example, numerous clinical trials have been conducted A random-effects model can also be used to provide differing, study-
comparing opioids with placebos or nonopioid analgesics in the treat- specific estimates of the treatment effect in each trial, something
ment of chronic pain. In the December 18, 2018, issue of JAMA, Busse that cannot be done under the fixed-effect assumption.
et al1 evaluated the evidence on opioid efficacy from 96 random-
ized clinical trials and, as part of that work, used random-effects Description of Random-Effects Meta-analysis
meta-analysis to synthesize results from 42 randomized clinical trials In a random-effects meta-analysis, the statistical model estimates
on the difference in pain reduction among patients taking opioids multiple parameters. First, the model estimates a separate treat-
vs placebo using a 10-cm visual analog scale (Figure 2 in Busse et al).1 ment effect for each trial, representing the estimate of the true ef-
Meta-analysis is the process of quantitatively combining study re- fect for the trial. The assumption that the true effects can vary from
sults into a single summary estimate and is a foundational tool for trial to trial is the foundation for a random-effects meta-analysis. Sec-
evidence-based medicine. Random-effects meta-analysis is the most ond, the model estimates an overall treatment effect, representing
common approach. an average of the true effects over the group of studies included.
Third, the model estimates the variability or degree of heteroge-
Why Is Random-Effects Meta-analysis Used? neity in the true treatment effects across trials. Compared with a
Each study evaluating the effect of a treatment provides its own an- fixed-effect estimate, the random-effects estimate for the overall
swer in terms of an observed or estimated effect size. Opioids re- effect is more influenced by smaller studies and has a wider CI, re-
duced pain by 0.54 cm more than placebo on a visual analog scale flecting not just the chance variation that is reflected in a fixed-
in 1 study2; this was the observed effect size and represents the best effect estimate, but also the variation among the true effects.4 In
estimate from that study of the true opioid effect. The true effect is the report by Busse et al,1 the random-effects average opioid ben-
the underlying benefit of opioid treatment if it could be measured efit was −0.69 cm (95% CI, −0.82 to −0.56 cm).
perfectly, and is a single value that cannot directly be known. Whether the variability observed in the estimates of treat-
If a particular study was replicated with new patients in the same ment effect is consistent with chance variation alone is reflected in
setting multiple times, the observed treatment effects would vary statistical measures of heterogeneity, often expressed as an I2, the
by chance even though the true effects would be the same in each. percentage of total variation in the random-effects estimate due to
The belief that the true effect was the same in each study is called heterogeneity in the true underlying treatment effects. An I2 value
the fixed-effect assumption, whereby the fixed effect is the com- greater than 50% to 75% is considered large.5 Busse et al1 report
mon, unknown true effect underlying each replication. A meta- an I2 of 70.4%, reflecting the marked variation among studies, which
analysis making the fixed-effect assumption is called a fixed-effect is also demonstrated by nonoverlapping CIs around some indi-
meta-analysis. The corresponding fixed-effect estimate of the treat- vidual treatment estimates.
ment effect is a weighted average of the individual study estimates A more natural heterogeneity measure is the standard devia-
and is always more precise (ie, it has a narrower confidence interval tion of the true effects, often denoted as τ. A τ of 0.35 cm can be de-
[CI] than that of any individual study, making the estimate appear rived from the data in Figure 2 in the article by Busse et al.1 Given the
closer to the true value than any individual study). overall random-effects estimate of −0.69 cm, this means that the true
However, medical studies addressing the same question are typi- effects in individual studies could vary over the range of −0.69 cm ±2
cally not exact replications and they can use different types of medi- τ or −1.39 cm to 0.01 cm, namely a true benefit in some studies roughly
cation or interventions for different amounts of time, at different in- twice as large as the average and no benefit in some others. This re-
tensities, within different populations, and have differently measured flects the display provided in the study by Busse et al1 in which 10 of
outcomes.3 Differences in study characteristics reduce the confi- 42 studies estimated a benefit larger than 1 cm, which was the mini-
dence that each study is actually estimating the same true effect. mum clinically importance difference. Quantifying the variability in
The alternative assumption is that the true effects being estimated treatment effects among studies helps readers decide whether com-
are different from each other or heterogeneous. In statistical jar- bining these results makes sense. Like the proverbial person said to
gon, this is called the random-effects assumption. The plural in ef- be at normal average temperature with 1 foot in ice and the other in
fects implies there is more than 1 true effect and random implies that boiling water, the estimated average effect can be nonsensical if the
the reasons the true effects differ are unknown. true individual study effects are too variable.
A random-effects assumption is less restrictive than a fixed-
effect assumption and reflects the variation or heterogeneity in the Why Did the Authors Use Random-Effects Meta-analysis?
true effects estimated by each trial. This usually results in a more re- Meta-analyses incorporate some uncertainties that mathemat-
alistic estimate of the uncertainty in the overall treatment effect with ical summaries cannot reflect. A sensible approach is to use the

jama.com (Reprinted) JAMA January 22, 2019 Volume 321, Number 3 301

© 2018 American Medical Association. All rights reserved.

Clinical Review & Education JAMA Guide to Statistics and Methods

statistical method least likely to overstate certainty almost regard- What Are the Caveats to Consider When Assessing
less of perceptions or philosophy about true effects being fixed or the Results of a Random-Effects Meta-analysis?
random, which is why random-effects models are a frequent choice The overall summary effect of a random-effects meta-analysis is rep-
in meta-analyses. resentative of the study-specific true effects without the estimate
The studies in the report by Busse et al1 demonstrate substan- representing a true effect (ie, there may be no population of patients
tial variability, being both qualitatively and quantitatively different. or interventions for which this summary value is true). This is why a
Tominaga et al6 examined the effect of tapentadol extended- random-effects meta-analysis should be interpreted with consider-
release tablets in Japanese patients with either chronic osteoar- ation of the qualitative and quantitative heterogeneity, particularly
thritic knee pain or lower back pain, whereas Simpson and the range of effects calculated using ±2 × τ. If the range is too broad,
Wlodarczyk7 examined the effect of transdermal buprenorphine in or if I2 exceeds 50% to 75%, the meta-analytic estimate might be too
Australian patients with diabetic peripheral neuropathic pain. These unrepresentative of the underlying effects, potentially obscuring im-
studies used different opioids to treat different sources of pain in cul- portant differences. Busse et al1 assessed qualitative heterogeneity
turally different populations that may assess pain differently; in- partly through their assessment of directness and judged it to be mini-
deed, the variability in observed effects between the 2 studies sug- mal, although that may not capture every dimension of importance.
gests that the differences seen are probably beyond chance variation. Other caveats apply to all meta-analyses and include whether
Taken together, these features provide substantial evidence that the analyses include all relevant studies, whether the studies are rep-
these studies are not examining the same effect, consistent with the resentative of the population of interest, whether study exclusions
random-effects assumption. Thus, a fixed effect is not plausible and are justified, and whether study quality was adequately assessed.
a random-effects meta-analysis is the appropriate method. Sometimes heterogeneity reflects a mixture of diverse biases, with
few of the studies properly estimating even their own true effects.
What Are Limitations of a Random-Effects Meta-analysis? Busse et al1 addressed this by using a risk-of-bias assessment.
First, the random-effects model does not explain heterogeneity,
it merely incorporates it. The standard recommendation is that re- How Should the Results of a Random-Effects Meta-analysis
searchers should attempt to reduce heterogeneity8,9 by using sub- Be Interpreted in This Particular Study?
groups of studies or a meta-regression; however, such methods rep- Three of 8 meta-analyses reported in Busse et al1 (using 7 different
resent exploratory data-dependent exercises and their results must outcome measures) have an I2 of 50% or greater, and the meta-
be interpreted accordingly. analysis we have been discussing that used the outcome of pain has
Second, there are many approaches to calculating the random- an I2 of 70.4%, reflecting conflict or a high degree of variability among
effects estimates. Although most produce similar estimates, the studies. The meta-analysis by Busse et al1 provided strong evi-
DerSimonian-Laird method is the most widely used and it pro- dence against opioids increasing pain and suggested that opioids are
duces CIs that are too narrow and P values that are too small when generally likely to reduce chronic noncancer pain by a modest 0.69
there are few studies (<10-15) and sizable heterogeneity; accord- cm more than placebo (less than the 1 cm minimum clinically impor-
ingly, this approach is not optimal in the setting of few studies and tant difference). However, in view of the amount of heterogeneity,
high heterogeneity and often may be contraindicated.10 it is possible that in some settings and patients, the benefit of opi-
Third, small studies more strongly influence estimates from oids could be lesser or greater than this random-effects estimate.
random-effects than from fixed-effect models; in fact, the larger the As such, physicians should consider this summary result with cau-
heterogeneity, the larger their relative influence. If smaller studies are tion, and in conjunction with the effect from the subset of studies
judged as more likely to be biased, this can be a substantial concern. most relevant to the patients they need to treat.

ARTICLE INFORMATION REFERENCES 5. Higgins JP, Thompson SG, Deeks JJ, Altman DG.
Author Affiliations: Department of Health Research 1. Busse JW, Wang L, Kamaleldin M, et al. Opioids Measuring inconsistency in meta-analyses. BMJ.
and Policy, Stanford University School of Medicine, for chronic noncancer pain: a systematic review and 2003;327(7414):557-560.
Stanford, California (Serghiou, Goodman); meta-analysis. JAMA. 2018;320(23):2448-2460. 6. Tominaga Y, Koga H, Uchida N, et al.
Meta-research Innovation Center at Stanford, doi:10.1001/jama.2018.18472 Methodological issues in conducting pilot trials in
Stanford, California (Serghiou, Goodman); 2. Wen W, Sitar S, Lynch SY, et al. A multicenter, chronic pain as randomized, double-blind,
Department of Medicine, Stanford University randomized, double-blind, placebo-controlled trial placebo-controlled studies. Drug Res (Stuttg). 2016;
School of Medicine, Stanford, California (Goodman). to assess the efficacy and safety of single-entity, 66(7):363-370. doi:10.1055/s-0042-107669
Corresponding Author: Steven N. Goodman, MD, once-daily hydrocodone tablets in patients with 7. Simpson RW, Wlodarczyk JH. Transdermal
PhD, 150 Governor’s Ln, Stanford, CA 94305 uncontrolled moderate to severe chronic low back buprenorphine relieves neuropathic pain. Diabetes
([email protected]). pain. Expert Opin Pharmacother. 2015;16(11):1593- Care. 2016;39(9):1493-1500. doi:10.2337/dc16-0123
Section Editors: Roger J. Lewis, MD, PhD, 1606. 8. Thompson SG, Sharp SJ. Explaining
Department of Emergency Medicine, Harbor-UCLA 3. Serghiou S, Patel CJ, Tan YY, et al. Field-wide heterogeneity in meta-analysis. Stat Med. 1999;18
Medical Center and David Geffen School of meta-analyses of observational associations can (20):2693-2708.
Medicine at UCLA; and Edward H. Livingston, MD, map selective availability of risk factors and the 9. Riley RD, Higgins JP, Deeks JJ. Interpretation of
Deputy Editor, JAMA. impact of model specifications. J Clin Epidemiol. random effects meta-analyses. BMJ. 2011;342:d549.
Published Online: December 19, 2018. 2016;71:58-67.
10. Cornell JE, Mulrow CD, Localio R, et al.
doi:10.1001/jama.2018.19684 4. Nikolakopoulou A, Mavridis D, Salanti G. Random-effects meta-analysis of inconsistent
Conflict of Interest Disclosures: None reported. Demystifying fixed and random effects effects. Ann Intern Med. 2014;160(4):267-270.
meta-analysis. Evid Based Ment Health. 2014;17(2):
53-57.

302 JAMA January 22, 2019 Volume 321, Number 3 (Reprinted) jama.com

© 2018 American Medical Association. All rights reserved.

Clinical Review & Education

JAMA Guide to Statistics and Methods

Using the E-Value to Assess the Potential Effect

of Unmeasured Confounding in Observational Studies
Sebastien Haneuse, PhD; Tyler J. VanderWeele, PhD; David Arterburn, MD

Randomized trials serve as the standard for comparative studies is associated with measured confounders. When it is known what
of treatment effects. In many settings, it may not be feasible or the unmeasured confounder is, these could potentially be
ethical to conduct a randomized study,1 and researchers may pur- obtained from published studies and/or through other data
sue observational studies to better understand clinical outcomes. sources. For example, smoking is a known risk factor for develop-
A central limitation of observational studies is the potential for con- ing cardiovascular disease but the smoking status for an individual
founding bias that arises because treatment assignment is not ran- patient may not be included in a database. In this case, an assump-
dom. Thus, the observed associations may be attributable to differ- tion about the prevalence of smoking could be made based on
ences other than the treatment being investigated and causality prior research where smoking status was considered in similar
cannot be assumed. clinical conditions. However, the prevalence cannot be estimated
In the October 16, 2018, issue of JAMA, results from a large, for a true unknown confounder. Additionally, many approaches
multisite observational study of the association between bariatric require making simplifying assumptions such as that the unmea-
surgery and long-term macrovascular disease outcomes among sured confounder is binary. Thus, sensitivity analyses of this type
patients with severe obesity and type 2 diabetes was reported by can only proceed once additional information, typically in the form
Fisher et al.2 Using data from 5301 patients aged 19 to 79 years of a series of inputs for some formulas, has been specified by
who underwent bariatric surgery at 1 of 4 integrated health sys- investigators. Because decisions about each of the assumptions
tems in the United States between 2005 and 2011 and 14 934 can affect the analysis results, the most rigorous approach to
matched nonsurgical patients, they found that bariatric surgery these types of sensitivity analyses would involve investigators
was associated with a 40% lower incidence of macrovascular dis- considering a broad range of values for each input and then exam-
ease at 5 years (2.1% in the surgical group and 4.3% in the nonsur- ining how the results are influenced.
gical group; hazard ratio [HR], 0.60 [95% CI, 0.42-0.86]). While achievable in principle, this approach has limitations.
Two strategies were used to mitigate confounding bias. In the First, the approach has been criticized as being susceptible to mis-
first, a matched cohort design was used where nonsurgical patients use, in the sense that an investigator could choose to focus on
were matched to surgical patients on the basis of a priori–identified assumptions that make the original result seem robust. Second, if
potential confounders (study site, age, sex, body mass index, many scenarios are considered, there is potential for conflicting
hemoglobin A1c level, insulin use, observed diabetes duration, and results within the sensitivity analysis, which may make it difficult to
prior health care use). In the second strategy used to adjust for con- draw firm conclusions.
founding bias, the primary results were based on the fit of a multi- The E-value is an alternative approach to sensitivity analyses
variable Cox model that adjusted for all of the factors used in the for unmeasured confounding in observational studies that avoids
matching as well as a broader range of potential confounders making assumptions that, in turn, require subjective assignment of
(Table 1 in the article2). Thus, any imbalances in the observed inputs for some formulas.4 Specifically, an E-value analysis asks the
potential confounders that remained after the matching process question: how strong would the unmeasured confounding have to
were controlled for by the statistical analysis. Despite these efforts, be to negate the observed results?5 The E-value itself answers this
however, given the observational design, the potential for unmea- question by quantifying the minimum strength of association on
sured confounding remained. the risk ratio scale that an unmeasured confounder must have with
both the treatment and outcome, while simultaneously consider-
Why Is the E-Value Used? ing the measured covariates, to negate the observed treatment–
While matching and regression-based analysis provide some con- outcome association. If the strength of unmeasured confounding
trol of confounding, it can only be with respect to factors that are is weaker than indicated by the E-value, then the main study result
measured. The potential for confounding from factors that were not could not be overturned to one of “no association” (ie, moving
measured in the study still exists. To assess how much of a problem the estimated risk ratio to 1.0) by the unmeasured confounder.
unmeasured confounding factors may pose, researchers may con- E-values can therefore help assess the robustness of the main
duct a sensitivity or bias analysis.3 Common to most of these sen- study result by considering whether unmeasured confounding of
sitivity analysis methods is the use of a formula for which 2 inputs this magnitude is plausible. The E-value provides a measure related
are required: (1) the strength and direction of the association be- to the evidence for causality, hence the name “E-value.”
tween the unmeasured confounder and treatment choice and The E-value has many appealing features. First, in contrast
(2) the strength and direction of association between the unmea- to standard methods for sensitivity, it requires no assumptions
sured confounder and outcome.4 from investigators. Second, it is intuitive because the lowest pos-
Furthermore, additional inputs or information may be needed, sible number is 1. The higher the E-value is, the stronger the unmea-
such as the prevalence of the unmeasured confounder and how it sured confounding must be to explain the observed association.

602 JAMA February 12, 2019 Volume 321, Number 6 (Reprinted) jama.com

JAMA Guide to Statistics and Methods Clinical Review & Education

Third, the calculation is also readily applied to the bounds of a 95% How Should the E-Value Findings Be Interpreted
CI. Thus, investigators can assess the extent of unmeasured con- in This Particular Study?
founding that would be required to shift the confidence interval so Fisher and colleagues2 found that bariatric surgery was associated
that it includes a risk ratio of 1.0 (ie, no association). Fourth, the with a lower composite incidence of macrovascular events at 5 years
E-value is simple to calculate for a range of effect measures, includ- (2.1% in the surgical group vs 4.3% in the nonsurgical group) that
ing relative risks, HRs, and risk differences, and study designs. The had an HR of 0.60 (95% CI, 0.42-0.86). The E-value for this was 2.72,
formulas for the E-value for different effect measures, including meaning that residual confounding could explain the observed as-
continuous outcomes, are available4 and the E-value has been sociation if there exists an unmeasured covariate having a relative
implemented in freely available software and an online calculator risk association at least as large as 2.72 with both macrovascular
(https://evalue.hmdc.harvard.edu/app/).6 events and with bariatric surgery. The E-value for the upper limit of
the confidence interval was 1.60. In the Fisher et al2 study, the HRs
What Are the Limitations of the E-Value? for some of the known, powerful macrovascular disease risk fac-
The E-value is a general tool for sensitivity analyses that does not tors were 1.09 (95% CI, 0.85-1.41) for hypertension, 1.88 (95% CI,
require assumptions about the nature of the unmeasured con- 1.34-2.63) for dyslipidemia, and 1.48 (95% CI, 1.17-1.87) for being
founder. In some settings, investigators may be amenable to mak- a current smoker. It is not likely that an unmeasured or unknown con-
ing assumptions (eg, about the prevalence of an unmeasured confounder would have a substantially greater effect on macrovascu-
founder) so that their sensitivity analyses can be tailored to their lar disease development than these known risk factors by having
specific study design and/or statistical analyses. Such analyses, how- a relative risk exceeding 2.72.
ever, should always be considered in the context of the plausibility
of the assumptions made. Caveats to Consider When Looking at Results
Based on the E-Value
Why Did the Authors Use the E-Value E-values must be interpreted, and indeed only have meaning, within
in This Particular Study? the context of the study at hand. In particular, its magnitude may be
The data used by Fisher and colleagues2 were abstracted retro- large or small depending on the magnitude of the associations of
spectively from the medical record databases of 4 integrated other risk factors. For example, if most other risk factors have an HR
health care systems and, as such, are representative of clinical deci- of 1.1, then an E-value of 1.3 will be relatively large because unmea-
sions and care at these institutions. Because the investigative team sured confounding would have to have much larger effects than most
did not have control over whether patients underwent bariatric risk factors to explain away the reported association. In contrast,
surgery (ie, treatment was not randomly assigned), the potential if many risk factors have an HR of 2.0, then an E-value of 1.3 will
for unmeasured confounding bias needed to be acknowledged and be relatively modest. The adjustments that have been performed
thoroughly investigated. (ie, for the observed confounders) should also be considered.

ARTICLE INFORMATION Published Online: January 24, 2019. 3. Lash TL, Fox MP, Fink AK. Applying Quantitative
Author Affiliations: Department of Biostatistics, doi:10.1001/jama.2018.21554 Bias Analysis to Epidemiologic Data. Berlin, Germany:
Harvard T.H. Chan School of Public Health, Boston, Conflict of Interest Disclosures: Drs Haneuse and Springer Science & Business Media; 2011.
Massachusetts (Haneuse, VanderWeele); VanderWeele reported receiving grants from the 4. Ding P, VanderWeele TJ. Sensitivity analysis
Department of Epidemiology, Harvard T.H. Chan National Institutes of Health. Dr Arterburn reported without assumptions. Epidemiology. 2016;27(3):
School of Public Health, Boston, Massachusetts receiving grants from the National Institutes of 368-377. doi:10.1097/EDE.0000000000000457
(VanderWeele); Kaiser Permanente Washington Health and Patient-Centered Outcomes Research 5. VanderWeele TJ, Ding P. Sensitivity analysis in
Health Research Institute, Seattle, Washington Institute. No other disclosures were reported. observational research: introducing the E-value.
(Arterburn); Department of Medicine, University of Ann Intern Med. 2017;167(4):268-274. doi:10.7326/
Washington, Seattle (Arterburn). REFERENCES M16-2607
Corresponding Author: Sebastien Haneuse, PhD, 1. Courcoulas AP, Yanovski SZ, Bonds D, et al. 6. Mathur MB, Ding P, Riddell CA, VanderWeele TJ.
Department of Biostatistics, Department of Long-term outcomes of bariatric surgery: a National Web site and R Package for computing E-values.
Biostatistics, Harvard T.H. Chan School of Public Institutes of Health symposium. JAMA Surg. 2014; Epidemiology. 2018;29(5):e45-e47. doi:10.1097/EDE.
Health, 655 Huntington Ave, Bldg II, Room 407, 149(12):1323-1329. doi:10.1001/jamasurg.2014.2440 0000000000000864
Boston, MA 02115 ([email protected]). 2. Fisher DP, Johnson E, Haneuse S, et al.
Section Editors: Roger J. Lewis, MD, PhD, Association between bariatric surgery and
Department of Emergency Medicine, Harbor-UCLA macrovascular disease outcomes in patients with
Medical Center and David Geffen School of type 2 diabetes and severe obesity. JAMA. 2018;
Medicine at UCLA; and Edward H. Livingston, MD, 320(15):1570-1582. doi:10.1001/jama.2018.14619
Deputy Editor, JAMA.

jama.com (Reprinted) JAMA February 12, 2019 Volume 321, Number 6 603

Clinical Review & Education

JAMA | JAMA Guide to Statistics and Methods

Mediation Analysis
Hopin Lee, PhD; Robert D. Herbert, PhD; James H. McAuley, PhD

In a 2018 study published in JAMA Network Open, Silverstein et al1 through mediators of interest, whereas direct effects work through
used mediation analysis to investigate how a problem-solving other mechanisms. These effects are often shown in a causal dia-
educational program prevented depressive symptoms in low- gram (Figure). Mediation analysis can estimate indirect and direct
income mothers. Using data from a randomized trial, the authors effects and the proportion mediated, a statistical measure estimat-
tested 8 plausible mechanisms by which the intervention could have ing how much of the total intervention effect works through a par-
its effects. They concluded that problem-solving education re- ticular mediator.
duced the risk of depressive symptoms in low-income mothers pri- Two broad analytical approaches are used to conduct a media-
marily by reducing maternal stress. tion analysis: statistical and causal. Statistical mediation analysis uses
regression models to estimate the strength of intervention-
Use of the Method mediator and mediator-outcome effects. These regression coeffi-
Why Is Mediation Analysis Used? cients can then be multiplied to estimate the indirect effect.2 Sta-
The effects of health and medical interventions are often presumed tistical mediation analysis is limited by its inability to accurately model
to work through specific biological or psychosocial mechanisms. Pos- situations in which there are nonlinear relationships between the in-
sible mechanisms can be evaluated using mediation analysis. tervention, mediator, and outcome or when there is an interaction
between the intervention and the mediator.2 Causal mediation analy-
Description of Mediation Analysis sis is more general and rigorous. It is more general because it allows
In mediation analysis, the effect of an intervention on an outcome for nonlinear relationships and interactions,3 and more rigorous be-
is partitioned into indirect and direct effects. Indirect effects work cause it explicitly outlines the assumptions that are necessary for

Figure. Mediation Analysis Applied to a Study of Problem-Solving Education (PSE) to Prevent Maternal Depression
Mediators

The standardized regression coefficients The adjusted rate ratios represent the
describe the average causal effects of the 0.08 (–0.07 to 0.23) 0.88 (0.77-1.01) relative increase in the rate of developing
Problem-solving ability
intervention on each of the mediators. worsened depression caused by a one
The regression coefficients can be standard deviation change in the
interpreted as the average increase in the –0.05 (–0.19 to 0.09) 1.03 (0.91-1.16) mediator. For example, a one standard
mediator, expressed as a proportion of a Mastery deviation reduction in perceived stress
standard deviation of the mediator, that changes the rate of developing worsened
would result if a person was given PSE 0.11 (–0.03 to 0.25) 0.78 (0.49-1.26) depressive symptoms by a factor of 0.43.
rather than usual care. For example, PSE Self-esteem In other words, a one standard deviation
reduced perceived stress by 11% of a change in stress yields a relative rate
standard deviation. reduction of 1 – 0.43 = 57%.
0.12 (–0.03 to 0.27) 1.01 (0.90-1.13)
Social coping
Rate of depressive
PSE symptom elevations
0.15 (0.01-0.30) 0.88 (0.75-1.04)
Behavioral activation

–0.02 (–0.17 to 0.13) 0.92 (0.80-1.05)

Avoidant coping

0.17 (0.03-0.31) 1.06 (0.92-1.21)

Problem-focused coping

–0.11 (–0.19 to –0.03) 0.43 (0.34-0.54)

Perceived stress

0.91 (0.85-0.98)
Specific indirect effect of PSE through perceived stress
0.89 (0.81-0.98)
Total indirect effect of PSE through all mediators
0.72 (0.53-0.98)
Direct effect of PSE

Mediation analysis exploring how PSE reduced depressive symptoms in rate of worsened depression by 1 – 0.91 = 9% through its effect on perceived
low-income mothers. The numbers on the arrows connecting PSE to each of the stress. PSE also reduced the rate of worsened depression through other
mediators are standardized regression coefficients (with 95% CIs). The numbers mechanisms in the model, including stress (rate ratio for the total indirect effect of
on the arrows linking mediators to the rate of worsened depression are adjusted PSE through all mediators = 0.89). The direct effect rate ratio of 0.72 indicates
rate ratios. The indirect and direct effects derived from the mediation analysis are that a substantial effect of PSE on the rate of worsened depression worked
reported in the lower right corner. The rate ratio of the specific indirect effect of through unmeasured mechanisms. Adapted from Silverstein et al.1
PSE through perceived stress is 0.91, indicating that on average, PSE reduced the

jama.com (Reprinted) JAMA February 19, 2019 Volume 321, Number 7 697

Clinical Review & Education JAMA Guide to Statistics and Methods

making causal claims and includes sensitivity analyses to assess these be assumed that either the exposure-mediator or exposure-
assumptions.2,3 Causal mediation analysis uses linear or nonlinear outcome effects are unconfounded. Therefore, estimates of all ef-
regression techniques to model the intervention-mediator and fects in the mediation model of an observational study could be bi-
mediator-outcome effects. The regression models are used to simu- ased, and control of confounding may be required for all effects. An
late potential values of the mediator and outcome for each study par- example is provided by an observational study in which Cheng et al5
ticipant under hypothetical and observed scenarios of receiving and used mediation analysis to test whether functional brain connectiv-
not receiving the intervention. These estimates are then used to cal- ity mediated the effect of depression on sleep quality. Because par-
culate the average indirect and direct effects.3 ticipants were not randomly allocated to levels of depression (the ex-
Statistical and causal mediation analyses produce similar esti- posure) or functional brain connectivity (the mediator), the
mates when linear models are used for continuous variables and investigators adjusted for known confounders of all effects in the me-
there are no interactions between the intervention and the media- diation models. Despite these efforts, it is possible that in this study,
tor. When mediators or outcomes are binary variables or when non- as in all observational studies, unmeasured confounding could have
linear models are used or interactions are present, statistical me- biased the estimates of indirect and direct effects.
diation can produce biased estimates and the causal mediation
approach is preferred.2 Why Did the Authors Use Mediation Analysis?
Although the effectiveness of problem-solving education on de-
What Are the Limitations of Mediation Analysis? pression is well established, the psychological mechanisms medi-
The explicit objective of all mediation analyses is to demonstrate ating this effect are not known. Silverstein et al1 conducted a planned
causal relationships. This objective requires that specific assump- mediation analysis of their randomized trial to understand the
tions are met. In a mediation analysis, the intervention-outcome, mechanisms by which problem-solving education reduced depres-
intervention-mediator, and mediator-outcome effects must be un- sive symptoms (Figure). Knowledge of the mediating mechanisms
confounded to permit valid causal inferences. This requirement is may be useful in creating more efficient or effective problem-
often called the no confounding, or ignorability, assumption.2 In a solving educational interventions.
randomized trial, participants are randomly assigned to interven-
tion groups, so the intervention-outcome and intervention- Caveats to Consider When Assessing the Results
mediator effects can be assumed to be unconfounded. However, trial of Mediation Analysis
participants are not usually randomly assigned to receive or not re- If a mediation analysis has not adjusted for confounding or ex-
ceive the mediator, so the mediator-outcome effect may be con- plored the effects of unmeasured confounding using sensitivity
founded, even in randomized trials. To overcome this potential analysis techniques, the findings should be interpreted with cau-
source of bias, investigators can control for known confounders of tion. Mediation analyses of randomized or nonrandomized studies
the mediator-outcome effect by using techniques such as regres- can only demonstrate causal effects if confounding can be confi-
sion adjustment. However, as highlighted in a previous JAMA Guide dently ruled out. If there are multiple mediators of the intervention
to Statistics and Methods article,4 unmeasured confounding may still that affect one another, mediators may act as postrandomization
introduce bias even if known confounders have been adjusted for. confounders of the effects of other mediators.6 In these cases, cau-
Sensitivity analyses can and should be used to assess the potential tion is necessary when interpreting estimates of indirect and direct
bias caused by unmeasured confounding in mediation analyses.3 effects derived from mediation analyses of single mediators. The tim-
The risk of confounding in mediation analyses is greater in ob- ing of measurements of the mediator and outcome are also impor-
servational studies than in randomized trials3 because participants in tant. If the mediator is measured at the same time as the outcome,
observational studies are not randomly allocated to receive the ex- there could be reverse causation. That is, the outcome could cause
posure. In observational studies, unlike randomized trials, it cannot the mediator rather than the mediator causing the outcome.

ARTICLE INFORMATION Section Editors: Roger J. Lewis, MD, PhD, 3. Imai K, Keele L, Yamamoto T. Identification,
Author Affiliations: Centre for Statistics in Department of Emergency Medicine, Harbor-UCLA inference and sensitivity analysis for causal
Medicine, Nuffield Department of Orthopaedics, Medical Center and David Geffen School of mediation effects. Stat Sci. 2010;25(1):51-71.
Rheumatology and Musculoskeletal Sciences, Medicine at UCLA; and Edward H. Livingston, MD, 4. Haukoos JS, Lewis RJ. The propensity score. JAMA.
University of Oxford, Oxford, United Kingdom Deputy Editor, JAMA. 2015;314(15):1637-1638. doi:10.1001/jama.2015.
(Lee); School of Medicine and Public Health, Published Online: January 25, 2019. 13480
University of Newcastle, New South Wales, doi:10.1001/jama.2018.21973 5. Cheng W, Rolls ET, Ruan H, Feng J. Functional
Australia (Lee); Neuroscience Research Australia connectivities in the brain that mediate the
(NeuRA), Sydney, New South Wales, Australia REFERENCES association between depressive problems and
(Herbert, McAuley); School of Medical Sciences, 1. Silverstein M, Cabral H, Hegel M, et al. sleep quality. JAMA Psychiatry. 2018;75(10):1052-
Faculty of Medicine, University of New South Problem-solving education to prevent depression 1061. doi:10.1001/jamapsychiatry.2018.1941
Wales, Australia (Herbert, McAuley). among low-income mothers. JAMA Netw Open. 6. VanderWeele TJ, Vansteelandt S. Mediation
Corresponding Author: Hopin Lee, PhD, Botnar 2018;1(2):e180334. doi:10.1001/jamanetworkopen. analysis with multiple mediators. Epidemiol Methods.
Research Centre, Nuffield Department of 2018.0334 2014;2(1):95-115.
Orthopaedics, Rheumatology and Musculoskeletal 2. VanderWeele TJ. Explanation in Causal Inference:
Sciences, University of Oxford, Windmill Road, Methods for Mediation and Interaction. New York, NY:
Headington, Oxford, OX3 7LD, United Kingdom Oxford University Press; 2015.
([email protected]).

698 JAMA February 19, 2019 Volume 321, Number 7 (Reprinted) jama.com

Clinical Review & Education

JAMA Guide to Statistics and Methods

Number Needed to Treat

Conveying the Likelihood of a Therapeutic Effect
Jeffrey L. Saver, MD; Roger J. Lewis, MD, PhD

Effectively communicating clinical trial results to patients and conveys statistical rather than clinical significance. The P value sug-
clinicians is a requirement for appropriate application in clinical gests there will be a difference in outcomes associated with choice
practice. In a recent issue of JAMA, Zhao et al 1 reported the of therapy, but not how large that difference will be.
results from a randomized clinical trial comparing dual antiplate- Risk ratios and odds ratios convey the relative rather than the
let therapy with aspirin monotherapy for preserving saphenous absolute differences in outcomes with different treatments.4 They
vein graft patency in 500 patients undergoing coronary artery are interpretable only if the event rate in the control comparator
bypass grafting. Dual antiplatelet therapy was found to be supe- group is also stated, and then require mental calculation not readily
rior to aspirin monotherapy. The authors 1 used the number performable by many decision makers. For example, a treatment
needed to treat (NNT) to communicate effect size, reporting that that increases by 1.5-fold the frequency of a desirable outcome
for every 8 patients treated with dual agents rather than aspirin (risk ratio = 1.5) will help only 1 of every 100 patients if the base rate
alone, 1 additional patient would achieve saphenous graft patency of the desirable outcome in the control group is 2% (increased in
at 1 year. The NNT may be defined as the number of patients who the active group to 3%), but will help 20 of every 100 patients if
need to be treated with one therapy vs another for 1 additional the base rate of the desirable outcome in the control group is 40%
patient to have the desired outcome. Since its first description 30 (increased in the active group to 60%). In contrast, the NNT con-
years ago,2 the NNT has become an important means to express veys the absolute size of differences in outcome proportions with
the magnitude of benefit conferred by a therapy.3 different treatments in a readily interpretable manner.

Explanation of the Concept Limitations and Alternatives to the NNT

What Is the NNT? Despite its several advantages, the NNT metric does have impor-
When a clinical trial is completed, the fraction or proportion of pa- tant limitations, and alternative indices of treatment effect magni-
tients experiencing the desired outcome is reported for the active tude are available that provide helpful complementary information.
and control groups. The NNT is derived from these values and indi- First, the NNT combines 2 proportions (the fraction of treatment
cates the magnitude of the therapy’s treatment effect on the dis- success in each treatment group) into a single number, which sacri-
ease observed in the clinical trial. The NNT is computed by dividing fices information. For example, the same NNT may represent
100 by the difference between the percentage response of the treat- increases in treatment success (eg, from 5% to 15% or from 85% to
ment group from that of the control group. Alternatively, the NNT 95%) that may be viewed differently by patients and clinicians.
is calculated by taking the reciprocal of the absolute risk reduction A second limitation is that it can be challenging to compare and
between the groups. The NNT indicates how many patients must integrate different NNTs because their values are expressed as frac-
be managed on average with active rather than control therapy to tions with different denominators. In contrast, the natural fre-
achieve 1 additional good outcome. quency metric (most often stated as benefit per hundred and harm
The number needed concept may be applied to many types per hundred) more readily facilitates comparisons because it ex-
of outcomes from both therapeutic and diagnostic studies. When presses the treatment effect magnitude using a uniform (100) and
a therapy increases desirable outcomes, the resulting value is the familiar (from percentages) denominator.5,6
number needed to benefit (more often denoted as just NNT). For example, consider the following statements describing
When a therapy increases adverse events, the resulting value is the same treatment effect. The NNT to prevent 1 myocardial infarc-
the number needed to harm. When applied to diagnostic strate- tion is 25 patients, to prevent 1 ischemic stroke is 50, and to cause 1
gies, the resulting values are the number needed to screen for major bleeding event is 33. For every 100 patients treated, 4 fewer
tests in asymptomatic individuals, and the number needed to will have a myocardial infarction, 2 fewer an ischemic stroke,
diagnose for tests in symptomatic individuals. and 3 more a major bleeding event. The different framing of the
clinical trial result provided by the NNT and benefit per hundred
Why Is the NNT Important? can influence decision making despite the fact that they are
The NNT is intuitively understandable by patients and clinicians. It numerically equivalent.
is also quantitative, facilitating decision making when selecting The NNT aligns more closely with the patient perspective
among available therapeutic strategies. By including a 95% confi- because the patient will often be making a particular treatment
dence interval (CI) around the observed NNT, the uncertainty in the decision only once (“my chance of benefit is 1 in X”). The benefit per
benefit also can be communicated effectively. hundred aligns more closely with the perspective of the clinician,
Other well-established indices of treatment effect are not well who will often be making the same treatment decision tens of
suited for this purpose. For example, a statistically significant P value times during a career (“out of 100 patients I treat, I will help X”).7

798 JAMA February 26, 2019 Volume 321, Number 8 (Reprinted) jama.com

JAMA Guide to Statistics and Methods Clinical Review & Education

A limitation of the NNT shared by the natural frequency is ticular clinical trial. In contrast, each individual patient has distinc-
that randomized clinical trial results fully specify NNT values only tive features modifying the baseline risk and treatment response.
for binary outcomes (such as the occurrence of an infection, In addition, when patient outcomes vary over time, the reported NNT
a rash, or death), but not for ordinal or continuous outcomes reflects the benefit at a particular time point and several different
(such as reduced pain or degree of disability). This drawback has NNT values might be needed to capture varying benefits (eg, at early,
been partially mitigated by the development of methods that middle, or late stages of the treatment course).
provide estimated NNT values for ordinal or continuous out-
comes using automated or content expert–informed derivation How Was the Concept of NNT Applied in This Particular Study?
techniques.8 However, these methods require additional, often In the Results section of the study by Zhao et al,1 the primary effi-
untestable, assumptions to estimate the distribution of an cacy end point findings were reported including each group’s indi-
observed treatment group benefit among individual patients vidual outcome proportions and 95% CIs, the relative treatment
because the same clinical trial result can arise when many effect magnitude (relative risk, 0.48 [95% CI, 0.31-0.74]), the
patients experience a small individual benefit or when fewer absolute treatment effect magnitude (risk difference, 12.2%
patients experience a large individual benefit. [95% CI, 5.2%-19.2%]), and the statistical significance (P < .001).
Another limitation is that the NNT reflects the number not The authors restated this result as an NNT of 8 (the approximate
the importance of events. Different types of events are each reciprocal of 12.2%). Reporting the findings this way conveyed
given their own separate NNT values and the resulting quantita- the probability of benefit in a clinically useful manner. The authors
tive statements may encourage overweighting of less important did not provide the 95% CI around the NNT value. The absence of
outcomes. For example, a therapy is clearly of substantial net a 95% CI around the NNT improves readability, but somewhat
benefit even if it has a nominally lower NNT to harm of 3 for obscures the degree of uncertainty around the estimated value.
a minor adverse effect (such as transient mild headache) accom- An NNT to harm value was not provided for the increase in minor
panying a nominally higher NNT to benefit of 5 for a major benefi- bleeding events that also was observed with double antiplatelet
cial effect (such as fatal cardiac failure). An alternative approach therapy. However, it is prudent in primary trial reports to state
is to integrate multiple outcomes into a single measure of treat- NNT values only for the lead efficacy and safety end points that
ment effect using health-related utility values for each of the were the prespecified focus of hypothesis testing.
outcomes.9,10 Once event values are converted to this single con-
sistent measure, a NNT to achieve any given magnitude of benefit How Should the NNT Be Interpreted in Zhao et al?
on the utility scale can be derived.6 For example, the “number The absolute risk difference of 12.2% with the 95% CI of 5.2% to
needed to save one life” was recently used to express the number 19.2% reported by Zhao et al 1 indicates that approximately 8
of patients with acute ischemic stroke treated with thrombec- patients (given by the reciprocal of 0.122) need to be treated with
tomy required to achieve the same total benefit as saving the life dual antiplatelet therapy as opposed to aspirin monotherapy to
of 1 patient who would have died and achieving a normal neuro- avoid 1 case of saphenous vein graft occlusion. However, the data
logical outcome.6 are also consistent with this number being as low as 5 or as high
Further limitations include that the NNT does not convey the as 19 (given by the reciprocals of 0.192 and 0.052, respectively).
financial costs and benefits of treatments and only expresses the These values capture both the probability of benefit for the
magnitude of effect expected for a prototypical patient, reflecting individual patient (approximately 1 in 8) and the uncertainty in
the aggregate characteristics of the population enrolled in a par- that probability.

ARTICLE INFORMATION Disclaimer: Dr Saver is an associate editor of JAMA, 5. Hoffrage U, Lindsey S, Hertwig R, Gigerenzer G.
Author Affiliations: Department of Neurology, but he was not involved in any of the decisions Medicine: communicating statistical information.
Ronald Reagan–UCLA Medical Center and David regarding review of the manuscript or its acceptance. Science. 2000;290(5500):2261-2262.
Geffen School of Medicine, University of California, 6. Nogueira RG, Jadhav AP, Haussen DC, et al.
Los Angeles (Saver); Department of Emergency REFERENCES Thrombectomy 6 to 24 hours after stroke with
Medicine, Harbor-UCLA Medical Center, Torrance, 1. Zhao Q, Zhu Y, Xu Z, et al. Effect of ticagrelor plus a mismatch between deficit and infarct. N Engl J Med.
California (Lewis); Department of Emergency aspirin, ticagrelor alone, or aspirin alone on 2018;378(1):11-21.
Medicine, David Geffen School of Medicine, saphenous vein graft patency 1 year after coronary 7. Peng J, He F, Zhang Y, et al. Differences in
University of California, Los Angeles (Lewis); Berry artery bypass grafting: a randomized clinical trial. simulated doctor and patient medical decision
Consultants LLC, Austin, Texas (Lewis). JAMA. 2018;319(16):1677-1686. making. PLoS One. 2013;8(11):e79181.
Corresponding Author: Jeffrey L. Saver, MD, Reed 2. Laupacis A, Sackett DL, Roberts RS. 8. Saver JL. Optimal end points for acute stroke
Neurologic Research Center, 710 Westwood Plaza, An assessment of clinically useful measures therapy trials. Stroke. 2011;42(8):2356-2362.
Los Angeles, CA 90095 ([email protected]). of the consequences of treatment. N Engl J Med.
1988;318(26):1728-1733. 9. Irony TZ. The “utility” in composite outcome
Section Editor: Edward H. Livingston, MD, Deputy measures. JAMA. 2017;318(18):1820-1821.
Editor, JAMA. 3. Mendes D, Alves C, Batel-Marques F.
Number needed to treat (NNT) in clinical literature. 10. Hong KS, Ali LK, Selco SL, et al. Weighting
Published Online: February 7, 2019. components of composite end points in clinical
doi:10.1001/jama.2018.21971 BMC Med. 2017;15(1):112.
trials. Stroke. 2011;42(6):1722-1729.
Conflict of Interest Disclosures: None reported. 4. Norton EC, Dowd BE, Maciejewski ML.
Odds ratios—current best practice and use. JAMA.
2018;320(1):84-85.

jama.com (Reprinted) JAMA February 26, 2019 Volume 321, Number 8 799

Clinical Review & Education

JAMA | JAMA Guide to Statistics and Methods

Choosing a Time Horizon in Cost

and Cost-effectiveness Analyses
Anirban Basu, PhD; Matthew L. Maciejewski, PhD

When designing a comparative outcomes or a cost-effectiveness the trial are generally needed because few trials collect pa-
analysis, the time horizon defining the duration of time for out- tient information for their entire lives. If the trial studied the
comes assessment must be carefully considered. The time horizon effect of an intervention on an intermediate outcome such as
must be long enough to capture the intended and unintended ben- cholesterol level, it must have a plausible epidemiological link
efits and harms of the intervention(s).1,2 In some instances, the time between the intermediate outcome and more comprehensive
horizon should extend beyond the duration of a clinical trial when a outcomes such as survival. When the epidemiological links are
specific end point is measured, whereas in other instances model- weak, assumptions about the comprehensive outcome cannot
ing outcomes over a longer period is unnecessary. Using a longer time reliably be made from observations of intermediate outcomes.
horizon than is necessary may add unnecessary cost and complex- Second, the treatment effect found within the trial must persist
ity to the cost-effectiveness analysis model. in time beyond the trial follow-up. For example, if a trial of aspirin
In the May 2017 issue of JAMA Ophthalmology, Wittenborn demonstrates an annual reduction of stroke by a risk ratio
et al3 examined costs and effectiveness of home-based macular of 0.8 over 3 years, would this effect be expected to continue
degeneration monitoring systems using a lifetime horizon in a beyond 3 years? Extrapolations regarding long-term treatment
cost-effectiveness analysis and a 10-year horizon in a budget effects depend on both clinical assumptions such as the known
impact analysis. The rationale for selection of time horizons and efficacy of an intervention and behavioral assumptions such as
their implications for interpreting the research is reviewed in this long-term adherence to treatment. These assumptions are com-
JAMA Guide to Statistics and Methods article. monly based on empirical evidence arising from other research
studies such as information describing aspirin treatment adher-
The Use of Time Horizon in a Cost-effectiveness Analysis ence from real-world data. In the absence of prior information to
For cost-effectiveness, the time horizon is the time over which fulfill these 2 conditions, the assumptions should be explicitly
the costs and effects are measured.1,2 Cost-effectiveness analyses stated and alternate analyses that vary the scenarios and assump-
should consider time horizons that capture all intended and unin- tions should be conducted to understand how the results are
tended consequences of the interventions being evaluated, irre- influenced by these assumptions.
spective of when they occur. When a cost-effectiveness analysis is Choosing the time horizon for a cost-effectiveness study
performed as part of a clinical trial, the time in which the data are depends on what perspective is selected to study. The perspec-
collected may be limited to the duration of the trial itself. This is tive (eg, patient, private or public payer, society) defines whose
called a within-trial horizon.4 For example, cost-effectiveness of costs are included and thus the types of resource use to be con-
antibiotics used to treat acute sinusitis can use a within-trial time sidered in a cost-effectiveness analysis. A private payer’s perspec-
horizon because the disease and its treatment occur over a very tive may require a relatively short time horizon such as for 1 to 2
short period and extrapolation of benefits over a long period is years so may not capture an intervention’s full benefits and harms
not required. that occur with time. In this scenario, benefits and harms deter-
If benefits and harms of an intervention can occur over the mined from a payer perspective may differ from societal perspec-
entirety of a patient’s life, then a lifetime horizon is appropriate. tive that will be longer and will also include the effect of an inter-
For example, cost-effectiveness for the use of a low vs a high vention on patients, caregivers, and other sectors of the society.
dose of aspirin may affect the primary end point of cardiovascular The Second Panel on Cost-effectiveness in Health and Medicine
events within the trial period, but the effects are likely to continue recommends using health care sector and a societal perspectives
throughout the patients’ lifetime. In such cases, within-trial analy- for cost-effectiveness analysis.2 These 2 perspectives should be
ses based on the primary end point are incomplete and potentially prioritized in selecting the appropriate time horizon and may
misleading because the long-term population effects on health involve selection of the same time horizon.
and costs and results would not be captured.5 In some instances, When designing studies with long time horizons, investigators
the appropriate time horizon can be somewhere between the must anticipate how outcomes for the treatment and control
duration of the trial and the lifetime of patients; eg, when a cost- groups might change over time. For example, will the control
effectiveness analysis is conducted from a payer’s perspective. group’s outcome trend change linearly over time, such as a slow,
However, choice of time horizon can have substantial influence on linear gain until a certain age, then slow linear decline until death?
cost-effectiveness analysis results in such cases.6 Will the treatment group’s outcome change in a nonlinear fashion
in the short-term and then stabilize into a fairly linear trend after a
Limitations Regarding Selection of Time Horizons number of years? How may survival curves be extrapolated over
Studies using long time horizons can be difficult and expensive time? These time-dependent changes in outcomes can be very dif-
to complete. Consequently, 2 types of information external to ficult to anticipate.

1096 JAMA March 19, 2019 Volume 321, Number 11 (Reprinted) jama.com

JAMA Guide to Statistics and Methods Clinical Review & Education

How Was Time Horizon Defined and Used costs were broadened from a public payer perspective (Medicare)
in the JAMA Ophthalmalogy Study? to a social perspective.
Wittenborn et al 3 compared costs and effectiveness of pa-
tients randomized to supplementing usual care with home-based How Does the Time Horizon Selected by Wittenborn et al
macular degeneration monitoring systems or randomized to Affect the Interpretation of the Study?
usual care. A lifetime horizon was used for the cost-effectiveness The lifetime analysis was warranted because of the natural history of
analysis from a societal perspective and a 10-year horizon for a age-related macular degeneration continues over patients’ lifetimes.
budget impact analysis from a public payer (Medicare) perspec- However, the challenges faced by the authors were to extrapolate the
tive. It considered costs that would be incurred by patients, clini- HOME trial results of 1.4 years mean duration to a lifetime horizon.3
cians and health care systems necessary for monitoring, and The trial end point of best-corrected visual acuity scores at the de-
medical costs. It also examined costs incurred by patients tection of age-related macular degeneration–associated choroidal
and employers such as productivity costs due to the time employ- neovascularization (CNV) had to be extrapolated to lifetime quality-
ees were unable to do their usual activities such as work. The adjusted life-years; and the acuity score at time of CNV diagnosis and
incremental cost-effectiveness ratio, which summarizes the ben- the false-positive rates of CNV diagnosis observed in the trial had to
efits (eg, effectiveness) and costs of the 2 alternatives in a single be extrapolated beyond the duration of the trial. The authors relied
metric, attributable to monitoring based on a social perspective on extensive epidemiological studies to estimate these parameters,
with a lifetime horizon was $35 663 (95% CI, cost-saving and found that their results were relatively insensitive to variation in
to $235 613) per quality-adjusted life-year gained. In the budget the input value of one parameter at a time over a range. However, to
impact analysis that estimated the cumulative cost for cover- what extent continuous monitoring would lead to additional behav-
ing monitoring in the initial 10 years found that Medicare would ioral changes (prevention of scheduled eye examination or fidelity
be projected to spend $1312 (95% CI: $222-$2848) per pa- to monitoring recommendations) beyond the duration of the HOME
tient. Taken together, these analyses demonstrated that coverage trial remains uncertain and were not captured by the model para-
of home telemonitoring would require additional Medicare meters. The authors studied some of these considerations by chang-
expenditures over the first 10 years but could be a good value ing underlying assumptions of the model to reflect slightly higher
for money when the time horizon was expanded from 10 years benefits of monitoring and found that the monitoring interventions
to patient lifetimes and when the set of relevant benefits and became cost-saving when modeled using a lifetime horizon.

ARTICLE INFORMATION Conflict of Interest Disclosures: Dr Basu reports 3. Wittenborn JS, Clemons T, Regillo C, Rayess N,
Author Affiliations: The Comparative Health receiving fees from Salutis Consulting LLC. Liffmann Kruger D, Rein D. Economic evaluation of
Outcomes, Policy, and Economics (CHOICE) Dr Maciejewski reports owning stock through a home-based age-related macular degeneration
Institute, University of Washington, Seattle (Basu); spouse’s employment at Amgen; grants and other monitoring system. JAMA Ophthalmol. 2017;135(5):
Center for Health Services Research in Primary support from the Veterans Affairs Health Services 452-459. doi:10.1001/jamaophthalmol.2017.0255
Care, Durham Veterans Affairs Medical Center, Research and Development and the National 4. Hlatky MA, Owens DK, Sanders GD.
Durham, North Carolina (Maciejewski); Duke Committee for Quality Assurance; and grants from Cost-effectiveness as an outcome in randomized
University Medical Center, Durham, North Carolina the National Institute on Drug Abuse. clinical trials. Clin Trials. 2006;3(6):543-551. doi:10.
(Maciejewski). 1177/1740774506073105
REFERENCES
Corresponding Author: Anirban Basu, PhD, 5. Sculpher MJ, Claxton K, Drummond M, McCabe C.
University of Washington, 1959 NE Pacific St, 1. Gold MR, Siegel JE, Russell LB, Weinstein MC, Whither trial-based economic evaluation for health
PO Box 357660, Seattle, WA 98195 eds. Cost-Effectiveness in Health and Medicine. care decision making? Health Econ. 2006;15(7):
([email protected]). New York, NY: Oxford University Press; 1996. 677-687. doi:10.1002/hec.1093
Section Editors: Roger J. Lewis, MD, PhD, 2. Sanders GD, Neumann PJ, Basu A, et al. 6. Kim DD, Wilkinson CL, Pope EF, Chambers JD,
Department of Emergency Medicine, Harbor-UCLA Recommendations for conduct, methodological Cohen JT, Neumann PJ. The influence of time
Medical Center and David Geffen School of practices, and reporting of cost-effectiveness horizon on results of cost-effectiveness analyses.
Medicine at UCLA; and Edward H. Livingston, MD, analyses: second panel on Cost-effectiveness in Expert Rev Pharmacoecon Outcomes Res. 2017;17
Deputy Editor, JAMA. Health and Medicine. JAMA. 2016;316(10):1093-1103. (6):615-623. doi:10.1080/14737167.2017.1331432
doi:10.1001/jama.2016.12195
Published Online: February 21, 2019.
doi:10.1001/jama.2019.1153

jama.com (Reprinted) JAMA March 19, 2019 Volume 321, Number 11 1097

Clinical Review & Education

JAMA Guide to Statistics and Methods

Treatment Effects in Multicenter Randomized Clinical Trials

Stephen J. Senn, PhD; Roger J. Lewis, MD, PhD

It is common for treatments to be evaluated in clinical trials that in- to affect the estimate of the treatment effect (eg, if patients vary in
volve many sites or centers, primarily because one center rarely can severity of illness from center to center, any imbalance in treatment
enroll sufficient numbers of patients to complete the trial.1 The use allocation within centers could affect the estimate of the treatment
of multiple clinical sites introduces complexity because outcomes effect). However, a potentially more important and less well-
at different sites may be systematically different, eg, due to differ- appreciated effect of differences between centers is on the uncer-
ences in patient populations, ancillary treatment practices, or other tainty or the precision in the estimate of the treatment effect. Even
factors. Thus, appropriate statistical analyses of multicenter clini- when stratified randomization is used successfully to achieve bal-
cal trials consider these center effects to yield a better understand- ance between groups across centers,4 the resulting confidence
ing of the overall mean treatment effect and the variability in treat- intervals (CIs) and P values can change substantially depending on
ment effects and patient outcomes among sites.1 whether the center effect is included in the statistical model used
In the May 15, 2018, issue of JAMA, Dodick et al2 published the to estimate the treatment effect, potentially affecting the overall
results of a clinical trial that compared migraine prevention by 2 dif- interpretation of the clinical trial result.1
ferent dosing regimens of fremanezumab vs placebo. The number The uncertainty in the estimate of the treatment effect is
of migraine days were recorded during a 28-day baseline period and defined by the variability in the estimates that would be obtained,
a 3-month treatment period. The primary outcome for the study was hypothetically, if equivalent multicenter clinical trials were indepen-
the change from baseline in the mean number of monthly migraine dently repeated many times. There are two distinct ways that such
days during treatment. The 875 participating patients were re- repetition might be conducted. First, the same centers might be
cruited from 123 centers in 9 countries. Using a primary analysis that used but the random treatment allocation (eg, which treatment is
accounted for each patient’s mean number of migraines during the assigned to each patient in sequence) would be changed from rep-
baseline period,2,3 treatment, country (US vs non-US), and other fac- etition to repetition, maintaining stratification to balance treat-
tors, the authors reported a difference with monthly dosing vs pla- ments within center each time. Second, different centers could also
cebo of –1.5 days (95% CI, –2.01 to –0.93 days; P < .001) and with be used for each trial, selected from a larger pool of centers with
single higher dosing vs placebo of –1.3 days (95% CI, –1.79 to –0.72 similar characteristics. Only the first approach is considered here,
days; P < .001). They also conducted a post hoc sensitivity analysis which addresses the uncertainty in the treatment effect obtained
that accounted for effects of the specific country of enrollment.2 from the original trial with those particular centers. The second
type of repetition, with variation in the participating centers, would
Estimating Treatment Effects in Multicenter Clinical Trials address a more general type of uncertainty—including uncertainty
Why Are Differences Between Centers Considered due to difference in the treatment effect among centers—that is
When Estimating Treatment Effects? beyond the scope of this discussion.
The goals of the statistical analysis of a multicenter clinical trial in- Suppose the same multicenter clinical trial results were ana-
clude providing a valid estimate of the treatment effect (ie, the mean lyzed using 2 different statistical models. First, a simple model is used
difference in outcomes between patients treated in the 2 groups) that does not include a term for center effect, so all variability is as-
and understanding and quantifying the remaining uncertainty or presumed to arise only from the inherent variation among patients but
cision in the estimated treatment effect.1 Patients treated at differ- without any systematic differences from center to center. Second,
ent centers may differ in their overall prognoses but experience the a more appropriate model is used that separates the variability aris-
same relative benefit of a treatment compared with standard care. ing from differences among patients within centers from the vari-
Alternatively, patients at different centers may differ in both their ability arising from differences among centers. The uncertainty in
overall prognoses and in the treatment effect. Only the first case is the treatment effect obtained from the 2 models, representing the
considered in this article. variability in the mean treatment effect that would be seen in the
The randomization of patients to treatments in multicenter trials repeated equivalent trials, and reflected in the width of the 95% CI
is usually stratified by center to achieve balance in the numbers of around the estimated treatment effect, will be different. The first
patients receiving each treatment within each center and, in what model, without the center effect, will overestimate the variability
follows, it is assumed this has been done.4 Balance improves the sta- in the estimated mean treatment effect because it assumes that all
tistical efficiency of the trial, increasing precision in the estimation variability is an inherent characteristic of the patient population. This
of treatment effects given a particular sample size. It also reduces will result in an overly wide CI and decreased statistical power (the
the risk in modestly sized trials that chance imbalance in treatment overly wide CI will be more likely to include a zero or null treatment
allocation at centers with smaller numbers of patients results in a effect, equivalent to a nonstatistically significant P value). In con-
bias if that center has better or worse outcomes on average than trast, the model incorporating the center effect will correctly rec-
other centers.1,4 ognize that some of the variability is a characteristic of patients within
Center effects, including systematic differences in outcomes each center and some is a characteristic of differences in popula-
between patients enrolled in different centers, have the potential tions enrolled at different centers, and eliminate the latter. This will

jama.com (Reprinted) JAMA March 26, 2019 Volume 321, Number 12 1211

Clinical Review & Education JAMA Guide to Statistics and Methods

reduce the estimated variability for patients within each center, re- preventive medication used. The primary analysis used analysis of
sulting in a narrower, more accurate CI and increased power. covariance3 to adjust for the influence of the 3 stratification factors,
the number of episodes occurring in the baseline period, and the
How Are Center Effects Incorporated into Estimates number of years since first onset of migraine. The effect of country
of Treatment Effects? was simplified by reducing this variable to only 2 groups: US vs
Various statistical models can be used to account for center effects non-US. To more completely evaluate the potential “country effect”
when estimating the treatment effect. The simplest way is to con- (analogous to the “center effect” described above), the authors
sider centers as fixed effects—each center is associated with its own conducted a “post hoc sensitivity analysis using a mixed-effects
effect on patient outcomes—in an appropriate model. This ap- model that included country instead of region as a random effect.”2
proach can be use in linear models, survival analysis, or logistic re- The results of this analysis are presented in eTable 4 in the article’s
gression for dichotomous outcomes and, in each case, allows for the Supplement 3, demonstrating a difference from placebo with
greater similarity among treatment outcomes within a center com- monthly dosing of –1.5 (95% CI, –2.00 to –0.93) and a difference
pared with across centers.5 with the single higher dose of –1.3 (95% CI, –1.79 to –0.72).2 These
results are almost identical to those of the primary analysis, sug-
Limitations of Estimates of Treatment Effects gesting that either there was little effect of country on the results
From Multicenter Clinical Trials or that all effect of the country was captured by simply distinguish-
It is often stated that the purpose of enrolling patients at multiple ing US sites from non-US sites.
centers is to increase the external generalizability of the trial re-
sults; however, centers are typically selected for factors (eg, aca- How Should the Results From This Study Be Interpreted?
demic affiliations, large patient volumes) intended to speed enroll- The treatment effects estimated by Dodick et al,2 and the associ-
ment and that limits the generalizability of results to other clinical ated CIs, were nearly identical regardless of whether the effect of
settings. Thus, even estimates from multicenter clinical trials may location of enrollment was dichotomized as within or outside of the
lack external validity when applied to qualitatively different prac- US, or captured as the country of enrollment, with 9 possibilities.
tice settings. The consistency suggests that the variability in outcomes associ-
Each of the models used to adjust for center effects have un- ated with the location of enrollment is either small or is captured
derlying assumptions regarding how the differences between cen- similarly by both models. The results from the model adjusting for
ters affect patient outcomes or, similarly, how patients within cen- the country of enrollment is the preferred estimate, because there
ters are more similar on average than patients between centers. is no harm in the adjustment if the country-to-country variability is
If these assumptions do not hold true, then the results of the analy- unimportant and the adjustment is critically important for correctly
ses may be biased or estimates of uncertainty may be incorrect, determining the uncertainty in the estimate of the mean treatment
affecting the statistical significance of results or the width of CIs. effect if the country-to-country variability turns out to be substan-
tial. In both the primary and post hoc sensitivity analyses, the mod-
How Were the Multicenter Data Analyzed els partitioned the variability in patient outcomes between that
in the Study by Dodick et al? associated with the location of enrollment (US vs non-US or by
The 875 participating patients were recruited in 123 centers in 9 country) and that associated with differences between patients,
countries and randomized in a 1:1:1 ratio to the 3 treatments.2 resulting in more accurate CIs than would have obtained had these
Patient randomization was stratified by sex, country, and baseline effects not been included.

ARTICLE INFORMATION Published Online: March 1, 2019. prevention of episodic migraine: a randomized
Author Affiliations: Luxembourg Institute of doi:10.1001/jama.2019.1480 clinical trial. JAMA. 2018;319(19):1999-2008. doi:
Health, Strassen, Luxembourg (Senn); Medical Conflict of Interest Disclosures: Dr Senn reported 10.1001/jama.2018.4853
Statistics Group, ScHARR, The University of serving as a consultant to the pharmaceutical 3. Vickers AJ, Altman DG. Statistics notes:
Sheffield, Sheffield, United Kingdom (Senn); industry. analysing controlled trials with baseline and follow
Department of Emergency Medicine, Harbor-UCLA up measurements. BMJ. 2001;323(7321):1123-1124.
Medical Center, Torrance, California (Lewis); REFERENCES doi:10.1136/bmj.323.7321.1123
Department of Emergency Medicine, David Geffen 1. Senn S. Some controversies in planning 4. Broglio K. Randomization in clinical trials:
School of Medicine at UCLA, Los Angeles, and analysing multi-centre trials. Stat Med. 1998; permuted blocks and stratification. JAMA. 2018;319
California (Lewis); Berry Consultants LLC, Austin, 17(15-16):1753-1765. doi:10.1002/(SICI)1097- (21):2223-2224. doi:10.1001/jama.2018.6360
Texas (Lewis). 0258(19980815/30)17:15/16<1753::AID-SIM977>3. 5. Meurer WJ, Lewis RJ. Cluster randomized trials:
Corresponding Author: Stephen J. Senn, PhD, 0.CO;2-X evaluating treatments applied to groups. JAMA.
29 Merchiston Crescent, Edinburgh, E10 5AJ, 2. Dodick DW, Silberstein SD, Bigal ME, et al. Effect 2015;313(20):2068-2069. doi:10.1001/jama.2015.
United Kingdom ([email protected]). of fremanezumab compared with placebo for 5199
Section Editor: Edward H. Livingston, MD, Deputy
Editor, JAMA.

1212 JAMA March 26, 2019 Volume 321, Number 12 (Reprinted) jama.com

Clinical Review & Education

JAMA Guide to Statistics and Methods

When Can Intermediate Outcomes Be Used as Surrogate Outcomes?

David L. DeMets, PhD; Bruce M. Psaty, MD, PhD; Thomas R. Fleming, PhD

Randomized clinical trials have a long history of success in many

Figure. Effects of an Intervention on a Replacement End Point
medical arenas. Many trials that change clinical practice use clinical
outcomes that are direct measures of how a patient feels, func- Immuno-oncology agent
tions, or survives. The substantial resources required by trials using R
such end points are powerful incentive to pursue designs that re- Chemotherapy
duce the numbers of patients required, the length of follow-up, and
the trial costs.1
Immuno-oncology agent Off-target effects
What Are Intermediate Outcomes and Why Are They Used?
Objective response rate
Patient-centered outcomes—direct measures of how a patient Progression-free survival
feels, functions, or survives—often reflect the effects of multiple Short-term tumor burden Overall
Cancer survival
factors, reducing the expected treatment effect and thus increas-
ing the required trial size. To reduce the trial resources, a frequent Longer-term effects
Long-term tumor burden
approach has been to use a biomarker or another replacement
end point that is an intermediate outcome thought to capture Effects on a replacement end point (R), such as objective response rate (ORR)
the causal pathway through which the disease process affects the or progression-free survival (PFS), often do not reliably predict intervention
patient-centered outcomes. Intermediate outcomes, which may be effects on a direct measure of how a patient feels, functions, or survives, such
physiological measures, laboratory test results, imaging results, or as overall survival, even though the 2 end points are correlated; insights into
why that occurs are provided by recognizing the importance of multiple causal
other such measures, are appealing because trials that use these pathways of the disease process and multiple mechanisms of action of the
outcomes are shorter, smaller, and statistically more powerful than treatment intervention.
those that evaluate patient-centered outcomes.

What Are the Limitations of Intermediate End Points? another replacement end point for overall survival, were only mod-
Despite the appeal of using replacement end points, 2 fundamen- estly correlated with their effects on overall survival (0.42 [95% CI,
tal requirements must be met2 to ensure that the replacement end 0.04-0.81]). It appears that PFS is insensitive to the longer-term
point is a “valid surrogate,” ie, the effect of the intervention on the effects of checkpoint inhibitors on overall survival.
replacement end point reliably predicts its effect on the patient-
centered outcome. The first requirement is that the replacement end How Should Intermediate Outcomes Be Used
point and the patient-centered outcome are strongly correlated. The for Checkpoint Inhibitor Studies?
second is that effects of an intervention on the replacement end To understand why intermediate outcomes such as ORR and PFS
point should fully capture its net effect on the patient-centered out- may be correlates for patient-centered end points such as over-
come. There are several reasons this second requirement is often all survival and yet might not be valid surrogates, consider the
not met; thus, “a correlate does not a surrogate make.”3 case of a randomized trial that compares an immuno-oncology
agent, such as a checkpoint inhibitor, with chemotherapy. The
How Have Intermediate Outcomes Been Used? Figure shows the multiple disease-process causal pathways and
An article published in JAMA Oncology4 by Ritchie and colleagues treatment-intervention mechanisms of action that can influence
illustrates these issues in the immuno-oncology setting, where whether a replacement end point is a valid surrogate. First, the
checkpoint inhibitors have frequently been evaluated for the replacement end point might not lie in a pathophysiological path-
treatment of advanced solid cancers. The objective response rate way through which the disease process causally induces effects
(ORR), determined from decreases in tumor size, is often thought on the patient-centered end point (ie, the green arrow does not
to be a valid surrogate for overall survival simply because, on a exist). Second, even if the replacement end point is in the path-
patient-specific level, treatment responders live longer than non- way, there could be treatment effects on other causal pathways
responders. The article evaluated the primary end points in 24 such as influencing tumor burden over longer periods that are
randomized controlled phase 2 trials of checkpoint inhibitors. inadequately captured by the replacement end point. This mis-
Across these trials of checkpoint inhibitors, the correlation match is likely when using short-term outcomes such as ORR and
between the observed ORR odds ratio and the overall survival PFS as replacement end points in trials of checkpoint inhibitors
hazard ratio was modest (0.57 [95% CI, 0.23-0.89]). Ritchie et al4 that often have important longer-term effects. Third, even if the
suggest avoiding ORR as a primary end point in phase 2 trials of intervention has the intended mechanism of action (dashed black
checkpoint inhibitors. arrows), its effects on the patient-centered end point might be
Similarly, the findings reported by Ritchie et al4 suggest that the affected by other mechanisms (orange arrow) that are not cap-
effects of checkpoint inhibitors on progression-free survival (PFS), tured by the replacement end point.

1184 JAMA March 24/31, 2020 Volume 323, Number 12 (Reprinted) jama.com

JAMA Guide to Statistics and Methods Clinical Review & Education

Even if intermediate end points such as ORR and PFS are not Because such insights are inherently imperfect, there is also a
established as valid surrogate end points, they can be useful as need for meta-analyses of completed trials to assess the relation-
supportive end points in phase 3 trials or as primary end points in ship of the net effects of interventions on the potential surrogate
proof-of-concept trials insofar as they provide substantive evi- end points and on the patient-centered outcomes. Informative
dence about biological effects. Hence, insights about mecha- illustrations of that process are provided by the validation of
nisms of action of interventions are important in formulating “death or cancer recurrence” in the adjuvant colon setting for
those end points. Importantly, Ritchie et al 4 recognized that 5-fluorouracil–based regimens.5 Another illustration is the valida-
checkpoint inhibitors and chemotherapy have fundamentally dif- tion of systolic and diastolic blood pressure as a surrogate for
ferent mechanisms of action. Because these therapies influence patient-centered outcomes for antihypertensive drug trials.6
tumor biology differently, what is known about the reliability of Intermediate outcome validation is, however, specific for drugs or
ORR and PFS as surrogate end points for overall survival in the classes of drugs because the validity of a surrogate might not
chemotherapy setting cannot be assumed to hold in the setting of properly extrapolate across different drug classes (or even to
checkpoint inhibitors. another drug in the same class), especially if important unin-
Because it is important to have sensitivity to the longer-term tended effects are drug- or class-specific. Caution is also required
effects of checkpoint inhibitors on tumor burden, duration of before extrapolating surrogates from adults to children.7
response (DOR) is at least as important as ORR. A preferred bio- If assessments of efficacy are based on replacement end
marker might be each patient’s “time in response.” This end point points that are not properly validated as surrogates for direct mea-
includes all patients in the analysis, in which nonresponders are sures of how a patient feels, functions, or survives, patients may be
included with an outcome of “zero” duration of response. This exposed to interventions that have unfavorable benefit-to-risk
approach enables an intention-to-treat analysis with increased sen- profiles. A classic example is the use of class IC antiarrhythmic
sitivity. For example, a doubling in ORR and a doubling in DOR agents to suppress arrhythmias after myocardial infarction.
translates to a 4-fold increase in “time in response.” Sensitivity While beneficial effects on the arrhythmia intermediate outcome
could be further improved by using “time in disease control” or led to off-label use of these agents in hundreds of thousands of
“long-term average change in disease burden” biomarkers that patients per year, the placebo-controlled Cardiac Arrhythmia Sup-
would capture causal effects on both disease stability and durabil- pression Trial8 revealed that these drugs tripled the death rate.
ity of tumor shrinkage. While such measures would still be rela- There are many other examples in which biomarkers, when used as
tively insensitive to unintended effects of interventions, this replacement end points, have yielded misleading results about
approach would reduce the risk of false-negative conclusions com- efficacy.6 While use of replacement end points provides more rapid
pared with traditional biomarkers such as ORR and PFS. assessments of experimental interventions, timeliness should not
For proper validation of potential surrogate end points, there be achieved at the expense of a misleading risk-benefit profile.
must be an in-depth understanding of the multiple causal path- Registrational or pivotal trials should use direct measures of how a
ways of the disease process and of the intended as well as unin- patient feels, functions, or survives whenever replacement end
tended mechanisms of action of the treatment intervention. points have not been properly validated.

ARTICLE INFORMATION Bristol-Myers Squibb, GlaxoSmithKline, and Merck. 4. Ritchie G, Gasper H, Man J, et al. Defining the
Author Affiliations: Department of Biostatistics Dr Psaty reported serving on the steering most appropriate primary end point in phase 2 trials
and Medical Informatics, University of committee of the Yale Open Data Access Project of immune checkpoint inhibitors for advanced solid
Wisconsin-Madison (DeMets); Department funded by Johnson & Johnson. No other cancers: a systematic review and meta-analysis.
of Medicine and Epidemiology, University of disclosures were reported. JAMA Oncol. 2018;4(4):522-528. doi:10.1001/
Washington, Seattle (Psaty); Department Funder/Support: Work on this article was jamaoncol.2017.5236
of Biostatistics, University of Washington, Seattle supported by National Institutes of Health (NIH) 5. Fleming TR. Surrogate endpoints and FDA’s
(Fleming). grant R37-AI29168 (Dr Fleming). accelerated approval process. Health Aff (Millwood).
Corresponding Author: David L. DeMets, PhD, Role of the Funder/Sponsor: The NIH had no role 2005;24(1):67-78. doi:10.1377/hlthaff.24.1.67
Department of Biostatistics and Medical in the preparation, review, or approval of the 6. Fleming TR, Powers JH. Biomarkers and
Informatics, University of Wisconsin-Madison, 610 manuscript; and decision to submit the manuscript surrogate endpoints in clinical trials. Stat Med.
Walnut St, WARF Bldg, Second Floor, Madison, WI for publication. 2012;31(25):2973-2984. doi:10.1002/sim.5403
53726 ([email protected]). 7. Barst RJ, Ivy DD, Gaitan G, et al. A randomized,
Section Editors: Roger J. Lewis, MD, PhD, REFERENCES double-blind, placebo-controlled, dose-ranging
Department of Emergency Medicine, Harbor-UCLA 1. Wittes J, Lakatos E, Probstfield J. Surrogate study of oral sildenafil citrate in treatment-naive
Medical Center and David Geffen School of endpoints in clinical trials: cardiovascular diseases. children with pulmonary arterial hypertension.
Medicine at UCLA; and Edward H. Livingston, MD, Stat Med. 1989;8(4):415-425. doi:10.1002/sim. Circulation. 2012;125(2):324-334. doi:10.1161/
Deputy Editor, JAMA. 4780080405 CIRCULATIONAHA.110.016667
Published Online: February 27, 2020. 2. Prentice RL. Surrogate endpoints in clinical trials: 8. Cardiac Arrhythmia Suppression Trial (CAST)
doi:10.1001/jama.2020.1176 definition and operational criteria. Stat Med. 1989; Investigators. Preliminary report: effect of
Conflict of Interest Disclosures: Dr DeMets 8(4):431-440. doi:10.1002/sim.4780080407 encainide and flecainide on mortality in a
reported receiving personal fees from Actelion, 3. Fleming TR, DeMets DL. Surrogate end points in randomized trial of arrhythmia suppression after
AstraZeneca, Boston Scientific, Medtronic, DalCor, clinical trials: are we being misled? Ann Intern Med. myocardial infarction. N Engl J Med. 1989;321(6):
Duke Clinical Research Institute, Intercept, Liva 1996;125(7):605-613. doi:10.7326/0003-4819-125- 406-412. doi:10.1056/NEJM198908103210629
Nova, Mesoblast, University of Minnesota, Sanofi, 7-199610010-00011
Pop Health Research Institute, Amgen,

jama.com (Reprinted) JAMA March 24/31, 2020 Volume 323, Number 12 1185

Clinical Review & Education

JAMA Guide to Statistics and Methods

Why Test for Proportional Hazards?

Mats J. Stensrud, MD, DrPhilos; Miguel A. Hernán, MD, DrPH

The Cox proportional hazards model, introduced in 1972,1 has be- Why Are Hazards Usually Not Proportional
come the default approach for survival analysis in randomized trials. in Medical Studies?
The Cox model estimates the ratio of the hazard of the event or out- Hazards are not proportional when the treatment effect changes
come of interest (eg, death) be- over time. In scenario 1, the effect of statin therapy on cardio-
tween 2 treatment groups. In- vascular events only became evident after 6 months or longer. In
Supplemental content
formally, the hazard at any given scenario 2, screening for colorectal cancer had both an immediate
time is the probability of experiencing the event of interest in the effect on the detection of undiagnosed cancers (hazard ratio
next interval among individuals who had not yet experienced the of colorectal cancer greater than 1 in the early follow-up) and a
event by the start of the interval. Because the Cox model requires delayed preventive effect due to the removal of cancer precursors
the hazards in both groups to be proportional, researchers are of- (hazard ratio less than 1 later in the follow-up).5
ten asked to “test” whether hazards are proportional. Hazards may also not be proportional because disease sus-
ceptibility varies between individuals. Those with greater disease
What Does It Mean That Hazards Are Proportional? susceptibility are more likely to develop the disease earlier. In sce-
The hazards are proportional if the hazard ratio remains constant nario 3, some women had a greater risk of coronary heart disease
from day 1 of the study until the end of follow-up. In practice, this than others because of, for example, a genetic predisposition.
does not occur for most medical interventions. Three articles pre- Even if hormone therapy increased the risk of disease by a con-
viously published in JAMA illustrate different scenarios regarding pro- stant factor (eg, by 80%) at every single time of follow-up, it is still
portional hazards. possible that the hazard ratio would have declined from 1.8 during
the first year of follow-up to less than 1 in later years because the
Scenario 1—No Immediate Effect most susceptible women would have been diagnosed with coro-
The Air Force/Texas Coronary Atherosclerosis Prevention Study2 nary heart disease in the early follow-up.6 As a result, the most
randomly assigned patients with atherosclerotic cardiovascular susceptible women would have been removed more rapidly from
disease to either statin therapy or placebo. The hazard ratio of the treatment group than from the control group. After 5 years,
a major adverse cardiovascular event was 0.63 (95% CI, 0.50- women without coronary heart disease in the treatment group
0.79) for statin vs placebo. However, the cumulative incidences would have been, on average, less susceptible to develop the dis-
of major adverse cardiovascular event in the statin and pla- ease than those in the placebo group. That is, remaining disease-
cebo groups were almost identical during the first 6 months free for 5 years in the treatment group would have become a
of follow-up and diverged thereafter. That is, the overall hazard proxy for being resistant to the development of coronary heart
ratio of 0.63 was a weighted average of the time-varying hazard disease. The hazard ratio after 5 years can be less than 1 even if
ratios, which were close to 1 in the first months of follow-up and hormone therapy did not prevent coronary heart disease in any
declined later. women in the study.4,5
The Figure depicts the 3 scenarios described above. These
Scenario 2—Immediate and Delayed Effects in Opposite Directions examples illustrate why hazards are not expected to be propor-
The Norwegian Colorectal Cancer Prevention Trial 3 randomly tional in almost any clinical study. The exception is when the treat-
assigned individuals aged 50 to 64 years to flexible sigmoidoscopy ment has no effect—then the hazard ratio is constant at 1 through-
screening or no screening. The hazard ratio of colorectal cancer out the follow-up.
was 0.80 (95% CI, 0.70-0.92) for screening vs no screening.
However, the cumulative incidence was greater in the screening What Are the Problems of Using Hazard Ratios
group until about 5 years of follow-up and lower after that time. From Proportional Hazards Models?
That is, the hazard ratio of 0.80 was a weighted average of the A mortality hazard ratio estimate of, for instance, 0.7 for treatment
time-varying hazard ratios, which were greater than 1 in the early vs placebo cannot be interpreted as a constant 30% mortality
follow-up and less than 1 in the later follow-up. decrease in the treatment group at all times during the follow-up
period. Rather, a hazard ratio of 0.7 means that, on average, treat-
Scenario 3—Variations in Disease Susceptibility ment decreases mortality during the follow-up period. The magni-
A Women’s Health Initiative study4 randomly assigned postmeno- tude of the cumulative benefit at a particular time can only be con-
pausal women to either estrogen plus progestin hormone therapy veyed by a comparison of the survival (proportion of individuals
or placebo. The hazard ratio of coronary heart disease was 1.24 (95% alive) in each group.
CI, 1.00-1.54) for hormone therapy vs placebo. However, the haz- One limitation of using Cox regression models when the hazard
ard ratio was 1.8 during the first year and 0.70 after 5 years of follow- ratio is not constant during the follow-up period is reporting an incor-
up. The overall hazard ratio of 1.24 was a weighted average of the rect standard variance estimator when the statistical model includes
time-varying hazard ratios throughout the follow-up. covariates other than the treatment group indicator.7 This limitation

jama.com (Reprinted) JAMA April 14, 2020 Volume 323, Number 14 1401

Clinical Review & Education JAMA Guide to Statistics and Methods

Figure. Nonproportional Hazards and Survival Curves in 3 Hypothetical Trials Comparing a Treatment vs a Control

Survival Hazard
Treatment Treatment
Control Control

A Scenario 1 B Scenario 2 C Scenario 3

100 4.50 100 4.50 100 4.50

80 3.75 80 3.75 80 3.75

Hazard, ×10–4

Hazard, ×10–4
Survival, %

Survival, %

Survival, %
60 3.00 60 3.00 60 3.00

40 2.25 40 2.25 40 2.25

20 1.50 20 1.50 20 1.50

0 0.75 0 0.75 0 0.75

0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
Time Time Time

In scenario 1, both the hazards (dotted lines) and the survival curves (solid lines) because of depletion of susceptibles in the treatment group, but the survival
gradually diverge (ie, the hazard ratio is not constant but is always greater than curves do not cross. The hazards would have been proportional if the dotted
1). In scenario 2, both the hazards and the survival curves cross (ie, the hazard lines were straight and horizontal.
ratio goes from greater than 1 to less than 1). In scenario 3, the hazards cross

can be overcome, and valid 95% confidence intervals can be esti- entire follow-up period. The 95% confidence interval should be es-
mated, by using bootstrapping methods. Another limitation is that the timated using a valid method such as bootstrapping and also using
magnitude of the Cox hazard ratio depends on the distribution of inverse probability weighting to adjust for losses to follow-up.
losses to follow-up (censoring), even if the losses occur at random. An implication is that statistical tests for proportional hazards
This limitation can be overcome by estimating an inverse probability– are unnecessary. Because it is expected that the hazard ratio will vary
weighted hazard ratio (eAppendix in the Supplement). over the follow-up period, tests of proportional hazards yielding high
P values are probably underpowered.
How Should Hazard Ratios Be Interpreted? Reports of hazard ratios should be supplemented with reports
As a weighted average of the time-varying hazard ratios, the haz- of effect measures directly calculated from absolute risks, such as
ard ratio estimate from a Cox proportional hazards model is often the survival differences or the restricted mean survival difference,8
used as a convenient summary of the treatment effect during the at times prespecified in the study protocol. These measures are ar-
follow-up. However, a hazard ratio from a Cox model needs to be guably more helpful for clinical decision-making and more easily un-
interpreted as a weighted average of the true hazard ratios over the derstood by patients.

ARTICLE INFORMATION Council of Norway (NFR239956/F20) and the progestin in healthy postmenopausal women:
Author Affiliations: Department of Epidemiology, ASISA Fellowship. Dr Hernán reported receiving principal results from the Women’s Health Initiative
Harvard T.H. Chan School of Public Health, Boston, a grant from the National Institutes of Health randomized controlled trial. JAMA. 2002;288(3):
Massachusetts (Stensrud, Hernán); Department of (R37 AI102634). 321-333. doi:10.1001/jama.288.3.321
Biostatistics, Oslo Centre for Biostatistics and Additional Information: The simulated data and 5. Manson JE, Hsia J, Johnson KC, et al; Women’s
Epidemiology, University of Oslo, Oslo, Norway R code are available at https://github.com/ Health Initiative Investigators. Estrogen plus
(Stensrud); Department of Biostatistics, Harvard CausalInference. progestin and the risk of coronary heart disease.
T.H. Chan School of Public Health, Boston, N Engl J Med. 2003;349(6):523-534. doi:10.1056/
Massachusetts (Hernán); Harvard-MIT Division of REFERENCES NEJMoa030808
Health Sciences and Technology, Boston, 1. Cox DR. Regression models and life-tables. 6. Hernán MA. The hazards of hazard ratios.
Massachusetts (Hernán). J R Stat Soc Series B Stat Methodol. 1972;34:187-202. Epidemiology. 2010;21(1):13-15. doi:10.1097/EDE.
Corresponding Author: Mats J. Stensrud, MD, 2. Downs JR, Clearfield M, Weis S, et al. Primary 0b013e3181c1ea43
DrPhilos, Department of Epidemiology, prevention of acute coronary events with lovastatin 7. DiRienzo A, Lagakos S. Effects of model
Harvard T.H. Chan School of Public Health, in men and women with average cholesterol levels: misspecification on tests of no randomized
677 Huntington Ave, Boston, MA 02115 results of AFCAPS/TexCAPS. JAMA. 1998;279(20): treatment effect arising from Cox’s proportional
([email protected]). 1615-1622. doi:10.1001/jama.279.20.1615 hazards model. J R Stat Soc Series B Stat Methodol.
Section Editors: Roger J. Lewis, MD, PhD, 3. Holme Ø, Løberg M, Kalager M, et al. Effect of 2001;63:745-757. doi:10.1111/1467-9868.00310
Department of Emergency Medicine, Harbor-UCLA flexible sigmoidoscopy screening on colorectal 8. Pak K, Uno H, Kim DH, et al. Interpretability of
Medical Center and David Geffen School of cancer incidence and mortality: a randomized cancer clinical trial results using restricted mean
Medicine at UCLA; and Edward H. Livingston, MD, clinical trial. JAMA. 2014;312(6):606-615. doi:10. survival time as an alternative to the hazard ratio.
Deputy Editor, JAMA. 1001/jama.2014.8266 JAMA Oncol. 2017;3(12):1692-1696. doi:10.1001/
Published Online: March 13, 2020. 4. Rossouw JE, Anderson GL, Prentice RL, et al; jamaoncol.2017.2797
doi:10.1001/jama.2020.1267 Writing Group for the Women’s Health Initiative
Conflict of Interest Disclosures: Dr Stensrud Investigators. Risks and benefits of estrogen plus
reported receiving grants from the Research

1402 JAMA April 14, 2020 Volume 323, Number 14 (Reprinted) jama.com

Clinical Review & Education

JAMA Guide to Statistics and Methods

Nonparametric Statistical Analysis

John M. Lachin, ScD

In this issue of JAMA, Baxter and colleagues1 from the Non- underthenullhypothesis,theexpectedsumingroupAbychancealone
Invasive Abdominal Aortic Aneurysm Clinical Trial (N-TA3CT) reis the average rank multiplied the number of observations (5)(4) = 20
search group report findings from a clinical trial that evaluated the ef- that is not substantially different from what was observed (25), and the
fect of doxycycline vs placebo on resulting 2-sided P value is .29. The same P value is obtained if the test
aneurysm growth among pa- is based on the sum of the ranks in the other group B.
Related article page 2029
tients with small infrarenal ab- The Wilcoxon test is also equivalent to a Mann-Whitney analysis3
dominal aortic aneurysms. The primary outcome was the maximum that provides an alternate derivation and computation. Thus, the test
transverse diameter of the aneurysm relative to the initial baseline is commonly called the Wilcoxon-Mann-Whitney test. This test is also
value after 2 years of treatment. However, the investigators antici- a member of a family of linear rank statistics4,5 that includes as spe-
pated that some patients might die or experience rupture of the cialcasestheMantel-Haenszelorlog-ranktest6 forsurvivaldataamong
aneurysm requiring endovascular repair. In these patients, the maxi- many others.
mum transverse diameter measurement at 2 years would be A linear rank test extends the concept of the sum of ranks to a sum
missing but not by chance alone because the missing data are infor- of rank scores, for which the score is a specified function of the simple
mative about the status of these patients relative to those who com- rank. If one hypothesizes that the observations in the population fol-
pleted the study and had 2-year measurement data available. To allow a given distribution, then it is possible to describe the rank scores
low for this informatively missing data, the statistical analysis plan for that provide an optimal linear rank statistic. For example, for a logis-
the study prespecified that the primary efficacy analysis would be con- tic distribution, the simple ranks themselves provide an optimal test
ducted using worst ranks, a nonparametric analytic method. In this (ie, the Wilcoxon test). Likewise, for a normal distribution with equal
JAMA Guide to Statistics and Methods, general nonparametric sta- standard deviations, a test using van der Waerden or normal scores7
tistics are addressed. is optimal. The power of the test is also nearly equivalent to the power
of the t test.
What Are Nonparametric Statistics?
Many statistical methods start with a statistical assumption that the Why Are Nonparametric Statistics Important?
distribution of measured values can be summarized by relatively few The principal advantage of a nonparametric test is that it provides a
parameters. For example, a normal or bell-shaped distribution is com- test of a null hypothesis of equality of distributions that does not
pletely defined by 2 parameters, the mean and the standard devia- require specific parametric assumptions. For example, the t test
tion. The commonly used t test that compares the means of 2 groups provides an optimal and valid test of the equality of the distribu-
assumes the data arise from 2 populations and that each has a nor- tions in the 2 groups if it is assumed that both distributions are nor-
mal distribution with the same standard deviation but with differ- mal (bell shaped) with the same mean and with the same standard
ent means. This is called a parametric analysis because the as- deviation. Under these assumptions, the test of equality of means
sumed distribution (eg, normal) can be completely summarized by is equivalent to a test of the equality of the entire distributions, and
a few parameters (eg, the mean and standard deviation). the type I (false-positive) error probability does not exceed the
On the other hand, a rank-based nonparametric analysis pro- specified significance level, such as .05. However, when the distri-
vides an alternative approach that requires fewer assumptions. Rather butions in the 2 groups have the same means but have different
than assume that the data have a specific parametric distribution, non- standard deviations or even different shapes, the type I error prob-
parametric methods assess whether the distributions between groups ability of the t test can be increased, leading to more false-positive
appear to differ, without assuming a specific shape for those distri- results than specified by the significance level (eg, more than .05).
butions. The simplest such nonparametric test is the Wilcoxon rank On the other hand, a nonparametric test will provide the desired
sum test,2 which examines the order of the observed values—their type I error probability (eg, .05) regardless of whether the normal
ranks—in the 2 groups. In this approach, the observed values are re- distribution assumptions apply.
placed by their ranks and the ranks are analyzed. In general, a nonparametric test such as the Wilcoxon test also
As a simple example, consider 2 groups (A and B) with the maxi- has good power relative to a t test or other parametric tests. The rela-
mum transverse diameter values in Table 1. In a rank analysis, each tive power of 2 possible tests is assessed by the relative efficiency,
value is replaced by its rank across all groups. This yields the ascend- often expressed as the ratio of sample sizes needed to provide the
ing ranks (lowest to highest) in Table 2. same power.5 When the actual distributions are normal but there is
Under the null hypothesis of no difference in the distributions be- a difference in means, the Wilcoxon test has a relative efficiency of
tween groups, regardless of their underlying shape, each group should 0.955 vs the t test, meaning that if the t test required a sample size
have a similar mixture of low, intermediate, and high ranks. The statis- of 100 to achieve a given level of power, the Wilcoxon test would
tical test of significance can then be based on the sum of the ranks in require a sample size of 100/0.955 = 105 to provide the same power,
either group. Consider group A that is the smaller of the 2 groups with a small difference. Furthermore, the Wilcoxon test is the optimal or
4 observations, for which the sum of the ranks is Sa = 25. The sum of most efficient test when the actual distributions are logistic, mean-
all 9 ranks is 1 + 2+…+9 = 45 and the average rank is 45/9 = 5. Thus, ing a bell shape but with a higher fraction of the observations closer

2080 JAMA May 26, 2020 Volume 323, Number 20 (Reprinted) jama.com

JAMA Guide to Statistics and Methods Clinical Review & Education

Table 1. Groups A and B: Maximum Transverse Diameter Values

Group Maximum transverse diameter, cm

A 4.437 4.436 4.290 4.349
B 4.045 4.091 4.479 3.985 4.295

Table 2. Groups A and B: Ranks of the Maximum Transverse Diameter Values

Group Rank
A 8 7 4 6
B 2 3 9 1 5

to the mean than the normal distribution, and a slightly smaller frac- parameter estimates so that there is a 1:1 correspondence between
tion in the shoulders of the distribution. Then the relative effi- the test and the confidence limits; ie, if P < .05, the confidence lim-
ciency of the Wilcoxon test to the t test is 1.1, meaning that the t test its on the mean difference do not include zero. Conversely, a non-
would need a 10% higher sample size to provide the same power parametric test is not based on estimates of population param-
as the Wilcoxon test. Similarly the normal scores test is highly effi- eters, in which case its results may conflict with parameter estimates.
cient over a range of possible distributions, ie, is also robust. On the other hand, a nonparametric test can be based on summary
Ranks (or rank scores) can also be used as the dependent vari- quantities that can be interpreted without the need for any distri-
able in a regression model that adjusts for other covariates, termed butional assumptions, such as the Mann-Whitney difference men-
a rank transformation analysis,8,9 in which the regression model- tioned above.
based test (eg, a t test) is approximately equal to the large-sample
rank-based test. For example, in a 2-group comparative study with How Were Nonparametric Statistics Used in This Study?
before and after treatment observations, a common approach Nonparametric methods were used in the N-TA3CT study analysis
would be to use an analysis of covariance (ANCOVA), in which the of the growth of abdominal aortic aneurysms. The authors used a
difference between posttreatment group means is adjusted for rank transformation ANCOVA in which the changes in the maxi-
the pretreatment or baseline values in a simple regression model to mum transverse diameter rank scores from baseline to 2 years were
increase study power. A similar nonparametric analysis can be con- compared between groups when also adjusting for the rank score
ducted by using the ranks of both the baseline values and the post- of the baseline measure, and sex. Van der Waerden or normal rank
treatment values. This approach has advantages8,9 over the analy- scores were used.
sis of the raw values because it avoids the substantial inflation of In this study, measures of the aortic diameter would be miss-
the type I (false-positive) error rate or loss of power that may occur ing if the aneurysm required repair or had ruptured during the
if there are deviations from the normality assumptions required by course of the study. When analyzing such study results, it is impor-
the usual ANCOVA analyses. tant to account for missing data.10 When missing data are related
to the outcome of interest, nonparametric methods can also be
Limitations and Alternatives to Using Nonparametric Statistics used to account for nonrandomly missing data that are related to
With a parametric analysis such as a t test, the analysis often starts and provide information about the outcome. Baxter et al1 used
with estimates of the assumed parameters, such as the means in worst ranks methods to analyze the outcomes of their study. This
each group, the standard deviation, the mean difference and its method will be reviewed in a subsequent JAMA Guide to Statistics
standard error. The t test is then computed as a function of these and Methods article.

ARTICLE INFORMATION with small infrarenal abdominal aortic aneurysms: 1952;14:453-458. doi:10.1016/S1385-
Author Affiliation: The Biostatistics Center, a randomized clinical trial. JAMA. Published May 26, 7258(52)50063-5
Department of Biostatistics and Bioinformatics, 2020. doi:10.1001/jama.2020.5230 7. Conover WJ, Iman RL. Analysis of covariance
Milken Institute School of Public Health, George 2. Wilcoxon F. Individual comparisons by ranking using the rank transformation. Biometrics. 1982;38
Washington University, Rockville, Maryland. methods. Biometrics Bull. 1945;1(6):80-83. doi:10. (3):715-724. doi:10.2307/2530051
Corresponding Author: John M. Lachin, ScD, 2307/3001968 8. Conover WJ, Iman RL. Rank transformations as a
Biostatistics Center, George Washington University, 3. Mann HB, Whitney DR. On a test of whether one bridge between parametric and nonparametric
6110 Executive Blvd, Rockville, MD 20852 of two random variables is stochastically larger than statistics. Am Stat. 1981;35(3):124-133. doi:10.2307/
([email protected]). the other. Ann Math Stat. 1947;18(1):50-60. doi:10. 2683975
Section Editors: Roger J. Lewis, MD, PhD, 1214/aoms/1177730491 9. Newgard CD, Lewis RJ. Missing data: how to
Department of Emergency Medicine, Harbor-UCLA 4. Randles RH, Wolfe DA. Introduction to the best account for what is not known. JAMA. 2015;
Medical Center and David Geffen School of Theory of Nonparametric Statistics. John Wiley & 314(9):940-941. doi:10.1001/jama.2015.10516
Medicine at UCLA; and Edward H. Livingston, MD, Sons; 1979. 10. Lachin JM. Worst-rank score analysis with
Deputy Editor, JAMA. 5. Tolles J, Lewis RJ. Time-to-event analysis. JAMA. informatively missing observations in clinical trials.
Conflict of Interest Disclosures: None reported. 2016;315(10):1046-1047. doi:10.1001/jama.2016.1825 Control Clin Trials. 1999;20(5):408-422. doi:10.
6. van der Waerden BL. Order tests for the 1016/S0197-2456(99)00022-7
REFERENCES two-sample problem and their power. Indag Math.
1. Baxter BT, Matsumura J, Curci JA, et al. Effect of
doxycycline on aneurysm growth among patients

jama.com (Reprinted) JAMA May 26, 2020 Volume 323, Number 20 2081

Clinical Review & Education

JAMA Guide to Statistics and Methods

Overlap Weighting
A Propensity Score Method That Mimics Attributes
of a Randomized Clinical Trial
Laine E. Thomas, PhD; Fan Li, PhD; Michael J. Pencina, PhD

Evidence obtained from clinical practice settings that compares alter- may be large, these methods can modify the target population,3 fail
native treatments is an important source of information about popu- to achieve good balance, or substantially worsen precision.5
lations and end points for which randomized clinical trials are unavail- Overlap weighting overcomes these limitations by assigning
able or infeasible.1 Unlike clinical weights to each patient that are proportional to the probability of
trials, which strive to ensure pa- that patient belonging to the opposite treatment group.6 Specifi-
Related article at
tient characteristics are compa- cally, treated patients are weighted by the probability of not receiv-
jamacardiology.com
rable across treatment groups ing treatment (1 − PS) and untreated patients are weighted by the
through randomization, observational studies must attempt to adjust probability of receiving the treatment (PS). These weights are smaller
for differences (ie, confounding). This is frequently addressed with a for extreme PS values so that outliers who are nearly always treated
propensity score (PS) that summarizes differences in patient charac- (PS near 1) or never treated (PS near 0) do not dominate results and
teristics between treatment groups. The PS is the probability that each worsen precision, as occurs with IPTW. These outliers contribute rela-
individual will be assigned to receive the treatment of interest given tively less to the result, while patients whose characteristics are com-
their measured covariates.2 Matching or weighting on the PS is used patible with either treatment contribute relatively more (Figure, A
to adjust comparisons between the 2 groups being compared.2,3 and B). The resulting target population mimics the characteristics
In an article in JAMA Cardiology, Mehta et al evaluated the of a pragmatic randomized trial that is highly inclusive, excluding no
association between angiotensin-converting enzyme inhibitors study participants from the available sample but emphasizing the
(ACEIs), angiotensin II receptor blockers (ARBs), or both with test- comparison of patients at clinical equipoise. Moreover, overlap
ing positive for severe acute respiratory syndrome coronavirus 2 weighting has desirable statistical properties. It leads to exact bal-
(SARS-CoV-2), the virus that causes coronavirus disease 2019 ance on the mean of every measured covariate when the PS is es-
(COVID-19), in 18 472 patients who were tested in the Cleveland timated by a logistic regression and is proven to optimize precision
Clinic Health System between March 8, 2020, and April 12, 2020.4 of the estimated association between treatment and outcomes
Overlap weighting5,6 based on the PS was used to adjust for con- among a large class of PS weighting methods, including IPTW and
founding in the comparison of 2285 patients who had been treated an analogue to matching.6 Overlap weighting can be as efficient as
with ACEIs/ARBs with 16 187 patients who did not receive ACEIs/ randomization if no adjustment was needed.5
ARBs. After adjustment, there was no significant association
between ACEI/ARB use and testing positive for SARS-CoV-2. What Are the Limitations of Overlap Weighting?
Like all PS methods, overlap weighting cannot adjust for patient char-
Use of the Method acteristics that are not measured and included in the model for the
Why Is Overlap Weighting Used? PS. It is important to identify confounding variables from the litera-
Overlap weighting is a PS method that attempts to mimic impor- ture, attempt to include them in the analysis, and recognize poten-
tant attributes of randomized clinical trials: a clinically relevant tar- tial bias due to unmeasured factors. For applications in which the
get population, covariate balance, and precision. The target popu- initial imbalances in patient characteristics between treatment
lation is the group of patients for whom the conclusions are groups are modest, overlap weighting yields similar results to IPTW.
drawn.3 Balance refers to the similarity of patient characteristics The advantages of overlap weighting are greatest when compara-
across treatment, which is an important condition to avoid bias. tor groups are initially very different.
Precision denotes the certainty about the estimate of association
between the treatment and the outcome of interest; more precise Why Did the Authors Use Overlap Weighting in This Study?
estimates have narrower CIs and greater statistical power. Mehta et al4 used overlap weighting to achieve good balance and mini-
Although classic PS methods of inverse probability of treatment mize variance of the estimated association between ACEI/ARB treat-
weighting (IPTW) and matching can adjust for differences in mea- ment and test results positive for SARS-CoV-2. Both goals were
sured characteristics,2,3 these methods have potential limitations achieved. Balance was demonstrated by reporting the overlap
with respect to target population, balance, and precision.5 weighted covariate means (or proportions) for the group that re-
Conventional IPTW assigns a weight of 1/PS for treated and ceived ACEIs/ARBs and the group that did not receive ACEIs/ARBs.
1/(1 − PS) for untreated patients, allowing individuals with underrep- There was no difference between groups after weighting (Figure, C).
resented characteristics to count more in the analysis.3 Matching op- The list of covariates included risk factors related to receiving ACEI/
erates differently by taking each treated study participant and find- ARB treatment and associated with testing positive for COVID-19.
ing the closest PS match among controls, usually within a bound. In The adjusted treatment comparisons were estimated with narrow
observational data, in which the initial differences in treatment groups CIs, providing strong evidence for the null result.

jama.com (Reprinted) JAMA June 16, 2020 Volume 323, Number 23 2417

Clinical Review & Education JAMA Guide to Statistics and Methods

Figure. Effect of Overlap Weighting on the Relative Contribution of 50 Simulated Patients With Different Ages and Diabetes Status

A Unadjusted sample B Overlap weighted sample C Balance across ACEI vs no ACEI

Age

BMI
Yes Yes CAD

COPD
Diabetes

Diabetes
Diabetes

Heart failure
No No
Hypertension

Male

0 20 40 60 80 100 0 20 40 60 80 100 0 0.20 0.40 0.60 0.80 1.00 1.20 1.40

Patient age, y Patient age, y Absolute standardized mean difference

1 patient Medication Patients Medication Method

No ACEI No ACEI Unadjusted
ACEI 1 2 3 ACEI Overlap weighted

Simulated according to the distribution of the same variables in the study by standardized mean difference is the absolute value of the difference in mean
Mehta et al.4 The bubble size reflects the relative contribution of each patient between treatment groups divided by the SD. An absolute mean standardized
to analysis. A, Each patient represents only themselves. Patients receiving difference less than or equal to 0.10 indicates good balance. BMI indicates body
angiotensin-converting enzyme inhibitors (ACEIs) are older and more likely to mass index; CAD, coronary artery disease; COPD, chronic obstructive
have diabetes. B, After overlap weighting some patients represent up to 3 other pulmonary disease.
patients, while others represent less than 1 other patient. C, The absolute

How Should the Results of Overlap Weighting Be Interpreted Caveats to Consider When Assessing the Results
in This Study? of an Overlap-Weighted Analysis
The primary results of the study by Mehta et al4 can be interpreted Overlap weighting creates exact balance on the mean of every mea-
just like other PS methods. That is, after adjustment for differences sured covariate when the PS is estimated by logistic regression
in cardiovascular risk factors, 9.1% of patients who were treated with (Figure, C). This is particularly important for reducing bias7; how-
ACEIs/ARBs tested positive for SARS-CoV-2 compared with 9.4% of ever, balance on the mean may not result in complete adjustment
patients who were not treated with ACEIs/ARBs (odds ratio, 0.97 for confounding on that variable. In addition, the baseline charac-
[95% CI, 0.81-1.15]). These estimates are measures of association be- teristics table of the overlap weighted sample should be presented
tween ACEI/ARB status and test positivity, with respect to a popu- (Table 2 and 3 in Mehta et al4). This table can include covariate means,
lation of patients at equipoise either to receive treatment with ACEIs/ medians, interquartile ranges, or any other statistics that are useful
ARBs or not and for whom all measured covariates are made similar to understanding the population. This approach will help to dem-
across treatments through overlap weighting. Bias due to unmea- onstrate which randomized clinical trial is best emulated by the
sured differences between patients who received ACEI/ARB treat- overlap weighted analysis with respect to target population, bal-
ment vs those who did not cannot be ruled out. ance, and precision.

ARTICLE INFORMATION Conflict of Interest Disclosures: Dr Pencina and angiotensin II receptor blockers with testing
Author Affiliations: Department of Biostatistics reported receiving grants from Regeneron/Sanofi positive for coronavirus disease 2019 (COVID-19).
and Bioinformatics, Duke University, Durham, and Amgen and personal fees from Merck and JAMA Cardiol. Published online May 5, 2020. doi:10.
North Carolina (Thomas, Pencina); Duke Clinical Boehringer Ingelheim outside the submitted work. 1001/jamacardio.2020.1855
Research Institute, Durham, North Carolina No other disclosures were reported. 5. Li F, Thomas LE, Li F. Addressing extreme
(Thomas, Li, Pencina); Department of Statistical propensity scores via the overlap weights. Am J
Science, Duke University, Durham, North Carolina REFERENCES Epidemiol. 2019;188(1):250-257.
(Li). 1. Basch E, Schrag D. The evolving uses of 6. Li F, Morgan KL, Zaslavsky AM. Balancing
Corresponding Author: Michael J. Pencina, PhD, “real-world” data. JAMA. 2019;321(14):1359-1360. covariates via propensity score weighting. J Am Stat
Department of Biostatistics and Bioinformatics, doi:10.1001/jama.2019.4064 Assoc. 2018;113(521):390-400. doi:10.1080/
Duke University, 2400 Pratt St, Durham, NC 27705 2. Haukoos JS, Lewis RJ. The propensity score. JAMA. 01621459.2016.1260466
([email protected]). 2015;314(15):1637-1638. doi:10.1001/jama.2015. 7. Zubizarreta JR. Stable weights that balance
Section Editors: Roger J. Lewis, MD, PhD, 13480 covariates for estimation with incomplete outcome
Department of Emergency Medicine, Harbor-UCLA 3. Thomas L, Li F, Pencina M. Using propensity data. J Am Stat Assoc. 2015;110(511):910-922. doi:
Medical Center and David Geffen School of score methods to create target populations in 10.1080/01621459.2015.1023805
Medicine at UCLA; and Edward H. Livingston, MD, observational clinical research. JAMA. 2020;323(5):
Deputy Editor, JAMA. 466-467. doi:10.1001/jama.2019.21558
Published Online: May 5, 2020. 4. Mehta N, Kalra A, Nowacki AS, et al. Association
doi:10.1001/jama.2020.7819 of use of angiotensin-converting enzyme inhibitors

2418 JAMA June 16, 2020 Volume 323, Number 23 (Reprinted) jama.com

Clinical Review & Education

JAMA Guide to Statistics and Methods

Modeling Epidemics With Compartmental Models

Juliana Tolles, MD, MHS; ThaiBinh Luong, PhD

During epidemics, there is a critical need to understand both the likely these interventions can alter the movement of individuals from the
numberofinfectionsandtheirtimecoursetoinformbothpublichealth susceptible compartment to the infected compartment, the tran-
and health care system responses. Approaches to forecasting the sition from the infected to the recovered compartment is solely de-
course of an epidemic vary and can include simulating the dynamics pendent on the amount of time that an individual is contagious,
of disease transmission and recovery1,2 or empirical fitting of data captured in the rate of recovery (γ).
trends.3 A common approach is to use epidemic compartmental mod- The term recovered in the SIR model can be misleading be-
els, such as the susceptible-infected-recovered (SIR) model.1,2 cause the recovered compartment does not necessarily refer to an
individual’s clinical course of the disease, but instead represents in-
Why Is a SIR Model Used? dividuals who are no longer contagious. Because compartmental
The SIR model aims to predict the number of individuals who are sus- models assume “closed” populations (without migration), individu-
ceptible to infection, are actively infected, or have recovered from in- als who have gained immunity to the disease and those who die of
fection at any given time. This model was introduced in 1927, less than the disease are both included in this compartment.
a decade after the 1918 influenza pandemic,4 and its popularity may The SIR model is defined by only 2 parameters: the effective
be due in part to its simplicity, which allows modelers to approxi- contact rate (β), which affects the transition from the susceptible
mate disease behavior by estimating a small number of parameters. compartment to the infected compartment, and the rate of recov-
ery (or mortality; γ), which affects the transition from the infected
Description of the SIR Model compartment to the recovered compartment. If the rate at which
In compartmental models, individuals within a closed population are individuals become infected exceeds the rate at which infected in-
separated into mutually exclusive groups, or compartments, based dividuals recover, there will be an accumulation of individuals in the
on their disease status. Each individual is considered to be in 1 com- infected compartment. The basic reproduction number R0, the mean
partment at a given time, but can move from one compartment to number of new infections caused by a single infected individual over
another based on the parameters of the model. The SIR model is one the course of their illness, is the ratio between β and γ. A decrease
of the most basic compartmental models, named for its 3 compart- in the effective contact rate β through community mitigation strat-
ments (susceptible, infected, and recovered). In this model, the as- egies decreases R0, delaying and lowering the peak infection rate
sumed progression is for a susceptible individual to become in- that occurs in the epidemic (ie, “flattening the curve”). However, to
fected through contact with another infected individual. Following maintain the decrease in total infections, the decrease in R0 gener-
a period as an infected individual, during which that person is as- ally must be sustained.
sumed to be contagious, the individual advances to a nonconta-
gious state, termed recovery, although that stage may include death What Are the Limitations of SIR Models?
or effective isolation. The simplicity of the SIR model makes it easy to compute, but also
In most modeled epidemics, all of a population begins in the sus- likely oversimplifies complex disease processes. The model does not,
ceptible compartment (Figure), which contains individuals who for example, incorporate the latent period between when an indi-
might become infected if exposed to the pathogen. This implies that vidual is exposed to a disease and when that individual becomes in-
no one has immunity to the disease at the beginning of the out- fected and contagious. In the context of coronavirus disease 2019
break. The infected compartment is defined as individuals who have (COVID-19), this corresponds to the time it takes for severe acute
the ability to infect individuals in the susceptible compartment. respiratory syndrome coronavirus 2 to replicate in a newly infected
As such, this compartment includes asymptomatic transmitters of individual and reach levels sufficient for transmission. Extensions of
the pathogen as well as hospitalized patients who require inten- the SIR model, such as the SEIR model (“E” denotes exposed but not
sive levels of care. One simplification in the SIR model is that it does yet contagious), account for this parameter, but additional exten-
not consider the latent period following exposure, rather it as- sions of the model would be necessary to, for example, model the
sumes that newly infected individuals are immediately contagious. time-dependent introduction of community mitigation strategies.
The rate at which susceptible individuals become infected is de- The SIR model also makes several simplifying assumptions about
pendent on the number of individuals in each of the susceptible and the population. It assumes homogeneous mixing of the popula-
infected compartments. At the start of an outbreak, when there are tion, meaning that all individuals in the population are assumed to
few infected individuals, the disease spreads slowly. As more indi- have an equal probability of coming in contact with one another. This
viduals become infected, they contribute to the spread and in- does not reflect human social structures, in which the majority of
crease the rate of infection. An additional factor in calculating the contact occurs within limited networks. The SIR model also as-
rate of spread is the effective contact rate (β). This parameter ac- sumes a closed population with no migration, births, or deaths from
counts for the transmissibility of the disease as well as the mean num- causes other than the epidemic.
ber of contacts per individual. Community mitigation strategies, such In addition, the parameters in a traditional SIR model do not al-
as quarantining infected individuals, social distancing, and closing low for quantification of uncertainty in model parameters. The pa-
schools, reduce this value and therefore slow the spread. Although rameter inputs are point estimates, which are single values reflecting

jama.com (Reprinted) JAMA June 23/30, 2020 Volume 323, Number 24 2515

Clinical Review & Education JAMA Guide to Statistics and Methods

Figure. Epidemic Trajectory Predicted by a Susceptible-Infected-Recovered Model

DISEASE STATUS COMPARTMENTS

Susceptible (S) Infected (I) Recovered (R)

Infection rate = βSI Recovery rate = γI
Number of people in each compartment

This model is based on the rates over

time of persons moving from the
susceptible compartment to the
infected compartment to the
recovered compartment. The rate of
infection is equal to β times the
number of individuals in the
susceptible and infected
compartments, and the rate of
recovery is equal to γ times the
Time number of individuals in the infected
compartment.

the modeler’s best guess. A common strategy in predicting the course pandemic, the results of SIR models have been compared with
of an epidemic is to calculate the SIR model over a few possible val- those of other modeling approaches.6 For example, some groups
ues for each parameter. The result is a range of future trajectories, have used network transmission models, which use information
but this strategy does not formally quantify the uncertainty in the about connectivity between individuals and groups within a
predictions. More complex models use distributions for each pa- population to spatially model disease transmission.7 Alternative
rameter instead of a point estimate to characterize the probability models that are not based on the biologic mechanisms of disease
of various future trajectories.5 If the parameters are not known with have also been developed, such as the Institute for Health Metrics
any precision, these more complex models will demonstrate the un- and Evaluation’s COVID-19 pandemic model, which is based on fit-
certainty in projections. The actual effect of social distancing, for ex- ting curves to empirically observed data.3 When different model-
ample, is often unknown. It is also possible, in more complex adap- ing approaches produce qualitatively different results, it may be
tations of the SIR compartmental framework, to incorporate due to critical differences in underlying assumptions, making it
observed data formally so that parameter values are estimated from imperative to determine which assumptions are more likely to be
the incoming data.5 valid. Alternatively, differing results may indicate that the sup-
porting data are simply insufficient to draw a reliable conclusion.
How Should SIR Models Be Interpreted? Although no model can perfectly predict the future, a good model
The SIR model is one of several types of models that can be used provides an approximation that is accurate enough to be useful
to model an infectious disease epidemic. During the COVID-19 for informing public policy.

ARTICLE INFORMATION Conflict of Interest Disclosures: None reported. days and deaths by US state in the next 4 months.
Author Affiliations: Department of Emergency medRxiv. Preprint posted March 30, 2020. doi:
Medicine, Harbor-UCLA Medical Center, Torrance, REFERENCES 10.1101/2020.03.27.20043752
California (Tolles); David Geffen School of Medicine, 1. Lourenco J, Paton R, Ghafari M, et al. 4. Kermack WO, McKendrick AG. A contribution to
Department of Emergency Medicine, University of Fundamental principles of epidemic spread the mathematical theory of epidemics. Proc R Soc
California, Los Angeles (Tolles); Predictive highlight the immediate need for large-scale Lond A. 1927;115:700-721. doi:10.1098/rspa.1927.0118
Healthcare, University of Pennsylvania Health serological surveys to assess the stage of the 5. Clancy D, O’Neill PD. Bayesian estimation of the
System, Philadelphia (Luong). SARS-CoV-2. medRxiv. Preprint posted March 26, basic reproduction number in stochastic epidemic
Corresponding Author: Juliana Tolles, MD, MHS, 2020. doi:10.1101/2020.03.24.20042291 models. Bayesian Anal. 2008;3(4):737-757.
Department of Emergency Medicine, Harbor-UCLA 2. Weissman GE, Crane-Droesch A, Chivers C, et al. 6. Newell NP, Lewnard JA, Jewell BL. Predictive
Medical Center, 1000 W Carson St, Bldg D9, Locally informed simulation to predict hospital mathematical models of the COVID-19 pandemic.
Torrance, CA 90502 ([email protected]). capacity needs during the COVID-19 pandemic. Ann JAMA. Published online April 16, 2020. doi:10.1001/
Section Editors: Roger J. Lewis, MD, PhD, Intern Med. Published online April 7, 2020. doi:10. jama.2020.6585
Department of Emergency Medicine, Harbor-UCLA 7326/M20-1260
7. Zlojutro A, Rey D, Gardner L. A decision-support
Medical Center and David Geffen School of 3. Murray CJL; IHME COVID-19 health service framework to optimize border control for global
Medicine at UCLA; and Edward H. Livingston, MD, utilization forecasting team. Forecasting COVID-19 outbreak mitigation. Sci Rep. 2019;9(1):2216.
Deputy Editor, JAMA. impact on hospital bed-days, ICU-days, ventilator
Published Online: May 27, 2020.
doi:10.1001/jama.2020.8420

2516 JAMA June 23/30, 2020 Volume 323, Number 24 (Reprinted) jama.com

Clinical Review & Education

JAMA Guide to Statistics and Methods

Use of Run-in Periods in Randomized Trials

Xiqian Huo, MD, PhD; Jane Armitage, MBBS

A prerandomization run-in is a period between screening a poten- tentially nonadherent participants) and an improved chance of ob-
tial trial participant and their being randomized. In a 2019 article in taining a reliable result by improved adherence. Whether a trial should
JAMA Network Open, Fukuoka et al1 evaluated whether a mobile include a run-in needs to balance these conflicting issues.
phone education application and in-person counseling could in- A randomized trial with poor adherence leading to an inconclu-
crease physical activity in 210 study participants. A prerandomiza- sive result (a type II error) wastes time, money, and participant good-
tion run-in was used to improve adherence and determine baseline will. If a statistically reliable result can be achieved by including
physical activity levels of the participants. In this article, the advan- a prerandomization run-in (while accepting some loss of generaliz-
tages and limitations of run-in periods are reviewed. ability), this is preferable. Typically, the randomized population after
including a run-in will have a lower absolute event rate than those who
Use of Run-in Periods were excluded. Those people randomized will tend to be healthier
Description of the Method (a “healthy volunteer” effect), since older persons, individuals with
A potential participant is screened and asked for consent to take part more severe illness, and less adherent groups are excluded. How-
in the trial but is not immediately randomized. Instead, for a period ever, when substantial randomized data from different types of people
of weeks or months, the participant may try the intervention (or pla- are available (as for blood pressure or low-density lipoprotein cho-
cebo) to determine if they are likely to be adherent to the study pro- lesterol [LDL-C] lowering), it indicates that the proportional effects are
tocol if they were to be randomized. Participants are typically as- usuallygeneralizabletopopulationsatdifferentlevelsofabsoluterisk.4
sessed at least twice before randomization to allow consent to be Run-ins are useful when treatments have known adverse effects that
reconfirmed, adherence to treatment or with data collection deter- are likely to affect adherence. For example, a trial that investigated
mined, and baseline and end of run-in measures obtained. Only at the effects of niacin on lipids and cardiovascular events in high-risk
the later assessment would the participant be randomized if they patients included an active run-in to minimize postrandomization
had been adherent with the study protocol and met all the inclu- dropouts due to known intolerance to niacin for some patients.5 One-
sion criteria. third (33.1%) of participants did not tolerate niacin during the active
run-in and were not randomized; if they had been included and
Why Are Run-in Periods Used in Trials? stopped treatment during the trial, there would have been a serious
For trialists, a key aim of the prerandomization run-in period is to im- reduction in study power to determine the effect of niacin.
prove adherence to trial treatments or procedures and to reduce loss When subgroup analyses of biomarker responses to treat-
to follow-up with consequent improvement in statistical power. ments are anticipated, an active run-in is useful to characterize an
A run-in also may allow assessment of treatment tolerability, re- individual patient’s biomarker response to treatment.6 This type of
sponse to treatment, or both. From the perspective of trial partici- biomarker analysis is not possible without a run-in, as it is not pos-
pants, a run-in allows time to reflect on whether they remain will- sible to establish an appropriate comparator group in the placebo-
ing to be randomized, including giving more time to understand trial allocated group because it is not known what biomarker response
information, experience study procedures, and discuss participa- to the treatment is in the placebo-allocated group. In addition, the
tion with their managing physicians or family before committing to biomarker response to the intervention may also facilitate recalcu-
a long-term trial. lation of sample size5 or facilitate selection of an appropriate dose
Nonadherence in clinical trials leads to systematic underesti- of the study medication for patients to take. A run-in period can
mation of the expected treatment effect that would result from ac- also be used to further check eligibility (eg, of blood markers that
tually taking the treatment (ie, 100% adherence and no drop-ins in might render the intervention inappropriate) and allow time for
the control group).2 A basic principle of randomized trials is that each other clinicians to be informed of patients’ possible recruitment
participant is analyzed in the group to which they were assigned, into the clinical trial.6
irrespective of their adherence (ie, an intention-to-treat analysis).3 A run-in period may facilitate the standardization of a pa-
Undertaking “per-protocol” analyses (ie, analyzing only those who tient’s usual treatment regimen or allow washout of a potentially
were adherent) introduces bias because of inherent differences be- interacting drug. For example, when assessing a new cholesterol-
tween those who do or do not adhere to the trial protocol. By ex- modifying drug, it may be desirable to standardize a patient’s cur-
cluding participants likely to be poorly adherent or being unwilling rent LDL-C–lowering therapy to ensure all the enrolled patients are
to collect adequate data (such as occurred in the study by Fukuoka adequately treated when the study is initiated. This approach can
et al) before randomization, a run-in period can reduce the risk of minimize the risk of the addition of postrandomization nonstudy
bias due to differential dropout or data collection and improve sta- lipid-lowering therapy, which would create difficulties in estimating
tistical power. the effect of the study drug.5 Run-in periods can also facilitate opti-
Includingaprerandomizationrun-inisparticularlyvaluableinlong- mization of a patient’s disease treatment (eg, for renin-angiotensin
term trials, in which adherence to an intervention is required over a system blockade treatment in a trial for kidney disease) or ensure
prolonged period. However, there is trade-off here between loss of a stable clinical condition (eg, for asthma) prior to randomization
generalizability of the results to a wider population (by excluding po- to the intervention.

188 JAMA July 14, 2020 Volume 324, Number 2 (Reprinted) jama.com

JAMA Guide to Statistics and Methods Clinical Review & Education

What Are the Limitations of Including a Run-in? evant in long-term trials in which loss of adherence may adversely
Because a run-in excludes individuals who are poorly adherent or affect the ability to obtain a clear result or in which complex proce-
experience adverse effects from the intervention, the representa- dures or follow-up leads to poor adherence with the intervention or
tiveness of the trial population may be compromised and results less follow-up. Run-ins are not appropriate or desirable in trials in acute
widely generalizable.7 The hazards of treatments may be underes- disease, surgical trials, or studies with short observation periods.
timated in the randomized study because vulnerable groups (such
as the elderly) who may be both less adherent and at greater risk of Why Did the mPED Study Use a Run-in?
adverse effects are excluded as a result of the run-in. In the study by Fukuoka et al, a 3-week run-in allowed determina-
Including a run-in has cost implications for the study, and money tion of participants’ baseline average daily steps and adherence to
spent could be used to increase the sample size. The counter argu- the study intervention, defined by at least 80% response to daily
ment is that a run-in improves cost-effectiveness, as better adher- messaging, 80% use of a daily activity diary, and wearing an accel-
ence allows a smaller sample size and fewer resources in follow-up. erometer for at least 8 hours per day. Only those who used the ac-
In the Physicians Health Study, including the run-in was estimated celerometer regularly and met the adherence requirements during
to increase pill-taking adherence by 20% to 40% and allowed for run-in (n = 210) were eligible for randomization. The run-in pro-
an estimated 30% smaller sample size.8 vided the opportunity to obtain an average baseline for an outcome
Participants may notice the change in treatment from run-in measure that had substantial variability (ie, physical activity) and al-
to the postrandomization period (eg, from placebo to active treat- lowed for adherence to be assessed.
ment, or vice versa), thereby compromising the blinding of the
study. If the trial is collecting “hard” clinical outcomes, such as How Should the Results of a Run-in Be Interpreted
heart attacks or death, this is unlikely to be a problem, but if out- in the mPED Study?
comes are subjective (such as symptomatic pain), there is a risk for Because of the run-in, 34% of potential study participants
introducing bias in the outcome assessment. This unblinding risk who initially provided consent (n = 108) were excluded before
can be minimized by good matching of the treatment and placebo, randomization, with poor adherence using the study equipment
as symptom reporting should not then be affected by knowledge the most common reason (≈20%). Those withdrawals, discon-
of treatment assignment. Other limitations include the possibility tinuations, and lack of end point collection occurring after ran-
of “carry-over” effects of the run-in treatment into the postrandomization would have substantially reduced study power.
domization phase. For example, it may be best to avoid an active The overall retention rate of 97.6% at 9 months allowed a statisti-
run-in when an intervention has a long effective half-life.9 cally significant effect of the app-based intervention to be
The appropriateness of run-in periods depend on the aims of detected in this adherent population. The degree to which these
the trial, the nature and duration of the intervention, the popula- results can be extrapolated to a less adherent group is a matter of
tion, the condition, and length of follow-up. Run-ins are most rel- clinical judgment.

ARTICLE INFORMATION of Population Health, University of Oxford), and 5. HPS2-THRIVE Collaborative Group. Effects of
Author Affiliations: National Clinical Research Dylan Morris, DPhil (Queensland Research Centre extended-release niacin with laropiprant in
Center of Cardiovascular Diseases, Fuwai Hospital, for Peripheral Vascular Disease, College of Medicine high-risk patients. N Engl J Med. 2014;371:203-212.
National Center for Cardiovascular Diseases, and Dentistry, James Cook University, Townsville, 6. Heart Protection Study Collaborative Group.
Chinese Academy of Medical Sciences and Peking Queensland, Australia), for their constructive MRC/BHF Heart Protection Study of cholesterol
Union Medical College, Beijing, China (Huo); advice on the manuscript. None of these individuals lowering with simvastatin in 20,536 high-risk
Nuffield Department of Population Health, MRC received any compensation for their contributions. individuals: a randomised placebo-controlled trial.
Population Health Research Unit, Clinical Trial Lancet. 2002;360(9326):7-22. doi:10.1016/S0140-
Service Unit and Epidemiological Studies Unit, REFERENCES 6736(02)09327-3
University of Oxford, Oxford, United Kingdom 1. Fukuoka Y, Haskell W, Lin F, Vittinghoff E. 7. Rothwell PM. External validity of randomised
(Huo, Armitage). Short-and long-term effects of a mobile phone app controlled trials: “to whom do the results of this
Corresponding Author: Jane Armitage, Nuffield in conjunction with brief in-person counseling on trial apply?” Lancet. 2005;365(9453):82-93. doi:10.
Department of Population Health, MRC Population physical activity among physically inactive women: 1016/S0140-6736(04)17670-8
Health Research Unit, Clinical Trial Service Unit and the mPED randomized clinical trial. JAMA Netw Open.
2019;2(5):e194281. doi:10.1001/jamanetworkopen. 8. Lang JM, Buring JE, Rosner B, Cook N,
Epidemiological Studies Unit, Richard Doll Bldg, Old Hennekens CH. Estimating the effect of the run-in
Road Campus, Oxford OX3 7LF, United Kingdom 2019.4281
on the power of the Physicians’ Health Study. Stat
([email protected]). 2. Wittes J. Sample size calculations for Med. 1991;10(10):1585-1593. doi:10.1002/sim.
Section Editors: Roger J. Lewis, MD, PhD, randomized controlled trials. Epidemiol Rev. 2002; 4780101010
Department of Emergency Medicine, Harbor-UCLA 24(1):39-53. doi:10.1093/epirev/24.1.39
9. Armitage JM, Bowman L, Clarke RJ, et al; Study
Medical Center and David Geffen School of 3. Detry MA, Lewis RJ. The intention-to-treat of the Effectiveness of Additional Reductions in
Medicine at UCLA; and Edward H. Livingston, MD, principle: how to assess the true effect of choosing Cholesterol and Homocysteine (SEARCH)
Deputy Editor, JAMA. a medical treatment. JAMA. 2014;312(1):85-86. doi: Collaborative Group. Effects of homocysteine-
Additional Contributions: We are grateful to 10.1001/jama.2014.7523 lowering with folic acid plus vitamin B12 vs placebo
Richard Haynes, DM, Will Herrington, MD, David 4. Blood Pressure Lowering Treatment Trialists’ on mortality and major morbidity in myocardial
Preiss, PhD, Marion Mafham, MD, Richard Bulbulia, Collaboration. Blood pressure-lowering treatment infarction survivors: a randomized trial. JAMA.
MD, and Louise Bowman, MD (MRC Population based on cardiovascular risk: a meta-analysis of 2010;303(24):2486-2494. doi:10.1001/jama.2010.
Health Research Unit, Clinical Trial Service Unit and individual patient data. Lancet. 2014;384(9943): 840
Epidemiological Studies Unit, Nuffield Department 591-598. doi:10.1016/S0140-6736(14)61212-5

jama.com (Reprinted) JAMA July 14, 2020 Volume 324, Number 2 189

Clinical Review & Education

JAMA Guide to Statistics and Methods

Regression Discontinuity Design

Matthew L. Maciejewski, PhD; Anirban Basu, PhD

In the May 3, 2019, issue of JAMA Network Open, Sukul and An RDD requires a continuous running variable that separates
colleagues1 used a regression discontinuity design (RDD) study to patients into 2 groups; the running variable is not always time. For
determine if a 2011 change in Medicare policy, expanding the num- example, because of practice guidelines, newborns just below
ber of secondary diagnostic codes allowed in Medicare billing from 1500 g of birth weight are much more likely to be admitted to the
9 to 24, was associated with a change in the observed disease se- neonatal intensive care unit (NICU) than those just above. Infants
verity of Medicare patients, independent of any real change in these born at 1499 g and 1501 g should be essentially identical. Thus, an
patients’ medical condition. RDD can be used to understand the association between NICU care
The apparent disease severity, based on secondary diagnostic and birth outcomes.6
codes, might increase because of real changes in the overall dis- To estimate the difference between the outcomes of infants of
ease burden over time, as an artifact from the listing of additional close to 1500-g weight attributable to NICU care, separate regres-
codes due to the greater number allowed, or some combination sion analyses of infants below and above the cutoff would be con-
these factors. The RDD is a method for separating the contribution ducted. One regression estimates the outcomes of hypothetical
of the abrupt change in policy—discontinuity—to observed in- 1500-g infants, based on infants with weights greater than 1500 g
creases in outcomes from the contribution of other factors that may (including some substantially heavier infants).6 A second regres-
produce more gradual changes.2 sion estimates the outcomes of hypothetical 1500-g infants, but
based on infants with weights less than 1500 g (including some
Use of the RDD Method substantially lighter infants). In each case, although the regression
Why Is an RDD Study Used? is based on data from infants with weights different than 1500 g,
The RDD method is used to study the outcomes related to an the regression is used to predict the outcomes for a 1500-g infant.
abrupt change when it is not possible to randomly assign patients The difference in outcomes for 1500-g infants predicted using the
to the conditions before and after the change. A goal of an RDD is greater-weight and lesser-weight NICU cohorts is the RDD estimate
to minimize the effect of confounding on the estimated effect of of the association of NICU care with outcomes.
a policy or treatment change.2 Unobserved confounding is par- The RDD approach can be extended to more complex situa-
ticularly concerning in nonrandomized studies because this bias tions, for example, cases in which the relationship ascribed to the
cannot be completely removed using conventional statistical running variable is not absolute (eg, fuzzy), meaning there may be
methods, such as regression or propensity score–based analysis.3 some patients with values near the cutoff who may receive either
An RDD attempts to minimize the risk for unobserved confound- treatment, or cases in which patients are followed up over time and
ing when generating the association between an exposure and some of the same patients contribute results both before and after
the change in the outcome of interest.4 the time cutoff. The first case might occur when examining the as-
sociation between blood transfusion and outcomes, with the he-
Description of the RDD Method moglobin value used as the running variable, because patients with
The RDD approach relies on having a continuous variable, the “run- hemoglobin values near the transfusion threshold may or may not
ning variable,” for which the levels above a certain cutoff abruptly be transfused based on other clinical consideration. The second case
change the probability of receiving one treatment or the other, might occur when studying a change in a practice guideline for the
resulting in a “natural experiment.” The 2011 change in Medicare care of chronic pain on opioid use, because some patients may re-
policy for secondary diagnosis codes represents just such a natural ceive care and be followed up both before and after the change in
experiment, and calendar time represents the running variable. practice guideline.
Because the presence of potentially confounding factors should be
the same immediately around the date when the policy change Limitations of the RDD Method
occurred, estimates of the effect of the change at the cutoff point A number of assumptions must be satisfied for an RDD to be valid.
should not be substantially confounded.5 For example, while there should be an abrupt change in the treat-
To evaluate the relationship of the change with the outcome ment (or policy) received below and above the threshold, an as-
of interest, the outcome is first estimated for patients above the sumption of the RDD approach is that there are no abrupt changes
cutoff as the value of the running variable approaches the cutoff in the relationship between the running variable and the treatment
from above. The same outcome is then estimated for patients or outcome except at the discontinuity threshold. This helps en-
below the cutoff as the value of the running variable approaches sure the validity of the regressions conducted using data from each
the cutoff from below. The difference in these estimates, with group. For example, the relationship between NICU admission and
both estimates extrapolated to hypothetical patients right at the birth weight should be smooth (likely decreasing with increasing birth
cutoff, quantifies the association between the change and the out- weight), with no abrupt changes except at 1500 g. A similar consid-
come at the discontinuity.2,5 eration would apply to clinical outcomes and birth weight.

jama.com (Reprinted) JAMA July 28, 2020 Volume 324, Number 4 381

Clinical Review & Education JAMA Guide to Statistics and Methods

A second assumption is that no one in the capacity of allocat- in 2011, a simple comparison of the average number of secondary
ing the treatment or influencing outcomes can manipulate the diagnoses before and after 2011 might be biased because of real
running variable. For example, a practitioner might alter the changes in the patient population over time—such as an increasing
recorded birth weight to influence the admission of an infant to burden of disease—that would influence the number of diagnoses
the NICU. To test this assumption, patterns of treatments may be made. Thus, the authors exploited a natural experiment using an
examined to determine whether there are any unexplained pat- RDD with date of the hospital discharge as the running variable.
terns in the running variable near the threshold that might sug-
gest external manipulation. How Should the Analysis Based on the RDD Method
The generalizability of estimates of treatment effects based on Be Interpreted?
RDDs may be limited because valid estimates can only be gener- Sukul and colleagues1 found that the unadjusted mean number of
ated for patients close to the running variable’s threshold. Thus, the condition categories coded increased from 1.7 in 2008 to 2.7 in
relationship between Medicare coverage and outcomes using an 2015. This observed increase likely represents a combination of
RDD with age as the running variable with age 65 years as the cut- secular changes and the relationship with the change in Medicare
off may be generalizable to 64-year-olds in a study comparing 64- coding rules. Based on the RDD used, adjusting for time, there
and 66-year-olds, but that estimate would not generalize to was an increase of 0.4 diagnostic codes at the discontinuity of the
40-year-olds. 2011 coding policy change. The authors concluded that the 2011
expansion in the number of secondary diagnosis coding positions
How Was the RDD Method Used? was associated with an increase in the estimated severity of ill-
The study by Sukul and colleagues1 examined whether the number ness of hospitalized Medicare beneficiaries. The RDD allowed the
of available secondary diagnosis coding positions (treatment) authors to separate the association of secular changes from the
was associated with the number of condition categories coded association of the change in the Medicare rule with the number of
(outcome) for all hospital discharges of Medicare fee-for-service diagnoses recorded. This result aids in the quantitative compari-
beneficiaries between 2008 and 2015. Because clinicians were son of illness severity in the patient populations treated before
allowed to use more secondary diagnosis coding positions starting and after 2011.

ARTICLE INFORMATION Published Online: July 2, 2020. changes in hospitalized Medicare beneficiaries’
Author Affiliations: Center for Health Services doi:10.1001/jama.2020.3822 severity of illness. JAMA Netw Open. 2019;2(5):
Research in Primary Care, Durham Veterans Affairs Conflict of Interest Disclosures: Dr Maciejewski e193290. doi:10.1001/jamanetworkopen.2019.3290
Medical Center, Durham, North Carolina reported ownership of Amgen stock through his 2. Thistlethwaite D, Campbell D. Regression-
(Maciejewski); Department of Population Health spouse’s employment. Dr Basu reported receiving discontinuity analysis: an alternative to the ex-post
Sciences, Duke University School of Medicine, personal fees from Salutis Consulting LLC. facto experiment. J Educ Psychol. 1960;51(6):309-317.
Durham, North Carolina (Maciejewski); Division of Funding/Support: This work was supported by the doi:10.1037/h0044319
General Internal Medicine, Department of Office of Research and Development, Health 3. Haukoos JS, Lewis RJ. The propensity score. JAMA.
Medicine, Duke University School of Medicine, Services Research and Development Service, 2015;314(15):1637-1638. doi:10.1001/jama.2015.
Durham, North Carolina (Maciejewski); The Department of Veterans Affairs (CIN 13-410). 13480
Comparative Health Outcomes, Policy, and Dr Maciejewski was supported by a Research Career
Economics (CHOICE) Institute, Departments of 4. Imbens G, Lemieux T. Regression discontinuity
Scientist award from the VA (RCS 10-391). designs: a guide to practice. J Econom. 2008;142
Pharmacy, Health Services, and Economics,
University of Washington, Seattle (Basu). Role of the Funder/Sponsor: The Department of (2):615-635. doi:10.1016/j.jeconom.2007.05.001
Veterans Affairs had no role in the preparation, 5. Shadish WR, Galindo R, Wong VC, Steiner PM,
Corresponding Author: Matthew L. Maciejewski, review, or approval of the manuscript or the
PhD, Center for Health Services Research Cook TD. A randomized experiment comparing
decision to submit the manuscript for publication. random and cutoff-based assignment. Psychol
in Primary Care, Durham Veterans Affairs Medical
Center and Duke University Medical Center, 411 W Disclaimer: The views expressed in this article are Methods. 2011;16(2):179-191. doi:10.1037/a0023345
Chapel Hill St, Ste 600, Durham, NC 27705 those of the authors and do not necessarily reflect 6. Almond D, Doyle JJ, Kowalski AE, Williams H.
([email protected]). the position or policy of the Department of Estimating marginal returns to medical care:
Veterans Affairs, Duke University, or the University evidence from at-risk newborns. Q J Econ. 2010;125
Section Editors: Roger J. Lewis, MD, PhD, of Washington.
Department of Emergency Medicine, Harbor-UCLA (2):591-634. doi:10.1162/qjec.2010.125.2.591
Medical Center and David Geffen School of REFERENCES
Medicine at UCLA; and Edward H. Livingston, MD,
Deputy Editor, JAMA. 1. Sukul D, Hoffman GJ, Nuliyalu U, et al.
Association between Medicare policy reforms and

382 JAMA July 28, 2020 Volume 324, Number 4 (Reprinted) jama.com

Clinical Review & Education

JAMA Guide to Statistics and Methods

Using Latent Class Analysis to Identify Hidden Clinical Phenotypes

Makoto Mori, MD; Harlan M. Krumholz, MD, SM; Heather G. Allore, PhD

In precision medicine, a common question for researchers is patient subgroups that respond more favorably to one treatment
whether patients can be classified with others who have similar risks or another. Investigators have used LCA as a confirmatory analysis6
and treatment responses. Such groupings can assist in predicting risk to reproduce novel phenotypes of sepsis identified by another clus-
and matching patients with appropriate treatment strategies. The tering technique called consensus k-means clustering. The investi-
challenge is that it is often not easy to identify meaningful clusters gators identified clinical phenotypes of sepsis with differential treat-
of people with the observable data. ment responses, based on a combination of hemodynamics,
Latent class analysis (LCA) is a common explanatory modeling laboratory, and end-organ functional parameters.
technique that allows researchers to identify groups of people who Latent class analysis can also capture groups of participant pref-
have similar characteristics that can include demographics, clinical erences that depend on a complex intersection of options. For ex-
characteristics, treatments, comorbidities, and outcomes.1 The term ample, LCA was used recently to group personal preferences for bar-
latent derives from the fact that the classes are not directly observ- iatric surgery, resulting in 3 subgroups relating to concerns about
able. Latent class analysis estimates the probability of each partici- costs, benefit focus, and procedure focus.7 An advantage of this ap-
pant being a member of each latent class.2 proach is that the grouping originates from the data; therefore, the
In the November 17, 2019, issue of JAMA Cardiology, Patel et al3 categories are not predefined and thus not limited by current con-
used group-based trajectory modeling (GBTM), a type of LCA, and ceptual frameworks.
identified 5 distinct patterns of change in participant urine albumin- The LCA methods include longitudinal approaches, in which
creatinine ratio (UACR) observed over 20 years. These 5 classes were participant-level trajectories of an outcome can be classified into
independently associated with adverse changes in cardiac struc- groupings. These are called GBTM or latent class growth analysis4
ture and ventricular function.3 Notably, participants belonging to the and identify underlying subgroups that would have been masked
identified trajectory classes could not be distinguished by the base- if only a single regression line was estimated, as done in the
line UACR alone, highlighting the value of this technique. This Guide majority of longitudinal analyses. For example, Wu et al1 found 5
to Statistics and Methods article describes LCA, its potential appli- trajectories of overall cardiovascular health over 4 years that were
cation, and limitations. independently associated with subsequent risk of incident cardio-
vascular disease.
Use of the Method
What Is LCA? Limitations of LCA
Latent class analysis is a statistical technique that identifies groups Several limitations of LCA merit consideration. First, although
defined by specific combinations of observed variables.2 Latent grouping based on latent class facilitates data presentation and
class analysis assigns each participant a probability of being in interpretation, participants do not actually belong to a single
each subgroup based on maximum likelihood estimation. Then, group. The class membership for each participant is assigned
each participant is assigned to the group to which they have the based on the highest probability of belonging to one of the latent
highest probability of belonging. In GBTM, the trajectories’ shape classes. That is, some participants have similar probabilities of
can be a straight or curvilinear form; shapes are based on the belonging to multiple groups (ie, probabilities of 0.5, 0.49, and
maximum likelihood estimation. Selecting the number of groups 0.01 for classes A, B, and C, respectively); however, the group
requires manual reconciliation of the trajectories’ shape, the mini- membership is assigned based on the highest probability.5 There-
mum number of participants assigned to a trajectory, and mea- fore, it is critical to examine the participants for whom the highest
sures indicating how well the model fits, such as the Akaike infor- probability of belonging to a single class is poor (<0.7) and pro-
mation criterion or Bayes information criterion .4 Although there is vide descriptions of such participants.5
no single criterion to select the number and shape of classes or tra- Second, the number of classes is derived from the cohort con-
jectories, the number of classes that yield the best fit to the sidering the model fit and complexity and that the number is not
observed data, the highest average probability of group member- fixed. Latent class analysis applied to a larger cohort or a cohort with
ship, and the fewest poor-fitting participants (ie, those with a more observed characteristics may yield a different number of classes
highest probability of group membership <0.7) is chosen.5 There- with different patterns. Therefore, reporting validation and repro-
fore, reporting the decisions and rationales behind this process of ducibility of the latent class is important. To validate the latent classes,
manual reconciliation is crucial. researchers may perform cross-validation. Reproducibility should be
tested by using different source data6 to test whether the identi-
Why Is LCA Used? fied groupings can be reproduced, although it is often difficult to find
Latent class analysis is useful when the patterns that constitute dis- an independent data set of a similar cohort that contains variables
tinct clusters or classes are difficult to discern with traditional meth- comparable with the original data set.
ods. For example, the clinical heterogeneity within the broad defi- Third, for GBTM applied to longitudinal data, participants
nition of sepsis has made it difficult to determine whether there are assigned to a class may vary around the estimated trajectory.

700 JAMA August 18, 2020 Volume 324, Number 7 (Reprinted) jama.com

JAMA Guide to Statistics and Methods Clinical Review & Education

For example, a participant may have a rapid recovery trajectory ini- How Should the LCA Be Interpreted?
tially, then experience a catastrophic event such as stroke and no Using GBTM, Patel et al3 were able to reduce the complex longitu-
longer follow the initial trajectory afterward.8 In such cases, the dinal UACR data to an interpretable number of trajectory types that
probability of membership is unlikely to be high for a single class. were associated with adverse cardiac functions and structural al-
Furthermore, GBTM is able to model trajectories only in polynomial terations. They concluded that such trajectory-based categoriza-
functions and is not equipped to model trajectories that do not tion may help with early identification of those at risk of subclinical
conform to other shapes, such as a cyclical trajectory. cardiovascular disease. However, identifying the group expected to
have the worst trajectory before the completion of follow-up re-
How Was LCA Used? mains a challenge. As expected in a longitudinal study, there was par-
Patel et al3 used GBTM to categorize participants (n = 2647) into 5 ticipant attrition over time (71% of surviving participants com-
distinct classes of trajectories of UACR (Figure 1 in the article) over pleted the last follow-up). The excluded cohort differed from the
the course of 20 years in young adults. Urine albumin-creatinine retained cohort in certain demographic and comorbidity character-
ratios were recorded prospectively at 5 time points (10, 15, 20, 25, istics. Because these exclusions occurred based on follow-up infor-
and 30 years after enrollment). Group-based trajectory modeling mation, it remains unknown whether the trajectory classes are gen-
identified distinct categories of trajectories that were not identifi- eralizable to the initial cohort prior to such exclusions, which is the
able from the baseline UACR measurement alone. The authors cohort for which clinicians are interested in prognosticating the risk.
labeled the trajectories, with the high-increasing group (1.6% of There are methods for modeling follow-up loss using GBTM,4 which
participants) showing a clinically concerning pattern of persistently were not applied. As the authors modeled the trajectories of more
high UACR level that continued to increase over the study period. than 2600 participants, if they had split the sample and found simi-
Their linear models showed that trajectories of worse UACR were lar trajectories in each split, this would have reinforced the repro-
associated with greater risk of adverse cardiac structural alterations ducibility of the latent classes. Regardless, the general conclusion
and ventricular systolic and diastolic functions measured at year that the dynamic changes of UACR may be associated with later ad-
30, the end of the study period. verse cardiac remodeling is supported by their approach.

ARTICLE INFORMATION UnitedHealth, IBM Watson Health, Element myocardial structure and function in later life:
Author Affiliations: Section of Cardiac Surgery, Science, Aetna, Facebook, Siegfried & Jensen law Coronary Artery Risk Development in Young Adults
Department of Surgery, Yale School of Medicine, firm, Arnold & Porter law firm, Ben C. Martin law (CARDIA) study. JAMA Cardiol. 2019;5(2):184-192.
New Haven, Connecticut (Mori); Center for firm, and the National Center for Cardiovascular doi:10.1001/jamacardio.2019.4867
Outcomes Research and Evaluation, Yale-New Diseases, Beijing, and grants from the Centers for 4. Nagin DS, Odgers CL. Group-based trajectory
Haven Hospital, New Haven, Connecticut (Mori, Medicare & Medicaid Services, Medtronic, the US modeling in clinical research. Annu Rev Clin Psychol.
Krumholz); Section of Cardiovascular Medicine, Food and Drug Administration, Johnson & Johnson, 2010;6:109-138. doi:10.1146/annurev.clinpsy.121208.
Department of Internal Medicine, Yale School of and the Shenzhen Center for Health Information, as 131413
Medicine, New Haven, Connecticut (Krumholz); well as being cofounder of Hugo, a personal health
information platform, and Refactor Health, an 5. Nagin D. Group-Based Modeling of Development.
Department of Health Policy and Management, Yale Harvard University Press; 2005. doi:10.4159/
School of Public Health, New Haven, Connecticut enterprise health care artificial intelligence–
augmented data management company. No other 9780674041318
(Krumholz); Section of Geriatrics, Department of
Internal Medicine, Yale School of Medicine, New disclosures were reported. 6. Seymour CW, Kennedy JN, Wang S, et al.
Haven, Connecticut (Allore); Department of Derivation, validation, and potential treatment
Biostatistics, Yale School of Public Health, New REFERENCES implications of novel clinical phenotypes for sepsis.
Haven, Connecticut (Allore). 1. Wu S, An S, Li W, et al. Association of trajectory of JAMA. 2019;321(20):2003-2017. doi:10.1001/jama.
cardiovascular health score and incident 2019.5791
Corresponding Author: Heather G. Allore, PhD,
Yale School of Medicine, Geriatrics, 300 George St, cardiovascular disease. JAMA Netw Open. 2019;2 7. Rozier MD, Ghaferi AA, Rose A, Simon NJ,
Ste 775, New Haven, CT 06511 (heather.allore@ (5):e194758. doi:10.1001/jamanetworkopen.2019. Birkmeyer N, Prosser LA. Patient preferences for
yale.edu). 4758 bariatric surgery: findings from a survey using
2. Vermunt JK, Magidson J. Latent class cluster discrete choice experiment methodology. JAMA Surg.
Section Editors: Roger J. Lewis, MD, PhD, 2019;154(1):e184375. doi:10.1001/jamasurg.2018.
Department of Emergency Medicine, Harbor-UCLA analysis. In: Hagenaars JA, McCutcheon AL, eds.
Applied Latent Class Analysis. Cambridge University 4375
Medical Center and David Geffen School of
Medicine at UCLA; and Edward H. Livingston, MD, Press; 2002:89-106. doi:10.1017/ 8. Gill TM, Murphy TE, Gahbauer EA, Allore HG.
Deputy Editor, JAMA. CBO9780511499531.004 The course of disability before and after a serious
3. Patel RB, Colangelo LA, Reis JP, Lima JAC, Shah fall injury. JAMA Intern Med. 2013;173(19):1780-1786.
Conflict of Interest Disclosures: Dr Krumholz doi:10.1001/jamainternmed.2013.9063
reported receipt of personal fees from SJ, Lloyd-Jones DM. Association of longitudinal
trajectory of albuminuria in young adulthood with

jama.com (Reprinted) JAMA August 18, 2020 Volume 324, Number 7 701

Clinical Review & Education

JAMA Guide to Statistics and Methods

Estimating Risk Ratios and Risk Differences

Alternatives to Odds Ratios
Mathias J. Holmberg, MD, MPH, PhD; Lars W. Andersen, MD, MPH, PhD, DMSc

The goal of many medical research studies is to estimate the direc- 10%, although all measures are based on the same measures of ef-
tion and magnitude of the effect of an intervention or treatment fect or association. An effect or association that appears very ben-
on a clinical outcome (in clinical trials) or the association between eficial according to a risk ratio or odds ratio might be negligible when
an exposure and an outcome (in observational studies). This effect applied to the absolute scale, which may present misleading infor-
or association can be presented mation about the clinical benefit or harm of an intervention. These
in various forms, depending on measures change with outcome prevalence. As shown in the Table,
Related article page 1058
the measured outcome. For ex- as the prevalence of the outcome increases, odds ratios become con-
ample, if the outcome is a con- sistently farther from 1 compared with risk ratios.
tinuous measure (eg, blood pressure), the effect or association could
be represented as a mean difference between the groups. If the out- How Are Risk Ratios and Risk Differences Estimated?
come is a time-to-event outcome (eg, time to death), the effect or Although adjusted risk ratios and risk differences may be more
association is often expressed as a hazard ratio. clinically intuitive than adjusted odds ratios, they have traditionally
For binary outcomes (eg, 90-day survival), the measure of the not been used because of the complexity of the methods required
effect or association is often presented as an odds ratio (ie, dividing to calculate them when adjustment for other variables is neces-
the odds of the outcome in one group with the odds of the out- sary. However, several methods are available for accomplishing
come in another), in which the odds are the probability divided by 1 this task. Simple calculations can be used to compute unadjusted
minus the probability. Odds ratios are commonly reported in clini- estimates of risk ratios and risk differences. When adjusted esti-
cal research because of the frequent use of logistic regression mates of risk ratios are needed, binomial models, 3 modified
when there is a need to adjust for various characteristics (eg, to Poisson models,4 and other techniques can be used to estimate
adjust for potential confounders in an observational study). Logistic them. Binomial and modified Poisson models are both regression-
regression yields odds ratios, is relatively straightforward to per- based approaches within the framework of generalized linear
form, and is widely available in statistical software. However, as models. These are flexible models that assume a linear relation-
explained in an earlier JAMA Guide to Statistics and Methods ship between a set of variables and an outcome. The outcome can
article,1 there are limitations to odds ratios. For instance, odds have different forms, depending on the link function of the model
ratios do not approximate risk ratios when the outcome is frequent (ie, a function to transform the outcome; for example, a log link
(Table) and odds ratios are easily misinterpreted by researchers, cli- performs a logarithmic transformation).
nicians, and patients. To obtain risk ratios, both the binomial and modified Poisson
An observational study by Grunau et al2 in this issue of JAMA methods assume a log link function to produce log risks, which,
evaluated survival to hospital discharge among patients who re- when exponentiated, can be directly interpreted as risk ratios. To
ceived ongoing resuscitation for out-of-hospital cardiac arrest dur- obtain risk differences, the methods assume an identity link func-
ing transport to the hospital compared with continuous resuscitation (eg, no transformation) to obtain regression coefficients that
tion at the scene. Instead of reporting odds ratios, the authors are risk differences. The modified part of the Poisson method
estimated risk ratios and risk differences, measures of association refers to the use of robust variance estimation (eg, a method to
that are more intuitive to interpret. estimate valid standard errors for coefficients in a regression
model) to account for model misspecification that occurs because
What Are Risk Ratios and Risk Differences? the binary outcome does not follow a Poisson distribution.
A risk ratio is the probability (or risk) of an outcome in one group di- Although both approaches have limitations, they tend to produce
vided by the probability in another, whereas the risk difference is the correct estimates with valid CIs, and are easy to implement in stan-
probability of an outcome in one group minus the probability in an- dard statistical software, including in SAS,5 Stata,6 and R.7
other. For example, if survival is 50% in one group and 40% in an-
other, the measures of effect or association are as follows: the risk Limitations of Risk Ratios and Risk Differences
ratio is 0.50/0.40 = 1.25 (ie, a relative increase in survival of 25%); In addition to the usual limitations of estimating and interpreting
the risk difference is 0.50 − 0.40 = 0.10 (ie, an absolute increase in measures of effect or association (ie, confounding, selection bias,
survival of 10%), which translates into a number needed to treat of and information bias), several caveats should be considered when
10 (ie, 1/the risk difference, or 1/0.10); and the odds ratio is (0.50/ adjusted risk ratios and risk differences are estimated. The statisti-
0.50)/(0.40/0.60) = 1.50 (ie, a relative increase in odds of survival cal computations are more complex than conventional methods
of 50%) (Table). An intervention that increases the relative odds of and it may be challenging to effectively communicate the method-
survival by 50% has the appearance of being more beneficial than ology to a nonstatistical audience. For example, log binomial
a relative increase in risk of 25%, or an absolute increase in risk of regression may result in computational errors such that the risk

1098 JAMA September 15, 2020 Volume 324, Number 11 (Reprinted) jama.com

JAMA Guide to Statistics and Methods Clinical Review & Education

Table. Hypothetical Scenarios Showing Differences in Measures of Effect or Association as Prevalence

of Outcomes Increases

Group 1 prevalence, % Group 2 prevalence, % Risk difference, % Risk ratio Odds ratio
2 1 1 2.00 2.02
5 4 1 1.25 1.26
50 40 10 1.25 1.50
60 50 10 1.20 1.50

ratio and risk difference cannot be estimated.8 Modified Poisson resuscitation at the scene. The risk ratio for survival was estimated
regression is more likely to produce results, but may lead to CIs at 0.48 (95% CI, 0.43 to 0.54) and the risk difference was esti-
that are too wide because of misspecification of the outcome mated at –4.8% (95% CI, −4.4% to −5.3%). In other words, trans-
distribution.9 For both models, these situations may be more likely port to the hospital was associated with a relative reduction in sur-
to occur with small sample sizes, when many variables are included vival of 52% and an absolute reduction in survival of 4.8%.
in the model, or both. However, each measure of this association alone does not pro-
vide a complete representation of the intervention. Although the risk
How Did the Authors Use Risk Ratios and Risk Difference ratio is generally constant across different baseline risks, the risk dif-
and How Should They Be Interpreted? ference is not.10 To fully describe the exposure-outcome relation-
The observational study by Grunau et al2 used modified Poisson re- ship, both absolute and relative measures should be reported, a prac-
gression to estimate survival among patients with out-of-hospital tice recommended by both the Consolidated Standards of Reporting
cardiac arrest who were transported to the hospital during ongo- Trials (CONSORT) and the Strengthening the Reporting of Obser-
ing resuscitation compared with those who received continuous vational Studies in Epidemiology (STROBE) guidelines.

ARTICLE INFORMATION Conflict of Interest Disclosures: Dr Andersen differences. Am J Epidemiol. 2005;162(3):199-200.

Author Affiliations: Research Center for serves as a statistical reviewer for JAMA. No other doi:10.1093/aje/kwi188
Emergency Medicine, Department of Clinical disclosures were reported. 6. Cummings P. Methods for estimating adjusted
Medicine, Aarhus University and Aarhus University risk ratios. Stata J. 2009;9(2):175-196. doi:10.1177/
Hospital, Aarhus, Denmark (Holmberg, Andersen); REFERENCES 1536867X0900900201
Department of Cardiology, Viborg Regional 1. Norton EC, Dowd BE, Maciejewski ML. Odds 7. Donoghoe M, Marschner I. Logbin: an R package
Hospital, Viborg, Denmark (Holmberg); Prehospital ratios—current best practice and use. JAMA. 2018; for relative risk regression using the log-binomial
Emergency Medical Services, Central Denmark 320(1):84-85. doi:10.1001/jama.2018.6971 model. J Stat Software. 2018;86(9). doi:10.18637/
Region, Denmark (Andersen); Department of 2. Grunau B, Kime N, Leroux B, et al. Association of jss.v086.i09
Anesthesiology and Intensive Care, Aarhus intra-arrest transport vs continued on-scene
University Hospital, Aarhus, Denmark (Andersen). 8. Williamson T, Eliasziw M, Fick GH. Log-binomial
resuscitation with survival to hospital discharge models: exploring failed convergence. Emerg
Corresponding Author: Lars W. Andersen MD, among patients with out-of-hospital cardiac arrest. Themes Epidemiol. 2013;10(1):14. doi:10.1186/1742-
MPH, PhD, DMSc, Research Center for Emergency JAMA. Published September 15, 2020. doi:10.1001/ 7622-10-14
Medicine, Department of Clinical Medicine, Aarhus jama.2020.14185
University and Aarhus University Hospital, Palle 9. Fitzmaurice GM, Lipsitz SR, Arriaga A, et al.
3. Wacholder S. Binomial regression in GLIM: Almost efficient estimation of relative risk
Juul-Jensens Blvd 161, 8200 Aarhus N, Denmark estimating risk ratios and risk differences. Am J
([email protected]). regression. Biostatistics. 2014;15(4):745-756. doi:
Epidemiol. 1986;123(1):174-184. doi:10.1093/ 10.1093/biostatistics/kxu012
Section Editors: Roger J. Lewis, MD, PhD, oxfordjournals.aje.a114212
Department of Emergency Medicine, Harbor-UCLA 10. Sun X, Ioannidis JP, Agoritsas T, Alba AC,
4. Zou G. A modified Poisson regression approach Guyatt G. How to use a subgroup analysis: Users’
Medical Center and David Geffen School of to prospective studies with binary data. Am J
Medicine at UCLA; and Edward H. Livingston, MD, Guide to the Medical Literature. JAMA. 2014;311(4):
Epidemiol. 2004;159(7):702-706. doi:10.1093/aje/ 405-411. doi:10.1001/jama.2013.285063
Deputy Editor, JAMA. kwh090
5. Spiegelman D, Hertzmark E. Easy SAS
calculations for risk or prevalence ratios and

jama.com (Reprinted) JAMA September 15, 2020 Volume 324, Number 11 1099

Clinical Review & Education

JAMA Guide to Statistics and Methods

Worst-Rank Score Methods—A Nonparametric Approach

to Informatively Missing Data
John M. Lachin, ScD

A previous JAMA Guide to Statistics and Methods article1 briefly re-

Table. Ranks With Tied Worst Ranks
viewed nonparametric statistics. Such statistical approaches repre-
sent the data using ranks of values rather than the observed val- Group Patient MTD rank score

ues. This provides valid tests of significance, regardless of the A 8 7 4 6 11 11

underlying distributions of the values and without the need to posit B 2 3 9 1 5 11

parametric assumptions—thus the term “nonparametric statistics.” Abbreviation: MTD, maximum transverse diameter.
In the May 26, 2020, issue of JAMA, Baxter and colleagues from the
N-TA3CT Research Group2 used a nonparametric analysis known as missing patients would be included in the primary efficacy analy-
the worst-rank score method in a manner that also captured infor- sis using worst ranks. This JAMA Guide to Statistics and Methods
mation from missing data that could result from worsening of the describes the nonparametric worst-rank analysis.
patient's condition.
Description of Worst-Rank Score Analysis
Use of the Method In the previous JAMA Guide to Statistics and Methods article,1 a
Why Is the Worst-Rank Score Method Used? simple rank test was described for 2 groups, one with 4 observa-
Virtually all studies experience some missing data.3 Missing data are tions and the other with 5, in which the 9 original values were re-
considered missing completely at random (MCAR) when the miss- placed by their ranks (1 to 9). However, now suppose this study ac-
ing data are the result of random processes by which some values tually started with 12 patients, 6 in each group, in which 2 patients
are observed and others are missing. If the missing data are indeed in group A died and 1 in group B died. In a worst-rank analysis, if all
MCAR (an untestable hypothesis), then an analysis of the observed that was known is that these 3 patients died, then these 3 deaths
data using virtually any statistical method will provide an unbiased would be assigned the average worst rank of 11 (the mean of 10, 11,
test. Such missing data are called noninformative because it is as- and 12), as shown in the Table.
sumed that the missing data are the result of a random process, and However, if other information would allow the 3 deaths to be
a missing datum conveys no information about what the missing ranked by a measure of severity, then the analysis could also be con-
value might be. ducted using exact worst ranks. For example, the deaths might be
However, it is possible that missing data are informatively ordered by their survival time, longer being better, so that the first
missing, in which case the missing data result from other outcomes death (in study time since randomization) would receive the rank
that reflect a change in the patient’s status, either improvement or of 12; the second, 11; and the third, 10.
deterioration. For example, in a study of congestive heart failure, Missing data also may occur from nonfatal deterioration of
missing data resulting from the death of a patient due to worsening the patient’s condition. In such a case the exact worst ranks can
heart failure would indicate that this patient had a worse outcome be assigned based on mechanisms of informative missing other
than any patient who survived. In such settings Wittes et al4 had sug- than mortality. For example, the DREAM study6 assessed the
gested that a rank analysis could readily capture this information by effect of blockade of the renin-angiotensin system using ramipril
assigning the worst ranks to study participants who died. Lachin5 vs placebo for the prevention of diabetes in patients with cardio-
then described the statistical properties of a worst-rank analysis and vascular disease or hypertension. A key outcome was the 2-hour
showed that a rank test using worst ranks provided an unbiased sta- poststimulus glucose level in an oral glucose tolerance test. How-
tistical test of the difference between groups. ever, the test was not conducted after a study participant devel-
Informatively missing data were an anticipated feature of the oped diabetes, so that the 2-hour glucose levels were informa-
N-TA3CT study2 that was conducted to compare the effect of tively missing for such individuals. Thus, study participants who
doxycycline vs placebo on aneurysm growth among patients with had developed diabetes were assigned exact worst ranks based
small infrarenal abdominal aortic aneurysms.2 The primary out- on the days from randomization to the diagnosis of diabetes. The
come was the maximum transverse diameter (MTD) of the aneu- analysis was then conducted using the Wilcoxon nonparametric
rysm relative to the initial baseline value after 2 years of treat- rank test.
ment. However, it was possible that some patients might die or
experience rupture of the aneurysm and would require endovas- How Was the Worst-Rank Score Analysis Used?
cular repair. For such patients the MTD measurement at 2 years The above methods were used in the N-TA3CT study analysis. The
would be missing and would be informative about the status of authors used a rank transformation analysis of covariance in which
these patients relative to those who completed the study and had the rank scores for the 2-year MTD values were compared be-
2-year MTD measurement data available. Thus, the statistical tween groups when also adjusting for the rank score of the base-
analysis plan for the study prespecified that these informatively line MTD and sex.1 The analysis also used worst ranks for patients

1670 JAMA October 27, 2020 Volume 324, Number 16 (Reprinted) jama.com

JAMA Guide to Statistics and Methods Clinical Review & Education

who died or who underwent surgical repair. From the CONSORT dia- coxon rank sum test. If there are no tied ranks, the expected prob-
gram, 225 patients had an observed measurable MTD at 2 years. An ability is P(A>B) = ½. However, to allow for ties, the Mann-Whitney
additional 22 who underwent surgical repair were assigned worse Difference is computed as P(A>B) – P(B>A), where P(B>A) is the re-
ranks of 226 to 247, based on the time to repair (226 the shortest, verse probability (random B greater than a random A).1 Under the
247 the longest), and the 7 who died were assigned ranks of 248 to null hypothesis of no difference in the distributions between groups,
255, based on the time to death. the Mann-Whitney difference is zero and the Mann-Whitney test
There were no differences between groups in the distribution of equals the Wilcoxon test.
the measurable normal MTD scores at 2 years and no appreciable dif-
ferences in the incidence of repair (9 vs 13) or death (4 vs 3). How Should Worst-Rank Score Analysis Be Interpreted?
A rank analysis of measurements under the MCAR assumption
Limitations of the Worst-Rank Score Analysis provides a robust and powerful test of a difference between
One drawback to rank-based methods, and to a worst-rank score groups in the population distributions, regardless of the shape of
analysis, is that the analysis does not provide an estimate of a para- the distribution within each group. An analysis with worst ranks
metric “effect size” that is a function of an estimate of the differ- for the patients who have informatively missing data then tests
ence in the distribution parameters between the 2 groups, such as whether there is a difference between groups in the distribution
a mean difference with 95% confidence limits. However, when using of either the measured values, or the distribution of the times at
a Wilcoxon test, the Mann-Whitney formulation provides an esti- which informative events occur that result in missing data for the
mate of a useful parameter-free or distribution-free quantity that de- measured primary outcome, or both. Lachin and others5,7-9 have
scribes the difference between groups. This analysis is based on an shown that the worst-rank analysis has good power to detect
estimate of the probability that a random study participant from group differences in situations in which the ranks of the observed
group A will have a higher value than a random study participant from values differ between groups or the worst ranks differ in the same
group B, designated as P(A>B). This probability estimate is based on direction, and neither of the two show a difference in the oppo-
the sum of the ranks in group A and thus is equivalent to the Wil- site direction.

ARTICLE INFORMATION REFERENCES 6. Bosch J, Yusuf S, Gerstein HC, et al; DREAM Trial
Author Affiliation: Biostatistics Center, 1. Lachin JM. Nonparametric statistical analysis. Investigators. Effect of ramipril on the incidence of
Department of Biostatistics and Bioinformatics, JAMA. 2020;323(20):2080-2081. doi:10.1001/jama. diabetes. N Engl J Med. 2006;355(15):1551-1562.
Milken Institute School of Public Health, The 2020.5874 doi:10.1056/NEJMoa065061
George Washington University, Rockville, Maryland. 2. Baxter BT, Matsumura J, Curci JA, et al; N-TA3CT 7. McMahon RP, Harrell FE Jr. Power calculation for
Corresponding Author: John M. Lachin, ScD, Investigators. Effect of doxycycline on aneurysm clinical trials when the outcome is a composite
Biostatistics Center, Department of Biostatistics growth among patients with small infrarenal ranking of survival and a nonfatal outcome. Control
and Bioinformatics, Milken Institute School of abdominal aortic aneurysms: a randomized clinical Clin Trials. 2000;21(4):305-312. doi:10.1016/S0197-
Public Health, The George Washington University, trial. JAMA. 2020;323(20):2029-2038. doi:10. 2456(00)00052-0
6110 Executive Blvd, Rockville, MD 20852 1001/jama.2020.5230 8. Matsouaka RA, Betensky RA. Power and sample
([email protected]). 3. Newgard CD, Lewis RJ. Missing data: how to best size calculations for the Wilcoxon-Mann-Whitney
Section Editors: Roger J. Lewis, MD, PhD, account for what is not known. JAMA. 2015;314(9): test in the presence of death-censored
Department of Emergency Medicine, Harbor-UCLA 940-941. doi:10.1001/jama.2015.10516 observations. Stat Med. 2015;34(3):406-431. doi:
Medical Center and David Geffen School of 10.1002/sim.6355
4. Wittes J, Lakatos E, Probstfield J. Surrogate
Medicine at UCLA; and Edward H. Livingston, MD, endpoints in clinical trials: cardiovascular diseases. 9. Colantuoni E, Scharfstein DO, Wang C, et al.
Deputy Editor, JAMA. Stat Med. 1989;8(4):415-425. doi:10.1002/sim. Statistical methods to compare functional
Published Online: September 28, 2020. 4780080405 outcomes in randomized controlled trials with high
doi:10.1001/jama.2020.7709 mortality. BMJ. 2018;360:j5748. doi:10.1136/bmj.
5. Lachin JM. Worst-rank score analysis with j5748
Conflict of Interest Disclosures: None reported. informatively missing observations in clinical trials.
Control Clin Trials. 1999;20(5):408-422. doi:10.
1016/S0197-2456(99)00022-7

jama.com (Reprinted) JAMA October 27, 2020 Volume 324, Number 16 1671

Item-Total Correlations On Excel PDF
100% (2)
Item-Total Correlations On Excel PDF
3 pages
Therapeutic Reasoning in Occupational Therapy Chapter 10
No ratings yet
Therapeutic Reasoning in Occupational Therapy Chapter 10
8 pages
Pooling Vs Meta Analysis
100% (1)
Pooling Vs Meta Analysis
13 pages
Treeage Pro Healthcare Module User'S Manual
No ratings yet
Treeage Pro Healthcare Module User'S Manual
164 pages
Biostatistics Primer Module 3 - Variables and Scales of Measurement
No ratings yet
Biostatistics Primer Module 3 - Variables and Scales of Measurement
24 pages
Cost Effectiveness
100% (1)
Cost Effectiveness
1 page
Short Portable Mental Status Questionaire (SPMSQ)
No ratings yet
Short Portable Mental Status Questionaire (SPMSQ)
1 page
Pharmacoepidemiology, Pharmacoeconomics,Pharmacovigilance
From Everand
Pharmacoepidemiology, Pharmacoeconomics,Pharmacovigilance
NELLORE DHARANI SAI SREEKANTH
3/5 (1)
Jamaevidence Ug Eg Introduction Final
No ratings yet
Jamaevidence Ug Eg Introduction Final
42 pages
New Evidence Pyramid
No ratings yet
New Evidence Pyramid
3 pages
JAMA Evidence UG EG SystematicReviews FINAL
No ratings yet
JAMA Evidence UG EG SystematicReviews FINAL
68 pages
Chodankar2021 Real World Evidence Vs RCT
No ratings yet
Chodankar2021 Real World Evidence Vs RCT
4 pages
Introduction To Health Economics and Outcomes Research (HEOR) For Writing Professionals
No ratings yet
Introduction To Health Economics and Outcomes Research (HEOR) For Writing Professionals
24 pages
Sample Informed Consent Form: Instructions To The Student Researcher
100% (1)
Sample Informed Consent Form: Instructions To The Student Researcher
1 page
Ook Eview: Johns Hopkins Nursing Evidence-Based Practice Model and Guidelines
No ratings yet
Ook Eview: Johns Hopkins Nursing Evidence-Based Practice Model and Guidelines
3 pages
Finding What Works in Health Care Standards For Systematic Reviews IOM 2011 PDF
No ratings yet
Finding What Works in Health Care Standards For Systematic Reviews IOM 2011 PDF
341 pages
Nutritional Assessment
100% (1)
Nutritional Assessment
141 pages
Biostat Quiz Leak
50% (2)
Biostat Quiz Leak
3 pages
Gordis Epidemiology E Book 6th Edition Celentano All Chapters Instant Download
100% (4)
Gordis Epidemiology E Book 6th Edition Celentano All Chapters Instant Download
55 pages
What Is Cost-Utility Analysis?: Supported by Sanofi-Aventis
No ratings yet
What Is Cost-Utility Analysis?: Supported by Sanofi-Aventis
6 pages
Androgen Deprivation Therapy
No ratings yet
Androgen Deprivation Therapy
27 pages
Critical Appraisal PDF
No ratings yet
Critical Appraisal PDF
5 pages
Health Economic Evaluation: From Theory To Practice: Assoc Prof DR Arthorn Riewpaiboon
No ratings yet
Health Economic Evaluation: From Theory To Practice: Assoc Prof DR Arthorn Riewpaiboon
34 pages
Critical Appraisal Online Course
No ratings yet
Critical Appraisal Online Course
2 pages
Evidence Based Practice - All
No ratings yet
Evidence Based Practice - All
81 pages
Knowledge Translation and Learning Technologies
100% (1)
Knowledge Translation and Learning Technologies
27 pages
Outcome Measure Unit 3
No ratings yet
Outcome Measure Unit 3
20 pages
Kappa Stats
No ratings yet
Kappa Stats
6 pages
SF-36 Website PDF PDF
No ratings yet
SF-36 Website PDF PDF
5 pages
Gastric
No ratings yet
Gastric
152 pages
Decision Tree
No ratings yet
Decision Tree
9 pages
Instant ebooks textbook Evidence Informed Health Policy Second Edition Using EBP to Transform Policy in Nursing and Healthcare Loversidge download all chapters
100% (3)
Instant ebooks textbook Evidence Informed Health Policy Second Edition Using EBP to Transform Policy in Nursing and Healthcare Loversidge download all chapters
82 pages
Download full The Measurement of Health and Health Status Concepts Methods and Applications from a Multidisciplinary Perspective 1st Edition Paul Krabbe ebook all chapters
100% (7)
Download full The Measurement of Health and Health Status Concepts Methods and Applications from a Multidisciplinary Perspective 1st Edition Paul Krabbe ebook all chapters
55 pages
Heath Economics Guide Handbook en
No ratings yet
Heath Economics Guide Handbook en
32 pages
Lecture 3 Inflammatory Bowel Disease-Edited
No ratings yet
Lecture 3 Inflammatory Bowel Disease-Edited
60 pages
University Hospitals State of Research Address 2024
100% (1)
University Hospitals State of Research Address 2024
62 pages
Complex Interventions Guidance
No ratings yet
Complex Interventions Guidance
39 pages
Cerebral Oximetry
No ratings yet
Cerebral Oximetry
5 pages
Efficacy of Lifestyle Interventions in Patients With Type 2 Diabetes - A Systematic Review and Meta-Analysis
No ratings yet
Efficacy of Lifestyle Interventions in Patients With Type 2 Diabetes - A Systematic Review and Meta-Analysis
11 pages
Download Users Guides to the Medical Literature Essentials of Evidence Based Clinical Practice 3rd Edition Gordon Guyatt/Maureen O. Meade/Drummond Rennie && Deborah J. Cook ebook All Chapters PDF
88% (8)
Download Users Guides to the Medical Literature Essentials of Evidence Based Clinical Practice 3rd Edition Gordon Guyatt/Maureen O. Meade/Drummond Rennie && Deborah J. Cook ebook All Chapters PDF
81 pages
Immediate download Pharmacology PreTest Self Assessment and Review 10th Edition Arnold Stern ebooks 2024
No ratings yet
Immediate download Pharmacology PreTest Self Assessment and Review 10th Edition Arnold Stern ebooks 2024
51 pages
Bio Statistics 1
No ratings yet
Bio Statistics 1
63 pages
Full Download Medical Statistics From Scratch : An Introduction for Health Professionals, 4th Edition David Bowers PDF DOCX
No ratings yet
Full Download Medical Statistics From Scratch : An Introduction for Health Professionals, 4th Edition David Bowers PDF DOCX
47 pages
Instant Download Neuropsychology of Everyday Functioning 1st Edition Thomas D. Marcotte Phd PDF All Chapters
100% (4)
Instant Download Neuropsychology of Everyday Functioning 1st Edition Thomas D. Marcotte Phd PDF All Chapters
61 pages
Recist 1.1: Pembimbing: Dr. Novita Andayani, SP.P (K)
No ratings yet
Recist 1.1: Pembimbing: Dr. Novita Andayani, SP.P (K)
13 pages
Where can buy Implementing Evidence Based Practice in Healthcare A Facilitation Guide 1st Edition Gill Harvey ebook with cheap price
100% (10)
Where can buy Implementing Evidence Based Practice in Healthcare A Facilitation Guide 1st Edition Gill Harvey ebook with cheap price
54 pages
Prognosis Appraisal Tools
No ratings yet
Prognosis Appraisal Tools
2 pages
2023 Wilson Disease
No ratings yet
2023 Wilson Disease
33 pages
Handbook of Statistics in Clinical Oncology John Crowley All Chapters Instant Download
100% (6)
Handbook of Statistics in Clinical Oncology John Crowley All Chapters Instant Download
81 pages
Conceptualizing Analyses of Ecological Momentary Assessment Data
No ratings yet
Conceptualizing Analyses of Ecological Momentary Assessment Data
12 pages
Hepatitis in Children: DR Hamza Bawumia
100% (1)
Hepatitis in Children: DR Hamza Bawumia
36 pages
Patient Activation: RHP 12 Learning Collaborative Package One Debra Flores, PH.D
No ratings yet
Patient Activation: RHP 12 Learning Collaborative Package One Debra Flores, PH.D
18 pages
03 - Costs in Economic Evaluation - 20190321
No ratings yet
03 - Costs in Economic Evaluation - 20190321
41 pages
Multiple Correspondence Analysis: January 2007
No ratings yet
Multiple Correspondence Analysis: January 2007
14 pages
[Ebooks PDF] download Essentials of epidemiology in public health Fourth Edition Aschengrau full chapters
100% (1)
[Ebooks PDF] download Essentials of epidemiology in public health Fourth Edition Aschengrau full chapters
35 pages
Epidemiology For The Uninitiated
No ratings yet
Epidemiology For The Uninitiated
6 pages
[FREE PDF sample] Encyclopedia of Quality of Life and Well Being Research 2014th Edition Alex C Michalos ebooks
100% (2)
[FREE PDF sample] Encyclopedia of Quality of Life and Well Being Research 2014th Edition Alex C Michalos ebooks
40 pages
A Beginner's Guide To Interpreting Odds Ratios, Confidence Intervals and P-Values - Students 4 Best Evidence
No ratings yet
A Beginner's Guide To Interpreting Odds Ratios, Confidence Intervals and P-Values - Students 4 Best Evidence
35 pages
What Is Health Economic
No ratings yet
What Is Health Economic
8 pages
Kana
100% (2)
Kana
401 pages
Systematic Reviews and Meta-Analyses: DR Paul Watts
No ratings yet
Systematic Reviews and Meta-Analyses: DR Paul Watts
35 pages
32QF Certification of Compliance RPMS Cycle Iii
100% (1)
32QF Certification of Compliance RPMS Cycle Iii
2 pages
Stat T 3
100% (2)
Stat T 3
39 pages
Updated Questions For Brainstorming
100% (1)
Updated Questions For Brainstorming
2 pages
Quantitive and Qualitatitve Research Methods
No ratings yet
Quantitive and Qualitatitve Research Methods
2 pages
Classroom Action Research
No ratings yet
Classroom Action Research
10 pages
ECOE Guia AMEE 21 - 11 - 2019 - The Object
No ratings yet
ECOE Guia AMEE 21 - 11 - 2019 - The Object
11 pages
Comparing Medians: The Man Whitney U-Test
No ratings yet
Comparing Medians: The Man Whitney U-Test
8 pages
Article 29772
No ratings yet
Article 29772
14 pages
Basic Research Designs
No ratings yet
Basic Research Designs
5 pages
Reliability, Validity (Bit) Ethics Concerns (When Interviewing)
No ratings yet
Reliability, Validity (Bit) Ethics Concerns (When Interviewing)
4 pages
Grade 12 Thesis Template
100% (1)
Grade 12 Thesis Template
23 pages
Stockemer2019 Chapter ConductingASurvey
No ratings yet
Stockemer2019 Chapter ConductingASurvey
15 pages
(Language Testing and Evaluation) Ute Knoch-Diagnostic Writing Assessment - The Development and Validation of A Rating Scale-Peter Lang GMBH, Internationaler Verlag Der Wissenschaften (2009)
100% (1)
(Language Testing and Evaluation) Ute Knoch-Diagnostic Writing Assessment - The Development and Validation of A Rating Scale-Peter Lang GMBH, Internationaler Verlag Der Wissenschaften (2009)
322 pages
Type 1 and Type 2 Errors
No ratings yet
Type 1 and Type 2 Errors
3 pages
381-Article Text-1337-1-10-20171221
No ratings yet
381-Article Text-1337-1-10-20171221
7 pages
Chapter 3
No ratings yet
Chapter 3
4 pages
LAS 6 PR2 Quarter 2 Week 4 1
No ratings yet
LAS 6 PR2 Quarter 2 Week 4 1
4 pages
Science Process Skills Inventory (SPSI) : Oregon State University 4-H Youth Development
No ratings yet
Science Process Skills Inventory (SPSI) : Oregon State University 4-H Youth Development
2 pages
Test 1 Questions
No ratings yet
Test 1 Questions
21 pages
149 453 2 PB
No ratings yet
149 453 2 PB
17 pages
SAS 1 2 3 4 5 6 7 Basic Concepts and Principles of Assessment
No ratings yet
SAS 1 2 3 4 5 6 7 Basic Concepts and Principles of Assessment
86 pages
Biology Grade 9 ANSWER KEY
No ratings yet
Biology Grade 9 ANSWER KEY
1 page
Memaknai Etika Bisnis Berbasis Kearifan PDF
No ratings yet
Memaknai Etika Bisnis Berbasis Kearifan PDF
5 pages
Question Bank
No ratings yet
Question Bank
1 page
Strip Plot Design
No ratings yet
Strip Plot Design
12 pages
Stanag 6001-Otan
No ratings yet
Stanag 6001-Otan
1 page
Frequency Table: Lampiran 10 Hasil Uji Chi-Square
No ratings yet
Frequency Table: Lampiran 10 Hasil Uji Chi-Square
3 pages
ISEB Foundation Certification in Software Testing Course
No ratings yet
ISEB Foundation Certification in Software Testing Course
1 page
Evaluating Statistical Claims (Level 1) Answer Key
No ratings yet
Evaluating Statistical Claims (Level 1) Answer Key
2 pages

JAMA Guide to Statistics and Methods

Uploaded by

JAMA Guide to Statistics and Methods

Uploaded by

Clinical Review & Education

JAMA Guide to Statistics and Methods

The Intention-to-Treat Principle

jama.com JAMA July 2, 2014 Volume 312, Number 1 85

Copyright 2014 American Medical Association. All rights reserved.

86 JAMA July 2, 2014 Volume 312, Number 1 jama.com

Copyright 2014 American Medical Association. All rights reserved.

Editorials represent the opinions of the authors and JAMA

Introducing the JAMA Guide to Statistics and Methods

jama.com JAMA July 2, 2014 Volume 312, Number 1 35

Copyright 2014 American Medical Association. All rights reserved.

JAMA Guide to Statistics and Methods

Sample Size Calculation for a Hypothesis Test

In this issue of JAMA, Koegelenberg et al1 report the results of a

Related article page 155 The primary results were posi-

Use of the Method

180 JAMA July 9, 2014 Volume 312, Number 2 jama.com

Copyright 2014 American Medical Association. All rights reserved.

jama.com JAMA July 9, 2014 Volume 312, Number 2 181

Copyright 2014 American Medical Association. All rights reserved.

JAMA Guide to Statistics and Methods

Multiple Comparison Procedures

jama.com JAMA August 6, 2014 Volume 312, Number 5 543

Copyright 2014 American Medical Association. All rights reserved.

Caveats to Consider When Looking at Multiple Comparison

ARTICLE INFORMATION REFERENCES 5. Bender R, Lange S. Adjusting for multiple

544 JAMA August 6, 2014 Volume 312, Number 5 jama.com

Copyright 2014 American Medical Association. All rights reserved.

JAMA Guide to Statistics and Methods

Minimal Clinically Important Difference

1342 JAMA October 1, 2014 Volume 312, Number 13 jama.com

Copyright 2014 American Medical Association. All rights reserved.

jama.com JAMA October 1, 2014 Volume 312, Number 13 1343

Copyright 2014 American Medical Association. All rights reserved.

JAMA Guide to Statistics and Methods

Methods for Evaluating Changes in Health Care Policy

Figure. Conceptual Illustration of a Difference-in-Differences Analysis for 2 Scenarios

jama.com JAMA December 10, 2014 Volume 312, Number 22 2401

Copyright 2014 American Medical Association. All rights reserved.

2402 JAMA December 10, 2014 Volume 312, Number 22 jama.com

Copyright 2014 American Medical Association. All rights reserved.

JAMA Guide to Statistics and Methods

Cluster Randomized Trials

Copyright 2015 American Medical Association. All rights reserved.

Copyright 2015 American Medical Association. All rights reserved.

JAMA Guide to Statistics and Methods

Copyright 2015 American Medical Association. All rights reserved.

Copyright 2015 American Medical Association. All rights reserved.

JAMA Guide to Statistics and Methods

940 JAMA September 1, 2015 Volume 314, Number 9 (Reprinted) jama.com

Copyright 2015 American Medical Association. All rights reserved.

jama.com (Reprinted) JAMA September 1, 2015 Volume 314, Number 9 941

Copyright 2015 American Medical Association. All rights reserved.

JAMA Guide to Statistics and Methods

Evaluating Discrimination of Risk Prediction Models

jama.com (Reprinted) JAMA September 8, 2015 Volume 314, Number 10 1063

Copyright 2015 American Medical Association. All rights reserved.

ARTICLE INFORMATION REFERENCES 5. Hand DJ. Measuring classifier performance:

1064 JAMA September 8, 2015 Volume 314, Number 10 (Reprinted) jama.com

Copyright 2015 American Medical Association. All rights reserved.

JAMA Guide to Statistics and Methods

The Propensity Score

Copyright 2015 American Medical Association. All rights reserved.

Copyright 2015 American Medical Association. All rights reserved.

JAMA Guide to Statistics and Methods

Copyright 2015 American Medical Association. All rights reserved.

Copyright 2015 American Medical Association. All rights reserved.

JAMA Guide to Statistics and Methods

2294 JAMA December 1, 2015 Volume 314, Number 21 (Reprinted) jama.com

Copyright 2015 American Medical Association. All rights reserved.

jama.com (Reprinted) JAMA December 1, 2015 Volume 314, Number 21 2295

Copyright 2015 American Medical Association. All rights reserved.

JAMA Guide to Statistics and Methods

Analyzing Repeated Measurements Using Mixed Models

Copyright 2016 American Medical Association. All rights reserved.

Copyright 2016 American Medical Association. All rights reserved.