Quantifying Treatment Differences in Confirmatory Trials With Delayed Effects
Quantifying Treatment Differences in Confirmatory Trials With Delayed Effects
delayed effects
José L. Jiménez
Novartis Pharma A.G., Basel, Switzerland
arXiv:1908.10502v1 [stat.ME] 28 Aug 2019
Abstract
Dealing with non-proportional hazards is increasingly common nowadays when designing
confirmatory clinical trials in oncology. Under these circumstances, the hazard ratio may not
be the best statistical measurement of treatment effect, and nor is log-rank test since it will
no longer be the most powerful statistical test. Possible alternatives include the restricted
mean survival time (RMST), that does not rely on the proportional hazards assumption and
is clinically interpretable, and the weighted log-rank test, which is proven to outperform the
log-rank test in delayed effects settings. We conduct simulations to evaluate the performance
and operating characteristics of the RMST-based inference and compared to the log-rank test
and weighted log-rank test with parameter values (ρ = 0, γ = 1), as well as their linked hazard
ratios. The weighted log-rank test is generally the most powerful test in a delayed effects
setting, and RMST-based tests have, under certain conditions, better performance than the
log-rank test when the truncation time is reasonably close to the tail of the observed curves.
In terms of treatment effect quantification, the hazard ratio linked to the weighted log-rank
test is able to capture the maximal treatment difference and provides a valuable summary
of the treatment effect in delayed effect settings. Hence, we recommend the inclusion of the
hazard ratio linked to the weighted log-rank test among the measurements of treatment effect
in settings where there is suspicion of substantial departure from the proportional hazards
assumption.
Keywords: delayed effects; non-proportional hazards; restricted mean survival time; weighted
log-rank.
1 Introduction
Randomized controlled clinical trials are the gold standard in drug development to confirm both
safety and efficacy of new drugs. When the primary endpoint is a time-to-event endpoint, the
objective is to quantify the difference between the survival curves of the treatment arms. The most
common time to event endpoints used for confirmatory phase III trials in oncology are: progression
free survival (PFS) and overall survival (OS). PFS corresponds to the time from randomization
until disease progression or death, whereas OS corresponds to the time from randomization until
death.
In this article we focus on the application to the immuno-oncology (IO) space. IO agents have
an effect on both the subject’s immune system and the tumor’s microenvironment. This way, the
tumors may be eliminated from the host or the disease progression may be delayed. In contrast with
chemotherapeutic agents, the effect of an IO agent is not directed to the tumor but to the subject’s
1
immune system, which causes that the effect is not observable immediately. This lag between the
activation of the immune cell is known in the literature as a delayed treatment effect. This delay
induces non proportional hazards and may translate to an overall underestimation of the PFS or
OS difference with respect to the control treatment arm (i.e., the hazard ratio (HR) may increase
towards 1, as the delay increases).
There exist in the literature multiple approaches to quantify treatment differences and test the
null hypothesis (HR=1) vs the alternative hypothesis (HR < 1) in confirmatory clinical trials.
The weighted log-rank test with the Fleming and Harrington class of weights [3] has gained, in
the recent years, considerable attention in the IO space as it allows to weight late differences between
survival curves over early differences by tuning its two parameters (ρ, γ). On this matter, [5] made
an extensive evaluation of the weighted log-rank test in confirmatory trials with delayed effects.
However, this comparison was made based only on power, and did not explore the quantification of
treatment effect.
[4] explored the differences between restricted mean survival time (RMST) and the hazard ratio
(HR) in a large number of scenarios that include proportional hazards and non proportional hazards.
The RMST is a robust and clinically interpretable measurement of the survival time distribution
that only depends on the selection of cutoff (truncation) time t∗ that needs to be pre-specified to
avoid selection bias (see [9]). Its clear advantage over the HR in a delayed effects setting is that it
does not rely on the proportional hazards assumption. Moreover, analogous to the hazard ratio as
a measurement of the relative risk of event hazard, a similar measurement for the RMST can be
obtained by simply doing the ratio of RMST between arms (control vs. experimental), with ratio
< 1 meaning survival benefit in the experimental arm. In the non proportional hazards setting in
which we are interested, the work from [4] concluded that the RSMT based tests are more efficient
that the log-rank test under certain censoring conditions (i.e., it achieves higher power). However,
the HR still gives a slightly better estimate of the maximal treatment differences, although as the
dropout rate increases the differences between the HR and the RMST ratio tend to disappear. This
article however does not include the performance of the weighted log-rank test (and its adjusted
HR), which is proven to be more powerful than the log-rank test in settings with delayed effects.
Mind that in [4] the HR used is a weighted average of the HR over time on the log scale and not
just the HR from the standard Cox model.
In this article extend the work of [4] and evaluate the differences in performance between RMST
based tests, and the weighted log-rank test with the parameter combination (ρ = 0, γ = 1), as well
as between the ratio between RMST of each treatment group, and the HR linked to the weighted
log-rank test and referred as “adjusted HR” (see section 2.2), in a setting with delayed effects. For
comparison purposes, we also include the HR in the comparison.
The rest of the manuscript is structured as follows. In section 2 we describe the weighted log-
rank test, the HR that is linked to the weighted log-rank test, and the RMST. In section 3 we
present an empirical evaluation between the log rank test, weighted log-rank test and RMST based
test in simulated scenarios with delayed effects. In section 4 we present an evaluation of a real trial
example. Last, in section 5 we discussion the major findings and conclusions of the article.
2 Method
Following the notation of [5], let T be a vector that contains the event times, ti , i = 1, 2, . . . , D,
between the patients’ enrollment date and the patients’ final event date, tD , such that t1 < t2 <
· · · < tD . Let the number of events at time ti be denoted as di , the total number of patients at
risk at that time be denoted as ni , and the effect delay (in months) be denoted as . As previously
described if t < both survival curves go in parallel and once t ≥ , the survival curves will start
diverging. Hence, we assume the following density functions fj (t), survival functions Sj (t) and
2
hazard functions hj (t) for the control group (j = 1) and for the experimental group (j = 2):
3
wti
Ati = . (4)
max(wti )
In (4), Ati is non-negative and has a maximal value 1. The hazard function in the Cox model
proposed by [8] is defined as
∗
λ(ti ; X) = λ0 eAti ×β×X = λ0 eβ×Xti , (5)
where λ(ti ; X) has a constant coefficient and a time-varying covariate Xt∗i = Ati × X that represents
the assignment weighted by the adjustment factor. The β̂ from the Cox models with time-varying
coefficients are proven to be unbiased (see [1]).
Also, since Ati ≤ 1, we can interpret β̂ as the estimated maximal effect. The time points
where we observe the maximal treatment difference are weighted with Ati = 1 in the corresponding
weighted log-rank test. Moreover, this weighted log-rank test (and consequently the score test from
this model) is optimal and will have the highest power based on the Scoenfeld’s proof [10].
The adjusted hazard ratio is therefore defined as
λ0 eAti ×β×1
HRti = Ati ×β×0
= eAti ×β , (6)
λ0 e
where eβ represents the maximal effect.
To estimate µ in (7) we can use the Kaplan-Meier (KM) estimator of Ŝ(t), hence
Z t∗
µ̂ = Ŝ(t)dt. (9)
0
4
where ŜE (t) and ŜC (t) are the estimated survival curves of the experimental and control arms
respectively. The estimated variance term is defined as V (µ̂E (t∗ )) + V (µ̂C (t∗ )).
Since we have the RMST of both treatment arms, we can compute
R t∗
ŜE (t)dt
R0t∗ . (12)
0
ŜC (t)dt
Equation (12) is, just like the hazard ratio, a measurement of the relative risk of event hazard
with ratio < 1 indicating a survival improvement in the experimental arm. The variance of (12) is
estimated using the delta method.
To have an objective evaluation, in this article t∗ is linked to the data and is pre-specified as i)
the minimum of the maximal observed event times (minimax event time) and ii) minimum of the
maximal observed (event or censored) times (minimax observed time) of each treatment arm.
3 Simulated study
3.1 Setup
The simulation of the survival times T is conducted by randomly drawing sample from U (0, 1) and
backtransforming using the inverse function Sk−1 (U ). We assume that the dropout time variable D
follows an exponential distribution with rate parameters λDE and λDC in the experimental arm and
the control arm respectively. Again, following the notation of [4], let Y denote the time a subject
is enrolled in the trial, and its distribution is the same in both treatment arms. We assume that T
and D are independent and their distribution does not depend on Y . The accrual and event times
from different patients are also independent.
In total, we implement 2 scenarios with different delay times () that go from 0 (i.e., proportional
hazards) up to 4 months delay. The median survival for the control arm is 6 months in both scenarios
whereas the median survival for the experimental arm is 15 and 9 months respectively. Hence, the
true hazard ratios (i.e., maximal treatment differences) will be 0.4 and 0.667 respectively as well.
Sample size is calculated described above and using the Schoenfel’s formula (see [11]). Hence,
the necessary number of events in each scenario is 52 and 258, and the total sample size 75 and 330
respectively.
The 2 scenarios with the different delays are implemented with different dropout rates that
follow an exponential distribution with hazard rates of 1% and 3%. Data is generated using the
nphsim R package [12] where we incorporate a 18 month enrollment (with ramp-up) period with
administrative censoring at 25 months, randomization ratio 1:1, power of 90% and a one-sided level
α of 2.5%. We run M = 10.000 simulated trials where we calculate the empirical power defined as
M
1 X
Power = I(ztest > zα ). (13)
M i=1
The evaluation of the RMST based test (minimax event time), the RMST based test (minimax
observed time), the RMST ratio, the weighted log-rank test with the parameter combination (ρ =
0, γ = 1), the log-rank test, the HR and the adjusted HR is implemented at the same time in the 2
presented scenarios.
3.2 Results
In Figures 1, 2, 3 and 4 we present the empirical comparison between RMST based tests and the
weighted log-rank test with the parameters (ρ = 0, γ = 1), and their respective treatment effect
5
difference estimates. Recall that the adjusted HR is the only method studied in this article that
actually estimates the maximal treatment difference. The HR and the RMST provide a treatment
differences across the entire study.
Figure 1 contains the results of scenario 1 where, under proportional hazards, the hazard ratio
is equal 0.4 (i.e, the maximal treatment difference is hence 0.4). Overall, we can see that in terms of
power, and as expected based on previous literature, the weighted log-rank test with the parameter
combination (ρ = 0, γ = 1) is the test with highest power as the delay increases. When the dropout
rate is equal to 1% and the delay is equal to 0 (i.e., under proportional hazards), the log rank
test has a power of 90% and is the most powerful test. The RMST based test (minimax observed
time) performs slightly worse but outperforming both the RMST based test (minimax event time)
and the weighted log-rank. However, when delay start to increase, weighted log-rank achieves the
highest power, outperforming the log-rank test and the RMST based tests.
The weighted log-rank test remains the most powerful test also when the dropout rates increase
to 3%. However, it is interesting to point out that, as the dropout rate and the delay increase, the
RMST based test (minimax observe time) becomes slightly more powerful than the log-rank test.
Figure 3 contains the estimated treatment difference estimations in scenario 1. For both dropout
rate, we observe that the HR provides a treatment difference across the entire study of 0.41 under
proportional hazards, which increases up to 0.68 with a 4 month delay. Regarding the RMST
ratios, the one using the minimax event time provides an estimate of the treatment difference of
0.62 under proportional hazards, increasing up to 0.83 with a 4 month delay. The RMST ratio
that uses the minimax observed time provides an estimated treatment difference of 0.59 under
proportional hazards that increases up to 0.78 with a 4 month delay. The adjusted HR provides a
maximal treatment difference under proportional hazards of 0.42, which increases up to 0.54 with
a 4 month delay.
Figure 2 contains the results of scenario 2 where, under proportional hazards, the hazard ratio
is equal 0.667 (i.e, the maximal treatment difference is hence 0.667). Overall we can see that, just
like in scenario 1, in terms of power the weighted log-rank test with the parameter combination
(ρ = 0, γ = 1) is the test with highest power as the delay increases. When the dropout rate is
equal to 1% and the delay is equal to 0 (i.e., under proportional hazards), the log rank test has
a power of 90% and is the most powerful test. The RMST based test (minimax observed time)
performs slightly worse, but outperforming both the RMST based test (minimax event time) and
the weighted log-rank test. However, when the delay starts to increase, the weighted log-rank
achieves the highest power, outperforming the log-rank test and the RMST based tests.
The weighted log-rank test remains the most powerful test also when the dropout rates increase
to 3%. However, it is interesting to point out that, as the dropout rate and the delay increase, in
this scenario that has a smaller treatment difference between arms than the one from scenario 1 (i.e.,
the maximal treatment difference is scenario 1 is 0.4 and in scenario 2 is 0.667), both RMST based
tests become more powerful than the log-rank test. This conclusion is in line with the conclusions
made by [4].
Figure 4 contains the estimated treatment difference estimations in scenario 2. With a dropout
rate of 1%, we observe that the HR provides a treatment difference across the entire study of 0.67
under proportional hazards, which increases up to 0.81 with a 4 month delay. Regarding the RMST
ratios, the one using the minimax event time provides an estimate of the treatment difference of 0.75
under proportional hazards, increasing up to 0.86 with a 4 month delay. The RMST ratio that uses
the minimax observed time provides an estimated treatment difference of 0.73 under proportional
hazards that increases up to 0.85 with a 4 month delay. The adjusted HR provides a maximal
treatment difference under proportional hazards of 0.67, which increases up to 0.73 with a 4 month
delay. With a dropout rate of 3%, we observe that the HR provides a treatment difference across
the entire study of 0.67 under proportional hazards, which increases up to 0.81 with a 4 month
delay. Regarding the RMST ratios, the one using the minimax event time provides an estimate of
6
the treatment difference of 0.71 under proportional hazards, increasing up to 0.81 with a 4 month
delay. The RMST ratio that uses the minimax observed time provides an estimated treatment
difference of 0.71 under proportional hazards that increases up to 0.81 with a 4 month delay. The
adjusted HR provides a maximal treatment difference under proportional hazards of 0.67, which
increases up to 0.72 with a 4 month delay.
Overall, the simulations performed in this article provide the following conclusions:
• In line with [5], the weighted log-rank test with parameters (ρ = 0, γ = 1) is the method that
provides highest power in a setting with delayed effects.
• In line with the conclusions presented by [8], the adjusted HR that is linked to the weighted
log-rank with parameters (ρ = 0, γ = 1) test captures very well the maximal treatment
difference between two treatment arms in the presence of delayed effects.
• In line with the conclusions from [4], the RMST based test using the minimax observed
time outperforms in terms of power the log-rank test in the presence of delayed effects and
increasing dropout rates. However, the HR yields a treatment difference across the entire
study closer to the maximal treatment difference than the RMST ratios.
• Even though the HR and the RMST ratios do not try to estimate the maximal treatment
difference, when used for this purpose as it is done in current practice, their estimation of the
maximal treatment difference is, by far, not as good as the one provided by the adjusted HR.
1.0
0.8
0.8
Empirical power
Empirical power
0.6
0.6
0.4
0.4
0.2
0.0
0 1 2 3 4 0 1 2 3 4
Effect delay (months) Effect delay (months)
Figure 1: Empirical power in scenario 1 for the log-rank test, the RMST bases test (minimax event
time), RMST based test (minimax observe time) and the weighted log-rank test with the parameter
combination (ρ = 0, γ = 1) with dropout rates of 1% and 3%.
7
Effect delay (% of study duration) Effect delay (% of study duration)
1.0 0% 4% 8% 12% 16% 0% 4% 8% 12% 16%
1.0
0.8
0.8
Empirical power
Empirical power
0.6
0.6
0.4
0.4
dropout rate: 1% dropout rate: 3%
log−rank log−rank
0.2
0.2
RMST (minimax event times) RMST (minimax event times)
RMST (minimax observed times) RMST (minimax observed times)
0.0
0.0
weighted log−rank weighted log−rank
0 1 2 3 4 0 1 2 3 4
Effect delay (months) Effect delay (months)
Figure 2: Treatment difference estimations in scenario 1 using the HR, the adjusted HR and RMST
ratios with dropout rates of 1% and 3%.
1.0
Estimated treatment difference
0.8
0.6
0.6
0.4
0.4
0.2
RMST ratio (minimax event times) RMST ratio (minimax event times)
RMST ratio (minimax observed times) RMST ratio (minimax observed times)
0.0
0.0
Adjusted HR Adjusted HR
0 1 2 3 4 0 1 2 3 4
Effect delay (months) Effect delay (months)
Figure 3: Empirical power in scenario 2 for the log-rank test, the RMST bases test (minimax event
time), RMST based test (minimax observe time) and the weighted log-rank test with the parameter
combination (ρ = 0, γ = 1) with dropout rates of 1% and 3%.
8
Effect delay (% of study duration) Effect delay (% of study duration)
1.0 0% 4% 8% 12% 16% 0% 4% 8% 12% 16%
1.0
Estimated treatment difference
0.8
0.6
0.6
0.4
0.4
dropout rate: 1% dropout rate: 3%
HR HR
0.2
0.2
RMST ratio (minimax event times) RMST ratio (minimax event times)
RMST ratio (minimax observed times) RMST ratio (minimax observed times)
0.0
0.0
Adjusted HR Adjusted HR
0 1 2 3 4 0 1 2 3 4
Effect delay (months) Effect delay (months)
Figure 4: Treatment difference estimations in scenario 2 using the HR, the adjusted HR and RMST
ratios with dropout rates of 1% and 3%.
9
Figure 5: Overall survival Kaplan-Meier curves of the phase 3 randomized study in patients with
relapsed or refractory, CD22-positive, Philadelphia chromosome (Ph)-positive or Ph-negative acute
lymphoblastic leukemia. A total of 326 patients were 1:1 randomized to receive either inotuzumab
ozogamicin (inotuzumab ozogamicin group) or standard intensive chemotherapy (standard-therapy
group) (source: [6]
of this article to assess the results of this particular clinical trial. Its only purpose is to show the
performance of the methodoly used in this article in a real setting.
5 Discussion
Nowadays, it is quite common to find studies where the proportional hazard assumption does not
hold (i.e, with the use of novel cancer therapies such us targeted therapies or immunotherapies).
However, despite the fact that the HR lacks of interpretability under non proportional hazards, it
is still the standard method to quantify treatment differences.
In this article we present a comparison between the log-rank test, the weighted log-rank test with
parameters (ρ = 0, γ = 1) and RMST based tests, and their linked treatment difference estimates
(i.e., the HR, the adjusted HR and RMST ratios), that are widely used in clinical trials with delayed
effects. This article represents an extension of the work done by [4]. In the mentioned article, a
comparison is done between log-rank and RMST-based tests (and their linked HR and RMST ratios).
This comparison concludes that RMST ratios not only capture better the treatment differences but
also can be interpreted, since they do not rely on the proportional hazards assumption. This
comparison is done in a wide variety of scenarios, including non-proportional hazards. However,
we believe that under non-proportional hazards, the weighted log-rank test and its linked HR are
much more appropriate than both the HR and RMST ratios.
We implement all these methods (i.e., the log-rank test, the weighted log-rank test with paramter
values (ρ = 0, γ = 1), RMST-based test, and their linked HRs and RMST ratios) in two scenarios
10
with delayed effects and different droupout rates. From these simulations we conclude that under
non-proportional hazards scenarios where late separation of survival curves is observed, the RMST-
based test has better performance than the log-rank test in terms of power when the truncation
time is reasonably close to the tail of the observed Kaplan Meier curves. However, the weighted
log-rank test with parameters (ρ = 0, γ = 1) outperforms both RMST-based tests and the log-rank
test. In terms of treatment effect quantification, the HR linked to the weighted log-rank test is the
measurement that performs best under non-proportional hazards.
The estimation of the treatment effect is also a key component of the analysis of a clinical trial.
The RMST-based tests do not rely on any model assumptions and hence the interpretation is still
straightforward. In contrast, the HR varies with time, and its value cannot be interpreted as the
average HR across times. The RMST can capture the entire event-free distribution and hence is
able to provide a clinically meaningful summary of the group differences in a randomized study.
However, we believe that the HR linked to the weighted-log rank test also provides a good summary
of the group differences by giving the maximal treatment difference observed along the entire trial,
which can be easily interpreted under non-proportional hazards. Plus, it does not require to specify
any truncation time unlike the RMST ratio. From our point of view this a clear advantage with
respect to the RMST ration because if we design a study with the RMST as the primary analysis
powered to detect a meaningful difference of 2 RMSTs, the selection of the truncation time cannot
be based on the minimax event time or minimax observed time when data is not available. Instead,
this truncation time should be a fixed timepoint. This time window has to be large enough and
expected to capture most of the survival curves for the RMST to be used as an adequate global
summary statistic. However, we believe that the maximal treatment difference is only useful in
scenarios where there is a late separation between Kaplan-Meier curves. It would not make sense
to provide this measurent for example in scenarios with crossing Kaplan Meier curves.
Therefore, under non-proportional hazard with late separation we agree with [4] when saying
that the RMST curves as well as the related ratios are easy to interpret and are clinically meaningful
to characterize the treatment effect over time and has a clear advantage over the HR and the log-
rank test. However, we have shown that the weighted log-rank test with parameters (ρ = 0, γ = 1)
outperforms the RMST-based tests in terms of power and its linked HR provides a treatment
difference summary that can be also very useful under the presence of delayed effects.
Disclaimer
The views and opinions expressed in this article are those of the author and do not necessarily
reflect the official policy or position of Novartis Pharma A.G.
References
[1] Per Kragh Andersen and Richard D Gill. Cox’s regression model for counting processes: a
large sample study. The annals of statistics, pages 1100–1120, 1982.
[2] David R Cox. Regression models and life-tables. Journal of the Royal Statistical Society: Series
B (Methodological), 34(2):187–202, 1972.
11
[3] Thomas R Fleming and David P Harrington. A class of hypothesis tests for one and two
sample censored survival data. Communications in Statistics-Theory and Methods, 10(8):763–
794, 1981.
[4] Bo Huang and Pei-Fen Kuan. Comparison of the restricted mean survival time with the hazard
ratio in superiority trials with a time-to-event end point. Pharmaceutical statistics, 17(3):202–
213, 2018.
[5] José L Jiménez, Viktoriya Stalbovskaya, and Byron Jones. Properties of the weighted log-rank
test in the design of confirmatory studies with delayed effects. Pharmaceutical statistics, 2018.
[6] Hagop M Kantarjian, Daniel J DeAngelo, Matthias Stelljes, Giovanni Martinelli, Michaela
Liedtke, Wendy Stock, Nicola Gökbuget, Susan O’Brien, Kongming Wang, Tao Wang, et al.
Inotuzumab ozogamicin versus standard therapy for acute lymphoblastic leukemia. New Eng-
land Journal of Medicine, 375(8):740–753, 2016.
[7] John Lawrence. Strategies for changing the test statistic during a clinical trial. Journal of
biopharmaceutical statistics, 12(2):193–205, 2002.
[8] Ray S Lin and Larry F León. Estimation of treatment effects in weighted log-rank tests.
Contemporary clinical trials communications, 8:147–155, 2017.
[9] Patrick Royston and Mahesh KB Parmar. Restricted mean survival time: an alternative to
the hazard ratio for the design and analysis of randomized trials with a time-to-event outcome.
BMC medical research methodology, 13(1):152, 2013.
[10] David Schoenfeld. The asymptotic properties of nonparametric tests for comparing survival
distributions. Biometrika, 68(1):316–319, 1981.
[11] David A Schoenfeld et al. Sample-size formula for the proportional-hazards regression model.
Biometrics, 39(2):499–503, 1983.
[12] Wu Haiyan Wang, Yang and Keaven Anderson. Nphsim: Simulation and power calculations
for time-to-event clinical trials. https://github.com/keaven/nphsim/.
12