0% found this document useful (0 votes)

379 views293 pages

Models For Multi-State Survival Data - Per Kragh Andersen, Henrik Ravn (Chapman & Hall - CRC Texts in Statistical Science) - CRC (2024)

Multi-state models provide a statistical framework for studying longitudinal data on subjects when focus is on the occurrence of events that the subjects may experience over time. They find application particularly in biostatistics, medicine, and public health. The book includes mathematical detail which can be skipped by readers more interested in the practical examples. It is aimed at biostatisticians and at readers with an interest in the topic having a more applied background.

Uploaded by

José Renato

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

379 views293 pages

Models For Multi-State Survival Data - Per Kragh Andersen, Henrik Ravn (Chapman & Hall - CRC Texts in Statistical Science) - CRC (2024)

Uploaded by

José Renato

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 293

Models for Multi-State

Survival Data
Multi-state models provide a statistical framework for studying longitudinal data on subjects when
focus is on the occurrence of events that the subjects may experience over time. They find appli-
cation particularly in biostatistics, medicine, and public health. The book includes mathematical
detail which can be skipped by readers more interested in the practical examples. It is aimed at
biostatisticians and at readers with an interest in the topic having a more applied background, such
as epidemiology. This book builds on several courses the authors have taught on the subject.

Key Features:

• Intensity-based and marginal models.

• Survival data, competing risks, illness-death models, recurrent events.
• Full chapter on pseudo-values.
• Intuitive introductions and mathematical details.
• Practical examples of event history data.
• Exercises.

Software code in R and SAS and the data used in the book can be found on the book’s webpage.

Henrik Ravn is senior statistical director at Novo Nordisk A/S, Denmark. He graduated with
an MSc in theoretical statistics in 1992 from University of Aarhus, Denmark and completed
a PhD in Biostatistics in 2002 from the University of Copenhagen, Denmark. He joined Novo
Nordisk in late 2015 after more than 22 years of experience doing biostatistical and epidemio-
logical research, at Statens Serum Institut, Denmark and in Guinea-Bissau, West Africa. He
has co-authored more than 160 papers, mainly within epidemiology and application of survival
analysis and has taught several courses as external lecturer at Section of Biostatistics, University
of Copenhagen.

Per Kragh Andersen is professor of Biostatistics at the Department of Public Health, University of
Copenhagen, Denmark since 1998. He earned a mathematical statistics degree from the University
of Copenhagen in 1978, a PhD in 1982, and a DMSc degree in 1997. From 1993 to 2002 he worked
as chief statistician at Danish Epidemiology Science. He is author or co-author of more than 125
papers on statistical methodology and more than 250 papers in the medical literature. His research
has concentrated on survival analysis, and he is co-author of the 1993 book Statistical Models
Based on Counting Processes. He has taught several courses both nationally and internationally
both for students with a mathematical background and for students in medicine or public health.
CHAPMAN & HALL/CRC
Texts in Statistical Science Series
Joseph K. Blitzstein, Harvard University, USA
Julian J. Faraway, University of Bath, UK
Martin Tanner, Northwestern University, USA
Jim Zidek, University of British Columbia, Canada
Recently Published Titles
Sampling
Design and Analysis, Third Edition
Sharon L. Lohr

Theory of Statistical Inference

Anthony Almudevar
Probability, Statistics, and Data
A Fresh Approach Using R
Darrin Speegle and Brain Claire

Bayesian Modeling and Computation in Python

Osvaldo A. Martin, Raviv Kumar and Junpeng Lao

Bayes Rules!
An Introduction to Applied Bayesian Modeling
Alicia Johnson, Miles Ott and Mine Dogucu

Stochastic Processes with R

An Introduction
Olga Korosteleva

Design and Analysis of Experiments and Observational Studies using R

Nathan Taback

Time Series for Data Science: Analysis and Forecasting

Wayne A. Woodward, Bivin Philip Sadler and Stephen Robertson

Statistical Theory
A Concise Introduction, Second Edition
Felix Abramovich and Ya’acov Ritov
Applied Linear Regression for Longitudinal Data
With an Emphasis on Missing Observations
Frans E.S. Tan and Shahab Jolani
Fundamentals of Mathematical Statistics
Steffen Lauritzen

Modelling Survival Data in Medical Research, Fourth Edition

David Collett

Applied Categorical and Count Data Analysis, Second Edition

Wan Tang, Hua He and Xin M. Tu
Geographic Data Science with Python
Sergio Rey, Dani Arribas-Bel and Levi John Wolf
Models for Multi-State Survival Data
Rates, Risks, and Pseudo-Values
Per Kragh Andersen and Henrik Ravn

For more information about this series, please visit: https://www.routledge.com/Chapman--HallCRC-

Texts-in-Statistical-Science/book-series/CHTEXSTASCI
Models for Multi-State
Survival Data
Rates, Risks, and Pseudo-Values

Per Kragh Andersen and Henrik Ravn

Figures by Julie Kjærulff Furberg

Designed cover image: © Gustav Ravn
First edition published 2024
by CRC Press
6000 Broken Sound Parkway NW, Suite 300, Boca Raton, FL 33487-2742

and by CRC Press

4 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN

CRC Press is an imprint of Taylor & Francis Group, LLC

© 2024 Taylor & Francis Group, LLC

Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot as-
sume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have
attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders
if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please
write and let us know so we may rectify in any future reprint.

Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or
utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including pho-
tocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission
from the publishers.

For permission to photocopy or use material electronically from this work, access www.copyright.com or contact the
Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. For works that are
not available on CCC please contact [email protected]

Trademark notice: Product or corporate names may be trademarks or registered trademarks and are used only for iden-
tification and explanation without intent to infringe.

ISBN: 978-0-367-14002-1 (hbk)

ISBN: 978-1-032-56869-0 (pbk)
ISBN: 978-0-429-02968-4 (ebk)

DOI: 10.1201/9780429029684

Typeset in Nimbus Roman font

by KnowledgeWorks Global Ltd.

Publisher’s note: This book has been prepared from camera-ready copy provided by the authors.

Access the Support Material: https://multi-state-book.github.io/companion/

Contents

Preface xi

List of symbols and abbreviations xiii

1 Introduction 1
1.1 Examples of event history data 2
1.1.1 PBC3 trial in liver cirrhosis 2
1.1.2 Guinea-Bissau childhood vaccination study 4
1.1.3 Testis cancer incidence and maternal parity 5
1.1.4 PROVA trial in liver cirrhosis 6
1.1.5 Recurrent episodes in affective disorders 7
1.1.6 LEADER cardiovascular trial in type 2 diabetes 9
1.1.7 Bone marrow transplantation in acute leukemia 9
1.1.8 Copenhagen Holter study 10
1.2 Parameters in multi-state models 13
1.2.1 Choice of time-variable 13
1.2.2 Marginal parameters 14
1.2.3 Conditional parameters 18
1.2.4 Data representation 20
1.2.5 Target parameter 25
1.3 Independent censoring and competing risks 27
1.4 Mathematical definition of parameters (*) 29
1.4.1 Marginal parameters (*) 30
1.4.2 Conditional parameters (*) 31
1.4.3 Counting processes (*) 32
1.5 Exercises 34

2 Intuition for intensity models 35

2.1 Models for homogeneous groups 35
2.1.1 Nelson-Aalen estimator 36
2.1.2 Piece-wise constant hazards 37
2.1.3 Significance tests 38
2.2 Regression models 40
2.2.1 Multiplicative regression models 41
2.2.2 Modeling assumptions 46
2.2.3 Cox versus Poisson models 51

v
vi CONTENTS
2.2.4 Additive regression models 54
2.2.5 Additive versus multiplicative models 57
2.3 Delayed entry 58
2.4 Competing risks 61
2.5 Recurrent events 62
2.5.1 Recurrent episodes in affective disorders 63
2.5.2 LEADER cardiovascular trial in type 2 diabetes 64
2.6 Exercises 66

3 Intensity models 69
3.1 Likelihood function (*) 69
3.2 Non-parametric models (*) 73
3.2.1 Nelson-Aalen estimator (*) 73
3.2.2 Inference (*) 74
3.3 Cox regression model (*) 76
3.4 Piece-wise constant hazards (*) 78
3.5 Additive regression models (*) 80
3.6 Examples 82
3.6.1 PBC3 trial in liver cirrhosis 82
3.6.2 Guinea-Bissau childhood vaccination study 82
3.6.3 PROVA trial in liver cirrhosis 83
3.6.4 Testis cancer incidence and maternal parity 85
3.7 Time-dependent covariates 87
3.7.1 Adapted covariates 87
3.7.2 Non-adapted covariates 88
3.7.3 Inference 88
3.7.4 Inference (*) 89
3.7.5 Recurrent episodes in affective disorders 89
3.7.6 PROVA trial in liver cirrhosis 90
3.7.7 PBC3 trial in liver cirrhosis 95
3.7.8 Bone marrow transplantation in acute leukemia 96
3.7.9 Additional issues 101
3.8 Models with shared parameters 103
3.8.1 Duplicated data set 103
3.8.2 PROVA trial in liver cirrhosis 105
3.8.3 Bone marrow transplantation in acute leukemia 105
3.8.4 Joint likelihood (*) 106
3.9 Frailty models 110
3.9.1 Inference (*) 110
3.9.2 Clustered data 111
3.9.3 Recurrent events 112
3.10 Exercises 115

4 Intuition for marginal models 117

4.1 Plug-in methods 117
4.1.1 Two-state model 117
CONTENTS vii
4.1.2 Competing risks 125
4.1.3 Illness-death models 131
4.2 Direct models 134
4.2.1 Two-state model 135
4.2.2 Competing risks 136
4.2.3 Recurrent events 142
4.3 Marginal hazard models 147
4.3.1 Clustered data 148
4.3.2 Recurrent events 149
4.3.3 Illness-death model 150
4.4 Independent censoring – revisited 153
4.4.1 Investigating the censoring distribution 153
4.4.2 Censoring and covariates – a review 155
4.4.3 Independent competing risks – a misnomer (*) 156
4.4.4 Semi-competing risks (*) 158
4.5 Exercises 160

5 Marginal models 163

5.1 Plug-in for Markov processes (*) 163
5.1.1 Two-state model (*) 165
5.1.2 Competing risks (*) 166
5.1.3 Progressive illness-death model (*) 168
5.1.4 Recurrent events (*) 169
5.1.5 Progressive multi-state models (*) 170
5.2 Plug-in for non-Markov processes (*) 170
5.2.1 State occupation probabilities (*) 171
5.2.2 Transition probabilities (*) 172
5.2.3 Recurrent events (*) 174
5.2.4 Semi-Markov processes (*) 174
5.3 Landmarking 176
5.3.1 Conditional survival probabilities 176
5.3.2 Landmark super models 177
5.3.3 Bone marrow transplantation in acute leukemia 178
5.3.4 Multi-state landmark models 180
5.3.5 Estimating equations (*) 181
5.4 Micro-simulation 184
5.4.1 Simulating multi-state processes 184
5.4.2 Simulating from an improper distribution 187
5.4.3 PROVA trial in liver cirrhosis 187
5.5 Direct regression models 190
5.5.1 Generalized estimating equations (*) 191
5.5.2 Two-state model (*) 192
5.5.3 Competing risks (*) 193
5.5.4 Recurrent events (*) 196
5.5.5 State occupation probabilities (*) 200
5.6 Marginal hazard models (*) 201
viii CONTENTS
5.6.1 Cox score equations – revisited (*) 201
5.6.2 Multivariate Cox model (*) 202
5.6.3 Clustered data (*) 203
5.6.4 Recurrent events (*) 203
5.6.5 Illness-death model (*) 204
5.7 Goodness-of-fit 205
5.7.1 Cumulative residuals (*) 205
5.7.2 Generalized estimating equations (*) 206
5.7.3 Cox model (*) 206
5.7.4 Direct regression models (*) 208
5.8 Examples 208
5.8.1 Non-Markov transition probabilities 208
5.8.2 Direct binomial regression 211
5.8.3 Extended models for recurrent events 213
5.8.4 Goodness-of-fit based on cumulative residuals 213
5.9 Exercises 219

6 Pseudo-values 221
6.1 Intuition 222
6.1.1 Introduction 222
6.1.2 Hazard difference 229
6.1.3 Restricted mean 230
6.1.4 Cumulative incidence 233
6.1.5 Cause-specific time lost 234
6.1.6 Non-Markov transition probabilities 234
6.1.7 Recurrent events 235
6.1.8 Covariate-dependent censoring 236
6.2 Theoretical properties (*) 237
6.3 Approximation of pseudo-values (*) 240
6.4 Goodness-of-fit (*) 241
6.5 Exercises 243

7 Further topics 245

7.1 Interval-censoring 245
7.1.1 Markov processes (*) 246
7.1.2 Two-state model (*) 246
7.1.3 Competing risks (*) 247
7.1.4 Progressive illness-death model (*) 247
7.2 Models for dependent data 248
7.2.1 Times of entry into states 248
7.2.2 Shared frailty model – two-stage estimation (*) 249
7.3 Causal inference 250
7.3.1 Definition of causality 251
7.3.2 The g-formula (*) 252
7.3.3 Inverse probability of treatment weighting (*) 252
7.3.4 Summary and discussion 254
CONTENTS ix
7.4 Joint models with time-dependent covariates 254
7.4.1 Random effects model 255
7.4.2 Likelihood (*) 256
7.4.3 Prediction of survival probabilities 256
7.4.4 Landmarking and joint models 257
7.5 Cohort sampling 257
7.5.1 Nested case-control studies (*) 258
7.5.2 Case-cohort studies (*) 259
7.5.3 Discussion 260

Bibliography 261

Subject Index 271

Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
Preface

Multi-state models provide a statistical framework for studying longitudinal data on sub-
jects when focus is on the occurrence of events that the subjects may experience over time.
The most simple situation is when only a single event, ‘death’ is of interest – a situation
known as survival analysis. We shall use the phrase multi-state survival data for the data
that arise in the general case when observing subjects over time and several events may be
of interest during their life spans.
As indicated in the sub-title of the book, models for multi-state survival data can either
be specified via rates of transition between states or by directly addressing the risk of
occupying a given state at given points in time. A general approach for addressing risks and
other marginal parameters is via pseudo-values about which a whole chapter is provided.
The background for writing this book is our teaching of several courses on various aspects
of multi-state survival data – either for participants with a basic training in statistics or with
a clinical/epidemiological background. Several texts on multi-state models already exist.
These include the books by Beyersmann et al. (2012), Broström (2012), Geskus (2016),
and more recently, Cook and Lawless (2018). The book by Kalbfleisch and Prentice (1980,
2002) focuses on survival analysis but also discusses general multi-state models. Books
with more emphasis on mathematics are those by Andersen et al. (1993), Hougaard (2000),
Martinussen and Scheike (2006), and Aalen et al. (2008). In addition, several review papers
have appeared (e.g., Hougaard, 1999; Andersen and Keiding, 2002; Putter et al., 2007; An-
dersen and Pohar Perme, 2008; Bühler et al., 2023). In spite of the existence of these texts,
we were unable to identify a suitable common book for these different types of participants
and, importantly, none of the cited texts provide detailed discussions on the development
of methods based on pseudo-values.
With this book we aim at filling this gap (at least for ourselves) and to provide a text that
is applicable as the basis for courses for mixed groups of participants. By addressing at
least two types of readers, we deliberately run the risk of falling between two chairs; how-
ever, we believe that readers with different mathematical backgrounds and interests should
all benefit from the book. Those readers who appreciate some mathematical details can
read the book from the beginning to the end, thereby first getting a (hopefully) more intu-
itive introduction to the topics, including practical examples and, subsequently, in sections
marked with ‘(*)’ get more details. This will, unavoidably, entail some repetitions. On the
other hand, readers with less interest in mathematical details can read all sections that are
not marked with ‘(*)’ without losing the flow of the book. It should be emphasized that
we will from time to time refer to (*)-marked sections and to more technical publications

xi
xii PREFACE
from those more intuitive sections. The text includes several summary boxes that emphasize
highlights from recent sections.
The book discusses a number of practical examples of event history data that are meant to
illustrate the methods discussed. Sometimes, different statistical models that are fitted to the
same data are mathematically incompatible and we will make remarks to this effect along
the way. Software code for the calculations in the examples is not documented in the book.
Rather, code in R and SAS and data can be found on the book’s webpage. The webpage
also includes code for solving the practical exercises found at the end of each chapter and
solutions to theoretical exercises marked with (*).
The cover drawing probably calls for an explanation, as follows. When analyzing recurrent
events data and ignoring competing risks (which is quite common in applications), then
a curve like the top one on the figure may be obtained – a curve that is upwards biased.
However, then comes the book by the crow (Per’s middle name) and the raven (Henrik’s
last name) as a rescue and forces the curve downwards to avoid the bias. We wish to thank
Gustav Ravn for the cover drawing.
There are a number of other people to whom we wish to address our thanks and without
whose involvement the writing of this book would not have been possible.
First and foremost, our sincere thanks go to Julie K. Furberg who has carefully created
all figures and validated all analyses quoted in the book. She also gave valuable feedback
on the text. Eva N.S. Wandall thoroughly created solutions to practical exercises and con-
tributed to some of the examples.
Several earlier drafts of chapters were read and commented upon by Anne Katrine Duun-
Henriksen, Niels Keiding, Thomas H. Scheike, and Henrik F. Thomsen. Torben Marti-
nussen provided important input for Chapter 6.
A special thank you goes to those who have provided data for the practical examples:
Peter Aaby, Jules Angst, Flemming Bendtsen, John P. Klein, Bjørn S. Larsen, Thorkild
I.A. Sørensen, Niels Tygstrup, and Tine Westergaard. Permission to present analyses of
LEADER data was given by Novo Nordisk A/S.
We thank our employers: University of Copenhagen and Novo Nordisk A/S for letting us
work on this book project during working hours for several years. A special thank to Novo
Nordisk A/S for granting a stay at Favrholm Campus to finalize the book. Communication
with the publishers has been smooth and we are grateful for their patience.
Bagsværd Per Kragh Andersen
March 2023 Henrik Ravn
List of symbols and abbreviations

The following list describes main symbols and abbreviations used in the book:
α(t) Hazard (intensity, rate) function
β Regression coefficient
εh (τ) Expected length of stay in state h in [0, τ]
λ (t) Intensity process for counting process
µ(t) Mean number of recurrent events until time t; µ(t) = E(N(t))
Rt
A(t) Cumulative hazard function; A(t) = 0 α(u)du

Ai Frailty for cluster/subject i

Bi Delayed entry time for subject i
Ci Right-censoring time for subject i
Di Event indicator for subject i: {0, 1}; with competing risks {0, 1, . . . , k}
DF Degrees of freedom
E(T ) Expectation of the random variable T
F(t) Cumulative distribution function for the random variable T
Fh (t) Cumulative incidence function for cause h
G(t) Survival distribution function for C (censoring)
GEE Generalized estimating equation
h State in multi-state model
HR Hazard ratio
I(· · · ) Indicator function; I(· · · ) is 1 if · · · is true and 0 otherwise
J Inspection time giving rise to interval-censoring
K(t) Weight function in logrank test and other non-parametric tests
Li Likelihood contribution from subject i
LP Linear predictor = β1 Z1 + β2 Z2 + · · · + β p Z p

xiii
xiv LIST OF SYMBOLS AND ABBREVIATIONS
LRT Likelihood ratio test
M(t) Martingale process
N(t) Counting process; Ni (t) for subject i
P(·) Probability
PL Partial likelihood
Qh (t) Probability to be in state h at time t (state occupation probability)
R(t) Risk set at time t
S(t) Survival distribution function for the random variable T ; 1 − F(t)
SD Standard deviation
Ti Event time for individual i
U(·) Score function or other estimating function
V (t) Multi-state process
Wi Weight for subject i
Xi Observation time for subject i; Xi = min(Ti ,Ci )
Y (t) Number at risk at time t
Yi (t) At-risk indicator for subject i
Zi Covariate for subject i, may be time-dependent: Zi (t)
Chapter 1

Introduction

In many fields of quantitative science, subjects are followed over time for the occurrence of
certain events. Examples include clinical studies where cancer patients are followed from
time of surgery until time of death from any cause (survival analysis), epidemiological co-
hort studies, e.g., registry-based, where disease-free individuals (‘exposed’ or ‘unexposed’)
are followed from a given calendar time until diagnosis of a certain disease, or demographic
studies where women are followed through child-bearing ages with the focus on ages at
which they give birth to live-born children. Data from such studies may be represented as
events occurring in continuous time and a mathematical framework in which to study such
phenomena is that of multi-state models where an event is considered a transition between
certain (discrete) states. We will denote the resulting data as multi-state survival data or
event history data.
Possible scientific questions that may be addressed in event history analyses include how
the mortality of the cancer patients is associated with individual prognostic variables such
as age, disease stage or histological features of the tumor, or what is the probability that ex-
posed or unexposed subjects are diagnosed with the disease within a certain time interval, or
what is the expected time spent for women as nulliparous depending on the socio-economic
status of her family.
An important feature of event history data is that of incomplete observation. This means
that observation of the event(s) of interest is precluded by the occurrence of another event,
such as end-of-study, drop-out of study, or death of the individual (in case the event of
interest is non-fatal). Here, as we shall discuss in more detail in Section 1.3, an important
distinction is between avoidable events (right-censoring) representing practical restrictions
in data collection that prevent further observation of the subject (e.g., end-of-study or drop-
out) and non-avoidable events (competing risks), such as the death of a patient. For the
former class of avoidable events, it is an important question whether the incomplete data
that are available to the investigator after censoring still suitably represent the population
for which inference is intended. This is the notion of independent censoring that will also
be further discussed in Section 1.3.
In this book we will discuss two classes of statistical models for multi-state survival data:
Intensity-based models and marginal models. Briefly, intensities or rates are parameters that
describe the immediate future development of the process conditionally on past information
on how the process has developed, while marginal parameters, such as the risk of being in

1
2 INTRODUCTION
a given state at a particular time, do not involve such a conditioning. Both classes of models
often involve explanatory variables (or covariates/prognostic variables/risk factors – terms
that we will use interchangeably in the book).
The first model class targets intensities and is inspired by standard hazard models for sur-
vival data, and we shall see that models such as the Cox (1972) proportional hazards model
also play an important role for more general multi-state survival data. Throughout, we will
use the terms intensity, hazard, and rate interchangeably. Models for intensities are dis-
cussed in Chapters 2 and 3.
The second model class targets marginal parameters (e.g., risks) and, here, one approach is
plug-in methods where the marginal parameter is estimated using intensity-based models.
Thus, the results from these models are either inserted (‘plugged’) into an equation giving
the relationship between the marginal parameter and the intensities, or they are used as the
basis for simulating a large number of realizations of the multi-state process, whereby the
marginal parameter may be estimated, a technique known as micro-simulation. Another
approach is models that directly target marginal parameters, and a number of such mod-
els will also be presented. Marginal models are discussed in Chapters 4 and 5. For direct
marginal models (or simply direct models), as we shall see in Chapter 6, pseudo-values
(or pseudo-observations) are useful. In the final Chapter 7, a number of further topics are
briefly discussed.
Sections marked with ‘(*)’ contain, as indicated in the Preface, more mathematical details.
Each chapter ends with a number of exercises where those marked with ‘(*)’ are more
technical.
Multi-state survival data
Multi-state survival data (event history data) represent subjects followed over time
for the occurrence of events of interest. The events occur in continuous time and an
event is considered a transition between discrete states.

1.1 Examples of event history data

To support a more tangible discussion of the concepts highlighted above, we will in this
section present a series of examples of event history data and, along the way, specify the
scientific questions that the examples were meant to address, the events of interest (and,
thereby, the states in the multi-state model), and the censoring events.

1.1.1 PBC3 trial in liver cirrhosis

The PBC3 trial was a multi-center randomized clinical trial conducted in six European hos-
pitals (Lombard et al., 1993). Between January 1983 and January 1989, 349 patients with
the liver disease primary biliary cirrhosis (PBC) were randomized to treatment with either
Cyclosporin A (CyA, 176 patients) or placebo (173 patients). The purpose of the trial was
to study the effect of treatment on the survival time, so, the event of interest is death of the
patient. The censoring events include drop-out before the planned termination of the trial
(end of December 1988) and being alive at the end of the trial. However, during the course
EXAMPLES OF EVENT HISTORY DATA 3

0 1
Alive - Dead

Figure 1.1 The two-state model for survival data.

of the trial, an increased use of liver transplantation as a possible treatment for patients with
this disease forced the investigators to reconsider the trial design. Liver transplantation was
primarily offered to severely ill patients and, therefore, censoring patients at the time of
transplantation would likely leave the investigators with a sample of ‘too well’ patients
that would no longer be representative of patients with PBC. This led them to redefine the
main event of interest to be ‘failure of medical treatment’ defined as the composite end-
point of either death or liver transplantation, whichever occurred first. This is because both
death and the need of a liver transplantation signal that the medical treatment is no longer
effective. Patients were followed from randomization until treatment failure, drop-out or
January 1989; 61 patients died (CyA: 30, placebo: 31), another 29 were transplanted (CyA:
14, placebo: 15) and 4 patients were lost to follow-up before January 1989. For patients
lost to follow-up and for those alive without having had a liver transplantation on January
1989, all that is known about time to failure was that it exceeds time from randomization
to end of follow-up.
Figure 1.1 shows the general two-state model for survival data with states ‘0: Alive’ and ‘1:
Dead’ and one possible transition from state 0 to state 1 representing the event ‘death’. In
the PBC3 trial, this model is applicable with the two states representing: (0) ‘Alive without
transplantation’ and (1) ‘Dead or transplantation’ and the transition, 0 → 1, representing
the event of interest – failure of medical treatment.
PBC3 was a randomized trial and, therefore, the explanatory variable of primary inter-
est was the treatment indicator. However, in addition, a number of clinical, biochemical,
and histological variables were recorded at entry into the study. Studying the distribution
of such prognostic variables in the two treatment groups, it appeared that, in spite of the
randomization, the CyA group tended to present with somewhat less favorable values of
these variables than the placebo group. Therefore, evaluation of the treatment effect with
or without adjustment for explanatory variables shows some differences to be discussed in
later chapters.
Another option than defining the composite end-point ‘failure of medical treatment’ would
be to study the two events ‘death without transplantation’ and ‘liver transplantation’ sepa-
rately. This would enable a study of possibly different effects of treatment (and other co-
variates) on each of these separate events. This situation is depicted in Figure 1.2, showing
the general competing risks model. Compared to Figure 1.1 it is seen that the initial state
‘Alive’ is the same whereas the final state ‘Dead’ is now split into a number, k separate
states, transitions into which represent deaths from different causes. For the PBC3 trial,
4 INTRODUCTION

1
3 Dead, cause 1

0 q
Alive q
q

Q
Q
Q
Q
Q
Q k
QQ
s Dead, cause k

Figure 1.2 The competing risks model with k causes of death.

state 0 represents, as before, ‘Alive without transplantation’ and there are k = 2 final states
representing, respectively, ‘1: Transplantation’ and ‘2: Dead without transplantation’. The
event ‘liver transplantation’ is a 0 → 1 transition and ‘death without liver transplantation’ a
0 → 2 transition. Some patients died after liver transplantation. However, the initial medical
treatment (CyA or placebo) was no longer considered relevant after a transplantation, so,
information on mortality after transplantation was not ascertained as a part of the trial and
is not available.

1.1.2 Guinea-Bissau childhood vaccination study

The Guinea-Bissau study was a longitudinal study of women of fertile age and their
prospectively registered children and was initiated in 1990 in five regions of Guinea-Bissau,
West Africa. This observational study was set up to assess childhood and maternal mortal-
ity. In each region, 20 clusters of 100 women were selected and visited approximately every
6 months by a mobile team. The purpose of collecting these data, used by Kristensen et al.
(2000), was to examine the association between routine childhood vaccinations and infant
mortality. The main outcome was infant mortality over 6 months among 5,274 children be-
tween age 0-6 months at the initial visit (first visit by the mobile team). The recommended
vaccination schedule for this age group in Guinea-Bissau at that time was BCG (Bacillus
Calmette-Guérin) and polio at birth; DTP (diphtheria, tetanus, and pertussis) and oral polio
at 6, 10, and 14 weeks. At the visits, vaccination status was ascertained by inspection of
the immunization card. The authors analyzed mortality before next visit according to the
vaccination status assessed at the initial visit.
EXAMPLES OF EVENT HISTORY DATA 5
Table 1.1 Guinea-Bissau childhood vaccination study: Mortality during 6 months of follow up ac-
cording to vaccination status (BCG or DTP) at initial visit among 5,274 children.

Died during follow-up

Vaccinated Yes No Total
No 95 (4.9%) 1847 1942
Yes 127 (3.8%) 3205 3332
Total 222 (4.2%) 5052 5274

As in the PBC3 example, there are two relevant states ‘Alive’ and ‘Dead’ as represented in
Figure 1.1. The censoring events include out-migration between visits and being alive at the
subsequent visit. Table 1.1 provides the basic mortality for vaccinated and non-vaccinated
children.
This study was an observational study, as allocation to vaccination groups was not random-
ized. This means that any observed association between vaccination status and later mor-
tality may be confounded because of uneven distributions of mortality risk factors among
vaccinated and non-vaccinated children. Thus, there may be a need to adjust for covariates
ascertained at the initial visit in the analysis of mortality and vaccinations.
In principle, information on vaccines received between visits was available for surviving
children at the next visit. However, these extra data are discarded as, culturally, the be-
longings of deceased children, including immunization cards, are destroyed implying a dif-
ferential information on vaccines given between visits and leading to immortal time bias.

1.1.3 Testis cancer incidence and maternal parity

In a registry-based study, Westergaard et al. (1998) extracted information from the Dan-
ish Civil Registration system on all women born in Denmark since 1935 (until 1978) who
were alive when the system was established in April 1968. Based on this, a cohort of all
(1,015,994) sons of those women who were alive in 1968 or born later was created, and
this cohort was followed from April 1968 or date of birth (whichever came later) until a
diagnosis of testicular cancer (ascertained in The Danish Cancer Registry, 626 cases), death
(1.5%), emigration (1.3%), or end of study (end of 1992). The total follow-up time at risk
was 15,981,967 years. The main purpose of the study was to address whether first-born sons
have a higher incidence of testicular cancer than later born sons, and a secondary question
was whether this potential association was present for two histological sub-types, semi-
nomas (183 cases) and non-seminomas (443 cases). First-born sons provided 7,972,276
person-years at risk and 398 cases of testicular cancer, and the similar numbers for later
born sons were 8,009,691 person-years and 228 cancer cases. A number of other potential
risk factors for testis cancer, including age of the mother at time of birth and calendar time
at birth of the son were also ascertained from the civil registration system.
The relevant multi-state model for this situation is the competing risks model (Figure
1.2) where the final states are ‘Testis cancer’ (possibly further split into seminomas and
6 INTRODUCTION
Table 1.2 PROVA trial in liver cirrhosis: Numbers of patients and numbers of events.

Deaths Deaths
without after
Treatment group Patients Bleedings bleeding bleeding Drop-out
Sclerotherapy only 73 13 13 5 5
Propranolol only 68 12 5 6 7
Both treatments 73 12 20 10 5
No treatment 72 13 8 8 3
Total 286 50 46 29 20

non-seminomas) and ‘Death without testis cancer’. The censoring events are emigration
and end-of-study. However, because of the rather large data set with more than a million
cohort members, the raw data set with individual records was first tabulated according to
the explanatory variables (including ‘current age of the son’) where, for each combination
of these variables, the person-years at risk and numbers of seminomas and non-seminomas
are given. In a similar fashion, the numbers of deaths could be tabulated; however, that
information was not part of the available data and this has a number of consequences for
the analyses that are possible to conduct for this study. This will be discussed later (Sec-
tion 3.6).

1.1.4 PROVA trial in liver cirrhosis

The PROVA trial (PROVA Study Group, 1991) was a Danish-Norwegian multi-center,
investigator-initiated clinical trial with the purpose of evaluating the prophylactic effect
of propranolol (a beta-blocker) and/or sclerotherapy (a treatment where polidocanol is in-
jected directly in the sub-mucosa next to the vein) on the occurrence of bleeding and death
in patients with liver cirrhosis. Eligible patients, recruited from eleven hospitals, included
those in whom cirrhosis was histologically verified, endoscopy had shown oesophageal
varices, but a transfusion-requiring bleeding had not yet been observed. Between Novem-
ber 1985 and March 1989, 286 patients were randomized (1:1:1:1) as shown in Table 1.2
that also shows the numbers of events in each of the four treatment groups. Twenty patients
dropped out of the trial without an event before the date of termination (end of 1989). At
the end of follow-up, 286 − (46 + 29 + 20) = 191 patients were still alive and in the trial
(out of whom 50 − 29 = 21 had experienced a bleeding).
Figure 1.3 shows the (irreversible or progressive) illness-death model. This multi-state
model is also known as the disability model. Compared to Figure 1.1, it is now the ini-
tial state 0 that is split into separate states, and compared to Figure 1.2, a transition from
state 1 to state 2 is now included. This model is applicable in the PROVA trial with state
0 representing ‘Alive without bleeding’, state 1 ‘Alive with bleeding’ and state 2 ‘Dead’.
The event ‘bleeding’ corresponds to a 0 → 1 transition, ‘death without bleeding’ to a 0 → 2
transition, and ‘death after bleeding’ to a 1 → 2 transition. Note that, in the trial, death after
bleeding was not considered a primary end-point, but since these deaths were registered as
secondary end-points (and since they are of clinical interest) they are included as events in
the figure.
EXAMPLES OF EVENT HISTORY DATA 7

0 1
Disease-free - Diseased

S
S
S
S 2
S
w Dead
/

Figure 1.3 The progressive illness-death model.

As for the case with the PBC3 trial (Section 1.1.1), there were two censoring events:
Drop-out and end-of-study. Furthermore, a number of potential explanatory variables were
recorded at entry into the PROVA trial. These variables may be used when studying the
prognosis of the patients.

1.1.5 Recurrent episodes in affective disorders

Psychiatric patients diagnosed with an affective disorder, i.e., unipolar disorder (depres-
sion) or bipolar disorder (manic-depression), often experience recurrent disease episodes
after the initial diagnosis. Kessing et al. (2004) reported on follow-up of 186 unipolar
and 220 bipolar patients who had been admitted to the Psychiatric Hospital, University of
Zürich, Switzerland between 1959 and 1963. Here, we study the 119 of those patients who
had their initial diagnosis in that period (98 unipolar and 21 bipolar patients). At follow-up
times in 1963, 1965, 1970, 1975, 1980, and 1985, disease episodes were retrospectively
ascertained via family doctors’ reports, records from in- and out-patient services, and via
patients or family members. Data on mortality and on dates of end of episodes were also
collected. The purpose was to study the pattern of repeated disease episodes, in particular
whether the disease course was deteriorating, and the event of primary interest was there-
fore the beginning of a new episode. Patients had on average 5.6 observed episodes (range
from 1 to 26), 78 patients had died by 1985, 38 patients were still alive, and 3 were lost to
follow-up before 1985. Figure 1.4 shows a multi-state model applicable in this situation,
known as the illness-death model with recovery. Compared to Figure 1.3, a transition from
state 1 to state 0, a ‘recovery’, is now included. If we think of an episode as an admission
to hospital, then state 0 corresponds to ‘Out of hospital’ and state 1 to ‘In hospital’. A tran-
sition from 0 to 1 is a hospital admission (initiation of a new episode – the event of primary
interest) and a 1 → 0 transition is a discharge from hospital (end of an episode). From both
states, a patient may die and, thereby, make a transition to state 2. Thus, the model depicts
that there are periods where patients are not at risk for experiencing the event of primary
interest.
8 INTRODUCTION

0 1
-
At risk Not at risk

S
S
S
S 2
S
w Dead
/

Figure 1.4 The illness-death model with recovery, applicable for recurrent episodes with a terminal
event, i.e., situations with a terminal event and with periods between times at which subjects are at
risk for a new event.

Sometimes, in spite of the fact that there are intervals between the ‘at-risk periods’, focus
may be on times from the initiation of one episode to the initiation of the next rather on
times between events. In such an approach, depicted in Figure 1.5, the interval not at risk is
included in the time between events and, thereby by definition, there are no such intervals.
Note that the terminal state has been re-labelled as ‘D’. We will denote recurrent events
where the events have a certain duration (as in this example) as recurrent episodes.

0 1 2
No event - 1 event - 2 events - ···

@
@
@
@
@
@ ?
@ D
R Dead
@

Figure 1.5 A multi-state model for recurrent events with a terminal event and no intervals between
at-risk periods.

In these data, a number of explanatory variables that may affect the outcome and its asso-
ciation with the initial diagnosis (unipolar vs. bipolar) were recorded at the time when the
initial diagnosis was given. These potential confounders include sex and age of the patient
and calendar time at the initial diagnosis.
EXAMPLES OF EVENT HISTORY DATA 9
In situations without a terminal event, e.g., when mortality is negligible, the models in
Figures 1.4 and 1.5 without the final ‘Dead’ state may be applicable.

1.1.6 LEADER cardiovascular trial in type 2 diabetes

The LEADER trial (Liraglutide Effect and Action in Diabetes: Evaluation of cardiovas-
cular outcome Results) (Marso et al., 2016) was a company-initiated, double-blind ran-
domized controlled multi-center trial investigating the cardiovascular effects of liraglutide,
a glucagon like peptide-1 (GLP-1) receptor agonist approved for treatment of type 2 dia-
betes, versus placebo when added to standard of care in a population with type 2 diabetes
and a high cardiovascular risk. A total of 9,340 subjects were randomized 1:1 to receive
either liraglutide or placebo during the period from September 2010 through April 2012.
Follow-up was terminated between August 2014 and December 2016, corresponding to a
planned time on trial between 42 and 60 months. The median follow-up time was reported
to be 3.8 years. The primary end-point was a three-component major adverse cardiovascular
events (3-p MACE) composite end-point consisting of non-fatal stroke, non-fatal myocar-
dial infarction (MI) or cardiovascular (CV) death. The primary analysis was a time-to-event
analysis of time to first 3-p MACE. This end-point occurred in 608 out of 4,668 patients
randomized to liraglutide and in 694 out of 4,672 placebo treated patients – a difference
that was statistically significant when analyzed in a competing risks model (Figure 1.2),
the competing event being death from non-CV causes.
Two of the components of 3-p MACE may occur repeatedly. Thus, recurrent MI and recur-
rent stroke (either fatal or non-fatal) are both possible, and a model like the one depicted
in Figure 1.5 may be applicable. Here, one possibility would be to define the recurrent
event as MI in which case the terminal event would be death from any cause (entry into
state D). One could also be interested in ‘recurrent 3-p MACE’ where the state D would be
non-CV death; however, one would then have to address the problem that the occurrence
of one component of the recurrent end-point, namely CV death, would imply that no fur-
ther events are possible. If an MI or stroke was fatal, this has been coded as an event (MI
or 3-p MACE) occurring on a given calendar day, and then a cardiovascular death on the
subsequent calendar day. Censoring was primarily caused by patients being alive at the end
of follow-up. Thus, according to Marso et al. (2016), only 139 patients in the liraglutide
group (3.0%) did not complete the study and the similar number in the placebo group was
159 (3.4%). Table 1.3 gives some key numbers of event counts observed in the trial.

1.1.7 Bone marrow transplantation in acute leukemia

For patients with certain hematological diseases such as leukemia, bone marrow (BM)
transplantation (also called stem cell transplantation) is an often used treatment option.
Briefly, the immune system of the patient is first affected by chemotherapy (the ‘pre-
conditioning’ which removes disease symptoms) and, next, bone marrow from the donor is
infused. Two serious and competing events are often studied in relation to the treatment: Re-
lapse of the disease, i.e., return of the disease symptoms, and non-relapse mortality (death
in remission), both of which signal that the treatment with BM is no longer effective. After
a relapse, patients are given second line treatment to reduce the mortality. A complication
10 INTRODUCTION
Table 1.3 LEADER cardiovascular trial in type 2 diabetes: Observed myocardial infarctions (MI)
and major adverse cardiovascular events (MACE).

Recurrent MI Recurrent 3-p MACE

Liraglutide Placebo Liraglutide Placebo
≥ 1 event 292 339 608 694
(Total events 359 421 768 923)
Dead before 1st event 329 373 137 133
Censored before 1st event 4047 3960 3923 3845
Randomized 4668 4672 4668 4672

to the treatment is graft versus host disease (GvHD) where the infused donor cells react
against the patient.
A data set compiled from the Center for International Blood and Marrow Transplant Re-
search (CIBMTR) was analyzed by Andersen and Pohar Perme (2008) with the main pur-
pose of studying how occurrence of the GvHD event affected relapse and death in remis-
sion. The CIBMTR is comprised of clinical and basic scientists who confidentially share
data on their blood and bone marrow transplant patients with the CIBMTR Data Collection
Center located at the Medical College of Wisconsin, Milwaukee, USA. The CIBMTR is
a repository of information about results of transplants at more than 450 transplant cen-
ters worldwide. The present data set consists of 2,009 patients from 255 different centers
who received an HLA-identical sibling transplant between 1995 and 2004 for acute myel-
ogenous leukemia (AML) or acute lymphoblastic leukemia (ALL) and were transplanted in
first complete remission, i.e., when the pre-conditioning has eliminated the leukemia symp-
toms. All patients received BM or peripheral blood (PB) stem cell transplantation. Table
1.4 gives an overview of the events observed during follow-up (until 2007). The 1, 272
(= 2, 009 − 737) patients who were still alive at that time are censored.
Figure 1.6 shows the states and events for this study. A number of potential prognostic
variables for the events (GvHD, relapse and death) were ascertained at time of transplan-
tation. These variables include disease type (ALL vs. AML), graft type (BM or BM/PB),
and sex and age of the patient. Sometimes, GvHD is considered, not a state but rather a
time-dependent covariate, in which case the diagram in Figure 1.3 would be applicable
with states BMT, relapse and dead.

1.1.8 Copenhagen Holter study

In the Copenhagen Holter Study, men and women aged 55, 60, 65, 70, or 75 years and
living in two postal regions in Copenhagen, Denmark were contacted during the period
1998–2000. These subjects received a questionnaire including items on cardiovascular risk
factors and medical history. All respondents with more than 1 risk factor and a 60% sample
of those with 0 or 1 risk factor were invited to a physical examination including a 48-
hour continuous electrocardiogram recording (‘Holter monitoring’). Larsen et al. (2015)
EXAMPLES OF EVENT HISTORY DATA 11
Table 1.4 Bone marrow transplantation in acute leukemia: Events observed in 2,009 leukemia pa-
tients who underwent bone marrow transplantation (GvHD: Graft versus host disease).

Event No. of patients Percentage

Relapse 259 12.9
Death 737 36.7
Relapse and death 232 89.6 of patients with relapse
GvHD 976 48.6
GvHD and relapse 91 9.3 of patients with GvHD
GvHD and death 389 39.9 of patients with GvHD

0 1
BMT - GvHD

H
HH
H
? HH ?
2 H
3
HHH
Relapse

j
H
- Dead

Figure 1.6 Bone marrow transplantation (BMT) in acute leukemia: States and transitions (GvHD:
Graft versus host disease).

reported on a follow-up until 2013 of 678 participants with the purpose of studying the
association between excessive supra-ventricular ectopic activity (ESVEA, a particular kind
of irregular heart rhythm detected via the Holter monitoring) and later atrial fibrillation (AF,
a serious heart arrhythmia affecting the blood circulation) and stroke. It was well known that
ESVEA increases the incidence of AF, but one purpose of the study was to examine whether
the incidence of stroke in patients with ESVEA was increased over and above what could
be explained by an increase in the occurrence of AF. Events of AF, stroke and death during
follow-up were ascertained via the Danish National Patient Registry. Figure 1.7 shows the
possible states and transitions that can be studied based on these data. Note that, compared
to Figure 1.6, this state diagram allows a 2 → 1 transition; however, such transitions do not
impact the basic scientific question raised in the Copenhagen Holter Study. Table 1.5 shows
the number of patients who were observed to follow the different possible paths through
these states according to ESVEA at time of recruitment. From this table it appears that AF
occurred in 18% of the patients with ESVEA and in 10% of those without, stroke occurred
in 21% of patients with ESVEA and in 9% of those without. Among those who experienced
AF without stroke, 25% of patients with ESVEA later had a stroke. The similar fraction for
patients without ESVEA was 9%. An analysis of these interacting events must account for
the fact that patients may also die without AF and/or stroke events.
12 INTRODUCTION

0 1
No event -
*
AF

HH
H
HH
? H ?
2 HH 3
H
Stroke

H
j
H
- Dead

Figure 1.7 Copenhagen Holter study: States and transitions (AF: Atrial fibrillation).

Table 1.5 Copenhagen Holter study: Number of patients following different paths (ESVEA: Exces-
sive supra-ventricular ectopic activity; AF: Atrial fibrillation).

Number of patients
Observed path Without ESVEA With ESVEA Total
0 320 34 354
0 → AF 29 8 37
0 → Stroke 17 1 18
0 → AF → Stroke 3 1 4
0 → Stroke → AF 4 0 4
0 → Dead 158 32 190
0 → AF → Dead 20 4 24
0 → Stroke → Dead 25 14 39
0 → AF → Stroke → Dead 2 3 5
0 → Stroke → AF → Dead 1 2 3
Total 579 99 678

The Copenhagen Holter study was an observational study, so, adjustment for covariates
(potential confounders) may be needed when examining the association between the ‘ex-
posure’ ESVEA and later events like AF or stroke. A number of covariates were observed
at the examination at the time of recruitment, including smoking status, age, sex, blood
pressure, and body mass index. The follow-up in this study was, like in the testis cancer
incidence study, Example 1.1.3, registry-based and in Denmark this means that, in princi-
ple, there is no loss to follow-up (except for the fact that patients may emigrate before the
end of the study which, however, nobody did). As a consequence the only censoring event
is end-of-study (alive in 2013). This data set will mainly be used for practical exercises
throughout the book.
PARAMETERS IN MULTI-STATE MODELS 13
Examples of event history data

1. PBC3 trial: Randomized trial of effect of CyA vs. placebo on survival and liver
transplantation in patients with Primary Biliary Cirrhosis (n = 349).
2. Guinea-Bissau study: Observational study of effect of childhood vaccinations
on survival (n = 5, 274).
3. Testis cancer study: Register study on the relationship between maternal parity
and testicular cancer rates of their sons (n = 1, 015, 994).
4. PROVA trial: Randomized trial of effect of propranolol and/or sclerotherapy on
the occurrence of bleeding and death in patients with liver cirrhosis (n = 286).
5. Recurrent episodes in affective disorders: Observational study of pattern of
repeated disease episodes for patients with unipolar or bipolar disorder (n = 119).
6. LEADER trial: Randomized trial in type 2 diabetics with high cardiovascular
risk – effect of liraglutide vs. placebo on cardiovascular events (n = 9, 340).
7. Bone marrow transplantation: Observational study of effect of graft versus host
disease (GvHD) on relapse and death in remission among bone marrow trans-
planted patients with leukemia (n = 2, 009).
8. Copenhagen Holter study: Observational study of the association between ex-
cessive supra-ventricular ectopic activity (ESVEA) and later atrial fibrillation
(AF) and stroke (n = 678).

1.2 Parameters in multi-state models

1.2.1 Choice of time-variable
In the examples in Section 1.1, we have seen how multi-state models may provide a suitable
framework for describing event history data. Multi-state models are given by a number of
states and possible transitions between these states that occur over time. For applications
of multi-state models, for any set of data at hand, one must consider what is meant by time.
In other words, a suitable time origin, a time zero, must be chosen.
In some cases, this choice is obvious. Thus, for randomized studies such as the PBC3,
PROVA, and LEADER trials (Examples 1.1.1, 1.1.4, and 1.1.6), time zero is typically taken
to be the time of randomization where participants fulfill the relevant inclusion criteria and
where treatment is initiated. In clinical follow-up studies, there may also be an initiating
event that defines the inclusion into the study and, therefore, serves as a suitable time origin.
This includes the initial diagnosis of affective disorder in Example 1.1.5, the time of bone
marrow transplantation in Example 1.1.7, and perhaps to a lesser extent, the time of initial
clinical assessment in the Copenhagen Holter Study (Example 1.1.8).
However, in observational studies (Examples 1.1.2 and 1.1.3), the time of entry into the
study is not necessarily a suitable time origin because that date may not be the time of any
important event in the lifetime of the participants. In such cases, alternative time axes to
be used for modeling include age or calendar time. Here, subjects are not always followed
from the corresponding time origin (time of birth or some fixed calendar date), a situation
known as delayed entry (or left-truncation), and subjects are only included into the study
conditionally on being alive and event-free at the time of entry into the study. As an example
14 INTRODUCTION
Table 1.6 Small set of survival data.

Subject Time from entry Status at time of exit Age

number to exit 0 = censored, 1 = dead at entry Age at exit
1 5 1 12 17
2 6 0 0 6
3 7 1 0 7
4 8 1 10 18
5 9 0 6 15
6 12 0 6 18
7 13 1 9 22
8 15 1 3 18
9 16 1 8 24
10 20 0 0 20
11 22 0 0 22
12 23 1 2 25

we can consider the study of children in Guinea-Bissau (Example 1.1.2) where a choice of
a primary time-variable for the survival analysis is needed. For the time-variable ‘time
since initial visit’, say t, all children will be followed from the same time zero. This time-
variable has the advantage of all risk factors being ascertained at the same time; however,
the mortality rate for the children will likely not depend strongly on t. Thus, an alternative to
using t as the time-variable would be to use (current) age of the children, a time-variable that
will definitely affect the mortality rates. Some children were born into the study because
the mother was followed during pregnancy and those children will be followed from age 0.
Other children will only be included, i.e., being at risk of dying in the analysis, at a later
age, namely the age at which the child was first observed to be in state 0 (initial visit).
This is an example of delayed entry that will be further discussed in Section 2.3. Also for
Example 1.1.3 (testis cancer incidence and maternal parity), there will be delayed entry
when age is chosen as the primary time-variable because only boys born after 1968 are
followed from birth.
For illustration, consider the small set of survival data provided in Table 1.6 and Figure
1.8. The subjects were followed from a time zero of entry (recruitment). Let t be the time
since entry. Additionally, the age at time zero and at time of exit is given. Figure 1.8a
depicts the survival data using time t (time since entry) as time-variable and Figure 1.8b
the same survival data using age as time-variable, illustrating delayed entry. It is seen that
the same follow-up intervals are represented in the two figures. These intervals, however,
are re-allocated differently along the chosen time axis.

1.2.2 Marginal parameters

The small set of survival data in Table 1.6 is, in principle, a data set with a quantitative
outcome variable, say T and one might, therefore, wonder whether quantities such as mean
and standard deviation (SD) were applicable as descriptive parameters. However, the pres-
ence of censoring makes the calculation of a simple average as an estimator of the expected
value E(T ) futile. Neither the average of all twelve observation times, nor those of the seven
PARAMETERS IN MULTI-STATE MODELS 15
(a) Time since entry as time-variable (b) Age as time-variable

12 12
11 11
10 10
9 9
Subject number

Subject number
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 5 10 15 20 25 0 5 10 15 20 25
Time (t) Age

Figure 1.8 Small set of survival data: Twelve subjects of whom seven died during the study (dots)
and five were censored (circles).

uncensored survival times are applicable as estimates of E(T ) (both will under-estimate the
mean, see Exercise 1.1). Similarly, the relative frequency of observation times greater than,
say t = 10, cannot be used as an estimator of the probability P(T > t) of surviving past
time t because of the censored observations ≤ t for which the corresponding true survival
times T may or may not exceed t.
These considerations illustrate that other parameters and methods of estimation are required
for multi-state survival data and, in the following, we will discuss such parameters. Having
decided on a time zero, we let
(V (t), t ≥ 0)
be the multi-state process denoting, at time t, the state occupied at that time among a num-
ber of discrete states h = 0, . . . , k. For the two-state model for survival data in Figure 1.1,
the multi-state process at time t can take the values V (t) = 0 or V (t) = 1. One set of param-
eters of interest is the state occupation (or ‘occupancy’) probabilities at any time, t. Denote
the probability (risk) of being in state h at time t as
Qh (t) = P(V (t) = h);
then the sum of these over all possible states will be equal to 1
k
Q0 (t) + · · · + Qk (t) = ∑ Qh (t) = 1.
h=0

In the two-state model for survival data, Figure 1.1, with the random variable T being time
to death, the state 0 occupation probability Q0 (t) is the survival function, i.e.,
Q0 (t) = S(t) = P(T > t),
and Q1 (t) is the failure distribution function, i.e.,
Q1 (t) = F(t) = P(T ≤ t) = 1 − S(t).
16 INTRODUCTION
(a) Time since entry as time-variable (b) Age as time-variable

1.0 1.0
0.9 0.9
0.8 0.8
0.7 0.7

Estimate of S(age)
Estimate of S(t)

0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0.0 0.0
0 5 10 15 20 25 0 5 10 15 20 25
Time (t) Age

Figure 1.9 Small set of survival data: Estimated survival functions.

For the small set of survival data (Table 1.6), Figure 1.9 provides estimates of the survival
function, S(·), with either (a) time since entry or (b) age as time-variable. The Kaplan-
Meier estimator, which we will return to later in the book (Sections 4.1.1 and 5.1.1) was
used for the estimation of S(·). Note that the shapes of the survival functions are somewhat
different, illustrating the importance of choice of time zero. For any time point, the vertical
distance from the curve up to 1 represents F(t) = 1 − S(t).
The probabilities Qh (t) are examples of marginal parameters, i.e., at time t, their value is
not conditional on the past history (V (s), s < t) of the multi-state process (though they may
involve covariates recorded at time zero). Other marginal parameters include the expected
time, εh (·), spent in state h, either during all times, i.e., all the way up to infinity, εh (∞), or
up to some threshold time τ < ∞, εh (τ). The latter parameters have the property that they
add up to τ, i.e.,
k
ε0 (τ) + · · · + εk (τ) = ∑ εh (τ) = τ,
h=0

because the time from 0 to τ has to be divided among the possible states. For the two-state
model (Figure 1.1), ε0 (τ) is the τ-restricted mean life time, i.e., the expected time lived
before time τ, and ε1 (τ) = τ − ε0 (τ) is the expected time lost before time τ. Figure 1.10
illustrates the estimated restricted mean life time for τ = 12, the area under the survival
curve, for the small set of survival data.
In cases where all subjects are in the same state ‘0’ at time zero (which is the case for all the
multi-state models depicted in Section 1.1), the distribution of the time (Th ) from time zero
until (first) entry into another state h is another marginal parameter. Examples include the
distribution of the survival time (time until entry into state 1 in Figure 1.1), time to event
no. h in a recurrent events situation (e.g., Figure 1.5), or time to relapse or to GvHD in the
model for the bone marrow transplantation data (Example 1.1.7, Figure 1.6). Note that, in
the last two examples, not all subjects will eventually enter into these states and the entry
PARAMETERS IN MULTI-STATE MODELS 17

Figure 1.10 Small set of survival data: Illustration of estimated restricted mean life time before time
τ = 12.

times must be defined properly (formally, the value of these times may be infinite and Th is
denoted an improper variable).
For the recurrent events multi-state models (Figures 1.4 and 1.5), another marginal param-
eter of interest is the expected number of events, say µ(t) = E(N(t)) before time t where
N(t) counts events before time t. In Figure 1.4, this is the expected number of times that
state 1 is visited before time t.
Marginal parameters

Marginal parameters express the following type of quantities: You place yourself
at time 0 and ask questions about aspects of the multi-state process V (·) at a later
time point t (without consideration of what may happen between 0 and t). We will
discuss the following marginal parameters:
• State occupation probability: Qh (t) = P(V (t) = h), the probability (risk) of
being in state h at time t.
• Restricted mean: εh (τ), the expected time spent in state h before time τ.
• Distribution of time Th to (first) entry into state h; relevant if everyone starts at
time 0 in the same state (0).
• Expected number of events µ(t) before time t; relevant for a recurrent events
process N(·), counting the number of events over time.
18 INTRODUCTION
1.2.3 Conditional parameters
Another set of parameters in multi-state models for a time point t is conditional on the past
up to that time. The most important of such parameters are the transition intensities (or
rates or hazards – as emphasized in the Introduction, we will use these notions interchange-
ably). For two different states, h, j these are (for a ‘small’ dt > 0) given by

P(V (t + dt) = j | V (t) = h and the past for s < t)

αh j (t) ≈ , (1.1)
dt
where ‘|’ is read ‘given that’. Here, the past includes both information of the history on the
multi-state process (V (s), s < t) up to (but not including) time t and of possible covariates
Z, recorded at time zero. The interpretation is that the conditional probability that a subject
in state h at time t makes a transition to state j 6= h ‘just after time t’ given the past is
(approximately when dt > 0 is small) equal to αh j (t)dt. Thus, if subjects are followed
over time then, at any time t, the transition intensities give the instantaneous conditional
probabilities per time unit, given the past of events at time t, see, e.g., Andersen et al.
(2021).
The transition intensities are short-term transition probabilities per time unit where, more
generally, the transition probability for any two states (i.e., not necessarily distinct) is

Ph j (s,t) = P(V (t) = j | V (s) = h and the past before time s),

i.e., the conditional probability of being in state j at time t given state h at the earlier time
s and given the past at time s. In the common situation where all subjects are in an initial
state (say, 0) at the time origin, t = 0 we have for any state h that

P0h (0,t) = Qh (t),

the state occupation probability at time t. However, more generally, transition probabilities
are more complicated parameters than the state occupation probabilities because they may
depend on the past at time s > 0. If Ph j (s,t) only depends on the past via the state (h)
occupied at time s then the multi-state process is said to be a Markov process. Note that
the parameters αh j (s) and Ph j (s,t) involve conditioning on the past, but never on the future
beyond time s. Indeed, conditioning on the future is a quite common mistake in multi-state
survival analysis (e.g., Andersen and Keiding, 2012) – a mistake that we will sometimes
refer to in later chapters.
The most simple example of a transition intensity is the hazard function, α01 (t) = α(t), in
the two-state model, in Figure 1.11. For some states, the transition intensities out of that
state are all equal to 0, i.e., no transitions out of that state are possible. An example is the
state ‘Dead’ in Figure 1.11 and such a state is said to be absorbing, whereas a state that is
not absorbing is said to be transient, an example being the state ‘Alive’ in that figure. For
the competing risks model (Figure 1.2), there is a transition intensity from the transient state
0 to each of the absorbing states, the cause-specific hazards for cause h = 1, . . . , k having
the interpretations α0h (t)dt ≈ the conditional probability of failure from cause h at time t
given no failure before time t. In Figure 1.12, we have added the cause-specific hazards to
the earlier Figure 1.2 in the same way as we did when going from Figure 1.1 to Figure 1.11.
PARAMETERS IN MULTI-STATE MODELS 19

0 1
α01 (t)
Alive - Dead

Figure 1.11 The two-state model for survival data with hazard function for the 0 → 1 transition.

1
3 Dead, cause 1

α01 (t)

0 q
Alive q
q

Q
Q
Q
Q
α0k (t) Q
Q
k
QQ
s Dead, cause k

Figure 1.12 The competing risks model with cause-specific hazard functions.

In a similar way, transition intensities may be added to the other box-and-arrow diagrams
in Section 1.1.
Intuitively, if one knows all transition intensities at all times. then both the marginal pa-
rameters and the transition probabilities may be calculated. This is because, by knowing
the intensities, numerous paths for V (t) may be generated by moving forward in time in
small steps (of size dt), whereby Qh (t), εh (τ), and Ph j (s,t) may be computed as simple
averages over these numerous paths. This is, indeed, true and it is the idea behind micro-
simulation that we will return to in Section 5.4. In some multi-state models, including the
two-state model and the competing risks model (Figures 1.1, resp. 1.11 and 1.2, resp. 1.12),
the marginal parameters may also be computed explicitly by certain mathematical expres-
sions, e.g., the probability of staying in the initial state 0 in the two-state model (the survival
20 INTRODUCTION
function) is given by the formula
Zt
S(t) = Q0 (t) = exp − α(u)du (1.2)
0

that expresses how to get the survival function from the hazard function. Likewise, for the
competing risks model, the probability of being in the final, absorbing state h = 1, . . . , k at
time t is given by Z t
Qh (t) = S(u)α0h (u)du, (1.3)
0
where, in Equation (1.2), α = α01 + · · · + α0k . This probability is frequently referred to as
the (cause h-) cumulative incidence function, Fh (t), a name that originates from epidemi-
ology where that name means ‘the cumulative risk of an event over time’, see, e.g., Szklo
and Nieto (2014, ch. 2). In Chapter 4, we will give intuitive arguments why Equations (1.2)
and (1.3) look the way they do.
Models for multi-state survival data, e.g., regression models where adjustment for covari-
ates is performed, may conveniently be specified via the transition intensities, the Cox
(1972) regression model for survival data being one prominent such example. Intensity-
based models are studied in Chapters 2 and 3. Having modeled all intensities, marginal pa-
rameters in simple multi-state models may be obtained by plugging-in the intensities into
expressions like Equation (1.2) or (1.3). However, the marginal parameters may depend on
the intensities in a non-simple fashion, and it is therefore of interest to aim at setting up
direct regression models for the way in which, e.g., εh (t), depends on covariates. Marginal
models (both models based on plug-in and direct models) are the topic of Chapters 4 and 5
(see also Chapter 6 where such direct models are based on pseudo-values).

Conditional parameters

Conditional parameters for a multi-state process V (·) quantify, at time t, the future
development of the process conditionally on the past of the process before t. We will
discuss two types of conditional parameters:
1. Transition intensities: αh j (t) gives the probability per time unit of moving to
state j right after time t given that you are in state h at t and given the past up to t

αh j (t) ≈ P(V (t + dt) = j | V (t) = h and the past for s < t)/dt,

(Equation 1.1). Transition intensities are only defined if j is different from h.

2. Transition probabilities: Ph j (s,t) gives the probability of being in state j at time
t, given that you were in state h at an earlier time point s and given the past history
of V (·) up to that earlier time point. Transition probabilities are also defined if h
and j are the same state.

1.2.4 Data representation

The multi-state survival data arising when observing a subject over time consist of a series
of times of transition between states and the corresponding types of transition, i.e., from
PARAMETERS IN MULTI-STATE MODELS 21
which state and to which state did the subject move at that time. For some subjects, this
observed event history will end in an absorbing state from where no further transitions
are possible, e.g., the subject is observed to die. However, there may be right-censoring in
which case information on a time last seen alive, and the state occupied by the subject at
that time will typically be available. Such data will have the format

((0,V (0)), (T1 ,V (T1 )), (T2 ,V (T2 )), . . . , (X,V (X))),

with one record per subject where T1 , T2 , . . . , TN−1 are the observed times of transition and X
is either the time TN of the last transition into an absorbing state or the time, C of censoring
and are said to be in wide format (or marked point process format). Such a format may
typically be directly obtained from raw data consisting of dates where events happened
(together with date of entry into the study and/or date of birth).
We will, in later chapters, typically assume that data for independent and identically dis-
tributed (i.i.d.) subjects i = 1, . . . , n are observed and, in situations where data may be de-
pendent, we will explicitly emphasize this.
Wide format is, however, less suitable as a basis for the analysis of the data, for which
purpose data are transformed into long format (or counting process format) where each
subject may be represented by several records. Here, each record typically corresponds to
a given type of transition, say from state h to state j and includes time of entry into state h,
time last seen in state h and information on whether the subject, at this latter time, made a
transition to state j, the ‘(Start, Stop, Status)’ triple.
As the name suggests, data in long format are closely linked to the mathematical repre-
sentation of observations from multi-state processes via counting processes. Thus, for each
possible transition, say from state h to another state j, data from a given subject, i may be
represented as the counting process

Nh ji (t) = No. of direct h → j transitions observed for subject i in the interval [0,t],

together with the indicator of being at risk for that transition at time t− (i.e., ‘just before
time t’)
Yhi (t) = I(subject i is in state h at time t−).
Here, the indicator function I(· · · ) is 1 if · · · is true and 0 otherwise. Note that, for several
multi-state models depicted in Section 1.1, the number of observed events of a given type
is at most 1 for any given subject; an exception is the model for recurrent events in Figure
1.4 where each subject may experience several 0 → 1 and 1 → 0 transitions. Counting
processes, N(t) = Nh j (t) with more than one jump may also be constructed by adding up
processes for individual subjects,

N(t) = ∑ Nh ji (t).
i

Likewise, the total number at risk at time t, Y (t) = Yh (t), is obtained by adding up individual
at-risk processes,
Y (t) = ∑ Yhi (t).
i
22 INTRODUCTION
(a) Counting process, N(t) (b) Number at risk, Y (t)

12 12
11 11
10 10
9 9
8 8
7 7
N(t)

Y(t)
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 5 10 15 20 25 0 5 10 15 20 25
Time (t) Time (t)

(c) Counting process, N(age) (d) Number at risk, Y (age)

12 12
11 11
10 10
9 9
8 8
7 7
N(age)

Y(age)

6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 5 10 15 20 25 0 5 10 15 20 25
Age Age

Figure 1.13 Small set of survival data: Counting process and number at risk using t as time-variable
(a-b) and age as time-variable (c-d).

Figure 1.13 shows N(t) = N01 (t) and Y (t) = Y0 (t) for the small set of survival data from Ta-
ble 1.6. When using t (time since entry) as time-variable the at-risk function Y (·) is mono-
tonically decreasing, while using age as time-variable the at-risk function also increases
due to the delayed entries.
For a small value of dt > 0 we define the jump at time T for the counting process N(·) to be
dN(T ) = N(T + dt) − N(T −), i.e., dN(t) = 1 if t = T is an observed time of transition and
dN(t) = 0 otherwise. The representation using counting processes has turned out to be very
useful when formulating estimators in multi-state models and their statistical properties.
We will next illustrate how typical raw data in wide format (with one record per subject)
may be transformed into long format where each subject may be represented by several
PARAMETERS IN MULTI-STATE MODELS 23
records. We will use the PROVA trial (Section 1.1.4) to illustrate the ideas and assume that
the dates doe, dob, dod, and dls are defined as in Table 1.7.

Table 1.7 PROVA trial in liver cirrhosis: Date variables available (NA: Not available).

Date Description
1 doe Date of entry into the study, i.e., date of random-
ization.
2 dob (> doe) Date of transfusion-requiring bleeding; if no
bleeding was observed, then dob=NA .
Time of bleeding is T1 = dob − doe.
3 dod (> doe) Date of death; if no death was observed, then
dod=NA.
If dod 6= NA, then time of death is T2 = dod − doe
and this is also the right-hand end-point, say X, for
the interval of observation.
4 dls (> doe) Date last seen; if dod=NA then dls = date of
censoring in which case the censoring time is
C = dls − doe and this equals the right-hand end-
point, X for the interval of observation. If dod 6=
NA, then dls is equal to dod.

Note that the inequalities given in the table should all be checked with the data to make sure
that the recorded dates are consistent with reality. Also, if both dob and dod are observed,
then the inequality dob ≤ dod should hold. A data set with one line per subject containing
these date variables is in wide format. From the basic date-variables, times and types of
observed events may be defined as shown in Table 1.8, thereby taking the first step towards
transforming the data into long format.

Table 1.8 PROVA trial in liver cirrhosis: Observed transitions for different patterns of observed
dates (NA: Not available).

Observed Transition Last seen

dob dod 0→1 0→2 1→2 State Time
1 NA NA No No No 0 C
2 Yes NA At T1 No No 1 C
3 NA Yes No At T2 No 2 T2
4 Yes Yes At T1 No At T2 2 T2

Here, times refer to time since entry, i.e., the time origin t = 0 is the time of randomization.
The resulting counting processes and at-risk processes for the data in long format are shown
in Table 1.9. Examples of how realizations of the multi-state process V (t) would look like
for specific values of T1 , T2 ,C are shown in Figure 1.14.
Based on these observations, one may now construct three data sets – one for each of the
possible transitions. Each record in the data set for the h → j transition has the structure
24 INTRODUCTION
Table 1.9 PROVA trial in liver cirrhosis: Counting processes and at-risk processes for different
patterns of observed dates (NA: Not available).

dob dod N01 (t) N02 (t) N12 (t) Y0 (t) Y1 (t)
1 NA NA 0 0 0 I(C ≥ t) 0
2 Yes NA I(T1 ≤ t) 0 0 I(T1 ≥ t) I(T1 < t ≤ C)
3 NA Yes 0 I(T2 ≤ t) 0 I(T2 ≥ t) 0
4 Yes Yes I(T1 ≤ t) 0 I(T2 ≤ t) I(T1 ≥ t) I(T1 < t ≤ T2 )

Censoring at C (< T1 , T2 ) Bleeding at T1 and censoring at C (< T2 )

V (t) V (t)
6 6
2 2
1 1 a

0 a - 0 r -
C t T1 C t

Death without bleeding at T2 (< C) Bleeding at T1 and death at T2 (< C)

V (t) V (t)
6 6
2 2
1 1 r

0 r - 0 r -
T2 t T1 T2 t
Figure 1.14 PROVA trial in liver cirrhosis: The process V (t) for different patterns of observed dates
corresponding to the rows in Tables 1.8–1.10.

(Start, Stop, Status) with

Start = Time of entry into h,

Stop = Time last seen in h,
Status = Transition to j or not at time Stop

This is the data set in long format. In the PROVA example, each subject contributes at most
with one record to each data set as shown in Table 1.10 where the Status variable is 1 if a
h → j transition was observed at time Stop (i.e., Status=dNh j (Stop)).
Note that an observed 0 → 1 transition also gives rise to a record in the data set for 0 →
2 transitions ending in no transition (and vice versa). Also note that, in the data set for
the 1 → 2 transition, there is delayed entry meaning that subjects are not at risk of that
transition from time zero but only from a later time point (Start, T1 ) where the subject
was first observed to be in state 1. Presence of delayed entry is closely connected to the
choice of time-variable. Thus, had one chosen to consider the 1 → 2 transition intensity
in the PROVA trial to depend primarily on time since bleeding (and not on time since
PARAMETERS IN MULTI-STATE MODELS 25
Table 1.10 PROVA trial in liver cirrhosis: Records (Start, Stop, Status) in the three data sets
for different patterns of observed dates (NA: Not available).

Observed Data set

dob dod 0→1 0→2 1→2
1 NA NA (0,C, 0) (0,C, 0) NA
2 Yes NA (0, T1 , 1) (0, T1 , 0) (T1 ,C, 0)
3 NA Yes (0, T2 , 0) (0, T2 , 1) NA
4 Yes Yes (0, T1 , 1) (0, T1 , 0) (T1 , T2 , 1)

randomization), then there would have been no delayed entry. The example also illustrates
that even though a basic time origin is chosen for a given multi-state model (here time of
randomization), there may be (later) transitions in the model for which another time origin
may be more appropriate.

1.2.5 Target parameter

The scientific questions addressed through the examples in Section 1.1 most often involve a
number of covariates and, as a result, both intensity models and direct models for marginal
parameters will often be (multiple) regression models. In the regression models that we will
be studying in later chapters, the covariates will always enter via a linear predictor com-
bining (linearly) the effects of the individual covariates on some suitable function (the link
function) of the multi-state model parameter via regression coefficients, see, e.g., Andersen
and Skovgaard (2010, ch. 5).
As an example, we can consider a direct model for a state occupation probability (risk)
Qh (t) at time t. Suppose that there are p covariates Z1 , Z2 , . . . , Z p under consideration –
then the model may be

log(Qh (t)) = β0 + β1 Z1 + β2 Z2 + · · · + β p Z p ,

i.e., the covariate effects on Qh (t) are linear on the scale of the logarithm (the link function)
of the risk. Note that we will often refer to a coefficient, β j , as the ‘effect’ of the corre-
sponding covariate Z j – also in situations where a causal interpretation is not aimed at. The
expression on the right-hand side of this equation is the linear predictor

LP = β1 Z1 + β2 Z2 + · · · + β p Z p (1.4)

and involves regression coefficients β1 , β2 , . . . , β p (but not the intercept, β0 ). The interpre-
tation of a single β j is, as follows. Consider two subjects differing 1 unit for covariate j
and having identical values for the remaining covariates in the model. Then the difference
between the log(risks) for those two subjects is
βj = (β0 + β1 Z1 + · · · + β j−1 Z j−1 + β j (Z j + 1) + · · · + βpZp)
− (β0 + β1 Z1 + · · · + β j−1 Z j−1 + β jZ j + · · · + β p Z p ).
Thus, exp(β j ) is the risk ratio for a 1 unit difference in Z j for given values of the re-
maining covariates in the model. It is seen that not only does the interpretation of β j de-
pend on the chosen link function (here, the logarithm) but also on which other covariates
26 INTRODUCTION
(Z1 , . . . , Z j−1 , Z j+1 , . . . , Z p ) that are included in the model. Therefore, a regression coeffi-
cient, e.g., for a treatment variable, unadjusted for other covariates is likely to differ from
one that is adjusted for sex and age. (One exception is when the model is linear, i.e., the
link function is the identity, and treatment is randomized and, thereby, independent of sex
and age and other covariates in which case the parameter is collapsible, see, e.g., Daniel
et al., 2021). Nevertheless, for a number of reasons, regression models and their estimated
coefficients are useful in connection with the analysis of multi-state survival data.
First of all, regression models describe the association between covariates and intensities
or marginal parameters in multi-state models and insight may be gained from these asso-
ciations when trying to understand the development of the process. In this connection, it
is also of interest to compare estimates of β j across models with different levels of adjust-
ments, e.g., do we see similar associations with Z j with or without adjustment for other
covariates? Another major use of multi-state regression models is prediction, e.g., what is
the estimated risk of certain events for a subject with given characteristics? These aspects
will be further illustrated in the later chapters.
However, the proper answer to a scientific question posed need not be given by quoting a
coefficient from a suitable regression model in which case other target parameters should
be considered. We will see that regression models are still useful ‘building blocks’ when
targeting alternative parameters. As an example of how a target parameter properly address-
ing the scientific question posed may be chosen, we can consider the PBC3 trial (Example
1.1.1). Here, the question of interest is whether treatment with CyA prolongs time to treat-
ment failure and, since the study was randomized, this may be answered by estimating and
comparing survival curves (S(t) – see Section 1.2.2) for the CyA and placebo groups. How-
ever, as we shall see in Section 2.2.1, randomization was not perfect and levels of important
prognostic variables (albumin and bilirubin) tended to be more beneficial in the placebo
group than in the CyA group. For this reason (but also influenced by non-collapsibility),
the estimated regression coefficients for treatment with or without adjustment for these two
variables will differ. Also, estimated survival curves for treated and control patients will
vary with their levels of albumin and bilirubin and it would be of interest to estimate one
survival curve for each treatment group that properly accounts for the covariate imbalance
between the groups. Such a parameter, the contrast (e.g., difference or ratio) between the
survival functions in the two groups had they had the same covariate distribution may be
obtained using the g-formula (e.g., Hernán and Robins, 2020, ch. 13) and works by aver-
aging individually predicted curves over the observed distribution of albumin and bilirubin
(Z2 , Z3 ). Thus, two predictions are performed for each subject, i: One setting treatment (Z1 )
to CyA and one setting treatment to placebo and in both predictions keeping the observed
values (Z2i , Z3i ) for albumin and bilirubin. The predictions for each value of treatment are
then averaged over i = 1, . . . , n
1 n b
Sbj (t) = ∑ S(t | Z1 = j, Z2 = Z2i , Z3 = Z3i ), j = CyA, placebo. (1.5)
n i=1
We will illustrate the use of the g-formula in later chapters. Here, a challenge will be to do
inference for the treatment contrast (i.e., to assess the uncertainty in the form of a confi-
dence interval) and, typically, a bootstrap procedure will be applied (e.g., Efron and Tib-
shirani, 1993, ch. 6).
INDEPENDENT CENSORING AND COMPETING RISKS 27
In the final chapter of the book (Section 7.3), we will also discuss under what circumstances
the resulting treatment contrast may be given a causal interpretation. There, we will also de-
fine what is meant by causal, discuss alternative approaches to causal inference, and under
which assumptions (including that of no unmeasured confounders) a causal interpretation
is possible.

1.3 Independent censoring and competing risks

In Section 1.1, the practical examples demonstrated that incomplete observation is in-
evitable when dealing with event history data, and in Section 1.2 we discussed the structure
of the resulting data in the sample and the parameters in multi-state models that are typi-
cally targeted. In this section, we will be more specific about the target population for which
inference is intended based on the observed, incomplete data and we will also discuss under
which assumptions concerning the censoring mechanism that inference is valid.
We will first review some fundamental concepts from basic statistics. Statistics provides
methods for doing inference on parameters attached to a population based on data from a
sample from the population. Data are represented by random variables whose distribution
is characterized by these parameters and, provided the sample is well-defined (e.g., random
or representative), valid inference may be achieved. As a standard example, think of the
distribution of blood pressure in women in a given age range with a given diagnosis, and
where the parameters of interest are the mean and the standard deviation. If blood pres-
sure measurements for a sample of such female patients are available, then inference for
the parameters is often straightforward, and it is typically not hard to conceptualize the
population from which the sample was drawn and that for which inference is intended.
Next, we turn to an example from event history analysis, namely a study of patients diag-
nosed with the chronic disease malignant melanoma (skin cancer), all given the standard
treatment (radical surgery) and where interest focuses on the survival time from surgery,
see Figure 1.1. By observing data from a sample of patients with malignant melanoma,
one would aim at making inference on the survival time distribution in the population of
such patients. However, because of the very nature of survival data, one has to face the
additional complication (over and above randomness or representativeness of the sample)
of incomplete observation of the random variables in question. At the time when data are to
be analyzed, some patients may still be alive and all that is known about the survival time
for such a patient is that it exceeds the time elapsed from time of surgery to time of analysis
– the survival time is right-censored.
We saw in Section 1.2 that this incompleteness has consequences for the types of param-
eters on which one focuses (e.g., survival functions or hazards instead of means and stan-
dard deviations) and consequently for the types of statistical models that are typically used.
However, it also has consequences for the conceptualization of the population from which
the sample was drawn. For the malignant melanoma example, the population would be (all)
patients with the disease who have undergone radical surgery and who are observed until
failure. That is, in the population there is no incompleteness and the question naturally
arises under what conditions the incomplete data in the sample allow valid inference for
the population. The condition needed is known as independent right-censoring.
28 INTRODUCTION
We discuss a definition of this concept further below, but let us first consider an extension
of the example. Deaths among melanoma patients may either be categorized as death from
the disease or as death from other causes, see Figure 1.2. Suppose that the scientific inter-
est focuses on death from the disease. It is then a question of whether the competing risk
of death from other causes can be considered as, possibly independent, right-censoring.
For that to be the case, following the arguments just given, data should be considered as
a sample from a population without censoring, i.e., a population where all melanoma pa-
tients are observed until death from the disease. We will therefore argue that the answer to
the question is ‘no’ because the complete population without censoring (death from other
causes) would be completely hypothetical: One can hardly imagine ever to obtain data from
a population where melanoma patients cannot die from other causes.
The complete target population for which inference is intended is therefore one where
avoidable causes of incompleteness (right-censoring) are absent. However, non-avoidable
causes of incompleteness (competing risks) may, indeed, be present in the target popula-
tion, and the corresponding events should therefore be included as possible transitions in the
box-and-arrows diagram such as those shown in Section 1.1. Thus, if in the PBC3 trial (Ex-
ample 1.1.1) one would study risk factors for transplantation, then the non-avoidable event
of death without transplantation cannot be considered a, possibly independent, censoring
mechanism because the population without this event would be completely hypothetical.
A similar remark concerns an analysis of the event stroke in the Copenhagen Holter Study
(Example 1.1.8) where considering death as censoring would be inappropriate. On the other
hand, events like drop-out or end-of-study (e.g., in the PBC3 trial, Example 1.1.1, or the
PROVA trial, Example 1.1.4) are examples of avoidable events (censoring).
An important question is whether such censoring events can be considered independent
censoring. We will define independent censoring, as follows (see, e.g., Andersen et al.,
1993, ch. III). Let, as in Section 1.2, (V (t),t ≥ 0) be the multi-state process. The transition
intensities for this process are (approximately when dt > 0 is small)

P(V (t + dt) = j | V (t) = h and the past for s < t)

αh j (t) ≈
dt
for states h 6= j. Censoring at C is then independent if

P(V (t + dt) = j | V (t) = h, past for s < t and C > t)

≈ αh j (t), (1.6)
dt
i.e., if the additional information at time t that a subject is still uncensored does not alter
the transition intensities. As a consequence of independent censoring, the subset of subjects
still at risk (i.e., uncensored) at any time t represents the population at this time. Examples
of independent censoring mechanisms include random censoring where C is independent
of V (t) and type 2 censoring where remaining subjects are censored when a pre-specified
number of events has been observed in the whole sample (Andersen et al., 1993, ch. III).
Note that the definition of independent censoring involves the past and, thereby, the covari-
ates that are included in the model for the transition intensities. This means that, roughly
speaking, events and censoring should be conditionally independent given covariates.
MATHEMATICAL DEFINITION OF PARAMETERS (*) 29
The independent censoring condition can, unfortunately, typically not be tested based on
the available censored data (except for the fact that it may be investigated if censoring de-
pends on covariates - see Section 4.4.1). This is because the future events after censoring
(that should be independent of the censoring time) are not observed. Investigation of inde-
pendent censoring is therefore a matter of discussion in each single case. In the PBC3 and
PROVA trials, two censoring mechanisms were operating: End-of-study and drop-out. The
former (administrative censoring) can typically safely be taken to be independent (the fact
that the study has reached its planned termination should have no consequence for future
events for those still at risk). However, if there is some calendar time trend in patient re-
cruitment (e.g., patients may tend to be less and less severely ill at time of recruitment as
time passes), then the administrative censoring may only be independent conditionally on
the time of recruitment, and calendar time should then be adjusted for in the analysis. Sim-
ilarly, if censoring depends on other important prognostic variables, see Section 4.4.2 for
further discussion. For censoring due to drop-out events, one must typically be more care-
ful. In a trial, patients may drop out because of toxicity or because of lack of efficacy and,
in both cases, knowing that a subject is still uncensored may carry information on the future
event risk for that subject. If, in the PBC3 trial, one focuses on mortality, then censoring
by liver transplantation cannot be considered independent because a liver transplantation
was typically only offered to patients with a relatively poor prognosis, and this means that
the information that a patient is still untransplanted tells that this patient is ‘doing relatively
well’.
A practical advice would be, first of all, to do the best to avoid drop-out and, if drop-outs
do happen, then to record the reasons for drop-out in the patient file such that the problem
of independent censoring can be discussed and such that covariates related to drop-out can
be accounted for in the model for the event occurrence.

Multi-state model, competing risks, and censoring

A multi-state model is given by a number of different states that a subject can occupy
and the possible transitions between the states. The transitions represent the events
that may happen. Such a model can be depicted in a box-and-arrow diagram where
the transition intensities may be indicated (e.g., Figures 1.11 and 1.12).
These diagrams show the possible states in a completely observed population, i.e.,
censoring is not a state in the model. If one particular transition is of interest, then
other transitions in the multi-state model, possibly competing with that, are non-
avoidable events that must be properly addressed in the analysis and should not be
treated as a (potentially, avoidable) censoring.

1.4 Mathematical definition of parameters (*)

In the examples in Section 1.1, we have seen a number of multi-state models consisting
of a finite number of states and possible transitions between some of these states. Section
1.2 gave an informal introduction to parameters associated with multi-state models and the
present section discusses the mathematical definition of these parameters.
30 INTRODUCTION
1.4.1 Marginal parameters (*)
We denote the multi-state process by (V (t), t ≥ 0), i.e., at time t, V (t) is the state occupied
by a generic subject. Usually, a small number, k +1 of states is considered and often labelled
0, 1, ..., k. The state space is then the finite set

S = {0, 1, ..., k}. (1.7)

Corresponding to these states there are state occupation (or ‘occupancy’) probabilities

Qh (t) = P(V (t) = h), h∈S (1.8)

giving the marginal distribution over the states at time t, so we have for all t that ∑h Qh (t) =
1.
In the two-state model for survival data in Figure 1.1, Q0 (t) is the probability of being still
alive time t, the survival function, often denoted S(t), and Q1 (t) = 1 − Q0 (t) is the failure
distribution function, F(t) = 1 − S(t). In Figure 1.2, Q0 (t) is also the survival function,
S(t), and Qh (t), h = 1, . . . , k are the cumulative incidence functions for cause h, i.e., the
probability Fh (t) of failure from cause h before time t. The probability Fh (t) is sometimes
referred to as a sub-distribution function as Fh (∞) < 1.
Another marginal parameter of interest, which may be obtained from the state occupation
probabilities, is the expected time spent in a given state (expected length of stay). For state
h, this is given by
Z ∞ Z∞
εh (∞) = E I(V (t) = h)dt = Qh (t)dt. (1.9)
0 0

Since we have to deal with right-censoring, whereby information about the process V (t)
for large values of time t is limited, restricted means are often studied, i.e.,
Z τ
εh (τ) = Qh (t)dt (1.10)
0

for some suitable time threshold, τ < ∞. This is the expected time spent in state h in the
interval from 0 to τ. Since, for all t, ∑h∈S Qh (t) = 1 it follows that ∑h∈S εh (τ) = τ.
For the two-state model for survival data (Figure 1.1), ε0 (∞) is the expected life time E(T )
and ε0 (τ) is the τ-restricted mean life time E(T ∧ τ), the expected time lived before time
τ and, thus, ε1 (τ) = τ − ε0 (τ) is the expected time lost before time τ. For the competing
risks model (Figure 1.2), ε0 (τ) is the τ-restricted mean life time and, for h 6= 0, εh (τ) is the
expected time lost due to cause h before time τ (see Section 5.1.2). For the disability model
(Figure 1.3), ε1 (τ) is the expected time lived with disability before time τ.
In the common situation where everyone is in the same state (0) at time t = 0 (i.e., P(V (0) =
0) = 1), the marginal distribution of the random variable

Th = inf{V (t) = h}, h 6= 0, (1.11)

t>0

that is, the time of first entry into state h, h 6= 0, (which may be infinite) may also be of
MATHEMATICAL DEFINITION OF PARAMETERS (*) 31
interest. For recurrent events, Th is the time until the hth occurrence of the event, e.g.,
the time from diagnosis to episode no. h for the psychiatric patients discussed in Section
1.1.5 (Figure 1.5). However, the most important marginal parameter for a recurrent events
process is the expected number of events in [0,t]
µ(t) = E(N(t)), (1.12)
where N(t) is the number of recurrent events in [0,t]. For the model in Figure 1.4, this is
the expected number of visits to state 1 in [0,t].
The parameters defined in this section are called marginal since, at time t, they involve no
conditioning on the past (V (s), s < t) (though they may involve time-fixed covariates).

1.4.2 Conditional parameters (*)

To describe the time-dynamics of V (t), one may use conditional parameters such as the
transition probabilities
Ph j (s,t) = P(V (t) = j | V (s) = h), h, j ∈ S , s < t. (1.13)
Note that these probabilities do not necessarily correspond to direct transitions from h to j,
thus, in Figure 1.3 there are two possible paths from state 0 to state 2: one going directly,
and one going through state 1. A state h is said to be absorbing if no transitions out of
6 h, the transition probabilities out of the state h are 0,
the state are possible, i.e., for all j =
Ph j (s,t) = 0 for all s < t. A state that is not absorbing is said to be transient. In Figure 1.1,
state 1 is absorbing and state 0 is transient; in Figure 1.2, states 1 to k are absorbing and
state 0 transient; while, in Figure 1.3, state 2 is absorbing and states 0 and 1 transient. In
the situation where all subjects are in state 0 at time t = 0 (P(V (0) = 0) = 1) we have that
P0h (0,t) = Qh (t).
The transition probabilities Ph j (s,t) will, more generally, depend on the past history of the
process V (s) at time s and on covariates Z. For the moment, we will restrict attention
to time-fixed covariates Z recorded at time t = 0 and postpone the discussion of time-
dependent covariates until Section 3.7. The past information available at time s for con-
ditioning will be denoted Hs . If all Ph j (s,t) only depend on the past via the current state h
(and, possibly, via time-fixed covariates), then the multi-state process is said to be a Markov
process. As an example: If the probability, P12 (s,t) of dying before time t for a patient in the
bleeding state 1 at time s of the PROVA trial (Example 1.1.4) only depends on time since
start of treatment (s), then the process is Markovian; if it depends on the time, d = s − T1 ,
elapsed in state 1 at time s, i.e., the time since onset of the bleeding episode (and possibly
also on s), then the process is semi-Markovian.
For modeling purposes, as we shall see in later chapters, transition intensities αh j (t) are
convenient. These are given as the following limit
1
αh j (t) = lim Ph j (t,t + dt), j 6= h, (1.14)
dt→0 dt
which we assume to exist. That is, if dt > 0 is a ‘small’ time window, then
Ph j (t,t + dt) ≈ αh j (t)dt, j 6= h.
32 INTRODUCTION
One reason why intensities are useful for modeling purposes is that they (in contrast to
Ph j (s,t) that must be between 0 and 1) can take on any non-negative value. For the two-
state survival model, Figure 1.1, α01 (t) is the hazard function

α(t) = lim P(T ≤ t + dt | T > t)/dt (1.15)

dt→0

for the survival time T = T1 (time until entry into state 1). For the competing risks model,
Figure 1.2, α0h (t) is the cause-specific hazard

αh (t) = lim P(T ≤ t + dt, D = h | T > t)/dt, (1.16)

dt→0

where D = V (∞) is the cause of death and T = minh>0 Th is the survival time (time of exit
from state 0) with Th defined in (1.11). Both of these multi-state processes are Markovian.
The illness-death process of Figure 1.3 is non-Markovian if the intensity α12 (t) not only
depends on t but also on the time, d = t − T1 , spent in state 1 at time t.
The transition intensities are the most basic parameters of a multi-state model in the sense
that, in principle, if all transition intensities are specified, then all other parameters such as
state occupation probabilities, transition probabilities, and expected times spent in various
states may be derived. As we shall see in later chapters, the mapping from intensities to
other parameters is sometimes given by explicit formulas, though this depends both on the
structure of the states and possible transitions in the model and on the specific assumptions
(e.g., Markov or non-Markov) made for the intensities. Examples of such explicit formu-
las include the survival function (1.2) for the two-state model for survival data and, more
generally, !
Z t
Q0 (t) = exp − ∑ αh (u)du
0 h

in the competing risks model (Figure 1.2). Also, the cause-h cumulative incidence function
in the competing risks model, Equation (1.3), and the probability of being in the interme-
diate ‘Diseased’ state in both the Markov and semi-Markov illness-death model (Figure
1.3) (formulas to be given in Sections 5.1.3 and 5.2.4) are explicit functions of the inten-
sities. A general way of going from intensities to marginal parameters (not building on a
mathematical expression) is to use micro-simulation (Section 5.4).

1.4.3 Counting processes (*)

Closely connected to the transition intensities is the representation of observations from a
multi-state process as counting processes. In later chapters, we will typically assume that
observations from independent and identically distributed (i.i.d.) subjects i = 1, . . . , n are
available and that the multi-state process Vi (t) for subject i is observed on the time interval
[0, Xi ]. (In some cases, data may be dependent, and we will then make explicit remarks to
this effect.) Here, Xi = Ci ∧ Ti , i.e., Xi is either equal to Ci , the observed right-censoring
time for subject i, or Xi is the time (say, Ti ) where subject i is observed to enter an absorbing
state. In the latter case, the value of Ci may or may not be known depending on the situation,
e.g., if for the two-state model for survival data, subject i is observed to die at time Ti , then
it may not be known when (i.e., at time Ci > Ti ) that subject would have been censored had
MATHEMATICAL DEFINITION OF PARAMETERS (*) 33
it not died at Ti – that will depend on the actual right-censoring mechanism in the study.
Thus, if in the PBC3 trial (Example 1.1.1), there had been no drop-out and, therefore, all
censoring was caused by being alive at the end of 1988 (administrative censoring), then all
potential censoring times would have been known.
We now consider a generic subject i and drop the index i for ease of notation. The multi-
state process (V (t),t ≤ X) with state space S can then be represented by the counting
processes
(Nh j (t), h, j ∈ S , h 6= j, t ≤ X) (1.17)
where each Nh j (t) counts the number of observed direct h → j transitions in the time inter-
val [0,t]. If state h is absorbing, then Nh j (t) = 0 for all states j 6= h and all values of time
t. Note that, in our notation, we will not distinguish between the complete (uncensored)
multi-state process and the censored process (observed until time X) and a similar remark
goes for the counting processes derived from the multi-state process.
Let the past history of the multi-state process (including relevant covariates) at time t be
the sigma-algebra Ht . Then the (random) intensity process, λh j (t) for Nh j (t) (with respect
to that history) under independent censoring is (approximately when dt > 0 is small)

E(dNh j (t) | Ht− )/dt ≈ λh j (t) = αh j (t)Yh (t). (1.18)

Here, the transition intensity αh j (t) is some function of time t and the past history (Ht− )
for the interval [0,t),
Yh (t) = I(V (t−) = h) (1.19)
is the indicator for the subject of being observed to be in state h just before time t, and

dNh j (t) = Nh j (t) − Nh j (t−) (1.20)

is the increment (0 or 1, the jump) for Nh j at time t. Since λh j (t) is fixed given (Ht− ),
Equation (1.18) implies that if we define
Z t
Mh j (t) = Nh j (t) − λh j (u)du (1.21)
0

then
E(dMh j (t) | Ht− ) = 0
from which it follows that the process Mh j (t) in (1.21) is a martingale, i.e.,

E(M(t) | Hs ) = M(s), s≤t

see Exercise 1.4. The decomposition of the counting process in (1.21) into a martingale plus
the integrated intensity process (the compensator) is known as the Doob-Meyer decompo-
sition of Nh j (·). Since martingales possess a number of useful mathematical properties, in-
cluding approximate large-sample normal distributions (e.g., Andersen et al., 1993, ch. II),
this observation has the consequence that large-sample properties of estimators, estimating
equations, and test statistics may be derived when formulated via counting processes. We
will hint at this in later chapters.
34 INTRODUCTION
1.5 Exercises

Exercise 1.1 Consider the small data set in Table 1.6 and argue why both the average of all
(12) observation times and the average of the (7) uncensored times will likely underestimate
the true mean survival time from entry into the study

Exercise 1.2
1. Consider the following records mimicking the Copenhagen Holter study (Example 1.1.8)
in wide format and transform them into long format, i.e., create one data set for each of
the possible transitions in Figure 1.7.

Time of
Subject AF stroke death last seen
1 NA NA NA 100
2 10 NA NA 90
3 NA 20 NA 80
4 15 30 NA 85
5 NA NA 70 70
6 30 NA 75 75
7 NA 35 95 95
8 25 50 65 65

2. Do the same for the entire data set.

Exercise 1.3 (*)

1. Derive Equations (1.2) and (1.3) for, respectively, the survival function in the two-state
model (Figure 1.1) and the cumulative incidence function in the competing risks model
(Figure 1.2).
2. Show, for the Markov illness-death model (Figure 1.3), that the state occupation proba-
bility for state 1 at time t, Q1 (t), is
Z t Zu Zt
exp − (α01 (x) + α02 (x))dx α01 (u) exp − α12 (x)dx du.
0 0 u

Exercise 1.4 (*) Argue (intuitively) how the martingale property E(M(t) | Hs ) = M(s)
follows from E(dM(t) | Ht− ) = 0 (Section 1.4.3).
Chapter 2

Intuition for intensity models

In this chapter, we will give a non-technical introduction to models for intensities to be

discussed in more mathematical details in Chapter 3. Along with the introduction of the
models, examples will be given to illustrate how results from analysis of these models can
be interpreted. More examples are given in Chapter 3. In Section 1.2, we argued that the
intensity (or rate or hazard) is the basic parameter in multi-state models, so, estimation of
the intensity is crucial. In the following, we will present intuitive arguments for the way in
which models for the intensity may be analyzed for one transition at a time. An important
point is that modeling of different intensities in a given multi-state model may be done sep-
arately, and in Section 3.1 we will present a mathematical argument (based on a likelihood
factorization) why this is so. An intuitive argument is the following. The rate describes what
happens locally in time among those who are at risk for a given type of event and because,
in a small time interval from t to t + dt, only one event can happen, it is not needed to con-
sider which other events may happen when focusing on a single event type. This means,
e.g., that in a competing risks model, each single cause-specific hazard may be analyzed
by, formally, censoring for the competing causes (and for avoidable censoring events, see
Section 1.3). We will exemplify this in Section 2.4. This does not mean that ‘censoring for
competing causes’ is considered an independent censoring mechanism, rather it should be
thought of as a ‘technical hack’ in the estimation. However, for other parameters (such as
the cumulative probability over time of dying from a specific cause, the cumulative inci-
dence, cf. Equation (1.3)), different attention must be paid to deaths from other causes and
to other kinds of loss to follow-up. This phenomenon has caused confusion in the literature
on competing risks, and it will be an important topic for discussion in this book, see, e.g.,
Section 4.1.2.

2.1 Models for homogeneous groups

We will use the two-state model for the PBC3 trial (Section 1.1.1) and the composite end-
point ‘failure of medical treatment’, i.e., death or transplantation, as motivating example.
Consider a patient (i) randomized to CyA treatment who is still event-free (alive without
having had a liver transplantation) at time t since randomization. The intensity of the event
‘failure of medical treatment’ for that subject at time t, αi (t) has the interpretation

αi (t)dt = P(i has an event before time t + dt | i is event-free at t).

35
36 INTUITION FOR INTENSITY MODELS
2.1.1 Nelson-Aalen estimator
We will use the counting process notation (where I(· · · ) is an indicator function, see Section
1.2)
dNi (t) = I(i has an event in the interval (t,t + dt))
and
Yi (t) = I(i is event-free and uncensored just before time t).
If we assume that censoring is independent (Equation (1.6)) and that all CyA treated pa-
tients have the same intensity α(t), then a natural estimator of the probability α(t)dt is the
fraction
No. of patients with an event in (t,t + dt) dN(t)
= .
No. of patients at risk of an event just before time t Y (t)
Here, dN(t) = ∑i dNi (t) and Y (t) = ∑i Yi (t) are the total number of events at time t (typi-
cally 0 or 1) and the total number of subjects at risk at time t, respectively. This idea leads
to the Nelson-Aalen estimator for the cumulative hazard
Z t
A(t) = α(u)du,
0

as follows. Let 0 < X1 ≤ X2 ≤ · · · ≤ Xn be the ordered times of observation for the CyA
treated patients, i.e., an X is either an observed time of failure or a time of censoring,
whatever came first for a given subject. Then, for each such time, a term dN(X)/Y (X) is
added to the estimator, and the Nelson-Aalen estimator may then be written

b = dN(X1 ) + dN(X2 ) + · · · ,
A(t)
Y (X1 ) Y (X2 )
where contributions from all times of observation ≤ t are added up. Since only observed
failure times (dN(X) = 1) effectively contribute to this sum (for a censoring at X, dN(X) =
0), the estimator may be re-written as

b = 1
A(t) ∑ .
No. at risk at X
event times, X≤t

This estimates the cumulative hazard and on a plot of A(t)b against t, the ‘approximate local
slope’ estimates the intensity at that point in time. To establish an interpretation of A(t), we
study the situation with survival data (Figure 1.1), where the following ‘experiment’ can be
considered: Assume that one subject is observed from time t = 0 until failure (at time X1 ,
say). At that failure time, replace the first subject by another subject who is still alive and
observe that second subject until failure (at time X2 > X1 ). At X2 , replace by a third subject
who is still alive and observe until failure (at X3 > X2 ), and so on. In that experiment A(t)
is the expected number of replacements in [0,t]. In particular, we may note that A(t) is not
a probability and its value may exceed the value 1.
The standard deviation (or standard error) of the estimator A(t)
b (which we will throughout
abbreviate SD) can also be estimated, whereby confidence limits can be added to plots
of A(t),
b preferably by transforming symmetric limits for log(A(t)). This amounts to a
b exp(−1.96 · SD/A(t))
lower 95% confidence limit for A(t) of A(t) b and an upper limit of
MODELS FOR HOMOGENEOUS GROUPS 37

0.6
Cumulative hazard

0.4

0.2

0.0
0 1 2 3 4 5 6
Time since randomization (years)

Placebo CyA

Figure 2.1 PBC3 trial in liver cirrhosis: Nelson-Aalen estimates by treatment.

b exp(1.96 · SD/A(t)).
A(t) b The SD for the cumulative hazard estimate will increase most
markedly during periods at which few subjects are at risk, i.e., Y (t) is low.
Figure 2.1 shows the Nelson-Aalen estimates for the two treatment groups, CyA and
placebo, from the PBC3 trial. It is seen that the curves for both treatment groups are
roughly linear (suggesting that the hazards are roughly constant) and that they are quite
equal (suggesting that CyA treatment does not affect the rate of failure from medical treat-
ment in patients with PBC). This can be emphasized by estimating the SD of A(t). b Thus,
at 2 years the estimates are, respectively, 0.183 (SD=0.035) in the placebo group and 0.167
(SD=0.034) in the CyA group, leading to the 95% confidence intervals (0.126, 0.266), re-
spectively (0.112, 0.249). To simplify the figure, confidence limits have not been added to
the curves.

2.1.2 Piece-wise constant hazards

Even though both hazards in the PBC3 trial seem rather constant, a model assuming this
is typically considered to be too restrictive for practical purposes. However, a flexible ex-
tension of the constant hazard model exists, namely the piece-wise constant hazards (or
piece-wise exponential) model. This builds on a division of time into ‘suitable’ intervals
within which the hazard is assumed to be constant. To fit this model, a specification of a
number, L of intervals in which α(t) is assumed to be constant is needed. This specification
should, ideally, be pre-determined, i.e., without having looked at the data and the distribu-
tion of event times. This could, for example, be based on ‘nice’ values of time such as 5-
or 10-year age intervals or yearly intervals of follow-up time. However, alternatively one
sometimes attempts to choose intervals including roughly equal numbers of events. In each
38 INTUITION FOR INTENSITY MODELS
Table 2.1 PBC3 trial in liver cirrhosis: Events, risk time, and estimated hazards in a piece-wise
exponential model by treatment group.

Risk Time Hazard

Treatment Interval Events (in years) (per 100 years)
` (year) D` Y` α
b` SD
CyA 0-1 24 295.50 8.1 1.7
2-3 18 137.67 13.1 3.1
4-5 2 20.80 9.6 6.8
Placebo 0-1 27 287.08 9.4 1.8
2-3 17 136.00 12.5 3.0
4-5 2 23.66 8.5 6.0

such interval the value of the hazard may then be estimated by an occurrence/exposure rate
obtained as the ratio between the number of events occurring in that interval and the total
(‘exposure’) time at risk in the interval. Thus, if α` is the hazard in interval no. `, then it is
estimated by
No. of events in interval ` D`
b` =
α = .
Total time at risk in interval ` Y`
Note that the hazard has a per time dimension and that, therefore, whenever a numerical
value of a hazard is quoted, the units in which time is measured should be given.
For the PBC3 data, working with two-year intervals of follow-up time, the resulting event
and person-time counts together with the resulting estimated
√ rates are shown in Table 2.1
together with their estimated standard deviation, SD = D` /Y` and depicted in Figure 2.2.
It is seen that, judged from the SD values, the estimated hazards are, indeed, quite constant
over time and between the treatment groups. The SD is smaller when the event count is
large.
Figure 2.3 compares for the placebo group the estimated cumulative hazards using the two
different models (a step function for the non-parametric model and a broken straight line
for the piece-wise exponential model) and the two sets of estimates are seen to coincide
well.

2.1.3 Significance tests

If a formal statistical comparison between the hazards in the two treatment groups is de-
sired, either in the non-parametric case using the Nelson-Aalen estimators or in the case
of the piece-wise constant hazard model, then this may be achieved via a suitable signif-
icance test. In the latter case, a standard likelihood ratio test (LRT) may be used which
for the PBC3 data, splitting into two-year intervals of follow-up time, gives the value 0.31
yielding (using the χ32 -distribution, i.e., the chi-squared distribution with 3 DF (degrees of
freedom)) a P-value of 0.96.
In the former case, the standard non-parametric test for comparing the hazards in the two
treatment groups is the logrank test. This is obtained by, at each observation time, X (in
MODELS FOR HOMOGENEOUS GROUPS 39

14
Estimated hazard function (per 100 years)

0
0 1 2 3 4 5
Time since randomization (years)

Placebo CyA

Figure 2.2 PBC3 trial in liver cirrhosis: Estimated piece-wise exponential hazard functions by treat-
ment group, see Table 2.1.

either group, 0 or 1), setting up a two-by-two table summarizing the observations at that
time, see Table 2.2.
Across the tables for different X, the observed, dN0 (X), and expected (under the hypothesis
of identical hazards in groups 0 and 1)

Y0 (X)
(dN0 (X) + dN1 (X)),
Y0 (X) +Y1 (X)

numbers of failures from one group (here group 0) are added. Denote the resulting sums by
O0 and E0 , respectively. Also the variances

Y0 (X)Y1 (X)
(dN0 (X) + dN1 (X))
(Y0 (X) +Y1 (X))2

(if all failure times are distinct) are added across the tables to give v. Note that only observed
failure times (in either group) effectively contribute to these sums. The two-sample logrank

Table 2.2 Summary of observations in two groups at a time, X, of observation.

Group Died Survived Alive before

0 dN0 (X) Y0 (X) − dN0 (X) Y0 (X)
1 dN1 (X) Y1 (X) − dN1 (X) Y1 (X)
Total dN0 (X) + dN1 (X) Y0 (X) +Y1 (X)
40 INTUITION FOR INTENSITY MODELS

0.6
Cumulative hazard

0.4

0.2

0.0
0 1 2 3 4 5 6
Time since randomization (years)

Nelson−Aalen Piece−wise exponential

Figure 2.3 PBC3 trial in liver cirrhosis: Estimated cumulative hazards for the placebo group.

test statistic, to be evaluated in the χ12 -distribution, is then

(O0 − E0 )2
.
v

For the PBC3 trial, the observed number of failures in the placebo group is O0 = 46, the
expected number is E0 = 44.68, and the variance v = 22.48 leading to a logrank test statistic
of 0.08 and the P-value 0.78. Note that the same results would be obtained by focusing,
instead, on the CyA group (because O0 + O1 = E0 + E1 = total number of observed failures
and, therefore, (O0 − E0 )2 = (O1 − E1 )2 ).
The logrank test can be extended to a comparison of more than two groups where, for
2 -distribution (e.g.,
comparison of k groups, the resulting test statistic is evaluated in the χk−1
Collett, 2015, ch. 2). Also a stratified version of the logrank test is available.

2.2 Regression models

In this section, we will study regression models describing the way in which hazard func-
tions may depend on covariates. It will be an assumption throughout that censoring is
independent given the covariates under study – see Section 1.3. As explained there, this
assumption cannot be checked based on the censored event history data; however, if cen-
soring depends on a covariate, then this should be included in the hazard model, see Section
4.4.2 for further discussion.
REGRESSION MODELS 41
2.2.1 Multiplicative regression models
A more informative way of comparing the two groups than using a significance test is to
quantify the discrepancies between the hazards in the two treatment groups via a regression
model. Using a hazard ratio for this quantification leads to the Cox proportional hazards
regression model (Cox model) when the starting point is the non-parametric model (Cox,
1972) and to the ‘Poisson’ (or piece-wise exponential) regression model when the starting
point is the model with a piece-wise constant hazard (e.g., Clayton and Hills, 1993, ch.
22-23).

Cox model
A Cox model for the PBC3 trial would assume that the hazard for the placebo group is α0 (t)
and no assumptions concerning this hazard are imposed (it is a completely unspecified non-
parametric, baseline hazard function). On the other hand, the hazard function (say, α1 (t))
for the CyA group is assumed to be proportional to the baseline hazard, i.e., there exists a
constant hazard ratio, say HR, such that, for all t,

α1 (t) = α0 (t) HR,

i.e.,
α1 (t)
= HR.
α0 (t)
This specifies a regression model because, for each patient (i) in the PBC3 trial, we can
define an explanatory variable (or covariate) Zi , as follows,

0 if patient i was in the placebo group
Zi =
1 if patient i was in the CyA group

and then the Cox model for the hazard for patient i

α0 (t) if patient i was in the placebo group
αi (t) =
α0 (t) HR if patient i was in the CyA group

can be written as the regression model

αi (t) = α0 (t) exp(β Zi )

with HR = exp(β ). On the logarithmic scale, the model becomes

log(αi (t)) = log(α0 (t)) + β Zi .

Thus, the proportionality assumption is the same as a constant difference between the
log(hazards) at any time t. Figure 2.4 illustrates the proportional hazards assumption for a
binary covariate Z, both on the hazard scale (a) and the log(hazard) scale (b).
Because the Cox model combines a non-parametric baseline hazard with a parametric spec-
ification of the covariate effect, it is often called semi-parametric.
42 INTUITION FOR INTENSITY MODELS

(a) hazard scale

(b) log(hazard) scale

Figure 2.4 Illustrations of the assumptions for a Cox model for a binary covariate Z.
REGRESSION MODELS 43
To estimate the hazard ratio, HR (or the regression coefficient β = log(HR)), the Cox log-
partial likelihood function l(β ) is maximized. This is
!
exp(β Zevent )
l(β ) = ∑ log (2.1)
∑ j at risk at time X exp(β Z j )
event times, X
and the intuition behind this is, as follows. At any event time, X, the covariate value, say
Zevent , for the individual with an event at that time ‘is compared’ to that of the subjects ( j)
who were at risk for an event at that time (i.e., still event-free and uncensored, including
the failing subject). Thus, if ‘surprisingly often’, the individual having an event is placebo-
treated compared to the distribution of the treatment variable Z j among those at risk, then
this signals that placebo treatment is a risk factor. The set of subjects who are at risk of the
event at a given time t is denoted the risk set, R(t).
For the PBC3 data, a Cox model including the treatment indicator, Z yields an estimated
regression coefficient of βb = −0.059 with an estimated standard deviation of 0.211, lead-
ing to an estimated hazard ratio of exp(−0.059) = 0.94 with 95% confidence limits from
0.62(= exp(−0.059 − 1.96 · 0.211)) to 1.43(= exp(−0.059 + 1.96 · 0.211)). This contains
the null value HR = 1, in accordance with the logrank test statistic. The estimated SD is
known as a model-based standard deviation since it follows from the likelihood function
l(β ). In the Cox model, the cumulative baseline hazard may be estimated using a ‘Nelson-
Aalen-like’ estimator, known as the Breslow estimator:
b0 (t) = 1
A ∑ . (2.2)
event times, X≤t ∑ j at risk at time X exp(β Z j )
b

For the PBC3 data, A0 (t) is the cumulative hazard in the placebo group, and the estimate
is shown in Figure 2.5. Note that, compared to Figure 2.1, there are many more steps in
the Breslow estimate. This is because all event times, i.e., in either treatment group, give
rise to a jump in the baseline hazard estimator. The intuition is that, due to the proportional
hazards assumption, an event for a CyA treated patient also contains information about the
hazard in the placebo group.

Multiple Cox regression

The PBC3 trial was randomized and a regression analysis including only the treatment vari-
able may be reasonable. In observational studies focusing on an exposure covariate, there
may be confounders for which adjustment is needed when estimating the association with
exposure. This leads to the need of performing multiple (Cox) regression analyses. Because
the randomization in the PBC3 trial was not completely successful (it turned out that CyA
treated patients, in spite of the randomization, tended to have slightly less favorable val-
ues of important prognostic variables like serum bilirubin and serum albumin, see Table
2.3), we will illustrate multiple Cox regression using this example. The joint effect of p
covariates, as explained in Section 1.2.5, is summarized in a linear predictor
LP = β1 Z1 + β2 Z2 + · · · + β p Z p
(see Equation (1.4)) in which each covariate Z j enters via a regression coefficient β j . The
resulting model for a subject with covariates Z1 , Z2 , . . . , Z p is then given by the hazard
α0 (t) exp(LP)
44 INTUITION FOR INTENSITY MODELS

Cumulative baseline hazard 0.6

0.4

0.2

0.0
0 1 2 3 4 5 6
Time since randomization (years)

Figure 2.5 PBC3 trial in liver cirrhosis: Breslow estimate for the cumulative baseline hazard in a
Cox model including only treatment.

where the regression coefficients have the following interpretation (see Section 1.2.5). Con-
sider two subjects differing 1 unit for covariate j and having identical values for the remain-
ing covariates in the model. Then the ratio between their hazards at any time t is
α0 (t) exp(β1 Z1 + · · · + β j−1 Z j−1 + β j (Z j + 1) + · · · + β p Z p )
= exp(β j ).
α0 (t) exp(β1 Z1 + · · · + β j−1 Z j−1 + β j (Z j ) + · · · + βpZp)
Thus, exp(β j ) is the hazard ratio for a 1 unit difference in Z j at any time t and for given
values of the remaining covariates in the model. Furthermore, the baseline hazard α0 (t) is
the hazard function for subjects where the linear predictor equals 0.

Table 2.3 PBC3 trial in liver cirrhosis: Average covariate values by treatment group.

Treatment (n) Albumin (g/L) Bilirubin (µmol/L)

CyA (176) 37.51 48.56
Placebo (173) 39.26 42.34

If, for the PBC3 data we add the covariates Z2 = albumin and Z3 = bilirubin to the model
including only the treatment indicator, then the results in Table 2.4 are obtained.
It is seen that, for given value of the variables albumin and bilirubin, the log(hazard ratio)
comparing CyA with placebo is now numerically much larger and with a 95% confidence
interval for exp(β1 ) which is (0.391, 0.947) and, thus, excludes the null value. The interpre-
tation of the coefficient for albumin is that the hazard ratio when comparing two subjects
differing 1 g/L is exp(−0.116) = 0.89 for given values of treatment and bilirubin.
REGRESSION MODELS 45
Table 2.4 PBC3 trial in liver cirrhosis: Estimated coefficients (and SD) from a Cox model.

Covariate βb SD
Treatment CyA vs. placebo -0.496 0.226
Albumin per 1 g/L -0.116 0.021
Bilirubin per 1 µmol/L 0.00895 0.00098

Table 2.5 PBC3 trial in liver cirrhosis: Estimated coefficients (and SD) from a Poisson regression
model.

Covariate βb SD
Treatment CyA vs. placebo -0.475 0.224
Albumin per 1 g/L -0.112 0.021
Bilirubin per 1 µmol/L 0.00846 0.00094

Poisson model
In a similar way, a multiplicative regression model can be obtained with the starting point
being the piece-wise constant hazards model. For the PBC3 data, the model including only
treatment is
αi (t) = α0 (t) exp(β Zi ),
but now the baseline hazard, instead of being completely unspecified as in the Cox model,
is assumed to be constant in, e.g., 2-year intervals of follow-up time

 α1 if t < 2,
α0 (t) = α2 if 2 ≤ t < 4,
α3 if 4 ≤ t.


The resulting regression model is known as Poisson or piece-wise exponential regression.

The reason behind the name ‘Poisson’ is a bit technical and will be given in Section 3.3. Es-
timates of the parameters β , α1 , α2 , α3 are obtained by referring to the maximum likelihood
principle (which also motivates the parameter estimates in the Cox model). For a sim-
ple Poisson model, including only treatment, the estimates are (with 95% confidence lim-
its) exp(βb) = 0.942 (0.623, 1.424), α b1 = 0.088 (0.067, 0.115), αb2 = 0.128 (0.092, 0.178),
b3 = 0.090 (0.034, 0.239) (the latter three expressed in the unit ‘per 1 year’) close to what
α
was seen in Table 2.1. In this model, the time intervals appear as a categorical covariate.
Multiple Poisson regression is now straightforward and is given by the following hazard
for a subject with covariates Z1 , Z2 , . . . , Z p

α0 (t) exp(LP)

where the linear predictor LP is given by Equation (1.4) and the baseline hazard α0 (t) (the
hazard function when LP = 0) is assumed piece-wise constant. For the PBC3 data, adding
albumin and bilirubin to the model yields the estimates shown in Table 2.5 which are seen
to be quite close to those from the similar Cox model (Table 2.4).
46 INTUITION FOR INTENSITY MODELS
2.2.2 Modeling assumptions
Whenever the effect on some outcome of several explanatory variables is obtained by com-
bining the variables into a linear predictor, some assumptions are imposed:
• The effect of a quantitative covariate on the linear predictor is linear.
• For each covariate, its effect on the linear predictor is independent of other variables’
effects, i.e., there are no interactions between the covariates.
Since these assumptions are standard in models with a linear predictor, there are standard
ways of checking them. Thus, as discussed, e.g., by Andersen and Skovgaard (2006, ch.
4-5), to check linearity, extended models including non-linear effects, such as splines or
polynomials, may be fitted and compared statistically to the simple model with a linear
effect. To examine interactions, interaction terms may be added to the linear predictor and
the resulting models may be compared statistically to the simple additive model.
We will exemplify goodness-of-fit investigations using the data from the PBC3 trial.

Checking linearity
We will illustrate how to examine linearity of a quantitative covariate, Z, using either a
quadratic effect or linear splines. Both in the Cox model and in the Poisson model, either
the covariate Z 2 or covariates of the form

Z j = (Z − a j ) · I(Z > a j ), j = 1, . . . , s

may be added to the linear predictor. Here, the covariate values a1 < · · · < as are knots
to be selected. If no particular clinically relevant cut-points are available, then one would
typically use certain percentiles as knots. The spline covariate Z j gives, for subjects who
have a value of Z that exceeds the knot a j , how much the value exceeds a j . For subjects
with Z ≤ a j , the spline covariate is Z j = 0 and the linear predictor now depends on Z as
a broken straight line. Here, the interpretation of coefficient no. j is the change in slope
at the knot a j , and the coefficient for Z is the slope below the first knot, a1 . Linearity,
therefore, corresponds to the hypothesis that all coefficients for the added spline covariates
are equal to zero. In a model with a quadratic effect, i.e., including both Z and Z 2 , the
corresponding coefficients (say, β1 and β2 ) do not, themselves, have particularly simple
interpretations. However, the fact that a positive β2 suggests that the best fitting quadratic
curve for the covariate is a convex (‘happy’) parabola, while a negative β2 suggests that the
best fitting parabola for the covariate is concave (‘bad tempered’) does give some insight
into the dose-response relationship between the linear predictor and Z. In both cases, the
extreme point for the parabola (a minimum if β2 > 0 and a maximum if β2 < 0) corresponds
to Z = −β1 /(2β2 ), a fact that may give further insight.
For albumin, there is a normal range from 25g/L and up, and we choose s = 1 knot placed at
a1 = 25. For bilirubin, the normal range is 0 to 17.1µmol/L and we let s = 3 and a1 = 17.1,
a2 = 2×17.1, and a3 = 3×17.1. Table 2.6 shows the results for both the Cox model and the
Poisson model. It is seen that, for albumin, there is no evidence against linearity in either
model which is also illustrated in Figure 2.6 where the estimated linear predictors under
linearity and under the spline model are shown for the Poisson case.
REGRESSION MODELS 47
Table 2.6 PBC3 trial in liver cirrhosis: Estimated coefficients (and SD) from Cox and Poisson mod-
els modeling the effects of albumin, bilirubin, and log2 (bilirubin) using linear splines (S) or as
quadratic (Q); LRT denotes the appropriate likelihood ratio test for linearity. All models included
albumin and bilirubin (modeled as described) and treatment.

Cox model Poisson model

Covariate βb SD βb SD
S Albumin -0.0854 0.045 -0.0864 0.045
> 35 g/L -0.0557 0.073 -0.0474 0.072
LRT 0.60 (1 DF), P = 0.44 0.44 (1 DF), P = 0.51

Q Albumin -0.1295 0.213 -0.1388 0.210

Albumin2 0.000195 0.00298 0.000371 0.00293
LRT 0.0042 (1 DF), P = 0.95 0.02 (1 DF), P = 0.90
S Bilirubin 0.0624 0.062 0.0617 0.062
> 17.1µmol/L -0.0146 0.085 -0.0168 0.085
> 2 × 17.1µmol/L -0.0026 0.053 0.0027 0.053
> 3 × 17.1µmol/L -0.0400 0.026 -0.0428 0.026
LRT 24.40 (3 DF), P < 0.001 24.54 (3 DF), P < 0.001

Q Bilirubin 0.0200 0.0033 0.0200 0.0033

Bilirubin2 -0.000031 0.0000091 -0.000032 0.0000091
LRT 12.35 (1 DF), P < 0.001 13.34 (1 DF), P < 0.001
S log2 (bilirubin) 0.201 0.465 0.198 0.466
> 17.1µmol/L 0.935 0.915 0.882 0.912
> 2 × 17.1µmol/L -0.386 1.293 -0.234 1.278
> 3 × 17.1µmol/L -0.181 0.988 -0.314 0.971
LRT 1.61 (3 DF), P = 0.66 1.71 (3 DF), P = 0.63

Q log2 (bilirubin) 0.582 0.500 0.628 0.498

(log2 (bilirubin))2 0.0072 0.043 0.0016 0.042
LRT 0.03 (1 DF), P = 0.87 0.00 (1 DF), P = 0.97

For bilirubin, however, linearity describes the relationship quite poorly as illustrated both by
the likelihood ratio tests and Figure 2.7 (showing the linear predictors for the Poisson model
under linearity and with linear splines). Both the negative coefficients from the models
with quadratic effects and this figure suggest that the effect of bilirubin should rather be
modeled as some concave function. The maximum point for the parabola corresponds to a
bilirubin value of 0.0200/(2 · 0.000031) = 322.6 which is compatible with the figure. The
concave curve could be approximated by a logarithmic curve and Table 2.6 (and Figure 2.8)
show the results after a log2 -transformation and using the same knots. It should be noticed
that any logarithmic transformation would have the same impact on the results, and we
chose the log2 -transformation because it enhances the interpretation, as will be explained
in what follows. Since the linear spline has no systematic deviations from a straight line,
linearity after log-transformation is no longer contraindicated, and Table 2.7 shows the
48 INTUITION FOR INTENSITY MODELS

−1

−2
Linear predictor

−3

−4

−5

20 30 40 50
Albumin

Effect as linear spline Linear effect

Figure 2.6 PBC3 trial in liver cirrhosis: Linear predictor as a function of albumin in two Poisson
models. Both models also included treatment and bilirubin.

Table 2.7 PBC3 trial in liver cirrhosis: Estimated coefficients (and SD) from Cox and Poisson mod-
els with linear effects of albumin and log2 (bilirubin).

Cox model Poisson model

Covariate βb SD βb SD
Treatment CyA vs. placebo -0.574 0.224 -0.546 0.223
Albumin per 1 g/L -0.091 0.022 -0.087 0.022
log2 (bilirubin) per doubling 0.665 0.074 0.647 0.073

estimates from Cox and Poisson models including treatment, albumin and log2 (bilirubin).
The interpretation of the Cox-coefficient for the latter covariate is that the hazard increases
by a factor of about exp(0.665) = 1.94 when comparing two subjects where one has twice
the value of bilirubin compared to the other (and similarly for the coefficient from the
Poisson model).

Checking interactions
In the models from Table 2.7, we will now study potential treatment-covariate interactions.
In Table 2.8, interactions between treatment and, in turn, albumin and log2 (bilirubin) have
been introduced, as follows. The covariate Z (albumin or log2 (bilirubin)) is replaced by two
covariates:
Z if treatment is placebo,
Z(0) =
0 if treatment is CyA,
REGRESSION MODELS 49

0
Linear predictor

−1

−2

−3

0 100 200 300 400

Bilirubin

Effect as linear spline Linear effect

Figure 2.7 PBC3 trial in liver cirrhosis: Linear predictor as a function of bilirubin in two Poisson
models. Both models also included treatment and albumin.

−1
Linear predictor

−2

−3

−4

2.5 5.0 7.5

log2(bilirubin)

Effect as linear spline Linear effect

Figure 2.8 PBC3 trial in liver cirrhosis: Linear predictor as a function of log2 (bilirubin) in two
Poisson models. Both models also included treatment and albumin.
50 INTUITION FOR INTENSITY MODELS
Table 2.8 PBC3 trial in liver cirrhosis: Cox and Poisson models with examination of interaction
between treatment and albumin or log2 (bilirubin); LRT denotes the likelihood ratio test for the
hypothesis of no interaction.

Cox model Poisson model

Covariate βb SD βb SD
Z(0) Albumin, placebo -0.081 0.034 -0.076 0.034
Z(1) Albumin, CyA -0.097 0.027 -0.094 0.028
LRT 0.13, P = 0.71 0.17, P = 0.68

Z(0) log2 (bilirubin), placebo 0.726 0.099 0.704 0.097

Z(1) log2 (bilirubin), CyA 0.593 0.106 0.580 0.105
LRT 0.86, P = 0.35 0.78, P = 0.38

and
Z if treatment is CyA,
Z(1) =
0 if treatment is placebo.
The model, additionally, includes a main effect of treatment. However, since the interpre-
tation of this is the hazard ratio for treatment when Z = 0, its parameter estimate was not
included in the table. From Table 2.8 it is seen that, for both models, the interactions are
quite small both judged from the separate coefficients in the two treatment groups and the
corresponding likelihood ratio tests. If a more satisfactory interpretation of the main effect
of treatment is required, then the Z in the definition of the interaction covariates Z(0) and
Z(1) can be replaced by a centered covariate, Z − Z̄, where Z̄ is, e.g., an average Z-value. In
that case, the main effect of treatment is the hazard ratio at Z = Z̄. Since centering does not
change the coefficients for Z(0) and Z(1) , we did not make this modification in the analysis.

Checking proportional hazards

Both Cox and Poisson regression models impose the additional assumption of proportional
hazards, i.e., the multiplicative effect of any covariate is constant over time. Since this
corresponds to no interaction between covariate and time and since ‘time’ in the Poisson
model enters as a categorical explanatory variable, tests for no interaction are applicable
for examining proportional hazards in that model.
For the PBC3 trial, introducing such interactions leads to likelihood ratio tests (all with 2
DF since there are 3 time intervals) for the three covariates in the model, see Table 2.9. It
is seen that proportional hazards give a reasonable description of how all covariates affect
the hazard over time in the Poisson model.
In the Cox model, the time effect is modeled via the non-parametric baseline hazard, and
examination of proportional hazards requires special techniques. We will return to a more
detailed discussion of such methods later in the book (e.g., Sections 3.7 and 5.7) and here
just mention a graphical technique (based on a stratified Cox model).
REGRESSION MODELS 51
Table 2.9 PBC3 trial in liver cirrhosis: Examination of proportional hazards in a Poisson model
including treatment, albumin, and log2 (bilirubin); LRT denotes the likelihood ratio test.

Treatment Albumin log2 (bilirubin)

Interval (year) βb SD βb SD βb SD
0-1 -0.562 0.291 -0.110 0.028 0.710 0.093
2-3 -0.462 0.345 -0.052 0.035 0.558 0.121
4-5 -1.266 1.230 -0.065 0.153 0.305 0.482
LRT 0.43, P = 0.81 1.80, P = 0.41 1.58, P = 0.45

For the Cox model, an alternative to assuming proportional hazards for treatment is to
stratify by treatment, leading to the stratified Cox model

αi (t) = α j0 (t) exp(LP), when i is in stratum j. (2.3)

Here, the linear predictor no longer includes treatment and the stratum is j = 0 for placebo
treated patients and j = 1 for patients from the CyA group. The effect of treatment is
via the two separate baseline hazards α00 (t) for placebo and α10 (t) for CyA, and these
two baseline hazards are not assumed to be proportional. Rather, like the baseline hazard
in the unstratified Cox model, they are completely unspecified. Figure 2.9 illustrates the
assumptions behind the stratified Cox model for two strata and one binary covariate Z both
on the hazard and the log(hazard) scale.
By estimating the cumulative baseline hazards A00 (t) and A10 (t) separately, the propor-
tional hazards assumption may be investigated. This may be done graphically by plotting
b10 (t) against A
A b00 (t) where, under proportional hazards, the resulting curve should be close
to a straight line through the point (0, 0) with a slope equal to the hazard ratio for treatment.
Note that, in Equation (2.3), the effect of the linear predictor (i.e., of the variables albumin
and log2 (bilirubin)) is the same in both treatment groups. Inference for this model builds on
a stratified Cox log partial likelihood where there are separate risk sets for the two treatment
groups (details to be given in Section 3.3).
We fitted the stratified Cox model to the PBC3 data which resulted in coefficients (SD)
−0.090 (0.022) for albumin and 0.663 (0.075) for log2 (bilirubin) close to what we have
seen before. Figure 2.10 shows the goodness-of-fit plot and suggests that proportional haz-
ards for treatment fits the PBC3 data well. The slope of the straight line in the plot is the
estimated hazard ratio for treatment exp(−0.574) found in the unstratified model. Similar
investigations could be done for albumin and log2 (bilirubin); however, for these quantitative
covariates, one would need a categorization in order to create the strata. For such covari-
ates, other ways of examining proportional hazards are better suited and will be discussed
in Sections 3.7 and 5.7.

2.2.3 Cox versus Poisson models

In the PBC3 trial, almost identical results were found for the Cox models and the cor-
responding Poisson models. This is not surprising because, in this example, the hazard
seemed rather time-constant. However, the similarity between the two types of model tends
52 INTUITION FOR INTENSITY MODELS

(a) hazard scale

(b) log(hazard) scale

Figure 2.9 Illustrations of the assumptions for a stratified Cox model for two strata and one binary
covariate Z.
REGRESSION MODELS 53

0.4
Cumulative baseline hazard: CyA

0.3

0.2

0.1

0.0
0.0 0.3 0.6 0.9
Cumulative baseline hazard: placebo

Figure 2.10 PBC3 trial in liver cirrhosis: Cumulative baseline hazard for CyA plotted against
cumulative baseline hazard for placebo in a stratified Cox model. The straight line has slope
exp(−0.574) = 0.563.

to hold quite generally (depending, though, to some extent on how the time-intervals for the
Poisson model are chosen). This is because any (hazard) function may be approximated by
a piece-wise constant function. Given this fact, the choice between the two types of model
is rather a matter of convenience. Some pros and cons may be given:
• In the Cox model a choice of time-intervals is not needed.
• For the Poisson model, estimates of the time-variable are given together with covariate
estimates. In the Cox model, the (cumulative) baseline hazard needs special considera-
tion.
• In the Poisson model, examination of proportional hazards is an integrated part of the
analysis requiring no special techniques.
• Some problems may involve several time-variables (e.g., Example 1.1.4). For the Cox
model, one of these must be selected as the ‘baseline’ time-variable, and the others can
then be included as time-dependent covariates (Section 3.7). For the Poisson model,
several (categorized) time-variables may be accounted for simultaneously.
• The Poisson model with categorical covariates may be fitted to a tabulated (and typically
much smaller) data set (Sections 3.4 and 3.6.4).
54 INTUITION FOR INTENSITY MODELS
2.2.4 Additive regression models
Both the Cox model and the Poisson model resulted in hazard ratios as measures of the
association between a covariate and the hazard function. Furthermore, we saw in the PBC3
example (and tried to argue beyond that study in Section 2.2.3) that these two multiplicative
models were so closely related that the resulting hazard ratios from either would be similar.
However, other hazard regression models exist and may sometimes provide a better fit to
a given data set and/or provide estimates with a more useful and direct interpretation. One
such class of models is the class of additive hazard models among which the Aalen model
(Aalen, 1989) is the most frequently used. In this model, the hazard function for a subject
with covariates (Z1 , Z2 , . . . , Z p ) is given by the sum

α(t) = α0 (t) + LP(t),

where the (now time-dependent) linear predictor is

LP(t) = β1 (t)Z1 + · · · + β p (t)Z p .

β j (t) = (α0 (t) + β1 (t)Z1 + · · · + β j (t)(Z j + 1) + · · · + β p (t)Z p )

− (α0 (t) + β1 (t)Z1 + · · · + β j (t)(Z j ) + · · · + β p (t)Z p )

(see also Section 1.2.5). In the Cox model, the cumulative baseline hazard could be esti-
mated using the Breslow estimator. Likewise, in the Aalen model the cumulative baseline
hazard and the cumulative regression functions
Z t Z t Z t
A0 (t) = α0 (u)du, B1 (t) = β1 (u)du, . . . , B p (t) = β p (u)du
0 0 0

can be estimated. More specifically, the change in (A0 (t), B1 (t), . . . , B p (t)) at an observed
event time, X, is estimated by multiple linear regression. The subjects who enter this linear
regression are those who are at risk at time X, the outcome is 1 for the subject with an
event and 0 for the others, and this outcome is regressed linearly on the covariates for those
subjects. The resulting estimators Ab0 (t), Bb1 (t), . . . , Bbp (t) at time t are obtained by adding up
the estimated changes for event times X ≤ t. Since the estimated change at an event time,
X need not be positive, plots of the estimates Bb j (t) against t need not be increasing.
Figure 2.11 shows both the estimated cumulative baseline hazard and the estimated cu-
mulative treatment effect in an Aalen model for the PBC3 data including treatment as the
only covariate. The estimated treatment effect is equipped with 95% point-wise confidence
limits. It is seen that the cumulative baseline hazard is roughly linear suggesting (as we
have seen in previous analyses) that the baseline hazard is roughly constant. In this model
including only one binary explanatory variable, the estimated cumulative baseline hazard
REGRESSION MODELS 55

Cumulative baseline hazard

Cumulative treatment effect

0.6 0.25

0.4 0.00

0.2 −0.25

0.0 −0.50
0 1 2 3 4 5 6 0 1 2 3 4 5 6
Time since randomization (years) Time since randomization (years)

Figure 2.11 PBC3 trial in liver cirrhosis: Estimated cumulative baseline hazard and cumulative
regression function for treatment (with 95% point-wise confidence limits) in an additive hazard
model.

is the Nelson-Aalen estimate in the placebo group, cf. Figure 2.1. The cumulative treat-
ment effect, Bb1 (t) (judged from the confidence limits) is close to 0, still in accordance with
previous analyses. The estimator in this model is the difference between the Nelson-Aalen
estimators for the CyA group and the placebo group, see Figure 2.1. A significance test
derived from the model confirms that there is no evidence against a null effect of treatment
in this model (P = 0.75).

It is seen that the Aalen model is very flexible including completely unspecified (non-
parametric) baseline hazard and covariate effects and that the estimates from the model are
entire curves that may be hard to communicate (though exp(−B1 (t)) is the ratio between
survival functions, see Equation (1.2)). It is, thus, of interest to simplify the model, e.g.,
by restricting the regression functions to be time-constant (e.g., Martinussen and Scheike,
2007, ch. 5). The hypothesis of a time-constant hazard difference β1 (t) = β1 may also
be tested within the model and results in a P-value of 0.62 and an estimate βb1 = −0.0059
(SD = 0.021) per year corresponding to P = 0.78. Note that this coefficient has a ‘per time’
dimension relating to the units in which the time-variable was recorded. Thus, if somehow
10,000 person-years were collected for both the treated group and for the control group
then, according to this estimate, 59 fewer treatment failures are expected in the treated
group.
The simple additive model with a time-varying treatment effect may be extended with more
covariates like albumin and bilirubin. This leads to the estimated cumulative regression
functions shown in Figure 2.12. To interpret such a curve one should (as it was the case for
the Nelson-Aalen estimator) focus on its local slope which at time t is the approximate haz-
ard difference at that time when comparing subjects differing by one unit of the covariate.
It is seen that these slopes are generally negative for treatment and albumin and positive for
bilirubin in accordance with earlier results. To enhance readability of the figure, confidence
limits have been excluded. However, significance tests for the three regression functions
(Table 2.10) show that both of the biochemical variables are quite significant, but treatment
is not. Inspection of Figure 2.12 suggests that the regression functions are roughly constant
56 INTUITION FOR INTENSITY MODELS
0.0
Cumulative treatment effect

−0.1

−0.2

0 1 2 3 4 5 6
Time since randomization (years)
0.00
Cumulative albumin effect

−0.01

−0.02

−0.03

−0.04
0 1 2 3 4 5 6
Time since randomization (years)
0.0125
Cumulative bilirubin effect

0.0100

0.0075

0.0050

0.0025

0.0000
0 1 2 3 4 5 6
Time since randomization (years)

Figure 2.12 PBC3 trial in liver cirrhosis: Estimated cumulative regression functions for treatment,
albumin, and bilirubin in an additive hazard model.

(roughly linear estimated cumulative effects) and this is also what formal significance tests
indicate (Table 2.10).
Even though there is not evidence against a constant effect for any of the three covariates,
we first consider a model with a constant effect of treatment (Z) and time-varying effects
of albumin (Z2 ) and bilirubin (Z3 )

α(t) = α0 (t) + β1 Z + β2 (t)Z2 + β3 (t)Z3 .

Here, the estimated hazard difference for treatment is βb1 = −0.040 per year (0.020)
P = 0.05. This model imposes the assumptions for the linear predictor of linear ef-
fects of albumin and bilirubin and no interactions between the included covariates – now
REGRESSION MODELS 57
Table 2.10 PBC3 trial in liver cirrhosis: P-values and estimated coefficients (and SD) from additive
hazard models including treatment and linear effects of albumin and bilirubin.

Estimated constant
P-value for effect per year
Covariate Covariate effect Constant effect βb SD
Treatment 0.112 0.69 -0.041 0.020
Albumin 0.006 0.96 -0.0084 0.0022
Bilirubin <0.001 0.16 0.0023 0.00038

assumptions referring to the additive hazard scale. These assumptions may be tested using
standard methods. Adding quadratic effects of, respectively, albumin and bilirubin to the
model with a constant effect of treatment results in P-values for linearity of 0.065 for al-
bumin and 0.05 for bilirubin. There seems to be no strong evidence against linearity. In the
models including quadratic effects, the estimated (constant) treatment effects are, respec-
tively, βb1 = −0.042 (0.020), and βb1 = −0.040 (0.021) (per year). This means that more
flexible models for the biochemical variables do not substantially change the estimated
treatment effect. A test for no interaction was performed in the model where all three co-
variate effects are constant (Table 2.10, last column). This was done along the same lines as
for the Cox model and identified no important interactions between treatment and albumin
(P = 0.76) and between treatment and bilirubin (P = 0.08).
Like the Cox model, an additive hazard regression model having some or all regres-
sion functions constant (β j (t) = β j ) and an unspecified baseline hazard α0 (t) is semi-
parametric.

2.2.5 Additive versus multiplicative models

From the analyses of the PBC3 data, we have seen that both a Cox model with a non-
parametric baseline and time-constant hazard ratios and an additive model with a non-
parametric baseline and time-constant hazard differences provide reasonable fits to the data.
This is in spite of the fact that the two models are mathematically incompatible (unless
the effects of the quantitative covariates are null and the baseline hazard is constant) and
should be interpreted to the effect that the methods used for investigating lack-of-fit may not
be sufficiently effective to detect minor model departures. A potential advantage of using
additive hazards models is that the coefficients relate more directly to absolute differences
in event occurrence than coefficients from multiplicative models for which the impact on
absolute event occurrence depends strongly on the magnitude of the baseline hazard.
The multiplicative models assuming either a non-parametric baseline (Cox) or a piece-
wise constant baseline (Poisson) were quite similar in terms of their estimates, and one
may wonder if an additive hazard model with a piece-wise constant baseline could also be
studied? The answer is ‘yes’, however, the algorithms for fitting such a model may be quite
sensitive to details concerning starting values to avoid negative predicted hazards during
the iterations. A simple model, including only treatment (Z) is

α(t) = α0 (t) + β1 Z
58 INTUITION FOR INTENSITY MODELS
Table 2.11 PBC3 trial in liver cirrhosis: Estimated time-constant coefficients and SD (per year)
from an additive hazard model with piece-wise constant baseline hazard including treatment and
linear effects of albumin and bilirubin.

Covariate βb SD
Treatment -0.050 0.062
Albumin -0.0083 0.0048
Bilirubin 0.0020 0.00064

where 
 α1 if t < 2,
α0 (t) = α2 if 2 ≤ t < 4,
α3 if 4 ≤ t.


b1 = 0.091 (0.016), α
The estimates in this model are: α b2 = 0.132 (0.024), α
b3 = 0.094 (0.046)
(all rates per year, quite close to the similar estimates from the multiplicative Poisson
model) and βb1 = −0.0073 (0.021) (per year, quite close to the insignificant treatment ef-
fect in the Aalen model with a time-constant hazard difference). When including the two
biochemical variables, however, the convergence of the algorithm is questionable but nev-
ertheless leads to estimates quite similar to those from the time-constant Aalen model but
with larger estimated standard deviations, see Table 2.11.

2.3 Delayed entry

So far, in this chapter, we have shown how models for a single intensity may be set up,
analyzed, and checked. As illustrating example, the PBC3 trial was used with focus on the
composite end-point ‘failure of medical treatment’. The PBC3 trial was a randomized trial
and, therefore, the time origin for the intensity models was naturally chosen to be the time
of randomization and all subjects were observed and followed from this time origin. Recall
Example 1.1.2 in which children in Guinea-Bissau were visited by a mobile team and fol-
lowed for about six months at which time deaths and migrations in the intermediate period
were ascertained at a second visit by the team. To relate the mortality rate to vaccinations at
baseline, a Cox model may be used with time at first visit as the time origin and censoring
for emigrations and end of follow-up (alive at time of second visit). Table 2.12 shows esti-
mated regression coefficients in a model including an indicator for being BCG vaccinated
at baseline and age at baseline as a quantitative variable. It is seen that BCG vaccination
is associated with a reduced mortality rate while age at baseline is formally statistically
insignificant. However, because of its confounding effect, adjustment for age at baseline
may still be reasonable when estimating the BCG effect: Unadjusted for age, the latter is
βb = −0.282 (0.135), somewhat different from the adjusted estimate.
Figure 2.13 shows the estimated cumulative hazard for an unvaccinated child aged 3 months
at baseline. It is seen that the curve is roughly linear and, therefore, the estimated baseline
hazard is a roughly constant function of follow-up time. This illustrates a general feature
of observational studies like the Bissau study, namely that, as discussed in Section 1.2.1,
recruitment into such a study is not an important event in the life of the participants. There-
fore, time since recruitment (follow-up time in the Bissau study) may be unlikely to affect
DELAYED ENTRY 59
Table 2.12 Guinea-Bissau childhood vaccination study: Estimated coefficients (and SD) from Cox
models using either follow-up time or age as the time-variable.

Follow-up time Age

Covariate βb SD βb SD
BCG yes vs. no -0.353 0.144 -0.356 0.141
Age months 0.055 0.038

0.06

0.04
Cumulative hazard

0.02

0.00
1 2 3 4 5 6
Follow−up time (months)

Figure 2.13 Guinea-Bissau childhood vaccination study: Estimated cumulative hazard in a Cox
model for a BCG-unvaccinated child aged 3 months at baseline. Time origin is time of first visit.

the intensity of future events and to use the non-parametric baseline hazard in a Cox model
to model its effect may not be an optimal use of that model’s possibilities. In the Bissau
study and in many other observational studies, a general alternative to using time since re-
cruitment as the time-variable for intensity models is to use (current) age. That is, the time
origin is now time of birth, time of entry is age at entry and time of event or censoring is
the corresponding age. Inference in this case has to be performed taking delayed entry into
account.
It turns out that intensity modeling as discussed earlier in this chapter, carries through with
simple modifications. Thus, the following types of model may be studied for the Bissau
data. If, for the ith child, ai is the age at entry then the mortality rate at age a, a > ai
could be α1 (a) if the child was BCG vaccinated before age ai and α0 (a) otherwise. If these
rates are not further specified then their cumulatives can be estimated using the Nelson-
Aalen estimator and they may be non-parametrically compared using the logrank test (with
delayed entry). In both cases, all that is needed is a modification of the risk set at any age a
60 INTUITION FOR INTENSITY MODELS

0.15

0.10
Cumulative hazard

0.05

0
1 2 3 4 5 6 7 8 9 10 11 12 13
Age (months)

Figure 2.14 Guinea-Bissau childhood vaccination study: Breslow estimate of the cumulative base-
line hazard in a Cox model with current age as time-variable, i.e., time origin is time of birth.

which should now be those children, i for whom ai < a and ai +Xi ≥ a where, as previously,
Xi is time (since entry) of event or censoring. That is, the risk set at age a includes those
subjects who have an age at entry smaller than a and for whom event or censoring has
not yet occurred (see Figure 1.8b). Similarly, a Cox model αi (a) = α0 (a) exp(β Zi ) may
be fitted using the same risk set modification, and inference for a model with a piece-wise
constant intensity should be based on numbers of events and person-years at risk in suitably
chosen age intervals. Also, additive hazard models may be adapted to handle delayed entry
using similar risk set modifications.
Table 2.12 also shows the estimated BCG coefficient in a Cox model with age as the time-
variable and it is seen to be very similar to that from the model using follow-up time. Figure
2.14 shows the Breslow estimate of the cumulative baseline hazard (i.e., for an unvaccinated
child) in this model. The curve appears slightly convex (bending upwards) in accordance
with the positive coefficient for age at entry in the Cox model with follow-up time as time-
variable (Table 2.12). This figure also illustrates another feature of intensity models for
data with delayed entry, namely that, for small values of time (here, ages less that about 1
month) there may be few subjects at risk and, thereby, few or no events. As a consequence,
cumulative hazard plots may appear flat (as here) or have big steps. In such situations, one
may choose to present plots where the time axis begins in a suitable value a0 > 0 of age
where the risk set is sufficiently large – a choice that has no impact of the shape of the curve
(i.e., the estimated hazard itself) for values of a > a0 .
In Section 1.3, the concept of independent censoring was discussed. Recall that the meaning
of that concept is that, at any time t, subjects who were censored at that time should have the
COMPETING RISKS 61
same failure rate as those who survived beyond t. Handling delayed entry as just described
relies on a similar assumption of independent delayed entry. The assumption is here that,
at any time t, subjects who enter at that time should have the same failure rate as those who
were observed to be at risk prior to t. The consequence of independent delayed entry is
that, at any time t, the observed risk set at that time is representative for all subjects in the
population who are still alive at t.

Time zero
When analyzing multi-state survival data, a time zero must be chosen. For random-
ized studies, the time of randomization where participants fulfill the relevant in-
clusion criteria and where treatment is initiated is the standard choice. In clinical
follow-up studies there may also be an initiating event that defines the inclusion into
the study and may, therefore, serve as time zero. In observational studies, the time of
entry into the study is not necessarily a suitable time zero because that date may not
be the time of any important event in the life time of the participating subjects. In
such cases, an alternative time axis to be used is (current) age. For this, subjects are
not always followed from the corresponding time zero (age at birth), and data may
be observed with delayed entry. This means that subjects are only included into the
study conditionally on being alive and event-free at the time of entry into the study.
Analysis of intensities in the presence of delayed entry requires a modification of
the risk sets, and care must be taken when the risk set, as a consequence, becomes
small for small values of age.

2.4 Competing risks

We have seen how the intensity of a single event may be analyzed with main emphasis on
the composite end-point ‘failure of medical treatment’ in the PBC3 trial. However, com-
pletely similar inference can be made for the two cause-specific hazards in that study, i.e.,
for the events ‘transplantation’ and ‘death without transplantation’. As explained in the
introduction to this chapter, each of these cause-specific hazards may be analyzed by, for-
mally, censoring for the competing cause (and for the avoidable censoring events drop-out
and end of study). We will now study Cox models for the PBC3 trial, separately for these
two end-points which, together, constitute the composite end-point. Table 2.13 shows the
results together with a re-analysis of the composite end-point, now also including the co-
variates sex and age. These covariates did not have different distributions in the two treat-
ment groups and were not included in previous analyses. It is seen that treatment, sex,
albumin, and bilirubin have quite similar effects on the two competing end-points, and
these effects are then emphasized when studying the composite end-point. This illustrates
what we discussed in Section 1.3, namely that transplantation was primarily offered to pa-
tients with a poor prognosis (e.g., low albumin and high bilirubin). However, looking at the
age effects we see that the situation is different: High age means a higher hazard of death,
but high age is also associated with a lower rate of transplantation. This means that, even
though transplantation was primarily offered to patients with a poor prognosis, it was less of
an option if the patient was old. Overall, the composite end-point is more common for older
62 INTUITION FOR INTENSITY MODELS
Table 2.13 PBC3 trial in liver cirrhosis: Estimated coefficients (and SD) from Cox models for death
without transplantation, transplantation, and failure of medical treatment, respectively.

Event type Covariate βb SD

Death without Treatment CyA vs. placebo -0.420 0.268
transplantation Albumin per 1 g/L -0.070 0.029
log2 (Bilirubin) per doubling 0.692 0.093
Sex male vs. female -0.486 0.319
Age per year 0.073 0.016

Transplantation Treatment CyA vs. placebo -0.673 0.413

Albumin per 1 g/L -0.094 0.039
log2 (Bilirubin) per doubling 0.832 0.147
Sex male vs. female -0.204 0.563
Age per year -0.048 0.021

Failure of medical Treatment CyA vs. placebo -0.510 0.223

treatment Albumin per 1 g/L -0.071 0.023
log2 (Bilirubin) per doubling 0.738 0.078
Sex male vs. female -0.585 0.267
Age per year 0.031 0.012

patients because there were more deaths than transplantations. Such patterns can of course
only be observed when actually separating the components of the composite end-point.
Note that, because the hazard for the composite end-point, α(t) is the sum α1 (t) + α2 (t)
of the cause-specific hazards, a Cox model for α(t) may be mathematically incompatible
with Cox models for the cause-specific hazards. However, such a potential lack of fit may
not be sufficiently serious to invalidate conclusions.
Readers familiar with the Fine-Gray regression model may wonder why that model is not
discussed under this section’s heading of competing risks. The explanation is that the Fine-
Gray model is not a hazard-based model, and we will, therefore, postpone the discussion of
this model to Chapters 4 and 5 where the focus is on models for marginal parameters, such
as the cumulative incidence function.

2.5 Recurrent events

Similar remarks as those in the previous section go for the other examples introduced in
Section 1.1 where each transition intensity may be analyzed separately. This also includes
the recurrent events in Examples 1.1.5 and 1.1.6, see Figures 1.4 and 1.5. However, in these
examples several transitions correspond to occurrence of the same event – an affective
episode (a psychiatric admission) in Example 1.1.5 and a major cardiovascular event or
a myocardial infarction in Example 1.1.6. It may seem more natural to model a general
admission intensity (from state 0 to state 1 in Figure 1.4) than setting up separate models
for the first, second, etc. occurrence of the event (e.g., 0 → 1, 1 → 2, . . . transitions in Figure
1.5). We will illustrate these points using the data from both of these examples.
RECURRENT EVENTS 63
2.5.1 Recurrent episodes in affective disorders
Table 2.14 (left column with heading ‘Time since diagnosis’) shows estimated regression
coefficients for bipolar versus unipolar disease based on separate Cox models for first,
second, third, and fourth re-admission. The time-variable was time since initial diagnosis,
so, models are fitted with delayed entry since a patient is only at risk for admission no.
h = 1, 2, 3, 4 after discharge from admission no. h − 1. It is seen that bipolar patients gen-
erally have increased re-admission intensities; however, there is a large variability among
the estimates for different values of the number of episodes. As an alternative to using
time since initial diagnosis as the time-variable, one may use time since latest discharge
from psychiatric hospital (known as gap time models) in which case there is no delayed
entry. Table 2.14 (right column) also shows estimated regression coefficients for bipolar
versus unipolar disease based on separate Cox models for first, second, third, and fourth
re-admission using the gap time-variable. Results are quite similar to those obtained using
time since initial diagnosis. For both sets of models, we see both a considerable variation
among different values of h and an increasing SD with increasing h. The latter is due to the
diminishing number of patients at risk for later admissions.
It is, therefore, of interest to try to summarize the effects across repeated events. One way
of doing this is via the model

α01 (t | Z) = α01,0 (t) exp(β Z)

with a single baseline 0 → 1 transition intensity and a common effect of the covariate
Z (Z = 1 for bipolar patients and Z = 0 for unipolar patients). Table 2.14 (left column)
also shows the resulting βb summarizing the h-specific estimates into a single effect, and
leading to a markedly reduced SD (all re-admissions are studied here, not just the first
four episodes). This model is a simple version of what is sometimes referred to as the AG
model, so named after Andersen and Gill (1982), see, e.g., Amorim and Cai (2015), where
re-admissions for any patient are independent of previous episodes. More involved models
for these data will be studied in later sections (e.g., Section 3.7.5). A way of accounting for
previous episodes is to stratify, leading to the model

α01h (t | Z) = α01,h0 (t) exp(β Z)

for episode no. h = 1, 2, 3, · · · where there are separate baseline hazards (α01,h0 (t) for
episode no. h) but only a single regression coefficient for Z. This model (known as the
PWP model after Prentice, Williams and Peterson, 1981, see Amorim and Cai, 2015) is
seen to provide a smaller coefficient for Z than the AG model (Table 2.14, bottom line).
This is because, by taking the number of previous episodes into account (here, via stratifi-
cation), some of the discrepancy between bipolar and unipolar patients disappears since the
occurrence of repeated episodes is itself affected by the initial diagnosis. Similar models
(AG and PWP) may be set up for gap times and the results are also shown in the right col-
umn of Table 2.14. These models rely on the assumption that gap times are independent,
not only among patients – a standard assumption – but also within patients. The latter as-
sumption may not be reasonably fulfilled and we will in later sections (e.g., Sections 3.7.5
and 3.9) discuss models where this assumption is relaxed. A discussion of the use of gap
time models for recurrent events was given by Hougaard (2022), recommending (at least
64 INTUITION FOR INTENSITY MODELS
Table 2.14 Recurrent episodes of affective disorder: Estimated coefficients (and SD) from Cox mod-
els per episode, AG model, and PWP model for bipolar vs. unipolar disease.

Time since diagnosis Gap time model

Model Episode βb SD βb SD
Cox model 1 0.356 0.250 0.399 0.249
2 0.189 0.260 0.217 0.258
3 -0.117 0.301 -0.111 0.287
4 1.150 0.354 0.596 0.318
AG model 0.366 0.094 0.126 0.094
PWP model 0.242 0.112 0.028 0.100

for data from randomized trials) not to use gap time models. A further complication arising
when studying recurrent events is that censoring may depend on the number of previous
events (see, e.g., Cook and Lawless, 2007; sect. 2.6).

2.5.2 LEADER cardiovascular trial in type 2 diabetes

Analyses similar to those discussed in the previous section were presented for recurrent
myocardial infarctions (MI) in the LEADER trial (Example 1.1.6) by Furberg et al. (2022).
Figure 2.15 shows the Nelson-Aalen estimates for the cumulative hazards, and Table 2.15

0.12

0.10
Cumulative hazard

0.08

0.06

0.04

0.02

0.00
0 12 24 36 48 60
Time since randomization (months)

Liraglutide Placebo

Figure 2.15 LEADER cardiovascular trial in type 2 diabetes: Estimated cumulative hazards
(Nelson-Aalen estimates) of recurrent myocardial infarctions by treatment group.
RECURRENT EVENTS 65
Table 2.15 LEADER cardiovascular trial in type 2 diabetes: Estimated coefficients (and SD) from
models for the hazard of recurrent myocardial infarctions for liraglutide vs. placebo.

Model βb SD
Cox model 1st event -0.159 0.080
AG model Cox type -0.164 0.072
Piece-wise constant -0.164 0.072
PWP model 2nd event -0.047 0.197
3rd event -0.023 0.400
4th event 0.629 0.737
5th event -0.429 1.230
All events -0.130 0.072

summarizes results from regression models. The coefficients for liraglutide versus placebo
from a Cox model for time to first event and from an AG model are quite similar; however,
with the latter having a smaller SD. Notice that a Cox type model and a model with a piece-
wise constant intensity give virtually identical results. As in the previous example, we can
see that estimates from PWP-type models with event-dependent coefficients become highly
variable, and that the coefficient from a PWP model with a common effect among different
event numbers gets numerically smaller compared to that from an AG model. One may
argue that the PWP models with event-dependent coefficients are not relevant for estimating
the treatment effect in a randomized trial because, for later event numbers, patients are no
longer directly comparable between the two treatment groups due to different selection into
the groups when the treatment, indeed, has an effect.

Intensity, hazard, rate

The basic parameter in multi-state models is the intensity (also called hazard or
rate). The intensity describes what happens locally in time and conditionally on the
past among those who are at risk for a given type of event at that time. Specification
of all intensities allows simulation of realizations of the multi-state process.
Intensities may typically be analyzed one transition at a time using methods as de-
scribed in this chapter. These include:
• Non-parametric estimation using the Nelson-Aalen estimator.
• Parametric estimation using a model with a piece-wise constant hazard.
• Multiplicative regression models:
– Cox model with a non-parametric baseline hazard (including the AG model
for recurrent events).
– Poisson model with a piece-wise constant baseline hazard.
• Aalen (additive) regression model.
66 INTUITION FOR INTENSITY MODELS
2.6 Exercises

Exercise 2.1 Consider the data from the Copenhagen Holter study (Example 1.1.8).
1. Estimate non-parametrically the cumulative hazards of death for subjects with or without
ESVEA.
2. Make a non-parametric test for comparison of the two.
3. Make a similar analysis based on a model where the hazard is assumed constant within
5-year intervals.

Exercise 2.2 Consider the data from the Copenhagen Holter study.
1. Make a version of the data set enabling an analysis of the composite end-point of stroke
or death without stroke (‘stroke-free survival’, i.e., define the relevant Time and Status
variables), see Section 1.2.4.
2. Estimate non-parametrically the cumulative hazards of stroke-free survival for subjects
with or without ESVEA.
3. Make a non-parametric test for comparison of the two.
4. Make a similar analysis based on a model where the hazard is assumed constant within
5-year intervals.

Exercise 2.3 Consider the data from the Copenhagen Holter study and the composite end-
point stroke-free survival.
1. Fit a Cox model and estimate the hazard ratio between subjects with or without ESVEA.
2. Fit a Poisson model where the hazard is assumed constant within 5-year intervals and
estimate the hazard ratio between subjects with or without ESVEA.
3. Compare the results from the two models.

Exercise 2.4 Consider the data from the Copenhagen Holter study and the composite end-
point stroke-free survival.
1. Fit a Cox model and estimate the hazard ratio between subjects with or without ESVEA,
now also adjusting for sex, age, and systolic blood pressure (sysBP).
2. Fit a Poisson model where the hazard is assumed constant within 5-year intervals and
estimate the hazard ratio between subjects with or without ESVEA, now also adjusting
for sex, age, and sysBP.
3. Compare the results from the two models.
EXERCISES 67
Exercise 2.5
1. Check the Cox model from the previous exercise by examining proportional hazards
between subjects with or without ESVEA and between men and women.
2. Check for linearity on the log(hazard)-scale for age and sysBP.
3. Do the same for the Poisson model.

Exercise 2.6 Consider the data from the Copenhagen Holter study and focus now on the
mortality rate after stroke.
1. Estimate non-parametrically the cumulative hazards for subjects with or without ESVEA
using the time-variable ‘time since recruitment’.
2. Assume proportional hazards and estimate the hazard ratio between subjects with or
without ESVEA.
3. Repeat these two questions using now the time-variable ‘time since stroke’ and compare
the results.

Exercise 2.7
1. Consider the data from the Copenhagen Holter study and fit Cox models for the cause-
specific hazards for the outcomes stroke and death without stroke including ESVEA,
sex, age, and sysBP.
2. Compare with the results from Exercise 2.4 (first question).

Exercise 2.8 Consider the data on repeated episodes in affective disorder, Example 1.1.5.
1. Estimate non-parametrically the cumulative event intensities for unipolar and bipolar
patients.
2. Fit an AG-type model and estimate, thereby, the ratio between intensities for unipolar
and bipolar patients, adjusting for year of diagnosis.
3. Fit a PWP model and estimate, thereby, the ratio between intensities for unipolar and
bipolar patients, adjusting for year of diagnosis.
4. Compare the results from the two models.
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
Chapter 3

Intensity models

This chapter gives some of the mathematical details behind the intensity models introduced
in Chapter 2. The corresponding sections are marked with ‘(*)’. The chapter also contains
a less technical section dealing with examples (Section 3.6) and, finally, time-dependent
covariates (Section 3.7) and models with shared parameters (Section 3.8) are introduced, as
well as random effects (frailty) models for situations where an assumption of independence
among observations may not be justified (Section 3.9).

3.1 Likelihood function (*)

In Section 1.4, we described the way in which data obtained by observing a multi-state
process V (t) could be represented using counting processes. Thus, for each pair of states
6 j, the process Nh j (t) counts the number of observed direct h → j transitions in [0,t].
h, j, h =
Further, the distribution of each Nh j (t) is dynamically described by the intensity process

λh j (t) ≈ E(dNh j (t) | Ht− )/dt = P(dNh j (t) = 1 | Ht− )/dt

which, under independent censoring, equals

λh j (t) = αh j (t)Yh (t).

Here, Yh (t) = I(V (t−) = h) is the indicator for being in state h just before time t and
the transition intensity αh j (t) is some non-negative function of the past (Ht− ) and of
time t containing parameters (say, θ ) to be estimated. Estimation is based on observing
(Nh ji (t),Yhi (t)) over some time interval for n independent subjects i = 1, . . . , n possibly,
together with covariates Z i for those subjects. As explained in Section 1.4, the right-hand
end-point of the time interval of observation for subject i is either an observed time, Ci of
right-censoring or it is a point in time (say, Ti ) where the multi-state process Vi (t) reaches
an absorbing state, e.g., when subject i dies. We will denote the time interval of observa-
tion for subject i by [0, Xi ] with Xi = Ti ∧ Ci . There may also be delayed entry (Sections
1.2 and 2.3), i.e., the observation of subject i begins at a later time point Bi > 0 and i is
only observed conditionally on not having reached an absorbing state by time Bi . For the
moment, we will discuss the case of no delayed entry and return to the general case below.
We assume throughout that both censoring and delayed entry are independent (Sections 1.3
and 2.3), possibly given relevant covariates.

69
70 INTENSITY MODELS
For a given state space S = {0, 1, . . . , k} and a given transition structure as indicated by
the box-and-arrow diagrams of Chapter 1, there is a certain number of possible transitions
and we index these by v = 1, . . . , K. Splitting the time interval of observation, t ≤ Xi , into
(small) sub-intervals, each of length dt > 0 we have, for each subject, a (long!) sequence
of multinomial experiments

1; dN1i (t), . . . , dNKi (t), 1 − ∑ dNvi (t) .
v

These correspond to events of type v = 1, . . . , K or to no event between time t and t + dt and

have probability parameters conditionally on the past Ht− given by the intensity processes

λ1i (t)dt, . . . , λKi (t)dt, 1 − ∑ λvi (t)dt .
v

The index parameter of the multinomial distribution equals 1 because, in continuous time,
at most 1 event can happen at time t. The probability, conditionally on the past Ht− , of ob-
serving a given configuration (dN1 (t), . . . , dNK (t)) of events at time t, where either exactly
one of the dNv (t) equals 1 or they are all equal to 0, is then
dN1i (t) dNKi (t) 1−∑v dNvi (t)
Li (t) = λ1i (t)dt · · · λKi (t)dt × 1 − ∑ λvi (t)dt
v

and, therefore, the contribution to the likelihood function from subject i is the product

Li = ∏ Li (t)
t≤Xi

over all such intervals for t ≤ Xi . There will only be a finite number of intervals with an
event and letting dt → 0, the last factor

∏ (1 − ∑ λvi (t)dt)1−∑ dN (t) v vi

t≤Xi v

reduces to the product-integral exp(− ∑v 0Xi λvi (u)du) (Andersen et al., 1993, ch. II). We
R

will discuss the product-integral in more detail in Section 5.1. The likelihood contribution
from subject i then becomes

Z Xi
dN1i (t) dNKi (t)

Li = ∏ (λ1i (t)dt) · · · (λKi (t)dt) exp(− ∑ λvi (u)du). (3.1)
t≤Xi v 0

Equation (3.1) is the Jacod formula for the likelihood based on observing a multivariate
counting process for subject i, (Nvi (t), t ≤ Xi ; v = 1, . . . , K) (Andersen et al., 1993, ch. II).
For independent subjects, the overall likelihood is

L = ∏ Li .
i

Some brief remarks are in order here; for more details, the reader is referred to Andersen et
al. (1993; ch. III).
LIKELIHOOD FUNCTION (*) 71
1. In the case of delayed entry at time Bi > 0, the vth observed counting process for subject
i is Z t
dNv (u) = Nv (t) − Nv (Bi )
Bi

and, in order to compute the associated likelihood contribution Li in Equation (3.1), it

must be assumed that the intensity λvi (t) for t > Bi does not depend on the unobserved
past from the time interval [0, Bi ). Such an assumption must be discussed on a case-by-
case basis.
2. There may be covariates, Z i to be included in the model. These could be time-fixed
covariates, observed at time of entry into the study, and in that case the intensities in the
likelihood are conditional on these covariates. Here, it is assumed that the distribution of
the covariates carries no information on the parameters of interest (i.e., the θ ’s implicitly
appearing in the αv (t) functions). There could also be time-dependent covariates, Z i (t)
that are functions of the past history Ht− including covariates observed at entry – known
as adapted time-dependent covariates (see Section 3.7.1). In some cases there are (non-
adapted) time-dependent covariates that depend on other information on each subject in
which case the likelihood is more complicated. This and other aspects of time-dependent
covariates will be discussed in Section 3.7.
3. We assume throughout that censoring is independent, i.e., (Section 1.3) the multi-state
process is conditionally independent of censoring times for given covariates. Under this
assumption, observation of censoring times gives rise to factors in the likelihood, and the
factors arising from observation of the (censored) multi-state process (the Jacod formula)
will depend correctly on the parameter θ from the complete population.
4. It has been assumed in (3.1) that the factor in the likelihood arising from observation of
the time of censoring, Ci for subject i does not depend on the parameters, θ . Following
Kalbfleisch and Prentice (1980, ch. 5), we will denote this assumption non-informative
censoring. Note the difference between this concept and that of independent censoring
discussed in Section 1.3. What was there defined as independent censoring is in some
texts, somewhat confusingly, denoted non-informative censoring, but we have, like An-
dersen et al. (1993, ch. III), followed the notation introduced by Kalbfleisch and Prentice.
Similar remarks go for the observation of a potential time, Bi of delayed entry.
5. The likelihood (3.1) factorizes into a product over transition types v
Z Xi
(λ1i (t)dt)dN1i (t) · · · (λKi (t)dt)dNKi (t) × exp(− ∑

Li = ∏ λvi (u)du)
t≤Xi v 0
Z Xi
λvi (u)du) ∏ (λvi (t)dt)dNvi (t) .

= ∏ exp(− (3.2)
v 0 t≤Xi

This has the consequence that data may be analyzed by considering one transition at a
time (unless some parameters of interest are shared between several transition types – a
situation that will be discussed in Section 3.8).
6. For the special case of survival data (Figure 1.1) there is only one type of event and the
index v may be omitted. There is also at most one event for subject i (at time Xi ) and in
72 INTENSITY MODELS
this case the likelihood contribution from subject i reduces to
Z Xi dN (X )
Li = exp − λi (u)du)(λi (Xi ) i i .
0
With Di = dNi (Xi ), and fi and Si , respectively, the density and survival function for the
survival time Ti , this becomes
Li = fi (Xi )Di Si (Xi )(1−Di ) .
It is seen that, for each v, the factor (3.2) in the general multi-state likelihood has the
same structure as the likelihood contribution from a subject, i in the two-state model
(however, possibly with more than one jump for each subject).

Likelihood factorization
Intensities are modeled in continuous time and at most one event can happen at any
given time. This has the consequence that the likelihood function (the Jacod for-
mula) factorizes and intensities may be modeled one at a time.
Under the assumption of independent censoring, the multi-state process is condi-
tionally independent of the censoring times given covariates. This has the conse-
quence that, as long as the covariates that create conditional independence are ac-
counted for, the censoring times will also give rise to separate factors in the likeli-
hood, and intensities in the multi-state process may be analyzed without specifying
a model for censoring.

As a consequence of the likelihood factorization, models for the αv (t) can be set up for one
type of transition v (∼ h → j) at a time. Possible such models (to be discussed in detail
in the following sections) include (3.3)-(3.6) below, where for ease of notation we have
dropped the index v. First,
α(·) is completely unspecified (3.3)
(Section 3.2). Here, the time-variable needs specification, the major choices being the base-
line time-variable t, or current duration in the state h, i.e., d = t − Th where Th (as in Section
1.4) denotes the time of entry into state h. In both cases, α(·) may depend on categorical co-
variates by stratification, and we
Rt
shall see in Section 3.2 how non-parametric estimation of
the cumulative hazard A(t) = 0 α(u)du (using the Nelson-Aalen estimator) is performed.
α(·) is piece-wise constant (3.4)
(Section 3.4). The time-variable needs specification and, in contrast to (3.3), it is now as-
sumed that time is split into a number, L, of intervals given by 0 = s0 < s1 < · · · < sL = ∞
and in each interval α(·) is constant
α(u) = α` for s`−1 ≤ u < s` .
The likelihood, as we shall see in Section 3.4, now leads to estimating α` by occur-
rence/exposure rates. As it was the case for (3.3), the hazard may depend on a categorical
covariate using stratification.
β T Z ) Cox regression model
α(· | Z ) = α0 (·) exp(β (3.5)
NON-PARAMETRIC MODELS (*) 73
(Section 3.3). In (3.5), the hazard is allowed to depend on time-fixed covariates, Z assuming
that all hazards are proportional and the baseline hazard, α0 (·) is completely unspecified
(as in (3.3)). The resulting Cox regression model is said to be semi-parametric because it
contains both the finite parameter vector β = (β1 , . . . , β p )T and the non-parametric compo-
nent α0 (·). This model may be generalized by allowing the covariates to depend on time.
Thus, in a recurrent events process (Example 1.1.5) the intensity of events can be modeled
as depending on the number of previous events, see Section 3.7. Though the Cox model
(3.5) is the most frequently used regression model, alternatives do exist. Thus (3.4) and
(3.5) may be combined into a multiplicative hazard regression model with a piece-wise
constant baseline hazard, known as piece-wise exponential or Poisson regression (Section
3.4). Finally, as an alternative to the multiplicative model (3.5), an additive hazard model

α(· | Z ) = α0 (·) + β (·)T Z (3.6)

may be studied (Section 3.5). Here, all regression functions β j (·) may be time-dependent
(leading to the Aalen model) or some or all of the β j (·) may be time-constant. The baseline
hazard α0 (·) is typically taken to be unspecified like in the Cox model, an alternative being
to assume it to be piece-wise constant.

3.2 Non-parametric models (*)

3.2.1 Nelson-Aalen estimator (*)
In this section, we consider a single type of event and the model for the intensity process
for the associated counting process Ni (t) is

λi (t) = α(t)Yi (t)

with Yi (t) = I(Bi < t ≤ Xi ) for some entry time Bi ≥ 0. That is, there is a common hazard,
α(t) for all subjects and this hazard is not further specified. In other words, we have a non-
parametric model for the intensity. The likelihood based on observing independent subjects
i = 1, . . . , n is Z ∞
L = ∏ ∏ (α(t)Yi (t))dNi (t) exp −

α(u)Yi (u)du ,
i t 0

leading to the log-likelihood

Z ∞ Z ∞

log(L) = ∑ log α(t)Yi (t) dNi (t) − α(t)Yi (t)dt . (3.7)
i 0 0

Here, integrals are written with a lower limit of 0 and an upper limit equal to ∞ because the
indicators Yi (t) take care of the proper range of integration, (Bi , Xi ] for subject i. Formally,
the derivative with respect to a single α(t) is
dNi (t)
∑ −Yi (t)dt (3.8)
i α(t)

leading to the score equation

∑ dNi (t) −Yi (t)α(t)dt = 0 (3.9)
i
74 INTENSITY MODELS
with the solution
d = ∑i dNi (t) .
α(t)dt
∑i Yi (t)
Rt
Thus, the Nelson-Aalen estimator for the cumulative hazard A(t) = 0 α(u)du is
Z t
b = ∑i dNi (u)
A(t) (3.10)
0 ∑i Yi (u)

and it has a maximum likelihood interpretation. Note that ∑i Yi (t) is the number of subjects
observed to be at risk at time t and if X is an observed event time for (N1 (t), . . . , Nn (t)) then,
for each such time, a term 1/(∑i Yi (X)) is added to the estimator. On a plot of A(t) b against
t, the approximate local slope estimates the hazard at that point in time, see Figure 2.1 for
an example. Note that this slope may be estimated directly by smoothing the Nelson-Aalen
estimator (e.g., Andersen et al., 1993; ch. IV).

3.2.2 Inference (*)

If we let N = ∑i Ni , Y = ∑i Yi then, by the Doob-Meyer decomposition (1.21) we have that
Z t
b −
A(t) I(Y (u) > 0)α(u)du
0
Z t Z t
α(u)Y (u)du + dM(u)
= I(Y (u) > 0) − I(Y (u) > 0)α(u)du
0 Y (u) 0
Z t
dM(u)
= I(Y (u) > 0)
0 Y (u)

showing that the Nelson-Aalen estimator minus its target parameter (slightly modified be-
cause of the possibility that the risk set may become empty, i.e., Y (u) = 0) is a martingale
integral and, thereby, itself a martingale (Andersen et al., 1993, ch. IV). From this, point-
wise confidence limits for A(t) may be based on asymptotic normality for martingales and
on the following model-based variance estimator. By formally taking the derivative of (3.8)
6 t and −dN(t)/α(t)2 for s = t. Plugging in the Nelson-
with respect to α(s), we get 0 for s =
Aalen increments dN(t)/Y (t), this leads to the following estimator for var(A(t))
b
Z t
dN(u)
σb 2 (t) = . (3.11)
0 (Y (u))2

Point-wise confidence limits based on (3.11) typically build on symmetric confidence limits
for log(A(t)), i.e., a 95% confidence interval for A(t) is

from A(t) b (t)/A(t))

b exp(−1.96σ b b exp(1.96σb (t)/A(t)).
to A(t) b

Simultaneous confidence bands can also be constructed (e.g., Andersen et al., 1993; ch.
IV).
As mentioned in Section 3.1, stratification according to a categorical covariate, Z is possible
and separate Nelson-Aalen estimators can be calculated for each category. Comparison
among two or more cumulative hazards may be performed using the logrank test or other
NON-PARAMETRIC MODELS (*) 75
Rt
non-parametric tests, as follows. For the two-sample test, let A b j (t) = dN j (u)/Y j (u) be
0
the Nelson-Aalen estimator in group j = 0, 1, where N j = ∑i in group j Ni counts failures in
group j and Y j = ∑i in group j Yi is the number at risk in that group. A general class of non-
parametric test statistics for the null hypothesis H0 : A0 (u) = A1 (u), u ≤ t can then be based
on the process Z t
b1 (u) − d A
K(u) d A b0 (u) , (3.12)
0
where K(·) is some weight function of the observations in the interval [0,t) which is 0
whenever Y1 (u) = 0 or Y0 (u) = 0. Under H0 and using (1.21), the process (3.12) reduces to
the martingale
Z t
dM1 (u) dM0 (u)
K(u) − ,
0 Y1 (u) Y0 (u)
whereby conditions for asymptotic normality under H0 of the test statistic may be found
(Andersen et al., 1993; ch. V), see Exercise 3.1.
The most common choice of weight function is

Y0 (t)Y1 (t)
K(t) =
Y0 (t) +Y1 (t)

leading to the logrank test

Z t
Y1 (u)
LR(t) = N1 (t) − (dN0 (u) + dN1 (u)).
0 Y0 (u) +Y1 (u)

Evaluated at t = ∞, this has the interpretation as

LR(∞) = ‘Observed’ - ‘Expected’ (in group 1),

as explained in Section 2.1.3.

The statistic LR(∞) is normalized using the (hypergeometric) variances vi

Y0 (Xi )Y1 (Xi ) Y0 (Xi ) +Y1 (Xi ) − dN0 (Xi ) − dN1 (Xi )
(dN0 (Xi ) + dN1 (Xi ))
(Y0 (Xi ) +Y1 (Xi ))2 Y0 (Xi ) +Y1 (Xi ) − 1

added across all observation times (Xi ) to give v. Note that the last factor in vi equals 1
when exactly 1 failure is observed at Xi , i.e., when dN0 (Xi ) + dN1 (Xi ) = 1. The resulting
two-sample logrank test statistic to be evaluated in the χ12 -distribution is

LR(∞)2
.
v
The logrank test is most powerful against proportional hazards alternatives, i.e., when
α1 (t) = HRα0 (t) for some constant hazard ratio HR, but other choices of weight func-
tion K(t) provide non-parametric tests with other power properties.
Along the same lines, non-parametric tests for comparison of intensities α j (t) among m > 2
groups may be constructed, as well as a stratified tests (Andersen et al., 1993, ch. V).
76 INTENSITY MODELS
3.3 Cox regression model (*)
Often, analyses of multi-state models involve several covariates, in which case stratification
as discussed in Section 3.2, is no longer feasible and some model specification of how
the covariates affect the hazard is needed. In the Cox regression model (Cox, 1972) this
specification is done using hazard ratios in a multiplicative model while, at the same time,
keeping the way in which time affects the hazard unspecified. Thus, in the Cox model the
hazard function for subject i with covariates Z i = (Zi1 , . . . , Zip )T is

β T Z i ),
α(t | Z i ) = α0 (t) exp(β (3.13)
where the baseline hazard α0 (t) is not further specified and the effect of the covariates is
via the linear predictor LPi = β T Z i involving p regression coefficients β = (β1 , . . . , β p )T
(Section 1.2.5). Inference for the baseline hazard and regression coefficients builds on the
Jacod formula
Z∞ dNi (t)
T
Li = exp − Yi (t)α0 (t) exp(β
β Z i )dt ∏ Yi (t)α0 (t) exp(ββ TZ i)
0 t

for the likelihood contribution from subject i (Section 3.1). This leads to the total log-
likelihood
Z ∞ Z ∞
T T
∑ β Z i ) dNi (t) −
log Yi (t)α0 (t) exp(β Yi (t)α0 (t) exp(β
β Z i dt
i 0 0

and differentiation with respect to a single α0 (t) (along the same lines as were used when
deriving the Nelson-Aalen estimator in Section 3.2.1) leads to the score equation

T
∑ i dN (t) −Y i (t)α0 (t) exp(β
β Z i )dt = 0. (3.14)
i

For fixed β , this has the solution

∑i dNi (t)
(t)dt =
α0d (3.15)
β TZ i)
∑i Yi (t) exp(β
which is identical to the Nelson-Aalen increments in the case of no covariates (ββ = 0 ).
Inserting (3.15) into the likelihood ∏i Li yields the profile likelihood
! !∑i dNi (t)
Z ∞
β ) × exp −
PL(β ∑ dNi (t) ∏ ∑ dNi (t)
0 i t i

where the first factor

β T Z i ) dNi (t)
Yi (t) exp(β
β ) = ∏∏
PL(β (3.16)
i t β TZ j )
∑ j Y j (t) exp(β
is the Cox partial likelihood (Cox, 1975) and the second factor does not depend on the
β ) is maximized by computing the Cox score
β -parameters. To estimate β , PL(β

∂
Z ∞
∑ j Y j (t)Z β TZ j )
Z j exp(β
β )) = ∑
log(PL(β Zi − dNi (t) (3.17)
∂β i 0 ∑ j Y j (t) exp(ββ TZ j )
COX REGRESSION MODEL (*) 77
and solving the resulting score equation. This leads to the Cox maximum partial likelihood
estimator βb and inserting this into (3.15) yields the Breslow estimator of the cumulative
baseline hazard A0 (t) = 0t α0 (u)du
R

Z t
b0 (t) = ∑i dNi (u)
A T
(3.18)
0
∑i Yi (u) exp(βb Z i )
(Breslow, 1974). Note that the sums ‘∑ Y j (t)...’ in (3.15)-(3.18) are effectively sums over
the risk set
R(t) = { j : Y j (t) = 1} (3.19)
at time t.
Large-sample inference for βb may be based on standard likelihood results for PL(β β ). A
crucial step is to note that, evaluated at the true regression parameter and considered as
a process in t when based on the data from [0,t], (3.17) is a martingale. (Andersen and
Gill, 1982; Andersen et al., 1993, ch. VII), see Exercise 3.2. Thus, model-based standard
deviations of βb may be obtained from the second derivative of log(PL(β β )) and the resulting
Wald tests (as well as score- and likelihood ratio tests) are also valid. Also, joint large-
sample inference for βb and A b0 (t) is available (Andersen and Gill, 1982; Andersen et al.,
1993, ch. VII). For a simple model including only a binary covariate, the score test reduces
to the logrank test discussed in Section 3.2.2 (see Exercise 3.3). If the simple model is a
stratified Cox model (see (3.20) below), the score test reduces to a stratified logrank test.
Since the Cox model (3.13) includes a linear predictor LPi = β T Z i , there are general meth-
ods available for checking some of the assumptions imposed in the model, such as linear
effects (on the log(hazard) scale) of quantitative covariates and absence or presence of
interactions between covariates. Examination of properties for the linear predictor was ex-
emplified in Section 2.2.2. A special feature of the Cox model that needs special attention
when examining goodness-of-fit is that of proportional hazards (no interaction between
covariates and time). This is because of the non-parametric modeling of the time effect
via the baseline hazard α0 (t). We have chosen to collect discussions of general methods
for goodness-of-fit examinations for a number of different multi-state models (including
the Cox model) in a separate Section 5.7. However, methods for the Cox model are also
described in connection with the PBC3 example in Section 2.2.2, and in Section 3.7 ex-
amination of the proportional hazards assumption using time-dependent covariates will be
discussed.
A useful extension of (3.13) when non-proportional hazards are detected, say, among the
levels j = 1, . . . , m of a categorical covariate, Z0 , is the stratified Cox model
β T Z ),
α(t | Z0 = j, Z ) = α j0 (t) exp(β j = 1, . . . , m. (3.20)
In (3.20), there is an unspecified baseline hazard for each stratum, S j given by the level of
Z0 , but the same effect of Z in all strata (‘no interaction between Z0 and Z ’, though this
assumption may be relaxed). Inference still builds on the Jacod formula that now leads to a
stratified Cox partial likelihood for β
β TZ i)
Yi (t) exp(β dNi (t)
PLs (β
β) = ∏ ∏ ∏ T
(3.21)
j i∈S j t β Zk)
∑k∈S j Yk (t) exp(β
78 INTENSITY MODELS
Rt
and a Breslow estimator for A j0 (t) = 0 α j0 (u)du
Z t
∑i∈S j dNi (u)
b j0 (t) =
A (3.22)
T
0
∑i∈S j Yi (u) exp(βb Z i )

where, in both (3.21) and (3.22), i ∈ S j if Z0i = j, see Exercise 3.4.

3.4 Piece-wise constant hazards (*)

An alternative to the non- and semi-parametric models for the hazard studied in Sections
3.2 and 3.3 is a parametric model for α(·). For the special case of survival data (Figure
1.1), many parametric models have been studied in the literature, including the exponen-
tial, Weibull, and other accelerated failure time models, and the Gompertz model (e.g.,
Andersen et al., 1993, ch. VI). For general multi-state models, however, these parametric
specifications are less frequently used, and for that reason we will in this section restrict
attention to the piece-wise constant (or piece-wise exponential) hazard model (3.4) which
is, indeed, useful for a general multi-state process. The situation is, as follows. We consider
a single type of event and the model for the intensity process for the associated counting
process Ni (t) is
λi (t) = α(t)Yi (t)
with Yi (t) = I(Bi < t ≤ Xi ) for some entry time Bi ≥ 0, and α(t) is a common hazard for all
subjects. This hazard is specified as

α(t) = α` for s`−1 ≤ t < s`

for a number (L) of intervals given by 0 = s0 < s1 < · · · < sL = ∞. Thus, in each interval,
α(·) is assumed to be constant. This model typically provides a reasonable approximation
to any given hazard, it is flexible and, as we shall see shortly, inference for the model is
simple. The model has the drawbacks that the cut-points (s` ) need to be chosen and that
the resulting hazard is not a smooth function of time. Smooth extensions of the piece-wise
constant hazard model using, e.g., splines have been developed (e.g., Royston and Parmar,
2002) but will not be further discussed here. We will show the estimation details for the
case L = 2, the general case L > 2 is analogous. The starting point is the Jacod formula
(3.1), and the (essential part of the) associated log-likelihood (3.7) now becomes
Z s` Z s`
!

log(L) = ∑ ∑ log(α` ) dNi (t) − α` Yi (t)dt .
i `=1,2 s`−1 s`−1

With D` = ∑i ss`−1
R `
dNi (t), the total number of observed events in interval ` and Y` =
R s`
∑i s`−1 Yi (t)dt, the total time at risk observed in interval `, the associated score is

∂ D`
log(L) = −Y`
∂ α` α`
leading to the occurrence/exposure rate
D`
b` =
α
Y`
PIECE-WISE CONSTANT HAZARDS (*) 79
being the maximum likelihood estimator. Standard large sample likelihood inference tech-
niques can be used to show that the pair (α b2 )T is asymptotically normal with the proper
b1 , α
mean and a covariance matrix based on the derivatives of the score which is estimated by
D1
!
Y12
0
.
0 YD22
2

A crucial step in this derivation is to notice that the score ∂ log(L)/∂ α` is a martingale
when evaluated at the true parameter values and considered as a process in t, when based
on data in [0,t] (Andersen et al., 1993, ch. VI).
By the delta-method, it is seen that the log(α b` ) are asymptotically normal with mean
√
log(α` ) and a standard deviation which may be estimated by 1/ D` . Furthermore, the
b` ) are asymptotically independent. This result is used for constructing 95%
different log(α
confidence limits for α` which become
√ √
b` exp(−1.96/ D` ) to α
from α b` exp(1.96/ D` ).

Comparison of piece-wise constant hazards in m strata, using the same partition of time,
0 = s0 < s1 < · · · < sL = ∞ in all strata may be performed via the likelihood ratio test which,
under the null hypothesis of equal hazards in all strata, follows an asymptotic χL(m−1) 2 -
distribution.
The model with a piece-wise constant hazard can also be used as baseline hazard in a
multiplicative (or additive – see Section 3.5) hazard regression model. The resulting multi-
plicative model is
β T Z i ),
α(t | Z i ) = α0 (t) exp(β
with α0 (t) = α0` when s`−1 ≤ t < s` , ` = 1, . . . , L. Since any baseline hazard function can
be approximated by a piece-wise constant function, the resulting Poisson or piece-wise
exponential regression model is closely related to a Cox model, and we showed in Section
2.2.1 that (for the PBC3 data) results from the two types of model were very similar. The
parameters is this model are estimated via the likelihood function (3.1), and hypothesis
testing for the parameters is also based on this.
The likelihood simplifies if Z i only consists of categorical variables, i.e., if there exist finite
sets C1 , . . . , C p such that Zi j ∈ C j , j = 1, . . . , p. In that case, Z i takes values c , say, in the
finite set C = C1 × · · · × C p . Letting θc = exp(β β T Z i ) when Z i = c , the likelihood function
becomes Z ∞
dNi (t)

∏ ∏(λi (t)) exp(−
0
λi (t)dt)
i t
L
= ∏ ∏ (α0` θc )N`cc exp(−α0` θcY`cc )
`=1 c∈C
with Z s`
N`cc = ∑ (Ni (s` ) − Ni (s`−1 )), Y`cc = ∑ Yi (t)dt,
Z i =cc
i:Z Z i =cc s`−1
i:Z

the total number of events in the interval from s`−1 to s` , respectively, the total time at risk
in that interval among subjects with Z i = c , see Exercise 3.5.
80 INTENSITY MODELS
The resulting likelihood is seen to be proportional to the likelihood obtained by, formally,
treating N`cc as independent Poisson random variables with mean α0` θcY`cc . This fact is the
origin of the name Poisson regression, and it has the consequence that parameters may
be estimated using software for fitting such a Poisson model. However, since there is no
requirement of assuming the N`cc to be Poisson distributed, this name has caused some
confusion and the resulting model should, perhaps, rather be called piece-wise exponential
regression since it derives from an assumption of piece-wise constant hazards. Another con-
sequence of this likelihood reduction is that, when fitting the model with only categorical
covariates, data may first be summarized in the cross-tables (N`cc , Y`cc ), ` = 1, . . . , L, c ∈ C .
For large data sets this may be a considerable data reduction, and we will use this fact when
analyzing the testis cancer incidence data (Example 1.1.3) in Section 3.6.4.
The model, as formulated here, is multiplicative in time and covariates and, thus, assumes
proportional hazards. However, since the categorical time-variable and the covariates enter
the model on equal footings, examination of proportional hazards can be performed by
examining time×covariate interactions. Other aspects of the linear predictor may be tested
in the way described in Section 2.2.2.

3.5 Additive regression models (*)

Both the Cox model and the multiplicative Poisson model resulted in hazard ratios as
measures of the association between a covariate and the hazard function. Other hazard
regression models exist and may, as explained in Section 2.2.4, sometimes provide a better
fit to a given data set and/or provide estimates with a more useful and direct interpretation.
One such class of models is that of additive hazard models among which the Aalen model
(Aalen, 1989; Andersen et al., 1993, ch. VII) is the most frequently used. In this model, the
hazard function for a subject with covariates Z = (Z1 , . . . , Z p )T is given by

α(t | Z ) = α0 (t) + β1 (t)Z1 + · · · + β p (t)Z p . (3.23)

Here, both the baseline hazard α0 (t) and the regression functions β1 (t), ... , β p (t) are un-
specified functions of time, t. The interpretation of the baseline hazard, like the baseline
hazard α0 (t) in the Cox model, is the hazard function for a subject with a linear predictor
equal to 0, while the value, β j (t), of the jth regression function is the hazard difference at
time t for two subjects who differ by 1 unit in their values for Z j and have identical values
for the remaining covariates, see Section 2.2.4.
In this model, the likelihood is intractable and other methods of estimation are most of-
ten used (though Lu et al., 2023,studied the maximum likelihood estimator under certain
constraints). To this end, we define the cumulative baseline hazard and the cumulative re-
gression functions
Z t Z t
A0 (t) = α0 (u)du, B j (t) = β j (u)du, j = 1, . . . , p
0 0

with increments collected in the (p + 1)-column vector, say,

dB(t) = (dA0 (t), dB1 (t), . . . , dB p (t))T

ADDITIVE REGRESSION MODELS (*) 81
and these functions may be estimated using multiple linear regression, as follows. For each
subject i = 1, . . . , n we represent the outcome as a counting process Ni (t) and collect the in-
crements dNi (t) in the n-column vector dN(t) = (dN1 (t), . . . , dNn (t))T . We, further, define
the n × (p + 1)-matrix Y(t) with ith row given by Yi (t) (1, Zi1 , . . . , Zip ), and the parameters
may be estimated as solutions to the linear regression problem

E(dN(t) | Ht− ) = Y(t)dB(t).

An unweighted least squares solution is

Z t
b =
B(t) I(rank(Y(u)) = p + 1)(Y(u)T Y(u))−1 YT (u)dN(u), (3.24)
0

however, more efficient, weighted versions of (3.24) exist (e.g., Martinussen and Scheike,
2006, ch. 5).
Large-sample properties of the estimators may be derived using that
Z t
b −
B(t) I(rank(Y(u)) = p + 1)dB(u)
0

is a vector-valued martingale. Thereby, SD(Bb j (t)) may be estimated to obtain 95% point-
wise confidence limits for B j (t). Furthermore, hypothesis tests for model reductions, such
as B j (t) = 0, 0 < t ≤ τ ∗ for a chosen τ ∗ , may be derived, e.g., based on supt≤τ ∗ |Bb j (t)|
(Martinussen and Scheike, 2006, ch. 5).
The Aalen model is very flexible including a completely unspecified baseline hazard and
covariate effects and, as a result, the estimates from the model are entire curves. It may,
thus, be of interest to simplify the model, e.g., by restricting some regression functions to be
time-constant. The hypothesis of a time-constant hazard difference β j (t) = β j , 0 < t ≤ τ ∗
may be tested (e.g., using a supremum-based statistic such as supt≤τ ∗ |Bb j (t) − (t/τ ∗ )βbj |,
where βbj is an estimate under the hypothesis of a time-constant hazard difference for the
jth covariate). In the resulting semi-parametric model where some covariate effects are
time-varying and other are time-constant, parameters may be estimated as described by
Martinussen and Scheike (2006, ch. 5). The ultimate model reduction leads to the additive
hazard model
α(t | Z ) = α0 (t) + β1 Z1 + · · · + β p Z p (3.25)
with a non-parametric baseline and time-constant hazard differences, much like the Cox
regression model (Lin and Ying, 1994).
The multiplicative models assuming either a non-parametric baseline (Cox) or a piece-wise
constant baseline (Poisson) were quite similar in terms of their estimates and, in a similar
fashion, an additive hazard model with a piece-wise constant baseline can be studied. This
leads to a model like (3.25) but now with α0 (t) = α0` when s`−1 ≤ t < s` , ` = 1, . . . , L. This
model is fully parametric and may be fitted using maximum likelihood. However, fitting
algorithms may be sensitive to starting values, reflecting the general difficulty in relation to
additive hazard models that estimated hazards may become negative.
82 INTENSITY MODELS
Counting processes and martingales

The mathematical foundation of the models for analyzing intensities is that of count-
ing processes and martingales (e.g., Andersen et al., 1993).

3.6 Examples
This section presents a series of worked examples to illustrate the models for rates discussed
so far. We will first recap the results from the PBC3 trial (Example 1.1.1), next present
extended analyses of the childhood vaccination survival data from Guinea-Bissau (Example
1.1.2), and finally discuss Examples 1.1.4 and 1.1.3.

3.6.1 PBC3 trial in liver cirrhosis

The purpose of this trial was to evaluate the effect of treatment with CyA versus placebo on
the composite end-point ‘failure of medical treatment’. Data from the trial were used ex-
tensively in Chapter 2 to illustrate the methods introduced there. In summary, we found in
Section 2.2.1 that, unadjusted for other covariates, there was no effect of treatment. How-
ever, adjusting for the biochemical variables albumin and log2 (bilirubin), both of which
turned out to have less favorable values in the active group, CyA-treated patients were
found to have a significantly lower hazard of the composite end-point. Further adjustment
for age and sex did not change this conclusion (Section 2.4).
Even though the main scientific question in the PBC3 trial dealt with the composite end-
point, further insight could be gained by separate studies of its two components: Death
without liver transplantation and liver transplantation. Analyses of the two cause-specific
hazards (Section 2.4) showed that, while treatment, albumin, bilirubin and sex had effects
on the two outcomes going in the same direction, high age was associated with a higher
rate of death and a lower rate of transplantation.

3.6.2 Guinea-Bissau childhood vaccination study

In this study, the purpose of the analyses was to assess how the mortality rate in the 6-month
period between visits by a mobile team was associated with vaccinations given before the
first visit. In Section 2.3, some initial analyses were presented with main focus on BCG
vaccination and the choice of time-variable in the analysis. The preliminary conclusion
was that BCG-vaccinated children had a lower mortality rate – both when follow-up time
and current age was used as the time-variable in a Cox model – and we concluded that
the latter choice of time-variable was preferable because follow-up time was not associated
with the mortality rate.
We will now analyze the data to address how the mortality rate depends not only on BCG
vaccination but also on vaccination with DTP. As explained in Example 1.1.2, this vaccine
was given in three doses, and Table 3.1 shows the joint distribution of children according
to the two vaccinations at first visit. It is seen that few children, unvaccinated with BCG,
have received any dose of DTP, in other words the two covariates BCG and DTP are highly
correlated. This must be borne in mind when concluding from the subsequent analyses.
EXAMPLES 83
Table 3.1 Guinea-Bissau childhood vaccination study: Vaccination status at initial visit among
5,274 children.

DTP doses
BCG 0 1 2 3
Yes 1159 35.1% 1299 39.4% 582 17.6% 261 7.9%
No 1942 98.4% 19 1.0% 9 0.5% 3 0.2%
Total 3101 58.8% 1318 25.0% 591 11.2% 264 5.0%

Table 3.2 Guinea-Bissau childhood vaccination study: Estimated coefficients (and SD) from Cox
models, using age as time-variable, for vaccination status at initial visit.

BCG Any DTP dose Interaction

Model βb SD βb SD βb SD
Only BCG -0.356 0.141
Only DTP -0.039 0.149
Additive effects -0.558 0.192 0.328 0.202
Interaction -0.576 0.202 0.125 0.718 0.221 0.743

Since relatively few children have received multiple DTP doses, we will dichotomize that
variable in the following. Table 3.2 shows estimated regression coefficients from Cox mod-
els addressing the (separate and joint) effects of vaccinations on mortality. We have repeated
the analysis from the previous chapter where only BCG is accounted for showing a ben-
eficial effect of this vaccine. It it seen that, unadjusted for BCG, there is no association
between any dose of DTP and mortality, while, adjusting for BCG, DTP tends to increase
mortality, albeit insignificantly, while the effect of BCG seems even more beneficial than
without adjustment. In this model, due to the collinearity between the two covariates, stan-
dard deviations are inflated compared to the two simple models. Finally, it is seen that there
is no important interaction between the effects of the two vaccines on mortality (though
the test for this has small power because few children received a DTP vaccination without
a previous BCG). An explanation behind these findings could be that BCG vaccination is,
indeed, beneficial as seen in the additive model, while DTP tends to be associated with
an increased mortality rate. This latter effect is, however, not apparent without adjustment
for BCG because most children who got the DTP had already received the ‘good’ BCG
vaccine. Further discussions are provided by Kristensen et al. (2000).

3.6.3 PROVA trial in liver cirrhosis

In this trial, patients were randomized in a two-by-two factorial design to either propra-
nolol, sclerotherapy, both treatments, or no treatment and followed with respect to the two
competing end-points variceal bleeding or death without bleeding. The scientific question
addressed was how the occurrence of these outcomes was affected by treatment. We can
address this question by comparing the event rates among the treatment arms using the
non-parametric four-sample logrank test and, further, discrepancies between the treatment
84 INTENSITY MODELS
Table 3.3 PROVA trial in liver cirrhosis: Cox models for the rates of variceal bleeding and death
without bleeding.

(a) Variceal bleeding

Covariate βb SD βb SD
Sclerotherapy vs. none 0.056 0.392 0.177 0.433
Propranolol vs. none -0.040 0.400 0.207 0.424
Both treatments vs. none -0.032 0.401 0.031 0.421
Sex male vs. female -0.026 0.329
Coagulation factors % of normal -0.0207 0.0078
log2 (bilirubin) µmol/L 0.191 0.149
Medium varices vs. small 0.741 0.415
Large varices vs. small 1.884 0.442

(b) Death without bleeding

Covariate βb SD βb SD
Sclerotherapy vs. none 0.599 0.450 0.826 0.459
Propranolol vs. none -0.431 0.570 -0.160 0.575
Both treatments vs. none 1.015 0.419 0.910 0.420
Sex male vs. female 0.842 0.416
Coagulation factors % of normal -0.0081 0.0068
log2 (bilirubin) µmol/L 0.445 0.137
Medium varices vs. small 0.222 0.347
Large varices vs. small 0.753 0.449

arms may be quantified via hazard ratios from Cox models for each of the two outcomes.
The results from the Cox models are summarized in Table 3.3.
The four-sample logrank test statistics for the two outcomes take, respectively, the values
0.071 for bleeding and 12.85 for death without bleeding corresponding to P-values of 0.99
and 0.005. So, for bleeding there are no differences among the treatment groups, while
for death without bleeding there are. Inspecting the regression parameters for this outcome
in Table 3.3, it is seen that the two groups where sclerotherapy was given have higher
death rates. The Cox model with separate effects in all treatment arms can be reduced
to a model with additive (on the log-rate scale) effects of sclerotherapy and propranolol
(LRT = 1.63, 1 DF, P = 0.20) and in the resulting model, propranolol is insignificant (LRT
= 0.35, 1 DF, P = 0.56). The regression parameter for sclerotherapy in the final model is
βb = 1.018 (SD = 0.328).
The PROVA trial was randomized and, hence, adjustment for prognostic variables should
not change these conclusions (though, in the PBC3 trial, Example 1.1.1, such an adjustment
did change the estimated treatment effects considerably). Nevertheless, Table 3.3 for illus-
tration also shows treatment effects after such adjustments. Adjustment was made for four
covariates, of which two (coagulation factors and size of varices, the latter a three-category
variable), are associated with bleeding and two (sex and log2 (bilirubin)) with death with-
out bleeding. However, conclusions concerning treatment effects are not changed by this
EXAMPLES 85
Table 3.4 PROVA trial in liver cirrhosis: Cox model for the composite outcome bleeding-free sur-
vival.

Covariate βb SD
Sclerotherapy vs. none 0.525 0.313
Propranolol vs. none 0.100 0.338
Both treatments vs. none 0.495 0.292
Sex male vs. female 0.360 0.253
Coagulation factors % of normal -0.0136 0.0053
log2 (bilirubin) µmol/L 0.328 0.102
Medium varices vs. small 0.446 0.263
Large varices vs. small 1.333 0.301

adjustment, and the same is the case if further adjustment for age is done (results not
shown). It could be noted that, from the outset of the trial, the two end-points were consid-
ered equally important and, thus, merging them into the composite end-point ‘bleeding-free
survival’ is of interest. However, the results in Table 3.3 suggest that doing so would provide
a less clear picture. Estimates are shown in Table 3.4 where the significance of treatment
diminishes (LRT with 3 DF: P = 0.19) and, among the covariates, sex loses its significance
(Wald test: P = 0.16), while the three remaining covariates keep their significance. Note
also that Cox models for the two separate end-points are mathematically incompatible with
a Cox model for the composite end-point.

3.6.4 Testis cancer incidence and maternal parity

The data set available for studying the relationship between maternal parity and sons’ rates
of testicular cancer (Example 1.1.3) consists of tables of testicular cancer cases (seminomas
or non-seminomas) and person-years at risk according to the variables: Current age of the
son, birth cohort of the son, parity, and mother’s age at the birth of the son. The basis for
these tables was a follow-up of sons born to women in Denmark from the birth cohorts
1935-78 and who were either alive in 1968 (when the Danish Civil Registration System
was established) or born between 1968 and 1992. The sons were followed from birth or
1968, whatever came last, to a diagnosis of testicular cancer, death, emigration or end of
1992, whatever came first. Using current age as the time-variable in a Poisson regression
model for the cancer rate, sons born before 1968 have left-truncated follow-up records,
beginning in their age in 1968, whereas sons born after 1968 were followed from birth (age
0, no left-truncation). Figure 3.1 shows a Lexis diagram for this follow-up information, i.e.,
an age by calendar time coordinate system illustrating the combinations of age and calendar
time represented in the data set (e.g., Keiding, 1998). The numbers of person-years at risk
in combinations of age and birth cohort are given in the diagram showing that the majority
of person-time comes from boys aged less than 15 years.
Death of the son was a competing risk for the event of interest, testis cancer. However, as
explained in Section 1.1.3, the numbers of deaths in each of the categories of the tables were
not part of the available data set. This has the consequence that only the rate of testicular
cancer can be analyzed and not the full competing risks model including the death rate.
86 INTENSITY MODELS

10 16
30
Age (years)

8 39 36

8 40 77 41
20

8 40 78 84 42

5 61 195 255 553

0
1950 1960 1970 1980 1990 2000
Calendar time

Figure 3.1 Testis cancer incidence and maternal parity: Lexis diagram showing the numbers of
person-years at risk (in units of 10,000 years) for combinations of age and birth cohort.

Fortunately, due to the likelihood factorization (3.2), and as explained in the introduction to
Chapter 2, the available data do allow a correct inference for the rate of testis cancer. Table
3.5 shows estimated log(hazard ratios) from Poisson models with age as the time-variable
(in categories 0-14, 15-19, 20-24, 25-29, and 30+ years) and including either parity alone
(1 vs. 2+) or parity together with birth cohort and mother’s age. It is seen that, unadjusted,
firstborn sons have an exp(0.217) = 1.24-fold increased rate of testicular cancer and this
estimate is virtually unchanged after adjustment for birth cohort and mother’s age. The
95% confidence interval for the adjusted hazard ratio is (1.05, 1.50), P = 0.01. The rates
increase markedly with age and LR tests for the adjusting factors are LRT= 2.53, 3 DF,
P = 0.47 for mother’s age and LRT= 9.22, 4 DF, P = 0.06 for birth cohort of son. It was
also studied whether the increased rate for firstborn sons varied among age groups (i.e., a
potential interaction between parity and age, non-proportional hazards). This was not the
case as seen by an insignificant LR-statistic for interaction (LRT= 7.76, 4 DF, P = 0.10).
It was finally studied how the rates of seminomas and non-seminomas, respectively, were
associated with parity. The hazard ratios (HRs) in relation to these competing end-points
were remarkably similar: HR= 1.23 (0.88, 1.72) for seminomas, HR= 1.27 (1.02, 1.56) for
non-seminomas.
TIME-DEPENDENT COVARIATES 87
Table 3.5 Testis cancer incidence and maternal parity: Poisson regression models.

Covariate βb SD βb SD
Parity 1 vs. 2+ 0.217 0.084 0.230 0.091
Age (years) 0-14 vs. 20-24 -4.031 0.211 -4.004 0.239
15-19 vs. 20-24 -1.171 0.119 -1.167 0.125
25-29 vs. 20-24 0.560 0.098 0.617 0.104
30+ vs. 20-24 0.753 0.133 0.954 0.154
Mother’s age (years) 12-19 vs. 30+ 0.029 0.241
20-24 vs. 30+ 0.058 0.222
25-29 vs. 30+ -0.117 0.225
Son cohort 1950-57 vs. 1973+ -0.363 0.288
1958-62 vs. 1973+ -0.080 0.248
1963-67 vs. 1973+ 0.124 0.237
1968-72 vs. 1973+ 0.134 0.236

3.7 Time-dependent covariates

In Chapter 2 and Sections 3.1-3.6, we have studied (multiplicative or additive) hazard re-
gression models. Recall that the hazard function at time t, conditionally on the past, gives
the instantaneous probability per time unit of having an event just after time t

αh j (t) ≈ P(V (t + dt) = j | V (t) = h and the past for s < t)/dt,

if the event is a transition from state h to state j (and dt > 0 is ‘small’). In the model dis-
cussions and examples given so far, the past for s < t only included time-fixed covariates,
recorded at the time of study entry. However, one of the strengths of hazard regression
models is their ability to also include covariates that are time-dependent. Time-dependent
covariates can be quite different in nature and, in what follows, we will distinguish between
adapted and non-adapted covariates and, for the latter class, between internal (or endoge-
nous) and external (or exogenous) covariates. In later Subsections 3.7.5-3.7.8 we will, via
examples, illustrate different aspects of hazard regression models with time-dependent co-
variates.

3.7.1 Adapted covariates

We will denote covariates where only aspects of the past (V (s), s < t) are used as time-
dependent covariates as adapted covariates. Thus, adapted covariates contain no random-
ness over and above this past (including time-fixed covariates Z, combined with time t).
Examples of adapted covariates include:
• Number of events before t in a recurrent events study (e.g., Example 1.1.5).
• Time since entry into the current state h (e.g., time since bleeding in the PROVA trial,
Example 1.1.4).
• Current age or current calendar time Z(t) = Z + t if Z is age or calendar time at entry
into the study and t time since entry.
• Z(t) = Z · f (t) for some specific function f (·) to model a non-proportional effect of Z.
88 INTENSITY MODELS
Another example, not represented in any of our examples from Section 1.1, would be one
where
• Z(t) is a pre-specified dose (i.e., planned at time t = 0) given to a subject while still alive,
e.g., Z(t) = z0 when 0 < t ≤ t0 and Z(t) = z1 when t0 < t.
Note that, in the last three examples the development of the time-dependent covariate is
deterministic (possibly given time-fixed covariates).

3.7.2 Non-adapted covariates

Non-adapted time-dependent covariates are those that involve extra randomness, i.e., not
represented by the past for V (t). Examples include:
• Repeated measurements of a biochemical marker in a patient, such as repeated record-
ings of serum bilirubin in the PBC3 trial (Example 1.1.1). The value at time t could be
the current value of the marker, lagged values (i.e., the value at time t − ∆, e.g., 1 month
ago), or average values over the period from t − ∆ to t.
• Additional vaccinations given during follow-up in the Guinea-Bissau childhood vacci-
nation study (Example 1.1.2).
• Complications (or improvements) that may occur during follow-up of a patient, e.g.,
obtaining an Absolute Neutrophil Count (ANC) above 500 cells per µL in the bone
marrow transplantation study (Example 1.1.7).
• The level of air pollution in a study of the occurrence of asthma attacks.
In the first three examples, the recording of a value of the time-dependent covariate at time
t requires that the subject is still alive at that time, while, in the last example, the value
of time-dependent covariate exists whether or not a given subject is still alive. The former
type of covariate is known as endogenous or internal, the latter as exogenous or external.

3.7.3 Inference
Inference for the parameters in the hazard model can proceed along the lines described in
earlier sections. For the Cox model, the regression coefficient β can be estimated from the
Cox partial log-likelihood
!
exp(β Zevent (X))
l(β ) = ∑ log
event times, X ∑ j at risk at time X exp(β Z j (X))

where, at event time X, the covariate values at that time are used for everyone at risk at X.
Similarly, the cumulative baseline hazard A0 (t) is estimated by

b0 (t) = 1
A ∑ .
event times, X≤t ∑ j at risk at time X exp(β Z j (X))
b

Least squares estimation in the additive Aalen model (Sections 2.2.4 and 3.5) can, likewise,
be modified to include time-dependent covariates. The arguments for the Cox model depend
on whether the time-dependent covariates are adapted or not. These arguments are outlined
in the next section.
TIME-DEPENDENT COVARIATES 89
The ability to do micro-simulation (Section 5.4) depends on whether time-dependent co-
variates are adapted or not. Micro-simulation of models including non-adapted covariates
requires joint models for the multi-state process and the time-dependent covariate, and in
Section 7.4 we will briefly discuss joint models.
There is a related feature (to be discussed in Sections 4.1 and 5.2) that depends on whether
a hazard model includes time-dependent covariates or not. This is the question of whether it
is possible to estimate marginal parameters by plug-in. This will typically only be possible
with time-fixed covariates, with deterministic time-dependent covariates, or with exoge-
nous covariates, for the latter situation, see Yashin and Arjas (1988).

3.7.4 Inference (*)

When Z i (t) is adapted, the arguments for deriving the likelihood follow those in Sections
3.1 and 3.3. Thus, with a Cox model, λi (t) = Yi (t)α0 (t) exp(β β T Z i (t)) is still the intensity
process for Ni (t) with respect to the history Ht generated by the multi-state process and
baseline covariates. This means that the Jacod formula (3.1) is applicable as the starting
point for inference and, hence, that jumps in the baseline hazard, for fixed β , are estimated
by
∑i dNi (t)
(t)dt =
α0d , (3.26)
β T Z i (t))
∑i Yi (t) exp(β
and the essential factor in the resulting profile likelihood will still be
!dNi (t)
β T Z i (t))
Yi (t) exp(β
β ) = ∏∏
PL(β , (3.27)
i t β T Z j (t))
∑ j Y j (t) exp(β

which is the Cox partial likelihood with time-dependent covariates.

For non-adapted time-dependent covariates, the arguments for deriving a likelihood be-
come more involved. This is because the full likelihood will also include factors resulting
from the extra randomness in Z i (t) and will no longer be given by the Jacod formula (3.1).
It was discussed by Andersen et al. (1993; ch. III) how the likelihood given by (3.1) can
still be considered a partial likelihood that is applicable for inference and that, for the Cox
model, estimation can still be based on (3.26) and (3.27).
Estimates in the Aalen model may be based on (3.24) when defining the matrix Y(t) to have
ith row equal to Yi (t)(1, Zi1 (t), . . . , Zip (t)). This is the case both for adapted and non-adapted
covariates.

3.7.5 Recurrent episodes in affective disorders

The study was presented in Section 1.1.5, and in Section 2.5, where simple regression mod-
els for the association between the initial diagnosis (bipolar vs. unipolar) and the recurrence
rate were studied. In an AG-model, the estimated hazard ratio between the two diagnostic
groups was exp(0.366) = 1.442 with 95% confidence limits from 1.198 to 1.736, see Table
2.14. In this model, recurrent episodes were assumed to be independent in the sense that
the recurrence rate depended on no other aspects of the past than the initial diagnosis. One
way of relaxing this assumption is to include the number of previous episodes for subject i,
90 INTENSITY MODELS
Table 3.6 Recurrent episodes in affective disorders: AG models with number of previous episodes,
N(t−), as time-dependent covariate.

Covariate βb SD βb SD βb SD
Bipolar vs. unipolar 0.366 0.094 0.318 0.095 0.067 0.097
N(t−) 0.126 0.0087 0.425 0.032
N(t−)2 -0.0136 0.0016

Ni (t−) as a time-dependent covariate. This is an adapted variable. Since its effect appears
highly non-linear (P < 0.001 for linearity in a model including both Ni (t−) and Ni (t−)2 ),
we quote the hazard ratio for diagnosis from the model including the quadratic term which
is exp(0.067) = 1.069 with 95% confidence limits from 0.884 to 1.293. In a model includ-
ing only Ni (t−), it is seen that the recurrence rate increases with the number of previous
episodes and so does the rate in the quadratic model where the maximum of the estimated
parabola for Ni (t−) is −0.425/(−2 · 0.0136) = 15.6. Table 3.6 summarizes the results. In-
cluding Ni (t−) as a categorical variable instead gives the hazard ratio exp(0.090) = 1.094
(0.904, 1.324). In both cases, a substantial reduction of the hazard ratio is seen when com-
paring to the value 1.442 from the model without adjustment for Ni (t−). The explanation
was given in Section 2.5 in connection with the PWP model where adjustment for previ-
ous episodes was carried out using (time-dependent) stratification, namely that the occur-
rence of repeated episodes is itself affected by the initial diagnosis, and previous episodes,
therefore, serve as an intermediate variable between the baseline covariate and recurrent
episodes. The AG model including functions of Ni (t−) is, thus, more satisfactory in the
sense that the independence assumption is relaxed; however, it is less clear if it answers the
basic question of comparison between the two diagnostic groups. To answer this question,
models for the marginal mean number of events over time may be better suited (Sections
4.2.3 and 5.5.4).
Another time-dependent covariate which may affect the estimate is current calendar period.
This is also an adapted covariate for given value of the baseline covariate calendar time at
diagnosis. The number of available beds in psychiatric hospitals has been decreasing over
time, and if access to hospital varies between the two diagnostic groups, then adjustment
for period may give rise to a different hazard ratio for diagnosis. We, therefore, created a
covariate by categorizing current period, i.e., calendar time at diagnosis + follow-up time,
into the intervals: Before 1965, 1966-70, 1971-75, 1976-80, 1981+, and adjusted for this in
the simple AG model including only diagnosis. As seen in Table 3.7, the resulting hazard
ratio for diagnosis is only slightly smaller than without adjustment (exp(0.361) = 1.435,
95% c.i. from 1.193 to 1.728). It is also seen that the recurrence rate tends to decrease with
calendar period.

3.7.6 PROVA trial in liver cirrhosis

The study was presented in Section 1.1.4, and in Section 3.6.3 analyses addressing the main
question in this trial were presented. Thus, the rates of variceal bleeding and death with-
out bleeding were related to the given treatment, concluding that treatment arms involving
sclerotherapy had a higher mortality rate whereas no effect on the bleeding rate was seen.
The course of the patients after onset of the primary end-point, bleeding, was not part of
TIME-DEPENDENT COVARIATES 91
Table 3.7 Recurrent episodes in affective disorders: AG model with current calendar period as time-
dependent covariate.

Covariate βb SD
Bipolar vs. unipolar 0.361 0.095
Period 1966-70 vs. 1959-65 -0.251 0.208
1971-75 vs. 1959-65 -0.179 0.331
1976-80 vs. 1959-65 -0.367 0.439
1981+ vs. 1959-65 -1.331 0.554

the basic trial question; however, it is of clinical interest to study the 1 → 2 transition rate
in the illness-death model of Figure 1.3. For this purpose, a choice of time-variable in the
model for the rate α12 (·) is needed. For the two-state model for survival analysis and for
the competing risks model, a single time origin was assumed and all intensities depended
on time t since that origin. The same is the case with the rates α01 (t) and α02 (t) in the
illness-death model. However, for the rate α12 (·), both the time-variable t (time since ran-
domization) and time since entry into state 1, duration d = d(t) = t − T1 , may play a role.
Note the similarity with the choice of time-variable for models for transition intensities in
models for recurrent events (Section 2.5). If α12 (·) only depends on t, then the multi-state
process is said to be Markovian; if it depends on d, then it is semi-Markovian; see Section
1.4. In the Markovian case, inference for α12 (t) needs to take delayed entry into account;
if α12 (·) only depends on d, then this is not the case.
Results from Cox regression analyses are displayed in Table 3.8. It should be kept in mind
when interpreting these results that this is a small data set with 50 patients and 29 deaths
(Table 1.2). We first fitted a (Markov) model with t as the baseline time-variable including
treatment, sex, and log2 (bilirubin) (top panel (a), first column in Table 3.8). The latter two
covariates were not strongly associated with the rate and their coefficients are not shown –
the same holds for subsequent models. In this model, the treatment effect was significant
(LRT= 11.12, 3 DF, P = 0.01) and there was an interaction between propranolol and scle-
rotherapy (LRT= 6.78, 1 DF, P = 0.009). The combined treatment group seems to have a
high mortality rate and the group receiving only sclerotherapy a low rate, however, it should
be kept in mind that one is no longer comparing randomized groups because the patients
entering the analysis are selected by still being alive and having experienced a bleeding –
features that may, themselves, be affected by treatment. The Breslow estimate of the cu-
mulative baseline hazard is shown in Figure 3.2 and is seen to increase sharply for small
values of t. It should be kept in mind that there is delayed entry and, as a consequence, few
patients at risk at early failure times: For the first five failure times the numbers at risk were,
respectively, 2, 3, 5, 6, and 6. The Markov assumption corresponds to no effect of duration
since bleeding on the mortality rate after bleeding, and this hypothesis may be investigated
using adapted time-dependent covariates. Defining di (t) = t − T1i where T1i is the time of
entry into state 1 for patient i, i.e., his or her time of bleeding, the following two covariates
were added to the first model

Zi1 (t) = I(di (t) < 5 days)

Zi2 (t) = I(5 days ≤ di (t) < 10 days),
92 INTENSITY MODELS
Table 3.8 PROVA trial in liver cirrhosis: Cox models for the rate of death after bleeding using
different baseline time-variables.

(a) Time since randomization

Covariate βb SD βb SD
Sclerotherapy vs. none -1.413 0.679 -1.156 0.684
Propranolol vs. none -0.115 0.595 -0.024 0.631
Both treatments vs. none 0.733 0.544 0.425 0.611
d(t) <5 days vs. ≥10 days 2.943 0.739
5 days≤ d(t) <10 days vs. ≥10 days 2.345 0.803

(b) Duration

Covariate βb SD βb SD
Sclerotherapy vs. none -0.997 0.643 -1.019 0.650
Propranolol vs. none -0.300 0.596 -0.312 0.601
Both treatments vs. none 0.871 0.514 0.847 0.524
t <1 year vs. t ≥2 years -0.172 0.910
1 year≤ t <2 years vs. t ≥2 years -0.221 0.886

see top panel (a), second column of Table 3.8. These covariates were strongly associated
with the mortality rate (P < 0.001), and the Markov assumption is clearly rejected. We
can also see that the treatment effect changes somewhat and it is no longer statistically
significant (P = 0.29). The same conclusion is arrived at if, instead, the time-dependent
covariate di (t) is included with an assumed linear effect on the log(rate) (not shown).
The coefficients for the time-dependent covariates show that the mortality rate is very high
shortly after the bleeding episode and instead of attempting to model this effect parametri-
cally, using time-dependent covariates, an alternative would be to use duration since bleed-
ing as the baseline time-variable in a Cox model. Results from such models are shown
in the lower panel (b) of Table 3.8. In the model including treatment (together with sex
and log2 (bilirubin) – coefficients not shown) this is statistically significant (P = 0.02), and
there is a tendency that the combined treatment group has the highest mortality rate and
the group receiving only sclerotherapy the lowest. The Breslow estimate of the cumulative
baseline hazard is shown in Figure 3.3 and is seen to increase sharply for small values of
duration. With this time-variable there is no left-truncation and the estimator does not have
a particularly large variability for small values of duration.
To this model, one may add functions of t as time-dependent covariates to investigate
whether time since randomization affects the mortality rate. Neither a piece-wise constant
effect (Table 3.8, lower panel (b), second column) nor a linear effect (not shown) suggested
any importance and their inclusion has little impact on the estimated treatment effects.
We will prefer the model with duration as the baseline time-variable because, of the two
time-variables, duration seems to have the strongest effect on the mortality rate and by
using this in the Cox baseline hazard, one avoids making parametric assumptions about the
way in which it affects the rate.
TIME-DEPENDENT COVARIATES 93

1.00

0.75
Cumulative hazard

0.50

0.25

0.00
0 1 2 3 4
Time since randomization (years)

Figure 3.2 PROVA trial in liver cirrhosis: Breslow estimate for the cumulative baseline hazard in a
Cox model for the 1 → 2 transition rate as a function of time since randomization.

0.15
Cumulative hazard

0.10

0.05

0.00
0 1 2 3
Duration (years)

Figure 3.3 PROVA trial in liver cirrhosis: Breslow estimate for the cumulative baseline hazard in a
Cox model for the 1 → 2 transition rate as a function of time since bleeding (duration).
94 INTENSITY MODELS
Table 3.9 PROVA trial in liver cirrhosis: Deaths after bleeding/person-years at risk according to
duration and time since randomization.

Time since randomization

Duration 0-1 year 1-2 years 2+ years
0-4 days 8/0.463 2/0.131 0/0.014
5-9 days 2/0.386 1/0.116 0/0.014
10+ days 7/12.430 5/19.882 4/21.565

The choice between duration of bleeding and time since randomization as baseline time-
variable may be entirely avoided by, instead of using a Cox regression model, analyzing
the data using a Poisson regression model and splitting the follow-up time (since bleeding)
according to both time-variables. This will require a choice of cut-points for both time-
variables. Using the same intervals as in the two Cox models (i.e., 1 and 2 years for t
and 5 and 10 days for d(t)) gives the distribution of deaths and person-years at risk as
shown in Table 3.9. It is seen that no patients with a bleeding episode after 2 years since
randomization died within the first 10 days after bleeding, and one is also reminded of the
fact that this is a small data set. Having split the follow-up time after bleeding according
to the two time-variables, Poisson regression models including either time-variable or both
may be fitted. The results from these models, also including the insignificant variables sex
and log2 (bilirubin), are presented in Table 3.10. It can be noticed (cf. Section 2.2) that
results from similar Cox or Poisson models tend to be very close and, furthermore, that it
is not crucial for the estimated regression coefficients for treatment which time-variable(s)
we adjust for. This is also the case in a model allowing for an interaction between the
two time-variables (not shown). However, the fact that in the model including both time-
variables additively, duration is strongly associated with the rate (P < 0.001) but time since
randomization is not (P = 0.53) suggests that it is most important to account for duration
since bleeding.

Time-variable and time-dependent covariates

The Cox model requires a specification of the time-variable for the baseline hazard.
If several time-variables affect the intensity (e.g., both time on study and duration in
a state), there is a choice to be made: Which time-variable should be baseline, and
which can be handled using adapted time-dependent covariates? A general advice is
to choose as baseline time-variable, one that has a marked effect on the hazard that
may be hard to model parametrically.
For a Poisson model, several time-variables can be handled in parallel (i.e., without
pin-pointing one as ‘baseline’); however, in that case all time-variables must be
categorized with an assumption that they have a piece-wise constant effect on the
intensity.
TIME-DEPENDENT COVARIATES 95
Table 3.10 PROVA trial in liver cirrhosis: Poisson regression models for the rate of death after
bleeding accounting for duration since bleeding (a), time since randomization (b), or both (c).

(a)

Covariate βb SD
Sclerotherapy vs. none -1.130 0.642
Propranolol vs. none -0.314 0.589
Both treatments vs. none 0.967 0.516
d(t) <5 days vs. ≥10 days 3.602 0.439
5≤ d(t) <10 days vs. ≥10 days 2.844 0.637

(b)

Covariate βb SD
Sclerotherapy vs. none -1.281 0.652
Propranolol vs. none -0.432 0.566
Both treatments vs. none 0.801 0.525
t <1 year vs. t ≥2 years 1.507 0.579
1≤ t <2 years vs. t ≥2 years 0.430 0.648

(c)

Covariate βb SD
Sclerotherapy vs. none -1.110 0.648
Propranolol vs. none -0.318 0.579
Both treatments vs. none 0.826 0.514
d(t) <5 days vs. ≥10 days 3.350 0.464
5≤ d(t) <10 days vs. ≥10 days 2.583 0.659
t <1 year vs. t ≥2 years 0.733 0.618
1≤ t <2 years vs. t ≥2 years 0.189 0.655

3.7.7 PBC3 trial in liver cirrhosis

In Section 2.2 we saw how the assumption of proportional hazards could be checked in a
Poisson regression model by introducing an interaction between covariates and the catego-
rized time-variable. By using adapted time-dependent covariates, the same idea applies for
the Cox regression model, and we will illustrate this using the PBC3 trial as example. To
the model including treatment, albumin, and log2 (bilirubin), a time-dependent covariate of
the form
Zi (t) = Zi · f (t)
for some function f (·) of time t was added. Here Zi is one of the time-fixed covariates. To do
this, a choice of f (t) must be made and choosing a monotone f (·), a test of proportionality
versus an alternative of a monotone hazard ratio is obtained. Typical choices of f (·) are the
identity f (t) = t, log(t) or I(t > t0 ) for some threshold t0 . Table 3.11 shows results from
such tests in the form of the estimated coefficient for Zi (t) and its SD. It is seen that, in
all cases the estimated coefficient is rather small compared to its SD resulting nowhere in
96 INTENSITY MODELS
Table 3.11 PBC3 trial in liver cirrhosis: Examination of proportional hazards in a Cox model in-
cluding treatment, albumin, and log2 (bilirubin).

Treatment Albumin log2 (bilirubin)

Function βb SD βb SD βb SD
f (t) = t 0.045 0.178 0.023 0.019 -0.063 0.062
f (t) = log(t) 0.108 0.254 0.033 0.026 -0.077 0.088
f (t) = I(t > 2 years) 0.032 0.434 0.057 0.043 -0.182 0.149

any evidence against proportionality. It is also seen that, for each of the three covariates,
the coefficient has the same sign for all choices of f (t). The tendencies for treatment and
albumin are that the hazard ratio increases over time while, for bilirubin, it decreases.

3.7.8 Bone marrow transplantation in acute leukemia

The four-state model for the bone marrow transplantation study was shown in Figure 1.6.
We will first present Nelson-Aalen estimates for the cumulative transition rates. Figure 3.4
shows the estimates for the cumulative mortality rates A03 (t), without relapse or GvHD,
A13 (t), without relapse and with GvHD, and A23 (t), after relapse. The latter two are esti-
mated taking the delayed entry into states 1 or 2 into account. Here, time t denotes time
since bone marrow transplantation (BMT). It is seen that occurrence of a relapse markedly
increases the mortality rate and also that the direct mortality rate for patients after expe-
riencing GvHD is higher than without. Figure 3.5 shows the estimated cumulative GvHD
rate, A01 (t), and shows that the rate is high shortly after BMT and thereafter it decreases.
Finally, Figure 3.6 shows the estimated cumulative relapse rates, A02 (t) without GvHD and
A12 (t) after GvHD, the latter estimated using delayed entry. It is seen that the GvHD event
decreases the relapse rate.
For these estimates, no assumptions are made in relation to how the two different relapse
rates and the three death rates are connected. More parsimonious models could build on
various proportional hazards assumptions, e.g., α12 (t) = exp(β2 )α02 (t) for the relapse rates
and α13 (t) = exp(β3 )α03 (t) for the death rates with or without GvHD. Such models can be
fitted by including GvHD as a time-dependent covariate in separate models for the relapse
and death rates. When GvHD is a state in the multi-state model, the covariate

Zi (t) = I(i had GvHD before time t)

is an adapted time-dependent covariate. For relapse, the estimated hazard ratio for GvHD
is exp(βb2 ) = 0.858 with 95% confidence limits from 0.663 to 1.112 (P = 0.25). The pro-
portional hazards assumption was evaluated by including the time-dependent covariate
Zi (t) log(t + 1) which gives a P-value of 0.35. A graphical evaluation, following the lines
from the stratified Cox model in Section 2.2, can be performed by plotting A b12 (t) against
b02 (t), see Figure 3.7. Under proportional hazards, the resulting curve should be a straight
A
line through the point (0,0), with slope equal to exp(βb2 ) = 0.858 and this is seen to be a
good approximation. For death, the similar analyses yield exp(βb3 ) = 3.113 with 95% confi-
dence limits from 2.577 to 3.760. Addition of an interaction between GvHD and log(t + 1)
TIME-DEPENDENT COVARIATES 97

6
Cumulative hazard of death

0
0 12 24 36 48 60 72 84 96 108 120 132 144 156
Time since bone marrow transplantation (months)

Figure 3.4 Bone marrow transplantation in acute leukemia: Cumulative mortality rate after relapse
(dashed line); cumulative mortality rate after GvHD (dotted line); cumulative mortality rate without
relapse or GvHD (solid line) (GvHD: Graft versus host disease).

0.8

0.7

0.6
Cumulative GvHD hazard

0.5

0.4

0.3

0.2

0.1

0.0
0 12 24 36 48 60 72 84 96 108 120 132 144 156
Time since bone marrow transplantation (months)

Figure 3.5 Bone marrow transplantation in acute leukemia: Cumulative rate of GvHD (Graft versus
host disease).
98 INTENSITY MODELS

0.2
Cumulative relapse hazard

0.1

0.0
0 12 24 36 48 60 72 84 96 108 120 132 144 156
Time since bone marrow transplantation (months)

Figure 3.6 Bone marrow transplantation in acute leukemia: Cumulative relapse rate after GvHD
(dashed line); cumulative relapse rate without GvHD (solid line) (GvHD: Graft versus host disease).

gives P = 0.11, and the goodness-of-fit plot is seen in Figure 3.8. This figure does suggest
some deviations from proportional hazards apparently caused by a too low hazard ratio
early on (convex shape of the curve); however, the formal test is insignificant.
Bone marrow transplantation studies often aim at studying the two adverse end-points re-
lapse and death without relapse, the latter often termed death in remission or treatment-
related mortality, both signaling that the treatment with BMT is no longer effective. In such
a situation, a relevant multi-state model to use would be the competing risks model, Figure
1.2, i.e., the disease course after relapse is not studied, and GvHD is no longer considered
a separate state in the model. However, in an analysis of the rates of relapse and death in
remission, it would still be of interest to study how these may be affected by occurrence of
GvHD over time. This can be done as just described, i.e., by including the time-dependent
GvHD covariate Zi (t) in the Cox models for the two rates. However, for the competing
risks model, this will no longer be an adapted time-dependent covariate but rather a non-
adapted, internal or endogenous time-dependent covariate, because the past history at time
t in the competing risks model does not contain information on GvHD and because the
existence of Zi (t) requires subject i to be alive. At this point, the distinction between these
two situations may look rather academic but when, later in the book (Section 5.2.4), we go
beyond rate models and also target marginal parameters, such as the probability of experi-
encing a relapse, the distinction will become important. We will conclude this example by
presenting results from analyses of these two event rates taking both of the time-dependent
covariates GvHD and ANC500 into account, the latter taking the value 1 at time t if, at that
time, the Absolute Neutrophil Count is above 500 cells per µL, based on repeated blood
TIME-DEPENDENT COVARIATES 99

Cumulative relapse hazard: GvHD 0.20

0.15

0.10

0.05

0
0 0.05 0.10 0.15 0.20
Cumulative relapse hazard: no GvHD

Figure 3.7 Bone marrow transplantation in acute leukemia: Cumulative relapse rate with GvHD
plotted against cumulative relapse rate without GvHD. The dashed straight line has slope equal to
exp(βb2 ) = 0.858 (GvHD: Graft versus host disease).

1.0

0.8
Cumulative death hazard: GvHD

0.6

0.4

0.2

0
0 0.05 0.10 0.15 0.20
Cumulative death hazard: no GvHD

Figure 3.8 Bone marrow transplantation in acute leukemia: Cumulative death rate with GvHD
plotted against cumulative death rate without GvHD. The dashed straight line has slope equal to
exp(βb3 ) = 3.113 (GvHD: Graft versus host disease).
100 INTENSITY MODELS
Table 3.12 Bone marrow transplantation in acute leukemia: Cox models for relapse and death in
remission (GvHD: Graft versus host disease, BM: Bone marrow, PB: Peripheral blood, AML: Acute
myelogenous leukemia, ALL: Acute lymphoblastic leukemia, ANC: Absolute neutrophil count)

(a) Relapse

Covariate βb SD βb SD
GvHD(t) -0.184 0.134 -0.188 0.134
Age per 10 years -0.040 0.045 -0.039 0.045
Graft type BM only vs. PB/BM -0.125 0.135 -0.130 0.135
Disease ALL vs. AML 0.563 0.130 0.562 0.130
ANC500(t) -2.138 1.077

(b) Death in remission

Covariate βb SD βb SD
GvHD(t) 1.041 0.098 1.040 0.099
Age per 10 years 0.263 0.033 0.263 0.033
Graft type BM only vs. PB/BM -0.085 0.096 -0.140 0.096
Disease ALL vs. AML 0.334 0.098 0.336 0.098
ANC500(t) -2.228 0.305

samples taken during follow-up, and equal to 0 otherwise. Table 3.12 shows the results. It
is seen that the earlier results for GvHD are sustained after adjustment for age, graft type,
disease (and ANC500), i.e., GvHD tends to reduce the relapse rate and increase the death
rate. Higher age markedly increases the death rate and tends to be associated with a lower
relapse rate. Patients with ALL have higher event rates than those with AML, and patients
receiving only bone marrow tend to have lower event rates than those who also receive
peripheral blood. Finally, obtaining an ANC above 500 markedly reduces both event rates.
In principle, one could go a step further and concentrate on the mortality rate (cf. Figure
1.1), treating relapse as an internal time-dependent covariate. However, as indicated in Fig-
ure 3.4, occurrence of relapse is such a serious event that it is typically treated as a separate
end-point.

Time-dependent covariates or state

An adapted time-covariate at time t is a function of the past of the multi-state pro-

cess before that time, for example whether a given state has been visited before t.
We have seen that the same time-dependent covariate may also be included in a
simplified multi-state model without inclusion of that state. This avoids the need for
modeling the development of the time-dependent covariate; however, the covariate
is then no longer adapted and prediction from the model is no longer feasible.
The ultimate choice between these options will depend of the scientific questions
the model is aimed at addressing.
TIME-DEPENDENT COVARIATES 101
3.7.9 Additional issues
We have seen that incorporating time-dependent covariates into a hazard regression model
is, in principle, straightforward. The same methods of estimation apply as for models in-
cluding only time-fixed covariates, e.g., the Cox partial likelihood. Furthermore, the inter-
pretation of exp(β ) is the ratio between the instantaneous event risks per time unit for a one-
unit change in Z and for given values of the remaining explanatory variables in the model.
However, this simplicity is somewhat deceptive and models including time-dependent co-
variates are considerably more involved than models without. In the following, we will
discuss this in more detail, see also Fisher and Lin (1999).

Data availability; measurement errors

To set up the Cox partial likelihood or the estimating equations for the Aalen model, values
for all covariates at all event times are needed. For internal covariates, this may entail some
difficulties if covariates are based on repeated measurements of some marker, such as the
ANC in the bone marrow transplantation study. Thus, at any given event time, T , what
will typically be known is the last value recorded before time T , and in order to assess the
covariate value at T , some extrapolation or modeling of the repeated measurements of the
marker is needed. Frequently, last observation carried forward is used, though studies have
shown that this may not be optimal and better ways of extrapolating are preferable (e.g.,
Andersen and Liestøl, 2003). Note that interpolation between values before and after T is
not recommended since later values are only available for the selected group of subjects
who survive until the time of the next measurement, thereby one is ‘conditioning on the
future’ when including this information. The fact that the marker varies over time will also
imply that it is measured with some error, so that accounting for measurement error may
be needed (e.g., Bycott and Taylor, 1998). In the bone marrow transplantation study, what
was recorded was the first time for which ANC exceeded 500 cells per µL and the true
time of crossing this threshold would have been at an earlier point in time. A solution that
is sometimes used to such problems is joint modeling of the marker and the multi-state
process. We will briefly discuss such techniques in Section 7.4.
A related problem occurs when time-dependent covariates are used in situations with de-
layed entry. Here, the information collected for a subject at the time of entry into the study
may not suffice for calculating the value of a covariate such as the duration since some
earlier event, and care must be exercised when trying to include such covariates. Another
problem with delayed entry and internal time-dependent covariates may occur when mod-
eling rates in terms of age rather than time on study. This is because the risk set at any
given age may consist of subjects with strongly varying times since the latest measurement
of the marker and, thereby, possibly with differential measurement error (e.g., Andersen
and Liestøl, 2003).

Interpretation of covariate effects

In a multiplicative model, exp(β ) is the ratio between event rates for a one-unit change in
Z and for given values of the remaining explanatory variables in the model. This means
that if a model includes both a time-fixed covariate, Z1 , such as initial diagnosis or ran-
domized treatment group, and a non-deterministic time-dependent covariate, Z2 (t) such as
102 INTENSITY MODELS
occurrence of later events, then exp(β1 ) is the hazard ratio for Z1 for given value of Z2 (t).
This may mask some of the true effect of Z1 because some of this effect may be mediated
via changes in Z2 (t). An example of this was presented in Section 3.7.5 when analyzing the
data on repeated episodes in affective disorders and where the hazard ratio for the initial
diagnosis, bipolar vs. unipolar, was quite different without or with adjusting for N(t−), the
number of episodes before time t. A similar problem was observed when, in the PROVA
trial (Section 3.7.6), the mortality rate after a bleeding episode was modeled in relation to
the randomized treatment and it was noticed that the patients contributing to this analy-
sis may be selected differently in the treatment groups due to the way in which treatment
affects the occurrence of a bleeding and the mortality rate without bleeding. The desired
effect of a baseline covariate on the event occurrence may, in such situations, be better stud-
ied in terms of the cumulative risk of the event or the expected number of recurrent events.
We will return to this problem in Sections 4.2.3 and 5.5.4.
An additional problem with interpretation that we will also return to in later chapters is that
a hazard ratio cannot necessarily be interpreted as reflecting a risk ratio. This is the case
for both time-fixed and for time-dependent covariates and is due to the fact that the same
covariate may affect other transition hazards in the multi-state model (e.g., in the case of
competing risks, see Sections 4.1.2 and 4.2.2).

Immortal time bias

A time-dependent covariate is sometimes, mistakenly, considered to be time-fixed and this
will lead to immortal time bias (e.g., Suissa, 2007; Andersen et al., 2021). The name of this
bias is explained below. In the bone marrow transplantation study, one might try to com-
pare rates of relapse among those who do or do not ever experience GvHD, or, in the study
of vaccinations and mortality in Guinea-Bissau, one may wish to compare the mortality
rates between children who do or do not receive additional vaccinations in the period be-
tween the two visits by the mobile team. This entails conditioning on the future and leads
to bias because those who live long enough to obtain GvHD or receive additional vacci-
nations without experiencing the event of interest will appear to have a longer time until
event occurrence. The correct way to handle the situation is to treat GvHD or additional
vaccinations as time-dependent covariates (e.g., Jensen et al., 2007).
We will illustrate this bias via the bone marrow transplantation study. Recall that when, cor-
rectly, treating GvHD as a time-dependent covariate, the estimated hazard ratio for GvHD
is exp(βb2 ) = 0.858 for relapse and exp(βb3 ) = 3.113 for death in remission (Section 3.7.8).
If one, naively, includes instead the variable ‘GvHD: yes/no’ in the model as a time-fixed
covariate, the hazard ratio in relation to relapse is 0.544, and for death in remission it is
1.576. Thus, for both outcomes GvHD appears more beneficial in the incorrect analyses.
This can be explained, as follows. There are 3,938.61 person-years at risk in the initial
state (0) in the multi-state model of Figure 1.6 and 3,255.19 in the GvHD state (1). How-
ever, out of the person-years spent in state 0, 283.42 are from patients who later develop
GvHD. These person-years are in state 0 and should not be credited the GvHD group. How-
ever, when treating GvHD as time-fixed, these immortal years are included in the risk time
among GvHD patients which leads to the bias described.
MODELS WITH SHARED PARAMETERS 103
3.8 Models with shared parameters
In most of the examples so far, different transitions in a multi-state model were modeled
separately, i.e., models for different transition intensities had no parameters in common.
Thus, in Section 3.6.3, separate models for the rate of bleeding and the rate of death with-
out bleeding in the PROVA trial (Example 1.1.4) were studied and, in Section 3.7.8, the
rates of relapse and that of death in remission in the bone marrow transplantation study
(Example 1.1.7) were also modeled separately. Since, in these examples the rates of com-
pletely different events were studied, having parameters in common for the different tran-
sition intensities seems quite unnatural. However, in the case of recurrent events (Section
2.5), examples were given (in the form of PWP or AG models) where the initial diagnosis
of bipolar versus unipolar disorder in the study of recurrent episodes in affective disorders
(Example 1.1.5) had the same multiplicative effect on the rates of first, second, third, ...
recurrence. Also in the PROVA trial and in the bone marrow transplantation study, models
with common regression coefficients could be envisaged. Thus, in the former case, vari-
ables such as sex and age could have the same effects on the mortality rates with or without
a bleeding episode, and in the latter, variables such as disease and graft type could have the
same effect on the rate of death in remission with or without GvHD.
In the present section, we will study models for several transition intensities where some
covariates may have common effects across different transitions. We will provide a detailed
study of the illness-death model (Figure 1.3); however, the ideas presented for this model
carry over to more complex multi-state models. As we shall see, the concepts of type-
specific covariates and time-dependent strata are crucial in this discussion.
We address simultaneous (Cox) modeling of the rates α02 (t) and α12 (t) in an illness-death
model, and for each covariate Z there is a choice whether it has the same or different effects
on these two rates. Furthermore, α02 (t) and α12 (t) may or may not be proportional. The
modeling combination that we have mostly focused on so far is when all covariates have
different effects and where the two rates are not proportional, in which case the two rates are
modeled separately. However, we shall see in what follows that all modeling combinations
may be obtained by fitting one common model for the two hazards to a duplicated data set
where either Z is used directly or replaced by two type-specific covariates, and the model
is either stratified (time-dependently) by the starting state (0 or 1) or that state is used as a
time-dependent covariate.

3.8.1 Duplicated data set

The starting point for such an analysis is the two separate data sets for, respectively, the
0 → 2 and 1 → 2 transitions, see Table 1.10 for a discussion in relation to the PROVA
trial where ‘disease’ corresponds to bleeding. The data set for the 0 → 2 transition has
variables (Start, Stop, Status) where Start = 0, Stop = time last seen in state
0, and Status = 1, if at time Stop, a 0 → 2 transition (death without the disease) was
observed and = 0 otherwise. Suppose that the data set, additionally, contains a (numerical)
covariate Z. The data set for the 1 → 2 transition has Start = time of entry into state 1
(time of disease), Stop = time last seen in state 1, and Status = 1, if at time Stop, a
1 → 2 transition was observed (death with the disease) and = 0 otherwise. Suppose that
104 INTENSITY MODELS
this data set also contains the covariate Z. The duplicated data set should have one or two
records for each subject, one if the disease was not observed and two if it was observed.
Each record should have the following variables:
• (Start, Stop, Status) copied from the original data sets.
• Z copied from the original data sets.
• A stratum variable, say Type, that is 0, if the record came from the 0 → 2 data set and 1
if the record came from the 1 → 2 data set.
• Two type-specific covariates (Z0 , Z1 ) = (Z, 0) if Type=0, and (Z0 , Z1 ) = (0, Z) if Type=1,
i.e., the type-specific covariate corresponding to ‘the other value of Type’ is set to 0.
The models with different combinations of common versus different effects of Z and pro-
portional versus non-proportional baseline 0 → 2 and 1 → 2 transition rates can now be fit-
ted to the duplicated data set, as follows, where, in all cases, the (Start, Stop, Status)
triple is used as response variable.
• Different covariate effects and non-proportional hazards

α02 (t | Z) = α02,0 (t) exp (β0 Z), α12 (t | Z) = α12,0 (t) exp (β1 Z). (3.28)

Include the type-specific covariates (Z0 , Z1 ) and stratify by Type (time-dependent strata).
This is equivalent to fitting separate models for the two transition rates.
• Same covariate effect and non-proportional hazards

α02 (t | Z) = α02,0 (t) exp (β Z), α12 (t | Z) = α12,0 (t) exp (β Z). (3.29)

Include the original covariate Z and stratify by Type (time-dependent strata).

• Different covariate effects and proportional hazards

α02 (t | Z) = α0 (t) exp (β0 Z), α12 (t | Z) = α0 (t) exp (β1 Z + γ). (3.30)

Include the type-specific covariates (Z0 , Z1 ) and use Type as a (time-dependent) covari-
ate – the latter yielding the hazard ratio exp(γ).
• Same covariate effect and proportional hazards

α02 (t | Z) = α0 (t) exp (β Z), α12 (t | Z) = α0 (t) exp (β Z + γ). (3.31)

Include the original covariate Z and use Type as a (time-dependent) covariate – the latter
yielding the hazard ratio exp(γ).
Models (3.28) and (3.29), respectively (3.30) and (3.31), may be compared using likelihood
ratio tests, i.e., it can be examined whether the regression coefficients for Z can be taken to
be the same for the two transition types. Comparing models (3.28) and (3.30), respectively
(3.29) and (3.31), corresponds to an examination of proportional hazards as exemplified,
e.g., in Sections 2.2.2 and 3.7.8. Multiple regression models, i.e., including more covariates
with combinations of identical and different effects on the two rates, can be set up along
these lines and will be exemplified below. For all models, model-based SD can be applied.
MODELS WITH SHARED PARAMETERS 105
The models (3.28) and (3.30) with type-specific covariates correspond to inclusion of an
interaction between Type and Z. This observation suggests how joint Poisson models for
the two mortality rates may also be fitted. This will require a duplicated data set including
cases of death and person-years at risk, both before and after disease occurrence, and where
interactions between some covariates and Type may be included.
The data duplication trick for the illness-death model works in a similar fashion for other
multi-state models, including the competing risks model (Figure 1.2). Thus, for the PBC3
trial, models with common covariate effects on the rate of death without transplantation
and the rate of transplantation could be fitted as well as models with proportional cause-
specific hazards. However, since the interpretation of such common effects on rates of quite
different events is not attractive, we will not illustrate this feature in the following.

3.8.2 PROVA trial in liver cirrhosis

We will study the PROVA trial (Example 1.1.4) and joint models for the mortality rates
with or without a previous bleeding. For simplicity, we drop the treatment variables and
study models including the covariates sex and log2 (bilirubin) (see also Section 3.7.6). We
will also (for illustration, and in spite of the fact that better-fitting models were identified in
Section 3.7.6) study Markov models, i.e., the time-variable t used for both the mortality rate
without bleeding α02 (t) and that after bleeding, α12 (t) is time since randomization. Table
3.13 shows the estimated coefficients for the two covariates obtained by fitting a number
of joint models, and Figure 3.9 shows the estimated cumulative baseline hazards for the
model (3.28) where proportionality is not assumed.
In model (3.28) (Table 3.13a), it is seen that the coefficients for sex are close, while those
for bilirubin are not (likelihood ratio tests for the corresponding model reductions are, re-
spectively, 0.04 and 9.8 both with 1 DF). In the model with identical sex effects (Table
3.13b), it is seen that bilirubin only affects the mortality rate without bleeding and the co-
efficient for log2 (bilirubin) in the model for α12 (t) can be set to 0 (Table 3.13c). Note the
gain in efficiency for the coefficient for sex when based on both mortality rates (SD in (b),
(c) vs. (a)).
Judged from Figure 3.9, it seems that the baseline rates are far from proportional and this
is supported by a test for proportionality in the model where Type is a time-dependent
covariate – here the coefficient for Type· log(t) is strongly significant (P < 0.001). For
this reason, no results from models assuming proportional mortality rates with and without
bleeding are presented.

3.8.3 Bone marrow transplantation in acute leukemia

In Section 3.7.8, we studied models for the rates of relapse and death in remission in the
bone marrow transplantation study (Example 1.1.7) using GvHD as a time-dependent co-
variate. In this section, we follow up on that example and illustrate joint modeling of the
rates α03 (t) and α13 (t) in Figure 1.6, i.e., the rates of death in remission without or with
GvHD. We will include the two covariates age and disease (ALL vs. AML), and Table
3.14 shows estimated regression coefficients from a series of such models. The model (a)
corresponds to fitting separate Cox models for the two transition rates, and it is seen that
106 INTENSITY MODELS
Table 3.13 PROVA trial in liver cirrhosis: Joint Cox models for the mortality rates without or with
bleeding (sex: Males vs. females).

(a)

Sex log2 (bilirubin)

Event type βb SD βb SD
Death without bleeding 1.041 0.411 0.528 0.115
Death after bleeding 0.910 0.481 -0.162 0.183
Both

(b)

Sex log2 (bilirubin)

Event type βb SD βb SD
Death without bleeding 0.527 0.115
Death after bleeding -0.169 0.179
Both 0.987 0.312

(c)

Sex log2 (bilirubin)

Event type βb SD βb SD
Death without bleeding 0.526 0.115
Death after bleeding 0
Both 0.942 0.307

both covariates have quite similar effects. The cumulative baseline rates from this model
are shown in Figure 3.10 and do not contra-indicate proportionality, so, model (b) in the ta-
ble shows results from a model assuming α13 (t) = exp(γ)α03 (t), corresponding to treating
Type (GvHD) as a time-dependent covariate. Here, γb = 0.842 (SD = 0.269). Proportion-
ality is also supported by including the covariate GvHD(t) log(t) for which the likelihood
ratio test is LRT=2.08 with 1 DF. The resulting model is on the form (3.30) with different
coefficients and proportional hazards. In model (c) of the table, coefficients are shown for
the model where the two type-specific covariates for age and disease are replaced by com-
mon covariates leading to a model of the form (3.31). The LRT for common coefficients
is 0.74 with 2 DF, supporting the simpler model in which the gain in efficiency for the
regression coefficients can be noticed. In model (c), γb = 1.049 (SD = 0.098).

3.8.4 Joint likelihood (*)

The estimators in joint Cox models for all transition intensities αh j (t) in a multi-state model
are related to those based on the stratified Cox partial likelihood (3.21) and the correspond-
ing Breslow estimators (3.22). That is, there is a single p-vector of regression coefficients
β and a number, K of unspecified baseline hazards αv0 (t), v = 1, . . . , K. To realize this, it
MODELS WITH SHARED PARAMETERS 107

3
Cumulative hazard

0
0 1 2 3 4
Time since randomization (years)

Figure 3.9 PROVA trial in liver cirrhosis: Breslow estimates for the cumulative baseline mortality
rates in a joint Cox model for the 0 → 2 (solid line) and 1 → 2 transition rates (dashed line).

0.20

0.15
Cumulative hazard

0.10

0.05

0.00
0 12 24 36 48 60 72 84 96 108 120 132 144 156
Time since bone marrow transplantation (months)

Figure 3.10 Bone marrow transplantation in acute leukemia: Breslow estimates for the cumulative
baseline rates of death in remission in a joint Cox model for the mortality rates with (dashed line)
or without GvHD (solid line) (GvHD: graft versus host disease).
108 INTENSITY MODELS
Table 3.14 Bone marrow transplantation in acute leukemia: Joint Cox models for the rates of death
in remission without or with GvHD (GvHD: Graft versus host disease, disease: Acute myelogenous
leukemia (AML) vs. acute lymphoblastic leukemia (ALL)).

(a)

Disease Age (years)

Event type βb SD βb SD
Death in remission without GvHD 0.267 0.159 0.256 0.048
Death in remission with GvHD 0.360 0.125 0.287 0.041
Both

(b)

Disease Age (years)

Event type βb SD βb SD
Death in remission without GvHD 0.262 0.159 0.246 0.048
Death in remission with GvHD 0.379 0.125 0.292 0.041
Both

(c)

Disease Age (years)

Event type βb SD βb SD
Death in remission without GvHD
Death in remission with GvHD
Both 0.332 0.098 0.272 0.031

has to be argued that all the models studied in Section 3.8.1, i.e., with covariates having
either different or common regression coefficients for the different transition hazards and
with separate or proportional baseline hazards, may be written in the form

β T Z h ji (t))
αh ji (t) = αv0 (t) exp(β
β T Z h ji (t))
= αφ (h, j)0 (t) exp(β (3.32)

with proper definition of (possibly time-dependent) p-vectors of type-specific covariates

Z h ji (t) = (Zh j1i (t), . . . , Zh jpi (t)), i.e., there is one set of p covariates for each type, h → j,
of transition in the model. In (3.32), v = φ (h, j) is the function of the transition type that
takes the same value for pairs of states, h, j for which the corresponding baseline hazards
αh j0 (t) are assumed proportional.
We will now illustrate how this idea works using the three-state illness-death model as an
example (Figure 1.3). Assume that we will fit the following models for the three transition
MODELS WITH SHARED PARAMETERS 109
intensities for subject i
α01i (t) = α01,0 (t) exp(β0 Z1i ),
α02i (t) = α02,0 (t) exp(β1 Z1i + β2 Z2i ),
α12i (t) = α02,0 (t) exp(γ) exp(β1 Z1i + β20 Z2i ).
Thus, there are two time-fixed covariates: Z1 that influences all three hazards and influences
the two death intensities in the same way (β1 ), and Z2 that does not influence the disease rate
α01 (t) and influences the mortality rates without or with the disease in different ways (β2
and β20 ). Furthermore, the mortality rates with and without the disease are proportional. It is
seen that, in this case, there are p = 5 unknown regression coefficients β = (β0 , β1 , β2 , β20 , γ)
and K = 2 unspecified baseline hazards α01,0 (t) and α02,0 (t). The models may be written
in the form (3.32) with this p-vector of regression coefficients and these K baseline hazards
by defining the type-specific covariates,
Z 01i (t) = (Z1i , 0, 0, 0, 0)
Z 02i (t) = (0, Z1i , Z2i , 0, 0, 0)
Z 12i (t) = (0, Z1i , 0, Z2i , 1).
The mapping φ is φ (0, 1) = 1, φ (0, 2) = φ (1, 2) = 2 corresponding to α02,0 (t) and α12,0 (t)
being proportional.
We now take the Jacod likelihood (3.1) as the staring point which, for the current model,
becomes
Z∞
T
L = ∏ ∏ {exp − Z
Yhi (t)αφ (h, j)0 (t) exp(β h ji (t))dt
β
i (h, j) 0
dNh ji (t)
β T Z h ji (t))
× ∏ Yhi (t)αφ (h, j)0 (t) exp(β }. (3.33)
t

Note that (3.33) no longer factorizes over types, ν since the same β appears for all types.
Transforming by the logarithm, differentiating with respect to a single αv0 (t), and solving
for αv0 (t) as in Section 3.3 leads, for fixed β , to the estimate
∑i ∑φ (h, j)=v dNh ji (t) dNv (t)
(t)dt =
αv0d T
= , (3.34)
∑i ∑φ (h, j)=v Yhi (t) exp(β
β Z h ji (t)) S0v (β
β ,t)

say. Inserting this into (3.33) leads to the relevant version of the stratified Cox partial like-
lihood
Yhi (t) exp(β β T Z h ji (t)) dNh ji (t)
PL(ββ ) = ∏∏ ∏ ∏ ,
i v φ (h, j)=v t S0v (β
β ,t)
and the Breslow estimator becomes
Z t
bv0 (t) dNv (u)
A
0 S0v (βb , u)
with notation as in (3.34), see Exercise 3.6. Since the resulting estimators are likelihood-
based, model-based SD may be obtained from the second derivative of the log-likelihood,
and likelihood ratio tests are also available.
110 INTENSITY MODELS
3.9 Frailty models
In Chapter 2 and in previous sections of the present chapter, we have shown several ex-
amples of regression models for a single transition rate in a multi-state model and unless
there were parameters that were shared among several transitions (see Section 3.8), these
intensities could be analyzed separately. In all these examples, an assumption of indepen-
dence among observational units (typically among subjects/patients) was reasonable. In
the present section, we will study situations where the independence assumption is not nec-
essarily met. First of all, correlated event history data may be a consequence of subjects
‘coming in clusters’, such as members of the same family, patients attending the same gen-
eral physician or medical center, or inhabitants in the same community. In these situations
it is likely that the event occurrences for subjects from the same cluster are more alike
than those among clusters. An example is the bone marrow transplantation study (Example
1.1.7) where the 2,009 patients were treated in one of 255 different medical centers and
where patients attending the same center may share some common traits and, thereby, be
more alike than patients from different centers. As a quite different situation, one may be
interested in the distribution of times to entry into different states in a multi-state model,
e.g., time to event no. h = 1, 2, . . . in a recurrent events situation (e.g., Figure 1.5), or time
to relapse or to GvHD in the model for the bone marrow transplantation data (Example
1.1.7, Figure 1.6). Here, within any given patient, these times will likely be dependent.
A classical way of modeling dependence among observational units in statistics is to use
random effects to represent unobserved common traits, and in event history analysis random
effects models are known as frailty models (e.g., Hougaard, 2000). In the present section,
we will discuss frailty models, first presenting some of the more technical inference details
(Section 3.9.1) and, next, focusing on two major examples of using frailty models. Thus,
in Section 3.9.2, we will study shared frailty models for clustered data, while Section 3.9.3
presents frailty models for recurrent events, possibly jointly with mortality. In Sections 4.3
and 5.6, we will return to the problem of dependent event history data and discuss marginal
hazard models, and in Section 7.2 we will summarize our discussions.

3.9.1 Inference (*)

Frailty models may be set up quite generally for both situations giving rise to dependent
data. If data come in (independent) clusters i = 1, . . . , n with ni observations in cluster i,
then the intensity model for subject h in cluster i could be

αih (t | Zih , Ai ) = Aih αihc (t | Zih ), h = 1, . . . , ni . (3.35)

In the situation where, for each subject i = 1, . . . , n (assumed independent), we have a multi-
state model with possible transition types ν = 1, . . . , K (Section 3.1), the frailty model for
the type ν transition could be
c
ανi (t | Zi , Ai ) = Aνi ανi (t | Zi ) (3.36)

where the independent frailties Ai = (A1i , . . . , AKi ), i = 1, . . . , n follow some K-variate dis-
tribution across the population. In both cases, inference for parameters in the baseline in-
c (t | Z ), respectively α c (t | Z ) (the conditional hazards given covariates for a
tensities ανi i ih ih
FRAILTY MODELS 111
frailty of 1) and in the frailty distribution may, in principle, be performed using the likeli-
hood approach as described in Section 3.1. For given frailty, obervations are independent
with the likelihood given by the Jacod formula (3.1), and the likelihood for the observed
data is obtained by integrating out the frailty. This may entail technical and numerical
challenges and, furthermore, for this approach to work, the assumption that censoring is
independent of the frailty must be imposed (Nielsen et al., 1992). This is because the full
likelihood, as explained in Section 3.1, in addition to the factors arising from (3.1), also in-
volves factors reflecting the censoring distribution, and if these factors depend on the frailty,
then integration of the likelihood over the frailty distribution may become intractable.
Models like (3.35) and (3.36) were discussed by Putter and van Houwelingen (2015) and
by Balan and Putter (2020). Even though models of both types may, in principle, be ana-
lyzed, these authors concluded that frailty models are most useful for clustered data and
for recurrent events (without or with competing risks). As a side remark, we can mention
that frailty models may also be studied for univariate survival data to explain effects of
omitted covariates (e.g., Aalen et al., 2008, ch. 6). However, as discussed by Putter and van
Houwelingen (2015) and Balan and Putter (2020), this may become quite speculative be-
cause information on effects of missing covariates comes from deviations from proportional
hazards and a proportional hazards model with a missing covariate and a non-proportional
hazards model will be virtually indistinguishable. Following this we will, in what follows,
concentrate on frailty models for clustered data and for recurrent events.

3.9.2 Clustered data

The set-up is as follows. Data come in independent clusters i = 1, . . . , n with ni observations
in cluster i and we will assume that the intensity for subject h in cluster i is given by

αih (t | Zih , Ai ) = Ai αihc (t | Zih ), h = 1, . . . , ni . (3.37)

Here, the Zih are observed individual level covariates and A1 , . . . , An are independent and
identically distributed random frailties representing unobserved covariates shared by mem-
bers of cluster i. We will assume that their distribution is independent of the observed
covariates. Standard choices for the frailty distribution include the gamma distribution with
mean E(A) = 1 and an unknown standard deviation σ = SD(A) to be estimated, and the
log-normal distribution with E(log(A)) = 0 and SD(log(A)) = σ . The parameter σ quan-
tifies the unobserved heterogeneity among clusters and, at the same time, the intra-cluster
association. We will exemplify this below. The baseline hazard could be of the Cox-form

αihc (t | Zih ) = α0 (t) exp(LPih ),

with an unspecified α0 (t), possibly stratified, or α0 (t) could be piece-wise constant. The
regression parameters β` in the linear predictor LPih = β1 Zih1 + · · · + β p Zihp have a within-
cluster interpretation, exp(β` ) giving the hazard ratio for a one-unit difference in covariate
Zih` for given values of the remaining observed covariates, cf. Section 2.2.1, and for given
frailty.
112 INTENSITY MODELS
Table 3.15 Bone marrow transplantation in acute leukemia: Frailty models for relapse-free survival
taking clustering by medical center into account (BM: Bone marrow, PB: Peripheral blood, AML:
Acute myelogenous leukemia, ALL: Acute lymphoblastic leukemia).

Gamma frailty Log-normal frailty

Covariate βb SD βb SD
Graft type BM only vs. BM/PB -0.175 0.087 -0.177 0.088
Disease ALL vs. AML 0.472 0.080 0.474 0.080
Age per 10 years 0.187 0.028 0.189 0.028
Frailty SD2 0.117 0.139

Bone marrow transplantation in acute leukemia

We return to the bone marrow transplantation study where patients were treated in one
of 255 different medical centers. These centers had strongly varying sizes, contributing
between 1 and 110 subjects. Shared frailty Cox models for relapse-free survival were fit-
ted with, first, a gamma distributed random effect representing unobserved factors shared
among patients from the same center and, next, a log-normal frailty. Table 3.15 shows the
results. The two sets of coefficients using the two different frailty distributions are seen to
be similar. The interpretation of, e.g., the effect of graft type is that two patients from the
same center – one receiving bone marrow only, the other bone marrow or peripheral blood
and having identical values for disease and age – have a ratio between their hazards for
relapse-free survival of exp(−0.175) = 0.839 with 95% confidence limits from 0.708 to
0.995. The estimated SD2 of the gamma frailty distribution was σb 2 = 0.117. This corre-
sponds to a value of Kendall’s τ-coefficient of concordance of 0.117/(0.117 + 2) = 0.055
(Hougaard, 2000, ch. 4) – a fairly small value reflecting that the within-cluster association
b 2 = 0.139.
is low. For the log-normal frailty, σ
A standard Cox model stratified by center results in regression coefficients with the same
within-center interpretation as the frailty models. Estimates from this model were, as fol-
lows (with SD in brackets): graft type: −0.197 (0.114), disease: −0.471 (0.091), age 0.210
(0.035). They are seen to be close to those from the frailty models, however, with somewhat
larger SD. This is because many of the small centers contribute with little information to
the stratified model.

3.9.3 Recurrent events

In the previous section, we studied the use of a shared frailty model for the analysis of
clustered data. A similar model may be applicable for recurrent events without competing
risks (Figures 1.4 or 1.5 without the absorbing state 2, respectively D). Here, an AG-type
shared frailty model for the recurrent events intensity would be

αi (t | Ai ) = Ai α0 (t) exp(LPi (t)),

allowing for a linear predictor including time-dependent covariates. Here, there is an

individual-level frailty, Ai , that is assumed to follow some distribution (typically a gamma
distribution with mean 1 and SD= σ ) across the population. As was the case for clustered
FRAILTY MODELS 113
data, the interpretation of a regression parameter in the linear predictor is a within-subject
effect and, for that reason, the model may be questionable for time-fixed covariates. Thus,
for a randomized study such as the LEADER trial (Example 1.1.6), the treatment effect
would be the ratio between intensities for the same subject under, respectively, treatment
and control and, in any given study, both situations would not be observed and model es-
timates build on extrapolations beyond the observed data. With time-dependent covariates,
however, the model may be more directly applicable, a possible example being the number
of previous episodes in the study of recurrence in affective disorders – see Example 1.1.5,
where one of the questions addressed was whether the disease course was deteriorating.
A deteriorating disease course is suggested if, for given frailty, there is a tendency for the
re-admission intensity to increase with the number of previous episodes. This was studied
by Kessing et al. (1999, 2004); however, under the assumption that the discharge inten-
sity (and censoring) was independent of frailty. This assumption may be unrealistic since
more severely ill patients (i.e., with a high frailty) are likely to also spend more time in the
hospital than patients with a lower frailty. An extended model to address this problem was
proposed by O’Keefe et al. (2018) where the discharge intensity has a frailty factor of 1/Ai .
In many situations, there will be a competing risk in the form of a mortality that needs to be
addressed. In Kessing et al. (1999, 2004), the approximation that mortality is independent
of frailty was imposed – an assumption that is likely to be violated since more severely ill
patients with a high frailty may also have a higher mortality rate. Having a frailty effect
that is shared between the re-admission intensity and the mortality rate (e.g., Huang and
Wang, 2004) may be more satisfactory; however, as discussed in Section 3.8, models with
the same effect of covariates (observed or unobserved) on different transitions are not easy
to interpret. More general models for recurrent events with competing risks were discussed
by Cook and Lawless (2007, ch. 6) who considered a bivariate frailty (Ai1 , Ai2 ) and the
model
αi (t | Ai1 ) = Ai1 α0 (t) exp(LPi1 (t))
for the recurrent events intensity and

αDi (t | Ai2 ) = Ai2 αD0 (t) exp(LPi2 (t))

γ
for the mortality rate. A more parsimonious model with Ai2 = Ai1 , i.e.,

αi (t | Ai ) = Ai α0 (t) exp(LPi1 (t))

γ
αDi (t | Ai ) = Ai αD0 (t) exp(LPi2 (t)) (3.38)

was studied by Liu et al. (2004) and Rondeau et al. (2007). Here, γ is an additional param-
eter to be estimated and Ai follows a gamma distribution with mean 1. In the former paper,
inference was based on the EM algorithm while, in the latter paper, a penalized likelihood
approach was used.

LEADER cardiovascular trial in type 2 diabetes

We will illustrate the joint frailty model by quoting results from analyzing the LEADER
trial (Example 1.1.6) by Furberg et al. (2022). We will do this even though a frailty model
may not be the best choice for analyzing trial data due to its within-subject interpretation of
114 INTENSITY MODELS
Table 3.16 LEADER cardiovascular trial in type 2 diabetes: Frailty models for recurrent myocardial
infarctions (MI) with a gamma frailty distribution.

Piece-wise constant Cox-type

βb SD βb SD
Liraglutide vs. placebo -0.177 0.088 -0.177 0.088
Frailty SD 2.38 2.39

coefficients. The recurrent event under study is recurrent myocardial infarctions (MI) and
the competing event is all-cause death. The joint frailty model with Cox baseline hazards,
α0 (t), αD0 (t), did not converge when using the penalized likelihood approach of Rondeau
et al. (2007), so instead, models with piece-wise constant baseline hazards were studied.
Analyses of frailty models with Cox-type or piece-wise constant baseline hazards for the
recurrent events process alone (i.e., assuming that frailty does not affect mortality) yielded
quite similar results. This is seen in Table 3.16, where the effect of treatment (log(rate ratio)
for liraglutide vs. placebo) on the recurrent MI rate is βb = −0.177 (SD = 0.088) for both
the piece-wise constant and the Cox-type model. The estimated frailty SD (σb ) in the two
models (assuming a gamma distributed frailty) was 2.38 and 2.39, respectively.
The similar estimates from the joint frailty model, Equation (3.38), were βb = −0.186 (SD =
0.068) and σ b = 0.947 (SD = 0.031), see Table 3.17. In this model, the estimated effect
of treatment on mortality was βbD = −0.211 (SD = 0.078) and the association parameter
linking the frailties for recurrent events and mortality was estimated to be γb = 1.860 (SD =
0.115). The interpretation is that, for any given patient (i.e., for given frailty), treatment
with liraglutide reduces the MI rate by a factor of exp(−0.186) = 0.830 and reduces the
mortality rate by exp(−0.211) = 0.809. Furthermore, patients at a high rate of an MI (high
frailty) also have a high mortality rate (γb > 0). Heterogeneity among patients (frailty SD,
σb ) appears to be considerably higher when not accounting for mortality.

Table 3.17 LEADER cardiovascular trial in type 2 diabetes: Joint frailty model for recurrent
myocardial infarctions (MI) and all-cause mortality – piece-wise constant baseline hazards and
gamma frailty distribution.

Recurrent MI All-cause death

βb SD βb SD
Liraglutide vs. placebo -0.186 0.068 -0.211 0.078
Frailty SD (σ
b) 0.947
Association (γb) 1.86
EXERCISES 115
3.10 Exercises

Exercise
Rt
3.1 (*) Show that, under the null hypothesis H0 : A0 (t) = A1 (t), the test statistic
K(u) d b1 (u) − d A
A b0 (u) is a martingale (Section 3.2.2).
0

Exercise 3.2 (*) Show that, when evaluated at the true parameter vector β 0 , the Cox partial
likelihood score
Z t
∑ j Y j (u)Z βT
Z j exp(β 0 Z j)
∑ Z i − dNi (u)
0 i ∑ j Y j (u) exp(ββT
0 Z j )
is a martingale (Section 3.3).

Exercise 3.3 (*) Show that, for a Cox model with a single binary covariate, the score test
for the hypothesis β = 0 based on the first and second derivative of log PL(β ) (Equation
(3.16)) is equal to the logrank test.

Exercise 3.4 (*) Show that, for the stratified Cox model (3.20), the profile likelihood is
given by (3.21) and the resulting Breslow estimator by (3.22).

Exercise 3.5 (*) Consider the situation in Section 3.4 with categorical covariates and show
that the likelihood is given by
L
∏ ∏ (α0` θc )N `cc
exp(−α0` θcY`cc ).
`=1 c ∈C

Exercise 3.6 (*) Derive the estimating equations for the model studied in Section 3.8.4.

Exercise 3.7 Consider the Cox model for stroke-free survival in the Copenhagen Holter
study including the covariates ESVEA, sex, age, and systolic blood pressure (Exercise 2.4).
Test, using time-dependent covariates, whether the effects of these covariates may be de-
scribed as time-constant hazard ratios.

Exercise 3.8 Consider the Cox model for stroke-free survival in the Copenhagen Holter
study including the covariates ESVEA, sex, age, and systolic blood pressure. Add to that
model the time-dependent covariate I(AF ≤ t). How does this affect the effect of ESVEA?

Exercise 3.9 Consider the Cox model for stroke-free survival in the Copenhagen Holter
study including the covariates ESVEA, sex, age, and systolic blood pressure. Add to that
model, incorrectly, the covariate AF – now considered as time-fixed. How does this affect
the AF-effect?

Exercise 3.10 Consider an illness-death model for the Copenhagen Holter study with states
‘0: Alive without AF or stroke’, ‘1: Alive with AF and no stroke’, ‘2: Dead or stroke’, see
Figures 1.3 and 1.7.
116 INTENSITY MODELS
1. Fit separate Cox models for the rates of the composite end-point for subjects without or
with AF, i.e., for the 0 → 2 and 1 → 2 transitions including the covariates ESVEA, sex,
age, and systolic blood pressure. The time-variable in both models should be time since
recruitment.
2. Examine to what extent a combined model for the two intensities (i.e., possibly with
common regression coefficients and/or proportional hazards between the 0 → 2 and 1 →
2 transition rates) may be fitted.

Exercise 3.11 Consider the data on repeated episodes in affective disorder, Example 1.1.5.
1. Fit separate gamma frailty models for unipolar and bipolar patients including the co-
variate ‘number of previous events N(t−)’ assuming (not quite satisfactorily!) that the
mortality rate is independent of frailty.
2. Do the recurrence rates tend to increase with number of previous episodes?

Exercise 3.12 Consider the data on mortality in relation to childhood vaccinations in

Guinea-Bissau, Example 1.1.2.
1. Fit a gamma frailty model with a random effect of cluster (‘village’) including binary
variables for BCG and DTP vaccination and adjusting for age at recruitment (i.e., using
time since recruitment as time-variable). Compare the results with those in Table 2.12.
2. Fit a Cox model stratified on cluster, including binary variables for BCG and DTP vac-
cination and adjusting for age at recruitment. Compare with the results from the frailty
model.
Chapter 4

Intuition for marginal models

In this chapter, we will give a less technical introduction to the different models for risks
and other marginal parameters to be discussed in more mathematical details in Chapter 5.
Along with the introduction of the models, examples will be given to illustrate how results
from analysis of these models can be interpreted. In Chapter 2, we gave an intuitive in-
troduction to models for the basic parameter in multi-state models, the transition intensity.
As explained in Section 1.2, knowing all rates in principle enables calculation of marginal
model parameters such as the probability (risk), Qh (t) of occupying state h at time t. In
some multi-state models it is possible, mathematically, to describe this relationship. This is
the case for the two-state survival model of Figure 1.1, the competing risks model (Figure
1.2), and the progressive illness-death model (Figure 1.3). This means that if estimates are
given for all transition rates, e.g., via a regression model, then the marginal parameters may
be estimated (for given covariates) by plug-in. Plug-in refers to the idea of estimating a
given function, say g(θ ), of the parameter θ , by first getting an estimate θb of θ , and then
using g(θb) as the plug-in estimate of g(θ ). This is the topic of Section 4.1. As we shall
see there, in a regression situation this activity does not provide parameters that directly
describe how the marginal parameters are associated with the covariates. Therefore, it may
be of interest to set up regression models where marginal parameters are linked directly to
covariates. This is the topic of Section 4.2. The direct model approach has the additional
advantage that while plug-in builds on correctly specified models for all intensities (and,
thereby, a risk of model misspecification is run), only a single directly specified model for
the marginal parameter needs to be correct. In Section 4.3, we introduce marginal hazard
models that may be applicable in situations where an independence assumption need not
be justified and in Section 4.4, we return to a discussion of the concept of independent cen-
soring, including ways of studying whether censoring is affected by observed covariates.

4.1 Plug-in methods

4.1.1 Two-state model
In this model (Figure 1.11) there is only one transition rate, α01 (t) = α(t), the hazard
function for the distribution of the survival time T . It has the interpretation that α(t)dt is
(approximately) the conditional probability of dying before time t + dt (for a small dt > 0)
given survival till time t
α(t)dt ≈ P(T ≤ t + dt | T > t). (4.1)

117
118 INTUITION FOR MARGINAL MODELS
From the hazard function, the survival function, a marginal parameter, S(t) = Q0 (t), is
derived as follows. Divide the interval from 0 to t into small intervals all of length ∆. From
Equation (4.1), the probability of surviving the next little time interval (from u to u + ∆)
given survival till u is (1 − α(u)∆) as illustrated in Figure 4.1.

1 − α(u)∆
-
0 ∆ 2∆ ··· u u+∆ ··· t −∆ t

Figure 4.1 The probability of surviving the time interval from u to u + ∆ given survival till u is
(1 − α(u)∆).

The marginal probability, S(t), is the product of such factors (conditional probabilities) for
u<t
S(t) = (1 − α(∆)∆) · (1 − α(2∆)∆) · · · (1 − α(u)∆) · · · (1 − α(t − ∆)∆). (4.2)
This observation leads to the Kaplan-Meier estimator for the survival function (Kaplan and
Meier, 1958), as follows. We follow the arguments in Section 2.1.1 leading to the Nelson-
Aalen estimator for the cumulative hazard, i.e., estimating α(u)∆ by the fraction

No. of patients with an event in (u, u + ∆) dN(u)

= ,
No. of patients at risk of an event just before time u Y (u)

where dN(u), the observed number of failures at time u, is typically 0 or 1. The survival
function is then estimated by plugging-in this fraction into the product-representation for
S(t). Hereby, the Kaplan-Meier estimator is obtained

dN(X1 ) dN(Xk )
S(t)
b = 1− ··· 1−
Y (X1 ) Y (Xk )
1
= ∏ 1− . (4.3)
No. at risk at X
event times, X≤t

In Equation (4.3), X1 , . . . , Xk are the individual observation times before time t (some are
event times, others are censoring times) and the second line of the equation uses the ‘prod-
uct symbol’ ∏ which is similar to the ‘summation symbol’ ∑ used previously. Note that,
for times u < t with no observed event, dN(u) is 0, and the factor 1 − dN(u)/Y (u) becomes
1, so the plug-in estimator effectively becomes a product over observed event times before
time t as seen in Equation (4.3). The standard deviation of S(t) b can be estimated using the
Greenwood formula whereby confidence limits around S(t) may be obtained (Kaplan and
b
Meier, 1958). This is typically done by taking as starting point symmetric confidence limits
for log(A(t)), see Section 2.1.1. This amounts to 95% confidence limits for S(t) obtained
b to the powers exp(±1.96 · SD/(S(t)
by raising S(t) b A(t)))
b where SD is the Greenwood esti-
mate.
As an example, Figure 4.2 shows the Kaplan-Meier estimates for the two treatment groups
in the PBC3 trial (Example 1.1.1) for the time to the composite end-point ‘failure of medical
PLUG-IN METHODS 119

1.00

0.75
Survival probability

0.50

0.25

0.00
0 1 2 3 4 5 6
Time since randomization (years)

Placebo CyA

Figure 4.2 PBC3 trial in liver cirrhosis: Kaplan-Meier estimates by treatment.

treatment’. Similar to Figure 2.1, displaying the Nelson-Aalen estimates, the figure suggests
that, unadjusted for prognostic variables, the survival probability is unaffected by treatment.
This should be no surprise since the two estimators build on exactly the same information
and, therefore, are in one-to-one correspondence with each other. One important difference,
however, lies in the interpretation of the values on the vertical axis. In Figure 2.1, the
cumulative hazard was estimated and, as discussed there, the interpretation of this quantity
is not so direct. Figure 4.2, on the other hand, depicts the fraction of patients that, over
time, is still event-free. At 2 years, the estimates are 0.846 in the CyA group and 0.832 for
placebo with values of the Greenwood SD equal to 0.029, respectively 0.030. The resulting
95% confidence interval for S(2 years) is then (0.766, 0.882) for the placebo group and
(0.800, 0.894) for CyA. To enhance readability, confidence limits have not been added to
Figure 4.2.
Let us have a closer look at the product representation Equation (4.2) for S(t) to understand
this one-to-one correspondence. From the product representation we get a sum representa-
tion by using the logarithm

− log(S(t)) = ∑ − log(1 − α(u)∆) ≈ ∑ α(u)∆,

u<t u≤t

where for the approximation we have used that − log(1 − x) ≈ x which holds for small
positive values of x as seen in Figure 4.3. It then follows that
Z t
S(t) = exp(− α(u)du) (4.4)
0

because, for small ∆, the sum ∑u≤t α(u)∆ will be equal to the integral in Equation (4.4).
120 INTUITION FOR MARGINAL MODELS

1.0

0.8

0.6
−log(1−x)

0.4

0.2

0.0
0.0 0.2 0.4 0.6 0.8 1.0
x

Figure 4.3 The functions y = − log(1 − x) and y = x for 0 < x < 1. Note that the two functions almost
coincide for small values of x > 0.

This equation was already given in Equation (1.2). This formula expresses the one-to-one
correspondence between the survival function and the hazard. Knowing the hazard is know-
ing the survival function, and vice versa.
Suppose now that in the PBC3 example we have estimated the hazard by assuming it to be
piece-wise constant as in Table 2.1. We can then estimate S(t) by plug-in
Z t
b = exp(−
S(t) b (u)du).
α
0

Figure 4.4 shows the Kaplan-Meier estimator for the PBC3 placebo group together with
the plug-in estimator using a piece-wise constant hazard model for α(t) and, just like in
Figure 2.3, it is seen that the two models give quite similar results.
Having the one-to-one correspondence, Equation (4.4), a regression model for α(t) induces
a regression model for S(t). Assume a Cox regression model for α(t)

α(t) = α0 (t) exp(LP),

where the linear predictor LP is given by Equation (1.4). The survival function is then given
by Z t
S(t | Z) = exp − α0 (u) exp(LP)du = exp −A0 (t) exp(LP) , (4.5)
0
PLUG-IN METHODS 121

1.0

0.9

0.8

0.7
Survival probability

0.6

0.5

0.4

0.3

0.2

0.1

0.0
0 1 2 3 4 5 6
Time since randomization (years)

Kaplan−Meier Piece−wise exponential

Figure 4.4 PBC3 trial in liver cirrhosis: Estimated survival curves for the placebo group.

where A0 (t) = 0t α0 (u)du. Using the complementary log-log transformation cloglog of the
R

distribution function F(t)

cloglog(F(t)) = log(− log(1 − F(t))) = log(− log(S(t))),

we get the regression model in the cloglog scale

log(− log(S(t | Z)) = log(A0 (t)) + LP. (4.6)

The cloglog function is the link function which takes us from the marginal parameter to the
linear predictor. As an example, let us consider the models for the PBC3 trial presented
in Table 2.7. Here, the survival function at time t for a CyA treated subject (Z1 = 1) with
biochemical values albumin = Z2 , bilirubin = Z3 may (based on the Cox model) be estimated
by

b | Z1 = 1, Z2 , Z3 ) = exp −A
S(t b0 (t) exp(−0.574 − 0.091Z2 + 0.665 log2 (Z3 ))

while, for a placebo treated patient with the same values of the biochemical variables, the
estimated survival function at time t is

b | Z1 = 0, Z2 , Z3 ) = exp −A
S(t b0 (t) exp(−0.091Z2 + 0.665 log2 (Z3 )) .

Figure 4.5 shows the estimated survival curves for albumin = 38g/L and bilirubin =
45µmol/L – values close to the observed average values among all patients. We can see
that, on the probability scale Equation (4.5), the treatment effect is time-dependent, while
on the cloglog scale Equation (4.6), the effect is time-constant as assumed in the Cox and
122 INTUITION FOR MARGINAL MODELS
Poisson models, see Figure 4.6. It is also the case that, on the probability scale, the differ-
ence between the curves will depend on the values of albumin and bilirubin. The fact that,
on the cloglog scale, vertical distances between survival curves are constant under propor-
tional hazards was classically used to construct goodness-of-fit plots for the Cox model
based on a stratified model (e.g., Andersen et al., 1993, Section VII.3) – a technique that is
still offered by standard software packages. However, since we find that these plots may be
hard to interpret, we will not provide examples of their use and prefer, instead, plots such
as those exemplified in Figure 2.10.
If a single set of covariate-adjusted survival curves for the two treatment groups is desired,
then this may be obtained by averaging curves such as those exemplified over the observed
distribution of Z2 , Z3 . As explained in Section 1.2.5, this is known as the g-formula and
works by performing two predictions for each subject, i, one setting treatment to CyA
and one setting treatment to placebo, and in both predictions keeping the observed values
(Z2i , Z3i ) for albumin and bilirubin. The predictions for each value of treatment are then
averaged over i = 1, . . . , n (see Equation (1.5)). The g-formula results in the curves shown
in Figure 4.7. Note that, if randomization in the PBC3 trial had been more successful, then
these curves would resemble the Kaplan-Meier estimates in Figure 4.2. Using the curves
obtained based on the g-formula, it is possible to visualize the treatment effect on the prob-
ability scale after covariate-adjustment using plug-in. At 2 years, the values of the curves in
Figure 4.7 are 0.799 for placebo and 0.867 for CyA with estimated SD, respectively, 0.025
and 0.019 – slightly smaller than what is obtained based on 1,000 bootstrap replications,
namely SD values of 0.028 and 0.022. The treatment effect (risk difference at 2 years) is
thus 0.867 − 0.799 = 0.068, and it has an estimated SD of 0.026 close to that based on
1,000 bootstrap replications which is 0.027.
On a technical note, one may wonder why one does typically not estimate S(t) non-
parametrically by plugging-in the Nelson-Aalen estimator into Equation (4.4). The answer
is that for a distribution with jumps as the one estimated by Nelson-Aalen, the relation-
ship between the cumulative hazard and the survival function is given by the product-
representation rather than by Equation (4.4). Having said this, it should be mentioned that
computer packages often offer the estimator ‘exp(−Nelson-Aalen)’ as an alternative to
Kaplan-Meier (and ‘− log(Kaplan-Meier)’ as an alternative to Nelson-Aalen) and that this
in practice makes little difference. Following this remark, the survival function for given
covariates based on the Cox model could, alternatively, have been estimated by a product-
representation based on the Breslow estimator for the cumulative baseline hazard. Figure
4.8 shows the result of using these, alternative, estimators when predicting the survival
function in the two treatment groups for albumin = 38 and bilirubin = 45 and, as we can
see, this makes virtually no difference compared to Figure 4.5.

Restricted mean life time

Having estimated Q0 (t) = S(t), estimates of the marginal parameter τ-restricted mean life
time ε0 (τ), i.e., the expected time spent in state 0 before time τ, may be obtained by plug-in.
This is because Z τ
ε0 (τ) = S(t)dt, (4.7)
0
PLUG-IN METHODS 123

1.0

0.9

0.8
Estimated survival function

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0.0
0 1 2 3 4 5 6
Time since randomization (years)

Placebo CyA

Figure 4.5 PBC3 trial in liver cirrhosis: Estimated survival curves for a patient with albumin =
38 g/L and bilirubin = 45 µmol/L based on a Cox model. There is one curve for each value of
treatment.

0
log(−log(survival function))

−2

−4

−6
0 1 2 3 4 5 6
Time since randomization (years)

Placebo CyA

Figure 4.6 PBC3 trial in liver cirrhosis: Estimated survival curves for a patient with albumin = 38
g/L and bilirubin = 45 µmol/L based on a Cox model. The vertical scale is cloglog-transformed,
and there is one curve for each value of treatment.
124 INTUITION FOR MARGINAL MODELS

1.0

0.9
Estimated survival function (g−formula)

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0.0
0 1 2 3 4 5 6
Time since randomization (years)

Placebo CyA

Figure 4.7 PBC3 trial in liver cirrhosis: Estimated survival curve in the two treatment groups based
on the g-formula.

1.0

0.9

0.8
Estimated survival function

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0.0
0 1 2 3 4 5 6
Time since randomization (years)

Placebo CyA

Figure 4.8 PBC3 trial in liver cirrhosis: Estimated survival curves for a patient with albumin = 38
g/L and bilirubin = 45 µmol/L based on a Cox model. There is one curve for each value of treatment
and estimates are based on the product-formula.
PLUG-IN METHODS 125
Table 4.1 PBC3 trial in liver cirrhosis: Estimated 3-year restricted means (and SD) by treatment
group. *: Based on 1,000 bootstrap replications.

Placebo CyA
Model (albumin, bilirubin) ε0 (3)
b SD ε0 (3)
b SD
Non-parametric 2.61 0.064 2.68 0.057
0.064* 0.058*
Cox (38, 45) 2.53 0.068* 2.72 0.054*
(20, 90) 0.96 0.268* 1.38 0.279*
g-formula 2.55 0.060* 2.71 0.046*

i.e., it is the area under the survival function (Figure 1.10). The equation can be derived in
the following (perhaps less intuitive) way. If T is the life time, then ε0 (τ) is the expected
value of the minimum, min(T, τ), of T and the threshold τ. This random variable may be
written as Z min(T,τ)Z τ
min(T, τ) = 1dt = I(T > t)dt
0 0
and, therefore, ε0 (τ), the expected value of this is
Z τ Z τ Z τ Z τ
E( (I(T > t)dt) = E(I(T > t))dt = P(T > t)dt = S(t)dt
0 0 0 0

which is exactly the right-hand side of Equation (4.7). A non-parametric estimator for ε0 (τ)
is obtained by plugging-in the Kaplan-Meier estimator for S(t) into Equation (4.7) while
a regression model for ε0 (τ) may be obtained by plugging-in, e.g., a Cox model-based
estimator for S(t | Z) into the equation.
The method is illustrated using the PBC3 data, and Table 4.1 shows the results. It is seen
that, unadjusted, the 3-year restricted means do not differ between the two treatment groups.
If based on a Cox model, the restricted means differ according to the chosen covariate pat-
tern and single, adjusted values may be obtained using the g-formula. Most of the SD values
in the table are based on 1,000 bootstrap replications, and it is seen that the scenario with
albumin and bilirubin values of 20 and 90 provides larger SD – the explanation being that
these values are more extreme compared to the observed distributions of the two biochem-
ical variables.

4.1.2 Competing risks

In the competing risks model (Figure 1.12), there is one transition hazard for each absorbing
state h, the cause-specific hazard α0h (t) = αh (t). We discussed in Chapter 2 how each single
hazard can be analyzed using, e.g., the Nelson-Aalen estimator or a Cox regression model
and an important point was that modeling of α1 (t) and α2 (t) could be done separately. An
intuitive argument for that was given, and in Section 2.4 the cause-specific hazards for the
two competing events ‘transplantation’ and ‘death without transplantation’ in the PBC3
trial (Example 1.1.1) were analyzed.
126 INTUITION FOR MARGINAL MODELS
The situation is different when we want to estimate cumulative probabilities over time,
i.e., when we wish to go from the two cause-specific hazards to the three state occupation
probabilities: The overall (survival) probability of no event S(t) = Q0 (t), the probability
(or risk) of transplantation, the cumulative incidence F1 (t) = Q1 (t), and the probability of
death without transplantation, the cumulative incidence F2 (t) = Q2 (t).
The overall survival probability S(t) has the same relation to the total transition hazard out
of the initial state 0 as the one we saw in the case of overall survival data, Equation (4.4),
i.e., Z t
S(t) = exp(− (α1 (u) + α2 (u))du).
0
To derive the expression for the cause h cumulative incidence (h = 1, 2), the following
argument applies. Recall the definition of the cause-specific hazard

αh (u)du ≈ P(in state h at time u + du | in state 0 at time u)

(Section 1.2.3). The cumulative incidence at time t is the probability that a cause h event
has happened between time 0 and time t. The probability that it happens in the little time
interval from u to u + du (with 0 < u ≤ t) is the probability S(u) of no events before time u
(being in state 0 at time u) times the conditional probability αh (u)du of cause h happening
in that little interval given no previous events as illustrated in Figure 4.9.

S(u)αh (u)du
-
0 u u + du t

Figure 4.9 The probability that a cause h event happens in the little time interval from u to u + du is
the probability S(u) of no events before time u times the conditional probability αh (u)du.

Since, for different values of u, the events ‘cause h happens in the interval from u to u + du’
are exclusive, their total probability is the sum (integral) of the separate probabilities from
0 to t, i.e., Z t
Fh (t) = S(u)αh (u)du, (4.8)
0
an equation that was already given in (1.3).
Estimating S(t) by the overall Kaplan-Meier estimator and the cumulative cause h spe-
cific hazard by the Nelson-Aalen estimator, plug-in into Equation (4.8) leads to the non-
parametric Aalen-Johansen estimator
1
Fbh (t) = ∑ S(X−)
b (4.9)
Y (X)
cause h event times X≤t
PLUG-IN METHODS 127
of the cause h cumulative incidence. In Equation (4.9), S(X−)
b is the Kaplan-Meier value
just before the event time, X, i.e., the jump at that time is not included. Confidence limits
around Fbh (t) may also be computed, preferably based on a symmetric confidence interval
for cloglog(Fh (t)) in the same way as for the Kaplan-Meier estimator (Section 4.1.1).
In a similar fashion, Cox models (i.e., regression coefficients (βb1 , βb2 ) and Breslow estimates
(A
b10 , A
b20 )) for each of the cause-specific hazards may be plugged-in into Equation (4.8) to
obtain estimates of Fh (t | Z) for given values of covariates Z.
It is very important to notice that, via the factor S(·), the cumulative incidence for cause
1 depends on both of the cause-specific hazards α1 (·) and α2 (·). This means that, in spite
of the fact that inference for α1 (·) could be carried out by, formally, censoring for cause
2 events, both causes must be taken into account when estimating the cumulative risk of
cause 1 events. An estimator for F1 (t) obtained as ‘1 − Sb1 (t)’ where Sb1 (t) is a Kaplan-
Meier estimator counting only cause 1 events as events (and cause 2 events Rt
as censor-
ings) will be a biased estimator. This Kaplan-Meier curve estimates exp(− 0 α1 (u)du) –
a quantity that does not possess a probability interpretation in the population where both
causes are operating (the population for which we wish to make inference, cf. Section 1.3).
The incorrect cumulative
Rt
incidence estimator 1 − Sb1 (t) will be upwards biased because
F1 (t) ≤ 1 − exp(− 0 α1 (u)du), intuitively because, by counting cause 2 events as censor-
ings, we pretend that, had these subjects not been ‘censored’, then they would still be at
risk for the event of interest (i.e., cause 1), see, e.g., Andersen et al. (2012). In other words,
the one-to-one correspondence between a single rate (αh (t)) and the risk (Fh (t)) that we
saw for the two-state model (Section 4.1.1) does not exist in the competing risks model: To
compute the cause h risk, Fh (t), not only the rate for cause h is needed, but also the rates
for the competing cause(s) (and vice versa).
We will illustrate cumulative incidences and the bias incurred when using the incorrect
estimator using the PBC3 data. Figures 4.10 and 4.11 show, for the placebo group, a
stacked plot of cumulative incidences and overall survival function computed, respectively,
correctly by using the Aalen-Johansen estimator, Equation (4.9), and incorrectly using
‘1− Kaplan-Meier’ based on a single cause. More specifically, Fb1 (t), Fb1 (t) + Fb2 (t) and
Fb1 (t) + Fb2 (t) + S(t)
b are plotted against t. In Figure 4.10, the latter curve is, correctly, equal
to 1 while, in Figure 4.11, this sum exceeds 1 because ‘1− Kaplan-Meier’ is an upwards
biased estimator of the cumulative incidence. Note that (in the correct Figure 4.10), the
values of the vertical axis have simple interpretations as the fractions of patients who, over
time, are expected to experience the various events.
We will also illustrate predicted cumulative incidences from cause-specific Cox models.
This can be done for a given pattern for the covariates that enter into the Cox models.
Figure 4.12 shows predicted stacked cumulative incidences and overall survival for placebo
treated female patients with, respectively (age, albumin, bilirubin) equal to (40, 38, 45), (40,
20, 90), and (60, 38, 45). It is seen that, for the second pattern, both cumulative risks are
considerably larger than for the first while, comparing the first and the last pattern, it is seen
that the older patient has a much higher risk of death and a lower risk of transplantation.
128 INTUITION FOR MARGINAL MODELS

1.0
Stacked cumulative incidence and survival
0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0.0
0 1 2 3 4 5 6
Time since randomization (years)
Overall
Transplantation
Transplantation + death without transplantation

Figure 4.10 PBC3 trial in liver cirrhosis: Stacked cumulative incidence and survival curves for the
placebo group. Cumulative incidences are, correctly, estimated using the Aalen-Johansen estimator,
Equation (4.9).

1.1
Stacked cumulative incidence and survival

1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
0 1 2 3 4 5 6
Time since randomization (years)
Overall
Transplantation
Transplantation + death without transplantation

Figure 4.11 PBC3 trial in liver cirrhosis: Stacked cumulative incidence and survival curves for
the placebo group. Cumulative incidences are, incorrectly, estimated using the ‘1−Kaplan-Meier’
estimators based on single causes.
PLUG-IN METHODS 129
(a) 40 years old, albumin = 38 g/L, bilirubin = 45 µmol/L

1.0

Stacked cumulative incidence and survival

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0.0
0 1 2 3 4 5 6
Time since randomization (years)
Overall
Transplantation
Transplantation + death without transplantation

(b) 40 years old, albumin = 20 g/L, bilirubin = 90 µmol/L

1.0
Stacked cumulative incidence and survival

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0.0
0 1 2 3 4 5 6
Time since randomization (years)
Overall
Transplantation
Transplantation + death without transplantation

(c) 60 years old, albumin = 38 g/L, bilirubin = 45 µmol/L

1.0
Stacked cumulative incidence and survival

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0.0
0 1 2 3 4 5 6
Time since randomization (years)
Overall
Transplantation
Transplantation + death without transplantation

Figure 4.12 PBC3 trial in liver cirrhosis: Predicted, stacked cumulative incidence and survival
curves for three women in the placebo group based on cause-specific Cox models.
130 INTUITION FOR MARGINAL MODELS
Table 4.2 PBC3 trial in liver cirrhosis: Estimated time lost (in years) before 3 years due to trans-
plantation (T) and death without transplantation (D) by treatment group in four scenarios (F: fe-
male).

Scenario Event type Sex Age Albumin Bilirubin Placebo CyA

1 T No adjustment 0.142 0.086
D No adjustment 0.250 0.235
2 T F 40 38 45 0.220 0.117
D F 40 38 45 0.090 0.061
3 D F 40 20 90 1.377 0.967
D F 40 20 90 0.364 0.302
4 T F 60 38 45 0.080 0.043
D F 60 38 45 0.373 0.256

Cause-specific time lost

For overall survival data we saw how the τ-restricted mean survival time, ε0 (τ) could be
estimated as the area under the survival curve. This is the expected time lived Rbefore time
τ. The maximum time lived before τ is equal to τ and, therefore, τ − ε0 (τ) = 0τ F(t)dt is
the expected time lost before time τ. We have that F(t) = F1 (t) + F2 (t) is the sum of the
cause-specific cumulative incidences and
Z τ
εh (τ) = Fh (t)dt
0

is the expected time lost ‘due to cause h’ before time τ (Andersen, 2013). This may be
estimated as the area under the Aalen-Johansen estimator (or under a model-predicted cu-
mulative incidence estimator) over the interval from 0 to τ.
Table 4.2 shows the estimated expected time lost (in years) before τ = 3 years in each
treatment group of the PBC3 trial due to transplantation and death without transplantation,
respectively. The scenarios considered are either: No adjustment or adjustment for sex,
age, albumin, and bilirubin with three different covariate configurations. In the unadjusted
situation, note that the total time lost equals 3 minus the restricted means (2.61 and 2.68
years, respectively) presented in Table 4.1. The estimated values for the SD in the two
treatment groups: Placebo, respectively CyA, based on 1,000 bootstrap replications, were
0.040 and 0.030 for transplantation, and 0.053 and 0.050 for death without transplantation.
For the other three scenarios, the numbers in Table 4.2 (for the placebo group) provide,
for each cause, one-number summaries of the tendencies seen in Figure 4.12. In this scale
we can notice the beneficial effect of CyA after adjustment. For these values, estimates
of SD could also be obtained using the bootstrap (though we have not illustrated this).
Single values for each treatment group could be obtained by averaging over the observed
covariate distribution using the g-formula (Section 1.2.5). We will illustrate this based on a
direct regression model for εh (τ) in Section 4.2.2.
PLUG-IN METHODS 131

0 1
α01 (t)
Disease-free - Diseased

α02 (t) α13 (t, d)

? ?
2 3
Dead without Dead with
disease disease

Figure 4.13 States and transitions in the modified illness-death model without recovery.

Estimation of cumulative incidence

The cumulative incidence for a given cause h depends on the cause-specific hazards
for all causes. This means that, in spite of the fact that inference for separate αh (·)
can be carried out one at a time, all causes must be taken into account when esti-
mating the cumulative risk of cause h events. This is done using the Aalen-Johansen
estimator. An estimator for Fh (t) obtained as ‘1 − Sbh (t)’ where Sbh (t) is a Kaplan-
Meier estimator counting only cause h events as events (and other events as cen-
sorings) will be a biased estimator. It will be upwards biased because, by counting
other events as censorings, we pretend that, had these subjects not been ‘censored’,
then they would still be at risk for the event of interest (i.e., cause h). The one-to-one
correspondence between a single rate (αh (t)) and the risk (Fh (t)) that is present in
the two-state model does not exist in the competing risks model.

4.1.3 Illness-death models

The illness-death model without recovery is illustrated in Figure 1.3. There are three states:
‘0: Disease-free’, ‘1: Diseased’, and ‘2: Dead’; and three transition rates: α02 (·), the mor-
tality rate without the disease, α01 (·), the disease rate, and α12 (·), the mortality rate with
the disease. Some discussions of this model simplify if state 2 is split into two: ‘2: Dead
without the disease’ and ‘3: Dead with the disease’ and renaming the rate α12 (·) as α13 (·),
and we will do so in what follows, see Figure 4.13.
If α13 (·) = 0, then the model is the competing risks model. This means that the probabilities
Q0 (t) and Q2 (t) may be obtained from the rates α01 (·) and α02 (·) as described in Section
4.1.2 and the same holds for the expected time, ε0 (τ), spent in state 0 before time τ, i.e.,
the restricted disease-free mean life time. In the general case, i.e., α13 (·) ≥ 0, the proba-
bility Q1 (t) + Q3 (t) of having experienced the disease before time t (and either being alive
with the disease, state 1, or dead with the disease, state 3 at that time) is the cumulative
132 INTUITION FOR MARGINAL MODELS
Rt
incidence F1 (t) = 0 Q0 (u)α01 (u)du in that competing risks model. The novelty compared
to the competing risks model lies in distinguishing between occupancy of the states 1 and
3. The state 1 occupation probability, Q1 (t) will also depend on the 1 → 3 transition inten-
sity α13 (·). Additionally, as discussed in Section 3.7, a complication arises in connection
with the choice of time origin for the 1 → 3 transition intensity α13 (·). For this rate both
the time-variable t and duration since entry into state 1, say d = d(t) = t − T1 , may play a
role. If α13 (·) only depends on t then, as mentioned previously, the multi-state process is
Markovian; if it depends on d, then it is semi-Markovian (Sections 1.2.3 and 1.4). In the
Markovian case, inference for α13 (t) needs to take delayed entry into account, if α13 (·)
only depends on d, then this is not the case. The case where α13 (·) depends on both t and
d is more complex and models for this situation were discussed in Section 3.7. For the
Markovian case, a direct argument for Q1 (t) is possible, as follows.
To occupy state 1 at time t, a 0 → 1 transition must have occurred at some earlier time point
u < t and, subsequently, no 1 → 3 transition has occurred. The probability of a 0 → 1 tran-
sition between u and u + du is (following the lines of argument leading to the expression for
the cumulative incidence, Section 4.1.2) Q0 (u)α01R (u)du and the conditional probability of
staying in state 1 between times u and t is exp(− ut α13 (x)dx) as illustrated in Figure 4.14.
Adding up these probabilities for the different u < t (i.e., integrating over u from 0 to t)

Rt
Q0 (u)α01 (u)du· exp(− u α13 (x)dx)
-
0 u u + du t

Figure 4.14 The probability that a 0 → 1 transition happens in the small time interval from u to
u + du is the probability Q0 (u) of no transition out of state 0 before time u times the conditional
probability The probability of no 1 → 3 transition from u to t given in state 1 at time u is
α01 (u)du.
exp − ut α13 (x)dx .
R

leads to the desired expression

Z t Z t
Q1 (t) = Q0 (u)α01 (u) exp − α13 (x)dx du
0 u

for the state 1 occupation probability. The semi-Markovian case is similar; however,
Rt
now
the probability
of staying in state 1 from entry at time u and until t is exp − α
u 13 (x, x−
u)dx .
The probability Q3 (t) equals F1 (t)−Q1 (t) and, for all h, plug-in is applicable for estimation
of Qh (t). The latter approach for estimation of Q3 (t) is related to an idea of Pepe (1991), as
follows. For both the Markovian and the semi-Markovian situation, the probability Q0 (t) +
Q1 (t) of being alive with or without the disease, may be estimated by the Kaplan-Meier
estimator, say S(t),
b counting all deaths and disregarding disease occurrences. This leads to
the alternative Pepe estimator

Q b −Q
b1 (t) = S(t) b0 (t).
PLUG-IN METHODS 133

0.05

0.04
Probability

0.03

0.02

0.01

0.00
0 12 24 36 48 60 72 84 96 108 120 132 144 156
Time since bone marrow transplantation (months)

Prevalence of relapse Probability of being alive with relapse

Figure 4.15 Bone marrow transplantation in acute leukemia: State occupation probability and
prevalence for the relapse state.

Another quantity of interest in this model is the disease prevalence at time t

Q1 (t)
,
Q0 (t) + Q1 (t)

i.e., the conditional probability of being diseased at time t given alive at time t. Finally, the
expected time lived with the disease before time τ is
Z τ
ε1 (τ) = Q1 (u)du.
0

To illustrate this, we use a simplified version of the model for the bone marrow trans-
plantation data (Example 1.1.7) where graft versus host disease is not accounted for, see
Figure 1.6. Figure 4.15 shows both the estimated probability, Q b1 (t) of being alive with
relapse and the estimated prevalence. As a consequence of the high mortality rate with re-
lapse (Figure 3.4), both probabilities are rather low. From Q b1 (t), the expected time lived
with relapse before time τ can be estimated, and with τ = 120 months the estimate is
ε1 (τ) = 1.62 (SD = 0.28) months (with SD in brackets based on 1,000 bootstrap sam-
b
ples). For state 0, the expected time spent alive without relapse before τ = 120 months is
ε0 (τ) = 75.78 (SD = 1.25) months while the expected time lost due to death before τ = 120
b
months is 42.61 (SD = 1.23) months (b ε2 (τ) = 29.13, SD = 1.13 months lost without relapse
and ε3 (τ) = 13.48, SD = 0.80 months lost after relapse).
b
So, in the illness-death model, the use of plug-in becomes cumbersome though still doable
and the same is the case for more complicated irreversible (‘forward-going’) models, i.e.,
134 INTUITION FOR MARGINAL MODELS
Table 4.3 Recurrent episodes in affective disorders: Estimated numbers of years spent in and out of
hospital, and lost due to death before τ = 15 years for patients with unipolar or bipolar disease (SD
based on 1,000 bootstrap replications).

Out of hospital In hospital Dead All

Disease Years SD Years SD Years SD (τ)
Unipolar 9.59 0.51 2.20 0.23 3.21 0.48 15.00
Bipolar 12.33 0.77 1.87 0.34 0.80 0.63 15.00

those for which transitions back into previous states are not possible (e.g., Figures 1.5-
1.6). However, it seems clear that a more general technique – also covering models with
back-transitions – would be preferable, and for Markov processes such a technique exists.
This technique, based on product-integration of the transition intensities, is, however, not as
intuitive as those described in the present section. We will return to a discussion of product-
integration in Section 5.1 and skip the details here. We will rather illustrate results from an
analysis using the illness-death model with recovery, i.e., the model for recurrent episodes
(recurrent events with periods between times at which subjects are at risk for a new event),
Figure 1.4, as an example.
We study Example 1.1.5 on recurrent episodes in affective disorders and refer to an ongoing
episode as ‘being in hospital’. As for the illness-death model without recovery, we may be
interested in the state occupation probabilities, Q0 (t), the probability of being out of the
hospital t years after the initial diagnosis, Q1 (t), the probability of being in the hospital
at time t, and Q2 (t), the probability of being dead at time t. Likewise, the average times,
ε0 (τ), ε1 (τ), ε2 (τ), spent in each of the states until some threshold τ may be of interest.
Figure 4.16 shows the stacked estimates of the state occupation probabilities for patients
with unipolar or bipolar disorder. It is seen that bipolar patients spend more time out of the
hospital, and unipolar patients have a higher mortality – an observation that is emphasized
by computing the one-number summaries b εh (15 years), h = 0, 1, 2, see Table 4.3. Note that,
as explained in Section 1.2.2, the estimates add up to τ = 15 years.

Plug-in

Plug-in is a technique that enables estimation of a marginal parameter based on a

specification of all intensities in a multi-state model, at least whenever a mathe-
matical expression is available that expresses this dependence. It has the advantage
that the censoring distribution needs not be specified (except from the fact that it
should be considered which covariates affect censoring). However, plug-in (1) re-
quires a correct specification of all intensities in the multi-state model, and (2) does
not provide parameters that directly link the marginal parameter to covariates.

4.2 Direct models

This section discusses models where a marginal parameter in a multi-state model is di-
rectly linked to covariates. This will be done on a case-by-case basis for specific multi-state
diagrams as exemplified in Chapter 1. In each case the method involves setting up a set
DIRECT MODELS 135
Unipolar Bipolar
1.00 1.00

0.75 0.75
Probability

Probability
0.50 0.50

0.25 0.25

0.00 0.00
0 10 20 0 10 20
Time since first admission (years) Time since first admission (years)
State Dead In hospital Out of hospital

Figure 4.16 Recurrent episodes in affective disorders: Estimated stacked state occupation probabil-
ities for patients with unipolar or bipolar disorder.

of generalized estimating equations (GEEs) whose solutions are the regression parameters
giving this direct link. Mathematical details will be described in Section 5.5 where we will
see that this approach typically also involves estimation of the distribution of censoring
times (see also Section 4.4.1).

4.2.1 Two-state model

In this model, there is only one transition intensity, the hazard function α(t) and, as seen
in Section 4.1.1, a regression model for this hazard also induces a regression model for the
survival probability S(t) = Q0 (t) (and at the same time for the failure probability F(t) =
Q1 (t)). For a multiplicative hazard model such as the Cox model, the survival function is

S(t | Z) = exp(−A0 (t) exp(LP))

and, therefore,

log(− log(S(t | Z))) = log(− log(1 − F(t | Z))) = log(A0 (t)) + LP (4.10)

where LP = β1 Z1 + · · · + β p Z p is the linear predictor, see Section 2.2.1 and Equation (4.6).
For an additive hazard model with constant hazard differences, α(t | Z) = α0 (t) + LP, we
have that
− log(S(t | Z))/t = A0 (t)/t + LP.
In both cases, a certain transformation, the link function, of the marginal parameter (here
S(t)) gives the linear predictor and, therefore, the regression parameters can be interpreted
in the scale given by this link function, see Section 1.2.5. In the case of a multiplicative
hazard, the link function is the cloglog function, corresponding to exp(β ) being hazard
ratios (see Equation (4.6)), and in the additive case the link function is − log(·) (note the
minus sign) and the exp(β ) coefficients correspond to ratios between survival probabilities.
136 INTUITION FOR MARGINAL MODELS
Restricted mean life time
For the restricted mean life time and for marginal parameters in more complicated multi-
state models, hazard models do not provide a simple link between this parameter and co-
variates. In such situations, a way forward is to set up equations whose solutions provide
parameters that establish a direct link. As a first example, we will look at the restricted
mean life time ε0 (τ). For this parameter, Tian et al. (2014) proposed estimating equations
where some transformation, such as the logarithm or the identity function of the restricted
mean survival time is linear in the covariates, i.e.,

log(ε0 (τ)) = β0 + LP
ε0 (τ) = β0 + LP.

We provide more mathematical details in Section 5.5.2 and here illustrate the method using
the PBC3 data.
Table 4.4 shows the estimated coefficients in a linear model for the 3-year restricted mean
including treatment (Z1 ), albumin (Z2 ), and bilirubin (Z3 )

ε0 (3 | Z) = β0 + β1 Z1 + β2 Z2 + β3 log2 (Z3 ),

i.e., the link function is the identity function, meaning that it is the restricted mean itself that
is given by the linear predictor. In the estimation, it has been assumed that censoring does
not depend on covariates, see Section 4.4.1 for more details. To estimate the variability
of the estimators, robust or sandwich estimators of the SD are used (we will be using
both names in what follows). The use of the word ‘sandwich’ stems from the fact that the
mathematical expression for the collection of standard deviations and correlations consists
of two identical parts, the ‘bread’, with something different, the ‘meat’, in between. The
coefficients β1 , β2 , and β3 have attractive interpretations. For given values of albumin and
bilirubin, a CyA-treated patient, on average lives 0.168 years longer without transplantation
during the first 3 years after randomization compared to a placebo-treated patient, for each
extra 10 g/L of albumin the average years lived without transplantation during 3 years
increases by 0.31 years, and for each doubling of bilirubin the average years lived without
transplantation decrease by 0.214 years. The intercept β0 is the 3-year restricted mean when
the covariates all take the value 0 – an enhanced interpretability would be obtained by
centering the quantitative covariates (Section 2.2.2). However, more informative absolute
values for the restricted mean can be obtained using the g-formula (Section 1.2.5). This
gives the values 2.559 (0.065) years for placebo and 2.729 (0.053) for CyA with estimated
SD values based on 1,000 bootstrap replications in brackets. Since the model is linear,
the difference between these two restricted means is the coefficient (β1 ) for treatment and
based on the bootstrap procedure, the estimated value is 0.170 year with a bootstrap SD of
0.079, almost as in Table 4.4.

4.2.2 Competing risks

For the competing risks model it is of interest to relate the cumulative incidence to co-
variates, and this may by done using the Fine-Gray model (Fine and Gray, 1999). In this
model, the cumulative incidence Fh (t) for cause h is linked to covariates in the same way
DIRECT MODELS 137
Table 4.4 PBC3 trial in liver cirrhosis: Direct linear model (identity as link function) for the 3-year
restricted mean.

Covariate βb SD
Intercept 2.376 0.381
Treatment CyA vs. placebo 0.168 0.078
Albumin per 1 g/L 0.031 0.008
log2 (Bilirubin) per doubling -0.214 0.034

Table 4.5 PBC3 trial in liver cirrhosis: Estimated coefficients (and SD) from Fine-Gray models for
death without transplantation and transplantation.

Death without
transplantation Transplantation
Covariate βb SD βb SD
Treatment CyA vs. placebo -0.353 0.260 -0.409 0.368
Albumin per 1 g/L -0.061 0.031 -0.070 0.033
log2 (Bilirubin) per doubling 0.616 0.089 0.619 0.101
Sex male vs. female -0.415 0.317 -0.092 0.580
Age per year 0.087 0.016 -0.075 0.017

as F(t) = 1 − S(t) is linked to covariates in the Cox model for survival data, i.e., via the
cloglog function – see Equation (4.6). The model is

log(− log(1 − Fh (t | Z))) = log(A

e0h (t)) + LPh

with linear predictor LPh = β1h Z1 + · · · + β ph Z p . Estimating equations for the β -parameters
were proposed together with an estimator for A e0h (t), from which Fh (t | Z) may be predicted.
Robust estimators for the associated SD were also given. We will give more details in
Section 5.5.3 and here illustrate the model using the PBC3 data.
Table 4.5 shows the estimated coefficients when fitting Fine-Gray models to the cumulative
incidences for the two competing events transplantation and death without transplantation.
In the estimation, it has been assumed that censoring does not depend on covariates. From
the negative signs of the coefficients for treatment, sex and albumin, it appears that CyA
treatment, male sex and high albumin all decrease the risks of both end-points. High biliru-
bin increases the risk of both end-points, while advanced age increases the death risk and
decreases the risk of transplantation. These results are qualitatively well in line with those
obtained when analyzing the cause-specific hazards. It is important to realize that the two
sets of models target different parameters and, as we have seen in Section 4.1.2, each cu-
mulative incidence depends on both cause-specific hazards and, therefore, a coefficient in
a Fine-Gray model depends on how the corresponding covariate is associated with both
cause-specific hazards. It follows that a situation can occur where a covariate is associated
with, e.g., an increased cause-specific hazard for cause 1 but not associated with that for
cause 2, in which case that covariate could affect (decrease) the cumulative incidence for
cause 2. This is because a high cause 1 risk ‘leaves fewer subjects to experience cause
138 INTUITION FOR MARGINAL MODELS
2’. An example of this situation was provided by Andersen et al. (2012). Similar mecha-
nisms also explain differences between the coefficients from the cause-specific Cox models
(Table 2.13) and the Fine-Gray models (Table 4.5). For those covariates where the cause-
specific Cox coefficients have the same sign for both events (i.e., treatment, sex, albumin
and bilirubin), the Fine-Gray coefficients are numerically smaller while, for age where the
Cox coefficients have opposite signs, the Fine-Gray coefficients are numerically larger.
One may wonder, what is the exact interpretation of the Fine-Gray coefficients (except from
being risk differences on the cloglog scale)? When applying the cloglog(x) = log(− log(1−
x)) transformation to the risk function F = 1−S in the case of no competing risks, the result
is the cumulative hazard, i.e., its slope (the hazard α(·)) has the interpretation

α(t)dt ≈ P(event in (t,t + dt) | no event < t).

However, application of the cloglog function to the cause h cumulative incidence in the
eh (t)) has the following
presence of competing risks results in a function whose slope (say, α
interpretation

eh (t)dt ≈ P(cause h event in (t,t + dt) | no cause h or a competing event < t).
α

This function is known as the cause h sub-distribution hazard and its interpretation is not
very appealing: It gives the cause h event rate among those who have either not yet had a
cause h event or have experienced a competing event. This awkward ‘risk set’ has caused
some debate (e.g., Putter et al., 2020), but it is also the basis for the equations from which
the regression coefficients are estimated, see Section 5.5.3 for more details. In conclusion,
the Fine-Gray model has the nice feature that it provides a direct link between a cumulative
incidence and covariates, but this association, exp(β ), is expressed on the not-so-nice scale
of sub-distribution hazard ratios. The use of other link functions will be discussed in Sec-
tions 5.5.5 and 6.1.4 and, as we shall see there, these choices may entail other difficulties.
An appealing feature of the Fine-Gray model is the ease with which the cause h cumula-
tive incidence can be predicted for given covariates by combining the estimated regression
coefficients for that cause with the estimate of the baseline cumulative sub-distribution
hazard A e0h (t). Recall that, in order to predict a cause h cumulative incidence based on
cause-specific hazard models, both regression coefficients and cumulative baseline hazards
for all causes are needed. Prediction based on Fine-Gray models is exemplified in Figure
4.17 where the cumulative incidences for both treatments and both events in the PBC3 trial
are estimated for a 40-year old woman with albumin equal to 38 g/L and bilirubin equal to
45 µmol/L. The curves for placebo treatment can be compared with those in Figure 4.12a
and are seen to be close to those presented there. It should be noted that cause-specific Cox
proportional hazards models and Fine-Gray (proportional sub-distribution hazards) models
are mathematically incompatible.
To illustrate how single predictions across the patient population may be obtained, we use
the g-formula (Section 1.2.5) to estimate cumulative incidences at 2 years based on the
Fine-Gray models. To estimate the SD of the resulting risks of transplantation or death
without transplantation, 1,000 bootstrap samples were drawn. For transplantation, the g-
formula gives a 2-year risk of 0.069 for placebo and 0.049 for CyA. Based on the bootstrap,
the corresponding values (SD) are, respectively, 0.113 (0.198) and 0.092 (0.202); however,
DIRECT MODELS 139
excluding samples with degenerate estimates of 1 (47 samples), the bootstrap-based values
are 0.069 (0.020) for placebo and 0.048 (0.013) for CyA – in line with the risk estimates
from the original data. The estimated treatment effect (risk difference at 2 years) is 0.021
with a bootstrap SD of 0.020. For death without transplantation, the estimated 2-year risks
are 0.117 for placebo and 0.088 for CyA in accordance with the corresponding bootstrap
values (SD) of, respectively, 0.117 (0.021) and 0.088 (0.019) leading to an estimated risk
difference at 2 years of 0.030 (0.023).
It is sometimes advocated that the Fine-Gray model is used only for the ‘cause of interest’
(say, cause 1). However, we believe that, in the competing risks model, all causes should be
analyzed because the association between the cause 1 cumulative incidence and a covariate
may be a result of an association between that covariate and the cause-specific hazard for
a competing cause. Therefore, Latouche et al. (2013) argued that, to get an overview, all
cause-specific hazards and all cumulative incidences should be studied. Another item to
pay attention to is the fact that separate Fine-Gray models for two competing causes may
be mathematically incompatible (and certainly incompatible with a Cox model for overall
survival) and may provide overall risk estimates exceeding 1.

Cause-specific time lost

In Section 4.1.2, we estimated the transplant-free time lost from either death or transplan-
tation in the PBC3 trial (Example 1.1.1) by plug-in and in Section 4.2.1, a direct regression
model was set up for the restricted mean life time following Tian et al. (2014). We will now
illustrate how a similar direct regression model may be studied for the cause-specific time
lost in the competing risks model (Conner and Trinquart, 2021). We will return to a more
detailed discussion of the resulting estimating equations in Section 5.5.3 and here illustrate
results from an analysis of the competing risks data from the PBC3 trial (assuming that
censoring does not depend on covariates).
The parameter of interest is εh (τ), and we wish to relate this to covariates via a linear
predictor LPh . This is typically done by assuming either of the models
log(εh (τ)) = βh0 + LPh
εh (τ) = βh0 + LPh ,
though other link functions may be applied. In the PBC3 example we use the identity link
and look at linear models for εh (τ), h = 1, 2, i.e., either transplantation or death without
transplantation, for τ = 3 years. Table 4.6 shows results from fitting three sets of models
for these two end-points. We first fit a model including only the treatment variable (CyA vs.
placebo, first column) yielding, for transplantation, an intercept of βb10 = 0.138 years (cor-
responding to placebo) and a difference of βb11 = −0.049 years when comparing CyA with
placebo. For death without transplantation, the corresponding estimates are βb20 = 0.244
years and βb21 = 0.000 years, respectively. Note that these numbers are close to those found
in Table 4.2: 0.142 years and 0.086 − 0.142 = −0.056 years for transplantation and 0.250
years and 0.235 − 0.250 = 0.015 years for death without transplantation. Next, we ad-
just for sex, age, albumin, and log2 (bilirubin), second column. In Table 4.2, we estimated
the time lost in each treatment group for different sets of fixed values for these covari-
ates, yielding different treatment effects for the different sets (0.117 − 0.220 = −0.103,
140 INTUITION FOR MARGINAL MODELS

(a) Death without transplantion

1.0
Cumulative incidence for death w/o transplantation

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0.0
0 1 2 3 4 5 6
Time since randomization (years)

Placebo CyA

(b) Transplantation

1.0

0.9
Cumulative incidence for transplantation

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0.0
0 1 2 3 4 5 6
Time since randomization (years)

Placebo CyA

Figure 4.17 PBC3 trial in liver cirrhosis: Predicted cumulative incidence for a 40-year-old woman
with albumin equal to 38 g/L and bilirubin equal to 45 µmol/L based on Fine-Gray models.
DIRECT MODELS 141
0.967 − 1.377 = −0.410, and 0.043 − 0.080 = −0.037, respectively for transplantation
for the three choices made, and 0.061 − 0.090 = −0.029, 0.302 − 0.364 = −0.062, and
0.256 − 0.373 = −0.117, respectively for death without transplantation – all numbers in
years). By fitting the direct linear model, we get a single treatment effect of βb11 = −0.067
years for transplantation and βb21 = −0.068 years for death without transplantation. In the
final model in Table 4.6 (right column) we only adjust for the two biochemical variables
albumin and log2 (bilirubin) and we can compare with the results for the 3-year restricted
mean life time in Table 4.4. This is because ε0 (τ) + ε1 (τ) + ε2 (τ) = τ (Section 1.2.2) and,
therefore, coefficients β0 j , β1 j , β2 j in linear models for the three ε-parameters will satisfy
β1 j + β2 j = −β0 j . Adding up the β -parameters for the two end-points from the last model
in Table 4.6 we get, −0.068 − 0.079 = −0.147 for treatment, −0.002 − 0.028 = −0.030
for albumin, and 0.091 + 0.124 = 0.215 for log2 (bilirubin), to be compared with the coef-
ficients 0.148, 0.030, and −0.215 for the models for ε0 (τ).
The overall average time lost may be estimated using the g-formula (Section 1.2.5). The
last models in Table 4.6 were re-fitted on 1,000 bootstrap samples and this provided average
time lost due to transplantation of 0.143 years (bootstrap SD = 0.040) for placebo and
0.073 years (bootstrap SD = 0.030) for CyA. This gives an average treatment effect of
−0.070 years (0.050) – close to the estimated treatment effect in the linear model of Table
4.6, however with a somewhat smaller SD. The corresponding numbers for death without
transplantation were, respectively, 0.288 years (0.057) for placebo and 0.208 years (0.048)
for CyA, yielding an average treatment effect of −0.080 years (0.072), both numbers close
to the values from the table.
Table 4.6 PBC3 trial in liver cirrhosis: Estimated coefficients (and SD) from direct linear models
for time lost (in years) due to death without transplantation or to transplantation before τ = 3 years
(Bili: Bilirubin).

(a) Death without transplantation

Covariate βb SD βb SD βb SD
Intercept 0.244
Treatment CyA vs. placebo 0.000 0.088 -0.082 0.078 -0.079 0.078
Albumin per 1 g/L -0.019 0.007 -0.028 0.007
log2 (Bili) per doubling 0.159 0.037 0.124 0.037
Sex male vs. female 0.129 0.125
Age per year 0.013 0.004

(b) Transplantation

Covariate βb SD βb SD βb SD
Intercept 0.138
Treatment CyA vs. placebo -0.049 0.089 -0.067 0.085 -0.068 0.083
Albumin per 1 g/L -0.007 0.008 -0.002 0.008
log2 (Bili) per doubling 0.072 0.044 0.091 0.042
Sex male vs. female 0.132 0.139
Age per year -0.010 0.004
142 INTUITION FOR MARGINAL MODELS
Fine-Gray model

The Fine-Gray model is a direct model for the cumulative incidence that expresses
the association with covariates as sub-distribution hazard ratios. Therefore, interpre-
tation of the resulting parameters is challenging. If using the model, this should be
done for all causes in the competing risks model because the association between
a given covariate and a given cumulative incidence may be a result of the way in
which the covariate affects other causes. Based on Fine-Gray models for all causes,
the estimated overall failure risk for given subjects may exceed 1.

4.2.3 Recurrent events

For recurrent events one may, in principle, focus on the same marginal parameters (Qh (t)
and εh (t)) as exemplified in the previous sections. However, since it is the same type of
event that may occur repeatedly, there is another marginal parameter that is of interest to
study. This is the mean function or expected number of recurrent events over time, i.e.,

µ(t) = E(N(t))

where N(t) is the process counting the number of recurrent events in the interval from 0 to
t.

No terminal event
We will focus on the situation depicted in Figure 1.5 and first look at the case where the
mortality rate is negligible, i.e., state D on that figure is not relevant. In that case it turns
out that the estimating equations that are set up for µ(t) are solved by the Nelson-Aalen
estimator
dN(X)
b (t) =
µ ∑ (4.11)
Y (X)
event times X≤t
(Lawless and Nadeau, 1995). To compute confidence limits around this estimator, robust
estimators of the SD should be used and the confidence interval will typically be based
on symmetric confidence limits for log(µ(t)) – similarly to confidence limits around the
Nelson-Aalen estimator for the cumulative hazard (Section 2.1.1). A regression model

µ(t | Z) = µ0 (t) exp(LP) (4.12)

may also be analyzed quite simply since it may be shown (Lawless and Nadeau, 1995;
Lin et al., 2000) that solving what are formally score equations based on a Cox partial log
likelihood
exp(LPevent)
l(β ) = ∑ log
event times, X ∑ j at risk at time X exp(LP j )

leads to valid estimators for β (more details to be given in Section 5.5.4). Robust stan-
dard deviations must be used. A Breslow-type estimator for the baseline mean function
µ0 (t) also exists. The model in Equation (4.12) is often denoted the LWYY model after the
DIRECT MODELS 143
Table 4.7 Recurrent episodes in affective disorders: Estimated ratios between mean numbers of
psychiatric episodes between patients with bipolar vs. unipolar diagnosis (c.i.: confidence interval).

Model Mortality treated as exp(βb) 95% c.i.

LWYY model Censoring 1.52 (1.07, 2.17)
Ghosh-Lin model Competing risk 1.95 (1.48, 2.56)

authors of Lin et al. (2000). Just as it was the case for the Cox model (Section 2.2.2), atten-
tion should be paid to the goodness-of-fit of the multiplicative model in Equation (4.12).
Methods for doing this are discussed in Section 5.7.4.
We will exemplify this using data from Example 1.1.5 on recurrent episodes in affective
disorders. By focusing on times from one re-admission to the next, disregarding the fact
that there are in-hospital periods during which the event does not occur (see, e.g., Andersen
et al., 2019), we are in the situation of Figure 1.5. The parameter µ(t), the expected number
of re-admissions in [0,t], refers to a population where the duration of these periods has a
certain distribution, and one should realize that, in a population with another distribution of
these durations, the parameter would have been different. Most importantly, the parameter
µ(t) also refers to a population where patients cannot die. This is a completely unreason-
able assumption, and we include this example, mainly to demonstrate the bias that arises
when we, incorrectly, treat patients who die as censorings in Equation (4.11). We shall see
that this bias is similar to that seen for competing risks in Section 4.1.2 when, incorrectly,
estimating the cumulative incidence using ‘1−Kaplan-Meier’ and we will below return to
the correct analysis, properly taking mortality into account. Figure 4.18 shows the estimated
values of µ(t) for patients whose initial diagnosis was either unipolar or bipolar, obtained
using Equation (4.11). Note that (except for the fact that mortality is treated incorrectly) the
vertical axis has the attractive interpretation as the average numbers of re-admissions over
time since diagnosis and note that bipolar patients, on average, have more re-admissions
than unipolar patients. This discrepancy can be quantified using the multiplicative regres-
sion model from Equation (4.12) and, as seen in Table 4.7, the ratio (assumed constant)
between the two mean curves is estimated to be 1.52.

Terminal event
To perform a satisfactory analysis of the data on recurrent episodes in affective disorders,
we need to estimate µ(t) in the presence of a non-negligible mortality rate. Let S(t) be the
(marginal) survival function, i.e., S(t) = P(TD > t) where TD is time to entry into state D of
Figure 1.5 without consideration of re-admissions before time t. The mean number of re-
current events in the interval from 0 to t may be estimated using the following modification
of the estimator in Equation (4.11)

dN(X)
b (t) =
µ ∑ S(X−)
b (4.13)
Y (X)
event times X≤t
(Cook and Lawless, 1997; Ghosh and Lin, 2000). We will denote Equation (4.13) the Cook-
Lawless estimator. Here, S(·)
b is the Kaplan-Meier estimator for S and the minus sign in
144 INTUITION FOR MARGINAL MODELS

8
Expected number of episodes

0
0 5 10 15 20 25 30
Time since first admission (years)

Unipolar Bipolar

Figure 4.18 Recurrent episodes in affective disorders: Estimated average numbers of psychiatric
episodes after initial diagnosis for patients with unipolar or bipolar disorder. NB: mortality is
treated as censoring.

S(X−)
b means that a death event at time X is not included in the Kaplan-Meier estimator
at that time. An SD for this estimator is also available whereby confidence limits may be
computed. Since the Kaplan-Meier estimator is ≤ 1, it is seen by comparing Equations
(4.11) and (4.13) that treating mortality as censoring leads to an upwards biased estima-
tor for µ(t). The intuition behind this bias is the same as that discussed when comparing
the correct Aalen-Johansen estimator and the biased ‘1−Kaplan-Meier’ estimator for the
cumulative incidence with competing risks – namely that by treating dead patients as cen-
sored we pretend that, had they not been ‘censored’, then they would still be at risk for the
recurrent event. The bias is clearly seen in Figure 4.19 when comparing with Figure 4.18
(and even clearer on the cover figure where the birds sit on the correctly estimated curve
for unipolar patients).
The discrepancy between the two curves in Figure 4.19 may be quantified using a multi-
plicative regression model
µ(t | Z) = µ0 (t) exp(LP)
just like for the situation with no mortality, Equation (4.12). However, the estimating equa-
tions now need to be modified to properly account for the presence of a non-negligible
mortality rate (Ghosh and Lin, 2002; more details to be given in Section 5.5.4). We will
refer to this model as the Ghosh-Lin model. Table 4.7 also shows the estimated mean ra-
tio from this model which is seen to be 1.95. For both estimates in Table 4.7, it has been
assumed that censoring does not depend on covariates. The Ghosh-Lin estimate is seen
to be larger than that based on the incorrect assumption of no mortality. The explanation
DIRECT MODELS 145

8
Expected number of episodes

0
0 5 10 15 20 25 30
Time since first admission (years)

Unipolar Bipolar

Figure 4.19 Recurrent episodes in affective disorders: Estimated average numbers of psychiatric
admissions after initial diagnosis for patients with unipolar or bipolar disorder. NB: Mortality is
treated as a competing risk using the Cook-Lawless estimator.

is that the bias affects the curves for the two groups differently because unipolar patients
have a higher mortality rate than bipolar patients (estimated hazard ratio between bipolar
and unipolar patients in a Cox model for the marginal mortality rate is 0.410 with 95%
confidence limits from 0.204 to 0.825).
A critique that can be raised against the use of the marginal mean µ(t) in the presence of a
competing risk is that a treatment may appear beneficial if it quickly kills the patient and,
thereby, prevents further recurrent events from happening. Therefore, the occurrence of the
competing event (‘death’) must somehow be considered jointly with the recurrent events
process N(t), at the least by also quoting results from an analysis of the mortality rate.
One approach in this direction is the Mao-Lin (2016) model for the composite end-point
consisting of recurrent events and death – to be discussed in Section 5.5.4.

LEADER cardiovascular trial in type 2 diabetes

Similar analyses on the data from the LEADER trial (Example 1.1.6) were conducted by
Furberg et al. (2022). Figure 4.20 shows the non-parametric estimates of the mean func-
tions for recurrent MI in the two treatment groups with or without proper adjustment for
the competing mortality risk. The bias imposed by not taking death into account as a com-
peting risk is rather small in this example as a consequence of the relatively low mortality
rate (Table 1.3). Note that the Nelson-Aalen estimates are identical to the curves shown
in Figure 2.15 and correctly estimate the cumulative intensities. The estimated mean ratio
146 INTUITION FOR MARGINAL MODELS

0.12
Expected number of events per subject

0.10

0.08

0.06

0.04

0.02

0.00
0 12 24 36 48 60
Time since randomization (months)
Mortality treated as a competing risk (CL), Liraglutide
Mortality treated as a competing risk (CL), Placebo
Mortality treated as censoring (NA), Liraglutide
Mortality treated as censoring (NA), Placebo

Figure 4.20 LEADER cardiovascular trial in type 2 diabetes: Estimated average numbers of my-
ocardial infarctions. NB: One curve for each treatment group where mortality is treated as cen-
soring and one for each group where mortality is treated as a competing risk (CL: Cook-Lawless
estimates, NA: Nelson-Aalen estimates).

between liraglutide and placebo without taking mortality into account (LWYY model) is
exp(−0.164) = 0.849 (95% confidence limits from 0.714 to 1.009), while that obtained in
the Ghosh-Lin model is exp(−0.159) = 0.853 (0.718, 1.013). The latter is slightly closer to
1 because the mortality rate in the placebo group is slightly higher (Cox model for all-cause
mortality gives a log(hazard ratio) for placebo vs. liraglutide of 0.166 (SD = 0.070)). We
notice the need for both studying the recurrent events and mortality.

Mean function and terminal event

In the presence of a terminal event in a recurrent events multi-state process, the
Nelson-Aalen estimator is upwards biased for the mean function. This is because,
by censoring for the terminal event, one pretends to be in a population without that
event. To correctly take the competing risk of the terminal event into account, the
Cook-Lawless estimator must be used. In analogy with our recommendation to al-
ways study all competing events in a competing risks model (and not just the ‘cause
of interest’), we emphasize that the occurrence of the terminal event must be studied
together with the recurrent events.
MARGINAL HAZARD MODELS 147
Direct models
A direct model may be set up for the way in which a marginal parameter depends
on covariates. This requires (1) specification of a link function that gives the scale
on which parameters are to be interpreted, and (2) setting up a set of general-
ized estimating equations (GEEs), the solutions to which are the desired parameter
estimates. Direct modeling has some advantages compared to plug-in and micro-
simulation. (1) it provides a set of regression coefficients that directly explain the
association on the scale of the chosen link function, and (2) it targets directly the
marginal parameter of interest and does not rely on a correct specification of all in-
tensity models. However, a direct marginal model (1) does not provide information
on the dynamics of the multi-state process, and it is not possible to simulate paths
of the process based on a marginal model, and (2) requires a correct specification of
the censoring distribution.

4.3 Marginal hazard models

In Sections 1.2.2 and 1.4.1, the marginal parameter distribution of time of first entry into
state h (say, Th ) was introduced for situations where all subjects occupy the same state (0)
at time 0. Examples include time to occurrence no. h = 1, 2, . . . of a recurrent event (Figure
1.5) or time to relapse or GvHD in the model for the disease course after bone marrow
transplantation (Figure 1.6). As mentioned there, the random variable Th may, formally,
be infinite because the event in question will not necessarily happen for all subjects (Th is
improper). If several such times (e.g., times for different event numbers, h in a recurrent
events study) are studied simultaneously, then an assumption that these times are indepen-
dent within subjects (i) will typically not be reasonable, see, e.g., the discussion in Section
3.9.
As discussed in that section, there is a quite different situation that gives rise to dependent
event history data, namely when subjects come in clusters, such as members of the same
family, or patients attending the same medical center. Here, the random variable Th would
be time to the event of interest for subject h in a given cluster (i).
In Section 3.9, the potential association within subjects/clusters was taken into account
by using a frailty model, whereby, estimated regression parameters have a within-cluster
interpretation. However, inference for the marginal time to event distributions without a
specification of the intra-cluster/patient association is a useful alternative. This goal may
be achieved using an approach based on generalized estimating equations (GEEs) as dis-
cussed, e.g., by Wei et al. (1989) and by Lin (1994). In Section 5.6, the mathematical
background for the marginal Cox model will be described, and in Section 7.2, we will sum-
marize the discussion of analysis of dependent event history data. In the present section,
we will introduce the idea of marginal hazard models and discuss the extent to which it
is applicable to the examples mentioned and to similar examples. We will first consider
clustered data and next turn to the situation of marginal distributions of times into entry of
different states in a multi-state model.
148 INTUITION FOR MARGINAL MODELS
4.3.1 Clustered data
In situations where subjects come in clusters, it is relevant to account for the cluster struc-
ture in the analysis of the event history data. This can be done by setting up standard Cox
models for the event intensity for each subject separately, i.e., by specifying the marginal
hazard for subject h as
αh (t) ≈ P(Th ≤ t + dt | Th > t)/dt. (4.14)
This hazard is marginal in the sense that the life course of other cluster members is not taken
into account (even though this may be informative for subject h due to the suspected within-
cluster correlation). As an example, one could specify the following model for subject h in
cluster i with covariates (Zih1 , . . . , Zihp )

αih (t | Z) = α0 (t) exp(LPih ),

which is just a standard Cox model (in which stratification is also possible, i.e., different
baseline hazards in certain sub-groups). Estimation of the β coefficients and the baseline
hazard(s) follow exactly the same lines as described previously (Section 2.2.1 and Section
3.3), and the estimates are exactly the same as they would have been under independence.
The standard deviations of the estimates will be different because the cluster structure is
taken into account when computing the robust standard deviations instead of the model-
based standard deviations used in previous examples of the Cox model. The robust standard
deviations will often be larger than the model-based since the latter will over-estimate the
amount of precision by over-estimating the number of independent units in the study. This is
typically the case when there is a positive within-cluster correlation. However, in situations
with a negative within-cluster correlation or when the covariate varies within, rather than
among clusters, they may also be smaller. We will consider the bone marrow transplantation
data (Example 1.1.7) and take the cluster structure implied by patients attending different
medical centers into account. We will study models for three different outcomes: relapse,
relapse-free survival (i.e., the composite end-point of either relapse or death in remission
– leaving state 0 in Figure 4.13), or overall survival (time until entry into either state 2
or 3 in that figure). Table 4.8 shows results from models including the three covariates
graft type, disease, and age. For relapse, the estimated coefficients and the model-based
standard deviations are close to those found in Table 3.12 where adjustment for the time-
dependent covariate graft versus host disease was also conducted – without that adjustment
the two sets of estimates and standard deviations would have been identical. The robust
standard deviations tend to be larger than the model-based since a positive within-center
association is suspected. This is, in particular, the case for the covariate age that has a
larger variation among centers than within (F-ratio in a one-way ANOVA is 5.64). This
tendency is confirmed by the overall Wald significance tests for the three coefficients: 21.60
based on the model-based results and 16.36 for the robust for the outcome relapse. The
results for relapse-free survival and overall survival are similar since most events for the
composite end-point are deaths (Table 1.4). Robust standard deviations tend to be larger,
and the three degree of freedom Wald tests for all coefficients are more significant when
based on the model-based results (81.52 vs. 52.59 for relapse-free survival and 76.06 vs.
48.78 for overall survival).
MARGINAL HAZARD MODELS 149
Table 4.8 Bone marrow transplantation in acute leukemia: Estimated coefficients, model-based SD,
and robust SD from marginal hazard models for relapse, relapse-free survival, and overall survival
taking clustering by medical center into account (BM: Bone marrow, PB: Peripheral blood, AML:
Acute myelogenous leukemia, ALL: Acute lymphoblastic leukemia).

SD
βb Model-based Robust Ratio
Relapse
Graft type BM only vs. BM/PB -0.108 0.134 0.138 1.025
Disease ALL vs. AML 0.549 0.129 0.174 1.345
Age per 10 years -0.045 0.044 0.075 1.686
Relapse-free survival
Graft type BM only vs. BM/PB -0.161 0.077 0.077 0.997
Disease ALL vs. AML 0.455 0.078 0.078 1.004
Age per 10 years 0.169 0.026 0.033 1.286
Overall survival
Graft type BM only vs. BM/PB -0.160 0.079 0.081 1.022
Disease ALL vs. AML 0.405 0.080 0.078 0.975
Age per 10 years 0.173 0.026 0.033 1.267

4.3.2 Recurrent events

For recurrent events, times to first, second, third, etc. event occurrence within each patient
are correlated. Analyses of these times may, in the case of no competing risks in the form of
a terminal event that prevents further occurrences of the recurrent event (Figure 1.5 without
state D), be analyzed as just described for the bone marrow transplantation data. Thus, Cox
models
αhi (t | Z) = αh0 (t) exp(LPi ), h = 1, 2, . . . , K
for the hazard of the time to event occurrence no. h for a patient, i with covariates
(Zi1 , . . . , Zip ) may be set up. Here, the maximum number of events, K, to be studied needs
specification which must be done on a case-by-case basis. Sub-models with common co-
variate effects for several or all h are possible (i.e., β j instead of βh j ). The model is known
as the WLW model for recurrent events (Wei, Lin and Weisfeld, 1989). An example could
be the data on recurrent episodes of affective disorders, ignoring the fact that patients may
die during follow-up (Example 1.1.5). However, in previous analyses of these data (e.g.,
Section 4.2.3) we have seen that neglecting mortality, i.e., treating it as censoring, may lead
to considerably biased estimates in some situations and we will not pursue this idea any
further. An analysis of times to first, second, ... event occurrence needs to properly address
the competing risk of death. The problem is that the marginal hazard for time Th , Equation
(4.14)
αh (t) ≈ P(Th ≤ t + dt | Th > t)/dt,
is now marginal in the sense that consideration of neither times of previous events (i.e., no.
1, 2, . . . h − 1), nor time to death, TD is given. The latter means that, either αh (t) should be
interpreted in a hypothetical population without mortality, or the conditioning event Th > t
150 INTUITION FOR MARGINAL MODELS
Table 4.9 Recurrent episodes in affective disorders: Numbers of recurrences and deaths used in
marginal Cox models (WLW-models).

Episode no. Recurrence Death Total

h=1 99 16 115
h=2 82 28 110
h=3 62 45 107
h=K=4 47 55 102

should be taken to mean that the subject either is alive at time t but has not yet experienced
event no. h, or the subject has already died at time t without having had h recurrences.
We have already dismissed in Section 1.3 the first possibility as being unrealistic, and the
second possibility means that αh (t) is a sub-distribution hazard rather than an ordinary
hazard (Section 4.2.2). In the latter case, one may turn to marginal Fine-Gray models for
the cumulative incidences for event occurrence no. h = 1, . . . .K (Zhou et al., 2010).
To have well-defined marginal hazards, the definition of ‘event occurrence no. h’ could
be modified to being the composite end-point ‘occurrence of event no. h or death’, much
like earlier definitions of recurrence-free survival in the bone marrow transplantation study,
Example 1.1.7, or failure of medical treatment (transplantation-free survival) in the PBC3
trial, Example 1.1.1. This possibility was discussed by Li and Lagakos (1997) together with
a suggestion to model the cause-specific hazards for recurrence no. h = 1, . . . , K, taking
mortality into account as a competing risk. In the latter case, hazards are no longer marginal
in the sense of Equation (4.14).
We will illustrate marginal hazard models using the data on recurrence and death in patients
with affective disorders (Example 1.1.5). As in Section 2.5 we will restrict attention to
the first K = 4 recurrences for which the numbers of events (recurrences and/or deaths)
are shown in Table 4.9. Table 4.10 shows the results from analyses including only the
covariate bipolar vs. unipolar disorder. For the composite end-point, the hazard ratios tend
to decrease with episode number (h) while there is rather an opposite trend for the models
for the cause-specific hazards for recurrence no. h = 1, 2, 3, K = 4. The likely explanation
is that, as seen in Table 4.9, the fraction of deaths for the composite end-point increases
with h and, as we have seen earlier (Section 4.2), mortality is higher for unipolar than for
bipolar patients. The estimates for the separate coefficients βh may be compared using a
three degree of freedom Wald test which for the composite end-point is 6.08 (P = 0.11)
and for the cause-specific hazards is 8.27 (P = 0.04). Even though the latter is borderline
statistically significant, Table 4.10 also shows, for both analyses, the estimated log(hazard
ratio) in the model where β is the same for all h.

4.3.3 Illness-death model

Consider an illness-death model for a simplified situation for the bone marrow transplanta-
tion data, i.e., without consideration of GvHD. This is Figure 1.3 with state 1 correspond-
ing to relapse and state 2 to death. Suppose one wishes to make inference for both ‘time to
death’ and ‘time to relapse’. Could the marginal Cox model be applicable for this purpose?
MARGINAL HAZARD MODELS 151
Table 4.10 Recurrent episodes in affective disorders: Estimated coefficients (and robust SD) for
bipolar versus unipolar disorder from marginal Cox models (WLW-models) for the composite end-
point of recurrence or death and for the cause-specific hazards of recurrence.

Composite end-point Cause-specific hazard

Episode no. βb SD βb SD
h=1 0.380 0.209 0.495 0.202
h=2 0.291 0.255 0.640 0.242
h=3 0.003 0.246 0.534 0.269
h=K=4 0.107 0.237 0.879 0.283
Joint 0.193 0.204 0.615 0.211

The answer seems to be ‘no’ because the marginal hazard for relapse given in Equation
(4.14) is not well defined in the relevant population where death also operates. There have
been attempts in the literature to do this anyway (taking into account the ‘informative cen-
soring by death’) under the heading of semi-competing risks (e.g., Fine et al. 2001) , but
we will not follow that idea here and refer to further discussion in Section 4.4.4. Instead,
we will proceed as in the recurrent events case and re-define the problem to jointly study
times to death and times to the composite end-point of either relapse or death (relapse-free
survival). These times are correlated within each patient, since all deaths in remission count
as events of both types but their marginal hazards are well defined. The numbers of events
are 737 overall deaths and 764 (= 259 + 737 − 232, cf. Table 1.4) occurrences of relapse
or death in remission. Table 4.11 shows results from models including the covariates graft
type, disease, and age. The models are fitted using a stratified Cox model, stratified for the
two types of events with type-specific covariates and using robust SD. Note that, for both
end-points, the estimates are the same as those found in Table 4.8. This is because they
solve the same estimating equations. The robust SD are also close but not identical because
another clustering is now taken into account (patient rather than center as in Table 4.8). The
two sets of coefficients are strongly correlated: The estimated correlations are 0.98, 0.96,
and 0.97, respectively, for graft type, disease, and age. These correlations are accounted
for in Wald tests for equality of the two sets of coefficients. These are 0.004 (P = 0.95),
5.17 (P = 0.023), and 0.32 (P = 0.57), respectively, for the three covariates. Under the
hypothesis of equal coefficients for graft type and age, the estimates are −0.161 (0.077)
and 0.171 (0.026). Note that the SDs are not much reduced as a consequence of the high
correlations.
If one, further, wishes to include ‘time to GvHD’ in the analysis, then this needs to be
defined as GvHD-free survival, i.e., events for this outcome are either GvHD or death,
whatever comes first. For this outcome there are 1, 324 (= 976 + 737 − 389, cf. Table 1.4)
events of which 976 are GvHD occurrences. The results from an analysis of all three out-
comes are found in Table 4.11 (where those for overall and relapse-free survival are the
same as in the previous model). It is seen that the estimated coefficients for GvHD-free sur-
vival differ somewhat from those for the other two end-points since the majority of these
events are not deaths.
152 INTUITION FOR MARGINAL MODELS
Table 4.11 Bone marrow transplantation in acute leukemia: Estimated coefficients (and robust SD)
from marginal hazard models for relapse-free survival, overall survival, and GvHD-free survival
(BM: Bone marrow, PB: Peripheral blood, AML: Acute myelogenous leukemia, ALL: Acute lym-
phoblastic leukemia).

βb SD
Relapse-free survival
Graft type BM only vs. BM/PB -0.161 0.077
Disease ALL vs. AML 0.455 0.078
Age per 10 years 0.169 0.026
Overall survival
Graft type BM only vs. BM/PB -0.160 0.079
Disease ALL vs. AML 0.405 0.079
Age per 10 years 0.173 0.027
GvHD-free survival
Graft type BM only vs. BM/PB -0.260 0.059
Disease ALL vs. AML 0.293 0.060
Age per 10 years 0.117 0.019

Marginal hazard models

A marginal hazard model describes the marginal distribution of the time to a certain
event. For clustered data, this is carried out without consideration of the event times
for other cluster members and without having to specify the within-cluster associ-
ation, and, in this situation, marginal hazard models are useful. The same may be
the case in a recurrent events situation without a terminal event, in which case the
marginal time to first, second, third, etc. event may be analyzed without having to
specify their dependence (the WLW model).
However, in situations with competing risks (both for recurrent events and other
multi-state models) the concept of a marginal hazard is less obvious.

Model-based and robust SD

Intensity-based models build on the likelihood approach that directly provides
model-based SD. Direct marginal models aim at establishing the association be-
tween a marginal parameter and covariates and builds on setting up generalized es-
timating equations (GEEs). In this case, robust values for SDs are based on the GEE
using the sandwich formula. These SD are robust to certain model deviations; how-
ever, the link between the marginal parameter and the covariates should be correctly
specified (the GEE should be unbiased).
INDEPENDENT CENSORING – REVISITED 153
4.4 Independent censoring – revisited
4.4.1 Investigating the censoring distribution
In the beginning of Section 4.2, we mentioned that fitting direct models for a marginal pa-
rameter would typically require estimation of the distribution G(t) = P(C > t) of censoring
times. To do this, we imagine that in a study, any given subject has a potential time where
observation of that subject would be terminated if he or she had not yet died (or, more
precisely, not yet reached an absorbing state in the multi-state process) prior to that time.
This potential censoring time will, at the latest, be the time from entry into the study and
until the study is terminated, but subjects may drop out of the study prior to that time – see
the examples in Section 1.1. For some subjects, C is observed but for others, observation
of C is precluded if the survival time T is less than C. This suggests that G(t) may be esti-
mated by a Kaplan-Meier estimator where ‘censoring’ is the event and ‘death’ (absorbing
states) acts as censoring. If G(t) depends on covariates, then the distribution may be esti-
mated via, e.g., a Cox model for the censoring hazard. To obtain valid estimation of G(t),
an independent censoring condition (similar to Equation (1.6)) must be fulfilled. Since, in
essence, Equation (1.6) implies that C and T are conditionally independent given covariates
and since conditional independence is a condition that is symmetric in C and T , it follows
that the same condition will also ensure valid estimation of G(t). As explained in Section
1.3, this condition cannot be directly evaluated based on the available data – except from
the fact that the way in which the distributions of T and C depend on covariates may be
investigated using regression models.
Even though the censoring distribution is typically not of primary scientific interest, it may
be informative to study G(t) to get an overview of the follow-up time distribution (e.g., van
Houwelingen and Putter, 2012, ch. 1; Andersen et al., 2021). In what follows, we will do
so in a number of examples.

PBC3 trial in liver cirrhosis

In the PBC3 trial (Example 1.1.1), most censoring was administrative and only four patients
were lost to follow-up prior to end-of-study. This means that the distribution of censoring
times mostly reflects the distribution of calendar entry times. Figure 4.21 shows the result-
ing G(t)
b together with the Kaplan-Meier estimator S(t) b for the overall survival function
(probability of no treatment failure before time t) for both treatment groups combined. It is
seen that the censoring distribution is rather uniform (survival function close to linear) over
the 6 years where the trial was conducted; however, with relatively fewer short censoring
times reflecting that few patients were recruited towards the end of the trial.
The function G(t)
b gives the probability of being uncensored at time t and S(t)
b the probabil-
ity of having had no treatment failure by that time. Thus, the product gives the probability
of still being in the study at t and it holds that

Y (t)
G(t−)
b S(t−)
b = , (4.15)
n
the fraction of subjects still in the study just before t (where, as previously, the t− means
that a possible jump in the estimator at t is not yet included).
154 INTUITION FOR MARGINAL MODELS

1.0

0.9

0.8

0.7

0.6
Probability

0.5

0.4

0.3

0.2

0.1

0.0
0 1 2 3 4 5 6
Time since randomization (years)

Censoring Treatment failure

Figure 4.21 PBC3 trial in liver cirrhosis: Kaplan-Meier estimates for censoring and treatment fail-
ure.

To study if censoring depends on covariates, simple Cox models were studied including,
respectively, treatment, albumin or log2 (bilirubin). The estimates (SD) were, respectively,
βb = 0.084 (0.126), βb = 0.0010 (0.013), and βb = −0.0025 (0.0018) with associated (Wald)
P-values 0.50, 0.94, and 0.16 showing that the censoring times (entry times) have a distri-
bution that is independent of the prognostic variables.

PROVA trial in liver cirrhosis

In the PROVA trial, 75 patients out of 286 died and 20 dropped out prior to the end of
the trial, leaving 191 patients alive at the end of the study (Example 1.1.4, Table 1.2).
Figure 4.22 shows the estimated distribution of censoring times which is rather uniform
between 1 and 4 years with few censorings before or after that interval. Wald P-values for
the association between the censoring hazard and covariates based on a series of simple
Cox models are shown in Table 4.12 where it is seen that censoring depends strongly on
age but not much on the other covariates. Previous analyses of these data (e.g., Section
3.6.3) showed that adjustment for age had little impact on the results of hazard regression
models.

Recurrent episodes in affective disorder

In Example 1.1.5 on recurrence in affective disorders, out of 119 patients included, 78 had
died, leaving 41 censored observations out of whom 38 were in the study by its end in 1985.
Figure 4.23 shows G(t)
b for these data, and it is seen that most censorings happen between
22 and 26 years in accordance with entry between 1959 and 1963 and study termination
in 1985. For these data, there is no association between censoring and the initial diagnosis
INDEPENDENT CENSORING – REVISITED 155
1.0

0.9

0.8

0.7
Probability of no censoring

0.6

0.5

0.4

0.3

0.2

0.1

0.0
0 1 2 3 4 5 6 7
Time since randomization (years)

Figure 4.22 PROVA trial in liver cirrhosis: Kaplan-Meier estimate for censoring.

Table 4.12 PROVA trial in liver cirrhosis: Tests for association between censoring and covariates.

Covariate P-value
Treatment 0.98
Size of varices 0.10
Sex 0.59
Coagulation factors 0.28
log2 (Bilirubin) 0.18
Age 0.003

(bipolar versus unipolar, P = 0.20), but a very strong association with calendar time at
initial diagnosis (P < 0.001). It was seen in Table 3.6 that the variable was also associated
with the recurrence intensity, but also that adjustment did not much affect the estimate for
the initial diagnosis.

4.4.2 Censoring and covariates – a review

We have studied two classes of models for multi-state survival data: Intensity-based models
and marginal models and, for both, the issue of right-censoring had to be dealt with. As
indicated in Section 4.2 (and as will be further discussed in Sections 5.5.1 and 6.1.8), a
direct model for a marginal parameter typically requires that a model for the censoring
distribution is correctly specified. The most simple situation arises when censoring can be
considered independent of covariates, and it was seen in the previous section how censoring
models can be investigated.
156 INTUITION FOR MARGINAL MODELS
1.0

0.9

0.8

0.7
Probability of no censoring

0.6

0.5

0.4

0.3

0.2

0.1

0.0
0 5 10 15 20 25 30
Time since first admission (years)

Figure 4.23 Recurrent episodes in affective disorders: Kaplan-Meier estimate for censoring.

For models based on intensities, the situation is different. If censoring is independent, i.e.,
censoring is conditionally independent of the multi-state process given covariates, then the
methods discussed in Chapters 2 and 3 apply. However, this strictly requires that the covari-
ates which create the conditional independence are accounted for in the hazard model. We
hinted at this in Sections 1.3 and 2.2. The mathematical explanation follows from Section
3.1 where we emphasized that observation of the censoring times gives rise to likelihood
contributions which factorize from the failure contributions under the conditional indepen-
dence assumption. This means that the contributions arising from censoring can be disre-
garded when analyzing the failure intensities, and one needs not worry about the way in
which censoring is affected by covariates (‘the censoring model may be misspecified’) ex-
cept from the fact that the covariates that create the conditional independence are accounted
for in the intensity model.
So, the conclusion is that intensity-based models and, thereby, inference for marginal pa-
rameters based on plug-in, are less sensitive to the way in which covariates may affect cen-
soring. However, to achieve this flexibility, one has to assume that censoring is independent
given covariates and covariates affecting the censoring distribution should be accounted for
in the hazard model provided that they are associated with the hazard. Investigations along
these line were exemplified in the previous section.

4.4.3 Independent competing risks – a misnomer (*)

In Section 4.4.1 (see also Section 1.3), the point of view was, as follows. In the complete
population, there is a time-to-event distribution that we wish to estimate. The observations
are incomplete because of censoring, and all subjects have a potential time to censoring
INDEPENDENT CENSORING – REVISITED 157
that may or may not be observed. Thus, both time to event and time to censoring are well-
defined (proper) random variables that have some joint distribution. To make inference on
either of the marginal distributions, i.e., either the time to event in the absence of censoring
(the distribution of scientific interest), or (as we did Section 4.4.1) the marginal time to
censoring, an assumption of independent censoring is needed. We may be willing to make
this assumption after careful discussion of the mechanisms that lead to censoring.
A question is then: Could a similar approach be used for competing risks? We will argue
that the answer is negative, first and foremost because the question is ill-posed – we do not
believe that there is a well-defined ‘time to failure from cause 1 in the absence of cause 2’ or
vice versa. As explained in Section 1.3, in the complete population without censoring, both
causes are operating, and the occurrence of failure from one cause prevents failure from
the other cause from ever happening. As a consequence, and as explained earlier (Section
1.2.2, see also Section 5.1.2), the times to entry into states 1 or 2 in the competing risks
model (Figure 1.2) are improper random variables because the corresponding event (failure
from a specific cause) may never happen.
Nonetheless, there is a literature on classical competing risks building on random variables,
each representing time to failure from a specific cause (e.g., Crowder, 2001, ch. 3). We will
discuss this in the case of k = 2 competing causes, in which case there are two random
variables Te1 , Te2 representing time to failure from cause 1 or 2, respectively. These variables
are known as the latent failure times and they have a joint survival distribution

e 1 ,t2 ) = P(Te1 > t1 , Te2 > t2 ),

S(t

with marginal distributions Seh (th ) = P(Teh > th ), h = 1, 2 (Se1 (t1 ) = S(t
e 1 , 0) and Se2 (t2 ) =
e 2 )) and associated marginal (or net) hazards. Complete observations would be the
S(0,t
smaller of the latent failure times, min(Te1 , Te2 ) and the corresponding cause (the index of
the minimum). Incomplete observation may be a consequence of right-censoring at C, in
which case it is only known that both Te1 and Te2 are greater than C, and the cause of failure
is unknown.
A major problem with this approach is that, as discussed, e.g., by Tsiatis (1975), with
e 1 ,t2 ) is unidentifiable from the available
no further assumptions, the joint distribution S(t
observations – also in the case of no censoring. What is identifiable are the cause-specific
hazard functions αh (t) and functions thereof, such as the overall survival function S(t) and
the cumulative incidences Fh (t) (e.g., Prentice et al., 1978), i.e., exactly the quantities that
we have been focusing on. In terms of the joint survival function, S(t) = S(t,t) e and the
cause-specific hazards are obtained as

∂
αh (t) = − e 1 ,t2 )|t =t =t .
log S(t 1 2
∂th

An assumption that would make S(te 1 ,t2 ) identifiable is independence of Te1 , Te2 – a situation
known as independent competing risks and, under that assumption, the hazard of Teh (the
net hazard) equals αh (t). However, as just mentioned, independence cannot be identified
by observing only the minimum of the latent failure times and the associated cause. Another
e 1 ,t2 ) identifiable was discussed by Zheng and Klein (1995)
assumption that would make S(t
158 INTUITION FOR MARGINAL MODELS
who showed that if the dependence structure is specified by some specific shared frailty
e 1 ,t2 ) can be estimated. The problem is that there is no
model (see Section 3.9), then S(t
support in the data for any such dependence structure.
These problems were nicely illustrated by Kalbfleisch and Prentice (1980, ch. 7) who gave
an example of two different joint survival functions, one corresponding to independence,
the other not necessarily, with identical cause-specific hazards. More specifically the fol-
lowing two joint survival functions were studied
e 1 ,t2 ) = exp(1 − α1t1 − α2t2 − exp(α12 (α1t1 + α2t2 )))
S(t

(with α1 , α2 > 0 and α12 > −1) and

!
e∗ α1 eα12 (α1 +α2 )t1 + α2 eα12 (α1 +α2 )t2
S (t1 ,t2 ) = exp 1 − α1t1 − α2t2 − .
α1 + α2

e Te1 and Te2 are independent if α12 = 0 (in which case S(t
For S, e 1 ,t2 ) is the product
exp(−α1t1 ) exp(−α2t2 )) and the parameter α12 , therefore, quantifies deviations from in-
dependence. On the other hand, Se∗ (t1 ,t2 ) corresponds to independence no matter the value
of α12 (it is always a product of survival functions of t1 and t2 ); however, the cause-specific
hazards for Se and Se∗ are the same, namely

αh (t) = αh (1 + α12 exp(α12 (α1 + α2 )t)), h = 1, 2.

So, even though α1 , α2 , α12 can all be estimated if this model for the cause-specific hazards
were postulated, the estimated value of α12 cannot be taken to quantify discrepancies from
independence, since it cannot be ascertained from the data whether Se or Se∗ (or some other
model with the same cause-specific hazards) is correctly specified.
Such considerations have had the consequence that the latent failure time approach to com-
peting risks has been more or less abandoned in biostatistics (e.g., Prentice et al., 1978;
Andersen et al., 2012), and emphasis has been on cause-specific hazards and cumulative
incidences. Even when simulating competing risks data, it has been argued (Beyersmann et
al., 2009) that one should not use latent failure times even though this provides data with
the same distribution as when using cause-specific hazards. We will return to this question
in Section 5.4.
In summary, both concepts of independent censoring and independent competing risks re-
late to independence between two random time-variables. However, it is only in the former
case that these two random variables (time to event and time to censoring) are well-defined,
and, for that reason, we find the concept of independent censoring important, whereas we
find the concept of independent competing risks (and the associated latent failure time ap-
proach) less relevant.

4.4.4 Semi-competing risks (*)

As mentioned in Section 4.3, Fine at al. (2001) studied the illness-death model under the
heading of semi-competing risks with the focus of studying the marginal distribution of
time to illness (entry into state 1 in Figure 1.3). This activity is quite similar to the study of
INDEPENDENT CENSORING – REVISITED 159
marginal distributions of times to failure from specific causes in the competing risks model
as discussed in the previous section. The approach of Fine et al. (2001) was to assume a
shared gamma frailty model for the joint distribution of time to illness and time to death,
and, under such an assumption, the marginal distributions are identifiable from the available
data where occurrence of death (entry into state 2) prevents the illness from ever happening.
However, since this (or similar) association structures cannot be supported by the observed
data, we will follow recommendations of Xu et al. (2010) and Andersen and Keiding (2012)
and not study the illness-death model from the point of view of semi-competing risks.

Independent censoring/competing risks

Both of the concepts of independent censoring and independent competing risks

relate to independence between two random time-variables. However, it is only in
the former case that these two random variables (time to event and time to censoring)
are well-defined, and, for that reason, we find the concept of independent censoring
important, whereas we find the concept of independent competing risks (and the
associated latent failure time approach) less relevant.
160 INTUITION FOR MARGINAL MODELS
4.5 Exercises

Exercise 4.1 Consider the data from the Copenhagen Holter study and estimate the proba-
bilities of stroke-free survival for subjects with or without ESVEA using the Kaplan-Meier
estimator.

Exercise 4.2 Consider the Cox model for stroke-free survival in the Copenhagen Holter
study including the covariates ESVEA, sex, age, and systolic blood pressure (Exercise 2.4).
1. Estimate the survival functions for a female subject aged 65 years and with systolic
blood pressure equal to 150 mmHg – either with or without ESVEA.
2. Estimate the survival functions for patients with or without ESVEA using the g-formula.

Exercise 4.3 Consider the data from the Copenhagen Holter study and fit a linear model
for the 3-year restricted mean time to the composite end-point stroke or death including
ESVEA, sex, age, and systolic blood pressure.

Exercise 4.4 Consider the Cox models for the cause-specific hazards for the outcomes
stroke and death without stroke in the Copenhagen Holter study including ESVEA, sex,
age, and systolic blood pressure (Exercise 2.7). Estimate (using plug-in) the cumulative
incidences for both end-points for a female subject aged 65 years and with systolic blood
pressure equal to 150 mmHg – either with or without ESVEA.

Exercise 4.5
1. Repeat the previous question using instead Fine-Gray models.
2. Estimate the cumulative incidence functions for patients with or without ESVEA using
the g-formula.

Exercise 4.6 Consider the data from the Copenhagen Holter study and fit linear models
for the expected time lost (numbers of years) before 3 years due to either stroke or death
without stroke including ESVEA, sex, age, and systolic blood pressure.

Exercise 4.7 Consider an illness-death model for the Copenhagen Holter study with states
‘0: Alive without AF or stroke’, ‘1: Alive with AF and no stroke’, ‘2: Dead or stroke’, see
Figures 1.3 and 1.7.
1. Estimate the prevalence of AF.
2. Estimate the expected lengths of stay in states 0 or 1 up to 3 years.
3. Evaluate the SD of the expected lengths of stay using the bootstrap.

Exercise 4.8 Consider the data on mortality in relation to childhood vaccinations in

Guinea-Bissau, Example 1.1.2.
EXERCISES 161
1. Fit a marginal hazard model for the mortality rate, adjusting for cluster ‘(village)’ and
including binary variables for BCG and DTP vaccination and adjusting for age at re-
cruitment (i.e., using time since recruitment as time-variable).
2. Compare the results with those from the gamma frailty model (Exercise 3.12).

Exercise 4.9 Consider the data on recurrent episodes in affective disorder, Example 1.1.5.
1. Estimate the mean number of episodes, µ(t), in [0,t] for unipolar and bipolar patients,
taking the mortality into account.
2. Estimate, incorrectly, the same mean curves by treating death as censoring and compare
with the correct curves from the first question, thereby, re-constructing the cover figure
from this book (unipolar patients).

Exercise 4.10 Consider the data from the Copenhagen Holter study.
1. Estimate the distribution, G(t) of censoring.
2. Examine to what extent this distribution depends on the variables ESVEA, sex, age, and
systolic blood pressure.
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
Chapter 5

Marginal models

This chapter explains some of the mathematical foundation for the methods illustrated via
the practical examples in Chapter 4. As we did in that chapter, we will separately study
methods based on plug-in of results from intensity models and, here, it turns out to be cru-
cial whether the multi-state process is Markovian (Section 5.1) or not (Section 5.2). Along
the way, we will also introduce methods that were not exemplified in Chapter 4, including
the techniques of landmarking and micro-simulation (Sections 5.3 and 5.4), both of which
also build on plug-in of hazard models. The second part of this chapter, Sections 5.5-5.7,
describes the background for direct models for marginal parameters based on generalized
estimating equations (GEEs), including a general method based on cumulative residuals
for assessment of goodness-of-fit. Finally, Section 5.8 provides practical examples of new
methods presented.

5.1 Plug-in for Markov processes (*)

Let V (t) be a multi-state process with state space S = {0, 1, . . . , k} and assume that V (t)
is Markovian, i.e., the transition intensities satisfy the Markov property

αh j (t)dt ≈ P(V (t + dt) = j | V (t) = h, Ht− ) = P(V (t + dt) = j | V (t) = h, Z ) (5.1)

for all states h, j ∈ S , j =

6 h. Here, the Z are time-fixed covariates included in Ht , i.e., the
only way in which the intensities depend on the Rpast is via the current state h = V (t) and
via Z . Define the cumulative intensities Ah j (t) = 0t αh j (u)du and let

Ahh (t) = − ∑ Ah j (t).

j∈S , j6=h

We can now collect all Ah j (t), h, j ∈ S in a (k + 1) × (k + 1)-matrix A(t). The product-

integral of A(·) over the interval (s,t] is defined as the (k + 1) × (k + 1)-matrix

P(s,t) = π
(s,t]
I + dA(u)

= lim ∏ I + A(ui ) − A(ui−1 ) (5.2)
max |ui −ui−1 |→0

for any partition s = u0 < u1 < · · · < uN = t of (s,t] (Gill and Johansen, 1990). Here, I is the
(k + 1) × (k + 1) identity matrix. We defined the Ah j as integrated intensities, but (5.2) is

163
164 MARGINAL MODELS
also well-defined if the Ah j have jumps, and in the case where the Ah j correspond to purely
discrete measures, the product-integral (5.2) is just a finite matrix product over the jump
times in (s,t] (reflecting the Chapman-Kolmogorov equations). In the special case where
all intensities are time-constant on (s,t], the product-integral (5.2) is the matrix exponential
∞
1
α · (t − s)) = I + ∑ (α
P(s,t) = exp(α α · (t − s))i .
i=1 i!

It now holds (Gill and Johansen, 1990), that P(s,t) is the transition probability matrix
for the Markov process V (·), i.e., element h, j is Ph j (s,t) = P(V (t) = j|V (s) = h). If A is
absolutely continuous, then, for given intensity matrix α (t), the matrix P given by (5.2)
solves the Kolmogorov forward differential equations
∂
P(s,t) = P(s,t)α
α (t), with P(s, s) = I. (5.3)
∂t
This suggests plug-in estimators for P(s,t) based on models fitted for the intensities. A
non-parametric estimator for P(s,t) for an assumed homogeneous group is obtained by
plugging-in the Nelson-Aalen estimator A,
b and the resulting estimator

P(s,t)
b = πI + d A(u)
(s,t]
b

(5.4)

is the Aalen-Johansen estimator (Aalen and Johansen, 1978; Andersen et al., 1993; ch. IV).
If the transition intensities for a homogeneous group are assumed piece-wise constant on
(s,t], e.g., α = α 1 on (s, u] and α = α 2 on (u,t], then α 1 and α 2 are estimated by separate
occurrence/exposure rates in the two sub-intervals, and the plug-in estimator, using the
Chapman-Kolmogorov equations, is

P(s,t)
b = P(s,
b u)P(u,t)
b b 1 · (u − s)) exp(α
= exp(α b 2 · (t − u)), (5.5)

with similar expressions if there are more than two sub-intervals of (s,t] on which α is
constant. Both this expression and (5.4) also apply if the model for the intensities is a
hazard regression model with time-fixed covariates, e.g., a Cox model or a Poisson model.
The situation with time-dependent covariates is more challenging and will be discussed
in Section 5.2.4. Aalen et al. (2001) studied plug-in estimation based on additive hazards
models.
The state occupation probabilities are

Qh (t) = P(V (t) = h) = ∑ Q j (0)Pjh (0,t), j∈S.

In the situation where all subjects are in the same state (0) at time t = 0, i.e., Q0 (0) = 1,
these are Qh (t) = P0h (0,t) and the product-limit estimator may be used for this marginal
parameter. We will pay special attention to this situation in what follows. Based on the
state occupation probabilities, another marginal parameter, the expected length of stay in
state h before time τ, is directly obtained as
Z τ
εh (τ) = Qh (t)dt.
0
PLUG-IN FOR MARKOV PROCESSES (*) 165
Since these marginal parameters (and P(s,t)) are differentiable functionals of the intensi-
ties, large-sample properties of the resulting plug-in estimators may be derived from those
of the intensity estimators using functional delta-methods. The details are beyond the scope
of this presentation and may be found in Andersen et al. (1993; ch. II and IV).
We will now look at some of the multi-state models introduced in Section 1.1.

5.1.1 Two-state model (*)

In Figure 1.11, there are two states and a single transition intensity, α01 (t) = α(t) from 0
to 1, the hazard function for the distribution of the survival time T . The two-state process
is born Markov and the resulting A matrix is

−A(t) A(t)
A(t) = .
0 0

For this model, the transition probability matrix is

P00 (s,t) 1 − P00 (s,t)
P(s,t) =
0 1

and the differential Equation (5.3) is

∂
P00 (s,t) = −P00 (s,t)α(t)
∂t
with the solution P00 (s,t) = exp(− st α(u)du). In Section 4.1, an intuitive argument for this
R

expression was given and the survival function S(t) = P00 (0,t) may be estimated by plug-in
based on a model for α(t), e.g., a piece-wise constant hazard model. If A corresponds to
a discrete distribution, such as that estimated by the Nelson-Aalen estimator or by a Cox
model using the Breslow estimator for the cumulative baseline hazard, then it is the general
product-integral in which plug-in should be made and not the exponential expression. This
leads in the case of the Nelson-Aalen estimator to the Kaplan-Meier estimator for S(t)

b =
S(t) π 1 − dA(u)
[0,t]
b

= ∏ 1−
dN(X )
Y (X )
Xi ≤t i
i

(Kaplan and Meier, 1958) and the conditional Kaplan-Meier estimator for P00 (s,t) = S(t | s)

b | s) = dN(Xi )
S(t ∏ 1− ,
s<Xi ≤t Y (Xi )

where X1 , X2 , . . . are the observation times, Xi = Ti ∧ Ci . The general variance formula in

this case reduces to the Greenwood formula
s
dN(Xi )
SD(S(t))
b = S(t)
b ∑ Y (Xi )(Y (Xi ) − 1)
Xi ≤t
166 MARGINAL MODELS
(Kaplan and Meier, 1958). As discussed in Section 4.1.1, confidence limits for S(t) are
typically based on symmetric confidence limits for log(A(t)) = log(− log(S(t))).
Note that, in the special case of no censoring, the Kaplan-Meier estimator is the relative
b = Y (t)/n of survivors at time t, and the Greenwood formula reduces to the
frequency S(t)
binomial variance formula.
For the Cox model with time-fixed covariates, the cumulative hazard for given covariates
β T Z ) and the plug-in estimator for the survival probability is
Z is A0 (t) exp(β
T
b | Z) = dN(Xi ) exp(βb Z )
S(t ∏ 1− T
,
Xi ≤t ∑ j∈R(Xi ) exp(βb Z j )

where R(t) = { j : Y j (t) = 1} is the risk set at time t, cf. Equation (3.19). This estimator may
become negative and an alternative and commonly used estimator is
T
b | Z ) = exp(−A
S(t b0 (t) exp(βb Z )),

where A b0 (t) is the Breslow estimator (3.18). This estimator can be criticized for using
the continuous-time version of the product-integral on a time-discrete estimator; how-
ever, from a practical point of view, the differences between these estimators tend to
be of minor importance, cf. the discussion in Section 4.1.1 about not estimating S(t) by
‘exp(−Nelson-Aalen)’.
From estimates of the survival function S, the restricted mean life time
Z τ
ε0 (τ) = E(T ∧ τ) = S(t)dt
0

may be estimated by plug-in, also for given covariates based on a hazard regression model.
ε0 (τ) = (1/n) ∑i (Ti ∧ τ) is a simple average.
In the special case of no censoring, b

5.1.2 Competing risks (*)

We will focus on the competing risks model in Figure 1.2 with one transient state (0) and
two absorbing states (1, 2). The general competing risks model with k > 2 absorbing states
is similar and, in any case, the competing risks process is born Markov. For the three-state
model, the A matrix is:
 
−A1 (t) − A2 (t) A1 (t) A2 (t)
A(t) =  0 0 0 
0 0 0

where A1 , A2 are the cumulative cause-specific hazards for the two competing events. With
the P matrix  
P00 (s,t) P01 (s,t) P02 (s,t)
P(s,t) =  0 1 0 
0 0 1
PLUG-IN FOR MARKOV PROCESSES (*) 167
(where P00 = 1 − P01 − P02 and P0h (s,t), h = 1, 2 are the conditional cumulative incidences),
the Kolmogorov forward equations become
∂
P00 (s,t) = −P00 (s,t)(α1 (t) + α2 (t))
∂t
∂
P0h (s,t) = P00 (s,t)αh (t), h = 1, 2
∂t
with solutions
Zt
P00 (s,t) = exp − (α1 (u) + α2 (u))du
s
Z t
P0h (s,t) = P00 (s, u)αh (u)du, h = 1, 2.
s

In Section 4.1, intuitive arguments for these expressions were given. The resulting non-
parametric plug-in estimators (for s = 0) are the overall Kaplan-Meier estimator
dN(Xi )
b = Pb00 (0,t) = Q
S(t) b0 (t) = ∏ 1−
Xi ≤t Y (Xi )

(jumping at times of failure from either cause) and the cumulative incidence estimator
Z t
bh (t) = Fbh (t) = Pb0h (0,t) = dNh (u)
Q S(u−)
b , h = 1, 2, (5.6)
0 Y (u)
where Nh (·) counts failures from cause h = 1, 2 and Y (t) is the number of subjects at risk
in state 0 just before time t. The estimator (5.6) was discussed by, e.g., Aalen (1978), and
is often denoted the Aalen-Johansen estimator even though this name is also used for the
general estimator (5.4) (Aalen and Johansen, 1978). Note that, in (5.6), the Kaplan-Meier
estimator is evaluated just before a cause-h failure time at u.
bh (t) = Nh (t)/n.
In the special case of no censoring, Q
Expressions similar to Sb and (5.6) apply when estimating the state occupation probabilities
Qh (t), h = 0, 1, 2 based on, e.g., Cox models, for the two cause-specific hazards: αh (t | Z ) =
βT
αh0 (t) exp(β h Z ). In all cases, variance estimators are available (e.g., Andersen et al., 1991;
1993, ch. VII). Plug-in based on piece-wise exponential models may also be performed.
If T , as in the two-state model, is the life time (time spent in state 0), then the restricted
mean life time is Z τ
ε0 (τ) = E(T ∧ τ) = S(t)dt
0
and plug-in estimation is straightforward. In the competing risks model, one can also study
the random variables
Th = inf{t > 0 : V (t) = h}, h = 1, 2,
i.e., the times of entry into state h = 1, 2. These are improper random variables because,
e.g., P(T1 = ∞) = limt→∞ Q2 (t) is the positive probability that cause 1 never happens. The
restricted random variables Th ∧ τ are proper and
Z τ Z τ Z τ
E(Th ∧ τ) = E I(Th > t)dt = τ − E I(Th ≤ t)dt = τ − Qh (t)dt.
0 0 0
168 MARGINAL MODELS
It follows that Z τ
εh (τ) = Qh (t)dt
0
can be interpreted as the expected life time lost due to cause h before time τ and plug-in
estimation is straightforward (Andersen, 2013).
εh (τ) = (1/n) ∑i (τ − Thi ∧ τ) is a simple average.
In the special case of no censoring, b

5.1.3 Progressive illness-death model (*)

In Figure 1.3 there are three transition intensities, two transient states (0, 1) and one ab-
sorbing (2). This model is Markov whenever the α12 intensity only depends on t and not on
duration in state 1. This leads to the A matrix
 
−A01 (t) − A02 (t) A01 (t) A02 (t)
A(t) =  0 −A12 (t) A12 (t) 
0 0 0

and associated P matrix

 
P00 (s,t) P01 (s,t) P02 (s,t)
P(s,t) =  0 P11 (s,t) P12 (s,t) 
0 0 1

with P00 = 1 − P01 − P02 and P11 = 1 − P12 . The Kolmogorov forward equations become

∂
P00 (s,t) = −P00 (s,t)(α01 (t) + α02 (t))
∂t
∂
P01 (s,t) = P01 (s,t)α01 (t) − P11 (s,t)α12 (t)
∂t
∂
P11 (s,t) = −P11 (s,t)α12 (t)
∂t
with solutions
Z t
P00 (s,t) = exp(− (α01 (u) + α02 (u))du)
s
Z t
P11 (s,t) = exp(− α12 (u)du)
s
Z t
P01 (s,t) = P00 (s, u)α01 (u)P11 (u,t)du.
s

In Section 4.1.3, intuitive arguments for the latter expression were given, and plug-in es-
timation is possible. Classical work on the illness-death model include Fix and Neyman
(1951) and Sverdrup (1965).
Additional marginal parameters for the illness-death model (with or without recovery) in-
clude Z τ
εh (τ) = Qh (t)dt, h = 0, 1,
0
PLUG-IN FOR MARKOV PROCESSES (*) 169
the expected length of stay in [0, τ], respectively, alive without or with the illness. The
prevalence is
Q1 (t)
,
Q0 (t) + Q1 (t)
i.e., the probability of living with the illness at time t given alive at that time.

5.1.4 Recurrent events (*)

In Section 1.1, two diagrams for recurrent events were studied, corresponding to either in-
tervals or no intervals between periods where subjects are at risk for the recurrent event
(Figures 1.4 and 1.5, respectively). In the former case (the illness-death model with recov-
ery), there are two transient states (0 and 1), one absorbing state (2), and the A(t) matrix
is  
−A01 (t) − A02 (t) A01 (t) A02 (t)
A(t) =  A10 (t) −A10 (t) − A12 (t) A12 (t)  .
0 0 0
The associated P matrix is
 
P00 (s,t) P01 (s,t) P02 (s,t)
P(s,t) =  P10 (s,t) P11 (s,t) P12 (s,t) 
0 0 1

with P00 = 1 − P01 − P02 and P11 = 1 − P10 − P12 . In contrast to the progressive illness-death
model, the Kolmogorov forward equations do not have an explicit solution but for given
Nelson-Aalen estimates A bh j , P can be estimated from (5.4) using plug-in. For the recurrent
events intensity α01 (t), this corresponds to an AG-type Markov model, see Section 2.5.
Assuming, further, that Q0 (0) = 1, both the εh (τ) and the prevalence may be estimated as
in the previous section. This was exemplified in Section 4.1.3 using the data on recurrence
in affective disorders (Example 1.1.5).
For the case with no intervals between at-risk periods, a maximum number (say, K) of
recurrent events to be considered must be decided upon in order to get a transition matrix
with finite dimension. Letting, e.g., K = 2 this corresponds to having states 0 and 1 in
Figure 1.5 transient and states 2 and D absorbing (i.e., transitions out of state 2 are not
considered). Re-labelling state D as 3, this gives the A(t) matrix
 
−A01 (t) − A03 (t) A01 (t) 0 A03 (t)
 0 −A12 (t) − A13 (t) A12 (t) A13 (t) 
A(t) =  
 0 0 0 0 
0 0 0 0

and associated P matrix

 
P00 (s,t) P01 (s,t) P02 (s,t) P03 (s,t)
 0 P11 (s,t) P12 (s,t) P13 (s,t) 
P(s,t) =  
 0 0 1 0 
0 0 0 1
170 MARGINAL MODELS
with P00 = 1 − P01 − P02 − P03 and P11 = 1 − P12 − P13 . In this case, the model is progressive
and explicit expressions for the elements of P exist (see next section) but plug-in using
the general expression (5.4) is also possible. Models for the recurrence intensities αh,h+1 (t)
could be of PWP Markov type (i.e., separate models for each h) or of AG Markov type (one
common model for all h), see Section 2.3. For this model, another marginal parameter of
b – is the probability P(N(t) ≥ h) of seeing at least
potential interest – also estimable from P
h recurrences in [0,t]. For situations where mortality is negligible, simplified versions of
the two models considered are available (e.g., Exercise 5.1).

5.1.5 Progressive multi-state models (*)

In some models, explicit expressions for the state occupation probabilities Qh (t) were avail-
able and we have seen that these expressions follow directly from the general product-
integral representation of the transition probability matrix P. The strength of the general
approach is that it applies to any multi-state Markov process including, as we saw in the
previous section, the model depicted in Figure 1.4 where subjects may return to a previ-
ously occupied state. In progressive (or irreversible) models such as that shown in Figure
1.6 where subjects do not return to a previously occupied state, some state occupation
probabilities may be expressed explicitly in terms of the intensities. As an example, for the
model in Figure 1.6, the probability Q2 (t) of being alive after a relapse may be expressed,
as follows. There are two paths leading to state 2: Either (a) directly from the initial state
0, or (b) from that state via state 1 (GvHD). The probability of the former path is given by
Z t
(a)
Q2 (t) = P00 (0, u)α02 (u)P22 (u,t)du
0

because, at some time u < t, a 0 → 2-transition must have happened and, between times u
and t, the subject must stay in state 2. For the other path, i.e., via state 1, the probability is
Z t Z t
(b)
Q2 (t) = P00 (0, u)α01 (u) P11 (u, x)α12 (x)P22 (x,t)dxdu,
0 u

reflecting that, first a 0 → 1-transition must happen (at u < t), next the subject must stay
in state 1 from time u to a time x between u and t, make a 1 → 2-transition at x and stay
in state 2 between x and t. Similar expressions, though cumbersome, may be derived for
Qh (t) parameters in other progressive processes. These are not crucial for Markov processes
as discussed so far in this chapter because the general product-integral representation is
available; however, similar arguments may be applied also to some semi-Markov processes
where intensities not only depend on time t but also on the time spent in a given state. This
will be discussed in Section 5.2.4.

5.2 Plug-in for non-Markov processes (*)

For Markov processes, the transition intensities satisfy the Markov property (5.1), and under
this assumption the product-integral maps transition intensities onto transition probabilities
and, thereby, state occupation probabilities. If the Markov property is not satisfied, then
the transition intensities at time t depend on the past Ht− in a more complex way than
through the current state and, possibly, time-fixed covariates, and the transition probability
PLUG-IN FOR NON-MARKOV PROCESSES (*) 171
matrix P(s,t) cannot be computed using the product-integral. An explanation why product-
integration of the Nelson-Aalen estimators does not estimate transition probabilities in non-
Markov models (Titman, 2015) is that, at time u, t > u > s, the Nelson-Aalen estimator uses
all subjects in a given state at that time – no matter the state occupied at s.

5.2.1 State occupation probabilities (*)

The product-integral can still be used for estimating state occupation probabilities as shown
by Datta and Satten (2001), see also Overgaard (2019). The basis for this is a concept related
to the intensity, namely the partial transition rate which is the right-hand side of Equation
(5.1), say,
αh∗j (t) ≈ P(V (t + dt) = j | V (t) = h, Z )/dt. (5.7)
Note that, in (5.7), conditioning is only on the current state and baseline covariates and,
under Markovianity, it equals the transition intensity. It was shown by Datta and Satten
(2001) that, provided that censoring is independent of V (t), the cumulative partial transition
rate A∗h j (t) = 0t αh∗j (u)du can be consistently estimated by the Nelson-Aalen estimator
R

Z t
b∗h j (t) = dNh j (u)
A
0 Yh (u)

with Nh j ,Yh defined as previously. Extensions to the situation where censoring and V (t)
depend on (possibly, time-dependent) covariates were studied by Datta and Satten (2002)
and Gunnes et al. (2007).
The partial transition rates may be of interest in their own right and asymptotic results
follow from Glidden (2002). The partial rates are also important for marginal models for
recurrent events as we shall see in Section 5.5.4 where we will also argue why the Nelson-
Aalen estimator is consistent for A∗h j (t). However, their main interest lies in the fact that
they provide a step towards estimating state occupation probabilities. Assume for simplic-
ity that all subjects occupy the same state, 0 at time 0, i.e., Q0 (0) = 1. In that case, the top
row of the (k + 1) × (k + 1) product-integral matrix

π
(0,t]
I + dA∗ (u)

is the vector of state occupation probabilities Q(t) = (Q0 (t), Q1 (t), . . . , Qk (t)), suggesting
the plug-in estimator
b = (1, 0, . . . , 0)
Q(t) πb ∗ (u)
I + dA
(0,t]

(5.8)

which is the top row of the Aalen-Johansen estimator. Asymptotic results for (5.8) were also
given by Glidden (2002), including both a complex variance estimator and a simulation-
based way of assessing the uncertainty based on an idea of Lin et al. (1993) – an idea
that we will return to in connection with goodness-of-fit examinations using cumulative
residuals in Section 5.7.
The estimator (5.8) works for any state in a multi-state model, but for a transient state in
a progressive model, an alternative is available. This estimator, originally proposed for the
non-Markov irreversible illness-death model (Figure 1.3) by Pepe (1991) and Pepe et al.
172 MARGINAL MODELS
(1991), builds on the difference between Kaplan-Meier estimators and is, as such, not a
plug-in estimator based on intensities. We discuss it here for completeness and it works,
as follows, for the illness-death model. If T0 is the time spent in the initial state and T2
is the time of death, both random variables observed, possibly with right-censoring, then
the Kaplan-Meier estimator based on T0 estimates Q0 (t) while that based on T2 estimates
1−Q2 (t) = Q0 (t)+Q1 (t), so, their difference estimates the probability Q1 (t) of being in the
transient state 1 at time t. The resulting estimator is known as the Pepe estimator, and Pepe
(1991) also provided variance estimators. Alternatively, a non-parametric bootstrap may
be applied to assess the variability of the estimator. This idea generalizes to any transient
state in a progressive model. To estimate Qh (t) for such a state, one may use the difference
between the Kaplan-Meier estimators of staying in the set of states, say Sh from which
state h is reachable and that of staying in Sh ∪ {h}.
Rτ
Based on an estimator for Qh (t), the expected length of stay in that state, εh (τ) = 0 Qh (t)dt
may be estimated by plug-in.

5.2.2 Transition probabilities (*)

We mentioned in the introduction to Section 5.2 that an explanation why product-
integration of the Nelson-Aalen estimators does not estimate transition probabilities in
non-Markov models is that, at time u, t > u > s, the Nelson-Aalen estimator uses all
subjects in a given state at that time – no matter the state occupied at s (Titman, 2015).
Based on such an idea, Putter and Spitoni (2018) suggested to use the proposal of Datta
and Satten (2001) combined with sub-setting for estimation of transition probabilities
Ph j (s,t) = P(V (t) = j | V (s) = h). To estimate this quantity for a fixed value of s and a
fixed state h ∈ S , attention was restricted to those processes Vi (·) in state h at time s, i.e.,
processes for which Yhi (s) = 1 and counting processes and at-risk processes were defined
for this subset

dN LM
j` (t) = ∑ dN j`i (t)Yhi (s), Y jLM (t) = ∑ Y ji (t)Yhi (s), t ≥ s.
i i

Here ‘LM’ stands for landmarking, a common name used for restricting attention to sub-
jects who are still at risk at a landmark time point, here time s (Anderson et al., 1983; van
Houwelingen, 2007). We will return to uses of the landmarking idea in Section 5.3 where
models with time-dependent covariates are studied. The Nelson-Aalen estimators for the
partial transition rates based on these sub-sets are
Z t dN LM (u)
b∗LM j`
A j` (t) = , t ≥ s.
s Y jLM (u)

These may be plugged-in to the product-integral to yield the landmark Aalen-Johansen

estimator
bLM (s,t) = QLM (s)
P I+A π
b ∗LM (u) ,
(s,t)

(5.9)

where QLM (s) is the (k + 1) row vector with element h equal to 1 and other elements equal
to 0. For fixed s, the asymptotic properties of (5.9) follow from the results of Glidden
(2002).
PLUG-IN FOR NON-MARKOV PROCESSES (*) 173
The work by Titman (2015) on transition probabilities for non-Markov models should also
be mentioned here, even though the methods are not based on plug-in of intensities. Follow-
ing Uña-Alvarez and Meira-Machado (2015), Titman suggested a similar extension (i.e.,
based on sub-setting) of the Pepe estimator for a transient state j in a progressive model.
To estimate Ph j (s,t), one looks at the sub-set of processes Vi (·) in state h at time s and,
for fixed s, this transition probability is estimated as the difference between Kaplan-Meier
estimators of staying in sets of states Sh j and Sh j ∪ { j}, respectively, at time t where Sh j
is the set of states reachable from h and from which j can be reached. A variance estimator
was also presented.
For any state, j (absorbing or transient) in any multi-state model (progressive or not), Tit-
man (2015) also suggested another estimator for Ph j (s,t) based on sub-setting to processes
in state h at time s, as follows. Define Rh j to be the set of states reachable from h but
from which j cannot be reached. For the considered sub-set of processes, the following
competing risks process for u ≥ s is defined when j is an absorbing state

/ Rh j ∪ { j},

 0 if V (u) ∈
Vs∗ (u) = 1 if V (u) ∈ Rh j ,
2 if V (u) = j.


For the considered sub-set, this process is linked to V (t) by the relation Ph j (s,t) =
P(Vs∗ (t) = 2) and the desired transition probability can be estimated using the Aalen-
Johansen estimator for the cause 2 cumulative incidence for Vs∗ (t). More specifically, if
∗ (u) counts cause ` = 1, 2 events and Y ∗ (u) is the number still at risk for cause 1 or 2
Ns` s
events at time u− then the estimator is
Z t ∗
PbhTj (s,t) = b s∗ (u) = 0 | Vs∗ (s) = 0) dNs2 (u) ,
P(V
s Ys∗ (u)

b s∗ (u) = 0 | Vs∗ (s) = 0) is estimated using the Kaplan-Meier estimator

where P(V
∗ ∗
π 1 − dN (v)Y +(v)dN (v) .
(s,u]
s1
s
∗
s2

If j is a transient state, then the following survival process for u ≥ s is defined for the
considered sub-set of processes

/ Rh j ,

∗ 0 if V (u) ∈
Vs (u) =
1 if V (u) ∈ Rh j .

For this sub-set, the process Vs∗ (t) is related to V (t) via Ph j (s,t) = P(Vs∗ (t) = 0)P(V (t) =
j | Vs∗ (t) = 0), where the first factor can be estimated by the Kaplan-Meier estimator for
Vs∗ (t). Titman (2015) proposed to estimate the second factor by the relative frequency of
processes in state j at time t among those for which Vs∗ (t) = 0, i.e., by

∑i I(Vi (t) = j,Vi (s) = h,Vi (t) ∈/ Rh j )

.
∑i iI(V (s) = h,V i (t) ∈
/ R h j)
174 MARGINAL MODELS
Titman’s construction extended that of Allignol et al. (2014) for the illness-death model.
For this model, an alternative estimator was previously proposed by Meira-Machado et al.
(2006); however, the latter proposal has the drawback that to obtain consistency, the support
for the survival time distribution must be contained within that of the censoring distribu-
tion, and we will not discuss these estimators further. Both Titman (2015) and Putter and
Spitoni (2018) presented simulation studies showing that, for Markov models the Aalen-
Johansen estimator outperforms the more general estimators discussed in this section. For
non-Markov models, the Aalen-Johansen estimator was biased, whereas both the landmark
Aalen-Johansen estimator and the general estimator proposed by Titman had a satisfactory
performance. Malzahn et al. (2021) extended the landmark Aalen-Johansen technique to
hybrid situations where, only for some transitions, the Markov property fails, whereas, for
others, the Markov assumption is compatible with the data.
Non-parametric tests for the Markov assumption have been studied for the irreversible
illness-death model by Rodriguez-Girondo and de Uña-Alvarez (2012) building on esti-
mates of Kendall’s τ. For general multi-state models, Titman and Putter (2022) derived
logrank-type tests (Section 3.2.2) for Markovianity comparing landmark Nelson-Aalen es-
timators Ab∗LM (t) to standard Nelson-Aalen estimators A
b j` (t).
j`

5.2.3 Recurrent events (*)

For the situation with recurrent events, Figure 1.5, marginal parameters such as Qh (t) and
εh (t) may also be relevant as discussed in Section 5.1.4. In this situation, the expected
number of events
µ(t) = E(N(t)),
where N(t) is the recurrent events counting process, is another marginal parameter to target,
and it has an attractive interpretation. For this parameter, a plug-in estimator also exists.
This is because ∞ ∞
µ(t) = ∑ h · P(N(t) = h) = ∑ h · Qh (t),
h=1 h=1

and since the Qh (t) may be estimated by (5.8), we can estimate µ(t) by plug-in. A difficulty
is, though, that one has to decide on the number of terms to include in the sum – a choice
that need not be clear-cut since, for a large number, h of events, there may not be sufficient
data to properly estimate Qh (t).
We shall later (Section 5.5.4) see how direct models for µ(t) may be set up using general-
ized estimating equations. This also leads to an alternative plug-in estimator suggested by
Cook et al. (2009) that we will discuss there.

5.2.4 Semi-Markov processes (*)

A special class of non-Markov models is semi-Markov models where, at time t, transition
intensities αh j (·) depend not only on t but also on the time spent in state h, i.e., the duration
d = t − Th where Th is the (last) time of entry into state h before time t. We have studied
such models previously in connection with gap time models for recurrent events (Section
2.3) and when discussing adapted time-dependent covariates in Section 3.7, e.g., for the
PLUG-IN FOR NON-MARKOV PROCESSES (*) 175
PROVA trial (Example 1.1.4). For these data, we presented Cox models with either t or
d as the baseline time-variable and with the effect of the other time-variable expressed
as explicit functions using time-dependent covariates. Such models may form the basis for
plug-in estimation of transition and state occupation probabilities as discussed by Andersen
et al. (2022). We will illustrate this via estimation of the probability Q2 (t) of being alive in
the relapse state in the four-state model for the bone marrow transplantation data (Example
1.1.7, Figure 1.6). Assuming the multi-state process to be Markovian, we showed in Section
5.1.3 that Q2 (t) is the sum of
Z t
(a)
Q2 (t) = P00 (0, u)α02 (u)P22 (u,t)du
0

and Z t Z t
(b)
Q2 (t) = P00 (0, u)α01 (u) P11 (u, x)α12 (x)P22 (x,t)dxdu.
0 u
Suppose now that the transition intensities out of states 1 and 2 depend on both t and d
and that we have modeled α12 (t, d), α13 (t, d) and α23 (t, d). In such a situation, for t > s,
the probability of staying in state 1 between times s and t given entry into the state at time
T1 ≤ s is given by

P11 (s,t; T1 ) = P(V (t) = 1 | V (s) = 1,V (T1 −) = 0,V (T1 ) = 1)

Z t
= exp(− (α12 (u, u − T1 ) + α13 (u, u − T1 ))du),
s

because the waiting time distribution in state 1 given entry at T1 has hazard function
α12 (u, u − T1 ) + α13 (u, u − T1 ) at time u. Similarly, the probability af staying in state 2
between times s and t given entry into the state at time T2 ≤ s is given by

P22 (s,t; T2 ) = P(V (t) = 2 | V (s) = 2,V (T2 −) = 0 or 1,V (T2 ) = 2)

Z t
= exp(− α23 (u, u − T2 )du).
s

The probability of being in state 2 at time t is now the sum of

Z t
(a∗)
Q2 (t) = P00 (0, u)α02 (u)P22 (u,t; u)du
0

and Z t Z t
(b∗)
Q2 (t) = P00 (0, u)α01 (u) P11 (u, x; u)α12 (x, x − u)P22 (x,t; x)dxdu.
0 u
Note that P22 in this expression could also depend on the time u of 0 → 1 transition (though,
in that case the process would not be termed semi-Markov). This idea generalizes to other
progressive semi-Markov processes and to multi-state processes where the dependence on
the past is modeled using adapted time-dependent covariates; however, both the resulting
expressions and the associated variance calculations tend to get rather complex, as demon-
strated by Shu et al. (2007) for the irreversible illness-death model.
176 MARGINAL MODELS
Markov and non-Markov processes

For Markov processes, the intensities at time t only depend on the past history via
the state occupied at t and, possibly, via time-fixed covariates. They have attrac-
tive mathematical properties, most importantly that transition probabilities may be
obtained using plug-in via the product-integral.
Both the two-state model and the competing risks model are born Markov; however,
for more complicated multi-state models the Markov assumption is restrictive and
may not be fulfilled in practical examples. Analysis of non-Markov processes is less
straightforward, though state occupation probabilities (and expected length of stay
in a state) may be obtained using product-integration.

5.3 Landmarking
Plug-in works for hazard models with time-fixed covariates and for some models with
adapted time-dependent covariates as exemplified in Sections 3.7, 5.1, and 5.2.4. For haz-
ard models with non-adapted time-dependent covariates Z(t), it is typically not possible to
express parameters such as transition or state occupation probabilities using only the tran-
sition intensities. This is because the future course of the process V (t) will also depend on
the way in which the time-dependent covariates develop. In such a situation, in order to
estimate these probabilities, a joint model for V (t) and Z(t) is needed and we will briefly
discuss such joint models in Section 7.4. One way of approximating these probabilities
using plug-in estimators is based on landmarking. In Sections 5.3.1-5.3.3, we will discuss
this concept in the framework of the two-state model for survival data (Figure 1.1) with an
illustration using the example on bone marrow transplantation in acute leukemia (Example
1.1.7). In Section 5.3.4, we will briefly mention the extensions needed for more complex
multi-state models, and Section 5.3.5 provides some of the mathematical details.

5.3.1 Conditional survival probabilities

We look at the two-state model for survival data and have in mind a Cox regression model
for the hazard function α(t | Z(·)) including, possibly non-adapted, time-dependent covari-
ates Z(t). We aim at estimating conditional survival probabilities, such as,
P00 (s,t) = P(T > t | T > s, (Z(u), u ≤ s)), t > s,
i.e., given survival till a landmark time s (where T is the survival time) and given the
course of Z(·) up to the landmark time. The model that will be used for approximating such
conditional survival probabilities is the following Cox model
αs (t | (Z(u), u ≤ s)) = α0s (t) exp(LPs ), t ≥s (5.10)
with a landmark-specific baseline hazard α0s (t) and with linear predictor
LPs = β1s Z1 (s) + · · · + β ps Z p (s)
including the current covariate values at the landmark time s and with landmark-specific
regression coefficients. Recall from Section 3.7.2 that the value of the time-dependent co-
variate at time s may be the lagged value, Z j (s − ∆) or similar. It is seen that the covariates
LANDMARKING 177
in the model (5.10) are kept fixed for t > s at their value at time s and the model is fit-
ted (using delayed entry) for all subjects still at risk at time s, i.e., subjects i, for whom
Yi (s) = Y0i (s) = 1. From this model, the desired conditional survival probabilities can be
estimated using plug-in as explained in Sections 4.1.1 and 5.1.1. Typically, only short-term
predictions will be aimed at since the covariate values at time s are used when predicting
at that time, and the further development of the time-dependent covariates for t > s is not
accounted for. As a consequence, when fitting the model (5.10), one often censors everyone
at some horizon time thor (s) > s and, thereby, restricts attention to predictions for a period
of length at most thor (s) − s.
Following this basic idea, van Houwelingen (2007) and van Houwelingen and Putter (2012,
ch. 7) suggested to study a series of landmark models at times 0 ≤ s1 < s2 < · · · < sL , each
fitted, as just described, to subjects still at risk at the various landmark times. The model at a
given landmark s j has baseline hazard α0s j (t) and linear predictor LPs j = β1s j Z1 (s j ) + · · · +
β ps j Z p (s j ). If the horizon time when analyzing data at landmark s j is taken to be no larger
than the subsequent s j+1 , then the analyses at different landmarks will be independent in
the sense that a given failure time will appear in at most one analysis. However, this is
no requirement for the method, though it should be kept in mind that if intervals from
landmark to horizon do overlap for two successive landmarks, then, for each time point t
belonging to both intervals, the two landmark models will provide alternative descriptions
for the hazard at t. Therefore, models based on several landmarks should be regarded as
merely descriptions to be used for predictions and not as proper probability models for the
data generation process.

5.3.2 Landmark super models

The separate landmark models introduced in the previous section will typically contain a
large number of parameters (p · L regression coefficients and L baseline hazards), and it is
of interest to reduce this number. Furthermore, if successive landmarks are close, then one
would not expect the regression coefficients to substantially change from one landmark to
the next. Rather, some smoothness across landmarks is expected for the parameters. This
leads to the development of landmark super models, as follows. First, regression coeffi-
cients at different landmarks are connected by letting βks = βk + ∑m k
`=1 γk` f k` (s), k = 1, . . . , p
for mk suitably chosen functions fk` (s) that may vary among covariates (k). For ease of no-
tation we will drop this generality and let each βks , k = 1, . . . , p, be

βks = βk + γk1 f1 (s) + · · · + γkm fm (s), (5.11)

with the same smoothing functions for all covariates. If, in Equation (5.11), we let f1 (s1 ) =
· · · = fm (s1 ) = 0, then βk will be the effect of covariate k at the first landmark, s1 . A typical
choice could be m = 2 and

f1 (s) = (s − s1 )/(sL − s1 ), f2 (s) = ((s − s1 )/(sL − s1 ))2 .

To fit these models, a data duplication trick, following the lines of Section 3.8 can be ap-
plied. Thus, L copies of the data set are needed, where copy number j includes all subjects
still at risk at landmark s j , i.e., subjects i with Yi (s j ) = 1. The Cox model is stratified on
178 MARGINAL MODELS
Table 5.1 Bone marrow transplantation in acute leukemia: Distribution of the time-dependent co-
variates ANC500 and GvHD at entry and at five landmarks (ANC: Absolute neutrophil count,
GvHD: Graft versus host disease).

Landmark s j (months)
0 0.5 1.0 1.5 2.0 2.5
At risk Y (s j ) 2099 1988 1949 1905 1876 1829
ANC500(s j ) = 1 0 906 1912 1899 1874 1828
GvHD(s j ) = 1 0 180 391 481 499 495

j (to yield separate baseline hazards α0s j (t)), and the model in stratum j should include
the interactions Z · f` (s j ), ` = 1, . . . , m. For inference, robust standard deviations should be
used.
Having separate baseline hazards for each landmark will provide different models for the
hazard at some time points if the time horizons thor (s j ) are chosen in such a way that
prediction intervals overlap. Therefore, the baseline hazards could also be taken to vary
from one landmark to the next in a smooth way by letting

α0s (t) = α0 (t) exp(η1 g1 (s)) + · · · + ηm0 gm0 (s)), (5.12)

with all g` (s1 ) = 0, such that α0 (t) refers to the first landmark, s1 . Often, one would choose
the same smoothing functions for regression coefficients and baseline hazards, i.e., m = m0
and g` = f` . This model can also be fitted using the duplicated data set; however, stratifica-
tion on j should no longer be imposed because only a single baseline hazard is needed. The
model with baseline hazard given by Equation (5.12) provides a description for all values
of s and, thereby, using this model conditional predictions may be obtained for all s, not
only for the landmarks chosen when fitting the model.
Note the similarity with the tests for proportional hazards using time-dependent covariates
as discussed in Section 3.7.7. It is seen that the idea of studying departures from proportion-
ality using suitably defined functions f j (t) may also be used to obtain flexible models with
non-proportional hazards – both for models with time-fixed and time-dependent covariates.

5.3.3 Bone marrow transplantation in acute leukemia

In Section 3.7.8, we studied models for the data on bone marrow transplantation (BMT) in
acute leukemia (Example 1.1.7) including the two time-dependent covariates: Occurrence
of graft versus host disease (GvHD) and reaching an Absolute Neutrophil Count above
500 cells per µL (ANC500). We will now estimate conditional probabilities of relapse-
free survival, i.e., time to relapse or death in remission whatever comes first, based on past
information on the two time-dependent covariates using landmarking. Both covariates take
the value 0 at time t = 0 of BMT and, typically, a change of value from 0 to 1 takes place (if
at all) relatively shortly after BMT. Table 5.1 shows the distribution of the two covariates
at chosen landmarks 0.5, 1.0, 1.5, 2.0, and 2.5 months.
We first fit models like (5.10) at these landmarks using the horizon thor = s j + 6 months, see
Table 5.2. It is seen that both covariates have an effect on the hazard function – for GvHD,
LANDMARKING 179
Table 5.2 Bone marrow transplantation in acute leukemia: Estimated effects (and robust SD) of
time-dependent covariates ANC500 and GvHD at five landmarks using a 6-month horizon from
landmarks (ANC: Absolute neutrophil count, GvHD: Graft versus host disease).

Landmark ANC500 GvHD

j s j (months) βbs j SD βbs j SD
1 0.5 -0.335 0.107 0.703 0.149
2 1.0 -0.609 0.320 0.679 0.115
3 1.5 -1.686 0.610 0.863 0.113
4 2.0 -2.964 0.404 0.802 0.115
5 2.5 -3.354 0.178 0.831 0.121

Table 5.3 Bone marrow transplantation in acute leukemia: Estimated (Est) smooth effects of time-
dependent covariates ANC500 and GvHD (with robust SD) based on landmark super models using
a 6-month horizon from landmarks (ANC: Absolute neutrophil count, GvHD: Graft versus host
disease).

Stratified Smoothed
Covariate Parameter Est SD Est SD
ANC500(s) β1 -0.322 0.106 -0.298 0.094
ANC500(s) f1 (s) γ11 -1.191 1.670 -1.393 1.630
ANC500(s) f2 (s) γ12 -2.257 1.791 -2.038 1.765
GvHD(s) β2 0.663 0.143 0.674 0.140
GvHD(s) f1 (s) γ21 0.391 0.430 0.333 0.417
GvHD(s) f2 (s) γ22 -0.226 0.342 -0.175 0.333
g1 (s) η1 1.414 1.606
g2 (s) η2 1.940 1.751

the effect is rather constant over time, and presence of ANC500 seems to be increasingly
protective over time. Based on this model we predict the 6-month conditional relapse-free
survival probabilities given still at risk at the respective landmarks, i.e., using thor (s j ) =
s j + 6 months. Figure 5.1 shows these predictions for subjects with either ANC500(s j ) =
GvHD(s j ) = 0 or ANC500(s j ) = GvHD(s j ) = 1. It is seen that the effect of, in particular
ANC500, over time has a quite marked influence on the curves.
Next, we fit a landmark super model with coefficients given by (5.11) choosing f1 (s) =
(s − s1 )/(sL − s1 ) = (s − 0.5)/2 and f2 (s) = f1 (s)2 . Estimated coefficients are shown in
Table 5.3 (stratified), and Figure 5.2 shows the associated conditional relapse-free survival
curves. These are seen to be roughly consistent with those based on the simple landmark
model. Table 5.3 (smoothed) also shows coefficients in a super model with a single base-
line hazard α0 (t) (i.e., at s1 = 0.5 months) and later landmark-specific baseline hazards
(i.e., at s j > 0.5 months) specified by (5.12) and, thereby, varying smoothly among land-
marks. The smoothing functions were chosen as g` (s) = f` (s), ` = 1, 2. Figure 5.3 shows
the corresponding estimated conditional relapse-free survival probabilities – quite similar
to those in the previous figures.
180 MARGINAL MODELS
(a) ANC500(s j ) = GvHD(s j ) = 0

1.0
0.9

Conditional survival probability 0.8

0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
0 1 2 3 4 5 6 7 8 9
Time since bone marrow transplantation (months)

Landmark 0.5 1.0 1.5 2.0 2.5

(b) ANC500(s j ) = GvHD(s j ) = 1

1.0
0.9
Conditional survival probability

0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
0 1 2 3 4 5 6 7 8 9
Time since bone marrow transplantation (months)

Landmark 0.5 1.0 1.5 2.0 2.5

Figure 5.1 Bone marrow transplantation in acute leukemia: Estimated 0- to 6-month conditional
relapse-free survival probabilities given survival till landmarks s j = 0.5, 1.0, 1.5, 2.0, and 2.5
months.

5.3.4 Multi-state landmark models

Landmarking for general multi-state models, i.e., not necessarily the two-state model, fol-
lows closely the techniques outlined in Sections 5.3.1-5.3.2. This is because, using the tech-
niques discussed in these sections, landmark models may be set up for one transition inten-
sity at a time and, once the landmark models are established, transition probabilities may be
predicted by plug-in. When setting up the landmark models, all sorts of flexibility is avail-
able in the sense that different covariates and different smoothing functions ( f` (s), g` (s))
LANDMARKING 181
(a) ANC500(s j ) = GvHD(s j ) = 0

1.0
0.9

Conditional survival probability 0.8

0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
0 1 2 3 4 5 6 7 8 9
Time since bone marrow transplantation (months)

Landmark 0.5 1.0 1.5 2.0 2.5

(b) ANC500(s j ) = GvHD(s j ) = 1

1.0
0.9
Conditional survival probability

0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
0 1 2 3 4 5 6 7 8 9
Time since bone marrow transplantation (months)

Landmark 0.5 1.0 1.5 2.0 2.5

Figure 5.2 Bone marrow transplantation in acute leukemia: Estimated 0- to 6-month conditional
relapse-free survival probabilities given survival till landmarks s j = 0.5, 1.0, 1.5, 2.0, and 2.5
months. Estimates are based on a landmark super model with coefficients varying smoothly among
landmarks and landmark-specific baseline hazards.

may be chosen for the different transitions, though the same landmarks are typically used
for all transition hazards.

5.3.5 Estimating equations (*)

In this section, we will explain how the estimates in models (5.10)-(5.12) are obtained.
Model (5.10) for landmarks s1 , . . . , sL consists of L standard Cox regression models, one
182 MARGINAL MODELS
(a) ANC500(s j ) = GvHD(s j ) = 0

1.0
0.9

Conditional survival probability 0.8

0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
0 1 2 3 4 5 6 7 8 9
Time since bone marrow transplantation (months)

Landmark 0.5 1.0 1.5 2.0 2.5

(b) ANC500(s j ) = GvHD(s j ) = 1

1.0
0.9
Conditional survival probability

0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
0 1 2 3 4 5 6 7 8 9
Time since bone marrow transplantation (months)

Landmark 0.5 1.0 1.5 2.0 2.5

Figure 5.3 Bone marrow transplantation study: Estimated 0- to 6-month conditional relapse-free
survival probabilities given survival till landmarks s j = 0.5, 1.0, 1.5, 2.0, and 2.5 months. Estimates
are based on a landmark super model with coefficients and baseline hazards varying smoothly
among landmarks.

for each of the data sub-sets that together constitute the stacked data set. The estimating
equations β s are obtained from the Cox log-partial likelihood
n Z thor
exp(LPs,i )
∑ Yi (s) log dNi (t), s = s1 , . . . , sL ,
i=1 s ∑k Yk (t) exp(LPs,k )

by taking derivatives with respect to the parameters in the linear predictor LPs = β1s Z1 (s) +
· · · + β ps Z p (s) and equating to zero. The resulting Breslow estimator for the cumulative
LANDMARKING 183
baseline hazard A0s (t), t < thor is
Z t
b0s (t) = ∑i Yi (s)dNi (u)
A .
s ∑i Yi (u) exp(LP
c s,i )

This model contains L sets of parameters, each consisting of p regression parameters and a
baseline hazard, estimated separately for each landmark s j , j = 1, . . . , L. If the horizon time,
thor (s j ), used when analyzing data at landmark s j is no larger than the subsequent s j+1 , then
the analyses at different landmarks will be independent and inference for parameters can in
principle be performed using model-based variances; however, typically robust variances
are used for all landmarking models.
The model (5.11) is also a stratified Cox model where all strata contribute to the estimat-
ing equations for regression parameters. The estimating equations are obtained from the
log(pseudo-likelihood)
L n Z thor (s j )
exp(LPs j ,i )
∑∑ Yi (s j ) log dNi (t) (5.13)
j=1 i=1 s j ∑k Yk (t) exp(LPs j ,k )

where the linear predictor is

p p m
LPs j = ∑ βks Zk (s j ) = ∑ (βk + ∑ γk` f` (s j )Zk (s j )),
j
k=1 k=1 `=1

see Exercise 5.5. The Breslow estimator is

Z t
b0s (t) = ∑i Yi (s j )dNi (u)
A j .
sj ∑i Yi (u) exp(LP
c s j ,i )

Also in model (5.12), all strata contribute to the estimation of the coefficients in the linear
predictor
m0 p
LPs j = ∑ η` g` (s j ) + ∑ βks j Zk (s j )
`=1 k=1
m0 p m
= ∑ η` g` (s j ) + ∑ (βk + ∑ γk` f` (s j )Zk (s j )),
`=1 k=1 `=1

and the pseudo-log-likelihood has the same form as (5.13). Model (5.12) has only one
cumulative baseline hazard which may be estimated by

∑Lj=1 ∑i Yi (s j )dNi (u)

Z t
b0 (t) =
A .
0 ∑Lj=1 ∑i Yi (u) exp(LP
c s j ,i )

Here, event times belonging to several prediction intervals give several contributions to the
estimator. In all cases, inference for regression parameters (including γs and ηs) is based
on robust variances.
184 MARGINAL MODELS
5.4 Micro-simulation
In Section 4.1 and in previous sections of this chapter, we have utilized mathematical re-
lationships between transition intensities and marginal parameters to obtain estimates for
the latter via plug-in. This worked well for the more simple models like those depicted in
Figures 1.1-1.3 and, more generally, for Markov models. However, when the multi-state
models become more involved (e.g., Section 5.2), this approach becomes more cumber-
some. In this section, we will briefly discuss a brute force approach to estimating marginal
parameters based on transition intensities, namely micro-simulation (or discrete event sim-
ulation) (e.g., Mitton, 2000; Rutter et al., 2011).

5.4.1 Simulating multi-state processes

The idea is as follows: Based on (estimated) transition intensities, generate many paths for
the multi-state process and estimate the marginal parameter of interest by averaging over
the paths generated. This is doable because, at any time t when the process occupies state
h = V (t), the transition intensities govern what happens in the next little time interval from
t to t + dt. Thus, in this time interval, the process moves to state j = 6 h with conditional
probability αh j (t)dt given the past (Equation (1.1)) and with conditional probability 1 −
∑ j6=h αh j (t)dt, the process stays in state h. Note that some of these intensities will be zero
if a state j cannot be reached directly from state h. In this time interval, it is thus possible
to simulate data (dNh j (t), j 6= h) from the process by a multinomial experiment
(dNh j (t), j 6= h, 1 − ∑ dNh j (t)) ∼ mult(1; αh j (t)dt, j =
6 h, 1 − ∑ αh j (t)dt).
j6=h j6=h

The relevant multinomial distribution has index parameter 1 because at most one of the
counting processes Nh j (·) jumps at t, and the probability parameters are given by the tran-
sition intensities (see also Section 3.1). The intensities may be given conditionally on time-
fixed covariates Z and also conditionally on adapted time-dependent covariates Z(t) (Sec-
tion 3.7). In these cases, the time-fixed covariate pattern for which processes are generated
must first be decided upon. Non-adapted time-dependent covariates Z(t) involve extra ran-
domness, and micro-simulations can in this case only be carried out if a model for this extra
randomness is also set up. This will require joint modeling of V (t) and Z(t) (Section 7.4)
and will not be further considered here.
It may be computationally involved to generate paths for V (t) in steps of size dt as just
described and, fortunately, the simulations may be carried out in more efficient ways. This
is because, at any time t, we have locally a competing risks situation, and methods for
generating competing risks data (e.g., Beyersmann et al., 2009) may be utilized. Following
the recommended approach from that paper, the simulation algorithm goes, as follows:
1. Generate a time T1 of transition out of the initial state 0 based on the survival function
Z t
S0 (t) = exp − ∑ α0h (u)du = P00 (0,t).
h6=0 0

6 0 with probability
Given a transition at T1 , the process moves to state h1 =
α0h1 (T1 )
.
∑h6=0 α0h (T1 )
MICRO-SIMULATION 185
2. If the state, say h̃, reached when drawing a state from this multinomial distribution is
absorbing, then stop.
3. If state h̃ is transient, then generate a time T2 > T1 of transition out of that state based on
the (conditional) survival function
Z t
S1 (t | T1 ) = exp − ∑ αh̃h (u | T1 )du = Ph̃h̃ (T1 ,t).
T1
h6=h̃

6 h̃ with probability
Given a transition at T2 , the process moves to state h2 =
αh̃h2 (T2 )
.
∑h6=h̃ αh̃h (T2 )

4. Go to step 2.
Note that, in step 3, as the algorithm progresses, the past will contain more and more infor-
mation in the form of previous times (T1 , T2 , . . . ) and types of transition, and the transition
intensities may depend on this information, e.g., in the form of adapted time-dependent
covariates. The simulation process stops when reaching an absorbing state. For processes
without an absorbing state (e.g., Figure 1.4 without state 2), one has to decide upon a time
horizon (say, τ) within which the processes are generated. In this case, attention must be
restricted to marginal parameters not relating to times > τ.
The algorithm works well with a parametric specification of the intensities, e.g., as piece-
wise constant functions of time (Iacobelli and Carstensen, 2013). However, for non- or
semi-parametric models, such as the Cox model, in the steps where the next state is de-
termined given a transition at time T` , i.e., when computing the probabilities αh`−1 h` (T` )/
(∑h6=h`−1 αh`−1 h (T` )) and drawing from the associated multinomial distribution, one would
need the jumps for the cumulative hazards A bh h (T` ) at that time and, typically, at most one
`−1
of these will be > 0. For such situations, an alternative version of the algorithm is needed
and, for that purpose, the method not recommended by Beyersmann et al. (2009) is appli-
cable. This method builds on the latent failure time approach to competing risks discussed
in Section 4.4. However, the method does provide data with the correct distribution. The
algorithm then goes, as follows:
1. Generate independent latent potential times Te11 , . . . , Te1k of transition from the initial state
0 to each of the other states based on the ‘survival functions’
Z t
S0h (t) = exp − α0h (u)du , h = 1, . . . , k.
0

Let T1 be the minimum of Te11 , . . . , Te1k ; if this minimum corresponds to T1 = Te1h1 , then
the process moves to state h1 at time T1 .
2. If the state, h̃, thus reached is absorbing, then stop.
6 h̃ all > T1
3. If state h̃ is transient, then generate independent latent potential times Te2h , h =
of transition from that state based on the (conditional) ‘survival functions’
Z t
Sehh (t | T1 ) = exp − αh̃h (u | T1 )du , h =
6 h̃.
T1
186 MARGINAL MODELS
Let T2 be the minimum of Te2h , h 6= h̃; if this minimum corresponds to T2 = Te2h2 , then the
process moves to state h2 at time T2 .
4. Go to step 2.
Based on N processes (V`∗ (t), ` = 1, 2, . . . , N) generated in either of these ways, the desired
marginal parameter may be estimated by a simple average, e.g.,

bh (t) = 1 ∑ I(V`∗ (t) = h)

Q
N `

for the state h occupation probability Qh (t) at time t.

If the intensities used when generating the V`∗ (t) were known, then the variability of the
estimator could be obtained from the empirical variation across the processes generated.
Thus, the SD of Q bh (t) could be obtained as the square root of 1/(N(N − 1)) ∑` I(V ∗ (t) =
`
2
h) − Qbh (t) . This can be made small by simulating many processes (large N). However,
such a measure of variability would not take into account that the intensities used when
generating the V`∗ (t) contain parameters θ which are, themselves, estimated with a certain
uncertainty. To account for this second level of uncertainty, the following double generation
of processes can be applied (e.g., O’Hagan et al., 2007):
1. Draw parameter values θ 1 , . . . , θ B from the estimated distribution of the parameter esti-
mate θb , typically a multivariate normal distribution.
2. For each θ b , b = 1, . . . , B, simulate N processes Vb`∗ (t), ` = 1, . . . , N based on intensities
having parameter value θ b , and calculate the corresponding estimate of the marginal
parameter, e.g.,
1
Q̄hb (t) = ∑ I(Vb`∗ (t) = h).
N `
The associated (within-simulation runs) variance
1 2
SD2hb (t) = ∑ I(Vb`∗ (t) = h) − Q̄hb (t)
N(N − 1) `

can be made small by choosing N large.

3. The variation among estimates for different b is quantified by the (between-simulation
runs) variance
1 2
SD2h (t) = ∑ Q̄hb (t) − Q̄¯ h (t) (5.14)
(B − 1) b
where
1
Q̄¯ h (t) = ∑ Q̄hb (t).
B b

The variability of the estimator Qbh (t) obtained by simulating using the estimate θb based
on the observed data can now, for large N, be quantified by (5.14). An alternative way of
evaluating this variability would be to use the bootstrap.
MICRO-SIMULATION 187
1.0

0.9

0.8

0.7
Survival probability, U

0.6

0.5

0.4

0.3

0.2

0.1

0.0
0 1 2 3 4 5 6 7 8 9 10
Survival time, T

Figure 5.4 Illustration of sampling from a survival function: For the solid curve, values of U of,
respectively 0.4 or 0.1, provide survival times of 1.53 and 3.84, for the dashed curve U = 0.4 gives
T = 4.59 and U = 0.1 a survival time censored at τ = 10.

5.4.2 Simulating from an improper distribution

When drawing a time T from a survival distribution S(t), one would typically draw a ran-
dom number U from a uniform [0, 1] distribution and find the T for which S(T ) = U. This
works well for a proper distribution, i.e., when S(t) → 0 as t → ∞. However, an estimated
survival function S(t)
b does often not reach 0 even for large values of t, in which case this
method does not directly work (if a small value of U is drawn), see Figure 5.4 for further
explanation.
Two approaches may be used here. First, ways of extrapolating S(t) b for large values of t
could be applied and, here, use of a parametric model would be possible. This, however, is
not entirely satisfactory since the reason the estimated survival function does not reach 0 is
the lack of data to support a model for the right-hand tail of the distribution. This means that
any parametric extrapolation will lack support from the data. A model with a piece-wise
constant hazard may be the least unsatisfactory choice though, as just stated, there will be
little data support for the estimate of the hazard in the last interval (ending in +∞). Second,
if data provide no support for the intensities beyond some threshold (say, τ), then attention
could be restricted to marginal parameters not involving times > τ. This is the approach
that we will illustrate in the next section.

5.4.3 PROVA trial in liver cirrhosis

We will illustrate micro-simulation using data from the PROVA trial in liver cirrhosis (Ex-
ample 1.1.4). For this example, other methods for analyzing marginal parameters, such as
188 MARGINAL MODELS
the probability Q1 (t) of being alive at time t after a bleeding episode, or the expected time
ε1 (τ) spent in the bleeding state before time τ are available. Thereby, it is possible to com-
pare the results from micro-simulation with those obtained from these alternative methods.
We emphasize, however, that the strength of micro-simulation is that the method applies
quite generally, i.e., also when other methods do not work.
The basis for the simulations is a set of models for the transition intensities
α01 (t), α02 (t), α12 (t − T1 ) with a piece-wise constant dependence of either time since ran-
domization (t) or time since bleeding (t − T1 ). The number of intervals, respectively 17,
16, and 10, were chosen such that each interval contained 2-3 events and, first, no covari-
ates were taken into account. The first method of generating illness-death processes was
applied since a piece-wise constant hazard model directly provides estimates of the transi-
tion intensities (and not just their cumulatives). A censored event time was generated if the
time of exit from state 0 would otherwise exceed 4.13 years, or if the time spent in state
1 would otherwise exceed 3.73 years (see Figure 5.4). Table 5.4 shows the resulting esti-
mates of Q1 (t) and, for comparison, the corresponding Aalen-Johansen estimates. It is seen
that the two estimators behave quite similarly. The table also gives the estimated SD using,
respectively, the asymptotic Aalen-Johansen SD, Equation (5.14) with N = B = 1, 000, or
the bootstrap. The SD across replications for b = 1, . . . , B = 1, 000 is also illustrated by the
width of the histogram shown in Figure 5.5.

Table 5.4 PROVA trial in liver cirrhosis: Estimates (and SD) of the probability, Q1 (t), of being in
the bleeding state 1 at time t, using either the Aalen-Johansen estimator or micro-simulation. The
SD for the estimate obtained using micro-simulation was either based on SD1 (t) given by Equation
(5.14) with N = B = 1, 000 or on 1, 000 bootstrap replications.

Aalen-Johansen Micro-simulation
t b1 (t)
Q SD b1 (t)
Q SD1 (t) Bootstrap
0.5 0.050 0.013 0.053 0.015 0.011
1.0 0.081 0.016 0.074 0.020 0.016
1.5 0.091 0.018 0.091 0.022 0.017
2.0 0.093 0.019 0.088 0.021 0.018
2.5 0.089 0.019 0.082 0.021 0.018
3.0 0.089 0.019 0.075 0.022 0.019
3.5 0.079 0.019 0.069 0.023 0.020
4.0 0.063 0.020 0.051 0.017 0.016

From 10,000 simulated illness-death processes, the expected time (years) ε1 (τ) spent in the
bleeding state before time τ was also estimated. Figure 5.6 shows the estimate as a func-
tion of τ together with the corresponding estimate based on the integrated Aalen-Johansen
estimator. The two estimators are seen to coincide well.
In a second set of simulations, the binary covariate Z: Sclerotherapy (yes=1, no=0) was
added to the three transition intensity models assuming proportional hazards, thus requiring
three more parameters to be estimated. From N = 1, 000 processes, each re-sampled B =
1, 000 times, the probability Q1 (2) was estimated for Z = 0, 1. Histograms of the resulting
estimates are shown in Figure 5.7. It is seen that treatment with sclerotherapy reduces
MICRO-SIMULATION 189

0.100

0.075
Density

0.050

0.025

0.000
0.04 0.08 0.12 0.16
Q1(2)

b1 (2) based on B = 1, 000 estimates, each

Figure 5.5 PROVA trial in liver cirrhosis: Distribution of Q
from N = 1, 000 processes.

Figure 5.6 PROVA trial in liver cirrhosis: Estimates of average time, ε1 (τ) spent in the bleeding
state before time (year) τ; based on either the Aalen-Johansen estimator or on micro-simulation
(N = 10, 000).
190 MARGINAL MODELS
0.100

0.075
Density

0.050

0.025

0.000
0.00 0.05 0.10 0.15 0.20
Q1(2 | Z)

Z 0 1

b1 (2 | Z), Z = 0, 1, (sclerotherapy: No,

Figure 5.7 PROVA trial in liver cirrhosis: Distribution of Q
yes) based on B = 1, 000 estimates, each from N = 1, 000 processes.

the probability of being in the bleeding state, likely owing to the fact that this treatment
increases the death intensity without bleeding (Table 3.3).

Micro-simulation
Micro-simulation is a general plug-in technique for marginal parameters in a multi-
state model when intensities have been specified. The strength of the method is
its generality, and it is applicable in situations where plug-in using a mathematical
expression is not feasible. This includes estimation based on a model where the
intensities are functions of the past given by adapted time-dependent covariates.

5.5 Direct regression models

An alternative to basing estimation of marginal parameters for given covariates on models
for all intensities is to directly set up a model for the way in which the marginal parameter
depends on covariates. This requires specification of a link function that gives the scale on
which parameters are to be interpreted (Section 1.2.5) and setting up a set of generalized
estimating equations (GEEs), the solutions of which are the desired parameter estimates.
Such an approach has some advantages compared to plug-in and micro-simulation. First, it
provides a set of regression coefficients that directly explain the association on the scale of
the chosen link function and, second, it targets directly the marginal parameter of interest
and, thereby, it does not rely on a correct specification of all intensity models – a spec-
ification that may be difficult. A direct marginal model does not provide information on
DIRECT REGRESSION MODELS 191
the dynamics of the multi-state process, and it is not possible to simulate paths of the pro-
cess based on a marginal model. Also, direct modeling requires modeling of the censoring
distribution.
In this section, the general ideas of GEE are first introduced (Section 5.5.1), and in the
subsequent sections these ideas will be used to outline properties of estimators in direct
regression models for marginal parameters in a number of multi-state models.

5.5.1 Generalized estimating equations (*)

Generalized estimating equation (GEE) is a technique for estimating parameters in a re-
gression model for a marginal mean value parameter. Let, in a general setting, T1 , . . . , Tn be
independent random variables with conditional mean value given a p-vector of covariates
Z (possibly including a constant term corresponding to the model intercept) specified as
the generalized linear model
g(E(T | Z )) = β T
0 Z, (5.15)
i.e., the mean value transformed with the link function g is linear in the covariates Z , and β 0
is the true regression coefficient. To estimate this parameter, a set of unbiased estimating
equations is set up
β T Z i ) = 0.
β , Z i ) Ti − g−1 (β

U(ββ ) = ∑ Ui (β
β ) = ∑ A(β (5.16)
i i

β 0 )) = 0. The
Equations (5.16) are unbiased by (5.15) since, given covariates Z , E(U(β
β , Z i ) is typically the p-vector
function A(β
∂ −1 T
β , Z i) =
A(β g (β
β Z i ), j = 1, . . . , p
∂βj
of partial derivatives of the mean function. The independent random variables Ti , i =
1, . . . , n could be vector-valued with a dimension that may vary among i-values reflecting
a clustered data structure, possibly with clusters of varying size, si . In that case, A(β β , Z i)
would be a (p×si )-matrix, possibly including an (si ×si ) working correlation matrix. How-
ever, we will for simplicity restrict attention to the scalar case where Ti is univariate.
The asymptotic properties of the solution βb to (5.16) rely on a Taylor expansion of (5.16)
around the true parameter value β 0
β ) ≈ U(β
U(β β 0 ) + DU(β β − β 0 ),
β 0 )(β

where DU(β β ) is the (p × p)-matrix of partial derivatives of U. Inserting βb , using U(βb ) = 0,

and re-arranging we get
√ b 1
n(β − β 0 ) ≈ (−n−1 DU(βb ))−1 √ U(β
β 0 ).
n
Now, √1n U(β β 0 ) is a sum of independent random variables to which a central limit theo-
rem may be applied, i.e., conditions may be given under which it has a limiting N(0, V)-
distribution as n → ∞ where the limiting covariance matrix may be estimated by

b = 1 ∑ Ui (βb )Ui (βb )T .

V
n i
192 MARGINAL MODELS
If, further, −n−1 DU(βb ) converges in probability to a non-negative definite matrix, then it
√
follows that n(βb − β 0 ) also has a limiting zero-mean normal distribution with a covari-
ance matrix that can be estimated by the sandwich estimator

DU(βb )−1 VDU(

b βb )−1 . (5.17)

The ‘meat’ of the sandwich, V,b is the covariance of the GEE and the ‘bread’ is the inverse
matrix of the partial derivatives, DU(β β ), of the GEE. If the GEEs are obtained as score
equations by equating log-likelihood derivatives to zero, then the variance estimator sim-
plifies because minus the second log-likelihood derivative equals the inverse variance of
the score (e.g., Andersen et al., 1993, ch. VI).
In our applications of GEE, we will often face the additional complication that ‘our random
variables Ti ’ are incompletely observed because of right-censoring in which case inverse
probability of censoring weighted (IPCW) GEE

β TZ i) = 0
β , Z i ) Ti − g−1 (β

U(ββ ) = ∑ Ui (β
β ) = ∑ DiWbi A(β (5.18)
i i

must be used. In (5.18), Di = 1 if Ti is completely observed, Di = 0 if Ti is right-censored,

and W bi is a weight giving the inverse probability, possibly depending on covariates Z i ,
that the observation at Ti is uncensored. In this situation, the covariance of the GEE, i.e.,
the ‘meat’ of the sandwich, will involve an extra term owing to the need to estimate the
censoring distribution. Assuming that censoring is independent and does not depend on Z ,
this is typically done using the Kaplan-Meier estimator, say G(t)
b with censoring being the
event; otherwise, a regression model for censoring can be used, e.g., a Cox model leading
to weights depending on estimates of G(t | Z ) (Section 4.4.1). Presence of delayed entry
further complicates the situation and, in this case, the weights W bi need to be modified to
also reflect the (inverse) probability of no truncation. We will not go into details here but
refer to Geskus (2016, ch. 2) for an example where such a modification was studied for the
competing risks model.

5.5.2 Two-state model (*)

For the two-state model, as discussed in Section 4.2.1, a hazard regression model (e.g.,
multiplicative or additive) directly implies a model for the state occupation probabilities
Q0 (t) = S(t) and Q1 (t) = F(t) (with, respectively, a cloglog or a −log link function). This
approach implies a marginal model for all time points, t. If a model for a single (t0 ) or a
few time points and/or a model with other link functions is wanted, then direct regression
can be achieved using direct binomial regression or using pseudo-values. We will return to
this in Sections 5.5.5 and 6.1.1.
For the τ-restricted mean life time ε0 (τ) = E(T ∧ τ) where T is the survival time, possibly
observed with right-censoring, direct regression models were studied by Tian et al. (2014).
Let the potential right-censoring time for subject i be Ci and let Di = I(Ti ∧ τ ≤ Ci ) be
the indicator of observing the restricted life time for that subject. Let Xi = Ti ∧ τ ∧ Ci and
assume the generalized linear model

β TZ )
ε0 (τ | Z ) = g−1 (β
DIRECT REGRESSION MODELS 193
(where the vector Z now includes the constant covariate equal to 1). Typical link functions
could be g = log or g = identity. Now, β is estimated by solving the unbiased GEE
Di
U (β
β) = ∑ β T Z i )) = 0,
Z i (Xi − g−1 (β
i G(Xi )
b

where G b is the Kaplan-Meier estimator for the distribution of C. Thus, subjects for whom
Ti ∧ τ was observed are up-weighted to also represent subjects who were censored. Using
counting process notation, the estimating equations become
Zi
Z τ
U (β
β) = ∑ β T Z i ))dNi (t) = 0 .
(t − g−1 (β
i 0 G(t)
b

Tian et al. (2014) discussed asymptotic normality for βb , the solution to these GEEs, as-
suming, among other things that G is independent of Z with G(τ) > 0, and derived an
expression for the sandwich estimator of the variance of βb .

5.5.3 Competing risks (*)

In the competing risks model, the most important marginal parameter is the cumulative
incidence, say for cause 1
F1 (t) = P(T ≤ t, D = 1),
i.e., the state 1 occupation probability in the multi-state model of Figure 1.2. Here, T is the
life time (time spent in state 0) and D the failure indicator, D = V (∞). Fine and Gray (1999)
studied the following generalized linear model for this parameter

e01 (t)) + β T Z ,
log(− log(1 − F1 (t | Z ))) = log(A (5.19)

where the risk parameter is linked to the covariates in the same way as in the Cox model for
survival data, i.e., using the cloglog link. (Fine and Gray allowed inclusion of deterministic
time-dependent covariates, but we will skip this possibility in what follows.) Indeed, the
Fine-Gray model (5.19) is a Cox model for the hazard function for the improper random
variable
T1 = inf{t : V (t) = 1},
which is the time of entry into state 1. This hazard function is the cause 1 sub-distribution
hazard given by

e1 (t) =
α lim P(T1 ≤ t + dt | T1 > t)/dt (5.20)
dt→0
= lim P(T ≤ t + dt, D = 1 | T > t or (T ≤ t and D =
6 1))/dt.
dt→0

It follows from this expression that the sub-distribution hazard has a rather unintuitive in-
terpretation, being the cause-1 mortality rate among subjects who are either alive or have
already failed by a competing cause. We will discuss other choices of link function for the
cumulative incidence later (Section 5.5.5 and Chapter 6), and there we will see that this may
lead to other difficulties for the resulting model. Nice features of the link function in the
194 MARGINAL MODELS
Fine-Gray model include the fact that predicted failure probabilities stay within the admis-
sible range between 0 and 1 and, as we shall see now, that it suggests estimating equations
inspired by the score equations resulting from a Cox model.
In the special case of no censoring, the ‘risk set’

Re1 (t) = {i : Ti ≥ t or (Ti ≤ t and Di 6= 1)}

for the sub-distribution hazard is completely observed, and we let Yei (t) = I(Ti ≥ t or (Ti ≤
6 1)) = 1 − N1i (t−) (where N1i (t) = I(Ti ≤ t, Di = 1) is the counting process for
t and Di =
cause 1 failures) be the membership indicator for this risk set. In this case the ‘Cox score’
is
Z ∞
∑ j Yej (t)Z β TZ j )
Z j exp(β
U1 (β
β) = ∑ Zi − dN1i (t).
i 0 ∑ j Yej (t) exp(ββ TZ j)
Fine and Gray (1999) used martingale results to ascertain that the resulting score equation is
unbiased and to obtain asymptotic normality for its solution. Similar results were obtained
in the case with right-censoring, conditionally independent of V (t) for given covariates,
and where the censoring times Ci are known for all i (e.g., administrative censoring). In
that case the risk set is re-defined as

Re1 (t) = {i : Ci ∧ Ti ≥ t or (Ti ≤ t and D =

6 1 and Ci ≥ t)},

i.e., subjects who are either still alive and uncensored at t or have failed from a competing
cause before t and, at the same time, have a censoring time, Ci , exceeding t. The member-
ship indicator for this risk set is Yi∗ (t) = Yei (t)I(Ci ≥ t) and, thus, subjects who fail from a
competing cause stay in the risk set, not indefinitely, but until their time of censoring. This
leads to the ‘Cox score’
Z ∞
∑ j Y j∗ (t)Z β TZ j )
Z j exp(β
U∗1 (β
β) = ∑ Zi − dN1i (t).
i 0 ∑ j Y j∗ (t) exp(ββ TZ j )

In the general case of right-censoring (assumed to be conditionally independent of V (t)

for given covariates) but now with Ci unobserved for failing subjects, IPCW techniques
were used, as follows. According to the situation where all censoring times were known,
subjects j failing from a competing cause should stay in the risk set until their time, C j of
censoring. Now, C j is not observed, so those subjects stay in the risk set with a weight that
diminishes over time reflecting a decreasing probability of still being uncensored. This is
obtained using the weights

I(C j ≥ T j ∧ t)G(t)
b
W j (t) = (5.21)
b j ∧C j ∧ t)
G(T

where Gb estimates the censoring distribution either non-parametrically using the Kaplan-
Meier estimator or via a regression model. There are three kinds of subjects who were not
observed to fail from cause 1 before t:
1. j is still alive and uncensored in which case W j (t) = 1,
DIRECT REGRESSION MODELS 195
2. j was censored before t in which case W j (t) = 0,
3. j failed from a competing cause before t in which case W j (t) = G(t)/
b b j ), the con-
G(T
ditional probability of being still uncensored at t given uncensored at the failure time
Tj .
The resulting GEE are UW
1 (β
β ) = 0 where
Z ∞
∑ j W j (t)Yej (t)Z β TZ j )
Z j exp(β
UW
1 (β
β) =∑ Zi −
(Z )Wi (t)dN1i (t), (5.22)
i 0 ∑ j W j (t)Yej (t) exp(ββ TZ j )

and Yei (t) = 1 − N1i (t−), the indicator of no observed cause 1 failure before time t. Fine
and Gray (1999) showed that these equations are approximately unbiased, that their solu-
tions are asymptotically normal, and derived a consistent sandwich variance estimator. The
estimator Z t
Wi (u)dNi (u)
A (t) =
be
01 ∑ 0 T
i ∑ j W j (u)Yej (u) exp(βb Z j )
e01 (t) was also presented with asymp-
for the cumulative baseline sub-distribution hazard A
totic results.
The Fine-Gray model can be used for a single cause or for all causes – one at a time –
and, as we have seen, inference requires modeling of the censoring distribution. When all
causes are modeled, there is no guarantee that, for any given covariate pattern, one minus
the sum of the estimated cumulative incidences given that covariate pattern is a proper
survival function (e.g., Austin et al., 2021). Furthermore, the partial likelihood approach is
not fully efficient. Based on such concerns, Mao and Lin (2017) proposed an alternative
non-parametric likelihood approach to joint modeling of all cumulative incidences. The
Jacod formula (3.1) for the competing risks model was re-written in terms of the cumulative
incidences and their derivatives – the sub-distribution densities
d
f j (t) = Fj (t) = α j (t)S(t),
dt
as follows. For two causes of failure, the contribution to the Jacod formula from an obser-
vation at time X is, with the notation previously used,

L = α1 (X)I(D=1) α2 (X)I(D=2) S(X).

This can be re-written as

L = (α1 (X)S(X))I(D=1) (α2 (X)S(X))I(D=2) S(X)1−I(D=1)−I(D=2)

= f1 (X)I(D=1) f2 (X)I(D=2) (1 − F1 (X) − F2 (X))1−I(D=1)−I(D=2) ,

in which the cumulative incidences may be parametrized, e.g., as in the Fine-Gray model
or using other link functions. Mao and Lin (2017) showed that, under suitable conditions,
the resulting estimators are efficient and asymptotically normal. Similar to modeling via
hazard functions, this approach does not require a model for censoring.
196 MARGINAL MODELS
Cause-specific time lost
Conner and Trinquart (2021) used the approach of Tian et al. (2014) to study regression
models for the τ-restricted cause-specific time lost in the competing risks model. Following
Section 5.1.2, the parameter of interest is
Z τ
εh (τ) = τ − E(Th ∧ τ) = Qh (u)du
0

where Th is the time of entry into state h, possibly observed with right-censoring. Let the
potential right-censoring time for subject i be Ci and assume the generalized linear model

βT
εh (τ | Z ) = g−1 (β h Z)

(where the vector Z includes the constant covariate equal to 1). Typical link functions could
be g = log or g = identity. Now, β h is estimated by solving the unbiased GEE

Zi
Z τ
U (β
β h) = ∑ βT
(τ − t − g−1 (β h Z i ))dNhi (t) = 0
i 0 G(t)
b

where Gb is the Kaplan-Meier estimator for the distribution of C and Nhi (t) the counting
process for h-events for subject i.
Conner and Trinquart (2021) discussed conditions for asymptotic normality of the resulting
solution βb h to these GEEs and derived an expression for the sandwich estimator for the
variance of βb h .

5.5.4 Recurrent events (*)

For recurrent events we will focus on the mean function

µ(t) = E(N(t)),

where N(t) counts the number of events in [0,t]. We will distinguish between the two situa-
tions where either there are competing risks in the form of a terminal event, the occurrence
of which prevents further recurrent events from happening, or there is no such terminal
event.

No terminal event
We will begin by considering the latter situation. The parameter µ(t) is closely linked to a
partial transition rate as introduced in (5.7), as follows. The partial transition rates for this
model are (approximately for small dt > 0)
∗
αh,h+1 (t) ≈ P(V (t + dt) = h + 1 | V (t) = h)/dt,

and if these are assumed independent of h, then they equal

P(N(t + dt) = N(t) + 1)/dt = E(dN(t))/dt = dµ(t),

DIRECT REGRESSION MODELS 197
the derivative of the mean function. In Section 3.2.1 we derived the score equation (3.9) for
the cumulative hazard and the idea is now to use this as the basis for an unbiased GEE for
µ(t)
∑ Yi (t) dNi (t) − dµ(t) = 0, (5.23)
i

where Yi (t) = 1 if i is still uncensored at time t, i.e., Yi (t) = I(Ci > t) (note that, for recurrent
events without competing risks, times Ci of censoring will always be observed). Equation
(5.23) is solved by
∑ Yi (t)dNi (t)
dµ(t) = i
∑i Yi (t)
corresponding to estimating the mean function by the Nelson-Aalen estimator
Z t
∑i dNi (u)
b (t) =
µ . (5.24)
0 ∑i Yi (u)
(Note that we only have dNi (t) = 1 if Yi (t) = 1.) Equation (5.23) is unbiased if censoring
is independent of the multi-state process in which case the estimator can be shown to be
consistent (Lawless and Nadeau, 1995; Lin et al., 2000). For more general censoring, (5.23)
may be replaced by the IPCW GEE
Yi (t)
∑ Gb (t) dNi (t) − dµ(t) = 0
i i

leading to the weighted Nelson-Aalen estimator

Z t
∑i dNi (u)/G
bi (u)
b (t) =
µ ,
0 ∑i Yi (u)/G
bi (u)

where G bi (t) estimates the probability E(Yi (t)) that subject i is uncensored at time t, possibly
via a regression model. For both estimators, a sandwich variance estimator is available, or
bootstrap methods may be used.
A multiplicative regression model for the mean function, inspired by the Cox regression
model, is
β T Z ),
µ(t | Z ) = µ0 (t) exp(β (5.25)
see Lawless and Nadeau (1995) and Lin et al. (2000), often referred to as the LWYY model.
Unbiased GEE may be established from a working intensity model with a Cox type inten-
sity where the score equations are

∑ Yi (t) dNi (t) − dµ0 (t) exp(ββ T Z i ) = 0

(5.26)
i

and
β TZ i) = 0.

∑ Yi (t)ZZ i dNi (t) − dµ0 (t) exp(β (5.27)
i

Equation (5.26) is, following the lines of Section 3.3, for fixed β solved by
Z t
∑i dNi (u)
b0 (t) =
µ (5.28)
0 β TZ i)
∑i Yi (u) exp(β
198 MARGINAL MODELS
and inserting this solution into (5.27) leads to the equation

β TZ j )
!
Z j exp(β
∑ j Y j (t)Z
∑ Z i − Y (t) exp(ββ T Z ) dNi (t) = 0
i ∑j j j

which is identical to the Cox score Equation (3.17). To assess the uncertainty of the esti-
mator, a sandwich estimator, as derived by Lin et al. (2000) must be used instead of the
model-based SD obtained from the derivative of the score. Further, the baseline mean func-
tion µ0 (t) may be estimated by the Breslow-type estimator obtained by inserting βb into
(5.28).

Terminal event
The situation where there are events competing with the recurrent events process was stud-
ied by Cook and Lawless (1997) and by Ghosh and Lin (2000, 2002), see also Cook et al.
(2009). In this situation, the partial transition rate
∗
αh,h+1 (t) ≈ P(V (t + dt) = h + 1 | V (t) = h)/dt

(which we assume to be independent of h) has a slightly different interpretation, namely

α ∗ (t) ≈ E(dN(t) | TD > t)/dt where TD is the time to the competing event, i.e., the time of
entry into state D in Figure 1.5, typically the time to death. We define
Z t
A∗ (t) = E(N(t) | TD > t) = α ∗ (u)du (5.29)
0

and, as in the case of no competing risks, it may be estimated by the Nelson-Aalen estimator
Z t
b∗ (t) = ∑i dNi (u)
A .
0 ∑i Yi (u)
The quantity A∗ (t) is not of much independent interest (it conditions on the future); how-
ever, since in the case of competing risks we have
Z t
E(N(t)) = S(u)dA∗ (u),
0

this suggests the plug-in estimator, the Cook-Lawless estimator,

Z t
b (t) =
µ S(u−)d
b b∗ (u)
A
0

b is the Kaplan-Meier estimator for S(t) = P(TD > t). Asymptotic results for this
where S(·)
estimator were presented by Ghosh and Lin (2000).
Regression analysis for µ(t) in the presence of a terminal event can proceed in two direc-
tions. Cook et al. (2009) discussed a plug-in estimator combining a regression model for
S(t) via a Cox model for the marginal death intensity and one for A∗ (t) using the estimat-
ing Equations (5.26)-(5.27) for µ(t | Z ) without competing risks. As it was the case for
the plug-in models discussed previously, this enables prediction of E(N(t) | Z ) but does
not provide regression parameters that directly quantify the association. To obtain this, the
DIRECT REGRESSION MODELS 199
direct model for µ(t | Z ) discussed by Ghosh and Lin (2002) is applicable. This model
also has the multiplicative structure (5.25), and direct IPCW GEE for this marginal param-
eter were set up, as follows. Ghosh and Lin (2002), following Fine and Gray (1999), first
considered the case with purely administrative censoring, i.e., where the censoring times
Ci are known for all subjects i and, next, for general censoring, IPCW GEE were studied.
The resulting equations are, except for the factors Yei (t) = 1 − Ni (t−) appearing in Equation
(5.22) for the Fine-Gray model, identical to that equation, i.e.,
T
!
Z ∞
∑ j W j (t)ZZ j exp(β
β Z j )
UW1 (β
β) = ∑ Zi − Wi (t)dN1i (t), (5.30)
i 0 ∑ j W j (t) exp(ββ TZ j )
where the weights Wi (t) are given by (5.21). Ghosh and Lin (2002) presented asymptotic re-
sults for the solution βb , including a sandwich-type variance estimator, and for the Breslow-
type estimator Z t
Wi (u)dNi (u)
b0 (t) = ∑
µ
i 0 ∑ W j (u) exp(βb TZ j )
j
for the baseline mean function.
As discussed in Section 4.2.3, the occurrence of the competing event (‘death’) must be
considered jointly with the recurrent events process N(t) when a terminal event is present.
To this end, Ghosh and Lin (2002) also studied an inverse probability of survival weighted
(IPSW) estimator, as follows. In (5.30), the weights are re-defined as
I(TDi ∧Ci ≥ t)
WiD (t) =
b | Zi )
S(t
where the denominator estimates the conditional probability given covariates of survival
past time t. Ghosh and Lin showed that the corresponding GEE are approximately unbiased
and derived asymptotic properties of the resulting estimator βb . Though the details were not
given, this also provides the joint asymptotic distribution of βb and, say βb D , the estimated
regression coefficient in a Cox model for the survival time distribution. Note that, compared
to the IPCW approach, the IPSW approach has the advantage of not having to estimate the
censoring distribution, but instead the survival time distribution which is typically of greater
scientific interest.

Mao-Lin model
Mao and Lin (2016) defined a composite end-point combining information on N(t) and
survival. They considered multi-type recurrent events processes Nh (t), h = 1, . . . , k, Nh (t)
counting events of type h together with, say N0 (t), the counting process for the terminal
event and assumed that each event type and death can be equipped with a severity weight
(or utility) ch , h = 0, 1, . . . , k. They defined the weighted process
k
N̄(t) = ∑ ch Nh (t)
h=0

(which is a counting process if all ch = 1) and considered a multiplicative model

β TZ )
E(N̄(t)) = µ0 (t) exp(β (5.31)
200 MARGINAL MODELS
for its mean, sometimes referred to as the Mao-Lin model. Approximately unbiased GEE
for β are exactly equal to (5.30) and also the estimator for µ0 (t) suggested by Mao and
Lin (2016) equals that from the Ghosh-Lin model. Asymptotic results for these estimators
were provided. Furberg et al. (2022) studied a situation with competing risks where some
causes of death were included in a composite end-point, but others were considered as
events competing with the composite end-point.

5.5.5 State occupation probabilities (*)

For a general multi-state process, V (t), the state occupation probability Qh (t) = P(V (t) =
h) is the expectation of the random variable I(V (t) = h). In a situation with no censoring,
regression models for E(I(V (t0 ) = h) | Z ), for a fixed time point t0 , could be fitted using
GEE with the binary outcome variable I(V (t0 ) = h). In the more realistic setting with cen-
soring, direct binomial regression for the state occupation probability Qh (t0 ) was studied
by Scheike et al. (2008) for the special case of the competing risks model and, more gen-
erally, by Scheike and Zhang (2007). Azarang et al. (2017) used a similar approach for the
transition probability in the progressive illness-death model (Figure 1.3).
The technique of Scheike and Zhang (2007) is closely related to what we have demonstrated
in Sections 5.5.1-5.5.4. If the right-censoring time for subject i is Ci , then the indicator
I(Vi (t0 ) = h)I(Ci > t0 ) is always observed and can be used, suitably weighted, as response
in the GEE
I(Vi (t0 ) = h)I(Ci > t0 )
β TZ i)
− g−1 (β

U (β
β ) = ∑ A (β
β , Z i) (5.32)
i
b 0)
G(t

for the regression parameter β = (β0 , β1 , . . . , β p )T . This parameter vector includes an in-
tercept β0 depending on the chosen value of t0 . The model has link function g, i.e.,
g(E(I(Vi (t0 ) = h) | Z i )) = β T Z i . In (5.32), A (β
β , Z i ) is usually the (p + 1)-vector of partial
derivatives
∂ −1 T
A(ββ , Z i) = g (β β Z i ), j = 0, 1, . . . , p
∂βj
of the mean function (see Section 5.5.1) and G b is the Kaplan-Meier estimator if Ci , i =
1, . . . , n are assumed i.i.d. with survival distribution G(·).

The asymptotic distribution of βb was derived by Scheike and Zhang (2007) together with
a variance estimator using the sandwich formula. However, bootstrap or an i.i.d. decompo-
sition (to be further discussed in Section 5.7) are also possible when estimating the asymp-
totic variance. Extensions to a model for several time points simultaneously have also been
considered (e.g., Grøn and Gerds, 2014). Blanche et al. (2023) compared, for the compet-
ing risks model, estimates based on (5.32) with those obtained by solving the GEE of the
form (5.18) with I(Vi (t0 ) = h) as the response variable (see also Exercise 5.4).
For the analysis of the cumulative incidence in a competing risks model, the cloglog link
function log(− log(1 − p)) will provide regression coefficients with a similar interpreta-
tion as those in the Fine-Gray model (Section 5.5.3); however, using the direct binomial
approach other link functions may also be studied. This is also possible using pseudo-
observations to be discussed in Chapter 6. As discussed, e.g., by Gerds et al. (2012), a
MARGINAL HAZARD MODELS (*) 201
log-link gives parameters with a relative risk interpretation; however, this comes with the
price that estimates may be unstable for time points close to 0 and that predicted risks may
exceed 1.

5.6 Marginal hazard models (*)

In Section 4.3, we introduced analysis of the marginal parameter ‘distribution of time, Th of
(first) entry into state h’ in a multi-state model via models for the marginal hazard. Within
any subject (i), different Th , e.g., times to event no. h = 1, 2, ... in a model for recurrent
events, cannot reasonably be assumed independent, and both in that section and in Section
3.9, the situation was treated together with that of clustered data which also gives rise to
dependent event history data. In the latter situation, time-to-event information for subjects
from the same family, medical center or the like, is studied and independence within clus-
ters is questionable, whereas independence among clusters may still be reasonable. In a
frailty model (Section 3.9), regression parameters with a within cluster interpretation are
estimated.
In this section, we will discuss inference for the marginal time to event distributions with-
out a specification of the intra-cluster/subject association using marginal Cox models as
discussed, e.g., by Wei et al. (1989) and by Lin (1994). We will, furthermore, in Sections
5.6.3-5.6.5 discuss to what extent this approach and the very concept of a marginal hazard
are applicable in the different situations. In Section 7.2, we will summarize the discussion
of analysis of dependent event history data.

5.6.1 Cox score equations – revisited (*)

Before discussing the marginal Cox model, recall the score equations for β for a Cox model

β T Z i (t))
λi (t) = Yi (t)α0 (t) exp(β

for the intensity process λi (t) for the counting process Ni (t) = I(Xi ≤ t, Di = 1) counting
occurrences of the event of interest where, as usual, Yi (t) = I(Xi ≥ t). These are U (β
β) = 0
where Z ∞
U (β β) = ∑ Z i (t) − Z̄
(Z Z (β
β ,t))dNi (t),
i 0

cf. (3.17). Here,

S 1 (β
β ,t)
Z (β
Z̄ β ,t) =
S0 (ββ ,t)
and S0 (β
β ,t) = ∑i Yi (t) exp(β β T Z i (t)), S 1 (β
β ,t) = ∑i Yi (t)Z β T Z i (t)). Note that, since
Z i (t) exp(β
Rt T
Z i (t) − Z̄
∑i 0 (Z Z (β
β ,t))Yi (t) exp(ββ Z i (t))α0 (t)dt = 0 , the score may be re-written as
Z ∞
U (β
β) = ∑ Z i (t) − Z̄
(Z Z (β
β ,t))dMi (t) (5.33)
i 0

β T Z i (u))α0 (u)du is the counting process martingale

where Mi (t) = Ni (t) − 0t Yi (u) exp(β
R

(Equation (1.21)). This shows that the score evaluated based on data on the interval [0,t]
and evaluated at the true parameter value β 0 is a martingale, and the martingale central
202 MARGINAL MODELS
limit theorem may be used to show asymptotic normality of the score. Thereby, asymp-
totic normality of the solution βb follows as in Section 5.5.1 with the simplification that,
as explained below (5.17), minus the derivative DU (βb ) of the score estimates the inverse
variance of the score, such that the variance of βb may be estimated by DU (βb )−1 with

β ,t)T
Z ∞
S 2 (β β ,t)SS 1 (β
β ,t) S 1 (β
DU (β β) = ∑ − dNi (t) (5.34)
i 0 S0 (ββ ,t) β ,t)2
S0 (β

and S 2 (β Z i (t)Z
β ,t) = ∑i Yi (t)Z β T Z i (t)). This is the model-based variance estimate.
Z i (t)T exp(β
A robust estimator of the variance of βb may also be derived as in Section 5.5.1 following
Lin and Wei (1989) where the ‘meat’ of the sandwich in (5.17) is
Z ∞ Z ∞
Vb = ∑ Z i (t) − Z̄
(Z Z (βb ,t))d M
bi (t) (Z Z (βb ,t))T d M
Z i (t) − Z̄ bi (t). (5.35)
i 0 0

Here, Mbi is obtained by plugging-in βb and the Breslow estimator for α0 (t)dt into the
expression for Mi . The resulting robust variance-covariance matrix is then, as in (5.17),
DU (βb )−1Vb DU (βb )−1 .

5.6.2 Multivariate Cox model (*)

For the Cox model for the intensity of a single event, the robust variance derived in the
previous section seems of minor importance since, in this case, it gives the variance of the
estimator for a least false parameter for a misspecified Cox model, though it may be useful
for hypothesis testing, see Lin and Wei (1989).
For a multivariate situation, the sandwich variance is needed because it is robust against
misspecification of the within-cluster correlation structure. We will now discuss the multi-
variate situation in more detail. Suppose that there are independent units, i = 1, . . . , n, within
which there are K types of event of interest. We will denote these units as ‘clusters’ even
though, in some of the situations to be studied, the units correspond to subjects. The associ-
ated counting processes are Nhi (t) = I(Xhi ≤ t, Dhi = 1), h = 1, . . . , K, where Xhi = Thi ∧Chi
and Dhi = I(Xhi = Thi ). Here, Thi ≤ ∞ is the uncensored time of the type h event in cluster i
and Chi the associated time of right-censoring. The marginal hazard for events of type h in
a given cluster i is

αh (t) = lim P(Thi ≤ t + dt | Thi > t, Z hi (t))/dt, (5.36)

dt→0

i.e., conditioning is only on Thi > t (and on covariates) and not on other information for
cluster no. i.
The Cox model for the marginal intensity of type h events is

βT
λhi (t) = Yhi (t)αh0 (t) exp(β h Z i (t)), h = 1, . . . , K

with Yhi (t) = I(Xhi ≥ t). Following Lin (1994) we will use the feature of type-specific co-
variates (see Section 3.8) and re-write the model as

β T Z hi (t)),
λhi (t) = Yhi (t)αh0 (t) exp(β h = 1, . . . , K, (5.37)
MARGINAL HAZARD MODELS (*) 203
since this formulation, as discussed in Section 3.8, allows models with the same β for
U (β
several types of events. The GEEs for β are now Ū β ) = 0 where
K Z ∞
U (β
Ū β) = ∑ ∑ Z hi (t) − Z̄
(Z Z h (β
β ,t))dNhi (t).
i h=1 0

Lin (1994) also discussed a model with a common baseline hazard across event types and
a more general, stratified model generalizing both of these models was discussed by Spik-
erman and Lin (1998) . The derivations for these models are similar to those for (5.37) and
we will omit the corresponding details.
Lin (1994) discussed asymptotic normality of the solution βb and presented the robust es-
timator of the variance-covariance matrix, which is DŪ U (βb )−1 . Here, DŪ
U (βb )−1Vb̄ DŪ U and
V are sums over types h of the corresponding type-specific quantities given by (5.34) and
b̄
(5.35). Lin’s asymptotic results hold true no matter the correlation among clusters i. How-
ever, the results build on the assumption that the vectors of event times (Thi , h = 1, . . . , K)
and right-censoring times (Chi , h = 1, . . . , K) are conditionally independent given the co-
Z hi , h = 1, . . . , K).
variates (Z

5.6.3 Clustered data (*)

For truly clustered data, e.g., a family with K members, an event of type h corresponds to the
event of interest for family member no. h. In this situation, conditioning in (5.36) is only on
subject h in family i being event-free at time t and not on the status of other family members.
Individual censoring times Chi , conditionally independent of Thi given covariates could be a
plausible assumption. For clustered data, one could generalize to a situation with competing
risks and study Cox models for the cause-specific event hazard for each individual without
having to specify the within-family correlation structure though the hazard function for
subject h is no longer a marginal hazard in the sense of (5.36). Alternatively, in this situation
one may follow the approach of Zhou et al. (2012) who extended the Fine-Gray model for
the cause-specific cumulative incidence to clustered data.

5.6.4 Recurrent events (*)

For recurrent events, i corresponds to a subject, K is the maximum number of events of
interest, and an event of type h is the hth occurrence of the event for the subject. For the
situation with no competing risks (Figure 1.5 without the terminal state D), the WLW model
(Wei et al., 1989 – see Section 4.3.2) would be applicable for the marginal distributions
of time until first, second, third, etc. recurrent event. This is because both the marginal
hazard for event recurrence no. h, i.e., without considering possible occurrence of events
no. 1, 2, . . . , h − 1, is well-defined, and also the existence of a censoring time Ci for each
subject i, conditionally independent of the recurrent events process for given covariates is
plausible (though, one has to get used to the fact that subject i is considered at risk for event
no. h no matter if event no. h − 1 has occurred).
When there are competing risks, the notion of a marginal hazard becomes less obvious
and difficulties appear when applying the WLW model in this situation (which was done
204 MARGINAL MODELS
by Wei et al., 1989, in one of their examples). Here, one can argue that a marginal hazard
for time to event no. h is not well-defined because it relates to a hypothetical population
without mortality or, considering time to death, TD , as a censoring time, one can argue that
this cannot reasonably be considered independent of T1 , T2 , . . . . If one, which makes more
sense, treats death as a competing risk, then the model by Zhou et al. (2012) may be adapted
to a (marginal) analysis of the cumulative incidences for event no. h = 1, . . . , K.
One way of circumventing this problem would be to study the event times (T1 ∧TD , . . . , TK ∧
TD ), possibly jointly with TD . These event times are all censored by a single Ci for subject
i. This was one of the possibilities discussed by Li and Lagakos (1997), and an example of
this approach was given (for the data on recurrent episodes in affective disorder) in Table
4.10. Alternatively, one could acknowledge the presence of competing risks by restricting
attention to cause-specific hazards

αh (t) = lim P(Thi ≤ t + dt | Thi > t, TDi > t, Z hi (t))/dt,

dt→0

i.e., also conditioning on being alive at t, but then the parameter is no longer a marginal
hazard. This solution was also discussed by Li and Lagakos (1997) and exemplified in Table
4.10.

5.6.5 Illness-death model (*)

Difficulties, similar to those described for recurrent events, also appear when using the con-
cept of a marginal hazard for the illness-death model (Figure 1.3) or for the model for the
disease course after bone marrow transplantation (Figure 1.6). For the bone marrow trans-
plantation study (Example 1.1.7), i would be the patient and h = 1, 2, 3 could correspond,
respectively, to GvHD, relapse, and death for that patient. To pinpoint the problem, consider
a simplified multi-state model for the bone marrow transplantation data, i.e., without con-
sideration of GvHD. This is Figure 1.3 with state 1 corresponding to relapse and state 2 to
death, in which case the time to relapse T1 = inft (V (t) = 1) is an improper random variable
with P(T1 = ∞) being the probability of making a direct 0 → 2 transition, i.e., experiencing
death in remission. Here, the marginal hazard for T1 given by (5.36) is mathematically well-
defined. However, it is either the hazard function for the distribution of time to relapse in a
population with no mortality, or it is the sub-distribution hazard for the cause 1 cumulative
incidence, cf. Section 5.5.3. The former situation was touched upon by Lin (1994) in one
of his examples where it was noted that if one attempts to make marginal Cox models for
T1 and time to death, T2 = inft (V (t) = 2) by censoring for death when analyzing T1 , then
the conditional independence assumption between event times (T1 , T2 ) and the associated
censoring times (C ∧ T2 ,C) given covariates will be violated. This situation, i.e., attempting
to study the marginal distribution of T1 in a population without death informatively censor-
ing for T2 , was later referred to as semi-competing risks (e.g., Fine et al., 2001); however,
as discussed in Section 4.4.4, in our opinion this is asking a wrong question for the illness-
death model. In the latter situation, i.e., it is acknowledged that mortality is operating but no
conditioning on T2 is done when studying the hazard for T1 , one cannot make inference for
this hazard without distinguishing between censoring and death. Models for the cumulative
incidence of type 1 events, i.e., acknowledging the fact that an observed death in remission
signals that T1 = ∞, were studied by Bellach et al. (2019), thereby providing alternatives to
GOODNESS-OF-FIT 205
the Fine-Gray model. This approach is related to the previously discussed model of Mao
and Lin (2017) who studied joint models for all cumulative incidences, see Section 5.5.3.
A way of circumventing these problems, similar to what was discussed for recurrent events
in the previous section, would be to follow Lin (1994) and re-define the problem for the
illness-death model to study marginal Cox models, not for (T1 , T2 ) but for (T0 , T2 ) with
T0 = T1 ∧ T2 being the time spent in the initial state, 0, i.e., the times under study would
be recurrence-free survival and overall survival times. For this pair of times, the marginal
hazards are well-defined and the censoring times are (C,C) which may reasonably be as-
sumed conditionally independent of the event times for given covariates. Thus, if for the
bone marrow transplantation study, one wishes to analyze both time to relapse and time
to GvHD, then a possibility would be to study marginal models for relapse-free survival
(i.e., without censoring for GvHD) and for GvHD-free survival (i.e., without censoring for
relapse), possibly jointly with time to death. Such analyses were exemplified for the data
from Example 1.1.7 in Table 4.11.

5.7 Goodness-of-fit
In previous chapters and sections, a number of different models for multi-state survival
data have been discussed, including models for intensities (rates) and direct models for
marginal parameters (such as risks). All models impose a number of assumptions, such as
proportional hazards, additivity of covariates (no interaction) and linearity of quantitative
covariates in a linear predictor. We have in connection with examples shown how these
assumptions may be checked, often by explicitly introducing parameters describing depar-
tures from these assumptions. Thus, interaction terms or quadratic terms have been added
to a linear predictor (e.g., Section 2.2.1) as well as time-dependent covariates expressing
interactions in a Cox model between covariates and time (e.g., Section 3.7.7).
In this section, some general techniques for assessment of goodness-of-fit will be reviewed,
building on an idea put forward for the Cox model by Lin et al. (1993). The techniques
are based on cumulative residuals and provide both a graphical model assessment and a
numerical goodness-of-fit test – both using re-sampling from an approximate large-sample
distribution and, as we shall see, they are applicable for both hazard models and marginal
models. Section 5.7.1 presents the mathematical idea for the general method, with special
attention to GEE and the Cox model. Examples and discussion of how graphs and tests are
interpreted are given in Section 5.8.4.

5.7.1 Cumulative residuals (*)

Many of the estimators considered so far are obtained by solving equations of the form

U (β
β ) = ∑ A (β
β , Z i )ei
i

for a suitably defined set of residuals ei , i = 1, . . . , n. This is the case for the general GEE
discussed in Section 5.5.1, but also the score equations based on Cox partial likelihood may
be re-written into this form (Section 5.6.1). Solving the equations and, thereby, obtaining
parameter estimates βb , a set of observed residuals ebi are obtained which may be used for
206 MARGINAL MODELS
model checking by looking at processes of the form

Z i )I(Zi j ≤ z)b
W (z) = W j (z) = ∑ h(Z ei (5.38)
i

based on cumulative sums of residuals, where Zi j is a single (quantitative) covariate.

5.7.2 Generalized estimating equations (*)

If the residual is
ebi = Ti − g−1 (βb , Z i )
for a suitably defined response Ti , assumed independent among subjects, and link function
g then W (z) in (5.38) may be re-written as

W (z) = ∑ h(Z Z i )I(Zi j ≤ z)(g−1 (βb , Z i ) − g−1 (β

Z i )I(Zi j ≤ z)ei + ∑ h(Z β , Z i ))
i i

(Lin et al., 2002). Here, the second sum is Taylor expanded around the true parameter value
β0
∂ −1 ∗
− ∑ h(Z Z i )I(Zi j ≤ z) β , Z i )(βb − β 0 )
g (β
i ∂ β
and by another Taylor expansion, see Section 5.5.1,

DU (βb ))−1U (β
βb − β 0 ≈ (−D β 0)

DU (βb ))−1 ∑ A (β
= (−D β 0 , Z i )ei .
i

Collecting terms, the goodness-of-fit process W (z) is seen to have the same asymptotic
distribution as the following sum ∑i fz (Vi (·))ei of i.i.d. terms, where
!
∂ −1
Z i )I(Zi j ≤ z) 1 −
fz (Vi (·)) = h(Z g (ββ 0 , Z i )(−D β 0 ))−1 ∑ A (β
DU (β β 0, Z k) .
∂β k

The asymptotic distribution of W (z) can now be approximated by generating i.i.d. standard
normal variables (U1 , . . . ,Un ) and calculating ∑i fbz (Vi (·))b
eiUi (sometimes referred to as the
conditional multiplier theorem or the wild bootstrap, e.g., Martinussen and Scheike, 2006,
ch. 2; Bluhmki et al., 2018) where, in fbz (·), β 0 is replaced by βb . This i.i.d. decomposition
gives rise to a number of plots of cumulative residuals and also to tests obtained by compar-
ing the observed goodness-of-fit process to those obtained by repeated generation of i.i.d.
standard normal variables. We will illustrate this in Section 5.8.4.

5.7.3 Cox model (*)

For the Cox regression model, the estimating equation is the Cox score Equation (5.33)
Z ∞
U (β
β) = ∑ Z i − Z̄
(Z Z (β
β ,t))dMi (t) = 0 ,
i 0
GOODNESS-OF-FIT 207
where Mi is the martingale residual for the counting process Ni for subject i. This was
Z (β
shown in Section 5.6 where the definitions of Z̄ β ,t), S0 (β
β , u), and S 1 (β
β , u) are also found.
The goodness-of-fit process is
Z i )I(Zi j ≤ z)M
W (t, z) = W j (t, z) = ∑ h(Z bi (t)
i

with Z t
T ∑ dNk (u)
bi (t) = Ni (t) −
M Yi (u) exp(βb Z i ) k .
0 S0 (βb , u)
bi (t) as a function of β around β 0 and using
Taylor expanding M

βb − β 0 ≈ (−D β 0 ))−1U (β
DU (β β 0)
Z ∞
= (−D β 0 ))−1 ∑
DU (β Z i − Z̄
(Z Z (β
β 0 ,t))dMi (t),
i 0

the process W (t, z) is approximated by

Z t
∑k dNk (u)
W (t, z) ≈ ∑ h(Z βT
Z i )I(Zi j ≤ z) Ni (t) − Yi (u) exp(β 0 Z i)
i 0 S0 (β
β 0 , u)
ZT βT βT ST
Z t
S0 (β
β 0 , u)Z i exp(β 0 Z i ) − exp(β 0 Z i )S 1 (β
β 0 , u)
− d ∑ Nk (u)
0 β 0 , u)2
S0 (β k
Z ∞
!
× (−D β 0 ))−1 ∑
DU (β Z k − Z̄
(Z Z (β
β 0 , u))dMk (u) .
k 0

In this expression, the Doob-Meyer decomposition (1.21) of Ni is used to get that W (t, z)
has the same asymptotic distribution as
Z t
∑ Z i )I(Zi j ≤ z) − g j (β
(h(Z β 0 , u, z))dMi (u)
i 0
Z t
−∑ βT
Yi (u) exp(β Z i )I(Zi j ≤ z)(Z
0 Z i )h(Z Z i − Z̄ β 0 , u))T α0 (u)du
Z (β
i 0
Z ∞
×(−D β 0 ))−1 ∑
DU (β Z k − Z̄
(Z Z (β
β 0 , u))dMk (u),
k 0

where
β T Z i )h(Z
∑i Yi (u) exp(β Z i )I(Zi j ≤ z)
g j (β
β , u, z) = .
S0 (β
β , u)
This asymptotic distribution is approximated by replacing β 0 by βb , α0 (u)du by the Breslow
estimator, and dMi (t) by dNi (t)Ui with U1 , . . . ,Un i.i.d standard normal variables.
Lin et al. (1993) suggested to use W j (t, z) with h(·) = 1 and t = ∞ to check the functional
form for a quantitative Zi j , i.e., to plot cumulative martingale residuals

∑ I(Zi j ≤ z)Mbi (∞)

against z, together with a large number of paths generated from the approximate asymptotic
distribution.
208 MARGINAL MODELS
Z ) = Zi j and z = ∞,
To examine proportional hazards, Lin et al. (1993) proposed to let h(Z
i.e., cumulative Schoenfeld or ‘score’ residuals
Z t
∑ (Zi j − Z̄ j (βb , u))dNi (u)
i 0

are plotted against t, where the jth Schoenfeld residual for subject i failing at time Xi is
its contribution Zi j − Z̄ j (βb , Xi ) to the Cox score. The observed path for the goodness-of-fit
process is plotted together with a large number of paths generated from the approximate
asymptotic distribution.

5.7.4 Direct regression models (*)

The general idea has also been applied to a number of other special regression models, such
as those discussed in Sections 5.5.2-5.5.5. Thus, Li et al. (2015) used the technique for the
Fine-Gray regression model (5.19), Lin et al. (2000) did it for the multiplicative mean
model for recurrent events without competing risks (5.25), and Martinussen and Scheike
(2006, ch. 5) for the Aalen additive hazards model (3.23). The technique was also used in
connection with analyses based on pseudo-values by Pavlič et al. (2019), see Section 6.4.

5.8 Examples
In this section, we will exemplify some of the new methods that have been introduced in
the current chapter.

5.8.1 Non-Markov transition probabilities

PROVA trial in liver cirrhosis
In Section 3.7.6, models for the transition intensities in the three-state illness-death model
for the PROVA data (Example 1.1.4) were studied and one important conclusion from these
analyses was that the process did not fulfill the Markov property. This was seen in Table 3.8
where the mortality rate α12 (·) after bleeding depended on the duration d = d(t) = t − T1
in the bleeding state. In this example, we will study to what extent this deviation from
the Markov assumption affects estimation of the transition probability P01 (s,t), i.e., the
probability of being alive in the bleeding state at time t given alive without bleeding at the
earlier time point s. Some of the analyses to be reported, as well as some further analyses
of the PROVA data were presented by Andersen et al. (2022).
Under a Markov assumption, the probability P01 (s,t)R may be estimated using the Aalen-
Johansen estimator which is the plug-in estimator of st P00 (s, u)α01 (u)P11 (u,t)du (Section
5.1.3). This estimate (based on the entire data set of 286 observations, disregarding treat-
ment and other covariates) is shown in Figure 5.8 for s = 1 year. This figure also shows three
landmark based estimators, namely the landmark Aalen-Johansen estimator suggested by
Putter and Spitoni (2018), the Titman (2015) estimator for a transient state, and the Pepe
estimator also discussed by Titman (2015) (see Section 5.2.2). At the time point s = 1 year,
there were 190 patients still at risk in state 0, and the three landmark estimators are based on
those subjects. It is seen that the curves are quite different, with the Markov-based estima-
tor throughout over-estimating the probability compared to the landmark Aalen-Johansen
and Pepe estimators, whereas Titman’s estimator is close to the other landmark based
EXAMPLES 209

0.08

0.06
Probability

0.04

0.02

0.00

1 2 3 4
Time since randomization (years)
LM Pepe LM Titman
LM AaJ AaJ

Figure 5.8 PROVA trial in liver cirrhosis: Estimates for the transition probability P01 (s,t), t > s for
s = 1 year (LM: Landmark, AaJ: Aalen-Johansen).

estimators for t < 1.8 years while, for larger values of t, it approaches the Aalen-Johansen
estimator.
We next compare with plug-in estimators using the expression
Z t
P01 (s,t) = P00 (s, u)α01 (u)P11 (u,t | u)du,
s

and basing the probability P11 (u,t | u) = exp − ut α12 (x, x − u)dx of staying in state 1
R

until time t given entry into that state at the earlier time u on different models for the 1 → 2
transition intensity. We consider the following models
α12 (t, d) = α12,0 (d), (5.39)
α12 (t, d) = α12,0 (t) exp(LP(d)). (5.40)

Equation (5.39) is the special semi-Markov model where the intensity only depends on d.
In (5.40), the baseline 1 → 2 intensity depends on t and functions of d are used as time-
dependent covariates (as in Table 3.8). In (5.40), the linear predictor is either chosen as

LP(d) = β1 I(d < 5 days) + β2 I(5 days ≤ d < 10 days)

or LP(d) = β · d. Figure 5.9 shows the resulting estimates Pb01 (s,t) for s = 1 year to-
gether with the Aalen-Johansen and landmark Aalen-Johansen estimates. It is seen that the
estimate based on a semi-Markov model with d as the baseline time-variable is close to the
landmark Aalen-Johansen estimate. On the other hand, the models with t as baseline time-
variable and duration-dependent covariates differ according to the way in which the effect
of duration is modeled: With a piece-wise constant effect it is closer to the Markov-based
Aalen-Johansen estimator, and with a linear effect it is closer to the landmark estimate.
210 MARGINAL MODELS
0.06

0.04
Probability

0.02

0.00
1 2 3 4
Time since randomization (years)
FPD LD LM AaJ
PWCD Semi AaJ

Figure 5.9 PROVA trial in liver cirrhosis: Estimates for the transition probability P01 (s,t), t > s
for s = 1 year (FPD: Fractional polynomial duration effect in (5.40), LD: Linear duration effect in
(5.40), LM: Landmark, AaJ: Aalen-Johansen, PWCD: Piece-wise constant duration effect in (5.40),
Semi: Semi-Markov (5.39)).

To study whether a more detailed model for the duration effect LP(d) in (5.40) would
provide a better fit to the data, a model with a duration effect modeled using a fractional
polynomial
LP(d) = β1 d + β2 d 2 + β3 d 3 + β4 log(d)
(e.g., Andersen and Skovgaard, 2010, ch. 4) was also studied, see Figure 5.9. It is seen that
the latter estimate is close to that using a linear duration effect.
To assess the variability of the estimators, Andersen et al. (2022) also conducted a bootstrap
experiment by sampling B = 1, 000 times with replacement from the PROVA data set and
repeating the analyses on each bootstrap sample. It was found that the Aalen-Johansen
estimator has a relatively large SD; however, since this estimator tends to be upwards biased
as seen in Figures 5.8-5.9, a more fair comparison between the estimated variabilities is
obtained by studying the relative SD, i.e., the coefficient of variation SD(P)/
b P.b This showed
that the estimators based on sub-sampling (landmark Aalen-Johansen, Pepe, Titman) have
relatively large relative SD-values. On the other hand, the Aalen-Johansen estimator and
the plug-in estimators (5.39) and (5.40) (with a linear duration effect as covariate) have
smaller relative SD.
In conclusion, the estimators based on sub-sampling are truly non-parametric and hence
reliable; however, being based on fewer subjects, they are likely to be more variable. On
the other hand, the plug-in estimators are based on the full sample and hence less variable
though it may be a challenge to correctly model the effect of ‘the other time-variable’ using
time-dependent covariates.
EXAMPLES 211
Table 5.5 Bone marrow transplantation in acute leukemia: Estimated coefficients (and SD) for du-
ration effects in Cox models for the transition intensities α12 (·), α13 (·) and α23 (·) (GvHD: Graft
versus host disease).

Transition Duration in βb SD
1→2 GvHD state (t − T1 ) 0.074 0.046
1→3 GvHD state (t − T1 ) 0.050 0.021
2→3 Relapse state (t − T2 ) -0.066 0.016

Bone marrow transplantation in acute leukemia

We will here illustrate analyses on the bone marrow transplantation data (Example 1.1.7)
similar to those conducted for the PROVA trial in the previous section. Following Andersen
et al. (2022), we will be focusing on the probability P02 (s,t) of being alive in the relapse
state at time t given alive in the initial state at an earlier time s, see Figure 1.6. For this
example, a Markov model does not fit the data well; see Table 5.5 where results from
Cox models for the transition intensities α12 (·), α13 (·) and α23 (·) allowing for duration
dependence in states 1 or 2 are summarized. It is seen that the two death intensities depend
on (a linear effect of) duration, t − T1 or t − T2 in states 1 or 2, respectively. The former
increases with duration (βb > 0), while the latter decreases (βb < 0).
Figures 5.10 and 5.11 show estimates of P02 (s,t), t > s for s = 3 and 9 months using various
estimators: The Markov-based Aalen-Johansen estimator, the landmark Aalen-Johansen
and Pepe estimators, and three plug-in estimators. The first plug-in estimator uses duration
in states 1 or 2 as baseline time-variables with no adjustment for time t since transplan-
tation, and the two others model the intensities out of states 1 or 2 using t as baseline
time-variable and adjusting for duration in states 1 or 2 using the models with estimates
given in Table 5.5. It is seen that, for this example, the deviations from the Markov assump-
tion are less severe for the estimation of transition probabilities, and no big differences
between the various estimators are apparent. However, it does seem as if that based on the
semi-Markov model for s = 9 months gives somewhat lower estimates for t > 2.5 years. A
possible explanation is that the semi-Markov model does not take time t since transplanta-
tion into account for the transition intensities out of states 1 and 2 and, as seen in Section
3.7.8, this time-variable does have an effect. So, the example illustrates that when using
plug-in estimators modeling duration effects explicitly, great care must be exercised when
setting up these models.

5.8.2 Direct binomial regression

We will illustrate the use of direct binomial regression (Section 5.5.5) for estimating covari-
ate effects on the cumulative incidence of death without transplantation in the PBC3 trial.
In Section 4.2.2, these data were analyzed using the Fine-Gray model and estimates with
a sub-distribution hazard ratio interpretation were obtained. Using the estimating equa-
tions (5.32), it is possible to apply other link functions for linking the cumulative incidence
to covariates, e.g., a logistic link yielding estimates with an odds ratio interpretation. Ta-
ble 5.6 shows estimates obtained by fitting such models to the cumulative incidence for
212 MARGINAL MODELS

0.03
Probability

0.02

0.01

0.00
0 12 24 36 48 60
Time since bone marrow transplantation (months)
LinGvHD LM Pepe Semi
LinRel LM AaJ AaJ

Figure 5.10 Bone marrow transplantation in acute leukemia: Estimates for the transition probability
P02 (s,t),t > s for s = 3 months (LinGvHD: Linear duration effect in state 1, LM: Landmark, Semi:
Semi-Markov, LinRel: Linear duration effect in state 2, AaJ: Aalen-Johansen).

0.03
Probability

0.02

0.01

0.00
0 12 24 36 48 60
Time since bone marrow transplantation (months)
LinGvHD LM Pepe Semi
LinRel LM AaJ AaJ

Figure 5.11 Bone marrow transplantation in acute leukemia: Estimates for the transition probability
P02 (s,t),t > s for s = 9 months (LinGvHD: Linear duration effect in state 1, LM: Landmark, Semi:
Semi-Markov, LinRel: Linear duration effect in state 2, AaJ: Aalen-Johansen).
EXAMPLES 213
death without a liver transplantation (F2 (·)), either at t0 = 2 years or simultaneously at
(t1 ,t2 ,t3 ) = (1, 2, 3) years. It is seen that, unadjusted, the odds of dying without transplan-
tation before 2 years is 1.098 = exp(0.093) times higher for a CyA-treated person com-
pared to placebo with 95% confidence interval from 0.531 to 2.27, and after adjustment
for albumin and log2 (bilirubin) the corresponding odds ratio is 0.630 = exp(−0.463) (95%
confidence interval (0.266, 1.488)). Moving from analyzing F2 (t0 | Z) to jointly analyzing
F2 (t j | Z), j = 1, 2, 3 and assuming time-constant effects, the estimated SD become smaller.

Table 5.6 PBC3 trial in liver cirrhosis: Estimated coefficients (and SD) from direct binomial (lo-
gistic) models for the cumulative incidence of death without transplantation at t0 = 2 years or
simultaneously at (t1 ,t2 ,t3 ) = (1, 2, 3) years.
(a) t0 = 2 years

Covariate βb SD βb SD
Treatment CyA vs. placebo 0.093 0.371 -0.463 0.439
Albumin per 1 g/L -0.147 0.037
log2 (bilirubin) per doubling 0.639 0.151

(b) (t1 ,t2 ,t3 ) = (1, 2, 3) years

Covariate βb SD βb SD
Treatment CyA vs. placebo -0.030 0.323 0.520 0.373
Albumin per 1 g/L -0.125 0.035
log2 (bilirubin) per doubling 0.579 0.128

5.8.3 Extended models for recurrent events

Furberg et al. (2022) applied versions of the Mao-Lin (2016) model for recurrent events
with competing risks to data from the LEADER trial (Example 1.1.6). Recall from Section
5.5.4 that this model (5.31) concerns the mean of a weighted recurrent end-point counting
both recurrent events and death. For recurrent MI including all-cause death (with all sever-
ity weights ch = 1), the estimated log(mean ratio) was βb = −0.159 (SD = 0.057). Furberg et
al. also studied ‘recurrent 3-p MACE’, thus counting both recurrent myocardial infarctions,
recurrent strokes, and cardiovascular deaths as events (giving all events a severity weight
of ch = 1). Non-cardiovascular death was here (somewhat incorrectly) treated as censoring.
This yielded a mean ratio of exp(−0.183) = 0.833 between liraglutide and placebo with
a 95% confidence interval from 0.742 to 0.935. Adjusting properly for the competing risk
of non-cardiovascular death (in a model combining the Mao-Lin model with a Ghosh-Lin
model – see appendix in Furberg et al., 2022) had almost no impact on the estimate which
was a mean ratio of 0.832 (0.741, 0.934). A possible explanation for this similarity is the
lack of difference between non-cardiovascular death rates in the two treatment groups.

5.8.4 Goodness-of-fit based on cumulative residuals

We will show how plots of cumulative residuals may be used for assessing goodness-of-fit
for the Cox regression models fitted to the data from the PBC3 trial (Example 1.1.1).
214 MARGINAL MODELS
Checking linearity
The martingale residual from a Cox model is the difference
bi = Di − A
M b0 (Xi ) exp(LP
c i)

between the failure indicator Di (= 1 for a failure and = 0 for a censored observation) for
subject i and the estimated cumulative hazard evaluated at the time, Xi , of failure/censoring.
The latter has an interpretation as an ‘expected value’ of Di at time Xi . If Z j is a quantitative
covariate, then, according to Lin et al. (1993), a plot of the cumulative sum of martingale
residuals for subjects with Zi j ≤ z against z is sensitive to non-linearity of the effect of the
covariate on the linear predictor. If linearity provides a good description of the effect, then
the resulting curve should vary non-systematically around 0. A formal test for linearity may
be obtained by comparing the curve with a large number of random realizations of how the
curve should look like under linearity, e.g., focusing on the maximum value of the observed
curve compared to the maxima of the random realizations.
In Section 2.2, a model for the rate of failure of medical treatment including the covariates
treatment, albumin, and bilirubin was fitted to the PBC3 data, see Table 2.4. To assess
linearity of the two quantitative covariates albumin and bilirubin, Figures 5.12 and 5.13
show cumulative martingale residuals plotted against the covariate. While, for albumin,
linearity is not contra-indicated, the plot for bilirubin shows clear departures from ‘random
variation around 0’. This is supported by P-values from a formal significance test (0.459
for albumin and extremely small for bilirubin). The curve for bilirubin gets negative for
small values of the covariate indicating that the ‘expected’ number of failures is too large

10
Cumulative martingale residuals

−5

20 30 40 50
Albumin

Figure 5.12 PBC3 trial in liver cirrhosis: Checking linearity using cumulative martingale residuals
plotted against albumin.
EXAMPLES 215

Cumulative martingale residuals

−10

0 100 200 300 400

Bilirubin

Figure 5.13 PBC3 trial in liver cirrhosis: Checking linearity using cumulative martingale residuals
plotted against bilirubin.

for low values of bilirubin compared to the observed (the latter is often equal to 0). This
suggests that relatively more weight should be given to low values of bilirubin in the linear
predictor and relatively less weight to high values – something that may be achieved by a
transformation of the covariate with a concave (‘downward bending’) function, such as the
logarithm. Figure 5.14 shows the plot after transformation and now the curve is more in
accordance with what would be expected under linearity. This is supported by the P-value
of 0.481. The estimates in this model were given in Table 2.7. The plot for albumin in this
model (not shown) is not much different from what was seen in Figure 5.12.

Checking proportional hazards

The Schoenfeld (or score) residuals from a Cox model are the differences

Di (Zi j − Z̄ j (Xi )),

between the observed covariate Zi j for a subject failing at time Xi and an expected aver-
age value for covariate j, Z̄ j (Xi ), among subjects at risk at that time. According to Lin et
al. (1993), a plot of the cumulative sum of Schoenfeld residuals for subjects with Xi ≤ t
against time t is sensitive to departures from proportional hazards. If the proportional haz-
ards assumption fits the data well, then the resulting curve should vary non-systematically
around 0, and a formal goodness-of-fit test may be obtained along the same lines as for the
plot of cumulative martingale residuals.
Figures 5.15-5.17 show plots of cumulative Schoenfeld residuals (standardized by division
by SD(βbj )) against time for the three covariates in the model: Treatment, albumin, and
216 MARGINAL MODELS

5
Cumulative martingale residuals

−5

−10
2.5 5.0 7.5
log2(bilirubin)

Figure 5.14 PBC3 trial in liver cirrhosis: Checking linearity using cumulative martingale residuals
plotted against log2 (bilirubin).

2
Standardized score process

−2

0 1 2 3 4 5
Time since randomization (years)

Figure 5.15 PBC3 trial in liver cirrhosis: Checking proportional hazards using cumulative Schoen-
feld residuals for treatment (standardized) plotted against the time-variable.
EXAMPLES 217

2
Standardized score process

−2

0 1 2 3 4 5
Time since randomization (years)

Figure 5.16 PBC3 trial in liver cirrhosis: Checking proportional hazards using cumulative Schoen-
feld residuals for albumin (standardized) plotted against the time-variable.

2
Standardized score process

−1

−2

−3
0 1 2 3 4 5
Time since randomization (years)

Figure 5.17 PBC3 trial in liver cirrhosis: Checking proportional hazards using cumulative Schoen-
feld residuals for log2 (bilirubin) (standardized) plotted against the time-variable.
218 MARGINAL MODELS
log2 (bilirubin). It is seen that for neither of the covariates is the proportional hazards as-
sumption contra-indicated. This is confirmed both by the curves and the associated P-values
(0.919, 0.418, and 0.568, respectively).
The goodness-of-fit examinations show that linearity for albumin seems to describe the data
well, while, for bilirubin, a log-transformation is needed to obtain linearity. Furthermore,
for all three variables in the model, proportional hazards is a reasonable assumption. These
conclusions are well in line with what was seen in Sections 2.2.2 and 3.7.7. An advantage of
the general approach using cumulative residuals is that one needs not specify an alternative
against which linearity or proportional hazards is tested.
EXERCISES 219
5.9 Exercises

Exercise 5.1 (*) Consider the two-state reversible Markov model

α01 (t)
0 -1
At risk Not at risk

α10 (t)

Set up the A (t) and P (s,t) matrices and express, using the Kolmogorov forward differential
equations, the transition probabilities in terms of the transition intensities (Section 5.1).

Exercise 5.2 (*) Consider the four-state model for the bone marrow transplantation study,
Figure 1.6. Set up the A (t) and P (s,t) matrices and express, using the Kolmogorov for-
ward differential equations, the transition probabilities in terms of the transition intensities
(Section 5.1).

Exercise 5.3 (*)

1. Consider the competing risks model and show that the ratio between the cause h sub-
distribution hazard and the corresponding cause-specific hazard is

eh (t)
α S(t)
= .
αh (t) 1 − Fh (t)

2. Show that, thereby, proportional sub-distribution hazards and proportional cause-specific

hazards are incompatible.

Exercise 5.4 (*)

Consider the competing risks model and direct binomial regression for Qh (t0 ), the cause-h
cumulative incidence at time t0 (Section 5.5.5). The estimating equation (5.18) is

∑ Di (t0 )Wbi (t0 )AA(ββ T Z i ) Nhi (t0 ) − Qh (t0 | Z 1 )

with Di (t0 ) the indicator I(Ti ∧ t0 ≤ Ci ) of observing the state occupied at t0 and W
bi (t0 ) =
1/G((t0 ∧ Xi )−) the estimated inverse probability of no censoring (strictly) before the min-
b
imum of t0 and the observation time Xi for subject i. The alternative estimating equation
(5.32) is
∑ A (ββ T Z i ) Nhi (t0 )Di (t0 )Wbi (t0 ) − Qh (t0 | Z i ) .

i

Show that, replacing G

b by the true G, both estimating equations are unbiased.

Exercise 5.5 (*) Derive the estimating equations for the landmark model (5.11).
220 MARGINAL MODELS
Exercise 5.6 Consider an illness-death model for the Copenhagen Holter study with states
‘0: Alive without AF or stroke’, ‘1: Alive with AF and no stroke’, ‘2: Dead or stroke’, see
Figures 1.3 and 1.7. Examine, using a time-dependent covariate, whether this process may
be modeled as being Markovian.

Exercise 5.7 Consider the four-state model for the Copenhagen Holter study, see Figure
1.7.
1. Fit separate landmark models at times 3, 6, and 9 years for the mortality rate, including
AF, stroke, ESVEA, sex, age, and systolic blood pressure.
2. Fit landmark ‘super models’ where the coefficients vary smoothly among landmarks but
with separate baseline hazards at each landmark.
3. Fit a landmark ‘super model’ where both the coefficients and the baseline hazards vary
smoothly among landmarks.

Exercise 5.8 Consider a competing risks model for the Copenhagen Holter study with
states ‘0: Alive without AF or stroke’, ‘1: Alive with AF and no stroke’, ‘2: Dead or stroke’,
see Figures 1.2 and 1.7.
Fit, using direct binomial regression, a model for being in state 1 at time 3 years including
the covariates ESVEA, sex, age, and systolic blood pressure.

Exercise 5.9 Consider the Cox model for stroke-free survival in the Copenhagen Holter
study including the covariates ESVEA, sex, age, and systolic blood pressure (Exercises 2.4
and 3.7).
1. Investigate, using cumulative Schoenfeld residuals, whether the effects of the covariates
may be described as time-constant hazard ratios.
2. Investigate, using cumulative martingale residuals, whether the effects of age and sys-
tolic blood pressure can be considered linear on the log(hazard) scale.

Exercise 5.10 Consider the data on recurrent episodes in affective disorder, Example 1.1.5.
Fit a Mao-Lin regression model (5.31) for the mean of the composite end-point recurrent
episode or death, including initial diagnosis as the only covariate and using severity weights
equal to 1.
Chapter 6

Pseudo-values

In Sections 4.2 and 5.5, we discussed how direct regression models for marginal parameters
in a multi-state model could be set up and fitted using generalized estimating equations
(GEEs). It turned out that this could be done on a case-by-case basis and that, furthermore, it
was typically necessary to explicitly address the censoring distribution. This is because the
uncensored observations had to be re-weighted to also represent those who were censored
and this required estimation of the probability of being uncensored at the times of observed
failures. One might ask whether it would be possible to apply a more general technique
when fitting marginal models for multi-state processes. The answer to this question is ‘yes’,
under the proviso that one is content with a model for a single or a finite number of time
points. A way to do this is to apply pseudo-values (or pseudo-observations – we will use
these notions interchangeably in what follows).
The idea is as follows: With complete data, i.e., in the absence of censoring, a regres-
sion model could be set up and fitted using standard GEE using the relevant aspect of the
complete data as response variable as explained in Section 5.5.1. To model the survival
probability S(t0 | Z) in the point t0 , the survival indicator I(Ti > t0 ) would be observed for
all subjects i = 1, . . . , n and could thus be used as outcome variable in the GEE. With in-
complete data this is not possible, and in this case the pseudo-values are calculated based
on the available data and they replace the incompletely observed response variables (e.g.,
Andersen et al., 2003; Andersen and Pohar Perme, 2010). This is doable because they,
under suitable assumptions on the censoring distribution, have the correct expected value
for given covariates (Graw et al., 2009; Jacobsen and Martinussen, 2016; Overgaard et al.,
2017). The pseudo-values typically build on a non-parametric estimator for the marginal
parameter, such as the Kaplan-Meier estimator for the survival function in the two-state
model (Sections 4.1.1 or 5.1.1) or the Aalen-Johansen estimator for the competing risks
cumulative incidence (Sections 4.1.2 or 5.1.2). Thereby, censoring is dealt with once and
for all leaving us with a set of n observations which are approximately independent and
identically distributed (i.i.d.). Note that, while right-censoring may be handled in this way,
data with left-truncation are typically harder to deal with (Parner et al., 2023).
In Section 6.1, the basic idea is presented in an intuitive way with several examples and in
Section 6.2, more mathematical details are provided. Section 6.3 presents a fast approxi-
mation to calculation of pseudo-values, and Section 6.4 gives a brief account of how to use
cumulative residuals when assessing goodness-of-fit of models fitted to pseudo-values.

221
222 PSEUDO-VALUES
6.1 Intuition
6.1.1 Introduction
The set-up is as follows: V (t) is a multi-state process, and interest focuses on a marginal
parameter which is the expected value, E( f (V )) = θ , say, of some function f of the process.
Examples include the following:
• V (t) is the two-state process for survival data, Figure 1.1, and θ is the state 0 occupation
probability Q0 (t0 ) at a fixed time point t0 , i.e., the survival probability S(t0 ) = P(T > t0 )
at that time.
• V (t) is the competing risks process, Figure 1.2, and θ is the state h, h > 0 occupation
probability Qh (t0 ) at a fixed time point t0 , i.e., the cause-h cumulative incidence Fh (t0 ) =
P(T ≤ t0 , D = h) at that time.
• V (t) is the two-state process for survival data, Figure 1.1, and θ is the expected time
ε0 (τ) spent in state 0 before a fixed time point τ, i.e., the τ-restricted mean survival
time.
• V (t) is the competing risks process, Figure 1.2, and θ is the expected time εh (τ) spent in
state h > 0 up to a fixed time point τ, i.e., the cause-h specific time lost due to that cause
before time τ.
• V (t) is a recurrent events process, Figures 1.4-1.5, and θ is the expected number, µ(t0 ) =
E(N(t0 )) of events at a fixed time point t0 .
In this section, we present the idea for the first of these examples. The other examples follow
the same lines and more discussion is provided in Sections 6.1.2-6.1.7. We are interested in
a regression model for the survival function at time t0 , S(t0 | Z), that is, the expected value
of the survival indicator f (T ) = I(T > t0 ) given covariates Z

S(t0 | Z) = E(I(T > t0 ) | Z).

One typical model for this could be what corresponds to a Cox model, i.e.,

log(− log S(t0 | Z)) = β0 + LP,

where the intercept is β0 = log(A0 (t0 )), the log(cumulative baseline hazard) at time t0 , and
LP is the linear predictor LP = β1 Z1 + · · · + β p Z p . Another would correspond to an additive
hazard model
− log(S(t0 | Z))/t0 = β0 /t0 + LP
with β0 = A0 (t0 ), the cumulative baseline hazard at t0 . In general, some function g, the
link function, of the marginal parameter θ is the linear predictor. Note that such a model is
required to hold only at time t0 and not at all time points.
We first consider the unrealistic situation without censoring, i.e., survival times T1 , . . . , Tn
are observed and so are the t0 -survival indicators f (Ti ) = I(Ti > t0 ), i = 1, . . . , n. This situa-
tion serves as motivation for the way in which pseudo-observations are defined and, in this
situation, two facts can be noted:
INTUITION 223
1. The marginal mean E( f (T )) = E(I(T > t0 )) = S(t0 ) can be estimated as a simple aver-
age
b 0 ) = 1 ∑ I(Ti > t0 ).
S(t
n i

2. A regression model for θ = S(t0 | Z) with link function g can be analyzed using GEE with
f (T1 ) = I(T1 > t0 ), . . . , f (Tn ) = I(Tn > t0 ) as responses. This is a standard generalized
linear model for a binary outcome with link function g.
Let Sb−i be the estimator for S without observation i, i.e.,
1
Sb−i (t0 ) = I(T j > t0 ).
n−1 ∑
j6=i

We now have that

n · S(t
b 0) = f (T1 ) + · · · + f (Ti−1 ) + f (Ti ) + f (Ti+1 ) + · · · + f (Tn ),
−i
(n − 1) · Sb (t0 ) = f (T1 ) + · · · + f (Ti−1 ) + f (Ti+1 ) + · · · + f (Tn ),

i.e.,
b 0 ) − (n − 1) · Sb−i (t0 ) = f (Ti ).
n · S(t
Thus, the ith observation can be re-constructed by combining the marginal estimator based
on all observations and that obtained without observation no. i.
We now turn to the realistic scenario where some survival times are incompletely observed
because of right-censoring, i.e., the available data are (Xi , Di ), i = 1, . . . , n where Xi is the
ith time of observation, the smaller of the true survival time Ti and the censoring time
Ci , and Di is 1 if Ti is observed and 0 if the ith observation is censored. In this case it is
still possible to estimate the marginal survival function, namely using the Kaplan-Meier
estimator, Sb given by Equation (4.3). Based on this, we can calculate the quantity
b 0 ) − (n − 1) · Sb−i (t0 ),
θi = n · S(t (6.1)

where Sb−i (t0 ) is the estimator (now Kaplan-Meier) applied to the sample of size n − 1
obtained by eliminating observation no. i from the full sample. The θi , i = 1, . . . , n given by
(6.1) are the pseudo-observations for the incompletely observed survival indicators f (Ti ) =
I(Ti > t0 ), i = 1, . . . , n.
Note that pseudo-values are computed for all subjects – whether the survival time was
observed or only a censored observation was available.
The idea is now, first, to transform the data (Xi , Di ), i = 1, . . . , n into θi , i = 1, . . . , n using
(6.1), i.e., to add one more variable to each of the n lines in the data set and, next, to an-
alyze a regression model for θ by using θi , i = 1, . . . , n as responses in a GEE with the
desired link function, g. Here, typically, a normal error distribution is specified since this
will enable the correct estimating equations to be set up for the mean value (in spite of
the fact that the distribution of pseudo-values is typically far from normal). Such a proce-
dure will provide estimators for the parameters in the regression model S(t0 | Z) that have
been shown to be mathematically well-behaved if the distribution of the censoring times Ci
224 PSEUDO-VALUES
does not depend on the covariates, Z (Graw et al., 2009; Jacobsen and Martinussen, 2016;
Overgaard et al., 2017, 2023). The situation with covariate-dependent censoring is more
complex and will be discussed in Section 6.1.8. The standard sandwich estimator based on
the GEE is most often used, though this may be slightly conservative, i.e., a bit too large,
as will be explained in Section 6.2.
The use of pseudo-values for fitting marginal models for multi-state parameters has a num-
ber of attractive features:
1. It can be used quite generally for marginal multi-state parameters whenever a suitable
estimator θb for the marginal mean θ = E( f (V )) is available.
2. It provides us with a set of new variables θ1 , . . . , θn for which standard models for com-
plete data can be analyzed.
3. It provides us with a set of new variables θ1 , . . . , θn to which various plotting techniques
are applicable.
4. If interest focuses on a single time point t0 , then a specification of a model for other time
points is not needed.
A number of difficulties should also be mentioned:
1. If censoring depends on covariates, then modifications of the method are necessary.
2. It only provides a model at a fixed point in time t0 , or as we shall see just below, at a
number of fixed points in time t1 , . . . ,tm , and these time points need to be specified.
3. The base estimator needs to be re-calculated n + 1 times, and if this computation is
involved and/or n is large, then obtaining the n pseudo-values may be cumbersome.
A multivariate model for S(t1 | Z), . . . , S(tm | Z) at a number, m of time points t1 , . . . ,tm can
be analyzed in a similar way. The response in the resulting GEE is m-dimensional and a
joint model for all time points is considered. The model could be what corresponds to a
Cox model, i.e.,
log(− log S(t j | Z)) = β0 j + LP,
with β0 j = log(A0 (t j )), j = 1, . . . , m, the log(cumulative baseline hazard) at t j .

PBC3 trial in liver cirrhosis

As an example we will study the PBC3 trial (Example 1.1.1) and pseudo-observations for
the indicator of no failure from medical treatment. For illustration, we first fix two subjects
and plot the pseudo-observations for those subjects as a function of time. We choose one
subject with a failure at 1 year (X = 366 days, D = 1) and one with a censored observation
at 1 year (X = 365 days, D = 0), see Figure 6.1. It is seen that, for t < 1 year, the pseudo-
observations for the two subjects coincide. This is because an observation time at X (here
1 year) has the same impact on the Kaplan-Meier estimator S(t) b for t < X whether or not
the observation time corresponds to a failure or a censoring – in both cases, the subject
is a member of the risk set R(t). The pseudo-values before 1 year are (slightly) above
1. For t ≥ X, however, the pseudo-values for the two subjects differ. The resulting curve
for the failure is seen to be a ‘caricature’ of the indicator I(1 year > t) (however, with
INTUITION 225

1.00

0.75
Pseudo−values

0.50

0.25

0.00

0 1 2 3 4 5
Time since randomization (years)

Figure 6.1 PBC3 trial in liver cirrhosis: Pseudo-values for the survival indicator I(T > t) as a
function of follow-up time t for two subjects: A failure at T = 1 year (dashed) and a censoring at
C = 1 year (dotted).

negative values for t > 1 that increase towards 0), whereas, for the censored observation, the
pseudo-values decrease (without reaching 0). Even though the pseudo-values for I(T > t)
go beyond the interval [0, 1], they have approximately the correct conditional expectation
given covariates, i.e.,
E(I(Ti > t) | Zi ) ≈ E(θi | Zi )
if the censoring distribution is independent of covariates. This is why they can be used as
responses in a GEE for S(t | Z).
We next fix time and show how the pseudo-observations for all subjects in the data
set look like at those time points. For illustration, we compute pseudo-values at times
(t1 ,t2 ,t3 ) = (1, 2, 3) years. Figures 6.2a-6.2c show the results which are equivalent to what
was seen in Figure 6.1. For an observed failure time at T ≤ t j the pseudo-value is negative,
for an observed censoring time at C ≤ t j the pseudo-value is between 0 and 1, and for an
observation time X > t j (both failures and censorings) the pseudo-value is slightly above 1.
We next show how pseudo-observations can be used when fitting models for S(t0 | Z) and
look at t0 = 2 years as an example. To assess models including only a single quantitative
covariate like bilirubin or log2 (bilirubin), scatter-plots may be used much like in simple
linear regression models (Andersen and Skovgaard, 2010, ch. 4). Figure 6.3 (left panel)
shows pseudo-values plotted against bilirubin. Note that adding a scatter-plot smoother to
the plot is crucial – much like when plotting binary outcome data. The resulting curve
226 PSEUDO-VALUES
(a) I(Ti > t1 = 1 year).

1.0

0.5
Pseudo−values

0.0

−0.5

−1.0
0 1 2 3 4 5 6
Xi (years)

(b) I(Ti > t2 = 2 years).

1.0

0.5
Pseudo−values

0.0

−0.5

−1.0
0 1 2 3 4 5 6
Xi (years)

(c) I(Ti > t3 = 3 years).

1.0

0.5
Pseudo−values

0.0

−0.5

−1.0
0 1 2 3 4 5 6
Xi (years)

Figure 6.2 PBC3 trial in liver cirrhosis: Pseudo-values for the survival indicator I(Ti > tl = l years),
l = 1, 2, 3 for all subjects, i, plotted against the observation time Xi (failures: o, censorings: x).
INTUITION 227
1.2 2

log(−log(predicted pseudo−values))
1.0 1
0.8
0
Pseudo−values

0.6
−1
0.4
−2
0.2
−3
0.0
−0.2 −4

−0.4 −5
0 100 200 300 400 0 100 200 300 400
Bilirubin Bilirubin

Figure 6.3 PBC3 trial in liver cirrhosis: Pseudo-values for the survival indicator I(Ti > 2 years)
for all subjects, i, plotted against the covariate Zi = bilirubin with a scatter-plot smoother super-
imposed (left); in the right panel, the smoother is transformed with the cloglog link function.

1.2 2
log(−log(predicted pseudo−values))

1.0 1
0.8
0
Pseudo−values

0.6
−1
0.4
−2
0.2
−3
0.0
−0.2 −4

−0.4 −5
1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9
log2(bilirubin) log2(bilirubin)

Figure 6.4 PBC3 trial in liver cirrhosis: Pseudo-values for the survival indicator I(Ti > 2 years)
for all subjects, i, plotted against the covariate Zi = log2 (bilirubin) with a scatter-plot smoother
super-imposed (left); in the right panel, the smoother is transformed with the cloglog link function.

should be linear on the scale of the link function

cloglog(1 − S(t0 )) = log(− log(S(t0 )))

and Figure 6.3 (right panel) shows the smoother after this transformation. It is seen that
linearity does not describe the association well. Plotting, instead, against log2 (bilirubin)
(Figure 6.4) shows that using a linear model in this scale is not contra-indicated.
We can then fit a model for S(t0 | Z1 , Z2 , log2 (Z3 )) with Z1 , the indicator for CyA treatment,
Z2 = albumin, and Z3 = bilirubin using the pseudo-values at t0 = 2 years as the outcome
variable and using the cloglog link. Table 6.1 (left panel) shows the results. Compared
228 PSEUDO-VALUES
Table 6.1 PBC3 trial in liver cirrhosis: Estimated coefficients (and robust SD) from models for the
survival function with linear effects of albumin and log2 (bilirubin) based on pseudo-values. The
cloglog link function was used, and the SD values are based on the sandwich formula.

A single time point Three time points

t0 = 2 (t1 ,t2 ,t3 ) = (1, 2, 3)
Covariate βb SD βb SD
Treatment CyA vs. placebo -0.705 0.369 -0.599 0.287
Albumin per 1 g/L -0.105 0.034 -0.094 0.026
log2 (bilirubin) per doubling 0.836 0.140 0.684 0.092

with the Cox model results in Table 2.7, it is seen that the estimated coefficients based
on the pseudo-values are similar. The SD values are somewhat larger which should be no
surprise since the Cox models use all data, whereas the pseudo-values concentrate on a
single point in time. A potential advantage of using the pseudo-observations is that if inter-
est does focus on a single time point, then they, in contrast to a Cox model, avoid making
modeling assumptions about the behavior at other time points. In Table 6.1 (right panel),
results are shown for a joint model for pseudo-values at times (t1 ,t2 ,t3 ) = (1, 2, 3) years:
log(− log S(t j | Z)) = β0 j + LP, j = 1, 2, 3. Now, the results are closer to those based on
the Cox model in Table 2.7. In particular, values of SD are smaller than when based on
pseudo-values at a single point in time and simulation studies (e.g., Andersen and Pohar
Perme, 2010) have shown that the SD does tend to get smaller when based on more time
points; however, more than m ∼ 5 time points will typically not add much to the preci-
sion. The model for more time points is fitted by adding the time points at which pseudo-
values are computed as a categorical covariate and the output then also includes estimates
(βb01 , βb02 , βb03 ) of the Cox log(cumulative baseline hazard) at times (t1 ,t2 ,t3 ). Note that, in
such a model, non-proportional hazards (at the chosen time points) corresponds to interac-
tions with this categorical time-variable. For the models based on pseudo-values, the SD
values are obtained using sandwich estimators from the GEE. These have been shown to
be slightly conservative, however, typically not seriously biased, see also Section 6.2.
Another type of plot which is applicable when assessing a regression model based on
pseudo-observations is a residual plot. Figure 6.5 shows residuals from the model in Table
6.1 (right panel), i.e.,
ri j = θi j − exp(− exp(βb0 j + LP
c i ))

plotted against log2 (bilirubin) for subject i. Here, j = 1, 2, 3 refers to the three time points
(t1 ,t2 ,t3 ). A smoother has been superimposed for each j and it is seen that the residuals
vary roughly randomly around 0 indicating a suitable fit of the model. Note that residual
plots are applicable also for multiple regression models in contrast to the scatter-plots in
Figure 6.3 (left panel). In Section 6.4 we will briefly discuss how formal significance tests
for the goodness-of-fit of regression models based on pseudo-observations may be devised
using cumulative pseudo-residuals.
INTUITION 229

0
Pseudo−residuals

−1

−2
1 2 3 4 5 6 7 8 9
log2(bilirubin)

Year 1 2 3

Figure 6.5 PBC3 trial in liver cirrhosis: Pseudo-residuals for the survival indicator I(Ti > t j ) for
all subjects, i, and for (t1 ,t2 ,t3 ) = (1, 2, 3) years plotted against log2 (bilirubin). The cloglog link
function was used.

Pseudo-values
Pseudo-observations are computed once and for all at the chosen time points. This
takes care of censoring and provides us with a set of new observations that can
be used as response variables in a GEE with the desired link function. This also
enables application of various graphical techniques for data presentation and model
assessment. In the example, this provided estimates comparable to those obtained
with a Cox model. This, however, is only a ‘poor man’s Cox model’ since fitting
the full Cox model is both easier and more efficient. So, the main argument for
using pseudo-observations is the generality of the approach: The same basic ideas
apply for a number of marginal parameters in multi-state models and for several link
functions. We will demonstrate these features in Sections 6.1.2-6.1.7.

6.1.2 Hazard difference

In Section 6.1.1, the idea of pseudo-values was introduced via the PBC3 trial in liver cirrho-
sis, and regression models for the survival function at one or a few time points were studied
using the cloglog link function. The resulting regression coefficients were then comparable
to log(hazard ratios) estimated using the Cox model. If, instead, one wishes to estimate pa-
rameters with a hazard difference interpretation, then this may be achieved by modeling the
exact same pseudo-values but using instead a (minus) log link. Using bilirubin as an exam-
ple, linearity may be investigated (in a univariate model) by plotting pseudo-values against
230 PSEUDO-VALUES
1.2 3

−log(predicted pseudo−values)
1.0
0.8 2
Pseudo−values

0.6
0.4 1
0.2
0.0 0
−0.2
−0.4 −1
0 100 200 300 400 0 100 200 300 400
Bilirubin Bilirubin

Figure 6.6 PBC3 trial in liver cirrhosis: Pseudo-values for the survival indicator I(Ti > 2 years)
for all subjects, i, plotted against the covariate Zi = bilirubin with a scatter-plot smoother super-
imposed (left); in the right panel, the smoother is transformed with the (minus) log link function.

Table 6.2 PBC3 trial in liver cirrhosis: Estimated coefficients (and robust SD) from a model for the
survival indicator I(Ti > t j ) with linear effects of albumin and bilirubin based on pseudo-values at
(t1 ,t2 ,t3 ) = (1, 2, 3) years. The (minus) log link function was used.

Covariate βb SD
Treatment CyA vs. placebo -0.048 0.031
Albumin per 1 g/L -0.0097 0.0032
Bilirubin per 1 µmol/L 0.0042 0.0008

bilirubin. This is done in Figure 6.6 where a smoother has been superimposed (left), and in
the right-hand panel this smoother is log transformed. There seems to be a problem with
the fit for large values of bilirubin (where the smoother gets negative, thereby preventing a
log transform). Table 6.2 shows the estimated coefficients in a model for S(t j | Z) for the
three time points (t1 ,t2 ,t3 ) = (1, 2, 3) years, using the (minus) log link function and includ-
ing the covariates treatment, albumin and bilirubin. The coefficients have hazard difference
interpretations and may be compared to those seen in Table 2.10. The estimates based on
pseudo-values are seen to be similar, however, with larger SD values.
A residual plot may be used to assess the model fit, and Figure 6.7 shows the pseudo-
residuals from the model in Table 6.2 plotted against bilirubin. Judged from the smoothers,
the fit is not quite as good as that using the cloglog link.

6.1.3 Restricted mean

Turning now to the restricted mean life time, ε0 (τ) = E(min(T, τ)), pseudo-values are
based on the integrated Kaplan-Meier estimator
Z τ
ε0 (τ) =
b S(t)dt,
b
0
INTUITION 231

0
Pseudo−residuals

−1

−2
0 100 200 300 400 500
Bilirubin

Year 1 2 3

Figure 6.7 PBC3 trial in liver cirrhosis: Pseudo-residuals for the survival indicator I(Ti > t j ) for all
subjects, i, and for (t1 ,t2 ,t3 ) = (1, 2, 3) years plotted against bilirubin. The model used the (minus)
log link function (Table 6.2).

Table 6.3 PBC3 trial in liver cirrhosis: Estimated coefficients (and robust SD) from a linear model
(identity link function) for the τ-restricted mean life time for τ = 3 years based on pseudo-values.

Covariate βb SD
Intercept 2.83 0.35
Treatment CyA vs placebo 0.148 0.073
Albumin per 1 g/L 0.023 0.0068
log2 (bilirubin) per doubling -0.243 0.032

that is, Z τ Z τ
θi = n S(t)dt
b − (n − 1) Sb−i (t)dt.
0 0
We consider the PBC3 trial and the value τ =3 years and compare with results using the
model by Tian et al. (2014). Figure 6.8 shows the scatter-plot where the pseudo-values θi
are plotted against the observation times Xi . It is seen that all observations Xi > τ give rise
to identical pseudo-values slightly above τ while observed failures before τ have pseudo-
values close to the observed Xi = Ti and censored observations before τ have values that
increase with Xi in the direction of τ.
Table 6.3 shows the results from a linear model (i.e., identity as link function) for ε0 (τ |
Z), τ = 3 years, based on pseudo-observations. The results are seen to coincide quite well
with those obtained using the Tian et al. model (Table 4.4). Figure 6.9 shows scatter-plots
of pseudo-values against log2 (bilirubin) and seems to not contra-indicate a linear model.
232 PSEUDO-VALUES

3
Pseudo−values

0
0 1 2 3 4 5 6
Xi (years)

Figure 6.8 PBC3 trial in liver cirrhosis: Pseudo-values for the restricted life time min(Ti , τ) for all
subjects, i, plotted against the observation time Xi for τ =3 years: Observed failures (o), censored
observations (x).

3
Pseudo−values

0
1 2 3 4 5 6 7 8 9
log2(bilirubin)

Figure 6.9 PBC3 trial in liver cirrhosis: Pseudo-values for the restricted life time min(Ti , τ) for
all subjects, i, plotted against log2 (bilirubin) for τ =3 years. A scatter-plot smoother has been
superimposed.
INTUITION 233
Table 6.4 PBC3 trial in liver cirrhosis: Estimated coefficients (and robust SD) from models with
logistic and cloglog link functions for the cumulative incidence of death without transplantation
before t0 = 2 years based on pseudo-values.

(a) logit link function

Covariate βb SD βb SD
Treatment CyA vs. placebo 0.112 0.370 -0.574 0.506
Albumin per 1 g/L -0.144 0.049
log2 (bilirubin) per doubling 0.713 0.188

(b) cloglog link function

Covariate βb SD βb SD
Treatment CyA vs placebo 0.106 0.351 -0.519 0.425
Albumin per 1 g/L -0.114 0.037
log2 (bilirubin) per doubling 0.570 0.145

6.1.4 Cumulative incidence

We will now, for comparison with the direct binomial regression analyses in Section 5.8.2,
present models for the 2-year cumulative incidence of death without transplantation in the
PBC3 trial based on pseudo-observations. Table 6.4 shows the results. Comparing with
Table 5.6 it is seen that the two approaches yield very similar results – both in terms of es-
timated log(odds ratios) and of their SD. The table also shows similar results using, instead
of the logistic link function, a cloglog link function as in the Fine-Gray model. Though
the coefficients are similar, they have a different interpretation, namely log(sub-distribution
hazard ratios) rather than log(odds ratios). It is an advantage of the pseudo-value approach
(and of direct binomial regression) that several link functions may be used. However, it
should be kept in mind that the time point(s) at which pseudo-values are calculated must be
selected. Choosing, as in the previous section, time points 1, 2, and 3 years the estimated
coefficients (SD) using the cloglog link are, respectively, −0.511 (0.349) for treatment,
−0.107 (0.032) for albumin, and 0.519 (0.117) for log2 (bilirubin), i.e., similar coefficients
but somewhat smaller SD. Other aspects to consider when choosing the link function are,
as already discussed in Section 5.5.5, that predicted probabilities may exceed 1 (e.g., for
identity and log links) or may be negative (e.g., for the identity link) and that, for the same
mentioned link functions, models with a time-constant effect may be implausible.
For comparison with the Fine-Gray analyses presented in Table 4.5, we fitted pseudo-value
based models including also the covariates sex and age. Both a model using three time
points (1, 2, 3 years) and one using ten time points (0.5, 1.0,. . . , 4.5, 5.0 years) are shown
in Table 6.5. Some efficiency gain is seen when using more time points and comparing
with the Fine-Gray results, similar coefficients with somewhat increased SD are seen when
using pseudo-values.
234 PSEUDO-VALUES
6.1.5 Cause-specific time lost
In Section 4.2.2, we showed models for the time lost before τ = 3 years in the PBC3
trial due to transplantation or death without transplantation using estimating equations
suggested by Conner and Trinquart (2021). We will repeat these analyses using pseudo-
observations, see Table 6.6. Comparing with Table 4.6, it is seen that coefficients are quite
similar with smaller SD when based on pseudo-values. As we did in Section 4.2.2, we may
compare the coefficients from Tables 6.3 and 6.6 and notice that, for each explanatory vari-
able, the coefficient from the former table equals minus the sum of the coefficients from the
latter (e.g., for treatment we have −(−0.063 − 0.085) = 0.148).

6.1.6 Non-Markov transition probabilities

This example continues the example in Section 5.8.1 and reports some further results on
data from the PROVA trial in liver cirrhosis (Example 1.1.4) presented by Andersen et al.
(2022). We look at models for the probability P01 (s,t) of being alive in the bleeding state
1 at time t > 1 year given alive without bleeding at time s = 1 year and base the analy-
ses on pseudo-observations for the indicator I(Vi (t) = 1). These are based on estimators
Pb01 (1,t) using landmarking or plug-in. As an example, we will study how this probability
depends on whether sclerotherapy was given or not. Figure 6.10 shows the landmark Aalen-
Johansen estimators for P01 (1,t) in the two treatment groups and the two curves seem to be
rather close. This tendency is also seen when computing pseudo-values at time t = 2 years
(close to the median of observed transition times) based on different base estimators for
the transition probability and fitting a model with a log link, including only an indicator for
sclerotherapy. Table 6.7 shows the resulting estimates of the treatment effect at that time
using as base estimators, respectively, the landmark Aalen-Johansen or Pepe estimators or
different plug-in estimators. For the plug-in estimators, there is a choice between different
data sets on which estimation can be based, and this could affect the efficiency. The model
could be based on fitting the intensity models to the entire data set, the landmark data set,
or to the data set consisting of all patients who were still at risk at time s = 1 year. All esti-
mated coefficients provide a log(relative risk) close to 0. Using the ‘at-risk’ data set appears
to be associated with the smallest SD.

Table 6.5 PBC3 trial in liver cirrhosis: Estimated coefficients (and robust SD) from models for the
cumulative incidence of death without transplantation. Models use the cloglog link function based
on pseudo-values at either 3 or 10 time points.

3 time points 10 time points

Covariate βb SD βb SD
Treatment CyA vs. placebo -0.272 0.337 -0.413 0.318
Albumin per 1 g/L -0.076 0.033 -0.038 0.032
log2 (Bilirubin) per doubling 0.666 0.121 0.669 0.118
Sex male vs. female -0.502 0.400 -0.855 0.388
Age per year 0.073 0.022 0.096 0.022
INTUITION 235
Table 6.6 PBC3 trial in liver cirrhosis: Estimated coefficients (and robust SD) from linear models
for time lost (in years) due to transplantation or to death without transplantation before τ = 3 years.
Analyses are based on pseudo-values.

(a) Transplantation

Covariate βb SD βb SD
Treatment CyA vs. placebo -0.056 0.051 -0.063 0.046
Albumin per 1 g/L -0.001 0.004
log2 (Bilirubin) per doubling 0.100 0.026

(b) Death without transplantation

Covariate βb SD βb SD
Treatment CyA vs. placebo -0.015 0.073 -0.085 0.069
Albumin per 1 g/L -0.022 0.007
log2 (Bilirubin) per doubling 0.143 0.032

6.1.7 Recurrent events

Furberg et al. (2023) studied bivariate pseudo-values in the context of recurrent events
with competing risks, with examples from the LEADER trial (Example 1.1.6). That is,
a joint model based on pseudo-values for both the mean of the number, N(t), of recur-
rent myocardial infarctions and that of the survival indicator I(TD > t) was proposed. This
enables joint inference for the treatment effects on both components. For the time points
(t1 ,t2 ,t3 ) = (20, 30, 40) months, the following models were studied

log(E(N(t j ) | Z)) = log(µ0 (t j )) + βR Z

and
log(− log(E(I(TD > t j ) | Z))) = log(A0 (t j )) + βS Z,
where Z is the indicator for treatment with liraglutide. Joint estimation of the treatment
effects (βR , βS ) was based on pseudo-values for (N(t j ), I(TD > t j )), j = 1, 2, 3, using the

Table 6.7 PROVA trial in liver cirrhosis: Estimated coefficients (log(relative risk)), (with robust SD)
of sclerotherapy (yes vs. no) on the probability P01 (1,t) of being alive in the bleeding state at time
t = 2 years given alive in the initial state at time s = 1 year based on pseudo-values using different
base estimators.

Base estimator βb SD
Landmark Pepe -0.151 0.925
Landmark Aalen-Johansen -0.261 0.882
Plug-in linear, complete -0.849 1.061
Plug-in duration scale, complete data -0.636 0.832
Plug-in linear, at-risk data -0.650 0.674
Plug-in linear, landmark data -0.079 0.920
236 PSEUDO-VALUES

0.05

0.04

0.03
Probability

0.02

0.01

0.00
1 2 3 4
Time since randomization (years)
Sclerotherapy No Yes

Figure 6.10 PROVA trial in liver cirrhosis: Landmark Aalen-Johansen estimators for the probability
P01 (1,t) of being alive in the bleeding state at time t > 1 year among patients observed to be in
the initial state at time s = 1 year. Separate curves are estimated for patients treated or not with
sclerotherapy.

sandwich estimator to estimate the SD and correlations of (βbR , βbS ). Figure 6.11 shows the
non-parametric Cook-Lawless estimates for E(N(t) | Z) and the Kaplan-Meier estimates
for S(t | Z), and Table 6.8 shows the results from the bivariate pseudo-value regression.
The estimated regression coefficients are close to those based on separate Ghosh-Lin and
Cox models quoted in Section 4.2.3, i.e., −0.159 (SD = 0.088) for the log(mean ratio)
and −0.166 (SD = 0.070) for the log(hazard ratio). From the estimated joint distribution
of (βbR , βbS ), i.e., a bivariate normal distribution with SD and correlation equal to the values
from Table 6.8, it is possible to conduct a bivariate Wald test for the hypothesis (βR , βS ) =
(0, 0). The 2 DF Wald statistic takes the value 8.138 corresponding to P = 0.017.

Table 6.8 LEADER cardiovascular trial in type 2 diabetes: Parameter estimates (with robust SD) for
treatment (liraglutide vs. placebo) from a bivariate pseudo-value model with recurrent myocardial
infarctions (R) and overall survival (S) at three time points at (20,30,40) months.

βbR βbS SD(βbR ) SD(βbS ) Corr(βbR , βbS )

-0.218 -0.163 0.097 0.082 0.100

6.1.8 Covariate-dependent censoring

In this section, we will briefly explain the needed modifications to the approach (for the
survival indicator I(T > t0 )) when there is covariate-dependent censoring. It can be shown
THEORETICAL PROPERTIES (*) 237

Figure 6.11 LEADER cardiovascular trial in type 2 diabetes: Cook-Lawless estimates of the mean
number of recurrent myocardial infarctions (left) and Kaplan-Meier estimates of the survival func-
tion (right), by treatment group.

(see, e.g., Binder et al., 2014) that when survival data (Xi , Di ), i = 1, . . . , n are available then
the Kaplan-Meier estimator may, alternatively, be written using IPCW
n
b = 1 − 1 ∑ Ni (t) ,
S(t) (6.2)
n i=1 G(X
b i −)

where, as previously, the ith counting process is Ni (t) = I(Xi ≤ t, Di = 1) and G

b is the
Kaplan-Meier estimator for the distribution of censoring times. When censoring times de-
pend on covariates Z, this motivates another estimator for S(t), namely

1 n Ni (t)
Sbc (t) = 1 − ∑ , (6.3)
n i=1 G(X
b i − | Zi )

b | Z) now based on a regression model for the censoring distribution, e.g., a Cox
with G(t
model. For this situation, pseudo-values θi for the survival indicator can be based on Sbc (t)

θi = n · Sbc (t) − (n − 1) · Sbc−i (t)

and used as explained in Section 6.1.1. If the model for the censoring distribution is cor-
rectly specified, then the resulting estimators have the desired properties (Overgaard et al.,
2019). Similar modifications may be applied to other estimators on which pseudo-values
are based.

6.2 Theoretical properties (*)

Theoretical properties of methods based on pseudo-observations have been derived by,
among others, Graw et al. (2009), Jacobsen and Martinussen (2016), and by Overgaard
et al. (2017). In this section we will give a sketch of how these properties may be derived
using the Aalen-Johansen estimator as an example.
238 PSEUDO-VALUES
Recall that the pseudo-values build on an estimator for the marginal parameter of interest,
such as the Kaplan-Meier or the Aalen-Johansen estimator. The basic idea in the derivation
of the properties is to consider this basic estimator as a functional (i.e., a function of func-
tions) of certain empirical processes. In the following, we will indicate howRthis works for
the Aalen-Johansen estimator for the cause-h cumulative incidence Fh (t) = 0t S(u)dAh (u).
The estimator is (Equation (5.4))
Z t
Fbh (t) = S(u−)d
b bh (u)
A
0

with d A bh (u) = dNh (u)/Y (u) being the jumps in the Nelson-Aalen estimator of the cause-h
specific cumulative hazard (Equation (3.10)) and S(u)
b the all-cause Kaplan-Meier estimator
(Equation (4.3)). Now, by Equation (4.15), the fraction of subjects still at risk at time t−,
i.e., Y (t)/n, can be re-written as

1
Y (t) = S(t−)
b G(t−),
b
n
leading to the IPCW version of the Aalen-Johansen estimator
Z t
1
Fbh (t) = dNh (u)/n,
0 G(u−)
b

compare Section 6.1.8. Here, as previously, G b is the Kaplan-Meier estimator for the cen-
soring distribution. The empirical processes on which the estimator is based are

bY (t) = 1 ∑ Yi (t)
H
n i

and
bh (t) = 1 ∑ Nhi (t), h = 0, 1, . . . , k,
H
n i
where N0i is the counting process for censoring, N0i (t) = I(Xi ≤ t, Di = 0). With this nota-
tion, the cumulative censoring hazard is estimated by the Nelson-Aalen estimator
Z t
b0 (t) = 1 b0 (u)
A dH
0 bY (u)
H

and the corresponding survival function G by the product-integral

π

b =
G(t) 1 − dAb0 (u) .
[0,t]

Since observations for i = 1, . . . , n are independent, it follows by the law of large numbers
that Hb = (HbY , H
b0 , H
b1 , . . . , H
bk ) converges to a certain limit η = (ηY , η0 , η1 , . . . , ηk ). The
Aalen-Johansen estimator Fh is a certain functional, say φ , of H
b b and the true value Fh (t) = θ
is the same functional applied to η .
THEORETICAL PROPERTIES (*) 239
A smooth functional such as φ may be Taylor (von Mises) expanded
1
η ) + ∑ φ̇ (Xi∗ )
b ) ≈ φ (η
φ (H
n i
1
= θ + ∑ φ̇ (Xi∗ )
n i

where Xi∗ = (Xi , Di ) is the data point for subject i and φ̇ is the first order influence function
for φ (·). This is defined by
d
φ̇ (x) = φ ((1 − u)ηη + uδx )|u=0
du
= φη0 (δx − η ),

i.e., the derivative of φ at η in the direction δx −η

η where δx is Dirac’s delta, δx (y) = I(y = x)
(e.g., Overgaard et al., 2017).
We can now approximate the pseudo-observation for I(Ti ≤ t, Di = h)

θi = nFbh (t) − (n − 1)Fbh−i (t)

b −i )
b ) − (n − 1)φ (H
= nφ (H
1 1
η ) + ∑ φ̇ (Xi∗ )) − (n − 1)(φ (η
≈ n(φ (η η)+ φ̇ (X`∗ )),
n i n − 1 `6∑
=i

i.e.,
θi ≈ θ + φ̇ (Xi∗ ). (6.4)
We assume a model for the cumulative incidence of the form

g(E(I(T ≤ t, D = h) | Z )) = β T Z ,

i.e., with link function g and where Z contains the constant ‘1’ and β the corresponding
intercept, and estimates of β are obtained by solving the GEE

U (β
β ) = ∑ A (β β T Z i )) = 0 .
β , Z i )(θi − g−1 (β
i

Now by (6.4), these GEEs are seen to be (approximately) unbiased if

β TZ i) − θ
E(φ̇ (Xi∗ ) | Z i ) = g−1 (β (6.5)

and this must be verified on a case-by-case basis by explicit calculation of the influence
function. This has been done by Graw et al. (2009) for the cumulative incidence and more
generally by Overgaard et al. (2017) under the assumption that censoring is independent of
covariates. For the cumulative incidence the influence function is
Z t Z t
dNhi (u) Fh (t) − Fh (u)
φ̇ (Xi∗ ) = − Fh (t) + dM0i (u), (6.6)
0 G(u−) 0 S(u)G(u)
240 PSEUDO-VALUES
Rt
where Nhi counts h-events for subject i and M0i (t) = N0i (t) − 0 Yi (u)dA0 (u) is the martin-
gale for the process N0i counting censorings for subject i. From this expression, (6.5) may
be shown using the martingale property of M0i (t).
By the first order von Mises expansion, unbiasedness of the GEE was established and, had
the pseudo-values θ1 , . . . , θn been independent, the standard sandwich variance estimator
would apply for βb . A second order von Mises expansion gives the approximation
1
θi ≈ θ + φ̇ (Xi∗ ) + φ̈ (Xi∗ , X j∗ ) (6.7)
n−1 ∑
j6=i

where φ̈ is the second-order influence function. This may be shown to have expectation
zero (Overgaard et al., 2017); however, the presence of the second order terms shows that
θ1 , . . . , θn are not independent meaning that the GEEs are not a sum of independent terms
even when inserting the true value β 0 , and, therefore, the sandwich estimator needs to be
modified to properly describe the variability of βb . The details were presented by Jacobsen
and Martinussen (2016) for the Kaplan-Meier estimator and more generally by Overgaard
et al. (2017). However, use of the standard sandwich variance estimator based on the GEE
for pseudo-values from the Aalen-Johansen estimator turns out to be only slightly conser-
vative because the extra term in the correct variance estimator arising from the second order
terms in the expansion is negative and tends to be numerically small.

6.3 Approximation of pseudo-values (*)

In this section, we will first discuss how a less computer-intensive approximation to pseudo-
values may be computed and, next, show how well this approximation works in the PBC3
trial. Computation of the pseudo-values may be time-consuming because the estimator θb
needs to be re-calculated n times. Equation (6.4) suggests a plug-in approximation to the
pseudo-value for subject i, namely,
˙
θbi = θb + φb(Xi∗ ), (6.8)

where, in the latter term, estimates for the quantities appearing in the expression for the
influence function are inserted (Parner et al., 2023; Bouaziz, 2023). To illustrate this idea,
we study the Aalen-Johansen estimator Fbh (t) for which the influence function is given by
(6.6). The corresponding approximation in (6.8) is then
Z t Z t b
dNhi (u) Fh (t) − Fbh (u)
θbi = + b0i (u),
dM
0 G(u−)
b 0 Y (u)/n

inserting estimates for Fh , G and the cumulative censoring hazard A0 . An advantage is that
these estimates can be calculated once based on the full sample and, next, the estimated
influence function is evaluated at each observation Xi∗ . Thus,

Nhi (t) N0i (t)(Fbh (t) − Fbh (Xi ∧ t))

θbi = +
b i ∧ t)−)
G((X Y (Xi ∧ t)/n
Z Xi ∧t b
(Fh (t) − Fbh (u))dN0 (u)
− (6.9)
0 Y (u)2 /n
GOODNESS-OF-FIT (*) 241
where the first term contributes if i has a cause-h event before time t, the second if i is
censored before time t, and the last term contributes in all cases. Note that the first term
corresponds to the outcome variable in direct binomial regression, Section 5.5.5.
The empirical influence function may be obtained in the following, alternative way, known
as the infinitesimal jackknife (IJ) (Jaeckel, 1972; Efron, 1982, ch. 6). Using results from Lu
and Tsiatis (2008), the influence function (6.6) may be re-written as
Z t Z t Z u
1 1
φ (Xi∗ ) = dMhi (u) − S(u) dMi (s)dAh (u), (6.10)
0 G(u) 0 0 S(s)G(s)

where, for subject i, Mhi is the cause-h

Rt
event martingale and Mi = ∑h Mhi the martingale for
all events. The estimator Fh (t) = 0 S(u−)d A
b b bh (u) is written as a function of weights where
the actual estimator is obtained when all weights equal 1/n
Z t
!
w w bw (u),
Fb (t) = h exp − ∑ A b (u) d A
` h (6.11)
0 `

bw (t) = ∑i wi dN`i (u)

with A ` ∑i wiYi (s)
. The IJ influence function is obtained as

˙ ∂ Fbw (t)
φb(Xi∗ ) = h |
∂ wi w =1/n
and the approximate pseudo-value is
˙
θbi = θb + φb(Xi∗ ),

see Exercise 6.2.

PBC3 trial in liver cirrhosis

As an illustration, we will study the closeness of the IJ-approximation (for the survival
function, see Exercise 6.1) using the PBC3 trial (Example 1.1.1) as example. Pseudo-values
for the indicator I(Ti > t0 ) of no failure of medical treatment before time t0 = 2 years were
computed using the direct formula n · S(t b 0 ) − (n − 1) · Sb−i (t0 ) and compared with the IJ
approximation, see Figure 6.12. The approximation seems to work remarkably well (notice
the vertical scale on the right-hand plot).

6.4 Goodness-of-fit (*)

If the response variable Ti in Section 5.7.2 is a pseudo-value θi , then the Taylor expansion of
the GEE becomes the (second order) von Mises expansion discussed in Section 6.2 in order
to take into account that the pseudo-values θi , i = 1, . . . , n are not independent. Therefore,
some extra arguments are required when using this idea for pseudo-residuals (Pavlič et al.,
2019). However, since the second order terms tend to be numerically small, the first order
expansion may provide a satisfactory approximation.
Pavlič et al. (2019) used data from the PBC3 trial to show how plots of cumulative pseudo-
residuals were applicable when assessing goodness-of-fit of regression models based on
242 PSEUDO-VALUES
1.0 0.010

0.8
0.005
IJ pseudo−value

0.6

Difference
0.4 0.000

0.2
−0.005
0.0

−0.2 −0.010
−0.2 0.0 0.2 0.4 0.6 0.8 1.0 −0.2 0.0 0.2 0.4 0.6 0.8 1.0
Pseudo−value Pseudo−value

Figure 6.12 PBC3 trial in liver cirrhosis: Infinitesimal jackknife (IJ) pseudo-values for the survival
indicator I(Ti > 2 years) for all subjects, i, (left) and difference between IJ pseudo-values and or-
dinary pseudo-values (6.1) (right) plotted against the ordinary pseudo-values. An identity line has
been added to the left-hand plot.

pseudo-observations. Models for the probability of no medical failure before chosen time
points were fitted to pseudo-values using the cloglog link and including only the covari-
ate bilirubin or only log(bilirubin). The conclusion was that linearity is rejected without
transforming the covariate, whereas no deviations from linearity were detected after a log-
transformation.
EXERCISES 243
6.5 Exercises

Exercise 6.1 (*) The influence function for the survival function S(t) is
Z t
dMi (u)
φ̇ (Xi∗ ) = −S(t)
0 S(u)G(u)
with Mi (u) = Ni (u) − 0u Yi (s)dA(s) being the martingale for the failure counting process
R

for subject i (Overgaard et al. 2017). The corresponding ‘plug-in’ approximation is then
Z t∧Xi
∗ N i (t) dN(u)
φ (Xi ) = −S(t)
ḃ b − .
Y (t ∧ Xi )/n 0 Y (u)2 /n

1. Show Rthat, writing the estimator in the ‘exp(−Nelson-Aalen)’ form Sbw (t) =
exp(− 0t (∑i wi dNi (u)/ ∑i wiYi (u)), this expression is obtained as

˙ ∂ Sbw (t)
φb(Xi∗ ) = | .
∂ wi w =1/n

2. Show that for the standard Kaplan-Meier estimator

Sbw (t) = ∏(1 − ∑ wi dNi (u)/ ∑ wiY (u))

[0,t] i i

it holds that
∂ Sbw (t)
Z t∧Xi
Ni (t) dN(u)
| = −S(t) − .
∂ wi w =1/n
b
Y (t ∧ Xi ) − dN(t ∧ Xi ) 0 Y (u)(Y (u) − dN(u)

3. Show that, in the case of no censoring, the influence function reduces to

φ̇ (Xi∗ ) = I(Ti > t) − S(t).

Exercise 6.2 (*)

1. Show that, for the Aalen-Johansen estimator, the calculation

˙ ∂ Fbw (t)
φb(Xi∗ ) = h | ,
∂ wi w =1/n

with Fbhw (t) given by (6.11), leads to the pseudo-value approximation obtained by
plugging-in estimators into (6.10).
2. Show that, in the case of no censoring, the influence function reduces to

φ̇ (Xi∗ ) = I(Ti ≤ t, D = h) − Fh (t).

Exercise 6.3 Consider the Copenhagen Holter study and the composite end-point stroke-
free survival.
244 PSEUDO-VALUES
1. Fit, using pseudo-values, a cloglog model for experiencing that end-point before time 3
years including the covariates ESVEA, sex, age, and systolic blood pressure.
2. Compare the results with those of Exercise 2.4.

Exercise 6.4 Consider the Copenhagen Holter study and the composite end-point stroke-
free survival.
1. Fit, using pseudo-values a linear model for the 3-year restricted mean time to the com-
posite event including the covariates ESVEA, sex, age, and systolic blood pressure.
2. Compare with the results of Exercise 4.3.

Exercise 6.5 Consider the competing outcomes stroke and death without stroke in the
Copenhagen Holter study.
1. Fit, using pseudo-values, a cloglog-model for the cumulative incidences at 3 years in-
cluding ESVEA, sex, age, and systolic blood pressure.
2. Compare with the results of Exercises 4.4 and 5.8.
Chapter 7

Further topics

In previous chapters, we have discussed a number of methods for analyzing statistical mod-
els for multi-state survival data based on rates or on marginal parameters, such as risks of
being in certain states at certain time-points – the latter type of models sometimes based on
pseudo-values. In this final chapter, we will introduce a number of possible extensions to
these methods. For these further topics, entire books and review papers have been written,
e.g., Sun (2006) and van den Hout (2020) on interval-censored data (see also Cook and
Lawless, 2018, ch. 5), Hougaard (2000) and Prentice and Zhao (2020) on non-independent
data, Hernán and Robins (2020) on causal inference, Rizopoulos (2012) on joint models,
and Borgan and Samuelsen (2014) on cohort sampling. This means that our exposition will
be brief and we provide references for further reading.

7.1 Interval-censoring
So far, we have assumed that the multi-state process Vi (t) was observed continuously, i.e.,
exact transition times were observed up till the time Xi = Ti ∧Ci – the minimum of the time
of reaching an absorbing state and the time of right-censoring. Such an observation scheme
is not always possible. Sometimes, Vi (t) is only observed intermittently, that is, only the
values Vi (J0i ),Vi (J1i ), . . . ,Vi (JNi i ) at a number (Ni +1) of inspection times J i = (J0i , J1i , . . . , JNi i )
are ascertained. Typically the first time, J0i equals 0 for all subjects, i but, in general, the
inspection times may vary among subjects. The resulting observations of Vi (t) are said to
be interval-censored. The data arising when J i is the same for all i are known as panel data
(e.g., Kalbfleisch and Lawless, 1985). There may also be situations where the very concept
of an exact transition time is not meaningful, e.g., the time of onset of a slowly developing
disease such as dementia. In such a case, typically only a last time seen without the disease
and a first time seen with the disease are available for any subject who develops the disease,
once more giving rise to interval-censoring.
An assumption that will be made throughout, similar to that of independent censoring (Sec-
tion 1.3), is that the inspection process J i is independent of Vi (t) (e.g., Sun, 2006, ch. 1; see
also Cook and Lawless, 2018, ch. 7).
In this section, we will give a brief account of some techniques that have been developed
for analysis of interval-censored multi-state survival data.

245
246 FURTHER TOPICS
7.1.1 Markov processes (*)
Intermittent observation of the process V (t) gives rise to a likelihood contribution from
subject i that is a product of factors
P(Vi (J`i ) = s` | Vi (J`−1
i
) = s`−1 ), ` = 1, . . . , Ni ,
each corresponding to the probability of moving from the state s`−1 occupied at time J`−1 i

to the state s` occupied at the next inspection time, J`i . The resulting likelihood is tractable
if V (t) is a Markov process and transition hazards are assumed to be piece-wise constant
because then the transition probabilities are explicit functions of the transition hazards,
see Equation (5.5). Piece-wise constant hazard models for general Markov multi-state pro-
cesses were discussed by Jackson (2011) (see also van den Hout, 2017, ch. 4) and may also
be used in the special models to be discussed in the next sections.

7.1.2 Two-state model (*)

For the two-state model (Figure 1.1), at most one transition can be observed for any given
subject and interval-censored observations then reduce to an interval (JL , JR ] where that
transition took place, i.e., T ∈ (JL , JR ]. General interval-censoring is 0 < JL < JR < ∞ with
V (JL ) = 0, V (JR ) = 1, while the special case JL = 0, JR < ∞, V (JR ) = 1 is left-censoring.
The situation with JL = 0 or JR = ∞ with either V (JR ) or V (JL ) ∈ {0, 1} is known as current
status data in which case there is a single inspection time where it is ascertained whether or
not the event has already happened. The special cases JL > 0, JR = ∞ with V (JL ) = 0 and
JL = JR = T are, respectively, right-censoring (at C = JL ) and exact observation of T .
A simple, but not recommendable approach (e.g., Sun, 2006, ch. 2) is mid-point imputation
where, for JR < ∞, the ‘exact’ survival time is set to T = (JR + JL )/2 and, for JR = ∞, obser-
vation of T is considered right-censored at C = JL . With this approach, analysis proceeds
as if data were exactly observed, except from right-censoring.
Observation of the interval (JL , JR ], with T ∈ (JL , JR ], gives rise to the following special
case of the likelihood discussed in Section 7.1.1

∏(S(JLi ) − S(JRi )),

where S is the survival function S(t) = P(T > t). Analysis of parametric models, including
the piece-wise exponential model, based on this likelihood is simple and asymptotic prop-
erties follow from standard likelihood theory. Non-parametric maximization leads to the
Turnbull (1976) estimator for which the large-sample distribution is more complex (e.g.,
Sun, 2006, ch. 2 and 3). Pseudo-values based on a parametric model were discussed by
Bouaziz (2023).
Regression analysis of interval-censored survival data via transformation models based on
this likelihood were studied by Zeng et al. (2016). This class of models includes the Cox
model, previously studied by Finkelstein (1986). Regression analysis of current status data
using an additive hazards model was discussed by Lin et al. (1998). For the two-state model,
panel data give rise to grouped survival data. In this situation, non-parametric estimation
of the survival function reduces to the classical life-table (e.g., Preston et al., 2000). The
Cox model for grouped survival data was studied by Prentice and Gloeckler (1978).
INTERVAL-CENSORING 247
7.1.3 Competing risks (*)
Also for the competing risks model (Figure 1.2), interval-censored data reduce to an interval
(JL , JR ] with T ∈ (JL , JR ]. Following Section 7.1.1, we will assume that, when JR < ∞, we
also observe the state V (JR ), i.e., the cause of death. When JR = ∞, observation of T is
right-censored at JL and the cause of death is unknown.
Similar to the two-state model, Section 7.1.2, mid-point imputation is a possibility but it
is generally not recommended. The likelihood contribution from subject i based on the
interval-censored competing risks data is
!
∏(Fh (JRi ) − Fh (JLi ))I(D =h)
i
S(JLi )I(Di =0)
h

where Di denotes the cause of death h = 1, . . . , k with Di = 0 if observation of Ti is right-

censored, Fh is the cause-h cumulative incidence, and S = 1 − ∑h Fh the overall survival
function. Following Section 7.1.1, a model with piece-wise constant cause-specific hazards
is tractable via the resulting likelihood. Non-parametric estimation of the cumulative in-
cidences Fh (t) was discussed by Hudgens et al. (2004), generalizing the Turnbull (1976)
estimator, while Frydman and Liu (2013) focused on non-parametric estimation of the cu-
mulative cause-specific hazards. Competing risks panel data give rise to multiple decrement
life-tables (e.g., Preston et al., 2000).
Regression modeling of the cumulative incidences using transformation models (including
the Fine-Gray model) were studied by Mao et al. (2017).

7.1.4 Progressive illness-death model (*)

For the illness-death model without recovery (Figure 1.3), data are more complex. It is
usually assumed that time of death, T is either exactly observed or right-censored at C. In
addition, there may be earlier inspection times, JL < JR ≤ X = T ∧ C where the subject is
last observed to be healthy, V (JL ) = 0, respectively, first observed to be diseased, V (JR ) =
1. When the subject is censored at C or observed to die at T and no prior time JR with
V (JR ) = 1 is observed (but there may be a time JL ≥ 0 last seen healthy), the disease status
at C or T is typically not known. If a subject is observed with the disease at a time point JR
then, as mentioned in Sections 7.1.2-7.1.3, mid-point imputation is an option. The general
likelihood contributions for these four situations are shown in Table 7.1 and we refer to
Section 5.1.3 for equations giving the transition probabilities as functions of the transition
intensities.
Table 7.1 Likelihood contributions for four types of interval-censored observation of an irreversible
illness-death model.

Observation (V (JL ) = 0) Likelihood contribution

V (JR ) = 1, V (C) = 1 P00 (0, JL )P01 (JL , JR )P11 (JR ,C)
V (JR ) = 1, V (T ) = 2 P00 (0, JL )P01 (JL , JR )P11 (JR , T −)α12 (T )
V (C) ∈ {0, 1} P00 (0, JL )(P00 (JL ,C) + P01 (JL ,C))
V (T ) = 2 P00 (0, JL )(P00 (JL , T −)α02 (T ) + P01 (JL , T −)α12 (t))
248 FURTHER TOPICS
If the disease status (h = 0, 1) at the time of failure or censoring, X = T ∧C, is known, then
the likelihood contribution is instead P00 (0, JL ) ∑h=0,1 I(V (X−) = h)P0h (JL , X−)αh2 (X)D
where D = I(T < C) is the usual failure indicator.
A model with piece-wise constant transition hazards is tractable (e.g., Lindsey and Ryan,
1993). Frydman (1995) studied non-parametric maximum likelihood estimation of the tran-
sition hazards for the case where disease status at time of death is always known and Fry-
dman and Szarek (2009) extended the discussion to the situation where this is not the case
– both papers generalizing the Turnbull estimator. Joly et al. (2002) studied the situation
with possible unknown disease status at death and estimated the transition hazards using
penalized likelihood.
Pseudo-values for state occupation indicators, in particular for being diseased, I(V (t) = 1),
were discussed by Sabathé et al. (2020) based on the penalized likelihood estimator and
by Johansen et al. (2020) based on parametric models, including the piece-wise constant
hazards model.

7.2 Models for dependent data

In Sections 3.9, 4.3, and 5.6, the situation with dependent failure times was studied, and
two different scenarios were discussed. These are analysis of clustered data and inference
for times of entry into different states in the multi-state model, a special case of the latter
being times to recurrence no. h = 1, . . . , K in a model for recurrent events. Either, a frailty
was used to explicitly model the dependence, or the dependence was treated as a nuisance
parameter, and focus was on inference for the marginal distributions, taking a potential
dependence into account by using robust SD estimation. Reviews of these situations have
been given by Lin (1994) with most emphasis on marginal hazard models, by Wei and
Glidden (1997), and by Glidden and Vittinghoff (2004), the latter with much emphasis on
the shared frailty model. We concluded that, for clustered data, both modeling approaches
were useful, the main difference being that frailty models provide regression coefficients
with a within-cluster interpretation, whereas marginal hazards make comparisons among
clusters. At the same time, the frailty SD also gave a measure of the within-cluster asso-
ciation. We will return to a discussion of clustered data in Section 7.2.2 and discuss how
a hybrid approach, two-stage estimation may sometimes provide the best of two worlds
by, at the same time, providing regression parameters with a marginal interpretation and
a measure of within-cluster association. We first return to the second situation concerning
times of entry into different states.

7.2.1 Times of entry into states

For the situation concerning times of entry into different states, we concluded previously
that this is not well treated by either of the two methods, a major problem being that these
times are often improper random variables for which a marginal hazard (see Equation
(5.36)) is not a straightforward quantity. As discussed in Section 5.6, it either refers to
a hypothetical world where other events do not occur, or it should be taken to be a sub-
distribution hazard with its associated difficulties in interpretation (see Section 4.2.2). We
also refer to Sections 4.4.3 and 4.4.4 for a discussion of difficulties in connection with the
MODELS FOR DEPENDENT DATA 249
related concept of latent failure times for competing risks (and semi-competing risks). Re-
current events provide a special case where, as discussed in Section 3.9.3, a frailty model
is useful if within-subject parameters are relevant and, as shown in Sections 4.2.3 and
5.5.4, also marginal models for recurrent events are, indeed, useful, however not models
for marginal hazards but rather models for marginal means (or, equivalently, marginal rate
functions), at least in the case where a terminating event (such as death) stops the recur-
rent events process. Without terminating events, the WLW model may be applicable. For
general multi-state models, we find that transition (e.g., cause-specific) hazards or state oc-
cupation probabilities and associated expected lengths of stay are typically more relevant
target parameters than marginal distributions of times of entry into states.

7.2.2 Shared frailty model – two-stage estimation (*)

We now return to the shared frailty model for clustered data, Equation (3.37). Recall from
Section 3.9 that we assume that subjects have conditionally independent survival times
given frailty, that the frailty distribution is independent of covariates, and the censoring
is independent of frailty. We will concentrate on bivariate survival data, i.e., clusters of
size ni = 2, i = 1, . . . , n; however, all that follows goes through for general cluster sizes.
We also concentrate on the shared gamma frailty model and only provide brief remarks on
other frailty distributions. For any i (which we omit from the notation), we have that the
conditional survival functions given frailty are
Z t
Shc (t | A) = P(Th > t | A) = exp −A αhc (u)du ,

h = 1, 2
0

and the marginal survival functions EA (Shc (t | A)) are

Z ∞ Z t
αhc (u)du fθ (a)da,

Sh (t) = exp −a h = 1, 2,
0 0

where fθ (·) is the density function for the gamma distribution with mean 1 and SD2 = θ .
The survival function can be evaluated to be
1 −θ
Sh (t) = 1 + Ach (t)
θ
Rt
with Ach (t) = c
0 αh (u)du (e.g., Hougaard, 2000, ch. 7).
From the assumption of conditional independence of (T1 , T2 ) given frailty, A, it follows that
the bivariate survival function is
−θ
S(t1 ,t2 ) = P(T1 > t1 , T2 > t2 ) = S1 (t1 )−1/θ + S2 (t2 )−1/θ − 1 , (7.1)

(e.g., Hougaard, 2000, ch. 7; see also Cook and Lawless, 2018, ch. 6). Based on this result,
the intensity process for the counting process Nh (t) for subject h = 1, 2 can be seen to equal

1 θ + N1 (t−) + N2 (t−)
lim P(Nh (t + ∆) − Nh (t) = 1 | Ht− ) = Yh (t)αhc (t) ,
∆→0 ∆ θ + A1 (t ∧ X1 ) + A1 (t ∧ X2 )
250 FURTHER TOPICS
where Ht , as introduced in Section 1.4, denotes the observed past in [0,t] (see, e.g., Nielsen
et al., 1992). This shows that the shared gamma frailty model induces an intensity for
members of a given cluster that, at time t, depends on the number of previously observed
events in the cluster.
These derivations show how the marginal and joint distributions of T1 and T2 follow from
the shared (gamma) frailty specification where the joint distribution is expressed in terms
of the marginals by (7.1). Based on this expression, Glidden (2000), following earlier work
by Hougaard (1987) and Shih and Louis (1995), showed that it is possible to go the other
way around, i.e., first to specify the margins S1 (t) and S2 (t) and, subsequently, to combine
them into a joint survival function using (7.1). This equation is an example of a copula, i.e.,
a joint distribution on the unit square [0, 1] × [0, 1] with uniform margins. It is seen that a
shared frailty model induces a copula – other examples were given by, among others, An-
dersen (2005). Glidden (2000), for the gamma distribution, and Andersen (2005) for other
frailty distributions studied two-stage estimation, as follows. First, the marginal survival
functions S1 (t) and S2 (t) are specified and analyzed, e.g., using marginal Cox models as
in Sections 4.3 and 5.6. Next, estimates from these marginal models are inserted into (7.1)
(or another specified copula) to obtain a profile-like likelihood for the parameter(s) in the
copula, i.e., for θ in the case of the gamma distribution. This is maximized to obtain an
estimate, θb, for the association parameter. These authors derived asymptotic results for the
estimators (building, to a large extent, on Spiekerman and Lin, 1999). With a two-stage
approach, it is possible to get regression coefficients with a marginal interpretation (based
on a model for which goodness-of-fit examinations are simple) and at the same time get
a quantification of the within-cluster association, e.g., using Kendall’s coefficient of con-
cordance as exemplified in Section 3.9.2. Methods for evaluating the fit to the data of the
chosen copula are, however, not well developed (see, e.g., Andersen et al., 2005).

7.3 Causal inference

The g-formula was introduced in Section 1.2.5 as a means by which a single marginal treat-
ment difference can be estimated based on a regression model including both treatment and
other explanatory variables (confounders), and the method was later illustrated for a num-
ber of marginal parameters in Chapter 4. In this section, we will give a brief discussion
of the circumstances under which a marginal treatment difference can be given a causal
interpretation. This requires a definition of causality, and we will follow the approach of
Hernán and Robins (2020) based on potential outcomes (or counterfactuals) that is appli-
cable whenever the treatment corresponds to a well-defined intervention, i.e., it should be
possible to describe the target randomized trial in which the treatment effect would be es-
timable. We will define causality following these lines in Section 7.3.1 and demonstrate
that, in this setting, data from a randomized trial do, indeed, allow estimation of average
causal effects. In the more technical Sections 7.3.2-7.3.3, we will discuss the assumptions
that are needed for causal inference in observational studies (including consistency, ex-
changeability, and positivity) and show that, under such assumptions, average causal effects
may be estimated via the g-formula or via an alternative approach using Inverse Probability
of Treatment Weighting (IPTW) based on propensity scores. The final Section 7.3.4 gives a
less technical summary with further discussion.
CAUSAL INFERENCE 251
7.3.1 Definition of causality
Let Z be a binary treatment variable, Z = 1 for treated, Z = 0 for controls, let V (t) be a
multi-state process, and let θ = E( f (V )) be a marginal parameter of interest. Imagine a
population in which all subjects receive the treatment; in this population we could observe
the process V z=1 (t). Imagine, similarly, the same population but now every subject receives
the control, whereby we could observe V z=0 (t). The processes V z=z0 (t), z0 = 0, 1 are the
potential outcomes or counterfactuals, so called because in reality, each subject, i will re-
ceive at most one of the treatments which means that at least one of Viz=0 (t) or Viz=1 (t) will
never be observable. The average causal effect of treatment on θ is now defined as

θZ = E( f (V z=1 )) − E( f (V z=0 )), (7.2)

i.e., the difference between the means of what would be observed if every subject either
receives the treatment or every subject receives the control. In (7.2), the average causal
effect is defined as a difference; however, the causal treatment effect could equally well be
a ratio or another contrast between E( f (V z=1 )) and E( f (V z=0 )).
Examples of functions f (·) are the state h indicator f (V ) = I(V (t0 ) = h) at time t0 in which
case θZ would be the causal risk difference of occupying state h at time t0 , e.g., the t0 -year
risk difference of failure in the two-state model (Figure 1.1). Another example based on the
two-state model is f (V ) = min(T, τ), in which case θZ is the causal difference between the
τ-restricted mean life times under treatment and control. Note that θZ could not be a hazard
difference or a hazard ratio because the hazard functions

α z=z0 (t) ≈ P(T z=z0 ≤ t + dt | T z=z0 > t)/dt, z0 = 0, 1,

do not contrast the same population under treatment and control but, rather, they are con-
trasting the two, possibly different, sub-populations who would survive until time t under
either treatment or under control (e.g., Martinussen et al., 2020).
If treatment were randomized, then θZ would be estimable based on the observed data. This
is because for z0 = 0, 1 we have that

E( f (V z=z0 )) = E( f (V z=z0 ) | Z = z0 )

due to exchangeability – because of randomization, treatment allocation Z is independent

of everything else, including the potential outcomes V z=z0 , that is, computing the mean over
all subjects, E( f (V z=z0 )), results in the same as computing the mean, E( f (V z=z0 ) | Z = z0 ),
over the subset of subjects who were randomized to treatment z0 . By assuming consistency
Viz=z0 (t) = Vi (t) if Zi = z0 , i.e., what is observed for subject i if receiving treatment z0 equals
that subject’s counterfactual outcome under treatment z0 , we have that

E( f (V z=z0 ) | Z = z0 ) = E( f (V ) | Z = z0 )

and the latter mean is estimable from the subset of subjects randomized to treatment z0
– at least under an assumption of independent censoring. Note that, by the consistency
assumption, counterfactual outcomes are linked to the observed outcomes.
252 FURTHER TOPICS
7.3.2 The g-formula (*)
In the previous section, we presented a formal definition of causality based on counterfactu-
als and argued why average causal effects were estimable based on data from a randomized
study under the assumption of consistency. We will now turn to observational data and dis-
cuss under which extra assumptions an average causal effect can be estimated using the
g-formula. Recall from Section 1.2.5 that the g-formula computes the average prediction
1
θbz0 = ∑ fb(Vi (t) | Z = z0 , Z
ei) (7.3)
n i

based on some regression model for the parameter, θ of interest including treatment, Z
and other covariates (confounders) Z e . The prediction is performed by setting treatment to
z0 (= 0, 1) for all subjects and keeping the observed confounders Z e i for subject i = 1, . . . , n.
This estimates (under assumptions to be stated in the following) the mean E( f (V z=z0 )).
This is because we always have the identity

E( f (V z=z0 )) = EZe E( f (V z=z0 | Z
e)

and, under an assumption of conditional exchangeability, this equals

e ) = E e E( f (V z=z0 ) | Z
EZe E( f (V z=z0 ) | Z Z
e , Z = z0 ) .

That is, we assume that sufficiently many confounders are collected in Z e to obtain ex-
changeability for given value of Z or, in other words, for given confounders those who get
e
treatment 1 and those who get treatment 0 are exchangeable. This assumption is also known
as no unmeasured confounders. Finally, consistency, i.e.,

EZe E( f (V z=z0 ) | Z
e , Z = z0 ) = E e E( f (V ) | Z
Z
e , Z = z0 ) ,

is assumed, where the right-hand side is the quantity that is estimated by the g-formula.
In addition, an assumption of positivity should be imposed, meaning that for all values
of Z
e the probability of receiving either treatment should be positive. By this assumption,
prediction of the outcome based on the regression model for θ is feasible for all confounder
values under both treatments and, therefore, ‘every corner of the population is reached by
the predictions’. The g-formula estimate of (7.2) then becomes

θbZ = θb1 − θb0 (7.4)

with θbz0 , z0 = 0, 1 given by (7.3).

7.3.3 Inverse probability of treatment weighting (*)

In the previous section, we argued that an average causal effect, under suitable assumptions,
is estimable using the g-formula via modeling of certain features of the data – namely the
expected outcome for given treatment and confounders. The same average causal effect is
estimable under the same assumptions by modeling a completely different feature of the
CAUSAL INFERENCE 253
data, namely the probability of treatment assignment for given values of the confounders,
e . The conditional probability
Z
e i ) = P(Zi = 1 | Z
PS(Z ei) (7.5)

of subject i receiving treatment 1 is known as the propensity score and the idea in inverse
probability of treatment weighting, IPTW, is to construct a re-weighted data set, replacing
the outcome for subject i by a weighted outcome using the weights

bi = Zi 1 − Zi
W + , (7.6)
PS(Z i ) 1 − PS(
c e c Z ei)

where the propensity score has been estimated. That is, the outcome for subject i is
weighted by the inverse probability of receiving the treatment that was actually received
and, by this, the re-weighted data set becomes free of confounding because Z
e has the same
distribution among treated (Z = 1) and controls (Z = 0) (e.g., Rosenbaum and Rubin, 1983).
Therefore, a simple model including only treatment can be fitted to the re-weighted data set
to estimate θZ . This could be any of the models discussed in previous chapters from which
θ = E( f (V )) can be estimated, e.g., Cox or Fine-Gray models for risk parameters at some
time point, or direct models for the expected length of stay in a state in [0, τ].
In the situation where the outcome is represented by a pseudo-value θi (Andersen et al.,
2017) or with complete data, i.e., with no censoring, whereby f (Vi (t)) is completely ob-
servable and equals θi , see Section 6.1, the estimate is a simple difference between weighted
averages
1 b i θi − 1 ∑ W
θbZ = ∑ W bi θi .
n i:Zi =1 n i:Zi =0
In this case, it can be seen that this actually estimates the average causal effect, as follows.
The mean of the estimate in treatment group 1 (inserting the true propensity score) is

1 1 Zi θi
E ∑ Wi θi = EZe ∑ E |Z
ei
n i:Zi =1 n i PS(Zei)

and, assuming consistency, this is

1 1 e i = 1 E e ∑ 1 E Zi f (V z=1 ) | Z

EZe ∑ E Z i θi | Z Z i
ei
n i PS(Z i )
e n i PS(Z i )
e

and, finally, by conditional exchangeability, this is

1 1 e i = 1 ∑ E( f (V z=1 )).
e i E f (V z=1 ) | Z

EZe ∑ E Zi | Z i i
n i PS(Z i )
e n i

An identical calculation for the control group gives the desired result. It is seen that, because
we divide by PS(Z e ) or 1 − PS(Ze ) in (7.6), the assumption of positivity is needed.
254 FURTHER TOPICS
7.3.4 Summary and discussion
Sections 7.3.2 and 7.3.3 demonstrated that the average causal effect (7.2) may be estimated
in two different ways under a certain set of assumptions. The g-formula (Equations (7.3)
and (7.4)) builds on an outcome model, i.e., a model by which the marginal parameter, θ of
interest may be predicted for given values of treatment Z and confounders Z e . On the other
hand, IPTW builds on a model for treatment assignment (the propensity score, Equation
(7.5)) from which weights (7.6) are calculated and a re-weighted data is constructed. The
re-weighted data set is free of confounding from Ze and, therefore, the average causal effect
(7.2) may be estimated by fitting a simple model including only treatment Z to this data set.
The assumptions needed for a causal interpretation of the resulting θbZ include: Consistency
that links the observed outcomes to the counterfactuals, see Section 7.3.1, positivity, i.e., a
probability different from both 0 and 1 for any subject in the population of receiving either
treatment, and no unmeasured confounders – sufficiently many confounders are collected
in Z
e to ensure, for given confounder values, that those who get treatment 1 and those who
get treatment 0 are exchangeable. It is an important part of any causal inference endeavor
to discuss to what extent these conditions are likely to be fulfilled. In addition to these
assumptions, the g-formula rests on the outcome model being correctly specified and IPTW
on the propensity score model being correctly specified. Doubly robust methods have been
devised that only require one of these models to be correct, as well as even less model-
dependent techniques based on targeted maximum likelihood estimation (TMLE), see, e.g.,
van der Laan and Rose (2011).
Causal inference is frequently used – also for the analysis of multi-state survival data, see,
e.g., Gran et al. (2015), while Janvin et al. (2023) discussed causal inference for recurrent
events with competing risks. In this connection, analysis with time-dependent covariates
poses particular challenges because these may, at the same time, be affected by previous
treatment allocation and be predictive for both future treatment allocation and for the out-
come (known as time-dependent confounding, see, e.g., Daniel et al., 2013).

7.4 Joint models with time-dependent covariates

We have previously discussed difficulties in connection with estimating marginal param-
eters by plug-in based on intensity models with time-dependent covariates. The problem
was announced in Section 3.7.3 when introducing inference in models with time-dependent
covariates and further discussed in Section 3.7.8 in connection with considering the role of
GvHD in the multi-state model for the bone marrow transplantation data (Example 1.1.7).
Finally, landmarking was introduced in Section 5.3 as a simple way to circumvent the diffi-
culty. In the bone marrow transplantation study, estimation of state occupation probabilities
in a model that accounts for GvHD became possible when considering it as a state in the
multi-state model (Figure 1.6) rather than as a non-adapted time-dependent covariate in a
competing risks model with end-points relapse and non-relapse mortality. Thus, the solu-
tion used was to study a joint model for the time-dependent covariate and the events of
interest. This approach is, in principle, available whenever the time-dependent covariate
is categorical, though the resulting multi-state models quickly get complicated if several
categorical time-dependent covariates need consideration and/or have several categories.
JOINT MODELS WITH TIME-DEPENDENT COVARIATES 255
For a quantitative time-dependent covariate, the approach is not attractive, as it would re-
quire a categorization of the covariate; however, a solution is still to consider a joint model
for the multi-state process and the time-dependent covariate. The model for the covariate
should now be of an entirely different type, namely a model for repeated measurements
of a quantitative random variable. A large literature about this topic has evolved, some of
it summarized in the book by Rizopoulos (2012), and earlier review articles in the area
include Henderson et al. (2000) and Tsiatis and Davidian (2004). In this section, we will
give a very brief introduction to the topic, concentrating on a random effects model for the
evolvement of the time-dependent covariate and how this may be used as a basis for esti-
mating (conditional) survival probabilities in the framework of the two-state model (Figure
1.1). Competing risks and recurrent events in this setting were discussed by Rizopoulos
(2012, ch. 5) and will not be further considered in our brief account here.

7.4.1 Random effects model

In the joint model to be discussed in this section, the time-dependent covariate, Zi (t) for
subject i is assumed to be a sum of a true value at time t, mi (t) and a measurement error,
εi (t),
Zi (t) = mi (t) + εi (t),
where the error terms are assumed independent of everything else and normally distributed
with mean zero and a certain SD. The true value follows a linear mixed model
k
mi (t) = γ0 + LPm
i (t) + ∑ f ` (t) log(A`i )
`=1

with a fixed-effects linear predictor LPm i (t) = ∑` γ` Z`i (t) depending on covariates Z
e e
that are either time-fixed or deterministic functions of time. The random effects,
log(A1i ), . . . , log(Aki ) enter via k fixed functions of time, f` (t), ` = 1, . . . , k, where k is of-
ten taken to be 2 with ( f1 (t), f2 (t)) = (1,t), corresponding to random intercept and random
slope. The random effects are assumed to follow a k−variate normal distribution with mean
zero and some covariance. The hazard function

αi (t | A i ) = α0 (t) exp(β0 mi (t) + LPαi )

is assumed to depend on the true value of the time-dependent covariate and, thereby, on
the random effects, and possibly on other time-fixed covariates via the linear predictor
LPαi = ∑` β` Z`i . Some components, Z, Ze may appear in both linear predictors LPm , LPα .
The baseline hazard is typically modeled parametrically, e.g., by assuming it to be piece-
wise constant. In this model, the random effects (A1 , . . . , Ak ) serve as frailties (Section 3.9)
which, at the same time, affect the longitudinal development of the time-dependent covari-
ate. The survival time, T and the time-dependent covariate are assumed to be conditionally
independent given the frailties, and measurements Zi (ti`1 ) and Zi (ti`2 ) taken at different time
points are also conditionally independent for given frailties. Thus, the correlation among
repeated measurements of Zi (·) is given entirely by the random effects. These assumptions
are utilized when setting up the likelihood in the next section. A careful discussion of the
assumptions was given by Tsiatis and Davidian (2004).
256 FURTHER TOPICS
7.4.2 Likelihood (*)
The data for each of n independent subjects include the censored event time informa-
tion (Xi , Di ), covariates (Z Z i, Z
e i (t)), and measurements of the time-dependent covariate
Z i (t) = (Zi (ti1 ), . . . , Zi (tini )) taken at ni time points (typically with ti1 = 0). The likelihood
contribution from the event time information (Xi , Di ) for given frailties and given (Z Z i, Z
e i (t))
is ZX
i
α Di
Li (θθ | A i ) = (αi (Xi | A i )) exp − αi (t | A i )dt .
0

From the conditional independence assumptions summarized in Section 7.4.1, it follows

that, for given frailties, the likelihood contribution from observation of the time-dependent
covariate is
ni
LiZ (θθ | A i ) = ∏ ϕ(Zi (t`i )),
`=1

where ϕ is the relevant normal density function. The observed-data likelihood is now ob-
tained by integrating over the unobserved frailties
Z
Li (θθ ) = Liα (θθ | A i )LiZ (θθ | A i )ϕ(A
Ai )dA
Ai

with normal density ϕ(A Ai ). Maximization of L(θθ ) = ∏i Li (θθ ) over the set of all parameters
(denoted θ ) involves numerical challenges which may be approached, e.g., using the EM-
algorithm (Rizopoulos, 2012; ch. 4). Also, variance estimation for the parameter estimates,
θb may be challenging though, in principle, these estimates may be obtained from the second
derivative of log(L(θθ )).

7.4.3 Prediction of survival probabilities

Our goal with the joint model was estimation of survival probabilities based on a model
with time-dependent covariates. The model was described in Section 7.4.1 and inference
for the model parameters in Section 7.4.2. Estimation of (conditional) survival probabilities
P(Ti > t | Ti > s), t > s for given values of time-fixed covariates and given observed time-
dependent covariate up till time s, additionally, requires prediction of the random effects
for the subject in question. A large literature exists on prediction in random effects models
and details go beyond what can be covered here. It is, indeed, possible to make a prediction,
b is based on observation of time-fixed covariates (Z
A Z i, Z
e i ) and of the time-dependent covari-
ate at times ti1 , . . . ,tins (< s) (Rizopoulos, 2012; ch. 7). The estimated conditional survival
function is given by Zt
Si (t | s) = exp − αi (u | A is ; θ )du ,
b b b
s

where θb is the maximum likelihood estimator for all model parameters.

The same information also enables prediction of future values of the time-dependent co-
variate – even beyond the survival time for the subject in question. The joint modeling
approach has been criticized for being able to do this (briefly discussed, e.g., by Tsiatis and
Davidian, 2004); however, for the purpose of estimating survival probabilities, which was
our goal with the joint model, this point of criticism is less relevant.
COHORT SAMPLING 257
7.4.4 Landmarking and joint models
We have discussed two ways of obtaining estimates of marginal parameters based on mod-
els with time-dependent covariates: Landmarking (Section 5.3) and joint models (current
section). Following Putter and van Houwelingen (2022), it can be concluded that the for-
mer is a ‘pragmatic approach that avoids specifying a model for the time-dependent co-
variate’ while the latter is ‘quite efficient when the model is well specified’ but ‘quite sen-
sitive to misspecification of the longitudinal trajectory’. As a compromise, Putter and van
Houwelingen (2022) suggested a hybrid method – still based on landmarking, but also in-
volving a working model for Z(·) by which the conditional expectation E(Z(t) | {Z(u), u ≤
s}, T > s) may be approximated and used for prediction at the landmark time s. The details
go beyond what we can cover here; however, this ‘landmarking 2.0’ idea seems to be a
viable compromise that addresses the bias-variance trade-off between the two approaches.

7.5 Cohort sampling

In Sections 2.2.1 and 3.3 (see also Section 5.6), it was shown how the Cox regression model
could be fitted to a sample of n subjects (the full cohort) by solving estimating equations
based on the Cox partial likelihood (e.g., Equations (2.1) and (3.17)). Furthermore, the
cumulative baseline hazard could be estimated using the Breslow estimator (Equations (2.2)
or (3.18)). To compute these estimates, information on covariates, at any time t, was needed
for all subjects at risk at that time.
In practical analyses of large cohorts, covariate ascertainment may be unnecessarily costly,
in particular when relatively few subjects actually experience the event for which the haz-
ard is being modeled. In such cases, ways of sampling from the cohort may provide con-
siderable savings of resources without seriously compromising statistical efficiency, and
this section discusses two such cohort sampling methods – nested case-control sampling
and case-cohort sampling. As an example, Josefson et al. (2000) studied the association
between cervical carcinoma in situ (CIN) and HPV-16 viral load. They applied a nested
case-control study including all 468 cases of CIN and 1-2 controls per case sampled from
the cohort consisting of 146,889 Swedish women. This cohort was screened between 1969
and 1995 and generated the cases. The purpose of applying this design was to reduce the
costs in connection with doing the cytological analyses needed to ascertain the viral load
from the smears, taken from the screened woman and, subsequently, stored. As a second
example, Petersen et al. (2005) used a case-cohort design in a study of the association
between cause-specific mortality rates among Danish adoptees and cause of death infor-
mation for their biological and adoptive parents. Data on all 1,403 adoptees who were
observed to die before 1993 (the cases) were ascertained together with data on a random
sub-cohort sampled from the entire Danish Adoption Register. The sub-cohort consisted of
1,683 adoptees among whom 203 were also cases. In that study, ascertainment of data on
cause-specific mortality for the biological and adoptive parents was time-consuming, as it
involved scrutiny of non-computerized mortality records.
The discussion in this section will be relatively brief and concentrates on the main ideas.
Sections 7.5.1-7.5.2 give some technical results with a broader summary and examples in
Section 7.5.3.
258 FURTHER TOPICS

q d qd
d
q qd
a a
d d
q qd
a d a
p p
p p
p p
d

0 t1 t2 t3 τ 0 t1 t2 t3 τ
Figure 7.1 A cohort observed from time t = 0 to τ with D = 3 cases observed at times t1 ,t2 ,t3 , at
which m − 1 = 2 controls are sampled from the respective risk sets.

7.5.1 Nested case-control studies (*)

The nested case-control study is a case-control study matched on time in the sense that the
data set for analysis consists of all the cases observed in the cohort and a set of controls
randomly sampled from each risk set at the times at which cases are observed. Covariates,
which for sake of simplicity are assumed to be time-constant, are ascertained, first of all,
for all subjects from the full cohort i = 1, . . . , n for whom Ni (Xi ) = 1. For simplicity, we
concentrate on models for the hazard function in the two-state model for survival data,
Figure 1.1, though the same ideas apply for general transition intensities. These are the
cases, say ` = 1, . . . , D = ∑i Ni (Xi ) occurring at times t1 , . . . ,tD . In addition, at each failure
time t` , a number, m − 1 of controls are randomly sampled from the risk set at that time and
their covariate values are also ascertained. The sampled risk set at time t` , R(t e ` ) consists of
the case and the sampled controls. Figure 7.1 depicts the situation with m = 3.
The nested case-control study was discussed by Thomas (1977) with full mathematical
details by Borgan et al. (1995). A survey of both this design and the case-cohort study was
given by Borgan and Samuelsen (2014). Estimation of regression coefficients β in a Cox
β T Z i ) in the cohort proceeds by solving
model for the hazard function αi (t) = α0 (t) exp(β
the score equations based on a partial likelihood

f NCC (β
D β TZ `)
exp(β
PL β) = ∏ , (7.7)
`=1 ∑ j∈R(t β TZ j )
e ` ) exp(β

which is Equation (3.16) with the sum ∑ j Y j (t) exp(β β T Z j ) over the full risk set replaced
by the corresponding sum over the sampled risk set R(t e ` ). Having estimated the regression
coefficients, the cumulative baseline hazard function A0 (t) = 0t α0 (u)du may be estimated
R

by
b0,NCC (t) = ∑ 1
A , (7.8)
t` ≤t (Y (t` )/m) ∑ exp(βb TZ j )
j∈R(t` )
e

which is the Breslow estimator (3.18) with the sum over the sampled risk set up-weighted
by the ratio between the full risk set size, Y (t` ) and that of the sampled risk set, m. Large
COHORT SAMPLING 259

q qd
q qd
a d da
d d d
d d d
q qd
a Se d da
d d d
p p
p p
p p

0 t1 t2 t3 τ 0 t1 t2 t3 τ
Figure 7.2 A cohort observed from time t = 0 to τ with D = 3 cases observed at times t1 ,t2 ,t3 . A
random sub-cohort, Se is sampled at time t = 0.

sample properties of (7.7) and (7.8), including estimation of SD, were discussed by Borgan
et al. (1995) who also introduced other ways of sampling from the risk set than simple
random sampling.

7.5.2 Case-cohort studies (*)

In the nested case-control study (Section 7.5.1), new controls are sampled at each failure
time. In the case-cohort design a random sub-cohort, say Se of size me is sampled at time
t = 0 and used as a comparison group for all subsequent cases, see Prentice (1986) and
Borgan and Samuelsen (2014). Figure 7.2 depicts the situation in the same cohort as in
Figure 7.1.
Estimation of regression coefficients β in a Cox model for the hazard function αi (t) =
β T Z i ) in the cohort may be carried out by solving the score equations based on
α0 (t) exp(β
the pseudo-likelihood

f CC (β
D β TZ `)
exp(β
PL β) = ∏ . (7.9)
`=1 ∑ j∈S∪{`}
e β TZ j )
Y j (t` ) exp(β

Here, the comparison group at case time t` is the part of the sub-cohort Se that is still at risk
(i.e., with Y j (t` ) = 1) – with the case {`} added if this occurred outside the sub-cohort. Let
Ye (t` ) be the size of this comparison group. From βb , the cumulative baseline hazard may be
estimated by

b0,CC (t) = 1
A ∑ T
, (7.10)
t` ≤t (Y (t` )/Ye (t` )) ∑ j∈S∪{`} Y j (t` ) exp(βb Z j )
e

which is the Breslow estimator (3.18) with the sum over the remaining sub-cohort at time t`
up-weighted to represent the sum over the full risk set at that time. Large sample properties
of (7.9) and (7.10) were discussed by Self and Prentice (1988), and modifications of the
estimating equations by Borgan and Samuelsen (2014). Thus, all cases still at risk at t` may
be included in the comparison group when equipped with suitable weights.
260 FURTHER TOPICS
7.5.3 Discussion
Figures 7.1 and 7.2 illustrate the basic ideas in the two sampling designs. In the nested
case-control design, controls are sampled at the observed failure times, while, in the case-
cohort design, the sub-cohort is sampled at time t = 0 and used throughout the period of
observation. It follows that the latter design is useful when more than one case series is of
interest in a given cohort, because the same sub-cohort may be used as comparison group
for all cases. This was the situation in the study by Petersen et al. (2005) where mortality
rates from a number of different causes were analyzed. In the nested case-control design,
controls are matched on time – a feature that was useful in the study by Josefsen et al.
(2000) because the smears from cases and matched controls had the same storage time and,
thereby, ‘similar quality’. If, in a nested case-control study, more case series are studied
then new controls must be sampled for each new series since the failure times will typically
differ among case series. However, Støer et al. (2012) discussed how to re-use controls
among case series in such a situation.
In situations where both designs are an option, one may wonder about their relative effi-
ciencies. It appears that the efficiency of the two designs are quite similar when based on
similar numbers of subjects for whom covariates are ascertained. The efficiency of a nested
case-control study compared to a full cohort study has been shown to be of the order of
magnitude of (m − 1)/m, see, e.g., Borgan and Samuelsen (2014).

Guinea-Bissau childhood vaccination study

In this study (Example 1.1.2), relatively few children died during follow-up (222 or 4.2%,
see Table 1.1), and cohort sampling could be an option, even though vaccination status
and other covariates were, indeed, ascertained for all children in the study. To illustrate the
techniques, a nested case-control study was conducted within the cohort of 5,274 children
by sampling m − 1 = 3 controls at each of the D = 222 observed failures. For comparison,
a similar-sized case-cohort study was also conducted by sampling a 12.5% random sub-
cohort (664 children) from the full cohort, resulting in 641 ‘new’ children and 23 cases
within the sub-cohort. Table 7.2 shows the estimated coefficients for BCG vaccination from
Cox models with follow-up time as the time-variable, adjusted for age at recruitment as a
categorical variable. It is seen that similar estimates are obtained in the three analyses with
a somewhat smaller SD from the full cohort design and with similar values of SD for the
two cohort sampling designs. The ratio between SD2 for the full cohort and the nested case-
control design, (0.146/0.174)2 = 0.71, is well in line with the ratio (m − 1)/m = 0.75.

Table 7.2 Guinea-Bissau childhood vaccination study: Estimated coefficients (and SD) for BCG
vaccination (yes vs. no) from Cox models using follow-up time as the time-variable. Adjustment for
age at entry was made, and different sampling designs were used: Full cohort, nested case-control
with m − 1 = 3 controls per case, and case-cohort with m
e = 664.

Design βb SD
Full cohort -0.347 0.146
Nested case-control -0.390 0.174
Case cohort -0.389 0.166
Bibliography

Aalen, O. O. (1978). Nonparametric estimation of partial transition probabilities in multiple

decrement models. Ann. Statist., 6:534–545.
– (1989). A linear regression model for the analysis of life times. Statist. in Med., 8:907–
925.
Aalen, O. O., Borgan, Ø., Fekjær, H. (2001). Covariate adjustment of event histories esti-
mated from Markov chains: The additive approach. Biometrics, 57:993–1001.
Aalen, O. O., Borgan, Ø., Gjessing, H. (2008). Survival and Event History Analysis: A
Process Point of View. New York: Springer.
Aalen, O. O., Johansen, S. (1978). An empirical transition matrix for nonhomogeneous
Markov chains based on censored observations. Scand. J. Statist., 5:141–150.
Allignol, A., Beyersmann, J., Gerds, T. A., Latouche, A. (2014). A competing risks ap-
proach for nonparametric estimation of transition probabilities in a non-Markov illness-
death model. Lifetime Data Analysis, 20:495–513.
Amorim, L. D. A. F., Cai, J. (2015). Modelling recurrent events: a tutorial for analysis in
epidemiology. Int. J. Epidemiol., 44:324–333.
Andersen, E. W. (2005). Two-stage estimation in copula models used in family studies.
Lifetime Data Analysis, 11:333–350.
Andersen, P. K. (2013). Decomposition of number of years lost according to causes of
death. Statist. in Med., 32:5278–5285.
Andersen, P. K., Angst, J., Ravn, H. (2019). Modeling marginal features in studies of re-
current events in the presence of a terminal event. Lifetime Data Analysis, 25:681–695.
Andersen, P. K., Borgan, Ø., Gill, R. D., Keiding, N. (1993). Statistical Models Based on
Counting Processes. New York: Springer.
Andersen, P. K., Ekstrøm, C. T., Klein, J. P., Shu, Y., Zhang, M.-J. (2005). A class of
goodness of fit tests for a copula based on bivariate right-censored data. Biom. J., 47:815–
824.
Andersen, P. K., Geskus, R. B., de Witte, T., Putter, H. (2012). Competing risks in epidemi-
ology: possibilities and pitfalls. Int. J. Epidemiol., 41:861–870.
Andersen, P. K., Gill, R. D. (1982). Cox’s regression model for counting processes: a large
sample study. Ann. Statist., 10:1100–1120.
Andersen, P. K., Hansen, L. S., Keiding, N. (1991). Non- and semi-parametric estimation
of transition probabilities from censored observations of a non-homogeneous Markov
process. Scand. J. Statist., 18:153–167.
Andersen, P. K., Keiding, N. (2002). Multi-state models for event history analysis. Statist.
Meth. Med. Res., 11:91–115.

261
262 BIBLIOGRAPHY
Andersen, P. K., Keiding, N. (2012). Interpretability and importance of functionals in com-
peting risks and multistate models. Statist. in Med., 31:1074–1088.
Andersen, P. K., Klein, J. P., Rosthøj, S. (2003). Generalized linear models for correlated
pseudo-observations, with applications to multi-state models. Biometrika, 90:15–27.
Andersen, P. K., Liestøl, K. (2003). Attenuation caused by infrequently updated covariates
in survival analysis. Biostatistics, 4:633–649.
Andersen, P. K., Pohar Perme, M. (2008). Inference for outcome probabilities in multi-state
models. Lifetime Data Analysis, 14:405–431.
– (2010). Pseudo-observations in survival analysis. Statist. Meth. Med. Res., 19:71–99.
Andersen, P. K., Pohar Perme, M., van Houwelingen, H. C., Cook, R. J., Joly, P., Mart-
inussen, T., Taylor, J. M. G., Abrahamowicz, M., Therneau, T. M. (2021). Analysis of
time-to-event for observational studies: Guidance to the use of intensity models. Statist.
in Med., 40:185–211.
Andersen, P. K., Skovgaard, L. T. (2006). Regression with Linear Predictors. New York:
Springer.
Andersen, P. K., Syriopoulou, E., Parner, E. T. (2017). Causal inference in survival analysis
using pseudo-observations. Statist. in Med., 36:2669–2681.
Andersen, P. K., Wandall, E. N. S., Pohar Perme, M. (2022). Inference for transition prob-
abilities in non-Markov multi-state models. Lifetime Data Analysis, 28:585–604.
Anderson, J. R., Cain, K. C., Gelber, R. D. (1983). Analysis of survival by tumor response.
J. Clin. Oncol., 1:710–719.
Austin, P. C., Steyerberg, E. W., Putter, H. (2021). Fine-Gray subdistribution hazard models
to simultaneously estimate the absolute risk of different event types: Cumulative total
failure probability may exceed 1. Statist. in Med., 40:4200–4212.
Azarang, L., Scheike, T., Uña-Alvarez, J. (2017). Direct modeling of regression effects for
transition probabilities in the progressive illness-death model. Statist. in Med., 36:1964–
1976.
Balan, T. A., Putter, H. (2020). A tutorial on frailty models. Statist. Meth. Med. Res.,
29:3424–3454.
Bellach, A., Kosorok, M. R., Rüschendorf, L., Fine, J. P. (2019). Weighted NPMLE for the
subdistribution of a competing risk. J. Amer. Statist. Assoc., 114:259–270.
Beyersmann, J., Allignol, A., Schumacher, M. (2012). Competing Risks and Multistate
Models with R. New York: Springer.
Beyersmann, J., Latouche, A., Bucholz, A., Schumacher, M. (2009). Simulating competing
risks data in survival analysis. Statist. in Med., 28:956–971.
Binder, N., Gerds, T. A., Andersen, P. K. (2014). Pseudo-observations for competing risks
with covariate dependent censoring. Lifetime Data Analysis, 20:303–315.
Blanche, P. F., Holt, H., Scheike, T. H. (2023). On logistic regression with right censored
data, with or without competing risks, and its use for estimating treatment effects. Life-
time Data Analysis, 29:441–482.
Bluhmki, T., Schmoor, C., Dobler, D., Pauly, M., Finke, J., Schumacher, M., Beyersmann,
J. (2018). A wild bootstrap approach for the Aalen–Johansen estimator. Biometrics,
74:977–985.
Borgan, Ø., Goldstein, L., Langholz, B. (1995). Methods for the analysis of sampled cohort
data in the Cox proportional hazards model. Ann. Statist., 23:1749–1778.
BIBLIOGRAPHY 263
Borgan, Ø., Samuelsen, S. O. (2014). “Nested case-control and case-cohort studies”. Hand-
book of Survival Analysis. Ed. by J. P. Klein, H. C. van Houwelingen, J. G. Ibrahim, T. H.
Scheike. Boca Raton: CRC Press. Chap. 17:343–367.
Bouaziz, O. (2023). Fast approximations of pseudo-observations in the context of right-
censoring and interval-censoring. Biom. J., 65:22000714.
Breslow, N. E. (1974). Covariance analysis of censored survival data. Biometrics, 30:89–
99.
Broström, G. (2012). Event history analysis with R. London: Chapman and Hall/CRC.
Bühler, A., Cook, R. J., Lawless, J. L. (2023). Multistate models as a framework for esti-
mand specification in clinical trials of complex diseases. Statist. in Med., 42:1368–1397.
Bycott, P., Taylor, J. M. G. (1998). A comparison of smoothing techniques for CD4 data
measured with error in a time-dependent Cox proportional hazards model. Statist. in
Med., 17:2061–2077.
Clayton, D. G., Hills, M. (1993). Statistical Models in Epidemiology. Oxford: Oxford Uni-
versity Press.
Collett, D. (2015). Modelling Survival Data in Medical Research (3rd ed.) Boca Raton:
Chapman and Hall/CRC.
Conner, S. C., Trinquart, L. (2021). Estimation and modeling of the restricted mean time
lost in the presence of competing risks. Statist. in Med., 40:2177–2196.
Cook, R. J., Lawless, J. F. (1997). Marginal analysis of recurrent events and a terminating
event. Statist. in Med., 16:911–924.
– (2007). The Statistical Analysis of Recurrent Events. New York: Springer.
– (2018). Multistate Models for the Analysis of Life History Data. Boca Raton: Chapman
and Hall/CRC.
Cook, R. J., Lawless, J. F., Lakhal-Chaieb, L., Lee, K.-A. (2009). Robust estimation of
mean functions and treatment effects for recurrent events under event-dependent censor-
ing and termination: Application to skeletal complications in cancer metastatic to bone.
J. Amer. Statist. Assoc., 104:60–75.
Cox, D. R. (1972). Regression models and life-tables. J. Roy. Statist. Soc., ser. B, 34:187–
220.
– (1975). Partial likelihood. Biometrika, 62:269–276.
Crowder, M. (2001). Classical Competing Risks. London: Chapman and Hall/CRC.
Daniel, R. M., Cousens, S. N., de Stavola, B. L., Kenward, M. G., Sterne, J. A. C. (2013).
Methods for dealing with time-dependent confounding. Statist. in Med., 32:1584–1618.
Daniel, R. M., Zhang, J., Farewell, D. (2021). Making apples from oranges: Comparing
non collapsible effect estimators and their standard errors after adjustment for different
covariate sets. Biom. J., 63:528–557.
Datta, S., Satten, G. A. (2001). Validity of the Aalen-Johansen estimators of stage oc-
cupation probabilities and Nelson-Aalen estimators of integrated transition hazards for
non-Markov models. Stat. & Prob. Letters, 55:403–411.
– (2002). Estimation of integrated transition hazards and stage occupation probabilities for
non-Markov systems under dependent censoring. Biometrics, 58:792–802.
Efron, B. (1982). The Jackknife, the Bootstrap and Other Resampling Plans. SIAM,
Philadelphia: CBMS-NSF Regional Conference Series in Applied Mathematics.
264 BIBLIOGRAPHY
Efron, B., Tibshirani, R. (1993). An Introduction to the Bootstrap. Boca Raton: Chapman
and Hall/CRC.
Fine, J. P., Gray, R. J. (1999). A proportional hazards model for the subdistribution of a
competing risk. J. Amer. Statist. Assoc., 94:496–509.
Fine, J. P., Jiang, H., Chappell, R. (2001). On semi-competing risks data. Biometrika,
88:907–919.
Finkelstein, D. M. (1986). A proportional hazards model for interval-censored failure time
data. Biometrics, 42:845–854.
Fisher, L. D., Lin, D. Y. (1999). Time-dependent covariates in the Cox proportional-hazards
regression model. Ann. Rev. Public Health, 20:145–157.
Fix, E., Neyman, J. (1951). A simple stochastic model of recovery, relapse, death and loss
of patients. Hum. Biol., 23:205–241.
Frydman, H. (1995). Nonparametric estimation of a Markov illness-death process from
interval-censored observations, with applications to diabetes survival data. Biometrika,
82:773–789.
Frydman, H., Liu, J. (2013). Nonparametric estimation of the cumulative intensities in an
interval censored competing risks model. Lifetime Data Analysis, 19:79–99.
Frydman, H., Szarek, M. (2009). Nonparametric estimation in a Markov illness-death pro-
cess from interval censored observations with missing intermediate transition status. Bio-
metrics, 65:143–151.
Furberg, J. K., Korn, S., Overgaard, M., Andersen, P. K., Ravn, H. (2023). Bivariate pseudo-
observations for recurrent event analysis with terminal events. Lifetime Data Analysis,
29:256–287.
Furberg, J. K., Rasmussen, S., Andersen, P. K., Ravn, H. (2022). Methodological challenges
in the analysis of recurrent events for randomised controlled trials with application to
cardiovascular events in LEADER. Pharmaceut. Statist., 21:241–267.
Gerds, T. A., Scheike, T. H., Andersen, P. K. (2012). Absolute risk regression for competing
risks: interpretation, link functions, and prediction. Statist. in Med., 31:1074–1088.
Geskus, R. (2016). Data Analysis with Competing Risks and Intermediate States. Boca
Raton: Chapman and Hall/CRC.
Ghosh, D., Lin, D. Y. (2000). Nonparametric analysis of recurrent events and death. Bio-
metrics, 56:554–562.
– (2002). Marginal regression models for recurrent and terminal events. Statistica Sinica,
12:663–688.
Gill, R. D., Johansen, S. (1990). A survey of product-integration with a view towards ap-
plication in survival analysis. Ann. Statist., 18:1501–1555.
Glidden, D. V. (2000). A two-stage estimator of the dependence parameter for the Clayton-
Oakes model. Lifetime Data Analysis, 6:141–156.
– (2002). Robust inference for event probabilities with non-Markov event data. Biometrics,
58:361–368.
Glidden, D. V., Vittinghoff, E. (2004). Modelling clustered survival data from multicentre
clinical trials. Statist. in Med., 23:369–388.
Gran, J. M., Lie, S. A., Øyeflaten, I., Borgan, Ø., Aalen, O. O. (2015). Causal inference in
multi-state models – Sickness absence and work for 1145 participants after work reha-
bilitation. BMC Publ. Health, 15:1082.
BIBLIOGRAPHY 265
Graw, F., Gerds, T. A., Schumacher, M. (2009). On pseudo-values for regression analysis
in competing risks models. Lifetime Data Analysis, 15:241–255.
Grøn, R., Gerds, T. A. (2014). “Binomial regression models”. Handbook of Survival Analy-
sis. Ed. by J. P. Klein, H. C. van Houwelingen, J. G. Ibrahim, T. H. Scheike. Boca Raton:
CRCPress. Chap. 11:221–242.
Gunnes, N., Borgan, Ø., Aalen, O. O. (2007). Estimating stage occupation probabilities in
non-Markov models. Lifetime Data Analysis, 13:211–240.
Henderson, R., Diggle, P., Dobson, A. (2000). Joint modelling of longitudinal measure-
ments and event time data. Biostatistics, 1:465–480.
Hernán, M. A., Robins, J. M. (2020). Causal Inference: What If. Boca Raton: Chapman
and Hall/CRC.
Hougaard, P. (1986). A class of multivariate failure time distributions. Biometrika, 73:671–
678.
– (1999). Multi-state models: a review. Lifetime Data Analysis, 5:239–264.
– (2000). Analysis of Multivariate Survival Data. New York: Springer.
– (2022). Choice of time scale for analysis of recurrent events data. Lifetime Data Analysis,
28:700–722.
Huang, C., Wang, M. (2004). Joint modeling and estimation for recurrent event processes
and failure time data. J. Amer. Statist. Assoc., 99:1153–1165.
Hudgens, M. G., Satten, G. A., Longini, I. M. (2004). Nonparametric maximum likelihood
estimation for competing risks survival data subject to interval censoring and truncation.
Biometrics, 57:74–80.
Iacobelli, S., Carstensen, B. (2013). Multiple time scales in multi-state models. Statist. in
Med., 30:5315–5327.
Jackson, C. (2011). Multi-state models for panel data: the msm package for R. J. Statist.
Software, 38:1–27.
Jacobsen, M., Martinussen, T. (2016). A note on the large sample properties of estimators
based on generalized linear models for correlated pseudo-observations. Scand. J. Statist.,
43:845–862.
Jaeckel, L. A. (1972). The Infinitesimal Jackknife. Tech. rep. Bell Laboratories, MM 72-
1215-11.
Janvin, M., Young, J. G., Ryalen, P. C., Stensrud, M. J. (2023). Causal inference with re-
current and competing events. Lifetime Data Analysis. (in press).
Jensen, H., Benn, C. S., Nielsen, J., Lisse, I. M., Rodrigues, A., Andersen, P. K., Aaby, P.
(2007). Survival bias in observational studies of the effect of routine immunisations on
childhood survival. Trop. Med. Int. Health, 12:5–14.
Johansen, M. N., Lundbye-Christensen, S., Parner, E. T. (2020). Regression models using
parametric pseudo-observations. Statist. Med., 39:2949–2961.
Joly, P., Commenges, D., Helmer, C., Letenneur, L. (2002). A penalized likelihood ap-
proach for an illness-death model with interval-censored data: application to age-specific
incidence of dementia. Biostatistics, 3:433–443.
Josefson, A. M., Magnusson, P. K. E., Ylitalo, N., Sørensen, P., Qwarforth-Tubbin, P., An-
dersen, P. K., Melbye, M., Adami, H.-O., Gyllensten, U. B. (2000). Viral load of human
papilloma virus 16 as a determinant for development of cervical carcinoma in situ: a
nested case-control study. The Lancet, 355:2189–2193.
266 BIBLIOGRAPHY
Kalbfleisch, J. D., Lawless, J. F. (1985). The analysis of panel data under a Markov as-
sumption. J. Amer. Statist. Assoc., 80:863–871.
Kalbfleisch, J. D., Prentice, R. L. (1980). The Statistical Analysis of Failure Time Data.
(2nd ed. 2002). New York: Wiley.
Kaplan, E. L., Meier, P. (1958). Non-parametric estimation from incomplete observations.
J. Amer. Statist. Assoc., 53:457–481, 562–563.
Keiding, N. (1998). “Lexis diagram”. Encyclopedia of Biostatistics vol. 3. New York: Wi-
ley:2232–2234.
Kessing, L. V., Hansen, M. G., Andersen, P. K., Angst, J. (2004). The predictive effect
of episodes on the risk of recurrence in depressive and bipolar disorder - a life-long
perspective. Acta Psych. Scand., 109:339–344.
Kessing, L. V., Olsen, E. W., Andersen, P. K. (1999). Recurrence in affective disorder:
Analyses with frailty models. Amer. J. Epidemiol., 149:404–411.
Kristensen, I., Aaby, P., Jensen, H. (2000). Routine vaccinations and child survival: follow
up study in Guinea-Bissau, West Africa. Br. Med. J., 321:1435–1438.
Larsen, B. S., Kumarathurai, P., Falkenberg, J., Nielsen, O. W., Sajadieh, A. (2015). Exces-
sive atrial ectopy and short atrial runs increase the risk of stroke beyond atrial fibrillation.
J. Amer. College Cardiol., 66:232–241.
Latouche, A., Allignol, A., Beyersmann, J., Labopin, M., Fine, J. P. (2013). A competing
risks analysis should report results on all cause-specific hazards and cumulative inci-
dence functions. J. Clin. Epidemiol., 66:648–653.
Lawless, J. F., Nadeau, J. C. (1995). Some simple robust methods for the analysis of recur-
rent events. Technometrics, 37:158–168.
Li, J., Scheike, T. H., Zhang, M.-J. (2015). Checking Fine and Gray subdistribution hazards
model with cumulative sums of residuals. Lifetime Data Analysis, 21:197–217.
Li, Q. H., Lagakos, S. W. (1997). Use of the Wei-Lin-Weissfeld method for the analysis of
a recurrent and a terminating event. Statist. in Med., 16:925–940.
Lin, D. Y. (1994). Cox regression analysis of multivariate failure time data: the marginal
approach. Statist. in Med., 13:2233–2247.
Lin, D. Y., Oakes, D., Ying, Z. (1998). Additive hazards regression models with current
status data. Biometrika, 85:289–298.
Lin, D. Y., Wei, L. J. (1989). The robust inference for the Cox proportional hazards model.
J. Amer. Statist. Assoc., 84:1074–1078.
Lin, D. Y., Wei, L. J., Yang, I., Ying, Z. (2000). Semiparametric regression for the mean
and rate functions of recurrent events. J. Roy. Statist. Soc., ser. B, 62:711–730.
Lin, D. Y., Wei, L. J., Ying, Z. (1993). Checking the Cox model with cumulative sums of
martingale-based residuals. Biometrika, 80:557–572.
– (2002). Model-checking techniques based on cumulative residuals. Biometrics, 58:1–12.
Lin, D. Y., Ying, Z. (1994). Semiparametric analysis of the additive risk model. Biometrika,
81:61–71.
Lindsey, J. C., Ryan, L. M. (1993). A three-state multiplicative model for rodent tumori-
genicity experiments. J. Roy. Statist. Soc., ser. C, 42:283–300.
Liu, L., Wolfe, R. A., Huang, X. (2004). Shared frailty models for recurrent events and a
terminal event. Biometrics, 60:747–756.
BIBLIOGRAPHY 267
Lombard, M., Portmann, B., Neuberger, J., Williams, R., Tygstrup, N., Ranek, L., Ring-
Larsen, H., Rodes, J., Navasa, M., Trepo, C., Pape, G., Schou, G., Badsberg, J. H., An-
dersen, P. K. (1993). Cyclosporin A treatment in primary biliary cirrhosis: results of a
long-term placebo controlled trial. Gastroenterology, 104:519–526.
Lu, C., Goeman, J., Putter, H. (2023). Maximum likelihood estimation in the additive haz-
ards model. Biometrics, 28:700–722.
Lu, X., Tsiatis, A. A. (2008). Improving the efficiency of the log-rank test using auxiliary
covariates. Biometrika, 95:679–694.
Malzahn, N., Hoff, R., Aalen, O. O., Mehlum, I. S., Putter, H., Gran, J. M. (2021). A hybrid
landmark Aalen-Johansen estimator for transition probabilities in partially non-Markov
multi-state models. Lifetime Data Analysis, 27:737–760.
Mao, L., Lin, D. Y. (2016). Semiparametric regression for the weighted composite endpoint
of recurrent and terminal events. Biostatistics, 17:390–403.
– (2017). Efficient estimation of semiparametric transformation models for the cumulative
incidence of competing risk. J. Roy. Statist. Soc., ser. B, 79:573–587.
Mao, L., Lin, D. Y., Zeng, D. (2017). Semiparametric regression analysis of interval-
censored competing risks data. Biometrics, 73:857–865.
Marso, S. P., Daniels, G. H., Brown-Frandsen, K., Kristensen, P., Mann, J. F. E., Nauck,
M. A., Nissen, S. E., Pocock, S., Poulter, N. R., Ravn, L. S., Steinberg, W. M., Stockner,
M., Zinman, B., Bergenstal, R. M., Buse, J. B., for the LEADER steering committee
(2016). Liraglutide and Cardiovascular Outcomes in Type 2 Diabetes. New Engl. J. Med.,
375:311–322.
Martinussen, T., Scheike, T. H. (2006). Dynamic Regression Models for Survival Data.
New York: Springer.
Martinussen, T., Vansteelandt, S., Andersen, P. K. (2020). Subtleties in the interpretation of
hazard contrasts. Lifetime Data Analysis, 26:833–855.
Meira-Machado, L., J. Uña-Alvarez, Cadarso-Saurez, C. (2006). Nonparametric estimation
of transition probabilities in a non-Markov illness-death model. Lifetime Data Analysis,
13:325–344.
Mitton, L., Sutherland, H., Week, M., (eds.) (2000). Microsimulation Modelling for Policy
Analysis. Challenges and Innovations. Cambridge: Cambridge University Press.
Nielsen, G. G., Gill, R. D., Andersen, P. K., Sørensen, T. I. A. (1992). A counting process
approach to maximum likelihood estimation in frailty models. Scand. J. Statist., 19:25–
43.
O’Hagan, A., Stevenson, M., Madan, J. (2007). Monte Carlo probabilistic sensitivity anal-
ysis for patient level simulation models: efficient estimation of mean and variance using
ANOVA. Health Economics, 16:1009–1023.
O’Keefe, A. G., Su, L., Farewell, V. T. (2018). Correlated multistate models for multiple
processes: An application to renal disease progression in systemic lupus erythematosus.
Appl. Statist., 67:841–860.
Overgaard, M. (2019). State occupation probabilities in non-Markov models. Math. Meth.
Statist., 28:279–290.
Overgaard, M., Andersen, P. K., Parner, E. T. (2023). Pseudo-observations in a multi-state
setting. The Stata Journal, 23:491–517.
268 BIBLIOGRAPHY
Overgaard, M., Parner, E. T., Pedersen, J. (2017). Asymptotic theory of generalized esti-
mating equations based on jack-knife pseudo-observations. Ann. Statist., 45:1988–2015.
– (2019). Pseudo-observations under covariate-dependent censoring. J. Statist. Plan. and
Inf., 202:112–122.
Parner, E. T., Andersen, P. K., Overgaard, M. (2023). Regression models for censored time-
to-event data using infinitesimal jack-knife pseudo-observations, with applications to
left-truncation. Lifetime Data Analysis, 29:654–671.
Pavlič, K., Martinussen, T., Andersen, P. K. (2019). Goodness of fit tests for estimating
equations based on pseudo-observations. Lifetime Data Analysis, 25:189–205.
Pepe, M. S. (1991). Inference for events with dependent risks in multiple endpoint studies.
J. Amer. Statist. Assoc., 86:770–778.
Pepe, M. S., Longton, G., Thornquist, M. (1991). A qualifier Q for the survival function to
describe the prevalence of a transient condition. Statist. in Med., 10:413–421.
Petersen, L., Andersen, P. K., Sørensen, T. I. A. (2005). Premature death of adult adoptees:
Analyses of a case-cohort sample. Gen. Epidemiol., 28:376–382.
Prentice, R. L. (1986). A case-cohort design for epidemiologic cohort studies and disease
prevention trials. Biometrika, 73:1–11.
Prentice, R. L., Gloeckler, L. A. (1978). Regression analysis of grouped survival data with
application to breast cancer data. Biometrics, 34:57–67.
Prentice, R. L., Kalbfleisch, J. D., Peterson, A. V., Flournoy, N., Farewell, V. T., Breslow,
N. E. (1978). The analysis of failure times in the presence of competing risks. Biometrics,
34:541–554.
Prentice, R. L., Williams, B. J., Peterson, A. V. (1981). On the regression analysis of mul-
tivariate failure time data. Biometrika, 68:373–379.
Prentice, R. L., Zhao, S. (2020). The Statistical Analysis of Multivariate Failure Time Data.
Boca Raton: Chapman and Hall/CRC.
Preston, S., Heuveline, P., Guillot, M. (2000). Demography: Measuring and Modeling Pop-
ulation Processes. New York: Wiley.
PROVA study group (1991). Prophylaxis of first time hemorrage from esophageal varices
by sclerotherapy, propranolol or both in cirrhotic patients: A randomized multicenter
trial. Hepatology, 14:1016–1024.
Putter, H., Fiocco, M., Geskus, R. B. (2007). Tutorial in biostatistics: competing risks and
multi-state models. Statist. in Med., 26:2389–2430.
Putter, H., Schumacher, M., van Houwelingen, H. C. (2020). On the relation between the
cause-specific hazard and the subdistribution rate for competing risks data: The Fine-
Gray model revisited. Biom. J, 62:790–807.
Putter, H., Spitoni, C. (2018). Non-parametric estimation of transition probabilities in non-
Markov multi-state models: The landmark Aalen-Johansen estimator. Statist. Meth. Med.
Res., 27:2081–2092.
Putter, H., van Houwelingen, H. C. (2015). Frailties in multi-state models: Are they identi-
fiable? Do we need them? Statist. Meth. Med. Res., 24:675–692.
– (2022). Landmarking 2.0: Bridging the gap between joint models and landmarking.
Statist. in Med., 41:1901–1917.
Rizopoulos, D. (2012). Joint Models for Longitudinal and Time-to-Event Data. Boca Ra-
ton: Chapman and Hall/CRC.
BIBLIOGRAPHY 269
Rodriguez-Girondo, M., Uña-Alvarez, J. (2012). A nonparametric test for Markovianity in
the illness-death model. Statist. in Med., 31:4416–4427.
Rondeau, V., Mathoulin-Pelissier, S., Jacqmin-Gadda, H., Brouste, V., Soubeyran, P.
(2007). Joint frailty models for recurring events and death using maximum penalized
likelihood estimation: Application on cancer events. Biostatistics, 8:708–721.
Rosenbaum, P. R., Rubin, D. B. (1993). The central role of the propensity score in obser-
vational studies for causal effects. Biometrika, 70:41–55.
Royston, P., Parmar, M. K. B. (2002). Flexible parametric proportional-hazards and
proportional-odds models for censored survival data, with application to prognostic mod-
elling and estimation of treatment effects. Statist. in Med., 21:2175–2197.
Rutter, C. M., Zaslavsky, A. M., Feuer, E. J. (2011). Dynamic microsimulation models for
health outcomes: a review. Med. Decision Making, 31:10–18.
Sabathé, C., Andersen, P. K., Helmer, C., Gerds, T. A., Jacqmin-Gadda, H., Joly, P. (2020).
Regression analysis in an illness-death model with interval-censored data: a pseudo-
value approach. Statist. Meth. Med. Res., 29:752–764.
Scheike, T. H., Zhang, M.-J. (2007). Direct modelling of regression effects for transition
probabilities in multistate models. Scand. J. Statist., 34:17–32.
Scheike, T. H., Zhang, M.-J., Gerds, T. A. (2008). Predicting cumulative incidence proba-
bility by direct binomial regression. Biometrika, 95:205–220.
Self, S. G., Prentice, R. L. (1988). Asymptotic distribution theory and efficiency results for
case-cohort studies. Ann. Statist., 16:64–81.
Shih, J. H., Louis, T. A. (1995). Inferences on association parameter in copula models for
bivariate survival data. Biometrics, 51:1384–1399.
Shu, Y., Klein, J. P., Zhang, M.-J. (2007). Asymptotic theory for the Cox semi-Markov
illness-death model. Lifetime Data Analysis, 13:91–117.
Spikerman, C. F., Lin, D. Y. (1999). Marginal regression models for multivariate failure
time data. J. Amer. Statist. Assoc., 93:1164–1175.
Støer, N., Samuelsen, S. O. (2012). Comparison of estimators in nested case-control studies
with multiple outcomes. Lifetime Data Analysis, 18:261–283.
Suissa, S. (2007). Immortal time bias in pharmacoepidemiology. Amer. J. Epidemiol.,
167:492–499.
Sun, J. (2006). The Statistical Analysis of Interval-censored Failure Time Data. New York:
Springer.
Sverdrup, E. (1965). Estimates and test procedures in connection with stochastic models for
deaths, recoveries and transfers between different states of health. Skand. Aktuarietidskr.,
48:184–211.
Szklo, M., Nieto, F. J. (2014). Epidemiology. Beyond the Basics. Burlington: Jones and
Bartlett.
Thomas, D. C. (1977). Addendum to ‘Methods of cohort analysis: appraisal by application
to asbestos mining’ by F. D. K. Liddell, J. C. McDonald, D. C. Thomas. J. Roy. Statist.
Soc., ser. B, 140:469–491.
Tian, L., Zhao, L., Wei, L. J. (2014). Predicting the restricted mean event time with the
subject’s baseline covariates in survival analysis. Biostatistics, 15:222–233.
Titman, A. C. (2015). Transition probability estimates for non-Markov multi-state models.
Biometrics, 71:1034–1041.
270 BIBLIOGRAPHY
Titman, A. C., Putter, H. (2022). General tests of the Markov property in multi-state models.
Biostatistics, 23:380–396.
Tsiatis, A. A. (1975). A nonidentifiability aspect of the problem of competing risks. Proc.
Nat. Acad. Sci. USA, 72:20–22.
Tsiatis, A. A., Davidian, M. (2004). Joint modeling of longitudinal and time-to-event data:
An overview. Statistica Sinica, 14:809–834.
Turnbull, B. W. (1976). The empirical distribution with arbitrarily grouped, censored and
truncated data. J. Roy. Statist. Soc., ser. B, 38:290–295.
Uña-Alvarez, J., Meira-Machado, L. (2015). Nonparametric estimation of transition prob-
abilities in the non-Markov illness-death model: A comparative study. Biometrics,
71:364–375.
van den Hout, A. (2020). Multi-State Survival Models for Interval-Censored Data. Boca
Raton: Chapman and Hall/CRC.
van der Laan, M. J., Rose, S. (2011). Targeted Learning. Causal Inference for Observa-
tional and Experimental Data. New York: Springer.
van Houwelingen, H. C. (2007). Dynamic prediction by landmarking in event history anal-
ysis. Scand. J. Statist., 34:70–85.
van Houwelingen, H. C., Putter, H. (2012). Dynamic Prediction in Clinical Survival Anal-
ysis. Boca Raton: Chapman and Hall/CRC.
Wei, L. J., Glidden, D. V. (1997). An overview of statistical methods for multiple failure
time data in clinical trials. Statist. in Med., 16:833–839.
Wei, L. J., Lin, D. Y., Weissfeld, L. (1989). Regression analysis of multivariate incomplete
failure time data by modeling marginal distributions. J. Amer. Statist. Assoc., 84:1065–
1073.
Westergaard, T., Andersen, P. K., Pedersen, J. B., Frisch, M., Olsen, J. H., Melbye, M.
(1998). Testis cancer risk and maternal parity: a population-based cohort study. Br. J.
Cancer, 77:1180–1185.
Xu, J., Kalbfleisch, J. D., Tai, B. (2010). Statistical analysis of illness-death processes and
semicompeting risks data. Biometrics, 66:716–725.
Yashin, A., Arjas, E. (1988). A note on random intensities and conditional survival func-
tions. J. Appl. Prob., 25:630–635.
Zeng, D., Mao, L., Lin, D. Y. (2016). Maximum likelihood estimation for semiparametric
transformation models with interval-censored data. Biometrika, 103:253–271.
Zheng, M., Klein, J. P. (1995). Estimates of marginal survival for dependent competing
risks based on an assumed copula. Biometrika, 82:127–138.
Zhou, B., Fine, J. P., Latouche, A., Labopin, M. (2012). Competing risks regression for
clustered data. Biostatistics, 13:371–383.
Subject Index

Page numbers followed by * refer to a (*)-marked section, and page numbers followed by
an italic b refer to a summary box.

Aalen additive hazard model, 54–58, 65b, box-and-arrows diagram, see diagram
80–81* Breslow estimator, 43, 77*
piece-wise constant baseline, 58, 81*
semi-parametric, 81* case-cohort study, 259*
Aalen-Johansen estimator, 126, 167* causal inference, 27, 250–251, 252*
confidence limits, 127 cause-specific hazard, see hazard
for general Markov process, 164* cause-specific time lost, see time lost
intuitive argument, 126 censoring, 29b
IPCW version, 238 administrative, 29
state occupation probability, 171* and competing risks, 28
absorbing state, 18, 31* drop-out event, 29
adapted covariate, 71*, 87–88 independent, 1, 27–29, 153–156, 159b
additive hazard model, see Aalen additive definition, 28
hazard model lack of test for, 29
administrative censoring, 29 interval, 245, 246–248*
AG model, 63, 65b investigation of, 153–155
at risk, 21 left, 246*
total number, 21 non-informative, 71*
at-risk right, 1, 27
function, 22 Chapman-Kolmogorov equations, 164*
indicator, 21 cloglog, see link function
process, 21 clustered data, see also frailty model
avoidable event, 1, 28 marginal hazard model, 148, 203*
cohort sampling, 257–260
bone marrow transplantation, 9–10 collapsibility, 26
analyzed as clustered data, 148 compensator, 33*
expected length of stay, 133 competing risks, 1, 27–29, 29b
frailty model, 112 as censoring, 28
illness-death model, 133 diagram, 4, 19
joint Cox models, 105 direct model, 136–141, 193–196*
landmarking, 178–179 in marginal hazard model, 149
marginal hazard models, 151 independent, 156, 159b
non-Markov model, 211 interval-censoring, 247*
prevalence, 133 latent failure times, 157*
time-dependent covariate, 96–100 plug-in, 125–130, 166–168*
bootstrap, 26 time lost, 30*, 130

271
272 SUBJECT INDEX
composite end-point, 3, 9 intuition, 43
condition on the future, 18, 101, 102, 198 time-dependent covariate, 88, 89*
condition on the past, 18 cumulative hazard, 36
conditional multiplier theorem, 206* interpretation, 36
conditional parameter, 20b non-parametric estimator, see
confounder, 8, 250 Nelson-Aalen estimator
consistency, see causal inference cumulative incidence function, 20, 30*, 131b
Cook-Lawless estimator, 143, 198* area under, 130
Copenhagen Holter study, 10–12 biased estimator using Kaplan-Meier,
copula, 250* 127
counterfactual, see causal inference cloglog link, 137, 193*
counting process, 21, 32, 33*, 69* direct model, 136, 193*
format, 21 etymology, 20
jump, 22 Fine-Gray model, see Fine-Gray model
covariate, 2, 25 intuition, 126
adapted, 71*, 87–88, 100b non-parametric estimator, see
endogenous, 88 Aalen-Johansen estimator
exogenous, 88 plug-in, 126, 167*
external, 88 prediction of, 138
internal, 88 pseudo-values, 233–234
non-adapted, 71*, 88 cumulative mean function, see mean
time-dependent, 71*, 87–89, 94b, 100b function
Cox model, 88, 89* cumulative regression function, 54
cumulative baseline hazard, 88 current status data, 246*
immortal time bias, 102
inference, 88–89, 89* data duplication trick, 105, 177
interpretation, 101 delayed entry, 13, 24, 58–61, 61b, 71*
partial likelihood, 88, 89* choice of time-variable, 24
type-specific, 103, 202* GEE, 192*
Cox model, 41–44, 54, 65b, 76–78* independent, 61, 69*
baseline hazard, 41 logrank test, 59
checking proportional hazards, 50, 95, pseudo-values, 221
208* dependent data, see frailty model and
cloglog link, 121 marginal hazards model
interpretation, 41, 44 diagram
Jacod formula, 76* competing risks model, 4, 19
large-sample inference, 77* illness-death model, 7
linear predictor, 43 illness-death model with recovery, 8
marginal, see marginal hazard model recurrent episodes, 8
martingale residuals, 207* recurrent events, 8
multiple, 43 two-state model, 3, 19
multivariate, 202–203* direct binomial regression, 200–201*
partial likelihood, 43, 76* link function, 200*
profile likelihood, 76* direct marginal model, see direct model
Schoenfeld residuals, 208* direct model, 2, 134–146, 147b, 190–201*
score function, 76*, 201* competing risk, 193–196*
stratified, 51, 77* cumulative incidence function, 136,
survival function, 121, 166* 193*
time-dependent covariate, 88, 89* GEE, 191–192*
time-dependent strata, 103 link function, 191*
versus Poisson model, 51–53 recurrent events, 142, 196*
Cox partial likelihood, 43, 76* restricted mean, 136, 192*
SUBJECT INDEX 273
sandwich estimator, 192* Ghosh-Lin model, 144, 199*
state occupation probability, 200–201* goodness-of-fit, see residuals
survival function, 135, 192* Greenwood formula, 118, 165*
time lost, 139, 196* Guinea-Bissau study, 4–5
two-state model, 192–193* Breslow estimate, 60
disability model, 6, see also illness-death case-cohort, 260
model Cox model, 59, 82–83
discrete event simulation, see delayed entry, 58–61
micro-simulation nested case-control, 260
Doob-Meyer decomposition, 33*
drop-out event, 29 hazard, 65b, see also intensity
cause-specific, 18, 32*, 125
event history data, 1 difference, 54, 80*
examples, 13b pseudo-values, 229
bone marrow transplantation, 9 integrated, see cumulative hazard
Copenhagen Holter study, 10 marginal, 148, 152b, 202*
Guinea-Bissau study, 4 one-to-one with survival function, 20,
LEADER trial, 9 120
PBC3 trial, 2 ratio, 41
PROVA trial, 6 sub-distribution, 138, 193*
recurrent episodes in affective
disorders, 7 illness-death model, 6
small set of survival data, 14 diagram, 7
testis cancer study, 5 expected length of stay, 133
exchangeability, see causal inference interval-censoring, 247–248*
expected length of stay, 16, 30*, 133, 164* intuitive argument, 132
plug-in, 168* irreversible, 6
expected time spent in state, see expected marginal hazard model, 150, 204*
length of stay plug-in, 131–134, 168*
prevalence, 133
failure distribution function, 15, 30* progressive, 6, 168*
Fine-Gray model, 136, 142b, 193* semi-Markov, 132
interpretation, 138 time with disability, 30
frailty model, 110–114 with recovery, 8
censoring assumption, 111* immortal time bias, 102
clustered data, 111 improper random variable, 157, 167*, 187,
frailty distribution, 111 193*, 204*, 248
inference, 110* incomplete observation, see censoring
recurrent events, 112 independent censoring, see censoring
shared, 111, 249–250* independent delayed entry, see delayed entry
two-stage estimation, 249–250* indicator function, 21
functional delta-method, 165* influence function, 238–241*
integrated hazard, see cumulative hazard
gap time model, 63 integrated intensity process, 33*
GEE, 135, 152b, 191–192* intensity, 1, 2, 18, 65b
delayed entry, 192* models with shared parameters,
generalized linear model, 191* 103–105
IPCW, 192*, 194*, 197*, 199* likelihood function, 106–109*
sandwich estimator, 152b, 192* non-parametric model, 73*
generalized estimating equation, see GEE intensity process, 33*, 69*
generalized linear model, 191* intermediate variable, 90
g-formula, 26, 252* intermittent observation, 246
274 SUBJECT INDEX
interval-censoring, 245–248* likelihood ratio test, 38
competing risks, 247* linear predictor, 25
illness-death model, 247–248* checking interactions, 48
Markov process, 246* checking linear effect, 46
two-state model, 246* link function, 25, 135, 191*, 200*
inverse probability of cloglog, 121, 135, 227, 233
censoring weight, 192*, see also GEE identity, 136, 231, 233
survival weight, 199* logarithm, 136, 233, 234
treatment weight, 252–253* –logarithm, 135, 229
IPCW, 192*, see also GEE logit, 233
irreversible model, 6, see also illness-death pseudo-values, 223
model logrank test, 38, 75*
as score test, 77*
Jacod formula, 70* delayed entry, 59
joint Cox models, see intensity models with stratified, 40, 77*
shared parameters long data format, 21
joint models, 254–257 LWYY model, 143, 197*
likelihood, 256*
Mao-Lin model, 199*
Kaplan-Meier estimator, 117–120, 165* marginal Cox model, see marginal hazard
conditional, 165* model
confidence limits, 118, 166* marginal hazard model, 147–151, 152b,
intuitive argument, 118 201–204*
IPCW version, 237 clustered data, 148, 203*
variance estimator, 165* competing risks, 149, 203*
Kolmogorov forward differential equations, illness-death model, 150, 204*
164* recurrent events, 149, 203*
robust standard deviations, 148
landmarking, 172*, 176–183 time to (first) entry, 147, 201*
Aalen-Johansen estimator, 172* WLW model, 149, 203*
bone marrow transplantation, 178–179 marginal parameter, 14, 16, 17b
estimating equations, 181* direct model, 134–146
joint models, 257 failure distribution function, 15
super model, 177–178 for recurrent events, 31
latent failure times, see competing risks mean function, 142
LEADER trial, 9 restricted mean, 16, 30*
AG model, 65 state occupation probability, 15
bivariate pseudo-values, 235–236 survival function, 15
Cook-Lawless estimator, 237 time lost, 130
frailty model, 113 time to (first) entry, 16, 30*, 147
Ghosh-Lin model, 145 marked point process format, 21
intensity models for recurrent Markov process, 18, 31*, 132, 134,
myocardial infarctions, 64–65 163–170*, 176b
Mao-Lin model, 213 Aalen-Johansen estimator, 164*
PWP model, 65 interval-censoring, 246*
left-truncation, see delayed entry product-integral, 163*
Lexis diagram, 85 property, 163*
likelihood function, 69–73* state occupation probability, 164*
factorization, 71*, 72b test for assumption of, 91, 174*
Jacod formula, 70* transition probability, 164*
multinomial experiment, 70* martingale, 33*, 82b*
two-state model, 72* matrix exponential, 164*
SUBJECT INDEX 275
mean function piece-wise constant, 37–38, 78–80*
Cook-Lawless estimator, 143, 198* Poisson, 45
critique against, 145* partial transition rate, 171*
Ghosh-Lin model, 144 recurrent events, 198*
LWYY model, 143, 197* past, 18
Mao-Lin model, 199* condition on the, 18
Nelson-Aalen estimator, 142, 197* history, 16
terminal event, 143, 146b information, 1, 31*
micro-simulation, 2, 19, 184–190, 190b path, 11, 19, 31*
PROVA trial, 187–190 micro-simulation, 184
multi-state PBC3 trial, 2–4
process, 15, 30* Aalen model, 54–57
survival data, 1, 2b analysis of censoring, 153
multi-state model, 29b Breslow estimate, 44
diagram, see diagram cause-specific hazard, 61
multinomial experiment censoring, 2
likelihood function, 70* competing risks, 61
micro-simulation, 184* Cox model, 43–44
multiplicative hazard regression model, checking linearity, 46–48, 214
41–53, see also Cox model and checking proportional hazards, 50,
Poisson model 95, 215
cumulative incidence function
Nelson-Aalen estimator, 36–37, 65b, 73–75* Aalen-Johansen, 127
confidence limits, 36, 74* direct model, 138
for recurrent events, 142, 197* plug-in from Cox models, 127
maximum likelihood interpretation, 74* direct binomial regression, 211–213
variance estimator, 74* Fine-Gray model, 138
nested case-control study, 258–259* g-formula, 122
non-adapted covariate, 71*, 88 logrank test, 40
non-avoidable event, 1, 28 martingale residuals, 214
non-collapsibility, 26 Nelson-Aalen estimate, 37
non-informative censoring, 71* piece-wise constant hazards, 38
non-Markov process, 32*, 170–175*, 176b, Poisson model, 45
234 checking interactions, 48
Nelson-Aalen estimator, 171* checking linearity, 46–48
product-integral, 171* checking proportional hazards, 50
recurrent events, 174* pseudo-values, 224–234
state occupation probability, 171* residuals, 213–218
transition probability, 172* restricted mean
direct model, 136
observational study, 13 plug-in, 125
bone marrow transplantation, 9 Schoenfeld residuals, 215
Copenhagen Holter study, 10 survival function
Guinea-Bissau study, 4 Kaplan-Meier, 119, 121
recurrent episodes in affective plug-in, 121
disorders, 7 time lost
testis cancer study, 5 direct model, 139
occurrence/exposure rate, 38, 72*, 78*, 164* plug-in, 130
standard deviation, 38 time-dependent covariate, 95–96
Pepe estimator, 132, 172*
panel data, 245 piece-wise constant hazards, 37–38, 65b,
parametric hazard model 78–80*
276 SUBJECT INDEX
regression, 79* pseudo-observation, see pseudo-value
piece-wise exponential model, see pseudo-values, 2, 229, 229b
piece-wise constant hazards bivariate, 235–236
plug-in, 2, 20, 117–134, 134b covariate-dependent censoring, 236
competing risks, 125–130, 166–168* cumulative incidence function, 233–234
cumulative incidence function, 126, cumulative residuals, 241*
167* delayed entry, 221
expected length of stay, 168* GEE, 223
illness-death model, 131–134, 168* hazard difference, 229
Markov process, 163–170* infinitesimal jackknife, 241*
prevalence, 169* influence function, 238–241*
recurrent events, 169–170* intuition, 222–224
restricted mean, 122, 166* link function, 223
semi-Markov process, 174–175* cloglog, 227, 233
survival function, 118, 165* identity, 231, 233
time lost, 130, 168* logarithm, 233, 234
two-state model, 117–125, 165–166* –logarithm, 229
Poisson model, 45, 65b, 79*, see also logit, 233
piece-wise constant hazards no censoring, 222
checking proportional hazards, 50 non-Markov process, 234
etymology, 80* recurrent events, 235–236
joint models, 105* residual plot, 228
versus Cox model, 51–53 restricted mean, 229
population scatter plot, 224
hypothetical, 28 survival indicator, 223
sample from, 27 theroretical properties, 237–241*
without censoring, 28 time lost, 233–234
positivity, see causal inference with censoring, 223
prediction, 26, see also landmarking and PWP model, 63
joint models
prevalence, see illness-death model randomized trial
plug-in, 169* LEADER, 9
product-integral, 70*, 163* PBC3, 2
non-Markov, 171* PROVA, 6
profile likelihood, 76* rate, 2, 65b, see intensity or hazard
prognostic variable, 1, 2 recurrent episodes, 8, see also illness-death
progressive model, 170* model with recovery and recurrent
propensity score, see causal inference events
proportional hazards, 41 recurrent episodes in affective disorders, 7–9
checking, 50, 95, 208* AG model, 63
same as no interaction with time, 50 analysis of censoring, 154
PROVA trial, 6–7 Cook-Lawless estimator, 144
analysis of censoring, 154 gap time model, 63
Cox model, 83–85 Ghosh-Lin model, 143
delayed entry, 91 LWYY model, 143
joint Cox models, 105 mean function, 143
logrank test, 84 PWP model, 63
Markov, 91 state occupation probability, 134
micro-simulation, 187–190 time-dependent covariate, 89–90
non-Markov, 208–210 WLW model, 150
pseudo-values, 234 recurrent events
time-dependent covariate, 90–94 composite end-point, 199*
SUBJECT INDEX 277
Cook-Lawless estimator, 143, 198* counting process, 22
diagram, 8 delayed entry, 16
direct model, 142–146, 196* Kaplan-Meier estimate, 16
frailty model, 112 restricted mean, 17
Ghosh-Lin model, 144, 199* software, xii
LWYY model, 143, 197* (start, stop, status) triple, 21
Mao-Lin model, 199* state
marginal hazard model, 149, 152b, absorbing, 18, 31*
203* expected length of stay, 30*
mean function, 142 space, 30*
partial transition rate, 198* transient, 18, 31*
plug-in, 169–170* state occupancy, see state occupation
probability of at least h events, 170* state occupation probability, 18, 30*, 164*,
progressive model, 170* 171*
pseudo-values, 235–236 direct model, 200–201*
PWP model, 63 sub-distribution
terminal event, 146b, 198* function, 30*
WLW model, 149 hazard, see hazard
registry-based study summary box
testis cancer study, 5 conditional parameters, 20
regression coefficient, 25 counting processes and martingales, 82
interpretation, 25 cumulative incidence function, 131
regression function, 54 direct models, 147
regression model, 25 examples of event history data, 13
residuals Fine-Gray model, 142
cumulative, 205–208* independent censoring/competing risks,
cumulative martingale, 207* 159
cumulative pseudo, 241* intensity, hazard, rate, 65
cumulative sums of, 206* likelihood factorization, 72
martingale, 207*, 214 marginal hazard models, 152
Schoenfeld, 208*, 215 marginal parameters, 17
score, 208*, 215 Markov and non-Markov processes,
restricted mean, 16, 30* 176
direct model, 136, 192* mean function and terminal event, 146
plug-in, 122, 166* micro-simulation, 190
pseudo-values, 229 model-based and robust SD, 152
risk, 1, 15 multi-state model, competing risks, and
risk factor, 2 censoring, 29
risk set, 43, 77* multi-state survival data, 2
robust SD, see GEE plug-in, 134
pseudo-values, 229
sample, 27 time zero, 61
sandwich estimator, see GEE time-dependent covariate or state, 100
Schoenfeld residuals, see residuals time-variable and time-dependent
semi-competing risks, 151, 158* covariates, 94
semi-Markov process, 31*, 91*, 132, survival function, 15, 30*
174–175* area under, 125
semi-parametric model, 41, 57, 73*, 81* cloglog link, 121
shared parameters for intensity models, see direct model, 135, 192*
intensity model non-parametric estimator, see
small set of survival data, 14 Kaplan-Meier estimator
at-risk function, 22 one-to-one with hazard, 20, 120
278 SUBJECT INDEX
plug-in estimator, 118, 122, 165* calendar time, 13
plug-in estimator from Cox model, choice of, 13, 91
166* delayed entry, 24
pseudo-values, 223 several, 53, 94
transient state, 18, 31*
target parameter, 25 transition
target population, 27 intensity, 18, 31*, 69*
terminal event, 8, 9, 143, 198* probability, 18, 31*
testis cancer study, 5–6 Turnbull estimator, 246*
Lexis diagram, 85 two-state model, 3
Poisson model, 85–86 diagram, 3, 19
time axis, see time-variable direct model, 135–136, 192–193*
time lost, 30* interval-censoring, 246*
cause-specific, 130, 196* likelihood function, 72*
competing risks, 130, 196* plug-in, 117–125, 165–166*
direct model, 139, 196* type-specific covariate, see covariate
plug-in, 130, 168*
pseudo-values, 233–234 utility, 199
time origin, 13
time to (first) entry, see marginal parameter von Mises expansion, 239*
time zero, 13, 61b
time-dependent covariate, see covariate wide data format, 21
time-dependent strata, 103 wild bootstrap, 206*
time-variable, 61b, 94b WLW model, 149, 152b, 203*
age, 13 competing risks, 150, 203*

Bayesian Statistical Methods
100% (10)
Bayesian Statistical Methods
288 pages
Essentials of Probability Theory For Statisticians
67% (3)
Essentials of Probability Theory For Statisticians
419 pages
Introduction To Statistical Methods For Clinical Trials
100% (1)
Introduction To Statistical Methods For Clinical Trials
452 pages
(Mathematics Study Resources, 1) Ludger Rüschendorf - Stochastic Processes and Financial Mathematics-Springer (2023)
100% (1)
(Mathematics Study Resources, 1) Ludger Rüschendorf - Stochastic Processes and Financial Mathematics-Springer (2023)
310 pages
Statistical Regression Modeling With R: Ding-Geng (Din) Chen Jenny K. Chen
No ratings yet
Statistical Regression Modeling With R: Ding-Geng (Din) Chen Jenny K. Chen
239 pages
Kulkarni Modeling and Analysis of Stochastic Systems 2011
100% (4)
Kulkarni Modeling and Analysis of Stochastic Systems 2011
566 pages
2013 Book BayesianAndFrequentistRegressi PDF
No ratings yet
2013 Book BayesianAndFrequentistRegressi PDF
700 pages
Survival Analysis - Guo
No ratings yet
Survival Analysis - Guo
172 pages
Numerical Methods For Stochastic Partial Differential Equations With White Noise (Karniadakis, George Zhang, Zhongqiang)
No ratings yet
Numerical Methods For Stochastic Partial Differential Equations With White Noise (Karniadakis, George Zhang, Zhongqiang)
391 pages
MStat Bog
100% (4)
MStat Bog
259 pages
Untitled
100% (1)
Untitled
633 pages
Asymptotical Statistics
100% (2)
Asymptotical Statistics
460 pages
Applied Univariate, Bivariate, and Multivariate Statistics Using Python
100% (3)
Applied Univariate, Bivariate, and Multivariate Statistics Using Python
300 pages
Competing Risks and Multistate Models With R
100% (1)
Competing Risks and Multistate Models With R
249 pages
Bayesian Analysis of Time Series - Broemeling L. D. (CRC 2019) (1st Ed.)
100% (5)
Bayesian Analysis of Time Series - Broemeling L. D. (CRC 2019) (1st Ed.)
293 pages
Cause and Correlation in Biology - A User's Guide To Path Analysis, Structural Equations and Causal Inference
100% (3)
Cause and Correlation in Biology - A User's Guide To Path Analysis, Structural Equations and Causal Inference
330 pages
2015 Book RegressionModelingStrategies-1 PDF
No ratings yet
2015 Book RegressionModelingStrategies-1 PDF
598 pages
An Introduction to Statistical Computing: A Simulation-based Approach
From Everand
An Introduction to Statistical Computing: A Simulation-based Approach
Jochen Voss
No ratings yet
Linear Models and The Relevant Distributions and Matrix Algebra
No ratings yet
Linear Models and The Relevant Distributions and Matrix Algebra
539 pages
Theory of Stochastic Objects - Probability, Stochastic Processes, and Inference (PDFDrive)
100% (1)
Theory of Stochastic Objects - Probability, Stochastic Processes, and Inference (PDFDrive)
409 pages
Essentials of Probability Theor - Michael A. Proschan
No ratings yet
Essentials of Probability Theor - Michael A. Proschan
361 pages
Applied Categorical and Count Data Analysis (PDFDrive)
50% (2)
Applied Categorical and Count Data Analysis (PDFDrive)
380 pages
NEW Bayesian - Approaches.in - Oncology.using.R.and - OpenBUGS
100% (1)
NEW Bayesian - Approaches.in - Oncology.using.R.and - OpenBUGS
260 pages
Statistical Regression and Classification - From Linear Models To Machine Learning
100% (10)
Statistical Regression and Classification - From Linear Models To Machine Learning
532 pages
Untitled
100% (1)
Untitled
201 pages
(Chapman & Hall - CRC Texts in Statistical Science) Babette A. Brumback - Fundamentals of Causal Inference With R-Chapman and Hall - CRC (2021)
No ratings yet
(Chapman & Hall - CRC Texts in Statistical Science) Babette A. Brumback - Fundamentals of Causal Inference With R-Chapman and Hall - CRC (2021)
249 pages
Statistical Inference, Econometric Analysis and Matrix Algebra. Schipp, Bernhard Krämer, Walter. 2009
No ratings yet
Statistical Inference, Econometric Analysis and Matrix Algebra. Schipp, Bernhard Krämer, Walter. 2009
445 pages
Damon Berridge - Robert Crouchley - Multivariate Generalized Linear Mixed Models Using R-CRC Press (2011)
No ratings yet
Damon Berridge - Robert Crouchley - Multivariate Generalized Linear Mixed Models Using R-CRC Press (2011)
284 pages
AAAIntroduction To Statistical Decision Theory Utility Theory and Causal Analysis (Silvia Bacci, Bruno Chiandotto) (Z-Library)
100% (2)
AAAIntroduction To Statistical Decision Theory Utility Theory and Causal Analysis (Silvia Bacci, Bruno Chiandotto) (Z-Library)
305 pages
(Chapman & Hall - CRC Texts in Statistical Science) Anthony Almudevar - Theory of Statistical Inference (2021, Chapman and Hall - CRC) - Libgen - Li
100% (2)
(Chapman & Hall - CRC Texts in Statistical Science) Anthony Almudevar - Theory of Statistical Inference (2021, Chapman and Hall - CRC) - Libgen - Li
470 pages
Modeling Discrete Time-To-Event Data (PDFDrive)
100% (1)
Modeling Discrete Time-To-Event Data (PDFDrive)
252 pages
Statistical Causal Inferences and Their Applications in Public Health Research-Springer International Publishing (2016)
100% (2)
Statistical Causal Inferences and Their Applications in Public Health Research-Springer International Publishing (2016)
324 pages
Sequential Analysis Hypothesis Testing and Changepoint Detection ( Etc.) (Z-Library)
No ratings yet
Sequential Analysis Hypothesis Testing and Changepoint Detection ( Etc.) (Z-Library)
600 pages
New Sample Mathode 1
100% (2)
New Sample Mathode 1
698 pages
Fundamentals of Statistical Inference: What Is The Meaning of Random Error?
100% (1)
Fundamentals of Statistical Inference: What Is The Meaning of Random Error?
141 pages
Modelos de Fragilidad en El Análisis de Supervivencia PDF
No ratings yet
Modelos de Fragilidad en El Análisis de Supervivencia PDF
320 pages
124 Stochastic Processes From Applications To Theory Pierre Del Moral Spiridon Penev Edisi 1 2016
100% (1)
124 Stochastic Processes From Applications To Theory Pierre Del Moral Spiridon Penev Edisi 1 2016
916 pages
(Cambridge Series in Statistical and Probabilistic Mathematics) Gerhard Tutz, Ludwig-Maximilians-Universität Munchen - Regression For Categorical Data-Cambridge University Press (2012)
100% (3)
(Cambridge Series in Statistical and Probabilistic Mathematics) Gerhard Tutz, Ludwig-Maximilians-Universität Munchen - Regression For Categorical Data-Cambridge University Press (2012)
574 pages
Elements of Nonlinear Series Analysis and Forecasting PDF
100% (8)
Elements of Nonlinear Series Analysis and Forecasting PDF
626 pages
Effective Investments On Capital Markets: Waldemar Tarczyński Kesra Nermend Editors
No ratings yet
Effective Investments On Capital Markets: Waldemar Tarczyński Kesra Nermend Editors
508 pages
David Williams - Weighing The Odds A Course in Probability and Statistics
100% (1)
David Williams - Weighing The Odds A Course in Probability and Statistics
567 pages
Stochastic Search Optimization
No ratings yet
Stochastic Search Optimization
317 pages
2018 Book DataScienceAndPredictiveAnalyt
No ratings yet
2018 Book DataScienceAndPredictiveAnalyt
929 pages
STATS Textbook
100% (1)
STATS Textbook
459 pages
Bayesian Cost Effectiveness Analysis With The R Package BCEA PDF
No ratings yet
Bayesian Cost Effectiveness Analysis With The R Package BCEA PDF
181 pages
(Kaddour Hadri, William Mikhail, Kaddour Hadri, Wi
No ratings yet
(Kaddour Hadri, William Mikhail, Kaddour Hadri, Wi
616 pages
Robust Nonparametric Statistical Methods Second Edition
100% (3)
Robust Nonparametric Statistical Methods Second Edition
532 pages
Introduction To Statistical Methods For Financial Models
100% (2)
Introduction To Statistical Methods For Financial Models
387 pages
Neural Networks For Time Series Forecasting With R - Dr. N.D Lewis
67% (3)
Neural Networks For Time Series Forecasting With R - Dr. N.D Lewis
227 pages
Koch I. Analysis of Multivariate and High-Dimensional Data 2013
100% (17)
Koch I. Analysis of Multivariate and High-Dimensional Data 2013
532 pages
Previewpdf
No ratings yet
Previewpdf
27 pages
Poisson Point Processes Imaging, Tracking, and Sensing
No ratings yet
Poisson Point Processes Imaging, Tracking, and Sensing
280 pages
Bayesian Inference Data Evaluation and Decisions Second Edition
100% (2)
Bayesian Inference Data Evaluation and Decisions Second Edition
245 pages
Mathematical Methods Modelling and Applications
No ratings yet
Mathematical Methods Modelling and Applications
412 pages
Dokumen.pub Computational Finance With r 9789811920073 9789811920080
No ratings yet
Dokumen.pub Computational Finance With r 9789811920073 9789811920080
352 pages
Statistical Decision Theory and Bayesian Analysis
No ratings yet
Statistical Decision Theory and Bayesian Analysis
632 pages
Ebook Econometrics
No ratings yet
Ebook Econometrics
1,006 pages
Exercises of Stochastic Processes
From Everand
Exercises of Stochastic Processes
Simone Malacrida
No ratings yet
Elementary Theory and Application of Numerical Analysis: Revised Edition
From Everand
Elementary Theory and Application of Numerical Analysis: Revised Edition
David G. Moursund
No ratings yet
Models For Multistate Survival Data Per Kragh Andersen Henrik Ravn Biostatistician pdf download
No ratings yet
Models For Multistate Survival Data Per Kragh Andersen Henrik Ravn Biostatistician pdf download
82 pages
Confidential Briefing Rga Planning-Memo Attached
No ratings yet
Confidential Briefing Rga Planning-Memo Attached
2 pages
Bengali Noboborsho Special Recipe - Awadhi Gosht Korma - Cosmopolitan Currymania
No ratings yet
Bengali Noboborsho Special Recipe - Awadhi Gosht Korma - Cosmopolitan Currymania
5 pages
酒馆大乱斗修订版已转档
No ratings yet
酒馆大乱斗修订版已转档
5 pages
References: Guide Manual Palatine, IL: AAMA, 1979
No ratings yet
References: Guide Manual Palatine, IL: AAMA, 1979
5 pages
Sigmaweld 199 Technical Data Sheet
No ratings yet
Sigmaweld 199 Technical Data Sheet
4 pages
Lmes Sip Annex 4 Pia
No ratings yet
Lmes Sip Annex 4 Pia
6 pages
0 PHIC TRANSMITTAL FORM OF CLAIMS FOR THE Z BENEFITS
No ratings yet
0 PHIC TRANSMITTAL FORM OF CLAIMS FOR THE Z BENEFITS
1 page
Manchester-Airport-Olympic-House-Case-Study-INT
No ratings yet
Manchester-Airport-Olympic-House-Case-Study-INT
4 pages
Case Study Sheets
100% (2)
Case Study Sheets
2 pages
The Golden Circle for Windsor House
No ratings yet
The Golden Circle for Windsor House
3 pages
Sinexcel 400V-CE AHF Data Sheet
No ratings yet
Sinexcel 400V-CE AHF Data Sheet
1 page
Boardman Three Greek Gem Masters
No ratings yet
Boardman Three Greek Gem Masters
11 pages
Dll Week 6-q4 Science 5
No ratings yet
Dll Week 6-q4 Science 5
11 pages
Fisher Type 627
No ratings yet
Fisher Type 627
36 pages
Verdi - Rigoletto - Caro Nome Che Il Mio Cor - (30) Spartito
No ratings yet
Verdi - Rigoletto - Caro Nome Che Il Mio Cor - (30) Spartito
8 pages
Mechanisms of Exploitation: Economic and Social Changes in Syria During The Conflict
No ratings yet
Mechanisms of Exploitation: Economic and Social Changes in Syria During The Conflict
106 pages
Root Cause Analysis: Coronet Foods Pvt. LTD
50% (2)
Root Cause Analysis: Coronet Foods Pvt. LTD
103 pages
C1 Vocabulary Reading Pollution
100% (1)
C1 Vocabulary Reading Pollution
6 pages
HRMS Assignment Entry Training
No ratings yet
HRMS Assignment Entry Training
5 pages
A Threshold For Quantum Advantage in Derivative
No ratings yet
A Threshold For Quantum Advantage in Derivative
41 pages
Sample Questions 2020 Test Code PCB (Short Answer Type)
No ratings yet
Sample Questions 2020 Test Code PCB (Short Answer Type)
12 pages
LATEST SY195-205-215C9-shop MANUAL-20130104
100% (3)
LATEST SY195-205-215C9-shop MANUAL-20130104
935 pages
Salad or Wrap & Full TANK Classic Smoothie: Recipient: Redeemable From
No ratings yet
Salad or Wrap & Full TANK Classic Smoothie: Recipient: Redeemable From
1 page
Project Festus 2
No ratings yet
Project Festus 2
17 pages
Ee Objective Ree
100% (7)
Ee Objective Ree
37 pages
Noise Control of Buildings
100% (1)
Noise Control of Buildings
18 pages
Time Table Fa 1 for Lkg Ukg and Prekg(副本) 2
No ratings yet
Time Table Fa 1 for Lkg Ukg and Prekg(副本) 2
1 page
Coco Cola BC
No ratings yet
Coco Cola BC
31 pages
Notice and Agenda
No ratings yet
Notice and Agenda
3 pages
TVC-01 - T - 05695 - 3 - Scheme of CCC Cum M.box PDF
No ratings yet
TVC-01 - T - 05695 - 3 - Scheme of CCC Cum M.box PDF
18 pages

Models For Multi-State Survival Data - Per Kragh Andersen, Henrik Ravn (Chapman & Hall - CRC Texts in Statistical Science) - CRC (2024)

Uploaded by

Models For Multi-State Survival Data - Per Kragh Andersen, Henrik Ravn (Chapman & Hall - CRC Texts in Statistical Science) - CRC (2024)

Uploaded by

Models for Multi-State

• Intensity-based and marginal models.

Theory of Statistical Inference

Bayesian Modeling and Computation in Python

Stochastic Processes with R

Design and Analysis of Experiments and Observational Studies using R

Time Series for Data Science: Analysis and Forecasting

Modelling Survival Data in Medical Research, Fourth Edition

Applied Categorical and Count Data Analysis, Second Edition

For more information about this series, please visit: https://www.routledge.com/Chapman--HallCRC-

Per Kragh Andersen and Henrik Ravn

Figures by Julie Kjærulff Furberg

and by CRC Press

CRC Press is an imprint of Taylor & Francis Group, LLC

© 2024 Taylor & Francis Group, LLC

ISBN: 978-0-367-14002-1 (hbk)

Typeset in Nimbus Roman font

Access the Support Material: https://multi-state-book.github.io/companion/

List of symbols and abbreviations xiii

2 Intuition for intensity models 35

4 Intuition for marginal models 117

5 Marginal models 163

7 Further topics 245

Subject Index 271

Ai Frailty for cluster/subject i

1.1 Examples of event history data

1.1.1 PBC3 trial in liver cirrhosis

Figure 1.1 The two-state model for survival data.

Figure 1.2 The competing risks model with k causes of death.

1.1.2 Guinea-Bissau childhood vaccination study

Died during follow-up

1.1.3 Testis cancer incidence and maternal parity

1.1.4 PROVA trial in liver cirrhosis

Figure 1.3 The progressive illness-death model.

1.1.5 Recurrent episodes in affective disorders

1.1.6 LEADER cardiovascular trial in type 2 diabetes

1.1.7 Bone marrow transplantation in acute leukemia

Recurrent MI Recurrent 3-p MACE

1.1.8 Copenhagen Holter study

Event No. of patients Percentage

1.2 Parameters in multi-state models

Subject Time from entry Status at time of exit Age

1.2.2 Marginal parameters

Figure 1.9 Small set of survival data: Estimated survival functions.

P(V (t + dt) = j | V (t) = h and the past for s < t)

P0h (0,t) = Qh (t),

(Equation 1.1). Transition intensities are only defined if j is different from h.

1.2.4 Data representation

(c) Counting process, N(age) (d) Number at risk, Y (age)

Observed Transition Last seen

Censoring at C (< T1 , T2 ) Bleeding at T1 and censoring at C (< T2 )

Death without bleeding at T2 (< C) Bleeding at T1 and death at T2 (< C)

(Start, Stop, Status) with

Start = Time of entry into h,

Observed Data set

1.2.5 Target parameter

1.3 Independent censoring and competing risks

P(V (t + dt) = j | V (t) = h and the past for s < t)

P(V (t + dt) = j | V (t) = h, past for s < t and C > t)

Multi-state model, competing risks, and censoring

1.4 Mathematical definition of parameters (*)

S = {0, 1, ..., k}. (1.7)

Qh (t) = P(V (t) = h), h∈S (1.8)

Th = inf{V (t) = h}, h 6= 0, (1.11)

1.4.2 Conditional parameters (*)

α(t) = lim P(T ≤ t + dt | T > t)/dt (1.15)

αh (t) = lim P(T ≤ t + dt, D = h | T > t)/dt, (1.16)

1.4.3 Counting processes (*)

E(dNh j (t) | Ht− )/dt ≈ λh j (t) = αh j (t)Yh (t). (1.18)

dNh j (t) = Nh j (t) − Nh j (t−) (1.20)

E(M(t) | Hs ) = M(s), s≤t

2. Do the same for the entire data set.

Exercise 1.3 (*)

Intuition for intensity models

In this chapter, we will give a non-technical introduction to models for intensities to be

2.1 Models for homogeneous groups

• Intensity-based and marginal models.