Bayesian Analysis of Failure Time Data Using P Splines Free Download
Bayesian Analysis of Failure Time Data Using P Splines Free Download
Visit the link below to download the full version of this book:
https://medipdf.com/product/bayesian-analysis-of-failure-time-data-using-p-splin
es/
1 Introduction 1
1.1 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Bibliography 103
List of Figures
6.1 (a) hazard (b) density (c) survivor function of the lognormal distri-
bution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.2 (a) hazard (b) density (c) survivor function of the Weibull distribu-
tion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.3 Example of a bathtub shaped hazard . . . . . . . . . . . . . . . . 80
VIII List of Figures
6.1 Data set expansion for piecewise constant model, given knots 0, 2,
5, 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
Failure time analysis is a form of regression analysis where the time until an event
occurs is of interest. The event is generically referred to as failure in this thesis,
the observational units are referred to as individuals.
Unlike most regression models the model is not formulated for the conditional
expectation. Most regression models for failure time analysis are formulated in
terms of the hazard rate, giving the risk of failure and will be defined precisely in
the following. A general formulation for the hazard rate is (Cox and Oakes 1984,
p. 70):
β ) = h0 (t)ρ(β1 z1 + ..., βk zk ) = h0 (t)ρ(ηi ).
h(t|zzi ,β
Here, the baseline hazard h0 (t) gives the hazard of an individual with standard
conditions, corresponding to z = 0 , ηi = z
i β is the linear predictor and ρ(·) is a
nonnegative function satisfying ρ(0) = 1. Splines allow the replacement of linear
effects of the form z
i β in the linear predictor by more general functions. This is
useful for flexible modeling of the baseline hazard, treating time formally like a
covariate. A spline is a function consisting of local polynomials that are joined
together at points in the domain of the covariate. Splines can be understood as a
regression model: Every spline can be written as a weighted sum of basis func-
tions depending on a covariate, hence a regression model where the regression
coefficients are given by the weigths.
The aim of this thesis is to present Bayesian methods for models where either
the hazard rate, covariates, or both are modeled via splines, in discrete and contin-
uous time. B-spline basis functions in combination with a penalty to avoid over-
fitting (usually called P-splines) are the main building blocks used for modeling.
P-splines have good numerical properties, and allow fast computation. Addition-
ally, other useful basis functions for failure time analysis will be given. Failures are
always assumed to be nonrecurrent. A fully Bayesian perspective using MCMC
methods is taken.
1.1 Outline
This thesis is structured as follows. At first the basic concepts of failure time
analysis are introduced. For the statistical analysis of failure time data, time is
represented by a random variable which is characterized by quantities that are
specific for failure time modeling. These quantities can be used to construct the
likelihood by taking into account special properties of failure time data, such as
censoring, which refers to failure times that are not fully observed. Next, two
central model families are introduced; the relative risk and the log-location-scale
model family. The subsequent chapter gives an overview of computational and
inferential methods as they are relevant for model building. The chapter concludes
with the introduction of Bayesian P-splines using the Gaussian likelihood as an
example. The sampling scheme for Gaussian responses can be adjusted for the
probit model for discrete time and the lognormal model for continuous time. Sub-
sequently, models for the analysis of discrete time are introduced. Gibbs sampler
for these models are categorized here by methods embedded in the generalized lin-
ear model (GLM) and the latent variable framework. Based on those frameworks
efficient Bayesian sampling schemes can be constructed. From the GLM frame-
work iteratively weighted least squares (IWLS) proposals based on fisher scoring
for the Metropolis-Hastings algorithm can be derived. Many sampling schemes
for models using P-splines were developed on the basis on IWLS proposals, in-
cluding sampling schemes for continuous time models. Discrete time models are
illustrated using data of unemployment durations. Subsequently, estimation for
continuous time is described. The focus is on relative risk models but the lognor-
mal and extensions based on will also be discussed. The methods are illustrated
using a data set on crime-recidivism. As final chapter, a summary with outlook
will be given.
1.2 Notation 3
1.2 Notation
In this thesis standard notation as often used in the literature is used. The distinc-
tion between a random variable Y and its realizations y will be made in the intro-
ductory chapters and ignored for the later chapters when the meaning is obvious.
The conventions used in this thesis are listed here. Conditioning on parameters
will often be surpressed for notational simplicity.
4 1 Introduction
Symbol Explanation
x scalar
x = (x1 , ..., xn ) vector
X matrix
I[·] indicator function
diag(x1 , ..., xn ) diagonal matrix obtained from x
A,B
bdiag(A B) block diagonal matrix out of matrices A ,B
B
Symbol Explanation
h(t) hazard rate
H(t) cumulative hazard rate
h0 (t) baseline hazard
H0 (t) cumulative baseline hazard
G(t) survivor function
D available data
L(θθ |D) likelihood
vi censoring indicator
η linear predictor
The following table gives an overview over the shorthand used for the distributions.
and density
f (t) = dF(t)/dt.
G(0) = 1, (2.1)
Variables with survivor function not satisfying 2.2 are called defective, for those
it follows that E[T]=∞. The probability of failure in the small interval [t,t+dt) can
be approximated by h(t)dt (Aalen et al. 2008, pp. 5–17). The function h(t) is the
hazard rate, defined as:
VXUYLYRUIXQFWLRQ
KD]DUG
W W
Figure 2.1: Some hazard rates Figure 2.2: Some survivor functions
F(t + Δ) − F(t)
.
G(t)
Hence 2.3 is
1 F(t + Δ) − F(t) F (t) f (t)
lim = = ,
G(t) Δ→0 Δ G(t) G(t)
showing that the hazard is a conditional density.
The cumulative hazard rate is
t t
f (u)
H(t) = h(t) = du = [− log G(u)]t0 = − log G(t),
0 0 G(u)
due to 2.1. Hence, the survivor function can be written in terms of the hazard rate:
t
G(t) = exp(− h(t)) = exp(−H(t)). (2.4)
0
satisfy t
h(s) ds < ∞,
0
for all t and ∞
h(s) ds = ∞
0
to be the hazard rate of a nondefective continuous variable (Kalbfleisch and Pren-
tice 2002, p. 9). Many models in failure time modeling are formulated in terms of
the hazard rate first.
rate:
t
G(t) = ∏ (1 − h( j)).
j=1
Assuming grouped failure times might not be appropriate in all cases, as some
random variables are intrinsically discrete. Some helpful results follow from this
assumption however, and estimation is easier by deriving inferences on the like-
lihood contributions following from 2.2, leading to an identical modeling frame-
work.
Failure time data have some special characteristics which have to be accounted
for in the construction of the likelihood. A failure time is referred to as censored
when the actual failure time is not observed but it is only known to fall into an
interval. Failure times are left-truncated if they are only observable if they exceed
a truncation time. Time varying covariates are often available in the data set. In
the following sections, based on Klein and Moeschberger (2003, pp. 63-77), it will
be clarified how these conditions are accounted for in the formulation of the likeli-
hood. Conceptually, these adjustments can be represented in an unified framework
by varying the likelihood contributions. As a consequence, the likelihood becomes
more difficult to work with but there are computational methods which simplify
estimation.
ti = min(t˙i , ci ).
Here, t˙i is the true failure time and ci is the censoring time. The indicator variable
vi is defined as ⎧
⎨1 if t ≤ c ,
i i
vi =
⎩0 if t > c ,
i i