Statistical Methods For Dynamic Treatment Regimes
Statistical Methods For Dynamic Treatment Regimes
Bibhas Chakraborty
Erica E.M. Moodie
Statistical Methods
for Dynamic
Treatment Regimes
Reinforcement Learning, Causal
Inference, and Personalized Medicine
Statistics for Biology and Health
Series Editors
M. Gail
K. Krickeberg
J. Samet
A. Tsiatis
W. Wong
Statistical Methods
for Dynamic Treatment
Regimes
Reinforcement Learning, Causal Inference,
and Personalized Medicine
123
Bibhas Chakraborty Erica E.M. Moodie
Department of Biostatistics Department of Epidemiology,
Columbia University Biostatistics, and Occupational Health
New York, USA McGill University
Montreal Québec
Canada
ISSN 1431-8776
ISBN 978-1-4614-7427-2 ISBN 978-1-4614-7428-9 (eBook)
DOI 10.1007/978-1-4614-7428-9
Springer New York Heidelberg Dordrecht London
Library of Congress Control Number: 2013939595
This book was written to summarize and describe the state of the art of statistical
methods developed to address questions of estimation and inference for dynamic
treatment regimes, a branch of personalized medicine. The study of dynamic treat-
ment regimes is relatively young, and until now, no single source has aimed to pro-
vide an overview of the methodology and results which are dispersed in journals,
proceedings, and technical reports so as to orient researchers to the field. Our pri-
mary focus is on description of the methods, clear communication of the conceptual
underpinnings, and their illustration via analyses drawn from real applications as
well as results from simulated data. The first chapter serves to set the context for the
statistical reader in the landscape of personalized medicine; we assume a familiarity
with elementary calculus, linear algebra, and basic large-sample theory. Important
theoretical properties of the methods described will be stated when appropriate;
however, the reader will, for the most part, be referred to the primary research arti-
cles for the proofs of the results. By doing so, we hope the book will be accessible to
a wide audience of statisticians, epidemiologists, and medical researchers with some
statistical training, as well as computer scientists (machine/reinforcement learning
researchers) interested in medical applications.
Examples of data analyses from real applications are found throughout the book.
From these, we hope to impart a sense of the power and versatility of the methods
discussed to answer important problems in medical research. Where possible, we
refer readers to available code or packages in different statistical languages to facili-
tate implementation; whether or not such code exists, we aim to describe all analytic
approaches in sufficient detail that any researcher with a reasonable background in
statistical programming could implement the methods from scratch.
We hope that the publication of this book will foster the genuine enthusiasm
that we feel for this important area of research. Indeed, with the demographic shift
of most Western populations to older age, the treatment of chronic conditions will
bring increased pressure to develop evidence-based strategies for care that is tai-
lored to individual changes in health status. The recently proposed methods have not
yet reached a wide audience and consequently are underutilized. We hope that this
vii
viii Preface
text will serve as a useful handbook to those already active in the field of dynamic
regimes and spark a new generation of researchers to turn their attention to this
important and exciting area.
Acknowledgements
Bibhas Chakraborty would like to acknowledge support from the National Insti-
tutes of Health (NIH) grant R01 NS072127-01A1 and the Calderone Research Prize
for Junior Faculty (2011) awarded by the Mailman School of Public Health of the
Columbia University. Erica Moodie is supported by a Natural Sciences and En-
gineering Research Council (NSERC) University Faculty Award and by research
grants from NSERC and the Canadian Institutes of Health Research (CIHR). Finan-
cial support for the writing of this book was provided by the Quebec Population
Health Research Network (QPHRN).
We are indebted to numerous colleagues for lively and insightful discussions.
Our research has been enriched by exchanges with Daniel Almirall, Ken Cheung,
Nema Dean, Eric Laber, Bruce Levin, Susan Murphy, Min Qian, Thomas Richard-
son, Jamie Robins, Susan Shortreed, David Stephens, and Jonathan Wakefield. In
particular, we wish to thank Ashkan Ertefaie, Eric Laber, Min Qian, Olli Saarela,
and Michael Wallace for detailed comments on a first version of the text. Also,
we would like to acknowledge help in software development and creation of some
graphics for this book from Guqian Du, Tianxiao Huang, and Jingyi Xin – students
in the Department of Biostatistics at Columbia University. Jonathan Weinberg, Ben-
jamin Rich, and Yue Ru Sun, students in the Department of Mathematics & Statis-
tics, the Department of Epidemiology, Biostatistics, & Occupational Health, and the
school of Computer Science, respectively, at McGill University, also assisted in the
preparation of some simulation results and graphics.
We wish to thank our many medical and epidemiological collaborators for
thought-provoking discussions and/or the privilege of using their data: Dr. Michael
Kramer (PROBIT), Drs. Merrick Moseley and Catherine Stewart (MOTAS), Dr. Au-
gustus John Rush (STAR*D), and Dr. Victor J. Strecher (Project Quit – Forever
Free). MOTAS was funded by the Guide Dogs for the Blind Association (UK); per-
mission to analyze the data was granted by the MOTAS Cooperative. The follow-up
of the PROBIT study was made possible by a grant from CIHR.
Data used in Sect. 5.2.4 were obtained from the limited access data sets dis-
tributed from the NIMH-supported “Clinical Antipsychotic Trials of Intervention
Effectiveness in Schizophrenia” (CATIE-Sz). This is a multisite, clinical trial of
persons with schizophrenia comparing the effectiveness of randomly assigned med-
ication treatment. The study was supported by NIMH Contract #N01MH90001 to
the University of North Carolina at Chapel Hill. The ClinicalTrials.gov identifier is
NCT00014001. Analyses of the CATIE data presented in the book reflect the views
of the authors and may not reflect the opinions or views of the CATIE-Sz Study
Investigators or the NIH.
Preface ix
Data used in Sect. 8.9 were obtained from the limited access data sets dis-
tributed from the NIMH-supported “Sequenced Treatment Alternatives to Relieve
Depression” (STAR*D) study. The study was supported by NIMH Contract #
N01MH90003 to the University of Texas Southwestern Medical Center. The Clini-
calTrials.gov identifier is NCT00021528. Analyses of the STAR*D data presented
in the book reflect the views of the authors and may not reflect the opinions or views
of the STAR*D Study Investigators or the NIH.
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Evidence-Based Personalized Medicine for Chronic Diseases . . . . . . 1
1.2 Personalized Medicine and Medical Decision Making . . . . . . . . . . . . 2
1.2.1 Single-stage Decision Problems in Personalized Medicine . . 3
1.2.2 Multi-stage Decisions and Dynamic Treatment Regimes . . . 4
1.3 Outline of the Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
xi
xii Contents
Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
Acronyms
AB Adaptive bootstrap
ADHD Attention deficit hyperactivity disorder
AFT Accelerated failure time
ATT Average treatment effect on the treated
BCAWS Biased coin adaptive within-subject
BIC Bayesian Information Criterion
BUP Bupropion
BUS Buspirone
CATIE Clinical Antipsychotic Trials of Intervention Effectiveness
CBT Cognitive behavioral therapy
CCM Chronic care model
CCNIA Characterizing Cognition in Nonverbal Individuals with Autism
CIT Citalopram
CPB Centered percentile bootstrap
CRAN Comprehensive R Archive Network
CT Cognitive psychotherapy
DAG Directed acyclic graph
DB Double bootstrap
DP Dynamic programming
DTR Dynamic treatment regime
EF Estimating function
EM Enhanced motivational program
GAM Generalized additive model
GLM Generalized linear model
HAART Highly active antiretroviral therapy
HIV Human immunodeficiency virus
HM Hard-max
HT Hard-threshold
IMOR Iterative minimization of regrets
INR International normalized ratio
IPTW Inverse probability of treatment weighting
xv
xvi Acronyms
B. Chakraborty and E.E.M. Moodie, Statistical Methods for Dynamic Treatment Regimes, 1
Statistics for Biology and Health 76, DOI 10.1007/978-1-4614-7428-9 1,
© Springer Science+Business Media New York 2013
2 1 Introduction
(CCM) (Wagner et al. 2001) rather than the more traditional acute care model. Some
of the key features of health care that the CCM emphasizes are as follows. First, clin-
icians following the CCM treat the patients by individualizing the treatment type,
dosage and timing according to ongoing measures of patient response, adherence,
burden, side effects, and preference; there is a strong emphasis on personalization of
care according to patients’ needs. Second, instead of deciding a treatment for once
and all (static treatment), clinicians following CCM sequentially make decisions
about what to do next to optimize patient outcome, given an individual patient’s
case history (dynamic treatment). The main motivations for considering sequences
of treatments are high inter-patient variability in response to treatment, likely re-
lapse, presence or emergence of co-morbidities, time-varying side effect severity,
and reduction of costs and burden when intensive treatment is unnecessary (Collins
et al. 2004). Third, while there exist traditional practice guidelines for clinicians
that are primarily based on “expert opinions”, the CCM advocates for making these
regimes more objective and evidence-based. In fact, Wagner et al. (2001) described
the CCM as “a synthesis of evidence-based system changes intended as a guide to
quality improvement and disease management activities” (p. 69).
Since effective care for chronic disorders typically requires ongoing medical in-
tervention, management of chronic disorders poses additional challenges for the
paradigm of personalized medicine. This is because the personalization has to hap-
pen through multiple stages of intervention. In this context, dynamic treatment
regimes (Murphy et al. 2001; Murphy 2003; Robins 2004; Lavori and Dawson 2004)
offer a vehicle to operationalize the sequential decision making process involved in
the personalized clinical practice consistent with the CCM, and thereby a potential
way to improve it. In the following sections, we will develop key notions underlying
dynamic treatment regimes.
More statistically oriented works include Lindley (1985), French (1986), and Parmi-
giani (2002); in particular, Parmigiani (2002) provides an excellent account of the
Bayesian approach to medical decision making. The type of decision problems stud-
ied in this book are, however, slightly different from the ones considered by the
above authors. Below we briefly introduce the single-stage and multi-stage decision
problems arising in personalized medicine that we will be considering in this book.
For simplicity, first consider a single-stage decision problem, where the clinician has
to decide on the optimal treatment for an individual patient. Suppose the clinician
observes a certain characteristic (e.g. a demographic variable, a biomarker, or result
of a diagnostic test) of the patient, say o, and based on that has to decide whether
to prescribe treatment a or treatment a . In this example, a decision rule could be:
“give treatment a to the patient if his individual characteristic o is higher than a pre-
specified threshold, and treatment a otherwise”. More formally, a decision rule is
a mapping from currently available information, often succinctly referred to as the
state, into the space of possible decisions.
Any decision, medical or otherwise, is statistically evaluated in terms of its utility,
and the state in which the decision is made. For concreteness, let o denote the state
(e.g. patient characteristic), a denote a possible decision (treatment), and U (o, a)
denote the utility of taking the decision a while in the state o. Following Wald
(1949), the current statistical decision problem can be formulated in terms of the op-
portunity loss (or regret) associated with each pair (o, a) by defining a loss function
where the supremum is taken over all possible decisions for fixed o. The loss func-
tion is the difference between the utility of the optimal decision for state o, and
the utility of the current decision a under that state. Clearly the goal is to find the
decision that minimizes the loss function at the given state o; this is personalized
decision making since the optimal decision depends on the state. Equivalently, the
problem can be formulated directly in terms of the utility without defining the loss
function; in that case the goal would be to choose a decision so as to maximize
the utility for the given state o. The utility function can be specified in various
ways, depending on the specific problem. One of the most common ways would be
to set U (o, a) = Ea (Y |o), i.e. the conditional expectation of the primary outcome
Y given the state, where the expectation is computed according to a probability
distribution indexed by the decision a; we will make the underlying distributions
precise in Chap. 3. Alternatively, one can define U (o, a) = E(Y (a)|o), where Y (a)
is the potential outcome of the decision a; see Chap. 2 for a precise description of
the potential outcome framework.
4 1 Introduction
Decision making problems arising not only in medicine but also in many other sci-
entific domains like business, computer science, and social sciences often involve
complex choices with multiple stages, where decisions made at one stage affect
those to be made at another. In the context of multi-stage decisions, a dynamic treat-
ment regime (DTR) is a sequence of decision rules, one per stage of intervention,
for adapting a treatment plan to the time-varying state of an individual subject. Each
decision rule takes a subject’s individual characteristics and treatment history ob-
served up to that stage as inputs, and outputs a recommended treatment at that stage;
recommendations can include treatment type, dosage, and timing. DTRs are alter-
natively known as treatment strategies (Lavori and Dawson 2000; Thall et al. 2000,
2002, 2007a), adaptive treatment strategies (Murphy 2005a; Lavori and Dawson
2008), or treatment policies (Lunceford et al. 2002; Wahed and Tsiatis 2004, 2006).
Conceptually, a DTR can be viewed as a decision support system of a clinician (or
more generally, any decision maker), described as a key element of the CCM (Wag-
ner et al. 2001). At a more basic level, it may be helpful to think of the regime as
a rule-book and the specific treatment as the rules that apply to an individual case.
The reason for considering a DTR as a whole instead of its individual stage-specific
components is that the long-term effect of the current treatment may depend on
the performance of future treatment choices. This issue will be discussed in greater
detail in Chaps. 2 and 3.
1.2 Personalized Medicine and Medical Decision Making 5
Patients with HIV infection are usually treated with highly active antiretroviral ther-
apy (HAART). It is widely agreed that HAART should be initiated when CD4 cell
count falls below 200 cells/μl, but a key question is whether to initiate HAART
sooner in the course of the disease. In particular, it is of interest to know whether
it is optimal to begin treatment when CD4 cell count first drops below a certain
threshold, where that threshold may be as low as 200, or as high as 500, cells/μl
(Sterne et al. 2009). Thus, the process of treating an HIV-infected patient is a multi-
stage decision problem faced by the clinician who has to make treatment decisions
based on the patient’s CD4 count history (state) at a series of critical decision points
(stages) (Cain et al. 2010).
6 1 Introduction
Patients with cancer are often treated initially with a powerful chemotherapy, known
as induction therapy, to induce remission of the disease. If the patient responds (e.g.
shows sign of remission), the clinician tries to maintain remission for as long as pos-
sible before relapse by prescribing a maintenance therapy to intensify or augment
the effects of the first-line induction therapy. If the patient does not respond (e.g.
does not show sign of remission) to the first-line induction therapy, the clinician pre-
scribes a second-line induction therapy to try to induce remission. Of course there
exist many possible induction therapies and maintenance therapies. For treating a
patient with cancer, a clinician may want to use a DTR that maximizes the disease-
free survival time (primary outcome). One possible DTR can be: “initially prescribe
the first-line induction therapy a; if the patient responds to a, prescribe maintenance
therapy a , and if the patient does not respond to a, prescribe the second-line induc-
tion therapy a ”. See, for example, Wahed and Tsiatis (2004) for further details on
this two-stage clinical decision problem in the context of leukemia.
treatment regimes. In the following chapter, we will focus on methods from the sta-
tistical literature that hinge on direct modeling of contrasts of conditional mean
outcomes under different regimes; this includes methods such as G-estimation of
structural nested mean models and A-learning.
In Chap. 5, we turn to methods that model regimes directly. The chapter includes
inverse probability of treatment weighted estimators such as marginal structural
models as well as classification-based estimators. Chapter 6 takes a more model-
based approach, and considers the likelihood-based method of G-computation.
The first six chapters focus on continuous outcome settings. In Chap. 7, we
consider the literature to date on alternative outcome types: composite (multi-
dimensional) outcomes, censored data, and discrete outcomes. A variety of methods
from the previous chapters will be revisited.
Inference for optimal DTRs are discussed in Chap. 8. The issue of inference is
particularly difficult in the DTR setting due to the phenomenon of non-regularity.
Non-regularity and the ensuing complications arise because any method of esti-
mating the optimal DTR involves non-smooth operations on the data. As a result,
standard asymptotic theory or the usual bootstrap approach fail to produce valid con-
fidence intervals for true treatment effect parameters. Various methods of avoiding
this problem are discussed and compared in this chapter.
In Chap. 9, we will discuss some additional considerations, such as model build-
ing strategies and variable selection. In this chapter, we conclude the book with
some overall discussion and remarks on the directions in which the field appears to
be moving.
Chapter 2
The Data: Observational Studies
and Sequentially Randomized Trials
The data for constructing (optimal) DTRs that we consider are obtained from either
longitudinal observational studies or sequentially randomized trials. In this chapter
we review these two types of data sources, their advantages and drawbacks, and
the assumptions required to perform valid analyses in each, along with some ex-
amples. We also discuss a basic framework of causal inference in the context of
observational studies, and power and sample size issues in the context of random-
ized studies.
The goal of much of statistical inference is to quantify causal relationships, for in-
stance to be able to assert that a specified treatment1 improves patient outcomes
rather than to state that treatment use or prescription of treatment is merely asso-
ciated or correlated with better patient outcomes. Randomized trials are the “gold
standard” in study design, as randomization coupled with compliance allows causal
interpretations to be drawn from statistical association. Making causal inferences
from observational data, however, can be tricky and relies critically on certain (un-
verifiable) assumptions which we will discuss in Sect. 2.1.3. The notion of causation
is not new: it has been the subject matter of philosophers as far back as Aristotle,
and more recently of econometricians and statisticians. Holland (1986) provides a
nice overview of the philosophical views and definitions of causation as well as
of the causal models frequently used in statistics. Neyman (1923) and later Rubin
(1974) laid the foundations for the framework now used in modern causal inference.
The textbook Causal Inference (Hernán and Robins 2013) provides a thorough de-
scription of basic definitions and most modern methods of causal inference for both
1 In this book, we use the term treatment generically to denote either a medical treatment or an
exposure (which is the preferred term in the causal inference literature and more generally in epi-
demiology).
B. Chakraborty and E.E.M. Moodie, Statistical Methods for Dynamic Treatment Regimes, 9
Statistics for Biology and Health 76, DOI 10.1007/978-1-4614-7428-9 2,
© Springer Science+Business Media New York 2013
10 2 The Data: Observational Studies and Sequentially Randomized Trials
Much of the exposition of methods used when data are observational will rely on
the notion of potential outcomes (also called counterfactuals), defined as a per-
son’s outcome had he followed a particular treatment regime, possibly different from
the regime which he was actually observed to follow (hence, counter to fact). The
individual-level causal effect of a regime may then be viewed as the difference in
outcomes if a person had followed that regime as compared to a placebo regime or
a standard care protocol. Consider, for example, a simple one-stage2 randomized
trial in which subjects can receive either a or a . Suppose now that an individual
was randomized to receive treatment a. This individual will have a single observed
outcome Y which corresponds to the potential outcome “Y under treatment a”, de-
noted by Y (a), and one unobservable potential outcome, Y (a ), corresponding to the
outcome under a . An alternative notation to express counterfactual quantities is via
subscripting: Ya and Ya (Hernán et al. 2000). Pearl (2009) uses an approach similar
to that of the counterfactual framework, using what is called the “do” notation to
express the idea that a treatment is administered rather than simply observed to have
been given: in his notation, E[Y |do(A = a)] is the expected value of the outcome
variable Y under the intervention regime a, i.e. it is the population average were all
subjects forced to take treatment a.
The so-called fundamental problem of causal inference lies in the definition of
causal parameters at an individual level. Suppose we are interested in the causal ef-
fect of taking treatment a instead of treatment a . An individual-level causal parame-
ter that could be considered is a person’s outcome under treatment a subtracted from
his outcome under treatment a, i.e. Y (a)−Y (a ). Clearly, it is not possible to observe
the outcome under both treatments a and a without further data and assumptions
(e.g. in a cross-over trial with no carry-over effect) and so the individual-level causal
effect can never be observed. However, population-level causal parameters or aver-
age causal effects can be identified under randomization with perfect compliance, or
bounded under randomization with non-compliance. Without randomization, i.e. in
observational studies or indeed randomized trials with imperfect compliance, fur-
ther assumptions are required to estimate population-level causal effects, which we
shall detail shortly.
Suppose now that rather than being a one-stage trial, subjects are treated over two
stages, and can receive at each stage either a or a . If an individual was randomized
to receive treatment a first and then treatment a , this individual will have a single
observed outcome Y which corresponds to the potential outcome “Y under regime
2 While the term stage is commonly used in the randomized trial literature, the term interval is
more popular in the causal inference literature. In this book, for consistency, we will use the term
stage for both observational and randomized studies.
2.1 Longitudinal Observational Studies 11
(a, a )”, which we denote by Y (a, a ), and three unobservable potential outcomes:
Y (a, a), Y (a , a), and Y (a , a ), corresponding to outcomes under each of the other
three possible regimes. As is clear even in this very simple example, the number
of potential outcomes and causal effects as represented by contrasts between the
potential outcomes can be very large, even for a moderate number of stages. As shall
be seen in Chap. 4, the optimal dynamic regime may be estimated while limiting the
models specified to only a subset of all possible contrasts.
Longitudinal data are increasingly available to health researchers; this type of data
presents challenges not observed in cross-sectional data, not the least of which is the
presence of time-varying confounding variables and intermediate effects. A variable
O is said to be a mediating or intermediate variable if it is caused by A and in turn
causes changes in Y . For example, a prescription sleep-aid medication (A) may cause
dizziness (O) which in turn causes fall-related injuries (Y ). In contrast, a variable,
O, is said to confound a relationship between a treatment A and an outcome Y if
it is a common cause of both the treatment and the outcome. More generally, a
variable is said to be a confounder (relative to a set of covariates X) if it is a pre-
treatment covariate that removes some or all of the bias in a parameter estimate,
when taken into account in addition to the variables X. It may be the case, then,
that a variable is a confounder relative to one set of covariates X but not another,
X . If the effect of O on both A and Y is not accounted for, it may appear that there
is a relationship between A and Y when in fact their pattern of association may
be due entirely to changes in O. For example, consider a study of the dependence
of the number of deaths by drowning (Y ) on the use of sunscreen (A). A strong
positive relationship is likely to be observed, however it is far more likely that this is
due to the confounding variable air temperature (O). When air temperature is high,
individuals may be more likely to require sunscreen and may also be more likely to
swim, but there is no reason to believe that the use of sunscreen increases the risk of
drowning. In cross-sectional data, eliminating the bias due to a confounding effect
is typically achieved by adjusting for the variable in a regression model.
Directed Acyclic Graphs (DAGs), also called causal graphs, formalize the causal
assumptions that a researcher may make regarding the variables he wishes to ana-
lyze. A graph is said to be directed if all inter-variable relationships are connected
by arrows indicating that one variable causes changes in another and acyclic if it
has no closed loops (no feedback between variables); see, for example, Greenland
et al. (1999) or Pearl (2009) for further details. DAGs are becoming more common
in epidemiology and related fields as researchers seek to clarify their assumptions
about hypothesized relationships and thereby justify modeling choices (e.g. Bodnar
et al. 2004; Brotman et al. 2008). In particular, confounding in its simplest form can
be visualized in a DAG if there is an arrow from O into A, and another from O into
Y . Similarly, mediation is said to occur if there is at least one directed path of arrows
from A to Y that passes through O.
12 2 The Data: Observational Studies and Sequentially Randomized Trials
Let us now briefly turn to a two-stage setting where data are collected at three
time-points: baseline (t1 =0), t2 , and t3 . Covariates are denoted O1 and O2 , measured
at baseline and t2 , respectively. Treatment at stages 1 and 2, received in the intervals
[0,t2 ) and [t2 ,t3 ), are denoted A1 and A2 respectively. Outcome, measured at t3 , is
denoted Y . Suppose there is an additional variable, U, which is a cause of both O2
and Y . See Fig. 2.1.
U0
(a) (b)
(c) (d)
O1 O2 Y
(e) (f)
A1 A2
(g)
t1 t2 t3
Fig. 2.1 A two-stage directed acyclic graph illustrating time-varying confounding and mediation
We first focus on the effect of A1 on Y ; A1 acts directly on Y , but also acts indi-
rectly through O2 as indicated by arrows (e) and (d); O2 is therefore a mediator. We
now turn our attention to the effect of A2 on Y ; O2 confounds this relationship, as
can be observed by arrows (d) and (f). In this situation, adjustment for O2 is essential
to obtaining unbiased estimation of the effect of A2 on Y . However, complications
may arise if there are unmeasured factors that also act as confounders; in Fig. 2.1,
U acts in this way. If one were to adjust for O2 in a regression model, it would open
what is called a “back-door” path from Y to A2 via the path (b)→(a)→(c)→(g). This
is known as collider-stratification bias, selection bias, Berksonian bias, Berkson’s
paradox, or, in some contexts, the null paradox (Robins and Wasserman 1997; Gail
and Benichou 2000; Greenland 2003; Murphy 2005a); this problem will be consid-
ered in greater depth in Sect. 3.4.2 in the context of estimation. Collider-stratification
bias can also occur when conditioning on or stratifying by variables that are caused
by both the exposure and the outcome, and there has been a move in the epidemiol-
ogy literature to use the term selection bias only for bias caused by conditioning on
post-treatment variables, and the term confounding for bias caused by pre-treatment
variables (Hernán et al. 2004).
2.1 Longitudinal Observational Studies 13
Modeling choices become more complex when data are collected over time,
particularly as a variable may act as both a confounder and a mediator. The use
of a DAG forces the analyst to be explicit in his modeling assumptions, particu-
larly as the absence of an arrow between two variables (“nodes”) in a graph implies
the assumption of (conditional) independence. Some forms of estimation are able
to avoid the introduction of collider-stratification bias by eliminating conditioning
(e.g. weighting techniques) while others rely on the assumption that no variables
such as U exist. See Sect. 3.4.2 for a discussion on how Q-learning, a stage-wise
regression based method of estimation, avoids this kind of bias by analyzing one
stage at a time.
That is, for any possible regime āK , treatment A j received in the jth stage is inde-
pendent of any future (potential) covariate or outcome, O j+1 (ā j ), . . . , OK (āK−1 ),
Y (āK ), conditional on the history H j (Robins 1997).
14 2 The Data: Observational Studies and Sequentially Randomized Trials
That is, feasibility requires some subjects to have followed regime d¯K for the an-
alyst to be able to estimate its performance non-parametrically. To express this in
terms of decision trees, no non-parametric inference can be made about the effect
of following a particular branch of a decision tree if no one in the sample followed
that path.
Other terms have been used to describe feasible treatment regimes, including
viable (Wang et al. 2012) and realistic (Petersen et al. 2012) rules. Feasibility
is closely related to the positivity, or experimental treatment assignment (ETA),
assumption. Positivity, like feasibility, requires that there are both treated and
2.2 Examples of Longitudinal Observational Studies 15
untreated individuals at every level of the treatment and covariate history. Positiv-
ity may be violated either theoretically or practically. A theoretical or structural
violation occurs if the study design prohibits certain individuals from receiving a
particular treatment, e.g. failure of one type of drug may preclude the prescription
of other drugs in that class. A practical violation of the positivity assumption is
said to occur when a particular stratum of subjects has a very low probability of re-
ceiving the treatment (Neugebauer and Van der Laan 2005; Cole and Hernán 2008).
Visual and bootstrap-based approaches to diagnosing positivity violations have been
proposed for one-stage settings (Wang et al. 2006; Petersen et al. 2012). Practical
positivity violations may be more prevalent in longitudinal studies if there exists a
large number of possible treatment paths; methods for handling such violations in
multi-stage settings are less developed.
There is an additional assumption that is not required for estimation, but that
is useful for understanding the counterfactual quantities and models that will be
considered: the assumption of additive local rank preservation, which we shall elu-
cidate in two steps. First, local rank preservation states that the ranking of subjects’
outcomes under a particular treatment pattern aK is the same as their ranking un-
der any other pattern, say d K , given treatment and covariate history (see Table 2.1).
In particular, if we consider two regimes d K and aK , local rank preservation states
that the ranking of patients’ outcomes under regime d K is the same as their rank-
ing under regime aK conditional on the history H j . Local rank preservation is said
to be additive when Y (d K ) = Y (aK ) + cons, where cons = E[Y (d K ) − Y (aK )], i.e.,
the individual causal effect equals the average causal effect. This is also called unit
treatment additivity. Thus, rank preservation makes the assumption that the indi-
viduals who do best under one regime will also do so under another, and in fact
the ranking of each individual’s outcome will remain unchanged whatever the treat-
ment pattern received. Additive local rank preservation makes the much stronger
assumption that the difference between any two individuals’ outcomes will be the
same under all treatment patterns.
Table 2.1 Local rank preservation (LRP) and additive LRP, assuming all subjects have the same
baseline covariates
LRP Additive LRP
Subject Y (aK ) Rank Y (d K ) Rank Y (d K ) Rank
1 12.8 3 15.8 3 13.9 3
2 10.9 1 14.0 1 13.0 1
3 13.1 4 16.0 4 14.2 4
4 12.7 2 14.5 2 13.8 2
encouragement trials (Moodie et al. 2009), and cohort studies (Van der Laan and
Petersen 2007b). We shall briefly describe three here to demonstrate the variety of
questions that can be addressed using observational data and DTR methodology.
In particular, the data in the examples below have been addressed using regret-
regression, G-estimation, and marginal structural models; these and related methods
of estimation are presented in Chaps. 4 and 5.
Rosthøj et al. (2006) aimed to find a warfarin dosing strategy to control the risk of
both clotting and excessive bleeding, by tailoring treatment using the international
normalized ratio, a measure of clotting tendency of blood. Observational data were
taken from hospital records over a five year period; recorded variables included age,
sex, and diagnosis as well as a time-varying measure of INR. There exists a standard
target range for INR, and so the vector-valued tailoring variable, O j , was taken to
be 0 if the most recent INR measurement lay within the target range and otherwise
was taken to be the ratio of the difference between the INR measurement and the
nearest boundary of the target range, and the width of that target range. Treatment at
stage j, A j , was taken to be the change in warfarin dose (with 0 being an acceptable
option). The outcome of interest was taken to be the percentage of the time on study
in which a subject’s INR was within the target range.
Rosthøj et al. (2006) modeled the effect of taking the observed rather than the
optimal dose of warfarin using parametric mean models that are quadratic in the
dosing effect so that doses that are either too low or too high are penalized.
Cotton and Heagerty (2011) performed an analysis of the United States Renal Data
System, an administrative data set based on Medicare claims for hemodialysis with
end-stage renal disease. Covariates included demographic variables as well as clin-
ical and laboratory variables such as diabetes, HIV status, and creatinine clearance.
Monthly information was also available on the number of dialysis sessions reported,
the number of epoetin doses recorded, the total epoetin dosage, iron supplementa-
tion dose, the number of days hospitalized and the most recently recorded hemat-
ocrit measurement in the month.
Restricting their analysis to incident end-stage renal disease patients free from
HIV/AIDS from 2003, Cotton and Heagerty (2011) considered treatment rules that
adjust epoetin treatment at time j, A j , multiplicatively based on the value of treat-
ment in the previous month, A j−1 , and the most recent hematocrit measurement, O j :
2.2 Examples of Longitudinal Observational Studies 17
⎧
⎨ A j−1 × (0, 0.75) if O j ≥ ψ − 3
A j ∈ A j−1 × (0.75, 1.25) if O j ∈ (ψ − 3, ψ + 3)
⎩
A j−1 × (1.25, ∞) if O j ≤ ψ + 3
where the target hematocrit range specified by the parameter ψ is varied to consider
a range of different regimes. That is, O j is the tailoring variable at each month, and
the optimal regime is the treatment rule d optj (O j , A j−1 ; ψ ) that maximizes survival
time for ψ ∈ {31, 32, . . ., 40}. Thus, in contrast to the strategy employed by Rosthøj
et al. (2006), the decision rules considered in the analysis of Cotton and Heagerty
(2011) did not attempt to estimate the optimal treatment changes/doses, but rather
focused on estimating which target range of hematocrit should initiate a change in
treatment dose from one month to the next. Note that the parameter ψ (the mid-
value of the target hematocrit range) does not vary over time, but rather is common
over all months; this is called parameter sharing (over time).
It is well known that estimates based on observational data are often subject to
confounding and various hidden biases; hence randomized data, when available, are
preferable for more accurate estimation and stronger statistical inference (Rubin
1974; Holland 1986; Rosenbaum 1991). This is especially important when dealing
with DTRs since the hidden biases can compound over stages. One crucial point
to note here is that developing DTRs is a developmental procedure rather than a
confirmatory procedure. Usual randomized controlled trials are used as the “gold
standard” for evaluating or confirming the efficacy of a newly developed treatment,
not for developing the treatment per se. Thus, generating meaningful data for de-
veloping optimal DTRs is beyond the scope of the usual confirmatory randomized
trials; special design considerations are required. A special class of designs called
sequential multiple assignment randomized trial (SMART) designs, tailor-made for
the purpose of developing optimal DTRs, is discussed below.
SMART designs involve an initial randomization of patients to possible treat-
ment options, followed by re-randomizations at each subsequent stage of some or
all of the patients to another treatment available at that stage. The re-randomizations
at each subsequent stage may depend on information collected after previous treat-
ments, but prior to assigning the new treatment, e.g. how well the patient responded
to the previous treatment. Thus, even though a subject is randomized more than
once, ethical constraints are not violated. This type of design was first introduced
by Lavori and Dawson (2000) under the name biased coin adaptive within-subject
(BCAWS) design, and practical considerations for designing such trials were dis-
cussed by Lavori and Dawson (2004). Building on these works, Murphy (2005a)
proposed the general framework of the SMART design. These designs attempt to
conform better to the way clinical practice for chronic disorders actually occurs, but
still retain the well-known advantages of randomization over observational studies.
SMART-like trials, i.e. trials involving multiple randomizations had been used
in various fields even before the exact framework was formally established; see
for example, the CALGB Protocol 8923 for treating elderly patients with leukemia
(Stone et al. 1995; Wahed and Tsiatis 2004, 2006), the CATIE trial for antipsy-
chotic medications in patients with Alzheimer’s disease (Schneider et al. 2001), the
STAR*D trial for treatment of depression (Lavori et al. 2001; Rush et al. 2004; Fava
et al. 2003), and some cancer trials conducted at the MD Anderson Cancer Center
(Thall et al. 2000). Other examples include a smoking cessation study conducted
by the Center for Health Communications Research at the University of Michigan
(Strecher et al. 2008; Chakraborty et al. 2010), and a trial of neurobehavioral treat-
ments for patients with metastatic malignant melanoma (Auyeung et al. 2009). More
recently, Lei et al. (2012) discussed four additional examples of SMARTs: the Adap-
tive Characterizing Cognition in Nonverbal Individuals with Autism (CCNIA) De-
velopmental and Augmented Intervention (Kasari 2009) for school-age, nonverbal
children with autism spectrum disorders; the Adaptive Pharmacological and Behav-
ioral Treatments for children with attention deficit hyperactivity disorder (ADHD)
(see for example, Nahum-Shani et al. 2012a,b); the Adaptive Reinforcement-Based
2.3 Sequentially Randomized Studies 19
Treatment for Pregnant Drug Abusers (RBT) (Jones 2010); and the ExTENd study
for alcohol-dependent individuals (Oslin 2005). Lei et al. (2012) also discussed the
subtle distinctions between different types of SMARTs in terms of the extent of
multiple randomizations: (i) SMARTs in which only the non-responders to one of
the initial treatments are re-randomized (e.g. CCNIA); (ii) SMARTs in which non-
responders to all the initial treatments are re-randomized (e.g. the ADHD trial); and
(iii) SMARTs in which both responders and non-responders to all the initial treat-
ments are re-randomized (e.g. RBT, ExTENd).
TM
R
TMC
NTX
CBT
R
EM+CBT+NTX
R
TM
R
TMC
CBT
NTX
R
EM+CBT+NTX
Fig. 2.2 Hypothetical SMART design schematic for the addiction management example (an “R”
within a circle denotes randomization at a critical decision point)
Note that the goal of SMART design is to generate high quality data that would aid
in the development and evaluation of optimal DTRs. A competing design approach
could be to conduct separate randomized trials for each of the separate stages, to
find the optimal treatment at each stage based on the trial data, and then combine
these optimal treatments from individual stages to create a DTR. For example, in-
stead of the SMART design for the addiction management study described above,
the researcher may conduct two single-stage randomized trials. The first trial would
involve a comparison of the initial treatments (CBT versus NTX). The researcher
would then choose the best treatment based on the results of the first trial and move
on to the second trial where all subjects would be initially treated with the cho-
sen treatment and then responders would be randomized to one of the two possi-
ble options: TM or TMC, and non-responders would be randomized to one of the
two possible options: switch of the initial treatment or a treatment augmentation
(EM + CBT + NTX). However, when used to optimize DTRs, this approach suffers
from several disadvantages as compared to a SMART design.
First, this design strategy is myopic, and may often fail to detect possible de-
layed effects of treatments, ultimately resulting in a suboptimal DTR (Lavori and
Dawson 2000). Many treatments can have effects that do not occur until after the
intermediate outcome (e.g. response to initial treatment) has been measured, such
as improving the effect of a future treatment or long-term side effects that prevent
a patient from being able to use an alternative useful treatment in future. SMART
designs are capable of taking care of this issue while the competing approach is
not. This point can be further elucidated using the addiction management example,
following the original arguments of Murphy (2005a). Suppose counseling (TMC) is
more effective than monitoring (TM) among responders to CBT; this is a realistic
scenario since the subject can learn to use counseling during CBT at the initial stage
and thus is able to take advantage of the counseling offered at the subsequent stage
to responders. Individuals who received NTX during the initial treatment would not
have learned to use counseling, and thus among responders to NTX the addition
of counseling to the monitoring does not improve abstinence relative to monitoring
alone. If an individual is a responder to CBT, it is best to offer TMC as the sec-
ondary treatment. But if the individual is a responder to NTX, it is best to offer the
less expensive TM as the secondary treatment. In summary, even if CBT and NTX
result in the same proportion of responders (or, even if CBT appears less effective
at the initial stage), CBT may be the best initial treatment as part of the entire treat-
ment sequence. This would be due to the enhanced effect of TMC when preceded
by CBT. On the other hand, if the researcher employs two separate stage-specific
trials, he would likely conduct the second trial with NTX (which is cheaper than
CBT) as the initial treatment, unless CBT looks significantly better than NTX at the
first trial. In that case, there would be no way for the researcher to discover the truly
optimal regime.
Second, even though the results of the first trial may indicate that treatment a is
initially less effective than treatment a , it is quite possible that treatment a may elicit
2.3 Sequentially Randomized Studies 21
valuable diagnostic information that would permit the researcher to better personal-
ize the subsequent treatment to each subject, and thus improve the primary outcome.
This issue can be better discussed using the ADHD study example (Nahum-Shani
et al. 2012a,b), following the original discussion of Lei et al. (2012). In secondary
analyses of the ADHD study, Nahum-Shani et al. (2012a,b) found evidence that
children’s adherence to the initial intervention could be used to better match the
secondary intervention. More precisely, among non-responders to the initial inter-
vention (either low-dose medication or low-dose behavioral modification), those
with low adherence performed better when the initial intervention was augmented
with the other type of intervention at the second stage, compared to increasing the
dose or intensity of the initial treatment at the second stage. This phenomenon is
sometimes called the diagnostic effect or prescriptive effect.
Third, subjects who enroll and remain in a single-stage trial may be inherently
different from those who enroll and remain in a SMART. This is a type of co-
hort effect or selection effect, as discussed by Murphy et al. (2007a). Consider a
single-stage randomized trial in which CBT is compared with NTX. First, in order
to reduce variability in the treatment effect, investigators would tend to set very re-
strictive entry criteria (this is the case with most RCTs), which would result in a
cohort that represents only a small subset of the treatable population. In contrast,
researchers employing a SMART design would not try to reduce the variability in
the treatment effect, since this design would allow varying treatment sequences for
different types of patients. Hence SMARTs can recruit from a wider population
of patients, and would likely result in greater generalizability. Furthermore, in a
single-stage RCT, for subjects with no improvement in symptoms and for those ex-
periencing severe side-effects, there is often no option but to drop out of the study
or cease to comply with the study protocol. In contrast, non-responding subjects in
a SMART would know that their treatments will be altered at some point. Thus it
can be argued that non-responding subjects may be less likely to drop out from a
SMART relative to a single-stage randomized trial. Consequently the choice of the
best initial treatment obtained from a single-stage trial may be based on a sample
less representative of the study population compared to the choice of the best initial
treatment obtained from a SMART.
From the above discussion, it is clear that conducting separate stage-specific tri-
als and combining best treatment options from these separate trials may fail to de-
tect delayed effects and diagnostic effects, and may result in possible cohort effects,
thereby rendering the developed sequence of treatment decisions potentially subop-
timal. This has been the motivation to consider SMART designs.
For simplicity of exposition, let us focus on SMART designs with only two stages;
however the ideas can be generalized to any finite number of stages. Denote the
observable data trajectory for a subject in a SMART by (O1 , A1 , O2 , A2 ,Y ), where
22 2 The Data: Observational Studies and Sequentially Randomized Trials
and
This implies that the mean primary outcome of a DTR can be written as a function
of the multivariate distribution of the observable data obtained from a SMART;
see Murphy (2005a) for detailed derivation. This property ensures that data from
SMARTs can be effectively used to evaluate pre-specified DTRs or to estimate the
optimal DTR within a certain class. We defer our discussion of estimation of optimal
DTRs to later chapters.
As is the case with any other study, power and sample size calculations are cru-
cial elements in designing a SMART. In a SMART, one can investigate multiple
research questions, both concerning entire DTRs (e.g. comparing the effects of two
DTRs) and concerning certain components thereof (e.g. testing the main effect of the
first stage treatment, controlling for second stage treatment). To power a SMART,
however, the investigator needs to choose a primary research question (primary hy-
pothesis), and calculate the sample size based on that question. Additionally, one or
more secondary questions (hypotheses) may be investigated in the study. While the
SMART provides unbiased estimates (free from confounding) to these secondary
questions by virtue of randomization, it is not necessarily powered to address these
secondary hypotheses.
2.3 Sequentially Randomized Studies 23
Suppose the randomization probability is 1/2 for each treatment option at the first
stage. Standard calculation yields a total sample size formula for the two sided test
with power (1 − β ) and size α :
n = 4(zα /2 + zβ )2 δ −2 ,
where zα /2 and zβ are the standard normal (1 − α /2) percentile and (1 − β ) per-
centile, respectively. To use the formula, one needs to postulate the effect size δ , as
is the case in standard two-group randomized controlled trials (RCTs).
Another interesting primary question could be: “on average what is the best sec-
ondary treatment, TM or TMC, for responders to initial treatment?”. In other words,
the researcher wants to compare the mean primary outcomes of two groups of re-
sponders (those who get TM versus TMC as the secondary treatment). As before,
standard formula can be used. Define the standardized effect size δ as the standard-
ized difference in mean primary outcomes between two groups (Cohen 1988), i.e.
Let γ denote the overall response rate to initial treatment. Suppose the random-
ization probability is 1/2 for each treatment option at the second stage. Standard
calculation yields a total sample size formula for the two sided test with power
(1 − β ) and size α :
n = 4(zα /2 + zβ )2 δ −2 γ −1 .
To use the formula, one needs to postulate the overall initial response rate γ , in
addition to postulating the effect size δ . A similar question could be a comparison
of secondary treatments among non-responders; in this case the sample size formula
would be a function of non-response rate to the initial treatment.
Alternatively researchers may be interested in primary research questions related
to entire DTRs. In this case, Murphy (2005a) argued that the primary research ques-
tions should involve the comparison of two DTRs beginning with different initial
treatments. Test statistics and sample size formulae for this type of research ques-
tion have been derived by Murphy (2005a) and Oetting et al. (2011).
24 2 The Data: Observational Studies and Sequentially Randomized Trials
The comparison of two DTRs, say d¯ and d¯ , beginning with different initial treat-
ments, can be obtained by comparing the subgroup of subjects in the trial whose
treatment assignments are consistent with regime d¯ with the subgroup of subjects
in the trial whose treatment assignments are consistent with regime d¯ . Note that
there is no overlap between these two subgroups since a subject’s initial treatment
assignment can be consistent with only one of d¯ or d¯ . The standardized effect size
2
in this context is defined as δ = (μd¯ − μd¯ ) (σd¯ + σd2¯ )/2, where μd¯ is the mean
primary outcome under the regime d¯ and σd2¯ is its variance. Suppose the random-
ization probability for each treatment option is 1/2 at each stage. In this case, using
a large sample approximation, the required sample size for the two sided test with
power (1 − β ) and size α is
n = 8(zα /2 + zβ )2 δ −2 .
Oetting et al. (2011) discussed additional research questions and the correspond-
ing test statistics and sample size formulae under different working assumptions. A
web application that calculates the required sample size for sizing a study designed
to discover the best DTR using a SMART design for continuous outcomes can be
found at
http://methodologymedia.psu.edu/smart/samplesize.
Some alternative approaches to sample size calculations can be found in Dawson
and Lavori (2010, 2012).
Furthermore, for time-to-event outcomes, sample size formulae can be found in
Feng and Wahed (2009) and Li and Murphy (2011). A web application for sample
size calculation in this case can be found at
http://methodologymedia.psu.edu/logranktest/samplesize.
Randomization Probabilities
Let p1 (a1 |H1 ) and p2 (a2 |H2 ) be the randomization probability at the first and second
stage, respectively. Formulae for the randomization probabilities that would create
equal sample sizes across all DTRs were derived by Murphy (2005a). This was
motivated by the classical large sample comparison of means for which, given equal
variances, the power of a test is maximized by equal sample sizes. Let k1 (H1 ) be
the number of treatment options at the first stage with history H1 and k2 (H2 ) be the
number of treatment options at the second stage with history H2 , respectively. Then
Murphy’s calculations give the optimal values of randomization probabilities as
If k2 does not depend on H2 , the above formulae can be directly used at the start of
the trial. Otherwise, working assumptions concerning the distribution of O2 given
(O1 , A1 ) are needed in order to use the formulae. In the case of the addiction man-
agement example, k1 (H1 ) = 2 and k2 (H2 ) = 2 for all possible combinations of
(H1 , H2 ). Thus (2.1) yields an optimal randomization probability of 1/2 for each
treatment option at each stage. See Murphy (2005a) for derivations and further
details.
Over the years, some principles and practical considerations have emerged mainly
from the works of Lavori and Dawson (2004), Murphy (2005a) and Murphy et al.
(2007a) which researchers should keep in mind as general guidelines when design-
ing a SMART.
First, Murphy (2005a) recommended that the primary research question should
consider simple DTRs, leading to tractable sample size calculations. For example, in
the addiction management study, one can consider regimes where the initial decision
rule does not depend on an individual’s pre-treatment information and the secondary
decision rule depends only on the individual’s initial treatment and his response
status (as opposed to depending on a large number of intermediate variables).
Second, when designing the trial, the class of treatment options at each stage
should be restricted by ethical, scientific or feasibility considerations (Lavori and
Dawson 2004; Murphy 2005a). It is better to use a low dimensional summary crite-
rion (e.g. response status) instead of all intermediate outcomes (e.g. improvement of
symptom severity, side-effects, adherence etc.) to restrict the class of possible treat-
ments; in many contexts including mental health studies, feasibility considerations
may often force researchers to use a patient’s preference in this low dimensional
summary. Lavori and Dawson (2004) demonstrated how to constrain treatment op-
tions (and thus decision rules) using the STAR*D study as an example (this study
will be introduced later in this chapter). Yet, Murphy (2005a) warned against un-
necessary restriction of the class of the decision rules. In our view, determining the
“right class” of treatment options in any given study remains an art, and cannot be
fully operationalized.
Third, a SMART should be viewed as one trial among a series of randomized
trials intended to develop and/or refine a DTR (Collins et al. 2005). It should even-
tually be followed by a confirmatory randomized trial that compares the developed
regime and an appropriate control (Murphy 2005a; Murphy et al. 2007a).
Fourth, like traditional randomized trials, SMARTs may involve usual problems
such as dropout, non-compliance, incomplete assessments, etc. However, by virtue
of the option to alter the non-functioning treatments at later stages, SMARTs should
be more appealing to participants, which may result in greater recruitment success,
greater compliance, and lower dropout compared to a standard RCT.
26 2 The Data: Observational Studies and Sequentially Randomized Trials
Finally, as in the context of any standard randomized trial, feasibility and ac-
ceptability considerations relating to a SMART can best be assessed via (external)
pilot studies (see, e.g. Vogt 1993). Recently Almirall et al. (2012a) discussed how
to effectively design a SMART pilot study that can precede, and thereby aid in fine-
tuning, a full-blown SMART. They also presented a sample size calculation formula
useful for designing a SMART pilot study.
The SMART design discussed above involves stages of treatment and/or experi-
mentation. In this regard, it bears similarity with some other common designs, in-
cluding what are known as adaptive designs (Berry 2001, 2004). Below we discuss
the distinctions between SMART and some other multi-stage designs, to avoid any
confusion.
“Adaptive design” is an umbrella term used to denote a variety of trial designs that
allow certain trial features to change from an initial specification based on accu-
mulating data (evolving information) while maintaining statistical, scientific, and
ethical integrity of the trial (Dragalin 2006; Chow and Chang 2008). Some com-
mon types of adaptive designs are as follows. A response adaptive design allows
modification of the randomization schedules based on observed data at pre-set in-
terim times in order to increase the probability of success for future subjects; Berry
et al. (2001) discussed an example of this type of design. A group sequential design
(Pocock 1977; Pampallona and Tsiatis 1994) allows premature stopping of a trial
due to safety, futility and/or efficacy with options of additional adaptations based
on the results of interim analyses. A sample size re-estimation design involves the
re-calculation of sample size based on study parameters (e.g. revised effect size,
conditional power, nuisance parameters) obtained from interim data; see Banerjee
and Tsiatis (2006) for an example. An adaptive dose-finding design is used in early
phase clinical development to identify the minimum effective dose and the max-
imum tolerable dose, which are then used to determine the dose level for the next
phase clinical trials (see for example, Chen 2011). An adaptive seamless phase II/III
trial design is a design that addresses within a single trial objectives that are nor-
mally achieved through separate trials in phase II and phase III of clinical devel-
opment, by using data from patients enrolled before and after the adaptation in the
final analysis; see Levin et al. (2011) for an example. In general, the aim of adaptive
designs is to improve the quality, speed and efficiency of clinical development by
modifying one or more aspects of a trial. Recent perspectives on adaptive designs
can be found in Coffey et al. (2012).
2.3 Sequentially Randomized Studies 27
Based on the above discussion, now we can identify the distinctions between the
standard SMART design and adaptive designs. In a SMART design, each subject
moves through multiple stages of treatment, while in most adaptive designs each
stage involves different subjects. The goal of a SMART is to develop a good DTR
that could benefit future patients. Many adaptive designs (e.g. response adaptive de-
sign) try to provide the most efficacious treatment to each patient in the trial based
on the current knowledge available at the time that a subject is randomized. In a
SMART, unlike in an adaptive design, the design elements such as the final sam-
ple size, randomization probabilities and treatment options are pre-specified. Thus,
SMART designs involve within-subject adaptation of treatment, while adaptive de-
signs involve between-subject adaptation.
Next comes the natural question of whether some adaptive features can be in-
tegrated into the SMART design framework. In some cases the answer is yes, at
least in principle. For example, Thall et al. (2002) provided a statistical framework
for an adaptive design in a multi-stage treatment setting involving two SMARTs.
Thall and Wathen (2005) considered a similar but more flexible design where the
randomization criteria for each subject at each stage depended on the data from
all subjects previously enrolled. However, adaptation based on interim data is less
feasible in settings where subjects’ outcomes may only be observed after a long pe-
riod of time has elapsed. How to optimally use adaptive design features within the
SMART framework is an open question that warrants further research.
SMART designs have some operational similarity with classical crossover trial de-
signs; however they are very different conceptually. First, treatment allocation at any
stage after the initial stage of a SMART typically depends on a subject’s intermedi-
ate outcome (response/non-response). However, in a crossover trial, subjects receive
all the candidate treatments irrespective of their intermediate outcomes. Second, as
the goal of a typical cross-over study is to determine the outcome of a one-off treat-
ment, crossover trials consciously attempt to wash out the carryover effects (i.e.
delayed effects), whereas SMARTs attempt to capture them and, where possible,
take advantage of any interactions between treatments at different stages to opti-
mize outcome following a sequence of treatments.
As mentioned earlier, a SMART should be viewed as one trial among a series of ran-
domized trials intended to develop and/or refine a DTR. It should eventually be fol-
lowed by a confirmatory randomized trial that compares the developed regime and
an appropriate control (Murphy 2005a; Murphy et al. 2007a). This purpose is shared
by the multiphase experimental approach (with distinct phases for screening, refin-
ing, and confirming) involving factorial designs, originally developed in engineering
28 2 The Data: Observational Studies and Sequentially Randomized Trials
(Box et al. 1978), and recently used in the development of multicomponent behav-
ioral interventions (Collins et al. 2005, 2009; Chakraborty et al. 2009). Note that
DTRs are multicomponent treatments, and SMARTs are developmental trials to aid
in the innovation of optimal DTRs. From this perspective, a SMART design can be
viewed as one screening/refining experiment embedded in the entire multiphase ex-
perimental approach. In fact, Murphy and Bingham (2009) developed a framework
to connect SMARTs with factorial designs. However, there remain many open ques-
tions in this context, and more research is needed to fully establish the connections.
Fig. 2.3 A schematic of the algorithm for treatment assignment in the STAR*D study
30 2 The Data: Observational Studies and Sequentially Randomized Trials
2.5 Discussion
In this chapter, we have described the two sources of data that are commonly used
for estimating DTRs: observational follow-up studies and SMARTs. The use of ob-
servational data adds an element of complexity to the problem of estimation and
requires careful handling and additional assumptions, due to the possibility of con-
founding. To assist in the careful formulation of causal contrasts in the presence
of confounding, the potential outcomes framework was introduced. In contrast,
SMARTs offer simpler analyses but often require significant investment to conduct
a high quality trial with adequate power. We discussed conceptual underpinnings
of and practical considerations for conducting a SMART, as well as its distinctions
from other multiphase designs. We introduced several examples of observational
and sequentially randomized studies, some of which we will investigate further in
subsequent chapters.
Chapter 3
Statistical Reinforcement Learning
B. Chakraborty and E.E.M. Moodie, Statistical Methods for Dynamic Treatment Regimes, 31
Statistics for Biology and Health 76, DOI 10.1007/978-1-4614-7428-9 3,
© Springer Science+Business Media New York 2013
32 3 Statistical Reinforcement Learning
environment interactions.
3.2 Reinforcement Learning: A Conceptual Overview 33
state, with respect to a given policy, is the total expected future reward of an agent,
starting with that state, and following the given policy to select actions thereafter.
Thus the goal of RL, rephrased in terms of policy and value, is to estimate a pol-
icy that maximizes the value over a specified class of policies. Since the value is
a function of two arguments, namely the state and the policy, the above-mentioned
maximization of value over policy space can happen either for each state, or aver-
aged over all the states. Unless otherwise specified, in this book, we will concentrate
on maximization of value averaged over all the states.
Compare the above conceptual framework of traditional RL to the present prob-
lem of constructing DTRs for chronic disorders. The computerized decision support
system of the clinician plays the role of the learning agent, while the population
of subjects with the disorder of interest plays the role of the environment. Every
time a patient visits the clinic defines a stage of potential clinical intervention. Pre-
treatment observations on the patient constitute the state, and the treatment (type,
dosage, timing etc.) serves as the action. A suitably-defined measure of the patient’s
well-being following the treatment can be conceptualized as the reward. For exam-
ple, in the addiction management problem described in Chap. 1, reward can be the
percentage of days abstinent in a given period of time following the treatment. How-
ever, it must be recognized that constructing good rewards in the medical setting is
challenging, and how best to combine different outcomes of interest (e.g. efficacy
and toxicity) into a single reward is an open question. Finally, policy is synonymous
with dynamic treatment regime, and the value of a policy is the same as the expected
primary outcome under a dynamic regime.
While the problem of constructing DTRs from patient data seems to be a special
case of the classical RL, it has several unique features that distinguishes it from the
classical RL problem. Below we list the major distinctions:
Unknown System Dynamics and the Presence of Unknown Causes: In many RL
problems, the system dynamics (multivariate distribution of the data, including
state transition probabilities) are known from the physical laws or other subject-
matter knowledge. For example, in the case of a robot learning about its envi-
ronment, the system dynamics are often known (Sutton and Barto 1998, p. 66).
Unfortunately this is often not the case in the medical setting due to the pres-
ence of potentially many unknown causes of the outcome. Consider, for example,
treatment of chronic depression. A patient’s response to a treatment may depend
on how well he adheres to the treatment assigned, his genetic composition, co-
occurring family problems, etc. These unknown causes play an important role
in treatment outcome, and in some cases interact with treatment to affect that
outcome. Hence classical DP algorithms that use direct knowledge of the system
dynamics are not suitable, and constructing DTRs in the medical setting using
patient data is not a straightforward RL problem.
Furthermore, the unknown causes and system dynamics pose potential chal-
lenges to statistical methods for estimating treatment effects. In statistics, it is a
common practice to collect data, whenever possible, on all potential risk factors
and confounders and adjust for these by including them in regression models.
However, it is not possible to collect data on a cause if it is unknown, and
34 3 Statistical Reinforcement Learning
Need to Pool over Subject-level Data: In some other classical RL problems, the
system dynamics are not completely specified, rendering the DP methods unsuit-
able, however, good generative models are available from which one can simulate
data on states, actions, and rewards – essentially as much data as one wants. In
other words, data are often very cheap in these classical problems. The primary
restrictive issue in this setting is the computational complexity. In the medical
setting, however, data are extremely expensive in terms of both time and money.
Furthermore, generative models from which to simulate patient data are rarely
available due, in part, to the point noted above: that there may be unknown or
poorly understood causes of the outcome in the medical setting. Thus, all that
is typically available to the analyst is a sample consisting of treatment records
(pre-treatment observations, treatments, and post-treatment outcomes) of n pa-
tients from a randomized or observational study. The sample size n is usually not
very large compared to the size of the state space and the action space. Hence
one is forced to use parametric or semi-parametric statistical models (called
function approximation in computer science) to add some structure to the data
and then pool over subjects’ data to construct the decision rules. In computer
science, a sample of data is often called a batch; hence a sub-class of RL al-
gorithms that work with batch data are called batch-mode learning algorithms
(Ernst et al. 2005; Murphy 2005b). In the medical setting, batch-mode RL al-
gorithms with function approximation (as opposed to the more common online
algorithms, which we will not discuss in this book) are suitable.
In a general RL problem, the agent and the environment interact at each of a possibly
infinite number of stages. In the medical setting, we restrict ourselves to RL prob-
lems with a finite number of stages (say K). These are called finite-horizon prob-
lems. We do not assume Markov property here, since it is not appropriate in general.
At stage j (1 ≤ j ≤ K), the agent observes a state O j ∈ O j and executes an action
A j ∈ A j , where O j is the state space and A j is the action space. We will restrict
ourselves to settings where the state O j can be a vector consisting of discrete or
continuous variables, but the action A j can only be discrete. RL problems with a
continuous action space are beyond the scope of this book, however exciting work
on G-estimation of optimal strategies for dosing of continuous-valued treatments is
being undertaken (Rich et al. 2010). Partly as a consequence of its action, the agent
receives a real-valued reward Y j ∈ R, and moves on to the next stage with a new
state O j+1 ∈ O j+1 . As in Chap. 2, define Ō j ≡ (O1 , . . . , O j ) and Ā j ≡ (A1 , . . . , A j ).
Also define the history H j at stage j as the vector (Ō j , Ā j−1 ). At any stage j, the
quantities O j , A j ,Y j and H j are random variables, the observed values of which will
be denoted respectively by o j , a j , y j and h j . The reward Y j is conceptualized as a
known function of the history H j , the current action A j , and the next state O j+1 .
Thus,
Y j = Y j (H j , A j , O j+1 ) = Y j (Ō j , Ā j , O j+1 ).
In some settings, there may be only one terminal reward YK ; rewards at all previous
stages are taken to be 0. In statistical terms, rewards may be taken to be synonymous
with outcome.
Define a deterministic policy d ≡ (d1 , . . . , dK ) as a vector of decision rules, where
for 1 ≤ j ≤ K, d j : H j → A j is a mapping from the history space H j to the action
space A j . A policy is called stochastic if the above mappings are from the history
space H j to the space of probability distributions over the action space A j which,
in a slight abuse of notation, we will denote d j (a j |h j ). The collection of policies,
depending on the history-space and action-space, defines a function space called
policy space and is often denoted by D.
A finite-horizon trajectory consists of the set {O1 , A1 , O2 , . . . , AK , OK+1 }.
As mentioned earlier, the problem of constructing DTRs conforms to what is
known as batch-mode RL in computer science. In a batch-mode RL problem, the
36 3 Statistical Reinforcement Learning
Denote the expectation with respect to the distribution Pd by Ed . The primary goal
in statistical RL is to estimate (learn) the optimal policy, say d ∗ , from the data on
n finite-horizon trajectories, not necessarily generated by the optimal policy (hence
the need for what are known as off-policy algorithms in RL). By optimal policy
within a policy class, we mean the one with greatest possible value within that class.
The precise definition of value follows.
The value function for a state o1 with respect to an arbitrary policy d is
K
V d (o1 ) = Ed ∑ Y j (H j , A j , O j+1)O1 = o1 .
j=1
2 In the case of a SMART, this policy consists of the randomization probabilities and is known
by design, whereas for an observational study, this can be estimated by the propensity score (see
Sect. 3.5 for definition).
3.3 A Probabilistic Framework 37
This represents the total expected future reward starting at a particular state o1 and
thereafter choosing actions according to the policy d. Given a policy d, the stage
j value function for a history h j is the total expected future rewards from stage j
onwards, and is given by
K
V jd (h j ) = Ed ∑ Yk (Hk , Ak , Ok+1 )H j = h j , 1 ≤ j ≤ K.
k= j
Note that, by definition, V1d (·) = V d (·). For convenience, set VK+1
d (·) = 0. Then the
V jd (h j ) = Ed ∑ Yk (Hk , Ak , Ok+1 )H j = h j
k= j
K
= Ed Y j (H j , A j , O j+1 )H j = h j + Ed ∑ Yk (Hk , Ak , Ok+1 )H j = h j
k= j+1
= Ed Y j (H j , A j , O j+1 )H j = h j
K
+Ed Ed ∑ Yk (Hk , Ak , Ok+1 )H j+1 H j = h j
k= j+1
= Ed Y j (H j , A j , O j+1 )H j = h j + Ed V j+1 d
(H j+1 )H j = h j
= Ed Y j (H j , A j , O j+1 ) + V j+1
d
(H j+1 )H j = h j , 1 ≤ j ≤ K. (3.4)
V jopt (h j ) = max V jd (h j ).
d∈D
The optimal value functions satisfy the Bellman equation (Bellman 1957),
V jopt (h j ) = max E Y j (H j , A j , O j+1 ) + V j+1
opt
(H j+1 )H j = h j , A j = a j , (3.5)
a j ∈A j
when all observations and actions are discrete (see Sutton and Barto, 1998, pp. 76,
for details). The Bellman equation also holds for more general scenarios, but with
additional assumptions.
Finally, the (marginal) value of a policy d, written V d , is the average value
function under that policy, averaged over possible initial observations, i.e.,
K
Note that the above expectation is taken with respect to entire likelihood of the
data, as given by (3.2) or (3.3), for the case of deterministic or stochastic policy
respectively. Thus the value of a policy is simply the marginal mean outcome under
that policy.
38 3 Statistical Reinforcement Learning
Given a policy, the primary statistical goal is to estimate its value. A related
problem would be to compare the values of two or more pre-specified policies; this
is, in fact, an extension of the problem of comparing mean outcomes of two or
more (static) treatments. Note that this is often the primary analysis of a SMART. In
Sect. 5.1, we will consider some methods for estimating the value of a pre-specified
policy developed in the statistics literature.
In many classical RL as well as medical domains, researchers often seek to
estimate a policy that maximizes the value (i.e. the optimal policy). One approach is
to first specify a policy space, and then employ some method to estimate the value
of each policy in that space to find the best one. An alternative approach is to work
with what is known as action-value function or simply the Q-function (where “Q”
stands for the “quality of action”) instead of the value function V d defined above.
Q-functions are defined as follows.
The stage j Q-function for policy d is the total expected future reward starting
from a history h j at stage j, taking an action a j , and following the policy d thereafter.
Thus,
Qopt opt
j (h j , a j ) = E[Y j (H j , A j , O j+1 ) + V j+1 (H j+1 )|H j = h j , A j = a j ].
The primary goal in statistical RL is to estimate the optimal policy. As briefly men-
tioned in Sect. 3.3, one approach to achieve this goal is to first specify a policy space
D, and then employ any suitable method to estimate the value of each candidate pol-
icy d ∈ D to estimate the best one, say dˆopt . More precisely,
This class of methods is known as the policy search methods in the RL literature (Ng
and Jordan 2000). Methods like inverse probability weighting and marginal struc-
tural models developed in the causal inference literature also fall in this category;
we will discuss these approaches in considerable detail in Chap. 5. While the policy
search approach is typically non-parametric or semi-parametric, requiring only mild
3.4 Estimation of Optimal DTRs by Modeling Conditional Means 39
assumptions about the data, the main issue is the high variability of the value func-
tion estimates, and the resulting high variability in the estimated optimal policies.
With reference to estimating the optimal policy, one can conceive of methods that
lie at the other end of the parametric spectrum, in the sense that they model the entire
multivariate distribution of the data, and then apply dynamic programming methods
to learn the optimal policy. Likelihood-based methods, including G-computation
and Bayesian methods, fall in that category; we will briefly discuss them in Chap. 9.
One downside of this class of methods is that the entire likelihood of the data may
not be relevant for choosing optimal actions, and hence these methods run the risk
of providing a biased estimator of the value function (and hence the optimal policy)
if the model specification is incorrect. Since there is more modeling involved in this
approach, there are more chances to get it wrong.
In between these two extremes, there exist attractive methods that model only
part of the entire likelihood, e.g. the conditional expectation of the reward given his-
tory and action. In other words, these methods model the Q-functions (Q-learning)
or even only parts of the Q-functions relevant for decision making (e.g. A-learning,
G-estimation, etc.). In the present chapter and Chap. 4, we will be focusing on this
class of methods. Modeling the conditional expectation can be done via regression.
Below we introduce a simple version of Q-learning that estimates the optimal policy
in two steps: (1) estimate the stage-specific Q-functions by using parametric models
(e.g. linear models), and (2) recommend the actions that maximize the estimated
Q-functions. In its simplest incarnation (using linear models for the Q-functions),
Q-learning3 can be viewed as an extension of least squares regression to multi-stage
decision problems. However, one can use more flexible models (e.g. splines, neural
networks, trees etc.) for the Q-functions.
For clarity of exposition, we will first describe Q-learning for studies with two
stages, and then generalize to K (≥ 2) stages. In a two-stage study, longitudinal
data on a single subject are given by the trajectory (O1 , A1 , O2 , A2 , O3 ), where nota-
tions are defined in Sect. 3.3. The histories at each stage are given by H1 ≡ O1 and
H2 ≡ (O1 , A1 , O2 ). The data available for estimation consist of a random sample
of n subjects. For simplicity, assume that the data arise from a SMART with two
possible treatments at each stage, A j ∈ {−1, 1} and that they are randomized (con-
ditionally on history) with known randomization probabilities. The study can have
either a single terminal reward (primary outcome), Y , observed at the end of stage
3 The version of Q-learning we will be using in this book is similar to the fitted Q-iteration algo-
rithm in the RL literature. This version is an adaptation of Watkins’ classical Q-learning to batch
data, involving function approximation.
40 3 Statistical Reinforcement Learning
2, or two rewards (intermediate and final outcomes), Y1 and Y2 , observed at the end
of each stage. The case of a single terminal outcome Y is viewed as a special case
with Y1 ≡ 0 and Y2 = Y . A two-stage policy (DTR) consists of two decision rules,
say (d1 , d2 ), with d j (H j ) ∈ {−1, 1}.
One simple method to construct the optimal DTR d opt = (d1opt , d2opt ) is Q-
learning (Watkins 1989; Sutton and Barto 1998; Murphy 2005b). First define the
optimal Q-functions for the two stages as follows:
Qopt
2 (H2 , A2 ) = E Y2 |H2 , A2 ,
Qopt opt
1 (H1 , A1 ) = E Y1 + max Q2 (H2 , a2 )|H1 , A1 .
a2
If the above two Q-functions were known, the optimal DTR (d1opt , d2opt ), using a
backwards induction argument (as in dynamic programming), would be
d opt opt
j (h j ) = arg max Q j (h j , a j ), j = 1, 2. (3.7)
aj
In practice, the true Q-functions are not known and hence must be estimated from
the data. Note that Q-functions are conditional expectations, and hence a natural
approach to model them is via regression models. Consider linear regression models
for the Q-functions. Let the stage j ( j = 1, 2) Q-function be modeled as
j (H j , A j ; β j , ψ j ) = β j H j0 + (ψ j H j1 )A j ,
Qopt T T
(3.8)
where H j0 and H j1 are two (possibly different) vector summaries (or, features) of
the history H j , with H j0 denoting the “main effect of history” (H j0 includes the
intercept term) and H j1 denoting the “treatment effect of history” (the vector H j1
also includes an intercept-like term that corresponds to the main effect of treatment).
The collections of variables H j0 are often termed predictive, while H j1 is said to
contain prescriptive or tailoring variables. The Q-learning algorithm involves the
following steps:
2
1. Stage 2 regression: (β̂2 , ψ̂2 ) = argminβ2 ,ψ2 1n ∑ni=1 Y2i −Qopt
2 (H2i , A 2i ; β 2 , ψ 2 ) .
2. Stage 1 pseudo-outcome: Ŷ1i = Y1i + maxa2 Qopt 2 (H , a ; β̂ , ψ̂ ), i = 1, . . . , n.
2i 2 2 2 2
3. Stage 1 regression: (β̂1 , ψ̂1 ) = argminβ1 ,ψ1 n ∑i=1 Ŷ1i −Qopt
1 n
1 (H1i , A1i ; β1 , ψ1 ) .
Note that in step 2 above, the quantity Ŷ1i is a predictor of the unobserved random
variable Y1i + maxa2 Qopt 2 (H2i , a2 ), i = 1, . . . , n. Once the Q-functions have been
estimated, finding the optimal DTR is easy. The estimated optimal DTR using Q-
learning is given by (dˆ1opt , dˆ2opt ), where the stage j optimal rule is specified as
j (h j ) = argmax Q j (h j , a j ; β̂ j , ψ̂ j ), j = 1, 2.
dˆopt opt
aj
3.4 Estimation of Optimal DTRs by Modeling Conditional Means 41
opt
The above procedure can be easily generalized to K > 2 stages. Define QK+1 ≡ 0,
and
Qopt opt
j (H j , A j ) = E Y j + max Q j+1 (H j+1 , a j+1 )|H j , A j , j = 1, . . . , K.
a j+1
j (H j , A j ; β j , ψ j ) = β j H j0 + (ψ j H j1 )A j ,
Qopt j = 1, . . . , K.
T T
(β̂ j , ψ̂ j )
1 n
2
= arg min
β j ,ψ j
∑
n i=1
Y ji + max Q j+1 (H j+1 , a j+1 ; β̂ j+1 , ψ̂ j+1 ) −Q j (H ji , A ji ; β j , ψ j ) .
a j+1
opt opt
stage j pseudo-outcome
j (h j ) = arg max Q j (h j , a j ; β̂ j , ψ̂ j ), j = 1, . . . , K.
dˆopt opt
aj
Q-learning with linear models and K = 2 stages has been implemented in the R
package qLearn that is freely available from the Comprehensive R Archive Net-
work (CRAN):
http://cran.r-project.org/web/packages/qLearn/index.html.
Some readers, especially those unfamiliar with causal inference, may find the
indirect, two-step procedure of Q-learning a bit strange, at least on the surface.
To them, the following one-step procedure for estimating the optimal DTR might
seem more natural. In this approach, one would model the conditional mean out-
come E(Y |O1 , A1 , O2 , A2 ) and run an all-at-once regression analysis; the estimated
optimal policy would be given by
dˆ1opt , dˆ2opt = arg max E(Y |o1 , a1 , o2 , a2 ).
(a1 ,a2 )
Unfortunately, this is not a good idea because of the possibility of bias in the
estimation of stage 1 treatment effect; this arises as a consequence of what is
known as collider-stratification bias or Berkson’s paradox (Gail and Benichou
2000; Greenland 2003; Murphy 2005a; Chakraborty 2011). This phenomenon was
first described in the context of a retrospective study examining a risk factor for
42 3 Statistical Reinforcement Learning
1 2 3 4
P(O2 = 1|O1 ,U, A1 ) = U + A1 + (1 − U) + A1 ,
2 2 2 2
where each q j ∈ [0, 1]. That is, the binary intermediate outcome O2 (responder/non-
responder status to initial treatment) depends on U and also on the treatment A1
(when q1 − q2 = 0 and q3 − q4 = 0). By applying Bayes’ theorem and some algebra,
one can see that
A1 O2 Y
Fig. 3.1 A diagram displaying the spurious effect between A1 and Y , as a consequence of Berk-
son’s paradox
average number of cigarettes smoked per day, number of months not smoked during
the study period, all measured at 6 months from the baseline), and O3 consists of the
same outcome variables measured at stage 2 (12 months from the baseline). A1 and
A2 represent the behavioral interventions given at stages 1 and 2 respectively. The
outcome Y could be the quit status at the end of the study. An example DTR can have
the following form: “At stage 1, if a subject’s baseline selfefficacy is greater
than a threshold (say 7, on a 1–10 scale), then provide the highly-personalized level
of the treatment component source; and if the subject is willing to continue treat-
ment, then at stage 2 provide booster intervention if he continues to be a smoker
at the end of stage 1 and control otherwise”. Of course characteristics other than
selfefficacy or a combination of more than one characteristic can be used to
specify a DTR. To find the optimal DTR, we applied the two-stage Q-learning pro-
cedure involving the following steps.
1. Fit stage 2 regression (n = 281) of FF6Quitstatus using the model:
2. Construct the pseudo-outcome (Ŷ1 ) for the stage 1 regression by plugging in the
stage 2 estimates:
Note that in this case one can construct the pseudo-outcome for everyone who
participated at stage 1, since there are no variables from post-stage 1 required to
do so.
3. Fit stage 1 regression (n = 1,401) of the pseudo-outcome using a model of the
form:
Table 3.1 Regression coefficients and 95 % bootstrap confidence intervals at stage 1 (significant
effects are in bold)
Variable Coefficient 95 % CI
motivation 0.04 (−0.00, 0.08)
selfefficacy 0.03 (0.00, 0.06)
education −0.01 (−0.07, 0.06)
source −0.15 (−0.35, 0.06)
source × selfefficacy 0.03 (0.00, 0.06)
story 0.05 (−0.01, 0.11)
story × education −0.07 (−0.13, −0.01)
Note that the sample sizes at the two stages differ because only 281 subjects
were willing to continue treatment into stage 2 (as allowed by the study protocol).
No significant treatment effect was found in the regression analysis at stage 2. The
stage 1 analysis summary, including the regression coefficients and 95 % bootstrap
confidence intervals4 (using 1,000 replications) is presented in Table 3.1.
The conclusions from the present data analysis can be summarized as follows.
Since no significant stage 2 treatment effect was found, this analysis suggests that
the stage 2 behavioral intervention need not be adapted to the smoker’s individual
characteristics, interventions previously received, or stage 1 outcome. More inter-
esting results are found at stage 1. It is found that subjects with higher levels of
selfefficacy are more likely to quit. The highly personalized level of source
is more effective for subjects with a higher selfefficacy (≥7), and deeply tai-
lored level of story is more effective for subjects with lower education (≤high
school); these two conclusions can be drawn from the interaction plots (with confi-
dence intervals) presented in Fig. 3.2. Thus, according to this data analysis, to maxi-
mize each individual’s chance of quitting over the two stages, the web-based smok-
ing cessation intervention should be designed in future such that: (1) smokers with
high selfefficacy (≥7) are assigned to highly personalized level of source,
and (2) smokers with lower education are assigned to deeply tailored level of
story.
Until recently, Q-learning had only been studied and implemented in settings where
the exposure was randomized. However, as the development of DTRs is often ex-
ploratory, the power granted by the large samples often available using observa-
tional data may be a good means of discovering potentially optimal DTRs which
may later be assessed in a confirmatory randomized trial. It has long been believed
that Q-learning could easily be adapted to observational (non-randomized) treatment
4 Inference for stage 1 parameters in Q-learning is problematic due to an underlying lack of
smoothness, so usual bootstrap inference is not theoretically valid. Nevertheless, we use it here
for illustrative purposes only. Valid inference procedures will be discussed in Chap. 8.
46 3 Statistical Reinforcement Learning
0.8
0.6
0.4
0.2
0
0 2 4 6 8 10
self-efficacy
b
Story by Education Interaction Plot (with confidence intervals)
1
deeply-tailored story
low-tailored story
0.9
predicted stage 1 mean outcome
0.8
0.7
0.6
0.5
0.4
Fig. 3.2 Interaction plots: (a) source by self-efficacy (upper panel), (b) story by education (lower
panel), along with confidence intervals for predicted stage 1 pseudo-outcome
settings, provided all confounding covariates are measured, using covariate adjust-
ment or so-called causal methods, i.e. propensity score approaches, including re-
gression, matching, and inverse probability of treatment weighting.
3.5 Q-learning Using Observational Data 47
O1 A1 O2 A2 Y
ζ1 ζ1
γ5
C1 C2
γ1
Fig. 3.3 Causal diagram for generative model for simulations in Sect. 3.5
Let μ = E[Y |C1 , O1 , A1 ,C2 , O2 , A2 ], and ε ∼ N(0, 1) be the error term. Then Y =
μ + ε , with
μ = γ0 + γ1C1 + γ2 O1 + γ3 A1 + γ4 O1 A1 + γ5C2 + γ6 A2 + γ7 O2 A2 + γ8 A1 A2 .
See Fig. 3.3 for the causal diagram corresponding to the data generating models.
Parameters were chosen to produce regular settings (see Chap. 8 for considera-
tion of non-regularity): γ = (0, γ1 , 0, −0.5, 0, γ5, 0.25, 0.5, 0.5) and δ = (0.1, 0.1).
We begin with a randomized treatment setting, ζ0 = ζ1 = 0 and γ1 = γ5 = 0. This
is the reference scenario.
Table 3.2 Performance of Q-learning adjustment methods: Bias, Monte Carlo Variance (MC var),
and Mean Squared Error (MSE)
Method Bias MC var MSE
Scenario A
None −0.0004 0.0080 0.0080
Linear −0.0005 0.0079 0.0079
PS (linear) −0.0006 0.0079 0.0079
PS (quintiles) −0.0007 0.0080 0.0080
PS (matching) −0.0066 0.0164 0.0164
IPW −0.0006 0.0080 0.0080
Scenario B
None 0.0073 0.0297 0.0298
Linear 0.0088 0.0121 0.0122
PS (linear) 0.0054 0.0188 0.0188
PS (quintiles) 0.0056 0.0204 0.0204
PS (matching) −0.0109 0.0431 0.0432
IPW 0.0080 0.0224 0.0224
Scenario C
None −0.7201 0.0256 0.5441
Linear −0.0027 0.0116 0.0116
PS (linear) −0.2534 0.0233 0.0875
PS (quintiles) −0.3151 0.0213 0.1206
PS (matching) −0.2547 0.0681 0.1330
IPW −0.4304 0.0189 0.2042
Scenario D
None −0.5972 0.0211 0.3777
Linear 0.0075 0.0120 0.0121
PS (linear) −0.2599 0.0227 0.0902
IPW −0.3274 0.0159 0.1231
Scenario E
None −0.2475 0.0114 0.0727
Linear 0.0075 0.0120 0.0121
PS (linear) 0.0050 0.0141 0.0141
IPW −0.1381 0.0116 0.0306
where O1 is the 4 × 1 vectors consisting of O1 from step (1) repeated four times,
A1 = (−1, −1, 1, 1), O2 = (O2 (−1), O2 (−1), O2 (1), O2 (1)) using the potential
values generated in step (2), and A2 = (−1, 1, −1, 1).
4. Set the confounders to be C1 = Y and C2 = max(Y).
5. From among the four possible treatment paths and corresponding potential out-
comes, select the “observed” data using P[A j = 1|C j ] = 1 − P[A j = −1|C j ] =
expit(ζ0 + ζ1C j ), j = 1, 2.
The vector of δ s was set to (0.1,0.1), while the vector of γ ∗ s was taken to be
(0, 0, −0.5, 0, 0.25, 0.5, 0.5), indicating a regular (large effect) scenario. In simula-
tions where treatment was randomly allocated, ζ0 = ζ1 = 0, while for confounded
treatment, ζ0 = 0.2, ζ1 = 1. As can be observed from Eq. (3.9), the Q-functions
will not depend on the values of C1 and C2 so that any model for the Q-function
3.6 Discussion 51
Table 3.3 Performance of Q-learning adjustment methods under the confounding by counterfac-
tuals simulations: Bias, Monte Carlo Variance (MC var), Mean Squared Error (MSE) and coverage
of 95 % bootstrap confidence intervals
Adjustment Randomized treatment Confounded treatment
method Bias MC var MSE Cover Bias MC var MSE Cover
n = 250
None 0.0020 0.0082 0.0082 94.0 0.2293 0.0080 0.0605 26.4
Linear 0.0011 0.0032 0.0032 95.1 0.0051 0.0039 0.0039 93.8
PS (linear) 0.0010 0.0052 0.0052 96.2 0.0548 0.0060 0.0090 89.4
PS (quintiles) 0.0008 0.0056 0.0056 96.1 0.0779 0.0061 0.0121 83.2
PS (matching) 0.0027 0.0099 0.0099 98.0 0.1375 0.0107 0.0295 75.9
IPW 0.0004 0.0046 0.0046 93.9 0.0108 0.0075 0.0076 92.8
n = 1,000
None −0.0012 0.0022 0.0022 93.4 0.2246 0.0021 0.0525 0.5
Linear 0.0001 0.0009 0.0009 93.5 0.0037 0.0010 0.0010 93.5
PS (linear) −0.0002 0.0014 0.0014 95.5 0.0446 0.0015 0.0035 77.0
PS (quintiles) −0.0004 0.0015 0.0015 95.7 0.0699 0.0015 0.0064 55.0
PS (matching) −0.0015 0.0026 0.0026 97.5 0.1256 0.0027 0.0184 31.0
IPW −0.0008 0.0012 0.0012 93.6 0.0018 0.0018 0.0018 93.6
3.6 Discussion
In Sect. 3.3, the longitudinal data structure was described. In Sect. 3.4, Q-learning
was introduced. This is a semi-parametric approach to estimation that we will return
to throughout the remainder of the text. In its typical implementation, it employs a
sequence of regressions, initially aiming to determine the optimal treatment strategy
at the last stage, then at each previous stage assuming the optimal DTR is followed
in later stages. The method is appealing for its computational and conceptual sim-
plicity, and as we will see in the next chapter, it ties closely with other methods of
estimation from the statistical literature. However, Q-learning may depend heavily
on being able to correctly specify the model for the Q-function, as we observed in
Sect. 3.5. The approach must therefore be undertaken with particular caution when
non-randomized data are used.
Chapter 4
Semi-parametric Estimation of Optimal DTRs
by Modeling Contrasts of Conditional Mean
Outcomes
B. Chakraborty and E.E.M. Moodie, Statistical Methods for Dynamic Treatment Regimes, 53
Statistics for Biology and Health 76, DOI 10.1007/978-1-4614-7428-9 4,
© Springer Science+Business Media New York 2013
54 4 Estimation of Optimal DTRs by Modeling Contrasts of Conditional Mean Outcomes
stage j, in persons with treatment and covariate history h j who subsequently receive
the optimal regime d opt
j+1 :
where “optimal” refers to treatment subsequent to stage j and “blip” refers to the
single-stage change in treatment at stage j, i.e., a “blip” of treatment given at stage
opt ref opt
j. Note that at least one of Y (ā j , d j+1 ), Y (ā j−1 , d j , d j+1 ) is a counterfactual out-
come unless the reference regime is optimal in stage j and the patient was treated
optimally at stage j and thereafter, in which case γ j (h j , a j ) = 0.
opt
As noted previously, a dynamic treatment regime d¯K is optimal if it maximizes
the expectation of the outcome Y ; define components of the optimal regime recur-
sively starting at K as
d opt opt
j (h j ) = argmax E[Y (ā j−1 , a j , d j+1 )|H j = h j ].
aj
opt
In general, d j (h j ) depends on ō j and ā j−1 , however as in previous chapters,
we will sometimes suppress the argument of the treatment regime function and
simply write d opt opt
j . Note that both of the counterfactual outcomes Y (ā j , d j+1 ) and
Y (ā j−1 , d ref opt
j , d j+1 ) in the optimal blip-to-reference assume that the optimal regime
is followed from stage j + 1 to the end of treatment. However, the actual treatments
prescribed by the optimal regime may differ because the treatments which maxi-
mize outcome given treatment history ā j may not correspond to those that maximize
outcome given treatment history (ā j−1 , d ref j ). Thus, we must keep in mind the subtle
distinction between the optimal regime and the specific optimal treatment prescribed
by that regime given an individual’s particular history.
Optimal regimes are defined for any sequence of treatment and covariate history,
even a sequence h j that might not be possible to observe had the optimal regime
been followed by all participants from the first stage. Thus, an optimal regime
provides information not only on the best treatment choices from “time zero”, but
also on the treatment choices that would maximize outcome from some other time
or stage, even if a sub-optimal regime had been followed up to that point. The
sequential randomization or no unmeasured confounding assumption discussed in
Chap. 2 is important as it allows us to infer that the average counterfactual outcome
of people who received treatments āK had they instead received d j from stage j
onwards is the same as the average outcome in those people who actually received
treatments (ā j−1 , d j ) conditional on history, and thus identify the parameters of the
blip function.
The assumption of rank preservation, introduced in Chap. 2, provides a simplistic
situation in which the parameters of a SNMM may be interpreted at the individual
level. That is, additive local rank preservation gives that the difference in the out-
come that would be observed for each particular person (who has history H j = h j )
should he be treated with regime (ā j−1 , d ref opt opt
j , d j+1 ) instead of regime (ā j , d j+1 ) is
equal to γ j (h j , a j ) given some treatment a j at stage j. However, SNMMs may be
used without making such assumptions, relying instead on an arguably more useful
population-level interpretation of average causal effects.
4.1 Structural Nested Mean Models 55
There are two special cases of optimal blip-to-reference functions that are
commonly used in the dynamic regimes literature and applications. We focus
here on binary treatment, i.e. A j ∈ {−1, 1},1 however the two SNMMs dis-
cussed below are mathematically equivalent under more general treatment types
(Robins 2004, pp. 243–245).
The first of these, suggested by Robins (2004), takes the reference regime to be
the zero regime, where by “zero regime” we mean some substantively meaningful
treatment such as placebo or standard care. Of course, like the optimal regime, what
is considered to be standard care may be different for participants with different
covariates or in different stages of treatment. Call this the optimal blip-to-zero func-
tion.
A second special case of the optimal blip-to-reference function, called the regret
function, takes the negative of the optimal blip-to-reference that uses optimal
treatment at stage j as the reference regime. Denote this by
for j = 1, . . . , K. Thus the regret at stage j is the expected difference in the outcome
that would have been observed had the participant taken the optimal treatment in
stage j instead of treatment regime a j in participants who followed ā j up to stage j
and the optimal regime from stage j + 1 onwards; note that this is identical in spirit
the loss-function L (o, a) introduced in Chap. 1.
For binary treatment and continuous outcome, the correspondence between the
optimal blip-to-reference functions and regrets is:
μ j (h j , a j ) = max γ j (h j , a j ) − γ j (h j , a j ), or
aj
γ j (h j , a j ) = μ j (h j , d ref
j ) − μ j (h j , a j ).
It is evident from these identities that if the regret is smooth in its arguments, the
optimal blip-to-zero will also be smooth. The converse does not hold: a smooth
optimal blip-to-zero may imply a discontinuous regret function. We shall henceforth
assume that d ref
j equals the zero regime (coded −1), and simply refer to the optimal
blip-to-zero function as the optimal blip.
1 While the 0/1 coding of treatment is widely used in the causal inference literature, the −1/1
coding is more common in Q-learning and SMART design literature, and hence we will adopt it in
this chapter as in the rest of the book.
56 4 Estimation of Optimal DTRs by Modeling Contrasts of Conditional Mean Outcomes
E[Y (A1 , A2 ) − Y (A1 , −1)|H2 ] = (ψ20 + ψ21 O2 + ψ23 (A1 + 1)/2)(A2 + 1)/2
giving d2opt = sign(ψ20 + ψ21 O2 + ψ23 (A1 + 1)/2). However at previous stages (in
this example, there is only one prior stage: the first), the standard blip:
opt opt
E[Y (A1 , A2 (Ō2 , A1 )) − Y (−1, A2 (Ō2 , −1))|H1 ] =
E (β21 ψ10 + β21 ψ11 O1 )(A1 + 1)/2 +
(ε + c1 + c2 )(sign(ε + c1 + c2 ) + 1)/2 −
(ε + c1 )(sign(ε + c1 ) + 1)/2|H1 ,
where
ε = ψ21 ε1
c1 = c1 (O1 ) = ψ20 + ψ21 β10 + ψ21 β11 O1
c2 = c2 (O1 , A1 ) = (ψ22 + ψ21 ψ10 + ψ21ψ11 O1 )(A1 + 1)/2.
4.2 Model Parameterizations and Optimal Rules 57
This gives
The optimal rule contains both the probability and cumulative density functions of
the normal distribution. Specific parametric knowledge of the distribution of the
state variable is required to estimate the optimal rule – but of course this is precisely
what we wish to avoid when using semi-parametric methods such as G-estimation,
which will be presented shortly, in Sect. 4.3. The study of standard SNMMs will not
be pursued further in this chapter, as, for the present, our interest lies in estimating
optimal dynamic regimes without explicitly modeling all aspects of the longitudinal
distribution of the data.
j (h j ; ψ ) = arg max γ j (h j , a j ; ψ )
d opt
aj
or
d j (h j ; ψ ) = {a j such that μ j (h j , a j ; ψ ) = 0}
opt
for j = 1, 2, . . . , K. Sections 4.3 and 4.4 discuss in greater detail methods of finding
estimators ψ̂ of ψ so that the optimal rules may be estimated to be the treatment
a j which maximizes γ j (h j , a j ; ψ̂ ) over all possible treatments at stage j. A number
of estimators for ψ have been proposed. For example, in some cases solutions can
be found in closed form while in others, an objective function must be minimized
or iteration is required (Murphy 2003). Once ψ̂ has been found by an appropriate
method, the optimal treatments are found by maximizing the regret or the optimal
blip function over all treatments where ψ̂ is used in place of ψ .
There is a variety of parameterizations that may be chosen to describe the optimal
blip function at each stage. For instance, we may suppose that the blips are time-
dependent (non-stationary), so that γ j (h j , a j ; ψ j ) is such that ψ j = ψk whenever
58 4 Estimation of Optimal DTRs by Modeling Contrasts of Conditional Mean Outcomes
j = k. On the other hand, when the state variable, O j , measures the same quantity at
each stage, for example CD4 cell count in an HIV setting or white blood cell count
in a cancer trial, it may be more reasonable to assume that parameters are shared
across stages: ψ j = ψk for all j, k. We consider the shared parameters case in more
detail in Chap. 9.
Define D j (γ ) to be the set of rules, d j , that are optimal under the optimal blip
opt
D j (γ ) = {d opt
j (·; ψ )|d j (h j ; ψ ) = argmax γ j (h j , a j ; ψ ) for some ψ }.
opt
aj
Similarly, let D j (μ ) be the set of optimal rules that are compatible with regret model
μ j (h j , a j ; ψ ):
D j (μ ) = {d opt
j (·; ψ )| μ j (h j , d j (h j ; ψ ); ψ ) = 0 for some ψ }.
opt
Murphy (2003) proposes modeling regrets using a known link function, f (u), that
relates the regret and the decision rule. For a scale parameter η j (h j ) ≥ 0, set
μ jf (h j , a j ) = η j (h j ) × f (a j − d opt
j (h j ; ψ )),
D j (μ̆ j ) = {d opt
j (·; ψ )|d j (h j ; ψ ) = arg min μ̆ j (h j , a j ; ψ ) for some ψ }
opt
aj
denote the set of optimal rules that are compatible with the approximation μ̆ j (h j , a j )
of μ j (h j , a j ). The approximate regret may not equal zero at the optimal regime.
In particular, using the expit function gives μ̆ j (h j , a j ) = 0.5 at the optimal regime for
individuals whose covariates values lie exactly on the optimal decision rule thresh-
old (i.e., people for whom the optimal rule is not unique).
Suppose γ j (h j , a j ) = c (o j ; ψ )(a j + 1)/2 is monotonic and increasing in o j so
that treatment is beneficial if a subject is above a threshold value of the random
4.2 Model Parameterizations and Optimal Rules 59
and D j (γ ) = D j (μ ) = {sign(c (o j ; ψ ))}. This holds true since whenever the optimal
blip is positive, the outcome is being maximized by taking treatment (A = 1) rather
than not (A = −1), while when the optimal blip is negative, the best one could do is
to have an expected difference in potential outcomes of zero, which is achieved by
not being treated (Fig. 4.1).
a b
3
3
Optimal blip
2
2
Regret
1
1
0
−4 −2 0 2 4 −4 −2 0 2 4
Oj Oj
Fig. 4.1 (a) Monotonic, increasing optimal blip and (b) corresponding regret functions. The regret
in black is for A = −1, and in grey, for A = 1. Note that the regrets, where non-zero, are a reflection
of the optimal blip above the x-axis
4.3 G-estimation
Robins (2004) proposed finding the parameters ψ of the optimal blip function
or regret function via G-estimation. This method is a generalization of esti-
mating equations designed to account for time-varying covariates and general
treatment regimes. There are close ties between G-estimation and instrumental vari-
ables methods. To use an instrumental variable analysis to estimate a causal ef-
fect requires a variable (the instrument) that is associated with the outcome only
through its effect on the treatment received and possibly also through measured
confounders (Angrist et al. 1996). All that is required to define an unbiased esti-
mating equation is that the model residual is conditionally uncorrelated with the
instrument. Viewing the expected counterfactual outcome (G j (ψ ), defined below)
as the model and a centered function of treatment as the instrument, we may think of
G-estimation as an instrumental variables analysis (Joffe 2000); by the assumption
of no unmeasured confounding, treatment allocation at stage j is independent of
outcome and state in any future stage, conditional on treatment and state history.
See Joffe and Brensinger (2003) for a detailed one-stage explanation and imple-
mentation. Instrumental variables analysis and G-estimation fall under the wider
umbrella of the Generalized Methods of Moments approach, thoroughly treated by
Newey and McFadden (1994).
Define
K
opt
Y (āk , d opt
k+1 ) − Y (āk−1 , 0k , d k+1 ) Hk = hk
K
at the second stage, which is the observed outcome minus the expected counterfac-
tual outcome under the observed treatment (given the observed covariate history)
plus the expected counterfactual outcome under the observed treatment at the first
stage and the optimal treatment at the second stage (given the observed covariate
history). In expectation, the first of two terms cancel out, leaving only the expected
counterfactual outcome under the observed treatment at the first stage and the opti-
4.3 G-estimation 61
mal treatment at the second stage. Similarly, for the first stage in a two-stage setting,
we have
G1 (ψ ) = Y + E Y (d opt
1 ) − Y (a 1 , d opt
2 )|H1 = h 1 + E Y (a 1 , d opt
2 ) − Y (ā 2 )|H2 = h 2 .
opt opt
The third and fourth terms, −E Y (a1 , d2 )|H1 = h1 + E Y (a1 , d2 )|H2 = h2
cancel in expectation, as do the first and last, leaving only the expected counter-
factual outcome under optimal treatment at both stages.
Therefore, G j (ψ ) is a person’s outcome adjusted by the expected difference
between the average outcome for someone who received a j and someone who
was given the optimal treatment at the start of stage j, where both had the same
treatment and covariate history to the start of stage j − 1 and were subsequently
treated optimally. Under the assumption of additive local rank preservation, G j (ψ )
equals the counterfactual outcome, not simply its expectation (Robins 2004); i.e.,
G j (ψ ) = Y (ā j−1 , d opt
j ). Now, let S j (A j ) = s j (H j , A j ) be a vector-valued function
that is chosen by the analyst to contain the variables thought to interact with
treatment to effect a difference in the outcome; the range of the function is in
Rdim(ψ j ) . For example, if K = 2, we may choose S1 (A1 ) = (A1 + 1)/2 · (1, O1 )T
and S2 (A2 ) = (A2 + 1)/2 · (1, O1, A1 , O1 A1 )T , which is simply the derivative of the
stage j blip function with respect to ψ .
Model the probability of receiving treatment a j by p j (A j = 1|H j ; α ), where α
may be vector-valued; for binary treatment, this model is the propensity score which
was first introduced in Sect. 3.5. A common parametric model used to describe the
treatment allocation probabilities is the logistic model when treatment is binary,
however non-parametric models may also be used. Let
K
U(ψ , α ) = ∑ G j (ψ ){S j (A j ) − E[S j (A j )|H j ; α ]}. (4.2)
j=1
Robins (2004) proved that the estimator ψ̂ of ψ using Eq. (4.3) is consistent pro-
vided that either E[G j (ψ )|H j ; ς ] or p j (A j = 1|H j ; α ) is correctly modeled, and
thus the estimate is said to be doubly-robust. In fact, the propensity score model
p j (A j = 2|H j ; α ) need not be correctly specified in the sense of capturing the data-
generating mechanism, but rather it must include and correctly model the impact
of all confounding variables on the choice of treatment. To use Eq. (4.3), typi-
cally estimates ς̂ and α̂ of the nuisance parameters ς and α are substituted into
the estimating equation. Estimates from Eq. (4.3) may be considerably less variable
than those from Eq. (4.2), but they are still not efficient. As with Eq. (4.2), if the
treatment model and its parameters are known (as they would be, for instance, in a
randomized trial with perfect compliance), estimates from (4.3) are more efficient
using estimated treatment probabilities than the known values (Robins 2004).
Efficient estimates can be found with judicious choice of the function S j (A j ).
Unfortunately, the form of S j (A j ) that leads to efficient estimates is typically com-
plex except in the special case where
for all j (Robins 1994). In the particular situation where this variance assumption
holds, setting
∂
S j (A j ; ψ ) = E G j (ψ ) (Var[G j (ψ )|H j ])−1
∂ψ
yields estimators that are semi-parametric efficient provided each of E[G j (ψ )|H j ],
p j (A j = 1|H j ; α ), E[ ∂∂ψ G j (ψ )], and Var[G j (ψ )|H j ] is correctly specified. Note,
however, that “correct” specification of the treatment model does not in fact require
complete knowledge of the treatment assignment mechanism, but only that the
model p j (A j = 1|H j ; α ) conditions on all variables that confound the relationship
4.3 G-estimation 63
between treatment in the jth stage, A j , and the outcome; Ertefaie et al. (2012) prove
this in the context of propensity score adjustment and inverse probability weighting.
When optimal blips are linear in ψ and parameters are not assumed to be shared
across stages, we can solve for ψ̂ explicitly. In general, for instance when blips are
not linear or parameters are shared across stages, search algorithms may be required.
Use the modification
K
which is a person’s outcome adjusted by the expected difference between the av-
erage outcome for someone who received a j and someone who was given the zero
regime at stage j, where both had the same treatment and covariate history to stage
j − 1 and were treated optimally from stage j + 1 onwards. Under additive local
rank preservation, Gmod, j (ψ ) = Y (ā j−1 , −1, d opt
j+1 ).
This modification allows recursive estimation using Eqs. (4.2) or (4.3). At the last
stage, we first estimate the nuisance parameter (ςK , αK ) and note that Gmod,K (ψ ) and
consequently UK (ψ ) now has only a single (possibly vector) unknown parameter as
well as ψK . Solve for ψ̂K . Now estimate (ςK−1 , αK−1 ) at the second-to-last stage,
K − 1. Substitution of these estimates ψ̂K leaves us with only the parameter ψK−1
unknown in UK−1 (ψ ), and so ψ̂K−1 may be found. Continuing in this manner yields
recursive G-estimates for all optimal regime parameters, ψ j , j = 1, . . . , K.
Recursive G-estimation is particularly useful when parameters are not shared
across stages (i.e., not stationary or common to different stages). An example of
blip functions for two stages which are linear in ψ but do have common pa-
rameters between stages are γ1 (h1 , a1 ) = (ψ0 + ψ1 o1 )(a1 + 1)/2 and γ2 (h2 , a2 ) =
(ψ0 + ψ1 o2 + ψ2 a1 )(a2 + 1)/2, since ψ0 and ψ1 appear in the blip functions at both
stages. In fact, recursive G-estimation may still be used when parameters are shared
by first assuming no sharing and then taking, for instance, the inverse-covariance
weighted average or even the simple average of the stage-specific estimates. Note
that Gmod, j (ψ ) could also be used in G-estimation (Eqs. (4.2) or (4.3)) without recur-
sion.
To accomplish G-estimation (using either the standard or the recursive approach)
requires estimates of the nuisance parameters ς and α . Thus, we can perform G-
estimation in two steps: find ς (ψ ) analytically by ordinary least squares and α
by some possibly non-parametric method of estimation (step 1), then plug these
estimates into Eqs. (4.2) or (4.3) and solve to find ψ (step 2). For recursive G-
estimation, we in fact have two steps of estimation at each stage for a total of 2K
steps. The impact of this two-step approach on estimation of standard errors will be
considered in Chap. 8.
64 4 Estimation of Optimal DTRs by Modeling Contrasts of Conditional Mean Outcomes
In the one-stage case with univariate O and binary A, a linear (optimal) blip function
gives
so that E[Y |O = o, A = a] = (ψ0 + ψ1 o)(a + 1)/2 + b(o) where the form of b(o) is
not specified. In this simple context, we may model b(o) non-parametrically and use
ordinary least squares to model the optimal blip as an alternative to G-estimation
(Robins 2004); in fact, the ordinary least squares (OLS) approach is a simplified
implementation of Q-learning. We consider two examples; in both, we generate the
state and treatment data such that O ∼ Uniform(−0.5,3) and A takes the value 1 with
probability expit(−2 + 1.8O). In the first example,
with ε ∼ N (0, 1) for each. For both G-estimation approaches based on (4.2)
and (4.3), take S(A) = (A + 1)/2 · (1, O)T . To perform G-estimation more effi-
ciently, model E[H(ψ )|O] as a linear function of O. Recall that the estimator is
doubly-robust, so that it is consistent even when E[H(ψ )|O] is incorrectly speci-
fied provided treatment allocation probabilities are correctly modeled with respect
to the confounding variables. For the regression method, model b(o) in two ways:
non-parametrically with a smoothing spline and parametrically via a linear model.
Table 4.1 shows results from 1,000 simulations in which G-estimation is com-
pared to modeling outcome, Y , among those who were not treated with a straight
line and with a smoothing spline, and the regressing observed outcome minus the
predicted value from the initial regression on the state variable. When Y (A = −1)
depends linearly on O, all four methods exhibit little bias, with the smallest variabil-
ity exhibited by the regression method which models b(o) linearly, followed closely
by the G-estimation using Eq. (4.3).
In the second example, where the dependence of Y on O is highly non-linear,
G-estimation using (4.2) demonstrates the least bias of the three approaches, how-
ever it is also the most highly variable. Using the more efficient G-estimating
Eq. (4.3) reduces the standard error considerably, at the cost of an introduction of
some bias at small sample sizes. The regression method that models b(o) linearly
exhibits a low variance but considerable bias even at large sample sizes.
4.3 G-estimation 65
Table 4.1 Comparison of G-estimation and OLS regression for a one-stage case
G-estimation Linear regression
Eq. (4.2) Eq. (4.3) Linear b(o) Smooth b(o)
n ψ ψ̂ SE ψ̂ SE ψ̂ SE ψ̂ SE
Y = −1.4 + 0.8O + A(5 + 2O) + ε
50 ψ0 = 5 5.183 1.071 5.020 0.777 5.029 0.599 5.062 1.004
ψ1 = 2 1.864 0.823 1.978 0.563 1.975 0.412 1.938 0.778
100 ψ0 = 5 5.040 0.602 4.984 0.476 4.976 0.381 4.996 0.598
ψ1 = 2 1.963 0.464 2.000 0.355 2.007 0.263 1.993 0.480
1,000 ψ0 = 5 5.008 0.176 5.001 0.150 5.000 0.123 4.998 0.172
ψ1 = 2 1.993 0.136 1.998 0.111 1.999 0.084 2.001 0.130
Y = −1.4O3 + eO + A(5 + 2O) + ε
50 ψ0 = 5 4.626 3.846 5.655 1.917 11.358 2.389 7.449 2.018
ψ1 = 2 2.167 3.494 1.452 1.501 −3.134 1.585 −0.018 1.695
100 ψ0 = 5 4.940 1.893 5.318 1.187 10.982 1.541 6.817 1.208
ψ1 = 2 1.944 1.980 1.680 0.990 −2.907 1.098 0.523 1.086
1,000 ψ0 = 5 4.981 0.481 5.011 0.355 10.726 0.476 6.319 0.286
ψ1 = 2 2.011 0.550 1.990 0.295 −2.654 0.337 1.001 0.244
Under the following sufficient conditions, it has been shown that Q-learning and
G-estimation are algebraically equivalent when linear models are used for the
Q-functions (Chakraborty et al. 2010):
(i) The parameters in Qopt opt
1 and Q2 are distinct;
(ii) A j has zero conditional mean given the history H j = (Ō j , Ā j−1 ), j = 1, 2; and
(iii) The covariates used in the model for Qopt 1 are nested within the covariates used
in the model for Qopt 2 , i.e., (H T , H T A ) ⊂ H T where H
10 11 1 20 j0 and H j1 are two
vector summaries of the history H j , denoting the main effect of history and the
part of history that interacts with treatment.
Recall that with binary treatments A j coded −1/1, the random variable A j may in
fact have a zero mean conditional on covariate history.
Recall that the regret is given by μ j (h j , a j ) = E[Y (ā j−1 , d opt opt
j ) −Y (ā j , d j+1 )|H j =
h j ], which can also be expressed as
− μ j (h j , a j ) = Qopt opt
j (h j , a j ) − max Q j (h j , a j )
aj
− μ j (H j , A j ; ψ j ) = ψ Tj H j1 a j − |ψ Tj H j1 |, j = 1, 2.
T −1
β̂2 = [Pn (H20 H20 )] Pn (H20Y2 ) − Pn (H20 H21T A2 ψ2 ) ,
we have shown that the G-estimating equation is identical to the second-stage least
squares regression performed in Q-learning. The proof for the first stage estimation
follows in a similar fashion.
In the case of shared parameters, the G-estimating functions are stacked and
solved simultaneously. Approximate solutions to the G-estimation functions and
Q-learning functions have been considered by taking the outer product of the esti-
mation functions and searching for values of ψ that minimize the resulting quadratic
function (Chakraborty and Moodie 2013). In such circumstances, assuming condi-
tions (ii) and (iii) above, it can again be shown that Q-learning and G-estimation are
equivalent.
As noted previously, even in randomized trial settings where treatment proba-
bilities are fixed and known, it is more efficient to use estimates of the propensity
score rather than known randomization probabilities; as this expectation does not
involve parameters ψ j , it is typically estimated at the outset of a G-estimation anal-
ysis and substituted into the G-estimating functions for the DTR parameters. Thus,
while Q-learning and G-estimation are in some instances equivalent, the typical im-
plementation of these methods leads to estimates which are not identical.
μ̃ j = Qopt opt
j (H j , A j ) − Q j (H j , −1)
4.4 Regret-Based Methods of Estimation 67
Murphy (2003) developed a method that estimates the parameters of the optimal
regime, ψ , by searching for (ψ̂ , ĉ) which satisfy
K K
2
∑ Pn Y + ĉ + ∑ μk (Hk , Ak ; ψ̂ ) − ∑ μ j (H j , a; ψ̂ )p j (a|H j ; α̂ )
j=1 k=1 a
K
≤ ∑ Pn Y + c + ∑ μk (Hk , Ak ; ψ̂ ) + μ j (H j , A j ; ψ )
j=1 k= j
2
− ∑ μ j (H j , a; ψ )p j (a|H j ; α̂ ) (4.4)
a
for all c and all ψ . Treatment probabilities – i.e., parameters α of the propensity
score – can be estimated in the same fashion as for G-estimation. The scalar quantity
c is not easily interpreted, except in the special case of no effect of treatment, when
ĉ = −Pn (Y ) i.e. it is the negative of the sample mean of the outcomes. In fact, c
(ĉ) may be omitted from (4.4); it is not required for estimation but greatly improves
the stability of the procedure (Murphy 2003). The estimator ψ̂ is consistent for
ψ provided the treatment allocation probabilities, p j (A j = 1|H j ; α ), are correctly
specified with respect to confounding variables.
Murphy (2003) described an iterative method of finding solutions to (4.4), which
begins by selecting an initial value of ψ̂ , say ψ̂ (1) , then minimizing the right-hand
side (RHS) of the equation over (ψ , c) to obtain a new value of ψ̂ , ψ̂ (2) , and repeat-
ing this until convergence. This iterative minimization of regrets (IMOR) method
may not produce a monotonically decreasing sequence of RHS values of Eq. (4.4).
Furthermore, this iterative procedure may not converge to a minimum; use of sev-
eral starting seeds and profile plots of the RHS of (4.4) for each parameter in a
stage about its estimate may reassure the analyst that a minimum was reached.
Rosthøj et al. (2006) provided an empirical demonstration of the method applied
to estimate the optimal dose of anticoagulation medication, and investigated con-
vergence properties through extensive simulations. The simulation study suggested
that IMOR may not converge when samples are small (e.g. 300), there is consider-
able noise in the data, or the researcher cannot posit good initial values for the search
algorithm; mis-specification of the treatment model can lead to serious convergence
problems, indicating that IMOR is not a doubly-robust procedure.
In the following section, we will see that IMOR is closely connected to, but not
in general the same as, Robins’ more efficient estimation (Eq. (4.3)) and that these
68 4 Estimation of Optimal DTRs by Modeling Contrasts of Conditional Mean Outcomes
are equivalent under the null hypothesis of no treatment effect for particular model
choices in Eq. (4.3). Note that the term “more efficient G-estimation” is used to
distinguish between the two G-estimating Eqs. (4.2) and (4.3), and is not meant to
imply that G-estimation is more efficient than IMOR.
Consider the one-stage case with observed variables O, A, and Y where A is binary
and O and Y are both univariate. We shall demonstrate that G-estimation and IMOR
are equivalent, given specific choices of models. Robins (2004, Theorem 9.1) proves
that for
γ (o, a; ψ ) = E[Y |O = o, A = a] − E[Y|O = o, A = −1],
which equals E[Y (a) − Y (−1)|O = o] under the no unmeasured confounding as-
sumption, γ (o, a) is the unique function g(o, a) minimizing
subject to g(o, −1) = 0. This constraint on g() is required to restrict the function
to be in the class of blip functions. In G-estimation, minimization occurs when the
derivative of Eq. (4.5),
equals zero. At the minimum, g(o, a) = γ (o, a), which gives Y − g(o, a) = G(ψ ) =
Gmod (ψ ). Taking S(A) equal to − ∂∂ψ g(O, A), Eq. (4.5) equals Eq. (4.3), the more
efficient G-estimating equation.
For a one-stage problem, IMOR proceeds directly by minimizing the left-hand
side of Eq. (4.4). At the minimum we have:
2
E Y − g(o, a) − E[Y − g(o, a)|O]
2
= E Y − γ (o, a) − E[Y − γ (o, a)|O = o]
2
with ĉ = −μ (o, −1) + E[μ (o, −1) − Y|O = o], which can be re-expressed as
4.4 Regret-Based Methods of Estimation 69
One critical difference exists: IMOR does not model E[Gmod (ψ )|O = o] explicitly,
but rather does so through the regrets and ĉ. This expression for ĉ makes clear that,
under the null hypothesis of no treatment effect, ĉ = E[Gmod (ψ )] = E[Y ], and IMOR
is equivalent to G-estimation using Eq. (4.3) with E[Gmod (ψ )|O = o] modeled by a
constant.
Suppose now that we have K stages and we observe ŌK , ĀK , and Y where A j is
binary and O j , Y are univariate for all j. Suppose also that parameters are not shared
across stages, so that ψ j = ψk for j = k. Robins (2004, Corollary 9.2) extended
Eq. (4.5), proving that for an optimal blip γ j (h j , a j ) with parameters ψ j , the unique
function g(h j , a j ) minimizing
⎡!
K
E ⎣ Y − g(h j , a j ) + ∑ γk (hk , dkopt ; ψk ) − γk (hk , ak ; ψk )
k= j+1
"2 ⎤
K
−E Y − g(h j , a j ) + ∑ γk (hk , dkopt ; ψk ) − γk (hk , ak ; ψk ) |Hk = hk ⎦
k= j+1
(4.6)
When S(A j ) = − ∂ ∂ψ j g(H j , A j ), Eq. (4.6) equals a G-estimating equation of the same
form as Eq. (4.3) using the modified version of the counterfactual quantity G j (ψ ).
IMOR is another method of recursive minimization. At any stage j, taking
g(h j , a j ) = γ j (h j , a j ; ψ j ) = μ j (h j , −1; ψ j ) − μ j (h j , a j ; ψ j ) in Eq. (4.6) leads to the
RHS of (4.4) for a single stage with
j−1
−ĉ = μ j (h j , −1; ψ j ) + ∑ μk (hk , ak ; ψ̂k )
k=1
K
However, the parameter c in Eq. (4.4) is not stage-specific and IMOR and G-
estimation are not in general equivalent. As in the one-stage instance, there is the
important difference between the way the methods achieve their solutions that is
due to whether or not E[Gmod, j (ψ )|H j = h j ] is modeled explicitly or through the
regrets and ĉ. As in the case of a single stage, under the null hypothesis of no treat-
ment effect, ĉ = E[Gmod, j (ψ )] = E[Y ], so there is equivalence between IMOR and
G-estimation (4.3) when E[Gmod, j (ψ )|H j = h j ] is modeled with a constant (which is
stationary across all stages) and S(A j ) = − ∂ ∂ψ j g(H j , A j ).
Regarding the relative efficiency, we can make the following points:
(i) Under the null hypothesis of no treatment effect, IMOR is a special case of G-
estimation using Eq. (4.3) in which E[Gmod, j (ψ j )|H j ] is assumed to be constant.
(ii) Under regularity conditions, estimates from Eq. (4.3) are the most efficient
among the class of G-estimates using a given function S(A j ) when both the
propensity score (w.r.t. confounders) and expected counterfactual models are
correctly specified (Robins 2004, Theorems 3.3(ii), 3.4).
(iii) Equation (4.3) does not satisfy the regularity conditions under the null
hypothesis due to non-differentiability of the estimating equation in a neigh-
borhood of ψ = 0. However, the conditions hold for constant blip functions,
γ j (h j , a j ) = a j ψ j (which may depend on j but not h j ) which posit no treat-
ment interactions. (See Chap. 8 for a thorough consideration of the problem of
non-regularity and solutions.)
Therefore, we may say that if the null hypothesis holds and we estimate a
constant blip model (which trivially is correctly specified under the null hypothe-
sis of no treatment effect), then G-estimation is more efficient than IMOR when
E[Gmod, j (ψ )|H j = h j ] = E[Y |H j = h j ] depends on H j and is correctly specified in
G-estimating Eq. (4.3). If E[Gmod, j (ψ )|H j = h j ] is constant, IMOR and Eq. (4.3)
yield efficient estimators.
4.4 Regret-Based Methods of Estimation 71
4.4.2 A-Learning
4.4.3 Regret-Regression
Two very similar methods have been proposed to model blip or regret function
parameters using regret-regression. The first, proposed by Almirall et al. (2010),
relies on the observation that, in a two-stage setting, the marginal mean of the coun-
terfactual outcome Y (a1 , a2 ) can be expressed as
where β0 = E[Y (−1, −1)] and γ 0j are zero blip-to-zero functions, i.e. blip func-
tions that consider the zero treatment regime to be the reference at stage j and
assume zero treatment at all subsequent stages. The functions ε j (H j ) are nui-
sance functions which must have mean zero for equality of the above equation
to hold. In particular, ε1 (H1 ) = E[Y (−1, −1)|H1] − E[Y (−1, −1)] and ε2 (H2 ) =
E[Y (a1 , −1)|H2 ] − E[Y (a1 , −1)|H1 ]. One modeling possibility is to set the func-
tions equal to a linear function of the residual obtained after subtracting O j from its
estimated conditional mean: ε j (H j ) = β j (O j − Ê[O j |H j−1 , A j−1 ]). A linear model
implementation of the algorithm can thus be described in brief as following the
steps:
1. At each stage, regress O j on the history H j−1 and A j−1 and set Z j = O j −
Ê[O j |H j−1 , A j−1 ].
2. Estimate the parameters of
K K
E[Y |H j , A j ; β j , ψ j ] = β0 + ∑ β jT Z j − ∑ γ 0 (H j , A j ; ψ j )
j=2 j=1
In simulation, this method appeared to perform better than IMOR in terms of both
bias and variability.
The regret-regression methods described above require estimation of the com-
ponents of the full data-likelihood that involve the time-varying covariates O j , but
not the treatment decisions, A j . In contrast, G-estimation and IMOR do not require
estimating the components of the data relating to state variables. It has been ar-
gued that it may be easier to model the treatment mechanism than the covariate
mechanism. This is undoubtedly the case in sequentially randomized trials, but
may be subject to debate in observational studies. Finally, as noted in Sect. 4.3.1,
G-estimation enjoys the property of double-robustness, which is not a feature of
4.4 Regret-Based Methods of Estimation 73
the regret-based methods described in this section. However, under correct model-
specification, the regression-based estimators appear to enjoy lower variability than
G-estimators and it is reasonable to conjecture that the regret-regression estima-
tor will as well. Furthermore, as the above methods are regression-based, the usual
linear-regression residual diagnostic techniques may be used to guide the choice of
the regret-function. See Sect. 9.2 for further discussion of model-checking in DTR
estimation.
To implement G-estimation for one stage, we modeled the optimal blip function
with a simple linear model,
The G-estimates (95 % CI), which can be found in closed-form, are ψˆ0 = 0.847
(0.337, 1.358) and ψˆ1 = −0.014 (−0.023, −0.005), so that the optimal rule is to
treat all amblyopes who are no older than 61.7 months. Bootstrap re-sampling was
used to estimate standard errors. The confidence interval of each parameter excludes
0, implying that there is a significant effect of treatment at the 5 % level. Using G-
estimating Eq. (4.3) yields the same estimate of the optimal rule.
The G-estimation result, which suggests that occlusion is beneficial but less so at
older ages, is in keeping with medical knowledge of neuro-development. Both the
models and the methods of estimation varied between the two analyses described
above, and so we now investigate further to discern the source of the differing
results.
The linear blip of Eq. (4.7) implies a symmetric regret:
when treatment should occur for low values of O, which, using G-estimation, is
estimated to be
Suppose that the model of the regret from the G-estimation approach, described in
(4.9), is correct and take the optimal treatment to be: treat only patients who are
no older than 62 months. The threshold β2 in the SNMM described by μ̆ f (O, A)
corresponds to −ψ0 /ψ1 = 62 from μ (O, A) in Eq. (4.9). To help visualize this, let
β2 = 62 and arbitrarily choose β1 = 0.4; then the model μ f (O, A) is a step function
(Fig. 4.2), which assigns equal regret to all treatment regimes other than that which
is optimal. If the model of Eq. (4.9) is correct, then μ f (O, A) is not capturing its
“peakedness” and it is in exactly this case that simulations have shown the IMOR
method to perform less well (Murphy 2003).
a b
0.4
Regret
Blip
−0.4 −0.2 0.0
40 50 60 70 80 90 40 50 60 70 80 90
Age (months) Age (months)
Fig. 4.2 (a) Optimal blip and (b) regret functions. The blue lines are under linear blip parameter-
ization, the red, under μ f (O, A), and the dashed line is the threshold for the optimal rule. In (b),
the solid lines are regrets for A = 1, the dotted lines, for A = −1
76 4 Estimation of Optimal DTRs by Modeling Contrasts of Conditional Mean Outcomes
Using IMOR to find β using model (4.9), i.e. |β − O| × (A + sign(O − β ))2 /4,
gives estimates similar to those found via G-estimation: β̂ (95 % CI) = 61.3
(58.6,64.1). Thus, when comparable blip and regret models are used, G-estimation
and IMOR yielded similar estimates. Restricting the analysis to the children who
were followed to 12 weeks resulted in the same decision rule (61 months), and
varying the number of hours of occlusion required to define treatment also failed to
substantially change the optimal decision rule.
Suppose that children aged 36–96 months are treated for amblyopia by eye patching
over 12 weeks, with a check-up at 8 weeks. The outcome is a utility function incor-
porating a child’s visual acuity and a measure of psychological stress endured due
to wearing an eye patch. Variables are distributed as follows:
where ε ∼ N (0, 0.05), δ ∼ N (0, 0.12) are independent of each other and of
Ā2 , Ō2 ,Y . Treatments A j takes value 1 with probability p j , where p1 = expit(2 −
0.03O1) and p2 = expit(−0.1 + 0.5O2). The optimal blips,
Table 4.2 Comparison of G-estimation and IMOR for two stages in 1,000 data sets of sample size
500: A hypothetical trial of occlusion therapy for treatment of amblyopia
G-estimation IMOR
Eq. (4.2) Eq. (4.3)
ψ ψ̂ SE ψ̂ SE ψ̂ SE
n = 500
ψ10 = 18.0 17.845 2.554 17.984 0.717 17.996 0.561
ψ11 = −0.3 −0.298 0.038 −0.300 0.009 −0.300 0.008
ψ20 = 3.0 3.059 2.122 3.000 0.652 3.023 0.416
ψ21 = 0.5 0.456 3.001 0.507 0.857 0.475 0.635
ψ22 = 2.0 1.952 2.750 2.006 0.728 1.986 0.613
ψ23 = 0.0 0.016 4.630 −0.031 1.017 0.010 0.985
n = 1,000
ψ10 = 18.0 17.914 1.734 17.998 0.470 18.019 0.381
ψ11 = −0.3 −0.299 0.026 −0.300 0.006 −0.300 0.006
ψ20 = 3.0 3.031 1.463 2.994 0.447 3.014 0.281
ψ21 = 0.5 0.469 2.068 0.506 0.582 0.483 0.432
ψ22 = 2.0 1.968 1.886 2.002 0.508 1.987 0.418
ψ23 = 0.0 0.028 3.171 −0.008 0.700 0.017 0.674
n = 2,000
ψ10 = 18.0 17.916 1.197 17.997 0.327 18.006 0.269
ψ11 = −0.3 −0.299 0.018 −0.300 0.004 −0.300 0.004
ψ20 = 3.0 3.085 1.012 3.018 0.314 3.018 0.193
ψ21 = 0.5 0.386 1.432 0.473 0.412 0.473 0.297
ψ22 = 2.0 1.902 1.302 1.976 0.355 1.976 0.288
ψ23 = 0.0 0.144 2.196 0.030 0.487 0.038 0.463
4.5 Discussion
where V̂ d (or, V̂ d(ψ ) ) is the estimated value function of the regime d (or, d(ψ )).
Perhaps one of the simplest ways to conceptualize the indexing parameter ψ is
to consider treatment rules of the form: “At stage j, change treatment when the
tailoring variable (suitable summary of the available history H j ) falls below or
above a threshold ψ ”. This class of methods is known as policy search methods
in the RL literature (Ng and Jordan 2000). A variety of methods from the statistics
literature, including inverse probability weighting (Robins et al. 2000) and marginal
structural models (Robins et al. 2000; Hernán et al. 2000; Murphy et al. 2001), fall
under this class of methods. The current chapter is devoted to a detailed description
of these methods.
B. Chakraborty and E.E.M. Moodie, Statistical Methods for Dynamic Treatment Regimes, 79
Statistics for Biology and Health 76, DOI 10.1007/978-1-4614-7428-9 5,
© Springer Science+Business Media New York 2013
80 5 Estimation of Optimal DTRs by Directly Modeling Regimes
The most crucial part of all the procedures mentioned above is the estimation
of the value function for an arbitrary regime (or, treatment policy) d. The value
of d can be estimated from a sample of n data trajectories of the form
{O1 , A1 , . . . , OK , AK , OK+1 } in several ways. Note that the expectation in the ex-
pression of value in (3.6) is taken with respect to the distribution Pd , but the data
trajectories are drawn from a distribution Pπ , corresponding to the exploration
policy π ; see Chap. 3 for more details. When d = π , the estimation is relatively
straightforward. For example, in a SMART, the investigator may be naturally in-
terested in estimating the values of the regimes “embedded” in the study (these
are the exploration policies). To make the discussion concrete, let us consider the
hypothetical SMART design (with K = 2) in the addiction management context
introduced in Chap. 2.
TM
R
TMC
NTX
CBT
R
EM+CBT+NTX
R
TM
R
TMC
CBT
NTX
R
EM+CBT+NTX
Fig. 5.1 Hypothetical SMART design schematic for the addiction management example (an “R”
within a circle denotes randomization at a critical decision point)
There are eight embedded regimes in this study; see Fig. 5.1, which is a repro-
duction of Fig. 2.2. For example, one embedded regime is, “treat the patient with
NTX at stage 1; give TM at stage 2 if the patient is a responder, and give CBT
at stage 2 if the patient is a non-responder”. Other embedded regimes can be de-
scribed similarly. Estimating the value of any of these embedded regimes can be
done by collecting all the subjects whose realized treatment experiences are con-
sistent with the rules given by the embedded regime of interest, and computing
the sample average of the primary outcome. When the regime to be evaluated, d,
is not one of the embedded regimes in a study, the estimation is more compli-
cated. Viewed from a causal inference perspective, this is a problem of estimating a
5.1 Estimating the Value of an Arbitrary Regime: Inverse Probability Weighting 81
dPd
where dP π
is a version of the Radon-Nikodym derivative, and is given by the ratio
of the two likelihoods (3.2) and (3.1). This ratio simplifies to
K I[A j = d j (H j )]
∏ π j (A j |H j )
.
j=1
Note that the trick of changing the probability measure employs the same basic
idea as importance sampling in Monte Carlo simulation. Thus, by changing the
probability measure as above, the expression for value becomes
)
K )
I[A j = d j (H j )]
Vd = ∏ π j (A j |H j )
Y dPπ = wd,π Y dPπ ,
j=1
where
K I[A j = d j (H j )]
wd,π = ∏
j=1 π j (A j |H j )
is a weight function depending on the entire data trajectory (we deliberately sup-
pressed the dependence on A j and H j for notational simplicity). A natural way to
estimate V d is by its empirical version V̂ d ,
V̂ d = Pn wd,π Y , (5.1)
where Pn denotes the empirical average over a sample of size n. Even though the
expectation of the weight function is 1, it is preferable to normalize the weights
by their sample mean to obtain a more stable estimate. The resulting estimator is
known as the inverse probability of treatment weighted (IPTW) estimator (Robins
1 While the term feasibility is commonly used in the causal inference literature, absolute continuity
et al. 2000), or more simply the inverse probability weighted or weighting (IPW)
estimator, and is given by
Pn wd,π Y
V̂IPTW =
d . (5.2)
Pn wd,π
In the case where the data arise from a SMART, the exploration policy consisting
of the randomization probabilities π j (A j |H j ) is known by design. Hence, by the law
of large numbers, the IPTW estimator is consistent. However, the IPTW estimator
is highly variable due to the presence of the non-smooth indicator functions inside
the weights.
Recently Zhang et al. (2012b) proposed a similar, doubly-robust estimator of
the value function for a single-stage treatment regime, using augmented inverse
probability of treatment weighting. Let a data trajectory be given by (H, A,Y ) with
the treatment A ∈ {−1, 1}, and H ≡ O (since this is a one-stage setting). Let d(H; ψ )
denote a regime indexed by ψ , μ (A, H; β̂ ) an estimated model for the mean outcome
as a function of baseline covariates H and treatment A, and π (H; γ̂ ) an estimated
propensity score. Then
Cψ Y Cψ − πc (H; ψ , γ̂ )
V̂AIPTW = Pn
d
− m(H; ψ , β̂ )
πc (H; ψ , γ̂ ) πc (H; ψ , γ̂ )
is the doubly-robust, augmented IPTW estimator of the mean outcome (value) under
treatment rule d(H; ψ ) where
Thus, for a specific value of ψ (denoting a specific regime), the contribution to the
value function estimator for someone treated with A = 1 and for whom d(H; ψ ) =
1 is
Y 1 − π (H; γ̂ )
− μ (1, X; β̂ )
π (H; γ̂ ) π (H; γ̂ )
while for someone treated with A = 1 and for whom d(H; ψ ) = −1 it is simply
μ (−1, X; β̂ ). Similarly, the contribution to the value function estimator for someone
who received A = −1 and for whom d(H; ψ ) = −1 is
Y π (H; γ̂ )
− μ (−1, X; β̂ )
1 − π (H; γ̂ ) 1 − π (H; γ̂ )
while for someone with A = −1 and d(H; ψ ) = 1 it is μ (1, X; β̂ ). Thus, each indi-
vidual contributes a convex combination of their observed outcome, Y , and their
modeled outcome, μ (d(H; ψ ), H; β̂ ) to the value estimator. In addition to being
more robust to model mis-specification, doubly-robust estimators tend to be more
efficient than their non-augmented counterparts (Robins 2004).
5.2 Marginal Structural Models and Weighting Methods 83
When the ultimate interest lies in picking a regime that is optimal, one can
consider the estimated value as function of ψ , and then select the value of ψ , say
ψ
ψ opt , that maximizes VAIPTW ≡ VAIPTW
d . One can view this approach as a single-
stage marginal structural modeling where the target of estimation becomes the
ψ
marginal mean conditional on baseline covariates, i.e. VAIPTW (O1 ), instead of the
ψ
overall marginal mean VAIPTW ; see Sect. 3.3 to understand the distinction between
the two. See Sect. 5.2 for details on the marginal structural modeling approach.
Zhang et al. (2012b) also considered a similar estimator based on a standard IPTW
formulation (i.e. without the augmentation term), but found its performance inferior
to the optimal regime estimated via the augmented IPTW estimating function.
Marginal structural models (MSMs) were originally proposed to estimate the effect
of static treatment regimes (Robins 1999a; Robins et al. 2000; Hernán et al. 2000),
i.e., treatment regimens that are not tailored to evolving patient characteristics; how-
ever they are increasingly being applied to the problem of estimating optimal DTRs.
These models are said to be marginal because they pertain to population-average ef-
fects (marginalizing over all time-varying covariates and/or intermediate outcomes,
and possibly also over some or all baseline covariates), and structural because they
describe causal (not associational) effects. The approach requires an initial invest-
ment in data manipulation, but is appealing because of the ease with which the mod-
els may be estimated using standard software. Furthermore, the approach provides
a mechanism for evaluating the effect of small changes in the parameter indexing
the regime (e.g. a decision rule threshold) on the average potential outcome in the
population.
Although in discussing the estimation of marginal structural models, the focus
in this text is on inverse probability weighting, estimation can also be performed by
other means such as targeted maximum likelihood (Van der Laan and Rubin 2006;
Neugebauer et al. 2010). In brief, targeted maximum likelihood estimation can
estimate treatment effects for longitudinal data in the presence of time-dependent
confounders; the method is doubly-robust and can be made to optimize asymptotic
estimating efficiency, but may not be implemented as easily as IPTW in complex
scenarios.
∂ a
U IPW (Y (a), HK |w, β ) = w(A|HK ) V (O1 ; β )[Y − V a (O1 ; β )]
∂β
84 5 Estimation of Optimal DTRs by Directly Modeling Regimes
for each regime with which their observed history is consistent; we follow Shortreed
and Moodie (2012) in calling these copies replicates. For example, if an individual’s
data are consistent with regime d¯K through stage j, we say that the replicate follows
that regime through stage j; at the point where the individual’s observed history is
no longer compatible with regime d¯K , the replicate corresponding to that individ-
ual and threshold ψ is artificially censored. A weighted analysis of this augmented
data set with artificial censoring mimics an analysis of a trial in which individuals
are randomized to follow one of the treatment regimes of interest, under the as-
sumptions of Sect. 2.1.3 as well as the assumptions of correct specification of the
marginal response model.
Zhao et al. (2012) and Zhang et al. (2012b) proposed closely related approaches
that straddle the static and dynamic regime settings, in that they seek to estimate
a personalized treatment rule, but do so in a single-stage setting only so that the
regime is not truly dynamic, or changing, over time. The approach of Zhang et al.
(2012b) has already been discussed in Sect. 5.1; we will discuss the approach of
Zhao et al. (2012) in Sect. 5.3 while considering classification-based approach to
estimating the value function.
As in the case of estimating the optimal treatment rule for a single stage of treat-
ment, estimation of the optimal DTR for multi-stage treatments requires finding the
regime d that maximizes the population average outcome V d (O1 ) = Ed [Y |O1 ] =
E[Y (d)|O1 ], or alternatively V d = E[Y (d)]. Then
∂ d
U IPW (Y (d), HK |w, β ) = w(A|HK ) V (O1 ; β )[Y − V d (O1 ; β )]
∂β
∂
= w(A|HK ) V a (O1 , ψ ; β )[Y − V a (O1 , ψ ; β )]
∂β
is the estimating function for the marginal structural model, where w = w(A|HK ) is
a weight for a replicate in the augmented data set, and the threshold ψ is treated as
a covariate in the outcome model, which is parameterized by β . The weight w is
constructed by taking the product of the probability of receiving the assigned treat-
ment regime and the probability of continued observation, i.e. of not being lost to
follow-up (not censored) or artificially censored, of a replicate in the augmented
data set under the assigned treatment regime. It is typically the case in MSM es-
timation of DTRs that, given a replicate’s current covariates and a regime thresh-
old ψ , the probability of continued observation at any stage j is equivalent to the
86 5 Estimation of Optimal DTRs by Directly Modeling Regimes
Then construct the final weight for each replicate i by taking the product over all
observed stages, wi = ∏ j wi ( j).
4. Perform a weighted linear regression with weights wi to obtain the coefficient
estimates of the model E[Y (d)|O1 ] = V a (O1 , ψ ; β ). Typically, the model posited
for V a (O1 , ψ ; β ) will not be monotonic, but rather will allow for a flexion point,
thus allowing the value to be maximized at some value ψ other than the bound-
aries of Ψ .
Cotton and Heagerty (2011) have proposed an approach that is closely related to
the above algorithm, but rather than creating a replicate for each person-threshold
pair, they propose generating m data sets in which patients are randomly assigned
to one of the treatment regimes with which their data are compatible. Each of the
m data sets is then analyzed as a single, complete data set in which regime mem-
bership is treated as known and unique. To date, no studies have been conducted
to determine the relative performance of the two data-augmentation approaches to
DTR MSM estimation.
88 5 Estimation of Optimal DTRs by Directly Modeling Regimes
U0, U1
A1 A2
O1 O2 O3
Following Bembom and Van der Laan (2007), O1 was generated from a uniform
(200, 800) distribution and O j , j = 2, 3, were generated from a normal distribution
with mean π j , variance 102 where
π j = O j−1 − 40 + 50U0I[A j−1 = −1] + 60U1I[A j−1 = 1] + 1.2 jI[ j = 3]X,
where X is a measured risk factor that is uniformly distributed on the range (−6, 3).
Treatment in the first stage, A1 , was randomly allocated with equal probability given
to each option. In the second stage, treatment was again randomly allocated to all
individuals for whom S ≥ −50; all individuals with S < −50 switched treatments.
Denote the mean response under treatment rule “treat with A1 = a1 then switch
to A2 = −a1 if S < ψ ” by V (a1 ,ψ ) . The true values of V (a1 ,ψ ) and of the optimal
threshold were determined by Monte Carlo simulation. Bembom and Van der Laan
(2007) were followed in assuming a simplifying, quadratic form for the relationship
between the switching threshold, ψ , and the expected response. Figure 5.3 depicts
5.2 Marginal Structural Models and Weighting Methods 89
75
70
65
V(a1,ψ)
6055
Fig. 5.3 The dependence of V (a1 ,ψ ) on the threshold ψ : truth (thick black lines) and projections
onto a quadratic function (thinner grey lines)
the true dependence of the mean responses V (−1,ψ ) and V (1,ψ ) on the treatment-
switching threshold ψ , and the projection of these functions onto quadratic models
over the range of potential thresholds (−30, . . . , 20). The true optimal rule is given
by initial treatment A1 = −1 followed by a switch to A2 = 1 if the health indicator
is not increased by at least 12; if the initial treatment is A1 = 1, the optimal decision
is to switch to treatment A2 = 1 if the health indicator is not increased by at least
10. The projection of V (a1 ,ψ ) onto quadratic models, however, yields slightly less
aggressive treatment rules: if initial treating is A1 = −1, switch to A2 = 1 if the
health indicator is not increased by at least 8 while for initial treatment A1 = 1,
switch to A2 = −1 if the indicator is not increased by at least 10.
Fifty-two candidate dynamic treatment regimes are evaluated, indexed by initial
treatment and the switching threshold ψ ∈ {−30, −28, . . ., 18, 20}, considering the
following mean models:
Results are presented in Table 5.1. Including the predictive variable in the re-
sponse model leads to reduced mean squared error for the estimators of the param-
eters of the quadratic projection of the response onto the decision rule threshold.
In terms of the decision rule itself, the median estimated optimal threshold over
the 5,000 simulated data sets coincides for Models 1 and 2 and indeed the median
values equal the values of the threshold that maximize the quadratic projection of
the true dependence of the mean response onto ψ . However, the interval formed by
taking the 2.5-th and 97.5-th percentiles in the distribution of thresholds is narrower
for Model 2 than Model 1. For example, the interval formed over the simulated data
sets for the optimal threshold for the regime A1 = −1, A2 = 1, ψ is (6, 10) if X is
included in the response model, and (4, 14) otherwise.
Table 5.1 Threshold rules for a continuous response estimated via MSMs. Bias, Monte Carlo
standard error (SE), and root mean squared error (rMSE) of parameters estimating the dependence
of the response, V (a1 ,ψ ) , on the decision threshold ψ in a quadratic model. Model 1 omits the risk
factor X from the response model; Model 2 does not. Summaries are based on 5,000 simulated
data sets, for sample sizes n = 100, 250, 500, 1,000
Model 1 Model 2
Bias (%) SE† rMSE† Bias (%) SE† rMSE†
A1 = 0, A2 = 1
n = 100 ψ 13.29 8.64 8.78 13.23 5.91 6.11
ψ2 9.16 0.42 0.43 9.42 0.31 0.32
n = 250 ψ 12.91 5.43 5.64 12.72 3.71 4.00
ψ2 8.85 0.26 0.27 8.56 0.20 0.21
n = 500 ψ 13.29 3.75 4.05 13.50 2.57 3.01
ψ2 8.67 0.18 0.19 8.64 0.14 0.15
n = 1,000 ψ 12.59 2.67 3.04 12.48 1.81 2.32
ψ2 8.38 0.13 0.14 8.24 0.10 0.11
A1 = 1, A2 = 0
n = 100 ψ 6.64 12.44 12.54 7.11 9.26 9.42
ψ2 0.38 0.56 0.56 0.59 0.41 0.41
n = 250 ψ 6.69 7.72 7.89 6.54 5.77 5.98
ψ2 0.94 0.35 0.35 1.01 0.25 0.25
n = 500 ψ 7.02 5.41 5.67 6.75 4.06 4.38
ψ2 0.51 0.24 0.24 0.77 0.18 0.18
n = 1,000 ψ 7.09 3.86 4.23 6.99 2.93 3.39
ψ2 0.33 0.17 0.17 0.47 0.13 0.13
† Multiplied by 102
(c) TA(ψ ), ψ ∈ (Ψ \30): treat with the typical antipsychotic at baseline if PANSS
score is ψ or higher, then switch to an atypical antipsychotic when PANSS
scores falls below ψ ; if the PANSS score is below ψ at baseline, treat with an
atypical antipsychotic for 12 months.
Note that if a replicate’s baseline PANSS score is less than the threshold ψ ∗
and the individual was assigned the typical antipsychotic at enrollment in the
CATIE trial this replicate is not deemed consistent with the regime TA(ψ ∗ ).
Any replicate with a baseline PANSS score below the threshold ψ ∗ is consid-
ered to follow the regime TA(ψ ∗ ) only if their initial assigned treatment in the
CATIE study was an atypical medication.
2. Censor CATIE replicates in the augmented data set at the month that any of the
three following events occur:
(a) An individual, and thus all corresponding replicates, is randomized to a drug
not considered in the current analysis.
(b) An individual, and thus all corresponding replicates, progresses to the unran-
domized, unblinded stage of the trial prior to month 12.
(c) A replicate, for which the corresponding individual is initially assigned the
typical antipsychotic, is censored for no longer following their assigned dy-
namic treatment regime. That is, given a PANSS threshold ψ , replicates may
stop following the regime for one of two reasons:
(i) Before choosing to switch off the typical antipsychotic, a replicate’s
PANSS score falls below the threshold ψ of their assigned regime.
(ii) At the visit that treatment is switched from the typical antipsychotic, the
PANSS score is equal to or greater than ψ .
Note that censoring individuals for reasons (a) and (b) could occur in any analysis
of the CATIE data depending, of course, on the scientific question of interest; we
refer to this as off-study censoring. Censoring for reason (c) is specific to the
data augmented dynamic treatment regimes analysis, and we refer to this type as
simply artificial censoring.
3. Estimate censoring models to ensure parameter estimates are not biased by any
covariates that may be predictive of both censoring and 12-month outcome. Esti-
mate stabilized censorship weights using the baseline variables listed below and
a spline on month of observation with knots at months 1, 2, . . . , 11 to ensure con-
tinuity at the knots. Specifically, the baseline variables were:
• years on prescription antipsychotic medication;
• a binary indicator of hospitalization in the 3 months prior to CATIE entry;
• factors of the categorical variables site type, sex, race, marital status, educa-
tion, employment;
• PANSS score;
• body-mass index;
• alcohol and drug use;
• Calgary depression score;
• presence and severity of movement disorders;
5.2 Marginal Structural Models and Weighting Methods 93
• quality of life;
• physical and mental functioning;
• and the threshold, ψ .
All baseline covariates are included in the numerator of the stabilized weights. In
addition to the baseline variables, the model for the denominator of the weights
includes baseline treatment, current (time-varying) values of body mass index,
alcohol and drug use, PANSS score, Calgary depression score, presence and
severity of movement disorders, quality of life, physical and mental function-
ing, medication adherence, date of observation, and previous month’s treatment
assignment. The baseline covariates are also included as linear terms in the final
response model which was estimated using the weighted, augmented data set.
Censorship models are estimated at each month, as individuals may switch treat-
ment at any month in the CATIE study. Since not all variables were collected at
every month, we use the last scheduled value for those covariates that were not
collected at a particular monthly visit. Following convention, all weights were
truncated at 10 to avoid excess variability (Cain et al. 2010; Van der Laan and
Petersen 2007a).
4. Perform a weighted linear regression with the weights constructed as in the pre-
vious step to obtain the coefficient estimates of the model
where {O1,k } is the collection of baseline variables used in the numerator of the
stabilized censorship weights.
Missing data are handled by multiple imputation (Shortreed et al. 2010), while
confidence intervals are constructed using a bootstrap procedure (Shao and Sitter
1996). Results are summarized in Fig. 5.4, which shows the predicted mean 12-
month PANSS scores for an individual who is Caucasian and unmarried, who grad-
uated from college, had not been hospitalized in the 3 months prior to CATIE, was
not employed at entry into the CATIE study, had spent 13 years on prescription
anti-psychotic medications prior to CATIE, was recruited from a university clinic,
had an average baseline PANSS score (75.58), was classified as moderately ill by
the clinician global impression of illness severity index, had no drug or alcohol
use as judged by the clinician, had no movement disorder symptoms at baseline as
measured by any of the three movement disorder scales, and had average baseline
values of body-mass index (29.8), Calgary Depression score (4.7), quality of life
score (2.8), and mental and physical function as measured by the SF-12. The co-
efficient estimates (95 % CI) are β̂1 : 62.9 (50.9, 74.7); β̂2 : 60.8 (58.2, 73.0); β̂3 :
7.7 × 10−1(3.3 × 10−1, 1.02); β̂4 : −6.0 × 10−3(−8.6, −2.0) × 10−3. These results
suggest that the treatment regimes “always treat with a typical antipsychotic” and
“always treat with an atypical antipsychotic” are equivalent treatment strategies in
order to reduce 12-month symptoms, as there was no significant difference between
94 5 Estimation of Optimal DTRs by Directly Modeling Regimes
the predicted mean of these two regimes. As the threshold used for switching from
the typical to an atypical antipsychotic is increased, 12-month PANSS score in-
creases. The statistically significant threshold, ψ , indicates that there is merit to
tailoring within the TA(ψ ) regime, and suggests that for most smaller values of ψ ,
reduced PANSS scores are observed at 12 months if initial therapy with the typical
antipsychotic is continued rather than changing therapy depending on ψ .
75
AA
TA(ψ)
Predicted 12−month PANSS score
70
65
60
55
50
30 40 50 60 70
Treatment regime threshold,ψ
Fig. 5.4 Predicted 12-month PANSS scores from dynamic MSM for the regimes AA and TA(ψ ).
The horizontal axis indicates the threshold values for the TA(ψ ) regime
It is first critical to note that the expected counterfactual outcome (i.e. value) under
a treatment regime d can be expressed as a function of that regime and the con-
trast function C(H) = μ (H, 1) − μ (H, −1) where μ (H, A) is a model for the mean
outcome as a function of the treatment A (coded as −1/1) and the covariate H:
for d(H) = 2 · I[μ (H, 1) > μ (H, −1)] − 1. Thus, if an estimate of the contrast func-
tion C(H), say Ĉ(H), were available, the optimal regime could be found by taking:
We briefly note that in a single stage case, an estimate of the contrast could be
found by regression (Q-learning), where parameters in the mean outcome model
μ (H, A; β ) are estimated to give Ĉreg (H) = μ (H, 1; β̂ ) − μ (H, −1; β̂ ); or by G-
estimation, where the contrast itself is modeled, yielding ĈG (H); or indeed in any
number of ways including the augmented IPW approach of Zhang et al. (2012b).
Using either of the first two of these approaches, an optimal regime could be found
by recommending treatment whenever the estimated contrast exceeds 0. The disad-
vantage to this approach is that there is strong reliance on correct specification of the
model for the contrast (or the mean outcome); as we observed in Sect. 3.5, incorrect
specification of the model can lead to considerable bias and very poor coverage.
The classification approach, then, aims to separate the estimation of the optimal
regime from the modeling of the contrast to reduce the dependence of the optimal
regime estimator on the specification of the contrast. Zhang et al. (2012a) and Rubin
and van der Laan (2012) have independently shown that
where W = |C(H)| and Z = sign(C(H)). That is, they show that the optimal treat-
ment decision is the one that minimizes the distance between the rule, d(H), and the
rule implied by the contrast, Z = sign(C(H)), where that distance is weighted by the
relative importance of treating that individual, W = |C(H)|. That is, the goal is to
minimize the error for the response Z using covariates H in the classification rule
d. This can be accomplished using a host of different non-parametric classification
methods (e.g. trees) and does not require a parametric form for the treatment regime.
The authors further note that the augmented IPW estimator introduced in Sect. 5.1
is a special case of this type of classification-based estimator. In simulation, Zhang
et al. (2012a) showed that the classification-based estimator of the optimal DTR
using the augmented IPW estimate of the contrast performed very well, even when
96 5 Estimation of Optimal DTRs by Directly Modeling Regimes
the true form of the decision rule was not characterized by a tree. However, the sim-
ulation results also demonstrated that the classification approach that took the es-
timated contrast from a regression or via non-augmented IPW often exhibited the
worst performance. It is perhaps not surprising that the quality of the estimated con-
trast can seriously affect the classification-based estimator, as the estimated contrast
is used to define the response, or target classification: Z = sign(C(H)).
Zhao et al. (2012) developed a method based on the IPTW approach to identifying
the optimal regime and termed it outcome weighted learning (OWL), in recognition
of the machine learning flavor present in the approach. Clearly the expected outcome
(value) under a treatment rule d is given by
I[A = d(H)]
d
VIPTW =E Y
Aπ (H) + (1 − A)/2
where the treatment A is coded as −1/1 with π (H) = P(A = 1|H). Note that the
denominator reduces to the probability of being treated amongst those treated (A =
1), i.e. π (H), and the probability of not being treated amongst those who were not
(A = −1), i.e., 1 − π (H). Thus the optimal rule is given by
Since d(H) can always be represented as sign( f (H)), for some suitable function f
(exploiting the fact that A is coded −1/1), the above display is equivalent to finding
ˆf opt (H) = argmin Pn Y
I[A = sign( f (H))] ,
f Aπ (H) + (1 − A)/2
5.4 Assessing the Merit of an Estimated Regime 97
and then setting dˆopt (H) = sign( fˆopt (H)). In the machine learning literature, the
objective function appearing on the right side of the above display is viewed as a
weighted sum of 0–1 loss function, which is a non-smooth, non-convex function.
It is well-known that such a function is difficult to minimize directly. One common
approach to address this difficulty is to consider convex surrogate loss functions
instead of the original non-convex 0–1 loss (Zhang 2004). Most of the modern clas-
sification methods, as well as the classical logistic regression method, in effect min-
imize such a convex surrogate loss function; see Hastie et al. (2009, Sect. 10.6) for
a vivid discussion. In particular, Zhao et al. (2012) employed the popular hinge loss
function that is used in the context of support vector machines (Cortes and Vapnik
1995). In addition, Zhao et al. (2012) penalized the hinge loss for complexity in the
estimated f ; this is a common technique to avoid overfitting the data. Thus, follow-
ing the classification literature, Zhao et al. (2012) replaced the original minimization
problem by the following convex surrogate minimization problem:
Y
fˆopt (H) = argmin Pn (1 − A f (H))+ + λn|| f ||2 ,
f Aπ (H) + (1 − A)/2
An interesting question that has not yet been properly addressed in the existing liter-
ature is how best to define the merit of the estimated optimal regime d,ˆ irrespective
of the estimation procedure employed (e.g. Q-learning, G-estimation, MSM etc.).
As in any estimation procedure, one would tend to think of bias and variance as
natural metrics. However, since regimes are functions, rather than real numbers or
vectors, bias and variance has to be defined, if possible, in terms of their associ-
ated values (mean potential outcomes) rather than directly. First let us consider the
notion of variance since it is easy to conceptualize. Naturally, one can consider the
variability in the value under the estimated regime or use cross-validation (Zhang
et al. 2012a; Rubin and van der Laan 2012). More precisely, we can write,
2
ˆ = E V dˆ − E(V dˆ) ,
Var(d)
98 5 Estimation of Optimal DTRs by Directly Modeling Regimes
where the expectation is over the distribution of the entire sample. Thus, the above
variance represents the variability of the value of dˆ across different samples.
In the present context, bias is a more difficult concept. First, let V opt =
maxd∈D V d be the optimal value function (i.e. value of the optimal regime) within
a pre-specified class of regimes D. Then the bias of the estimated regime dˆ can be
defined as
Bias(d)ˆ = E(V dˆ) − V opt ,
where the above expectation is over the distribution of the entire sample. The bias
represents how much the expected value of the estimated regime, averaged over the
distribution of the sample, differs from the best possible value. One can combine
the bias and variance criteria into a mean squared error (MSE) type criterion of the
estimated regime:
ˆ = Bias2 (d)
ˆ + Var(d) ˆ
ˆ = E(V d − V opt )2 .
MSE(d)
The MSE measures how “close” the estimated regime is to the truly optimal regime
within the class under consideration, in a well-defined sense. It is not hard to imag-
ine the existence of a bias-variance trade-off across different estimation procedures
considered in this book; for example, the policy search or MSM-type methods con-
sidered in this chapter are likely to lead to less bias but more variance compared to
Q-learning (which involves more parametric modeling).
A more traditional criterion for assessing the merit of an arbitrary (but fixed)
regime from the reinforcement learning literature is the generalization error. The
generalization error of a fixed regime d at a state (e.g. baseline covariate) o1 is
defined as the difference between the optimal value function and the value function
under the regime of interest d. Thus,
However to assess the overall performance of a regime, one needs to combine the
generalization errors over possible values of o1 . The traditional approach in rein-
forcement learning (Bertsekas and Tsitsiklis 1996) is to use the maximum general-
ization error, maxo1 (V opt (o1 ) − V d (o1 )), which represents the worst case scenario.
Another option is to consider an average generalization error (Kearns et al. 2000;
Kakade 2003; Murphy 2005b). An average generalization error is defined as
)
ηd = (V opt (o1 ) − V d (o1 ))dP(o1 ) = V opt − V d ,
o1
For an estimated regime d,ˆ the MSE and the generalization error are related; it
turns out that the MSE is the expected value of the squared average generalization
error, i.e.,
MSE(d)ˆ = E η 2 (d),
ˆ
where the above expectation is taken with respect to the distribution of the sample.
While the concept of generalization error is simple and intuitive, its computa-
tion for a given estimation procedure is usually quite complex. Murphy (2005b)
derived finite-sample upper bounds on the generalization error of Q-learning. The
results are quite technical in nature, and hence beyond the scope of this book. We
are not aware of the existence of any work that has considered generalization errors
of other estimation procedures presented in this book. The next question is that of
formal inference, e.g. testing for a significant difference between candidate regimes,
arising from different procedures, in terms of their values; we will briefly re-visit
this in Sect. 8.10. It is not clear whether such testing must be done through values,
or whether a more direct approach can be devised.
5.5 Discussion
et al. 2012b). Thus, while developed independently, the single-stage regime estima-
tor of Zhang et al. (2012b) is in fact based on the same principles of estimation
as the multi-stage (longitudinal) DTR estimators of Hernán et al. (2006), Van der
Laan and Petersen (2007a), Robins et al. (2008), Orellana et al. (2010a), and Orel-
lana et al. (2010b): in both cases, the treatment threshold parameters are the direct
targets of estimation, and are obtained through estimating the value (population av-
erage marginal outcome) as a function of the decision parameters, with the optimal
rule being chosen by the indexing parameter which maximizes the mean marginal
outcome.
Marginal structural models are appealing due to the simplicity of implementation
as well as their familiarity among statisticians and epidemiologists who use these as
a standard tool when estimating the impact of static treatment regimes in longitudi-
nal data. Using MSMs, estimated via inverse probability weighting (augmented or
not), allows the analyst to estimate the decision rule parameters directly.
All of the methods that we have considered in this chapter are suitable for non-
randomized data; of course they rely on the validity of a number of assumptions,
some of which are untestable but can be assessed at least informally using model
diagnostics (see Sect. 9.2) and substantive knowledge of the health condition under
consideration.
Chapter 6
G-computation: Parametric Estimation
of Optimal DTRs
× E ∑ Y j H j = h j , A j = a j ,
j=1
and then fitting a parametric model, say φ j (h j , a j ; θ j ), for the inside conditional
expectation. Note that in a single-stage setting, the above expression simply gives
B. Chakraborty and E.E.M. Moodie, Statistical Methods for Dynamic Treatment Regimes, 101
Statistics for Biology and Health 76, DOI 10.1007/978-1-4614-7428-9 6,
© Springer Science+Business Media New York 2013
102 6 G-computation: Parametric Estimation of Optimal DTRs
V d = E E[Y |H = h, A = d(h)] , which is estimated by Pn [Y |A = d(h), H = h] =
Pn φ (h, d(h); θ̂ ) where θ̂ is an estimator of θ . The resulting estimator is known as
the G-computation formula (Robins 1986), and is given by
' (
V̂Gd = Pn ∑ I[d1 (h1 ) = a1 , . . . , dK (hK ) = aK ] φ j (h j , a j ; θ̂ j ) . (6.1)
{(h j ,a j ):1≤ j≤K}
j=1 f j (o j |h j−1 , a j−1 ) in Eq. (3.1). Thus the key idea underlying
likelihood, e.g. ∏K+1
G-computation is to estimate the marginal mean (or distribution) of the outcome
by first fitting models for conditional means (or conditional likelihoods) of stage-
specific, time-varying outcomes given history and action, and then to substitute
values corresponding to specific treatment patterns into Eq. (6.1) (or correspond-
ing expression of the data likelihood). Note that in G-computation, a potentially
greater part of the likelihood of the data is modeled (the states and responses), in
contrast to some of the semi-parametric approaches of the previous two chapters,
where efforts are focused on modeling the treatment allocation probabilities and
the final outcome model. G-computation requires the assumption of no unmeasured
confounding introduced in Chap. 2. See Robins and Hernán (2009) or Dawid and
Didelez (2010) for a detailed exposition of G-computation.
6.1 Frequentist G-computation 103
G-computation has seen considerable use in the last decade. Thall et al. (2002)
considered G-computation to evaluate a phase II clinical trial of prostate cancer
treatment. Lavori and Dawson (2004) demonstrated (with R pseudocode) how to
evaluate two-stage data, motivated by the treatment for major depressive disorder
in the sequentially randomized STAR*D trial; see Chap. 2 for a brief description of
this trial. Bembom and Van der Laan (2007) demonstrated the use of G-computation
and compared results with marginal structural models (see Sect. 5.2) to examine the
optimal chemotherapy for the treatment of prostate cancer, choosing from among
four first-line treatments and the same four treatments offered as salvage therapy
(Thall et al. 2007b).
One of the most complex and realistic implementations of G-computation us-
ing epidemiological data was performed by Taubman et al. (2009), who used more
than 20 years of data from the Nurses’ Health Study to examine the effect of
composite lifestyle interventions on the risk of coronary heart disease. Similarly,
Young et al. (2011) analyzed data from a large multi-country cohort study of HIV+
individuals to determine when to initiate antiretroviral therapy as a function of CD4
cell count, and Westreich et al. (2012) used G-computation to evaluate the impact
of antiretroviral therapy on time to AIDS or death. The question was the same as
that investigated by Cain et al. (2010) using a marginal structural modeling ap-
proach (albeit with different data). G-computation has also been adopted in the
econometric literature (e.g. Abbring and Heckman 2007), where it has been used
to explore the effects of television-watching on performance in math and reading
(Huang and Lee 2010), and of spanking on behavior (Lee and Huang 2012).
Diggle et al. (2002) provided a simple expositional example on two stages where
all variables are binary; in such a case, it is simple to implement G-computation
non-parametrically (i.e. without using a parametric model for the conditional mean
or distribution). More recently, Daniel et al. (2013) demonstrated the use of G-
computation, as well as two semi-parametric approaches to estimating time-varying
treatment effects, using simulated data. In the tutorial, a small by-hand example of
a non-parametric implementation of G-computation is given as is a more complex
scenario which requires parametric modeling. The supplementary material in the
tutorial include a worked example in which there is loss to follow-up, so that the
treatment of interest is redefined to be not simply treatment pattern ā, but rather
receipt of treatment pattern ā and continued observation. G-computation has been
implemented as a SAS macro
http://www.hsph.harvard.edu/causal/software/
and as a Stata command (Daniel et al. 2011), facilitating dissemination and use of
the method.
There are two potentially serious drawbacks to G-computation. The first is that in
complex settings (many stages, or high dimensional intermediate observations), G-
computation typically requires an estimate of the distribution of each intermediate
outcome O j , given each possible history up to that time point. Using a Monte Carlo
104 6 G-computation: Parametric Estimation of Optimal DTRs
In this example, consider two key stages (intervals): birth to 3 months, and
3–6 months of age. The “treatment” or exposure of interest for our analysis is
any breastfeeding measured in each of the stages. That is, A1 takes the value 1 if
the child was breastfed up to 3 months of age (and is set to −1 otherwise), and
A2 is the corresponding quantity for any breastfeeding from 3 to 6 months of age.
Note that any breastfeeding allows for exclusive breastfeeding or breastfeeding with
supplementation with formula or solid foods. The outcome, Y , is the vocabulary
subtest score on the WASI measured at age 6.5 years. A single tailoring variable
is considered at each stage: the birthweight of the infant at the first stage, and the
infant’s 3-month weight at the second stage.
Implementing G-computation to address the question of whether breastfeeding
itself produces higher vocabulary subtest scores requires models for both the vocab-
ulary subtest score, as well as for the 3-month weight. A linear model was used to fit
the vocabulary subtest score on the log-scale (Y ) as a function of baseline covariates
(intervention group status, geographical location (eastern/rural, eatern/urban, west-
ern/rural, or western/urban), mother’s education, mother’s smoking status, family
history of allergy, mother’s age, mother’s breastfeeding of any previous children,
whether birth was by cesarean section, gender) as well as birthweight, 3 month
weight, breastfeeding from 0 to 3 months (A1 ), breastfeeding from 3 to 6 months
(A2 ), and the first-order interactions (i) A1 × A2 , (ii) A1 by birthweight, and (iii)
A2 by 3-month weight. Note that O1 includes all baseline covariates and the tai-
loring variable birthweight, while O2 includes all variables in O1 in additional to
3-month weight. Three-month weight was also fit on the log scale using a linear
model that conditioned on the baseline covariates and birthweight, breastfeeding
from 0 to 3 months (A1 ), and the interaction between A1 and birthweight.
The G-computation procedure used can be described by the following steps, for
any regime of interest, d = (d1 (h1 ), d2 (h2 )):
1. Fit an appropriate joint distribution model for the baseline variables O1 . For
PROBIT, a non-parametric approach is adopted, and the empirical distribution
was used.
2. Fit an appropriate model to the intermediate variable, O2 , as a function of O1 and
A1 . For PROBIT, a linear model on the log-transformed 3-month weight is used.
3. Fit an appropriate model to the response, Y , as a function of O1 , A1 , O2 , and A2 .
For PROBIT, a linear model on the log-transformed subtest score is used.
4. Create a hypothetical population by drawing a random sample with replacement
from the distribution of baseline covariates fit in Step (1).
106 6 G-computation: Parametric Estimation of Optimal DTRs
5. Using coefficient estimates and randomly sampled residuals from the model fit
in Step (2), determine the (counterfactual) intermediate variable o2 (d1 ) under
treatment regime d1 with history h1 = o1 .
6. Using coefficient estimates and randomly sampled residuals from the model fit in
Step (3), determine the response under treatment regime d with history h1 = o1 ,
h2 = (o1 , d1 (h1 ), o2 (d1 )).
Using this approach, we can compare distributions under different treatment
regimes, such as the static regimes “never breastfeed” or “breastfeed until at least
6 months of age”, or the dynamic regime “breastfeed until three months of age, then
continue only if 3-month weight exceeds 6.5 kg”. Note that in steps 5 and 6, one
could assume a likelihood for the potential outcomes, e.g. a normal distribution,
rather than the less parametric approach of selecting a random residual.
Table 6.1 Parameter coefficients from a linear regression model for the log-transformed vocabu-
lary subtest score of the WASI and log-transformed 3-month weight
Vocab. score Weight at 3 months
Est. SD Est. SD
Intercept 4.315 0.047 1.312 0.020
Intervention 0.071 0.035 0.012 0.006
East Belarus (rural) 0.034 0.048 −0.015 0.008
West Belarus (urban) 0.008 0.053 −0.011 0.008
West Belarus (rural) −0.002 0.044 −0.016 0.007
Attended some university 0.047 0.003 0.008 0.002
Completed university 0.099 0.004 0.009 0.003
Smoker −0.008 0.008 0.002 0.005
Allergy 0.023 0.006 −0.003 0.004
Age 0.009 0.002 −0.001 0.001
Age2 0.000 0.000 0.000 0.000
BF previously −0.049 0.003 0.001 0.002
Did not BF −0.042 0.003 −0.002 0.002
Cesarean 0.000 0.004 −0.005 0.002
Gender −0.011 0.002 0.045 0.001
Birthweight 0.010 0.004 0.139 0.002
A1 : Breastfed 0–3 months 0.048 0.019 0.008 0.012
Weight at 3 months 0.017 0.002
A2 : Breastfed 3–6 months −0.121 0.057
A1 ×Birthweight −0.012 0.005 0.001 0.004
A2 ×Weight at 3 months 0.016 0.007
A1 × A2 0.019 0.037
Results from regression models which account for within-hospital clustering are
presented in Table 6.1; coefficient estimates from models which ignored clustering
are very similar. Statistically significant effects of breastfeeding and its interaction
with weight are found in the model for the log vocabulary score. However, when
these models are subsequently used to produce samples from the counterfactual
distribution of outcomes, it is evident that the impact of breastfeeding itself on the
6.2 Bayesian Estimation of DTRs 107
0.025
0.020
DTR
No BF
All BF
0.015
Density
0.010
0.005
0.000
Fig. 6.1 Counterfactual vocabulary subtest score under three different breastfeeding regimes esti-
mated by G-computation: a DTR (gray, solid line), no breastfeeding (dashed line) and breastfeed-
ing until at least 6 months (dotted line)
vocabulary subtest score is minimal (see Fig. 6.1), with mean test scores varying by
less than one point under the three regimes considered. These results are broadly in
line with the findings of Moodie et al. (2012).
A detailed presentation of the many modeling choices required for any particular
application of a Bayesian estimation of a dynamic treatment regime is beyond the
scope of this text, however a great number of resources are available to the interested
reader (see, e.g. Chen et al. 2010).
averaging over random draws from the space of possible models, so that inference
is based on results from the averaged model. They argued that this is a distinct
advantage of a Bayesian approach over frequentist methods (semi-parametric or
otherwise), as it allows the analyst to incorporate uncertainty regarding model spec-
ification into the estimation procedure. As in the frequentist approaches, Bayesian
estimation of optimal dynamic treatment regimes may be computationally bur-
densome in complex settings with many covariates and/or stages, although some
advances have been made. For example, Wathen and Thall (2008) adapted the
forward-sampling approach of Carlin et al. (1998) so as to be able to sample from
the predictive distribution of the outcome under each of several regimes, where the
distribution is estimated from the observed data, however in this case the “regimes”
of interest were stopping rules for group sequential clinical trials.
Arjas and Saarela (2010) considered data on HIV treatment from the Multi-
Center AIDS Cohort (Kaslow et al. 1987), focusing on a two-stage setting in which
there is a single (continuous) tailoring variable at each stage, treatment is binary,
and the outcome is a continuous variable. They postulated appropriate prior dis-
tributions for each component of the joint likelihood, and thus obtained the joint
posterior distribution. Following this, the posterior predictive distribution was used
to see how the outcomes of individuals drawn from the same population as those
who formed the sample data were distributed under different treatment patterns.
This approach uses the principles set forth by Arjas and Parner (2004), who sug-
gested using summaries of the posterior predictive distributions as the main crite-
rion for comparing different treatment regimes, leading to what they refer to as an
“integrated causal analysis” in which the same probabilistic framework is used for
inference about model parameters as well as for treatment comparisons and hence
the choice of an optimal regime.
Lavori and Dawson (2000) used multiple imputations performed by an
approximate Bayesian bootstrap (Rubin and Shenker 1991) to generate draws
from the counterfactual distributions, and thereby allow a non-parametric means
of comparing mean outcomes under different treatment strategies. Zajonc (2012)
proposed a similar approach, though from a more overtly Bayesian perspective, and
considers data from the North Carolina Education Research Data Center, examining
the impact of honors programs on tenth grade mathematics test scores. Two stages
with a binary exposure were considered; several baseline confounders, and two
time-dependent variables were used in the analysis. Tailoring of the decision rule
was performed in a variety of ways including using the single, continuous math-
ematics score at the start of each stage as well as by creating an index score that
was a composite of five variables including sex, race, and test score. The approach
was the same in spirit as that of Arjas and Saarela (2010), however the motiva-
tion was somewhat different. Like Lavori and Dawson (2000) and Zajonc (2012)
framed the estimation problem as one of missing data, where the missing infor-
mation is on the potential outcomes, and undertakes estimation through what is
effectively a multiple imputation approach. Thus, Bayesian machinery was used
to estimate the posterior predictive distribution of the potential outcomes, and the
optimal regime was selected as that which maximized the expected posterior utility,
6.2 Bayesian Estimation of DTRs 109
where the utility was simply some analyst-defined function of the outcome and
potentially other factors such as treatment cost.
The Bayesian posterior predictive approach to dynamic treatment regime estima-
tion is in many ways similar to G-computation, but is more readily able to capture
three primary sources of variability in estimators: (i) randomness in covariates and
outcomes as specified by the predictive distribution for the outcome given data, (ii)
potential randomness in the regime (if, for example, the DTR had a stochastic com-
ponent such as “treat within three months of the occurrence of a particular health-
related event”); and (iii) variability in the unknown model parameters (Arjas 2012).
There have also been a number of applications of Bayesian predictive inference to
examine questions of causation for non-continuous outcomes, many by Elja Arjas
and colleagues. For example, Arjas and Andreev (2000) used a Bayesian nonpara-
metric intensity model for recurrent events to study the impact of child-care setting
on the number of ear infections.
We now return to the PROBIT trial, and re-analyze the data using a Bayesian
predictive approach.
A Bayesian G-computation procedure is designed to complement and compare
with the analysis performed in Sect. 6.1.2. A variety of summary measures of the
110 6 G-computation: Parametric Estimation of Optimal DTRs
0.025
No BF
0.020
All BF
0.015
Density
0.010
0.005
0.000
Fig. 6.2 Counterfactual vocabulary subtest score under two different breastfeeding regimes esti-
mated by a Bayesian implementation of G-computation: no breastfeeding (dashed line) and breast-
feeding until at least 6 months (dotted line)
104.5
Vocabulary Subtest Score
104.0
103.5
103.0
No BF All BF
Fig. 6.3 Distribution of the mean counterfactual vocabulary subtest score under two breastfeeding
regimes estimated by a Bayesian implementation of G-computation: no breastfeeding (left) and
breastfeeding until at least 6 months (right)
112 6 G-computation: Parametric Estimation of Optimal DTRs
6.3 Discussion
Up to this point, our development has focused entirely on the continuous outcome
setting. In this chapter, we will turn our attention to the developments that have
been made for estimating DTRs for more difficult outcome types including multi-
component rewards, time-to-event data, and discrete outcomes. As we shall see,
the range of approaches considered in previous chapters have been employed, but
additional care and thought must be devoted to appropriately handling additional
complexities in these settings.
B. Chakraborty and E.E.M. Moodie, Statistical Methods for Dynamic Treatment Regimes, 113
Statistics for Biology and Health 76, DOI 10.1007/978-1-4614-7428-9 7,
© Springer Science+Business Media New York 2013
114 7 Estimation of DTRs for Alternative Outcome Types
Q j (H j , A j ) = δ (β j1
opt T
H j0 + ψ Tj1H j1 A j ) + (1 − δ )(β j2
T
H j0 + ψ Tj2H j1 A j ).
While much of the DTR literature has focused on continuous outcomes, research
and analyses have been conducted for time-to-event data as well. Here, we briefly
review some key developments.
Huang and Ning (2012) used linear regression to fit accelerated failure time (AFT)
models (Cox and Oaks 1984) in a Q-learning framework to estimate the optimal
DTR in a time-to-event setting. Consider a two-stage setting, where patients may
receive treatment in at least one and possibly two stages of a study. That is, all
patients are exposed to some level of the treatment (where we include a control
condition as a possible level of treatment) at the first stage. After the first stage
of treatment, one of three possibilities may occur to a study participant: (1) the
individual is cured by the treatment and does not require further treatment; (2) the
individual experiences the outcome event, or (3) the individual requires a second
stage of treatment, e.g. because of disease recurrence. Let Y denote the total follow-
up time for an individual. If the individual is cured, he is followed until the end
of the study and then censored so that Y is the time from the start of treatment to
the censoring time; if he experiences the outcome event, Y is the time at which the
event occurs. Further, let R denote the time from the initial treatment to the start
7.2 Estimating DTRs for Time-to-Event Outcomes with Q-learning 115
of the second stage treatment (assuming this to be the same as the time of disease
recurrence), and let S denote the time from the start of the second stage treatment
until end of follow-up (due to experiencing the event or the end of the study); then
Y = R + S. Set S = 0 for those individuals who did not experience a second stage of
treatment.
First, let us assume that there is no censoring. Then an AFT Q-learning algorithm
for time-to-event outcomes proceeds much like that for continuous outcomes:
1. Stage 2 parameter estimation: Using OLS, find estimates (β̂2 , ψ̂2 ) of the condi-
2 (H2i , A2i ; β2 , ψ2 ) of the log-transformed time of follow-up
tional mean model Qopt
from the start of the second stage, log(Si ), for those who experienced a second
stage treatment.
2. Stage 2 optimal rule: By substitution, dˆ2opt (h2 ) = arg maxa2 Qopt 2 (h2 , a2 ; β̂2 , ψ̂2 ).
3. Stage 1 pseudo-outcome: Set Si∗ = maxa2 exp(Qopt 2 (H2i 2 2 ψ̂2 )), i = 1, . . . , n,
, a ; β̂ ,
which can be viewed as the time to event that would be expected under optimal
second-stage treatment. Then calculate the pseudo-outcome,
Yi if Si = 0
Ŷ1i = i = 1, . . . , n.
Ri + Si∗ if Si > 0
1 n
2
(β̂1 , ψ̂1 ) = arg min
β1 ,ψ1
∑
n i=1
log(Ŷ1i ) − Qopt
1 (H1i , A 1i ; β 1 , ψ 1 ) .
In Q-learning, the Q-functions need not always be modeled by linear models. In the
RL literature, Q-functions had been modeled via regression trees or more sophisti-
cated variations like random forests and extremely randomized trees (Ernst et al.
116 7 Estimation of DTRs for Alternative Outcome Types
2005; Geurts et al. 2006; Guez et al. 2008) or via kernel-based regression (Or-
moneit and Sen 2002). More recently in the DTR literature, Zhao et al. (2011) em-
ployed support vector regression (SVR) to model the Q-functions in the context
of modeling survival time in a cancer clinical trial. These modern methods from the
machine learning literature are often appealing due to their robustness and flexibility
in estimating the Q-functions. Following Zhao et al. (2011), here we briefly present
the SVR method to fit Q-functions.
Stepping outside the RL framework for a moment, consider a regression problem
n vector of predictors x ∈ R and the outcome y ∈ R. Given
m
with
the the data
xi , yi i=1 , the goal in SVR is to find a (regression) function f : Rm → R that
closely matches the target yi for the corresponding xi . One of the popular loss
functions is the so-called ε -insensitive loss function (Vapnik 1995), defined as:
L ( f (xi ), yi ) = (| f (xi )− yi |− ε )+ , where ε > 0 and u+ denotes the positive part of u.
The ε -insensitive loss function ignores errors of size less than ε and grows linearly
beyond that. Conceptually, this property is similar to that of the robust regression
methods (Huber 1964); see Hastie et al. (2009, p. 435) for more details on this sim-
ilarity, including a graphical representation.
In SVR, typically the regression function f (·) is assumed to take the form f (x) =
θ0 + θ T Φ (x), where Φ (x) is a vector of non-linear basis functions (or, features)
of the original predictor vector x. Thus, while the regression function employs a
linear model involving the transformed features Φ (x), it can potentially become
highly non-linear in the original predictor space, thereby allowing great flexibility
and predictive power. It turns out that the problem of solving for unknown f is a
convex optimization problem, and can be solved by quadratic programming using
Lagrange multipliers (see, for example, Hastie et al. 2009, Chap. 12).
In the context of dynamic treatment regimes, the outcome of interest y (e.g. sur-
vival time from cancer) is often censored. The presence of censoring makes matters
more complicated and the SVR procedure as outlined above cannot be used without
modification. Shivaswamy et al. (2007) considered a version of SVR, without the
ε -insensitive property, to take into account censored outcomes. Building on their
work, Zhao et al. (2011) developed a procedure called ε -SVR-C (where C denotes
censoring) that can account for censored outcomes and has the ε -insensitive prop-
erty. Below we briefly present their procedure.
In general,
wedenote interval-censored survival (more generally, time-to-event)
n
data by xi , li , ui i=1 , where l and u stand for the lower and upper bound of the
interval under consideration. If a patientexperiences n the death event, then the cor-
responding observation is denoted by xi , yi i=1 with li = ui = yi . Also, letting
ui = + ∞, one can easily construct a right-censored observation xi , li , + ∞ . Given
the interval-censored data, consider the following loss function:
The shape of the loss function for both interval-censored data and right-censored
data are displayed in Fig. 7.1.
7.2 Estimating DTRs for Time-to-Event Outcomes with Q-learning 117
a b
Fig. 7.1 ε -SVR-C loss functions for: (a) interval-censored data (left panel), and (b) right-censored
data (right panel)
Defining the index sets L = {i : li > −∞} and U = {i : ui < +∞}, the ε -SVR-C
optimization formulation is:
1
min ||θ ||2 + CE ∑ ξi + ∑ ξi , subject to
θ ,θ0 ,ξ ,ξ 2 i∈L i∈U
(θ0 + θ T Φ (xi )) − ui ≤ ε + ξi , i ∈ U;
li − (θ0 + θ T Φ (xi )) ≤ ε + ξi, i ∈ L;
ξi ≥ 0, i ∈ L;
ξi ≥ 0, i ∈ U.
In the above display, ξi and ξi are the so-called slack variables and CE is the cost
the regularization term 2 ||θ || as well as the training error
1 2
of error.
By minimizing
CE ∑i∈L ξi + ∑i∈U ξi , the ε -SVR-C algorithm can avoid both overfitting and un-
derfitting of the training data.
Interestingly, the solution depends on the basis function Φ only through inner
products Φ (xi )T Φ (x j ), ∀i, j. In fact, one need not explicitly specify the basis func-
tion Φ ; it is enough to specify the kernel function K(xi , x j ) = Φ (xi )T Φ (x j ). One
popular choice of K used by Zhao et al. (2011) is the Gaussian (or radial basis) ker-
nel, given by K(xi , x j ) = exp(−γ ||xi − x j ||2 ). Thus the above optimization problem
is equivalent to the following dual problem:
1
min (λ − λ )T K(xi , x j )(λ − λ ) − ∑ (li − ε )λi + ∑ (ui + ε )λi ,
λ ,λ 2 i∈L i∈U
subject to
∑ λi − ∑ λi = 0, 0 ≤ λi , λi ≤ CE , i = 1, . . . , n.
i∈L i∈U
118 7 Estimation of DTRs for Alternative Outcome Types
1 2
Stage 1 Stage 2
Possible
Possible treatments
treatments and initial
timings
Fig. 7.2 Treatment plan and therapy options for advanced non-small cell lung cancer in a hypo-
thetical SMART design
In case of known Q-functions, the optimal DTR (d1opt , d2opt ), using a backwards
induction argument, would be
When the Q-functions are unknown, they are estimated using suitable models. In the
present development, censored outcomes (T1 ∧C, δ1 = I[T1 ≤ C]) and (T2 ∧C2 , δ2 =
I[T2 ≤ C2 ]) are used at both stages. The exact algorithm to perform Q-learning with
ε -SVR-C for censored survival data is as follows:
1. For those individuals with YD = 1 (i.e. those who actually go on to the second
stage of treatment), perform right-censored regression using ε -SVR-C of the cen-
sored outcome (T2 ∧C2 , δ2 ) on the stage-2 variables (H2 , (A2 , TM )) to obtain Q̂opt
2 .
2. Construct the pseudo-outcome
3. In fitting Qopt
1 , the pseudo-outcome T̂D is assessed through the censored ob-
servation (X̃, δ̃ ), with X̃ = T1 ∧ C + YD T̂2 = T̂D ∧ C̃ and δ̃ = I[T̂D ≤ C̃], where
C̃ = CI[C < t2 ] + ∞I[C2 ≥ t2 ]. Perform ε -SVR-C of (X̃, δ̃ ) on (H1 , A1 ) to obtain
Q̂opt
1 .
Once the Q-functions are fitted, the estimated optimal DTR is given by (dˆ1opt , dˆ2opt ),
where the stage-specific optimal rules are given by
In the ε -SVR-C steps of the Q-learning algorithm, the tuning parameters CE and
γ are chosen via cross validation over a grid of values. Zhao et al. (2011) reported
robustness of the procedure to relatively small values of ε ; they set its value at 0.1
in their simulation study.
Zhao et al. (2011) evaluated the above method of estimating the optimal DTR
with survival-type outcome in an extensive simulation study. In short, they consid-
ered a generative model, the parameters of which could be easily tweaked to reflect
four different clinical scenarios resulting in four different optimal regimes. They
generated data on 100 virtual patients from each of the 4 clinical scenarios, thus a
total of 400 virtual patients. Then the optimal regime was estimated via Q-learning
with ε -SVR-C. For evaluation purposes, an independent test sample of size 100 per
clinical scenario (hence totaling 400) was also generated. Outcomes (overall sur-
vival) for these virtual test patients were evaluated for the estimated optimal regime
as well as all possible (12) fixed regimes, using the generative model. Furthermore,
they repeated the simulations ten times for the training sample (each of size 400).
Then ten different estimated optimal regimes from these ten training samples were
applied to the same test sample (of size 400) mentioned earlier. All the results for
each of the 13 treatment regimes (12 fixed, plus the estimated optimal) were aver-
aged over the 400 test patients. It was found that the true overall survival was sub-
stantially higher for the estimated optimal regime than any of the 12 fixed regimes.
They also conducted additional simulations to check the sensitivity of the procedure
to the sample size. It was found that for sample sizes ≥100, the procedure is very
reliable in selecting the optimal regime.
Moodie et al. (2013) recently tackled the challenging problem of Q-learning for
discrete-valued outcomes, and took a less parametric approach to modeling the Q-
functions by using generalized additive models (GAMs). Generalized additive mod-
els provide a user-friendly means to introducing greater flexibility in modeling the
relationship between an outcome and covariates. GAMs are treated as penalized
regression splines with different smoothing parameters allowed for each covariate,
where the degree of smoothing is selected by generalized cross-validation (Wood
2006, 2011). The automatic parsimony that the approach ensures helps to control
the dimensionality of the estimation problem, an important feature in the DTR set-
ting where the covariate space is potentially very large.
Suppose we are in a setting where the outcome at the final stage is discrete,
and there are no intermediate rewards. The outcome could represent, for instance,
a simple indicator of success such as maintenance of viral load below a given
threshold over the course of a study (a binary outcome), or the number of emer-
gency room visits in a given period (a count, possibly Poisson-distributed). When
the outcome Y is discrete, the Q-learning procedure must be adapted to respect
the constraints on the outcome, for example, Y is bounded in [0, 1], or Y is
7.3 Q-learning of DTRs for Discrete Outcomes 121
opt
By definition, in a two-stage setting, we have Q2 (H2 , A2 ) =
non-negative.
E Y H2 , A2 at the final interval. A reasonable modeling choice would be to consider
a generalized linear model (GLM). For instance, for a Bernoulli
utility, we might
choose a logistic model of the form E Y H2 , A2 = expit β j H j0 + (ψ Tj H j1 )A j ,
T
which is bounded by [0,1]. As in the continuous utility setting, the optimal regime
at the first interval is defined by
since the logit function is strictly increasing. We may therefore model the logit of
1 (H1 , A1 ; β1 , ψ1 ) rather than the Q-function itself to determine the optimal DTR.
Qopt
The Q-learning algorithm for a discrete outcome consists of the following steps:
1. Interval 2 parameter estimation: Using GLM regression with a strictly increasing
link function, f (·), find estimates (β̂2 , ψ̂2 ) of the conditional mean model for the
outcome Y , Q2 (H2i , A2i ; β2 , ψ2 ).
opt
2. Interval 2 optimal rule: Set dˆ2opt (h2 ) = arg maxa2 Qopt 2 (h2 , a2 ; β̂2 , ψ̂2 ).
3. Interval 1 pseudo-outcome: Set
The estimated optimal DTR using Q-learning is given by (dˆ1 , dˆ2 ). In a binary
outcome scenario, note that unlike in the continuous utility setting, the pseudo-
outcome, Ỹ1i , does not represent the (expected) value of the second-interval Q-
function under the optimal treatment but rather a transformation of that expected
outcome.
We briefly consider a simulation study. The data for treatments (A1 , A2 ), and
covariates (C1 , O1 ,C2 , O2 ) were generated as in Sect. 3.5. We considered three out-
come distributions: normal, Bernoulli, and Poisson, and two forms of the relation-
ship between the outcome and the variables C1 and C2 . The first setting corresponds
to Scenario C of Sect. 3.5 (normal outcome, Q-functions linear in covariates); the
second varies only in that a quadratic terms for C1 and C2 are included in the mean
model. Similarly, settings three and four correspond to a Bernoulli outcome with Q-
functions that are, respectively, linear and quadratic in C1 and C2 , and the final pair
of settings correspond to a Poisson outcome with Q-functions that are, respectively,
linear and quadratic in the covariates. Results are presented in Table 7.1.
Overall, we observe very good performance of both the linear (correct) speci-
fication and the GAM specification of the Q-function when the true confounder-
outcome relationship is linear: estimators are unbiased, and the use of the GAM for
the Q-function exhibits reasonably variability even for the smaller sample size of
250. In fact the variability of the estimator resulting from a GAM for the Q-function
is as low as the linear model-based estimator for the normal and Poisson outcomes,
implying there is little cost for the additional flexibility in the cases. When the de-
pendence of the utility on the confounding variables is quadratic, only the decision
rule parameters resulting from a GAM for the Q-function exhibits little or no bias
and good coverage rates.
Thus, it appears that Moodie et al. (2013) have taken a modest but promising step
on the path to a more fully generalized Q-learning algorithm, with the consideration
of a flexible, spline-based modeling approach for discrete outcomes. The next step
of adapting Q-learning to allow discrete interval-specific outcomes is challenging,
and remains an open problem.
Some of the seminal work in developing MSMs for DTR estimation was performed
in a survival context, using inverse probability weighting combined with pooled
logistic regression to approximate a Cox model for the estimation of the hazard ra-
tio parameters (Hernán et al. 2006; Robins et al. 2008). The methods are gaining
popularity in straightforward applications examining, for example, when to initi-
ate dialysis (Sjölander et al. 2011) or antiretroviral therapy (Shepherd et al. 2010).
These methods require little adaptation to the algorithm described in Sect. 5.2.2: as
with continuous outcomes, data-augmentation is undertaken to create replicates of
individuals that are compatible with each regime of interest. The only step that dif-
7.4 IPW for Censored and Discrete Outcomes 123
Table 7.1 Comparison of the performance Q-learning for normal, Bernoulli, and Poisson
outcomes when the true Q-function is either linear or quadratic in the covariates: bias, Monte
Carlo variance (MC var), Mean Squared Error (MSE) and coverage of 95 % bootstrap confidence
intervals (Cover) of the first interval decision rule parameter ψ10 . Bias, variance, and MSE are each
multiplied by 10.
Adjustment n = 250 n = 1,000
method Bias MC var MSE Cover Bias MC var MSE Cover
Normal outcome, Q-functions linear in covariates
None 10.03 0.35 10.41 0.0 10.12 0.09 10.32 0.0
Linear 0.02 0.08 0.08 94.1 0.00 0.02 0.02 93.0
GAM 0.02 0.08 0.08 94.4 0.00 0.02 0.02 93.6
Normal outcome, Q-functions quadratic in covariates
None 18.18 16.30 4.935 68.1 18.92 4.31 40.11 10.8
Linear 29.64 20.53 108.38 37.9 31.42 4.72 103.46 0.1
GAM 0.21 1.49 1.50 95.2 −0.11 0.40 0.40 92.7
Bernoulli outcome, Q-functions linear in covariates
None 8.65 1.57 8.97 13.7 8.45 0.19 7.32 0.0
Linear 0.20 1.98 1.98 94.9 0.00 0.28 0.28 95.1
GAM 0.81 4.25 4.25 97.2 0.00 0.28 0.28 95.8
Bernoulli outcome, Q-functions quadratic in covariates
None 3.77 0.65 2.07 64.8 3.71 0.15 1.53 10.8
Linear 1.54 0.87 1.11 92.5 1.56 0.20 0.44 79.7
GAM 0.06 2.63 2.63 97.2 −0.11 0.32 0.32 97.0
Poisson outcome, Q-functions linear in covariates
None 8.97 0.70 8.74 5.6 9.49 0.23 9.23 0.0
Linear 0.14 0.11 0.11 93.9 0.14 0.02 0.03 93.8
GAM 0.13 0.11 0.11 95.7 0.14 0.02 0.03 94.5
Poisson outcome, Q-functions quadratic in covariates
None 4.39 0.19 2.12 15.4 4.32 0.04 1.91 0.0
Linear −1.01 0.27 0.38 90.1 −1.06 0.07 0.19 72.6
GAM 0.00 0.28 0.28 96.7 0.14 0.64 0.65 94.6
fers is the outcome regression model, which is adapted to the outcome type, using,
for example a weighted Cox model or a weighted pooled logistic regression rather
than weighted linear regression.
A separate but closely related body of work has focused on survival data
primarily in two-phase cancer trials. In the trials which motivated the statistical
developments, cancer patients were randomly assigned to one of several initial ther-
apies and, if the initial treatments successfully induced remission, the patient was
randomized to one of several maintenance therapies. A wide collection of methods
have been developed in this framework, including weighted Kaplan-Meier cen-
soring survivor curves and mean-restricted survival times (Lunceford et al. 2002),
an improved estimator for the survival distribution which was shown to be the
most efficient among regular, asymptotically linear estimators (Wahed and Tsiatis
2004, 2006). Log-rank tests and sample size calculations have since been developed
(Feng and Wahed 2009). While these methods do address estimation of a dynamic
regime of the form “what is the best initial treatment? what is the best subsequent
treatment if the initial treatment fails?”, these methods are typically used to select
124 7 Estimation of DTRs for Alternative Outcome Types
from among a small class of initial and maintenance treatment pairs, and have not
been developed to select an optimal threshold from among a potentially large list of
values.
The general MSM framework for DTR estimation has been further adapted to
handle stochastic treatment assignment rules. For example, Cain et al. (2010) con-
sidered treatment rules which allowed for a grace period of m months in the timing
of treatment initiation, i.e. a rule of the form “initiate treatment within m months of
covariate O crossing threshold ψ ” rather than “initiate treatment when covariate O
crosses threshold ψ ”.
Thall and colleagues have considered DTRs in several cancer treatment settings,
where the typical treatment paradigm is “play the winner, drop the loser” (Thall
et al. 2000): a patient given an initial course of a treatment will continue to receive
that treatment if it is deemed to be sufficiently successful (e.g. due to partial tumor
shrinkage or partial remission), will be switched to a maintenance therapy or follow-
up if completely successful, and will be switched to an alternative treatment (some-
times referred to as a salvage therapy) if the initial treatment is unsuccessful. The
definition of success on a particular course of treatment may depend on which
course it is. For example, in prostate cancer, a success on the first course of treat-
ment requires a decrease of at least 40 % in the cancer biomarker prostate-specific
antigen (PSA) from baseline, while success in the second course requires a decrease
of at least 80 % in PSA from the baseline value (and, in both cases, no evidence of
disease progression).
In a prostate cancer treatment trial, Thall et al. (2000) took a parametric approach
to estimating the best sequence of treatments with the goal of maximizing the prob-
ability of successful treatment, where success is a binary variable. Four treatment
courses were considered. Patients were randomized to one of the four treatments,
and if treatment failed, randomized to one of the remaining three options. That is,
A1 = {1, 2, 3, 4} and A2 = A1 \ a1 (where a1 is the treatment actually given at the
first stage). A patient was switched from a treatment after the first failure, or deemed
to have had a successful therapy following two successful courses of the same treat-
ment. Thus, the trial can be viewed as a two-stage trial in which patients can have
at least one and at most two courses of treatment in the first stage, and at most two
courses of treatment in the second stage for a total two to four courses of treatment.
The optimizing criterion for determining the best DTR was the probability of
successful therapy. That is, the goal was to maximize ξ (a, a ) = ξa + (1 − ξa)ξa |a ,
where ξa is the probability of a patient success in the first two courses with initial
treatment a and ξa |a is the probability that the patient has two successful courses
with treatment a following initial (unsuccessful) treatment with a, i.e. under treat-
ment strategy (a, a ). Parametric conditional probability models were posited to
7.6 Discussion 125
obtain estimates of ξ (a, a ) that were allowed to depend on the patient’s state and
treatment history. For example, letting Y j take the value 1 if a patient experiences
successful treatment on the jth course and 0 otherwise, patient outcomes through
the first two courses of therapy can be characterized by the following probabilities:
which gives ξa = θ1 (a)θ2 (1; (a, a)). Logistic regression models were proposed for
the above probabilities, i.e. logit(θ j ) were modeled as linear functions of treatment
and covariate histories for each of the j courses of treatment. These probability
models can be extended to depend on state variables such as initial disease severity
as well. Once all these models are fitted, one can pick the best DTR, i.e. the best
treatment pair (a, a ) that maximizes the overall success probability ξ (a, a ).
7.6 Discussion
In this chapter, we have considered the estimation of DTRs for a variety of outcome
types, including multi-dimensional continuous outcomes, time-to-event outcomes
in the presence of censoring, as well as discrete outcomes. Methods used in the
literature for such data include Q-learning, marginal structural models, and a fully
parametric, likelihood-based approach. In the context of Q-learning, modeling of
time-to-event data has been accomplished using accelerated failure time models
(with censoring handled by inverse probability weighting) and using the less para-
metric approach of support vector regression. For discrete outcomes, Q-learning has
also been combined with generalized additive models selected by generalized cross-
validation, with promising results. The MSM approach has been implemented for
discrete failure times only, but can easily be used in a continuous-time setting using
a marginal structural Cox model. G-estimation can also be employed assuming an
AFT (see Mark and Robins 1993; Hernán et al. 2005) to estimate DTRs, however
the approach remains under-utilized, perhaps because of the relative lack of standard
software with which it can be implemented.
Chapter 8
Inference and Non-regularity
Inference plays a key role in almost all statistical problems. In the context of DTRs,
one can think of inference for mainly two types of quantities: (i) inference for the
parameters indexing the optimal regime; and (ii) inference for the value function
(mean outcome) of a regime – either a regime that was pre-specified, or one that
was estimated. The literature contains several instances of estimation and inference
for the value functions of one or more pre-specified regimes (Lunceford et al. 2002;
Wahed and Tsiatis 2004, 2006; Thall et al. 2000, 2002, 2007a). However there has
been relatively little work on inference for the value function of an estimated policy,
mainly due to the difficulty of the problem.
Constructing confidence intervals (CIs) for the parameters indexing the optimal
regime is important for the following reasons. First, if the CIs for some of these
parameters contain zero, then perhaps the corresponding components of the patient
history need not be collected to make optimal decisions using the estimated DTR.
This has the potential to reduce the cost of data collection in a future implementa-
tion of the estimated optimal DTR. Thus in the present context, CIs can be viewed
as a tool – albeit one that is not very sophisticated – for doing variable selection.
Such CIs can be useful in exploratory data analysis when trying to interactively find
a suitable model for, say, the Q-functions. Second, note that when linear models are
used for the Q-functions, the difference in predicted mean outcomes corresponding
to two treatments, e.g. a contrast of Q-functions or a blip function, becomes a lin-
ear combination of the parameters indexing the optimal regime. Point-wise CIs for
these linear combinations can be constructed over a range of values of the history
variables based on the CIs for individual parameters. These CIs can dictate when
there is insufficient support in the data to recommend one treatment over another; in
such cases treatment decisions can be made based on other considerations, e.g. cost,
familiarity, burden, preference etc.
An additional complication in inference for the parameters indexing the optimal
regime arises because of a phenomenon called non-regularity. It was Robins (2004)
who first considered the problem of inference for the parameters of the optimal
DTR in the context of G-estimation. As originally discussed by Robins, the treat-
ment effect parameters at any stage prior to the last can be non-regular under
B. Chakraborty and E.E.M. Moodie, Statistical Methods for Dynamic Treatment Regimes, 127
Statistics for Biology and Health 76, DOI 10.1007/978-1-4614-7428-9 8,
© Springer Science+Business Media New York 2013
128 8 Inference and Non-regularity
Un (θ ) →d N (0, ΣU ) . (8.2)
H0 : ψ = ψ0 , β = ‘anything’
HA : (ψ , β ) = (ψ0 , β )
From Eq. (8.4), it can be seen that E[Un (ψ )] = E[Uadj (ψ )] so Uadj (ψ ) is an unbiased
EF; Eq. (8.5) follows from Eq. (8.4) via a substitution from a Taylor expansion of
the EF for β about its limiting value. From Eq. (8.5), we can derive the asymptotic
distribution of the parameter of interest ψ to be
√
n (ψ̂ − ψ ) →d N p (0, Σψ̂ )
∂
probability limit of −E ∂ ψ Uadj (ψ ) and ΣUadj is the probability limit of
E Uadj (ψ )Uadj (ψ )T . Note that ψ̂ is the substitution estimator defined by finding
the solution to the EF where an estimate of the (vector) nuisance parameter β̂ has
been plugged into the equation in place of the true value, β .
It is interesting to consider the variance of the substitution estimator ψ̂ with
the estimator, say ψ̃ , that would result from plugging in the true value of the nui-
sance parameter (a feasible estimator only when such true values are known). That
is, we may wish to consider Σψ̂ and Σψ̃ . It turns out that no general statement
regarding the two estimators’ variances can be made, however there are special
cases in which relationships can be derived (see Henmi and Eguchi (2004) for a
geometric consideration of EF which serves to elucidate the variance relationships).
For example, if the EF is the score function for θ in a parametric model, there is a
cost (in terms of information loss or variance inflation) that is incurred for having to
estimate the nuisance parameters. In contrast, in the semi-parametric setting where
the score functions for ψ and β are orthogonal and that the score function is used as
the EF for β , it can be shown that Σψ̃ − Σψ̂ is positive definite. That is, efficiency is
gained by estimating rather than knowing the nuisance parameter β .
We now apply the theory of the previous section to Q-learning for the case where we
use linear models parameterized by θ j = (ψ j , β j ) of the form Qopt j (H j , A j ; β j , ψ j ) =
β j H j0 + (ψ j H j1 )A j . For simplicity of exposition, we will focus on the two-stage
T T
setting, but extensions to the general, K-stage setting follow directly. Following
the algorithm for Q-learning outlined in Sect. 3.4.1, we begin with a regression of
Y2 using the model Qopt 2 (H2 , A2 ; β2 , ψ2 ) = β2 H20 + (ψ2 H21 )A2 . Letting X2 denote
T T
(H20 , H21 A2 ), this gives a linear regression of the familiar form E[Y2 |X2 ] = X2 θ2 ,
with Var[θ̂2 ] = (X2T X2 )−1 σ 2 where σ 2 denotes the variance of the residuals Y2 −
X2 θ2 . Confidence intervals can then be formed, and significance tests performed, for
the vector parameter θ2 . If composite tests of the form H0 : ψ2 = 0 are desired, hy-
pothesizing that the variables contained in H21 are not significantly useful tailoring
variables without specifying any hypothesized values for the value of β2 , then the
Wald statistic should be scaled using Iψ ψ .β = (Iψ2 ψ2 − Iψ2 β2 Iβ−1β Iβ2 ψ2 )1/2 ,
1/2
2 2 2 2 2
where
132 8 Inference and Non-regularity
Iψ2 ψ2 Iψ2 β2
Iβ2 ψ2 Iβ2 β2
is a block-diagonal matrix decomposition of the information of the regression
1/2
parameters at the second stage, and similarly Iψ ψ .β should be used to determine
2 2 2
the limits of a confidence interval.
Now, let us consider the first-stage estimator. First stage estimation proceeds
by first forming the pseudo-outcome Y1 + β2T H20 + |ψ2T H21 |, which we implement
in practice using the estimate Ŷ1 = Y1 + β̂2T H20 + |ψ̂2T H21 |, and regressing this on
(H10 , H11 A1 ) using the model Qopt1 (H1 , A1 ; β1 , ψ1 ) = β1 H10 + (ψ1 H11 )A1 . This
T T
∂
U2,n (θ2 ) = Pn Y2 − Qopt
2 (H2 , A2 ; β2 , ψ2 ) 2 (H2 , A2 ; β2 , ψ2 )
Qopt
∂θ
2
= Pn Y2 − β2 H20 − (ψ2 H21 )A2 (H20
T T T
, H21
T
A2 )T ,
U1,n (θ1 , θ2 ) = Pn Y1 + max Qopt
2 (H2 , A2 ; β2 , ψ2 ) −
A2
∂
1 (H1 , A1 ; β1 , ψ1 )
Qopt 1 (H1 , A1 ; β1 , ψ1 )
Qopt
∂ θ1
= Pn Y1 + β2T H20 + |ψ2T H21 | − β1T H10 − (ψ1T H11 )A1 (H10 T
, H11
T
A1 )T .
Then the (joint) estimating equation for all the parameters from both stages of
Q-learning is given by
U2,n (θ2 )
= 0.
U1,n (θ1 , θ2 )
At the first stage, then, both the main effect parameters β1 and all second
stage parameters can be considered nuisance parameters. Collecting these into
a single vector β = (β1 , β2 , ψ2 ), we use a similar form to above, form-
ing Wald test statistics or CIs for the tailoring variable parameters using
1/2 −1
I = (Iψ1 ψ1 − Iψ β I Iβ ψ )
1/2
, where
ψ1 ψ1 .β1 1 1 β1 β1 1 1
% &
Iψ1 ψ1 Iψ
1 β1
Iβ ψ Iβ β
1 1 1 1
The variance of the optimal decision rule parameters ψ̂ must adjust for the plug-in
estimates of nuisance parameters in the estimating function of Eq. (4.3), U(ψ ) =
∑ni=1 ∑Kj=1 U j (ψ j , ς̂ j (ψ j ), α̂ j ). In the derivations that follow, we assume the param-
eters are not shared between stages, however the calculations are similar in the
shared-parameter setting. Second derivatives of the estimating functions for all
parameters are needed, and thus we require that each subject’s optimal regime must
be unique at every stage except possibly the first. If for any individual, the optimal
treatment is not unique, then it is the case that γ j (h j , a j ) = 0, or equivalently that for
a Q-function β jT H j0 + (ψ Tj H j1 )(A j + 1)/2, ψ Tj H j1 = 0. Provided the rule is unique,
then the estimating functions used in each stage of estimation for G-estimation will
be differentiable and so the asymptotic variance can be determined.
Robins (2004) derives the variance of U(ψ , ς (ψ ), α ) by performing a first order
Taylor expansion of the function about the limiting values of ς̂ (ψ ) and α̂ , ς and α :
∂
Uadj (ψ ) = U(ψ , ς , α ) + E U(ψ , ς , α ) (ς̂ (ψ ) − ς ) +
∂ς
∂
E U(ψ , ς , α ) (α̂ − α ) + o p(1)
∂α
This gives
−1
∂ ∂ ˙
Uadj (ψ ) = U(ψ , ς , α ) − E U(ψ , ς , α ) E lς (ς ) l˙ς (ς )
∂ς ∂ς
−1
∂ ∂ ˙
−E U(ψ , ς , α ) E lα (α ) l˙α (α ).
∂α ∂α
Thus the estimating function has variance E[Uadj (ψ )⊗2 ] = E[Uadj (ψ )Uadj (ψ )T ].
It follows that the variance of the blip function parameters which index the deci-
sion rules, ψ̂ = (ψ̂1T , ψ̂2T , . . . , ψ̂KT )T , is given by
134 8 Inference and Non-regularity
⎡! "⊗2 ⎤
−1
∂
Σψ̂ = E ⎣ E Uadj (ψ , ς , α ) Uadj (ψ , ς , α ) ⎦.
∂ψ
Suppose at each of two stages, p different parameters are estimated. Then Σψ̂ is
the (2p) × (2p) covariance matrix
% (11) (12) &
Σψ̂ Σψ̂
Σψ̂ = (21) (22) .
Σψ̂ Σψ̂
The p × p covariance matrix of ψ̂2 = (ψ̂20 , . . . , ψ̂2(p−1)) that accounts for using
(22)
the substitution estimates ς̂2 and α̂2 is Σψ̂ , and accounting for substituting ψ̂2
(11)
as well as ς̂1 and α̂1 to estimate ψ1 gives the p × p covariance matrix Σψ̂ for
ψ̂1 = (ψ̂10 , . . . , ψ̂1(p−1)).
However, as shown in Sect. 4.3.1, parameters can be estimated separately at each
stage using G-estimation recursively at each stage. In such a case, it is possible to
(22) (11)
estimate the variances Σψ̂ and Σψ̂ of the stage-specific parameters recursively as
well (Moodie 2009a). The development for the estimation of the diagonal compo-
( j j)
nents, Σψ̂ , of the covariance matrix Σψ̂ will be undertaken in a two-stage setting,
but the extension to the K stage case follows directly.
Let Uadj,1 (ψ1 , ψ2 ) and Uadj,2 (ψ2 ) denote, respectively, the first and second com-
(22)
ponents of Uadj (ψ ). At the second stage, use Uadj,2 to calculate Σψ̂ . To find the
covariance matrix of ψ̂1 , use a Taylor expansion of U1 (ψ1 , ψˆ2 , ς̂1 (ψ1 ), α̂1 ) about the
limiting values of the nuisance parameters (ψ2 , ς1 , α1 ). After some simplification,
this gives:
ε ∂
Uadj,1 (ψ1 , ψ2 ) = Uadj,1 (ψ1 , ψ2 ) − E U1 (ψ1 , ψ2 , ς1 , α1 ) ·
∂ ψ2
−1
∂
E Uadj,2 (ψ2 , ς2 , α2 ) Uadj,2 (ψ2 , ς2 , α2 )
∂ ψ2
+o p(1).
√
It then follows that n(ψ̂1 − ψ1 ) converges in distribution to
⎛ ⎡! "⊗2 ⎤⎞
−1
∂
N ⎝0, E ⎣ E Uε ε
Uadj,1 ⎦⎠ .
∂ ψ1 adj,1
Thus, the diagonal components of Σψ̂ are obtained using a more tractable calcula-
tion.
Note that if there are K > 2 stages, the similar derivations can be used, but re-
quire the use of K − j adjustment terms to Uadj, j for the estimation and substitu-
ε and U
tion of all future decision rule parameters, ψ j+1 , . . . , ψK . Note that Uadj adj
8.1 Inference for the Parameters Indexing the Optimal Regime Under Regularity 135
produce numerically the same variance estimate at each stage: that is, the recursive
variance calculation simply provides a more convenient and less computationally
intensive approach by taking advantage of known independences (i.e. zeros in the
matrix of derivatives of U(ψ ) with respect to ψ ) which arise because decision rules
do not share parameters at different stages. The asymptotic variances can lead to
coverage below the nominal level in small samples, but perform well for samples
of size 1,000 or greater in regular settings where differentiability of the EFs holds
(Moodie 2009a).
Berger and Boos (1994) and Berger (1996) proposed a general method for
constructing valid hypothesis tests in the presence of a nuisance parameter. One
can develop an asymptotically exact confidence interval for the stage 1 parameter
ψ1 by inverting these hypothesis tests, based on the following nuisance parameter
formulation. As we have noted above, many DTR parameter estimators are obtained
via substitution because the true value of the stage 2 parameter ψ2 is unknown and
must be estimated (see Sect. 8.2 for details). Instead,
√ if the true value of ψ2 were
known a priori, the asymptotic distribution of n(ψ̂1 − ψ1 ) would be regular (in
fact, normal), and standard procedures could be used to construct an asymptotically
valid confidence interval although performance of such asymptotic variance esti-
mators may be poor in small samples. Thus, while ψ2 is not of primary interest for
analyzing stage√ 1 decisions, it nevertheless plays an essential role in the asymptotic
distribution of n(ψ̂1 − ψ1 ). In this sense, ψ2 is a nuisance parameter. This idea
was used by Robins (2004) to construct a projection confidence interval for ψ1 .
The basic idea is as follows. Let Sn,1−α (ψ2 ) denote an asymptotically exact con-
fidence interval for ψ1 if ψ2 were known, i.e., P(ψ1 ∈ Sn,1−α√(ψ2 )) = 1 − α + oP (1).
Of course, the exact value of ψ2 is not known, but since n(ψ̂2 − ψ2 ) is regular
and asymptotically normal, it is straightforward to construct a (1 − ε ) asymptotic
confidence interval for ψ2 , say Cn,1−ε , for arbitrary ε > 0. Then, it follows that
.
γ ∈Cn,1−ε Sn,1−α (γ ) is a (1 − α − ε ) confidence interval for ψ1 . To see this, note
that
/
P ψ1 ∈ Sn,1−α (γ ) ≥ 1 − α + oP(1) + P ψ2 ∈ / Cn,1−ε = 1 − α − ε + oP (1).
γ ∈Cn,1−ε
(8.6)
Thus, the projection confidence interval is the union of the confidence intervals
Sn,1−α (γ ) over all values γ ∈ Cn,1−ε , and is an asymptotically valid (1 − α − ε )
confidence interval for ψ1 . The main downside of this approach is that it is poten-
tially highly conservative. Also, its implementation can be computationally highly
expensive.
136 8 Inference and Non-regularity
ψ22
ψ20
ψ21
1.0 0.015 0.5 0.015 0.5 0.015
0.4 0.4
1.0
0.025 0.025 0.025
0.2 0.2
0.020 0.020 0.020
0.5
ψ22
ψ20
ψ21
0.0
0.010 0.010 0.010
−0.2 −0.2
n n n
Fig. 8.1 Absolute bias of ψ̂10 in hard-max Q-learning in different regions (regular and non-regular)
of the underlying parameter space. Different plots correspond to different parameter settings.
of the parameter space that lead to bias in ψ̂10 , thereby reinforcing the necessity to
address the problem through careful estimation and inference techniques.
As noted by Moodie and Richardson (2010), the bias maps can be used to visually
represent the asymptotic results concerning DTR estimators. Consistency may be
visualized by looking at a horizontal cross-section of a bias map: as sample size in-
creases, the bias of the first-stage estimator will decrease to be smaller than any
fixed, positive number at all non-regular parameter settings, even those that are
nearly non-regular. However, as derived by Robins (2004), there exist sequences of
data generating processes {ψ(n) } for which the second-stage parameters ψ2 decrease
with increasing n in such a way that the asymptotic bias of the first-stage estimator
ψ̂1 is strictly positive. Contours of constant bias can be found along the lines on the
bias map traced by plotting g2 (ψ2 ) = kn−1/2 against n, for some constant k. The
asymptotic bias is bounded and, in finite samples, the value of the second-stage pa-
rameters (i.e. the “nearness” to non-regularity) and the sample size both determine
the bias of the first-stage parameter estimator.
138 8 Inference and Non-regularity
With (3.8) as the model for Q-functions, the optimal DTR is given by
where sign(x) = 1 if x > 0, and −1 otherwise. Note that the term β jT H j0 on the right
hand side of (3.8) does not feature in the optimal DTR. Thus for estimating optimal
DTRs, the ψ j s are the parameters of interest, while β j s are nuisance parameters.
These ψ j s are the policy parameters for which we want to construct confidence
intervals.
Inference for ψ2 , the stage 2 parameters, is straightforward since this falls in
the framework of standard linear regression. In contrast, inference for ψ1 , the
stage 1 parameters, is complicated by the previously discussed problem of non-
regularity resulting from the underlying non-smooth maximization operation in the
estimation procedure. To further understand the problem, recall that the stage 1
pseudo-outcome in Q-learning for the i-th subject is
the two asymptotic distributions across samples. Consequently, ψ̂1 becomes a biased
estimator of ψ1 , and Wald type CIs for components of ψ1 show poor coverage rates
(Robins 2004; Moodie and Richardson 2010).
Let us again consider a typical, two-stage scenario with linear optimal blip func-
tions,
Let η2 = ψ20 + ψ21 o2 + ψ22 (a1 + 1)/2 + ψ23 o2 (a1 + 1)/2 and similarly define
η̂2 = ψ̂20 + ψ̂21 o2 + ψ̂22 (a1 + 1)/2 + ψ̂23o2 (a1 + 1)/2. The G-estimating function
for ψ2 is unbiased, so E[η̂2 ] = η2 . The sign of η2 is used to decide optimal treat-
ment at the second stage: d2opt = sign(η2 ) = sign(ψ20 + ψ21o2 + ψ22 a1 + ψ23 o2 a1 )
and dˆ2opt = sign(η̂2 ) so that now the G-estimating equation solved for ψ1 at the first
interval contains:
E
where ≥ is used to denote “greater than or equal to in expectation”. The quan-
tity γ2 (h2 , d2opt ; ψ2 ) − γ2 (h2 , a2 ; ψ2 ) in Gmod,1 (ψ1 ) – or more generally, the sum
∑
k> j γk (hk , dkopt ; ψk ) − γk (hk , ak ; ψk ) in Gmod, j (ψ j ) – corresponds conceptually to |μ | in
the toy example with normally-distributed random variables Xi that was introduced
at the start of the section. By using a biased estimate of sign(η2 )η2 in Gmod,1 (ψ1 ),
some strictly positive value is added into the G-estimating equation for ψ1 . The esti-
mating function no longer has expectation zero and hence is asymptotically biased.
estimators are quite intuitive in nature, only limited theoretical results are available.
We present these in the context of Q-learning, but these can equally be applied in a
G-estimation setting.
Ŷ1iHT = Y1i + β̂2T H20,i + |ψ̂2T H21,i | · I[|ψ̂2T H21,i | > λi ], i = 1, . . . , n, (8.8)
where λi (>0) is the threshold for the i-th subject in the sample (possibly depending
on the variability of the linear combination ψ̂2T H21,i for that subject). One way to
operationalize this is to perform a preliminary test (for each subject in the sample) of
the null hypothesis ψ2T H21,i = 0 (H21,i is considered fixed in this test), set Ŷ1iHT = Ŷ1i
if the null hypothesis is rejected, and replace |ψ̂2T H21,i | with the “better guess” of 0
in the case that the test fails to reject the null hypothesis. Thus the hard-threshold
pseudo-outcome can be written as
' √ (
n|ψ̂2T H21,i |
Ŷ1i = Y1i + β̂2 H20,i + |ψ̂2 H21,i | · I
HT T T
> zα /2 (8.9)
T Σ̂ H
H21,i ψ̂2 21,i
for i = 1, . . . , n, where n−1 Σ̂ψ̂2 is the estimated covariance matrix of ψ̂2 . The
corresponding estimator of ψ1 , denoted by ψ̂1HT , will be referred to as the hard-
threshold estimator. The hard-threshold estimator is common in many areas like
variable selection in linear regression and wavelet shrinkage (Donoho and John-
stone 1994). Moodie and Richardson (2010) proposed this estimator for bias cor-
rection in the context of G-estimation, and called it the Zeroing Instead of Plugging
In (ZIPI) estimator. In regular data-generating settings, ZIPI estimators converge to
the usual recursive G-estimators and therefore are asymptotically consistent, unbi-
ased and normally distributed. Furthermore, in any non-regular setting where there
exist some individuals for whom there is a unique optimal regime, ZIPI estimators
have smaller asymptotic bias than the recursive G-estimators provided parameters
are not shared across stages (Moodie and Richardson 2010).
Note that Ŷ1HT is still a non-smooth function of ψ̂2 and hence ψ̂1HT is a non-
regular estimator of ψ1 . However, the problematic term |ψ̂2T H21 | is thresholded, and
hence one might expect that the degree of non-regularity is somewhat reduced. An
important issue regarding the use of this estimator is the choice of the significance
level α of the preliminary test, which is an unknown tuning parameter. As dis-
cussed by Moodie and Richardson (2010), this is a difficult problem even in better-
understood settings where preliminary test based estimators are used; no widely ap-
plicable data-driven method for choosing α in this setting is available. Chakraborty
et al. (2010) studied the behavior of the usual bootstrap in conjunction with this
estimator empirically.
8.3 Threshold Estimators with the Usual Bootstrap 141
where x+ = xI[x > 0] stands for the positive part of a function, and λi (>0) is a
tuning parameter associated with the i-th subject in the sample (again possibly de-
pending on the variability of the linear combination ψ̂2T H21,i for that subject). In the
context of regression shrinkage (Breiman 1995) and wavelet shrinkage (Gao 1998),
the third term on the right side of (8.10) is generally known as the non-negative
garrote estimator. As discussed by Zou (2006), the non-negative garrote estimator
is a special case of the adaptive lasso estimator. Chakraborty et al. (2010) proposed
this soft-threshold estimator in the context of Q-learning.
Like the hard-threshold pseudo-outcome, Ŷ1ST is also a non-smooth function of
ψ̂2 and hence ψ̂1ST remains a non-regular estimator of ψ1 . However, the problematic
term |ψ̂2T H21 | is thresholded and shrunk towards zero, which reduces the degree of
non-regularity. As in the case of hard-threshold estimators, a crucial issue here is to
choose a data-driven tuning parameter λi ; see below for a choice of λi following a
Bayesian approach. Figure 8.2 presents the hard-max, the hard-threshold, and the
soft-threshold pseudo-outcomes.
Theorem 8.1. Let X be a random variable such that X|μ ∼ N(μ , σ 2 ) with known
variance σ 2 . Let the prior distribution on μ be given by μ |φ 2 ∼ N(0, φ 2 ), with
Jeffrey’s noninformative hyper-prior on φ 2 , i.e., p(φ 2 ) ∝ 1/φ 2 . Then an empirical
Bayes estimator of |μ | is given by
% 0 &
EB
3σ 2 +
X
3σ 2 +
|μ |
ˆ = X 1− 2 2Φ 1− 2 −1
X σ X
0 0 ! "
2
3σ 2 + X2
3σ 2 +
+ σ 1− 2 exp − 2 1 − 2 , (8.11)
π X 2σ X
Fig. 8.2 Hard-threshold and soft-threshold pseudo-outcomes compared with the hard-max
pseudo-outcome
8.3 Threshold Estimators with the Usual Bootstrap 143
EB
3σ 2 +
3σ 2 +
|μˆ | ≈ X 1− 2 sign(X) = |X| 1 − 2 . (8.12)
X X
Now for i = 1, . . . , n separately, put X = ψ̂2T H21,i , and μ = ψ2T H21,i (for fixed H21,i );
and plug in σ̂ 2 = H21,i T Σ̂ H
ψ̂2 21,i /n for σ . This leads to a choice of λi in the soft-
2
i = 1, . . . , n. (8.13)
The presence of the indicator function in (8.13) indicates that Ŷ1iST is a thresholding
rule for small values of |ψ̂2T H21,i |, while the term just preceding the indicator func-
tion makes Ŷ1iST a shrinkage rule for moderate to large values of |ψ̂2T H21,i | (for which
the indicator function takes the value one).
Interestingly, the thresholding rule in (8.13) also provides some guidance for
choosing the tuning parameter of the hard-threshold estimator. Note that the√indi-
cator function in (8.13) corresponds to a pretest that uses a critical value of 3 =
1.7321; equating this value to zα /2 and solving for α , we get α = 0.0833. Hence a
hard-threshold estimator with tuning parameter α = 0.0833 ≈ 0.08 corresponds to
the soft-threshold estimator without the shrinkage effect. Chakraborty et al. (2010)
empirically showed that the hard-threshold estimator with α = 0.08 outper-
formed other choices of this tuning parameter as reported in the original paper
by Moodie and Richardson (2010).
and 3.4.3. The variables considered here are the same as those considered in
Sect. 3.4.3. To find the optimal DTR, we applied both the hard-max and the soft-
threshold estimators within the Q-learning framework. This involved:
1. Fit stage 2 regression (n = 281) of FF6Quitstatus using the model:
and
Note that in this case one can construct both versions of the pseudo-outcomes for
everyone who participated at stage 1, since there are no variables from post-stage
1 required to do so.
3. Fit stage 1 regression (n = 1, 401) of the pseudo-outcome using a model of the
form:
Table 8.1 Regression coefficients and 95 % bootstrap confidence intervals at stage 1, using both
the hard-max and the soft-threshold estimators (significant effects are in bold)
Hard-max Soft-threshold
Variable Coefficient 95 % CI Coefficient 95 % CI
motivation 0.04 (−0.00, 0.08) 0.04 (0.00, 0.08)
selfefficacy 0.03 (0.00, 0.06) 0.03 (0.00, 0.06)
education −0.01 (−0.07, 0.06) −0.01 (−0.07, 0.06)
source −0.15 (−0.35, 0.06) −0.15 (−0.35, 0.06)
source × selfefficacy 0.03 (0.00, 0.06) 0.03 (0.00, 0.06)
story 0.05 (−0.01, 0.11) 0.05 (−0.01, 0.11)
story × education −0.07 (−0.13, −0.01) −0.07 (−0.13, −0.01)
From the above analysis, it is found that at stage 1 subjects with higher level of
motivation or selfefficacy are more likely to quit. The highly personal-
ized level of source is more effective for subjects with a higher selfefficacy
(≥7), and deeply tailored level of story is more effective for subjects with lower
education (≤ high school); these two conclusions can be drawn from the inter-
action plots (with confidence intervals) presented in Fig. 3.2 (see Sect. 3.4.3). Thus
to maximize each individual’s chance of quitting over the two stages, the web-based
smoking cessation intervention should be designed in future such that: (1) smok-
ers with high self-efficacy (≥7) are assigned to highly personalized level of
source, and (2) smokers with lower education are assigned to deeply tailored
level of story.
here and that used in the context of variable selection is in the “target” of penaliza-
tion: while penalties are applied to each variable (covariate) in a variable selection
context, they are applied on each subject in the case of PQ-learning.
Let θ j = (β jT , ψ Tj )T for j = 1, 2. PQ-learning starts by considering a penalized
least squares optimization at stage 2; it minimizes the objective function
n
2 n
W2 (θ2 ) = ∑ Y2i − Qopt
2 (H2i , A2i ; β2 , ψ2 ) + ∑ Jλn |ψ2T H21,i |
i=1 i=1
with respect to θ2 to obtain the stage 2 estimates θ̂2 , where Jλn (·) is a pre-specified
penalty function and λn is a tuning parameter. The penalty function can be taken
directly from the variable selection literature; in particular Song et al. (2011) uses
α
√ lasso (Zou 2006) penalty, where Jλn (θ ) = λn θ /|θ̂ | with α > 0 and θ̂
the adaptive
being a n-consistent estimator of θ . Furthermore, √ as in the adaptive lasso proce-
dure, the tuning parameter λn is taken to satisfy nλn → 0 and nλn → ∞. The rest
of the Q-learning algorithm (hard-max version) is unchanged in PQ-learning.
The above minimization is implemented via local quadratic approximation
(LQA), following Fan and Li (2001). The procedure starts with an initial value ψ̂2(0)
of ψ2 , and then uses LQA for the penalty terms in the objective function:
1 Jλ |ψ̂2(0)
T H
21,i |
Jλn |ψ2T H21,i | ≈Jλn |ψ̂2(0) ψ ψ̂
n
T
H21,i | + ( T
H21,i )2
−( T
H21,i )2
2 |ψ̂2(0)
T H
21,i |
2 2(0)
for ψ2 close to ψ̂2(0) . Hence the objective function can be locally approximated, up
to a constant, by
n
2 1 n Jλ |ψ̂2(0) T H
21,i |
∑ Y2i − Qopt2 (H2i , A2i ; β2 , ψ2 ) + ∑ (ψ2T H21,i )2 .
n
i=1 2 i=1 | ψ̂ T H
2(0) 21,i
|
When Q-functions are approximated by linear models as in (3.8), the above mini-
mization problem has a closed form solution:
ψ̂2 = [X22 (I − X21 (XT21 X21 )−1 XT21 + D)X22 ]−1 XT22 (I − X21(XT21 X21 )−1 XT21 )Y2 ,
β̂2 = (XT21 X21 )−1 XT21 (Y2 − X22ψ̂2 ),
above minimization procedure can be continued for more than one step or until
convergence. However, as discussed by Fan and Li (2001), either the one-step or
multi-step procedure will be as efficient as the fully iterative procedure as long as
the initial estimators are good enough.
8.4 Penalized Q-learning 147
Variance Estimation
Song et al. (2011) provided a sandwich type plug-in estimator for the variance of θ̂2 :
T −1
Qopt
1 (H , A
1 1 1; θ̂ ) + P Z S¯
n 1 2 1
cov( θ̂ )S¯
2 2 1Z T
Iˆ10 ,
the uncertainty about the optimal treatment for patients with ‘small’ – rather than
zero – treatment effects. Such situations may be better handled by a local asymptotic
framework. From this perspective, the PQ-learning method is still non-regular as it
is not consistent under local alternatives; see Laber et al. (2011) for further details
on this issue.
The double bootstrap (see, e.g. Davison and Hinkley 1997; Nankervis 2005) is a
computationally intensive method for constructing CIs. Chakraborty et al. (2010)
implemented this method for inference in the context of Q-learning. Empirically it
was found to offer valid CIs for the policy parameters in the face of non-regularity.
Below we present a brief description.
Let θ̂ be an estimator of a parameter θ and θ̂ ∗ be its bootstrap
version. As is
well-known, the 100(1 − α ) % percentile bootstrap CI is given by θ̂(∗α ) , θ̂(1−
∗
α) ,
2 2
where θ̂γ∗ is the 100γ -th percentile of the bootstrap distribution. Then the double
(percentile) bootstrap CI is calculated as follows:
1. Draw B1 first-step bootstrap samples from the original data. For each first-
step bootstrap sample, calculate the bootstrap version of the estimator θ̂ ∗b ,
b = 1, . . . , B1 .
2. Conditional on each first-step bootstrap sample, draw B2 second-step (nested)
bootstrap samples and calculate the double bootstrap versions of the estimator,
e.g., θ̂ ∗∗bm , b = 1, . . . , B1 , m = 1, . . . , B2 .
3. For b = 1, . . . , B1 , calculate u∗b = B12 ∑Bm=1 2
I[θ̂ ∗∗bm ≤ θ̂ ], where θ̂ is the estimator
based on the original data.
∗ ∗ ∗
4. The double bootstrap CI is given by θ̂q̂( α ) , θ̂q̂(1− α ) , where q̂(γ ) = u(γ ) , the
2 2
100γ -th percentile of the distribution of u∗b , b = 1, . . . , B1 .
Next we attempt to provide some intuition1 about the double bootstrap using the
bagged hard-max estimator. Bagging (Breiman 1996), a nickname for bootstrap ag-
gregating, is a well-known ensemble method used to smooth “unstable” estimators,
e.g. decision trees in classification. Bagging was originally motivated by Breiman
as a variance-reduction technique; however Bühlmann and Yu (2002) showed that it
is a smoothing operation that also reduces the mean squared error of the estimator
in the case of decision trees, where a “hard decision” based on an indicator function
is taken. Note that in the context of Q-learning, the hard-max pseudo-outcome can
be re-written as
1 This is unpublished work, but the first author was pointed to this direction by Dr. Susan Murphy
(personal communication).
8.6 Adaptive Bootstrap Confidence Intervals 149
= Y1i + β̂2T H20,i + (ψ̂2T H21,i ) · 2 · I[ψ̂2T H21,i > 0] − 1 . (8.14)
The second term in (8.14) contains an indicator function (as in a decision tree).
Hence one can expect that the bagged version of the hard-max estimator will ef-
fectively “smooth out” the effect of this indicator function (e.g. replace the hard
decision by a soft decision) and hence should reduce the degree of non-regularity.
More
% precisely, bagging
& would effectively replace the indicator I[ψ̂2T H21,i > 0] by
√ T
nψ̂2 H21,i
Φ
T Σ̂ H
; see Bühlmann and Yu (2002) for details. The bagged hard-max
H21,i ψ̂2 21,i
estimator of ψ1 can be calculated as follows:
1. Construct a bootstrap sample of size n from the original data.
2. Compute the bootstrap version ψ̂1∗ of the usual hard-max estimator ψ̂1 .
3. Repeat steps 1 and 2 above B2 times yielding ψ̂1∗1 , . . . , ψ̂1∗B2 . Then the bagged
hard-max estimator is given by ψ̂1Bag = B12 ∑Bb=1
2
ψ̂1∗b .
When it comes to constructing CIs, the effect of considering a usual bootstrap CI
using B1 replications along with the bagged hard-max estimator (already using B2
bootstrap replications) is, in a way, equivalent to considering a double bootstrap CI
in conjunction with the original (un-bagged) hard-max estimator.
where the first term is smooth and the second term is non-smooth. While Wn is
asymptotically normally distributed, the distribution of Un depends on the under-
lying data-generating process “non-smoothly”. To illustrate the effect of this non-
smoothness, fix H21 = h21 . If hT21 ψ2 > 0, then Un is asymptotically normal with
mean zero. On the other hand, Un has a non-normal √ asymptotic distribution if
hT21 ψ2 = 0. Thus, the asymptotic distribution of cT n(θ̂1 − θ1 ) depends abruptly
on both the true parameter ψ2 and the distribution
√ of patient features H21 . In par-
ticular, the asymptotic distribution of cT n(θ̂1 − θ1 ) depends on the frequency of
patient features H21 = h21 for which there is no treatment effect (i.e. features for
which hT21 ψ2 = 0). As discussed earlier in this chapter, this non-regularity compli-
cates the construction of CIs for cT θ1 .
150 8 Inference and Non-regularity
n(hT21 ψ̂2 )2
Tn (h21 ) ,
hT21 Σ̂ψ̂2 h21
where Σ̂ψ̂2 /n is the estimated covariance matrix of ψ̂2 . Note that Tn (h21 )
corresponds to the usual test statistic when testing the null hypothesis: hT21 ψ2 = 0.
The pretests are performed using a cutoff λn , which is a tuning parameter of the
procedure and can be varied; to optimize performance, Laber et al. (2011) used
λn = log log n in their simulation study and √data analysis.
Let the upper and lower bounds on cT n(θ̂1 − θ1 ) discussed above be given
by U (c) and L (c) respectively; both of these quantities are √functions of λn .
Laber et al. (2011) showed that the limiting distributions of cT n(θ̂1 − θ1 ) and
U (c) are equal in the√case H21 T ψ = 0 with probability one. Similarly, the limit-
2
ing distributions of c n(θ̂1 − θ1 ) and L (c) are equal in the case H21
T T ψ = 0 with
2
probability one. That is, when there is a large treatment effect for almost all patients
then the upper (or lower) bound is tight. However, when there is a non-null subset of
patients for which there is no treatment effect, then the limiting distribution
√ of the
upper bound is stochastically larger than the limiting distribution of cT n(θ̂1 − θ1 ).
This adaptivity between non-regular and regular settings is a key feature of this
procedure.
Next we discuss how to actually construct the CIs by this procedure. By con-
struction of U (c) and L (c), it follows that
U (c) L (c)
cT θ̂1 − √ ≤ cT θ1 ≤ cT θ̂1 − √ .
n n
The distributions of U (c) and L (c) are approximated using the bootstrap. Let û
be the 1 − α /2 quantile of the bootstrap distribution of U (c), √and let lˆ be the α /2
ˆ √n] is
quantile of the bootstrap distribution of L (c). Then [c θ̂1 − û/ n, cT θ̂1 − l/
T
Through a series of theorems, Laber et al. (2011) proved the consistency of the
bootstrap in this context, and in particular that
√ √
P cT θ̂1 − û/ n ≤ cT θ1 ≤ cT θ̂1 − l/ˆ n ≥ 1 − α + oP(1).
The m-out-of-n bootstrap is a well-known tool for producing valid confidence sets
for non-smooth functionals (Shao 1994; Bickel et al. 1997). This method is the
same as the usual nonparametric bootstrap (Efron 1979) except that the resample
size, historically denoted by m, is of a smaller order of magnitude than the orig-
inal sample size n. More precisely, m depends on n, tends to infinity with n, and
satisfies m = o(n). Intuitively, the m-out-of-n bootstrap works asymptotically by
letting the empirical distribution tend to the true generative distribution at a faster
rate than the analogous convergence of the bootstrap empirical distribution to the
empirical distribution. In essence, this allows the empirical distribution to reach
its limit ‘first’ so that bootstrap resamples behave as if they were drawn from the
true generative distribution. Unfortunately, the choice of the resample size m has
long been a difficult obstacle since the condition m = o(n) is purely asymptotic and
thus provides no guidance for finite samples. Data-driven approaches for choosing
m in various contexts were given by Hall et al. (1995), Lee (1999), Cheung et al.
(2005), and Bickel and Sakov (2008). However, these choices were not directly
152 8 Inference and Non-regularity
region
T of the
null hypothesis hT21 ψ2 = 0. Thus, a natural choice for τn (h21 ) is
h21 Σ̂21 h21 · χ1,1− −1
ν , where n Σ̂21 is the plug-in estimator of the asymptotic co-
2
with 1 degree of freedom. Chakraborty et al. (2013) used ν = 0.001 in their sim-
ulations, and also showed robustness of results to this choice of ν via a thorough
sensitivity analysis.
As before, let c ∈ Rdim(θ1 ) be a known vector. To form a (1 − η ) × 100 % con-
fidence interval for cT θ1 , first find lˆ and û, the (η /2) × 100 and (1 − η /2) × 100
√ (b) (b)
percentiles of cT m(θ̂1 − θ̂1 ) respectively, where θ̂1 is the m-out-of-n bootstrap
(b)
analog of θ̂1 (the dependence of θ̂1 on m is implicit in the notation). The confi-
√ ˆ √m).
dence interval is then given by (cT θ̂1 − û/ m, cT θ̂1 − l/
Next we describe the double bootstrap procedure for choosing the tuning
parameter α employed to define m. Suppose cT θ1 is the parameter of interest,
and its estimate from the original data is cT θ̂1 . Consider a grid of possible values
of α ; Chakraborty et al. (2013) used {0.025, 0.05, 0.075, . . ., 1} in their simulation
study and data analysis. The exact algorithm follows.
1. Draw B1 usual n-out-of-n first-stage bootstrap samples from the data and calcu-
(b )
late the corresponding bootstrap estimates cT θ̂1 1 , b1 = 1, . . . , B1 . Fix α at the
smallest value in the grid.
2. Compute the corresponding values of m̂(b1 ) using Eq. (8.16), b1 = 1, . . . , B1 .
3. Conditional on each first-stage bootstrap sample, draw B2 m̂(b1 ) -out-of-n second-
stage (nested) bootstrap samples and calculate the double bootstrap versions of
(b b )
the estimate cT θ̂1 1 2 , b1 = 1, . . . , B1 , b2 = 1, . . . , B2 .
For b1 = 1,
4. . . . , B1 , computethe (η /2) × 100 and (1 − η /2) × 100 percentiles of
√ (b1 b2 ) (b1 ) (b ) (b )
c m̂ 1 θ̂1
T (b ) − θ̂1 , b2 = 1, . . . , B2 , say lˆDB1 and ûDB1 respectively.
Construct the double centered
percentile bootstrap CI from the b1 -th first-
(b ) (b )
√ (b ) (b )
√
stage bootstrap data as cT θ̂1 1 − ûDB1 / m̂(b1 ) , cT θ̂1 1 − lˆDB1 / m̂(b1 ) , b1 =
1, . . . , B1 .
5. Estimate the coverage rate of the double bootstrap CI from all the first-stage
bootstrap data sets as
1 B1 T (b1 )
∑
(b ) (b ) (b )
I c θ̂1 − ûDB1 / m̂(b1 ) ≤ cT θ̂1 ≤ cT θ̂1 1 − lˆDB1 / m̂(b1 ) .
B1 b =1
1
6. If the above coverage rate is at or above the nominal rate, up to Monte Carlo
error, then pick the current value of α as the final value. Otherwise, update α to
its next higher value in the grid.
7. Repeat steps 2–6, until the coverage rate of the double bootstrap CI, up to Monte
Carlo error, attains the nominal coverage rate, or the grid is exhausted.2
2 If this unlikely event does occur, one should examine the observed values of p̂. If the values of p̂
are concentrated close to zero, ν may be increased; if not, the maximal value in the grid should be
increased.
154 8 Inference and Non-regularity
history such that γ5 A2+ γ6 O2 A2 + γ7 A1 A2 = 0, and (ii) the standardized effect size
E(γ5 + γ6 O2 + γ7 A1 )/ Var(γ5 + γ6 O2 + γ7 A1 ). These two quantities, denoted by p
and φ , respectively, can be thought of as measures of non-regularity. Note that for
fixed parameter values, the linear combination (γ5 + γ6 O2 + γ7 A1 ) that governs the
non-regularity in an example generative model can take only four possible values
corresponding to the four possible (O2 , A1 ) cells. The cell probabilities can be easily
calculated; the formulae are provided in Table 8.2. Using the quantities presented
in Table 8.2, one can write
E[γ5 + γ6 O2 + γ7 A1 ] = q1 f1 + q2 f2 + q3 f3 + q4 f4 ,
E[(γ5 + γ6 O2 + γ7 A1 )2 ] = q1 f12 + q2 f22 + q3 f32 + q4 f42 .
From these two, one can calculate Var[γ5 + γ6 O2 + γ7 A1 ], and subsequently the effect
size φ .
Table 8.3 provides the parameter settings; the first six of these settings were
constructed by Chakraborty et al. (2010), and were described therein as “non-
regular,” “near-non-regular,” and “regular.” Example 1 is a setting where there is
no treatment effect for any subject (any possible history) in either stage. Example
2 is similar to example 1, where there is a very weak stage 2 treatment effect for
every subject, but it is hard to detect the very weak effect given the noise level in
the data. Example 3 is a setting where there is no stage 2 treatment effect for half
the subjects in the population, but a reasonably large effect for the other half of
subjects. In example 4, there is a very weak stage 2 treatment effect for half the
subjects in the population, but a reasonably large effect for the other half of sub-
jects (the parameters are close to those in example 3). Example 5 is a setting where
there is no stage 2 treatment effect for one-fourth of the subjects in the population,
but others have a reasonably large effect. Example 6 is a completely regular setting
where there is a reasonably large stage 2 treatment effect for every subject in the
population. Song et al. (2011) also used these six examples for empirical evaluation
of their PQ-learning method.
To these six, Laber et al. (2011) added three further examples labeled A, B, and
C. Example A is an example of a strongly regular setting. Example B is an example
of a non-regular setting where the non-regularity is strongly dependent on the stage
1 treatment. In example B, for histories with A1 = 1, there is a moderate effect of
156 8 Inference and Non-regularity
A2 at the second stage. However, for histories with A1 = −1, there is no effect of
A2 at the second stage, i.e., both actions at the second stage are equally optimal.
In example C, for histories with A1 = 1, there is a moderate effect of A2 , and for
histories with A1 = −1, there is a small effect of A2 . Thus example C is a “near-
non-regular” setting that behaves similarly to example B.
The Q-learning analysis models used in the simulation study are given by
H20 = (1, O1 , A1 , O1 A1 )T ,
H21 = (1, O2 , A1 )T ,
H10 = (1, O1 )T ,
H11 = (1, O1 )T .
So the models for the Q-functions are correctly specified. For the purpose of
inference, the focus is on ψ10 and ψ11 , the parameters associated with stage 1 treat-
ment A1 in the analysis model. They can be expressed in terms of γ s and δ s, the
parameters of the generative model, as follows:
where q1 = q3 = 14 (expit(δ1 + δ2 ) − expit(−δ1 + δ2 )), and q2 = q4 = 14 (expit(δ1 −
δ2 ) − expit(−δ1 − δ2 )).
bootstrap
version. Then the 100(1 − α ) % CPB confidence interval is given by
(b) (b) (b)
2θ̂ − θ̂(1− α ) , 2θ̂ − θ̂( α ) , where θ̂γ is the 100γ -th percentile of the bootstrap
2 2
distribution. The competing methods are listed below:
(i) CPB interval in conjunction with the (original) hard-max estimator (CPB-
HM);
(ii) CPB interval in conjunction with the hard-threshold estimator with α = 0.08
(CPB-HT0.08 );
(iii) CPB interval in conjunction with the soft-threshold estimator (CPB-ST);
(iv) Double bootstrap interval in conjunction with the hard-max estimator (DB-
HM);
(v) Asymptotic confidence interval in conjunction with the PQ-learning estimator
(PQ);
(vi) Adaptive bootstrap confidence interval (ACI);
(vii) m-out-of-n CPB interval with fixed α = 0.1, in conjunction with the hard-max
estimator (m̂0.1 -CPB-HM);
(viii) m-out-of-n CPB interval with data-driven α chosen by double bootstrap, in
conjunction with the hard-max estimator (m̂α̂ -CPB-HM);
(ix) m-out-of-n CPB interval with fixed α = 0.1, in conjunction with the soft-
threshold estimator (m̂0.1 -CPB-ST);
(x) m-out-of-n CPB interval with data-driven α chosen by double bootstrap, in
conjunction with the soft-threshold estimator (m̂α̂ -CPB-ST)
The comparisons are conducted on a variety of settings represented by examples
1–6, A–C, using N = 1,000 simulated data sets, B = 1, 000 bootstrap replications,
and the sample size n = 300. However, the double bootstrap CIs are based on
B1 = 500 first-stage and B2 = 100 second-stage bootstrap iterations, due to the in-
creased computational burden. Note that here we simply compile the results from the
original papers instead of implementing and running them afresh. As a consequence,
the results for all the methods across all examples are not available.
We focus on the coverage rate and width of CIs for the parameter ψ10 that denotes
the main effect of treatment; see Table 8.4 for coverage and Table 8.5 for width of
CIs. Different authors also reported results for the stage 1 interaction parameter ψ11 ;
however the effect of non-regularity is less pronounced on this parameter, and hence
less interesting for the purpose of illustration of non-regularity and comparison of
competing methods.
First, let us focus on Table 8.4. As expected from the inconsistency of the usual n-
out-of-n bootstrap in the present non-regular problem, the CPB-HM method shows
the problem of under-coverage in most of the examples. While CPB-HT0.08 , by
virtue of bias correction via thresholding (see Moodie and Richardson 2010), per-
forms well in Ex. 1–4, it fares poorly in Ex. 5–6 (and was never implemented in
Ex. A–C). Similarly CPB-ST performs well, again by virtue of bias correction via
thresholding (see Chakraborty et al. 2010), except in Ex. 6, A, and B. The compu-
tationally expensive double bootstrap method (DB-HM) performs well across the
first six examples (but was never tried on Ex. A–C). The PQ method (see Song
et al. 2011) performs well across the first six examples (but was never tried on
158 8 Inference and Non-regularity
Table 8.4 Monte Carlo estimates of coverage probabilities of confidence intervals for the main
effect of treatment (ψ10 ) at the 95 % nominal level. Estimates significantly below 0.95 at the 0.05
level are marked with ∗. Examples are designated NR non-regular, NNR near-non-regular, R regular
Ex. 1 Ex. 2 Ex. 3 Ex. 4 Ex. 5 Ex. 6 Ex. A Ex. B Ex. C
n = 300
NR NNR NR NNR NR R R NR NNR
CPB-HM 0.936 0.932* 0.928* 0.921* 0.933* 0.931* 0.944 0.925* 0.922*
CPB-HT0.08 0.950 0.953 0.943 0.941 0.932* 0.885* – – –
CPB-ST 0.962 0.961 0.947 0.946 0.942 0.918* 0.918* 0.931* 0.938
DB-HM 0.936 0.936 0.948 0.944 0.942 0.950 – – –
PQ 0.951 0.940 0.952 0.955 0.953 0.953 – – –
ACI 0.994 0.994 0.975 0.976 0.962 0.957 0.950 0.977 0.976
m̂0.1 -CPB-HM 0.984 0.982 0.956 0.955 0.943 0.949 0.953 0.971 0.970
m̂α̂ -CPB-HM 0.964 0.964 0.953 0.950 0.939 0.947 0.944 0.955 0.960
m̂0.1 -CPB-ST 0.993 0.993 0.979 0.976 0.954 0.943 0.939 0.972 0.977
m̂α̂ -CPB-ST 0.971 0.976 0.961 0.956 0.949 0.935 0.926* 0.971 0.967
Table 8.5 presents the Monte Carlo estimates of the mean width of CIs. Mean
widths corresponding to CPB-HT0.08 , DB-HM and PQ were not reported in the
original papers in which they appeared. Among the rest of the methods, as expected,
8.8 Simulation Study 159
Table 8.5 Monte Carlo estimates of the mean width of confidence intervals for the main effect of
treatment (ψ10 ) at the 95 % nominal level. Widths with corresponding coverage significantly below
nominal are marked with ∗. Examples are designated NR non-regular, NNR near-non-regular, R
regular
Ex. 1 Ex. 2 Ex. 3 Ex. 4 Ex. 5 Ex. 6 Ex. A Ex. B Ex. C
n = 300
NR NNR NR NNR NR R R NR NNR
CPB-HM 0.269 0.269* 0.300* 0.300* 0.320* 0.309* 0.314 0.299* 0.299*
CPB-HT0.08 – – – – – – – – –
CPB-ST 0.250 0.250 0.293 0.293 0.319 0.319* 0.323* 0.303* 0.304
DB-HM – – – – – – – – –
PQ – – – – – – – – –
ACI 0.354 0.354 0.342 0.342 0.341 0.327 0.327 0.342 0.342
m̂0.1 -CPB-HM 0.346 0.347 0.341 0.341 0.340 0.341 0.332 0.342 0.343
m̂α̂ -CPB-HM 0.331 0.331 0.321 0.323 0.330 0.336 0.322 0.328 0.328
m̂0.1 -CPB-ST 0.324 0.324 0.336 0.336 0.343 0.352 0.343 0.353 0.353
m̂α̂ -CPB-ST 0.273 0.275 0.306 0.306 0.328 0.349 0.331* 0.330 0.332
CIs constructed via the usual n-out-of-n method (CPB-HM and CPB-ST) have the
least width; however these are often associated with under-coverage. The widths
of the CIs from the last five methods are quite comparable, with m̂α̂ -CPB-HM and
m̂α̂ -CPB-ST offering narrower CIs more often.
Given the above findings, it is very hard to declare an overall winner. From a
purely theoretical standpoint, the ACI method (Laber et al. 2011) is arguably the
strongest since it uses a local asymptotic framework. However it is conceptually
complicated, computationally expensive, and often conservative in finite samples.
In terms of finite sample performance, both versions of the m-out-of-n bootstrap
method (Chakraborty et al. 2013) are at least as good as (and often better than) the
ACI method; moreover, they are conceptually very simple and hence may be more
attractive to practitioners. The version with fixed α (m̂0.1 -CPB-HM), while simi-
lar to ACI in conservatism, is computationally much cheaper. On the other hand,
the version with data-driven choice of α (m̂α̂ -CPB), while computationally as de-
manding as the ACI, overcomes the conservatism and provides nominal coverage in
all the examples. Nonetheless, m-out-of-n bootstrap methods are valid only under
fixed alternatives, not under local alternatives. The PQ-learning method (Song et al.
2011) is also valid only under fixed alternatives but not under local alternatives.
This method is non-conservative in Ex. 1–6, and is computationally the cheapest.
However its coverage performance in Ex. A–C and the mean widths of CIs resulting
from this method in all the examples are unknown to us at this point.
Note that the bias maps of Fig. 8.1 in Sect. 8.2 were created in a scenario where
γ5 + γ6 O2 + γ7 A1 = 0 with positive probability. As noted previously, the generative
parameters γ5 , γ6 and γ7 correspond to the policy parameters ψ20 , ψ21 , and ψ22 of
the analysis model, respectively. For all bias maps in the figure, γ1 = γ2 = γ4 = 0 and
γ3 = −0.5; the first three plots (upper panel) explored the extent of bias in regions
around the parameter setting given in Ex. 5 of Table 8.3, while the last three plots
(lower panel) explore the extent of bias in regions around the parameter setting in
160 8 Inference and Non-regularity
Ex. 6 of Table 8.3. More precisely, in the first three plots, δ1 = 1, δ2 = 0; and only
one of ψ20 (= γ5 ), ψ21 (= γ6 ), or ψ22 (= γ7 ) was varied while the remaining were
fixed (e.g. (ψ21 , ψ22 ) = (0.5, 0.5) fixed in the first plot, (ψ20 , ψ22 ) = (1.0, 0.5) fixed
in the second plot, and (ψ20 , ψ21 ) = (1.0, 0.5) fixed in the third plot). Similarly, in
the last three plots, δ1 = δ2 = 0.1; and only one of ψ20 , ψ21 , or ψ22 was varied
while the remaining were fixed, e.g. (ψ21 , ψ22 ) = (0.5, 0.5) fixed in the first plot of
the lower panel, (ψ20 , ψ22 ) = (0.25, 0.5) fixed in the second plot of the lower panel,
and (ψ20 , ψ21 ) = (0.25, 0.5) fixed in the third plot of the lower panel.
Selective serotonin reuptake inhibitors (SSRIs) are the most commonly prescribed
class of antidepressants with simple dosing regimens and a preferable adverse effect
profile in comparison to other types of antidepressants (Nelson 1997; Mason et al.
2000). Serotonin is a neurotransmitter in the human brain that regulates a variety
of functions including mood. SSRIs affect the serotonin based brain circuits. Other
classes of antidepressants may act on serotonin in concert with other neurotransmit-
ter systems, or on entirely different neurotransmitter. While a meta-analysis of all
efficacy trials submitted to the US Food and Drug Administration of four antidepres-
sants for which full data sets were available found that pharmacological treatment
of depression was no more effective than placebo for mild to moderate depression,
other studies support the effectiveness of SSRIs and other antidepressants in pri-
mary care settings (Arroll et al. 2005, 2009). Few studies have examined treatment
patterns, and in particular, few have studied best prescribing practices following
treatment failure.
Sequenced Treatment Alternatives to Relieve Depression (STAR*D) was a
multisite, multi-level randomized controlled trial designed to assess the comparative
effectiveness of different treatment regimes for patients with major depressive disor-
der, and was introduced earlier in Chap. 2. See Sect. 2.4.2 for a detailed description
of the study design along with a schematic of the treatment assignment algorithm.
Here we will focus on levels 2, 2A, and 3 of the study only. For the purpose of the
current analysis, we will classify the treatments into two categories: (i) treatment
with an SSRI (alone or in combination): sertraline (SER), CIT + bupropion (BUP),
CIT + buspirone (BUS), or CIT + cognitive psychotherapy (CT) or (ii) treatment
with one or more non-SSRIs: venlafaxine (VEN), BUP, or CT alone. Only the
patients assigned to CIT + CT or CT alone in level 2 were eligible, in the case of a
non-satisfactory response, to move to a supplementary level of treatment (level 2A),
to receive either VEN or BUP. Patients not responding satisfactorily at level 2 (and
level 2A, if applicable) would continue to level 3. Treatment options at level 3 can
8.9 Analysis of STAR*D Data: An Illustration 161
again be classified into two categories, i.e. treatment with (i) SSRI: an augmentation
of any SSRI-containing level 2 treatment with either lithium (Li) or thyroid hor-
mone (THY), or (ii) non-SSRI: mirtazapine (MIRT) or nortriptyline (NTP), or an
augmentation of any non-SSRI level 2 treatment with either Li or THY.
8.9.2 Analysis
One thousand two hundred and sixty patients were used at stage 1 (level 2); a small
number (19) of patients were omitted altogether due to gross item missingness in the
covariates. Of the 1,260 patients at stage 1, there were 792 who were non-remitters
(QIDS > 5) who should have moved to stage 2 (level 3); however, only 324 patients
were present at stage 2 while the rest dropped out. To adjust for this dropout, the
model for Qopt2 was fitted using inverse probability weighting where the probability
of being present at stage 2 was estimated by logistic regression using O11 , O21 , O31 ,
A1 , −Y1 , O22 , O11 A1 , O21 A1 , and O31 A1 as predictors.
Another complexity came up in the computation of the pseudo-outcome,
maxa2 Qopt2 . Note that for (792 − 324) = 468 non-remitters who were absent from
stage 2, covariates O12 (QIDS.start at stage 2) and O32 (preference at stage 2)
were missing, rendering the computation of the pseudo-outcome impossible for
them. For these patients, the value of O12 was imputed by the last observed QIDS
score in the previous stage – a sensible strategy for a continuous, slowly changing
variable like the QIDS score. On the other hand, the missing values of the binary
variable O32 (preference at stage 2) were imputed using k nearest neighbor (k-NN)
classification, where k was chosen via leave-one-out cross-validation. Following
these imputations, Q-learning was implemented for this data; the estimates of the
parameters of the Q-functions, along with their 95 % bootstrap CIs were computed.
While only the usual bootstrap was used at stage 2, both the usual bootstrap and the
adaptive m-out-of-n bootstrap procedure (with α chosen via double bootstrap) were
employed at stage 1, to facilitate ready comparison.
8.9.3 Results
Results of the above analysis are presented in Table 8.6. In this analysis, m
was chosen to be 1,059 in a data-driven way (using double bootstrap). At both
stages, the coefficient of QIDS.start (β12 and β11 ) and the coefficient of prefer-
ence (β32 and β31 ) were statistically significant. Additionally ψ31 , the coefficient
of preference-by-treatment interaction at stage 1 was significantly different from 0;
this fact is particularly interesting because it suggests that the decision rule at stage
1 should be individually tailored based on preference.
The estimated optimal DTR can be explicitly described in terms of the ψ̂ s:
dˆ2opt (H2 ) = sign(−0.18 − 0.01O12 − 0.25O22), and dˆ1opt (H1 ) = sign(−0.73 +
0.01O11 + 0.01O21 − 0.67O31). That is, the estimated optimal DTR suggests treat-
ing a patient at stage 2 with an SSRI if (−0.18 − 0.01 × QIDS.start2 − 0.25 ×
QIDS.slope2 ) > 0, and with a non-SSRI otherwise. Similarly, it suggests treat-
ing a patient at stage 1 with an SSRI if (−0.73 + 0.01 × QIDS.start1 + 0.01 ×
QIDS.slope1 − 0.67 × preference1 ) > 0, and with a non-SSRI otherwise.
8.9 Analysis of STAR*D Data: An Illustration 163
Table 8.6 Regression coefficients and their 95 % centered percentile bootstrap CIs (both the usual
n-out-of-n and the novel m-out-of-n) in the analysis of STAR*D data (significant coefficients are
in bold)
However, these are just the “point estimates” of the optimal decision rules.
A measure of confidence for these estimated decision rules can be formulated as
follows. Note that the estimated difference in mean outcome at stage 2 correspond-
ing to the two treatment options is given by
For any fixed values of QIDS.start, QIDS.slope, and preference, one can construct
point-wise CIs for the above difference in mean outcome (or, pseudo-outcome)
based on the CIs for the individual ψ s, thus leading to a confidence band around the
entire function. The mean difference function and its 95 % confidence band over the
observed range of QIDS.start and QIDS.slope are plotted for stage 1 (separately for
preference = “switch” and preference = “augment or no preference”) and for stage
2 (patients with all preferences combined), and are presented in Fig. 8.3. Since the
confidence bands in all three panels contain zero, there is insufficient evidence in
the data to recommend a unique best treatment.
164 8 Inference and Non-regularity
Predicted difference in mean stage-1 pseudo-outcome Patients preferring treatment switch Patients preferring treatment a ugmentation
6
5
4
0 2
–5 0
–2
–10
–4
–15 –6
0.3 0.3
0.2 30 0.2 30
25 0.1 25
0.1 20 20
0 15 0 15
10 10
–0.1 5 –0.1 5
QIDS.slope QIDS.start QIDS.slope QIDS.start
15
10
–5
–10
–15
0.3
0.2 30
0.1 25
20
0 15
10
–0.1 5
QIDS.slope QIDS.start
Fig. 8.3 Predicted difference in mean outcome and its 95 % confidence band for: (a) patients pre-
ferring treatment switch at stage 1; (b) patients either preferring treatment augmentation or without
preference at stage 1; and (c) all patients at stage 2
In Sect. 5.1, we discussed estimation of the value of an arbitrary DTR. Once a DTR
dˆ is estimated from the data (say, via Q-learning, G-estimation, etc.), a key quantity
ˆ ˆ
to assess its merit is its true value, V d . A point estimate of this quantity, say V̂ d ,
can be obtained, for example, by the IPTW formula (see Sect. 5.1). However it may
ˆ
be more interesting to construct a confidence interval for V d and see if the confi-
8.10 Inference About the Value of an Estimated DTR 165
dence interval contains the optimal value V opt (implying that the estimated DTR is
not significantly different from the optimal DTR), or the value of some other pre-
specified (not necessarily optimal) DTR. It turns out that the estimation of the value
of an estimated DTR, or constructing a confidence interval for it, is a very difficult
problem.
From Sect. 5.1, we can express the value of dˆ by
)
K
I[A j = dˆj (H j )]
∏
ˆ
Vd = Y dPπ , (8.17)
j=1 π j (A j |H j )
where π is an embedded DTR in the study from which the data arose (e.g. the
randomization probabilities in the study); see Sect. 5.1 for further details. Note
that (8.17) can be alternatively expressed as
)
! "
K
1
K
dˆ
V = ∏ π j (A j |H j ) Y ∏ I[A j = dˆj (H j )] dPπ
j=1 j=1
)
K
= c(O1 , A1 , . . . , OK+1 ; π ) ∏ I[A j = dˆj (H j )] dPπ (8.18)
j=1
where ! "
K
1
c(O1 , A1 , . . . , OK+1 ; π ) = ∏ π j (A j |H j ) Y
j=1
is a function of the entire data trajectory and the embedded DTR π . Note that the
form of the value function, as expressed in (8.18), is analogous to the test error
(misclassification rate) of a classifier in a weighted (or, cost-sensitive) classifica-
tion problem, where c(O1 , A1 , . . . , OK+1 ; π ) serves as the weight (or, cost) function.
Zhao et al. (2012) vividly discussed this analogy in a single-stage decision problem;
see also Sect. 5.3.
From this analogy, one can argue that the confidence intervals for the value func-
tion could be constructed in ways similar to those for confidence intervals for the
test error of a learned classifier. Unfortunately, constructing valid confidence in-
tervals for the test error in classification is an extremely difficult problem due to
the inherent non-regularity (note the presence of non-smooth indicator functions in
the definition of the value function); see Laber and Murphy (2011) for further de-
tails. Standard methods like normal approximation or the usual bootstrap fail in this
problem. Laber and Murphy (2011) developed a method for constructing such con-
fidence intervals by use of smooth data-dependent upper and lower bounds on the
test error; this method is similar to the method described in Sect. 8.6 in the context
of inference for Q-learning parameters. They proved that for linear classifiers, their
proposed confidence interval automatically adapts to the non-smoothness of the test
error, and is consistent under local alternatives. The method provided nominal cover-
age on a suite of test problems using a range of classification algorithms and sample
166 8 Inference and Non-regularity
sizes. While intuitively one can expect that this method could be successfully used
for constructing confidence intervals for the value function, more research is needed
to extend and fine-tune the procedure to the current setting.
Following the estimation of the posterior density via direct calculation or, more
likely, Markov Chain Monte Carlo, the Bayesian analyst must then formulate opti-
mal decision rules. This can be done in a variety of manners, such as recommending
treatment if the posterior median of H Tj1 ψ j is greater than some threshold or if the
probability that the posterior mean of H Tj1 ψ j exceeds a threshold is greater than a
half. Decisions based on either of these rules will coincide when the posterior is nor-
mally distributed, but may not in general (i.e. when laws are exceptional). Alterna-
tively, both Arjas and Saarela (2010) and Zajonc (2012) considered a G-computation
like approach, and choose as optimal the rule that maximizes the posterior predictive
mean of the outcome.
8.12 Discussion
In this chapter, we have illustrated the problem of non-regularity that arises in the
context of inference about the optimal “current” (stage j) treatment rule, when the
optimal treatments at subsequent stages are non-unique for at least some non-null
proportion of subjects in the population. We have discussed and illustrated the phe-
nomenon using Q-learning as well as G-estimation.
8.12 Discussion 167
more familiar to applied quantitative researchers. Many of the inference tools dis-
cussed in this chapter can be extended to involve more stages and more treatment
options at each stage; see, for example, Laber et al. (2011) and Song et al. (2011).
Aside from notational complications, extending the adaptive m-out-of-n procedure
should also be straightforward.
Finally, we touched on the problems of inference for the value of an estimated
DTR, discussing the work of Laber and Murphy (2011), and Bayesian estimation.
These are very interesting yet very difficult problems, and little has yet appeared in
the literature. More targeted research is warranted.
Chapter 9
Additional Considerations and Final Thoughts
In estimating optimal adaptive treatment strategies, the variables used to tailor treat-
ment to patient characteristics are typically hand-picked by experts who seek to use
a minimum set of variables routinely available in clinical practice. However, studies
often use a large set of easy-to-measure covariates (e.g., multiple surveys of men-
tal health status and functioning) from which a smaller subset of variables must be
selected for any practical implementation of treatment tailoring. It may therefore
be desirable to be able to select tailoring variables with which to index the class of
regimes using automated or data-adaptive procedures. It has been noted that predic-
tion methods such as boosting could aid in selecting variables to adapt treatments
(LeBlanc and Kooperberg 2010); many such methods can be applied with ease, par-
ticularly to the regression-based approaches to estimating optimal DTRs, however
their ability to select variables for strong interactions with treatment rather than
simply strong predictive power may require special care and further study.
Recall the distinction between predictive variables (used to increase precision of
estimates) and prescriptive variables (used to adapt treatment strategies to patients),
i.e. tailoring variables (Gunter et al. 2007). In the Q-learning notation, predictive
variables correspond to the H j0 terms in the Q-function associated with parameters
β , while the prescriptive or tailoring variables are those contained in H j1 , asso-
ciated with parameters ψ . Tailoring variables must qualitatively interact with the
treatment, meaning that the choice of optimal treatment varies for different values
of such variables. The usefulness of a prescriptive variable can be characterized by
B. Chakraborty and E.E.M. Moodie, Statistical Methods for Dynamic Treatment Regimes, 169
Statistics for Biology and Health 76, DOI 10.1007/978-1-4614-7428-9 9,
© Springer Science+Business Media New York 2013
170 9 Additional Considerations and Final Thoughts
the magnitude of the interaction and the proportion of the population for whom the
optimal action changes given the knowledge of the variable (Gunter et al. 2007).
We will focus the discussion in this section on the randomized trial setting, so that
variable selection is strictly for the purposes of optimal treatment tailoring, rather
than elimination of bias due to confounding. Further, we will restrict attention to
the one-stage setting, as to date there have been no studies on the use of variable
selection for dynamic treatment regimes in the multi-stage setting.
Lu et al. (2013) proposed an adaptation of the lasso which penalizes only interaction
terms. Specifically, they consider the loss function
where the covariate vector Oi is augmented by a column of 1s and has total length
p + 1, π (o) = P(A = 1|O = o; α ) is the propensity score for a binary treatment A
and φ (O) is an arbitrary function. Lu et al. (2013) noted that the estimating function
found by taking the derivative of the loss function Ln (ψ , β , α ) with respect to ψ
corresponds to an A-learning method of estimation, and is therefore robust to mis-
specification of the conditional mean model φ (O; β ) for the response Y in the sense
that the estimator requires correct specification of either the propensity score or the
mean model φ (O; β ). The decision (or treatment interaction) parameters ψ are then
shrunk using an adaptive lasso which penalizes parameters with a weight inversely
proportional to their estimated value, solving
p+1
min Ln (ψ , β̂ , α̂ ) + λn
ψ
∑ |ψ̂ j |−1 |ψ j |
j=1
interaction terms with very high probability in samples of size 100 or larger. In high
dimensional settings, the penalized estimator increased the selection of the correct
treatment choice relative to the unpenalized estimator by 7–8 %; in low dimensional
settings, the improvement was more modest (2–3 %).
As proposed by Gunter et al. (2007, 2011b), the S-score for a (univariate) variable
O is defined as:
SO = Pn max Pn [Y |A = a, O] − max Pn [Y |A = a] .
a∈A a∈A
The S-score of a variable O captures the expected increase in the response that is
observed by adapting treatment based on the value of that variable. S-scores com-
bine two characteristics of a useful tailoring variable: the interaction of the variable
with the treatment and the proportion of the population exhibiting variability in its
value. A high S-score for a variable is indicative of a strong qualitative interaction
between the variable and the treatment, as well as a high proportion of patients for
whom the optimal action would change if the value of the variable were taken into
consideration. Thus, S-scores may be used to rank variables and select those that
have the highest scores. The performance of the S-score ranking method was found
to be superior to the standard lasso (Tibshirani 1996) in terms of consistent selection
of a small number of variables from a large set of covariates of interest.
In the real-data implementation of the S-score ranking performed by Gunter et al.
(2007), each variable was evaluated separately, without taking into account poten-
tial correlation between variables. Two variables that are highly correlated may have
similar S-scores (Biernot and Moodie 2010) but may not both be necessary for deci-
sion making. The S-score may be modified in a straight-forward fashion to examine
the usefulness of sets of variables, O given the use of others, O, by considering, for
example,
SO |O = Pn max Pn Y |A = a, O, O − max Pn [Y |A = a, O] .
a∈A a∈A
Thus, the S-score approach could be used to select the variable, O, with the highest
score, then select a second variable, O , with the highest S-score given the use of O
as a prescriptive variable, and so on.
For i = 1, . . . , n subjects and j = 1, . . . , p possible tailoring variables, Gunter et al.
(2007, 2011b) proposed an alternative score, also based on both the strength of
interaction as measured by
172 9 Additional Considerations and Final Thoughts
and the proportion of the population for whom the optimal decision differs if a
variable is used for tailoring, captured by
∗
Pj = Pn I argmax Pn [Y |O j = oi j , A = a] = a
a
Gunter et al. (2007, 2011b) suggested the use of the S- and U-scores in combi-
nation with lasso:
1. Select variables that are predictive of the outcome Y from among the variables in
(H10 , AH11 ), using cross-validation or the BIC to select the penalty parameter.
2. Rank each variable O j using the S- or U-score, retaining the predictive variables
selected in step (1) to reduce the variability in the estimated mean response.
Choose the M most highly-ranked variables, where M is the cardinality of the
variables in H11 for which the S- or U-score is non-zero.
3. Create nested subsets of variables.
∗ be the top M variables found in step (2), and let H ∗ denote the union
(a) Let H11 10
∗ . Let M ∗ denote the car-
of the predictive variables chosen in step (1) and H11
∗ ∗
dinality of (H10 , H11 ).
(b) Run a weighted lasso where all main effect and interaction variables chosen
in step (1) only have weight 1, and all interaction variables chosen in step
(2) are given a weight 0 < w ≤ 1 which is a non-decreasing function of the
U- or S-score. This downweights the importance of the prescriptive variables,
which are favored by lasso.
(c) Create M ∗ nested subsets based on the order of entry of the M ∗ variables in
the weighted lasso.
4. Choose from among the subsets based on the highest expected response, or al-
ternatively, the highest adjusted gain in the outcome relative to not using any
tailoring variables.
The variable selection approaches based on the S- and U-scores were found to
perform well in simulation, leading to variable choices that provided higher ex-
pected outcomes than lasso alone (Gunter et al. 2007, 2011b).
9.1 Variable Selection 173
Gunter et al. (2011a) suggested that the qualitative ranking of the previous section
is complex and difficult to interpret, and instead proposed the use of a stepwise
procedure, using the expected response conditional on treatment A and covariates
O, as the criterion on which to select or omit tailoring variables.
The suggested approach begins by fitting a regression model for the response Y as
a function of treatment only, and estimating the mean response to the overall (“un-
tailored”) optimal treatment; denote this by V̂0∗ . Next, let C contain the treatment
variable as well as all variables which are known to be important predictors of the
response. Fit a regression model for the response Y as a function of treatment and all
variables in C and estimate the mean response to the overall (un-tailored) optimal
treatment when the predictors in C are included in the model; denote this by V̂C∗ .
A key quantity that will be used to decide variable inclusion or exclusion is the
adjusted value of the model. For C , the adjusted value is AVC = (V̂C∗ − V̂0∗ )/|C |
where |C | is the rank of the model matrix used in the estimation of the response
conditional on the variables in C .
Letting E denote all eligible variables, both predictive variables and treatment-
covariate interaction terms, not included in C . The procedure is then carried out by
performing forward selection and backwards elimination at each step.
Forward selection: For each variable e ∈ E ,
1. Estimate the predictive model using all the variables in C plus the variable e.
2. Optimize the estimated predictive model over the treatment actions to obtain
the optimal mean response, V̂E∗ , and calculate the adjusted value, AVe = (V̂e∗ −
V̂0∗ )/|C + e|.
3. Retain the covariate e∗ which results in the largest value of AVe .
Backward elimination: For each variable c ∈ C ,
1. Estimate the predictive model using all the variables in C except the variable c.
2. Optimize the estimated predictive model over the treatment actions to obtain the
∗ , and calculate the adjusted value, AV ∗
optimal mean response, V̂−c −c = (V̂−c −
∗
V̂0 )/|C − c|.
3. Let c∗ be the covariate which results in the largest value of AV−c .
If each of AVC , AVe∗ , and AV−c∗ are negative, the stepwise procedure is com-
plete and no further variable selection is required. If AVe∗ > max{AVC , AV−c∗ }, e∗ is
included in C and AVC is set to AVe∗ ; otherwise, if AV−c∗ > max{AVC , AVe∗ }, remove
c from C and AVC is set to AV−c∗ . Gunter et al. (2011a) suggested that all covariate
main effects should be retained in a model in which a treatment-covariate interac-
tion is present, and to group covariates relating to a single characteristic (e.g. dummy
variables indicating covariate level for categorical variables).
In simulation, the stepwise method was found to have higher specificity but lower
sensitivity than the qualitative interaction ranking approach of the previous section
(Gunter et al. 2011a). That is, the stepwise procedure was less likely to falsely in-
clude variables which did not qualitatively interact with treatment, at the cost of
174 9 Additional Considerations and Final Thoughts
being less able to identify variables which did. However, the stepwise procedure is
rather easier to implement and can be applied to different outcome types such as
binary or count data.
Gunter et al. (2011c) used a similar, but more complex, method to perform
variable selection while controlling the number of falsely significant findings by us-
ing bootstrap sampling and permutation thresholding in combination. The bootstrap
procedure is used as a form of voting algorithm to ensure selection of variables that
modify treatment in a single direction, while the permutation algorithm is used to
maintain a family-wise error rate across the tests of significance for the coefficients
associated with the tailoring variables.
There has been relatively little work on the topic of model checking for estimating
optimal DTRs. The regret-regression approach of Henderson et al. (2010) is one of
the first in which the issues of model checking and diagnostics were specifically
addressed. Because regret-regression uses ordinary least squares for estimation of
the model parameters, standard tools for regression model checking and diagnostics
can be employed. In particular, Henderson et al. (2010) showed that residual plots
can be used to diagnose model mis-specification. In fact, these standard approaches
can and should be used whenever a regression-based approach to estimating DTR
parameters, such as Q-learning or A-learning as implemented by Almirall et al.
(2010), is taken.
Consider the following small example using Q-learning: data are generated such
that O11 ∼ N(0, 1) and O21 ∼ N(−0.50 + 0.5O11, 1), treatment is randomly assigned
at each stage with probability 1/2, and the binary tailoring variables are generated
via
Thus the state variables are O1 = (O11 , O12 ) and O2 = (O21 , O22 ). Then for
ε ∼ N(0, 1),
We fit three models. The first is correctly specified, the second omits the single pre-
dictive variable, O j1 , from the model for the Q-function at each stage, and the third
omits the interaction A j O j2 from the Q-function model. As observed in Fig. 9.1,
residuals from the OLS fit at each stage of the Q-learning algorithm can be used
to detect the omission of important predictors of the response, but may not be suf-
ficiently sensitive to detect the omission of important tailoring variables from the
Q-function model.
9.2 Model Checking via Residual Diagnostics 175
−3 −2 −1 0 1 2 3
−4 −2 0 2 4
r^1
r^2
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
^ ^
O11 O21
−6 −4 −2 0 2 4 6
−4 −2 0 2 4
r^1
r^2
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
^ ^
O11 O21
−3 −2 −1 0 1 2 3
−6 −4 −2 0 2 4
r^1
r^2
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
^ ^
O11 O21
Fig. 9.1 Residual diagnostic plots for Q-learning using a simulated data set with n = 500. The
first and second columns show plots for residuals at the first and second stages, respectively. The
first row corresponds to a correctly specified Q-function model. In the second and third rows,
Q-function models at each stage are mis-specified by the omission, respectively, of a predictive
variable and an interaction with a tailoring variable
where
K−1
Gmod, j (ψ ) ≡ Gmod, j (ψ )(HK , AK , ψ j ) = Y − γ j (H j , A j ; ψ j ) + ∑ μm (Hm , Am ; ψm ),
m= j+1
176 9 Additional Considerations and Final Thoughts
Gi j − E[Gi j |H j ; ς j (ψ j )]
K−1
= Yi − γ j (H j , A j ; ψ j )) + ∑ μm (Hm , Am ; ψm ) − E[Gi j |H j ; ς j (ψ j )]
m= j+1
K−1
= Yi − E[Gi j |H j ; ς j (ψ j )] − ∑ μm (Hm , Am ; ψm ) + γ j (H j , A j ; ψ j ))
m= j+1
has mean zero conditional on history H j , so that a fitted value for Yi is given by
K−1
Ŷi j (ψ ) = E[Gi j |H j ; ς j (ψ j )] − ∑ μm (Hm , Am ; ψm ) + γ j (H j , A j ; ψ j )) .
m= j+1
The residual for the ith individual at the jth stage is then defined to be
K−1
ri j (ψ ) = Yi − E[Gi j |H j ; ς j (ψ j )] − ∑ μm (Hm , Am ; ψm ) + γ j (H j , A j ; ψ j )) .
m= j+1
To use the residual for model-checking purposes, estimates ψ̂ and ς̂ j (ψ̂ j ) must be
substituted for the unknown parameters. The residuals ri j can be used to verify the
models E[Gi j |H j ; ς j (ψ j )] and γ (h j , a j ; ψ j ), diagnosing underspecification (that is,
the omission of a variable) and checking the assumptions regarding the functional
form in which covariates were included in the models.
Rich et al. (2010) considered a two-stage simulation, and examined plots of the
first- and second-stage residuals against covariates and fitted values. The resid-
ual plots were able to detect incorrectly-specified models in a variety of settings,
and appeared able to distinguish at which stage the model was mis-specified.
While patterns in residual plots provide a useful indicator of problems with model
specification, they do not necessarily indicate in which model a problem occurs,
i.e. whether the problem is in the specification of the blip function or the expected
counterfactual model.
9.3 Discussion and Concluding Remarks 177
O1 ∼ N(0, 140)
O2 ∼ N(50 + 1.25O1, 120)
In Fig. 9.2, we plot the residuals for four different models, three of which have
mis-specified components, from a single data set of size 500. The first and sec-
ond models mis-specified the form of E[Gi j |H j ; ς j (ψ j )], the expected counterfac-
tual model, at stage one and two, respectively. The third model correctly specified
the expected counterfactual models, but omitted O1 and O2 from the blip models at
both stages. The fourth model was correctly specified. In the first, second, and fourth
rows, the stage(s) where no models are mis-specified provide residual plots with no
systematic patterns. However, if the expected counterfactual model (row 1 and 2)
or the blip models (row 3) are mis-specified at one or both stages, obvious trends
appear in the residual plots. As noted by Rich et al. (2010), mis-specification of
the expected counterfactual model and the blip function result in similar patterns in
the residual plots; it is therefore not possible to determine which model is incorrect
simply by inspection of residual plots.
O1 y^1 O2 y^2
O1 y^1 O2 y^2
^r ^r ^r ^r
1 1 2 2
O1 y^1 O2 y^2
^r ^r ^r ^r
1 1 2 2
O1 y^1 O2 y^2
Fig. 9.2 Residual diagnostic plots for G-estimation using simulated data set with n = 500. The first
two columns show plots for residuals and the first stage ( j = 1), the last two for residuals at the
second stage ( j = 2). Specifically, the columns plot: (1) first stage residuals vs. O1 , (2) residuals
vs. fitted values at the first stage, (3) second stage residuals vs. O2 , and (4) residuals vs. fitted
values at the second stage. Rows correspond model choices: (1) E[Gmod,1 (ψ )|O1 ; ς1 (ψ1 )] mis-
specified, (2) E[Gmod,2 (ψ )|H2 ; ς2 (ψ2 )] mis-specified, (3) γ1 (O1 , A1 ; ψ1 ) and γ2 (H2 , A2 ; ψ2 ) mis-
specified, and (4) all models correctly specified. The solid grey curve indicates a loess smooth
through the points
In today’s health care, there seems to be an increasing trend in the use of sophis-
ticated mobile devices (e.g. smart phones, actigraph units containing accelerom-
eters, etc.) to remotely monitor patients’ chronic health conditions and to act on
the fly, when needed. According to the reinforcement learning literature, this is an
instance of online decision making in a possibly infinite horizon setting involving
many stages of intervention. Development of statistically sound estimation and in-
ference techniques for such a setting seems to be another very important future
research direction.
The call to personalize medicine is growing more urgent, and reaching beyond
the walls of academia. Even in popular literature (see, e.g. Topol 2012), it has been
declared that
This is a new era of medicine, in which each person can be near fully defined at the individ-
ual level, instead of how we practice medicine at the population level, with [. . . ] use of the
same medication and dosage for a diagnosis rather than for a patient.
While it is true that high dimensional data, even genome scans, are increas-
ingly available to the average “consumer” of medicine, there remains the need
to adequately and appropriately evaluate any new, tailored approach to treatment.
It is that evaluation, by statistical means, that has proven theoretically, computa-
tionally, and practically challenging and has driven many of the methodological
innovations described in this text.
The study of estimation and inference for dynamic treatment regimes is still
relatively young, and constantly evolving. Many inferential problems, including
inference about the optimal value function, remain incompletely addressed. A fur-
ther key challenge is the dissemination of the statistical results into the medical and
public health spheres, so that the methods being developed are not used in ‘toy’
examples, but are deployed in routine use for the evidence-based improvement of
treatment of chronic illnesses. While observational data can help drive hypotheses
and suggest good regimes to explore, increasing the use of SMARTs in clinical re-
search will be required to better understand and evaluate the sequential treatment
decisions that are routinely taken in the care of chronic illnesses.
Glossary
B. Chakraborty and E.E.M. Moodie, Statistical Methods for Dynamic Treatment Regimes, 181
Statistics for Biology and Health 76, DOI 10.1007/978-1-4614-7428-9,
© Springer Science+Business Media New York 2013
182 Glossary
Collider-stratification bias Bias that arises due to the selection of the sample or
conditioning of an analysis model on a covariate that is a common effect of the
treatment of interest (or a variable which causes treatment) and the outcome (or one
of its causes).
Confounding The bias that occurs when the treatment and the outcome have a
common cause that is not appropriately accounted for in an analysis.
Counterfactual outcome The outcome that would have been observed if individ-
ual i had received treatment a where a is not the treatment actually received. Often
used interchangeably with the term potential outcome.
Dynamic treatment regime A set of rules for determining effective treatments for
individuals based on personal characteristics such as treatment and covariate history.
G-computation An estimation procedure that models the dependence of covariates
on the history, then simulates from these models the outcome that would have been
observed had exposures been fixed by intervention.
G-estimation An estimation procedure typically coupled with structural nested
models that aims to simulate nested randomized-controlled trials at each stage of
treatment within strata of treatment and covariate history.
Marginal structural model A model for the mean counterfactual outcome which
conditions on treatment (and sometimes also baseline covariates) only, but does not
include any post-baseline covariates.
Non-regular estimator An estimator whose asymptotic distribution does not con-
verge uniformly over the parameter space. In the context of the estimation of optimal
dynamic treatment regimes, this typically occurs due to non-differentiability of the
estimating function with respect to a parameter that indexes a decision rule.
Policy From the reinforcement learning literature: a dynamic treatment regime.
Policy search methods In the reinforcement learning literature, a class of methods
which finds the optimal regime directly by estimating the value or marginal mean
outcome under each candidate regime within a pre-specified class and then selects
as optimal the regimes that maximize the estimated value.
Potential outcome The outcome that would be observed if individual i were to
receive treatment a, where here treatment may indicate a single- or multi-component
intervention that is either static or dynamic.
Propensity score For a binary-valued treatment, it is the conditional probability of
receiving treatment given covariates.
Q-function The total expected future reward, starting from stage j with covariate
history h j , taking an action a j , and following the treatment policy d thereafter. Thus,
Note that if a j follows policy d, then the Q-function equals the value function.
Regret A blip function where both the reference regime d ∗j and d are taken to
the optimal treatment regime. It is the expected difference in the outcome among
participants with history h j that would be observed had the participants taken the
optimal treatment from stage j onwards instead of taking the observed treatment
regime a j and subsequently followed the optimal regime:
Value function The total expected future reward, starting with a particular covari-
ate history, and following the given treatment regime actions thereafter. The stage- j
value function for history h j with respect to a regime d is
K
V jd (h j ) = Ed ∑ k k k k+1
Y (H , A , O )H j = h j
k= j
= Ed Y j (H j , A j , O j+1 ) + V j+1
d
(H j+1 )H j = h j , 1 ≤ j ≤ K.
The value function, or simply value, represents the expected future reward starting at
stage j with history h j and thereafter choosing treatments according to the policy d.
References
B. Chakraborty and E.E.M. Moodie, Statistical Methods for Dynamic Treatment Regimes, 185
Statistics for Biology and Health 76, DOI 10.1007/978-1-4614-7428-9,
© Springer Science+Business Media New York 2013
186 References
Arjas, E., & Parner, J. (2004). Causal reasoning from longitudinal data. Scandina-
vian Journal of Statistics, 31, 171–187.
Arjas, E., & Saarela, O. (2010). Optimal dynamic regimes: Presenting a case for
predictive inference. The International Journal of Biostatistics, 6.
Arroll, B., MacGillivray, S., Ogston, S., Reid, I., Sullivan, F., Williams, B., & Crom-
bie, I. (2005). Efficacy and tolerability of tricyclic antidepressants and ssris com-
pared with placebo for treatment of depression in primary care: A meta-analysis.
Annals of Family Medicine, 3, 449–456.
Arroll, B., Elley, C. R., Fishman, T., Goodyear-Smith, F. A., Kenealy, T., Blashki,
G., Kerse, N., & MacGillivray, S. (2009). Antidepressants versus placebo for
depression in primary care. Cochrane Database of Systematic Reviews, 3,
CD007954.
Auyeung, S. F., Long, Q., Royster, E. B., Murthy, S., McNutt, M. D., Lawson, D.,
Miller, A., Manatunga, A., & Musselman, D. L. (2009). Sequential multiple-
assignment randomized trial design of neurobehavioral treatment for patients
with metastatic malignant melanoma undergoing high-dose interferon-alpha ther-
apy. Clinical Trials, 6, 480–490.
Banerjee, A., & Tsiatis, A. A. (2006). Adaptive two-stage designs in phase II clinical
trials. Statistics in Medicine, 25, 3382–3395.
Bellman, R. E. (1957). Dynamic programming. Princeton: Princeton University
Press.
Bembom, O., & Van der Laan, M. J. (2007). Statistical methods for analyzing
sequentially randomized trials. Journal of the National Cancer Institute, 99,
1577–1582.
Berger, R. L. (1996). More powerful tests from confidence interval p values. Amer-
ican Statistician, 50, 314–318.
Berger, R. L., & Boos, D. D. (1994). P values maximized over a confidence set
for the nuisance parameter. Journal of the American Statistical Association, 89,
1012–1016.
Berkson, J. (1946). Limitations of the application of fourfold tables to hospital data.
Biometrics Bulletin, 2, 47–53.
Berry, D. A. (2001). Adaptive clinical trials and Bayesian statistics in drug develop-
ment (with discussion). Biopharmaceutical Report, 9, 1–11.
Berry, D. A. (2004). Bayesian statistics and the efficiency and ethics of clinical
trials. Statistical Science, 19, 175–187.
Berry, D. A., Mueller, P., Grieve, A. P., Smith, M., Parke, T., Blazek, R., Mitchard,
N., & Krams, M. (2001). Adaptive Bayesian designs for dose-ranging drug trials.
In Gatsonis, C., Kass, R.E., Carlin, B., Carriquiry, A. Gelman, A. Verdinelli, I.,
and West, M. (Eds.), Case studies in Bayesian statistics (Vol. V, pp. 99–181).
New York: Springer.
Bertsekas, D. P., & Tsitsiklis, J. (1996). Neuro-dynamic programming. Belmont:
Athena Scientific.
Berzuini, C., Dawid, A. P., & Didelez, V. (2012). Assessing dynamic treatment
strategies. In C. Berzuini, A. P. Dawid, & L. Bernardinelli (Eds.), Causality:
Statistical perspectives and applications (pp. 85–100). Chichester, West Sussex,
United Kindom.
References 187
Bickel, P. J., & Sakov, A. (2008). On the choice of m in the m out of n bootstrap and
confidence bounds for extrema. Statistica Sinica, 18, 967–985.
Bickel, P. J., Klaassen, C. A. J., Ritov, Y., & Wellner, J. A. (1993). Efficient and
adaptive estimation for semiparametric models. Baltimore: Johns Hopkins Uni-
versity Press.
Bickel, P. J., Gotze, F., & Zwet, W. V. (1997). Resampling fewer than n observations:
Gains, losses and remedies for losses. Statistica Sinica, 7, 1–31.
Biernot, P., & Moodie, E. E. M. (2010). A comparison of variable selection ap-
proaches for dynamic treatment regimes. The International Journal of Biostatis-
tics, 6.
Bodnar, L. M., Davidian, M., Siega-Riz, A. M., & Tsiatis, A. A. (2004). Marginal
structural models for analyzing causal effects of time-dependent treatments: An
application in perinatal epidemiology. American Journal of Epidemiology, 159,
926–934.
Box, G. E. P., Hunter, W. G., & Hunter, J. S. (1978). Statistics for experimenters:
An introduction to design, data analysis, and model building. New York: Wiley.
Breiman, L. (1995). Better subset regression using the nonnegative garrote. Techno-
metrics, 37, 373–384.
Breiman, L. (1996). Bagging predictors. Machine Learning, 24, 123–140.
Brotman, R. M., Klebanoff, M. A., Nansel, T. R., Andrews, W. W., Schwebke, J. R.,
Zhang, J., Yu, K. F., Zenilman, J. M., & Scharfstein, D. O. (2008). A longitu-
dinal study of vaginal douching and bacterial vaginosis – A marginal structural
modeling analysis. American Journal of Epidemiology, 168, 188–196.
Buhlmann, P., & Yu, B. (2002). Analyzing bagging. Annals of Statistics, 30,
927–961.
Cain, L. E., Robins, J. M., Lanoy, E., Logan, R., Costagliola, D., & Hernán, M. A.
(2010). When to start treatment? A systematic approach to the comparison of dy-
namic regimes using observational data. The International Journal of Biostatis-
tics, 6.
Carlin, B. P., Kadane, J. B., & Gelfand, A. E. (1998). Approaches for optimal se-
quential decision analysis in clinical trials. Biometrics, 54, 964–975.
Chakraborty, B. (2009). A study of non-regularity in dynamic treatment regimes
and some design considerations for multicomponent interventions (Dissertation,
University of Michigan, 2009).
Chakraborty, B. (2011). Dynamic treatment regimes for managing chronic health
conditions: A statistical perspective. American Journal of Public Health, 101,
40–45.
Chakraborty, B., & Moodie, E. E. M. (2013). Estimating optimal dynamic treatment
regimes with shared decision rules across stages: An extension of Q-learning
(under revision).
Chakraborty, B., Collins, L. M., Strecher, V. J., & Murphy, S. A. (2009). Develop-
ing multicomponent interventions using fractional factorial designs. Statistics in
Medicine, 28, 2687–2708.
188 References
Chakraborty, B., Murphy, S. A., & Strecher, V. (2010). Inference for non-regular
parameters in optimal dynamic treatment regimes. Statistical Methods in Medical
Research, 19, 317–343.
Chakraborty, B., Laber, E. B., & Zhao, Y. (2013). Inference for optimal dynamic
treatment regimes using an adaptive m-out-of-n bootstrap scheme. Biometrics,
(in press).
Chapman, G. B., & Sonnenberg, F. B. (2000). Decision making in health care: The-
ory, psychology, and applications. Cambridge, UK: Cambridge University Press.
Chen, Y. K. (2011). Dose finding by the continual reassessment method. Boca Ra-
ton: Chapman & Hall/CRC.
Chen, M.-H., Muller, P., Sun, D., & Ye, K. (Eds.). (2010). Frontiers of statistical
decision making and Bayesian analysis: In Honor of James O. Berger. New York:
Springer.
Cheung, K. Y., Lee, S. M. S., & Young, G. A. (2005). Iterating the m out of n
bootstrap in nonregular smooth function models. Statistica Sinica, 15, 945–967.
Chow, S. C., & Chang, M. (2008). Adaptive design methods in clinical trials – A
review. Orphanet Journal of Rare Diseases, 3.
Clemen, R. T., & Reilly, T. (2001). Making hard decisions. Pacific Grove: Duxbury.
Coffey, C. S., Levin, B., Clark, C., Timmerman, C., Wittes, J., Gilbert, P., & Harris,
S. (2012). Overview, hurdles, and future work in adaptive designs: Perspectives
from an NIH-funded workshop. Clinical Trials, 9, 671–680.
Cohen, J. (1988). Statistical power for the behavioral sciences (2nd ed.). Hillsdale:
Erlbaum.
Cole, S. R., & Frangakis, C. (2009). The consistency statement in causal inference:
A definition or an assumption? Epidemiology, 20, 3–5.
Cole, S. A., & Hernán, M. A. (2008). Constructing inverse probability weights for
marginal structural models. American Journal of Epidemiology, 168, 656–664.
Collins, L. M., Murphy, S. A., & Bierman, K. (2004). A conceptual framework for
adaptive preventive interventions. Prevention Science, 5, 185–196.
Collins, L. M., Murphy, S. A., Nair, V. N., & Strecher, V. J. (2005). A strategy
for optimizing and evaluating behavioral interventions. Annals of Behavioral
Medicine, 30, 65–73.
Collins, L. M., Chakraborty, B., Murphy, S. A., & Strecher, V. J. (2009). Compari-
son of a phased experimental approach and a single randomized clinical trial for
developing multicomponent behavioral interventions. Clinical Trials, 6, 5–15.
Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20,
273–297.
Cotton, C. A., & Heagerty, P. J. (2011). A data augmentation method for estimat-
ing the causal effect of adherence to treatment regimens targeting control of an
intermediate measure. Statistics in Bioscience, 3, 28–44.
Cox, D. R. (1958). Planning of experiments. New York: Wiley.
Cox, D. R., & Oaks, D. (1984). Analysis of survival data. Boca Raton, Florida:
Chapman & Hall/CRC.
D’Agostino, R. B., Jr. (1998). Tutorial in biostatistics: Propensity score methods
for bias reduction in the comparison of a treatment to a non-randomized control
group. Statistics in Medicine, 17, 2265–2281.
References 189
Hernán, M. A., Brumback, B., & Robins, J. M. (2000). Marginal structural models
to estimate the causal effect of zidovudine on the survival of HIV-positive men.
Epidemiology, 11, 561–570.
Hernán, M. A., Hernández-Dı́az, S., & Robins, J. M. (2004). A structural approach
to selection bias. Epidemiology, 15, 615–625.
Hernán, M. A., Cole, S. J., Margolick, J., Cohen, M., & Robins, J. M. (2005). Struc-
tural accelerated failure time models for survival analysis in studies with time-
varying treatments. Pharmacoepidemiology and Drug Safety, 14, 477–491.
Hernán, M. A., Lanoy, E., Costagliola, D., & Robins, J. M. (2006). Comparison of
dynamic treatment regimes via inverse probability weighting. Basic & Clinical
Pharmacology & Toxicology, 98, 237–242.
Hirano, K., & Porter, J. (2009). Asymptotics for statistical treatment rules. Econo-
metrica, 77, 1683–1701.
Holland, P. (1986). Statistics and causal inference. Journal of the American Statisti-
cal Association, 81, 945–970.
Huang, F., & Lee, M.-J. (2010). Dynamic treatment effect analysis of TV effects on
child cognitive development. Journal of Applied Econometrics, 25, 392–419.
Huang, X., & Ning, J. (2012). Analysis of multi-stage treatments for recurrent dis-
eases. Statistics in Medicine, 31, 2805–2821.
Huber, P. (1964). Robust estimation of a location parameter. Annals of Mathematical
Statistics, 53, 73–101.
Joffe, M. M. (2000). Confounding by indication: The case of calcium channel block-
ers. Pharamcoepidemiology and Drug Safety, 9, 37–41.
Joffe, M. M., & Brensinger, C. (2003). Weighting in instrumental variables and G-
estimation. Statistics in Medicine, 22, 1285–1303.
Jones, H. (2010). Reinforcement-based treatment for pregnant drug abusers
(home ii). Bethesda: National Institutes of Health. http://clinicaltrials.gov/ct2/
show/NCT01177982?term=jones+pregnant&rank=9.
Kaelbling, L. P., Littman, M. L., & Moore, A. (1996). Reinforcement learning: A
survey. The Journal of Artificial Intelligence Research, 4, 237–385.
Kakade, S. M. (2003). On the sample complexity of reinforcement learning (Disser-
tation, University College London).
Kasari, C. (2009). Developmental and augmented intervention for facilitating
expressive language (ccnia). Bethesda: National Institutes of Health. http://
clinicaltrials.gov/ct2/show/NCT01013545?term=kasari&rank=5.
Kaslow, R. A., Ostrow, D. G., Detels, R., Phair, J. P., Polk, B. F., & Rinaldo, C. R.
(1987). The Multicenter AIDS Cohort Study: Rationale, organization, and se-
lected characteristics of the participants. American Journal of Epidemiology, 126,
310–318.
Kearns, M., Mansour, Y., & Ng, A.Y. (2000). Approximate planning in large
POMDPs via reusable trajectories (Vol. 12). MIT.
Kramer, M. S., Chalmers, B., Hodnett, E. D., Sevkovskaya, Z., Dzikovich, I.,
Shapiro, S., Collet, J., Vanilovich, I., Mezen, I., Ducruet, T., Shishko, G.,
Zubovich, V., Mknuik, D., Gluchanina, E., Dombrovsky, V., Ustinovitch, A.,
Ko, T., Bogdanovich, N., Ovchinikova, L., & Helsing, E. (2001). Promotion of
192 References
Moodie, E. E. M., Chakraborty, B., & Kramer, M. S. (2012). Q-learning for estimat-
ing optimal dynamic treatment rules from observational data. Canadian Journal
of Statistics, 40, 629–645.
Moodie, E. E. M., Dean, N., & Sun, Y. R. (2013). Q-learning: Flexible learning
about useful utilities. Statistics in Biosciences, (in press).
Mortimer, K. M., Neugebauer, R., Van der Laan, M. J., & Tager, I. B. (2005). An
application of model-fitting procedures for marginal structural models. American
Journal of Epidemiology, 162, 382–388.
Murphy, S. A. (2003). Optimal dynamic treatment regimes (with Discussion). Jour-
nal of the Royal Statistical Society, Series B, 65, 331–366.
Murphy, S. A. (2005a). An experimental design for the development of adaptive
treatment strategies. Statistics in Medicine, 24, 1455–1481.
Murphy, S. A. (2005b). A generalization error for Q-learning. Journal of Machine
Learning Research, 6, 1073–1097.
Murphy, S. A., & Bingham, D. (2009). Screening experiments for developing dy-
namic treatment regimes. Journal of the American Statistical Association, 184,
391–408.
Murphy, S. A., Van der Laan, M. J., Robins, J. M., & CPPRG (2001). Marginal mean
models for dynamic regimes. Journal of the American Statistical Association, 96,
1410–1423.
Murphy, S. A., Lynch, K. G., Oslin, D., Mckay, J. R., & TenHave, T. (2007a). De-
veloping adaptive treatment strategies in substance abuse research. Drug and Al-
cohol Dependence, 88, s24–s30.
Murphy, S. A., Oslin, D., Rush, A. J., & Zhu, J. (2007b). Methodological challenges
in constructing effective treatment sequences for chronic psychiatric disorders.
Neuropsychopharmacology, 32, 257–262.
Nahum-Shani, I., Qian, M., Almiral, D., Pelham, W., Gnagy, B., Fabiano, G., Wax-
monsky, J., Yu, J., & Murphy, S. A. (2012a). Experimental design and primary
data analysis methods for comparing adaptive interventions. Psychological Meth-
ods, 17, 457–477.
Nahum-Shani, I., Qian, M., Almiral, D., Pelham, W., Gnagy, B., Fabiano, G., Wax-
monsky, J., Yu, J., & Murphy, S. (2012b). Q-learning: A data analysis method for
constructing adaptive interventions. Psychological Methods, 17, 478–494.
Nankervis, J. C. (2005). Computational algorithms for double bootstrap confidence
intervals. Computational Statistics & Data Analysis, 49, 461–475.
Nelson, J. C. (1997). Safety and tolerability of the new antidepressants. Journal of
Clinical Psychiatry, 58(Suppl. 6), 26–31.
Neugebauer, R., & Van der Laan, M. J. (2005). Why prefer double robust estimators
in causal inference? Journal of Statistical Planning and Inference, 129, 405–426.
Neugebauer, R., & Van der Laan, M. J. (2006). G-computation estimation for causal
inference with complex longitudinal data. Computational Statistics & Data Anal-
ysis, 51, 1676–1697.
Neugebauer, R., Silverberg, M. J., & Van der Laan, M. J. (2010). Observational
study and individualized antiretroviral therapy initiation rules for reducing can-
cer incidence in HIV-infected patients (Technical report). U.C. Berkeley Division
of Biostatistics Working Paper Series.
References 195
Newey, W. K., & McFadden, D. (1994). Large sample estimation and hypothesis
testing. In R. F. Engle & D. L. McFadden (Eds.), Handbook of econometrics
(Vol. IV, pp. 2113–2245). Amsterdam/Oxford: Elsevier Science.
Neyman, J. (1923). On the application of probability theory to agricultural experi-
ments. Essay in principles. Section 9 (translation published in 1990). Statistical
Science, 5, 472–480.
Ng, A., & Jordan, M. (2000). PEGASUS: A policy search method for large MDPs
and POMDPs.
Oetting, A. I., Levy, J. A., Weiss, R. D., & Murphy, S. A. (2011). Statistical method-
ology for a SMART design in the development of adaptive treatment strategies.
In: P. E. Shrout, K. M. Keyes, & K. Ornstein (Eds.) Causality and Psychopathol-
ogy: Finding the Determinants of Disorders and their Cures (pp. 179–205). Ar-
lington: American Psychiatric Publishing.
Olshen, R. A. (1973). The conditional level of the F-test. Journal of the American
Statistical Association, 68, 692–698.
Orellana, L., Rotnitzky, A., & Robins, J. M. (2010a). Dynamic regime marginal
structural mean models for estimation of optimal dynamic treatment regimes, part
I: Main content. The International Journal of Biostatistics, 6.
Orellana, L., Rotnitzky, A., & Robins, J. M. (2010b). Dynamic regime marginal
structural mean models for estimation of optimal dynamic treatment regimes, part
II: Proofs and additional results. The International Journal of Biostatistics, 6.
Ormoneit, D., & Sen, S. (2002). Kernel-based reinforcement learning. Machine
Learning, 49, 161–178.
Oslin, D. (2005). Managing alcoholism in people who do not respond to naltrexone
(ExTENd). Bethesda: National Institutes of Health. http://clinicaltrials.gov/ct2/
show/NCT00115037?term=oslin&rank=8.
Pampallona, S., & Tsiatis, A. A. (1994). Group sequential designs for one and two
sided hypothesis testing with provision for early stopping in favour of the null
hypothesis. Journal of Statistical Planning and Inference, 42, 19–35.
Parmigiani, G. (2002). Modeling in medical decision making: A Bayesian approach.
New York: Wiley.
Pearl, J. (2009). Causality (2nd ed.). New York: Cambridge University Press.
Petersen, M. L., Deeks, S. G., & Van der Laan, M. J. (2007). Individualized treat-
ment rules: Generating candidate clinical trials. Statistics in Medicine, 26, 4578–
4601.
Petersen, M. L., Porter, K. E., Gruber, S., Wang, Y., & Van der Laan, M. J. (2012).
Diagnosing and responding to violations in the positivity assumption. Statistical
Methods in Medical Research, 21, 31–54.
Partnership for Solutions (2004). Chronic conditions: Making the case for ongoing
care: September 2004 update. Baltimore: Partnership for Solutions, Johns Hop-
kins University.
Pineau, J., Bellernare, M. G., Rush, A. J., Ghizaru, A., & Murphy, S. A. (2007).
Constructing evidence-based treatment strategies using methods from computer
science. Drug and Alcohol Dependence, 88, S52–S60.
196 References
Pliskin, J. S., Shepard, D., & Weinstein, M. C. (1980). Utility functions for life years
and health status: Theory, assessment, and application. Operations Research, 28,
206–224.
Pocock, S. J. (1977). Group sequential methods in the design and analysis of clinical
trials. Biometrika, 64, 191–199.
Politis, D. N., Romano, J. P., & Wolf, M. (1999). Subsampling. New York: Springer.
Pötscher, B. M. (2007). Confidence sets based on sparse estimators are necessarily
large. Arxiv preprint arXiv:0711.1036.
Pötscher, B. M., & Schneider, U. (2008). Confidence sets based on penalized maxi-
mum likelihood estimators. Mpra paper, University Library of Munich, Germany.
Qian, M., & Murphy, S. A. (2011). Performance guarantees for individualized treat-
ment rules. Annals of Statistics, 39, 1180–1210.
Rich, B., Moodie, E. E. M., Stephens, D. A., & Platt, R. W. (2010). Model check-
ing with residuals for g-estimation of optimal dynamic treatment regimes. The
International Journal of Biostatistics, 6.
Rich, B., Moodie, E. E. M., and Stephens, D.A. (2013) Adaptive individualized dos-
ing in pharmacological studies: Generating candidate dynamic dosing strategies
for warfarin treatment. (submitted).
Robins, J. M. (1986). A new approach to causal inference in mortality studies with
sustained exposure periods – Application to control of the healthy worker survivor
effect. Mathematical Modelling, 7, 1393–1512.
Robins, J. M. (1994). Correcting for non-compliance in randomized trials using
structural nested mean models. Communications in Statistics, 23, 2379–2412.
Robins, J. M. (1997). Causal inference from complex longitudinal data. In
M. Berkane (Ed.), Latent variable modeling and applications to causality: Lec-
ture notes in statistics (pp. 69–117). New York: Springer.
Robins J. M. (1999a). Marginal structural models versus structural nested models as
tools for causal inference. In: M. E. Halloran & D. Berry (Eds.) Statistical models
in epidemiology: The environment and clinical trials. IMA, 116, NY: Springer-
Verlag, pp. 95–134.
Robins, J. M. (1999b). Association, causation, and marginal structural models. Syn-
these, 121, 151–179.
Robins, J. M. (2004). Optimal structural nested models for optimal sequential de-
cisions. In D. Y. Lin & P. Heagerty (Eds.), Proceedings of the second Seattle
symposium on biostatistics (pp. 189–326). New York: Springer.
Robins, J. M., & Hernán, M. A. (2009). Estimation of the causal effects of time-
varying exposures. In G. Fitzmaurice, M. Davidian, G. Verbeke, & G. Molen-
berghs (Eds.), Longitudinal data analysis. Boca Raton: Chapman & Hall/CRC.
Robins, J. M., & Wasserman, L. (1997). Estimation of effects of sequential treat-
ments by reparameterizing directed acyclic graphs. In D. Geiger & P. Shenoy
(Eds.), Proceedings of the thirteenth conference on uncertainty in artificial intel-
ligence (pp. 409–430). Providence.
Robins, J. M., Hernán, M. A., & Brumback, B. (2000). Marginal structural models
and causal inference in epidemiology. Epidemiology, 11, 550–560.
References 197
Robins, J. M., Orellana, L., & Rotnitzky, A. (2008). Estimation and extrapolation of
optimal treatment and testing strategies. Statistics in Medicine, 27, 4678–4721.
Rosenbaum, P. R. (1991). Discussing hidden bias in observational studies. Annals
of Internal Medicine, 115, 901–905.
Rosenbaum, P. R., & Rubin, D. B. (1983). The central role of the propensity score
in observational studies for causal effects. Biometrika, 70, 41–55.
Rosenbaum, P. R., & Rubin, D. B. (1984). Reducing bias in observational studies us-
ing subclassification on the propensity score. Journal of the American Statistical
Association, 79, 516–524.
Rosenbaum, P. R., & Rubin, D. B. (1985). Constructing a control group using mul-
tivariate matched sampling methods that incorporate the propensity score. The
American Statistician, 39, 33–38.
Rosthøj, S., Fullwood, C., Henderson, R., & Stewart, S. (2006). Estimation of op-
timal dynamic anticoagulation regimes from observational data: A regret-based
approach. Statistics in Medicine, 25, 4197–4215.
Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and
nonrandomized studies. Journal of Educational Psychology, 66, 688–701.
Rubin, D. B. (1980). Discussion of “randomized analysis of experimental data: The
Fisher randomization test” by D. Basu. Journal of the American Statistical Asso-
ciation, 75, 591–593.
Rubin, D. B., & Shenker, N. (1991). Multiple imputation in health-case data bases:
An overview and some applications. Statistics in Medicine, 10, 585–598.
Rubin, D. B., & van der Laan, M. J. (2012). Statistical issues and limitations in
personalized medicine research with clinical trials. International Journal of Bio-
statistics, 8.
Rush, A. J., Fava, M., Wisniewski, S. R., Lavori, P. W., Trivedi, M. H., Sackeim,
H. A., Thase, M. E., Nierenberg, A. A., Quitkin, F. M., Kashner, T. M., Kupfer,
D. J., Rosenbaum, J. F., Alpert, J., Stewart, J. W., McGrath, P. J., Biggs, M. M.,
Shores-Wilson, K., Lebowitz, B. D., Ritz, L., & Niederehe, G. (2004). Sequenced
treatment alternatives to relieve depression (STAR*D): Rationale and design.
Controlled Clinical Trials, 25, 119–142.
Saarela, O., Moodie, E. E. M., Stephens, D. A., & Klein, M. B. (2013a). On
Bayesian estimation of marginal structural models (submitted).
Saarela, O., Stephens, D. A., & Moodie, E. E. M. (2013b). The role of exchange-
ability in causal inference (submitted).
Schneider, L. S., Tariot, P. N., Lyketsos, C. G., Dagerman, K. S., Davis, K. L., &
Davis, S. (2001). National Institute of Mental Health Clinical Antipsychotic Tri-
als of Intervention Effectiveness (CATIE): Alzheimer disease trial methodology.
American Journal of Geriatric Psychiatry, 9, 346–360.
Schulte, P. J., Tsiatis, A. A., Laber, E. B., & Davidian, M. (2012). Q- and A-learning
methods for estimating optimal dynamic treatment regimes. arXiv, 1202.4177v1.
Sekhon, J. S. (2011). Multivariate and propensity score matching software with au-
tomated balance optimization: The matching package for R. Journal of Statistical
Software, 42, 1–52.
198 References
Shao, J. (1994). Bootstrap sample size in nonregular cases. Proceedings of the Amer-
ican Mathematical Society, 122, 1251–1262.
Shao, J., & Sitter, R. R. (1996). Bootstrap for imputed survey data. Journal of the
American Statistical Association, 91, 1278–1288.
Shepherd, B. E., Jenkins, C. A., Rebeiro, P. F., Stinnette, S. E., Bebawy, S. S., Mc-
Gowan, C. C., Hulgan, T., & Sterling, T. R. (2010). Estimating the optimal CD4
count for HIV-infected persons to start antiretroviral therapy. Epidemiology, 21,
698–705.
Shivaswamy, P., Chu, W., & Jansche, M. (2007). A support vector approach to cen-
sored targets. In Proceedings of the seventh IEEE international conference on
data mining, Omaha (pp. 655–660).
Shortreed, S. M., & Moodie, E. E. M. (2012). Estimating the optimal dynamic an-
tipsychotic treatment regime: Evidence from the sequential-multiple assignment
randomized CATIE Schizophrenia Study. Journal of the Royal Statistical Society,
Series B, 61, 577–599.
Shortreed, S. M., Laber, E., & Murphy, S. A. (2010). Imputation methods for the
clinical antipsychotic trials of intervention and effectiveness study (Technical re-
port SOCS-TR-2010.8). School of Computer Science, McGill University.
Shortreed, S. M., Laber, E., Lizotte, D. J., Stroup, T. S., Pineau, J., & Murphy,
S. A. (2011). Informing sequential clinical decision-making through reinforce-
ment learning: An empirical study. Machine Learning, 84, 109–136.
Sjölander, A., Nyrén, O., Bellocco, R., & Evans, M. (2011). Comparing different
strategies for timing of dialysis initiation through inverse probability weighting.
American Journal of Epidemiology, 174, 1204–1210.
Song, R., Wang, W., Zeng, D., & Kosorok, M. R. (2011). Penalized Q-learning for
dynamic treatment regimes arXiv:1108.5338v1 [stat.ME].
Sox, H. C., Blatt, M. A., Higgins, M. C., & Marton, K. I. (1988). Medical decision
making. Boston: Butterworth-Heinemann.
Sterne, J. A. C., May, M., Costagliola, D., de Wolf, F., Phillips, A. N., Harris, R.,
Funk, M. J., Geskus, R. B., Gill, J., Dabis, F., Miró, J. M., Justice, A. C., Led-
ergerber, B., Fätkenheuer, G., Hogg, R. S., D’Arminio Monforte, A., Saag, M.,
Smith, C., Staszewski, S., Egger, M., Cole, S. R., & The When To Start Consor-
tium (2009). Timing of initiation of antiretroviral therapy in AIDS-free HIV-1-
infected patients: A collaborative analysis of 18 HIV cohort studies. Lancet, 373,
1352–1363.
Stewart, C. E., Fielder, A. R., Stephens, D. A., & Moseley, M. J. (2002). Design
of the Monitored Occlusion Treatment of Amblyopia Study (MOTAS). British
Journal of Ophthalmology, 86, 915–919.
Stewart, C. E., Moseley, M. J., Stephens, D. A., & Fielder, A. R. (2004). Treat-
ment dose-response in amblyopia therapy: The Monitored Occlusion Treatment
of Amblyopia Study (MOTAS). Investigations in Ophthalmology and Visual Sci-
ence, 45, 3048–3054.
Stone, R. M., Berg, D. T., George, S. L., Dodge, R. K., Paciucci, P. A., Schulman,
P., Lee, E. J., Moore, J. O., Powell, B. L., & Schiffer, C. A. (1995). Granulocyte
macrophage colony-stimulating factor after initial chemotherapy for elderly pa-
References 199
tients with primary acute myelogenous leukemia. The New England Journal of
Medicine, 332, 1671–1677.
Strecher, V., McClure, J., Alexander, G., Chakraborty, B., Nair, V., Konkel, J.,
Greene, S., Collins, L., Carlier, C., Wiese, C., Little, R., Pomerleau, C., & Pomer-
leau, O. (2008). Web-based smoking cessation components and tailoring depth:
Results of a randomized trial. American Journal of Preventive Medicine, 34, 373–
381.
Stroup, T. S., McEvoy, J. P., Swartz, M. S., Byerly, M. J., Glick, I. D., Canive, J. M.,
McGee, M., Simpson, G. M., Stevens, M. D., & Lieberman, J. A. (2003). The
National Institute of Mental Health Clinical Antipschotic Trials of Intervention
Effectiveness (CATIE) project: Schizophrenia trial design and protocol devepl-
opment. Schizophrenia Bulletin, 29, 15–31.
Stroup, T. S., Lieberman, J. A., McEvoy, J. P., Davis, S. M., Meltzer, H. Y., Rosen-
heck, R. A., Swartz, M. S., Perkins, D. O., Keefe, R. S. E., Davis, C. E., Severe, J.,
& Hsiao, J. K. (2006). Effectiveness of olanzapine, quetiapine, risperidone, and
ziprasidone in patients with chronic schizophrenia folllowing discontinuation of
a previous atypical antipsychotic. American Journal of Psychiatry, 163, 611–622.
Sturmer, T., Schneeweiss, S., Brookhart, M. A., Rothman, K. J., Avorn, J., & Glynn,
R. J. (2005). Analytic strategies to adjust confounding using exposure propensity
scores and disease risk scores: Nonsteroidal antiinflammatory drugs and short-
term mortality in the elderly. American Journal of Epidemiology, 161, 891–898.
Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. Cam-
bridge, MA: MIT.
Swartz, M. S., Perkins, D. O., Stroup, T. S., McEvoy, J. P., Nieri, J. M., & Haal, D. D.
(2003). Assessing clinical and functional outcomes in the Clinical Antipsychotic
Trials of Intervention Effectiveness (CATIE) schizophrenia trial. Schizophrenia
Bulletin, 29, 33–43.
Taubman, S. L., Robins, J. M., Mittleman, M. A., & Hernán, M. A. (2009). Inter-
vening on risk factors for coronary heart disease: An application of the parametric
g-formula. International Journal of Epidemiology, 38, 1599–1611.
Thall, P. F., & Wathen, J. K. (2005). Covariate-adjusted adaptive randomization in a
sarcoma trial with multi-stage treatments. Statistics in Medicine, 24, 1947–1964.
Thall, P. F., Millikan, R. E., & Sung, H. G. (2000). Evaluating multiple treatment
courses in clinical trials. Statistics in Medicine, 30, 1011–1128.
Thall, P. F., Sung, H. G., & Estey, E. H. (2002). Selecting therapeutic strategies
based on efficacy and death in multicourse clinical trials. Journal of the American
Statistical Association, 97, 29–39.
Thall, P. F., Wooten, L. H., Logothetis, C. J., Millikan, R. E., & Tannir, N. M.
(2007a). Bayesian and frequentist two-stage treatment strategies based on se-
quential failure times subject to interval censoring. Statistics in Medicine, 26,
4687–4702.
Thall, P. F., Logothetis, C., Pagliaro, L. C., Wen, S., Brown, M. A., Williams, D.,
& Millikan, R. E. (2007b). Adaptive therapy for androgen-independent prostate
cancer: A randomized selection trial of four regimens. Journal of the National
Cancer Institute, 99, 1613–1622.
200 References
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of
the Royal Statistical Society, Series B, 58, 267–288.
Topol, E. (2012). Creative destruction of medicine: How the digital revolution and
personalized medicine will create better health care. New York: Basic Books.
Torrance, G. W. (1986). Measurement of health state utilities for economic ap-
praisal. Journal of Health Economics, 5, 1–30.
Tsiatis, A. A. (2006). Semiparametric theory and missing data. New York: Springer.
Van der Laan, M. J., & Petersen, M. L. (2007a). Causal effect models for realistic
individualized treatment and intention to treat rules. The International Journal of
Biostatistics, 3.
Van der Laan, M. J., & Petersen, M. L. (2007b). Statistical learning of origin-specific
statically optimal individualized treatment rules. The International Journal of
Biostatistics, 3.
Van der Laan, M. J., & Robins, J. M. (2003). Unified methods for censored longitu-
dinal data and causality. New York: Springer.
Van der Laan, M. J., & Rubin, D. (2006). Targeted maximum likelihood learning.
The International Journal of Biostatistics, 2.
Van der Vaart, A. W. (1998). Asymptotic statistics. Cambridge, UK: Cambridge Uni-
versity Press.
Vansteelandt, S., & Goetghebeur, E. (2003). Causal inference with generalized
structural mean models. Journal of the Royal Statistical Society, Series B, 65,
817–835.
Vapnik, V. (1995). The nature of statistical learning theory. New York: Springer.
Vogt, W. P. (1993). Dictionary of statistics and methodology: A nontechnical guide
for the social sciences. Newbury Park: Sage Publications.
Wagner, E. H., Austin, B. T., Davis, C., Hindmarsh, M., Schaefer, J., & Bonomi, A.
(2001). Improving chronic illness care: Translating evidence into action. Health
Affairs, 20, 64–78.
Wahed, A. S., & Tsiatis, A. A. (2004). Optimal estimator for the survival distribution
and related quantities for treatment policies in two-stage randomized designs in
clinical trials. Biometrics, 60, 124–133.
Wahed, A. S., & Tsiatis, A. A. (2006). Semiparametric efficient estimation of sur-
vival distributions in two-stage randomisation designs in clinical trials with cen-
sored data. Biometrika, 93, 163–177.
Wald, A. (1949). Statistical decision functions. New York: Wiley.
Wang, Y., Petersen, M. L., Bangsberg, D., & Van der Laan, M. J. (2006). Diag-
nosing bias in the inverse probability of treatment weighted estimator resulting
from violation of experimental treatment assignment. UC Berkeley Division of
Biostatistics Working Paper Series.
Wang, L., Rotnitzky, A., Lin, X., Millikan, R. E., & Thall, P. F. (2012). Evaluation of
viable dynamic treatment regimes in a sequentially randomized trial of advanced
prostate cancer. Journal of the American Statistical Association, 107, 493–508.
Wathen, J. K., & Thall, P. F. (2008). Bayesian adaptive model selection for optimiz-
ing group sequential clinical trials. Statistics in Medicine, 27, 5586–5604.
References 201
B. Chakraborty and E.E.M. Moodie, Statistical Methods for Dynamic Treatment Regimes, 203
Statistics for Biology and Health 76, DOI 10.1007/978-1-4614-7428-9,
© Springer Science+Business Media New York 2013
204 Index