0% found this document useful (0 votes)

11 views30 pages

Leveraging Factored Action Spaces

This document presents a method for offline reinforcement learning (RL) that utilizes factored action spaces to improve sample efficiency and policy performance in healthcare applications. By decomposing the Q-function linearly, the approach addresses the challenges posed by combinatorial action spaces, allowing for better inference in underexplored regions of the state-action space. Theoretical insights and empirical evidence demonstrate that this method can outperform traditional non-factored approaches, particularly when data is limited.

Uploaded by

Wenshuai Zhao

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views30 pages

Leveraging Factored Action Spaces

Uploaded by

Wenshuai Zhao

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

Leveraging Factored Action Spaces for Efficient

Offline Reinforcement Learning in Healthcare

Shengpu Tang1 Maggie Makar1 Michael W. Sjoding2 Finale Doshi-Velez3 Jenna Wiens1
1
Division of Computer Science & Engineering, University of Michigan, Ann Arbor, MI, USA
arXiv:2305.01738v1 [cs.LG] 2 May 2023

2
Division of Pulmonary and Critical Care, Michigan Medicine, Ann Arbor, MI, USA
3
SEAS, Harvard University, Cambridge, MA, USA
Correspondence to: {tangsp,wiensj}@umich.edu
Reviewed on OpenReview: https://openreview.net/forum?id=Jd70afzIvJ4

Abstract
Many reinforcement learning (RL) applications have combinatorial action spaces,
where each action is a composition of sub-actions. A standard RL approach
ignores this inherent factorization structure, resulting in a potential failure to
make meaningful inferences about rarely observed sub-action combinations; this is
particularly problematic for offline settings, where data may be limited. In this work,
we propose a form of linear Q-function decomposition induced by factored action
spaces. We study the theoretical properties of our approach, identifying scenarios
where it is guaranteed to lead to zero bias when used to approximate the Q-function.
Outside the regimes with theoretical guarantees, we show that our approach can still
be useful because it leads to better sample efficiency without necessarily sacrificing
policy optimality, allowing us to achieve a better bias-variance trade-off. Across
several offline RL problems using simulators and real-world datasets motivated by
healthcare, we demonstrate that incorporating factored action spaces into value-
based RL can result in better-performing policies. Our approach can help an agent
make more accurate inferences within underexplored regions of the state-action
space when applying RL to observational datasets.

1 Introduction
In many real-world decision-making problems, the action space exhibits an inherent combinatorial
structure. For example, in healthcare, an action may correspond to a combination of drugs and treat-
ments. When applying reinforcement learning (RL) to these tasks, past work [1–4] typically considers
each combination a distinct action, resulting in an exponentially large action space (Figure 1a). This
is inefficient as it fails to leverage any potential independence among dimensions of the action space.
This type of factorization structure in action space could be incorporated when designing the architec-
ture of function approximators for RL (Figure 1b). Similar ideas have been used in the past, primarily
to improve online exploration [5, 6], or to handle multiple agents [7–11] or multiple rewards [12].
However, the applicability of this approach has not been systematically studied, especially in offline
settings and when the MDP presents no additional structure (e.g., when the state space cannot be
explicitly factorized).
In this work, we develop an approach for offline RL with factored action spaces by learning linearly
decomposable Q-functions. First, we study the theoretical properties of this approach, investigating
the sufficient and necessary conditions for it to lead to an unbiased estimate of the Q-function (i.e.,
zero approximation error). Even when the linear decomposition is biased, we note that our approach

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

(a) combinatorial action space (b) factored action space

select

Figure 1: Illustration of Q-network architectures, which take the state s as input and output Q(s, a)
for a selected action. In this example, the action space A consists of D = 3 binary sub-action spaces
{, }, {, } and {, }. (a) Learning with the combinatorial action space requires 23 = 8
output heads (exponential in D), one for each combination of sub-actions. (b) Incorporating the
linear Q decomposition for the factored action space requires 2 × 3 = 6 output heads (linear in D).

leads to a reduction of variance, which in turn leads to an improvement in sample efficiency. Lastly,
we show that when sub-actions exhibit certain structures (e.g., when two sub-actions “reinforce” their
independent effects), the linear approximation, though biased, can still lead to the optimal policy. We
test our approach in offline RL domains using a simulator [13] and a real clinical dataset [2], where
domain knowledge about the relationship among actions suggests our proposed factorization approach
is applicable. Empirically, our approach outperforms a non-factored baseline when the sample size is
limited, even when the theoretical assumptions (around the validity of a linear decomposition) are not
perfectly satisfied. Qualitatively, in the real-data experiment, our approach learns policies that better
capture the effect of less frequently observed treatment combinations.
Our work provides both theoretical insights and empirical evidence for RL practitioners to consider
this simple linear decomposition for value-based RL approaches. Our contribution complements
many popular offline RL methods focused on distribution shift (e.g., BCQ [14]) and goes beyond
pessimism-only methods by leveraging domain knowledge. Compatible with any algorithm that has
a Q-function component, we expect our approach will lead to gains for offline RL problems with
combinatorial action spaces where data are limited and when domain knowledge can be used to check
the validity of theoretical assumptions.

2 Problem Setup
We consider Markov decision processes (MDPs) defined by a tuple M = (S, A, p, r, µ0 , γ), where
S and A are the state and action spaces, p(s0 |s, a) and r(s, a) are the transition and instantaneous
reward functions, µ0 (s) is the initial state distribution, and γ ∈ [0, 1] is the discount factor. A
probabilistic policy π(a|s) specifies a mapping from each state to a probability distribution over
actions. For a deterministic policy, π(s) refers to the action
with π(a|s) = 1. The state-value function
∞
is defined as V π (s) = Eπ EM t−1 π
P
t=1 γ rt | s1 = s . The action-value function, Q (s, a), is
defined by further restricting the action taken from the starting state. The goal of RL is to find a policy
π ∗ = arg maxπ Es∼µ0 [V π (s)] (or an approximation) that has the maximum expected performance.
2.1 Factored Action Spaces
While the standard MDP definition abstracts away the underlying structure within the action space A,
in this paper, we explicitly express a factored action space as a Cartesian product of D sub-action
ND
spaces, A = d=1 Ad = A1 × · · · × AD . We use a ∈ A to denote each action, which can be written
as a vector of sub-actions a = [a1 , . . . , aD ], with each ad ∈ Ad . In general, a sub-action space can
be discrete or continuous, and the cardinalities of discrete sub-action spaces are not required to be the
same. For clarity of analysis and illustration, we consider discrete sub-action spaces in this paper.
2.2 Linear Decomposition of Q Function
The traditional factored MDP literature almost exclusively considers state space factorization [15]. In
contrast, here we capitalize on action space factorization to parameterize value functions. Specifically,
our approach considers a linear decomposition of the Q function, as illustrated in Figure 1b:
PD
Qπ (s, a) = d=1 qd (s, ad ). (1)
Each component qd (s, ad ) in the summation is allowed to condition on the full state space s and only
one sub-action ad . While similar forms of decomposition have been used in past work, there are key

2
differences in how the summation components are parameterized. In the multi-agent RL literature,
each component qd (sd , ad ) can only condition on the corresponding state space of the d-th agent [e.g.,
8, 9]. The decomposition in Eqn. (1) also differs from a related form of decomposition considered
by Juozapaitis et al. [12] where each component qd (s, a) can condition on the full action a. To the
best of our knowledge, we are the first to consider this specific form of Q-function decomposition
backed by both theoretical rigor and empirical evidence; in addition, we are the first to apply this idea
to offline RL. We discuss other related work in Section 5.

3 Theoretical Analyses
In this section, we study the theoretical properties of the linear Q-function decomposition induced by
factored action spaces. We first present sufficient and necessary conditions for our approach to yield
unbiased estimates, and then analyze settings in which our approach can reduce variance without
sacrificing policy performance when the conditions are violated. Finally, we discuss how domain
knowledge may be used to check the validity of these conditions, providing examples in healthcare.

3.1 Sufficient Conditions for Zero Bias

If we consider the total return of D MDPs running in parallel, where each MDP is defined by their
respective state space Sd and action space Ad , then the desired linear decomposition holds for the
ND ND
MDP defined by the joint state space d=1 Sd and joint action space d=1 Ad (formally discussed
in Appendix B.1). However, this relies on an explicit, known state space factorization, limiting its
applicability. In contrast, we now present a generalization that forgoes the explicit factorization of
the state space by making use of state abstractions. Intuitively, the MDP should have some implicit
factorization, such that it is homomorphic to D parallel MDPs. It is, however, not a requirement that
this factorization is known, as long as it exists.
ND
Theorem 1. Given an MDP defined by S, A, p, r and a policy π : S → ∆(A), where A = d=1 Ad
is a factored action space with D sub-action spaces, if there exists D unique corresponding state
abstractions φ = [φ1 , · · · , φD ] where φd : S → Zd , zd = φd (s), zd0 = φd (s0 ), such that for all
s, a, s0 the following holds:
P QD 0
s̃∈φ−1 (φ(s0 )) p(s̃|s, a) = d=1 pd (zd |zd , ad ) (2)
PD QD
r(s, a) = d=1 rd (zd , ad ) (3) π(a|s) = d=1 πd (ad |zd ) (4)

for some pd : Zd × Ad → ∆(Zd ), rd : Zd × Ad → R, and πd : Zd → ∆(Ad ), then the Q-function

PD
of policy π can be expressed as Qπ (s, a) = d=1 qd (s, ad ).
In Appendix B.2, we present an induction-based proof of Theorem 1. Since every assumption is
used in key steps of the proof, we conjecture that the sufficient conditions cannot be relaxed in
general. Consequently, if the sufficient conditions are satisfied, then using Eqn. (1) to parameterize
the Q-function leads to zero approximation error and results in an unbiased estimator. Note that
this does not require knowledge of φ. To highlight the significance of Theorem 1, we present the
following example, in which the state space cannot be explicitly factored, yet the linear decomposition
exists (additional examples probing the sufficient conditions can be found in Appendix C).
Example 1 (Two-dimensional chains with abstractions). The factored action space shown in Fig-
ure 2a, A = Ax × Ay , is the composition of two binary sub-action spaces: Ax = {←, →} leading
the agent to move left or right, and Ay = {↓, ↑} leading the agent to move down or up. Thus, A
consists of four actions, where each action a = [ax , ay ] leads the agent to move diagonally.
Consider the MDP in Figure 2b with action space A. The state space S = {s0,0 , s0,1 , s̃0,1 , s1,0 , s1,1 }
contains 5 different states; subscripts indicate the abstract state vector under φ = [φx , φy ] (e.g., s0,1
and s̃0,1 are two different raw states but are identical under the abstraction, φ(s0,1 ) = φ(s̃0,1 ) =
[z0,? , z?,1 ]). There does not exist an explicit state space factorization such that S = Sx × Sy . One
can check that Eqns. (2) and (3) are satisfied by comparing the raw transitions and rewards against
the abstracted version (e.g., action % from s0,0 moves both → (under φx ) and ↑ (under φy ) to s1,1
and receives the sum of the two rewards, 1 + 1 = 2). For Eqn. (4) to hold, the policy must take the
same action from s0,1 and s̃0,1 . In Figure 2c, we show the linear decomposition of the Q-function for
one such policy where Theorem 1 applies, under which the evolution of the MDP can be seen as two
chain MDPs running in parallel (also in Figure 2b). /

3
(a) Ax × Ay (b) +1
Up [Left, Up] [Right, Up] Abstract MDP s̃0,1
Ax
Left Right in y-direction Original MDP
Ay +1

Down
[Left, Down] [Right, Down] +1
z?,1 s0,1 s1,1

+1
(c) (1 − p ) p
+1 +1 +1
a = [ax , ay ] qx (s0,0 , ax ) + qy (s0,0 , ay ) = Qπ (s0,0 , a) +1
+2

.= [←, ↓] 0.9 0.9 1.8 z?,0 s0,0 s1,0

-= [←, ↑] 0.9 1 1.9
+1
&= [→, ↓] 1 0.9 1.9
%= [→, ↑] 1 1 2 φy
φx z0,? z1,? Abstract MDP
+1
in x-direction

Figure 2: (a) The composition of sub-action spaces Ax and Ay results in A = Ax × Ay depicted by

outgoing arrows exiting the corners of each state (denoted by ). The corner from which the action
exits encodes the direction. (b) An MDP with 5 states and 4 actions of the factored action space A.
For example, action %= [→, ↑] from s0,0 moves the agent both right (→) and up (↑), to s1,1 . Under
abstractions φ = [φx , φy ], this MDP can be mapped to two abstract MDPs (with action spaces Ax and
Ay , respectively). The abstract state spaces are Zx = {z0,? , z1,? } and Zy = {z?,0 , z?,1 }, respectively,
where ? indicates the coordinate ignored by the abstraction. s1,1 is an absorbing state whose outgoing
transition arrows are not shown. Taking action -= [←, ↑] from s0,0 leads to s0,1 with probability p
and to s̃0,1 with probability (1 − p) (denoted in green). Actions taken by a determinisitic policy π
are denoted by bold blue arrows. π takes the same action &= [→, ↓] from s0,1 and s̃0,1 . Nonzero
rewards are denoted in red. (c) Linear decomposition of Qπ for s0,0 with respect to the factored
action space (γ = 0.9). Similar decompositions for other states also exist (omitted for space).

3.2 Necessary Conditions for Zero Bias

In Appendix B.5, we derive a necessary condition for the linear parameterization to be unbiased.
Unfortunately, the condition is not verifiable unless the exact MDP parameters are known; this
highlights the non-trivial nature of the problem. One may naturally question whether the sufficient
conditions (which are arguably more verifiable in practice) must hold (i.e., are necessary) for the
linear parameterization to be unbiased. Perhaps surprisingly, none of the conditions are necessary.
We state the following propositions and provide justifications through a set of counterexamples below
and in Appendix C.
Proposition 2. There exists an MDP M and a policy π for which QπM decomposes as Eqn. (1) but
the transition function p of M does not satisfy Eqn. (2).
Proposition 3. There exists an MDP M and a policy π for which QπM decomposes as Eqn. (1) but
the reward function r of M does not satisfy Eqn. (3).
Proposition 4. There exists an MDP M and a policy π for which QπM decomposes as Eqn. (1) but
the policy π does not satisfy Eqn. (4).
Example 2 (Modified two-dimensional chains). In Figure 3, all conditions in Theorem 1 are violated,
yet for each state, there exists a linear decomposition of Q-values (see Appendix C). /
Figure 3: This MDP is similar to Example 1 (except it does
(1 − γ ) (b) not have state s̃0,1 ) and we consider the same abstractions φ =
z?,1 s0,1
(a) s1,1 [φx , φy ]. The Q-function and decomposition are exactly the same
+1
as in the previous example. However, none of the conditions in
Theorem 1 are satisfied. (a) The transition function does not satisfy
+1 +1 +1
+1
+2 Eqn. (2) because action %= [→, ↑] from s0,1 does not move right
z?,0 s0,0 s1,0
(→ under φx ) to s1,1 and instead moves back to state s0,1 . (b) The
reward function does not satisfy Eqn. (3) as the reward of (1 − γ)
+1 for action %= [→, ↑] from s0,1 is not the sum of +1 (→ from z0,?
φy (c)
under φx ) and 0 (↑ from z?,1 under φy ). (c) The policy does not
φx z0,?
+1
z1,? satisfy Eqn. (4) as it takes different sub-actions from z0,? under
φx (- from s0,0 specifies ←, whereas % from s0,1 specifies →).

Therefore, while Theorem 1 imposes a rather stringent set of assumptions on the MDP structure
(transitions, rewards) and the policy, violations of these conditions do not preclude the linear parame-
terization of the Q-function from being an unbiased estimator.

4
3.3 How Does Bias Affect Policy Learning?

When the sufficient conditions do not hold perfectly, using the linear parameterization in Eqn. (1) to
fit the Q-function may incur nonzero approximation error (bias). This can affect the performance of
the learned policy; in Appendix B.3, we derive error bounds based on the extent of bias relative to
the sufficient conditions in Theorem 1. Despite this bias, our approach always leads to a reduction
in the variance of the estimator. This gives us an opportunity to achieve a better bias-variance
trade-off, especially given limited historical data in the offline setting. In addition, as we will
demonstrate, biased Q-values do not always result in suboptimal policy performance, and we identify
the characteristics of problems where this occurs under our proposed linear decomposition.

3.3.1 Bias-Variance Trade-off

While the amount of bias incurred depends on the problem structure, the benefit of variance reduction
is immediate. Intuitively, to learn the Q-function of a tabular MDP with state space S and action
ND
space A = d=1 Ad , the linear parameterization reduces the number of free parameters from
Q D PD
|S||A| = |S|( d=1 |Ad |) to |S|( d=1 |Ad | − D + 1) (see Appendix B.4). This reduces the
hypothesis class from exponential in D to linear in D. To analyze variance reduction, we compare
the bounds on Rademacher complexity [16–18] of the Q-function approximator using the factored
action space with that of the full combinatorial action space (formally discussed in Appendix B.6).
Proposition 5. Using the linear Q-function decomposition for the factored action space in Eqn. (1)
has a smaller lower bound on the empirical Rademacher complexity compared to learning the
Q-function in the combinatorial action space.
Proposition 5 shows that our linear Q-function parameterization leads to a smaller function space,
which implies a lower-variance estimator. Hence, our factored-action approach can make more
efficient use of limited samples, leading to an interesting bias-variance trade-off that is especially
beneficial for offline settings with limited data.

3.3.2 Bias 6⇒ Suboptimal Performance

Even in the presence of bias, an inaccurate Q-function may still correctly identify the value-
maximizing action (Proposition 6). While this statement is generally true, in this section, we
identify when this occurs specifically given our linear decomposition based on factored action spaces.
To focus the analysis on the most interesting aspects unique to our approach, we consider a bandit
setting; extensions to the sequential RL setting are possible by applying induction similar to the proof
for the main theorems (Appendices B.1 and B.2).
Proposition 6. There exists an MDP with the optimal Q∗ and its approximation Q̂ parameterized in
the form of Eqn. (1), such that Q̂ 6= Q∗ and yet arg maxa Q̂(a) = arg maxa Q∗ (a).
Justification. Consider a 1-step bandit problem with a single state and the same action space as
before, A = Ax × Ay . Taking an action a = [ax , ay ] leads the agent to move diagonally and
terminate immediately. Since there are no transitions, the Q-values of any policy are simply the
immediate reward from each action, Q(a) = r(a). We assume the reward function is defined
as in Figure 4a (Appendix B.7 describes a procedure to standardize an arbitrary reward function).

(a) (b) Figure 4: (a) A two-dimensional bandit

problem with action space A. Rewards
α 1+α+β
rLeft +rDown = 0 are denoted for each arm. (b) Learning
s rLeft +rUp = α using the linear Q decomposition approach
rRight +rDown = 1
rRight +rUp +
rInteract
=1+α+β corresponds to a system of linear equations
0 1
that relates the reward of each arm. The
(c) parameter rInteract is dropped in our linear
True Value Function Linear Approximation approximation, leading to omitted-variable
   bias. (c) Solving the system results in an
− 41 β
 ∗
Q (.) 0 Q̂(.)
   
Q∗ (-)  α  
Q̂( -)
 
α + 1 
β approximate value function Q̂, which does
not equal to the true value function Q∗
4 
Q∗ (&) =  =
      
1 Q̂(&)  1 + 14 β 
 
∗
Q (%) 1+α+β Q̂(%) 1 + α + 34 β unless β = 0.

5
Applying our approach amounts to solving for the parameters rLeft , rRight , rDown , rUp of the linear system
in Figure 4b, while dropping the interaction term rInteract , resulting in a form of omitted-variable bias
[19]. Solving the system gives the approximate value function where the interaction term β appears
in the approximation Q̂ for all arms (Figure 4c, details in Appendix B.8).
Note that Q̂ = Q∗ only when β = 0, i.e., there is no interaction between the two sub-actions. We
first consider the family of problems with α = 1 and β ∈ [−4, 4]. In Figure 5a, we measure the
∗
value approximation error RMSE(Q∗ , Q̂), as well as the suboptimality V π − V π̂ = maxa Q∗ (a) −
Q∗ (arg maxa Q̂(a)) of the greedy policy defined by Q̂ as compared to π ∗ . As expected, when β = 0,
Q̂ is unbiased and has zero approximation error. When β 6= 0, Q̂ is biased and RMSE > 0; however,
for β ≥ −1, Q̂ corresponds to a policy that correctly identifies the optimal action.
We further investigate this phenomenon considering both α, β ∈ [−4, 4] (to show all regions with
interesting trends), measuring RMSE and suboptimality in the same way as above. As shown in
Figure 5b, the approximation error is zero only when β = 0, regardless of α. However, in Figure 5c,
for a wide range of α and β settings, suboptimality is zero; this suggests that in those regions, even
in the presence of bias (non-zero approximation error), our approach leads to an approximate value
function that correctly identifies the optimal action. The irregular contour outlines multiple regions
where this happens; one key region is when the two sub-actions affect the reward in the same direction
(i.e., α ≥ 0) and their interaction effects also affect the reward in the same direction (i.e., β ≥ 0).
∗
(a) (b) RMSE(Q∗ , Q̂) (c) Suboptimality, V π − V π̂
4 1.0 4 2.0
RMSE(Q∗ , Q̂)

1.0 Biased

0.8
0.5 Unbiased 2 2 1.5

0.6
0.0
0 0
α

α
1.0
1.0
0.4
Suboptimality

0.5 Optimal -2 -2 0.5

0.2

0.0 -4 0.0 -4 0.0

-4

-2

-4

-2

4
−4 −2 0 2 4
β β β
Figure 5: (a) The approximation error and policy suboptimality of our approach for the bandit
problem in Figure 4a, for different settings of β when α = 1. The Q-value approximation is unbiased
only when β = 0, but the corresponding approximate policy is optimal for a wider range of β ≥ −1.
(b-c) The approximation error and policy suboptimality of our approach for the bandit problem in
Figure 4a, for different settings of α and β. The Q-value approximation is unbiased only when
β = 0, but the corresponding approximate policy is optimal for a wide range of α and β values. The
highlighted region of zero suboptimality corresponds to α ≥ 0 and β ≥ 0.

3.4 Practical Considerations: Are these Assumptions too Strong?

Based on our theoretical analysis, strong assumptions (Section 3.1) on the problem structure (though
not necessary, Section 3.2) are the only known way to guarantee the unbiasedness of our proposed
linear approximation. It is thus crucial to understand the applicability (and inapplicability) of our
approach in real-world scenarios. Exploring to what extent these assumptions hold in practice is
especially important for safety-critical domains such as healthcare where incorrect actions (treatments)
can have devastating consequences. Fortunately, RL tasks for healthcare are often equipped with
significant domain knowledge, which serves as a better guide to inform the algorithm design than
heuristics-driven reasoning alone [20, 5, 9].
Oftentimes, when clinicians treat conditions using multiple medications at the same time (giving rise
to the factored action space), it is because each medication has a different “mechanism of action,”
resulting in negligible or limited interactions. For example, several classes of medications are used in
the management of chronic heart failure, and each has a unique and incremental benefits on patient
outcomes [21]. Problems such as this satisfy the sufficient conditions in Section 3.1 in spite of a
non-factorized state space. Moreover, any small interactions would have a bounded effect on RL
policy performance (according to Appendix B.3).
Similarly, in the management of sepsis (which we consider in Section 4.2), fluids and vasopressors
affect blood pressure to correct hypotension via different mechanisms [22]. Fluid infusion increases
“preload” by increasing the blood return to the heart to make sure the heart has enough blood to
pump out [23]. In contrast, common vasopressors (e.g., norepinephrine) increase “inotropy” by

6
stimulating the heart muscle and increase peripheral vascular resistance to maintain perfusion to
organs [24, 25]. Therefore, while the two treatments may appear to operate on the same part of the
state space (e.g., they both increase blood pressure), in general they are not expected to interfere with
each other. Recently, there has also been evidence suggesting that their combination can better correct
hypotension [26], which places this problem approximately in the regime discussed in Section 3.3.2.
In offline settings with limited historical data, the benefits of a reduction in variance can outweigh
any potential small bias incurred in the scenarios above and lead to overall performance improvement
(Section 3.3.1). However, our approach is not suitable if the interaction is counter to the effect of
the sub-actions (e.g., two drugs that raise blood pressure individually, but when combined lead to
a decrease). In such scenarios, the resulting bias will likely lead to suboptimal performance (Sec-
tion 3.3.2). Nevertheless, many drug-drug interactions are known and predictable [27–30]. In such
cases, one can either explicitly encode the interaction terms or resort back to a combinatorial action
space (Appendix B.9). While we focus on healthcare, there are other domains in which significant
domain knowledge regarding the interactions among sub-actions is available, e.g., cooperative multi-
agent games in finance where there is a higher payoff if agents cooperate (positive interaction effects)
or intelligent tutoring systems that teach basic arithmetic operations as well as fractions (which are
distinct but related skills). For these problems, this knowledge can and should be leveraged.

4 Experimental Evaluations
We apply our approach to two offline RL problems from healthcare: a simulated and a real-data
problem, both having an action space that is composed of several sub-action spaces. These problems
correspond to settings discussed in Section 3.4 where we expect our proposed approach to perform
well. In the following experiments, we compare our proposed approach (Figure 1b), which makes
assumptions regarding the effect of sub-actions in combination with other sub-actions, against a
common baseline that considers a combinatorial action space (Figure 1a).

4.1 Simulated Domain: Sepsis Simulator

Rationale. First, we apply our approach to a simulated domain modeled after the physiology of
patients with sepsis [13]. Although the policies are learned “offline,” a simulated setting allows us to
evaluate the learned policies in an “online” fashion without requiring offline policy evaluation (OPE).
Setup. Following prior work [31], a state is represented by a feature vector x(s) ∈ {0, 1}21 that
uses a one-hot encoding for each underlying variable (diabetes status, heart rate, blood pressure,
oxygen concentration, glucose; all of which are discrete). The action space is composed of 3 binary
treatments: antibiotics, vasopressors, and mechanical ventilation, such that A = Aabx × Avaso × Amv ,
with Aabx = Avaso = Amv = {0, 1} and |A| = 23 = 8. Each treatment affects certain vital signs
and may raise or lower their values with pre-specified probabilities (precise definition in [31]). A
patient is discharged alive when all vitals are normal and all treatments have been withdrawn; death
occurs if 3 or more vitals are abnormal. Rewards are sparse and only assigned at the end of each
episode (+1 for survival and −1 for death), after which the system transitions into the respective
absorbing state. Episodes are truncated at a maximum length of 20 following [13] (where no terminal
reward is assigned). Here, the MDP partly satisfies the sufficient conditions outlined in Section 3. For
example, oxygen saturation (which can be seen as a state abstraction) is only affected by mechanical
ventilation, whereas heart rate is only affected by antibiotics. However, blood pressure is affected by
both antibiotics and vasopressors, meaning the effects of these two sub-actions are not independent.
Offline learning. First, we generated datasets with different sample sizes following different behavior
policies. We ran fitted Q-iteration for up to 50 iterations using a neural network function approximator,
selecting the early-stopping iteration based on ground-truth policy performance. Each setting of
sample size and behavior policy was repeated 10 times with different random seeds. Additional
details are described in Appendix D.1.
Results. Figure 6 compares median performance of the proposed approach vs. the baseline over the
10 runs (error bars are interquartile ranges). We considered behavior policies that take the optimal
action with probability ρ and select randomly among non-optimal actions with probability 1 − ρ.
How does sample size affect performance? We first look at a uniformly random behavior policy
(ρ = 1/|A| = 0.125, Figure 6 center). As expected, larger sample sizes (i.e., more training episodes)
lead to better policy performance for both the baseline and proposed approaches. For smaller

7
Performance (policy value)
ρ = 0.9125 ( = 0.1) ρ = 0.5625 ( = 0.5) ρ = 0.125 ρ = 0.01 ρ=0
0.8

0.4

0.0

−0.4

102 103 104 102 103 104 102 103 104 102 103 104 102 103 104
Sample size (training episodes)
Baseline Proposed

Figure 6: Performance on the sepsis simulator across sample sizes and behavior policies. Plots
display the performance over 10 runs, with the trend lines showing medians and error bars showing
interquartile ranges. ρ is the probability of taking the optimal action under the behavior policies
used to generate offline datasets from the simulator. The left two plots show two -greedy policies
(ρ > 0.125; conversion: ρ = (1 − ) + /|A|); the middle plot shows a uniformly random policy
(ρ = 0.125); the right two plots show two policies that undersample the optimal action, ρ < 0.125;
from left to right, ρ decreases. Across different data distributions, our proposed approach outperforms
the baseline at small sample sizes, and closely matches baseline performance at large sample sizes.
Dashed lines denote the value of the optimal policy, which equals to 0.736.

sample sizes (< 5000), the proposed approach consistently outperforms the baseline. As sample size
increases further, the performance gap shrinks and eventually the baseline overtakes our proposed
approach. This is because variance decreases with increasing sample size but the bias incurred by the
factored approximation does not change. Once there are enough samples, reductions in variance are
no longer advantageous and the incurred bias dominates the performance. Overall, this shows that
our approach is promising especially for datasets with limited sample size.
How does behavior policy affect performance? As we anneal the behavior policy closer to the optimal
policy (ρ > 0.125, Figure 6 left two), we reduce the randomness in the behavior policy and limit the
amount of exploration possible at the same sample size. The same overall trend largely holds. On
the other hand, when the probability of taking the optimal action is less than random (ρ < 0.125,
Figure 6 right two), the proposed approach achieves better performance than the baseline with an
even larger gap for limited sample sizes (≤ 103 ). Without observing the optimal actions (ρ = 0),
the baseline performs relatively poorly, even for large sample sizes. In comparison, our approach
accounts for relationships among actions to some extent and is thus able to better generalize to the
unobserved and underexplored optimal actions, thereby outperforming the baseline.
Takeaways. In a challenging situation where our theoretical assumptions do not perfectly hold, our
proposed approach matches or outperforms the baseline, especially for smaller sample sizes.

4.2 Real Healthcare Data: Sepsis Treatment in MIMIC-III

Rationale. We apply our method to a real-world example of learning optimal sepsis treatment
policies for patients in the intensive care unit. Acknowledging the challenging nature of OPE for
quantitative comparisons [32, 33], here we qualitatively inspect the learned policies using clinical
domain knowledge.
Setup. Originally introduced by [2], we use the improved formulations of this task as per [34] and [35].
After applying the specified inclusion and exclusion criteria to the de-identified MIMIC-III database
[36], we obtained a cohort of 19,287 patients and performed a 70/15/15 split for training, validation
and testing. For each patient, their data include 10 time-invariant demographic and contextual features
and a 33-dimensional time series collected at 4h intervals, consisting of measurements from up to
24h before until up to 48h after sepsis onset. We used a recurrent neural network (RNN) with long
short-term memory (LSTM) cells to create an approximate information state [37] to summarize
the history into a dS -dimensional embedding vector. A terminal reward of 100 is assigned for 48h
survival and 0 otherwise. Intermediate rewards are all 0. γ for learning is 0.99 and for evaluation is 1.
Actions pertain to treatment decisions in each 4h interval, representing total volume of intravenous
(IV) fluids and amount of vasopressors administered, resulting in a 5 × 5 factored action space.
Offline learning. After learning the state representations, we apply variants of discrete-action batch-
constrained Q-learning (BCQ) [38, 14], where the baseline uses the combinatorial action space and
the proposed approach incorporates the linear decomposition induced by the factored action space.

8
100

Estimated Policy Value

98
96 Policy Baseline BCQ Factored BCQ Clinician
94 Baseline Test WIS 90.44 ± 2.44 91.62 ± 2.12 90.29 ± 0.51
Proposed Test ESS 178.32 ± 11.42 178.32 ± 11.96 2894
92 Observed % agreement
90 62.42% 62.37% 57.16%
with clinician
0 100 200
Effective Sample Size
Figure 7: Left - Pareto frontiers of validation performance for the candidate policies (all points plotted
in Figure 16). The shaded region does not meet the ESS cutoff of ≥ 200. The red circles indicate
the selected models (based on best validation WIS) for baseline and proposed (both have a BCQ
threshold of τ = 0.5). Right - Performance on test set, ± standard errors from 100 bootstraps.
25000 25000 2500025000
(a) Clinician Baseline BCQ Factored
Factored
BCQBCQ (b) State Vector Heterogeneity
0.18
357 39 90 156 211 13 0 0 1 119 153 1530 03 3177 177
65 65 2000020000 0.172 0.171 0.172 0.176 0.185
>2L

>2L

>2L
4

4
20000 20000
IV fluid dose (mL/4h)

IV fluid dose (mL/4h)

1L-2L

1L-2L
1401 103 167 268 277 0 0 0 0 6 0 00 00 00 010 10 0.163 0.175 0.168 0.181 0.184 0.17
3

3
15000 15000 1500015000
1 500mL-1L

1 500mL-1L

1 500mL-1L
2984 162 179 296 273 4 0 0 0 14 1244 1244
0 01 1183 183
77 77 0.157 0.164 0.168 0.170 0.166
2

2
0.16
10000 10000 1000010000
17147 791 654 882 606 22355 936 38 2508 13 2246722467
34 34
238 238
1801 1801
154 154
1-500

1-500

1-500
0.154 0.158 0.161 0.165 0.176

5000 0.15
8491 106 48 83 75 5000 9839 0 0 0 0 9186 9186
0 00 052 521 1 5000 5000 0.152 0.156 0.139 0.144 0.157
00

00
00 0.001-0.08
1 0.08-0.2
2 0.2-0.45
3 >0.45
4 00 0.001-0.08
1 0.08-0.2
2 0.2-0.45
3 >0.45
4 00 0.001-0.08
00 1 0.001-0.08
0.08-0.2
1 2 0.08-0.2
0.2-0.45
2 3 0.2-0.45
3>0.45
4 >0.45
4 00 0.001-0.08
1 0.08-0.2
2 0.2-0.45
3 >0.45
4
Vasopressor dose ( g/kg/min) Vasopressor dose ( g/kg/min) 0 Vasopressor
Vasopressor
dose dose ( g/kg/min) 0
( g/kg/min) 0 Vasopressor dose ( g/kg/min) 0.14

Figure 8: (a) Qualitative comparison of policies. (b) Per-action state heterogeneity, measured as the
standard deviation of all state embeddings from which a particular action is observed in the dataset,
averaged over state embedding dimensions. Actions with higher IV fluid doses exhibited greater
heterogeneity in the observed states from which those actions were taken (by the clinician policy).
The Q-networks were trained for a maximum of 10, 000 iterations, with checkpoints saved every 100
iterations. We performed model selection [31] over the saved checkpoints (candidate policies) by
evaluating policy performance using the validation set with OPE. Specifically, we estimated policy
value using weighted importance sampling (WIS) and measured effective sample size (ESS), where
the behavior policy was estimated using k nearest neighbors in the embedding space. Following
previous work [39], the final policies were selected by maximizing validation WIS with ESS of
≥ 200 (we consider other thresholds in Appendix D.3), for which we report results on the test set.
Results. We visualize the validation performance over all candidate policies. Figure 7-left shows
that the performance Pareto frontier (in terms of WIS and ESS) of the proposed approach generally
dominates the baseline.
Quantitative comparisons. Evaluating the final selected policies on the test set (Figure 7-right) shows
that the proposed factored BCQ achieves a higher policy value (estimated using WIS) than baseline
BCQ at the same level of ESS. In addition, both policies have a similar level of agreement with the
clinician policy, comparable to the average agreement among clinicians.
Qualitative comparisons. In Figure 8a, we compare the distributions of recommended actions by
the clinician behavior policy, baseline BCQ and factored BCQ, as evaluated on the test set. While
overall the policies look rather similar, in that the most frequently recommended action corresponds
to low doses of IV fluids <500mL with no vasopressors, there are notable differences for key parts of
the action space. In particular, baseline BCQ almost never recommends higher doses of IV fluids
>500 mL, either alone or in combination with vasopressors, whereas both clinician and factored
BCQ recommend IV fluids >500 mL more frequently. These actions are typically used for critically
ill patients, for whom the Surviving Sepsis Campaign guidelines recommends up to >2L of fluids
[40]. We hypothesize that this difference is due to a higher level of heterogeneity in the patient states
for which actions with high IV fluid doses were observed, compared to the remaining actions with
lower doses of IV fluids. To further understand this phenomenon, we measure the per-action state
heterogeneity in the test set by computing, for each action, the standard deviation (averaged over
the embedding dimensions) of all RNN state embeddings from which that action is taken according
to the behavior policy. As shown in Figure 8b, actions with higher IV fluids generally have larger
standard deviations, supporting our hypothesis. The larger heterogeneity combined with lower sample
sizes makes it difficult for baseline BCQ to correctly infer the effects of these actions, as it does not
leverage the relationship among actions. In contrast, our approach leverages the factored action space
and can thus make better inferences about these actions.

9
Takeaways. Applied to real clinical data, our proposed approach outperforms the baseline quantita-
tively and recommends treatments that align better with clinical knowledge. While promising, these
results are based in part on OPE, which has many issues [32, 33]. We stress that further investigation
and close collaboration with clinicians are essential before such RL algorithms are used in practice.

5 Related Work
For many years, the factored RL literature focused exclusively on state space factorization [15, 41–43].
More recently, interest in action space factorization has grown, as RL is applied in increasingly
more complex planning and control tasks. In particular, researchers have previously considered
the model-based setting with known MDP factorizations in which both state and action spaces are
factorized [44–47]. For model-free approaches, others have studied methods for factored actions
with a policy component (i.e., policy-based or actor-critic) [48, 20, 49–51]. In contrast, our work
considers value-based methods as those have been the most successful in offline RL [52].
Among prior work with a value-based component (e.g., Q-network), the majority pertains to multi-
agent [8–10, 53, 54] or multi-objective [51] problems that impose known, explicit assumptions on
the state space or the reward function. Notably, Son et al. [10] established theoretical conditions
for factored optimal actions (called “Individual-Global-Max”) for multi-agent RL and motivated
subsequent works [55, 56]; their result differs from our contribution on the unbiasedness of factored
Q-functions (instead of actions) for single-agent RL. In the online setting for single-agent deep
RL, Sharma et al. [20] and Tavakoli et al. [5] incorporated factored action spaces into Q-network
architecture designs, but did not provide a formal justification for the linear decomposition. Others
have empirically compared various “mixing” functions to combine the values of sub-actions [20, 9]. In
contrast, while our work only considers the linear decomposition function, we examine its theoretical
properties and provide justifications for using this approach in practical problems, especially in offline
settings. Our linear Q-decomposition is related to that of Swaminathan et al. [57] who also applied a
linearity assumption for off-policy evaluation, but for combinatorial contextual bandits rather than
RL. Our insights on the bias-variance trade-off is also related to a concurrent work by Saito and
Joachims [58] who proposed efficient off-policy evaluation for bandit problems with large (but not
necessarily factored) action spaces. In appendix Table 1, we further outline the differences of our
work compared to the existing literature.
Finally, the sufficient conditions we establish are related to, but different from, those identified by
Van Seijen et al. [59] and Juozapaitis et al. [12] who considered reward decompositions in the absence
of factored actions. Related, Metz et al. [60] proposed an approach that sequentially predicts values
for discrete dimensions of a transformed continuous action space, but assume an a priori ordering
of action dimensions, which we do not; Pierrot et al. [50] studied a different form of action space
factorization where sub-actions are sequentially selected in an autoregressive manner. Complementary
to our work, Tavakoli et al. [6] proposed to organize the sub-actions and interactions as a hypergraph
and linearly combining the values; our theoretical results on the linear decomposition nonetheless
apply to their setting where the sub-action interactions are explicitly identified and encoded.

6 Conclusion
To better leverage factored action spaces in RL, we developed an approach to learning policies that
incorporates a simple linear decomposition of the Q-function. We theoretically analyze the sufficient
and necessary conditions for this parameterization to yield unbiased estimates, study its effect on
variance reduction, and identify scenarios when any resulting bias does not lead to suboptimal
performance. We also note how domain knowledge may be used to inform the applicability of
our approach in practice, for problems where any possible bias is negligible or does not affect
optimality. Through empirical experiments on two offline RL problems involving a simulator and
real clinical data, we demonstrate the advantage of our approach especially in settings with limited
sample sizes. We provide further discussions on limitations, ethical considerations and societal
impacts in Appendix A. Though motivated by healthcare, our approach could apply more broadly
to scale RL to other applications (e.g., education) involving combinatorial action spaces where
domain knowledge may be used to verify the theoretical conditions. Future work should consider the
theoretical implications of linear Q decompositions when combined with other offline RL-specific
algorithms [52]. Given the challenging nature of identifying the best treatments from offline data, our
proposed approach may also be combined with other RL techniques that do not aim to identify the
single best action (e.g., learning dead-ends [61] or set-valued policies [34]).

10
Acknowledgments
This work was supported by the National Science Foundation (NSF; award IIS-1553146 to JW;
award IIS-2007076 to FDV; award IIS-2153083 to MM) and the National Library of Medicine of
the National Institutes of Health (NLM; grant R01LM013325 to JW and MWS). The views and
conclusions in this document are those of the authors and should not be interpreted as necessarily
representing the official policies, either expressed or implied, of the National Science Foundation, nor
of the National Institutes of Health. This work was supported, in part, by computational resources
and services provided by Advanced Research Computing, a division of Information and Technology
Services (ITS) at the University of Michigan, Ann Arbor. The authors would like to thank Adith
Swaminathan, Tabish Rashid, and members of the MLD3 group for helpful discussions regarding
this work, as well as the reviewers for constructive feedback.

Data and Code Availability

The code for all experiments is available at https://github.com/MLD3/OfflineRL_FactoredActions.
The sepsis simulator is based on prior work with public implementation at https://github.com/
clinicalml/gumbel-max-scm. The MIMIC-III database used in the real-data experiments of this paper
is publicly available through the PhysioNet website: https://physionet.org/content/mimiciii/1.4/. The
cohort definition, extraction and preprocessing code are based on prior work with publicly available
implementation at https://github.com/microsoft/mimic_sepsis.

References
[1] Damien Ernst, Guy-Bart Stan, Jorge Goncalves, and Louis Wehenkel. Clinical data based optimal STI
strategies for HIV: a reinforcement learning approach. In Proceedings of the 45th IEEE Conference on
Decision and Control, pages 667–672. IEEE, 2006.

[2] Matthieu Komorowski, Leo A Celi, Omar Badawi, Anthony C Gordon, and A Aldo Faisal. The artificial
intelligence clinician learns optimal treatment strategies for sepsis in intensive care. Nature medicine, 24
(11):1716–1720, 2018.

[3] Niranjani Prasad, Li Fang Cheng, Corey Chivers, Michael Draugelis, and Barbara E Engelhardt. A
reinforcement learning approach to weaning of mechanical ventilation in intensive care units. In Conference
on Uncertainty in Artificial Intelligence (UAI), 2017.

[4] Sonali Parbhoo, Mario Wieser, Volker Roth, and Finale Doshi-Velez. Transfer learning from well-curated
to less-resourced populations with HIV. In Machine Learning for Healthcare Conference, pages 589–609.
PMLR, 2020.

[5] Arash Tavakoli, Fabio Pardo, and Petar Kormushev. Action branching architectures for deep reinforcement
learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.

[6] Arash Tavakoli, Mehdi Fatemi, and Petar Kormushev. Learning to represent action values as a hypergraph
on the action vertices. In International Conference on Learning Representations, 2021. URL https:
//openreview.net/forum?id=Xv_s64FiXTv.

[7] Craig Boutilier. Planning, learning and coordination in multiagent decision processes. In TARK, volume 96,
pages 195–210. Citeseer, 1996.

[8] Peter Sunehag, Guy Lever, Audrunas Gruslys, Wojciech Marian Czarnecki, Vinicius Zambaldi, Max Jader-
berg, Marc Lanctot, Nicolas Sonnerat, Joel Z. Leibo, Karl Tuyls, and Thore Graepel. Value-decomposition
networks for cooperative multi-agent learning based on team reward. In Proceedings of the 17th Interna-
tional Conference on Autonomous Agents and MultiAgent Systems, AAMAS, 2018.

[9] Tabish Rashid, Mikayel Samvelyan, Christian Schroeder, Gregory Farquhar, Jakob Foerster, and Shimon
Whiteson. QMIX: Monotonic value function factorisation for deep multi-agent reinforcement learning. In
International Conference on Machine Learning, pages 4295–4304. PMLR, 2018.

[10] Kyunghwan Son, Daewoo Kim, Wan Ju Kang, David Earl Hostallero, and Yung Yi. QTRAN: Learning
to factorize with transformation for cooperative multi-agent reinforcement learning. In International
conference on machine learning, pages 5887–5896. PMLR, 2019.

11
[11] Ming Zhou, Yong Chen, Ying Wen, Yaodong Yang, Yufeng Su, Weinan Zhang, Dell Zhang, and Jun
Wang. Factorized Q-learning for large-scale multi-agent systems. In Proceedings of the First International
Conference on Distributed Artificial Intelligence, pages 1–7, 2019.

[12] Zoe Juozapaitis, Anurag Koul, Alan Fern, Martin Erwig, and Finale Doshi-Velez. Explainable reinforce-
ment learning via reward decomposition. In IJCAI/ECAI Workshop on Explainable Artificial Intelligence,
2019.

[13] Michael Oberst and David Sontag. Counterfactual off-policy evaluation with Gumbel-max structural causal
models. In International Conference on Machine Learning, pages 4881–4890, 2019.

[14] Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without explo-
ration. In International Conference on Machine Learning, pages 2052–2062. PMLR, 2019.

[15] Daphne Koller and Ronald Parr. Computing factored value functions for policies in structured MDPs.
In Proceedings of the 16th International Joint Conference on Artificial Intelligence-Volume 2, pages
1332–1339, 1999.

[16] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT
press, 2018.

[17] Yaqi Duan, Chi Jin, and Zhiyuan Li. Risk bounds and Rademacher complexity in batch reinforcement
learning. arXiv preprint arXiv:2103.13883, 2021.

[18] Maggie Makar, Ben Packer, Dan Moldovan, Davis Blalock, Yoni Halpern, and Alexander D’Amour.
Causally motivated shortcut removal using auxiliary labels. In International Conference on Artificial
Intelligence and Statistics, pages 739–766. PMLR, 2022.

[19] Jeffrey M Wooldridge. Introductory econometrics: A modern approach. Cengage learning, 2015.

[20] Sahil Sharma, Aravind Suresh, Rahul Ramesh, and Balaraman Ravindran. Learning to factor policies
and action-value functions: Factored action space representations for deep reinforcement learning. arXiv
preprint arXiv:1705.07269, 2017.

[21] Michel Komajda, Michael Boehm, Jeffrey S Borer, Ian Ford, Luigi Tavazzi, Matthieu Pannaux, and Karl
Swedberg. Incremental benefit of drug therapies for chronic heart failure with reduced ejection fraction: a
network meta-analysis. European journal of heart failure, 20(9):1315–1322, 2018.

[22] Jeffrey Gotts and Michael Matthay. Sepsis: pathophysiology and clinical management. BMJ, 353, 2016.

[23] Laurent Guérin, Jean-Louis Teboul, Romain Persichini, Martin Dres, Christian Richard, and Xavier Monnet.
Effects of passive leg raising and volume expansion on mean systemic pressure and venous return in shock
in humans. Critical Care, 19(1):1–9, 2015.

[24] Olfa Hamzaoui, Jean-François Georger, Xavier Monnet, Hatem Ksouri, Julien Maizel, Christian Richard,
and Jean-Louis Teboul. Early administration of norepinephrine increases cardiac preload and cardiac
output in septic patients with life-threatening hypotension. Critical care, 14(4):1–9, 2010.

[25] Xavier Monnet, Julien Jabot, Julien Maizel, Christian Richard, and Jean-Louis Teboul. Norepinephrine
increases cardiac preload and reduces preload dependency assessed by passive leg raising in septic shock
patients. Critical care medicine, 39(4):689–694, 2011.

[26] Olfa Hamzaoui. Combining fluids and vasopressors: A magic potion? Journal of Intensive Medicine, 2021.
ISSN 2667-100X. doi: https://doi.org/10.1016/j.jointm.2021.09.004.

[27] Teijo I Saari, Kari Laine, Mikko Neuvonen, Pertti J Neuvonen, and Klaus T Olkkola. Effect of voricona-
zole and fluconazole on the pharmacokinetics of intravenous fentanyl. European journal of clinical
pharmacology, 64(1):25–30, 2008.

[28] Pamela L Smithburger, Sandra L Kane-Gill, and Amy L Seybert. Drug–drug interactions in the medical
intensive care unit: an assessment of frequency, severity and the medications involved. International
Journal of Pharmacy Practice, 20(6):402–408, 2012.

[29] WebMD. Drugs interaction checker - find interactions between medications. https://www.webmd.com/
interaction-checker/default.htm.

[30] Epic. Lexicomp and medi-span for epic - drug reference information. https://www.wolterskluwer.com/en/
solutions/lexicomp/about/epic/drug-reference.

12
[31] Shengpu Tang and Jenna Wiens. Model selection for offline reinforcement learning: Practical considerations
for healthcare settings. In Machine Learning for Healthcare Conference, pages 2–35. PMLR, 2021.

[32] Omer Gottesman, Fredrik Johansson, Joshua Meier, Jack Dent, Donghun Lee, Srivatsan Srinivasan,
Linying Zhang, Yi Ding, David Wihl, Xuefeng Peng, et al. Evaluating reinforcement learning algorithms
in observational health settings. arXiv preprint arXiv:1805.12298, 2018.

[33] Omer Gottesman, Fredrik Johansson, Matthieu Komorowski, Aldo Faisal, David Sontag, Finale Doshi-
Velez, and Leo Anthony Celi. Guidelines for reinforcement learning in healthcare. Nature medicine, 25(1):
16–18, 2019.

[34] Shengpu Tang, Aditya Modi, Michael Sjoding, and Jenna Wiens. Clinician-in-the-loop decision making:
Reinforcement learning with near-optimal set-valued policies. In International Conference on Machine
Learning, pages 9387–9396. PMLR, 2020.

[35] Taylor W Killian, Haoran Zhang, Jayakumar Subramanian, Mehdi Fatemi, and Marzyeh Ghassemi. An
empirical study of representation learning for reinforcement learning in healthcare. In Machine Learning
for Health NeurIPS Workshop, 2020.

[36] Alistair EW Johnson, Tom J Pollard, Lu Shen, Li-wei H Lehman, Mengling Feng, Mohammad Ghassemi,
Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. MIMIC-III, a freely accessible
critical care database. Scientific data, 3(1):1–9, 2016.

[37] Jayakumar Subramanian and Aditya Mahajan. Approximate information state for partially observed
systems. In 2019 IEEE 58th Conference on Decision and Control (CDC), pages 1629–1636. IEEE, 2019.

[38] Scott Fujimoto, Edoardo Conti, Mohammad Ghavamzadeh, and Joelle Pineau. Benchmarking batch deep
reinforcement learning algorithms. arXiv preprint arXiv:1910.01708, 2019.

[39] Yao Liu and Emma Brunskill. Avoiding overfitting to the importance weights in offline policy optimization,
2022. URL https://openreview.net/forum?id=dLTXoSIcrik.

[40] Laura Evans, Andrew Rhodes, Waleed Alhazzani, Massimo Antonelli, Craig M Coopersmith, Craig
French, Flávia R Machado, Lauralyn Mcintyre, Marlies Ostermann, Hallie C Prescott, et al. Surviving
sepsis campaign: international guidelines for management of sepsis and septic shock 2021. Intensive care
medicine, 47(11):1181–1247, 2021.

[41] Carlos Guestrin, Daphne Koller, Ronald Parr, and Shobha Venkataraman. Efficient solution algorithms for
factored MDPs. Journal of Artificial Intelligence Research, 19:399–468, 2003.

[42] Alexander L Strehl, Carlos Diuk, and Michael L Littman. Efficient structure learning in factored-state
MDPs. 2007.

[43] Karina Valdivia Delgado, Scott Sanner, and Leliane Nunes De Barros. Efficient solutions to factored MDPs
with imprecise transition probabilities. Artificial Intelligence, 175(9-10):1498–1527, 2011.

[44] Aswin Raghavan, Saket Joshi, Alan Fern, Prasad Tadepalli, and Roni Khardon. Planning in factored action
spaces with symbolic dynamic programming. In Twenty-Sixth AAAI Conference on Artificial Intelligence,
2012.

[45] Aswin Raghavan, Roni Khardon, Alan Fern, and Prasad Tadepalli. Symbolic opportunistic policy iteration
for factored-action mdps. Advances in Neural Information Processing Systems, 26, 2013.

[46] Ian Osband and Benjamin Van Roy. Near-optimal reinforcement learning in factored MDPs. Advances in
Neural Information Processing Systems, 27:604–612, 2014.

[47] Yangyi Lu, Amirhossein Meisami, and Ambuj Tewari. Causal Markov decision processes: Learning good
interventions efficiently. arXiv preprint arXiv:2102.07663, 2021.

[48] Brian Sallans and Geoffrey E Hinton. Reinforcement learning with factored states and actions. The Journal
of Machine Learning Research, 5:1063–1088, 2004.

[49] Tom Van de Wiele, David Warde-Farley, Andriy Mnih, and Volodymyr Mnih. Q-learning in enormous
action spaces via amortized approximate maximization. arXiv preprint arXiv:2001.08116, 2020.

[50] Thomas Pierrot, Valentin Macé, Jean-Baptiste Sevestre, Louis Monier, Alexandre Laterre, Nicolas Perrin,
Karim Beguir, and Olivier Sigaud. Factored action spaces in deep reinforcement learning, 2021. URL
https://openreview.net/forum?id=naSAkn2Xo46.

13
[51] Thomas Spooner, Nelson Vadori, and Sumitra Ganesh. Factored policy gradients: Leveraging structure
for efficient learning in MOMDPs. In Advances in Neural Information Processing Systems, 2021. URL
https://openreview.net/forum?id=NXGnwTLlWiR.
[52] Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial,
review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
[53] Laetitia Matignon, Guillaume J Laurent, and Nadine Le Fort-Piat. Independent reinforcement learners
in cooperative Markov games: a survey regarding coordination problems. The Knowledge Engineering
Review, 27(1):1–31, 2012.
[54] Ardi Tampuu, Tambet Matiisen, Dorian Kodelja, Ilya Kuzovkin, Kristjan Korjus, Juhan Aru, Jaan Aru, and
Raul Vicente. Multiagent cooperation and competition with deep reinforcement learning. PLOS ONE, 12(4):
1–15, 04 2017. doi: 10.1371/journal.pone.0172395. URL https://doi.org/10.1371/journal.pone.0172395.
[55] Jianhao Wang, Zhizhou Ren, Terry Liu, Yang Yu, and Chongjie Zhang. QPLEX: Duplex dueling multi-
agent Q-learning. In International Conference on Learning Representations, 2021. URL https://openreview.
net/forum?id=Rcmk0xxIQV.
[56] Jianhao Wang, Zhizhou Ren, Beining Han, Jianing Ye, and Chongjie Zhang. Towards understanding
cooperative multi-agent Q-learning with value factorization. Advances in Neural Information Processing
Systems, 34:29142–29155, 2021.
[57] Adith Swaminathan, Akshay Krishnamurthy, Alekh Agarwal, Miro Dudik, John Langford, Damien Jose,
and Imed Zitouni. Off-policy evaluation for slate recommendation. Advances in Neural Information
Processing Systems, 30, 2017.
[58] Yuta Saito and Thorsten Joachims. Off-policy evaluation for large action spaces via embeddings. arXiv
preprint arXiv:2202.06317, 2022.
[59] Harm Van Seijen, Mehdi Fatemi, Joshua Romoff, Romain Laroche, Tavian Barnes, and Jeffrey Tsang.
Hybrid reward architecture for reinforcement learning. In Advances in Neural Information Processing
Systems, 2017.
[60] Luke Metz, Julian Ibarz, Navdeep Jaitly, and James Davidson. Discrete sequential prediction of continuous
actions for deep RL, 2018. URL https://openreview.net/forum?id=r1SuFjkRW.
[61] Mehdi Fatemi, Taylor W. Killian, Jayakumar Subramanian, and Marzyeh Ghassemi. Medical dead-ends
and learning to identify high-risk states and treatments. In Advances in Neural Information Processing
Systems, 2021. URL https://openreview.net/forum?id=4CRpaV4pYp.
[62] Jinglin Chen and Nan Jiang. Information-theoretic considerations in batch reinforcement learning. In
International Conference on Machine Learning, pages 1042–1051, 2019.
[63] Tengyang Xie and Nan Jiang. Batch value-function approximation with only realizability. In International
Conference on Machine Learning, pages 11404–11413. PMLR, 2021.
[64] Wenhao Zhan, Baihe Huang, Audrey Huang, Nan Jiang, and Jason D Lee. Offline reinforcement learning
with realizability and single-policy concentrability. arXiv preprint arXiv:2202.04634, 2022.
[65] Lihong Li, Thomas J Walsh, and Michael L Littman. Towards a unified theory of state abstraction for
MDPs. Proceedings of the 9th International Symposium on Artificial Intelligence and Mathematics, 4,
2006.
[66] Satinder P Singh and Richard C Yee. An upper bound on the loss from approximate optimal-value functions.
Machine Learning, 16(3):227–233, 1994.
[67] Michael Kearns and Satinder Singh. Near-optimal reinforcement learning in polynomial time. Machine
learning, 49(2):209–232, 2002.
[68] Michail G Lagoudakis and Ronald Parr. Least-squares policy iteration. The Journal of Machine Learning
Research, 4:1107–1149, 2003.
[69] Pranjal Awasthi, Natalie Frank, and Mehryar Mohri. On the Rademacher complexity of linear hypothesis
sets. arXiv preprint arXiv:2007.11045, 2020.
[70] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International
Conference on Learning Representations, 2015.
[71] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare,
Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through
deep reinforcement learning. Nature, 518(7540):529–533, 2015.

14
Checklist
1. For all authors...
(a) Do the main claims made in the abstract and introduction accurately reflect the paper’s
contributions and scope? [Yes]
(b) Did you describe the limitations of your work? [Yes] Section 3.4, Appendix A
(c) Did you discuss any potential negative societal impacts of your work? [Yes] Appx A
(d) Have you read the ethics review guidelines and ensured that your paper conforms to
them? [Yes]
2. If you are including theoretical results...
(a) Did you state the full set of assumptions of all theoretical results? [Yes] Sec 3, Appx B
(b) Did you include complete proofs of all theoretical results? [Yes] Appendix B
3. If you ran experiments...
(a) Did you include the code, data, and instructions needed to reproduce the main exper-
imental results (either in the supplemental material or as a URL)? [Yes] see Data &
Code Availability
(b) Did you specify all the training details (e.g., data splits, hyperparameters, how they
were chosen)? [Yes] Sec 4.1-4.2, Appendix D.1-D.2
(c) Did you report error bars (e.g., with respect to the random seed after running experi-
ments multiple times)? [Yes] Fig 6
(d) Did you include the total amount of compute and the type of resources used (e.g., type
of GPUs, internal cluster, or cloud provider)? [No] our experiments only involved
small neural networks
4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...
(a) If your work uses existing assets, did you cite the creators? [Yes] MIMIC-III by
Johnson et al. [36], sepsisSim by Oberst and Sontag [13]
(b) Did you mention the license of the assets? [No] please refer to the website of the
creator of MIMIC-III https://physionet.org/content/mimiciii/1.4/
(c) Did you include any new assets either in the supplemental material or as a URL? [N/A]
(d) Did you discuss whether and how consent was obtained from people whose data
you’re using/curating? [No] please refer to the website of the creator of MIMIC-III
https://physionet.org/content/mimiciii/1.4/
(e) Did you discuss whether the data you are using/curating contains personally identifiable
information or offensive content? [Yes] Sec 4.2, MIMIC-III is de-identified
5. If you used crowdsourcing or conducted research with human subjects...
(a) Did you include the full text of instructions given to participants and screenshots, if
applicable? [N/A]
(b) Did you describe any potential participant risks, with links to Institutional Review
Board (IRB) approvals, if applicable? [N/A]
(c) Did you include the estimated hourly wage paid to participants and the total amount
spent on participant compensation? [N/A]

15
A Additional Discussion
Computational Efficiency. While our main analysis focuses on statistical efficiency (variance) and
its trade-off with approximation error (bias), here we outline some considerations on computational
efficiency. To compute the values for all output heads in Figure 1, there is a clear saving of
computational cost by our approach with a linear complexity O(D) (measured in flops) in the number
of sub-actions, whereas the baseline has an exponential complexity O(exp(D)). We consider two
common inference operations after the values of the output heads are computed: maxa Q(s, a)
and arg maxa Q(s, a). For both operations, the baseline has an exponential time complexity of
O(exp(D)). For our proposed approach, an optimized implementation has a linear time complexity
of O(D): one can perform argmax/max per sub-action and then concatenate/sum the results. In our
current code release, we did not implement the optimized version; instead, we made use of the sub-
action featurization matrix defined in Appendix B.4 so that automatic differentiation can be applied
directly. This implementation is computationally more expensive than our analysis above and than the
baseline: the forward pass includes a dense matrix multiplication with time complexity O(D exp(D))
flops, followed by an O(exp(D)) argmax/max operation. In settings where computational complexity
might be a bottleneck (especially at inference time), we recommend using the featurization matrix
implementation for learning and the optimized version for inference.
Limitations. Our theoretical analysis in Section 3 focuses on the “realizability” condition of the linear
function class [62], where we are interested in guarantees of zero approximation error, i.e., whether
the true Q∗ lies within the linear function class. In principle, it is possible to find Q∗ given a realizable
function class (e.g., by enumerating all member functions). However, when Q-learning-style iterative
algorithms are used in practice, its convergence relies on a stronger “completeness” condition, as
discussed in [62–64]. We did not investigate how our proposed form of parameterization (and the
specific shape of bias introduced) interacts with the learning procedure, and this is an interesting
direction for future work (Wang et al. [56] studied this for linear value factorization in the context of
FQI but for multi-agent RL).
Ethical Considerations and Societal Impact. In general, policies computationally derived using RL
must be carefully validated before they can be used in high-stakes domains such as healthcare. Our
linear parameterization implicitly makes an independence assumption with respect to the sub-actions,
allowing the Q-function to generalize to sub-action combinations that are underexplored (and even
unexplored) in the offline data (as shown in Section 4.1). When the independence assumptions are
valid (according to domain knowledge), this is a case of a “free lunch” as we can reduce variance
without introducing any bias. However, inaccurate or incomplete domain knowledge may render the
independence assumptions invalid and cause the agent to incorrectly generalize to dangerous actions
(e.g., learned policy recommends drug combinations with adverse side effects, see Section 3.4). This
misuse may be alleviated by incorporating additional offline RL safeguards to constrain the learned
policy (e.g., BCQ was used in Section 4.2 to restrict the learned policy to not take rarely observed
sub-action combinations). Still, to apply RL in healthcare and other safety-critical domains, it is
important to consult and closely collaborate with domain experts (e.g., clinicians for healthcare
problems) to come up with meaningful tasks and informed assumptions, and perform thorough
evaluations involving both the quantitative and qualitative aspects [32, 33].

Table 1: Qualtitative comparisons of this work with the existing literature.

Known state
Policy- Model- Value- Linear value Unbiasedness
factorization or
based? based? based? decomposition? guarantees?
abstraction?
[46, 47] 3 3
[48] 3 3
[50, 51] 3 7
[15, 41–43] 3(V) 3 3
[20] 3 3(Q) 31 3
VDN [8] 3(Q) 3 3
QMIX [9] 3(Q) *2 3
BDQ [5] 3(Q) *2 7
This work 3(Q) 3 7 3
1 2
Empirically tested various “combination” functions including linear. Both QMIX and BDQ do not aggregate
the sub-Q functions; instead, they aggregate the argmax sub-actions.

16
B Detailed Theoretical Analyses

B.1 Sufficient Condition: The Trivial Setting - D Parallel MDPs

To build intuition, we first consider a related setting where D MDPs are running in parallel. If every
MDP evolves independently as controlled by its respective policy, then the total return from all D
MDPs should naturally be the sum of the individual returns from each MDP. Formally, we state the
following proposition involving fully factored MDPs and factored policies. Here, we use the vector
notation s = [s1 , · · · , sD ] to indicate the explicit state space factorization.
Definition 1. Given MDPs M1 · · · MD where each Md is defined by (Sd , Ad , pd , rd ), a fully
ND ND ND
factored MDP M = d=1 Md is defined by S, A, p, r such that S = d=1 Sd , A = d=1 Ad ,
QD PD
p(s0 |s, a) = d=1 pd (s0d |sd , ad ), and r(s, a) = d=1 rd (sd , ad ).
Definition 2. Given MDPs M1 · · · Md and policies π1 · · · πD where each πd : Sd → ∆(Ad ),
ND ND
then a factored policy π = d=1 πj for the MDP M = d=1 Mj is π : S → ∆(A) such that
QD
π(a|s) = d=1 πd (ad |sd ).
ND ND
Proposition 7. The Q-function of policy π = d=1 πj for MDP M = d=1 Md can be expressed
D
as QπM (s, a) = d=1 QπMd d (sd , ad ).
P

To match the form in Eqn. (1), we can set qd (s, ad ) = QπMd d (sd , ad ). Importantly, each QπMd d does
not depend on any ad0 where d0 6= d. Note that although our definition of qd is allowed to condition
on the entire state space s, each QπMd d only depends on sd . Proposition 7 can be seen as a corollary to
Theorem 1 where the abstractions are defined using the sub-state spaces, such that φd : S → Sd .

Proof of Proposition 7. Without loss of generality, we consider the setting with D = 2 such that
A = A1 × A2 ; extension to D > 2 is straightforward. The proof is based on mathematical induction
π,(h) Ph
on a sequence of h-step Q-functions of π defined as QM (s, a) = E[ t=1 γ t−1 rt |s1 = s, a1 =
a, at ∼ π].
Base case. For h = 1, the one-step Q-function is simply the reward, which by assumption
π,(1) π ,(1) π ,(1)
r(s, a) = r1 (s1 , a1 ) + r2 (s2 , a2 ). Therefore, QM (s, a) = QM1 1 (s1 , a1 ) + QM2 2 (s2 , a2 ).
π,(h) π ,(h) π ,(h)
Inductive step. Suppose QM (s, a) = QM1 1 (s1 , a1 ) + QM2
2
(s2 , a2 ) holds. We can express
π,(h+1) π,(h)
QM in terms of QM using the Bellman equation:

π,(h+1) π,(h)
X
QM (s, a) = r(s, a) +γ p(s0 |s, a)VM (s0 )
s0
| {z }
1 | {z }
2

π,(h) π,(h)
X
where VM (s0 ) = π(a0 |s0 )QM (s0 , a0 ).
a0

By Definition 1, 1 can be written as a sum r(s, a) = r1 (s1 , a1 )+r2 (s2 , a2 ) where each summand
depends on only either a1 or a2 but not both. Next we show that 2 also decomposes in a similar
manner. For a given s we have:
π,(h) π,(h)
X
VM (s) = π(a|s)QM (s, a)
a

π ,(h) π ,(h)
X
= π1 (a1 |s1 )π2 (a2 |s2 ) QM1 1 (s1 , a1 ) + QM2 2 (s2 , a2 )
a1 ,a2
P : 1 P π ,(h)
P : 1 P π2 ,(h)
= a
2
π2
(a2 |s2 ) a 1
π1 (a1 |s1 )QM1 1 (s1 , a1 ) + a1
π1
(a1 |s1 ) a2 π2 (a2 |s2 )QM2 (s2 , a2 )

X X
π ,(h) π ,(h)
= π1 (a1 |s1 )QM1 1 (s1 , a1 ) + π2 (a2 |s2 )QM2 2 (s2 , a2 ) ,
a1 a2

17
π ,(h)
where we use the fact that π1 (a1 |s1 )QM1 1 (s1 , a1 ) is independent of π2 (a2 |s2 ) (and vice versa),
π ,(h) P πd ,h
and that πd (·|sd ) is a probability simplex. Letting VMdd (sd ) = ad π1 (ad |sd )QM d
(sd , ad ),
π,(h) π ,(h) π ,(h)
then VM (s0 ) = VM11 (s01 ) + VM22 (s02 ).
Substituting into 2 , we have:
π,(h)
X
p(s0 |s, a)VM (s0 )
s0

π ,(h) π ,(h)
X
= p1 (s01 |s1 , a1 )p2 (s02 |s2 , a2 ) VM11 (s01 ) + VM22 (s02 )
s01 ,s02
P : 1 P
π ,(h)
P : 1 P π2 ,(h) 0
(s02 p1 (s01 |s1 , a1 )VM11 (s01 ) + (s01 0

= p2
s02 |s
2 , a2 ) s01 s 0 p1
|s
1 , a1 ) s02 p2 (s2 |s2 , a2 )VM2 (s2 )
1
X
π ,(h) π ,(h)
X
= p1 (s01 |s1 , a1 )VM11 (s01 ) + p2 (s02 |s2 , a2 )VM22 (s02 )
s01 s02
π ,(h)
where we make use of a similar independence property between p1 (s01 |s1 , a1 )VM11 (s01 ) and
p2 (s02 |s2 , a2 ), and the fact that that pd (·|sd , ad ) is a probability simplex.
π,(h+1) π ,(h+1) π ,(h+1)
Therefore, we have QM (s, a) = QM1 1 (s1 , a1 ) + QM2 2 (s2 , a2 ) as desired, where
πd ,(h+1) π ,(h)
(sd , ad ) = rd (sd , ad ) + γ s0 pd (sd |sd , ad ) a0 πj (ad |sd )QMd j (s0d , a0d ).
0 0 0
P P
QMd
d d

By mathematical induction, this decomposition holds for any h-step Q-function. Letting h → ∞
shows that this holds for the full Q-function.

B.2 Sufficient Condition: The Abstraction Setting

We first review some important background on state abstractions. Using the properties of state
abstractions, we can prove the main sufficient condition in Theorem 1. This proof follows largely
from the techniques used in proving Proposition 7, with the exception of how marginalization over
the state space is handled.
Background on State Abstractions. A state abstraction (also known as state aggregation) [65], is a
mapping φ : S → Z that converts each element of the primitive state space S to an element of the
abstract state space Z. Intuitively, if two states s1 and s2 are mapped to the same element under φ,
i.e., φ(s1 ) = φ(s2 ), then they are treated as the same (abstract) state under the abstraction. Therefore,
we can view an abstraction as a partitioning of the primitive state space into non-overlapping subsets.
Since a state abstraction is a many-to-one mapping, we define its inverse as φ−1 (z) = {s̃ : φ(s̃) = z},
a set containing all primitive states that are mapped to the abstract state z.
We have the following property of summations involving state abstractions, where for any function
f : S → R, X X X
f (s) = f (s̃)
s∈S z∈Z s̃∈φ−1 (z)
To understand this property, let us consider the sum of f (s) for all states in S which can be obtained
in two different ways: i) directly iterating through the elements of S, ii) first iterating through the
partitions of S (induced by the abstraction), and then iterating through the elements in each partition,
giving rise to the double summation. This property allows us to change the index of summation
from primitive states to abstract states. For multiple abstractions φ = [φ1 , · · · , φD ] where φd 6= φd0
if d 6= d0 , denoting z = φ(s) = [z1 , . . . , zD ], we can similarly define the inverse abstraction
φ−1 (z) = {s̃ : φ(s̃) = z}, and the summation property similarly applies.

Proof of Theorem 1. Without loss of generality, we consider the setting with D = 2 so A = A1 ×A2 ;
extension to D > 2 is straightforward. The proof is based on mathematical induction on a sequence
Ph
of h-step Q-functions of π denoted by Q(h) (s, a) = E[ t=1 γ t−1 rt |s1 = s, a1 = a, at ∼ π].
Base case. For h = 1, the one-step Q-function is simply the reward, which by assumption r(s, a) =
(1)
r1 (z1 , a1 ) + r2 (z2 , a2 ). We can trivially set qd (zd , ad ) = rd (zd , ad ) such that Q(1) (s, a) =
(1) (1)
q1 (z1 , a1 ) + q2 (z2 , a2 ).

18
(h) (h)
Inductive step. Suppose Q(h) (s, a) = q1 (z1 , a1 ) + q2 (z2 , a2 ) holds. We can express Q(h+1)
in terms of Q(h) using the Bellman equation:
X
Q(h+1) (s, a) = r(s, a) +γ p(s0 |s, a)V (h) (s0 )
s0
| {z }
1 | {z }
2

X
where V (h) (s0 ) = π(a0 |s0 )Q(h) (s0 , a0 ).
a0

1 can be written as a sum r(s, a) = r1 (z1 , a1 ) + r2 (z2 , a2 ) where each summand depends on
only either a1 or a2 but not both. Next we show 2 also decomposes in a similar manner.
For a given s we have:
X
V (h) (s) = π(a|s)Q(h) (s, a)
a

(h) (h)
X
= π1 (a1 |z1 )π2 (a2 |z2 ) q1 (z1 , a1 ) + q2 (z2 , a2 )
a1 ,a2
P
|z: 1 P (h)
P
|z: 1 P (h)
= a π 2
(a2 2 ) a π 1 (a1 |z1 )q 1 (z 1 , a1 ) + a π
1
(a1 1 ) a2 π2 (a2 |z2 )q2 (z2 , a2 )
2 1 1
(h) (h)
X X
= π1 (a1 |z1 )q1 (z1 , a1 ) + π2 (a2 |z2 )q2 (z2 , a2 ) ,
a1 a2

(h)
where we used the property that π1 (a1 |z1 )q1 (z1 , a1 ) is independent of π2 (a2 |z2 ) (and vice
(h) P (h)
versa), and that πd (·|zd ) is a probability simplex. Letting vd (zd ) = ad πd (ad |zd )qd (zd , ad ),
(h) (h)
then we can write V (h) (s0 ) = v1 (z10 ) + v2 (z20 ).
Substituting into 2 , we have:
X X X
p(s0 |s, a)V (h) (s0 ) = p(s̃|s, a)V (h) (s̃)
s0 z0 s̃∈φ−1 (z 0 )
X X
= p(s̃|s, a)V (h) (s̃)
z 0 s̃∈φ−1 (z 0 )
X X
= p(s̃|s, a) V (h) (s̃)
z0 s̃∈φ−1 (z 0 )

(h) (h)
X
= p1 (z10 |z1 , a1 )p2 (z20 |z2 , a2 ) v1 (z10 ) + v2 (z20 )
z10 ,z20
P : 1 P
(h)
P : 1 P
(h) 0
0
= p2
z20 (z p1 (z10 |z1 , a1 )v1 (z10 ) +
2 |z2 , a2 ) z10 z 0 p1

0
(z
1 |z1 , a1 ) 0
z20 p2 (z2 |z2 , a2 )v2 (z2 )
1
X
(h) (h)
X
= p1 (z10 |z1 , a1 )v1 (z10 ) + p2 (z20 |z2 , a2 )v2 (z20 )
z10 z20

where on the first line we used the property of state abstractions to replace the index of summation,
and from the second to the third line we used the fact that for all s̃ ∈ φ−1 (z 0 ) that have the same
(h) (h)
abstract state vector z 0 , their value V (h) (s0 ) = v1 (z10 ) + v2 (z20 ) are equal; this allows us to
directly sum their transition probabilities p(s̃|s, a). Following that, we substitute in Eqn. (2), and
then use a similar independence property as above and that pd (·|zd , ad ) is a probability simplex.
(h+1) (h+1)
Therefore, we have Q(h+1) (s, a) = q1 (z1 , a1 ) + q2 (z2 , a2 ) as desired where
(h+1) (h) 0
(zd , ad ) = rd (zd , ad ) + γ z0 pd (zd |zd , ad ) a0 π(ad |zd )qd (zd , a0d ).
0 0 0
P P
qd
d d

By mathematical induction, this decomposition holds for any h-step Q-function. Letting h → ∞
shows that this holds for the full Q-function.

19
B.3 Policy Learning with Bias - Performance Bounds

Consider a particular model-based procedure for approximating the optimal Q-function using Eqn. (1):
i) finding approximations M c = (p̂, r̂) that are close to the true transition/reward functions p, r such
that there exists some state abstraction set φ with p̂, r̂ satisfying (2) and (3) with respect to φ, ii) doing
planning (e.g., dynamic programming) using the approximate MDP parameters p̂ and r̂. We can show
the following performance bounds; note that these upper bounds are loose and information-theoretic
(in that they require knowledge of the implicit factorization).
Proposition 8. If the approximation errors in p̂ and r̂ are upper bounded by p and r for all
s ∈ S, a ∈ A:
X
p(s0 |s, a) − p̂(s0 |s, a) ≤ p ,
s0
r(s, a) − r̂(s, a) ≤ r ,

then the above model-based procedure leads to an approximate Q-function Q̂ and an approximate
policy π̂ that satisfy:
r γp Rmax
kQ∗M − Q∗M
ck∞ ≤ + ,
1−γ 2(1 − γ)2
∗ π̂ 2r γp Rmax
kVM − VM k∞ ≤ + .
1−γ (1 − γ)2

Proof. See classical results by Singh and Yee [66] and Kearns and Singh [67] (the simulation lemma).

B.4 Subspace of Representable Q Functions

To help understand how the linear parameterization of Q-function Eqn. (1) affects the representation
power of the function class, we first define the following matrices for action space featurization.
Definition 3. The sub-action mapping matrix for sub-action space Ad , Ψd , is defined as
ψ d (a1 )|
 
|

.. |A|×|Ad |
Ψj =   ∈ {0, 1}
 
.
ψ d (a|A| )|
|

where each row ψ d (ai )| ∈ {0, 1}1×|Ad | is a one-hot vector with a value 1 in column projA→Ad (ai ).
Remark. The i-th row of Ψd corresponds to an action ai ∈ A, and the j-th column corresponds to
a particular element of the sub-action space ajd ∈ Ad . The (i, j)-entry of Ψd is 1 if and only if the
projection of ai onto the sub-action space Ad is ajd . Since each row is a one-hot vector, the sum of
elements in each row is exactly 1, i.e., ψ d (ai )| 1 = 1.
Definition 4. The sub-action mapping matrix, Ψ, is defined by a horizontal concatenation of Ψd for
d = 1...D
" #
P
Ψ = Ψ1 ··· ΨD ∈ {0, 1}|A|×( d |Ad |)

Remark. Ψ describes how to map each action ai ∈ A to its corresponding sub-actions. Therefore,
the sum of elements in each row is exactly D, the number of sub-action spaces; ψ(ai )| 1 = D.
Definition 5. The condensed sub-action mapping matrix, Ψ̃, is
 

Ψ̃D  ∈ {0, 1}|A|×(1+ )

P
Ψ̃ =  1 d (|Ad |−1)
Ψ̃1 ···

where the first column contains all 1’s, and Ψ̃d denotes Ψd with the first column removed.

20
Proposition 9. colspace(Ψ) = colspace(Ψ̃) and rank(Ψ) = rank(Ψ̃) = ncols(Ψ̃) (i.e.,
+
matrix Ψ̃ has full column rank). Consequently, ΨΨ+ = Ψ̃Ψ̃ .
Corollary 10. Suppose the Q-function Q of a policy π at state s is linearly decomposable with respect
PD
to the sub-actions, i.e., we can write Q(s, a) = d=1 qd (s, ad ) for all ad ∈ Ad , then there exists w
and w̃ such that the column vector containing the Q-values for all actions at state s can be expressed
as Q(s, A) = Ψw = Ψ̃w̃. In other words, Eqn. (1) is equivalent to Q(s, A) ∈ colspace(Ψ̃).
Corollary 11. Suppose Q(s, A) ∈ / colspace(Ψ̃). Let ŵ = Ψ+ Q(s, A) and w̃ ˆ = Ψ̃+ Q(s, A) be
the least-squares solutions of the respective linear equations. Then Ψŵ = Ψ̃w̃.ˆ
Remark. Corollaries 10 and 11 imply there are two possible implementations, regardless of whether
the true Q-function can be represented by the linear parameterization. Intuitively, both versions try
to project the true Q-value vector Q(s, A) for a particular state s onto the subspace spanned by the
columns of Ψ or Ψ̃. Since the two matrices have the same column space, the results of the projections
ˆ are equal (they cannot be as they have different dimensions),
are equal. This does not imply ŵ and w̃
but rather the resultant Q-value estimates are equal, Q̂(s, A) = Ψŵ = Ψ̃w̃.ˆ

To make the theorem statements more concrete, we inspect a simple numerical example and verify
the theoretical properties.
Example 3. Consider an MDP with A = A1 × A2 , where A1 = {0, 1} and A2 = {0, 1}. Conse-
quently, |A1 | = |A2 | = 2 and |A| = 22 = 4.
Suppose for state s we can write Q(s, a) = Q(s, [a1 , a2 ]) = q1 (s, a1 ) + q2 (s, a2 ) for all a1 ∈
A1 , a2 ∈ A2 . Then
Q(s, a1 = 0, a2 = 0) q1 (s, 0) + q2 (s, 0) 1 0 1 0 q1 (s, 0)
      
Q(s, a1 = 0, a2 = 1) q1 (s, 0) + q2 (s, 1) 1 0 0 1 q1 (s, 1)
Q(s, A) =  = =
Q(s, a1 = 1, a2 = 0) q1 (s, 1) + q2 (s, 0) 0 1 1 0 q2 (s, 0)
Q(s, a1 = 1, a2 = 1) q1 (s, 1) + q2 (s, 1) 0 1 0 1 q2 (s, 1)


1 0 1 0 q1 (s, 0)
  
| | w1
" #" #
1 0 0 1 q (s, 1)
= Ψ w where Ψ =  , w= 1
|

0 1 1 0  q2 (s, 0) 
| | w2
0 1 0 1 q2 (s, 1)
|{z } |{z }
We can also write Ψ1 Ψ2
1 0 0
 
v0 (s) q1 (s, 0) + q2 (s, 0)
" # " #
1 0 1
Q(s, A) = Ψ̃w̃, where Ψ̃ =  , w̃ = u1 (s) = q1 (s, 1) − q1 (s, 0)
1 1 0
u2 (s) q2 (s, 1) − q2 (s, 0)
1 1 1

One can verify that rank(Ψ) = rank(Ψ̃) = 3 and colspace(Ψ) = colspace(Ψ̃), because the
columns of Ψ̃ are linearly independent, but the columns of Ψ are not linearly independent:
1 0 1 0
       
1 0 0 1
0 + 1 − 1 = 0 .
0 1 0 1
Furthermore,
3/8 3/8 −1/8 −1/8
 
3/4 1/4 1/4 −1/4
" #
+ −1/8 −1/8 3/8 3/8 +
Ψ = , Ψ̃ = −1/2 −1/2 1/2 1/2
3/8 −1/8 3/8 −1/8
−1/2 1/2 −1/2 1/2
−1/8 3/8 −1/8 3/8
and
3/4 1/4 1/4 −1/4
 
+ +  1/4 3/4 −1/4 1/4
ΨΨ = Ψ̃Ψ̃ =  .
1/4 −1/4 3/4 1/4
−1/4 1/4 1/4 3/4
/

21
Proof of Proposition 9.
Q that Ψ P
First note is a tall matrix for non-trivial cases, with more rows than columns, because
|A| = d |Ad | ≥ d |Ad | if |Ad | ≥ 2 for all d (see proof). Therefore, the rank of Ψ is the number
of linear independent columns of Ψ.
We use the following notation to write matrix Ψd in terms of its columns:
| |
" #
Ψd = cd,1 · · · cd,|Ad | .
| |

The following statements are true:

Claim 1: The columns of Ψd are pairwise orthogonal, cd,j | cd,j 0 = 0, ∀j 6= j 0 , and they form an
orthogonal basis. This is because each row ψ d (ai )| is a one-hot vector, containing only
one 1; this implies that out of the two entries in row i of cd,j and cd,j 0 , at least one entry is
0, and their product must be 0.
P|Ad |
Claim 2: The sum of entries in each row of Ψd is 1, and j=1 cd,j = 1 a column vector of 1’s
with matching size. This is a direct consequence of each row ψ d (ai )| being a one-hot
vector. In other words, 1 ∈ colspace(Ψd ).
Claim 3: The columns of Ψ are not linearly independent. This is because there is not a unique
P|Ad |
way to write 1 as a linear combination of the columns of Ψ. For example, j=1 cd,j =
P|Ad0 | 0
j=1 cd0 ,j = 1 for some d 6= d, where we used the columns of Ψd and Ψd0 .

Claim 4: 1 ∈
/ colspace(Ψ̃1 · · · Ψ̃D ) because the first entry of every column vector in any Ψ̃d is 0
and no linear combination of them can result in a 1. Consequently, 1 ∈/ colspace(Ψ̃d ) for
any d.
Claim 5: cd,1 ∈/ colspace(1, Ψ̃d0 : d0 6= d), where cd,1 is the column removed from Ψd to construct
Ψ̃d . This can also be seen from the first entry of the column vector: the first entry of cd,1
is 1, and all columns of Ψ̃d0 : d0 6= d have the first entry being 0.
Claim 6: cd,j ∈ / colspace(1, Ψ̃1 · · · Ψ̃D \ {cd,j }) for j > 1. By expressing cd,j = (1 −
P|Ad |
j 0 =2,j 0 6=j cd,j ) + (−cd,1 ), we observe that the first part of the sum lies in the col-
0

umn space, while the second part does not (from the previous claim, cd,1 is not in
the column space of Ψ̃d0 where d0 6= d; this is because within Ψ̃d , the only way is
P|Ad |
cd,1 = 1 − j 0 =2 cd,j 0 and we have excluded one of the columns cd,j from the column
space).

Combining these claims implies that each column of Ψ̃ cannot be expressed as a linear combination
PD
of all other columns, and thus Ψ̃ has full column rank, rank(Ψ̃) = ncols(Ψ̃) = 1 + d=1 (|Ad | − 1).
It follows that Ψ̃ contains the linearly independent subset of columns from Ψ, and their column
spaces and ranks are equal.
+
ΨΨ+ and Ψ̃Ψ̃ are orthogonal projection matrices onto the column space of Ψ and Ψ̃, respectively.
+
Since colspace(Φ) = colspace(Ψ̃), it follows that ΨΨ+ = Ψ̃Ψ̃ .

B.5 A Necessary Condition for Unbiasedness

Consider the matrix form of the Bellman equation (cf. Sec 2 of Lagoudakis and Parr [68]):
Q = R + γP π Q

where Q ∈ R|S||A| is a vector containing the Q-values for all state-action pairs, R ∈ R|S||A| , and
P π ∈ R|S||A|×|S||A| is the (s, a)-transition matrix induced by the MDP and policy π. Solving this

22
equation gives us the Q-function in closed form:
Q = (I − γP π )−1 R (5)
where I ∈ R|S||A|×|S||A| .
To derive a necessary condition, we start by assuming
PD
that the Q-function is representable by the
linear parameterization, i.e., there exists W ∈ R( d=1 |Ad |)×|S| such that vec−1|A|×|S| (Q) = ΨW .
−1
Here, vec|A|×|S| is the inverse vectorization operator that reshapes the vector of all Q-values into a
PD
matrix of size |A| × |S|, and Ψ ∈ {0, 1}|A|×( d=1 |Ad |) is defined in Appendix B.4. Substituting
PD
Eqn. (5) into the premise gives us a necessary condition: if there exists W ∈ R( d=1 |Ad |)×|S| such
that
vec−1 π −1

|A|×|S| (I − γP ) R = ΨW

Unfortunately, unlike the sufficient conditions in Theorem 1 (and Proposition 7), this necessary
condition is not as clean and likely not verifiable in most settings. The matrix inverse and vec−1
reshaping operation make it challenging to further manipulate the expression. This highlights the
non-trivial nature of the problem.

B.6 Variance Reduction in the Bandit Setting

Background on Rademacher complexity. Let F be a family of functions mapping from Rd to R.

The empirical Rademacher complexity of F for a sample S = {x1 , . . . , xm } is defined by
m
" #
b S (F) = E sup 1 X
R σi f (xi ) ,
σ f ∈F m
i=1
where σ = [σ1 , . . . , σm ] is a vector of i.i.d. Rademacher variables, i.e., independent uniform r.v.s
taking values in {−1, +1}.
For a matrix M ∈ Rm×D , define the (p, q)-group norm as the q-norm of the p-norm of the columns
of M, that is kMkp,q = k[kM1 kp , · · · , kMD kp ]kq , where Mj is the j-th column of M.
In Awasthi et al. [69], Theorem 2 stated that: let F = {f = w| x : kwkp ≤ A} be a family of
linear functions defined over Rd with bounded weight in `2 -norm, then the empirical Rademacher
complexity of F for a sample S = {x1 , . . . , xm } satisfies the following lower bound (where
X = [x1 . . . xm ]| ):
Rb S (F) ≥ √A kXk2,2 .
2m

Proof for Proposition 5. For the sake of argument, we consider the one-timestep bandit setting;
extension to the sequential setting can be similarly derived following Chen and Jiang [62], Duan
et al. [17]. Let the true generative model be Q∗ = Ψr + ψ Interact rInteract (details in Appendix B.8). We
formally show the reduction in the variance of the estimators, by comparing the lower bound of their
respective empirical Rademacher complexities. A smaller Rademacher complexity translates into
lower variance estimators.
Suppose we obtain a sample of m actions and apply the linear P approximation. Our approach for
factored action space corresponds to the matrix X ∈ {0, 1}m×( d |Ad |) , obtained by stacking the
corresponding rows of Ψ (recall Definition 4). The
P complete, combinatorial action space corresponds
to the matrix X 0 = [X, xInteract ] ∈ {0, 1}m×(1+ d |Ad |) by adding the corresponding rows of ψ Interact .
By definition, kXkp,q < kX 0 kp,q , since the former drops a column with non-zero norm that exists
in the latter.
Consider the following two function families, for the factored action space and the complete action
space respectively:
FF = {f = wF| x : kwF k2 ≤ A}
FC = {f = wC| x0 : kwC k2 ≤ A},
for some A > 0. A straightforward application of Theorem 2 of Awasthi et al. [69] shows that the
lower bound on the Rademacher complexity of the of the factored action space is smaller than that of
the complete action space, which completes our argument.

23
B.7 Standardization of Rewards for the Bandit Setting (Proposition 6)

R0,1 R1,1 α 1+α+β

s s

R0,0 R1,0 0 1

Figure 9: Standardization of rewards.

Suppose the rewards of the four arms are [R0,0 , R0,1 , R1,0 , R1,1 ]. We can apply the following trans-
formations to reduce any reward function to the form of [0, α, 1, 1 + α + β], and these transformations
do not affect the least-squares solution:
• If R0,0 = R1,0 and R0,1 = R1,1 , we can ignore x-axis sub-action as setting it to either 0
(←) or 1 (→) does not affect the reward. Similarly, if R0,0 = R0,1 and R1,0 = R1,1 , we
can ignore y-axis sub-action. In both cases, this reduces to a one-dimensional action space
which we do not discuss further.
• Now at least one of the following is false: R0,0 = R1,0 or R0,1 = R1,1 . If R0,0 6= R1,0 ,
skip this step. Otherwise, it must be that R0,0 = R1,0 and R0,1 6= R1,1 . Swap the role of
down vs. up such that the new R0,0 6= R1,0 .
• If R0,0 < R1,0 , skip this step. Otherwise it must be that R0,0 > R1,0 . Swap the role of left
vs. right so that R0,0 < R1,0 .
• If R0,0 6= 0, subtract R0,0 from all rewards so that the new R0,0 = 0.
• Now R1,0 > R0,0 > 0 must be positive. If R1,0 6= 1, divide all rewards by R1,0 so that the
new R1,0 = 1.
• Lastly, we should have R0,0 = 0 and R1,0 = 1. Set α = R0,1 and β = R1,1 − R1,0 − R0,1 .

B.8 Omitted-Variable Bias in the Bandit Setting (Proposition 6)

Suppose the true generative model is

Q∗ (a) = 1(ax =Left) rLeft + 1(ax =Right) rRight + 1(ay =Down) rDown + 1(ay =Up) rUp + 1(a=Right,Up) rInteract

In other words,
 ∗
Q (.) 1 0 1 0 rLeft 0
     
Q∗ (-) 1 0 0 1  rRight  0
Q∗ (&) = 0 + r Q∗ = Ψr + ψ Interact rInteract
1 1 0 rDown  0 Interact
Q∗ (%) 0 1 0 1 rUp 1
Here, rLeft , rRight , rDown , rUp , rInteract are parameters of the generative model. Note that the matrix
[Ψ, ψ Interact ] has a column space of R4 , i.e., this generative model captures every possible reward
configuration of the four actions.
Applying our proposed linear approximation translates to “dropping” the interaction parameter, rInteract ,
and estimate the remaining four parameters. This leads to a form of omitted-variable bias, which can
be computed as:
+  
1 0 1 0 0

+ 1 0 0 1 0
Ψ ψ Interact rInteract =  r
0 1 1 0 0 Interact
0 1 0 1 1
3/8 3/8 −1/8 −1/8 0 −1/8
    
−1/8 −1/8 3/8 3/8 0  3/8
= r = r
3/8 −1/8 3/8 −1/8 0 Interact −1/8 Interact
−1/8 3/8 −1/8 3/8 1 3/8

24
The biased estimate of the four parameters are:
r̂Left rLeft − 18 rInteract
   
 r̂Right   rRight + 3 rInteract 
r̂ = r + Ψ+ ψ Interact rInteract    8
r̂Down  = rDown − 1 rInteract 

8
r̂Up rUp + 38 rInteract

and the estimated Q-values are:

1
 r −
8 rInteract rLeft + rDown − 14 rInteract
     
Q̂(.) 1 0 1 0 Left
3  rLeft + rUp + 1 rInteract 
Q̂(-) 1 0 0 1  rRight + 8 rInteract 

4
Q̂ = 
  = 
1 =
rRight + rDown + 1 rInteract 

Q̂(&) 0 1 1 0 rDown − r
8 Interact

4
Q̂(%) 0 1 0 1 rUp + 3
rRight + rUp + 34 rInteract
8 rInteract

For the bandit problem in Figure 4a, substituting rLeft + rDown = 0, rLeft + rUp = α, rRight + rDown = 1,
and rInteract = β gives
  
Q̂(.) − 41 β


Q̂(-)  α + 14 β 
 
= 
Q̂(&)  1 + 14 β 

Q̂(%) 1 + α + 34 β
which is the solution we presented in Figure 4c.

B.9 Accounting for Sub-action Interactions

When the interaction effect is not negligible and can lead to suboptimal performance, one solution
is to explicitly encode the residual interaction terms in the decomposed Q-function by letting
PD
Q(s, a) = d=1 qd (s, ad ) + R(a). The exact parameterization of the residual term R(a) is problem
dependent: one may incorporate Tavakoli et al. [6] to systematically consider interactions of certain
“ranks” (e.g., limiting it to only two-way or three-way interactions), and consider regularizing the
magnitude of residual terms so we still benefit from the efficiency gains of the linear decomposition.

C More Illustrative Examples

(a) (b)
+1

s?,1 s0,1 s1,1

a=0 a=1 +1 +1 +1 +1
+2

a=0 s0 +1 s1
a=1 s?,0 s0,0 s1,0

My
s0,? s1,?
Mx +1

Figure 10: (a) A one-dimensional chain MDP, with an initial state s0 and an absorbing state s1 , and
two actions a = 0 (left) and a = 1 (right). (b) A two-dimensional chain MDP shown together with
the component chains Mx and My . Rewards are denoted in red. Squares indicate absorbing states
whose outgoing transition arrows are omitted. For readability, in the diagram, the states and actions
are laid out following a convention similar to the Cartesian coordinate system so that the bottom left
state has index (0, 0), and right and up both increase the corresponding coordinate by 1.

25
In this appendix, we discuss the building blocks of the examples used in the main paper and provide
additional examples to support the theoretical properties presented in Section 3.
One-dimensional Chain. First, consider the chain problem depicted in Figure 10a. The agent always
starts in the initial state s0 and can take one of two possible actions: left (a = 0), which leads the
agent to stay at s0 , or right (a = 1), which leads the agent to transition into s1 and receive a reward of
+1. After reaching the absorbing state s1 , both a = 0 and a = 1 lead the agent to stay at s1 with zero
reward. For γ < 1, a (deterministic) optimal policy is π ∗ (s0 ) = 1, and either action can be taken in
s1 . Next, we use this MDP to construct a two-dimensional problem.
Two-dimensional Chain. Following the construction used in Definition 1, we consider an MDP
M = Mx × My consisting of two chains (the horizontal chain Mx and the vertical chain My )
running in parallel, as shown in Figure 10b. Their corresponding state spaces are Sx = {s0,? , s1,? }
and Sy = {s?,0 , s?,1 }, which indicate the x- and y-coordinates respectively. There are 4 actions from
each state, depicted by diagonal arrows {., -, &, %}; each action a = [ax , ay ] effectively leads
the agent to perform ax in Mx and ay in My . For example, taking action %= [→, ↑] from state
s0,0 leads the agent to transition into state s1,1 and receive a reward of +2 (the sum of +1 from
Mx and +1 from My ). For γ < 1, an optimal policy for this MDP is to always move up and right,
π ∗ (·) =%= [→, ↑], regardless of which state the agent is in.
Satisfying the Sufficient Conditions. Let φx : S → Sx and φy : S → Sy be the abstractions. By
construction, the transition and reward functions of this MDP satisfy Eqns. (3) and (4). To apply
Theorem 1, the policy must satisfy Eqn. (4). In Figure 11, we show three such policies (other policies
in this category are omitted due to symmetry and transitions that have the same outcome), together
with the true Q-functions (with γ = 0.9) and their decompositions in the form of Eqn. (1).
Violating the Sufficient Conditions.
• Policy violates Eqn. (4) - Nonzero bias. For this setting, we hold the MDP (transitions
and rewards) unchanged. In Figure 12, we show seven policies that do not satisfy Eqn. (4),
together with the resultant Q-function and the biased linear approximation with the non-zero
approximation error.
• Transition violates Eqn. (2) - Nonzero Bias. Figure 13 shows an example where one
transition has been modified.
• Reward violates Eqn. (2) - Nonzero Bias. Figure 14 shows an example where one reward
has been modified.
• Transition violates Eqn. (2), or policy violates Eqn. (4) - Zero Bias. If γ = 0, then the
Q-function is simply the immediate reward, and any conditions on the transition or policy
can be forgone.
• Reward violates Eqn. (3) - Zero Bias. It is possible to construct reward functions adversar-
ially such that r itself does not satisfy the condition, and yet Q can be linearly decomposed.
See Figure 15 for an example.

26
Policy π MDP diagram Qπ = Qx + Qy
Optimal policy π ∗

. - & % ← ← → → ↓ ↑ ↓ ↑
s0,0 % →, ↑ s0,0 1.8 1.9 1.9 2 s0,? 0.9 0.9 1 1 s?,0 0.9 1 0.9 1
   
s0,1 % →, ↑ s0,1 0.9 0.9 1 1 s0,? 0.9 0.9 1 1 s?,1  0 0 0 0
= s1,0 0.9 1 0.9 1 s1,?  0 0 0 0 s?,0 0.9 1 0.9 1
s1,0 % →, ↑
1.9 2
s1,1 % →, ↑ s1,1 0 0 0 0 s1,? 0 0 0 0 s?,1 0 0 0 0
1.8 1.9

A non-optimal policy

. - & % ← ← → → ↓ ↑ ↓ ↑
s0,0 - ←, ↑ s0,0 0.9 1 1.9 2 s0,? 0 0 1 1 s?,0 0.9 1 0.9 1
   
s0,1 - ←, ↑ s0,1  0 0 1 1 s0,?  0 0 1 1 s?,1  0 0 0 0
= s1,0 0.9 1 0.9 1 s1,?  0 0 0 0 s?,0 0.9 1 0.9 1
s1,0 % →, ↑
1 2
s1,1 % →, ↑ s1,1 0 0 0 0 s1,? 0 0 0 0 s?,1 0 0 0 0
0.9 1.9

Another non-optimal policy

. - & % ← ← → → ↓ ↑ ↓ ↑
s0,0 . ←, ↓ s0,0 0 1 1 2 s0,? 0 0 1 1 s?,0 0 1 0 1
   
s0,1 - ←, ↑ s0,1  0 0 1 1 s0,?  0 0 1 1 s?,1 0 0 0 0
= s1,0  0 1 0 1 s1,?  0 0 0 0 s?,0 0 1 0 1
s1,0 & →, ↓
1 2
s1,1 % →, ↑ s1,1 0 0 0 0 s1,? 0 0 0 0 s?,1 0 0 0 0
0 1

Figure 11: Example MDPs and policies where Proposition 7 applies, for the optimal policy and two
particular non-optimal policies. γ = 0.9. We show the linear decomposition of the Q-function into
Qx and Qy . Qx only depends on the x-coordinate of state and the sub-action that moves ← or →;
Qy only depends on the y-coordinate of state and the sub-action that moves ↓ or ↑.

π(S) MDP diagram Qπ (s0,0 , A) Q̂(s0,0 , A) π(S) MDP diagram Qπ (s0,0 , A) Q̂(s0,0 , A)

s0,0 - ←, ↑ . 1.71 . 1.7325 s0,0 % →, ↑ . 1.8 . 1.575

               
s0,1 % →, ↑ -  1.9  - 1.8775 s0,1 - ←, ↑ - 1  - 1.225
= =
s1,0 % →, ↑ &  1.9  & 1.8775 s % →, ↑
1,0 & 1.9 & 2.125
1.9 2 1 2
s1,1 % →, ↑ % 2 % 2.0225 s1,1 % →, ↑ % 2 % 1.775
1.71 1.9 1.8 1.9

s0,0 . ←, ↓ . 0 . 0.45 s0,0 & →, ↓ . 1.71 . 1.5075

               
s0,1 % →, ↑ - 1.9 - 1.45 s0,1 - ←, ↑ - 1  - 1.2025
= =
s1,0 % →, ↑ & 1.9 & 1.45 s1,0 % →, ↑ &  1.9  & 2.1025
1.9 2 1 2
s1,1 % →, ↑ % 2 % 2.45 s1,1 % →, ↑ % 2 % 1.7975
0 1.9 1.71 1.9

s0,0 & →, ↓ . 0.9 . 0.675 s0,0 % →, ↑ . 1.8 . 1.35

               
s0,1 . ←, ↓ - 1  - 1.225 s0,1 - ←, ↓ - 1  - 1.45
= =
s1,0 & →, ↓ & 1  & 1.225 s1,0 & →, ↓ & 1  & 1.45
1 2 1 2
s1,1 & →, ↓ % 2 % 1.775 s1,1 & →, ↓ % 2 % 1.55
0.9 1 1.8 1

s0,0 & →, ↓ . 0 . 0.225

       
s0,1 . ←, ↓ - 1.9 - 1.675
=
s & →, ↓
1,0 & 1  & 0.775
1.9 2
s1,1 & →, ↓ % 2 % 2.225
0 1

Figure 12: Example MDPs and policies where Proposition 7 does not apply because the policy
violates Eqn. (4) (violations are highlighted). γ = 0.9. For example, in the first case, the policy does
not take the same sub-action from s0,0 and s0,1 with respect to the horizontal chain Mx . Applying
the linear approximation produces biased estimates Q̂ of the true Q-function, Qπ .

27
π(S) MDP diagram Qπ (s0,0 , A) Q̂(s0,0 , A)

s0,0 % →, ↑ 1.8 1.575

       
s0,1 % →, ↑ 1.9 2.125
=
s % →, ↑
1,0
1 1.225
1 2
s1,1 % →, ↑ 2 1.775
1.8 1.9

Figure 13: Example MDPs and policies where Theorem 1 does not apply because the transition
function violates Eqn. (2). γ = 0.9. In this example, the highlighted transition corresponding to the
action %= [→, ↑] from s0,1 does not move right (→ under Mx ) to s1,1 and instead moves back to
state s0,1 . Applying the linear approximation produces biased estimates Q̂ of the true Q-function,
Qπ .

Reward function Q-function Qπ (s0,0 , A) Q̂(s0,0 , A)

0 1 0 1

0.9 1.375
   
0 1 0 1
1.9 1.425
1.9 1.425
1 1 1 1 1.9 1 1 1
1 1.475
0 1 0 0 0.9 1.9 0 0

Figure 14: Example MDPs and policies where Theorem 1 does not apply because the reward function
violates Eqn. (3). γ = 0.9. In this example, the reward function of the bottom left state s0,0 does not
satisfy the condition because the reward of % is 1 6= 2 = 1 + 1. Applying the linear approximation
produces biased estimates Q̂ of the true Q-function, Qπ .

Reward function Q-function Qπ = Qx + Qy

. - & % ← ← → → ↓ ↑ ↓ ↑
0 1 0 1

s0,0 8.5 3 7 1.5 s0,? 1.5 1.5 0 0 s?,0 7 1.5 7 1.5

0 1 0 1
s0,1  0 0 1 1 s0,?  0 0 1 1 s?,1 0 0 0 0
s1,0  0 4 0 4 s1,?  0 0 0 0 s?,0 0 4 0 4
2 1.5 4 4 3 1.5 4 4
s1,1 0 0 0 0 s1,? 0 0 0 0 s?,1 0 0 0 0
7 3 0 0 8.5 7 0 0

Figure 15: Example MDPs and policies where Theorem 1 does not apply because the reward function
violates Eqn. (3). γ = 1. In this example, the reward function of the bottom left state s0,0 does not
satisfy the condition because 7 + 1.5 6= 2 + 3. However, there exists a linear decomposition of the
true Q-function, Qπ , for a particular policy denoted by bold blue arrows.

28
D Experiments
D.1 Sepsis Simulator - Implementation Details

When generating the datasets, we follow the default initial state distribution specified in the original
implementation.
By default, we used neural networks consisting of one hidden layer with 1,000 neurons and ReLU
activation to allow for function approximators with sufficient expressivity. We trained these networks
using the Adam optimizer (default settings) [70] with a batch size of 64 for a maximum of 100
epochs, applying early stopping on 10% “validation data” (specific to each supervised task) with
a patience of 10 epochs. We minimized the mean squared error (MSE) for regression tasks (each
iteration of FQI). For FQI, we also added value clipping (to be within the range of possible returns
[−1, 1]) when computing bootstrapping targets to ensure a bounded function class and encourage
better convergence behavior [71].

D.2 MIMIC Sepsis - Implementation Details

The RNN AIS encoder was trained to predict the mean of a unit-variance multivariate Gaussian that
outputs the observation at subsequent timesteps, conditioned on the subsequent actions, following
the idea in Subramanian and Mahajan [37]. We performed a grid search over the hyperparameters
(Table 2) for training the RNN, selecting the model that achieved the smallest validation loss. Using
the best encoder model, we then trained the offline RL policy using BCQ (and factored BCQ),
considering validation performance of all checkpoints (saved every 100 iterations, for a maximum of
10,000 iterations) and all combinations of the BCQ hyperparameters (Table 2).

Table 2: Hyperparameter values used for training the RNN approximate information state as well as
BCQ for offline RL. Discrete BCQ for both the baseline and factored implementation are identical
except for the final layer of the Q-networks.
Hyperparameter Searched Settings
RNN:
- Embedding dimension, dS {8, 16, 32, 64, 128}
- Learning rate { 1e-5, 5e-4, 1e-4, 5e-3, 1e-3 }
BCQ (with 5 random restarts):
- Threshold, τ {0, 0.01, 0.05, 0.1, 0.3, 0.5, 0.75, 0.999}
- Learning rate 3e-4
- Weight decay 1e-3
- Hidden layer size 256

D.3 MIMIC Sepsis results

100 100

95 95

90 Observed 90 Observed
tau=0.0 tau=0.0
WIS

WIS

tau=0.01 tau=0.01
85 tau=0.05 85 tau=0.05
tau=0.1 tau=0.1
tau=0.3 tau=0.3
80 tau=0.5 80 tau=0.5
tau=0.75 tau=0.75
tau=0.9999 tau=0.9999
75 75
0 50 100 150 200 250 0 50 100 150 200 250
ESS ESS

Figure 16: Validation performance (in terms of WIS and ESS) for all hyperparameter settings and all
checkpoints considered during model selection. Left - baseline, Right - proposed.

29
100 100

Estimated Policy Value

98 95
96 90

WIS
94 Baseline 85
Proposed Baseline
92 80
Observed Proposed
90 75 Observed
0 100 200 0 100 200
Effective Sample Size ESS

Figure 17: Left - Pareto frontiers of validation performance for the baseline and proposed approaches;
Right - test performance of the candidate models that lie on the validation Pareto frontier. The
validation performance largely reflects the test performance, and proposed approach outperforms the
baseline in terms of test performance albeit with a bit more overlap.

100
95
90
Test WIS

85
80
75

Baseline
150 Proposed
Test ESS

100

0
0 50 100 150 200 250
Validation ESS cutoff

Figure 18: Model selection with different minimum ESS cutoffs. In the main paper we used ESS
≥ 200; here we sweep this threshold and compare the resultant selected policies for both the baseline
and proposed approach (only using candidate models that lie on the validation Pareto frontier). In
general, across the ESS cutoffs, the proposed approach outperforms the baseline in terms of test set
WIS value, with comparable or slightly lower ESS.

Reinforcement Learning Cheat Sheet: Return
No ratings yet
Reinforcement Learning Cheat Sheet: Return
7 pages
SRT757-Acoustic Presentation PDF
0% (1)
SRT757-Acoustic Presentation PDF
3 pages
Deep Reinforcement Learning in Large Discrete Action Spaces
No ratings yet
Deep Reinforcement Learning in Large Discrete Action Spaces
11 pages
Solving Stochastic Planning Problems With Large State and Action Spaces
No ratings yet
Solving Stochastic Planning Problems With Large State and Action Spaces
9 pages
Nips00 Bs
No ratings yet
Nips00 Bs
7 pages
Algorithms 17 00060
No ratings yet
Algorithms 17 00060
15 pages
Reinforcement-Learning-Cheatsheet
No ratings yet
Reinforcement-Learning-Cheatsheet
16 pages
3003 o Ine Reinforcement Learning W
No ratings yet
3003 o Ine Reinforcement Learning W
15 pages
12 ML Reinforcement Learning Value Based Control
No ratings yet
12 ML Reinforcement Learning Value Based Control
12 pages
MP-DQNMulti-Pass Q-Networks For Deep Reinforcement Learning With Parameterised Action Spaces
No ratings yet
MP-DQNMulti-Pass Q-Networks For Deep Reinforcement Learning With Parameterised Action Spaces
8 pages
17526_Pub
No ratings yet
17526_Pub
28 pages
A Brief Introduction To Reinforcement Learning
No ratings yet
A Brief Introduction To Reinforcement Learning
4 pages
9_DeepReinforcementLearning
No ratings yet
9_DeepReinforcementLearning
138 pages
Offline Imitation Learning From Multiple Baselines With Applications To Compiler Optimization
No ratings yet
Offline Imitation Learning From Multiple Baselines With Applications To Compiler Optimization
10 pages
Linear Quadratic Control Using Model-Free Reinforcement Learning
No ratings yet
Linear Quadratic Control Using Model-Free Reinforcement Learning
16 pages
Bridging The Gap Between Value and Policy Based Reinforcement Learning
No ratings yet
Bridging The Gap Between Value and Policy Based Reinforcement Learning
21 pages
lecture-06
No ratings yet
lecture-06
98 pages
Difference of Q Estimation
No ratings yet
Difference of Q Estimation
28 pages
Stochastic Q-Learning For Large Discrete Action Spaces
No ratings yet
Stochastic Q-Learning For Large Discrete Action Spaces
26 pages
A crash course on reinforcement learning - Felix Wagner
No ratings yet
A crash course on reinforcement learning - Felix Wagner
84 pages
10 Deep Reinforcement
No ratings yet
10 Deep Reinforcement
40 pages
2206.04282v1
No ratings yet
2206.04282v1
56 pages
22 Reinforcement Learning
No ratings yet
22 Reinforcement Learning
18 pages
NIPS 2012 A Unifying Perspective of Parametric Policy Search Methods For Markov Decision Processes Paper
No ratings yet
NIPS 2012 A Unifying Perspective of Parametric Policy Search Methods For Markov Decision Processes Paper
9 pages
13486 Langevin Soft Actor Crit
No ratings yet
13486 Langevin Soft Actor Crit
27 pages
Multi Grid
No ratings yet
Multi Grid
10 pages
Lecture Notes v1.0 687 F22
No ratings yet
Lecture Notes v1.0 687 F22
115 pages
Lec17-ReinforcementLearning
No ratings yet
Lec17-ReinforcementLearning
58 pages
Add-On DRL CS06
No ratings yet
Add-On DRL CS06
23 pages
(Partially Observable) Markov Decision Processes: Frederike Petzschner & Lionel Rigoux
No ratings yet
(Partially Observable) Markov Decision Processes: Frederike Petzschner & Lionel Rigoux
19 pages
Soft Q-Learning With Mutual Information Regularization
No ratings yet
Soft Q-Learning With Mutual Information Regularization
19 pages
An Introduction To Policy Search Methods: Thomas Furmston
No ratings yet
An Introduction To Policy Search Methods: Thomas Furmston
33 pages
Lecture 10 - Overview of RL With A VIP Perspective
No ratings yet
Lecture 10 - Overview of RL With A VIP Perspective
35 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
Average-reward Model-free Reinforcement Learning- A Systematic Review and Literature Mapping
No ratings yet
Average-reward Model-free Reinforcement Learning- A Systematic Review and Literature Mapping
36 pages
Lecture10
No ratings yet
Lecture10
25 pages
Chapter 4
No ratings yet
Chapter 4
16 pages
Solutions_to_Reinforcement_Learning_by_Sutton_Chapter_3_rx1
No ratings yet
Solutions_to_Reinforcement_Learning_by_Sutton_Chapter_3_rx1
10 pages
Offline Reinforcement Learning With Instrumental Variables in Confounded Markov Decision Processes
No ratings yet
Offline Reinforcement Learning With Instrumental Variables in Confounded Markov Decision Processes
71 pages
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
No ratings yet
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
39 pages
Reinforcement Learning: A Short Cut
No ratings yet
Reinforcement Learning: A Short Cut
7 pages
RL Theory Tutorial
No ratings yet
RL Theory Tutorial
80 pages
[14] Action Robust Reinforcement Learning and Applications in Continuous Control
No ratings yet
[14] Action Robust Reinforcement Learning and Applications in Continuous Control
10 pages
Mid Term Report SoS (3)
No ratings yet
Mid Term Report SoS (3)
18 pages
QMIX
No ratings yet
QMIX
52 pages
Reinforcement Learning: Instructor: Max Welling
No ratings yet
Reinforcement Learning: Instructor: Max Welling
18 pages
Active Learning For Reward Estimation in Inverse Reinforcement Learning
No ratings yet
Active Learning For Reward Estimation in Inverse Reinforcement Learning
16 pages
RL With LCS
No ratings yet
RL With LCS
29 pages
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
No ratings yet
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
30 pages
20AI903_RL_UNIT 4
No ratings yet
20AI903_RL_UNIT 4
49 pages
An Introduction to Reinforcement Learning From theory to algorithms (December 19, 2024)_Joon Kwon
No ratings yet
An Introduction to Reinforcement Learning From theory to algorithms (December 19, 2024)_Joon Kwon
66 pages
Unit 1 - Reinforcement Learning,Overfitting, Training, Validation Sets, Metrics, Bias and Variance
No ratings yet
Unit 1 - Reinforcement Learning,Overfitting, Training, Validation Sets, Metrics, Bias and Variance
16 pages
Efficient Solution Algorithms For Factored Mdps
No ratings yet
Efficient Solution Algorithms For Factored Mdps
70 pages
subtitle (10)
No ratings yet
subtitle (10)
2 pages
07 Deep Reinforcement Learning (John)
No ratings yet
07 Deep Reinforcement Learning (John)
52 pages
11-DL-Deep Learning For Reinforcement Learning
No ratings yet
11-DL-Deep Learning For Reinforcement Learning
47 pages
RLAlgs in MDPs
No ratings yet
RLAlgs in MDPs
9 pages
ACC23 Tutorial Paulson
No ratings yet
ACC23 Tutorial Paulson
12 pages
Reinforcement Learning Exam
No ratings yet
Reinforcement Learning Exam
6 pages
Radial Basis Networks: Fundamentals and Applications for The Activation Functions of Artificial Neural Networks
From Everand
Radial Basis Networks: Fundamentals and Applications for The Activation Functions of Artificial Neural Networks
Fouad Sabry
No ratings yet
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
A History of Meta-gradient
No ratings yet
A History of Meta-gradient
8 pages
Automatic Curriculum Learning through Value Disagreement
No ratings yet
Automatic Curriculum Learning through Value Disagreement
12 pages
EPC
No ratings yet
EPC
18 pages
A theoretical understanding of self-paced learning
No ratings yet
A theoretical understanding of self-paced learning
10 pages
RewardShaping
No ratings yet
RewardShaping
7 pages
TheoryCL
No ratings yet
TheoryCL
19 pages
StartStateCurriculum
No ratings yet
StartStateCurriculum
9 pages
OS CourseFile
No ratings yet
OS CourseFile
10 pages
Operations Research Case Study Using Global Supply Chain Management PDF
No ratings yet
Operations Research Case Study Using Global Supply Chain Management PDF
10 pages
Mastering Hibernate - Sample Chapter
No ratings yet
Mastering Hibernate - Sample Chapter
27 pages
Sdconnect Test Without Adapter
No ratings yet
Sdconnect Test Without Adapter
6 pages
Netsim
No ratings yet
Netsim
70 pages
Autocad Interview Quesions and Answers
No ratings yet
Autocad Interview Quesions and Answers
3 pages
Image Processing With ImageJ - Second Edition - Sample Chapter
No ratings yet
Image Processing With ImageJ - Second Edition - Sample Chapter
18 pages
Cancerdiscover: An Integrative Pipeline For Cancer Biomarker and Cancer Class Prediction From High-Throughput Sequencing Data
No ratings yet
Cancerdiscover: An Integrative Pipeline For Cancer Biomarker and Cancer Class Prediction From High-Throughput Sequencing Data
9 pages
Introducing Refined Agile Model (RAM) in The Context of Bangladesh'S Software Development Environment Concentrating On The Improvement of Requirement Engineering Process
No ratings yet
Introducing Refined Agile Model (RAM) in The Context of Bangladesh'S Software Development Environment Concentrating On The Improvement of Requirement Engineering Process
20 pages
Sample Cover Letter
No ratings yet
Sample Cover Letter
5 pages
Isaeva
No ratings yet
Isaeva
2 pages
ERP Complaint Book
No ratings yet
ERP Complaint Book
12 pages
Translet
No ratings yet
Translet
29 pages
Super Intelligence
100% (1)
Super Intelligence
85 pages
Practical Guide To Auditing The Software Development Process
100% (5)
Practical Guide To Auditing The Software Development Process
17 pages
DevOps Interview Questions and Answers PDF
No ratings yet
DevOps Interview Questions and Answers PDF
5 pages
How To Hide The Usrlogon - CMD File (Or Any Script File) From Showing Up
No ratings yet
How To Hide The Usrlogon - CMD File (Or Any Script File) From Showing Up
6 pages
A Reciprocity Theorem For Domino Tilings
No ratings yet
A Reciprocity Theorem For Domino Tilings
9 pages
Constructors and Parsing Methods of Wrapper Classes in Java
No ratings yet
Constructors and Parsing Methods of Wrapper Classes in Java
3 pages
How To Add Bolt in RobotXML Database
No ratings yet
How To Add Bolt in RobotXML Database
8 pages
Css Achievement Test
No ratings yet
Css Achievement Test
3 pages
Debuz Company Profile
No ratings yet
Debuz Company Profile
14 pages
Activity Based Costing
No ratings yet
Activity Based Costing
3 pages
ITC CPU Structure&amp Function
No ratings yet
ITC CPU Structure&amp Function
2 pages
Netbeans Ide 6.9.1 Installation Instructions: Log in
No ratings yet
Netbeans Ide 6.9.1 Installation Instructions: Log in
5 pages
Abraxas Stud I Enz 00 Bonn Goog
No ratings yet
Abraxas Stud I Enz 00 Bonn Goog
254 pages
Getting The Most Out of OpenCPN
0% (1)
Getting The Most Out of OpenCPN
106 pages
Why and How To Use Random Forest Variable Importance Measures (And How You Shouldn't)
No ratings yet
Why and How To Use Random Forest Variable Importance Measures (And How You Shouldn't)
47 pages
Epicor Information Worker Admin
No ratings yet
Epicor Information Worker Admin
21 pages

Leveraging Factored Action Spaces

Uploaded by

Leveraging Factored Action Spaces

Uploaded by

Leveraging Factored Action Spaces for Efficient

Offline Reinforcement Learning in Healthcare

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

3.1 Sufficient Conditions for Zero Bias

for some pd : Zd × Ad → ∆(Zd ), rd : Zd × Ad → R, and πd : Zd → ∆(Ad ), then the Q-function

.= [←, ↓] 0.9 0.9 1.8 z?,0 s0,0 s1,0

Figure 2: (a) The composition of sub-action spaces Ax and Ay results in A = Ax × Ay depicted by

3.2 Necessary Conditions for Zero Bias

3.3.1 Bias-Variance Trade-off

3.3.2 Bias 6⇒ Suboptimal Performance

(a) (b) Figure 4: (a) A two-dimensional bandit

0.5 Optimal -2 -2 0.5

0.0 -4 0.0 -4 0.0

3.4 Practical Considerations: Are these Assumptions too Strong?

4.1 Simulated Domain: Sepsis Simulator

4.2 Real Healthcare Data: Sepsis Treatment in MIMIC-III

Estimated Policy Value

IV fluid dose (mL/4h)

IV fluid dose (mL/4h)

IV fluid dose (mL/4h)

Data and Code Availability

Table 1: Qualtitative comparisons of this work with the existing literature.

B.1 Sufficient Condition: The Trivial Setting - D Parallel MDPs

B.2 Sufficient Condition: The Abstraction Setting

B.4 Subspace of Representable Q Functions

Ψ̃D  ∈ {0, 1}|A|×(1+ )

The following statements are true:

B.5 A Necessary Condition for Unbiasedness

B.6 Variance Reduction in the Bandit Setting

Background on Rademacher complexity. Let F be a family of functions mapping from Rd to R.

R0,1 R1,1 α 1+α+β

Figure 9: Standardization of rewards.

B.8 Omitted-Variable Bias in the Bandit Setting (Proposition 6)

Suppose the true generative model is

and the estimated Q-values are:

B.9 Accounting for Sub-action Interactions

C More Illustrative Examples

s?,1 s0,1 s1,1

Another non-optimal policy

s0,0 - ←, ↑ . 1.71 . 1.7325 s0,0 % →, ↑ . 1.8 . 1.575

s0,0 . ←, ↓ . 0 . 0.45 s0,0 & →, ↓ . 1.71 . 1.5075

s0,0 & →, ↓ . 0.9 . 0.675 s0,0 % →, ↑ . 1.8 . 1.35

s0,0 & →, ↓ . 0 . 0.225

s0,0 % →, ↑ 1.8 1.575

Reward function Q-function Qπ (s0,0 , A) Q̂(s0,0 , A)

Reward function Q-function Qπ = Qx + Qy

s0,0 8.5 3 7 1.5 s0,? 1.5 1.5 0 0 s?,0 7 1.5 7 1.5

D.2 MIMIC Sepsis - Implementation Details

D.3 MIMIC Sepsis results

Estimated Policy Value

You might also like