Leveraging Factored Action Spaces
Leveraging Factored Action Spaces
Shengpu Tang1 Maggie Makar1 Michael W. Sjoding2 Finale Doshi-Velez3 Jenna Wiens1
1
Division of Computer Science & Engineering, University of Michigan, Ann Arbor, MI, USA
arXiv:2305.01738v1 [cs.LG] 2 May 2023
2
Division of Pulmonary and Critical Care, Michigan Medicine, Ann Arbor, MI, USA
3
SEAS, Harvard University, Cambridge, MA, USA
Correspondence to: {tangsp,wiensj}@umich.edu
Reviewed on OpenReview: https://openreview.net/forum?id=Jd70afzIvJ4
Abstract
Many reinforcement learning (RL) applications have combinatorial action spaces,
where each action is a composition of sub-actions. A standard RL approach
ignores this inherent factorization structure, resulting in a potential failure to
make meaningful inferences about rarely observed sub-action combinations; this is
particularly problematic for offline settings, where data may be limited. In this work,
we propose a form of linear Q-function decomposition induced by factored action
spaces. We study the theoretical properties of our approach, identifying scenarios
where it is guaranteed to lead to zero bias when used to approximate the Q-function.
Outside the regimes with theoretical guarantees, we show that our approach can still
be useful because it leads to better sample efficiency without necessarily sacrificing
policy optimality, allowing us to achieve a better bias-variance trade-off. Across
several offline RL problems using simulators and real-world datasets motivated by
healthcare, we demonstrate that incorporating factored action spaces into value-
based RL can result in better-performing policies. Our approach can help an agent
make more accurate inferences within underexplored regions of the state-action
space when applying RL to observational datasets.
1 Introduction
In many real-world decision-making problems, the action space exhibits an inherent combinatorial
structure. For example, in healthcare, an action may correspond to a combination of drugs and treat-
ments. When applying reinforcement learning (RL) to these tasks, past work [1–4] typically considers
each combination a distinct action, resulting in an exponentially large action space (Figure 1a). This
is inefficient as it fails to leverage any potential independence among dimensions of the action space.
This type of factorization structure in action space could be incorporated when designing the architec-
ture of function approximators for RL (Figure 1b). Similar ideas have been used in the past, primarily
to improve online exploration [5, 6], or to handle multiple agents [7–11] or multiple rewards [12].
However, the applicability of this approach has not been systematically studied, especially in offline
settings and when the MDP presents no additional structure (e.g., when the state space cannot be
explicitly factorized).
In this work, we develop an approach for offline RL with factored action spaces by learning linearly
decomposable Q-functions. First, we study the theoretical properties of this approach, investigating
the sufficient and necessary conditions for it to lead to an unbiased estimate of the Q-function (i.e.,
zero approximation error). Even when the linear decomposition is biased, we note that our approach
select
Figure 1: Illustration of Q-network architectures, which take the state s as input and output Q(s, a)
for a selected action. In this example, the action space A consists of D = 3 binary sub-action spaces
{, }, {, } and {, }. (a) Learning with the combinatorial action space requires 23 = 8
output heads (exponential in D), one for each combination of sub-actions. (b) Incorporating the
linear Q decomposition for the factored action space requires 2 × 3 = 6 output heads (linear in D).
leads to a reduction of variance, which in turn leads to an improvement in sample efficiency. Lastly,
we show that when sub-actions exhibit certain structures (e.g., when two sub-actions “reinforce” their
independent effects), the linear approximation, though biased, can still lead to the optimal policy. We
test our approach in offline RL domains using a simulator [13] and a real clinical dataset [2], where
domain knowledge about the relationship among actions suggests our proposed factorization approach
is applicable. Empirically, our approach outperforms a non-factored baseline when the sample size is
limited, even when the theoretical assumptions (around the validity of a linear decomposition) are not
perfectly satisfied. Qualitatively, in the real-data experiment, our approach learns policies that better
capture the effect of less frequently observed treatment combinations.
Our work provides both theoretical insights and empirical evidence for RL practitioners to consider
this simple linear decomposition for value-based RL approaches. Our contribution complements
many popular offline RL methods focused on distribution shift (e.g., BCQ [14]) and goes beyond
pessimism-only methods by leveraging domain knowledge. Compatible with any algorithm that has
a Q-function component, we expect our approach will lead to gains for offline RL problems with
combinatorial action spaces where data are limited and when domain knowledge can be used to check
the validity of theoretical assumptions.
2 Problem Setup
We consider Markov decision processes (MDPs) defined by a tuple M = (S, A, p, r, µ0 , γ), where
S and A are the state and action spaces, p(s0 |s, a) and r(s, a) are the transition and instantaneous
reward functions, µ0 (s) is the initial state distribution, and γ ∈ [0, 1] is the discount factor. A
probabilistic policy π(a|s) specifies a mapping from each state to a probability distribution over
actions. For a deterministic policy, π(s) refers to the action
with π(a|s) = 1. The state-value function
∞
is defined as V π (s) = Eπ EM t−1 π
P
t=1 γ rt | s1 = s . The action-value function, Q (s, a), is
defined by further restricting the action taken from the starting state. The goal of RL is to find a policy
π ∗ = arg maxπ Es∼µ0 [V π (s)] (or an approximation) that has the maximum expected performance.
2.1 Factored Action Spaces
While the standard MDP definition abstracts away the underlying structure within the action space A,
in this paper, we explicitly express a factored action space as a Cartesian product of D sub-action
ND
spaces, A = d=1 Ad = A1 × · · · × AD . We use a ∈ A to denote each action, which can be written
as a vector of sub-actions a = [a1 , . . . , aD ], with each ad ∈ Ad . In general, a sub-action space can
be discrete or continuous, and the cardinalities of discrete sub-action spaces are not required to be the
same. For clarity of analysis and illustration, we consider discrete sub-action spaces in this paper.
2.2 Linear Decomposition of Q Function
The traditional factored MDP literature almost exclusively considers state space factorization [15]. In
contrast, here we capitalize on action space factorization to parameterize value functions. Specifically,
our approach considers a linear decomposition of the Q function, as illustrated in Figure 1b:
PD
Qπ (s, a) = d=1 qd (s, ad ). (1)
Each component qd (s, ad ) in the summation is allowed to condition on the full state space s and only
one sub-action ad . While similar forms of decomposition have been used in past work, there are key
2
differences in how the summation components are parameterized. In the multi-agent RL literature,
each component qd (sd , ad ) can only condition on the corresponding state space of the d-th agent [e.g.,
8, 9]. The decomposition in Eqn. (1) also differs from a related form of decomposition considered
by Juozapaitis et al. [12] where each component qd (s, a) can condition on the full action a. To the
best of our knowledge, we are the first to consider this specific form of Q-function decomposition
backed by both theoretical rigor and empirical evidence; in addition, we are the first to apply this idea
to offline RL. We discuss other related work in Section 5.
3 Theoretical Analyses
In this section, we study the theoretical properties of the linear Q-function decomposition induced by
factored action spaces. We first present sufficient and necessary conditions for our approach to yield
unbiased estimates, and then analyze settings in which our approach can reduce variance without
sacrificing policy performance when the conditions are violated. Finally, we discuss how domain
knowledge may be used to check the validity of these conditions, providing examples in healthcare.
3
(a) Ax × Ay (b) +1
Up [Left, Up] [Right, Up] Abstract MDP s̃0,1
Ax
Left Right in y-direction Original MDP
Ay +1
Down
[Left, Down] [Right, Down] +1
z?,1 s0,1 s1,1
+1
(c) (1 − p ) p
+1 +1 +1
a = [ax , ay ] qx (s0,0 , ax ) + qy (s0,0 , ay ) = Qπ (s0,0 , a) +1
+2
Therefore, while Theorem 1 imposes a rather stringent set of assumptions on the MDP structure
(transitions, rewards) and the policy, violations of these conditions do not preclude the linear parame-
terization of the Q-function from being an unbiased estimator.
4
3.3 How Does Bias Affect Policy Learning?
When the sufficient conditions do not hold perfectly, using the linear parameterization in Eqn. (1) to
fit the Q-function may incur nonzero approximation error (bias). This can affect the performance of
the learned policy; in Appendix B.3, we derive error bounds based on the extent of bias relative to
the sufficient conditions in Theorem 1. Despite this bias, our approach always leads to a reduction
in the variance of the estimator. This gives us an opportunity to achieve a better bias-variance
trade-off, especially given limited historical data in the offline setting. In addition, as we will
demonstrate, biased Q-values do not always result in suboptimal policy performance, and we identify
the characteristics of problems where this occurs under our proposed linear decomposition.
5
Applying our approach amounts to solving for the parameters rLeft , rRight , rDown , rUp of the linear system
in Figure 4b, while dropping the interaction term rInteract , resulting in a form of omitted-variable bias
[19]. Solving the system gives the approximate value function where the interaction term β appears
in the approximation Q̂ for all arms (Figure 4c, details in Appendix B.8).
Note that Q̂ = Q∗ only when β = 0, i.e., there is no interaction between the two sub-actions. We
first consider the family of problems with α = 1 and β ∈ [−4, 4]. In Figure 5a, we measure the
∗
value approximation error RMSE(Q∗ , Q̂), as well as the suboptimality V π − V π̂ = maxa Q∗ (a) −
Q∗ (arg maxa Q̂(a)) of the greedy policy defined by Q̂ as compared to π ∗ . As expected, when β = 0,
Q̂ is unbiased and has zero approximation error. When β 6= 0, Q̂ is biased and RMSE > 0; however,
for β ≥ −1, Q̂ corresponds to a policy that correctly identifies the optimal action.
We further investigate this phenomenon considering both α, β ∈ [−4, 4] (to show all regions with
interesting trends), measuring RMSE and suboptimality in the same way as above. As shown in
Figure 5b, the approximation error is zero only when β = 0, regardless of α. However, in Figure 5c,
for a wide range of α and β settings, suboptimality is zero; this suggests that in those regions, even
in the presence of bias (non-zero approximation error), our approach leads to an approximate value
function that correctly identifies the optimal action. The irregular contour outlines multiple regions
where this happens; one key region is when the two sub-actions affect the reward in the same direction
(i.e., α ≥ 0) and their interaction effects also affect the reward in the same direction (i.e., β ≥ 0).
∗
(a) (b) RMSE(Q∗ , Q̂) (c) Suboptimality, V π − V π̂
4 1.0 4 2.0
RMSE(Q∗ , Q̂)
1.0 Biased
0.8
0.5 Unbiased 2 2 1.5
0.6
0.0
0 0
α
α
1.0
1.0
0.4
Suboptimality
-2
-4
-2
4
−4 −2 0 2 4
β β β
Figure 5: (a) The approximation error and policy suboptimality of our approach for the bandit
problem in Figure 4a, for different settings of β when α = 1. The Q-value approximation is unbiased
only when β = 0, but the corresponding approximate policy is optimal for a wider range of β ≥ −1.
(b-c) The approximation error and policy suboptimality of our approach for the bandit problem in
Figure 4a, for different settings of α and β. The Q-value approximation is unbiased only when
β = 0, but the corresponding approximate policy is optimal for a wide range of α and β values. The
highlighted region of zero suboptimality corresponds to α ≥ 0 and β ≥ 0.
Based on our theoretical analysis, strong assumptions (Section 3.1) on the problem structure (though
not necessary, Section 3.2) are the only known way to guarantee the unbiasedness of our proposed
linear approximation. It is thus crucial to understand the applicability (and inapplicability) of our
approach in real-world scenarios. Exploring to what extent these assumptions hold in practice is
especially important for safety-critical domains such as healthcare where incorrect actions (treatments)
can have devastating consequences. Fortunately, RL tasks for healthcare are often equipped with
significant domain knowledge, which serves as a better guide to inform the algorithm design than
heuristics-driven reasoning alone [20, 5, 9].
Oftentimes, when clinicians treat conditions using multiple medications at the same time (giving rise
to the factored action space), it is because each medication has a different “mechanism of action,”
resulting in negligible or limited interactions. For example, several classes of medications are used in
the management of chronic heart failure, and each has a unique and incremental benefits on patient
outcomes [21]. Problems such as this satisfy the sufficient conditions in Section 3.1 in spite of a
non-factorized state space. Moreover, any small interactions would have a bounded effect on RL
policy performance (according to Appendix B.3).
Similarly, in the management of sepsis (which we consider in Section 4.2), fluids and vasopressors
affect blood pressure to correct hypotension via different mechanisms [22]. Fluid infusion increases
“preload” by increasing the blood return to the heart to make sure the heart has enough blood to
pump out [23]. In contrast, common vasopressors (e.g., norepinephrine) increase “inotropy” by
6
stimulating the heart muscle and increase peripheral vascular resistance to maintain perfusion to
organs [24, 25]. Therefore, while the two treatments may appear to operate on the same part of the
state space (e.g., they both increase blood pressure), in general they are not expected to interfere with
each other. Recently, there has also been evidence suggesting that their combination can better correct
hypotension [26], which places this problem approximately in the regime discussed in Section 3.3.2.
In offline settings with limited historical data, the benefits of a reduction in variance can outweigh
any potential small bias incurred in the scenarios above and lead to overall performance improvement
(Section 3.3.1). However, our approach is not suitable if the interaction is counter to the effect of
the sub-actions (e.g., two drugs that raise blood pressure individually, but when combined lead to
a decrease). In such scenarios, the resulting bias will likely lead to suboptimal performance (Sec-
tion 3.3.2). Nevertheless, many drug-drug interactions are known and predictable [27–30]. In such
cases, one can either explicitly encode the interaction terms or resort back to a combinatorial action
space (Appendix B.9). While we focus on healthcare, there are other domains in which significant
domain knowledge regarding the interactions among sub-actions is available, e.g., cooperative multi-
agent games in finance where there is a higher payoff if agents cooperate (positive interaction effects)
or intelligent tutoring systems that teach basic arithmetic operations as well as fractions (which are
distinct but related skills). For these problems, this knowledge can and should be leveraged.
4 Experimental Evaluations
We apply our approach to two offline RL problems from healthcare: a simulated and a real-data
problem, both having an action space that is composed of several sub-action spaces. These problems
correspond to settings discussed in Section 3.4 where we expect our proposed approach to perform
well. In the following experiments, we compare our proposed approach (Figure 1b), which makes
assumptions regarding the effect of sub-actions in combination with other sub-actions, against a
common baseline that considers a combinatorial action space (Figure 1a).
7
Performance (policy value)
ρ = 0.9125 ( = 0.1) ρ = 0.5625 ( = 0.5) ρ = 0.125 ρ = 0.01 ρ=0
0.8
0.4
0.0
−0.4
102 103 104 102 103 104 102 103 104 102 103 104 102 103 104
Sample size (training episodes)
Baseline Proposed
Figure 6: Performance on the sepsis simulator across sample sizes and behavior policies. Plots
display the performance over 10 runs, with the trend lines showing medians and error bars showing
interquartile ranges. ρ is the probability of taking the optimal action under the behavior policies
used to generate offline datasets from the simulator. The left two plots show two -greedy policies
(ρ > 0.125; conversion: ρ = (1 − ) + /|A|); the middle plot shows a uniformly random policy
(ρ = 0.125); the right two plots show two policies that undersample the optimal action, ρ < 0.125;
from left to right, ρ decreases. Across different data distributions, our proposed approach outperforms
the baseline at small sample sizes, and closely matches baseline performance at large sample sizes.
Dashed lines denote the value of the optimal policy, which equals to 0.736.
sample sizes (< 5000), the proposed approach consistently outperforms the baseline. As sample size
increases further, the performance gap shrinks and eventually the baseline overtakes our proposed
approach. This is because variance decreases with increasing sample size but the bias incurred by the
factored approximation does not change. Once there are enough samples, reductions in variance are
no longer advantageous and the incurred bias dominates the performance. Overall, this shows that
our approach is promising especially for datasets with limited sample size.
How does behavior policy affect performance? As we anneal the behavior policy closer to the optimal
policy (ρ > 0.125, Figure 6 left two), we reduce the randomness in the behavior policy and limit the
amount of exploration possible at the same sample size. The same overall trend largely holds. On
the other hand, when the probability of taking the optimal action is less than random (ρ < 0.125,
Figure 6 right two), the proposed approach achieves better performance than the baseline with an
even larger gap for limited sample sizes (≤ 103 ). Without observing the optimal actions (ρ = 0),
the baseline performs relatively poorly, even for large sample sizes. In comparison, our approach
accounts for relationships among actions to some extent and is thus able to better generalize to the
unobserved and underexplored optimal actions, thereby outperforming the baseline.
Takeaways. In a challenging situation where our theoretical assumptions do not perfectly hold, our
proposed approach matches or outperforms the baseline, especially for smaller sample sizes.
8
100
>2L
>2L
>2L
>2L
4
4
20000 20000
IV fluid dose (mL/4h)
1L-2L
1L-2L
1L-2L
1L-2L
1401 103 167 268 277 0 0 0 0 6 0 00 00 00 010 10 0.163 0.175 0.168 0.181 0.184 0.17
3
3
15000 15000 1500015000
1 500mL-1L
1 500mL-1L
1 500mL-1L
1 500mL-1L
1 500mL-1L
2984 162 179 296 273 4 0 0 0 14 1244 1244
0 01 1183 183
77 77 0.157 0.164 0.168 0.170 0.166
2
2
0.16
10000 10000 1000010000
17147 791 654 882 606 22355 936 38 2508 13 2246722467
34 34
238 238
1801 1801
154 154
1-500
1-500
1-500
1-500
1-500
0.154 0.158 0.161 0.165 0.176
5000 0.15
8491 106 48 83 75 5000 9839 0 0 0 0 9186 9186
0 00 052 521 1 5000 5000 0.152 0.156 0.139 0.144 0.157
00
00
00
00
00
00 0.001-0.08
1 0.08-0.2
2 0.2-0.45
3 >0.45
4 00 0.001-0.08
1 0.08-0.2
2 0.2-0.45
3 >0.45
4 00 0.001-0.08
00 1 0.001-0.08
0.08-0.2
1 2 0.08-0.2
0.2-0.45
2 3 0.2-0.45
3>0.45
4 >0.45
4 00 0.001-0.08
1 0.08-0.2
2 0.2-0.45
3 >0.45
4
Vasopressor dose ( g/kg/min) Vasopressor dose ( g/kg/min) 0 Vasopressor
Vasopressor
dose dose ( g/kg/min) 0
( g/kg/min) 0 Vasopressor dose ( g/kg/min) 0.14
Figure 8: (a) Qualitative comparison of policies. (b) Per-action state heterogeneity, measured as the
standard deviation of all state embeddings from which a particular action is observed in the dataset,
averaged over state embedding dimensions. Actions with higher IV fluid doses exhibited greater
heterogeneity in the observed states from which those actions were taken (by the clinician policy).
The Q-networks were trained for a maximum of 10, 000 iterations, with checkpoints saved every 100
iterations. We performed model selection [31] over the saved checkpoints (candidate policies) by
evaluating policy performance using the validation set with OPE. Specifically, we estimated policy
value using weighted importance sampling (WIS) and measured effective sample size (ESS), where
the behavior policy was estimated using k nearest neighbors in the embedding space. Following
previous work [39], the final policies were selected by maximizing validation WIS with ESS of
≥ 200 (we consider other thresholds in Appendix D.3), for which we report results on the test set.
Results. We visualize the validation performance over all candidate policies. Figure 7-left shows
that the performance Pareto frontier (in terms of WIS and ESS) of the proposed approach generally
dominates the baseline.
Quantitative comparisons. Evaluating the final selected policies on the test set (Figure 7-right) shows
that the proposed factored BCQ achieves a higher policy value (estimated using WIS) than baseline
BCQ at the same level of ESS. In addition, both policies have a similar level of agreement with the
clinician policy, comparable to the average agreement among clinicians.
Qualitative comparisons. In Figure 8a, we compare the distributions of recommended actions by
the clinician behavior policy, baseline BCQ and factored BCQ, as evaluated on the test set. While
overall the policies look rather similar, in that the most frequently recommended action corresponds
to low doses of IV fluids <500mL with no vasopressors, there are notable differences for key parts of
the action space. In particular, baseline BCQ almost never recommends higher doses of IV fluids
>500 mL, either alone or in combination with vasopressors, whereas both clinician and factored
BCQ recommend IV fluids >500 mL more frequently. These actions are typically used for critically
ill patients, for whom the Surviving Sepsis Campaign guidelines recommends up to >2L of fluids
[40]. We hypothesize that this difference is due to a higher level of heterogeneity in the patient states
for which actions with high IV fluid doses were observed, compared to the remaining actions with
lower doses of IV fluids. To further understand this phenomenon, we measure the per-action state
heterogeneity in the test set by computing, for each action, the standard deviation (averaged over
the embedding dimensions) of all RNN state embeddings from which that action is taken according
to the behavior policy. As shown in Figure 8b, actions with higher IV fluids generally have larger
standard deviations, supporting our hypothesis. The larger heterogeneity combined with lower sample
sizes makes it difficult for baseline BCQ to correctly infer the effects of these actions, as it does not
leverage the relationship among actions. In contrast, our approach leverages the factored action space
and can thus make better inferences about these actions.
9
Takeaways. Applied to real clinical data, our proposed approach outperforms the baseline quantita-
tively and recommends treatments that align better with clinical knowledge. While promising, these
results are based in part on OPE, which has many issues [32, 33]. We stress that further investigation
and close collaboration with clinicians are essential before such RL algorithms are used in practice.
5 Related Work
For many years, the factored RL literature focused exclusively on state space factorization [15, 41–43].
More recently, interest in action space factorization has grown, as RL is applied in increasingly
more complex planning and control tasks. In particular, researchers have previously considered
the model-based setting with known MDP factorizations in which both state and action spaces are
factorized [44–47]. For model-free approaches, others have studied methods for factored actions
with a policy component (i.e., policy-based or actor-critic) [48, 20, 49–51]. In contrast, our work
considers value-based methods as those have been the most successful in offline RL [52].
Among prior work with a value-based component (e.g., Q-network), the majority pertains to multi-
agent [8–10, 53, 54] or multi-objective [51] problems that impose known, explicit assumptions on
the state space or the reward function. Notably, Son et al. [10] established theoretical conditions
for factored optimal actions (called “Individual-Global-Max”) for multi-agent RL and motivated
subsequent works [55, 56]; their result differs from our contribution on the unbiasedness of factored
Q-functions (instead of actions) for single-agent RL. In the online setting for single-agent deep
RL, Sharma et al. [20] and Tavakoli et al. [5] incorporated factored action spaces into Q-network
architecture designs, but did not provide a formal justification for the linear decomposition. Others
have empirically compared various “mixing” functions to combine the values of sub-actions [20, 9]. In
contrast, while our work only considers the linear decomposition function, we examine its theoretical
properties and provide justifications for using this approach in practical problems, especially in offline
settings. Our linear Q-decomposition is related to that of Swaminathan et al. [57] who also applied a
linearity assumption for off-policy evaluation, but for combinatorial contextual bandits rather than
RL. Our insights on the bias-variance trade-off is also related to a concurrent work by Saito and
Joachims [58] who proposed efficient off-policy evaluation for bandit problems with large (but not
necessarily factored) action spaces. In appendix Table 1, we further outline the differences of our
work compared to the existing literature.
Finally, the sufficient conditions we establish are related to, but different from, those identified by
Van Seijen et al. [59] and Juozapaitis et al. [12] who considered reward decompositions in the absence
of factored actions. Related, Metz et al. [60] proposed an approach that sequentially predicts values
for discrete dimensions of a transformed continuous action space, but assume an a priori ordering
of action dimensions, which we do not; Pierrot et al. [50] studied a different form of action space
factorization where sub-actions are sequentially selected in an autoregressive manner. Complementary
to our work, Tavakoli et al. [6] proposed to organize the sub-actions and interactions as a hypergraph
and linearly combining the values; our theoretical results on the linear decomposition nonetheless
apply to their setting where the sub-action interactions are explicitly identified and encoded.
6 Conclusion
To better leverage factored action spaces in RL, we developed an approach to learning policies that
incorporates a simple linear decomposition of the Q-function. We theoretically analyze the sufficient
and necessary conditions for this parameterization to yield unbiased estimates, study its effect on
variance reduction, and identify scenarios when any resulting bias does not lead to suboptimal
performance. We also note how domain knowledge may be used to inform the applicability of
our approach in practice, for problems where any possible bias is negligible or does not affect
optimality. Through empirical experiments on two offline RL problems involving a simulator and
real clinical data, we demonstrate the advantage of our approach especially in settings with limited
sample sizes. We provide further discussions on limitations, ethical considerations and societal
impacts in Appendix A. Though motivated by healthcare, our approach could apply more broadly
to scale RL to other applications (e.g., education) involving combinatorial action spaces where
domain knowledge may be used to verify the theoretical conditions. Future work should consider the
theoretical implications of linear Q decompositions when combined with other offline RL-specific
algorithms [52]. Given the challenging nature of identifying the best treatments from offline data, our
proposed approach may also be combined with other RL techniques that do not aim to identify the
single best action (e.g., learning dead-ends [61] or set-valued policies [34]).
10
Acknowledgments
This work was supported by the National Science Foundation (NSF; award IIS-1553146 to JW;
award IIS-2007076 to FDV; award IIS-2153083 to MM) and the National Library of Medicine of
the National Institutes of Health (NLM; grant R01LM013325 to JW and MWS). The views and
conclusions in this document are those of the authors and should not be interpreted as necessarily
representing the official policies, either expressed or implied, of the National Science Foundation, nor
of the National Institutes of Health. This work was supported, in part, by computational resources
and services provided by Advanced Research Computing, a division of Information and Technology
Services (ITS) at the University of Michigan, Ann Arbor. The authors would like to thank Adith
Swaminathan, Tabish Rashid, and members of the MLD3 group for helpful discussions regarding
this work, as well as the reviewers for constructive feedback.
References
[1] Damien Ernst, Guy-Bart Stan, Jorge Goncalves, and Louis Wehenkel. Clinical data based optimal STI
strategies for HIV: a reinforcement learning approach. In Proceedings of the 45th IEEE Conference on
Decision and Control, pages 667–672. IEEE, 2006.
[2] Matthieu Komorowski, Leo A Celi, Omar Badawi, Anthony C Gordon, and A Aldo Faisal. The artificial
intelligence clinician learns optimal treatment strategies for sepsis in intensive care. Nature medicine, 24
(11):1716–1720, 2018.
[3] Niranjani Prasad, Li Fang Cheng, Corey Chivers, Michael Draugelis, and Barbara E Engelhardt. A
reinforcement learning approach to weaning of mechanical ventilation in intensive care units. In Conference
on Uncertainty in Artificial Intelligence (UAI), 2017.
[4] Sonali Parbhoo, Mario Wieser, Volker Roth, and Finale Doshi-Velez. Transfer learning from well-curated
to less-resourced populations with HIV. In Machine Learning for Healthcare Conference, pages 589–609.
PMLR, 2020.
[5] Arash Tavakoli, Fabio Pardo, and Petar Kormushev. Action branching architectures for deep reinforcement
learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
[6] Arash Tavakoli, Mehdi Fatemi, and Petar Kormushev. Learning to represent action values as a hypergraph
on the action vertices. In International Conference on Learning Representations, 2021. URL https:
//openreview.net/forum?id=Xv_s64FiXTv.
[7] Craig Boutilier. Planning, learning and coordination in multiagent decision processes. In TARK, volume 96,
pages 195–210. Citeseer, 1996.
[8] Peter Sunehag, Guy Lever, Audrunas Gruslys, Wojciech Marian Czarnecki, Vinicius Zambaldi, Max Jader-
berg, Marc Lanctot, Nicolas Sonnerat, Joel Z. Leibo, Karl Tuyls, and Thore Graepel. Value-decomposition
networks for cooperative multi-agent learning based on team reward. In Proceedings of the 17th Interna-
tional Conference on Autonomous Agents and MultiAgent Systems, AAMAS, 2018.
[9] Tabish Rashid, Mikayel Samvelyan, Christian Schroeder, Gregory Farquhar, Jakob Foerster, and Shimon
Whiteson. QMIX: Monotonic value function factorisation for deep multi-agent reinforcement learning. In
International Conference on Machine Learning, pages 4295–4304. PMLR, 2018.
[10] Kyunghwan Son, Daewoo Kim, Wan Ju Kang, David Earl Hostallero, and Yung Yi. QTRAN: Learning
to factorize with transformation for cooperative multi-agent reinforcement learning. In International
conference on machine learning, pages 5887–5896. PMLR, 2019.
11
[11] Ming Zhou, Yong Chen, Ying Wen, Yaodong Yang, Yufeng Su, Weinan Zhang, Dell Zhang, and Jun
Wang. Factorized Q-learning for large-scale multi-agent systems. In Proceedings of the First International
Conference on Distributed Artificial Intelligence, pages 1–7, 2019.
[12] Zoe Juozapaitis, Anurag Koul, Alan Fern, Martin Erwig, and Finale Doshi-Velez. Explainable reinforce-
ment learning via reward decomposition. In IJCAI/ECAI Workshop on Explainable Artificial Intelligence,
2019.
[13] Michael Oberst and David Sontag. Counterfactual off-policy evaluation with Gumbel-max structural causal
models. In International Conference on Machine Learning, pages 4881–4890, 2019.
[14] Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without explo-
ration. In International Conference on Machine Learning, pages 2052–2062. PMLR, 2019.
[15] Daphne Koller and Ronald Parr. Computing factored value functions for policies in structured MDPs.
In Proceedings of the 16th International Joint Conference on Artificial Intelligence-Volume 2, pages
1332–1339, 1999.
[16] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT
press, 2018.
[17] Yaqi Duan, Chi Jin, and Zhiyuan Li. Risk bounds and Rademacher complexity in batch reinforcement
learning. arXiv preprint arXiv:2103.13883, 2021.
[18] Maggie Makar, Ben Packer, Dan Moldovan, Davis Blalock, Yoni Halpern, and Alexander D’Amour.
Causally motivated shortcut removal using auxiliary labels. In International Conference on Artificial
Intelligence and Statistics, pages 739–766. PMLR, 2022.
[19] Jeffrey M Wooldridge. Introductory econometrics: A modern approach. Cengage learning, 2015.
[20] Sahil Sharma, Aravind Suresh, Rahul Ramesh, and Balaraman Ravindran. Learning to factor policies
and action-value functions: Factored action space representations for deep reinforcement learning. arXiv
preprint arXiv:1705.07269, 2017.
[21] Michel Komajda, Michael Boehm, Jeffrey S Borer, Ian Ford, Luigi Tavazzi, Matthieu Pannaux, and Karl
Swedberg. Incremental benefit of drug therapies for chronic heart failure with reduced ejection fraction: a
network meta-analysis. European journal of heart failure, 20(9):1315–1322, 2018.
[22] Jeffrey Gotts and Michael Matthay. Sepsis: pathophysiology and clinical management. BMJ, 353, 2016.
[23] Laurent Guérin, Jean-Louis Teboul, Romain Persichini, Martin Dres, Christian Richard, and Xavier Monnet.
Effects of passive leg raising and volume expansion on mean systemic pressure and venous return in shock
in humans. Critical Care, 19(1):1–9, 2015.
[24] Olfa Hamzaoui, Jean-François Georger, Xavier Monnet, Hatem Ksouri, Julien Maizel, Christian Richard,
and Jean-Louis Teboul. Early administration of norepinephrine increases cardiac preload and cardiac
output in septic patients with life-threatening hypotension. Critical care, 14(4):1–9, 2010.
[25] Xavier Monnet, Julien Jabot, Julien Maizel, Christian Richard, and Jean-Louis Teboul. Norepinephrine
increases cardiac preload and reduces preload dependency assessed by passive leg raising in septic shock
patients. Critical care medicine, 39(4):689–694, 2011.
[26] Olfa Hamzaoui. Combining fluids and vasopressors: A magic potion? Journal of Intensive Medicine, 2021.
ISSN 2667-100X. doi: https://doi.org/10.1016/j.jointm.2021.09.004.
[27] Teijo I Saari, Kari Laine, Mikko Neuvonen, Pertti J Neuvonen, and Klaus T Olkkola. Effect of voricona-
zole and fluconazole on the pharmacokinetics of intravenous fentanyl. European journal of clinical
pharmacology, 64(1):25–30, 2008.
[28] Pamela L Smithburger, Sandra L Kane-Gill, and Amy L Seybert. Drug–drug interactions in the medical
intensive care unit: an assessment of frequency, severity and the medications involved. International
Journal of Pharmacy Practice, 20(6):402–408, 2012.
[29] WebMD. Drugs interaction checker - find interactions between medications. https://www.webmd.com/
interaction-checker/default.htm.
[30] Epic. Lexicomp and medi-span for epic - drug reference information. https://www.wolterskluwer.com/en/
solutions/lexicomp/about/epic/drug-reference.
12
[31] Shengpu Tang and Jenna Wiens. Model selection for offline reinforcement learning: Practical considerations
for healthcare settings. In Machine Learning for Healthcare Conference, pages 2–35. PMLR, 2021.
[32] Omer Gottesman, Fredrik Johansson, Joshua Meier, Jack Dent, Donghun Lee, Srivatsan Srinivasan,
Linying Zhang, Yi Ding, David Wihl, Xuefeng Peng, et al. Evaluating reinforcement learning algorithms
in observational health settings. arXiv preprint arXiv:1805.12298, 2018.
[33] Omer Gottesman, Fredrik Johansson, Matthieu Komorowski, Aldo Faisal, David Sontag, Finale Doshi-
Velez, and Leo Anthony Celi. Guidelines for reinforcement learning in healthcare. Nature medicine, 25(1):
16–18, 2019.
[34] Shengpu Tang, Aditya Modi, Michael Sjoding, and Jenna Wiens. Clinician-in-the-loop decision making:
Reinforcement learning with near-optimal set-valued policies. In International Conference on Machine
Learning, pages 9387–9396. PMLR, 2020.
[35] Taylor W Killian, Haoran Zhang, Jayakumar Subramanian, Mehdi Fatemi, and Marzyeh Ghassemi. An
empirical study of representation learning for reinforcement learning in healthcare. In Machine Learning
for Health NeurIPS Workshop, 2020.
[36] Alistair EW Johnson, Tom J Pollard, Lu Shen, Li-wei H Lehman, Mengling Feng, Mohammad Ghassemi,
Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. MIMIC-III, a freely accessible
critical care database. Scientific data, 3(1):1–9, 2016.
[37] Jayakumar Subramanian and Aditya Mahajan. Approximate information state for partially observed
systems. In 2019 IEEE 58th Conference on Decision and Control (CDC), pages 1629–1636. IEEE, 2019.
[38] Scott Fujimoto, Edoardo Conti, Mohammad Ghavamzadeh, and Joelle Pineau. Benchmarking batch deep
reinforcement learning algorithms. arXiv preprint arXiv:1910.01708, 2019.
[39] Yao Liu and Emma Brunskill. Avoiding overfitting to the importance weights in offline policy optimization,
2022. URL https://openreview.net/forum?id=dLTXoSIcrik.
[40] Laura Evans, Andrew Rhodes, Waleed Alhazzani, Massimo Antonelli, Craig M Coopersmith, Craig
French, Flávia R Machado, Lauralyn Mcintyre, Marlies Ostermann, Hallie C Prescott, et al. Surviving
sepsis campaign: international guidelines for management of sepsis and septic shock 2021. Intensive care
medicine, 47(11):1181–1247, 2021.
[41] Carlos Guestrin, Daphne Koller, Ronald Parr, and Shobha Venkataraman. Efficient solution algorithms for
factored MDPs. Journal of Artificial Intelligence Research, 19:399–468, 2003.
[42] Alexander L Strehl, Carlos Diuk, and Michael L Littman. Efficient structure learning in factored-state
MDPs. 2007.
[43] Karina Valdivia Delgado, Scott Sanner, and Leliane Nunes De Barros. Efficient solutions to factored MDPs
with imprecise transition probabilities. Artificial Intelligence, 175(9-10):1498–1527, 2011.
[44] Aswin Raghavan, Saket Joshi, Alan Fern, Prasad Tadepalli, and Roni Khardon. Planning in factored action
spaces with symbolic dynamic programming. In Twenty-Sixth AAAI Conference on Artificial Intelligence,
2012.
[45] Aswin Raghavan, Roni Khardon, Alan Fern, and Prasad Tadepalli. Symbolic opportunistic policy iteration
for factored-action mdps. Advances in Neural Information Processing Systems, 26, 2013.
[46] Ian Osband and Benjamin Van Roy. Near-optimal reinforcement learning in factored MDPs. Advances in
Neural Information Processing Systems, 27:604–612, 2014.
[47] Yangyi Lu, Amirhossein Meisami, and Ambuj Tewari. Causal Markov decision processes: Learning good
interventions efficiently. arXiv preprint arXiv:2102.07663, 2021.
[48] Brian Sallans and Geoffrey E Hinton. Reinforcement learning with factored states and actions. The Journal
of Machine Learning Research, 5:1063–1088, 2004.
[49] Tom Van de Wiele, David Warde-Farley, Andriy Mnih, and Volodymyr Mnih. Q-learning in enormous
action spaces via amortized approximate maximization. arXiv preprint arXiv:2001.08116, 2020.
[50] Thomas Pierrot, Valentin Macé, Jean-Baptiste Sevestre, Louis Monier, Alexandre Laterre, Nicolas Perrin,
Karim Beguir, and Olivier Sigaud. Factored action spaces in deep reinforcement learning, 2021. URL
https://openreview.net/forum?id=naSAkn2Xo46.
13
[51] Thomas Spooner, Nelson Vadori, and Sumitra Ganesh. Factored policy gradients: Leveraging structure
for efficient learning in MOMDPs. In Advances in Neural Information Processing Systems, 2021. URL
https://openreview.net/forum?id=NXGnwTLlWiR.
[52] Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial,
review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
[53] Laetitia Matignon, Guillaume J Laurent, and Nadine Le Fort-Piat. Independent reinforcement learners
in cooperative Markov games: a survey regarding coordination problems. The Knowledge Engineering
Review, 27(1):1–31, 2012.
[54] Ardi Tampuu, Tambet Matiisen, Dorian Kodelja, Ilya Kuzovkin, Kristjan Korjus, Juhan Aru, Jaan Aru, and
Raul Vicente. Multiagent cooperation and competition with deep reinforcement learning. PLOS ONE, 12(4):
1–15, 04 2017. doi: 10.1371/journal.pone.0172395. URL https://doi.org/10.1371/journal.pone.0172395.
[55] Jianhao Wang, Zhizhou Ren, Terry Liu, Yang Yu, and Chongjie Zhang. QPLEX: Duplex dueling multi-
agent Q-learning. In International Conference on Learning Representations, 2021. URL https://openreview.
net/forum?id=Rcmk0xxIQV.
[56] Jianhao Wang, Zhizhou Ren, Beining Han, Jianing Ye, and Chongjie Zhang. Towards understanding
cooperative multi-agent Q-learning with value factorization. Advances in Neural Information Processing
Systems, 34:29142–29155, 2021.
[57] Adith Swaminathan, Akshay Krishnamurthy, Alekh Agarwal, Miro Dudik, John Langford, Damien Jose,
and Imed Zitouni. Off-policy evaluation for slate recommendation. Advances in Neural Information
Processing Systems, 30, 2017.
[58] Yuta Saito and Thorsten Joachims. Off-policy evaluation for large action spaces via embeddings. arXiv
preprint arXiv:2202.06317, 2022.
[59] Harm Van Seijen, Mehdi Fatemi, Joshua Romoff, Romain Laroche, Tavian Barnes, and Jeffrey Tsang.
Hybrid reward architecture for reinforcement learning. In Advances in Neural Information Processing
Systems, 2017.
[60] Luke Metz, Julian Ibarz, Navdeep Jaitly, and James Davidson. Discrete sequential prediction of continuous
actions for deep RL, 2018. URL https://openreview.net/forum?id=r1SuFjkRW.
[61] Mehdi Fatemi, Taylor W. Killian, Jayakumar Subramanian, and Marzyeh Ghassemi. Medical dead-ends
and learning to identify high-risk states and treatments. In Advances in Neural Information Processing
Systems, 2021. URL https://openreview.net/forum?id=4CRpaV4pYp.
[62] Jinglin Chen and Nan Jiang. Information-theoretic considerations in batch reinforcement learning. In
International Conference on Machine Learning, pages 1042–1051, 2019.
[63] Tengyang Xie and Nan Jiang. Batch value-function approximation with only realizability. In International
Conference on Machine Learning, pages 11404–11413. PMLR, 2021.
[64] Wenhao Zhan, Baihe Huang, Audrey Huang, Nan Jiang, and Jason D Lee. Offline reinforcement learning
with realizability and single-policy concentrability. arXiv preprint arXiv:2202.04634, 2022.
[65] Lihong Li, Thomas J Walsh, and Michael L Littman. Towards a unified theory of state abstraction for
MDPs. Proceedings of the 9th International Symposium on Artificial Intelligence and Mathematics, 4,
2006.
[66] Satinder P Singh and Richard C Yee. An upper bound on the loss from approximate optimal-value functions.
Machine Learning, 16(3):227–233, 1994.
[67] Michael Kearns and Satinder Singh. Near-optimal reinforcement learning in polynomial time. Machine
learning, 49(2):209–232, 2002.
[68] Michail G Lagoudakis and Ronald Parr. Least-squares policy iteration. The Journal of Machine Learning
Research, 4:1107–1149, 2003.
[69] Pranjal Awasthi, Natalie Frank, and Mehryar Mohri. On the Rademacher complexity of linear hypothesis
sets. arXiv preprint arXiv:2007.11045, 2020.
[70] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International
Conference on Learning Representations, 2015.
[71] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare,
Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through
deep reinforcement learning. Nature, 518(7540):529–533, 2015.
14
Checklist
1. For all authors...
(a) Do the main claims made in the abstract and introduction accurately reflect the paper’s
contributions and scope? [Yes]
(b) Did you describe the limitations of your work? [Yes] Section 3.4, Appendix A
(c) Did you discuss any potential negative societal impacts of your work? [Yes] Appx A
(d) Have you read the ethics review guidelines and ensured that your paper conforms to
them? [Yes]
2. If you are including theoretical results...
(a) Did you state the full set of assumptions of all theoretical results? [Yes] Sec 3, Appx B
(b) Did you include complete proofs of all theoretical results? [Yes] Appendix B
3. If you ran experiments...
(a) Did you include the code, data, and instructions needed to reproduce the main exper-
imental results (either in the supplemental material or as a URL)? [Yes] see Data &
Code Availability
(b) Did you specify all the training details (e.g., data splits, hyperparameters, how they
were chosen)? [Yes] Sec 4.1-4.2, Appendix D.1-D.2
(c) Did you report error bars (e.g., with respect to the random seed after running experi-
ments multiple times)? [Yes] Fig 6
(d) Did you include the total amount of compute and the type of resources used (e.g., type
of GPUs, internal cluster, or cloud provider)? [No] our experiments only involved
small neural networks
4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...
(a) If your work uses existing assets, did you cite the creators? [Yes] MIMIC-III by
Johnson et al. [36], sepsisSim by Oberst and Sontag [13]
(b) Did you mention the license of the assets? [No] please refer to the website of the
creator of MIMIC-III https://physionet.org/content/mimiciii/1.4/
(c) Did you include any new assets either in the supplemental material or as a URL? [N/A]
(d) Did you discuss whether and how consent was obtained from people whose data
you’re using/curating? [No] please refer to the website of the creator of MIMIC-III
https://physionet.org/content/mimiciii/1.4/
(e) Did you discuss whether the data you are using/curating contains personally identifiable
information or offensive content? [Yes] Sec 4.2, MIMIC-III is de-identified
5. If you used crowdsourcing or conducted research with human subjects...
(a) Did you include the full text of instructions given to participants and screenshots, if
applicable? [N/A]
(b) Did you describe any potential participant risks, with links to Institutional Review
Board (IRB) approvals, if applicable? [N/A]
(c) Did you include the estimated hourly wage paid to participants and the total amount
spent on participant compensation? [N/A]
15
A Additional Discussion
Computational Efficiency. While our main analysis focuses on statistical efficiency (variance) and
its trade-off with approximation error (bias), here we outline some considerations on computational
efficiency. To compute the values for all output heads in Figure 1, there is a clear saving of
computational cost by our approach with a linear complexity O(D) (measured in flops) in the number
of sub-actions, whereas the baseline has an exponential complexity O(exp(D)). We consider two
common inference operations after the values of the output heads are computed: maxa Q(s, a)
and arg maxa Q(s, a). For both operations, the baseline has an exponential time complexity of
O(exp(D)). For our proposed approach, an optimized implementation has a linear time complexity
of O(D): one can perform argmax/max per sub-action and then concatenate/sum the results. In our
current code release, we did not implement the optimized version; instead, we made use of the sub-
action featurization matrix defined in Appendix B.4 so that automatic differentiation can be applied
directly. This implementation is computationally more expensive than our analysis above and than the
baseline: the forward pass includes a dense matrix multiplication with time complexity O(D exp(D))
flops, followed by an O(exp(D)) argmax/max operation. In settings where computational complexity
might be a bottleneck (especially at inference time), we recommend using the featurization matrix
implementation for learning and the optimized version for inference.
Limitations. Our theoretical analysis in Section 3 focuses on the “realizability” condition of the linear
function class [62], where we are interested in guarantees of zero approximation error, i.e., whether
the true Q∗ lies within the linear function class. In principle, it is possible to find Q∗ given a realizable
function class (e.g., by enumerating all member functions). However, when Q-learning-style iterative
algorithms are used in practice, its convergence relies on a stronger “completeness” condition, as
discussed in [62–64]. We did not investigate how our proposed form of parameterization (and the
specific shape of bias introduced) interacts with the learning procedure, and this is an interesting
direction for future work (Wang et al. [56] studied this for linear value factorization in the context of
FQI but for multi-agent RL).
Ethical Considerations and Societal Impact. In general, policies computationally derived using RL
must be carefully validated before they can be used in high-stakes domains such as healthcare. Our
linear parameterization implicitly makes an independence assumption with respect to the sub-actions,
allowing the Q-function to generalize to sub-action combinations that are underexplored (and even
unexplored) in the offline data (as shown in Section 4.1). When the independence assumptions are
valid (according to domain knowledge), this is a case of a “free lunch” as we can reduce variance
without introducing any bias. However, inaccurate or incomplete domain knowledge may render the
independence assumptions invalid and cause the agent to incorrectly generalize to dangerous actions
(e.g., learned policy recommends drug combinations with adverse side effects, see Section 3.4). This
misuse may be alleviated by incorporating additional offline RL safeguards to constrain the learned
policy (e.g., BCQ was used in Section 4.2 to restrict the learned policy to not take rarely observed
sub-action combinations). Still, to apply RL in healthcare and other safety-critical domains, it is
important to consult and closely collaborate with domain experts (e.g., clinicians for healthcare
problems) to come up with meaningful tasks and informed assumptions, and perform thorough
evaluations involving both the quantitative and qualitative aspects [32, 33].
16
B Detailed Theoretical Analyses
To build intuition, we first consider a related setting where D MDPs are running in parallel. If every
MDP evolves independently as controlled by its respective policy, then the total return from all D
MDPs should naturally be the sum of the individual returns from each MDP. Formally, we state the
following proposition involving fully factored MDPs and factored policies. Here, we use the vector
notation s = [s1 , · · · , sD ] to indicate the explicit state space factorization.
Definition 1. Given MDPs M1 · · · MD where each Md is defined by (Sd , Ad , pd , rd ), a fully
ND ND ND
factored MDP M = d=1 Md is defined by S, A, p, r such that S = d=1 Sd , A = d=1 Ad ,
QD PD
p(s0 |s, a) = d=1 pd (s0d |sd , ad ), and r(s, a) = d=1 rd (sd , ad ).
Definition 2. Given MDPs M1 · · · Md and policies π1 · · · πD where each πd : Sd → ∆(Ad ),
ND ND
then a factored policy π = d=1 πj for the MDP M = d=1 Mj is π : S → ∆(A) such that
QD
π(a|s) = d=1 πd (ad |sd ).
ND ND
Proposition 7. The Q-function of policy π = d=1 πj for MDP M = d=1 Md can be expressed
D
as QπM (s, a) = d=1 QπMd d (sd , ad ).
P
To match the form in Eqn. (1), we can set qd (s, ad ) = QπMd d (sd , ad ). Importantly, each QπMd d does
not depend on any ad0 where d0 6= d. Note that although our definition of qd is allowed to condition
on the entire state space s, each QπMd d only depends on sd . Proposition 7 can be seen as a corollary to
Theorem 1 where the abstractions are defined using the sub-state spaces, such that φd : S → Sd .
Proof of Proposition 7. Without loss of generality, we consider the setting with D = 2 such that
A = A1 × A2 ; extension to D > 2 is straightforward. The proof is based on mathematical induction
π,(h) Ph
on a sequence of h-step Q-functions of π defined as QM (s, a) = E[ t=1 γ t−1 rt |s1 = s, a1 =
a, at ∼ π].
Base case. For h = 1, the one-step Q-function is simply the reward, which by assumption
π,(1) π ,(1) π ,(1)
r(s, a) = r1 (s1 , a1 ) + r2 (s2 , a2 ). Therefore, QM (s, a) = QM1 1 (s1 , a1 ) + QM2 2 (s2 , a2 ).
π,(h) π ,(h) π ,(h)
Inductive step. Suppose QM (s, a) = QM1 1 (s1 , a1 ) + QM2
2
(s2 , a2 ) holds. We can express
π,(h+1) π,(h)
QM in terms of QM using the Bellman equation:
π,(h+1) π,(h)
X
QM (s, a) = r(s, a) +γ p(s0 |s, a)VM (s0 )
s0
| {z }
1 | {z }
2
π,(h) π,(h)
X
where VM (s0 ) = π(a0 |s0 )QM (s0 , a0 ).
a0
By Definition 1, 1 can be written as a sum r(s, a) = r1 (s1 , a1 )+r2 (s2 , a2 ) where each summand
depends on only either a1 or a2 but not both. Next we show that 2 also decomposes in a similar
manner. For a given s we have:
π,(h) π,(h)
X
VM (s) = π(a|s)QM (s, a)
a
π ,(h) π ,(h)
X
= π1 (a1 |s1 )π2 (a2 |s2 ) QM1 1 (s1 , a1 ) + QM2 2 (s2 , a2 )
a1 ,a2
P : 1 P π ,(h)
P : 1 P π2 ,(h)
= a
2
π2
(a2 |s2 ) a 1
π1 (a1 |s1 )QM1 1 (s1 , a1 ) + a1
π1
(a1 |s1 ) a2 π2 (a2 |s2 )QM2 (s2 , a2 )
X X
π ,(h) π ,(h)
= π1 (a1 |s1 )QM1 1 (s1 , a1 ) + π2 (a2 |s2 )QM2 2 (s2 , a2 ) ,
a1 a2
17
π ,(h)
where we use the fact that π1 (a1 |s1 )QM1 1 (s1 , a1 ) is independent of π2 (a2 |s2 ) (and vice versa),
π ,(h) P πd ,h
and that πd (·|sd ) is a probability simplex. Letting VMdd (sd ) = ad π1 (ad |sd )QM d
(sd , ad ),
π,(h) π ,(h) π ,(h)
then VM (s0 ) = VM11 (s01 ) + VM22 (s02 ).
Substituting into 2 , we have:
π,(h)
X
p(s0 |s, a)VM (s0 )
s0
π ,(h) π ,(h)
X
= p1 (s01 |s1 , a1 )p2 (s02 |s2 , a2 ) VM11 (s01 ) + VM22 (s02 )
s01 ,s02
P : 1 P
π ,(h)
P : 1 P π2 ,(h) 0
(s02 p1 (s01 |s1 , a1 )VM11 (s01 ) + (s01 0
= p2
s02 |s
2 , a2 ) s01 s 0 p1
|s
1 , a1 ) s02 p2 (s2 |s2 , a2 )VM2 (s2 )
1
X
π ,(h) π ,(h)
X
= p1 (s01 |s1 , a1 )VM11 (s01 ) + p2 (s02 |s2 , a2 )VM22 (s02 )
s01 s02
π ,(h)
where we make use of a similar independence property between p1 (s01 |s1 , a1 )VM11 (s01 ) and
p2 (s02 |s2 , a2 ), and the fact that that pd (·|sd , ad ) is a probability simplex.
π,(h+1) π ,(h+1) π ,(h+1)
Therefore, we have QM (s, a) = QM1 1 (s1 , a1 ) + QM2 2 (s2 , a2 ) as desired, where
πd ,(h+1) π ,(h)
(sd , ad ) = rd (sd , ad ) + γ s0 pd (sd |sd , ad ) a0 πj (ad |sd )QMd j (s0d , a0d ).
0 0 0
P P
QMd
d d
By mathematical induction, this decomposition holds for any h-step Q-function. Letting h → ∞
shows that this holds for the full Q-function.
Proof of Theorem 1. Without loss of generality, we consider the setting with D = 2 so A = A1 ×A2 ;
extension to D > 2 is straightforward. The proof is based on mathematical induction on a sequence
Ph
of h-step Q-functions of π denoted by Q(h) (s, a) = E[ t=1 γ t−1 rt |s1 = s, a1 = a, at ∼ π].
Base case. For h = 1, the one-step Q-function is simply the reward, which by assumption r(s, a) =
(1)
r1 (z1 , a1 ) + r2 (z2 , a2 ). We can trivially set qd (zd , ad ) = rd (zd , ad ) such that Q(1) (s, a) =
(1) (1)
q1 (z1 , a1 ) + q2 (z2 , a2 ).
18
(h) (h)
Inductive step. Suppose Q(h) (s, a) = q1 (z1 , a1 ) + q2 (z2 , a2 ) holds. We can express Q(h+1)
in terms of Q(h) using the Bellman equation:
X
Q(h+1) (s, a) = r(s, a) +γ p(s0 |s, a)V (h) (s0 )
s0
| {z }
1 | {z }
2
X
where V (h) (s0 ) = π(a0 |s0 )Q(h) (s0 , a0 ).
a0
1 can be written as a sum r(s, a) = r1 (z1 , a1 ) + r2 (z2 , a2 ) where each summand depends on
only either a1 or a2 but not both. Next we show 2 also decomposes in a similar manner.
For a given s we have:
X
V (h) (s) = π(a|s)Q(h) (s, a)
a
(h) (h)
X
= π1 (a1 |z1 )π2 (a2 |z2 ) q1 (z1 , a1 ) + q2 (z2 , a2 )
a1 ,a2
P
|z: 1 P (h)
P
|z: 1 P (h)
= a π 2
(a2 2 ) a π 1 (a1 |z1 )q 1 (z 1 , a1 ) + a π
1
(a1 1 ) a2 π2 (a2 |z2 )q2 (z2 , a2 )
2 1 1
(h) (h)
X X
= π1 (a1 |z1 )q1 (z1 , a1 ) + π2 (a2 |z2 )q2 (z2 , a2 ) ,
a1 a2
(h)
where we used the property that π1 (a1 |z1 )q1 (z1 , a1 ) is independent of π2 (a2 |z2 ) (and vice
(h) P (h)
versa), and that πd (·|zd ) is a probability simplex. Letting vd (zd ) = ad πd (ad |zd )qd (zd , ad ),
(h) (h)
then we can write V (h) (s0 ) = v1 (z10 ) + v2 (z20 ).
Substituting into 2 , we have:
X X X
p(s0 |s, a)V (h) (s0 ) = p(s̃|s, a)V (h) (s̃)
s0 z0 s̃∈φ−1 (z 0 )
X X
= p(s̃|s, a)V (h) (s̃)
z 0 s̃∈φ−1 (z 0 )
X X
= p(s̃|s, a) V (h) (s̃)
z0 s̃∈φ−1 (z 0 )
(h) (h)
X
= p1 (z10 |z1 , a1 )p2 (z20 |z2 , a2 ) v1 (z10 ) + v2 (z20 )
z10 ,z20
P : 1 P
(h)
P : 1 P
(h) 0
0
= p2
z20 (z p1 (z10 |z1 , a1 )v1 (z10 ) +
2 |z2 , a2 ) z10 z 0 p1
0
(z
1 |z1 , a1 ) 0
z20 p2 (z2 |z2 , a2 )v2 (z2 )
1
X
(h) (h)
X
= p1 (z10 |z1 , a1 )v1 (z10 ) + p2 (z20 |z2 , a2 )v2 (z20 )
z10 z20
where on the first line we used the property of state abstractions to replace the index of summation,
and from the second to the third line we used the fact that for all s̃ ∈ φ−1 (z 0 ) that have the same
(h) (h)
abstract state vector z 0 , their value V (h) (s0 ) = v1 (z10 ) + v2 (z20 ) are equal; this allows us to
directly sum their transition probabilities p(s̃|s, a). Following that, we substitute in Eqn. (2), and
then use a similar independence property as above and that pd (·|zd , ad ) is a probability simplex.
(h+1) (h+1)
Therefore, we have Q(h+1) (s, a) = q1 (z1 , a1 ) + q2 (z2 , a2 ) as desired where
(h+1) (h) 0
(zd , ad ) = rd (zd , ad ) + γ z0 pd (zd |zd , ad ) a0 π(ad |zd )qd (zd , a0d ).
0 0 0
P P
qd
d d
By mathematical induction, this decomposition holds for any h-step Q-function. Letting h → ∞
shows that this holds for the full Q-function.
19
B.3 Policy Learning with Bias - Performance Bounds
Consider a particular model-based procedure for approximating the optimal Q-function using Eqn. (1):
i) finding approximations M c = (p̂, r̂) that are close to the true transition/reward functions p, r such
that there exists some state abstraction set φ with p̂, r̂ satisfying (2) and (3) with respect to φ, ii) doing
planning (e.g., dynamic programming) using the approximate MDP parameters p̂ and r̂. We can show
the following performance bounds; note that these upper bounds are loose and information-theoretic
(in that they require knowledge of the implicit factorization).
Proposition 8. If the approximation errors in p̂ and r̂ are upper bounded by p and r for all
s ∈ S, a ∈ A:
X
p(s0 |s, a) − p̂(s0 |s, a) ≤ p ,
s0
r(s, a) − r̂(s, a) ≤ r ,
then the above model-based procedure leads to an approximate Q-function Q̂ and an approximate
policy π̂ that satisfy:
r γp Rmax
kQ∗M − Q∗M
ck∞ ≤ + ,
1−γ 2(1 − γ)2
∗ π̂ 2r γp Rmax
kVM − VM k∞ ≤ + .
1−γ (1 − γ)2
Proof. See classical results by Singh and Yee [66] and Kearns and Singh [67] (the simulation lemma).
To help understand how the linear parameterization of Q-function Eqn. (1) affects the representation
power of the function class, we first define the following matrices for action space featurization.
Definition 3. The sub-action mapping matrix for sub-action space Ad , Ψd , is defined as
ψ d (a1 )|
|
.. |A|×|Ad |
Ψj = ∈ {0, 1}
.
ψ d (a|A| )|
|
where each row ψ d (ai )| ∈ {0, 1}1×|Ad | is a one-hot vector with a value 1 in column projA→Ad (ai ).
Remark. The i-th row of Ψd corresponds to an action ai ∈ A, and the j-th column corresponds to
a particular element of the sub-action space ajd ∈ Ad . The (i, j)-entry of Ψd is 1 if and only if the
projection of ai onto the sub-action space Ad is ajd . Since each row is a one-hot vector, the sum of
elements in each row is exactly 1, i.e., ψ d (ai )| 1 = 1.
Definition 4. The sub-action mapping matrix, Ψ, is defined by a horizontal concatenation of Ψd for
d = 1...D
" #
P
Ψ = Ψ1 ··· ΨD ∈ {0, 1}|A|×( d |Ad |)
Remark. Ψ describes how to map each action ai ∈ A to its corresponding sub-actions. Therefore,
the sum of elements in each row is exactly D, the number of sub-action spaces; ψ(ai )| 1 = D.
Definition 5. The condensed sub-action mapping matrix, Ψ̃, is
where the first column contains all 1’s, and Ψ̃d denotes Ψd with the first column removed.
20
Proposition 9. colspace(Ψ) = colspace(Ψ̃) and rank(Ψ) = rank(Ψ̃) = ncols(Ψ̃) (i.e.,
+
matrix Ψ̃ has full column rank). Consequently, ΨΨ+ = Ψ̃Ψ̃ .
Corollary 10. Suppose the Q-function Q of a policy π at state s is linearly decomposable with respect
PD
to the sub-actions, i.e., we can write Q(s, a) = d=1 qd (s, ad ) for all ad ∈ Ad , then there exists w
and w̃ such that the column vector containing the Q-values for all actions at state s can be expressed
as Q(s, A) = Ψw = Ψ̃w̃. In other words, Eqn. (1) is equivalent to Q(s, A) ∈ colspace(Ψ̃).
Corollary 11. Suppose Q(s, A) ∈ / colspace(Ψ̃). Let ŵ = Ψ+ Q(s, A) and w̃ ˆ = Ψ̃+ Q(s, A) be
the least-squares solutions of the respective linear equations. Then Ψŵ = Ψ̃w̃.ˆ
Remark. Corollaries 10 and 11 imply there are two possible implementations, regardless of whether
the true Q-function can be represented by the linear parameterization. Intuitively, both versions try
to project the true Q-value vector Q(s, A) for a particular state s onto the subspace spanned by the
columns of Ψ or Ψ̃. Since the two matrices have the same column space, the results of the projections
ˆ are equal (they cannot be as they have different dimensions),
are equal. This does not imply ŵ and w̃
but rather the resultant Q-value estimates are equal, Q̂(s, A) = Ψŵ = Ψ̃w̃.ˆ
To make the theorem statements more concrete, we inspect a simple numerical example and verify
the theoretical properties.
Example 3. Consider an MDP with A = A1 × A2 , where A1 = {0, 1} and A2 = {0, 1}. Conse-
quently, |A1 | = |A2 | = 2 and |A| = 22 = 4.
Suppose for state s we can write Q(s, a) = Q(s, [a1 , a2 ]) = q1 (s, a1 ) + q2 (s, a2 ) for all a1 ∈
A1 , a2 ∈ A2 . Then
Q(s, a1 = 0, a2 = 0) q1 (s, 0) + q2 (s, 0) 1 0 1 0 q1 (s, 0)
Q(s, a1 = 0, a2 = 1) q1 (s, 0) + q2 (s, 1) 1 0 0 1 q1 (s, 1)
Q(s, A) = = =
Q(s, a1 = 1, a2 = 0) q1 (s, 1) + q2 (s, 0) 0 1 1 0 q2 (s, 0)
Q(s, a1 = 1, a2 = 1) q1 (s, 1) + q2 (s, 1) 0 1 0 1 q2 (s, 1)
1 0 1 0 q1 (s, 0)
| | w1
" #" #
1 0 0 1 q (s, 1)
= Ψ w where Ψ = , w= 1
|
0 1 1 0 q2 (s, 0)
| | w2
0 1 0 1 q2 (s, 1)
|{z } |{z }
We can also write Ψ1 Ψ2
1 0 0
v0 (s) q1 (s, 0) + q2 (s, 0)
" # " #
1 0 1
Q(s, A) = Ψ̃w̃, where Ψ̃ = , w̃ = u1 (s) = q1 (s, 1) − q1 (s, 0)
1 1 0
u2 (s) q2 (s, 1) − q2 (s, 0)
1 1 1
One can verify that rank(Ψ) = rank(Ψ̃) = 3 and colspace(Ψ) = colspace(Ψ̃), because the
columns of Ψ̃ are linearly independent, but the columns of Ψ are not linearly independent:
1 0 1 0
1 0 0 1
0 + 1 − 1 = 0 .
0 1 0 1
Furthermore,
3/8 3/8 −1/8 −1/8
3/4 1/4 1/4 −1/4
" #
+ −1/8 −1/8 3/8 3/8 +
Ψ = , Ψ̃ = −1/2 −1/2 1/2 1/2
3/8 −1/8 3/8 −1/8
−1/2 1/2 −1/2 1/2
−1/8 3/8 −1/8 3/8
and
3/4 1/4 1/4 −1/4
+ + 1/4 3/4 −1/4 1/4
ΨΨ = Ψ̃Ψ̃ = .
1/4 −1/4 3/4 1/4
−1/4 1/4 1/4 3/4
/
21
Proof of Proposition 9.
Q that Ψ P
First note is a tall matrix for non-trivial cases, with more rows than columns, because
|A| = d |Ad | ≥ d |Ad | if |Ad | ≥ 2 for all d (see proof). Therefore, the rank of Ψ is the number
of linear independent columns of Ψ.
We use the following notation to write matrix Ψd in terms of its columns:
| |
" #
Ψd = cd,1 · · · cd,|Ad | .
| |
Claim 1: The columns of Ψd are pairwise orthogonal, cd,j | cd,j 0 = 0, ∀j 6= j 0 , and they form an
orthogonal basis. This is because each row ψ d (ai )| is a one-hot vector, containing only
one 1; this implies that out of the two entries in row i of cd,j and cd,j 0 , at least one entry is
0, and their product must be 0.
P|Ad |
Claim 2: The sum of entries in each row of Ψd is 1, and j=1 cd,j = 1 a column vector of 1’s
with matching size. This is a direct consequence of each row ψ d (ai )| being a one-hot
vector. In other words, 1 ∈ colspace(Ψd ).
Claim 3: The columns of Ψ are not linearly independent. This is because there is not a unique
P|Ad |
way to write 1 as a linear combination of the columns of Ψ. For example, j=1 cd,j =
P|Ad0 | 0
j=1 cd0 ,j = 1 for some d 6= d, where we used the columns of Ψd and Ψd0 .
Claim 4: 1 ∈
/ colspace(Ψ̃1 · · · Ψ̃D ) because the first entry of every column vector in any Ψ̃d is 0
and no linear combination of them can result in a 1. Consequently, 1 ∈/ colspace(Ψ̃d ) for
any d.
Claim 5: cd,1 ∈/ colspace(1, Ψ̃d0 : d0 6= d), where cd,1 is the column removed from Ψd to construct
Ψ̃d . This can also be seen from the first entry of the column vector: the first entry of cd,1
is 1, and all columns of Ψ̃d0 : d0 6= d have the first entry being 0.
Claim 6: cd,j ∈ / colspace(1, Ψ̃1 · · · Ψ̃D \ {cd,j }) for j > 1. By expressing cd,j = (1 −
P|Ad |
j 0 =2,j 0 6=j cd,j ) + (−cd,1 ), we observe that the first part of the sum lies in the col-
0
umn space, while the second part does not (from the previous claim, cd,1 is not in
the column space of Ψ̃d0 where d0 6= d; this is because within Ψ̃d , the only way is
P|Ad |
cd,1 = 1 − j 0 =2 cd,j 0 and we have excluded one of the columns cd,j from the column
space).
Combining these claims implies that each column of Ψ̃ cannot be expressed as a linear combination
PD
of all other columns, and thus Ψ̃ has full column rank, rank(Ψ̃) = ncols(Ψ̃) = 1 + d=1 (|Ad | − 1).
It follows that Ψ̃ contains the linearly independent subset of columns from Ψ, and their column
spaces and ranks are equal.
+
ΨΨ+ and Ψ̃Ψ̃ are orthogonal projection matrices onto the column space of Ψ and Ψ̃, respectively.
+
Since colspace(Φ) = colspace(Ψ̃), it follows that ΨΨ+ = Ψ̃Ψ̃ .
Consider the matrix form of the Bellman equation (cf. Sec 2 of Lagoudakis and Parr [68]):
Q = R + γP π Q
where Q ∈ R|S||A| is a vector containing the Q-values for all state-action pairs, R ∈ R|S||A| , and
P π ∈ R|S||A|×|S||A| is the (s, a)-transition matrix induced by the MDP and policy π. Solving this
22
equation gives us the Q-function in closed form:
Q = (I − γP π )−1 R (5)
where I ∈ R|S||A|×|S||A| .
To derive a necessary condition, we start by assuming
PD
that the Q-function is representable by the
linear parameterization, i.e., there exists W ∈ R( d=1 |Ad |)×|S| such that vec−1|A|×|S| (Q) = ΨW .
−1
Here, vec|A|×|S| is the inverse vectorization operator that reshapes the vector of all Q-values into a
PD
matrix of size |A| × |S|, and Ψ ∈ {0, 1}|A|×( d=1 |Ad |) is defined in Appendix B.4. Substituting
PD
Eqn. (5) into the premise gives us a necessary condition: if there exists W ∈ R( d=1 |Ad |)×|S| such
that
vec−1 π −1
|A|×|S| (I − γP ) R = ΨW
Unfortunately, unlike the sufficient conditions in Theorem 1 (and Proposition 7), this necessary
condition is not as clean and likely not verifiable in most settings. The matrix inverse and vec−1
reshaping operation make it challenging to further manipulate the expression. This highlights the
non-trivial nature of the problem.
Proof for Proposition 5. For the sake of argument, we consider the one-timestep bandit setting;
extension to the sequential setting can be similarly derived following Chen and Jiang [62], Duan
et al. [17]. Let the true generative model be Q∗ = Ψr + ψ Interact rInteract (details in Appendix B.8). We
formally show the reduction in the variance of the estimators, by comparing the lower bound of their
respective empirical Rademacher complexities. A smaller Rademacher complexity translates into
lower variance estimators.
Suppose we obtain a sample of m actions and apply the linear P approximation. Our approach for
factored action space corresponds to the matrix X ∈ {0, 1}m×( d |Ad |) , obtained by stacking the
corresponding rows of Ψ (recall Definition 4). The
P complete, combinatorial action space corresponds
to the matrix X 0 = [X, xInteract ] ∈ {0, 1}m×(1+ d |Ad |) by adding the corresponding rows of ψ Interact .
By definition, kXkp,q < kX 0 kp,q , since the former drops a column with non-zero norm that exists
in the latter.
Consider the following two function families, for the factored action space and the complete action
space respectively:
FF = {f = wF| x : kwF k2 ≤ A}
FC = {f = wC| x0 : kwC k2 ≤ A},
for some A > 0. A straightforward application of Theorem 2 of Awasthi et al. [69] shows that the
lower bound on the Rademacher complexity of the of the factored action space is smaller than that of
the complete action space, which completes our argument.
23
B.7 Standardization of Rewards for the Bandit Setting (Proposition 6)
s s
R0,0 R1,0 0 1
Suppose the rewards of the four arms are [R0,0 , R0,1 , R1,0 , R1,1 ]. We can apply the following trans-
formations to reduce any reward function to the form of [0, α, 1, 1 + α + β], and these transformations
do not affect the least-squares solution:
• If R0,0 = R1,0 and R0,1 = R1,1 , we can ignore x-axis sub-action as setting it to either 0
(←) or 1 (→) does not affect the reward. Similarly, if R0,0 = R0,1 and R1,0 = R1,1 , we
can ignore y-axis sub-action. In both cases, this reduces to a one-dimensional action space
which we do not discuss further.
• Now at least one of the following is false: R0,0 = R1,0 or R0,1 = R1,1 . If R0,0 6= R1,0 ,
skip this step. Otherwise, it must be that R0,0 = R1,0 and R0,1 6= R1,1 . Swap the role of
down vs. up such that the new R0,0 6= R1,0 .
• If R0,0 < R1,0 , skip this step. Otherwise it must be that R0,0 > R1,0 . Swap the role of left
vs. right so that R0,0 < R1,0 .
• If R0,0 6= 0, subtract R0,0 from all rewards so that the new R0,0 = 0.
• Now R1,0 > R0,0 > 0 must be positive. If R1,0 6= 1, divide all rewards by R1,0 so that the
new R1,0 = 1.
• Lastly, we should have R0,0 = 0 and R1,0 = 1. Set α = R0,1 and β = R1,1 − R1,0 − R0,1 .
In other words,
∗
Q (.) 1 0 1 0 rLeft 0
Q∗ (-) 1 0 0 1 rRight 0
Q∗ (&) = 0 + r Q∗ = Ψr + ψ Interact rInteract
1 1 0 rDown 0 Interact
Q∗ (%) 0 1 0 1 rUp 1
Here, rLeft , rRight , rDown , rUp , rInteract are parameters of the generative model. Note that the matrix
[Ψ, ψ Interact ] has a column space of R4 , i.e., this generative model captures every possible reward
configuration of the four actions.
Applying our proposed linear approximation translates to “dropping” the interaction parameter, rInteract ,
and estimate the remaining four parameters. This leads to a form of omitted-variable bias, which can
be computed as:
+
1 0 1 0 0
+ 1 0 0 1 0
Ψ ψ Interact rInteract = r
0 1 1 0 0 Interact
0 1 0 1 1
3/8 3/8 −1/8 −1/8 0 −1/8
−1/8 −1/8 3/8 3/8 0 3/8
= r = r
3/8 −1/8 3/8 −1/8 0 Interact −1/8 Interact
−1/8 3/8 −1/8 3/8 1 3/8
24
The biased estimate of the four parameters are:
r̂Left rLeft − 18 rInteract
r̂Right rRight + 3 rInteract
r̂ = r + Ψ+ ψ Interact rInteract 8
r̂Down = rDown − 1 rInteract
8
r̂Up rUp + 38 rInteract
For the bandit problem in Figure 4a, substituting rLeft + rDown = 0, rLeft + rUp = α, rRight + rDown = 1,
and rInteract = β gives
Q̂(.) − 41 β
Q̂(-) α + 14 β
=
Q̂(&) 1 + 14 β
Q̂(%) 1 + α + 34 β
which is the solution we presented in Figure 4c.
When the interaction effect is not negligible and can lead to suboptimal performance, one solution
is to explicitly encode the residual interaction terms in the decomposed Q-function by letting
PD
Q(s, a) = d=1 qd (s, ad ) + R(a). The exact parameterization of the residual term R(a) is problem
dependent: one may incorporate Tavakoli et al. [6] to systematically consider interactions of certain
“ranks” (e.g., limiting it to only two-way or three-way interactions), and consider regularizing the
magnitude of residual terms so we still benefit from the efficiency gains of the linear decomposition.
(a) (b)
+1
a=0 a=1 +1 +1 +1 +1
+2
a=0 s0 +1 s1
a=1 s?,0 s0,0 s1,0
+1
My
s0,? s1,?
Mx +1
Figure 10: (a) A one-dimensional chain MDP, with an initial state s0 and an absorbing state s1 , and
two actions a = 0 (left) and a = 1 (right). (b) A two-dimensional chain MDP shown together with
the component chains Mx and My . Rewards are denoted in red. Squares indicate absorbing states
whose outgoing transition arrows are omitted. For readability, in the diagram, the states and actions
are laid out following a convention similar to the Cartesian coordinate system so that the bottom left
state has index (0, 0), and right and up both increase the corresponding coordinate by 1.
25
In this appendix, we discuss the building blocks of the examples used in the main paper and provide
additional examples to support the theoretical properties presented in Section 3.
One-dimensional Chain. First, consider the chain problem depicted in Figure 10a. The agent always
starts in the initial state s0 and can take one of two possible actions: left (a = 0), which leads the
agent to stay at s0 , or right (a = 1), which leads the agent to transition into s1 and receive a reward of
+1. After reaching the absorbing state s1 , both a = 0 and a = 1 lead the agent to stay at s1 with zero
reward. For γ < 1, a (deterministic) optimal policy is π ∗ (s0 ) = 1, and either action can be taken in
s1 . Next, we use this MDP to construct a two-dimensional problem.
Two-dimensional Chain. Following the construction used in Definition 1, we consider an MDP
M = Mx × My consisting of two chains (the horizontal chain Mx and the vertical chain My )
running in parallel, as shown in Figure 10b. Their corresponding state spaces are Sx = {s0,? , s1,? }
and Sy = {s?,0 , s?,1 }, which indicate the x- and y-coordinates respectively. There are 4 actions from
each state, depicted by diagonal arrows {., -, &, %}; each action a = [ax , ay ] effectively leads
the agent to perform ax in Mx and ay in My . For example, taking action %= [→, ↑] from state
s0,0 leads the agent to transition into state s1,1 and receive a reward of +2 (the sum of +1 from
Mx and +1 from My ). For γ < 1, an optimal policy for this MDP is to always move up and right,
π ∗ (·) =%= [→, ↑], regardless of which state the agent is in.
Satisfying the Sufficient Conditions. Let φx : S → Sx and φy : S → Sy be the abstractions. By
construction, the transition and reward functions of this MDP satisfy Eqns. (3) and (4). To apply
Theorem 1, the policy must satisfy Eqn. (4). In Figure 11, we show three such policies (other policies
in this category are omitted due to symmetry and transitions that have the same outcome), together
with the true Q-functions (with γ = 0.9) and their decompositions in the form of Eqn. (1).
Violating the Sufficient Conditions.
• Policy violates Eqn. (4) - Nonzero bias. For this setting, we hold the MDP (transitions
and rewards) unchanged. In Figure 12, we show seven policies that do not satisfy Eqn. (4),
together with the resultant Q-function and the biased linear approximation with the non-zero
approximation error.
• Transition violates Eqn. (2) - Nonzero Bias. Figure 13 shows an example where one
transition has been modified.
• Reward violates Eqn. (2) - Nonzero Bias. Figure 14 shows an example where one reward
has been modified.
• Transition violates Eqn. (2), or policy violates Eqn. (4) - Zero Bias. If γ = 0, then the
Q-function is simply the immediate reward, and any conditions on the transition or policy
can be forgone.
• Reward violates Eqn. (3) - Zero Bias. It is possible to construct reward functions adversar-
ially such that r itself does not satisfy the condition, and yet Q can be linearly decomposed.
See Figure 15 for an example.
26
Policy π MDP diagram Qπ = Qx + Qy
Optimal policy π ∗
. - & % ← ← → → ↓ ↑ ↓ ↑
s0,0 % →, ↑ s0,0 1.8 1.9 1.9 2 s0,? 0.9 0.9 1 1 s?,0 0.9 1 0.9 1
s0,1 % →, ↑ s0,1 0.9 0.9 1 1 s0,? 0.9 0.9 1 1 s?,1 0 0 0 0
= s1,0 0.9 1 0.9 1 s1,? 0 0 0 0 s?,0 0.9 1 0.9 1
s1,0 % →, ↑
1.9 2
s1,1 % →, ↑ s1,1 0 0 0 0 s1,? 0 0 0 0 s?,1 0 0 0 0
1.8 1.9
A non-optimal policy
. - & % ← ← → → ↓ ↑ ↓ ↑
s0,0 - ←, ↑ s0,0 0.9 1 1.9 2 s0,? 0 0 1 1 s?,0 0.9 1 0.9 1
s0,1 - ←, ↑ s0,1 0 0 1 1 s0,? 0 0 1 1 s?,1 0 0 0 0
= s1,0 0.9 1 0.9 1 s1,? 0 0 0 0 s?,0 0.9 1 0.9 1
s1,0 % →, ↑
1 2
s1,1 % →, ↑ s1,1 0 0 0 0 s1,? 0 0 0 0 s?,1 0 0 0 0
0.9 1.9
. - & % ← ← → → ↓ ↑ ↓ ↑
s0,0 . ←, ↓ s0,0 0 1 1 2 s0,? 0 0 1 1 s?,0 0 1 0 1
s0,1 - ←, ↑ s0,1 0 0 1 1 s0,? 0 0 1 1 s?,1 0 0 0 0
= s1,0 0 1 0 1 s1,? 0 0 0 0 s?,0 0 1 0 1
s1,0 & →, ↓
1 2
s1,1 % →, ↑ s1,1 0 0 0 0 s1,? 0 0 0 0 s?,1 0 0 0 0
0 1
Figure 11: Example MDPs and policies where Proposition 7 applies, for the optimal policy and two
particular non-optimal policies. γ = 0.9. We show the linear decomposition of the Q-function into
Qx and Qy . Qx only depends on the x-coordinate of state and the sub-action that moves ← or →;
Qy only depends on the y-coordinate of state and the sub-action that moves ↓ or ↑.
π(S) MDP diagram Qπ (s0,0 , A) Q̂(s0,0 , A) π(S) MDP diagram Qπ (s0,0 , A) Q̂(s0,0 , A)
Figure 12: Example MDPs and policies where Proposition 7 does not apply because the policy
violates Eqn. (4) (violations are highlighted). γ = 0.9. For example, in the first case, the policy does
not take the same sub-action from s0,0 and s0,1 with respect to the horizontal chain Mx . Applying
the linear approximation produces biased estimates Q̂ of the true Q-function, Qπ .
27
π(S) MDP diagram Qπ (s0,0 , A) Q̂(s0,0 , A)
Figure 13: Example MDPs and policies where Theorem 1 does not apply because the transition
function violates Eqn. (2). γ = 0.9. In this example, the highlighted transition corresponding to the
action %= [→, ↑] from s0,1 does not move right (→ under Mx ) to s1,1 and instead moves back to
state s0,1 . Applying the linear approximation produces biased estimates Q̂ of the true Q-function,
Qπ .
0 1 0 1
0.9 1.375
0 1 0 1
1.9 1.425
1.9 1.425
1 1 1 1 1.9 1 1 1
1 1.475
0 1 0 0 0.9 1.9 0 0
Figure 14: Example MDPs and policies where Theorem 1 does not apply because the reward function
violates Eqn. (3). γ = 0.9. In this example, the reward function of the bottom left state s0,0 does not
satisfy the condition because the reward of % is 1 6= 2 = 1 + 1. Applying the linear approximation
produces biased estimates Q̂ of the true Q-function, Qπ .
. - & % ← ← → → ↓ ↑ ↓ ↑
0 1 0 1
Figure 15: Example MDPs and policies where Theorem 1 does not apply because the reward function
violates Eqn. (3). γ = 1. In this example, the reward function of the bottom left state s0,0 does not
satisfy the condition because 7 + 1.5 6= 2 + 3. However, there exists a linear decomposition of the
true Q-function, Qπ , for a particular policy denoted by bold blue arrows.
28
D Experiments
D.1 Sepsis Simulator - Implementation Details
When generating the datasets, we follow the default initial state distribution specified in the original
implementation.
By default, we used neural networks consisting of one hidden layer with 1,000 neurons and ReLU
activation to allow for function approximators with sufficient expressivity. We trained these networks
using the Adam optimizer (default settings) [70] with a batch size of 64 for a maximum of 100
epochs, applying early stopping on 10% “validation data” (specific to each supervised task) with
a patience of 10 epochs. We minimized the mean squared error (MSE) for regression tasks (each
iteration of FQI). For FQI, we also added value clipping (to be within the range of possible returns
[−1, 1]) when computing bootstrapping targets to ensure a bounded function class and encourage
better convergence behavior [71].
The RNN AIS encoder was trained to predict the mean of a unit-variance multivariate Gaussian that
outputs the observation at subsequent timesteps, conditioned on the subsequent actions, following
the idea in Subramanian and Mahajan [37]. We performed a grid search over the hyperparameters
(Table 2) for training the RNN, selecting the model that achieved the smallest validation loss. Using
the best encoder model, we then trained the offline RL policy using BCQ (and factored BCQ),
considering validation performance of all checkpoints (saved every 100 iterations, for a maximum of
10,000 iterations) and all combinations of the BCQ hyperparameters (Table 2).
Table 2: Hyperparameter values used for training the RNN approximate information state as well as
BCQ for offline RL. Discrete BCQ for both the baseline and factored implementation are identical
except for the final layer of the Q-networks.
Hyperparameter Searched Settings
RNN:
- Embedding dimension, dS {8, 16, 32, 64, 128}
- Learning rate { 1e-5, 5e-4, 1e-4, 5e-3, 1e-3 }
BCQ (with 5 random restarts):
- Threshold, τ {0, 0.01, 0.05, 0.1, 0.3, 0.5, 0.75, 0.999}
- Learning rate 3e-4
- Weight decay 1e-3
- Hidden layer size 256
100 100
95 95
90 Observed 90 Observed
tau=0.0 tau=0.0
WIS
WIS
tau=0.01 tau=0.01
85 tau=0.05 85 tau=0.05
tau=0.1 tau=0.1
tau=0.3 tau=0.3
80 tau=0.5 80 tau=0.5
tau=0.75 tau=0.75
tau=0.9999 tau=0.9999
75 75
0 50 100 150 200 250 0 50 100 150 200 250
ESS ESS
Figure 16: Validation performance (in terms of WIS and ESS) for all hyperparameter settings and all
checkpoints considered during model selection. Left - baseline, Right - proposed.
29
100 100
WIS
94 Baseline 85
Proposed Baseline
92 80
Observed Proposed
90 75 Observed
0 100 200 0 100 200
Effective Sample Size ESS
Figure 17: Left - Pareto frontiers of validation performance for the baseline and proposed approaches;
Right - test performance of the candidate models that lie on the validation Pareto frontier. The
validation performance largely reflects the test performance, and proposed approach outperforms the
baseline in terms of test performance albeit with a bit more overlap.
100
95
90
Test WIS
85
80
75
Baseline
150 Proposed
Test ESS
100
50
0
0 50 100 150 200 250
Validation ESS cutoff
Figure 18: Model selection with different minimum ESS cutoffs. In the main paper we used ESS
≥ 200; here we sweep this threshold and compare the resultant selected policies for both the baseline
and proposed approach (only using candidate models that lie on the validation Pareto frontier). In
general, across the ESS cutoffs, the proposed approach outperforms the baseline in terms of test set
WIS value, with comparable or slightly lower ESS.
30