0% found this document useful (0 votes)
3 views

Exploiting Structure in Offline Multi-Agent RL- The Benefits of Low Interaction Rank

This document explores the learning of approximate equilibria in offline multi-agent reinforcement learning (MARL) by introducing the concept of interaction rank, which enhances robustness to distribution shifts. The authors demonstrate that utilizing functions with low interaction rank, combined with regularization and no-regret learning, leads to decentralized and efficient learning in offline MARL settings. Empirical results support the advantages of low interaction rank architectures over traditional single-agent value decomposition methods.

Uploaded by

Wenshuai Zhao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Exploiting Structure in Offline Multi-Agent RL- The Benefits of Low Interaction Rank

This document explores the learning of approximate equilibria in offline multi-agent reinforcement learning (MARL) by introducing the concept of interaction rank, which enhances robustness to distribution shifts. The authors demonstrate that utilizing functions with low interaction rank, combined with regularization and no-regret learning, leads to decentralized and efficient learning in offline MARL settings. Empirical results support the advantages of low interaction rank architectures over traditional single-agent value decomposition methods.

Uploaded by

Wenshuai Zhao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Exploiting Structure in Offline Multi-Agent RL:

The Benefits of Low Interaction Rank


Wenhao Zhan* Scott Fujimoto† Zheqing Zhu† Jason D. Lee‡
Daniel R. Jiang† Yonathan Efroni†
arXiv:2410.01101v1 [cs.LG] 1 Oct 2024

October 3, 2024

Abstract
We study the problem of learning an approximate equilibrium in the offline multi-agent
reinforcement learning (MARL) setting. We introduce a structural assumption—the interaction
rank—and establish that functions with low interaction rank are significantly more robust to
distribution shift compared to general ones. Leveraging this observation, we demonstrate that
utilizing function classes with low interaction rank, when combined with regularization and
no-regret learning, admits decentralized, computationally and statistically efficient learning
in offline MARL. Our theoretical results are complemented by experiments that showcase the
potential of critic architectures with low interaction rank in offline MARL, contrasting with
commonly used single-agent value decomposition architectures.

1 Introduction
Multi-agent reinforcement learning (MARL) is a general framework for interactive decision-making
with multiple agents. Recent breakthroughs in this field include learning superhuman strategies
in games like Go (Silver et al., 2016), StarCraft II (Vinyals et al., 2019), Texas hold’em poker
(Brown and Sandholm, 2019), and Diplomacy (Bakhtin et al., 2022). Additionally, MARL has been
successfully applied in real-world domains, including auctions (Jin et al., 2018), pricing systems
(Nanduri and Das, 2007), and traffic control (Wu et al., 2017). However, most of these successes
rely on online and iterative interaction with the environment, which enables the collection of diverse
and exploratory data. In practice, online interaction with exploratory policies is often infeasible or
prohibitive due to safety constraints, making it necessary to use offline datasets instead.
Several recent works have investigated the application of modern deep RL algorithms to the
offline MARL setting (Yang et al., 2021; Tseng et al., 2022; Wang et al., 2024). Despite recent
advances, there remains a lack of standardized methods that can effectively tackle complex, real-
world problems beyond simulated or simplistic settings. Recent works (Cui and Du, 2022; Zhang
* Princeton University. Work done at Meta.

Meta.

Princeton University.

1
Offline Setting Reward Sample Efficient
Assumption Complex- Algorithm
ity

Markov Game — O CN ✗

Contextual Game K-Interaction O CK ✓
Rank

Markov Game w/ K-Interaction O CK ✓
Decoupled Rank
Transition

Table 1: Comparison of the results presented in this work (highlighted in orange) and prior work. C
is the single-agent coverage coefficient. Here we present the worst-case dependence of the sample
complexity in the single-agent coverage coefficient, where N is the number of agents.

et al., 2023b) studied offline MARL from a sample complexity perspective. Specifically, Zhang
et al. (2023b) designed the BCEL algorithm, a sample-efficient algorithm for the offline general-
sum MARL setting with general function classes. However, its implementation poses significant
challenges due to the need to solve a non-convex problem in the joint action space. Furthermore,
the algorithm’s sample complexity is tied to the unilateral coverage coefficient, which can scale
exponentially with the number of agents in the worst-case scenario. This raises the following
question, which becomes the focus of this work:

Are there any natural structural assumptions that allow for both sample efficient and
computationally efficient algorithms in the offline MARL setting?

Recent lower bounds show that computing an equilibrium in a MARL setting is hard in gen-
eral (Daskalakis et al., 2009, 2023). Nevertheless, for some specialized MARL classes, this need
not be the case. In this work, we study the MARL setting with low interaction rank (IR). In this
setting the reward model decomposes to a sum of terms, each involving the interactions of only a
subset of the agents (Section 3). Our key statistical result is that functions with low interaction rank
are more robust to distribution shift compared to general functions. This result, which, as we show,
has natural applications in offline MARL, may also be of general interest.
Assuming the reward model has low interaction rank, we leverage regularization and no-regret
learning to develop decentralized computationally-efficient offline algorithms for the contextual
game (CG) setting, and for Markov games (MG) with a decoupled transition model (Section 4
and Section 5). Notably, we prove that applying structures with low interaction rank allows these
algorithms to achieve sample-efficient learning, avoiding the exponential dependence in the number
of agents. Lastly, in Section 6, we empirically corroborate our findings. This shows the potential of
using reward architectures with low interaction rank in offline MARL setting, and the need to go
beyond the standard single agent value decomposition architectures, which have been popularized
for MARL (Sunehag et al., 2017; Rashid et al., 2020; Yu et al., 2022).

2
2 Preliminaries
We define the general offline multi-agent RL setting, which includes all of the models we study.

General-sum contextual MG. A contextual MG is defined by the tuple M = (N, H, C, S :=


Q N QN ⋆ N,H
i=1 Si , A := i=1 Ai , {Ri,h }i=1,h=1 ) where N is the number of agents and H is the horizon. C is
the context space. In each episode, a public context c ∈ C, which is observed by all agents and stays
invariant throughout the episode, is drawn from the distribution ρ. Si and Ai are the local state and
action spaces of the i-th agent. We assume the initial local state of each agent is fixed for simplicity,

but our analysis can be easily extended to accommodate stochastic initial states. Ri,h (c, s, a) is
the reward distribution of agent i at step h given the context c, joint state s and joint action a. We
⋆ ⋆ ⋆
assume the value of Ri,h lies in [0, 1] and denote the mean of Ri,h by ri,h . In this paper, we study

general-sum RL (Littman, 1994) and thus ri,h : C × S × A → [0, 1] can be an arbitrary reward
function.

Policy and value functions. A joint policy π = {πh }H h=1 is a mapping from C × S to the simplex
∆A which determines the joint action selection probability under the public context and joint state
at each step. Given π, we use πi = {πi,h }H h=1 to denote the marginalized policy for agent i. In the
decentralized setting (Zhang et al., 2023a; DeWeese and Qu, 2024; Qu et al., 2020; Lin et al., 2021;
Jin et al., 2024), each agent i independentlyQN executes its local policy πi based only on the public
context c and its local state si , i.e., π = i=1 πi where πi,h : C × Si → ∆Ai for all i, h. In this case,
we call the joint policy π a product policy.
Given a reward function ri of agent i and joint policy π, we define the value function and
Q-function associated with agent i to be agent i’s expected return conditioned on the current joint
state (and action):
" H #
π,r
X
Vi,h (c, s) := Eπ ri,h′ (c, sh′ , ah′ ) c, sh = s ,
h′ =h
" H #
X
Qπ,r
i,h (c, s, a) := Eπ ri,h (c, sh′ , ah′ ) c, sh = s, ah = a .
h′ =h

Here, Eπ [·] denotes the expectation under the distribution of the trajectory when executing π in M.
π,r
We will omit the superscript r in Vi,h and Qπ,r ⋆
i,h if r is the ground truth reward r .

Offline equilibrium learning. For any joint policy π, if each agent cannot increase its own
expected reward by changing its policy while the other agents fix their policies, then π is a coarse
correlated equilibrium (CCE) (Aumann, 1987). More specifically, let Πi := {µi : C × Si → ∆Ai }
denote the local policy class of the agent i, then an ϵ-approximate CCE can be defined as follows:

Definition 1 (Coarse Correlated Equilibrium). A joint policy π is called an ϵ-approximate CCE if


µ ×π−i π
Gapi (π) := max Ec∼ρ [Vi,1i (c, s1 )] − Ec∼ρ [Vi,1 (c, s1 )] ≤ ϵ, ∀i ∈ [N ],
µi ∈Πi

where π−i is the marginalized policy of π for all agents excluding i.

3
If π is a product policy and satisfies Definition 1, then π is the well-known Nash equilibrium
(NE) (Nash et al., 1950). Given the fact that NE can be hard to compute for even general-sum
normal games (Daskalakis et al., 2009), our goal is to learn an ϵ-approximate CCE. In particular,
we want to identify a natural structural property for MARL, under which we can design both
statistically and computationally efficient offline algorithm, which means that we assume access to
an offline dataset D without allowing interaction with the environment beyond this.

3 Interaction Rank Implies Robustness to Distribution Shift


In this section, we define the key structural property introduced in this work—the interaction rank
(IR) of a function. We show that a function with a low interaction rank is significantly more robust
to distribution shift compared to a general function in a standard offline supervised learning setting.
This observation later enables us to derive sample efficient guarantees for the MARL setting. For an
arbitrary function, we define its interaction rank as follows.
Definition 2 (Interaction Rank). A function f : X × Y1 × · · · × YW → [0, 1] has interaction
rank K (K-IR) if there exists a positive integer K such that there exists a group of sub-functions
∪0≤k≤K {gj1 ,...,jk }j1 <···<jk which satisfies
K−1
X X
f (x, y1 , . . . , yW ) = gj1 ,...,jk (x, yj1 , . . . , yjk ), ∀x ∈ X , y1 ∈ Y1 , . . . , yW ∈ YW .
k=0 1≤j1 <···<jk ≤W

Intuitively, the function can be decomposed into a sum of g, called sub-functions, each depending
only on a subset of the input variables. This structure is common in practice and finds application
in fields including physics (Grana, 2016), economics, (Asghari et al., 2022) and statistics (Vonesh
et al., 2001).

Relation to Taylor series. When restricting the inputs of a function to a local neighborhood,
QWTo see this, fix an x ∈ X , then any
Definition 2 can understood as a Taylor expansion of function.
W
K-differentiable function f in a local region of {yw }w=1 ∈ w Yw can be approximated as
K k
X 1 X ∂f (x, y1′ , . . . , yW ′
)Y
f (x, y1 , . . . , yW ) ≃ f (x, y1′ , . . . , yW

)+ (yjk′ − yj′ k′ ).
k=1
k! j ,...,j
∂y j1 · · · ∂y jk
k′ =1
1 k

Hence, the interaction rank of a K-order Taylor expansion is upper bounded by K + 1. Further, if
the Taylor series is close to f , we can find a good approximation of f with low interaction rank.

Bounded interaction rank implies distribution shift robustness. The key property that makes
functions with low interaction rank useful in the offline MARL is their robustness to distribution
shift. Towards formalizing this statement, let us first consider an offline supervised learning setting.
Suppose we wish to learn a target function f ⋆ in an offline setting. The training distribution is
x ∼ p, yi ∼ pi (·|x), ∀i and the target distribution is x ∼ p′ , yi ∼ p′i (·|x), ∀i. The distribution shift is
quantified by the density ratio:
p′ (x) p′i (yi |x)
max ≤ CDS , max ≤ CDS .
x∈X p(x) i∈[N ],x∈X ,yi ∈Yi pi (yi |x)

4
Let fb denote the learned function. Standard guarantees imply that the training error (i.e., under p
and pi ) can be upper bounded by ϵ:
 2 

Ex∼p,y1 ∼p1 (·|x),...,yW ∼pW (·|x) (f − f )(x, y1 , · · · , yW )
b ≤ ϵ. (1)

When f ⋆ and fb are general functions, the optimal worst-case learning error under the target
distribution is O (CDS )W +1 ϵ , which scales exponentially with the input size W . However, if f ⋆
and fb have bounded interaction rank, this result can be significantly improved; the error under
distribution shift only scales exponentially with the interaction rank.
Theorem 1. If f ⋆ and fb are K-IR, we have
 2 

Ex∼p′ ,y1 ∼p′1 (·|x),...,yW ∼p′W (·|x) (f − fb)(x, y1 , . . . , yW ) ≲ (2W )2(K−1) CDS
K
ϵ.

Here for any two functions g and g ′ , g ≲ g ′ means that there exists a constant c > 0 such that
g < cg ′ always holds. Theorem 1 indicates that when K ≪ W , function classes with bounded
interaction rank are more robust to distribution shift and can significantly alleviate the curse of
dimensionality due to multiple agents in offline learning. In MARL, W + 1 will be the number of
agents, while K is the interaction rank of the reward.

4 Warm Up: Contextual Games


The robustness to distribution shift of low-IR functions suggests that such a property may be useful
for offline MARL. Indeed, in the offline setting we need to properly estimate quantities that deviate
from the data distribution. To provide intuition for the benefits of low-IR reward classes and
corresponding algorithmic design, we start by considering the contextual games (CG) setting as a
warm up.

Offline CG. The CG problem is a general-sum contextual MG where Si = ∅ for all i and
H = 1. To simplify notation, we omit the h subscript in rh and πh for this setting. We assume
the offline dataset D = DR where each sample (c, a = {ai }N N
i=1 , {ri }i=1 ) is i.i.d. sampled from
c ∼ ρ, ai ∼ νi (·|c), ri ∼ Ri⋆ (c, a) for all i ∈ [N ]. We callQνi the offline behavior policy for each
agent i and use ν to denote the product behavior policy i∈[N ] νi . Let us assume for simplicity
that we have learned reward functions {b ri ∈ [0, 1]}i∈[N ] from the offline dataset with in-distribution
training error ϵ:
Ec∼ρ,a∼ν(·|c) (ri⋆ − rbi )2 ≤ ϵ, ∀i ∈ [N ].
 

Algorithm: Decentralized χ2 -Regularized Policy Gradient. Given rb, we propose a decen-


tralized, χ2 -regularized, no-regret policy gradient based algorithm. As we show, this algorithm
produces a set of policies which are near equilibrium. In each iteration t , each agent will update
their policy via:
1
πit+1 (c) = arg min −⟨b
rit (c, ·), p⟩ + λ χ2 (p, νi (c)) + Dc,i (p, πit (c)) . (2)
p∈∆Ai | {z } η | {z }
regularization no-regret learning

5
Here rbit (c, ai ) = Eaj ∼πjt (c),∀j̸=i [b
ri (c, a)] is the expected reward of agent i given that the other agents’
policies are j̸=i πjt . The regularizer χ2 (p, νi (c)) := Eai ∼νi (·|c) [(p(ai )/νi (ai |c) − 1)2 ] is the χ2 -
Q
divergence between distribution p and νi (c) and Dc,i (p, πit (c)) is the Bregman divergence between
distribution p and πit (c):
Dc,i (p, πit (c)) := χ2 (p, νi (c)) − χ2 (πit (c), νi (c)) − ⟨∇πit (c)χ2 (πit (c), νi (c)), p − πit (c)⟩
" 2 #
t
p(ai ) − πi (ai |c)
= Eai ∼νi (·|c) .
νi (ai |c)
We denote the total number of iterations by T . Eq. (2) has two divergence terms, which serve
different roles. We add the χ2 -divergence regularization term to encourage the policy trajectory to
stay close to the behavior policy νi and thus lessen the distribution shift issue. On the other hand, to
ensure the update enjoys no regret, we have a Bregman divergence term which is motivated from
the policy mirror descent literature (Zhan et al., 2023a; Lan, 2023). Notably, Eq. (2) is a quadratic
optimization problem whose input size is only |Ai |. Thus, for small action and state space we can
solve it efficiently without incurring exponential computation cost as the number of agents increase.
Remark 1. The key ingredients of our algorithm are (1) regularization and (2) no-regret learning.
We choose χ2 -divergence and its corresponding Bregman divergence for a tractable theoretical
analysis. In practice, other regularizers can also be utilized, such as KL divergence (Rafailov
et al., 2024) or the L2 behavior cloning term in TD3-BC (Fujimoto and Gu, 2021). Additionally, in
practice, one-step online gradient (Zinkevich, 2003) can be used as the no-regret learning algorithm.

Theoretical analysis. Now we analyze the statistical sample complexity of the above algorithm. If
the reward function class has no specific structure, the sample complexity can still scale exponentially
with N due to distribution shift. To address this, we leverage a low-IR reward function class:
Assumption 1 (K-IR Reward). Suppose that the interaction rank of ri⋆ and rbi are upper bounded
by K, with X = C × Ai and Yj = Aj in Definition 2 for all i ∈ [N ].
Assumption 1 naturally holds in a variety of games. For example, polymatrix games (Howson Jr,
1972; Kalogiannis and Panageas, 2024; MacQueen and Wright, 2024) characterize the reward
function via pairwise interactions and, thus, for these settings Assumption 1 holds with K = 2. In
network games (Galeotti et al., 2010; DeWeese and Qu, 2024; Park et al., 2024), the reward only
depends on the neighbors and thus Assumption 1 holds with K equal to the degree of the network.
Note that for all of these examples, we have K ≪ N .
Now we introduce a bound on the maximum gap of the output policy π b under K-IR reward
classes. Let r(π) be the expected reward under the distribution c ∼ ρ, a ∼ π(·|c). Similar to
existing offline RL analysis techniques (Xie et al., 2021), we split the bound into on-support and
off-support components:
Theorem 2 (Informal). Suppose Assumption 1 holds. Let Πi (C) := {µi : Ec∼ρ [χ2 (µi (c), νi (c))] ≤
C} denote the policy class which has bounded χ2 -divergence from the behavior policy νi . Fix any
δ ∈ (0, 1] and select T, η, λ in Eq. (2) properly. Then, with probability at least 1 − δ, we have
n  1
2 K−1 3K−1
o
max Gapi (b π ) ≲ max min C (2N ) ϵ + subopti (C, π b) , (3)
i i∈[N ] C≥1

b) := maxµi ∈Πi ri⋆ (µi , π


where subopti (C, π b−i ) − maxµi ∈Πi (C) ri⋆ (µi , π
b−i ) is the off-support bias.

6
Optimal bias-variance tradeoff. We call Πi (C) a covered policy class because policies within
it have bounded χ2 -divergence from the behavior policy ν, which implies that we can estimate
their performance relatively accurately from the offline dataset. The right hand side of Eq. (3) can
be viewed as a bias-variance decomposition of the gap. The first term is the variance term which
measures the distribution-shift effect of comparing against policies from Πi (C). The second term is
the bias term which quantifies the performance difference between the global optimal policy and
the optimal policy in the covered policy class. As C increases, the considered covered policy class
will expand and thus the variance term will grow while the bias term will diminish. Notably, our
algorithm does not require any information about C and the gap in Theorem 2 is upper bounded by
the optimal C, which means that we can identify the best bias-variance tradeoff automatically.

Polynomial sample complexity with single-agent concentrability. Let us consider the following
(ai |c)
single-agent all-policy concentrability coefficient Csin := maxi∈[N ],µi ,c∈C,ai ∈Ai µνii(ai |c)
. Note that
Csin will not scale with N exponentially. Then Theorem 2 implies that if Csin < ∞, the maximum
gap under the interaction rank structure can be upper bounded by
 1
π ) ≲ Csin (2N 2 )K−1 ϵ 3K−1 .
max Gapi (b
i

Therefore, given a fixed K, we can learn an approximate CCE with polynomial sample complexity
with respect to the number of agents N under single-agent all-policy concentrability. This demon-
strates the power of low-IR reward classes for MARL. When combined with regularization and
no-regret learning, the sample complexity is significantly improved, making computationally- and
statistically-efficient algorithm design possible in MARL.

Proof highlights. We provide a proof sketch of Theorem 2 for K = 2, supplying intuition for
how K-IR reward classes benefit theoretical sample complexity. For any agent i ∈ [N] and policy
µi ∈ Πi (C) where C > 1, we can bound the in-support gap Tt=1 ri⋆ (µi , π−i
t
) − ri⋆ (π t ) as follows:
P

T
X T
X T
X

ri⋆ )(c, a)] + t
) − rbi (π t ) .

t [(r
Ec∼ρ,ai ∼µi (c),a−i ∼π−i i − rbi )(c, a)] + ri −
Ec∼ρ,a∼πt [(b rbi (µi , π−i
|t=1 {z } |t=1 {z } |t=1 {z }
(1) (2) (3)

We need to bound terms (1), (2), and (3). Term (3) is the performance difference when changing the
policy of agent i to µi . Note that this is equivalent to the regret of agent i with loss function −b rit and
thus we can bound it with similar techniques in policy mirror descent literature (Zhan et al., 2023a).
Term (1) represents the reward learning error under the comparator policy µi and learned policy
t
π−i , which is different from ν. To control it, we need to tackle the distribution shift between the two.
We use g∅i , {gji }j̸=i and gb∅i , {b
gji }j̸=i to denote the decomposition of ri⋆ and rbi , and use ∆ij to denote
i i
gj − gbj . Since we apply a K-IR reward class Assumption 1, we can decompose term (1) as follows

t (·|c) [(r − r
Ec∼ρ,ai ∼µi (·|c),a−i ∼π−i bi )(c, a)]
i
 i  X
Ec∼ρ,ai ∼µi (·|c),aj ∼πjt (·|c) ∆ij (c, ai , aj ) .
 
= Ec∼ρ,ai ∼µi (·|c) ∆ (c, ai ) +
j̸=i

7
Meanwhile, from the property of χ2 -divergence, we have
Ec∼ρ,ai ∼µi (·|c),aj ∼πjt (·|c) ∆ij (c, ai , aj )
 
r h 2 i 
≤ Ec∼ρ,ai ∼νi (·|c),aj ∼νj (·|c) ∆ij (c, ai , aj ) · 1 + χ2 ρ ◦ (µi × πjt ), ρ ◦ (νi × νj ) ,

where we use ρ ◦ p to denote the joint distribution c ∼ ρ, a ∼ p(·|c) for some conditional distribution
p. For the χ2 -divergence term, χ2 (µi (c), νi (c)) is bounded because µi is from the covered policy
class; we can also upper bound χ2 (πjt (c), νj (c)) due to the χ2 regularizer term in Eq. (2). Thus, we
h 2 i
i
only need to bound Ec∼ρ,ai ∼νi (·|c),aj ∼νj (·|c) ∆j (c, ai , aj ) .
This is non-trivial because we are only regressing with respect to r⋆ , which is the summation of
the sub-function g, and there exist infinite number of IR decompositions of r⋆ . Fortunately, we are
able to show that such an aligned decomposition exists:
Lemma 1 (Sub-function Alignment for K = 2, informal). There exists a standardized IR decompo-
sition of r⋆ and rb, denoted by g∅ , g1 , . . . , gW and gb∅ , gb1 , . . . , gbW such that we have
h 2 i
Ec∼ρ,ai ∼νi (·|c),aj ∼νj (·|c) ∆ij (c, ai , aj ) ≤ 2ϵ, ∀j ̸= i.
With Lemma 1, we are able to bound term (1) efficiently. Term (2) can be handled similarly.
Notably, Lemma 1 holds for general K as shown in Lemma 4 and the IR decomposition circumvents
exponential scaling with N . The above discussion illustrates that low-IR reward classes are quite
effective when mitigating the learning error under distribution shift in MARL.

5 Decentralized Regularized Actor-Critic in Markov Games


with Decoupled Transitions
We are now ready to investigate the benefits of low interaction rank in offline MGs. In particular,
we will propose our main algorithmic framework to utilize low-IR function classes.

MGs with decoupled transitions. In this work we assume the transition of the local state only
depends on the local state, public context and local action (Zhang et al., 2023a; DeWeese and Qu,

2024; Jin et al., 2024), which can be characterized by the kernel Pi,h : C × Si × Ai 7→ ∆Si for

all i ∈ [N ], h ∈ [H]. Note that the reward function Ri,h (c, s, a) is still of a general-sum game
and depends on the joint state and joint action. Notice that CGs are a special case of MGs with
decoupled transitions.
Remark 2. The decoupled transitions property finds application in many practical scenarios
including sensor coverage, autonomous vehicles, and robotics, and has been studied under online
decentralized learning setting (Zhang et al., 2023a; DeWeese and Qu, 2024; Jin et al., 2024). For
more general MGs, decentralized no-regret algorithms are hard to design even in full-information
setting. As far as we know, Erez et al. (2023) is the only existing work which achieves sublinear
regret in general MGs when all the agents adopt the decentralized algorithm. However, they only
focus on tabular cases in the full information setting or online setting with a minimum reachability
assumption. Therefore, we leave it as an important future direction to extend our analysis to more
general MGs.

8
In particular, we consider the decentralized setting where each agent i executes its policy πi only
based on the public context c and its local state si (Zhang et al., 2023a; DeWeese and Qu, 2024; Qu
et al., 2020; Lin et al., 2021; Jin et al., 2024). For agent i, given a local policy πi = {πi,h }h∈[H] and
a public context c, the transition of the local state is indeed independent from other agents and thus
we can define the local state visitation measure as follows:

dπhi (s|c) := Pπi (si,h = s|c), ∀h ∈ [H], s ∈ Si , i ∈ [N ],

where si,h is the local state of agent i at step h and Pπ (·|c) denotes the distribution of the trajectories
under policy πi and public context c. We also define dπhi (s, a|c) := dπhi (s|c)πi,h (a|s).

Offline dataset. We assume access to an offline dataset {Dh }H h=1 . Dh consists of M i.i.d. samples
(c, {si , ai , s′i }i∈[N ] , {ri }i∈[N ] ) where c ∼ ρ, si ∼ σi,h (·|c), ai ∼ νi,h (·|c, si ), s′i ∼ Pi,h

(·|c, si , ai ) and
⋆ νi
ri ∼ Ri,h (c, {si , ai }i∈[N Q] ). Note that σi,h may not be the local state visitation measure dh (·|c). We
also use σh to denote i∈[N ] σi,h .

General function approximation. We consider the general function approximation setting. This
makes the algorithm applicable in potentially large or even infinite state space and action space.
Suppose that we have function classes R = {Ri }N ⋆
i=1 to approximate the reward function ri,h where
Ri ⊆ {r : C × A → [0, 1]} for all i ∈ [N ]. In addition, we use function classes {Pi }i∈[N ] where
Pi ⊆ {P : C × Si × Ai → ∆Si } to approximate the transition model. We assume here that Ri and
Pi are finite, but the analysis can be extended to infinite function classes naturally by replacing the
cardinality of Ri and Pi with its covering or bracketing number (Wainwright, 2019). To simplify
notation, we use |R| and |P| to denote maxi∈[N ] |Ri | and maxi∈[N ] |Pi |.

5.1 Algorithmic Framework


For general-sum MGs with decoupled transitions, we consider a widely-used kind of algorithmic
framework in practice, the actor-critic method (Barto et al., 1983). Arming it with regularization and
no-regret learning, we propose DR-AC for offline learning in MGs. The full algorithm is stated in
Algorithm 1. Notably, DR-AC is a decentralized model-based algorithm which is computationally
efficient given that we are able to solve a least squares regression (LSR) and maximum likelihood
estimation (MLE) problem. DR-AC consists of two phases: offline reward and transition learning,
followed by decentralized actor-critic updates.

Offline reward and transition learning. We first learn the reward function rbi for each agent i
using LSR on the offline dataset DR . In particular, here we will use a function class Ri where all
the functions have bounded IR so that our learned reward has higher robustness to distribution shift,
as we have shown in the previous section. We also learn the transition model for each i via MLE
on the offline dataset with function classes Pi . Note that LSR and MLE problems are common
in supervised learning and can be solved with simple methods like stochastic gradient descent
(Jain et al., 2018). The RL literature has also assumed the existence of efficient solutions to these
optimization problems, calling algorithms that depend on them oracle-efficient (Dann et al., 2018;
Agarwal et al., 2020; Uehara et al., 2021; Song et al., 2022).

9
Algorithm 1 Decentralized Regularized Actor-Critic (DR-AC)
1: Initialize πi1 to be the behavior policy νi for each agent i.
2: /** Offline Reward & Transition Learning **/
3: Compute for all i ∈ [N ], h ∈ [H]
X X
rbi,h = arg min (r(c, s, a) − ri )2 , Pbi,h = arg max log P (s′i |c, si , ai ).
r∈Ri P ∈Pi
(c,s,a,ri )∈Dh (c,si ,ai ,s′i )∈Dh

4: for t = 1, . . . , T do
5: for i ∈ [N ], h ∈ [H] do
6: /** Critic Update **/
7: Estimate the single-agent Q-function with the learned reward rbi and transition Pbi :
h t i
bt (c, si , ai ) = E π ,br
Q i,h
πj
(sj ,a ∼db (·|c),∀j̸=i
Q
b
i,h (c, s, a) , ∀c ∈ C, si ∈ Si , ai ∈ Ai .
i) h

8: /** Actor Update **/


9: Run mirror descent for all c ∈ C, s ∈ Si :

t+1
πi,h (c, s) = arg min −⟨Q bti,h (c, s, ·), p⟩ + λχ2 (p, νi,h (c, s)) + 1 Dc,s,i (p, πi,h
t
(c, s)). (4)
p∈∆Ai,h η
nQ oT
t
10: Return: the uniform mixture of π
i∈[N ] i .
t=1

Critic update. In each iteration, for each agent i, we estimate its current single-agent Q-function,
given other agents’ policies, with the learned reward rb and transition model Pb:
h t i
Qbti,h (c, si , ai ) = E πj bπ ,br (c, s, a) ,
Q
(sj ,a ∼d (·|c),∀j̸=i
j)
b
h
i,h

where we use Q bπ,br and dbπj to denote the joint Q-function and local state visitation measure of π
i,h h
under reward rb and transition Pb. In practice, we can simply use a Monte-Carlo-type method to
bt , which only requires solving an LSR problem and is thus computationally efficient.
estimate Qi,h
See Appendix B for more details.

Actor update. Given the estimated Q-function, we use regularized policy gradient to update
each agent’s policy. The update formula Eq. (4) is almost the same as the update in Eq. (2) for
CGs, except the estimated reward is replaced with the estimated Q-function. We use χ2 -divergence
for regularization and Bregman divergence in Algorithm 1. Nevertheless, DR-AC allows other
regularizers and no-regret learning techniques as mentioned in Remark 1. Note that Eq. (4) is a
quadratic optimization problem with input size |Ai | and thus can be solved efficiently.

5.2 Theoretical Analysis


We now present the sample complexity guarantee for DR-AC. We assume the function class
{Ri }i∈[N ] and {Pi }i∈[N ] are realizable.

10
⋆ ⋆
Assumption 2. Suppose that we have ri,h ∈ Ri and Pi,h ∈ Pi for all i ∈ [N ], h ∈ [H].
In general, DR-AC can have exponentially large statistical complexity with respect to the number
of agents N . However, similarly to the CG result, a low-IR reward function class alleviates this.
Assumption 3 (K-IR Reward). Suppose that the IR of ri is upper bounded by K with X =
C × Si × Ai and Yj = Sj × Aj in Definition 2 for all j ̸= i, ri ∈ Ri , i, j ∈ [N ].
In addition, we assume that the offline dataset satisfies single-agent all-policy concentrability
for the local state distribution. Recall that σi,h is the dataset distribution.
Assumption 4. Suppose that for all i ∈ [N ] we have
dµhi (s|c)
max ≤ CS < ∞.
i∈[N ],µi ,c∈C,s∈Si ,h∈[H] σi,h (s|c)

We need Assumption 4 because bounded χ2 -divergence between the action probabilities of two
policies does not imply bounded χ2 -divergence between their state visitation measure. In DR-AC
we can only regularize the action probability and therefore require additional concentrability for
the local states. Nevertheless, here we only need single-agent concentrability and thus CS does not
scale exponentially with N . Now we can bound on the maximum gap of the output policy π b by
DR-AC:
Theorem 3. Suppose Assumption 2, Assumption 3 and Assumption 4 hold. Let Πi (C) := {µi :
Ec∼ρ,s∼dµhi (·|c) [χ2 (µi,h (c, s), νi,h (c, s))] ≤ C, ∀h} denote the policy class which has bounded χ2 -
divergence from the behavior policy νi . Fix any δ ∈ (0, 1] and select
K 3K K−1 1
λ H2
λ = CS3K+2 H 3K+2 (2N 2 ) 3K+2 ϵRP
3K+2
, η= , T = ,
H2 λ2
log(N H|R||P|/δ)
where ϵRP := M
Then, with probability at least 1 − δ the output of DR-AC, π
. b, satisfies:
 K 1

6K+2 K−1
3K+2 2 3K+2 3K+2
max Gapi (b
π ) ≲ max min CCS H 3K+2 (2N ) ϵRP + subopti (C, πb) ,
i i∈[N ] C≥1

µ ◦b
π−i µ ◦b
π−i
b) := maxµi Ec∼ρ [Vi,1i
where subopti (C, π (c, s1 )] − maxµi ∈Πi (C) Ec∼ρ [Vi,1i (c, s1 )].
Similarly to Theorem 2, Theorem 3 indicates that DR-AC admits an optimal bias-variance
tradeoff over the covered policy class Πi (C). In addition, if we have single-agent all-policy
µi,h (ai |c,si )
concentrability Csin := maxh,i,µi ,c,si ∈Si ,ai ∈Ai νi,h (ai |c,si )
< ∞, DR-AC is capable of learning an
ϵ-approximate CCE given sample complexity
3K+2 K 6K+2
Csin CS H (2N 2 )K−1 log(N H|R||P|/δ)
M≳ ,
ϵ3K+2
Therefore, given a fixed K and single-agent all-policy concentrability, DR-AC can learn an ap-
proximate CCE in polynomial sample complexity with respect to N for general-sum MGs with
decoupled transitions. This suggests that introducing low-IR structure to the reward class is still
beneficial for offline learning in general-sum MGs.
Remark 3. For |R|, note that we require the function classes Ri to have IR bounded by K. This
means that their complexity will at most only scale with K exponentially.

11
Joint-action 2nd order interaction rank (2-IR) Single agent (1-IR)
reward critic reward critic reward critic

Figure 1: Network diagrams for the ith agent.

Comparison with existing works. To our knowledge, Cui and Du (2022); Zhang et al. (2023b) are
the only existing offline general-sum MARL works with provable statistical guarantees. However,
the proposed methods are not decentralized and require evaluating the gap for every possible
candidate joint policies, resulting in an impractically high computational burden.
2
Statistically, although Cui and Du (2022) achieves O(1/ϵ
e ) complexity, they require stronger
concentrability assumption, which is the following unilateral concentrability with a target policy π ⋆ :
π⋆
dµhi (si , ai |c)dh−i (s−i , a−i |c)
Cuni (π ⋆ ) := max ,
h,i,µi ,c,s∈S,a∈A σh (s|c) ν(a|s, c)

where s−i and a−i are the joint state and action of the agents excluding i. Note that the single-agent
all-policy concentrability coefficient CS Csin is indeed weaker than Cuni and we have CS Csin ≤
Cuni (π ⋆ ) for any π ⋆ . In the worst case, Cuni can still scale exponentially with number of agents N ,
whereas our sample complexity scales with the IR K.
For Zhang et al. (2023b), in CGs, they have the following concentrability assumption:
⋆ 2
⋆ (·|c) [(r − r ) ]

Ec∼ρ,ai ∼µi (·|c),a−i ∼π−i
Cuni (R, π ⋆ ) := max .
i,µi ,r∈Ri Ec∼ρ,a∼ν(·|c) [(r − r⋆ )2 ]

In their work Ri can be a general function class and thus Cuni (R, π ⋆ ) can be as large as Cuni (π ⋆ )
in the worst case. Notably, if we use function class with K-IR instead, Theorem 1 shows that

Cuni (R, π ⋆ ) ≲ Csin
K
. Therefore, we indeed find a particular function class such that the concentrabil-
ity in Zhang et al. (2023b) is not vacuous. For MGs, Zhang et al. (2023b) uses a function class to
approximate the joint Q function while we use F to approximate the single-agent Q-function, and
thus the results are not directly comparable.

6 Experiments
In this section, we examine the practical implications of our results. With this in mind, our findings
can be interpreted as providing the following guideline: Use a reward or Q-function class with
the smallest possible IR that can still represent the underlying true model. This approach strikes
a balance between two factors: it ensures realizability by requiring the model can be represented
accurately, and it improves sample efficiency, as demonstrated in Theorem 2 and Theorem 3.

12
Maximum Gap
Implementation and experimental setting. To exam- 7.0

ine the usefulness of this observation, we study a simple 6.0

offline CG environment. We implement the actor update 5.0

in DR-AC to be a single gradient descent update with re- 4.0

Gap
spect to TD3+BC objective (Fujimoto and Gu, 2021) from
3.0
Tianshou library (Weng et al., 2022). Further, recall that
2.0
TD3+BC adds explicit L2 regularization term that keeps
the policy close to the data collection policy and thus 1.0

fits into the framework of DR-AC. To test the potential 0.0


0 100 200 300 400 500
benefits of low rank reward critic architectures we exper- Time Steps
imented with three different types, depicted in Figure 1: 2-IR critic 1-IR critic
i) joint-action, ii) 2-IR, and iii) 1-IR reward critics. The Joint-action critic
joint-action reward critic is a general mapping from the
joint action space to a number, and, hence, is the most Figure 2: Comparison of TD3+BC in-
expressive; it can represent both 2-IR and 1-IR. On the stantiated with different critic architec-
other hand, the 1-IR architecture is the least expressive, tures, i) 1-IR critic, ii) 2-IR critic, and
as it cannot represent 2-IR reward models, since it only iii) joint-action critic. The underlying
accesses a single agent action. Notably, we choose the true reward is a 2-IR. This figure show-
number of parameters of the 2-IR and joint-action archi- cases the advantage of using 2-IR critic
tectures to be of the same order of magnitude for fair architecture compared to 1-IR or the
comparison. general joint-action critics when the un-
The details of our environment setting are as follows derlying model is 2-IR. The shaded area
(see Appendix A for additional information). We consider represents the standard error across tri-
the continuous action setting, where ∀i ∈ [N ], ai ∈ als.
[−1, 1]. The underlying reward model PNis a 2-IR √
function

of the form ∀i ∈ [N ], ri (s, a) = j=1 ai aj / N + ϵ where ϵ ∼ Uniform(−σ, σ) and σ > 0.
Further, we set number of agents as N = 50. We collect offline data with the uniform policy and
set the number of samples M such that σN/M = 0.1. In this noise regime, the reward model is
learnable but the noise level may effect the training procedure. We experiment with few architectures
for each reward critic type and report here the best one. We also experimented with an additional
environment in which the underlying reward is a 1-IR model (see additional results in Appendix A).

Results. Experiment results are depicted in Figure 2. The 2-IR critic approach leads to the best
performing result by significant margin compared to the joint-action and 1-IR reward critics. For
the 2-IR critic the maximum gap across agents is the smallest, meaning the joint policy is in a
near equilibrium point. Interestingly, the simpler 1-IR model has the worst performance among
the three candidates. Such an approach for critic modeling is common in the online cooperative
MARL setting (Sunehag et al., 2017; Rashid et al., 2020; Yu et al., 2022). Nevertheless, as our
experiments show, it can dramatically fail in offline MARL. This is because in the online setting, the
agent can continually collect fresh samples to update the estimated 1-IR reward so that the critic can
learn accurate local approximations of the current expected reward even if the other agents’ policies
change. However, in offline setting, a 1-IR critic cannot make such updates because iterative data
collection is not allowed. In the offline MARL setting, single agent critic models may be severely
biased and degrade the performance of the learned policies.

13
7 Conclusions
In this work, we investigated the benefits of using reward models with low IR in the offline MARL
setting. We showed that learning an approximate equilibrium in offline MARL can scale exponen-
tially with the IR instead of exponentially with the number of agents. Our proposed algorithm is
a decentralized, no-regret learning algorithm that can be implemented in practical settings while
utilizing standard RL algorithms. The empirical results demonstrate superior performance of the
critic with the smallest IR that can still represent the underlying true model in offline MARL, while
the widely-used single-agent critic can fail catastrophically in this setting. Moving forward, building
critics with low IR in MARL is a promising direction for future work, as well as exploring additional
structural assumptions to alleviate the MARL problem.

14
References
Agarwal, A., Kakade, S., Krishnamurthy, A., and Sun, W. (2020). Flambe: Structural complexity
and representation learning of low rank mdps. arXiv preprint arXiv:2006.10814.

Asghari, M., Fathollahi-Fard, A. M., Mirzapour Al-E-Hashem, S., and Dulebenets, M. A. (2022).
Transformation and linearization techniques in optimization: A state-of-the-art survey. Mathe-
matics, 10(2):283.

Aumann, R. J. (1987). Correlated equilibrium as an expression of bayesian rationality. Econometrica:


Journal of the Econometric Society, pages 1–18.

Bakhtin, A., Brown, N., Dinan, E., Farina, G., Flaherty, C., Fried, D., Goff, A., Gray, J., Hu, H.,
et al. (2022). Human-level play in the game of diplomacy by combining language models with
strategic reasoning. Science, 378(6624):1067–1074.

Barto, A. G., Sutton, R. S., and Anderson, C. W. (1983). Neuronlike adaptive elements that can
solve difficult learning control problems. IEEE transactions on systems, man, and cybernetics,
(5):834–846.

Brown, N. and Sandholm, T. (2019). Superhuman ai for multiplayer poker. Science, 365(6456):885–
890.

Cui, Q. and Du, S. S. (2022). Provably efficient offline multi-agent reinforcement learning via
strategy-wise bonus. Advances in Neural Information Processing Systems, 35:11739–11751.

Dann, C., Jiang, N., Krishnamurthy, A., Agarwal, A., Langford, J., and Schapire, R. E. (2018).
On oracle-efficient pac rl with rich observations. Advances in Neural Information Processing
Systems, 2018:1422–1432.

Daskalakis, C., Goldberg, P. W., and Papadimitriou, C. H. (2009). The complexity of computing a
nash equilibrium. SIAM Journal on Computing, 39(1):195–259.

Daskalakis, C., Golowich, N., and Zhang, K. (2023). The complexity of markov equilibrium in
stochastic games. In The Thirty Sixth Annual Conference on Learning Theory, pages 4180–4234.
PMLR.

DeWeese, A. and Qu, G. (2024). Locally interdependent multi-agent mdp: Theoretical framework
for decentralized agents with dynamic dependencies. arXiv preprint arXiv:2406.06823.

Erez, L., Lancewicki, T., Sherman, U., Koren, T., and Mansour, Y. (2023). Regret minimization
and convergence to equilibria in general-sum markov games. In International Conference on
Machine Learning, pages 9343–9373. PMLR.

Fujimoto, S. and Gu, S. S. (2021). A minimalist approach to offline reinforcement learning.


Advances in neural information processing systems, 34:20132–20145.

Galeotti, A., Goyal, S., Jackson, M. O., Vega-Redondo, F., and Yariv, L. (2010). Network games.
The review of economic studies, 77(1):218–244.

15
Grana, D. (2016). Bayesian linearized rock-physics inversion. Geophysics, 81(6):D625–D641.

Howson Jr, J. T. (1972). Equilibria of polymatrix games. Management Science, 18(5-part-1):312–


318.

Jain, P., Kakade, S. M., Kidambi, R., Netrapalli, P., and Sidford, A. (2018). Parallelizing stochastic
gradient descent for least squares regression: mini-batching, averaging, and model misspecifica-
tion. Journal of machine learning research, 18(223):1–42.

Jin, J., Song, C., Li, H., Gai, K., Wang, J., and Zhang, W. (2018). Real-time bidding with multi-agent
reinforcement learning in display advertising. In Proceedings of the 27th ACM international
conference on information and knowledge management, pages 2193–2201.

Jin, R., Chen, Z., Lin, Y., Song, J., and Wierman, A. (2024). Approximate global convergence of
independent learning in multi-agent systems. arXiv preprint arXiv:2405.19811.

Kalogiannis, F. and Panageas, I. (2024). Zero-sum polymatrix markov games: Equilibrium collapse
and efficient computation of nash equilibria. Advances in Neural Information Processing Systems,
36.

Lan, G. (2023). Policy mirror descent for reinforcement learning: Linear convergence, new sampling
complexity, and generalized problem classes. Mathematical programming, 198(1):1059–1106.

Lin, Y., Qu, G., Huang, L., and Wierman, A. (2021). Multi-agent reinforcement learning in stochastic
networked systems. Advances in neural information processing systems, 34:7825–7837.

Littman, M. L. (1994). Markov games as a framework for multi-agent reinforcement learning. In


Machine learning proceedings 1994, pages 157–163. Elsevier.

Liu, Q., Chung, A., Szepesvári, C., and Jin, C. (2022). When is partially observable reinforcement
learning not scary? arXiv preprint arXiv:2204.08967.

MacQueen, R. and Wright, J. (2024). Guarantees for self-play in multiplayer games via polymatrix
decomposability. Advances in Neural Information Processing Systems, 36.

Nanduri, V. and Das, T. K. (2007). A reinforcement learning model to assess market power under
auction-based energy pricing. IEEE transactions on Power Systems, 22(1):85–95.

Nash, J. F. et al. (1950). Non-cooperative games.

Park, C., Zhang, K., and Ozdaglar, A. (2024). Multi-player zero-sum markov games with networked
separable interactions. Advances in Neural Information Processing Systems, 36.

Qu, G., Wierman, A., and Li, N. (2020). Scalable reinforcement learning of localized policies for
multi-agent networked systems. In Learning for Dynamics and Control, pages 256–266. PMLR.

Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D., Ermon, S., and Finn, C. (2024). Direct
preference optimization: Your language model is secretly a reward model. Advances in Neural
Information Processing Systems, 36.

16
Rashid, T., Samvelyan, M., De Witt, C. S., Farquhar, G., Foerster, J., and Whiteson, S. (2020).
Monotonic value function factorisation for deep multi-agent reinforcement learning. Journal of
Machine Learning Research, 21(178):1–51.
Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J.,
Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. (2016). Mastering the game of Go with
deep neural networks and tree search. nature, 529(7587):484–489.
Song, Y., Zhou, Y., Sekhari, A., Bagnell, J. A., Krishnamurthy, A., and Sun, W. (2022). Hybrid rl:
Using both offline and online data can make rl efficient. arXiv preprint arXiv:2210.06718.
Sunehag, P., Lever, G., Gruslys, A., Czarnecki, W. M., Zambaldi, V., Jaderberg, M., Lanctot, M.,
Sonnerat, N., Leibo, J. Z., Tuyls, K., et al. (2017). Value-decomposition networks for cooperative
multi-agent learning. arXiv preprint arXiv:1706.05296.
Tseng, W.-C., Wang, T.-H. J., Lin, Y.-C., and Isola, P. (2022). Offline multi-agent reinforcement
learning with knowledge distillation. Advances in Neural Information Processing Systems,
35:226–237.
Uehara, M., Zhang, X., and Sun, W. (2021). Representation learning for online and offline rl in
low-rank mdps. arXiv preprint arXiv:2110.04652.
Vinyals, O., Babuschkin, I., Czarnecki, W. M., Mathieu, M., Dudzik, A., Chung, J., Choi, D. H.,
Powell, R., Ewalds, T., Georgiev, P., et al. (2019). Grandmaster level in starcraft ii using
multi-agent reinforcement learning. Nature, 575(7782):350–354.
Vonesh, E. F., Wang, H., and Majumdar, D. (2001). Generalized least squares, taylor series
linearization and fisher’s scoring in multivariate nonlinear regression. Journal of the American
Statistical Association, 96(453):282–291.
Wainwright, M. (2019). High-Dimensional Statistics: A Non-Asymptotic Viewpoint, volume 48 of
Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press.
Wang, X., Xu, H., Zheng, Y., and Zhan, X. (2024). Offline multi-agent reinforcement learning
with implicit global-to-local value regularization. Advances in Neural Information Processing
Systems, 36.
Weng, J., Chen, H., Yan, D., You, K., Duburcq, A., Zhang, M., Su, Y., Su, H., and Zhu, J. (2022).
Tianshou: A highly modularized deep reinforcement learning library. Journal of Machine
Learning Research, 23(267):1–6.
Wu, C., Kreidieh, A., Parvate, K., Vinitsky, E., and Bayen, A. M. (2017). Flow: Architecture and
benchmarking for reinforcement learning in traffic control. arXiv preprint arXiv:1710.05465, 10.
Xie, T., Cheng, C.-A., Jiang, N., Mineiro, P., and Agarwal, A. (2021). Bellman-consistent pessimism
for offline reinforcement learning. arXiv preprint arXiv:2106.06926.
Yang, Y., Ma, X., Li, C., Zheng, Z., Zhang, Q., Huang, G., Yang, J., and Zhao, Q. (2021).
Believe what you see: Implicit constraint approach for offline multi-agent reinforcement learning.
Advances in Neural Information Processing Systems, 34:10299–10312.

17
Yu, C., Velu, A., Vinitsky, E., Gao, J., Wang, Y., Bayen, A., and Wu, Y. (2022). The surprising ef-
fectiveness of ppo in cooperative multi-agent games. Advances in Neural Information Processing
Systems, 35:24611–24624.

Zhan, W., Cen, S., Huang, B., Chen, Y., Lee, J. D., and Chi, Y. (2023a). Policy mirror descent for
regularized reinforcement learning: A generalized framework with linear convergence. SIAM
Journal on Optimization, 33(2):1061–1091.

Zhan, W., Uehara, M., Kallus, N., Lee, J. D., and Sun, W. (2023b). Provable offline preference-based
reinforcement learning.

Zhan, W., Uehara, M., Sun, W., and Lee, J. D. (2022). Pac reinforcement learning for predictive
state representations. arXiv preprint arXiv:2207.05738.

Zhang, R., Zhang, Y., Konda, R., Ferguson, B., Marden, J., and Li, N. (2023a). Markov games with
decoupled dynamics: Price of anarchy and sample complexity. In 2023 62nd IEEE Conference
on Decision and Control (CDC), pages 8100–8107. IEEE.

Zhang, Y., Bai, Y., and Jiang, N. (2023b). Offline learning in markov games with general function
approximation. In International Conference on Machine Learning, pages 40804–40829. PMLR.

Zinkevich, M. (2003). Online convex programming and generalized infinitesimal gradient ascent. In
Proceedings of the 20th international conference on machine learning (icml-03), pages 928–936.

18
A Additional Experimental Details
In this section we give additional information on the experiment design. Additional hyper-parameters
related to training are given in Table 2.
Our high-level implementation follows the framework of DR-AC and has three steps:

1. Data collection. Collect data via a uniform policy, where each agent executes a random action
ai ∼ Uniform([−1, 1]) for all i ∈ [N ].

2. Learn critic. Learn N reward critic models using LSR and the collected offline data. Namely,
for each agent i ∈ [N ], estimate a reward critic by solving the following LSR:
X
arg min (r(a) − ri )2 .
r∈Ri
(a,ri )∈Dh

We experiment with three types of reward critic types, namely, different reward classes Ri : 1-IR,
2-IR, and joint-action critic models. We solve this by gradient descent, which iteratively samples
a batch from Dh , and takes a gradient step. Our method returns the critic with the smallest
validation loss, calculated with respect to a holdout validation dataset, through the course of
training. Lastly, if during the run the critic does not show improvement after number of steps
specified by the ‘patience’ parameter we stop the run (see Table 2 for hyper-parameter values).

3. Learn actor. Apply TD3+BC on all agents to get a policy per agent.

Next we elaborate on the critic architectures we used and their implementation.

1. Joint-action critic. We experimented with architectures with 3 layers and 2 layers. Recall that
N is the number of agents. The 3 layer architectures are of size N × width × width × width × 1
where width ∈ [512, 1028, 2056], and the 2 layer architectures are of size N ×width×width×1
where width ∈ [128, 512, 2056].

2. 2-IR critic. We experimented with 2 layer architectures of size 2 × width × width × 1 where
width ∈ [64, 128, 256]. For the ith agent, there are N such networks, where each network
represents the interaction term with the j th agent. Let this network be denoted as DNNij :
A × A → R. With these, the reward of the ith agent is given by
X
r̂i (a) = DNNij (ai , aj ).
j

3. 1-IR critic. We experimented with 2 layer architectures of size 1 × width × width × 1 where
width ∈ [128, 256, 512], where the only input to the network is the action of the ith agent.

The metric which we measure is the maximum gap defined by


!  
X X
max max ai ai + πj − πi  πj  ,
i∈[N ] ai ∈[−1,1]
j̸=i j∈[N ]

19
2-IR Critic Joint-Action Critic 1-IR Critic
6.0
8.0
8.0
5.0
Maximum Gap

Maximum Gap

Maximum Gap
6.0
4.0 6.0

3.0 4.0 4.0


2.0
2.0 2.0
1.0

0.0 0.0
0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500
Time Steps Time Steps Time Steps

2-IR critic (2 layers, 64) Joint-action critic (2 layers, 128) 1-IR critic (2 layers, 128)
2-IR critic (2 layers, 256) Joint-action critic (2 layers, 512) 1-IR critic (2 layers, 512)
Joint-action critic (2 layers, 2048)
Joint-action critic (3 layers, 512)
Joint-action critic (3 layers, 1024)

Figure 3: Comparison of TD3+BC instantiated with different critic architectures, i) 1-IR critic,
ii) 2-IR critic, and iii) joint-action critic. The underlying true reward is a 2-IR. The shaded area
represents the standard error computed across trials.

where πj is the policy for agent j (note that here we use deterministic policies). In particular, the
above expression obtains its maximum at ai = ±1.
Details of the environment with the underlying reward of 2-IR are presented in Section 6.
Figure 3 depicts additional results that measure the performance of various architectures for the 2-IR
environment. As observed, the 2-IR critic consistently performs better compared to the joint-action
architecture and the 1-IR architecture.
We experimented with an additional environment in which the underlying reward model is a 1-IR
reward model of the form ∀i ∈ [N ], ri⋆ (s, a) = a2i + ϵ. Additional parameters of the environment
are similar to those described in Section 6. Since the underlying reward model is a 1-IR, we expect
the 1-IR critic type to result in good performance. Further, since the 2-IR critic is not significantly
more expressive compared to the 1-IR critic, we may expect it to have good performance as well.
Figure 4 depicts the results of this experiment for all reward critic types and architectures. These
show that both the 1-IR and 2-IR reward critics have good performance, whereas the joint-action
critic performs significantly worse with respect to the maximum gap metric.

20
2-IR Critic Joint-Action Critic 1-IR Critic
1.0 1.000 1.0

0.8 0.995 0.8


Maximum Gap

Maximum Gap

Maximum Gap
0.6 0.6
0.990

0.4 0.4
0.985
0.2 0.2
0.980
0.0 0.0
0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500
Time Steps Time Steps Time Steps

2-IR critic (2 layers, 64) Joint-action critic (2 layers, 128) 1-IR critic (2 layers, 128)
2-IR critic (2 layers, 128) Joint-action critic (2 layers, 512) 1-IR critic (2 layers, 256)
2-IR critic (2 layers, 256) Joint-action critic (2 layers, 2048) 1-IR critic (2 layers, 512)
Joint-action critic (3 layers, 512)
Joint-action critic (3 layers, 1024)
Joint-action critic (3 layers, 2048)

Figure 4: Comparison of TD3+BC instantiated with different critic architectures, i) 1-IR critic,
ii) 2-IR critic, and iii) joint-action critic. The underlying true reward is a 1-IR. The shaded area
represents the standard error computed across trials.

Hyperparameter Value
Critic learning rate 1e-4
Critic batch size 64
Patience parameter for critic 20
Actor learning rate 1e-3
Actor batch size 64
Number of epochs 500
Optimizer Adam
Policy architecture MLP, 3 layers, width 128, w/ ReLu activations
TD3+BC α parameter 5
# of trials per experiment 10

Table 2: Hyperparameters used in the experiments.

21
B Q-function Estimation
In this section we provide a computationally efficient method to estimate Q bt in Algorithm 1. We
i,h
assume access to a function class {Fi }i∈[N ] where Fi ⊆ {f : C × Si × Ai → [0, H]} to approximate
the single-agent Q-functions. The full algorithm is shown in Algorithm 2.
Specifically, in Algorithm 2, we will sample c ∼ ρ, si,h ∼ σi,h (·|c), ai,h ∼ 12 νi,h (·|c, si,h ) +
1 πt
π (·|c, si,h ), (s−i,h , a−i,h ) ∼ dbh−i (·|c) and then roll out the joint policy π t in Pb. It can be
2 i,h
observed that the cumulative reward q is indeed an unbiased estimate of Q bt (c, si,h , ai,h ). Notably,
i,h
we sample the state si,h from the offline dataset to leverage the offline information. We also sample
ai,h from 12 νi,h + 12 πi,h
t
such that the actions can cover the current policy πi,ht
and the competing
policy µi,h , which has bounded χ2 -divergence from νi,h . Then we only need to run LSR on the
collected batch to estimate the Q-function. In summary, we can see that Algorithm 2 can be
implemented with LSR oracles.

Algorithm 2 Q-function Estimation


1: Input: Estimated reward rb, estimated transition Pb, policy π t , step h, agent i, function class Fi .
2: Dsim ← ∅.
3: for m = 1, . . . , Msim do
4: Sample c ∼ ρ.
5: Execute π t in Pb with public context c until step h.
6: Denote the current joint local state excluding agent i by s−i,h . Reset the state of agent i to
be si,h ∼ σi,h (·|c).
7: Execute ai,h ∼ 12 νi,h (·|c, si,h ) + 12 πi,ht
(·|c, si,h ) and a−i,h ∼ π−i t
(·|c, s−i,h ).
8: Continue to execute the joint policy π t in Pb until step H.
9: Compute the cumulative reward staring from (c, sh , ah ) by q under the reward model rb.
Add (c, si,h , ai,h , q) into Dsim .
et = arg minf ∈F P 2
10: Run LSR: Q i,h i (c,si,h ,ai,h ,q)∈Dsim (f (c, si,h , ai,h ) − q) .
11: Return: Q et .
i,h

B.1 Theoretical Guarantee


et is close to Q
Next we want to show that the estimated Q bt . We have the following lemma:
i,h i,h

Lemma 2 (Q-function estimation error). Suppose Q bt ∈ Fi for all t, i, h and Assumption 4 holds.
i,h
With probability at least 1 − δ, we have for all i, t, h, µi ∈ Πi (C) that
hD Ei
Ec∼ρ,si ∼dµhi (·|c) Q bt (c, si , ·) − Qet (c, si , ·), µi,h (·|c, si ) − π t (·|c, si )
i,h i,h i,h
s
CCS H 2 log(N T H|F|/δ)
≲ .
Msim

Recall that for any two functions g and g ′ , g ≲ g ′ means that there exsits a constant c > 0 such
that g < cg ′ always holds. Note that from the proof of Theorem 3, Lemma 2 suggests that we can

22
et as a surrogate of Q
use Q bt and Theorem 3 still holds as long as Msim ≳ CCS H 2 log(N T H|F |/δ)
.
i,h i,h ϵ2
Therefore, Algorithm 2 is indeed a computationally and statistically efficient Q-function estimator.

Proof of Lemma 2. From the guarantee of LSR (Lemma 13), we know with probability at least
1 − δ that for all i ∈ [N ], t ∈ [T ], h ∈ [H]
 2 
t
b (c, si , ai ) − Q t
Ec∼ρ,si ∼σi,h (·|c),ai ∼ 1 νi,h (·|c,si,h )+ 1 πi,h
t (·|c,s
i,h )
Q i,h
e (c, si , ai )
i,h
2 2

H 2 log(N T H|F|/δ)
≲ .
Msim
Therefore, from Cauchy-Schwartz, we have
s
h
bti,h (c, si , ai ) − Q
eti,h (c, si , ai )
i CS H 2 log(N T H|F|/δ)
Ec∼ρ,si ∼dµhi (·|c),ai ∼πi,h
t (·|c,s ) Q ≲ .
i
Msim

On the other hand, since µi ∈ Πi (C), from Lemma 5 we know


s
h
bti,h (c, si , ai ) − Q
eti,h (c, si , ai )
i CCS H 2 log(N T H|F|/δ)
Ec∼ρ,si ∼dµhi (·|c),ai ∼µi,h (·|c,si ) Q ≲ .
Msim

Therefore we have
hD Ei
Ec∼ρ,si ∼dµhi (·|c) Q bti,h (c, si , ·) − Qeti,h (c, si , ·), µi,h (·|c, si ) − πi,h
t
(·|c, si )
s
CCS H 2 log(N T H|F|/δ)
≲ .
Msim

23
C Proofs in Section 3
C.1 Proof of Theorem 1
We first define a specific IR decomposition for any function f in Definition 2 that is useful in the
rest of the proof.

Lemma 3 (Standardized IR Decomposition). For any function f with interaction rank K and train-
ing distribution x ∼ p, yi ∼ pi (·|x), ∀i, there exists a group of sub-functions ∪0≤k≤K−1 {gj′ 1 ,··· ,jk }j1 <···<jk
where

Eyjl ∼pjl (·|x) [gj′ 1 ,··· ,jk (x, yj1 , · · · , yjk )] = 0 (5)

for all k ∈ [K − 1], l ∈ [k], x ∈ X , yjl′ ∈ Yjl′ (l′ ̸= l) and


K−1
X X
f (x, y1 , · · · , yW ) = gj′ 1 ,··· ,jk (x, yj1 , · · · , yjk ), ∀x ∈ X , y1 ∈ Y1 , · · · , yW ∈ YW .
k=0 1≤j1 <···<jk ≤W

We call this group of sub-functions ∪0≤k≤K−1 {gj′ 1 ,··· ,jk }j1 <···<jk the standardized IR decomposition
of f .

The standardized decomposition separates the variations and mean of f under the training
distribution. With Lemma 3, we are able to provide an upper bound per-sub-function fitting error by
simply fitting their summation f :

Lemma 4 (Sub-function Alignment). For any functions f ⋆ and fb with interaction rank K, let
∪0≤k≤K−1 {gj1 ,··· ,jk }j1 <···<jk and ∪0≤k≤K−1 {b gj1 ,··· ,jk }j1 <···<jk denote the standardized decomposi-

tion of f and f in Lemma 3. Assume that the following holds
b
 2 

Ex∼p,y1 ∼p1 (·|x),··· ,yW ∼pW (·|x) (f − f )(x, y1 , · · · , yW )
b ≤ ϵ.

Then for any 0 ≤ k ≤ K − 1 and 1 ≤ j1 < · · · < jk ≤ W , we have:

Ex∼p,yj1 ∼pj1 (·|x),··· ,yjk ∼pjk (·|x) (∆j1 ,··· ,jk (x, yj1 , · · · , yjk ))2 ≤ 2k ϵ,
 

where ∆j1 ,··· ,jk := gj1 ,··· ,jk − gbj1 ,··· ,jk .

Lemma 4 implies that the learning error of standardized sub-functions can be upper bounded by
the fitting error of f efficiently when the interaction rank is small. This property is the key reason
why interaction rank is a more precise measure of the function complexity than the input size.
Now let ∪0≤k≤K−1 {gj1 ,··· ,jk }j1 <···<jk and ∪0≤k≤K−1 {b gj1 ,··· ,jk }j1 <···<jk denote the standardized

decomposition of f and fb. Then from Lemma 4, we know for any 0 ≤ k ≤ K−1 and j1 < · · · < jk
that

Ex∼p,yj1 ∼pj1 (·|x),··· ,yjk ∼pjk (·|x) (∆j1 ,··· ,jk (x, yj1 , · · · , yjk ))2 ≤ 2k ϵ,
 

24
which implies that

(∆j1 ,··· ,jk (x, yj1 , · · · , yjk ))2 ≤ (CDS )k+1 2k ϵ.


 
Ex∼p′ ,yj1 ∼p′j (·|x),··· ,yjk ∼p′j (·|x) (6)
1 k

On the other hand, we know


 2 

Ex∼p′ ,y1 ∼p′1 (·|x),··· ,yW ∼p′W (·|x) (f − fb)(x, y1 , · · · , yW )
 !2 
K−1
X X
=Ex∼p′ ,y1 ∼p′1 (·|x),··· ,yW ∼p′W (·|x)  ∆j1 ,··· ,jk (x, yj1 , · · · , yjk ) 
k=0 1≤j1 <···<jk ≤W
K−1
X X
K−1
(∆j1 ,··· ,jk (x, yj1 , · · · , yjk ))2 ,
 
≲W Ex∼p′ ,yj1 ∼p′j (·|x),··· ,yjk ∼p′j (·|x)
1 k
k=0 1≤j1 <···<jk ≤W

where the last step is due to AM-GM inequality. Now substitute Eq. (6) into the above inequality,
we have
 2 

Ex∼p′ ,y1 ∼p′1 (·|x),··· ,yW ∼p′W (·|x) (f − fb)(x, y1 , · · · , yW ) ≲ (2W )2(K−1) CDS
K
ϵ.

This concludes our proof.

C.2 Proof of Lemma 3


From Definition 2, we know that there exists a group of sub-functions ∪0≤k≤K−1 {gj1 ,··· ,jk }j1 <···<jk
which satisfies
K−1
X X
f (x, y1 , · · · , yW ) = gj1 ,··· ,jk (x, yj1 , · · · , yjk ), ∀x ∈ X , y1 ∈ Y1 , · · · , yW ∈ YW .
k=0 1≤j1 <···<jk ≤W

We prove the proposition with induction on K. First for K = 1, Lemma 3 holds naturally. Now we
suppose the proposition holds for K − 1 where K ≥ 2. Then for any {jl }K−1
l=1 , we can construct
gj′ 1 ,··· ,jK−1 (x, yj1 , · · · , yjK−1 ) as follows:

gj′ 1 ,··· ,jK−1 (x, yj1 , · · · , yjK−1 ) = gj1 ,··· ,jK−1 (x, yj1 , · · · , yjK−1 )
K−1
X X
(−1)k
 
+ Eyjl ∼pjl (·|x),··· ,yjl ∼pjl (x) gj1 ,··· ,jK−1 (x, yj1 , · · · , yjK−1 ) .
1 1 k k
k=1 1≤l1 <···<lk ≤K−1

It can be verified that gj′ 1 ,··· ,jK−1 (x, yj1 , · · · , yjK−1 ) satisfies the property of standardized decomposi-
tion, i.e., Eq. (5). Now consider the function f ′ :
X
f ′ (x, y1 , · · · , yW ) = f (x, y1 , · · · , yW ) − gj′ 1 ,··· ,jK−1 (x, yj1 , · · · , yjK−1 ).
1≤j1 <···<jK−1 ≤W

25
Note that f ′ satisfies Definition 2 with IR K − 1. By induction hypothesis, we know there exists a
standardized decomposition for f ′ :
K−2
X X

f (x, y1 , · · · , yW ) = gj′ 1 ,··· ,jk (x, yj1 , · · · , yjk ), ∀x ∈ X , y1 ∈ Y1 , · · · , yW ∈ YW .
k=0 1≤j1 <···<jk ≤W

where gj′ 1 ,··· ,jk satisfies the requirement in Eq. (5) for all k ∈ [K − 2]. This implies that we have
K−1
X X
f (x, y1 , · · · , yW ) = gj′ 1 ,··· ,jk (x, yj1 , · · · , yjk ), ∀x ∈ X , y1 ∈ Y1 , · · · , yW ∈ YW ,
k=0 1≤j1 <···<jk ≤W

where gj′ 1 ,··· ,jk satisfies the requirement in Eq. (5) for all k ∈ [K − 1]. Therefore the argument holds
for K as well. By induction we can prove the proposition.

C.3 Proof of Lemma 4


Fix any 0 ≤ k ≤ K − 1 and 1 ≤ j1 < · · · < jk ≤ W . With slight abuse of notations, we also use
f ⋆ and fb to denote the expected function value under the training distribution:

f ⋆ (x, yj1 , · · · , yjk ) := Eyj ∼pj (·|x),∀j ∈{j


/ l }l∈[k] [f (x, y1 , · · · , yW )] ,
h i
f (x, yj1 , · · · , yjk ) := Eyj ∼pj (·|x),∀j ∈{j
b / l }l∈[k] f (x, y1 , · · · , yW ) .
b

From Cauchy-Schwartz inequality, we can observe that


 2 

Ex∼p,yjl ∼pjl (·|x),∀l∈[k] (f − f )(x, yj1 , · · · , yjk )
b ≤ ϵ. (7)

Since we are considering standardized decomposition, from Lemma 3 we have


k
X X

f (x, yj1 , · · · , yjk ) = gjl1 ,··· ,jl ′ (x, yjl1 , · · · , yjl ′ ),
k k
k′ =0 1≤l1 <···<lk′ ≤k
k
X X
fb(x, yj1 , · · · , yjk ) = gbjl1 ,··· ,jl ′ (x, yjl1 , · · · , yjl ′ ).
k k
k′ =0 1≤l1 <···<lk′ ≤k

Now we use symmetrization trick to prove the result. Consider the following symmetrization
operation of function f :
k

X X
G(f ⋆
)(x, {yjl }l∈[k] , {yj′ l }l∈[k] ) := (−1)k f ⋆ (x, {yjl }l∈{l ′
/ 1 ,··· ,lk′ } , {yjl }l∈{l1 ,··· ,lk′ } ).
k′ =0 1≤l1 <···<lk′ ≤k

It can be verified that


k

X X
G(f ⋆
)(x, {yjl }l∈[k] , {yj′ l }l∈[k] ) = (−1)k gj1 ,··· ,jk (x, {yjl }l∈{l ′
/ 1 ,··· ,lk′ } , {yjl }l∈{l1 ,··· ,lk′ } ).
k′ =0 1≤l1 <···<lk′ ≤k

26
This implies that we have
k

X X
(G(f − fb))(x, {yjl }l∈[k] , {yj′ l }l∈[k] ) =

(−1)k ∆j1 ,··· ,jk (x, {yjl }l∈{l ′
/ 1 ,··· ,lk′ } , {yjl }l∈{l1 ,··· ,lk′ } ).
k′ =0 1≤l1 <···<lk′ ≤k
(8)

On the one hand, from AM-GM inequality and Eq. (7) we have
 2 
⋆ ′
Ex∼p,yjl ∼pjl (·|x),yj′ ∼pjl (·|x),∀l∈[k] (G(f − fb))(x, {yjl }l∈[k] , {yjl }l∈[k] ) ≤ 22k ϵ.
l

On the other hand, we can expand the left hand side of the above inequality:
 2 
⋆ ′
Ex∼p,yjl ∼pjl (·|x),yj′ ∼pjl (·|x),∀l∈[k] (G(f − f ))(x, {yjl }l∈[k] , {yjl }l∈[k] )
b
l
 2 
k

X X

=Ex∼p,yjl ∼pjl (·|x),yj′ ∼pjl (·|x),∀l∈[k]  (−1)k ∆j1 ,··· ,jk (x, {yjl }l∈{l
/ 1 ,··· ,lk′ } , {yjl }l∈{l1 ,··· ,lk′ } )

l
k′ =0 1≤l1 <···<lk′ ≤k

=2k Ex∼p,yjl ∼pjl (·|x),∀l∈[k] (∆j1 ,··· ,jk (x, yj1 , · · · , yjk ))2 ,
 

where the second step is due to Eq. (8) and the third step is because the cross terms are 0 due to the
independence between yj and yj′ given x and Lemma 3. Therefore we have

Ex∼p,yj1 ∼pj1 (·|x),··· ,yjk ∼pjk (·|x) (∆j1 ,··· ,jk (x, yj1 , · · · , yjk ))2 ≤ 2k ϵ,
 

which concludes our proof.

27
D Proof of Theorem 2
We first present the formal statement of Theorem 2:
Theorem 4. Suppose Assumption 1 hold. Let Πi (C) := {µi : Ec∼ρ [χ2 (µi (c), νi (c))] ≤ C} denote
the policy class which has bounded χ2 -divergence from the behavior policy νi . Fix any δ ∈ (0, 1]
and select
2K−2 2 K−1 1
T = (2N 2 )− 3K−1 ϵ− 3K−1 , η = λ = (2N 2 ) 3K−1 ϵ 3K−1 .

Then with probability at least 1 − δ, we have


n  1
2 K−1 3K−1
o
max Gapi (b π ) ≲ max min C (2N ) ϵ + subopti (C, π
b) ,
i i∈[N ] C≥1

where subopti (C, π b−i ) − maxµi ∈Πi (C) ri⋆ (µi , π


b) := maxµi ∈Πi ri⋆ (µi , π b−i ) is the off-support bias.

Proof of Theorem 4. Note that for any agent i ∈ [N ] and policy µi ∈ Πi (C) where C > 1, we
have
T
X T
 X
ri⋆ (µi , π−i
t
) − ri⋆ (π t ) = ⋆
t [(r − r
Ec∼ρ,ai ∼µi (c),a−i ∼π−i i bi )(c, a)]
t=1
|t=1 {z }
(1)
T
X T
X
ri⋆ )(c, a)] + t
) − rbi (π t ) . (9)

+ ri −
Ec∼ρ,a∼πt [(b rbi (µi , π−i
|t=1 {z } |t=1 {z }
(2) (3)

With slight abuse of the notations, we use ∪0≤k≤K−1 {gji1 ,··· ,jk } and ∪0≤k≤K−1 {b gji1 ,··· ,jk } to denote
the standardized decomposition of ri and rbi , as defined in Lemma 3. We also use ∆ij1 ,··· ,jk to denote

gji1 ,··· ,jk − gbji1 ,··· ,jk . First note that from Lemma 4, we have for all i ∈ [N ], 0 ≤ k ≤ K − 1 and
1 ≤ j1 < · · · < jk ≤ N where jl ̸= i for all l ∈ [k] that:
h 2 i
i
Ec∼ρ,ai ∼νi (·|c),ajl ∼νjl (·|c),∀l∈[k] ∆j1 ,··· ,jk (c, ai , aj1 , · · · , ajk ) ≤ 2k ϵ, (10)

Next we bound terms (1), (2) and (3) in Eq. (9) respectively.

Bounding term (1). For term (1), from Lemma 3, we know for all policy µi ∈ Πi (C) where
C ≥ 1 that:

t (·|c) [(r − r
Ec∼ρ,ai ∼µi (·|c),a−i ∼π−i bi )(c, a)]
i
K
X X
Ec∼ρ,ai ∼µi (·|c),ajl ∼πjt (·|c),∀l∈[k] ∆ij1 ,··· ,jk (c, ai , aj1 , · · · , ajk ) .
 
=
l
k=0 1≤j1 <···<jk ≤N :jl ̸=i,∀l∈[k]

To quantify the above transfer error, we have the following lemma which leverages the χ2 -divergence
between the target distribution and training distribution:

28
Lemma 5. For two distributions d1 , d2 ∈ ∆(Z) and any function f defined on Z, we have
p
Ez∼d1 [f (z)] ≤ Ez∼d2 [(f (z))2 ](1 + χ2 (d1 , d2 )).
Proof. Note that we have
X (d1 (z) − d2 (z))2 X (d1 (z))2
2 1 2
1 + χ (d , d ) = 1 + = .
z∈Z
d2 (z) z∈Z
d2 (z)

Then the lemma comes directly from Cauchy-Schwartz inequality.


From Lemma 5 we have
Ec∼ρ,ai ∼µi (·|c),ajl ∼πjt (·|c),∀l∈[k] ∆ij1 ,··· ,jk (c, ai , aj1 , · · · , ajk )
 
l
r h 2 i
≤ Ec∼ρ,ai ∼νi (c),ajl ∼νjl (·|c),∀l∈[k] ∆ij1 ,··· ,jk (c, ai , aj1 , · · · , ajk )
v     
u
u Y Y
· t1 + χ2 ρ ◦ µi × πjtl  , ρ ◦ νi × νjl 
u
l∈[k] l∈[k]
v      
u
u Y Y
≤ t2k ϵ 1 + χ2 ρ ◦ µi × πjtl  , ρ ◦ νi × νjl ,
u
l∈[k] l∈[k]

where recall that we use ρ ◦ p to denote the joint distribution c ∼ ρ, a ∼ p(·|c) for some conditional
distribution p. In the last step we utilize Eq. (10).
Now we only need to bound χ2 -divergence between ρ ◦ µi ◦ l∈[k] πjtl and ρ ◦ νi ◦ l∈[k] νjl .
Q Q
We achieve this with the following lemma:
Lemma 6. For any 2k policies {pj }kj=1 and {qj }kj=1 , we have
k k
! " k #
Y Y Y
1 + χ2 ρ ◦ 1 + χ2 (pj (c), qj (c)) .

pj , ρ ◦ qj = Ec∼ρ
j=1 j=1 j=1

Proof. Note that we have


 Q 2
|c)
!
k
Y k
Y X ρ(c)
p (a
j∈[k] j j
1 + χ2 ρ ◦ pj , ρ ◦ qj = Q
j=1 j=1 c,a1 ,··· ,ak
ρ(c) j∈[k] qj (aj |c)
Q 2  
X X p
j∈[k] j j (a |c) X Y X (pj (aj |c)) 2
= ρ(c) Q = ρ(c)  
c a1 ,··· ,ak j∈[k] q j (a j |c) c∈S aj
q j (a j |c)
j∈[k]
" k #
X Y Y
1 + χ2 (pj (c), qj (c)) = Ec∼ρ 1 + χ2 (pj (c), qj (c)) .
 
= ρ(c)
c j∈[k] j=1

29
Therefore, from Lemma 6 we have
    
Y Y
1+χ2 ρ ◦ µi × πjtl  , ρ ◦ νi × νjl 
l∈[k] l∈[k]
  (11)
Y
= Ec∼ρ  χ2 (µi (c), νi (c)) + 1 (χ2 (πjtl (c), νjl (c)) + 1) .
l∈[k]

Meanwhile, from the policy update formula Eq. (2), we have for all t ∈ [T ] and c ∈ C:
1
rit (c, ·), πit+1 (c)⟩ + λχ2 (πit+1 (c), νi (c)) + Dc,i (πit+1 (c), πit (c))
− ⟨b
η
1
≤ −⟨b rit (c, ·), πit (c)⟩ + λχ2 (πit (c), νi (c)) + Dc,i (πit (c), πit (c)).
η

Note that Dc,i (πit (c), πit (c)) = 0 and rbit ∈ [0, 1], we know

1
χ2 (πit+1 (c), νi (c)) ≤ χ2 (πit (c), νi (c)) + .
λ
Since χ2 (πi1 (c), νi (c)) = χ2 (νi (c), νi (c)) = 0, for all t ∈ [T ] and s ∈ S we have
t−1
χ2 (πit (c), νi (c)) ≤ , ∀t ∈ [T + 1]. (12)
λ
Substitute Eq. (12) into Eq. (11) and we have
    
 k
2
Y
t 
Y T
Ec∼ρ χ2 (µi (c), νi (c)) + 1
 
1 + χ ρ ◦ µi × πjl , ρ ◦ νi ×
 νj l   ≤
λ
l∈[k] l∈[k]
 k
T
≤ (C + 1) ,
λ

where the second step is due to µi ∈ Πi (C).


Therefore, we have for all policies µi ∈ Πi (C) where C ≥ 1 that
s  k
2T
Ec∼ρ,ai ∼µi (c),ajl ∼πjt (·|c),∀l∈[k] ∆ij1 ,··· ,jk (c, ai , aj1 , · · · , ajk ) ≲
 
Cϵ · .
l λ

This implies that we have


s s
K−1  k  K−1
X T 2T N 2
(1) ≲ T CkN −1 Cϵi · ≲ T Cϵ · . (13)
k=0
λ λ

Here C is the combination number.

30
Bounding term (2). Similarly, for term (2), following the same arguments as bounding term (1),
we know for all policy µi ∈ Πi (C) where C ≥ 1 that:
s  k
 i  2T
Ec∼ρ,ai ∼πit (c),ajl ∼πjt (c),∀l∈[k] ∆j1 ,··· ,jk (c, ai , aj1 , · · · , ajk ) ≤ ϵ (Ex∼ρ [fc,i (πit )] + 1) · .
l λ

Recall that we use fc,i (p) to denote the chi-squared divergence χ2 (p, νi (c)). Then with AM-GM
inequality, we have

Ec∼ρ,ai ∼πit (c),ajl ∼πjt (c),∀l∈[k] ∆ij1 ,··· ,jk (c, ai , aj1 , · · · , ajk )
 
l
 k s  
K−1 k
λ t N 2T 2T
≤ K−1 Ex∼ρ [fc,i (πi )] + · ·ϵ+ ϵ· .
N λ λ λ

Therefore, we have
s
T  2
K−1  K−1
X T 2T N 2T N 2
(2) − λ Ex∼ρ [fc,i (πit )] ≲ · ·ϵ+T ϵ· . (14)
t=1
λ λ λ

Bounding term (3). First we have the following lemma to characterize the no-regret guarantee of
regularized policy gradient (see Appendix D.1 for proof):
Lemma 7 (No-Regret Regularized Policy Gradient). Given a sequence of loss functions {lt }t∈[T ]
where lt : X × Y → [0, B] for some B > 0 and a reference policy ν : X 7→ ∆Y . Suppose we
initialize p1 to be ν and run the following regularized policy gradient for T iterations:
1
pt+1 (x) = arg min −⟨lt (x, ·), p⟩ + λχ2 (p, ν(x)) + Dx (p, pt ),
p∈∆Y η
where Dx (p, pt ) is the Bregman divergence between p(x) and pt (x). Then we have for all policy µ
and x ∈ X that
T T +1
ηT B 2
 
X
t t
X
2 t 1
l (x), µ(x) − p (x) + λ χ (p (x), ν(x)) ≤ T λ + χ2 (µ(x), ν(x)) + .
t=1 t=1
η 4

Note that (3) = Tt=1 Ex∼ρ [⟨b


rit (c), µ(c) − π t (c)⟩]. Thus, Lemma 7 implies that for any policy
P
µi ∈ Πi (C), we have:
T
X C ηT
(3) + λ Ex∼ρ [fc,i (πit )] ≲T Cλ + + . (15)
t=1
η 4

Putting all pieces together. Now substituting Eq (13),(14),(15) into Eq (9), we have for all policy
µi ∈ Πi (C) where C ≥ 1 that
s K−1 K−1
2T N 2
 
⋆ ⋆ C η 2T N 2 1
b−i ) − ri (b
ri (µi , π π ) ≲ Cλ + + + Cϵ · + · · ϵ.
ηT 4 λ λ λ

31
Therefore by setting
2K−2 2 K−1 1
T = (2N 2 )− 3K−1 ϵ− 3K−1 , η = λ = (2N 2 ) 3K−1 ϵ 3K−1 ,

we have for all policy µi ∈ Πi (C) where C ≥ 1 that


 1
ri⋆ (µi , π
b−i ) − ri⋆ (b
π ) ≲ C (2N 2 )K−1 ϵ 3K−1 .

This concludes our proof.

D.1 Proof of Lemma 7


Let fx (p) denote the χ2 -divergence χ2 (p(x), ν(x)). First due to first order optimality in the policy
update step , we know for all p : X → ∆Y and all t ∈ [T ], x ∈ X that:

−ηlt (x) + (1 + ηλ)∇fx (pt+1 ) − ∇fx (pt ), p(x) − pt+1 (x) ≥ 0. (16)

This implies that for all t ∈ [T ], x ∈ X and any policy µ, we have

ηlt (x), µ(x) − pt (x) + ηλfx (pt ) − ηλfx (µ)


= ηlt (x) − (1 + ηλ)∇fx (pt+1 ) + ∇fx (pt ), µ(x) − pt+1 (x)
+ ∇fx (pt+1 ) − ∇fx (pt ), µ(x) − pt+1 (x) + ηlt (x), pt+1 (x) − pt (x)
+ ηλ∇fx (pt+1 ), µ(x) − pt+1 (x) + ηλfx (pt ) − ηλfx (µ),
≤ ∇fx (pt+1 ) − ∇fx (pt ), µ(x) − pt+1 (x) + ηlt (x), pt+1 (x) − pt (x)
| {z } | {z }
(4) (5)
t+1 t+1 t
+ ηλ∇fx (p ), µ(x) − p (x) + ηλfx (p ) − ηλfx (µ) .
| {z }
(6)

Next we bound terms (4), (5) and (6) respectively.

First for term (4), note that we have the following lemma:

Lemma 8. For any i ∈ [N ] and p1 , p2 , p3 : X → ∆Y , we have for all x ∈ X

⟨∇fx (p1 ) − ∇fx (p2 ), p3 (x) − p1 (x)⟩ = Dx (p3 , p2 ) − Dx (p3 , p1 ) − Dx (p1 , p2 ).

Proof. By definition, we know

Dx (p, p′ ) = fx (p) − fx (p′ ) − ⟨∇fx (p′ ), p − p′ ⟩.

Substitute the definition into Lemma 8 and we can prove the lemma.
From Lemma 8, we can rewrite (4) as follows:

(4) = Dx (µ, pt ) − Dx (µ, pt+1 ) − Dx (pt+1 , pt ).

32
Then for term (5), from Cauchy-Schwartz inequality, we have
X (pt+1 (y|x) − pt (y|x))2 ν(y|x)η 2 (lt (x, y))2 η2B 2
(5) ≤ + ≤ Dx (pt+1 , pt ) + ,
y∈Y
ν(y|x) 4 4

where the last step comes from the definition of Dx .

Finally for term (6), Since fx is convex, we know

ηλ∇fx (pt+1 ), µ(x) − pt+1 (x) ≤ ηλfx (µ) − ηλfx (pt+1 ).

This implies that

(6) ≤ ηλ fx (pt ) − fx (pt+1 ) .




In summary, for all t ∈ [T ], s ∈ S and any policy µ, we have

ηlt (x), µ(x) − pt (x) + ηλfx (pt ) − ηλfx (µ)


 η2B 2
≤ Dx (µ, pt ) − Dx (µ, pt+1 ) + ηλ fx (pt ) − fx (pt+1 ) +

.
4
Therefore, summing up from t = 1 to T , we have
T T +1
ηT B 2
 
X
t t
X
2 t 1
l (x), µ(x) − p (x) + λ χ (p (x), ν(x)) ≤ Tλ + χ2 (µ(x), ν(x)) + ,
t=1 t=1
η 4

where we use the fact that Dx (µ, p1 ) = Dx (µ, ν) = χ2 (pt (x), ν(x)).

33
E Proof of Theorem 3
Let fc,s,i,h (p) to denote the χ2 -divergence χ2 (ph (c, s), νi,h (c, s)). Note that for any agent i ∈ [N ]
and policy µi ∈ Πi (C) where C ≥ 1, we have
T
X h µ ◦πt ,r⋆ i h t ⋆ i
i π ,r
Ec∼ρ Vi,1 −i (c, s1 ) − Ec∼ρ Vi,1 (c, s1 )
t=1
H X
T
!
X  ⋆ 
= E µi ◦π t
−i t (·|c,s
ri,h (c, sh , ah ) − rbi,h (c, sh , ah )
c∼ρ,sh ∼dh (·|c),ai,h ∼µi,h (·|c,si,h ),a−i,h ∼π−i −i,h )
h=1 t=1
| {z }
(1)
H X
T
!
X  ⋆ 
+ Ec∼ρ,sh ∼dπt (·|c),ah ∼πt (·|c,sh ) −ri,h (c, sh , ah ) + rbi,h (c, sh , ah )
h
h=1 t=1
| {z }
(2)
T
!
X h µ ◦πt ,br i h µ ◦πt ,br i
i −i i −i
+ Ec∼ρ Vi,1 (c, s1 ) − Ec∼ρ Vbi,1 (c, s1 )
t=1
| {z }
(3)
T
!
X h t i h t i
π ,b
r π ,b
r
+ Ec∼ρ Vbi,1 (c, s1 ) − Ec∼ρ Vi,1 (c, s1 )
t=1
| {z }
(4)
T
!
X h µ ◦πt ,br i h t i
i π ,b
r
+ Ec∼ρ Vbi,1 −i (c, s1 ) − Ec∼ρ Vbi,1 (c, s1 ) ,
t=1
| {z }
(5)

π,b
r
where we use Vbi,h to denote the joint value function under reward rb and transition Pb. Next we will
bounded these terms separately. In particular, terms (1) and (2) are bounded by statistical guarantees
on the reward model and the distribution shift robustness of low IR models; term (3) and (4) are
bounded by the statistical guarantees of the transition model, while using the decoupling property,
and term (5) is bounded by no-regret analysis while identifying proper value and Q functions that
satisfies Bellman equation.
We use ∪0≤k≤K−1 {gji,h 1 ,··· ,jk
} and ∪0≤k≤K−1 {b gji,h
1 ,··· ,jk
} to denote the standardized decomposition
of ri,h and rbi,h , as defined in Lemma 3. We also use ∆i,h
⋆ i,h
bji,h
j1 ,··· ,jk to denote gj1 ,··· ,jk − g 1 ,··· ,jk
. From
Assumption 2 and the LSR guarantee Lemma 13, with probability at least 1 − δ/2 we have for all
i ∈ [N ], h ∈ [H] that:
h

2 i log(N H|R|/δ)
Ec∼ρ,sj ∼σj,h (·|c),aj ∼νj,h (·|c,sj ),∀j rh,i (c, s, a) − rbh,i (c, s, a) ≲ := ϵR .
M
Combining the above inequality with Lemma 4, we have for all i ∈ [N ], h ∈ [H], 0 ≤ k ≤ K − 1
and 1 ≤ j1 < · · · < jk ≤ N where jl ̸= i for all l ∈ [k] that:
 2 
i,h
Ec∼ρ,si ∼σi,h (·|c),ai ∼νi (·|c,si ),sjl ∼σi,h (·|c),ajl ∼νjl (·|c,sjl ),∀l∈[k] ∆j1 ,··· ,jk (c, zi , zj1 , · · · , zjk ) ≤ 2k ϵR ,

34
where we use zj to denote (sj , aj ). Next we bound terms (1), (2) and (3) in Eq. (9) respectively.
For term (1), fix h ∈ [H] and t ∈ [T ], then we know
 ⋆ 
E µi ◦π t
−i t
r i,h (c, s h , ah ) − r
bi,h (c, s h , ah )
c∼ρ,sh ∼dh (·|c),ai,h ∼µi,h (c,si,h ),a−i,h ∼π−i (c,s−i,h )
K−1
X X
= E µ
πt
j
[∆i,h
j1 ,··· ,jk (c, zi , zj1 , · · · , zjk )]
c∼ρ,zi ∼dhi (·|c),zjl ∼dh l (·|c),∀l
k=0 1≤j1 <···<jk ≤N :jl ̸=i,∀l∈[k]

With similar arguments in the proof of Theorem 2, from Lemma 5 we have

E µi
πt
jl
[∆i,h
j1 ,··· ,jk (c, zi , zj1 , · · · , zjk )]
c∼ρ,zi ∼dh (·|c),zjl ∼dh (·|c),∀l
s   2 
≤ E µ
πt
j
∆i,h
j1 ,··· ,jk (c, zi , zj1 , · · · , zjk )
c∼ρ,si ∼dhi (·|c),ai ∼νi,h (·|c,si ),sjl ∼dh l (·|c),ajl ∼νjl ,h (·|c,sjl )∀l
v         
u
t t
u Y πj Y Y πj Y
· t1 + χ2 ρ ◦ dµhi × dh l  ◦ µi × πjtl  , ρ ◦ dµhi × dh l  ◦ νi × νjl 
u
l∈[K] l∈[k] l∈[K] l∈[k]
p
≤ (CS )k+1 2k ϵR
v         
u
t t
u Y πj Y Y πj Y
· t1 + χ2 ρ ◦ dµhi × dh l  ◦ µi × πjtl  , ρ ◦ dµhi × dh l  ◦ νi × νjl .
u
l∈[K] l∈[k] l∈[K] l∈[k]

On the other hand, from Lemma 6 we know


        
 k
2 µi
Y πjt Y
t  µi
Y πjt Y TH
1 + χ ρ ◦ dh ×
 dh l
◦ µi ×
 πjl , ρ ◦ dh ×
 dh l
◦ νi ×
 νj l  ≲C .
λ
l∈[K] l∈[k] l∈[K] l∈[k]

This implies that


s  k
2T H
E πt [∆i,h
j1 ,··· ,jk (c, zi , zj1 , · · · , zjk )] ≲ CSk+1 C ϵR
µ j
c∼ρ,zi ∼dhi (·|c),zjl ∼dh l (·|c),∀l λ

Therefore we have
s  K−1
2T HN 2
(1) ≲ T H CCSK ϵR .
λ

Similarly, term (2) is bounded by


s K
CS T H
(2) ≲ T H (2N 2 )K−1 ϵR .
λ

35
For term (3), note that we have
H X
X T
(3) = E µi ◦π t
−i t (·|c,s
[b
ri,h (c, sh , ah )]
c∼ρ,sh ∼dh (·|c),ai,h ∼µi,h (·|c,si,h ),a−i,h ∼π−i −i,h )
h=1 t=1
−E µi ◦π t [b
ri,h (c, sh , ah )]
c∼ρ,sh ∼dbh −i (·|c),ai,h ∼µi,h (·|c,si,h ),a−i,h ∼π−i
t (·|c,s
−i,h )

H X
T
" #
X X µi ◦πt µ i ◦π t
≤ Ec∼ρ dh −i (s, a|c) − dbh −i (s, a|c) .
h=1 t=1 s,a

At the same time, due to decoupled transition, we have the following lemma:
Lemma 9. For any policy product π, we have for all h ∈ [H] that
" #  
N X π π
X X
Ec∼ρ dπh (s, a|c) − dbπh (s, a|c) ≤ Ec∼ρ  dhj (sj , aj |c) − dbhj (sj , aj |c) 
s,a j=1 sj ,aj

hP i
πj πj
Thus, from Lemma 9, we only need to bound Ec∼ρ sj ,aj d h (s j , aj |c) − d
b
h (s j , aj |c) for
any agent j and single-agent policy πj . This is achieved in the following lemma:
Lemma 10. For any j ∈ [N ] and single-agent policy πj , we have for all h ∈ [H] that
 
X π π
Ec∼ρ  dhj (sj , aj |c) − dbhj (sj , aj |c) 
sj ,aj
h−1
X h i

≤ Ec∼ρ,(sj ,aj )∼dπj (·|c) Pbj,h′ (·|c, sj , aj ) − Pj,h′ (·|c, sj , aj ) .
h′ 1
h′ =1

On the other hand, from the guarantee of MLE in the literature (Liu et al., 2022; Zhan et al.,
2022, 2023b) (Lemma 14), we know with probability at least 1 − δ/2 that for all j ∈ [N ], h ∈ [H]
 

2 log(HN |P|/δ)
Ec∼ρ,sj ∼σj,h (·|c),aj ∼νj,h (·|c,sj ) Pbj,h (·|c, sj , aj ) − Pj,h (·|c, sj , aj ) ≲ := ϵP .
1 M
(17)

From Lemma 5, this implies that with probability at least 1 − δ/2, we have for all j ∈ [N ], h ∈
[H], t ∈ [T ], µi ∈ Πi (C) that
h i p

Ec∼ρ,(si ,ai )∼dhi (·|c) Pi,h (·|c, si , ai ) − Pi,h (·|c, si , ai )
µ b ≲ CS CϵP , (18)
1
r
h

i CS T HϵP
E πt Pj,h (·|c, sj , aj ) − Pj,h (·|c, sj , aj )
b ≲ .
j
c∼ρ,(sj ,aj )∼dh (·|c) 1 λ
Therefore, we have
r
p CS T HϵP
(3) ≲ H T CS CϵP + H 2 T N
2
.
λ

36
For term (4), following the same arguments for term (3), we have
r
CS T HϵP
(4) ≲ H 2 T N .
λ

For term (5), we first need to show that the expected single-agent Q function Q bt satisfies
h i i,h
π,b
r π,b
r
Bellman equation. In particular, let Qb := E π
(sj ,aj) ∼dbhj (·|c),∀j̸=i
Qb (c, s, a) for any product
i,h
policy π and we have the following lemma:

Lemma 11. Given a joint policy π−i for agents except i, for all i ∈ [N ], h ∈ [H], c ∈ C, si ∈
Si , a ∈ Ai and policy µi , we have
h i
µ ◦π ,b r bµi ◦π−i ,br (c, si , ai ) ,
Vbi,hi −i (c, si ) := Eai ∼µi,h (·|c,si ) Q i,h
h i
µi ◦π−i ,b
r µi ◦π−i ,b
r ′
Qi,h
b (c, si , ai ) = E(s−i ,a−i )∼db −i (·|c) [b
π ri,h (c, s, a)] + Es′ ∼Pbi,h (·|c,si ,ai ) Vi,h+1 (c, si ) .
b
h i

Lemma 11 indeed implies that Q bµi ◦π−i ,br (c, si , ai ) is a valid Q function w.r.t. to the reward
i,h
ri,h (c, s, a)] under transition model Pb and thus we have the following
function E(s−i ,a−i )∼dbπ−i (·|c) [b
h
performance difference lemma:

Lemma 12. Given a joint policy π−i for agents except i, for any policies µi and µ′i , we have
H

hD Ei
µ ◦π ,b
r µ ◦π ,b bµi ◦π−i ,r (c, si,h , ·), µ′ (·|c, si,h ) − µi,h (·|c, si,h )
r
X
Vbi,1i −i (c, si,1 ) − Vbi,1i −i (c, si,1 ) = E µ′ Qi,h i,h .
si,h ∼dbhi (·|c)
h=1

Now given Lemma 12, we have


T
X h µ ◦πt ,br i h t i
i π ,b
r
(5) = Ec∼ρ Vbi,1 −i (c, si,1 ) − Ec∼ρ Vbi,1 (c, si,1 )
t=1
H
" T D
#
X X E
t t
= Ec∼ρ,si ∼dbµi (·|c) Qi,h (c, si , ·), µi,h (·|c, si ) − πi,h (·|c, si )
b
h
h=1 t=1
H
" T D
#
X X E
t t
≤ Ec∼ρ,si ∼dµhi (·|c) Qi,h (c, si , ·), µi,h (·|c, si ) − πi,h (·|c, si )
b
t=1
|h=1 {z }
(6)
H
" #
X X
+ TH Ec∼ρ dbµhi (si |c) − dµhi (si |c)
h=1 si
| {z }
(7)

Apply Lemma 7 and since µi ∈ Πi (C), we have

HC ηH 3 T
(6) ≲ T HλC + + .
η 4

37
From Lemma 10 and Eq. (18),we have
p
(7) ≲ T H 3 CS CϵP .

Therefore, we have
h i h i
π ,r⋆
µ ◦b b,r⋆
π
Ec∼ρ Vi,1i −i (c, s1 ) − Ec∼ρ Vi,1 (c, s1 )
s  K r
CS T H 2 K−1
HC ηH 3 3
p
2 CS T HϵP
≲H C (2N ) ϵR + HλC + + + H CS CϵP + H N .
λ Tη 4 λ

Let
2K 2 K 1
− 3K+2 4 2K−2 − 3K+4 K−1
T = CS H 3K+2 (2N 2 )− 3K+2 ϵRP3K+2 , η = CS3K+2 H − 3K+2 (2N 2 ) 3K+2 ϵRP
3K+2
,
K 3K K−1 1
λ = CS3K+2 H 3K+2 (2N 2 ) 3K+2 ϵRP
3K+2
,
log(N H|R||P|/δ)
where ϵRP := M
and then we have for all µi ∈ Πi (C) that
h i h i K 1
π ,r⋆
µ ◦b b,r⋆
π 6K+2 K−1
Ec∼ρ Vi,1i −i (c, s1 ) − Ec∼ρ Vi,1 (c, s1 ) ≲ CCS3K+2 H 3K+2 (2N 2 ) 3K+2 ϵRP
3K+2
.

This concludes our proof.

E.1 Proof of Lemma 9


Note that given c, the distribution of (sj , aj ) is independent from each other due to the decoupled
transition. Therefore we have
" #  
X X Y π Y π
Ec∼ρ dπ (s, a|c) − dbπ (s, a|c) = Ec∼ρ 
h h d j (sj , aj |c) − db j (sj , aj |c) 
h h
s,a s,a j∈[N ] j∈[N ]

Now for any 0 ≤ k ≤ N − 1, consider the following difference:


" #
X Y π Y π
Y π
Y π
Ik := Ec∼ρ dhj (sj , aj |c) dbhj (sj , aj |c) − dhj (sj , aj |c) dbhj (sj , aj |c)
s,a 1≤j≤k k+1≤j≤N 1≤j≤k+1 k+2≤j≤N

Note that we have


" #
X Y π Y π π π
Ik = Ec∼ρ dhj (sj , aj |c) dbhj (sj , aj |c) dbhk+1 (sk+1 , ak+1 |c) − dhk+1 (sk+1 , ak+1 |c)
s,a 1≤j≤k k+2≤j≤N
"
π π
X
= Ec∼ρ dbhk+1 (sk+1 , ak+1 |c) − dhk+1 (sk+1 , ak+1 |c)
sk+1 ,ak+1
#
π π
X Y Y
· dhj (sj , aj |c) dbhj (sj , aj |c)
s−(k+1) ,a−(k+1) 1≤j≤k k+2≤j≤N

38
" #
π π
X
= Ec∼ρ dbhk+1 (sk+1 , ak+1 |c) − dhk+1 (sk+1 , ak+1 |c)
sk+1 ,ak+1

Therefore we have
" # N −1  
N X π π
X X X
Ec∼ρ dπh (s, a|c) − dbπh (s, a|c) ≤ Ik = Ec∼ρ  dhj (sj , aj |c) − dbhj (sj , aj |c)  .
s,a k=0 j=1 sj ,aj

This concludes our proof.

E.2 Proof of Lemma 10


hP i
πj πj
Let δh denote Ec∼ρ sj ,aj d h (s j , a j |c) − db
h (s j , aj |c) . Then we know δ1 = 0. In addition, for
any 1 ≤ h ≤ H, we have
"
X X π
δh = Ec∼ρ j
dh−1 (s′j , a′j |c)Pj,h (sj |c, s′j , a′j )πj,h (aj |c, sj )
sj ,aj s′j ,a′j
#
πj
X
− dbh−1 (s′j , a′j |c)Pbj,h (sj |c, s′j , a′j )πj,h (aj |c, sj )
s′j ,a′j
"
π
X X
= Ec∼ρ j
dh−1 (s′j , a′j |c)Pj,h (sj |c, s′j , a′j )πj,h (aj |c, sj )
sj ,aj s′j ,a′j
!
πj
X
− dh−1 (s′j , a′j |c)Pbj,h (sj |c, s′j , a′j )πj,h (aj |c, sj )
s′j ,a′j

πj
X
+ dh−1 (s′j , a′j |c)Pbj,h (sj |c, s′j , a′j )πj,h (aj |c, sj )
s′j ,a′j
!#
πj
X
− dbh−1 (s′j , a′j |c)Pbj,h (sj |c, s′j , a′j )πj,h (aj |c, sj )
s′j ,a′j
"
X X π
≤ Ec∼ρ j
dh−1 (s′j , a′j |c)Pj,h (sj |c, s′j , a′j )πj,h (aj |c, sj )
sj ,aj s′j ,a′j
#

πj
− dh−1 (s′j , a′j |c)Pbj,h (sj |c, s′j , a′j )πj,h (aj |c, sj )
"
X X πj
+ Ec∼ρ dh−1 (s′j , a′j |c)Pbj,h (sj |c, s′j , a′j )πj,h (aj |c, sj )
sj ,aj s′j ,a′j
#

πj
− dbh−1 (s′j , a′j |c)Pbj,h (sj |c, s′j , a′j )πj,h (aj |c, sj )

39
" #
πj
X
≤ Ec∼ρ dh−1 (s′j , a′j |c)πj,h (aj |c, sj ) Pj,h (sj |c, s′j , a′j ) − Pbj,h (sj |c, s′j , a′j )
sj ,aj ,s′j ,a′j
" #
πj πj
X
+ Ec∼ρ Pbj,h (sj |c, s′j , a′j )πj,h (aj |c, sj ) dh−1 (s′j , a′j |c) − dbh−1 (s′j , a′j |c)
sj ,aj ,s′j ,a′j
" #
πj
X X
= Ec∼ρ dh−1 (s′j , a′j |c) Pj,h (sj |c, s′j , a′j ) − Pbj,h (sj |c, s′j , a′j ) πj,h (aj |c, sj )
sj ,aj ,s′j aj
" #
πj πj
X X
+ Ec∼ρ dh−1 (s′j , a′j |c) − dbh−1 (s′j , a′j |c) Pbj,h (sj |c, s′j , a′j )πj,h (aj |c, sj )
s′j ,a′j sj ,aj
" #
π
X X
= Ec∼ρ j
dh−1 (s′j , a′j |c) Pj,h (sj |c, s′j , a′j ) − Pbj,h (sj |c, s′j , a′j ) πj,h (aj |c, sj )
sj ,aj ,s′j aj
" #
π π
X X
+ Ec∼ρ j
dh−1 (s′j , a′j |c) − dbh−1
j
(s′j , a′j |c) Pbj,h (sj |c, s′j , a′j )πj,h (aj |c, sj )
s′j ,a′j sj ,aj
h i

=E πj
c∼ρ,(sj ,aj )∼dh−1 (·|c)
Pbj,h−1 (·|c, sj , aj ) − Pj,h−1 (·|c, sj , aj ) + δh−1 .
1

Therefore, we have
 
X π π
Ec∼ρ  dhj (sj , aj |c) − dbhj (sj , aj |c) 
sj ,aj
h−1
X h i

≤ Ec∼ρ,(sj ,aj )∼dπj (·|c) Pbj,h′ (·|c, sj , aj ) − Pj,h′ (·|c, sj , aj ) .
h′ 1
h′ =1

This concludes our proof.

E.3 Proof of Lemma 11


h i
µ ◦π−i ,b
r µ ◦π ,b
r
First it can be observed that Vbi,hi (c, si,h ) = Es−i ∼dbπ−i Vbi,hi −i (c, sh ) . Note that we have
h

" " H
##
bµi ◦π−i ,br (c, si,h , ai,h ) = E
X
Q i,h
π
(s−i,h ,a−i,h )∼db −i (·|c) Eµi ◦π−i ,Pb rbi,h′ (c, sh′ , ah′ ) c, sh , ah
h
h′ =h
= E(s−i,h ,a−i,h )∼dbπ−i (·|c) [b
ri,h (c, sh , ah )]
h
" " H
##
X
+ E(s−i,h ,a−i,h )∼dbπ−i (·|c) Eµi ◦π−i ,Pb rbi,h (c, sh′ , ah′ ) c, sh , ah ,
h
h′ =h+1

where we use Eπ,Pb [·] to denote the distribution of the trajectory when executing joint policy π with
transition model Pb.

40
On the other hand we know
" H #
X
Eµi ◦π−i ,Pb rbi,h (c, sh′ , ah′ ) c, sh , ah
h′ =h+1
" " H
##
X
=Esj,h+1 ∼Pbj,h (·|c,sj,h ,aj,h ),∀j Eµi ◦π−i ,Pb rbi,h (c, sh′ , ah′ ) c, sh+1
h′ =h+1
h i
µi ◦π−i ,b
r
=Esj,h+1 ∼Pbj,h (·|c,sj,h ,aj,h ),∀j Vbi,h+1 (c, sh+1 ) .

Therefore we know
bµi ◦π−i ,br (c, si,h , ai,h )
Qi,h
= E(s−i,h ,a−i,h )∼dbπ−i (·|c) [b
ri,h (c, sh , ah )]
h
h i
µi ◦π−i ,b
r
+ Esi,h+1 ∼Pbi,h (·|c,si,h ,ai,h ),(s−i,h ,a−i,h )∼dbπ−i (·|c),sj,h+1 ∼Pbj,h (·|c,sj,h ,aj,h ),∀j̸=i Vbi,h+1 (c, sh+1 )
h
h i
µi ◦π−i ,b
r
= E(s−i,h ,a−i,h )∼dbπ−i (·|c) [bri,h (c, sh , ah )] + Esi,h+1 ∼Pbi,h (·|c,si,h ,ai,h ),s−i,h+1 ∼dbπ−i (·|c) Vbi,h+1 (c, sh+1 )
h h+1
h i
µ ◦π ,b
r
= E(s−i,h ,a−i,h )∼dbπ−i (·|c) [bri,h (c, sh , ah )] + Esi,h+1 ∼Pbi,h (·|c,si,h ,ai,h ) Vbi,hi −i (c, si,h+1 ) .
h

This concludes our proof.

E.4 Proof of Lemma 12


Let rei,h (c, si , ai ) denote E(s−i ,a−i )∼dπ−i (·|c) [ri,h (c, s, a)]. From Lemma 11, we have
h
" H #

µ ◦π ,r µ ◦π ,r µ ◦π ,r
X
Vi,1i −i (c, si,1 ) − Vi,1i −i (c, si,1 ) = Eµ′i rei,h (c, si,h , ai,h ) c − Vi,1i −i (c, si,1 )
h=1
" H
#
h i
µ ◦π ,r
X
= Eµ′i rei,h (c, si,h , ai,h ) c + Eµ′i rei,1 (si,1 , ai,1 ) − Vi,1i −i (c, si,1 ) c
h=2
" H
#
h i
µi ◦π−i ,r µi ◦π−i ,r µi ◦π−i ,r
X
=E µ′i rei,h (c, si,h , ai,h ) c + E Qi,1 µ′i (c, si,1 , ai,1 ) − Vi,2 (c, si,2 ) − Vi,1 (c, si,1 ) c
h=2
" H
#
X  µ ◦π ,r 
= Eµ′i rei,h (c, si,h , ai,h ) c − Eµ′i Vi,2i −i (c, si,2 )
h=2
µ ◦π−i ,r
(c, si,1 , ·), µ′i,1 (·|c, si,1 ) − µi,1 (·|c, si,1 )
 
+E µ′ Qi,1i .
si,1 ∼d1 i (·|c)

Here the first step is due to the definition of valueh function and the third step i is due to Lemma 11.
PH  µi ◦π−i ,r 
Now apply the above arguments recursively to Eµ′i ei,h (c, si,h , ai,h ) c −Eµ′i Vi,2
h=2 r (c, si,2 )
and we have
H
µ′ ◦π ,r µ ◦π ,r µ ◦π−i ,r
X
(c, si,h , ·), µ′i,h (·|c, si,h ) − µi,h (·|c, si,h )
 
Vi,1i −i (c, si,1 ) − Vi,1i −i (c, si,1 ) = E µ′ Qi,hi .
si,h ∼dhi (·|c)
h=1

This concludes our proof.

41
F Auxiliary Lemmas
Lemma 13 (Song et al. (2022)). Let {(xm , ym )}M
m=1 be M samples that are independently sampled

from xm ∼ p and ym ∼ q(·|xm ) := f (xm ) + ϵm where ϵm is a random noise. Suppose that
ym ∈ [0, 1] for all m ∈ [M ] and we have access to a function class G : X → [0, 1] which satisfies
f ⋆ ∈ G. Then if {ϵm }M ⋆
m=1 are independent and E[ym |xm ] = f (xm ), we have with probability at
least 1 − δ that
log(|G|/δ)
Ex∼p [(fb(x) − f ⋆ (x))2 ] ≲ ,
M

where fb = arg minf ∈G M 2


P
m=1 (f (xm ) − ym ) is the LSR solution.

Lemma 14 (Zhan et al. (2023b)). Let {(xm , ym )}M m=1 be M samples that are i.i.d. sampled from
xm ∼ p and ym ∼ q ⋆ (·|xm ). Suppose we have access to a probability model class Q which satisfies
q ⋆ ∈ Q. Then we have with probability at least 1 − δ that
 log(|Q|/δ)
q (·|x) − q ⋆ (·|x)∥21 ≲

Ex∼p ∥b ,
M
PM
where qb = arg minq∈Q m=1 log q(ym |xm ) is the MLE solution.

42

You might also like