Offline Pre-trained Multi-agent Decision Transformer
Offline Pre-trained Multi-agent Decision Transformer
Abstract: Offline reinforcement learning leverages previously collected offline datasets to learn optimal policies with no necessity to
access the real environment. Such a paradigm is also desirable for multi-agent reinforcement learning (MARL) tasks, given the combin-
atorially increased interactions among agents and with the environment. However, in MARL, the paradigm of offline pre-training with
online fine-tuning has not been studied, nor even datasets or benchmarks for offline MARL research are available. In this paper, we facil-
itate the research by providing large-scale datasets and using them to examine the usage of the decision transformer in the context of
MARL. We investigate the generalization of MARL offline pre-training in the following three aspects: 1) between single agents and mul-
tiple agents, 2) from offline pretraining to online fine tuning, and 3) to that of multiple downstream tasks with few-shot and zero-shot
capabilities. We start by introducing the first offline MARL dataset with diverse quality levels based on the StarCraftII environment,
and then propose the novel architecture of multi-agent decision transformer (MADT) for effective offline learning. MADT leverages the
transformer′s modelling ability for sequence modelling and integrates it seamlessly with both offline and online MARL tasks. A signific-
ant benefit of MADT is that it learns generalizable policies that can transfer between different types of agents under different task scen-
arios. On the StarCraft II offline dataset, MADT outperforms the state-of-the-art offline reinforcement learning (RL) baselines, includ-
ing BCQ and CQL. When applied to online tasks, the pre-trained MADT significantly improves sample efficiency and enjoys strong per-
formance in both few-short and zero-shot cases. To the best of our knowledge, this is the first work that studies and demonstrates the ef-
fectiveness of offline pre-trained models in terms of sample efficiency and generalizability enhancements for MARL.
Keywords: Pre-training model, multi-agent reinforcement learning (MARL), decision making, transformer, offline reinforcement
learning.
Citation: L. Meng, M. Wen, C. Le, X. Li, D. Xing, W. Zhang, Y. Wen, H. Zhang, J. Wang, Y. Yang, B. Xu. Offline pre-trained multi-
agent decision transformer. Machine Intelligence Research, vol.20, no.2, pp.233–248, 2023. http://doi.org/10.1007/s11633-022-1383-7
of the first models that verifies the possibility of solving and sequential model-based methods. For the constraint-
conventional (offline) RL problems by generative traject- based methods, a straightforward method is to adopt the
ory modelling, i.e., modelling the joint distribution of the off-policy algorithm and regard the offline datasets as a
sequence of states, actions, and rewards without tempor- replay buffer to learn a policy with promising perform-
al difference learning. ance. However, experience existing in offline datasets and
The technique of transforming decision-making prob- interaction with online environments have different distri-
lems into sequence modelling problems has opened a new butions, which causes the overestimation in the off-policy
gate for solving RL tasks. Crucially, this activates a nov- (value-based) method[30]. Substantial works presented in
el pathway toward training RL systems on diverse data- offline RL aim at resolving the distribution shift between
sets[20−22] in much the same manner as in supervised the static offline datasets and the online environment in-
learning, which is often instantiated by offline RL tech- teractions[30−32]. In addition, depending on the dynamic
niques[23]. Offline RL methods have recently attracted tre- planning ability of the transition model, Matsushima et
mendous attention since they enable agents to apply self- al.[33, 34] learn different models offline and regularize the
supervised or unsupervised RL methods in settings where policy efficiently. In particular, Yang et al.[35, 36] con-
online collection is infeasible. We, thus, argue that this is strain off-policy algorithms in the multi-agent field. Re-
particularly important for MARL problems since online lated to our work for the improvement of sample effi-
exploration in multi-agent settings may not be feasible in ciency, Nair et al.[37] derive the Karush Kuhn-Tucker
many settings[24], but learning with unsupervised or meta- (KKT) conditions of the online objective, generating an
learned[25] outcome-driven objectives via offline data is advantage weight to avoid the out-of-distribution (OOD)
still possible. However, it is unclear yet whether the ef- problem. For the sequential model-based methods, De-
fectiveness of sequence modelling through transformer ar- cision transformer outperforms many state-of-the-art off-
chitecture also applies to MARL problems. line RL algorithms by regarding the offline policy train-
In this paper, we propose multi-agent decision trans- ing process as a sequential modelling and testing it
formers (MADT), an architecture that casts the problem online[19, 38]. In contrast, we show a transformer-based
of MARL as conditional sequence modelling. Our man- method in the multi-agent field, attempting to transfer
date is to understand if the proposed MADT can learn across many scenarios without extra constraints. By shar-
through pre-training a generalized policy on offline data- ing the sequential model across agents and learning a
sets, which can then be effectively used to other down- global critic network offline, we conduct a pre-trained
stream environments (known or unknown). As a study multi-agent policy that can be continuously fine-tuned
example, we specifically focus on the well-known chal- online.
lenge for MARL tasks: the StarCraft multi-agent chal- Multi-agent reinforcement learning. As a natur-
lenge (SMAC)[26], and demonstrate the possibility of solv- al extension from single-agent RL, MARL[1] attracts
ing multiple SMAC tasks with one big sequence model. much attention to solve more complex problems under
Our contribution is as follows: We propose a series of Markov games. Classic algorithms often assume multiple
transformer variants for offline MARL by leveraging the agents to interact with the environment online and col-
sequential modelling of the attention mechanism. In par- lect the experience to train the joint policy from scratch.
ticular, we validate our pre-trained sequential model in Many empirical successes have been demonstrated in
the challenging multi-agent environment for its sample ef- solving zero-sum games through MARL methods[8, 39].
ficiency and transferability. We built a dataset with dif- When solving decentralized partially observable Markov
ferent skill levels covering different variations of SMAC decision processes (Dec-POMDPs) or potential games[40],
scenarios. Experimental results on SMAC tasks show that the framework of centralized training and decentralized
MADT enjoys fast adaptation and superior performance execution (CTDE) is often employed[41−45] where a cent-
via learning one big sequence model. ralized critic is trained to gather all agents′ local observa-
The main challenges in our offline pre-training and on- tions and assign credits. While CTDE methods rely on
line fine-tuning problems are the out-of-distribution and the individual-global-max assumption[46], another thread
of work is built on the so-called advantage decomposition
training paradigm mismatch problems. We tackle these
lemma[47], which holds in general for any cooperative
two problems with the sequential model and pre-train the
game; such a lemma leads to provably convergent multi-
global critic model offline.
agent trust-region methods[48] and constrained policy op-
timization methods[49].
2 Related work Transformer. Transformer[50] has achieved a great
Offline deep reinforcement learning. Recent breakthrough to model relations between the input and
works have successfully applied RL in robotics con- output sequence with variable length, for the sequence-to-
trol[27, 28] and gaming AIs[29] online. However, many works sequence problems[51], especially in machine translation[52]
attempt to reduce the cost resulting from online interac- and speech recognition[53]. Recent works even reorganize
tions by learning with neural networks from an offline the vision problems as the sequential modelling process
dataset named offline RL methods[23]. There are two and construct the state-of-the-art (SOTA) model with
classes to divide the offline RL methods: constraint-based pretraining, named vision transformers (ViT)[16, 54, 55].
L. Meng et al. / Offline Pre-trained Multi-agent Decision Transformer 235
Due to the Markovian property of trajectories in offline the imitation learning that only fits a policy network off-
datasets, we can utilize Transformer as that in language line, MADT trains the actor and critic offline together
modelling. Therefore, Transformer can bridge the gap and fine-tunes them online in the RL-style training
between supervised learning in the offline setting and re- scheme. We also give some empirical conclusions, such as
inforcement learning in online interaction because of the the effect of reward-to-go in the online fine-tuning stage
representation capability. We claim that the components and the multi-task padding method on SMAC.
in Markov games are sequential, then utilise the trans- Multi-agent reinforcement learning. For the
Markov game, which is a multi-agent extension of the
former for each agent to fit a transferable MARL policy.
Markov decision process (MDP), there is a tuple repres-
Furthermore, we fine-tune the learned policy via trial-
enting the essential elements ⟨S, A, R, P, n, γ⟩, where S de-
and-error.
notes the state space of n agents, S1 × S2 · · · ×Sn → S .
Ai is the action space of each agent i, P : Si × Ai →
3 Methodology P D(Si ) denotes the transition function emitting the dis-
In this section, we demonstrate how the transformer is tribution over the state space and A is the joint action
applied to our offline pre-training MARL framework. space, Ri : S × Ai → R is the reward function of each
agent and takes action following their policies
First, we introduce the typical paradigm and computa-
π(a|s) ∈ Πi : S → P D(A) from the policy space Πi , where
tion process for the multi-agent reinforcement learning
Πi denotes the policy space of agent i, a ∈ Ai, and s ∈ Si.
and attention-based model. Then, we introduce an offline
Each agent aims to maximize its long-term reward
MARL method, in which the transformer sequentially ∑ t t
t γ ri , where ri ∈ Ri denotes the reward of agent i in
t
maps between the local observations and actions of each
time t and γ denotes the discount factor. In the cooperat-
agent in the offline dataset via parameter sharing. Then
ive setting, we also denote the ri with r shared among
we leverage the hidden representation as the input of the
agents for the simplification.
MADT to minimize the cross-entropy loss. Furthermore,
Attention-based model. The attention-based mod-
we introduce how to integrate the online MARL with
el has shown its stable and strong representation capabil-
MADT in constructing our whole framework to train a
ity. The scale dot-production attention uses the self-at-
universal MARL policy. To accelerate the online learning,
tention mechanism demonstrated in [50]. Let Q ∈ Rtq ×dq
we load the pre-trained model as a part of the MARL al-
be the quries, K ∈ Rtk ×dk be the keys, and V ∈ Rtv ×dv
gorithms and learn the policy based on experience in the
be the values, where t∗ are the element numbers of differ-
latest buffer stored from the online environment. To train
ent inputs and d∗ are the corresponding element dimen-
a universal MARL policy quickly adapting to other tasks,
sions. Normally, tk = tv and dq = dk. The outputs of self-
we bridge the gap between different scenarios from obser-
attention are computed as
vations, actions, and available actions, respectively. Fig. 1
overviews our method from the perspective of offline pre- QK T
training with supervised learning and online fine-tuning Attention(Q, K, V ) = softmax( √ )V (1)
dk
with MARL algorithms. The main contributions of this
√
work are summarized as follows: 1) We conducted an off- where the scalar 1/ dk is used to prevent the softmax
line dataset for multi-agent offline pre-training on the function from entering regions that have very small
well-known challenging task, SMAC; 2) To improve the gradients. Then, we introduce the multi-head attention
sample efficiency online, we propose fine-tuning the pre- process as follows:
trained multi-agent policy instantiated with the sequence
model by sharing policy among agents and show the MultiHead(Q, K, V ) = Concat(head1 , · · · , headh )W O
strong capacity of sequence modelling for multi-agent re- (2)
inforcement learning in the few-shot and zero-shot set-
tings; 3) We propose pre-training an actor and a critic to headi = Attention(QWiQ , KWiK , V WiV ). (3)
fine-tune with the policy-based network. In contrast to
Data
Load Train
Dataset Trainer MADT
data
Rollout
Online
Collect data Online
Buffer
environment
Fig. 1 Overview of the pipeline for pre-training the general policy and fine-tuning it online
236 Machine Intelligence Research 20(2), April 2023
1
∑C t
9) Update θ with θ = arg max ′ C t=1 P (ai )×
′ θ
3.1 Multi-agent decision transformer log P (âti |τi<t ; θ )
10) end for
Algorithm 1 shows the offline training process for a 11) end for
single-task of MADT, in which we autoregressively en- Trajectories reformulation as input. We model
code the trajectories from the offline datasets in offline the lowest granularity at each time step as a modelling
pre-trained MARL and train the transformer-based net- unit xt from the static offline dataset for the concise rep-
work with supervised learning. We carefully reformulate resentation. MARL has many elements, such as
the trajectories as the inputs of the causal transformer ⟨global_state, local_observation⟩, different from the single
that are different from those in the Decision transfor- agent. It is reasonable for sequential modelling methods
mer[19]. We deprecate the reward-to-go and actions that to model them in a MDP. Therefore, we formulate the
are encoded with states together in the single-agent DT. trajectory as follows:
We will interpret the reason for this in the next section.
Similar to the seq2seq models, MADT is based on the τ i = (x1 , · · · , xt , · · · , xC ) , where xt = (st , oit , ati )
autoregressive architecture with reformulated sequential
inputs across timescales. The left part of Fig. 2 shows the where sit denotes the global shared state, oit denotes the
architecture. The causal transformer encodes the agent i′s individual observation for agent i at time step t , and ait
trajectory sequence τit at the time step t to a hidden rep- denotes the action. We regard xt as a token and process
resentation hti = (h1 , h2 , · · · , hl ) with a dynamic mask. the whole sequence similar to the scheme in the language
Given ht, the output at the time step t is based on the modelling.
previous data and then consumes the previously emitted Output sequence construction. To bridge the gap
actions as additional inputs when predicting a new ac- between training with the whole context trajectory and
Offline datasets a o a o a o A s
t−1 t t t+1 × n agent t−1 t t−1 t
testing with only previous data, we mask the context state. That means the pre-trained policy is encouraged to
data to autoregressively output in the time step t with choose an action to be identical to the distribution in the
previous data in ⟨1, · · · , t − 1⟩. Therefore, MADT pre- offline dataset, even though it leads to a low reward.
dicts the sequential actions at each time step using the Therefore, we need to design another paradigm, MADT-
decoder as follows: PPO, to integrate RL and supervised learning for fine-
tuning in Algorithm 2. Fig. 2 shows the pre-training and
y = at = arg max pθ (a|τ, a1 , · · · , at−1 ) (6) fine-tuning framework. A direct method is to share the
a
pre-trained model across each agent and implement the
where θ denotes the parameters of MADT and τ denotes REINFORCE algorithm[56]. However, only actors result in
the trajectory including the global state s, local higher variance, and the employment of a critic to assess
observation o before the time step t , pθ is the distribution state values is necessary. Therefore, in online MARL, we
over the legal action space under the available action v . leverage an extension of PPO, the state-of-the-art al-
Core module description. MADT differs from the gorithm on tasks of StarCraft, multi-agent particle envir-
transformers in conventional sequence modelling tasks onment (MPE), and even the return-based game
that take inputs with position encoding and decode the Hanabi[57]. In the offline stage, we adopt the strategy
encoded hidden representation autoregressively. We use mentioned before to pre-train an offline policy for each
the masking mechanism with a lower triangular matrix to agent π(oi , ai ) and additionally use the global state to
compute the attention: pre-train a centralized critic Vϕ (s). In the fine-tuning
stage, we first load the offline pre-trained sharing policy
( QK T )
Attention(Q, K, V ) = softmax √ +M V (7) as each agent′s online initial policy πi (oi , ai ). When the
dk critic is pre-trained, we instantiate the centralized critic
with the pre-trained model as Vϕ (s). To fine-tune the pre-
where M is the mask matrix that ensures that the input
trained multi-agent policy and critic model, multiple
at the time step t can only correlate with the input from
agents clear the buffer and interact with the environ-
⟨1, · · · , t − 1⟩. We employ the cross-entropy (CE) as the
ment to learn the policy via maximizing the following
total sequential prediction loss and utilize the available
PPO objective:
action v to ensure agents take those illegal actions with a
probability of zero. The CE loss can be represented as ∑
n
follows: E [min(ω Â(s, a), clip(ω,
s ∼ ρθold , a ∼ πθold
i=1
1 ∑
C 1−ϵ, 1+ϵ)Â(s, a))] (9)
LCE (θ) = P (at ) log P (ât |τt , â<t ; θ) (8)
C t=1 where ω = πθ (oi , ai )/πθold (oi , ai ) denotes the importance
weight using in PPO-style algorithms, Â denotes the
where C is the context length, at is the ground truth advantage function computed by reward and the critic
action, τt includes {s1:t , o1:t }. â denotes the output of model and ϵ denotes the clip parameter. The detailed
MADT. The cross-entropy loss shown above aims to fine-tuning pipeline can be found in Algorithm 2.
minimize the distribution distance between the prediction Algorithm 2. MADT-Online: Multi-agent decision
and the ground truth. transformer with PPO
1) Input: Offline dataloader D , Pretrained MADT policy
3.2 Multi-agent decision transformer with with parameter θ
PPO 2) Initialize θ and ϕ are the parameters of an actor
πθ (ai |oi ) and critic Vϕ (s) respectively, which could be
The method above can fit the data distribution well, inherited directly from pre-trained models
resulting from the sequential modelling capacity of the 3) Initialize n as the agent number, γ as the discount
transformer. However, it fails to work well when pre- factor, and ϵ as clip ratio.
training on the offline datasets and improves continually 4) for τ = {τ1 , · · · , τi , · · · , τn } in D do
by interacting with the online environment. The reason is 5) Sample τi = {st , oti , ati }t∈1:C as the ground truth,
the mismatch between the objectives of the offline and where C is the context length
online phase. In the offline stage, the imitation-based ob- 6) Compute the advantage function Â(s, ai ) =
∑ t
jective conforms to a supervised learning style in MADT t γ r(s, ai ) − Vϕ (s)
and ignores measuring each action with a value model. 7) Compute the important weight w = πθ (oi , ai )/
When the pre-trained model is loaded to interact with the πθold (oi , ai )
online environment, the buffer will only collect actions 8) Update θi for i ∈ 1, · · · , n via:
conforming to the distributions of the offline datasets 9) θi = arg max Es∼ρθold ,a∼πθold [clip(w, 1 − ϵ, 1+
rather than those corresponding to high reward at this ϵ)Â(s, ai )]
238 Machine Intelligence Research 20(2), April 2023
∑
10) Compute the MSE loss Lϕ = 12 [ t γ t rt − Vϕ (s)]2 4.1 Offline datasets
11) Update the critic network via ϕ = arg min Lϕ
ϕ
12) end for The offline datasets are collected from the running
policy, MAPPO[58], on the well-known SMAC task[26].
3.3 Universal model across scenarios Each dataset contains a large number of trajectories:
τ := (st , ot , at , rt , donet , vt )Tt=1. Different from D4RL[59],
To train a universal policy for each of the scenarios in our datasets conform to the property of DecPOMDP,
the SMAC which might vary with agent number in fea- which owns local observations and available actions for
ture space, action space, and reward ranges, we consider each agent. In the appendix, we list the statistical proper-
the modification list below. ties of the offline datasets in Tables A1 and A2.
Parameters sharing across agents. When offline
examples are collected from multiple tasks or the test 4.2 Offline multi-agent reinforcement lear-
phase owns the different agent numbers from the offline ning
datasets, the difference in agent numbers across tasks is
an intractable problem for deciding the number of actors. In this experiment, we aim to validate the effective-
Thus, we consider sharing the parameters across all act- ness of the MADT offline version in Section 3.1 as a
ors with one model as well as attaching one-hot agent IDs framework for offline MARL on the static offline datasets.
We train a policy on the offline datasets with various
into observations for compatibility with a variable num-
qualities and then apply it to an online environment,
ber of agents.
StarCraft[26]. There are also baselines under this setting,
Feature encoding. When the policy needs to gener-
such as behavior cloning (BC), as a kind of imitation
alize to new scenarios that arise from different feature
learning method showing stable performance on single-
shapes, we propose encoding all features into a universal
agent offline RL. In addition, we employ the convention-
space by padding zero at the end and mapping them to a
al effective single-agent offline RL algorithms, BCQ[32],
low-dimensional space with fully connected networks.
CQL[31], and ICQ[35], and then use the extension method
Action masking. Another issue is the different ac-
by simply mixing each agent value network proposed by
tion spaces across scenarios. For example, fewer enemies
[35] for multi-agent setting, denoting it as “xx-MA”. We
in a scenario means fewer potential attack options as well compare the performance of the MADT offline version
as fewer available actions. Therefore, an extra vector is with other abovementioned offline RL algorithms under
utilized to mute the unavailable actions so that their online evaluation in the MARL environment. To verify
probabilities are always zero during both the learning and the quality of our collected datasets, we chose data from
evaluating processes. different levels and trained the baselines as well as our
Reward scaling. Different scenarios might vary in MADT. Fig. 3 shows the overall performance on various
reward ranges and lead to unbalanced models during quality datasets. The baseline methods enhance their per-
multi-task offline learning. To balance the influence of ex- formance stably, indicating the quality of our offline data-
amples from different scenarios, we scale their rewards to sets. Furthermore, our MADT outperforms the offline
the same range to ensure that the output models have MARL baselines and converges faster across easy, hard,
comparable performance across different tasks. and super hard maps (2s3z, 3s5z, 3s5z VS. 3s6z ,
corridor ). From the initial performance in the evalu-
4 Experiments ation period, our pretrained model gives a higher return
than baselines in each task. Besides, our model can sur-
We show three experimental settings: offline MARL, pass the average performance in the offline dataset.
online MARL by loading the pre-trained model, and few-
shot or zero-shot offline learning. For the offline MARL, 4.3 Offline pre-training and online fine-
we expect to verify the performance of our method by tuning
pre-training the policy and directly testing on the corres-
ponding maps. In order to demonstrate the capacity of The experimental designed in this subsection intends
the pre-trained policy on the original or new scenarios, we to answer the question: Is the pre-training process neces-
aim to demonstrate the fine-tuning in the online environ- sary for online MARL? First, we compare the online ver-
ment. Experimental results in offline MARL show that sion of MADT in Section 3.2 with and without loading
our MADT-offline in Section 3.1 outperforms the state-of- the pre-trained model. If training MADT only by online
the-art methods. Furthermore, MADT-online in Section experience, we can view it as a transformer-based
3.2 can improve the sample efficiency across multiple MAPPO replacing the actor and critic backbone net-
scenarios. Besides, the universal MADT trained from works with the transformer. Furthermore, we validate
multi-task data with MADT-online generalizes well in that our framework MADT with the pre-trained model
each scenario in a few-shot or even zero-shot setting. can improve sample efficiency on most easy, hard, and
L. Meng et al. / Offline Pre-trained Multi-agent Decision Transformer 239
20.0 14
17.5 6 10
Average return
Average return
Average return
Average return
15.0 12
5 8
12.5 10 4
10.0 8 3 6
7.5 6 4
5.0 2
2.5 4 1 2
2
0 0.2 0.4 0.6 0.8 1.0 0 0.2 0.4 0.6 0.8 1.0 0 0.2 0.4 0.6 0.8 1.0 0 0.2 0.4 0.6 0.8 1.0
Timesteps 1E+3 Timesteps 1E+2 Timesteps 1E+2 Timesteps 1E+2
17.5 20.0 14
20 15.0 12
17.5
Average return
Average return
Average return
Average return
15 12.5 15.0 10
10.0 12.5 8
10 7.5 10.0 6
5.0 7.5 4
5 2.5
0 5.0 2
0
0 0.2 0.4 0.6 0.8 1.0 0 0.2 0.4 0.6 0.8 1.0 0 0.2 0.4 0.6 0.8 1.0 0 0.2 0.4 0.6 0.8 1.0
Timesteps 1E+3 Timesteps 1E+2 Timesteps 1E+2 Timesteps 1E+2
20 20 20
20
Average return
Average return
Average return
18
Average return
16 15 15
15
14 MADT 10 10
12 BC 10
10
BCQ-MA
5 5
CQL-MA 5
8 ICQ-MA
0
0
0.2 0.4 0.6 0.8 1.0 0 0.2 0.4 0.6 0.8 1.0 0 0.2 0.4 0.6 0.8 1.0 0 0.2 0.4 0.6 0.8 1.0
Timesteps 1E+3 Timesteps 1E+2 Timesteps 1E+2 Timesteps 1E+2
(a) 2s3z (Easy) (b) 3s5z (Hard) (c) 3s5z VS. 3s6z (Super hard) (d) Corridor (Super hard)
Fig. 3 Performance of offline MADT compared with baselines on four easy or (super-)hard SMAC maps. The dotted lines represent the
mean values in the training set. Columns (a)–(d) are average returns from (poor, medium, good) datasets from top to the bottom.
super hard maps. the adaptability of the seen tasks. In contrast, the zero-
Necessity of the pretrained model. We train our shot experiments are designed for the held-out maps.
MADT based on the datasets collected from a map and Few-shot learning. The results in Fig. 5(a) show
fine-tune it on the same map online with the MAPPO al- that our method can utilize multi-task datasets to train a
gorithm. For comparison fairness, we use the transformer universal policy and generalize to all tasks well. Pre-
as both actor and critic networks with and without the trained MADT can achieve higher returns than the mod-
pre-trained model. Primarily, we choose three maps from el trained from scratch when we limit the interactions
easy, hard, and super hard maps to validate the effective- with the environment.
ness of the pre-trained model in Fig. 4. Experimental res- Zero-shot learning. Fig. 5(b) shows that our univer-
ults show that the pre-trained model converges faster sal MADT can surprisingly improve performance on
than the algorithm trained from scratch, especially in downstream tasks even if it has not been seen before (3
challenging maps. stalkers VS. 4 zealots).
Improving sample efficiency. To validate the
sample efficiency improvement by loading our pre-trained 4.5 Ablation study
MADT and fine-tuning it with MAPPO, we compare the
overall framework with the state-of-the-art algorithm, The experiments in this subsection are designed to an-
MAPPO[58], without the pre-training phase. We measure swer the following research questions: RQ1: Why should
the sample efficiency in terms of the time to threshold we choose MAPPO for the online phase? RQ2: Which
mentioned in [60], which denotes the number of online in- kind of input should be used to make the pre-trained
teractions (timesteps) to achieve a predefined threshold in model beneficial for the online MARL? RQ3: Why can-
Table 1, and our pre-trained model needs much less than not the offline version of MADT be improved in the on-
the traditional MAPPO to achieve the same win rate. line fine-tuning period after pre-training?
Suitable online algorithm. Although the selection
4.4 Generalization with multi-task pre- of the MARL algorithm for the online phase should be
training flexible according to specific tasks, we design experi-
ments to answer RQ1 here. As discussed in Section 3, we
Experiments in this section explore the transferability can train Decision Transformer for each agent and fine-
of the universal MADT mentioned in Section 3.3, which tune it online with an MARL algorithm. An intuitive
is pre-trained with mixed data from multiple tasks. De- method is to load the pre-trained transformer and take it
pending on whether the downstream tasks have been seen as the policy network for fine-tuning with the policy
or not, the few-shot experiments are designed to validate gradient method, e.g., REINFORCE[56]. However, for the
240 Machine Intelligence Research 20(2), April 2023
20.0 14 17.5
17.5 12 15.0
Average return
15.0
Average return
Average return
10 12.5
12.5 8 Online MARL w/ pretrained model
10.0 10.0 Online MARL w/o pretrained model
6
7.5 4 7.5
5.0 2 5.0
2.5 Online MARL w/ pretrained model
0
Online MARL w/ pretrained model
2.5
Online MARL w/o pretrained model Online MARL w/o pretrained model
0 −2
0 0.5 1.0 1.5 2.0 2.5 0 0.5 1.0 1.5 2.0 2.5 0 0.5 1.0 1.5 2.0
Timesteps 1E+6 Timesteps 1E+6 Timesteps 1E+6
(a) So many baneling (Easy) (b) 5 m VS. 6 m (Hard) (c) 3s5z VS. 3s6z (Super hard)
Fig. 4 Average returns with and without the pre-trained model
Table 1 Number of interactions needed to achieve the win rate 20%, 40%, 60%, 80% and 100% for the training policy with
(MAPPO/pre-trained MADT). “–” means no more samples are needed to reach the target win rate.
“∞” represents that policies cannot reach the target win rate.
# Samples to achieve the win rate # Samples to achieve the win rate
Maps Maps
20% 40% 60% 80% 100% 20% 40% 60% 80% 100%
So many
3.2E+4/ 1E+5/ 3.2E+5/ 5E+5/ 1E+6/ 10 m VS. 11 m 4E+5/ 1.7E+6/ 4E+6/
baneling 2E+5/– 3.5E+5/–
8E+3 4E+4 7E+4 8E+4 6.4E+5 (Hard) 2.8E+4 1.2E+5 2.5E+5
(Easy)
Bane VS.
Corridor 7.8E+6/
bane 3.2E+3/– 3.2E+3/– 3.2E+5/– 4E+5/– 5.6E+5/– 1.5E+6/– 1.8E+6/– 2E+6/– 2.8E+6/–
(Super hard) 4E+5
(Easy)
20.0
25 From scratch Universal MADT (ours)
Fine-tuned from universal MADT 17.5
Train from scratch
20 15.0
Average return
Average return
12.5
15
10.0
10 7.5
5.0
5
2.5
0 0
z 0 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00
VS
.1 1sc 3m .3
z
.4
z
VS. V S VS Times (teps) 1E+6
2m 2s 3s 3s
(a) (b)
Fig. 5 Few-shot and zero-shot validation results. (a) shows the average returns of the universal MADT pre-trained from all five tasks
data and the policy trained from scratch, individually. We limit the environment interaction to 2.5 M steps. (b) shows the average
returns of a held-out map (3s VS. 4z ), where the universal MADT is trained from data on (2m VS. 1z , 2s VS. 1sc, 3m , 3s VS. 3z ).
reason of the high variance mentioned in Section 3.2, we performance in improving the sample efficiency during
choose MAPPO as the online algorithm and compare its the online period in Fig. 6(a).
L. Meng et al. / Offline Pre-trained Multi-agent Decision Transformer 241
State_only
MADT with MAPPO State & action MADT with MAPPO for fine-tuning
MADT with REINFORCE Reward-to-go & state & action Pure MADT for fine-tuning
15.0
14 14
12.5
12 12
Average return
Average return
Average return
10 10.0 10
8 7.5 8
6 5.0 6
4 2.5 4
2 0 2
0 0.2 0.4 0.6 0.8 1.0 0 0.2 0.4 0.6 0.8 1.0 0 1 2 3 4 5 6 7 8
Timesteps 1E+6 Timesteps 1E+6 Timesteps 1E+5
(a) (b) (c)
Fig. 6 Ablation results on a hard map, 5m_vs_6m , for validating the necessity of (a) MAPPO in MADT-online, (b) Input formulation,
(c) online version of MADT.
Dropping reward-to-go in MADT. To answer in tackling SMAC tasks. MADT learns a big sequence
RQ2, we compare different inputs embedded into the model that outperforms the state-of-the-art methods in
transformer, including the combination of state, reward-
offline settings, including BC, BCQ, CQL, and ICQ.
to-go, and action. We find reward-to-go harmful to on-
When applied in online settings, the pre-trained MADT
line fine-tuning performance, as shown in Fig. 6(b). We
suppose the distribution of reward-to-go is the mismatch can drastically improve the sample efficiency. We ap-
between offline data and online samples. That is, the re- plied MADT to train a generalizable policy over a series
wards of online samples are usually lower than those of of SMAC tasks and then evaluated its performance un-
offline data due to stochastic exploration at the begin- der both few-shot and zero-shot settings. The results
ning of the online phase. It deteriorates the fine-tuning demonstrate that the pre-trained MADT policy adapts
capability of the pre-trained model, and based on quickly to new tasks and improves performance on differ-
Fig. 6(b), we only choose states as our inputs for pre-
ent downstream tasks. To the best of our knowledge, this
training and fine-tuning.
is the first work that demonstrates the effectiveness of
Integrating online MARL with MADT. To an-
swer RQ3, we directly apply the offline version of MADT offline pre-training and the effectiveness of sequence mod-
for pre-training and fine-tune it online. However, Fig. 6(c) elling through transformer architectures in the context of
shows that it cannot be improved during the online MARL.
phase. We analyse the results caused by the absence of
motivation for chasing higher rewards and conclude that Appendix A Properties of datasets
offline MADT is supervised learning and tends to fit its
collected experience even with unsatisfactory rewards. We list the properties of our offline datasets in Tables A1
and A2.
5 Conclusions
Appendix B Details of hyper-parameters
In this work, we propose MADT, an offline pre-
trained model for MARL, which integrates the trans- Details of hyper-parameters used for MADT experi-
former to improve sample efficiency and generalizability ments are listed from Tables B1–B5.
Table A1 Properties for our offline dataset collected from the experience of multi-agent PPO on the easy maps of SMAC
3m-medium – –
Table A1 (continued) Properties for our offline dataset collected from the experience of multi-agent PPO on the easy maps of SMAC
Table A2 Properties for our offline dataset collected from the experience of multi-agent
PPO on the hard and super hard maps of SMAC
Table A2 (continued) Properties for our offline dataset collected from the experience of multi-agent
PPO on the hard and super hard maps of SMAC
Table B1 Common hyper-parameters for all MADT experiments for pre-training on a map, taking 3m (easy) as an example
References
Table B5 Hyper-parameters for MADT
experiments in Fig. 5(b) [1] Y. D. Yang, J. Wang. An overview of multi-agent rein-
forcement learning from game theoretical perspective.
Hyper-parameter Value [Online], Available: https://arxiv.org/abs/2011.00583,
Offline_map_lists [2m_vs_1z, 3m, 2s_vs_1sc, 3s_vs_3z] 2020.
Offline_episode_num [250, 250, 250, 250] [2] S. Shalev-Shwartz, S. Shammah, A. Shashua. Safe, multi-
agent, reinforcement learning for autonomous driving.
Offline_lr 5E–4 [Online], Available: https://arxiv.org/abs/1610.03295,
Online_lr 1E–4 2016.
[6] Y. D. Yang, R. Luo, M. N. Li, M. Zhou, W. N. Zhang, J. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray,
Wang. Mean field multi-agent reinforcement learning. In B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford,
Proceedings of the 35th International Conference on Ma- I. Sutskever, D. Amodei. Language models are few-shot
chine Learning, Stockholm, Sweden, pp. 5571–5580, 2018. learners. In Proceedings of the 34th International Confer-
ence on Neural Information Processing Systems, Van-
[7] Y. D. Yang, L. T. Yu, Y. W. Bai, Y. Wen, W. N. Zhang, J.
couver, Canada, Article number 159, 2020. DOI: 10.5555/
Wang. A study of AI population dynamics with million-
3495724.3495883.
agent reinforcement learning. In Proceedings of the 17th
International Conference on Autonomous Agents and [19] L. L. Chen, K. Lu, A. Rajeswaran, K. Lee, A. Grover, M.
MultiAgent Systems, ACM, Stockholm, Sweden, pp. 2133– Laskin, P. Abbeel, A. Srinivas, I. Mordatch. Decision
2135, 2018. transformer: Reinforcement learning via sequence model-
ing. [Online], Available: https://arxiv.org/abs/2106.
[8] P. Peng, Y. Wen, Y. D. Yang, Q. Yuan, Z. K. Tang, H. T.
01345, 2021.
Long, J. Wang. Multiagent bidirectionally-coordinated
nets: Emergence of human-level coordination in learning [20] Y. D. Yang, J. Luo, Y. Wen, O. Slumbers, D. Graves, H.
to play StarCraft combat games. [Online], Available: ht- bou Ammar, J. Wang, M. E. Taylor. Diverse auto-cur-
tps://arxiv.org/abs/1703.10069, 2017. riculum is critical for successful real-world multiagent
learning systems. In Proceedings of the 20th International
[9] M. Zhou, Z. Y. Wan, H. J. Wang, M. N. Wen, R. Z. Wu,
Conference on Autonomous Agents and Multi-agent Sys-
Y. Wen, Y. D. Yang, W. N. Zhang, J. Wang. MALib: A
tems, ACM, pp. 51–56, 2021.
parallel framework for population-based multi-agent rein-
forcement learning. [Online], Available: https://arxiv.org/ [21] N. Perez-Nieves, Y. D. Yang, O. Slumbers, D. H. Mguni,
abs/2106.07551, 2021. Y. Wen, J. Wang. Modelling behavioural diversity for
[10] X. T. Deng, Y. H. Li, D. H. Mguni, J. Wang, Y. D. Yang.
learning in open-ended games. In Proceedings of the 38th
On the complexity of computing Markov perfect equilibri- International Conference on Machine Learning, pp. 8514–
um in general-sum stochastic games. [Online], Available: 8524, 2021.
https://arxiv.org/abs/2109.01795, 2021. [22] X. Y. Liu, H. T. Jia, Y. Wen, Y. J. Hu, Y. F. Chen, C. J.
[11] T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, Fan, Z. P. Hu, Y. D. Yang. Unifying behavioral and re-
J. Tan, V. Kumar, H. Zhu, A. Gupta, P. Abbeel, S. Levine. sponse diversity for open-ended learning in zero-sum
Soft actor-critic algorithms and applications. [Online], games. In Proceedings of the 35th Conference on Neural
Available: https://arxiv.org/abs/1812.05905, 2018. Information Processing Systems, pp. 941–952, 2021.
[12] R. Munos, T. Stepleton, A. Harutyunyan, M. G. Belle- [23] S. Levine, A. Kumar, G. Tucker, J. Fu. Offline reinforce-
mare. Safe and efficient off-policy reinforcement learning. ment learning: Tutorial, review, and perspectives on open
In Proceedings of the 30th Conference on Neural Informa- problems. [Online], Available: https://arxiv.org/abs/
tion Processing Systems, Barcelona, Spain, pp. 1054– 2005.01643, 2020.
1062, 2016. [24] R. Sanjaya, J. Wang, Y. D. Yang. Measuring the non-
[13] L. Espeholt, R. Marinier, P. Stanczyk, K. Wang, M. transitivity in chess. Algorithms, vol. 15, no. 5, Article
Michalski. SEED RL: Scalable and efficient deep-RL with number 152, 2022. DOI: 10.3390/a15050152.
accelerated central inference. In Proceedings of the 8th In- [25] X. D. Feng, O. Slumbers, Y. D. Yang, Z. Y. Wan, B. Liu,
ternational Conference on Learning Representations, Ad- S. McAleer, Y. Wen, J. Wang. Discovering multi-agent
dis Ababa, Ethiopia, 2020. auto-curricula in two-player zero-sum games. [Online],
[14] L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, Available: https://arxiv.org/abs/2106.02745, 2021.
T. Ward, Y. Doron, V. Firoiu, T. Harley, I. Dunning, S. [26] M. Samvelyan, T. Rashid, C. S. de Witt, G. Farquhar, N.
Legg, K. Kavukcuoglu. IMPALA: Scalable distributed Nardelli, T. G. J. Rudner, C. M. Hung, P. H. S. Torr, J.
deep-RL with importance weighted actor-learner architec- Foerster, S. Whiteson. The StarCraft multi-agent chal-
tures. In Proceedings of the 35th International Conference lenge. In Proceedings of the 18th International Conference
on Machine Learning, Stockholm, Sweden, pp. 1407–1416, on Autonomous Agents and MultiAgent Systems,
2018. Montreal, Canada, pp. 2186–2188, 2019.
[15] K. M. He, X. L. Chen, S. N. Xie, Y. H. Li, Dollár, R. Gir- [27] Z. Li, S. R. Xue, X. H. Yu, H. J Gao. Controller optimiza-
shick. Masked autoencoders are scalable vision learners. In tion for multirate systems based on reinforcement learn-
Proceedings of IEEE/CVF Conference on Computer Vis- ing. International Journal of Automation and Computing,
ion and Pattern Recognition, IEEE, New Orleans, USA, vol. 17, no. 3, pp. 417–427, 2020. DOI: 10.1007/s11633-020-
pp. 15979–15988, 2021. DOI: 10.1109/CVPR52688.2022. 1229-0.
01553.
[28] Y. Li, D. Xu. Skill learning for robotic insertion based on
[16] Z. Liu, Y. T. Lin, Y. Cao, H. Hu, Y. X. Wei, Z. Zhang, S. one-shot demonstration and reinforcement learning. Inter-
Lin, B. N. Guo. Swin transformer: Hierarchical vision national Journal of Automation and Computing, vol. 18,
transformer using shifted windows. In Proceedings of no. 3, pp. 457–467, 2021. DOI: 10.1007/s11633-021-1290-3.
IEEE/CVF International Conference on Computer Vision,
IEEE, Montreal, Canada, pp. 9992–10002, 2021. DOI: 10. [29] C. Berner, G. Brockman, B. Chan, V. Cheung, P. Debiak,
1109/ICCV48922.2021.00986. C. Dennison, D. Farhi, Q. Fischer, S. Hashme, C. Hesse, et
al. Dota 2 with large scale deep reinforcement learning.
[17] S. Kim, J. Kim, H. W. Chun. Wave2Vec: Vectorizing elec- [Online], Available: https://arxiv.org/abs/1912.06680,
troencephalography bio-signal for prediction of brain dis- 2019.
ease. International Journal of Environmental Research
and Public Health, vol. 15, no. 8, Article number 1750, [30] A. Kumar, J. Fu, G. Tucker, S. Levine. Stabilizing off-
2018. DOI: 10.3390/ijerph15081750. policy Q-learning via bootstrapping error reduction. In
Proceedings of the 33rd Conference on Neural Information
[18] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan,
Processing Systems, Vancouver, Canada, pp. 11761–11771,
P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A.
2019.
Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T.
Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. [31] A. Kumar, A. Zhou, G. Tucker, S. Levine. Conservative Q-
246 Machine Intelligence Research 20(2), April 2023
learning for offline reinforcement learning. In Proceedings [46] K. Son, D. Kim, W. J. Kang, D. E. Hostallero, Y. Yi.
of the 34th International Conference on Neural Informa- QTRAN: Learning to factorize with transformation for co-
tion Processing Systems, Vancouver, Canada, Article operative multi-agent reinforcement learning. In Proceed-
number 100, 2020. DOI: 10.5555/3495724.3495824. ings of the 36th International Conference on Machine
Learning, Long Beach, USA, pp. 5887–5896, 2019.
[32] S. Fujimoto, D. Meger, D. Precup. Off-policy deep rein-
forcement learning without exploration. In Proceedings of [47] J. G. Kuba, M. N. Wen, L. H. Meng, S. D. Gu, H. F.
the 36th International Conference on Machine Learning, Zhang, D. H. Mguni, J. Wang, Y. D. Yang. Settling the
Long Beach, USA, pp. 2052–2062, 2019. variance of multi-agent policy gradients. In Proceedings of
the 35th Conference on Neural Information Processing
[33] T. Matsushima, H. Furuta, Y. Matsuo, O. Nachum, S. X.
Systems, pp. 13458–13470, 2021.
Gu. Deployment-efficient reinforcement learning via mod-
el-based offline optimization. In Proceedings of the 9th In- [48] J. G. Kuba, R. Q. Chen, M. N. Wen, Y. Wen, F. L. Sun, J.
ternational Conference on Learning Representations, 2021. Wang, Y. D. Yang. Trust region policy optimisation in
multi-agent reinforcement learning. In Proceedings of the
[34] D. J. Su, J. D. Lee, J. M. Mulvey, H. V. Poor. MUSBO: 10th International Conference on Learning Representa-
Model-based uncertainty regularized and sample efficient tions, 2022.
batch optimization for deployment constrained reinforce-
ment learning. [Online], Available: https://arxiv.org/abs/ [49] S. D. Gu, J. G. Kuba, M. N. Wen, R. Q. Chen, Z. Y.
2102.11448, 2021. Wang, Z. Tian, J. Wang, A. Knoll, Y. D. Yang. Multi-
agent constrained policy optimisation. [Online], Available:
[35] Y. Q. Yang, X. T. Ma, C. H. Li, Z. W. Zheng, Q. Y. Zhang, https://arxiv.org/abs/2110.02793, 2021.
G. Huang, J. Yang, Q. C. Zhao. Believe what you see: Im-
plicit constraint approach for offline multi-agent reinforce- [50] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones,
ment learning. [Online], Available: https://arxiv.org/abs/ A. N. Gomez, L. Kaiser, I. Polosukhin. Attention is all you
2106.03400, 2021. need. In Proceedings of the 31st International Conference
on Neural Information Processing Systems, Long Beach,
[36] J. C. Jiang, Z. Q. Lu. Offline decentralized multi-agent re- USA, pp. 6000–6010, 2017. DOI: 10.5555/3295222.3295349.
inforcement learning. [Online], Available: https://arxiv.
org/abs/2108.01832, 2021. [51] I. Sutskever, O. Vinyals, Q. V. Le. Sequence to sequence
learning with neural networks. In Proceedings of the 27th
[37] A. Nair, M. Dalal, A. Gupta, S. Levine. Accelerating on- International Conference on Neural Information Pro-
line reinforcement learning with offline datasets. [Online], cessing Systems, Montreal, Canada, pp. 3104–3112, 2014.
Available: https://arxiv.org/abs/2006.09359, 2020. DOI: 10.5555/2969033.2969173.
[38] M. Janner, Q. Y. Li, S. Levine. Offline reinforcement [52] Q. Wang, B. Li, T. Xiao, J. B. Zhu, C. L. Li, D. F. Wong,
learning as one big sequence modeling problem. [Online], L. S. Chao. Learning deep transformer models for machine
Available: https://arxiv.org/abs/2106.02039, 2021. translation. In Proceedings of the 57th Annual Meeting of
[39] L. C. Dinh, Y. D. Yang, S. McAleer, Z. Tian, N. P. Nieves, the Association for Computational Linguistics, Florence,
O. Slumbers, D. H. Mguni, H. bou Ammar, J. Wang. On- Italy, pp. 1810–1822, 2019. DOI: 10.18653/v1/P19-1176.
line double oracle. [Online], Available: https://arxiv.org/ [53] L. H. Dong, S. Xu, B. Xu. Speech-transformer: A no-recur-
abs/2103.07780, 2021. rence sequence-to-sequence model for speech recognition.
[40] D. H. Mguni, Y. T. Wu, Y. L. Du, Y. D. Yang, Z. Y. In Proceedings of IEEE International Conference on
Wang, M. N. Li, Y. Wen, J. Jennings, J. Wang. Learning Acoustics, Speech and Signal Processing, Calgary,
in nonzero-sum stochastic games with potentials. In Pro- Canada, pp. 5884–5888, 2018. DOI: 10.1109/ICASSP.
ceedings of the 38th International Conference on Machine 2018.8462506.
Learning, pp. 7688–7699, 2021. [54] K. Han, Y. H. Wang, H. T. Chen, X. H. Chen, J. Y. Guo,
[41] Y. D. Yang, Y. Wen, J. Wang, L. H. Chen, K. Shao, D. Z. H. Liu, Y. H. Tang, A. Xiao, C. J. Xu, Y. X. Xu, Z. H.
Mguni, W. N. Zhang. Multi-agent determinantal Q-learn- Yang, Y. M. Zhang, D. C. Tao. A survey on vision trans-
ing. In Proceedings of the 37th International Conference former. [Online], Available: https://arxiv.org/abs/2012.
on Machine Learning, pp. 10757–10766, 2020. 12556, 2020.
[42] T. Rashid, M. Samvelyan, C. Schroeder, G. Farquhar, J. [55] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn,
Foerster, S. Whiteson. QMIX: Monotonic value function X. H. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G.
factorisation for deep multi-agent reinforcement learning. Heigold, S. Gelly, J. Uszkoreit, N. Houlsby. An image is
In Proceedings of the 35th International Conference on worth 16x16 words: Transformers for image recognition at
Machine Learning, Stockholm, Sweden, pp. 4295–4304, scale. In Proceedings of the 9th International Conference
2018. on Learning Representations, 2020.
[43] Y. Wen, Y. D. Yang, R. Luo, J. Wang, W. Pan. Probabil- [56] R. J. Williams. Simple statistical gradient-following al-
istic recursive reasoning for multi-agent reinforcement gorithms for connectionist reinforcement learning. Ma-
learning. In Proceedings of the 7th International Confer- chine Learning, vol. 8, no. 3, pp. 229–256, 1992. DOI: 10.
ence on Learning Representations, New Orleans, USA, 1007/BF00992696.
2019. [57] I. Mordatch, P. Abbeel. Emergence of grounded composi-
[44] Y. Wen, Y. D. Yang, J. Wang. Modelling bounded ration-
tional language in multi-agent populations. In Proceed-
ality in multi-agent interactions by generalized recursive ings of the 32nd AAAI Conference on Artificial Intelli-
reasoning. In Proceedings of the 29th International Joint gence and Thirtieth Innovative Applications of Artificial
Conference on Artificial Intelligence, Yokohama, Japan, Intelligence Conference and the 8th AAAI Symposium on
pp. 414–421, 2020. DOI: 10.24963/ijcai.2020/58. Educational Advances in Artificial Intelligence, New Or-
leans, USA, Article number 183, 2018. DOI: 10.5555/
[45] S. Hu, F. D. Zhu, X. J. Chang, X. D. Liang. UPDeT: Uni- 3504035.3504218.
versal multi-agent reinforcement learning via policy de-
[58] C. Yu, A. Velu, E. Vinitsky, J. X. Gao, Y. Wang, A. Bay-
coupling with transformers. [Online], Available: https://
en, Y. Wu. The surprising effectiveness of PPO in cooper-
arxiv.org/abs/2101.08001, 2021.
ative, multi-agent games. [Online], Available: https://arx-
L. Meng et al. / Offline Pre-trained Multi-agent Decision Transformer 247
Jun Wang received the B. Sc. degree from chine learning researcher with ten-year working experience in
Southeast University, China in 1997, the both academia and industry. Currently, he is an assistant pro-
M. Sc. degree from National University of fessor at Peking University, China. Before joining Peking Uni-
Singapore, Singapore in 2003, the Ph. D. versity, he was an assistant professor at King′s College London,
degree from Delft University of Techno- UK. Before KCL, he was a principal research scientist at Hua-
logy, The Netherlands in 2007. is cur- wei UK. Before Huawei, he was a senior research manager at
rently a professor with Computer Science AIG, working on AI applications in finance. He has maintained a
Department, University College London, track record of more than forty publications at top conferences
UK. He has published over 200 research and journals, along with the Best System Paper Award at CoRL
articles. His team won the First Global Real-Time Bidding Al- 2020 and the Best Blue-sky Paper Award at AAMAS 2021.
gorithm Contest with more than 80 participants worldwide. He His research interests include reinforcement learning and
was the winner of multiple best paper awards. He was a recipi- multi-agent systems.
ent of the Beyond Search—Semantic Computing and Internet E-mail: [email protected]
Economics Award by Microsoft Research and the Yahoo! FREP
Faculty Award. He has served as an Area Chair for ACM CIKM
and ACM SIGIR. His recent service includes the Co-Chair for
Artificial Intelligence, Semantics, and Dialog in ACM SIGIR
2018.
His research interests are in the areas of AI and intelligent sys-
Bo Xu received the B. Sc. degree in elec-
tems, covering (multiagent) reinforcement learning, deep gener-
trical engineering from Zhejiang Uni-
ative models, and their diverse applications on information re-
versity, China in 1988, and the M. Sc. and
trieval, recommender systems and personalization, data mining,
Ph. D. degrees in pattern recognition and
smart cities, bot planning, and computational advertising.
intelligent system from Institute of Auto-
E-mail: [email protected] (Corresponding author)
mation, Chinese Academy of Sciences,
ORCID iD: 0000-0001-9006-7951
China in 1992 and 1997, respectively. He is
a professor, the director of Institute of
Yaodong Yang received the B. Sc. de- Automation, Chinese Academy of Sci-
gree in electronic engineering & informa- ences, China, and also deputy director of Center for Excellence
tion science from University of Science and in Brain Science and Intelligence Technology, Chinese Academy
Technology of China, China in 2013, the of Sciences, China.
M.Sc. degree in science (Quant. Biology/ His research interests include brain-inspired intelligence,
Biostatistics) degree from Imperial Col- brain-inspired cognitive models, natural language processing and
lege London, UK 2014, and the Ph.D. de- understanding, and brain-inspired robotics.
gree in computer science from University E-mail: [email protected] (Corresponding author)
College London, UK in 2021. He is a ma- ORCID iD: 0000-0002-1111-1529