0% found this document useful (0 votes)

5 views

Offline Pre-trained Multi-agent Decision Transformer

This paper introduces the Multi-Agent Decision Transformer (MADT) for offline pre-training in multi-agent reinforcement learning (MARL), leveraging large-scale datasets from the StarCraft II environment. MADT demonstrates improved sample efficiency and generalizability by enabling effective offline learning and online fine-tuning across various tasks. The study highlights the architecture's capability to learn transferable policies that adapt to different scenarios, marking a significant advancement in offline MARL research.

Uploaded by

jtqs1bb

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

Offline Pre-trained Multi-agent Decision Transformer

Uploaded by

jtqs1bb

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Machine Intelligence Research 20(2), April 2023, 233-248

www.mi-research.net DOI: 10.1007/s11633-022-1383-7

Offline Pre-trained Multi-agent Decision Transformer

Linghui Meng 1,2* Muning Wen 3* Chenyang Le 3 Xiyun Li 1,4
Dengpeng Xing 1,2 Weinan Zhang 3 Ying Wen 3 Haifeng Zhang 1,2
Jun Wang 6 Yaodong Yang 5 Bo Xu 1,2
1 Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China
2 School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, China
3 Shanghai Jiao Tong University, Shanghai 200240, China
4 School of Future Technology, University of Chinese Academy of Sciences, Beijing 100049, China
5 Institute for AI, Peking University, Beijing 100871, China
6 Department of Computer Science, University College London, London WC1E 6BT, UK

Abstract: Offline reinforcement learning leverages previously collected offline datasets to learn optimal policies with no necessity to
access the real environment. Such a paradigm is also desirable for multi-agent reinforcement learning (MARL) tasks, given the combin-
atorially increased interactions among agents and with the environment. However, in MARL, the paradigm of offline pre-training with
online fine-tuning has not been studied, nor even datasets or benchmarks for offline MARL research are available. In this paper, we facil-
itate the research by providing large-scale datasets and using them to examine the usage of the decision transformer in the context of
MARL. We investigate the generalization of MARL offline pre-training in the following three aspects: 1) between single agents and mul-
tiple agents, 2) from offline pretraining to online fine tuning, and 3) to that of multiple downstream tasks with few-shot and zero-shot
capabilities. We start by introducing the first offline MARL dataset with diverse quality levels based on the StarCraftII environment,
and then propose the novel architecture of multi-agent decision transformer (MADT) for effective offline learning. MADT leverages the
transformer′s modelling ability for sequence modelling and integrates it seamlessly with both offline and online MARL tasks. A signific-
ant benefit of MADT is that it learns generalizable policies that can transfer between different types of agents under different task scen-
arios. On the StarCraft II offline dataset, MADT outperforms the state-of-the-art offline reinforcement learning (RL) baselines, includ-
ing BCQ and CQL. When applied to online tasks, the pre-trained MADT significantly improves sample efficiency and enjoys strong per-
formance in both few-short and zero-shot cases. To the best of our knowledge, this is the first work that studies and demonstrates the ef-
fectiveness of offline pre-trained models in terms of sample efficiency and generalizability enhancements for MARL.

Keywords: Pre-training model, multi-agent reinforcement learning (MARL), decision making, transformer, offline reinforcement
learning.

Citation: L. Meng, M. Wen, C. Le, X. Li, D. Xing, W. Zhang, Y. Wen, H. Zhang, J. Wang, Y. Yang, B. Xu. Offline pre-trained multi-
agent decision transformer. Machine Intelligence Research, vol.20, no.2, pp.233–248, 2023. http://doi.org/10.1007/s11633-022-1383-7

1 Introduction cost resulting from the data collection[11−14]. Furthermore,

even in domains where the online environment is feasible,
Multi-agent reinforcement learning (MARL) al- we might still prefer to utilize previously-collected data
gorithms[1] play an essential role in solving complex de- instead; for example, if the domain′s complexity requires
cision-making tasks by learning from the interaction data large datasets for effective generalization. In addition, a
between computerised agents and (simulated) physical policy trained on one scenario usually cannot perform
environments. It has been typically applied to self-driv- well on another even under the same task. Therefore, a
ing[2−4], order dispatching[5, 6], modelling population dy- universal policy is critical for saving the training time of
namics[7], and gaming AIs[8, 9]. However, the scheme of general reinforcement learning (RL) tasks.
learning policy from experience requires the algorithms Notably, the recent advance of supervised learning has
with high computational complexity[10] and sample effi- shown that the effectiveness of learning methods can be
ciency due to the limited computing resources and high maximized when they are provided with very large mod-
elling capacity, trained on very large and diverse
Research Article
Special Issue on Large-scale Pre-training: Data, Models, and Fine- datasets[15−17]. The surprising effectiveness of large, gener-
tuning ic models supplied with large amounts of training data,
Manuscript received July 6, 2022; accepted October 18, 2022
such as GPT-3[18], spurs the community to search for
Recommended by Associate Editor Zhi-Yuan Liu
*These authors contribute equally to this work ways to scale up thus boosting the performance of RL
© The Author(s) 2023 models. Towards this end, Decision transformer[19] is one
234 Machine Intelligence Research 20(2), April 2023

of the first models that verifies the possibility of solving and sequential model-based methods. For the constraint-
conventional (offline) RL problems by generative traject- based methods, a straightforward method is to adopt the
ory modelling, i.e., modelling the joint distribution of the off-policy algorithm and regard the offline datasets as a
sequence of states, actions, and rewards without tempor- replay buffer to learn a policy with promising perform-
al difference learning. ance. However, experience existing in offline datasets and
The technique of transforming decision-making prob- interaction with online environments have different distri-
lems into sequence modelling problems has opened a new butions, which causes the overestimation in the off-policy
gate for solving RL tasks. Crucially, this activates a nov- (value-based) method[30]. Substantial works presented in
el pathway toward training RL systems on diverse data- offline RL aim at resolving the distribution shift between
sets[20−22] in much the same manner as in supervised the static offline datasets and the online environment in-
learning, which is often instantiated by offline RL tech- teractions[30−32]. In addition, depending on the dynamic
niques[23]. Offline RL methods have recently attracted tre- planning ability of the transition model, Matsushima et
mendous attention since they enable agents to apply self- al.[33, 34] learn different models offline and regularize the
supervised or unsupervised RL methods in settings where policy efficiently. In particular, Yang et al.[35, 36] con-
online collection is infeasible. We, thus, argue that this is strain off-policy algorithms in the multi-agent field. Re-
particularly important for MARL problems since online lated to our work for the improvement of sample effi-
exploration in multi-agent settings may not be feasible in ciency, Nair et al.[37] derive the Karush Kuhn-Tucker
many settings[24], but learning with unsupervised or meta- (KKT) conditions of the online objective, generating an
learned[25] outcome-driven objectives via offline data is advantage weight to avoid the out-of-distribution (OOD)
still possible. However, it is unclear yet whether the ef- problem. For the sequential model-based methods, De-
fectiveness of sequence modelling through transformer ar- cision transformer outperforms many state-of-the-art off-
chitecture also applies to MARL problems. line RL algorithms by regarding the offline policy train-
In this paper, we propose multi-agent decision trans- ing process as a sequential modelling and testing it
formers (MADT), an architecture that casts the problem online[19, 38]. In contrast, we show a transformer-based
of MARL as conditional sequence modelling. Our man- method in the multi-agent field, attempting to transfer
date is to understand if the proposed MADT can learn across many scenarios without extra constraints. By shar-
through pre-training a generalized policy on offline data- ing the sequential model across agents and learning a
sets, which can then be effectively used to other down- global critic network offline, we conduct a pre-trained
stream environments (known or unknown). As a study multi-agent policy that can be continuously fine-tuned
example, we specifically focus on the well-known chal- online.
lenge for MARL tasks: the StarCraft multi-agent chal- Multi-agent reinforcement learning. As a natur-
lenge (SMAC)[26], and demonstrate the possibility of solv- al extension from single-agent RL, MARL[1] attracts
ing multiple SMAC tasks with one big sequence model. much attention to solve more complex problems under
Our contribution is as follows: We propose a series of Markov games. Classic algorithms often assume multiple
transformer variants for offline MARL by leveraging the agents to interact with the environment online and col-
sequential modelling of the attention mechanism. In par- lect the experience to train the joint policy from scratch.
ticular, we validate our pre-trained sequential model in Many empirical successes have been demonstrated in
the challenging multi-agent environment for its sample ef- solving zero-sum games through MARL methods[8, 39].
ficiency and transferability. We built a dataset with dif- When solving decentralized partially observable Markov
ferent skill levels covering different variations of SMAC decision processes (Dec-POMDPs) or potential games[40],
scenarios. Experimental results on SMAC tasks show that the framework of centralized training and decentralized
MADT enjoys fast adaptation and superior performance execution (CTDE) is often employed[41−45] where a cent-
via learning one big sequence model. ralized critic is trained to gather all agents′ local observa-
The main challenges in our offline pre-training and on- tions and assign credits. While CTDE methods rely on
line fine-tuning problems are the out-of-distribution and the individual-global-max assumption[46], another thread
of work is built on the so-called advantage decomposition
training paradigm mismatch problems. We tackle these
lemma[47], which holds in general for any cooperative
two problems with the sequential model and pre-train the
game; such a lemma leads to provably convergent multi-
global critic model offline.
agent trust-region methods[48] and constrained policy op-
timization methods[49].
2 Related work Transformer. Transformer[50] has achieved a great
Offline deep reinforcement learning. Recent breakthrough to model relations between the input and
works have successfully applied RL in robotics con- output sequence with variable length, for the sequence-to-
trol[27, 28] and gaming AIs[29] online. However, many works sequence problems[51], especially in machine translation[52]
attempt to reduce the cost resulting from online interac- and speech recognition[53]. Recent works even reorganize
tions by learning with neural networks from an offline the vision problems as the sequential modelling process
dataset named offline RL methods[23]. There are two and construct the state-of-the-art (SOTA) model with
classes to divide the offline RL methods: constraint-based pretraining, named vision transformers (ViT)[16, 54, 55].
L. Meng et al. / Offline Pre-trained Multi-agent Decision Transformer 235

Due to the Markovian property of trajectories in offline the imitation learning that only fits a policy network off-
datasets, we can utilize Transformer as that in language line, MADT trains the actor and critic offline together
modelling. Therefore, Transformer can bridge the gap and fine-tunes them online in the RL-style training
between supervised learning in the offline setting and re- scheme. We also give some empirical conclusions, such as
inforcement learning in online interaction because of the the effect of reward-to-go in the online fine-tuning stage
representation capability. We claim that the components and the multi-task padding method on SMAC.
in Markov games are sequential, then utilise the trans- Multi-agent reinforcement learning. For the
Markov game, which is a multi-agent extension of the
former for each agent to fit a transferable MARL policy.
Markov decision process (MDP), there is a tuple repres-
Furthermore, we fine-tune the learned policy via trial-
enting the essential elements ⟨S, A, R, P, n, γ⟩, where S de-
and-error.
notes the state space of n agents, S1 × S2 · · · ×Sn → S .
Ai is the action space of each agent i, P : Si × Ai →
3 Methodology P D(Si ) denotes the transition function emitting the dis-
In this section, we demonstrate how the transformer is tribution over the state space and A is the joint action
applied to our offline pre-training MARL framework. space, Ri : S × Ai → R is the reward function of each
agent and takes action following their policies
First, we introduce the typical paradigm and computa-
π(a|s) ∈ Πi : S → P D(A) from the policy space Πi , where
tion process for the multi-agent reinforcement learning
Πi denotes the policy space of agent i, a ∈ Ai, and s ∈ Si.
and attention-based model. Then, we introduce an offline
Each agent aims to maximize its long-term reward
MARL method, in which the transformer sequentially ∑ t t
t γ ri , where ri ∈ Ri denotes the reward of agent i in
t
maps between the local observations and actions of each
time t and γ denotes the discount factor. In the cooperat-
agent in the offline dataset via parameter sharing. Then
ive setting, we also denote the ri with r shared among
we leverage the hidden representation as the input of the
agents for the simplification.
MADT to minimize the cross-entropy loss. Furthermore,
Attention-based model. The attention-based mod-
we introduce how to integrate the online MARL with
el has shown its stable and strong representation capabil-
MADT in constructing our whole framework to train a
ity. The scale dot-production attention uses the self-at-
universal MARL policy. To accelerate the online learning,
tention mechanism demonstrated in [50]. Let Q ∈ Rtq ×dq
we load the pre-trained model as a part of the MARL al-
be the quries, K ∈ Rtk ×dk be the keys, and V ∈ Rtv ×dv
gorithms and learn the policy based on experience in the
be the values, where t∗ are the element numbers of differ-
latest buffer stored from the online environment. To train
ent inputs and d∗ are the corresponding element dimen-
a universal MARL policy quickly adapting to other tasks,
sions. Normally, tk = tv and dq = dk. The outputs of self-
we bridge the gap between different scenarios from obser-
attention are computed as
vations, actions, and available actions, respectively. Fig. 1
overviews our method from the perspective of offline pre- QK T
training with supervised learning and online fine-tuning Attention(Q, K, V ) = softmax( √ )V (1)
dk
with MARL algorithms. The main contributions of this
√
work are summarized as follows: 1) We conducted an off- where the scalar 1/ dk is used to prevent the softmax
line dataset for multi-agent offline pre-training on the function from entering regions that have very small
well-known challenging task, SMAC; 2) To improve the gradients. Then, we introduce the multi-head attention
sample efficiency online, we propose fine-tuning the pre- process as follows:
trained multi-agent policy instantiated with the sequence
model by sharing policy among agents and show the MultiHead(Q, K, V ) = Concat(head1 , · · · , headh )W O
strong capacity of sequence modelling for multi-agent re- (2)
inforcement learning in the few-shot and zero-shot set-
tings; 3) We propose pre-training an actor and a critic to headi = Attention(QWiQ , KWiK , V WiV ). (3)
fine-tune with the policy-based network. In contrast to

Store data SOTA policy/

Offline dataset
Demonstration
Offline
Shared data structure

Data

Load Train
Dataset Trainer MADT
data

Rollout
Online
Collect data Online
Buffer
environment

Fig. 1 Overview of the pipeline for pre-training the general policy and fine-tuning it online
236 Machine Intelligence Research 20(2), April 2023

The position-wise feed-forward network is another tion.

core module of the transformer. It consists of two linear Algorithm 1. MADT-Offline: Multi-agent decision
transformations with a ReLU activation function. The di- transformer
mensionality of inputs and outputs is dmodel, and that of 1) Input: Offline dataset D : {τi : ⟨sti , oti , ati , vit , dti ,
the feed forward layer is df f . Specially, ri ⟩t=1 }i=1, vi denotes the available action
t T n t

2) Initialize θ for the Causal transformer

FFN(x) = max(0, xW1 + b1 )W2 + b2 (4) 3) Initialize α as the learning rate, C as the context
length, and n as the maximum agent number
where W1 ∈ Rdmodel ×df f and W2 ∈ Rdf f ×dmodel are the 4) for τ = {τ1 , · · · , τi , · · · , τn } in D do
weights, and b1 ∈ Rdf f and b2 ∈ Rdmodel are the biases. 5) Chunk the trajectory into τi = {rit , sti , ati }t∈1:C as
Across different positions are the same linear transform-
the ground truth samples, where C is the context
ations. Note that the position encoding for leveraging the
length, and mask the trajectory when dti is true
order of the sequence is as follows:
6) for τit = {τi1 · · · τit · · · τiC } in τi do
′
7) Mask illegal actions via P (ati |τi<t ; θ ) = 0 if
PE(pos, 2i) = sin(pos/10 0002i/dmodel )
vit is true
PE(pos, 2i + 1) = cos(pos/10 0002i/dmodel ). (5) 8) Predict the action âti = arg maxai P (ai |τi<t ; θ )
′

1
∑C t
9) Update θ with θ = arg max ′ C t=1 P (ai )×
′ θ
3.1 Multi-agent decision transformer log P (âti |τi<t ; θ )
10) end for
Algorithm 1 shows the offline training process for a 11) end for
single-task of MADT, in which we autoregressively en- Trajectories reformulation as input. We model
code the trajectories from the offline datasets in offline the lowest granularity at each time step as a modelling
pre-trained MARL and train the transformer-based net- unit xt from the static offline dataset for the concise rep-
work with supervised learning. We carefully reformulate resentation. MARL has many elements, such as
the trajectories as the inputs of the causal transformer ⟨global_state, local_observation⟩, different from the single
that are different from those in the Decision transfor- agent. It is reasonable for sequential modelling methods
mer[19]. We deprecate the reward-to-go and actions that to model them in a MDP. Therefore, we formulate the
are encoded with states together in the single-agent DT. trajectory as follows:
We will interpret the reason for this in the next section.
Similar to the seq2seq models, MADT is based on the τ i = (x1 , · · · , xt , · · · , xC ) , where xt = (st , oit , ati )
autoregressive architecture with reformulated sequential
inputs across timescales. The left part of Fig. 2 shows the where sit denotes the global shared state, oit denotes the
architecture. The causal transformer encodes the agent i′s individual observation for agent i at time step t , and ait
trajectory sequence τit at the time step t to a hidden rep- denotes the action. We regard xt as a token and process
resentation hti = (h1 , h2 , · · · , hl ) with a dynamic mask. the whole sequence similar to the scheme in the language
Given ht, the output at the time step t is based on the modelling.
previous data and then consumes the previously emitted Output sequence construction. To bridge the gap
actions as additional inputs when predicting a new ac- between training with the whole context trajectory and

t t+1 t Critic loss

a a a e.g., Huber loss
Add & Norm
N× Actor loss
avalt−1 avalt+1 avalt e.g., PPO loss
Feed forward

Add & Norm

… Causal transformer … Actor network Critic network
Multi-head
attention

Offline datasets a o a o a o A s
t−1 t t t+1 × n agent t−1 t t−1 t

Padded agent ation embedding o Local observation s Global state

Padded agent observation embedding a Agent′s action A All agents′ actions

Fig. 2 Detailed model structure for offline and online MADT
L. Meng et al. / Offline Pre-trained Multi-agent Decision Transformer 237

testing with only previous data, we mask the context state. That means the pre-trained policy is encouraged to
data to autoregressively output in the time step t with choose an action to be identical to the distribution in the
previous data in ⟨1, · · · , t − 1⟩. Therefore, MADT pre- offline dataset, even though it leads to a low reward.
dicts the sequential actions at each time step using the Therefore, we need to design another paradigm, MADT-
decoder as follows: PPO, to integrate RL and supervised learning for fine-
tuning in Algorithm 2. Fig. 2 shows the pre-training and
y = at = arg max pθ (a|τ, a1 , · · · , at−1 ) (6) fine-tuning framework. A direct method is to share the
a
pre-trained model across each agent and implement the
where θ denotes the parameters of MADT and τ denotes REINFORCE algorithm[56]. However, only actors result in
the trajectory including the global state s, local higher variance, and the employment of a critic to assess
observation o before the time step t , pθ is the distribution state values is necessary. Therefore, in online MARL, we
over the legal action space under the available action v . leverage an extension of PPO, the state-of-the-art al-
Core module description. MADT differs from the gorithm on tasks of StarCraft, multi-agent particle envir-
transformers in conventional sequence modelling tasks onment (MPE), and even the return-based game
that take inputs with position encoding and decode the Hanabi[57]. In the offline stage, we adopt the strategy
encoded hidden representation autoregressively. We use mentioned before to pre-train an offline policy for each
the masking mechanism with a lower triangular matrix to agent π(oi , ai ) and additionally use the global state to
compute the attention: pre-train a centralized critic Vϕ (s). In the fine-tuning
stage, we first load the offline pre-trained sharing policy
( QK T )
Attention(Q, K, V ) = softmax √ +M V (7) as each agent′s online initial policy πi (oi , ai ). When the
dk critic is pre-trained, we instantiate the centralized critic
with the pre-trained model as Vϕ (s). To fine-tune the pre-
where M is the mask matrix that ensures that the input
trained multi-agent policy and critic model, multiple
at the time step t can only correlate with the input from
agents clear the buffer and interact with the environ-
⟨1, · · · , t − 1⟩. We employ the cross-entropy (CE) as the
ment to learn the policy via maximizing the following
total sequential prediction loss and utilize the available
PPO objective:
action v to ensure agents take those illegal actions with a
probability of zero. The CE loss can be represented as ∑
n
follows: E [min(ω Â(s, a), clip(ω,
s ∼ ρθold , a ∼ πθold
i=1

1 ∑
C 1−ϵ, 1+ϵ)Â(s, a))] (9)
LCE (θ) = P (at ) log P (ât |τt , â<t ; θ) (8)
C t=1 where ω = πθ (oi , ai )/πθold (oi , ai ) denotes the importance
weight using in PPO-style algorithms, Â denotes the
where C is the context length, at is the ground truth advantage function computed by reward and the critic
action, τt includes {s1:t , o1:t }. â denotes the output of model and ϵ denotes the clip parameter. The detailed
MADT. The cross-entropy loss shown above aims to fine-tuning pipeline can be found in Algorithm 2.
minimize the distribution distance between the prediction Algorithm 2. MADT-Online: Multi-agent decision
and the ground truth. transformer with PPO
1) Input: Offline dataloader D , Pretrained MADT policy
3.2 Multi-agent decision transformer with with parameter θ
PPO 2) Initialize θ and ϕ are the parameters of an actor
πθ (ai |oi ) and critic Vϕ (s) respectively, which could be
The method above can fit the data distribution well, inherited directly from pre-trained models
resulting from the sequential modelling capacity of the 3) Initialize n as the agent number, γ as the discount
transformer. However, it fails to work well when pre- factor, and ϵ as clip ratio.
training on the offline datasets and improves continually 4) for τ = {τ1 , · · · , τi , · · · , τn } in D do
by interacting with the online environment. The reason is 5) Sample τi = {st , oti , ati }t∈1:C as the ground truth,
the mismatch between the objectives of the offline and where C is the context length
online phase. In the offline stage, the imitation-based ob- 6) Compute the advantage function Â(s, ai ) =
∑ t
jective conforms to a supervised learning style in MADT t γ r(s, ai ) − Vϕ (s)
and ignores measuring each action with a value model. 7) Compute the important weight w = πθ (oi , ai )/
When the pre-trained model is loaded to interact with the πθold (oi , ai )
online environment, the buffer will only collect actions 8) Update θi for i ∈ 1, · · · , n via:
conforming to the distributions of the offline datasets 9) θi = arg max Es∼ρθold ,a∼πθold [clip(w, 1 − ϵ, 1+
rather than those corresponding to high reward at this ϵ)Â(s, ai )]
238 Machine Intelligence Research 20(2), April 2023

∑
10) Compute the MSE loss Lϕ = 12 [ t γ t rt − Vϕ (s)]2 4.1 Offline datasets
11) Update the critic network via ϕ = arg min Lϕ
ϕ
12) end for The offline datasets are collected from the running
policy, MAPPO[58], on the well-known SMAC task[26].
3.3 Universal model across scenarios Each dataset contains a large number of trajectories:
τ := (st , ot , at , rt , donet , vt )Tt=1. Different from D4RL[59],
To train a universal policy for each of the scenarios in our datasets conform to the property of DecPOMDP,
the SMAC which might vary with agent number in fea- which owns local observations and available actions for
ture space, action space, and reward ranges, we consider each agent. In the appendix, we list the statistical proper-
the modification list below. ties of the offline datasets in Tables A1 and A2.
Parameters sharing across agents. When offline
examples are collected from multiple tasks or the test 4.2 Offline multi-agent reinforcement lear-
phase owns the different agent numbers from the offline ning
datasets, the difference in agent numbers across tasks is
an intractable problem for deciding the number of actors. In this experiment, we aim to validate the effective-
Thus, we consider sharing the parameters across all act- ness of the MADT offline version in Section 3.1 as a
ors with one model as well as attaching one-hot agent IDs framework for offline MARL on the static offline datasets.
We train a policy on the offline datasets with various
into observations for compatibility with a variable num-
qualities and then apply it to an online environment,
ber of agents.
StarCraft[26]. There are also baselines under this setting,
Feature encoding. When the policy needs to gener-
such as behavior cloning (BC), as a kind of imitation
alize to new scenarios that arise from different feature
learning method showing stable performance on single-
shapes, we propose encoding all features into a universal
agent offline RL. In addition, we employ the convention-
space by padding zero at the end and mapping them to a
al effective single-agent offline RL algorithms, BCQ[32],
low-dimensional space with fully connected networks.
CQL[31], and ICQ[35], and then use the extension method
Action masking. Another issue is the different ac-
by simply mixing each agent value network proposed by
tion spaces across scenarios. For example, fewer enemies
[35] for multi-agent setting, denoting it as “xx-MA”. We
in a scenario means fewer potential attack options as well compare the performance of the MADT offline version
as fewer available actions. Therefore, an extra vector is with other abovementioned offline RL algorithms under
utilized to mute the unavailable actions so that their online evaluation in the MARL environment. To verify
probabilities are always zero during both the learning and the quality of our collected datasets, we chose data from
evaluating processes. different levels and trained the baselines as well as our
Reward scaling. Different scenarios might vary in MADT. Fig. 3 shows the overall performance on various
reward ranges and lead to unbalanced models during quality datasets. The baseline methods enhance their per-
multi-task offline learning. To balance the influence of ex- formance stably, indicating the quality of our offline data-
amples from different scenarios, we scale their rewards to sets. Furthermore, our MADT outperforms the offline
the same range to ensure that the output models have MARL baselines and converges faster across easy, hard,
comparable performance across different tasks. and super hard maps (2s3z, 3s5z, 3s5z VS. 3s6z ,
corridor ). From the initial performance in the evalu-
4 Experiments ation period, our pretrained model gives a higher return
than baselines in each task. Besides, our model can sur-
We show three experimental settings: offline MARL, pass the average performance in the offline dataset.
online MARL by loading the pre-trained model, and few-
shot or zero-shot offline learning. For the offline MARL, 4.3 Offline pre-training and online fine-
we expect to verify the performance of our method by tuning
pre-training the policy and directly testing on the corres-
ponding maps. In order to demonstrate the capacity of The experimental designed in this subsection intends
the pre-trained policy on the original or new scenarios, we to answer the question: Is the pre-training process neces-
aim to demonstrate the fine-tuning in the online environ- sary for online MARL? First, we compare the online ver-
ment. Experimental results in offline MARL show that sion of MADT in Section 3.2 with and without loading
our MADT-offline in Section 3.1 outperforms the state-of- the pre-trained model. If training MADT only by online
the-art methods. Furthermore, MADT-online in Section experience, we can view it as a transformer-based
3.2 can improve the sample efficiency across multiple MAPPO replacing the actor and critic backbone net-
scenarios. Besides, the universal MADT trained from works with the transformer. Furthermore, we validate
multi-task data with MADT-online generalizes well in that our framework MADT with the pre-trained model
each scenario in a few-shot or even zero-shot setting. can improve sample efficiency on most easy, hard, and
L. Meng et al. / Offline Pre-trained Multi-agent Decision Transformer 239

20.0 14
17.5 6 10
Average return

Average return

Average return
15.0 12
5 8
12.5 10 4
10.0 8 3 6
7.5 6 4
5.0 2
2.5 4 1 2
2
0 0.2 0.4 0.6 0.8 1.0 0 0.2 0.4 0.6 0.8 1.0 0 0.2 0.4 0.6 0.8 1.0 0 0.2 0.4 0.6 0.8 1.0
Timesteps 1E+3 Timesteps 1E+2 Timesteps 1E+2 Timesteps 1E+2
17.5 20.0 14
20 15.0 12
17.5
Average return

Average return

Average return
15 12.5 15.0 10
10.0 12.5 8
10 7.5 10.0 6
5.0 7.5 4
5 2.5
0 5.0 2
0
0 0.2 0.4 0.6 0.8 1.0 0 0.2 0.4 0.6 0.8 1.0 0 0.2 0.4 0.6 0.8 1.0 0 0.2 0.4 0.6 0.8 1.0
Timesteps 1E+3 Timesteps 1E+2 Timesteps 1E+2 Timesteps 1E+2

20 20 20
20
Average return

Average return

Average return
18
Average return

16 15 15
15
14 MADT 10 10
12 BC 10
10
BCQ-MA
5 5
CQL-MA 5
8 ICQ-MA
0
0
0.2 0.4 0.6 0.8 1.0 0 0.2 0.4 0.6 0.8 1.0 0 0.2 0.4 0.6 0.8 1.0 0 0.2 0.4 0.6 0.8 1.0
Timesteps 1E+3 Timesteps 1E+2 Timesteps 1E+2 Timesteps 1E+2
(a) 2s3z (Easy) (b) 3s5z (Hard) (c) 3s5z VS. 3s6z (Super hard) (d) Corridor (Super hard)
Fig. 3 Performance of offline MADT compared with baselines on four easy or (super-)hard SMAC maps. The dotted lines represent the
mean values in the training set. Columns (a)–(d) are average returns from (poor, medium, good) datasets from top to the bottom.

super hard maps. the adaptability of the seen tasks. In contrast, the zero-
Necessity of the pretrained model. We train our shot experiments are designed for the held-out maps.
MADT based on the datasets collected from a map and Few-shot learning. The results in Fig. 5(a) show
fine-tune it on the same map online with the MAPPO al- that our method can utilize multi-task datasets to train a
gorithm. For comparison fairness, we use the transformer universal policy and generalize to all tasks well. Pre-
as both actor and critic networks with and without the trained MADT can achieve higher returns than the mod-
pre-trained model. Primarily, we choose three maps from el trained from scratch when we limit the interactions
easy, hard, and super hard maps to validate the effective- with the environment.
ness of the pre-trained model in Fig. 4. Experimental res- Zero-shot learning. Fig. 5(b) shows that our univer-
ults show that the pre-trained model converges faster sal MADT can surprisingly improve performance on
than the algorithm trained from scratch, especially in downstream tasks even if it has not been seen before (3
challenging maps. stalkers VS. 4 zealots).
Improving sample efficiency. To validate the
sample efficiency improvement by loading our pre-trained 4.5 Ablation study
MADT and fine-tuning it with MAPPO, we compare the
overall framework with the state-of-the-art algorithm, The experiments in this subsection are designed to an-
MAPPO[58], without the pre-training phase. We measure swer the following research questions: RQ1: Why should
the sample efficiency in terms of the time to threshold we choose MAPPO for the online phase? RQ2: Which
mentioned in [60], which denotes the number of online in- kind of input should be used to make the pre-trained
teractions (timesteps) to achieve a predefined threshold in model beneficial for the online MARL? RQ3: Why can-
Table 1, and our pre-trained model needs much less than not the offline version of MADT be improved in the on-
the traditional MAPPO to achieve the same win rate. line fine-tuning period after pre-training?
Suitable online algorithm. Although the selection
4.4 Generalization with multi-task pre- of the MARL algorithm for the online phase should be
training flexible according to specific tasks, we design experi-
ments to answer RQ1 here. As discussed in Section 3, we
Experiments in this section explore the transferability can train Decision Transformer for each agent and fine-
of the universal MADT mentioned in Section 3.3, which tune it online with an MARL algorithm. An intuitive
is pre-trained with mixed data from multiple tasks. De- method is to load the pre-trained transformer and take it
pending on whether the downstream tasks have been seen as the policy network for fine-tuning with the policy
or not, the few-shot experiments are designed to validate gradient method, e.g., REINFORCE[56]. However, for the
240 Machine Intelligence Research 20(2), April 2023

20.0 14 17.5
17.5 12 15.0

Average return
15.0
Average return

Average return
10 12.5
12.5 8 Online MARL w/ pretrained model
10.0 10.0 Online MARL w/o pretrained model
6
7.5 4 7.5
5.0 2 5.0
2.5 Online MARL w/ pretrained model
0
Online MARL w/ pretrained model
2.5
Online MARL w/o pretrained model Online MARL w/o pretrained model
0 −2
0 0.5 1.0 1.5 2.0 2.5 0 0.5 1.0 1.5 2.0 2.5 0 0.5 1.0 1.5 2.0
Timesteps 1E+6 Timesteps 1E+6 Timesteps 1E+6
(a) So many baneling (Easy) (b) 5 m VS. 6 m (Hard) (c) 3s5z VS. 3s6z (Super hard)
Fig. 4 Average returns with and without the pre-trained model

Table 1 Number of interactions needed to achieve the win rate 20%, 40%, 60%, 80% and 100% for the training policy with
(MAPPO/pre-trained MADT). “–” means no more samples are needed to reach the target win rate.
“∞” represents that policies cannot reach the target win rate.

# Samples to achieve the win rate # Samples to achieve the win rate
Maps Maps
20% 40% 60% 80% 100% 20% 40% 60% 80% 100%

2 m VS. 1 z 3 s VS. 5 z 8.7E+5/ 9E+5/ 2E+6/

8E+4/– 1E+5/– 1.3E+5/– 1.5E+5/– 1.6E+5/– 8E+5/– 8.5E+5/–
(Easy) (Hard) 1.5E+4 5E+4 1.5E+5

3m 2 c VS. 64 zg 4E+5/ 5E+5/ 1.8E+6/

3.2E+3/– 8.3E+4/– 3.2E+5/– 4E+5/– 7.2E+5/– 2E+5/– 3E+5/–
(Easy) (Hard) 8E+4 1E+5 5E+5

2 s VS. 1 sc 8E+4/ 3E+5/ 8 m VS. 9 m 1.4E+6/ 2E+6/

1E+4/– 2.5E+4/– 3E+4/– 3E+5/– 6E+5/– ∞/2.2E+6
(Easy) 4E+4 1.2E+5 (Hard) 2E+4 8E+4

3 s VS. 3 z 6.2E+5/ 7.3E+5/ 8E+5/ 5 m VS. 6 m 1.5E+6/ 2.5E+6/ 5E+6/

2.5E+5/– 3E+5/– ∞/∞ ∞/∞
(Easy) 1E+4 1.5E+5 2.9E+5 (Hard) 2E+5 8E+5 2E+6

3 s VS. 4 z 1.5E+6/ 3 s5 z 8E+5/ 1.3E+6/ 1.5E+6/ 1.9E+6/ 2.5E+6/

3E+5/– 4E+5/– 5E+5/– 6.2E+5/–
(Easy) 1.8E+5 (Hard) 6.3E+4 1E+5 4E+5 1E+6 2E+6

So many
3.2E+4/ 1E+5/ 3.2E+5/ 5E+5/ 1E+6/ 10 m VS. 11 m 4E+5/ 1.7E+6/ 4E+6/
baneling 2E+5/– 3.5E+5/–
8E+3 4E+4 7E+4 8E+4 6.4E+5 (Hard) 2.8E+4 1.2E+5 2.5E+5
(Easy)

8m 5.6E+5E+ 8.8E+5/ MMM2

4E+5/– 5E+6/– 5.6E+5/– 1E+6/ 1.8E+6/ 2.3E+6/ 4E+6/ ∞/∞
(Easy) 5/1.6E+5 2.4E+5 (Super hard)

MMM 1.8E+6/ 3 s5z VS. 3 s6 z 3E+6/ 5E+6/

5.2E+4/– 8E+4/– 3E+5/– 4.5E+5/– 1.8E+6/– 2.5E+6/– ∞/∞
(Easy) 6E+5 (Super hard) 8E+5 1E+6

Bane VS.
Corridor 7.8E+6/
bane 3.2E+3/– 3.2E+3/– 3.2E+5/– 4E+5/– 5.6E+5/– 1.5E+6/– 1.8E+6/– 2E+6/– 2.8E+6/–
(Super hard) 4E+5
(Easy)

20.0
25 From scratch Universal MADT (ours)
Fine-tuned from universal MADT 17.5
Train from scratch
20 15.0
Average return
Average return

12.5
15
10.0
10 7.5
5.0
5
2.5
0 0
z 0 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00
VS
.1 1sc 3m .3
z
.4
z
VS. V S VS Times (teps) 1E+6
2m 2s 3s 3s

(a) (b)
Fig. 5 Few-shot and zero-shot validation results. (a) shows the average returns of the universal MADT pre-trained from all five tasks
data and the policy trained from scratch, individually. We limit the environment interaction to 2.5 M steps. (b) shows the average
returns of a held-out map (3s VS. 4z ), where the universal MADT is trained from data on (2m VS. 1z , 2s VS. 1sc, 3m , 3s VS. 3z ).

reason of the high variance mentioned in Section 3.2, we performance in improving the sample efficiency during
choose MAPPO as the online algorithm and compare its the online period in Fig. 6(a).
L. Meng et al. / Offline Pre-trained Multi-agent Decision Transformer 241

State_only
MADT with MAPPO State & action MADT with MAPPO for fine-tuning
MADT with REINFORCE Reward-to-go & state & action Pure MADT for fine-tuning
15.0
14 14
12.5
12 12
Average return

Average return

Average return
10 10.0 10
8 7.5 8
6 5.0 6
4 2.5 4
2 0 2
0 0.2 0.4 0.6 0.8 1.0 0 0.2 0.4 0.6 0.8 1.0 0 1 2 3 4 5 6 7 8
Timesteps 1E+6 Timesteps 1E+6 Timesteps 1E+5
(a) (b) (c)
Fig. 6 Ablation results on a hard map, 5m_vs_6m , for validating the necessity of (a) MAPPO in MADT-online, (b) Input formulation,
(c) online version of MADT.

Dropping reward-to-go in MADT. To answer in tackling SMAC tasks. MADT learns a big sequence
RQ2, we compare different inputs embedded into the model that outperforms the state-of-the-art methods in
transformer, including the combination of state, reward-
offline settings, including BC, BCQ, CQL, and ICQ.
to-go, and action. We find reward-to-go harmful to on-
When applied in online settings, the pre-trained MADT
line fine-tuning performance, as shown in Fig. 6(b). We
suppose the distribution of reward-to-go is the mismatch can drastically improve the sample efficiency. We ap-
between offline data and online samples. That is, the re- plied MADT to train a generalizable policy over a series
wards of online samples are usually lower than those of of SMAC tasks and then evaluated its performance un-
offline data due to stochastic exploration at the begin- der both few-shot and zero-shot settings. The results
ning of the online phase. It deteriorates the fine-tuning demonstrate that the pre-trained MADT policy adapts
capability of the pre-trained model, and based on quickly to new tasks and improves performance on differ-
Fig. 6(b), we only choose states as our inputs for pre-
ent downstream tasks. To the best of our knowledge, this
training and fine-tuning.
is the first work that demonstrates the effectiveness of
Integrating online MARL with MADT. To an-
swer RQ3, we directly apply the offline version of MADT offline pre-training and the effectiveness of sequence mod-
for pre-training and fine-tune it online. However, Fig. 6(c) elling through transformer architectures in the context of
shows that it cannot be improved during the online MARL.
phase. We analyse the results caused by the absence of
motivation for chasing higher rewards and conclude that Appendix A Properties of datasets
offline MADT is supervised learning and tends to fit its
collected experience even with unsatisfactory rewards. We list the properties of our offline datasets in Tables A1
and A2.
5 Conclusions
Appendix B Details of hyper-parameters
In this work, we propose MADT, an offline pre-
trained model for MARL, which integrates the trans- Details of hyper-parameters used for MADT experi-
former to improve sample efficiency and generalizability ments are listed from Tables B1–B5.

Table A1 Properties for our offline dataset collected from the experience of multi-agent PPO on the easy maps of SMAC

Maps Difficulty Data quality # Samples Reward distribution (mean (±std))

3m Easy 3m-poor 62 528 6.29 (± 2.17)

3m-medium – –

3m-good 1 775 159 19.99 (± 0.18)

8m Easy 8m-poor 157 133 6.43 (± 2.41)

8m-medium 297 439 11.95 (± 0.94)

8m-good 2 781 145 20.00 (± 0.16)

2s3z Easy 2s3z-poor 147 314 7.69 (± 1.77)

2s3z-medium 441 474 12.85 (± 1.37)

2s3z-good 4 177 846 19.93 (± 0.67)

242 Machine Intelligence Research 20(2), April 2023

Table A1 (continued) Properties for our offline dataset collected from the experience of multi-agent PPO on the easy maps of SMAC

Maps Difficulty Data quality # Samples Reward distribution (mean (±std))

2s_vs_1sc Easy 2s_vs_1sc-poor 12 887 6.62 (± 2.74)

2s_vs_1sc-medium 33 232 11.70 (± 0.73)

2s_vs_1sc-good 1 972 972 20.23 (± 0.02)

3s_vs_4z Easy 3s_vs_4z-poor 216 499 7.58 (± 1.45)

3s_vs_4z-medium 335 580 12.13 (±m 1.38)

3s_vs_4z-good 3 080 634 20.19 (± 0.40)

MMM Easy MMM-poor 326 516 7.64 (± 2.05)

MMM-medium 648 115 12.23 (± 1.37)

MMM-good 2 423 605 20.08 (± 1.67)

So_many_baneling Easy So_many_baneling-poor 1 542 9.08 (± 0.66)

So_many_baneling-medium 59 659 13.31 (± 1.14)

So_many_baneling-good 1 376 861 19.46 (± 1.29)

3s_vs_3z Easy 3s_vs_3z-poor 52 807 8.10 (± 1.37)

3s_vs_3z-medium 80 948 11.87 (± 1.19)

3s_vs_3z-good 149 906 20.02 (± 0.09)

2m_vs_1z Easy 2m_vs_1z-poor 25 333 5.20 (± 1.66)

2m_vs_1z-medium 300 11.00 (± 0.01)

2m_vs_1z-good 120 127 20.00 (± 0.01)

Bane_vs_bane Easy Bane_vs_bane-poor 63 1.59 (± 3.56)

Bane_vs_bane-medium 3 507 14.00 (± 0.93)

Bane_vs_bane-good 458 795 19.97 (± 0.36)

1c3s5z Easy 1c3s5z-poor 52 988 8.10 (± 1.65)

1c3s5z-medium 180 357 12.68 (± 1.42)

1c3s5z-good 2 400 033 19.88 (± 0.69)

Table A2 Properties for our offline dataset collected from the experience of multi-agent
PPO on the hard and super hard maps of SMAC

Maps Difficulty Data quality # Samples Reward distribution (mean (±std))

5m_vs_6m Hard 5m_vs_6m-poor 1 324 213 8.53 (± 1.18)

5m_vs_6m-medium 657 520 11.03 (± 0.58)

5m_vs_6m-good 503 746 20 (± 0.01)

10m_vs_11m Hard 10m_vs_11m-poor 140 522 7.64 (± 2.39)

10m_vs_11m-medium 916 845 12.72 (± 1.25)

10m_vs_11m-good 895 609 20 (± 0.01)

2c_vs_64zg Hard 2c_vs_64zg-poor 10 830 8.91 (± 1.01)

2c_vs_64zg-medium 97 702 13.05 (± 1.37)

2c_vs_64zg-good 2 631 121 19.95 (± 1.24)

8m_vs_9m Hard 8m_vs_9m-poor 184 285 8.18 (± 2.14)

8m_vs_9m-medium 743 198 12.19 (± 1.14)

8m_vs_9m-good 911 652 20 (± 0.01)

L. Meng et al. / Offline Pre-trained Multi-agent Decision Transformer 243

Table A2 (continued) Properties for our offline dataset collected from the experience of multi-agent
PPO on the hard and super hard maps of SMAC

Maps Difficulty Data quality # Samples Reward distribution (mean (±std))

3s_vs_5z Hard 3s_vs_5z-poor 423 780 6.85 (± 2.00)

3s_vs_5z-medium 686 570 12.12 (± 1.39)

3s_vs_5z-good 2 604 082 20.89 (± 1.38)

3s5z Hard 3s5z-poor 365 389 8.32 (± 1.44)

3s5z-medium 2 047 601 12.61 (± 1.32)

3s5z-good 1 448 424 18.45 (± 2.03)

3s5z_vs_3s6z Super hard 3s5z_vs_3s6z-poor 594 089 7.92 (± 1.77)

3s5z_vs_3s6z-medium 2 214 201 12.56 (± 1.37)

3s5z_vs_3s6z-good 1 542 571 18.35 (± 2.04)

27m_vs_30m Super hard 27m_vs_30m-poor 102 003 7.18 (± 2.08)

27m_vs_30m-medium 456 971 13.19 (± 1.25)

27m_vs_30m-good 412 941 17.33 (± 1.97)

MMM2 Super hard MMM2-poor 1 017 332 7.87 (± 1.74)

MMM2-medium 1 117 508 11.79 (± 1.28)

MMM2-good 541 873 18.64 (± 1.47)

Corridor Super hard Corridor-poor 362 553 4.91 (± 1.71)

Corridor-medium 439 505 13.00 (± 1.32)

Corridor-good 3 163 243 19.88 (± 0.99)

Table B1 Common hyper-parameters for all MADT experiments for pre-training on a map, taking 3m (easy) as an example

Hyper-parameter Value Hyper-parameter Value Hyper-parameter Value

Offline_train_critic True Max_timestep 400 Eval_epochs 32

n_layer 2 n_head 2 n_embd 32

Online_buffer_size 64 Model_type State_only Mini_batch_size 128

Table B2 Hyper-parameters for MADT experiments in Fig. 3

Maps offline_episode_num offline_lr

2s3z 1 000 1E–4

3s5z 1 000 1E–4

3s5z_vs_3s6z 1 000 5E–4

Corridor 1 000 5E–4

Table B3 Hyper-parameters for MADT experiments in Figs. 4, 6 and Table 1

Maps Offline_episode_num Offline_lr Online_lr Online_ppo_epochs

2c_vs_64zg 1 000 5E–4 5E–4 10

10m_vs_11m 1 000 5E–4 5E–4 10

8m_vs_9m 1 000 1E–4 5E–4 10

3s_vs_5z 1 000 1E–4 5E–4 10

3s5z 1 000 1E–4 5E–4 10

3m 1 000 1E–4 5E–4 15

244 Machine Intelligence Research 20(2), April 2023

Table B3 (continued) Hyper-parameters for MADT experiments in Figs. 4, 6 and Table 1

Maps Offline_episode_num Offline_lr Online_lr Online_ppo_epochs

2s_vs_1sc 1 000 1E–4 5E–4 15

MMM 1 000 1E–4 1E–4 5

So_many_baneling 1 000 1E–4 1E–4 5

8m 1 000 1E–4 1E–4 5

3s_vs_3z 1 000 1E–4 1E–4 5

3s_vs_4z 1 000 1E–4 1E–4 5

Bane_vs_bane 1 000 1E–4 1E–4 5

2m_vs_1z 1 000 1E–4 1E–4 5

2c_vs_64zg 1 000 1E–4 1E–4 5

5m_vs_6m 1 000 1E–4 1E–4 10

Corridor 1 000 1E–4 1E–4 10

3s5z_vs_3s6z 1 000 1E–4 1E–4 10

The images or other third party material in this art-

Table B4 Hyper-parameters for MADT
experiments in Fig. 5(a) icle are included in the article′s Creative Commons li-
cence, unless indicated otherwise in a credit line to the
Hyper-parameter Value material. If material is not included in the article′s Creat-
Offline_map_lists [3s_vs_4z, 2m_vs_1z, 3m, 2s_vs_1sc, 3s_vs_3z] ive Commons licence and your intended use is not per-
Offline_episode_num [200, 200, 200, 200, 200] mitted by statutory regulation or exceeds the permitted
use, you will need to obtain permission directly from the
Offline_lr 5E–4
copyright holder.
Online_lr 1E–4 To view a copy of this licence, visit http://creative-
Online_ppo_epochs 5 commons.org/licenses/by/4.0/.

References
Table B5 Hyper-parameters for MADT
experiments in Fig. 5(b) [1] Y. D. Yang, J. Wang. An overview of multi-agent rein-
forcement learning from game theoretical perspective.
Hyper-parameter Value [Online], Available: https://arxiv.org/abs/2011.00583,
Offline_map_lists [2m_vs_1z, 3m, 2s_vs_1sc, 3s_vs_3z] 2020.

Offline_episode_num [250, 250, 250, 250] [2] S. Shalev-Shwartz, S. Shammah, A. Shashua. Safe, multi-
agent, reinforcement learning for autonomous driving.
Offline_lr 5E–4 [Online], Available: https://arxiv.org/abs/1610.03295,
Online_lr 1E–4 2016.

Online_ppo_epochs 5 [3] M. Zhou, J. Luo, J. Villella, Y. D. Yang, D. Rusu, J. Y.

Miao, W. N. Zhang, M. Alban, I. Fadakar, Z. Chen, A. C.
Huang, Y. Wen, K. Hassanzadeh, D. Graves, D. Chen, Z.
Acknowledgements B. Zhu, N. Nguyen, M. Elsayed, K. Shao, S. Ahilan, B. K.
Zhang, J. N. Wu, Z. G. Fu, K. Rezaee, P. Yadmellat, M.
Linghui Meng was supported in part by the Strategic Rohani, N. P. Nieves, Y. H. Ni, S. Banijamali, A. C.
Rivers, Z. Tian, D. Palenicek, H. bou Ammar, H. B.
Priority Research Program of the Chinese Academy of
Zhang, W. L. Liu, J. Y. Hao, J. Wang. Smarts: Scalable
Sciences (No. XDA27030300). Haifeng Zhang was suppor- multi-agent reinforcement learning training school for
ted in part by the National Natural Science Foundation autonomous driving. [Online], Available: https://arxiv.
of China (No. 62206289). org/abs/2010.09776, 2020.
[4] H. F. Zhang, W. Z. Chen, Z. R. Huang, M. N. Li, Y. D.
Open Access Yang, W. N. Zhang, J. Wang. Bi-level actor-critic for
multi-agent coordination. In Proceedings of the 34th
AAAI Conference on Artificial Intelligence, New York,
This article is licensed under a Creative Commons At-
USA, pp. 7325–7332, 2020.
tribution 4.0 International License, which permits use,
[5] M. N. Li, Z. W. Qin, Y. Jiao, Y. D. Yang, J. Wang, C. X.
sharing, adaptation, distribution and reproduction in any
Wang, G. B. Wu, J. P. Ye. Efficient ridesharing order dis-
medium or format, as long as you give appropriate credit patching with mean field multi-agent reinforcement learn-
to the original author(s) and the source, provide a link to ing. In Proceedings of World Wide Web Conference,
the Creative Commons licence, and indicate if changes ACM, San Francisco, USA, pp. 983–994, 2019. DOI: 10.
1145/3308558.3313433.
were made.
L. Meng et al. / Offline Pre-trained Multi-agent Decision Transformer 245

[6] Y. D. Yang, R. Luo, M. N. Li, M. Zhou, W. N. Zhang, J. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray,
Wang. Mean field multi-agent reinforcement learning. In B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford,
Proceedings of the 35th International Conference on Ma- I. Sutskever, D. Amodei. Language models are few-shot
chine Learning, Stockholm, Sweden, pp. 5571–5580, 2018. learners. In Proceedings of the 34th International Confer-
ence on Neural Information Processing Systems, Van-
[7] Y. D. Yang, L. T. Yu, Y. W. Bai, Y. Wen, W. N. Zhang, J.
couver, Canada, Article number 159, 2020. DOI: 10.5555/
Wang. A study of AI population dynamics with million-
3495724.3495883.
agent reinforcement learning. In Proceedings of the 17th
International Conference on Autonomous Agents and [19] L. L. Chen, K. Lu, A. Rajeswaran, K. Lee, A. Grover, M.
MultiAgent Systems, ACM, Stockholm, Sweden, pp. 2133– Laskin, P. Abbeel, A. Srinivas, I. Mordatch. Decision
2135, 2018. transformer: Reinforcement learning via sequence model-
ing. [Online], Available: https://arxiv.org/abs/2106.
[8] P. Peng, Y. Wen, Y. D. Yang, Q. Yuan, Z. K. Tang, H. T.
01345, 2021.
Long, J. Wang. Multiagent bidirectionally-coordinated
nets: Emergence of human-level coordination in learning [20] Y. D. Yang, J. Luo, Y. Wen, O. Slumbers, D. Graves, H.
to play StarCraft combat games. [Online], Available: ht- bou Ammar, J. Wang, M. E. Taylor. Diverse auto-cur-
tps://arxiv.org/abs/1703.10069, 2017. riculum is critical for successful real-world multiagent
learning systems. In Proceedings of the 20th International
[9] M. Zhou, Z. Y. Wan, H. J. Wang, M. N. Wen, R. Z. Wu,
Conference on Autonomous Agents and Multi-agent Sys-
Y. Wen, Y. D. Yang, W. N. Zhang, J. Wang. MALib: A
tems, ACM, pp. 51–56, 2021.
parallel framework for population-based multi-agent rein-
forcement learning. [Online], Available: https://arxiv.org/ [21] N. Perez-Nieves, Y. D. Yang, O. Slumbers, D. H. Mguni,
abs/2106.07551, 2021. Y. Wen, J. Wang. Modelling behavioural diversity for
[10] X. T. Deng, Y. H. Li, D. H. Mguni, J. Wang, Y. D. Yang.
learning in open-ended games. In Proceedings of the 38th
On the complexity of computing Markov perfect equilibri- International Conference on Machine Learning, pp. 8514–
um in general-sum stochastic games. [Online], Available: 8524, 2021.
https://arxiv.org/abs/2109.01795, 2021. [22] X. Y. Liu, H. T. Jia, Y. Wen, Y. J. Hu, Y. F. Chen, C. J.
[11] T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, Fan, Z. P. Hu, Y. D. Yang. Unifying behavioral and re-
J. Tan, V. Kumar, H. Zhu, A. Gupta, P. Abbeel, S. Levine. sponse diversity for open-ended learning in zero-sum
Soft actor-critic algorithms and applications. [Online], games. In Proceedings of the 35th Conference on Neural
Available: https://arxiv.org/abs/1812.05905, 2018. Information Processing Systems, pp. 941–952, 2021.

[12] R. Munos, T. Stepleton, A. Harutyunyan, M. G. Belle- [23] S. Levine, A. Kumar, G. Tucker, J. Fu. Offline reinforce-
mare. Safe and efficient off-policy reinforcement learning. ment learning: Tutorial, review, and perspectives on open
In Proceedings of the 30th Conference on Neural Informa- problems. [Online], Available: https://arxiv.org/abs/
tion Processing Systems, Barcelona, Spain, pp. 1054– 2005.01643, 2020.
1062, 2016. [24] R. Sanjaya, J. Wang, Y. D. Yang. Measuring the non-
[13] L. Espeholt, R. Marinier, P. Stanczyk, K. Wang, M. transitivity in chess. Algorithms, vol. 15, no. 5, Article
Michalski. SEED RL: Scalable and efficient deep-RL with number 152, 2022. DOI: 10.3390/a15050152.
accelerated central inference. In Proceedings of the 8th In- [25] X. D. Feng, O. Slumbers, Y. D. Yang, Z. Y. Wan, B. Liu,
ternational Conference on Learning Representations, Ad- S. McAleer, Y. Wen, J. Wang. Discovering multi-agent
dis Ababa, Ethiopia, 2020. auto-curricula in two-player zero-sum games. [Online],
[14] L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, Available: https://arxiv.org/abs/2106.02745, 2021.
T. Ward, Y. Doron, V. Firoiu, T. Harley, I. Dunning, S. [26] M. Samvelyan, T. Rashid, C. S. de Witt, G. Farquhar, N.
Legg, K. Kavukcuoglu. IMPALA: Scalable distributed Nardelli, T. G. J. Rudner, C. M. Hung, P. H. S. Torr, J.
deep-RL with importance weighted actor-learner architec- Foerster, S. Whiteson. The StarCraft multi-agent chal-
tures. In Proceedings of the 35th International Conference lenge. In Proceedings of the 18th International Conference
on Machine Learning, Stockholm, Sweden, pp. 1407–1416, on Autonomous Agents and MultiAgent Systems,
2018. Montreal, Canada, pp. 2186–2188, 2019.
[15] K. M. He, X. L. Chen, S. N. Xie, Y. H. Li, Dollár, R. Gir- [27] Z. Li, S. R. Xue, X. H. Yu, H. J Gao. Controller optimiza-
shick. Masked autoencoders are scalable vision learners. In tion for multirate systems based on reinforcement learn-
Proceedings of IEEE/CVF Conference on Computer Vis- ing. International Journal of Automation and Computing,
ion and Pattern Recognition, IEEE, New Orleans, USA, vol. 17, no. 3, pp. 417–427, 2020. DOI: 10.1007/s11633-020-
pp. 15979–15988, 2021. DOI: 10.1109/CVPR52688.2022. 1229-0.
01553.
[28] Y. Li, D. Xu. Skill learning for robotic insertion based on
[16] Z. Liu, Y. T. Lin, Y. Cao, H. Hu, Y. X. Wei, Z. Zhang, S. one-shot demonstration and reinforcement learning. Inter-
Lin, B. N. Guo. Swin transformer: Hierarchical vision national Journal of Automation and Computing, vol. 18,
transformer using shifted windows. In Proceedings of no. 3, pp. 457–467, 2021. DOI: 10.1007/s11633-021-1290-3.
IEEE/CVF International Conference on Computer Vision,
IEEE, Montreal, Canada, pp. 9992–10002, 2021. DOI: 10. [29] C. Berner, G. Brockman, B. Chan, V. Cheung, P. Debiak,
1109/ICCV48922.2021.00986. C. Dennison, D. Farhi, Q. Fischer, S. Hashme, C. Hesse, et
al. Dota 2 with large scale deep reinforcement learning.
[17] S. Kim, J. Kim, H. W. Chun. Wave2Vec: Vectorizing elec- [Online], Available: https://arxiv.org/abs/1912.06680,
troencephalography bio-signal for prediction of brain dis- 2019.
ease. International Journal of Environmental Research
and Public Health, vol. 15, no. 8, Article number 1750, [30] A. Kumar, J. Fu, G. Tucker, S. Levine. Stabilizing off-
2018. DOI: 10.3390/ijerph15081750. policy Q-learning via bootstrapping error reduction. In
Proceedings of the 33rd Conference on Neural Information
[18] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan,
Processing Systems, Vancouver, Canada, pp. 11761–11771,
P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A.
2019.
Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T.
Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. [31] A. Kumar, A. Zhou, G. Tucker, S. Levine. Conservative Q-
246 Machine Intelligence Research 20(2), April 2023

learning for offline reinforcement learning. In Proceedings [46] K. Son, D. Kim, W. J. Kang, D. E. Hostallero, Y. Yi.
of the 34th International Conference on Neural Informa- QTRAN: Learning to factorize with transformation for co-
tion Processing Systems, Vancouver, Canada, Article operative multi-agent reinforcement learning. In Proceed-
number 100, 2020. DOI: 10.5555/3495724.3495824. ings of the 36th International Conference on Machine
Learning, Long Beach, USA, pp. 5887–5896, 2019.
[32] S. Fujimoto, D. Meger, D. Precup. Off-policy deep rein-
forcement learning without exploration. In Proceedings of [47] J. G. Kuba, M. N. Wen, L. H. Meng, S. D. Gu, H. F.
the 36th International Conference on Machine Learning, Zhang, D. H. Mguni, J. Wang, Y. D. Yang. Settling the
Long Beach, USA, pp. 2052–2062, 2019. variance of multi-agent policy gradients. In Proceedings of
the 35th Conference on Neural Information Processing
[33] T. Matsushima, H. Furuta, Y. Matsuo, O. Nachum, S. X.
Systems, pp. 13458–13470, 2021.
Gu. Deployment-efficient reinforcement learning via mod-
el-based offline optimization. In Proceedings of the 9th In- [48] J. G. Kuba, R. Q. Chen, M. N. Wen, Y. Wen, F. L. Sun, J.
ternational Conference on Learning Representations, 2021. Wang, Y. D. Yang. Trust region policy optimisation in
multi-agent reinforcement learning. In Proceedings of the
[34] D. J. Su, J. D. Lee, J. M. Mulvey, H. V. Poor. MUSBO: 10th International Conference on Learning Representa-
Model-based uncertainty regularized and sample efficient tions, 2022.
batch optimization for deployment constrained reinforce-
ment learning. [Online], Available: https://arxiv.org/abs/ [49] S. D. Gu, J. G. Kuba, M. N. Wen, R. Q. Chen, Z. Y.
2102.11448, 2021. Wang, Z. Tian, J. Wang, A. Knoll, Y. D. Yang. Multi-
agent constrained policy optimisation. [Online], Available:
[35] Y. Q. Yang, X. T. Ma, C. H. Li, Z. W. Zheng, Q. Y. Zhang, https://arxiv.org/abs/2110.02793, 2021.
G. Huang, J. Yang, Q. C. Zhao. Believe what you see: Im-
plicit constraint approach for offline multi-agent reinforce- [50] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones,
ment learning. [Online], Available: https://arxiv.org/abs/ A. N. Gomez, L. Kaiser, I. Polosukhin. Attention is all you
2106.03400, 2021. need. In Proceedings of the 31st International Conference
on Neural Information Processing Systems, Long Beach,
[36] J. C. Jiang, Z. Q. Lu. Offline decentralized multi-agent re- USA, pp. 6000–6010, 2017. DOI: 10.5555/3295222.3295349.
inforcement learning. [Online], Available: https://arxiv.
org/abs/2108.01832, 2021. [51] I. Sutskever, O. Vinyals, Q. V. Le. Sequence to sequence
learning with neural networks. In Proceedings of the 27th
[37] A. Nair, M. Dalal, A. Gupta, S. Levine. Accelerating on- International Conference on Neural Information Pro-
line reinforcement learning with offline datasets. [Online], cessing Systems, Montreal, Canada, pp. 3104–3112, 2014.
Available: https://arxiv.org/abs/2006.09359, 2020. DOI: 10.5555/2969033.2969173.
[38] M. Janner, Q. Y. Li, S. Levine. Offline reinforcement [52] Q. Wang, B. Li, T. Xiao, J. B. Zhu, C. L. Li, D. F. Wong,
learning as one big sequence modeling problem. [Online], L. S. Chao. Learning deep transformer models for machine
Available: https://arxiv.org/abs/2106.02039, 2021. translation. In Proceedings of the 57th Annual Meeting of
[39] L. C. Dinh, Y. D. Yang, S. McAleer, Z. Tian, N. P. Nieves, the Association for Computational Linguistics, Florence,
O. Slumbers, D. H. Mguni, H. bou Ammar, J. Wang. On- Italy, pp. 1810–1822, 2019. DOI: 10.18653/v1/P19-1176.
line double oracle. [Online], Available: https://arxiv.org/ [53] L. H. Dong, S. Xu, B. Xu. Speech-transformer: A no-recur-
abs/2103.07780, 2021. rence sequence-to-sequence model for speech recognition.
[40] D. H. Mguni, Y. T. Wu, Y. L. Du, Y. D. Yang, Z. Y. In Proceedings of IEEE International Conference on
Wang, M. N. Li, Y. Wen, J. Jennings, J. Wang. Learning Acoustics, Speech and Signal Processing, Calgary,
in nonzero-sum stochastic games with potentials. In Pro- Canada, pp. 5884–5888, 2018. DOI: 10.1109/ICASSP.
ceedings of the 38th International Conference on Machine 2018.8462506.
Learning, pp. 7688–7699, 2021. [54] K. Han, Y. H. Wang, H. T. Chen, X. H. Chen, J. Y. Guo,
[41] Y. D. Yang, Y. Wen, J. Wang, L. H. Chen, K. Shao, D. Z. H. Liu, Y. H. Tang, A. Xiao, C. J. Xu, Y. X. Xu, Z. H.
Mguni, W. N. Zhang. Multi-agent determinantal Q-learn- Yang, Y. M. Zhang, D. C. Tao. A survey on vision trans-
ing. In Proceedings of the 37th International Conference former. [Online], Available: https://arxiv.org/abs/2012.
on Machine Learning, pp. 10757–10766, 2020. 12556, 2020.

[42] T. Rashid, M. Samvelyan, C. Schroeder, G. Farquhar, J. [55] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn,
Foerster, S. Whiteson. QMIX: Monotonic value function X. H. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G.
factorisation for deep multi-agent reinforcement learning. Heigold, S. Gelly, J. Uszkoreit, N. Houlsby. An image is
In Proceedings of the 35th International Conference on worth 16x16 words: Transformers for image recognition at
Machine Learning, Stockholm, Sweden, pp. 4295–4304, scale. In Proceedings of the 9th International Conference
2018. on Learning Representations, 2020.

[43] Y. Wen, Y. D. Yang, R. Luo, J. Wang, W. Pan. Probabil- [56] R. J. Williams. Simple statistical gradient-following al-
istic recursive reasoning for multi-agent reinforcement gorithms for connectionist reinforcement learning. Ma-
learning. In Proceedings of the 7th International Confer- chine Learning, vol. 8, no. 3, pp. 229–256, 1992. DOI: 10.
ence on Learning Representations, New Orleans, USA, 1007/BF00992696.
2019. [57] I. Mordatch, P. Abbeel. Emergence of grounded composi-
[44] Y. Wen, Y. D. Yang, J. Wang. Modelling bounded ration-
tional language in multi-agent populations. In Proceed-
ality in multi-agent interactions by generalized recursive ings of the 32nd AAAI Conference on Artificial Intelli-
reasoning. In Proceedings of the 29th International Joint gence and Thirtieth Innovative Applications of Artificial
Conference on Artificial Intelligence, Yokohama, Japan, Intelligence Conference and the 8th AAAI Symposium on
pp. 414–421, 2020. DOI: 10.24963/ijcai.2020/58. Educational Advances in Artificial Intelligence, New Or-
leans, USA, Article number 183, 2018. DOI: 10.5555/
[45] S. Hu, F. D. Zhu, X. J. Chang, X. D. Liang. UPDeT: Uni- 3504035.3504218.
versal multi-agent reinforcement learning via policy de-
[58] C. Yu, A. Velu, E. Vinitsky, J. X. Gao, Y. Wang, A. Bay-
coupling with transformers. [Online], Available: https://
en, Y. Wu. The surprising effectiveness of PPO in cooper-
arxiv.org/abs/2101.08001, 2021.
ative, multi-agent games. [Online], Available: https://arx-
L. Meng et al. / Offline Pre-trained Multi-agent Decision Transformer 247

iv.org/abs/2103.01955, 2021. mation, Chinese Academy of Sciences, China.

His research interests include reinforcement learning and
[59] J. Fu, A. Kumar, O. Nachum, G. Tucker, S. Levine. brain-inspired robotics.
D4RL: Datasets for deep data-driven reinforcement learn- E-mail: [email protected]
ing. [Online], Available: https://arxiv.org/abs/2004.07219,
2020.
Weinan Zhang received the B. Eng. in
[60] Z. D. Zhu, K. X. Lin, A. K. Jain, J. Zhou. Transfer learn-
computer science and technology from
ing in deep reinforcement learning: A survey. [Online],
ACM Class of Shanghai Jiao Tong Uni-
Available: https://arxiv.org/abs/2009.07888, 2020.
versity, China in 2011, and the Ph. D. de-
gree in computer science and technology
Linghui Meng received the B. Sc. degree from University College London, UK in
in rail traffic signal control from Beijing 2016. He is now a tenure-track associate
Jiao Tong University, China in 2019. He is professor at Shanghai Jiao Tong Uni-
currently a Ph.D. degree candidate in pat- versity, China. He has published over 150
tern recognition and intelligent system at research papers on international conferences and journals and
both Institute of Automation, Chinese has been serving as an area chair or (senior) PC member at
Academy of Sciences, China, and Uni- ICML, NeurIPS, ICLR, KDD, AAAI, IJCAI, SIGIR, etc, and a
versity of Chinese Academy of Sciences, reviewer at JMLR, TOIS, TKDE, TIST, etc.
China. His research interests include reinforcement learning, deep
His interests include theoretical research on reinforcement learning and data science with various real-world applications of
learning, pre-training models, multi-agent system and speech re- recommender systems, search engines, text mining & generation,
cognition. knowledge graphs and game AI.
E-mail: [email protected] E-mail: [email protected]
ORCID iD: 0000-0002-5826-8072
Ying Wen received the B. Eng. degree
Muning Wen is currently a Ph.D. degree from Beijing University of Posts and Tele-
candidate in computer science and techno- communications, China, the B. Eng. de-
logy at Shanghai Jiao Tong University, gree with first class honor from Queen
China. Mary, University of London, UK in 2015,
His research interests include reinforce- the M. Sc. degree with distinction honor
ment learning, deep learning and multi- from University College London, UK in
agent system. 2016, and the Ph. D. degree from Depart-
E-mail: [email protected] ment of Computer Science, University Col-
lege London, UK. is currently a tenure-track Assistant Professor
with the John Hopcroft Center for Computer Science, Shanghai
Jiao Tong University, China. He has published over 20 research
Chenyang Le is an undergraduate in
papers about machine learning on top-tier international confer-
computer science and technology at
ences (ICML, NeurIPS, ICLR, IJCAI, and AAMAS). He has
Shanghai Jiaotong University, China.
been serving as a PC member at ICML, NeurIPS, ICLR, AAAI,
His research interests include reinforce-
IJCAI, ICAPS and a reviewer at TIFS, Operational Research,
ment learning, automatic speech recogni-
etc. He was granted Best Paper Award in AAMAS 2021 Blue
tion and speech translation.
Sky Track and Best System Paper Award in CoRL 2020.
E-mail: [email protected]
His research interests include machine learning, multi-agent
systems and human-centered interactive systems.
E-mail: [email protected]

Xiyun Li is currently a Ph. D. degree can-

didate in pattern recognition and intelli-
Haifeng Zhang received the B. Sc. degree
gent system at Institute of Automation,
in computer science and economics and the
Chinese Academy of Sciences, China.
Ph. D. degree in computer science from
His research interests include reinforce-
Peking University, China in 2012 and
ment learning and brain-inspired cognit-
2018, respectively. He was a visiting scient-
ive models.
ist at Center on Frontiers of Computing
E-mail: [email protected]
Studies (CFCS), Peking University,
China. Earlier, he was a research fellow at
University College London, UK. He is an
Dengpeng Xing received the B. Sc. de- associate professor at Institute of Automation, Chinese Academy
gree in mechanical electronics and the of Sciences (CASIA), China.
M. Sc. degree in mechanical manufactur- He has published research papers on international confer-
ing and automation from Tianjin Uni- ences ICML, NeurIPS, AAAI, IJCAI, AAMAS, etc. He has
versity, China in 2002 and 2006, respect- served as a Reviewer for AAAI, IJCAI, TNNLS, Acta Automat-
ively, and the Ph. D. degree in control sci- ica Sinica, and Co-Chair for IJCAI competition, IJTCS, DAI
ence and engineering from Shanghai Jiao Workshop, etc.
Tong University, China in 2010. He is cur- His research interests include reinforcement learning, game
rently an associate professor in the Re- AI, game theory and computational advertising.
search Center for Brain-inspired Intelligence, Institute of Auto- E-mail: [email protected]
248 Machine Intelligence Research 20(2), April 2023

Jun Wang received the B. Sc. degree from chine learning researcher with ten-year working experience in
Southeast University, China in 1997, the both academia and industry. Currently, he is an assistant pro-
M. Sc. degree from National University of fessor at Peking University, China. Before joining Peking Uni-
Singapore, Singapore in 2003, the Ph. D. versity, he was an assistant professor at King′s College London,
degree from Delft University of Techno- UK. Before KCL, he was a principal research scientist at Hua-
logy, The Netherlands in 2007. is cur- wei UK. Before Huawei, he was a senior research manager at
rently a professor with Computer Science AIG, working on AI applications in finance. He has maintained a
Department, University College London, track record of more than forty publications at top conferences
UK. He has published over 200 research and journals, along with the Best System Paper Award at CoRL
articles. His team won the First Global Real-Time Bidding Al- 2020 and the Best Blue-sky Paper Award at AAMAS 2021.
gorithm Contest with more than 80 participants worldwide. He His research interests include reinforcement learning and
was the winner of multiple best paper awards. He was a recipi- multi-agent systems.
ent of the Beyond Search—Semantic Computing and Internet E-mail: [email protected]
Economics Award by Microsoft Research and the Yahoo! FREP
Faculty Award. He has served as an Area Chair for ACM CIKM
and ACM SIGIR. His recent service includes the Co-Chair for
Artificial Intelligence, Semantics, and Dialog in ACM SIGIR
2018.
His research interests are in the areas of AI and intelligent sys-
Bo Xu received the B. Sc. degree in elec-
tems, covering (multiagent) reinforcement learning, deep gener-
trical engineering from Zhejiang Uni-
ative models, and their diverse applications on information re-
versity, China in 1988, and the M. Sc. and
trieval, recommender systems and personalization, data mining,
Ph. D. degrees in pattern recognition and
smart cities, bot planning, and computational advertising.
intelligent system from Institute of Auto-
E-mail: [email protected] (Corresponding author)
mation, Chinese Academy of Sciences,
ORCID iD: 0000-0001-9006-7951
China in 1992 and 1997, respectively. He is
a professor, the director of Institute of
Yaodong Yang received the B. Sc. de- Automation, Chinese Academy of Sci-
gree in electronic engineering & informa- ences, China, and also deputy director of Center for Excellence
tion science from University of Science and in Brain Science and Intelligence Technology, Chinese Academy
Technology of China, China in 2013, the of Sciences, China.
M.Sc. degree in science (Quant. Biology/ His research interests include brain-inspired intelligence,
Biostatistics) degree from Imperial Col- brain-inspired cognitive models, natural language processing and
lege London, UK 2014, and the Ph.D. de- understanding, and brain-inspired robotics.
gree in computer science from University E-mail: [email protected] (Corresponding author)
College London, UK in 2021. He is a ma- ORCID iD: 0000-0002-1111-1529