0% found this document useful (0 votes)
12 views

LLM Agent

This paper presents a multi-agent reinforcement learning (MARL) approach to optimize multi-order execution in finance, addressing the limitations of existing methods that focus on single orders. The proposed method incorporates a learnable multi-round communication protocol, allowing agents to share and refine their intended actions, thus improving collaboration and trading performance. Experiments demonstrate the effectiveness of this approach in real-world financial markets, achieving better overall profits and coordination among agents.

Uploaded by

ternencejy
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

LLM Agent

This paper presents a multi-agent reinforcement learning (MARL) approach to optimize multi-order execution in finance, addressing the limitations of existing methods that focus on single orders. The proposed method incorporates a learnable multi-round communication protocol, allowing agents to share and refine their intended actions, thus improving collaboration and trading performance. Experiments demonstrate the effectiveness of this approach in real-world financial markets, achieving better overall profits and coordination among agents.

Uploaded by

ternencejy
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Learning Multi-Agent Intention-Aware Communication for

Optimal Multi-Order Execution in Finance


Yuchen Fang∗† Zhenggang Tang∗† Kan Ren∗‡
[email protected] [email protected] [email protected]
Shanghai Jiao Tong University University of Illinois Microsoft Research Asia
Urbana-Champaign

Weiqing Liu Li Zhao Jiang Bian


Microsoft Research Asia Microsoft Research Asia Microsoft Research Asia

Dongsheng Li Weinan Zhang‡ Yong Yu


Microsoft Research Asia [email protected] Shanghai Jiao Tong University
Shanghai Jiao Tong University

Tie-Yan Liu
Microsoft Research Asia
ABSTRACT CCS CONCEPTS
Order execution is a fundamental task in quantitative finance, aim- • Computing methodologies → Multi-agent reinforcement
ing at finishing acquisition or liquidation for a number of trading learning.
orders of the specific assets. Recent advance in model-free rein-
forcement learning (RL) provides a data-driven solution to the order KEYWORDS
execution problem. However, the existing works always optimize Reinforcement Learning, Quantitative Finance, Multi-agent Rein-
execution for an individual order, overlooking the practice that forcement Learning, Order Execution, Financial Trading
multiple orders are specified to execute simultaneously, resulting in
ACM Reference Format:
suboptimality and bias. In this paper, we first present a multi-agent Yuchen Fang, Zhenggang Tang, Kan Ren, Weiqing Liu, Li Zhao, Jiang Bian,
RL (MARL) method for multi-order execution considering practical Dongsheng Li, Weinan Zhang, Yong Yu, and Tie-Yan Liu. 2023. Learning
constraints. Specifically, we treat every agent as an individual oper- Multi-Agent Intention-Aware Communication for Optimal Multi-Order
ator to trade one specific order, while keeping communicating with Execution in Finance. In Proceedings of the 29th ACM SIGKDD Conference on
each other and collaborating for maximizing the overall profits. Knowledge Discovery and Data Mining (KDD ’23), August 6–10, 2023, Long
Nevertheless, the existing MARL algorithms often incorporate com- Beach, CA, USA. ACM, New York, NY, USA, 13 pages. https://doi.org/10.
munication among agents by exchanging only the information of 1145/3580305.3599856
their partial observations, which is inefficient in complicated finan-
cial market. To improve collaboration, we then propose a learnable 1 INTRODUCTION
multi-round communication protocol, for the agents communicat- In quantitative finance, the primary goal of the investor is to maxi-
ing the intended actions with each other and refining accordingly. mize the long-term value through continuously trading of multiple
It is optimized through a novel action value attribution method assets in the market [1, 2]. The process consists of two parts, port-
which is provably consistent with the original learning objective folio management, which dynamically allocate the portfolio across
yet more efficient. The experiments on the data from two real-world the assets, and order execution whose goal is to fulfill a number of
markets have illustrated superior performance with significantly acquisition or liquidation orders specified by the portfolio man-
better collaboration effectiveness achieved by our method. agement strategy, within a time horizon, and close the loop of in-
∗ These
vestment [3, 4]. Figure (1a) presents the trading process within one
authors contributed equally to this research.
† This work was conducted during the internship of Yuchen Fang and Zhenggang Tang trading day. The trader first updates the target portfolio allocation
at Microsoft Research Asia. following some portfolio management strategy. Then, the orders
‡ Corresponding author.
shown in the red dotted zone need to be executed to accomplish
the actual portfolio adjustment. We focus on order execution task
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed in this paper, which aims at simultaneously finish executing mul-
for profit or commercial advantage and that copies bear this notice and the full citation tiple orders during the time horizon while maximizing the overall
on the first page. Copyrights for components of this work owned by others than the execution profit gain.
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific permission The challenge of order execution lies in two aspects. First, the
and/or a fee. Request permissions from [email protected]. number of orders changes according to the portfolio allocation
KDD ’23, August 6–10, 2023, Long Beach, CA, USA from day to day, which requires the order execution strategy to be
© 2023 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 979-8-4007-0103-0/23/08. . . $15.00 scalable and flexible to support large and various number of orders.
https://doi.org/10.1145/3580305.3599856 Second, cash balance is limited and all acquiring operations will

4003
KDD ’23, August 6–10, 2023, Long Beach, CA, USA Yuchen Fang et al.

(i). Cash shortage, cutoff


Portfolio management acquiring intention (iii). Finish acquiring others to generate the imagined trajectory, which is intractable
at final stage

Liquidation amount
MSFT: 500
Order Execution
MSFT: 800 especially in the noisy and complicated financial market. Also, the

Acquiring amount
AAPL: 800
Buy 300 MSFT
AAPL: 300 agents herein can not respond to the intentions of others, i.e., chang-
Sell 500 AAPL ing the actions they intended to take after receiving the messages,
GOOG: 600 GOOG: 900
Buy 300 GOOG
(ii). Liquidate more and
until the next timestep, which makes the intention message less
Cash:
$1000 Cash: $650 supplement cash helpful for achieving well coordination at the current timestep.
resource
In this paper, we propose a novel multi-round intention-aware
(a) Illustration of financial (b) Imbalanced cash utilization
communication protocol for communicating the intended actions
investment as a combination requiring further coordination
among agents at each timestep, which is optimized through an
of portfolio management and between acquirement and liqui-
multi-order execution. dation operations. action value attribution method. Specifically, we first model the
intention of agents as the actions they intended to take at current
Figure 1: An example of multi-order execution and motiva- timestep [20, 21] and share the intentions between agents through
tion of collaboration within it. communication. Then, during multiple rounds of communication,
the agents are allowed to coordinate with each other and achieve
consume the limited cash supply of the trader, which can only be better balance between acquisition and liquidation, whereafter the
replenished by the liquidating operations. The lack of cash supply last intended actions are taken as the final decisions of the current
may lead to missing good trading opportunities, which urges one timestep. Thus, note that the intended actions of agents should
to achieve balance between acquisition and liquidation, to avoid be gradually refined for better collaboration during multi-round
conflicted trading decisions that would cause cash shortage and communication. To ensure this, we propose a novel action value
poor trading performance. Figure (1b) illustrates a typical exam- attribution method to directly optimize and refine the intended
ple of a conflicted trading decision that results in cash imbalance actions at each round, which has been proved as unbiased to the
and low execution utility. The execution of acquisition orders are original decision making objective yet more sample efficient.
forced to postpone due to cash shortage until the liquidation orders Our contributions are three-fold as discussed below.
supplementing the cash, leading to missing the best acquisition • We illustrate the necessity of simultaneous optimization for all or-
opportunity. We have observed similar evidences in real-world ders in multi-order execution task. To the best of our knowledge,
transactions and made more analysis in the experiment. this is the first work formulating this problem as a multi-agent
Although there exists many works for order execution, few of collaboration task and utilize MARL method to solve it.
them manage to address the above three challenges. Traditional • We formulate the intention of agents as the actions they intended
financial model based methods [5–7] and some recently developed to take and propose a novel action value attribution method
model-free reinforcement learning (RL) methods [4, 8, 9] only opti- to optimize the intended actions directly. We are the first to
mize the strategy for single-order execution without considering explicitly refine the intended actions of agents in a cooperative
practice of multi-order execution, which would result in low trad- scenario, which may shed some light on the researches about
ing efficacy. Moreover, it is not applicable to directly transfer the general multi-agent reinforcement learning.
existing methods to multi-order execution since utilizing only one • Our proposed intention refinement mechanism allows agents
agent to conduct the execution of multiple orders would lead to to share and modify their intended actions before the final deci-
scalability issue as the action space of one individual agent grows sions are made. The experiment results on two real-world stock
exponentially with the number of orders. Also, it is either not flexi- markets have demonstrated the superiority of our approach on
ble enough for the execution of varying number of orders [10–12]. both trading performance and collaboration effectiveness.
To resolve the above challenges, we treat multi-order execution
as a multi-agent collaboration problem and utilize a multi-agent 2 RELATED WORK
reinforcement learning (MARL) method where each agent acts to
execute one individual order to factorize the joint action space for 2.1 RL for Order Execution
scalability to varying order numbers [13–15], and all agents collab- Reinforcement learning (RL) based solutions are proposed for or-
orate to achieve higher overall profits with less decision conflicts. der execution due to its nature as a sequential decision making
However, the existing MARL solutions for general multi-agent task. Early works [22–24] extend traditional control theory based
collaboration are not suitable for the multi-order execution envi- methods [5, 25] and rely on unrealistic assumptions for the market,
ronment where the actions of one agent can significantly influence thus not performing well in real-world situations. Several follow-
the others through shared cash balance, further affecting the final ing works [4, 8, 9, 26, 27] adopt a data-driven mindset and utilized
performance as the financial market changes drastically. The main- model-free RL methods to learn optimal trading strategies. However,
stream methods, which build the communication channel among all these methods target on individual order execution and do not
agents to promote collaboration [16–18], only allow agents to share consider practical constraints under multi-order execution as shown
information of their partial observation, which can not directly in Figure (1b), leading to sub-optimal or impractical trading behav-
reflect the intention of agents, thus harming the collaboration per- iors. Although MARL has been widely adopted in financial area for
formance. [19] models the intentions of agents as imagined future market simulation [28–31] and portfolio management [7, 32, 33],
trajectories and share the intentions through communication, in these is no existing method utilizing MARL directly for order exe-
order to achieve better collaboration performance. However, it cution. To our best knowledge, this is the first work using MARL
requires predicting environment dynamics and future actions of for multi-order execution task with practical constraint.

4004
Learning Multi-Agent Intention-Aware Communication for Optimal Multi-Order Execution in Finance KDD ’23, August 6–10, 2023, Long Beach, CA, USA

2.2 Communication in General MARL 𝑑 𝑖 ∈ {1, −1} stands for liquidation and acquisition, respectively
Communication is an essential approach to encourage collabora- and order index 𝑖 ∈ [1, 𝑛]; the amount of shares to trade for each
tion between agents for multi-agent collaboration problems. Early asset M = (𝑀 1, . . . , 𝑀 𝑛 ); and the initial cash balance of the trading
works [11, 34, 35] design pre-defined communication protocols account 𝑐 0 which is shared by all the trading operations during
between agents. DIAL [36] and CommNet [17] first proposed dif- the day as explained below. For simplicity, we assume that there
ferentiable communication mechanism with deep neural networks. are 𝑇 timesteps in the time horizon, i.e., one trading day. At each
The following works can be divided into two sub-groups. The first timestep 𝑡 ∈ [1,𝑇 ], the trader should propose the volumes to trade,
group focuses mainly on “who to communicate”, i.e., throttling denoted as q𝑡 = (𝑞𝑡1, . . . , 𝑞𝑡𝑛 ). The market prices of the correspond-
the communication channel rather than a fully connected global ing assets are p𝑡 = (𝑝𝑡1, . . . , 𝑝𝑡𝑛 ), which would not be revealed to
one [18, 37–39]. While the second concentrates on “how to deliver the trader before her proposing the trading decision at 𝑡. During
messages”, i.e., designing different network structures for better the trading process, for the trader, liquidating operations replen-
information passing and aggregation in communication channels ish the cash balance while acquiring operations consume it. As
[16, 37, 40]. However, the messages shared in all these communica- a result, the balance of trading account varies every timestep as
Í
tion methods contain only the information from the observations 𝑐𝑡 = 𝑐𝑡 −1 + 𝑛𝑖=1 𝑑 𝑖 · (𝑝𝑡𝑖 · 𝑞𝑡𝑖 ).
of agents and do not explicitly reflect their intentions, leading to Different from the previous works for order execution that only
catastrophic discordance. Our work is orthogonal to these methods optimize for a single order, the objective of multi-order execution is
since we focus on “what to communicate” and the corresponding to maximize the overall profits while fulfilling all the orders without
optimization algorithm, which can be easily adapted to the commu- running out of cash, which can be formulated as follows:
nication structure in the works mentioned above to make up for 𝑛
"𝑇 # 𝑇
!
∑︁ ∑︁ ∑︁
𝑖 𝑖 𝑖 𝑖
their shortcomings. arg max 𝑑 (𝑝𝑡 · 𝑞𝑡 ) / 𝑞𝑡 ,
q1 ,...,q𝑇 𝑖=1 𝑡 =1 𝑡 =1
(1)
2.3 Intention Modeling ∑︁𝑇
s.t. q𝑡 = M, q𝑡 ≥ 0, 𝑐𝑡 ≥ 0, ∀𝑡 ∈ {1, . . . ,𝑇 }.
Explicit modeling of the intentions of agents has been used in The-
𝑡 =1
ory of Mind (ToM) and Opponent Modeling (OM) methods. ToM-net
The average execution price (AEP) of order 𝑖 is calculated as
[41] captures mental states of other agents and predicts their fu- "𝑇 # !
ture action. OM [42] uses agent policies to predict the intended 𝑖
∑︁
𝑖 𝑖
𝑇
∑︁
𝑖
actions of opponents. But all these works are conducted under a 𝑝¯ = (𝑝𝑡 · 𝑞𝑡 ) / 𝑞𝑡 . (2)
competitive setting and require agents to infer each other’s inten- 𝑡 =1 𝑡 =1
tion, which could be inaccurate considering the instability nature The trader needs to maximize the AEP of all liquidation orders
of MARL [43]. [19] first conducts intention sharing between agents (𝑑 𝑖 = 1) while minimizing that of acquisition orders (𝑑 𝑖 = −1).
through communication mechanisms under cooperative settings.
However, the agents are not allowed to modify their action after 3.2 Multi-Order Execution as a Markov Decision
receiving others’ intention at the current step, thus still suffering Process
from discordance. Also, this method requires forecasting the state The order execution problem can be formulated as a Markov deci-
transitions of the environment of the next few steps, which may sion process (MDP) as (𝑛, S, A, 𝑅, 𝐼, 𝛾), where each agent executes
suffer from compounding error problems, especially in finance area one order and all the agents share a collective goal. Here S is the
where the environment is extremely noisy. While our method can state space, A is the action space for the agent. 𝑛 is the number
improve the joint action of agents gradually through the intention of agents corresponding to the order number. Note that, for differ-
sharing and refinement process, and does not use predicted state ent trading days, order number 𝑛 varies dynamically and can be
transitions or any additional environment information. large, which makes the joint action space extremely huge for one
single-agent RL policy [8, 9] to learn to execute multiple orders
3 PRELIMINARY simultaneously. Thus, in multi-agent RL schema for multi-order
In this section, we first present the task definition of multi-order ex- execution, we treat each agent as the individual operator for execut-
ecution problem, including notations for variables and optimization ing one order. Each agent has a policy 𝜋 𝑖 (𝒔𝑖 ; 𝜽 𝑖 ) which produces the
goals. Then, we formulate the task as a Markov decision process of distribution of actions and is parameterized by 𝜽 𝑖 . The actions of
the trader interacting with the market environment. all agents 𝒂 = (𝑎 1, ..., 𝑎𝑛 ) sampled from each corresponding policy
is used to interact with the environment and 𝑅(𝒔, 𝒂) is the reward
3.1 Multi-Order Execution function. 𝐼 (𝒔, 𝒂) is the transition function of environment which
We generalize the typical settings of order execution in previous gives the next state by receiving the action. Our goal is to optimize
works [8, 44] to multi-order execution, where a set of orders need a unified policy 𝝅 = {𝜋 1, ..., 𝜋 𝑛 } with parameter 𝜽 = {𝜽 1, ..., 𝜽 𝑛 } to
to be fulfilled within a predefined time horizon. As shown in Fig- maximize the expected accumulative reward
ure (1a), we take intraday order execution as a running example
" #
∑︁
𝑡
while the other time horizons follow the same schema. Thus, for 𝐽 (𝜽 ) = 𝐸𝜏∼𝑝 𝝅 (𝜏 ) 𝛾 𝑅(𝒔𝑡 , 𝒂𝑡 ) , (3)
each trading day, there are a set of 𝑛 orders to be traded for 𝑛 𝑡
assets, respectively, which are denoted as a tuple (𝑛, 𝑐 0, d, M). It where 𝑝 𝝅 (𝜏) is the probability distribution of trajectory 𝜏 = {(𝒔 1,
includes the trading directions of the orders d = (𝑑 1, . . . , 𝑑 𝑛 ) where 𝒂 1 ), . . . , (𝒔𝑇 , 𝒂𝑇 )} sampled by 𝝅 from the environment and 𝛾 is the

4005
KDD ’23, August 6–10, 2023, Long Beach, CA, USA Yuchen Fang et al.

discount factor. More implementation details of the MDP definitions too much of asset 𝑖 within a short time period as
are presented below. 𝑞𝑖
𝑅𝑎− (𝒔𝑡 , 𝑎𝑖𝑡 ; 𝑖) = −𝛼 ( 𝑡𝑖 ) 2 = −𝛼 (𝑎𝑖𝑡 ) 2, (5)
Number of agents 𝑛. We define the agent to be the operator for a 𝑀
single asset, and each agent would be responsible for the execution where 𝛼 is a hyper-parameter controlling the penalty degree.
of the corresponding order. Third, we also penalize all the agents whenever the remained
State space S. The state 𝒔𝑡 ∈ S describes the overall information cash balance is used up right at timestep 𝑡, as
of the system. However, for each agent executing a specific order 𝑅𝑐− (𝒔𝑡 , 𝑎𝑖𝑡 ; 𝑖) = −𝜎 1 [𝑐𝑡 = 0|𝑐𝑡 −1 > 0], (6)
𝑖, the state 𝒔𝑖𝑡 observed at timestep 𝑡 contains only the historical
where 𝜎 is the hyper-parameter.
market information of the corresponding asset collected just before
Thus, the reward for executing the 𝑖-th order is defined as
timestep 𝑡 and some shared trading status. The detailed description
of the observed state information has been illustrated in Appendix 𝑅(𝒔𝑡 , 𝑎𝑖𝑡 ; 𝑖) = 𝑅𝑒+ (𝒔𝑡 , 𝑎𝑖𝑡 ; 𝑖) + 𝑅𝑎− (𝒔𝑡 , 𝑎𝑖𝑡 ; 𝑖) + 𝑅𝑐− (𝒔𝑡 , 𝑎𝑖𝑡 ; 𝑖)
A.2.
!
𝑖 𝑖 𝑝𝑡
𝑖 (7)
= 𝑑 𝑎𝑡 𝑖 − 1 − 𝛼 (𝑎𝑖𝑡 ) 2 − 𝜎 1 [𝑐𝑡 = 0|𝑐𝑡 −1 > 0].
Action space A. Agent 𝑖 proposes the action 𝑎𝑖𝑡 ∈ A after observ- 𝑝˜
ing 𝒔𝑖𝑡 at 𝑡 ∈ {1, . . . ,𝑇 }. Following [4], we define discrete action
Finally, in order to optimize the execution of all the orders holisti-
space as 𝑎𝑖𝑡 ∈ {0, 0.25, 0.5, 0.75, 1} which corresponds to the propor-
cally, the overall reward function, which is shared by all the agents,
tion of the target order 𝑀 𝑖 . The trading volume 𝑞𝑡𝑖 = 𝑎𝑖𝑡 𝑀 𝑖 will be is defined as the average of rewards of all the orders as
Í
executed at timestep 𝑡. Moreover, 𝑇𝑡=1 𝑎𝑖𝑡 = 1 has been satisfied 𝑛
Í 𝑇 −1
by fixing 𝑎𝑇𝑖 = (1 − 𝑡 =1 𝑎𝑖𝑡 ) to ensure all orders are fully fulfilled. 1 ∑︁
𝑅(𝒔𝑡 , 𝒂𝑡 ) = 𝑅(𝒔𝑡 , 𝑎𝑖𝑡 ; 𝑖) . (8)
The similar setting has been widely adopted in the related literature 𝑛 𝑖=1
[8, 9]. We also conduct experiments on different action spaces and
present the results in Appendix B.2, which shows that this action Assumptions. There are two main assumptions adopted in this
setting has performed sufficiently well. paper. Similar to [9], (i) the temporary market impact has been
We should note cash limitation here. If the remained cash balance adopted as a reward penalty, i.e., 𝑅𝑎− , and we assume that the market
is not adequate for all acquisition agents, then the intended trading is resilient and will bounce back to equilibrium at the next timestep.
volume of them are cutoff, by environment, evenly to the scale that (ii) We either ignore the commissions and exchange fees as this
the remained cash balance will just be used up after this timestep. expense is relatively small fractions for the institutional investors
For instance, if the actions of all the acquisition agents require twice that we mainly aim at.
as much as the remained cash, their viable executed volume will be
normalized to half, to ensure cash balance 𝑐𝑡 ≥ 0 for 1 ≤ 𝑡 ≤ 𝑇 . 4 METHODOLOGY
In this section, we first briefly introduce the main challenges of
Reward function. Different from the previous works for order
solving multi-order execution task. Then, we illustrate how the
execution that optimize the performance of each order individually,
previous communication based MARL methods fail facing these
we formulate the reward of all agents in multi-order execution as
challenges by describing the design and problems of general frame-
the summation of rewards for execution of each individual order,
work of existing method, further clarifying the motivation of our
which includes three parts: the profitability of trading, the penalty
improvements. Finally, we introduce the details of our multi-round
of market impact and the penalty of cash shortage where the cash
intention-aware communication with action refinement, including
balance has been used up and the acquisition actions are limited.
the optimization method.
First, to account for the profitability during order execution
caused by actions, following [4], we formulate this term of reward 4.1 Problems of General Multi-agent
as volume weighted execution gain upon the average price as
Communication Framework
price normalization There are two main challenges for solving multi-order execution
z }| { task with MARL method: (1) The agents have to extract essential
information from their own observation, e.g., judge whether it is a
! !
𝑞𝑖 𝑝𝑡𝑖 − 𝑝˜ 𝑖 𝑝𝑡𝑖
𝑅𝑒+ (𝒔𝑡 , 𝑎𝑖𝑡 ; 𝑖) = 𝑑 𝑡𝑖 ·
𝑖
= 𝑑 𝑖 𝑎𝑖𝑡 −1 , (4) good opportunity to trade to derive a high profit. (2) The acquisition
𝑀 𝑝˜ 𝑖 𝑝˜ 𝑖
and liquidation orders should be coordinated with each other to
maintain a reasonable cash balance without severe cash shortage.
where 𝑝˜ 𝑖 = 𝑇1 𝑇𝑡=1 𝑝𝑡𝑖 is the average market price of asset 𝑖 over the
Í
This requires the agents to be aware of the situation and decision
whole time horizon. Note that, as discussed in [4], incorporating intention of the other agents at the current timestep, and adjust
𝑝˜ 𝑖 in the reward function at timestep 𝑡 will not cause information their decisions to avoid potential conflicts, e.g., congested cash
leakage since the reward has not been included in the state 𝒔𝑡 thus consumption. However, the existing multi-agent communication
would not influence the actions of our agent. It would only take methods are limited by a common inflexible framework and can not
effect in back-propagation during training. solve all these challenges, as summarized and discussed as below.
Second, to avoid potential market impacts, i.e., too large trading Note that, the following procedure is conducted at each timestep,
volume that may harmfully influence the market, we follow the thus, we omit the subscript 𝑡 of timestep for simplicity in condition
recent works [4, 8, 9] and propose a quadratic penalty for trading that the context is of clarity.

4006
Learning Multi-Agent Intention-Aware Communication for Optimal Multi-Order Execution in Finance KDD ’23, August 6–10, 2023, Long Beach, CA, USA

To solve the above challenge (1), at each timestep, the 𝑖-th agent 𝑡−1 Environment 𝑡+1
would extract a hidden state from its observation 𝒔𝑖 with an infor-
Timestep 𝑡 𝑎! 𝑎"

𝐾 rounds
mation extractor 𝐸 (·)
D … D
𝒉𝑖0 = 𝐸 (𝒔𝑖 ) . (9)
𝒉!$ 𝒉"
$


Then, the agents would communicate with each other using a com-


communication
munication channel 𝐶 (·) and update the hidden states, for totally 𝑎%! 𝑎%"

𝑘-th round
∇𝐽! in Eq. (16)
𝐾 rounds for a thorough information exchange, D 𝒉!% … 𝒉"
% D
(𝒉𝑘1 . . . , 𝒉𝑛𝑘 ) = 𝐶 (𝒉𝑘1 −1, . . . , 𝒉𝑛𝑘 −1 ), 1 ≤ 𝑘 ≤ 𝐾 . (10) C
! 𝒉!%&! … 𝒉"


𝑎%&! "
%&! 𝑎%&!
Finally, the actions of agents are generated with a decision making


module 𝐷 (·) as
𝑎!! 𝑎!"

communication
∇𝐽" in Eq. (16)
𝑎𝑖 ∼ 𝐷 (𝒉𝑖𝐾 ). …

1st round
(11) D 𝒉!! 𝒉!" D
The previous works on multi-agent communication have been C
summarized above. They all focus on improving the structure of 𝑎#! 𝒉!# 𝒉"
# 𝑎#"
communication channels 𝐶, either throttling the channel for more E … E
efficient communication [18, 37–39] or improving the information 𝑛 agents 𝒔! 𝒔"
passing ability with a more complicated channel design [16, 37, 40].
However, none of these approaches breaks through the above Figure 2: The policy framework of our proposed intention-
framework, and we claim that two main problems exist within this aware communication method. It illustrates the decision-
framework. First, all hidden representations 𝒉𝑖𝑘 only contain the making process of all agents at timestep 𝑡. Totally 𝐾 rounds of
information of partial observations of the agents but not the actions intention-aware communication is conducted among agents
they intended to take, making it harder for agents to reach good col- to derive the final decision 𝒂. Intended actions 𝒂𝑘 will be made
laboration. Second, though multiple rounds of communications are after the 𝑘-th round and fed to the next communication. All
conducted during this process, the agents only make decisions once the modules of different rounds share the same parameters
after the final round of communication. They have no opportunity 𝜽 , thus, few more parameters are introduced.
to refine their actions afterward, leading to discordance as it is hard
for agents to reach an agreement immediately in a complicated the intentions of the others during this process. The process can be
environment like the finance market. These problems combined formulated as
making existing methods fail to solve the challenge (2) mentioned (𝒉𝑘1 , . . . , 𝒉𝑛𝑘 ) = 𝐶 (𝒉𝑘1 −1 ||𝑎𝑖𝑘 −1, . . . , 𝒉𝑛𝑘 −1 ||𝑎𝑛𝑘 −1 ), (12)
above, thus not suitable for multi-order execution task.
𝑎𝑖𝑘 ∼ 𝐷 (𝒉𝑖𝑘 ), 1 ≤ 𝑖 ≤ 𝑛, 1 ≤ 𝑘 ≤ 𝐾 , (13)
4.2 Intention-Aware Communication where 𝑎𝑖0 is a dummy action whose value can be arbitrarily assigned,
In this section, we first describe the framework of our proposed 𝐶 (·) is the communication channel where agents exchange infor-
intention-aware communication (IaC) method. Then, we discuss mation and update their hidden states for one round from 𝒉𝑘 to
the optimization details including the intended action refinement. 𝒉𝑘+1 , and 𝐷 (·) is the decision making module. The intended actions
of the last round are used as the final actions actually executed in
4.2.1 Decision making with multi-round intention communication. the environment as 𝒂 = 𝒂𝐾 in our method.
To solve the problems mentioned above, we propose the intention- Note that, our proposed method is different from the general
aware communication, which can be divided into two parts: obser- framework of the previous communication-based MARL methods
vation extraction and multi-round communication with decision described in Section 4.1, which only share the information extracted
making. Following the above convention, we by default omit the from the partial observations of agents and make the (final) deci-
subscript 𝑡 in notations without causing misunderstanding since sions only once after the last round of communication. The novel
this procedure is conducted during each timestep. The whole pro- parts of our proposed method are emphasized with dashed lines
cess has been illustrated in Figure 2. and striped blocks in Figure 2.
Observable information extraction. Similar to the framework The communication channel 𝐶 (·) is scalable to varying number
described in Section 4.1, from the view of the 𝑖-th agent, at each of agents following previous methods [16, 36],
timestep, an information extractor 𝐸 is utilized to extract the pat-
4.2.2 Intention optimization with action value attribution. The in-
terns from the input and encode them as an initial hidden repre-
tended actions 𝒂𝑘 generated after 𝑘-th communication round should
sentation 𝒉𝑖0 of agent 𝑖 from its observation 𝒔𝑖 following Eq. (9).
provide the intention of every agent, which reflects the instant intu-
Multi-round communication with decision making. Our cen- ition of decision making at the current timestep. Thus, the intended
tral improvement lands in the multi-round communication process. actions 𝒂𝑘 should by design reflect the true intentions of agents and
Instead of designing more complicated communication channels, be exchanged with each other through the next round of communi-
we focus on “what to communicate” and share the intended actions cation to further facilitate collaborative decision making, until the
of agents during each round of communication. Also, we make it final round as 𝒂 = 𝒂𝐾 . To achieve this, we propose an auxiliary ob-
possible for agents to constantly refine their actions according to jective for each round of intended actions. All auxiliary objectives

4007
KDD ’23, August 6–10, 2023, Long Beach, CA, USA Yuchen Fang et al.

Í𝐾
are optimized together with the original objective 𝐽 (𝜽 ) defined in as 𝑘=1 (𝑄 (𝒔, 𝒂𝑘 ) − 𝑄 (𝒔, 𝒂𝑘 −1 )) = 𝑄 (𝒔, 𝒂𝐾 ), we can see that the
Eq. (3) to keep all intended actions highly correlated to the final original optimization goal 𝐽 (𝜽 ), i.e., 𝐽𝐾′ here, has been distributed
goal and progressively refine decisions upon round and round. to each 𝐽𝑘′ for 1 ≤ 𝑘 ≤ 𝐾. We can tell that what we are doing is
We first introduce some definitions before describing the de- designing an action value attribution method where the value of
sign of the auxiliary objective in detail. Recall that we set the con- the last decision 𝒂𝐾 is attributed to all the intended actions.
text of agents at timestep 𝑡, we define value function 𝑉 (𝒔𝑡 ) = Optimization with action value attribution decomposes the fi-
Í𝑇 𝑡 ′ −𝑡 𝑅(𝒔 ′ , 𝒂 ′ ) as the expected cumulative reward we could nal optimization objective to each round of intention-aware com-
𝑡 ′ =𝑡 𝛾 𝑡 𝑡
get starting from state 𝒔𝑡 following our policy 𝝅, and action value munication, which not only alleviates the burden of multi-agent
𝑄 (𝒔𝑡 , 𝒂𝑡 ) = 𝑅(𝒔𝑡 , 𝒂𝑡 ) + 𝛾𝑉 (𝐼 (𝒔𝑡 , 𝒂𝑡 )) as the expected cumulative communication optimization, but also improves decision making
reward if we take action 𝒂𝑡 at timestep 𝑡 following policy 𝝅. gradually through learning to promote action value at each round
Denoting the intention generation process during the 𝑘-th round as shown in Eqs. (16) and (18).
of communication as 𝝅 𝑘 , i.e., 𝝅 𝑘 (·|𝒔, 𝒂𝑘 −1 ) = 𝐷 (𝐶 (𝒉𝑘 −1 ||𝒂𝑘 −1 )),
Action value estimation. The last detail is to estimate the ac-
we optimize 𝝅 𝑘 by defining an auxiliary objective function to max-
tion value 𝑄 (𝒔, 𝒂) to calculate ∇𝜽 𝐽𝑘 (𝜽 ) in Eq. (16). Normally, for
imize the expected cumulative reward 𝑄 (𝒔, 𝒂𝑘 ) as if we took 𝒂𝑘
𝑘 = 𝐾, 𝑄 (𝒔, 𝒂𝐾 ) can be directly calculated from the original sam-
as the final actions instead of 𝒂𝐾 , for all 𝒔 and 𝒂𝑘 −1 encountered
pled trajectory as 𝒂𝐾 is the final decision utilized to interact with
during interacting with the environment. Note that, when 𝑘 = 𝐾,
the environment. While, for 1 ≤ 𝑘 < 𝐾, we train an action value
this objective is consistent with our original objective 𝐽 (𝜽 ) as 𝒂𝐾
estimation model 𝑄ˆ (𝒔, 𝒂) upon the trajectories collected by interact-
are the final actions that are used to interact with the environment.
ing with the environment using the actual decision 𝒂𝐾 . Note that,
Thus, all our objectives can be uniformly denoted as
  this procedure does not require the environment to provide any ad-
𝐽𝑘 (𝜽 ) = E𝒔,𝒂𝑘 −1 E𝒂𝑘 ∼𝝅 𝑘 (· |𝒔,𝒂𝑘 −1 ;𝜽 ) [𝑄 (𝒔, 𝒂𝑘 )] , 1 ≤ 𝑘 ≤ 𝐾 , (14) ditional information about the intended actions, which guarantees
generalizability for wider applications of MARL.
where 𝜽 are the parameters of 𝐸 (·), 𝐶 (·), 𝐷 (·). The gradient of 𝐽𝑘
Overall speaking, we optimize the objective functions 𝐽𝑘 (𝜽 )
w.r.t. 𝜽 is calculated as
for all communication rounds simultaneously, thus, the final loss
∇𝜽 𝐽𝑘 (𝜽 ) = E𝒔,𝒂𝑘 ,𝒂𝑘 −1 [∇𝜽 log 𝝅 𝑘 (𝒂𝑘 |𝒔, 𝒂𝑘 −1 ; 𝜽 )𝑄 (𝒔, 𝒂𝑘 )] . (15) function to minimize w.r.t. the parameter 𝜽 is defined as
We further expect the intended actions should generally be re- 𝐾
1 ∑︁
fined during the communication process, which means 𝒂𝑘 should 𝐿(𝜽 ) = − 𝐽𝑘 (𝜽 ) . (19)
𝐾
achieve higher return than 𝒂𝑘 −1 . Therefore, we use the expected 𝑘=1
return of the intended actions in last round of communication as a As for implementation, we use PPO algorithm [45] to optimize all
baseline function into Eq. (15) where 𝑄 (𝒔, 𝒂 0 ) = 0, the intended actions and the final decisions.
The overall decision-making and optimization process of our
∇𝜽 𝐽𝑘 (𝜽 ) = E [∇𝜽 log 𝝅 𝑘 (𝒂𝑘 |𝒔, 𝒂𝑘 −1 ; 𝜽 ) proposed Intention-Aware Communication method is presented
𝒔,𝒂𝑘 ,𝒂𝑘 −1
in Algorithm 1. The detailed network structures of the extractor
(𝑄 (𝒔, 𝒂𝑘 ) − 𝑄 (𝒔, 𝒂𝑘 −1 ))] . (16)
𝐸 (·), communication channel 𝐶 (·), decision module 𝐷 (·) and action
It is reasonable as we would like to encourage 𝒂𝑘 to achieve better value estimator 𝑄ˆ (·, ·) are presented in Section A.6.
performance than 𝒂𝑘 −1 and penalize those perform worse. More-
over, the policy gradient in Eq. (16) remains unbiased to Eq. (15) Algorithm 1 Intention-Aware Communication
since the gradient of the baseline w.r.t. 𝜽 is Require: Random initialized network parameters 𝜽
E𝒔,𝒂𝑘 ,𝒂𝑘 −1 [−∇𝜽 log 𝝅 𝑘 (𝒂𝑘 |𝒔, 𝒂𝑘 −1 ; 𝜽 )𝑄 (𝒔, 𝒂𝑘 −1 )] while 𝜽 not converged do
Sample a set of orders (𝑛, 𝑐 0, d, M) from the dataset and obtain
=E𝒔,𝒂𝑘 −1 [−𝑄 (𝒔, 𝒂𝑘 −1 )E𝒂𝑘 [∇𝜽 log 𝝅 𝑘 (𝒂𝑘 |𝒔, 𝒂𝑘 −1 ; 𝜽 )]] (17) the initial state 𝒔 1 .
=E𝒔,𝒂𝑘 −1 [−𝑄 (𝒔, 𝒂𝑘 −1 )∇𝜽 1] = 0 . for timestep 𝑡 ∈ [1,𝑇 ] do
Action value attribution. We further clarify the intuition of our 𝒉𝑖𝑡,0 ← 𝐸 (𝒔𝑖𝑡 ) for agent 𝑖 ∈ [1, 𝑛].
𝑎𝑖𝑡,0 ← dummy action for agent 𝑖 ∈ [1, 𝑛].
auxiliary objective from a perspective of attributing the credit of
for communication round 𝑘 ∈ [1, 𝐾] do
final decision making across the whole process of intention re-
Refine the intended actions 𝒂𝑡,𝑘 as in Eq. (12, 13).
finement during each round of communication. First, aiming at
end for
encouraging the agents to find better actions than the intended
Obtain 𝒔𝑡 +1 = 𝐼 (𝒔𝑡 , 𝒂𝑡,𝐾 ) from the environment.
actions exchanged during last round of communication, we define
end for
another auxiliary objective for optimizing 𝝅 𝑘 as
Update 𝜽 minimizing 𝐿(𝜽 ) defined in Eq. (19).
𝐽𝑘′ (𝜽 ) = E𝒔,𝒂𝑘 −1 E𝒂𝑘 ∼𝝅 𝑘 (· |𝒔,𝒂𝑘 −1 ;𝜽 ) [𝑄 (𝒔, 𝒂𝑘 ) − 𝑄 (𝒔, 𝒂𝑘 −1 )] .
 
end while
(18)
Taking derivatives of Eq. (18) w.r.t. 𝜽 and considering Eq. (16),
we can easily find that ∇𝜽 𝐽𝑘 (𝜽 ) = ∇𝜽 𝐽𝑘′ (𝜽 ). Thus, considering 5 EXPERIMENTS
the consistency between 𝐽𝐾 (𝜽 ) and 𝐽 (𝜽 ) we mentioned above, In this section, we present the experiment settings and results with
𝐽𝐾′ (𝜽 ), i.e., the auxiliary objective defined over the decisions of the extended investigations. The reproducible codes and benchmark
last round, is also consistent with our original target 𝐽 (𝜽 ). Also, with data will be released upon the acceptance of this paper.

4008
Learning Multi-Agent Intention-Aware Communication for Optimal Multi-Order Execution in Finance KDD ’23, August 6–10, 2023, Long Beach, CA, USA

5.1 Datasets • IaC is our proposed method, which utilizes an intention-aware


All the compared methods are trained and evaluated based on communication mechanism to increase the cooperative trading
the historical transaction data of China A-share stock market US efficacy for multi-order execution. Specifically, for comprehen-
stock market from 2018 to 2021 collected from Yahoo Finance1 . sive comparison, we conduct experiments on two variants of our
All datasets are divided into training, validation, and test sets ac- method, IaCT and IaCC , which utilize the implementation of
cording to time and the statistics of all datasets are presented in communication channel as TarMAC and CommNet, respectively.
Appendix A.1 For both stock markets, we conduct three datasets For all the compared methods, the hyper-parameters are tuned on
with a rolling window of one-year length and stride size, denoted the validation sets and then evaluated on the test sets. For RL-based
as CHW1, CHW2, CHW3 and USW1, USW2, USW3. For each methods, the policies are trained with six different random seeds
trading day, several sets of orders with the corresponding initial after determining the optimal hyper-parameters and the means and
cash budgets {(𝑛, 𝑐 0, d, M)} are generated according to a widely standard deviations of results on test sets are reported. The detailed
used portfolio management strategy “Buying-Winners-and-Selling- hyper-parameter settings are presented in Appendix A.5. All RL-
Losers” [1, 46] implemented in [47], and each set of intraday exe- based methods share the same network structures for extractor
cution orders includes the information about the asset name, the 𝐸 (·), communication network 𝐶 (·) (if exists) and decision module
amount and trading type of each order, as discussed in Sec. 3. The 𝐷 (·), thus the sizes of parameters are similar, for fair comparison.
orders are the same for all the compared methods for fairness. With-
out loss of generality, all the orders in our datasets are restricted 5.2.2 Evaluation metrics. The first evaluation metric is the aver-
to be fulfilled within a trading day, which is 240-minute length for age execution gain (EG) over all orders which is calculated as
Chinese stock market and 390-minute for US stock market. The 1 Í |D| EG𝑖 , where |D| is the number of orders in the
EG = |D| 𝑖=1
detailed trading process has been described in Appendix A.3.
dataset, and EG𝑖 is the execution gain on order 𝑖 relative to the
5.2 Evaluation Settings corresponding daily average market price 𝑝˜ 𝑖 of asset 𝑖 and defined
𝑝¯𝑖 −𝑝˜ 𝑖
5.2.1 Compared methods. Since we are the first to study the simul- as EG𝑖 = 𝑑 𝑖 ( strategy
𝑝˜ 𝑖
) × 104 ‱. Here 𝑝¯𝑖strategy is the average exe-
taneous optimization for multi-order execution, we first compare cution price of the evaluated strategy as defined in Eq. (2). Note that
our proposed method and its variants with traditional financial EG is proportional to the reward 𝑅𝑒+ described in Eq. (4) and has
model based methods and some single-agent RL baselines proposed been widely used in order execution task [4, 8, 9]. EG is measured in
for order execution problem. Note that, for single-agent RL methods basis points (BPs), where one basis point is 1‱. To better illustrate
that are optimized for single-order optimization, instead of using the profit ability, we also report the additional annualized rate of
the summation of rewards for all orders as the reward function, return (ARR) brought by the order execution algorithm, relative
the agent is only optimized for the reward of each individual order to the same portfolio management strategy with TWAP execution
𝑅(𝒔𝑡 , 𝑎𝑖𝑡 ; 𝑖) as defined in Eq. (7), aligned with how these methods solution whose average execution price is 𝑝˜ 𝑖 and EG= 0, whose
are originally proposed. Then, to illustrate the effectiveness of our detailed calculation is presented in Appendix A.4. Following [8, 9],
proposed novel intention-aware communication mechanism, we E [EG |EG >0]
we also report the gain-loss ratio (GLR) of EG E 𝑖[ −EG𝑖 |EG𝑖 <0] and
compare our method with those general RL works incorporating 𝑖 𝑖 𝑖
positive rate (POS) P[EG𝑖 > 0] across all the orders in the dataset.
common MARL algorithms. All methods are evaluated with the All the above metrics, i.e., EG, ARR, GLR and POS, are better while
same multi-order execution procedure described in Appendix A.3. value getting higher.
• TWAP (Time-Weighted Average Price) [25] is a passive rule- Moreover, as a reasonable execution strategy should manage
based method which evenly distributes the order amount across the cash resources wisely to avoid shortages, our last evaluation
the whole time horizon, whose average execution price would metric is the average percentage of time of conflict (TOC) where
be the average market price 𝑝˜ 𝑖 as defined in Eq. (4). the agents conduct conflicted actions and suffer from short of cash
• VWAP (Volume-Weighted Average Price) [6] is another widely during the execution, defined as 100% × E[ 𝑇𝑡=1 1 (𝑐𝑡 = 0)]/𝑇 .
Í
used strategy which distributes the order proportionally to the TOC is better when the value is lower. Generally, a high TOC value
estimated market volume to reduce market impacts. indicates that the acquisition orders are often limited by the cash
• AC is derived in [5] as a trading strategy focusing on balancing supply, which would usually results in suboptimal EG results. Note
price risk and market impacts using close-form market heuristics. that, although the portfolio management strategy responsible to
• DDQN (Double Deep Q-network) [8] is a single-agent value- generate daily orders can hold more cash, i.e., allocating a larger
based RL method for single-order execution optimization. initial cash balance 𝑐 0 for each set of orders to offer more budgets
• PPO was proposed in [9] which utilizes PPO algorithm to opti- for acquisition and reduce TOC, a large cash position would lower
mize single-order execution. the capital utilization and cause a lower profit rate. Thus, the initial
• CommNet [36] first utilizes a neural network as a broadcasting cash budget 𝑐 0 would not be very large which requires multi-order
communication channel to share information between agents. execution to actively coordinates among liquidation and acquiring
• TarMAC [16] is a multi-agent reinforcement learning method to well manage cash resources, i.e., achieving low TOC value.
which first utilizes an attention-based communication channel.
• IS [19] incorporates intention communication in MARL, which 5.3 Experiment Results
forecasts the future trajectories of other agents as intentions. The detailed experiment results are listed in Table 1. We can tell
1 https://finance.yahoo.com that (1) our proposed methods, i.e., IaCC and IaCT , significantly

4009
KDD ’23, August 6–10, 2023, Long Beach, CA, USA Yuchen Fang et al.

China A-share Market Window 1 (CHW1) China A-share Market Window 2 (CHW2) China A-share Market Window 3 (CHW3)
Method Group Method
EG (‱) ↑ ARR (%) ↑ POS ↑ GLR ↑ TOC (%) ↓ EG (‱) ↑ ARR (%) ↑ POS ↑ GLR ↑ TOC (%) ↓ EG (‱) ↑ ARR (%) ↑ POS ↑ GLR ↑ TOC (%) ↓
TWAP 0.00 0.00 0.50 1.00 0.00 0.00 0.00 0.50 1.00 0.00 0.00 0.00 0.50 1.00 0.00
Financial model based
AC -3.26 -0.65 0.48 1.00 0.00 -1.25 -0.25 0.49 0.96 0.00 -6.14 -1.22 0.48 0.95 0.00
(single-order optimization)
VWAP -3.26 -0.65 0.49 1.01 0.00 -2.23 -0.45 0.48 0.92 0.00 -6.13 -1.22 0.48 0.95 0.00
Single-agent RL PPO 21.63±1.45 5.56±0.36 0.59±0.01 1.27±0.05 40.39±6.66 24.36±2.32 6.28±0.58 0.60±0.01 1.23±0.04 45.12±10.95 20.01 ± 1.11 5.13±0.28 0.58±0.01 1.26±0.03 30.24±5.37
(single-order optimization) DDQN 6.25±0.27 1.57±0.07 0.53±0.01 1.05±0.01 8.43±1.72 7.12±0.64 1.80±0.16 0.54±0.01 1.06±0.03 18.27±3.23 7.07±0.56 1.78±0.14 0.53±0.02 1.03±0.01 10.27±1.14
CommNet 20.32±0.98 5.21±0.25 0.60±0.01 1.20±0.01 2.45±0.43 30.21±1.89 7.84±0.47 0.59±0.02 1.33±0.03 6.87±1.29 21.02±1.24 5.39±0.31 0.58±0.01 1.30±0.01 2.98±0.99
TarMAC 22.46±1.42 5.77±0.36 0.57±0.03 1.29±0.01 2.56±0.10 31.12±0.88 8.09±0.22 0.60±0.01 1.38±0.02 3.75±0.48 21.89±0.70 5.62±0.18 0.58±0.01 1.33±0.01 4.03±1.53
Multi-agent RL
IS 21.22±2.88 5.45±0.72 0.58±0.01 1.28±0.02 3.02±0.52 30.01±0.35 7.79±0.09 0.59±0.01 1.35±0.04 5.23±0.51 22.04±1.13 5.66±0.28 0.59±0.01 1.37±0.05 4.54±0.56
(multi-order optimization)
IaCC 28.38±2.34* 7.35±0.59* 0.64±0.02* 1.33±0.01 1.56±0.05 32.28±0.21 8.40±0.05 0.60±0.01 1.38±0.01 1.63±0.33 24.78±1.02 6.39±0.26 0.60±0.01 1.39±0.01* 1.29±0.43*
IaCT 28.36±3.45 7.35±0.86 0.63±0.03 1.34±0.01* 1.23±0.36* 33.01±0.18* 8.60±0.05* 0.61±0.01* 1.41±0.01* 1.58±0.32* 25.45±1.22* 6.57±0.31* 0.60±0.00* 1.38±0.02 1.38±0.48

US Stock Market Window 1 (USW1) US Stock Market Window 2 (USW2) US Stock Market Window 3 (USW3)
Method Group Method
EG (‱) ↑ ARR (%) ↑ POS ↑ GLR ↑ TOC (%) ↓ EG (‱) ↑ ARR (%) ↑ POS ↑ GLR ↑ TOC (%) ↓ EG (‱) ↑ ARR (%) ↑ POS ↑ GLR ↑ TOC (%) ↓
TWAP 0.00 0.00 0.50 1.00 0.00 0.00 0.00 0.50 1.00 0.00 0.00 0.00 0.50 1.00 0.00
Financial model based
AC 1.23 0.24 0.51 1.01 0.00 0.75 0.05 0.50 1.01 0.00 1.52 0.29 0.51 1.01 0.00
(single-order optimization)
VWAP -2.34 -0.60 0.49 0.96 0.00 -2.88 -0.89 0.48 0.98 0.00 -1.25 -0.24 0.49 0.99 0.00
Single-agent RL PPO 4.02±0.46 1.01±0.12 0.53±0.01 1.03±0.02 14.18±3.43 4.39±0.84 1.10±0.21 0.53±0.01 1.04±0.03 12.36±2.73 4.59±0.78 1.15±0.20 0.54±0.01 0.99±0.04 10.18±2.53
(single-order optimization) DDQN 1.08±0.36 0.27±0.09 0.51±0.01 1.02±0.01 7.12±1.04 2.09±0.77 0.52±0.19 0.51±0.01 1.02±0.01 7.12±1.04 1.79±0.45 0.45±0.11 0.52±0.01 0.98±0.05 6.34±1.44
CommNet 6.01±1.31 1.51±0.33 0.53±0.01 1.04±0.02 1.33±0.53 5.75±1.03 1.45±0.26 0.52±0.01 1.04±0.01 1.03±0.38 6.42±1.83 1.62±0.46 0.54±0.01 1.04±0.01 1.88±0.47
TarMAC 6.54±1.05 1.65±0.27 0.54±0.01 1.03±0.01 2.34±0.68 6.28±1.52 1.58±0.39 0.53±0.01 1.04±0.02 2.58±0.37 6.99±2.01 1.76±0.51 0.54±0.02 1.04±0.01 3.21±0.88
Multi-agent RL
IS 5.77±0.88 1.45±0.22 0.53±0.01 1.03±0.01 1.67±0.78 4.98±0.75 1.25±0.19 0.52±0.02 1.03±0.01 3.23±0.58 5.85±0.78 1.47±0.20 0.53±0.01 1.03±0.02 4.33±1.62
(multi-order optimization)
IaCC 7.82±0.25 1.97±0.06 0.54±0.01 1.07±0.03* 1.30±0.10 7.57±0.32 1.91±0.08 0.55±0.02* 1.06±0.02 0.97±0.08* 8.01±0.33 2.02±0.08 0.54±0.01 1.08±0.03* 1.01±0.13*
IaCT 8.11±0.52* 2.05±0.13* 0.55±0.02 1.06±0.01 1.01±0.22* 7.99±0.29* 2.02±0.07 0.54±0.01 1.07±0.03* 1.01±0.29 8.23±0.56* 2.08±0.14* 0.55±0.02* 1.07±0.01 1.38±0.27

Table 1: The results of all the compared methods on five rolling windows of two real-world markets. ↑ (↓) means the higher
(lower) value is better. For learning-based methods, we report the mean value ± standard deviation over six random seeds. The
best results of learning-based methods are highlighted with bold format. * indicates p-value < 10 −6 in significance test [48].

improve the trading performance on the test environment com- the general price trend of all the assets by showing the average
of price deviation at each minute Δ𝑝ˆ𝑡 = E𝑡 [ 𝑡𝑝˜ ] as defined in
pared to the other baselines and achieve the highest profits, i.e., EG, 𝑝 −𝑝˜
POS and GLR on all datasets. Also, the intention-aware communi- Eq. (4). It shows that, on average, there exists acquisition oppor-
cation mechanism brings a significant reduction in the TOC metric, tunity (lower price) at the beginning of the trading day, while the
which proves that sharing intended actions through communication opportunities for liquidation (higher price) does not come until the
offers much better collaboration performance than the previous middle of the trading day. Generally, the opportunities for liquida-
MARL methods. (2) Almost all the MARL methods with multi-order tion come later than those of acquisition, which requires careful
optimization achieve higher profits and lower TOC than the RL collaboration between buyers and sellers during execution and fur-
methods optimized for single-order. It has illustrated the necessity ther call for multi-order optimization solutions as the fulfillment of
of jointly optimizing multi-order execution and encouraging the acquisition orders depends on the cash supplied by liquidation.
collaboration among agents. (3) Although IS also shares the inten-
tions of agents through communication, it achieves worse results Multi-order optimization improves significantly against single-
than IaC, indicating that the refinement of intended actions for order execution. We compare the execution details of IaC and
multiple rounds within a single timestep is important for agents to PPO to show that jointly optimizing multiple orders is necessary to
reach good collaboration in a complicated environment. Also, in- achieve high profit in multi-order execution. Figure (3b) shows how
tention communication in IS requires predicting future trajectories IaCC and PPO distribute the given order across the trading day on
of all agents, which might not accurately reflect the true inten- average. The bars exhibit the ratio of acquisition and liquidation
𝑞
tion of other agents. Moreover, it suffers from large compounding orders fulfilled at every minute on average, i.e., E𝑡 [ 𝑀𝑡 ]. The hollow
error in noisy financial environments. (4) All the financial model red bars show the number of orders buyers intend to fulfill, and
based methods achieve TOC equal to zero since they do not actively the solid bars show what they actually trade considering the cash
seek best execution opportunity but mainly focus on reducing mar- limitation. We find from Figure (3b) that PPO tends not to liquidate
ket impacts [5], which may easily derive low TOC value yet low much at the beginning of the day, as it is not a good opportunity
(poorer) EG performance. (5) There exists a huge gap between the for liquidation, as shown in Figure (3a), leading to slow cash replen-
performances of RL-based methods on China A-share market and ishment. Although PPO intends to buy many shares in the first 30
US stock market. The reason may be the relatively larger daily price minutes of the day when the price is low, it is severely limited by
volatility in China stock market as shown in Appendix B.3. a cash shortage. It has to postpone the acquisition operations and
loses the best trading opportunities during the day. On the contrary,
5.4 Extended Investigation
IaCT coordinates both acquisition and liquidation more actively
To further present the necessity of jointly optimizing multiple or- and efficiently fulfills most liquidation orders in the first hour of
ders and the improvement in collaboration efficiency achieved by trading thus guaranteeing sufficient cash supply for acquisition
the proposed intention-aware communication mechanism, we in- orders, which shows the improvement on collaboration brought by
vestigate the statistics of the market data and compare the transac- optimizing all orders simultaneously through our method.
tion details of our IaC strategy and other baselines. The analysis in
this section is based on the trading results on the test set of CHW1 The intended actions have been gradually refined during
and USW1 while the other datasets share similar conclusions. intention-aware communication in IaC. To illustrate the refine
process of intended actions, we directly take 𝒂𝑡,𝑘 , generated after
Collaboration is necessary when conducting multi-order ex- the 𝑘-th round of communication (1 ≤ 𝑘 ≤ 𝐾 = 3) from the
ecution. We further clarify the necessity of collaboration for multi- trained policy of IaCC and IaCT , as the final actions at each timestep
order execution by illustrating the trading opportunity of acquisi- 𝑡, and evaluate them on the test sets. We report the profitability
tion and liquidation orders over the trading day. Figure 3a illustrates

4010
Learning Multi-Agent Intention-Aware Communication for Optimal Multi-Order Execution in Finance KDD ’23, August 6–10, 2023, Long Beach, CA, USA

acquisition opportunity daily average price


liquidation opportunity
Liquidation Acquisition (traded) Acquisition (intended)

General price trend of CHW1


Price deviation Δpt̂ (BPs) Price deviation Δpt̂ (BPs)

Overall trading situation IaCC on CHW1 IaCC on USW1


15 IaCT on CHW1 IaCT on USW1
PPO on CHW1 IaCT on CHW1
10 EG w.r.t. communication rounds

Ratio of fulfilled order (%)


2.0 2.0
5 conflict (cash shortage) 28.5

buy

CHW1 EG (‱)
8.00

USW1 EG (‱)
1.0

buy
0 1.0 28.0
-5 0.0 7.75
27.5
-10 0.0 1.0 7.50

sell
sell
-15 27.0
2.0 7.25
09:30 10:30 11:30 14:00 15:00 1.0
09:30 10:30 11:30 14:00 15:00 09:30 10:30 11:30 14:00 15:00
General price trend of USW1 TOC w.r.t. communication rounds
10 PPO on USW1 IaCT on USW1 1.8

Ratio of fulfilled order (%)

CHW1 TOC (‱)

USW1 TOC (‱)


1.0 4
1.0
5 conflict (cash shortage) 1.6
3

buy

buy
0.5 0.5
0 1.4
2
-5 0.0 0.0
1.2 1

sell
sell
-10 0.5 1 2 3
0.5
09:30 10:30 11:30 12:30 13:30 14:30 15:30 communication round index (k/3)
09:30 11:00 12:30 14:00 15:30 09:30 11:00 12:30 14:00 15:30
Time Time Time
(c) The TOC and EG of intended ac-
(a) 𝑡 [Δ𝑝ˆ𝑡 ], average price deviation at ev- E (b) The average ratio of order fulfilled at every minute. tions after different communica-
ery minute Δ𝑝ˆ𝑡 = ( 𝑝˜𝑡 − 1) to the averaged
𝑝
Better coordination of our IaC methods leads to higher tion rounds in IaCC and IaCT (to-
˜
price 𝑝 over all the market stocks. profits (EG) and lower TOC (i.e., fewer conflicts). tally 𝐾 = 3 rounds).

Figure 3: The price trend of assets in the order and the corresponding trading situation of compared methods.

performance (EG) and collaboration effectiveness (TOC) of these The performances first improve sharply and then remain stable
intended actions in Figure (3c) to exhibit the collaboration ability of when 𝐾 ≥ 3, which indicates that when intention-aware commu-
the agents after each round of action refinement. We can tell that (1) nication is sufficient (𝐾 ≥ 3), additional communication would
all intended actions achieve good TOC performance and reasonable not bring significant improvements. These observations reflect the
EG, which reflects that even the intended actions proposed before stability and robustness of our proposed IaC method.
the final actions have reflected the intentions of the agents thus
subsequently offer clear information for better collaboration. (2) The
EG gets higher and the TOC gradually reaches optimal while the 6 SOFTWARE FOR ORDER EXECUTION
agents communicate, indicating that the agents manage to improve We developed a financial decision-making toolkit Qlib.RL based on
their intended actions based on the intentions of each other for Qlib [47], to support order execution scenario. It offers APIs for
better collaboration. We also conduct case study on the transaction receiving orders from upstream portfolio management systems and
details of the intended actions after each round of communication outputs detailed execution decisions to downstream trading pro-
in Appendix B.4 to further show the refinement of intended actions gram. Qlib.RL supports simultaneously execution of 1,000 orders on
during our intention-aware communication. a machine with a NVIDIA P40 GPU and an Intel Xeon 8171M CPU,
where all execution decisions would be given within 50 millisecond,
Action change during communication EG performances w.r.t. K
8
which is significantly faster than the required decision time interval
0.10 T
IaC on CHW1 27.5 that is 1 minute in our practice. As for training, Qlib.RL retrains the
EG on CHW1

EG on USW1

IaCC on CHW1 IaCT on CHW1


policy every 2 months with the latest data and applies a rolling man-
25.0
Value

0.08 IaCC on CHW1 7


IaCT on USW1
IaCT on USW1 ner to maintain promising performance, which is an acceptable cost
0.06 IaCC on USW1 22.5
IaCC on USW1 in this scenario. The corresponding codes and benchmark frame-
20.0 6
0.04 1 2 3 4 5 work with data can be referred to https://seqml.github.io/marl4fin.
a 1|] a 3 − a 2|] a 4 − a 3|] a 5 − a 4|]
[|a 2 − [| [| [| total communication rounds K

(a) (b)
Figure 4: (a) Convergence analysis of intended actions along
7 CONCLUSION AND FUTURE WORK
with communication rounds (1 ≤ 𝑘 ≤ 𝐾 = 5); (b) Analysis of In this paper, we formulate multi-order execution task as a multi-
total numbers (𝐾) of communication rounds. agent collaboration problem and solve it through an intention-
aware communication method. Specifically, we model the intention
of agents as their intended actions and allow agents to share and
The intended actions have converged after multi-round com- refine their intended actions through multiple rounds of commu-
munication in IaC. Figure (4a) shows the average difference be- nication. A novel action value attribution method is proposed to
tween the intended actions of neighboring rounds with totally 𝐾 = 5 optimize the intended actions directly. The experiment results have
communication rounds. It indicates that the intended actions gen- shown the superiority of our proposed method.
erally reaches convergence as the agents communicate for multiple In the future, we plan to conduct joint optimization with order ex-
rounds. Figure (4b) illustrates the influence of the total communica- ecution and portfolio management, and adapt our intention-aware
tion round number 𝐾 on the EG performances of our IaC method. communication methodology to wider RL applications.

4011
KDD ’23, August 6–10, 2023, Long Beach, CA, USA Yuchen Fang et al.

REFERENCES SSRN 3374766, 2019.


[1] J. Wang, Y. Zhang, K. Tang, J. Wu, and Z. Xiong, “Alphastock: A buying-winners- [25] D. Bertsimas and A. W. Lo, “Optimal control of execution costs,” Journal of
and-selling-losers investment strategy using interpretable deep reinforcement Financial Markets, vol. 1, no. 1, pp. 1–50, 1998.
attention networks,” in KDD, 2019. [26] Y. Nevmyvaka, Y. Feng, and M. Kearns, “Reinforcement learning for optimized
[2] Y. Ye, H. Pei, B. Wang, P.-Y. Chen, Y. Zhu, J. Xiao, and B. Li, “Reinforcement- trade execution,” in ICML, 2006.
learning based portfolio management with augmented asset movement prediction [27] Y. Deng, F. Bao, Y. Kong, Z. Ren, and Q. Dai, “Deep direct reinforcement learning
states,” in AAAI, 2020. for financial signal representation and trading,” IEEE Transactions on Neural
[3] A. Cartea and S. Jaimungal, “Optimal execution with limit and market orders,” Networks and Learning Systems, vol. 28, no. 3, pp. 653–664, 2017.
Quantitative Finance, vol. 15, no. 8, pp. 1279–1291, 2015. [28] W. Bao and X.-y. Liu, “Multi-agent deep reinforcement learning for liquidation
[4] Y. Fang, K. Ren, W. Liu, D. Zhou, W. Zhang, J. Bian, Y. Yu, and T.-Y. Liu, “Universal strategy analysis,” CoRR, 2019.
trading for order execution with oracle policy distillation,” in AAAI, 2021. [29] M. Karpe, J. Fang, Z. Ma, and C. Wang, “Multi-agent reinforcement learning
[5] R. Almgren and N. Chriss, “Optimal execution of portfolio transactions,” Journal in a realistic limit order book market simulation,” Proceedings of the First ACM
of Risk, 2001. International Conference on AI in Finance, 2020.
[6] S. M. Kakade, M. Kearns, Y. Mansour, and L. E. Ortiz, “Competitive algorithms [30] J. Lussange, I. Lazarevich, S. Bourgeois-Gironde, S. Palminteri, and B. Gutkin,
for vwap and limit order trading,” in Proceedings of the 5th ACM conference on “Modelling stock markets by multi-agent reinforcement learning,” Computational
Economics, 2021.
Electronic commerce, 2004, pp. 189–198.
[31] J. Lussange, I. Lazarevich, S. Bourgeois Gironde, S. Palminteri, and B. Gutkin,
[7] J. W. Lee, J. Park, O. Jangmin, J. Lee, and E. Hong, “A multiagent approach
“Stock market microstructure inference via multi-agent reinforcement learning,”
to 𝑞 -learning for daily stock trading,” IEEE Transactions on Systems, Man, and
CoRR, 2019.
Cybernetics-Part A: Systems and Humans, vol. 37, no. 6, pp. 864–877, 2007.
[32] J. Lee, R. Kim, S.-W. Yi, and J. Kang, “Maps: Multi-agent reinforcement learning-
[8] B. Ning, F. H. T. Ling, and S. Jaimungal, “Double deep q-learning for optimal
based portfolio management system,” in IJCAI, 2020.
execution,” CoRR, 2018.
[33] R. Wang, H. Wei, B. An, Z. Feng, and J. Yao, “Commission fee is not enough: A
[9] S. Lin and P. A. Beling, “An end-to-end optimal trade execution framework based
hierarchical reinforced framework for portfolio management,” in AAAI, 2021.
on proximal policy optimization,” in IJCAI, 2020.
[34] F. S. Melo, M. T. Spaan, and S. J. Witwicki, “Querypomdp: Pomdp-based commu-
[10] M. Zhou, Y. Chen, Y. Wen, Y. Yang, Y. Su, W. Zhang, D. Zhang, and J. Wang,
nication in multiagent systems,” in European Workshop on Multi-Agent Systems,
“Factorized q-learning for large-scale multi-agent systems,” in Proceedings of the
2011.
First International Conference on Distributed Artificial Intelligence, 2019, pp. 1–7.
[35] L. Panait and S. Luke, “Cooperative multi-agent learning: The state of the art,”
[11] M. Tan, “Multi-agent reinforcement learning: Independent vs. cooperative agents,”
Autonomous agents and multi-agent systems, 2005.
in ICML, 1993.
[36] J. Foerster, I. A. Assael, N. De Freitas, and S. Whiteson, “Learning to communicate
[12] X. Li, J. Zhang, J. Bian, Y. Tong, and T.-Y. Liu, “A cooperative multi-agent reinforce-
with deep multi-agent reinforcement learning,” NeurIPS, 2016.
ment learning framework for resource balancing in complex logistics network,”
[37] J. Jiang, C. Dun, T. Huang, and Z. Lu, “Graph convolutional reinforcement learn-
in Proceedings of the 18th International Conference on Autonomous Agents and
ing,” in ICLR, 2020.
MultiAgent Systems. International Foundation for Autonomous Agents and
[38] J. Jiang and Z. Lu, “Learning attentional communication for multi-agent coopera-
Multiagent Systems, 2019, pp. 980–988.
tion,” NeurIPS, 2018.
[13] J. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson, “Counterfactual
[39] D. Kim, S. Moon, D. Hostallero, W. J. Kang, T. Lee, K. Son, and Y. Yi, “Learning to
multi-agent policy gradients,” in Proceedings of the AAAI Conference on Artificial
schedule communication in multi-agent reinforcement learning,” ICLR, 2019.
Intelligence, vol. 32, 2018.
[40] P. Peng, Y. Wen, Y. Yang, Q. Yuan, Z. Tang, and et. al., “Multiagent bidirectionally-
[14] K. Zhang, Z. Yang, H. Liu, T. Zhang, and T. Basar, “Fully decentralized multi-agent
coordinated nets: Emergence of human-level coordination in learning to play
reinforcement learning with networked agents,” in International Conference on
starcraft combat games,” CoRR, 2017.
Machine Learning. PMLR, 2018, pp. 5872–5881.
[41] N. Rabinowitz, F. Perbet, F. Song, C. Zhang, S. A. Eslami, and M. Botvinick,
[15] L. Wang, L. Han, X. Chen, C. Li, J. Huang, W. Zhang, W. Zhang, X. He, and D. Luo,
“Machine theory of mind,” in International conference on machine learning. PMLR,
“Hierarchical multiagent reinforcement learning for allocating guaranteed display
2018.
ads,” IEEE Transactions on Neural Networks and Learning Systems, 2021.
[42] R. Raileanu, E. Denton, A. Szlam, and R. Fergus, “Modeling others using oneself
[16] A. Das, T. Gervet, J. Romoff, D. Batra, D. Parikh, M. Rabbat, and J. Pineau, “Tarmac:
in multi-agent reinforcement learning,” in ICML, 2018.
Targeted multi-agent communication,” in International Conference on Machine
[43] R. Lowe, Y. I. Wu, A. Tamar, J. Harb, O. Pieter Abbeel, and I. Mordatch, “Multi-
Learning, 2019.
agent actor-critic for mixed cooperative-competitive environments,” NeurIPS,
[17] S. Sukhbaatar, R. Fergus et al., “Learning multiagent communication with back-
2017.
propagation,” NeurIPS, 2016.
[44] Á. Cartea, S. Jaimungal, and J. Penalva, Algorithmic and high-frequency trading.
[18] T. Huang, Z. Lu et al., “Learning individually inferred communication for multi-
Cambridge University Press, 2015.
agent cooperation,” NeurIPS, 2020.
[45] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy
[19] W. Kim, J. Park, and Y. Sung, “Communication in multi-agent reinforcement
optimization algorithms,” CoRR, 2017.
learning: Intention sharing,” in ICLR, 2020.
[46] N. Jegadeesh and S. Titman, “Returns to buying winners and selling losers:
[20] P. R. Cohen and H. J. Levesque, “Intention is choice with commitment,” Artificial
Implications for stock market efficiency,” The Journal of finance, 1993.
intelligence, 1990.
[47] X. Yang, W. Liu, D. Zhou, J. Bian, and T.-Y. Liu, “Qlib: An ai-oriented quantitative
[21] M. Tomasello, M. Carpenter, J. Call, T. Behne, and H. Moll, “Understanding
investment platform,” CoRR, 2020.
and sharing intentions: The origins of cultural cognition,” Behavioral and brain
[48] B. Bhattacharya and D. Habtzghi, “Median of the p value under the alternative
sciences, 2005.
hypothesis,” The American Statistician, 2002.
[22] D. Hendricks and D. Wilcox, “A reinforcement learning extension to the almgren-
[49] K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk,
chriss framework for optimal trade execution,” in CIFEr, 2014.
and Y. Bengio, “Learning phrase representations using rnn encoder–decoder for
[23] R. Hu, “Optimal order execution using stochastic control and reinforcement
statistical machine translation,” in EMNLP, 2014.
learning,” 2016.
[50] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser,
[24] K. Dabérius, E. Granat, and P. Karlsson, “Deep execution-value and policy based
and I. Polosukhin, “Attention is all you need,” NeurIPS, 2017.
reinforcement learning for trading and beating market benchmarks,” Available at

4012

You might also like