0% found this document useful (0 votes)
5 views

StarCraft_Micromanagement_With_Reinforcement_Learning_and_Curriculum_Transfer_Learning

This paper presents a reinforcement learning and curriculum transfer learning approach for micromanagement in the real-time strategy game StarCraft. It introduces an efficient state representation and a parameter sharing multi-agent gradient-descent Sarsa(λ) algorithm to enhance cooperative behaviors among units. The proposed methods demonstrate superior performance in combat scenarios, achieving high win rates against built-in AI and improving training efficiency through transfer learning techniques.

Uploaded by

xieshengyuan3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

StarCraft_Micromanagement_With_Reinforcement_Learning_and_Curriculum_Transfer_Learning

This paper presents a reinforcement learning and curriculum transfer learning approach for micromanagement in the real-time strategy game StarCraft. It introduces an efficient state representation and a parameter sharing multi-agent gradient-descent Sarsa(λ) algorithm to enhance cooperative behaviors among units. The proposed methods demonstrate superior performance in combat scenarios, achieving high win rates against built-in AI and improving training efficiency through transfer learning techniques.

Uploaded by

xieshengyuan3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, VOL. 3, NO.

1, FEBRUARY 2019 73

StarCraft Micromanagement With Reinforcement


Learning and Curriculum Transfer Learning
Kun Shao , Yuanheng Zhu , Member, IEEE, and Dongbin Zhao , Senior Member, IEEE

Abstract—Real-time strategy games have been an important to play in board games [8]. As one of the most popular RTS
field of game artificial intelligence in recent years. This paper games, StarCraft has a huge player base and numerous profes-
presents a reinforcement learning and curriculum transfer learn- sional competitions, requiring different strategies, tactics and
ing method to control multiple units in StarCraft micromanage-
ment. We define an efficient state representation, which breaks reactive control techniques. For the game AI research, StarCraft
down the complexity caused by the large state space in the game provides an ideal environment to study the control of multiple
environment. Then, a parameter sharing multi-agent gradient- units with different difficulty levels [9]. In recent years, the study
descent Sarsa(λ) algorithm is proposed to train the units. The of StarCraft AI has an impressive progress, driven by some Star-
learning policy is shared among our units to encourage cooperative Craft AI competitions and Brood War Application Programming
behaviors. We use a neural network as a function approximator
to estimate the action–value function, and propose a reward func- Interface (BWAPI)1 [10]. Recently, researchers have developed
tion to help units balance their move and attack. In addition, a more efficient platforms to promote the development of this
transfer learning method is used to extend our model to more diffi- field, including TorchCraft, ELF and PySC2. StarCraft AI aims
cult scenarios, which accelerates the training process and improves at solving a series of challenges, such as spatial and temporal
the learning performance. In small-scale scenarios, our units suc- reasoning, multi-agent collaboration, opponent modeling and
cessfully learn to combat and defeat the built-in AI with 100%
win rates. In large-scale scenarios, the curriculum transfer learn- adversarial planning [8]. At present, designing a game AI for
ing method is used to progressively train a group of units, and it the full StarCraft game based on machine learning method is
shows superior performance over some baseline methods in target out-of-reach. Many researchers focus on micromanagement as
scenarios. With reinforcement learning and curriculum transfer the first step to study AI in StarCraft [11]. In combat scenarios,
learning, our units are able to learn appropriate strategies in Star- units have to navigate in highly dynamic environment and attack
Craft micromanagement scenarios.
enemies within fire range. There are many methods for StarCraft
Index Terms—Reinforcement learning, transfer learning, cur- micromanagement, including potential fields for spatial naviga-
riculum learning, neural network, game AI. tion and obstacle avoidance [12], [13], Bayesian modeling to
deal with incompleteness and uncertainty in the game [14],
heuristic game-tree search to handle both build order planning
I. INTRODUCTION
and units control [15], and neuroevolution to control individual
RTIFICIAL intelligence (AI) has a great advance in the
A last decade. As an excellent testbed for AI research, games
have been helping AI to grow since its birth, including the an-
unit with hand-craft features [16].
As an intelligent learning method, reinforcement learning
(RL) is very suitable for sequential decision-making tasks. In
cient board game [1]–[4], the classic Atari video games [5], [6], StarCraft micromanagement, there are some interesting appli-
and the imperfect information game [7]. These games have a cations with RL methods. Shantia et al. use online Sarsa and
fixed, limited set of actions, and researchers only need to con- neural-fitted Sarsa with a short term memory reward function to
trol a single agent in game environment. Besides, there are control units’ attack and retreat [17]. They use vision grids to
a large number of games including multiple agents and re- obtain the terrain information. This method needs a hand-craft
quiring complex rules, which are much more difficult for AI design, and the number of input nodes has to change with the
research. number of units. Besides, they apply an incremental learning
In this paper, we focus on a real-time strategy (RTS) game to method to scale the task to a larger scenario with 6 units. How-
explore the learning of multi-agent control. RTS games are usu- ever, the win rate with incremental learning is still below 50%.
ally running in real-time, which is different from taking turns Wender et al. use different RL algorithms in micromanagement,
including Q learning and Sarsa [18]. They control one powerful
Manuscript received August 23, 2017; revised November 17, 2017 and Febru- unit to play against multiple weak units, without cooperation
ary 8, 2018; accepted March 24, 2018. Date of publication April 27, 2018; date of
current version January 21, 2019. This work was supported by the National Nat- and teamwork between own units.
ural Science Foundation of China (NSFC) under Grants 61573353, 61603382, In the last few years, deep learning has achieved a remark-
and 61533017. (Corresponding author: Dongbin Zhao.) able performance in many complex problems [19], and has
The authors are with the State Key Laboratory of Management and Control
for Complex Systems, Institute of Automation, Chinese Academy of Sciences, dramatically improved the generalization and scalability of tra-
Beijing 100190, China, and also with the University of Chinese Academy of ditional RL algorithms [5]. Deep reinforcement learning (DRL)
Sciences, Beijing 101408, China (e-mail:, [email protected]; yuanheng.
[email protected]; [email protected]).
Digital Object Identifier 10.1109/TETCI.2018.2823329 1 http://bwapi.github.io/

2471-285X © 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications standards/publications/rights/index.html for more information.

Authorized licensed use limited to: Renmin University. Downloaded on November 16,2022 at 13:37:02 UTC from IEEE Xplore. Restrictions apply.
74 IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, VOL. 3, NO. 1, FEBRUARY 2019

can teach agents to make decisions in high-dimension state


space by an end-to-end method. Usunier et al. propose an RL
method to tackle micromanagement with deep neural network
[20]. They use a greedy MDP to choose actions for units sequen-
tially at each time step, with zero-order optimization to update
the model. This method is able to control all units owned by the
player, and observe the full state of the game. Peng et al. use
an actor-critic method and recurrent neural networks (RNNs) to
play StarCraft combat games [21]. The dependency of units is
modeled by bi-directional RNNs in hidden layers, and its gradi-
Fig. 1. Representation of the agent-environment interaction in reinforcement
ent update is efficiently propagated through the entire networks. learning.
Different from Usunier’s and Peng’s work that design central-
ized controllers, Foerster et al. propose a multi-agent actor-critic
reinforcement learning model for micromanagement, includ-
method to tackle decentralized micromanagement tasks, which
ing state representation method, network architecture and ac-
significantly improves the performance over centralized RL con-
tion definition. And in Section IV, we introduce the parameter
trollers [22].
sharing multi-agent gradient-descent Sarsa(λ) algorithm and the
For StarCraft micromanagement, traditional methods have
reward function. In Section V, we introduce the StarCraft mi-
difficulties in handling complicated state and action space, and
cromanagement scenarios used in our paper and the training
learning cooperative tactics. Modern methods rely on strong
details. In Section VI, we make an analysis of experimental re-
compute capability introduced by deep learning. Besides, learn-
sults and discuss the learned strategies. In the end, we draw a
ing micromanagement with model-free RL methods usually
conclusion of the paper and propose some future work.
needs a lot of training time, which is even more serious in large
scale scenarios. In this paper, we dedicate to explore more effi-
II. PROBLEM FORMULATION AND BACKGROUNDS
cient state representation to break down the complexity caused
by the large state space, and propose appropriate RL algorithm A. Problem Formulation
to solve the problem of multi-agent decision making in Star-
In StarCraft micromanagement, we need to control a group
Craft micromanagement. In addition, we introduce curriculum
of units to destroy the enemies under certain terrain conditions.
transfer learning to extend the RL model to various scenarios
The combat scenario with multiple units is approximated as
and improve the sample efficiency.
a Markov game, a multi-agent extension of Markov decision
The main contributions emphasize in three parts. First, we
processes (MDPs) [21]– [23]. In a Markov game with N agents,
propose an efficient state representation method to deal with the
a set of states S are used to describe the properties of all agents
large state space in StarCraft micromanagement. This method
and the environment, as well as a set of actions A1 , ..., AN and
takes units’ attributes and distances into consideration, allow-
observations O1 , ..., ON for each agent.
ing an arbitrary number of units on both sides. Compared with
In the combat, units in each side need to cooperate with
related work, our state representation is more concise and more
each other. Developing a learning model for multiple units is
efficient. Second, we present a parameter sharing multi-agent
challenging in micromanagement. In order to maintain a flex-
gradient-descent Sarsa(λ) (PS-MAGDS) algorithm to train our
ible framework and allow an arbitrary number of units, we
units. Using a neural network as a function approximator, agents
consider that our units have access to the state space S from
share the parameters of a centralized policy, and update the pol-
its own observation of the current combat by treating other
icy with their own experiences simultaneously. This method
units as part of the environment S → Oi . Each unit interacts
trains homogeneous agents efficiently, and encourages cooper-
in the combat environment with its own observation and ac-
ative behaviors. To solve the problem of sparse and delayed
tion. S × A1 × ... × AN → S  denotes the transition from state
rewards, we introduce a reward function including small in-
S to the successive state S  with actions of all the units, and
termediate rewards in the RL model. This reward function im-
R1 ...RN are the generated rewards of each unit. For the sake
proves the training process, and serves as an intrinsic motivation
of multi-agent cooperation, the policy is shared among our
to help units collaborate with each other. Third, we propose a
units. The goal of each unit is to maximize its total expected
transfer learning method to extend our model to various scenar-
return.
ios. Compared with learning from scratch, this method acceler-
ates the training process and improves the learning performance
to a great extent. In large scale scenarios, we apply curriculum B. Reinforcement Learning
transfer learning method to successfully train a group of units. To solve the multi-agent control problem in StarCraft mi-
In term of win rates, our proposed method is superior to some cromanagement, we can resort to reinforcement learning. Rein-
baseline methods in target scenarios. forcement learning is a type of machine learning algorithms
The rest of the paper is organized as follows. In Section II, in which agents learn by trial and error and determine the
we describe the problem formulation of StarCraft microman- ideal behavior from its own experience with the environment
agement, as well as backgrounds of reinforcement learning [24]. We draw the classic RL diagram in Fig. 1. It shows the
and transfer curriculum learning. In Section III, we present the process that an RL agent interacts with the environment. The

Authorized licensed use limited to: Renmin University. Downloaded on November 16,2022 at 13:37:02 UTC from IEEE Xplore. Restrictions apply.
SHAO et al.: STARCRAFT MICROMANAGEMENT WITH REINFORCEMENT LEARNING AND CURRICULUM TRANSFER LEARNING 75

agent-environment interaction process in RL is formulated as a control, including asynchronous advantage actor-critic (A3C)
Markov decision process. The agent in state s makes an action [36], trust region policy optimization (TRPO) [39], proximal
a according to the policy π. This behavior causes a reward r, policy optimization (PPO) [40], and so on. The sample com-

T statet s−t. We define the future discounted
and transfers to a new plexity of traditional DRL methods tends to be high, which
return at time t as t  =t γ rt  , where T is the terminal time limits these methods to real-world applications. While model-
step and γ ∈ [0, 1] is a discount factor that determines the im- based DRL approaches learn value function and policy in a
portance of future rewards. The aim of an RL model is to learn data-efficient way, and have been widely used in sensorimotor
an optimal policy π, which defines the probability of selecting control. Guided policy search (GPS) uses a supervised learning
action a in state s, so that the sum of the overall discounted algorithm to train policy and an RL algorithm to generate guid-
rewards is maximized, as demonstrated by ing distributions, allowing to train deep policies efficiently [41].
 T  Researchers have also proposed some other model-based DRL
  methods, like normalized advantage functions (NAF) [42] and
t −t
max E γ rt  |s = st , a = at , π . (1)
π
t  =t
embed to control (E2C) [43].
Multi-agent reinforcement learning is a closely related area
As one of the most popular RL algorithms, temporal differ- to our work [44]. A multi-agent system includes a number
ence (TD) learning is a combination of Monte Carlo method of agents interacting in one environment [45], [46]. Recently,
and dynamic programming method. TD method can learn from some multi-agent reinforcement learning algorithms with deep
raw experience without a model of the environment, and update neural network are proposed to learn communication [47],
estimates based on part of the sequence, without waiting for a cooperative-competitive behaviors [23] and imperfect informa-
final outcome [25]. The most widely known TD learning algo- tion [48]. In our work, we use a multi-agent reinforcement learn-
rithms are Q-learning and Sarsa. Q-learning estimates the value ing method with policy sharing among agents to learn cooper-
of making an action in a given state and iteratively updates the ative behaviors. Agents share the parameters of a centralized
Q-value estimate towards the observed reward. The TD error δt policy, and update the policy with their own experiences si-
in Q-learning is computed as multaneously. This method can train homogeneous agents more
δt = rt+1 + γ max Q(st+1 , a) − Q(st , at ) . (2) efficiently [49].
a

Q-learning is an off-policy learning method, which means it C. Curriculum Transfer Learning


learns a different policy compared with the one choosing actions.
Different from Q-learning’s off-policy mechanism, Sarsa is an Generally speaking, model-free reinforcement learning meth-
on-policy method, which means the policy is used both for ods need plenty of samples to learn an optimal policy. However,
selecting actions and updating previous Q-value [24]. The Sarsa many challenging tasks are difficult for traditional RL methods
update rule is demonstrated as to learn admissible policies in large state and action space. In
StarCraft micromanagement, there are numerous scenarios with
δt = rt+1 + γQ(st+1 , at+1 ) − Q(st , at ) , (3a) different units and terrain conditions. It will take a lot of time
Q(st+1 , at+1 ) = Q(st , at ) + αδt , (3b) to learn useful strategies in different scenarios from scratch. A
number of researchers focus on improving the learning speed
where α is the learning rate. Traditional reinforcement learning and performance by exploiting domain knowledge across vari-
methods have some successful applications, including TD in ous but related tasks. The most widely used approach is transfer
Backgammon [26] and adaptive dynamic programming (ADP) learning (TL) [50], [51]. To some extent, transfer learning is
in control [27]– [29]. a kind of generalization across tasks, transferring knowledge
Reinforcement learning with deep neural networks function from source tasks to target tasks. Besides, transfer learning can
approximators has received great attentions in recent years. DRL be extended to RL problems by using the model parameters in
provides an opportunity to train a single agent to solve a se- the same model architecture [52]. The procedure of using trans-
ries of human-level tasks by an end-to-end manner [30], [31]. fer learning in our experiments is training the model with RL
As the most famous DRL algorithm, deep Q-network (DQN) method in a source scenario first. And then, we can use the well-
uses the experience replay technique and a target network to re- trained model as a starting point to learn micromanagement in
move the correlations between samples and stabilize the training target scenarios.
process [5]. In the last few years, we have witnessed a great num- As a special form of transfer learning, curriculum learning
ber of improvements on DQN, including double DQN [32], pri- involves a set of tasks organized by increasing level of diffi-
oritised DQN [33], dueling DQN [34], distributed DQN [35] and culties. The initial tasks are used to guide the learner so that it
asynchronous DQN [36]. Apart from value-based DRL meth- can perform better on the final task [53]. Combining curricu-
ods like DQN and its variants, policy-based DRL methods use lum learning and transfer learning, curriculum transfer learning
deep networks to parameterize and optimize the policy directly (CTL) method has shown good performance to help the learning
[37]. Deep deterministic policy gradient (DDPG) is the con- process converge faster and towards better optimum in recent
tinuous analogue of DQN, which uses a critic to estimate the work [54]–[56]. For micromanagement, a feasible method of
value of current policy and an actor to update the policy [38]. using CTL is mastering a simple scenario first, and then solv-
Policy-based DRL methods play important roles in continuous ing difficult scenarios based on this knowledge. By changing

Authorized licensed use limited to: Renmin University. Downloaded on November 16,2022 at 13:37:02 UTC from IEEE Xplore. Restrictions apply.
76 IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, VOL. 3, NO. 1, FEBRUARY 2019

TABLE I
THE DATA TYPE AND DIMENSION OF INPUTS IN OUR MODEL

Fig. 2. Representation of curriculum transfer learning. Storing knowledge


gained solving source task and gradually applying it to M curricular tasks to
update the knowledge. Eventually, applying it to the target task.

the number and type of units, we could control the difficulty of


micromanagement. In this way, we can use CTL to train our
units with a sequence of progressively difficult micromanage-
ment scenarios, as shown in Fig. 2.

III. LEARNING MODEL FOR MICROMANAGEMENT


A. Representation of High-Dimension State
State representation of StarCraft is still an open problem with
no universal solution. We construct a state representation with
inputs from the game engine, which have different data types and Fig. 3. Representation of the learning model of one unit in StarCraft micro-
dimensions, as depicted in Table I. The proposed state represen- management scenarios. The state representation has three parts and a neural
network is used as the function approximator. The network outputs the proba-
tation method is efficient and independent of the number of units bilities of moving to 8 directions and attack.
in the combat. In summary, the state representation is composed
of three parts: the current step state information, the last step
state information and the last step action, as shown in Fig. 3. The
3) EnemySumInfo: enemy units’ distances are summed in
current step state information includes own weapon’s cooldown
each area;
time, own unit’s hitpoint, distances information of own units,
4) EnemyMaxInfo: enemy units’ distances are maximized in
distances information of enemy units and distances information
each area.
of terrain. The last step state information is the same with the
If a unit is out of the central unit’s sight range D, the unit’s
current step. We take the last step action into consideration,
distance value dis unit is set to 0.05. Otherwise, the value is
which has been proven to be helpful for the learning process
linear with d, the distance to the central unit, as demonstrated
in the RL domain [57], [58]. The proposed state representation
in (4).
method also has good generalization and can be used in other
combat games, which need to take agents’ property and distance 
0.05, d > D
information into consideration. dis unit(d) = (4)
All inputs with real type are normalized by their maximum 1 − 0.95(d/D), d ≤ D
values. Among them, CoolDown and HitPoint have 1 dimension
for each. We divide the combat map into 8 sector areas on In addition, terrain distance value dis terrain is also com-
average, and compute the distances information in each area. puted in 8 sector areas. If the obstacle is out of the central unit’s
Units’ distance information is listed as follows: sight range, the value is set to 0. Otherwise, the value is also
1) OwnSumInfo: own units’ distances are summed in each linear with the distance to the central unit, as shown in (5).
area; 
0, d > D
2) OwnMaxInfo: own units’ distances are maximized in each dis terrain(d) = (5)
area; 1 − d/D, d ≤ D

Authorized licensed use limited to: Renmin University. Downloaded on November 16,2022 at 13:37:02 UTC from IEEE Xplore. Restrictions apply.
SHAO et al.: STARCRAFT MICROMANAGEMENT WITH REINFORCEMENT LEARNING AND CURRICULUM TRANSFER LEARNING 77

In this way, the current step state information has 42 dimen-


sions. The last step action has 9 dimensions, with selected action
setting to 1 and the other actions setting to 0. In total, the state
representation in our model is embedded to 93 dimensions.

B. Action Definition
In StarCraft micromanagement scenarios, the original action
space is very large. At each time step, each unit can move to
arbitrary directions with arbitrary distances in the map. When
the unit decides to attack, it can choose any enemy units in the
weapon’s fire range. In order to simplify the action space, we
Fig. 4. The PS-MAGDS reinforcement learning diagram in the StarCraft
choose 8 move directions with a fixed distance and attacking the micromanagement scenarios.
weakest as the available actions for each unit. When the chosen
action is move, our units will turn to one of the 8 directions, Up,
Down, Left, Right, Upper-left, Upper-right, Lower-left, Lower- whole PS-MAGDS reinforcement learning diagram is depicted
right, and move a fixed distance. When the chosen action is in Fig. 4.
attack, our units will stay at the current position and focus fire
on enemy units. Currently, we select the enemy with the lowest
hitpoint in our weapon’s attack range as the target. According A. Parameter Sharing Multi-Agent Gradient-Descent Sarsa(λ)
to the experimental results, these actions are enough to control We propose a multi-agent RL algorithm that extend the tra-
our units in the game. ditional Sarsa(λ) to multiple units by sharing the parameters of
the policy network among our units. To accelerate the learning
C. Network Architecture process and tackle the problem of delayed rewards, we use eligi-
bility traces in reinforcement learning. As a basic mechanism in
Because our units’ experience has a limited subset of the
RL, eligibility traces are used to assign temporal credit, which
large state space and most test states will never been explored
consider a set of previously experienced transitions [61]. This
before, it will be difficult to apply table reinforcement learning
means it not only considers the value of the last state-action pair
to learn an optimal policy. To solve this problem, we use a
but also the visited ones. With this method, we can solve the
neural network parameterized by vector θ to approximate the
problem of delayed reward in the game environment. Sarsa with
state-action values to improve our RL model’s generalization.
eligibility traces, termed as Sarsa(λ), is one way of averaging
The input of the network is the 93 dimensions tensor from
backups made after multiple steps. λ is a factor that determines
the state representation. We has 100 neurons in the hidden layer,
the weight of each backup. In our implementation of Sarsa(λ)
and use the rectified linear unit (ReLU) activation function for
for multiple units combat, we use a neural network as the func-
the network nonlinearity, as demonstrated by
tion approximator and share network parameters among all our
f (z) = max(0, z), (6) units. Although we have only one network to train, the units
can still behave differently because each one receives different
where z is the output of hidden layer. We use ReLU function observations and actions as its input.
rather than Sigmoid or tanh function, because ReLU function To update the policy network efficiently, we use the gradient-
does not have the problem of gradient descent, which can guar- descent method to train the Sarsa(λ) reinforcement learning
antee the effective training of the model [59]. Different from model. The gradient-descent learning update is demonstrated
these saturating nonlinearities functions such as Sigmoid or in (7),
tanh, ReLU function is a non-saturating nonlinearity function. In
terms of training time with gradient descent, the non-saturating δt = rt+1 + γQ(st+1 , at+1 ; θ t ) − Q(st , at ; θ t ) (7a)
nonlinearity is much faster [60].
The output layer of the neural network has 9 neurons, giving θ t+1 = θ t + αδt et (7b)
the probabilities of moving to 8 directions and attack. The learn- et = γλet−1 + ∇θt Q(st , at ; θ t ), e0 = 0 (7c)
ing model of one unit in StarCraft micromanagement scenarios,
including state representation, neural network architecture and
where et is the eligibility traces at time step t.
output actions, is depicted in Fig. 3.
One of the challenging issues in reinforcement learning is the
trade-off between exploration and exploitation. If we choose
IV. LEARNING METHOD FOR MICROMANAGEMENT the best action every step according to current policy, we are
In this paper, we formulate StarCraft micromanagement as likely to trap in local optimum. On the contrary, if we tend to
a multi-agent reinforcement learning model. We propose a explore in the large state space, the model will have difficulty
parameter sharing multi-agent gradient-descent Sarsa(λ) (PS- in converging. In the experiment, we use the -greedy method
MAGDS) method to train the model, and design a reward func- to choose actions during training, which selects the current best
tion as intrinsic motivations to promote the learning process. The action with probability 1 − , and takes a random exploratory

Authorized licensed use limited to: Renmin University. Downloaded on November 16,2022 at 13:37:02 UTC from IEEE Xplore. Restrictions apply.
78 IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, VOL. 3, NO. 1, FEBRUARY 2019

hitpoint loss of our units.


Algorithm 1: Parameter Sharing Multi-Agent Gradient-
Descent Sarsa(λ). rt = (damage amountt × damage f actor − ρ ×
1: Initialize policy parameters θ shared among our units (10)
(unit hitpointt−1 − unit hitpointt ))/10
2: Repeat (for each episode):
3: e0 = 0 where damage amount is the amount of damage caused by
4: Initialize st , at our units’ attack, damage f actor is our units’ attack force and
5: Repeat (for each step of episode): unit hitpoint is our unit’s hitpoint. We divide the reward by a
6: Repeat (for each unit): constant to resize it to a more suitable range, which is set to 10
7: Take action at , receive rt+1 , next state st+1 in our experiment. ρ is a normalized factor to balance the total
8: Choose at+1 from st+1 using -greedy hitpoint of our units and enemy units,
9: If random(0, 1) < 
10: at+1 = randint(N ) 
H 
N
ρ= enemy hitpointi / unit hitpointj (11)
11: else
i=1 j =1
12: at+1 = arg maxa Q(st+1 , a; θ t )
13: Repeat (for each unit): where H is the number of enemy units, and N is the num-
14: Update TD error, weights and eligibility traces ber of our units. Generally speaking, this normalized factor is
15: δt = rt+1 + γQ(st+1 , at+1 ; θ t ) − Q(st , at ; θ t ) necessary in StarCraft micromanagement with different num-
16: θ t+1 = θ t + αδt et bers and types of units. Without proper normalization, policy
17: et+1 = γλet + ∇θt + 1 Q(st+1 , at+1 ; θ t+1 ) network will have difficulty in converging, and our units need
18: t←t+1 much more episodes to learn useful behaviors.
19: until st is terminal Apart from the basic attack reward, we consider some ex-
tra rewards as the intrinsic motivation to speed up the training
process. When a unit is destroyed, we introduce an extra neg-
ative reward, and set it to −10 in our experiment. We would
action with probability , like to punish this behavior in consideration that the decrease
 of own units has a bad influence on the combat result. Besides,
randint(N ), random(0, 1) <  in order to encourage our units to work as a team and make
a= (8)
arg maxa Q(s, a), otherwise cooperative actions, we introduce a reward for units’ move.
If there are no our units or enemy units in the move direc-
where N equals to 9 in the experiment. tion, we give this move action a small negative reward, which
We use exponentially  decay to implement the -greedy is set to −0.5. According to the experiment, this reward has
method. The  is initialized to 0.5 and anneals schedule with an impressive effect on the learning performance, as shown in
an exponential smoothing window of the episode number Fig. 6.
episode num, as demonstrated by
C. Frame Skip

 = 0.5/ 1 + episode num . (9) When applying reinforcement learning to video games, we
should pay attention to the continuity of actions. Because of the
The overall parameter sharing multi-agent gradient-descent real-time property of StarCraft micromanagement, it is imprac-
Sarsa(λ) method is presented in Algorithm 1. tical to make a action every game frame. One feasible method
is using frame skip technology, which executes a training step
every fixed number of frames. However, small frame skip will
B. Reward Function introduce strong correlation in the training data, while large
The reward function provides useful feedbacks for RL agents, frame skip will reduce effective training samples. We refer to
which has a significant impact on the learning results [62]. related work in [20], and try several frame skips (8, 10, 12) in
The goal of StarCraft micromanagement is to destroy all of a small scale micromanagement scenario. At last, we set the
the enemy units in the combat. If the reward is only based frame skip to 10 in our experiment, which takes an action every
on the final result, the reward function will be extremely 10 frames for each unit.
sparse. Moreover, units usually get a positive reward after
many steps. The delay in rewards makes it difficult to learn V. EXPERIMENT SETTINGS
which set of actions is responsible for the corresponding
rewards. A. StarCraft Micromanagement Scenarios
To tackle the problem of sparse and delayed rewards in mi- We consider several StarCraft micromanagement scenarios
cromanagement, we design a reward function to include small with various units, including Goliaths vs. Zealots, Goliaths vs.
intermediate rewards. In our experiment, all agents receive the Zerglings and Marines vs. Zerglings, as shown in Fig. 5.
main reward caused by their attack action at each time step, 1) In the first scenario, we will control 3 Goliaths to fight
equalling to the damage that the enemy units received minus the against 6 Zealots. From Table II, we can see that the enemy

Authorized licensed use limited to: Renmin University. Downloaded on November 16,2022 at 13:37:02 UTC from IEEE Xplore. Restrictions apply.
SHAO et al.: STARCRAFT MICROMANAGEMENT WITH REINFORCEMENT LEARNING AND CURRICULUM TRANSFER LEARNING 79

Fig. 5. Representation of the StarCraft micromanagement scenarios in our experiments, left: Goliaths vs. Zealots; middle: Goliaths vs. Zerglings; right: Marines
vs. Zerglings.

TABLE II The experiment is applied on a computer with an Intel i7-6700


THE COMPARATIVE ATTRIBUTES OF DIFFERENT UNITS IN OUR
MICROMANAGEMENT SCENARIOS
CPU and 16 GB of memory.

VI. RESULTS AND DISCUSSIONS


In this section, we analyze the results in different microman-
agement scenarios and discuss our RL model’s performance.
In small scale scenarios, we use the first scenario as a starting
point to train our units. In the remaining scenarios, we intro-
duce transfer learning method to scale the combat to large sce-
narios. The object of StarCraft micromanagement is defeating
the enemies and increasing the win rates in these given sce-
narios. For a better comprehension, we analyze the win rates,
units have advantage on the number of units, hitpoint and episode steps and average rewards during training, as well as
damage factor. By contrast, our units’ fire range is much the learned strategies. Our code and results are open-source at
wider. https://github.com/nanxintin/StarCraft-AI.
2) In the second scenario, the enemies have 20 Zerglings.
Our Goliaths units have advantage on hitpoint, damage A. Small Scale Micromanagement
factor and fire range, while the enemies have much more In small scale micromanagement scenarios, we will train Go-
units and less cooldown time. liaths against enemy units with different amounts and types. In
3) In the third scenario, we will control up to 20 Marines to the second scenario, we will also use transfer learning method
fight against 30 Zerglings. The enemy units have advan- to train Goliaths based on the well-trained model of the first sce-
tage on speed and amount, while our units have advantage nario. Both of the two scenarios are trained with 4000 episodes
on fire range and damage factor. and over 1 million steps.
We divide these scenarios into two groups. The first and 1) Goliaths vs. Zealots: In this scenario, we train our Go-
the second are small scale micromanagements and the last is
liaths units from scratch and analyze the results.
the large scale micromanagement. In these scenarios, the en- 1) Win Rates: At first, we will analyze the learning perfor-
emy units are controlled by the built-in AI, which is hard- mance of our RL method with moveReward. To evaluate
coded with game inputs. An episode terminates when either
the win rates, we test our model after every 200 episodes’
side of the units are destroyed. A human beginner of Star- training for 100 combats, and depict the results in Fig. 6.
Craft can’t beat the built-in AI in these scenarios. Platinum- We can see that our Goliaths units can’t win any combats
level players have average win rates of below 50% with 100 before 1400 episodes. With the progress of training, units
games for each scenario. Our RL agents is expected to exploit start to win several games and the curve of win rates has
their advantages and avoid their disadvantages to win these an impressive increase after 2000 episodes. After 3000
combats. episodes’ training, our units can reach win rates of 100%
at last.
B. Training 2) Episode Steps: We depict the average episode steps and
In the training process, we set the discount factor γ to 0.9, the standard deviations of our three Goliaths units during
learning rate α to 0.001, and the eligibility traces factor λ to 0.8 training in Fig. 7. It is apparent to see that the curve
in all scenarios. Moreover, the maximum steps of each episode of average episode steps has four stages. In the opening,
are set to 1000. In order to accelerate the learning process, the episode steps are extremely few because Goliaths have
game runs at full speed by setting gameSpeed to 0 in BWAPI. learned nothing and are destroyed quickly. After that,

Authorized licensed use limited to: Renmin University. Downloaded on November 16,2022 at 13:37:02 UTC from IEEE Xplore. Restrictions apply.
80 IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, VOL. 3, NO. 1, FEBRUARY 2019

Fig. 6. The win rates of our units in 3 Goliaths vs. 6 Zealots micromanagement Fig. 9. The win rates of our units in 3 Goliaths vs. 20 Zerglings microman-
scenario from every 200 episodes’ training. agement scenario from every 200 episodes’ training.

Fig. 7. The episode steps of our units in 3 Goliaths vs. 6 Zealots microman-
agement scenario during training. Fig. 10. The average episode steps of our units in 3 Goliaths vs. 20 Zerglings
micromanagement scenario during training.

2) Goliaths vs. Zerglings: In this scenario, the enemy units


are a group of Zerglings, and we reuse the well-trained model
from the first scenario to initialize the policy network. In com-
parison with learning from scratch, we have a better understand-
ing of transfer learning.
1) Win Rates: We draw the win rates in Fig. 9. When training
from scratch, the learning process is extremely slow and
our units can’t win a game until 1800 episodes. Without
transfer learning, the win rates are below 60% after 4000
episodes. When training based on the model of the first
Fig. 8. The average reward of our units in 3 Goliaths vs. 6 Zealots microman- scenario, the learning process is much faster. Even in the
agement scenario during training.
opening, our units win several games, and the win rates
reach 100% in the end.
Goliaths start to realize that the hitpoint damage causes 2) Episode Steps: In Fig. 10, we draw the average episode
a negative reward. They learn to run away from enemies steps of our three Goliaths during training. Without trans-
and the episode steps increase to a high level. And then, fer learning, the curve has the similar trend with that in the
episode steps start to decrease because Goliaths learn to first scenario. The average episode steps have a obvious
attack to get positive rewards, rather than just running increase in the opening and drop gradually during training.
away. In the end, Goliaths have learned an appropriate When training with transfer learning, the average episode
policy to balance move and attack, and they are able to steps keep stable in the whole training process, within the
destroy enemies in almost 300 steps. range of 200 to 400. A possible explanation is that our
3) Average Rewards: Generally speaking, a powerful game units have learned some basic move and attack skills from
AI in micromanagement scenarios should defeat the ene- the well-trained model, and they use these skills to speed
mies as soon as possible. Here we introduce the average up the training process.
rewards, dividing the total rewards by episode steps in the 3) Average Rewards: We draw the average rewards of our
combat. The curve of our Goliaths units’ average rewards three Goliaths in Fig. 11. When training from scratch, our
is depicted in Fig. 8. The average rewards have an obvious units have difficulty in winning the combat in the opening
increase in the opening, grow steadily during training and and the average rewards are in a low level before 1000
stay smooth after almost 3000 episodes. episodes. The average rewards with transfer learning, by

Authorized licensed use limited to: Renmin University. Downloaded on November 16,2022 at 13:37:02 UTC from IEEE Xplore. Restrictions apply.
SHAO et al.: STARCRAFT MICROMANAGEMENT WITH REINFORCEMENT LEARNING AND CURRICULUM TRANSFER LEARNING 81

TABLE V
WIN RATES IN VARIOUS CURRICULAR SCENARIOS AND UNSEEN SCENARIOS

Fig. 11. The average reward of our units in 3 Goliaths vs. 20 Zerglings
micromanagement scenario during training. M:Marine, Z:Zergling.

TABLE III our model’s average win rates in 100 test games for 5 times. In
CURRICULUM DESIGN FOR MARINES VS. ZERGLINGS MICROMANAGEMENT M10 vs. Z13, PS-MAGDS achieves a win rate of 97%, which
is much higher than other methods, including the recently pro-
posed GMEZO and BiCNet. In M20 vs. Z30, PS-MAGDS has
the second best performance, which is very close to the best one.
We also test our well-trained models in curricular scenarios
M:Marine, Z:Zergling. and unseen scenarios, and present the results in Table V. We
can see that PS-MAGDS has outstanding performances in these
TABLE IV curricular scenarios. In unseen scenarios with more units, PS-
PERFORMANCE COMPARISON OF OUR MODEL WITH BASELINE METHODS IN
TWO LARGE SCALE SCENARIOS MAGDS also has acceptable results.

C. Strategies Analysis
In StarCraft micromanagement, there are different types of
units with different skills and properties. Players need to learn
M:Marine, Z:Zergling. how to move and attack with a group of units in real time. If
we design a rule-based AI to solve this problem, we have to
comparison, are much higher from the beginning and be- consider a large amount of conditions, and agent’s ability is
have better in the whole training process. also limited. Beginners of StarCraft could not win any of these
combats presented in our paper. So these behaviors are highly
complex and difficult to learn. With reinforcement learning and
B. Large Scale Micromanagement
curriculum transfer learning, our units are able to master sev-
In large scale micromanagement scenarios, we use curriculum eral useful strategies in these scenarios. In this section, we will
transfer learning to train our Marines to play against Zerglings, make a brief analysis on these strategies that our units have
and compare the results with some baseline methods. learned.
1) Marines vs. Zerglings: In this section, we design a cur- 1) Disperse Enemies: In small scale micromanagement sce-
riculum with 3 classes to train the units, as shown in Table III. narios, our Goliaths units have to fight against the opponent with
After training, we test the performance in two target scenar- a larger amount and more total hitpoints. If our units stay to-
ios: M10 vs. Z13 and M20 vs. Z30. In addition, we use some gether and fight against a group of units face-to-face, they will be
baseline methods as a comparison, which consist of rule-based destroyed quickly and lose the combat. The appropriate strategy
approaches and DRL approaches. is dispersing enemies, and destroying them one by one.
1) Weakest: A rule-based method, attacking weakest in the In the first scenario, our Goliaths units have learned dispers-
fire range. ing Zealots after training. In the opening, our units disperse
2) Closest: A rule-based method, attacking closest in the fire enemies into several parts and destroy it in one part first. After
range. that, the winning Goliath moves to other Goliaths and helps to
3) GMEZO: A DRL method, based on the zero-order op- fight against the enemies. Finally, our units focus fire on the
timization, having impressive results over traditional RL remaining enemies and destroy them. For a better understand-
methods [20]. ing, we choose some frames of game replay in the combat and
4) BiCNet: A DRL method, based on the actor-critic archi- draw units’ move and attack directions in Fig. 12. The white
tecture, having the best performance in most StarCraft lines stand for the move directions and the red lines stand for
micromanagement scenarios [21]. the attack directions.
In Table IV, we present the win rates of the PS-MAGDS The similar strategy occurs in the second scenario. The oppo-
method and baseline methods. In each scenario, we measure nent has much more units, and Zerglings Rush has great damage

Authorized licensed use limited to: Renmin University. Downloaded on November 16,2022 at 13:37:02 UTC from IEEE Xplore. Restrictions apply.
82 IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, VOL. 3, NO. 1, FEBRUARY 2019

Fig. 12. The sample game replay in 3 Goliaths vs. 6 Zealots micromanagement
scenario.

Fig. 14. The sample game replay in 20 Marines vs. 30 Zerglings microman-
agement scenario.

VII. CONCLUSION AND FUTURE WORK


This paper focuses on the multiple units control in StarCraft
micromanagement scenarios. We present several contributions,
including an efficient state representation, the parameter shar-
Fig. 13. The sample game replay in 3 Goliaths vs. 20 Zerglings microman- ing multi-agent gradient-descent Sarsa(λ), the effective reward
agement scenario.
function and the curriculum transfer learning method used to
extend our model to various scenarios. We demonstrate the ef-
power, which is frequently used in StarCraft games. Our Go- fectiveness of our approach in both small scale and large scale
liaths units disperse Zerglings into several groups and keep a scenarios, and the superior performance over some baseline
suitable distance with them. When units’ weapons are in a valid methods in two target scenarios. It is remarkable that our pro-
cooldown state, they stop moving and attack the enemies, as posed method is able to learn appropriate strategies and defeat
shown in Fig. 13. the built-in AI in various scenarios.
2) Keep the Team: In large scale micromanagement scenar- In addition, there are still some areas for future work. The
ios, each side has a mass of units. Marines are small size ground cooperative behaviors of multiple units are learned by sharing
units with low hitpoints. If they combat in several small groups, the policy network, constructing an efficient state representa-
they are unable to resist the enemies. A suitable strategy is tion method including other units’ information and the pro-
keeping our Marines in a team, moving with the same direction posed intrinsic motivated reward function. Although our units
and attacking the same target, as demonstrated in Fig. 14. From can successfully master some effective coordination strategies,
these figures, we can see that our Marines have learned to move we will explore more intelligent methods for multi-agent col-
forward and retreat in a queue. laboration. To solve the delayed reward problem in StarCraft
3) Hit and Run: Apart from the global strategies discussed micromanagement, we use a simple, straight and efficient re-
above, our units have also learned some local strategies during ward shaping method. Nevertheless, there are also some other
training. Among them, Hit and Run is the most widely used methods solving the sparse and delayed rewards, such as hi-
tactic in StarCraft micromanagement. Our units rapidly learn erarchical reinforcement learning. Hierarchical RL integrates
the Hit and Run tactic in all scenarios, including the single hierarchical action-value functions, operating at different tem-
unit’s Hit and Run in Figs. 12 and 13, and a group of units’ Hit poral scales [63]. Compared with the reward shaping method,
and Run in Fig. 14. hierarchical RL has the capacity to learn temporally-abstracted
4) Existing Problems: Although our units have learned use- exploration, and gives agents more flexibility. But its framework
ful strategies after training, there are still some problems in is also much more complicated, and automatically subgoals ex-
combats. For instance, Goliaths move forward and backward traction is still an open problem. In the future, we will make
now and then and don’t join in combats to help other units in an in-depth study on applying hierarchical RL to StarCraft. At
time. In addition, units prefer moving to the boundary of the present, we can only train ranged ground units with the same
map, so as to avoid the enemies. type, while training melee ground units using RL methods is

Authorized licensed use limited to: Renmin University. Downloaded on November 16,2022 at 13:37:02 UTC from IEEE Xplore. Restrictions apply.
SHAO et al.: STARCRAFT MICROMANAGEMENT WITH REINFORCEMENT LEARNING AND CURRICULUM TRANSFER LEARNING 83

still an open problem. We will improve our method for more [23] R. Lowe, W. Yi, and T. Aviv, “Multi-agent actor-critic for mixed
types of units and more complex scenarios in the future. Finally, cooperative-competitive environments,” in Proc. Adv. Neural Inf. Process.
Syst., 2017, pp. 6382–6393.
we will also consider to use our micromanagement model in the [24] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction.
StarCraft bot to play full the game. Cambridge, MA, USA: MIT Press, 1998.
[25] L. P. Kaelbling, M. L. Littman, and A. W. Moore, “Reinforcement learning:
A survey,” J. Artifi. Intell. Res., vol. 4, no. 1, pp. 237–285, 1996.
ACKNOWLEDGMENT [26] G. Tesauro, “TD-Gammon, a self-teaching backgammon program,
achieves master-level play,” Neural Comput., vol. 6, no. 2, pp. 215–219,
We would like to thank Q. Zhang, Y. Chen, D. Li, Z. Tang, 1994.
and N. Li for the helpful comments and discussions about this [27] Q. Zhang, D. Zhao, and W. Ding, “Event-based robust control for uncertain
work and paper writing. We also thank the BWAPI and StarCraft nonlinear systems using adaptive dynamic programming,” IEEE Trans.
Neural Netw. Nearning Syst., vol. 29, no. 1, pp. 37–50, Jan. 2018.
group for their meaningful work. [28] Q. Zhang, D. Zhao, and Y. Zhu, “Event-triggered H ∞ control for
continuous-time nonlinear system via concurrent learning,” IEEE Trans.
REFERENCES Syst., Man, Cybern., Syst., vol. 47, no. 7, pp. 1071–1081, Jul. 2017.
[29] Y. Zhu, D. Zhao, and X. Li, “Iterative adaptive dynamic programming for
[1] D. Silver et al., “Mastering the game of Go with deep neural networks and solving unknown nonlinear zero-sum game based on online data,” IEEE
tree search,” Nature, vol. 529, no. 7587, pp. 484–489, 2016. Trans. Neural Netw. Learn. Syst., vol. 28, no. 3, pp. 714–725, Mar. 2017.
[2] D. Silver et al., “Mastering the game of Go without human knowledge,” [30] D. Zhao et al., “Review of deep reinforcement learning and discussions
Nature, vol. 550, no. 7676, pp. 354–359, 2017. on the development of computer Go,” Control Theory Appl., vol. 33, no. 6,
[3] D. Zhao, Z. Zhang, and Y. Dai, “Self-teaching adaptive dynamic program- pp. 701–717, 2016.
ming for Gomoku,” Neurocomputing, vol. 78, no. 1, pp. 23–29, 2012. [31] Z. Tang, K. Shao, D. Zhao, and Y. Zhu, “Recent progress of deep rein-
[4] K. Shao, D. Zhao, Z. Tang, and Y. Zhu, “Move prediction in Gomoku forcement learning: From AlphaGo to AlphaGo Zero,” Control Theory
using deep learning,” in Proc. Youth Academic Annu. Conf. Chin. Assoc. Appl., vol. 34, no. 12, pp. 1529–1546, 2017.
Autom., 2016, pp. 292–297. [32] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning
[5] V. Mnih et al., “Human-level control through deep reinforcement learn- with double Q-learning,” in Proc. 30th AAAI Conf. Artifi. Intell., 2016,
ing,” Nature, vol. 518, no. 7540, pp. 529–533, 2015. pp. 2094–2100.
[6] D. Zhao, H. Wang, K. Shao, and Y. Zhu, “Deep reinforcement learning [33] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience
with experience replay based on SARSA,” in Proc. IEEE Symp. Series replay,” in Proc. Int. Conf. Learn. Representations, 2016.
Comput. Intell., 2017, pp. 1–6. [34] Z. Wang, T. Schaul, M. Hessel, H. Van Hasselt, M. Lanctot, and N. De
[7] M. Moravik et al., “Deepstack: Expert-level artificial intelligence in heads- Freitas, “Dueling network architectures for deep reinforcement learning,”
up no-limit poker,” Science, vol. 356, no. 6337, pp. 508–513, 2017. in Proc. Int. Conf. Mach. Learn., 2016, pp. 1995–2003.
[8] S. Ontanon, G. Synnaeve, A. Uriarte, F. Richoux, D. Churchill, and M. [35] A. Nair et al., “Massively parallel methods for deep reinforcement learn-
Preuss, “A survey of real-time strategy game AI research and competition ing,” presented at the Deep Learning Workshop, Int. Conf. Mach. Learn.,
in StarCraft,” IEEE Trans. Comput. Intell. AI Games, vol. 5, no. 4, pp. 293– Lille, France, 2015.
311, Dec. 2013. [36] V. Mnih et al., “Asynchronous methods for deep reinforcement learning,”
[9] R. Lara-Cabrera, C. Cotta, and A. J. Fernndez-Leiva, “A review of com- in Proc. Int. Conf. Mach. Learn., 2016, pp. 1928–1937.
putational intelligence in RTS games,” in Proc. IEEE Symp. Foundations [37] D. Li, D. Zhao, Q. Zhang, and C. Luo, “Policy gradient methods with
Comput. Intell., 2013, pp. 114–121. Gaussian process modelling acceleration,” in Proc. Int. Joint Conf. Neural
[10] G. Robertson and I. Watson, “A review of real-time strategy game AI,” AI Netw., 2017, pp. 1774–1779.
Mag., vol. 35, no. 4, pp. 75–104, 2014. [38] T. P. Lillicrap et al., “Continuous control with deep reinforcement learn-
[11] K. Shao, Y. Zhu, and D. Zhao, “Cooperative reinforcement learning for ing,” in Proc. Int. Conf. Learn. Representations, 2016.
multiple units combat in StarCraft,” in Proc. IEEE Symp. Series Comput. [39] J. Schulman, S. Levine, P. Abbeel, M. I. Jordan, and P. Moritz, “Trust
Intell., 2017, pp. 1–6. region policy optimization,” in Proc. Int. Conf. Mach. Learn., 2015,
[12] J. Hagelback, “Hybrid pathfinding in StarCraft,” IEEE Trans. Comput. pp. 1889–1897.
Intell. AI Games, vol. 8, no. 4, pp. 319–324, Dec. 2016. [40] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proxi-
[13] A. Uriarte and S. Ontañón, “Kiting in RTS games using influence maps,” mal policy optimization algorithms,” arXiv:1707.06347, 2017.
in Proc. Artifi. Intell. Interactive Digital Entertainment Conf., 2012, [41] S. Levine and V. Koltun, “Guided policy search,” in Proc. Int. Conf. Mach.
pp. 31–36. Learn., 2013, pp. 1–9.
[14] G. Synnaeve and P. Bessire, “Multiscale Bayesian modeling for RTS [42] S. Gu, T. P. Lillicrap, I. Sutskever, and S. Levine, “Continuous deep Q-
games: An application to StarCraft AI,” IEEE Trans. Comput. Intell. AI learning with model-based acceleration,” in Proc. Int. Conf. Mach. Learn.,
Games, vol. 8, no. 4, pp. 338–350, Dec. 2016. 2016, pp. 2829–2838.
[15] D. Churchill and B. Michael, “Incorporating search algorithms into RTS [43] M. Watter, J. T. Springenberg, J. Boedecker, and M. A. Riedmiller, “Em-
game agents,” in Proc. Artifi. Intell. Interactive Digital Entertainment bed to control: a locally linear latent dynamics model for control from raw
Conf., 2012, pp. 2–7. images,” in Proc. Adv. Neural Inf. Process. Syst., 2015, pp. 2746–2754.
[16] I. Gabriel, V. Negru, and D. Zaharie, “Neuroevolution based multi- [44] M. L. Littman, “Markov games as a framework for multi-agent reinforce-
agent system for micromanagement in real-time strategy games,” in Proc. ment learning,” in Mach. Learn. Proc., 1994, pp. 157–163.
Balkan Conf. Informat., 2012, pp. 32–39. [45] T. Ming, “Multi-agent reinforcement learning: Independent vs. coopera-
[17] A. Shantia, E. Begue, and M. Wiering, “Connectionist reinforcement tive agents,” in Proc. 10th Int. Conf. Mach. Learn., 1993, pp. 330–337.
learning for intelligent unit micro management in StarCraft,” in Proc. [46] Z. Zhang, D. Zhao, J. Gao, D. Wang, and Y. Dai, “FMRQ—A multiagent
Int. Joint Conf. Neural Netw., 2011, pp. 1794–1801. reinforcement learning algorithm for fully cooperative tasks,” IEEE Trans.
[18] S. Wender and I. Watson, “Applying reinforcement learning to small scale Cybern., vol. 47, no. 6, pp. 1367–1379, Jun. 2017.
combat in the real-time strategy game StarCraft: Broodwar,” in Proc. IEEE [47] S. Sukhbaatar, A. Szlam, and R. Fergus, “Learning multiagent commu-
Conf. Comput. Intell. Games, 2012, pp. 402–408. nication with backpropagation,” in Proc. Adv. Neural Inf. Process. Syst.,
[19] Y. Lecun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, 2016, pp. 2244–2252.
no. 7553, pp. 436–444, 2015. [48] L. Marc, Z. Vinicius, and G. Audrunas, “A unified game-theoretic ap-
[20] N. Usunier, G. Synnaeve, Z. Lin, and S. Chintala, “Episodic exploration for proach to multiagent reinforcement learning,” arXiv:1711.00832.
deep deterministic policies: An application to StarCraft micromanagement [49] K. G. Jayesh, E. Maxim, and K. Mykel, “Cooperative multi-agent control
tasks,” in Proc. Int. Conf. Learn. Representations, 2017. using deep reinforcement learning,” in Proc. Int. Conf. Auton. Agents
[21] P. Peng et al., “Multiagent bidirectionally-coordinated nets for learning to Multiagent Syst., 2017, pp. 66–83.
play StarCraft combat games,” arXiv:1703.10069, 2017. [50] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Trans.
[22] J. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson, “Coun- Knowledge Data Eng., vol. 22, no. 10, pp. 1345–1359, Oct. 2010.
terfactual multi-agent policy gradients,” in Proc. 32nd AAAI Conf. Artifi.
Intell., 2018.

Authorized licensed use limited to: Renmin University. Downloaded on November 16,2022 at 13:37:02 UTC from IEEE Xplore. Restrictions apply.
84 IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, VOL. 3, NO. 1, FEBRUARY 2019

[51] A. Gupta, Y.-S. Ong, and L. Feng, “Insights on transfer optimization: Be- Yuanheng Zhu (M’15) received the B.S. degree from
cause experience is the best teacher,” IEEE Trans. Emerg. Topics Comput. Nanjing University, Nanjing, China, in 2010, and the
Intell., vol. 2, no. 1, pp. 51–64, Feb. 2018. Ph.D. degree with the State Key Laboratory of Man-
[52] M. E. Taylor and P. Stone, “Transfer learning for reinforcement learning agement and Control for Complex Systems, Insti-
domains: A survey,” J. Mach. Learn. Res., vol. 10, pp. 1633–1685, 2009. tute of Automation, Chinese Academy of Sciences,
[53] Y. Bengio, J. Louradour, R. Collobert, and J. Weston, “Curriculum learn- Beijing, China, in 2015. He is currently an Associate
ing,” in Proc. 26th Annu. Int. Conf. Mach. Learn., 2009, pp. 41–48. Professor with the State Key Laboratory of Manage-
[54] A. Graves et al., “Hybrid computing using a neural network with dynamic ment and Control for Complex Systems, Institute of
external memory,” Nature, vol. 538, no. 7626, pp. 471–476, 2016. Automation, Chinese Academy of Sciences. His re-
[55] Y. Wu and Y. Tian, “Training agent for first-person shooter game with search interests include optimal control, adaptive dy-
actor-critic curriculum learning,” in Proc. Int. Conf. Learn. Representa- namic programming, and reinforcement learning.
tions, 2017.
[56] Q. Dong, S. Gong, and X. Zhu, “Multi-task curriculum transfer deep
learning of clothing attributes,” in Proc. IEEE Winter Conf. Appl. Comput.
Vision, 2017, pp. 520–529.
[57] J. X. Wang et al., “Learning to reinforcement learn,” in Proc. Int. Conf.
Learn. Representations, 2017.
[58] P. Mirowski et al., “Learning to navigate in complex environments,” in
Proc. Int. Conf. Learn. Representations, 2017.
[59] X. Glorot, A. Bordes, Y. Bengio, X. Glorot, A. Bordes, and Y. Bengio,
“Deep sparse rectifier neural networks,” in Proc. Int. Conf. Artifi. Intell.
Statist., 2011, pp. 315–323.
[60] V. Nair and G. E. Hinton, “Rectified linear units improve restricted Boltz-
mann machines,” in Proc. Int. Conf. Mach. Learn., 2010, pp. 807–814. Dongbin Zhao (M’06–SM’10) received the B.S.,
[61] S. P. Singh and R. S. Sutton, “Reinforcement learning with replacing M.S., and Ph.D. degrees from Harbin Institute of
eligibility traces,” Mach. Learn., vol. 22, no. 1, pp. 123–158, 1996. Technology, Harbin, China, in 1994, 1996, and 2000,
[62] A. Y. Ng, D. Harada, and S. J. Russell, “Policy invariance under reward respectively. He was a Postdoctoral Fellow with
transformations: Theory and application to reward shaping,” in Proc. Int. Tsinghua University, Beijing, China, from 2000 to
Conf. Mach. Learn., 1999, pp. 278–287. 2002. He has been a Professor with the Institute of
[63] T. D. Kulkarni, K. R. Narasimhan, A. Saeedi, and J. B. Tenenbaum, “Hi- Automation, Chinese Academy of Sciences, since
erarchical deep reinforcement learning: Integrating temporal abstraction 2012, and also a Professor with the University of
and intrinsic motivation,” in Proc. Adv. Neural Inf. Process. Syst., 2016, Chinese Academy of Sciences, Beijing, China. From
pp. 3675–3683. 2007 to 2008, he was also a visiting scholar at the
University of Arizona. He has published 4 books and
more than 60 international journal papers. His current research interests include
computational intelligence, adaptive dynamic programming, deep reinforce-
Kun Shao received the B.S. degree in automation
ment learning, robotics, intelligent transportation systems, and smart grids.
from Beijing Jiaotong University, Beijing, China, in
Dr. Zhao is the Associate Editor for the IEEE TRANSACTIONS ON NEURAL
2014. He is currently working toward the Ph.D. de-
gree in control theory and control engineering with NETWORKS and Learning Systems (2012-), IEEE COMPUTATION INTELLIGENCE
MAGAZINE (2014-), etc. He is the Chair of Beijing Chapter and was the Chair
the State Key Laboratory of Management and Con-
of Adaptive Dynamic Programming and Reinforcement Learning Technical
trol for Complex Systems, Institute of Automation,
Committee (2015–2016) and Multimedia Subcommittee (2015–2016) of IEEE
Chinese Academy of Sciences, Beijing, China. His
current research interests include reinforcement Computational Intelligence Society (CIS). He is Guest Editor of renowned inter-
national journals. He is involved in organizing several international conferences.
learning, deep learning, and game AI.

Authorized licensed use limited to: Renmin University. Downloaded on November 16,2022 at 13:37:02 UTC from IEEE Xplore. Restrictions apply.

You might also like