StarCraft_Micromanagement_With_Reinforcement_Learning_and_Curriculum_Transfer_Learning
StarCraft_Micromanagement_With_Reinforcement_Learning_and_Curriculum_Transfer_Learning
1, FEBRUARY 2019 73
Abstract—Real-time strategy games have been an important to play in board games [8]. As one of the most popular RTS
field of game artificial intelligence in recent years. This paper games, StarCraft has a huge player base and numerous profes-
presents a reinforcement learning and curriculum transfer learn- sional competitions, requiring different strategies, tactics and
ing method to control multiple units in StarCraft micromanage-
ment. We define an efficient state representation, which breaks reactive control techniques. For the game AI research, StarCraft
down the complexity caused by the large state space in the game provides an ideal environment to study the control of multiple
environment. Then, a parameter sharing multi-agent gradient- units with different difficulty levels [9]. In recent years, the study
descent Sarsa(λ) algorithm is proposed to train the units. The of StarCraft AI has an impressive progress, driven by some Star-
learning policy is shared among our units to encourage cooperative Craft AI competitions and Brood War Application Programming
behaviors. We use a neural network as a function approximator
to estimate the action–value function, and propose a reward func- Interface (BWAPI)1 [10]. Recently, researchers have developed
tion to help units balance their move and attack. In addition, a more efficient platforms to promote the development of this
transfer learning method is used to extend our model to more diffi- field, including TorchCraft, ELF and PySC2. StarCraft AI aims
cult scenarios, which accelerates the training process and improves at solving a series of challenges, such as spatial and temporal
the learning performance. In small-scale scenarios, our units suc- reasoning, multi-agent collaboration, opponent modeling and
cessfully learn to combat and defeat the built-in AI with 100%
win rates. In large-scale scenarios, the curriculum transfer learn- adversarial planning [8]. At present, designing a game AI for
ing method is used to progressively train a group of units, and it the full StarCraft game based on machine learning method is
shows superior performance over some baseline methods in target out-of-reach. Many researchers focus on micromanagement as
scenarios. With reinforcement learning and curriculum transfer the first step to study AI in StarCraft [11]. In combat scenarios,
learning, our units are able to learn appropriate strategies in Star- units have to navigate in highly dynamic environment and attack
Craft micromanagement scenarios.
enemies within fire range. There are many methods for StarCraft
Index Terms—Reinforcement learning, transfer learning, cur- micromanagement, including potential fields for spatial naviga-
riculum learning, neural network, game AI. tion and obstacle avoidance [12], [13], Bayesian modeling to
deal with incompleteness and uncertainty in the game [14],
heuristic game-tree search to handle both build order planning
I. INTRODUCTION
and units control [15], and neuroevolution to control individual
RTIFICIAL intelligence (AI) has a great advance in the
A last decade. As an excellent testbed for AI research, games
have been helping AI to grow since its birth, including the an-
unit with hand-craft features [16].
As an intelligent learning method, reinforcement learning
(RL) is very suitable for sequential decision-making tasks. In
cient board game [1]–[4], the classic Atari video games [5], [6], StarCraft micromanagement, there are some interesting appli-
and the imperfect information game [7]. These games have a cations with RL methods. Shantia et al. use online Sarsa and
fixed, limited set of actions, and researchers only need to con- neural-fitted Sarsa with a short term memory reward function to
trol a single agent in game environment. Besides, there are control units’ attack and retreat [17]. They use vision grids to
a large number of games including multiple agents and re- obtain the terrain information. This method needs a hand-craft
quiring complex rules, which are much more difficult for AI design, and the number of input nodes has to change with the
research. number of units. Besides, they apply an incremental learning
In this paper, we focus on a real-time strategy (RTS) game to method to scale the task to a larger scenario with 6 units. How-
explore the learning of multi-agent control. RTS games are usu- ever, the win rate with incremental learning is still below 50%.
ally running in real-time, which is different from taking turns Wender et al. use different RL algorithms in micromanagement,
including Q learning and Sarsa [18]. They control one powerful
Manuscript received August 23, 2017; revised November 17, 2017 and Febru- unit to play against multiple weak units, without cooperation
ary 8, 2018; accepted March 24, 2018. Date of publication April 27, 2018; date of
current version January 21, 2019. This work was supported by the National Nat- and teamwork between own units.
ural Science Foundation of China (NSFC) under Grants 61573353, 61603382, In the last few years, deep learning has achieved a remark-
and 61533017. (Corresponding author: Dongbin Zhao.) able performance in many complex problems [19], and has
The authors are with the State Key Laboratory of Management and Control
for Complex Systems, Institute of Automation, Chinese Academy of Sciences, dramatically improved the generalization and scalability of tra-
Beijing 100190, China, and also with the University of Chinese Academy of ditional RL algorithms [5]. Deep reinforcement learning (DRL)
Sciences, Beijing 101408, China (e-mail:, [email protected]; yuanheng.
[email protected]; [email protected]).
Digital Object Identifier 10.1109/TETCI.2018.2823329 1 http://bwapi.github.io/
2471-285X © 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Renmin University. Downloaded on November 16,2022 at 13:37:02 UTC from IEEE Xplore. Restrictions apply.
74 IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, VOL. 3, NO. 1, FEBRUARY 2019
Authorized licensed use limited to: Renmin University. Downloaded on November 16,2022 at 13:37:02 UTC from IEEE Xplore. Restrictions apply.
SHAO et al.: STARCRAFT MICROMANAGEMENT WITH REINFORCEMENT LEARNING AND CURRICULUM TRANSFER LEARNING 75
agent-environment interaction process in RL is formulated as a control, including asynchronous advantage actor-critic (A3C)
Markov decision process. The agent in state s makes an action [36], trust region policy optimization (TRPO) [39], proximal
a according to the policy π. This behavior causes a reward r, policy optimization (PPO) [40], and so on. The sample com-
T statet s−t. We define the future discounted
and transfers to a new plexity of traditional DRL methods tends to be high, which
return at time t as t =t γ rt , where T is the terminal time limits these methods to real-world applications. While model-
step and γ ∈ [0, 1] is a discount factor that determines the im- based DRL approaches learn value function and policy in a
portance of future rewards. The aim of an RL model is to learn data-efficient way, and have been widely used in sensorimotor
an optimal policy π, which defines the probability of selecting control. Guided policy search (GPS) uses a supervised learning
action a in state s, so that the sum of the overall discounted algorithm to train policy and an RL algorithm to generate guid-
rewards is maximized, as demonstrated by ing distributions, allowing to train deep policies efficiently [41].
T Researchers have also proposed some other model-based DRL
methods, like normalized advantage functions (NAF) [42] and
t −t
max E γ rt |s = st , a = at , π . (1)
π
t =t
embed to control (E2C) [43].
Multi-agent reinforcement learning is a closely related area
As one of the most popular RL algorithms, temporal differ- to our work [44]. A multi-agent system includes a number
ence (TD) learning is a combination of Monte Carlo method of agents interacting in one environment [45], [46]. Recently,
and dynamic programming method. TD method can learn from some multi-agent reinforcement learning algorithms with deep
raw experience without a model of the environment, and update neural network are proposed to learn communication [47],
estimates based on part of the sequence, without waiting for a cooperative-competitive behaviors [23] and imperfect informa-
final outcome [25]. The most widely known TD learning algo- tion [48]. In our work, we use a multi-agent reinforcement learn-
rithms are Q-learning and Sarsa. Q-learning estimates the value ing method with policy sharing among agents to learn cooper-
of making an action in a given state and iteratively updates the ative behaviors. Agents share the parameters of a centralized
Q-value estimate towards the observed reward. The TD error δt policy, and update the policy with their own experiences si-
in Q-learning is computed as multaneously. This method can train homogeneous agents more
δt = rt+1 + γ max Q(st+1 , a) − Q(st , at ) . (2) efficiently [49].
a
Authorized licensed use limited to: Renmin University. Downloaded on November 16,2022 at 13:37:02 UTC from IEEE Xplore. Restrictions apply.
76 IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, VOL. 3, NO. 1, FEBRUARY 2019
TABLE I
THE DATA TYPE AND DIMENSION OF INPUTS IN OUR MODEL
Authorized licensed use limited to: Renmin University. Downloaded on November 16,2022 at 13:37:02 UTC from IEEE Xplore. Restrictions apply.
SHAO et al.: STARCRAFT MICROMANAGEMENT WITH REINFORCEMENT LEARNING AND CURRICULUM TRANSFER LEARNING 77
B. Action Definition
In StarCraft micromanagement scenarios, the original action
space is very large. At each time step, each unit can move to
arbitrary directions with arbitrary distances in the map. When
the unit decides to attack, it can choose any enemy units in the
weapon’s fire range. In order to simplify the action space, we
Fig. 4. The PS-MAGDS reinforcement learning diagram in the StarCraft
choose 8 move directions with a fixed distance and attacking the micromanagement scenarios.
weakest as the available actions for each unit. When the chosen
action is move, our units will turn to one of the 8 directions, Up,
Down, Left, Right, Upper-left, Upper-right, Lower-left, Lower- whole PS-MAGDS reinforcement learning diagram is depicted
right, and move a fixed distance. When the chosen action is in Fig. 4.
attack, our units will stay at the current position and focus fire
on enemy units. Currently, we select the enemy with the lowest
hitpoint in our weapon’s attack range as the target. According A. Parameter Sharing Multi-Agent Gradient-Descent Sarsa(λ)
to the experimental results, these actions are enough to control We propose a multi-agent RL algorithm that extend the tra-
our units in the game. ditional Sarsa(λ) to multiple units by sharing the parameters of
the policy network among our units. To accelerate the learning
C. Network Architecture process and tackle the problem of delayed rewards, we use eligi-
bility traces in reinforcement learning. As a basic mechanism in
Because our units’ experience has a limited subset of the
RL, eligibility traces are used to assign temporal credit, which
large state space and most test states will never been explored
consider a set of previously experienced transitions [61]. This
before, it will be difficult to apply table reinforcement learning
means it not only considers the value of the last state-action pair
to learn an optimal policy. To solve this problem, we use a
but also the visited ones. With this method, we can solve the
neural network parameterized by vector θ to approximate the
problem of delayed reward in the game environment. Sarsa with
state-action values to improve our RL model’s generalization.
eligibility traces, termed as Sarsa(λ), is one way of averaging
The input of the network is the 93 dimensions tensor from
backups made after multiple steps. λ is a factor that determines
the state representation. We has 100 neurons in the hidden layer,
the weight of each backup. In our implementation of Sarsa(λ)
and use the rectified linear unit (ReLU) activation function for
for multiple units combat, we use a neural network as the func-
the network nonlinearity, as demonstrated by
tion approximator and share network parameters among all our
f (z) = max(0, z), (6) units. Although we have only one network to train, the units
can still behave differently because each one receives different
where z is the output of hidden layer. We use ReLU function observations and actions as its input.
rather than Sigmoid or tanh function, because ReLU function To update the policy network efficiently, we use the gradient-
does not have the problem of gradient descent, which can guar- descent method to train the Sarsa(λ) reinforcement learning
antee the effective training of the model [59]. Different from model. The gradient-descent learning update is demonstrated
these saturating nonlinearities functions such as Sigmoid or in (7),
tanh, ReLU function is a non-saturating nonlinearity function. In
terms of training time with gradient descent, the non-saturating δt = rt+1 + γQ(st+1 , at+1 ; θ t ) − Q(st , at ; θ t ) (7a)
nonlinearity is much faster [60].
The output layer of the neural network has 9 neurons, giving θ t+1 = θ t + αδt et (7b)
the probabilities of moving to 8 directions and attack. The learn- et = γλet−1 + ∇θt Q(st , at ; θ t ), e0 = 0 (7c)
ing model of one unit in StarCraft micromanagement scenarios,
including state representation, neural network architecture and
where et is the eligibility traces at time step t.
output actions, is depicted in Fig. 3.
One of the challenging issues in reinforcement learning is the
trade-off between exploration and exploitation. If we choose
IV. LEARNING METHOD FOR MICROMANAGEMENT the best action every step according to current policy, we are
In this paper, we formulate StarCraft micromanagement as likely to trap in local optimum. On the contrary, if we tend to
a multi-agent reinforcement learning model. We propose a explore in the large state space, the model will have difficulty
parameter sharing multi-agent gradient-descent Sarsa(λ) (PS- in converging. In the experiment, we use the -greedy method
MAGDS) method to train the model, and design a reward func- to choose actions during training, which selects the current best
tion as intrinsic motivations to promote the learning process. The action with probability 1 − , and takes a random exploratory
Authorized licensed use limited to: Renmin University. Downloaded on November 16,2022 at 13:37:02 UTC from IEEE Xplore. Restrictions apply.
78 IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, VOL. 3, NO. 1, FEBRUARY 2019
Authorized licensed use limited to: Renmin University. Downloaded on November 16,2022 at 13:37:02 UTC from IEEE Xplore. Restrictions apply.
SHAO et al.: STARCRAFT MICROMANAGEMENT WITH REINFORCEMENT LEARNING AND CURRICULUM TRANSFER LEARNING 79
Fig. 5. Representation of the StarCraft micromanagement scenarios in our experiments, left: Goliaths vs. Zealots; middle: Goliaths vs. Zerglings; right: Marines
vs. Zerglings.
Authorized licensed use limited to: Renmin University. Downloaded on November 16,2022 at 13:37:02 UTC from IEEE Xplore. Restrictions apply.
80 IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, VOL. 3, NO. 1, FEBRUARY 2019
Fig. 6. The win rates of our units in 3 Goliaths vs. 6 Zealots micromanagement Fig. 9. The win rates of our units in 3 Goliaths vs. 20 Zerglings microman-
scenario from every 200 episodes’ training. agement scenario from every 200 episodes’ training.
Fig. 7. The episode steps of our units in 3 Goliaths vs. 6 Zealots microman-
agement scenario during training. Fig. 10. The average episode steps of our units in 3 Goliaths vs. 20 Zerglings
micromanagement scenario during training.
Authorized licensed use limited to: Renmin University. Downloaded on November 16,2022 at 13:37:02 UTC from IEEE Xplore. Restrictions apply.
SHAO et al.: STARCRAFT MICROMANAGEMENT WITH REINFORCEMENT LEARNING AND CURRICULUM TRANSFER LEARNING 81
TABLE V
WIN RATES IN VARIOUS CURRICULAR SCENARIOS AND UNSEEN SCENARIOS
Fig. 11. The average reward of our units in 3 Goliaths vs. 20 Zerglings
micromanagement scenario during training. M:Marine, Z:Zergling.
TABLE III our model’s average win rates in 100 test games for 5 times. In
CURRICULUM DESIGN FOR MARINES VS. ZERGLINGS MICROMANAGEMENT M10 vs. Z13, PS-MAGDS achieves a win rate of 97%, which
is much higher than other methods, including the recently pro-
posed GMEZO and BiCNet. In M20 vs. Z30, PS-MAGDS has
the second best performance, which is very close to the best one.
We also test our well-trained models in curricular scenarios
M:Marine, Z:Zergling. and unseen scenarios, and present the results in Table V. We
can see that PS-MAGDS has outstanding performances in these
TABLE IV curricular scenarios. In unseen scenarios with more units, PS-
PERFORMANCE COMPARISON OF OUR MODEL WITH BASELINE METHODS IN
TWO LARGE SCALE SCENARIOS MAGDS also has acceptable results.
C. Strategies Analysis
In StarCraft micromanagement, there are different types of
units with different skills and properties. Players need to learn
M:Marine, Z:Zergling. how to move and attack with a group of units in real time. If
we design a rule-based AI to solve this problem, we have to
comparison, are much higher from the beginning and be- consider a large amount of conditions, and agent’s ability is
have better in the whole training process. also limited. Beginners of StarCraft could not win any of these
combats presented in our paper. So these behaviors are highly
complex and difficult to learn. With reinforcement learning and
B. Large Scale Micromanagement
curriculum transfer learning, our units are able to master sev-
In large scale micromanagement scenarios, we use curriculum eral useful strategies in these scenarios. In this section, we will
transfer learning to train our Marines to play against Zerglings, make a brief analysis on these strategies that our units have
and compare the results with some baseline methods. learned.
1) Marines vs. Zerglings: In this section, we design a cur- 1) Disperse Enemies: In small scale micromanagement sce-
riculum with 3 classes to train the units, as shown in Table III. narios, our Goliaths units have to fight against the opponent with
After training, we test the performance in two target scenar- a larger amount and more total hitpoints. If our units stay to-
ios: M10 vs. Z13 and M20 vs. Z30. In addition, we use some gether and fight against a group of units face-to-face, they will be
baseline methods as a comparison, which consist of rule-based destroyed quickly and lose the combat. The appropriate strategy
approaches and DRL approaches. is dispersing enemies, and destroying them one by one.
1) Weakest: A rule-based method, attacking weakest in the In the first scenario, our Goliaths units have learned dispers-
fire range. ing Zealots after training. In the opening, our units disperse
2) Closest: A rule-based method, attacking closest in the fire enemies into several parts and destroy it in one part first. After
range. that, the winning Goliath moves to other Goliaths and helps to
3) GMEZO: A DRL method, based on the zero-order op- fight against the enemies. Finally, our units focus fire on the
timization, having impressive results over traditional RL remaining enemies and destroy them. For a better understand-
methods [20]. ing, we choose some frames of game replay in the combat and
4) BiCNet: A DRL method, based on the actor-critic archi- draw units’ move and attack directions in Fig. 12. The white
tecture, having the best performance in most StarCraft lines stand for the move directions and the red lines stand for
micromanagement scenarios [21]. the attack directions.
In Table IV, we present the win rates of the PS-MAGDS The similar strategy occurs in the second scenario. The oppo-
method and baseline methods. In each scenario, we measure nent has much more units, and Zerglings Rush has great damage
Authorized licensed use limited to: Renmin University. Downloaded on November 16,2022 at 13:37:02 UTC from IEEE Xplore. Restrictions apply.
82 IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, VOL. 3, NO. 1, FEBRUARY 2019
Fig. 12. The sample game replay in 3 Goliaths vs. 6 Zealots micromanagement
scenario.
Fig. 14. The sample game replay in 20 Marines vs. 30 Zerglings microman-
agement scenario.
Authorized licensed use limited to: Renmin University. Downloaded on November 16,2022 at 13:37:02 UTC from IEEE Xplore. Restrictions apply.
SHAO et al.: STARCRAFT MICROMANAGEMENT WITH REINFORCEMENT LEARNING AND CURRICULUM TRANSFER LEARNING 83
still an open problem. We will improve our method for more [23] R. Lowe, W. Yi, and T. Aviv, “Multi-agent actor-critic for mixed
types of units and more complex scenarios in the future. Finally, cooperative-competitive environments,” in Proc. Adv. Neural Inf. Process.
Syst., 2017, pp. 6382–6393.
we will also consider to use our micromanagement model in the [24] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction.
StarCraft bot to play full the game. Cambridge, MA, USA: MIT Press, 1998.
[25] L. P. Kaelbling, M. L. Littman, and A. W. Moore, “Reinforcement learning:
A survey,” J. Artifi. Intell. Res., vol. 4, no. 1, pp. 237–285, 1996.
ACKNOWLEDGMENT [26] G. Tesauro, “TD-Gammon, a self-teaching backgammon program,
achieves master-level play,” Neural Comput., vol. 6, no. 2, pp. 215–219,
We would like to thank Q. Zhang, Y. Chen, D. Li, Z. Tang, 1994.
and N. Li for the helpful comments and discussions about this [27] Q. Zhang, D. Zhao, and W. Ding, “Event-based robust control for uncertain
work and paper writing. We also thank the BWAPI and StarCraft nonlinear systems using adaptive dynamic programming,” IEEE Trans.
Neural Netw. Nearning Syst., vol. 29, no. 1, pp. 37–50, Jan. 2018.
group for their meaningful work. [28] Q. Zhang, D. Zhao, and Y. Zhu, “Event-triggered H ∞ control for
continuous-time nonlinear system via concurrent learning,” IEEE Trans.
REFERENCES Syst., Man, Cybern., Syst., vol. 47, no. 7, pp. 1071–1081, Jul. 2017.
[29] Y. Zhu, D. Zhao, and X. Li, “Iterative adaptive dynamic programming for
[1] D. Silver et al., “Mastering the game of Go with deep neural networks and solving unknown nonlinear zero-sum game based on online data,” IEEE
tree search,” Nature, vol. 529, no. 7587, pp. 484–489, 2016. Trans. Neural Netw. Learn. Syst., vol. 28, no. 3, pp. 714–725, Mar. 2017.
[2] D. Silver et al., “Mastering the game of Go without human knowledge,” [30] D. Zhao et al., “Review of deep reinforcement learning and discussions
Nature, vol. 550, no. 7676, pp. 354–359, 2017. on the development of computer Go,” Control Theory Appl., vol. 33, no. 6,
[3] D. Zhao, Z. Zhang, and Y. Dai, “Self-teaching adaptive dynamic program- pp. 701–717, 2016.
ming for Gomoku,” Neurocomputing, vol. 78, no. 1, pp. 23–29, 2012. [31] Z. Tang, K. Shao, D. Zhao, and Y. Zhu, “Recent progress of deep rein-
[4] K. Shao, D. Zhao, Z. Tang, and Y. Zhu, “Move prediction in Gomoku forcement learning: From AlphaGo to AlphaGo Zero,” Control Theory
using deep learning,” in Proc. Youth Academic Annu. Conf. Chin. Assoc. Appl., vol. 34, no. 12, pp. 1529–1546, 2017.
Autom., 2016, pp. 292–297. [32] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning
[5] V. Mnih et al., “Human-level control through deep reinforcement learn- with double Q-learning,” in Proc. 30th AAAI Conf. Artifi. Intell., 2016,
ing,” Nature, vol. 518, no. 7540, pp. 529–533, 2015. pp. 2094–2100.
[6] D. Zhao, H. Wang, K. Shao, and Y. Zhu, “Deep reinforcement learning [33] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience
with experience replay based on SARSA,” in Proc. IEEE Symp. Series replay,” in Proc. Int. Conf. Learn. Representations, 2016.
Comput. Intell., 2017, pp. 1–6. [34] Z. Wang, T. Schaul, M. Hessel, H. Van Hasselt, M. Lanctot, and N. De
[7] M. Moravik et al., “Deepstack: Expert-level artificial intelligence in heads- Freitas, “Dueling network architectures for deep reinforcement learning,”
up no-limit poker,” Science, vol. 356, no. 6337, pp. 508–513, 2017. in Proc. Int. Conf. Mach. Learn., 2016, pp. 1995–2003.
[8] S. Ontanon, G. Synnaeve, A. Uriarte, F. Richoux, D. Churchill, and M. [35] A. Nair et al., “Massively parallel methods for deep reinforcement learn-
Preuss, “A survey of real-time strategy game AI research and competition ing,” presented at the Deep Learning Workshop, Int. Conf. Mach. Learn.,
in StarCraft,” IEEE Trans. Comput. Intell. AI Games, vol. 5, no. 4, pp. 293– Lille, France, 2015.
311, Dec. 2013. [36] V. Mnih et al., “Asynchronous methods for deep reinforcement learning,”
[9] R. Lara-Cabrera, C. Cotta, and A. J. Fernndez-Leiva, “A review of com- in Proc. Int. Conf. Mach. Learn., 2016, pp. 1928–1937.
putational intelligence in RTS games,” in Proc. IEEE Symp. Foundations [37] D. Li, D. Zhao, Q. Zhang, and C. Luo, “Policy gradient methods with
Comput. Intell., 2013, pp. 114–121. Gaussian process modelling acceleration,” in Proc. Int. Joint Conf. Neural
[10] G. Robertson and I. Watson, “A review of real-time strategy game AI,” AI Netw., 2017, pp. 1774–1779.
Mag., vol. 35, no. 4, pp. 75–104, 2014. [38] T. P. Lillicrap et al., “Continuous control with deep reinforcement learn-
[11] K. Shao, Y. Zhu, and D. Zhao, “Cooperative reinforcement learning for ing,” in Proc. Int. Conf. Learn. Representations, 2016.
multiple units combat in StarCraft,” in Proc. IEEE Symp. Series Comput. [39] J. Schulman, S. Levine, P. Abbeel, M. I. Jordan, and P. Moritz, “Trust
Intell., 2017, pp. 1–6. region policy optimization,” in Proc. Int. Conf. Mach. Learn., 2015,
[12] J. Hagelback, “Hybrid pathfinding in StarCraft,” IEEE Trans. Comput. pp. 1889–1897.
Intell. AI Games, vol. 8, no. 4, pp. 319–324, Dec. 2016. [40] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proxi-
[13] A. Uriarte and S. Ontañón, “Kiting in RTS games using influence maps,” mal policy optimization algorithms,” arXiv:1707.06347, 2017.
in Proc. Artifi. Intell. Interactive Digital Entertainment Conf., 2012, [41] S. Levine and V. Koltun, “Guided policy search,” in Proc. Int. Conf. Mach.
pp. 31–36. Learn., 2013, pp. 1–9.
[14] G. Synnaeve and P. Bessire, “Multiscale Bayesian modeling for RTS [42] S. Gu, T. P. Lillicrap, I. Sutskever, and S. Levine, “Continuous deep Q-
games: An application to StarCraft AI,” IEEE Trans. Comput. Intell. AI learning with model-based acceleration,” in Proc. Int. Conf. Mach. Learn.,
Games, vol. 8, no. 4, pp. 338–350, Dec. 2016. 2016, pp. 2829–2838.
[15] D. Churchill and B. Michael, “Incorporating search algorithms into RTS [43] M. Watter, J. T. Springenberg, J. Boedecker, and M. A. Riedmiller, “Em-
game agents,” in Proc. Artifi. Intell. Interactive Digital Entertainment bed to control: a locally linear latent dynamics model for control from raw
Conf., 2012, pp. 2–7. images,” in Proc. Adv. Neural Inf. Process. Syst., 2015, pp. 2746–2754.
[16] I. Gabriel, V. Negru, and D. Zaharie, “Neuroevolution based multi- [44] M. L. Littman, “Markov games as a framework for multi-agent reinforce-
agent system for micromanagement in real-time strategy games,” in Proc. ment learning,” in Mach. Learn. Proc., 1994, pp. 157–163.
Balkan Conf. Informat., 2012, pp. 32–39. [45] T. Ming, “Multi-agent reinforcement learning: Independent vs. coopera-
[17] A. Shantia, E. Begue, and M. Wiering, “Connectionist reinforcement tive agents,” in Proc. 10th Int. Conf. Mach. Learn., 1993, pp. 330–337.
learning for intelligent unit micro management in StarCraft,” in Proc. [46] Z. Zhang, D. Zhao, J. Gao, D. Wang, and Y. Dai, “FMRQ—A multiagent
Int. Joint Conf. Neural Netw., 2011, pp. 1794–1801. reinforcement learning algorithm for fully cooperative tasks,” IEEE Trans.
[18] S. Wender and I. Watson, “Applying reinforcement learning to small scale Cybern., vol. 47, no. 6, pp. 1367–1379, Jun. 2017.
combat in the real-time strategy game StarCraft: Broodwar,” in Proc. IEEE [47] S. Sukhbaatar, A. Szlam, and R. Fergus, “Learning multiagent commu-
Conf. Comput. Intell. Games, 2012, pp. 402–408. nication with backpropagation,” in Proc. Adv. Neural Inf. Process. Syst.,
[19] Y. Lecun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, 2016, pp. 2244–2252.
no. 7553, pp. 436–444, 2015. [48] L. Marc, Z. Vinicius, and G. Audrunas, “A unified game-theoretic ap-
[20] N. Usunier, G. Synnaeve, Z. Lin, and S. Chintala, “Episodic exploration for proach to multiagent reinforcement learning,” arXiv:1711.00832.
deep deterministic policies: An application to StarCraft micromanagement [49] K. G. Jayesh, E. Maxim, and K. Mykel, “Cooperative multi-agent control
tasks,” in Proc. Int. Conf. Learn. Representations, 2017. using deep reinforcement learning,” in Proc. Int. Conf. Auton. Agents
[21] P. Peng et al., “Multiagent bidirectionally-coordinated nets for learning to Multiagent Syst., 2017, pp. 66–83.
play StarCraft combat games,” arXiv:1703.10069, 2017. [50] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Trans.
[22] J. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson, “Coun- Knowledge Data Eng., vol. 22, no. 10, pp. 1345–1359, Oct. 2010.
terfactual multi-agent policy gradients,” in Proc. 32nd AAAI Conf. Artifi.
Intell., 2018.
Authorized licensed use limited to: Renmin University. Downloaded on November 16,2022 at 13:37:02 UTC from IEEE Xplore. Restrictions apply.
84 IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, VOL. 3, NO. 1, FEBRUARY 2019
[51] A. Gupta, Y.-S. Ong, and L. Feng, “Insights on transfer optimization: Be- Yuanheng Zhu (M’15) received the B.S. degree from
cause experience is the best teacher,” IEEE Trans. Emerg. Topics Comput. Nanjing University, Nanjing, China, in 2010, and the
Intell., vol. 2, no. 1, pp. 51–64, Feb. 2018. Ph.D. degree with the State Key Laboratory of Man-
[52] M. E. Taylor and P. Stone, “Transfer learning for reinforcement learning agement and Control for Complex Systems, Insti-
domains: A survey,” J. Mach. Learn. Res., vol. 10, pp. 1633–1685, 2009. tute of Automation, Chinese Academy of Sciences,
[53] Y. Bengio, J. Louradour, R. Collobert, and J. Weston, “Curriculum learn- Beijing, China, in 2015. He is currently an Associate
ing,” in Proc. 26th Annu. Int. Conf. Mach. Learn., 2009, pp. 41–48. Professor with the State Key Laboratory of Manage-
[54] A. Graves et al., “Hybrid computing using a neural network with dynamic ment and Control for Complex Systems, Institute of
external memory,” Nature, vol. 538, no. 7626, pp. 471–476, 2016. Automation, Chinese Academy of Sciences. His re-
[55] Y. Wu and Y. Tian, “Training agent for first-person shooter game with search interests include optimal control, adaptive dy-
actor-critic curriculum learning,” in Proc. Int. Conf. Learn. Representa- namic programming, and reinforcement learning.
tions, 2017.
[56] Q. Dong, S. Gong, and X. Zhu, “Multi-task curriculum transfer deep
learning of clothing attributes,” in Proc. IEEE Winter Conf. Appl. Comput.
Vision, 2017, pp. 520–529.
[57] J. X. Wang et al., “Learning to reinforcement learn,” in Proc. Int. Conf.
Learn. Representations, 2017.
[58] P. Mirowski et al., “Learning to navigate in complex environments,” in
Proc. Int. Conf. Learn. Representations, 2017.
[59] X. Glorot, A. Bordes, Y. Bengio, X. Glorot, A. Bordes, and Y. Bengio,
“Deep sparse rectifier neural networks,” in Proc. Int. Conf. Artifi. Intell.
Statist., 2011, pp. 315–323.
[60] V. Nair and G. E. Hinton, “Rectified linear units improve restricted Boltz-
mann machines,” in Proc. Int. Conf. Mach. Learn., 2010, pp. 807–814. Dongbin Zhao (M’06–SM’10) received the B.S.,
[61] S. P. Singh and R. S. Sutton, “Reinforcement learning with replacing M.S., and Ph.D. degrees from Harbin Institute of
eligibility traces,” Mach. Learn., vol. 22, no. 1, pp. 123–158, 1996. Technology, Harbin, China, in 1994, 1996, and 2000,
[62] A. Y. Ng, D. Harada, and S. J. Russell, “Policy invariance under reward respectively. He was a Postdoctoral Fellow with
transformations: Theory and application to reward shaping,” in Proc. Int. Tsinghua University, Beijing, China, from 2000 to
Conf. Mach. Learn., 1999, pp. 278–287. 2002. He has been a Professor with the Institute of
[63] T. D. Kulkarni, K. R. Narasimhan, A. Saeedi, and J. B. Tenenbaum, “Hi- Automation, Chinese Academy of Sciences, since
erarchical deep reinforcement learning: Integrating temporal abstraction 2012, and also a Professor with the University of
and intrinsic motivation,” in Proc. Adv. Neural Inf. Process. Syst., 2016, Chinese Academy of Sciences, Beijing, China. From
pp. 3675–3683. 2007 to 2008, he was also a visiting scholar at the
University of Arizona. He has published 4 books and
more than 60 international journal papers. His current research interests include
computational intelligence, adaptive dynamic programming, deep reinforce-
Kun Shao received the B.S. degree in automation
ment learning, robotics, intelligent transportation systems, and smart grids.
from Beijing Jiaotong University, Beijing, China, in
Dr. Zhao is the Associate Editor for the IEEE TRANSACTIONS ON NEURAL
2014. He is currently working toward the Ph.D. de-
gree in control theory and control engineering with NETWORKS and Learning Systems (2012-), IEEE COMPUTATION INTELLIGENCE
MAGAZINE (2014-), etc. He is the Chair of Beijing Chapter and was the Chair
the State Key Laboratory of Management and Con-
of Adaptive Dynamic Programming and Reinforcement Learning Technical
trol for Complex Systems, Institute of Automation,
Committee (2015–2016) and Multimedia Subcommittee (2015–2016) of IEEE
Chinese Academy of Sciences, Beijing, China. His
current research interests include reinforcement Computational Intelligence Society (CIS). He is Guest Editor of renowned inter-
national journals. He is involved in organizing several international conferences.
learning, deep learning, and game AI.
Authorized licensed use limited to: Renmin University. Downloaded on November 16,2022 at 13:37:02 UTC from IEEE Xplore. Restrictions apply.