0% found this document useful (0 votes)
24 views

HRL in SC2 With Human Expertise in Subgoals Selections

Uploaded by

Kristine Wah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

HRL in SC2 With Human Expertise in Subgoals Selections

Uploaded by

Kristine Wah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Hierarchical Reinforcement Learning in StarCraft II

with Human Expertise in Subgoals Selection


Xinyi Xu∗1 , Tiancheng Huang2
Pengfei Wei1 , Akshay Narayan1 , Tze-Yun Leong1
1
NUS, School of Computing, Medical Computing Lab,
{xuxinyi,weipf,anarayan,leongty}@comp.nus.edu.sg
2
NTU, School of Computer Science and Engineering,
[email protected]
arXiv:2008.03444v3 [cs.AI] 29 Sep 2020

Abstract games, AlphaStar (Vinyals et al. 2019) achieves superhu-


This work is inspired by recent advances in hierarchical re- man performance at the expense of huge computational re-
inforcement learning (HRL) (Barto and Mahadevan 2003; sources1 . Training flat DRL agents even on minigames (sim-
Hengst 2010), and improvements in learning efficiency from plistic versions of the full-length SC2 games) requires 600
heuristic-based subgoal selection, experience replay (Lin million samples (Vinyals et al. 2017) and 10 billion sam-
1993; Andrychowicz et al. 2017), and task-based curricu- ples (Zambaldi et al. 2019) for each minigame, and re-
lum learning (Bengio et al. 2009; Zaremba and Sutskever peated with 100 different sets of hyper-parameters, approx-
2014). We propose a new method to integrate HRL, expe- imately equivalent to over 630 and 10,500 years of game
rience replay and effective subgoal selection through an im- playing time respectively. Even with such large number
plicit curriculum design based on human expertise to support of training samples, DRL agents are not yet able to beat
sample-efficient learning and enhance interpretability of the
human experts at some minigames (Vinyals et al. 2017;
agents behavior. Human expertise remains indispensable in
many areas such as medicine (Buch, Ahmed, and Maruthappu Zambaldi et al. 2019).
2018) and law (Cath 2018), where interpretability, explain- We argue that learning a new task in general or SC2
ability and transparency are crucial in the decision making minigames in particular is a two-stage process, viz., learn-
process, for ethical and legal reasons. Our method simpli- ing the fundamentals, and mastering the skills. For SC2
fies the complex task sets for achieving the overall objec- minigames, novice human players learn the minigame fun-
tives by decomposing them into subgoals at different levels damentals reasonably quickly by decomposing the game
of abstraction. Incorporating relevant subjective knowledge into smaller, distinct and necessary steps. However, to
also significantly reduces the computational resources spent achieve mastery over the minigame, humans take a long
in exploration for RL, especially in high speed, changing, and time, mainly to practice the precision of skills. RL agents,
complex environments where the transition dynamics cannot
on the other hand, may take a long time to learn the fun-
be effectively learned and modelled in a short time. Experi-
mental results in two StarCraft II (SC2) (Vinyals et al. 2017) damentals of the gameplay but achieve mastery (stage two)
minigames demonstrate that our method can achieve better efficiently. This can be observed from the training progress
sample efficiency than flat and end-to-end RL methods, and curves in (Vinyals et al. 2017) which shows spikes followed
provides an effective method for explaining the agents per- plateaus of reward signals instead of steady and gradual in-
formance. creases.
We want to leverage human expertise to reduce the
Introduction ‘warm-up’ time required by the RL agents. The Hierarchi-
cal Reinforcement Learning (HRL) framework (Bakker and
Reinforcement learning (RL) (Sutton and Barto 2018) en-
Schmidhuber 2004; Levy et al. 2019) comprises a general
ables agents to learn how to take actions, by interacting
layered architecture that supports different levels of abstrac-
with an environment, to maximize a series of rewards re-
tions corresponding to human expertise and agent’s skills at
ceived over time. In combination with advances in deep
the low-level manoeuvres. Intuitively, HRL provides a way
learning and computational resources, the Deep Reinforce-
for combining the best from human expertise and agent by
ment Learning (DRL) (Mnih et al. 2013) formulation has
organizing the inputs from humans at a high level (more ab-
led to dramatic results in acting from perception (Mnih et
stract) and those from agents at a lower level (more precise).
al. 2015), game playing (Silver et al. 2016), and robotics
In this work, we extend the HRL framework to incorporate
(Andrychowicz et al. 2020). However, DRL usually re-
human expertise in subgoal selection. We demonstrate the
quires extensive computations to achieve satisfactory per-
effects of our methods in mastering SC2 minigames, and
formance. For example, in full-length StarCraft II (SC2)
present preliminary results on sample efficiency and inter-

Corresponding author
1
Copyright c 2020, Association for the Advancement of Artificial According to (Vinyals et al. 2019), for each of their 12 agents,
Intelligence (www.aaai.org). All rights reserved. they conduct training on 32 TPUs for 44 days.
pretability over the flat RL methods. either a state is a goal or not. Introducing G modifies the
The rest of the paper is organized as follows. We briefly original reward function R slightly: ∀g ∈ G, Rg : S ×
outline the background information in the next section. Next, A 7→ R, R(s, a|g) := Rg (s, a). At the beginning of each
we describe our proposed methodology. Further, we discuss episode, in addition to s0 , the initialization includes a fixed
the related works and present our experimental results. We g to create a tuple (s0 , g). Other modifications naturally fol-
then conclude the paper highlighting opportunities for future low: π : S × G 7→ A, and Qπ (st , at , g) = E[Rt |st , at , g].
work. Experience Replay Lin (1993) proposed the idea of us-
ing ‘experiences buffers’ to help machines learn. For-
Preliminaries mally, a single time step experience is defined as a tuple
Markov decision process and Reinforcement learning: (st , at , rt , st+1 ) and more generally an experience can be
A Markov decision process (MDP) is a five-tuple constructed by concatenating multiple consecutive experi-
hS, A, T , R, γi, where, S is the set of states the agent can ence tuples.
be in; A is the set of possible actions available for the agent; Curriculum Learning Methods in this framework typically
R : S × A 7→ R is the reward function, T : S × A 7→ ∆S is explicitly or implicitly design a series of tasks or goals (with
the transition function; and γ ∈ [0, 1] is the discount fac- gradually increased difficulties) for the agent to follow and
tor that denotes the usefulness of the future rewards. We learn, i.e., the curriculum (Bengio et al. 2009; Weng 2020).
consider the standard formalism of reinforcement learning StarCraft II SC2 is a real-time-strategy (RTS) game, where
where an agent continuously interacts with a fully observ- players command their units to compete against each other.
able environment, defined using an MDP. A deterministic In an SC2 full-length game, typically players start out by
policy is a mapping π : S 7→ A and we can describe commanding units to collect resources (minerals and gas) to
a sequence of actions and reward signals from the envi- build up their economy and army at the same time. When
ronment. Every episode begins with an initial s0 . At each they have amassed a sufficiently large army, they com-
t, the agent takes an action at = πt (st ), and gets a re- mand these units to attack their opponents’ base in order
ward rt = R(st , at ). At the same time, st+1 is sampled to win. SC2 is currently a very promising simulation envi-
from T (st , at ). Over time, the discounted cumulative re- ronment for RL, due to its high flexibility and complexity
P∞
ward, called return, is calculated as: Rt = i=t γ i−t rt . The and wide-ranging applicability in the fields of game theory,
agent’s task is to maximize the expected return Es0 [R0 |s0 ]. planning and decision making, operations optimization, etc.
Furthermore, the Q-function (or action-value function) is SC2 minigames, as opposed to full-length games described
defined as Qπ (st , at ) = E[Rt |st , at ]. Assuming an opti- above, are built-in episodic tutorials where novice players

mal policy π ∗ : Qπ (s, a) ≥ Qπ (s, a) ∀s ∈ S, a ∈ can learn and practice their skills in a controlled and less
A, for any possible π. All optimal policies have the same Q- complex environment. Some relevant skills include collect-
function called the optimal Q-function, denoted Q∗ , satisfy- ing resources, building certain army units, etc.
ing this Bellman equation:
Q∗ (s, a) = Es0 ∼T (s,a) [R(s, a) + γ max Q∗ (s0 , a0 )].
Proposed Methodology
0
a ∈A We propose a novel method of integrating the advantages
Q-function Approximators The above definitions enable of human expertise and RL agents to facilitate fundamentals
one possible solution to MDPs: using a function approxi- learning and skills mastery of a learning task. Our method
mator for Q∗ . Deep-Q-Networks (DQN) (Mnih et al. 2013) adopts the principle of Curriculum Learning (Bengio et al.
and Deep Deterministic Policy Gradients (DDPG) (Lilli- 2009) and follows a task-oriented approach (Zaremba and
crap et al. 2016), are such approaches tackling model-free Sutskever 2014). The key idea is to leverage human ex-
RL problems. Typically, a neural network Q is trained to pertise to simplify the complex learning procedure, by de-
approximate Q∗ . During training, experiences are gener- composing it into hierarchical subgoals as the curriculum
ated via an exploration policy, usually -greedy policy with for the agent. More specifically, we factorize the learning
the current Q. The experience tuples (st , at , rt , st+1 ) are task into several successive subtasks indispensable for the
stored in a replay buffer. Q is trained using gradient de- agent to complete the entire complex learning procedure.
scent with respect to the loss L := E[Q(st , at ) − yt ]2 , where The customized reward function in each subtask implic-
yt = rt + γ maxa0 ∈A Q(st+1 , a0 ) with experiences sampled itly captures the corresponding subgoal. Importantly, these
from the replay buffer. successive subgoals are determined so that they are gradu-
An exploration policy is a policy that describes how the ally more difficult to improve learning efficiency (Bengio
agent interacts with the environment. For instance, a pol- et al. 2009; Justesen et al. 2018). With defined subgoals,
icy that picks actions randomly encourages exploration. On we use the Experience Replay technique to construct the
the other hand, a greedy policy with respect to Q, as in experiences to further improve the empirical sample effi-
πQ (s) = argmaxa∈A Q(s, a), encourages exploitation. To ciency (Andrychowicz et al. 2017; Bakker and Schmidhu-
balance these, a standard approach of -greedy (Sutton and ber 2004; Levy et al. 2019). Furthermore, adopting clearly
Barto 2018) is adopted: with probability  take a random ac- defined subtasks and subgoals enhances the interpretabil-
tion, and with probability 1 −  take a greedy action. ity of the agent’s learning progress. In implementation, we
Goal Space G Schaul et al. (2015) extended DQN to include customize SC2 minigames to embed human expertise on
a goal space G. A (sub)goal can be described with specifi- subgoal information and the criteria to identify and se-
cally selected states, or via functions such as f : S 7→ [0, 1], lect subgoals during learning. Therefore, the agent learns
the subpolicies and combines them in a hierarchical way.
By following a well-defined decomposition of the original
minigame into subtasks, we can choose the desired state of
a previous subtask to be the starting conditions of the next
subtask, thus completing the connection between subtasks.

Hierarchy: Subgoals and Subtasks


Our proposed hierarchy is composed of subgoals, which col-
lectively divide the problem into simpler subtasks that can
be solved easily and efficiently. Each subgoal is implicitly
captured as the desired state in its corresponding subtask,
and we refer to the agent’s skills to reach a subgoal its cor-
responding subpolicy. The rationale behind this is as fol-
Figure 2: Collect Minerals and Gas. From left to right, top to
lows. First, the advantages of human expertise and the agents
bottom:(1)-(4): (1) to build refineries; (2) to collect gas with
are complementary to each other in terms of learning and
built refineries; (3) both tasks in (1) and (2); (4) all three
mastering the task. Human players are good at seeing the
tasks in (1), (2), (3) and collect minerals.
big picture and thus identifying the essential and distinct
steps/skills very quickly. On the other hand, agents are pro-
ficient in honing learned skills and maneuvers to a high de-
gree of precision. Second, a hierarchy helps reduce the com-
plexity of search space via divide-and-conquer. Lastly, this
method enhances the interpretability of the subgoals (and
subpolicies).
Figure 1 illustrates the concept of subgoals and subpoli-
cies with a simple navigation agent. The agent is learning to
navigate to the flag post from the initial state s0 . One possi-
ble sequence of the states is s1 , . . . , s5 . Therefore, the entire
trajectory can be decomposed into subgoals; for instance,
Levy et al. (2019) used heuristic-based subgoal selection cri-
teria (in Figure 1 these selected subgoals, g0 , . . . , g4 , are de-
noted by orange circles). On the other hand, the sequence
of red nodes denote subgoals of our method. We highlight Figure 3: Build Marines. From left to right, top to
that this sequence would constitute a better guided and more bottom:(1)-(4): (1) to build supply depots; (2) to build bar-
efficient exploration path. In addition this sequence is better racks; (3) to build marines with (1) and (2) already built; (4)
aligned with the game where some states are the prerequi- all three tasks in (1), (2), (3).
sites for other states (illustrated as the black dashed arrows).
constructed via random sampling from the goal space and
concatenating the sampled goals to an already executed se-
quence {s1 , . . . , sT }, hence the name hindsight. The sub-
goals are initialized with heuristic-based selection and up-
dated according to hindsight actions. For example, in Fig-
ure 1, given a predetermined subgoal g0 , the agent might not
successfully reach it, and instead ends up in s1 . In this case,
the subgoal set in hindsight is s1 (updated from g0 ).
Our method distinguishes in that the (sub)goals selection
strategy is designed with human expertise, to give a fixed but
suitable decomposition of the learning task. Furthermore,
we exploit the underlying sequential relationship among the
Figure 1: Navigation Agent subgoals as in the game some states are the prerequisites for
others. Hence, certain actions are required to be performed
in order. Furthermore, another reason for introducing hu-
Subgoals Selection and Experience Replays man expertise rather than using end-to-end learning alone
Subgoal Design and Selection. We use the similar method is that compared with the environments investigated in pre-
for constructing experiences with a goal space as previous vious HRL works, SC2 encompasses a significantly larger
works (Andrychowicz et al. 2017; Levy et al. 2019). How- state-action space that prohibits a sample-efficient end-to-
ever, our method introduces human expertise in construct- end learning strategy. As a result, our method enjoys an
ing the hierarchy and subgoals selection. In (Andrychow- added advantage of interpretability of the selected subgoals.
icz et al. 2017), the hindsight experience replay buffer is Subtasks Implementations. We leverage the customizabil-
ity of SC2 minigames to carefully design subtasks to en- completed subgoal gi . So, the agent learns to continue on
able training of the corresponding subpolicies, as suggested the basis of a completed gi . It is an implicit process because,
in (Barto and Mahadevan 2003). We illustrate with the Col- when learning to reach subgoal gi+1 , the agent does not see
lect Minerals and Gas (CMAG) minigame, as shown and or interact directly with the reward signal corresponding to
described in Figure 2. There are several distinct and sequen- gi . For example, between two ordered subtasks CollectMin-
tial actions the player has to perform to score well: 1. com- erals and BuildRefinery, the agent learns to collect minerals
manding the Space Construction Vehicles (SCVs) - basic first and starts with some collected minerals in the latter with
units of the game, to collect minerals; 2. having collected the sole objective of learning to build refineries.
sufficient minerals, selecting SCVs to build the gas refinery Off-policy learning and PPO. Off-policy learning is a
(a prerequisite building for collecting vespene gas) on spe- learning paradigm where the exploration and learning are
cific locations with existing gas wells; 3. commanding the decoupled and take place separately. Exploration is mainly
SCVs to collect vespene gas from the constructed gas refin- used by the agent to collect experiences or ‘data points’ for
ery; 4. producing additional SCVs (at a fixed cost) to opti- its policy function or model. Learning is then conducted on
mize the mining efficiency. And there is a fixed time dura- these collected experiences, and Proximal Policy Optimiza-
tion of 900 seconds. The challenge of CMAG is that all these tion (PPO) (Schulman et al. 2017) is one such method. Its
actions/subpolicies should be performed in an optimized se- details are not the focus of this work and omitted here.
quence for best performance. The optimality depends on the Algorithm. We describe the HRL algorithm with human
order, timing, and the number of repetitions of these ac- expertise in subgoal selection here. The pseudo-code is
tions. For instance, it is important not to under/over-produce given in Algorithm 1. For a learning task, a sequence of sub-
SCVs at a mineral site for optimal efficiency. Hence, we tasks is designed with human expertise to implicitly define
implemented the following subtasks: BuildRefinery, Collect- the subgoals and we refer to our customized SC2 minigames
GasWithRefineries and BuildRefinieryAndCollectGas. In the as subtasks Γi , 0 ≤ i < m for the learning task. We pre-
first two subtasks, the agent learns the specific subpolicies to define reward thresholds thresholds ∈ Rm , for all sub-
build refineries and to collect gas (from built refineries), re- tasks. As the agent’s running average reward is higher than
spectively, while in the last subtask the agent learns to com- a threshold, this agent is considered to have learnt the corre-
bine them. Based on the same idea, the complete decomposi- sponding subtask well and will move to the subsequent sub-
tion for CMAG is given by [CMAG, BuildRefinery, Collect- task. We use learner L to denote the agent and to describe
GasWithRefineries, BuildRefineryAndCollectGas, CMAG] how it makes decisions and takes actions. It can be repre-
where the first CMAG trains the agent to collect minerals, sented by a deep neural network, and parametrized by w ~ L.
and the last CMAG trains it to combine all subpolicies and In addition, we define a sample count c and sample limit n.
also ‘re-introduces’ the reward signal for collecting miner- Sample count c refers to the number of samples the agent has
als to avoid forgetting (Zaremba and Sutskever 2014). Sim- used for learning a subtask. Sample limit n refers to the total
ilarly, for the BuildMarines (BM) minigame, shown in Fig- number of samples allowed for the agent for the entire learn-
ure 3, the sequential steps/actions are: 1. commanding the ing task, i.e., for all subtasks combined. c and n together are
SCVs to collect minerals; 2. having collected sufficient min- used to demonstrate empirical sample efficiency.
erals, selecting SCVs to build a supply depot (a prerequi- With these definitions and initializations, the algorithm
site building for barracks and to increase the supplies limit); takes the defined sequence of subtasks Γ with correspond-
3. having both sufficient minerals and a supply depot, se- ing thresholds and initiates learning on these subtasks in
lecting SCVs to build barracks; 4. having minerals, a sup- the same sequence. During the process, a running average
ply depot and barracks and with current unit count less than of the agent’s past achieved rewards is kept for each sub-
the supplies limit, selecting the barracks to train marines. task, represented by the API call test(). For each subtask
The fixed time duration for BM is 450 seconds. Therefore, Γi , either the agent completely exhausts its assigned sample
n
we implemented the corresponding subtasks: BuildSupply- limit b m c or it successfully reaches the thresholdsi . If the
Depots, BuildBarracks, BuildMarinesWithBarracks and the running average of past rewards ≥ thresholdsi , the agent
complete decomposition for BM is [BuildSupplyDepots, completes learning on Γi and starts with Γi+1 ; the process
BuildBarracks, BuildMarinesWithBarracks, BM]. Note we continues until all subtasks are learned. We follow the explo-
do not set BM as a first subtask as for CMAG because ration policy in preliminaries and adopt an -greedy policy,
CMAG contains both reward signals for minerals and gas, represented by explore() in Algorithm 1.
so it is an adequate simple task for the agent to learn to col-
lect minerals. However, BM has only the reward signals for Related Work
training marines, thus too difficult as the first subtask. Experience Replay RL has achieved impressive develop-
Construct Experience Replay for Each Subtask. With ments in robotics (Singh et al. 2019), strategic games such
the designed subtasks represented by our customized as Go (Silver et al. 2017), real-time strategy games (Zam-
minigames, constructing experience replays is straightfor- baldi et al. 2019; Vinyals et al. 2019) etc. Researchers have
ward. For a subtask, a predetermined subgoal gi is implicitly attempted in various ways to address the challenge of goal-
captured in its customized minigame (e.g., to build barracks, learning, reward shaping to get the ‘agent’ to learn to mas-
to manufacture SCVs, etc.) using a corresponding reward ter the task, and yet not overfit to the particular instances of
signal, so that the agent learns to reach gi . For the immedi- the goals or reward signals. Experience Replay (Lin 1993)
ate subsequent subtask, we set its initial conditions to be the is a technique to store and re-use past records of executions
Algorithm 1 HRL with Human Expertise in Subgoal Selec- state space, and is relatively easily parallelizable.
tion StarCraft II In addition to (Zambaldi et al. 2019), sev-
Input: subtasks Γi , 0 ≤ i < m eral works addressed some of the challenges presented by
Input: reward thresholds thresholds ∈ Rm SC2. In a real-time strategy (RTS) game such as SC2, the
Input: learner L, parametrized by w ~L hierarchical architecture is an intuitive solution concept, for
Input: sample count c, sample limit n. its efficient representation and interpretability. Similar but
for 0 ≤ i < m do
c←0
different hierarchies were employed in two other works,
while c <= b m n
c do where Lee et al. (2018) designed the hierarchy with seman-
experiences ← explore(L, Γi ) tic meaning and from a operational perspective while Pang
c ← c + |experiences| et al. (2019) forewent explicit semantic meanings for higher
0
w~L ←PPO(w ~ L , experiences) . off-policy flexibility. Both provided promising empirical results on the
0
if test(w ~L ) ≥ thresholdsi then full-length games against built-in AIs. Instead of full-length
Break . Go to next subtask SC2 games, our investigation targets the minigames and we
end if propose a way to integrate human expertise, the Curriculum
end while Learning paradigm and the Experience Replay technique
end for into the learning process.
Different from related works, our work adopts a principle-
driven HRL approach with human expertise in the subgoal
selection and thus an implicit formulation of a curriculum
(along with the signals from the environment) to train the for the agent, on SC2 minigames in order to achieve empiri-
‘agent’, achieving efficient sample usage. Mnih et al. (2013) cal sample efficiency and to enhance interpretability.
employed this technique together with Deep-Q-Learning to
produce state-of-the-art results in Atari, and subsequently
Mnih et al. (2015) confirmed the effectiveness of such ap- Experiments
proach under the stipulation that the ‘agent’ only sees what In the experiments, we specifically focus on two minigames,
human players would see, i.e., the pixels from the screen and viz., BM and CMAG to investigate the effectiveness of
some scoring indices. our method. We choose these two because, the discrepan-
Curriculum Learning Bengio et al. (2009) hypothesized cies in the performance between trained RL agents and hu-
and empirically showed that introducing gradually more man experts are the most significant as reported in (Vinyals
difficult examples speeds up the online learning, using a et al. 2017), suggesting these two are the most challeng-
manually designed task-specific curriculum. Zaremba and ing for non-hierarchical end-to-end learning approaches. For
Sutskever (2014) experimentally showed that it is important both CMAG and BM, we have implemented our customized
to mix in easy tasks to avoid forgetting. Justesen et al. (2018) SC2 minigames (subtasks) as described in the proposed
demonstrated that training an RL agent over a simple cur- methodology section, and we pair them with pre-defined
riculum with gradually increasing difficulty can effectively reward thresholds. In our experiments, the decompositions
prevent overfitting and lead to better generalization. for BM and CMAG are [BuildSupplyDepots, BuildBar-
Hierarchical Reinforcement Learning (HRL) HRL and racks, BuildMarinesWithBarracks, BM], and [CMAG, Buil-
its related concepts such as options (Sutton, Precup, and dRefinery, CollectGasWithRefineries, BuildRefineryAndCol-
Singh 1999) macro-actions (Hauskrecht et al. 1998), or lectGas, CMAG], respectively.
tasks (Li, Narayan, and Leong 2017) were introduced to
decompose the problem, usually a Markov decision pro- Experimental Setup
cess (MDP), into smaller sub-parts to be efficiently solved. • Model Architecture and Hyperparameters. We follow
We refer the readers to (Barto and Mahadevan 2003; Hengst the model architecture of Fully Convolutional agent in
2010) for more comprehensive treatments. We describe two (Vinyals et al. 2017) by utilizing an open-source imple-
tracks of related works most relevant to our problem. Bakker mentation by Ring (2018). We use the hyperparameters
and Schmidhuber (2004) proposed a two-level hierarchy, us- listed in Table 1.
ing subgoal and subpolicy to describe the learning taking
• Training & Testing. In order to evaluate the empirical
place at the lower level of the hierarchy. Levy et al. (2019)
sample efficiency of our method, we restrict the total num-
further articulated these ideas, and explicitly combined them
ber of training samples to be 10 million. Note this is
with Hindsight Experience Replay (Andrychowicz et al.
still significantly fewer than 600 million in (Vinyals et
2017) for better sample efficiency and performance. Another
al. 2017) or 10 billion in (Zambaldi et al. 2019). Further-
similarly inspired approach called context sensitive rein-
more, we adopt their practice of training multiple agents
forcement learning (CSRL) introduced by Li, Narayan, and
to report the best results attained. After training, on the
Leong (2017) employed the hierarchical structure to enable
trained model, average and maximum scores over 30 in-
effective re-use of learnt knowledge of similar (sub)tasks
dependent episodes are reported.
in a probabilistic way. In CSRL, instead of Experience Re-
play, efficient simulations over constructed states are used • Computing Resource. CPU: Intel(R) Core(TM) i9-
in learning, able to learn both the tasks, and the environment 10920X CPU @ 3.50GHz, RAM:64 GB, GPU: GeForce
(the transition and reward functions). CSRL scales well with RTX 2080 SUPER 8GB. The training time for a single
model initialization: approximately 1.66 hours for CMAG
and 1.5 hours for BM.

Table 1: Hyperparameters
BM CMAG
Learning rate 0.0007 0.0007
Batch size 32 32
Trajectory length 40 40
Off-policy learning algorithm PPO PPO
Reward thresholds [7,7,7,2] [300,5,5,5,500]

Table 2: Average Rewards Achieved


Minigame SC2LE DRL Ours Human Expert
CMAG 3,978 5,055 478.5(527) 7,566
BM 3 123 6.7(6.24) 133

Table 3: Maximum Rewards Achieved


Minigame SC2LE DRL Ours Human Expert
CMAG 4,130 unreported 1825 7,566
BM 42 unreported 22 133 Figure 4: Collect Minerals And Gas learning curve.

Table 4: Training Samples Required


Minigame SC2LE DRL Ours Human Expert at random and receives a corresponding reward signal. Al-
CMAG 6e8 1e10 1e7 N.A ternatively, the comparison between the running average re-
BM 6e8 1e10 3.4e6 N.A wards for these two agents clearly demonstrates that learn-
ing for the best agent on the BuildBarracks subtask is signif-
icantly more effective. The performance on this subtask also
affects the final subtask BuildMarines since without know-
Discussion ing how to build barracks, the agent cannot take the action
Our experimental results demonstrate similar trends to those of producing marines even if it has learnt this subpolicy. We
shown in (Vinyals et al. 2017). The variance observed in believe such interpretability and explainability provided by
final performance achieved can be quite large, over differ- our method are helpful in understanding and improving the
ent hyperparameter sets, different or same model parameter learning process and the behavior of the agent.
initializations and other stochasticity involved in learning.
For Tables 2 and 3, the higher the values the better. For Ta- On the other hand, the experimental results in CMAG
ble 4, the lower the values the better. Among the 5 agents show slightly less success. We believe this can be attributed
for BM, the best performing agent can achieve an average to the difference in the setting of learning. In BM, the agent
reward of 6.7 during testing, while the worst performing has to learn distinct skills and how to execute them in se-
agent can barely achieve 0.1. Note that the average reward quence in order to perform well, with relatively less em-
of 6.7 is twice more than the average reward of the best per- phasis on the degree of mastery of these skills. However,
forming agent (3) reported in (Vinyals et al. 2017) for BM. in CMAG the agent’s mastery of the skills including min-
In addition, our method allows for an in-depth investiga- ing minerals and gas directly and critically affects its final
tion into the agent’s learning curves to identify which part score, viz., total amount of minerals and gas collected. It
of the learning was not effective and led to the sub-optimal means that the agent has to be able to perform the skills well,
final performance. We compare the best (average 6.7) and i.e., optimize with respect to time and manufacturing cost,
worst (average 0.1) agents based on their subgoal learning which in itself can be a separate and more complex learn-
curves, and we find that the best agent is learning effectively ing task. Another experimental difficulty for CMAG lies in
across all subgoals. From Figure 5, the learning curves in the reward scales because the subtasks for collecting min-
all subtasks show consistent progress with more samples, erals and gas have high reward ceilings (as high as several
where the learning curves of the worst agent show substan- thousand), while those for building the gas refineries have
tially less progress, often flat at zero with very rare spikes, comparatively low reward ceilings (less than one hundred).
as shown in Figure 6. Especially for the BuildBarracks sub- Due to this large difference in the scales of the reward sig-
task, the agent’s learning is ineffective and it only occasion- nals between subtasks, the learning on the subtasks is even
ally stumbles upon the correct actions of building barracks more difficult and can be unbalanced.
interpretability. This initial work invites several exploration
directions: developing more efficient and effective ways of
introducing human expertise; a more formal and principled
state representation to further reduce the complexity of the
state space (goal space) with theoretical analysis on its com-
plexity; and a more efficient learning algorithm to pair with
the HRL architecture, Experience Replay and Curriculum
Learning.

Acknowledgments
This work was partially supported by an Academic Research
Grant T1 251RES1827 from the Ministry of Education in
Singapore and a grant from the Advanced Robotics Center
at the National University of Singapore.

References
Andrychowicz, M.; Wolski, F.; Ray, A.; Schneider, J.; Fong,
R.; Welinder, P.; McGrew, B.; Tobin, J.; Pieter Abbeel, O.;
and Zaremba, W. 2017. Hindsight experience replay. In
Guyon, I.; Luxburg, U. V.; Bengio, S.; Wallach, H.; Fer-
gus, R.; Vishwanathan, S.; and Garnett, R., eds., Advances
in Neural Information Processing Systems 30. Curran Asso-
ciates, Inc. 5048–5058.
Figure 5: Build Marines learning curve (best agent).
Andrychowicz, M.; Baker, B.; Chociej, M.; Józefowicz, R.;
McGrew, B.; Pachocki, J.; Petron, A.; Plappert, M.; Powell,
G.; Ray, A.; Schneider, J.; Sidor, S.; Tobin, J.; Welinder, P.;
Weng, L.; and Zaremba, W. 2020. Learning dexterous in-
hand manipulation. International Journal of Robotics Re-
search 39(1):3–20.
Bakker, B., and Schmidhuber, J. 2004. Hierarchical rein-
forcement learning based on subgoal discovery and subpol-
icy specialization. In Proceedings of the 8-th Conference on
Intelligent Autonomous Systems, IAS-8, 438–445.
Barto, A. G., and Mahadevan, S. 2003. Recent advances
in hierarchical reinforcement learning. Discrete Event Dy-
namic Systems 13(12):4177.
Bengio, Y.; Louradour, J.; Collobert, R.; and Weston, J.
2009. Curriculum learning. In Proceedings of the 26th An-
nual International Conference on Machine Learning, ICML
’09, 4148. New York, NY, USA: Association for Computing
Machinery.
Buch, V.; Ahmed, I.; and Maruthappu, M. 2018. Artificial
intelligence in medicine: Current trends and future possibil-
ities. British Journal of General Practice 68:143–144.
Cath, C. 2018. Governing artificial intelligence: Ethical, le-
gal and technical opportunities and challenges. Philosophi-
cal Transactions of The Royal Society A Mathematical Phys-
Figure 6: Build Marines learning curve (worst agent). ical and Engineering Sciences 376:20180080.
Hauskrecht, M.; Meuleau, N.; Kaelbling, L. P.; Dean, T.; and
Boutilier, C. 1998. Hierarchical solution of markov decision
Conclusion & Future Work processes using macro-actions. In Proceedings of the Four-
In this work, we examined the SC2 minigames and proposed teenth Conference on Uncertainty in Artificial Intelligence,
a way to introduce human expertise to an HRL framework. UAI98, 220229. San Francisco, CA, USA: Morgan Kauf-
By designing customized minigames to facilitate learning mann Publishers Inc.
and leveraging the effectiveness of hierarchical structures in Hengst, B. 2010. Hierarchical reinforcement learning. In
decomposing complex and large problems, we empirically Sammut, C., and Webb, G. I., eds., Encyclopedia of Machine
showed that our approach is sample-efficient and enhances Learning. Boston, MA: Springer US. 495–502.
Justesen, N.; Torrado, R. R.; Bontrager, P.; Khalifa, A.; To- Leach, M.; Kavukcuoglu, K.; Graepel, T.; and Hassabis, D.
gelius, J.; and Risi, S. 2018. Illuminating generalization in 2016. Mastering the game of go with deep neural networks
deep reinforcement learning through procedural level gener- and tree search. Nature 529:484–489.
ation. In NeurIPs Workshop on Deep Reinforcement Learn- Silver, D.; Schrittwieser, J.; Simonyan, K.; Antonoglou, I.;
ing. Huang, A.; Guez, A.; Hubert, T.; Baker, L.; Lai, M.; Bolton,
Lee, D.; Tang, H.; Zhang, J. O.; Xu, H.; Darrell, T.; and A.; Chen, Y.; Lillicrap, T.; Hui, F.; Sifre, L.; van den Driess-
Abbeel, P. 2018. Modular architecture for starcraft II with che, G.; Graepel, T.; and Hassabis, D. 2017. Master-
deep reinforcement learning. In Rowe, J. P., and Smith, G., ing the game of go without human knowledge. Nature
eds., Proceedings of the Fourteenth AAAI Conference on Ar- 550(7676):354–359.
tificial Intelligence and Interactive Digital Entertainment, Singh, A.; Yang, L.; Finn, C.; and Levine, S. 2019. End-to-
AIIDE 2018, November 13-17, 2018, Edmonton, Canada, end robotic reinforcement learning without reward engineer-
187–193. AAAI Press. ing. In Bicchi, A.; Kress-Gazit, H.; and Hutchinson, S., eds.,
Levy, A.; Konidaris, G. D.; Jr., R. P.; and Saenko, K. 2019. Robotics: Science and Systems XV, University of Freiburg,
Learning multi-level hierarchies with hindsight. In 7th In- Freiburg im Breisgau, Germany, June 22-26, 2019.
ternational Conference on Learning Representations, ICLR Sutton, R. S., and Barto, A. G. 2018. Reinforcement Learn-
2019, New Orleans, LA, USA, May 6-9, 2019. ing: An Introduction. A Bradford Book55 Hayward Street
Li, Z.; Narayan, A.; and Leong, T. Y. 2017. An efficient ap- Cambridge MA United States, second edition.
proach to model-based hierarchical reinforcement learning. Sutton, R. S.; Precup, D.; and Singh, S. 1999. Be-
31st AAAI Conference on Artificial Intelligence, AAAI 2017 tween mdps and semi-mdps: A framework for temporal ab-
3583–3589. straction in reinforcement learning. Artificial Intelligence
Lillicrap, T. P.; Hunt, J. J.; Pritzel, A.; Heess, N.; Erez, T.; 112(12):181211.
Tassa, Y.; Silver, D.; and Wierstra, D. 2016. Continuous Vinyals, O.; Ewalds, T.; Bartunov, S.; Georgiev, P.; Vezhn-
control with deep reinforcement learning. In 4th Interna- evets, A. S.; Yeo, M.; Makhzani, A.; Küttler, H.; Agapiou,
tional Conference on Learning Representations, ICLR 2016 J. P.; Schrittwieser, J.; Quan, J.; Gaffney, S.; Petersen, S.;
- Conference Track Proceedings. Simonyan, K.; Schaul, T.; van Hasselt, H.; Silver, D.; Lilli-
Lin, L.-J. 1993. Reinforcement learning for robots using crap, T. P.; Calderone, K.; Keet, P.; Brunasso, A.; Lawrence,
neural networks. Ph.D. Dissertation, Carnegie Mellon Uni- D.; Ekermo, A.; Repp, J.; and Tsing, R. 2017. Star-
versity. craft II: A new challenge for reinforcement learning. CoRR
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; abs/1708.04782.
Antonoglou, I.; Wierstra, D.; and Riedmiller, M. 2013. Play- Vinyals, O.; Babuschkin, I.; Czarnecki, W. M.; Mathieu, M.;
ing atari with deep reinforcement learning. In NIPS Deep Dudzik, A.; Chung, J.; Choi, D. H.; Powell, R.; Ewalds, T.;
Learning Workshop. Georgiev, P.; Oh, J.; Horgan, D.; Kroiss, M.; Danihelka, I.;
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Veness, Huang, A.; Sifre, L.; Cai, T.; Agapiou, J. P.; Jaderberg, M.;
J.; Bellemare, M. G.; Graves, A.; Riedmiller, M. A.; Fidje- Vezhnevets, A. S.; Leblond, R.; Pohlen, T.; Dalibard, V.;
land, A.; Ostrovski, G.; Petersen, S.; Beattie, C.; Sadik, A.; Budden, D.; Sulsky, Y.; Molloy, J.; Paine, T. L.; Gulcehre,
Antonoglou, I.; King, H.; Kumaran, D.; Wierstra, D.; Legg, C.; Wang, Z.; Pfaff, T.; Wu, Y.; Ring, R.; Yogatama, D.;
S.; and Hassabis, D. 2015. Human-level control through Wnsch, D.; McKinney, K.; Smith, O.; Schaul, T.; Lillicrap,
deep reinforcement learning. Nature 518(7540):529–533. T.; Kavukcuoglu, K.; Hassabis, D.; Apps, C.; and Silver, D.
2019. Grandmaster level in StarCraft II using multi-agent
Pang, Z.-J.; Liu, R.-Z.; Meng, Z.-Y.; Zhang, Y.; Yu, Y.; and reinforcement learning. Nature 575(7782):350–354.
Lu, T. 2019. On Reinforcement Learning for Full-Length
Game of StarCraft. In Proceedings of the AAAI Conference Weng, L. 2020. Curriculum for reinforcement learning.
on Artificial Intelligence, volume 33, 4691–4698. lilianweng.github.io/lil-log.
Ring, R. 2018. Reaver: Modular deep reinforcement learn- Zambaldi, V. F.; Raposo, D.; Santoro, A.; Bapst, V.; Li, Y.;
ing framework. https://github.com/inoryy/reaver. Babuschkin, I.; Tuyls, K.; Reichert, D. P.; Lillicrap, T. P.;
Lockhart, E.; Shanahan, M.; Langston, V.; Pascanu, R.;
Schaul, T.; Horgan, D.; Gregor, K.; and Silver, D. 2015. Uni- Botvinick, M.; Vinyals, O.; and Battaglia, P. W. 2019. Deep
versal value function approximators. In Proceedings of the reinforcement learning with relational inductive biases. In
32nd International Conference on International Conference 7th International Conference on Learning Representations,
on Machine Learning - Volume 37, ICML15, 13121320. ICLR 2019, New Orleans, LA, USA, May 6-9, 2019.
JMLR.org.
Zaremba, W., and Sutskever, I. 2014. Learning to execute.
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; and CoRR abs/1410.4615.
Klimov, O. 2017. Proximal policy optimization algorithms.
CoRR abs/1707.06347.
Silver, D.; Huang, A.; Maddison, C. J.; Guez, A.; Sifre,
L.; van den Driessche, G.; Schrittwieser, J.; Antonoglou, I.;
Panneershelvam, V.; Lanctot, M.; Dieleman, S.; Grewe, D.;
Nham, J.; Kalchbrenner, N.; Sutskever, I.; Lillicrap, T. P.;

You might also like