Multiplayer Residual Advantage Learningwith General Function Approximation
Multiplayer Residual Advantage Learningwith General Function Approximation
net/publication/2265942
CITATIONS READS
45 145
2 authors, including:
Mance Harmon
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Mance Harmon on 11 May 2020.
Abstract
A new algorithm, advantage learning, is presented that improves on advantage
updating by requiring that a single function be learned rather than two.
Furthermore, advantage learning requires only a single type of update, the learning
update, while advantage updating requires two different types of updates, a learning
update and a normilization update. The reinforcement learning system uses the
residual form of advantage learning. An application of reinforcement learning to a
Markov game is presented. The test-bed has continuous states and nonlinear
dynamics. The game consists of two players, a missile and a plane; the missile
pursues the plane and the plane evades the missile. On each time step, each player
chooses one of two possible actions; turn left or turn right, resulting in a 90 degree
instantaneous change in the aircraft’s heading. Reinforcement is given only when
the missile hits the plane or the plane reaches an escape distance from the missile.
The advantage function is stored in a single-hidden-layer sigmoidal network.
Speed of learning is increased by a new algorithm, Incremental Delta-Delta (IDD),
which extends Jacobs’ (1988) Delta-Delta for use in incremental training, and
differs from Sutton’s Incremental Delta-Bar-Delta (1992) in that it does not require
the use of a trace and is amenable for use with general function approximation
systems. The advantage learning algorithm for optimal control is modified for
games in order to find the minimax point, rather than the maximum. Empirical
results gathered using the missile/aircraft test-bed validate theory that suggests
residual forms of reinforcement learning algorithms converge to a local minimum
of the mean squared Bellman residual when using general function approximation
systems. Also, to our knowledge, this is the first time an approximate second order
method has been used with residual algorithms. Empirical results are presented
comparing convergence rates with and without the use of IDD for the reinforcement
learning test-bed described above and for a supervised learning test-bed. The
results of these experiments demonstrate IDD increased the rate of convergence and
resulted in an order of magnitude lower total asymptotic error than when using
backpropagation alone.
1 INTRODUCTION
In Harmon, Baird, and Klopf (1995) it was demonstrated that the residual gradient form of the
advantage updating algorithm could learn the optimal policy for a linear-quadratic differential
game using a quadratic function approximation system. We propose a simpler algorithm,
advantage learning, which retains the properties of advantage updating but requires only one
function to be learned rather than two. A faster class of algorithms, residual algorithms, is
proposed in (Baird, 95). We present empirical results demonstrating the residual form of
advantage learning solving a nonlinear game using a general neural network. The game is a
Markov decision process (MDP) with continuous states and nonlinear dynamics. The game
consists of two players, a missile and a plane; the missile pursues the plane and the plane evades
the missile. On each time step each player chooses one of two possible actions; turn left or turn
right, which results in a 90 degree instantaneous change in heading for the aircraft.
Reinforcement is given only when the missile either hits or misses the plane. The advantage
function is stored in a single-hidden-layer sigmoidal network. Rate of convergence is increased
by a new algorithm we call Incremental Delta-Delta (IDD), which extends Jacobs’ (1988) Delta-
Delta for use in incremental training, as opposed to epoch-wise training. IDD differs from
Sutton’s Incremental Delta-Bar-Delta (1992) in that it does not require the use of a trace,
averages of recent values, and is useful for general function approximation systems. The
advantage learning algorithm for optimal control is modified for games in order to find the
minimax point, rather than the maximum. Empirical results gathered using the missile/aircraft
test-bed validate theory that suggests residual forms of reinforcement learning algorithms
converge to a local minimum of the mean squared Bellman residual when using general function
approximation systems. Also, to our knowledge, this is the first time an approximate second
order method has been used with residual algorithms, and we present empirical results
comparing convergence rates with and without the use of IDD for the reinforcement learning
test-bed described above and for a supervised-learning test-bed.
In Section 2 we present advantage learning and describe its improvements over advantage
updating. In Section 3 we review direct algorithms, residual gradient algorithms, and residual
algorithms. In Section 4 we present a brief discussion of game theory and review research in
which game theory has been applied to MDP-like environments. In Section 5 we present
Incremental Delta-Delta (IDD), an incremental, nonlinear extension to Jacobs’ (1988) Delta-
Delta algorithm. Also presented in Section 5 are empirical results generated from an application
of the IDD algorithm to a nonlinear supervised learning task. Section 6 explicitly describes the
reinforcement learning testbed and presents the update equations for residual advantage learning.
Simulation results generated using the missile/aircraft test-bed are presented and discussed in
Section 7. These results include diagrams of learned behavior, a comparison of the system’s
ability to reduce the mean squared Bellman error for different values of φ (including an adaptive
φ), and a comparison of the system’s performance with and without the use of IDD.
2 BACKGROUND
1
actions. For each state x and action u, the advantage, A(x,u), is stored, representing an estimate
of the degree to which the expected total discounted reinforcement is increased by performing
action u rather than the action currently considered best. The optimal value function V*(x)
represents the true value of each state. The optimal advantage function A*(x,u) will be zero if u
is the optimal action (because u confers no advantage relative to itself) and A*(x,u) will be
negative for any suboptimal u (because a suboptimal action has a negative advantage relative to
the best action). Advantage updating has been shown to learn faster than Q-learning (Watkins,
1989), especially for continuous-time problems (Baird, 1993, Harmon, Baird, & Klopf, 1995).
2
output. Equation (3) is exactly the TD(0) algorithm, and could also be called the direct
implementation of incremental value iteration, Q-learning, and advantage learning.
E=
1
∑
n x
[ R + γV(x' ) − V(x)]
2
(5)
If A(x,u) is an approximation of A*(x,u), then the mean squared Bellman residual, E, is:
2
1 1
E= R + γ ∆ Amax (x' ,u) + 1 − Amax (x,u) − A(x,u)
t
(7)
u ∆tK ∆tK u
where the inner <> is the expected value over all possible results of performing a given action u
in a given state x, and the outer <> is the expected value over all possible states and actions.
3
possible without causing the weights to blow up, so φ can be chosen as close to 0 as possible
without the weights blowing up. A φ of 1 is guaranteed to converge, and a φ of 0 might be
expected to learn quickly if it is stable at all. However, this may not be the best approach. It
requires an additional parameter to be chosen by trial and error, and it ignores the fact that the
best φ to use initially might not be the best φ to use later, after the system has learned for some
time. Fortunately, it is easy to calculate the φ that ensures a decreasing mean squared residual,
while bringing the weight change vector as close to the direct algorithm as possible (described in
Section 6).
4 MULTI-PLAYER GAMES
The theory of Markov decision processes (Barto et al., 1989, Howard 1960) is the basis for most
of the recent reinforcement learning theory. However, this body of theory assumes that the
learning system’s environment is stationary and, therefore, contains no other adaptive systems.
Game theory (von Neumann and Morgenstern, 1947) is explicitly designed for reasoning about
multi-player environments.
Differential games (Isaacs, 1965) are games played in continuous time, or use sufficiently small
time steps to approximate continuous time. Both players evaluate the given state and
simultaneously execute an action, with no knowledge of the other player's selected action. The
value of a game is the long-term, discounted reinforcement if both opponents play the game
optimally in every state. Consider a game in which player A tries to minimize the total
discounted reinforcement, while the opponent, player B, tries to maximize the total discounted
reinforcement. Given the advantage A(x,uA,uB) for each possible action in state x, it is useful to
define the minimax and maximin values for state x as:
minimax(x) = min
u
max
u
A (x,uA ,uB ) (8)
A B
maximin(x) = max
u
min
u
A (x,uA ,uB ) (9)
B A
If the minimax equals the maximin, then the minimax is called a saddlepoint and the optimal
policy for both players is to perform the actions associated with the saddlepoint. If a saddlepoint
does not exist, then the optimal policy is stochastic if an optimal policy exists at all. If a
saddlepoint does not exist, and a learning system treats the minimax as if it were a saddlepoint,
then the system will behave as if player A must choose an action on each time step, and then
player B chooses an action based upon the action chosen by A. For the algorithms described
below, a saddlepoint is assumed to exist. If a saddlepoint does not exist, this assumption confers
a slight advantage to player B.
4
learning system was the Least-Mean-Square (LMS) rule, also known as the Widrow-Hoff rule
(Widrow and Stearns, 1985), and the IDBD algorithm was derived for linear function
approximation systems. Here, we present an extension to Jacobs (1988) Delta-Delta algorithm
that is appropriate for incremental training when using nonlinear function approximation
systems.
As in the IDBD algorithm, in IDD each parameter of the the neural network has an associated
learning rate of the form
αi (t) = e βi (t) (10)
(where i indicates the parameter of association) that is updated after each step of learning. There
are two advantages resulting from the exponential relationship of the learning rate, αi, and the
memory parameter that is actually modified, βi. First, this assures that αi will always be positive.
Second, it allows geometric steps in αi.. As βi is incremented or decremented by a fixed step-
size, then αi will move up or down by a fraction of its current value, allowing αi to become very
small. IDD updates βi by
∆ wi (t + 1)
β i (t + 1) = β i (t) + θ ∆w i (t ) (11)
α i (t)
where ∆wi is a change in the weight parameter wi, and θ is the meta-learning rate. The
derivation of IDD is similar in principle to that of IDBD and is presented below. We start with
∂ 12 E 2 (t + 1)
β i (t + 1) = β i (t) − θ (12)
∂β i (t)
where E 2 is the expected value of the mean squared error. Applying the chain rule we may
write
∂ 1
2 E (t + 1)
2
∂ 1
2 E (t + 1) ∂wi (t + 1) ∂α i (t )
2
= (13)
∂β i (t ) ∂w i ( t + 1) ∂α i ( t ) ∂β i (t )
By evaluating the last term of equation (13), shown in equation (14), and then substituting the
results we arrive at equation (15).
∂α i (t) ∂e βi β
= = e i = α i (t) (14)
∂β i (t) ∂β i
∂ 1
2 E (t + 1)
2
∂ 1
2 E (t + 1) ∂wi (t + 1)
2
= α (t ) (15)
∂β i (t ) ∂w i ( t + 1) ∂α i ( t ) i
By evaluating the next to last term of equation (15) and rearranging, we find the equality
described by equation (16).
∂wi (t + 1) ∂ ∂ E 2 (t + 1)
1
wi (t) − α i (t)
2
=
∂αi (t) ∂α i (t) ∂w i (t)
5
∂wi (t + 1) ∂ 12 E 2 (t + 1)
=− (16)
∂αi (t ) ∂w i (t )
Again, substituting the results of equation (16) into (15) produces
∂ 1
2 E (t + 1)
2
∂ 12 E 2 (t + 1) ∂ 1
2 E (t + 1)
2
=− α i (t) (17)
∂β i (t) ∂wi (t + 1) ∂w i (t)
Next, we define the change in the parameter wi and rearrange for substitution.
∂ 1 E 2 (t + 1)
∆w i (t) = − αi (t)
2
∂wi (t)
(18)
∆w i (t) ∂ 12 E (t + 1)
2
− =
α i (t) ∂wi (t)
By substituting the left-hand side of the second half of equation (18) into equation (17), and then
deriving the equivalent of equation (18) for ∆ wi (t + 1) and substituting into equation (17), we
arrive at
∂ 1
2 E (t + 1)
2
∆w i (t + 1)
=− ∆wi (t ) (19)
∂β i (t ) α i (t )
Thus
∆wi (t + 1)
β i (t + 1) = β i (t ) + θ ∆wi (t ) (20)
α i (t )
The right-hand side of equation (19) provides a true unbiased estimate of the gradient of the
error surface with respect to the memory parameter, βi. An equivalent of IDBD for nonlinear
systems can trivially be derived from IDD by replacing ∆wi (t) with ∆ w i (t ) , where ∆ w i (t ) is an
exponentially weighted sum of the current and past changes to wi. The trace is defined by
equation (21).
∆ w i (t) = (1 − ε )∆w i (t − 1) + ε∆ w i (t − 1) (21)
This form of IDBD for nonlinear systems includes another free parameter, ε, that determines the
decay rate of the trace. If it were possible to choose ε perfectly for each training example, IDBD
would, in the worst case, be equivalent to IDD, and would on average provide a better estimate
of the gradient. However, the optimal value of ε is a function of the rate of change in the
gradient of the error surface, and is therefore different for different regions of state space.
Moreover, Jacobs’ original motivations for using delta-bar-delta rather than delta-delta are no
longer relevant when each learning rate is defined according to equation (10). For these reasons
we used IDD to speed the convergence rate for our testbed.
6
5.2 IDD Supervised Learning Results
The capabilities of IDD were initially assessed using a supervised-learning task. The intent of
this experiment was to answer the question: Does the IDD algorithm perform better than the
ordinary backpropogation algorithm? The task involved six real-valued inputs (including a bias)
and one output. The inputs were chosen independently and randomly in the range [-1, 1]. The
objective function was the square of the first input summed with the second input. The function
approximator was a single-hidden-layer sigmoidal network with 5 hidden nodes. For each
algorithm, we trained the network for 50,000 iterations and then measured the asymptotic error.
This process was repeated 100 times using
different initial random number seeds, and the α
0.35 0 .4 0.5 0.6 0 .7 0.8 0 .9
results were averaged. The experiment was 1 1
The parameter φ is a constant that controls a trade-off between pure gradient descent (when φ
equals 1) and a fast direct algorithm (when φ equals 0). A φ that ensures a decreasing mean
squared residual, while bringing the weight change vector as close to the direct algorithm as
possible can be calculated by maintaining an estimate of the epoch-wise weight change vectors.
These can be approximated by maintaining two scalar values, wd and wrg associated with each
weight w in the function approximation system. These are traces, averages of recent values,
used to approximate ∆Wd and ∆Wrg. The traces are updated on each learning cycle according to:
7
[ ]
wd ← (1 − µ )wd − µ (R + γ ∆t Amin max ( x' , u))/ ∆tK + (1 − 1 / ∆tK )Amin max ( x, u)
∂ (23)
• − Amin max ( x, u)
∂w
(R + γ ∆ t Amin max (x' ,u))/ ∆ tK +
wrg ← (1 − µ)wrg − µ
(1 − 1 / ∆ tK )Amin max (x, u) − A(x,u)
∆t ∂
γ Amin max (x' ,u) /( ∆tK ) + (24)
∂w
•
∂ ∂
(1 − 1/ ∆tK ) A (x,u) − A(x,u)
∂w min max ∂w
where µ is a small, positive constant that governs how fast the system forgets. On each time step
a stable φ is calculated by using equation (25). This ensures convergence while maintaining fast
learning:
∑w w d rg
φ= w
+µ (25)
∑ (wd − wrg )wrg
w
It is important to note that this algorithm does not follow the negative gradient, which would be
the steepest path of descent. However, the algorithm does cause the mean squared residual to
decrease monotonically (for appropriate φ), thereby guaranteeing convergence to a local
minimum of the mean squared Bellman residual.
8
player, and the positions of the players are updated. Pseudocode describing these dynamics
follows:
deg2rad - converts degrees to radians; // trigonometric functions measured in radians
Plane_action - actions chosen by plane;
Plane_theta - current heading of plane in 2d space;
Missile_velocity_X - component vector velocity of missile in x dimension;
Missile_velocity_Y - component vector velocity of missile in y dimension;
1) If (Plane_action = 0.5) then Plane_theta = Plane_theta + 90 * deg2rad;
else Plane_theta = Plane_theta - 90 * deg2rad;
2) Repeat step 1 for Missile.
3) Normalize Plane_theta and Missile_theta.
4) Missile_velocity_X = Missile_speed * cos(Missile_theta);
5) Missile_velocity_Y = Missile_speed * sin(Missile_theta);
6) Repeat steps 4 & 5 for Plane;
The reinforcement function R is a function of the distance between the players. A reinforcement
of 1 is given when the Euclidean distance between the players is greater than 2 units (plane
escapes). A reinforcement of -1 is given when the distance is less than 0.25 units (missile hits
plane). No reinforcement is given when the distance is in the range [0.25,2]. The missile seeks
to minimize reinforcement, while the plane seeks to maximize reinforcement.
The advantage function is approximated by a single-hidden-layer neural network with 50 hidden
nodes. The hidden-layer nodes each have a sigmoidal activation function, the output of which
lies in the range [-1,1]. The output of the network is a linear combination of the outputs of the
hidden-layer nodes with their associated weights. To speed the rate of convergence we used
IDD as described in Section 5. There are 6 inputs to the network. The first 4 inputs describe the
state and are normalized to the range [-1,1]. They consist of the differences in positions and
velocities of the players in both the x and y dimensions. The remaining inputs describe the
action to be taken by each player; 0.5 and -0.5 indicate left and right turns, respectively.
7 RESULTS
Experiments were formulated to accomplish three objectives. The first objective was to
determine heuristically to what degree residual advantage learning could learn a reasonable
policy for the missile/aircraft system. In Harmon, Baird, and Klopf (1995) it was possible to
calculate the optimal weights for the quadratic function approximator used to represent the
advantage function. This is not the case for the current system. The nonlinear dynamics of this
system require more representational capacity than a simple quadratic network to store the
advantage function. In using a single hidden-layer sigmoidal network we gain the
representational capacity needed but lose the ability to calculate the optimal parameters for the
network, which would have been useful as a metric. For this reason, our metric is reduced to
simple observation of the system behavior, and is analogous to the metric used to evaluate
Tesauro’s TD-Gammon (Tesauro, 1990). Also, it is possible this game might be made less
difficult to solve if expressed in an appropriate coordinate system, such as plane and missile
centered polar coordinates. However, the motivation for this experiment is to demonstrate the
9
ability of residual algorithms to solve difficult, nonlinear control problems using a general neural
network. For this reason, the game was explicitly structured to be difficult to solve.
The second objective was to analyze the performance of three different forms of advantage
learning: the residual gradient form, the direct form, and a weighted average of the two (values
of φ in the range [0, 1]). The third and final objective was to evaluate the utility of IDD for this
test-bed, and to address the following question. When using residual algorithms, which method
increases the rate of convergence the most: 1) Using the residual gradient form of the algorithm
with a second order method or, 2) Using the residual form of the algorithm with an adaptive φ
or,
3) Some combination of the two?
Addressing the first objective, the reinforcement learning system implementing the residual form
of advantage learning produced a reasonable policy after 800,000 training cycles. The missile
learned to pursue the plane, and the plane learned to evade the missile. Interesting behavior was
exhibited by both players under certain initial conditions. First, the plane learned that in some
cases it is able to evade indefinitely the missile by continuously flying in circles within the
missile’s turn radius. Second, the missile learned to anticipate the position of the plane. Rather
than heading directly toward the plane, the missile learned to lead the plane under appropriate
circumstances.
(a) (b)
Figure 2 (a) Demonstration of missile leading plane after learning and (b) ultimately hitting the plane.
(a) (b)
Figure 2 (a) Demonstration of the ability of the plane to survive indefinitely by flying in continuous circles within the
missile’s turn radius. (b) Demonstration of the learned behavior of the plane to turn toward the missile to increase the
distance between the two in the long term.
In Experiment 2, the effects of different values of φ, the weighting factor used in the linear
combination of the residual gradient update vector and the direct method update vector, and the
use of IDD on the learning system’s convergence rate were compared. For ∆t values of 1.0 and
0.1, twelve different runs were accomplished, each using identical parameters with the exception
10
of the weighting factor φ and the use or non-use of IDD. Figure 3 presents the results of these
experiments. A φ of 1 yields advantage learning in the residual gradient form, while a φ of 0
yields advantage learning implemented in the direct form.
10
Figure 3: φ comparison
A φ of 1, which yields the residual gradient
A sy mpt ot ic
Error
1
residual more than the other values of φ test,
dd ∆t =1
bp ∆t =1
For this set of experiments the Bellman error was minimized the most by combining the use of
an adaptive φ with the use of IDD. The resultant control policy from these experiments also
produced aircraft trajectories that looked reasonable. It is the case that using IDD resulted in a
lower mean squared Bellman error for all values of φ, including the adaptive φ. Why this is the
case and how these mechanisms interact will be explored in future research.
11
8 CONCLUSIONS
The results gathered using the missile/aircraft test-bed provide evidence that residual forms of
reinforcement learning algorithms produce reinforcement learning systems that are stable and
converge to a local minimum of the mean squared Bellman residual when using general function
approximation systems. In general, non-linear problems of this type are difficult to solve with
classical game theory and control theory, and therefore appear to be good applications for
reinforcement learning.
The data also suggests that the use of second-order methods may be desirable or even necessary,
as was the case for this test-bed, to generate the desired control policy. Although much research
has been accomplished investigating approaches for speeding rates of convergence, the results
gathered from applying these methods in supervised learning tasks may not necessarily hold true
for reinforcement learning tasks. This stems from fundamental differences in the nature of
supervised and reinforcement learning. For this reason, we feel that a rigorous comparison of
these methods implemented in a reinforcement learning system is an appropriate topic for future
research.
Acknowledgments
This research was supported under Task 2312R1 by the United States Air Force Office of
Scientific Research. Thanks to Michael Littman for a useful discussion of Markov games. We
also wish to thank Matt Rizki and Lou Tambourino of the Adaptive Vision Laboratory at Wright
State University for making available the use of their equipment.
References
Baird, L.C. (1993). Advantage updating Wright-Patterson Air Force Base, OH. (Wright
Laboratory Technical Report WL-TR-93-1146, available from the Defense Technical
information Center, Cameron Station, Alexandria, VA 22304-6145).
Barto, A. G., Sutton, R. S., & Watkins, C. J. C. H. (1989). Learning and sequential decision
making. Technical Report 89-95, Department of Computer and Information Science, Univeristy
of Massachusetts, Amherst, Massachusetts. Also published in Learning and Computational
Neuroscience: Foundations of Adaptive Networks, Michael Gabriel and John Moore, editors.
MIT Press, Cambridge MA (1991).
Boyan, J.A., and Moore, A.W. (1995). Generalization in reinforcement learning: Safely
approximating the value function. In Tesauro, G., Touretzky, D.S., and Leen, T.K. (eds.),
Advances in Neural Information Processing Systems 7. MIT Press, Cambridge MA.
Isaacs, Rufus (1965). Differential games. New York: John Wiley and Sons, Inc.
Jacobs, R. A. (1988). Increased rates of convergence through learning rate adaptation. Neural
Networks 1, pp. 295-307.
12
Harmon, M.E., Baird, L.C, & Klopf, A.H. (1995). Advantage updating applied to a differential
game. In Tesauro, G., Touretzky, D.S., and Leen, T.K. (eds.), Advances in Neural Information
Processing Systems 7. MIT Press, Cambridge MA.
Howard, R. A. (1960). Dynamic Programming and Markov Processes. MIT Press, Cambridge
MA.
Rajan, N., Prasad, U. R., and Rao, N. J. (1980). Pursuit-evasion of two aircraft in a horizontal
plane. Journal of Guidance and Control. 3(3), May-June, 261-267.
Rumelhart, D., Hinton, G., & Williams, R. (1986). Learning representations by backpropagating
errors. Nature. 323, 9 October, 533-536.
Von Neumann, J., and Morgenstern, O. (1947). Theory of Games and Economic Behavior.
Princeton University Press, Princeton NJ.
Widrow, B., and Stearns, S. D. (1985). Adaptive Signal Processing. Englewood Cliffs, NJ:
Prentice-Hall.
13