0% found this document useful (0 votes)
8 views

RewardShaping

Uploaded by

Wenshuai Zhao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

RewardShaping

Uploaded by

Wenshuai Zhao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence

Expressing Arbitrary Reward Functions as Potential-Based Advice

Anna Harutyunyan Sam Devlin Peter Vrancx Ann Nowé


Vrije Universiteit Brussel University of York Vrije Universiteit Brussel Vrije Universiteit Brussel
[email protected] [email protected] [email protected] [email protected]

Abstract “distracted”, and instead learn to ride in a loop and collect


the positive reward forever. This issue of positive reward cy-
Effectively incorporating external advice is an important cles is addressed by Ng, Harada, and Russell (1999), where
problem in reinforcement learning, especially as it moves
they devise their potential-based reward shaping (PBRS)
into the real world. Potential-based reward shaping is a way
to provide the agent with a specific form of additional re- framework, which constrains the shaping reward to have the
ward, with the guarantee of policy invariance. In this work we form of a difference of a potential function of the transition-
give a novel way to incorporate an arbitrary reward function ing states. In fact, they prove a stronger claim that such a
with the same guarantee, by implicitly translating it into the form is necessary1 for leaving the original task unchanged.
specific form of dynamic advice potentials, which are main- This elegant and implementable framework led to an ex-
tained as an auxiliary value function learnt at the same time. plosion of reward shaping research and proved to be ex-
We show that advice provided in this way captures the input tremely effective (Asmuth, Littman, and Zinkov 2008), (De-
reward function in expectation, and demonstrate its efficacy vlin, Kudenko, and Grzes 2011), (Brys et al. 2014), (Snel
empirically. and Whiteson 2014). Wiewiora, Cottrell, and Elkan (2003)
extended PBRS to state-action advice potentials, and Devlin
Introduction and Kudenko (2012) recently generalized PBRS to handle
dynamic potentials, allowing potential functions to change
The term shaping in experimental psychology (dating at online whilst the agent is learning.
least as far back as (Skinner 1938)) refers to the idea of Additive reward functions from early RS research, while
rewarding all behavior leading to the desired behavior, in- dangerous to policy preservation, were able to convey be-
stead of waiting for the subject to exhibit it autonomously havioral knowledge (e.g. expert instructions) directly. Po-
(which, for complex tasks, may take prohibitively long). For tential functions require an additional abstraction, and re-
example, Skinner discovered that, in order to train a rat to strict the form of the additional effective reward, but provide
push a lever, any movement in the direction of the lever crucial theoretical guarantees. We seek to bridge this gap be-
had to be rewarded. Reinforcement learning (RL) is a frame- tween the available behavioral knowledge and the effective
work, where an agent learns from interaction with the en- potential-based rewards.
vironment, typically in a tabula rasa manner, guaranteed to This paper gives a novel way to specify the effective
learn the desired behavior eventually. As with Skinner’s rat, shaping rewards, directly through an arbitrary reward func-
the RL agent may take a very long time to stumble upon tion, while implicitly maintaining the grounding in poten-
the target lever, if the only reinforcement (or reward) it re- tials, necessary for policy invariance. For this, we first ex-
ceives is after that fact, and shaping is used to speed up the tend Wiewiora’s advice framework to dynamic advice po-
learning process by providing additional rewards. Shaping tentials. We then propose to in parallel learn a secondary
in RL has been linked to reward functions from very early value function w.r.t. a variant of our arbitrary reward func-
on; Mataric (1994) interpreted shaping as designing a more tion, and use its successive estimates as our dynamic ad-
complex reward function, Dorigo and Colombetti (1997) vice potentials. We show that the effective shaping rewards
used shaping on a real robot to translate expert instructions then reflect the input reward function in expectation. Em-
into reward for the agent, as it executed a task, and Randløv pirically, we first demonstrate our method to avoid the issue
and Alstrøm (1998) proposed learning a hierarchy of RL sig- of positive reward cycles on a grid-world task, when given
nals in an attempt to separate the extra reinforcement func- the same behavior knowledge that trapped the bicyclist from
tion from the base task. It is in the same paper that they un- (Randløv and Alstrøm 1998). We then show an application,
cover the issue of modifying the reward signals in an uncon- where our dynamic (PB) value-function advice outperforms
strained way: when teaching an agent to ride a bicycle, and other reward-shaping methods that encode the same knowl-
encouraging progress towards the goal, the agent would get edge, as well as a shaping w.r.t. a different popular heuristic.
Copyright c 2015, Association for the Advancement of Artificial
1
Intelligence (www.aaai.org). All rights reserved. Given no knowledge of the MDP dynamics.

2652
Background Reward Shaping
We assume the usual reinforcement learning frame- The most general form of reward shaping in RL can be given
work (Sutton and Barto 1998), in which the agent interacts as modifying the reward function of the underlying MDP:
with its Markovian environment at discrete time steps t =
1, 2, . . .. Formally, a Markov decision process (MDP) (Put- R0 = R + F
erman 1994) is a tuple M = hS, A, γ, T , Ri, where: S is a
where R is the (transition) reward function of the base
set of states, A is a set of actions, γ ∈ [0, 1] is the discount-
problem, and F is the shaping reward function, with
ing factor, T = {Psa (·)|s ∈ S, a ∈ A} are the next state
F (s, a, s0 ) giving the additional reward on the transition
transition probabilities with Psa (s0 ) specifying the proba-
(s, a, s0 ), and ft defined analogously to rt . We will always
bility of state s0 occuring upon taking action a from state
refer to the framework as reward shaping, and to the auxil-
s, R : S × A → R is the expected reward function with
iary reward itself as shaping reward.
R(s, a) giving the expected (w.r.t. T ) value of the reward
PBRS (Ng, Harada, and Russell 1999) maintains a poten-
that will be received when a is taken in state s. R(s, a, s0 )2
tial function Φ : S → R, and constrains the shaping reward
and rt+1 denote the components of R at transition (s, a, s0 )
function F to the following form:
and at time t, respectively.
A (stochastic) Markovian policy π : S ×A → R is a prob-
ability distribution over actions at each state, so that π(s, a) F (s, s0 ) = γΦ(s0 ) − Φ(s) (6)
gives the probability of action a being taken from state s where γ is the discounting factor of the MDP. Ng et
under policy π. We will use π(s, a) = 1 and π(s) = a inter- al (1999) show that this form is both necessary and suffi-
changeably. Value-based methods encode policies through cient for policy invariance.
value functions (VF), which denote expected cumulative re- Wiewiora et al (2003) extend PBRS to advice potential
ward obtained, while following the policy. We focus on functions defined over the joint state-action space. Note that
state-action value functions. In a discounted setting: this extension adds a dependency of F on the policy be-
ing followed (in addition to the executed transition). The

hX i authors consider two types of advice: look-ahead and look-
Qπ (s, a) = ET ,π γ t rt+1 |s0 = s, a0 = a (1) back, providing the theoretical framework for the former:
t=0
We will omit the subscripts on E from now on, and imply F (s, a, s0 , a0 ) = γΦ(s0 , a0 ) − Φ(s, a) (7)
all expectations to be w.r.t. T , π. A (deterministic) greedy Devlin and Kudenko (2012) generalize the form in Eq. (6)
policy is obtained by picking the action of maximum value to dynamic potentials, by including a time parameter, and
at each state: show that all theoretical properties of PBRS hold.

π(s) = arg max Q(s, a) (2) F (s, t, s0 , t0 ) = γΦ(s0 , t0 ) − Φ(s, t) (8)


a
A policy π ∗ is optimal if its value is largest:
From Reward Functions to Dynamic
∗ π
Q (s, a) = sup Q (s, a), ∀s, a Potentials
π
There are two (inter-related) problems in PBRS: efficacy
When the Q-values are accurate for a given policy π, they and specification. The former has to do with designing the
satisfy the following recursive relation (Bellman 1957): best potential functions, i.e. those that offer the quickest and
smoothest guidance. The latter refers to capturing the avail-
Qπ (s, a) = R(s, a) + γE[Qπ (s0 , a0 )] (3) able domain knowledge into a potential form, in the easiest
The values can be learned incrementally by the following and most effective way. This work primarily deals with that
update: latter question.
Locking knowledge in the form of potentials is a conve-
Qt+1 (st , at ) = Qt (st , at ) + αt δt (4) nient theoretical paradigm, but may be restrictive, when con-
where Qt denotes an estimate of Qπ at time t, αt ∈ (0, 1) sidering all types of domain knowledge, in particular behav-
is the learning rate at time t, and ioral knowledge, which is likely to be specified in terms of
actions. Say, for example, an expert wishes to encourage an
δt = rt+1 + γQt (st+1 , at+1 ) − Qt (st , at ) (5) action a in a state s. If following the advice framework, she
is the temporal-difference (TD) error of the transition, in sets Φ(s, a) = 1, with Φ zero-valued elsewhere, the shap-
which at and at+1 are both chosen according to π. This pro- ing reward associated with the transition (s, a, s0 ) will be
cess is shown to converge to the correct value estimates (the F (s, a, s0 , a0 ) = Φ(s0 , a0 ) − Φ(s, a) = 0 − 1 = −1, so
TD-fixpoint) in the limit under standard approximation con- long as the pair (s0 , a0 ) is different from (s, a).3 The favor-
ditions (Jaakkola, Jordan, and Singh 1994). able behavior (s, a) will then factually be discouraged. She
could avoid this by further specifying Φ for state-actions
2
R is a convention from MDP literature. In reinforcement reachable from s via a, but that would require knowledge
learning it is more common to stick with R(s, a, s0 ) for specifi-
3
cation, but we will refer to the general form in our derivations. Assume the example is undiscounted for clarity.

2653
of the MDP. What she would thus like to do is to be able where we implicitly associate s with st , s0 with st0 , and
to specify the desired effective shaping reward F directly, F (s, a, s0 , a0 ) with F (s, a, t, s0 , a0 , t0 ). As with Wiewiora’s
but without sacrificing optimality provided by the potential- framework, F is now not only a function of the transition
based framework. (s, a, s0 ), but also the following action a0 , which adds a de-
This work formulates a framework to do just that. Given pendence on the policy the agent is currently evaluating.
an arbitrary reward function R† , we wish to achieve F ≈ We examine the change in the optimal Q-values of the
R† , while maintaining policy invariance. This question is original MDP, resulting from adding F to the base reward
equivalent to seeking a potential function Φ, based on R† , function R.
s.t. F Φ ≈ R† , where (and in the future) we take F Φ to mean
a potential-based shaping reward w.r.t. Φ. hP ∞ i
The core idea of our approach is to introduce a secondary Q∗ (s, a) = E γ t (rt+1 + ft+1 )|s0 = s
(state-action) value function Φ, which, concurrently with the h t=0

(12)
main process, learns on the negation of the expert-provided γ t (rt+1 + γΦt+1 (st+1 , at+1 )
P
= E
R† , and use the consecutively updated values of Φt as a dy- t=0 i
namic potential function, thus making the translation into −Φt (st , at ))
potentials implicit. Formally: hP ∞ i hP ∞ i
= E γ t rt+1 + E γ t Φt (st , at )
RΦ = −R† (9) t=0
hP ∞ it=1
t
−E γ Φt (st , at )
t=0
Φt+1 (s, a) = Φt (s, a) + βt δtΦ (10) hP ∞ i
= E γ t rt+1 − Φ0 (s, a)
t=0
δtΦ = rt+1
Φ
+ γΦt (st+1 , at+1 ) − Φt (st , at ) (11) (16)
Thus, once the optimal policy w.r.t. R + F is learnt, to
where βt is the learning rate at time t, and at+1 is chosen uncover the optimal policy w.r.t. R, one may use the bi-
according to the policy π w.r.t. the value function Q of the ased greedy action-selection (Wiewiora, Cottrell, and Elkan
main task. The shaping reward is then of the form: 2003) w.r.t. the initial values of the dynamic advice function.
ft+1 = γΦt+1 (st+1 , at+1 ) − Φt (st , at ) (12) π(s) = arg max Q(s, a) + Φ0 (s, a)

(17)
The intuition of the correspondence between R† and F a
lies in the relation between the Bellman equation (for Φ): Notice that when the advice function is initialized to 0,
the biased greedy action-selection above reduces to the basic
Φπ (s, a) = −R† (s, a) + γΦπ (s0 , a0 ) (13) greedy policy (Eq. (2)), allowing one to use dynamic advice
and shaping rewards from an advice potential function: equally seamlessly to simple state potentials.

Shaping in Expectation
F (s, a) = γΦ(s0 , a0 ) − Φ(s, a) = R† (s, a) (14)
Let R† be an arbitrary reward function, and let Φ be the
This intuition will be made more precise later. state-action value function that learns on RΦ = −R† ,
while following some fixed policy π. The shaping reward
Theory at timestep t w.r.t. Φ as a dynamic advice function is given
This section is organized at follows. First, we extend the by:
potential-based advice framework to dynamic potential-
based advice, and ensure that the desired guarantees hold. ft+1 = γΦt+1 (st+1 , at+1 ) − Φt (st , at )
(Our dynamic (potential-based) value-function advice is = γΦt (st+1 , at+1 ) − Φt (st , at )
then an instance of dynamic potential-based advice.) We +γΦt+1 (st+1 , at+1 ) − γΦt (st+1 , at+1 )
then turn to the question of correspondence between R† and (11)
F , showing that F captures R† in expectation. Finally, we = δtΦ − rt+1
Φ
+ γ∆Φ(st+1 , at+1 )

ensure that these expectations are meaningful, by arguing = rt+1 + δtΦ + γ∆Φ(st+1 , at+1 )
convergence. (18)
Now assume the process has converged to the TD-fixpoint
Dynamic Potential-Based Advice Φπ . Then:
Analogously to (Devlin and Kudenko 2012), we augment
Wiewiora’s look-ahead advice function (Eq. (7)) with a time F (s, a, s0 , a0 ) = γΦπ (s0 , a0 ) − Φπ (s, a)
parameter to obtain our dynamic potential-based advice: (3)
F (s, a, t, s0 , a0 , t0 ) = γΦ(s0 , a0 , t0 ) − Φ(s, a, t), where t/t0 = γΦπ (s0 , a0 )
is the time of the agent visiting state s/s0 and taking action −RΦ (s, a) − γE[Φπ (s0 , a0 )]
(9)
a/a0 . For notational compactness we rewrite the form as: = R† (s, a)
+γ(Φπ (s0 , a0 ) − E[Φπ (s0 , a0 )])
F (s, a, s0 , a0 ) = γΦt0 (s0 , a0 ) − Φt (s, a) (15) (19)

2654
Thus, we obtain that the shaping reward F w.r.t. the VF advice approach beats other methods that use the same
converged values Φπ , reflects the expected designer reward domain knowledge, as well as a popular static shaping w.r.t.
R† (s, a) plus a bias term, which measures how different the a different heuristic.
sampled next state-action value is from the expected next
state-action value. This bias will at each transition further Grid-World
encourage transitions that are “better than expected” (and
The problem described is a minimal working example of the
vice versa), similarly, e.g., to ”better-than-average” (and vice
bicycle problem (Randløv and Alstrøm 1998), illustrating
versa) rewards in R-learning (Schwartz 1993).
the issue of positive reward cycles.
To obtain the expected shaping reward F(s, a), we take
Given a 20 × 20 grid, the goal is located at the bottom
the expectation w.r.t. the transition matrix T , and the policy
right corner (20, 20). The agent must reach it from its ini-
π with which a0 is chosen.
tial position (0, 0) at the top left corner, upon which event
it will receive a positive reward. The reward on the rest of
F(s, a) = E[F (s, a, s0 , a0 )] the transitions is 0. The actions correspond to the 4 cardinal
= R† (s, a) + γE[Φπ (s0 , a0 ) − E[Φπ (s0 , a0 )]] directions, and the state is the agent’s position coordinates
= R† (s, a) (x, y) in the grid. The episode terminates when the goal was
(20) found, or when 10000 steps have elapsed.
Thus, Eq. (18) gives the shaping reward while Φ’s are not Given approximate knowledge of the problem, a natural
yet converged, (19) gives the component of the shaping re- heuristic to encourage is transitions that move the agent to
ward on a transition after Φπ are correct, and (20) establishes the right, or down, as they are to advance the agent closer to
the equivalence of F and R† in expectation.4 the goal. A reward function R† encoding this heuristic can
be defined as
Convergence of Φ
If the policy π is fixed, and the Qπ -estimates are cor- R† (s, right) = R† (s, down) = c, c ∈ R+ , ∀s
rect, the expectations in the previous section are well-
defined, and Φ converges to the TD-fixpoint. However, When provided naı̈vely (i.e. with F = R† ), the agent
Φ is learnt at the same time as Q. This process can be is at a risk of getting “distracted”: getting stuck in a pos-
shown to converge by formulating the framework on two itive reward cycle, and never reaching the goal. We ap-
timescales (Borkar 1997), and using the ODE method of ply our framework, and learn the corresponding Φ w.r.t.
Borkar and Meyn (2000). We thus require5 the step size RΦ = −R† ,setting F accordingly (Eq. (12)). We compare
schedules {αt } and {βt } satisfy the following: that setting with the base learner and with the non-potential-
αt based naı̈ve learner.6
lim =0 (21) Learning was done via Sarsa with -greedy action selec-
t→∞ βt tion,  = 0.1. The learning parameters were tuned to the
Q and Φ correspond to the slower and faster timescales, following values: γ = 0.99, c = 1, αt+1 = τ αt decay-
respectively. Given that step-size schedule difference, we ing exponentially (so as to satisfy the condition in Eq. (21)),
can rewrite the iterations (for Q and Φ) as one iteration, with with α0 = 0.05, τ = 0.999 and βt = 0.1.
a combined parameter vector, and show that the assump- We performed 50 independent runs, 100 episodes each
tions (A1)-(A2) from Borkar and Meyn (2000) are satisfied, (Fig. 1(a)). Observe that the performance of the (non-PB)
which allows to apply their Theorem 2.2. This analysis is agent learning with F = R† actually got worse with time, as
analogous to that of convergence of TD with Gradient Cor- it discovered a positive reward cycle, and got more and more
rection (Theorem 2 in (Sutton et al. 2009)), and is left out disinterested in finding the goal. Our agent, armed with the
for clarity of exposition. same knowledge, used it properly (in a true potential-based
Note that this convergence is needed to assure that Φ in- manner) and the learning was accelerated significantly, com-
deed captures the expert reward function R† . The form of pared to the base agent.
general dynamic advice from Eq. (15) itself does not pose
any requirements on the convergence properties of Φ to Cart-Pole
guarantee policy invariance.
We now evaluate our approach on a more difficult cart-pole
benchmark (Michie and Chambers 1968). The task is to bal-
Experiments ance a pole on top a moving cart for as long as possible. The
We first demonstrate our method correctly solving a grid- (continuous) state contains the angle ξ and angular velocity
world task, as a simplified instance of the bicycle problem. ξ˙ of the pole, and the position x and velocity ẋ of the cart.
We then assess the practical utility of our framework on a
larger cart-pole benchmark, and show that our dynamic (PB) 6
To illustrate our point more clearly in the limited space, we
omit the static PB variant with (state-only) potentials Φ(x, y) =
4
Note that function approximation will introduce an (unavoid- x+y. It depends on a different type of knowledge (about the state),
able) approximation-error term in these equations. while this experiment compares two ways to utilize the behavioral
5
In addition to the standard stochastic approximation assump- reward function R† . The static variant does not require learning,
tions, common to all TD algorithms. and hence performs better in the beginning.

2655
8000
10000
7000
9000

6000 8000

7000
5000
6000
Steps

Steps
4000
5000
Base
3000 4000
Non−PB advice
Dynamic advice 3000
2000 Base
2000 Non−PB advice
1000 Myopic PB advice
1000 Static PB (angle)
Dynamic (PB) VF advice
0 0
10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100
Episodes Episodes
(a) Grid-world (b) Cart-pole

Figure 1: Mean learning curves. Shaded areas correspond to the 95% confidence intervals. The plot is smoothed by taking a running average
of 10 episodes. (a) The same reward function added directly to the base reward function (non-PB advice) diverges from the optimal policy,
whereas our automatic translation to dynamic-PB advice accelerates learning significantly. (b) Our dynamic (PB) VF advice learns to balance
the pole the soonest, and has the lowest variance.

There are two actions: a small positive and a small nega- 5. (Dynamic (PB) VF advice) We learn Φ as a value func-
tive force applied to the cart. A pole falls if |ξ| > π4 , which tion w.r.t. RΦ = −R† , and set F5 = F Φ accordingly
terminates the episode. The track is bounded within [−4, 4], (Eq. (12)).
but the sides are “soft”; the cart does not crash upon hitting We used tile coding (Sutton and Barto 1998) with 10
them. The reward function penalizes a pole drop, and is 0 tilings of 10 × 10 to represent the continuous state. Learning
elsewhere. An episode terminates successfully, if the pole was done via Sarsa(λ) with eligibility traces and -greedy
was balanced for 10000 steps. action selection,  = 0.1. The learning parameters were
An intuitive behavior to encourage is moving the cart to tuned to the following: λ = 0.9, c = 0.1, αt+1 = τ αt
the right (or left) when the pole is leaning rightward (or left- decaying exponentially (so as to satisfy the condition in
ward). Let o : S × A → {0, 1} be the indicator function Eq. (21)), with α0 = 0.05, τ = 0.999, and βt = 0.2. We
denoting such orientation of state s and action a. A reward found γ to affect the results differently across variants, with
function to encompass the rule can then be defined as: the following best values: γ1 = 0.8, γ2 = γ3 = γ4 = 0.99,
γ5 = 0.4. Mi is then the MDP hS, A, γi , T , R + Fi i.
R† (s, a) = o(s, a) × c, c ∈ R+ Fig. 1(b) gives the comparison across Mi (i.e. the best γ
We compare the performance of our agent to the base values for each variant), whereas Table 1 also contains the
learner and two other reward shaping schemes that reflect comparison w.r.t. the base value γ = γ1 .
the same knowledge about the desired behavior, and one that We performed 50 independent runs of 100 episodes each
uses different knowledge (about the angle of the pole).7 The (Table 1). Our method beats the alternatives in both fixed and
variants are described more specifically below: tuned γ scenarios, converging to the optimal policy reliably
after 30 episodes in the latter (Fig. 1(b)). Paired t-tests on the
1. (Base) The base learner, F1 := 0.
sums of steps of all episodes per run for each pair of variants
2. (Non-PB advice) Advice is received simply by adding R† confirm all variants as significantly different with p < 0.05.
to the main reward function, F2 := R† . This method will Notice that the non-potential-based variant for this problem
lose some optimal policies. does not perform as poorly as on the grid-world task. The
3. (Myopic PB advice) Potentials are initialized and main- reason for this is that getting stuck in a positive reward cy-
tained with R† , i.e. F3 := F Φ with Φ = R† . This is cle can be good in cart-pole, as the goal is to continue the
closest to Wiewiora’s look-ahead advice framework. episode for as long as possible. However, consider the policy
that achieves keeping the pole at an equilibrium (at ξ = 0).
4. (Static PB shaping with angle) The agent is penalized While clearly optimal in the original task, this policy will
proportionally to the angle with which it deviates from not be optimal in M2 , as it will yield 0 additional rewards.
equilibrium. F4 := F Φ with Φ ∼ −|ξ|2 .
7
Note that unlike our behavioral encouragement, the angle
Discussion
shaping requires precise information about the state, which is more Choice of R† The given framework is general enough to
demanding in a realistic setup, where the advice comes from an capture any form of the reward function R† . Recall, how-
external observer. ever, that F = R† holds after Φ values have converged.

2656
Table 1: Cart-pole results. Performance is indicated with standard error. The final performance refers to the last 10% of the run. Dynamic
(PB) VF advice has the highest mean, and lowest variance both in tuned and fixed γ scenarios, and is the most robust, whereas myopic shaping
proved to be especially sensitive to the choice of γ .

Best γ values Base γ = 0.8


Variant Final Overall Final Overall
Base 5114.7±188.7 3121.8±173.6 5114.7±165.4 3121.8±481.3
Non-PB advice 9511.0±37.2 6820.6±265.3 6357.1±89.1 3405.2±245.2
Myopic PB shaping 8618.4±107.3 3962.5±287.2 80.1±0.3 65.8±0.9
Static PB 9860.0±56.1 8292.3±261.6 3744.6±136.2 2117.5±102.0
Dynamic (PB) VF advice 9982.4±18.4 9180.5±209.8 8662.2±60.9 5228.0±274.0

Thus, the simpler the provided reward function R† , the different rewards with the same state representation, and it
sooner will the effective shaping reward capture it. In this tackles the question of specification rather than efficacy.
work, we have considered reward functions R† of the form On the other hand, there has been a lot of research
R† (B) = c, c > 0, where B is the set of encouraged be- in human-provided advice (Thomaz and Breazeal 2006),
havior transitions. This follows the convention of shaping in (Knox et al. 2012). This line of research (interactive shap-
psychology, where punishment is implicit as absence of pos- ing) typically uses the human advice component heuristi-
itive encouragement. Due to the expectation terms in F , we cally as a (sometimes annealed) additive component in the
expect such form (of all-positive, or all-negative R† ) to be reward function, which does not follow the potential-based
more robust. Another assumption is that all encouraged be- framework (and thus does not preserve policies). Knox and
haviors are encouraged equally; one may easily extend this Stone (2012) do consider PBRS as one of their methods, but
to varying preferences c1 < . . . < ck , and consider a choice (a) stay strictly myopic (similar to the third variant in the
between expressing them within a single reward function, or cart-pole experiment), and (b) limit to state potentials. Our
learning a separate value function for each signal ci . approach is different in that it incorporates the external ad-
Role of discounting Discounting factors γ in RL deter- vice through a value function, and stays entirely sound in the
mine how heavily the future rewards are discounted, i.e. the PBRS framework.
reward horizon. Smaller γ’s (i.e. heavier discounting) yield
quicker convergence, but may be insufficient to convey long-
term goals. In our framework, the value of γ plays two sep-
arate roles in the learning process, as it is shared between Φ Conclusions and Outlook
and Q. Firstly, it determines how quickly Φ values converge. In this work, we formulated a framework which allows to
Since we are only interested in the difference of consecu- specify the effective shaping reward directly. Given an arbi-
tive Φ-values, smaller γ’s provide a more stable estimate, trary reward function, we learned a secondary value func-
without losses. On the other hand, if the value is too small, tion, w.r.t. a variant of that reward function, concurrently
Q will lose sight of the long-term rewards, which is detri- to the main task, and used the consecutive estimates of that
mental to performance, if the rewards are for the base task value function as dynamic advice potentials. We showed that
alone. We, however, are considering the shaped rewards. the shaping reward resulting from this process captures the
Since shaped rewards provide informative immediate feed- input reward function in expectation. We presented empir-
back, it becomes less important to look far ahead into the ical evidence that the method behaves in a true potential-
future. This notion is formalized by Ng (2003), who proves based manner, and that such encoding of the behavioral
(in Theorem 3) that a “good” potential function shortens the domain knowledge speeds up learning significantly more,
reward horizon of the original problem. Thus γ, in a sense, compared to its alternatives. The framework induces little
balances the stability of learning Φ with the length of the added complexity: the maintenance of the auxiliary value
shaped reward horizon of Q. function is linear in time and space (Modayil et al. 2012),
and, when initialized to 0, the optimal base value function is
Related Work unaffected.
The correspondence between value and potential functions We intend to further consider inconsistent reward func-
has been known since the conceivement of the latter. Ng et tions, with an application to humans directly providing ad-
al (1999) point out that the optimal potential function is the vice. The challenges are then to analyze the expected effec-
true value function itself (as in that case the problem reduces tive rewards, as the convergence of a TD-process w.r.t. in-
to learning the trivial zero value function). With this insight, consistent rewards is less straightforward. In this work we
there have been attempts to simultaneously learn the base identified the secondary discounting factor γ Φ with the pri-
value function at coarser and finer granularities (of function mary γ. This need not be the case, in general: γ Φ = νγ.
approximation), and use the (quicker-to-converge) former as Such modification adds an extra term Φ-term into Eq. (20),
a potential function for the latter (Grzes and Kudenko 2008). potentially offering gradient guidance, which is useful if the
Our approach is different in that our value functions learn on expert reward is sparse.

2657
Acknowledgments Michie, D., and Chambers, R. A. 1968. Boxes: An experi-
Anna Harutyunyan is supported by the IWT-SBO project ment in adaptive control. In Dale, E., and Michie, D., eds.,
MIRAD (grant nr. 120057). Machine Intelligence. Edinburgh, UK: Oliver and Boyd.
Modayil, J.; White, A.; Pilarski, P. M.; and Sutton, R. S.
References 2012. Acquiring a broad range of empirical knowledge in
real time by temporal-difference learning. In Systems, Man,
Asmuth, J.; Littman, M. L.; and Zinkov, R. 2008. Potential- and Cybernetics (SMC), 2012 IEEE International Confer-
based shaping in model-based reinforcement learning. In ence on, 1903–1910. IEEE.
Proceedings of 23rd AAAI Conference on Artificial Intelli-
gence (AAAI-08), 604–609. AAAI Press. Ng, A. Y.; Harada, D.; and Russell, S. 1999. Policy in-
variance under reward transformations: Theory and applica-
Bellman, R. 1957. Dynamic Programming. Princeton, NJ, tion to reward shaping. In In Proceedings of the Sixteenth
USA: Princeton University Press, 1st edition. International Conference on Machine Learning (ICML-99),
Borkar, V. S., and Meyn, S. P. 2000. The ODE method 278–287. Morgan Kaufmann.
for convergence of stochastic approximation and reinforce- Ng, A. Y. 2003. Shaping and policy search in reinforce-
ment learning. SIAM Journal on Control and Optimization ment learning. Ph.D. Dissertation, University of California,
38(2):447–469. Berkeley.
Borkar, V. S. 1997. Stochastic approximation with two time Puterman, M. 1994. Markov Decision Processes: Discrete
scales. Systems and Control Letters 29(5):291 – 294. Stochastic Dynamic Programming. New York, NY, USA:
Brys, T.; Nowé, A.; Kudenko, D.; and Taylor, M. E. 2014. John Wiley & Sons, Inc., 1st edition.
Combining multiple correlated reward and shaping signals Randløv, J., and Alstrøm, P. 1998. Learning to drive a
by measuring confidence. In Proceedings of 28th AAAI Con- bicycle using reinforcement learning and shaping. In Pro-
ference on Artificial Intelligence (AAAI-14), 1687–1693. ceedings of the Fifteenth International Conference on Ma-
AAAI Press. chine Learning, 463–471. San Francisco, CA, USA: Morgan
Devlin, S., and Kudenko, D. 2012. Dynamic potential-based Kauffman.
reward shaping. In Proceedings of the 11th International Schwartz, A. 1993. A reinforcement learning method
Conference on Autonomous Agents and Multiagent Systems for maximizing undiscounted rewards. In ICML, 298–305.
(AAMAS-12), 433–440. International Foundation for Au- Morgan Kaufmann.
tonomous Agents and Multiagent Systems. Skinner, B. F. 1938. The behavior of organisms: An experi-
Devlin, S.; Kudenko, D.; and Grzes, M. 2011. An empirical mental analysis. Appleton-Century.
study of potential-based reward shaping and advice in com- Snel, M., and Whiteson, S. 2014. Learning potential
plex, multi-agent systems. Advances in Complex Systems functions and their representations for multi-task reinforce-
(ACS) 14(02):251–278. ment learning. Autonomous Agents and Multi-Agent Systems
Dorigo, M., and Colombetti, M. 1997. Robot Shaping: Ex- 28(4):637–681.
periment In Behavior Engineering. MIT Press, 1st edition. Sutton, R., and Barto, A. 1998. Reinforcement learning: An
Grzes, M., and Kudenko, D. 2008. Multigrid reinforce- introduction, volume 116. Cambridge Univ Press.
ment learning with reward shaping. In Krkov, V.; Neruda, Sutton, R. S.; Maei, H. R.; Precup, D.; Bhatnagar, S.; Silver,
R.; and Koutnk, J., eds., Artificial Neural Networks - ICANN D.; Szepesvári, C.; and Wiewiora, E. 2009. Fast gradient-
2008, volume 5163 of Lecture Notes in Computer Science. descent methods for temporal-difference learning with lin-
Springer Berlin Heidelberg. 357–366. ear function approximation. In Proceedings of the 26th An-
Jaakkola, T.; Jordan, M. I.; and Singh, S. P. 1994. On the nual International Conference on Machine Learning, ICML
convergence of stochastic iterative dynamic programming ’09, 993–1000. New York, NY, USA: ACM.
algorithms. Neural computation 6(6):1185–1201. Thomaz, A. L., and Breazeal, C. 2006. Reinforcement learn-
Knox, W. B., and Stone, P. 2012. Reinforcement learning ing with human teachers: Evidence of feedback and guid-
from simultaneous human and MDP reward. In Proceed- ance with implications for learning performance. In Pro-
ings of the 11th International Conference on Autonomous ceedings of the 21st AAAI Conference on Artificial Intelli-
Agents and Multiagent Systems (AAMAS-12), 475–482. In- gence (AAAI-06), 1000–1005. AAAI Press.
ternational Foundation for Autonomous Agents and Multia- Wiewiora, E.; Cottrell, G. W.; and Elkan, C. 2003. Prin-
gent Systems. cipled methods for advising reinforcement learning agents.
Knox, W. B.; Glass, B. D.; Love, B. C.; Maddox, W. T.; and In Machine Learning, Proceedings of the Twentieth Interna-
Stone, P. 2012. How humans teach agents: A new experi- tional Conference (ICML-03), 792–799.
mental perspective. International Journal of Social Robotics
4:409–421.
Mataric, M. J. 1994. Reward functions for accelerated learn-
ing. In In Proceedings of the Eleventh International Confer-
ence on Machine Learning (ICML-94), 181–189. Morgan
Kaufmann.

2658

You might also like