RewardShaping

Uploaded by

Wenshuai Zhao

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views

RewardShaping

Uploaded by

Wenshuai Zhao

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence

Expressing Arbitrary Reward Functions as Potential-Based Advice

Anna Harutyunyan Sam Devlin Peter Vrancx Ann Nowé

Vrije Universiteit Brussel University of York Vrije Universiteit Brussel Vrije Universiteit Brussel
[email protected] [email protected] [email protected] [email protected]

Abstract “distracted”, and instead learn to ride in a loop and collect

the positive reward forever. This issue of positive reward cy-
Effectively incorporating external advice is an important cles is addressed by Ng, Harada, and Russell (1999), where
problem in reinforcement learning, especially as it moves
they devise their potential-based reward shaping (PBRS)
into the real world. Potential-based reward shaping is a way
to provide the agent with a specific form of additional re- framework, which constrains the shaping reward to have the
ward, with the guarantee of policy invariance. In this work we form of a difference of a potential function of the transition-
give a novel way to incorporate an arbitrary reward function ing states. In fact, they prove a stronger claim that such a
with the same guarantee, by implicitly translating it into the form is necessary1 for leaving the original task unchanged.
specific form of dynamic advice potentials, which are main- This elegant and implementable framework led to an ex-
tained as an auxiliary value function learnt at the same time. plosion of reward shaping research and proved to be ex-
We show that advice provided in this way captures the input tremely effective (Asmuth, Littman, and Zinkov 2008), (De-
reward function in expectation, and demonstrate its efficacy vlin, Kudenko, and Grzes 2011), (Brys et al. 2014), (Snel
empirically. and Whiteson 2014). Wiewiora, Cottrell, and Elkan (2003)
extended PBRS to state-action advice potentials, and Devlin
Introduction and Kudenko (2012) recently generalized PBRS to handle
dynamic potentials, allowing potential functions to change
The term shaping in experimental psychology (dating at online whilst the agent is learning.
least as far back as (Skinner 1938)) refers to the idea of Additive reward functions from early RS research, while
rewarding all behavior leading to the desired behavior, in- dangerous to policy preservation, were able to convey be-
stead of waiting for the subject to exhibit it autonomously havioral knowledge (e.g. expert instructions) directly. Po-
(which, for complex tasks, may take prohibitively long). For tential functions require an additional abstraction, and re-
example, Skinner discovered that, in order to train a rat to strict the form of the additional effective reward, but provide
push a lever, any movement in the direction of the lever crucial theoretical guarantees. We seek to bridge this gap be-
had to be rewarded. Reinforcement learning (RL) is a frame- tween the available behavioral knowledge and the effective
work, where an agent learns from interaction with the en- potential-based rewards.
vironment, typically in a tabula rasa manner, guaranteed to This paper gives a novel way to specify the effective
learn the desired behavior eventually. As with Skinner’s rat, shaping rewards, directly through an arbitrary reward func-
the RL agent may take a very long time to stumble upon tion, while implicitly maintaining the grounding in poten-
the target lever, if the only reinforcement (or reward) it re- tials, necessary for policy invariance. For this, we first ex-
ceives is after that fact, and shaping is used to speed up the tend Wiewiora’s advice framework to dynamic advice po-
learning process by providing additional rewards. Shaping tentials. We then propose to in parallel learn a secondary
in RL has been linked to reward functions from very early value function w.r.t. a variant of our arbitrary reward func-
on; Mataric (1994) interpreted shaping as designing a more tion, and use its successive estimates as our dynamic ad-
complex reward function, Dorigo and Colombetti (1997) vice potentials. We show that the effective shaping rewards
used shaping on a real robot to translate expert instructions then reflect the input reward function in expectation. Em-
into reward for the agent, as it executed a task, and Randløv pirically, we first demonstrate our method to avoid the issue
and Alstrøm (1998) proposed learning a hierarchy of RL sig- of positive reward cycles on a grid-world task, when given
nals in an attempt to separate the extra reinforcement func- the same behavior knowledge that trapped the bicyclist from
tion from the base task. It is in the same paper that they un- (Randløv and Alstrøm 1998). We then show an application,
cover the issue of modifying the reward signals in an uncon- where our dynamic (PB) value-function advice outperforms
strained way: when teaching an agent to ride a bicycle, and other reward-shaping methods that encode the same knowl-
encouraging progress towards the goal, the agent would get edge, as well as a shaping w.r.t. a different popular heuristic.
Copyright c 2015, Association for the Advancement of Artificial
1
Intelligence (www.aaai.org). All rights reserved. Given no knowledge of the MDP dynamics.

2652
Background Reward Shaping
We assume the usual reinforcement learning frame- The most general form of reward shaping in RL can be given
work (Sutton and Barto 1998), in which the agent interacts as modifying the reward function of the underlying MDP:
with its Markovian environment at discrete time steps t =
1, 2, . . .. Formally, a Markov decision process (MDP) (Put- R0 = R + F
erman 1994) is a tuple M = hS, A, γ, T , Ri, where: S is a
where R is the (transition) reward function of the base
set of states, A is a set of actions, γ ∈ [0, 1] is the discount-
problem, and F is the shaping reward function, with
ing factor, T = {Psa (·)|s ∈ S, a ∈ A} are the next state
F (s, a, s0 ) giving the additional reward on the transition
transition probabilities with Psa (s0 ) specifying the proba-
(s, a, s0 ), and ft defined analogously to rt . We will always
bility of state s0 occuring upon taking action a from state
refer to the framework as reward shaping, and to the auxil-
s, R : S × A → R is the expected reward function with
iary reward itself as shaping reward.
R(s, a) giving the expected (w.r.t. T ) value of the reward
PBRS (Ng, Harada, and Russell 1999) maintains a poten-
that will be received when a is taken in state s. R(s, a, s0 )2
tial function Φ : S → R, and constrains the shaping reward
and rt+1 denote the components of R at transition (s, a, s0 )
function F to the following form:
and at time t, respectively.
A (stochastic) Markovian policy π : S ×A → R is a prob-
ability distribution over actions at each state, so that π(s, a) F (s, s0 ) = γΦ(s0 ) − Φ(s) (6)
gives the probability of action a being taken from state s where γ is the discounting factor of the MDP. Ng et
under policy π. We will use π(s, a) = 1 and π(s) = a inter- al (1999) show that this form is both necessary and suffi-
changeably. Value-based methods encode policies through cient for policy invariance.
value functions (VF), which denote expected cumulative re- Wiewiora et al (2003) extend PBRS to advice potential
ward obtained, while following the policy. We focus on functions defined over the joint state-action space. Note that
state-action value functions. In a discounted setting: this extension adds a dependency of F on the policy be-
ing followed (in addition to the executed transition). The
∞
hX i authors consider two types of advice: look-ahead and look-
Qπ (s, a) = ET ,π γ t rt+1 |s0 = s, a0 = a (1) back, providing the theoretical framework for the former:
t=0
We will omit the subscripts on E from now on, and imply F (s, a, s0 , a0 ) = γΦ(s0 , a0 ) − Φ(s, a) (7)
all expectations to be w.r.t. T , π. A (deterministic) greedy Devlin and Kudenko (2012) generalize the form in Eq. (6)
policy is obtained by picking the action of maximum value to dynamic potentials, by including a time parameter, and
at each state: show that all theoretical properties of PBRS hold.

π(s) = arg max Q(s, a) (2) F (s, t, s0 , t0 ) = γΦ(s0 , t0 ) − Φ(s, t) (8)

a
A policy π ∗ is optimal if its value is largest:
From Reward Functions to Dynamic
∗ π
Q (s, a) = sup Q (s, a), ∀s, a Potentials
π
There are two (inter-related) problems in PBRS: efficacy
When the Q-values are accurate for a given policy π, they and specification. The former has to do with designing the
satisfy the following recursive relation (Bellman 1957): best potential functions, i.e. those that offer the quickest and
smoothest guidance. The latter refers to capturing the avail-
Qπ (s, a) = R(s, a) + γE[Qπ (s0 , a0 )] (3) able domain knowledge into a potential form, in the easiest
The values can be learned incrementally by the following and most effective way. This work primarily deals with that
update: latter question.
Locking knowledge in the form of potentials is a conve-
Qt+1 (st , at ) = Qt (st , at ) + αt δt (4) nient theoretical paradigm, but may be restrictive, when con-
where Qt denotes an estimate of Qπ at time t, αt ∈ (0, 1) sidering all types of domain knowledge, in particular behav-
is the learning rate at time t, and ioral knowledge, which is likely to be specified in terms of
actions. Say, for example, an expert wishes to encourage an
δt = rt+1 + γQt (st+1 , at+1 ) − Qt (st , at ) (5) action a in a state s. If following the advice framework, she
is the temporal-difference (TD) error of the transition, in sets Φ(s, a) = 1, with Φ zero-valued elsewhere, the shap-
which at and at+1 are both chosen according to π. This pro- ing reward associated with the transition (s, a, s0 ) will be
cess is shown to converge to the correct value estimates (the F (s, a, s0 , a0 ) = Φ(s0 , a0 ) − Φ(s, a) = 0 − 1 = −1, so
TD-fixpoint) in the limit under standard approximation con- long as the pair (s0 , a0 ) is different from (s, a).3 The favor-
ditions (Jaakkola, Jordan, and Singh 1994). able behavior (s, a) will then factually be discouraged. She
could avoid this by further specifying Φ for state-actions
2
R is a convention from MDP literature. In reinforcement reachable from s via a, but that would require knowledge
learning it is more common to stick with R(s, a, s0 ) for specifi-
3
cation, but we will refer to the general form in our derivations. Assume the example is undiscounted for clarity.

2653
of the MDP. What she would thus like to do is to be able where we implicitly associate s with st , s0 with st0 , and
to specify the desired effective shaping reward F directly, F (s, a, s0 , a0 ) with F (s, a, t, s0 , a0 , t0 ). As with Wiewiora’s
but without sacrificing optimality provided by the potential- framework, F is now not only a function of the transition
based framework. (s, a, s0 ), but also the following action a0 , which adds a de-
This work formulates a framework to do just that. Given pendence on the policy the agent is currently evaluating.
an arbitrary reward function R† , we wish to achieve F ≈ We examine the change in the optimal Q-values of the
R† , while maintaining policy invariance. This question is original MDP, resulting from adding F to the base reward
equivalent to seeking a potential function Φ, based on R† , function R.
s.t. F Φ ≈ R† , where (and in the future) we take F Φ to mean
a potential-based shaping reward w.r.t. Φ. hP ∞ i
The core idea of our approach is to introduce a secondary Q∗ (s, a) = E γ t (rt+1 + ft+1 )|s0 = s
(state-action) value function Φ, which, concurrently with the h t=0
∞
(12)
main process, learns on the negation of the expert-provided γ t (rt+1 + γΦt+1 (st+1 , at+1 )
P
= E
R† , and use the consecutively updated values of Φt as a dy- t=0 i
namic potential function, thus making the translation into −Φt (st , at ))
potentials implicit. Formally: hP ∞ i hP ∞ i
= E γ t rt+1 + E γ t Φt (st , at )
RΦ = −R† (9) t=0
hP ∞ it=1
t
−E γ Φt (st , at )
t=0
Φt+1 (s, a) = Φt (s, a) + βt δtΦ (10) hP ∞ i
= E γ t rt+1 − Φ0 (s, a)
t=0
δtΦ = rt+1
Φ
+ γΦt (st+1 , at+1 ) − Φt (st , at ) (11) (16)
Thus, once the optimal policy w.r.t. R + F is learnt, to
where βt is the learning rate at time t, and at+1 is chosen uncover the optimal policy w.r.t. R, one may use the bi-
according to the policy π w.r.t. the value function Q of the ased greedy action-selection (Wiewiora, Cottrell, and Elkan
main task. The shaping reward is then of the form: 2003) w.r.t. the initial values of the dynamic advice function.
ft+1 = γΦt+1 (st+1 , at+1 ) − Φt (st , at ) (12) π(s) = arg max Q(s, a) + Φ0 (s, a)

(17)
The intuition of the correspondence between R† and F a
lies in the relation between the Bellman equation (for Φ): Notice that when the advice function is initialized to 0,
the biased greedy action-selection above reduces to the basic
Φπ (s, a) = −R† (s, a) + γΦπ (s0 , a0 ) (13) greedy policy (Eq. (2)), allowing one to use dynamic advice
and shaping rewards from an advice potential function: equally seamlessly to simple state potentials.

Shaping in Expectation
F (s, a) = γΦ(s0 , a0 ) − Φ(s, a) = R† (s, a) (14)
Let R† be an arbitrary reward function, and let Φ be the
This intuition will be made more precise later. state-action value function that learns on RΦ = −R† ,
while following some fixed policy π. The shaping reward
Theory at timestep t w.r.t. Φ as a dynamic advice function is given
This section is organized at follows. First, we extend the by:
potential-based advice framework to dynamic potential-
based advice, and ensure that the desired guarantees hold. ft+1 = γΦt+1 (st+1 , at+1 ) − Φt (st , at )
(Our dynamic (potential-based) value-function advice is = γΦt (st+1 , at+1 ) − Φt (st , at )
then an instance of dynamic potential-based advice.) We +γΦt+1 (st+1 , at+1 ) − γΦt (st+1 , at+1 )
then turn to the question of correspondence between R† and (11)
F , showing that F captures R† in expectation. Finally, we = δtΦ − rt+1
Φ
+ γ∆Φ(st+1 , at+1 )
†
ensure that these expectations are meaningful, by arguing = rt+1 + δtΦ + γ∆Φ(st+1 , at+1 )
convergence. (18)
Now assume the process has converged to the TD-fixpoint
Dynamic Potential-Based Advice Φπ . Then:
Analogously to (Devlin and Kudenko 2012), we augment
Wiewiora’s look-ahead advice function (Eq. (7)) with a time F (s, a, s0 , a0 ) = γΦπ (s0 , a0 ) − Φπ (s, a)
parameter to obtain our dynamic potential-based advice: (3)
F (s, a, t, s0 , a0 , t0 ) = γΦ(s0 , a0 , t0 ) − Φ(s, a, t), where t/t0 = γΦπ (s0 , a0 )
is the time of the agent visiting state s/s0 and taking action −RΦ (s, a) − γE[Φπ (s0 , a0 )]
(9)
a/a0 . For notational compactness we rewrite the form as: = R† (s, a)
+γ(Φπ (s0 , a0 ) − E[Φπ (s0 , a0 )])
F (s, a, s0 , a0 ) = γΦt0 (s0 , a0 ) − Φt (s, a) (15) (19)

2654
Thus, we obtain that the shaping reward F w.r.t. the VF advice approach beats other methods that use the same
converged values Φπ , reflects the expected designer reward domain knowledge, as well as a popular static shaping w.r.t.
R† (s, a) plus a bias term, which measures how different the a different heuristic.
sampled next state-action value is from the expected next
state-action value. This bias will at each transition further Grid-World
encourage transitions that are “better than expected” (and
The problem described is a minimal working example of the
vice versa), similarly, e.g., to ”better-than-average” (and vice
bicycle problem (Randløv and Alstrøm 1998), illustrating
versa) rewards in R-learning (Schwartz 1993).
the issue of positive reward cycles.
To obtain the expected shaping reward F(s, a), we take
Given a 20 × 20 grid, the goal is located at the bottom
the expectation w.r.t. the transition matrix T , and the policy
right corner (20, 20). The agent must reach it from its ini-
π with which a0 is chosen.
tial position (0, 0) at the top left corner, upon which event
it will receive a positive reward. The reward on the rest of
F(s, a) = E[F (s, a, s0 , a0 )] the transitions is 0. The actions correspond to the 4 cardinal
= R† (s, a) + γE[Φπ (s0 , a0 ) − E[Φπ (s0 , a0 )]] directions, and the state is the agent’s position coordinates
= R† (s, a) (x, y) in the grid. The episode terminates when the goal was
(20) found, or when 10000 steps have elapsed.
Thus, Eq. (18) gives the shaping reward while Φ’s are not Given approximate knowledge of the problem, a natural
yet converged, (19) gives the component of the shaping re- heuristic to encourage is transitions that move the agent to
ward on a transition after Φπ are correct, and (20) establishes the right, or down, as they are to advance the agent closer to
the equivalence of F and R† in expectation.4 the goal. A reward function R† encoding this heuristic can
be defined as
Convergence of Φ
If the policy π is fixed, and the Qπ -estimates are cor- R† (s, right) = R† (s, down) = c, c ∈ R+ , ∀s
rect, the expectations in the previous section are well-
defined, and Φ converges to the TD-fixpoint. However, When provided naı̈vely (i.e. with F = R† ), the agent
Φ is learnt at the same time as Q. This process can be is at a risk of getting “distracted”: getting stuck in a pos-
shown to converge by formulating the framework on two itive reward cycle, and never reaching the goal. We ap-
timescales (Borkar 1997), and using the ODE method of ply our framework, and learn the corresponding Φ w.r.t.
Borkar and Meyn (2000). We thus require5 the step size RΦ = −R† ,setting F accordingly (Eq. (12)). We compare
schedules {αt } and {βt } satisfy the following: that setting with the base learner and with the non-potential-
αt based naı̈ve learner.6
lim =0 (21) Learning was done via Sarsa with -greedy action selec-
t→∞ βt tion, = 0.1. The learning parameters were tuned to the
Q and Φ correspond to the slower and faster timescales, following values: γ = 0.99, c = 1, αt+1 = τ αt decay-
respectively. Given that step-size schedule difference, we ing exponentially (so as to satisfy the condition in Eq. (21)),
can rewrite the iterations (for Q and Φ) as one iteration, with with α0 = 0.05, τ = 0.999 and βt = 0.1.
a combined parameter vector, and show that the assump- We performed 50 independent runs, 100 episodes each
tions (A1)-(A2) from Borkar and Meyn (2000) are satisfied, (Fig. 1(a)). Observe that the performance of the (non-PB)
which allows to apply their Theorem 2.2. This analysis is agent learning with F = R† actually got worse with time, as
analogous to that of convergence of TD with Gradient Cor- it discovered a positive reward cycle, and got more and more
rection (Theorem 2 in (Sutton et al. 2009)), and is left out disinterested in finding the goal. Our agent, armed with the
for clarity of exposition. same knowledge, used it properly (in a true potential-based
Note that this convergence is needed to assure that Φ in- manner) and the learning was accelerated significantly, com-
deed captures the expert reward function R† . The form of pared to the base agent.
general dynamic advice from Eq. (15) itself does not pose
any requirements on the convergence properties of Φ to Cart-Pole
guarantee policy invariance.
We now evaluate our approach on a more difficult cart-pole
benchmark (Michie and Chambers 1968). The task is to bal-
Experiments ance a pole on top a moving cart for as long as possible. The
We first demonstrate our method correctly solving a grid- (continuous) state contains the angle ξ and angular velocity
world task, as a simplified instance of the bicycle problem. ξ˙ of the pole, and the position x and velocity ẋ of the cart.
We then assess the practical utility of our framework on a
larger cart-pole benchmark, and show that our dynamic (PB) 6
To illustrate our point more clearly in the limited space, we
omit the static PB variant with (state-only) potentials Φ(x, y) =
4
Note that function approximation will introduce an (unavoid- x+y. It depends on a different type of knowledge (about the state),
able) approximation-error term in these equations. while this experiment compares two ways to utilize the behavioral
5
In addition to the standard stochastic approximation assump- reward function R† . The static variant does not require learning,
tions, common to all TD algorithms. and hence performs better in the beginning.

2655
8000
10000
7000
9000

6000 8000

7000
5000
6000
Steps

Steps
4000
5000
Base
3000 4000
Non−PB advice
Dynamic advice 3000
2000 Base
2000 Non−PB advice
1000 Myopic PB advice
1000 Static PB (angle)
Dynamic (PB) VF advice
0 0
10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100
Episodes Episodes
(a) Grid-world (b) Cart-pole

Figure 1: Mean learning curves. Shaded areas correspond to the 95% confidence intervals. The plot is smoothed by taking a running average
of 10 episodes. (a) The same reward function added directly to the base reward function (non-PB advice) diverges from the optimal policy,
whereas our automatic translation to dynamic-PB advice accelerates learning significantly. (b) Our dynamic (PB) VF advice learns to balance
the pole the soonest, and has the lowest variance.

There are two actions: a small positive and a small nega- 5. (Dynamic (PB) VF advice) We learn Φ as a value func-
tive force applied to the cart. A pole falls if |ξ| > π4 , which tion w.r.t. RΦ = −R† , and set F5 = F Φ accordingly
terminates the episode. The track is bounded within [−4, 4], (Eq. (12)).
but the sides are “soft”; the cart does not crash upon hitting We used tile coding (Sutton and Barto 1998) with 10
them. The reward function penalizes a pole drop, and is 0 tilings of 10 × 10 to represent the continuous state. Learning
elsewhere. An episode terminates successfully, if the pole was done via Sarsa(λ) with eligibility traces and -greedy
was balanced for 10000 steps. action selection, = 0.1. The learning parameters were
An intuitive behavior to encourage is moving the cart to tuned to the following: λ = 0.9, c = 0.1, αt+1 = τ αt
the right (or left) when the pole is leaning rightward (or left- decaying exponentially (so as to satisfy the condition in
ward). Let o : S × A → {0, 1} be the indicator function Eq. (21)), with α0 = 0.05, τ = 0.999, and βt = 0.2. We
denoting such orientation of state s and action a. A reward found γ to affect the results differently across variants, with
function to encompass the rule can then be defined as: the following best values: γ1 = 0.8, γ2 = γ3 = γ4 = 0.99,
γ5 = 0.4. Mi is then the MDP hS, A, γi , T , R + Fi i.
R† (s, a) = o(s, a) × c, c ∈ R+ Fig. 1(b) gives the comparison across Mi (i.e. the best γ
We compare the performance of our agent to the base values for each variant), whereas Table 1 also contains the
learner and two other reward shaping schemes that reflect comparison w.r.t. the base value γ = γ1 .
the same knowledge about the desired behavior, and one that We performed 50 independent runs of 100 episodes each
uses different knowledge (about the angle of the pole).7 The (Table 1). Our method beats the alternatives in both fixed and
variants are described more specifically below: tuned γ scenarios, converging to the optimal policy reliably
after 30 episodes in the latter (Fig. 1(b)). Paired t-tests on the
1. (Base) The base learner, F1 := 0.
sums of steps of all episodes per run for each pair of variants
2. (Non-PB advice) Advice is received simply by adding R† confirm all variants as significantly different with p < 0.05.
to the main reward function, F2 := R† . This method will Notice that the non-potential-based variant for this problem
lose some optimal policies. does not perform as poorly as on the grid-world task. The
3. (Myopic PB advice) Potentials are initialized and main- reason for this is that getting stuck in a positive reward cy-
tained with R† , i.e. F3 := F Φ with Φ = R† . This is cle can be good in cart-pole, as the goal is to continue the
closest to Wiewiora’s look-ahead advice framework. episode for as long as possible. However, consider the policy
that achieves keeping the pole at an equilibrium (at ξ = 0).
4. (Static PB shaping with angle) The agent is penalized While clearly optimal in the original task, this policy will
proportionally to the angle with which it deviates from not be optimal in M2 , as it will yield 0 additional rewards.
equilibrium. F4 := F Φ with Φ ∼ −|ξ|2 .
7
Note that unlike our behavioral encouragement, the angle
Discussion
shaping requires precise information about the state, which is more Choice of R† The given framework is general enough to
demanding in a realistic setup, where the advice comes from an capture any form of the reward function R† . Recall, how-
external observer. ever, that F = R† holds after Φ values have converged.

2656
Table 1: Cart-pole results. Performance is indicated with standard error. The final performance refers to the last 10% of the run. Dynamic
(PB) VF advice has the highest mean, and lowest variance both in tuned and fixed γ scenarios, and is the most robust, whereas myopic shaping
proved to be especially sensitive to the choice of γ .

Best γ values Base γ = 0.8

Variant Final Overall Final Overall
Base 5114.7±188.7 3121.8±173.6 5114.7±165.4 3121.8±481.3
Non-PB advice 9511.0±37.2 6820.6±265.3 6357.1±89.1 3405.2±245.2
Myopic PB shaping 8618.4±107.3 3962.5±287.2 80.1±0.3 65.8±0.9
Static PB 9860.0±56.1 8292.3±261.6 3744.6±136.2 2117.5±102.0
Dynamic (PB) VF advice 9982.4±18.4 9180.5±209.8 8662.2±60.9 5228.0±274.0

Thus, the simpler the provided reward function R† , the different rewards with the same state representation, and it
sooner will the effective shaping reward capture it. In this tackles the question of specification rather than efficacy.
work, we have considered reward functions R† of the form On the other hand, there has been a lot of research
R† (B) = c, c > 0, where B is the set of encouraged be- in human-provided advice (Thomaz and Breazeal 2006),
havior transitions. This follows the convention of shaping in (Knox et al. 2012). This line of research (interactive shap-
psychology, where punishment is implicit as absence of pos- ing) typically uses the human advice component heuristi-
itive encouragement. Due to the expectation terms in F , we cally as a (sometimes annealed) additive component in the
expect such form (of all-positive, or all-negative R† ) to be reward function, which does not follow the potential-based
more robust. Another assumption is that all encouraged be- framework (and thus does not preserve policies). Knox and
haviors are encouraged equally; one may easily extend this Stone (2012) do consider PBRS as one of their methods, but
to varying preferences c1 < . . . < ck , and consider a choice (a) stay strictly myopic (similar to the third variant in the
between expressing them within a single reward function, or cart-pole experiment), and (b) limit to state potentials. Our
learning a separate value function for each signal ci . approach is different in that it incorporates the external ad-
Role of discounting Discounting factors γ in RL deter- vice through a value function, and stays entirely sound in the
mine how heavily the future rewards are discounted, i.e. the PBRS framework.
reward horizon. Smaller γ’s (i.e. heavier discounting) yield
quicker convergence, but may be insufficient to convey long-
term goals. In our framework, the value of γ plays two sep-
arate roles in the learning process, as it is shared between Φ Conclusions and Outlook
and Q. Firstly, it determines how quickly Φ values converge. In this work, we formulated a framework which allows to
Since we are only interested in the difference of consecu- specify the effective shaping reward directly. Given an arbi-
tive Φ-values, smaller γ’s provide a more stable estimate, trary reward function, we learned a secondary value func-
without losses. On the other hand, if the value is too small, tion, w.r.t. a variant of that reward function, concurrently
Q will lose sight of the long-term rewards, which is detri- to the main task, and used the consecutive estimates of that
mental to performance, if the rewards are for the base task value function as dynamic advice potentials. We showed that
alone. We, however, are considering the shaped rewards. the shaping reward resulting from this process captures the
Since shaped rewards provide informative immediate feed- input reward function in expectation. We presented empir-
back, it becomes less important to look far ahead into the ical evidence that the method behaves in a true potential-
future. This notion is formalized by Ng (2003), who proves based manner, and that such encoding of the behavioral
(in Theorem 3) that a “good” potential function shortens the domain knowledge speeds up learning significantly more,
reward horizon of the original problem. Thus γ, in a sense, compared to its alternatives. The framework induces little
balances the stability of learning Φ with the length of the added complexity: the maintenance of the auxiliary value
shaped reward horizon of Q. function is linear in time and space (Modayil et al. 2012),
and, when initialized to 0, the optimal base value function is
Related Work unaffected.
The correspondence between value and potential functions We intend to further consider inconsistent reward func-
has been known since the conceivement of the latter. Ng et tions, with an application to humans directly providing ad-
al (1999) point out that the optimal potential function is the vice. The challenges are then to analyze the expected effec-
true value function itself (as in that case the problem reduces tive rewards, as the convergence of a TD-process w.r.t. in-
to learning the trivial zero value function). With this insight, consistent rewards is less straightforward. In this work we
there have been attempts to simultaneously learn the base identified the secondary discounting factor γ Φ with the pri-
value function at coarser and finer granularities (of function mary γ. This need not be the case, in general: γ Φ = νγ.
approximation), and use the (quicker-to-converge) former as Such modification adds an extra term Φ-term into Eq. (20),
a potential function for the latter (Grzes and Kudenko 2008). potentially offering gradient guidance, which is useful if the
Our approach is different in that our value functions learn on expert reward is sparse.

2657
Acknowledgments Michie, D., and Chambers, R. A. 1968. Boxes: An experi-
Anna Harutyunyan is supported by the IWT-SBO project ment in adaptive control. In Dale, E., and Michie, D., eds.,
MIRAD (grant nr. 120057). Machine Intelligence. Edinburgh, UK: Oliver and Boyd.
Modayil, J.; White, A.; Pilarski, P. M.; and Sutton, R. S.
References 2012. Acquiring a broad range of empirical knowledge in
real time by temporal-difference learning. In Systems, Man,
Asmuth, J.; Littman, M. L.; and Zinkov, R. 2008. Potential- and Cybernetics (SMC), 2012 IEEE International Confer-
based shaping in model-based reinforcement learning. In ence on, 1903–1910. IEEE.
Proceedings of 23rd AAAI Conference on Artificial Intelli-
gence (AAAI-08), 604–609. AAAI Press. Ng, A. Y.; Harada, D.; and Russell, S. 1999. Policy in-
variance under reward transformations: Theory and applica-
Bellman, R. 1957. Dynamic Programming. Princeton, NJ, tion to reward shaping. In In Proceedings of the Sixteenth
USA: Princeton University Press, 1st edition. International Conference on Machine Learning (ICML-99),
Borkar, V. S., and Meyn, S. P. 2000. The ODE method 278–287. Morgan Kaufmann.
for convergence of stochastic approximation and reinforce- Ng, A. Y. 2003. Shaping and policy search in reinforce-
ment learning. SIAM Journal on Control and Optimization ment learning. Ph.D. Dissertation, University of California,
38(2):447–469. Berkeley.
Borkar, V. S. 1997. Stochastic approximation with two time Puterman, M. 1994. Markov Decision Processes: Discrete
scales. Systems and Control Letters 29(5):291 – 294. Stochastic Dynamic Programming. New York, NY, USA:
Brys, T.; Nowé, A.; Kudenko, D.; and Taylor, M. E. 2014. John Wiley & Sons, Inc., 1st edition.
Combining multiple correlated reward and shaping signals Randløv, J., and Alstrøm, P. 1998. Learning to drive a
by measuring confidence. In Proceedings of 28th AAAI Con- bicycle using reinforcement learning and shaping. In Pro-
ference on Artificial Intelligence (AAAI-14), 1687–1693. ceedings of the Fifteenth International Conference on Ma-
AAAI Press. chine Learning, 463–471. San Francisco, CA, USA: Morgan
Devlin, S., and Kudenko, D. 2012. Dynamic potential-based Kauffman.
reward shaping. In Proceedings of the 11th International Schwartz, A. 1993. A reinforcement learning method
Conference on Autonomous Agents and Multiagent Systems for maximizing undiscounted rewards. In ICML, 298–305.
(AAMAS-12), 433–440. International Foundation for Au- Morgan Kaufmann.
tonomous Agents and Multiagent Systems. Skinner, B. F. 1938. The behavior of organisms: An experi-
Devlin, S.; Kudenko, D.; and Grzes, M. 2011. An empirical mental analysis. Appleton-Century.
study of potential-based reward shaping and advice in com- Snel, M., and Whiteson, S. 2014. Learning potential
plex, multi-agent systems. Advances in Complex Systems functions and their representations for multi-task reinforce-
(ACS) 14(02):251–278. ment learning. Autonomous Agents and Multi-Agent Systems
Dorigo, M., and Colombetti, M. 1997. Robot Shaping: Ex- 28(4):637–681.
periment In Behavior Engineering. MIT Press, 1st edition. Sutton, R., and Barto, A. 1998. Reinforcement learning: An
Grzes, M., and Kudenko, D. 2008. Multigrid reinforce- introduction, volume 116. Cambridge Univ Press.
ment learning with reward shaping. In Krkov, V.; Neruda, Sutton, R. S.; Maei, H. R.; Precup, D.; Bhatnagar, S.; Silver,
R.; and Koutnk, J., eds., Artificial Neural Networks - ICANN D.; Szepesvári, C.; and Wiewiora, E. 2009. Fast gradient-
2008, volume 5163 of Lecture Notes in Computer Science. descent methods for temporal-difference learning with lin-
Springer Berlin Heidelberg. 357–366. ear function approximation. In Proceedings of the 26th An-
Jaakkola, T.; Jordan, M. I.; and Singh, S. P. 1994. On the nual International Conference on Machine Learning, ICML
convergence of stochastic iterative dynamic programming ’09, 993–1000. New York, NY, USA: ACM.
algorithms. Neural computation 6(6):1185–1201. Thomaz, A. L., and Breazeal, C. 2006. Reinforcement learn-
Knox, W. B., and Stone, P. 2012. Reinforcement learning ing with human teachers: Evidence of feedback and guid-
from simultaneous human and MDP reward. In Proceed- ance with implications for learning performance. In Pro-
ings of the 11th International Conference on Autonomous ceedings of the 21st AAAI Conference on Artificial Intelli-
Agents and Multiagent Systems (AAMAS-12), 475–482. In- gence (AAAI-06), 1000–1005. AAAI Press.
ternational Foundation for Autonomous Agents and Multia- Wiewiora, E.; Cottrell, G. W.; and Elkan, C. 2003. Prin-
gent Systems. cipled methods for advising reinforcement learning agents.
Knox, W. B.; Glass, B. D.; Love, B. C.; Maddox, W. T.; and In Machine Learning, Proceedings of the Twentieth Interna-
Stone, P. 2012. How humans teach agents: A new experi- tional Conference (ICML-03), 792–799.
mental perspective. International Journal of Social Robotics
4:409–421.
Mataric, M. J. 1994. Reward functions for accelerated learn-
ing. In In Proceedings of the Eleventh International Confer-
ence on Machine Learning (ICML-94), 181–189. Morgan
Kaufmann.

2658

Effects of Social Media To The Learning Process of 11 Stem Students
100% (1)
Effects of Social Media To The Learning Process of 11 Stem Students
84 pages
17337-Article Text-20831-1-2-20210518
No ratings yet
17337-Article Text-20831-1-2-20210518
9 pages
Multi Grid
No ratings yet
Multi Grid
10 pages
efthymiadis2014
No ratings yet
efthymiadis2014
17 pages
Potential-Based Shaping in Model-Based Reinforcement Learning
No ratings yet
Potential-Based Shaping in Model-Based Reinforcement Learning
6 pages
1 s2.0 S0167739X24001262 Main
No ratings yet
1 s2.0 S0167739X24001262 Main
16 pages
2505.11478v1
No ratings yet
2505.11478v1
29 pages
Inverse Reinforcement Learning
No ratings yet
Inverse Reinforcement Learning
8 pages
Thomas 1
No ratings yet
Thomas 1
33 pages
Theory and Application of Reward Shaping in Reinforcement Learning
No ratings yet
Theory and Application of Reward Shaping in Reinforcement Learning
102 pages
NeurIPS 2018 On Learning Intrinsic Rewards For Policy Gradient Methods Paper
No ratings yet
NeurIPS 2018 On Learning Intrinsic Rewards For Policy Gradient Methods Paper
11 pages
NIPS 2013 Policy Shaping Integrating Human Feedback With Reinforcement Learning Paper
No ratings yet
NIPS 2013 Policy Shaping Integrating Human Feedback With Reinforcement Learning Paper
9 pages
Reinforcement Learning from Demonstration through Shaping
No ratings yet
Reinforcement Learning from Demonstration through Shaping
7 pages
Reinforcement Learning: A Tutorial
No ratings yet
Reinforcement Learning: A Tutorial
17 pages
Sample Complexity of Variance-Reduced Distributionally Robust Q-Learning
No ratings yet
Sample Complexity of Variance-Reduced Distributionally Robust Q-Learning
77 pages
Bayesian Reinforcement Learning
No ratings yet
Bayesian Reinforcement Learning
27 pages
Constrained Rl Merged
No ratings yet
Constrained Rl Merged
790 pages
Mavrin 19 A
No ratings yet
Mavrin 19 A
11 pages
Reinforcement Learning in Robotics: A Survey: Jens Kober J. Andrew Bagnell Jan Peters
No ratings yet
Reinforcement Learning in Robotics: A Survey: Jens Kober J. Andrew Bagnell Jan Peters
38 pages
A Survey of Preference-Based Reinforcement Learning Methods
No ratings yet
A Survey of Preference-Based Reinforcement Learning Methods
46 pages
Reinforcement LN-6
No ratings yet
Reinforcement LN-6
13 pages
Stop Summation: Min-Form Credit Assignment Is All Process Reward Model Needs for Reasoning
No ratings yet
Stop Summation: Min-Form Credit Assignment Is All Process Reward Model Needs for Reasoning
18 pages
DW 01
No ratings yet
DW 01
14 pages
Research Paper 2
No ratings yet
Research Paper 2
29 pages
Active Learning For Reward Estimation in Inverse Reinforcement Learning
No ratings yet
Active Learning For Reward Estimation in Inverse Reinforcement Learning
16 pages
Reinforcement Learning: Instructor: Max Welling
No ratings yet
Reinforcement Learning: Instructor: Max Welling
18 pages
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
No ratings yet
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
34 pages
Adams2022 Article ASurveyOfInverseReinforcementL
No ratings yet
Adams2022 Article ASurveyOfInverseReinforcementL
40 pages
mlnotes2srija
No ratings yet
mlnotes2srija
15 pages
2312.08365v2
No ratings yet
2312.08365v2
39 pages
Soft Q-Learning With Mutual Information Regularization
No ratings yet
Soft Q-Learning With Mutual Information Regularization
19 pages
Ai PPT New
No ratings yet
Ai PPT New
14 pages
Reinforcement Learning With Deep Energy-Based Policies
No ratings yet
Reinforcement Learning With Deep Energy-Based Policies
16 pages
Reinforcement Learning (RL) : Learning From Rewards (And Punishments)
No ratings yet
Reinforcement Learning (RL) : Learning From Rewards (And Punishments)
67 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
32 pages
UNIT-V-Reinforcement Learning
No ratings yet
UNIT-V-Reinforcement Learning
4 pages
Sandholm 1996
No ratings yet
Sandholm 1996
20 pages
Issues in Using Function Approximation For Reinforcement Learning
No ratings yet
Issues in Using Function Approximation For Reinforcement Learning
9 pages
RL
No ratings yet
RL
94 pages
Assignment_15_Modern_AI
No ratings yet
Assignment_15_Modern_AI
3 pages
Reinforcement Learning 1
No ratings yet
Reinforcement Learning 1
14 pages
Reinforcement Learning in Robotics A Survey
No ratings yet
Reinforcement Learning in Robotics A Survey
37 pages
Long-Term Safe Reinforcement Learning with Binary Feedback
No ratings yet
Long-Term Safe Reinforcement Learning with Binary Feedback
8 pages
ML-10
No ratings yet
ML-10
9 pages
Unit:1 Reinforcement Learning
No ratings yet
Unit:1 Reinforcement Learning
8 pages
Winter Semester 2023-24_CSE4037_ETH_AP2023246000594_2024-01-05_Reference-Material-I
No ratings yet
Winter Semester 2023-24_CSE4037_ETH_AP2023246000594_2024-01-05_Reference-Material-I
35 pages
Bellemare17a PDF
No ratings yet
Bellemare17a PDF
10 pages
Distributional Reinforcement Learning: By: Marc G. Bellemare, Will Dabney, Mark Rowland
No ratings yet
Distributional Reinforcement Learning: By: Marc G. Bellemare, Will Dabney, Mark Rowland
12 pages
37 RL
No ratings yet
37 RL
18 pages
2501.00364v1
No ratings yet
2501.00364v1
13 pages
Zheng 20 B
No ratings yet
Zheng 20 B
11 pages
Skill-Based Curiosity For Intrinsically Motivated Reinforcement Learning
No ratings yet
Skill-Based Curiosity For Intrinsically Motivated Reinforcement Learning
20 pages
A comprehensive survey on safe reinforcement learning
No ratings yet
A comprehensive survey on safe reinforcement learning
44 pages
RL
No ratings yet
RL
27 pages
Reinf 2
No ratings yet
Reinf 2
4 pages
Follow Actions PDF
No ratings yet
Follow Actions PDF
42 pages
Reward_Centerring
No ratings yet
Reward_Centerring
22 pages
Reinforced Learning
No ratings yet
Reinforced Learning
25 pages
A (Long) Peek Into Reinforcement Learning _ Lil'Log
No ratings yet
A (Long) Peek Into Reinforcement Learning _ Lil'Log
23 pages
Unit 1 - Reinforcement Learning,Overfitting, Training, Validation Sets, Metrics, Bias and Variance
No ratings yet
Unit 1 - Reinforcement Learning,Overfitting, Training, Validation Sets, Metrics, Bias and Variance
16 pages
Geometric functions in computer aided geometric design
From Everand
Geometric functions in computer aided geometric design
Oscar Ruiz
No ratings yet
A History of Meta-gradient
No ratings yet
A History of Meta-gradient
8 pages
EPC
No ratings yet
EPC
18 pages
Automatic Curriculum Learning through Value Disagreement
No ratings yet
Automatic Curriculum Learning through Value Disagreement
12 pages
A theoretical understanding of self-paced learning
No ratings yet
A theoretical understanding of self-paced learning
10 pages
StartStateCurriculum
No ratings yet
StartStateCurriculum
9 pages
TheoryCL
No ratings yet
TheoryCL
19 pages
About Test Items
No ratings yet
About Test Items
2 pages
Aplikasi Survival Lama Perawatan Pasien Demam Berdarah
No ratings yet
Aplikasi Survival Lama Perawatan Pasien Demam Berdarah
12 pages
U2
No ratings yet
U2
6 pages
Personality Adjectives
100% (1)
Personality Adjectives
3 pages
Paper Viii Drafting, Pleading and Conveyancing
No ratings yet
Paper Viii Drafting, Pleading and Conveyancing
6 pages
QT Arabic English
No ratings yet
QT Arabic English
455 pages
1637210840-Time Table - Qualifying January 2022
No ratings yet
1637210840-Time Table - Qualifying January 2022
8 pages
1 Group Method Solution For Solving Nonlinear Heat Diffusion Problems Mina B. Abd-el-Malek+++++++P
No ratings yet
1 Group Method Solution For Solving Nonlinear Heat Diffusion Problems Mina B. Abd-el-Malek+++++++P
11 pages
Quotient Rule Integration by Part1
No ratings yet
Quotient Rule Integration by Part1
5 pages
Creating Worlds: A Career in Game Design: Questions 34-44 Are Based On The Following Passage
No ratings yet
Creating Worlds: A Career in Game Design: Questions 34-44 Are Based On The Following Passage
3 pages
History of Britain and Ireland PDF
100% (3)
History of Britain and Ireland PDF
402 pages
Neurology Clinical Cases PDF
100% (2)
Neurology Clinical Cases PDF
219 pages
Geography Lesson
100% (1)
Geography Lesson
4 pages
Continuous Louvers 12x5 Upwards - WZ9en
No ratings yet
Continuous Louvers 12x5 Upwards - WZ9en
7 pages
Final PPT Reporting - Afro
No ratings yet
Final PPT Reporting - Afro
18 pages
SHEDET Volume 7 Issue 7 Pages 184-200
No ratings yet
SHEDET Volume 7 Issue 7 Pages 184-200
17 pages
Young Child PTSD Checklist (YCPC) (Ages 1-6)
No ratings yet
Young Child PTSD Checklist (YCPC) (Ages 1-6)
5 pages
Cambridge University Press Cambridge Opera Journal
No ratings yet
Cambridge University Press Cambridge Opera Journal
24 pages
Journal Three
No ratings yet
Journal Three
2 pages
RPMS Accomplishment Journal Elem/sec
100% (1)
RPMS Accomplishment Journal Elem/sec
7 pages
Coagulation and Ultrafiltration in Seawa PDF
No ratings yet
Coagulation and Ultrafiltration in Seawa PDF
227 pages
Administrative Distance Vs Metric
No ratings yet
Administrative Distance Vs Metric
4 pages
Codes and Coding
No ratings yet
Codes and Coding
8 pages
Signs and Symbols-Following Directions
No ratings yet
Signs and Symbols-Following Directions
6 pages
Musical Instruments Worksheet
No ratings yet
Musical Instruments Worksheet
2 pages
Cincin Separuh Hati Netty Virgiantini 2024 scribd download
100% (6)
Cincin Separuh Hati Netty Virgiantini 2024 scribd download
49 pages
Corporators vs. Incorporators
No ratings yet
Corporators vs. Incorporators
3 pages
25 Yrs of Educating Children With Disabilities Thru IDEA
No ratings yet
25 Yrs of Educating Children With Disabilities Thru IDEA
6 pages
News in Korean = 뉴스 로 한국어 공부 하기 News in Korean = Nyusŭ ro -- Talk to Me in Korean; Bookchair Vertrieb Korean Book -- 3rd edition, Seoul, 2017 -- 9791186701102 -- f2d4e76046eca1b428173ab0…
No ratings yet
News in Korean = 뉴스 로 한국어 공부 하기 News in Korean = Nyusŭ ro -- Talk to Me in Korean; Bookchair Vertrieb Korean Book -- 3rd edition, Seoul, 2017 -- 9791186701102 -- f2d4e76046eca1b428173ab0…
168 pages

RewardShaping

Uploaded by

RewardShaping

Uploaded by

Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence

Expressing Arbitrary Reward Functions as Potential-Based Advice

Anna Harutyunyan Sam Devlin Peter Vrancx Ann Nowé

Abstract “distracted”, and instead learn to ride in a loop and collect

π(s) = arg max Q(s, a) (2) F (s, t, s0 , t0 ) = γΦ(s0 , t0 ) − Φ(s, t) (8)

Best γ values Base γ = 0.8

You might also like