Learning Potential Functions and Their R PDF
Learning Potential Functions and Their R PDF
Abstract In multi-task learning, there are roughly two approaches to discovering repre-
sentations. The first is to discover task relevant representations, i.e., those that compactly
represent solutions to particular tasks. The second is to discover domain relevant represen-
tations, i.e., those that compactly represent knowledge that remains invariant across many
tasks. In this article, we propose a new approach to multi-task learning that captures domain-
relevant knowledge by learning potential-based shaping functions, which augment a task’s
reward function with artificial rewards. We address two key issues that arise when deriving
potential functions. The first is what kind of target function the potential function should
approximate; we propose three such targets and show empirically that which one is best
depends critically on the domain and learning parameters. The second issue is the represen-
tation for the potential function. This article introduces the notion of k-relevance, the ex-
pected relevance of a representation on a sample sequence of k tasks, and argues that this is
a unifying definition of relevance of which both task and domain relevance are special cases.
We prove formally that, under certain assumptions, k-relevance converges monotonically to
a fixed point as k increases, and use this property to derive Feature Selection Through Ex-
trapolation of k-relevance (FS-TEK), a novel feature-selection algorithm. We demonstrate
empirically the benefit of FS-TEK on artificial domains.
Keywords multi-task reinforcement learning · feature selection · abstraction · potential-
based shaping · transfer learning
M. Snel
Intelligent Systems Lab Amsterdam
Universiteit van Amsterdam
E-mail: [email protected]
S. Whiteson
Intelligent Systems Lab Amsterdam
Universiteit van Amsterdam
E-mail: [email protected]
2 Matthijs Snel, Shimon Whiteson
1 Introduction
Real-world autonomous agents constantly face problems with various degrees of similar-
ity to problems encountered before. Often, exploiting these similarities is vital to solving
the new problem with acceptable cost in terms of time, money, or damage incurred. For
example, a person learning a new language can do so more quickly if he understands the
general structure of language; a robot navigating a new building should not need to crash
into obstacles to learn that doing so is bad.
This article considers how an agent facing a sequence of tasks can best exploit its ex-
perience with previous tasks to speed learning on new tasks, via two complementary ap-
proaches. First, by extracting knowledge from this experience in a way that can be leveraged
by learning algorithms, and second, by automatically discovering good representations for
that knowledge; for example, a subset of the agent’s sensory features. We consider a rein-
forcement learning (RL) [73] setting in which agents learn control policies, i.e., mappings
from states to actions, for sequential decision problems based on feedback in the form of
reward. Typically, the agent tries to estimate the value of a state under a given policy as the
sum of future rewards, or return, it can expect under that policy.
Traditionally, the aim has been to converge to the optimal policy, which maximizes
expected return. However, expected online return (return incurred while the agent is learning
and interacting with the environment), rather than convergence, is often more important. By
guiding the agent’s exploration strategy, the application of prior knowledge to the task is one
tool for improving online return.
In transfer learning, prior knowledge is derived from previous tasks seen by the agent
or other agents. More specifically, transfer learning aims to improve performance on a set
of target tasks by leveraging experience from a set of source tasks. Clearly, the target tasks
must be related to the source tasks for transfer to have an expected benefit. In multi-task
reinforcement learning (MTRL), this relationship is formalized through a domain, a distri-
bution over tasks from which the source and target tasks are independently drawn [76, 86].
Even so, relatedness can come in different forms. This article focuses on discovering shared
representational structure between tasks and the knowledge captured by that structure.
Discovering a single compact representation (e.g., a subset of the agent’s sensory fea-
tures) that is able to capture individual task solutions in the entire domain is the objective of
many multi-task methods (e.g. [30,18,83,70]). We call this a task-relevant representation: a
single representation with which a different function is learned in each task. Its power lies
in its compactness, which makes each task easier to learn.
Other approaches utilize a cross-task representation that, while not necessarily useful
for representing task solutions, captures task-invariant knowledge (e.g. [8, 20, 62, 34, 22]).
We call this a domain-relevant representation, which can serve as a basis for a single cross-
task function that captures (approximately) invariant domain knowledge. Roughly speaking,
such a function obviates the need to re-learn task-invariant knowledge and biases the agent’s
behavior.
The key insight underlying the work presented in this article is that task-relevant and
domain-relevant representations are fundamentally different: not only can they capture dif-
ferent concepts, but they can employ different features. For example, consider a robot-
navigation domain. In each task, the robot is placed in a different building in which it must
locate a bomb that is placed at a fixed location that depends on the building. An optimal
policy for a given task need condition only on the robot’s position in the building; posi-
tion is therefore task relevant. However, in each task, the robot must re-learn how position
relates to the bomb’s location. By contrast, while distance sensors are not needed in a task-
Learning Potential Functions and their Representations for MTRL 3
specific policy, they can be used to represent the task-invariant knowledge that crashing into
obstacles should be avoided, and are therefore domain relevant.
Some approaches learn a single new representation that may combine task-relevant and
domain-relevant representations [79,8,5,2,40]. For example, instead of training a separate
neural network per supervised learning (SL) task, Caruana [8] trains a single network on all
tasks in parallel. Learning benefits from the shared hidden layer between tasks, which cap-
tures task-invariant knowledge, but the network can also represent task-specific solutions
through the separate task weights. However, such approaches have several limitations. First,
task solutions may interfere with each other and with the task-invariant knowledge [9]. Sec-
ond, any penalty term for model complexity [2, 40] affects both task and domain-relevant
representations simultaneously. Lastly, mapping the original problem representation to a
new one makes it harder to interpret the solutions and decipher which knowledge is domain-
relevant.
These problems can be avoided by maintaining separate task-relevant and domain-relevant
representations. In each task, the agent learns a policy using the task-relevant representa-
tion and is aided by a cross-task function using the domain-relevant representation. This
approach also makes it easier to override the bias of the cross-task function, which is impor-
tant when that bias proves incorrect [9]. The cross-task function can be represented as, e.g.,
advice rules [81], or a shaping function [53,34], the approach we focus on in this article.
Shaping functions, which augment a task’s reward function with additional artificial
rewards, have proven effective in single-task [53], multi-agent [4] and multi-task [34, 68, 69]
problems, with applications to, e.g., foraging by real robots [15], robot soccer [10], and real-
time strategy games [52]. Ng et al. [53] showed that, to preserve the optimal policy of the
ground task, the shaping function should consist of a difference of potentials Φ : X → R,
which, like value functions, specify the desirability of a given state. In MTRL, shaping
functions can be learned automatically by deriving the potential function from the source
tasks [34,68,69], e.g., the navigating robot described above could learn a shaping function
that discourages bumping into obstacles.
The primary aim of this article is to address the two key steps involved in trying to
maximize online return through potential functions in MTRL. The first step is selecting the
appropriate target for the potential function to approximate. This question does not arise in
SL, since the targets are given and thus any single cross-task function strives to minimize a
loss function with respect to all tasks’ targets simultaneously. An intuitive choice of target
in MTRL is the optimal value function of each task; approximating this target leads to the
solution that is closest in expectation to the optimal solution of the unknown next task.
However, there is no guarantee that using a potential function that approximates this target
leads to the best online return. Therefore, in section 3, we propose three different targets.
Given a target, the second key step is to find the representation that approximates this
target as closely as possible while at the same time generalizing well given limited domain
information (observed tasks); i.e., to select a domain-relevant representation. Previous work
on multi-task shaping [34] relied on manually designed representations. Other MTRL work
that employed an analogue of domain-relevant representations also manually designed them
[63,22,21], or learned them but did not learn a cross-task function [20]. Therefore, in section
4, we introduce the notion of k-relevance, the expected relevance on a sequence of k tasks
sampled from the domain, and argue that this is a unifying definition of relevance of which
both task and domain relevance are special cases. We prove that, under certain assumptions,
k-relevance is an approximately exponential function of k, and use this property to derive
Feature Selection Through Extrapolation of k-relevance (FS-TEK), a novel feature-selection
algorithm that applies to both tabular representations and linear function approximators with
4 Matthijs Snel, Shimon Whiteson
binar features. The key insight behind FS-TEK is that change in relevance observed on
task sequences of increasing length can be extrapolated to more accurately predict domain
relevance.
Finally, we present an empirical analysis that evaluates these two steps on multiple do-
mains, including one that requires function approximation. First, we show that which po-
tential function is best depends critically on the domain and learning parameters. Then, we
demonstrate empirically the benefit of FS-TEK on these same domains.
The remainder of the article is structured as follows. The next section addresses some
essential background on shaping functions and reinforcement learning. Thereafter, sections
3 and 4 discuss the theoretical aspects of the potential functions and notion of relevance that
we propose. Section 5 provides an experimental validation of these concepts. The article
concludes with an overview of related work in section 6 and a discussion in section 7.
2 Background
This section introduces the notation used in the article and provides further background on
shaping functions and the standard framework for solving reinforcement learning problems.
In the last subsection, we formalize the problem setting considered in this article.
where rt+1 is the reward received after taking action at in state xt at time step t and γ is a
discount factor in [0, 1] that determines the relative importance of future rewards.
The return depends on the agent’s policy π : Xm × A → [0, 1] s.t. ∑a π(x, a) = 1, which
assigns a probability to action a in state x. The value of state-action pair (x, a) under policy π
is the expected return when starting in x, taking a, and following π thereafter, and is denoted
by Qm : X+ m → R. The optimal value function, which defines the maximum expected return
the agent can accrue, is defined by the Bellman optimality equation:
∗ xax0 xax0 ∗ 0 0
Qm (x, a) = ∑ Pm Rm + max Qm (x , a ) . (2)
x0 ∈Xm a0
An optimal policy π ∗ achieves maximum expected return and can be derived from the opti-
mal Q-function by setting π ∗ (x, argmaxa Q∗m (x, a)) = 1. It follows that the optimal value of
state x is Vm∗ (x) = maxa Q∗m (x, a).
Solving (2) explicitly requires a model of the MDP; this is the domain of planning
methods such as dynamic programming [6]. For example, policy iteration starts out with
a random policy, and iteratively computes the value function for the policy and makes the
policy greedy with respect to the new value function, until convergence. On the other hand,
value iteration works by iteratively computing solutions to (2) directly.
In RL problems, however, no model is available and the agent must learn about its
environment by interacting with it. Model-based RL methods do so by iteratively learning a
model of the environment and improving the policy by planning on the estimated model; in
contrast, model-free methods, which we employ in this article, estimate a value function or
policy directly from interaction with the environment [73].
A widely used class of model-free methods is temporal-difference (TD) learning [72].
When used for control, the TD update takes the form
where α is a learning rate and δ the TD error. There are various approaches to computing
the latter, depending on which TD control algorithm is used. Two of the most popular are
Sarsa [58,64], and Q-Learning [84]. For Sarsa,
where r is the reward incurred, x0 the next state and a0 the action the agent took at x0 . For
Q-learning,
δ = r + γ max∗
Q(x0 , a∗ ) − Q(x, a). (5)
a
Basic RL methods do not cope well with tasks with large state spaces. One tool for
speeding up learning in such problems is the use of eligibility traces [73]. Here, a state-
action pair gets assigned a trace with value 1 (so-called replacing traces) when visited;
each trace is decayed by a factor γλ on each time step, where γ is the discount factor of
the MDP and λ is a decay parameter. At each time step, instead of just updating the most
recent state-action pair, all state-action pairs with traces significantly above zero are updated
as Q(x, a) ← Q(x, a) + αδ e(x, a), where e(x, a) is the trace value. When combined with
eligibility traces, Sarsa and Q-Learning are called Sarsa(λ ) and Q(λ ), respectively.
Function approximators can also improve performance in large state spaces and are re-
quired for continuous state spaces. Linear approximators represent the Q-value for a given
state-action pair as φ (x, a)T θ , where φ (x, a) maps the state-action pair (possibly nonlin-
early) to a feature vector representation, and θ is the vector of weights. Combined with
6 Matthijs Snel, Shimon Whiteson
eligibility traces, the update takes the form θ ← θ + αδ e, where e is the vector of traces. In
this article we employ tile coding [1,73], a linear function approximator based on a binary
feature vector.
2.2 Shaping
The concept of shaping stems from the field of operant conditioning, where it denotes a
training procedure of rewarding successive approximations to a desired behavior [67]. In RL,
it may refer either to training the agent on successive tasks of increasing complexity, until the
desired complexity is reached [61,27,65,59,57,17], or, more commonly, to supplementing
the MDP’s reward function with additional, artificial rewards [85, 14, 50, 53, 38, 49, 15, 3, 24].
This article employs shaping functions in the latter sense.
Because the shaping function modifies the rewards the agent receives, the shaped agent
may no longer learn an optimal policy for the original MDP (e.g. [57]). Ng et al. [53] show
that, in order to retain the optimal policy, a shaping function should consist of a difference
of potential functions over states. Like a value function, a potential function Φ : X 7→ R
specifies the desirability of a given state. Potential-based shaping functions take the form
Fxx0 = γΦ(x0 ) − Φ(x); hence, a positive reward is received when the agent moves from a low
to a high potential, and a negative reward when moving in the opposite direction. Similarly to
potentials in physics and potential field methods in mobile robot navigation [36], a shaping
potential thus results in a “force” encouraging the agent in a certain “direction”.
State potential functions can miss out on additional information provided by the actions
for a state. To address this, Wiewiora et al. [88,87] introduced shaping functions of the form
Fxxa 0 0 0
0 a0 = γΦ(x , a ) − Φ(x, a), where a is defined as in the learning rule. In this form, shaping
functions closely resemble advice-giving methods [45, 81] in that they bias an agent’s policy
towards the actions that the shaping function estimates as most valuable.1 Wiewiora et al.
show that using F is equivalent to initializing the agent’s Q-table to the potential function,
under the same experience history. However, some important differences remain. Unlike
shaping, initialization biases the agent’s actions before they are taken. In addition, shaping
can be applied to RL with function approximation and in cases where the experimenter does
not have access to the agent’s value function [88]. Either way, the results in this article apply
equally well to shaping as to initialization methods.
duced here. We use the term “shaping” for both methods, and let function arguments resolve any ambiguity.
Learning Potential Functions and their Representations for MTRL 7
interested in a scenario in which the agent is “interacting sequentially” with the domain,
similar to lifelong learning [79]. That is, starting with an empty history h and potential
function Φ : X+ d → 0, it goes through the following steps:
1. Receive a task m sampled with replacement from the domain according to D(m). The
task model is unknown.
2. Learn the solution to m with a model-free learning algorithm, aided by Φ. Add the
solution to the solution history h.
3. Update Φ based on h.
4. Go to step 1.
Nonetheless, this article applies equally well to batch scenarios in which the agent receives
a sequence of source tasks sampled from the domain upfront; the solutions to this sequence
then just become the history h.
where Rm is the return accrued in task m and rt,m is the immediate reward obtained on
timestep t in task m. Note that the task is essentially a hidden variable since the potential
function does not take it as input. Thus the potential function that satisfies (6) may perform
poorly in some tasks, though it performs best in expectation across tasks.
Since shaping with Φ is equivalent to initializing the Q-table with it, solving (6) is
equivalent to finding the best cross-task initialization of the Q-table. Unfortunately, because
of interacting unknown task and learning dynamics, there is no obvious way to compute
such a solution efficiently in the learning case, and search approaches quickly become im-
practical. However, it is possible to derive a solution for the planning case that provides the
lowest bounds on the number of iterations needed to converge.
In the following sections, we first derive an expression for the optimal value table ini-
tialization given that the task models are available and solved using value iteration. We show
that the optimal initialization in this case minimizes, in expectation, the weighted geometric
mean max-norm with the optimal value function in the target task. We then discuss three
strategies for efficiently approximating ΦL∗ for the learning case.
In the planning case, an optimal initialization is one that minimizes the expected number of
iterations to solve the target task.
8 Matthijs Snel, Shimon Whiteson
Theorem 1 The initial value function Q∗0 that in expectation minimizes the number of iter-
ations needed to solve a given task m from a domain d by value iteration is, for γ ∈ (0, 1),
Q∗0 = argmax ∑ D(m) logγ kQ∗m − Q0 k∞
Q0 m
Proof By Banach’s theorem, the value iteration sequence for a single task converges at a
geometric rate [6]:
kQ∗m − Qnm k∞ ≤ γ n kQ∗m − Q0 k∞ ,
where Qnm is the value function on task m after n iterations and k · k∞ denotes the max-
norm. This equation provides a lower bound on the number of iterations n needed to get
within an arbitrary distance of Q∗m , in terms of γ and the initial value function Q0 . That is,
to get within ε of Q∗m , then kQ∗m − Qnm k∞ ≤ ε, which is satisfied if γ n kQ∗m − Q0 k∞ ≤ ε. Let
δm = kQ∗m − Q0 k∞ . Assuming δm > 0 and ε < δm , then
γ n δm ≤ ε
ε
n ≥ logγ
δm
= logγ ε − logγ δm .
For multiple tasks the expected lower bound is
h i
n̄ ≥ ∑ D(m) logγ ε − logγ δm
m
= logγ ε − ∑ D(m) logγ δm . (7)
m
Intuitively, a good initialization of the Q-function is the one closest in expectation, according
to some norm, to some desired target value function Qm of the unknown next task m the agent
will face. This can be seen as a definition of a cross-task value function Qd that predicts the
expected value of a given state-action pair (x, a) on an unknown new task sampled from
the domain, given the target values observed for (x, a) on previous tasks. If the norm is
Euclidean, Qd gives the least-squared-error prediction of the value of (x, a) on the new task.
That is, it minimizes the mean squared error (MSE) across tasks:
h i2
Qd = argmin ∑ D(m) ∑ Pr(x, a|m) Qm (x, a) − Q0 (x, a) , (8)
Q0 m∈M x,a∈X+
m
Learning Potential Functions and their Representations for MTRL 9
where Pr(x, a|m) is a task-dependent weighting over state-action pairs that determines how
much each pair contributes to the error. As is common in least squares, one may want to
weight non-uniformly, in our case for example if some state-action pairs occur more often
than others. It is not immediately clear how to define Pr(x, a|m). In section 3.6, we discuss
four possible options for this distribution. For now, we assume Pr(x, a|m) is given. By setting
the gradient of (8) to zero and solving, we obtain:
Note that Pr(m|x, a) is a natural way of selecting the right tasks to average over for a given
/ X+
pair: it is zero for any (x, a) ∈ m , leaving such pairs out of the average.
For linear binary function approximators, solving (8) leads to
where wid and wim are the weight of feature i across tasks and in task m respectively, and fi
is the value of feature i (either 0 or 1). See Appendix C for the derivation. Thus, instead of
averaging per state-action pair as in (9), Eq. 10 averages per binary feature. In the following
sections, we follow the format of (9) and use table-based value functions, unless mentioned
otherwise.
In supervised learning, the targets Qm are given. In reinforcement learning, an intuitive
choice of target might be the optimal value function of each task. However, there is no
guarantee that using a potential function that approximates this target leads to the best online
return. In the following three subsections, we propose three different types of Qm to use as
target, each leading to a different potential function.
By setting the target to be the optimal value function Q∗m of each task, we obtain:
While approaches equivalent to (11) have been used successfully as potential functions
or initializations in previous work [68,74], they are not guaranteed to be optimal with respect
to the online return. Since Q∗d makes predictions based on the optimal policy of each task, it
may be too optimistic: First, it is likely to overestimate the value of some state-action pairs
in some tasks. Second, the actions that Q∗d estimates as best may not be cautious enough:
it assumes the optimal policy will be followed from the next timestep onward, and hence
ignores the uncertainty caused by exploration. This may negatively affect online return, in
particular in risky tasks, i.e, tasks where some actions may result in a large negative reward,
for example a helicopter crash or a fall down a steep cliff. Note that this “risky” ordering of
actions may occur even in cases where value is estimated correctly (for example, for states
that only occur in one task). We observe this phenomenon experimentally in the cliff domain
in section 5.1.
10 Matthijs Snel, Shimon Whiteson
In some cases, the agent’s Q-function on a task m may never reach Q∗m , even after learning.
This may happen, for example, when using function approximation or an on-policy algo-
rithm with a soft policy. When this occurs, it may be better to use a potential function based
on an average over Q̃m , the value function to which the learning algorithm converges. The
derivation is the same as for the previous section, yielding:
The previous two approaches to defining cross-task value for the potential function both
rely on value function for tasks that have already been solved. Such approaches may be too
optimistic, even if (12) is used instead of (11), since they are based on the result of learning
and implicitly assume a (near-) optimal policy will be followed from the next time step
onward, which is not typically the case during learning.
In this section, we propose another definition which, in a sense, more closely resembles
the traditional definition of value in that it estimates value of a single fixed cross-task policy
µ : Xd × A → [0, 1]s.t. ∑a µ(x, a) = 1 that assigns the same probability to a given state-action
pair, regardless of the current task. This definition might also be more suitable for use as Φ
since, like Φ, it is fixed across tasks. The value function of the best possible cross-task
policy, µ ∗ , will typically make lower estimates than either Q∗d or Q̃d , since µ ∗ is usually
not optimal in every (or any) task. For example, consider a domain with a goal location in
an otherwise empty room. If the distribution over goal locations is uniform, and the state
provides no clue as to the goal position, then µ ∗ is a uniform distribution over actions in
every state. We define the cross-task value of a state under a stationary policy µ as
µ µ
Qd (x, a) = ∑ Pr(m|x, a)Qm (x, a). (13)
m∈M
Like (11) and (12), this follows (9), except that it averages over the values of a single policy
instead of multiple task-dependent ones.
The fact that µ is task-independent makes the task essentially a hidden variable and
MTRL similar to a POMDP (Partially Observable MDP), for which µ is a memoryless pol-
µ
icy that conditions only on the current observation and Qd is similar to the value of such a
policy as defined in [66]. Therefore, there need not exist a stationary policy (stochastic or
deterministic) that maximizes the value of each state simultaneously [66]. Consequently, tra-
ditional dynamic programming methods may not apply. One way to overcome this problem
is to assign a scalar value to each possible policy:
µ µ
Vd = ∑ Pr(x)Vd (x) (14)
x∈Xd
µ µ
Vd (x) = ∑ µ(x, a)Qd (x, a) (15)
a∈A
µ
where Vd is the domain-wide value of µ and Pr(x) is a distribution that assigns an appropri-
ate measure of weight to each state x. Two options for Pr(x) are the start-state distribution or
Learning Potential Functions and their Representations for MTRL 11
the occupation probability of x, Pr(x|µ). For the POMDP case, Singh et al. [66] show that
defining the optimal policy as
All three proposed potential function types take the general form
where Pr(m|x, a) = Pr(x, a|m)D(m)/ Pr(x, a). As indicated in section 3.2, it is not immedi-
ately clear how to define Pr(x, a|m). Four options are:
1. The stationary distribution induced by the policy corresponding to each Qm . For
Q∗d and Qd , this would be πm∗ and µ ∗ , respectively. For Q̃d it would be the soft policy
µ
(e.g. ε-greedy). Since Qd averages over all Qm , this seems a good option. However, it
may be problematic for Q∗d and Q̃m as it represents only the distribution over (x, a) after
learning. There may be state-action pairs that the policy never, or rarely, visits in some
tasks, but clearly the values in these tasks should still be included in the average.
2. The distribution induced by the learning process. Since we are interested in improv-
ing performance during learning, another choice may be a distribution over (x, a) during
learning. However, it is unclear how to define this distribution, since it depends, among
other things, on when learning is halted and the trajectory through state-action space
taken during learning. One possibility is to take one sample from all possible such tra-
jectories and define
T
Pr(x, a|m) = ∑ Pr(x, a|m, πt ) Pr(πt |m),
t=1
where Pr(πt |m) = 1/T for the soft policy the agent was following at time step t when
learning task m, Pr(x, a|m, πt ) is the stationary distribution over (x, a) given πt , and T
is a predetermined stopping criterion, such as a fixed number of learning steps or a
µ
convergence threshold. Clearly, this option does not apply to Qd .
3. The start-state distribution and uniform random policy. This defines
where Pr(x0 = x|m) is the start-state distribution of m and Pr(a) = 1/|A| for all a. This
definition appropriately captures the distribution over (x, a) when the agent enters a new
task without prior knowledge.
4. The uniform distribution. This defines Pr(x, a|m) = 1/|X+ +
m | for all (x, a) ∈ Xm .
For Q∗d and Q̃d , the latter option seems the most sensible one. Under the first and third
definitions, a probability of zero might be assigned to a state-action pair in some tasks, or
even in all, which is clearly not desirable. Also, since (11) and (12) are concerned with pre-
diction of the optimal (approximate) value of a pair (x, a) in a new target task, the underlying
assumption is that after taking a in x, the optimal (soft) policy for that task will be followed.
12 Matthijs Snel, Shimon Whiteson
However, the first two definitions in the list above make assumptions about the policy that
has been followed so far, which is irrelevant in this context. Empirical comparison of the
options revealed no significant difference for Q∗d and Q̃d . For Qd , the stationary distribution
µ
4 Representation Selection
So far, the discussion has focused on different potential functions, without regard to the
knowledge the agent has of the domain; in fact, the formulas in the previous section are
based on the set of all tasks in the domain, thereby implicitly assuming full domain knowl-
edge. This discussion is important for selecting the right potential function; as section 5.1
will show, the best-performing potential function may depend on the domain and learning
parameters, even with full domain knowledge.
In practice, the agent’s knowledge of the domain is limited by the number of tasks it
has observed. Therefore, we now turn to the setting in which only a sample sequence of
tasks is available to the agent. A central aim in this setting is to generalize well from the
sample to new data. As in supervised learning, representation is key to generalization. This
section discusses the theory underlying the second key step in maximizing online return
through potential functions for MTRL: finding the representation that approximates this
target as closely as possible while at the same time generalizing well given limited domain
information (observed tasks). To this end, we propose a definition of relevance that can be
used for learning representations in MTRL. This definition applies to any potential function
type; in fact, it applies to any table-based function that maps vectors to scalars.
In MTRL, new data may consist of both new state-action pairs and new tasks. Our
central objective, therefore, is to discover which information is relevant across tasks, which
we call domain-relevant information. While our definition also captures which information
is relevant within tasks (task relevant), this subject has been extensively addressed in existing
research. Therefore, we focus primarily on discovering domain-relevant representations.
The robot navigation example in the introduction illustrates the distinction between task
and domain relevance. Recall that here, the robot needs to locate a bomb at a fixed task-
dependent location in buildings with different layouts, where each task is a different build-
ing. Now imagine that in addition, a building can be green or red; in red buildings, the robot
receives negative reward when standing still (e.g. because it is in enemy territory and needs
to keep moving). To describe an optimal policy for each task, only position, which constitute
a Markov state representation, is needed to represent state. However, the value of a given
position needs to be re-learned in every task. Since position does not represent information
that can be retained between tasks, this feature is task relevant. Building color is not useful
within a given task, since its value is constant. However, across tasks it represents informa-
tion about standing still, which receives additional punishment in red buildings. Therefore,
this feature is domain relevant, but not task relevant. Finally, distance sensors are useful both
within and across tasks, and are therefore both task and domain relevant.
Projecting the full feature set onto a subset of the task-relevant features may create an ab-
stract state space that is smaller than the original state space, and can therefore help learn the
task more quickly, under certain conditions on the projection [43]. In the multi-task setting,
it is possible to define this abstract space before entering a new task, by identifying valid
abstractions (task-relevant representations) of previously experienced tasks and transferring
these to the new task (e.g. [83,40,19,37]).
Learning Potential Functions and their Representations for MTRL 13
Domain-relevant features also allow for a higher level of abstraction, but in addition
allow the agent to deduce rules from the abstract representation and reason with them,
right from the start of a new task, thereby obviating the need to re-learn this task-invariant
knowledge. Of course, it should be possible for the agent to override the heuristic rules given
new information garnered from interaction with a specific task. Shaping functions are a good
candidate for these kind of rules, endowing the agent with prior knowledge that gradually
degrades as the agent accumulates experience of a task.
Li et al. [43] provide an overview and classification of several state abstraction schemes
in reinforcement learning. We employ their definition of a state abstraction function:
4.1 Relevance
Since our main goal is to find good representations for potential functions, our targets for
abstraction are Q∗d , Q̃d , or Qd , the three types of potential function proposed in section 3.
µ
Which of these is used does not matter for the theory, as any Q-function can serve as a basis
for relevance. Therefore, in what follows, we remain agnostic to the target for relevance.
Since the agent has only experienced a sample sequence of tasks from the domain, it
cannot compute this target Qd exactly. Instead, it can approximate Qd by computing a cross-
task value function based on the sequence. Let c = (c1 , c2 , ..., ck ) be a sequence of |c| tasks
S
sampled from M, and Xc = ci ∈c Xci .
In the following, we treat the action as just another state feature so that it can be ab-
stracted away if it is irrelevant. Let x+ ∈ X+ c denote a state-action pair. Let Qc be a cross-task
Q-function computed on a sequence of tasks c, Qc : Xc × A 7→ R. If |c| = 1, Qc = Qm . All
of the definitions for Qd in section 3 follow the general form
where
Pr(x+ |ci ) Pr(ci |c)
Pr(ci |x+ , c) = ,
Pr(x+ |c)
14 Matthijs Snel, Shimon Whiteson
where Pr(ci |c) = 1/|c| for all ci , since the probability of sampling a task from the domain
D(m), is already reflected in Pr(c). Furthermore, given a sample of tasks, the best we can
do given the prior of the sample is to assign each task a probability corresponding to its
frequency in the sample, which is accomplished by 1/|c|.
Several notions of predictive power on Qc are possible. For example, the Kullback-
Leibler divergence between the marginal distribution over Qc and the distribution condi-
tioned on the representation of which we wish to measure relevance [68], or the related
measures of conditional mutual information [29] or correlation [47]. The disadvantage of
these measures is that they do not take the magnitude of the impact of a representation into
account; for example, two representations may cause equal divergence, but the difference
in expected return associated with one set may be much larger than the other. To address
this problem, we propose a measure of relevance2 that is proportional to the squared error
in predicting Qc .
Let φ (X+ +
c ) = Yc denote all possible projections of Xc onto Y using φ . For ease of
notation, denote the set of ground state-action pairs belonging to an abstract state-action
pair y ∈ Yc , φ −1 (y), by Xyc . Defining the Q-function of an abstract pair as the weighted
average of the values of its corresponding ground pairs is a natural definition.
It follows that the Q-value of the null abstract state, denoted Q̄0c/ , which corresponds to the
empty set of features, is the mean of all Q-values.
The Q-value of a given abstract state-action pair y generally differs from those of at
least some ground state-action pairs corresponding to y. The error ε for a given abstract
state-action pair is the weighted mean squared error of ground pairs with respect to y.
2 Relevance is not a measure in the strict mathematical sense; because of dependence between feature sets,
ρ(F ∪ G) 6= ρ(F) + ρ(G) for some disjoint feature sets F and G and relevance ρ.
Learning Potential Functions and their Representations for MTRL 15
When using an abstract Q-function as defined in (18), this equals the sum of weighted vari-
ances of the Q-values of ground state-action pairs corresponding to a given y:
ρ(φ , Qc ) = ∑ Pr(y|c) Var(Qc (Xyc )). (21)
y∈Yc
Note that the aim is therefore to find an abstraction with low relevance: this abstracts
away non-relevant parts of the original representation and keeps the relevant parts. This is
in line with the terminology employed by Li et al. [43], who designate the abstraction that
preserves Q∗ as a Q∗ -irrelevance abstraction. Similarly, an abstraction with low relevance
on Qc should be thought of as a Qc -irrelevance abstraction.
4.2 k-Relevance
The previous section defined the relevance of an abstraction with respect to a Q-function
computed on a sequence of tasks. This single definition captures both task and domain rel-
evance; the former just requires a one-task sequence. Naturally, relevance often depends on
both the length of the task sequence over which it is computed and on the tasks that the
sequence contains. For example, in the robot navigation domain as a whole, the position
feature is irrelevant since, given a certain position, there is no way to know what direction
to go. In other words, in Qd , the average value function computed over the whole domain,
Q-values will not vary with the position feature, and therefore the position feature will not
result in any error on Qd if it is discarded.
However, given a sample sequence of two tasks, it is likely that position is still rele-
vant, e.g., if the bomb is in the same general area of the building in both tasks. In practice,
the agent, while interacting with the domain, will have at its disposal a growing sequence
of tasks based on which it must construct a domain-relevant representation. Doing so is
challenging because representations may appear relevant given the observed sequence but
actually not be domain irrelevant. Intuitively, the longer the sequence, the better sequence
relevance approximates domain relevance. Therefore, instead of just computing relevance
on the currently observed sequence, we should try to predict how it changes with increas-
ing sequence length. Decreasing relevance could mean that the feature, while relevant on the
current sequence, is not relevant in the long run. However, it is not enough to simply observe
that the relevance of a given representation is decreasing because it may plateau above zero.
To predict how relevance changes as more tasks are observed, we use the definitions in
the previous section to define the k-relevance of a representation as the expected relevance
ρk over all possible sequences of length k. Let Ck be the set of all possible sequences c
of length k, i.e., |Ck | = |M|k . The probability of sampling a given sequence c from Ck is
|c|
Pr(c) = ∏i=1 D(ci ). Then we have the following definition for k-relevance:
Definition 5 (k-relevance) The k-relevance of an abstraction φ : X+
d → Yd in a domain
d = hD, Mi is the expected relevance of φ on a given Q-function computed from a task
sequence of length k sampled from M according to D
ρk (φ ) = E[ρ(φ , Qc )|c ∈ Ck ] = ∑ Pr(c)ρ(φ , Qc ). (22)
c∈Ck
Given this definition, an abstraction φ is strictly domain relevant, DR(φ ), if and only if its
k-relevance ρk remains positive as k goes to infinity:
DR(φ ) ⇔ lim ρk (φ ) > 0, (23)
k→∞
16 Matthijs Snel, Shimon Whiteson
Fig. 1 illustrates how k-relevance changes with k. Relevance of a given φ is the same on each
single-task sequence for a given task m, i.e. sequences for which each element ci = m. For
k = 1, k-relevance comprises only single-task sequences, but the share of these sequences
decreases exponentially with k; in the example, it is D(1)k + D(2)k + D(3)k . In general, the
sequences represent the true distribution over tasks ever more closely.
For k = 2, every task is combined with every task in the domain. Hence for k > 2, no new
task combinations arise; however, we need a way to quantify relevance on, e.g., c = (1, 2, 3)
given the relevance on (1, 2), (1, 3), and (2, 3). We show in Appendix B that it follows, from
the expression of relevance in terms of variance (Eq. 21) that relevance on any sequence
equals the sum of covariances between the Q-functions of all task pairs involved in the
sequence. The right column of Fig. 1 visualizes this; the highlighted areas correspond to the
covariance matrix (and hence relevance) for k − 1; therefore, k + 1-relevance is a function
of k-relevance. The following two theorems follow directly from this, together with the fact
that sequences approximate the true distribution over tasks ever more closely.
Theorem 2 Let φ be an abstraction with abstract Q-function as in definition 2, and let
ρk = ρk (φ ) for any k. Let d(x, y) = |x − y| be a metric on R, and let f (ρk ) = ρk+1 map
k-relevance to k + 1-relevance. Then f is a contraction; that is, for k > 1 there is a constant
κ ∈ (0, 1] such that
d( f (ρk ), f (ρk−1 )) ≤ κd(ρk , ρk−1 ).
Furthermore, if d(ρ2 , ρ1 ) 6= 0, then f is a strict contraction, i.e. there is a κ ∈ (0, 1) such
that the above holds.
Learning Potential Functions and their Representations for MTRL 17
Fig. 1: Left three columns (k = 1..3): probability of each sequence as k increases, shown as
portions of a rectangle with area 1, on a 3-task domain with D(1) = 0.5, D(2) = D(3) =
0.25. Probability of sequence (1, 2) is indicated as p12. For each k + 1, the area of each
sequence for k is split into new sequences with area proportional to D. Right column: co-
variance matrices for the sequences marked in bold on the left. The weight of each matrix
element is equal, namely 1/k2 . Relevance on a given sequence is the sum of all matrix
elements. In turn, k-relevance is the sum over all sequences.
exploits the notion of k-relevance introduced in the previous sections. FS-TEK focuses on
finding domain-relevant representations, since determining task-relevant representations us-
ing k-relevance is relatively straightforward. As mentioned previously, the key idea behind
FS-TEK is to fit an exponential function to a candidate representation’s relevance based on
an observed sequence of tasks and then extrapolate it to estimate the representation’s domain
relevance.
Since it is a feature selection algorithm, FS-TEK constructs abstractions of the form
φ (x) = x[Y] = y, where Y ⊆ X and x[Y] denotes the values of the features Y in vector x.
That is, the abstract state y to which a state x is mapped consists of the values of the features
in some relevant set Y. Viewed another way, Y is the result of removing some irrelevant set
F = X − Y from X. In the following, when we refer the relevance of a set F, we mean the
relevance of the abstraction that removes F from X.
The abstraction function φ (·) is fundamentally different for binary linear function ap-
proximator representations than table-based representations; in the former case, the function
needs to identify and cluster abstract states directly in feature space. However, FS-TEK itself
operates similarly for both representions.
FS-TEK’s main loop is based on an iterative backward elimination procedure (e.g. [32,
28]). Because of interdependence between features, it is not always sufficient to select fea-
tures based on their relevance in isolation. Features may be irrelevant on their own, but
relevant together with another feature; similarly, a feature that seems irrelevant may become
relevant once another feature is removed. Backward elimination starts with the full feature
set and iteratively removes features according to a measure of relevance (k-relevance, for
FS-TEK). In contrast, forward selection methods start with the empty set and iteratively
add features. While forward selection may yield a smaller final feature set, it may miss
interdependencies between features (e.g. [28]). Therefore, FS-TEK’s main procedure uses
backward elimination. Nonetheless, it attempts to combine the advantages of both methods
by using forward selection to decide which feature to remove when more than one feature is
marked for elimination.
Algorithm 1 specifies the main body of FS-TEK. It takes as input the sequence s of
solutions to observed tasks and a parameter α(k) that specifies the confidence level for the
statistical test on relevance, possibly depending on sample sequence length k.
In each iteration, FS-TEK starts by computing the k-relevance of each feature united
with the current set of features to be removed, with k ranging from 1 to the current number of
observed tasks (line 6; the details of BKR, Backward-k-Relevance, are provided in algorithm
2 below). For each feature, this results in a dataset with k as input and relevance as target
(each column of the matrix D). The algorithm subsequently does a nonlinear least-squares
fit of the exponential function f (k; θ ) to each feature’s relevance data (line 11).
Next, the function value and confidence interval are computed for the point where f
asymptotes (lines 12-15; in line 13, CI(a, α(|s|) is the confidence interval at point a for the
length of the current observed sequence of tasks). If the confidence interval’s lower bound
is less than or equal to 0, the feature is classified as domain irrelevant and added to the set
of features to be removed (line 20).
This procedure often marks more than one feature for removal. FS-TEK uses forward
selection to decide which of the marked features to remove (lines 19-21). In this check, the
relevance of the set of features to remove is tested in isolation, without taking into account
the features that remain. Contrary to BKR, which computes relevance for a range of k, FR
(Forward Relevance) computes relevance only on the currently observed sequence of tasks.
Using forward selection can enable FS-TEK to select a smaller subset of relevant features
and provides a “second opinion” to supplement the noisy estimate of BKR.
Learning Potential Functions and their Representations for MTRL 19
Algorithm 1 FS-TEK
Input: a sequence s of task solutions Qsi , i ∈ {1, 2, . . . , |s|}; α(k), the confidence level
Output: A set R of features to remove
1:
2: R ← 0/
3: f (k; θ ) = θ1 + θ2 exp(−θ3 k)
4: repeat
5: F ← 0/ //Features marked for removal in this iteration
6: Calculate D, the |s| × |X+ − R| matrix of k-relevance per feature, using BKR
7:
8: // Each feature for which extrapolated function of relevance does not differ
9: // significantly from 0 is marked for removal
10: for all Xi ∈ X+ − R do
h i2
|s|
11: θ̂ ≈ argminθ̂ ∑k=1 f (k; θ̂ ) − D(k, i) //least-squares fit
12: a = min{a : d f (k; θ̂ )
dk (a) = 0}
13: if f (a; θ̂ ) − CI(a, α(|s|)) ≤ 0 then
14: F ← F ∪ Xi
15: end if
16: end for
17:
18: // Of all marked features, the one with weakest forward relevance is removed
19: for all Xi ∈ F do
20: r(i) = FR(s, R ∪ Xi )
21: end for
22: R ← R ∪ argmini r(i)
23: until F = 0/
The forward and backward relevance methods operate essentially according to similar
principles. Algorithm 2 describes BKR, which computes, for each feature not yet marked
for removal, the k-relevance for the given k. Given a set of task sequences Ck , the average Q-
function for each sequence is computed according to equation 17 (lines 3-5). Each feature’s
k-relevance is then the average relevance taken over those Q-functions, as defined in (22)
(lines 8-12). Relevance is computed for the abstraction φi that removes from the full feature
set a feature set consisting of the features removed in previous iterations united with the
given feature.
FR (algorithm 3) is similar but only computes relevance based on the current task se-
quence. In addition, it adds features to the empty set instead of removing them from the
full set. It first computes an abstract function based on the set G of features to be removed
(lines 2-4). Forward relevance is then the relevance of the null abstraction (the empty set of
features) with respect to the Q-function based on G: i.e., the difference between the error
made by using G and that made by using no features.
Not taking into account the complexity of the nonlinear least-squares optimization pro-
cedure, FS-TEK’s complexity is mainly determined by BKR. Let the number of features
|X+ | = P, the size of the state-action space |X+ s | = N, and the maximum number of se-
quences max seqs = M. BKR’s worst-case time complexity is O(MkN + (P − |R|)(N +
MN)). However, in practice FS-TEK does not re-sample the sequences at each iteration and
computes the average Q-functions based on the sequences only once at the start of the algo-
rithm, so the MkN term can be removed. This results in a complexity of O(N(P − |R|)(1 +
M)) for BKR when used by FS-TEK.
In the worst case, all features in X+ need to be removed. FS-TEK’s first iteration incurs
a cost of O(|s|NP(1 + M)), since R is empty. On the second iteration, |R| = 1, and thus
the cost is O(|s|N(P − 1)(1 + M)). Since a total of P iterations is needed, the cost in terms
of P progresses as P + P − 1 + P − 2 + · · · + 1 = P(P + 1)/2. Therefore, the total cost is
O(|s|N(P2 + P)(1 + M)), i.e., quadratic in the number of features. Also, while the cost is
linear in the size of the state-action space N, in the worst case N scales exponentially with
the number of features P. In practice, however, usually not all features need to be removed,
and features frequently covary such that N does not scale exponentially with P.
5 Empirical Evaluations
In the previous two sections, we proposed solutions for the two key steps involved in max-
imizing online return in MTRL through potential functions: selection of a good potential
function, and finding a good representation for this function in order to generalize well given
limited domain knowledge. For the first step, we proposed three different types of potential
functions. For the second step, we proposed the novel feature selection algorithm, FS-TEK,
which extrapolates change in relevance to predict true domain relevance. In this section, we
evaluate these contributions empirically on multiple domains.
experiments. While most of these domains are simple, they enable the illustration of critical
factors in the performance of shaping functions in MTRL.
For comparison purposes, in this section we assume the agent has perfect knowledge of
each domain and thus compute each potential function using all tasks in the domain. Our
goal is to demonstrate the theoretical advantages of each type of potential function. Assum-
ing perfect domain knowledge enables comparisons of the potential functions’ maximum
performance, untainted by sampling error. In section 5.2, we consider the more realistic
setting in which only a sample sequence of tasks is available, and in which generalization
to unseen next tasks, and thus learning a good representation for the potential function, is
important.
To illustrate a scenario in which Q∗d , the average over optimal value functions, is not the
optimal potential function, we define a cliff domain based on the episodic cliff-walking grid
world from Sutton and Barto [73]. The agent starts in one corner of the grid and needs to
reach a goal location in the corner that is in the same row or column as the start state, while
avoiding stepping into the cliff that lies in between the start and goal.
The domain contains all permutations with the goal and start state in opposite corners
of the same row or column with a cliff between them (8 tasks in total). Each task is a 4x4
grid world with deterministic actions N, E, S, W, states (x, y), and a -1 step penalty. Falling
down the cliff results in -1000 reward and teleportation to the start state. The distribution
over tasks is uniform. We compute each potential function according to the definitions given
in section 3. Finding the cross-task policy µ ∗ has been shown to be NP-hard for POMDPs
[82]. We use a genetic algorithm (GA) to approximate it3 .
To illustrate how performance of a given Φ depends on the learning algorithm, we use
two standard RL algorithms, Sarsa and Q-Learning. Since for Q-Learning, Q∗d = Q̃d , we
use Sarsa’s solution for Q̃d for both algorithms. Both algorithms use an ε-greedy policy
with ε = 0.1, γ = 1, and the learning rate α = 1 for Q-Learning and α = 0.4 for Sarsa,
maximizing the performance of both algorithms for the given ε.
We also run an additional set of experiments in which the agent is given a cliff sensor
that indicates the direction of the cliff (N, E, S, W) if the agent is standing right next to it.
Note that the addition of this sensor makes no difference for learning a single task, since
the information it provides is already deducible from the agent’s position and the number of
states per task is not affected. However, the number of states in the domain does increase:
one result of adding the sensor is that tasks no longer have identical state spaces.4
For each potential function, we report the mean total reward incurred by sampling a task
from the domain, running the agent for 500 episodes, and repeating this 100 times. Table 1a
µ
shows performance without the cliff sensor. On this domain, Qd performs very poorly; one
reason may be that the GA did not find µ ∗ , but a more likely one is that, due to the structure
of the domain, even µ ∗ would incur low return for each state, yielding a pessimistic potential
function.
3 We employ a standard real-valued GA with population size 100, no crossover and mutation with p = 0.5;
mutation adds a random value δ ∈ [−0.05, 0.05]. Policies are constructed by a softmax distribution over the
chromosome values.
4 Note that the addition of this sensor is not the same as the manual separation of state features for the
value and potential function as done in [34, 63] – see related work (section 6). In the experiments reported in
this section, both functions use the exact same set of features.
22 Matthijs Snel, Shimon Whiteson
Table 1: Mean total reward and 95% confidence interval for various shaping configurations
and learning algorithms on the cliff domain. All numbers ×104 .
As expected, Sarsa outperforms Q-Learning on this domain due to its on-policy nature:
because Q-Learning learns the optimal policy directly, it tends to take the path right next
to the cliff and is thus more likely to fall in. For Q-Learning, Q∗d and Q̃d do better than the
baseline, with the latter doing significantly better.
The situation changes when the cliff sensor is added (we did not retest the potential
function that did worse than the baseline), as shown in table 1b. Though the sensor does not
speed learning within a task, it provides useful information across tasks: whenever the cliff
sensor activates, the agent should not step in that direction. This information is reflected in
the average of the value functions and thus in the potential function, More precisely, the
state-action space X+ d is enlarged and fewer state-action pairs are shared between tasks.
Under these circumstances, both Q∗d and Q̃d significantly outperform baseline Sarsa, with
the latter, again, doing best. The picture for Q-Learning remains largely the same.
The continuing cliff domain is the same as the episodic version with the cliff sensor, except
that there are no terminal states. Instead there is a reward of 0 on every step and 10 for pass-
ing through the goal state and teleporting to the start state. We hypothesized an additional
benefit for shaping here, since the reward function is sparser (see e.g. [39] for a formal re-
lation between the benefit of shaping and sparsity of reward). To our surprise, however, this
was not always the case. Ironically, one main reason seems to be the sparse reward function.
In addition, the presence of an area of large negative reward next to the goal state makes
the task even more difficult to learn. For increasing exploration rate ε, the optimal ε-greedy
agent takes ever larger detours around the cliff, partly because of the sparse reward func-
tion; from around ε = 0.05, it huddles in the corner of the grid indefinitely, without ever
attempting to reach the goal. For this reason, we used ε = 0.01 for all experiments.
Figure 2 shows the mean cumulative reward of Sarsa and Q-learning under various learn-
ing rates and shaping regimes. Here, we used two different methods for computing Q̃d : one
uses the exact values of the optimal ε-greedy policy for each task for ε = 0.01, as computed
by a soft version of policy iteration5 , and the other uses solutions as computed by Sarsa run
on each task in the domain with ε = 0.01, α = 0.07, for 107 steps. These two value functions
are different since, as it turns out, Sarsa converges to the wrong solution.
The figure shows a markedly different picture from the episodic cliff world results. Even
at its optimal setting, Sarsa does not significantly outperform Q-learning on this domain.
Perhaps even more surprising, especially since Sarsa converges to the wrong solution, is that
5 In the policy improvement step, the policy is made only ε-greedy w.r.t. the value function.
Learning Potential Functions and their Representations for MTRL 23
4 4
x 10 x 10
10 No shaping 10
Q*d
8 8
Q~d(exact)
6 6
Qµd
4 4
Mean return
~
Qd(sarsa)
2 2
0 0
−2 −2
−4 −4
−6 −6
−8 −8
1 3 5 7 9 15 20 25 30 35 40 45 50 55 60 1 3 5 7 9 15 20 25 30 35 40 45 50 55 60
alpha x 10−2 alpha x 10−2
Fig. 2: Mean return and 95% confidence intervals of Sarsa and Q-learning on the continuing
cliff domain, under various shaping regimes. Return is cumulative reward over 105 steps and
averaged over all tasks in the domain (i.e. 8 runs).
x 103 4
x 10
4
x 10
5 12 12
No shaping Mean cum. reward
Mean cum. reward
Q~d (exact) shaping
4 Cum. reward (single run) Cum. reward (single run)
10 TD error (single run) 10 TD error (single run)
3
2 8 8
Cumulative reward
1 Mean
6 6
Single
Q~d
0
4 4
−1
−2 2 2
No shaping Single Mean
−3 TD Error TD Error
0 0
−4
0 5 10 15 0 0.5 1 1.5 2 0 0.5 1 1.5 2
Steps x 103 Steps 5 Steps 5
x 10 x 10
(a) Shaped vs unshaped Sarsa (b) No shaping (c) Q̃d (exact) shaping
Fig. 3: Cumulative reward of unshaped and shaped Sarsa on the continuing cliff domain.
Mean curve (light gray, dashed) represents average over 100 runs; also shown are example
cumulative reward (black,solid) and TD error from a single run.
unshaped Sarsa at its optimal learning rate outperforms shaped Sarsa at its optimal setting;
the shaped version dominates only from around α = 0.15, and generally not significantly.
Fig. 3 reveals what is happening. Fig. 3a makes clear that, even though shaping has a
disadvantage when measured over a large number of timesteps, it does provide an initial per-
formance boost which lasts up to around 104 steps. However, as the two rightmost graphs
show, the long-term learning dynamics of shaped Sarsa ultimately result in inferior perfor-
mance: it is plagued by long periods of stasis, in which it keeps far from the cliff and thus
the goal, and no reward is incurred. The likely reason is the same as for the eventual stasis of
unshaped Sarsa (not shown), which results from Sarsa’s inability, under the low exploration
rate, to escape from the strong local optimum in the corner of the grid (recall that under
higher exploration rates, the corner of the grid becomes the global optimum). Since shaped
Sarsa is closer to convergence than unshaped Sarsa, it discourages the agent from approach-
24 Matthijs Snel, Shimon Whiteson
−10
−20
−30
−40
Mean return
R 2
−50
L −60
No shaping
−70 Q*d
3 1 Q~d
−80 Qµd
−90
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
U alpha
Fig. 4: Example task and mean return of q-learning on the triangle domain, under various
shaping regimes. Return is cumulative reward over 10 episodes and averaged over all tasks
in the domain (i.e. 12 runs). All differences between the shaping methods are significant,
µ
except between q̃d and qd . Points shown are for ε = 0.01, the best-performing setting.
ing any location that might contain a cliff, resulting in an initial performance boost but also
earlier onsets of stasis.
Each task in the episodic Triangle domain consists of three non-terminal states, in which
three actions can be taken (Fig. 4a). In addition to feature x1 , which corresponds to the state
numbers shown, the agent perceives two additional features x2 and x3 . Feature x2 is the
inverse of the square of the shortest distance to the goal, i.e. in the figure, the states would
be (1, 0.25), (2, 1), (3, 0.25). Feature x3 is a binary feature that indicates task color: red (1)
or green (0); if red, the agent receives a -10 penalty for self-transitions, in addition to the -1
step reward that is default in every task. x3 is constant within a task, but may change from
one task to the next. The goal may be at any state and the effect of actions L (dashed) and
R (solid) may be reversed. Action U (dotted) always results in either a self-transition or a
goal-transition (when the goal is next to the current state). There are thus 12 tasks in total.
We compare performance of the different shaping regimes on this domain with a Q-
learning agent with discount factor γ = 1. Not surprisingly, there is no significant difference
between the potentials in this domain: while Q∗d estimates values higher than Q̃d , which
µ
estimates higher than Qd , differences in estimates are minimal and the ordering of actions
is the same.
The binary stock-trading domain is an attractive domain for comparison since it is an estab-
lished benchmark [71,13,37], is stochastic, and has an easily varied number of states and
tasks. An example task is displayed in Fig. 5a. The domain consists of a number of sectors
Learning Potential Functions and their Representations for MTRL 25
S, such as telecom (s1 in the example) and pharmaceuticals (s2). Each contains a E items
of equity (stock). Each stock is either rising (1) or falling (0). An agent can buy or sell all
stocks of one whole sector at a time; sector ownership is indicated by a 1 (owned) or 0 (not
owned) in the state vector. Therefore, if the agent owns pharma but not telecom, the part of
the state vector pertaining to stock in the example would be (0, 1, 1, 1, 0, 1, 0): the first two
elements indicate sector ownership, the next three what telecom stocks are doing, and the
final two what pharma stocks are doing. At each timestep, for each sector that it owns, the
agent receives a reward of +1 for each stock that is rising and 1 for each stock that is falling.
Thus in the example, the agent would earn a reward of 1, since there is 1 stock rising in
pharma.
The probability of stocks rising in a given sector s, Ps , depends on two factors: the
number of stocks rising in s in the previous timestep, and the influence of G global factors
(in the example, oil is the only global factor). How stocks and globals influence Ps is task-
dependent. In the example, the only telecom stock of influence is e2 ; in pharma, it is e1 . In
a given task, stocks within a sector may be influenced by any combination of stocks in that
sector. Stocks that are rising increase Ps ; stocks that are falling decrease it.
Globals behave just like stocks in that they can rise (1) or fall (0). However, globals
always affect all sectors simultaneously. The effect of a global varies per task; for a given
rising global, it increases Ps in half the tasks, and decreases it in the other half, making its
net cross-task effect zero. The exact formula for determining the probability that stocks in a
given sector s will rise in a given task m is:
Rms + 3Rg
m
Psm = 0.1 + 0.8 ,
Ism + 3G
5.1.5 Summary
Our experiments showed that which potential type is best is highly dependent on the domain,
learning algorithm, and learning parameters. In the domains used in this section, prediction
of optimal value Q∗d never significantly outperforms the other shaping functions. One might
26 Matthijs Snel, Shimon Whiteson
80
G1
Q~d
75
E1 *
Qd
Buy / Sell
Mean return
70
E2 Qd
µ
S1
65
E1
60 No shaping
Buy / Sell E2
55
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
S2 alpha
Fig. 5: Schematic overview of, and mean Q-learning return on, the stock-trading domain
under various shaping regimes. a) See main text. b) Return is cumulative reward over 300
learning steps and averaged over 20 runs for all tasks in the domain (i.e. 240 runs). Differ-
ences between the methods are generally significant, except between Q̃d and Q∗d .
expect that, since Q-Learning is an off-policy algorithm and Sarsa an on-policy one, Q∗d
would be best suited for the former and Q̃d for the latter. However, Q-Learning combined
µ
with Q̃d or Qd generally outperformed the other options, while no such effect could be
observed for Sarsa. Finally, on the continuing cliff domain unshaped Sarsa outperforms
shaped Sarsa for low learning rates. This shows that not only task relatedness can negatively
affect transfer, but also the learning algorithm and learning parameters.
In this section, we test FS-TEK on the same domains used in section 5.1. We compare to
another backward elimination algorithm that also uses our definition of relevance, but does
not make use of the multi-task information by extrapolating. This algorithm, IBKR (Iterated
BKR) is identical to FS-TEK, except that it only computes relevance for k = |s|, the length
of the sequence of experienced tasks. In addition to IBKR, we compare FS-TEK to shaping
functions constructed without FS; with randomly selected features; and with fixed features,
in which only the true domain-relevant features are used.
In each experiment, a learning agent interacts sequentially with each domain. After each
task, the agent constructs a new potential function based on the solutions of the tasks solved
so far. The potential function is based on Q∗c as defined in (17), unless specified otherwise.
Likewise, relevance is computed based on Q∗c . Although section 5.1 showed that, in some
domains, using a potential based on Q̃d works better, doing so is impractical here as it would
require solving each task twice (once using Q-learning, and then again using, e.g., Sarsa to
compute the potential function).
For each domain, we use a fixed learning rate α and ε-greedy exploration, where ε =
0.01 and α is set to the best value found in section 5.1. For each new potential function (i.e.,
for each number of tasks seen), we test the agent on 10 tasks sampled from the domain;
the whole procedure is repeated for 500 runs. Performance is measured only after three
Learning Potential Functions and their Representations for MTRL 27
1 15
14 12
13
14
15
10
11
121310
11
79 8
9
7
68
FS−TEK5 5−15 5
IBKR 4−15
6
0.9 4
−20 4
IBKR 0.8
Mean total return on new task
FS−TEK 3
0.7
No FS
−25 3
Fixed FS 0.6
Random FS
TPR
0.5
−30
0.4
0.3
−35 0.2
0.1 IBKR
FS−TEK
−40 0
3 4 5 6 7 8 9 10 11 12 13 14 15 0 0.2 0.4 0.6 0.8 1
Tasks seen FPR
Fig. 6: Performance of shaping functions constructed using various FS methods (left) and
ROC space (right) of the BKR and FS-TEK methods.
tasks, since FS-TEK requires at least this many tasks (since the exponential function to be
estimated has three parameters).
While our implementation of FS-TEK largely corresponds to that outlined in algorithms
1 and 2, there are some practical tweaks. Especially for a small number of experienced tasks,
the Jacobian of the estimated exponential may be ill-conditioned, often preventing reliable
extrapolation and computation of the confidence interval. When this happens, we mark the
feature as “unsure” until reliable estimates can be made in a subsequent iteration. In addition,
for a given call to FS-TEK, any feature that is neither marked for removal nor “unsure” in
any iteration, is kept indefinitely and not re-checked on subsequent iterations. This greatly
improves speed and yields equal or better performance on the domains under consideration.
We use the Levenberg-Marquardt algorithm [48] for the nonlinear least-squares fit.
For this domain, we set the confidence level for IBKR to α = 0.15, while for FS-TEK, we
used α(k) = min(max(k − 4, 0) × 0.06 + 10−4 , 0.3), i.e., it is 10−4 up to k = 5, from which
point it linearly increases with k until a maximum of 0.3. An increasing confidence level for
FS-TEK of this kind, and a constant level for IBKR, were found to work best by a coarse
parameter search on this domain. Recall that higher confidence means that features are more
easily marked as relevant. Increasing α makes sense since as more tasks are observed, esti-
mates become more certain and the confidence interval can be tightened.
The left panel of Fig. 6 shows the mean return of shaping functions constructed using
the various FS methods. FS-TEK achieves a significant performance improvement over both
regular shaping and shaping with IBKR and is the first to match the performance of fixed
FS, in which the correct features were hard-coded. The other methods also eventually reach
that performance since, once the experienced tasks and their observed frequency approach
the true set of tasks and distribution, the need for generalization disappears.
The right panel shows the ROC (Receiver Operating Characteristic) space of the IBKR
and FS-TEK method, plotting the False Positive Rate (FPR) against the True Positive Rate
(TPR). In the context of this article, the TPR indicates the ratio of features correctly classi-
fied as domain relevant out of all domain relevant features, while the FPR indicates the ratio
of features incorrectly classified as domain relevant out of all domain irrelevant features.
28 Matthijs Snel, Shimon Whiteson
6000 1 812
79
14
6
13
10
15
11
54
FS−TEK 4−15 3 315
2
1
6
7
8
9
10
11
12
13
14
5
4
5500 0.9 IBKR 3−15
0.8
Mean total return on new task
5000
0.7 2
4500
0.6
4000
TPR
0.5
3500
0.4
3000
0.3
2500 0.2
IBKR
2000 FS−TEK 0.1 IBKR
No FS FS−TEK
1500 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 0.2 0.4 0.6 0.8 1
Tasks seen FPR
Fig. 7: Cliff domain when Q∗d is used as potential. No fixed FS is shown since there are no
features to remove; random FS is not shown because of its inferior performance.
Ideally, TPR=1 and FPR=0. From here on, we denote ROC by FPR/TPR, e.g., 0/1 in the
ideal case. In the plot, the numbers next to the markers indicate the number of tasks seen.
The triangle domain contains one domain-irrelevant feature, namely the state number.
The other three features are DR. In this domain, IBKR mostly achieves an ROC of 1/1, which
amounts to never removing any features and is equivalent to the vanilla shaping function –
indeed, their performance is nearly identical. FS-TEK does a better job, achieving an ROC of
around 0.45/1 after five tasks seen, meaning it nearly always identifies the right DR features
(as did IBKR), and in addition removes the irrelevant feature 55% of the time.
While the cliff domain contains no domain-irrelevant features, the first two features (en-
coding position) are only very weakly domain relevant. For the ROC space plot, therefore,
we marked the position features as domain irrelevant; this shows that FS-TEK removes the
position features about 40% of the time (Fig. 7), which explains its slight performance gain.
Because of the peculiar progress of performance with number of observed tasks, we
plotted performance from 1 observed task onward. The initial increase and subsequent de-
crease in performance with number of observed tasks for all methods is interesting. Since
using Q̃d as potential was found to work better on this domain (section 5.1), we were curious
if the same trend would happen for Q̃d . Fig. 8 shows the results.
Clearly, the same trend does not happen for Q̃d , and the benefit of FS-TEK is greater
with this potential type. Moreover, while Q̃d does better than Q∗d in the long run, as shown
in section 5.1, Q∗d outperforms Q̃d for low number of observed tasks. The explanation must
therefore be sought in difference between Q̃d and Q∗d . With respect to the cliff sensor, Q̃d
encourages the agent to move away from the cliff, while Q∗d does not. Of course, when a
cliff direction has not been encountered yet in previous tasks, both potentials have uniform
preference over actions. With respect to position (i.e. when the cliff sensor shows no read-
ing), Q∗d pushes the agent towards the center of the grid; Q̃d , on the other hand, pushes the
agent towards the edges of the world. In short, Q̃d encourages exploration, but to shy away
from a cliff once one is encountered; Q∗d encourages sitting in the center, but to stay near a
cliff once one is encountered.
Learning Potential Functions and their Representations for MTRL 29
5500 1 45 13
12
15
11
693
10
14
87
FS−TEK 3−15 3 15
2
1
6
8
9
10
11
12
13
14
7
4
5
5000
IBKR 3−15
0.9
4500 0.8
Mean total return on new task
4000
2
0.7
3500
0.6
3000
TPR
0.5
2500
0.4
2000
0.3
1500
1000 0.2
IBKR
500 FS−TEK 0.1 IBKR
No FS FS−TEK
0 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 0.2 0.4 0.6 0.8 1
Tasks seen FPR
Fig. 8: Cliff domain when Q̃d is used as potential. No fixed FS is shown since there are no
features to remove; random FS is not shown because of its inferior performance.
The likelihood that the agent has encountered a given cliff increases with k. Therefore,
for Q∗d the likelihood that the agent will stick close to a cliff and fall into it increases with
k. At some point, this likelihood together with the tendency to push the agent towards the
center gains critical mass and performance declines. For Q̃d , performance increases, since
as the agent encounters more cliffs it is less likely to fall into them; in addition, this potential
function increasingly encourages exploring the edges of the world and thus discovering the
cliff in the current task sooner.
For the stock domain, we used the settings e = 1, o = 2 as detailed in section 5.1.4, but tested
for h ranging from 1 to 5. For h = 1, the size of the state-action space |X+ | = 32 and |M| = 6;
these numbers double every value of h until for h = 5, |X+ | = 512 and |M| = 96. Recall
from section 5.1.4 that e represents sector ownership features, o represents number of stocks
per sector and h represents task-dependent global variables that positively or negatively
(depending on the task) affect the probability that stocks rise. For the current settings, stock
ownership is completely irrelevant, while the stock features and action are domain relevant.
The domain is challenging because the H features are strongly task relevant, much more so
than the other features, but domain irrelevant (across all tasks their effect cancels out). In
addition, the stock and action features are only weakly domain relevant. This means that the
H features appear strongly domain relevant when an insufficient number of tasks have been
experienced; moreover, they add noise to the shaping function when selected.
For h < 4, we used a confidence level α = 10−4 for FS-TEK and α = 10−3 for IBKR.
For higher h, we used an α(k) that increases with k for FS-TEK; this was found to result
in better performance. On the other hand, IBKR performed equally well for varying and
constant α, so we kept it constant at α = 10−2 for h ≥ 4.
The results are shown in Fig. 9. Generally, FS-TEK significantly outperforms all other
methods, but its performance deteriorates until, for h = 4 and higher, it performs about as
well as IBKR and vanilla shaping. It may seem that this is caused by the change in ROC
from h = 4 onward, which in turn is caused by the change in the confidence level setting.
However, the opposite is true: earlier experiments showed that a constant α = 10−4 resulted
in a similar ROC as for lower h, but also in a mean return that was significantly worse than
30 Matthijs Snel, Shimon Whiteson
120 1
115 79118 12
13
10
15
17
14
1916 18
20
110 0.8 6
105 5
0.6
100 4
19
20
18
95 0.4 17
15
16
14
13
12
90 11
10 3
85 0.2 FS−TEK
80 9
87654 IBKR
75 1
7 9 11
8 13
1012
15
FPR14
171916 20
18
70
65 0.8 6
60 5
55 0.6
50 Random FS 19
20
17
18 4
0.4 16
15
14
13
12
45 Fixed FS 10
11
40 No FS 0.2 3 FS−TEK
35 FS−TEK 976
8 3 IBKR
54
55 IBKR 1 9 11
8 13
1012
15
14
1719
16 18
20
7 FPR
50 6
0.8
45 5
40 0.6
19
20
18
16
17
35 0.4 15
14
13 4
12
11
10
30 3
0.2 8 FS−TEK
25 9 765 4 3 IBKR
40 1 5 76 9 8110
1 131212
14
15
1
17
18
17 6
10 4FPR 11 13 15
16
14
18
35 3 9
0.8 2
30 8
0.6
25
0.4
20 1
15 0.2 6 FS−TEK
745 32 1 IBKR
30 1 3 FPR 457 76 9810
14
12
11 88
9
10
11
12
13
14
17
1
15
17
13
16
18 5
6
62
25 0.8
5
1
20 0.6 4
15 0.4
10 0.2 FS−TEK
3 2 1 IBKR
5 0
3 4 5 6 7 8 9 10 12 14 16 18 20 0 0.2 0.4 0.6 0.8 1
Tasks seen FPR
Fig. 9: Stock domain for H = 1 (top) to 5 (bottom). Left column plots tasks seen versus
mean total return. Right column plots FPR versus TPR.
Learning Potential Functions and their Representations for MTRL 31
1000 0
600
0.01 k=4
0.02 k=6
400 k=8
∆ total return
∆ return
0.06
200
0
0
−500 −200
3 5 7 9 0 5 10 15 20
k Episode
(a) Improvement in total return per k (b) Improvement per episode, for σsensor = 0.02
Fig. 10: Heat domain for σslip = 0.1. On the left the improvement in total return that FS-
TEK achieves over regular shaping for the first five episodes, per k. Each line represents
a different value for σsensor . On the right the improvement in return that FS-TEK achieves
over regular shaping per episode, for σsensor = 0.02.
any other method. Instead, the constant performance of FS-TEK in terms of ROC versus
the deteriorating performance shows that the task dynamics, rather than FS-TEK’s ability to
identify the correct features, are the cause of the performance decline. These dynamics are
such that for low H, it is more important to not mistakenly select domain irrelevant features
than to select the relevant ones (this also explains why FS-TEK does better than fixed FS for
H = 1); for higher H, having a low FPR decreases in importance while a high TPR increases
in importance. This makes sense since the more h variables there are, the more their effect
is diluted, and the more important (relatively) the domain relevant features become.
To assess FS-TEK’s ability to perform in a more challenging setting requiring function ap-
proximation, we also consider the heat domain, in which a circular agent with a radius of 1
learns to find a heat source in a 10x10 walled-off area. Note that here, we are using Eq. (10)
to compute the potential function, and are using an abstraction function φ (·) that computes
abstractions directly in feature space.
State is continuous and consists of (x,y) position, the robot’s heading in radians, and the
intensity of the heat emitted by the source. The agent moves by going forward or backward,
or turning left or right by a small
p amount.
Reward per step is −10/ (10 × 10 + 10 × 10) = 0.71. An episode terminates when
the heat source is within the agent’s radius. The agent employs a jointly tile-coded Q-
function approximator with 4 overlapping tilings, resulting in a total of 1664 features. We
run a Sarsa(λ ) agent with λ = 0.9 and ε = 0.05 for 500 episodes, and compare the perfor-
mance of FS-TEK to regular shaping under various levels of noise. Transition noise adds
ξ ∼ N (0, σslip ) to the agent’s action and sensor noise adds ξ ∼ N (0, σsensor ) to all state
features. Since noise increases the chance of overfitting, our hypothesis is that FS-TEK
should result in a greater benefit for higher levels of noise.
32 Matthijs Snel, Shimon Whiteson
1 9453678 4896
35
TPR
0.5
0
1 7 59 68 895
3 4 3 46
TPR
0.5
0
0 0.5 1 0 0.5 1
FPR FPR
Fig. 11: FS-TEK ROC space for various noise levels on the Heat domain. Clockwise starting
in the top left, noise levels σsensor are 0.01, 0.02, 0.04, 0.06.
Fig. 10 shows results for σslip = 0.1 and varying sensor noise. The left panel shows
that FS-TEK achieves a significant jump in performance over the initial episodes for lower
k, and as expected this benefit increases to some extent with the noise level. As observed
previously, FS performs on par with regular shaping for higher k. The reason that FS has
less benefit for noise levels above 0.02 seems to be that removing the right features becomes
more difficult. This is confirmed by the ROC plots (Fig. 11), which show that FS-TEK’s
false positive rate increases as noise increases.
The right panel of Fig. 10 shows the benefit of FS per episode, for σsensor = 0.02 and
various k. As k increases, there is a dip for early episodes; this dip is also the reason that FS
and no FS perform on par for higher k. The likely explanation is that there is some small
amount of information contained in the features that FS-TEK discards; therefore the regular
potential function is based on more information. The fact that this happens for higher k stems
from the fact that there is less overfitting because of the larger amount of data (previous
experience) available. Since only the potential function and not the value function has a
reduced feature set, this slightly wrong bias is quickly overcome.
5.2.5 Summary
Our experiments have shown that FS-TEK compares favorably to selection methods that
do not explicitly exploit the history of experienced tasks. Although changes in potential
function and increases in task and state space size affect the relative online return of FS-
TEK compared to other methods, FS-TEK’s performance in terms of ROC remains fairly
constant. Our results furthermore suggest that FS-TEK can be flexibly tuned for a low FPR
or high TPR, according to what is best for the domain; IBKR, on the other hand, has more
trouble in filtering out domain irrelevant features, as expected. Lastly, we have succesfully
Learning Potential Functions and their Representations for MTRL 33
applied FS-TEK to a continuous domain using a linear function approximator with binary
features.
6 Related Work
Our work is most related to shaping, multi-task (reinforcement) learning, and feature selec-
tion. This section reviews work done in each of these areas and discusses their relationship
to our own work.
The theoretical result of Ng et al. [53], which showed that potential-based shaping func-
tions preserve the optimal policy of the ground MDP for model-free RL, has recently been
extended in various ways. Grześ and Kudenko [25] demonstrate empirically that scaling
the potential can affect learning performance, and relate performance of a distance-to-goal-
based potential to the discount factor γ and task size, showing that as task size increases,
so should γ. Asmuth et al. [3] show that R-max, a popular model-based RL method, is still
PAC-MDP [31] when combined with an admissable potential function Φ(x, a) ≥ Q∗ (x, a).
In multi-agent RL, potential-based shaping provably preserves the Nash equilibrium in
stochastic games [44,11]. Preservation of optimal policy and Nash equilibrium have also
been shown to hold for potential functions that change while the agent is learning [12]. In
practice, it has also been shown to improve strategies for the iterated Prisoner’s dilemma
[4] and robot soccer [10]. Learning a model of difference rewards, a concept in multi-agent
learning related to shaping, has been shown to improve performance in air traffic control
[56].
There have been a number of successes in learning potentials automatically. In single-
task RL, one approach is to construct an initial shaping function based on intuition [38] or an
initial task model [24], and refine it through interaction with the task. Elfwing et al. [15, 16]
evolve a shaping function that, when transferred to a real robot, results in better performance
than when transferring Q-values. Other work has learned a shaping function on abstractions
that are either provided [26] or also learned [49]. This latter approach is related to ours in
that it explores different representations for the potential and value function. However, our
work benefits from the MTRL setting in that it learns the abstractions offline, in between
tasks, and therefore does not incur a cost while interacting with the next task.
Konidaris and Barto [34,35] were the first to learn a shaping function automatically in
a multi-task environment by estimating the value function based on optimal solutions of
previously experienced tasks. They base the value function on problem space, the Markov
representation necessary for optimally solving tasks, and the potential function on agent
space, the representation that retains the same semantics across tasks. However, they pre-
specify both spaces. Similar pre-designed separate representions were employed in [63], in
which an optimal reward function is searched on a distribution of tasks. As mentioned ear-
lier, this article substantially extends this work by comparing different potential functions
and offering a method for automatically discovering both task and domain-relevant repre-
sentations.
34 Matthijs Snel, Shimon Whiteson
The field of multi-task reinforcement learning has rapidly developed in recent years and is
too large to survey comprehensively. Instead, we focus on approaches most similar to ours;
see Taylor and Stone [76] for an extensive survey of the field.
In [74], the average optimal Q-function of previously experienced tasks is used to initial-
ize the model of a new task. Although the authors state that the average optimal Q-function
is always the best initialization, our article has shown otherwise. Mehta et al. [51] assume
fixed task dynamics and reward features, but different reward feature weights in each task.
Given the reward weights of a target task, they initialize the task with the value function of
the best stored source task policy given the new reward weights. In [77], the value function
of a single source task is transformed via inter-task mappings and then copied to the value
function of the target task. This work is a special case of ours since it considers only one
source task. In this case, the cross-task potential as defined in Eq. 9 equals the value function
of that source task by definition, and thus the initial policy based on the potential would be
the same as the source task policy. However, taking more source tasks into account leads
to significantly better performance on the target task. While we have not considered tasks
with different feature and action spaces, our method can be applied to such scenarios using
inter-task mappings, making the two approaches complementary.
Under a broader interpretation, initialization can also be performed by transferring source
task experience samples to a batch-learning method in the target task [42]. Similar to other
initialization methods, bias towards the source task knowledge decreases as experience with
the target task accumulates. Other forms of initialization are to use a population of evolved
source task policies as initial population for an evolutionary algorithm in the target task [78],
or to use source task information to set the prior for Bayesian RL [89, 41].
Advice-giving methods, which suggest an action to the agent given a state, are closely
related to potential-based shaping. Torrey et al. [81] identify actions in source tasks with
higher Q-values than others, and use this information to construct rules on action prefer-
ences that are added as constraints to a linear program for batch-learning Q-function weights
in the target task. In [80], this work is extended by using inductive logic programming to
extract the rules. Tayor and Stone [78] learn rules that summarize a learned source task
policy and incorporate these as an extra action in the target task, to be learned by a policy-
search method. All these approaches are flexible in that they can deal with different state
features and actions between tasks by a set of provided inter-task mappings. However, these
mappings have to be provided by humans, except in [78], where they can be learned if pro-
vided with a description of task-independent state clusters that describe different objects in
the domain. These clusters are similar to our notion of domain-relevant abstractions, which
FS-TEK discovers automatically.
Both feature selection (FS) and state abstraction methods have been applied to single-task
RL, with especially FS seeing a recent surge in interest [54, 33, 55, 29, 46, 23]. Although
similar methods have also been applied to transfer learning, most of these learn task-relevant
representations for supervised learning [5,2] or reinforcement learning [30, 18, 83, 40, 70, 75,
37,60]; the latter aim to reduce the state space of the target task, or find good representations
for value functions or policies. However, none of these approaches learn domain-relevant
representations.
Learning Potential Functions and their Representations for MTRL 35
In the supervised learning literature, work that does learn such representations includes
lifelong [79] and multitask [8] learning. Both paradigms develop a representation that is
shared between tasks by using training examples from a set of tasks instead of a single task,
and show that this representation can improve performance by generalizing over tasks, if
tasks are sufficiently similar. In RL, Foster and Dayan [20] identify shared task subspaces by
identifying shared structure between task value functions; by augmenting the value function
and policy representation with these subspaces, new tasks can be learned more quickly.
While the idea of shared structure in task value functions is similar to ours, a limitation
of this method is that it requires a single transfer function between tasks and only allows
changes in the reward function. Similar to the domain-relevant features defined in this article
and to the agent space of Konidaris and Barto [34], Frommberger and Wolter [22] define
structure space as the feature space that retains the same semantics across tasks, and learn a
structure space policy between tasks. Frommberger [21] applies the same concept to tile-
coding functions for generalization within and across tasks. However, in both cases the
structure space is hand-designed, as in [34].
While not explicitly concerned with representation learning, Sherstov and Stone [62]
construct a task-independent model of the transition and reward dynamics by identifying
shared outcomes and state classes between tasks, and using this for action transfer. This can
be viewed as a kind of model-based domain relevance, and thus an interesting direction for
future work on model-based MTRL.
There are two primary characteristics distinguishing our approach from the other cross-
task methods discussed here. First, it captures within-task and cross-task relevance within a
single definition. Second, it exploits multi-task structure by considering how representation
relevance changes with increasing task sequence length and using that to predict relevance
on the entire domain.
We have approached the multi-task learning problem by extracting shared information be-
tween tasks, identifying which state features are relevant to this information, and using po-
tential functions based on these features to capture the information and guide the agent in
new tasks.
This paper proposed and empirically compared three different methods for constructing po-
tential functions based on past experience, and showed that which one is best depends on
the domain, learning algorithm and parameters. Further studies are needed to determine
what domain factors influence the best potential type. We conjecture that risk may be one
such factor; since the optimal value function Q∗ does not take exploration risk into account,
potential functions based on Q∗ might make it more likely that the agent enters “disaster
states”, if these are present in the domain. We observed this experimentally in the cliff do-
main in section 5.1. In such domains, using more cautious potential functions such as Q̃d or
µ µ
Qd may work better. In addition, Qd seems to be more robust to changes in learning rate,
which may be because it is constructed based on a single fixed cross-task policy.
Learning algorithm and parameters may also affect the benefit of transfer: in the cliff
domain, an unshaped Sarsa agent outperformed the shaped agent. However, we think this
36 Matthijs Snel, Shimon Whiteson
scenario is rare, and applies in particular to scenarios in which an on-policy potential func-
tion is applied to a domain with a sparse reward function (where it may be difficult for
the agent to improve upon the initial policy suggested by the potential), and disaster states.
The latter may cause on-policy potential functions to too strongly bias the agent away from
disaster states, causing the agent to get stuck in areas with no reward.
While using a potential function based on a different value function than the agent has
learned may be beneficial, it may not always be practical. However, shaping, and the appli-
cation of prior knowledge in general, are most useful in scenarios where the cost (and risk)
of acquiring samples outweighs the cost of computation. In these cases, it may be well worth
the extra cost of computing a potential based on a different value function.
7.2 Representations
Acknowledgements
We thank George Konidaris, Hado van Hasselt, Eric Wiewiora, Lihong Li, Christos Dim-
itrakakis and Harm van Seijen for valuable discussions, and the anonymous reviewers for
suggesting improvements to the original article.
Appendix A: Glossary
As stated in Theorem 2, we are concerned with the case where the abstract Q-function is
defined as in (18), i.e., the weighted average over state-action pairs in a given cluster. In this
case, relevance equals the sum of weighted variances of the Q-values of ground state-action
pairs corresponding to a given cluster y (21). Before proving the theorems, we show how
relevance can be rewritten as a sum of covariances between Q-functions.
Variance of a weighted sum of n correlated random variables equals the weighted sum
of covariances. We start by showing that any Qc is a weighted sum of random variables
(namely the Q-functions of each task in the sequence), and that therefore relevance can be
written in terms of a weighted sum over covariances. Equation 17 is already a weighted sum,
38 Matthijs Snel, Shimon Whiteson
but we require a constant weight per random variable (task). Thus we rewrite (17) as
k
Qc (x+ ) = ∑ Pr(ci |c)Qcci (x+ ) (26)
i=1
Pr(x+ |ci )
Qcci (x+ ) = Qc (x+ ), (27)
Pr(x+ |c) i
where the last line is just a rescaling of Qci depending on c, and k = |c|, the sequence length.
Similarly, we define
Pr(x+ |y, ci )
Qcy,ci (x+ ) = Qc (x+ ), (28)
Pr(x+ |y, c) i
/ Xyci , as the values of Qcci on the domain Xyci .
where Pr(x+ |y, ci ) = 0 for all x+ ∈
For ease of notation, we write Var(Qc ) for the variance Var(Qc (X+ c )), leaving the do-
main implicit, and similarly for the covariance. Note that Pr(ci |c) = 1/k. Then relevance
(21) can be written as
!
1 k c
ρ(φ , Qc ) = ∑ Pr(y|c) Var ∑ Qy,ci
y∈Yc k i=1
k k
1
c c
= ∑ Pr(y|c) ∑ ∑ Cov Q y,c , Q y,c (29)
k2 y∈Yc i=1 j=1
i j
To see how relevance changes from one sequence to the next, we need to know how
the covariance between two given tasks changes. For this purpose it is easier to write the
covariance as Cov(Xi , X j ) = E[Xi X j ] − E[Xi ] E[X j ]; then all we need to do is quantify both
expectations. Let (c, m) be the new sequence formed by appending a task m ∈ M to a given
sequence c. In the following, for ease of notation, assume an abstraction that leaves the orig-
inal Q-function intact, i.e. Qcy,ci = Qcci . The results can be extended to general abstractions
by substituting Xc = Xyc , Xm = Xym , Pr(x+ |c) = Pr(x+ |y, c), and Pr(x+ |m) = Pr(x+ |y, m).
Lemma 1 The expected value of a given Q-function for a given task ci in any sequence c is
the same as the expected value of the original Q-function on ci . That is,
Pr(x+ |m)
= ∑ + Pr(x+ |c) Pr(x+ |c) Qci (x+ ) (Since Qcci = 0 ∀x+ ∈
/ X+
ci )
x+ ∈X ci
t
u
Lemma 2 For a given sequence c and new sequence (c, m) formed by appending a task
m ∈ M to c, h i k + 1 h i
(c,m) (c,m)
0 ≤ E Qci Qc j ≤ E Qcci Qcc j ,
k
where | · | denotes absolute value.
Proof Let Qi· j (x+ ) = Qci (x+ )Qc j (x+ ). Then
For a given task pair, the only quantity that changes from one sequence to the next is
Pr(x+ |c). Let fi,c j (x+ ) = Pr(x+ |ci ) Pr(x+ |c j )/ Pr(x+ |c), and recall that, for a sequence of
length k, Pr(x+ |c) = 1/k ∑ki=1 Pr(x+ |ci ). Therefore, on a new sequence (c, m):
If Pr(x+ |m) is larger (smaller) than Pr(x+ |c), this ratio is smaller (larger) than 1. It is largest
when Pr(x+ |m) = 0, namely (k + 1)/k, and at its smallest it is
Since Qi· j (x+ ) is constant from one sequence to the next, this leads to the bounds as stated
in the lemma. t
u
Note that especially the lower bound is quite loose, since usually Pr(x+ |c) will not be that
close to 0. However, for our present purposes this is sufficient. Given these facts, the proof
of Theorem 2 readily follows.
Theorem 2 Let φ be an abstraction with abstract Q-function as in definition 2, and let
ρk = ρk (φ ) for any k. Let d(x, y) = |x − y| be a metric on R, and let f (ρk ) = ρk+1 map
k-relevance to k + 1-relevance. Then f is a contraction; that is, for k > 1 there is a constant
κ ∈ (0, 1] such that
d( f (ρk ), f (ρk−1 )) ≤ cd(ρk , ρk−1 ).
Furthermore, if d(ρ2 , ρ1 ) 6= 0, then f is a strict contraction, i.e. there is a κ ∈ (0, 1) such
that the above holds.
40 Matthijs Snel, Shimon Whiteson
Proof We need to show that |ρk+1 −ρk | < |ρk −ρk−1 | for any k > 1. The relevance of a given
sequence consists of the sum of the elements of the covariance matrix for that sequence,
where each element has weight 1/k2 . As illustrated in Fig. 1, from one sequence c to a new
sequence (c, m), the ratio of additional covariances formed by the new task m with the tasks
already present in c is (2k − 1)/k2 and thus rapidly decreases with k. The same figure also
shows that change in relevance is caused by two factors: the expansion of the covariance
matrices as sequence length increases coupled with the change in sequence probability, and
change in the covariance between a given task pair from one sequence to the next. Suppose
that the covariance of any task pair does not change from one sequence to the next. Then
clearly, since the ratio of new covariance matrix elements changes with k as (2k − 1)/k2 and
in addition the probability of all new sequences (c, m) formed from a given c sums up to the
probability of c, |ρk+1 − ρk | ≤ |ρk − ρk−1 | for any k > 1.
Now suppose that covariances do change from one sequence to the next. As Lemma 1
and 2 show, the maximum change in covariance from any sequence c of length k to the next
is (k + 1) Cov(Qcci , Qcc j )/k for any i and j. This change also decreases with k, and therefore
|ρk+1 − ρk | ≤ |ρk − ρk−1 | for any k > 1. If |ρ2 − ρ1 | = 0, then by this property the difference
must stay 0 and |ρk+1 − ρk | = 0 for any k. In all other cases, the change in relevance is a
strict contraction, |ρk+1 − ρk | < |ρk − ρk−1 |, by the above arguments. tu
Lemma 3 The ratio of the number of type 1 covariances to the number of type 2 covariances
decreases with k. For a given sequence c of length k − 1, the ratio of new type 1 covariances
in all new sequences of length k formed from c is
2(k − 1) + N
(32)
N(2k − 1)
where N = |M|.
Proof Let c be any sequence on a domain with N = |M| tasks. Assume task m ∈ {1, 2, . . . , N},
occurs om ∈ {0, 1, . . . , k} times in c. Then any c can be represented by an N-dimensional vec-
tor o: o = (o1 , o2 , . . . , oN ). Note that for a given sample size k, ∑i oi = k for any c. Lastly,
denote by σ the sum of elements in the last column and row of the covariance matrix – as
shown in Fig. 1, these are the elements added from one sequence c to the next (c, m).
Now take any sequence c of length k − 1, with task counts in vector o. Form N new
sequences of length k, where each sequence is formed by adding a task from M to c. To see
how the ratio of covariances changes between k − 1 and k, all that matters is the ratio in σ .
For any new sequence formed by adding task m to sequence c, there will be 2om + 1 type 1
covariances in σ . Hence, in total, taken over the N new sequences formed from c, there will
be 2(o1 + o2 + . . . + oN ) + N = 2(k − 1) + N new type 1 covariances. In total, taken over the
N new sequences, there are N(2k − 1) covariances. So the ratio of type 1 covariances in σ
for a given sample size k is
2(k − 1) + N
(33)
N(2k − 1)
This ratio decreases with k. Therefore the ratio of type 2 covariances increases with k. t
u
Theorem 3 If all tasks share the same distribution over state-action pairs Pr(x+ |m), then
ρk is a monotonically increasing or decreasing function of k, or is constant.
Proof The assumption of a single distribution over state-action pairs implies that covari-
ances do not change from one sequence to the next. This hfollows from iLemmah 1 and iEq. 31:
+ + (c,m) (c,m)
since Pr(x |m) = Pr(x |c), the ratio resolves to 1 and E Qci Qc j = E Qcci Qcc j .
The rest of the proof is by cases. Given k = 1, ρk+1 can either be smaller than, greater
than, or equal to ρk .
Case 1: ρ2 < ρ1 .
Since ρ2 < ρ1 , it follows that the expected value of a type 2 covariance is lower than that of
a type 1 covariance: ρ1 is made up of all possible type 1 covariances in the domain, while ρ2
in addition consists of all possible type 2 covariances. Since covariances do not change from
one k to the next, type 2 covariances must be lower on average. From Lemma 3, the ratio of
type 2 covariances increases with k. Within the type 2 covariances, the frequency of a given
task pair does not change, and the same holds for the type 1 covariances. Therefore, siince
covariances do not change with k, ρk must get ever lower with k, and ρk is monotonically
decreasing with k.
Case 2: ρ2 > ρ1 .
By a similar argument to that for case 1, ρk is a monotonically increasing function of k.
Case 3: ρ2 = ρ1 .
Therefore |ρ2 − ρ1 | = 0, and |ρk+1 − ρk | must stay 0 by Theorem 2, which shows that ρk is
constant. t
u
This appendix derives an average cross-task linear function approximator from a set of linear
function approximators per task, where approximators are assumed to have binary features.
Let wm be the weight vector of the function approximator in task m and let Qm (x) = wTm fx
be the Q-value of x. We wish to find
!
h i2
Qd = argmin ∑ Pr(m) ∑ P(x|m) Qm (x) − Q0 (x) (34)
Q0 m∈M x
!
h i2
T
= argmin ∑ P(m) ∑ P(fx |m) (wm − w0 ) fx . (35)
w0 m∈M x
Let !
h i2
g(w0 ) = ∑ P(m) ∑ P(fx |m) (wm − w0 )T fx
m∈M x
Let P(fix ) indicate P(fi = fix ). Furthermore, P(fx |m) = P(f1x , . . . , fNx |m), which equals
P(fx |m)P(f2x |f1x , m) . . . P(fNx |fN−1
1
x , . . . , f1x , m). So
= P(f1 = 1|m).
This holds for all features i and therefore, if we multiply g with 0.5 for convencience when
taking the partial derivative,
1
g(wi0 ) = ∑ P(fi = 1, m)(wim − wi0 )2 ,
2 m∈M
∂g
= ∑ P(fi = 1, m)(wi0 − wim ).
∂ wi0 m∈M
References
1. Albus, J.S.: A theory of cerebellar function. Mathematical Biosciences 10, 25–61 (1971)
2. Argyriou, A., Evgeniou, T., Pontil, M.: Convex multi-task feature learning. Machine Learning 73(3),
243–272 (2008)
3. Asmuth, J., Littman, M., Zinkov, R.: Potential-based shaping in model-based reinforcement learning.
In: Proceedings of the 23rd AAAI Conference on Artificial Intelligence, pp. 604–609. The AAAI Press
(2008)
4. Babes, M., de Cote, E.M., Littman, M.L.: Social reward shaping in the prisoner’s dilemma. In: 7th
International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS 2008), pp.
1389–1392 (2008)
5. Baxter, J.: A model of inductive bias learning. J. Artif. Intell. Res. (JAIR) 12, 149–198 (2000)
6. Bertsekas, D.P.: Dynamic programming and optimal control. Athena (1995)
7. Boutilier, C., Dearden, R., Goldszmidt, M.: Stochastic dynamic programming with factored representa-
tions. Artif. Intell. 121(1-2), 49–107 (2000)
8. Caruana, R.: Multitask learning. Machine Learning 28(1), 41–75 (1997)
9. Caruana, R.: Inductive transfer retrospective and review. In: NIPS 2005 Workshop on Inductive Transfer:
10 Years Later (2005)
10. Devlin, S., Grzes, M., Kudenko, D.: Multi-agent, reward shaping for robocup keepaway. In: AAMAS,
pp. 1227–1228 (2011)
11. Devlin, S., Kudenko, D.: Theoretical considerations of potential-based reward shaping for multi-agent
systems. In: AAMAS, AAMAS ’11, pp. 225–232 (2011)
12. Devlin, S., Kudenko, D.: Dynamic potential-based reward shaping. In: AAMAS, pp. 433–440 (2012)
13. Diuk, C., Li, L., Leffler, B.R.: The adaptive k-meteorologists problem and its application to structure
learning and feature selection in reinforcement learning. In: ICML, p. 32 (2009)
14. Dorigo, M., Colombetti, M.: Robot shaping: developing autonomous agents through learning. Artificial
Intelligence 71(2), 321–370 (1994)
Learning Potential Functions and their Representations for MTRL 43
15. Elfwing, S., Uchibe, E., Doya, K., Christensen, H.: Co-evolution of shaping: Rewards and meta-
parameters in reinforcement learning. Adaptive Behavior 16(6), 400–412 (2008)
16. Elfwing, S., Uchibe, E., Doya, K., Christensen, H.I.: Darwinian embodied evolution of the learning
ability for survival. Adaptive Behavior 19(2), 101–120 (2011)
17. Erez, T., Smart, W.: What does shaping mean for computational reinforcement learning? In: Develop-
ment and Learning, 2008. ICDL 2008. 7th IEEE International Conference on, pp. 215 –219 (2008)
18. Ferguson, K., Mahadevan, S.: Proto-transfer learning in markov decision processes using spectral meth-
ods. In: ICML Workshop on Structural Knowledge Transfer for Machine Learning (2006)
19. Ferrante, E., Lazaric, A., Restelli, M.: Transfer of task representation in reinforcement learning using
policy-based proto-value functions. In: AAMAS, pp. 1329–1332 (2008)
20. Foster, D.J., Dayan, P.: Structure in the space of value functions. Machine Learning 49(2-3), 325–346
(2002)
21. Frommberger, L.: Task space tile coding: In-task and cross-task generalization in reinforcement learning.
In: Proceedings of the 9th European Workshop on Reinforcement Learning (EWRL9) (2011)
22. Frommberger, L., Wolter, D.: Structural knowledge transfer by spatial abstraction for reinforcement
learning agents. Adaptive Behavior 18(6), 507–525 (2010)
23. Geramifard, A., Doshi, F., Redding, J., Roy, N., How, J.P.: Online discovery of feature dependencies. In:
ICML, pp. 881–888 (2011)
24. Grześ, M., Kudenko, D.: Learning shaping rewards in model-based reinforcement learning. In: Proc.
AAMAS 2009 Workshop on Adaptive Learning Agents (2009)
25. Grzes, M., Kudenko, D.: Theoretical and empirical analysis of reward shaping in reinforcement learning.
In: ICMLA, pp. 337–344 (2009)
26. Grześ, M., Kudenko, D.: Online learning of shaping rewards in reinforcement learning. Neural Networks
23(4), 541 – 550 (2010)
27. Gullapalli, V., Barto, A.G.: Shaping as a method for accelerating reinforcement learning. In: Proc. IEEE
International Symposium on Intelligent Control, pp. 554–559 (1992)
28. Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. Journal of Machine Learning
Research 3, 1157–1182 (2003)
29. Hachiya, H., Sugiyama, M.: Feature selection for reinforcement learning: Evaluating implicit state-
reward dependency via conditional mutual information. In: ECML/PKDD, pp. 474–489 (2010)
30. Jong, N.K., Stone, P.: State abstraction discovery from irrelevant state variables. In: IJCAI-05 (2005)
31. Kakade, S.M.: On the sample complexity of reinforcement learning. Ph.D. thesis, University College
London (2003)
32. Koller, D., Sahami, M.: Toward optimal feature selection. In: ICML, pp. 284–292 (1996)
33. Kolter, J.Z., Ng, A.Y.: Regularization and feature selection in least-squares temporal difference learning.
In: ICML, p. 66 (2009)
34. Konidaris, G., Barto, A.: Autonomous shaping: Knowledge transfer in reinforcement learning. In: Proc.
23rd International Conference on Machine Learning, pp. 489–496 (2006)
35. Konidaris, G., Scheidwasser, I., Barto, A.G.: Transfer in reinforcement learning via shared features.
Journal of Machine Learning Research 13, 1333–1371 (2012)
36. Koren, Y., Borenstein, J.: Potential field methods and their inherent limitations for mobile robot naviga-
tion. In: Proc. IEEE Conference on Robotics and Automation, pp. 1398–1404 (1991)
37. Kroon, M., Whiteson, S.: Automatic feature selection for model-based reinforcement learning in factored
MDPs. In: ICMLA 2009: Proceedings of the Eighth International Conference on Machine Learning and
Applications, pp. 324–330 (2009)
38. Laud, A., DeJong, G.: Reinforcement learning and shaping: Encouraging intended behaviors. In: Proc.
19th International Conference on Machine Learning, pp. 355–362 (2002)
39. Laud, A., DeJong, G.: The influence of reward on the speed of reinforcement learning: An analysis of
shaping. In: ICML, pp. 440–447 (2003)
40. Lazaric, A.: Knowledge transfer in reinforcement learning. Ph.D. thesis, Politecnico di Milano (2008)
41. Lazaric, A., Ghavamzadeh, M.: Bayesian multi-task reinforcement learning. In: ICML, pp. 599–606
(2010)
42. Lazaric, A., Restelli, M., Bonarini, A.: Transfer of samples in batch reinforcement learning. In: ICML,
pp. 544–551 (2008)
43. Li, L., Walsh, T.J., Littman, M.L.: Towards a unified theory of state abstraction for mdps. In: Aritificial
Intelligence and Mathematics (2006)
44. Lu, X., Schwartz, H.M., Givigi, S.N.: Policy invariance under reward transformations for general-sum
stochastic games. Journal of Artificial Intelligence Research (JAIR) 41, 397–406 (2011)
45. Maclin, R., Shavlik, J.W.: Creating advice-taking reinforcement learners. Machine Learning 22(1-3),
251–281 (1996)
46. Mahadevan, S.: Representation discovery in sequential decision making. In: AAAI (2010)
44 Matthijs Snel, Shimon Whiteson
47. Manoonpong, P., Wörgötter, F., Morimoto, J.: Extraction of reward-related feature space using
correlation-based and reward-based learning methods. In: ICONIP (1), pp. 414–421 (2010)
48. Marquardt, D.: An algorithm for least-squares estimation of nonlinear parameters. SIAM Journal of
Applied Mathematics 11, 431–441 (1963)
49. Marthi, B.: Automatic shaping and decomposition of reward functions. In: Proc. 24th International
Conference on Machine Learning, pp. 601–608 (2007)
50. Matarić, M.J.: Reward functions for accelerated learning. In: Proc. 11th International Conference on
Machine Learning (1994)
51. Mehta, N., Natarajan, S., Tadepalli, P., Fern, A.: Transfer in variable-reward hierarchical reinforcement
learning. Machine Learning 73(3), 289–312 (2008)
52. Midtgaard, M., Vinther, L., Christiansen, J.R., Christensen, A.M., Zeng, Y.: Time-based reward shaping
in real-time strategy games. In: Proceedings of the 6th international conference on Agents and data
mining interaction, ADMI’10, pp. 115–125. Springer-Verlag, Berlin, Heidelberg (2010)
53. Ng, A., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application
to reward shaping. In: Proc. 16th International Conference on Machine Learning (1999)
54. Parr, R., Li, L., Taylor, G., Painter-Wakefield, C., Littman, M.L.: An analysis of linear models, linear
value-function approximation, and feature selection for reinforcement learning. In: ICML, pp. 752–759
(2008)
55. Petrik, M., Taylor, G., Parr, R., Zilberstein, S.: Feature selection using regularization in approximate
linear programs for markov decision processes. In: ICML, pp. 871–878 (2010)
56. Proper, S., Tumer, K.: Modeling difference rewards for multiagent learning (extended abstract). In:
AAMAS. Valencia, Spain (2012)
57. Randløv, J., Alstrøm, P.: Learning to drive a bicycle using reinforcement learning and shaping. In: Proc.
15th International Conference on Machine Learning (1998)
58. Rummery, G., Niranjan, M.: On-line q-learning using connectionist systems. Tech. Rep. CUED/F-
INFENG-RT 116, Engineering Department, Cambridge University (1994)
59. Saksida, L.M., Raymond, S.M., Touretzky, D.S.: Shaping robot behavior using principles from instru-
mental conditioning. Robotics and Autonomous Systems 22(3-4), 231 – 249 (1997)
60. van Seijen, H., Whiteson, S., Kester, L.: Switching between representations in reinforcement learning.
In: Interactive Collaborative Information Systems, pp. 65–84 (2010)
61. Selfridge, O., Sutton, R.S., Barto, A.G.: Training and tracking in robotics. In: Proc. Ninth International
Joint Conference on Artificial Intelligence (1985)
62. Sherstov, A.A., Stone, P.: Improving action selection in MDP’s via knowledge transfer. In: Proceedings
of the Twentieth National Conference on Artificial Intelligence (2005)
63. Singh, S., Lewis, R., Barto, A.: Where do rewards come from? In: Proc. 31st Annual Conference of the
Cognitive Science Society, pp. 2601–2606 (2009)
64. Singh, S., Sutton, R.: Reinforcement learning with replacing eligibility traces. Machine Learning 22(1),
123–158 (1996)
65. Singh, S.P.: Transfer of learning by composing solutions of elemental sequential tasks. Machine Learning
8(3), 323–339 (1992)
66. Singh, S.P., Jaakkola, T., Jordan, M.I.: Learning without state-estimation in partially observable marko-
vian decision processes. In: ICML, pp. 284–292 (1994)
67. Skinner, B.F.: The behavior of organisms: An experimental analysis. Appleton-Century-Crofts, New
York (1938)
68. Snel, M., Whiteson, S.: Multi-task evolutionary shaping without pre-specified representations. In: Ge-
netic and Evolutionary Computation Conference (GECCO’10) (2010)
69. Snel, M., Whiteson, S.: Multi-task reinforcement learning: Shaping and feature selection. In: Proceed-
ings of the European Workshop on Reinforcement Learning (EWRL) (2011)
70. Sorg, J., Singh, S.: Transfer via soft homomorphisms. In: Proc. 8th Int. Conf. on Autonomous Agents
and Multiagent Systems (AAMAS 2009), pp. 741–748 (2009)
71. Strehl, A.L., Diuk, C., Littman, M.L.: Efficient structure learning in factored-state mdps. In: AAAI, pp.
645–650 (2007)
72. Sutton, R.: Learning to predict by the method of temporal differences. Machine Learning 3, 9–44 (1983)
73. Sutton, R., Barto, A.: Reinforcement Learning: An Introduction. The MIT Press (1998)
74. Tanaka, F., Yamamura, M.: Multitask reinforcement learning on the distribution of mdps. In: Proc.
2003 IEEE International Symposium on Computational Intelligence in Robotics and Automation (CIRA
2003), pp. 1108–113 (2003)
75. Taylor, J., Precup, D., Panagaden, P.: Bounding performance loss in approximate mdp homomorphisms.
In: D. Koller, D. Schuurmans, Y. Bengio, L. Bottou (eds.) Advances in Neural Information Processing
Systems 21, pp. 1649–1656 (2009)
Learning Potential Functions and their Representations for MTRL 45
76. Taylor, M., Stone, P.: Transfer learning for reinforcement learning domains: A survey. Journal of Ma-
chine Learning Research 10(1), 1633–1685 (2009)
77. Taylor, M., Stone, P., Liu, Y.: Transfer learning via inter-task mappings for temporal difference learning.
Journal of Machine Learning Research 8(1), 2125–2167 (2007)
78. Taylor, M.E., Whiteson, S., Stone, P.: Transfer via inter-task mappings in policy search reinforcement
learning. In: AAMAS, p. 37 (2007)
79. Thrun, S.: Is learning the n-th thing any easier than learning the first? In: Advances in Neural Information
Processing, pp. 640–646 (1995)
80. Torrey, L., Shavlik, J.W., Walker, T., Maclin, R.: Transfer learning via advice taking. In: Advances in
Machine Learning I, pp. 147–170. Springer (2010)
81. Torrey, L., Walker, T., Shavlik, J.W., Maclin, R.: Using advice to transfer knowledge acquired in one re-
inforcement learning task to another. In: Proceedings of the Sixteenth European Conference on Machine
Learning (ECML 2005), pp. 412–424 (2005)
82. Vlassis, N., Littman, M.L., Barber, D.: On the computational complexity of stochastic controller opti-
mization in pomdps. CoRR abs/1107.3090 (2011)
83. Walsh, T.J., Li, L., Littman, M.L.: Transferring state abstractions between mdps. In: ICML-06 Workshop
on Structural Knowledge Transfer for Machine Learning (2006)
84. Watkins, C.J.C.H., Dayan, P.: Q-learning. Machine Learning 8, 279–292 (1992)
85. Whitehead, S.D.: A complexity analysis of cooperative mechanisms in reinforcement learning. In: Pro-
ceedings AAAI-91, pp. 607–613 (1991)
86. Whiteson, S., Tanner, B., Taylor, M.E., Stone, P.: Protecting against evaluation overfitting in empirical
reinforcement learning. In: ADPRL 2011: Proceedings of the IEEE Symposium on Adaptive Dynamic
Programming and Reinforcement Learning, pp. 120–127 (2011)
87. Wiewiora, E.: Potential-based shaping and q-value initialization are equivalent. Journal of Artificial
Intelligence Research 19, 205–208 (2003)
88. Wiewiora, E., Cottrell, G., Elkan, C.: Principled methods for advising reinforcement learning agents. In:
Proc. 20th International Conference on Machine Learning, pp. 792–799 (2003)
89. Wilson, A., Fern, A., Ray, S., Tadepalli, P.: Multi-task reinforcement learning: a hierarchical bayesian
approach. In: ICML, pp. 1015–1022 (2007)