824 Multi Source Transfer Learning
824 Multi Source Transfer Learning
net/publication/370731508
CITATIONS READS
5 46
3 authors:
Marco A. Wiering
University of Groningen
239 PUBLICATIONS 6,159 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Remo Sasso on 13 May 2023.
Abstract
1 Introduction
Reinforcement learning (RL) offers a powerful framework for decision-making tasks ranging from games
to robotics, where agents learn from interactions with an environment to improve their performance over
time. The agent observes states and rewards from the environment and acts with a policy that maps states
to actions. The ultimate goal of the agent is to find a policy that maximizes the expected cumulative reward
corresponding to the task. A crucial challenge in RL is to learn this optimal policy with a limited amount of
data, referred to as sample efficiency. This is particularly relevant for real-world applications, where data
collection may be costly. In order to achieve sample efficiency, agents must effectively extract information
from their interactions with the environment and explore the state space efficiently.
Model-based reinforcement learning methods learn and utilize a model of the environment that can predict
the consequences of actions, which can be used for planning and generating synthetic data. Recent works
show that by learning behaviors from synthetic data, model-based algorithms exhibit remarkable sam-
ple efficiency in high-dimensional environments compared to model-free methods that do not construct a
1
Published in Transactions on Machine Learning Research (04/2023)
model (Łukasz Kaiser et al., 2020; Schrittwieser et al., 2020; Hafner et al., 2021). However, a model-based
agent needs to learn an accurate model of the environment which still requires a substantial number of
interactions. The fact that RL agents typically learn a new task from scratch (i.e. without incorporating
prior knowledge) plays a major role in their sample inefficiency. Given the potential benefits of incor-
porating such prior knowledge (as demonstrated in the supervised learning domain (Yosinski et al., 2014;
Sharif Razavian et al., 2014; Zhuang et al., 2020)), transfer learning has recently gained an increasing amount
of interest in RL research (Zhu et al., 2020). However, since the performance of transfer learning is highly
dependent on the relevance of the data that was initially trained on (source task) with respect to the data
that will be faced afterward (target task), its application to the RL domain can be challenging. RL environ-
ments namely often differ in several fundamental aspects (e.g. state-action spaces and reward function),
making the choice of suitable source-target transfer pair challenging and often based on the intuition of
the designer (Taylor & Stone, 2009). Moreover, existing transfer learning techniques are often specifically
designed for model-free algorithms (Wan et al., 2020; Heng et al., 2022), transfer within the same domain
(Sekar et al., 2020; Liu & Abbeel, 2021; Yuan et al., 2022), or transfer within the same task (Hinton et al.,
2015; Parisotto et al., 2015; Czarnecki et al., 2019; Agarwal et al., 2022).
This paper proposes to address the challenge of determining the most effective source task by automating
this process with multi-source transfer learning. Specifically, we introduce a model-based transfer learn-
ing framework that enables multi-source transfer learning for a state-of-the-art model-based reinforce-
ment learning algorithm across different domains. Rather than manually selecting a single source task, the
introduced techniques allow an agent to automatically extract the most relevant information from a set
of source tasks, regardless of differences between environments. We consider two different multi-source
transfer learning settings and provide solutions for each: a single agent that has mastered multiple tasks,
and multiple individual agents that each mastered a single task. As modern RL algorithms are often com-
posed of several components we also ensure each of the proposed techniques is applicable in a modular
fashion, allowing the applicability to individual components of a given algorithm. The proposed meth-
ods are extensively evaluated in challenging cross-domain transfer learning experiments, demonstrating
resilience to differences in state-action spaces and reward functions. Note that the techniques are also
applicable in single-source transfer learning settings. but we encourage the usage of multi-source transfer
learning, in order to avoid the manual selection of an optimal source task. The overall contribution of this
paper is the adaptation of existing transfer learning approaches in combination with novel techniques, to
achieve enhanced sample efficiency for a model-based algorithm without the need of selecting an optimal
source task. The main contributions are summarized as follows:
• Fractional transfer learning: We introduce a type of transfer learning that transfers partial pa-
rameters instead of random initialization, resulting in substantially improved sample efficiency.
• Modular multi-task transfer learning: We demonstrate enhanced sample efficiency by training
a single model-based agent on multiple tasks and modularly transferring its components with
different transfer learning methods.
• Meta-model transfer learning: We propose a multi-source transfer learning technique that com-
bines models from individual agents trained in different environments into a shared feature space,
creating a meta-model that utilizes these additional input signals for the target task.
2
Published in Transactions on Machine Learning Research (04/2023)
2 Preliminaries
3
Published in Transactions on Machine Learning Research (04/2023)
Here 𝑝 denotes distributions that generate real environment samples, 𝑞 denotes distributions approximat-
ing those distributions in latent space, and 𝜃 denotes the jointly optimized parameters of the models. At a
given timestep 𝑡, the representation model maps a visual observation 𝑜𝑡 together with the previous latent
state 𝑠𝑡−1 and previous action 𝑎𝑡−1 to latent state 𝑠𝑡 . The transition model learns to predict 𝑠𝑡+1 from 𝑠𝑡 and 𝑎𝑡 ,
and the reward model learns to predict the corresponding reward 𝑟𝑡 . The observation model reconstructs
𝑠𝑡 to match 𝑜𝑡 , providing the learning signal for learning the feature space. This world model is called the
recurrent state space model (RSSM), and we refer the reader to Hafner et al. (2019) for further details. To
simulate a trajectory {(𝑠𝜏 , 𝑎𝜏 )}𝑡+𝐻
𝜏 =𝑡 of length 𝐻 , where 𝜏 denotes the imagined time index, the representa-
tion model maps an initial observation 𝑜𝑡 to latent state 𝑠𝜏 , which is combined with action 𝑎𝜏 yielded by
the policy, to predict 𝑠𝜏 +1 using the transition model. The reward model then predicts the corresponding
reward 𝑟𝜏 +1 , which is used for policy learning.
In order to learn a policy, Dreamer makes use of an actor-critic approach, where the action model 𝑞𝜙 (𝑎𝜏 |𝑠𝜏 )
implements the policy, and the value model 𝑣𝜓 (𝑠𝑡 ) ≈ 𝔼𝑞(⋅|𝑠𝜏 ) (∑𝑡+𝐻
𝜏 =𝑡 𝛾
𝜏 −𝑡
𝑟𝜏 ) estimates the expected reward
that the action model achieves from a given state 𝑠𝜏 . Here 𝜙 and 𝜓 are neural network parameters for the
action and value model respectively, and 𝛾 is the discount factor. The reward, value, and actor models are
implemented as Gaussian distributions parameterized by feed-forward neural networks. The transition
model is a Gaussian distribution parameterized by a Gated Recurrent United (GRU; Bahdanau et al. (2014))
followed by feed-forward layers. The representation model is a variational encoder (Kingma & Welling,
2013; Rezende et al., 2014) combined with the GRU, followed by feed-forward layers. The observation model
is a transposed convolutional neural network (CNN; LeCun et al. (2015)). See Appendix F for further details
and pseudocode of the Dreamer algorithm.
3 Related Work
Transfer learning. In supervised learning, parameters of a neural network are transferred by either freez-
ing or retraining the parameters of the feature extraction layers, and by randomly initializing the param-
eters of the output layer to allow adaptation to the new task (Yosinski et al., 2014; Sabatelli et al., 2018;
Cao et al., 2021). Similarly, we can transfer policy and value models, depending on the differences of the
state-action spaces, and reward functions between environments (Carroll & Peterson, 2002; Schaal et al.,
2004; Fernández & Veloso, 2006; Rusu et al., 2016; Zhang et al., 2018). We can also transfer autoencoders,
trained to map observations to latent states (Chen et al., 2021). Experience samples collected during the
source task training process can also be transferred to enhance the learning of the target task (Lazaric et al.,
2008; Tirinzoni et al., 2018).
Model-free transfer learning. Model-free reinforcement learning algorithms seem particularly suitable
for distillation techniques and transfer within the same domain and task (Hinton et al., 2015; Parisotto
et al., 2015; Czarnecki et al., 2019; Agarwal et al., 2022). By sharing a distilled policy across multiple agents
learning individual tasks, Teh et al. (2017) obtain robust sample efficiency gains. Sabatelli & Geurts (2021)
4
Published in Transactions on Machine Learning Research (04/2023)
showed that cross-domain transfer learning in the Atari benchmark seems to be a less promising applica-
tion of deep model-free algorithms. More impressive learning improvements are observed in continuous
control settings for transfer within the same domain or transfer across fundamentally similar domains with
single and multiple policy transfers (Wan et al., 2020; Heng et al., 2022). Our techniques are instead appli-
cable to model-based algorithms and evaluated in challenging cross-domain settings with fundamentally
different environments. Ammar et al. (2014) introduce an autonomous similarity measure for MDPs based
on restricted Boltzmann machines for selecting an optimal source task, assuming the MDPs are within
the same domain. García-Ramírez et al. (2021) propose to select the best source models among multiple
model-free models using a regressor. In multiple-source policy transfer, researchers address similar issues
but focus on transferring a set of source policies with mismatching dynamics to some target policy in
model-free settings (Yang et al., 2020; Barekatain et al., 2020; Lee et al., 2022).
Model-based transfer learning. In model-based reinforcement learning one also needs to transfer a dy-
namics model, which can be fully transferred between different domains if the environments are sufficiently
similar (Eysenbach et al., 2021; Rafailov et al., 2021). Landolfi et al. (2019) perform a multi-task experiment,
where the dynamics model of a model-based agent is transferred to several novel tasks sequentially, and
show that this results in significant gains of sample efficiency. PlaNet (Hafner et al., 2019) was used in a
multi-task experiment, where the same agent was trained on tasks of six different domains simultaneously
using a single world model (Ha & Schmidhuber, 2018). A popular area of research for transfer learning
in model-based reinforcement learning is task-agnostically pre-training the dynamics model followed and
introducing a reward function for a downstream task afterward (Rajeswar et al., 2022; Yuan et al., 2022). In
particular, in Plan2Explore (Sekar et al., 2020) world model is task-agnostically pre-trained, and a Dreamer
(Hafner et al., 2020) agent uses the model to zero-shot or few-shot learn a new task within the same do-
main. Note that Dreamer is not to be confused with DreamerV2 (Hafner et al., 2021), which is essentially the
same algorithm adapted for discrete domains. Unlike previous works, we investigate multi-source transfer
learning with a focus on such world model-based algorithms, by introducing techniques that enhance sam-
ple efficiency without selecting a single source task, are applicable across domains, are pre-trained with
reward functions, and allow the transfer of each individual component.
4 Methods
Here we present the main contributions of this work and describe their technical details. First, we introduce
multi-task learning as a multi-source setting and combine it with different types of transfer learning in a
modular fashion, after introducing a new type of transfer learning that allows for portions of information
to be transferred (Section 4.1). We follow this with an alternative setting, where we propose to trans-
fer components of multiple individual agents utilizing a shared feature space and a meta-model instead
(Section 4.2).
In this section, we describe how a single agent can be trained on multiple MDPs simultaneously, introduce a
novel type of transfer learning, and provide insights on transfer learning in a modular manner. These three
concepts are combined to create a modular multi-source transfer learning technique that can autonomously
learn to extract relevant information for a given target task.
Simultaneous multi-task learning. We propose to train a single agent facing multiple unknown envi-
ronments simultaneously, each with different state-action spaces and reward functions. Intuitively, the
5
Published in Transactions on Machine Learning Research (04/2023)
parameters of this agent will contain aggregated information from several source tasks, which can then be
transferred to a target task (similar to parameter fusion in the supervised learning domain (Kendall et al.,
2018)). We train the multi-task RL agents by padding the action space of the agent with unused elements
to the size of the task with the largest action dimension. Note that we are required to have uniform action
dimensions across all source environments, as we use a single policy model. The agent collects one episode
of each task in collection phases to acquire a balanced amount of experiences in the replay buffer.
Fractional transfer learning. When two tasks are related but
not identical, transferring all parameters of a neural network
may not be beneficial, as it can prevent the network from
adapting and continuing to learn the new task. Therefore,
the feature extraction layers of a network are typically fully
transferred, but the output layer is randomly initialized. We
propose a simple alternative called fractional transfer learn-
ing (FTL). By only transferring a fraction of the parameters,
the network can both benefit from the transferred knowledge
and continue to learn and adapt to the new task. Let 𝜃𝑇 denote
target parameters, 𝜃𝜖 randomly initialized weights, 𝜆 the frac-
tion parameter, and 𝜃𝑆 source parameters, then we apply FTL
(a)
by 𝜃𝑇 ← 𝜃𝜖 + 𝜆𝜃𝑆 . That is, we add a fraction of the source pa-
rameters to the randomly initialized weights. This approach
is similar to the shrink and perturb approach presented in su-
pervised learning, where Gaussian noise is added to shrunken
weights from the previous training iteration (Ash & Adams,
2020). However, we present a reformulation and simplifica-
tion where we omit the ’noise scale’ parameter 𝜎 and simply
use Xavier uniform weight initialization Glorot et al. (2011)
as the perturbation, such that we only need to specify 𝜆. In
addition to recent applications to the model-free setting (Liu
et al., 2019; Shu et al., 2021), we are the first to demonstrate the
(b)
application of this type of transfer learning in a deep model-
based RL application. Figure 2: (a) We train a single agent in
Modular transfer learning. Modern RL algorithm architec- multiple (source) environments simultane-
tures often consist of several components, each relating to ously by alternately interacting with each
different elements of an MDP. Therefore, to transfer the pa- environment. (b) We transfer the param-
rameters of such architectures, we need to consider transfer eters of this agent in a modular fashion
learning on a modular level. Using Dreamer as a reference, for learning a novel (target) task. The
we discuss what type of transfer learning each component representation, observation and transition
benefits most from. We consider three types of direct transfer model are fully transferred ( ). The
learning (i.e. initialization techniques): random initialization, reward and value models are fractionally
FTL, and full transfer learning, where the latter means we ini- transferred ( ). The action model and
tialize the target parameters with the source parameters. As the action-input parameters of the transi-
commonly done in transfer learning, we fully transfer feature tion model are randomly initialized ( ).
extraction layers of each component and only consider the
output layer for alternative transfer learning strategies. We don’t discuss the action parameters, as ac-
6
Published in Transactions on Machine Learning Research (04/2023)
tion elements and dimensions don’t match across different environments, meaning we simply randomly
initialize action parameters.
We found that fractionally transferring parameters of the reward and value models can result in substan-
tial performance gains (Appendix E). In this paper, we apply our methods to MDPs with similar reward
functions, meaning the parameters of these models consist of transferable information that enhances the
learning of a target task. This demonstrates the major benefits of FTL, as the experiments also showed that
fully transferring these parameters has a detrimental effect on learning (Appendix D). From the literature,
we know that fully transferring the parameters of the representation, observation, and transition model
often results in positive transfers. As we are dealing with visually similar environments, the generality
of convolutional features allows the full transfer of the representation and observation parameters (Chen
et al., 2021). Additionally, when environments share similar physical laws, transition models may be fully
transferred, provided that the weights connected to actions are reset (Landolfi et al., 2019). See Appendix F
for pseudocode and implementation details of the proposed modular multi-task transfer learning approach,
and see Figure 2 for an overview of our approach.
We now consider an alternative multi-source transfer learning setting, where we have access to the param-
eters of multiple individual agents that have learned the optimal policy for different tasks. Transferring
components from agents trained in different environments represents a major obstacle for multi-source
transfer learning. To the best of our knowledge, we are the first to propose a solution that allows the
combination of deep model-based reinforcement learning components trained in different domains, from
which the most relevant information can autonomously be extracted for a given target task.
Universal feature space. As the focus of this paper is on world model-based agents that learn their com-
ponents in a compact latent feature space, we are required to facilitate a shared feature space across the
agents when combining and transferring their components. From the literature, we know that for visually
similar domains, an encoder of a converged autoencoder can be reused due to the generality of convolu-
tional features (Chen et al., 2021). Based on this intuition we introduce universal feature space (UFS): a fixed
latent feature space that is shared among several world model-based agents to enable the transferability
of their components. We propose to train a single agent in multiple environments simultaneously (as de-
scribed in Section 4.1), freeze its encoder, and reuse it for training both source and target agents. In this
paper, we use two different types of environments: locomotion and pendula tasks (Figure 4). Therefore, we
decided to train a single agent simultaneously in one locomotion and one pendulum environment, such
that we learn convolutional features for both types of environments. Note that in the case of Dreamer,
we do not transfer and freeze the other RSSM components, as this would prevent the Dreamer agent from
learning new reward and transition functions. Dreamer learns its latent space via reconstruction, meaning
the encoder is the main component of the representation model responsible for constructing the feature
space.
Meta-model transfer learning. When an agent uses the UFS for training in a given target task, we can
combine and transfer components from agents that were trained using the UFS in different environments.
We propose to accomplish this by introducing meta-model transfer learning (MMTL). For a given com-
ponent of the target agent, we assemble the same component from all source agents into an ensemble
of frozen components. In addition to the usual input element(s), the target component also receives the
output signals provided by the frozen source components.
7
Published in Transactions on Machine Learning Research (04/2023)
5 Experiments
Experimental setup. We evaluate the proposed methods using Dreamer2 on six continuous control tasks:
Hopper, Ant, Walker2D, HalfCheetah, InvertedPendulumSwingup, and InvertedDoublePendulumSwingup
2 This work builds upon the code base of Dreamer: https://github.com/danijar/dreamer.
8
Published in Transactions on Machine Learning Research (04/2023)
Figure 4: Visualization of the tasks learned in the experiments: Hopper, Walker2D, Ant, HalfCheetah, In-
vertedPendulumSwingup, and InvertedDoublePendulumSwingup. Each of the MDPs differ in all elements
(, , 𝑅, 𝑃), and are used in multi-source transfer learning experiments as both source and target tasks.
Table 1: Overall average episode return of 1 million environment steps for FTL and MMTL, where param-
eters of 4 source tasks were transferred. We compare to a baseline Dreamer agent that learns from scratch.
Bold results indicate the method outperforms the baseline for a given task.
(Figure 4). Due to a restricted computational budget, we provide extensive empirical evidence for only
one model-based algorithm and leave the investigation of the potential benefits for other (model-based)
algorithms to future work. A detailed description of the differences between the MDPs can be found in Ap-
pendix A. We perform experiments for both multi-task (referred to as FTL) and multiple-agent (referred to
as MMTL) transfer learning settings. For each method, we run multi-source transfer learning experiments
using a different set of 𝑁 source tasks for each of the target environments (Appendix B). The selection of
source tasks for a given set was done such that each source set for a given target environment includes at
least one task from a different environment type, i.e., a pendulum task for a locomotion task and vice versa.
Similarly, each source set contains at least one task of the same environment type. We also ran preliminary
experiments (3 random seeds) for sets consisting of 𝑁 = [2, 3] to observe potential performance differences
that result from different 𝑁 , but we found no significant differences (Appendix C).
Hyperparameters. To demonstrate that our methods can autonomously extract useful knowledge from
a set of source tasks that includes at least one irrelevant source task, we apply our methods to the most
challenging setting (𝑁 = 4) for 9 random seeds. To the best of our knowledge there exist no available and
comparable multi-source transfer learning techniques in the literature that are applicable to cross-domain
transfer learning or applicable on a modular level to a world model-based algorithm such as Dreamer.
Therefore, for each run, we train the FTL and MMTL target agents for 1 million environment steps and
compare them to a baseline Dreamer agent that learns from scratch for 1 million environment steps, in order
to evaluate the sample efficiency gains of the transfer learning approaches. FTL is evaluated by training
multi-task agents for 2 million environment steps for a single run, after which we transfer the parameters
to the target agent as described in Section 4.1. We use a fraction of 𝜆 = 0.2 for FTL, as we observed in
preliminary experiments, the largest performance gains occur in a range of 𝜆 ∈ [0.1, 0.3] (see Appendix E).
9
Published in Transactions on Machine Learning Research (04/2023)
Table 2: Average episode return of the final 1e5 environment steps for FTL and MMTL, where parameters
of 4 source tasks were transferred. We compare to a baseline Dreamer agent that learns from scratch. Bold
results indicate the method outperforms the baseline for a given task.
For creating the UFS for MMTL, we train a multi-task agent on the Hopper and InvertedPendulum task
for 2 million environment steps. We evaluate MMTL by training a single agent on each task of the set
of source environments for 1 million environment steps (all using the same UFS), after which we transfer
their reward models to the target agent as described in Section 4.2. We used a single Nvidia V100 GPU for
each training run, taking about 6 hours per 1 million environment steps.
Results. The overall aggregated return of both FTL and MMTL is reported in Table 1, which allows us
to take jumpstarts into account in the results. The aggregated return of the final 10% (1e5) environment
steps can be found in Table 2, allowing us to observe final performance improvements. Figure 5 shows the
corresponding learning curves.
6 Discussion
We now discuss the results of each of the proposed methods individually and reflect on the overall multi-
source transfer learning contributions of this paper. We would like to emphasize that we don’t compare
FTL and MMTL to each other, but to the baseline, as they are used for two entirely different settings.
Additionally, although the proposed techniques are applicable to single-source transfer learning settings
as well, empirical performance comparisons between multi-source and single-source transfer learning are
outside of the scope of this paper, and we leave this to future work.
Multi-source transfer learning. The overall transfer learning results show that the proposed solutions
for the multi-task setting (FTL) and multiple-agent setting (MMTL) result in positive transfers for 5 out
of 6 environments. We observe jumpstarts, overall performance improvements, and/or final performance
gains. This shows the introduced techniques allow agents to autonomously extract useful information
stemming from source agents trained in MDPs with different state-action spaces, reward functions, and
transition functions. We would like to emphasize that for each of the environments, there were at most
two environments that could provide useful transfer knowledge. Nevertheless, our methods are still able
to identify the useful information and result in a positive transfer with 4 source tasks. For the Ant test-
ing environment, we observe negative transfers when directly transferring parameters in the multi-task
setting, which logically follows from the environment being too different from all other environments in
terms of dynamics, as we fully transfer the transition model. Note that MMTL does not suffer significantly
from this difference in environments, as the transition model is not transferred in that setting.
Multi-task and fractional transfer learning. When training a single agent on multiple tasks and transfer-
ring its parameters to a target task in a fused and modular fashion, and applying FTL to the reward and
10
Published in Transactions on Machine Learning Research (04/2023)
Figure 5: Learning curves for average episode return of 1 million environment steps for FTL and MMTL
using 4 source tasks, compared to a baseline Dreamer agent that was trained from scratch. Shaded areas
represent standard error across the 9 runs.
11
Published in Transactions on Machine Learning Research (04/2023)
UFS enables the related source reward models to provide useful predictions for an environment they were
not trained with. Note that the source models are also not further adapted to the new environment as their
parameters are frozen. The overall performance improvements are less substantial, yet still present for the
pendula environments. As initial experiments suggested, there being just one related source task for the
pendula environments leads to degrading learning performance when adding additional unrelated source
models (Appendix C). To the best of our knowledge, we are the first to successfully combine individual
components of multiple individual agents trained in different environments as a multi-source transfer
learning technique that results in positive transfers, which is autonomously accomplished through gradient
descent and a shared feature space provided by the proposed UFS approach. Note that even though there
is additional inference of several frozen reward models, for MMTL there is little additional computational
complexity as the frozen autoencoder does not require gradient updates, compensating for any additional
computation.
7 Conclusion
In this paper, we introduce several techniques that enable the application of multi-source transfer learning
to modern model-based algorithms, accomplished by adapting and combining both novel and existing con-
cepts from the supervised learning and reinforcement learning domains. The proposed methods address
two major obstacles in applying transfer learning to the RL domain: selecting an optimal source task and
deciding what type of transfer learning to apply to individual components. The introduced techniques are
applicable to several individual deep reinforcement learning components and allow the automatic extrac-
tion of relevant information from a set of source tasks. Moreover, the proposed methods cover two likely
cross-domain multi-source scenarios: transferring parameters from a single agent that mastered several
tasks and transferring parameters from multiple agents that mastered a single task. Where existing tech-
niques on transfer learning generally focus on model-free algorithms, transfer within the same domain,
or transfer within the same task, we showed that our introduced methods are applicable to challenging
cross-domain settings and compatible with a state-of-the-art model-based reinforcement learning algo-
rithm, importantly without having to select an optimal source task.
First, we introduced fractional transfer learning, which allows parameters to be transferred partially, as
opposed to discarding information as commonly done by randomly initializing a layer of a neural net-
work. We used this type of transfer learning as an option in an insightful discussion concerning what type
of transfer learning deep model-based reinforcement learning algorithm components benefit from. The
conclusions of this discussion were empirically validated in the multi-task transfer learning setting, which
both showed that the modular transfer learning decisions result in significant performance improvements
for learning a novel target task, and that fused parameter transfer allows for autonomous extraction of
useful information from multiple different source tasks.
Next, we showed that by learning a universal feature space, we enable the combination and transfer of
individual components from agents trained in environments with different state-action spaces and reward
functions. We extend this concept by introducing meta-model transfer learning, which leverages the pre-
dictions of models trained by different agents in addition to the usual input signals, as a multi-source
transfer learning technique for multiple-agent settings. This again results in significant sample efficiency
gains in challenging cross-domain experiments while autonomously learning to leverage and ignore rele-
vant and irrelevant information, respectively.
12
Published in Transactions on Machine Learning Research (04/2023)
A natural extension for future work is the application of these techniques beyond the experimental setup
used in this research, such as different environments (e.g. Atari 2600) and algorithms. The techniques in-
troduced in this paper are applicable on a modular level, indicating potential applicability to components
of other model-based algorithms, and potentially model-free algorithms as well. Another interesting di-
rection would be the application of the proposed techniques in a single-source fashion, and empirically
comparing the transfer learning performance to the multi-source approach.
Acknowledgments
We thank the Center for Information Technology of the University of Groningen for their support and for
providing access to the Peregrine high-performance computing cluster.
13
Published in Transactions on Machine Learning Research (04/2023)
References
Rishabh Agarwal, Max Schwarzer, Pablo Samuel Castro, Aaron Courville, and Marc G Bellemare. Reincar-
nating reinforcement learning: Reusing prior computation to accelerate progress. In Alice H. Oh, Alekh
Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Sys-
tems, 2022.
Haitham Bou Ammar, Eric Eaton, Matthew E Taylor, Decebal Constantin Mocanu, Kurt Driessens, Gerhard
Weiss, and Karl Tuyls. An automated measure of mdp similarity for transfer in reinforcement learning.
In Workshops at the Twenty-Eighth AAAI Conference on Artificial Intelligence, 2014.
Jordan T. Ash and Ryan P. Adams. On warm-starting neural network training. In Proceedings of the 34th
International Conference on Neural Information Processing Systems, 2020.
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning
to align and translate. arXiv preprint arXiv:1409.0473, 2014.
Mohammadamin Barekatain, Ryo Yonetani, and Masashi Hamaya. Multipolar: Multi-source policy ag-
gregation for transfer reinforcement learning between diverse environmental dynamics. In Proceedings
of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20. International Joint
Conferences on Artificial Intelligence Organization, 2020.
Richard Bellman. A markovian decision process. Journal of Mathematics and Mechanics, 6(5):679–684, 1957.
Zhangjie Cao, Minae Kwon, and Dorsa Sadigh. Transfer reinforcement learning across homotopy classes.
IEEE Robotics and Automation Letters, 6(2):2706–2713, 2021.
J L Carroll and Todd S Peterson. Fixed vs. dynamic Sub-Transfer in reinforcement learning. ICMLA, 2002.
Lili Chen, Kimin Lee, Aravind Srinivas, and Pieter Abbeel. Improving computational efficiency in visual
reinforcement learning via stored embeddings. Advances in Neural Information Processing Systems, 34:
26779–26791, 2021.
Erwin Coumans and Yunfei Bai. Pybullet, a python module for physics simulation for games, robotics and
machine learning. http://pybullet.org, 2016–2021.
Wojciech M. Czarnecki, Razvan Pascanu, Simon Osindero, Siddhant Jayakumar, Grzegorz Swirszcz, and
Max Jaderberg. Distilling policy distillation. In Kamalika Chaudhuri and Masashi Sugiyama (eds.), Pro-
ceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, volume 89
of Proceedings of Machine Learning Research, pp. 1331–1340. PMLR, 16–18 Apr 2019.
Benjamin Eysenbach, Shreyas Chaudhari, Swapnil Asawa, Sergey Levine, and Ruslan Salakhutdinov. Off-
dynamics reinforcement learning: Training for transfer with domain classifiers. In International Confer-
ence on Learning Representations, 2021.
F Fernández and M M Veloso. Reusing and building a policy library. ICAPS, 2006.
Jesús García-Ramírez, Eduardo Morales, and Hugo Jair Escalante. Multi-source transfer learning for deep
reinforcement learning. In Pattern Recognition: 13th Mexican Conference, MCPR 2021, Mexico City, Mexico,
June 23–26, 2021, Proceedings, pp. 131–140. Springer, 2021.
14
Published in Transactions on Machine Learning Research (04/2023)
Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural networks. In Proceedings of
the fourteenth international conference on artificial intelligence and statistics, pp. 315–323. JMLR Workshop
and Conference Proceedings, 2011.
David Ha and Jürgen Schmidhuber. Recurrent world models facilitate policy evolution. In S. Bengio, H. Wal-
lach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information
Processing Systems, 2018.
Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James David-
son. Learning latent dynamics for planning from pixels. In International conference on machine learning,
2019.
Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning be-
haviors by latent imagination. In International Conference on Learning Representations, 2020.
Danijar Hafner, Timothy P Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete
world models. In International Conference on Learning Representations, 2021.
You Heng, Tianpei Yang, YAN ZHENG, Jianye HAO, and Matthew E. Taylor. Cross-domain adaptive trans-
fer reinforcement learning based on state-action correspondence. In The 38th Conference on Uncertainty
in Artificial Intelligence, 2022.
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint
arXiv:1503.02531, 2015.
Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-task learning using uncertainty to weigh losses for
scene geometry and semantics. In Proceedings of the IEEE conference on computer vision and pattern
recognition, pp. 7482–7491, 2018.
Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114,
2013.
Nicholas C Landolfi, Garrett Thomas, and Tengyu Ma. A model-based approach for sample-efficient multi-
task reinforcement learning. arXiv preprint arXiv:1907.04964, 2019.
Alessandro Lazaric, Marcello Restelli, and Andrea Bonarini. Transfer of samples in batch reinforcement
learning. In Proceedings of the 25th international conference on Machine learning, pp. 544–551, 2008.
Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436–444, 2015.
Hyun-Rok Lee, Ram Ananth Sreenivasan, Yeonjeong Jeong, Jongseong Jang, Dongsub Shim, and Chi-Guhn
Lee. Multi-policy grounding and ensemble policy learning for transfer learning with dynamics mis-
match. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22.
International Joint Conferences on Artificial Intelligence Organization, 2022.
Hao Liu and Pieter Abbeel. Unsupervised active pre-training for reinforcement learning, 2021.
Iou-Jen Liu, Jian Peng, and Alexander G. Schwing. Knowledge flow: Improve upon your teachers. In 7th
International Conference on Learning Representations, ICLR, 2019.
Emilio Parisotto, Jimmy Lei Ba, and Ruslan Salakhutdinov. Actor-mimic: Deep multitask and transfer
reinforcement learning. arXiv preprint arXiv:1511.06342, 2015.
15
Published in Transactions on Machine Learning Research (04/2023)
Rafael Rafailov, Tianhe Yu, Aravind Rajeswaran, and Chelsea Finn. Visual adversarial imitation learning
using variational models. Advances in Neural Information Processing Systems, 2021.
Sai Rajeswar, Pietro Mazzaglia, Tim Verbelen, Alexandre Piché, Bart Dhoedt, Aaron Courville, and Alexan-
dre Lacoste. Unsupervised model-based pre-training for data-efficient reinforcement learning from pix-
els. In Decision Awareness in Reinforcement Learning Workshop at ICML, 2022.
Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approxi-
mate inference in deep generative models. In Proceedings of the 31st International Conference on Machine
Learning - Volume 32, ICML’14, pp. II–1278–II–1286. JMLR.org, 2014.
Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray
Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive neural networks. arXiv preprint
arXiv:1606.04671, 2016.
Matthia Sabatelli and Pierre Geurts. On the transferability of deep-q networks. In Deep Reinforcement
Learning Workshop of the 35th Conference on Neural Information Processing Systems, 2021.
Matthia Sabatelli, Mike Kestemont, Walter Daelemans, and Pierre Geurts. Deep transfer learning for art
classification problems. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops,
pp. 0–0, 2018.
Stefan Schaal, Auke Jan Ijspeert, Aude Billard, Sethu Vijayakumar, and Jean-Arcady Meyer. Estimating
future reward in reinforcement learning animats using associative learning. In From animals to animats
8: Proceedings of the Eighth International Conference on the Simulation of Adaptive Behavior, pp. 297–304.
MIT Press, 2004.
Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt,
Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, et al. Mastering atari, go, chess and
shogi by planning with a learned model. Nature, 588(7839):604–609, 2020.
Ramanan Sekar, Oleh Rybkin, Kostas Daniilidis, Pieter Abbeel, Danijar Hafner, and Deepak Pathak. Plan-
ning to explore via self-supervised world models. In International Conference on Machine Learning, pp.
8583–8592. PMLR, 2020.
Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. Cnn features off-the-shelf:
an astounding baseline for recognition. In Proceedings of the IEEE conference on computer vision and
pattern recognition workshops, pp. 806–813, 2014.
Yang Shu, Zhi Kou, Zhangjie Cao, Jianmin Wang, and Mingsheng Long. Zoo-tuning: Adaptive transfer
from A zoo of models. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International
Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of
Machine Learning Research, pp. 9626–9637. PMLR, 2021.
Matthew E Taylor and Peter Stone. Transfer learning for reinforcement learning domains: A survey.
Journal of Machine Learning Research, 10(7), 2009.
Yee Teh, Victor Bapst, Wojciech M Czarnecki, John Quan, James Kirkpatrick, Raia Hadsell, Nicolas Heess,
and Razvan Pascanu. Distral: Robust multitask reinforcement learning. Advances in neural information
processing systems, 2017.
16
Published in Transactions on Machine Learning Research (04/2023)
Andrea Tirinzoni, Andrea Sessa, Matteo Pirotta, and Marcello Restelli. Importance weighted transfer of
samples in reinforcement learning. In Jennifer Dy and Andreas Krause (eds.), Proceedings of the 35th
International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research,
pp. 4936–4945. PMLR, 10–15 Jul 2018.
Michael Wan, Tanmay Gangwani, and Jian Peng. Mutual information based knowledge transfer under
state-action dimension mismatch. In Jonas Peters and David Sontag (eds.), Proceedings of the 36th Con-
ference on Uncertainty in Artificial Intelligence (UAI), volume 124 of Proceedings of Machine Learning
Research, pp. 1218–1227. PMLR, 03–06 Aug 2020.
Tianpei Yang, Jianye Hao, Zhaopeng Meng, Zongzhang Zhang, Yujing Hu, Yingfeng Chen, Changjie Fan,
Weixun Wang, Wulong Liu, Zhaodong Wang, and Jiajie Peng. Efficient deep reinforcement learning
via adaptive policy transfer. In Christian Bessiere (ed.), Proceedings of the Twenty-Ninth International
Joint Conference on Artificial Intelligence, IJCAI-20, pp. 3094–3100. International Joint Conferences on
Artificial Intelligence Organization, 2020.
Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural
networks? arXiv preprint arXiv:1411.1792, 2014.
Yifu Yuan, Jianye Hao, Fei Ni, Yao Mu, Yan Zheng, Yujing Hu, Jinyi Liu, Yingfeng Chen, and Changjie
Fan. Euclid: Towards efficient unsupervised reinforcement learning with multi-choice dynamics model.
arXiv preprint arXiv:2210.00498, 2022.
Amy Zhang, Harsh Satija, and Joelle Pineau. Decoupling dynamics and reward for transfer learning. arXiv
preprint arXiv:1804.10689, 2018.
Zhuangdi Zhu, Kaixiang Lin, and Jiayu Zhou. Transfer learning in deep reinforcement learning: A survey.
arXiv preprint arXiv:2009.07888, 2020.
Fuzhen Zhuang, Zhiyuan Qi, Keyu Duan, Dongbo Xi, Yongchun Zhu, Hengshu Zhu, Hui Xiong, and Qing
He. A comprehensive survey on transfer learning. Proceedings of the IEEE, 109(1):43–76, 2020.
Łukasz Kaiser, Mohammad Babaeizadeh, Piotr Miłos, Błażej Osiński, Roy H Campbell, Konrad Czechowski,
Dumitru Erhan, Chelsea Finn, Piotr Kozakowski, Sergey Levine, Afroz Mohiuddin, Ryan Sepassi, George
Tucker, and Henryk Michalewski. Model based reinforcement learning for atari. In International Con-
ference on Learning Representations, 2020.
17
Published in Transactions on Machine Learning Research (04/2023)
A Environment Descriptions
In this paper locomotion and pendulum balancing environments from PyBullet (Coumans & Bai, 2016–
2021) are used for experiments. The locomotion environments have as goal to walk to a target point that
is distanced 1 kilometer away from the starting position as quickly as possible. Each environment has a
different entity with different numbers of limbs and therefore has different state-action spaces, and tran-
sition functions. The reward function is similar, slightly adapted for each entity as the agent is penalized
for certain limbs colliding with the ground. The pendula environments have as their objective to balance
the initially inverted pendulum upwards. The difference between the two environments used is that one
environment has two pendula attached to each other. This environment is not included in the PyBullet
framework for swing-up balancing, which we, therefore, implemented ourselves. The reward signal for
the InvertedPendulumSwingup for a given observation 𝑜 is:
𝑟𝑜 = cos Θ (1)
where Θ is the current position of the joint. For the InvertedDoublePendulumSwingup a swing-up task,
we simply add the cosine of the position of the second joint Γ to Equation 1:
As such, these two environments also differ in state-action spaces, transition functions, and reward func-
tions.
In Table 3 the combinations of source tasks and target tasks can be viewed that were used for the experi-
ments in this paper. That is, they correspond to the results of Appendix C and Section 5.
Table 3: Source-target combinations used for FTL and MMTL experiments, using environments HalfChee-
tah (Cheetah), Hopper, Walker2D, InvertedPendulumSwingup (InvPend), InvertedDoublePendulum-
Swingup (InvDbPend), and Ant.
18
Published in Transactions on Machine Learning Research (04/2023)
In Table 4 and Table 5 the results for training on, and transferring, 2 and 3 source tasks as described in
Section 5 can be found for both FTL and MMTL, showing the overall average episode return and the episode
return for the final 1e5 environment steps respectively. In Figure 6 the corresponding learning curves can
be found. These experiments are the averages and standard deviation across 3 seeds for FTL, MMTL, and
a baseline Dreamer trained from scratch. As just 3 seeds were used for these experiments, they are not
conclusive. However, they do show that regardless of the number of source tasks, our methods can extract
useful information and enhance the performance compared to the baseline.
Figure 6: Preliminary experiments: average episode return for 3 random seeds obtained across 1 million
environment steps by MMTL (red), FTL (blue), and a vanilla Dreamer (green), where for FTL 𝜆 = 0.2 is
used. FTL and MMTL both receive transfer from a combination of 2 (double) and 3 (triple) source tasks.
Shaded areas represent the standard deviation across the 3 runs.
19
Published in Transactions on Machine Learning Research (04/2023)
Table 4: Preliminary experiments: overall average episode return with of FTL using 𝜆 = 0.2 and MMTL
for the reward model. Parameters of 2 and 3 source tasks are transferred to the HalfCheetah, Hopper,
Walker2D, InvertedPendulumSwingup (InvPend), InvertedDoublePendulumSwingup (InvDbPend), and
Ant tasks, and compared to a baseline Dreamer agent that learns from scratch. Bold results indicate the
best performance across the methods and baseline for a given task.
Table 5: Preliminary experiments: average episode return of the final 1e5 environment steps of FTL us-
ing 𝜆 = 0.2 and MMTL for the reward model. Parameters of 2 and 3 source tasks are transferred to
the HalfCheetah, Hopper, Walker2D, InvertedPendulumSwingup (InvPend), InvertedDoublePendulum-
Swingup (InvDbPend), and Ant tasks, and compared to a baseline Dreamer agent that learns from scratch.
Bold results indicate the best performance across the methods and baseline for a given task.
D Full Transfer
In Figure 7 learning curves can be seen for the HalfCheetah task where we fully transfer the parameters
instead of using FTL. As can be seen, this results in a detrimental overfitting effect where the agent does
not seem to be learning.
Figure 7: 4 seeds for full transfer learning on the HalfCheetah task across 2, 3, and 4 source tasks.
20
Published in Transactions on Machine Learning Research (04/2023)
In this appendix, results can be found that illustrate the effect of different fractions 𝜆. We present the
numerical results of transferring fractions of 𝜆 ∈ [0.0, 0.1, 0.2, 0.3, 0.4, 0.5] using 2, 3, and 4 source tasks.
That is, the overall performance over 3 random seeds for the HalfCheetah and Hopper, Walker2D and Ant,
and InvertedPendulumSwingup and InvertedDoublePendulumSwingup testing environments can be found
in Table 6, Table 7, and Table 8 respectively. Again, as just 3 seeds were used for these experiments, they
are not conclusive. However, they do indicate that the choice of 𝜆 can largely impact performance, and
generally, the best performance gains are yielded with 𝜆 ∈ [0.1, 0.3].
Table 6: Average return for fraction transfer of 2, 3, and 4 source tasks for the HalfCheetah and Hopper
tasks, with fractions 𝜆 ∈ [0.0, 0.1, 0.2, 0.3, 0.4, 0.5]. Bold results indicate the best performing fraction per
number of source tasks for each target task.
HalfCheetah Hopper
𝜆 2 tasks 3 tasks 4 tasks 2 tasks 3 tasks 4 tasks
0.0 1841 ± 806 1752 ± 783 1742 ± 777 1917 ± 960 1585 ± 941 1263 ± 1113
0.1 𝟐𝟎𝟐𝟖 ± 𝟗𝟗𝟑 2074 ± 762 1500 ± 781 2041 ± 887 𝟔𝟓𝟒𝟐 ± 𝟒𝟒𝟔𝟗 3300 ± 3342
0.2 1982 ± 838 1967 ± 862 1773 ± 748 1911 ± 712 5538 ± 4720 1702 ± 1078
0.3 1899 ± 911 𝟐𝟎𝟗𝟒 ± 𝟖𝟓𝟗 1544 ± 771 𝟐𝟔𝟕𝟎 ± 𝟏𝟕𝟖𝟗 4925 ± 4695 𝟑𝟑𝟒𝟏 ± 𝟒𝟔𝟎𝟗
0.4 2008 ± 943 2015 ± 873 𝟐𝟏𝟔𝟐 ± 𝟕𝟖𝟗 2076 ± 803 2437 ± 2731 1451 ± 991
0.5 1961 ± 944 1647 ± 896 1635 ± 809 1975 ± 772 2014 ± 2328 3246 ± 3828
Baseline 1681 ± 726 1340 ± 1112
Table 7: Average return for fraction transfer of 2, 3, and 4 source tasks for the Walker2D and Ant tasks
with fractions 𝜆 ∈ [0.0, 0.1, 0.2, 0.3, 0.4, 0.5]. Bold results indicate the best performing fraction per number
of source tasks for each target task.
Walker2D Ant
𝜆 2 tasks 3 tasks 4 tasks 2 tasks 3 tasks 4 tasks
0.0 546 ± 776 𝟓𝟒𝟓 ± 𝟏𝟎𝟏𝟐 157 ± 891 1070 ± 659 923 ± 559 897 ± 586
0.1 360 ± 803 441 ± 852 𝟒𝟕𝟖 ± 𝟏𝟏𝟏𝟑 806 ± 469 1022 ± 645 834 ± 652
0.2 𝟏𝟎𝟎𝟗 ± 𝟏𝟐𝟓𝟒 393 ± 813 200 ± 807 1124 ± 722 1052 ± 687 898 ± 616
0.3 535 ± 914 325 ± 812 323 ± 818 1162 ± 824 1080 ± 701 905 ± 554
0.4 742 ± 992 516 ± 1074 354 ± 862 1308 ± 735 885 ± 571 1022 ± 709
0.5 352 ± 853 355 ± 903 396 ± 829 951 ± 651 932 ± 609 683 ± 498
Baseline 116 ± 885 𝟏𝟓𝟖𝟗 ± 𝟕𝟕𝟏
Table 8: Average return for fraction transfer of 2, 3, and 4 source tasks for the InvertedPendulumSwingup
and InvertedDoublePendulumSwingup tasks with fractions 𝜆 ∈ [0.0, 0.1, 0.2, 0.3, 0.4, 0.5]. Bold results indi-
cate the best-performing fraction per number of source tasks for each target task.
InvertedPendulumSwingup InvertedDoublePendulumSwingup
𝜆 2 tasks 3 tasks 4 tasks 2 tasks 3 tasks 4 tasks
0.0 847 ± 135 779 ± 259 𝟖𝟑𝟖 ± 𝟏𝟔𝟗 1255 ± 270 1303 ± 248 1208 ± 322
0.1 857 ± 145 834 ± 198 803 ± 239 1185 ± 316 1283 ± 272 1251 ± 272
0.2 856 ± 121 𝟖𝟓𝟎 ± 𝟏𝟑𝟗 782 ± 269 1209 ± 280 1299 ± 254 1235 ± 284
0.3 𝟖𝟕𝟏 ± 𝟗𝟑 832 ± 182 806 ± 213 1279 ± 216 1325 ± 225 1239 ± 259
0.4 858 ± 150 789 ± 285 790 ± 237 𝟏𝟑𝟎𝟕 ± 𝟐𝟐𝟕 1323 ± 243 𝟏𝟐𝟕𝟖 ± 𝟐𝟖𝟕
0.5 852 ± 126 830 ± 181 791 ± 304 1301 ± 216 𝟏𝟑𝟒𝟎 ± 𝟐𝟑𝟎 1206 ± 278
Baseline 723 ± 364 1194 ± 306
21
Published in Transactions on Machine Learning Research (04/2023)
F Algorithms
Algorithm 1 presents pseudocode of the original Dreamer algorithm (adapted from Hafner et al. (2020)) with
seed episodes 𝑆, collect interval 𝐶, batch size 𝐵, sequence length 𝐿, imagination horizon 𝐻 , and learning
rate 𝛼.
Algorithm 2 presents the adaptation of Dreamer such that it learns multiple tasks simultaneously. Finally,
Algorithm 3 and Algorithm 4 present the procedures of fractional transfer learning using a multi-task agent
and meta-model transfer learning for the reward model of Dreamer, respectively. Note that Algorithm 1
is the base algorithm, and we omit identical parts of the pseudocode in the other algorithms to avoid
redundancy and enhance legibility.
Algorithm 1: Dreamer
Initialize dataset with 𝑆 random seed episodes.
Initialize neural network parameters 𝜃REP , 𝜃OBS , 𝜃REW , 𝜃TRANS , 𝜙, 𝜓 randomly.
while not converged do
for update step 𝑐 = 1 … 𝐶 do
// Dynamics learning
Draw 𝐵 data sequences {(𝑎𝑡 , 𝑜𝑡 , 𝑟𝑡 )}𝑘+𝐿
𝑡=𝑘 ∼ .
Compute model states 𝑠𝑡 ∼ 𝑝𝜃REP (𝑠𝑡 |𝑠𝑡−1 , 𝑎𝑡−1 , 𝑜𝑡 ).
Update 𝜃 using representation learning.
// Behavior learning
𝜏 =𝑡 from each 𝑠𝑡 .
Imagine trajectories {(𝑠𝜏 , 𝑎𝜏 )}𝑡+𝐻
Predict rewards 𝔼𝑞𝜃REW (𝑟𝜏 |𝑠𝜏 ) and values 𝑣𝜓 (𝑠𝜏 ).
Compute value estimates V𝜆 (𝑠𝜏 ).
Update 𝜙 ← 𝜙 + 𝛼∇𝜙 ∑𝑡+𝐻 𝜏 =𝑡 V𝜆 (𝑠𝜏 ).
Update 𝜓 ← 𝜓 𝛼∇𝜓 ∑𝑡+𝐻 𝜏 =𝑡 2 ‖𝑣𝜓 (𝑠𝜏 ) − V𝜆 (𝑠𝜏 )‖ .
1 2
// Environment interaction
𝑜1 ← env.reset()
for time step 𝑡 = 1 … 𝑇 do
Compute 𝑠𝑡 ∼ 𝑝𝜃REP (𝑠𝑡 |𝑠𝑡−1 , 𝑎𝑡−1 , 𝑜𝑡 ) from history.
Compute 𝑎𝑡 ∼ 𝑞𝜙 (𝑎𝑡 |𝑠𝑡 ) with the action model.
Add exploration noise to action.
𝑟𝑡 , 𝑜𝑡+1 ← env.step(𝑎𝑡 ).
Add experience to dataset ← ∪ {(𝑜𝑡 , 𝑎𝑡 , 𝑟𝑡 )𝑇𝑡=1 }.
22
Published in Transactions on Machine Learning Research (04/2023)
23
Published in Transactions on Machine Learning Research (04/2023)
// Environment interaction
...
24