2023 ArXiv Unleash LLM For Offline RL
2023 ArXiv Unleash LLM For Offline RL
A BSTRACT
20 100
20 67
50 45 20
3 2
0 0 0
LaMo (Ours) DT Wiki-RL CQL BC
Figure 1: Normalized score on D4RL (Fu et al., 2021) dataset of Language Models for Motion
Control (LaMo), Decision Transformer (DT, Chen et al., 2021), Wiki-RL (Reid et al., 2022), Con-
servative Q-Learning (CQL, Kumar et al., 2020) and Behavior Cloning (BC). We average scores
over tasks and data sample ratios for each domain. (Medium for Mujoco and Atari, Complete and
Partial for Kitchen, of different sample ratios, described in Appendix B.)
1 I NTRODUCTION
Offline reinforcement learning (RL) has gained significant attention in recent years due to its poten-
tial for utilizing pre-collected datasets to improve agent performance (Lange et al., 2012; Prudencio
et al., 2023; Levine et al., 2020). Among the prominent algorithms in offline RL, Decision Trans-
former (DT) (Chen et al., 2021) reframes RL as a conditional sequence modeling problem and
∗
Equal contribution. Order is decided by coin flip.
1
utilizes the Transformer architecture (Vaswani et al., 2023), showing the potential of sequence mod-
els for decision making (Xu et al., 2022; Hu et al., 2023a;b; Xie et al., 2023; Laskin et al., 2022).
However, Transformers are known to be data hungry (Khan et al., 2022; Brown et al., 2020; Ope-
nAI, 2023), meaning that pre-training on massive amounts of data is usually required to achieve
satisfactory model abilty (Touvron et al., 2021). One of the most pronounced applications of Trans-
formers — large language models (LLMs) — has achieved significant progress in language under-
standing recently, such as GPT (Radford et al., 2018; 2019; Brown et al., 2020; OpenAI, 2023),
ChatGPT (Ouyang et al., 2022), and LLaMA (Touvron et al., 2023a). Pre-trained on rich and di-
verse linguistic data, LLMs gain great few-shot and zero-shot learning abilities (Brown et al., 2020;
Kojima et al., 2023).
A natural thought to enhance the Transformer-based sequential decision-making methods is thus
to introduce the power of pre-trained Language Models (LMs) into them, initially explored by a
lot of recent works (Ahn et al., 2022; Huang et al., 2022; Driess et al., 2023; Wu et al., 2023;
Li et al., 2022; Reed et al., 2022; Lin et al., 2023; Brohan et al., 2022; 2023; Tang et al., 2023;
Wang et al., 2023b). Among them, Li et al. (2022) propose to encode the environment states with
LLMs and learn a policy based on the decoded states, while their environment states are restricted
to language descriptions only, making it hard for motion control. Reid et al. (2022) address this
weakness by directly utilizing a pre-trained LM as the initialization of DT and processing low-level
agent states and actions directly, instead of processing language descriptions. Their architecture
thus successfully utilizes pre-trained LMs in motion control tasks like locomotion (Fu et al., 2021).
However, despite the novelty of the proposed method in (Reid et al., 2022), they still do not fully
unleash the power of LMs: their empirical performance is on par with pure DT methods and lags
behind CQL (Kumar et al., 2020). We thus ask,
Can we unleash the power of pre-trained LMs to solve sequential decision-making problems?
In this work, we propose Language Models for Motion Control (LaMo), a framework to effectively
utilize pre-trained LMs for offline RL. While the motivation is straightforward, it takes four crucial
designs to empower LaMo: 1) pre-trained language model is used as the initial weight of DT; 2)
the pre-trained weights are frozen and the model is fine-tuned with parameter-efficient finetuning
method LoRA (Hu et al., 2021) on 0.7% of the parameters; 3) we replace the input embeddings and
the output linear projections with Multi-Layer Perceptrons (MLPs); 4) a language prediction loss
function as an auxiliary objective. Consequently, we find that the four components combined can
help LaMo preserve the prior knowledge and generalization ability acquired from the pre-training
while adapting efficiently to the new domain of offline RL.
We conduct comprehensive experiments across three distinct environments: Kitchen (Gupta et al.,
2019), MuJoCo Todorov et al. (2012), and Atari (Bellemare et al., 2013), spanning 8 tasks alto-
gether. These tasks range from sparse-reward to dense-reward, and from state inputs and image
inputs. For each task, we evaluate performance under varying data ratios to examine the influence
of sample amount on the outcomes. We observe that as is shown in Figure 1, LaMo surpasses
both DT and value-based baselines in sparse-reward tasks; and in dense-reward tasks, our method
significantly outperforms DT and closes the gap between value-based methods and DT-based meth-
ods. Especially, we find that when the data scale is limited (e.g., 1% of the whole dataset), LaMo
demonstrates much more powerful learning ability, which could be credited to inductive bias within
pre-trained LMs.
Our contributions are three-fold:
• We propose LaMo, a novel offline RL framework that unleashes the power of pre-trained lan-
guage models.
• To better utilize the cross-domain knowledge from language modeling, we propose 3 additional
techniques including LoRA finetuning, non-linear MLP projections, and an auxiliary language
loss. Each module is shown to contribute positively to the final results of LaMo.
• Through extensive experiments in 8 tasks across diverse domains, dataset scales, and reward
densities, we demonstrate the superiority of LaMo over DT-based and value-based offline RL
algorithms. Specifically, we find that LaMo could successfully handle the challenging low-data
regime while DT could not. This highlights the great potential of our cross-domain pre-training
for sequential modeling.
2
2 R ELATED W ORK
Transformers for decision making. Transformers have dominated the language tasks in the NLP
community (Radford et al., 2018; 2019; Brown et al., 2020; Devlin et al., 2018) and also started to
show potential in other domains, such as decision making. As one initial trial to introduce Trans-
formers into reinforcement learning (RL), Decision Transformer (DT, Chen et al., 2021) models the
elements such as states and actions into a sequence, thus framing the RL problem into a sequence
prediction problem. There are a lot of following works make improvements under the framework
of DT (Xu et al., 2022; Hu et al., 2023b; Xie et al., 2023; Yamagata et al., 2023). For example,
Prompt DT (Xu et al., 2022) appends demonstrations into the sequence to achieve generalization
in new tasks; Xie et al. (2023) pre-train DT by leveraging future trajectory information; Q-learning
DT (Yamagata et al., 2023) refines the return-to-go in training data using Q-values, thereby imbuing
DT with Q-learning’s proficiency in handling sub-optimal data. The Trajectory Transformer (Janner
et al., 2021) trains on sequences of discretized states, actions, and rewards, indicating a more direct
solution. Our work focuses on utilizing the cross-domain knowledge, i.e., language pre-training, as
privileged information to enhance DT-based methods, which thus is orthogonal to these works.
Large Language Models (LLMs) have been the most pronounced application of the Transformer
architecture in recent years (Radford et al., 2018; 2019; Brown et al., 2020; OpenAI, 2023; Devlin
et al., 2018; Touvron et al., 2023a;b). Pre-trained on massive amounts of corpus, LLMs have shown
surprising few-shot and even zero-shot ability in language tasks, such as GPT series (Radford et al.,
2018; 2019; Brown et al., 2020; OpenAI, 2023). To personalize LLMs for different downstream
user applications with computational efficiency, researchers commonly utilize parameter-efficient
finetuning techniques (Hu et al., 2021; Zhang et al., 2023a; Li & Liang, 2021; Lester et al., 2021;
Liu et al., 2022; Wang et al., 2023a) to finetune LLMs. In this work, we use the GPT-2 architec-
ture (Radford et al., 2019) as the backbone due to its affordability and use LoRA (Hu et al., 2021)
for downstream finetuning.
LMs for decision making. The great success of LMs in language tasks also motivates researchers
to explore the potential of LMs for decision making problems (Ahn et al., 2022; Huang et al., 2022;
Driess et al., 2023; Wu et al., 2023). One line of works (Ahn et al., 2022; Huang et al., 2022; Driess
et al., 2023; Wu et al., 2023) utilizes LMs for high-level task decomposition and task planning,
while their low-level execution policy is learned or designed separately. Another line of works (Li
et al., 2022; Reed et al., 2022; Lin et al., 2023; Brohan et al., 2023; Tang et al., 2023; Wang et al.,
2023b) exploits the representation and generalization power of pre-trained LMs. Li et al. (2022)
adapt pre-trained LMs to generate policies for tasks where the inputs could be converted into word
sequences and point out the significance of sequential structure of inputs; Lin et al. (2023) use
a geometric feasibility planner to encourage LM to generate both mid-level and low-level plans
given language instruction; and Tang et al. (2023) design prompts for LMs to encode language
instructions. When multi-modal inputs are involved, one solution is transforming them into one
common embedding space (Brohan et al., 2023; Reed et al., 2022). For example, RT-2 (Brohan et al.,
2023) utilizes a Vision-Language Model pre-trained on massive language and vision-language data,
and also represents actions as text tokens on the Robot-Action Fine-tuning stage; GATO (Reed et al.,
2022) utilizes a Vision Transformer to encode the image inputs, and learns from a large multi-modal,
multi-task dataset to perform various tasks all in one model.
The most relevant work to us is Wiki-RL (Reid et al., 2022), which also uses a pre-trained language
model as the initialization of DT for offline RL. However, their empirical results are shown to be
only close to DT and could not surpass CQL (Kumar et al., 2020). Therefore, our work tries to better
unleash the power of pre-trained LMs for offline RL.
3 P RELIMINARIES
3
the discount factor. The agent in this MDP follows a policy π(a|s), and the objective is:
"∞ #
X
J(π) = Es∼T,a∼π γ t R(st , at ) . (1)
t=0
In offline RL, the access to interacting with the environment is removed while the objective remains
(i) (i) (i) (i)
J(π). Agents could only learn on pre-collected trajectories D = {(st , at , st+1 , rt )}, which is
generated by a unknown behavior policy πB . Here we introduce common properties of the dataset
D: 1) Sub-optimality. In many contexts, πB is not an optimal policy, i.e., D would not contain
the optimal behaviors, and thus simple imitation may exhibit suboptimal performance; 2) Dense-
reward or sparse-reward. In the dense-reward environment, agents receive reward signals that
correspond to whether agents’ behaviors are good for each timestep, while in the sparse-reward
setting, positive reward signals from the environments might be only given when success is achieved,
and otherwise are zero. The sparse-reward setting is thus much more challenging but closer to the
real world scenarios.
Following Decision Transformer (DT), we frame the RL problem as a sequential modeling problem.
We consider each trajectory τ as a sequence of ordered return-to-go R̂, action a, and states s, defined
as follows,
τ = (R̂t0 , st0 , at0 , R̂t0 +1 , st0 +1 , at0 +1 , . . . , R̂t0 +K−1 , st0 +K−1 , at0 +K−1 ) . (2)
where return-to-go R̂ is defined as the sum of rewards from the current timestep to the future:
PT
R̂k = i=k+1 ri , T is the episode length, and K is the context length. The learning objective of the
model is to predict the correct action at given the history sequence and the current state st , written
as a simple squared error term:
t0 +K−1
X
Ldecision = ∥at − a′t ∥22 . (3)
t=t0
4 M ETHOD
We propose Language Models for Motion Control (LaMo), an effective framework that incorpo-
rates pre-trained Language Models (LMs) into offline Reinforcement Learning, to leverage the rea-
soning and few-shot ability of LMs and solve challenging scenarios such as limited data and sparse
reward. An illustration of LaMo is given in Figure 2. LaMo encompasses several crucial designs:
1) We adopt a pre-trained LM (i.e., GPT-2 (Radford et al., 2019)) as the initialization of a Decision
Transformer (DT) (Chen et al., 2021); 2) We replace the linear embedding projections with MLPs
to augment representation learning capabilities for complicated tasks; 3) During training the offline
RL agents, we freeze the pre-trained parts and utilize the parameter-efficient fine-tuning technique
LoRA (Hu et al., 2021), where the trainable parameters account for only 0.7% of the entire model;
4) We introduce language prediction as an auxiliary objective while finetuning, in order to stabilize
the performance and maintain the language ability.
4.1 P RE - TRAINING ON L ANGUAGE TASKS
We start by pre-training LMs based on a popular and computationally affordable pre-trained lan-
guage model, GPT-2 (Radford et al., 2019). We use the corpus dataset WikiText (Merity et al.,
2016) and the common next-token prediction objective
s−1
X
Llanguage = − log (T (wi+1 |w1 , . . . , wi )) , (4)
i=1
where wi is the ith language token in one sentence. The pre-trained GPT-2 is accessible in Hugging
Face1 . To explore the effects of different pre-trained model quality on the downstream offline RL
tasks, we also explore three variants of models: 1) a model that is pre-trained for fewer steps; 2)
a model that is pre-trained on randomly shuffled text corpus; 3) a model with randomly initialized
weights. Our results in Section 5.5 and Appendix G show that high language pre-training quality is
helpful for downstream RL tasks, underscoring the importance and necessity of the pre-training.
1
https://huggingface.co/gpt2
4
large language model pre-train downstream offline RL
Figure 2: The overview of LaMo. LaMo mainly consists of two stages: (1) pre-training LMs on
language tasks, (2) freezing the pre-trained attention layers, replacing linear projections with MLPs,
and using LoRA to adapt to RL tasks. We also apply the language loss during the offline RL stage
as a regularizer.
Multi-layer perceptrons for embeddings. The pre-trained LMs process the input into latent vectors
and decode the latent vectors into the output via simple linear projections. We find that to effectively
utilize the pre-trained language model in offline RL, replacing the linear projections with MLPs is
essential to bridge the domain gap. Extensive ablations are provided in Section 5.5 to support the
importance of this non-linear module.
Frozen weights and low rank adaptation. We apply the parameter-efficient training technique
LoRA (Hu et al., 2021), which constrains the gradient update process in a low-dimension space by
rewriting the weight matrix W ∈ Rd×k as W0 + ∆W = W0 + BA, where B ∈ Rd×r , A ∈ Rr×k ,
and r ≪ min(d, k). We inject low-rank matrices into the attention weights Q, K, V and freeze all
other weights of the Transformer.
Meanwhile, the model is desired to maintain the knowledge of the LMs. The number of trainable
parameters only takes up 0.7% of the entire Transformer. We hypothesize that such a mechanism
would let the pre-trained model treat the inputs as languages to the maximum extent while maintain-
ing adaptivity. Empirically, we find that full-weight finetuning or frozen Transformer layers would
harm performance, as is shown in Figure 5. More discussions are provided in Section 5.5.
Language prediction as an auxiliary objective. To further stabilize the training process and main-
tain the knowledge learned from languages, we simultaneously train the model on language predic-
tion tasks. The corpus we train on is WikiText (Merity et al., 2016), same as the pre-training stage.
To perform language prediction, we would temporarily replace the input and output projections with
the projections of the pre-trained LM. Empirically, we find that this term could prominently prevent
the model from overfitting. Intriguingly, for sparse-reward tasks such as Kitchen, the performance
of LaMo is critically enhanced to surpass recent strong baselines, as is shown in Figure 6b. Besides,
this objective could help preserve the language understanding ability, which means we could obtain
a model skilled at both language understanding and motion control as a side effect. A more detailed
discussion is in Section 5.5. The overall objective while training the offline RL agents is then
5 E XPERIMENTS
In this work, we delve into solving sequential decision-making problems while only offline inter-
action datasets are available during training, known as the Offline RL problem. We evaluate the
performance of LaMo on the standard benchmark D4RL (Fu et al., 2021) and also evaluate the
learning ability of LaMo under the low-data regime. To show the effectiveness of each component
in LaMo, extensive ablations are also conducted.
5
5.1 E XPERIMENT S ETUP
We conduct our experiments on 8 tasks from 3 domains MuJoCo, Atari, and Kitchen. Detailed task
descriptions are provided in Appendix C. We use datasets from D4RL (Fu et al., 2021) and more
details are provided in Appendix B. Due to the limitation of computation resources, we run each
experiment for 3 seeds with numbers 0, 1, 2 to ensure reproducibility.
We compare the performance of LaMo with various powerful baselines in offline reinforcement
learning: CQL (Kumar et al., 2020), IQL (Kostrikov et al., 2021), TD3+BC (Fujimoto & Gu, 2021),
BCQ (Fujimoto et al., 2019), NFQ (Riedmiller, 2005), Behavior Cloning (BC), and DT (Chen et al.,
2021). Besides, we compare with Wiki-RL (Reid et al., 2022), which also utilizes pre-trained lan-
guage model in offline reinforcement learning. To systematically report the performance of all these
methods, we compute the average performance over the last 20K training steps out of a total of 100K
training steps with evaluations conducted every 2500 training steps. The scores we report are nor-
malized scores so that 100 represents an expert policy and 0 represents a random policy, following
the convention of Fu et al. (2021) and Hafner et al. (2022).
5.2 S PARSE - REWARD TASKS
Task Dataset Ratio LaMo DT Wiki-RL CQL IQL TD3+BC BC
Kitchen Partial 1 46.6 ± 5.3 33.8 ± 14.5 20.4 ± 10.4 0.2 ± 1.0 45.7 ± 3.3 8.2 ± 6.5 1.1 ± 1.9
Kitchen Complete 1 64.2 ± 5.3 52.8 ± 3.7 21.7 ± 6.6 0.0 ± 0.0 30.0 ± 1.5 0.6 ± 1.0 0.0 ± 0.0
Reacher2d Medium 1 33.0 ± 8.3 22.8 ± 6.0 29.4 ± 8.5 31.5 ± 0.1 30.4 ± 1.0 31.2 ± 0.2 14.0 ± 7.4
Average 47.9(↑31%) 36.5 23.8 10.6 35.4 13.3 5.0
Task Dataset Ratio LaMo DT Wiki-RL CQL IQL TD3+BC BC
Kitchen Partial 0.01 11.6 ± 3.0 0.9 ± 0.9 9.2 ± 3.0 0.7 ± 1.0 5.5 ± 1.5 13.9 ± 3.2 1.6 ± 0.9
Kitchen Partial 0.1 35.1 ± 5.2 22.6 ± 6.8 27.9 ± 3.6 0.0 ± 0.0 19.7 ± 3.3 17.0 ± 3.4 4.6 ± 2.2
Kitchen Complete 0.3 45.9 ± 2.9 31.5 ± 4.5 32.8 ± 3.9 1.7 ± 0.8 29.5 ± 1.2 0.0 ± 0.0 0.0 ± 0.0
Kitchen Complete 0.5 50.6 ± 6.1 36.6 ± 5.1 13.9 ± 5.1 17.6 ± 5.0 35.4 ± 2.5 0.1 ± 0.3 4.8 ± 1.9
Reacher2d Medium 0.1 12.4 ± 3.8 2.3 ± 1.5 4.1 ± 2.6 15.8 ± 0.2 5.8 ± 0.8 8.7 ± 0.7 2.1 ± 2.1
Reacher2d Medium 0.3 31.2 ± 7.6 6.4 ± 2.6 19.4 ± 7.4 30.0 ± 0.4 10.2 ± 1.1 24.5 ± 1.7 10.2 ± 3.8
Average 31.1(↑86%) 16.7 17.9 11.0 17.7 10.7 3.9
Table 1: Normalized score for sparse-reward tasks. We compare LaMo with DT, Wiki-RL, CQL,
IQL, TD3+BC, and BC. Mean of 3 seeds with number 0, 1, 2. Blue highlight indicates the highest
score, orange highlight indicates the second-highest score, and red numbers represent the improve-
ment of LaMo over DT.
Results for sparse-reward tasks including Kitchen and Reacher2d are given in Table 1. We select
strong baselines including CQL, IQL, TD3+BC, BC, DT and Wiki-RL. We observe that LaMo
shows an overwhelming advantage over Decision Transformer and Wiki-RL across all tasks and
datasets, which indicates that our approach effectively harnesses the power of the pre-trained model.
Overall, LaMo has improved the performance of DT by up to 50%. Compared with value-based
methods, our approach also demonstrates significant advantages in average performance. We have
achieved the best performance among all strong baselines in 7 tasks and second-place results in 2
tasks Kitchen Partial with 1% data and Reacher2d Medium with 10% data.
Significantly, in Kitchen tasks, CQL initially performs reasonably well, but as training progresses,
it faces the issue of overfitting, causing a notable drop in its performance, which is shown in Ap-
pendix F. While for LaMo, such a phenomenon does not occur, reflecting LaMo’s success in pre-
venting overfitting.
5.3 D ENSE - REWARD TASKS
Task Dataset Ratio LaMo DT Wiki-RL CQL BCQ NFQ BC
Breakout Medium 1 473.4 ± 195.6 350.3 ± 134.6 129.0 ± 105.9 367.8 ± 131.9 56.2 ± 19.2 -4.5 ± 2.0 291.3 ± 114.8
Qbert Medium 1 79.0 ± 13.1 28.9 ± 18.3 7.6 ± 6.5 83.3 ± 14.8 50.8 ± 16.3 -0.3 ± 0.4 51.9 ± 11.2
Pong Medium 1 125.6 ± 6.6 116.1 ± 10.4 98.1 ± 15.6 116.4 ± 9.5 89.1 ± 16.5 -1.0 ± 0.0 -1.0 ± 0.1
Average 226.0(↑37%) 165.1 78.2 189.1 65.3 -1.9 114.1
Task Dataset Ratio LaMo DT Wiki-RL CQL BCQ NFQ BC
Breakout Medium 0.1 136.9 ± 91.1 24.9 ± 11.2 9.4 ± 6.9 58.1 ± 19.8 15.0 ± 6.5 -3.7 ± 2.9 62.5 ± 16.2
Qbert Medium 0.1 63.6 ± 17.2 26.1 ± 14.3 6.7 ± 6.1 62.0 ± 20.6 15.0 ± 11.0 -0.6 ± 0.5 -0.2 ± 0.1
Pong Medium 0.1 114.8 ± 8.8 87.1 ± 19.7 22.7 ± 10.1 119.2 ± 9.6 57.6 ± 20.4 -1.0 ± 0.0 -1.0 ± 0.1
Average 105.1(↑128%) 46.0 13.0 79.8 29.2 -1.8 20.5
Table 2: Normalized score for 3 dense-reward tasks in Atari. We compare LaMo with DT, Wiki-
RL, CQL, BCQ, NFQ and BC. Mean of 3 seeds with number 0, 1, 2. Blue highlight indicates the
highest score, orange highlight indicates the second-highest score, and red numbers represent the
improvement of LaMo over DT.
6
Task Dataset Ratio LaMo DT Wiki-RL CQL IQL TD3+BC BC
Hopper Medium 1 74.1 ± 5.3 60.9 ± 3.3 75.4 ± 5.9 61.6 ± 3.4 62.8 ± 3.2 58.7 ± 2.8 47.8 ± 5.3
Halfcheetah Medium 1 42.5 ± 0.4 42.6 ± 0.5 41.9 ± 0.8 46.7 ± 0.2 48.3 ± 0.2 48.2 ± 0.1 42.2 ± 1.0
Walker2d Medium 1 73.3 ± 3.1 70.2 ± 4.3 67.4 ± 8.1 81.1 ± 1.2 81.0 ± 3.1 84.0 ± 1.3 57.5 ± 9.5
Average 63.3(↑9%) 57.9 61.6 63.1 64.1 63.6 49.2
Task Dataset Ratio LaMo DT Wiki-RL CQL IQL TD3+BC BC
Hopper Medium 0.005 57.0 ± 7.1 35.8 ± 6.6 49.9 ± 5.0 37.9 ± 3.9 41.1 ± 2.7 40.1 ± 3.6 47.0 ± 4.2
Hopper Medium 0.01 52.0 ± 4.6 41.9 ± 5.2 50.2 ± 5.0 39.8 ± 5.4 51.3 ± 2.4 51.0 ± 3.9 50.0 ± 12.6
Hopper Medium 0.1 73.7 ± 3.5 57.3 ± 3.8 67.3 ± 4.9 59.8 ± 2.3 50.6 ± 3.1 56.9 ± 2.3 44.4 ± 7.7
Halfcheetah Medium 0.005 39.0 ± 1.6 22.4 ± 5.2 37.6 ± 1.7 40.5 ± 1.0 34.9 ± 1.9 17.3 ± 3.0 34.8 ± 1.8
Halfcheetah Medium 0.01 40.6 ± 1.3 29.6 ± 4.8 38.4 ± 2.1 41.9 ± 0.6 34.8 ± 2.0 24.3 ± 2.5 37.2 ± 2.3
Halfcheetah Medium 0.1 42.1 ± 0.6 41.7 ± 0.8 40.5 ± 1.1 45.0 ± 0.5 46.7 ± 0.3 48.3 ± 0.2 42.0 ± 1.0
Walker2d Medium 0.005 66.9 ± 5.4 16.7 ± 4.8 46.5 ± 20.4 51.9 ± 9.1 30.9 ± 6.0 3.4 ± 1.2 24.0 ± 12.5
Walker2d Medium 0.01 74.5 ± 4.7 38.9 ± 9.3 60.2 ± 10.5 69.7 ± 4.2 44.5 ± 4.8 12.9 ± 4.1 65.3 ± 11.2
Walker2d Medium 0.1 70.4 ± 4.2 70.2 ± 7.5 72.4 ± 2.6 75.2 ± 3.2 44.5 ± 4.8 12.9 ± 4.1 66.7 ± 10.1
Average 57.4(↑46%) 39.4 51.4 51.3 42.1 29.7 45.7
Table 3: Normalized score for 3 dense-reward tasks in MuJoCo. We compare LaMo with DT,
Wiki-RL, CQL, IQL, TD3+BC, and BC.
Results for dense reward tasks are given in Table 2 and Table 3. For Atari, Since IQL and TD3+BC
do not support discrete control (Seno & Imai, 2022), we select CQL, BCQ, and NFQ as baselines.
We observe that LaMo achieves the highest average scores in Atari and MuJoCo under the low-data
regime. However, we also notice that in MuJoCo domain, when the data scale is relatively large
(10%, 100%), LaMo only comes close to DT and falls behind CQL in Halfcheetah and Walker2d.
In Qbert Medium (100%) and Pong Medium (10%), LaMo also does not surpass CQL. We attribute
it to the following reasons: unlike sparse-reward tasks, where the Bellman backups would slowly
propagate the information of rewards (Chen et al., 2021), limiting the performance of value-based
algorithms, dense-reward tasks are extremely suitable for value-based methods such as CQL while
DT is less preferable, which is empirically examined by Bhargava et al. (2023). Our experiments
verify the stands and point out that LaMo could further enhance the potential of DT, closing the
performance gap between DT and CQL in dense-reward tasks.
5.4 A BILITY IN L OW-DATA R EGIME
+ R S S H U 0 H G L X P % U H D N R X W 0 H G L X P
. L W F K H Q &