TRad
TRad
Shanghai Jiao Tong University China Pacific Insurance China Pacific Insurance
Shanghai, China Shanghai, China Shanghai, China
ABSTRACT code synthesis [24], text ranking [7], table-based reasoning [43],
Numerous large language model (LLM) agents have been built and retrieval query expansion [17] due to their wide knowledge and
for different tasks like web navigation and online shopping due excellent ability of text understanding and generation. Recently,
to LLM’s wide knowledge and text-understanding ability. Among a series of works have attempted to build powerful agents based
these works, many of them utilize in-context examples to achieve on LLMs for various sequential decision-making tasks, including
generalization without the need for fine-tuning, while few of them text-based games [41], online shopping [40], web navigation [4],
have considered the problem of how to select and effectively uti- and information retrieval [48].
lize these examples. Recently, methods based on trajectory-level Among existing LLM agents, some are trained with large-scale
retrieval with task meta-data and using trajectories as in-context expert data by supervised fine-tuning (SFT) [8, 9, 18], while some are
examples have been proposed to improve the agent’s overall perfor- tuning-free and utilize in-context learning (ICL) with few expert
mance in some sequential decision making tasks. However, these demonstration examples [13, 34, 42, 46]. In this paper, we focus
methods can be problematic due to plausible examples retrieved the scope on tuning-free ICL methods, as they are highly cost-
without task-specific state transition dynamics and long input with effective and can seamlessly generalize to different tasks using only
plenty of irrelevant context. In this paper, we propose a novel a small amount of expert samples. Most existing ICL-based agents
framework (TRAD) to address these issues. TRAD first conducts are prompted with expert trajectories carefully selected by human
Thought Retrieval, achieving step-level demonstration selection via [28, 38, 42], which work well when few expert trajectories are
thought matching, leading to more helpful demonstrations and less available. However, when we have access to a large dataset of expert
irrelevant input noise. Then, TRAD introduces Aligned Decision, trajectories or an expert policy, the automatic and personalized
complementing retrieved demonstration steps with their previ- selection of expert trajectories for each task instruction becomes
ous or subsequent steps, which enables tolerance for imperfect necessary, and can have an essential influence on task performance.
thought and provides a choice for balance between more context Recently, Zheng et al. [46] study the problem of demonstration
and less noise. Extensive experiments on ALFWorld and Mind2Web selection and propose Synapse, which retrieves relevant expert
benchmarks show that TRAD not only outperforms state-of-the-art trajectories by task meta-data, and then prompts LLMs with these
models but also effectively helps in reducing noise and promoting retrieved trajectories. Synapse performs well on computer control
generalization. Furthermore, TRAD has been deployed in real-world tasks (MiniWob++ [27]) and web navigation tasks (Mind2Web [4]).
scenarios of a global business insurance company and improves the Nevertheless, retrieving and prompting with complete trajectories
success rate of robotic process automation. Our codes are available can be problematic in the following three aspects.
at: https://github.com/skyriver-2000/TRAD-Official. Plausible examples. Sometimes generalization to data from var-
ious domains can be critical. For example, in cross-website and
KEYWORDS cross-domain subsets of Mind2Web, agents operate on websites
Large Language Model, LLM Agent, Sequential Decision Making, unseen in the training set, i.e., memory. In this case, retrieving tra-
LLM Reasoning, Information Retrieval jectories with only task meta-data is very likely to provide plausible
examples, which share similar task instructions to the current one
but require totally different solutions. As shown by experiments in
1 INTRODUCTION [46], plausible examples provide no more information than random
Large Language Models (LLMs) [3, 32] have achieved remarkable examples and can usually mislead LLM agents to wrong decisions.
success on various tasks like question answering [45], chatbot [20],
Zhou and Yang, et al.
Environment
Trajectory-Wise Task Metadata
Similarity Search Domain: PutTwo
Instruction: Put two pillow in sofa
Overall Scenario
Demonstrations (For Reasoning)
Thought-Enhanced Task: put two pillow in armchair Retrieved (1)
You are in the middle of the room ……
Memory think: ……
act: go to sofa 1
On the sofa 1, you see ……
Thought Thought
I am now in/on: armchair 1 I am now in/on armchair 1
Critical objects I have found: Critical objects I have found:
Thought Retrieval pillow 1 (armchair 1)
pillow 2 (armchair 1)
pillow 1 (armchair 1)
pillow 2 (armchair 1)
Thought Retrieval
Objects I have taken: Objects I have taken:
None pillow 1
Now I have found both pillows in/on armchair 1. Next, I need to Now I have taken the first pillow (1). Next, I need to go to the sofa
take the first pillow and then put it in/on the sofa …… and put it there ……
Thought-Enhanced Thought-Enhanced
Memory Memory
Demonstrations (For Decision) Demonstrations (For Decision)
Task: put two pillow in armchair. Retrieved (1) Task: put two pillow in armchair. Retrieved (1)
On the sofa 1, you see a creditcard 3, and a pillow 2. You pick up the pillow 1 from the sofa 1.
think: I am now in/on: sofa 1 think: I am now in/on: sofa 1
Critical objects I have found: Critical objects I have found:
pillow 1 (sofa 1, put in/on armchair 2) pillow 1 (sofa 1)
pillow 2 (sofa 1) pillow 2 (sofa 1)
Aligned Decision Objects I have taken:
None
Objects I have taken:
pillow 1
Aligned Decision
Now I have found the second pillows (2). Next, I need to take Now I have taken the first pillow (1). Next, I need to go to an
it and then put it in an armchair. armchair and put it there ……
act: take pillow 1 from armchair 1 act: go to armchair 2
Task: put two pillow in sofa Retrieved (2) Task: put two pillow in sofa Retrieved (2)
On the armchair 2, you see a pillow 2 …… You put the pillow 2 in/on armchair 2 ……
Figure 1: An overall illustration of TRAD agent (on ALFWorld [30] enviroment). TRAD first pre-processes expert trajectories,
labeling each step with high-quality thoughts. At inference time, TRAD first conducts thought retrieval, which generates
thought with trajectory-wise retrieved demonstrations as the query and keys for a more precise step-wise demonstration
retrieval. Given the retrieved steps, TRAD employs aligned decision module to complement their temporally neighboring steps
and corresponding position information (Fig. 2). Finally, the next action is generated according to the enhanced demonstration.
Context limit of LLMs. When facing tasks with long horizons which is more specific with state transitions and helpful in reducing
and complex observations, prompting with complete trajectories plausible examples.
will result in input sequences longer than the allowed length of In this paper, we propose Thought Retrieval and Aligned Decision
LLMs. Synapse thus has to reduce the number of trajectory ex- (TRAD), a novel framework that achieves step-wise demonstration
amples or even fail to complete the task directly. Though some retrieval via thought matching and enhances the context for ac-
long-context LLMs can receive very long prompts, the performance tion prediction with temporally neighboring steps and their order
can be harmed due to the issue of long-term forgetting [31]. information. Our contribution can be summarized in four-folds:
Irrelevant information in prompts. LLMs are found sensitive to • We propose a thought retrieval method, where we label thoughts
their prompts, and can easily copy their recent input [11, 22]. The for expert demonstration steps in advance with an LLM, prompt
decision at the current timestep can be related to very few steps in LLM agents to reason at inference time, and achieve step-wise
a retrieved trajectory, while other steps do not provide any helpful retrieval by a similarity search on thought. To the best of our
information. Therefore, irrelevant steps will have unpredictable knowledge, this is the first work that enables the LLM agent with
effects on the decision of LLM agents. As shown by our experiments, thought retrieval techniques for sequential decision-making.
they negatively impact the performance most of the time. • Based on the thought retrieval operation, we further propose an
To address the problems of trajectory-wise retrieval and prompt- aligned decision method, where we supply the retrieved steps
ing, we delve into step-wise demonstration retrieval and prompting. with their temporal neighbors to overcome imperfect thoughts
We discover that, via demonstrating with relevant steps, the input and enhance task-relevant information.
context of the LLM agent can be significantly reduced. Thus, the • We conduct extensive experiments and analysis on Mind2Web [4]
issue of context limit and irrelevant information can be alleviated. tasks and ALFWorld [30], showing that TRAD achieves state-of-
Therefore, the critical part is to retrieve step demonstrations that the-art (SoTA) performance compared to existing works. TRAD
are truly relevant and helpful. To achieve this, we utilize step-by- brings a 2.99% improvement over the strongest baseline (93.78%
step reasoning, i.e. Chain-of-Thought technique [38], to abstract → 96.77%) to the success rate (SR) on ALFWorld. On Mind2Web,
the state at each timestep as retrieval queries and keys. The gen- TRAD improves element accuracy, step SR, and SR remarkably
erated thoughts can involve historical information or future plans, over the powerful Synapse agent [46] by 2.1%, 1.4%, and 0.5%.
TRAD: Enhancing LLM Agents with Step-Wise Thought Retrieval and Aligned Decision
• We have deployed TRAD to the real-world robotic process au- [39] further select examples in a subset recalled from 𝑘-NN search
tomation scenarios of a global business insurance company, via minimizing the entropy of output.
where TRAD enables the LLM agent to significantly improve IRCoT [33] should be the most relevant work to ours, which
the success rate in a bunch of practical tasks. In average, TRAD retrieves relevant documents with reasoning steps on question-
raises step SR from 90.2% to 98.1% and SR from 65.0% to 92.5%. answering tasks. However, their method consists of retrieving with
a complete historical trajectory and accumulating retrieved trajec-
2 RELATED WORK tories over time, which are not transferable to complex sequential
decision-making tasks, and we propose a method different from
2.1 LLM Agents
theirs in that: (i) Our method focuses on both providing more rele-
In recent years, there has been a rapidly growing trend to utilize vant demonstrations and reducing irrelevant context for sequential
pre-trained LLMs as the central controller to obtain human-level decision-making tasks, while theirs is limited to question-answering
decision-making capabilities [35]. Among these works: Nakano tasks and only addresses the first issue. (ii) Our method retrieves
et al. [18] fine-tune the GPT-3 [3] model for question answering completely different steps across timesteps and complements the
in a text-based web browsing environment. Yao et al. [40] develop retrieval results with temporal information, while theirs only ac-
WebShop, a simulated e-commerce website environment, and fine- cumulates relevant documents at every reasoning step and heuris-
tune a BERT [5] model with imitation learning and reinforcement tically cuts off the earliest ones to fit in the context limit of LLMs.
learning. Yao et al. [42] insert a reasoning section between obser- (iii) Our method prepares pseudo-golden thoughts for expert trajec-
vation input and action output, significantly improving the per- tories in the memory to enable retrieval with trajectories without
formance on ALFWorld [30] and WebShop [40] tasks. Shinn et al. thoughts, and utilizes single-step thoughts as both queries and keys
[28] further improve over [42] via verbally reflecting on linguistic for precise retrieval, while theirs uses thoughts only as queries with
task feedback signals. Schick et al. [26] teach LLMs to use external raw documents as keys.
tools via simple APIs in a self-supervised learning way. Park et al. The selection of in-context examples has been studied thor-
[21] introduce Generative Agents, extending LLMs with natural lan- oughly for non-sequential tasks like question answering and senti-
guage memories and retrieving them dynamically to plan behavior. ment analysis. However, for sequential decision-making tasks, how
Wang et al. [37] propose DEPS, an interactive planning approach, to select the examples to improve the overall performance remains
which facilitates better error correction by integrating a descrip- unclear. Zheng et al. [46] propose a trajectory-wise retrieval so-
tion of the plan execution process and an explanation of failure lution, while a more precise step-wise solution is still desired as
feedback. Wang et al. [34] employ an exploration curriculum, a discussed in Section 1, which motivates our work.
growing skill library, and a novel iterative prompting mechanism,
leading to better proficiency in playing Minecraft. Deng et al. [4]
construct the Mind2Web dataset from real-world webpages, which
consists of three subsets requiring different degrees of general- 2.3 LLM Planning and Reasoning
ization, and compare the performance of imitation learning and Our work proposes to use thought, which can be viewed as a general
few-shot inference. abstraction of the current state, as queries and keys for retrieval.
As can be seen above, most existing LLM agents focus on: 1) Nevertheless, plans, code comments, and any other text that extracts
improving task performance by direct fine-tuning [4, 18, 40]; 2) en- comprehensive information about the current state can serve as
hancing planning or reasoning by explicitly prompting [28, 37, 42]; an alternative. Therefore, we particularly review some remarkable
3) extending the application with an external memory or tool li- reasoning and planning works based on LLMs, and most of them
brary [21, 26, 34]. However, providing more relevant information in are complementary to our work.
prompts, as a fundamental way to elicit better task understanding, Wei et al. [38] first introduce the concept of Chain-of-Thought
does not receive sufficient attention. When near-optimal demon- (CoT) by providing with explicit step-by-step reasoning process in
strations are accessible, selecting few-shot demonstrations properly example outputs improving performance on arithmetic, common-
can be a simple yet very effective way to improve task performance, sense, and symbolic reasoning tasks. Wang et al. [36] further find
which is investigated in our work. that a single reasoning path can be sub-optimal, and propose self-
consistency to address this problem by sampling multiple reasoning
2.2 In-Context Example Selection paths. For efficient yet flexible search of reasoning paths, Yao et al.
LLMs have been shown excellence of few-shot learning [3], and the [41] apply tree search with self-evaluation to find globally excellent
selection of in-context examples can yield a significant improve- thoughts. Besta et al. [2] later extend the tree-search structure to a
ment on the overall performance. Liu et al. [16] first propose to graph search for even better flexibility and overall performance.
retrieve the 𝑘-nearest neighbors (𝑘-NN) of the input as in-context The works mentioned above consider problems that are non-
examples, and achieve improvement over random retrieval base- sequential or solvable by a single complete reasoning path after re-
lines. Rubin et al. [25] select relevant samples with an encoder ceiving the input. For harder sequential decision-making problems:
trained with label similarity, and obtain better performance over Zhou et al. [47] introduce least-to-most prompting to solve hard
BM25 and pre-trained encoder baselines. Zhang et al. [44] consider problems by decomposing the problem and solving sub-problems
selecting and labeling unlabeled examples as demonstrations to sequentially. ReAct proposed by Yao et al. [42] interacts with the
achieve the best performance, and view this problem as a sequential environment in a reason-then-act style, which enriches the context
decision making task to solve by reinforcement learning. Wu et al. for action prediction. Code-as-Policies [14] writes executable codes
Zhou and Yang, et al.
Aligned Decision
Temporal Expansion Task: put two pillow in armchair. Retrieved (1)
[Step -1] On the sofa 1, you see …… Retrieved Step Sequence 1
Input (Observation) Timestep − 1 think: ……
On the sofa 1, you see a pillow 1, and a pillow 2. act: take pillow 1 from sofa 1
Thought Retrieval Output (Action)
take pillow 1 from armchair 1 [Step 0] You pick up the pillow 1 from the sofa 1. Retrieved Step Sequence 2
Thought Thought think: I am now in/on: sofa 1
I am now in/on armchair 1 I am now in/on: sofa 1 Critical objects I have found:
Critical objects I have found: Critical objects I have found: Input (Observation) Timestep pillow 1 (sofa 1) History Alignment ( + = )
pillow 1 (armchair 1) pillow 1 (sofa 1) You pick up the pillow 1 from the sofa 1. pillow 2 (sofa 1)
pillow 2 (armchair 1) pillow 2 (sofa 1) Objects I have taken: Timestep − 2
Objects I have taken: Objects I have taken: Output (Action) pillow 1
go to armchair 2 On the sofa 1, you see a newspaper 3.
pillow 1 pillow 1 Now I have taken the first pillow (1). Next, I think: ……
Now I have taken the first pillow (1). Next, I need Now I have taken the first pillow (1). Next, I need need to go to an armchair and put it there …… act: go to armchair 1
to go to the sofa and put it there …… to go to an armchair and put it there …… act: go to armchair 2
Input (Observation) Timestep +1
Timestep − 1
On the armchair 2, you see …… [Step 1] On the armchair 2, you see ……
think: …… On the armchair 1, you see ……
Output (Action) think: ……
act: put pillow 1 in/on armchair 2
Similarity put pillow 1 in/on armchair 2 act: take pillow 1 from armchair 1 Prompting
Search
Relative Order Mark
Task: put two pillow in sofa Retrieved (2)
Current Input Timestep
Thought You pick up the pillow 1 from the armchair 1.
I am now in/on: armchair 2 Timestep −1 [Step -1] On the armchair 2, ……
think: I am now in/on armchair 1
Critical objects I have found: Critical objects I have found:
pillow 1 (sofa 1) pillow 1 (armchair 1)
pillow 2 (sofa 1, taken and put in/on armchair 2)
Thought-Enhanced pillow 2 (armchair 1)
Objects I have taken: Timestep [Step 0] You put the pillow 2 ……
Objects I have taken:
Memory None pillow 1
Now I have put the first pillow in armchair 2. Now I have taken the first pillow (1). Next, I
Next, I need to go back to sofa 1 and take the need to go to the sofa and put it there …… Action Output
second pillow (1). Timestep +1 [Step 1] On the sofa 1, ……
act: go to sofa 1
Thought Retrieval Query Thought Retrieval Keys Thought Retrieval Values Demonstrations (For Decision) History Aligned LLM Input LLM Agent
Figure 2: An illustration of our aligned decision method, where 𝐵 = 𝐹 = 1 and the 𝑖-th retrieved step is at in its trajectory. time 𝑡 𝑖
The aligned decision method consists of three sub-processes to the retrieved step demonstrations and prompting: 1) Temporal
Expansion: Collect at most 𝐵 previous steps and 𝐹 subsequent steps for each retrieved step, and transform each step into a
sequence of length 𝐵 + 𝐹 + 1 from 𝑡 𝑖 − 𝐵 to 𝑡 𝑖 + 𝐹 ; 2) Relative Order Mark: For each step in one demonstration step sequence,
we label its relative position to the retrieved step in this sequence, i.e., the previous one (𝑡 𝑖 − 1) with [Step -1] and the next
one (𝑡 𝑖 + 1) with [Step 1]; 3) History Alignment: For the current episode, we complement current observation (and thought,
optional) with 𝐵 + 𝐹 previous steps to enrich information and align with demonstrations.
for embodied control by hierarchically expanding undefined pro- 3.1 Thought Preparation
grams, which can be viewed as implicit reasoning or CoT process. Most expert trajectories, collected by either human or other expert
Liu et al. [15] propose to incorporate the strength of classical plan- agents, do not contain their reasoning process. Therefore, before we
ners by translating the original problem into a PDDL [1] problem utilize thoughts for retrieval, we should prepare thoughts for each
to solve by classical planners. Hao et al. [10] and Ding et al. [6] demonstration step in the memory. Specifically, we start from a
share a similar insight that reasoning can be implemented indeed small subset of expert demonstrations and provide thoughts written
by planning, where [10] use LLMs as world models and [6] conduct by human experts for each step in it. Given this small subset as few-
MCTS for thought generation with a light-weight extra network. shot examples in prompts, we can query LLMs to label thoughts for
To summarize, LLM planning and reasoning have continuously a large memory. Although ground-truth actions are not accessible
received huge attention from researchers in recent years. This at inference time, we can prompt LLMs with them to generate
makes our work flexible and improvable with more powerful plan- thoughts of higher quality. In this way, LLMs produce pseudo-
ning and reasoning methods in the future. golden thoughts consistent with expert actions, and we obtain
a thought-enhanced memory M supporting both trajectory-wise
retrieval with task meta-data and step-wise retrieval with thoughts.
3 THE TRAD FRAMEWORK
As discussed in Section 1, trajectory-wise retrieving and prompt- 3.2 Thought Retrieval
ing lead to issues of plausible examples, LLM context limits, and
Given pseudo-golden thoughts for all steps in the memory, which
irrelevant information. To resolve these issues, we propose a novel
can serve as keys for step-wise similarity search, we now present
method called Thought Retrieval and Aligned Decision (TRAD), as
our thought retrieval method to select relevant demonstrations
illustrated in Fig. 1. Our TRAD agent utilizes thought, which is
at inference time. To be specific, we first conduct trajectory-wise
obtained by reasoning about its current state, to retrieve similar
demonstration retrieval as in [46] for thought generation. With
steps from expert trajectories, and is then complemented with steps
these trajectory demonstrations, at each timestep 𝑡 we prompt the
temporally correlated to the retrieved ones and their temporal po-
LLM to generate a thought 𝜏𝑡 for step-wise retrieval. Note that this
sition information to predict the action. Formally, our TRAD agent
process does not directly effects decision-making, hence it can be
can be summarized in one equation:
further simplified if necessary and the issues mentioned in Section 1
will not impact the agent severely.
𝜋𝑇 𝑅𝐴𝐷 (𝑎𝑡 |𝜉, 𝑜 0:𝑡 , 𝑎 0:𝑡 −1 ) = LLM(AD(TR(𝜏𝑡 , M), 𝜉, 𝑜 0:𝑡 , 𝑎 0:𝑡 −1 )) , With the thought 𝜏𝑡 , which can be viewed as an abstraction,
about current state, we conduct dense retrieval to find relevant steps
in the thought-enhance memory M. Here any encoder pre-trained
where 𝜉 is the current task, 𝑜 0:𝑡 and 𝑎 0:𝑡 −1 are historical observa- on a large corpus for retrieval, e.g., Sentence-BERT [23] and DPR
tions and actions, 𝜏𝑡 is the thought generated by LLM about the [12], can be utilized to encode the query thought and key thoughts
current state, TR and AD denote our thought retrieval and aligned into dense vectors. Using a cosine similarity between the query and
decision modules, and M refers to the thought-enhanced memory. keys, we then collect top-𝐾 relevant steps that belong to mutually
We will present each module of TRAD in the following subsections. different trajectories and their corresponding task instructions.
TRAD: Enhancing LLM Agents with Step-Wise Thought Retrieval and Aligned Decision
Table 1: Success Rate of Different Methods on 6 Types of ALFWorld Tasks. We compare TRAD with ReAct [42], Synapse [46],
and their strong combination. TRAD significantly outperforms all baselines in terms of overall performance, achieves the best
performance in 5 out of 6 types of task, and shows a decent performance on Heat task. The improvement of TRAD over all
baselines on overall performance is statistically significant (measured by student’s t-test at 𝑝 < 0.05).
Method Put Examine Clean Heat Cool PutTwo All
ReAct (Random) 0.8472±0.0393 0.8333±0.0454 0.9570±0.0304 0.8841±0.0205 0.9841±0.0224 0.8431±0.0277 0.8980±0.0093
ReAct (Fixed) 0.7778±0.0708 0.9630±0.0262 0.9032±0.0263 0.9275±0.0205 1.0000±0.0000 0.8824±0.0480 0.9055±0.0186
Synapse 0.9444±0.0196 0.7037±0.0262 0.9355±0.0000 0.9130±0.0615 1.0000±0.0000 0.8039±0.0555 0.8955±0.0106
Synapse + ReAct 0.9167±0.0340 0.9444±0.0454 1.0000±0.0000 0.9130±0.0000 0.9524±0.0000 0.8627±0.0555 0.9378±0.0035
TRAD (Ours) 0.9583±0.0000 0.9630±0.0524 1.0000±0.0000 0.8986±0.0205 1.0000±0.0000 0.9804±0.0277 0.9677±0.0141
3.3 Aligned Decision RQ1 How does TRAD perform against existing SoTA methods?
Now we have relevant demonstration steps from thought retrieval. RQ2 Does thought retrieval help to reduce irrelevant context and
However, the query thought can be imperfect due to the lack of improve the overall performance?
expert action information at inference time. As we will show by ab- RQ3 Does aligned decision help to supply information when gen-
lation experiments in Section 4.4, directly using these steps to form eralization is important?
single-step demonstrations does not provide satisfactory perfor- RQ4 Diving into aligned decision, are all temporal expansion (TE),
mance, which is similar to the plausible example issue of trajectory- relative order mark (ROM), and history alignment (HA) neces-
wise retrieval. Therefore, we propose an aligned decision method to sary for improvement?
incorporate more information during the decision-making process. RQ5 How will the performance and advantage of TRAD be effected
Aligned decision complements LLM agents with steps temporally by critical hyper-parameters?
correlated to the retrieved ones and their temporal position infor-
mation. As illustrated in Fig. 2, the aligned decision method can be 4.1 Experiment Setup
decomposed into following three sub-processes.
To answer the above research questions, we conduct extensive
Temporal expansion. For each retrieved step, we first expand it experiments on ALFWorld [30] and Mind2Web [4] tasks. For each
into a step sequence involving 𝐵 previous steps and 𝐹 subsequent task, we introduce the details of evaluation as follows.
steps. When the number of previous or subsequent steps is smaller
than 𝐵 or 𝐹 , we simply take all previous or subsequent steps. This ALFWorld [30] is a text-based game aligned with ALFRED [29]
transforms each retrieved step into at most (𝐵 + 1 + 𝐹 ) temporally benchmark. It involves 6 types of tasks where an agent must take a
successive steps, allowing LLM agents to correct their imperfect series of actions (e.g. go to shelf 1, take vase 2 from shelf 1, put vase
thoughts by looking at more related steps at decision-making time. 2 in/on cabinet 5) to achieve a high-level goal given by a natural
language instruction (e.g. put some vase on a cabinet). This environ-
Relative order mark. Given 𝐾 expanded step sequences by tem-
ment is challenging in three aspects: 1) Agent should determine
poral expansion, we insert a mark for each step (including the re-
likely places of a householding object and explore them one by one
trieved ones) indicating the relative position w.r.t. its corresponding
to find such object; 2) Agent should understand the usage of some
retrieved step, and incorporate this rule of mark in the prompt for
objects like microwaves, fridges, and desklamps; 3) Some tasks can
decision. For example, the last step before the retrieved one will be
take an agent more than 30 steps to solve, requiring substantial
marked as [Step -1], the retrieved step as [Step 0], and the first
long-term memorization.
step after the retrieved one as [Step 1]. This provides temporal
Following Shridhar et al. [30], we evaluate on the subset of 134
information about the (𝐵 + 1 + 𝐹 ) × 𝐾 demonstration steps, and
out-of-distribution tasks, comparing the task success rates of TRAD
promotes more accurate demonstration following.
to ReAct [42] and Synapse [46] (without state abstraction as obser-
History alignment. Sometimes the optimal policy to a task, like vations are short). As ReAct and Synapse has provided sufficiently
ALFWorld, can be history-dependent, hence using single-step input strong performances, we do not include more complex reasoning
for action prediction is unreasonable. Since we aim to reduce input and planning baselines and corresponding variants of TRAD due to
content for less forgetting and noise, we should neither use all his- our API cost limit. Note that the original ReAct uses fixed but not
torical observations and actions. Moreover, even if we include previ- retrieved trajectories as demonstrations, hence we test two ReAct
ous actions as auxiliary information, there exists a mismatch where baselines to eliminate such an effect:
expert demonstrations are given as sequences of length 𝐵 + 1 + 𝐹
while current input is a single step. We thus propose to insert at most • ReAct (Fixed) uses fixed human-written trajectories as demon-
𝐵 + 𝐹 previous input-output pairs (i.e. 𝑜𝑡 − (𝐵+𝐹 ):𝑡 −1, 𝑎𝑡 − (𝐵+𝐹 ):𝑡 −1 ) strations;
before current input 𝑜𝑡 , transforming current input into a similar • ReAct (Random) randomly samples trajectories from the memory
sequence to demonstrations. as demonstrations.
Table 2: Results (%) of all methods on Mind2Web benchmark. TRAD achieves the best overall performances and the most
improvement on the two harder subsets, especially the most out-of-distribution Cross-Domain subset. The improvement of
TRAD over all baselines on three overall metrics is statistically significant (measured by student’s t-test with 𝑝 < 0.01).
Cross-Task Cross-Website Cross-Domain All
Method
Ele. Acc Step SR SR Ele. Acc Step SR SR Ele. Acc Step SR SR Ele. Acc Step SR SR
MindAct 20.3 17.4 0.8 19.3 16.2 0.6 21.0 18.6 1.0 20.6 18.0 0.9
ReAct (Random) 31.0 24.7 1.6 25.7 19.1 0.6 27.9 22.9 1.8 28.3 22.7 1.6
ReAct (Relevant) 31.3 26.0 1.2 26.7 20.5 0.6 28.0 23.1 1.6 28.5 23.4 1.4
Synapse w/o Retrieval 33.1 28.9 3.2 27.8 22.1 1.1 30.0 26.5 1.4 30.4 26.4 1.7
Synapse 34.4 30.6 2.0 28.8 23.4 1.1 29.4 25.9 1.6 30.4 26.6 1.6
TRAD (Ours) 35.2 30.8 3.6 30.4 24.0 0.6 32.0 28.0 2.0 32.5 28.0 2.1
combining the trajectory-level retrieval in Synapse and the reason- noting that the worst trial of TRAD among 3 random seeds achieves
ing in ReAct. On ALFWorld, all methods are built with GPT-4 [19] a success rate of 94.8%, outperforming the best trial produced by
and 2 in-context examples. any other method (94.0%).
Down to the success rate on each type of task, we observe that
Mind2Web [4] is an HTML-based web navigation benchmark col-
the success rate of each method varies more on the simplest Put
lected from real-world webpages, involving various tasks such as
task and the hardest PutTwo task. We discuss the results of these
searching, trip booking, social network subscription, etc. It contains
two tasks respectively as follows:
3 subsets, i.e., cross-task, cross-website, cross-domain. This envi-
ronment is challenging in two aspects: 1) Existing LLM agents can • On the simplest Put task, ReAct performs even more poorly than
hardly understand HTML input well; 2) Unseen tasks and websites other harder tasks. We find that the two vital reasons for ReAct’s
can require substantial generalization. Deng et al. [4] find that the failure on Put task are incorrect location and usage of objects,
cross-website and cross-domain subsets are significantly harder e.g. trying to put an object in a closed safe. As this issue can be
due to the need for generalization to unseen websites. alleviated through a combination with Synapse, the necessity of
Since Mind2Web was introduced only about half a year ago, there retrieving relevant demonstrations thus justified.
is a lack of suitable baseline algorithms, and thus we compare our • TRAD achieves the largest improvement on the hardest PutTwo
TRAD agent to Synapse [46] and ReAct [42]. Following Zheng et al. task. PutTwo requires to correct the locations of two objects and
[46], we evaluate on all 3 subsets, comparing the element accuracy a comprehensive understanding of its task process. Since TRAD’s
(Ele. Acc), step success rate (Step SR), and trajectory success rate outstanding performance on this hardest task is obtained from a
(SR). For fair comparison, we follow [46] and summarize obser- reduced input context at decision-making time, we can conclude
vations into 5 web elements with the pre-trained element ranker that step-wise thought retrieval is helpful by reducing the noise
provided by [4] for all methods. Since the observations are still of irrelevant steps and finding relevant examples more precisely.
very complex on Mind2Web, including thoughts for every step in
trajectories is not available, hence: 1) we do not include a Synapse 4.3 Evaluation on Mind2Web
+ ReAct baseline; 2) TRAD generates thoughts and predicts actions To verify the capability of TRAD under more realistic scenarios, we
by a single-step prompt with the current observation and previous compare TRAD to ReAct and the current SoTA method, Synapse, on
actions (without previous observations). To eliminate the effect of the Mind2Web benchmark, and the results are shown in Tab. 2. We
prompting style and reasoning, we build two ReAct baselines using also include the results of Synapse without retrieval here to better
the same format of prompt as TRAD: illustrate the effect of different retrieval methods.
Generally, TRAD achieves the highest performance in terms of
• ReAct (Random), for which we prompt ReAct with completely
all 3 metrics averaged on 3 subsets. Considering that the trajectory-
random demonstration steps.
level retrieval of Synapse only brings marginal boosts on Cross-Task
• ReAct (Relevant), for which we prompt ReAct with demonstrate
and Cross-Website subsets, and even slightly impacts the perfor-
steps randomly chosen from trajectories retrieved by Synapse.
mance on the Cross-Domain subset, our TRAD method can be thus
We do not include the ReAct (Fixed) baseline as it is hard to write justified in two aspects:
or pick demonstrations commonly helpful for such diverse test sets. • By reducing input context and utilizing step-wise relevant demon-
We also provide the results of the simplest MindAct [4] baseline strations, our step-wise thought retrieval helps more than the
without reasoning and retrieval for completeness. On Mind2Web, trajectory-wise retrieval with task meta-data in Synapse to im-
all methods are built with GPT-3.5-turbo and 3 in-context examples. prove on the simplest Cross-Task subset.
• By eliminating plausible examples and complementing temporal
4.2 Evaluation on ALFWorld correlated steps, aligned decision helps to improve on the two
The success rate of each method tested on ALFWorld is shown harder subsets, especially the most out-of-distribution Cross-
in Tab. 1. Generally, our TRAD agent achieves an average success Domain subset.
rate of 96.77%, significantly outperforming ReAct (∼90%), Synapse Furthermore, we observe that the two ReAct baselines perform
(89.55%), and even their strong combination (93.78%). It is also worth poorly on this task, which indicates that:
TRAD: Enhancing LLM Agents with Step-Wise Thought Retrieval and Aligned Decision
Table 3: Results (%) of ablation studies on Mind2Web benchmark. TE builds the basic structure of aligned decision and is thus
critical for performance boost on all three subsets. HA and ROM work well to promote generalization on the two harder
Cross-Website and Cross-Domain subsets but provide little help on the Cross-Task subset. The improvement of TRAD over all
ablation baselines on Ele. Acc and Step SR is statistically significant (measured by student’s t-test with 𝑝 < 0.05).
Cross-Task Cross-Website Cross-Domain All
Method
Ele. Acc Step SR SR Ele. Acc Step SR SR Ele. Acc Step SR SR Ele. Acc Step SR SR
TRAD w/o TE 34.2 28.4 1.2 27.4 20.4 0.6 29.1 24.0 1.4 30.0 24.5 1.3
TRAD w/o HA 36.2 31.1 4.0 28.3 22.2 0.6 29.4 24.9 1.8 30.8 25.9 2.1
TRAD w/o ROM 35.7 30.5 3.6 28.9 22.3 0.6 31.5 27.2 1.9 32.1 27.2 2.0
TRAD (Ours) 35.2 30.8 3.6 30.4 24.0 0.6 32.0 28.0 2.0 32.5 28.0 2.1
• The thoughts generated by GPT-3.5-turbo on Mind2Web tasks (HA), and compare the corresponding performances. The results
are not sufficient for LLM agents to infer the correct action. are shown in Tab. 3.
• The single-step prompting style which removes previous obser- From Tab. 3, we observe that the performance without each
vations does not benefit overall performance. component varies differently on the simplest Cross-Task subset and
the two harder subsets:
On the contrary, TRAD utilizes these imperfect thoughts for re-
• On the harder Cross-Website and Cross-Domain subsets, the
trieval rather than direct decision-making, and is complemented
elimination of all three modules in aligned decision results in a
with temporally correlated steps via aligned decision. Therefore,
significant performance drop, and the effect of temporal expan-
TRAD is not negatively impacted by the imperfect thoughts, but
sion is the most significant. This is intuitive, since only retrieved
transforms them into helpful information.
steps are provided to the agent without TE, and thus the agent
Before we start the study on detailed design and hyper-parameter
becomes more vulnerable to imperfect thoughts.
choices of TRAD, we can summarize our performance evaluation on
• On the simplest Cross-Task subset, however, history alignment
ALFWorld and Mind2Web benchmarks and answer the first three
and relative order mark are not that helpful and even cause per-
research questions as follows.
formance drop. As discussed earlier (Section 1 and Section 3.3),
Answer to RQ1: On both householding (ALFWorld) and web navi- when the issue of plausible examples is not severe, reducing
gation (Mind2Web) tasks, TRAD significantly outperforms curernt context and prompting with the most relevant demonstration
SoTA methods and becomes the new SoTA method. becomes the dominant factor of performance boost. Therefore,
Answer to RQ2: On ALFWorld benchmark, Synapse + ReAct gener- only temporal expansion remains beneficial for recovering from
ates thoughts in exactly the same way with our TRAD, and uses en- imperfect thoughts, while the other two components lead to
tire relevant trajectories (more information than TRAD) as demon- sub-optimal performance.
strations for action prediction. However, TRAD shows obvious ad- Generally, the aligned decision method provides more informa-
vantage over this baseline. Therefore, we can conclude that TRAD tion about the source trajectories of retrieved steps and the current
benefits from more relevant demonstrations and less irrelevant trajectory, and helps especially for scenarios where generalization
input context brought by thought retrieval. is essential. We can now summarize these observations and answer
Answer to RQ3: On Mind2Web benchmark, TRAD achieves the the fourth research question.
most improvement over Synapse on the Cross-Domain subset which Answer to RQ4: Among the sub-processes in aligned decision, 1)
requires the most generalization. Therefore, we can tell that the temporal expansion provides tolerance for imperfect thoughts and
aligned decision method complements critical information for decision- improves the overall performance of TRAD consistently; 2) relative
making on unseen input. order mark and history alignment complement TRAD with temporal
information about the trajectories of retrieved steps and the current
trajectory, which serve as useful context for out-of-distribution
4.4 Ablation Studies
decision-making but may become less useful for in-distribution
We have verified the effectiveness of TRAD on two different scenar- decision-making.
ios, i.e., automatic householding and web navigation. Next, we are
to examine the effect of each module in TRAD. Due to our limited 4.4.2 The Effect of Expansion Steps 𝐵 and 𝐹 . Next we vary a critical
budget for API usage, all ablation studies are conducted on the hyper-parameter, the number of temporal expansion steps, and
Mind2Web benchmark with GPT-3.5-turbo. investigate how the overall performance will change accordingly.
To avoid an expensive grid search on 𝐵 and 𝐹 , we consider only
4.4.1 The Effect of Aligned Decision. First, we study the effect one-side expansion by varying 𝐵 or 𝐹 from 0 to 4 with the other
of macro building blocks of TRAD. Since eliminating thought re- set to 0. The results over all 3 subsets are shown in Fig. 3.
trieval will disable aligned decision at the same time and break the From Fig. 3, we can have the following observations:
framework fundamentally, we do not remove the thought retrieval • Both forward expansion (𝐹 > 0) and backward expansion (𝐵 > 0)
module, but ablate each component of aligned decision, i.e., temporal achieve improvement compared to no expansion (𝐹 = 𝐵 = 0).
expansion (TE), relative order mark (ROM), and history alignment This justifies our design of aligned decision.
Zhou and Yang, et al.
Ele. Acc
of TRAD and Synapse, and that the advantage of TRAD over Synapse
31 31
30 Synapse 30 Synapse consistently remains for all 𝐾 ∈ {1, 2, 3, 4}.
29 TRAD 29 TRAD With results in Section 4.4.2 and Section 4.4.3, we now respond
to our last research question.
28 28
Answer to RQ5: The performance and advantage of TRAD gen-
Step SR
Step SR
27 27
26 Synapse 26 Synapse
erally remains stable with different hyper-parameter choices, i.e.,
25 TRAD 25 TRAD temporal expansion steps, number of retrieved demonstrations. Its
24 24 performance and advantage only degrade when using long back-
2.0 2.0 ward extension, which is possibly due to the fact that historical
1.5 1.5 information has already been incorporated in thoughts and does
SR
SR
1.0 Synapse 1.0 Synapse not provide further help for decision-making.
0.5 TRAD 0.5 TRAD
0 1 2 3 4 0 1 2 3 4 4.5 Case Studies
# Forward Step F # Backward Step B
(a) Varying 𝐹 (b) Varying 𝐵
At the end of this section, we present some representative trajecto-
ries or steps, where we can intuitively learn the advantages of TRAD.
Figure 3: The effect of varying subsequent steps 𝐹 and previ- We show two cases produced by Synapse and our TRAD agent on
ous steps 𝐵 on Mind2Web benchmark. Solid lines correspond the cross-domain subset of Mind2Web in Fig. 5, to demonstrate:
to the performance metrics of TRAD given different 𝐹 and 1) the difference between task meta-data retrieval and thought re-
𝐵, and the dashed lines correspond to the Synapse baseline. trieval; 2) the reason for retrieval rather than direct prediction with
Forward expansion (𝐹 > 0) generally provides more improve- thought and the tolerance for imperfect thoughts.
ment than backward expansion (𝐵 > 0) over no expansion In Fig. 5a, the trajectory-wise retrieval of Synapse is obviously
(𝐹 = 𝐵 = 0) and the Synapse baseline. 𝐹 or 𝐵 does not help problematic, which only considers “search” in task instructions and
more when they are sufficiently large. the retrieved trajectories are completely irrelevant to the current
one. However, when we use these irrelevant demonstrations for
• Either forward expansion or backward expansion does not ben- thought production and conduct thought retrieval afterwards, the
efit from increasing a large enough 𝐹 or 𝐵 further. This proves retrieved demonstrations become much more relevant as they all
our hypothesis that irrelevant context too far from the current relate to baby (toddler) and reflect the process of interacting with
state is of little value and even noisy. navigation links or buttons to unfold invisible web pages during
• Generally, forward expansion performs better than backward web browsing. With the demonstrations from thought retrieval,
expansion when varying 𝐹 and 𝐵. The reason for this phenome- TRAD is capable of making the correct decision.
non might be that historical information has been incorporated In Fig. 5b, both Synapse and TRAD seem to retrieve relevant
in thoughts and thus future information helps more. examples trying to find something in New York, but if we examine
• TRAD achieves its best performance when 𝐹 = 2 and 𝐵 = 0, and the trajectories retrieved by task meta-data, 2/3 of them fulfill the
consistently outperforms Synapse with forward expansion. condition “New York” by clicking some link or button rather than
4.4.3 The Effect of Demonstration Amount 𝐾. Finally, we look into typing in a text box. Unfortunately, the correct action under the
a common yet important hyper-parameter, the number of retrieved current state is typing, not clicking, and thus Synapse fails to type
demonstrations 𝐾, and see how the advantage of TRAD over the the correct content. On the contrary, TRAD learns to type the correct
baseline (Synapse) will change given different 𝐾 ∈ {1, 2, 3, 4, 5}. We content “New York” into the text box, even if its thought is incorrect.
show the results over all 3 subsets in Fig. 4. Note that the trajectory- This also validates our hypothesis that using thought for retrieval
wise prompting in Synapse frequently exceeds the context limit instead of prediction helps to correct imperfect thoughts.
when 𝐾 = 5, and thus we omit this result.
5 REAL-WORLD DEPLOYMENT OF TRAD
33 Since Dec. 2023, we have deployed our TRAD agent to automate
28 2.5
32
some real-world office tasks in a mainstream insurance company,
Step SR
Ele. Acc
27 2.0
31
SR
30 Synapse
26
Synapse
1.5
Synapse
which owns a global business with approximately 170 million cus-
25 1.0
29 TRAD TRAD TRAD tomers worldwide. We select 4 different websites and collect 100
24 0.5
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
# Retrieved Demonstrations K
expert trajectories for some representative tasks on each website
as our memory. For evaluation, we collect 20 unseen tasks on each
Figure 4: The effect of varying the number of retrieved website, using step success rate (Step SR) and trajectory success
demonstrations 𝐾 on Mind2Web benchmark. Solid lines cor- rate (SR) as evaluation metrics. Tasks involve filling in insurance
respond to the performance metrics of TRAD given different inquiry forms, implementing advanced information retrieval, etc.
𝐾, and the dashed lines correspond to the Synapse baseline. Since the websites are complex and contain thousands of web ele-
𝐾 has a mild effect on the performance of TRAD and Synapse, ments, prompting with complete trajectories is not available, hence
and the advantage of TRAD over Synapse remains stable we only consider single-step prompting with historical actions as
when 𝐾 varies. auxiliary information.
TRAD: Enhancing LLM Agents with Step-Wise Thought Retrieval and Aligned Decision
Website: bookdepository
……
Domain: Shopping
Therefore, next I have to:
Subdomain: General
Similarity Task: search new books from JK
Similarity `CLICK [253]` ([link] KIDS -> CLICK) to
Search Search navigate to the Kids section and start
Rowling available for kids within the age
the search.
from 3 to 5 that are below $20.
Website: target
……
Domain: Shopping
Therefore, next I have to:
Thought-Enhanced Subdomain: General Thought-Enhanced
`CLICK [290]` ([link] BABY -> HOVER)
Memory Task: Find organic dog food and add it to Memory
to view baby products.
the cart.
Predicted Action: CLICK [3469] (×) Predicted Action: CLICK [224] (√)
……
Website: new.mta.info
Therefore, next I have to:
Domain: Travel
`TYPE [6394] [NEW YORK]`
Similarity Subdomain: Ground Similarity
Task: Find all outdoor events this month in ([searchbox] Find a Location -> TYPE:
Search Search NEW YORK) to specify the location as
NYC
New York.
……
Website: yellowpages Therefore, next I have to:
Domain: Travel `TYPE [9606] [New York]`
Thought-Enhanced Thought-Enhanced
Subdomain: Restaurant ([searchbox] Search by city -> TYPE: New
Memory Memory
Task: Find deals in New York. York) to specify the location as New York
City.
Predicted Action: TYPE [143] [pollutants in New York City] (×) Predicted Action: TYPE [143] [New York] (√)
To verify the effectiveness of TRAD, we use two different ReAct To be specific, the difference between TRAD and ReAct-RV is using
agents that the company has attempted as our baseline: thought for a second-time step retrieval and the aligned decision
• ReAct-RD: randomly selects expert steps in random trajecto- module. To further investigate the effect of thought retrieval and
ries as demonstrations. aligned decision, we also deploy a TR agent which removes our
• ReAct-RV: randomly selects expert steps in relevant trajecto- aligned decision method, namely the TRAD w/o TE baseline in
ries retrieved by task instruction as demonstrations. Tab. 3. We list the results in Tab. 4.
Zhou and Yang, et al.
Table 4: Evaluation results on real-world websites from a 6.2.1 Better Demonstrations For Reasoning. TRAD currently em-
mainstream global business insurance company. ploys relevant trajectories or randomly-chosen steps from them as
Method ReAct-RD ReAct-RV TR TRAD (Ours) demonstrations to generate thoughts, which still suffers from the
issues discussed in Section 1 to some extent. Therefore, modifica-
Website 1 Step SR 0.843 0.826 0.941 0.950
(form filling) SR 0.500 0.450 0.800 0.800 tions can be made to generate thoughts of higher quality, and thus
improve the overall performance of TRAD.
Website 2 Step SR 0.941 0.937 0.958 0.974
(advanced IR) SR 0.900 0.850 0.850 0.900 6.2.2 Better Representations For Retrieval. As we have discussed
Website 3 Step SR 0.962 0.987 1.000 1.000 in Section 2.3, TRAD can utilize any other methods to obtain a com-
(advanced IR) SR 0.850 0.800 0.850 1.000 prehensive abstraction of the current state in a sequential decision-
Website 4 Step SR 0.820 0.860 0.845 1.000 making task, which can possibly serve as better queries and keys
(form filling) SR 0.350 0.350 0.400 1.000 for the step-wise demonstration retrieval. Therefore, TRAD can be
Step SR 0.891 0.902 0.936 0.981 combined with more powerful LLM planning and reasoning meth-
Average
SR 0.650 0.613 0.725 0.925 ods and even dense abstractions produced by LLMs pre-trained on
domain-specific data like [8].
As can be seen in Tab. 4, TRAD achieves the best performance
on all 4 websites, showing its advantage can remain when deployed
7 CONCLUSIONS
to real-world scenarios. Moreover, we observe that TRAD w/o TE In this work, we propose a novel LLM agent augmented by step-wise
baseline also outperforms both ReAct agents, but exhibits noticeable demonstration retrieval (TRAD) for sequential decision-making
disadvantages compared to the complete TRAD agents. This justifies tasks. TRAD first retrieves relevant step demonstrations by its
our design of both thought retrieval and aligned decision. thought about current state, and then complements temporally
correlated steps for more informative action prediction. Extensive
Inference efficiency of TRAD. At inference time, our TRAD agent
experiments are conducted on two different sequential decision-
only introduces little extra time consumption in thought retrieval
making tasks to validate the effectiveness of our solution, and thor-
compared to ReAct. We profile the inference process of TRAD and
ough ablation studies justify the design choice and stability of our
ReAct on all websites and tasks, and in average TRAD takes only
method. We further present the results from real-world deployment
11.7% more time than ReAct-RD, which indicates that our method
of our method, showing its value in real-world applications.
achieves improvement without much sacrifice on efficiency.
6 DISCUSSIONS ACKNOWLEDGMENTS
The Shanghai Jiao Tong University team is partially supported
6.1 Limitations of TRAD by Shanghai Municipal Science and Technology Major Project
Although TRAD exhibits excellent performances over a diverse set (2021SHZDZX0102) and National Natural Science Foundation of
of tasks, it still has limitations like dependence on high-quality China (62322603, 62076161).
thought and trade-off between information and noise in temporal
expansion, and we briefly discuss about them here. REFERENCES
[1] Constructions Aeronautiques, Adele Howe, Craig Knoblock, ISI Drew McDer-
6.1.1 Dependence on high-quality thought. TRAD alleviates the mott, Ashwin Ram, Manuela Veloso, Daniel Weld, David Wilkins SRI, Anthony
issue of imperfect thoughts by its aligned decision module, but its Barrett, Dave Christianson, et al. 1998. Pddl| the planning domain definition
capability still depends heavily on the quality of thoughts and the language. Technical Report (1998).
[2] Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Lukas Gianinazzi,
capability of backbone LLM. To make such a step-wise retrieval- Joanna Gajda, Tomasz Lehmann, Michal Podstawski, Hubert Niewiadomski,
augmented method work well, the abstraction of current state is Piotr Nyczyk, and Torsten Hoefler. 2023. Graph of Thoughts: Solving Elaborate
critical since it serves as the query and key for retrieval, hence Problems with Large Language Models. arXiv preprint arXiv:2308.09687 (2023).
[3] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan,
the LLM used to build a TRAD agent should at least have a decent Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda
understanding of the task. Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan,
Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter,
6.1.2 Trade-off in temporal expansion. TRAD expects to keep rele- Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin
Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya
vant information but reduce irrelevant input context by step-wise Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners.
thought retrieval, while preserving some chance for correcting im- In Proceedings of the 34th Advances in Neural Information Processing Systems
perfect thoughts by temporal expansion. Here exists a trade-off: a (NeurIPS).
[4] Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang,
longer temporal expansion brings not only more tolerance to im- Huan Sun, and Yu Su. 2023. Mind2Web: Towards a Generalist Agent for the
perfect thoughts, but also more irrelevant noise in demonstrations. Web. In Proceedings of the 37th Advances in Neural Information Processing Systems
(NeurIPS).
This trade-off requires careful consideration for different tasks. [5] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert:
Pre-training of deep bidirectional transformers for language understanding.
6.2 Future Directions arXiv preprint arXiv:1810.04805 (2018).
[6] Ruomeng Ding, Chaoyun Zhang, Lu Wang, Yong Xu, Minghua Ma, Wei Zhang,
While ablation studies have been conducted to justify our design Si Qin, Saravan Rajmohan, Qingwei Lin, and Dongmei Zhang. 2023. Everything
of TRAD, there are some promising ideas worth study which can of thoughts: Defying the law of penrose triangle for thought generation. arXiv
preprint arXiv:2311.04254 (2023).
probably improve TRAD further. We leave them as future works, [7] Fernando Ferraretto, Thiago Laitz, Roberto Lotufo, and Rodrigo Nogueira. 2023.
and discuss them as follows. ExaRanker: Synthetic Explanations Improve Neural Rankers. In Proceedings of
TRAD: Enhancing LLM Agents with Step-Wise Thought Retrieval and Aligned Decision
the 46th International ACM SIGIR Conference on Research and Development in [29] Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han,
Information Retrieval (SIGIR). 2409—-2414. Roozbeh Mottaghi, Luke Zettlemoyer, and Dieter Fox. 2020. ALFRED: A Bench-
[8] Izzeddin Gur, Hiroki Furuta, Austin Huang, Mustafa Safdari, Yutaka Matsuo, mark for Interpreting Grounded Instructions for Everyday Tasks. In Proceedings
Douglas Eck, and Aleksandra Faust. 2024. A Real-World WebAgent with Planning, of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition
Long Context Understanding, and Program Synthesis. In Proceedings of The 12th (CVPR). 10737–10746.
International Conference on Learning Representations (ICLR). [30] Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam
[9] Izzeddin Gur, Ofir Nachum, Yingjie Miao, Mustafa Safdari, Austin Huang, Trischler, and Matthew J. Hausknecht. 2021. ALFWorld: Aligning Text and
Aakanksha Chowdhery, Sharan Narang, Noah Fiedel, and Aleksandra Faust. Embodied Environments for Interactive Learning. In Proceedings of 9th Interna-
2023. Understanding HTML with Large Language Models. In Findings of the tional Conference on Learning Representations (ICLR).
Association for Computational Linguistics (EMNLP). 2803–2821. [31] The LongChat Team. 2023. How Long Can Open-Source LLMs Truly Promise on
[10] Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahua Hong, Zhen Wang, Daisy Zhe Wang, Context Length? https://lmsys.org/blog/2023-06-29-longchat/
and Zhiting Hu. 2023. Reasoning with Language Model is Planning with World [32] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne
Model. In Proceedings of the 2023 Conference on Empirical Methods in Natural Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal
Language Processing (EMNLP). 8154–8173. Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lam-
[11] Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. 2020. The ple. 2023. LLaMA: Open and Efficient Foundation Language Models. arXiv
Curious Case of Neural Text Degeneration. In Proceedings of the 8th International preprint arXiv:2302.13971 (2023).
Conference on Learning Representations (ICLR). [33] Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal.
[12] Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick S. H. Lewis, Ledell Wu, 2023. Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-
Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense Passage Retrieval Intensive Multi-Step Questions. In Proceedings of the 61st Annual Meeting of the
for Open-Domain Question Answering. In Proceedings of the 2020 Conference on Association for Computational Linguistics (ACL). 10014–10037.
Empirical Methods in Natural Language Processing (EMNLP). 6769–6781. [34] Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu,
[13] Geunwoo Kim, Pierre Baldi, and Stephen McAleer. 2023. Language Models can Linxi Fan, and Anima Anandkumar. 2023. Voyager: An open-ended embodied
Solve Computer Tasks. In Proceedings of the 37th Advances in Neural Information agent with large language models. arXiv preprint arXiv:2305.16291 (2023).
Processing Systems (NeurIPS). [35] Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang,
[14] Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. 2023. A survey on
Pete Florence, and Andy Zeng. 2023. Code as Policies: Language Model Programs large language model based autonomous agents. arXiv preprint arXiv:2308.11432
for Embodied Control. In Proceedings of 2023 IEEE International Conference on (2023).
Robotics and Automation (ICRA). 9493–9500. [36] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V. Le, Ed H. Chi, Sharan Narang,
[15] Bo Liu, Yuqian Jiang, Xiaohan Zhang, Qiang Liu, Shiqi Zhang, Joydeep Biswas, Aakanksha Chowdhery, and Denny Zhou. 2023. Self-Consistency Improves Chain
and Peter Stone. 2023. LLM+P: Empowering large language models with optimal of Thought Reasoning in Language Models. In The 11th International Conference
planning proficiency. arXiv preprint arXiv:2304.11477 (2023). on Learning Representations, (ICLR).
[16] Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, Lawrence Carin, and [37] Zihao Wang, Shaofei Cai, Anji Liu, Xiaojian Ma, and Yitao Liang. 2023. Describe,
Weizhu Chen. 2021. What Makes Good In-Context Examples for GPT-3? arXiv explain, plan and select: Interactive planning with large language models enables
preprint arXiv:2101.06804 (2021). open-world multi-task agents. In Proceedings of the 37th Advances in Neural
[17] Iain Mackie, Shubham Chatterjee, and Jeffrey Dalton. 2023. Generative Relevance Information Processing Systems (NeurIPS).
Feedback with Large Language Models. In Proceedings of the 46th International [38] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei
ACM SIGIR Conference on Research and Development in Information Retrieval Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. 2022. Chain-of-Thought Prompting
(SIGIR). 2026–2031. Elicits Reasoning in Large Language Models. In Proceedings of the 36th Advances
[18] Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina in Neural Information Processing Systems (NeurIPS).
Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. [39] Zhiyong Wu, Yaoxiang Wang, Jiacheng Ye, and Lingpeng Kong. 2023. Self-
2021. Webgpt: Browser-assisted question-answering with human feedback. arXiv Adaptive In-Context Learning: An Information Compression Perspective for
preprint arXiv:2112.09332 (2021). In-Context Example Selection and Ordering. In Proceedings of the 61st Annual
[19] OpenAI. 2023. GPT-4 Technical Report. arXiv preprint arXiv:2303.08774 (2023). Meeting of the Association for Computational Linguistics (ACL). 1423–1436.
[20] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela [40] Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. 2022. Web-
Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Shop: Towards Scalable Real-World Web Interaction with Grounded Language
2022. Training language models to follow instructions with human feedback. Agents. In Proceedings of 36th Conference on Neural Information Processing Systems
In Proceedings of the 36th Advances in Neural Information Processing Systems (NeurIPS).
(NeurIPS). 27730–27744. [41] Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan
[21] Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Cao, and Karthik Narasimhan. 2023. Tree of Thoughts: Deliberate Problem
Liang, and Michael S Bernstein. 2023. Generative agents: Interactive simulacra Solving with Large Language Models. In Proceedings of 37th Conference on Neural
of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Information Processing Systems (NeurIPS).
Interface Software and Technology (UIST). 1–22. [42] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R.
[22] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in
Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Language Models. In Proceedings of The 11th International Conference on Learning
Blog (2019). Representations (ICLR).
[23] Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings [43] Yunhu Ye, Binyuan Hui, Min Yang, Binhua Li, Fei Huang, and Yongbin Li. 2023.
using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Large Language Models are Versatile Decomposers: Decomposing Evidence and
Methods in Natural Language Processing and the 9th International Joint Conference Questions for Table-based Reasoning. In Proceedings of the 46th International
on Natural Language Processing (EMNLP-IJCNLP). 3980–3990. ACM SIGIR Conference on Research and Development in Information Retrieval
[24] Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiao- (SIGIR). 174–184.
qing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, et al. 2023. Code [44] Yiming Zhang, Shi Feng, and Chenhao Tan. 2022. Active Example Selection for
llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023). In-Context Learning. In Proceedings of the 2022 Conference on Empirical Methods
[25] Ohad Rubin, Jonathan Herzig, and Jonathan Berant. 2022. Learning To Retrieve in Natural Language Processing (EMNLP). 9134–9148.
Prompts for In-Context Learning. In Proceedings of the 2022 Conference of the [45] Huaixiu Steven Zheng, Swaroop Mishra, Xinyun Chen, Heng-Tze Cheng, Ed H.
North American Chapter of the Association for Computational Linguistics: Human Chi, Quoc V Le, and Denny Zhou. 2024. Step-Back Prompting Enables Rea-
Language Technologies (NAACL-HLT). 2655–2671. soning Via Abstraction in Large Language Models. In Proceedings of The 12th
[26] Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, International Conference on Learning Representations (ICLR).
Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer: [46] Longtao Zheng, Rundong Wang, Xinrun Wang, and Bo An. 2024. Synapse:
Language models can teach themselves to use tools. In Proceedings of the 37th Trajectory-as-Exemplar Prompting with Memory for Computer Control. In Pro-
Advances in Neural Information Processing Systems (NeurIPS). ceedings of 12th International Conference on Learning Representations (ICLR).
[27] Tianlin Shi, Andrej Karpathy, Linxi Fan, Jonathan Hernandez, and Percy Liang. [47] Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang,
2017. World of Bits: An Open-Domain Platform for Web-Based Agents. In Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc V. Le, and Ed H. Chi. 2023.
Proceedings of the 34th International Conference on Machine Learning (ICML), Least-to-Most Prompting Enables Complex Reasoning in Large Language Models.
Vol. 70. 3135–3144. In The 11th International Conference on Learning Representations (ICLR).
[28] Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik R Narasimhan, and [48] Yutao Zhu, Huaying Yuan, Shuting Wang, Jiongnan Liu, Wenhan Liu, Chen-
Shunyu Yao. 2023. Reflexion: Language agents with verbal reinforcement learn- long Deng, Zhicheng Dou, and Ji-Rong Wen. 2023. Large language models for
ing. In Proceedings of the 37th Advances in Neural Information Processing Systems information retrieval: A survey. arXiv preprint arXiv:2308.07107 (2023).
(NeurIPS).
Zhou and Yang, et al.
A PROMPT LIBRARY
A.1 Prompts on ALFWorld
ALFWorld includes 6 different types of task, and we only present the prompt for the Put task here.
A.1.1 Thought preparation. We write thoughts for the same demonstration ($Demo 1 and $Demo 2) as the first two in ReAct [42] and use
them for thought preparation.
You are an agent to interact with a household to solve a task . You will be given a task where you need to put an ( two ) object ( s ) to a target either directly or after an
operation . Each time you first think about your current situation , then output an action , and wait for next observation .
Here is your action space :
* go to target : Move to the target , and you will observe what is in / on the target or know it is closed or opened .
* open target : Open the target when it is closed , and you will observe what is in / on the target . Only cabinets , drawers , fridges , safes , and microwaves can be opened .
* take object from target : Take the object from the target when the object is in / on the target . You can take only one object at the same time .
* put object in / on target : Put an object you have taken / picked up in / on the target . You should go to the target in your last action . You can put no matter there are
other objects in / on the target or not .
* clean object with target : Clean an object you have taken / picked up with the target . The target should be a sinkbasin . You should go to the target in your last action .
You can clean no matter there are other objects in / on the target or not .
* heat object with target : Heat an object you have taken / picked up with the target . The target should be a microwave . You should go to the target in your last action .
You can heat no matter there are other objects in / on the target or not .
* cool object with target : Cool an object you have taken / picked up with the target . The target should be a fridge . You should go to the target in your last action . You
can cool no matter there are other objects in / on the target or not .
* use object : Use the object . The object should be a desklamp . You should be in / on a place where the object appears .
* look : Look around and see what you are facing . Only look when nothing happens .
Here are two examples .
$Demo 1
$Demo 2
None
Objects I have taken :
None
To solve the task , I need to find and take a soapbottle , then put it on the toilet . First , I need to find a soapbottle . A soapbottle is more likely to appear in
cabinets (1 -4) , countertop (1) , or sinkbasin (1 -2) . I will start by checking the garbagecan 1 , although it 's less likely to find a soapbottle there , it 's still possible
.
> act : go to garbagecan 1
On the garbagecan 1 , you see a soapbottle 1.
> think : I am now in / on : garbagecan 1
Critical objects I have found :
soapbottle 1 ( garbagecan 1)
Objects I have taken :
None
I have found a soapbottle (1) in the garbagecan 1. Next , I need to take it .
> act : take soapbottle 1 from garbagecan 1
You pick up the soapbottle 1 from the garbagecan 1.
> think : I am now in / on : garbagecan 1
Critical objects I have found :
soapbottle 1 ( garbagecan 1)
Objects I have taken :
soapbottle 1
Now I have taken the soapbottle (1) . Next , I need to go to and put it on the toilet 1.
> act : go to toilet 1
On the toilet 1 , you see a cloth 1.
> think : I am now in / on : toilet 1
Critical objects I have found :
soapbottle 1 ( garbagecan 1)
Objects I have taken :
soapbottle 1
Now I am at the toilet 1. Next , I need to put the soapbottle 1 on it .
> act : put soapbottle 1 in / on toilet 1
A.1.3 Action prediction. For action prediction we use almost the same prompt as thought generation. For TRAD we only use 𝐵 + 1 + 𝐹 steps
in each demonstration, and we add the following sentence to tell LLM the meaning of relative order mark:
The mark [ Step $i ] in expert examples indicates a coarse relative position of expert demonstration steps to your situation . For example , [ Step -1] means the last step ,
[ Step 0] means the current step , and [ Step 1] means the next step .
The input are presented in the same format as demonstrations without human-written reasons.
A.2.2 Thought generation.
You are a large language model trained to navigate the web . You will be given a task , an observation , and your previous actions . Each time you should output the next
action and wait for the next observation . Here is the action space :
1. ` CLICK [ id ] `: Click on an HTML element with its id .
2. ` TYPE [ id ] [ value ] `: Type a string into the element with the id .
Zhou and Yang, et al.
A.2.3 Action prediction. We use the same sentence as in ALFWorld to tell LLM about the relative order mark.
You are a large language model trained to navigate the web . You will be given a task , an observation , and your previous actions . Each time you should output the next
action and wait for the next observation . Here is the action space :
1. ` CLICK [ id ] `: Click on an HTML element with its id .
2. ` TYPE [ id ] [ value ] `: Type a string into the element with the id .
3. ` SELECT [ id ] [ value ] `: Select a value for an HTML element by its id .
Now you are given some expert demonstrations , follow these demonstrations and make your decision .
The mark [ Step $i ] indicates a coarse relative position of expert demonstration steps to your situation . For example , [ Step -1] means the last step , [ Step 0] means the
current step , and [ Step 1] means the next step .
Note that you should take all previous actions into reasoning . In your output , the action should be quoted by a pair of '`'.
$Demo 1
$Demo 2
$Demo 3
$Input
The input are presented in the same format as demonstrations, except that they have no ground-truth actions.
TRAD: Enhancing LLM Agents with Step-Wise Thought Retrieval and Aligned Decision
36 36 31 31
35 35
Ele. Acc
Ele. Acc
Ele. Acc
Ele. Acc
29 29
34 34
Synapse Synapse Synapse Synapse
33 33 27 27
32 TRAD 32 TRAD TRAD TRAD
25 25
31 31 24 24
Step SR
Step SR
Step SR
Step SR
30 30 22 22
29 Synapse 29 Synapse Synapse Synapse
20 20
28 TRAD 28 TRAD TRAD TRAD
27 27 18 18
4.0 4.0
1.0 1.0
3.0 3.0
2.0 2.0
SR
SR
0.5 0.5
SR
SR
Ele. Acc
Ele. Acc
Ele. Acc
30 30
31 31
28 Synapse 28 Synapse 30 Synapse 30 Synapse
26 TRAD 26 TRAD 29 TRAD 29 TRAD
28 28 28 28
Step SR
Step SR
Step SR
Step SR
26 26 27 27
Synapse Synapse 26 Synapse 26 Synapse
24 24
TRAD TRAD 25 TRAD 25 TRAD
22 22 24 24
2.0 2.0 2.0 2.0
1.5 1.5 1.5 1.5
SR
SR
SR
SR
Figure 6: The effect of varying subsequent steps 𝐹 and previous steps 𝐵 on Mind2Web benchmark. Solid lines correspond to
the performance metrics of TRAD given different 𝐹 and 𝐵, and the dashed lines correspond to the Synapse baseline. Forward
expansion (𝐹 > 0) generally provides more improvement than backward expansion (𝐵 > 0) over no expansion (𝐹 = 𝐵 = 0) and
the Synapse baseline. 𝐹 or 𝐵 does not help more when they are sufficiently large.
Zhou and Yang, et al.
36 31 32 33
32
Ele. Acc
35
Ele. Acc
Ele. Acc
Ele. Acc
29 30
34 31
Synapse 27 28 Synapse
33 Synapse 30 Synapse
TRAD 26 TRAD TRAD
32 25 TRAD 29
32 24 28 28
Step SR
31 Step SR
Step SR
Step SR
22 26 27
30
Synapse 26
29 Synapse 20 Synapse 24 Synapse
TRAD 25
28 TRAD 18 TRAD TRAD
22 24
5.0 2.5
4.0 1.0 2.5
2.0 2.0
3.0
SR
SR
SR
SR
2.0 0.5 1.5 1.5
Synapse Synapse Synapse Synapse
1.0 1.0 1.0
TRAD 0.0 TRAD TRAD TRAD
0.0 0.5 0.5
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
# Retrieved Demonstrations K # Retrieved Demonstrations K # Retrieved Demonstrations K # Retrieved Demonstrations K
(a) Cross-Task (b) Cross-Website (c) Cross-Domain (d) All
Figure 7: The effect of varying the number of retrieved demonstrations 𝐾 on Mind2Web benchmark. Solid lines correspond to
the performance metrics of TRAD given different 𝐾, and the dashed lines correspond to the Synapse baseline. 𝐾 has a mild
effect on the performance of TRAD and Synapse, and the advantage of TRAD over Synapse remains stable when 𝐾 varies.