0% found this document useful (0 votes)
38 views23 pages

2504.02623v1

The paper introduces the Multi-Mission Tool Bench, a benchmark designed to evaluate the robustness of large language model (LLM)-based agents in related and dynamic multi-mission scenarios, addressing the limitations of existing single-mission benchmarks. It features a comprehensive framework for generating multi-mission test data and employs dynamic decision trees for evaluating agent performance. Experiments reveal critical factors influencing agent decision-making, providing insights for future LLM research and development.

Uploaded by

Mariana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views23 pages

2504.02623v1

The paper introduces the Multi-Mission Tool Bench, a benchmark designed to evaluate the robustness of large language model (LLM)-based agents in related and dynamic multi-mission scenarios, addressing the limitations of existing single-mission benchmarks. It features a comprehensive framework for generating multi-mission test data and employs dynamic decision trees for evaluating agent performance. Experiments reveal critical factors influencing agent decision-making, providing insights for future LLM research and development.

Uploaded by

Mariana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Multi-Mission Tool Bench: Assessing the Robustness of LLM based Agents

through Related and Dynamic Missions

Peijie Yu1*† , Yifan Yang1*† , Jinjian Li1* , Zelong Zhang1 ,


Haorui Wang1 , Xiao Feng1 , Feng Zhang1
1
Tencent HunYuan
Correspondence: {peijieyu, ioanyang}@tencent.com

Abstract the robustness of an agent’s decision-making. How-


ever, existing benchmarks focus primarily on
Large language models (LLMs) demonstrate single-mission scenarios.
strong potential as agents for tool invocation
This paper presents the Multi-Mission Tool
due to their advanced comprehension and plan-
arXiv:2504.02623v1 [cs.AI] 3 Apr 2025

ning capabilities. Users increasingly rely on Bench. This benchmark evaluates agent robustness
LLM-based agents to solve complex missions in related and dynamic multi-mission scenarios.
through iterative interactions. However, exist- The benchmark addresses three core challenges:
ing benchmarks predominantly access agents 1) it contains more mission-types than others, i.e.
in single-mission scenarios, failing to capture four major categories and six subcategories; 2) it
real-world complexity. To bridge this gap, we includes all mission-type transition patterns in pre-
propose the Multi-Mission Tool Bench. In
fixed mission number; 3) all successive missions
the benchmark, each test case comprises multi-
ple interrelated missions. This design requires have strong relations with prior dialogues, agents
agents to dynamically adapt to evolving de- are forced to extract information from previous mis-
mands. Moreover, the proposed benchmark ex- sions. Therefore, it closely mirrors the complexity
plores all possible mission-switching patterns of real-world.
within a fixed mission number. Specifically, we To simulate all mission-type switching patterns,
propose a multi-agent data generation frame- we first define the mission-types by their corre-
work to construct the benchmark. We also pro- sponding agent action-types. Agent actions are
pose a novel method to evaluate the accuracy
divided into four main types: using a single tool,
and efficiency of agent decisions with dynamic
decision trees. Experiments on diverse open- using multiple tools, chatting with users, and us-
source and closed-source LLMs reveal critical ing tools after clarifying parameters. An agent
factors influencing agent robustness and pro- accomplishes a single mission by performing one
vide actionable insights to the tool invocation of these actions. Therefore, we define four types
society1 . of missions. For sequential missions, agents com-
bine multiple action-types to reach the objectives.
1 Introduction Figure 2 a) displays that the agent employs the com-
In recent years, large language models (LLMs) bination of four action-types to complete the four
have achieved significant progress in natural lan- sequential missions in Figure 1. Thus, we introduce
guage processing. These models demonstrate the mission switching space to describe the trans-
strong capabilities to understand contextual infor- formations of mission types. Figure 2 b) shows that
mation and user instructions, making them effective our benchmark thoroughly explores the proposed
agents for mission completion. space with a prefixed mission number. This indi-
Real-world applications require agents to handle cates that our benchmark includes all mission-type
dynamic user demands. As users frequently ad- transition patterns. In contrast, other benchmarks
just their requests during conversations (Figure 1), have a more limited range of action diversity.
agents must complete sequential missions with To construct the multi-mission benchmark, we
evolving requirements. This situation challenges propose a controllable data generation framework
*
with multiple characters. The framework simulates
Equal Contributions.
† the mission execution process through dialogic in-
Corresponding authors.
1
Available on https://github.com/yupeijei1997/ teractions among five agents: user, planner, tool,
MMTB. AI, and checker. In each generation process, we

1
Figure 1: A multi-mission example. It contains four related missions, and the mission types are changing
dynamically. This figure presents the conversation between a user and an AI. The inter-dialogues are hided.

assign the desirable mission type and mission rela- Section 4 explains how we build the benchmark.
tionship to guide the framework. Ultimately, our It covers how to create related missions, predefine
benchmark encompasses all potential combinations mission-types, and explore the mission switching
in the mission switching space for a set number of space. Section 5 describes the evaluation methods
missions. Notably, a complete mission involves we use for this benchmark. Section 6 shows the
multiple rounds of dialogues. test results of LLMs and presents our analysis of
To evaluate the proposed benchmark, we intro- these findings.
duce a novel evaluation method. It assesses the
accuracy and efficiency of agents decisions, by em- 2 Related Work
ploying dynamic decision trees.
Eventually, we evaluate a range of open-source 2.1 Evaluation of LLMs
and closed-source LLMs, encompassing both spe- Recent benchmarks evaluate the capabilities of
cific and general LLMs. Our comprehensive exper- LLM-based agents from various point of views.
iments reveal numerous factors influencing the ro- Some research evaluates the generalizability of
bustness of agent decision-making. These findings agents in various scenarios (Li et al., 2024; Trivedi
offer valuable insights for guiding future research et al., 2024; Liu et al., 2024c). Others(Du et al.,
on the development of LLM-agents. 2024; Qin et al., 2024; Ye et al., 2024; Li et al.,
The main contributions of this paper are: 2023) collected massive tools to investigate the
• To the best of our knowledge, this is the first impact of tool diversity on agent performance. Cer-
benchmark that assesses agent robustness in tain research (Zhuang et al., 2023; Guo et al., 2024;
related and dynamic multi-mission scenarios. Xie et al., 2024) examines agents within specific
domains. While some works (Shen et al., 2024b;
• We introduce a controllable multi-role data Chen et al., 2024; Huang et al., 2024a) provide a
generation framework to explore the action- comprehensive assessment of multiple agent abili-
type space in multi-mission contexts. ties, others (Huang et al., 2024b; Tang et al., 2023;
Qiao et al., 2024a) address specific issues like the
• A novel testing method is proposed to evaluate
illusion problem (Patil et al., 2023) and multistep
the accuracy and efficiency of dynamic path
execution capabilities (Shen et al., 2024a; Yao et al.,
planning.
2024).
• Comprehensive testing of open-source and Our benchmark assesses agents’ overall capa-
closed-source LLMs is conducted, revealing bilities, emphasizing challenges of related and dy-
various factors that affect the robustness of namic multi-missions. Importantly, the multistep
agent decision making. tasks discussed in previous studies align with our

2
Figure 2: Visualization of mission switching space. a) Four distinct colors represents four different action-types. The
green dot indicates the agent sequentially selects four type of actions to execute four missions. b) The distribution
of the proposed benchmark within the mission switching space. Each row corresponds to a different number of
missions. Each dot indicates a specific combination of the current and preceding action-types. Colored dots indicate
combinations included in the benchmark, while gray dots indicate their absence. c) Distribution of four other agent
benchmarks in the space.

approach of employing multiple tools to complete 2.2 LLM-as-Agent


a single mission. User mission automation is a significant research
area for large LLMs. General (Achiam et al., 2023;
The work most similar to ours is BFCL V3 (Char- Sun et al., 2024; Yang et al., 2024; Team et al.,
lie Cheng-Jie Ji, a). It also involves four types 2024; GLM et al., 2024; Srivastava and Dames,
of agent actions and various user missions in one 2024) LLMs with larger scale primarily integrate it
test case. However, BFCL V3 only covers a small within multi-task learning process. While there are
part of the mission switching space. In contrast, also many smaller specialized LLMs based agents.
our work simulates all possible mission transitions We categorize agent research into various ap-
within a predefined set of missions. In most test proaches. Some studies (Xu et al., 2024; Qiao
data of BFCL V3, missions have no information et al., 2024b; Zhang et al., 2024b) equip agents
dependencies. Agents can complete any given mis- with self-reflection and self-correction capabilities
sion autonomously without relying on information to improve their understanding of environmental
from previous dialogues. In our case, all data con- feedback. Others (Zhang et al., 2024a; Han et al.,
tain related missions. 2024; Islam et al., 2024) introduce heuristic deci-
sion frameworks to solve complex problems. Fur-
Other studies, WorfBench and TaskBench (Qiao ther research (Shi et al., 2024; Schick et al., 2023;
et al., 2024a; Shen et al., 2024b), also introduce a Liu et al., 2024b) focuses on strengthening agents’
graph-based evaluation method for multi-tool invo- core skills. Concurrently, some work (meetkai;
cation. However, they only compute the similarity Lin et al., 2024; Liu et al., 2024b) generate more
between the agent’s planned path and the annota- diverse training data with proposed frameworks.
tion through graph matching, unable to explicitly Our study also introduces a novel data generation
determine its correctness or calculate the optimal framework. Unlike previous works, our framework
probability of the agent’s plan, as our work does. uniquely specifies desired agent action-types.
The proposed benchmark simulates real-world
Table 1 compares the mentioned benchmarks application scenarios, and evaluates the core abili-
with our proposed one in various aspects. ties of agents and tests various LLMs.

3
Rate of Mission-Types
Benchmark MutMiss∗ MSSS‡4
RelMiss† Asingle Achat Aclarity AS
multi AP
multi AS+P
multi
Ours ! 100 100 ! ! ! ! ! !
BFCL v3(Charlie Cheng-Jie Ji, a) ! 15.7 39.7 ! ! ! % ! %
BFCL v1(Patil et al., 2023) % 0.0 0.9 ! ! % % ! %
BFCL v2(Charlie Cheng-Jie Ji, b) % 0.0 0.9 ! ! % % ! %
ToolBench(Qin et al., 2024) % 0.0 0.0 ! % % ! % %
AnyToolBench(Du et al., 2024) % 0.0 0.0 ! % % ! % %
τ -bench(Yao et al., 2024) % 0.0 0.0 ! % % ! % %
T-EVAL(Chen et al., 2024) % 0.0 0.0 ! % % ! % %
UltraTool(Huang et al., 2024a) % 0.0 0.0 ! % % ! % %

Table 1: Comparative Analysis of the Multi-Mission Tool Bench against other benchmarks in the field. The symbol
’*’ indicates Multi-Mission, while ’†’ denotes Related Missions. Moreover, in the four-mission action-type space,
the Mission Switching Space Scale ( MSSS4 ) represents the proportion of combination coverage for each dataset
relative to all possible combinations.

proposed a novel data generation framework. In


this section, we explain the proposed framework
and how to construct the benchmark. Subsection
4.1 presents the five roles in the framework and
their interaction mechanism. Subsection 4.2 de-
scribes how these roles complete a mission. It
includes specifying mission-types and setting up
dependencies with earlier missions for later ones.
Subsection 4.3 we expand the scope from gener-
ating a single mission to creating a test data with
Figure 3: The multi-agent framework. multiple related missions. Subsequently, we thor-
oughly explore the mission switching space to con-
struct the entire benchmark. Furthermore, Ap-
3 Terminologies pendixes A and B present the method for collecting
We use agent action-type to describe the mission- tools and the distribution of the test set.
type switching patterns. In this section, we intro-
duce the concepts of agent action-type and mission 4.1 Data Generation Framework
switching space. We employ five agents to generate multi-mission
As stated above, agents use four types of action test data. We simulate this process with a sin-
to accomplish user missions: invoking a single tool, gle LLM. For each dialogue, we assign different
invoking multiple tools, chatting with the user, and roles and specific tasks to the LLM, denoted R.
invoking tools after clarifying their parameters. We We define five roles: User, Planner, AI, Checker,
denote these action-types as Asingle , Amulti , Achat , and Tool, represented as Ruser , Rplanner , RAI ,
and Aclarif y respectively. As inter-tool dependen- Rchecker , and Rtool respectively. The Planner is
cies cause diverse execution sequences, we further the key to analyzing the mission, planning tool in-
divide Amulti into the following categories: serial vocation paths, and deciding action-types. Figure
execution, parallel execution, and a combination of 3 shows the interaction among these five roles.
both, represented as ASmulti , APmulti , and AS+P
multi . In this framework, only RAI communicates
Furthermore, we define the concept of mission
with Ruser , and Rplanner gets instructions from
switching space to describe the combination of
Ruser . When Rplanner starts Asingle or Amulti ,
action-types corresponding to serially related mis-
Rtool simulates tool feedback. For Aclarif y or
sions, labeled AN = {A0 , A1 , . . . , AN }. Here,
Achat , RAI asks about tool parameters or summa-
N is the total number of missions and Ai is the
rizes responses. Rchecker checks the format and
action-type corresponding to the i-th mission.
sequence of Rplanner ’s plans, ensuring accurate
4 Benchmark Construction planning. Note that Rchecker is only involved in
data generation. Moreover, Ruser has different
Q
To construct multi-mission test data, and thor- tasks at different stages. Ruser responses to gener-
oughly explore the mission switching space, we A
ate a new mission, while Ruser responses to answer

4
times to construct the test data. It is important to
note that the generation results from both Rtool and
RAI are crucial information provided to the agents
during our testing process.

Figure 4: The dependencies among tools. 5 Dynamic Evaluation Method


The dependencies among tools lead to multiple
the questions of RAI . possible execution sequences. This challenge be-
We provide the prompts for the roles mentioned comes more pronounced in multi-mission scenar-
above in Appendix E. ios. To address this, we propose a novel evaluation
framework. This framework accurately verifies the
4.2 Generate Single Mission
correctness and optimality of agent actions. The
We first introduce how to construct a single mission method follows three steps: dependency analysis,
using the proposed multi-agent framework. decision tree construction, and path validation.
In the generation process, we first generate user First, we manually identify tool dependencies.
missions. When generating user missions, we first We then implement a topological sorting algorithm
sample a tool list for the missions. with depth-first search to generate all possible exe-
To achieve a desirable mission type, we insert cution paths. Unlike previous methods (Qiao et al.,
the predefine action-type Ai into the role prompt 2024a; Shen et al., 2024b) that produce limited sub-
Q
Ruser . optimal paths, our algorithm constructs complete
To generate related missions, we generate sev- optimal and suboptimal sequences.
eral candidate missions, then employ expert refine- During agent testing, we perform incremental
ment to get the final successive mission. We catego- path matching against the decision tree. Each agent
rize mission relationships into three types: implicit action triggers either: 1) Path termination for mis-
understanding, ellipsis, and long-term memory, and matched actions. 2) Subtree pruning for valid ac-
Q
insert the relationship types into Ruser to generate tions, narrowing subsequent options.
Q
three candidate missions. The Ruser also contains To illustrate the process clearly, take a simplified
the previous user-AI dialogues. Finally, we man- toy example. Consider a user aiming to create a
ually select and refine the candidate missions to PowerPoint presentation about the year’s most pop-
achieve the final one. ular movie. This task requires four tools: Tool 0
With the user missions, we use the five roles for creating the presentation, Tool 1 for retrieving
mentioned above to complete the entire execution. the popular movie ranking, Tool 2 for gathering
detailed movie information, and Tool 3 for trans-
4.3 Construct the Whole Benchmark forming this information into slides, labeled as [0],
In Subsection 4.2, we obtain the ability to gener- [1], [2], and [3] respectively.
ate a specific type of mission and create related Analysis shows [2] needs parameters from [1],
missions. Subsequently, we apply this ability to and [3] depends on parameters from [0] and [2].
construct the benchmark. This benchmark aims to Figure 4 shows this dependency.Figure 5 a) shows
fully demonstrate the diversity of mission switch- the initial decision tree based on tool dependencies.
ing in the test data. To achieve this goal, we employ Here, [0, 1] means tools [0] and [1] are called in
the proposed method to explore the entire mission parallel. This tree reveals five candidate paths to
switching space in prefixed mission number. complete the task with three to four tool calls.
First, we identify all combinations of action- When the agent calls Tool [1] in the first step,
types for the given number of missions, represented check if this action is among the first-step candi-
N
as A = A11 , A21 , ..., A4N . Here, Aji indicates the date actions. Then, prune the sub-decision trees
j-th combination for i missions. For i missions, related to operations [0] and [0,1], getting an up-
there exist 4i combinations. dated decision tree as in Figure 5 b). In the second
Subsequently, we generate test data indepen- step, when the agent calls Tool [0], confirm the ac-
dently for each action-type combination. If the tion’s correctness and prune the sub-decision trees
action-type combination contains N elements, we for candidate actions [0] and [0,2] in the second
use the aforementioned generation framework N layer, as in Figure 5 c). At this point, only one

5
Figure 5: Visualization of the dynamic decision tree during evaluation.

candidate path remains, and verify its correctness mission relationship types. Further error analysis
by sequential path matching. is detailed in Appendix D.
Additionally, we calculate two metrics. Success
6.1 Overview
rate: percentage of valid paths completed. Optimal-
ity rate: percentage of paths that match minimal This subsection analyzes the accuracy of models
tool invocations. Appendix C provides formal al- on the whole dataset, with Figure 6 showing the
gorithm specifications. accuracy of 15 models. The models are arranged
from low to high accuracy, with different colored
6 Experiments dots indicating model types and varying dot sizes
representing model sizes.
The Multi-Mission Tool Bench consists of 1,024 From the analysis of Figure 6, we draw the fol-
test entries, each containing one to four missions. lowing conclusions. The o1 model, with strong
We divide the test set into four subsets based on the reasoning capabilities, shows a significant accuracy
number of missions, with each subset containing advantage. Open-source models, such as Qwen-
256 entries. 72b, are narrowing the gap with the top close-
We evaluated 24 state-of-the-art models on the source models. General models like DeepSeek-
test set, including closed-source general mod- V3 and doubao-1.5-pro perform well in other mis-
els, open-source general models, and special- sions but have a clear weakness in tool utilization.
ized models. Specifically, the closed-source gen- Notably, small specialized models like ToolACE
eral models are: o1-2024-12-17(OpenAI), GPT- achieve comparable performance to large-scale gen-
4o-2024-11-20(Achiam et al., 2023), Gemini- eral models.
1.5-Pro-002(Team et al., 2024), Mistral-Large- Figure 7 illustrates the performance of different
2411(Mistral), and doubao-1.5-pro-32k(Doubao). scale models in the Qwen2.5-Instruction-Series and
The open-source general models include: Qwen2.5- Hammer2.1-Series. As shown, there is a positive
Instruction-Series(Yang et al., 2024), GLM-4- correlation between model scale and accuracy. In-
9B-Chat(GLM et al., 2024), DeepSeek-R1(Guo terestingly, specialized models experience a faster
et al., 2025), DeepSeek-V3(Liu et al., 2024a), decline in accuracy. To explain this phenomenon,
and Llama-3.3-70B-Instruct(Dubey et al., 2024). more research is needed.
The specialized models are: Toolace (Liu
et al., 2024b), Hammer2.1-Series(Lin et al., 6.2 Impact of Mission Switching
2024), watt-tool-8b(Shi et al., 2024), xLAM-7b-fc- This study examines the impact of mission quan-
r(Zhang et al., 2024a), and gorilla-openfunctions- tity, mission-type, and mission transition on agent
v2(Charlie Cheng-Jie Ji, b). Model sizes range robustness.
from several hundred billions to 70b, 30b, and the Seven models with better overall performance
smallest at 0.5b. were selected for detailed analysis, including four
This section details the test results and analy- general models and three specialized models. Fig-
sis. Subsection 6.1 shows the overall performances. ure 8 presents the performance of these models in
Subsection 6.2 analyzes effects of the number of various subsets of mission quantities. The results
missions, mission action-types, and mission switch- indicate that specialized models perform compa-
ing. Subsection 6.3 explores the impact of inter- rably to stronger general models on single mis-

6
Figure 6: Overall accuracy of agents on the whole benchmark.

Following the structure of Figure 2 b), in Figure 9,


we visualize the models’ performance in the action-
type space with heatmaps. Each heatmap pyramid
represents a model’s performance, with each layer
corresponding to a sub-testset and its action-type
combinations. Deeper colors signify higher accu-
racy. Greater color contrast within the same layer,
with a larger proportion of lighter areas, indicates
poorer robustness of the model. The findings reveal
Figure 7: The performance of two series agents.
that the best performing o1 model also exhibits the
highest robustness. In contrast, the three special-
ized models show less stability than the general
models.

6.3 Impact of Mission Types

Moreover, we divide the test set by mission action-


type and analyze the performance of all models,
as shown in Table 2. The heatmap reveals several
observations: models exhibit varying strengths and
weaknesses across different action-types. For in-
Figure 8: The impact of various mission number on the stance, most models struggle to determine whether
agents. the necessary parameters are missing(Aclarity ). Al-
though many models have the ability to handle
Amulti missions, they still face challenges in han-
sions but experience a rapid decline in accuracy dling complex scenarios such as tackling ASmulti
in multi-mission scenarios. This confirms our hy- and AS+P
multi missions.
pothesis that current research overlooks the influ-
ence of multi-mission. Furthermore, even the most For multi-tool invocation, we introduce two new
advanced o1 model demonstrates a noticeable de- metrics, with results displayed on the far right of
crease in capability when handling multiple mis- Table 2. The first is the optimal path rate, where the
sions. general models perform notably well. Additionally,
instead of using hard labels to indicate mission
We further analyze the performance of the seven success, we propose accomplished progress metric
models across different action-type combinations. to assess model capability.

7
Figure 9: Visualization of the robustness of agents in the mission switching space.

S+P Optimal Accomplished


1 Agent Asingle Achat Aclarity AP
multi AS
multi Amulti Path Rate Progress

o1-2024-12-17† 63.28 91.41 45.70 50.32 12.50 19.05 39.42 30.15


GPT-4o-2024-11-20 † 54.69 74.61 35.94 51.59 18.75 23.81 41.08 45.56
Gemini-1.5-Pro-002† 49.61 77.73 35.94 37.58 6.25 8.33 26.14 16.58
Qwen2.5-72b-Instruct‡ 56.25 74.61 27.34 45.22 18.75 7.14 30.29 19.43
ToolACE-8B⋆ 43.75 87.11 22.66 35.67 0.00 3.57 24.07 9.55
Mistral-Large-2411† 57.03 55.86 31.64 41.40 12.50 16.67 29.88 37.69
Hammer2.1-7b⋆ 28.13 91.27 31.25 28.03 6.25 3.57 19.67 9.72
watt-tool-8b⋆ 40.63 91.80 23.05 29.94 0.00 0.00 19.50 8.38
GLM-4-9B-Chat‡ 30.08 89.84 10.16 12.10 12.50 0.00 0.00 12.23
DeepSeek-R1‡ 27.50 68.27 13.39 44.19 33.33 6.06 33.61 39.17
doubao-1.5-pro-32k† 60.16 25.78 5.86 36.94 18.75 9.52 5.39 38.53
xLAM-7b-fc-r⋆ 14.45 86.33 5.08 7.64 0.00 1.19 4.56 9.55
gorilla-openfunctions-v2⋆ 2.34 90.63 4.30 5.73 0.00 0.00 3.73 4.86
DeepSeek-V3‡ 22.09 41.58 7.51 4.81 0.00 4.55 4.05 24.13
Llama-3.3-70B-Instruct‡ 29.30 19.92 0.00 0.64 0.00 0.00 0.00 12.40

Table 2: The performance of agents in various type of missions, and the quantitative evaluation results on Amulti
missions. Here, † and ‡ represent close-source and open-source general model, ⋆ represents specific model.

Agent Implicit Ellipsis Long-Term


6.4 Impact of Related Mission o1-2024-12-17† 57.31 54.17 43.58
GPT-4o-2024-11-20† 42.69 52.92 34.64
This subsection examines how mission relation- Gemini-1.5-Pro-002† 46.99 42.08 31.84
ship types affect agent performance. As men- Qwen2.5-72b-Instruct‡ 40.11 47.08 28.49
ToolACE-8B⋆ 38.68 35.83 27.93
tioned, all subsequent missions in our benchmark Mistral-Large-2411† 35.24 39.17 30.17
are closely relate to preceding missions, and we Hammer2.1-7b⋆ 43.55 34.58 27.93
watt-tool-8b⋆ 40.97 32.92 26.26
have abstracted three types of mission relation- GLM-4-9B-Chat‡ 35.82 26.25 21.23
DeepSeek-R1‡ 30.06 32.28 18.67
ships. doubao-1.5-pro-32k† 25.79 28.33 22.91
Table 3 presents the accuracy of all models in xLAM-7b-fc-r⋆ 30.37 22.92 19.55
gorilla-openfunctions-v2⋆ 29.80 21.67 16.20
the three types of mission relationship. Long-term DeepSeek-V3‡ 17.28 18.07 13.39
memory has the most significant impact on model Llama-3.3-70B-Instruct‡ 9.17 13.33 11.17

performance, followed by the absence of core com- Table 3: The impact of mission relation types on agent
ponents in the problem( ellipsis ). performance.

7 Conclusion
This paper introduces a novel multi-mission bench- agents, a significant robustness gap emerges in
mark to evaluate the robustness of LLM-based multi-mission contexts. Moreover, all agents strug-
agents. Evaluations reveal that current agents ex- gle with complex multi-tool invocation missions
hibit varying degrees of limitations when address- and have shortcomings in related mission handling.
ing multi-mission scenarios. Notably, while spe- We believe that these findings offer valuable in-
cialized agents achieve comparable overall accu- sights for guiding future research on the develop-
racy and single-mission performance to general ment of LLM-agents.

8
Limitations Yu Du, Fangyun Wei, and Hongyang Zhang. 2024. Any-
tool: Self-reflective, hierarchical agents for large-
In evaluating LLM-based agents from a multi- scale api calls. International Conference on Machine
mission perspective, we identify specific limita- Learning.
tions in both mission duration and the data genera- Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey,
tion framework. Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman,
Firstly, our study aims to enhance the diversity Akhil Mathur, Alan Schelten, Amy Yang, Angela
of test data in terms of mission variation, yet the Fan, et al. 2024. The llama 3 herd of models. arXiv
diversity in the number of missions remains lim- preprint arXiv:2407.21783.
ited. Specifically, our test data comprises up to four Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chen-
missions. This limitation arises because the mis- hui Zhang, Da Yin, Dan Zhang, Diego Rojas, Guanyu
sion switching space expands exponentially with Feng, Hanlin Zhao, et al. 2024. Chatglm: A family
of large language models from glm-130b to glm-4 all
an increase in the number of missions, leading to a tools. arXiv preprint.
rapid enlargement of the test set size and additional
workload. Moreover, we observe a swift decline in Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song,
Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma,
the precision of the model’s output as the number
Peiyi Wang, Xiao Bi, et al. 2025. Deepseek-r1: In-
of missions increases, indicating that there is no im- centivizing reasoning capability in llms via reinforce-
mediate need to explore the model’s performance ment learning. arXiv preprint arXiv:2501.12948.
across a larger number of missions.
Zishan Guo, Yufei Huang, and Deyi Xiong. 2024. Ctool-
Secondly, the proposed data generation frame- eval: A chinese benchmark for llm-powered agent
work employs multiple iterations and human inter- evaluation in real-world api interactions. Annual
vention to ensure the quality of multi-turn dialogue Meeting of the Association for Computational Lin-
production. This approach suffers the limitations guistics, pages 15711–15724.
of LLMs in accurately following instructions. Senyu Han, Lu Chen, Li-Min Lin, Zhengshan Xu, and
In summary, these limitations emphasize the Kai Yu. 2024. Ibsen: Director-actor agent collabo-
need for ongoing development in the field of LLM ration for controllable and interactive drama script
generation. Annual Meeting of the Association for
based evaluations.
Computational Linguistics.
Shijue Huang, Wanjun Zhong, Jianqiao Lu, Qi Zhu, Ji-
References ahui Gao, Weiwen Liu, Yutai Hou, Xingshan Zeng,
Yasheng Wang, Lifeng Shang, et al. 2024a. Planning,
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama
creation, usage: Benchmarking llms for comprehen-
Ahmad, Ilge Akkaya, Florencia Leoni Aleman,
sive tool utilization in real-world complex scenarios.
Diogo Almeida, Janko Altenschmidt, Sam Altman,
arXiv preprint.
Shyamal Anadkat, et al. 2023. Gpt-4 technical report.
arXiv preprint. Yue Huang, Jiawen Shi, Yuan Li, Chenrui Fan, Siyuan
Fanjia Yan Shishir G. Patil Tianjun Zhang Ion Sto- Wu, Qihui Zhang, Yixin Liu, Pan Zhou, Yao Wan,
ica Joseph E. Gonzalez Charlie Cheng-Jie Ji, Neil Zhenqiang Gong, et al. 2024b. Metatool bench-
Huanzhi Mao. a. Gorilla bfvl v3. https://gorilla. mark for large language models: Deciding whether to
cs.berkeley.edu/leaderboard.html. Accessed: use tools and which to use. International Conference
2025-01-17. on Learning Representations.

Fanjia Yan Shishir G. Patil Tianjun Zhang Ion Md Ashraful Islam, Mohammed Eunus Ali, and
Stoica Joseph E. Gonzalez Charlie Cheng-Jie Ji, Md Rizwan Parvez. 2024. Mapcoder: Multi-agent
Huanzhi Mao. b. Gorilla openfunctions v2. code generation for competitive problem solving. An-
https://gorilla.cs.berkeley.edu//blogs/7_ nual Meeting of the Association for Computational
open_functions_v2.html. Accessed: 2025-01-17. Linguistics.

Zehui Chen, Weihua Du, Wenwei Zhang, Kuikun Manling Li, Shiyu Zhao, Qineng Wang, Kangrui Wang,
Liu, Jiangning Liu, Miao Zheng, Jingming Zhuo, Yu Zhou, Sanjana Srivastava, Cem Gokmen, Tony
Songyang Zhang, Dahua Lin, Kai Chen, et al. 2024. Lee, Li Erran Li, Ruohan Zhang, et al. 2024. Embod-
t-eval: Evaluating the tool utilization capability of ied agent interface: Benchmarking llms for embodied
large language models step by step. Annual Meet- decision making. Conference on Neural Information
ing of the Association for Computational Linguistics, Processing Systems.
pages 9510–9529.
Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song,
Doubao. Doubao 1.5pro. https://team.doubao. Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang,
com/zh/special/doubao_1_5_pro. Accessed: and Yongbin Li. 2023. Api-bank: A comprehensive
2025-02-14. benchmark for tool-augmented llms. Proceedings

9
of the 2023 Conference on Empirical Methods in Haiyang Shen, Yue Li, Desong Meng, Dongqi Cai,
Natural Language Processing, pages 3102–3116. Sheng Qi, Li Zhang, Mengwei Xu, and Yun Ma.
2024a. Shortcutsbench: A large-scale real-world
Qiqiang Lin, Muning Wen, Qiuying Peng, Guanyu Nie, benchmark for api-based agents. arXiv preprint.
Junwei Liao, Jun Wang, Xiaoyun Mo, Jiamu Zhou,
Cheng Cheng, Yin Zhao, et al. 2024. Hammer: Ro- Yongliang Shen, Kaitao Song, Xu Tan, Wenqi Zhang,
bust function-calling for on-device language models Kan Ren, Siyu Yuan, Weiming Lu, Dongsheng Li,
via function masking. arXiv preprint. and Yueting Zhuang. 2024b. Taskbench: Benchmark-
ing large language models for task automation. In-
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, ternational Conference on Learning Representations
Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Workshop on Large Language Model (LLM) Agents.
Deng, Chenyu Zhang, Chong Ruan, et al. 2024a.
Deepseek-v3 technical report. arXiv preprint Wentao Shi, Mengqi Yuan, Junkang Wu, Qifan Wang,
arXiv:2412.19437. and Fuli Feng. 2024. Direct multi-turn preference
optimization for language agents. Proceedings of the
Weiwen Liu, Xu Huang, Xingshan Zeng, Xinlong Hao, 2024 Conference on Empirical Methods in Natural
Shuai Yu, Dexun Li, Shuai Wang, Weinan Gan, Language Processing, pages 2312–2324.
Zhengying Liu, Yuanqing Yu, et al. 2024b. Toolace:
Winning the points of llm function calling. arXiv Alkesh K Srivastava and Philip Dames. 2024. Speech-
preprint. guided sequential planning for autonomous naviga-
tion using large language model meta ai 3 (llama3).
Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu arXiv preprint.
Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen
Men, Kejuan Yang, et al. 2024c. Agentbench: Eval- Xingwu Sun, Yanfeng Chen, Yiqing Huang, Ruobing
uating llms as agents. International Conference on Xie, Jiaqi Zhu, Kai Zhang, Shuaipeng Li, Zhen Yang,
Learning Representations. Jonny Han, Xiaobo Shu, et al. 2024. Hunyuan-large:
An open-source moe model with 52 billion activated
meetkai. functionary-meetkai. https: parameters by tencent. arXiv preprint.
//functionary.meetkai.com/. Accessed:
2025-01-17. Qiaoyu Tang, Ziliang Deng, Hongyu Lin, Xianpei Han,
Qiao Liang, Boxi Cao, and Le Sun. 2023. Toolalpaca:
Mistral. Au large. https://mistral.ai/en/news/
Generalized tool learning for language models with
mistral-large. Accessed: 2025-02-14.
3000 simulated cases. arXiv preprint.
OpenAI. o1 and o1-mini. https://platform.
openai.com/docs/models#o1. Accessed: 2025-02- Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan
14. Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer,
Damien Vincent, Zhufeng Pan, Shibo Wang, et al.
Shishir G Patil, Tianjun Zhang, Xin Wang, and Joseph E 2024. Gemini 1.5: Unlocking multimodal under-
Gonzalez. 2023. Gorilla: Large language model standing across millions of tokens of context. arXiv
connected with massive apis. arXiv preprint. preprint.

Shuofei Qiao, Runnan Fang, Zhisong Qiu, Xiaobin Harsh Trivedi, Tushar Khot, Mareike Hartmann, Ruskin
Wang, Ningyu Zhang, Yong Jiang, Pengjun Xie, Fei Manku, Vinty Dong, Edward Li, Shashank Gupta,
Huang, and Huajun Chen. 2024a. Benchmarking Ashish Sabharwal, and Niranjan Balasubramanian.
agentic workflow generation. arXiv preprint. 2024. Appworld: A controllable world of apps and
people for benchmarking interactive coding agents.
Shuofei Qiao, Ningyu Zhang, Runnan Fang, Yujie Luo, Annual Meeting of the Association for Computational
Wangchunshu Zhou, Yuchen Eleanor Jiang, Chengfei Linguistics, pages 16022–16076.
Lv, and Huajun Chen. 2024b. Autoact: Automatic
agent learning from scratch via self-planning. An- Jian Xie, Kai Zhang, Jiangjie Chen, Tinghui Zhu, Renze
nual Meeting of the Association for Computational Lou, Yuandong Tian, Yanghua Xiao, and Yu Su. 2024.
Linguistics. Travelplanner: A benchmark for real-world planning
with language agents. International Conference on
Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Machine Learning.
Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang,
Bill Qian, et al. 2024. Toolllm: Facilitating large Heng-Da Xu, Xian-Ling Mao, Puhai Yang, Fanshu Sun,
language models to master 16000+ real-world apis. and He-Yan Huang. 2024. Rethinking task-oriented
International Conference on Learning Representa- dialogue systems: From complex modularity to zero-
tions. shot autonomous agent. Annual Meeting of the Asso-
ciation for Computational Linguistics, pages 2748–
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta 2763.
Raileanu, Maria Lomeli, Eric Hambro, Luke Zettle-
moyer, Nicola Cancedda, and Thomas Scialom. 2023. An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui,
Toolformer: Language models can teach themselves Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu,
to use tools. Advances in Neural Information Pro- Fei Huang, Haoran Wei, et al. 2024. Qwen2. 5 tech-
cessing Systems, 36:68539–68551. nical report. arXiv preprint.

10
Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik
Narasimhan. 2024. t-bench: A benchmark for tool-
agent-user interaction in real-world domains. arXiv
preprint.

Junjie Ye, Guanyu Li, Songyang Gao, Caishuang Huang,


Yilong Wu, Sixian Li, Xiaoran Fan, Shihan Dou,
Qi Zhang, Tao Gui, et al. 2024. Tooleyes: Fine-
grained evaluation for tool learning capabilities of
large language models in real-world scenarios. arXiv
preprint.
Figure 10: Category distribution of tools.
Jianguo Zhang, Tian Lan, Ming Zhu, Zuxin Liu, Thai
Hoang, Shirley Kokane, Weiran Yao, Juntao Tan,
Akshara Prabhakar, Haolin Chen, et al. 2024a. xlam:
A family of large action models to empower ai agent
systems. arXiv preprint.

Wenqi Zhang, Ke Tang, Hai Wu, Mengna Wang,


Yongliang Shen, Guiyang Hou, Zeqi Tan, Peng Li,
Yueting Zhuang, and Weiming Lu. 2024b. Agent-
pro: Learning to evolve via policy-level reflection
and optimization. Annual Meeting of the Association Figure 11: Distribution of action-types.
for Computational Linguistics.

Yuchen Zhuang, Yue Yu, Kuan Wang, Haotian Sun, and B.1 Data Examples
Chao Zhang. 2023. Toolqa: A dataset for llm ques-
tion answering with external tools. Conference on We present two more examples of mission execu-
Neural Information Processing Systems, 36:50117– tion corresponding to the examples in Section 5.
50143. Figure 13 illustrates the execution of the optimal
path, while Figure 14 shows a non-optimal path
A Diverse Toolset Construction execution.

We generate the toolset based on tool descriptions C Details of Proposed Evaluation Method
from public-apis, following the ToolAlpaca ap-
proch. This API repository contains 400 tool lists, 1. Initialize graph G, indegree table, visitation table,
corresponding to 1600 tools in 50 categories. current path, and all paths.
In contrast to ToolAlpaca, our approach includes 2. Perform topological sorting and depth-first
three strategies to enhance tool accuracy and pa- traversal based on parallel combination and permu-
rameter variety. Initially, we utilize LLMs like GPT tation.
to refine tool descriptions, addressing the common 2.1 For each search, find all nodes with an in-
issue of the absence of constraint parameters in degree of 0 and arrange all possible combinations
generated tools. For instance, a tool description for based on the number of nodes. Specifically, since
querying Spanish weather does not mention Spain nodes with an indegree of 0 are independent, they
in any of its three specific functions, leading to can be combined arbitrarily. When the number of
the generated tool cannot validate the query loca- nodes in a combination is greater than 1, it indi-
tion. Second, we expand parameter types to include cates that these nodes can be called in parallel. It
complex data structures such as enumerations, ar- is this method that allows our algorithm to enu-
rays, and objects, aligning better with real-world merate all possible paths, including parallel and
scenarios. Finally, five LLM agent experts review serial-parallel calls, as opposed to being limited
the generated tools. These steps ensure the tools’ to serial calls only, compared to naive topological
accuracy and parameter diversity. sorting.
2.2 Traverse each combination, add the combi-
B Analysis of the Test Data nation to the current path, and update the indegree
and visitation tables.
Figure 10, 11 and 12 present the proposed dataset 2.3 Continue with depth-first traversal until the
from the following three perspectives. number of nodes in the path matches the number

11
E Part Roles Prompt of Agents
E.1 Role Prompt of Mission Generation
We show the role prompt of single mission genera-
tion in Figure 15.

E.2 Role Prompt of Planner


We show the role prompt of Planner in Figures 16,
17, 18, 19, 20, 21 and 22.

E.3 Role Prompt of Tool


We show the role prompt of Tool in Figures 23.
Figure 12: Distribution of three mission relationship
types. E.4 Role Prompt of AI
We show the role prompt of AI in Figures 24.
of nodes in the annotated answer, completing the
generation of one path, and add it to all paths.
2.4 Repeat the above steps until the traversal is
complete.
3. Based on the path length, divide into the
optimal path and the suboptimal path.

D Further Analysis of Agent Performance

In addition to the analytical perspectives mentioned


in the main text, we analyze the error types of the
agents.
We categorize errors into tool errors and parame-
ter errors. Specifically, we further divide parameter
errors into parameter name hallucinations, parame-
ter value hallucinations, and parameter value errors.
Table 4 lists these error classifications. Stronger
agents show a relatively lower proportion of tool
errors. Although parameter name hallucinations oc-
cur less frequently, they are serious and widespread.
The most common parameter error occurs when the
agent extracts parameter values.

Table 4: The distribution of agent errors. Here, ‘Hallu.’


is short for hallucination.
Parameter Errors
Tool
Agent Name Value Value
Errors
Hallu. Hallu. Err
o1-2024-12-17 83.33 0.24 5.07 11.36
GPT-4o-2024-11-20 75.87 0.20 8.05 15.49
Gemini-1.5-Pro-002 85.15 0.19 3.34 11.32
Qwen2.5-72b-Instruct 80.90 0.37 6.31 12.43
ToolACE-8B 90.56 0.17 1.75 7.52
Mistral-Large-2411 78.19 0.35 6.46 15.01
watt-tool-8b 90.68 0.17 3.63 5.53
GLM-4-9B-Chat 92.99 0.15 2.99 3.88
DeepSeek-R1 95.77 0.00 2.11 2.11
doubao-1.5-pro-32k 82.35 0.28 10.69 6.67
xLAM-7b-fc-r 96.36 0.27 1.35 1.89
gorilla-openfunctions-v2 98.83 0.00 0.26 0.90
DeepSeek-V3 96.57 0.00 0.90 2.53
Llama-3.3-70B-Instruct 90.53 0.33 2.45 6.69

12
Figure 13: An Optimal Path Example.

Figure 14: A Suboptimal Path Example.

13
Single Tool Invocation Mission Generation Prompt.

Please act as a user interacting with a super intelligent agent.


This super intelligent agent has access to a range of external tools and can use these tools to solve
the missions you propose.
Next, please propose 5 missions that you need the super intelligent agent to solve based on the
All 5 missions must require the use of {{{tool}}} from the [Tool List] to be completed, and each
mission should only require a single call to {{{tool}}}.
The missions should be specific and diverse.
Finally, please output the final result according to the [Format] without generating any extra text.
The required parameters for tool {{{tool}}} are: {{{tool_required}}}, and the optional parame-
ters are: {{{tool_no_required}}}.

[Requirements]="""
1. The description of the user’s mission must include information on all the required parameters
needed to call {{{tool}}}. For other optional parameters, please add them as you see fit, using
natural language.
2. The user’s missions should use different types of sentence structures: imperative, declarative,
interrogative, etc.
3. The user’s missions should include different tones: colloquial, formal, polite, direct, etc.
4. Ensure that the length of the user’s missions varies, gradually increasing from short to long.
5. Ensure that the user’s missions involve different themes/instances, different scenarios, and
different roles.
6. Extract common entities that appear in all descriptions from the [Tool List] and ensure that these
entities appear in the user’s missions.
7. Do not explicitly specify the tool {{{tool}}} in the user’s missions.
"""

[Tool List]="""
{{{tool}}}
"""

[Format]="""
{
"mission 1": "xxx",
"mission 2": "xxx",
"mission 3": "xxx",
"mission 4": "xxx",
"mission 5": "xxx",
}
"""

Figure 15: Single Tool Invocation Mission Generation Prompt.

14
Planner Decision Generation Prompt Part-1.

Please act as a Planner within a super intelligent agent.


You have access to a series of external tools, and you can solve user missions by invoking these
external tools, as detailed in the [Tool List].
You are responsible for assessing the completion status of the current user mission and providing
thoughts, plans, and actions to be executed.
If the Checker_Planner indicates ‘no’ for correct, it means there is an issue with the decision you
made in the previous round. In this case, you should regenerate your decision based on the analysis
provided by the Checker_Planner.
However, please be mindful not to include explanations of previously generated incorrect results in
your Thoughts!
In your Plan, be sure not to mention the use of the prepare_to_answer tool and the
ask_user_for_required_parameters tool. Instead, describe these actions in natural language, as the
prepare_to_answer and ask_user_for_required_parameters tools are not to be exposed.
Refer to the [Planner Output Format] for the output format.
[Environmental Information]="""
{{{env_info}}}
"""

Figure 16: Planner Decision Generation Prompt Part-1.

15
Planner Decision Generation Prompt Part-2.

[Planner Output Format]="""


Planner:
{
"Mission_Finish": "Whether the user mission is completed, fill in ‘yes’ if completed, ‘no’ if
not completed",
"Thought": "Based on the [Requirements] and [Environmental Information], follow the steps
below to give the internal thought process when solving the user mission. You must provide an
analysis of the required and optional parameters for each tool that needs to be called.
First step, decompose the mission, first analyze whether a tool needs to be called to complete
it, and whether there is a suitable tool in the [Tool List].
If a tool needs to be called, which tool(s) should be used to complete the user mission, whether
one or multiple tools should be called.
If multiple tools are involved, please provide an analysis of the serial and parallel nature of
multiple tools.
Second step, provide an analysis of the required and optional parameters for the first tool that
needs to be called (now), in the following order.
1. First, list the required and optional parameters for each tool that needs to be called.
2. Based on the context and user mission, analyze the required parameters, check which
information for each tool’s required parameters is provided, and explain which are provided and
which are missing to ask the user.
3. Finally, analyze the optional parameters. If the user has provided information for optional
parameters, briefly explain the situation; otherwise, there is no need to explain.
Note:
1. The analysis process should not be too lengthy; it needs to be concise and clear.
2. Do not have too much redundant content that is repetitive of the Plan.",
"Plan": "Based on the [Requirements], [Environmental Information], Thought, context, and
user mission, provide a planning scheme.
Note:
1. When involving multiple tool calls, provide the overall plan and the plan for the first action
during the first Plan, and provide the plan for the current step in subsequent dialogues.
2. The Plan is a general explanation of the Thought. The Plan does not need to set the values
of the tool parameters; it only needs to explain which tools should be called to complete what
missions, only the purpose of calling the tools.
3. The format of the Plan needs to be consistent with the example given in the [Requirements].
4. Do not have too much redundant content that is repetitive of the Thought.",
"Action_List": [
{
"name": "Based on the [Requirements], provide the action to be taken, i.e., the
selected tool name",
"arguments": "Based on the [Requirements], [Environmental Information], and
[Tool List], provide the input parameters for the action to be taken, i.e., the tool’s input parameters.
Note: 1. Optional parameters not specified by the user do not need to be provided. 2. Use the
JSON format in terms of format, use a dictionary object, do not use strings, and there is no need to
provide comments for the parameters",
"tool_call_purpose": "The purpose of the tool call"
}
]
}
"""

Figure 17: Planner Decision Generation Prompt Part-2.


16
Planner Decision Generation Prompt Part-3.

[Requirements]="""
*** Special Attention ***
1. When making a decision, please ensure that the tool you invoke from the [Tool List] is suitable
for solving the user’s mission based on the definition of the tools in the list. Do not force the use of
inappropriate tools to solve the user’s mission; instead, call the appropriate tool from the [Tool
List] according to the user’s mission.
2. Ensure that the Action_List you provide does not contradict the Plan you have set out. The
order of tools in the given Action_List should be consistent with the sequence planned in the Plan.
3. For optional parameters, you only need to fill them in if the user has provided a value that is
different from the default or if there is no default value. Otherwise, there is no need to include
them in the arguments.
*** The prepare_to_answer tool needs to be called in the following two scenarios: ***
1. If you believe that the user’s mission can be completed, then call the prepare_to_answer tool to
provide a summary response, with the answer_type parameter set to ‘tool’.
2. If you believe that the user’s mission does not require the use of any tools from the [Tool List]
or that there is no suitable tool to solve the user’s mission and it can be answered directly, then call
the prepare_to_answer tool, with the answer_type parameter set to ‘chat’.
Note:
1) The absence of a suitable tool in the [Tool List] to solve the user’s mission does not mean
that you lack the ability to answer. Please respond based on the context information and the
knowledge you possess. Do not excessively refuse to answer, nor imagine knowledge you do not
have. Only refuse to answer when you cannot respond based on the context information and your
own knowledge.
2) The absence of a suitable tool in the [Tool List] to solve the user’s mission also includes the
following situation:
First, analyze the common entities that appear in each tool. For example, some tools can only
query data related to a certain entity A. If the user asks about entity B, it also means that there is
no suitable tool.
For instance:
- If the tools in the [Tool List] can only query and analyze population data for Denmark, and the
user asks for population data for Sweden, then you should also call the prepare_to_answer tool.
- If the tools in the [Tool List] can only query weather data for China, including current and
historical weather, and the user asks for weather data for the United States, then you should also
call the prepare_to_answer tool.

Figure 18: Planner Decision Generation Prompt Part-3.

17
Planner Decision Generation Prompt Part-4.

*** There are four scenarios in which the ask_user_for_required_parameters tool needs to be
invoked: ***
1. If you believe that a user’s mission requires the use of a tool from the [Tool List], but the user’s
mission is missing some required parameters from the tool, and the user needs to provide the
necessary information, then invoke the ask_user_for_required_parameters tool. Please do not
hallucinate parameters.
2. Please note that you are unable to deduce the values of some tool parameters on your own
and will need to invoke the ask_user_for_required_parameters tool to ask the user. Please do not
hallucinate parameters.
For example:
1) For the timestamp parameter, you do not have the ability to deduce the timestamp based
on time. However, you can deduce other time-related parameters (start_time, end_time,
etc.) on your own based on [Environmental Information], without needing to invoke the
ask_user_for_required_parameters tool.
2) For ID-type parameters (station_id, product_id, etc.), you do not have the ability to deduce the
corresponding ID based on the name.
3. Based on the context of the conversation, if you have already asked the user once to provide the
necessary information but the user still has not provided all the required parameters, then please
continue to invoke the ask_user_for_required_parameters tool.
4. If the user provides incomplete parameter values, such as the tool parameter being an IP address
(ip_address), but the user provides an incomplete IP address (e.g., 192.22), please continue to use
the ask_user_for_required_parameters tool to ask the user for the complete IP address.
Finally, if you confirm the need to invoke the ask_user_for_required_parameters tool, provide the
inquiry plan in the format: "Ask the user to provide xxx, in order to invoke the xxx tool to xxx" in
the Plan.

Figure 19: Planner Decision Generation Prompt Part-4.

18
Planner Decision Generation Prompt Part-5.

*** There are eight scenarios in which multiple tools need to be invoked: ***
If a user mission involves invoking multiple tools, please first analyze the dependency relation-
ships between the multiple invocation tools. For tools that do not have invocation dependencies,
perform concurrent invocations, and for tools that do have invocation dependencies, perform serial
invocations. Specifically, you can handle each of the following eight scenarios separately:
Concurrent invocation scenarios:
1. If you determine that the user mission requires multiple invocations of the same tool A, but
with different parameters for each invocation of tool A, then please invoke tool A concurrently and
provide the concurrent invocation plan in the Plan in the format: "Concurrently invoke tool A N
times for xxx."
2. If you determine that the user mission requires the invocation of different tools, such as tools
A and B, and there is no dependency between tool A and B, then please invoke tools A and B
concurrently, and provide the concurrent invocation plan in the Plan in the format: "Concurrently
invoke tool A for xxx, while invoking tool B for xxx."
Serial invocation scenarios:
3. If you determine that the user mission requires the invocation of different tools, such as tools A,
B, and C, and there are dependencies between these tools, then please invoke tools A, B, and C
serially, and provide the serial invocation plan in the Plan in the format: "First, invoke tool A for
xxx. Then, invoke tool B for xxx. Next, invoke tool C for xxx. Now, I will invoke tool A for xxx."
Serial invocation has the following two dependency scenarios:
3.1. Parameter dependency: For example, before invoking tool C, it is necessary to first invoke
tool B to obtain the result as an input parameter, and before invoking tool B, it is necessary to first
invoke tool A to obtain the result as an input parameter. Therefore, you need to first complete the
invocation of tool A to obtain the result, use it as the input parameter for invoking tool B, and
after obtaining the result from tool B, use it as the input parameter for invoking tool C, i.e., please
invoke tools A, B, and C serially.
3.2. Logical dependency: Even if there is no parameter dependency between the invocation of
tools A, B, and C, but there is a logical dependency, such as logically needing to invoke tool B
before tool C, and tool A before tool B, then please also invoke tools A, B, and C serially.

Figure 20: Planner Decision Prompt Generation Part-5.

19
Planner Decision Generation Prompt Part-6.

Combined serial and concurrent invocation scenarios:


4. If you determine that the user mission requires the invocation of different tools, such as tools
A, B, and C, and tool C depends on the invocation of tools A and B, but there is no dependency
between tools A and B, then please invoke tools A and B concurrently, followed by the serial
invocation of tool C, and provide the combined serial and concurrent invocation plan in the Plan in
the format: "Concurrently invoke tools A and B for xxx and xxx, respectively. Then, invoke tool C
for xxx. Now, I will concurrently invoke tools A and B for xxx and xxx."
5. If you determine that the user mission requires the invocation of different tools, such as tools A,
B, and C, and tools B and C depend on the invocation of tool A, but there is no dependency between
tools B and C, then please first invoke tool A serially, followed by the concurrent invocation of
tools B and C, and provide the combined serial and concurrent invocation plan in the Plan in the
format: "First, invoke tool A for xxx. Then, concurrently invoke tools B and C for xxx and xxx,
respectively. Now, I will invoke tool A for xxx."
6. If you determine that the user mission requires the invocation of different tools, such as tools A
and B, and there is a dependency between tools A and B, and tool A needs to be invoked multiple
times, then please first invoke tool A concurrently multiple times, followed by the serial invocation
of tool B, and provide the combined serial and concurrent invocation plan in the Plan in the format:
"First, concurrently invoke tool A N times for xxx. Then, invoke tool B for xxx. Now, I will
concurrently invoke tool A N times for xxx."
7. If you determine that the user mission requires the invocation of different tools, such as tools A
and B, and there is a dependency between tools A and B, and tool B needs to be invoked multiple
times, then please first invoke tool A serially, followed by the concurrent invocation of tool B
multiple times, and provide the combined serial and concurrent invocation plan in the Plan in the
format: "First, invoke tool A for xxx. Then, concurrently invoke tool B N times for xxx. Now, I
will invoke tool A for xxx."
Special scenarios:
8. The tools prepare_to_answer and ask_user_for_required_parameters cannot be invoked concur-
rently with other tools and need to be invoked serially.

Figure 21: Planner Decision Generation Prompt Part-6.

20
Planner Decision Generation Prompt Part-7.

Please also note:


1. The dependency relationship between tool invocations refers to the necessity of completing the
call to Tool A before running the call to Tool B.
2. For multiple invocations of the same tool, it is necessary to carefully analyze the dependency
relationship of each call, noting that even two calls to the same tool may be interdependent.
3. If you state in your Thought and Plan that tools need to be called in sequence, then the number
of tools to be called in your given Action_List cannot exceed one, otherwise, there will be a logical
contradiction!
4. If you cannot ensure that parallel calls to multiple tools A, B, C will not have parameter
dependencies and logical dependencies, then please call multiple tools A, B, C in sequence!
*** Special Circumstances ***
In the following three cases, there is no need to call the ask_user_for_required_parameters tool:
1. If a tool’s parameter is a country’s ISO code, and the user’s mission mentions a specific country,
such as China, you can directly deduce China’s ISO code and fill it in.
2. If a tool’s parameter is a longitude or latitude value, and the user’s mission mentions a specific
location, such as Beijing, you can directly deduce the approximate longitude and latitude values
for Beijing and fill them in.
3. If a tool’s parameter is a time-related parameter (such as start_time, end_time, or other
parameters that include year, month, and day) and not a timestamp type, you can deduce it based
on the current time in the [Environmental Information] and fill it in. At the same time, you need to
explain in your Thought how you deduced the value of the time-related parameter based on the
current time.
*** Other Notes: ***
1. Be sure not to provide comments for parameters, as providing parameter comments will cause
JSON to fail to parse.
"""
{{{all_tool_required_info}}}
[Tool List]="""
{{{tools}}}
"""

Figure 22: Planner Decision Generation Prompt Part-7.

21
Tool Feedback Generation Prompt.

Please act as an external tool, Tool, within a super intelligent agent. These external tools can be
used to solve user missions, as detailed in the [Tool List].
Based on the tool name and input parameters output by the super intelligent agent’s Planner,
simulate the execution results of the tool.
If there are multiple tools in the Action_List provided by the Planner, please simulate each one sep-
arately, ensuring the number matches the Action_List, and store the results in the Observation_List.
Refer to the [Tool Output Format] for the output format.
[Environmental Information]="""
{{{env_info}}}
"""

[Tool Invocation Result Requirements]="""


1. Validate the HTTP method and parameters in the request according to the OpenAPI specification.
2. Generate a response that strictly follows the format specified in the OpenAPI specification and
ensure it is in JSON format.
3. The response should contain real data, avoiding the use of placeholders.
4. Handle edge cases by providing appropriate error responses.
5. For requests without length limitations, such as the GET method, ensure the response returns
3 to 5 samples, and be careful not to use ellipses like // xxx, ... to omit sample information, as it
must conform to JSON format, otherwise it will cause JSON parsing errors!
6. Try to simulate responses in English.
"""
[Tool List]="""
{{{tools}}}
"""
[Tool Output Format]="""
Tool:
{
"Observation_List": [
{
"status_code": "Refer to [Tool Invocation Result Requirements] for the HTTP
response status code",
"response": "Refer to [Tool Invocation Result Requirements] to simulate the result
of the action execution. Ensure your response is in JSON format, contains real data, and complies
with the OpenAPI specification format."
}
]
}
"""

Figure 23: Tool Feedback Generation Prompt.

22
AI Feedback Generation Prompt.

Please act as an Agent assistant within a super intelligent agent, which has a series of external
tools. The Planner within the super intelligent agent can solve user missions by calling external
tools, as detailed in the [Tool List].
You are responsible for interacting with the user. Based on the results returned by the Planner and
Tool, combined with the user mission and the context of the conversation, you provide answers,
and only your answers will be displayed to the user.
Refer to the [Agent Assistant Output Format] for the output format.
[Environmental Information]="""
{{{env_info}}}
"""
[Agent Assistant Output Format]="""
Agent Assistant: According to the [Requirements], reply to the most recent round of content
starting with "User:" in the context conversation information (do not repeat this sentence).
"""
[Requirements]="""
1. The reply must start with "Agent Assistant:".
2. Summarize the user mission from the most recent round starting with "User:" based on the
context conversation information.
3. Use markdown format, and be sure to pay attention to the layout to make it look neat, with two
line breaks between paragraphs.
4. Pay special attention! If the Observation given by the Tool is a list, and each item in the list has
its own ID, such as xxx_id or xxxId, then when summarizing the reply, please retain these IDs for
each item and inform the user!
5. Reply in English.
"""
{{{all_tool_required_info}}}
[Tool List]="""
{{{tools}}}
"""

Figure 24: AI Feedback Generation Prompt.

23

You might also like