2504.02623v1
2504.02623v1
ning capabilities. Users increasingly rely on Bench. This benchmark evaluates agent robustness
LLM-based agents to solve complex missions in related and dynamic multi-mission scenarios.
through iterative interactions. However, exist- The benchmark addresses three core challenges:
ing benchmarks predominantly access agents 1) it contains more mission-types than others, i.e.
in single-mission scenarios, failing to capture four major categories and six subcategories; 2) it
real-world complexity. To bridge this gap, we includes all mission-type transition patterns in pre-
propose the Multi-Mission Tool Bench. In
fixed mission number; 3) all successive missions
the benchmark, each test case comprises multi-
ple interrelated missions. This design requires have strong relations with prior dialogues, agents
agents to dynamically adapt to evolving de- are forced to extract information from previous mis-
mands. Moreover, the proposed benchmark ex- sions. Therefore, it closely mirrors the complexity
plores all possible mission-switching patterns of real-world.
within a fixed mission number. Specifically, we To simulate all mission-type switching patterns,
propose a multi-agent data generation frame- we first define the mission-types by their corre-
work to construct the benchmark. We also pro- sponding agent action-types. Agent actions are
pose a novel method to evaluate the accuracy
divided into four main types: using a single tool,
and efficiency of agent decisions with dynamic
decision trees. Experiments on diverse open- using multiple tools, chatting with users, and us-
source and closed-source LLMs reveal critical ing tools after clarifying parameters. An agent
factors influencing agent robustness and pro- accomplishes a single mission by performing one
vide actionable insights to the tool invocation of these actions. Therefore, we define four types
society1 . of missions. For sequential missions, agents com-
bine multiple action-types to reach the objectives.
1 Introduction Figure 2 a) displays that the agent employs the com-
In recent years, large language models (LLMs) bination of four action-types to complete the four
have achieved significant progress in natural lan- sequential missions in Figure 1. Thus, we introduce
guage processing. These models demonstrate the mission switching space to describe the trans-
strong capabilities to understand contextual infor- formations of mission types. Figure 2 b) shows that
mation and user instructions, making them effective our benchmark thoroughly explores the proposed
agents for mission completion. space with a prefixed mission number. This indi-
Real-world applications require agents to handle cates that our benchmark includes all mission-type
dynamic user demands. As users frequently ad- transition patterns. In contrast, other benchmarks
just their requests during conversations (Figure 1), have a more limited range of action diversity.
agents must complete sequential missions with To construct the multi-mission benchmark, we
evolving requirements. This situation challenges propose a controllable data generation framework
*
with multiple characters. The framework simulates
Equal Contributions.
† the mission execution process through dialogic in-
Corresponding authors.
1
Available on https://github.com/yupeijei1997/ teractions among five agents: user, planner, tool,
MMTB. AI, and checker. In each generation process, we
1
Figure 1: A multi-mission example. It contains four related missions, and the mission types are changing
dynamically. This figure presents the conversation between a user and an AI. The inter-dialogues are hided.
assign the desirable mission type and mission rela- Section 4 explains how we build the benchmark.
tionship to guide the framework. Ultimately, our It covers how to create related missions, predefine
benchmark encompasses all potential combinations mission-types, and explore the mission switching
in the mission switching space for a set number of space. Section 5 describes the evaluation methods
missions. Notably, a complete mission involves we use for this benchmark. Section 6 shows the
multiple rounds of dialogues. test results of LLMs and presents our analysis of
To evaluate the proposed benchmark, we intro- these findings.
duce a novel evaluation method. It assesses the
accuracy and efficiency of agents decisions, by em- 2 Related Work
ploying dynamic decision trees.
Eventually, we evaluate a range of open-source 2.1 Evaluation of LLMs
and closed-source LLMs, encompassing both spe- Recent benchmarks evaluate the capabilities of
cific and general LLMs. Our comprehensive exper- LLM-based agents from various point of views.
iments reveal numerous factors influencing the ro- Some research evaluates the generalizability of
bustness of agent decision-making. These findings agents in various scenarios (Li et al., 2024; Trivedi
offer valuable insights for guiding future research et al., 2024; Liu et al., 2024c). Others(Du et al.,
on the development of LLM-agents. 2024; Qin et al., 2024; Ye et al., 2024; Li et al.,
The main contributions of this paper are: 2023) collected massive tools to investigate the
• To the best of our knowledge, this is the first impact of tool diversity on agent performance. Cer-
benchmark that assesses agent robustness in tain research (Zhuang et al., 2023; Guo et al., 2024;
related and dynamic multi-mission scenarios. Xie et al., 2024) examines agents within specific
domains. While some works (Shen et al., 2024b;
• We introduce a controllable multi-role data Chen et al., 2024; Huang et al., 2024a) provide a
generation framework to explore the action- comprehensive assessment of multiple agent abili-
type space in multi-mission contexts. ties, others (Huang et al., 2024b; Tang et al., 2023;
Qiao et al., 2024a) address specific issues like the
• A novel testing method is proposed to evaluate
illusion problem (Patil et al., 2023) and multistep
the accuracy and efficiency of dynamic path
execution capabilities (Shen et al., 2024a; Yao et al.,
planning.
2024).
• Comprehensive testing of open-source and Our benchmark assesses agents’ overall capa-
closed-source LLMs is conducted, revealing bilities, emphasizing challenges of related and dy-
various factors that affect the robustness of namic multi-missions. Importantly, the multistep
agent decision making. tasks discussed in previous studies align with our
2
Figure 2: Visualization of mission switching space. a) Four distinct colors represents four different action-types. The
green dot indicates the agent sequentially selects four type of actions to execute four missions. b) The distribution
of the proposed benchmark within the mission switching space. Each row corresponds to a different number of
missions. Each dot indicates a specific combination of the current and preceding action-types. Colored dots indicate
combinations included in the benchmark, while gray dots indicate their absence. c) Distribution of four other agent
benchmarks in the space.
3
Rate of Mission-Types
Benchmark MutMiss∗ MSSS‡4
RelMiss† Asingle Achat Aclarity AS
multi AP
multi AS+P
multi
Ours ! 100 100 ! ! ! ! ! !
BFCL v3(Charlie Cheng-Jie Ji, a) ! 15.7 39.7 ! ! ! % ! %
BFCL v1(Patil et al., 2023) % 0.0 0.9 ! ! % % ! %
BFCL v2(Charlie Cheng-Jie Ji, b) % 0.0 0.9 ! ! % % ! %
ToolBench(Qin et al., 2024) % 0.0 0.0 ! % % ! % %
AnyToolBench(Du et al., 2024) % 0.0 0.0 ! % % ! % %
τ -bench(Yao et al., 2024) % 0.0 0.0 ! % % ! % %
T-EVAL(Chen et al., 2024) % 0.0 0.0 ! % % ! % %
UltraTool(Huang et al., 2024a) % 0.0 0.0 ! % % ! % %
Table 1: Comparative Analysis of the Multi-Mission Tool Bench against other benchmarks in the field. The symbol
’*’ indicates Multi-Mission, while ’†’ denotes Related Missions. Moreover, in the four-mission action-type space,
the Mission Switching Space Scale ( MSSS4 ) represents the proportion of combination coverage for each dataset
relative to all possible combinations.
4
times to construct the test data. It is important to
note that the generation results from both Rtool and
RAI are crucial information provided to the agents
during our testing process.
5
Figure 5: Visualization of the dynamic decision tree during evaluation.
candidate path remains, and verify its correctness mission relationship types. Further error analysis
by sequential path matching. is detailed in Appendix D.
Additionally, we calculate two metrics. Success
6.1 Overview
rate: percentage of valid paths completed. Optimal-
ity rate: percentage of paths that match minimal This subsection analyzes the accuracy of models
tool invocations. Appendix C provides formal al- on the whole dataset, with Figure 6 showing the
gorithm specifications. accuracy of 15 models. The models are arranged
from low to high accuracy, with different colored
6 Experiments dots indicating model types and varying dot sizes
representing model sizes.
The Multi-Mission Tool Bench consists of 1,024 From the analysis of Figure 6, we draw the fol-
test entries, each containing one to four missions. lowing conclusions. The o1 model, with strong
We divide the test set into four subsets based on the reasoning capabilities, shows a significant accuracy
number of missions, with each subset containing advantage. Open-source models, such as Qwen-
256 entries. 72b, are narrowing the gap with the top close-
We evaluated 24 state-of-the-art models on the source models. General models like DeepSeek-
test set, including closed-source general mod- V3 and doubao-1.5-pro perform well in other mis-
els, open-source general models, and special- sions but have a clear weakness in tool utilization.
ized models. Specifically, the closed-source gen- Notably, small specialized models like ToolACE
eral models are: o1-2024-12-17(OpenAI), GPT- achieve comparable performance to large-scale gen-
4o-2024-11-20(Achiam et al., 2023), Gemini- eral models.
1.5-Pro-002(Team et al., 2024), Mistral-Large- Figure 7 illustrates the performance of different
2411(Mistral), and doubao-1.5-pro-32k(Doubao). scale models in the Qwen2.5-Instruction-Series and
The open-source general models include: Qwen2.5- Hammer2.1-Series. As shown, there is a positive
Instruction-Series(Yang et al., 2024), GLM-4- correlation between model scale and accuracy. In-
9B-Chat(GLM et al., 2024), DeepSeek-R1(Guo terestingly, specialized models experience a faster
et al., 2025), DeepSeek-V3(Liu et al., 2024a), decline in accuracy. To explain this phenomenon,
and Llama-3.3-70B-Instruct(Dubey et al., 2024). more research is needed.
The specialized models are: Toolace (Liu
et al., 2024b), Hammer2.1-Series(Lin et al., 6.2 Impact of Mission Switching
2024), watt-tool-8b(Shi et al., 2024), xLAM-7b-fc- This study examines the impact of mission quan-
r(Zhang et al., 2024a), and gorilla-openfunctions- tity, mission-type, and mission transition on agent
v2(Charlie Cheng-Jie Ji, b). Model sizes range robustness.
from several hundred billions to 70b, 30b, and the Seven models with better overall performance
smallest at 0.5b. were selected for detailed analysis, including four
This section details the test results and analy- general models and three specialized models. Fig-
sis. Subsection 6.1 shows the overall performances. ure 8 presents the performance of these models in
Subsection 6.2 analyzes effects of the number of various subsets of mission quantities. The results
missions, mission action-types, and mission switch- indicate that specialized models perform compa-
ing. Subsection 6.3 explores the impact of inter- rably to stronger general models on single mis-
6
Figure 6: Overall accuracy of agents on the whole benchmark.
7
Figure 9: Visualization of the robustness of agents in the mission switching space.
Table 2: The performance of agents in various type of missions, and the quantitative evaluation results on Amulti
missions. Here, † and ‡ represent close-source and open-source general model, ⋆ represents specific model.
performance, followed by the absence of core com- Table 3: The impact of mission relation types on agent
ponents in the problem( ellipsis ). performance.
7 Conclusion
This paper introduces a novel multi-mission bench- agents, a significant robustness gap emerges in
mark to evaluate the robustness of LLM-based multi-mission contexts. Moreover, all agents strug-
agents. Evaluations reveal that current agents ex- gle with complex multi-tool invocation missions
hibit varying degrees of limitations when address- and have shortcomings in related mission handling.
ing multi-mission scenarios. Notably, while spe- We believe that these findings offer valuable in-
cialized agents achieve comparable overall accu- sights for guiding future research on the develop-
racy and single-mission performance to general ment of LLM-agents.
8
Limitations Yu Du, Fangyun Wei, and Hongyang Zhang. 2024. Any-
tool: Self-reflective, hierarchical agents for large-
In evaluating LLM-based agents from a multi- scale api calls. International Conference on Machine
mission perspective, we identify specific limita- Learning.
tions in both mission duration and the data genera- Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey,
tion framework. Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman,
Firstly, our study aims to enhance the diversity Akhil Mathur, Alan Schelten, Amy Yang, Angela
of test data in terms of mission variation, yet the Fan, et al. 2024. The llama 3 herd of models. arXiv
diversity in the number of missions remains lim- preprint arXiv:2407.21783.
ited. Specifically, our test data comprises up to four Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chen-
missions. This limitation arises because the mis- hui Zhang, Da Yin, Dan Zhang, Diego Rojas, Guanyu
sion switching space expands exponentially with Feng, Hanlin Zhao, et al. 2024. Chatglm: A family
of large language models from glm-130b to glm-4 all
an increase in the number of missions, leading to a tools. arXiv preprint.
rapid enlargement of the test set size and additional
workload. Moreover, we observe a swift decline in Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song,
Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma,
the precision of the model’s output as the number
Peiyi Wang, Xiao Bi, et al. 2025. Deepseek-r1: In-
of missions increases, indicating that there is no im- centivizing reasoning capability in llms via reinforce-
mediate need to explore the model’s performance ment learning. arXiv preprint arXiv:2501.12948.
across a larger number of missions.
Zishan Guo, Yufei Huang, and Deyi Xiong. 2024. Ctool-
Secondly, the proposed data generation frame- eval: A chinese benchmark for llm-powered agent
work employs multiple iterations and human inter- evaluation in real-world api interactions. Annual
vention to ensure the quality of multi-turn dialogue Meeting of the Association for Computational Lin-
production. This approach suffers the limitations guistics, pages 15711–15724.
of LLMs in accurately following instructions. Senyu Han, Lu Chen, Li-Min Lin, Zhengshan Xu, and
In summary, these limitations emphasize the Kai Yu. 2024. Ibsen: Director-actor agent collabo-
need for ongoing development in the field of LLM ration for controllable and interactive drama script
generation. Annual Meeting of the Association for
based evaluations.
Computational Linguistics.
Shijue Huang, Wanjun Zhong, Jianqiao Lu, Qi Zhu, Ji-
References ahui Gao, Weiwen Liu, Yutai Hou, Xingshan Zeng,
Yasheng Wang, Lifeng Shang, et al. 2024a. Planning,
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama
creation, usage: Benchmarking llms for comprehen-
Ahmad, Ilge Akkaya, Florencia Leoni Aleman,
sive tool utilization in real-world complex scenarios.
Diogo Almeida, Janko Altenschmidt, Sam Altman,
arXiv preprint.
Shyamal Anadkat, et al. 2023. Gpt-4 technical report.
arXiv preprint. Yue Huang, Jiawen Shi, Yuan Li, Chenrui Fan, Siyuan
Fanjia Yan Shishir G. Patil Tianjun Zhang Ion Sto- Wu, Qihui Zhang, Yixin Liu, Pan Zhou, Yao Wan,
ica Joseph E. Gonzalez Charlie Cheng-Jie Ji, Neil Zhenqiang Gong, et al. 2024b. Metatool bench-
Huanzhi Mao. a. Gorilla bfvl v3. https://gorilla. mark for large language models: Deciding whether to
cs.berkeley.edu/leaderboard.html. Accessed: use tools and which to use. International Conference
2025-01-17. on Learning Representations.
Fanjia Yan Shishir G. Patil Tianjun Zhang Ion Md Ashraful Islam, Mohammed Eunus Ali, and
Stoica Joseph E. Gonzalez Charlie Cheng-Jie Ji, Md Rizwan Parvez. 2024. Mapcoder: Multi-agent
Huanzhi Mao. b. Gorilla openfunctions v2. code generation for competitive problem solving. An-
https://gorilla.cs.berkeley.edu//blogs/7_ nual Meeting of the Association for Computational
open_functions_v2.html. Accessed: 2025-01-17. Linguistics.
Zehui Chen, Weihua Du, Wenwei Zhang, Kuikun Manling Li, Shiyu Zhao, Qineng Wang, Kangrui Wang,
Liu, Jiangning Liu, Miao Zheng, Jingming Zhuo, Yu Zhou, Sanjana Srivastava, Cem Gokmen, Tony
Songyang Zhang, Dahua Lin, Kai Chen, et al. 2024. Lee, Li Erran Li, Ruohan Zhang, et al. 2024. Embod-
t-eval: Evaluating the tool utilization capability of ied agent interface: Benchmarking llms for embodied
large language models step by step. Annual Meet- decision making. Conference on Neural Information
ing of the Association for Computational Linguistics, Processing Systems.
pages 9510–9529.
Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song,
Doubao. Doubao 1.5pro. https://team.doubao. Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang,
com/zh/special/doubao_1_5_pro. Accessed: and Yongbin Li. 2023. Api-bank: A comprehensive
2025-02-14. benchmark for tool-augmented llms. Proceedings
9
of the 2023 Conference on Empirical Methods in Haiyang Shen, Yue Li, Desong Meng, Dongqi Cai,
Natural Language Processing, pages 3102–3116. Sheng Qi, Li Zhang, Mengwei Xu, and Yun Ma.
2024a. Shortcutsbench: A large-scale real-world
Qiqiang Lin, Muning Wen, Qiuying Peng, Guanyu Nie, benchmark for api-based agents. arXiv preprint.
Junwei Liao, Jun Wang, Xiaoyun Mo, Jiamu Zhou,
Cheng Cheng, Yin Zhao, et al. 2024. Hammer: Ro- Yongliang Shen, Kaitao Song, Xu Tan, Wenqi Zhang,
bust function-calling for on-device language models Kan Ren, Siyu Yuan, Weiming Lu, Dongsheng Li,
via function masking. arXiv preprint. and Yueting Zhuang. 2024b. Taskbench: Benchmark-
ing large language models for task automation. In-
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, ternational Conference on Learning Representations
Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Workshop on Large Language Model (LLM) Agents.
Deng, Chenyu Zhang, Chong Ruan, et al. 2024a.
Deepseek-v3 technical report. arXiv preprint Wentao Shi, Mengqi Yuan, Junkang Wu, Qifan Wang,
arXiv:2412.19437. and Fuli Feng. 2024. Direct multi-turn preference
optimization for language agents. Proceedings of the
Weiwen Liu, Xu Huang, Xingshan Zeng, Xinlong Hao, 2024 Conference on Empirical Methods in Natural
Shuai Yu, Dexun Li, Shuai Wang, Weinan Gan, Language Processing, pages 2312–2324.
Zhengying Liu, Yuanqing Yu, et al. 2024b. Toolace:
Winning the points of llm function calling. arXiv Alkesh K Srivastava and Philip Dames. 2024. Speech-
preprint. guided sequential planning for autonomous naviga-
tion using large language model meta ai 3 (llama3).
Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu arXiv preprint.
Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen
Men, Kejuan Yang, et al. 2024c. Agentbench: Eval- Xingwu Sun, Yanfeng Chen, Yiqing Huang, Ruobing
uating llms as agents. International Conference on Xie, Jiaqi Zhu, Kai Zhang, Shuaipeng Li, Zhen Yang,
Learning Representations. Jonny Han, Xiaobo Shu, et al. 2024. Hunyuan-large:
An open-source moe model with 52 billion activated
meetkai. functionary-meetkai. https: parameters by tencent. arXiv preprint.
//functionary.meetkai.com/. Accessed:
2025-01-17. Qiaoyu Tang, Ziliang Deng, Hongyu Lin, Xianpei Han,
Qiao Liang, Boxi Cao, and Le Sun. 2023. Toolalpaca:
Mistral. Au large. https://mistral.ai/en/news/
Generalized tool learning for language models with
mistral-large. Accessed: 2025-02-14.
3000 simulated cases. arXiv preprint.
OpenAI. o1 and o1-mini. https://platform.
openai.com/docs/models#o1. Accessed: 2025-02- Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan
14. Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer,
Damien Vincent, Zhufeng Pan, Shibo Wang, et al.
Shishir G Patil, Tianjun Zhang, Xin Wang, and Joseph E 2024. Gemini 1.5: Unlocking multimodal under-
Gonzalez. 2023. Gorilla: Large language model standing across millions of tokens of context. arXiv
connected with massive apis. arXiv preprint. preprint.
Shuofei Qiao, Runnan Fang, Zhisong Qiu, Xiaobin Harsh Trivedi, Tushar Khot, Mareike Hartmann, Ruskin
Wang, Ningyu Zhang, Yong Jiang, Pengjun Xie, Fei Manku, Vinty Dong, Edward Li, Shashank Gupta,
Huang, and Huajun Chen. 2024a. Benchmarking Ashish Sabharwal, and Niranjan Balasubramanian.
agentic workflow generation. arXiv preprint. 2024. Appworld: A controllable world of apps and
people for benchmarking interactive coding agents.
Shuofei Qiao, Ningyu Zhang, Runnan Fang, Yujie Luo, Annual Meeting of the Association for Computational
Wangchunshu Zhou, Yuchen Eleanor Jiang, Chengfei Linguistics, pages 16022–16076.
Lv, and Huajun Chen. 2024b. Autoact: Automatic
agent learning from scratch via self-planning. An- Jian Xie, Kai Zhang, Jiangjie Chen, Tinghui Zhu, Renze
nual Meeting of the Association for Computational Lou, Yuandong Tian, Yanghua Xiao, and Yu Su. 2024.
Linguistics. Travelplanner: A benchmark for real-world planning
with language agents. International Conference on
Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Machine Learning.
Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang,
Bill Qian, et al. 2024. Toolllm: Facilitating large Heng-Da Xu, Xian-Ling Mao, Puhai Yang, Fanshu Sun,
language models to master 16000+ real-world apis. and He-Yan Huang. 2024. Rethinking task-oriented
International Conference on Learning Representa- dialogue systems: From complex modularity to zero-
tions. shot autonomous agent. Annual Meeting of the Asso-
ciation for Computational Linguistics, pages 2748–
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta 2763.
Raileanu, Maria Lomeli, Eric Hambro, Luke Zettle-
moyer, Nicola Cancedda, and Thomas Scialom. 2023. An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui,
Toolformer: Language models can teach themselves Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu,
to use tools. Advances in Neural Information Pro- Fei Huang, Haoran Wei, et al. 2024. Qwen2. 5 tech-
cessing Systems, 36:68539–68551. nical report. arXiv preprint.
10
Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik
Narasimhan. 2024. t-bench: A benchmark for tool-
agent-user interaction in real-world domains. arXiv
preprint.
Yuchen Zhuang, Yue Yu, Kuan Wang, Haotian Sun, and B.1 Data Examples
Chao Zhang. 2023. Toolqa: A dataset for llm ques-
tion answering with external tools. Conference on We present two more examples of mission execu-
Neural Information Processing Systems, 36:50117– tion corresponding to the examples in Section 5.
50143. Figure 13 illustrates the execution of the optimal
path, while Figure 14 shows a non-optimal path
A Diverse Toolset Construction execution.
We generate the toolset based on tool descriptions C Details of Proposed Evaluation Method
from public-apis, following the ToolAlpaca ap-
proch. This API repository contains 400 tool lists, 1. Initialize graph G, indegree table, visitation table,
corresponding to 1600 tools in 50 categories. current path, and all paths.
In contrast to ToolAlpaca, our approach includes 2. Perform topological sorting and depth-first
three strategies to enhance tool accuracy and pa- traversal based on parallel combination and permu-
rameter variety. Initially, we utilize LLMs like GPT tation.
to refine tool descriptions, addressing the common 2.1 For each search, find all nodes with an in-
issue of the absence of constraint parameters in degree of 0 and arrange all possible combinations
generated tools. For instance, a tool description for based on the number of nodes. Specifically, since
querying Spanish weather does not mention Spain nodes with an indegree of 0 are independent, they
in any of its three specific functions, leading to can be combined arbitrarily. When the number of
the generated tool cannot validate the query loca- nodes in a combination is greater than 1, it indi-
tion. Second, we expand parameter types to include cates that these nodes can be called in parallel. It
complex data structures such as enumerations, ar- is this method that allows our algorithm to enu-
rays, and objects, aligning better with real-world merate all possible paths, including parallel and
scenarios. Finally, five LLM agent experts review serial-parallel calls, as opposed to being limited
the generated tools. These steps ensure the tools’ to serial calls only, compared to naive topological
accuracy and parameter diversity. sorting.
2.2 Traverse each combination, add the combi-
B Analysis of the Test Data nation to the current path, and update the indegree
and visitation tables.
Figure 10, 11 and 12 present the proposed dataset 2.3 Continue with depth-first traversal until the
from the following three perspectives. number of nodes in the path matches the number
11
E Part Roles Prompt of Agents
E.1 Role Prompt of Mission Generation
We show the role prompt of single mission genera-
tion in Figure 15.
12
Figure 13: An Optimal Path Example.
13
Single Tool Invocation Mission Generation Prompt.
[Requirements]="""
1. The description of the user’s mission must include information on all the required parameters
needed to call {{{tool}}}. For other optional parameters, please add them as you see fit, using
natural language.
2. The user’s missions should use different types of sentence structures: imperative, declarative,
interrogative, etc.
3. The user’s missions should include different tones: colloquial, formal, polite, direct, etc.
4. Ensure that the length of the user’s missions varies, gradually increasing from short to long.
5. Ensure that the user’s missions involve different themes/instances, different scenarios, and
different roles.
6. Extract common entities that appear in all descriptions from the [Tool List] and ensure that these
entities appear in the user’s missions.
7. Do not explicitly specify the tool {{{tool}}} in the user’s missions.
"""
[Tool List]="""
{{{tool}}}
"""
[Format]="""
{
"mission 1": "xxx",
"mission 2": "xxx",
"mission 3": "xxx",
"mission 4": "xxx",
"mission 5": "xxx",
}
"""
14
Planner Decision Generation Prompt Part-1.
15
Planner Decision Generation Prompt Part-2.
[Requirements]="""
*** Special Attention ***
1. When making a decision, please ensure that the tool you invoke from the [Tool List] is suitable
for solving the user’s mission based on the definition of the tools in the list. Do not force the use of
inappropriate tools to solve the user’s mission; instead, call the appropriate tool from the [Tool
List] according to the user’s mission.
2. Ensure that the Action_List you provide does not contradict the Plan you have set out. The
order of tools in the given Action_List should be consistent with the sequence planned in the Plan.
3. For optional parameters, you only need to fill them in if the user has provided a value that is
different from the default or if there is no default value. Otherwise, there is no need to include
them in the arguments.
*** The prepare_to_answer tool needs to be called in the following two scenarios: ***
1. If you believe that the user’s mission can be completed, then call the prepare_to_answer tool to
provide a summary response, with the answer_type parameter set to ‘tool’.
2. If you believe that the user’s mission does not require the use of any tools from the [Tool List]
or that there is no suitable tool to solve the user’s mission and it can be answered directly, then call
the prepare_to_answer tool, with the answer_type parameter set to ‘chat’.
Note:
1) The absence of a suitable tool in the [Tool List] to solve the user’s mission does not mean
that you lack the ability to answer. Please respond based on the context information and the
knowledge you possess. Do not excessively refuse to answer, nor imagine knowledge you do not
have. Only refuse to answer when you cannot respond based on the context information and your
own knowledge.
2) The absence of a suitable tool in the [Tool List] to solve the user’s mission also includes the
following situation:
First, analyze the common entities that appear in each tool. For example, some tools can only
query data related to a certain entity A. If the user asks about entity B, it also means that there is
no suitable tool.
For instance:
- If the tools in the [Tool List] can only query and analyze population data for Denmark, and the
user asks for population data for Sweden, then you should also call the prepare_to_answer tool.
- If the tools in the [Tool List] can only query weather data for China, including current and
historical weather, and the user asks for weather data for the United States, then you should also
call the prepare_to_answer tool.
17
Planner Decision Generation Prompt Part-4.
*** There are four scenarios in which the ask_user_for_required_parameters tool needs to be
invoked: ***
1. If you believe that a user’s mission requires the use of a tool from the [Tool List], but the user’s
mission is missing some required parameters from the tool, and the user needs to provide the
necessary information, then invoke the ask_user_for_required_parameters tool. Please do not
hallucinate parameters.
2. Please note that you are unable to deduce the values of some tool parameters on your own
and will need to invoke the ask_user_for_required_parameters tool to ask the user. Please do not
hallucinate parameters.
For example:
1) For the timestamp parameter, you do not have the ability to deduce the timestamp based
on time. However, you can deduce other time-related parameters (start_time, end_time,
etc.) on your own based on [Environmental Information], without needing to invoke the
ask_user_for_required_parameters tool.
2) For ID-type parameters (station_id, product_id, etc.), you do not have the ability to deduce the
corresponding ID based on the name.
3. Based on the context of the conversation, if you have already asked the user once to provide the
necessary information but the user still has not provided all the required parameters, then please
continue to invoke the ask_user_for_required_parameters tool.
4. If the user provides incomplete parameter values, such as the tool parameter being an IP address
(ip_address), but the user provides an incomplete IP address (e.g., 192.22), please continue to use
the ask_user_for_required_parameters tool to ask the user for the complete IP address.
Finally, if you confirm the need to invoke the ask_user_for_required_parameters tool, provide the
inquiry plan in the format: "Ask the user to provide xxx, in order to invoke the xxx tool to xxx" in
the Plan.
18
Planner Decision Generation Prompt Part-5.
*** There are eight scenarios in which multiple tools need to be invoked: ***
If a user mission involves invoking multiple tools, please first analyze the dependency relation-
ships between the multiple invocation tools. For tools that do not have invocation dependencies,
perform concurrent invocations, and for tools that do have invocation dependencies, perform serial
invocations. Specifically, you can handle each of the following eight scenarios separately:
Concurrent invocation scenarios:
1. If you determine that the user mission requires multiple invocations of the same tool A, but
with different parameters for each invocation of tool A, then please invoke tool A concurrently and
provide the concurrent invocation plan in the Plan in the format: "Concurrently invoke tool A N
times for xxx."
2. If you determine that the user mission requires the invocation of different tools, such as tools
A and B, and there is no dependency between tool A and B, then please invoke tools A and B
concurrently, and provide the concurrent invocation plan in the Plan in the format: "Concurrently
invoke tool A for xxx, while invoking tool B for xxx."
Serial invocation scenarios:
3. If you determine that the user mission requires the invocation of different tools, such as tools A,
B, and C, and there are dependencies between these tools, then please invoke tools A, B, and C
serially, and provide the serial invocation plan in the Plan in the format: "First, invoke tool A for
xxx. Then, invoke tool B for xxx. Next, invoke tool C for xxx. Now, I will invoke tool A for xxx."
Serial invocation has the following two dependency scenarios:
3.1. Parameter dependency: For example, before invoking tool C, it is necessary to first invoke
tool B to obtain the result as an input parameter, and before invoking tool B, it is necessary to first
invoke tool A to obtain the result as an input parameter. Therefore, you need to first complete the
invocation of tool A to obtain the result, use it as the input parameter for invoking tool B, and
after obtaining the result from tool B, use it as the input parameter for invoking tool C, i.e., please
invoke tools A, B, and C serially.
3.2. Logical dependency: Even if there is no parameter dependency between the invocation of
tools A, B, and C, but there is a logical dependency, such as logically needing to invoke tool B
before tool C, and tool A before tool B, then please also invoke tools A, B, and C serially.
19
Planner Decision Generation Prompt Part-6.
20
Planner Decision Generation Prompt Part-7.
21
Tool Feedback Generation Prompt.
Please act as an external tool, Tool, within a super intelligent agent. These external tools can be
used to solve user missions, as detailed in the [Tool List].
Based on the tool name and input parameters output by the super intelligent agent’s Planner,
simulate the execution results of the tool.
If there are multiple tools in the Action_List provided by the Planner, please simulate each one sep-
arately, ensuring the number matches the Action_List, and store the results in the Observation_List.
Refer to the [Tool Output Format] for the output format.
[Environmental Information]="""
{{{env_info}}}
"""
22
AI Feedback Generation Prompt.
Please act as an Agent assistant within a super intelligent agent, which has a series of external
tools. The Planner within the super intelligent agent can solve user missions by calling external
tools, as detailed in the [Tool List].
You are responsible for interacting with the user. Based on the results returned by the Planner and
Tool, combined with the user mission and the context of the conversation, you provide answers,
and only your answers will be displayed to the user.
Refer to the [Agent Assistant Output Format] for the output format.
[Environmental Information]="""
{{{env_info}}}
"""
[Agent Assistant Output Format]="""
Agent Assistant: According to the [Requirements], reply to the most recent round of content
starting with "User:" in the context conversation information (do not repeat this sentence).
"""
[Requirements]="""
1. The reply must start with "Agent Assistant:".
2. Summarize the user mission from the most recent round starting with "User:" based on the
context conversation information.
3. Use markdown format, and be sure to pay attention to the layout to make it look neat, with two
line breaks between paragraphs.
4. Pay special attention! If the Observation given by the Tool is a list, and each item in the list has
its own ID, such as xxx_id or xxxId, then when summarizing the reply, please retain these IDs for
each item and inform the user!
5. Reply in English.
"""
{{{all_tool_required_info}}}
[Tool List]="""
{{{tools}}}
"""
23