0% found this document useful (0 votes)
7 views

On the Robustness of Agentic Function Calling

The document discusses the robustness of function calling (FC) capabilities in Large Language Models (LLMs) acting as autonomous agents, focusing on their resilience to input perturbations and the impact of expanding toolkits with semantically related functions. It introduces a benchmark to evaluate FC robustness and identifies weaknesses in existing evaluation methodologies, emphasizing the need for improved assessments of agentic deployments. The study highlights the importance of generating meaning-preserving rephrasings of user requests and the challenges posed by toolkit expansion on model performance.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

On the Robustness of Agentic Function Calling

The document discusses the robustness of function calling (FC) capabilities in Large Language Models (LLMs) acting as autonomous agents, focusing on their resilience to input perturbations and the impact of expanding toolkits with semantically related functions. It introduces a benchmark to evaluate FC robustness and identifies weaknesses in existing evaluation methodologies, emphasizing the need for improved assessments of agentic deployments. The study highlights the importance of generating meaning-preserving rephrasings of user requests and the challenges posed by toolkit expansion on model performance.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

On the Robustness of Agentic Function Calling

Ella Rabinovich Ateret Anaby-Tavor


IBM Research IBM Research
[email protected] [email protected]

Abstract as a complex, multi-turn (i.e., involving user in-


teraction) sequence of function calls, ultimately
Large Language Models (LLMs) are increas- achieving a given goal. Models specifically opti-
ingly acting as autonomous agents, with func- mized for FC are typically designed to generate
arXiv:2504.00914v1 [cs.CL] 1 Apr 2025

tion calling (FC) capabilities enabling them to


a function call in response to a natural-language
invoke specific tools for tasks. While prior
research has primarily focused on improving user request (Bai et al., 2023; Dubey et al., 2024;
FC accuracy, little attention has been given to Zhang et al., 2024). The function (also known as
the robustness of these agents to perturbations a tool) is chosen from a predefined "toolkit"—a
in their input. We introduce a benchmark as- compact set of function descriptions1 —provided as
sessing FC robustness in two key areas: re- part of the model’s prompt. The agent is expected
silience to naturalistic query variations, and to produce a syntactically correct tool invocation,
stability in function calling when the toolkit
ensuring that parameter values are appropriately as-
expands with semantically related tools. Evalu-
ating best-performing FC models on a carefully signed to function arguments (a process known as
expanded subset of the Berkeley function call- slot filling). For instance, given the query, "What is
ing leaderboard (BFCL), we identify critical the record for the highest number of points scored
weaknesses in existing evaluation methodolo- by a single player in an NBA game?" and the com-
gies, and highlight areas for improvement in pact json tool description in Figure 1 (top), the
real-world agentic deployments. model is expected to generate the invocation code
shown in Figure 1 (bottom). Several datasets and
1 Introduction evaluation methodologies have been proposed to
Large Language Models (LLMs) are reshaping arti- assess LLMs’ function calling capabilities (Patil
ficial intelligence, shifting from static language pro- et al., 2023; Liu et al., 2024), and various bench-
cessors to dynamic, task-oriented agents capable marks have been created for evaluating a range of
of planning, executing, and refining their actions. FC scenarios, BFCL leaderboard (Patil et al., 2023)
These agents hold the potential for transforma- among the most prominent ones.
tive applications across various domains, including Robustness of Large Language Models In the
healthcare (Abbasian et al., 2023; Mehandru et al., context of the more "traditional" LLM usage, a
2024), finance (Li et al., 2024; Xiao et al., 2024; model robustness quantifies an LLM’s ability to
Ding et al., 2024), education (Yang et al., 2024; Xu generate semantically equivalent outputs, given se-
et al., 2024), and customer support (Huang et al., mantically equivalent inputs (Raj et al., 2023; Ra-
2024; Rome et al., 2024). LLM agents have been binovich et al., 2023; Ackerman et al., 2024). Ro-
revolutionarily positioned as routing systems that bustness benchmarks assess, among other factors,
can act independently, make decisions and perform how well LLMs handle naturally-occurring, non-
tasks with minimal human intervention. malicious perturbations in user input, such as para-
phrased questions in a QA task, typos, variations in
Agentic Function Calling Function calling (FC),
punctuation, whitespace, or diacritics. Extending
the process by which an agent autonomously se-
this notion to agentic FC would require a model
lects and invokes a specific function to retrieve
to produce an equivalent tool invocation despite
information or execute a task, serves as a funda-
naturalistic, yet, strictly meaning-preserving, per-
mental building block of an agentic system. In
1
this context, a full execution trajectory can be seen Descriptions are often provided in the json format.
normally reduced to top-K most relevant function
definitions through a shortlisting module (Qin
et al., 2023), such as semantic search over the set
of tools, towards constructing the context (here,
prompt) of a FC agent. In the example toolkit in
Figure 1 (top), additional tools may include:
basketball.most_points_career(),
basketball.most_points_single_season(),
basketball.game_stats().

Contribution We focus on two aspects of ro-


bustness, capturing input variations that can be
expected in real-world agentic deployments but
are not easily controlled by a developer: (1) gen-
Figure 1: Compact function definition example (top),
erating meaning-preserving rephrasings of user re-
and agent’s output, triggering the function call with
assigned parameter values (bottom), per user request quests and (2) expanding the toolkit to include
"What is the record for the highest number of points a set of semantically related tools that are likely
scored by a single player in an NBA game?". to be shortlisted by a selection module. Using
one of the (single-turn) challenging BFCL (Patil
et al., 2023) test sets as our starting point, we
turbations in the input query. Considering Figure 1, first carefully build a benchmark dataset, com-
a semantically equivalent paraphrase "What is the prising variations pertaining to the two afore-
highest number of points ever scored by a single mentioned aspects (Section 2). Next, we eval-
player in an NBA game?" should result in the same uate the robustness of several best-performing
tool invocation as the original request. LLMs3 , and discuss the breakdown of failures,
Despite its clear practical significance, research highlighting (among others) prominent weaknesses
on the robustness of agentic function calling re- of the existing agentic FC evaluation bench-
mains sparse, with only two studies, to the best marks (Section 3). Our benchmark data is avail-
of our knowledge, examining agent resilience to able at https://huggingface.co/datasets/
modifications in tool descriptions. Ye et al. (2024) ibm-research/BFCL-FC-robustness.
introduce a series of increasingly aggressive alter-
ations to function names, parameter names, and 2 Dataset Generation
their descriptions – to the point where a tool (or
a parameter) name (or description) becomes arbi- We next provide details on the generation of our
trary or entirely uninformative about its function- benchmark dataset. Specifically, we describe the
ality. Similarly, Lu et al. (2024) conduct multiple creation of (1) meaning-preserving rephrasings of
interventions, including tool distractions, within a user requests and (2) expanding the toolkit to in-
different evaluation framework that evaluates tool clude a set of semantically related tools.
sequencing at the system rather than function level.
2.1 User Query Perturbations
While these studies offer valuable insights, they
provide limited evidence on agent resilience to real- Building on the study by Ackerman et al. (2024),
world perturbations, as system developers typically who tested LLMs’ sensitivity to paraphrased user
exert substantial control over the faithfulness and queries in the QA and classification settings, we
level of detail in function and parameter names, investigate whether agents’ FC capabilities remain
along with their descriptions. robust to meaning-preserving variations in user re-
Moreover, a typical "toolkit" (the list of available quests. Here, the task presents additional challenge,
functions) in these studies is limited to a single tool as the rewording must strictly maintain precise pa-
or a small number of unrelated tools. A realistic rameter values to ensure accurate slot filling for the
scenario may involve a system specification with sake of evaluation. For instance, the request "Cal-
thousands of available tools,2 which in practice is culate the depreciated value of a property costing
$200,000 with an annual depreciation rate of 3%
2
A software engineering (SWE) agent fixing git issues, has
3
access to about 1.2K tools exposed through github docs. According to the BFCL leaderboard (Jan 2025).
Figure 2: A toolkit expansion steps: (1) request variants are generated using the LLama3.1-70B model (Dubey et al.,
2024), (2) function json definitions for executing these requests are generated using the Code-Llama-13B model
(Roziere et al., 2023), and a filtering step (3) is applied to filter out tools semantically identical to any of the original
functions. The process is completed when the expanded toolkit is created for testing the original query.

original request What is the record for the most points scored by a single player in an NBA game?
original toolkit basketball.most_points_single_game(...)
Who holds the record for the highest number of assists made by a female basketball player?
request variants What is the longest winning streak in NBA history?
...
basketball.most_points_career(...)
additional tools basketball.records_history(...)
...

Table 1: A toolkit expansion steps: request variants and additional tools addressing those variants.

for 5 years." can be safely rephrased as "Determine 2024), see Appendix 7.2 for the detailed prompt.
the value of a $200,000 asset which loses 3 percent (2) For each request variant, a tool definition is
of its worth each year, after five years." Contem- generated to enable request fulfillment. Here, we
porary LLMs handle this task effectively, and we used the CodeLlama-13B model (Roziere et al.,
used the Llama3.1-70B model (Dubey et al., 2024), 2023) with a carefully designed prompt and few-
with appropriate prompting and in-context learn- shot examples, ensuring that the generated defini-
ing. A manual review of 50 examples by one of the tions conform not only to the required json format
authors revealed no instances of semantic drift or but also to the naming conventions, style, and level
parameter misalignment. Appendix 7.1 provides of detail in function and parameter descriptions.
details on the prompt used for this task. Notably, based on our manual inspection, the style
A substantial portion of the paraphrases targeted of the generated tool definitions is indistinguishable
named entities, which are natural candidates for sur- from that of the original function(s).
face form variability. For instance, the user query
(3) In rare cases, a generated tool was found to be
"What is the humidity level in Miami,Florida in
strictly functionally equivalent to the original one,
the upcoming 7 days?" was rephrased as "How will
despite differences in name, description, or param-
the humidity levels change over the next seven days
eter order (see Appendix 7.3). We eliminate such
in Miami,FL?". These seemingly minor modifica-
cases by (a) concatenating the original tool prop-
tions led to a notable drop in benchmark perfor-
erties into a "signature," and (b) filtering out any
mance – we analyze and interpret this decline, and
newly generated tool whose "signature" exceeded a
propose strategies to mitigate it in Section 3.
predefined similarity threshold to the original tool,
2.2 Expanding Agent’s Toolkit as measured via cosine similarity of their embed-
dings, computed using the sentence-transformers
Aiming at expanding the (originally) "thin" agent’s module (Reimers and Gurevych, 2019).
toolkit, simulating the scenario where function def- Table 1 presents example original request (and
initions are retrieved by a shortlister, we follow the its tool), along with the expansion process: ad-
steps illustrated in Figure 2 and outlined in Table 1. ditional (related but not strictly identical) request
(1) We generate related yet different request vari- variants, and additional tools, fulfilling those addi-
ants using the Llama3.1-70B model (Dubey et al., tional requests. The mean number of tools in the
expanded toolkit is 5.6 compared to the 2.7 (seem- is the humidity level in Miami,Florida in the up-
ingly unrelated) tools in the original BFCL dataset, coming 7 days?". The expected response includes
meaning that three semantically-related functions the function weather.humidity_forecast() and
were added on average to each one of the 200 test- validates its location parameter by exact match
cases. Next, we evaluate the FC performance of to one of the predefined values: ["Miami", "Miami,
multiple agents using the generated benchmark. Florida", "FL"]. When the request is rephrased as
"How will the humidity levels change over the next
3 Agentic FC Robustness Evaluation seven days in Miami,FL?", agents assign the value
3.1 Experimental Setup "Miami, FL" to location, which does not match
any of the (incompletely) listed options.
Models We evaluate several top-performing
Further systematic analysis of error types distri-
LLMs from the BFCL leaderboard, both API-
bution reveals that 70–90% of errors indeed stem
accessible and locally hosted, as FC agents. Closed
from mis-match in parameter value assignment.
models include GPT4o-mini and o1-mini,4 as well
We conclude that the majority of failures in this
as Claude-3.5-Haiku and Claude-3.5-Sonnet.5 Lo-
case can be attributed to the evaluation approach
cally hosted models include Llama3.1-70B and
drawback rather than agents’ sensitivity.
its more advanced version Llama3.3-70B (Dubey
We argue that this issue could potentially be
et al., 2024), Granite3.1-8B-instruct (Granite Team,
mitigated by applying semantic similarity instead
2024), DeepSeek-v2.5 (DeepSeek-AI, 2024), and
of exact match. Indeed, recent studies adopt a
Qwen2.5-72B (Qwen Team, 2024).
more holistic approach to evaluation of a con-
Evaluation Approach BFCL employ a two- structed function call; e.g., Zhong et al. (2025) who
phase FC evaluation approach: (1) assessment of use multi-dimensional matching strategy, including
the generated tool call through the tree-matching FCs’ embeddings similarity and LLM-as-a-Judge
abstract syntax tree (AST) methodology, and (2) matching, ensuring a generated tool call meets its
evaluation of the tool execution in a simulated en- semantic requirements. We leave the exploration
vironment (Patil et al., 2023). Our focus in this of this mitigation strategy in the context of BFCL
study is the evaluation of FC construction provided evaluation framework to future work.
interventions in its input; we, therefore, adhere to
the first evaluation phase – namely, AST. A robust Agents’ Sensitivity to Toolkit Expansion Ev-
agent will generate correct function call regardless idently, expanding an agent’s toolkit with a set
of the precise request wording and of its toolkit size: of related functions caused performance degra-
"thin" (as it comes with the original benchmark), dation across the board (Table 2, left). Here,
or expanded, simulating a shortlister selection. objective agent failures span a range of error
types: wrong function selected, wrong number
3.2 Experimental Results of functions generated (typically two instead of
We report AST averaged over the 200 dataset ex- one), wrong parameter assignment to a correctly-
amples, including three variants: (a) the original selected function, parameter hallucinations, etc.
version, (b) original ("thin") toolkit + rephrased As an example, in response to the request "What
user request, (c) expanded toolkit + original user is the ranking of Manchester United in Premier
request. Table 2 (left) reports the results. Several League?", an agent with the expanded toolkit
insights can be drawn from the figures: produces football_league.ranking("premier
league"), retrieving the complete ranking ta-
FC Evaluation Approach Weakness(es) A no- ble of the league, instead of the more appro-
table (and somewhat unexpected) drop occurs when priate sports_ranking("Manchester United",
evaluating the original toolkit on a rephrased re- "premier league"), answering the query.
quest. Closer examination of errors reveals a sig- Table 2 (right) presents error breakdown for
nificant weakness in the common approach to FC agents in this study in the expanded toolkit sce-
evaluation – specifically, in handling arguments nario, showing the proportion of each error type
that can accept several equivalently valid values within the set of failures stemming from toolkit
(e.g., named entities). Consider the request: "What expansion. While no clear pattern dominates, it
4
https://platform.openai.com/docs/models is evident that agents struggle with both accurate
5
https://www.anthropic.com/claude function selection and parameter assignment.
robustness evaluation exp. toolkit + orig. query: error analysis (%)
orig. toolkit exp. toolkit wrong wrong wrong num wrong param.
model (agent) original
reph. query orig. query syntax function of functions assignment
Llama3.1-70B 0.965 0.825 (-15%) 0.925 (-4%) 0.00 0.45 0.10 0.45
Llama3.3-70B 0.945 0.785 (-17%) 0.905 (-4%) 0.00 0.23 0.46 0.31
DeepSeek-V2.5 0.965 0.835 (-14%) 0.950 (-2%) 0.00 0.56 0.00 0.44
Qwen2.5-72B 0.975 0.850 (-13%) 0.965 (-1%) 0.00 0.29 0.00 0.71
Granite3.1-8B-instruct 0.945 0.770 (-19%) 0.870 (-8%) 0.09 0.50 0.18 0.23
Claude-3.5-Haiku 0.925 0.765 (-11%) 0.870 (-2%) 0.00 0.44 0.00 0.56
Claude-3.5-Sonnet 0.915 0.845 ( -8%) 0.890 (-3%) 0.00 0.29 0.00 0.71
gpt4o-mini 0.925 0.765 (-17%) 0.870 (-6%) 0.26 0.42 0.00 0.32
o1-mini 0.905 0.770 (-15%) 0.885 (-2%) 0.33 0.27 0.00 0.43

Table 2: Agentic FC robustness evaluation results. Models’ AST performance drop is evident when rephrasing the
original query, and also when using the original query with extended toolokit (left); relative percent drop is specified
in brackets. Failures stemming from toolkit expansion vary mostly between wrong function selection and wrong
parameter assignment (right). The best result in a column (the lowest performance drop) is boldfaced.

Finally, expanding an agent’s toolkit with ad- pansion scenario relies on multiple LLMs to gen-
ditional functions occasionally caused models to erate related requests and corresponding tools, a
"repair" some of their original (baseline) failures in time-consuming process currently performed of-
a few cases. Interestingly, this observations high- fline. We are actively exploring ways to streamline
lights the stochastic, generative nature of LLM this pipeline for improved efficiency and usability.
agents, where seemingly unrelated changes to a
model context may entail different output. 6 Ethical Considerations
We use publicly available datasets to study the ro-
4 Conclusions and Future Work
bustness of agentic function calling. We did not
We focus on two aspect of robustness, capturing make use of AI-assisted technologies while writing
input variations that can be expected in real-world this paper. We also did not hire human annotators
agentic deployments: (1) meaning-preserving at any stage of the research.
rephrasings of user requests and (2) agent’s toolkit
expansion to include a set of semantically related Acknowledgements
tools that are likely to be shortlisted by a selection We are deeply grateful to Michal Jacovi for her
module. We build a benchmark dataset, evaluate invaluable assistance in carrying out this study. We
the robustness of several SOTA LLM agents, and would like to thank Guy Uziel for his feedback on
discuss the breakdown of failures. earlier versions of this paper. Finally, we are thank-
Our future work includes testing the robustness ful to our anonymous reviewers for their useful
of agentic FC with additional and diverse datasets. comments and constructive feedback.
Moreover, it has been shown that LLMs can be
easily distracted by larger context (Shi et al., 2023;
Levy et al., 2024). We plan to extend the set of References
experiments to scenarios were agent’s toolkit is Mahyar Abbasian, Iman Azimi, Amir M Rahmani, and
expanded also with non-relevant tools, to compare Ramesh Jain. 2023. Conversational health agents: A
the performance against the current setting. personalized llm-powered agent framework. arXiv
preprint arXiv:2310.02374.
5 Limitations Samuel Ackerman, Ella Rabinovich, Eitan Farchi, and
While our study provides valuable insights into Ateret Anaby Tavor. 2024. A novel metric for mea-
suring the robustness of large language models in
measuring agents’ robustness in the function call- non-adversarial scenarios. In Findings of the Associ-
ing scenario, it has several limitations. First, we ation for Computational Linguistics: EMNLP 2024,
evaluate our approach on a single dataset, suffi- pages 2794–2802, Miami, Florida, USA. Association
cient for the focused contribution of a short pa- for Computational Linguistics.
per, but requiring extension to additional datasets Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang,
for a broader analysis. Second, our toolkit ex- Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei
Huang, et al. 2023. Qwen technical report. arXiv Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan
preprint arXiv:2309.16609. Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang,
Bill Qian, et al. 2023. Toolllm: Facilitating large
DeepSeek-AI. 2024. Deepseek-v2: A strong, economi- language models to master 16000+ real-world apis.
cal, and efficient mixture-of-experts language model. arXiv preprint arXiv:2307.16789.
Preprint, arXiv:2405.04434.
Qwen Qwen Team. 2024. Qwen2.5: A party of founda-
Han Ding, Yinheng Li, Junhao Wang, and Hang Chen. tion models.
2024. Large language model agent in financial trad- Ella Rabinovich, Samuel Ackerman, Orna Raz, Eitan
ing: A survey. arXiv preprint arXiv:2408.06361. Farchi, and Ateret Anaby Tavor. 2023. Predicting
question-answering performance of large language
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, models through semantic consistency. In Proceed-
Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, ings of the Third Workshop on Natural Language
Akhil Mathur, Alan Schelten, Amy Yang, Angela Generation, Evaluation, and Metrics (GEM), pages
Fan, et al. 2024. The llama 3 herd of models. arXiv 138–154.
preprint arXiv:2407.21783.
Harsh Raj, Domenic Rosati, and Subhabrata Majumdar.
IBM Granite Team. 2024. Granite 3.0 language models. 2023. Measuring reliability of large language models
through semantic consistency. In Proceedings of the
Kung-Hsiang Huang, Akshara Prabhakar, Sidharth ML Safety Workshop, NuerIPS.
Dhawan, Yixin Mao, Huan Wang, Silvio Savarese,
Caiming Xiong, Philippe Laban, and Chien-Sheng Nils Reimers and Iryna Gurevych. 2019. Sentence-bert:
Wu. 2024. Crmarena: Understanding the capacity of Sentence embeddings using siamese bert-networks.
llm agents to perform professional crm tasks in realis- In Proceedings of the 2019 Conference on Empirical
tic environments. arXiv preprint arXiv:2411.02305. Methods in Natural Language Processing. Associa-
tion for Computational Linguistics.
Mosh Levy, Alon Jacoby, and Yoav Goldberg. 2024.
Scott Rome, Tianwen Chen, Raphael Tang, Luwei Zhou,
Same task, more tokens: the impact of input length on
and Ferhan Ture. 2024. "ask me anything": How
the reasoning performance of large language models.
comcast uses llms to assist agents in real time. In
arXiv preprint arXiv:2402.14848.
Proceedings of the 47th International ACM SIGIR
Conference on Research and Development in Infor-
Yuan Li, Bingqiao Luo, Qian Wang, Nuo Chen, Xu Liu, mation Retrieval, pages 2827–2831.
and Bingsheng He. 2024. CryptoTrade: A reflective
LLM-based agent to guide zero-shot cryptocurrency Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten
trading. In Proceedings of the 2024 Conference on Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi,
Empirical Methods in Natural Language Processing, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. 2023.
pages 1094–1106, Miami, Florida, USA. Association Code llama: Open foundation models for code. arXiv
for Computational Linguistics. preprint arXiv:2308.12950.

Zuxin Liu, Thai Hoang, Jianguo Zhang, Ming Zhu, Freda Shi, Xinyun Chen, Kanishka Misra, Nathan
Tian Lan, Shirley Kokane, Juntao Tan, Weiran Yao, Scales, David Dohan, Ed H Chi, Nathanael Schärli,
Zhiwei Liu, Yihao Feng, et al. 2024. Apigen: and Denny Zhou. 2023. Large language models can
Automated pipeline for generating verifiable and be easily distracted by irrelevant context. In Inter-
diverse function-calling datasets. arXiv preprint national Conference on Machine Learning, pages
arXiv:2406.18518. 31210–31227. PMLR.

Jiarui Lu, Thomas Holleis, Yizhe Zhang, Bernhard Au- Yijia Xiao, Edward Sun, Di Luo, and Wei Wang. 2024.
mayer, Feng Nan, Felix Bai, Shuang Ma, Shen Ma, Tradingagents: Multi-agents llm financial trading
Mengyu Li, Guoli Yin, et al. 2024. Toolsandbox: A framework. arXiv preprint arXiv:2412.20138.
stateful, conversational, interactive evaluation bench-
mark for llm tool use capabilities. arXiv preprint Songlin Xu, Xinyu Zhang, and Lianhui Qin. 2024. Edu-
arXiv:2408.04682. agent: Generative student agents in learning. arXiv
preprint arXiv:2404.07963.
Nikita Mehandru, Brenda Y Miao, Eduardo Rodriguez Kaiqi Yang, Yucheng Chu, Taylor Darwin, Ahreum Han,
Almaraz, Madhumita Sushil, Atul J Butte, and Hang Li, Hongzhi Wen, Yasemin Copur-Gencturk,
Ahmed Alaa. 2024. Evaluating large language mod- Jiliang Tang, and Hui Liu. 2024. Content knowledge
els as agents in the clinic. NPJ digital medicine, identification with multi-agent large language mod-
7(1):84. els (llms). In International Conference on Artificial
Intelligence in Education, pages 284–292. Springer.
Shishir G Patil, Tianjun Zhang, Xin Wang, and
Joseph E Gonzalez. 2023. Gorilla: Large language Junjie Ye, Yilong Wu, Songyang Gao, Caishuang
model connected with massive apis. arXiv preprint Huang, Sixian Li, Guanyu Li, Xiaoran Fan, Qi Zhang,
arXiv:2305.15334. Tao Gui, and Xuanjing Huang. 2024. Rotbench: a
multi-level benchmark for evaluating the robustness original query is not fully appropriate
of large language models in tool learning. arXiv for the new one and vise versa. As an
preprint arXiv:2401.08326.
example, generating ’Book a single room
Jianguo Zhang, Tian Lan, Ming Zhu, Zuxin Liu, Thai for two nights at the Hilton Hotel in
Hoang, Shirley Kokane, Weiran Yao, Juntao Tan, Chicago’ per the original query ’Book
Akshara Prabhakar, Haolin Chen, et al. 2024. xlam: a double room for three nights at the
A family of large action models to empower ai agent
systems. arXiv preprint arXiv:2409.03215. Marriott hotel near OHare Airport in
Chicago’, is not sufficient since both
Lucen Zhong, Zhengxiao Du, Xiaohan Zhang, Haiyi queries can be answered using the same
Hu, and Jie Tang. 2025. Complexfuncbench: Ex-
ploring multi-step and constrained function call- function call, invoked with different
ing under long-context scenario. arXiv preprint parameters. The query should contain all
arXiv:2501.10132. information needed for its computation.
For instance, ’What is the capital of
7 Appendices Brazil?’ is a good query, while ’What
7.1 Prompt for Request Rephrasing is the capital of a country provided by
We used the following prompt for generating user?’ is not since one cannot generate a
strictly meaning-preserving request rephrasing function call and populate its arguments
with the Llama3.1-70B model (Dubey et al., 2024): using the info in the query alone. Output
the newly generated query only, without
SYSTEM: You are a helpful assistant explanation or interpretation. Consider
helping rephrasing user requests, while the examples below:
accurately preserving their meaning,
including numbers and names if exist. USER: I need the schedules of matches
Do not answer the requirement, just happening on February 28, 2024.
produce another one that is identical ASSISTANT: I need the schedules of the
in meaning but is phrased differently. college league matches happening during
Produce ONLY the rephrased requirement, the winter 2024 season.
without further thoughts or explanations. ...
Consider the example below:
7.3 Example of Syntactically Different but
USER: Can I find the dimensions and Semantically Equivalent Tools
properties of a triangle, if it is known
Although rare, distinct, yet functionally equiva-
that its three sides are 5 units, 4 units
lent tools, pose a challenge for accurate evaluation,
and 3 units long?
since the "labeled" BFCL data contains only one
ASSISTANT: What are the dimensions and of these functions. As an example, the tool
properties of a triangle whose three sides
sentence.translate(sentence: string,
are 5, 4 and 3 units long?
from: string,
7.2 Prompt for Similar Requests Generation to: string)
We used the following prompt for generating is functionally equivalent to
closely related but different request with the
translate_sent(orig_language: string,
Llama3.1-70B model (Dubey et al., 2024):
target_language: string,
SYSTEM: You are a helpful assistant sentence: string).
introduced with the following user query.
As described in Section 2, we concatenate func-
Create a very similar query that refers
tion name and description, as well parameter names
to a very similar user need and is likely
and descriptions into a tool "signature", and fil-
to be implemented in an enterprise as part
ter out generated tools exhibiting cosine similarity
of the same project. The new query should
higher than a predefined threshold to the original
introduce one or two additional distinct
one, aiming at a toolkit with distinct functions. The
parameter types. It should differ from
similarity threshold was set to 0.8.
the original query in a sense that a
function that can be used to fulfill the

You might also like