On the Robustness of Agentic Function Calling

The document discusses the robustness of function calling (FC) capabilities in Large Language Models (LLMs) acting as autonomous agents, focusing on their resilience to input perturbations and the impact of expanding toolkits with semantically related functions. It introduces a benchmark to evaluate FC robustness and identifies weaknesses in existing evaluation methodologies, emphasizing the need for improved assessments of agentic deployments. The study highlights the importance of generating meaning-preserving rephrasings of user requests and the challenges posed by toolkit expansion on model performance.

Uploaded by

nabibukhsh.baloch02

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views

On the Robustness of Agentic Function Calling

Uploaded by

nabibukhsh.baloch02

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

On the Robustness of Agentic Function Calling

Ella Rabinovich Ateret Anaby-Tavor

IBM Research IBM Research
[email protected] [email protected]

Abstract as a complex, multi-turn (i.e., involving user in-

teraction) sequence of function calls, ultimately
Large Language Models (LLMs) are increas- achieving a given goal. Models specifically opti-
ingly acting as autonomous agents, with func- mized for FC are typically designed to generate
arXiv:2504.00914v1 [cs.CL] 1 Apr 2025

tion calling (FC) capabilities enabling them to

a function call in response to a natural-language
invoke specific tools for tasks. While prior
research has primarily focused on improving user request (Bai et al., 2023; Dubey et al., 2024;
FC accuracy, little attention has been given to Zhang et al., 2024). The function (also known as
the robustness of these agents to perturbations a tool) is chosen from a predefined "toolkit"—a
in their input. We introduce a benchmark as- compact set of function descriptions1 —provided as
sessing FC robustness in two key areas: re- part of the model’s prompt. The agent is expected
silience to naturalistic query variations, and to produce a syntactically correct tool invocation,
stability in function calling when the toolkit
ensuring that parameter values are appropriately as-
expands with semantically related tools. Evalu-
ating best-performing FC models on a carefully signed to function arguments (a process known as
expanded subset of the Berkeley function call- slot filling). For instance, given the query, "What is
ing leaderboard (BFCL), we identify critical the record for the highest number of points scored
weaknesses in existing evaluation methodolo- by a single player in an NBA game?" and the com-
gies, and highlight areas for improvement in pact json tool description in Figure 1 (top), the
real-world agentic deployments. model is expected to generate the invocation code
shown in Figure 1 (bottom). Several datasets and
1 Introduction evaluation methodologies have been proposed to
Large Language Models (LLMs) are reshaping arti- assess LLMs’ function calling capabilities (Patil
ficial intelligence, shifting from static language pro- et al., 2023; Liu et al., 2024), and various bench-
cessors to dynamic, task-oriented agents capable marks have been created for evaluating a range of
of planning, executing, and refining their actions. FC scenarios, BFCL leaderboard (Patil et al., 2023)
These agents hold the potential for transforma- among the most prominent ones.
tive applications across various domains, including Robustness of Large Language Models In the
healthcare (Abbasian et al., 2023; Mehandru et al., context of the more "traditional" LLM usage, a
2024), finance (Li et al., 2024; Xiao et al., 2024; model robustness quantifies an LLM’s ability to
Ding et al., 2024), education (Yang et al., 2024; Xu generate semantically equivalent outputs, given se-
et al., 2024), and customer support (Huang et al., mantically equivalent inputs (Raj et al., 2023; Ra-
2024; Rome et al., 2024). LLM agents have been binovich et al., 2023; Ackerman et al., 2024). Ro-
revolutionarily positioned as routing systems that bustness benchmarks assess, among other factors,
can act independently, make decisions and perform how well LLMs handle naturally-occurring, non-
tasks with minimal human intervention. malicious perturbations in user input, such as para-
phrased questions in a QA task, typos, variations in
Agentic Function Calling Function calling (FC),
punctuation, whitespace, or diacritics. Extending
the process by which an agent autonomously se-
this notion to agentic FC would require a model
lects and invokes a specific function to retrieve
to produce an equivalent tool invocation despite
information or execute a task, serves as a funda-
naturalistic, yet, strictly meaning-preserving, per-
mental building block of an agentic system. In
1
this context, a full execution trajectory can be seen Descriptions are often provided in the json format.
normally reduced to top-K most relevant function
definitions through a shortlisting module (Qin
et al., 2023), such as semantic search over the set
of tools, towards constructing the context (here,
prompt) of a FC agent. In the example toolkit in
Figure 1 (top), additional tools may include:
basketball.most_points_career(),
basketball.most_points_single_season(),
basketball.game_stats().

Contribution We focus on two aspects of ro-

bustness, capturing input variations that can be
expected in real-world agentic deployments but
are not easily controlled by a developer: (1) gen-
Figure 1: Compact function definition example (top),
erating meaning-preserving rephrasings of user re-
and agent’s output, triggering the function call with
assigned parameter values (bottom), per user request quests and (2) expanding the toolkit to include
"What is the record for the highest number of points a set of semantically related tools that are likely
scored by a single player in an NBA game?". to be shortlisted by a selection module. Using
one of the (single-turn) challenging BFCL (Patil
et al., 2023) test sets as our starting point, we
turbations in the input query. Considering Figure 1, first carefully build a benchmark dataset, com-
a semantically equivalent paraphrase "What is the prising variations pertaining to the two afore-
highest number of points ever scored by a single mentioned aspects (Section 2). Next, we eval-
player in an NBA game?" should result in the same uate the robustness of several best-performing
tool invocation as the original request. LLMs3 , and discuss the breakdown of failures,
Despite its clear practical significance, research highlighting (among others) prominent weaknesses
on the robustness of agentic function calling re- of the existing agentic FC evaluation bench-
mains sparse, with only two studies, to the best marks (Section 3). Our benchmark data is avail-
of our knowledge, examining agent resilience to able at https://huggingface.co/datasets/
modifications in tool descriptions. Ye et al. (2024) ibm-research/BFCL-FC-robustness.
introduce a series of increasingly aggressive alter-
ations to function names, parameter names, and 2 Dataset Generation
their descriptions – to the point where a tool (or
a parameter) name (or description) becomes arbi- We next provide details on the generation of our
trary or entirely uninformative about its function- benchmark dataset. Specifically, we describe the
ality. Similarly, Lu et al. (2024) conduct multiple creation of (1) meaning-preserving rephrasings of
interventions, including tool distractions, within a user requests and (2) expanding the toolkit to in-
different evaluation framework that evaluates tool clude a set of semantically related tools.
sequencing at the system rather than function level.
2.1 User Query Perturbations
While these studies offer valuable insights, they
provide limited evidence on agent resilience to real- Building on the study by Ackerman et al. (2024),
world perturbations, as system developers typically who tested LLMs’ sensitivity to paraphrased user
exert substantial control over the faithfulness and queries in the QA and classification settings, we
level of detail in function and parameter names, investigate whether agents’ FC capabilities remain
along with their descriptions. robust to meaning-preserving variations in user re-
Moreover, a typical "toolkit" (the list of available quests. Here, the task presents additional challenge,
functions) in these studies is limited to a single tool as the rewording must strictly maintain precise pa-
or a small number of unrelated tools. A realistic rameter values to ensure accurate slot filling for the
scenario may involve a system specification with sake of evaluation. For instance, the request "Cal-
thousands of available tools,2 which in practice is culate the depreciated value of a property costing
$200,000 with an annual depreciation rate of 3%
2
A software engineering (SWE) agent fixing git issues, has
3
access to about 1.2K tools exposed through github docs. According to the BFCL leaderboard (Jan 2025).
Figure 2: A toolkit expansion steps: (1) request variants are generated using the LLama3.1-70B model (Dubey et al.,
2024), (2) function json definitions for executing these requests are generated using the Code-Llama-13B model
(Roziere et al., 2023), and a filtering step (3) is applied to filter out tools semantically identical to any of the original
functions. The process is completed when the expanded toolkit is created for testing the original query.

original request What is the record for the most points scored by a single player in an NBA game?
original toolkit basketball.most_points_single_game(...)
Who holds the record for the highest number of assists made by a female basketball player?
request variants What is the longest winning streak in NBA history?
...
basketball.most_points_career(...)
additional tools basketball.records_history(...)
...

Table 1: A toolkit expansion steps: request variants and additional tools addressing those variants.

for 5 years." can be safely rephrased as "Determine 2024), see Appendix 7.2 for the detailed prompt.
the value of a $200,000 asset which loses 3 percent (2) For each request variant, a tool definition is
of its worth each year, after five years." Contem- generated to enable request fulfillment. Here, we
porary LLMs handle this task effectively, and we used the CodeLlama-13B model (Roziere et al.,
used the Llama3.1-70B model (Dubey et al., 2024), 2023) with a carefully designed prompt and few-
with appropriate prompting and in-context learn- shot examples, ensuring that the generated defini-
ing. A manual review of 50 examples by one of the tions conform not only to the required json format
authors revealed no instances of semantic drift or but also to the naming conventions, style, and level
parameter misalignment. Appendix 7.1 provides of detail in function and parameter descriptions.
details on the prompt used for this task. Notably, based on our manual inspection, the style
A substantial portion of the paraphrases targeted of the generated tool definitions is indistinguishable
named entities, which are natural candidates for sur- from that of the original function(s).
face form variability. For instance, the user query
(3) In rare cases, a generated tool was found to be
"What is the humidity level in Miami,Florida in
strictly functionally equivalent to the original one,
the upcoming 7 days?" was rephrased as "How will
despite differences in name, description, or param-
the humidity levels change over the next seven days
eter order (see Appendix 7.3). We eliminate such
in Miami,FL?". These seemingly minor modifica-
cases by (a) concatenating the original tool prop-
tions led to a notable drop in benchmark perfor-
erties into a "signature," and (b) filtering out any
mance – we analyze and interpret this decline, and
newly generated tool whose "signature" exceeded a
propose strategies to mitigate it in Section 3.
predefined similarity threshold to the original tool,
2.2 Expanding Agent’s Toolkit as measured via cosine similarity of their embed-
dings, computed using the sentence-transformers
Aiming at expanding the (originally) "thin" agent’s module (Reimers and Gurevych, 2019).
toolkit, simulating the scenario where function def- Table 1 presents example original request (and
initions are retrieved by a shortlister, we follow the its tool), along with the expansion process: ad-
steps illustrated in Figure 2 and outlined in Table 1. ditional (related but not strictly identical) request
(1) We generate related yet different request vari- variants, and additional tools, fulfilling those addi-
ants using the Llama3.1-70B model (Dubey et al., tional requests. The mean number of tools in the
expanded toolkit is 5.6 compared to the 2.7 (seem- is the humidity level in Miami,Florida in the up-
ingly unrelated) tools in the original BFCL dataset, coming 7 days?". The expected response includes
meaning that three semantically-related functions the function weather.humidity_forecast() and
were added on average to each one of the 200 test- validates its location parameter by exact match
cases. Next, we evaluate the FC performance of to one of the predefined values: ["Miami", "Miami,
multiple agents using the generated benchmark. Florida", "FL"]. When the request is rephrased as
"How will the humidity levels change over the next
3 Agentic FC Robustness Evaluation seven days in Miami,FL?", agents assign the value
3.1 Experimental Setup "Miami, FL" to location, which does not match
any of the (incompletely) listed options.
Models We evaluate several top-performing
Further systematic analysis of error types distri-
LLMs from the BFCL leaderboard, both API-
bution reveals that 70–90% of errors indeed stem
accessible and locally hosted, as FC agents. Closed
from mis-match in parameter value assignment.
models include GPT4o-mini and o1-mini,4 as well
We conclude that the majority of failures in this
as Claude-3.5-Haiku and Claude-3.5-Sonnet.5 Lo-
case can be attributed to the evaluation approach
cally hosted models include Llama3.1-70B and
drawback rather than agents’ sensitivity.
its more advanced version Llama3.3-70B (Dubey
We argue that this issue could potentially be
et al., 2024), Granite3.1-8B-instruct (Granite Team,
mitigated by applying semantic similarity instead
2024), DeepSeek-v2.5 (DeepSeek-AI, 2024), and
of exact match. Indeed, recent studies adopt a
Qwen2.5-72B (Qwen Team, 2024).
more holistic approach to evaluation of a con-
Evaluation Approach BFCL employ a two- structed function call; e.g., Zhong et al. (2025) who
phase FC evaluation approach: (1) assessment of use multi-dimensional matching strategy, including
the generated tool call through the tree-matching FCs’ embeddings similarity and LLM-as-a-Judge
abstract syntax tree (AST) methodology, and (2) matching, ensuring a generated tool call meets its
evaluation of the tool execution in a simulated en- semantic requirements. We leave the exploration
vironment (Patil et al., 2023). Our focus in this of this mitigation strategy in the context of BFCL
study is the evaluation of FC construction provided evaluation framework to future work.
interventions in its input; we, therefore, adhere to
the first evaluation phase – namely, AST. A robust Agents’ Sensitivity to Toolkit Expansion Ev-
agent will generate correct function call regardless idently, expanding an agent’s toolkit with a set
of the precise request wording and of its toolkit size: of related functions caused performance degra-
"thin" (as it comes with the original benchmark), dation across the board (Table 2, left). Here,
or expanded, simulating a shortlister selection. objective agent failures span a range of error
types: wrong function selected, wrong number
3.2 Experimental Results of functions generated (typically two instead of
We report AST averaged over the 200 dataset ex- one), wrong parameter assignment to a correctly-
amples, including three variants: (a) the original selected function, parameter hallucinations, etc.
version, (b) original ("thin") toolkit + rephrased As an example, in response to the request "What
user request, (c) expanded toolkit + original user is the ranking of Manchester United in Premier
request. Table 2 (left) reports the results. Several League?", an agent with the expanded toolkit
insights can be drawn from the figures: produces football_league.ranking("premier
league"), retrieving the complete ranking ta-
FC Evaluation Approach Weakness(es) A no- ble of the league, instead of the more appro-
table (and somewhat unexpected) drop occurs when priate sports_ranking("Manchester United",
evaluating the original toolkit on a rephrased re- "premier league"), answering the query.
quest. Closer examination of errors reveals a sig- Table 2 (right) presents error breakdown for
nificant weakness in the common approach to FC agents in this study in the expanded toolkit sce-
evaluation – specifically, in handling arguments nario, showing the proportion of each error type
that can accept several equivalently valid values within the set of failures stemming from toolkit
(e.g., named entities). Consider the request: "What expansion. While no clear pattern dominates, it
4
https://platform.openai.com/docs/models is evident that agents struggle with both accurate
5
https://www.anthropic.com/claude function selection and parameter assignment.
robustness evaluation exp. toolkit + orig. query: error analysis (%)
orig. toolkit exp. toolkit wrong wrong wrong num wrong param.
model (agent) original
reph. query orig. query syntax function of functions assignment
Llama3.1-70B 0.965 0.825 (-15%) 0.925 (-4%) 0.00 0.45 0.10 0.45
Llama3.3-70B 0.945 0.785 (-17%) 0.905 (-4%) 0.00 0.23 0.46 0.31
DeepSeek-V2.5 0.965 0.835 (-14%) 0.950 (-2%) 0.00 0.56 0.00 0.44
Qwen2.5-72B 0.975 0.850 (-13%) 0.965 (-1%) 0.00 0.29 0.00 0.71
Granite3.1-8B-instruct 0.945 0.770 (-19%) 0.870 (-8%) 0.09 0.50 0.18 0.23
Claude-3.5-Haiku 0.925 0.765 (-11%) 0.870 (-2%) 0.00 0.44 0.00 0.56
Claude-3.5-Sonnet 0.915 0.845 ( -8%) 0.890 (-3%) 0.00 0.29 0.00 0.71
gpt4o-mini 0.925 0.765 (-17%) 0.870 (-6%) 0.26 0.42 0.00 0.32
o1-mini 0.905 0.770 (-15%) 0.885 (-2%) 0.33 0.27 0.00 0.43

Table 2: Agentic FC robustness evaluation results. Models’ AST performance drop is evident when rephrasing the
original query, and also when using the original query with extended toolokit (left); relative percent drop is specified
in brackets. Failures stemming from toolkit expansion vary mostly between wrong function selection and wrong
parameter assignment (right). The best result in a column (the lowest performance drop) is boldfaced.

Finally, expanding an agent’s toolkit with ad- pansion scenario relies on multiple LLMs to gen-
ditional functions occasionally caused models to erate related requests and corresponding tools, a
"repair" some of their original (baseline) failures in time-consuming process currently performed of-
a few cases. Interestingly, this observations high- fline. We are actively exploring ways to streamline
lights the stochastic, generative nature of LLM this pipeline for improved efficiency and usability.
agents, where seemingly unrelated changes to a
model context may entail different output. 6 Ethical Considerations
We use publicly available datasets to study the ro-
4 Conclusions and Future Work
bustness of agentic function calling. We did not
We focus on two aspect of robustness, capturing make use of AI-assisted technologies while writing
input variations that can be expected in real-world this paper. We also did not hire human annotators
agentic deployments: (1) meaning-preserving at any stage of the research.
rephrasings of user requests and (2) agent’s toolkit
expansion to include a set of semantically related Acknowledgements
tools that are likely to be shortlisted by a selection We are deeply grateful to Michal Jacovi for her
module. We build a benchmark dataset, evaluate invaluable assistance in carrying out this study. We
the robustness of several SOTA LLM agents, and would like to thank Guy Uziel for his feedback on
discuss the breakdown of failures. earlier versions of this paper. Finally, we are thank-
Our future work includes testing the robustness ful to our anonymous reviewers for their useful
of agentic FC with additional and diverse datasets. comments and constructive feedback.
Moreover, it has been shown that LLMs can be
easily distracted by larger context (Shi et al., 2023;
Levy et al., 2024). We plan to extend the set of References
experiments to scenarios were agent’s toolkit is Mahyar Abbasian, Iman Azimi, Amir M Rahmani, and
expanded also with non-relevant tools, to compare Ramesh Jain. 2023. Conversational health agents: A
the performance against the current setting. personalized llm-powered agent framework. arXiv
preprint arXiv:2310.02374.
5 Limitations Samuel Ackerman, Ella Rabinovich, Eitan Farchi, and
While our study provides valuable insights into Ateret Anaby Tavor. 2024. A novel metric for mea-
suring the robustness of large language models in
measuring agents’ robustness in the function call- non-adversarial scenarios. In Findings of the Associ-
ing scenario, it has several limitations. First, we ation for Computational Linguistics: EMNLP 2024,
evaluate our approach on a single dataset, suffi- pages 2794–2802, Miami, Florida, USA. Association
cient for the focused contribution of a short pa- for Computational Linguistics.
per, but requiring extension to additional datasets Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang,
for a broader analysis. Second, our toolkit ex- Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei
Huang, et al. 2023. Qwen technical report. arXiv Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan
preprint arXiv:2309.16609. Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang,
Bill Qian, et al. 2023. Toolllm: Facilitating large
DeepSeek-AI. 2024. Deepseek-v2: A strong, economi- language models to master 16000+ real-world apis.
cal, and efficient mixture-of-experts language model. arXiv preprint arXiv:2307.16789.
Preprint, arXiv:2405.04434.
Qwen Qwen Team. 2024. Qwen2.5: A party of founda-
Han Ding, Yinheng Li, Junhao Wang, and Hang Chen. tion models.
2024. Large language model agent in financial trad- Ella Rabinovich, Samuel Ackerman, Orna Raz, Eitan
ing: A survey. arXiv preprint arXiv:2408.06361. Farchi, and Ateret Anaby Tavor. 2023. Predicting
question-answering performance of large language
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, models through semantic consistency. In Proceed-
Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, ings of the Third Workshop on Natural Language
Akhil Mathur, Alan Schelten, Amy Yang, Angela Generation, Evaluation, and Metrics (GEM), pages
Fan, et al. 2024. The llama 3 herd of models. arXiv 138–154.
preprint arXiv:2407.21783.
Harsh Raj, Domenic Rosati, and Subhabrata Majumdar.
IBM Granite Team. 2024. Granite 3.0 language models. 2023. Measuring reliability of large language models
through semantic consistency. In Proceedings of the
Kung-Hsiang Huang, Akshara Prabhakar, Sidharth ML Safety Workshop, NuerIPS.
Dhawan, Yixin Mao, Huan Wang, Silvio Savarese,
Caiming Xiong, Philippe Laban, and Chien-Sheng Nils Reimers and Iryna Gurevych. 2019. Sentence-bert:
Wu. 2024. Crmarena: Understanding the capacity of Sentence embeddings using siamese bert-networks.
llm agents to perform professional crm tasks in realis- In Proceedings of the 2019 Conference on Empirical
tic environments. arXiv preprint arXiv:2411.02305. Methods in Natural Language Processing. Associa-
tion for Computational Linguistics.
Mosh Levy, Alon Jacoby, and Yoav Goldberg. 2024.
Scott Rome, Tianwen Chen, Raphael Tang, Luwei Zhou,
Same task, more tokens: the impact of input length on
and Ferhan Ture. 2024. "ask me anything": How
the reasoning performance of large language models.
comcast uses llms to assist agents in real time. In
arXiv preprint arXiv:2402.14848.
Proceedings of the 47th International ACM SIGIR
Conference on Research and Development in Infor-
Yuan Li, Bingqiao Luo, Qian Wang, Nuo Chen, Xu Liu, mation Retrieval, pages 2827–2831.
and Bingsheng He. 2024. CryptoTrade: A reflective
LLM-based agent to guide zero-shot cryptocurrency Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten
trading. In Proceedings of the 2024 Conference on Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi,
Empirical Methods in Natural Language Processing, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. 2023.
pages 1094–1106, Miami, Florida, USA. Association Code llama: Open foundation models for code. arXiv
for Computational Linguistics. preprint arXiv:2308.12950.

Zuxin Liu, Thai Hoang, Jianguo Zhang, Ming Zhu, Freda Shi, Xinyun Chen, Kanishka Misra, Nathan
Tian Lan, Shirley Kokane, Juntao Tan, Weiran Yao, Scales, David Dohan, Ed H Chi, Nathanael Schärli,
Zhiwei Liu, Yihao Feng, et al. 2024. Apigen: and Denny Zhou. 2023. Large language models can
Automated pipeline for generating verifiable and be easily distracted by irrelevant context. In Inter-
diverse function-calling datasets. arXiv preprint national Conference on Machine Learning, pages
arXiv:2406.18518. 31210–31227. PMLR.

Jiarui Lu, Thomas Holleis, Yizhe Zhang, Bernhard Au- Yijia Xiao, Edward Sun, Di Luo, and Wei Wang. 2024.
mayer, Feng Nan, Felix Bai, Shuang Ma, Shen Ma, Tradingagents: Multi-agents llm financial trading
Mengyu Li, Guoli Yin, et al. 2024. Toolsandbox: A framework. arXiv preprint arXiv:2412.20138.
stateful, conversational, interactive evaluation bench-
mark for llm tool use capabilities. arXiv preprint Songlin Xu, Xinyu Zhang, and Lianhui Qin. 2024. Edu-
arXiv:2408.04682. agent: Generative student agents in learning. arXiv
preprint arXiv:2404.07963.
Nikita Mehandru, Brenda Y Miao, Eduardo Rodriguez Kaiqi Yang, Yucheng Chu, Taylor Darwin, Ahreum Han,
Almaraz, Madhumita Sushil, Atul J Butte, and Hang Li, Hongzhi Wen, Yasemin Copur-Gencturk,
Ahmed Alaa. 2024. Evaluating large language mod- Jiliang Tang, and Hui Liu. 2024. Content knowledge
els as agents in the clinic. NPJ digital medicine, identification with multi-agent large language mod-
7(1):84. els (llms). In International Conference on Artificial
Intelligence in Education, pages 284–292. Springer.
Shishir G Patil, Tianjun Zhang, Xin Wang, and
Joseph E Gonzalez. 2023. Gorilla: Large language Junjie Ye, Yilong Wu, Songyang Gao, Caishuang
model connected with massive apis. arXiv preprint Huang, Sixian Li, Guanyu Li, Xiaoran Fan, Qi Zhang,
arXiv:2305.15334. Tao Gui, and Xuanjing Huang. 2024. Rotbench: a
multi-level benchmark for evaluating the robustness original query is not fully appropriate
of large language models in tool learning. arXiv for the new one and vise versa. As an
preprint arXiv:2401.08326.
example, generating ’Book a single room
Jianguo Zhang, Tian Lan, Ming Zhu, Zuxin Liu, Thai for two nights at the Hilton Hotel in
Hoang, Shirley Kokane, Weiran Yao, Juntao Tan, Chicago’ per the original query ’Book
Akshara Prabhakar, Haolin Chen, et al. 2024. xlam: a double room for three nights at the
A family of large action models to empower ai agent
systems. arXiv preprint arXiv:2409.03215. Marriott hotel near OHare Airport in
Chicago’, is not sufficient since both
Lucen Zhong, Zhengxiao Du, Xiaohan Zhang, Haiyi queries can be answered using the same
Hu, and Jie Tang. 2025. Complexfuncbench: Ex-
ploring multi-step and constrained function call- function call, invoked with different
ing under long-context scenario. arXiv preprint parameters. The query should contain all
arXiv:2501.10132. information needed for its computation.
For instance, ’What is the capital of
7 Appendices Brazil?’ is a good query, while ’What
7.1 Prompt for Request Rephrasing is the capital of a country provided by
We used the following prompt for generating user?’ is not since one cannot generate a
strictly meaning-preserving request rephrasing function call and populate its arguments
with the Llama3.1-70B model (Dubey et al., 2024): using the info in the query alone. Output
the newly generated query only, without
SYSTEM: You are a helpful assistant explanation or interpretation. Consider
helping rephrasing user requests, while the examples below:
accurately preserving their meaning,
including numbers and names if exist. USER: I need the schedules of matches
Do not answer the requirement, just happening on February 28, 2024.
produce another one that is identical ASSISTANT: I need the schedules of the
in meaning but is phrased differently. college league matches happening during
Produce ONLY the rephrased requirement, the winter 2024 season.
without further thoughts or explanations. ...
Consider the example below:
7.3 Example of Syntactically Different but
USER: Can I find the dimensions and Semantically Equivalent Tools
properties of a triangle, if it is known
Although rare, distinct, yet functionally equiva-
that its three sides are 5 units, 4 units
lent tools, pose a challenge for accurate evaluation,
and 3 units long?
since the "labeled" BFCL data contains only one
ASSISTANT: What are the dimensions and of these functions. As an example, the tool
properties of a triangle whose three sides
sentence.translate(sentence: string,
are 5, 4 and 3 units long?
from: string,
7.2 Prompt for Similar Requests Generation to: string)
We used the following prompt for generating is functionally equivalent to
closely related but different request with the
translate_sent(orig_language: string,
Llama3.1-70B model (Dubey et al., 2024):
target_language: string,
SYSTEM: You are a helpful assistant sentence: string).
introduced with the following user query.
As described in Section 2, we concatenate func-
Create a very similar query that refers
tion name and description, as well parameter names
to a very similar user need and is likely
and descriptions into a tool "signature", and fil-
to be implemented in an enterprise as part
ter out generated tools exhibiting cosine similarity
of the same project. The new query should
higher than a predefined threshold to the original
introduce one or two additional distinct
one, aiming at a toolkit with distinct functions. The
parameter types. It should differ from
similarity threshold was set to 0.8.
the original query in a sense that a
function that can be used to fulfill the

Building Effective AI Agents 1735257949
100% (1)
Building Effective AI Agents 1735257949
11 pages
48157106
100% (2)
48157106
77 pages
GrammarList N5
100% (1)
GrammarList N5
7 pages
Functionality
No ratings yet
Functionality
10 pages
Apigen: Automated Pipeline For Generating Verifiable and Diverse Function-Calling Datasets
No ratings yet
Apigen: Automated Pipeline For Generating Verifiable and Diverse Function-Calling Datasets
20 pages
2503.04479v1
No ratings yet
2503.04479v1
27 pages
Big Code Bench
No ratings yet
Big Code Bench
62 pages
Function calling at Edge
No ratings yet
Function calling at Edge
9 pages
Alopex
No ratings yet
Alopex
12 pages
Functions, Calls, and Agents Short Course
No ratings yet
Functions, Calls, and Agents Short Course
16 pages
Towards better Human-Agent Alignment
No ratings yet
Towards better Human-Agent Alignment
17 pages
Multi-Agentic RAG with Hugging Face Code Agents _ by Gabriele Sgroi, PhD _ Dec, 2024 _ Towards Data Science
No ratings yet
Multi-Agentic RAG with Hugging Face Code Agents _ by Gabriele Sgroi, PhD _ Dec, 2024 _ Towards Data Science
42 pages
Autonomous and Self-Adapting System for Synthetic Media Detection and Attribution
No ratings yet
Autonomous and Self-Adapting System for Synthetic Media Detection and Attribution
18 pages
Agentless_Demystifying LLM-based Software Engineering Agents
No ratings yet
Agentless_Demystifying LLM-based Software Engineering Agents
25 pages
Survey on Evaluation of LLM-based Agents
No ratings yet
Survey on Evaluation of LLM-based Agents
20 pages
Evaluating AI Agents
No ratings yet
Evaluating AI Agents
22 pages
2504.02623v1
No ratings yet
2504.02623v1
23 pages
2407.01489v2
No ratings yet
2407.01489v2
25 pages
Intelligent Functional Testing: White Paper
No ratings yet
Intelligent Functional Testing: White Paper
9 pages
AI-Activity 10-Solve final exam 2023-2024
No ratings yet
AI-Activity 10-Solve final exam 2023-2024
15 pages
2024FuallStackBench Seed
No ratings yet
2024FuallStackBench Seed
26 pages
8. Function Calling
No ratings yet
8. Function Calling
16 pages
AI Chapter 2 Notes
No ratings yet
AI Chapter 2 Notes
26 pages
Understanding the Effectiveness of Large Language Models in Detecting Security Vulnerabilities
No ratings yet
Understanding the Effectiveness of Large Language Models in Detecting Security Vulnerabilities
18 pages
ai.
No ratings yet
ai.
28 pages
Slide Trí Tuệ Nhân Tạo - Lecture02 - SearchStrategies - Phạm Bảo Sơn - UET
No ratings yet
Slide Trí Tuệ Nhân Tạo - Lecture02 - SearchStrategies - Phạm Bảo Sơn - UET
67 pages
Nestful_Nested_API_Calls_Benchmark
No ratings yet
Nestful_Nested_API_Calls_Benchmark
10 pages
Cs3491 Artificial Intelilgence and Machine Learning
No ratings yet
Cs3491 Artificial Intelilgence and Machine Learning
22 pages
Functional Programming Step by Step: A Practical Guide with Examples
From Everand
Functional Programming Step by Step: A Practical Guide with Examples
William E. Clark
No ratings yet
AI Notes
No ratings yet
AI Notes
21 pages
Building Effective Agents Anthropic
No ratings yet
Building Effective Agents Anthropic
26 pages
Building Effective Agents _ Anthropic
No ratings yet
Building Effective Agents _ Anthropic
16 pages
Chain of Tools: Large Language Model Is An Automatic Multi-Tool Learner
No ratings yet
Chain of Tools: Large Language Model Is An Automatic Multi-Tool Learner
28 pages
Charming Python:: Functional Programming in Python, Part 1
No ratings yet
Charming Python:: Functional Programming in Python, Part 1
10 pages
Two Marks - Aiml
No ratings yet
Two Marks - Aiml
21 pages
11
No ratings yet
11
15 pages
xLAM_250303_024533
No ratings yet
xLAM_250303_024533
16 pages
Negotiation Kosinski
No ratings yet
Negotiation Kosinski
26 pages
Non-Functional Requirements For Machine Learning: Challenges and New Directions
No ratings yet
Non-Functional Requirements For Machine Learning: Challenges and New Directions
6 pages
Artificial Intelligence UNIT-I Question-Bank
No ratings yet
Artificial Intelligence UNIT-I Question-Bank
85 pages
CI qb
No ratings yet
CI qb
96 pages
Automation Framework - Evaluation Criteria
No ratings yet
Automation Framework - Evaluation Criteria
5 pages
Google REST
No ratings yet
Google REST
19 pages
Dia 4
No ratings yet
Dia 4
24 pages
Executable Code Actions Elicit Better LLM Agents
No ratings yet
Executable Code Actions Elicit Better LLM Agents
25 pages
Unit 4
No ratings yet
Unit 4
61 pages
Browse Comp
No ratings yet
Browse Comp
11 pages
The Causal Reasoning Ability of Open Large Language Model A Comprehensive and Exemplary Functional Testing
No ratings yet
The Causal Reasoning Ability of Open Large Language Model A Comprehensive and Exemplary Functional Testing
10 pages
03_ml_testing
No ratings yet
03_ml_testing
51 pages
Webinar LLM tv2
No ratings yet
Webinar LLM tv2
20 pages
Detection of Malicious Hyperlinks Using Machine Learning A Proposed System
No ratings yet
Detection of Malicious Hyperlinks Using Machine Learning A Proposed System
4 pages
2. A_Survey_on_Evaluating_Large_Language_Models_in_Co
No ratings yet
2. A_Survey_on_Evaluating_Large_Language_Models_in_Co
26 pages
Ai & ML 2 Marks Was
No ratings yet
Ai & ML 2 Marks Was
23 pages
Fernandez Mounier Pachon 05
No ratings yet
Fernandez Mounier Pachon 05
16 pages
WhiteFox - White-Box Compiler Fuzzing Empowered by Large Language Models
No ratings yet
WhiteFox - White-Box Compiler Fuzzing Empowered by Large Language Models
27 pages
Aiml Q Bank
No ratings yet
Aiml Q Bank
25 pages
Fedeval-Llm: Federated Evaluation of Large Language Models On Downstream Tasks With Collective Wisdom
No ratings yet
Fedeval-Llm: Federated Evaluation of Large Language Models On Downstream Tasks With Collective Wisdom
17 pages
Question Bank For CAT1 - 2mks
No ratings yet
Question Bank For CAT1 - 2mks
36 pages
Science BSC Computer Science Semester 5 2022 November Elective I Artificial Intelligence Cbcs
No ratings yet
Science BSC Computer Science Semester 5 2022 November Elective I Artificial Intelligence Cbcs
29 pages
Lexicon of Programming Terminology: Lexicon of Tech and Business, #17
From Everand
Lexicon of Programming Terminology: Lexicon of Tech and Business, #17
Mustafa Al-Dori
5/5 (1)
Mastering Terraform A Comprehensive Guide to Infrastructure As Code
From Everand
Mastering Terraform A Comprehensive Guide to Infrastructure As Code
Mario Marinov
No ratings yet
Programming in Star
From Everand
Programming in Star
Francis McCabe
No ratings yet
L33-Pumping Lemma For CFL
No ratings yet
L33-Pumping Lemma For CFL
20 pages
Pragmatic Adaptation As A Requirement in Translation: A Case of Persian and English Versions of The Shah
No ratings yet
Pragmatic Adaptation As A Requirement in Translation: A Case of Persian and English Versions of The Shah
7 pages
FSSR
No ratings yet
FSSR
127 pages
English 8
No ratings yet
English 8
224 pages
First Meeting British English Teacher Ver2
No ratings yet
First Meeting British English Teacher Ver2
3 pages
Half Yearly Examination, 2017-18: English
No ratings yet
Half Yearly Examination, 2017-18: English
6 pages
Focus 4 Unit 5 Test 20221010 094734
No ratings yet
Focus 4 Unit 5 Test 20221010 094734
6 pages
Verb and Non Verb Sentence Kel 2
No ratings yet
Verb and Non Verb Sentence Kel 2
5 pages
Arctic Community Nunavut Unit Plan
No ratings yet
Arctic Community Nunavut Unit Plan
6 pages
Word Formation 1 - Nouns
No ratings yet
Word Formation 1 - Nouns
3 pages
Verbal and Non-Verbal Communication
No ratings yet
Verbal and Non-Verbal Communication
6 pages
Pre Test - Grade 7
No ratings yet
Pre Test - Grade 7
3 pages
JD Gab Janda PHD Thesis
No ratings yet
JD Gab Janda PHD Thesis
231 pages
ECB1 Tests Grammar Check 6.2B New2018
No ratings yet
ECB1 Tests Grammar Check 6.2B New2018
1 page
Relative Clauses: Pronouns - Who (Subject) Refer Only To People
No ratings yet
Relative Clauses: Pronouns - Who (Subject) Refer Only To People
3 pages
Centro de Educación Básica Bilingüe "White Dove" Cuarto Grado B Teacher:Jonathan Arias Syllabus de Language Ii Periodo
No ratings yet
Centro de Educación Básica Bilingüe "White Dove" Cuarto Grado B Teacher:Jonathan Arias Syllabus de Language Ii Periodo
1 page
Isfo Syllabus: English
No ratings yet
Isfo Syllabus: English
6 pages
1944 - Robert - The Science of Idioms
No ratings yet
1944 - Robert - The Science of Idioms
17 pages
Enfl6111 Mo
No ratings yet
Enfl6111 Mo
16 pages
What Is A Global Language
No ratings yet
What Is A Global Language
7 pages
Detailed Lesson Plan in English 2
100% (1)
Detailed Lesson Plan in English 2
6 pages
Verbs 'De' Infinitive
No ratings yet
Verbs 'De' Infinitive
3 pages
Sample Theory & Que. - Language Basic Concepts - UGC NET ENG. UNIT-5
100% (1)
Sample Theory & Que. - Language Basic Concepts - UGC NET ENG. UNIT-5
23 pages
Discussions About Vandanam and Vanakkam
100% (1)
Discussions About Vandanam and Vanakkam
7 pages
LKS2 60 Second Reads Guidance
No ratings yet
LKS2 60 Second Reads Guidance
1 page
New Wave Book 2 Students Book
No ratings yet
New Wave Book 2 Students Book
139 pages
Figure of Speech Detailed Lesson Plan
No ratings yet
Figure of Speech Detailed Lesson Plan
5 pages
Across the pond
No ratings yet
Across the pond
9 pages

On the Robustness of Agentic Function Calling

Uploaded by

On the Robustness of Agentic Function Calling

Uploaded by

On the Robustness of Agentic Function Calling

Ella Rabinovich Ateret Anaby-Tavor

Abstract as a complex, multi-turn (i.e., involving user in-

tion calling (FC) capabilities enabling them to

Contribution We focus on two aspects of ro-

You might also like