0% found this document useful (0 votes)
28 views

Team13 DevRev Report

Uploaded by

be22b042
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views

Team13 DevRev Report

Uploaded by

be22b042
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

1 Introduction

In the evolving landscape of Large Language Models (LLMs), their use as reasoning agents and
function callers has garnered substantial attention. LLMs demonstrate the capacity to interpret and
respond to queries by calling external tools ([1], [2], [3]), a testament to their advanced language
comprehension. This capability to integrate tool usage represents a significant stride in enhancing the
scope and accuracy of LLMs in various applications.

Current state-of-the-art approaches to the tool-usage problem, which utilize GPT-4 (OpenAI) and
Claude-2 (Anthropic), demonstrate impressive results but are closed-source and computationally
expensive. Researchers have attempted to solve this problem by finetuning smaller language models
([3], [4], [5]). However, these models are ineffective at generalizing to new tools when provided in a
zero-shot manner, referred to as ‘dynamic tooling’ from here onwards. The discrepancy between the
generalized tool-use capabilities of large models and the more restricted capabilities of compact models
presents the motivation behind our exploration in this work - Can we exploit the nature of this task to
train small open-source LLMs to generalize their tool-use abilities while keeping the latency minimal?

Query Prompt Prompt Output Code to


User
Formation JSON
RTaC - LLM converter
Add/
Tools
Modify/
Docstring
Select
Instruction
Finetuning

....
Tools
List Output
Tool Docstring Coding
Database Conversion Base LLMs

Figure 1: Overview of “Reimagining Tooling as Coding” (RTaC)

Addressing these challenges, we propose Reimagining Tooling as Coding (RTaC), which


reconceptualizes the task of tooling as a coding task to exploit the powerful code-comprehension
capabilities of LLMs. RTaC provides tools to be used, in docstring format, to instruct finetuned
Coding-Base LLMs. It then extracts the output in Python-inspired code format and
deterministically converts it to JSON. RTaC promotes docstring reading capability in the LLMs
and hence supports tool modification, addition, and deletion. Using RTaC, we achieve GPT-4
benchmark performance while employing smaller models, such as DeepSeek 1.3B and CodeLlama 7B
LLMs, despite a drastic (300x) reduction in parameter count. We also simultaneously achieve
significant (5x) cost reduction per query while matching GPT-4’s latency. Moreover, RTaC supports
the processing of complex conditional and iterative logic (Bonus), surpassing GPT-4's capabilities.

In Section 2 of this report, we present a thorough literature review to better understand the landscape
of tooling via LLMs. Section 3 details the hypothesis and components of our proposed RTaC pipeline
and provides an analysis of performance, cost, and latency improvements. Section 4 presents our
extensive experimentations, while Section 5 presents future research questions in the domain of
tooling. Lastly, we present our dataset details, choice of evaluation metrics, and code implementation
specifics in the Appendix.

1 Our deployed RTaC models can be found here.


2 Literature Review
2.1 Tool Augmented LLMs
Recent research proposes API Copilot models designed to excel in API interaction and tool usage.
Gorilla [4], built upon Llama 7B [6], surpasses models such as GPT-4 in AST accuracy and reduces
hallucinations on the task of making API calls. The authors propose novel retriever-aware training,
giving Gorilla the ability to adjust to API documentation changes from platforms such as TorchHub,
TensorHub, and HuggingFace. However, it cannot be finetuned on novel datasets. Using a text-to-text
interface, TALM [7] combines LLMs with non-differentiable tools. It is trained to generate tool
inputs, call APIs, and produce outputs based on tool results. It enhances performance on limited tool
data through iterative self-play focused on question-answering. Meanwhile, Graph-ToolFormer [8]
incorporates models such as Graph-BERT [9] and TransE [10] to integrate external graph data and
reasoning tools. This approach involves creating prompts with graph reasoning API calls and using
self-play for prompt generation. While these models use limited API databases, ToolLLM [3] utilizes
a more extensive dataset from RapidAPI, containing 16,464 APIs. It proposes a decision tree method
for enhanced planning and reasoning. Fine-tuning Llama-2 7B [11] on ToolBench [12] resulted in
the creation of ToolLLama, which excels in both single-tool and multi-tool scenarios.

2.2 Dataset Generation


The aforementioned papers incorporate data generation as their primary approach for adapting base
LLMs for tool usage. Gorilla introduces a comprehensive dataset called APIBench by utilizing Self-
Instruct [13], which proposes an automated pipeline to generate large-scale instruction datasets from a
small set of seed tasks. First, human experts provide sample instructions and API documentation as
context. A language model then generates new instructions that plausibly use the APIs, creating
instruction-API pairs. A vital benefit of this approach is that it does not require manual effort to label
training data. Gorilla uses GPT-4 for this data generation.

Although APIBench is built over massive APIs, it does not have multi-tool scenarios. ToolLLM
proposes an innovative data generation strategy supporting multi-tool interplay. The authors sample
API combinations by iterating through tools and sampling intra-category and intra-collection
combinations. ChatGPT is leveraged to generate instructions involving the sampled APIs, and its
behavior is regulated by prompting with documentation, task descriptions, and examples. Generated
APIs are validated against the original sample to filter out hallucinations.

Task: Task:
Prioritize my {priority} Prioritize my P0
API1 API Name: prioritize_objects
issues and add them to Random issues and add them to
API Description: Returns list of
GPT-4 the current sprint Value Pool the current sprint
objects sorted by priority.
Required parameters: objects API Calls: API Calls:
...... works_list(issue.priority={p}) works_list(issue.priority="P0")
API2 API Name: get_sprint_id prioritize_objects(objects = prioritize_objects(objects =
...... $$PREV[1]) $$PREV[1])
get_sprint_id() get_sprint_id()

APIs & API Templates Training Data


... Documentation

Figure 2: Self-Instruct data generation and Template filling

2
API-Bank [14] aims to tackle the susceptibility of LLM-based data generation to hallucination by
splitting the task of generation into multiple smaller tasks, each being tackled by a separate LLM agent.
One agent proposes domains, another generates plausible APIs, and others select APIs and make API
calls to address given queries. The end result is a synthetic collection of examples of API usage in
different domains. A key aspect is the modular approach, delegating distinct aspects of data generation
to different agents. ToolAlpaca [5] also proposes a multi-agent framework by simulating interactions
between three distinct agents—a user, an assistant, and a tool executor—all embodied by LLMs. The
user agent generates tool use instructions based on documentation, and the assistant agent determines
appropriate actions to take, invoking tool functions via the executor, which simulates execution using
the tool specifications. The iterative interplay between these agents produces varied and realistic tool
use instances.

An entirely different approach, not relying on LLMs, is presented in ToolQA [15], focussing on
generating question-answer pairs. First, templates are created that include placeholders for tool
attributes. These templates are then instantiated with actual values to generate diverse questions.
Answers are produced by predefined tool operators and toolchains that execute APIs. This allows
generating a broad range of synthetic queries and responses.

2.3 Prompting Methods


Refining

INPUT Backtracking from a chain INPUT

INPUT
Backtracking

Aggregating
OUTPUT Aggregating
Chains
OUTPUT Intermediate thoughts thoughts
are also scored
OUTPUT

Unscored Negative Score Abandon Thought

Positive Score Dependencies between thoughts Backtrack

Figure 3: Chain of Thought (left), Tree of Thought (middle), Graph of Thought (right)

In order to enhance the task-specific performance of LLMs, there has been parallel research into
advanced prompting methods, which can be put into two distinct streams in the context of tool usage.
The first category involves prompting LLMs to deconstruct the tool usage task into a sequence of
logical reasoning steps through explicit instructions. Chain of Thought (CoT) [16] exemplifies this
by instigating multi-step reasoning and enhancing problem-solving in areas like mathematics and
programming. Skeleton of Thought (SoT) [17] proposes prompt engineering for basic reasoning
frameworks, aiding in complex cognitive tasks. Tree of Thought (ToT) [18] employs a decision-tree
structure for prompting, while Graph of Thought (GoT) [19] utilizes graph theory for tasks requiring
abstraction and creativity. The second category enhances the query by incorporating reasoning aids,
such as Knowledge Graphs (KG) [20], to facilitate the accurate mapping of argument values to APIs.
Knowledge graphs map relationships between various entities, such as objects, events, and concepts,
represented as nodes and edges in a graph. They have been found useful for context-aware content
generation, leveraging the relational data within the graph.

3
2.4 Instruction Finetuning of LLMs

Instruction Finetuning of an LLM is a domain adaptation strategy that enhances the model's
performance for specific prompt formats by training it on a targeted dataset. Through this process, the
LLM maintains its general abilities while improving its expertise in relevant, specialized tasks.
Parameter Efficient Fine Tuning (PEFT) [21] techniques selectively update a small subset of a pre-
trained model's parameters during finetuning. By keeping most parameters frozen, these methods
reduce memory footprint and computational cost while enabling specialized adaptation to new tasks.
The Low-Rank Adaptation (LoRA) [22] model outperforms finetuning in model quality while
having fewer parameters and higher training throughput. Its linear design allows the integration of
trainable matrices with frozen weights during deployment, which, unlike adapters, avoids additional
inference latency. On modern GPUs, computing speed has out-paced memory speed, and most
operations in Transformers are bottlenecked by memory accesses. FlashAttention [23] optimally
accounts for reading and writing the attention matrix between fast GPU on-chip SRAM and relatively
slow GPU high bandwidth memory.

3 Reimagining Tooling as Coding (RTaC)


3.1 Design Framework

Reimagining tooling as a form of coding, in the context of LLMs, forms the cornerstone of our
pipeline design. This approach stems from the observation that tool utilization in LLMs essentially
involves executing function calls, assigning values to arguments, and efficiently linking these outputs,
mirroring the core elements of coding. This conceptual overlap extends beyond mere theory, as
evidenced by the proficiency of Copilot and CodeGen. Grounded in this insight, we adopt a training
strategy that treats the tooling challenge within the framework of a coding paradigm. Accordingly, we
prioritize finetuning LLMs with a foundational background in coding (such as DeepSeek-1.3B Code-
Instruct) as opposed to those exclusively trained on natural language processing tasks.

In our paradigm, tool descriptions are conveyed to LLMs in a docstring format during training, as
shown in Figure 4 (left), emulating standard coding practices. The expected output format is
structured as variable assignments from API calls (e.g., var_x = api_call(arguments)), as shown in
Figure 4 (right). This offers advantages over direct training on JSON outputs by reducing the number
of output tokens required and circumventing the need for additional training to correct JSON errors,
an issue prevalent in other methods.

Figure 4: Sample of tool docstring (left) and output in code format (right)

Conditional and iterative logic (Bonus) is handled by allowing the LLM to not only generate outputs in
the format of "var_x = api_calls()" but also to incorporate if-else statements and for-loop constructs.
In the parsed JSON object, we introduce two specialized `magic tools` – `conditional_magic` and

4
`iterational_magic` with the capability for 'JSON in JSON' style argument values, as shown in
Figure 5. Such a format is crucial for managing multiple chains of tools that are dependent on specific
conditions or require iterative processes.

Figure 5: Sample code output and conversion using ‘JSON in JSON’ methodology

3.2 Pipeline

Dataset Generation: We adopt the Self-Instruct [13] methodology, which utilizes GPT-4 to create a
dataset encompassing only the tools specified in the Problem Statement in multi-tool scenarios. A tool
sampling method is employed to introduce only the docstring of 3-5 tools at a time, thereby ensuring
comprehensive tool coverage in generating query-output pairs. Inspired by API-Bank [14]’s technique
to address the vulnerability of LLM-based data generation to hallucination, we split the task of query-
output generation between 2 LLM agents. The first agent generates queries. A second agent is then
fed these queries and generates ground truths. In this manner, 1800 query-output pairs were generated
by only employing the 9 tools in the Problem Statement. Further, 200 Out-of-Distribution
philosophical queries that cannot be answered using the given toolset were appended. We refer to
these 2000 query-output pairs as the Stage-1 Dataset. To train for bonus handling, we first generate 5
new tools that interface well with conditional and iterative logic. 100 query-output pairs using these
tools were then generated and appended with 500 pairs sampled from the Stage-1 Dataset, hereon
referred to as the Stage-2 Dataset. Both datasets were human-evaluated, and outputs were corrected
to finetune for peak performance.

Tool Sampling Selected Query


Query
Database Tools GPT-4 GPT-4 Output Pair
Agent 1 Agent 2

Figure 6: Dual agent dataset generation

5
Training Pipeline: We follow an instruction finetuning-based training approach wherein we first
prepare our dataset in a structure where an “Allowed tools:'' token is introduced, followed by the
docstrings for the tools to be used as shown in A.5. We train the model on the LORA [22] framework
using the PEFT [21] library under a 4-bit quantization setting through the Bits and Bytes
framework. Training is done in two stages. During the first stage, the model is trained for 5 epochs
on queries from the Stage-1 Dataset and the docstrings for the tools in the Problem Statement. The
model is further trained for 5 epochs using the Stage-2 Dataset, where it sees the docstrings for the
tools in the Problem Statement along with the 5 new tools described above. This short instruction
finetuning instills docstring reading capabilities in the LLM as well as adherence to our Python-
inspired code output format.

Inference: RTaC allows the user to add their own set of dynamic tools. These tools are appended to
the static tools in the prompt under the “Allowed Tools:” token and then passed to the LLM.
Similarly, in the case of modification and deletion of already added tools, updated docstrings are
passed under the “Allowed Tools:” token.

Code to JSON Conversion with Error Handling: RTaC allows us to integrate a ‘Code to JSON’
convertor to obtain the final output as a JSON object. Inspired by compiler technology, this allows for
deterministic parsing with error control for data types and allowed parameter values and leads to
correct JSONs in every inference. Implementation specifics are detailed in A.1.

3.3 Result Analysis

We present averaged results over a test dataset with a wide range of tooling settings - Static Tooling
(200 data points), Dynamic Tooling (100 data points), Modifications and Deletions over Static Tools
(40 data points), and Bonus Handling (10 data points). Both RTaC CodeLlama 7B and RTaC
DeepSeek 1.3B match GPT-4 performance under the RTaC setting while being much more
economical. Latency is matched and improved as well.

GPT-4 within RTaC RTaC RTaC


Metrics/Models GPT-4 # #
Framework CodeLlama 7B DeepSeek 1.3B

F1-Score 86.82 93.93 93.22 93.28


JSON Similarity 74.88 87.79 87.42 85.73
Cost/Query ($) 0.0341 0.0312 0.0086 0.0060
Latency (s) 7.32 6.88 7.56 5.25
# - Inferred on A100 GPU via Replicate billed at $ 0.00115/s.

Table 1: RTaC vs GPT-4 baseline results

4 Experiments and Results


We experiment with two broadly separate approaches to the Problem Statement. The first experiment
seeks to prompt-engineer closed-source LLMs to achieve peak task performance since finetuning them
is prohibitively expensive. We then experiment on smaller open-source LLMs with different
instruction finetuning and prompting setups to arrive at our proposed RTaC pipeline.

6
4.1 Closed-Source LLMs
We experimented on the GPT family of closed-source LLMs for both the JSON and RTaC Code
output creation end task. The prompt used to evaluate the baseline performance consisted of the
chatbot’s purpose as a tooling agent, tool descriptions, format of the output, and examples of chatbot
behavior. Further, we used four prompting techniques: Query Augmentation with Knowledge Graph
(KG) [20], Graph of Thought (GoT) [19], Chain of Thought (CoT) [16], Skeleton of Thought
(SoT) [17] and Tree of Thought (ToT) [18] seeking to boost baseline performance under the JSON
output setting.

Model JSON Similarity Precision Recall F1-Score

GPT 3.5 67.23 82.34 87.32 84.76


GPT-4 Turbo 74.88 88.23 85.45 86.82
GPT-4 Turbo + SoT 75.23 84.52 89.43 86.91
GPT-4 Turbo + GoT 82.32 87.69 90.16 88.91
GPT-4 Turbo + CoT 79.11 88.12 85.84 86.97
GPT-4 Turbo + ToT 80.69 86.52 88.62 87.56
GPT-4 Turbo + KG 80.62 84.32 90.97 87.52
GPT-4 Turbo + RTaC 87.79 92.12 95.81 93.93

Table 2: Results - Closed-Source LLMs

4.2 Open-Source LLMs

In this section, we present three pipelines for our Problem Statement that utilize open-source LLMs,
incrementally building upon our hypothesis of “Reimagining Tooling as Coding”. This
experimentation helps us arrive at our proposed pipeline, RTaC, and also serves as an ablation study.

4.2.1 Few-Shot prompting of Coding-Base LLMs

We investigate our hypothesis around the efficiency of Coding-Base LLMs over normal LLMs using
this pipeline. The LLMs are provided with a prompt similar to that referred to in Section 4.1 with few-
shot examples involving both static tool usage (5 examples) and bonus structure (3 examples).

Dataset Model JSON Similarity Precision Recall F1-Score

CodeLlama 7B 68.78 81.25 77.85 79.51


CodeLlama 13B 73.47 82.80 84.52 83.65
Static
DeepSeek 1.3B 60.28 73.33 63.00 67.77
DeepSeek 6.7B 62.02 90.72 61.48 73.29

CodeLlama 7B 67.94 78.24 74.97 76.57


CodeLlama 13B 71.12 79.84 85.41 82.53
Dynamic
DeepSeek 1.3B 60.82 65.29 66.53 65.90
DeepSeek 6.7B 63.96 92.10 57.53 70.82

CodeLlama 7B 56.28 76.43 74.17 75.28


CodeLlama 13B 59.22 78.13 83.91 80.87
Bonus
DeepSeek 1.3B 47.32 63.49 62.18 62.83
DeepSeek 6.7B 54.87 87.56 59.25 70.68

Table 3: Results - Few-Shot prompting of Coding-Base LLMs

7
Discussion: Few-shot prompting on pre-trained LLMs like Llama-2 7B and Zephyr 7B (results
provided in A.6) is substantially surpassed by CodeLlama and DeepSeek, proving our choice for
Coding-Base LLMs. However, this pipeline is plagued by a high context length needed to explain the
output formats and chatbot behavior, which further leads to higher latencies. Output inspection
reveals that while models correctly tend to solve the query, the format is non-convertible.

4.2.2 Tool Memorization with “Add Tool” token

We build upon ToolLLama style finetuning of LLMs, which finetunes LLMs using query-output pairs,
to include support for dynamic tooling. This is achieved using an “Added Tools:” token. Dynamic
tools provided at runtime are appended in docstring format after this token, while the query follows
the “Query:” token in the input prompt.

To instill an understanding of our added tokens, we first generate 50 dynamic tools and queries that
interface with them using the Self-Instruct [13] methodology. 300 such queries-output-toolset tuples,
along with the 100 bonus query-output pairs and the Stage-1 Dataset, as described above, are used for
instruction finetuning over 10 epochs.

Dataset Model JSON Similarity Precision Recall F1-Score

CodeLlama 7B 85.89 92.36 94.31 93.32


ToolAlpaca 68.45 81.51 73.85 77.49
Static
ToolLLama 69.57 86.22 78.13 81.98
DeepSeek 1.3B 84.68 91.94 94.63 93.27

CodeLlama 7B 75.51 85.45 87.51 86.47


ToolAlpaca 63.47 75.93 72.27 74.05
Dynamic
ToolLLama 66.62 78.24 75.11 76.64
DeepSeek 1.3B 75.83 82.12 81.85 81.98

CodeLlama 7B 83.91 87.67 91.12 89.36


ToolAlpaca 67.22 80.33 72.91 76.44
Bonus
ToolLLama 68.34 84.96 76.19 80.34
DeepSeek 1.3B 81.35 89.11 90.71 89.90

Table 4: Results - Tool Memorization with “Add Tool” token

Discussion: Experimentations with this pipeline reveal training instability. While longer training
makes the model excel in the static setting, dynamic tool comprehension and usage take a hit. On the
other hand, enabling dynamic tooling with controlled training length leads to parameter and data-type
hallucinations for the memorized static tools. Memorization further limits this pipeline’s ability to
modify and delete tools. All this motivates moving to a pipeline with the least tool memorization.
Instruction finetuning shows promising adherence to a code output format that is convertible to
JSON.

4.2.3 RTaC (Our proposed pipeline)

Here, we build upon the previous pipeline by replacing the “Added Tools:” token with the “Allowed
Tools:” token and appending docstrings for all tools, both static and those added dynamically at
runtime, after the token. This pipeline avoids tool memorization and instead promotes docstring
comprehension. Section 3 detailed our motivations and training details for this pipeline.

8
Dataset Model JSON Similarity Precision Recall F1-Score

DeepSeek 1.3B 87.73 94.38 93.28 93.82


CodeGen 2B 74.23 81.33 78.43 79.85
Static
DeepSeek 6.7B 87.79 93.01 95.05 94.01
CodeLlama 7B 89.91 94.19 94.59 94.38

DeepSeek 1.3B 81.47 90.67 88.88 89.76


CodeGen 2B 67.43 65.58 65.01 65.29
Dynamic
DeepSeek 6.7B 82.17 92.03 92.16 92.09
CodeLlama 7B 85.57 91.11 93.37 92.22

DeepSeek 1.3B 82.98 91.49 90.14 90.80


CodeGen 2B 63.84 69.79 60.91 65.04
Modified
DeepSeek 6.7B 87.61 91.18 91.13 91.15
CodeLlama 7B 86.34 92.77 93.23 92.99

DeepSeek 1.3B 83.96 91.47 92.01 91.73


CodeGen 2B 55.37 69.79 60.91 65.04
Bonus
DeepSeek 6.7B 83.17 91.66 93.12 92.38
CodeLlama 7B 86.92 92.22 94.91 93.54

Table 5: Results - RTaC

Discussion: RTaC shows commendable results on both static and dynamic tooling. Further, models
under this pipeline gracefully handle the modification and deletion of static tools. This showcases the
models’ ability to comprehend the given docstrings. While the context length increases over the
previous pipeline due to the addition of static tool docstring in each prompt, even small models such
as DeepSeek 1.3B perform well under this pipeline, effectively leading to minimal latencies. It must
also be noted that CodeGen 2B is not a code instruct model and does not perform well.

5 Future Scope
5.1 Improvements in evaluation
Sample query

Output execution

dir1

flowers

file1 summary
file2

Deterministic State
API Set BASH Backend
Match with Ground Truth

Deterministic Supervision

Figure 7: Design for a bash-based deterministic evaluation benchmark

9
The current evaluation metrics in this research domain, such as JSON Similarity and F1-score, fail to
reliably evaluate critical aspects such as correctness and optimality. We find that string and AST-based
evaluation is not fit for the task of tooling. The same query can often be answered via multiple
sequences of tools, yet they are not scored as such during evaluation. To overcome this, we have been
working on a bash-based toolset that can act as a deterministic evaluation benchmark for tooling
LLMs. Our methodology involves creating an API set that can be mapped to bash operations on
directories and files. Models’ outputs can now be deterministically evaluated by state-matching after
API call execution.

5.2 Query Reformulation

Prior research [24] demonstrates that query quality has a great impact on model accuracy. This is
corroborated by our findings that articulate queries that follow the sequence of execution have almost
perfect accuracy rates. This motivates research into query-optimizing modules that can reformulate a
user's query to one that the model is most comfortable with.

5.3 Tool Retrievers

While docstring comprehension leads to high accuracies on a limited set of tools, scalability motivates
the addition of tool retrievers to the pipeline. This will empower RTaC to be scaled to massive API
sets and outperform current SOTA methods like Gorilla and ToolLLama, which are reliant on tool
memorization.

5.4 Robust Compilers for JSON Converters

In our experimentations, we find that using a coding paradigm for tooling instills complex sequence
generation capabilities in models. These range from nested logic to combinations of iterative and
conditional reasoning. While our current JSON converter is limited to fixed if-else and for constructs,
an implementation using stronger compiler grammar will greatly improve the pipeline’s ability to
answer complex queries.

6 Conclusion
In this study, we introduce the concept of Reimagining Tooling as Coding (RTaC). This idea
redefines tooling as a coding-related task. Our approach involves presenting tool descriptions to LLMs
in docstring format during their training phase, and the desired output is formatted as variable
assignments from API calls. We utilize a dual-agent dataset generation method, covering tool-usage in
static, dynamic and bonus settings. Coding-Base LLMs, naturally adept at understanding coding
elements, are instruction finetuned using this specially curated dataset in the RTaC format. This novel
strategy has allowed smaller, open-source Coding-Base LLMs like DeepSeek 1.3B and CodeLlama 7B
to achieve performance on par with leading models like GPT-4, but with notably reduced
computational demands and response times. RTaC offers a fresh way of addressing tool-usage
challenges with LLMs, setting the stage for future advancements in LLM applications.

10
References
1. Yan, F. (2023, November 13). Gorilla OpenFunctions.
https://gorilla.cs.berkeley.edu/blogs/4_open_functions.html
2. Eleti, A., Harris, J., Kilpatrick, L. (2023, June 13). Function calling and other API updates.
https://openai.com/blog/function-calling-and-other-api-updates
3. Qin, Y., Liang, S., Ye, Y., Zhu, K., Yan, L., Lu, Y., Lin, Y., Cong, X., Tang, X., Qian, B., Zhao, S., Hong, L.,
Tian, R., Xie, R., Zhou, J., Gerstein, M., Li, D., Liu, Z., and Sun, M. (2023). Toolllm: Facilitating large
language models to master 16,000+ real-world APIs.
4. Patil, S. G., Zhang, T., Wang, X., and Gonzalez, J. E. (2023). Gorilla: Large language model connected with
massive APIs.
5. Tang, Q., Deng, Z., Lin, H., Han, X., Liang, Q., Cao, B., & Sun, L. (2023). ToolAlpaca: Generalized Tool
Learning for Language Models with 3000 Simulated Cases.
6. Touvron, H., Lavril, T., Izacard, G., Grave, E., & Lample, G. (2023, February 27). Introducing LLaMA: A
foundation, 65-billion-parameter large language model. https://ai.meta.com/blog/large-language-model-
llama-meta-ai/
7. Parisi, A., Zhao, Y., & Fiedel, N. (2022). TALM: Tool Augmented Language Models.
8. Zhang, J. (2023). Graph-ToolFormer: To Empower LLMs with Graph Reasoning Ability via Prompt
Augmented by ChatGPT.
9. Zhang, J., Zhang, H., Xia, C., & Sun, L. (2020). Graph-Bert: Only Attention is Needed for Learning Graph
Representations.
10. Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., & Yakhnenko, O. (2013). Translating Embeddings
for Modeling Multi-relational Data.
11. Touvron, H., Martin, L., Stone, K., & Scialom, T. (2023, July 18). Llama 2: Open Foundation and Fine-
Tuned Chat Models. https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-
chat-models/
12. Xu, Q., Hong, F., Li, B., Hu, C., Chen, Z., & Zhang, J. (2023). On the Tool Manipulation Capability of
Open-source Large Language Models.
13. Wang, Y., Kordi, Y., Mishra, S., Liu, A., Smith, N. A., Khashabi, D., & Hajishirzi, H. (2023). Self-Instruct:
Aligning Language Models with Self-Generated Instructions.
14. Li, M., Song, F., Yu, B., Yu, H., Li, Z., Huang, F., & Li., Y. (2023, October 25). Api-bank: A benchmark for
tool-augmented llms.
15. Zhuang, Y., Yu, Y., Wang, K., Sun, H., & Zhang, C. (2023). ToolQA: A Dataset for LLM Question
Answering with External Tools.
16. Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., & Zhou, D. (2023).
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.
17. Ning, X., Lin, Z., Zhou, Z., Wang, Z., Yang, H., & Wang, Y. (2023). Skeleton-of-Thought: Large Language
Models Can Do Parallel Decoding.
18. Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T. L., Cao, Y., & Narasimhan, K. (2023). Tree of Thoughts:
Deliberate Problem Solving with Large Language Models.
19. Besta, M., Blach, N., Kubicek, A., Gerstenberger, R., Gianinazzi, L., Gajda, J., Lehmann, T., Podstawski,
M., Niewiadomski, H., Nyczyk, P., & Hoefler, T. (2023). Graph of Thoughts: Solving Elaborate Problems
with Large Language Models.
20. Hogan, A., Blomqvist, E., Cochez, M., D’Amato, C., De Melo, G., Gutierrez, C., Labra Gayo, J.E., Kirrane,
S., Neumaier, S., Polleres, A., Navigli, R., Ngonga Ngomo, A.C., Rashid, S.M., Rula, A., Schmelzeisen, L.,
Sequeda, J., Staab, S., Zimmermann, A. (2021, September 11). Knowledge Graphs.
21. Mangrulkar, S., Paul, S. (2023, February 10). PEFT: Parameter-Efficient Fine-Tuning of Billion-Scale
Models on Low-Resource Hardware https://huggingface.co/blog/peft
22. Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2021). LoRA: Low-
Rank Adaptation of Large Language Models.
23. Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Ré, C. (2022). FlashAttention: Fast and Memory-Efficient
Exact Attention with IO-Awareness.
24. Ma, X., Gong, Y., He, P., Zhao, H., Duan, N. (2023, October 23). Query Rewriting for Retrieval-
Augmented Large Language Models.
Appendix
A.1 JSON Converter:

Our JSON converter is designed to compile model-generated code into a specific JSON format. The
script categorizes the model's output into two primary types. The first is the General Case, which
adheres to a standard variable assignment format using tool names and arguments. The second type is
the Bonus Case, which encompasses additional code structures like conditional statements and for-
loops. These structures are used for temporary variable assignments, further expanding the script's
capability to handle diverse output formats.

In processing these outputs, the script employs several functions. The process_tool function is used
for the General Case, while Bonus Cases are managed by specialized bonus handlers that also leverage
process_tool but with modified parameters. The make_tool function checks for the validity of tool
and argument names, ignoring invalid entries. Valid arguments are then processed by the
update_arg_val function. This function is responsible for determining if argument values are lists,
handling them recursively if so, and assessing the validity of each value. This includes scenarios where
values are function calls or reference outputs from previous calls, ensuring comprehensive and
accurate JSON conversion.

Code General Invalid


Converter process_tool() wrong_name_handler()
Case Tool

if/for
Recursively calls if
case arg_val is a list
Valid Valid
bonus_handler() make_tool() update_arg_val()
Tool argument
name

Figure: Code flow for JSON conversion

A.2 Evaluation Metric

Abstract Syntax Tree (AST) is hard to implement, especially when there are repeating tools in the
same tooling sequence. This, along with the fact that multiple arguments are passed in the tools, makes
the AST implementation complex and heuristic-based. No popular AST implementation that supports
tooling was found in our review. To ensure reproducibility, we use LangChain’s
JsonEditDistanceEvaluator, which computes a normalized Damerau-Levenshtein distance
between two “canonicalized” JSON strings. JSON similarity is then calculated by subtracting the
JSON Edit Distance from one.

Precision quantifies the proportion of the tools that were expected to appear in the output and were
indeed correctly identified. Whereas the recall quantifies the proportion of the tools that were
expected to appear in the output and were indeed correctly identified. Then, the F1-Score is calculated
using these metrics, and this is hence a metric of a tool name being present in the output along with
input or not. We present Precision, Recall and F1-Scores as percentages out of 100.
A.3 Prompt for Section 4.2.1

A.4 Prompt for Section 4.2.2

A.5 Prompt for Section 4.2.3

A.6 Additional results from Section 4.2.1

Dataset Model JSON Similarity Precision Recall F1-Score

Llama-2 7B Chat 44.57 40.80 41.95 41.37


Static
Zephyr 7B Chat 50.74 61.58 79.74 69.49

Llama-2 7B Chat 37.94 35.18 38.17 36.61


Dynamic
Zephyr 7B Chat 41.12 55.39 47.67 51.24

Table: Results - Few-Shot prompting of Llama-2 7B and Zephyr 7B (Section 4.2.1)

You might also like