Alopex
Alopex
2
Hong Kong University of Science and Technology
3
University of Illinois Urbana-Champaign
Abstract
The rapid advancement of Large Language Models (LLMs) has led to their increased integration
into mobile devices for personalized assistance, which enables LLMs to call external API functions to
enhance their performance. However, challenges such as data scarcity, ineffective question formatting,
and catastrophic forgetting hinder the development of on-device LLM agents. To tackle these issues,
we propose Alopex, a framework that enables precise on-device function calls using the Fox LLM.
Alopex introduces a logic-based method for generating high-quality training data and a novel “description-
question-output” format for fine-tuning, reducing risks of function information leakage. Additionally, a
data mixing strategy is used to mitigate catastrophic forgetting, combining function call data with textbook
datasets to enhance performance in various tasks. Experimental results show that Alopex improves
function call accuracy and significantly reduces catastrophic forgetting, providing a robust solution for
integrating function call capabilities into LLMs without manual intervention.
1 Introduction
With the rapid advancement of Large Language Models (LLMs) [24, 2, 14], their integration into software
applications has become increasingly widespread [25, 31]. Researchers and engineers from both academia
and industry are now focusing on developing LLM-powered agents [20, 35] for mobile devices to provide
personalized assistance to users. A key aspect of creating an on-device LLM agent is enabling LLMs to
call external API functions [36, 26] for enhanced performance. These API functions act as external tools,
grounding the LLMs to generate more accurate and personalized outputs.
Despite the growing interest, several challenges hinder the development of on-device LLM agents with
function call capabilities. We summarize these challenges from the perspective of data scarcity, question
format, and catastrophic forgetting.
Scarcity of LLM Function Call Demonstrations. Training data that demonstrates successful LLM function
calls is scarce. Since the concept of LLM agents has only recently gained researchers’ attention, there is
a shortage of examples showing how LLMs can effectively use external API functions on mobile devices
to complete queries correctly. Consequently, current methods often rely on LLMs to generate synthetic
function call demonstrations based on function descriptions [5]. However, this synthetic data, as seen in
* Now with Stevens Institute of Technology
†
With University of California, Irvine
1
other domains [29], raises concerns about correctness and information leakage [9, 1]. Therefore, a manual
verification and regeneration process is necessary for these LLM-generated demonstrations, which impacts
overall workflow efficiency.
Effective Question Format Remains To Be Explored. The optimal way to format queries to an LLM for
successfully triggering the correct functions is still uncertain. This format needs to be consistent during
both the training and inference phases to ensure aligned performance. One current approach [5] uses a
“question-call-description” format, where the question, such as "Can I capture a high-resolution image from
the front camera?", is placed first. Following the question, the function call command to initiate the function
call procedure is placed next. Finally, the description of the triggered function is provided in the last section.
However, this format presents two main issues, including i) the function call command has limited prior
in-context knowledge about the function descriptions, leading to potential inaccuracies; and ii) if we fine-tune
an LLM using this format, it will generate the function descriptions after the function call command during
inference. Although manual intervention can stop this generation, it is vulnerable to malicious attacks.
Therefore, we still need to identify an effective fine-tuning data format for LLMs in terms of function calls.
Significant Catastrophic Forgetting Happened in Function Call Fine-Tuning. Existing LLM fine-tuning
approaches for function calls may result in catastrophic forgetting [19, 27], which significantly impairs LLM
performance on reasoning tasks. For instance, when evaluating fine-tuned LLMs with function calls on the
MMLU dataset [11], a 16% drop in accuracy is observed, making the fine-tuned LLMs perform like random
guesses in high school multiple-choice questions. This study highlights the typical catastrophic forgetting
phenomenon during LLM fine-tuning for function calls. In on-device LLM agents, it is crucial to maintain
advanced LLM functionalities, such as commonsense and mathematical reasoning, alongside function calling.
Therefore, it is imperative to develop fine-tuning strategies that prevent catastrophic forgetting while enabling
LLMs for function calls.
To tackle these issues, we propose Alopex, a computational framework designed to enable on-device
function calls using Fox LLM [34] for precise, domain-specific responses. We summarize the contributions
of Alopex below.
• We argue that function call demonstrations contain strong logic. Moreover, we propose a simple yet
effective method to generate high-quality training demonstrations for LLMs to use API functions. Our
study suggests that this logic-based demonstration generation approach leads to better fine-tuned LLMs for
function calls.
• Through our exploration, we present a new “description-question-output” data format for fine-tuning LLMs
for function calls. We observe that this format performs better than the existing formats. Additionally, this
format reduces the potential risks of leaking function information.
• We introduce a data mixing approach to overcome catastrophic forgetting in fine-tuning LLMs for function
calls. By combining our function call dataset with a textbook dataset, we demonstrate that the fine-tuned
LLMs perform better in both function calls and other LLM evaluation benchmarks.
Moreover, we propose a series of system-level optimizations to integrate function call capabilities into
LLMs without requiring manual verification and modification. Experimental results demonstrate that our
framework achieves better accuracy in function calls compared to existing fine-tuned LLMs. Additionally,
Alopex significantly reduces the catastrophic forgetting phenomenon observed in existing fine-tuned LLMs
for function calls. Furthermore, our Alopex framework supports an automatic LLM adaptation pipeline
encompassing data generation and LLM fine-tuning.
2 Related Work
We summarize the related works from the dataset generation, training schema, and benchmarks.
Dataset Generation. Recent works have developed function call dataset generation pipeline. Octopus-v2 [5]
2
Figure 1: Rule-Based Logic Data Generation Work Flow, regarding the detail content of function description
of take_a_photo, please refer to the content mentioned above in this study.
presents a methodology to use LLM to sample queries from open-source docstrings that are relevant to
functions. Then, based on the sampled queries, LLM is used to generate outputs. Dialogue State Tracking
(DST) employs LLMs to generate dialogue data to reduce dialogue collection and annotation costs [22].
APIGen [17] regenerates query data for the APIs that have noisy and unusable descriptions, and design an
automated pipeline to check and verify the Function-Calling Datasets. ToolAlpaca, MetaTool, Octo-planner
and “Self-Taught Reasoner” (STR) also leverage the generative capabilities of LLM to create comprehensive
documentation for each tool [13, 32, 6, 37]. These
approaches all employ LLM for data generation. However, in practical implementation, considering the
amount of data and the inference time of LLMs, generating data with LLMs is not highly efficient. Moreover,
the results of data generation may contain errors, such as duplicate queries, incorrect function selection,
incorrect parameter usage, and irrelevant outputs to the queries. To address the issue of data generation errors,
the aforementioned works primarily rely on manual secondary checks or automated detection pipeline, which
requires additional time costs for the inspection process.
Training Schema. Previous works, such as ToolAlpaca [32] and Octopus-v2, rely on supervised fine-tuning
to enable LLMs with function call capabilities. However, this approach neglects the issue of catastrophic
forgetting caused by fine-tuning negative impact on LLMs. Enabling LLM for function calls share spirits
with the LLM alignment techniques [16].
Benchmarks. For this study, we adopt the Berkeley Function-Calling Leaderboard [36] as our evaluation
metric for single function call accuracy, and utilize the open-source library lm-auto-evaluation [10] to evaluate
the LLMs performance on metrics MMLU [11], GSM8K [8], ARC [7], TruthfulQA [15], HellaSwag [38],
and Winogrande [28].
3
3 Alopex Framework
We overview our framework in Figure 1. The framework contains three major components: i) function
call demonstration generation, ii) formatting function call demonstrations for LLM fine-tuning, and iii)
overcoming catastrophic forgetting.
Figure 2: Data Format (1) is based on the format speculated to be used in Octopus-v2, while Data Format (2)
is created using our approach.
Meanwhile, according to the data training format provided by gorilla-llm/gorilla-openfunctions-v1 [36],
they employ Data Format(2) in Figure 2, i.e., DF 2 .
So, which one is better? Before we proceed with the next step, we need to introduce a metric called
Out-of-Logic Function Call Accuracy (OOFC). Since we design the dataset generation rules manually, it is
inevitable that cases will deviate from the pre-defined rules. For example, the existing designed rules generate
the question part that contains phrases such as “Can I”, “How do I”, “How can I”, “Is it possible to”, and “I
4
want to”. Any phrases like “What’s the approach to”, “Is it achievable to”, “I wish to”, “Could you aid me”, or
“Would you assist me” are considered as “out-of-logic rules”. Therefore, the quality of the LLM fine-tuning
data and the function call data can be evaluated by the performance of OOFC. After conducting extensive
experiments, we have discovered that the data structure follows DF 2 not only achieves 99% accuracy in
function call accuracy when following logic but also performs better than DF 1 in out-of-logic scenarios.
Therefore, we have chosen to use the function description and the question as the combined input.
4 Experiment
In this section, we want to validate the effectiveness of the Alopex framework. We would like to answer the
following questions:
5
• Q1: Without conducting an examination on the generated dataset, how much better is the Rule-Based
Logic method compared with the LLM-based method?
• Q2: Why does the data format DF 1 perform better than DF 2 ?
• Q3: How effective is the approach of mixing datasets in a 1:1 ratio in mitigating catastrophic forgetting?
4.1.1 Dataset
We referenced 10 Android APIs open-sourced from Octopus-v2 [5] to construct the function call datasets;
see §A in the appendix for the descriptions of the 10 functions. We generated the following three datasets for
our research questions.
• We generated 1,000 data points for each API using a Logic-Rule Based method. The format of the dataset
follows DF 1 . The dataset was then split into a training dataset containing 200 data items and a test dataset
containing 800 data items.
• We employed the GPT3.5 API [23] to generate the dataset, using the Octopus-v2 approach [5]. Since we
did not have access to the open-source docstrings mentioned in Octopus-v2, we directly utilized the queries
generated in the first part of our dataset generation as the docstrings. We mixed and shuffled the question
dataset generated by the Rule-Based Logic method for the 10 functions. Then, using the GPT-3.5 API,
we filtered the queries based on each function’s description. Subsequently, we used the GPT-3.5 API to
generate the output that contains the function name and the function parameters. Additionally, we appended
each function’s description after the output. We used the test dataset generated from previous rule-based
logic to evaluate the performance of LLMs trained using the LLM-generated data.
• We generate a dataset with the same generation process and a different data format DF 2 . Note that we
utilized the 10 function descriptions and the user’s questions as inputs, instead of only using the user’s
questions as inputs, as the model needs to select the most appropriate function from the 10 function
descriptions based on the user’s questions to assist the user.
Model Training Strategy Func Call ACC OOFC MMLU GSM8K Arc HellaSwag Winogrande TruthfulQA Average Acc
Octopus-v2 NA 0.99048 0.990 0.2574 0.0023 0.3302 0.5000 0.5722 0.4149 0.775
NA 0.001 0.001 0.3916 0.1766 0.4326 0.7035 0.6535 0.3877 0.152
Output + Describe 0.996 0.92 0.2603 0.0008 0.3797 0.6077 0.6093 0.4100 0.764
stabilityai/stablelm-2-1.6b Mix(1:1) Output+Describe 0.99 0.943 0.3655 0.1221 0.4027 0.7020 0.6085 0.3688 0.7871
10 Func Describe + question 0.999 0.93 0.3304 0.0212 0.4002 0.6291 0.6298 0.3212 0.769
Mix(1:1) Func Describe + question 0.994 0.971 0.3477 0.1205 0.4002 0.6593 0.6148 0.3778 0.795
NA 0.489 0.449 0.4174 0.1812 0.4923 0.7154 0.6575 0.3308 0.4679
Output + Describe 0.9945 0.967 0.2985 0.0129 0.4352 0.6314 0.6212 0.3892 0.786
Gemma2B Mix(1:1) Output+Describe 0.9945 0.965 0.3600 0.1236 0.4701 0.6811 0.6346 0.3745 0.80
10 Func Describe + question 0.997 0.984 0.2892 0.0334 0.4121 0.6319 0.6338 0.3607 0.7915
Mix(1:1) Func Describe + question 0.999 0.986 0.3608 0.1440 0.4616 0.6420 0.6346 0.3812 0.807
NA 0.004 0.001 0.4710 0.3366 0.3686 0.6173 0.6148 0.3937 0.157
Output + Describe 0.9954 0.971 0.3685 0.0349 0.3464 0.5507 0.5912 0.3644 0.7808
Qwen1.5 - 1.8B Mix(1:1) Output+Describe 0.981 0.929 0.4685 0.3275 0.3754 0.6313 0.6283 0.3927 0.793
10 Func Describe + question 0.997 0.972 0.4467 0.2790 0.3976 0.5812 0.6022 0.3864 0.805
Mix(1:1) Func Describe + question 0.998 0.961 0.4615 0.3048 0.3848 0.6016 0.6038 0.4041 0.806
NA 0.102 0.095 0.4303 0.3654 0.4087 0.6273 0.6062 0.3866 0.223
Output + Describe 0.989 0.964 0.4196 0.2858 0.3933 0.6197 0.5998 0.3879 0.801
tensoropera/Fox-1-1.6B Mix(1:1) Output+Describe 0.990 0.966 0.4090 0.3374 0.4215 0.6476 0.6188 0.4046 0.8097
10 Func Describe + question 0.997 0.975 0.4193 0.3245 0.4010 0.6182 0.5951 0.4246 0.8119
Mix(1:1) Func Describe + question 0.999 0.976 0.4065 0.2851 0.3763 0.6038 0.5856 0.4047 0.806
6
We employed these three of datasets to fine-tune Qwen1.5-1.8B, Gemma2B, stabilityai/stablelm-2-1.6b, and
tensoropera/Fox-1-1.6B.
4.1.2 Settings
Testbed. We conducted our experiments using 2 GPUs NVIDIA H100 80GB HBM3.
Evaluation Benchmarks and Metrics. We utilized the algorithm from the Berkeley Function Call Leader-
board [36] to calculate the function call accuracy. Additionally, we employed the LM Evaluation Harness [10]
to measure metrics such as MMLU, GSM8K, Arc, HellaSwag, Winogrande, TruthfulQA, and their average.
Additionally, we introduced a new metric called Function Call LLM Average Accruacy(FCLAA)which is
the sum of average function call accuracy in Rule domain and Out of rule domain, and the average accuracy
of MMLU, GSM8K, Arc, HellaSwag, Winogrande, TruthfulQA into our evaluation. Experiments revealed
that the function call data generated by GPT-3.5 alone has many errors. It can only achieve an accuracy of
70% which required subsequent validation, modification and re-generation. The Rule-Based Logic method
is efficient and generates high-quality function call data. LLMs using Rule-Based Logic function call data
finetuning can achieve 99% accuracy. Since our output does not directly use the function name but instead
utilizes special tokens inspired by Octopus-v2 [5] to reduce the probability of spelling errors in predicting
function names, we used 10 shots to provide prompts for the expected output when testing the ability of
pre-trained LLMs to perform function calls. The experiment revealed that, apart from gemma-2b, none of the
other three LLMs were able to perform function calls, with gemma-2b exhibiting an accuracy of 48.9% for
function calls.
4.2 Results
Exp1: Comparisons between the Rule-Based Logic method and the LLM-based method. Based on
the experimental results from Table 1 and Table 2, we found that while using GPT3.5 to generate data for
function calls allows for automated selection of queries that are suitable for each function and automated
batch generation of function call outputs, it still results in a significant number of errors, which affects the
effectiveness of the fine-tuned LLMs. Consequently, careful checking, validation and secondary generation
are required after generating the datasets.
Exp 2: Comparisons between DF 1 and DF 2 . Based on the experimental results from Table 1 and Table 2,
we observed that DF 2 was more likely to get better function call accuracy, especially in the out-of-logic case.
Exp 3: Evaluation of the effectiveness of mixing datasets. Based on the results in Table 3, using open-
source textbook datasets helps recover LLMs’ previous performance. Although there are fluctuations in
metrics, the average accuracy of all metrics shows an upward trend, except for Fox, which is more robust and
exhibits stable results and achieves the highest average accuracy.
5 Conclusion
Alopex efficiently generates high-quality function call datasets without the need for data validation and
regeneration. It also alleviates the catastrophic forgetting phenomenon caused by the function call task in
LLMs. Experimental results show that, among the four small LLMs, Fox performs the highest level of
robustness and average accuracy.
7
References
[1] Harshvardhan Aditya, Siddansh Chawla, Gunika Dhingra, Parijat Rai, Saumil Sood, Tanmay Singh,
Zeba Mohsin Wase, Arshdeep Bahga, and Vijay K Madisetti. Evaluating privacy leakage and mem-
orization attacks on large language models (llms) in generative ai applications. Journal of Software
Engineering and Applications, 17(5):421–447, 2024.
[3] Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han,
Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang
Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan,
Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng Xu, Jin Xu, An Yang, Hao
Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu, Hongyi Yuan, Zheng Yuan, Jianwei Zhang,
Xingxuan Zhang, Yichang Zhang, Zhenru Zhang, Chang Zhou, Jingren Zhou, Xiaohuan Zhou, and
Tianhang Zhu. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
[4] Marco Bellagente, Jonathan Tow, Dakota Mahan, Duy Phung, Maksym Zhuravinskyi, Reshinth
Adithyan, James Baicoianu, Ben Brooks, Nathan Cooper, Ashish Datta, et al. Stable lm 2 1.6 b
technical report. arXiv preprint arXiv:2402.17834, 2024.
[5] Wei Chen and Zhiyuan Li. Octopus v2: On-device language model for super agent, 2024. URL
https://arxiv.org/abs/2404.01744.
[6] Wei Chen, Zhiyuan Li, Zhen Guo, and Yikang Shen. Octo-planner: On-device language model for
planner-action agents, 2024. URL https://arxiv.org/abs/2406.18082.
[7] Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and
Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018.
[8] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias
Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman.
Training verifiers to solve math word problems, 2021. URL https://arxiv.org/abs/2110.14168.
[9] Ronen Eldan and Mark Russinovich. Who’s harry potter? approximate unlearning in llms. arXiv
preprint arXiv:2310.02238, 2023.
[10] Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster,
Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff,
Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika,
Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework for few-shot language
model evaluation, 12 2023. URL https://zenodo.org/records/10256836.
[11] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob
Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300,
2020.
[12] Jianheng Huang, Leyang Cui, Ante Wang, Chengyi Yang, Xinting Liao, Linfeng Song, Junfeng Yao,
and Jinsong Su. Mitigating catastrophic forgetting in large language models with self-synthesized
rehearsal, 2024. URL https://arxiv.org/abs/2403.01244.
8
[13] Yue Huang, Jiawen Shi, Yuan Li, Chenrui Fan, Siyuan Wu, Qihui Zhang, Yixin Liu, Pan Zhou, Yao
Wan, Neil Zhenqiang Gong, and Lichao Sun. Metatool benchmark for large language models: Deciding
whether to use tools and which to use, 2024. URL https://arxiv.org/abs/2310.03128.
[14] Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot,
Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard
Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée
Lacroix, and William El Sayed. Mistral 7b, 2023. URL https://arxiv.org/abs/2310.06825.
[15] Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human
falsehoods, 2022.
[16] Yong Lin, Lu Tan, Hangyu Lin, Zeming Zheng, Renjie Pi, Jipeng Zhang, Shizhe Diao, Haoxiang Wang,
Han Zhao, Yuan Yao, et al. Speciality vs generality: An empirical study on catastrophic forgetting in
fine-tuning foundation models. arXiv preprint arXiv:2309.06256, 2023.
[17] Zuxin Liu, Thai Hoang, Jianguo Zhang, Ming Zhu, Tian Lan, Shirley Kokane, Juntao Tan, Weiran Yao,
Zhiwei Liu, Yihao Feng, Rithesh Murthy, Liangwei Yang, Silvio Savarese, Juan Carlos Niebles, Huan
Wang, Shelby Heinecke, and Caiming Xiong. Apigen: Automated pipeline for generating verifiable and
diverse function-calling datasets, 2024. URL https://arxiv.org/abs/2406.18518.
[18] Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang. An empirical study of
catastrophic forgetting in large language models during continual fine-tuning, 2024. URL https:
//arxiv.org/abs/2308.08747.
[19] Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist networks: The
sequential learning problem. In Psychology of learning and motivation, volume 24, pages 109–165.
Elsevier, 1989.
[20] Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher
Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou,
Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, and John Schulman. Webgpt:
Browser-assisted question-answering with human feedback, 2022. URL https://arxiv.org/abs/
2112.09332.
[22] Cheng Niu, Xingguang Wang, Xuxin Cheng, Juntong Song, and Tong Zhang. Enhancing dialogue state
tracking models through llm-backed user-agents simulation, 2024. URL https://arxiv.org/abs/
2405.13037.
[24] OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni
Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor
Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff
Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner, Lenny Bogdonoff,
Oleg Boiko, Madelaine Boyd, Anna-Luisa Brakman, Greg Brockman, Tim Brooks, Miles Brundage,
Kevin Button, Trevor Cai, Rosie Campbell, Andrew Cann, Brittany Carey, Chelsea Carlson, Rory
Carmichael, Brooke Chan, Che Chang, Fotis Chantzis, Derek Chen, Sully Chen, Ruby Chen, Jason
Chen, Mark Chen, Ben Chess, Chester Cho, Casey Chu, Hyung Won Chung, Dave Cummings, Jeremiah
9
Currier, Yunxing Dai, Cory Decareaux, Thomas Degry, Noah Deutsch, Damien Deville, Arka Dhar,
David Dohan, Steve Dowling, Sheila Dunning, Adrien Ecoffet, Atty Eleti, Tyna Eloundou, David
Farhi, Liam Fedus, Niko Felix, Simón Posada Fishman, Juston Forte, Isabella Fulford, Leo Gao, Elie
Georges, Christian Gibson, Vik Goel, Tarun Gogineni, Gabriel Goh, Rapha Gontijo-Lopes, Jonathan
Gordon, Morgan Grafstein, Scott Gray, Ryan Greene, Joshua Gross, Shixiang Shane Gu, Yufei Guo,
Chris Hallacy, Jesse Han, Jeff Harris, Yuchen He, Mike Heaton, Johannes Heidecke, Chris Hesse,
Alan Hickey, Wade Hickey, Peter Hoeschele, Brandon Houghton, Kenny Hsu, Shengli Hu, Xin Hu,
Joost Huizinga, Shantanu Jain, Shawn Jain, Joanne Jang, Angela Jiang, Roger Jiang, Haozhun Jin,
Denny Jin, Shino Jomoto, Billie Jonn, Heewoo Jun, Tomer Kaftan, Łukasz Kaiser, Ali Kamali, Ingmar
Kanitscheider, Nitish Shirish Keskar, Tabarak Khan, Logan Kilpatrick, Jong Wook Kim, Christina
Kim, Yongjik Kim, Jan Hendrik Kirchner, Jamie Kiros, Matt Knight, Daniel Kokotajlo, Łukasz Kon-
draciuk, Andrew Kondrich, Aris Konstantinidis, Kyle Kosic, Gretchen Krueger, Vishal Kuo, Michael
Lampe, Ikai Lan, Teddy Lee, Jan Leike, Jade Leung, Daniel Levy, Chak Ming Li, Rachel Lim, Molly
Lin, Stephanie Lin, Mateusz Litwin, Theresa Lopez, Ryan Lowe, Patricia Lue, Anna Makanju, Kim
Malfacini, Sam Manning, Todor Markov, Yaniv Markovski, Bianca Martin, Katie Mayer, Andrew
Mayne, Bob McGrew, Scott Mayer McKinney, Christine McLeavey, Paul McMillan, Jake McNeil,
David Medina, Aalok Mehta, Jacob Menick, Luke Metz, Andrey Mishchenko, Pamela Mishkin, Vinnie
Monaco, Evan Morikawa, Daniel Mossing, Tong Mu, Mira Murati, Oleg Murk, David Mély, Ashvin
Nair, Reiichiro Nakano, Rajeev Nayak, Arvind Neelakantan, Richard Ngo, Hyeonwoo Noh, Long
Ouyang, Cullen O’Keefe, Jakub Pachocki, Alex Paino, Joe Palermo, Ashley Pantuliano, Giambattista
Parascandolo, Joel Parish, Emy Parparita, Alex Passos, Mikhail Pavlov, Andrew Peng, Adam Perelman,
Filipe de Avila Belbute Peres, Michael Petrov, Henrique Ponde de Oliveira Pinto, Michael, Pokorny,
Michelle Pokrass, Vitchyr H. Pong, Tolly Powell, Alethea Power, Boris Power, Elizabeth Proehl, Raul
Puri, Alec Radford, Jack Rae, Aditya Ramesh, Cameron Raymond, Francis Real, Kendra Rimbach,
Carl Ross, Bob Rotsted, Henri Roussez, Nick Ryder, Mario Saltarelli, Ted Sanders, Shibani Santurkar,
Girish Sastry, Heather Schmidt, David Schnurr, John Schulman, Daniel Selsam, Kyla Sheppard, Toki
Sherbakov, Jessica Shieh, Sarah Shoker, Pranav Shyam, Szymon Sidor, Eric Sigler, Maddie Simens,
Jordan Sitkin, Katarina Slama, Ian Sohl, Benjamin Sokolowsky, Yang Song, Natalie Staudacher, Fe-
lipe Petroski Such, Natalie Summers, Ilya Sutskever, Jie Tang, Nikolas Tezak, Madeleine B. Thompson,
Phil Tillet, Amin Tootoonchian, Elizabeth Tseng, Preston Tuggle, Nick Turley, Jerry Tworek, Juan
Felipe Cerón Uribe, Andrea Vallone, Arun Vijayvergiya, Chelsea Voss, Carroll Wainwright, Justin Jay
Wang, Alvin Wang, Ben Wang, Jonathan Ward, Jason Wei, CJ Weinmann, Akila Welihinda, Peter
Welinder, Jiayi Weng, Lilian Weng, Matt Wiethoff, Dave Willner, Clemens Winter, Samuel Wolrich,
Hannah Wong, Lauren Workman, Sherwin Wu, Jeff Wu, Michael Wu, Kai Xiao, Tao Xu, Sarah Yoo,
Kevin Yu, Qiming Yuan, Wojciech Zaremba, Rowan Zellers, Chong Zhang, Marvin Zhang, Shengjia
Zhao, Tianhao Zheng, Juntang Zhuang, William Zhuk, and Barret Zoph. Gpt-4 technical report, 2024.
URL https://arxiv.org/abs/2303.08774.
[26] Aaron Parisi, Yao Zhao, and Noah Fiedel. Talm: Tool augmented language models, 2022. URL
https://arxiv.org/abs/2205.12255.
[27] Roger Ratcliff. Connectionist models of recognition memory: constraints imposed by learning and
forgetting functions. Psychological review, 97(2):285, 1990.
[28] Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. WINOGRANDE: an
adversarial winograd schema challenge at scale, 2019.
10
[29] Amrith Setlur, Saurabh Garg, Xinyang Geng, Naman Garg, Virginia Smith, and Aviral Kumar. Rl
on incorrect synthetic data scales the efficiency of llm math reasoning by eight-fold. arXiv preprint
arXiv:2406.14532, 2024.
[30] Haizhou Shi, Zihao Xu, Hengyi Wang, Weiyi Qin, Wenyuan Wang, Yibin Wang, Zifeng Wang, Sayna
Ebrahimi, and Hao Wang. Continual learning of large language models: A comprehensive survey, 2024.
URL https://arxiv.org/abs/2404.16789.
[32] Qiaoyu Tang, Ziliang Deng, Hongyu Lin, Xianpei Han, Qiao Liang, Boxi Cao, and Le Sun. Toolalpaca:
Generalized tool learning for language models with 3000 simulated cases, 2023. URL https://arxiv.
org/abs/2306.05301.
[33] Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak,
Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, Pouya Tafti, Léonard Hussenot,
Pier Giuseppe Sessa, Aakanksha Chowdhery, Adam Roberts, Aditya Barua, Alex Botev, Alex Castro-
Ros, Ambrose Slone, Amélie Héliou, Andrea Tacchetti, Anna Bulanova, Antonia Paterson, Beth Tsai,
Bobak Shahriari, Charline Le Lan, Christopher A. Choquette-Choo, Clément Crepy, Daniel Cer, Daphne
Ippolito, David Reid, Elena Buchatskaya, Eric Ni, Eric Noland, Geng Yan, George Tucker, George-
Christian Muraru, Grigory Rozhdestvenskiy, Henryk Michalewski, Ian Tenney, Ivan Grishchenko,
Jacob Austin, James Keeling, Jane Labanowski, Jean-Baptiste Lespiau, Jeff Stanway, Jenny Brennan,
Jeremy Chen, Johan Ferret, Justin Chiu, Justin Mao-Jones, Katherine Lee, Kathy Yu, Katie Millican,
Lars Lowe Sjoesund, Lisa Lee, Lucas Dixon, Machel Reid, Maciej Mikuła, Mateo Wirth, Michael
Sharman, Nikolai Chinaev, Nithum Thain, Olivier Bachem, Oscar Chang, Oscar Wahltinez, Paige
Bailey, Paul Michel, Petko Yotov, Rahma Chaabouni, Ramona Comanescu, Reena Jana, Rohan Anil,
Ross McIlroy, Ruibo Liu, Ryan Mullins, Samuel L Smith, Sebastian Borgeaud, Sertan Girgin, Sholto
Douglas, Shree Pandya, Siamak Shakeri, Soham De, Ted Klimenko, Tom Hennigan, Vlad Feinberg,
Wojciech Stokowiec, Yu hui Chen, Zafarali Ahmed, Zhitao Gong, Tris Warkentin, Ludovic Peran,
Minh Giang, Clément Farabet, Oriol Vinyals, Jeff Dean, Koray Kavukcuoglu, Demis Hassabis, Zoubin
Ghahramani, Douglas Eck, Joelle Barral, Fernando Pereira, Eli Collins, Armand Joulin, Noah Fiedel,
Evan Senter, Alek Andreev, and Kathleen Kenealy. Gemma: Open models based on gemini research
and technology, 2024. URL https://arxiv.org/abs/2403.08295.
[34] TensorOpera. Tensoropera unveils fox foundation model: A pioneering small language model (slm) for
cloud and edge. https://blog.tensoropera.ai/tensoropera-unveils-fox-foundation-model-a-pioneering-
open-source-slm-leading-the-way-against-tech-giants/, 2024.
[35] Lilian Weng. Llm-powered autonomous agents. lilianweng.github.io, Jun 2023. URL https://
lilianweng.github.io/posts/2023-06-23-agent/.
[36] Fanjia Yan, Huanzhi Mao, Charlie Cheng-Jie Ji, Tianjun Zhang, Shishir G. Patil, Ion Stoica, and
Joseph E. Gonzalez. Berkeley function calling leaderboard. 2024.
[37] Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah D. Goodman. Star: Bootstrapping reasoning with
reasoning, 2022. URL https://arxiv.org/abs/2203.14465.
[38] Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine
really finish your sentence?, 2019.
11
Appendix
A Function Description
We present the question part, action part, and parameter part we designed for the Rule-Based Logic for the
take_a_photo API. We also show the parameter Dictionary Mapping for the take_a_photo API.
To better compare with Octopus-v2 [5], the functions and their descriptions we are using here are from their
study.
B Experiment Parameters
In the experiment, we used the following configuration parameters:
• Learning Rate: 6e-5, 5e-5, 4e-5, 3e-5, 2e-5, 4e-4, 3e-4, 5e-4
• Batch Size: 1, 2
• Epochs: [3,8]
• Warmup Steps: 5
• Max sequence length: 2048
• LR scheduler type: linear
We conducted 10 experiments for each model using this set of hyperparameters. The best result from each
experiment was selected for the report.
"actions" : [
"take a photo", "snap a picture", "capture an image", "shoot a photo", "get a snapshot",
"record a photo", "grab a picture", "click a photo", "take a selfie"
],
"camera_types" : [
"using the back camera", "with the rear camera", "using the front camera",
"with the front-facing camera", "on the rear camera", "on the front camera", ""
],
"resolutions" : [
"with the 720p resolution", "with the 1080p resolution", "with the 4K resolution", "", "at
a high resolution", "at a clear resolution", "at a relative low resolution"
],
"questions" : [
"Can I", "How do I", "How can I", "Is it possible to", "What's the process for",
"Is there a way to", "What's the easiest way to", "", "I want to", "Please help me", "
Could you help me"
]
camera_mapping = {
"front": "front",
"back": "back",
"rear": "back"
}
resolution_mapping = {
"720p": "720p",
"relative low resolution": "720p",
"1080p": "1080p",
"clear resolution": "1080p",
"4K": "4K",
"high resolution": "4K"
}
12