0% found this document useful (0 votes)
28 views

AutoFlow - Automated Workflow Generation

AutoFlow - Automated Workflow Generation

Uploaded by

민냥
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views

AutoFlow - Automated Workflow Generation

AutoFlow - Automated Workflow Generation

Uploaded by

민냥
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

AutoFlow: Automated Workflow Generation for

Large Language Model Agents

Zelong Li Shuyuan Xu Kai Mei


Rutgers University Rutgers University Rutgers University
[email protected] [email protected] [email protected]
arXiv:2407.12821v1 [cs.CL] 1 Jul 2024

Wenyue Hua Balaji Rama Om Raheja


Rutgers University Independent Researcher Independent Researcher
[email protected] [email protected] [email protected]

Hao Wang He Zhu Yongfeng Zhang


Rutgers University Rutgers University Rutgers University
[email protected] [email protected] [email protected]

Abstract
Recent advancements in Large Language Models (LLMs) have shown significant
progress in understanding complex natural language. One important application of
LLM is LLM-based AI Agent, which leverages the ability of LLM as well as exter-
nal tools for complex-task solving. To make sure LLM Agents follow an effective
and reliable procedure to solve the given task, manually designed workflows are
usually used to guide the working mechanism of agents. However, manually de-
signing the workflows requires considerable efforts and domain knowledge, making
it difficult to develop and deploy agents on massive scales. To address these issues,
we propose AutoFlow, a framework designed to automatically generate workflows
for agents to solve complex tasks. AutoFlow takes natural language program as
the format of agent workflow and employs a workflow optimization procedure to
iteratively optimize the workflow quality. Besides, this work offers two workflow
generation methods: fine-tuning-based and in-context-based methods, making the
AutoFlow framework applicable to both open-source and closed-source LLMs.
Experimental results show that our framework can produce robust and reliable
agent workflows. We believe that the automatic generation and interpretation of
workflows in natural language represent a promising paradigm for solving complex
tasks, particularly with the rapid development of LLMs. The source code of this
work is available at https://github.com/agiresearch/AutoFlow.

1 Introduction
Recent advancements in Large Language Models (LLMs) have demonstrated substantial progress
in understanding and processing complex natural language. These developments have opened up a
wide array of applications, among which the deployment of LLM-based AI agents stands out. These
agents leverage the capabilities of LLMs along with external tools to tackle intricate tasks, ranging
from data analysis [7], software development [23, 35], scientific research [2], travel planning [46] to
many other decision-making processes in various domains.
One of the critical aspects of ensuring that LLM-based AI agents operate effectively and reliably is
the design of workflows that guide their task-solving procedures. For example, an LLM-based agent
for fake news detection may execute under the following workflow designed by information and
Workflow New
User: Generation Generation
Query Query
Workflow Generated Update LLM
Generator Workflow with RL
LLM

Workflow Execute
Interpreter Workflow
LLM Get Reward
One Iteration (Epoch)

Figure 1: The overall generation process of AutoFlow using reinforcement learning reward for LLMs
communication experts [24]: 1) Check the URL, 2) Check the language, 3) Commonsense evaluation,
4) Standpoint evaluation, 5) Summarize the findings, and 6) Classification. The agent executes the
workflow step by step, and each step may call the LLM or external tools to gather useful information
for the final summarization and classification.
Traditionally, these workflows are manually crafted, requiring significant effort and deep domain
knowledge. This manual process poses a substantial barrier to the large-scale development and
deployment of AI agents, as it is both time-consuming and resource-intensive.
To address the challenges associated with manual workflow design, this paper proposes AutoFlow,
a novel framework aimed at the automatic generation of workflows for AI agents to solve complex
tasks. AutoFlow represents workflows in the form of natural language programs [47], facilitating
easier comprehension and interaction. Central to AutoFlow is a workflow optimization procedure
that iteratively refines the quality of the generated workflows, ensuring robustness and reliability.
Technically, AutoFlow introduces two innovative workflow generation methods: a fine-tuning-based
method and an in-context-based method. The fine-tuning-based approach customizes the workflow
generation process for specific tasks and domains by adjusting the parameters of the LLMs. In
contrast, the in-context-based method utilizes contextual information to guide the generation process
without the need for extensive fine-tuning, making it suitable for both open-source and closed-source
LLMs. More specifically, as shown in Figure 1, the user will provide a workflow generation query
to describe the type of tasks. Based on the query, the generator LLM generates a workflow and the
frozen interpreter LLM executes the generated workflow on the dataset, with evaluating performance
as the reward. Then, AutoFlow uses reinforcement learning (RL) to update the generator LLM with
the reward. This process can be seen as one training iteration and the generator LLM expects to learn
how to generate effective and optimal workflows after several iterations.
Our experimental results validate the effectiveness of the AutoFlow framework, showing that the
generated workflows by AutoFlow outperform manually designed ones while keeping readability, and
showcasing its ability to produce high-quality workflows that enable AI agents to perform complex
tasks with a high degree of reliability. The automatic generation and interpretation of workflows
in natural language not only streamline the development process but also represent a promising
paradigm for addressing complex problems, especially in the context of the rapid evolution of LLM
technologies. In summary, this paper makes the following contributions:

• We introduce AutoFlow, a framework that can automatically generate workflows in natural language
so that the workflows can be precisely interpreted by LLMs while reducing human efforts.
• We propose two methods, the fine-tuning method and the in-context learning method, to incorporate
RL in the workflow generation process for both open-source and closed-source LLMs.
• We conduct experiments through benchmark tasks to validate the AutoFlow framework, contributing
to higher valid plan rates and overall performance while keeping the generated natural language
workflow readable by humans.

In the following part of this paper, we first review the related work in Section 2. In Section 3,
we introduce how to represent workflows in natural language and our motivations. In Section 4,
we demonstrate the detailed design of our AutoFlow framework, including two learning methods,
fine-tuning and in-context learning methods. We provide and analyze the experimental results on

2
benchmark datasets in Section 5, and finally conclude our work and suggest potential avenues for
future research in Section 6.

2 Related Work
2.1 LLM Agents and Workflow

AI agent is an autonomous entity capable of making decisions and executing actions in a given
environment to effectively handle various complex tasks [30, 8, 38, 45]. Recently, with the rapid
advancement of Large Language Models (LLMs), LLM-based AI agents have become an important
type of agent for complex task solving [7, 23, 35], such as reasoning, planning and coding.
Reasoning: LLMs typically break down complex tasks into a series of steps, constituting a chain of
reasoning [40]. Approaches such as Chain of Thought (CoT) and its derivatives [40, 21], including
tree [49] and graph structures [1], are commonly used. The self-consistency method [39] samples
multiple reasoning paths and selects the most consistent outcome through voting.
Planning: Planning tasks require LLMs to generate a sequence of actions to achieve specific goals [9].
Recent studies have designed platforms to test LLMs’ planning abilities in areas such as expert model
integration [7], travel task planning [46], and tool usage [52]. However, a known issue is that LLMs
may generate non-executable, invalid or grammatically wrong plans, such as using a piece of text as
input to an image-processing tool. To solve the problem, some studies [7, 52] use post-processing
method to extract a chain of tools from the generated texts, which use LLM itself as a parser to
post-process the generated text. Further, recent attempts integrate finite state machines into LLMs to
enhance human’s controllability of LLM in planning [25, 44]. The ReAct approach [50] also uses
external tools such as search engines to improve LLM planning. In this work, we build on these ideas
to enhance the executability of the generated frameworks.
Coding: LLMs can generate code to solve complex tasks, reducing the need for manual programming
[29, 47, 15, 27, 5, 34, 31, 3]. However, the generated code may contain errors or fail to meet user
requirements. To mitigate these issues, workflow-based methods have been proposed, including
manually designed and automatically generated workflows [16, 43, 53]. Another research direction
involves using LLMs for natural language programming, leveraging their strong natural language
understanding abilities. A notable example is the CoRE language [47], which unifies natural language
programming, pseudo-code programming, and workflow programming under the same framework
using LLM as interpreter. Our work follows the workflow concept in natural language programming
and develops an automated workflow generation framework to reduce human labor.

2.2 Automated Machine Learning

Automated Machine Learning (AutoML) aims to reduce human labors in designing and deploying
machine learning techniques, simplifying the application of ML in real-world problems. There are
three main types of AutoML techniques [48, 26]:
Automated Model Selection: Tools such as Auto-sklearn [6] and Auto-WEKA [22] automatically
select the best machine learning model from a library of models and hyper-parameter settings.
Automated Feature Engineering: Tools such as Data Science Machine [17], ExploreKit [18], and
VEST [4] generate or select useful features without manual intervention, since feature engineering
significantly impacts model performance in many applications.
Neural Architecture Search (NAS): Methods such as ENAS [33], DARTS [28], NASH [37], GNAS
[12], and AmoebaNet-A [36] discover effective neural network architectures for specific tasks
without manual design. Experiments show that networks generated through NAS can match or even
outperform human-designed architectures across various tasks.
AutoML systems typically involve two main components for training: a controller, which is a machine
learning model responsible for sampling model selections, and a child model, which comprises the
parameters of the machine learning model to be created and used for the task at hand. In our work, we
follow this training paradigm, using a workflow generator LLM as the controller, and the generated
workflow along with a workflow interpreter LLM as the child model. More details of the proposed
technique are introduced in Section 4.

3
3 Preliminary and Background
3.1 Natural Language Programs as Workflows

In this section, we introduce how to use natural language programs as a representation of workflows.
Specifically, we will use the Code Representation and Execution (CoRE) system [47] as an example
to show how to construct workflows as natural language programs and how the LLM Agent follows
the workflow by executing the natural language program.

3.1.1 CoRE Language Syntax


The CoRE language defines four components to organize workflows as natural language instructions.

• Step Name is used to uniquely identify each step of the workflow.


• Step Type defines the type of instruction for each step. There are three different types of steps:
– Process: The process step transitions to the next specified step after executing the current step.
– Decision: Similar to conditional statements (e.g., “if-else”), the decision step is used for branch-
ing the program flow based on evaluated conditions.
– Terminal: The terminal step represents the end of the program.
• Step Instruction is a natural language instruction to be executed in the step.
• Step Connection points to the next step, which establishs the program execution flow.

An example workflow for image-text processing on the OpenAGI benchmark is shown as follows:
Step 1::: Process ::: Identify the input data type based on the
objective .::: next :: Step 2
Step 2::: Process ::: Identify the output data type based on the
objective .::: next :: Step 3
Step 3::: Process ::: Select tools in the provided tool list to generate
a plan .::: next :: Step 4
Step 4::: Decision ::: Check whether every tool in the plan is in the
provided tool list .::: Yes :: Step 5:: No :: Step 3
Step 5::: Decision ::: Check whether the output data type of the
previous tool is the input data type
of the next tool .::: Yes :: Step 6:: No :: Step 3
Step 6::: Terminal ::: Output the plan by listing the tool names .:::

In this paper, we use ‘:::’ to delimit the above four components in each step.

3.1.2 LLM as Interpreter for Workflow Execution


To process and execute the workflow in the CoRE language, the system uses an LLM as an interpreter.
The LLM interpreter executes instructions step by step. Concretely, the execution of one step can be
divided into four procedures in the CoRE system.
❶ First, the LLM decides which information from memory may be needed to execute the current step
and retrieves the relevant information from memory. ❷ After obtaining the relevant information, the
system integrates the information with the instruction of that step into a structured prompt, which the
LLM processes to generate a response. ❸ To extend LLM’s capability, the system may use external
tools to analyze the initial response of each step. According to the initial response to the current step,
the LLM determines whether external tools are required. If tool usage is confirmed, LLM will decide
the tool name and tool arguments, then execute the external tool, and finally incorporate the results
into the memory. ❹ After the execution of the current step, LLM will decide which is the next step to
execute based on the output of the current step.

3.2 Motivation

The CoRE system enables users to write workflows in natural language, which unifies natural
language programming, pseudo-code programming, and workflow programming. Although the entry
barrier is lower than coding in programming languages, constructing workflows in natural language

4
Example Workflow Example Workflow
Step 1:::Process:::identify the input data type based on the objective.:::next::step 2 Step 1:::Process:::identify the input data type based on the objective.:::next::step 2
Step 2:::Process:::identify the output data type based on the objective.:::next::Step 3 Step 2:::Process:::identify the output data type based on the objective.:::next::Step 3
Step 3:::Process:::Select models in provided models list to generate a to-do list.:::next::step 4 Step 3:::Process:::Select models in provided models list to generate a to-do list.:::next::step 4
Step 4:::decision:::Check whether every models in the to-do list is in the provided Step 4:::decision:::Check whether every models in the to-do list is in the provided
models.:::Yes::Step 5::No::Step 3 models.:::Yes::Step 5::No::Step 3
Step 5:::decision:::Check whether the previous model output data type is the input data type Step 5:::decision:::Check whether the previous model output data type is the input data type
of the next model.:::Yes::Step 6::No::Step 3 of the next model.:::Yes::Step 6::No::Step 3
Step 6:::terminal:::Output the to-do list solely with model name.::: Step 6:::terminal:::Output the to-do list solely with model name.:::
Iteration

User: Provide a workflow with several steps. The workflow can guide User: Provide a workflow with several steps. The workflow can guide
the LLM design of plans for a type of complex task using provided the LLM design of plans for a type of complex task using provided
neural network tools. {Tool List} {Task Example} neural network tools. {Tool List} {Task Example}
Iteration

Workflow Generator LLM: Workflow Generator LLM:


Step 1:::Process:::Establish the main purpose of the project.:::next::Step 2 Step 1:::Process:::Establish the main purpose of the project.:::next::Step 2
Step 2:::…… Step 2:::……
…… ……
Step n:::…… Step n:::……
GPT-4 as a parser to ensure the workflow matches CoRE grammar
Workflow Interpreter LLM: Execute workflow and calculate reward.
Workflow Interpreter LLM: Execute workflow and calculate reward.
User: The execution performance of given workflow is 0.6415. Provide a
Workflow Generator LLM: Update LoRA param with reward = 0.6415. new workflow in the same form of previous one.

(a) AutoFlow generation process based on fine-tuning (b) AutoFlow generation process based on in-context
method with RL reward for open-source LLMs. learning with RL reward for closed-source LLMs.
Figure 2: Overview for workflow generation with AutoFlow, using OpenAGI [7] tasks as an example

still requires much human labor and domain expertise. Inspired by Automated Machine Learning
(AutoML) [13], we would like to automatically learn the best workflow based on the given task
and training data. Considering the instructions in CoRE language are written in natural language
and LLM has a strong ability of natural language understanding, we also use LLM as the workflow
generator. To distinguish with the Interpreter LLM mentioned in 3.1, we denote the LLM that learns
to generate workflows as the Workflow Generator LLM, and name the LLM that interprets and
executes workflow as the Workflow Interpreter LLM, consistent with Figure 1. In this way, users only
need to provide a high-level description of the task and the corresponding dataset, and the generator
LLM can generate the optimal workflow in CoRE language for the interpreter LLM to execute on the
given task. This process expects to minimize human efforts and automatically pursue the optimal
workflow for LLM regardless of users’ knowledge on workflow design.

4 The AutoFlow Framework


In this section, we introduce the two methods of applying the AutoFlow framework to the workflow
generator LLM, i.e., the fine-tuning method for open-source LLMs and the in-context learning method
for closed-source LLMs.

4.1 Fine-tuning Method for Workflow Generation with Open-source LLMs

We use LoRA adapter [11] for fine-tuning open-souced LLMs as workflow generators. The training
process is shown in Figure 2a.
First, the workflow generator LLM receives a few-shot example workflow and a description of the task
from users as the input query. Although the CoRE language has minimal grammar requirements and
the instructions are written in natural language, which can be well learned and generated by LLMs,
an example workflow can help the workflow generator LLM better understand the grammar of the
CoRE language. The natural language description of the task is to help the generator LLM understand
the application scenarios of the workflow to be generated. Take the text and image processing tasks
in OpenAGI benchmark [7] as an example, the task description could be “Provide a workflow with
several steps. The workflow can guide the LLM to design plans for a type of complex tasks realted to
text and image processing using the provided tools”.
Second, the next step is to generate an executable workflow based on the input query. For closed-
source LLMs such as GPT-4, the model can directly generate a grammatically valid workflow given
the few-shot example. However, open-source LLMs such as Mixtral-8x7B cannot consistently
generate grammatically valid workflow even if few-shot example workflows are provided. To solve
the problem, we follow the post-processing strategy in previous work [7, 52] and use GPT-4 as a
parser to revise the output workflow into a grammatically valid one.

5
Third, the generated workflow will be executed by the interpreter LLM to obtain its performance on
the validation dataset. Then, the generator LLM is updated based on the workflow’s performance
on the validation dataset. Specifically, we use reinforcement learning (RL) to update the parameters
of the LoRA adapter of the generator LLM, with the average metrics of all data instances on the
validation dataset as the reward.
These three steps together consist of one iteration of the fine-tuning process. The fine-tuning process
will terminal and the final workflow will be produced when the terminal condition is met, when the
difference of reward between two consecutive iterations is smaller then a threshold. After the iterative
optimization process, the workflow generator LLM produces the optimal workflow for the task based
on the execution feedback.

4.2 In-context Learning Method for Workflow Generation with Closed-source LLMs

As for closed-source LLMs such as GPT-4, we use in-context learning to avoid fine-tuning the
parameters. As shown in Figure 2b, the AutoFlow framework also requires an example workflow
and a description of the task, and feeds them as the input query to the workflow generator LLM.
After the GPT-4 generates the workflow, we do not use a parser to revise the flow since GPT-4 can
well follow the CoRE grammar demonstrated by the example workflow. Then, the interpreter LLM
executes the workflow to evaluate its performance on the validation dataset as the reward, which is
the same process as the fine-tuning method. The difference is that, in the next step, the AutoFlow
framework directly includes the reward value in the query and prompts the generator LLM to generate
a new workflow given the performance of the previously generated workflow, such as “The execution
performance of the previous workflow is 0.6415. Provide a new workflow that can gain a better
performance”. The whole process is demonstrated in Figure 2b.
We will show in the experimentation that closed-source LLMs such as GPT-4 can well utilize the
reward values in the prompt to refine the workflow and finally obtainn the optimal workflow by using
the in-context learning method.

5 Experiments
5.1 Backbone Large Language Model (LLM)

We conduct experiments on both closed-source and open-source LLMs:


• GPT-4 [32] (Closed-source) is a generative pre-trained transformer of OpenAI. In this work, we
use the GPT-4-1106-preview version.
• Mixtral-8x7B [14] (Open-source) is a pre-trained generative Sparse Mixture of Experts with 46.7
billion parameters.
In our experiment, we apply these two types of LLMs for both workflow generator LLM and
interpreter LLM. Thus, there are four combinations in total.

5.2 Planning Schema of LLMs

We adopt the following LLM-based agent planning schema:


• Zero-shot Learning (Zero) directly inputs the query to the LLM.
• Chain-of-Thought (CoT) [40] induces the LLM to generate a coherent language sequence that
serves as a meaningful intermediate step bridging the input query and the output answer.
• Few-shot Learning (Few) presents a set of high-quality demonstrations in the prompt, each
consisting of both input and desired output on the target task.
• CoRE [47] uses a manually designed workflow with LLM as an interpreter.
• AutoFlow is our proposed framework that can automatically generate workflows.

5.3 Benchmark Datasets

We conduct experiments on a benchmark dataset, OpenAGI [7]. The OpenAGI benchmark tasks are
categorized based on their output type and ground-truth label type (Task 1, 2, and 3). Then, based

6
Metrics / Task Zero CoT Few CoRE AutoFlow (GPT) AutoFlow (Mixtral)
Task 1 (CLIP Score) 0.0 0.0 0.1839 0.1825 0.2441 0.1831
Task 2 (BERT Score) 0.1092 0.1987 0.0687 0.2593 0.3017 0.3133
Task 3 (ViT Score) 0.1949 0.1562 0.5501 0.2437 0.5720 0.4907
Average over tasks 0.1206 0.1736 0.1887 0.2483 0.3597 0.3442
Table 1: Performance on OpenAGI when using the open-source LLM, Mixtral, as the LLM interpreter
for all tasks and learning schema. Zero is for Zero-shot Learning, Few is for Few-shot Learning. The
boldface numbers denote the highest score under each task type using the same LLM.
Metrics / Task Zero CoT Few CoRE AutoFlow (GPT) AutoFlow (Mixtral)
Task 1 (CLIP Score) 0.0 0.2732 0.3055 0.1368 0.3049 0.3032
Task 2 (BERT Score) 0.2076 0.2266 0.6307 0.6505 0.6628 0.7014
Task 3 (ViT Score) 0.5058 0.6736 0.6480 0.6480 0.6899 0.6119
Average over tasks 0.2378 0.3359 0.5281 0.6104 0.6415 0.6501
Table 2: Performance on OpenAGI using the closed-source LLM, GPT-4, as the LLM interpreter for
all tasks and learning schema. Zero is for Zero-shot Learning, Few is for Few-shot Learning. The
boldface numbers denote the highest score under each task type using the same LLM.

on different task types, different metrics are employed to gauge the performance: CLIP Score [10],
assessing the similarity between text and image, is utilized for Text-to-Image tasks (Task 1); BERT
Score [54], evaluating text generation with BERT, is applied when both data labels and the expected
outputs are texts (Task 2); and ViT Score [42] gauges the similarity between the image label and
image output (Task 3).

5.4 Implementation Details

Our framework and all baselines are implemented by PyTorch, an open-source library. We follow
the implementation setting of the OpenAGI platform [7] for Zero-shot and few-shot learnings. We
leverage the DSPy framework [19, 20] to apply the CoT strategy to the OpenAGI platform. We
also tried Program-of-Thought [5] and ReAct [51] strategies on the OpenAGI platform. However,
the ReAct strategy requires text observation, which is unsuitable for our OpenAGI task since some
observations are in image format, and Program-of-Thought cannot generate executable codes. Thus,
we did not include them as the baselines.
For the hyper-parameter setting of the AutoFlow framework, we set the number of iterations for the
workflow generator LLM as 30. For the open-source LLM, Mixtral, as the generator LLM, we use
the REINFORCE [41] as the core reinforcement learning (RL) algorithm for the generator LLM,
with the average score on the training dataset as the reward. We use Adam as the optimizer with the
learning rate at 0.001 for RL. Also, we apply Low-Rank Adaptation (LoRA) [11] with the rank equal
to 8 to Mixtral for efficient fine-tuning.

5.5 Experimental Analysis

We conduct the experiments on the OpenAGI [7] benchmark dataset. For a fair comparison, we
show the results using the same workflow interpreter LLM in a table. Specifically, the results of
using the open-source LLM, Mixtral, as the LLM interpreter is shown in Table 1; and the results of
using the closed-source LLM, GPT-4, as the LLM interpreter is shown in Table 2. Each row stands
for a type of task, each column represents the planning schema of an LLM interpreter. From these
two tables, we can see that, after applying our AutoFlow framework, the average score over tasks is
significantly better than the baselines. Compared to the best baseline, CoRE, AutoFlow has over 40%
improvement when using Mixtral as the LLM interpreter, and over 5% improvement when using
GPT-4 as the interpreter LLM. For the score of each type of task, our AutoFlow also reaches the
highest one. Thus, the experiment results validate that AutoFlow is effective and can generate a
workflow with better performance than manually designed ones.
An interesting observation is that, the best average score when using Mixtral as the LLM interpreter,
is AutoFlow with GPT-4 as the workflow generator; and the best average score when using GPT-4 as
the LLM interpreter, is AutoFlow with Mixtral as the workflow generator. This observation suggests
that the combination of different systems (Mixtral and GPT-4) for the LLM interpreter and workflow
generator might lead to a kind of synergistic effect where the strengths of one system complement
the weaknesses of the other, which helps to better solve complex multi-step tasks.

7
6 Conclusions and Future Work
In this study, we introduce the AutoFlow framework to use Large Language Models (LLMs) for
automatically generating effective workflows for agents. We propose two learning methods for
AutoFlow, the fine-tuning method when using open-source LLM as workflow generator, and the
in-context learning method when using closed-source LLM as workflow generator. Compared to
manually designed workflows, automatically generated workflows can reach better performance and
significantly reduce the human labor, leading to a higher degree of automation.
Although AutoFlow demonstrates promising results, there is still space for improvement. For example,
the learning process for the workflow generator LLM uses reinforcement learning, which may not be
the most efficient compared to some gradient-based methods or few-shot learning methods. Future
studies may try to evaluate the efficacy of other learning methods. Another example in the AutoFlow
framework is that, the workflow generator and interpreter LLMs work together using a collaborative
learning paradigm. Instead, we may try other learning paradigms such as the teacher-student paradigm
or the adversarial learning paradigm.

References
[1] Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas
Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, et al. 2024.
Graph of thoughts: Solving elaborate problems with large language models. In Proceedings of
the AAAI Conference on Artificial Intelligence, Vol. 38. 17682–17690.
[2] Daniil A. Boiko, Robert MacKnight, and Gabe Gomes. 2023. Emergent autonomous scientific
research capabilities of large language models. arXiv:2304.05332 [physics.chem-ph]
[3] Yuzhe Cai, Shaoguang Mao, Wenshan Wu, Zehua Wang, Yaobo Liang, Tao Ge, Chenfei Wu,
Wang You, Ting Song, Yan Xia, Jonathan Tien, Nan Duan, and Furu Wei. 2024. Low-code
LLM: Graphical User Interface over Large Language Models. arXiv:2304.08103 [cs.CL]
[4] Vitor Cerqueira, Nuno Moniz, and Carlos Soares. 2021. Vest: Automatic feature engineering
for forecasting. Machine Learning (2021), 1–23.
[5] Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W. Cohen. 2023. Program of Thoughts
Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks. Trans-
actions on Machine Learning Research (2023).
[6] Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Springenberg, Manuel Blum, and
Frank Hutter. 2015. Efficient and Robust Automated Machine Learning. In Advances in Neural
Information Processing Systems, C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett
(Eds.), Vol. 28. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2015/
file/11d0e6287202fced83f79975ec59a3a6-Paper.pdf
[7] Yingqiang Ge, Wenyue Hua, Kai Mei, Jianchao Ji, Juntao Tan, Shuyuan Xu, Zelong Li, and
Yongfeng Zhang. 2023. OpenAGI: When LLM Meets Domain Experts. In Advances in Neural
Information Processing Systems (NeurIPS) (2023).
[8] Yingqiang Ge, Yujie Ren, Wenyue Hua, Shuyuan Xu, Juntao Tan, and Yongfeng Zhang. 2023.
LLM as OS, Agents as Apps: Envisioning AIOS, Agents and the AIOS-Agent Ecosystem.
arXiv e-prints (2023), arXiv–2312.
[9] Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahua Hong, Zhen Wang, Daisy Zhe Wang, and Zhiting
Hu. 2023. Reasoning with language model is planning with world model. arXiv preprint
arXiv:2305.14992 (2023).
[10] Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. 2021. CLIPScore:
A Reference-free Evaluation Metric for Image Captioning.
[11] Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu
Wang, and Weizhu Chen. 2021. LoRA: Low-Rank Adaptation of Large Language Models.
arXiv:2106.09685 [cs.CL]

8
[12] Siyu Huang, Xi Li, Zhi-Qi Cheng, Zhongfei Zhang, and Alexander Hauptmann. 2018. Gnas:
A greedy neural architecture search method for multi-attribute learning. In Proceedings of the
26th ACM international conference on Multimedia. 2049–2057.
[13] Frank Hutter, Lars Kotthoff, and Joaquin Vanschoren. 2019. Automated machine learning:
methods, systems, challenges. Springer Nature.
[14] Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris
Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand,
et al. 2024. Mixtral of experts. arXiv preprint arXiv:2401.04088 (2024).
[15] Ana Jojic, Zhen Wang, and Nebojsa Jojic. 2023. Gpt is becoming a turing machine: Here are
some ways to program it. arXiv preprint arXiv:2303.14310 (2023).
[16] Martin Josifoski, Lars Klein, Maxime Peyrard, Yifei Li, Saibo Geng, Julian Paul Schnitzler,
Yuxing Yao, Jiheng Wei, Debjit Paul, and Robert West. 2023. Flows: Building Blocks of
Reasoning and Collaborating AI. arXiv:2308.01285 [cs.AI]
[17] James Max Kanter and Kalyan Veeramachaneni. 2015. Deep feature synthesis: Towards
automating data science endeavors. In 2015 IEEE international conference on data science and
advanced analytics (DSAA). IEEE, 1–10.
[18] Gilad Katz, Eui Chul Richard Shin, and Dawn Song. 2016. Explorekit: Automatic feature
generation and selection. In 2016 IEEE 16th International Conference on Data Mining (ICDM).
IEEE, 979–984.
[19] Omar Khattab, Keshav Santhanam, Xiang Lisa Li, David Hall, Percy Liang, Christopher Potts,
and Matei Zaharia. 2022. Demonstrate-Search-Predict: Composing Retrieval and Language
Models for Knowledge-Intensive NLP. arXiv preprint arXiv:2212.14024 (2022).
[20] Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri
Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller,
Matei Zaharia, and Christopher Potts. 2023. DSPy: Compiling Declarative Language Model
Calls into Self-Improving Pipelines. arXiv preprint arXiv:2310.03714 (2023).
[21] Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022.
Large language models are zero-shot reasoners. Advances in neural information processing
systems 35 (2022), 22199–22213.
[22] Lars Kotthoff, Chris Thornton, Holger H Hoos, Frank Hutter, and Kevin Leyton-Brown. 2019.
Auto-WEKA: Automatic model selection and hyperparameter optimization in WEKA. In
Automated Machine Learning. Springer, Cham, 81–95.
[23] Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard
Ghanem. 2023. CAMEL: Communicative Agents for "Mind" Exploration of Large Language
Model Society. In Thirty-seventh Conference on Neural Information Processing Systems.
[24] Xinyi Li, Yongfeng Zhang, and Edward C Malthouse. 2024. Large Language Model Agent for
Fake News Detection. arXiv preprint arXiv:2405.01593 (2024).
[25] Zelong Li, Wenyue Hua, Hao Wang, He Zhu, and Yongfeng Zhang. 2024. Formal-LLM:
Integrating Formal Language and Natural Language for Controllable LLM-based Agents.
arXiv:2402.00798 (2024).
[26] Zelong Li, Jianchao Ji, Yingqiang Ge, and Yongfeng Zhang. 2022. AutoLossGen: Automatic
Loss Function Generation for Recommender Systems. SIGIR (2022).
[27] Bo Liu, Yuqian Jiang, Xiaohan Zhang, Qiang Liu, Shiqi Zhang, Joydeep Biswas, and Peter
Stone. 2023. Llm+ p: Empowering large language models with optimal planning proficiency.
arXiv preprint arXiv:2304.11477 (2023).
[28] Hanxiao Liu, Karen Simonyan, and Yiming Yang. 2019. DARTS: Differentiable Architecture
Search. In International Conference on Learning Representations. https://openreview.
net/forum?id=S1eYHoC5FX

9
[29] Qing Lyu, Shreya Havaldar, Adam Stein, Li Zhang, Delip Rao, Eric Wong, Marianna Apidi-
anaki, and Chris Callison-Burch. 2023. Faithful chain-of-thought reasoning. arXiv preprint
arXiv:2301.13379 (2023).

[30] Kai Mei, Zelong Li, Shuyuan Xu, Ruosong Ye, Yingqiang Ge, and Yongfeng Zhang. 2024.
AIOS: LLM Agent Operating System. arXiv (2024).

[31] Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese,
and Caiming Xiong. 2022. Codegen: An open large language model for code with multi-turn
program synthesis. arXiv preprint arXiv:2203.13474 (2022).

[32] Josh et al OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]

[33] Hieu Pham, Melody Guan, Barret Zoph, Quoc Le, and Jeff Dean. 2018. Efficient neural
architecture search via parameters sharing. In International Conference on Machine Learning.
PMLR, 4095–4104.

[34] Gabriel Poesia, Oleksandr Polozov, Vu Le, Ashish Tiwari, Gustavo Soares, Christopher Meek,
and Sumit Gulwani. 2022. Synchromesh: Reliable code generation from pre-trained language
models. arXiv preprint arXiv:2201.11227 (2022).

[35] Chen Qian, Xin Cong, Wei Liu, Cheng Yang, Weize Chen, Yusheng Su, Yufan Dang, Jiahao
Li, Juyuan Xu, Dahai Li, Zhiyuan Liu, and Maosong Sun. 2023. Communicative Agents for
Software Development. arXiv:2307.07924 [cs.SE]

[36] Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V. Le. 2019. Regularized Evolution
for Image Classifier Architecture Search. Proceedings of the AAAI Conference on Artificial
Intelligence 33, 01 (Jul. 2019), 4780–4789. https://doi.org/10.1609/aaai.v33i01.
33014780

[37] Frank Hutter Thomas Elsken, Jan Hendrik Metzen. 2018. Simple and efficient architec-
ture search for Convolutional Neural Networks. https://openreview.net/forum?id=
SySaJ0xCZ

[38] Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen,
Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, and Ji-Rong Wen. 2023. A
Survey on Large Language Model based Autonomous Agents. arXiv:2308.11432 [cs.AI]

[39] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha
Chowdhery, and Denny Zhou. 2022. Self-consistency improves chain of thought reasoning in
language models. arXiv preprint arXiv:2203.11171 (2022).

[40] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le,
Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models.
Advances in neural information processing systems 35 (2022), 24824–24837.

[41] Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist
reinforcement learning. Machine learning 8 (1992), 229–256.

[42] Bichen Wu, Chenfeng Xu, Xiaoliang Dai, Alvin Wan, Peizhao Zhang, Zhicheng Yan, Masayoshi
Tomizuka, Joseph Gonzalez, Kurt Keutzer, and Peter Vajda. 2020. Visual Transformers: Token-
based Image Representation and Processing for Computer Vision. arXiv:2006.03677 [cs.CV]

[43] Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun
Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W White, Doug Burger,
and Chi Wang. 2023. AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent
Conversation. arXiv:2308.08155 [cs.AI]

[44] Yiran Wu, Tianwei Yue, Shaokun Zhang, Chi Wang, and Qingyun Wu. 2024. StateFlow:
Enhancing LLM Task-Solving through State-Driven Workflows. arXiv:2403.11322 [cs.CL]

10
[45] Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang,
Junzhe Wang, Senjie Jin, Enyu Zhou, Rui Zheng, Xiaoran Fan, Xiao Wang, Limao Xiong,
Yuhao Zhou, Weiran Wang, Changhao Jiang, Yicheng Zou, Xiangyang Liu, Zhangyue Yin,
Shihan Dou, Rongxiang Weng, Wensen Cheng, Qi Zhang, Wenjuan Qin, Yongyan Zheng,
Xipeng Qiu, Xuanjing Huang, and Tao Gui. 2023. The Rise and Potential of Large Language
Model Based Agents: A Survey. arXiv:2309.07864 [cs.AI]
[46] Jian Xie, Kai Zhang, Jiangjie Chen, Tinghui Zhu, Renze Lou, Yuandong Tian, Yanghua Xiao,
and Yu Su. 2024. Travelplanner: A benchmark for real-world planning with language agents.
arXiv preprint arXiv:2402.01622 (2024).
[47] Shuyuan Xu, Zelong Li, Kai Mei, and Yongfeng Zhang. 2024. CoRE: LLM as Interpreter for
Natural Language Programming, Pseudo-Code Programming, and Flow Programming of AI
Agents. arXiv:2405.06907 [cs.CL]
[48] Quanming Yao, Mengshuo Wang, Yuqiang Chen, Wenyuan Dai, Yu-Feng Li, Wei-Wei Tu,
Qiang Yang, and Yang Yu. 2018. Taking human out of learning applications: A survey on
automated machine learning. arXiv preprint arXiv:1810.13306 (2018).
[49] Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik
Narasimhan. 2024. Tree of thoughts: Deliberate problem solving with large language models.
Advances in Neural Information Processing Systems 36 (2024).
[50] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan
Cao. 2022. React: Synergizing reasoning and acting in language models. arXiv preprint
arXiv:2210.03629 (2022).
[51] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan
Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. In International
Conference on Learning Representations (ICLR).
[52] Siyu Yuan, Kaitao Song, Jiangjie Chen, Xu Tan, Yongliang Shen, Ren Kan, Dongsheng Li, and
Deqing Yang. 2024. EASYTOOL: Enhancing LLM-based Agents with Concise Tool Instruction.
arXiv preprint arXiv:2401.06201 (2024).
[53] Zhen Zeng, William Watson, Nicole Cho, Saba Rahimi, Shayleen Reynolds, Tucker Balch,
and Manuela Veloso. 2024. FlowMind: Automatic Workflow Generation with LLMs.
arXiv:2404.13050 [cs.CL]
[54] Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2020.
BERTScore: Evaluating Text Generation with BERT.

11

You might also like