0% found this document useful (0 votes)
12 views

Self-Organized-Agent

Uploaded by

medievalfmb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Self-Organized-Agent

Uploaded by

medievalfmb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Self-Organized Agents: A LLM Multi-Agent Framework toward

Ultra Large-Scale Code Generation and Optimization

Yoichi Ishibashi Yoshimasa Nishimura


TsukushiAI TsukushiAI
[email protected] [email protected]

Abstract Single agent Self-organized Agents (SoA)

✗ Limited scalability ✔ Scalable code generation/improvement

Recent advancements in automatic code gener-


ation using large language model (LLM) agent
have brought us closer to the future of auto-
arXiv:2404.02183v1 [cs.SE] 2 Apr 2024

mated software development. However, exist-


ing single-agent approaches face limitations Code Size
per agent Max context
in generating and improving large-scale, com- length
Max context
plex codebases due to constraints in context Code Size
per agent
length

length. To tackle this challenge, we propose


Self-Organized multi-Agent framework (SoA), Total Code Size Total Code Size

a novel multi-agent framework that enables the


Figure 1: Left (single agent): A single agent is solely
scalable and efcient generation and optimiza-
responsible for the entire implementation. As the code-
tion of large-scale code. In SoA, self-organized
base grows larger, the load increases for code genera-
agents operate independently to generate and
tion, modication, and memory management, making
modify code components while seamlessly col-
it difcult to manage and develop. The larger the en-
laborating to construct the overall codebase. A
tire codebase becomes, the more it puts pressure on
key feature of our framework is the automatic
the context length during self-debugging, limiting the
multiplication of agents based on problem com-
amount of code that can be managed. Right (SoA): The
plexity, allowing for dynamic scalability. This
implementation is distributed among multiple agents.
enables the overall code volume to be increased
The agents are independent; code generation, modi-
indenitely according to the number of agents,
cation, and memory management are separated from
while the amount of code managed by each
other agents. Each agent manages only its own part,
agent remains constant. We evaluate SoA on
allowing it to focus on the implementation regardless
the HumanEval benchmark and demonstrate
of the complexity of the entire codebase. Furthermore,
that, compared to a single-agent system, each
agents automatically multiply according to the com-
agent in SoA handles signicantly less code,
plexity of the problem. This allows for the generation
yet the overall generated code is substantially
and modication of complex and large-scale code while
greater. Moreover, SoA surpasses the powerful
maintaining a constant amount of code management/-
single-agent baseline by 5% in terms of Pass@1
generation/modication per agent.
accuracy. 1

1 Introduction
of automatic code generation techniques in the
In recent years, research on agents using Large eld of automated application and tool develop-
Language Models (LLMs) (Brown et al., 2020; ment (Hong et al., 2023; Dong et al., 2023; Huang
OpenAI, 2023; Touvron et al., 2023), such as Re- et al., 2023). Compared to non-agent-based meth-
Act (Yao et al., 2023b), Reexion (Shinn et al., ods (Muennighoff et al., 2023; Li et al., 2023b),
2023), Toolformer (Schick et al., 2023), and Auto- these research achievements have led to remark-
GPT 2 , has been expanding the possibilities of au- able performance improvements in automatic code
tomating human tasks. These advancements have generation (Zhong et al., 2024; Zhou et al., 2023).
particularly contributed to the rapid development Most recent research has focused on single-agent
1 approaches for code generation. These single-agent
Our code will be available at https://github.com/
tsukushiAI/self-organized-agent. code generation methods face limitations, espe-
2
https://github.com/Significant-Gravitas/ cially in terms of scalability, when the implemen-
tation becomes complex and requires a large code- tal results, we revealed how agents automatically
base. The main reason for this technical difculty is multiply according to the complexity of the prob-
that a single agent must manage the entire code gen- lem, effectively scaling up the overall code vol-
eration process alone. For instance, implementing a ume while keeping the code generation per agent
machine learning algorithm involves several stages, constant (§ 4.2). These experimental results sup-
such as data preprocessing, algorithm training, and port the contribution of our framework, which over-
result evaluation, which include many functions comes the scalability issues faced by single-agent
and classes. When these complex components are approaches and provides a solution capable of han-
combined, the codebase inevitably becomes very dling larger projects.
large. However, there are limitations to the con-
text length of LLMs, and as the number of input 2 Code Generation Task
tokens increases, the inference performance de-
The code generation task involves generating
creases (Levy et al., 2024; Shaham et al., 2023;
Python functions from docstrings (Chen et al.,
Li et al., 2023a). Consistently understanding and
2021). In this task, an agent is given a docstring
generating or modifying appropriate code for such
that denes the types of the function’s inputs and
an extensive codebase poses a signicant challenge
expected outputs, as well as the specic require-
for a single agent in terms of comprehending and
ments that the function should meet. The agent is
managing the context. Consequently, the single-
then required to generate the code for a function
agent approach struggles to efciently generate and
that fullls the specied functionality. The gener-
modify code as its complexity and size increase.
ated code is veried for accuracy using unit tests,
To tackle these challenges, we propose a self-
and the quality of the code is evaluated based on
organized multi agent framework that can auto-
its ability to pass the test cases. As with previ-
matically generate and modify large-scale code
ous studies (Shinn et al., 2023; Zhong et al., 2024;
(Figure 1). Self-organization (Ashby, 1947) is a
Zhou et al., 2023), we use the evaluation metric
phenomenon in which living organisms or matter
Pass@1 (Chen et al., 2021), where a problem is
create an orderly, large structure as a result of their
considered solved if any of the k code samples
individual autonomous behaviors, despite lacking
pass all test cases.
the ability to oversee the entire system. In our
framework, self-organized agents, each responsible 3 Self-organized Agent Framework
for different code parts or tasks, independently gen-
erate and modify code. With the self-organization Our Self-organized Agents (SoA) framework en-
of agents, a single agent no longer needs to com- ables efcient implementation of large-scale and
prehend the entire codebase, making it possible to complex code by having self-organized agents in-
scale up large-scale code simply by increasing the dependently generate and modify small-scale and
number of agents. Another feature of our frame- simple code. In this section, we introduce the im-
work is that agents automatically multiply accord- portant components of SoA, namely the agents and
ing to the complexity of the problem, allowing the layers responsible for more abstract process-
the overall codebase to expand while keeping the ing than the agents, and nally introduce the code
amount of code handled by each agent constant. generation and modication protocols in the SoA
These features enable the dynamic and exible gen- framework.
eration and modication of large-scale code, which
was impossible with the traditional single-agent 3.1 Child Agent
approach. Child agents implement a given function based on
In our experiments, we evaluated the perfor- its docstrings. As shown in Figure 2, this agent has
mance of this framework using HumanEval (Chen a simple structure consisting of two elements: an
et al., 2021), a benchmark for code generation. The LLM and memory. The LLM generates code from
results show that our self-organized multi-agent the given docstrings and modies the code based
framework outperformed Reexion (Shinn et al., on the results of unit tests. The memory stores the
2023), an existing powerful code generation agent code generated by the agent itself and retrieves the
(§4.1), demonstrating the effectiveness of our ap- latest code to be input to the LLM along with the
proach in generating and modifying code. Further- unit test feedback during code modication. If an
more, through a detailed analysis of the experimen- agent has these minimal specications, it is possi-
ble to use an off-the-shelf agents (e.g., Reexion) their Mother agent, they contribute to the creation
as a Child agent. We deliberately use a simple of a more optimized and large codebase.
agent to verify the effectiveness of SoA in a simple
setup. 3.2 Mother Agent
The Mother is an agent that generates new agents
Code Generation The main role of Child agents
(Mother or Child). Similar to Child agents, the
is to generate functions that meet the specications
Mother agent independently implements the spe-
based on the given function’s docstrings. As shown
cic Python function based on its given docstrings.
in Figure 2, the agent follows the instructions to
The Mother has memory, code generation capa-
generate the rest of the function and complete it.
bilities, and self-debugging functions, as same as
The completed function implementation is stored
Child agents. The unique feature of the Mother
in memory, and the unit tests for the function are
agent is its ability to generate multiple Child agents
also stored as they form the basis for future code
according to the complexity of the problem and del-
modications.
egate parts of the implementation to these agents.
Code Modication: Empowering Child Agents This structure allows the Mother agent to focus on
with Self-Organization and Adaptability One implementing abstract processes, while the Child
of the most remarkable aspects of agents in the agents generated by the Mother agent concentrate
SoA framework is their ability to autonomously on implementing concrete processes. This divi-
improve their code based on the state of nearby sion of labor enhances the overall efciency and
agents . This process sets SoA apart from tradi- exibility of the SoA framework.
tional agent approaches and showcases the power
Code Generation We explain the code genera-
of self-organization in code modication. While
tion process by the Mother agent using the imple-
existing agents like Reexion (Shinn et al., 2023)
mentation example of the is_sum_of_odds_ten
rely solely on the results of unit tests, Child agents
function shown in Figure 2. The starting point is
in SoA go beyond this limitation by independently
the function’s docstrings and unit tests, which are
observing the state of their mother agent, such as
memorized for reference in the later self-debugging
differences in modications and feedback. By gath-
phase. The rst task of the Mother agent is to
ering this invaluable information from their sur-
generate a skeleton of the implementation from
rounding environment, Child agents can adapt their
the given docstrings, including subfunctions such
behavior and make more informed decisions about
as get_odd_numbers to extract odd numbers and
code modication, even without explicit instruc-
sum_of_numbers to calculate their sum. The num-
tions. The modications and feedback generated
ber and types of these subfunctions are automati-
by the Mother agent serve as an important source
cally determined by the LLM based on the com-
of information for the Child agents. Armed with
plexity of the problem.
these insights, Child agents can more effectively
It is important to note that these subfunctions
modify their own code, contributing to the over-
are unimplemented, and the Mother agent does not
all improvement of the codebase in a way that is
directly implement them. Instead, it delegates the
both efcient and adaptive. Figure 3 illustrates this
implementation of the subfunctions to other agents,
process, which begins with the execution of unit
allowing the Mother agent to focus on generating
tests and the retrieval of the latest implementation
the skeleton and streamline its own code generation
from memory. The Child agent then harnesses the
process. After the docstrings and unit tests for the
power of the LLM to create a code modication
subfunctions are generated, they are assigned to
proposal, seamlessly combining the information ob-
newly initialized agents for implementation. These
served from the Mother agent with the test results
agents proceed with the implementation of their
and the latest implementation details. By storing
respective functions without looking at the inter-
the modied code in memory, Child agents cre-
nals of the is_sum_of_odds_ten function imple-
ate a feedback loop that continuously renes and
mented by the Mother agent. Since agents within
improves the codebase over time. This iterative pro-
the same Mother can work asynchronously, the
cess, driven by the principles of self-organization
overall code generation process is streamlined.
and adaptability, enables SoA to tackle complex
code modication tasks with efciency and effec- Code Modication The Mother’s code modica-
tiveness. As Child agents work in harmony with tion is almost the same as the Child’s code modi-
Child Docstrings Unit test Mother Docstrings Unit test
def get_odd_numbers(lst): def is_sum_of_odds_ten(lst):
''' ‘’’
Extracts the odd numbers Checks if sum of assert is_sum_of_odds_ten([1,9]) == True
assert get_odd_numbers([1,2,3]) == [1, 3] odd numbers is 10
from a given list of numbers.
‘’' ’’’
pass pass

Memory Memory
LLM LLM
Unit test Unit test
STEP1: STEP1:
Generate the code Generate the skeleton
Code Code
def get_odd_numbers(lst): def is_sum_of_odds_ten(lst):
''' ‘’'Checks if sum of odd numbers is 10
Extracts the odd numbers from a given ’’’
list of numbers. odd_numbers = get_odd_numbers(lst)
‘''
odd_numbers = [n for n in lst sum_odds = sum_numbers(odd_numbers)
if n % 2 != 0] STEP2: Update memory STEP2: Update memory
return odd_numbers return “Yes” if sum_odds == 10 else “No”

LLM

STEP3: Generate signature and unit tests


def get_odd_numbers(lst): def sum_numbers(lst):
''' '''
Extracts the odd numbers. Calculates the sum of numbers.
‘’' ‘’’
pass pass

assert get_odd_numbers([1,2,3]) == [1, 3] assert sum_numbers([1,3]) == 4

child or mother child or mother

Figure 2: Overview of code generation. Child agents generate executable Python function from a given docstring.
The Mother agent generates the skeleton of the function. The Mother spawns a new initialized agent (Child or
Mother) and delegates unimplemented functions.

cation (Figure 3). It observes information from the Algorithm 1 in the appendix.
upper Mother and uses it to modify the functions
Code Generation The code generation process
it is responsible for. The only difference is that the
in the SoA framework begins with the function’s
feedback it generates and the code before and after
docstrings and unit tests. In the initial stage, there
modication are used by lower-level agents (Child
is only one initialized Mother agent, which is the
or Mother).
root of the tree structure. Based on the input doc-
3.3 Self-organized Agent Process strings and unit tests, it generates docstrings and
unit tests for subtasks and passes them to other
The Self-organized Agent (SoA) framework is a
agents it generates (see §3.2). If the tree structure
distributed framework in which multiple agents (in-
reaches a predetermined depth, the tasks are passed
cluding Mother agents and Child agents) repeatedly
to Child agents; otherwise, they are passed to newly
generate and modify functions. The core of this
generated Mother agents. By repeatedly prolifer-
framework lies in the principle of self-organization,
ating and increasing the number of agents until
where each agent functions independently without
the last agent, it is possible to generate large-scale
the need to directly observe the entire codebase.
code while keeping the amount of code managed
The hierarchical combination of Mother agents
by individual agents constant.
and Child agents forms an agent network that ef-
fectively constructs a single large-scale codebase. Code Modication Once code generation is
In this hierarchical structure, Mother agents de- complete, the process transitions to the code mod-
compose complex problems into more manageable ication phase. First, the implementations of all
smaller problems by dividing tasks and delegating agents are combined to create the nal implementa-
them to the agents they have generated. Although tion. This nal implementation is evaluated using
each agent is independent, the agents as a whole the unit tests provided to the root Mother, and feed-
can work efciently towards the implementation of back is generated from the results. Since there are
a single function. Despite the fact that the amount no agents higher than this root Mother, information
of code each agent generates, modies, and man- from higher-level agents as shown in Figure 3 is
ages is always small, the number of agents scales, not used. The modication process starts based on
allowing the amount of code generated to be in- this feedback and propagates information from the
creased indenitely according to the difculty of root Mother agent to the Child agents. Each agent
the problem. Detailed algorithms are presented in updates its implementation based on the received
Upper Mother Baselines We compare SoA with several state-of-
Agent state the-art code generation methods including Alpha-
(Self-feedbacks, New code, Old code)
Code (Li et al., 2022), Incoder (Fried et al., 2023),
Mother Codex (Chen et al., 2021), CoT (Wei et al., 2022),
and Gemini Pro (Anil et al., 2023). Additionally,
STEP1: Observe Memory
upper agent
Test
result Unit test
we evaluate the performance of various GPT-3.5-
LLM
based agents, such as ChatGPT, Self-Edit (Zhang
STEP2:
Generate self-feedbacks
Code
et al., 2023), and Reexion (Shinn et al., 2023).
(1) Return a Boolean value.
(2) In 'get_odd_numbers',
Old code
These baselines are chosen to represent a diverse
ignore negative numbers.
range of approaches, including single-agent and
STEP3:
Fix code
LLM
multi-agent systems, as well as those with and with-
def is_sum_of_odds_ten(lst):
‘’'Checks if sum of odd numbers is 10
out self-debugging capabilities.
’’’
STEP4:
odd_numbers = get_odd_numbers(lst) Update
memory
sum_odds = sum_numbers(odd_numbers)
Agent Conguration To evaluate the effective-
return True if sum_odds == 10 else False
ness of the SoA framework, we selected the Re-
Agent state exion agent as a baseline. Reexion iteratively
Child To other children/mothers
modies code based on the given docstrings and
automatically generated unit tests until it reaches
STEP1: Test Memory the maximum number of iterations or passes the
Observe mother agent result
LLM
Unit test
unit tests. The main difference between Reexion
STEP2:
Generate self-feedbacks
Code and SoA is that Reexion is composed of a single
(1) Ignore numbers that are
agent, while SoA is composed of self-organized
negative.
Old code STEP4:
Update
multiple agents. In the SoA conguration, we set
STEP3:
Fix code LLM memory
the maximum number of iterations for the learn-
def get_odd_numbers(lst):
''' ing loop to 8 and the maximum tree depth to 2.
Extracts the odd numbers from a given list of numbers.
‘''
odd_numbers = [n for n in lst if n % 2 != 0 and n > 0]
Additionally, following (Shinn et al., 2023), we
return odd_numbers
provided a few-shot trajectory to the LLM.

Figure 3: Overview of code modication. Agents Data and Tasks To evaluate the performance
(Mother/Child) observe the state of Mother (feedback, of automatic code generation, we used the Hu-
old code, and updated code) and use this information manEval (Chen et al., 2021) benchmark. Hu-
to improve the functions for which they are responsi- manEval is a set that includes diverse programming
ble. The state of the upper agent is used to modify code problems designed to measure the functional cor-
by lower agents within the hierarchy. This state propa-
rectness of generated code. We used the Python
gation promotes collaboration and information sharing
throughout the hierarchy, enabling efcient code modi- language set for evaluation and followed the evalua-
cation. tion methodology of Reexion (Shinn et al., 2023).
In this process, multiple test cases are created for
each generated code, and n test cases are randomly
feedback, generates new feedback, and transmits it selected to construct a test suite. This test suite is
to lower-level agents (see §3.2). Finally, the Child used to verify whether the generated code functions
agents update their own implementations, and the correctly. We set 6 unit tests for Reexion and 1
process terminates (see § 3.1). This series of pro- unit test for SoA.
cesses is repeated until a predetermined maximum
number of iterations is reached. 4.1 Main Results
Table 1 compares the Pass@1 accuracy of the pro-
4 Experiments posed method and the baseline. Comparing SoA
with Reexion, a strong baseline, SoA outperforms
LLM Selection We used GPT3.5-turbo3 for code
Reexion by 5% in Pass@1. Considering that each
generation and feedback generation.4
agent in SoA does not see the entire code, this is
3 a surprising result. This result suggests that self-
gpt3.5-turbo-1106
4
GPT-4 was not selected due to the high experimental cost organized agents can generate code that functions
required. well as a whole without needing to oversee the
Docstrings
def func5
def func2
def function1(x)
Agent Skeleton of Code Unit test
Unit test
“”” def func5
Docstrings def function1(x)
“”” 1. Generate Unit test

LLM ……
# Generate here
y = function2(x) def func5
Memory def func3 Unit test
def function1(x) 2. Memorize ……
Unit test def func5
z = function3(y)
Unit test

……
3. Self-debugging def func5
return function4(z)
Unit test Unit test
Unit test def func4
assert function(1) == 2 def func5
Unit test
Unit test

Figure 4: Overview of the SoA framework. Mother agents and Child agents hierarchically construct a network and
perform function generation and modication. Mother agents delegate tasks to other Mother agents or Child agents,
and each agent independently executes tasks while effectively implementing a single function as a whole.

Method SD SO Pass@1 all functions. In the context of HumanEval, which


AlphaCode (Li et al., 2022) ✘ ✘ 17.1 requires the implementation of a single function,
Incoder (Fried et al., 2023) ✘ ✘ 15.2 SoA’s code amount is calculated by summing the
Codex (Chen et al., 2021) ✘ ✘ 47.0
Gemini Pro (Anil et al., 2023) ✘ ✘ 67.7
code generated by each agent, while Reexion’s
code amount is based on a single function. The
CoT (Wei et al., 2022) ✘ ✘ 44.6
ChatGPT ✘ ✘ 57.3 code amount per function in SoA refers to the code
GPT-3.5 Self-Edit (Zhang et al., 2023) ✔ ✘ 62.2 generated by each individual agent, whereas in Re-
Reexion (Shinn et al., 2023) ✔ ✘ 66.5 exion, it is equivalent to the code amount of a
SoA (ours) ✔ ✔ 71.4
single function. The results unequivocally demon-
Table 1: Results of SoA and baselines on HumanEval. strate SoA’s superiority over Reexion in terms of
The score of ChatGPT is taken from Dong et al. (2023). the number of tokens per nal code and the av-
SD indicates whether the agent uses self-debugging with erage number of characters per function. What
unit tests, while SO denotes whether the agent employs is remarkable is that despite each agent in SoA
self-organized multi-agent collaboration. handling signicantly fewer tokens/characters com-
pared to the single agent in Reexion, the overall
output generated by SoA is substantially greater.
entire code, by independently implementing the
This nding underscores the exceptional scalabil-
functions assigned to them.
ity of SoA, indicating its ability to handle increas-
4.2 Analysis ingly complex tasks by seamlessly adding more
agents to the system. Our results suggest that by
One of the most critical aspects of our study is increasing the depth of the agent hierarchy and in-
the efciency of the self-organized multi-agent ap- troducing more Mother agents, SoA can generate
proach in large-scale code generation. To showcase even larger-scale code by efciently distributing
the superior performance of SoA, we conducted a the workload among multiple agents. As the tree
comprehensive comparative analysis between Re- structure becomes deeper, the system exhibits an
exion, a state-of-the-art single-agent system, and innite scaling potential, enabling the generation
our proposed multi-agent system. Using the Hu- of increasingly complex and extensive codebases
manEval benchmark, we meticulously examined while ensuring that each agent handles a manage-
the overall scale of the code generated by both able portion of the code. Each agent can maintain
systems and the amount of code each agent inde- a manageable amount of code while theoretically
pendently generated and memorized. To ensure a allowing for an indenite increase in the overall
fair comparison, we removed comments and doc- code generation capacity.
strings from the HumanEval results and focused on
the number of characters and tokens of pure code. This distributed approach empowers SoA to sig-
Figure 5 presents a visualization of the average nicantly scale up its ability to tackle large-scale
amount of code generated by SoA and Reexion and complex coding tasks with remarkable ef-
from the perspective of individual functions and ciency and high quality, far surpassing the limi-
Single agent (Reflexion) Multi agent (SoA) Multi-Agent Collaboration for Software De-
Single agent (Reflexion) Single agent (Reflexion)
velopment In recent years, several multi-agent-
Tokens

Avg. # Generated Tokens


Multi agents (SoA) Multi agents (SoA)
100 60 based approaches have emerged as promising
count

Average character count


solutions for software development, such as
Generated

75
Avg. # character

40
50 MetaGPT (Hong et al., 2023), ChatDev (Qian et al.,
25
20
2023), Self-collaboration (Dong et al., 2023), and
Average

0 0 AgentCoder (Huang et al., 2023). These methods


Entire
Per codebase
a final code PerFunction
a function
(all agents) (agent) typically personify agents and assign them specic
Single agent (Reflexion)
Multi agents (SoA)
Single agent (Reflexion)
Multi agents (SoA)
names or occupational roles, such as programmers,
400 200 project managers, or QA engineers, to allocate
count

Average character count

tasks. While this approach has shown promise,


Characters

Avg. Characters

300 150
Avg. character

200 100 our method takes a different and more exible ap-
100 50 proach. Instead of assigning xed occupational
Average

0 0 roles, we subdivide agent capabilities based on


Entire
Per codebase
a final code PerFunction
a function
(all agents) (agent) code functionality, allowing each agent to demon-
strate its expertise without being constrained by pre-
Figure 5: Comparison of code generation amount be- dened roles. This ne-grained task allocation en-
tween SoA (mulit-agent) and Reexion (single agent).
ables more exible problem-solving and adaptation
to the complexity of the software development pro-
cess. Moreover, by incorporating the concepts of
tations encountered by single-agent systems like
self-organization and self-proliferation, our agents
Reexion, where a sole agent is responsible for
can dynamically scale up the overall code volume
managing and generating the entire codebase.
based on the difculty of the problem at hand, pro-
viding a highly adaptable and efcient framework
5 Related Work for large-scale code generation and modication.
Macro vs. Micro Perspectives While both multi-
LLM Agents Recent advancements in LLM
agent-based methods (Hong et al., 2023; Qian et al.,
agents, such as ReAct (Yao et al., 2023b), Re-
2023; Dong et al., 2023; Huang et al., 2023) and our
exion (Shinn et al., 2023), Toolformer (Schick
proposed SoA framework share the common goal
et al., 2023), and Self-Rene (Madaan et al., 2023),
of automating software development, they address
have primarily focused on single-agent approaches,
different technical aspects of the process. Existing
where one agent is responsible for both genera-
multi-agent methods primarily focus on optimiz-
tion and modication tasks. Among these, Re-
ing the macro structure of software development,
exion (Shinn et al., 2023) has gained signicant
such as project management and task allocation.
attention in the eld of code generation due to its
In contrast, our method takes a more micro-level
outstanding performance. However, despite their
perspective, focusing on the elemental technolo-
strengths, these single-agent approaches face in-
gies of code generation and modication. These
herent limitations when it comes to generating and
approaches are not mutually exclusive but rather
modifying large-scale codebases. To address these
complementary, offering a more comprehensive so-
limitations and push the boundaries of what is pos-
lution to the challenges faced in automatic software
sible with LLM agents, we propose SoA, a novel
development. By combining the strengths of both
multi-agent framework that harnesses the power
macro and micro-level approaches, we can create
of self-organization and collaboration. While we
a powerful and holistic framework that efciently
intentionally adopted simple agents for SoA in
handles the complexities of large-scale code gener-
this work, our framework is exible enough to in-
ation and modication.
corporate more sophisticated and powerful meth-
ods (Zhong et al., 2024; Zhou et al., 2023) and Prompt Engineering Tree-of-Thought
other state-of-the-art LLMs 5 , further enhancing (ToT) (Yao et al., 2023a) and Skeleton of
its potential for large-scale code generation and Thought (SoT) (Ning et al., 2023) are prompt
modication. engineering techniques that utilize tree-like
structures. ToT represents reasoning steps as nodes
5
https://claude.ai/ to explore correct reasoning paths, while SoT
generates a skeleton of the answer and completes research and development in the eld of automatic
the contents in parallel to decrease generation software development.
latency. In contrast, SoA uses a tree structure with
agents as nodes, focusing on their collaboration
and self-organization to generate and modify code References
efciently. Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-
Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan
6 Conclusion Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Mil-
lican, David Silver, Slav Petrov, Melvin Johnson,
Ioannis Antonoglou, Julian Schrittwieser, Amelia
In this work, we introduced Self-organized Agents Glaese, Jilin Chen, Emily Pitler, Timothy P. Lilli-
(SoA), a novel multi-agent framework for efcient crap, Angeliki Lazaridou, Orhan Firat, James Molloy,
and scalable automatic code generation and op- Michael Isard, Paul Ronald Barham, Tom Henni-
timization using large language models (LLMs). gan, Benjamin Lee, Fabio Viola, Malcolm Reynolds,
Yuanzhong Xu, Ryan Doherty, Eli Collins, Clemens
SoA addresses the limitations of single-agent ap-
Meyer, Eliza Rutherford, Erica Moreira, Kareem
proaches in handling large-scale, complex code- Ayoub, Megha Goel, George Tucker, Enrique Pi-
bases by leveraging the power of self-organization queras, Maxim Krikun, Iain Barr, Nikolay Savinov,
and distributed code generation. In SoA, self- Ivo Danihelka, Becca Roelofs, Anaïs White, Anders
organized agents operate independently to generate Andreassen, Tamara von Glehn, Lakshman Yagati,
Mehran Kazemi, Lucas Gonzalez, Misha Khalman,
and modify code components while seamlessly col- Jakub Sygnowski, and et al. 2023. Gemini: A fam-
laborating to construct the overall codebase. A key ily of highly capable multimodal models. CoRR,
feature of our framework is the automatic multi- abs/2312.11805.
plication of agents based on problem complexity,
W Ross Ashby. 1947. Principles of the self-organizing
allowing for dynamic scalability and enabling the dynamic system. The Journal of general psychology,
overall code volume to be increased indenitely ac- 37(2):125–128.
cording to the number of agents, while the amount
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie
of code managed by each agent remains constant.
Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind
We evaluated SoA on the HumanEval bench- Neelakantan, Pranav Shyam, Girish Sastry, Amanda
mark and demonstrated its superior performance Askell, Sandhini Agarwal, Ariel Herbert-Voss,
compared to Reexion, a state-of-the-art single- Gretchen Krueger, Tom Henighan, Rewon Child,
Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu,
agent system, with SoA achieving a 5% improve- Clemens Winter, Christopher Hesse, Mark Chen, Eric
ment in terms of Pass@1 accuracy. Furthermore, Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess,
our in-depth analysis revealed SoA’s remarkable Jack Clark, Christopher Berner, Sam McCandlish,
scalability, as each agent in SoA handles signi- Alec Radford, Ilya Sutskever, and Dario Amodei.
2020. Language models are few-shot learners. In Ad-
cantly less code compared to the single-agent base-
vances in Neural Information Processing Systems 33:
line, yet the overall generated code is substantially Annual Conference on Neural Information Process-
greater. These results highlight the effectiveness of ing Systems 2020, NeurIPS 2020, December 6-12,
SoA in generating and optimizing large-scale code 2020, virtual.
efciently and with high quality.
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan,
However, it is essential to acknowledge the limi- Henrique Pondé de Oliveira Pinto, Jared Kaplan,
tations of the current implementation of SoA. The Harrison Edwards, Yuri Burda, Nicholas Joseph,
framework’s performance may be affected by the Greg Brockman, Alex Ray, Raul Puri, Gretchen
Krueger, Michael Petrov, Heidy Khlaaf, Girish Sas-
choice of LLM and the quality of the generated try, Pamela Mishkin, Brooke Chan, Scott Gray,
unit tests. Additionally, SoA has been evaluated Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz
on a limited set of programming tasks, and its ef- Kaiser, Mohammad Bavarian, Clemens Winter,
fectiveness in handling more complex, real-world Philippe Tillet, Felipe Petroski Such, Dave Cum-
mings, Matthias Plappert, Fotios Chantzis, Eliza-
software development projects remains to be in-
beth Barnes, Ariel Herbert-Voss, William Hebgen
vestigated. Furthermore, the communication and Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie
collaboration mechanisms among agents in SoA Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain,
can be further optimized to improve efciency and William Saunders, Christopher Hesse, Andrew N.
fault tolerance. Carr, Jan Leike, Joshua Achiam, Vedant Misra, Evan
Morikawa, Alec Radford, Matthew Knight, Miles
Despite these limitations, we believe that the Brundage, Mira Murati, Katie Mayer, Peter Welinder,
SoA framework has signicant potential for future Bob McGrew, Dario Amodei, Sam McCandlish, Ilya
Sutskever, and Wojciech Zaremba. 2021. Evaluat- Lago, Thomas Hubert, Peter Choy, Cyprien de Mas-
ing large language models trained on code. CoRR, son d’Autume, Igor Babuschkin, Xinyun Chen, Po-
abs/2107.03374. Sen Huang, Johannes Welbl, Sven Gowal, Alexey
Cherepanov, James Molloy, Daniel J. Mankowitz,
Yihong Dong, Xue Jiang, Zhi Jin, and Ge Li. 2023. Self- Esme Sutherland Robson, Pushmeet Kohli, Nando
collaboration code generation via chatgpt. CoRR, de Freitas, Koray Kavukcuoglu, and Oriol Vinyals.
abs/2304.07590. 2022. Competition-level code generation with alpha-
code. CoRR, abs/2203.07814.
Daniel Fried, Armen Aghajanyan, Jessy Lin, Sida Wang,
Eric Wallace, Freda Shi, Ruiqi Zhong, Scott Yih,
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler
Luke Zettlemoyer, and Mike Lewis. 2023. Incoder:
Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon,
A generative model for code inlling and synthesis.
Nouha Dziri, Shrimai Prabhumoye, Yiming Yang,
In The Eleventh International Conference on Learn-
Shashank Gupta, Bodhisattwa Prasad Majumder,
ing Representations, ICLR 2023, Kigali, Rwanda,
Katherine Hermann, Sean Welleck, Amir Yazdan-
May 1-5, 2023. OpenReview.net.
bakhsh, and Peter Clark. 2023. Self-rene: Itera-
Sirui Hong, Xiawu Zheng, Jonathan Chen, Yuheng tive renement with self-feedback. In Advances in
Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Neural Information Processing Systems 36: Annual
Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Conference on Neural Information Processing Sys-
Lingfeng Xiao, and Chenglin Wu. 2023. Metagpt: tems 2023, NeurIPS 2023, New Orleans, LA, USA,
Meta programming for multi-agent collaborative December 10 - 16, 2023.
framework. CoRR, abs/2308.00352.
Niklas Muennighoff, Qian Liu, Armel Zebaze, Qinkai
Dong Huang, Qingwen Bu, Jie M. Zhang, Michael Luck, Zheng, Binyuan Hui, Terry Yue Zhuo, Swayam
and Heming Cui. 2023. Agentcoder: Multi-agent- Singh, Xiangru Tang, Leandro von Werra, and
based code generation with iterative testing and opti- Shayne Longpre. 2023. Octopack: Instruction tuning
misation. CoRR, abs/2312.13010. code large language models. CoRR, abs/2308.07124.
Mosh Levy, Alon Jacoby, and Yoav Goldberg. 2024. Xuefei Ning, Zinan Lin, Zixuan Zhou, Huazhong Yang,
Same task, more tokens: the impact of input length on and Yu Wang. 2023. Skeleton-of-thought: Large
the reasoning performance of large language models. language models can do parallel decoding. CoRR,
CoRR, abs/2402.14848. abs/2307.15337.
Jiaqi Li, Mengmeng Wang, Zilong Zheng, and Muhan
OpenAI. 2023. GPT-4 technical report. CoRR,
Zhang. 2023a. Loogle: Can long-context lan-
abs/2303.08774.
guage models understand long contexts? CoRR,
abs/2311.04939.
Chen Qian, Xin Cong, Cheng Yang, Weize Chen,
Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Yusheng Su, Juyuan Xu, Zhiyuan Liu, and Maosong
Muennighoff, Denis Kocetkov, Chenghao Mou, Sun. 2023. Communicative agents for software de-
Marc Marone, Christopher Akiki, Jia Li, Jenny velopment. CoRR, abs/2307.07924.
Chim, Qian Liu, Evgenii Zheltonozhskii, Terry Yue
Zhuo, Thomas Wang, Olivier Dehaene, Mishig Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta
Davaadorj, Joel Lamy-Poirier, João Monteiro, Oleh Raileanu, Maria Lomeli, Eric Hambro, Luke Zettle-
Shliazhko, Nicolas Gontier, Nicholas Meade, Armel moyer, Nicola Cancedda, and Thomas Scialom. 2023.
Zebaze, Ming-Ho Yee, Logesh Kumar Umapathi, Toolformer: Language models can teach themselves
Jian Zhu, Benjamin Lipkin, Muhtasham Oblokulov, to use tools. In Advances in Neural Information Pro-
Zhiruo Wang, Rudra Murthy V, Jason Stillerman, cessing Systems 36: Annual Conference on Neural
Siva Sankalp Patel, Dmitry Abulkhanov, Marco Information Processing Systems 2023, NeurIPS 2023,
Zocca, Manan Dey, Zhihan Zhang, Nour Moustafa- New Orleans, LA, USA, December 10 - 16, 2023.
Fahmy, Urvashi Bhattacharyya, Wenhao Yu, Swayam
Singh, Sasha Luccioni, Paulo Villegas, Maxim Ku- Uri Shaham, Maor Ivgi, Avia Efrat, Jonathan Berant,
nakov, Fedor Zhdanov, Manuel Romero, Tony Lee, and Omer Levy. 2023. Zeroscrolls: A zero-shot
Nadav Timor, Jennifer Ding, Claire Schlesinger, Hai- benchmark for long text understanding. In Find-
ley Schoelkopf, Jan Ebert, Tri Dao, Mayank Mishra, ings of the Association for Computational Linguis-
Alex Gu, Jennifer Robinson, Carolyn Jane Ander- tics: EMNLP 2023, Singapore, December 6-10, 2023,
son, Brendan Dolan-Gavitt, Danish Contractor, Siva pages 7977–7989. Association for Computational
Reddy, Daniel Fried, Dzmitry Bahdanau, Yacine Jer- Linguistics.
nite, Carlos Muñoz Ferrandis, Sean Hughes, Thomas
Wolf, Arjun Guha, Leandro von Werra, and Harm Noah Shinn, Federico Cassano, Ashwin Gopinath,
de Vries. 2023b. Starcoder: may the source be with Karthik Narasimhan, and Shunyu Yao. 2023. Re-
you! CoRR, abs/2305.06161. exion: language agents with verbal reinforcement
learning. In Advances in Neural Information Pro-
Yujia Li, David H. Choi, Junyoung Chung, Nate Kush- cessing Systems 36: Annual Conference on Neural
man, Julian Schrittwieser, Rémi Leblond, Tom Ec- Information Processing Systems 2023, NeurIPS 2023,
cles, James Keeling, Felix Gimeno, Agustin Dal New Orleans, LA, USA, December 10 - 16, 2023.
Hugo Touvron, Louis Martin, Kevin Stone, Peter Al- A Pseudocode
bert, Amjad Almahairi, Yasmine Babaei, Nikolay
Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti
Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton-
Ferrer, Moya Chen, Guillem Cucurull, David Esiobu,
Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller,
Cynthia Gao, Vedanuj Goswami, Naman Goyal, An-
thony Hartshorn, Saghar Hosseini, Rui Hou, Hakan
Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa,
Isabel Kloumann, Artem Korenev, Punit Singh Koura,
Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Di-
ana Liskovich, Yinghai Lu, Yuning Mao, Xavier Mar-
tinet, Todor Mihaylov, Pushkar Mishra, Igor Moly-
bog, Yixin Nie, Andrew Poulton, Jeremy Reizen-
stein, Rashi Rungta, Kalyan Saladi, Alan Schelten,
Ruan Silva, Eric Michael Smith, Ranjan Subrama-
nian, Xiaoqing Ellen Tan, Binh Tang, Ross Tay-
lor, Adina Williams, Jian Xiang Kuan, Puxin Xu,
Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan,
Melanie Kambadur, Sharan Narang, Aurélien Ro-
driguez, Robert Stojnic, Sergey Edunov, and Thomas
Scialom. 2023. Llama 2: Open foundation and ne-
tuned chat models. CoRR, abs/2307.09288.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le,
and Denny Zhou. 2022. Chain-of-thought prompting
elicits reasoning in large language models. In Ad-
vances in Neural Information Processing Systems 35:
Annual Conference on Neural Information Process-
ing Systems 2022, NeurIPS 2022, New Orleans, LA,
USA, November 28 - December 9, 2022.
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran,
Tom Grifths, Yuan Cao, and Karthik Narasimhan.
2023a. Tree of thoughts: Deliberate problem solving
with large language models. In Advances in Neural
Information Processing Systems 36: Annual Confer-
ence on Neural Information Processing Systems 2023,
NeurIPS 2023, New Orleans, LA, USA, December 10
- 16, 2023.
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak
Shafran, Karthik R. Narasimhan, and Yuan Cao.
2023b. React: Synergizing reasoning and acting
in language models. In The Eleventh International
Conference on Learning Representations, ICLR 2023,
Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
Kechi Zhang, Zhuo Li, Jia Li, Ge Li, and Zhi Jin. 2023.
Self-edit: Fault-aware code editor for code genera-
tion. In Proceedings of the 61st Annual Meeting of
the Association for Computational Linguistics (Vol-
ume 1: Long Papers), ACL 2023, Toronto, Canada,
July 9-14, 2023, pages 769–787. Association for
Computational Linguistics.
Lily Zhong, Zilong Wang, and Jingbo Shang. 2024.
LDB: A large language model debugger via ver-
ifying runtime execution step-by-step. CoRR,
abs/2402.16906.
Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman,
Haohan Wang, and Yu-Xiong Wang. 2023. Language
agent tree search unies reasoning acting and plan-
ning in language models. CoRR, abs/2310.04406.
Algorithm 1 Generate Code with Self-organized Agent Framework
Input: docstrings: Docstrings for the function, unit_tests: List of unit tests, max_depth: Maximum depth of the agent
hierarchy, max_iterations: Maximum number of code modication iterations
Output: The nal generated code
1:
2: Initialize the root Mother agent with docstrings and unit_tests
3:
4: function G ENERATE AGENT(agent, depth, subtask_docstrings, subtask_unit_tests)
5: if depth − 1 = max_depth then
6: next_agent ← new ChildAgent
7: else
8: next_agent ← new MotherAgent
9: end if
10: Assign subtask_docstrings and unit_tests to next_agent
11: G ENERATE(next_agent, depth + 1)
12: end function
13:
14: function G ENERATE(agent, depth)
15: if depth = 1 then ▷ Root Mother
16: skeleton ← Generate skeleton from agent.docstrings and agent.unitt ests
17: agent.code ← skeleton
18: for each subtask_docstrings, subtask_unit_tests in subtasks do
19: G ENERATE AGENT(agent, depth, subtask_docstrings, subtask_unit_tests)
20: end for
21: else if depth = max_depth then ▷ Child
22: Generate code for agent.subtask_docstrings and agent.subtask_unit_tests
23: agent.code ← generated code
24: else ▷ Mother
25: Generate code for agent.subtask_docstrings and agent.subtask_unit_tests
26: agent.code ← generated code
27: for each subtask_docstrings, subtask_unit_tests in subtasks do
28: G ENERATE AGENT(agent, depth, subtask_docstrings, subtask_unit_tests)
29: end for
30: end if
31: end function
32:
33: function M ODIFY(agent, test_result, upper_agent_observation)
34: Generate feedback for agent based on test_result and upper_agent_observation
35: Update agent’s code based on feedback
36: for each subagent in agent.subagents do
37: Evaluate subagent.code using subagent.unit_tests to get subagent_test_result
38: M ODIFY(subagent, subagent_test_result, feedback and code changes)
39: end for
40: end function
41:
42: Start code generation with G ENERATE(root_mother, 1)
43:
44: for each iteration in max_iterations do
45: Combine implementations from all agents to create f inal_implementation
46: Evaluate f inal_implementation using unit_tests to get test_result
47: Modify the code starting from root_mother with M ODIFY(root_mother, test_result, None)
48: end for
49:
50: return The nal implementation combined from all agents

You might also like