0% found this document useful (0 votes)
7 views

2402.02172v5

CodeAgent is a multi-agent Large Language Model system designed to automate the code review process by simulating collaborative interactions among various roles such as reviewers and coders. It addresses challenges like prompt drifting and enhances code review effectiveness through a supervisory agent called QA-Checker, which ensures alignment with the original review intent. Experimental evaluations demonstrate that CodeAgent outperforms existing models in detecting vulnerabilities, ensuring code style adherence, and suggesting code revisions.

Uploaded by

shareforfunudem
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

2402.02172v5

CodeAgent is a multi-agent Large Language Model system designed to automate the code review process by simulating collaborative interactions among various roles such as reviewers and coders. It addresses challenges like prompt drifting and enhances code review effectiveness through a supervisory agent called QA-Checker, which ensures alignment with the original review intent. Experimental evaluations demonstrate that CodeAgent outperforms existing models in detecting vulnerabilities, ensuring code style adherence, and suggesting code revisions.

Uploaded by

shareforfunudem
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

CodeAgent: Autonomous Communicative Agents for Code Review

Xunzhu Tang1 , Kisub Kim2 , Yewei Song1 , Cedric Lothritz3 , Bei Li4 , Saad Ezzini5 ,
Haoye Tian6,* , Jacques Klein1 , and Tegawendé F. Bissyandé1
1
University of Luxembourg
2
Singapore Management University
3
Luxembourg Institute of Science and Technology
4
Northeastern University
5
Lancaster University
6
The University of Melbourne
Abstract 2022; Thongtanunam et al., 2022; Staron et al.,
2020). In this respect, an effective approach
Code review, which aims at ensuring the over-
all quality and reliability of software, is a cor- should not only address how to review the sub-
mitted code for some specific needs (e.g., vulner-
arXiv:2402.02172v5 [cs.SE] 24 Sep 2024

nerstone of software development. Unfortu-


nately, while crucial, Code review is a labor- ability detection (Chakraborty et al., 2021; Yang
intensive process that the research community et al., 2024a)). Still, other non-negligible aspects
is looking to automate. Existing automated of code review should also be considered, like de-
methods rely on single input-output generative tecting issues in code formatting or inconsisten-
models and thus generally struggle to emulate
cies in code revision (Oliveira et al., 2023; Tian
the collaborative nature of code review. This
work introduces CodeAgent, a novel multi-
et al., 2022; Panthaplackel et al., 2021). However,
agent Large Language Model (LLM) system processing multiple sub-tasks requires interactions
for code review automation. CodeAgent in- among employees in different roles in a real code
corporates a supervisory agent, QA-Checker, review scenario, which makes it challenging to de-
to ensure that all the agents’ contributions ad- sign a model that performs code review automati-
dress the initial review question. We evaluated cally.
CodeAgent on critical code review tasks: (1) Agent-based systems are an emerging paradigm
detect inconsistencies between code changes
and a computational framework in which au-
and commit messages, (2) identify vulnerabil-
ity introductions, (3) validate code style ad- tonomous entities (aka agents) interact with each
herence, and (4) suggest code revision. The other (Li et al., 2023a; Qian et al., 2023; Hong
results demonstrate CodeAgent’s effective- et al., 2023) to perform a task. Agent-based ap-
ness, contributing to a new state-of-the-art in proaches have been proposed to address a spec-
code review automation. Our data and code trum of software engineering tasks (Qian et al.,
are publicly available (https://github. 2023; Zhang et al., 2024; Tang et al., 2023; Tian
com/Code4Agent/codeagent).
et al., 2023), moving beyond the conventional sin-
1 Introduction gle input-output paradigm due to their exceptional
ability to simulate and model complex interac-
Code review (Bacchelli and Bird, 2013; Bosu and tions and behaviors in dynamic environments (Xi
Carver, 2013; Davila and Nunes, 2021) imple- et al., 2023; Yang et al., 2024b; Wang et al., 2023).
ments a process wherein software maintainers ex- Recently, multi-agent systems have leveraged the
amine and assess code contributions to ensure strengths of diverse agents to simulate human-
quality and adherence to coding standards, and like decision-making processes (Du et al., 2023;
identify potential bugs or improvements. In recent Liang et al., 2023; Park et al., 2023), leading to
literature, various approaches (Tufano et al., 2021, enhanced performance across various tasks (Chen
2022) have been proposed to enhance the perfor- et al., 2023; Li et al., 2023b; Hong et al., 2023).
mance of code review automation. Unfortunately, This paradigm is well-suited to the challenge of
major approaches in the field ignore a fundamen- code review, where multiple reviewers, each with
tal aspect: the code review process is inherently diverse skills and roles, collaborate to achieve a
interactive and collaborative (Bacchelli and Bird, comprehensive review of the code..
2013). Instead, they primarily focus on rewriting This paper. Drawing from the success of agent-
and adapting the submitted code (Watson et al., based collaboration, we propose a multi-agent-
* Corresponding author. based framework CodeAgent to simulate the
dynamics of a collaborative team engaged in the to propose an autonomous agent-based sys-
code review process, incorporating diverse roles tem for practical code review in the field of
such as code change authors, reviewers, and deci- software maintenance.
sion makers. In particular, A key contribution of
CodeAgent is that we address the challenge of • We build a new dataset comprising 3 545
prompt drifting (Zheng et al., 2024; Yang et al., real-world code changes and commit mes-
2024c), a common issue in multi-agent systems sages. This dataset, which includes all rel-
and Chain-of-Thought (CoT) reasoning. This is- evant files and details in a self-contained for-
sue, characterized by conversations that stray from mat, is valuable for evaluating advanced code
the main topic, highlights the need for strategies review tasks such as vulnerability detection,
to maintain focus and coherence (Greyling, 2023; code style detection, and code revision sug-
Chae et al., 2023). This drift, often triggered by gestions.
the model-inspired tangents or the randomness of
• We demonstrate the effectiveness of the QA-
Large Language Models (LLMs), necessitates the
Checker. This agent monitors the conversa-
integration of a supervisory agent. We employ an
tion flow to ensure alignment with the orig-
agent named QA-Checker (for "Question-Answer
inal intent, effectively addressing the com-
Checker") that monitors the conversation flow, en-
mon prompt drifting issues in multi-agent
suring that questions and responses stay relevant
systems.
and aligned with the dialogue’s intended objec-
tive. Such an agent not only refines queries but Experimental evaluation highlights the perfor-
also realigns answers to match the original intent, mance of CodeAgent: In vulnerability detec-
employing a systematic approach grounded in a tion, CodeAgent outperforms GPT-4 and Code-
mathematical framework. BERT by 3 to 7 percentage points in terms of
To evaluate the performance of CodeAgent, the number of vulnerabilities detected. For for-
we first assess its effectiveness for typical review mat alignment, CodeAgent outperforms ReAct
objectives such as detecting vulnerabilities 4.1 and by approximately 14% in recall for inconsistency
validating the consistency and alignment of the detection. On the code revision task, CodeAgent
code format 4.2. We then compare CodeAgent surpasses the state of the art in software engineer-
with state-of-the-art generic and code-specific lan- ing literature, achieving an average performance
guage models like ChatGPT (OPENAI, 2022) improvement of about 30% in the Edit Progress
and CodeBERT (Feng et al., 2020). Finally, metric (Zhou et al., 2023).
we assess the performance of CodeAgent com-
pared to the state-of-the-art tools for code revi- 2 CodeAgent
sion suggestions (Tufano et al., 2021; Thongta-
nunam et al., 2022; Tufano et al., 2022). Since This section details the methodology behind our
each of these related works presents a specific CodeAgent framework. We discuss tasks and
dataset, we also employ them toward a fair com- definition in Sec 2.1, pipeline in Section 2.2, de-
parison. Additionally, we also collect pull re- fined role cards in Section 2.3, and the design of
quests from GitHub, featuring an extensive array the QA-Checker in Sec 2.4.
of commits, messages, and comments to evaluate
2.1 Tasks
advanced capabilities.The experimental results re-
veal that CodeAgent significantly outperforms We define CA, VA, FA, and CR in as following:
the state-of-the-art, achieving a 41% increase in CA (Zhang et al., 2022): Consistency analysis be-
hit rate for detecting vulnerabilities. CodeAgent tween code change and commit message; the task
also excels in consistency checking and format is to detect cases where the commit message ac-
alignment, outperforming the target models. Fi- curate describes (in natural language) the intent of
nally, CodeAgent showcases its robustness for code changes (in programming language).
code revision by presenting superior average edit VA (Braz et al., 2022): Vulnerability analysis; the
progress. task is to identify cases where the code change in-
We summarize our contributions as follows: troduces a vulnerability in the code.
FA (Han et al., 2020): Format consistency analysis
• To the best of our knowledge, we are the first between commit and original files; the task is to
CodeAgent
Team Roles

I prossess a piece of code that might contain some bugs. Could you assist in
inspecting it for any issues? If problems are found, I would appreciate the
provision of a corrected version. I am seeking an in-depth review of the
code, specially focusing on the following aspect:......
User CEO CPO CTO Reviewer Coder

Role Definition: Phases: Role Definition: Phases:


Your main responsibilities include being an Basic Info Sync; You are CTO of CodeAgent, you are fimiliar to virous Basic Info Sync
Document programming languages and good at overarching....
active decision-maker on code review.....
CEO ##Conversations ##Files CTO
N

##Conversations
## Output Basic Info Sync
N N N modality Language

Role Definition: Phases:


Your main responsibilities include being an Code Review;

Basic Info Sync Document active decision-maker on code review..... Code Alignment;
N ## Action Analysis Document
Check in loop Coder To address this potential bug, I recommend using the
## Output ##Conversations "Objects.equals" method instead of directly calling "equals" on the
"expected" object. This will ensure a null-safe and consistent
document Code modality Language comparison.
##Revised codes
Role Definition: N N N
You are a Code reviewer at CodeAgent collaborating to Phases:
ensure software quality by assessing code for defects, Code Review;
vulnerabilities, and consistency issues, fixing bugs, and Code Alignment
##Files
Reviewer suggesting improvements... Code Code
##Consistency Analysis Align Docu Code Log
Revie m ment
##Conversations w ent
... I found that there is a lack of semantic consistency between
them. The commit message does not accurately reflect the
changes mad in the code. This inconsistency
Role Definition: Phases:
##Security Analysis
N N You are a CPO woking in codeagent, you are responsible for Document
... I did not find nay modifications in the code that could introduce
security vulnerabilities, attacks, or bugs....However, it is always assisting CEO and coder to summary code review reports...
recommended to conduct a thorough security review of the entire N
codebase to ensure .... ##Conversations
CPO Document
## Format Analysis N
Code Code
The format of the code snippet does not align with the writing ##Files
Revie Align style and format of the original file. Inconsistent formatting can
w ment
negatively impact the readability and maintainability of the
project. It is important to maintain a consistent coding.... document
## Revision Suggestions
I recommend aligning the code snippet with the writing style. I suggest revising the code to fix the
Pull Request code Commit Message original file conversation
potential risk

Figure 1: A Schematic diagram of role data cards of simulated code review team and their conversations
within CodeAgent. We have six characters in CodeAgent across four phases, including “Basic Info Sync",
“Code Review", “Code Alignment", and “Document". Code review is a kind of collaboration work, where we
design conversations between every two roles for every step to complete the task.

validate that the code change formatting style is through code revision and suggestions to the au-
not aligned with the target code. thor; and 4) Document, finalizing by synthesiz-
CR (Zhou et al., 2023): Code revisions; this task ing the opinions of the CEO, CPO (Chief Prod-
attempts to automatically suggest rewrites of the uct Officer), Coder, and Reviewer to provide the
code change to address any issue discovered. final comments. In addition to six defined roles,
the proposed architecture of CodeAgent consists
2.2 Pipeline of phase-level and conversation-level components.
The waterfall model breaks the code review pro-
We defined six characters and four phases for the
cess at the phase level into four sequential phases.
framework. The roles of the characters are il-
At the conversation level, each phase is divided
lustrated in Figure 1. Each phase contains mul-
into atomic conversations. These atomic conver-
tiple conversations, and each conversation hap-
sations involve task-oriented role-playing between
pens between agents. The four phases consist
two agents, promoting collaborative communica-
of 1) Basic Info Sync, containing the roles of
tion. One agent works as an instructor and the
chief executive officer (CEO), chief technology
other as an assistant. Communication follows an
officer (CTO), and Coder to conduct modality and
instruction-following style, where agents interact
language analysis; 2) Code Review, asking the
to accomplish a specific subtask within each con-
Coder and Reviewer for actual code review (i.e.,
versation, and each conversation is supervised by
target sub-tasks); 3) Code Alignment, supporting
QA-Checker. QA-Checker is used to align the
the Coder and Reviewer to correct the commit
Roles

User CEO CPO CTO Reviewer Coder

Basic Info Sync Code Review Code Alignment Document


Instructor

Reviews
Modality Language Code/Doc Code/Doc Code/Doc
N N N N N N N N
Assistor

Pull Request code Commit Message original file N


looped conversations

Figure 2: CodeAgent’s pipeline/scenario of a full conversation during the code review process among different
roles. “Basic Info Sync” demonstrates the basic information confirmation by the CEO, CTO, and Coder; “Code
Review” shows the actual code review process; “Code Alignment” illustrates the potential code revision; and
“Document” represents the summarizing and writing conclusion for all the stakeholders. All the conversations
are being ensured by the Quality Assurance checker until they reach the maximum dialogue turns or meet all the
requirements.

consistency of questions and answers between the Sec 2.1) and provide a detailed description of ob-
instructor and the assistant in a conversation to servation. Reviewer’s code review activity is under
avoid digression. QA-Checker will be introduced the assistance with Coder as shown in Figure 2.
in Section 2.4. Meanwhile, with the Reviewer’s assistance, Coder
Figure 2 shows an illustrative example of the can process the code revision as shown in the ‘Re-
CodeAgent pipeline. CodeAgent receives the vised codes’ part in the Coder card in Figure 1.
request to do the code review with the submitted Apart from Reviewer, Coder also cooperates with
commit, commit message, and original files. In CTO and CEO in the simulated team.
the first phase, CEO, CTO, and Coder will co- Each role and conversation, input and output of
operate to recognize the modality of input (e.g., each conversation is designed in Figure 1. Further
document, code) and language (e.g., Python, Java information about role definition details is pro-
and Go). In the second phase, with the help of vided in our Appendix-Section C.1.
Coder, Reviewer will write an analysis report on
consistency analysis, vulnerability analysis, for- 2.4 Self-Improving CoT with QA Checker
mat analysis and suggestions for code revision. In
the third phase, based on analysis reports, Coder
will align or revise the code if any incorrect snip- CB(q0+ aai0) CB(q1+ aai1)
QA QA QA
pets are identified with assistance from Reviewer. checker checker checker
Coder cooperates with CPO and CEO to summa- instructor

rize the document and codes about the whole code q0 1 a0 q1 2 a1 q2


... N
an
review in the final phase.
assistor
2.3 Role Card Definition
influence q question (instruction) CB Combination
function
As shown in Figure 1, we define six characters shared a answer aai added adjusted
memory instruction
in our simulation system (CodeAgent), includ-
ing User, CEO, CPO, CTO, Reviewer, Coder, and
Figure 3: This diagram shows the architecture of our
they are defined for different specific tasks. designed Chain-of-Thought (CoT): Question-Answer
All tasks are processed by the collaborative Checker (QA-Checker).
work of two agents in their multi-round conversa-
tions. For example, as a role Reviewer, her respon- QA-Checker is an instruct-driven agent, de-
sibility is to do the code review for given codes signed to fine-tune the question inside a conver-
and files in three aspects (tasks CA, VA, and FA in sation to drive the generated answer related to
the question. As shown in Figure 3, the initial Table 1: Comparison of Positive and Negative Samples
question (task instruction) is represented as q0 , in CA and FA (CA and FA are defined in Section 2.1).
and the first answer of the conversation between
Samples CA FA
Reviewer and Coder is represented as a0 . If QA-
Merged Closed Merged Closed
Checker identifies that a0 is inappropriate for q0 ,
it generates additional instructions attached to the Positive (consistency) 2,089 820 2,238 861
original question (task instruction) and combines Negative (inconsistency) 501 135 352 94
them to ask agents to further generate a different
answer. The combination in Figure 3 is defined
as q1 = CB(q0 + aai0 ), where aai0 is the addi- 3.2 Metrics
tional instruction attached. The conversation be- • F1-Score and Recall. We utilized the F1-
tween two agents is held until the generated an- Score and recall to evaluate our method’s
swer is judged as appropriate by QA-Checker or it effectiveness on tasks CA and FA. The F1-
reaches the maximum number of dialogue turns. Score, a balance between precision and re-
call, is crucial for distinguishing between
Theoretical Analysis of QA-Checker in Di- false positives and negatives. Recall mea-
alogue Refinement The QA-Checker is an sures the proportion of actual positives
instruction-driven agent, crucial in refining ques- correctly identified (Hossin and Sulaiman,
tions and answers within a conversation to ensure 2015).
relevance and precision. Its operation can be un- • Edit Progress (EP). EP evaluates the im-
derstood through the following lemma and proof provement in code transitioning from erro-
in Appendix A. neous to correct by measuring the reduction
in edit distance between the original code
3 Experimental Setup and the prediction on task CR. A higher EP
indicates better efficiency in code genera-
We evaluate the performance of CodeAgent tion (Dibia et al., 2022; Elgohary et al., 2021;
through various qualitative and quantitative exper- Zhou et al., 2023).
iments across nine programming languages, us-
• Hit Rate (Rate) We also use hit rate to eval-
ing four distinct metrics. In this section, we will
uate the rate of confirmed vulnerability is-
discuss experimental settings, including datasets,
sues out of the found issues by approaches
metrics, and baselines. For more information,
on task VA.
please see Appendix C.

3.3 State-of-the-Art Tools and Models


3.1 Datasets
Our study evaluates various tools and models for
To conduct a fair and reliable comparison for the code revision and modeling. Trans-Review (Tu-
code revision task, we employ the same datasets fano et al., 2021) employs src2abs for code ab-
(i.e., Trans-Reviewdata , AutoTransformdata , and straction, effectively reducing vocabulary size.
T5-Reviewdata ) as the state-of-the-art study (Zhou AutoTransform (Thongtanunam et al., 2022)
et al., 2023). Furthermore, we collect and curate uses Byte-Pair Encoding for efficient vocabulary
an additional dataset targeting the advanced tasks. management in pre-review code revision. T5-
Table 1 shows our new dataset which includes over Review (Tufano et al., 2022) leverages the T5 ar-
3,545 commits and 2,933 pull requests from more chitecture, emphasizing improvement in code re-
than 180 projects, spanning nine programming view through pre-training on code and text data.
languages: Python, Java, Go, C++, JavaScript, C, In handling both natural and programming lan-
C#, PHP, and Ruby. It focuses on consistency and guages, CodeBERT (Feng et al., 2020) adopts a
format detection, featuring both positive and nega- bimodal approach, while GraphCodeBERT (Guo
tive samples segmented by the merged and closed et al., 2021) incorporates code structure into its
status of pull requests across various languages. modeling. CodeT5 (Wang et al., 2021), based
The detailed information about the dataset can be on the T5 framework, is optimized for identi-
seen in Appendix-Section F. fier type awareness, aiding in generation-based
Table 2: The number of vulnerabilities found by CodeAgent and other approaches. As described in Appendix-
Section F, we have 3,545 items to evaluate. Ratecr represents the confirmed number divided by the number of
findings while Rateca is the confirmed number divided by the total evaluated number. CodeAgent w/o indicates
the version without QA-Checker.

CodeBERT GPT-3.5 GPT-4.0 COT ReAct CodeAgent CodeAgent w/o


Find 1,063 864 671 752 693 483 564
Confirm 212 317 345 371 359 449 413
Ratecr 19.94% 36.69% 51.42% 49.34% 51.80% 92.96% 73.23%
Rateca 5.98% 8.94% 9.73% 10.46% 10.13% 12.67% 11.65%
The values in gray ( nn.nn ) denote the greatest values for the confirmed number of vulnerabilities and the
rates.

tasks. Additionally, we compare these tools with Comparison As shown in Table 2,


GPT (OPENAI, 2022) by OpenAI, notable for CodeAgent successfully identified 483 potential
its human-like text generation capabilities in nat- vulnerabilities within a data set of 3,545 samples,
ural language processing. Finally, we involve with an impressive 449 of these finally confirmed
COT (Wei et al., 2022) and ReAct (Yao et al., as high-risk vulnerabilities1 . CodeBERT, a key
2022), of which COT is a method where lan- pre-trained model for code-related tasks, with its
guage models are guided to solve complex prob- parameters frozen for this experiment, initially
lems by generating and following a series of in- identified 1,063 items as vulnerable, yet only 212
termediate reasoning steps and ReAct synergis- passed the stringent verification criteria. Similar
tically enhances language models by interleav- trends were observed with GPT-3.5 and GPT-4.0,
ing reasoning and action generation, improving which confirmed 317 and 345 vulnerabilities out
task performance and interpretability across var- of 864 and 671 identified items, respectively.
ious decision-making and language tasks. These outcomes are further quantified by the con-
firmation rates (Ratecr ) of 19.94% for CodeBERT,
4 Experimental Result Analysis 36.69% for GPT-3.5, and 51.42% for GPT-4.0,
This section discusses the performance of while CodeAgent demonstrated a remarkable
CodeAgent in the four tasks considered for our Ratecr of 92.96%. Additionally, the analysis of
experiments. In Appendix Section E, we provide confirmed vulnerabilities against all analyzed
further analyses: we discuss the difference in items (Rateca ) yielded 5.98%, 8.94%, 9.73%, and
the execution time of CodeAgent in different 12.67% for CodeBERT, GPT-3.5, GPT-4.0, and
languages and perform a capability analysis CodeAgent, respectively. Evidently, Table 2 not
between CodeAgent and recent approaches. only highlights CodeAgent’s high precision in
identifying vulnerable commits but also reveals
4.1 Vulnerability Analysis the progressive improvement from GPT-3.5 to
Compared to CA and FA, VA is a more complex GPT-4.0, likely due to the latter’s capacity to
code review subtask, covering more than 25 differ- handle longer input sequences, with token limits
ent aspects (please see the Appendix-Section G), of 4,096 and 32,768, respectively. The integra-
including buffer overflows, sensitive data expo- tion of sophisticated algorithms like CoT and
sure, configuration errors, data leakage, etc. Vul- QA-Checker in CodeAgent has significantly en-
nerability analysis being a costly, time-consuming, hanced its capabilities in vulnerability detection,
resource-intensive and sensitive activity, only a surpassing the individual input-output efficiencies
low proportion of commits are labeled. We there- of GPT and CodeBERT. Appendix-Sections D
fore propose a proactive method for data annotion: and M highlight further details regarding the
we execute CodeAgent on the 3,545 samples importance of the QA-checker. Moreover, more
(covering nine languages) and manual verify the experimental results in 9 languages are accessible
identified cases to build a ground truth. Then, we
1
applied CodeBERT (Feng et al., 2020) and GPT The verification process involved a rigorous manual ex-
amination, extending beyond 120 working hours. Each sam-
on the dataset with the task of vulnerability binary ple being validated by at least 2 people: a researcher and an
prediction. engineer
in Appendix-Section J. F1-Score and recall score of task CA and FA. For
In addition, the analysis of vulnerabilities iden- CA and FA, the dataset we have is shown in Ta-
tified by various models reveals interesting over- ble 1 and more detailed data information is shown
laps among the models. CodeBERT confirmed in Figure 7 in Appendix.
212 vulnerabilities, whereas GPT-3.5, GPT-4.0,
Code Change and Commit Message Consis-
and CodeAgent confirmed 317, 345, and 449
tency Detection. As illustrated in Table 3, we
vulnerabilities, respectively. Notably, the inter-
assess the efficacy of CodeAgent in detecting
section of vulnerabilities confirmed by CodeBERT
the consistency between code changes and commit
and GPT-3.5 is 169, indicating a substantial over-
messages, contrasting its performance with other
lap in their findings. Similarly, the intersec-
prevalent methods like CodeBERT, GPT-3.5, and
tion between CodeBERT and GPT-4.0 is 170,
GPT-4.0. This evaluation specifically focuses on
while a larger overlap of 212 vulnerabilities is ob-
merged and closed commits in nine languages. In
served between GPT-3.5 and GPT-4.0. The com-
particular, CodeAgent exhibits remarkable per-
bined intersection among CodeBERT, GPT-3.5,
formance, outperforming other methods in both
and GPT-4.0 is 137, underscoring the commonali-
merged and closed scenarios. In terms of Re-
ties in vulnerabilities detected across these mod-
call, CodeAgent achieved an impressive 90.11%
els. Further, the intersections of vulnerabilities
for merged commits and 87.15% for closed ones,
confirmed by CodeBERT, GPT-3.5, and GPT-4.0
marking a considerable average improvement of
with CodeAgent are 212, 317, and 334, respec-
5.62% over the other models. Similarly, the
tively, highlighting the comprehensive coverage
F1-Score of CodeAgent stands at 93.89% for
and detection capabilities of CodeAgent.
merged and 92.40% for closed commits, surpass-
ing its counterparts with an average improvement
32 CodeBERT
of 3.79%. More comparable details in different
10 languages are shown in Appendix-Section. K.
73
ChatGPT3.5
137
33 Table 3: Comparison of CodeAgent with other meth-
ChatGPT4.0
ods on merged and closed commits across 9 languages
75 on CA task. ‘Imp’ represents the improvement.
CodeAgent
89
11 Merged CodeBERT GPT-3.5 GPT-4.0 COT ReAct CodeAgent Imp (pp)
Recall 63.64 80.08 84.27 80.73 82.04 90.11 5.84
F1 75.00 87.20 90.12 87.62 88.93 93.89 3.77
Figure 4: Overlap of vulnerability detection by Code-
BERT, GPT-3.5, GPT-4.0, and CodeAgent. Closed CodeBERT GPT-3.5 GPT-4.0 COT ReAct CodeAgent Imp (pp)
Recall 64.80 79.05 81.75 81.77 83.42 87.15 5.21
F1 77.20 87.35 89.61 89.30 89.81 92.40 3.35

Ablation Study. As shown in Table 2, we con- Average CodeBERT GPT-3.5 GPT-4.0 COT ReAct CodeAgent Imp (pp)

ducted an ablation study to evaluate the effective- Recall


F1
64.22
76.01
79.57
87.28
83.01
89.61
81.25
88.46
82.73
89.37
88.63
93.16
5.62
3.79
ness of the QA-Checker in CodeAgent. Specif-
ically, we created a version of our tool without
the QA-Checker, referred to as CodeAgent w/o . Format Consistency Detection. In our detailed
We then compared this version to the full version evaluation of format consistency between commits
of CodeAgent that includes the QA-Checker. and original files, CodeAgent’s performance
The results demonstrate that CodeAgent w/o is was benchmarked against established models like
substantially less effective in identifying vulner- CodeBERT and GPT variants across nine different
able issues, yielding lower hit rates (Ratecr and languages. This comparative analysis, presented
Rateca ). This reduction in performance highlights in Table 4, was centered around pivotal met-
the critical role of the QA-Checker in enhancing rics such as Recall and F1-Score. CodeAgent
CodeAgent’s overall effectiveness. More de- demonstrated a significant edge over the state-of-
tailed information about the ablation study can be the-art, particularly in the merged category, with
found in Appendix-Section M. an impressive Recall of 89.34% and an F1-Score
of 94.01%. These figures represent an average im-
4.2 Consistency and Format Detection provement of 10.81% in Recall and 6.94% in F1-
In this section, we will discuss the performance Score over other models. In the closed category,
of CodeAgent and baselines on metrics like the CodeAgent continued to outperform, achieving
a Recall of 89.57% and an F1-Score of 94.13%, Table 5: Experimental Results for the Code Revi-
surpassing its counterparts with an improvement sion (CR task) of CodeAgent and the state-of-the-art
works. Bold indicates the best performers.
of 15.56% in Recall and 9.94% in F1-Score.
The overall average performance of CodeAgent Approach
Trans-Reviewdata AutoTransformdata T5-Reviewdata Average

further accentuates its superiority, with a Recall EP EP EP EP


Trans-Review -1.1% -16.6% -151.2% -56.3%
of 89.46% and an F1-Score of 94.07%, mark- AutoTransform 49.7% 29.9% 9.7% 29.8%
T5-Review -14.9% -71.5% 13.8% -24.2%
ing an average improvement of 13.39% in Re- CodeBERT 49.8% -75.3% 22.3% -1.1%
GraphCodeBERT 50.6% -80.9% 22.6% -2.6%
call and 10.45% in F1-Score. These results un- CodeT5 41.8% -67.8% 25.6% -0.1%
derscore CodeAgent’s exceptional capability in CodeAgent 42.7% 14.4% 37.6% 31.6%

accurately detecting format consistency between


commits and their original files. and maintaining code consistency. Related stud-
ies include Hellendoorn et al. (Hellendoorn et al.,
Table 4: Comparison of CodeAgent with other meth-
ods on merged and closed commits across the 9 lan- 2021), who addressed code change anticipation,
guages on FA task. ‘Imp’ represents the improvement. and Siow et al. (Siow et al., 2020), who intro-
duced CORE for code modification semantics.
Merged CodeBERT GPT-3.5 GPT-4.0 COT ReAct Imp (pp)
CodeAgent
Hong et al. (Hong et al., 2022) proposed COM-
Recall 60.59 60.72 78.53 70.39 71.21 89.34 10.81
F1 74.14 74.88 87.07 80.69 82.18 94.01 6.94 MENTFINDER for comment suggestions, while
Closed CodeBERT GPT-3.5 GPT-4.0 COT ReAct CodeAgent Imp (pp) Tufano et al. (Tufano et al., 2021) and Li et al. (Li
Recall 69.95 73.61 68.46 73.39 74.01 89.57 15.56
F1 80.49 84.19 80.16 83.65 83.90 94.13 9.94
et al., 2022) developed tools for code review au-
Average CodeBERT GPT-3.5 GPT-4.0 COT ReAct CodeAgent Imp (pp) tomation using models like T5CR and CodeRe-
Recall 65.27 67.17 73.50 71.89 72.61 89.46 15.96 viewer, respectively. Recently, Lu et al. (Lu
F1 77.32 79.54 83.62 82.17 83.04 94.07 10.45
et al., 2023) incorporated large language models
for code review, enhancing fine-tuning techniques.
4.3 Code Revision
Collaborative AI. Collaborative AI, involving
We evaluate the effectiveness of CodeAgent in AI systems working towards shared goals, has
revision suggestion (i.e., bug fixing) based on seen advancements in multi-agent LLMs (Talebi-
Edit Progress (EP) metric. We consider Trans- rad and Nadiri, 2023; Qian et al., 2023), focusing
Review, AutoTransform, T5-Review, CodeBERT, on collective thinking, conversation dataset cura-
GraphCodeBERT, CodeT5 as comparable state tion (Wei et al., 2023; Li et al., 2023a), and so-
of the art. As detailed in Table 5, these ap- ciological phenomenon exploration (Park et al.,
proaches exhibit a varied performance across dif- 2023). Research by Akata et al. (Akata et al.,
ferent datasets. In particular, CodeAgent shows 2023) and Cai et al. (Cai et al., 2023) further ex-
remarkable performance in the T5-Review dataset, plores LLM cooperation and efficiency. However,
achieving the highest EP of 37.6%. This is a sig- there remains a gap in integrating these advance-
nificant improvement over other methods, which ments with structured software engineering prac-
underlines the effectiveness of CodeAgent in tices (Li et al., 2023a; Qian et al., 2023), a chal-
handling complex code revision tasks. Further- lenge our approach addresses by incorporating ad-
more, with an average EP of 31.6%, CodeAgent vanced human processes in multi-agent systems.
consistently outperforms its counterparts, posi- For a complete overview of related work, please
tioning itself as a leading solution in automated refer to our Appendix-Section B.
code revision. Its ability to excel in the T5-
Review, a challenging benchmark data, indicates a 6 Conclusion
strong capability to address complex bugs. In ad-
dition, its overall average performance surpasses In this paper, we introduced CodeAgent, a novel
other state-of-the-art models, highlighting its ro- multi-agent framework that automates code re-
bustness and reliability. views. CodeAgent leverages its novel QA-
Checker system to maintain focus on the re-
5 Related Work view’s objectives and ensure alignment. Our ex-
periments demonstrate CodeAgent’s effective-
Automating Code Review Activities. Our work ness in detecting vulnerabilities, enforcing code-
contributes to automating code review activities, message consistency, and promoting uniform code
focusing on detecting source code vulnerabilities style. Furthermore, CodeAgent outperforms ex-
isting state-of-the-art solutions in code revision Limitations
suggestions. By incorporating human-like con-
versational elements and considering the specific Firstly, the generalizability of the system across
characteristics of code review, CodeAgent sig- different software development environments or
nificantly improves both efficiency and accuracy. industries may require further validation and test-
We believe this work opens exciting new avenues ing. While the system has shown promising results
for research and collaboration practices in soft- in the provided datasets, its applicability to other
ware development. contexts remains uncertain without additional em-
pirical evidence. This limitation suggests that the
findings may not be fully transferable to all set-
tings within the software development domain.
Secondly, the baseline test used in the study might
be insufficient. The current testing approach may
not fully capture the system’s performance, par-
ticularly in edge cases or more complex scenar-
ios. This could result in an overestimation of the
system’s capabilities and an underestimation of its
limitations. Further, more comprehensive testing
is needed to establish a more robust baseline and
to ensure that the system performs reliably across
a wider range of conditions.

Ethics Statements
This study was conducted in compliance with eth-
ical guidelines and standards for research. The
research did not involve human participants, and
therefore, did not require informed consent or eth-
ical review from an institutional review board. All
data used in this study were publicly available, and
no personal or sensitive information was accessed
or processed. The development and evaluation of
the CodeAgent system were performed with a
focus on transparency, reproducibility, and the po-
tential positive impact on the software develop-
ment community.

References
Elif Akata, Lion Schulz, Julian Coda-Forno,
Seong Joon Oh, Matthias Bethge, and Eric
Schulz. 2023. Playing repeated games with large
language models. arXiv preprint.

Alberto Bacchelli and Christian Bird. 2013. Expec-


tations, outcomes, and challenges of modern code
review. In 2013 35th International Conference
on Software Engineering (ICSE), pages 712–721.
IEEE.

Amiangshu Bosu and Jeffrey C Carver. 2013. Impact


of peer code review on peer impression formation:
A survey. In 2013 ACM/IEEE International Sympo-
sium on Empirical Software Engineering and Mea-
surement, pages 133–142. IEEE.
Larissa Braz, Christian Aeberhard, Gül Çalikli, and Al- Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng,
berto Bacchelli. 2022. Less is more: supporting de- Duyu Tang, Shujie Liu, Long Zhou, Nan Duan,
velopers in vulnerability detection during code re- Alexey Svyatkovskiy, Shengyu Fu, Michele Tu-
view. In Proceedings of the 44th International Con- fano, Shao Kun Deng, Colin B. Clement, Dawn
ference on Software Engineering, pages 1317–1329. Drain, Neel Sundaresan, Jian Yin, Daxin Jiang, and
Ming Zhou. 2021. Graphcodebert: Pre-training
Tianle Cai, Xuezhi Wang, Tengyu Ma, Xinyun Chen, code representations with data flow. In 9th Inter-
and Denny Zhou. 2023. Large language models as national Conference on Learning Representations,
tool makers. arXiv preprint. ICLR 2021, Virtual Event, Austria, May 3-7, 2021.
OpenReview.net.
Hyungjoo Chae, Yongho Song, Kai Tzu-iunn Ong,
Taeyoon Kwon, Minjin Kim, Youngjae Yu, DongGyun Han, Chaiyong Ragkhitwetsagul, Jens
Dongha Lee, Dongyeop Kang, and Jinyoung Yeo. Krinke, Matheus Paixao, and Giovanni Rosa. 2020.
2023. Dialogue chain-of-thought distillation for Does code review really remove coding convention
commonsense-aware conversational agents. arXiv violations? In 2020 IEEE 20th International Work-
preprint arXiv:2310.09343. ing Conference on Source Code Analysis and Ma-
nipulation (SCAM), pages 43–53. IEEE.
Saikat Chakraborty, Rahul Krishna, Yangruibo Ding,
and Baishakhi Ray. 2021. Deep learning based vul- Vincent J Hellendoorn, Jason Tsay, Manisha Mukher-
nerability detection: Are we there yet? IEEE Trans- jee, and Martin Hirzel. 2021. Towards automating
actions on Software Engineering, 48(9):3280–3296. code review at scale. In Proceedings of the 29th
ACM Joint Meeting on European Software Engi-
Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, neering Conference and Symposium on the Founda-
Chenfei Yuan, Chen Qian, Chi-Min Chan, Yujia Qin, tions of Software Engineering, pages 1479–1482.
Yaxi Lu, Ruobing Xie, et al. 2023. Agentverse:
Facilitating multi-agent collaboration and explor- Sirui Hong, Xiawu Zheng, Jonathan Chen, Yuheng
ing emergent behaviors in agents. arXiv preprint Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang,
arXiv:2308.10848. Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou,
et al. 2023. Metagpt: Meta programming for
Nicole Davila and Ingrid Nunes. 2021. A systematic multi-agent collaborative framework. arXiv preprint
literature review and taxonomy of modern code re- arXiv:2308.00352.
view. Journal of Systems and Software, 177:110951.
Yang Hong, Chakkrit Tantithamthavorn, Patanamon
Victor Dibia, Adam Fourney, Gagan Bansal, Forough Thongtanunam, and Aldeida Aleti. 2022. Com-
Poursabzi-Sangdeh, Han Liu, and Saleema Amer- mentfinder: a simpler, faster, more accurate code re-
shi. 2022. Aligning offline metrics and human view comments recommendation. In Proceedings of
judgments of value of ai-pair programmers. arXiv the 30th ACM Joint European Software Engineering
preprint arXiv:2210.16494. Conference and Symposium on the Foundations of
Software Engineering, pages 507–519.
Yilun Du, Shuang Li, Antonio Torralba, Joshua B
Tenenbaum, and Igor Mordatch. 2023. Improv- Mohammad Hossin and Md Nasir Sulaiman. 2015. A
ing factuality and reasoning in language mod- review on evaluation metrics for data classification
els through multiagent debate. arXiv preprint evaluations. International journal of data mining &
arXiv:2305.14325. knowledge management process, 5(2):1.
Ahmed Elgohary, Christopher Meek, Matthew Guohao Li, Hasan Abed Al Kader Hammoud, Hani
Richardson, Adam Fourney, Gonzalo Ramos, Itani, Dmitrii Khizbullin, and Bernard Ghanem.
and Ahmed Hassan Awadallah. 2021. NL-EDIT: 2023a. Camel: Communicative agents for" mind"
Correcting semantic parse errors through natural exploration of large scale language model society.
language interaction. In Proceedings of the 2021 arXiv preprint arXiv:2303.17760.
Conference of the North American Chapter of the
Association for Computational Linguistics: Human Yuan Li, Yixuan Zhang, and Lichao Sun. 2023b.
Language Technologies, pages 5599–5610, Online. Metaagents: Simulating interactions of human be-
Association for Computational Linguistics. haviors for llm-based task-oriented coordination via
collaborative generative agents. arXiv preprint
Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xi- arXiv:2310.06500.
aocheng Feng, Ming Gong, Linjun Shou, Bing Qin,
Ting Liu, Daxin Jiang, and Ming Zhou. 2020. Code- Zhiyu Li, Shuai Lu, Daya Guo, Nan Duan, Shailesh
bert: A pre-trained model for programming and nat- Jannu, Grant Jenks, Deep Majumder, Jared Green,
ural languages. In Findings of the Association for Alexey Svyatkovskiy, Shengyu Fu, et al. 2022.
Computational Linguistics: EMNLP 2020, Online Codereviewer: Pre-training for automating code re-
Event, 16-20 November 2020, volume EMNLP 2020 view activities. arXiv e-prints, pages arXiv–2203.
of Findings of ACL, pages 1536–1547. Association
for Computational Linguistics. Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang,
Yan Wang, Rui Wang, Yujiu Yang, Zhaopeng Tu,
Cobus Greyling. 2023. Prompt drift and chaining. and Shuming Shi. 2023. Encouraging divergent
thinking in large language models through multi- Haoye Tian, Weiqi Lu, Tsz On Li, Xunzhu Tang,
agent debate. arXiv preprint arXiv:2305.19118. Shing-Chi Cheung, Jacques Klein, and Tegawendé F
Bissyandé. 2023. Is chatgpt the ultimate program-
Junyi Lu, Lei Yu, Xiaojia Li, Li Yang, and Chun Zuo. ming assistant–how far is it? arXiv preprint
2023. Llama-reviewer: Advancing code review arXiv:2304.11938.
automation with large language models through
parameter-efficient fine-tuning. In 2023 IEEE 34th Haoye Tian, Xunzhu Tang, Andrew Habib, Shang-
International Symposium on Software Reliability wen Wang, Kui Liu, Xin Xia, Jacques Klein, and
Engineering (ISSRE), pages 647–658. IEEE. Tegawendé F Bissyandé. 2022. Is this change the
answer to that problem? correlating descriptions of
Delano Oliveira, Reydne Santos, Fernanda Madeiral, bug and code changes for evaluating patch correct-
Hidehiko Masuhara, and Fernando Castor. 2023. A ness. arXiv preprint arXiv:2208.04125.
systematic literature review on the impact of format-
ting elements on code legibility. Journal of Systems Rosalia Tufano, Simone Masiero, Antonio Mas-
and Software, 203:111728. tropaolo, Luca Pascarella, Denys Poshyvanyk, and
Gabriele Bavota. 2022. Using pre-trained models to
OPENAI. 2022. Chatgpt. boost code review automation. In Proceedings of
Sheena Panthaplackel, Junyi Jessy Li, Milos Glig- the 44th International Conference on Software En-
oric, and Raymond J Mooney. 2021. Deep just- gineering, pages 2291–2302.
in-time inconsistency detection between comments Rosalia Tufano, Luca Pascarella, Michele Tufano,
and source code. In Proceedings of the AAAI Con- Denys Poshyvanyk, and Gabriele Bavota. 2021. To-
ference on Artificial Intelligence, volume 35, pages wards automating code review activities. In 2021
427–435. IEEE/ACM 43rd International Conference on Soft-
Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, ware Engineering (ICSE), pages 163–174. IEEE.
Meredith Ringel Morris, Percy Liang, and Michael S
Yue Wang, Weishi Wang, Shafiq R. Joty, and Steven
Bernstein. 2023. Generative agents: Interactive sim-
C. H. Hoi. 2021. Codet5: Identifier-aware unified
ulacra of human behavior. In Proceedings of the
pre-trained encoder-decoder models for code under-
36th Annual ACM Symposium on User Interface
standing and generation. In Proceedings of the 2021
Software and Technology, pages 1–22.
Conference on Empirical Methods in Natural Lan-
Chen Qian, Xin Cong, Cheng Yang, Weize Chen, guage Processing, EMNLP 2021, Virtual Event /
Yusheng Su, Juyuan Xu, Zhiyuan Liu, and Maosong Punta Cana, Dominican Republic, 7-11 November,
Sun. 2023. Communicative agents for software de- 2021, pages 8696–8708. Association for Computa-
velopment. arXiv preprint arXiv:2307.07924. tional Linguistics.

Jing Kai Siow, Cuiyun Gao, Lingling Fan, Sen Chen, Zhenhailong Wang, Shaoguang Mao, Wenshan Wu,
and Yang Liu. 2020. Core: Automating review Tao Ge, Furu Wei, and Heng Ji. 2023. Unleash-
recommendation for code changes. In 2020 IEEE ing cognitive synergy in large language models:
27th International Conference on Software Analysis, A task-solving agent through multi-persona self-
Evolution and Reengineering (SANER), pages 284– collaboration. arXiv preprint arXiv:2307.05300.
295. IEEE.
Cody Watson, Nathan Cooper, David Nader Palacio,
Miroslaw Staron, Mirosław Ochodek, Wilhelm Med- Kevin Moran, and Denys Poshyvanyk. 2022. A sys-
ing, and Ola Söder. 2020. Using machine learning tematic literature review on the use of deep learn-
to identify code fragments for manual review. In ing in software engineering research. ACM Trans-
2020 46th Euromicro Conference on Software Engi- actions on Software Engineering and Methodology
neering and Advanced Applications (SEAA), pages (TOSEM), 31(2):1–58.
513–516. IEEE.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
Yashar Talebirad and Amirhossein Nadiri. 2023. Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou,
Multi-agent collaboration: Harnessing the power of et al. 2022. Chain-of-thought prompting elicits
intelligent llm agents. reasoning in large language models. Advances in
Neural Information Processing Systems, 35:24824–
Xunzhu Tang, Zhenghan Chen, Kisub Kim, Haoye 24837.
Tian, Saad Ezzini, and Jacques Klein. 2023.
Just-in-time security patch detection–llm at the Jimmy Wei, Kurt Shuster, Arthur Szlam, Jason Weston,
rescue for data augmentation. arXiv preprint Jack Urbanek, and Mojtaba Komeili. 2023. Multi-
arXiv:2312.01241. party chat: Conversational agents in group settings
with humans and models. arXiv preprint.
Patanamon Thongtanunam, Chanathip Pornprasit, and
Chakkrit Tantithamthavorn. 2022. Autotransform: Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen
Automated code transformation to support modern Ding, Boyang Hong, Ming Zhang, Junzhe Wang,
code review process. In Proceedings of the 44th Senjie Jin, Enyu Zhou, et al. 2023. The rise and
international conference on software engineering, potential of large language model based agents: A
pages 237–248. survey. arXiv preprint arXiv:2309.07864.
Aidan ZH Yang, Haoye Tian, He Ye, Ruben Mar-
tins, and Claire Le Goues. 2024a. Security vulnera-
bility detection with multitask self-instructed fine-
tuning of large language models. arXiv preprint
arXiv:2406.05892.
Boyang Yang, Haoye Tian, Weiguo Pian, Haoran Yu,
Haitao Wang, Jacques Klein, Tegawendé F Bis-
syandé, and Shunfu Jin. 2024b. Cref: an llm-
based conversational software repair framework for
programming tutors. In Proceedings of the 33rd
ACM SIGSOFT International Symposium on Soft-
ware Testing and Analysis, pages 882–894.
Xiaoyu Yang, Jie Lu, and En Yu. 2024c. Adapt-
ing multi-modal large language model to concept
drift in the long-tailed open world. arXiv preprint
arXiv:2405.13459.
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak
Shafran, Karthik Narasimhan, and Yuan Cao. 2022.
React: Synergizing reasoning and acting in language
models. arXiv preprint.
Hongxin Zhang, Weihua Du, Jiaming Shan, Qinhong
Zhou, Yilun Du, Joshua B Tenenbaum, Tianmin
Shu, and Chuang Gan. 2023. Building coopera-
tive embodied agents modularly with large language
models. arXiv preprint.
Mengxi Zhang, Huaxiao Liu, Chunyang Chen, Yuzhou
Liu, and Shuotong Bai. 2022. Consistent or not?
an investigation of using pull request template in
github. Information and Software Technology,
144:106797.

Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Ab-


hik Roychoudhury. 2024. Autocoderover: Au-
tonomous program improvement. arXiv preprint
arXiv:2404.05427.

Jonathan Zheng, Alan Ritter, and Wei Xu. 2024.


Neo-bench: Evaluating robustness of large lan-
guage models with neologisms. arXiv preprint
arXiv:2402.12261.

Xin Zhou, Kisub Kim, Bowen Xu, DongGyun Han,


Junda He, and David Lo. 2023. Generation-based
code review automation: How far are we? arXiv
preprint arXiv:2303.07221.
Contents (Appendix) A Details of QA-Checker Algorithm
Lemma A.1. Let Q(Qi , Ai ) denote the quality
A Details of QA-Checker Algorithm 13 assessment function of the QA-Checker for the
question-answer pair (Qi , Ai ) in a conversation
B Complete Related Work 15 at the i-th iteration. Assume Q is twice differ-
entiable and its Hessian matrix H(Q) is positive
C Experimental Details 15
definite. If the QA-Checker modifies the question
C.1 Role Definition . . . . . . . . . . 15
Qi to Qi+1 by attaching an additional instruc-
C.2 Execute Time Across Languages . 16
tion aaii , and this leads to a refined answer Ai+1 ,
D Comparative Analysis of QA-Checker then the sequence {(Qi , Ai )} converges to an opti-
AI System and Recursive Self- mal question-answer pair (Q∗ , A∗ ), under specific
Improvement Systems 17 regularity conditions.
D.1 Comparison Table . . . . . . . . . 17 Proof. The QA-Checker refines the question and
D.2 Differences and Implications . . . 17 answers using the rule:
D.3 Importance of QA-Checker in
Qi+1 = Qi + aaii ,
Role Conversations . . . . . . . . 17
D.4 Conclusion . . . . . . . . . . . . 17 Ai+1 = Ai − αH(Q(Qi , Ai ))−1 ∇Q(Qi , Ai ),
where α is the learning rate. To analyze con-
E Capabilities Analysis between vergence, we consider the Taylor expansion of Q
CodeAgent and Other Methods 17 around (Qi , Ai ):
Q(Qi+1 , Ai+1 ) ≈ Q(Qi , Ai ) + ∇Q(Qi , Ai )
F Dataset 17 · (Qi+1 − Qi , Ai+1 − Ai )
1
G Key Factors Leading to Vulnerabilities 19 + (Qi+1 − Qi , Ai+1 − Ai )T
2
H(Q(Qi , Ai ))(Qi+1 − Qi , Ai+1 − Ai ).
H Data Leakage Statement 19
Substituting the update rule and rearranging, we
get:
I Algorithmic Description of CodeAgent
Q(Qi+1 , Ai+1 ) ≈ Q(Qi , Ai )
Pipeline with QA-Checker 19
− α∇Q(Qi , Ai )T H(Q(Qi , Ai ))−1
J Detailed Performance of CodeAgent in ∇Q(Qi , Ai )
Various Languages on VA task 21 α2
+ ∇Q(Qi , Ai )T H(Q(Qi , Ai ))−1
2
K More detailed experimental results on ∇Q(Qi , Ai ).
CA and FA tasks 21 For sufficiently small α, this model suggests an
increase in Q, implying convergence to an optimal
L Case Study 21 question-answer pair (Q∗ , A∗ ) as i → ∞. The
L.1 Performance on 9 languages . . . 21 convergence relies on the positive definiteness of
L.2 Difference of CodeAgent-3.5 H(Q) and the appropriate choice of α, ensuring
and CodeAgent-4.0 . . . . . . . 21 each iteration moves towards an improved quality
of the question-answer pair.
M Ablation study 21
In practical terms, this lemma and its proof un-
N Cost statement 24 derpin the QA-Checker’s ability to refine answers
iteratively. The QA-Checker assesses the qual-
O Tool 35 ity of each answer concerning the posed question,
employing advanced optimization techniques that
are modeled by the modified Newton-Raphson
method to enhance answer quality. This frame-
work ensures that, with each iteration, the system
moves closer to the optimal answer, leveraging
both first and second-order derivatives for efficient
and effective learning.
Further Discussion The QA-Checker computes
Q(Qi , Ai ) at each iteration i and compares it to a
predefined quality threshold τ . If Q(Qi , Ai ) < Q(Qi , Ai ) = α · Relevance(Qi , Ai )
τ , the QA-Checker generates an additional in- + β · Specificity(Ai )
struction aaii to refine the question to Qi+1 =
+ γ · Coherence(Ai )
Qi + aaii , prompting the agents to generate an im-
proved answer Ai+1 .
where:
First, we assume that the quality assessment
function Q(Qi , Ai ) is twice differentiable with re-
spect to the question Qi . This assumption is rea- • Qi and Ai represent the question and answer
sonable given the smooth nature of the component at the i-th iteration of the conversation.
functions (relevance, specificity, and coherence)
and the use of continuous word embeddings. Next,
we apply the second-order Taylor approximation • Relevance(Qi , Ai ) measures how well the
to Q(Qi+1 , Ai+1 ) around the point (Qi , Ai ): answer Ai addresses the key points and intent
of the question Qi , computed as:
Q(Qi+1 , Ai+1 ) ≈ Q(Qi , Ai ) + ∇Q(Qi , Ai )T ∆Qi
1 ⃗i · A
Q ⃗i
+ ∆QTi H(Q(Qi , Ai ))∆Qi + R2 (∆Qi ) Relevance(Qi , Ai ) =
2 ⃗ i ||A
⃗i|
|Q
where ∆Qi = Qi+1 − Qi , H(Q(Qi , Ai )) is
where Q⃗ i and A
⃗ i are vector representations
the Hessian matrix of Q evaluated at (Qi , Ai ), and
R2 (∆Qi ) is the remainder term. of Qi and Ai .
Assuming that the remainder term R2 (∆Qi ) • Specificity(Ai ) assesses how specific and de-
is negligible and that the Hessian matrix is posi- tailed the answer Ai is, calculated as:
tive definite, we can approximate the optimal step
∆Q∗i as:
P
t∈ContentWords(Ai ) TechnicalityScore(t)
Ai =
Length(Ai )

∆Q∗i ≈ −H(Q(Qi , Ai ))−1 ∇Q(Qi , Ai ). where ContentWords(Ai ) is the set


of substantive content words in Ai ,
Substituting this approximation into the Taylor TechnicalityScore(t) is a measure of
expansion and using the fact that Qi+1 = Qi +
α∆Q∗i (where α is the learning rate), we obtain: how technical or domain-specific the term
t is, and Length(Ai ) is the total number of
Q(Qi+1 , Ai+1 ) ≈Q(Qi , Ai ) − α∇Q(Qi , Ai )T
words in Ai .
· H(Q(Qi , Ai ))−1 ∇Q(Qi , Ai ) • Coherence(Ai ) evaluates the logical flow and
α2 structural coherence of the answer Ai , com-
+ ∇Q(Qi , Ai )T H(Q(Qi , Ai ))−1 puted as:
2
· ∇Q(Qi , Ai ).
Coherence(Ai ) =α · DiscourseConnectives(Ai )
+ β · CoreferenceConsistency(Ai )
The assumptions of twice differentiability, neg-
+ γ · AnswerPatternAdherence(Ai )
ligible remainder term, and positive definite Hes-
sian matrix provide a more solid foundation for
the approximation in Lemma 3.1. For suffi- where DiscourseConnectives(Ai ) is the
ciently small α, this approximation suggests an in- density of discourse connectives in Ai ,
crease in Q, implying convergence to an optimal CoreferenceConsistency(Ai ) measures the
question-answer pair (Q∗ , A∗ ) as i → ∞. The consistency of coreference chains in Ai ,
convergence relies on the positive definiteness of and AnswerPatternAdherence(Ai ) assesses
H(Q) and the appropriate choice of α, ensuring how well Ai follows the expected structural
each iteration moves towards an improved quality patterns for the given question type.
of the question-answer pair.
The quality assessment function Q used by the α, β, and γ are non-negative weights that sum
QA-Checker is defined as: to 1, with α = β = γ.
B Complete Related Work peated games. Collaboration for efficiency, pro-
posed by Cai et al. (Cai et al., 2023), introduces
Automating Code Review Activities Our focus a model for cost reduction through large mod-
included detecting source code vulnerabilities, en- els as tool-makers and small models as tool-users.
suring style alignment, and maintaining commit Zhang et al. (Zhang et al., 2023) established a
message and code consistency. Other studies ex- framework for verbal communication and collab-
plore various aspects of code review. Hellen- oration, enhancing overall efficiency. However, Li
doorn et al. (Hellendoorn et al., 2021) addressed et al. (Li et al., 2023a) and Qian et al. (Qian et al.,
the challenge of anticipating code change posi- 2023), presenting a multi-agent framework for
tions. Siow et al. (Siow et al., 2020) introduced software development, primarily relied on natural
CORE, employing multi-level embeddings for language conversations, not standardized software
code modification semantics and retrieval-based engineering documentation, and lacked advanced
review suggestions. Hong et al. (Hong et al., 2022) human process management expertise. Challenges
proposed COMMENTFINDER, a retrieval-based in multi-agent cooperation include maintaining
method for suggesting comments during code re- coherence, avoiding unproductive loops, and fos-
views. Tufano et al. (Tufano et al., 2021) de- tering beneficial interactions. Our approach em-
signed T5CR with SentencePiece, enabling work phasizes integrating advanced human processes,
with raw source code without abstraction. Li et like code review in software maintenance, within
al. (Li et al., 2022) developed CodeReviewer, fo- multi-agent systems.
cusing on code diff quality, review comment gen-
eration, and code refinement using the T5 model.
Recently, large language models have been in-
corporated; Lu et al. (Lu et al., 2023) fine-tuned
LLama with prefix tuning for LLaMA-Reviewer,
using parameter-efficient fine-tuning and instruc-
tion tuning in a code-centric domain.
Collaborative AI Collaborative AI refers to artifi- C Experimental Details
cial intelligent systems designed to achieve shared
goals with humans or other AI systems. Previ-
ous research extensively explores the use of mul-
tiple LLMs in collaborative settings, as demon-
strated by Talebirad et al. (Talebirad and Nadiri,
2023) and Qian et al. (Qian et al., 2023). These In our work, the maximum number of conversa-
approaches rely on the idea that inter-agent in- tion rounds is set as 10.
teractions enable LLMs to collectively enhance
their capabilities, leading to improved overall
performance. The research covers various as-
pects of multi-agent scenarios, including collec-
tive thinking, conversation dataset curation, soci-
ological phenomenon exploration, and collabora-
tion for efficiency. Collective thinking aims to
boost problem-solving abilities by orchestrating C.1 Role Definition
discussions among multiple agents. Researchers
like Wei et al. (Wei et al., 2023) and Li et al. (Li
et al., 2023a) have created conversational datasets
through role-playing methodologies. Sociologi-
cal phenomenon investigations, such as Park et Six roles are defined as shown in Figure 5.
al. (Park et al., 2023)’s work, involve creating vir-
tual communities with rudimentary language in-
teractions and limited cooperative endeavors. In Apart from that, for the QA-checker in
contrast, Akata et al. (Akata et al., 2023) scruti- CodeAgent, we define an initial prompt for it,
nized LLM cooperation through orchestrated re- which is shown as follows:
Role Specialization

My primary responsibilities involve the integration of commit content, crafting commit messages, managing original
files, and supplying necessary input information like commit details and code.

User
I'm Chief Executive Officer. Now, we are both working at CodeAgent and we share a common interest in collaborating
to successfully complete the code review for commits or code. My main responsibilities include being a decision-maker
in policy and strategy, a leader managing teams, and an effective communicator with management and employees. I also
CEO specialize in summarizing complex code reviews.

I am the Chief Product Officer at CodeAgent, collaborating closely with my team to complete code reviews
successfully. I am responsible for assisting CEO and coder to summary code review reports

CPO

I am the CTO of CodeAgent, familiar with various programming languages and skilled in overarching technology
strategies. My role involves collaborating on new customer tasks, making high-level IT decisions that align with our
organization's goals, and working closely with IT staff in everyday operations.
CTO

I am a Code reviewer at CodeAgent collaborating to ensure software quality by assessing code for defects,
vulnerabilities, and consistency issues, fixing bugs, and suggesting improvements. I also collobrate with othe stuffs to
complete the code revision and summary of code review
Reviewer
I am a Coder at CodeAgent who actively reviews and revises code. I make decisions about code changes and
ensure code quality by evaluating code for defects and suggesting improvements. I am proficient in various
programming languages and platforms, including Python, Java, Go, C++, JavaScript, C, C#, PHP, and Ruby, etc.
Coder

Figure 5: Specialization of six main characters in CodeAgent.

C.2 Execute Time Across Languages


I’m the QA-Checker, an AI-driven As depicted in the data, we observe a significant
agent specializing in ensuring quality and trend in the average execution time for code re-
coherence in conversational dynamics, par- views in CodeAgent across various program-
ticularly in code review discussions at ming languages. The analysis includes nine lan-
CodeAgent. My primary role involves ana- guages: Python, Java, Go, C++, JavaScript, C, C#,
lyzing and aligning conversations to main- PHP, and Ruby. For each language, the average
tain topic relevance, ensuring that all dis- execution time of code reviews for both merged
cussions about code commits and reviews and closed pull requests (PRs) is measured. The
stay focused and on track. As a sophisti- results, presented in Figure 6, indicate that, on av-
cated component of the AI system, I apply erage, the execution time for merged PRs is longer
advanced algorithms, including Chain-of- than that for closed PRs by approximately 44.92
Thought reasoning and optimization tech- seconds. This considerable time difference can be
niques, to evaluate and guide conversa- attributed to several potential reasons. One pri-
tional flow. I am adept at identifying and mary explanation is that merged PRs likely un-
correcting topic drifts, ensuring that every dergo a more rigorous and detailed review process.
conversation adheres to its intended pur- They are intended to be integrated into the main
pose. My capabilities extend to facilitating codebase, and as such, contributors might be re-
clear and effective communication between quested to update their commits in the PRs more
team members, making me an essential as-
set in streamlining code review processes
and enhancing overall team collaboration
and decision-making.
frequently to adhere to the project’s high-quality tion is crucial in domains where the quality of in-
standards. On the other hand, closed PRs, which formation is paramount, ensuring that responses
are not meant for merging, might not require such are not only correct but also contextually appro-
extensive review processes, leading to shorter re- priate and informative.
view times on average, which may also be the rea- Furthermore, the efficiency of the QA-Checker
son they are not merged into main projects. in refining responses based on advanced optimiza-
tion techniques makes it an invaluable tool in dy-
Average Execution Time with Patterns for Different Programming Languages
Category
Merged
namic conversational environments. It can quickly
Closed
450 adapt to the nuances of a conversation, providing
high-quality responses that are aligned with the
400
evolving nature of dialogue.
Execution Time (seconds)

350
D.4 Conclusion
300 While recursive self-improvement systems offer
broad adaptability and systemic learning, the QA-
250
Checker stands out in its specialized role in QA
tasks, particularly in role conversations. Its fo-
200
Python Java Go C++ JavaScript C C# PHP Ruby
Programming Language cused approach to improving answer quality and
Figure 6: Execution time with CodeAgent across dif- its efficiency in handling conversational nuances
ferent language (count unit: second). make it an essential component in AI-driven com-
munication systems.
D Comparative Analysis of QA-Checker
E Capabilities Analysis between
AI System and Recursive
CodeAgent and Other Methods
Self-Improvement Systems
Compared to open-source baseline methods such
In this section, we will delve into the differences
as AutoGPT and autonomous agents such as Chat-
between QA-Checker and self-improvement sys-
Dev and MetaGPT, CodeAgent offers functions
tems (Hong et al., 2023), and underscore the im-
for code review tasks: consistency analysis, vul-
portance of the QA-Checker in role conversations.
nerability analysis, and format analysis. As shown
D.1 Comparison Table in Table 7, our CodeAgent encompasses a wide
range of abilities to handle complex code review
We begin with a comparative overview presented
tasks efficiently. Incorporating the QA-Checker
in Table 6.
self-improved module can significantly improve
D.2 Differences and Implications the conversation generation between agents and
contribute to the improvement of code review.
The key differences between these systems lie
Compared to COT, the difference and the ad-
in their application scope, learning mechanisms,
vantages of CodeAgent with QA-Checker are
and improvement scopes. The QA-Checker is
shown in Section D.
highly specialized, focusing on QA tasks with
efficiency and precision. In contrast, recursive F Dataset
self-improvement systems boast a broader appli-
cation range and adaptability, integrating experi- Previous Dataset As shown in Zhou
ences from diverse projects for systemic improve- et al. (2023), our study incorporates three
ments. distinct datasets for evaluating the perfor-
mance of CodeAgent: Trans-Reviewdata ,
D.3 Importance of QA-Checker in Role AutoTransformdata , and T5-Reviewdata .
Conversations Trans-Reviewdata , compiled by Tufano et
In the context of role conversations, the QA- al. (Tufano et al., 2021), derives from Gerrit
Checker plays a pivotal role. Its specialized na- and GitHub projects, excluding noisy or overly
ture makes it exceptionally adept at handling spe- lengthy comments and review data with new
cific conversational aspects, such as accuracy, rel- tokens in revised code not present in the initial
evance, and clarity in responses. This specializa- submission. AutoTransformdata , collected by
Table 6: Comparative Overview of QA-Checker AI System and Recursive Self-Improvement Systems

Feature/System QA-Checker AI System Recursive Self-Improvement System


Specialized for QA tasks with Broad scope, covering various dimensions like
Application Focus
precise task execution software development and learning algorithms
Advanced optimization techniques Multi-level learning: learning, meta-learning,
Learning Mechanism
for iterative improvement in QA and recursive self-improvement
Focused on individual capability Enhances the entire system, including multi-agent
Scope of Improvement
in specific QA tasks interactions and communication protocols
Based on mathematical models Utilizes experiences from past projects to improve
Experience Integration
to optimize answer quality overall performance

Table 7: Comparison of capabilities for CodeAgent and other approaches. ‘✓’ indicates the presence of a specific
feature in the corresponding framework, ‘✗ is absence. ChatDev and MetaGPT are two representative multi-agent
frameworks, GPT is a kind of single-agent framework, and CodeBert is a representative pre-trained model.

Approaches Consistency Analysis Vulnerability Analysis Format Analysis Code Revision COT QA-Checker
ChatDev (Qian et al., 2023) ✗ ✗ ✗ ✗ ✓ ✗
MetaGPT (Hong et al., 2023) ✗ ✗ ✗ ✗ ✓ ✗
GPT (OPENAI, 2022) ✓ ✓ ✓ ✓ ✗ ✗
CodeBert (Feng et al., 2020) ✓ ✓ ✓ ✓ ✗ ✗
CodeAgent ✓ ✓ ✓ ✓ ✓ ✓

Thongtanunam et al. (Thongtanunam et al., 2022) ative samples based on the merged and closed sta-
from three Gerrit repositories, comprises only tus of pull requests. For example, in Python, the
submitted and revised codes without review com- dataset comprises 254 merged and 35 closed neg-
ments. Lastly, T5-Reviewdata , gathered by Tufano ative CA samples, alongside 803 merged and 213
et al. (Tufano et al., 2022) from Java projects closed positive CA samples, with corresponding
on GitHub, filters out noisy, non-English, and distributions for other languages like Java, Go,
duplicate comments. These datasets are employed C++, and more. Similarly, the FA data follows
for Code Revision Before Review (CRB) and this pattern of positive and negative samples across
Code Revision After Review (CRA) tasks, with languages. Figure 7 graphically represents this
the exception of AutoTransformdata for CRA and data, highlighting the distribution and compari-
Review Comment Generation (RCG) due to its son of merged versus closed samples in both CA
lack of review comments. and FA categories for each language. This com-
prehensive dataset, covering over 3,545 commits
New Dataset Design and Collection To en- and nearly 2,933 pull requests from more than 180
hance our model evaluation and avoid data leak- projects, was meticulously compiled using a cus-
age, we curated a new dataset, exclusively col- tom crawler designed for GitHub API interactions,
lecting data from repositories created after April targeting post-April 2023 repositories to ensure
2023. This approach ensures the evaluation of our up-to-date and diverse data for an in-depth anal-
CodeAgent model on contemporary and relevant ysis of current software development trends.
data, free from historical biases. The new dataset
is extensive, covering a broad spectrum of soft-
ware projects across nine programming languages.
Table 8: Statistics of Studied Datasets.
Dataset Description Our dataset, illustrated in
Dataset Statistics #Train #Valid #Test
Fig. 8, encapsulates a detailed analysis of consis-
tency and format detection in software develop- Trans-Review 13,756 1,719 1,719
ment, spanning various programming languages. AutoTransform 118,039 14,750 14,750
It includes CA (consistency between commit and
commit message (See Sec 2.1)) and FA (format T5-Review 134,239 16,780 16,780
consistency between commit and original (See
Sec 2.1)) data, segmented into positive and neg-
Python Java 114 Go Python Java 117 Go
803 250 247 867 276 120
800 Negative Negative Negative Negative Negative Negative
Positive Positive Positive 800 Positive 250 Positive Positive
700 100 100
200
600 200
80 600 80
Sample Count

Sample Count

Sample Count

Sample Count

Sample Count

Sample Count
500 150 67
60 56 150 60
400 400
300 100 89 100 92
254 40 40
200 213 213
50 40 19 18 200 190
20 50 20 16
100 7
35 8 35 11 5
0 0 0 0 0 0
Merged Closed Merged Closed Merged Closed Merged Closed Merged Closed Merged Closed
102 C++ JavaScript C 126 119 C++ JavaScript C 128
235 120 252
100 Negative Negative 120 Negative Negative 250 Negative Negative
Positive Positive Positive Positive Positive 120 Positive
200 100 100 96
80 100 200 100
80
Sample Count

Sample Count

Sample Count

Sample Count

Sample Count

Sample Count
150 80 150 80
60
46 60 51
101 60 105 60
40 36 100 100
40 40 40
20 50 45 20 19 50 18 18
10 20 14 20 28 20
11 5 7
0 0 0 0 0 0
Merged Closed Merged Closed Merged Closed Merged Closed Merged Closed Merged Closed
C# 149
PHP Ruby C# 156
PHP Ruby
169 170 177 160 178
160 Negative Negative Negative 175 Negative Negative 175 Negative
Positive 140 Positive 160 Positive Positive 140 Positive Positive
140 120 140 150 150
120
120 100 92 120 125 99 125
Sample Count

Sample Count

Sample Count

Sample Count

Sample Count

Sample Count
100
100 100 100 100
80 80
80 80
60 75 60 75
60 52 60 57
45 50 50 49
40 37 40 40 32 40
24 29 24
20 20 13 20 25 20 17 25
10 10 5 6 6
0 0 0 0 0 0
Merged Closed Merged Closed Merged Closed Merged Closed Merged Closed Merged Closed

(a) Positive and negative data of both merged and closed com-(b) Positive and negative data of both merged and closed com-
mits across 9 languages on CA task (Sec 2.1). mits across 9 languages on FA task (Sec 2.1).

Figure 7: Distribution of positive, negative of both merged and closed data across 9 languages, including ‘python’,
‘java’, ‘go’, ‘c++’, ‘javascript’, ‘c’, ‘c#’, ‘php’, ‘ruby’.

Merged and Closed Issues in Different Programming Languages with Values


1057 Merged
Closed
Algorithm 1 Integrated Workflow of
1000
CodeAgent with QA-Checker
800 Input: Code submission, commit message,
600
original files
Number

Output: Refined code review document


400
Initialize phase p = 1
287 280

200
248
206
173
202 while p ≤ 4 do
133 138 146
112 114
97 74 56 62
105
55 Switch: Phase p
0
Python Java Go C++ JavaScript C C# PHP Ruby Case 1: Basic Info Sync
Figure 8: Comparative Visualization of Merged and Conduct initial information analysis
Closed Commit Counts Across Various Programming Update: p = 2
Languages Case 2: Code Review
Perform code review with Coder and Re-
viewer
G Key Factors Leading to Vulnerabilities Update: p = 3
The following table outlines various key factors Case 3: Code Alignment
that can lead to vulnerabilities in software sys- Apply code revisions based on feedback
tems, along with their descriptions. These factors Update: p = 4
should be carefully considered and addressed to Case 4: Document
enhance the security of the system. Finalize review document
Update: p = 5 (End)
H Data Leakage Statement QA-Checker Refinement (Applies in
Cases 2 and 3)
As the new dataset introduced in Section F, Let Qi be the current question and Ai the
the time of the collected dataset is after April current answer
2023, avoiding data leakage while we evaluate Evaluate response quality: qScore =
CodeAgent on codeData dataset. Q(Qi , Ai )
if qScore below threshold then
I Algorithmic Description of Generate additional instruction aai
CodeAgent Pipeline with QA-Checker Update question: Qi+1 = Qi + aai
Request new response: Ai+1
This algorithm demonstrates the integration of
end if
QA-Checker within the CodeAgent pipeline,
end while
employing mathematical equations to describe the
Return: Refined code review document
QA-Checker’s iterative refinement process.
No. Vulnerability Factor Description
1 Insufficient Input Validation Check for vulnerabilities like SQL injection, Cross-Site Scripting
(XSS), and command injection in new or modified code, espe-
cially where user input is processed.
2 Buffer Overflows Particularly in lower-level languages, ensure that memory man-
agement is handled securely to prevent overflows.
3 Authentication and Authoriza- Evaluate any changes in authentication and authorization logic
tion Flaws for potential weaknesses that could allow unauthorized access or
privilege escalation.
4 Sensitive Data Exposure Assess handling and storage of sensitive information like pass-
words, private keys, or personal data to prevent exposure.
5 Improper Error and Exception Ensure that errors and exceptions are handled appropriately with-
Handling out revealing sensitive information or causing service disruption.
6 Vulnerabilities in Dependency Review updates or changes in third-party libraries or components
Libraries or Components for known vulnerabilities.
7 Cross-Site Request Forgery Verify that adequate protection mechanisms are in place against
(CSRF) CSRF attacks.
8 Unsafe Use of APIs Check for the use of insecure encryption algorithms or other risky
API practices.
9 Code Injection Look for vulnerabilities related to dynamic code execution.
10 Configuration Errors Ensure that no insecure configurations or settings like open debug
ports or default passwords have been introduced.
11 Race Conditions Analyze for potential data corruption or security issues arising
from race conditions.
12 Memory Leaks Identify any changes that could potentially lead to memory leaks
and resource exhaustion.
13 Improper Resource Manage- Check resource management, such as proper closure of file han-
ment dles or database connections.
14 Inadequate Security Configura- Assess for any insecure default settings or unencrypted commu-
tions nications.
15 Path Traversal and File Inclusion Examine for risks that could allow unauthorized file access or
Vulnerabilities execution.
16 Unsafe Deserialization Look for issues that could allow the execution of malicious code
or tampering with application logic.
17 XML External Entity (XXE) At- Check if XML processing is secure against XXE attacks.
tacks
18 Inconsistent Error Handling Review error messages to ensure they do not leak sensitive system
details.
19 Server-Side Request Forgery Analyze for vulnerabilities that could be exploited to attack inter-
(SSRF) nal systems.
20 Unsafe Redirects and Forwards Check for vulnerabilities leading to phishing or redirection at-
tacks.
21 Use of Deprecated or Unsafe Identify usage of any such functions and commands in the code.
Functions and Commands
22 Code Leakages and Hardcoded Look for hardcoded passwords, keys, or other sensitive data in
Sensitive Information the code.
23 Unencrypted Communications Verify that data transmissions are securely encrypted to prevent
interception and tampering.
24 Mobile Code Security Issues For mobile applications, ensure proper handling of permission
requests and secure data storage.
25 Cloud Service Configuration Er- Review any cloud-based configurations for potential data leaks or
rors unauthorized access.
In this algorithm, Q(Qi , Ai ) represents the K More detailed experimental results on
quality assessment function of the QA-Checker, CA and FA tasks
which evaluates the relevance and accuracy of the
Detailed experimental results of CA are shown in
answer Ai to the question Qi . If the quality score
Figure 9 and Figure 10. Detailed experimental re-
qScore is below a predefined threshold, the QA-
sults of FA are shown in Figure 11 and Figure 12.
Checker intervenes by generating an additional in-
struction aai to refine the question, prompting a L Case Study
more accurate response in the next iteration.
As shown in Table 10, we can easily localize the
figure numbers of case studies for specific pro-
J Detailed Performance of CodeAgent gramming languages.
in Various Languages on VA task
L.1 Performance on 9 languages
In our comprehensive analysis using
Table 10: Correlation Table between specific program-
CodeAgent, as detailed in Table 9, we observe
ming language and case study.
a diverse landscape of confirmed vulnerabili-
ties across different programming languages. Programming
The table categorizes these vulnerabilities into Figure No.
Language
‘merged’ and ‘closed’ statuses for languages such Python 13
as Python, Java, Go, C++, JavaScript, C, C#, PHP, Java 14
and Ruby. A significant finding is a markedly high Go 15
number of ‘merged’ vulnerabilities in Python, C++ 16
potentially reflective of its extensive application JavaScript 17
or intrinsic complexities leading to security gaps. C 18
Conversely, languages like Go, Ruby, and C C# 19
exhibit notably lower counts in both categories, php 20
perhaps indicating lesser engagement in complex Ruby 21
applications or more robust security protocols.
Table 9 that the ‘closed’ category consistently
L.2 Difference of CodeAgent-3.5 and
presents lower vulnerabilities than ‘merged’
CodeAgent-4.0
across most languages, signifying effective res-
olution mechanisms. However, an exception is CodeAgent-3.5 and CodeAgent-4.0 in this pa-
noted in C, where ‘closed’ counts surpass those per has no difference in general code review, how-
of ‘merged’, possibly indicating either delayed ever, CodeAgent-4.0 is more powerful in pro-
vulnerability identification or efficient mitigation cessing long input sequences and logic reason-
strategies. Remarkably, the Rateclose is generally ing. As shown in Figure 22, we take one ex-
observed to be higher than Ratemerge across the ample of consistency detection between commit
languages, exemplifying a significant reduction and commit message and find that CodeAgent-
in vulnerabilities post-resolution. For example, 4.0 diffs from CodeAgent-3.5 in the detailed ex-
Python demonstrates a Ratemerge of 14.00% planation. CodeAgent-3.5 output a report with
against a higher Rateclose of 18.16%. This trend 15k lines while CodeAgent-4.0 outputs a re-
is consistent in most languages, emphasizing the port with more than 17.7k lines. Detailed data is
importance of proactive vulnerability manage- shown in https://zenodo.org/records/
ment. The Rateavg , representing the proportion 10607925.
of confirmed vulnerabilities against the total of
M Ablation study
both merged and closed items, further elucidates
this point, with C++ showing the highest Rateavg In this section, we evaluate the performance of
at 16.49%. These insights not only underline the different parts in CodeAgent in vulnerability
diverse vulnerability landscape across program- analysis. CodeAgent is based on chain-of-
ming languages but also highlight the adeptness thought (COT) and large language model (a.k.a.
of CodeAgent in pinpointing and verifying GPT). As shown in Section 4.1, CodeAgent out-
vulnerabilities in these varied contexts. performs baselines (a.k.a. CodeBERT, GPT-3.5,
Table 9: Vulnerable problems (#) found by CodeAgent. Ratemerge means the value of confirmed divided by
the total number in the merged and Rateclose is the value of confirmed divided by the total number in the closed.
Rateavg is the value of the confirmed number divided by the total number of the merged and closed.

CodeAgent Python Java Go C++ JavaScript C C# PHP Ruby


merged (total#) 1,057 287 133 138 280 114 206 173 202
merged (confirmed#) 148 17 11 19 34 9 21 28 20
Ratemerge 14.00% 5.92% 8.27% 13.77% 12.14% 7.89% 10.19% 16.18% 9.90%
closed (total#) 248 97 74 56 112 146 62 105 55
closed (confirmed#) 45 10 5 13 16 26 7 15 5
Rateclose 18.16% 10.31% 6.76% 23.2% 14.29% 17.81% 11.29% 14.29% 9.09%
Total number (#) 1,305 384 207 194 392 260 268 278 257
Total confirmed (#) 193 27 16 32 50 35 28 43 25
Rateavg 14.79% 7.03% 7.73% 16.49% 12.76% 13.46% 10.45% 14.47% 9.73%

python java go
1.0 1.0 1.0
Recall Recall Recall
F1 F1 F1
0.936
0.923 0.916
0.9 0.9 0.9 0.886 0.895
0.871 0.879 0.871
0.846 0.858 0.854
0.843
0.809 0.816
0.8 0.8 0.789 0.8
0.770 0.772
0.757
Scores

Scores

Scores

0.719 0.711 0.723


0.7 0.7 0.7

0.619
0.6 0.6 0.6 0.596
0.579

0.5 0.5 0.5


CodeBERT ChatGPT-3.5 ChatGPT-4.0 CodeAgent CodeBERT ChatGPT-3.5 ChatGPT-4.0 CodeAgent CodeBERT ChatGPT-3.5 ChatGPT-4.0 CodeAgent

c++ javascript c
1.0 1.0 1.0
Recall Recall Recall
F1 0.942 F1 0.942 F1
0.918 0.907 0.906 0.909
0.9 0.9 0.9 0.900
0.880 0.882 0.885
0.872
0.847 0.847 0.850
0.824
0.800 0.810
0.8 0.784 0.8 0.8
0.771
Scores

Scores

Scores

0.747
0.730
0.7 0.7 0.7
0.660
0.637
0.620
0.6 0.6 0.6

0.5 0.5 0.5


CodeBERT ChatGPT-3.5 ChatGPT-4.0 CodeAgent CodeBERT ChatGPT-3.5 ChatGPT-4.0 CodeAgent CodeBERT ChatGPT-3.5 ChatGPT-4.0 CodeAgent

c# php ruby
1.0 1.0 1.0
Recall Recall Recall 0.967
F1 0.948 F1 0.958 F1 0.947
0.929 0.939
0.919 0.917 0.926
0.903 0.906 0.900
0.9 0.892 0.9 0.9
0.870 0.879
0.839 0.847
0.828
0.8 0.8 0.789 0.8 0.800
0.760
Scores

Scores

Scores

0.7 0.7 0.7 0.694


0.678
0.645

0.6 0.6 0.6

0.5 0.5 0.5


CodeBERT ChatGPT-3.5 ChatGPT-4.0 CodeAgent CodeBERT ChatGPT-3.5 ChatGPT-4.0 CodeAgent CodeBERT ChatGPT-3.5 ChatGPT-4.0 CodeAgent

Figure 9: Comparison of models on the merged data across 9 languages on CA task.


python java go
1.0 1.0 1.0
Recall Recall Recall
F1 F1 F1
0.926 0.922
0.9 0.886 0.893 0.9 0.889 0.9 0.897
0.878
0.861 0.865 0.857
0.817 0.826
0.808 0.809 0.800 0.800
0.8 0.8 0.8 0.780
0.764
Scores

Scores

Scores
0.736
0.714 0.714
0.7 0.700 0.7 0.7 0.696

0.6 0.6 0.596 0.6

0.5 0.5 0.5


CodeBERT ChatGPT-3.5 ChatGPT-4.0 CodeAgent CodeBERT ChatGPT-3.5 ChatGPT-4.0 CodeAgent CodeBERT ChatGPT-3.5 ChatGPT-4.0 CodeAgent

c++ javascript c
1.0 1.0 1.0
Recall Recall Recall
F1 F1 F1
0.937
0.907 0.915 0.915
0.9 0.9 0.881 0.9
0.866 0.862 0.857
0.848 0.851
0.805 0.805 0.816 0.811
0.8 0.8 0.8 0.794
0.765 0.776 0.770
Scores

Scores

Scores
0.717 0.717 0.714
0.7 0.7 0.703 0.7
0.674
0.653

0.6 0.6 0.6

0.5 0.5 0.5


CodeBERT ChatGPT-3.5 ChatGPT-4.0 CodeAgent CodeBERT ChatGPT-3.5 ChatGPT-4.0 CodeAgent CodeBERT ChatGPT-3.5 ChatGPT-4.0 CodeAgent

c# php ruby
1.0 1.0 1.0
Recall 0.966 Recall Recall
F1 F1 F1
0.929 0.935
0.907 0.919
0.905
0.9 0.884 0.885 0.9 0.9
0.859 0.867
0.846 0.854 0.844
0.810 0.821
0.808 0.800
0.8 0.8 0.8
0.750 0.761 0.757
Scores

Scores

Scores
0.711
0.7 0.7 0.696 0.7

0.635
0.622
0.6 0.6 0.6

0.5 0.5 0.5


CodeBERT ChatGPT-3.5 ChatGPT-4.0 CodeAgent CodeBERT ChatGPT-3.5 ChatGPT-4.0 CodeAgent CodeBERT ChatGPT-3.5 ChatGPT-4.0 CodeAgent

Figure 10: Comparison of models on the closed data across 9 languages on CA task.

python java go
1.0 1.0 1.0
Recall Recall Recall
0.942 F1 F1 0.947 F1 0.946

0.9 0.900 0.9 0.899 0.899 0.9 0.897


0.882
0.857
0.826 0.834
0.819
0.8 0.802 0.8 0.8 0.790 0.795
0.759 0.755 0.750
Scores

Scores

Scores

0.729 0.717
0.709
0.7 0.7 0.7
0.658

0.609 0.615
0.6 0.6 0.6

0.5 0.5 0.5


CodeBERT ChatGPT-3.5 ChatGPT-4.0 CodeAgent CodeBERT ChatGPT-3.5 ChatGPT-4.0 CodeAgent CodeBERT ChatGPT-3.5 ChatGPT-4.0 CodeAgent

c++ javascript c
1.0 1.0 1.0
Recall Recall Recall
F1 0.947 F1 0.944 F1 0.945
0.931
0.908 0.899 0.911
0.9 0.9 0.897 0.9 0.896
0.857 0.852
0.825 0.825 0.836 0.824
0.807
0.8 0.8 0.8 0.790
0.750
Scores

Scores

Scores

0.714 0.714 0.726


0.708
0.7 0.7 0.687 0.7
0.667

0.6 0.6 0.6

0.5 0.5 0.5


CodeBERT ChatGPT-3.5 ChatGPT-4.0 CodeAgent CodeBERT ChatGPT-3.5 ChatGPT-4.0 CodeAgent CodeBERT ChatGPT-3.5 ChatGPT-4.0 CodeAgent

c# php ruby
1.0 1.0 1.0
Recall Recall Recall
F1 0.944 F1 0.946 F1 0.944
0.918
0.9 0.887 0.898 0.9 0.897 0.9 0.899
0.868
0.832 0.834 0.844
0.805 0.813
0.8 0.781 0.8 0.783 0.8
0.775
Scores

Scores

Scores

0.747
0.729 0.724
0.708
0.7 0.7 0.686 0.7
0.655 0.647

0.6 0.6 0.6

0.5 0.5 0.5


CodeBERT ChatGPT-3.5 ChatGPT-4.0 CodeAgent CodeBERT ChatGPT-3.5 ChatGPT-4.0 CodeAgent CodeBERT ChatGPT-3.5 ChatGPT-4.0 CodeAgent

Figure 11: Comparison of models on the merged data across 9 languages on FA task.
python java go
1.0 1.0 1.0
Recall Recall Recall
0.943 F1 F1 F1
0.924 0.921 0.919
0.9 0.897 0.9 0.9
0.859 0.859 0.859 0.866
0.842 0.851
0.838
0.823
0.805 0.802 0.807
0.8 0.8 0.8
0.761 0.761
Scores

Scores

Scores
0.737 0.731
0.700 0.707
0.7 0.685 0.7 0.7 0.687

0.6 0.6 0.6

0.5 0.5 0.5


CodeBERT ChatGPT-3.5 ChatGPT-4.0 CodeAgent CodeBERT ChatGPT-3.5 ChatGPT-4.0 CodeAgent CodeBERT ChatGPT-3.5 ChatGPT-4.0 CodeAgent

c++ javascript c
1.0 1.0 1.0
Recall Recall Recall
0.959 F1 F1 F1
0.922 0.915 0.920 0.924
0.910
0.9 0.9 0.880 0.9
0.867 0.860 0.859
0.843 0.840 0.851
0.828 0.820 0.814
0.8 0.8 0.780 0.8
0.765 0.761 0.758
Scores

Scores

Scores
0.706 0.700 0.703
0.7 0.7 0.7

0.633
0.6 0.6 0.6

0.5 0.5 0.5


CodeBERT ChatGPT-3.5 ChatGPT-4.0 CodeAgent CodeBERT ChatGPT-3.5 ChatGPT-4.0 CodeAgent CodeBERT ChatGPT-3.5 ChatGPT-4.0 CodeAgent

c# php ruby
1.0 1.0 1.0
Recall Recall Recall
F1 F1 F1
0.925 0.924 0.923
0.9 0.9 0.9
0.870 0.876
0.860 0.859 0.860 0.857
0.837 0.837
0.813 0.814 0.821 0.819
0.8 0.8 0.8 0.796
0.778
0.755
Scores

Scores

Scores
0.719 0.719
0.7 0.7 0.687 0.697 0.7 0.694
0.684

0.6 0.6 0.6

0.5 0.5 0.5


CodeBERT ChatGPT-3.5 ChatGPT-4.0 CodeAgent CodeBERT ChatGPT-3.5 ChatGPT-4.0 CodeAgent CodeBERT ChatGPT-3.5 ChatGPT-4.0 CodeAgent

Figure 12: Comparison of models on the closed data across 9 languages on FA task.

GPT-4.0) across 9 different languages. The per- Detailed Comparison between CodeAgent
formance mainly comes from the combination of and CodeAgent w/o Comparing the findings
COT and QA-Checker. Thus, we design an ad- in Table 11 with those in Table 9, we observe
ditional version called CodeAgent w/o , which some notable differences in vulnerability detection
means CodeAgent without QA-Checker. Then, by CodeAgent and CodeAgent w/o . While
we use CodeAgent w/o to do vulnerability anal- the overall trend of higher ‘merged’ vulnerabili-
ysis and compare with CodeAgent. We first dis- ties in Python and lower counts in Go and Ruby
cuss about the result of CodeAgent w/o and then remains consistent, Table 11 shows a slight re-
discuss about comparison between CodeAgent duction in the Ratemerge for most languages,
and CodeAgent w/o . suggesting a more conservative confirmation ap-
proach in CodeAgent w/o . Similarly, Rateclose
and Rateavg values in Table 11 generally indi-
cate a lower proportion of confirmed vulnerabili-
Overview of Vulnerabilities in CodeAgent w/o
ties compared to Table 9, reflecting potentially dif-
Table 11 presents the findings of CodeAgent
ferent criteria or efficacy in vulnerability assess-
w/o , a variant of the original CodeAgent, in
ment. These variations highlight the impact of
identifying vulnerabilities across different pro-
QA-Checker in CodeAgent.
gramming languages. The table showcases the
number of ‘merged’ and ‘closed’ vulnerabilities
in languages such as Python, Java, Go, C++,
N Cost statement
JavaScript, C, C#, PHP, and Ruby. Notably,
Python leads in the ‘merged’ category with a to-
tal of 1,057 cases, of which 140 are confirmed, As shown in Table 12, CodeAgent-4 has
yielding a Ratemerge of 13.25%. In contrast, lan- a higher query time and cost compared to
guages like Go and Ruby show lower vulnerability CodeAgent-3.5 due to its increased complexity
counts in both ‘merged’ and ‘closed’ categories. and capabilities. We acknowledge that the integra-
The table also includes Rateclose and Rateavg , pro- tion of AI models and multi-agent systems may in-
viding insights into the effectiveness of vulnera- troduce complexity and require specialized knowl-
bility management across these languages. edge for implementation.
Commit Original
Python Commit
Message File
rename client_id and user_id into client and
@@ -365,8 +365,8 @@ def handle_connect(self, params):
user in connection info
# Welcome to Centrifuge dear Connection!
self.is_authenticated = True
self.default_info = { ...
class Client(object):
- 'user_id': self.user, """
- 'client_id': self.uid, This class describes a single connection of
+ 'user': self.user, client.
+ 'client': self.uid, """
application = None
'default_info': info,
'channel_info': None def __init__(self, sock, info):
} self.sock = sock
@@ -421,9 +421,6 @@ def handle_refresh(self, params): self.info = info
""" self.uid = uuid.uuid4().hex
Handle request with refreshed connection timestamp self.is_authenticated = False
self.user = None
""" self.timestamp = None
- if not self.is_authenticated: self.channel_info = {}
- raise Return((None, self.application.UNAUTHORIZED)) self.default_info = {}
- self.project_name = None
self.channels = None
project_name = params["project"]
self.presence_ping_task = None
user = params["user"] self.expire_timeout = None
timestamp = params["timestamp"] ...

...

CodeAgent Summary

Semantic Consistency Analysis


The semantic consistency between the code changes and the commit message is generally good. However, there are a few
inconsistencies that should be addressed. In the 'handle_connect' method, the 'default_info' dictionary is being updated to use
the keys 'user' and 'client' instead of 'user_id' and 'client_id' as mentioned in the commit message. It would be better to update
the commit message to reflect this change or update the code to use the original keys. Additionally, in the 'handle_refresh'
method, the check for 'is_authenticated' has been removed, which is inconsistent with the commit message. It would be helpful
to clarify the reason for this change in the commit message.

Security Analysis
The provided code appears to have some potential security vulnerabilities that should be addressed. Firstly, there is no input
validation or sanitization for user input, which can lead to SQL injection, XSS, and command injection risks... The code should
also be reviewed for proper authentication and authorization processes, as well as the handling of sensitive data to prevent
unauthorized access and data breaches. Proper error handling and exception management should be implemented to avoid
leaking sensitive information and causing service interruptions...

Format Analysis
The format of the code is generally consistent with the writing style and format of the original file. However, there are a few
inconsistencies that could impact readability and maintainability. In the 'handle_connect' method, the indentation of the
'self.default_info' dictionary is inconsistent. It would be better to align the indentation to improve code readability. Additionally,
in the 'handle_refresh' method, the comment at the end of the line should be moved to a separate line for better code formatting.
These minor formatting issues can be easily addressed to improve the overall quality of the code.

Suggestion
Based on the analysis, the following suggestions are provided for code alignment and revisions:1). In the 'handle_connect'
method, update the commit message to reflect the change in keys from 'user_id' and 'client_id' to 'user' and 'client'.
Alternatively, update the code to use the original keys mentioned in the commit message. 2). In the 'handle_refresh' method,
clarify the reason for removing the 'is_authenticated' check in the commit message. If it was intentional, provide an explanation
for the change. 3). Improve input validation and sanitization techniques to prevent SQL injection, XSS, and command injection
risks. Implement parameterized queries for database operations and escape user-generated content displayed in HTML. 4).
Ensure robust memory management practices to avoid buffer overflows in lower-level languages.5). Review and enhance
authentication and authorization processes, as well as the handling of sensitive data, to prevent unauthorized access and data
breaches...

Figure 13: Example in Python project with sha value: “0d3b94bbd7528ed8649cdab4370a62e508bad82c"


Commit Original
Java Commit
Message Java
File
@@ -403,7 +403,7 @@ public boolean canDismiss(Object token) {
@Override Fixed issue where onDismiss() did not get
public void onDismiss(View view, Object token) { called if Snackbar is swiped.
if (view != null) {
- finish();
+ dismiss(false); ...
public static Snackbar with(Context context)
}} {
@@ -512,6 +512,10 @@ private void startTimer(long duration) { return new Snackbar(context);
} }
public void dismiss() {
+ dismiss(mAnimated); /**
* Sets the type of {@link Snackbar} to be
+ } displayed.
+ *
+ private void dismiss(boolean animate) { * @param type the {@link
if (mIsDismissing) { Java SnackbarType} of this instance
* @return
return; */
} public Snackbar type(SnackbarType type)
@@ -522,7 +526,7 @@ public void dismiss() { {
mEventListener.onDismiss(Snackbar.this); mType = type;
} return this;
}
- if (!mAnimated) {
+ if (!animate) { /**
finish(); * Sets the text to be displayed in this
return; {@link Snackbar}
} ...

...
Java

CodeAgent Summary

Semantic Consistency Analysis

The code changes in the commit do not accurately reflect the description provided in the commit message. The commit message
mentions fixing an issue where onDismiss() did not get called if Snackbar is swiped, but the code changes do not address this
issue. Instead, the code changes seem to be related to dismissing the Snackbar.

Security Analysis
No security analysis can be performed as the code is not provided.

Format Analysis
The format of the code aligns with the writing style and format of the original file. There are no formatting inconsistencies that
impact the overall readability and maintainability of the project.

Suggestion
The highest priority comment is regarding the semantic consistency between the code changes and the commit message. The
code changes should address the issue mentioned in the commit message, i.e., fixing the issue where onDismiss() did not get
called if Snackbar is swiped. To fix this issue, the following changes can be made:

1. In the `Snackbar` class, modify the 2.Please make the above changes to ensure that the code aligns
`dismiss()` method as follows: with the description provided in the commit message.
...
public void dismiss() {
dismiss(mAnimated); private void startTimer(long duration) {
} // implementation
private void dismiss(boolean animate) { }
if (mIsDismissing) {
return; public void dismiss() {
} dismiss(mAnimated);
mIsDismissing = true; }
if (mEventListener != null) { private void dismiss(boolean animate) {
mEventListener.onDismiss(Snackbar.this); if (mIsDismissing) {
} return;
if (!animate) { }
finish(); mIsDismissing = true;
return;
} if (mEventListener != null) {
} mEventListener.onDismiss(Snackbar.this);
}...

Figure 14: Example in Java project with sha value: “6e3c6d17d943f5ac70b421653eb167e0c34b119f"


01/02/2024, 22:58 codeagent-casestudy

Commit Original
Go Commit
Message Go
File
fix: clean integration-test script
@@ -3,12 +3,9 @@ set -e

export DEST=.
#!/usr/bin/env bash
-TESTFLAGS+=("-test.timeout=20m -failfast -v") set -e
-
-cd integration export DEST=.
echo "Testing against..." echo "Testing against..."
docker version docker version

# shellcheck disable=SC2086 # shellcheck disable=SC2086


Go # shellcheck disable=SC2048
# shellcheck disable=SC2048
go test ./integration -test.timeout=20m -
-go test -v ${TESTFLAGS[*]}
failfast -v ${TESTFLAGS[*]}
+go test ./integration -test.timeout=20m -failfast -v ${TESTFLAGS[*]}

...
Go

CodeAgent Summary

Semantic Consistency Analysis


Based on the provided commit message and code changes, there is a semantic inconsistency. The commit message states that
the script is being cleaned, but the code changes do not reflect any cleaning actions. To improve semantic consistency, the code
changes should align with the description provided in the commit message. I suggest revising the code to include the cleaning
actions mentioned in the commit message.

Security Analysis

After reviewing the code, I found that there are no explicit security vulnerabilities present....

Format Analysis
The format of the code does not align with the writing style and format of the original file. Inconsistent formatting can
negatively impact the readability and maintainability of the project. To improve the format consistency, I suggest following the
original file's formatting conventions. This includes consistent indentation, proper spacing, and adhering to any specific style
guidelines.

Suggestion

Based on the semantic inconsistency and format analysis, I recommend the following revisions:
1. Update the code changes to reflect the cleaning actions mentioned in the commit message.
2. Ensure that the code follows the formatting conventions of the original file.

Revised code
#!/usr/bin/env bash
set -e
export DEST=.
echo "Testing against..."
docker version
# shellcheck disable=SC2086
# shellcheck disable=SC2048
go test ./integration -test.timeout=20m -failfast -v ${TESTFLAGS[*]}

Figure 15: Example in Go project with sha value: “a6b00b097b1a8140bd8b60177247dc830374c8ed"


Commit Original
C++ Commit
Message C++
File

remove end-point to get events

...
@@ -83,7 +83,6 @@ void master_server_routes() { //analytics events
//analytics events server->post("/analytics/events",
post_create_event);
server->post("/analytics/events", post_create_event); server->post("/analytics/events/replicate",
server->post("/analytics/events/replicate", post_replicate_events); post_replicate_events);
- server->get("/analytics/events/:name", get_analytics_events, false, true); server->get("/analytics/query_hits_counts",
server->get("/analytics/query_hits_counts", get_query_hits_counts); get_query_hits_counts);
// meta
// meta C++ server->get("/metrics.json",
get_metrics_json);
server->get("/stats.json", get_stats_json);
server->get("/debug", get_debug);
server->get("/health", get_health);
...

...
C++

CodeAgent Summary

Semantic Consistency Analysis


The code changes in the provided code are not consistent with the description provided in the commit message. The commit
message mentions the removal of the "/analytics/events/:name" endpoint, but this change is not reflected in the code. It is
important to ensure that the code accurately reflects the changes described in the commit message to avoid confusion and
potential issues.

Security Analysis
The provided code appears to have some potential security vulnerabilities that should be addressed. Firstly, there is no input
validation or sanitization for user input, which can lead to SQL injection, XSS, and command injection risks... The code should
also be reviewed for proper authentication and authorization processes, as well as the handling of sensitive data to prevent
unauthorized access and data breaches. Proper error handling and exception management should be implemented to avoid
leaking sensitive information and causing service interruptions...

Format Analysis
The format of the code does not align with the writing style and format of the original file. There are inconsistencies in
indentation and spacing, which can impact the overall readability and maintainability of the project. It is recommended to
follow a consistent coding style throughout the codebase to improve readability and make it easier for developers to understand
and maintain the code.

Suggestion
1. Remove the "/analytics/events/:name" endpoint as mentioned in the commit message.
2. Ensure consistent indentation and spacing throughout the code to improve readability.
3. Consider organizing the code into logical sections or modules to enhance maintainability.

Figure 16: Example in C++ project with sha value: “09e88c7c20347cdf513323fa8c616aba0a1b9ea7"

Table 11: Vulnerable problems (#) found by CodeAgent w/o

CodeAgent Python Java Go C++ JavaScript C C# PHP Ruby


merged (total#) 1,057 287 133 138 280 114 206 173 202
merged (confirmed#) 140 17 10 12 28 9 21 28 17
Ratemerge 13.25% 5.92% 7.52% 8.70% 10.00% 7.89% 10.19% 16.18% 8.42%
closed (total#) 248 97 74 56 112 146 62 105 55
closed (confirmed#) 36 9 5 12 16 26 7 15 5
Rateclose 14.52% 9.28% 6.76% 21.43% 14.29% 17.81% 11.29% 14.29% 9.09%
Total number (#) 1,305 384 207 194 392 260 268 278 257
Total confirmed (#) 176 26 15 24 44 35 28 43 22
Rateavg 13.49% 6.77% 7.25% 12.37% 11.22% 13.46% 10.45% 15.47% 8.56%
Commit Original
Javascript Commit
Message File
@@ -225,6 +225,22 @@ window['$'] = window['jquip'] = (function(){ Addeed .replaceWith()
}
return this;
};
+ p['replaceWith'] = function(val){ ...
+ if (isF(val)) { p['after'] = function(){
+ return this['each'](function(i) { if (this[0] && this[0].parentNode){
+ var self = $(this), old = self.html(); return this.dm(arguments, false,
+ self.replaceWith( val.call(this, i, old) ); function(el){
+ }); this.parentNode.insertBefore(el,
this.nextSibling);
+ } else if (isS(val)) { });
+ val = $(val).detach(); }
+ } return this;
+ return this['each'](function() { };
+ var next = this.nextSibling, p['replaceWith'] = function(val){
+ parent = this.parentNode; if (isF(val)) {
+ parent.removeChild(this); return this['each'](function(i) {
+ (next ? $(next).before(val) : $(parent).append(val)); var self = $(this), old = self.html();
+ }); self.replaceWith( val.call(this, i, old) );
+ }; });
} else if (isS(val)) {
p['hide'] = function(){ val = $(val).detach();
return this['each'](function(){ }
cache(this, "display", this.style.display); ...

...

CodeAgent Summary

Semantic Consistency Analysis


The semantic consistency between the code changes and the commit message is generally good. The changes accurately reflect
the description provided in the commit message. However, there is one inconsistency in the code. In the added `replaceWith`
function, there is a missing check for the `val` parameter being a function. Currently, the code assumes that `val` will always be
a function or a string. It would be better to add a check for the function type and handle it accordingly. This will ensure that the
code is more robust and can handle different types of input.

Security Analysis
The security analysis of the provided code reveals some potential vulnerabilities. The code does not validate user input, which
can lead to SQL injection, XSS, and command injection risks. It is crucial to implement input validation and sanitization
techniques to prevent these vulnerabilities....

Format Analysis
The format of the code aligns with the writing style and format of the original file. There are no formatting inconsistencies that
impact the overall readability and maintainability of the project. The code follows a consistent indentation style and uses
appropriate naming conventions. However, it is recommended to add comments to explain the purpose and functionality of
complex code sections to improve code understandability.

Suggestion
1. In the `replaceWith` function, add a check for the `val` parameter being a function and handle it accordingly.
2. Implement input validation and sanitization techniques to prevent SQL injection, XSS, and command injection risks.
3. Improve error handling mechanisms to avoid sensitive information leakage and service interruptions.
4. Review the code for any deprecated functions, hardcoded sensitive data, or code leakages.
5. Add comments to explain the purpose and functionality of complex code sections.

Revised Code
window['$'] = window['jquip'] = (function(){
var win = window,
queryShimCdn = "http://cdnjs.cloudflare.com/ajax/libs/sizzle/1.4.4/sizzle.min.js",
queryEngines = function(){ return win["Sizzle"] || win["qwery"]; },
doc = document, docEl = doc.documentElement,
scriptFns=[], load=[], sLoaded,
runtil = /Until$/, rmultiselector = /,/,
rparentsprev = /^(?:parents|prevUntil|prevAll)/,
rtagname = /<([\w:]+)/,
...

Figure 17: Example in JavaScript project with sha value: “6e9a80d93b470aef4978677ed0b7c9996fddbe20"


Commit Original
C Commit
Message C
File

remove ventoy.net in grub UI

...
@@ -757,9 +757,9 @@ <h4> <footer class="main-footer">
<div class="pull-right hidden-xs">
<footer class="main-footer"> <b
<div class="pull-right hidden-xs"> id="plugson_build_date">20231014
18:52:12</b>
- <b id="plugson_build_date">20231014 18:52:12</b>
</div>
+ <b id="plugson_build_date">20231014 18:52:12</b>
</div> </footer>
- <strong><a href="https://www.ventoy.net"
target="_blank">https://www.ventoy.net</a></strong> C <!-- Control Sidebar -->
+ <!-- /.control-sidebar -->
</footer> <!-- Add the sidebar's background. This
div must be placed
<!-- Control Sidebar --> immediately after the control sidebar --
>
<div class="control-sidebar-bg"
style="position: fixed; height: auto;"></div>
</div>
<!-- ./wrapper -->
...

...
C

CodeAgent Summary

Semantic Consistency Analysis

Based on the provided code and commit message, the semantic consistency analysis reveals that the changes in the code
accurately reflect the description provided in the commit message. There are no inconsistencies or potential hidden malicious
code identified.

Security Analysis
The security analysis of the provided code reveals several potential vulnerabilities. Firstly, there is no validation of user input,
which can lead to SQL injection, XSS, and command injection risks. It is recommended to implement proper input validation
and sanitization techniques to prevent these vulnerabilities. Additionally, the code does not handle errors and exceptions
properly, which can result in sensitive information leakage and service interruptions...

Format Analysis

The format analysis reveals that the code formatting is inconsistent with the writing style and format of the original file. This
inconsistency can impact the overall readability and maintainability of the project. It is recommended to align the code
formatting with the original file's style and format to improve code quality.

Suggestion
1. Implement input validation and sanitization techniques to prevent SQL injection, XSS, and command injection risks.
2. Improve error handling mechanisms to avoid sensitive information leakage and service interruptions.
3. Review the code for any deprecated functions, hardcoded sensitive data, or code leakages.
4. Conduct a thorough review of all dependencies, APIs, and configurations, including third-party libraries
5. Align the code formatting with the writing style and format of the original file to improve code quality and maintainability.

Figure 18: Example in C project with sha value: “5f93cf43dd6f809d7927abb79884285ad77e8a58"


Commit Original
C# Commit
Message C#
File

change logger to concrete type

...
builder.AddOpenAIChatCompletion(
@@ -148,7 +148,7 @@ private Kernel InitializeKernel() modelId:
openAIConfiguration.ModelId,
apiKey:
private readonly RedirectOutput _testOutputHelper; openAIConfiguration.ApiKey);
private readonly IConfigurationRoot _configuration;
- private readonly ILoggerFactory _logger; var kernel = builder.Build();
+ private readonly XunitLogger<Kernel> _logger;
C#
return kernel;
public void Dispose() }
{
private readonly RedirectOutput
_testOutputHelper;
private readonly IConfigurationRoot
_configuration;
private readonly XunitLogger<Kernel>
_logger;
...

...
C#

CodeAgent Summary

Semantic Consistency Analysis


The semantic consistency between the code changes and the commit message is generally good. The code changes accurately
reflect the description provided in the commit message. However, there is one inconsistency in the code. In the
`InitializeKernel` method, the `_logger` field is changed from `ILoggerFactory` to `XunitLogger<Kernel>`. The commit
message does not mention this change, which could lead to confusion for other developers. It is recommended to update the
commit message to include this change.

Security Analysis

The security analysis of the provided code reveals several potential vulnerabilities. Firstly, there is no validation of user input to
prevent SQL injection, XSS, and command injection risks. ..

Format Analysis

The format of the code aligns well with the writing style and format of the original file. There are no significant formatting
inconsistencies that impact the overall readability and maintainability of the project.

Suggestion

In the `InitializeKernel` method, the `_logger` field should be updated to use the concrete type `XunitLogger<Kernel>`
instead of the abstract type `ILoggerFactory`. This change should be reflected in the commit message as well.

Figure 19: Example in C# project with sha value: “0e231c7a81b318e9eade972f7b877e66128ed67d"


Commit Original
PHP Commit
Message PHP
File
...function get_posts( $args = null ) {
@@ -2425,6 +2425,7 @@ function get_posts( $args = null ) { add get_post_meta args to get_posts()
'meta_value' => '',
'post_type' => 'post',
'suppress_filters' => true,
+ 'get_post_meta' => false ...
); function get_posts( $args = null ) {
$parsed_args = wp_parse_args( $args, $defaults ); $defaults = array(
@@ -2449,7 +2450,20 @@ function get_posts( $args = null ) { 'numberposts' => 5,
$parsed_args['no_found_rows'] = true; 'category' => 0,
'orderby' => 'date',
$get_posts = new WP_Query(); 'order' => 'DESC',...
- return $get_posts->query( $parsed_args ); );
+ $get_posts = $get_posts->query($parsed_args); $parsed_args = wp_parse_args( $args,
+ $defaults );
+ if (!empty($parsed_args['get_post_meta'])) { PHP if ( empty( $parsed_args['post_status'] ) ) {
+ foreach ($get_posts as $post) { $parsed_args['post_status'] = (
+ $post_meta = []; 'attachment' === $parsed_args['post_type'] ) ?
+ foreach (get_post_meta($post->ID) as $key => $value) { 'inherit' : 'publish';
+ $post_meta[$key] = $value[0];} }
+ $post->meta_data = $post_meta; if ( ! empty( $parsed_args['numberposts'] )
+ $get_posts[] = $post; && empty( $parsed_args['posts_per_page'] )
+ } ){
+ }; $parsed_args['posts_per_page'] =
+ $parsed_args['numberposts'];
+ return $get_posts; }
} ...

...
PHP

CodeAgent Summary

Semantic Consistency Analysis

The semantic consistency between the code changes and the commit message seems to be accurate. The changes made in the
code align with the description provided in the commit message. There are no inconsistencies or potential hidden malicious
code identified in this aspect.

Security Analysis
The provided code requires a comprehensive security review to identify potential vulnerabilities. The analysis should focus on
the following areas:......Based on the provided code, it is difficult to perform a comprehensive security analysis. However, it is
recommended to conduct a thorough security review considering the above points to identify any potential vulnerabilities.

Format Analysis
The format of the code does not align with the writing style and format of the original file. There are inconsistencies in
indentation, spacing, and line breaks. These formatting inconsistencies can impact the overall readability and maintainability of
the project. It is recommended to follow a consistent coding style and use proper indentation and spacing throughout the
codebase.

Suggestion
1. In the `get_posts` function, the `get_post_meta` argument is not aligned properly with the other arguments.
It should be aligned with the other arguments for better readability.
2. In the `get_posts` function, the `get_posts` variable is overwritten with a new value.
It is recommended to use a different variable name to avoid confusion.
3. In the `get_posts` function, the code block inside the `if` condition can be
simplified by using a dictionary comprehension. Instead of manually iterating
over `get_post_meta`, you can use a dictionary comprehension to create the `post_meta` dictionary.

Figure 20: Example in PHP project with sha value: “6679f059b9a0972a90df198471188da652f5c547"


Commit Original
PHP Commit
Message PHP
File
...function get_posts( $args = null ) {
@@ -2425,6 +2425,7 @@ function get_posts( $args = null ) { add get_post_meta args to get_posts()
'meta_value' => '',
'post_type' => 'post',
'suppress_filters' => true,
+ 'get_post_meta' => false ...
); function get_posts( $args = null ) {
$parsed_args = wp_parse_args( $args, $defaults ); $defaults = array(
@@ -2449,7 +2450,20 @@ function get_posts( $args = null ) { 'numberposts' => 5,
$parsed_args['no_found_rows'] = true; 'category' => 0,
'orderby' => 'date',
$get_posts = new WP_Query(); 'order' => 'DESC',...
- return $get_posts->query( $parsed_args ); );
+ $get_posts = $get_posts->query($parsed_args); $parsed_args = wp_parse_args( $args,
+ $defaults );
+ if (!empty($parsed_args['get_post_meta'])) { PHP if ( empty( $parsed_args['post_status'] ) ) {
+ foreach ($get_posts as $post) { $parsed_args['post_status'] = (
+ $post_meta = []; 'attachment' === $parsed_args['post_type'] ) ?
+ foreach (get_post_meta($post->ID) as $key => $value) { 'inherit' : 'publish';
+ $post_meta[$key] = $value[0];} }
+ $post->meta_data = $post_meta; if ( ! empty( $parsed_args['numberposts'] )
+ $get_posts[] = $post; && empty( $parsed_args['posts_per_page'] )
+ } ){
+ }; $parsed_args['posts_per_page'] =
+ $parsed_args['numberposts'];
+ return $get_posts; }
} ...

...
PHP

CodeAgent Summary

Semantic Consistency Analysis

The semantic consistency between the code changes and the commit message seems to be accurate. The changes made in the
code align with the description provided in the commit message. There are no inconsistencies or potential hidden malicious
code identified in this aspect.

Security Analysis
The provided code requires a comprehensive security review to identify potential vulnerabilities. The analysis should focus on
the following areas:......Based on the provided code, it is difficult to perform a comprehensive security analysis. However, it is
recommended to conduct a thorough security review considering the above points to identify any potential vulnerabilities.

Format Analysis
The format of the code does not align with the writing style and format of the original file. There are inconsistencies in
indentation, spacing, and line breaks. These formatting inconsistencies can impact the overall readability and maintainability of
the project. It is recommended to follow a consistent coding style and use proper indentation and spacing throughout the
codebase.

Suggestion
1. In the `get_posts` function, the `get_post_meta` argument is not aligned properly with the other arguments.
It should be aligned with the other arguments for better readability.
2. In the `get_posts` function, the `get_posts` variable is overwritten with a new value.
It is recommended to use a different variable name to avoid confusion.
3. In the `get_posts` function, the code block inside the `if` condition can be
simplified by using a dictionary comprehension. Instead of manually iterating
over `get_post_meta`, you can use a dictionary comprehension to create the `post_meta` dictionary.

Figure 21: Example in Ruby project with sha value: “584f72e7f4c65066ccbd2183af76bf380b6eb974"


02/02/2024, 00:22 codeagent-casestudy

Commit Original
Java Commit
Message Java
File
@@ -403,7 +403,7 @@ public boolean canDismiss(Object token) {
@Override Fixed issue where onDismiss() did not get called if Snackbar
public void onDismiss(View view, Object token) { is swiped.
if (view != null) {
- finish();
+ dismiss(false); 655 lines
}}
@@ -512,6 +512,10 @@ private void startTimer(long duration) { ...
} public static Snackbar with(Context context) {
return new Snackbar(context);
public void dismiss() { }
+ dismiss(mAnimated);
+ } /**
+ * Sets the type of {@link Snackbar} to be displayed.
*
+ private void dismiss(boolean animate) { * @param type the {@link SnackbarType} of this instance
if (mIsDismissing) { Java
* @return
return; */
} public Snackbar type(SnackbarType type) {
mType = type;
@@ -522,7 +526,7 @@ public void dismiss() { return this;
mEventListener.onDismiss(Snackbar.this); }
}
- if (!mAnimated) { /**
+ if (!animate) { * Sets the text to be displayed in this {@link Snackbar}
...
finish();
return;
}

CodeAgent-3.5 CodeAgent-4.0

Semantic Consistency Analysis Semantic Consistency Analysis


The code changes in the commit do not accurately reflect The commit message states "Fixed issue where onDismiss() did not get called if
the description provided in the commit message. The Snackbar is swiped." The code changes reflect this accurately. The
commit message mentions fixing an issue where `onDismiss()` method has been modified to call `dismiss(false)`, which will
onDismiss() did not get called if Snackbar is swiped, but dismiss the Snackbar without animation when it is swiped. A new `dismiss()`
the code changes do not address this issue. Instead, the method has also been added which takes a boolean parameter to determine
code changes seem to be related to dismissing the whether to animate the dismissal or not. The changes in the code are consistent
Snackbar. with the commit message.

Security Analysis Security Analysis


No security analysis can be performed as the code is not The provided code does not seem to have any security vulnerabilities. It does
provided. not handle any user input directly, so there is no risk of SQL injection, XSS, or
command injection. It does not use any lower-level languages, so buffer
overflows are not a concern. The code does not manage any sensitive data, so
Format Analysis there is no risk of unauthorized access or data breaches. The code does not use
The format of the code aligns with the writing style and any third-party libraries, so there are no potential vulnerabilities from
format of the original file. There are no formatting dependencies. The code does not have any deprecated functions, hardcoded
inconsistencies that impact the overall readability and sensitive data, or code leakages. However, this is a preliminary analysis and a
maintainability of the project. more thorough review would be needed to confirm these findings

Suggestion Format Analysis


The format of the code aligns with the writing style and format of the original
The highest priority comment is regarding the semantic file. The code uses the same indentation, naming conventions, and comment
consistency between the code changes and the commit style as the original file. There are no formatting inconsistencies that would
message. The code changes should address the issue impact the readability or maintainability of the project.
mentioned in the commit message, i.e., fixing the issue
where onDismiss() did not get called if Snackbar is swiped.
To fix this issue, the following changes can be made:
Suggestion
The code changes are well-aligned with the rest of the codebase. The new
`dismiss()` method follows the same naming conventions and style as the
existing methods. The changes to the `onDismiss()` method are minimal and
maintain the original structure of the method. No revisions are necessary.

Figure 22: Comparison of code review between CodeAgent-3.5 and CodeAgent-4.0 on one Java project with
sha value: “6e3c6d17d943f5ac70b421653eb167e0c34b119f"
Table 12: Summarizes the average query time
and cost per code review for CodeAgent-3.5 and
CodeAgent-4.

Model Query Time(min) Cost in USD


CodeAgent-3.5 3 0.017
CodeAgent-4 5 0.122

O Tool
We develop a website for CodeAgent, which is
shown in Figure 23, and it is also accessable by
visiting following link:
https://code-agent-new.vercel.
app/index.html

Figure 23: website of CodeAgent

You might also like