RTLFixer

The document introduces RTLFixer, a framework that utilizes Large Language Models (LLMs) to automatically fix syntax errors in Verilog code, achieving a 98.5% success rate in correcting compilation errors. It combines Retrieval-Augmented Generation (RAG) and ReAct prompting to enable interactive debugging, significantly improving syntax success rates in various benchmarks. The framework demonstrates the potential of LLMs as autonomous agents in hardware design, enhancing both the accuracy of generated code and the efficiency of human engineers.

Uploaded by

6rz4jnhhq8

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views

RTLFixer

Uploaded by

6rz4jnhhq8

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

RTLFixer: Automatically Fixing RTL Syntax Errors with Large

Language Models
YunDa Tsai∗ Mingjie Liu∗ Haoxing Ren
[email protected] [email protected] [email protected]
NVIDIA NVIDIA NVIDIA
ABSTRACT Benchmark RTLFixer
1. Design description
This paper presents RTLFixer, a novel framework enabling auto- 2. Erroneous module code Prompt LLM
Template
matic syntax errors fixing for Verilog code with Large Language 3. Compiler feedback
arXiv:2311.16543v3 [cs.AR] 20 May 2024

GTP-3.5 GTP-4
Models (LLMs). Despite LLM’s promising capabilities, our analysis
indicates that approximately 55% of errors in LLM-generated Ver- Feedback Loop
Planning
ReAct
ilog are syntax-related, leading to compilation failures. To tackle Revised Code
Simple Thought
this issue, we introduce a novel debugging framework that employs Syntax
Fixer
Action Intermediate
Steps
Retrieval-Augmented Generation (RAG) and ReAct prompting, en- Observation

abling LLMs to act as autonomous agents in interactively debugging

Compiler RAG Tools

the code with feedback. This framework demonstrates exceptional ❌Compiler Logs
Retriever
proficiency in resolving syntax errors, successfully correcting about Retrieve

98.5% of compilation errors in our debugging dataset, comprising Database

212 erroneous implementations derived from the VerilogEval bench- ✅Passed Store

Human Guidance
mark. Our method leads to 32.3% and 10.1% increase in pass@1
success rates in the VerilogEval-Machine and VerilogEval-Human
benchmarks, respectively. The source code and benchmark are Figure 1: Overview of RTLFixer. The Autonomous Language
available at https://github.com/NVlabs/RTLFixer. Agent fixes the syntax error via a feedback loop. ReAct han-
ACM Reference Format: dles the iterative code refinement with intermediate reason-
YunDa Tsai, Mingjie Liu, and Haoxing Ren. 2024. RTLFixer: Automatically ing and action steps. Human expert guidance is incorporated
Fixing RTL Syntax Errors with Large Language Models. In 61st ACM/IEEE through RAG.
Design Automation Conference (DAC ’24), June 23–27, 2024, San Francisco, CA,
USA. ACM, New York, NY, USA, 7 pages. https://doi.org/10.1145/3649329.
3657353
code. Surprisingly, our analysis reveals that a substantial 55% of the
1 INTRODUCTION errors generated by LLMs for Verilog code are comprised of syntax
Large language models (LLMs) present great promise in automating errors, surpassing the occurrence of logic errors detected through
hardware design, especially in their capacity to understand design simulation. Rectifying syntax errors not only enhances the overall
intentions and produce Verilog code from natural language [8]. accuracy of LLM-generated code but also holds the potential to
Recent efforts, such as VeriGen [17] and VerilogEval [9], have pri- alleviate manual efforts for human engineers engaged in Verilog
marily focused on zero-shot code generation. However, akin to coding. The recognition and mitigation of syntax errors stand as
numerous complex programming tasks, generating flawless code imperative steps, not only for refining LLM capabilities but also
in a single attempt poses a significant challenge, with a high likeli- for streamlining the coding process for human practitioners in the
hood of errors. Consequently, there exists a clear need for robust domain of Verilog development.
debugging and refinement capabilities in Verilog code generated by Despite such challenges, LLMs have showcased remarkable ca-
Large Language Models. This need arises from that like human pro- pabilities in reasoning and enhancing action plans to address excep-
grammers, achieving precision often requires multiple iterations. tions. The ReAct framework [20] integrates reasoning and action
Most importantly, it is evident that Large Language Models en- synthesis in language models, demonstrating LLM’s capability to
counter challenges in generating fully syntactically correct Verilog engage in reasoning processes and refine decision-making through
interactive feedback. Similarly, SelfDebug [3] and SelfEvolve [6]
∗ equal contribution illustrate the model’s ability to self-identify mistakes by scrutiniz-
ing execution results and articulating generated code in natural
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed language. It is essential to highlight that prior works do not explic-
for profit or commercial advantage and that copies bear this notice and the full citation itly address the correction of syntax errors and primarily center on
on the first page. Copyrights for components of this work owned by others than the improving the accuracy of generated code, particularly in Python,
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific permission where language models excel syntactically.
and/or a fee. Request permissions from [email protected]. On the other hand, Large Language Models are acknowledged for
DAC ’24, June 23–27, 2024, San Francisco, CA, USA their inclination to produce factual errors, a phenomenon termed
© 2024 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 979-8-4007-0601-1/24/06. . . $15.00 hallucination [5]. To mitigate this challenge, the Retrieval-Augmented
https://doi.org/10.1145/3649329.3657353 Generation (RAG) paradigm [7] has been introduced, integrating
retrieval mechanisms to improve the precision of generated content the domains of auto-completion and conversational code gener-
by incorporating information from external knowledge sources. ation. DAVE [14] emerged as an early study of LLM tailored for
In this paper, we introduce RTLFixer, a innovative debugging hardware design. VeriGen [17] further expanded the dataset scope
framework that utilizes LLMs as autonomous language agents in and experimented with open-sourced models. Following suit, Chip-
conjunction with RAG. We exclusively focus on addressing the chal- Chat [1], leveraging GPT-4, demonstrated the extensive potential
lenge of rectifying syntax errors in RTL Verilog code—an essential of LLMs in collaboratively generating processors and other hard-
problem with potential benefits for both LLMs and human engi- ware designs. In parallel, benchmarks such as VerilogEval [9] and
neers. As shown in Figure 1, our framework combines established RTLLM [11] played a pivotal role in advancing the application of
human expertise stored in a retrieval database for correcting syntax LLMs in Verilog code generation.
errors while simultaneously harnessing the capabilities of LLMs
as autonomous agents for reasoning and action planning (ReAct).
Through incorporating human expertise, our approach provides 2.2 Reasoning and Action Synthesis of LLMs
explicit guidance and explanations when LLMs face challenges in Large Language Models (LLMs) have demonstrated proficiency in
error correction. The stored compiler messages and human expert both reasoning and planning across various tasks. In terms of rea-
guidance function as a persistent external non-parametric memory soning, methods like chain-of-thought [18] empower LLMs to break
database, enhancing results through RAG. By empowering LLMs down intricate problems into logical steps, significantly enhanc-
with ReAct, LLMs serve as autonomous agents adept at strategically ing their problem-solving abilities. LLMs also excel in interactive
planning intermediate steps for iterative debugging. We also create decision-making and formulating action plans, effectively leverag-
VerilogEval-syntax, a Verilog syntax debugging dataset, derived ing digital tools, as shown in works such as ToolLLM [15].
from VerilogEval [9], containing 174 erroneous implementations. The ReAct [20] framework represents a notable advancement in
Our contributions are summarized as follows: LLM capabilities by seamlessly integrating reasoning and action
• Our framework demonstrates an impressive 98.5% success planning. This framework enables LLMs to generate both reasoning
rate in resolving syntax errors, resulting in a noteworthy traces and specific actions, facilitating dynamic interaction with
32.3% and 10.1% improvement in the pass@1 metrics achieved external information sources. This integration not only enhances
solely by addressing syntax errors in VerilogEval-Machine LLM’s performance in complex tasks but also renders them more
and VerilogEval-Human benchmarks, respectively. reliable and versatile as autonomous agents, capable of delivering
• Our framework also improves the syntax success rate from more accurate and context-aware responses.
73% to 93% on the RTLLM benchmark [11], demonstrating
the generalizability of this approach. 2.3 Retrieval-Augmented Generation
• Compared to One-shot generation, ReAct enhances syntax
success rates by 25.7%, 26.4%, and 31.2% with iterative feed- Retrieval-Augmented Generation (RAG) [7] represents a significant
back from Simple, iverilog, and Quartus, respectively. advancement in addressing the limitations of Large Language Mod-
• RAG with human guidance significantly improves syntax els when handling knowledge-intensive tasks. LLMs despite con-
success rates, up to 31.2% and 18.6% with feedback from taining extensive factual knowledge, often encounter challenges in
Quartus for One-shot and ReAct prompting, respectively. accessing and effectively manipulating this information. RAG lever-
ages a combination of LLMs, which serve as parametric memory,
The remainder of this paper is structured as follows. In Section 2, and an external knowledge base, such as Wikipedia, functioning
we present preliminary works on LLMs for Verilog code generation, as non-parametric memory. This unique approach allows RAG to
ReAct, and RAG. Our debugging framework RTLFixer is elucidated access and retrieve relevant documents or passages from the exter-
in Section 3, where we empower LLMs as autonomous agents with nal knowledge base based on the input query. Consequently, this
ReAct and innovatively provide human guidance through RAG. Sec- enriches the context available to the text generator, resulting in out-
tion 4 details our experimental results, showcasing the effectiveness puts that exhibit improved accuracy and factual consistency. In the
of our method in correcting syntax errors and improving the pass context of code generation, several works such as ReACC [10] and
rate. Finally, Section 6 summarizes and concludes the paper. RepoCoder [21] have successfully harnessed the capabilities of RAG
to enhance the code generation proficiency of LLMs, showcasing
2 PRELIMINARIES its transformative potential.
In this section, we begin by briefly exploring the advancements
and applications of LLMs for Verilog code generation in Section 2.1.
We then delve into the synthesis of reasoning and action in LLMs, 3 RTLFIXER: RESOLVING SYNTAX ERROR
elaborated in Section 2.2. Finally, we discuss Retrieval-Augmented WITH LLM AGENTS AND RETRIEVAL
Generation in Section 2.3. In this section, we explain the details of RTLFixer, which utilizes
Autonomous Language Agents enhanced with ReAct and Retrieval-
2.1 LLMs for Verilog Code Generation Augmented Generation (RAG). The framework’s structure is out-
Large Language Models exhibit the capability to generate code, lined in Figure 1 (Section 3.1), and the application of ReAct is thor-
with Codex [2] standing as an early exemplar. GitHub Copilot [4], oughly discussed in Section 3.2. Furthermore, Section 3.3 explores
building upon such groundwork, played a crucial role in pioneering the integration of human expert guidance using RAG. Finally, we ex-
LLM-based code completion engines, contributing significantly to plain the curation process for our VerilogEval-syntax error dataset.
3.1 Overview of RTLFixer System Prompt:
Implement the Verilog module based on the following description. Assume
RTLFixer comprises an LLM for code generation, RAG for access- that signals are positive clock/clk edge triggered unless otherwise stated.
ing human expert guidance, and ReAct for improved task decompo-
Problem Description:
sition, tool use, and planning. Our approach starts by formulating Implement the Verilog module based on the following description. Assume
an input prompt integrating a benchmark dataset problem into a that signals are positive clock/clk edge triggered unless otherwise stated.
template, followed by the agent utilizing RAG and ReAct, revising Given an 8-bit input vector [7:0], reverse its bit ordering.
erroneous Verilog code. If syntax errors persist, error logs from the module top_module (
input [7:0] in,
compiler as well as retrieved human guidance from the database output [7:0] out
are provided as feedback. This interactive debugging loop can be );
repeated multiple times until all errors are resolved.
Erroneous Implementation:
assign {out[0],out[1],out[2],out[3],out[4],out[5],out[6],out[8]}
↩→ = in;
3.2 Reasoning and Action Planning through endmodule
ReAct Iterative Prompting
Feedback:
We enable Large Language Models to function as autonomous main.v:5: error: Index out[8] is out of range.
agents for reasoning and action planning through the ReAct prompt- 1 error(s) during elaboration.
ing mechanism [20]. In ReAct, LLMs generate both reasoning traces
and task-specific actions in an interleaved manner. The input prompt, (a) One-shot Prompt Template
along with the ReAct instruction prompt, is provided to an LLM. Solve a task with interleaving Thought, Action, Observation steps.
Subsequently, the LLM initiates the generation of ReAct steps, each Thought can reason about the current situation, and Action can be the
following types:
consisting of Thought, Action, and Observation components. An (1) Compiler[code], which compiles the input code and provide error
example of a ReAct instruction prompt is depicted in Figure 2b, message if there is syntax error.
(2) Finish[answer], which returns the answer and finished the task.
while Figure 2c illustrates the self-prompting process, showcasing (3) RAG[logs], input the compiler log and retrieve expert solutions to fix
the intermediate steps within each iteration of ReAct. the syntax error.
During this process, the LLM prompts itself for thoughts on
how to address the error and selects the next action. Potential (b) ReAct Instruction (System Prompt)
actions include generating an explanation for the error, searching Question:
for a solution in the human expert guidance database, revising What is the syntax error in the given Verilog module implementation and
how to fix it?
the code, and submitting the revised code to the compiler, among
other possibilities. The output of the chosen action becomes the Thought 1:
observation in the prompt. The agent continues prompting until Implement a finite state machine that searches for message boundaries in
a byte stream.
the compilation is successful, selecting the Finish action to output Action 1: Compiler
the final response. If unsuccessful, the process iterates up to n times, always @ (state or in) begin
where n is a user-selected hyperparameter. Our objective is to assess case (state)
4'b0001: begin
the effectiveness of a fully automated feedback-driven solution.
Observation 1:
We employ One-shot prompting, illustrated in Figure 2a, as the main.v:15: error: out is not a valid l-value in top_module.
baseline for comparing with ReAct. One-shot prompting involves Thought 2:
The code has a syntax error because the "out" signal was declared as a
only a single-turn of feedback from the compiler message, but it ex- wire instead of a register. I need to correct this and run the compilation.
cludes ReAct prompting, which decomposes syntax fixing through Action 2: RAG[..not a valid l-value..]
reasoning and action planning, and the iterative multi-round inter- Use assign statements instead of always block if possible.
...
actions with the compiler. Action n: Finish
The revised code is as follow: ...

3.3 Retrieval Augmented Generation (RAG) (c) ReAct Steps with Thought-Action-Observation.
We leverage Retrieval-Augmented Generation (RAG), a potent tech-
nique that notably enhances Large Language Models’ capabilities Figure 2: Prompts used for ReAct. (a) shows the One-shot
by incorporating human expert guidance through a retriever. A prompting template with feedback message. (b)-(c) demon-
key distinction from traditional RAG lies in our curated database, strate the example where LLMs serve as autonomous agents
enriched with human instructions and demonstrations. with ReAct to decompose syntax fixing problems with rea-
The retrieval database curation process involves a meticulous soning and planning.
procedure of categorizing syntax errors and developing instructions
and demonstrations for syntax error resolution. In the initial step,
we categorize various syntax errors into groups using error num- inclusion of clear instructions, and demonstrations of possible so-
ber tags provided by compilers (such as Quartus) in the compiler lutions enables the LLM to adeptly address errors. To facilitate this,
logs. During the manual inspection of LLM’s struggle cases, it be- human experts offer detailed explanations for compiler logs, serv-
comes evident that ambiguous error messages present a significant ing as human expert guidance. An illustrative example of common
challenge, impeding the model’s error resolution capabilities. The errors is showcased in Figure 3. Subsequently, all compiler logs,
ensuring a diverse representation of syntax errors. This results in a
Compiler Logs:
Object ‘clk’ is not declared. Verify the object name is correct. If the name
total of 212 erroneous implementations in the dataset.
is correct, declare the object.
Human Expert Guidance: 4 EXPERIMENTS
Check if ‘clk’ is an input. If not, and if ‘clk’ is used within the module,
make sure the name is correct. If it’s meant to trigger an ‘always’ block, In this section, we first present the evaluation metrics in Section 4.1,
replace ‘posedge clk’ with ‘*’. followed by our primary findings showcased in Section 4.2. Within
this section, we show the performance improvements and the im-
Compiler Logs: pact of ReAct and RAG. Finally, Section 4.3 details ablation studies
Index cannot fall outside the declared range for vector
Human Expert Guidance: on the quality feedback message and LLM.
Carefully examine the index values to prevent encountering ‘index out of Setup: We conduct all experiments with GPT-3.5 as the LLM through
bound’ errors in your code. When utilizing parameters for indexing, try to OpenAI APIs [13], except for the ablation experiment on different
use binary strings for performing the indexing operation instead.
LLM. We specifically used gpt-3.5-turbo-16k-0613. A simple rule-
based syntax fixer is applied to every LLM-generated verilog code,
which avoids simple errors such as misplaced timescale derivatives.
Figure 3: Examples of common error categories that LLM
In all experiments, we set the sampling temperature to 0.4. For Re-
constantly could not solve and the corresponding human
Act prompting, we restrict the LLM to a maximum of 10 iterations
expert guidance in the retrieval database.
of Thought-Action-Observation, where Action might involve inter-
actions with the compiler. We consider the syntax error resolved if
any of the generated code passes. To limit test variance, we repeat
error code segments, and corresponding human guidance undergo
each experiment 10 times and report the average.
systematic storage in the database for future retrieval.
We integrate the human guidance and demonstration database
4.1 Evaluation Metric
with Large Language Models (LLMs) using Retrieval-Augmented
Generation (RAG). RAG enables the retrieval of pertinent docu- Compile Fix rate: To demonstrate the debugging capability of our
ments or data from a source, utilizing this information as context method, we calculate the expectation fix rate, with 𝑐 as the number
for the original input prompt. This approach allows the language of fixed samples out of all 𝑛 = 10 samples.
model to access the latest information without the need for retrain- h𝑐 i
ing, proving especially valuable in enhancing the model’s capacity fix rate = E (1)
to generate more accurate and reliable outputs. Leveraging RAG 𝑝𝑟𝑜𝑏𝑙𝑒𝑚𝑠 𝑛
further ensures that the language model has access to the most cur- Functional Correctness: We follow recent work in directly mea-
rent and relevant information, including compiler logs and human suring code functional correctness with simulation through pass@k
guidance, facilitating effective error resolution. metric [2], where a problem is considered solved if any of the k
Figure 3 illustrates two common error categories along with a samples passes the tests. We use the unbiased estimator as follow
demonstration of compiler logs and corresponding human guidance. and ensure 𝑛 = 20 is sufficiently large:
For this task, common retrievers such as pattern-matching, fuzzy " 𝑛−𝑐
#
search, or similarity search with a vector database are suitable. In pass@k = E 1− 𝑘
(2)
𝑛
our experiments, we opted for an exact match to error tags for 𝑝𝑟𝑜𝑏𝑙𝑒𝑚𝑠 𝑘
simplicity, given the limited number of error cases. We collected 7
common error categories with 30 entries for iverilog and 11 common 4.2 Main Results
error categories with 45 entries for Qaurtus in total.
Prompt RAG Simple iverilog Quartus GPT-4
One-shot w/o 0.414 0.536 0.587 0.91
3.4 Debugging Dataset w/ - 0.800 0.899 0.98
We created a novel benchmark dataset, VerilogEval-syntax, based ReAct w/o 0.671 0.731 0.799 0.92
w/ - 0.820 0.985 0.99
on the VerilogEval benchmark [9]. This dataset comprises flawed
Table 1: Fix rate for One-shot vs. ReAct, w/ and w/o RAG, ab-
code implementations sourced from the VerilogEval problem set.
lation on feedback quality and LLMs on VerilogEval-syntax.
Each entry includes the original problem description and erroneous
implementation containing syntax errors.
The dataset curation includes sampling, filtering, and cluster-
ing. Code samples were selected from VerilogEval problems using The main results in Table 1 show the effectiveness of ReAct and
One-shot and ReAct prompting methods with gpt-3.5-turbo model, RAG which each provided performance gain in a large margin. We
retaining only error-inducing samples. In the filtering phase, we note that One-shot generation includes only a single-turn interac-
focus on code with compile errors and use the following processing tion with either simple or compiler message as feedback. Table 2
and filtering criteria: extraction of code from markdown blocks, showed the improvement of pass@{1,5} on VerilogEval dataset after
validation of module statements, and removal of samples with ex- fixing syntax errors with visualizations in Figure 4. The VerilogEval-
traneous language or empty module bodies. The final step involved Human benchmark statistics show that syntax errors constitute a
clustering using DBSCAN [16] with Jaccard distance [12], grouping significant 55% of errors in GPT-3.5 generated Verilog code, surpass-
similar implementations to select representative examples while ing simulation errors. With just addressing syntax errors using our
Dataset Set pass@1 pass@5
VerilogEval original fixed original fixed
syntax errors in the code samples, which can skew accuracy. In
Human All 0.267 0.368 0.458 0.506 our study, we evaluate functional correctness using the VerilogEval
easy 0.521 0.666 0.808 0.847 benchmark, specifically addressing fixes to syntax errors in the
hard 0.053 0.120 0.164 0.221
Machine All 0.467 0.799 0.691 0.891
code samples. The results, displayed in Table 2, show the perfor-
easy 0.568 0.833 0.782 0.892 mance scores on the VerilogEval dataset and the improvements
hard 0.367 0.771 0.601 0.890 after rectifying syntax errors, with 32.3% and 10.1% improvement
Table 2: Pass@k for simulation pass rate on VerilogEval on the pass@1 metric for Machine and Human respectively. We
dataset after fixing syntax errors. further divided the VerilogEval benchmark into two subsets: easy,
comprising 71 problems, and hard, consisting of 85 problems. These
subsets have been delineated based on a pass rate threshold of 0.1
on Human. For simple problems in Human and low-level descrip-
approach (shown in the inner circle), the pass rate increases from
tions in the Machine, the correction of syntax errors significantly
26.7% to 36.8%. Table 3 shows that our approach can generalize
enhances the pass rate, reaching around 80% for pass@1. When
across different benchmarks. We further detail our findings below.
contrasting the pass rate improvements between easy and hard
Impact of ReAct: ReAct significantly outperforms One-shot gener-
problems in the Human descriptions, we observe a greater improve-
ation. ReAct’s ability to iteratively revise code with reasoning and
ment for easy problems at 14.5% compared to hard problems at
planning results in superior performance. Moreover, even without
6.7% for pass@1. This discrepancy suggests that LLMs still face
explicit feedback from the compiler (Simple), the intermediate rea-
challenges when advanced reasoning and problem-solving skills
soning steps similar to chain-of-thought can still bring considerable
are required. Discussions on future work to address simulation
improvements from 41.4% to 67.1%. When compared with One-shot
errors are presented in Section 5.
without RAG, ReAct enhances syntax success rates by 25.7%, 26.4%,
Generalizability: Our method using ReAct and RAG can be gener-
and 31.2% with iterative feedback from Simple, iverilog, and Quar-
alized to other benchmark. To account for potential overfitting dur-
tus, respectively. We also observe consistent improvement from
ing the design of the retrieval database, we also tested our method
ReAct, regardless of the compiler and use of RAG.
on the RTLLM [11] benchmark without deriving new human guid-
Impact of RAG: Notably, the application of RAG with human
ance for the retrieval database. As shown in Table 3, our framework
expert guidance boosts the fix rate considerably and substantially
improves the syntax success rate from 73% to 93%, demonstrating
enhances the solution’s reliability. The results with Quartus com-
its capability to generalize1 .
piler in Table 1 show that RAG improves the fix rate by 31.2% (58.7%
to 89.9% ) for One-shot and 18.6% (79.9% to 98.5%) when using ReAct.
4.3 Ablation Studies
We observe consistent improvement with RAG, regardless of the
quality of the compiler feedback message and LLM (GPT-4). Our research includes two ablation studies designed to evaluate
the impact of feedback quality and the selection of LLMs on the
VerilogEval-Machine VerilogEval-Human effectiveness of syntax error correction.
36.2% 7.5%
29.3%
4.3.1 Impact of Feedback Quality. We study the impact of feedback
2.2%
2.6% quality with using different feedback messages detailed below.
18.5% 28.2% 13.4%2.9% Simple: We only give an instruct prompt "Correct the syntax error
23.8%

12.9% 27.0% in the code." without any explicit instruction on what the error is
14.4% 23.5%
24.6%
40.3% 17.4% 8.5%
11.1% 8.3% about and how to fix it.
Statistics
Passed-easy
Passed-hard
Icarus Verilog (iverilog) [19]: Open-source Verilog simulator.
7.1%
Compilation Error - easy
2.0%3.3% Compilation Error - hard
35.0%
The compiler occasionally encounters edge cases where it fails to
Simulation Error - easy
Simulation Error - hard provide informative logs, outputting messages such as "I give up."
Logs lack clarity, making them challenging to decipher.
Figure 4: VerilogEval pass@1 results prior (inner) and post Quartus2 : Commercial compiler for FPGAs. In contrast with the
(outer) syntax error fixing with RTLFixer. open-source counterpart, it delivers well-defined and clear logs,
effectively identifying errors and often offering suggestions and
validation tips, making it more user-friendly and informative.
LLM Syntax Success Rate pass@1 We deem Simple, iverilog, and Quartus to have increasing level
GPT-3.5 73% 11% of feedback message quality and illustrate the difference of the two
GPT-3.5 + RTLFixer 93% 16% compilers with an example in Figure 5. Results depicted in Table 1
Table 3: Improvements of Syntax Success rate and simulation clearly demonstrate that compiler logs give better feedback than
Pass@1 on RTLLM benchmark using ReAct and RAG with Simple feedback and that the quality of compiler output impacts
Quartus compiler. the performance of LLM debugging. As the quality of the compiler
message improves (iverilog v.s. Quartus), the success rate of fixing
1 The syntax success rate we collected is different from the original paper because we
Simulation Correctness Improvement: Previous research [9, 11]
used the Verilog code sample provided in the their repo for each problem.
evaluates the performance of LLM-generated Verilog code using 2 https://www.intel.com/content/www/us/en/products/details/fpga/development-
the pass@k metric. However, this approach doesn’t account for tools/quartus-prime/resource.html
syntax errors also increases. Intriguingly, the disparity between one of the failure case where it particularly requires arithmetic
iverilog and Quartus results is more pronounced when using React index calculations to solve the index out-of-range error. Some other
with RAG. This discrepancy potentially suggests that high-quality notable failures occurred in cases where LLMs were confident in
compiler messages enhance the LLM’s ability to more effectively incorrect syntax, possibly due to it being accepted in C/C++.
utilize the retrieved human expert guidance.

Task ID: vector100r Erroneous Implementation (Partial):

for (i = 0; i < 16; i = i + 1) begin : ROW
Erroneous Implementation for (j = 0; j < 16; j = j + 1) begin : COLUMN
1 module top_module ( neighbors[0] = q[(i-1)*16 + (j-1)];
2 input [99:0] in, row_above = q[((i-1) & 15)*16 + j];
3 output reg [99:0] out ...
4 );
5 always @(posedge clk) begin Compile Error:
6 for (int i = 0; i < 100; i = i +1) begin Error (10232): Verilog HDL error at conwaylife.sv(23): index -17 cannot fall
7 out[i] <= in[99 - i]; outside the declared range [255:0] for vector "q"
8 end
9 end
10 endmodule
Figure 6: An example which the agent failed to fix a syntax
error. LLM failed to calculate array indices in the for loop
iverilog:
vector100r.sv:5: error: Unable to bind wire/reg/memory ‘clk’ in ‘top_module’ and does not recognize the out-of-bound error.
vector100r.sv:5: error: Failed to evaluate event expression ’posedge clk’.
2 error(s) during elaboration.

Iterative Code Refinement: In Figure 7, we analyze the number

Quartus: of iterations ReAct requires to fix syntax errors. About 90% of
Error (10161): Verilog HDL error at vector100r.sv(5): object "clk" is not
declared. Verify the object name is correct. If the name is correct, declare
problems are resolved in a single revision. For the remaining cases,
the object. File: /tmp/tmp4u6ib9ig/vector100r.sv Line: 5 Error: Quartus additional code revisions are necessary, as new errors may surface
Prime Analysis & Synthesis was unsuccessful. 1 error, 1 warning after addressing the initial ones.

Figure 5: Example of compiler log from iverilog and Quartus. Distribution of iterations in ReAct
Quartus feedback messages are more informative. 102
# of Samples

101
4.3.2 Impact of Different LLMs. In Table 1, we present the results
obtained when utilizing GPT-4 as the underlying LLM with Quartus
100
as the compiler. A notable improvement in syntax error resolution 0 2 4 6 8 10
is observed when using GPT-4 compared to GPT-3.5, particularly # of iterations
with One-shot prompting with RAG, where the success rate in- Figure 7: Distribution of iterations required by ReAct to fix
creased from 89.9% to 98%. When comparing the results of GPT-4 syntax errors.
between One-shot and ReAct, we observe minor improvements of
approximately 1%, suggesting that GPT-4 is already a robust agent
proficient in fixing syntax errors without the need for reasoning, Challenges in Debugging Simulation Errors: While our frame-
action planning, and iterative refinement. work could be readily adapted to employ LLMs for debugging sim-
Nonetheless, it is important to highlight that our approach of ulation errors, our preliminary studies revealed limited improve-
empowering LLMs with ReAct and RAG can significantly narrow ments beyond syntax error fixes. Despite our efforts to provide
the gap between weaker LLMs and stronger ones, especially ben- simulation error logs as feedback to LLM agents, including sum-
eficial for weaker open-source models [9, 17] that may not be as maries on output error count and text-formatted waveform-like
performant as GPT-4 on programming tasks. comparisons of error versus solution output, we observed that LLMs
had constrained capabilities to comprehend simulation feedback
5 ANALYSIS AND DISCUSSION messages. They only exhibited proficiency in fixing logic imple-
In this section, we delve into a series of analyses and discussions, mentation errors for simple problems but struggled with more
extracting valuable insights from our discoveries. Specifically, we complex questions, especially those involving high-level design
provide analysis in fail cases and the effect of iterative code re- functionality descriptions and advanced reasoning. Addressing the
finement. Additionally, we discuss the challenges associated with challenges associated with LLMs fixing erroneous implementations
applying our method to debugging simulation logic errors. in such problems, particularly those requiring advanced reasoning
Failure due to LLM’s Incapability: Most of the cases where, even and problem-solving skills, remains an exciting area for future re-
with the aid of ReAct and RAG, failed to correct syntax errors is search. This also highlights the need to improve LLM’s capabilities
due to the fundamental incapability of the LLM. Figure 6 illustrates in reasoning and problem-solving related to hardware design.
6 CONCLUSION
Our framework RTLFixer demonstrates the significant impact of
employing Retrieval Augmented Generation (RAG) and advanced
prompting methods like ReAct in debugging Verilog code with
Large Language Models. Key findings indicate that these approaches
notably enhance syntax error resolution, achieving success rates as
high as 98.5%. This research not only offers a novel autonomous
language agent for Verilog code debugging but also introduces
comprehensive dataset for further exploration.

REFERENCES
[1] Jason Blocklove, et al. 2023. Chip-Chat: Challenges and Opportunities in Con-
versational Hardware Design. arXiv preprint arXiv:2305.13243 (2023).
[2] Mark Chen, et al. 2021. Evaluating large language models trained on code. arXiv
preprint arXiv:2107.03374 (2021).
[3] Xinyun Chen, et al. 2023. Teaching large language models to self-debug. arXiv
preprint arXiv:2304.05128 (2023).
[4] Nat Friedman. 2021. Introducing GitHub Copilot: your AI pair program-
mer. (2021). https://github.blog/2021-06-29-introducing-github-copilot-ai-
pair-programmer
[5] Ziwei Ji, et al. 2023. Survey of Hallucination in Natural Language Generation.
Comput. Surveys 55, 12 (March 2023), 1–38. https://doi.org/10.1145/3571730
[6] Shuyang Jiang, et al. 2023. SelfEvolve: A Code Evolution Framework via Large
Language Models. arXiv preprint arXiv:2306.02907 (2023).
[7] Patrick Lewis, et al. 2021. Retrieval-Augmented Generation for Knowledge-
Intensive NLP Tasks. arXiv:cs.CL/2005.11401
[8] Mingjie Liu, et al. 2023. ChipNeMo: Domain-Adapted LLMs for Chip Design.
arXiv:cs.CL/2311.00176
[9] Mingjie Liu, et al. 2023. VerilogEval: Evaluating Large Language Models for
Verilog Code Generation. arXiv preprint arXiv:2309.07544 (2023).
[10] Shuai Lu, et al. 2022. ReACC: A Retrieval-Augmented Code Completion Frame-
work. arXiv:cs.SE/2203.07722
[11] Yao Lu, et al. 2023. RTLLM: An Open-Source Benchmark for Design RTL Gener-
ation with Large Language Model. arXiv preprint arXiv:2308.05345 (2023).
[12] Suphakit Niwattanakul, et al. 2013. Using of Jaccard coefficient for keywords
similarity. In Proceedings of the international multiconference of engineers and
computer scientists, Vol. 1. 380–384.
[13] OpenAI. 2023. OpenAI models api. (2023). https://platform.openai.com/docs/
models
[14] Hammond Pearce, et al. 2020. Dave: Deriving automatically verilog from english.
In Proceedings of the 2020 ACM/IEEE Workshop on Machine Learning for CAD.
27–32.
[15] Yujia Qin, et al. 2023. Toolllm: Facilitating large language models to master
16000+ real-world apis. arXiv preprint arXiv:2307.16789 (2023).
[16] Erich Schubert, et al. 2017. DBSCAN revisited, revisited: why and how you
should (still) use DBSCAN. ACM Transactions on Database Systems (TODS) 42, 3
(2017), 1–21.
[17] Shailja Thakur, et al. 2023. VeriGen: A Large Language Model for Verilog Code
Generation. arXiv preprint arXiv:2308.00708 (2023).
[18] Jason Wei, et al. 2023. Chain-of-Thought Prompting Elicits Reasoning in Large
Language Models. arXiv:cs.CL/2201.11903
[19] Stephen Williams et al. 2002. Icarus verilog: open-source verilog more than a
year later. Linux Journal 2002, 99 (2002), 3.
[20] Shunyu Yao, et al. 2022. ReAct: Synergizing Reasoning and Acting in Language
Models. In The Eleventh International Conference on Learning Representations.
[21] Fengji Zhang, et al. 2023. Repocoder: Repository-level code completion through
iterative retrieval and generation. arXiv preprint arXiv:2303.12570 (2023).

Student Result Management System Project Report
66% (50)
Student Result Management System Project Report
49 pages
Abinitio Intvw Questions
100% (1)
Abinitio Intvw Questions
20 pages
Digital Logic RTL & Verilog Interview Questions Preview
33% (6)
Digital Logic RTL & Verilog Interview Questions Preview
34 pages
Verilogcoder
No ratings yet
Verilogcoder
30 pages
Liu et al. - 2023 - Invited Paper VerilogEval Evaluating Large Language Models for Verilog Code Generation
No ratings yet
Liu et al. - 2023 - Invited Paper VerilogEval Evaluating Large Language Models for Verilog Code Generation
8 pages
2024 - DeepRTL Bridging Verilog Understanding and Generation with a Unified Representation Model 2
No ratings yet
2024 - DeepRTL Bridging Verilog Understanding and Generation with a Unified Representation Model 2
23 pages
2503.15112v1
No ratings yet
2503.15112v1
9 pages
Autosva 2309.09437v2
No ratings yet
Autosva 2309.09437v2
7 pages
Make Every Move Count: LLM-based High-Quality RTL Code Generation Using MCTS
No ratings yet
Make Every Move Count: LLM-based High-Quality RTL Code Generation Using MCTS
7 pages
symmetry-13-00247-v2
No ratings yet
symmetry-13-00247-v2
15 pages
Ece4750 Tut4 Verilog
No ratings yet
Ece4750 Tut4 Verilog
49 pages
Lab No: 1 Introduction To Verilog (Gate-Level Modeling, Data-Flow Modeling and Behavioral Modeling)
No ratings yet
Lab No: 1 Introduction To Verilog (Gate-Level Modeling, Data-Flow Modeling and Behavioral Modeling)
3 pages
03 Digital Design Using Verilog
No ratings yet
03 Digital Design Using Verilog
101 pages
VIS 2005 LessonLearnt
No ratings yet
VIS 2005 LessonLearnt
3 pages
Evaluating The Ability of Large Language Models To Generate Verifiable Specifications in Verifast
No ratings yet
Evaluating The Ability of Large Language Models To Generate Verifiable Specifications in Verifast
4 pages
Top 4 Verilog and SystemVerilog papers
No ratings yet
Top 4 Verilog and SystemVerilog papers
2 pages
Designing With Verilog
No ratings yet
Designing With Verilog
2 pages
P G - C 2: B L L M C R F: AN U Oder Oosting Arge Anguage Odels For Ode With Anking Eedback
No ratings yet
P G - C 2: B L L M C R F: AN U Oder Oosting Arge Anguage Odels For Ode With Anking Eedback
15 pages
LAB #04 Task #01: Implement 4X1 Mux by Calling 2x1 Mux ?
No ratings yet
LAB #04 Task #01: Implement 4X1 Mux by Calling 2x1 Mux ?
14 pages
RL4F: Generating Natural Language Feedback With Reinforcement Learning For Repairing Model Outputs
No ratings yet
RL4F: Generating Natural Language Feedback With Reinforcement Learning For Repairing Model Outputs
16 pages
ASE2024_CodeGenSurvey-7
No ratings yet
ASE2024_CodeGenSurvey-7
17 pages
2408.11729v2
No ratings yet
2408.11729v2
9 pages
Examining Zero-Shot Vulnerability Repair With LLM
No ratings yet
Examining Zero-Shot Vulnerability Repair With LLM
18 pages
2305.01210v3
No ratings yet
2305.01210v3
15 pages
5 HC24.Ucsd.hanxianHuang.v02
No ratings yet
5 HC24.Ucsd.hanxianHuang.v02
46 pages
Fully Autonomous Programming With Large Language Models
No ratings yet
Fully Autonomous Programming With Large Language Models
10 pages
Lecture 2
No ratings yet
Lecture 2
51 pages
Verilog Lecture
No ratings yet
Verilog Lecture
32 pages
SyestemVerilog-HVL
No ratings yet
SyestemVerilog-HVL
3 pages
Explaination
No ratings yet
Explaination
2 pages
ENG2 Verilog
No ratings yet
ENG2 Verilog
6 pages
What's New in .NET 8? A Complete Guide to the Latest Features
From Everand
What's New in .NET 8? A Complete Guide to the Latest Features
Nitika
No ratings yet
EDA report
No ratings yet
EDA report
26 pages
Lab 5 Introduction To Logic Simulation and Verilog PDF
No ratings yet
Lab 5 Introduction To Logic Simulation and Verilog PDF
8 pages
Verilog X Bugs
No ratings yet
Verilog X Bugs
34 pages
Adld Ad DD 12 To 19 Q Bank
No ratings yet
Adld Ad DD 12 To 19 Q Bank
40 pages
Scribd.vpdfs.com Digital Logic Rtl Amp Verilog Interview Questions Preview
No ratings yet
Scribd.vpdfs.com Digital Logic Rtl Amp Verilog Interview Questions Preview
34 pages
High Level Digital Design & Testing (HLDDT)
No ratings yet
High Level Digital Design & Testing (HLDDT)
126 pages
DIGITAL SYSTEM DESIGN Manual1
No ratings yet
DIGITAL SYSTEM DESIGN Manual1
20 pages
CODEJUDGE : Evaluating Code Generation with Large Language Models
No ratings yet
CODEJUDGE : Evaluating Code Generation with Large Language Models
20 pages
資工系王俊堯教授數位邏輯設計Unit10_Introduction to Verilog HDL
No ratings yet
資工系王俊堯教授數位邏輯設計Unit10_Introduction to Verilog HDL
47 pages
EE2001039 - A2report1 - EEE 4106
No ratings yet
EE2001039 - A2report1 - EEE 4106
5 pages
5 Hardware Design Languages
No ratings yet
5 Hardware Design Languages
65 pages
" Verilog: Indra Reddy
No ratings yet
" Verilog: Indra Reddy
167 pages
21-CP-6 (Report14) DLD
No ratings yet
21-CP-6 (Report14) DLD
17 pages
SystemVerilog Veriflcation
No ratings yet
SystemVerilog Veriflcation
68 pages
Verilog Intro by Sneh Sir
No ratings yet
Verilog Intro by Sneh Sir
15 pages
llms
No ratings yet
llms
17 pages
Assessing Large Language Models for Code Generation: A Comprehensive Framework
No ratings yet
Assessing Large Language Models for Code Generation: A Comprehensive Framework
6 pages
Unit-2 DSD Short Notes
No ratings yet
Unit-2 DSD Short Notes
15 pages
Lecture2 Cuong
No ratings yet
Lecture2 Cuong
51 pages
Formal Verification: Outline
No ratings yet
Formal Verification: Outline
14 pages
113 2st DigitalCircuitLab Vlog1
No ratings yet
113 2st DigitalCircuitLab Vlog1
23 pages
Code Attack
No ratings yet
Code Attack
16 pages
2108.07732v1
No ratings yet
2108.07732v1
34 pages
Improving Small-Scale Large Language Models Function Calling For Reasoning Tasks
No ratings yet
Improving Small-Scale Large Language Models Function Calling For Reasoning Tasks
10 pages
KeyuHe_MaxLi_JosephLiu
No ratings yet
KeyuHe_MaxLi_JosephLiu
12 pages
Cohen Links To Papers Books
No ratings yet
Cohen Links To Papers Books
4 pages
DSD Lab Manual 2022 Revised Aug
No ratings yet
DSD Lab Manual 2022 Revised Aug
69 pages
React Essentials : Building Modern Web Interfaces
From Everand
React Essentials : Building Modern Web Interfaces
Jiho Seok
No ratings yet
Mastering Dynamic Programming in Python
From Everand
Mastering Dynamic Programming in Python
Ed A Norex
No ratings yet
Groovy for Domain-Specific Languages, Second Edition: Extend and enhance your Java applications with domain-specific scripting in Groovy
From Everand
Groovy for Domain-Specific Languages, Second Edition: Extend and enhance your Java applications with domain-specific scripting in Groovy
Fergal Dearle
No ratings yet
Time Zone PDF
No ratings yet
Time Zone PDF
12 pages
AFIT - Distinguishing Internet-Facing ICS Devices
No ratings yet
AFIT - Distinguishing Internet-Facing ICS Devices
114 pages
CS101 Introduction To Computing: (Web Development Lecture 7)
No ratings yet
CS101 Introduction To Computing: (Web Development Lecture 7)
58 pages
Activity 6.1 Module9 Union Intersect Minus
No ratings yet
Activity 6.1 Module9 Union Intersect Minus
2 pages
Augiworld: Implementation Strategies
No ratings yet
Augiworld: Implementation Strategies
52 pages
Split Word Document by Page With VBA
No ratings yet
Split Word Document by Page With VBA
3 pages
Take-Home Assignment in Computer Communications: EDA 344/DIT423-Chalmers and Gothenburg University
No ratings yet
Take-Home Assignment in Computer Communications: EDA 344/DIT423-Chalmers and Gothenburg University
1 page
Notepad 2
No ratings yet
Notepad 2
14 pages
S DEVELOP Definition
No ratings yet
S DEVELOP Definition
2 pages
Nzload: Ibm Puredata System For Analytics Programming and Usage
No ratings yet
Nzload: Ibm Puredata System For Analytics Programming and Usage
22 pages
Boolean PDF
No ratings yet
Boolean PDF
2 pages
Game Wickipedia
No ratings yet
Game Wickipedia
6 pages
Negociando Con China by Henry M Paulson JR 86pdf A - 5a979a721723dd0d543a8606 PDF
No ratings yet
Negociando Con China by Henry M Paulson JR 86pdf A - 5a979a721723dd0d543a8606 PDF
2 pages
Readme VHDX
No ratings yet
Readme VHDX
2 pages
Case Study: Management Information System at Dell: Dell Computer Corporation: Company Background
No ratings yet
Case Study: Management Information System at Dell: Dell Computer Corporation: Company Background
4 pages
Imager - DRYPIX NetTool - E02
No ratings yet
Imager - DRYPIX NetTool - E02
108 pages
TC2431en-Ed03 Virtualization OXE Configuration Related To OXE Releases and ESXi Infrastructure Versions
100% (1)
TC2431en-Ed03 Virtualization OXE Configuration Related To OXE Releases and ESXi Infrastructure Versions
70 pages
Python Project
100% (2)
Python Project
29 pages
DBA Commands
No ratings yet
DBA Commands
20 pages
This Slide Is 100% Editable
No ratings yet
This Slide Is 100% Editable
5 pages
A. Chapter 5
No ratings yet
A. Chapter 5
6 pages
CAD Exchanger SDK Presentation
No ratings yet
CAD Exchanger SDK Presentation
11 pages
AWS Certified Cloud Practitioner Exam - Free Actual Q&As, Page 14 - ExamTopics
No ratings yet
AWS Certified Cloud Practitioner Exam - Free Actual Q&As, Page 14 - ExamTopics
2 pages
"Versatile Event Management: A Project Report On
No ratings yet
"Versatile Event Management: A Project Report On
60 pages
Exchange Pre-Engagement Questionnaire
0% (1)
Exchange Pre-Engagement Questionnaire
14 pages
Bitnami Openfire Virtual Machine
No ratings yet
Bitnami Openfire Virtual Machine
30 pages
Grade 9 - Photoshop Intro (Day 1-3) PDF
No ratings yet
Grade 9 - Photoshop Intro (Day 1-3) PDF
26 pages
Avaya Aura Virtualized Environment
No ratings yet
Avaya Aura Virtualized Environment
2 pages

RTLFixer

Uploaded by

RTLFixer

Uploaded by

RTLFixer: Automatically Fixing RTL Syntax Errors with Large

abling LLMs to act as autonomous agents in interactively debugging

98.5% of compilation errors in our debugging dataset, comprising Database

Task ID: vector100r Erroneous Implementation (Partial):

Iterative Code Refinement: In Figure 7, we analyze the number

You might also like