Liu et al. - 2023 - Invited Paper VerilogEval Evaluating Large Language Models for Verilog Code Generation

This paper introduces VerilogEval, a benchmarking framework designed to evaluate large language models (LLMs) for Verilog code generation in hardware design. It includes a comprehensive dataset of 156 diverse Verilog coding tasks sourced from HDLBits, along with automated testing procedures for functional correctness. The authors also demonstrate that LLM performance can be enhanced through supervised fine-tuning using synthetic problem-code pairs generated by LLMs.

Uploaded by

yangkunkuo

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views

Liu et al. - 2023 - Invited Paper VerilogEval Evaluating Large Language Models for Verilog Code Generation

Uploaded by

yangkunkuo

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Invited Paper: VerilogEval: Evaluating Large

Language Models for Verilog Code Generation

Mingjie Liu, Nathaniel Pinckney, Brucek Khailany, and Haoxing Ren
NVIDIA Corporation
{mingjiel, npinckney, bkhailany, haoxingr}@nvidia.com

Abstract—The increasing popularity of large language models

(LLMs) has paved the way for their application in diverse do-
mains. This paper proposes a benchmarking framework tailored
specifically for evaluating LLM performance in the context of
Verilog code generation for hardware design and verification.
We present a comprehensive evaluation dataset consisting of
156 problems from the Verilog instructional website HDLBits.
The evaluation set consists of a diverse set of Verilog code
generation tasks, ranging from simple combinational circuits to
complex finite state machines. The Verilog code completions can
be automatically tested for functional correctness by comparing
the transient simulation outputs of the generated design with
a golden solution. We also demonstrate that the Verilog code
generation capability of pretrained language models could be
improved with supervised fine-tuning by bootstrapping with
LLM generated synthetic problem-code pairs.
Fig. 1 VerilogEval uses a sandbox environment for simple
I. I NTRODUCTION and reproducible evaluation of LLM Verilog code generation
The escalating popularity of Large Language Models
(LLMs), characterized by their remarkable capacity to com-
prehend and generate human-like text, has opened up a realm task-specific data to adapt to requirements. Ensuring model
of possibilities across diverse domains [1]–[3]. LLMs tailored alignment is imperative for achieving improved performance,
for specific domains have garnered significant attention owing driving the investigation of increasingly computationally de-
to their impressive performance on both general-purpose manding techniques, such as reinforcement learning with
benchmarks and specialized tasks within domains like finan- human feedback [15], [16]. The cost associated with acquiring
cial engineering [4], biomedical studies [5], [6], and general labeled data also remains a barrier, prompting a growing
scientific research [7]. When it comes to coding, LLMs can as- interest in alternative annotation-free alignment techniques.
sist developers by suggesting code snippets, offering solutions Self-Instruct [17] starts with a seed task, using LLMs to
to common programming challenges, and even explaining create more instructions and instances. WizardLM’s Evol-
complex concepts in a more accessible manner [8], [9]. Instruct [18] evolves instructions for a diverse dataset, which
In the realm of Electronic Design Automation, LLMs is further applied to code generation [19]. Additionally, a
provide the potential to aid engineers in designing and recent study utilized GPT-4 to generate a high-quality syn-
verifying digital systems, providing insights into Verilog thetic textbook dataset, achieving superior coding capabilities
coding, optimizing circuits, and automating time-consuming at 1/100th of the cost of other models [20]. Within the realm of
tasks [10], [11]. A number of studies have initiated the Verilog coding, there remains a notable gap in the exploration
evaluation of LLMs’ potential in generating Verilog code. of supervised fine-tuning for model enhancement.
Thakur et al. [12] fine-tuned CodeGen [9] models which was Moreover, notwithstanding commendable endeavors, recent
evaluated on 17 designs. A later follow up work [10] further research in Verilog code benchmarking has revealed limi-
demonstrate the ability to design chip-level circuits with tations concerning its comprehensiveness, quantity, and the
ChatGPT. RTLLM [13] propose a benchmark framework with diversity of problems studied. Effective benchmarks should
30 designs, which focus on increasing the benchmark design exhibit diversity, encompassing a wide array of topics, to
scalability. The authors further improved the solution quality mitigate testing variance. Furthermore, they should offer un-
with simple and effective prompt engineering techniques. ambiguous problem descriptions, ensuring that solutions can
While LLMs have proven to be powerful tools, their be assessed with clear distinctions regarding correctness. In
pretraining phase, characterized by unsupervised training loss, addition, reliability and automation are key factors, enabling
often lacks alignment with specific tasks. To enhance perfor- the straightforward evaluation of generated code through
mance, supervised fine-tuning (SFT) is used [14], involving robust testing procedures.

Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on December 07,2024 at 07:13:43 UTC from IEEE Xplore. Restrictions apply.
Our research addresses these gaps through the introduction HDLBits is a collection of digital circuit design exercises and
of VerilogEval1 , a open-source benchmark that encompasses an online judge for learning digital logic using the Verilog
a diverse array of questions, offers clear and unambiguous hardware description language. The evalutation set consists
problem descriptions, and incorporates automated, easily re- of diverse Verilog coding tasks, ranging from module imple-
producible testing procedures. This contribution significantly mentation of simple combinational circuits to complex finite
enhances the robustness and effectiveness of the evaluation state machines, code debugging, and testbench construction.
framework for Verilog code generation and assessment. Our We focus on generating self-contained3 Verilog modules
specific contributions are as follows: from natural language text descriptions. We define a Verilog
• We present a comprehensive evaluation dataset compris- module as self-contained if the module implementation does
ing 156 problems sourced from the HDLBits. These not require instantiation of any other modules. We emphasize
problems have undergone meticulous curation, ensuring the significance of module instantiation as a crucial capability
both clarity and diversity. in Verilog, playing an essential role in constructing extensive
• We developed a benchmarking framework wherein Ver- system-level designs. It’s important to note that our evaluation
ilog code completions are subjected to automatic func- does not delve into this topic. However, while most problems
tional correctness testing. in VerilogEval are intentionally concise, they demand the
• We constructed a synthetic supervised fine-tuning dataset LLM to possess a comprehension of hardware design along
by leveraging LLMs to generate problem descriptions with adept problem-solving skills in areas encompassing
paired with Verilog code. This dataset is employed in circuits, Boolean logic, state transitions, and more.
extensive experiments on SFT, further enhancing the Fig. 2 shows an example of the problem vectorr. Prob-
model’s proficiency in Verilog coding tasks. lem Description includes both natural language description
and module header and IO definition. Including the module
II. E VALUATION F RAMEWORK
header removes ambiguity such as the bit width of signals.
In this section we discuss the details of our evaluation The Question Prompt is concatenated with Problem De-
framework and evaluation dataset collection. Our work closely scription and sent to the LLM for inference. Canonical
follows the widely adopted Python coding benchmark Hu- Solution is provided as the golden solution for testing.
manEval [21] for best practices. VerilogEval is presented
in Fig. 1 where we develop a sandbox environment for simple B. Problem Descriptions
and reproducible evaluation of LLM Verilog code generation.
Although HDLBits serves as a valuable resource for Ver-
ilog coding challenges, a significant portion of the website’s
System Prompt (Optional):
You only complete chats with syntax correct Verilog code. End the
problem descriptions are not readily compatible with text-only
Verilog module code completion with ’endmodule’. Do not include language models. These problem descriptions rely on various
module, input and output definitions.
modalities, frequently incorporating circuit schematic images,
Question Prompt: state transition diagram graphs, Boolean logic tables, and
Implement the Verilog module based on the following description. As- Karnaugh maps. We explore with two methods for generating
sume that signals are positive clock/clk edge triggered unless otherwise
stated. text-only problem descriptions for these problem sets.
1) VerilogEval-machine: We completely disregard the
Problem Description:
Given an 8-bit input vector [7:0], reverse its bit ordering. descriptions on the website and opt to utilize LLMs for
module top_module ( the automated creation of problem descriptions. We em-
input [7:0] in, ploy the prompt template depicted in Fig. 3, employing
output [7:0] out
); gpt-3.5-turbo. Initially, we create all problem descrip-
tions using zero-shot methods. We validate these descriptions
Canonical Solution: by using the LLM to produce code solutions. Problem descrip-
assign {out[0],out[1],out[2],out[3],out[4], tions are considered invalid if none of the generated comple-
,→ out[5],out[6],out[7]} = in;
endmodule tions succeed across 100 samples, and such descriptions are
then discarded. Surprisingly, among the pool of 156 candidate
problems, 108 of them yield successful solutions upon initial
Fig. 2 Example of vectorr in VerilogEval-human. The sampling. Subsequent to this, we consider the valid generated
Problem Description includes both natural language descrip- descriptions as few-shot examples (4-shot) and proceed to
tion and module header, input, and output definition. further sample unsolved problems. In this phase, we itera-
tively sample descriptions along with their corresponding code
A. VerilogEval Evaluation Set completions (8 completions per description). Descriptions are
We evaluate functional correctness on a selected set of labeled as valid as soon as any of the corresponding code
problems from the Verilog instructional website HDLBits2 . completions pass testing. Sampling for each problem is halted
1 https://github.com/NVlabs/verilog-eval 3 Example of a removed question that is not self-contained: https://hdlbits.
2 https://hdlbits.01xz.net/wiki/Problem sets 01xz.net/wiki/Module cseladd.

Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on December 07,2024 at 07:13:43 UTC from IEEE Xplore. Restrictions apply.
System Prompt: into a text-only structure. We dedicated particular attention
Explain the high-level functionality of the Verilog module. Use as many to addressing ambiguity within the problem descriptions,
high-level concepts that are directly applicable to describe the code,
say at the level of an undergraduate EECS major, but do not include particularly when precisely determining attributes such as
extraneous details that aren’t immediately applicable. Use text-based truth the clock’s posedge or negedge triggering, whether reset and
tables and state transition graphs when necessary. Speak concisely as if
this was a specification for a circuit designer to implement. You should enable signals are active high or active low, and whether
only reply with descriptive natural language and not use any code. they operate synchronously or asynchronously. Boolean logic
tables and Karnaugh maps were transformed into textual
Few-shot Examples (Optional):
Question: Explain the high-level functionality of the Verilog module. tabular formats. Circuit schematic diagrams were translated
module top_module ( into natural language explanations of the connections between
input [7:0] in,
output [7:0] out logical gates. For sequential waveforms, we meticulously
); detailed all signal values at each transition edge of the clock,
assign {out[0],out[1],out[2],out[3],out[4],
,→ out[5],out[6],out[7]} = in; presented in a tabular layout with an added column for time
endmodule steps. One particular challenge we confronted revolved around
Answer: This top Verilog module is a 8-bit unidirectional data port. It the task of converting state transition graphs into a text-
takes an 8-bit input in, and outputs an 8-bit signal out. The output of the
module is assigned to out based on the value of in. If in = a,b,c,d,e,f,g,h, based representation. To tackle this, we turned to ChatGPT
then out = h,g,f,e,d,c,b,a. for guidance, as depicted in Fig. 4. We ultimately adopted
User Prompt: the edge list-based format to depict these state transition
Question: Explain the high-level functionality of the Verilog module. graphs. Examples of manually converted descriptions are
module top_module ( shown in Fig. 5. Initial explorations were conducted regarding
input [5:0] y,
input w, Verilog code completion by employing the converted formats.
output Y1, Notably, ChatGPT exhibited the capability to generate mean-
output Y3
); ingful code using these formats for simple problems. We
assign Y1 = y[0]&w; manually converted 156 problem descriptions in total.
assign Y3 = (y[1]|y[2]|y[4]|y[5]) & ˜w;
endmodule Comparing the descriptions between machine and human,
we find that machine descriptions are often more verbose
GPT-3.5 Response: (vectorr in Figs. 2 and 3). Although the model is di-
Answer: This Verilog module is a combinational logic circuit that takes rected to generate high-level explanations, produced machine
in 6-bit input vector y and a single bit input w. The module then outputs
two signals Y1 and Y3. Y1 is the logical AND of the first bit of the descriptions frequently delve into low-level details. These
input vector y and the single bit input w. Y3 is the logical AND of the descriptions tend to mirror the code’s implementation line
logical OR of the second, third, fifth and sixth bits of the input vector y
and the logical NOT of the single bit input w. In other words, Y1 is true by line, rather than focusing on the overarching functionality
if both the first bit of the input vector y and the single bit input w are of the circuit (2012_q2b in Figs. 3 and 5). Furthermore,
true, and Y3 is true if at least one of the second, third, fifth and sixth bits
of the input vector y is true and the single bit input w is false. despite that we have taken steps to ensure that all ma-
chine descriptions are capable of producing passing solutions
through LLMs, we cannot guarantee the absence of ambiguity
Fig. 3 VerilogEval-machine use gpt-3.5-turbo to gen-
and errors. Nevertheless, VerilogEval-machine remains a
erate problem descriptions for 2012_q2b.
valuable benchmark, particularly for assessing LLM’s compe-
tence in comprehending low-level instructions and generating
upon reaching a allocated sampling budget, resulting in an syntactically and functionally accurate Verilog code.
increase of 35 additional solutions. In total we generated 143
valid problem descriptions. Benchmark Language Train Test
HumanEval [21] Python - 164
User Prompt: MBPP [22] Python 374 500
I want you to write Verilog from a state transition diagram. How can I APPS [23] Python 5,000 5,000
give you this diagram in a text format?
VerilogEval-machine Verilog 8,502 143
VerilogEval-human Verilog - 156
GPT-3.5 Response:
For a more graphical representation using text, you can specify the states
with their output values and the transitions like this: TABLE I VerilogEval in comparison with popular (python)
StateA (0) --0--> StateA coding benchmarks.
StateA (0) --1--> StateB
StateB (1) --0--> StateC
StateB (1) --1--> StateB TABLE I summarizes our VerilogEval benchmark and
StateC (0) --0--> StateA compares with existing popular coding benchmarks in python.
StateC (1) --1--> StateB
We further detail our method of generating synthetic super-
vised fine-tuning dataset in Section III-A.
Fig. 4 ChatGPT guidance on state transition diagrams.
C. Automated Testing Environment
2) VerilogEval-human: We engaged in manual review Unlike software programming languages like Python, the
and conversion of problem descriptions from the website behavior of Verilog modules are typically evaluated through

Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on December 07,2024 at 07:13:43 UTC from IEEE Xplore. Restrictions apply.
Consider the state machine shown below: test patterns. Randomly generated test patterns may span from
A (0) --1--> B a few hundred clock cycles for simple problems to several
A (0) --0--> A thousand cycles for more complex ones.
B (0) --1--> C
B (0) --0--> D We adapted the sandbox environment to safetly run un-
C (0) --1--> E trusted programs from HumanEval [21]. We built and installed
C (0) --0--> D
D (0) --1--> F the open-source ICARUS Verilog [24] simulator in a docker
D (0) --0--> A container. We note that our evaluation of Verilog syntax is
E (1) --1--> E
E (1) --0--> D limited by the simulator, which might not include all features
F (1) --1--> C of Verilog HDL IEEE-1364 standard. Simulation and testing
F (1) --0--> D
Assume that a one-hot code is used with the state assignment y[5:0] = are handled under the hood and results can be produced using
000001(A), 000010(B), 000100(C), 001000(D), 010000(E), 100000(F). just a single line of command.
Write a Verilog for the signal Y1, which is the input of state flip-flop
y[1], for the signal Y3, which is the input of state flip-flop y[3]. Derive D. Evaluation Metric
the Verilog by inspection assuming a one-hot encoding.
Early works on code evaluations report on match-based
(a) 2012_q2b metrics such as BLEU score [25]. However, recent works [21],
[26] have argued that such metrics does not correlate well with
Implement the circuit described by the Karnaugh map below. functional correctness. In Fig. 6 we show that Verilog coding
ab exhibit similar issues, where the distributions of correct versus
cd 00 01 11 10 wrong solutions are not clearly seperable based on BLEU
00 |0|1|0| 1 |
01 |1|0|1| 0 | score probability densities.
11 |0|1|0| 1 |
10 |1|0|1| 0 |

(b) kmap4

Given the finite state machine circuit described below, assume that the D
flip-flops are initially reset to zero before the machine begins.
Build this circuit in Verilog.
Input x goes to three different two-input gates: a
XOR, an AND, and a OR gate. Each of the three gates
is connected to the input of a D flip-flop and then
the flip-flop outputs all go to a three-input XNOR,
whose output is Z. The second input of the XOR is its
corresponding flip-flop’s output, the second input of
the AND is its corresponding flip-flop’s complemented Fig. 6 BLEU score probability densities for correct and wrong
output, and finally the second input of the OR is its solutions from codegen-16B-verilog [12] for 2 tasks
corresponding flip-flop’s complementary output.
from VerilogEval-human.

We follow recent work in directly measuring code func-

tional correctness through pass@k metric [21], [22], [27],
where a problem is considered solved if any of the k samples
passes the unit tests. We also suggest using the unbiased
estimator from [21]:
" # n−c
k
pass@k := E 1− n , (1)
P roblems k
(c) ece241_2014_q4
where we generate n ≥ k samples per task in which c ≤
Fig. 5 Examples of VerilogEval-human descriptions. We n samples pass testing. In Fig. 7 we show that the number
show original website descriptions alongside manually con- of samples n need to be sufficiently large to produce low
verted text format. variance estimates for pass@k.
III. S UPERVISED F INE -T UNING
simulations. To enable automated testing, we compare simula- This section provides our findings concerning the super-
tion results between generated code completions with golden vised fine-tuning (SFT) of Large Language Models (LLM) for
reference solutions. We assert for output signal correctness at Verilog coding. We elucidate our approach to the generation
clock (posedge and/or negedge) transition edges for sequential of synthetic SFT data, achieved by utilizing LLMs to create
circuits, while for combinational circuits, we validate them problem descriptions, detailed in Section III-A. Subsequently,
when any input signals changes. Our testbench incorporates Section III-B comprises a comprehensive suite of supervised
two categories of input signals for each problem: manually fine-tuning (SFT) experiments, showcasing its potential for
crafted test patterns of significance, and randomly generated improving model performance.

Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on December 07,2024 at 07:13:43 UTC from IEEE Xplore. Restrictions apply.
B. Results on Supervised Fine-tuning
We conduct extensive experiments on fine-tuning with
the generated synthetic SFT data. Including both description
and code, our SFT data is 11MB in file size, compared
with ∼700MB of Github Verilog data used in [12]. For the
fine-tuning process, we employed the Adam optimizer with
hyperparameters β1 = 0.9, β2 = 0.999, and ϵ = 10−8 .
We set the learning rate to lr = 2e−5 , effective batch size
as 1M tokens, and opted not to apply weight decay. For
all our experiments, we sample n = 20 code completions
for measuring pass@k = {1, 5, 10} using Equation (1). We
Fig. 7 Variance in estimating pass@k with n. Samples from use nucleus sampling [31] with top p = 0.95, temperature
codegen-16B-verilog [12] for VerilogEval-human. temp = 0.8, and context length of 2048. We used a single
NVIDIA DGX node with 8 A100s and 2TB RAM.
Our experimentation primarily focuses on the CodeGen
model series [9] and its Verilog-trained counterparts in [12].
A. Synthetic SFT Data Generation These experiments encompass model sizes of 350M, 2B, 6B,
and 16B. We use -sft to indicate models fine-tuned with our
In this work, we investigate the creation of synthetic
synthetic SFT data. We clarify our notation for base models
SFT data through a bootstrapping process involving code
as follows:
descriptions generated by LLMs. To be precise, we undertake
the task of identifying and refining self-contained Verilog • codegen-nl [9] : Natural language model. Trained on

modules sourced from Github data [12]. Subsequently, we ThePile [32] 825.18GB English text corpus.
employ the prompt template depicted in Fig. 3 to generate • codegen-multi [9]: Code model. Initialized from

corresponding descriptive texts for each of these Verilog codegen-nl and continue trained on BigQuery multi-
modules, effectively creating machine descriptions and code lingual code dataset consisting of C, C++, Go, Java,
pairs. It’s worth noting that our approach to synthetic data JavaScript, and Python.
generation is straightforward in its implementation, and we • codegen-verilog [12]: Verilog code model. Initial-

acknowledge the potential for more advanced techniques as a ized from codegen-multi and continue trained on
promising future direction. ∼300MB of Github Verilog and 400MB of textbook data.
We leverage Pyverilog [28] to extract to abstract syntax tree Furthermore, we conducted comparisons with the
from Verilog code and employ the following filtering process gpt-3.5-turbo and gpt-4 models through OpenAI
to identify self-contained Verilog modules from open-sourced APIs [33]. Our analysis specifically involved default 4k
Github Verilog code [12]: context length models from 0613 checkpoints.
1) Training Epochs: Fig. 8 depicts the pass rate on Ver-
• We verify that the filtered code contain the module and
ilogEval with different SFT training epochs. Dashed lines
endmodule keywords, positioned at the beginning and
indicate gpt-3.5 results. Results show that machine de-
end of the code, respectively.
scriptions correlate well with human, demonstrating that
• We remove Verilog modules more than 200 lines of code
synthetic generated benchmarks could be a good indicator for
or exceeding 1024 tokens.
downstream task performance.
• We ascertain that the code includes at least one of the es-
In most cases, we observe that the performance metric
sential keywords: always, assign, always_ff,
pass@1 continues to exhibit improvement as the supervised
always_comb, always_latch.
fine-tuning (SFT) training epochs progress, whereas the met-
• We ensure extracted modules are self-contained without
rics pass@5 and pass@10 begin to deteriorate. This trend
any module instantiation.
suggests that with an increase in training epochs, the model
We further perform approximate deduplication based on tends to overfit to the SFT data, limiting its ability to
MinHash algorithm [29] using Jaccard similarity threshold of generate diverse solutions for tackling complex challenges.
0.8 as in [30]. We used gpt-3.5-turbo to generate code Interestingly, this overfitting also leads to an increase in the
descriptions based on the prompt template in Fig. 3, using model’s confidence and success rate when dealing with sim-
VerilogEval-human descriptions of shift18, rule110, pler problems, highlighting a discernible trade-off between the
lemmings1, fsm3onehot as few-shot examples. We se- pass@1 and pass@10 metrics. Consequently, we encourage
lected these examples with the aim of encompassing a wide future research to report on both of these metrics, particularly
range of design instances and the utilization of natural for models post-alignment, to provide a more comprehensive
language descriptions, including those presented in tabular assessment of their performance.
formats. In total we generated 8,502 problem description and Throughout the remainder of this study, we conduct super-
code pairs. vised fine-tuning (SFT) using 10 epochs for multi and 5

Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on December 07,2024 at 07:13:43 UTC from IEEE Xplore. Restrictions apply.
(a) codegen-16B-multi-sft (a) codegen-multi-sft

(b) codegen-16B-verilog-sft (b) codegen-verilog-sft

Fig. 8 SFT training epochs and pass rate on VerilogEval. Fig. 9 VerilogEval results on different model size. Solid lines
Dashed lines are gpt-3.5 results. are sft models, dotted lines are corresponding base models
without SFT, dashed lines are gpt-3.5 results.
epochs for verilog models.
2) Model Size and Base Model: Fig. 9 illustrates the pass
rates for the VerilogEval task using various model sizes pretraining) data diversity and quality would further lead to
and base models. The base model denotes the initial model increased performance.
checkpoint prior to SFT. It is worth noting that we have
VerilogEval-machine VerilogEval-human
omitted the results for models with a size of 350M, either Model
pass@1 pass@5 pass@10 pass@1 pass@5 pass@10
due to their unavailability or because their pass rates are gpt-3.5 46.7 69.1 74.1 26.7 45.8 51.7
gpt-4 47.9 67.8 72.9 27.0 45.8 52.0
insufficient to demonstrate statistical significance. Our results verilog-sft 46.2 67.3 73.7 28.8 45.9 52.3
suggest that more capable and larger models generally result
TABLE II Results on gpt models, comparing with
in better Verilog coding capabilities.
codegen-16B-verilog-sft.
In most instances, SFT using synthetically generated data
yields notable enhancements in downstream model perfor- In TABLE II we present the results obtained from both
mance. These improvements are particularly pronounced, es- the gpt-3.5 and gpt-4 models for the VerilogEval task.
pecially in the case of multi models, where the original Additionally, we demonstrate that our top-performing model
model was not explicitly trained on a substantial corpus of codegen-16B-verilog-sft, exhibits performance that
Verilog code. In the case of verilog models, VerilogEval- is on par with gpt models.
machine exhibited significant performance gains, whereas
the VerilogEval-human approach displayed comparatively VerilogEval-machine
Model
pass@1 pass@5 pass@10
less improvement and, at times, even slight deteriorations. codegen-16B-nl-sft 33.9 51.9 58.1
Our SFT data is sourced from the GitHub Verilog corpus, codegen-16B-multi-sft 37.1 55.0 61.1
and thus does not introduce additional Verilog code that the TABLE III Comparing nl and multi as SFT base models.
model did not encounter during its training for verilog
models. However, by providing problem-code pairs, this data In TABLE III we present a comparison of results be-
facilitates better alignment of the model, resulting in improved tween sft models utilizing two distinct base models:
outcomes for VerilogEval-machine. Despite incorporating codegen-nl and codegen-multi. The tokenizer of
few-shot prompting during the generation of SFT data (as codegen-nl model is inefficient in handling whitespaces,
discussed in Section III-A), the generated descriptions tend consequently preventing some of the VerilogEval-human
to be primarily low-level, lacking the textual diversity found problems from fitting within the limited context window of
in human examples, such as state transition graphs, wave- 2048 tokens. Thus we only display results for VerilogEval-
forms, Karnaugh maps, and similar elements. This “mis- machine. Despite the fact that multi models undergo pre-
alignment” between SFT data and VerilogEval-human might training on an extensive corpus of multi-lingual code data,
have caused verilog-sft models to degrade slightly in they exhibit only marginal enhancements of approximately
performance. We envision that increasing SFT (and Verilog 3% when applied to Verilog coding task. This observation

Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on December 07,2024 at 07:13:43 UTC from IEEE Xplore. Restrictions apply.
potentially suggests that there is limited positive knowledge it allows design teams to navigate the intricacies of hardware
transfer between software programming languages like C++ design effectively. Furthermore, it’s important to highlight that
and hardware descriptive languages such as Verilog. This a significant portion of the hardware design process revolves
highlights the significance of pretraining on substantial Ver- around optimizing the Power, Performance, and Area (PPA)
ilog corpora, as it can significantly enhance model perfor- metrics. These three factors, power consumption, computa-
mance in Verilog-related tasks. tional performance, and physical chip area, are paramount
3) SFT Data Quality: We conducted a comparative ex- considerations in modern hardware design. Achieving the
periment aimed at assessing the significance of data qual- right balance among them is a formidable challenge that re-
ity in SFT. In this experiment, we introduced a manipu- quires meticulous planning, advanced simulation, and iterative
lation by shuffling problem descriptions with incongruous refinement. Equally critical is the extensive effort invested in
Verilog code solutions, resulting in the creation of erro- design verification, aimed at ensuring the reliability and yield
neous problem-code pairs denoted as sft-error. The of the hardware. Verifying that a design functions as intended
outcomes, as presented in TABLE IV, provide a compari- under diverse conditions and corner cases is vital to mitigate
son of the performance results obtained through fine-tuning the risk of costly errors and to guarantee the final product
on the codegen-2B-verilog models concerning the meets its specifications. In essence, the successful realization
VerilogEval-machine task. The results clearly demonstrate of hardware designs hinges on the convergence of domain
that the inclusion of incorrect problem-code pairs detrimen- expertise, PPA optimization, and robust verification practices.
tally impacts model performance, underscoring the critical Nonetheless, Large Language Models (LLMs) present an
importance of maintaining high-quality SFT data. exciting opportunity for future research to revolutionize the
hardware design process. This transformative potential lies in
VerilogEval-machine
Model their ability to collaborate with domain experts in formulating
pass@1 pass@5 pass@10
codegen-2B-verilog 20.1 46.0 55.9 novel problems and devising innovative solutions. By leverag-
codegen-2B-verilog-sft 35.9 59.0 65.7
codegen-2B-verilog-sft-error 21.4 38.8 46.1 ing the vast knowledge and natural language understanding of
TABLE IV Comparative experiment on SFT data quality. LLMs, domain experts can work in tandem with these models
Incorrect low-quality SFT data degrades model performance. to explore uncharted territories in hardware design, poten-
tially leading to breakthroughs that enhance the efficiency,
reliability, and agility of the design process. The fusion of
IV. L IMITATIONS AND F UTURE D IRECTIONS human expertise with machine intelligence using LLMs in
In VerilogEval, our primary focus centers on harnessing this collaborative endeavor promises an exhilarating avenue
Large Language Models (LLMs) to generate self-contained for future research, one that holds the potential to reshape the
Verilog modules directly from natural language text descrip- very fabric of the hardware design research landscape.
tions. While we incorporate a wide array of hardware design
V. C ONCLUSION
topics through human-generated descriptions, it’s important
to note that our current evaluations are confined to boil- The growing prominence of Large Language Models
erplate code generation for relatively small-scale designs. (LLMs) has ushered in a new era of their application across
We emphasize the significance of module instantiation as a various domains. In this paper we introduce a specialized
crucial capability in Verilog, as it plays a pivotal role in benchmarking framework meticulously designed to assess
constructing complex system-level designs, albeit currently LLM performance within the realm of Verilog code generation
absent from our benchmark. Recent advancements in LLM- for hardware design. The cornerstone of this contribution lies
based coding benchmarking, as seen in [34], are starting to ex- in the creation of a robust evaluation dataset, comprising of
plore pragmatic code generation beyond standalone functions. 156 distinct problems sourced from HDLBits. Furthermore,
It’s worth mentioning that our testing environment solely we have demonstrated that the Verilog code generation ca-
assesses functional correctness and does not ensure that the pabilities of pretrained language models can be enhanced
generated Verilog code adheres to synthesizable formatting through supervised fine-tuning, facilitated by the generation
standards. We do not evaluate the performance of downstream of synthetic problem-code pairs using LLMs. These findings
circuit implementations, a gap that is addressed by the work not only advance the state of the art in Verilog code generation
presented in [13]. but also underscore the vast potential of LLMs in shaping the
Additionally, it’s crucial to recognize that boilerplate Hard- future of hardware design and verification.
ware Description Language (HDL) code generation, as cur- ACKNOWLEDGMENT
rently addressed in our VerilogEval and similar endeavors,
inherently operates within an exceedingly limited scope within The authors would like to thank Henry Wong
the broader landscape of hardware design. Hardware design, ([email protected]) the creator of HDLBits for his
in its entirety, necessitates a multidisciplinary approach that invaluable assistance in providing reference solutions and
draws on the expertise of domain professionals ranging from testbenches for the problems used in this paper.
transistor device, circuit design, and to hardware system
architecture. This holistic understanding is indispensable, as

Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on December 07,2024 at 07:13:43 UTC from IEEE Xplore. Restrictions apply.
R EFERENCES [23] D. Hendrycks, S. Basart, S. Kadavath, M. Mazeika, A. Arora, E. Guo,
C. Burns, S. Puranik, H. He, D. Song et al., “Measuring coding
[1] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, challenge competence with apps,” arXiv preprint arXiv:2105.09938,
A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language mod- 2021.
els are few-shot learners,” Advances in neural information processing [24] S. Williams. (2023) The ICARUS verilog compilation system. [Online].
systems, vol. 33, pp. 1877–1901, 2020. Available: https://github.com/steveicarus/iverilog
[2] OpenAI, “Gpt-4 technical report,” 2023. [25] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for
[3] S. Bubeck, V. Chandrasekaran, R. Eldan, J. Gehrke, E. Horvitz, E. Ka- automatic evaluation of machine translation,” in Proceedings of the 40th
mar, P. Lee, Y. T. Lee, Y. Li, S. Lundberg et al., “Sparks of artificial annual meeting of the Association for Computational Linguistics, 2002,
general intelligence: Early experiments with gpt-4,” arXiv preprint pp. 311–318.
arXiv:2303.12712, 2023. [26] S. Lu, D. Guo, S. Ren, J. Huang, A. Svyatkovskiy, A. Blanco, C. B.
[4] S. Wu, O. Irsoy, S. Lu, V. Dabravolski, M. Dredze, S. Gehrmann, Clement, D. Drain, D. Jiang, D. Tang, G. Li, L. Zhou, L. Shou,
P. Kambadur, D. Rosenberg, and G. Mann, “Bloomberggpt: A large L. Zhou, M. Tufano, M. Gong, M. Zhou, N. Duan, N. Sundaresan,
language model for finance,” arXiv preprint arXiv:2303.17564, 2023. S. K. Deng, S. Fu, and S. Liu, “CodeXGLUE: A machine learning
[5] H.-C. Shin, Y. Zhang, E. Bakhturina, R. Puri, M. Patwary, M. Shoeybi, benchmark dataset for code understanding and generation,” CoRR, vol.
and R. Mani, “Biomegatron: Larger biomedical domain language abs/2102.04664, 2021.
model,” arXiv preprint arXiv:2010.06060, 2020. [27] S. Kulal, P. Pasupat, K. Chandra, M. Lee, O. Padon, A. Aiken, and P. S.
[6] E. Bolton, D. Hall, M. Yasunaga, T. Lee, C. Manning, and P. Liang, Liang, “Spoc: Search-based pseudocode to code,” Advances in Neural
“BioMedLM.” [Online]. Available: https://crfm.stanford.edu/2022/12/ Information Processing Systems, vol. 32, 2019.
15/biomedlm.html [28] S. Takamaeda-Yamazaki, “Pyverilog: A python-based hardware design
[7] R. Taylor, M. Kardas, G. Cucurull, T. Scialom, A. Hartshorn, E. Saravia, processing toolkit for verilog hdl,” in Applied Reconfigurable Comput-
A. Poulton, V. Kerkez, and R. Stojnic, “Galactica: A large language ing: 11th International Symposium, ARC 2015, Bochum, Germany, April
model for science,” arXiv preprint arXiv:2211.09085, 2022. 13-17, 2015, Proceedings 11. Springer, 2015, pp. 451–460.
[8] A. Mastropaolo, L. Pascarella, E. Guglielmi, M. Ciniselli, S. Scalabrino, [29] A. Z. Broder, “On the resemblance and containment of documents,” in
R. Oliveto, and G. Bavota, “On the robustness of code generation Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat.
techniques: An empirical study on github copilot,” arXiv preprint No. 97TB100171). IEEE, 1997, pp. 21–29.
arXiv:2302.00438, 2023. [30] K. Lee, D. Ippolito, A. Nystrom, C. Zhang, D. Eck, C. Callison-Burch,
[9] E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang, Y. Zhou, S. Savarese, and N. Carlini, “Deduplicating training data makes language models
and C. Xiong, “Codegen: An open large language model for code with better,” arXiv preprint arXiv:2107.06499, 2021.
multi-turn program synthesis,” ICLR, 2023. [31] A. Holtzman, J. Buys, L. Du, M. Forbes, and Y. Choi, “The curious case
[10] J. Blocklove, S. Garg, R. Karri, and H. Pearce, “Chip-chat: Challenges of neural text degeneration,” arXiv preprint arXiv:1904.09751, 2019.
and opportunities in conversational hardware design,” arXiv preprint [32] L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, C. Foster,
arXiv:2305.13243, 2023. J. Phang, H. He, A. Thite, N. Nabeshima et al., “The pile: An
[11] K. Chang, Y. Wang, H. Ren, M. Wang, S. Liang, Y. Han, H. Li, 800gb dataset of diverse text for language modeling,” arXiv preprint
and X. Li, “Chipgpt: How far are we from natural language hardware arXiv:2101.00027, 2020.
design,” arXiv preprint arXiv:2305.14019, 2023. [33] OpenAI. (2023) OpenAI models api. [Online]. Available: https:
[12] S. Thakur, B. Ahmad, Z. Fan, H. Pearce, B. Tan, R. Karri, B. Dolan- //platform.openai.com/docs/models
Gavitt, and S. Garg, “Benchmarking large language models for auto- [34] H. Yu, B. Shen, D. Ran, J. Zhang, Q. Zhang, Y. Ma, G. Liang, Y. Li,
mated verilog rtl code generation,” in 2023 Design, Automation & Test T. Xie, and Q. Wang, “Codereval: A benchmark of pragmatic code
in Europe Conference & Exhibition (DATE). IEEE, 2023, pp. 1–6. generation with generative pre-trained models,” 2023.
[13] Y. Lu, S. Liu, Q. Zhang, and Z. Xie, “RTLLM: An open-source
benchmark for design rtl generation with large language model,” arXiv
preprint arXiv:2308.05345, 2023.
[14] J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du,
A. M. Dai, and Q. V. Le, “Finetuned language models are zero-shot
learners,” arXiv preprint arXiv:2109.01652, 2021.
[15] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin,
C. Zhang, S. Agarwal, K. Slama, A. Ray et al., “Training language
models to follow instructions with human feedback,” Advances in
Neural Information Processing Systems, vol. 35, pp. 27 730–27 744,
2022.
[16] Y. Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma,
D. Drain, S. Fort, D. Ganguli, T. Henighan et al., “Training a helpful and
harmless assistant with reinforcement learning from human feedback,”
arXiv preprint arXiv:2204.05862, 2022.
[17] Y. Wang, Y. Kordi, S. Mishra, A. Liu, N. A. Smith, D. Khashabi,
and H. Hajishirzi, “Self-instruct: Aligning language model with self
generated instructions,” arXiv preprint arXiv:2212.10560, 2022.
[18] C. Xu, Q. Sun, K. Zheng, X. Geng, P. Zhao, J. Feng, C. Tao, and
D. Jiang, “Wizardlm: Empowering large language models to follow
complex instructions,” arXiv preprint arXiv:2304.12244, 2023.
[19] Z. Luo, C. Xu, P. Zhao, Q. Sun, X. Geng, W. Hu, C. Tao, J. Ma, Q. Lin,
and D. Jiang, “WizardCoder: Empowering code large language models
with evol-instruct,” arXiv preprint arXiv:2306.08568, 2023.
[20] S. Gunasekar, Y. Zhang, J. Aneja, C. C. T. Mendes, A. Del Giorno,
S. Gopi, M. Javaheripi, P. Kauffmann, G. de Rosa, O. Saarikivi et al.,
“Textbooks are all you need,” arXiv preprint arXiv:2306.11644, 2023.
[21] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto,
J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman et al.,
“Evaluating large language models trained on code.(2021),” arXiv
preprint arXiv:2107.03374, 2021.
[22] J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan,
E. Jiang, C. Cai, M. Terry, Q. Le et al., “Program synthesis with large
language models,” arXiv preprint arXiv:2108.07732, 2021.

Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on December 07,2024 at 07:13:43 UTC from IEEE Xplore. Restrictions apply.

Introducing Permutations and Factorial Notation: Learn About The Math
No ratings yet
Introducing Permutations and Factorial Notation: Learn About The Math
8 pages
Verilogcoder
No ratings yet
Verilogcoder
30 pages
2024 - DeepRTL Bridging Verilog Understanding and Generation with a Unified Representation Model 2
No ratings yet
2024 - DeepRTL Bridging Verilog Understanding and Generation with a Unified Representation Model 2
23 pages
5 HC24.Ucsd.hanxianHuang.v02
No ratings yet
5 HC24.Ucsd.hanxianHuang.v02
46 pages
2305.01210v3
No ratings yet
2305.01210v3
15 pages
Code Generation With LLMs
No ratings yet
Code Generation With LLMs
59 pages
CIBench Evaluating Your LLMs With a Code Interpret
No ratings yet
CIBench Evaluating Your LLMs With a Code Interpret
22 pages
Invitedpaper Aspdac 24
No ratings yet
Invitedpaper Aspdac 24
7 pages
S 001: N Q A C LLM E: Afurai EW Ualitative Pproach For ODE Valuation
No ratings yet
S 001: N Q A C LLM E: Afurai EW Ualitative Pproach For ODE Valuation
22 pages
Assessing Large Language Models for Code Generation: A Comprehensive Framework
No ratings yet
Assessing Large Language Models for Code Generation: A Comprehensive Framework
6 pages
2. A_Survey_on_Evaluating_Large_Language_Models_in_Co
No ratings yet
2. A_Survey_on_Evaluating_Large_Language_Models_in_Co
26 pages
2403.00046v2
No ratings yet
2403.00046v2
13 pages
133_large_language_model_evaluatio
No ratings yet
133_large_language_model_evaluatio
12 pages
2405.02329v1
No ratings yet
2405.02329v1
8 pages
RTLFixer
No ratings yet
RTLFixer
7 pages
2503.15112v1
No ratings yet
2503.15112v1
9 pages
Evo Code Bench
No ratings yet
Evo Code Bench
15 pages
Evaluating Large Language Models Trained On Code
No ratings yet
Evaluating Large Language Models Trained On Code
35 pages
2408.09701v1
No ratings yet
2408.09701v1
11 pages
ASE2024_CodeGenSurvey-7
No ratings yet
ASE2024_CodeGenSurvey-7
17 pages
4. Studying the Quality of Source Code Generated by Different AI Generative Engines An Empirical Evaluation
No ratings yet
4. Studying the Quality of Source Code Generated by Different AI Generative Engines An Empirical Evaluation
19 pages
2404.08008
No ratings yet
2404.08008
32 pages
2307.07319
No ratings yet
2307.07319
12 pages
2407.21072v1
No ratings yet
2407.21072v1
15 pages
Make Every Move Count: LLM-based High-Quality RTL Code Generation Using MCTS
No ratings yet
Make Every Move Count: LLM-based High-Quality RTL Code Generation Using MCTS
7 pages
HumanEval Pro and MBPPPro Evaluating Large Language Models
No ratings yet
HumanEval Pro and MBPPPro Evaluating Large Language Models
27 pages
ENG2 Verilog
No ratings yet
ENG2 Verilog
6 pages
CODEJUDGE : Evaluating Code Generation with Large Language Models
No ratings yet
CODEJUDGE : Evaluating Code Generation with Large Language Models
20 pages
CodeTree
No ratings yet
CodeTree
16 pages
LLM Code TAs
No ratings yet
LLM Code TAs
20 pages
Benchmark LLM
No ratings yet
Benchmark LLM
28 pages
A_Review_on_Code_Generation_with_LLMs_Application_and_Evaluation 2 (1)
No ratings yet
A_Review_on_Code_Generation_with_LLMs_Application_and_Evaluation 2 (1)
6 pages
ChatGPT Coding CompSac 23
No ratings yet
ChatGPT Coding CompSac 23
9 pages
2407.10457v1
No ratings yet
2407.10457v1
10 pages
OpenAI Codex Arxiv
No ratings yet
OpenAI Codex Arxiv
35 pages
s10270 023 01105 5 - Newe
No ratings yet
s10270 023 01105 5 - Newe
13 pages
Evaluating The Ability of Large Language Models To Generate Verifiable Specifications in Verifast
No ratings yet
Evaluating The Ability of Large Language Models To Generate Verifiable Specifications in Verifast
4 pages
2408.11729v2
No ratings yet
2408.11729v2
9 pages
Big Code Bench
No ratings yet
Big Code Bench
62 pages
New Solutions On LLM Acceleration Optimization
No ratings yet
New Solutions On LLM Acceleration Optimization
12 pages
P G - C 2: B L L M C R F: AN U Oder Oosting Arge Anguage Odels For Ode With Anking Eedback
No ratings yet
P G - C 2: B L L M C R F: AN U Oder Oosting Arge Anguage Odels For Ode With Anking Eedback
15 pages
Bugs in LLms Genereated Code
No ratings yet
Bugs in LLms Genereated Code
47 pages
Software Testing With Large Language Models: Survey, Landscape, and Vision
No ratings yet
Software Testing With Large Language Models: Survey, Landscape, and Vision
31 pages
3649409.3691090
No ratings yet
3649409.3691090
2 pages
Project 3
No ratings yet
Project 3
11 pages
6. Large Language Models for Code Analysis - Do LLMs Really Do Their Job
No ratings yet
6. Large Language Models for Code Analysis - Do LLMs Really Do Their Job
18 pages
2401.04531v3
No ratings yet
2401.04531v3
29 pages
03-Towards An Understanding of Large Language
No ratings yet
03-Towards An Understanding of Large Language
41 pages
Code Generation 2305.10679v1
No ratings yet
Code Generation 2305.10679v1
13 pages
2304.13187v1
No ratings yet
2304.13187v1
10 pages
pdf2306 08997 PDF
No ratings yet
pdf2306 08997 PDF
20 pages
2310.01361v2
No ratings yet
2310.01361v2
36 pages
Benchmarking Large Language Models with a Unified Performance Ranking Metric
No ratings yet
Benchmarking Large Language Models with a Unified Performance Ranking Metric
13 pages
llm_evals
No ratings yet
llm_evals
23 pages
Legal 2 AI
No ratings yet
Legal 2 AI
10 pages
Assessing Cybersecurity Vulnerabilities in Code Large Language Models_2404.18567v1
No ratings yet
Assessing Cybersecurity Vulnerabilities in Code Large Language Models_2404.18567v1
12 pages
Measuring Massive Multitask Language Understanding
No ratings yet
Measuring Massive Multitask Language Understanding
25 pages
Autocoder: Enhancing Code Large Language Model With Aiev-I: Nstruct
No ratings yet
Autocoder: Enhancing Code Large Language Model With Aiev-I: Nstruct
11 pages
Deepseek LLM
No ratings yet
Deepseek LLM
48 pages
FPGA-Based-System-Design LAB JOURNAL 2
No ratings yet
FPGA-Based-System-Design LAB JOURNAL 2
56 pages
Microsoft AZ-400: Designing and Implementing Microsoft DevOps Solutions - Certification Exam Prep
From Everand
Microsoft AZ-400: Designing and Implementing Microsoft DevOps Solutions - Certification Exam Prep
Steve Brown
No ratings yet
- A General Layout Pattern Clustering Using Geometric Matching-based Clip Relocation and Lower-bound A - He at al
No ratings yet
- A General Layout Pattern Clustering Using Geometric Matching-based Clip Relocation and Lower-bound A - He at al
23 pages
A Review of DNN and GPU in Optical Proximity Correction
No ratings yet
A Review of DNN and GPU in Optical Proximity Correction
7 pages
2017 - Not Just a Black Box - Not Just a Black Box Learning Important Features Through Propagating Activation Differences - Shrikumar at al
No ratings yet
2017 - Not Just a Black Box - Not Just a Black Box Learning Important Features Through Propagating Activation Differences - Shrikumar at al
6 pages
2007 - Object Category Structure in Response Patterns of Neuronal Population in Monkey Inferior Temporal Co - Kiani at Al
No ratings yet
2007 - Object Category Structure in Response Patterns of Neuronal Population in Monkey Inferior Temporal Co - Kiani at Al
14 pages
2018_Learning to Multitask_Zhang Et Al_Curran Associates, Inc.
No ratings yet
2018_Learning to Multitask_Zhang Et Al_Curran Associates, Inc.
12 pages
2016 - ICCAD-2016 CAD Contest in Pattern Classification for Integrated Circuit Design Space Analysis and Be - Topaloglu at Al
No ratings yet
2016 - ICCAD-2016 CAD Contest in Pattern Classification for Integrated Circuit Design Space Analysis and Be - Topaloglu at Al
4 pages
Chen et al. - 2023 - Machine Learning in Advanced IC Design A Methodological Survey
No ratings yet
Chen et al. - 2023 - Machine Learning in Advanced IC Design A Methodological Survey
17 pages
Kahng et al. - 2024 - NN-Steiner A Mixed Neural-Algorithmic Approach for the Rectilinear Steiner Minimum Tree Problem
No ratings yet
Kahng et al. - 2024 - NN-Steiner A Mixed Neural-Algorithmic Approach for the Rectilinear Steiner Minimum Tree Problem
9 pages
2024_A Survey on LoRA of Large Language Models_Mao et al_arXiv
No ratings yet
2024_A Survey on LoRA of Large Language Models_Mao et al_arXiv
31 pages
Chen Et Al. - 2024 - The Dawn of AI-Native EDA Opportunities and Challenges of Large Circuit Models
No ratings yet
Chen Et Al. - 2024 - The Dawn of AI-Native EDA Opportunities and Challenges of Large Circuit Models
28 pages
2024_Neuroformer_Antoniades et al_arXiv
No ratings yet
2024_Neuroformer_Antoniades et al_arXiv
25 pages
2011 - Reliability of ERP and Single-Trial Analyses - Gaspar Et Al - NeuroImage
No ratings yet
2011 - Reliability of ERP and Single-Trial Analyses - Gaspar Et Al - NeuroImage
10 pages
2021_Task Switching Network for Multi-Task Learning_Sun et al_
No ratings yet
2021_Task Switching Network for Multi-Task Learning_Sun et al_
10 pages
2019_Multi-Domain and Multi-Task Learning for Human Action Recognition_Liu et al_IEEE Transactions on Image Processing
No ratings yet
2019_Multi-Domain and Multi-Task Learning for Human Action Recognition_Liu et al_IEEE Transactions on Image Processing
15 pages
2022_MTFormer - Multi-task Learning via Transformer and Cross-Task Reasoning_Xu et al_Springer Nature Switzerland
No ratings yet
2022_MTFormer - Multi-task Learning via Transformer and Cross-Task Reasoning_Xu et al_Springer Nature Switzerland
18 pages
Mirhoseini et al. - 2021 - A graph placement methodology for fast chip design
No ratings yet
Mirhoseini et al. - 2021 - A graph placement methodology for fast chip design
23 pages
2020 - A Benchmark Dataset For RSVP-Based Brain-Computer Interfaces - Zhang Et Al - Frontiers in Neuroscience
No ratings yet
2020 - A Benchmark Dataset For RSVP-Based Brain-Computer Interfaces - Zhang Et Al - Frontiers in Neuroscience
11 pages
Lopera et al. - 2021 - A Survey of Graph Neural Networks for Electronic Design Automation
No ratings yet
Lopera et al. - 2021 - A Survey of Graph Neural Networks for Electronic Design Automation
6 pages
2019_End-To-End Multi-Task Learning With Attention_Liu et al_
No ratings yet
2019_End-To-End Multi-Task Learning With Attention_Liu et al_
10 pages
2019_Pareto Multi-Task Learning_Lin et al_Curran Associates, Inc.
No ratings yet
2019_Pareto Multi-Task Learning_Lin et al_Curran Associates, Inc.
11 pages
2024 Transformer-VQ Lingle ArXiv
No ratings yet
2024 Transformer-VQ Lingle ArXiv
30 pages
A survey of the Vision Transformers and its CNN-Transformer based Variants_Khan et al_
No ratings yet
A survey of the Vision Transformers and its CNN-Transformer based Variants_Khan et al_
82 pages
2002 - FVC2000 - Fingerprint Verification Competition - Maio Et Al - IEEE Transactions On Pattern Analysis and Machine Intelligence
No ratings yet
2002 - FVC2000 - Fingerprint Verification Competition - Maio Et Al - IEEE Transactions On Pattern Analysis and Machine Intelligence
11 pages
Wang - 2021 - Steiner Tree A Deep Reinforcement Learning Approach
No ratings yet
Wang - 2021 - Steiner Tree A Deep Reinforcement Learning Approach
37 pages
Xu and Moseley - 2022 - Learning-Augmented Algorithms for Online Steiner Tree
No ratings yet
Xu and Moseley - 2022 - Learning-Augmented Algorithms for Online Steiner Tree
9 pages
2020 - Supervised Contrastive Learning - Khosla Et Al - Curran Associates, Inc.
No ratings yet
2020 - Supervised Contrastive Learning - Khosla Et Al - Curran Associates, Inc.
13 pages
EMPOWERMENT TECHNOLOGY 1-2
No ratings yet
EMPOWERMENT TECHNOLOGY 1-2
17 pages
Mbs Public School: Std-Viii
No ratings yet
Mbs Public School: Std-Viii
14 pages
Setup Remote Desktop Services in Windows Server 2012 R2
No ratings yet
Setup Remote Desktop Services in Windows Server 2012 R2
19 pages
PROJECTION EXERCISE
No ratings yet
PROJECTION EXERCISE
6 pages
Google Translate - Penelusuran Google
No ratings yet
Google Translate - Penelusuran Google
1 page
IMR June 2023 All
No ratings yet
IMR June 2023 All
50 pages
Optical Character Recognition Techniques in Urdu-A Survey: Vippon Preet Kour Naveen Kumar Gondhi
No ratings yet
Optical Character Recognition Techniques in Urdu-A Survey: Vippon Preet Kour Naveen Kumar Gondhi
5 pages
NSE3 FortiEDR Combined LiteUpdate 20210305
100% (1)
NSE3 FortiEDR Combined LiteUpdate 20210305
40 pages
Project Report Operation Management PDF
No ratings yet
Project Report Operation Management PDF
119 pages
3.2 M.E. Syllabus
No ratings yet
3.2 M.E. Syllabus
13 pages
MDM System Management Training
No ratings yet
MDM System Management Training
199 pages
Specification For VMS REV3 FR ARM
No ratings yet
Specification For VMS REV3 FR ARM
5 pages
Power Monitoring
No ratings yet
Power Monitoring
14 pages
WWW - Annauniv.edu - Academic - PDF - Final (25.01.2012) - 03. ELECTRICAL - 25.1.12 - 30 M.E. - Embedded System Technologies
No ratings yet
WWW - Annauniv.edu - Academic - PDF - Final (25.01.2012) - 03. ELECTRICAL - 25.1.12 - 30 M.E. - Embedded System Technologies
27 pages
Assignment 2 R120 - Thomas Vaillant 316167
No ratings yet
Assignment 2 R120 - Thomas Vaillant 316167
5 pages
17IT603 - Python & R Programming
No ratings yet
17IT603 - Python & R Programming
3 pages
PCVST Book
100% (1)
PCVST Book
73 pages
Burner F Systems DOC V2 2 en
No ratings yet
Burner F Systems DOC V2 2 en
94 pages
SPARKING ZERO MODEL IMPORT GUIDE
No ratings yet
SPARKING ZERO MODEL IMPORT GUIDE
4 pages
MAX-PAC Brochure Concepts NREC Web
No ratings yet
MAX-PAC Brochure Concepts NREC Web
8 pages
Data Logging Vs Data Acquisition
No ratings yet
Data Logging Vs Data Acquisition
4 pages
001 DS3 - DSheet - FP2 Rectifier 48V1800W - v5
No ratings yet
001 DS3 - DSheet - FP2 Rectifier 48V1800W - v5
2 pages
Fpga Timeline & Applications: Fpgas Past, Present & Future
No ratings yet
Fpga Timeline & Applications: Fpgas Past, Present & Future
39 pages
Resume - 3.6 Years Exp - Quality Engineer
No ratings yet
Resume - 3.6 Years Exp - Quality Engineer
3 pages
Interfacing Seven Segment Display To 8051
No ratings yet
Interfacing Seven Segment Display To 8051
7 pages
2x4a64 Pigeonhole Principle
No ratings yet
2x4a64 Pigeonhole Principle
14 pages
m40 s12 Lecture4
No ratings yet
m40 s12 Lecture4
3 pages
43-Article Text-239-2-10-20201
No ratings yet
43-Article Text-239-2-10-20201
9 pages
15cs72 Imp QSTN
No ratings yet
15cs72 Imp QSTN
2 pages

Liu et al. - 2023 - Invited Paper VerilogEval Evaluating Large Language Models for Verilog Code Generation

Uploaded by

Liu et al. - 2023 - Invited Paper VerilogEval Evaluating Large Language Models for Verilog Code Generation

Uploaded by

Invited Paper: VerilogEval: Evaluating Large

Language Models for Verilog Code Generation

Abstract—The increasing popularity of large language models

We follow recent work in directly measuring code func-

(b) codegen-16B-verilog-sft (b) codegen-verilog-sft

You might also like