0% found this document useful (0 votes)
7 views15 pages

On_Hardware_Security_Bug_Code_Fixes_by_Prompting_Large_Language_Models

This document discusses the use of Large Language Models (LLMs) for automatically repairing security bugs in hardware designs, specifically in Verilog code. The authors present a framework that evaluates LLM performance in bug repair and demonstrate that an ensemble of LLMs can effectively address hardware security vulnerabilities, outperforming existing automated tools. The study highlights the potential of LLMs to assist designers in generating fixes for identified bugs, paving the way for automated end-to-end bug repair solutions.

Uploaded by

1989907712
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views15 pages

On_Hardware_Security_Bug_Code_Fixes_by_Prompting_Large_Language_Models

This document discusses the use of Large Language Models (LLMs) for automatically repairing security bugs in hardware designs, specifically in Verilog code. The authors present a framework that evaluates LLM performance in bug repair and demonstrate that an ensemble of LLMs can effectively address hardware security vulnerabilities, outperforming existing automated tools. The study highlights the potential of LLMs to assist designers in generating fixes for identified bugs, paving the way for automated end-to-end bug repair solutions.

Uploaded by

1989907712
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL.

19, 2024 4043

On Hardware Security Bug Code Fixes By


Prompting Large Language Models
Baleegh Ahmad , Graduate Student Member, IEEE, Shailja Thakur , Member, IEEE,
Benjamin Tan , Member, IEEE, Ramesh Karri , Fellow, IEEE, and Hammond Pearce , Member, IEEE

Abstract— Novel AI-based code-writing Large Language Mod- relevant effort in this context thus far. Further efforts need to be
els (LLMs) such as OpenAI’s Codex have demonstrated made to support the automated repair of functional and secu-
capabilities in many coding-adjacent domains. In this work, rity bugs in hardware. Unlike software bugs, security bugs in
we consider how LLMs may be leveraged to automatically repair
identified security-relevant bugs present in hardware designs hardware are more problematic because they cannot be patched
by generating replacement code. We focus on bug repair in once the chip is fabricated; this is especially concerning as
code written in Verilog. For this study, we curate a corpus of hardware is typically the root of trust for a system.
domain-representative hardware security bugs. We then design Large Language Models (LLMs) are neural networks trained
and implement a framework to quantitatively evaluate the over millions of lines of text and code [9]. LLMs that are
performance of any LLM tasked with fixing the specified bugs.
The framework supports design space exploration of prompts fine-tuned over open-source code repositories can generate
(i.e., prompt engineering) and identifying the best parameters code, where a user “prompts” the LLM with some text (e.g.,
for the LLM. We show that an ensemble of LLMs can repair all code and comments) to guide the code generation. In contrast
fifteen of our benchmarks. This ensemble outperforms a state- to previous code repair techniques that involve mutation,
of-the-art automated hardware bug repair tool on its own suite repeated checks against an “oracle,” or source code templates,
of bugs. These results show that LLMs have the ability to repair
hardware security bugs and the framework is an important step we propose that an LLM trained on code and natural language
towards the ultimate goal of an automated end-to-end bug repair could potentially generate fixes, given an appropriate prompt
tool. that could draw from a designer’s expertise. As LLMs are
Index Terms— Hardware security, large language models, bug exposed to a wide variety of code examples during training,
repair. they should be able to assist designers in fixing bugs in
different types of hardware designs and styles, with natural
I. I NTRODUCTION language guidance. One distinction is that LLMs do not need

“B UGS” are inevitable when writing large quantities


of code. Fixing them is laborious: automated tools
are thus designed and employed to both identify bugs and
an oracle to generate a repair, although one could be useful
for evaluating whether the generated repair was successful.
LLMs may generate a few potential fixes, leaving the designers
then patch and repair them [1]. While considerable effort has to choose which repair is optimal. In prior work [10], [11],
explored software repair, for Hardware Description Languages LLMs have been used to generate functional Verilog code.
(HDLs), the state of the art is less mature. Machine learning-based techniques such as Neural Machine
In this study, we focus on repairing security-relevant hard- Translation [12] and pre-trained transformers [13] are explored
ware bugs after a bug has been identified by some means. in the software domain for bug fixes. Xia et al. [14] use LLMs
While there are several approaches and tools for detecting to successfully generate repairs for software bugs in 3 different
potential design bugs [2], [3], [4], [5], [6], [7], few techniques languages. Pearce et al. [15] use this approach to repair two
address the automated repair of hardware bugs. The recently scenarios of security weaknesses in Verilog code.
proposed CirFix [8] develops automatic repair of functional Thus, in this work, we investigate the use of LLMs
hardware bugs and, to the best of our knowledge, is the only to generate repairs for hardware security bugs. We study
the performance of OpenAI’s GPT (generative pre-trained
Manuscript received 3 July 2023; revised 21 December 2023 and 5 February
2024; accepted 6 February 2024. Date of publication 7 March 2024; date transformer), Codex and CodeGen LLMs on instances of
of current version 2 May 2024. This work was supported in part by Intel hardware security bugs. We offer insights into how best to
Corporation and in part by the Natural Sciences and Engineering Research use LLMs for successful repairs. A Register Transfer Level
Council of Canada (NSERC) under Grant RGPIN-2022-03027. The associate
editor coordinating the review of this manuscript and approving it for (RTL) designer can spot a security weakness and the LLM
publication was Dr. Stjepan Picek. (Corresponding author: Baleegh Ahmad.) can help to find a fix. Our contributions are as follows:
Baleegh Ahmad, Shailja Thakur, and Ramesh Karri are with the Department
of Electrical and Computer Engineering, New York University Tandon School
of Engineering, Brooklyn, NY 11201 USA (e-mail: [email protected]). • An automated framework for using LLMs to
Benjamin Tan is with the Department of Electrical and Software Engineer- generate repairs and evaluate them. We make the
ing, University of Calgary, Calgary, AB T2N 1N4, Canada. framework and artifacts produced in this study
Hammond Pearce is with the Department of Electrical and Computer
Engineering, University of New South Wales, Sydney, NSW 2052, Australia. available as open-source [16] (further details also
Digital Object Identifier 10.1109/TIFS.2024.3374558 included in the Appendix).
1556-6021 © 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: ShanghaiTech University. Downloaded on May 04,2025 at 16:39:03 UTC from IEEE Xplore. Restrictions apply.
4044 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 19, 2024

• An exploration of bugs and LLM parameters (model, tem- read by user code. Integrity is violated if data that should
perature, prompt) to see how to use LLMs in repair. They not be modifiable under certain conditions is modifiable. For
are posed as research questions answered in Section V. example, user code can write into registers that specify the
• A demonstration of how the repair mechanism could be access control policy. Secure computation is also a concern,
coupled with bug detectors to form an end-to-end solution and the synthesis and optimization of secure circuits starts with
to detect and repair bugs. This is presented in Section VII. the description of designs with HDLs [25].
Linters [26], [27] and formal verification tools [3], [4]
II. BACKGROUND AND R ELATED W ORK cover a large proportion of functional bugs. Although formal
To provide context for our work, we discuss some overar- verification tools like Synopsys FSV can be used for security
ching concepts in Section II-A. We present some differences verification in the design process, they can sometimes have
between our work and other efforts in Section II-B. limited success [28]. With the ever-growing complexity of
modern processors, software-exploitable hardware bugs are
becoming pernicious [29], [30]. This has resulted in the
A. Background
exploration of many detection techniques such as fuzzing [5],
The code repair problem is well-explored in the software information flow tracking [6], unique program execution
domain. Software code repair techniques continue to evolve checking [7] and static analysis [2].
(interested readers can see Monperrus’ living review [17]). Security-related issues that arise because of bugs in hard-
Generally, techniques try to fix errors by using program ware are taxonomized in the form of Common Weakness
mutations and repair templates paired with tests to validate Enumerations (CWEs). MITRE [31] is a not-for-profit that
any changes [18], [19]. Feedback loops are constructed with works with academia and industry to develop a list of CWEs
a reference implementation to guide the repair process [20]. that represent categories of vulnerabilities in hardware and
Other domain-specific tools may also be built to deal with software. A weakness is an element in a digital product’s
areas like build scripts, web, and software models. software, firmware, hardware, or service that can be exploited
Security bugs are defects that can lead to vulnerable sys- for malicious purposes. The CWE list provides a general
tems. While functional bugs can be detected using classical taxonomy and categorization of these elements that allow
testing, security bugs are more difficult to detect, and proving a common language to be used for discussion. It helps
their presence or absence is challenging. This has led to more developers and researchers search for the existence of these
“creative” kinds of bug repair, including AI-based machine- weaknesses in their designs and compare various tools they
learning techniques such as neural transfer learning [21] use to detect vulnerabilities in their designs and products.
and example-based approaches [22]. ML-based approaches, We select a subset of the hardware CWEs based on their
including neural networks, allow a greater ability to suggest clarity of explanation and relevance to RTL. A large subset
repairs for “unseen” code. Example-based approaches start off of CWEs are not related to the RTL as they cover post-silicon
with a dataset comprising pairs of bugs and their repairs. Then, issues or using outdated technologies or firmware. For some
matching algorithms are applied to spot the best repair can- of the remaining CWEs, their descriptions can be vague and
didate from the dataset. Efforts in repair are also explored in imprecise, making it difficult to reason about their presence
other domains like recompilable decompiled code [23]. These with a great degree of confidence. In this work, we focus on
approaches give credence to the ability of neural networks to the following CWEs:
learn from a larger set of correct code and inform repair on 1234: Hardware Internal or Debug Modes Allow Override
an instance of incorrect code. of Locks. System configuration controls, e.g., memory protec-
We focus on hardware bugs originating at the Register- tion are set after a power reset and then locked to prevent
Transfer Level (RTL) stage. RTL designs, typically coded in modification. This is done using a lock-bit signal. If the
HDLs such as Verilog, are high-level behavioral descriptions system allows debugging operations and the lock-bit can be
of hardware circuits specifying how data is transformed, overridden in a debug mode, the system configuration controls
transferred, and stored. RTL logic features two types of are not properly protected.
elements, sequential and combinational. Sequential elements 1271: Uninitialized Value on Reset for Registers Holding
(e.g., registers, counters, RAMs) tend to synchronize the cir- Security Settings. Security-critical information stored in reg-
cuit according to clock edges and retain values using memory isters should have a known value when being brought out of
components. Combinational logic (e.g., simple combinations reset. If that is not the case, these registers may have unknown
of gates) changes their outputs near-instantaneously according values that put the system in a vulnerable state.
to the inputs. While software code describes programs that 1280: Access Control Check Implemented After Asset is
will be executed from beginning to end, RTL specified in HDL Accessed. Access control checks are required in hardware
describes components that run independently in parallel. Like before security-sensitive assets like keys are accessed. If this
software, hardware designs have security bugs. By definition, check is implemented after the access, it is useless.
RTL is insecure if the security objectives of the circuit are 1276: Hardware Child Block Incorrectly Connected to Par-
unmet. These may include confidentiality and integrity require- ent System. If an input is incorrectly connected, it affects
ments [24]. Confidentiality is violated if data that should not security attributes like resets while maintaining correct func-
be seen/read under certain conditions is exposed. For example, tion and the integrity of the data of the child block can be
improper memory protection allows encryption keys to be violated.

Authorized licensed use limited to: ShanghaiTech University. Downloaded on May 04,2025 at 16:39:03 UTC from IEEE Xplore. Restrictions apply.
AHMAD et al.: ON HARDWARE SECURITY BUG CODE FIXES BY PROMPTING LLMs 4045

1245: Improper Finite State Machines (FSMs) in Hardware TABLE I


Logic. FSMs are used in hardware to carry out different C OMPARISON W ITH C IR F IX AND V ERI S KETCH
functions in different states. When FSMs are used in modules
that control the level of security a system is in, it is important
that the FSM does not have any undefined states. These states
may allow an adversary to carry out functionality that requires
higher privileges. An improper FSM can present itself as
unreachable states, FSM deadlock, or missing states.
1298: Hardware Logic Contains Race Conditions. Race
conditions may result in a timing error or glitch that causes the
output to enter an unknown or unwanted state before settling
to its desired value. If this happens in access control logic locate faults effectively. VeriSketch in contrast aims at neither
or sensitive security flows, an attacker may use it to bypass localization nor repair; it uses repair techniques to generate
protections. designs that are secure according to the properties provided
1231: Improper Prevention of Lock Bit Modification. A by security architects/designers.
trusted lock bit may be used to restrict access to regis- In our work, we focus on repair, using LLM-generated
ters, address regions, device configuration controls and other replacement code. We leverage some of the designer’s exper-
resources after their values are assigned. If the lock bit can be tise, and examine how much can be achieved with LLM-based
modified after assignments to these resources, attackers may repair, where generation does not need an “oracle”. While
be able to access and modify the assets the lock bit is trying CirFix instruments an oracle to use the correct outputs to guide
to protect. repairs, LLMs rely on the many RTL code from training to
1311: Improper Translation of Security Attributes by Fabric produce a corrected version of the buggy code. VeriSketch
Bridge. A bridge allows IP blocks supporting different fabric requires functional properties to generate the design which
protocols to be integrated into the system. If the translation functions as an oracle. We present an empirical comparison
of security attributes in this bridge is incorrect, the identity of of LLM-based replacement code to CirFix in Section VI.
an IP could be ascribed as trusted as opposed to trusted. This The first work that used LLMs to repair code (C,Python
exposes the system to control bypass, privilege escalation or and Verilog) was done by Pearce et al. [15]. For Hardware,
denial of service. the authors covered 2 CWEs by looking at 2 bug instances
1254: Incorrect Comparison Logic Granularity. Compari- in simple designs. We build on their basic idea by focusing
son logic is used for purposes like password checks. If the specifically on hardware and covering more bugs and CWEs
comparison is done over a series of steps where the compari- to gain more insights about using LLMs for repair of RTL.
son failure at one of the steps breaks the operation, it may be We cover 10 CWEs, 15 security related bugs and explore
vulnerable to a timing attack. functional repair capabilities of LLMs as well by comparing
1224: Improper Restriction of Write-Once Bit Fields. Hard- performance of LLMs with CirFix. We also combine the repair
ware design control registers have to initialized at reset to ability of LLMs with the static hardware bug detector tool
defined values. To prevent modification by software, they can CWEAT [2], to show how a bug can be detected and repaired
be written into only once, after which they become read-only. using our approach.
Failure to implement write-once restrictions allows them to be
modifiable.
III. D ESIGNS AND B UGS
B. Related Work To explore using LLMs to fix HW security bugs, we collate
CirFix [8] attempts to localize bugs in RTL designs and and prepare a set of 15 hardware security bug benchmark
then repair them. CirFix uses an iterative stochastic search designs from three sources: CWE descriptions on the MITRE
with an instrumented testbench that captures the behavior of website [31], OpenTitan System-on-Chip (SoC) [33] and the
the circuit. This testbench is an oracle that requires the input Hack@DAC 2021 SoC [34]. The bugs from MITRE and the
parameters and expected outputs. Another related work is Hack@DAC 2021 SoC were available previously. We inserted
VeriSketch [32]. Here, synthesis, information flow tracking, bugs into the OpenTitan designs. Each bug is represented in
and code-repair techniques are coupled with function and a design, as described in Table II.
security properties to produce secure RTL. While Verisketch
does borrow repair-oriented idea of iteratively generating RTL
A. MITRE’s CWE Examples
code till it is secure, it is a secure-by-design approach.
The differences between our work with CirFix and We use examples in MITRE’s hardware design list to
VeriSketch are outlined in Table I. CirFix performs local- develop simple designs representing CWE(s). 4 of the 9 bugs
ization/identification of the bug and the repair. These two and corresponding fixes for this source are shown in Figure 1
parts can be examined independently, e.g., Tarsel [6] uses and detailed below. Bugs 5-9 are described in Table II but their
hardware-specific timing information and the program spec- coded examples are skipped for brevity. Their details may be
trum and captures the changes of executed statements to found at [16].

Authorized licensed use limited to: ShanghaiTech University. Downloaded on May 04,2025 at 16:39:03 UTC from IEEE Xplore. Restrictions apply.
4046 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 19, 2024

TABLE II
B UGS OVERVIEW. W E A SSIGN A CWE TO E ACH B UG AND G IVE A D ESCRIPTION OF THE D ESIGN

1) Locked Register: This design has a register that is pro- measures. Since OpenTitan does not have declared bugs,
tected by a lock bit. The contents of the register may only be we inject bugs by tweaking the RTL of these security
changed when the lock_status bit is low. In Figure 1(a), measures in different modules. These are measures imple-
a debug_unlocked signal overrides the lock_status mented in the HDL code that mitigate attacks on assets in
signal allowing the locked register to be written into even if security-critical Intellectual Properties (IPs). The OpenTitan
lock_status is asserted. taxonomy presents a countermeasure in the following form:
2) Lock on Reset: design has a register that holds sensitive [UNIQUIFIER.]ASSET.CM_TYPE. Here, ASSET is the
information. This register should be assigned a known value element that is being protected e.g., key or internal states of a
on reset. In Figure 1(b), ttlocked register should have a value processor’s control flow. Each protection mechanism is named
on reset, but in this case, there is no reset. with CM_TYPE e.g., multi-bit encoded signal, scrambled asset,
3) Grant Access: This design has a register that should only or access to asset limited according to life-cycle state. The
be modifiable if the usr_id input is correct. In Figure 1(c), UNIQUIFIER is a custom prefix label to make the identifier
the register data_out is assigned a new value if the unique after identifying the IP. The bugs we produced using
grant_access signal is asserted. This should happen when these countermeasures and their corresponding fixes are shown
usr_id is correct, but since the check happens after writing in Figure 2.
into data_out in blocking assignments, data_out may be 1) ROM Control: This design contains a module that acts as
modified when the usr_id is incorrect. an interface between the Read Only Memory (ROM) and the
4) Trustzone Peripheral: This design contains a peripheral system bus. The ROM has scrambled contents, and the con-
instantiated in an SoC. To distinguish between trusted and troller descrambles the content while serving memory requests.
untrusted entities, a signal is used to assign the security level of We target the COMPARE.CTRL_FLOW. CONSISTENCY
the peripheral. This is also described as a privilege bit used in security measure in the rom_ctrl_compare module.
Arm TrustZone to define the security level of all connected IPs. Here, the asset is CTRL_FLOW referring to the control
In Figure 1(d), the security level of the instantiated peripheral flow of the ROM Control module. The countermeasure is
is grounded to zero, which could lead to incorrect privilege CONSISTENCY, checking consistency of the control flow
escalation of all input data. other than by associating integrity bits. A part of this mea-
sure is that the start_i signal should only be asserted in
the Waiting state, otherwise, an alert signal is asserted.
B. Google’s OpenTitan In Figure 2(a), because of our induced bug, the alert signal
OpenTitan is an open-source project to develop a sil- is incorrectly asserted when start_i is high in any state
icon root of trust with implementations of SoC security other than Waiting.

Authorized licensed use limited to: ShanghaiTech University. Downloaded on May 04,2025 at 16:39:03 UTC from IEEE Xplore. Restrictions apply.
AHMAD et al.: ON HARDWARE SECURITY BUG CODE FIXES BY PROMPTING LLMs 4047

Fig. 2. OpenTitan bugs: The repair (green) replaces the bug (red) for a
successful fix.

3) Keymanager KMAC: This design performs the Keccak


Message Authentication Code (KMAC) and Secure Hash-
ing Algorithm 3 (SHA3) functionality. It is responsible for
checking the integrity of the incoming message with the
signature produced from the same secret key. We target the
KMAC_IF_DONE.CTRL.CONSISTENCY security measure
in the keymgr_kmac_if module. Here, the asset is CTRL,
referring to the logic used to steer the hardware behavior of
the KMAC module. The countermeasure is CONSISTENCY,
checking the consistency of the control hardware other than
by associating integrity bits. A part of this measure is that
the kmac done signal should not be asserted outside the
accepted window, i.e., when the FSM is in the done state.
In Figure 2(c), because of our induced bug, the kmac done
signal is incorrectly asserted in the transmission state StTx.
Fig. 1. MITRE CWE bugs and repairs. A repair (green) replaces the bug
(red).
C. Hack@DAC-21
This is a hackathon for finding vulnerabilities at the RTL
level for an System-on-Chip (SoC). The bugs and fixes for
2) OTP Control: This is a One-Time Programmable mem- this source are shown in Figure 3.
ory controller that provides the programmability for the 1) CSR RegFile: This design contains a module that carries
device’s life cycle. It ensures that the correct life cycle transi- out changes in control and status registers according to the sys-
tions are implemented as the entity of the SoC changes among tem’s state. This includes changes in privilege levels, incoming
the 4 – Silicon Creator, Silicon Owner, Application Provider, interrupts, virtualization, and cache support. We consider the
and the End User. We target the LCI.FSM.LOCAL_ESC module’s function pertaining to the stalling of the core on
security measure in the otp_ctrl_lci module. Here, the receiving an interrupt and/or debug request. In Figure 3(a),
asset is FSM referring to the Finite State Machine of the the debug signal overrides interrupt signals.
OTP Control module. The countermeasure is LOCAL_ESC, 2) DMA: design has Direct Memory Access module
ensuring a trigger when an attack is detected. A part of common to all blocks. It uses the memory address as
this measure is that the FSM jumps to an error state if the input and performs read/write according to the Physical
escalation signal is asserted. In Figure 2(b), no error is raised Memory Protection (PMP) configuration. We consider PMP
in such a case because of our induced bug. access mechanism as the relevant security implementation.

Authorized licensed use limited to: ShanghaiTech University. Downloaded on May 04,2025 at 16:39:03 UTC from IEEE Xplore. Restrictions apply.
4048 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 19, 2024

comments, and instructions to form the Prompt to LLM. This


can be worded as “what the LLM sees”. An example of this
construction for bug in Figure 1(c) is shown in Figure 5 (a)-
(b). The instructions are broken down into ‘Bug Instruction’
and ‘Fix Instruction’. The former describes the nature of the
bug and lets the LLM know that the bug follows. The latter
follows the bug in comments and instructs the LLM on how to
fix the bug. These instructions are varied in different degrees
of detail according to the bug as discussed in Section IV-B.1.
The LLM takes the Prompt as input and outputs the Repairs.
The repairs produced may be correct or incorrect. Some of the
repairs generated using the prompt Figure 5 (b) are shown in
Figure 5 (c)-(e).
2) Evaluator: This block takes the Repairs generated by
the LLM and verifies their correctness by evaluating their
functionality and security. A repair is successful if it is both
functional and secure. We use ModelSim simulator [35] as a
part of Xilinx Vivado 2022.2 to simulate the designs. Func-
tional Evaluation uses Verilog testbenches we made for each
design comprising of tests for various input vectors. A failed
testbench indicates a failure of at least one test or a syntax error
in the design. For MITRE designs, we develop testbenches that
cover the design’s entire intended functionality. For OpenTitan
and Hack@DAC designs, we cover partial functionality for
inputs and outputs that pertain to the buggy code. These
designs require another step of forming the Device Under Test
Fig. 3. Hack@DAC bugs: Repair code (green) replaces buggy code (red).
(DUT) before simulation. We prepare for the simulator a list
of files instantiated by the buggy file and the files that need
In Figure 3(b), the PMP register is not assigned any value to be analyzed prior to the buggy file.
on reset. Security Evaluation involves a combination of testbenches
3) AES 2 Interface: This design incorporates the AES mod- (for MITRE and OpenTitan) and a static analysis tool (for
ule to produce cipher text while using an FSM for managing Hack@DAC) discussed in Section VII. For MITRE designs,
interactions. However, the case statement in Figure 3(c) lacks we design tests based on weaknesses mentioned on the MITRE
enough cases and does not include a default statement. website for each bug. For OpenTitan, we use the security
countermeasures defined in relevant “.hjson” files for the
peripherals. It is difficult to verify the security countermea-
IV. E XPERIMENTAL M ETHOD
sure completely because that requires simulating the SoC
To test the capability for LLMs to generate successful through the software for Design Verification by OpenTitan.
repairs, we perform experiments using the designs and bugs This method is a work in progress for the OpenTitan team.
detailed in Section III. In this section, we present our frame- Moreover, the countermeasures that can currently be verified
work that automates the execution of our experiments, starting completely require a lot of simulation time. Hence, we develop
from the location of bugs to the evaluation of the repairs. custom testbenches that verify specific functionality for the
bugs we introduce in the OpenTitan.
A. LLM-Based Repair Evaluation Framework
Figure 4 provides an overview of our experiments, with B. Experimental Parameters
three main components, the Sources, Repair Generator, and LLMs have several controllable parameters which effect
Evaluator. The Sources were discussed in Section III. output generation. We change the prompt (as discussed in
1) Repair Generator: This block takes the buggy file, Section IV-A.1) according to the bug and Instructions.
location and CWE of the bug as the input from the source. We also vary the Temperature and Models while keeping
We assume that the location of the bugs, i.e., starting and the top_p, number_of_completions (n) and max_tokens
ending line numbers and the filepath of the buggy file, constant at 1, 20 and 200 respectively. top_p is an alternative
is known. The location is only used to construct the content to sampling with temperature, called nucleus sampling, where
window for the prompt and is not a feature learned by the only results with probability mass of top_p are considered.
LLM. For each bug, we develop instructions to assist the n is the number of completions generated by the LLM per
repair: these are in the form of comments, which will be request. max_tokens is the maximum number of tokens that
inserted before and after the buggy code to assist the LLMs can be generated per completion [9].
in generating an appropriate repair for that bug. The Prompt 1) Instruction Variation: We test five instruction variants to
Generator combines the code before the bug, buggy code in guide the repair of bugs. They are described in Table III. Each

Authorized licensed use limited to: ShanghaiTech University. Downloaded on May 04,2025 at 16:39:03 UTC from IEEE Xplore. Restrictions apply.
AHMAD et al.: ON HARDWARE SECURITY BUG CODE FIXES BY PROMPTING LLMs 4049

Fig. 4. Our experimental framework has 3 components: Sources with bugs, LLM-based Repair Generator to create fixes, and Evaluator to verify repairs.

TABLE III
I NSTRUCTION VARIATIONS . W E D EVELOP 5 T YPES TO A SSIST R EPAIR OF
B UGS . VARIATION A I S THE BASE VARIATION W ITH N O A SSISTANCE .
T HE L EVEL OF D ETAIL /A SSISTANCE I NCREASES F ROM VARIATION
A TO E

commented bug. The latter follows the bug in comments and


represents guidance to the LLM on how to fix the bug.
Variation a provides no assistance and is the same across
all bugs. Here, the Bug instruction is “//BUG:” and the
Fix Instruction is “//FIX:”. The Bug Instruction for
the remaining variations is a description of the nature of
the bug. We take inspiration from the MITRE website and
cater them according to the CWE they represent. This is
an attempt to increase generalizability of some variations of
prompts allowing the repair of different instances of the same
‘kind/CWE’ of bug. For 9 of the 10 CWEs, we use the
description of the CWE as the Bug Instruction for variations
b, c, d and e . We make an exception for CWE 1245 because
it covers a vast possibility of issues. CWE 1245 is “Improper
FSMs in Hardware Logic” which may include incomplete case
statements, vulnerable transitions, missing transitions, FSM
deadlocks, incorrect alert mechanisms etc. For variation e this
description is appended with an example of a ‘generalized’ bug
in comments and its fix without comments. This generalization
is done through using more common signal names and coding
patterns. The Fix Instruction for b and e is the same as that
for a. For c, it is preceded by a ‘prescriptive’ instruction which
means that natural language is used to assist the fix. For d,
however, it is preceded by a ‘descriptive’ instruction which
means that language resembling pseudo-code is used to assist
the fix. The components of instruction that change are shown
in Table IV.
Fig. 5. Prompt to LLM and sample repairs produced for Bug 3 - Grant 2) Temperature (t): A higher value means that the LLM
Access. Sub-figures (a)-(b) show how the bug is combined with instructions
to generate the prompt that the LLM gets as one of its inputs. Sub-figures takes more risks and yields more creative completions. We use
(c)-(e) show some actual repairs generated by an LLM. t ∈ {0.1, 0.3.0.5, 0.7, 0.9}.
3) Models: We use seven LLMs, five of which are
made available by OpenAI [36] and two are open-source
variation has 2 parts – Bug Instruction and Fix Instruction. models available through [10] and [37]. The OpenAI
The former describes the nature of the bug and precedes the Codex models are derived from GPT-3 and were trained

Authorized licensed use limited to: ShanghaiTech University. Downloaded on May 04,2025 at 16:39:03 UTC from IEEE Xplore. Restrictions apply.
4050 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 19, 2024

TABLE IV
D ETAILS OF I NSTRUCTION VARIATIONS AND S TOP K EYWORDS U SED . T HE S AME B UG I NSTRUCTION I S U SED FOR VARIATIONS b,c,d , S HOWN IN
C OLUMN 2. I N C ASE OF VARIATION e, T HIS B UG I NSTRUCTION ( IN C OLUMN 2) I S A PPENDED BY AN E XAMPLE OF A B UG AND I TS R EPAIR
IN C OMMENTS , S HOWN IN C OLUMN 3. F IX I NSTRUCTIONS FOR VARIATIONS C AND D P RECEDE THE S TRING “FIX:”, S HOWN IN C OLUMNS
4 AND 5 R ESPECTIVELY. A DDITIONAL S TOP K EYWORDS T HAT T ERMINATE THE F URTHER G ENERATION OF T OKENS BY LLM S A RE
S HOWN IN C OLUMN 6

Authorized licensed use limited to: ShanghaiTech University. Downloaded on May 04,2025 at 16:39:03 UTC from IEEE Xplore. Restrictions apply.
AHMAD et al.: ON HARDWARE SECURITY BUG CODE FIXES BY PROMPTING LLMs 4051

on millions of public GitHub repositories. They can B. RQ2: How Important Are Prompt Details?
ingest and generate code, and also translate natural The 5 instruction variations from a to e increase in the
language to code. We use/evaluate gpt-3.5-turbo, level of detail. Apart from CodeGen and VGen, the LLMs do
gpt-4, code-davinci-001, code-davinci-002 and better with more detail when generating a repair, as shown in
code-cushman-001 models. From Hugging Face, we eval- Figure 8. Variations c-e perform better than variations a and
uate the model CodeGen-16B-multi, which we refer to as b. They include a fix instruction after the buggy code in com-
CodeGen. It is an autoregressive LLM for program synthesis ments, giving credence to the use of two separate instructions
trained sequentially on The Pile and BigQuery. We also per prompt (one before and one after the bug in comments).
evaluate the fine-tuned version of CodeGen, trained over a Variation d has the highest success rate among OpenAI LLMs
Verilog corpus comprising of open-source Verilog code in and is therefore our recommendation for bug fixes. The use
GitHub repositories [10], referred to as VGen. of a fix instruction in “pseudo-code” (designer intent using
4) Number of Lines Before Bug: Another parameter to mostly natural language) leads to the best results. There
consider is in the prompt preparation: the number of lines of is variation within LLMs for the best-observed instruction
existing code given to the LLM. Some files may be too large variation, e.g. gpt-4, code-davinci-002 and CodeGen
for the entire code before the bug to be sent to the LLM. We, perform best at e. Excluding the results of CodeGen and
therefore, select a minimum of 25 and a maximum of 50 lines VGen, because they perform very poorly, the success rates
of code before the bug as part of the prompt. In Figure 5 for variations a-e across OpenAI models increase by 20, 41,
(b), this would be lines 1–5 (inclusive). If there are more than 11 and −14 % respectively for each successive variation.
25 lines above the bug, we include enough lines that go up to As an example, going from variation a to b yields 20% more
the beginning of the block the bug is in. This block could be successful repairs and going from b to c yields 41% more
an always block, module, or case statement, etc. If the bug is successful repairs. From these numbers, the most significant
too large, however, the lines before the bug and the bug may jump is going from b to c, showing the importance of including
exceed the token limit of the LLM. Then the proposed scheme a Fix Instruction in the prompt. We also observe that a
will not be able to repair it. In our work though, we did not coded example of a repair in the form of variation e decreases
run into this problem. the success rates of OpenAI LLMs. Instructions with natural
5) Stop Keywords: They are not included in the response. language guidance do better than coded examples.
We developed a strategy that works well with the set of bugs.
The default stop keyword is endmodule. Keywords used are
in the column Stop keywords in Table IV. C. RQ3: What Bugs Appear Amenable to Repair?
The cumulative number of correct repairs for each bug for
V. E XPERIMENTAL R ESULTS OpenAI LLMs is shown in Figure 9. Bugs 3 and 4 were
We set up our experimental framework for each LLM, the best candidates for repair with success rates of over
generating 20 responses for every combination of bug, tem- 75%. These are examples from MITRE where the signal
perature, and instruction variation. The responses are counted names indicate their intended purposes. For the Grant Access
as successful repairs if they pass functional and security tests. module, the signals of concern are grant_access and
The number of successful repairs is shown as heatmaps in usr_id used in successive lines. LLMs preserved the func-
Figure 6. The maximum value for each element is 20, i.e., tionality that the usr_id should be compared before granting
when all responses were successful repairs. access. Most successful repairs either flipped the order of
blocking assignments or lumped them into an assignment
using the ternary operator. Similarly, Trustzone Periph-
A. RQ1: Can Out-of-the-Box LLMs Fix Hardware Security eral uses signal names data_in_security_level and
Bugs? rdata_security_level which illustrate their functional.
Results show that LLMs can repair simple security bugs. Bugs 5, 6 and 12 were the hardest to repair with success
gpt-4, code-davinci-002, and code-cushman-001 rates < 10%. Bug 6 was the toughest to repair because
yielded at least one successful repair for every bug of the specificity required for the correct case statement.
in our dataset. code-davinci-001, gpt-3.5-turbo, A correct repair would require all 32 possibilities of the
CodeGen and VGen were successful for 14, 13, 11 and security signal to be correctly translated to the 4 possible
10 out of 15 bugs. In total, we requested 52,500 repairs values of the output security signal. Bug 5 was difficult to
out of which 15,063 were correct, a success rate of repair because the models refused to acknowledge that a glitch
28.7%. The key here lies in selecting the best-observed existed and kept generated the same code as the bug. On many
parameters for each LLM. code-davinci-002 performs occasions gpt-4 produced the variations of the following
best at variation e, temp 0.1 producing 69% correct comment accompanying the code it generated: “No bug here,
repairs. gpt-4, gpt-3.5-turbo, code-davinci-001, already correct.” Bug 12 was the only bug that required a
code-cushman-001, CodeGen and VGen perform best line to be removed without replacement as a fix. Bugs 8 and
at (e, 0.5), (d, 0.1), (d, 0.1), (d, 0.1), (e, 0.3) and (c, 0.3) 14 were moderately difficult to repair with success rates over
with success rates of 67%, 44%, 53%, 51%, 17% and 8.3% 10 but less than 20%. Bug 8 proved difficult because of its
respectively. Performance of these LLMs across bugs is shown complexity. The bug spanned 20 lines and a typical repair
in Figure 7. required 4 if statements. Bug 14 had the bug of a register

Authorized licensed use limited to: ShanghaiTech University. Downloaded on May 04,2025 at 16:39:03 UTC from IEEE Xplore. Restrictions apply.
4052 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 19, 2024

Fig. 6. Results for all LLMs, temperature and instruction variation configurations represented as heatmaps. The maximum value for each small box is 20.
A higher value indicates more success by LLM in generating repair and is highlighted with a darker shade. All bugs were repaired at least once by at least
one LLM.

Authorized licensed use limited to: ShanghaiTech University. Downloaded on May 04,2025 at 16:39:03 UTC from IEEE Xplore. Restrictions apply.
AHMAD et al.: ON HARDWARE SECURITY BUG CODE FIXES BY PROMPTING LLMs 4053

D. RQ4: Does the LLM Temperature Matter?


A higher temperature allows the LLM to be more “creative.”
Overall, Figure 8 shows that the LLMs perform better at
lower temperatures with a temperature of 0.1 yielding the
highest number of successful repairs. The performance of most
LLMs decreases with increasing temperature. For gpt-4 and
gpt-3.5-turbo however, temperature does not seem to
have an impact with success rates being stable regardless of
the temperature. gpt-4 is the only LLM which performs best
at a temperature of 0.5. gpt-3.5 and VGen perform best at
0.3. The remaining perform best at 0.1. A lower temperature
leads to less variation in responses, suggesting that the less
Fig. 7. Results showing the performance of each LLM across all bugs in creative responses are more likely to be correct repairs.
the form of heatmaps. Each small square shows the number of correct repairs
for the corresponding instruction variation and temperature of the LLM. The
maximum possible value is 300. A higher value indicates more success in
generating repairs and is shaded in a darker color. E. RQ5: Are Some LLMs Better Than Others?
The gpt-4 LLM was the best performing, producing
3862 correct repairs out of 7500, giving it a success
rate of 51.5%. code-davinci-002, gpt-3.5-turbo,
code-davinci-001, code-cushman-001, CodeGen
and VGen had success rates of 43.4%, 31.6%, 30.6%, 29.1%,
9.9% and 4.7% respectively. The difference between OpenAI
LLMs and CodeGen + VGen is caused by CodeGen being
a smaller LLM, with 16 billion parameters compared to the
GPT-4’s 1.7 trillion and GPT-3’s ∼175B parameters (the num-
ber of parameters for the OpenAI LLMs are not public). While
the larger models tend to perform better, the relationship is not
completely linear. For instance, while the code-davinci models
and gpt-3.5-turbo are based on the same underlying GPT-3
model, the code-davinci-002 performs much better than gpt-
3.5-turbo. This is because it is fine-tuned for use specifically
for programming applications, whereas gpt-3.5-turbo has had
additional fine-tuning to align it for instruction following.
We believe that having a “large enough” model which has been
Fig. 8. Trends across models. The top graph shows the number of correct fine-tuned appropriately would provide a suitable foundation
repairs for LLMs for specified instruction variations. The bottom graph
shows the number of correct repairs for LLMs for specified temperature. The
for most applications. Counter-intuitively, the fine-tuned VGen
maximum value for each data point is 1500. performs worse than CodeGen. code-cushman-001 is
slightly inferior to davinci LLMs, possibly because it was
designed to be quicker (smaller), i.e., it has fewer parameters,
was trained over less data, or both.

VI. C OMPARISON TO P RIOR W ORK


To further validate the effectiveness of our approach, we per-
formed a detailed comparison of our work with CirFix [8]
(Table V). We use the best-performing LLM (gpt-4) at
t = 0.1 and generate one repair each for instruction variations
Fig. 9. Number of correct repairs per bug for OpenAI LLMs. The number a and b.
above each bar shows the sum of successful repairs across all LLMs for the
corresponding bug. The maximum possible value is 2500. A higher value
Recall that a has no instruction, meaning the LLM is not
indicates that the bug was repaired more times. Annotations represent the guided at all. This is done to closely mirror the use case
percentages of successful repairs. of CirFix. By comparing the first example produced by the
LLM, we evaluate only one attempt at repair. This attempt
is manually evaluated for correctness. Variation a produces
19 correct repairs as compared to CirFix’s 16. To elicit
holding security settings not initialized under reset. This was the power of LLMs, we use variation b which includes a
difficult to repair because a fix needs an added always block description of the type of bug. We use the brief descriptions
with an appropriate reset and re-creating the previous intended of bugs provided in CirFix’s GitHub repository. Variation b
functionality. fixes 22 of the 32 benchmarks.

Authorized licensed use limited to: ShanghaiTech University. Downloaded on May 04,2025 at 16:39:03 UTC from IEEE Xplore. Restrictions apply.
4054 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 19, 2024

TABLE V about identifiers, types, values, and conditions. The ASTs are
C OMPARISON ON C IR F IX B ENCHMARKS . A S UCCESSFUL R EPAIR I S traversed using keywords and patterns to indicate potential
S HOWN AS Y. W E U SE T WO I NSTRUCTION VARIATIONS FOR T HIS
C OMPARISON . A N E LEMENT - | Y M EANS T HAT THE R EPAIR U SING
vulnerabilities in CWEs 1234, 1271, and 1245. We ran this tool
VARIATION a WAS N OT S UCCESSFUL BUT U SING VARIATION over the Hack@DAC 2021 SoC and selected three instances,
b WAS . T HE E LEMENT 1/2 M EANS T HAT 2 E RRORS W ERE one per CWE, for the purposes of this paper. These are bugs
U SED IN THE D ESCRIPTION OF A S INGLE FAULT /B UG
AND 1 O UT OF 2 WAS R EPAIRED
13, 14, and 15 in Table II. We use the same tool for security
evaluation of the generated responses. We replace the buggy
code with the repaired code in the SoC and run the tool again.
If the same bug is picked up, i.e., the same location and CWE,
we can determine that the repair is not successful. If that is
not the case, we infer that the repair is adequate (i.e. the bug
was removed). We produced results in Figure 6 for bugs
13, 14, and 15 using this end-to-end solution.
We envision using this (or similar) LLM-infused end-to-
end solution by RTL designers as they write HDL code in the
early stages of Design. CWEAT can highlight weakness to the
designer, run it through LLM to produce repairs, choose the
ones that are secure, and present suggestions to the designer.
Detection and repair can be treated as separate tasks and
implemented using separate tools. A range of tools may be
used for detection e.g., commercial linting tools, hardware
fuzzers, information flow tracking, and formal verification.
The bugs found may be repaired by methods including LLMs
and CirFix. This hybrid approach is likely to detect the most
bugs and produce the most successful repairs.

VIII. D ISCUSSION AND L IMITATIONS


This study shows that LLMs have the potential for bug
repair. Presently, some assistance is required from the designer
to identify a bug’s location/nature. This can be overcome by
using tools to localize bugs and better design practices such as
comments explaining functionality. Currently, designers may
need to pick from options produced by LLMs. Static analysis
tools like CWEAT can help select or refine fixes. Even LLMs
can be used for identifying vulnerabilities. LLM4SecHW [38]
is a LLM-based hardware debugging framework aiming to
identify bugs and provide debugging suggestions during the
hardware design iteration process. This however, requires
curation of relevant data and fine-tuning using version control
histories of relevant repositories.
Fig. 10. Overview of the framework used in end-to-end solution with We believe LLMs have the potential for automating bug
CWEAT. This was used for bugs 13, 14, and 15. The source is Hack@DAC repair. selecting the right LLM is the first and easiest choice.
2021 SoC. CWEAT is used for both detection and evaluation. GPT-4 works the best because it has the largest number
of parameters. Knowing some information about the bug
significantly improves the performance. LLMs do not work
VII. O N I NCORPORATING A B UG D ETECTOR
well for certain kinds of bugs but do very well on others. Our
The work so far has discussed bug repair assuming that work shows that GPT-4 has a success rate of 40% or more for
the location of the bug is already known. A detector could 9 of the 15 bugs and 70% or more for 6. We observe that as
also be used after repair for security re-evaluation. To explore the nature of the security bug becomes more “elusive,” e.g.,
this possibility, we present an end-to-end framework for some race conditions and incorrect comparison level granularity,
CWEs in Verilog using CWEAT [2], which can detect a bug, LLMs perform significantly worse than on simpler bugs.
generate a repair using our methodology, and then evaluate its A vulnerability could be present as a result of multiple bugs
correctness. This pipeline is shown in Figure 10. collectively producing a security issue. We hypothesize that
CWEAT [2] is a static analysis tool that can detect some these would be difficult to identify as they may be present in
security weaknesses in RTL. We use the methods described multiple locations. Additionally, in a production environment,
in [2] to traverse the Abstract Syntax Trees (ASTs) generated we could use the repair mechanism within a validation loop.
by the Verific parser. Each node of the tree represents a This is the primary reason we have included the coupling
syntactical element of the RTL code with various information of a static detector with our approach in Section VII. So if

Authorized licensed use limited to: ShanghaiTech University. Downloaded on May 04,2025 at 16:39:03 UTC from IEEE Xplore. Restrictions apply.
AHMAD et al.: ON HARDWARE SECURITY BUG CODE FIXES BY PROMPTING LLMs 4055

the LLM is suggesting an incorrect repair, the detector will the best suggestion as the repair. In our experiments, we faced
identify the repair as incorrect. In this scenario, if an LLM is some challenge because of token limits set by the OpenAI API.
able to produce even one correct repair for a bug, it will be Since we were generating thousands of requests with a limited
able to find a successful fix. number of token keys, we had to wait for a minute ever time
A limitation of our study is the informal instruction we reached the limit. This raised our generation of repair time
variations. Although Bug instructions are inspired by the to ∼20 minutes per LLM.
descriptions in CWEs, our Fix instructions are devised accord-
ing to the experience of the authors. Our work reveals the IX. C ONCLUSION AND F UTURE W ORK
importance of these variations, as subtle changes can affect By selecting appropriate parameters and prompts, LLMs
the LLM response quality. Devising 5 categories is an attempt can effectively fix hardware bugs in our dataset. Each bug
to systematize this process, but future work can explore more had at least one successful repair, and 13 of the 15 had
varieties. Moreover, instructions are challenging to generalize perfect responses given the best parameter set. LLMs excel
across different bugs. Ideally, a designer would want variation when signal names and comments indicate functionality but
a to fix all bugs because no instructions are needed. struggle with multi-line fixes or when a buggy line must
Another limitation of our study is that the functional and be removed. Providing detailed, pseudo-code-like instructions
security evaluations are not exhaustive. Security evaluation is improves repair success rates. Bigger LLMs and LLMs at
dependent on design-specific security objectives and cannot lower temperatures perform better than those with fewer
be exhaustive. With this in mind, we limit the security eval- parameters and at higher temperatures. LLMs outperform
uation to the bug that makes the design insecure. Functional CirFix in fixing function-related bugs in Verilog, even when
evaluation is needed because a design that is secure but not detailed instructions are not provided. We suggest the follow-
functional is useless. For the CWE examples, we were able to ing areas for future research:
build exhaustive testbenches because the designs were low in • Employ a hybrid method for security bug detection with
complexity and had only one or two modules. Ideal functional linters, formal verification, fuzzing, fault localization,
testbenches should be exhaustive for other examples too but and static analysis tools. For repair, use LLMs and
are impractical. It would be a difficult task to write testbenches oracle-guided modifying algorithms. Combining tech-
for these complex SoCs and simulating the designs according niques is likely to yield better results than using just one.
to the software provided by OpenTitan and Hack@DAC. • Fine-tune LLMs over HDLs and see if their performance
It takes an hour to exhaustively simulate an IP on OpenTitan. improves. This improves quality of functional code [10].
Since we generate 3500 potential repairs for each bug, this • Explore the repair of functional bugs using LLMs with
would take 150 days for each bug. Therefore, we chose to build the full sweep of parameters. We only used one set of
custom testbenches that test the code a repair could impact. parameters that performed the best in our experiments.
The choice of end-tokens influences the success rate of
repairs. Some strategies are intuitive, like using the end line A PPENDIX
token as an end token for a bug that is present in only one
Compute Environment
line. Others may require more creativity because some lines
of code can be written in multiple ways. An LLM might not All experiments were conducted on an Intel Core i5-10400T
generate a repair that spans multiple conditional statements, CPU @2GHzx12 processor with 16 GB RAM, using Ubuntu
e.g., 20.04.5 LTS.

if (~resetn) begin locked <= 0; end Open Source Details


else if(unlock) begin locked <= d; end
else begin locked <= locked; end There are a few parts of our experimental framework where
we could not provide fully open-source access. We used Verific
if the keyword end is used as a stop token. Conversely, libraries provided by Verific under an academic license. Please
not limiting a response with an appropriate stop token may contact Verific for access.
mean that the LLM produces the correct repair but then adds We used CWEAT code from “Don’t CWEAT It:
superfluous code. We use a post-processing script to minimize Toward CWE Analysis Techniques in Early Stages
syntax errors. This involves adding/removing the end keyword of Hardware Design” [2]. The paper is available at
as needed. When the LLM generates a repair, that repair is a https://dl.acm.org/doi/abs/10.1145/3508352.3549369. Please
substitute for the bug only. The number of begin and end contact the authors for use/help with their codebase.
keywords are counted. If the numbers are same, nothing is to We used the CirFix benchmarks and results in the
be done, and the repair is inserted in place of the bug. If the GitHub repository provided by the authors of “CirFix: auto-
number of begins are greater by an amount n, end is added at matically repairing defects in hardware design code.” [8]
the end of the repair n times. If the number of ends are greater https://github.com/hammad-a/verilog_repair. Please contact
by an amount n, the first n instances of end are removed. the authors for their tools. Their paper is available at
The LLMs are quick in generating repairs. The 20 responses https://dl.acm.org/doi/10.1145/3503222.3507763.
per request are generated in under a minute. While trying to We use the Hack@DAC SoC from the 2021 competition.
find a repair for a bug, a Verilog designer should have enough Please contact them at [email protected] for more infor-
suggested repairs very quickly. The designer can then choose mation/access.

Authorized licensed use limited to: ShanghaiTech University. Downloaded on May 04,2025 at 16:39:03 UTC from IEEE Xplore. Restrictions apply.
4056 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 19, 2024

ACKNOWLEDGMENT [22] S. Ma, F. Thung, D. Lo, C. Sun, and R. H. Deng, “VuRLE: Automatic
vulnerability detection and repair by learning from examples,” in Com-
The authors would like to thank Verific Design Automation puter Security—ESORICS. Cham: Springer, 2017, pp. 229–246.
for generously providing academic access to linkable libraries, [23] P. Reiter, H. J. Tay, W. Weimer, A. Doupé, R. Wang, and S. Forrest,
examples, and documentation for their RTL parsers. This work “Automatically mitigating vulnerabilities in binary programs via partially
recompilable decompilation,” 2022, arXiv:2202.12336.
does not in any way constitute an Intel endorsement of a [24] N. Potlapally, “Hardware security in practice: Challenges and oppor-
product or supplier. tunities,” in Proc. IEEE Int. Symp. Hardware-Oriented Secur. Trust,
Jun. 2011, pp. 93–98.
[25] D. Demmler, G. Dessouky, F. Koushanfar, A.-R. Sadeghi, T. Schneider,
R EFERENCES and S. Zeitouni, “Automated synthesis of optimized circuits for secure
computation,” in Proc. 22nd ACM SIGSAC Conf. Comput. Commun.
[1] C. L. Goues, M. Pradel, and A. Roychoudhury, “Automated program
Secur., New York, NY, USA: Association for Computing Machinery,
repair,” Commun. ACM, vol. 62, no. 12, pp. 56–65, Nov. 2019, doi:
Oct. 2015, pp. 1504–1517, doi: 10.1145/2810103.2813678.
10.1145/3318162.
[26] vclint. (2022). Synopsys VC SpyGlass Lint. [Online]. Available:
[2] B. Ahmad et al., “Don’t CWEAT It: Toward CWE analysis techniques
https://www.synopsys.com/verification/static-and-formal-verification/vc-
in early stages of hardware design,” in Proc. IEEE/ACM Int. Conf.
spyglass/vc-spyglass-lint.html
Comput. Aided Design (ICCAD). New York, NY, USA: Association for
Computing Machinery, Oct. 2022, pp. 1–9. [27] Jasperlint. (2022). Jasper Superlint App. [Online]. Available:
https://www.cadence.com/en_US/home/tools/system-design-and-
[3] (2022). VC Formal. [Online]. Available: https://www.synopsys.
verification/formal-and-static-verification/jasper-gold-verification-
com/verification/static-and-formal-verification/vc-formal.html
platform/jaspergold-superlint-app.html
[4] Cadence. (2022). Jasper RTL Apps Cadence. [Online]. Available:
https://www.cadence.com/en_US/home/tools/system-design-and- [28] G. Dessouky et al., “HardFails: Insights into software-exploitable hard-
verification/formal-and-static-verification/jasper-gold-verification- ware bugs,” in Proc. USENIX Secur. Symp., 2019, pp. 213–230.
platform.html [29] M. Lipp et al. (2018). Meltdown: Reading Kernel Memory From
[5] T. Trippel, K. G. Shin, A. Chernyakhovsky, G. Kelly, D. Rizzo, and User Space. [Online]. Available: https://www.usenix.org/conference/
M. Hicks, “Fuzzing hardware like software,” 2021, arXiv:2102.02308. usenixsecurity18/presentation/lipp
[6] J. Wu et al., “Fault localization for hardware design code with time- [30] P. Kocher et al., “Spectre attacks: Exploiting speculative execution,” in
aware program spectrum,” in Proc. IEEE 40th Int. Conf. Comput. Design Proc. IEEE Symp. Secur. Privacy (SP), May 2019, pp. 1–19.
(ICCD), Oct. 2022, pp. 537–544. [31] T. M. Corp. (1194). CWE—CWE-1194: Hardware Design (4.1).
[7] M. R. Fadiheh, D. Stoffel, C. Barrett, S. Mitra, and W. Kunz, “Processor [Online]. Available: https://cwe.mitre.org/data/definitions/1194.html
hardware security vulnerabilities and their detection by unique program [32] A. Ardeshiricham, Y. Takashima, S. Gao, and R. Kastner, “VeriSketch:
execution checking,” in Proc. Design, Autom. Test Eur. Conf. Exhib. Synthesizing secure hardware designs with timing-sensitive information
(DATE), Mar. 2019, pp. 994–999. flow properties,” in Proc. ACM SIGSAC Conf. Comput. Commun. Secur.,
[8] H. Ahmad, Y. Huang, and W. Weimer, “CirFix: Automatically repairing New York, NY, USA: Association for Computing Machinery, Nov. 2019,
defects in hardware design code,” in Proc. 27th ACM Int. Conf. Archi- pp. 1623–1638, doi: 10.1145/3319535.3354246.
tectural Support Program. Lang. Operating Syst. New York, NY, USA: [33] (2019). Hardware OpenTitan Documentation. [Online]. Available:
Association for Computing Machinery, Feb. 2022, pp. 990–1003, doi: https://docs.opentitan.org/hw/
10.1145/3503222.3507763. [34] HACK@EVENT. (2022). HACK@DAC21—HacK@EVENT. [Online].
[9] M. Chen et al., “Evaluating large language models trained on code,” Available: https://hackatevent.org/hackdac21/
2021, arXiv:2107.03374. [35] (2020). ModelSim Vivado Design Suite Reference Guide: Model-
[10] S. Thakur et al., “Benchmarking large language models for automated Based DSP Design Using System Generator (UG958). Reader.
verilog RTL code generation,” in Proc. Design, Autom. Test Eur. Conf. AMD Adaptive Computing Documentation Portal. [Online]. Available:
Exhibition (DATE), Apr. 2023, pp. 1–6. https://docs.xilinx.com/r/en-U.S./ug958-vivado-sysgen-ref/ModelSim
[11] H. Pearce, B. Tan, and R. Karri, “DAVE: Deriving automatically verilog [36] OpenAI. (2021). OpenAI Codex. [Online]. Available: https://openai.
from English,” in Proc. ACM/IEEE 2nd Workshop Mach. Learn. CAD com/blog/openai-codex/
(MLCAD), Nov. 2020, pp. 27–32. [37] E. Nijkamp et al., “CodeGen: An open large language model for code
[12] M. Tufano, C. Watson, G. Bavota, M. D. Penta, M. White, and with multi-turn program synthesis,” 2022, arXiv:2203.13474.
D. Poshyvanyk, “An empirical study on learning bug-fixing patches [38] W. Fu, K. Yang, R. G. Dutta, X. Guo, and G. Qu, “LLM4SecHW:
in the wild via neural machine translation,” ACM Trans. Softw. Eng. Leveraging domain-specific large language model for hardware debug-
Methodol., vol. 28, no. 4, pp. 1–29, Sep. 2019, doi: 10.1145/3340544. ging,” in Proc. Asian Hardw. Oriented Secur. Trust Symp. (AsianHOST),
[13] D. Drain, C. Wu, A. Svyatkovskiy, and N. Sundaresan, “Generating bug- Dec. 2023, pp. 1–6.
fixes using pretrained transformers,” in Proc. 5th ACM SIGPLAN Int.
Symp. Mach. Program., Jun. 2021, pp. 1–8.
[14] C. S. Xia, Y. Wei, and L. Zhang, “Automated program repair in the
era of large pre-trained language models,” in Proc. IEEE/ACM 45th Int.
Conf. Softw. Eng. (ICSE), May 2023, pp. 1482–1494.
[15] H. Pearce, B. Tan, B. Ahmad, R. Karri, and B. Dolan-Gavitt, “Examining
zero-shot vulnerability repair with large language models,” in Proc. IEEE
Symp. Secur. Privacy (SP), May 2023, pp. 2339–2356.
[16] A. F. Rev. (2023). Artifacts for ‘On Hardware Security Bug Code
Fixes By Querying Large Language Models’. [Online]. Available:
https://zenodo.org/records/10416865
[17] M. Monperrus, The Living Review on Automated Program Repair,
document hal-01956501, HAL Arch. Ouvertes, 2018.
[18] W. Wang, Z. Meng, Z. Wang, S. Liu, and J. Hao, “LoopFix: An
approach to automatic repair of buggy loops,” J. Syst. Softw., vol. 156, Baleegh Ahmad (Graduate Student Member, IEEE)
pp. 100–112, Oct. 2019, doi: 10.1016/j.jss.2019.06.076. received the B.Sc. degree in electrical engineering
from New York University Abu Dhabi, Abu Dhabi,
[19] X. D. Le and Q. L. Le, “ReFixar: Multi-version reasoning for automated
United Arab Emirates, in 2020. He is currently
repair of regression errors,” in Proc. IEEE 32nd Int. Symp. Softw. Rel.
pursuing the Ph.D. degree in electrical and computer
Eng. (ISSRE), Oct. 2021, pp. 162–172.
engineering with the New York University Tandon
[20] Y. Lu, N. Meng, and W. Li, “FAPR: Fast and accurate program repair School of Engineering, Brooklyn, NY, USA. His
for introductory programming courses,” 2021, arXiv:2107.06550. research interests include hardware security and ver-
[21] Z. Chen, S. Kommrusch, and M. Monperrus, “Neural transfer learning ification with a particular focus on bug detection and
for repairing security vulnerabilities in c code,” IEEE Trans. Softw. Eng., repair.
vol. 49, no. 1, pp. 147–165, Jan. 2023.

Authorized licensed use limited to: ShanghaiTech University. Downloaded on May 04,2025 at 16:39:03 UTC from IEEE Xplore. Restrictions apply.
AHMAD et al.: ON HARDWARE SECURITY BUG CODE FIXES BY PROMPTING LLMs 4057

Shailja Thakur (Member, IEEE) received the Ph.D. Ramesh Karri (Fellow, IEEE) received the B.E.
degree from the University of Waterloo, Water- degree in electrical and computer engineering from
loo, ON, Canada, in 2022. She is currently a Andhra University and the Ph.D. degree in com-
Post-Doctoral Research Associate with the Depart- puter science from the University of California San
ment of Electrical and Computer Engineering and Diego. He is currently a Professor of electrical and
the Center for Cybersecurity, New York Univer- computer engineering with New York University
sity Tandon School of Engineering, Brooklyn, NY, (NYU) Tandon School of Engineering. He co-directs
USA. Her research interests include cyber-physical the NYU Center for Cyber Security, co-founded
systems, electronic design automation, and large the Trust Hub, and founded the Embedded Systems
language models. Challenge, the Annual Red Team Blue Team Event.
He has published over 300 articles in leading jour-
nals and conference proceedings. His research and education in hardware
cybersecurity include trustworthy ICs, processors, and cyber-physical systems;
security-aware computer-aided design, test, verification, and nano meets secu-
rity; hardware security competitions, benchmarks, and metrics; and additive
manufacturing security.

Benjamin Tan (Member, IEEE) received the B.E.


degree (Hons.) in computer systems engineering and Hammond Pearce (Member, IEEE) received the
the Ph.D. degree from The University of Auck- B.E. (Hons.) and Ph.D. degrees in computer sys-
land, Auckland, New Zealand, in 2014 and 2019, tems engineering from The University of Auckland,
respectively. In 2018, he was a Professional Teach- Auckland, New Zealand. From 2020 to 2023, he was
ing Fellow with the Department of Electrical and with New York University, Brooklyn, NY, USA,
Computer Engineering, The University of Auck- where he was a Post-Doctoral Research Associate
land. From 2019 to 2021, he was with New York and then a Research Assistant Professor with the
University (NYU), Brooklyn, NY, USA, where he Department of Electrical and Computer Engineer-
was a Post-Doctoral Associate and then a Research ing and the NYU Center for Cybersecurity. He is
Assistant Professor, affiliated with the NYU Center currently a Lecturer with the School of Computer
for Cybersecurity. He is currently an Assistant Professor with the Department Science and Engineering, University of New South
of Electrical and Software Engineering, University of Calgary, Calgary, Wales, Sydney, NSW, Australia. His research interests include cybersecurity
AB, Canada. His research interests include computer engineering, hardware of embedded and industrial systems, including additive manufacturing and
security, and electronic design automation. industrial informatics.

Authorized licensed use limited to: ShanghaiTech University. Downloaded on May 04,2025 at 16:39:03 UTC from IEEE Xplore. Restrictions apply.

You might also like