On_Hardware_Security_Bug_Code_Fixes_by_Prompting_Large_Language_Models
On_Hardware_Security_Bug_Code_Fixes_by_Prompting_Large_Language_Models
Abstract— Novel AI-based code-writing Large Language Mod- relevant effort in this context thus far. Further efforts need to be
els (LLMs) such as OpenAI’s Codex have demonstrated made to support the automated repair of functional and secu-
capabilities in many coding-adjacent domains. In this work, rity bugs in hardware. Unlike software bugs, security bugs in
we consider how LLMs may be leveraged to automatically repair
identified security-relevant bugs present in hardware designs hardware are more problematic because they cannot be patched
by generating replacement code. We focus on bug repair in once the chip is fabricated; this is especially concerning as
code written in Verilog. For this study, we curate a corpus of hardware is typically the root of trust for a system.
domain-representative hardware security bugs. We then design Large Language Models (LLMs) are neural networks trained
and implement a framework to quantitatively evaluate the over millions of lines of text and code [9]. LLMs that are
performance of any LLM tasked with fixing the specified bugs.
The framework supports design space exploration of prompts fine-tuned over open-source code repositories can generate
(i.e., prompt engineering) and identifying the best parameters code, where a user “prompts” the LLM with some text (e.g.,
for the LLM. We show that an ensemble of LLMs can repair all code and comments) to guide the code generation. In contrast
fifteen of our benchmarks. This ensemble outperforms a state- to previous code repair techniques that involve mutation,
of-the-art automated hardware bug repair tool on its own suite repeated checks against an “oracle,” or source code templates,
of bugs. These results show that LLMs have the ability to repair
hardware security bugs and the framework is an important step we propose that an LLM trained on code and natural language
towards the ultimate goal of an automated end-to-end bug repair could potentially generate fixes, given an appropriate prompt
tool. that could draw from a designer’s expertise. As LLMs are
Index Terms— Hardware security, large language models, bug exposed to a wide variety of code examples during training,
repair. they should be able to assist designers in fixing bugs in
different types of hardware designs and styles, with natural
I. I NTRODUCTION language guidance. One distinction is that LLMs do not need
Authorized licensed use limited to: ShanghaiTech University. Downloaded on May 04,2025 at 16:39:03 UTC from IEEE Xplore. Restrictions apply.
4044 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 19, 2024
• An exploration of bugs and LLM parameters (model, tem- read by user code. Integrity is violated if data that should
perature, prompt) to see how to use LLMs in repair. They not be modifiable under certain conditions is modifiable. For
are posed as research questions answered in Section V. example, user code can write into registers that specify the
• A demonstration of how the repair mechanism could be access control policy. Secure computation is also a concern,
coupled with bug detectors to form an end-to-end solution and the synthesis and optimization of secure circuits starts with
to detect and repair bugs. This is presented in Section VII. the description of designs with HDLs [25].
Linters [26], [27] and formal verification tools [3], [4]
II. BACKGROUND AND R ELATED W ORK cover a large proportion of functional bugs. Although formal
To provide context for our work, we discuss some overar- verification tools like Synopsys FSV can be used for security
ching concepts in Section II-A. We present some differences verification in the design process, they can sometimes have
between our work and other efforts in Section II-B. limited success [28]. With the ever-growing complexity of
modern processors, software-exploitable hardware bugs are
becoming pernicious [29], [30]. This has resulted in the
A. Background
exploration of many detection techniques such as fuzzing [5],
The code repair problem is well-explored in the software information flow tracking [6], unique program execution
domain. Software code repair techniques continue to evolve checking [7] and static analysis [2].
(interested readers can see Monperrus’ living review [17]). Security-related issues that arise because of bugs in hard-
Generally, techniques try to fix errors by using program ware are taxonomized in the form of Common Weakness
mutations and repair templates paired with tests to validate Enumerations (CWEs). MITRE [31] is a not-for-profit that
any changes [18], [19]. Feedback loops are constructed with works with academia and industry to develop a list of CWEs
a reference implementation to guide the repair process [20]. that represent categories of vulnerabilities in hardware and
Other domain-specific tools may also be built to deal with software. A weakness is an element in a digital product’s
areas like build scripts, web, and software models. software, firmware, hardware, or service that can be exploited
Security bugs are defects that can lead to vulnerable sys- for malicious purposes. The CWE list provides a general
tems. While functional bugs can be detected using classical taxonomy and categorization of these elements that allow
testing, security bugs are more difficult to detect, and proving a common language to be used for discussion. It helps
their presence or absence is challenging. This has led to more developers and researchers search for the existence of these
“creative” kinds of bug repair, including AI-based machine- weaknesses in their designs and compare various tools they
learning techniques such as neural transfer learning [21] use to detect vulnerabilities in their designs and products.
and example-based approaches [22]. ML-based approaches, We select a subset of the hardware CWEs based on their
including neural networks, allow a greater ability to suggest clarity of explanation and relevance to RTL. A large subset
repairs for “unseen” code. Example-based approaches start off of CWEs are not related to the RTL as they cover post-silicon
with a dataset comprising pairs of bugs and their repairs. Then, issues or using outdated technologies or firmware. For some
matching algorithms are applied to spot the best repair can- of the remaining CWEs, their descriptions can be vague and
didate from the dataset. Efforts in repair are also explored in imprecise, making it difficult to reason about their presence
other domains like recompilable decompiled code [23]. These with a great degree of confidence. In this work, we focus on
approaches give credence to the ability of neural networks to the following CWEs:
learn from a larger set of correct code and inform repair on 1234: Hardware Internal or Debug Modes Allow Override
an instance of incorrect code. of Locks. System configuration controls, e.g., memory protec-
We focus on hardware bugs originating at the Register- tion are set after a power reset and then locked to prevent
Transfer Level (RTL) stage. RTL designs, typically coded in modification. This is done using a lock-bit signal. If the
HDLs such as Verilog, are high-level behavioral descriptions system allows debugging operations and the lock-bit can be
of hardware circuits specifying how data is transformed, overridden in a debug mode, the system configuration controls
transferred, and stored. RTL logic features two types of are not properly protected.
elements, sequential and combinational. Sequential elements 1271: Uninitialized Value on Reset for Registers Holding
(e.g., registers, counters, RAMs) tend to synchronize the cir- Security Settings. Security-critical information stored in reg-
cuit according to clock edges and retain values using memory isters should have a known value when being brought out of
components. Combinational logic (e.g., simple combinations reset. If that is not the case, these registers may have unknown
of gates) changes their outputs near-instantaneously according values that put the system in a vulnerable state.
to the inputs. While software code describes programs that 1280: Access Control Check Implemented After Asset is
will be executed from beginning to end, RTL specified in HDL Accessed. Access control checks are required in hardware
describes components that run independently in parallel. Like before security-sensitive assets like keys are accessed. If this
software, hardware designs have security bugs. By definition, check is implemented after the access, it is useless.
RTL is insecure if the security objectives of the circuit are 1276: Hardware Child Block Incorrectly Connected to Par-
unmet. These may include confidentiality and integrity require- ent System. If an input is incorrectly connected, it affects
ments [24]. Confidentiality is violated if data that should not security attributes like resets while maintaining correct func-
be seen/read under certain conditions is exposed. For example, tion and the integrity of the data of the child block can be
improper memory protection allows encryption keys to be violated.
Authorized licensed use limited to: ShanghaiTech University. Downloaded on May 04,2025 at 16:39:03 UTC from IEEE Xplore. Restrictions apply.
AHMAD et al.: ON HARDWARE SECURITY BUG CODE FIXES BY PROMPTING LLMs 4045
Authorized licensed use limited to: ShanghaiTech University. Downloaded on May 04,2025 at 16:39:03 UTC from IEEE Xplore. Restrictions apply.
4046 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 19, 2024
TABLE II
B UGS OVERVIEW. W E A SSIGN A CWE TO E ACH B UG AND G IVE A D ESCRIPTION OF THE D ESIGN
1) Locked Register: This design has a register that is pro- measures. Since OpenTitan does not have declared bugs,
tected by a lock bit. The contents of the register may only be we inject bugs by tweaking the RTL of these security
changed when the lock_status bit is low. In Figure 1(a), measures in different modules. These are measures imple-
a debug_unlocked signal overrides the lock_status mented in the HDL code that mitigate attacks on assets in
signal allowing the locked register to be written into even if security-critical Intellectual Properties (IPs). The OpenTitan
lock_status is asserted. taxonomy presents a countermeasure in the following form:
2) Lock on Reset: design has a register that holds sensitive [UNIQUIFIER.]ASSET.CM_TYPE. Here, ASSET is the
information. This register should be assigned a known value element that is being protected e.g., key or internal states of a
on reset. In Figure 1(b), ttlocked register should have a value processor’s control flow. Each protection mechanism is named
on reset, but in this case, there is no reset. with CM_TYPE e.g., multi-bit encoded signal, scrambled asset,
3) Grant Access: This design has a register that should only or access to asset limited according to life-cycle state. The
be modifiable if the usr_id input is correct. In Figure 1(c), UNIQUIFIER is a custom prefix label to make the identifier
the register data_out is assigned a new value if the unique after identifying the IP. The bugs we produced using
grant_access signal is asserted. This should happen when these countermeasures and their corresponding fixes are shown
usr_id is correct, but since the check happens after writing in Figure 2.
into data_out in blocking assignments, data_out may be 1) ROM Control: This design contains a module that acts as
modified when the usr_id is incorrect. an interface between the Read Only Memory (ROM) and the
4) Trustzone Peripheral: This design contains a peripheral system bus. The ROM has scrambled contents, and the con-
instantiated in an SoC. To distinguish between trusted and troller descrambles the content while serving memory requests.
untrusted entities, a signal is used to assign the security level of We target the COMPARE.CTRL_FLOW. CONSISTENCY
the peripheral. This is also described as a privilege bit used in security measure in the rom_ctrl_compare module.
Arm TrustZone to define the security level of all connected IPs. Here, the asset is CTRL_FLOW referring to the control
In Figure 1(d), the security level of the instantiated peripheral flow of the ROM Control module. The countermeasure is
is grounded to zero, which could lead to incorrect privilege CONSISTENCY, checking consistency of the control flow
escalation of all input data. other than by associating integrity bits. A part of this mea-
sure is that the start_i signal should only be asserted in
the Waiting state, otherwise, an alert signal is asserted.
B. Google’s OpenTitan In Figure 2(a), because of our induced bug, the alert signal
OpenTitan is an open-source project to develop a sil- is incorrectly asserted when start_i is high in any state
icon root of trust with implementations of SoC security other than Waiting.
Authorized licensed use limited to: ShanghaiTech University. Downloaded on May 04,2025 at 16:39:03 UTC from IEEE Xplore. Restrictions apply.
AHMAD et al.: ON HARDWARE SECURITY BUG CODE FIXES BY PROMPTING LLMs 4047
Fig. 2. OpenTitan bugs: The repair (green) replaces the bug (red) for a
successful fix.
Authorized licensed use limited to: ShanghaiTech University. Downloaded on May 04,2025 at 16:39:03 UTC from IEEE Xplore. Restrictions apply.
4048 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 19, 2024
Authorized licensed use limited to: ShanghaiTech University. Downloaded on May 04,2025 at 16:39:03 UTC from IEEE Xplore. Restrictions apply.
AHMAD et al.: ON HARDWARE SECURITY BUG CODE FIXES BY PROMPTING LLMs 4049
Fig. 4. Our experimental framework has 3 components: Sources with bugs, LLM-based Repair Generator to create fixes, and Evaluator to verify repairs.
TABLE III
I NSTRUCTION VARIATIONS . W E D EVELOP 5 T YPES TO A SSIST R EPAIR OF
B UGS . VARIATION A I S THE BASE VARIATION W ITH N O A SSISTANCE .
T HE L EVEL OF D ETAIL /A SSISTANCE I NCREASES F ROM VARIATION
A TO E
Authorized licensed use limited to: ShanghaiTech University. Downloaded on May 04,2025 at 16:39:03 UTC from IEEE Xplore. Restrictions apply.
4050 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 19, 2024
TABLE IV
D ETAILS OF I NSTRUCTION VARIATIONS AND S TOP K EYWORDS U SED . T HE S AME B UG I NSTRUCTION I S U SED FOR VARIATIONS b,c,d , S HOWN IN
C OLUMN 2. I N C ASE OF VARIATION e, T HIS B UG I NSTRUCTION ( IN C OLUMN 2) I S A PPENDED BY AN E XAMPLE OF A B UG AND I TS R EPAIR
IN C OMMENTS , S HOWN IN C OLUMN 3. F IX I NSTRUCTIONS FOR VARIATIONS C AND D P RECEDE THE S TRING “FIX:”, S HOWN IN C OLUMNS
4 AND 5 R ESPECTIVELY. A DDITIONAL S TOP K EYWORDS T HAT T ERMINATE THE F URTHER G ENERATION OF T OKENS BY LLM S A RE
S HOWN IN C OLUMN 6
Authorized licensed use limited to: ShanghaiTech University. Downloaded on May 04,2025 at 16:39:03 UTC from IEEE Xplore. Restrictions apply.
AHMAD et al.: ON HARDWARE SECURITY BUG CODE FIXES BY PROMPTING LLMs 4051
on millions of public GitHub repositories. They can B. RQ2: How Important Are Prompt Details?
ingest and generate code, and also translate natural The 5 instruction variations from a to e increase in the
language to code. We use/evaluate gpt-3.5-turbo, level of detail. Apart from CodeGen and VGen, the LLMs do
gpt-4, code-davinci-001, code-davinci-002 and better with more detail when generating a repair, as shown in
code-cushman-001 models. From Hugging Face, we eval- Figure 8. Variations c-e perform better than variations a and
uate the model CodeGen-16B-multi, which we refer to as b. They include a fix instruction after the buggy code in com-
CodeGen. It is an autoregressive LLM for program synthesis ments, giving credence to the use of two separate instructions
trained sequentially on The Pile and BigQuery. We also per prompt (one before and one after the bug in comments).
evaluate the fine-tuned version of CodeGen, trained over a Variation d has the highest success rate among OpenAI LLMs
Verilog corpus comprising of open-source Verilog code in and is therefore our recommendation for bug fixes. The use
GitHub repositories [10], referred to as VGen. of a fix instruction in “pseudo-code” (designer intent using
4) Number of Lines Before Bug: Another parameter to mostly natural language) leads to the best results. There
consider is in the prompt preparation: the number of lines of is variation within LLMs for the best-observed instruction
existing code given to the LLM. Some files may be too large variation, e.g. gpt-4, code-davinci-002 and CodeGen
for the entire code before the bug to be sent to the LLM. We, perform best at e. Excluding the results of CodeGen and
therefore, select a minimum of 25 and a maximum of 50 lines VGen, because they perform very poorly, the success rates
of code before the bug as part of the prompt. In Figure 5 for variations a-e across OpenAI models increase by 20, 41,
(b), this would be lines 1–5 (inclusive). If there are more than 11 and −14 % respectively for each successive variation.
25 lines above the bug, we include enough lines that go up to As an example, going from variation a to b yields 20% more
the beginning of the block the bug is in. This block could be successful repairs and going from b to c yields 41% more
an always block, module, or case statement, etc. If the bug is successful repairs. From these numbers, the most significant
too large, however, the lines before the bug and the bug may jump is going from b to c, showing the importance of including
exceed the token limit of the LLM. Then the proposed scheme a Fix Instruction in the prompt. We also observe that a
will not be able to repair it. In our work though, we did not coded example of a repair in the form of variation e decreases
run into this problem. the success rates of OpenAI LLMs. Instructions with natural
5) Stop Keywords: They are not included in the response. language guidance do better than coded examples.
We developed a strategy that works well with the set of bugs.
The default stop keyword is endmodule. Keywords used are
in the column Stop keywords in Table IV. C. RQ3: What Bugs Appear Amenable to Repair?
The cumulative number of correct repairs for each bug for
V. E XPERIMENTAL R ESULTS OpenAI LLMs is shown in Figure 9. Bugs 3 and 4 were
We set up our experimental framework for each LLM, the best candidates for repair with success rates of over
generating 20 responses for every combination of bug, tem- 75%. These are examples from MITRE where the signal
perature, and instruction variation. The responses are counted names indicate their intended purposes. For the Grant Access
as successful repairs if they pass functional and security tests. module, the signals of concern are grant_access and
The number of successful repairs is shown as heatmaps in usr_id used in successive lines. LLMs preserved the func-
Figure 6. The maximum value for each element is 20, i.e., tionality that the usr_id should be compared before granting
when all responses were successful repairs. access. Most successful repairs either flipped the order of
blocking assignments or lumped them into an assignment
using the ternary operator. Similarly, Trustzone Periph-
A. RQ1: Can Out-of-the-Box LLMs Fix Hardware Security eral uses signal names data_in_security_level and
Bugs? rdata_security_level which illustrate their functional.
Results show that LLMs can repair simple security bugs. Bugs 5, 6 and 12 were the hardest to repair with success
gpt-4, code-davinci-002, and code-cushman-001 rates < 10%. Bug 6 was the toughest to repair because
yielded at least one successful repair for every bug of the specificity required for the correct case statement.
in our dataset. code-davinci-001, gpt-3.5-turbo, A correct repair would require all 32 possibilities of the
CodeGen and VGen were successful for 14, 13, 11 and security signal to be correctly translated to the 4 possible
10 out of 15 bugs. In total, we requested 52,500 repairs values of the output security signal. Bug 5 was difficult to
out of which 15,063 were correct, a success rate of repair because the models refused to acknowledge that a glitch
28.7%. The key here lies in selecting the best-observed existed and kept generated the same code as the bug. On many
parameters for each LLM. code-davinci-002 performs occasions gpt-4 produced the variations of the following
best at variation e, temp 0.1 producing 69% correct comment accompanying the code it generated: “No bug here,
repairs. gpt-4, gpt-3.5-turbo, code-davinci-001, already correct.” Bug 12 was the only bug that required a
code-cushman-001, CodeGen and VGen perform best line to be removed without replacement as a fix. Bugs 8 and
at (e, 0.5), (d, 0.1), (d, 0.1), (d, 0.1), (e, 0.3) and (c, 0.3) 14 were moderately difficult to repair with success rates over
with success rates of 67%, 44%, 53%, 51%, 17% and 8.3% 10 but less than 20%. Bug 8 proved difficult because of its
respectively. Performance of these LLMs across bugs is shown complexity. The bug spanned 20 lines and a typical repair
in Figure 7. required 4 if statements. Bug 14 had the bug of a register
Authorized licensed use limited to: ShanghaiTech University. Downloaded on May 04,2025 at 16:39:03 UTC from IEEE Xplore. Restrictions apply.
4052 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 19, 2024
Fig. 6. Results for all LLMs, temperature and instruction variation configurations represented as heatmaps. The maximum value for each small box is 20.
A higher value indicates more success by LLM in generating repair and is highlighted with a darker shade. All bugs were repaired at least once by at least
one LLM.
Authorized licensed use limited to: ShanghaiTech University. Downloaded on May 04,2025 at 16:39:03 UTC from IEEE Xplore. Restrictions apply.
AHMAD et al.: ON HARDWARE SECURITY BUG CODE FIXES BY PROMPTING LLMs 4053
Authorized licensed use limited to: ShanghaiTech University. Downloaded on May 04,2025 at 16:39:03 UTC from IEEE Xplore. Restrictions apply.
4054 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 19, 2024
TABLE V about identifiers, types, values, and conditions. The ASTs are
C OMPARISON ON C IR F IX B ENCHMARKS . A S UCCESSFUL R EPAIR I S traversed using keywords and patterns to indicate potential
S HOWN AS Y. W E U SE T WO I NSTRUCTION VARIATIONS FOR T HIS
C OMPARISON . A N E LEMENT - | Y M EANS T HAT THE R EPAIR U SING
vulnerabilities in CWEs 1234, 1271, and 1245. We ran this tool
VARIATION a WAS N OT S UCCESSFUL BUT U SING VARIATION over the Hack@DAC 2021 SoC and selected three instances,
b WAS . T HE E LEMENT 1/2 M EANS T HAT 2 E RRORS W ERE one per CWE, for the purposes of this paper. These are bugs
U SED IN THE D ESCRIPTION OF A S INGLE FAULT /B UG
AND 1 O UT OF 2 WAS R EPAIRED
13, 14, and 15 in Table II. We use the same tool for security
evaluation of the generated responses. We replace the buggy
code with the repaired code in the SoC and run the tool again.
If the same bug is picked up, i.e., the same location and CWE,
we can determine that the repair is not successful. If that is
not the case, we infer that the repair is adequate (i.e. the bug
was removed). We produced results in Figure 6 for bugs
13, 14, and 15 using this end-to-end solution.
We envision using this (or similar) LLM-infused end-to-
end solution by RTL designers as they write HDL code in the
early stages of Design. CWEAT can highlight weakness to the
designer, run it through LLM to produce repairs, choose the
ones that are secure, and present suggestions to the designer.
Detection and repair can be treated as separate tasks and
implemented using separate tools. A range of tools may be
used for detection e.g., commercial linting tools, hardware
fuzzers, information flow tracking, and formal verification.
The bugs found may be repaired by methods including LLMs
and CirFix. This hybrid approach is likely to detect the most
bugs and produce the most successful repairs.
Authorized licensed use limited to: ShanghaiTech University. Downloaded on May 04,2025 at 16:39:03 UTC from IEEE Xplore. Restrictions apply.
AHMAD et al.: ON HARDWARE SECURITY BUG CODE FIXES BY PROMPTING LLMs 4055
the LLM is suggesting an incorrect repair, the detector will the best suggestion as the repair. In our experiments, we faced
identify the repair as incorrect. In this scenario, if an LLM is some challenge because of token limits set by the OpenAI API.
able to produce even one correct repair for a bug, it will be Since we were generating thousands of requests with a limited
able to find a successful fix. number of token keys, we had to wait for a minute ever time
A limitation of our study is the informal instruction we reached the limit. This raised our generation of repair time
variations. Although Bug instructions are inspired by the to ∼20 minutes per LLM.
descriptions in CWEs, our Fix instructions are devised accord-
ing to the experience of the authors. Our work reveals the IX. C ONCLUSION AND F UTURE W ORK
importance of these variations, as subtle changes can affect By selecting appropriate parameters and prompts, LLMs
the LLM response quality. Devising 5 categories is an attempt can effectively fix hardware bugs in our dataset. Each bug
to systematize this process, but future work can explore more had at least one successful repair, and 13 of the 15 had
varieties. Moreover, instructions are challenging to generalize perfect responses given the best parameter set. LLMs excel
across different bugs. Ideally, a designer would want variation when signal names and comments indicate functionality but
a to fix all bugs because no instructions are needed. struggle with multi-line fixes or when a buggy line must
Another limitation of our study is that the functional and be removed. Providing detailed, pseudo-code-like instructions
security evaluations are not exhaustive. Security evaluation is improves repair success rates. Bigger LLMs and LLMs at
dependent on design-specific security objectives and cannot lower temperatures perform better than those with fewer
be exhaustive. With this in mind, we limit the security eval- parameters and at higher temperatures. LLMs outperform
uation to the bug that makes the design insecure. Functional CirFix in fixing function-related bugs in Verilog, even when
evaluation is needed because a design that is secure but not detailed instructions are not provided. We suggest the follow-
functional is useless. For the CWE examples, we were able to ing areas for future research:
build exhaustive testbenches because the designs were low in • Employ a hybrid method for security bug detection with
complexity and had only one or two modules. Ideal functional linters, formal verification, fuzzing, fault localization,
testbenches should be exhaustive for other examples too but and static analysis tools. For repair, use LLMs and
are impractical. It would be a difficult task to write testbenches oracle-guided modifying algorithms. Combining tech-
for these complex SoCs and simulating the designs according niques is likely to yield better results than using just one.
to the software provided by OpenTitan and Hack@DAC. • Fine-tune LLMs over HDLs and see if their performance
It takes an hour to exhaustively simulate an IP on OpenTitan. improves. This improves quality of functional code [10].
Since we generate 3500 potential repairs for each bug, this • Explore the repair of functional bugs using LLMs with
would take 150 days for each bug. Therefore, we chose to build the full sweep of parameters. We only used one set of
custom testbenches that test the code a repair could impact. parameters that performed the best in our experiments.
The choice of end-tokens influences the success rate of
repairs. Some strategies are intuitive, like using the end line A PPENDIX
token as an end token for a bug that is present in only one
Compute Environment
line. Others may require more creativity because some lines
of code can be written in multiple ways. An LLM might not All experiments were conducted on an Intel Core i5-10400T
generate a repair that spans multiple conditional statements, CPU @2GHzx12 processor with 16 GB RAM, using Ubuntu
e.g., 20.04.5 LTS.
Authorized licensed use limited to: ShanghaiTech University. Downloaded on May 04,2025 at 16:39:03 UTC from IEEE Xplore. Restrictions apply.
4056 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 19, 2024
ACKNOWLEDGMENT [22] S. Ma, F. Thung, D. Lo, C. Sun, and R. H. Deng, “VuRLE: Automatic
vulnerability detection and repair by learning from examples,” in Com-
The authors would like to thank Verific Design Automation puter Security—ESORICS. Cham: Springer, 2017, pp. 229–246.
for generously providing academic access to linkable libraries, [23] P. Reiter, H. J. Tay, W. Weimer, A. Doupé, R. Wang, and S. Forrest,
examples, and documentation for their RTL parsers. This work “Automatically mitigating vulnerabilities in binary programs via partially
recompilable decompilation,” 2022, arXiv:2202.12336.
does not in any way constitute an Intel endorsement of a [24] N. Potlapally, “Hardware security in practice: Challenges and oppor-
product or supplier. tunities,” in Proc. IEEE Int. Symp. Hardware-Oriented Secur. Trust,
Jun. 2011, pp. 93–98.
[25] D. Demmler, G. Dessouky, F. Koushanfar, A.-R. Sadeghi, T. Schneider,
R EFERENCES and S. Zeitouni, “Automated synthesis of optimized circuits for secure
computation,” in Proc. 22nd ACM SIGSAC Conf. Comput. Commun.
[1] C. L. Goues, M. Pradel, and A. Roychoudhury, “Automated program
Secur., New York, NY, USA: Association for Computing Machinery,
repair,” Commun. ACM, vol. 62, no. 12, pp. 56–65, Nov. 2019, doi:
Oct. 2015, pp. 1504–1517, doi: 10.1145/2810103.2813678.
10.1145/3318162.
[26] vclint. (2022). Synopsys VC SpyGlass Lint. [Online]. Available:
[2] B. Ahmad et al., “Don’t CWEAT It: Toward CWE analysis techniques
https://www.synopsys.com/verification/static-and-formal-verification/vc-
in early stages of hardware design,” in Proc. IEEE/ACM Int. Conf.
spyglass/vc-spyglass-lint.html
Comput. Aided Design (ICCAD). New York, NY, USA: Association for
Computing Machinery, Oct. 2022, pp. 1–9. [27] Jasperlint. (2022). Jasper Superlint App. [Online]. Available:
https://www.cadence.com/en_US/home/tools/system-design-and-
[3] (2022). VC Formal. [Online]. Available: https://www.synopsys.
verification/formal-and-static-verification/jasper-gold-verification-
com/verification/static-and-formal-verification/vc-formal.html
platform/jaspergold-superlint-app.html
[4] Cadence. (2022). Jasper RTL Apps Cadence. [Online]. Available:
https://www.cadence.com/en_US/home/tools/system-design-and- [28] G. Dessouky et al., “HardFails: Insights into software-exploitable hard-
verification/formal-and-static-verification/jasper-gold-verification- ware bugs,” in Proc. USENIX Secur. Symp., 2019, pp. 213–230.
platform.html [29] M. Lipp et al. (2018). Meltdown: Reading Kernel Memory From
[5] T. Trippel, K. G. Shin, A. Chernyakhovsky, G. Kelly, D. Rizzo, and User Space. [Online]. Available: https://www.usenix.org/conference/
M. Hicks, “Fuzzing hardware like software,” 2021, arXiv:2102.02308. usenixsecurity18/presentation/lipp
[6] J. Wu et al., “Fault localization for hardware design code with time- [30] P. Kocher et al., “Spectre attacks: Exploiting speculative execution,” in
aware program spectrum,” in Proc. IEEE 40th Int. Conf. Comput. Design Proc. IEEE Symp. Secur. Privacy (SP), May 2019, pp. 1–19.
(ICCD), Oct. 2022, pp. 537–544. [31] T. M. Corp. (1194). CWE—CWE-1194: Hardware Design (4.1).
[7] M. R. Fadiheh, D. Stoffel, C. Barrett, S. Mitra, and W. Kunz, “Processor [Online]. Available: https://cwe.mitre.org/data/definitions/1194.html
hardware security vulnerabilities and their detection by unique program [32] A. Ardeshiricham, Y. Takashima, S. Gao, and R. Kastner, “VeriSketch:
execution checking,” in Proc. Design, Autom. Test Eur. Conf. Exhib. Synthesizing secure hardware designs with timing-sensitive information
(DATE), Mar. 2019, pp. 994–999. flow properties,” in Proc. ACM SIGSAC Conf. Comput. Commun. Secur.,
[8] H. Ahmad, Y. Huang, and W. Weimer, “CirFix: Automatically repairing New York, NY, USA: Association for Computing Machinery, Nov. 2019,
defects in hardware design code,” in Proc. 27th ACM Int. Conf. Archi- pp. 1623–1638, doi: 10.1145/3319535.3354246.
tectural Support Program. Lang. Operating Syst. New York, NY, USA: [33] (2019). Hardware OpenTitan Documentation. [Online]. Available:
Association for Computing Machinery, Feb. 2022, pp. 990–1003, doi: https://docs.opentitan.org/hw/
10.1145/3503222.3507763. [34] HACK@EVENT. (2022). HACK@DAC21—HacK@EVENT. [Online].
[9] M. Chen et al., “Evaluating large language models trained on code,” Available: https://hackatevent.org/hackdac21/
2021, arXiv:2107.03374. [35] (2020). ModelSim Vivado Design Suite Reference Guide: Model-
[10] S. Thakur et al., “Benchmarking large language models for automated Based DSP Design Using System Generator (UG958). Reader.
verilog RTL code generation,” in Proc. Design, Autom. Test Eur. Conf. AMD Adaptive Computing Documentation Portal. [Online]. Available:
Exhibition (DATE), Apr. 2023, pp. 1–6. https://docs.xilinx.com/r/en-U.S./ug958-vivado-sysgen-ref/ModelSim
[11] H. Pearce, B. Tan, and R. Karri, “DAVE: Deriving automatically verilog [36] OpenAI. (2021). OpenAI Codex. [Online]. Available: https://openai.
from English,” in Proc. ACM/IEEE 2nd Workshop Mach. Learn. CAD com/blog/openai-codex/
(MLCAD), Nov. 2020, pp. 27–32. [37] E. Nijkamp et al., “CodeGen: An open large language model for code
[12] M. Tufano, C. Watson, G. Bavota, M. D. Penta, M. White, and with multi-turn program synthesis,” 2022, arXiv:2203.13474.
D. Poshyvanyk, “An empirical study on learning bug-fixing patches [38] W. Fu, K. Yang, R. G. Dutta, X. Guo, and G. Qu, “LLM4SecHW:
in the wild via neural machine translation,” ACM Trans. Softw. Eng. Leveraging domain-specific large language model for hardware debug-
Methodol., vol. 28, no. 4, pp. 1–29, Sep. 2019, doi: 10.1145/3340544. ging,” in Proc. Asian Hardw. Oriented Secur. Trust Symp. (AsianHOST),
[13] D. Drain, C. Wu, A. Svyatkovskiy, and N. Sundaresan, “Generating bug- Dec. 2023, pp. 1–6.
fixes using pretrained transformers,” in Proc. 5th ACM SIGPLAN Int.
Symp. Mach. Program., Jun. 2021, pp. 1–8.
[14] C. S. Xia, Y. Wei, and L. Zhang, “Automated program repair in the
era of large pre-trained language models,” in Proc. IEEE/ACM 45th Int.
Conf. Softw. Eng. (ICSE), May 2023, pp. 1482–1494.
[15] H. Pearce, B. Tan, B. Ahmad, R. Karri, and B. Dolan-Gavitt, “Examining
zero-shot vulnerability repair with large language models,” in Proc. IEEE
Symp. Secur. Privacy (SP), May 2023, pp. 2339–2356.
[16] A. F. Rev. (2023). Artifacts for ‘On Hardware Security Bug Code
Fixes By Querying Large Language Models’. [Online]. Available:
https://zenodo.org/records/10416865
[17] M. Monperrus, The Living Review on Automated Program Repair,
document hal-01956501, HAL Arch. Ouvertes, 2018.
[18] W. Wang, Z. Meng, Z. Wang, S. Liu, and J. Hao, “LoopFix: An
approach to automatic repair of buggy loops,” J. Syst. Softw., vol. 156, Baleegh Ahmad (Graduate Student Member, IEEE)
pp. 100–112, Oct. 2019, doi: 10.1016/j.jss.2019.06.076. received the B.Sc. degree in electrical engineering
from New York University Abu Dhabi, Abu Dhabi,
[19] X. D. Le and Q. L. Le, “ReFixar: Multi-version reasoning for automated
United Arab Emirates, in 2020. He is currently
repair of regression errors,” in Proc. IEEE 32nd Int. Symp. Softw. Rel.
pursuing the Ph.D. degree in electrical and computer
Eng. (ISSRE), Oct. 2021, pp. 162–172.
engineering with the New York University Tandon
[20] Y. Lu, N. Meng, and W. Li, “FAPR: Fast and accurate program repair School of Engineering, Brooklyn, NY, USA. His
for introductory programming courses,” 2021, arXiv:2107.06550. research interests include hardware security and ver-
[21] Z. Chen, S. Kommrusch, and M. Monperrus, “Neural transfer learning ification with a particular focus on bug detection and
for repairing security vulnerabilities in c code,” IEEE Trans. Softw. Eng., repair.
vol. 49, no. 1, pp. 147–165, Jan. 2023.
Authorized licensed use limited to: ShanghaiTech University. Downloaded on May 04,2025 at 16:39:03 UTC from IEEE Xplore. Restrictions apply.
AHMAD et al.: ON HARDWARE SECURITY BUG CODE FIXES BY PROMPTING LLMs 4057
Shailja Thakur (Member, IEEE) received the Ph.D. Ramesh Karri (Fellow, IEEE) received the B.E.
degree from the University of Waterloo, Water- degree in electrical and computer engineering from
loo, ON, Canada, in 2022. She is currently a Andhra University and the Ph.D. degree in com-
Post-Doctoral Research Associate with the Depart- puter science from the University of California San
ment of Electrical and Computer Engineering and Diego. He is currently a Professor of electrical and
the Center for Cybersecurity, New York Univer- computer engineering with New York University
sity Tandon School of Engineering, Brooklyn, NY, (NYU) Tandon School of Engineering. He co-directs
USA. Her research interests include cyber-physical the NYU Center for Cyber Security, co-founded
systems, electronic design automation, and large the Trust Hub, and founded the Embedded Systems
language models. Challenge, the Annual Red Team Blue Team Event.
He has published over 300 articles in leading jour-
nals and conference proceedings. His research and education in hardware
cybersecurity include trustworthy ICs, processors, and cyber-physical systems;
security-aware computer-aided design, test, verification, and nano meets secu-
rity; hardware security competitions, benchmarks, and metrics; and additive
manufacturing security.
Authorized licensed use limited to: ShanghaiTech University. Downloaded on May 04,2025 at 16:39:03 UTC from IEEE Xplore. Restrictions apply.