0% found this document useful (0 votes)
16 views

Invitedpaper Aspdac 24

A paper on TCAD

Uploaded by

jingw497
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Invitedpaper Aspdac 24

A paper on TCAD

Uploaded by

jingw497
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Invited Paper: Software/Hardware Co-design for

LLM and Its Application for Design Verification


Lily Jiaxin Wan1∗ , Yingbing Huang1∗ , Yuhong Li1 , Hanchen Ye1 , Jinghua Wang1 , Xiaofan Zhang2 , Deming Chen1
1
University of Illinois Urbana-Champaign, 2 Google
{wan25, yh21, leeyh, hanchen8, jinghua3, dchen}@illinois.edu, [email protected]

Abstract—The widespread adoption of Large Language Models design flow. Chip-Chat [7] established a set of eight founda-
(LLMs) is impeded by their demanding compute and memory tional benchmarks, aiming to delineate both the capabilities
resources. The first task of this paper is to explore optimization and limitations of current state-of-the-art (SOTA) LLMs in
strategies to expedite LLMs, including quantization, pruning,
and operation-level optimizations. One unique direction is to hardware design. Concurrently, there is a trend in the academic
optimize LLM inference through novel software/hardware co- circles towards the development of robust benchmark frame-
design methods. Given the accelerated LLMs, the second task of works to rigorously gauge LLM performance. For instance, [8]
this paper is to study LLMs’ performance in the usage scenario of introduced an open-source suite comprising 30 intricate hard-
circuit design and verification. Specifically, we place a particular ware designs. Their contribution enhanced LLMs’ feedbacks
emphasis on functional verification. Through automated prompt
engineering, we harness the capabilities of the established LLM, quality through advanced prompt engineering techniques. In a
GPT-4, to generate High-Level Synthesis (HLS) designs with related effort, Shailja et al. [9] fine-tuned the CodeGen model
predefined errors based on 11 open-source synthesizable HLS [10] using 17 Verilog codes. Additional research has focused
benchmark suites. This dataset is a comprehensive collection of on dataset construction through practical RTL tutorial exercises.
over 1000 function-level designs, and each of which is afflicted For instance, VerilogEval [11] developed a comprehensive
with up to 45 distinct combinations of defects injected into the
source code. This dataset, named Chrysalis, expands upon what’s benchmark framework that includes 156 problems derived from
available in current HLS error models, offering a rich resource the educational HDLBits platform. Moreover, this framework
for training to improve how LLMs debug code. The dataset can be was employed to fine-tune the CodeGen model [10], signifi-
accessed at: https://github.com/UIUC-ChenLab/Chrysalis-HLS. cantly improving its proficiency in generating RTL codes. In
Index Terms—Large Language Models, software/hardware co- a distinct approach, ChatEDA [12] was architected to generate
design, functional verification
codes specifically for conjuring codes capable of navigating
EDA tools based on natural language cues.
I. I NTRODUCTION While LLMs benefit from a vast number of parameters, they
also grapple with challenges related to sparsity and compu-
The rapid evolution of machine learning, particularly through
tational overhead. Because of the extensive usage of LLMs in
the advancements in neural network architectures, has precip-
time-sensitive applications like Internet of things (IoT) devices,
itated significant breakthroughs in diverse fields, encompass-
it is essential to ensure that LLMs deliver optimal inference
ing computer vision [1] and natural language processing [2].
performance without compromising on the multi-task solving
Among various neural network designs, the transformer ar-
and language generation ability. In this paper, we explore sev-
chitecture [3] stands out, offering unparalleled performance
eral cutting-edge optimization techniques for LLM inference,
on sequence-to-sequence tasks. Instead of using traditional
with a focus on quantization and pruning. Numerous studies
recurrent layers, this innovative structure harness the power
have highlighted the potential advantages of these methods
of the attention mechanism. The transformer model serves as
individually, including activation outliers handling, structural
the foundation for the emergence of Large Language Models
and contextual sparsity reduction. It is still challenging to
(LLMs) such as OpenAI’s GPT series [4], Meta’s LLaMA [5],
achieve consistent optimization benefits across diverse LLMs
and Google’s BARD [6]. Encompassing billions of parameters
and ensure adaptability in real-world scenarios. Thus, we point
and informed by extensive textual datasets sourced from the
out the potential solution in software/hardware co-design for
internet, these models possess the capability to interpret human-
this challenge, inspired by its proved power and effectiveness
generated text and yield contextually pertinent and logically
in the areas of Deep Neural Networks (DNNs) [13]–[16], Graph
coherent outputs to a wide spectrum of prompts.
Neural Networks (GNNs) [17], and conventional machine-
In the domain of Electronic Design Automation (EDA), the
learning solutions [18].
potential application of LLMs is becoming increasingly evident.
They are poised to support hardware engineers in various Overall, this paper introduces a unified optimization frame-
hardware design stages, ranging from the initial conception and work for LLMs, with a particular emphasis on functional
verification to optimization and coordination of the complete verification in EDA with the following contributions:
• We study the integration of both LLM quantization and
* These authors contributed equally to this work. pruning techniques, yielding greater benefits than their
standalone applications while compensating for their re-
spective limitations. Additionally, we anticipate the po-
tential integration of this approach with domain-specific
hardware accelerators.
• We pioneer an innovative methodology that harnesses the
capabilities of GPT-4 to inject bugs into HLS codes.
We design a set of tailored prompts to guide GPT-4 in
generating consistent and compliant buggy codes within
the EDA domain.
• Leveraging the above methodologies, we create the
Chrysalis dataset that includes both correct source codes
and intentionally injected buggy codes. This dataset is
meticulously organized, comprising over 1000 function-
level designs sourced from 11 open-source synthesizable
HLS benchmark suites. Each design undergoes a con-
trolled injection of up to 45 distinct combinations of bugs.
It represents an indispensable tool for the assessment and Fig. 1. Time consumption breakdown: a Llama-2-7b [5] decoder layer during
refinement of LLM-based HLS domain-specific debugging training. Attention significantly dominates the latency when sequence length
becomes longer.
assistants.

II. LLM ACCELERATION


A. Challenges
The immense scale of LLMs, often encompassing billions
or trillions of parameters, necessitates significant compute and
memory resources for both training and inference. This not only
amplifies energy consumption but also poses a major obstacle
for time-sensitive or real-time applications using LLMs. From
our benchmark analysis of the time consumption of Vicuna-
7B in Fig. 1 and 2, LLMs suffer from different aspects during
training and inference. In this section, we will focus on opti-
mizing LLMs’ inference that could provide an intuition to more
efficiently leverage the power of existing LLMs such as GPT-4
and LLaMA2. For inference, as shown in Fig. 2, the Multi-layer
Perceptron (MLP) predominantly dictate the latency, regardless
of input length. Additionally, the latency attributed to attention
layers grows more pronounced as input lengths increase. In Fig. 2. Time consumption breakdown: a decoder layer of Llama-2-7b during
inference. The MLPs consistently take up the most time across all input
this section, we aim to illustrate that although existing methods lengths. The attention mechanism and normalization operations scale with input
have considerably optimized these two parts, there remains a length. Their computational times increase notably as the input becomes longer,
need for more advanced software/hardware co-design methods highlighting their sensitivity to input size.
to fully leverage the capabilities of hardware platforms.
to handle outliers in activations and achieves lower latency
B. Existing Methods compared to FP16. The results demonstrate that SmoothQuant
Pruning and quantization are key techniques in optimizing can match the FP16 accuracy with INT8 quantization across
neural networks, particularly in addressing the computational various LLM sizes up to 530B parameters.
challenges presented by LLMs. These methods aim to reduce 2) Pruning of LLMs: A pruned model not only reduces
the size of models and the computational demands during memory requirements but also accelerates inference. LLM-
inference without significantly compromising performance. Pruner [21] uses structural pruning by dividing weights into in-
1) Quantization of LLMs: Besides the benefit from mem- dependent groups and employs the low-rank approximation for
ory and inference latency reduction, quantization’s potential recovery. At the same time, it preserves the diverse capacities of
for greater parallelism taps into the capabilities of hard- LLMs. SparseGPT [22] solves the Row-Hessian challenge, the
ware accelerators like FPGAs, further amplifying throughput. computational difficulty of calculating and storing individual
LLM.int8 [19] develops a two-part quantization procedure rows of the Hessian matrix, by reusing Hessians between
with vector-wise quantization and a new mixed-precision de- rows and distinct pruning masks with negligible accuracy drop.
composition scheme. It could preserve perplexity for models However, SparseGPT only demonstrated its fine-tuning-free
with 125M to 13B parameters at the cost of longer inference inference speed on a narrow range of large models like OPT
latency. SmoothQuant [20] uses a per-channel smoothing factor 175B and BLOOM 176B. Deja Vu [23] reveals the existence of
Fig. 4. The activeness of heads profiling on different datasets. Here, we
Fig. 3. OPT-13b’s accuracy on Lambada Dataset with different quantization measure the contribution of different heads based on their variance over
methods. W8A8-Naive denotes the naive 8-bits weights and activation quantiza- input sequence. Inactive heads show low variance, which eventually leads to
tion of LLM.int8. W4A4-SQ denotes 4-bits weights and activation quantization contextual sparsity.
of SmoothQuant.

pruning methods working with smaller LLMs overlooks their


contextual sparsity and proposes a real-time sparsity predictor
generalization abilities over data in different domains. For
trained during inference, which can reduce the inference latency
example, LLM-Pruner runs the risk of overfitting in recovery
of OPT-175B. The evaluations on OPT-66B and 175B show no
stage after pruning.
compromising quality of the models.
2) Our Ideas: From experiments involving state-of-the-art
C. Software/Hardware Co-design LLMs, solely relying on either quantization or pruning proves
1) Motivations: Beyond the evident advantages of pruning challenging in meeting real-world requirements and adapting
and quantization in LLMs, the complexities of their effective to diverse LLMs and hardware platforms. This indicates the
implementation remain. The core challenge with quantization potential benefits of integrating these two prevalent techniques
lies in preserving the model’s accuracy. A reduction in precision while considering hardware characteristics. Thus, we suggest
can compromise accuracy. Likewise, excessive pruning can sig- exploring pruning-aware quantization for LLM optimization.
nificantly diminish model accuracy. Additionally, the irregular As indicated in [23] and our experiments on OPT-13B, inac-
sparsity introduced by pruning may not align well with specific tive attention heads are uniformly distributed across the input
hardware architectures, potentially resulting in less-than-ideal sequence, referred to as the contextual sparsity of LLMs. A
performance enhancements. SmoothQuant provides the turn- typical strategy to capitalize on this sparsity is pruning. How-
key solution. However, achieving consistent performance in ever, different LLMs often exhibit varied patterns of contextual
LLMs with 4-bit quantization remains challenging as shown sparsity, constraining the effectiveness of conventional pruning.
in Fig. 3, despite claims of its universal optimality proved At the same time, quantization is sensitive to the range and
by other previous work [24]. We tested the performance of distribution of parameters, making the standalone quantization
SmoothQuant on OPT-13B [25]. Fig. 3 illustrates the accuracy precision a non-ideal solution.
levels SmoothQuant can attain in comparison to FP16, but Based on these considerations, our pruning-aware quanti-
W4A4 (4-bits quantization on weights and activation) and zation method aims to tackle these challenges by choosing
W4A16 cannot achieve the comparable accuracy with the pruning or quantization precisions according to the behavior
original model before quantization. After testing with different of different LLMs over various datasets. Specifically, pruning-
values of scale hyperparameter α, the accuracy of these two aware quantization method will profile over relatively small
models could vary between 0 to 0.3. These two models are amount of dataset and identify the importance and scale of
sensitive to the hyperparameter and need further investigation. parameters in attention matrix and neurons. Subsequently,
At the same time, the existing pruning methods mostly test guided by the profiling results, our approach can select between
on models with parameters over 175B. These large models pruning (0 bit) or varying quantization precisions (4, 8, and
are always under speculation of under-training, resulting in a 16 bits) for each layer. Additionally, these choices are tailored
high percentage of inactive parameters in the first place. Deja to align with the efficient computation patterns of different
Vu only demonstrates the preservation of accuracy on these hardware architectures, achieving the co-optimization of both
large models, showing insufficient evidence on its effectiveness software and hardware. Our approach can reduce the memory
regarding smaller models. Based on our experimental results, and computational cost while achieving enhanced inference
after pruning out 50% parameters on attention layers and 30% accuracy than existing pruning methods. Additionally, as shown
on MLP layers, OPT-13B model’s accuracy drops approxi- in Fig. 4, the activeness of attention heads exhibits similar dis-
mately 15% on WinoGrande dataset. Meanwhile, the existing tribution over multiple datasets, which suggests the preservation
Fig. 6. The great productivity gap between hardware design(purple line)
productivity, verification(red line) productivity and technology capacity(green
line) [33].

Fig. 5. The preliminary result from forward throughput improvement.


Flash2 hmask is the result from the combination of FlashAttention2 and our
pruning-aware quantization approach.

of multi-task solving and language generation ability of our ap-


proach. Furthermore, our approach can also be combined with
the state-of-art hardware-aware LLM acceleration frameworks,
such as FlashAttention-2 [26]. Based on the preliminary results
illustrated in Fig. 5, we are able to achieve higher throughput
than both PyTorch and standalone FlashAttention-2.

III. LLM F OR D ESIGN V ERIFICATION


Fig. 7. Comparison of LLM-Based vs. Traditional HLS Functional Verification
A. Challenges Flows.
Circuit design involves stages such as RTL synthesis, logic
synthesis, and placement and routing [27]. Most of these stages next test iteration. The process of crafting these test vectors,
are equipped with their corresponding verification processes to though reliable to a certain extent, has always been labor-
ensure the implemented hardware matches its specifications. intensive, requiring significant domain expertise. Moreover, the
Verification methods fall into two main categories: formal ver- growing complexity of modern hardware makes exhaustive
ification and simulation-based verification [28]. While formal state examination computationally unviable due to the vast
verification provides mathematical assurance, its limitations, search space. Recognizing these limitations, our approach har-
like scalability issues and the necessity for niche coding skills, nesses the power of LLMs to detect common bug patterns that
often position simulation-based verification as the industry’s humans easily get stuck, automatically on the top of the source
preferred method. code.
Nonetheless, simulation-based verification has its shortcom- In light of the aforementioned challenges of the widening
ings as it is very time-consuming especially for large-scale productivity gap and the increasing complexity of hardware ver-
hardware designs. Verification can account for 45% to 55% of ification, HLS enhances the design verification landscape. HLS,
the total design cycle [29], making it the most extensive phase operating at an elevated abstraction level, facilitates streamlined
in hardware development. Moore’s Law [30] suggests that tran- bug identification in intricate designs by emphasizing algo-
sistor counts in integrated circuits (ICs) double approximately rithmic and architectural considerations, harmonizing with the
every two years. For reference, the transistors of the AMD LLM’s strength in context comprehension. Furthermore, HLS
Ryzen 9 series processor [31] [32] scaled up by 1.78 times abbreviates the design-to-verification trajectory, fostering swift
from 2019 to 2021. Yet, as depicted in Fig. 6 [33], manual issue recognition and expediting debugging. Collectively, HLS
hardware design productivity (quantified in Gates/Day) has not furnishes engineers with an optimized, intuitive, and holistic
paralleled these technological leaps. This divergence reveals verification platform.
an expanding productivity gap, intensified by growing system
intricacies and nearing physical property limitations. B. LLM-based Chrysalis Dataset Generation
Traditional hardware simulation-based functional verification Within our functional verification framework, we employ the
uses test vectors, as shown in Fig. 7, to ensure a system’s ex- power of LLMs to streamline the debugging process, commenc-
pected behavior. Engineers often manually design test vectors, ing with the precise localization of erroneous code lines. A
create test benches, and set legality constraints. After reviewing foundational requirement for this approach is the establishment
coverage reports, they adjust parameters and continue to the of an LLM-centric dataset which is crafted specifically to
facilitate a robust evaluation or finetuning of varying LLMs making calls to other functions. We recognize that engineers
in the domain of HLS code debugging tasks. By deliberately may occasionally overlook critical values or functions, poten-
introducing known bugs and monitoring an LLM’s prowess in tially resulting in inter-functional errors. In order to facili-
pinpointing them, we are empowered to assess its efficiency, tate a comprehensive understanding of the contexts in which
accuracy, and reliability comprehensively. This assessment does errors can be injected, we have incorporated all instances
not just offer a window into the diagnostic capabilities of the of the #define syntax, ensuring that the entirety of each
LLM; it is also useful in refining a domain-specific LLM. function’s macro definitions is accounted for. Additionally,
Historically, the research community has grappled with the in each function-level design, we have incorporated details
absence of an open-source dataset catering specifically to buggy of all the functions that a particular function calls, ensuring
HLS code. To address this gap, we have created the Chrysalis comprehensive and clear documentation. This holistic approach
dataset, a comprehensive benchmark tailored to catalyze syner- ensures that LLM is well-equipped to comprehend the full
gistic advancements between the LLM and HLS domains. The scope of interdependencies and contextual intricacies, thereby
dataset is named ”Chrysalis” to metaphorically represent the enhancing its effectiveness in error injection tasks.
evolution of buggy code, akin to the transformative chrysalis By adhering to this rigorous procedure, our methodology in
stage in a butterfly’s life cycle, culminating in the emergence developing the Chrysalis dataset serves a dual purpose. Firstly,
of refined code. This dataset is exclusively crafted to underpin it exemplifies a systematic and automated method for LLM-
LLM-guided HLS debugging endeavors targeting FPGA plat- targeted dataset generation. Secondly, it serves as a valuable
forms. resource for advancing the capabilities of the LLMs’ capacity
To closely mirror genuine, real-world coding errors often in the realm of HLS verification.
overlooked by engineers, we have enumerated a series of logical 2) Logic Bugs Simulation: After thoroughly analyzing the
bugs. These particular bugs are crafted to elude detection by source code and extracting all functions, our next step is to
conventional HLS synthesizing and compiling tools. We utilize simulate pre-silicon logic bugs. These are the kind of errors
GPT-4’s [34] nuanced natural language capabilities, combined hardware designers might inadvertently introduce when crafting
with HLS source code and curated prompts, to enhance our the HLS version of a design, leading to deviations from the
dataset creation. Our dataset is a comprehensive collection, intended specifications. The error types employed in our work
offering both flawless benchmark suites and those with injected align closely with the categorization introduced in [46], and we
bugs. It features detailed classifications of errors and includes categorize the potential error types as follows: (1) OOB: Out-
precise annotations. These annotations trace the exact origins of of-bounds array access; (2) INIT: Accessing an uninitialized
the faults within the code. Our objective is to embed one or two variable; (3) SHFT: Bit shift by an out-of-bounds amount; (4)
specific bug types in each benchmark, targeting up to 45 unique INF: An infinite loop arising from an incorrect loop termination
scenarios: 9 with a single bug type and 36 with two bug types. If condition; (5) *++: Misunderstanding of operator precedence,
certain bug types are infeasible to be injected for one particular erroneously assuming that dereference (*) has higher prece-
benchmark, the total number of buggy code instances may be dence than postincrement (++); (6) MLU: Errors in manual
fewer than 45. To augment the variety of errors, variables can loop unrolling, leading to the omission of one iteration; (7)
be systematically varied to spawn a multitude of faulty code BUF: Copying from the wrong half of a split buffer; (8) ZERO:
samples. A variable initialized to zero when it should have a nonzero
The steps of generating the Chrysalis dataset include HLS initializer; (9) USE: Unintended sign extension.
design collection, logic bugs simulation and injection, and bug To emulate real-world buggy scenarios, each function-level
injection validation. The following sub-sections elucidate the design has one or two of these errors injected. Through this
detailed implementation strategies underpinning our Chrysalis process, we create a dataset comprising over 1,000 function-
dataset creation. level designs, each including 1-2 buggy instances out of 45
1) HLS Design Collection: For the creation of our Chrysalis possible combinations. This comprehensive dataset will provide
dataset, we have meticulously curated a collection of real- a robust platform to evaluate the performance and accuracy of
world HLS applications based on open-source projects. This LLM-based debugging tools.
comprehensive benchmark suites encompass a diverse array of 3) LLM-Driven Bug Injection Methodology for HLS Code:
synthesizable HLS applications, each drawn from the follow- Capitalizing on the accurate code version combined with a
ing reputable sources: FINN [35], GNNBuilder [36], H.264 defined error type, we harness the capabilities of GPT-4 for
[37], HLS4ML [38], MachSuite [39], Open-Source-IPs [40], a precise and systematic introduction of errors into HLS
Polybench [41], Rosetta [42], Vitis HLS introductory examples design structures. The foundation of our methodology rests
[43], Vitis libraries [44], and Tacle-Bench [45]. Our Chrysalis on a refined prompt template, fashioned to facilitate GPT-4 in
dataset primarily focuses on function-level tasks, with over generating erroneous code with both stability and predictability.
1,000 individual HLS programs extracted from these reputable The template is strategically partitioned into three key sections:
sources. This selection ensures the representation of various • Context: This section offers a snapshot of the surrounding
coding styles and complexities. environment where the HLS code is set to operate. It
While constructing the Chrysalis dataset, we consider the emphasizes the essence of studying inadvertent human-
intricacies of handling functions that include header files or induced errors that mirror real-world scenarios.
Fig. 8. An illustration showcasing the procedure for directing GPT-4 to introduce an Out-Of-Bounds (OOB) error into the ’gemm 4096’ source function. The
code line emphasized with a blue underline indicates the injection point of the error. For an automated generation of multiple erroneous function codes, the
terms highlighted in yellow must be substituted with the appropriate code identifiers and contents. To introduce varying error types, the descriptors highlighted
in grey should be adjusted accordingly.

• Requirement: This delineates the characteristics of the will involve a comparative analysis between the original, error-
intended error or bug, offering a comprehensive insight free source code and the generated buggy code. To maintain
into its nature. Beyond just a mere directive for error the integrity of our findings, any redundant or recurring errors
generation, it elucidates the rationale behind the need will be systematically eliminated.
for such errors, providing a deeper understanding of the
C. LLM-Based Bug Detection
specific anomalies being introduced.
• Complementary Rules: Serving as a structured frame- Our Chrysalis dataset presents a promising platform for
work, these rules ensure the synthesized bugs are not evaluating the proficiency of existing LLMs in HLS bug
only consistent with the original intention but also don’t localization. To be specific, engineers can assess the models’
stray from the designated error category. To streamline precision and efficiency by comparing the errors detected by the
the process, outputs are structured in a JSON file format, LLMs to predefined error labels. These labels provide detailed
encompassing fields like error type, and the content of information, including the types of errors and the specific
the faulty code line, among others, for automated parsing. locations of the incorrect code lines.
Moreover, to eschew exhaustive and aimless fault injec- We plan to develop a lightweight, domain-specific LLM
tion, the system is programmed to directly output ’No’ trained on the Chrysalis dataset. Such a model would not
if the conditions are unsuitable for the stipulated error merely detect anomalies but might also possess the capability
manifestation. to rectify them. This domain-centric LLM could integrate as
an extension within development environments like VSCode.
In each function-level task within our Chrysalis dataset, This would parallel tools such as Copilot, offering hardware
we introduce one to two types of errors. This strategy is engineers intuitive suggestions to identify and rectify pitfalls
specifically devised to emulate the nuances of human oversight in their HLS codes, thus augmenting the debugging process.
or negligence, ensuring the results resonate with real-world
scenarios. IV. C ONCLUSION
Fig. 8 showcases an example of the process of injecting an In this paper, we studied on the acceleration strategies of
OOB error into the function ”gemm 4096”. Upon introducing LLMs through software/hardware co-design and LLMs applica-
a bug into the system, we conduct a validation process to ascer- tion in the domain of design verification, particularly focusing
tain if the bug is correctly injected following the prompt. This on functional verification in HLS. We have explored several
state-of-the-art techniques for expediting LLMs, such as quan- [16] C. Zhuge et al., “Face recognition with hybrid efficient convolution
tization and pruning, and proposed an approach that synergizes algorithms on FPGAs,” in Proc. of GLVLSI, 2018.
[17] X. Chen et al., “Thundergp: Hls-based graph processing framework on
these methods to overcome their individual limitations. Our pre- fpgas,” in Proc. of FPGA, 2021.
liminary results suggest that this integrated approach can yield [18] S. Liu et al., “Real-time object tracking system on FPGAs,” in Proc. of
considerable improvements in inference performance without SAAHPC, 2011.
[19] T. Dettmers et al., “Llm. int8 (): 8-bit matrix multiplication for trans-
compromising accuracy, indicating a promising direction for formers at scale,” arXiv preprint arXiv:2208.07339, 2022.
future research. [20] G. Xiao et al., “Smoothquant: Accurate and efficient post-training quan-
Furthermore, we have addressed the critical challenge of tization for large language models,” in Proc. of ICML, 2023.
[21] E. Frantar et al., “SparseGPT: Massive Language Models Can Be
functional verification in hardware design, which has become Accurately Pruned in One-Shot,” 2023.
increasingly complex and time-consuming. By leveraging the [22] X. Ma et al., “LLM-Pruner: On the Structural Pruning of Large Language
capabilities of LLMs, specifically GPT-4, we have created a Models,” arXiv preprint arXiv:2305.11627, 2023.
[23] Z. Liu et al., “Deja Vu: Contextual Sparsity for Efficient LLMs at
comprehensive Chrysalis dataset for HLS that includes over Inference Time,” in Proc. of ICML, 2023.
1000 function-level designs with up to 45 distinct combinations [24] T. Dettmers et al., “The case for 4-bit precision: k-bit inference scaling
of defects. This Chrysalis dataset serves as a valuable resource laws,” in Proc. of ICML, 2023.
[25] S. Zhang et al., “OPT: Open Pre-trained Transformer Language Models,”
for evaluating and fine-tuning LLM-based HLS domain-specific 2022.
debugging assistants. [26] T. Dao, “Flashattention-2: Faster attention with better parallelism and
In conclusion, our research contributes a methodological work partitioning,” arXiv preprint arXiv:2307.08691, 2023.
[27] L. Lavagno et al., EDA for IC implementation, circuit design, and process
foundation and a practical toolkit for harnessing the power of technology. CRC press, 2018.
LLMs in the design verification domain. As we continue to [28] A. Piziali, Functional verification coverage measurement and analysis.
refine these techniques and expand the capabilities of LLMs, Springer Science & Business Media, 2007.
[29] H. Foster, “Part 8: The 2020 wilson research group functional verification
we anticipate a future where the co-design of software and study,” https://blogs.sw.siemens.com/verificationhorizons/2021/01/06/
hardware is seamlessly interwoven into the fabric of hard- part-8-the-2020-wilson-research-group-functional-verification-study/,
ware development, leading to faster, more efficient, and error- 2021.
[30] R. R. Schaller, “Moore’s law: past, present and future,” IEEE spectrum,
resilient design cycles. vol. 34, no. 6, pp. 52–59, 1997.
[31] P. Alcorn, “AMD Ryzen 9 3900X and Ryzen 7 3700X Review,”
ACKNOWLEDGEMENTS https://www.tomshardware.com/reviews/ryzen-9-3900x-7-3700x-review,
6214.html, 2019.
This work is supported in part by IBM-Illinois Discovery [32] TechPowerUp, “AMD Ryzen 7 5800H Specifications,” https://www.
Accelerator Institute (IIDAI), NSF 2117997 grant through techpowerup.com/cpu-specs/ryzen-7-5800h.c2368, 2021.
the A3D3 institute, and Semiconductor Research Corporation [33] R. Bahar et al., “Workshops on Extreme Scale Design Automation
(ESDA) challenges and opportunities for 2025 and beyond,” arXiv
(SRC) 2023-CT-3175 grant. preprint arXiv:2005.01588, 2020.
[34] OpenAI, “GPT-4 Technical Report,” 2023. [Online]. Available: https:
R EFERENCES //doi.org/10.48550/arXiv.2303.08774
[1] Y. Li et al., “Csrnet: Dilated convolutional neural networks for under- [35] Y. Umuroglu et al., “Finn: A framework for fast, scalable binarized neural
standing the highly congested scenes,” in Proc. of CVPR, 2018. network inference,” in Proc. of FPGA, 2017.
[2] Y. Goldberg, “A primer on neural network models for natural language [36] S. Abi-Karam et al., “GNNBuilder: An Automated Framework for
processing,” Journal of Artificial Intelligence Research, 2016. Generic Graph Neural Network Accelerator Generation, Simulation, and
[3] A. Vaswani et al., “Attention is all you need,” Advances in neural Optimization,” arXiv preprint arXiv:2303.16459, 2023.
information processing systems, 2017. [37] X. Liu et al., “High level synthesis of complex applications: An H. 264
[4] L. Ouyang et al., “Training language models to follow instructions with video decoder,” in Proc. of FPGA, 2016.
human feedback,” Advances in Neural Information Processing Systems, [38] F. Fahim et al., “hls4ml: An open-source codesign workflow to em-
2022. power scientific low-power machine learning devices,” arXiv preprint
[5] H. Touvron et al., “Llama 2: Open foundation and fine-tuned chat arXiv:2103.05579, 2021.
models,” arXiv preprint arXiv:2307.09288, 2023. [39] B. Reagen et al., “MachSuite: Benchmarks for accelerator design and
[6] S. Pichai, “An important next step on our AI journey,” 2023. customized architectures,” in Proc. of IISWC, 2014.
[7] J. Blocklove et al., “Chip-Chat: Challenges and Opportunities in Conver- [40] X. Liu et al., “HLS based Open-Source IPs for Deep Neural Network
sational Hardware Design,” arXiv preprint arXiv:2305.13243, 2023. Acceleration,” https://github.com/DNN-Accelerators/Open-Source-IPs,
[8] Y. Lu et al., “RTLLM: An Open-Source Benchmark for Design RTL Gen- 2019.
eration with Large Language Model,” arXiv preprint arXiv:2308.05345, [41] J. Karimov et al., “Polybench: The first benchmark for polystores,”
2023. in Performance Evaluation and Benchmarking for the Era of Artificial
[9] S. Thakur et al., “Benchmarking Large Language Models for Automated Intelligence: 10th TPC Technology Conference, TPCTC 2018, Rio de
Verilog RTL Code Generation,” in Proc. of DATE, 2023. Janeiro, Brazil, August 27–31, 2018, Revised Selected Papers 10, 2019.
[10] E. Nijkamp et al., “Codegen: An open large language model for code with [42] Y. Zhou et al., “Rosetta: A realistic high-level synthesis benchmark suite
multi-turn program synthesis,” arXiv preprint arXiv:2203.13474, 2022. for software programmable fpgas,” in Proc. of FPGA, 2018.
[11] M. Liu et al., “VerilogEval: Evaluating Large Language Models for [43] Xilinx, “Vitis-HLS-Introductory-Examples,” https://github.com/Xilinx/
Verilog Code Generation,” arXiv preprint arXiv:2309.07544, 2023. Vitis-HLS-Introductory-Examples, 2023.
[12] Z. He et al., “ChatEDA: A Large Language Model Powered Autonomous [44] Xilinx, “Vitis libraries,” https://github.com/Xilinx/Vitis Libraries, 2019.
Agent for EDA,” arXiv preprint arXiv:2308.10204, 2023. [45] Tacle, “Tacle Bench,” https://github.com/tacle/tacle-bench, 2017.
[13] X. Zhang et al., “DNNBuilder: an Automated Tool for Building High- [46] K. A. Campbell, “Robust and reliable hardware accelerator design through
Performance DNN Hardware Accelerators for FPGAs,” in Proc. of high-level synthesis,” Ph.D. dissertation, University of Illinois at Urbana-
ICCAD, 2018. Champaign, 2017.
[14] H. Ye et al., “HybridDNN: A Framework for High-Performance Hybrid
DNN Accelerator Design and Implementation,” in Proc. of DAC, 2020.
[15] X. Zhang et al., “DNNExplorer: a framework for modeling and exploring
a novel paradigm of FPGA-based DNN accelerator,” in Proc. of ICCAD,
2020.

You might also like