Invitedpaper Aspdac 24
Invitedpaper Aspdac 24
Abstract—The widespread adoption of Large Language Models design flow. Chip-Chat [7] established a set of eight founda-
(LLMs) is impeded by their demanding compute and memory tional benchmarks, aiming to delineate both the capabilities
resources. The first task of this paper is to explore optimization and limitations of current state-of-the-art (SOTA) LLMs in
strategies to expedite LLMs, including quantization, pruning,
and operation-level optimizations. One unique direction is to hardware design. Concurrently, there is a trend in the academic
optimize LLM inference through novel software/hardware co- circles towards the development of robust benchmark frame-
design methods. Given the accelerated LLMs, the second task of works to rigorously gauge LLM performance. For instance, [8]
this paper is to study LLMs’ performance in the usage scenario of introduced an open-source suite comprising 30 intricate hard-
circuit design and verification. Specifically, we place a particular ware designs. Their contribution enhanced LLMs’ feedbacks
emphasis on functional verification. Through automated prompt
engineering, we harness the capabilities of the established LLM, quality through advanced prompt engineering techniques. In a
GPT-4, to generate High-Level Synthesis (HLS) designs with related effort, Shailja et al. [9] fine-tuned the CodeGen model
predefined errors based on 11 open-source synthesizable HLS [10] using 17 Verilog codes. Additional research has focused
benchmark suites. This dataset is a comprehensive collection of on dataset construction through practical RTL tutorial exercises.
over 1000 function-level designs, and each of which is afflicted For instance, VerilogEval [11] developed a comprehensive
with up to 45 distinct combinations of defects injected into the
source code. This dataset, named Chrysalis, expands upon what’s benchmark framework that includes 156 problems derived from
available in current HLS error models, offering a rich resource the educational HDLBits platform. Moreover, this framework
for training to improve how LLMs debug code. The dataset can be was employed to fine-tune the CodeGen model [10], signifi-
accessed at: https://github.com/UIUC-ChenLab/Chrysalis-HLS. cantly improving its proficiency in generating RTL codes. In
Index Terms—Large Language Models, software/hardware co- a distinct approach, ChatEDA [12] was architected to generate
design, functional verification
codes specifically for conjuring codes capable of navigating
EDA tools based on natural language cues.
I. I NTRODUCTION While LLMs benefit from a vast number of parameters, they
also grapple with challenges related to sparsity and compu-
The rapid evolution of machine learning, particularly through
tational overhead. Because of the extensive usage of LLMs in
the advancements in neural network architectures, has precip-
time-sensitive applications like Internet of things (IoT) devices,
itated significant breakthroughs in diverse fields, encompass-
it is essential to ensure that LLMs deliver optimal inference
ing computer vision [1] and natural language processing [2].
performance without compromising on the multi-task solving
Among various neural network designs, the transformer ar-
and language generation ability. In this paper, we explore sev-
chitecture [3] stands out, offering unparalleled performance
eral cutting-edge optimization techniques for LLM inference,
on sequence-to-sequence tasks. Instead of using traditional
with a focus on quantization and pruning. Numerous studies
recurrent layers, this innovative structure harness the power
have highlighted the potential advantages of these methods
of the attention mechanism. The transformer model serves as
individually, including activation outliers handling, structural
the foundation for the emergence of Large Language Models
and contextual sparsity reduction. It is still challenging to
(LLMs) such as OpenAI’s GPT series [4], Meta’s LLaMA [5],
achieve consistent optimization benefits across diverse LLMs
and Google’s BARD [6]. Encompassing billions of parameters
and ensure adaptability in real-world scenarios. Thus, we point
and informed by extensive textual datasets sourced from the
out the potential solution in software/hardware co-design for
internet, these models possess the capability to interpret human-
this challenge, inspired by its proved power and effectiveness
generated text and yield contextually pertinent and logically
in the areas of Deep Neural Networks (DNNs) [13]–[16], Graph
coherent outputs to a wide spectrum of prompts.
Neural Networks (GNNs) [17], and conventional machine-
In the domain of Electronic Design Automation (EDA), the
learning solutions [18].
potential application of LLMs is becoming increasingly evident.
They are poised to support hardware engineers in various Overall, this paper introduces a unified optimization frame-
hardware design stages, ranging from the initial conception and work for LLMs, with a particular emphasis on functional
verification to optimization and coordination of the complete verification in EDA with the following contributions:
• We study the integration of both LLM quantization and
* These authors contributed equally to this work. pruning techniques, yielding greater benefits than their
standalone applications while compensating for their re-
spective limitations. Additionally, we anticipate the po-
tential integration of this approach with domain-specific
hardware accelerators.
• We pioneer an innovative methodology that harnesses the
capabilities of GPT-4 to inject bugs into HLS codes.
We design a set of tailored prompts to guide GPT-4 in
generating consistent and compliant buggy codes within
the EDA domain.
• Leveraging the above methodologies, we create the
Chrysalis dataset that includes both correct source codes
and intentionally injected buggy codes. This dataset is
meticulously organized, comprising over 1000 function-
level designs sourced from 11 open-source synthesizable
HLS benchmark suites. Each design undergoes a con-
trolled injection of up to 45 distinct combinations of bugs.
It represents an indispensable tool for the assessment and Fig. 1. Time consumption breakdown: a Llama-2-7b [5] decoder layer during
refinement of LLM-based HLS domain-specific debugging training. Attention significantly dominates the latency when sequence length
becomes longer.
assistants.
• Requirement: This delineates the characteristics of the will involve a comparative analysis between the original, error-
intended error or bug, offering a comprehensive insight free source code and the generated buggy code. To maintain
into its nature. Beyond just a mere directive for error the integrity of our findings, any redundant or recurring errors
generation, it elucidates the rationale behind the need will be systematically eliminated.
for such errors, providing a deeper understanding of the
C. LLM-Based Bug Detection
specific anomalies being introduced.
• Complementary Rules: Serving as a structured frame- Our Chrysalis dataset presents a promising platform for
work, these rules ensure the synthesized bugs are not evaluating the proficiency of existing LLMs in HLS bug
only consistent with the original intention but also don’t localization. To be specific, engineers can assess the models’
stray from the designated error category. To streamline precision and efficiency by comparing the errors detected by the
the process, outputs are structured in a JSON file format, LLMs to predefined error labels. These labels provide detailed
encompassing fields like error type, and the content of information, including the types of errors and the specific
the faulty code line, among others, for automated parsing. locations of the incorrect code lines.
Moreover, to eschew exhaustive and aimless fault injec- We plan to develop a lightweight, domain-specific LLM
tion, the system is programmed to directly output ’No’ trained on the Chrysalis dataset. Such a model would not
if the conditions are unsuitable for the stipulated error merely detect anomalies but might also possess the capability
manifestation. to rectify them. This domain-centric LLM could integrate as
an extension within development environments like VSCode.
In each function-level task within our Chrysalis dataset, This would parallel tools such as Copilot, offering hardware
we introduce one to two types of errors. This strategy is engineers intuitive suggestions to identify and rectify pitfalls
specifically devised to emulate the nuances of human oversight in their HLS codes, thus augmenting the debugging process.
or negligence, ensuring the results resonate with real-world
scenarios. IV. C ONCLUSION
Fig. 8 showcases an example of the process of injecting an In this paper, we studied on the acceleration strategies of
OOB error into the function ”gemm 4096”. Upon introducing LLMs through software/hardware co-design and LLMs applica-
a bug into the system, we conduct a validation process to ascer- tion in the domain of design verification, particularly focusing
tain if the bug is correctly injected following the prompt. This on functional verification in HLS. We have explored several
state-of-the-art techniques for expediting LLMs, such as quan- [16] C. Zhuge et al., “Face recognition with hybrid efficient convolution
tization and pruning, and proposed an approach that synergizes algorithms on FPGAs,” in Proc. of GLVLSI, 2018.
[17] X. Chen et al., “Thundergp: Hls-based graph processing framework on
these methods to overcome their individual limitations. Our pre- fpgas,” in Proc. of FPGA, 2021.
liminary results suggest that this integrated approach can yield [18] S. Liu et al., “Real-time object tracking system on FPGAs,” in Proc. of
considerable improvements in inference performance without SAAHPC, 2011.
[19] T. Dettmers et al., “Llm. int8 (): 8-bit matrix multiplication for trans-
compromising accuracy, indicating a promising direction for formers at scale,” arXiv preprint arXiv:2208.07339, 2022.
future research. [20] G. Xiao et al., “Smoothquant: Accurate and efficient post-training quan-
Furthermore, we have addressed the critical challenge of tization for large language models,” in Proc. of ICML, 2023.
[21] E. Frantar et al., “SparseGPT: Massive Language Models Can Be
functional verification in hardware design, which has become Accurately Pruned in One-Shot,” 2023.
increasingly complex and time-consuming. By leveraging the [22] X. Ma et al., “LLM-Pruner: On the Structural Pruning of Large Language
capabilities of LLMs, specifically GPT-4, we have created a Models,” arXiv preprint arXiv:2305.11627, 2023.
[23] Z. Liu et al., “Deja Vu: Contextual Sparsity for Efficient LLMs at
comprehensive Chrysalis dataset for HLS that includes over Inference Time,” in Proc. of ICML, 2023.
1000 function-level designs with up to 45 distinct combinations [24] T. Dettmers et al., “The case for 4-bit precision: k-bit inference scaling
of defects. This Chrysalis dataset serves as a valuable resource laws,” in Proc. of ICML, 2023.
[25] S. Zhang et al., “OPT: Open Pre-trained Transformer Language Models,”
for evaluating and fine-tuning LLM-based HLS domain-specific 2022.
debugging assistants. [26] T. Dao, “Flashattention-2: Faster attention with better parallelism and
In conclusion, our research contributes a methodological work partitioning,” arXiv preprint arXiv:2307.08691, 2023.
[27] L. Lavagno et al., EDA for IC implementation, circuit design, and process
foundation and a practical toolkit for harnessing the power of technology. CRC press, 2018.
LLMs in the design verification domain. As we continue to [28] A. Piziali, Functional verification coverage measurement and analysis.
refine these techniques and expand the capabilities of LLMs, Springer Science & Business Media, 2007.
[29] H. Foster, “Part 8: The 2020 wilson research group functional verification
we anticipate a future where the co-design of software and study,” https://blogs.sw.siemens.com/verificationhorizons/2021/01/06/
hardware is seamlessly interwoven into the fabric of hard- part-8-the-2020-wilson-research-group-functional-verification-study/,
ware development, leading to faster, more efficient, and error- 2021.
[30] R. R. Schaller, “Moore’s law: past, present and future,” IEEE spectrum,
resilient design cycles. vol. 34, no. 6, pp. 52–59, 1997.
[31] P. Alcorn, “AMD Ryzen 9 3900X and Ryzen 7 3700X Review,”
ACKNOWLEDGEMENTS https://www.tomshardware.com/reviews/ryzen-9-3900x-7-3700x-review,
6214.html, 2019.
This work is supported in part by IBM-Illinois Discovery [32] TechPowerUp, “AMD Ryzen 7 5800H Specifications,” https://www.
Accelerator Institute (IIDAI), NSF 2117997 grant through techpowerup.com/cpu-specs/ryzen-7-5800h.c2368, 2021.
the A3D3 institute, and Semiconductor Research Corporation [33] R. Bahar et al., “Workshops on Extreme Scale Design Automation
(ESDA) challenges and opportunities for 2025 and beyond,” arXiv
(SRC) 2023-CT-3175 grant. preprint arXiv:2005.01588, 2020.
[34] OpenAI, “GPT-4 Technical Report,” 2023. [Online]. Available: https:
R EFERENCES //doi.org/10.48550/arXiv.2303.08774
[1] Y. Li et al., “Csrnet: Dilated convolutional neural networks for under- [35] Y. Umuroglu et al., “Finn: A framework for fast, scalable binarized neural
standing the highly congested scenes,” in Proc. of CVPR, 2018. network inference,” in Proc. of FPGA, 2017.
[2] Y. Goldberg, “A primer on neural network models for natural language [36] S. Abi-Karam et al., “GNNBuilder: An Automated Framework for
processing,” Journal of Artificial Intelligence Research, 2016. Generic Graph Neural Network Accelerator Generation, Simulation, and
[3] A. Vaswani et al., “Attention is all you need,” Advances in neural Optimization,” arXiv preprint arXiv:2303.16459, 2023.
information processing systems, 2017. [37] X. Liu et al., “High level synthesis of complex applications: An H. 264
[4] L. Ouyang et al., “Training language models to follow instructions with video decoder,” in Proc. of FPGA, 2016.
human feedback,” Advances in Neural Information Processing Systems, [38] F. Fahim et al., “hls4ml: An open-source codesign workflow to em-
2022. power scientific low-power machine learning devices,” arXiv preprint
[5] H. Touvron et al., “Llama 2: Open foundation and fine-tuned chat arXiv:2103.05579, 2021.
models,” arXiv preprint arXiv:2307.09288, 2023. [39] B. Reagen et al., “MachSuite: Benchmarks for accelerator design and
[6] S. Pichai, “An important next step on our AI journey,” 2023. customized architectures,” in Proc. of IISWC, 2014.
[7] J. Blocklove et al., “Chip-Chat: Challenges and Opportunities in Conver- [40] X. Liu et al., “HLS based Open-Source IPs for Deep Neural Network
sational Hardware Design,” arXiv preprint arXiv:2305.13243, 2023. Acceleration,” https://github.com/DNN-Accelerators/Open-Source-IPs,
[8] Y. Lu et al., “RTLLM: An Open-Source Benchmark for Design RTL Gen- 2019.
eration with Large Language Model,” arXiv preprint arXiv:2308.05345, [41] J. Karimov et al., “Polybench: The first benchmark for polystores,”
2023. in Performance Evaluation and Benchmarking for the Era of Artificial
[9] S. Thakur et al., “Benchmarking Large Language Models for Automated Intelligence: 10th TPC Technology Conference, TPCTC 2018, Rio de
Verilog RTL Code Generation,” in Proc. of DATE, 2023. Janeiro, Brazil, August 27–31, 2018, Revised Selected Papers 10, 2019.
[10] E. Nijkamp et al., “Codegen: An open large language model for code with [42] Y. Zhou et al., “Rosetta: A realistic high-level synthesis benchmark suite
multi-turn program synthesis,” arXiv preprint arXiv:2203.13474, 2022. for software programmable fpgas,” in Proc. of FPGA, 2018.
[11] M. Liu et al., “VerilogEval: Evaluating Large Language Models for [43] Xilinx, “Vitis-HLS-Introductory-Examples,” https://github.com/Xilinx/
Verilog Code Generation,” arXiv preprint arXiv:2309.07544, 2023. Vitis-HLS-Introductory-Examples, 2023.
[12] Z. He et al., “ChatEDA: A Large Language Model Powered Autonomous [44] Xilinx, “Vitis libraries,” https://github.com/Xilinx/Vitis Libraries, 2019.
Agent for EDA,” arXiv preprint arXiv:2308.10204, 2023. [45] Tacle, “Tacle Bench,” https://github.com/tacle/tacle-bench, 2017.
[13] X. Zhang et al., “DNNBuilder: an Automated Tool for Building High- [46] K. A. Campbell, “Robust and reliable hardware accelerator design through
Performance DNN Hardware Accelerators for FPGAs,” in Proc. of high-level synthesis,” Ph.D. dissertation, University of Illinois at Urbana-
ICCAD, 2018. Champaign, 2017.
[14] H. Ye et al., “HybridDNN: A Framework for High-Performance Hybrid
DNN Accelerator Design and Implementation,” in Proc. of DAC, 2020.
[15] X. Zhang et al., “DNNExplorer: a framework for modeling and exploring
a novel paradigm of FPGA-based DNN accelerator,” in Proc. of ICCAD,
2020.