0% found this document useful (0 votes)
28 views

Generative AI in Cybersecurity - 2025

This paper reviews the applications and vulnerabilities of Generative AI and Large Language Models (LLMs) in cybersecurity, covering various domains such as intrusion detection, malware detection, and phishing detection. It discusses the evolution of LLMs, their strengths and weaknesses, and identifies vulnerabilities like prompt injection and data poisoning, along with mitigation strategies. The study aims to provide insights into enhancing cybersecurity defenses through LLM integration and outlines future research directions.

Uploaded by

Rizka Ardiansyah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views

Generative AI in Cybersecurity - 2025

This paper reviews the applications and vulnerabilities of Generative AI and Large Language Models (LLMs) in cybersecurity, covering various domains such as intrusion detection, malware detection, and phishing detection. It discusses the evolution of LLMs, their strengths and weaknesses, and identifies vulnerabilities like prompt injection and data poisoning, along with mitigation strategies. The study aims to provide insights into enhancing cybersecurity defenses through LLM integration and outlines future research directions.

Uploaded by

Rizka Ardiansyah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

Journal Pre-proof

Generative AI in Cybersecurity: A Comprehensive Review of LLM Applications and


Vulnerabilities

Mohamed Amine Ferrag, Fatima Alwahedi, Ammar Battah, Bilel Cherif, Abdechakour
Mechri, Norbert Tihanyi, Tamas Bisztray, Merouane Debbah

PII: S2667-3452(25)00008-2
DOI: https://doi.org/10.1016/j.iotcps.2025.01.001
Reference: IOTCPS 80

To appear in: Internet of Things and Cyber–Physical Systems

Received Date: 24 October 2024


Revised Date: 16 January 2025
Accepted Date: 29 January 2025

Please cite this article as: M.A. Ferrag, F. Alwahedi, A. Battah, B. Cherif, A. Mechri, N. Tihanyi, T.
Bisztray, M. Debbah, Generative AI in Cybersecurity: A Comprehensive Review of LLM Applications
and Vulnerabilities, Internet of Things and Cyber–Physical Systems, https://doi.org/10.1016/
j.iotcps.2025.01.001.

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition
of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of
record. This version will undergo additional copyediting, typesetting and review before it is published
in its final form, but we are providing this version to give early visibility of the article. Please note that,
during the production process, errors may be discovered which could affect the content, and all legal
disclaimers that apply to the journal pertain.

© 2025 Published by KeAi Communications Co., Ltd.


Generative AI in Cybersecurity: A Comprehensive
Review of LLM Applications and Vulnerabilities
Mohamed Amine Ferrag∗k , Fatima Alwahedi† , Ammar Battah† , Bilel Cherif† , Abdechakour Mechri‡ ,
Norbert Tihanyi† , Tamas Bisztray§ , and Merouane Debbah¶
∗ Department of Computer Science, Guelma University, Algeria
† Technology Innovation Institute, UAE
‡ Concordia University, Canada
§ University of Oslo, Norway
¶ Khalifa University of Science and Technology, UAE
k Corresponding author: [email protected]

GQA Grouped-Query Attention

of
Abstract—This paper provides a comprehensive review of
the future of cybersecurity through Generative AI and Large HPC High-Performance Computing
Language Models (LLMs). We explore LLM applications across HLS High-Level Synthesis Design Verification

ro
various domains, including hardware design security, intrusion HQQ Half-Quadratic Quantization
detection, software engineering, design verification, cyber threat IDS Intrusion Detection System

-p
intelligence, malware detection, and phishing detection. We
present an overview of LLM evolution and its current state, LLM Large Language Model
focusing on advancements in models such as GPT-4, GPT-3.5, LoRA Low-rank Adapters
re
Mixtral-8x7B, BERT, Falcon2, and LLaMA. Our analysis extends LSTM Long Short-Term Memory
to LLM vulnerabilities, such as prompt injection, insecure ML Machine Learning
output handling, data poisoning, DDoS attacks, and adversarial MLP Multi-Layer Perceptron
lP

instructions. We delve into mitigation strategies to protect these


models, providing a comprehensive look at potential attack sce- MQA Multi-Query Attention
narios and prevention techniques. Furthermore, we evaluate the NIST National Institute of Standards and Technology
performance of 42 LLM models in cybersecurity knowledge and NLP Natural Language Processing
na

hardware security, highlighting their strengths and weaknesses. NLU Natural Language Understanding
We thoroughly evaluate cybersecurity datasets for LLM training ORPO Odds Ratio Preference Optimization
and testing, covering the lifecycle from data creation to usage
PEFT Parameter Efficient Fine-Tuning
ur

and identifying gaps for future research. In addition, we review


new strategies for leveraging LLMs, including techniques like PLM Pre-trained Language Model
Half-Quadratic Quantization (HQQ), Reinforcement Learning PPO Proximal Policy Optimization
Jo

with Human Feedback (RLHF), Direct Preference Optimization RAG Retrieval Augmentation Generation
(DPO), Quantized Low-Rank Adapters (QLoRA), and Retrieval- RLHF Reinforcement Learning from Human Feedback
Augmented Generation (RAG). These insights aim to enhance
real-time cybersecurity defenses and improve the sophistication RNN Recurrent Neural Networks
of LLM applications in threat detection and response. Our paper RTL Register-Transfer Level
provides a foundational understanding and strategic direction SARD Software Assurance Reference Dataset
for integrating LLMs into future cybersecurity frameworks, em- SFT Supervised Fine-Tuning
phasizing innovation and robust model deployment to safeguard SVM Support Vector Machine
against evolving cyber threats.
TRPO Trust Region Policy Optimization
Index Terms—Generative AI, LLM, Transformer, Security,
Cyber Security.
I. I NTRODUCTION
The history of Natural Language Processing (NLP) dates
L IST OF A BBREVIATIONS back to the 1950s when the Turing test was developed. How-
AI Artificial Intelligence ever, NLP has seen significant advancements in recent decades
AIGC Artificial Intelligence Generated Content with the introduction of Recurrent Neural Networks (RNN)
APT Advanced Persistent Threat [1], Long Short-Term Memory (LSTM) [2], Gated Recurrent
CNN Convolutional Neural Network Units (GRU) [3], and Transformer methods [4]. RNN was first
CTG Controllable Text Generation introduced in the 1990s to model data sequences. LSTM, a
CVE Common Vulnerabilities and Exposures variant of RNN, was introduced in 1997, which addressed
CWE Common Weakness Enumeration the vanishing gradient problem and allowed for longer-term
FNN Feed-Forward Neural Network memory in NLP models. GRU, another variant of RNN, was
FRR False Refusal Rate introduced in 2014, which reduced the number of parame-
GPT Generative Pre-trained Transformers ters and improved computational efficiency [5]. The latest
GRU Gated Recurrent Units breakthrough in NLP was the introduction of Transformers

1
in 2017, enabling parallel processing of sequential data and Automation [28].
revolutionizing tasks like machine translation. These methods 7) Penetration Testing: LLMs can help generate scripts
have significantly improved various NLP tasks, including or modify existing ones to automate certain parts of
sentiment analysis, language generation, and translation [4], the penetration testing process. This includes scripts for
[6]–[8]. vulnerability scanning, network mapping, and exploiting
Cybersecurity is an ever-evolving field, with threats becom- known vulnerabilities [29].
ing increasingly sophisticated and complex. As organizations 8) Security Protocols Verification: LLMs can help verify
and individuals rely on digital technologies for communica- the security of protocols such as TLS/SSL, IPSec, . . . etc.
tion, commerce, and critical infrastructure, the need for robust 9) Security Training and Awareness: LLMs can generate
cybersecurity measures has never been greater [9]. The scale training materials tailored to an organization’s needs.
and diversity of cyber threats make it a daunting challenge They can also simulate phishing attacks and other se-
for security professionals to effectively identify, detect, and curity scenarios to train employees to recognize and
defend against them. In this context, Large Language Models respond to security threats [30].
(LLMs) have emerged as a game-changing technology with
the potential to enhance cybersecurity practices significantly
[10]–[14]. These models, powered by advanced NLP and
Large Language Models for Nine Cybersecurity Use Cases
Machine Learning (ML) techniques, offer a new frontier in

of
the fight against cyber threats [15], [16]. This article explores
the motivations and applications of LLMs in cybersecurity.

ro
Cybersecurity professionals often need to sift through a
vast amount of textual data, including security alerts, incident Threat Detection Phishing Detection Incident Response
and Analysis and Response
reports, threat feeds, and research papers, to stay ahead of

-p
evolving threats. LLMs, like Falcon 180b [17], possess natural
language understanding capabilities that enable them to parse,
re
summarize, and contextualize this information efficiently [7],
Security Cyber Forensics Chatbots
[18], [19]. They can assist in rapidly identifying relevant Automation
lP

threat intelligence, allowing analysts to make more informed


decisions and prioritize responses [20]. LLMs can excel in
various domains within cybersecurity [21], [22]. Figure 1
na

highlights the top nine use cases and applications for LLMs
Penetration Testing Security Protocols Security Training
in this field [23], [24]. Verification and Awareness
1) Threat Detection and Analysis: LLMs can analyze
ur

vast network data in real-time to detect anomalies and


Fig. 1: LLM Use Cases And Applications for Cybersecurity.
potential threats. They can recognize patterns indicative
of cyber attacks, such as malware, phishing attempts,
Jo

and unusual network traffic [20]. The primary aim of this paper is to provide an in-depth
2) Phishing Detection and Response: LLMs can identify and comprehensive review of the future of cybersecurity using
phishing emails by analyzing the text for malicious Generative AI and LLMs, covering all relevant topics in the
intent and comparing it to known phishing examples. cyber domain. The contributions of this study are summarized
They can also generate alerts and recommend preventive below:
actions [25]. • We review LLMs’ applications for cybersecurity use
3) Incident Response: During a cybersecurity incident, cases, such as hardware design security, intrusion de-
LLMs can assist by providing rapid analysis of the sit- tection, software engineering, design verification, cyber
uation, suggesting mitigation strategies, and automating threat intelligence, malware detection, phishing, and spam
responses where applicable [26]. detection, etc., providing a nuanced understanding of
4) Security Automation: LLMs can facilitate the automa- LLM capabilities across different cybersecurity domains;
tion of routine security tasks such as patch management, • We present a comprehensive overview of LLMs in cy-
vulnerability assessments, and compliance checks. This bersecurity, detailing their evolution and current state,
reduces the workload on cybersecurity teams and allows including advancements in 42 specific models, such as
them to focus on more complex tasks [10]. GPT-4o, GPT-4, BERT, Falcon, and LLaMA models;
5) Cyber Forensics: LLMs can help in forensic analysis • We analyze the vulnerabilities associated with LLMs,
by parsing through logs and data to determine the cause including prompt injection, insecure output handling,
and method of attack, thus aiding in the recovery process training data poisoning, inference data poisoning, DDoS
and future prevention strategies [27]. attacks, and adversarial natural language instructions. We
6) Chatbots: LLMs significantly enhance the capabilities also examine the mitigation strategies to safeguard these
of chatbots in cybersecurity environments by providing models from such vulnerabilities, providing a compre-
User Interaction, Incident Reporting and Handling, Real- hensive look at potential attack scenarios and prevention
time Assistance, Training and Simulations, and FAQ techniques;

2
SURVEY STRUCTURE

IX. LLM Cybersecurity Insights, Challenges and Limitations

A. Challenges and Limitations


B. LLM Cybersecurity Insights
X. Conclusion

I. Introduction VIII. LLM Vulnerabilities and Mitigation

Cybersecurity use cases A. Prompt Injection


Contributions B. Insecure Output Handling
C. Adversarial Natural Language Instructions
D. Automatic adversarial prompt generation
E. Training Data Poisoning
II. Related Reviews F. Inference Data Poisoning
G. Insecure Plugins
A. Applications of LLMs in Hardware Design Security H. Denial of Service (DoS) attack
B. Evaluation of LLMs
C. Evolution and State of LLMs in AI LLM
D. Advancements in PLMs for NLP VII. Cybersecurity datasets for LLMS
E. Instruction Fine-Tuning for LLMs
F. LLMs in Software Engineering A. Cyber Security Dataset Lifecycle
B. Software Cyber Security datasets

of
G. Multimodal Algorithms
H. Alignment Requirements for LLMs
I. Knowledge-Enhanced Pre-trained Language Models
J. Controllable Text Generation in NLG

ro
K. LLM for Cyber Security
L. Our survey compared to related surveys VI. Code specific LLMs

-p
A. Prevalent LLMs
III. Preliminaries of NLP for Cybersecurity B. Datasets Development for Code-centric LLM Models
C. Vulnerabilities Analysis of LLM-Generated Code
A. Recurrent neural network
re
B. Transformer models

IV. LLMS-based models for Cybersecurity V. General LLMs


lP

A. Recurrent Neural Networks-based models A. Prevalent LLMs


B. Transformer-based models B. LLMs performance in the cybersecurity domain
na

Fig. 2: Survey Structure (From Section I. to Section X.)


ur

• We evaluated the performance of 42 LLM models in detection and response.


Jo

different datasets in the cybersecurity domain.


• We thoroughly evaluate cybersecurity datasets tailored
for LLM training and testing. This includes a lifecycle
analysis from dataset creation to usage, covering various
The rest of this paper is organized as follows. Section
stages such as data cleaning, preprocessing, annotation,
II presents an in-depth analysis of related reviews in the
and labeling. We also compare cybersecurity datasets to
field, charting the evolution and state of LLMs in artificial
identify gaps and opportunities for future research;
intelligence. Section III delves into the preliminaries of NLP
• We provide the challenges and limitations of employing
applications for cybersecurity, covering foundational models
LLMs in cybersecurity settings, such as dealing with
and their advancements. Section IV discusses LLM-based
adversarial attacks and ensuring robustness. We also dis-
solutions specific to cybersecurity. Section V reviews general
cuss the implications of these challenges for future LLM
LLM models. Section VI reviews Code-specific LLMs models.
deployments and the development of secure, optimized
Section VII explores various cybersecurity datasets designed
models;
for LLM training and evaluation, detailing their development
• We discuss novel insights and strategies for leveraging
lifecycle and specific attributes. Section VIII focuses on the
LLMs in cybersecurity, including advanced techniques
vulnerabilities associated with LLMs and the strategies for
such as Half-Quadratic Quantization (HQQ), Reinforce-
their mitigation, introducing a classification of potential threats
ment Learning with Human Feedback (RLHF), Direct
and defense mechanisms. Section IX offers comprehensive in-
Preference Optimization (DPO), Odds Ratio Preference
sights into the challenges and limitations of integrating LLMs
Optimization (ORPO), GPT-Generated Unified Format
into cybersecurity frameworks, including practical considera-
(GGUF), Quantized Low-Rank Adapters (QLoRA), and
tions and theoretical constraints. Finally, Section X concludes
Retrieval-Augmented Generation (RAG). These insights
the paper by summarizing the key findings and proposing
aim to enhance real-time cybersecurity defenses and
directions for future research in LLMs and cybersecurity. A
improve the sophistication of LLM applications in threat
brief overview of the paper’s structure is illustrated in Figure 2.

3
TABLE I: Summary of Related Reviews on Large Language Models
Focused Area of Study Year Authors Key Points Data. Vuln. Comp. Optim. Hardw.
LLMs in Enhancing Hard- 2023 Saha et Discuss applications of LLMs in hardware design security, in- 8 8 8 8 8
ware Design Security al. [31] cluding vulnerability introduction, assessment, verification, and
countermeasure development.
Comprehensive Evaluation 2023 Chang et Provides an analysis of LLM evaluations focusing on criteria, 8 8 8 8 8
Methodologies for LLMs al. [30] context, methodologies, and future challenges.
The Evolutionary Path of 2023 Zhao et Surveys the evolution of LLMs in AI, focusing on pre-training, 8 8 8 8 8
LLMs in AI al. [26] adaptation tuning, utilization, and capacity evaluation.
Recent Advancements in 2023 Min et al. Reviews advancements in PLMs for NLP, covering paradigms 8 8 8 8 8
PLMs for NLP [32] like Pre-train then Fine-tune, Prompt-based Learning, and NLP
as Text Generation.
Exploring Instruction Fine- 2023 Zhang et Explores instruction fine-tuning for LLMs, covering methodolo- 8 8 8 8 8
Tuning in LLMs al. [33] gies, datasets, models, and multi-modality techniques.
Applying LLMs in Soft- 2023 Fan et al. Survey the use of LLMs in Software Engineering, discussing 8 8 8 8 8
ware Engineering [34] applications, challenges, and hybrid approaches.
Understanding Multimodal 2023 Wu et al. Provides an overview of multimodal algorithms, covering defini- 8 8 8 8 8
Algorithms [35] tion, evolution, technical aspects, and challenges.
Defining Alignment Re- 2023 Liu et al. Proposes a taxonomy of alignment requirements for LLMs and 8 8 8 8 8
quirements for LLMs [36] discusses harmful content concepts.
Incorporating External 2023 Hu et al. Reviews KE-PLMs, focusing on incorporating different types of 8 8 8 8 8

of
Knowledge in PLMs [37] knowledge into PLMs for NLP.
Advances in Controllable 2023 Zhang et Reviews CTG in NLG, focusing on Transformer-based PLMs and 8 8 8 8 8
Text Generation al. [38] challenges in controllability.

ro
LLM for Blockchain Secu- 2024 He et al. Analyze existing research to understand how LLMs can improve 8 8 8
rity [39] blockchain systems’ security.
LLM for Critical Infras- 2024 Yigit et Proposing advanced strategies using Generative AI and Large 8 8 8 8 8

-p
tructure Protection al. [40] Language Models to enhance resilience and security.
Software Testing with 2024 Wang et Explore how Large Language Models (LLMs) can enhance 8 8 8 8 8
Large Language Models al. [41] software testing, examining tasks, techniques, and future research
re
directions.
Malicious Insider Threat 2024 Alzaabi et Recommends advanced ML methods like deep learning and 8 8 8
Detection Using Machine al. [27] NLP for better detection and mitigation of insider threats in
lP

Learning Methods cybersecurity, emphasizing the need for integrating time-series


techniques.
Advancements in Large 2024 Raiaan et Reviews the evolution, architectures, applications, societal im- 8 8 8 8 8
Language Models al. [28] pacts, and challenges of LLMs, aiding practitioners, researchers,
na

and experts in understanding their development and prospects.


Applications of LLMs in 2024 Xu et al. Highlights the diverse applications of LLMs in cybersecurity 8 8 8
cybersecurity tasks [42] tasks such as vulnerability detection, malware analysis, and
intrusion and phishing detection.
ur

Retrieval-Augmented Gen- 2024 Zhao et Reviews how RAG has been integrated into various AIGC scenar- 8 8 8 8 8
eration for LLMs al. [28] ios to overcome common challenges such as updating knowledge,
handling long-tail data, mitigating data leakage, and managing
Jo

costs associated with training and inference.


Provides an overview of 2024 Han et al. Reviews various PEFT algorithms, their effectiveness, and the 8 8 8 8 8
Parameter Efficient Fine- [43] computational overhead involved.
Tuning (PEFT)
LLM for Cyber Security 2024 Zhang et The paper conducts a systematic literature review of over 180 8 8 8
al. [44] works on applying LLMs in cybersecurity.
LLM with security and pri- 2024 Yao et al. Explores the dual impact of LLMs on security and privacy, 8 8 8
vacy issues [10] highlighting their potential to enhance cybersecurity and data
protection while also posing new risks and vulnerabilities.
THIS SURVEY 2024 Ferrag et This paper provides an in-depth review of using Generative AI 4 4 4 4 4
al. and Large Language Models (LLMs) in cybersecurity.
8 : Not covered; : Partially covered; 4: Covered; Data.: Datasets used for training and fine-tuning LLMs for security use cases; Vuln.: LLM Vulnerabilities and Mitigation ;
Comp.: Experimental Analysis of LLMs Models’ Performance in Cyber Security Knowledge; Optim.: Optimization Strategies for Large Language Models in Cybersecurity;
Hardw. : Experimental Analysis of LLMs Models’ Performance in Hardware Security.

II. R ELATED REVIEWS tuning for LLMs, and explore their impactful integration into
software engineering. The section also encompasses an in-
This section delves into a curated collection of recent depth look at multimodal algorithms, examines the critical
articles that significantly contribute to the evolving landscape aspect of alignment requirements for LLMs, and discusses
of LLMs and their multifaceted applications. These reviews integrating external knowledge into PLMs to enhance NLP
offer a comprehensive and insightful exploration into various tasks. Lastly, it sheds light on the burgeoning field of Control-
dimensions of LLMs, including their innovative applications lable Text Generation (CTG) in Natural Language Generation
in hardware design security, evaluation methodologies, and (NLG), highlighting the latest trends and challenges in this
evolving role in artificial intelligence. Further, they cover dynamic and rapidly advancing area of research [45]–[47].
cutting-edge advancements in Pre-trained Language Models Table I presents a comprehensive summary of existing reviews
(PLMs) for NLP, delve into the intricacies of instruction fine- on LLMS across various application domains.

4
A. Evaluation of LLMs highlights a range of instruction-fine-tuned models, showcas-
Chang et al. [30] offers a comprehensive analysis of LLM ing their diversity and capabilities. It also examines multi-
evaluations, addressing three key aspects: the criteria for modality techniques and datasets, including those involving
evaluation (what to evaluate), the context (where to evaluate), images, speech, and video, reflecting the broad applicability
and the methodologies (how to evaluate). It thoroughly reviews of instruction tuning. The adaptation of LLMs to different
various tasks across different domains to understand the suc- domains and applications using instruction tuning strategies is
cesses and failures of LLMs, contributing to future research reviewed, demonstrating the versatility of this method. Addi-
directions. The paper also discusses current evaluation metrics, tionally, the survey addresses efforts to enhance the efficiency
datasets, and benchmarks and introduces novel approaches, of instruction fine-tuning, focusing on reducing computational
providing a deep understanding of the current evaluation and time costs. Finally, it evaluates these models, including
landscape. Additionally, it highlights future challenges in LLM performance analysis and critical perspectives, offering a holis-
evaluation and supports the research community by open- tic view of the current state and potential of instruction fine-
sourcing related materials, fostering collaborative advance- tuning in LLMs.
ments in the field.
E. LLMs in Software Engineering
B. Evolution and State of LLMs in AI Fan et al. [34] present a survey on using LLMs in Software
Engineering (SE), highlighting their potential applications and

of
Zhao et al. [26] provides an in-depth survey of LLMs’
evolution and current state in artificial intelligence. It traces open research challenges. LLMs, known for their emergent
the progression from statistical language models to neural lan- properties, offer novel and creative solutions across various

ro
guage models, specifically focusing on the recent emergence Software Engineering activities, including coding, design,
of pre-trained language models (PLMs) using Transformer requirements analysis, bug fixing, refactoring, performance

-p
models trained on extensive corpora. The paper emphasizes the optimization, documentation, and analytics. Despite these ad-
significant advancements achieved by scaling up these mod- vantages, the paper also acknowledges the significant technical
re
els, noting that LLMs demonstrate remarkable performance challenges these emergent properties bring, such as the need
improvements beyond a certain threshold and exhibit unique for methods to eliminate incorrect solutions, notably hallu-
cinations. The survey emphasizes the crucial role of hybrid
lP

capabilities not found in smaller-scale models. The survey


covers four critical aspects of LLMs: pre-training, adaptation approaches, which combine traditional Software Engineering
tuning, utilization, and capacity evaluation, providing insights techniques with LLMs, in developing and deploying reliable,
into both their technical evolution and the challenges they efficient, and effective LLM-based solutions for Software
na

pose. Additionally, the paper discusses the resources available Engineering. This approach suggests a promising pathway
for LLM development and explores potential future research for integrating advanced AI models into practical software
development processes.
ur

directions, underlining the transformative effect of LLMs on


AI development and application.
Jo

F. Multimodal Algorithms
C. Advancements in PLMs for NLP Wu et al. [35] addresses a significant gap in understand-
Min et al. [32] surveys the latest advancements in leveraging ing multimodal algorithms by providing a comprehensive
PLMs for NLP, organizing the approaches into three main overview of their definition, historical development, applica-
paradigms. Firstly, the ”Pre-train then Fine-tune” method tions, and challenges. It begins by defining multimodal models
involves general pre-training on large unlabeled datasets fol- and algorithms, then traces their historical evolution, offering
lowed by specific fine-tuning for targeted NLP tasks. Secondly, insights into their progression and significance. The paper
”Prompt-based Learning” uses tailored prompts to transform serves as a practical guide, covering various technical aspects
NLP tasks into formats akin to a PLM’s pre-training, enhanc- essential to multimodal models, such as knowledge represen-
ing the model’s performance, especially in few-shot learning tation, selection of learning objectives, model construction,
scenarios. Lastly, the ”NLP as Text Generation” paradigm information fusion, and prompts. Additionally, it reviews cur-
reimagines NLP tasks as text generation problems, fully capi- rent algorithms employed in multimodal models and discusses
talizing on the strengths of generative models like GPT-2 and commonly used datasets, thus laying a foundation for future
T5. These paradigms represent the cutting-edge methods in research and evaluation in this field. The paper concludes
utilizing PLMs for various NLP applications. by exploring several applications of multimodal models and
delving into key challenges that have emerged from their
D. Instruction Fine-Tuning for LLMs recent development, shedding light on both the potential and
the limitations of these advanced computational tools.
Zhang et al. [33] delves into the field of instruction fine-
tuning for LLMs, offering a detailed exploration of various
facets of this rapidly advancing area. It begins with an G. Alignment Requirements for LLMs
overview of the general methodologies used in instruction Liu et al. [36] propose a taxonomy of alignment require-
fine-tuning, then discusses the construction of commonly-used, ments for LLMs to aid practitioners in understanding and ef-
representative datasets tailored for this approach. The survey fectively implementing alignment dimensions and inform data

5
collection efforts for developing robust alignment processes. challenging research area. The paper surveys various ap-
The paper dissects the concept of ”harmful” generated content proaches that have emerged in the last 3-4 years, each targeting
into specific categories, such as harm to individuals (like emo- different CTG tasks with varying controlled constraints. It
tional harm, offensiveness, and discrimination), societal harm provides a comprehensive overview of common tasks, main
(including instructions for violent or dangerous behaviors), approaches, and evaluation methods in CTG and discusses the
and harm to stakeholders (such as misinformation impacting current challenges and potential future directions in the field.
business decisions). Citing an imbalance in Anthropic’s align- Claiming to be the first to summarize state-of-the-art CTG
ment data, the paper points out the uneven representation of techniques from the perspective of Transformer-based PLMs,
various harm categories, like the high frequency of ”violence” this paper aims to assist researchers and practitioners in keep-
versus the marginal appearance of ”child abuse” and ”self- ing pace with the academic and technological developments
harm.” This observation supports the argument that alignment in CTG, offering them an insightful landscape of the field and
techniques heavily dependent on data cannot ensure that LLMs a guide for future research.
will uniformly align with human behaviors across all aspects.
The authors’ own measurement studies reveal that aligned
J. LLM for Cyber Security
models do not consistently show improvements across all
harm categories despite the alignment efforts claimed by the Zhang et al. [44] examines the integration of LLMs within
model developers. Consequently, the paper advocates for a cybersecurity. Through an extensive literature review involving

of
framework that allows a more transparent, multi-objective over 127 publications from leading security and software
evaluation of LLM trustworthiness, emphasizing the need for engineering venues, this paper aims to shed light on LLMs’

ro
a comprehensive and balanced approach to alignment in LLM multifaceted roles in enhancing cybersecurity measures. The
development. survey pinpoints various applications for LLMs in detecting
vulnerabilities, analyzing malware, and managing network

-p
H. Knowledge-Enhanced Pre-trained Language Models intrusions and phishing threats. It highlights the current lim-
itations regarding the datasets used, which often lack size
Hu et al. [37] offers a comprehensive review of Knowledge-
re
and diversity, thereby underlining the necessity for more
Enhanced Pre-trained Language Models (KE-PLMs), a bur-
robust datasets tailored to these security tasks. The paper
geoning field aiming to address the limitations of standard
lP

also identifies promising methodologies like fine-tuning and


PLMs in NLP. While PLMs trained on vast text corpora
domain-specific pre-training, which could better harness the
demonstrate impressive performance across various NLP tasks,
potential of LLMs in cybersecurity contexts.
they often fall short in areas like reasoning due to the absence
na

Yao et al. [10] explores the dual role of LLMs in se-


of external knowledge. The paper focuses on how incorpo-
curity and privacy, highlighting their benefits in enhancing
rating different types of knowledge into PLMs can overcome
code security and data confidentiality and detailing potential
these shortcomings. It introduces distinct taxonomies for Nat-
ur

risks and inherent vulnerabilities. The authors categorize the


ural Language Understanding (NLU) and Natural Language
applications and challenges into ”The Good,” ”The Bad,”
Generation (NLG) to distinguish between these two core
and ”The Ugly,” where they discuss LLMs’ positive impacts,
Jo

areas of NLP. For NLU, the paper categorizes knowledge


their use in offensive applications, and their susceptibility to
types into linguistic, text, knowledge graph (KG), and rule
specific attacks, respectively. The paper emphasizes the need
knowledge. In the context of NLG, KE-PLMs are classified
for further research on threats like model and parameter extrac-
into KG-based and retrieval-based methods. By outlining these
tion attacks and emerging techniques such as safe instruction
classifications and exploring the current state of KE-PLMs,
tuning, underscoring the complex balance between leveraging
the paper provides not only clear insights into this evolving
LLMs for improved security and mitigating their risks.
domain but also identifies promising future directions for the
Saha et al. [31] discussed several key applications of LLMs
development and application of KE-PLMs, highlighting their
in the context of hardware design security. The paper illus-
potential to enhance the capabilities of PLMs in NLP tasks
trates how LLMs can intentionally introduce vulnerabilities
significantly.
and weaknesses into RTL (Register-Transfer Level) designs.
This process is guided by well-crafted prompts in natural
I. Controllable Text Generation in NLG language, demonstrating the model’s ability to understand and
Zhang [38] provides a critical and systematic review of manipulate complex technical designs. The authors explore
Controllable Text Generation (CTG), a burgeoning field in using LLMs to assess the security of hardware designs. The
NLG that is essential for developing advanced text generation model is employed to identify vulnerabilities, weaknesses,
technologies tailored to specific practical constraints. The and potential threats. It’s also used to pinpoint simple cod-
paper focuses on using large-scale pre-trained language models ing issues that could evolve into significant security bugs,
(PLMs), particularly those based on transformer architecture, highlighting the model’s ability to evaluate technical designs
which have established a new paradigm in NLG due to their critically. In this application, LLMs verify whether a hardware
ability to generate more diverse and fluent text. However, design adheres to specific security rules or policies. The
the limited interpretability of deep neural networks poses paper examines the model’s proficiency in calculating secu-
challenges to the controllability of these methods, making rity metrics, understanding security properties, and generating
transformer-based PLM-driven CTG a rapidly evolving and functional testbenches to detect weaknesses. This part of the

6
study underscores the LLM’s ability to conduct thorough and A. Recurrent neural networks
detailed verification processes. Finally, the paper investigates Recurrent Neural Networks (RNNs) [48] are artificial neural
how effectively LLMs can be used to develop countermea- networks that handle data sequences such as time series or
sures against existing vulnerabilities in a design. This aspect NLP tasks. The RNN model consists of two linked recur-
focuses on the model’s capability to solve problems and create rent neural networks. The first RNN encodes sequences of
solutions to enhance the security of hardware designs. Overall, symbols into a fixed-length vector, while the second decodes
the paper presents an in-depth analysis of how LLMs can be this vector into a new sequence. This architecture aims to
a powerful tool in various stages of hardware design security, maximize the conditional probability of a target sequence from
from vulnerability introduction and assessment to verification a given source sequence. When applied to cybersecurity, this
and countermeasure development. model could be instrumental in threat detection and response
systems by analyzing and predicting network traffic or log data
sequences that indicate malicious activity. Integrating the con-
K. Our survey compared to related surveys ditional probabilities generated by this model could enhance
anomaly detection frameworks, improving the identification
Our paper presents a more specialized and technical explo-
of subtle or novel cyber threats. The model’s ability to learn
ration of generative artificial intelligence and large language
meaningful representations of data sequences further supports
models in cybersecurity than the previous literature review.
its potential to recognize complex patterns and anomalies in

of
Focusing on a broad array of cybersecurity domains such
cybersecurity environments [49], [50].
as hardware design security, intrusion detection systems, and
software engineering, it targets a wider professional audience, 1) Gated Recurrent Units: GRUs are a recurrent neural

ro
including engineers, researchers, and industrial practitioners. network architecture designed to handle the vanishing gradient
This paper reviews 35 leading models like GPT-4, BERT, problem that can occur with standard recurrent networks.

-p
Falcon, and LLaMA, not only highlighting their applications Introduced by Cho et al. in 2014 [51], GRUs simplify the
but also their developmental trajectories, thereby providing a LSTM (Long Short-Term Memory) model while retaining its
ability to model long-term dependencies in sequential data.
re
comprehensive insight into the current capabilities and future
potentials of these models in cybersecurity. GRUs achieve this through two main gates: the update gate,
which controls how much a new state overwrites the old
lP

The paper also delves deeply into the vulnerabilities associ-


state, and the reset gate, which determines how much past
ated with LLMs, such as prompt injection, adversarial natural
information to forget. These gates effectively regulate the
language instructions, and insecure output handling. It presents
flow of information, making GRUs adept at tasks like time
na

sophisticated attack scenarios and robust mitigation strategies,


series prediction, speech recognition, and natural language
offering a detailed analysis crucial for understanding and pro-
processing. The main steps of GRUs are organized as follows:
tecting against potential threats. Additionally, the lifecycle of
Update Gate: The update gate determines how much
ur

specialized cybersecurity datasets—covering creation, clean- •

ing, preprocessing, annotation, and labeling—is scrutinized, information from the previous hidden state should be
providing essential insights into improving data integrity and passed to the new one. The update gate is calculated using
Jo

utility for training and testing LLMs. This level of detail is the following formula:
vital for developing robust cybersecurity solutions that can
effectively leverage the power of LLMs. zt = σ(Wz xt + Uz ht−1 ) (1)
Lastly, the paper examines the challenges associated with
deploying LLMs in cybersecurity contexts, emphasizing the where zt is the update gate at time step t, Wz and Uz are
necessity for model robustness and the implications of adver- the weight matrices, xt is the input at time step t, and
sarial attacks. It introduces advanced methodologies such as ht−1 is the previous hidden state. The sigmoid function,
Reinforcement Learning with Human Feedback (RLHF) and represented by σ, squishes the equation’s results between
Retrieval-Augmented Generation (RAG) to enhance real-time 0 and 1. The update gate allows the GRU to decide how
cybersecurity operations. This focus not only delineates the much of the previous hidden state information should be
current state of LLM applications in cybersecurity but also passed on to the new hidden state. If the update gate
sets the direction for future research and practical applications, is close to 1, it means that a lot of the previous hidden
aiming to optimize and secure LLM deployments in an evolv- state information should be passed on, while if the update
ing threat landscape. This makes the paper an indispensable gate is close to 0, it means that very little of the previous
resource for anyone involved in cybersecurity and AI, bridging hidden state information should be passed on. The Update
the gap between academic research and practical applications. Gate formula can be explored in the following three
different parts:
Part 1: Linear combination of inputs:
III. P RELIMINARIES OF NLP FOR C YBER S ECURITY

This section presents the preliminaries of NLP for cyber- zt = Wr xt + Ur ht−1 (2)
security, including recurrent neural networks (LSTMs and
GRUs) and transformer models. Part 2: Application of the sigmoid function:

7
to 0, the candidate’s hidden state primarily influences the
r̃t = σ(zt ) (3) new hidden state. The element-wise product between the
update gate (zt ) and the candidate hidden state (h̃t ) is
Part 3: Element-wise multiplication of r̃t and previous used to create the update vector (zt h̃t ). The element-
hidden state: wise product between the complement of the update gate
(1 − zt ) and the previous hidden state (ht−1 ) is used to
rt = r̃t ht−1 (4) create the reset vector ((1−zt ) ht−1 ). Finally, the update
and reset vectors are added to calculate the new hidden
state (ht ).
Where represents the Hadamard product, also known
as the element-wise multiplication. 2) Long Short-Term Memory: The LSTM [2] was designed
• Reset Gate: The reset gate determines how much of the to overcome the vanishing gradient problem that affects tra-
previous hidden state should be forgotten. The reset gate ditional recurrent neural networks (RNNs) during training,
is calculated using the following formula: particularly over long sequences. By integrating memory cells
that can maintain information over extended periods and gates
that regulate the flow of information into and out of the
rt = σ(Wr xt + Ur ht−1 ) (5) cell, LSTMs provide an effective mechanism for learning
dependencies and retaining information over time. This archi-

of
where rt is the reset gate at time step t, Wr and Ur are tecture has proved highly influential, becoming foundational
the weight matrices, xt is the input at time step t, and to numerous applications in machine learning that require

ro
ht−1 is the previous hidden state. handling sequential data, such as natural language processing,
• Candidate Hidden State: The candidate’s hidden state speech recognition, and time series analysis. The impact of
combines the input and the previous hidden state, filtered this work has been extensive, as it enabled the practical

-p
through the reset gate. The candidate’s hidden state is use and development of deep learning models for complex
calculated using the following formula: sequence modeling tasks. In cybersecurity, LSTMs can be used
re
for anomaly detection, where they analyze network traffic or
h̃t = tanh(W xt + U (rt ht−1 )) (6) system logs to identify unusual patterns that may signify a
lP

security breach or malicious activity [52]–[54]. Their ability to


where h̃t is the candidate hidden state at time step t, W learn from long sequences makes them particularly useful for
and U are the weight matrices, xt is the input at time detecting sophisticated attacks that evolve, such as advanced
na

step t, and rt is the reset gate at time step t. In this persistent threats (APTs) and ransomware. The main steps of
equation, the input at time step t (xt ) is combined with the LSTM models are organized as follows:
• Input Gate: The first step in an LSTM-based RNN
previous hidden state (ht−1 ) through the weight matrices
ur

W and U . The reset gate (rt ) is used to control the extent involves calculating the input gate. The input gate deter-
to which the previous hidden state (ht−1 ) is passed to mines the extent of new input to be added to the current
Jo

the candidate hidden state (h̃t ). The element-wise product state. The formula for the input gate is:
between the reset gate (rt ) and the previous hidden state
(ht−1 ) is used to create the reset vector (rt ht−1 ). The it = σ(Wi · [ht−1 , xt ] + bi ) (8)
reset vector is combined with the input (xt ) through the
weight matrix U . Finally, the result is passed through where it is the input gate at time step t, Wi is the weight
the hyperbolic tangent function to calculate the candidate matrix for the input gate, ht−1 is the hidden state from
hidden state (h̃t ). the previous time step, xt is the input at time step t, and
• New Hidden State: The new hidden state combines the bi is the bias for the input gate. The function σ(x) is the
previous and candidate hidden states, filtered through the sigmoid activation function. This formula calculates the
update gate. The new hidden state is calculated using the input gate it by first concatenating the previous hidden
following formula: state ht−1 with the current input xt . This combined vector
is multiplied by the weight matrix Wi , and the bias bi is
ht = (1 − zt ) ht−1 + zt h̃t (7) added. Finally, the sigmoid activation function is applied
to produce it , which ranges from 0 to 1 and represents
how much the current input updates the hidden state.
where ht is the new hidden state at time step t, zt is
• Forget Gate: The second step calculates the forget gate,
the update gate at time step t, and h̃t is the candidate
determining how much the previous state should be
hidden state at time step t. The new hidden state (ht )
forgotten. The formula for the forget gate is:
is calculated by taking a weighted combination of the
previous hidden state (ht−1 ) and the candidate hidden
state (h̃t ). The weight of the previous hidden state is ft = σ(Wf · [ht−1 , xt ] + bf ) (9)
determined by the update gate (zt ) - if the update gate is
close to 1, the new hidden state is primarily influenced where ft is the forget gate at time step t, Wf is the weight
by the previous hidden state. If the update gate is close matrix for the forget gate, ht−1 is the hidden state from

8
the previous time step, xt is the input at time step t, and
bf is the bias for the forget gate. The forget gate ft is
calculated like the input gate. It involves concatenating
the previous hidden state ht−1 with the current input xt ,
multiplying by the weight matrix Wf , and adding the
bias bf . The resulting value is passed through the sigmoid
activation function to determine the forget gate ft , which
ranges from 0 to 1 and represents the degree to which
the previous hidden state is preserved or forgotten in the
current hidden state.
• Candidate Memory Cell: The third step calculates the
candidate memory cell, representing the potential mem-
ory state update. The formula for the candidate memory
cell is:

c̃t = tanh(Wc · [ht−1 , xt ] + bc ) (10)

of
where c̃t is the candidate memory cell at time step t,
Wc is the weight matrix for the candidate memory cell,

ro
ht−1 is the hidden state from the previous time step, xt Fig. 3: How Transformer works for Software Security.
is the input at time step t, and bc is the bias for the

-p
candidate memory cell. The function tanh(x) is the hy-
perbolic tangent activation function. In this formula, the efficiency of tasks like translation and text summarization
candidate memory cell c̃t is calculated by concatenating and has broad cybersecurity applications. In cybersecurity,
re
the previous hidden state ht−1 and the input xt , then Transformer models can detect and respond to threats by ana-
multiplying by the weight matrix Wc and adding the bias lyzing source code patterns and network traffic and identifying
lP

bc . The result is passed through the hyperbolic tangent anomalies in system logs, as presented in Figure 3. They can
activation function, which ranges from -1 to 1, to control also be used for the automated generation of security policies
the magnitude of the memory cell update. based on the evolving landscape of threats and for intelligent
na

• Current Memory Cell: The fourth step calculates the threat hunting, where the system predicts and neutralizes
current memory cell, which is the updated state of the threats before they cause harm. This makes Transformers
memory cell, combining the effects of the forget and input versatile in enhancing security protocols and defending against
ur

gates. The formula for the current memory cell is: cyber attacks [20]. The main steps of Transformer models are
organized as follows:
Jo

• Attention Mechanism: The attention mechanism in the


ct = ft · ct−1 + it · c̃t (11)
Transformer model computes attention scores between
the input and output representations. These scores are
where ct is the current memory cell at time step t, ft is calculated using the scaled dot-product of the query and
the forget gate at time step t, ct−1 is the memory cell key representations and then normalized by a softmax
from the previous time step, it is the input gate at time function. The attention scores are subsequently used to
step t, and c̃t is the candidate memory cell at time step compute a weighted sum of the value representations,
t. This equation represents the new memory cell state as forming the output of the attention mechanism.
a combination of the old state (modulated by the forget The equation defines the attention scores:
gate) and the potential update (modulated by the input
gate).
QK T
 
• Output Gate: The final step calculates the output gate, Attention(Q, K, V ) = softmax √ V
which determines the amount of information output from dk
the LSTM cell. The details and formula for the output (12)
gate should follow.
Where:
Q, K, and V represent the matrices of queries, keys, and
B. Transformer models values transformed from the input representations. The
The Transformer architecture proposed by Vaswani et al. dimension of the keys is denoted by dk . The attention
[4] in 2017 is a significant advancement in natural language mechanism involves computing the dot product of Q and
processing built entirely around attention mechanisms. These the transpose of K, which is then scaled by the inverse
mechanisms allow the model to assess the relevance of dif- square root of dk to stabilize the gradients. The result
ferent words in a sentence, independent of their positional is passed through a softmax function to normalize the
relationships. This foundational technology has enhanced the scores, ensuring they sum to 1. These scores, represent-

9
ing attention weights, compute a weighted sum of the
values in V , resulting in the final attention output. This F F N (x) = max(0, xW1 + b1 )W2 + b2 (16)
mechanism allows the model to dynamically focus on
the most relevant parts of the input sequence for making
predictions. Where: x is the input to the feed-forward network. W1 ,
• Multi-Head Attention: In the Transformer model, mul- b1 , W2 , and b2 are the weight and bias parameters of the
tiple attention heads enhance the model’s capability to feed-forward network. max(0, x) is the ReLU activation
simultaneously focus on different parts of the input se- function. This equation represents a simple feed-forward
quence. The multi-head attention is calculated as follows: neural network (FFN) operation in deep learning models.
The FFN operation is a multi-layer perceptron (MLP)
that transforms the input x into a new representation by
MHead(Q, K, V ) = Concat(hd1 , hd2 , . . . , hdh )W O passing it through two fully connected (dense) layers. The
(13) first layer is followed by a ReLU activation function,
which applies a non-linear activation to the input by
Where: hdi represents the output of the i-th at- setting all negative values to zero. This activation function
tention head, computed using the attention formula helps the model learn complex non-linear relationships
Attention(Qi , Ki , Vi ). Each Qi , Ki , and Vi are different between the input and output. The second layer is a linear
transformation that produces the final output of the FFN.

of
linear projections of the original inputs Q, K, and V .
W O is a linear transformation matrix applied to the The weight and bias parameters of the two layers, W1 ,
b1 , W2 , and b2 , are learned during training and allow the

ro
concatenated results of all attention heads. The Concat
function concatenates the outputs of each head along a model to learn different representations of the input data.
specific dimension. The outputs of individual heads, hdi , • Encoder and Decoder Blocks: In the Transformer model,

-p
are each computed using the scaled dot-product attention the encoder and decoder blocks transform the input
mechanism: sequences into the output sequences. The encoder and
re
decoder blocks can be calculated as follows:

Qi KiT
 
lP

hdi = Attention(Qi , Ki , Vi ) = softmax √ Vi


dk Enc(x) = LN(x + MHead(x, x, x)) (17)
(14)
na

This approach enables the Multi-Head Attention mech-


anism to capture various aspects of the input sequence, Dec(x, y) = LN(x+MHead(x, y, y)+MHead(x, x, x))
simultaneously focusing on different subspace represen- (18)
ur

tations. As a result, it facilitates the model’s capture of


more complex relationships and improves performance Where: x is the input to the encoder/decoder block. y
Jo

across different types of tasks. is the output from the previous encoder/decoder block.
• Layer Normalization: In the Transformer model, layer The Encoder block Enc(x) takes the input x and applies
normalization ensures the input is within a standard the Multi-Head Attention mechanism to compute the
range. The layer normalization can be calculated as attention scores between the input and itself. The result
follows: is then added to the input and passed through a Layer
Normalization operation. The output of the encoder block
x − mean(x) is the new representation of the input after processing
LN (x) = p (15)
var(x) through the Multi-Head Attention and Layer Normaliza-
tion operations. The Decoder block Dec(x, y) is similar to
Where x is the input to the layer normalization. the encoder block but also takes the output from the previ-
textmean(x) and textvar(x) are the mean and variance ous decoder block, y, as input. The Multi-Head Attention
of x, respectively. Layer Normalization aims to mitigate mechanism is applied to compute the attention scores
the internal covariate shift, which arises when the distri- between the input and the previous output and between
bution of activations in a layer changes during training. the input and itself. The results are added to the input
The normalization operation is performed by subtracting and passed through a Layer Normalization operation. The
the mean of the activations and dividing by the square output of the decoder block is the new representation
root of the variance. This ensures the activations have of the input after processing through the Multi-Head
a zero mean and unit variance, leading to more stable Attention and Layer Normalization operations.
training.
• Position-wise Feed Forward: The position-wise feed-
IV. LLM S - BASED MODELS FOR C YBER S ECURITY
forward network transforms the input and output repre-
sentations in the Transformer model. The position-wise This section reviews recent studies employing LLM-
feedforward can be calculated as follows: based models (i.e., Recurrent Neural Networks-based and

10
TABLE II: RNN-based models for Cyber Security.
Study Year Type of Model Dataset Used Domain Key Contributions Open Issues
Yin et al. 2017 RNN-ID (Recurrent Benchmark data set Intrusion Detection The proposed model can improve Other machine learning algorithms
[55] Neural Network- the accuracy of intrusion detection and deep learning models, such as
Intrusion Detection) convolutional neural networks and
transformers are not considered in
the comparison
Güera et al. 2018 Temporal-aware Large set of deepfake videos collected Detection of Deep- The proposed method achieves The proposed approach’s effective-
[56] Pipeline (CNN and from multiple video websites fake Videos competitive results in detecting ness might be limited to the specific
RNN) deepfake videos while using a sim- types of deepfakes present in the
ple architecture dataset
Althubiti et 2018 LSTM RNN CSIC 2010 HTTP dataset Web Intrusion De- Proposal of LSTM RNN for web The paper only uses the CSIC 2010
al. [57] tection intrusion detection. High accuracy HTTP dataset, which may not be
rate (0.9997) in binary classifica- representative of all types of web
tion. application attacks
Xu et al. 2018 GRU-MLP-Softmax KDD 99 and NSL-KDD data sets Network Intrusion The system achieves leading perfor- The paper does not provide infor-
[58] (Gated Recurrent Detection mance with overall detection rates mation about the scalability of the
Unit, Multilayer of 99.42% using KDD 99 and proposed model
Perceptron, 99.31% using NSL-KDD, with low
Softmax) false positive rates
Ferrag and 2019 Blockchain and CICID2017 dataset, Power system Energy framework Proposal of DeepCoin framework The paper does not address the po-
Leandros RNN dataset, Bot-IoT dataset for Smart Grids combining blockchain and deep tential scalability issues that may
[59] learning for smart grid security arise as the number of nodes in the
network increases
Chawla et 2019 GRU with CNN ADFA (Australian Defence Force Intrusion Detection Achieved improved performance by The proposed system is vulnerable

of
al. [60] Academy) dataset combining GRUs and CNNs to adversarial attacks
Ullah et al. 2022 LSTM, BiLSTM, IoT-DS2, MQTTset, IoT-23, and Intrusion Detection Validation of the proposed mod- Further research is necessary to en-
[61] and GRU datasets els using various datasets, achieving hance their scalability for practical

ro
high accuracy, precision, recall, and applications in cybersecurity
F1 score
Donkol et 2023 LSTM CSE-CIC-IDS2018, CICIDS2017, and Intrusion Detection The proposed system outperformed Future research could explore the

-p
al. [62] UNSW-NB15 datasets other methods such as LPBoost and applicability of the proposed system
DNNs in terms of accuracy, preci- to other datasets
sion, recall, and error rate
Zhao et al. 2023 End-to-End Recur- IDS2017 and IDS2018 datasets Intrusion attacks + Address network-induced phe- The proposed system is vulnerable
re
[63] rent Neural Network and malware nomena that may result in misclas- to adversarial attacks
sifications in traffic detection sys-
tems used in cybersecurity
lP

Wang et al. 2021 RNN A large-scale patch dataset PatchDB Software Security The PatchRNN system can effec- The PatchRNN system can only
[64] tively detect secret security patches support C/C++
with a low false positive rate
Polat et al. 2022 LSTM and GRU SDN-based SCADA system Detection of DDoS The results show that the proposed The paper only focuses on detecting
na

[65] attacks RNN model achieves an accuracy DDoS attacks and does not address
of 97.62% for DDoS attack detec- other types of cyber threats (e.g., in-
tion sider threats or advanced persistent
threats)
ur

transformer-based models) for threat detection, malware clas- that combining GRUs with stacked CNNs leads to improved
Jo

sification, intrusion detection, and software vulnerability de- anomaly detection. The proposed system shows promising
tection.Table II presents the RNN-based models for Cyber results in detecting anomalous system call sequences in the
Security, while Tables III and IV present the transformer-based ADFA dataset. However, further research is needed to evaluate
models for Cyber Security. Figure 4 presents the LLM-based its performance in other datasets and real-world scenarios and
solutions for Cyber Security Use Cases. address issues related to adversarial attacks.
Ullah et al. [61] introduce the deep learning models to tackle
A. Recurrent Neural Networks-based models the challenge of managing cybersecurity in the growing realm
1) Intrusion Detection: Yin et al. [55] propose a deep learn- of IoT devices and services. The models utilize Recurrent
ing approach for intrusion detection using recurrent neural Neural Networks, Convolutional Neural Networks, and hybrid
networks (RNN-ID) and study its performance in binary and techniques to detect anomalies in IoT networks accurately.
multiclass classification tasks. The results show that the RNN- The proposed models are validated using various datasets (i.e.,
ID model outperforms traditional machine learning methods IoT-DS2, MQTTset, IoT-23, and datasets) and achieve high
in accuracy. Chawla et al. [60] presented an anomaly-based accuracy, precision, recall, and F1 score. However, the models
intrusion detection system that leverages recurrent neural net- need to be tested on more extensive and diverse datasets,
works (RNNs) with gated recurrent units (GRUs) and stacked and further research is necessary to enhance their scalability
convolutional neural networks (CNNs) to detect malicious for practical applications in cybersecurity. Donkol et al. [62]
cyber attacks. The system establishes a baseline of normal presents a technique, ELSTM-RNN, for improving security in
behavior for a given system by analyzing sequences of system intrusion detection systems. Using likely point particle swarm
calls made by processes. It identifies anomalous sequences optimization (LPPSO) and enhanced LSTM classification, the
based on a language model trained on normal call sequences proposed system addresses gradient vanishing, generalization,
from the ADFA dataset of system call traces. The authors and overfitting issues. The system uses an enhanced parti-
demonstrate that using GRUs instead of LSTMs results in cle swarm optimization technique to select efficient features,
comparable performance with reduced training times and which are used for effective classification using an enhanced

11
TABLE III: Transformer-based models for Cyber Security (Part I).
Study Year Type of Model Dataset Used Domain Key Contributions Open Issues
Parra et al. 2022 Federated HDFS and CTDD datasets Threat detection and The interpretability module inte- The paper briefly mentions the ap-
[66] Transformer Log forensics grated into the model provides plicability of the proposed approach
Learning Model insightful interpretability of the in edge computing systems but does
model’s decision-making process not discuss the scalability of the
approach to larger systems
Ziems et al. 2021 Transformer Model, Malware family datasets Malware Classifica- Demonstration that transformer- The experiments are conducted
[67] BERT, CANINE, tion based models outperform on preprocessed NIST NVD/SARD
Bagging-based traditional machine and deep databases, which may not reflect
random transformer learning models in classifying real-world conditions
forest (RTF) malware families
Wu et al. 2022 Robust CICID2017 and CIC-DDoS2019 Intrusion Detection The proposed method outperforms There is no discussion in the pa-
[68] Transformer-based datasets classical machine learning algo- per regarding the scalability of the
Intrusion Detection rithms such as support vector ma- proposed method, particularly when
System (RTID) chine (SVM) and deep learning al- dealing with large-scale and real-
gorithms (i.e., RNN, FNN, LSTM) time network traffic
on the two evaluated datasets
Demirkıran 2022 Transformer-based Catak dataset, Oliveira dataset, Malware classifica- The paper demonstrates that The study only focuses on mal-
et al. [69] models VirusShare dataset, and VirusSample tion transformer-based models, ware families that use API call se-
dataset specifically BERT and CANINE, quences, which means that it does
outperform traditional machine and not consider other malware types
deep learning models in classifying that may not use API calls
malware families
Ghourbi et 2022 An optimized ToN-IoT and Edge IIoTset datasets Threat Detection The experimental evaluation of the The paper does not discuss the scal-

of
al. [70] LightGBM model approach showed remarkable accu- ability of the proposed system for
and a Transformer- racies of 99% large-scale healthcare networks
based model

ro
Thapa et al. 2022 Transformer-based Software vulnerability datasets of Software security The paper highlights the advan- The paper only focuses on detect-
[71] language models C/C++ source codes and vulnerability tages of transformer-based language ing vulnerabilities in C/C++ source
detection in models over contemporary models code and does not explore the use

-p
programming of large transformer-based language
languages, models in detecting vulnerabilities
specifically C/C++ in other programming languages
Ranade et 2021 A transformer-based WebText dataset Fake Cyber Threat The attack is shown to introduce Further research is needed to ex-
re
al. [72] language model, Intelligence adverse impacts such as returning plore how to prevent or detect data
specifically GPT-2 incorrect reasoning outputs poisoning attacks on cyber-defense
system
lP

Fu et al. [73] 2022 Transformer- Large-scale real-world dataset with Software The proposed system is accurate The model’s performance can be
based line-level more than 188k C/C++ functions vulnerability for predicting vulnerable functions changed when applied to different
vulnerability prediction in safety- affected by the Top-25 most dan- programming languages or software
prediction model critical software gerous CWEs systems
systems
na

Mamede et 2022 A transformer-based Software Assurance Reference Software security The proposed system can identify The proposed method cannot be ex-
al. [74] deep learning model Dataset (SARD) project, which in the context of up to 21 vulnerability types and tended to other programming lan-
contains vulnerable and non- Java programming achieved an accuracy of 98.9% in guages and integrated into existing
vulnerable Java files language multi-label classification software development processes
ur

Evange et 2021 A transformer-based DNRTI (Dataset for NER in Threat Cybersecurity threat The experimental results demon- Further research is needed to test
al. [75] model Intelligence) intelligence strate that transformer-based tech- the effectiveness of transformer-
niques outperform previous state- based models on larger and more
Jo

of-the-art approaches for NER in diverse datasets


threat intelligence
Hashemi et 2023 Transformer models Labeled dataset from vulnerability Vulnerability Infor- The proposed approach outper- The paper does not address the is-
al. [76] (including BERT, databases mation Extraction forms existing rule-based and CRF- sue of bias in the labeled dataset
XLNet, RoBERTa, based models
and DistilBERT)
Liu et al. 2022 Transformer model A commit benchmark dataset that Commit message The experimental results demon- The pre-training dataset used in the
[77] includes over 7.99 million commits generation strate that CommitBART signifi- paper is limited to GitHub commits
across 7 programming languages (generation task) cantly outperforms previous pre-
and security patch trained models for code
identification
(understanding task)
Ahmad et al. 2024 Transformer model Set of 15 hardware security bug Hardware Security Bug repair potential demonstrated The need for designer assistance
[78] benchmark designs from three Bugs by ensemble of LLMs, outperform- in bug identification, handling com-
sources: MITRE website, OpenTitan ing state-of-the-art automated tool plex bugs, limited evaluations due
System-on-Chip (SoC) and the to simulation constraints, and chal-
Hack@DAC 2021 SoC lenges with token limits and repair
generation using LLMs
Wan et al. 2024 Transformer model Chrysalis dataset, comprising over Design Verification Creating the Chrysalis dataset Refining LLM techniques, integrat-
[79] 1,000 function-level HLS designs with for HLS debugging, and enabling ing LLMs into development envi-
injected logical bugs LLM-based bug detection and ronments, and addressing scalabil-
integration into development ity and generalization challenges
environments
Jang et al. 2024 Transformer model Includes 150K online security articles, Threat Detection Pre-trained language model for The paper’s limitations include a
[80] 7.3K security paper abstracts, 3.4K the cybersecurity domain, CyBER- narrow focus on specific non-
Wikipedia articles, and 185K CVE Tuned incorporates non-linguistic linguistic element (NLE) types, ac-
descriptions. elements (NLEs) such as URLs and knowledging the existence of more
hash values commonly found in cy- complex NLE types like code
bersecurity texts. blocks and file paths that require
future exploration

LSTM framework. The proposed system outperformed other methods, such as LPBoost and DNNs, in accuracy, precision,

12
TABLE IV: Transformer-based models for Cyber Security (Part II).
Study Year Type of Model Dataset Used Domain Key Contributions Open Issues
Bayer et al. 2024 Transformer model A dataset consisting of 4.3 million Intrusion attacks
Created a high-quality dataset and the model may not be suitable as
[81] entries of Twitter, Blogs, Paper, and and malware a domain-adapted language model a replacement for every type of cy-
CVEs related to the cybersecurity do- for the cybersecurity domain, which bersecurity model. They also state
main improves the internal representation that the hyperparameters may not
space of domain words and per- be generalizable to other language
forms best in cybersecurity scenar- models, especially very large lan-
ios guage models
Shestov et 2024 Transformer model The dataset comprises 22,945 Vulnerability detec- Finetuning the state-of-the-art code The proposed study shows that the
al. [82] function-level source code samples. It tion LLM, WizardCoder, increasing its main bottlenecks of the task that
includes 13,247 samples for training, training speed without performance limit performance lie in the field
5,131 for validation, and 4,567 for harm. of dataset quality and suggests the
testing usage of the project-level context
information
He et al. 2024 Transformer model Used three datasets: one with over blockchain technol- The introduction of a novel model, Include the model’s limitation in
[83] 100,000 entries from Ethereum main- ogy and smart con- BERT-ATT-BiLSTM, for advanced recognizing unseen contract struc-
net contracts, another with 892,913 tracts vulnerability detection in smart tures or novel types of vulnerabil-
addresses labelled across five vulner- contracts, and the evaluation of its ities, and the need to incorporate
ability categories, and a third with performance against other models support for multiple programming
6,498 smart contracts, including 314 languages to enhance universality
associated with Ponzi schemes and robustness
Patsakis et 2024 LLM fine-tuned for Malicious scripts from the Emotet Malware Classifica- Demonstrated 69.56% accuracy in Optimizing LLM fine-tuning for
al. [84] deobfuscation tasks malware campaign tion extracting URLs and 88.78% for improved accuracy and integrating
domains of droppers; explored deobfuscation capabilities into op-

of
LLM potential in malware deobfus- erational security pipelines
cation and reverse engineering
Guo et al. 2024 Fine-tuned open- Compiled dataset and five benchmark Software Security Demonstrated fine-tuning’s effec- Addressing dataset mislabeling and

ro
[85] source and general- datasets for vulnerability detection tiveness in improving detection ac- improving generalizability of mod-
purpose LLMs for curacy; highlighted limitations of els to unseen code scenarios
binary classification existing benchmark datasets

-p
Jamal et al. 2024 Transformer model Two open-source datasets, 747 spam, Phishing and spam Proposing IPSDM, a fine-tuned ver- Class imbalance, addressed with
[25] 189 phishing, 4825 ham; class imbal- detection sion of DistilBERT and RoBERTA, ADASYN, but potential bias re-
ance addressed with ADASYN outperforming baseline models and mains
the demonstration of the effective-
re
ness of LLMs in addressing cyber-
security challenges
Lykousas 2024 LLMs for detecting Public code repositories with embed- Authentication and Highlighted differences in password Improving LLM accuracy in detect-
lP

and Patsakis hard-coded creden- ded secrets and passwords Code Security patterns between developers and ing secrets and addressing context-
[86] tials in source code users; evaluated LLMs for detect- sensitive password vulnerabilities
ing hard-coded credentials and dis-
cussed their limitations
Karlsen et 2024 Fine-tuned LLMs Six datasets from web application and Cybersecurity Log Proposed a new pipeline leveraging Scaling models for more diverse
na

al. [87] for sequence system logs Analysis 60 fine-tuned models for log analy- log formats and optimizing for real-
classification (e.g., sis; DistilRoBERTa achieved an F1- time analysis in dynamic environ-
DistilRoBERTa, score of 0.998, outperforming state- ments
GPT-2, GPT-Neo) of-the-art techniques
ur

Mechri et al. 2025 Decoder-only Python dataset (1.875M function-level Software Security High accuracy in detecting vulnera- Further improvement in identifying
[88] Transformer with code snippets from GitHub, Codepar- bilities across 14 CWEs, F1 scores complex vulnerabilities and han-
64K context length rot, and GPT4-o-generated data) ranging from 84% to 99% dling diverse programming patterns
Jo

Ding et al. 2025 LLM-enhanced SolidiFI benchmark dataset Blockchain Security Recall of 95.06% and F1-score of Expanding the framework’s
[89] framework with 94.95% for detecting smart contract applicability to more complex
in-context learning vulnerabilities; self-check architec- blockchain environments and new
and CoT reasoning ture for CoT generation vulnerability types
Arshad et al. 2025 LLM-based Simulation data for vehicular commu- Autonomous Trans- 18% reduction in latency, 12% im- Addressing node selfishness, scal-
[90] decentralized nication scenarios portation Systems provement in throughput, and en- ability in larger networks, and
vehicular network hanced secure V2X communication privacy-preserving real-time data
architecture using blockchain and LLMs exchange
Xiao et al. 2025 LLM with Solidity v0.8 vulnerabilities dataset Blockchain Security Reduced false-positive rates by over Improving recall for newer Solidity
[91] advanced prompting 60%; evaluated latest five LLMs versions and adapting to evolving
techniques and identified root causes for re- library and framework changes
duced recall in Solidity v0.8
Hassanin et 2025 Pre-trained UNSW NB 15 and TON IoT datasets Intrusion Detection Achieves 100% accuracy Exploring scalability for larger and
al. [92] Transformer with on UNSW NB 15 dataset, more diverse datasets; integrating
specialized input significantly outperforming real-time detection capabilities
transformation BiLSTM, GRU, and CNN models
module
Liu et al. 2025 LLM-powered static Real-world firmware datasets Hardware Security Fully automated taint analysis with Exploring adaptability for diverse
[93] binary taint analysis Bugs 37 newly discovered bugs and 10 binary formats and enhancing real-
assigned CVEs; low engineering time analysis capabilities
cost
Gaber et al. 2025 Transformer-based Assembly instructions captured by the Malware Classifica- Introduced a novel AI-based frame- Enhancing scalability for larger
et al. [94] framework for zero- Peekaboo tool tion work leveraging Assembly data datasets and addressing advanced
day ransomware for high-accuracy zero-day ran- evasion techniques in novel ran-
detection somware detection; demonstrated somware samples
the relevance of Transformer mod-
els to ransomware classification by
aligning with Zipf’s law

recall, and error rate. The NSL-KDD dataset was used for solution, future research could explore the applicability of the
validation and testing, and further verification was done on proposed system to other datasets and real-world scenarios.
other datasets. While the paper provides a comprehensive Additionally, a more detailed analysis of the computational

13
of
ro
-p
re
lP

Fig. 4: LLM-based Solutions for Cyber Security Use Cases.


na

cost of the proposed system compared to other methods could detection. The deep learning-based scheme employs recurrent
be beneficial. neural networks to detect network attacks and fraudulent
transactions in the blockchain-based energy network. The
ur

Zhao et al. [63] presents ERNN, an end-to-end RNN performance of the proposed IDS is evaluated using three
model with a novel gating unit called session gate, designed different data sources.
to address network-induced phenomena that may result in
Jo

Polat et al. [65] introduce a method for improving the


misclassifications in traffic detection systems used in cyber- detection of DDoS attacks in SCADA systems that use SDN
security. The gating unit includes four types of actions to technology. The authors propose using a Recurrent Neural Net-
simulate network-induced phenomena during model training work (RNN) classifier model with two parallel deep learning-
and the Mealy machine to adjust the probability distribution based methods: Long Short-Term Memory (LSTM) and Gated
of network-induced phenomena. The paper demonstrates that Recurrent Units (GRU). The proposed model is trained and
ERNN outperforms state-of-the-art methods by 4% accuracy tested on a dataset from an experimentally created SDN-
and is scalable in terms of parameter settings and feature based SCADA topology containing DDoS attacks and regular
selection. The paper also uses the Integrated Gradients method network traffic data. The results show that the proposed RNN
to interpret the gating mechanism and demonstrates its ability model achieves an accuracy of 97.62% for DDoS attack de-
to reduce dependencies on local packets. Althubiti et al. tection, and transfer learning further improves its performance
[57] propose a deep learning-based intrusion detection system by around 5%.
(IDS) that uses a Long Short-Term Memory (LSTM) RNN 2) Software Security: Wang et al. [64] propose a deep
to classify and predict known and unknown intrusions. The learning-based defense system called PatchRNN to automat-
experiments show that the proposed LSTM-based IDS can ically detect secret security patches in open-source software
achieve a high accuracy rate of 0.9997. Xu et al. [58] propose (OSS). The system leverages descriptive keywords in the
a novel IDS that consists of a recurrent neural network with commit message and syntactic and semantic features at the
gated recurrent units (GRU), multilayer perceptron (MLP), source-code level. The system’s performance was evaluated
and softmax module. The experiments on the KDD 99 and on a large-scale real-world patch dataset and a case study
NSL-KDD data sets show that the system has a high overall on NGINX. The results indicate that the PatchRNN system
detection rate and a low false positive rate. Ferrag and Lean- can effectively detect secret security patches with a low false
dros [59] propose a novel deep learning and blockchain-based positive rate.
energy framework for smart grids, which uses a blockchain- 3) Detection of Deepfake Videos: Güera et al. [56] propose
based scheme and a deep learning-based scheme for intrusion a temporal-aware pipeline that automatically detects deepfake

14
videos by using a convolutional neural network (CNN) to based models outperform traditional models in terms of F1-
extract frame-level features and a recurrent neural network score and AUC score. The authors also propose a bagging-
(RNN) to classify the videos. The results show that the system based random transformer forest (RTF) model that reaches
can achieve competitive results in this task with a simple state-of-the-art evaluation scores on three out of four datasets.
architecture. Demirkıran et al. [69] proposes using transformer-based mod-
Overall, the reviewed studies demonstrate the potential of els for classifying malware families, better suited for capturing
deep learning methods, particularly RNNs, for intrusion detec- sequence relationships among API calls than traditional ma-
tion in various domains. The results show that the proposed chine and deep learning models. The experiments show that
deep learning-based models outperform traditional machine the proposed transformer-based models outperform traditional
learning methods in accuracy. However, more research is models such as LSTM and pre-trained models such as BERT
needed to address the limitations and challenges associated or CANINE in classifying highly imbalanced malware families
with these approaches, such as data scalability and inter- based on evaluation metrics like F1-score and AUC score.
pretability. Additionally, the proposed bagging-based random transformer
forest (RTF) model, an ensemble of BERT or CANINE,
achieves state-of-the-art performance on three out of four
B. Transformer-based models
datasets, including a state-of-the-art F1-score of 0.6149 on one
1) Cloud Threat Forensics: Parra et al. [66] proposed of the commonly used benchmark datasets.

of
an interpretable federated transformer log learning model Patsakis et al. [84] investigates the application of LLMs in
for threat detection in syslogs. The model is generated by malware deobfuscation, focusing on real-world scripts from

ro
training local transformer-based threat detection models at the Emotet malware campaign. The evaluation highlights the
each client and aggregating the learned parameters to generate potential of LLMs in identifying key indicators of compro-
a global federated learning model. The authors demonstrate mise, achieving 69.56% accuracy for URLs and 88.78% for

-p
the difference between normal and abnormal log time series associated domains. These findings emphasize the importance
through the goodness of fit test and provide insights into of fine-tuning LLMs for specialized cybersecurity tasks, such
re
the model’s decision-making process through an attention- as reverse engineering and malware analysis. While promising,
based interpretability module. The results from the HDFS and the work identifies areas for improvement, including optimiz-
lP

CTDD datasets validate the proposed approach’s effectiveness ing fine-tuning strategies to enhance accuracy and integrating
in achieving threat forensics in real-world operational set- these capabilities into threat intelligence frameworks for real-
tings. Evange et al. [75] discuss the importance of actionable world application.
na

threat intelligence in defending against increasingly sophis- Gaber et al. et al. [94] proposed a Pulse framework that pio-
ticated cyber threats. Cyber Threat Intelligence is available neers the use of Transformer models for zero-day ransomware
on various online sources, and Named Entity Recognition detection by analyzing Assembly instructions captured through
ur

(NER) techniques can extract relevant information from these the Peekaboo dynamic binary instrumentation tool. By lever-
sources. The paper investigates the use of transformer-based aging Zipf’s law, the study effectively connects linguistic prin-
models in NER and how they can facilitate the extraction ciples with ransomware behavior, making Transformer models
Jo

of cybersecurity-related named entities. The DNRTI dataset, ideal for classification tasks. This innovative approach forces
which contains over 300 threat intelligence reports, tests the ef- the model to focus on malicious patterns by excluding familiar
fectiveness of transformer-based models compared to previous functionality, ensuring robust detection of novel ransomware.
approaches. The experimental results show that transformer- Future research could expand scalability to accommodate
based techniques are more effective than previous methods in larger datasets and address increasingly sophisticated evasion
extracting cybersecurity-related named entities. techniques in emerging ransomware threats.
Karlsen et al. [87] proposed the LLM4Sec framework 3) Intrusion Detection: Wu et al. [68] proposed an RTID
that demonstrates the potential of large language models in that reconstructs feature representations in imbalanced datasets
cybersecurity log analysis by benchmarking 60 fine-tuned to make a trade-off between dimensionality reduction and
models, including architectures like BERT, RoBERTa, GPT-2, feature retention. The proposed method utilizes a stacked
and GPT-Neo. The study highlights the importance of fine- encoder-decoder neural network and a self-attention mech-
tuning for domain adaptation, with DistilRoBERTa achieving anism for network traffic type classification. The results
an exceptional F1-score of 0.998 across diverse datasets. This with CICID2017 and CIC-DDoS2019 datasets demonstrate
work introduces a novel experimentation pipeline that can the proposed method’s effectiveness in intrusion detection
serve as a foundation for further advancements in automated compared to classical machine learning and deep learning
log analysis. Future research could focus on scaling these algorithms. Ghourbi et al. [70] propose an intrusion and
models to handle various log formats and optimizing them malware detection system to secure the entire network of the
for real-time, dynamic cybersecurity environments. healthcare system independently of the installed devices and
2) Malware classification: Ziems et al. [67] explore computers. The proposed solution includes two components:
transformer-based models for malware classification using API an intrusion detection system for medical devices installed
call sequences as features. The study compares the perfor- in the healthcare network and a malware detection system
mance of the traditional machine and deep learning models for data servers and medical staff computers. The proposed
with transformer-based models. It shows that transformer- system is based on optimized LightGBM and Transformer-

15
based models. It is trained with four different datasets to encompassing one understanding task and three generation
ensure a varied knowledge of the different attacks affecting tasks for commits. The experimental results demonstrate that
the healthcare sector. The experimental evaluation of the CommitBART significantly outperforms previous pre-trained
approach showed remarkable accuracies of 99%. PLLM-CS models for code, and the analysis suggests that each pre-
[92] introduces a transformative approach to satellite network training task contributes to the model’s performance.
security, achieving perfect accuracy on a benchmark dataset Ding et al. [95] discuss the effectiveness of code language
and demonstrating superior performance over traditional deep models (code LMs) in detecting vulnerabilities. It identifies
learning models. significant issues in current datasets, such as poor quality,
4) Software Vulnerability Detection: Thapa et al. [71] low accuracy, and high duplication rates, which compromise
explores the use of large transformer-based language models model performance in realistic scenarios. To overcome these
in detecting software vulnerabilities in C/C++ source code, challenges, it introduces the PrimeVul dataset, which uses
leveraging the transferability of knowledge gained from nat- advanced data labeling, de-duplication, and realistic evaluation
ural language processing. The paper presents a systematic metrics to represent real-world conditions accurately. The find-
framework for source code translation, model preparation, ings reveal that current benchmarks, like the BigVul, greatly
and inference. It conducts an empirical analysis of software overestimate code LMs’ capabilities, with much lower per-
vulnerability datasets to demonstrate the good performance of formance observed on PrimeVul. This significant discrepancy
transformer-based language models in vulnerability detection. highlights the need for further innovative research to meet the

of
The paper also highlights the advantages of transformer- practical demands of deploying code LMs in security-sensitive
based language models over contemporary models, such as environments.

ro
bidirectional long short-term memory and bidirectional gated SecureQwen [88] is a vulnerability detection system de-
recurrent units, in terms of F1-score. However, the paper does signed for Python codebases. It uses a decoder-only trans-
not discuss the limitations or potential drawbacks of using former model with an extended context length of 64K tokens
transformer-based language models for software vulnerability
detection, and further research is needed in this area. Fu et
-p to analyze large-scale datasets. The model identifies vulnera-
bilities across 14 types of CWEs with high accuracy, achieving
re
al. [73] propose an approach called LineVul, which uses a F1 scores ranging from 84% to 99%. By leveraging a dataset
Transformer-based model to predict software vulnerabilities of 1.875 million function-level code snippets from various
lP

at the line level. The approach is evaluated on a large-scale sources, including GitHub and synthetic data, SecureQwen
dataset (i.e., on a large-scale real-world dataset with more demonstrates its capability to detect security issues in both
than 188k C/C++ functions). It achieves higher F1-measure human-written and AI-generated code.
na

for function-level predictions and higher Top-10 accuracy Guo et al. [85] explores the role of LLMs in detecting
for line-level predictions compared to baseline approaches. vulnerabilities in source code, comparing the performance of
The analysis also shows that LineVul accurately predicts fine-tuned open-source models and general-purpose LLMs.
ur

vulnerable functions affected by the top 25 most dangerous Leveraging a binary classification task and multiple datasets
CWEs. However, the model’s performance can be changed demonstrates the importance of fine-tuning smaller models for
when applied to different programming languages or software specific tasks, sometimes outperforming larger counterparts.
Jo

systems. The analysis also exposes critical issues with current bench-
Mamede et al. [74] presented a transformer-based VS Code mark datasets, such as mislabeling, which significantly affects
extension that uses state-of-the-art deep learning techniques model training and evaluation. Future research directions in-
for automatic vulnerability detection in Java code. The authors clude improving dataset quality and developing strategies to
emphasize the importance of early vulnerability detection enhance model generalization for more diverse and complex
within the software development life cycle to promote applica- software vulnerabilities.
tion security. Despite the availability of advanced deep learn- Lykousas and Patsakis [86] examine developer password
ing techniques for vulnerability detection, the authors note that patterns and the role of LLMs in detecting hard-coded creden-
these techniques are not yet widely used in development envi- tials in source code. The study reveals that while developers
ronments. The paper describes the architecture and evaluation tend to select more complex passwords compared to regular
of the VDet tool, which uses the Transformer architecture for users, context often influences weaker patterns. It underscores
multi-label classification of up to 21 vulnerability types in Java the risks posed by public repositories containing secrets and
files. The authors report an accuracy of 98.9% for multi-label the need for enhanced security practices. Additionally, the
classification and provide a demonstration video, source code, paper evaluates LLMs for detecting hard-coded credentials
and datasets for the tool. and identifying their potential and limitations. Future work
Liu et al. [77] introduce CommitBART, a pre-trained Trans- should focus on refining LLM capabilities to detect sensitive
former model specifically designed to understand and generate information and raising developers’ awareness about secure
natural language messages for GitHub commits. The model is password management.
trained on a large dataset of over 7.99 million commits, cov- 5) Cyber Threat Intelligence: Ranade et al. [72] presented
ering seven different programming languages, using a variety a method for automatically generating fake Cyber Threat In-
of pre-training objectives, including denoising, cross-modal telligence (CTI) using transformers, which can mislead cyber-
generation, and contrastive learning, across six pre-training defense systems. The generated fake CTI is used to perform a
tasks. The authors propose a ”commit intelligence” framework data poisoning attack on a Cybersecurity Knowledge Graph

16
(CKG) and a cybersecurity corpus. The attack introduces 7) Hardware Security Evaluation: Ahmad et al. [78]
adverse impacts such as returning incorrect reasoning outputs, delves into leveraging LLMs to automatically repair identified
representation poisoning, and corruption of other dependent security-relevant bugs present in hardware designs, explicitly
AI-based cyber defense systems. A human evaluation study focusing on Verilog code. Hardware security bugs pose signif-
was conducted with cybersecurity professionals and threat icant challenges in ensuring the reliability and safety of hard-
hunters, which reveals that professional threat hunters were ware designs. They curated a corpus of hardware security bugs
equally likely to consider the generated fake CTI and authentic through a meticulously designed framework. They explored
CTI as true. the performance of various LLMs, including OpenAI’s Codex
Hashemi et al. [76] propose an alternative approach for and CodeGen, in generating replacement code to fix these
automated vulnerability information extraction using Trans- bugs. The experiments reveal promising results, demonstrating
former models, including BERT, XLNet, RoBERTa, and Dis- that LLMs can effectively repair hardware security bugs, with
tilBERT, to extract security-related words and terms and success rates varying across different bugs and LLM mod-
phrases from descriptions of vulnerabilities. The authors fine- els. By optimizing parameters such as instruction variation,
tune several language representation models similar to BERT temperature, and model selection, they achieved successful
on a labeled dataset from vulnerability databases for Named repairs for a significant portion of the bugs in their dataset. In
Entity Recognition (NER) to extract complex features without addition, the results demonstrate that LLMs, including GPT-
requiring domain-expert knowledge. This approach outper- 4, code-davinci-002, and code-cushman-001, yield successful

of
forms the CRF-based models and can detect new information repairs for simple security bugs, with GPT-4 achieving a
from vulnerabilities with different description text patterns. success rate of 67% at variation e, temp 0.5. However, LLMs’

ro
The authors conclude that this approach provides a structured performance varies across bugs, showing success rates over
and unambiguous format for disclosing and disseminating vul- 75% with some bugs, while others are more challenging to
nerability information, which is crucial for preventing security repair, with success rates below 10%. The study emphasizes
attacks.
6) Phishing and spam detection: Koide et al. introduced
-p the importance of detailed prompt instructions, with variation
d showing the highest success rate among OpenAI LLMs.
re
[96], a novel system leveraging LLMs to detect phishing Further investigation is needed to evaluate LLMs’ scalability
emails. Despite advances in traditional spam filters, significant and effectiveness for diverse hardware security bug scenarios.
lP

challenges such as oversight and false positives persist. The Their findings underscore the potential of LLMs in automating
system transforms email data into prompts for LLM analy- the bug repair process in hardware designs, marking a crucial
sis, achieving a high accuracy rate (99.70%) and providing step towards developing automated end-to-end bug repair tools
na

detailed reasoning for its determinations. This helps users for hardware security.
make informed decisions about suspicious emails, potentially Mohamadreza et al. [99] explored the potential of using
enhancing the effectiveness of phishing detection. large language models to enhance the input generation in the
ur

Jamal et al. [25] explored the potential of LLMs to address process of hardware design verification for security-related
the growing sophistication of phishing and spam attacks. Their bugs. Mohamadreza et al. introduced Chatfuzz, a novel ML-
work, IPSDM, is an improved model based on the BERT based hardware fuzzer that leverages LLMs and reinforce-
Jo

family, specifically fine-tuned to detect phishing and spam ment learning to generate complex and random machine
emails. Compared to baseline models, IPSDM shows superior code sequences for exploring processor security vulnerabil-
accuracy, precision, recall, and F1-score performance on both ities. Chatfuzz introduces a specialized LLM model into a
balanced and unbalanced datasets while addressing overfitting hardware fuzzing approach to enhance the input generation
concerns. quality, outperforming the existing approaches regarding cov-
Heiding et al. [97] compared automatically generated phish- erage, scalability, and efficiency. Utilizing LLMs to understand
ing emails by GPT-4, manually designed emails using the V- processor language and generate data/control flow entangled
Triad method, and their combination. Their findings suggest machine code sequences, Chatfuzz integrates RL to guide
that emails designed with the V-Triad achieved the highest input generation based on code coverage metrics. Their ex-
click-through rates, indicating the effectiveness of exploiting periment on real-world cores, namely RocketCore and BOOM
cognitive biases. The study also evaluated the capability of cores, showed significantly faster coverage than state-of-the-art
four different LLMs to detect phishing intentions, with results hardware fuzzes. ChatFuzz achieves 75% condition coverage
often surpassing human detection. Furthermore, they discuss in RocketCore in 52 minutes and 97.02% in BOOM in 49
the economic impact of AI in lowering the costs of orches- minutes, identifying unique mismatches and new bugs and
trating phishing attacks. showcasing its effectiveness in hardware security testing.
Chataut et al. [98] focused on the effectiveness of LLMs Weimin et al. [100] introduces LLM4SECHW, a novel
in detecting phishing emails amidst threat actors’ constant framework for hardware debugging that utilizes domain-
evolution of phishing strategies. Their study emphasizes the specific Large Language Models. The authors addressed the
necessity for continual development and adaptation of detec- limitations of out-of-the-shelf LLMs in the hardware security
tion models to keep pace with innovative phishing techniques. domain by gathering a dataset of hardware design defects
The role of LLMs in this context highlights their potential to and remediation steps. The collected dataset has been built
significantly enhance email security by improving detection by leveraging open-sourced hardware designs from GitHub;
capabilities. the data consists of different Hardware Description Language

17
modules with their respective commits. By harnessing ver- verification by offering a benchmarking to the existing and
sion control information from open-source hardware projects specialized models. The paper also suggests a prompt engi-
and processing it to create a debugging-oriented dataset, neering approach that would enhance the efficiency of a large
LLM4SECHW fine-tunes hardware domain-specific language language model on the studied task. The proposed prompt
models to locate and rectify bugs autonomously, enhancing structure introduces a separation of concern approach, where
bug localization. LLM4SECHW has been evaluated with two the used prompt deals with each class of bugs separately. The
objectives: bug identification and design patching. The authors prompt starts by explicitly defining the context of the task,
demonstrated that non-fine-tuned LLMs lack hardware domain the functional description, the implementation context, and the
knowledge, which makes them incapable of locating bugs in task objective. The prompt is implemented through three main
the hardware design of a popular security-specialized chip sections: context, requirements, and complementary rules. The
project named OpenTitan. The based models (falcon 7b, llama highlighted works lay a foundation for a methodological,
2, Bard, chatbot, and stableLM) did not efficiently locate the practical approach to benchmarking, evaluating, and deploying
introduced hardware bugs. The three fine-tuned models (falcon LLM tasks for HLS design verification. While the paper does
7b, llama2, stableLM) successfully located the introduced bugs not provide any conclusive results about LLMs’ performance
in the hardware design. in such tasks, the authors believe that such methodology would
Zhang et al. [100] introduces Hardware Phi-1.5B, a large accelerate the adoption of new techniques to integrate LLMs
language model tailored for the hardware domain of the semi- into the design verification flow.

of
conductor industry, addressing the complexity of hardware- Mingjie et al. [103] evaluated the LLMs’ performance in
specific issues. The research focused on developing datasets solving Verilog related design tasks and generating design

ro
specifically for the hardware domain to enhance the model’s testbenchs by introducing VerilogEval. VerilogEval comprises
performance in comprehending complex terminologies. The different hardware design tasks ranging from module imple-
authors claim to surpass general code language models and mentation of simple combinatorial circuits to complex finite
natural language models like CodeLlama, BERT, and GPT-2
in the Hardware understanding tasks.
-p state machines, code debugging, and testbench construction.
VerilogEval suggests an end-to-end evaluation framework that
re
Madhav et al. [101] evaluated the security of the HDL fits better in the context of the hardware design verification
code generated by ChatGPT. The authors introduced a similar process benchmarking. The VerilogEval framework validates
lP

taxonomy to the NIST CWE 1 . The authors conducted various the correctness of the prompted tasks by comparing the
experiments to explore the impact of prompt engineering on behavior simulation to an established golden model of the
the security of the generated hardware design. prompted design. The authors used pass@k metric instead
na

Liu et al. [93] introduces a groundbreaking approach, named of the generic NLP related metrics like the BLEU score
LATTE, to binary program security by utilizing LLMs for probability metric. The study demonstrates that pre-trained
static binary taint analysis. Unlike traditional tools like Emtaint language models’ Verilog code generation capabilities can be
ur

and Karonte, which rely on manually crafted taint propagation improved through supervised fine-tuning. The experimental re-
and vulnerability inspection rules, LATTE is fully automated, sults show that fine-tuning LLMs on the hardware design tasks
and using the pass@k metric helps assess the performance of
Jo

reducing dependency on human expertise. Its effectiveness


is demonstrated through the discovery of 37 previously un- the resulting models properly. The pass@k metric helps assess
known bugs in real-world firmware, with 10 earning CVE the performance of Large Language Models (LLMs) in Verilog
assignments. Additionally, LATTE offers a scalable and cost- code generation by quantifying the number of successful code
efficient solution, making it highly accessible to researchers completions out of k samples, offering a clear evaluation
and practitioners. This work highlights the potential of LLMs criterion. The used metric shows that a fine-tuned model
to revolutionize binary program analysis, though future re- could have equal or better performance than the state-of-the-
search could focus on enhancing adaptability to diverse binary art OpenAI models (gpt-3 and gpt-4). VerilogEval highlights
formats and integrating real-time capabilities. the growing significance of Large Language Models (LLMs)
8) Hardware design & Verification: Lily et al. [102] in- and their application in various domains, emphasizing their
troduced the application of LLMs into High-Level Synthesis potential in Verilog code generation for hardware design and
Design Verification (HLS). The authors created a dataset verification. The findings underscore the importance of the
named Chrysalis to solve the problem of the non-existence of proposed benchmarking framework in advancing the state of
specialized HLS bug detection and evaluation capabilities. The the art in Verilog code generation, highlighting the vast poten-
Chrysalis dataset comprises over 1000 function-level designs tial of LLMs in assisting the hardware design and verification
extracted from reputable sources with intentionally injected process.
known bugs to evaluate and refine LLM-based HLS bug 9) Protocols verification: Ruijie et al. [105] introduced
localization. The set of the introduced bugs was selected based ChatAFL, an LLM-based protocol fuzzer. ChatAFL introduces
on the most common human coding errors and has been shaped an LLM-guided protocol fuzzing to address the challenge of
to elude most of the existing conventional HLS synthesis finding security flaws in protocol implementations without
tools detection mechanisms. The paper’s authors suggest that a machine-readable specification. The study suggests three
Chrysalis would contribute to the LLM-aided HLS design strategies for integrating an LLM into a mutation-based proto-
col fuzzer, focusing on grammar extraction, seed enrichment,
1 https://nvd.nist.gov/vuln/categories and saturation handling to enhance code coverage and state

18
Q1: In cryptography, what is the purpose of a message Total Correct
authentication code (MAC) or digital signature (SIG)? Total Questions
Q2: "What is the role of Uncoordinated Frequency Hopping Accuracy: %
(UFH) in anti-jamming broadcast communication?
...

80
Questions
role_prompt = "You are a security expert
who answers questions."
2k 500
Questions Questions user_prompt = f"Question:
{question}\nOptions: {', '.join([f'{key}) {value}'
for key, value in
answers.items()])}\n\nChoose the correct
answer (A, B, C, or D) only. Always return in
this format: 'ANSWER: X'"
Inference Model

10 k
Questions Gpt-4-turbo, Gpt-3.5-turbo, Mixtral-8x7B-Instruct, LLM prompt
GEMINI-pro (Bard), Falcon-180B-Chat, Flan-T5-XXL, engineering
CyberMetric Dataset Zephyr-7B-beta, Llama 2-70B, Falcon-40B-Instruct,
Flan-T5-Base, ....

of
Fig. 5: LLMs Performance Steps in the cybersecurity domain using CyberMetric Dataset [104].

ro
-p
transitions. ChatAFL prototype implementation demonstrates The evaluation part of LLMIF mainly aimed to evaluate
that the LLM-guided stateful fuzzer outperforms state-of-the- three axes: code coverage, ablation, and bug identification.
re
art fuzzers like AFLNET [106] and NSFUZZ [107] in terms The authors used an out-of-the-shelf popular SOC (CC2530)
of protocol state space coverage and code coverage. for the evaluation. 11 commercial devices have been selected
The experiments evaluated CHATAFL’s improvement over to conduct the various experiments. While the ablation and
lP

the baselines in terms of transition coverage achieved in bug detection could be easily evaluated, the code coverage is
24 hours, speed-up in achieving the same coverage, and impossible using the custom firmware that ships with the se-
the probability of outperforming the baselines in a random lected devices. The authors used an open-source Zigbee stack
na

campaign. CHATAFL demonstrated significant efficacy by to demonstrate the coverage capabilities. The authors claimed
covering 47.60% and 42.69% more state transitions, 29.55% that LLMIF outperforms Z-FUZZER [109], and BOOFUZZ
and 25.75% more states, and 5.81% and 6.74% more code [110] in terms of code coverage for the target Zigbee stack.
ur

than AFLNET and NSFUZZ, respectively. The authors claim that LLMIF achieved a notable increase
CHATAFL discovered nine unique and previously unknown in protocol message coverage and code coverage by 55.2%
Jo

vulnerabilities in widely used and extensively tested proto- and 53.9%, respectively, outperforming other Zigbee fuzzers
col implementations on real widely used projects (live555, in these aspects.
proFTPD, kamailio). The discovered vulnerabilities encom- LLMIF algorithm successfully uncovered 11 vulnerabilities
pass various memory vulnerabilities, including use-after-free, on real-world Zigbee devices, including eight previously un-
buffer overflow, and memory leaks, which have potential se- known vulnerabilities, showcasing its effectiveness in iden-
curity implications such as remote code execution or memory tifying security flaws in IoT devices. By incorporating the
leakage. The study demonstrated the effectiveness of utilizing large language model into IoT fuzzing, LLMIF demonstrated
LLMs for guiding protocol fuzzing to enhance state and code enhanced capabilities in protocol message coverage and vul-
coverage in protocol implementations. nerability discovery, highlighting its potential for improving
Wang et al. [108] introduced LLMIF and LLM-aided the security testing of IoT devices.
fuzzing approach for IoT devices protocols. LLMIF intro- 10) Blockchain Security: SmartGuard [89] is a framework
duces an LLM augmentation-based approach. The developed that combines large language models with advanced reasoning
pipeline incorporates an enhanced seed generation strategy techniques to detect vulnerabilities in smart contracts. It uses
by building an augmentation based on domain knowledge. semantic similarity to retrieve relevant code snippets and
The domain knowledge structure is extracted from the various employs Chains of Thought (CoT) reasoning for in-context
specifications of the under-fuzzing protocol. The flow starts learning. The framework includes a self-check mechanism for
by selecting a seed from the extracted augmentation set and generating reliable reasoning chains from labeled data. Tests
then enriching the extracted seed by exploring the protocol on the SolidiFI benchmark dataset show exceptional results,
specification. The enriching process is driven by the various with a recall of 95.06% and an F1-score of 94.95%, outper-
ranges of input values extracted during the augmentation forming existing tools in smart contract security. BlockLLM
phase. Furthermore, LLMIF introduces a coverage approach [90] introduces a decentralized network architecture for au-
by mutating the selected seed through the various enrichment tonomous vehicles, integrating blockchain with large language
and mutation operators that have been selected. models to improve security and communication. It enhances

19
vehicle-to-vehicle (V2V) and vehicle-to-infrastructure (V2I) techniques and Reinforcement Learning from Human Feed-
communication by providing adaptive decision-making and back (RLHF). However, other specifics, such as the model
ensuring data integrity. With features like incentive mecha- size, data size, and comprehensive training details, remain
nisms for node reliability, BlockLLM achieves significant im- undisclosed. Although GPT-4 could potentially be leveraged
provements, including an 18% reduction in latency and a 12% by cybercriminals for a wide range of attacks, such as social
increase in throughput, offering a scalable solution for secure engineering, if implemented strategically, it can also help
vehicular networks. Xiao et al. [91] advances the field of reduce the likelihood of individuals and organizations falling
smart contract vulnerability detection by focusing on Solidity prey to them.
v0.8, the latest version, unlike earlier works based on outdated 3) T5: Motivated by the trend of applying transfer learning
versions. By leveraging advanced prompting techniques with for NLP, researchers of Google have introduced T5 [113], an
five cutting-edge LLMs, the study significantly reduces false- encoder-decoder-based model that operates within the unified
positive rates (over 60%), showcasing the potential of refined text-to-text framework. Multiple variants of T5 with different
LLM utilization. However, the findings also reveal a significant sizes - ranging between 220M to 11B parameters- were devel-
drop in recall for specific vulnerabilities due to challenges oped to broaden the experimental scope and were trained on
adapting to newly introduced libraries and frameworks. Ad- massive amounts of data from various sources, including C4,
dressing these limitations could further enhance the precision Web Text, and Wikipedia. Building on the foundation of these
and robustness of LLM-based smart contract analysis. diverse model sizes and rich data sources, multiple approaches

of
and different settings for pre-training and fine-tuning were
V. G ENERAL LLM S examined and discussed, achieving performance that nearly

ro
Tables V, VI, VII compare general transformer-based Large matched human levels on one of the benchmarks. Considering
Language Models. LLM models are generally trained on a that, the model’s potential in cybersecurity applications is
particularly promising. For instance, T5 can be utilized for

-p
diverse and broad range of data to provide a relatively com-
prehensive understanding. They can handle various language threat intelligence by extracting critical information from vast
tasks like translation, summarization, and question-answering. security documents and then summarizing and organizing that
re
In contrast, code-specific LLMs are specialized models trained information.
primarily on programming languages and related technical 4) BERT: Bidirectional Encoder Representations from
lP

literature, which makes their primary role in understanding Transformers, commonly known as BERT, was presented by
and generating programming code well-suited for tasks like [114] to enhance fine-tuning-based approaches in NLP. It is
automated code generation, code completion, and bug detec- available in two versions: BERT-Base, with 110M parameters,
na

tion. and BERT-Large, with 340M parameters, trained on 126GB of


data from BooksCorpus and English Wikipedia. During its pre-
training phase, BERT employed two key techniques: Masked
A. Prevalent LLMs
ur

Language Modeling (MLM) and Next Sentence Prediction


1) GPT-3: GPT-3 (the third version of the Generative Pre- (NSP). Building on these approaches, fine-tuning, and feature-
trained Transformer series by OpenAI) was developed to based methods have led to competitive performance from
Jo

prove that scaling language models substantially improves BERT-Large in particular. Since encoder-only models like
their task-agnostic few-shot performance [111]. Based on BERT are known for their robust contextual understanding,
transformer architecture, GPT-3 has eight variants ranging applying such models to tasks like malware detection and
between 125M and 175B parameters, all trained for 300B software vulnerability can be highly effective in cybersecurity.
tokens from datasets like Common Crawl, WebText, Books, 5) ALBERT: Aiming to address the limitations related to
and Wikipedia. Additionally, the models were trained on V100 GPU/TPU memory and training time in Large Language
GPU leveraging techniques like autoregressive training, scaled Models (LLMs), Google researchers developed A Lite BERT
cross-entropy loss, and others. GPT-3, especially its most (ALBERT), a modified version of BERT with significantly
capable 175B version, has demonstrated strong performance fewer parameters [115]. And like other LLMs, ALBERT was
on many NLP tasks in different settings (i.e., zero-shot, one- introduced in various sizes, with options ranging from 12M
shot, and few-shots), suggesting it could significantly improve to 235M parameters, all trained on data from BooksCorpus
cybersecurity applications if appropriately fine-tuned. This and English Wikipedia. Various methods and techniques were
could translate to more effective Phishing Detection through deployed during the pre-training stage, including Factorized
precise language analysis, faster Incident Response, and other Embedding Parameterization, Cross-layer Parameter Sharing,
critical applications to enhance digital security measures. Inter-sentence Coherence Loss, and Sentence Order Prediction
2) GPT-4: In 2023, the GPT-4 transformer-based model (SOP). As a result, one of the models (i.e., ALBERT-xxlarge)
was released by OpenAI as the first large-scale multi- outperformed BERT-Large despite having fewer parameters.
modal model, exhibiting unprecedented performance in var- Thus, utilizing ALBERT in cybersecurity applications, such
ious benchmarks. The model’s capability of processing image as phishing detection and malware classification, could signif-
and text inputs has shifted the AI paradigm to a new level, icantly contribute to advancing cybersecurity infrastructure.
expanding beyond traditional NLP. [112] declared that GPT- 6) RoBERTa: RoBERTa, proposed by Meta, is an opti-
4 was trained using a vast corpus of web-based data and mized replication of BERT that demonstrates how the choice
data licensed from third-party sources with autoregressive of hyperparameters can significantly impact the model’s per-

20
TABLE V: Comparison of Large Language Models
Model Architecture Base Model Para- Training Pre-training Corpus Released Applications Use Cases in Training Key Training Quanti- Ref
meters Tokens Volume By Cybersecurity Scheme Techniques zation
GPT-3 Decoder-only NA 175B 300B Books, +570GB Open AI Language Malware Pre-training, Autoregressive NA [111]
Web text, Modeling, Text Detection, In-context training,
Wikipedia, Completion, Threat learning Scaled Cross
Common QA Intelligence, Entropy Loss,
Crawl Social Backpropagation
Engineering and gradient
Detection descent, Mixed
precision training.
GPT-4 Decoder-only NA NA NA Web Data, NA Open AI Language Malware Pre-training, Autoregressive NA [112]
Third-party Modeling, Text Detection, RLHF training
licensed data Completion, Threat
QA Intelligence,
Social
Engineering
Detection
T5 Encoder- NA 11B 1000B C4, Web Text, 750GB Google Language Malware Pre-training, Text-to-text frame- NA [113]
decoder Wikipedia Modeling, Detection, Fine-tuning work, Denotation-
Summa- Threat based pretraining
rization, Intelligence,
Translation Social
Engineering
Detection

of
BERT Encoder-only NA 340M 250B BooksCorpus, 126GB Google Language Malware Detec- Pre-training Masked NA [114]
English Modeling, tion, Threat In- LM(MLM),
Wikipedia Classification, telligence, Intru- Next-sentence

ro
QA, NER sion Detection, prediction(NSP)
Phishing Detec-
tion

-p
ALBERT Encoder-only BERT 235M +250B BooksCorpus, NA Google Language Malware Detec- Pre-training Factorized NA [115]
(calcu- English Modeling, tion, Threat In- embedding
lated) Wikipedia Classification telligence, Intru- parameterization,
sion Detection, Cross-layer
re
Phishing Detec- parameter sharing,
tion Inter-sentence
coherence loss,
lP

Sentence order
prediction (SOP)
RoBERTa Encoder-only BERT 355M 2000B BooksCorpus, NA Meta Language Malware Detec- Pre-training Dynamic Masking, NA [116]
English Modeling, tion, Threat In- Full-Sentences
Wikipedia Classification, telligence, Intru- without NSP
na

QA, NER sion Detection, loss, Large mini-


Phishing Detec- batches, Larger
tion byte-level BPE
XLNet Encoder-only Transformer- 340M +2000B English 158GB CMU, Language Malware Detec- Pre-training Permutation NA [117]
ur

XL (calcu- Wikipedia (calcu- Google Modeling, tion, Threat In- LM(PLM), Two-
lated) lated) Classification, telligence, Intru- stream self-
QA sion Detection, attention, Segment
Jo

Phishing Detec- Recurrence and


tion Relative Encoding
ProphetNet Encoder- NA 550M +260B Web Data, 160GB Microsoft Language Cybersecurity Pre-training, Masked Sequence NA [118]
decoder (calcu- Books Research Modeling, Reporting, Fine-tuning generation,
lated) Asia Question Threat Autoregressive
Generation, Intelligence training, Denoising
Summarization Autoencoder
objective, Shared
Parameters
between encoder
and decoder,
Maximum
Likelihood
Estimation (MLE)
Falcon Decoder-only NA 7-180B 5000B Web Data NA TII Language Malware Pre-training Autoregressive NA [119]
Modeling, Text Detection, training,
Completion, Threat FlashAttention,
QA Intelligence, ALiBi Positional
Social encoding
Engineering
Detection
Reformer Encoder- NA Up to +150B Web Data NA Google Language Malware Detec- Pre-training Locality-Sensitive NA [120]
decoder 6B (calcu- Modeling, tion, Threat In- Hashing (LSH)
lated) Classification telligence, Intru- Attention,
sion Detection, Chunked
Phishing Detec- Processing,
tion Shared-QK
Attention Heads,
Reversible layers

21
TABLE VI: Continued
Model Architecture Base Model Para- Training Pre-training Corpus Released Applications Use Cases in Training Key Training Quanti- Ref
meters Tokens Volume By Cybersecurity Scheme Techniques zation
PaLM Decoder-only NA 540B 780B Webpages, 2TB Google Language Threat Pre-training SwiGLU NA [121]
books, Modeling, QA, Intelligence, Activation, Parallel
Wikipedia, Translation Security Layers, Multi-
news articles, Policies Query attention
source code, Generation (MQA), RoPE
social media embeddings,
conversations, Shared Input-
GitHub Output embedding
PaLM2 Decoder-only NA NA NA web NA Google Language Threat Pre-training Compute optimal NA [122]
documents, Modeling, QA, Intelligence, scaling, Canary
books, code, Summarization Security token sequences,
mathematics, Policies Control tokens for
conversational Generation inference
data
LLaMA Decoder-only NA 7-65B 1400B CommonCrawl, 177GB Meta Language Threat Pre-training Pre-normalization, NA [123]
C4, GitHub, Modeling, Text Intelligence, SwiGLU activation
Wikipedia, Completion, Malware function, Rotary
Books, arXiv, QA Detection Embedding, Model
StackExchange and sequence
parallelism
LLaMA2 Decoder-only NA 7-70B 2000B Mix of puli- NA Meta Language Threat Pre-training, Optimized NA [124]
cally available Modeling, Text Intelligence, Fine-tuning, autoregressive

of
data Completion, Malware RLHF training, Grouped
QA Detection Query Attention
(GQA)

ro
GShard MoE NA 600B 1000B Web Data NA Google Language Threat Pre-training Conditional NA [125]
Modeling Intelligence, Computation,
Intrusion Lightweight

-p
Detection, Annotation APIs,
Malware XLA SPMD
Detection partitioning,
Position-wise MoE
re
ELECTRA Encoder-only NA 335M +1800B BooksCorpus, 158GB Google Language Threat Pre-training, Replaced token NA [126]
(calcu- English Modeling, Intelligence, Fine-tuning detection,
lated) Wikipedia Classification Intrusion Generator-
lP

Detection, discriminator
Malware framework, Token
Detection, replacement,
Phishing Weight-sharing
Detection
na

MPT-30B Decoder-only NA 30B 1000B C4, mC4, NA MosaicML Language Threat Pre-training FlashAttention, NA [127]
Common- Modeling, Text Intelligence, ALiBi positional
Crawl, Completion, Malware encoding
Wikipedia, QA Detection,
ur

Books, arXiv Software


Vulnerability
Yi-34B NA NA 34B 3000B Chinese and NA 01.AI Language Threat Pre-training, NA GPTQ, [128]
Jo

English dataset Modeling, Intelligence, Fine-tuning AWQ


Question Phishing
Answering Detection,
Vulnerability
Assessment
Phi-3-mini Decoder-only NA 3.8B 3.3T Phi-3 datasets NA Microsoft Language Threat Pre-training, LongRope, Query NA [129]
(Public Modeling, Text Intelligence, Fine-tuning Attention (GQA)
documents, Completion, Intrusion
synthetic, chat QA Detection,
formats) Malware
Detection
Mistral 7B Decoder-only NA 7.24B NA NA NA Mistral Language Threat Pre-training, Sliding Window NA [130]
AI Modeling, Text Intelligence, Fine-tuning Attention, Query
Completion, Intrusion Attention (GQA),
QA Detection, Byte-fallback BPE
Malware tokenizer
Detection
Cerebras- Decoder-only NA 2.7B 371B The Pile 825 GB Cerebras Language Threat Pre-training standard trainable NA [131]
GPT 2.7B Dataset Modeling, Text Intelligence, positional embed-
Completion, Intrusion dings and GPT-2
QA Detection, transformer, GPT-
Malware 2/3 vocabulary and
Detection tokenizer block
ZySec- Decoder-only NA 7.24B NA Trained across NA ZySec AI Language Expert guidance Pre-training NA NA [132]
AI/ZySec 30+ domains in Modeling, Text in cybersecurity
7B cybersecurity Completion, issues
QA
DeciLM Decoder-only NA 7.04 NA NA NA Deci Language Threat Pre-trained Grouped-Query NA [133]
7B Modeling, Text Intelligence, Attention (GQA)
Completion, Intrusion
QA Detection,
Malware
Detection

22
TABLE VII: Continued
Model Architecture Base Model Para- Training Pre-training Corpus Released Applications Use Cases in Training Key Training Quanti- Ref
meters Tokens Volume By Cybersecurity Scheme Techniques zation
Zephyr 7B Decoder-only Mistral 7B 7.24B NA NA NA Hugging- Language Threat Fine-tuning Flash Attention, NA [134]
Beta Face Modeling, Text Intelligence, Direct Preference
Completion, Intrusion Optimization
QA Detection, (DPO)
Malware
Detection
Dolly v2 12B Decoder-only Pythia 12B 12B 3T The Pile 825GiB Databricks Language Threat Fine-tuning NA NA [135]
Dataset Modeling, Text Intelligence,
Completion, Intrusion
QA Detection,
Malware
Detection
Falcon2 11B Decoder-only NA 11.1B 5T RefinedWeb NA TII Language Malware Pre-training ZeRO, high- NA [136]
enhanced Modeling, Text Detection, performance
with curated Completion, Threat Triton kernels,
corpora. QA Intelligence, FlashAttention-2
Social
Engineering
Detection

of
formance [116]. RoBERTa has only one version with 355M Text Summarization, where it achieved the best performance.

ro
parameters but is trained and tested in various data sizes Therefore, utilizing ProphetNet in cybersecurity tasks such as
and training steps. Similar to BERT, the training data was automated security incident summarization could significantly
taken from the Books corpus and English Wikipedia. However, enhance efficiency and decision-making.

-p
the key optimizations in this model were in the training 9) Falcon: Falcon LLM, built on decoder-only architec-
techniques, which included multiple methods such as Dynamic ture, was introduced by the Technology Innovation Institute
re
Masking, training on Full Sentences without NSP loss, using (TII) as a proof-of-concept that enhancing data quality can
Large Mini-Batches, and employing a Larger Byte-Level BPE. significantly improve the LLM performance even with purely
lP

Consequently, RoBERTa achieved state-of-the-art results in web-sourced data [119]. This insight is increasingly relevant as
some of the benchmarks. With proper fine-tuning, RoBERTa’s scaling in LLMs, which is becoming more prevalent, requires
ability to understand, interpret, and generate human-like text is more data for processing. The model has three versions (i.e.,
na

leveraged to automate and enhance various tasks in the realm 7B, 40B, 180B) pre-trained on the “RefinedWeb” dataset
of cybersecurity. proposed by TII. RefinedWeb, sourced exclusively from web
7) XLNet: The advances and limitations of Masked Lan- data, was subjected to various filtering and deduplication
ur

guage Modeling (MLM) in bidirectional encoders and Au- techniques to ensure high quality. Autoregressive training,
toregressive Language Modeling have inspired researchers at Flash Attention, and ALiBi Positional encoding were the
CMU and Google AI to develop XLNet [117]. Based on methods used for pre-training. With further fine-tuning, Falcon
Jo

the Transformer-XL model, XLNet combines aspects of both can advance cybersecurity, particularly in threat intelligence
approaches, enabling the learning of bidirectional contexts and analysis.
while addressing common MLM issues, such as neglecting 10) Reformer: Striving to address common memory limi-
dependencies between masked positions and the discrepancy tations in LLMs, Google proposed the Reformer, an encoder-
between pretraining and finetuning phases. With 340M pa- decoder memory-efficient LLM [120]. With up to 6B param-
rameters, XLNet was pre-trained using data from English eters, Reformer was pre-trained on web data using techniques
Wikipedia and utilizing techniques like Permutation Language including Locality-Sensitive Hashing (LSH) Attention, Chun-
Modeling (PLM), Two-stream attention, Segment Recurrence, ked Processing, Shared-QK Attention Heads, and Reversible
and Relative Encoding. Due to the careful design of the model layers. These techniques were proven to have a negligible
and strategic pre-training techniques, XLNet has achieved impact on the training process compared to the standard
substantial performance over other popular models like BERT, Transformer, as the Reformer achieved results that matched
making it -after appropriate fine-tuning- a capable tool for the full Transformer but with much faster processing and
enhancing various aspects of the cybersecurity field. better memory efficiency. Subsequently, employing Reformer
8) ProphetNet: ProphetNet LLM, proposed by Microsoft, for tasks like large-scale data analysis could serve the cyberse-
is a sequence-to-sequence pre-trained model that aims to curity field by enabling more efficient processing and analysis
address the issue of overfitting on strong local correlations by of extensive datasets.
leveraging two novel techniques, namely: future n-gram pre- 11) PaLM: Driven by the advancement in machine learning
diction and n-stream self-attention [118]. Built on an encoder- and natural language processing, Google has developed PaLM
decoder architecture and trained on 16GB base-scale and to examine the impact of scale on few-shot learning [121].
160GB large-scale datasets sourced from web data and books, PaLM, built on decoder-only architecture, was trained with
ProphetNet, with its 550M parameters, achieved new state-of- 540B parameters using Pathways, a new system that utilizes
the-art results on multiple benchmarks. The model was also highly efficient training across multiple TPU pods. The model
fine-tuned for two downstream tasks, Question Generation and was trained on 2TB of data from multiple sources, including

23
news articles, Wikipedia, source code, etc. SwiGLU Activa- context length and group-query attention (GQA), were also
tion, Parallel Layers, and other techniques were deployed for used. After pre-training, variants of the model (i.e., LLaMA2-
pre-training three different parameter scales, 8B, 62B, and Chat) were optimized for dialog use cases by supervised
540B, to understand the scaling behavior better. An observed fine-tuning and reinforcement learning with human feedback
discontinuous improvement indicated that as LLMs reach a (RLHF). The model evaluation, which focused on helpfulness
certain level of scale, they exhibit new abilities. Furthermore, and safety, showed superiority over the other open-source
these emerging capabilities continue to evolve and become models and competitive performance to some closed-source
apparent even beyond the scales that have been previously models.
explored and documented. Subsequently, PaLM achieved a 15) GShard: GShard LLM was introduced by Google in
breakthrough by outperforming the finetuned state-of-the-art 2020, aiming to address neural network scaling issues related
and average human on some benchmarks, proving that when to computation cost and training efficiency [125]. Based on
scaling is combined with chain-of-thought prompting, basic a Mixture-of-Experts (MoE) transformer with 600B parame-
few-shot evaluation has the potential to equal or surpass the ters, GShard was pre-trained on 1000B tokens of web data.
performance of fine-tuned state-of-the-art models across a Multiple techniques were deployed for the training stage,
broad spectrum of reasoning tasks. With such strong capabili- such as conditional computation, XLA SPMD partitioning,
ties, utilizing PaLM for tasks like generating security policies position-wise MoE, and parallel execution using annotation
and incident response automation can enhance the efficiency APIs. Subsequently, GShard outperformed prior models in

of
and effectiveness of cybersecurity operations. translation tasks and exhibited a favorable trade-off between
12) PaLM2: PaLM2 is an advanced variant of the PaLM scale and computational cost, resulting in a practical and

ro
model that is more compute-efficient, although it offers better sample-efficient model. These results highlight the importance
multilingual and reasoning capabilities [122]. The key en- of considering training efficiency when scaling LLMs, which
hancements in the model are the improved dataset mixtures, makes it more viable in the real world.
the compute-optimal scaling, and architectural and objective
improvements. The significant evaluation results of PaLM2
-p 16) ELECTRA: The extensive computation cost of MLM
pre-training methods has inspired Google to propose ELEC-
re
indicate that various approaches could elaborate on the model’s TRA LLM, which is a 335M parameters’ encoder-only trans-
enhancement besides scaling, such as meticulous data selection former model that utilizes a novel pre-training approach called
lP

and efficient architecture/objectives. Moreover, the fact that “replaced token detection” [126]. This technique allows the
PaLM2 outperformed the predecessor PaLM despite its signif- model to learn from the entire sequence rather than just a
icantly smaller size shows that the model quality has a greater small portion of masked tokens. Given that the quality and
na

influence on the performance than the model size as it could diversity of ELECTRA training data play a pivotal role in
enable more efficient inference, reducing serving costs and its ability to generalize across tasks, the model was trained
potentially allowing for broader applications and accessibility on a vast Books Corpus and English Wikipedia. Pre-training
ur

to more users. techniques were utilized, including replaced token detection,


13) LLaMA: Proposed by Meta, the LLaMA decoder-only generator-discriminator framework, token replacement, and
model is a proof-of-concept that it’s possible to achieve state- weight-sharing. As a result, ELECTRA was able to perform
Jo

of-the-art performance by training exclusively on publicly comparably to popular models like RoBERTa and XLNet
available data [123]. LLaMA, with multiple variants ranging when using less than 25% of their compute and outperform
between 7 and 65 billion parameters, was trained on 1400B to- them when using equivalent compute. Deploying such a robust
kens of publicly available datasets, including CommonCrawl, model in the security field after fine-tuning can provide an
C4, arXiv, and others. Interestingly, the techniques used for efficient solution for detecting and mitigating sophisticated
training the model were inspired by multiple popular models cyber threats, thanks to its nuanced understanding of context
like GPT-3 (Pre-normalization), PaLM (SwiGLU activation and language patterns.
function), and GPTNeo (Rotary Embedding). As a result of 17) MPT-30B: MPT-30B LLM is a decoder-only trans-
this incorporation, LLaMA-13B was able to outperform GPT- former introduced by MosaicML after the notable success
3(175B) on most benchmarks despite it being more than ten of MPT-7B [127]. The model has multiple variants, the
times smaller, while LLaMA-65B has shown to be competitive base model and two fine-tuned variants, namely MPT-30B-
with Chinchilla-70B and PaLM-540B. Given its relatively Instruct and MPT-30B-Chat. Training the model on a variety
small size and superior performance, fine-tuning LLaMA on of datasets such as C4, CommonCrawl, and arXiv, among
cyber threat intelligence tasks could significantly enhance the others, besides the strategic selection of pre-training meth-
security of edge devices. ods like FlashAttention and ALiBi positional encoding, have
14) LLaMA2: LLaMA2 is an optimized version of LLaMA contributed to a robust performance, surpassing even the
developed by Meta and a collection of pre-trained and fine- original GPT-3 benchmarks. MPT-30B has also significantly
tuned LLMs with sizes ranging from 7 to 70B parameters performed in programming tasks, outperforming some open-
[124]. In the pre-training, a mixture of publicly available data source models designed specifically for code generation. With
was used for up to 2000B training tokens. Moreover, multiple these capabilities, deploying MPT-30B in cybersecurity could
techniques were used in the predecessor LLaMA, such as Pre- substantially enhance threat detection and response systems.
normalization, SwiGLU activation function, and Rotary posi- Its adeptness at understanding and generating programming
tional embeddings. Two additional methods, namely increased languages promises advancements in automated vulnerability

24
assessment and developing sophisticated security protocols. performs better in the bug detection tasks, it lacks the proper
18) Yi-34B: The newly released LLM Yi-34B developed by identification of the CWE issue related to the faulty section.
01.AI is getting attention as one of the best open-source LLMs Mixtral models show less performance at identifying bugs but
[128]. Given the recent release of the model, its technical paper higher diversity in identifying a bug’s security impact on the
has not yet been published; hence, the available information overall design implementation. The outcomes of this experi-
is limited. The model has multiple variants: base and chat ment reveal that some models cannot identify the right issues
models, some quantized. All variants are trained on a dataset with the source code, which might require further refinement
containing Chinese and English only, and the chat versions of the used prompt and/or fine-tuning the general-purpose
have gone through supervised fine-tuning, resulting in more models on bug-locating tasks. The results also show that the
efficient models for downstream tasks. The base model out- model size doesn’t greatly impact the model performance at
performed many open LLMs in certain benchmarks, including locating the bugs nor reasoning about their according impact
renowned ones like LLaMA2-70B and Falcon-180B. Even the (CWE class identification). While the samples that have been
quantized versions have demonstrated impressive performance, picked do not exceed the context length of the selected models,
paving the way for their deployment in cybersecurity applica- the token size of the model itself might reveal a superiority
tions, such as edge security solutions. for the larger models when dealing with large source codes.
19) Falcon2-11B: Falcon2-11B LLM [136] built by TII, is However, superior bug identification and reasoning are also
a decoder-only model with 11 billion parameters, trained on required to provide the required performance.

of
an immense corpus of text data totaling over 5,000 billion In conclusion, the highlighted results reveal that the existing
tokens. In terms of performance, Falcon2-11B showcases models might be subject to weaknesses in identifying bugs in
impressive capabilities, supporting 11 languages: English,

ro
Hardware designs that might lead to security-related issues.
German, Spanish, French, Italian, Portuguese, Polish, Dutch, The two-step evaluation process gives better visibility in
Romanian, Czech, and Swedish. While it excels in generating building more robust dedicated LLMs for Hardware design

-p
human-like text, it also carries the biases and stereotypes security evaluation. Models that properly locate bugs do not
prevalent in its training data, a common challenge LLMs show similar performance in classifying the bug’s impact on
re
face. To address this, TII recommends fine-tuning the model the overall design. The outcomes could be evaluated with a
for specific tasks and implementing guardrails for production larger sample size and a more dedicated study at a large scale
lP

use. In the training process of Falcon2-11B, they utilized to get conclusive results.
a four-stage strategy with increasing context lengths; in the
final stage, they reached 8162 context lengths. This stage
C. LLMs performance in the cybersecurity knowledge
na

focused on enhancing performance using high-quality data.


Additionally, the training leveraged 1024 A100 40GB GPUs Table IX compares various 42 LLMs performance in the
and a custom distributed training codebase named Gigatron, cybersecurity domain using CyberMetric dataset [104] . Figure
ur

which employs a 3D parallelism approach combined with 5 presents the LLMs performance steps. The models are
ZeRO, high-performance Triton kernels, and FlashAttention-2 evaluated based on their accuracy across four question sets:
for efficient and effective training. 80 questions, 500 questions, 2000 questions, and 10,000 ques-
Jo

tions. The performance is represented in percentage accuracy,


B. LLMs performance in the hardware cybersecurity offering a comprehensive view of each model’s proficiency in
Table VIII compares 19 publicly available LLMs’ perfor- handling cybersecurity-related queries.
mance in Hardware design-related bug detection and security The top performers in this evaluation are the GPT-4 and
issues identification using samples from various sources. A GPT-4-turbo models by OpenAI. These models demonstrate
portion of the Chrystalis dataset [102] has been used to exceptional performance, with GPT-4 achieving 96.25% ac-
evaluate the performance of the LLM models in bug detection curacy on the 80-question set and maintaining high accuracy
tasks. A set of faults has been injected intentionally into a with 88.89% on the 10,000-question set. GPT-4-turbo closely
functional code and labeled as faulty. The sample size that follows with similar accuracy percentages. Both models are
has been processed comprises 10K of hardware design-related proprietary and developed by OpenAI, indicating a high
code samples. The prompt that has been used instructs the optimization level for specialized tasks within a controlled
model to check the concerned code for any issue or bug and environment. Another strong performer is the Mixtral-8x7B-
respond only with yes or no. The result is presented as the Instruct by Mistral AI, which boasts accuracy of 92.50% on the
ratio of the responses where the model successfully identified 80-question set and 87.00% on the 10,000-question set. This
a buggy code from the total samples used. The hardware CWE model is open-source under the Apache 2.0 license, demon-
column evaluates the capability of the models to link a buggy strating the potential of community-driven development in
code to its CWE number. The prompt has been designed to achieving high performance. Additionally, GEMINI-pro 1.0 by
ask for a well-defined CWE number on the buggy design. This Google shows robust performance, achieving 90.00% accuracy
evaluation process asses the capability of an LLM model in on the 80-question set and 87.50% on the 10,000-question set,
bug detection and classification into the correct CWE class highlighting the capabilities of large-scale corporate research
number. and development in LLMs.
The top performers in this evaluation in terms of design bug Mid-tier performers include models like Yi-1.5-9B-Chat by
detection are LLama3 and Mixtral. While the LLama3 model 01-ai and Hermes-2-Pro-Llama-3-8B by NousResearch. Yi-

25
TABLE VIII: Comparison of 19 LLMs Models’ Performance in Hardware Security Knowledge.
Hardware CWE Number
LLM model Size Design bug detection
1245 1221 1224 1298 1254 1209 1223 1234 1231
Llama 3-7b-instruct 8B 39.556% Yes No No No Yes No No Yes No
Mixtral-8x7B-Instruct 8x7B 16.154% No No No No No No No No No
Dolphin-mistral-7B 7B 16,024% Yes Yes No No No No No No No
Codegemma-9b-instruct 9B 10.746% No No No No No No No No Yes
CodeQwen-7b-instruct 7B 10.269% No No No No No No No No No
Wizard-vicuna-uncensored-7b-instruct 7B 9.374% No No No No No No No No No
Mistral-openorca-7b-instruct 7B 8.241% No No No No No Yes No No No
Wizardlm2-7b-instruct 7B 5.646% No No No No No No No No No
Llama2-uncensored-7b-instruct 7B 2.505% No No No No No No No No No
Falcon-40b-instruct 40B 1.620% No No No No No No No No No
Deepseek-coder-33b-instruct 33B 1.570% No No No No No No No No No
Orca-mini-3b-instruct 3B 1.173% No No Yes No No No No No No
Qwen2-4b-instruct 4B 0.576% No No No No No No No No No
CodeLlama-7b-instruct 7B 0.218% No No No No No No No No No
Phi3-4b-instruct 4B 0.019% No No No No No No No No No
Hardware-Phi 1.5B 0% No No No No No No No No No
Llava-13b-instruct 13B 0% No No No No No No No No No
Gemma-9b-instruct 9B 0% No No No No No No No No No

of
Starcoder2-15b-instruct 15B 0% No No No No No No No No No
Yes: Detected the CWE sample by MITRE, No: Did not Detect the CWE sample by MITRE. CWE: Common Weakness Enumeration.

ro
TABLE IX: Comparison of 42 LLMs Models’ Performance in Cyber Security Knowledge.

LLM model Company


-p
Size License
80 Q
Accuracy
500 Q 2k Q 10k Q
re
GPT-4o OpenAI N/A Proprietary 96.25% 93.40% 91.25% 88.89%
GPT-4-turbo OpenAI N/A Proprietary 96.25% 93.30% 91.00% 88.50%
Mixtral-8x7B-Instruct Mistral AI 45B Apache 2.0 92.50% 91.80% 91.10% 87.00%
lP

Falcon-180B-Chat TII 180B Apache 2.0 90.00% 87.80% 87.10% 87.00%


GEMINI-pro 1.0 Google 137B Proprietary 90.00% 85.05% 84.00% 87.50%
GPT-3.5-turbo OpenAI 175B Proprietary 90.00% 87.30% 88.10% 80.30%
Yi-1.5-9B-Chat 01-ai 9B Apache 2.0 87.50% 80.80% 77.15% 76.04%
na

Hermes-2-Pro-Llama-3-8B NousResearch 8B Open 86.25% 80.80% 77.95% 77.33%


Dolphin-2.8-mistral-7b-v02 Cognitive Computations 7B Apache 2.0 83.75% 77.80 % 76.60% 75.01%
Mistral-7B-OpenOrca Open-Orca 7B Apache 2.0 83.75% 80.20% 79.00% 76.71 %
ur

Gemma-1.1-7b-it Google 7B Open 82.50% 75.40% 75.75% 73.32%


Flan-T5-XXL Google 11B Apache 2.0 81.94% 71.10% 69.00% 67.50%
Meta-Llama-3-8B-Instruct Meta 8B Open 81.25 % 76.20% 73.05% 71.25%
Jo

Zephyr-7B-beta HuggingFace 7B MIT 80.94% 76.40% 72.50% 65.00%


Yi-1.5-6B-Chat 01-ai 6B Apache 2.0 80.00% 75.80% 75.70% 74.84%
Mistral-7B-Instruct-v0.2 Mistral AI 7B Apache 2.0 78.75% 78.40% 76.40% 74.82%
Llama 2-70B Meta 70B Apache 2.0 75.00% 73.40% 71.60% 66.10%
Qwen1.5-7B Qwen 7B Open 73.75% 60.60% 61.35% 59.79%
Qwen1.5-14B Qwen 14B Open 71.25% 70.00% 72.00% 69.96%
Mistral-7B-Instruct-v0.1 Mistral AI 7B Apache 2.0 70.00% 71.80% 68.25% 67.29%
Llama-3-8B-Instruct-Gradient-1048k Bartowski 8B Open 66.25% 58.00% 56.30% 55.09%
Qwen1.5-MoE-A2.7B Qwen 2.7B Open 62.50% 64.60% 61.65% 60.73%
Phi-2 Microsoft 2.7B MIT 53.75% 48.00% 52.90% 52.13%
Llama3-ChatQA-1.5-8B Nvidia 8B Open 53.75% 52.80% 49.45 % 49.64%
DeciLM-7B Deci 7B Apache 2.0 52.50% 47.20% 50.44% 50.75%
Flan-T5-Base Google 0.25B Apache 2.0 51.25% 50.40% 48.55% 47.09%
Deepseek-moe-16b-chat Deepseek 16B MIT 47.50% 45.80% 49.55% 48.76%
Mistral-7B-v0.1 Mistral AI 7B Apache 2.0 43.75% 39.40% 38.15% 39.28%
Qwen-7B Qwen 7B Open 43.75% 58.00% 55.75% 54.09%
Gemma-7b Google 7B Open 42.50% 37.20% 36.00% 34.28%
Meta-Llama-3-8B Meta 8B Open 38.75% 35.80% 37.00% 36.00%
Genstruct-7B NousResearch 7B Apache 2.0 38.75% 40.60% 37.55% 36.93%
Qwen1.5-4B Qwen 4B Open 36.25% 41.20% 40.50% 40.29%
Llama-2-13b-hf Meta 13B Open 33.75% 37.00% 36.40% 34.49%
Dolly V2 12b BF16 Databricks 12B MIT 33.75% 30.00% 28.75% 27.00%
Deepseek-llm-7b-base DeepSeek 7B MIT 33.75% 25.20% 27.00% 26.48%
Cerebras-GPT-2.7B Cerebras 7B Apache 2.0 25.00% 20.20% 19.75% 19.27%
Gemma-2b Google 2B Open 25.00% 23.20% 18.20% 19.18%
Stablelm-2-1 6b Stability AI 6B Open 16.25% 21.80% 19.55% 20.09%
ZySec-7B ZySec-AI 7B Apache 2.0 12.50% 16.40% 15.55% 14.04%
Phi-3-mini-4k-instruct Microsoft 3.8B MIT 5.00% 5.00% 4.41% 4.80%
Phi-3-mini-128k-instruct Microsoft 3.8B MIT 1.25% 0.20% 0.70% 0.88%

26
1.5-9B-Chat performs reasonably well with an 87.50% accu- A. Prevalent LLMs
racy on the 80-question set, tapering to 76.04% on the 10,000- 1) SantaCoder: As part of the BigCode project, Hug-
question set. Under the Apache 2.0 license, this model shows a gingFace and ServiceNow have proposed SantaCoder LLM
balance between open-source collaboration and performance. [137]. Based on the decoder-only architecture and with a
Hermes-2-Pro-Llama-3-8B achieves 86.25% accuracy on the 1.1B parameter, SantaCoder was trained on 268GB of Python,
80-question set and 77.33% on the 10,000-question set, further Java, and JavaScript subsets of The Stack dataset. Multiple
underscoring the effectiveness of collaborative research efforts. filtering techniques were used for the training data with-
Lower-tier performers include models like Qwen1.5-7B by out much impact except for one (i.e., filtering files from
Qwen. Qwen1.5-7B scores 73.75% on the 80-question set, repositories with 5+ GitHub stars), significantly deteriorat-
dropping to 59.79% on the 10,000-question set. As an open ing the performance on text2code benchmarks. Pre-training
model, Qwen1.5-7B indicates the challenges faced by smaller methods included Multi-Query-Attention (MQA) and Fill-in-
models in maintaining high accuracy with increasing question the-Middle (FIM). Although these techniques have led to a
set sizes. Falcon-40B-Instruct achieves 67.50% accuracy on slight drop in the model’s performance compared to Multi-
the 80-question set and 64.50% on the 10,000-question set. Head-Attention (MHA) and training without FIM, the model
Licensed under Apache 2.0, it highlights the competitive could still outperform previous multi-lingual code models
landscape of open-source LLMs. like CodeGen0Multi-2.7B and InCoder-6.7B despite being
The lowest-tier performers include models such as Phi- substantially smaller. Such performance can be promising if

of
3-mini-128k-instruct by Microsoft and Stablelm-2-1 6b by deployed in cybersecurity for tasks like software vulnerability
Stability AI. Phi-3-mini-128k-instruct has the lowest perfor- and secure code generation.

ro
mance, with only 1.25% accuracy on the 80-question set and 2) StarCoder: StarCoder is another decoder-only model
0.88% on the 10,000-question set. Despite being from a major developed within the BigCode project [138]. With 15.5B
company like Microsoft and licensed under MIT, this model parameters, StarCoder was pre-trained on 1000B tokens from
underscores the importance of continuous development and
optimization in LLMs. Stablelm-2-1 6b scores 16.25% on the
-p over 80 different programming languages. The pre-training
utilized techniques such as FIM and MQA and Learned
re
80-question set, decreasing to 20.09% on the 10,000-question Absolute Positional Embeddings. After pre-training, the base
set, demonstrating smaller models’ difficulties in scaling up model was fine-tuned on an additional 35B tokens of Python.
lP

effectively. Compared to other Code LLMs, StarCoder outperformed all


In conclusion, the table reveals that proprietary models fine-tuning models on Python. Moreover, the base model
perform better than open-source models, suggesting that con- outperformed OpenAI code-cushman-001. StarCoder’s excep-
na

trolled environments and dedicated resources may significantly tional performance in Python and its broad training in multiple
enhance model performance. However, larger models do not programming languages position it as a highly versatile tool
always guarantee higher performance, as seen with some mid for various coding tasks.
ur

and lower-tier performers. Additionally, many models show 3) StarChat-Alpha: StarChat Alpha is a variant of Star-
a decline in accuracy as the number of questions increases, Coder fine-tuned to act as a helpful coding assistant that ac-
Jo

highlighting the challenges in maintaining performance con- cepts natural language prompting (considering that StarCoder
sistency across larger datasets. The analysis indicates that needs specific structured prompting) [139]. With 16B param-
while top-tier proprietary models lead in performance, there eters, the model was fine-tuned on a mixture of oasst1 and
is significant potential within the open-source community databricks-dolly-15k datasets. The model has not undergone
to develop competitive models. Continuous improvements in RLHF or similar methods, which would have helped align it
model architecture, training data quality, and optimization with human preferences. Nevertheless, the comprehensive pre-
techniques are crucial for advancing state-of-the-art cyberse- training of the base model contributed to the model’s ability to
curity knowledge within LLMs. interpret various coding tasks and provide accurate code sug-
gestions. This capability makes it an invaluable programming
tool, simplifying code development and problem-solving.
VI. C ODE - SPECIFIC LLM S
4) CodeGen-2: Developed by Salesforce AI research,
The rapid evolution of technology and software develop- CodeGen2 was proposed as a product of extensive research
ment has increased the demand for specialized tools that in the field of LLM aimed at optimizing model architectures
aid in coding, debugging, and enhancing software security and learning algorithms to enhance the efficiency and reduce
[152], [153]. Recognizing this need, various organizations the costs associated with LLMs [140]. The final findings were
have developed Code-specific LLMs, each offering unique examined in multiple variants with parameters ranging from
features and capabilities. These models leverage advanced 1B to 16B, where the 16B model is trained on 400B tokens
machine learning techniques to understand, generate, and from the Stack dataset. Causal language modeling, cross-
manipulate code, thereby revolutionizing the field of software entropy loss, and other techniques were used for pre-training,
development [154], [155]. This section delves into several resulting in a robust program synthesis model. CodeGen2’s
notable Code-specific LLMs, exploring their architectures, proficiency in program synthesis makes it a valuable asset
training methods, and potential applications in cybersecurity in cybersecurity applications, such as aiding in vulnerability
and beyond [156]–[159]. Table X and Table XI compare Code- detection and enhancing code security analysis. Its ability to
specific Large Language Models. understand and generate complex small models can be trained

27
TABLE X: Comparison of Code-specific Large Language Models
Model Architecture Base Model Para- Training Pre-training Corpus Released Applications Use Cases in Training Key Training Quanti- Ref
meters Tokens Volume By Cybersecurity Scheme Techniques zation
SantaCoder Decoder-only NA 1.1B 236B The Stack 268GB Hugging- Code Threat Pre-training Multi Query NA [137]
v1.1 dataset Face, Generation, Intelligence, Attention (MQA),
(Python, Java, Servi- Code Software Fill-in-the-Middle
and JavaScript) ceNow Completion, Vulnerability, (FIM)
Code Analysis, Source Code
QA Generation
StarCoder Decoder-only NA 15.5B PT 80+ +800GB Hugging- Code Threat Pre-training, Fill-in-the-Middle NA [138]
1000B, programming Face, Generation, Intelligence, Fine-tuning (FIM), Multi
FT 35B languages, Servi- Code Software Query Attention
Git commits, ceNow Completion, Vulnerability (MQA), Learned
GitHub issues, Code Analysis, Detection absolute positional
and Jupyter QA embeddings
notebooks
StarChat Al- Decoder-only StarCoder- 16B NA oasst1 and NA Hugging- Code Threat Fine-tuning NA NA [139]
pha base databricks- Face, Generation, Intelligence,
dolly-15k Servi- Code Software
datasets ceNow Completion, Vulnerability
Code Analysis,
QA
CodeGen2 Decoder-only NA 1-16B 400B Stack dataset NA Salesforce Program Threat Pre-training Causal Language NA [140]
(causal LM) v1.1 Synthesis, Intelligence, Modeling, Cross-
Code Software entropy Loss,

of
Generation Vulnerability File-level Span
Corruption,
Infilling

ro
CodeGen2.5 Decoder-only NA 7B 1400B StarCoderData NA Salesforce Code Threat Pre-training Flash Attention, NA [141]
(causal LM) Generation, Intelligence, Infill Sampling,
Code Software Span Corruption

-p
Completion, Vulnerability
Code Analysis
CodeT5+ Encoder- NA 220M- 51.5B CodeSearchNet NA Salesforce Code Threat Pre-training Span Denoising, NA [142]
decoder 16B dataset, Generation and Intelligence, Contrastive
re
GitHub code Completion, Software Learning, text-
dataset Math Vulnerability code Matching,
Programming, Causal Language
lP

Text-to-code Modeling (CLM)


Retrieval Tasks
XGen-7B Decoder-only NA 7B 1500B GitHub, NA Salesforce Code Genera- Threat Pre-training, Standard Dense NA [143]
Several public tion, Summa- Intelligence, Fine-tuning Attention, Two-
sources, Apex rization Software stage Training
na

code data Vulnerability Strategy


(mixture of
natural text
data and code
ur

data)
Replit Code Decoder-only NA 2.7B 525B Stack Dedup NA Replit, Code Threat Pre-training Flash Attention, Matrix [144]
V1 (causal LM) v1.2 dataset Inc. Completion, Intelligence, AliBi Positional Multipli-
Jo

(20 different Code Software Embeddings, cation


languages) Generation Vulnerability LionW Optimizer
DeciCoder- Decoder-only NA 1B 446B StarCoderData NA Deci Code Threat Pre-training Fill-in-the-Middle NA [145]
1B (Python, Java, Completion, Intelligence, training (FIM),
and JavaScript) Code Software Grouped Query
Generation, Vulnerability Attention (GQA)
Code Analysis
CodeLLAMA Decoder-only LLaMA2 7-34B 620B Text and code NA Meta Code Threat Pre-training, Causal Infilling, NA [146]
from multiple Completion, Intelligence, Fine-tuning Autoregressive
datasets Code Software Training,
Generation, Vulnerability Repository-
Code Analysis level Reasoning,
Long-context
Fine-tuning
CodeQwen1.5- Decoder-only Qwen1.5 7.25B 3T code-related NA Qwen Code Threat Pre-training Flash Attention, NA [147]
7B data Generation, Intelligence, RoPE, Grouped-
Code Software Query Attention
Completion, Vulnerability, (GQA)
Code Analysis Bug fixes
DeepSeek Decoder-only NA 33.3B 2T Composition NA DeepSeek Code Threat Pre-training, Flash Attention, NA [148]
Coder-33B- of code Generation, Intelligence, Long- RoPE, Grouped-
instruct and natural Code Software context Query Attention
language Completion, Vulnerability pre-training, (GQA)
Code Analysis Instruction
fine-tuning
CodeGemma- Decoder-only Gemma 8.54B 500B Code NA Google Code Threat Pre-training, Fill-in-the-middle NA [149]
7B repositories, completion, Intelligence, Fine-tuning (FIM) tasks,
Mathematics Code Software dependency graph-
datasets, generation, Vulnerability based packing,
Synthetic code Code chat, unit test-based
Instruction lexical packing
following

28
TABLE XI: Continued
Granite 8B Decoder-only NA 8.05B 4.05T Publicly NA IBM Code Threat Pre-trained RoPE embedding, NA [150]
Code Datasets Granite generation, Intelligence, in two Grouped-Query
(GitHub Code Intrusion phases (the Attention (GQA),
Code Clean, explanation, Detection, second Context Length of
Starcoder data) Code fixing, Malware phase for 4096 Tokens
etc. Detection high-quality
data)
DeepSeek-V2 Decoder-only NA 236B 8.1T Composition NA DeepSeek Code Threat Pre-training, Mixture-of- NA [151]
of code Generation, Intelligence, SFT, RL, Experts (MoE),
and natural Code Software Long Multi-head Latent
language Completion, Vulnerability Context Attention (MLA)
Code Analysis Extension

for multiple epochs with specific settings, efficient security model underwent advanced pre-training techniques such as
protocols, and automated threat detection systems. Flash Attention for efficient computation, AliBi positional
5) CodeGen-2.5: Another version of the CodeGen family is embeddings for enhanced context interpretation, and the Li-
CodeGen 2.5 [141]. The 7B parameters model was introduced onW optimizer for improved training dynamics. The Replit
to prove that good models don’t necessarily have to be big, code v1 model is also available in two quantization options:

of
especially with the trend of scaling up LLMs and the data size 8-bit and 4-bit. The Replit-code-v1-3b model’s capabilities
limitations. CodeGen 2.5 was trained on 1400B training to- in understanding and generating code make it particularly

ro
kens from StarCoderData. A strategic selection of pre-training suited for cybersecurity applications, such as automating the
techniques, such as Flash Attention, Infill Sampling, and Span detection of code vulnerabilities and generating secure coding

-p
Corruption, enhanced the model’s performance. Moreover, that patterns. Additionally, its quantized versions can be utilized
led to a good performance that is on par with popular LLMs for edge security.
of larger size. The results indicated that small models can be
re
9) DeciCoder-1B: DeciCoder-1B is an open-source 1B
trained for multiple epochs with specific settings and achieve parameter decoder-only transformer developed by Deci AI
comparable results to bigger models. with a 2048-context window [145]. Subsets of Python, Java,
lP

6) CodeT5+: CodeT5+ is an encoder-decoder transformer and JavaScript from the StarCoderData dataset were used for
proposed by Salesforce AI Research to address some code training. The model architecture was built using Automated
LLMs limitations [142]. Specifically, those related to the ar- Neural Architecture Construction (AutoNAC) developed by
na

chitecture being either inflexible or serving as a single system the company, which is a technology designed to automatically
and pre-training task limitations related to a limited set of pre- create and optimize deep learning models, particularly neural
training objectives can result in a substantial degradation in networks, for specific tasks and hardware environments. More-
ur

performance. The proposed model has different variants rang- over, Grouped Query Attention (GQA) and FIM were utilized
ing from 220M to 16B parameters. Trained on 51.5B tokens to pre-train the model. Consequently, the model has shown
Jo

from CodeSearchNet and GitHub code datasets using tech- smaller memory usage compared to popular code LLMs like
niques like span denoising, contrastive learning, and others, the StarCoder and outperformed SantaCoder in the languages it
model achieved new state-of-the-art results on various code- was trained on with remarkable inference speed.
related tasks like code generation, code completion, etc. A 10) CodeLLAMA: Based on LLAMA 2, CodeLLAMA was
model with such capabilities can be valuable to cybersecurity introduced by Meta as a decoder-only transformer code LLM
for threat intelligence and software vulnerability. [146]. With variants ranging from 7 to 34B parameters of
7) XGen-7B: Another production of Salesforce AI Re- base, python specialized, and instruction-following models, all
search is XGen-7B LLM, a decoder-only transformer with trained on text and code from multiple datasets, CodeLLAMA
7B parameters [143]. The model was developed to address emerges as a comprehensive suite of models, adept at handling
the problem of sequence length constraints in the available a wide array of programming-related tasks. Causal infilling,
open-source LLMs as many tasks require inference over an Long-context fine-tuning, and other techniques were utilized
input context. XGen-7B, with up to 8K sequence length, was for pre-training and fine-tuning. CodeLLAMA models’ family
trained on 1500B tokens from a mixture of text and code achieved state-of-the-art performance in multiple benchmarks,
data. Techniques like standard dense attention and a two-stage indicating their potential for transformative applications in
training strategy were utilized for pre-training. Additionally, cybersecurity. Their advanced code analysis and generation
the model was enhanced with instructional tuning, a technique capabilities could be crucial in automating threat detection and
that refines its responses to align closely with specific user enhancing vulnerability assessments.
instructions. As a result, XGen-7B achieved comparable or 11) CodeQwen1.5-7B: CodeQwen1.5-7B-Chat [147] is a
better results than other 7B state-of-the-art open-source LLMs. transformer-based decoder-only language model trained on 3
8) Replit code v1: Proposed by Replit, Inc., the 2.7B trillion tokens of code data. It supports 92 coding languages
parameters causal language model Replit-code-v1-3b, with and has strong code-generation capabilities. The model can
a focus on code completion, was trained on 525B tokens understand and generate long contexts of up to 64,000 tokens
from a subset of the stack Dedup v1.2 dataset [144]. The and has shown excellent performance in text-to-SQL and bug-

29
fixing tasks. It is based on Qwen1.5, which offers eight model Over time, the focus has shifted towards increasing the
sizes, including 0.5B, 1.8B, 4B, 7B, 14B, 32B, and 72B dense size, diversity, and ethical considerations of the data used in
models, and an MoE model of 14B with 2.7B activated. training AI models. Introducing datasets such as ROOTS and
12) DeepSeek Coder-33B-instruct: Deepseek Coder [148] The Stack v2 [164] reflects a growing emphasis on responsible
is a series of code language models, with each model trained LLM development. These newer datasets encompass a broader
from scratch on 2 trillion tokens, 87% of which are code and range of programming languages and coding scenarios, and
13% natural language in English and Chinese. The model they incorporate governance frameworks to ensure the ethical
comes in various sizes, ranging from 1B to 33B, with the use of the data. In addition, these datasets are designed to
33B model being fine-tuned on 2 billion tokens of instruction address the needs of large multilingual language models and
data. It achieves state-of-the-art performance among open- the specific challenges of code generation and comprehension,
source code models on multiple programming languages and demonstrating the evolving landscape of LLM research driven
benchmarks. by enhanced dataset quality and scope.
13) CodeGemma-7B: CodeGemma [149] is a collection of
lightweight open code models built on top of Gemma. It is a
text-to-text and text-to-code decoder-only model with 7 billion C. Vulnerabilities Analysis of LLM-Generated Code
parameters, specializing in code completion and generation
The evolution of LLMs in software development has
tasks. It can answer questions about code fragments, gener-
brought significant advancements and new security challenges

of
ate code from natural language, or discuss programming or
[173]. Table XIII presents a comparative analysis of vulnera-
technical problems. CodeGemma was trained on 500 billion
bilities in LLM-generated code.

ro
tokens of primarily English language data from publicly avail-
able code repositories, open-source mathematics datasets and Schuster et al. [165] demonstrate how LLMs employed
synthetically generated code. in code autocompletion are susceptible to poisoning attacks,

-p
14) Granite 8B Code: IBM released a family of Granite which can manipulate the model’s output to suggest insecure
code models [150], including the Granite-8B-Code-Base, to code. This vulnerability is intensified by the ability to target
re
make coding more accessible and efficient for developers. specific developers or repositories, making the attacks more
Granite-8B-Code-Base is a decoder-only code model designed effective and difficult to detect. Despite defenses against such
attacks, their effectiveness remains limited, raising concerns
lP

for code generation, explanation, and fixing. It is trained in


two phases: first on 4 trillion tokens from 116 programming over the secure deployment of these technologies [165].
languages, then on 500 billion tokens from a carefully de- Recent studies, such as those by Asare et al. [166] and
Sandoval et al. [167], provide an empirical and comparative
na

signed mixture of high-quality code and natural language data.


This two-phase training strategy ensures the model can reason analysis of the security aspects of code generated by LLMs
and follow instructions while understanding programming like GitHub’s Copilot and OpenAI Codex. Asare et al. [166]
ur

languages and syntax. find that while Copilot occasionally replicates vulnerabilities
15) DeepSeek-V2: DeepSeek-V2 [151] is a mixture-of- known from human-written code, it does not consistently do
so across different vulnerabilities. In contrast, Sandoval et
Jo

experts (MoE) language model with 236 billion parameters,


of which 21 billion are activated for each token. It is a sig- al. [167] report a minimal increase in security risks when
nificant upgrade from the previous DeepSeek model, offering developers use LLMs in coding, indicating that LLMs do not
stronger performance while reducing training costs by 42.5%. necessarily degrade the security of the code more than human
The model was pre-trained on a vast and diverse corpus of developers would.
8.1 trillion tokens, followed by supervised fine-tuning and Moreover, Perry et al. [168] reveal a concerning trend where
reinforcement learning to maximise its capabilities. DeepSeek- users interacting with AI code assistants tend to write less
V2 excels at live coding tasks and open-ended generation, secure code but believe otherwise. Their findings underscore
supporting both English and Chinese. the need for heightened awareness and better design of user
interfaces to foster critical engagement with the code sugges-
tions provided by LLMs [168]. In a similar vein, Hamer et
B. Datasets Development for Code-centric LLM Models al. [169] emphasize the educational gap among developers
The development of large-scale datasets has played a crucial regarding the security implications of using code snippets from
role in advancing LLM models, especially those focused on AI like ChatGPT or traditional sources like StackOverflow,
understanding and generating code. Table XII presents the highlighting that both sources can propagate insecure code.
datasets used for pre-training foundation models in Coding. Lastly, novel tools like DeVAIC introduced by Cotroneo
Datasets like CodeSearchNet [160] and The Pile [161] have et al. [170] and comprehensive vulnerability evaluations in
been instrumental in bridging the gap between natural lan- LLM-generated web application code by Tóth et al. [171] and
guage and code, improving semantic search capabilities, and Tihanyi et al. [172] illustrate ongoing efforts to understand
enhancing language model training across diverse domains. better and mitigate the risks associated with AI-generated
These datasets provide a rich source of real-world code in code. DeVAIC, for instance, offers a promising approach to
multiple programming languages and include expert annota- detecting vulnerabilities in incomplete Python code snippets,
tions and natural language queries that challenge and push the potentially enhancing the security assessment capabilities for
boundaries of LLM performance in code-related tasks. AI-generated code.

30
TABLE XII: Datasets Used for Pre-training Foundation Models in Coding
Dataset Title Year Purpose Content Significance
Contains about 6 million
Advances the semantic code
”CodeSearchNet Challenge: functions from six languages
CodeSearchNet Focuses on bridging natural search field with a challenge
Evaluating the State of 2019 and 2 million automatically
[160] language and code. including 99 queries and 4k
Semantic Code Search” generated query-like
expert annotations.
annotations.
Improves model
”The Pile: An 800GB Dataset Comprises 22 high-quality, di-
Designed to train large-scale generalization capabilities;
The Pile [161] of Diverse Text for Language 2020 verse text subsets totaling 825
language models. evaluates with GPT-2 and
Modeling” GiB.
GPT-3.
Facilitates model training in Consists of 115M code files
2 Aids in diverse language and
CodeParrot CodeParrot Dataset 2022 code understanding and gen- from GitHub in 32 program-
format model training.
eration. ming languages, totaling 1TB.
Demonstrates improved per-
The Stack ”The Stack: 3 TB of permis- Aimed at fostering research Features 3.1 TB of code in 30 formance on text2code bench-
2022
[162] sively licensed source code” on AI for code. programming languages. marks; introduces data gover-
nance.
”The BigScience ROOTS Spans 59 languages and fo- Advances large-scale
Supports ethical, multilingual
ROOTS [163] Corpus: A 1.6TB Composite 2023 cuses on diverse, inclusive language model research
model research.
Multilingual Dataset” data. with an ethical approach.
Built from sources including
Shows improvements in code
The Stack v2 ”StarCoder 2 and The Stack Enhances foundation models 619 programming languages,
2024 LLM benchmarks; ensures
[164] v2: The Next Generation” for code. significantly larger than its
transparency in training data.
predecessor.

of
TABLE XIII: Comparative Analysis of Vulnerabilities in LLM-Generated Code

ro
Reference Year Primary Focus Methodology Key Findings
Schuster et al. [165] 2021 Poisoning in code autocom- Experimental poisoning attacks on autocom- Demonstrated effective targeted and untar-
pletion pleters geted poisoning; current defenses are largely

-p
ineffective.
Asare et al. [166] 2023 Security analysis of GitHub’s Empirical analysis comparing human and Copilot does not consistently replicate human
Copilot Copilot-generated code vulnerabilities vulnerabilities, showing variable performance
re
across different types.
Sandoval et al. 2023 Security implications of LLM User study with AI-assisted coding tasks Minimal increase in security risks from LLM
[167] code assistants in C program- assistance compared to control.
lP
ming
Perry et al. [168] 2023 Impact of AI code assistants Large-scale user study on security task per- Participants using AI wrote less secure code
on security formance but were overconfident in its security.
Hamer et al. [169] 2024 Security vulnerabilities in Empirical analysis of code snippets for secu- LLM-generated code had fewer vulnerabili-
LLM vs. StackOverflow code rity vulnerabilities ties than StackOverflow, highlighting differ-
na

ences in security risks.


Cotroneo et al. 2024 Security assessment tool for Development and validation of DeVAIC tool DeVAIC effectively identifies vulnerabilities
[170] AI-generated code in Python code, outperforming other tools.
Tóth et al. [171] 2024 Evaluating security of LLM- Hybrid evaluation using static and dynamic Significant vulnerabilities found in AI-
ur

generated PHP web code analysis generated PHP code, emphasizing the need
for thorough testing.
Tihanyi et al. [172] 2024 Security of LLM-generated C Dataset creation and analysis using formal Over 63% of generated C programs were
Jo

code from neutral prompts verification found vulnerable, with minor variations be-
tween different LLMs.

VII. C YBER S ECURITY DATASETS FOR LLM S topics is essential to ensure comprehensive coverage. Key
A. Cyber Security Dataset Lifecycle areas include network security, malware analysis, software
security, cryptographic protocols, cloud security, and incident
Creating a cybersecurity dataset for use with LLMs in- response. The data should be sourced from diverse and reliable
volves several steps that ensure the dataset is comprehensive, origins, such as public and private databases such as Common
accurate, and effective for training or evaluating the models. Weakness Enumeration (CWE) and Common Vulnerabilities
Figure 6 presents the cyber security dataset lifecycle for LLM and Exposures (CVE) [176], [177].
development.
1) Define Objectives: Defining the objectives for a cyberse- 3) Data Cleaning and Preprocessing: This process involves
curity dataset for LLMs is crucial as it dictates its construction filtering out irrelevant content to maintain a focus on cy-
and application. For training purposes, the dataset should cover bersecurity and standardizing formats across the dataset. For
various cybersecurity topics and incorporate various data types example, processing the Starcoder 2 dataset [164] involves
like text, code, and logs, aiming to develop a robust and several meticulous steps to refine GitHub issues collected
versatile LLM capable of understanding diverse threats (e.g., from GHArchive. Initially, auto-generated texts from email
Edge-IIoT dataset [174] for Network Security and FormAI replies and brief messages under 200 characters are removed,
dataset [172], [175] for Software Security). For evaluation, along with truncating longer comments to maintain a max-
the focus narrows to specific areas, such as benchmarking the imum of 100 lines while preserving the last 20 lines. This
LLMs’ knowledge in cybersecurity (e.g., CyberMetric [104]). step alone reduced the dataset volume by 17%. The dataset
2) Scope and Content Gathering: For the scope and content then undergoes further cleaning to remove comments by bots
gathering stage of building a cybersecurity dataset aimed at identified through specific keywords in usernames, eliminating
training and fine-tuning LLMs, selecting a broad range of an additional 3% of the issues. A notable focus is placed on the

31
interaction quality within the issues; conversations with two or from SATE IV, Github, and Debian and labeled using three
more users are prioritized, and those with extensive text under static analyzers. The dataset is substantial, but the vulnerability
a single user are preserved if they stay under 7,000 characters. percentage is low, standing at roughly 6.8%. The dataset is
Issues dominated by a single user with more than ten events multi-labelled, where more than one CWE can exist in a
are excluded, recognizing them as potentially low-quality or code sample. The dataset focuses on four main CWEs or
bot-driven, resulting in a 38% reduction of the remaining categories, while the rest of the vulnerabilities are grouped
dataset. For privacy, usernames are anonymized by replac- into one class. The researchers mapped the static analyzer
ing them with a sequential participant counter, maintaining findings to CWEs and binary labels. Furthermore, since the
confidentiality while preserving the integrity of conversational researchers did the mapping, warnings, and functions flagged
dynamics. by static analyzers that would not typically be exploited were
4) Annotation and Labeling: A sophisticated hybrid ap- not flagged as vulnerable. In addition, a strict deduplication
proach can be adopted to ensure precision and scalability in process was used to refine the dataset. The authors utilize this
the annotation and labeling stage of developing a cybersecurity dataset to train their model after lexing the source code to
dataset for LLMs. Cybersecurity experts manually annotate the reduce the code representation and use a limited vocabulary
dataset, meticulously labeling complex attributes such as threat size. Due to lexing the source code, the approach reduces the
type, guaranteeing high accuracy. Concurrently, automated needed vocabulary size compared to the original size required.
tools like static analyzers (e.g., Clang for C/C++ and Bandit However, the vulnerable portion is minimal compared to

of
for Python), formal verification methods (e.g., ESBMC), and the dataset. Moreover, the labeling considers four categories,
dynamic tools are employed to handle the large volume of which are limited compared to other datasets.

ro
data efficiently. These tools initially tag the data, which human 3) Reveal dataset: Reveal [180] was proposed to provide
experts carefully review and correct [178]. an accurate dataset that reflects the real world, which is
why it is also reflected in the imbalance of the samples.
B. Software Cyber Security datasets
-p Their work finds that performance drops by more than 50%
in real-world prediction, highlighting the need for a dataset
re
In software cyber security, datasets play a crucial role subjected to a realistic setting. The authors focus on two
in understanding, detecting, and mitigating vulnerabilities in open-source projects, Linux Debian and Chromium, as they
lP

software systems. This sub-section explores several significant are popular, well-maintained, showcase important domains,
software cybersecurity datasets, each offering unique insights and have publicly available vulnerability reports. The data
and methodologies for vulnerability analysis in cybersecurity. is not used as text but as Code Property Graphs (CPG),
na

From the extensive BigVul dataset, which links vulnerabilities which are then converted to graph embeddings for training
in the CVE database to specific code commits, to the inno- a Gated Graph Neural Network (GGNN). The authors use
vative FormAI dataset, leveraging AI-generated C programs an approach inspired by Zhou et. al [181] to identify the
ur

and advanced verification methods for precise vulnerability security vulnerabilities in the project, and they remedy the
classification, each dataset contributes uniquely to the field. class imbalance due to the majority of non-vulnerable code
These datasets range from manually labeled by security ex- through SMOTE. Such data was collected from Bugzilla and
Jo

perts to those generated using state-of-the-art automated tools, Debian security tracker. Looking at the vulnerable portion,
providing diverse resources for researchers and practitioners. it constitutes 9.16% out of the 18,169 programs. While the
Table XIV provides an overview of software vulnerability dataset attempts to depict a realistic dataset, relying on two
datasets that can be used for fine-tuning LLMs for software sole projects might limit how well a model trained on the
security. dataset would perform in a real-world prediction case.
1) Sate IV - Juliet dataset: Nist has developed the SARD 4) Devign dataset: Researchers of Devign [177] required
Juliet dataset to assess the capabilities of static analysis tools an accurate dataset to be used in several graph forms, which
on C/C++ program code out of many other programming they believe is better in reflecting the structural and logical
languages. The dataset contains the source files, with each test aspects of source code. The proposed approach contains a
case containing bad functions and good functions that patch graph embedding layer that uses Abstract Syntax Tree (AST),
the vulnerable “bad” code. Test cases are labeled with CWEs Control Flow Graph (CFG), Data Flow Graph (DFG), and
to indicate the type of vulnerability exposed in the program. Natural Code Sequence (NCS) to generate a joint graph rep-
The dataset contains keywords to indicate precisely where vul- resentation. The rationale behind a joint representation is the
nerable and non-vulnerable functions exist. Thus, the dataset ability of certain graphs to portray different vulnerability types
needs careful sanitization /obfuscation. While the dataset has not uncovered by others. Motivated to propose a more accurate
many vulnerability types and gives concrete examples, they are dataset instead of those generated using static analyzers, the
still programs purposefully built to demonstrate vulnerabilities researchers invested in a security team to manually label
rather than naturally occurring ones. the samples. The data is collected from large open-source
2) Draper dataset: Researchers in work [179] leveraged projects: Linux, Wireshark, QEMU, and FFmpeg. The dataset
a new dataset for vulnerability detection using deep rep- is manually labeled over two rounds, with 600 hours put into
resentation. A custom lexer was used to create a generic labeling it. While a significant advantage of the dataset is that
representation to capture the essential tokens while minimizing it is manually labeled and verified, the dataset is only binary
the token count. It was curated using open-source C/C++ code labeled. Also, it is worth noting that only 2 out of the four

32
Fig. 6: Cyber Security Dataset Lifecycle for LLM development.

datasets are available. bug. Open-source projects such as FFmpeg, OpenSSL, httpd,

of
5) VUDENC: The VUDENC dataset [182] is comprised libtiff, libav and NGINX constitute the curated dataset. This
of 25,040 vulnerability-fixing commits from 14,686 different dataset also has a limited number of vulnerable samples, and a

ro
GitHub repositories. The commits were filtered only to include manual validation experiment shows that the results are better
those that changed the code in a limited number of places, than those of regular differential analysis. However, it is still

-p
ensuring that the changed code was related to the commit mes- not at the desired accuracy, with manual validation showing
sage. The dataset covers seven common vulnerability types: an accuracy of 53%. The paper’s authors applied the dataset
to build a classifier to identify false alarms in static analyzers
re
SQL injection, cross-site scripting (XSS), command injection,
cross-site request forgery (XSRF), remote code execution, path to reduce the false positive rate.
disclosure, and open redirect. This Python-specific dataset 8) CVEfixes: CVEfixes [184] is a dataset built using the
lP

focuses on improving software systems’ security by identi- method proposed by the authors to curate vulnerability datasets
fying potentially vulnerable code. Each vulnerability type has based on CVEs. The automated tool was used to release
a dedicated dataset, with the number of repositories ranging CVEfixes, a dataset covering CVEs up to 9 June 2021. The
na

from 39 to 336 and the number of changed files ranging from dataset is organized in a relational database, which can be
80 to 650. The total lines of code across all vulnerability types used to extract data with the desired information. It contains
exceed 200,000, demonstrating the comprehensive nature of the code changes in several levels, namely on the repository,
ur

the dataset. commit, file, and method levels. The dataset contains 5495
6) BigVul dataset: BigVul [183] is a C/C++ vulnerability vulnerability fixing commits with 5365 CVE records, covering
Jo

dataset curated from the CVE database and its relevant open- 1754 open-source projects. The mining tool is shared, and the
source projects. 3,754 code vulnerabilities were collected from most recent CVE records can be mined.
348 open-source projects spanning 91 vulnerability types. The 9) CrossVul: The CrossVul dataset [185] encompasses a
dataset links CVEs in the CVE database with code commits diverse range of programming languages, exceeding 40 in
and project bug reports. Furthermore, the dataset contains 21 total, and comprises vulnerable source files. The dataset was
features to show changes and where the vulnerability lies. curated by extracting data from GitHub projects referenced
Compared to other datasets, BigVul provides many charac- by the National Vulnerability Database (NVD), specifically
teristics that can be useful for thoroughly analyzing vulner- focusing on files modified through git-diff. Files preceding
abilities and the history of change. Moreover, the diversity the commit are tagged as vulnerable, while those following
of the projects and the vulnerability types expose the models the commit are designated as non-vulnerable. Organized by
being trained on it to several patterns. However, the dataset Common Weakness Enumerations (CWEs)/Common Vulnera-
only contains 11,823 vulnerable functions as opposed to the bilities and Exposures (CVEs), as well as language types, the
253,096 non-vulnerable functions. While it may depict real dataset offers a comprehensive classification of vulnerabilities.
projects, the data is imbalanced, and more vulnerable functions It encompasses 1675 GitHub projects, spanning 5877 commits
are needed to train large models. and 27,476 files, with an equal distribution of 13,738 files
7) D2A dataset: A Dataset proposed by IBM [176] is marked as vulnerable and non-vulnerable, respectively. A
curated using differential analysis to label issues reported by supplementary dataset containing the commit messages for
static analysis tools. Bug-fixing commit pairs are extracted each sample is provided.
from open-source projects with a static analyzer running on 10) SySeVR dataset: SySeVR framework was proposed
them. If issues were detected in the “before” version and in [186], which builds on the previous work in VulDeeP-
disappeared in the “after” version, then it is assumed to be ecker [187]. While VulDeePecker only considers library/ API
a bug. Compared to other datasets, the bug trace is included function calls, SySeVR covers a variety of vulnerabilities.
in the dataset to determine the type and exact location of the Furthermore, SySeVR utilized a unique approach using the

33
notions of syntax-based Vulnerability Candidates(SyVCs) and in hardware and software design. This comprehensive dataset
Semantics-based Vulnerability Candidates (SeVCs) to rep- targets functional verification and code debugging in High-
resent programs as vectors that accommodate syntax and Level Synthesis (HLS). It offers a realistic evaluation envi-
semantic information. Their approach results show a reduced ronment with over 1,000 function-level designs and up to 45
false-negative rate. The dataset is collected from the National injected bug combinations. Named ”Chrysalis” to symbolize
Vulnerability Database (NVD) and Software Assurance Refer- code transformation, it includes diverse HLS applications with
ence Dataset (SARD). The NVD dataset contains 19 popular various error types. Created with GPT-4 and curated prompts,
C/C++ open source products and the SARD data comprises Chrysalis-HLS is a valuable resource for advancing LLM
126 vulnerability types. There are 1,591 programs from open- capabilities in HLS verification and debugging, enhancing
source projects, of which 874 are vulnerable. As for SARD, hardware engineering.
there are 14,000 programs, with 13,906 being vulnerable. 14) ReFormAI: The ReFormAI dataset [189] is a large-
While this dataset uses the existing datasets published by scale dataset of 60,000 independent SystemVerilog designs
NIST, the datasets would need further processing in most with varied complexity levels, targeting different Common
cases. For example, many vulnerable SARD programs contain Weakness Enumerations (CWEs). The dataset was generated
the vulnerable snippet and its patch. Not separating them into by four different LLMs and features a unique set of designs
different samples might yield unwanted results depending on for each of the 10 CWEs evaluated. The designs were labeled
the application. based on the vulnerabilities identified by formal verification

of
11) DiverseVul dataset: DiverseVul [188] is proposed as with unbounded proof. The LLMs evaluated include GPT-3.5-
a new vulnerable source code dataset that covers 295 than Turbo, Perplexity AI, Text-Davinci-003, and LLaMA. The re-

ro
the previous datasets combined. Furthermore, the dataset is sults indicate that at least 60% of the samples from the 60,000
60% bigger than previous open-source C/C++ datasets. The SystemVerilog designs are vulnerable to CWEs, highlighting
data is collected by crawling security issue websites and ex-

-p
the need for caution when using LLM-generated code in real-
tracting the commits. The security-based commits are labeled world projects.
vulnerable before and non-vulnerable for the version after
re
the commit. DiverseVul covers over 797 projects and 7,514 15) PrimeVul: PrimeVul [95] dataset is a benchmark
commits with more than 130 CWEs. The MD5 hashes are dataset based on existing open-source datasets. Mainly taking
into consideration BigVul [183], CrossVul [185], CVEfixes
lP

used to de-duplicate functions, yielding 18,495 vulnerable and


330,492 non-vulnerable functions. The authors conduct several [184] and DiverseVul [188]. The proposed pipeline consists of
experiments to validate the dataset, combining their dataset merging, de-duplication, and labeling through 1) PRIMEVUL-
ONEFUNC and 2) PRIMEVUL-NVDCHECK. ONEFUNC
na

with previous datasets and showing insights and possibilities of


their use. The paper shows that using their dataset along with selects only single functions that are associated with security
the previous datasets yields the best result in their experiments, commits. NVDCHECK is the compartment where a commit
is linked to its CVE and checked for in the NVD database.
ur

as opposed to using a single dataset.


12) FormAI dataset: The FormAI dataset [175] represents The function is labeled vulnerable if the description precisely
a significant advancement in cybersecurity and LLM, featuring mentions the function. The other case is the description
Jo

an extensive collection of 112,000 AI-generated, independent, containing the file name and the function being the single
and compilable C programs. This dataset is unique because function changed by a security commit. After such a process,
it utilizes dynamic zero-shot prompting techniques to create the yielded dataset consists of 7k vulnerable functions and
various programs, ranging from complex tasks like network 228,800 benign functions. The dataset spans 755 projects and
management and encryption to simpler ones like string manip- contains 6,827 commits. Their work also assesses the label
ulation. These programs were generated using GPT-3.5-turbo, quality of their dataset and other related datasets, showing
demonstrating the ability of Large Language Models (LLMs) low label errors in PrimeVul.
to produce diverse and realistic code samples. A standout 16) X1: X1 [82] dataset is constructed from several open-
feature of the FormAI dataset is its meticulous vulnerability source vulnerability datasets: CVEFixes, a Manually-Curated
classification. Each program is thoroughly analyzed for vulner- Dataset, and VCMatch. The dataset contains standalone func-
abilities, with the type of vulnerability, the specific line num- tions labeled as either vulnerable or non-vulnerable. The label-
ber, and the name of the vulnerable function clearly labeled. ing process involves extracting functions from vulnerability-
This precise labeling is achieved using the Efficient SMT- fixing commits, assuming pre-change versions are vulnerable
based Bounded Model Checker (ESBMC), an advanced formal and post-change versions are non-vulnerable. A modified
verification method. ESBMC employs techniques like model dataset (X1) is created to address potential false positives,
checking, abstract interpretation, constraint programming, and containing only functions that were the sole change in a
satisfiability modulo theories to rigorously assess safety and commit. The final dataset consists of X1 without P3, which
security properties in the programs. This approach ensures that has 1334 samples, and X1 with P3, which has 22945 samples.
vulnerabilities are definitively detected, providing a formal X1 without P3 is balanced, with a 1:1 ratio of positive to
model or counterexample for each finding and effectively negative classes, while X1 with P3 is imbalanced, reflecting
eliminating false positives. the real-world distribution of vulnerable functions with a 1:34
13) Chrysalis-HLS: Chrysalis-HLS [79] dataset, a helpful ratio. The dataset size is relatively small, which may limit its
resource for improving Large Language Models’ performance representativeness of the real vulnerability distribution.

34
TABLE XIV: Overview of Software Vulnerability Datasets that can be used for fine-tuning LLMs for software security.
Dataset Year Lang Source Multi- Type Samples Labelling Method Classification Challenges/Limitations
class Method
Sate IV - 2012 C, SARD Yes Synthetic Approx 60k Testcases are vulner- CWE Designed to be vulnerable,
Juliet C++ (C/C++) & able by design, with might not accurately depict
& 29k (Java) test corresponding patch real-world projects.
Java cases
Draper 2018 C Open-source Yes Real Total: 1.27M Static analyzers CWE Small percentage of vulnera-
[179] V: 82K NV: ble samples. Limited to four
1.19M categories.
Reveal 2018 C/C++ Open-source No Real Total: 18k V: Vulnerability-fixing Binary classes Imbalance in sample distribu-
[180] 1.6k NV: 16k commits identified by tion and only binary labeled.
security terms Limited to two projects.
Devign 2019 C Open-source No Real Total: 26K V: Binary Manual label- Binary classes Binary labeled. Partial dataset
[177] 12K NV: 14K ing release.
VUDENC 2019 Python Open-source Yes Real 1,009 commits Vulnerability-fixing Vulnerability Relatively small Dataset, No
[182] from 812 commits from GitHub type guarantee that the commits
repositories repositories fixed vulnerabilities.
BigVul 2020 C/C++ Open-source Yes Real Total: 264k Vulnerability-fixing CVE/CWE Significant class imbalance.
[183] V: 11.8k NV: commits from CVE Lack of CWEs/categories for
253k database all samples.
D2A [176] 2021 C/C++ Open-source Yes Real Total: 1.3M Vunerability-fixing Categories Small percentage of vulnera-
V: 18.6k NV: commits with static based on static ble samples. Manual valida-
1.27M analyzers analyzer tion shows low accuracy.

of
CVEfixes 2021 27 Open-source Yes Real 5,495 Vulnerability-fixing CVE/CWE Labelling accuracy needs en-
[184] lan- commits, commits from CVE hancement and dataset size in-
guages 50k methods database creased (only limited to CVE
records).

ro
CrossVul 2021 40+ Open-source Yes Real 5,877 Vulnerability-fixing CVE/CWE Labelling accuracy needs en-
[185] lan- commits, commits from CVE hancement and dataset size in-
guages 27k (13,738 database creased. Takes the whole file

-p
V/NV) files without pinpointing functions.
(only limited to CVE records).
SySeVR 2022 C/C++ SARD/NVD Yes Semi- Total: 15.6k Extracted from exist- CVE/CWE Limited subset of
[186] Synthetic V: 14.8k NV: ing databases NVD SARD/NVD. SARD is
re
811 and SARD synthetic, while NVD is
limited in the number of
labeled vulnerabilities.
lP

DiverseVul 2023 C/C++ Open-source Yes Real Total: 349K Vulnerability-fixing CWE Labelling accuracy needs en-
[188] V: 18.9k NV: commits from hancement and dataset size in-
330K security trackers creased (specifically vulnera-
ble functions).
na

FormAI 2023 C AI-generated Yes Artificial Total: 112k V: Formal verification Custom Bounded formal verification
[175] 57k NV: 55K Bounded Model categories does not cover all types of
checker vulnerabilities and depth.
Chrysalis- 2024 C++ Open-source Yes Synthetic Over 1,000 Predefined errors Bug Type Addressing scalability and
HLS [79] function-level generalization challenges
ur

HLS designs
FormAI v2 2024 C AI-generated Yes Artificial Total: 265k V: Formal verification Custom Bounded formal verification
[172] 168k NV: 23k Bounded Model categories does not cover all vulnerabil-
Jo

checker ities and depth.


ReFormAI 2024 System AI-generated Yes Artificial Total: 60k V: Formal verification CWE Formal verification with an
[189] Ver- 60k NV: 0 Bounded Model unbounded proof.
ilog checker
PrimeVul 2024 C/C++ Open-source Yes Real Total: 236k V: Single function se- CWE Limited vulnerable samples
[95] 7k NV: 229k lection and extraction due filtering existing samples
from NVD and specific function selec-
tion.
X1 [82] 2024 Java Open-source Yes Real Total: 22.9k Analyzing Binary classes Imbalanced, small, and may
V: 0.6k NV: vulnerability-fixing not represent the true vulner-
22.3k commits ability distribution.
V : Vulnerable , NV: Non Vulnerable

VIII. LLM VULNERABILITIES AND M ITIGATION A. Prompt Injection


Integrating LLMs into various digital platforms has brought
to light the critical issue of prompt injection [191]. This
This section reviews the OWASP Top 10 for LLM Appli- cybersecurity concern involves crafting inputs that manipulate
cations project [190], a comprehensive initiative designed to LLMs, potentially leading to unauthorized system exploitation
increase awareness about LLM security vulnerabilities. This or sensitive information disclosure. As LLMs become more
project targets a wide audience, including developers, design- prevalent, understanding and countering prompt injection at-
ers, architects, managers, and organizations that deploy and tacks is paramount for safeguarding the integrity and security
manage LLMs. Its core deliverable lists the top 10 most critical of these systems [192].
security vulnerabilities commonly found in LLM applications. 1) Nature of Prompt Injection Attacks: Prompt injection
In addition, we include other LLM vulnerabilities not included attacks in LLMs can manifest in various forms. One common
in the OWASP project, as presented in Table XV. Figure 7 method involves manipulating the model to retrieve private
presents LLM vulnerabilities included in the OWASP project. information. Attackers may craft inputs that subtly direct the

35
of
ro
-p
re
lP

Fig. 7: LLM vulnerabilities included in the OWASP project.


na

LLM to divulge confidential data. Another technique involves is hidden prompt injections in documents like resumes, de-
ur

embedding hidden prompts in web pages, which can solicit signed to manipulate the LLM’s output [197]. Furthermore,
sensitive information from unsuspecting users [193]. In addi- there’s the risk of direct user control over the LLM through
Jo

tion, attackers might embed specific prompts in documents, prompt injections, where malicious users craft inputs to gain
such as resumes, to alter the LLM’s output for deceptive pur- undue influence over the model’s responses. By understanding
poses. Finally, the risk of web plugins being exploited through these risks and implementing robust prevention strategies,
rogue website instructions leads to unauthorized actions by the developers and users of LLMs can protect against potential
LLM [194]. exploitations [198].
2) Mitigation Strategies: To combat these threats, several
mitigation strategies can be employed. First, operational re- B. Insecure Output Handling
strictions are vital; limiting the LLM’s capabilities to es- This issue arises when an application or plugin blindly
sential functions significantly reduces the risk of malicious trusts LLM outputs, funneling them into client-side or backend
exploitation. Requiring user consent for sensitive operations operations. Such oversight can lead to critical security risks
is another critical measure [195]. This approach ensures that like Cross-Site Scripting (XSS), Cross-Site Request Forgery
high-risk activities or operations involving sensitive data only (CSRF), Server-Side Request Forgery (SSRF), privilege esca-
occur with explicit user approval. Therefore, the influence lation, and remote code execution.
of untrusted or unfamiliar content on user prompts should 1) Nature of Insecure Output Handling Vulnerabilities:
be minimized to prevent indirect manipulations. Establishing The core of the problem lies in the unverified acceptance
clear trust boundaries within the system is also crucial. These of LLM outputs. For example, if LLM-generated content,
boundaries maintain user control and prevent unauthorized such as JavaScript or Markdown, is directly processed by a
actions, safeguarding the system from external manipulations browser or a backend function, it can lead to XSS or remote
[196]. code execution. This highlights the danger of assuming LLM
3) Potential Attack Scenarios: The scenarios for prompt outputs are safe by default, emphasizing the need for thorough
injection attacks are diverse and concerning. One scenario validation and sanitization.
involves adversarial prompt injections on websites, leading 2) Prevention Strategies: Preventing these vulnerabilities
to unauthorized actions by the LLM. Another potential threat involves two key strategies. Firstly, implementing stringent

36
validation for LLM outputs before interacting with backend D. Automatic Adversarial Prompt Generation
functions can help identify and neutralize potential threats.
Secondly, encoding LLM outputs before they reach the end 1) Nature of the Attack: Zou et al. [201] propose a method
user can prevent misinterpretation of the code, thereby reduc- for automatically generating adversarial prompts in aligned
ing the risk of malicious executions. language models. Specifically, they craft a targeted suffix
that, when appended to diverse LLM queries, maximizes the
3) Potential Attack Scenarios: The scenarios for exploita-
likelihood of producing objectionable or undesirable content.
tion are varied. They range from an application inadvertently
Unlike earlier approaches, this method employs automated
allowing LLM-generated responses to manipulate internal
techniques such as greedy and gradient-based search to cir-
functions, leading to unauthorized actions, to an LLM-powered
cumvent existing alignment measures systematically. By ex-
tool capturing and transmitting sensitive data to malicious
ploiting gaps in the alignment framework, attackers can bypass
entities. Other risks include allowing users to generate unvetted
safeguards to prevent harmful or prohibited outputs.
SQL queries through an LLM, which could result in data
breaches and the potential for LLMs to create and execute 2) Prevention Strategies: The findings from Zou et al. [201]
harmful XSS payloads. highlight the need for comprehensive defenses against auto-
mated adversarial prompt generation. Potential countermea-
sures include:
C. Adversarial Natural Language Instructions • Advanced Alignment Algorithms: Develop new align-

of
ment strategies that are more resistant to adversarial
Wu et al. [199] proposed presented DeceptPrompt, high- manipulations, ensuring that maliciously crafted suffixes
lighting a critical vulnerability in Code LLMs: their sus-

ro
cannot easily override guardrails.
ceptibility to adversarial natural language instructions. These • Real-Time Monitoring: Implement systems capable of
instructions are designed to appear benign while leading detecting suspicious prompt patterns or gradients in real

-p
Code LLMs to produce functionally accurate code containing time, enabling swift neutralization of emerging attacks.
hidden vulnerabilities. The DeceptPrompt algorithm utilizes a • Ongoing Model Retraining: Continuously update mod-
re
sophisticated evolution-based methodology with a fine-grained els with fresh adversarial examples to bolster resilience
loss design, crafting deceptive instructions that maintain the against newly discovered attack vectors.
appearance of normal language inputs while introducing se-
lP

• Adaptive Response Mechanisms: Design LLMs and sup-


curity flaws into the generated code. This vulnerability is porting infrastructure to adapt to changing tactics, reduc-
exacerbated by the challenges in preserving the code’s func- ing the window of opportunity for adversaries.
tionality, targeting specific vulnerabilities, and maintaining the
na

semantics of the natural language prompts [200]. 3) Potential Attack Scenarios: Automated adversarial
1) Prevention Strategies: The study suggests a set of pre- prompt generation can pose significant risks in various con-
texts:
ur

vention strategies to counter these threats. This involves in-


tegrating advanced code validation mechanisms within LLMs • Widespread Content Manipulation: Attackers may propa-
to identify and mitigate potential vulnerabilities in the gen- gate malicious suffixes across large-scale user interactions
Jo

erated code. Enriching the training of LLMs with adversarial (such as web forums or social media), causing misaligned
examples produced by DeceptPrompt is recommended to boost outputs on a broad scale.
their defense against security threats. Furthermore, continuous • Targeted Model Evasion: In specialized applications like
updates and security patches, informed by the latest cybersecu- content filtering or customer support bots, adversaries
rity research, are crucial for maintaining the LLMs’ defenses might exploit gradient-based techniques to bypass spe-
against new adversarial techniques. Addressing these chal- cific policy checks repeatedly.
lenges involves preserving the code’s functionality, targeting • Dynamic Model Attacks: As new alignment protocols are
specific vulnerabilities, and maintaining the semantics of the introduced, attackers can use automated search methods
natural language prompts used in the generation process. to uncover fresh vulnerabilities, fueling an ongoing arms
2) Potential Attack Scenarios: The authors highlight vari- race between defenders and adversaries.
ous potential attack scenarios that could exploit the vulnera-
bilities exposed by DeceptPrompt. These scenarios include at-
tackers using crafted natural language prompts to induce Code E. Training Data Poisoning
LLMs into generating code with vulnerabilities, leading to data
breaches, unauthorized access, or system compromises. The Training Data Poisoning in LLMs represents a critical
effectiveness of DeceptPrompt in real-world settings under- security and ethical issue, where malicious actors manipulate
scores the urgency for robust security measures in Code LLMs, the training dataset to skew the model’s learning process. This
given their increasing use in critical systems and infrastructure. manipulation can range from introducing biased or incorrect
The challenges in preserving the code’s functionality, targeting data to embedding hidden, harmful instructions, compromising
specific vulnerabilities, and maintaining the semantics of the model integrity and reliability. The impact is profound, as
natural language prompts add complexity to these potential poisoned LLMs may produce biased, offensive, or inaccurate
attack scenarios, amplifying the need for enhanced security outputs, raising significant challenges in detection due to the
protocols in Code LLMs. vast and complex nature of training datasets [202].

37
TABLE XV: Overview of LLM vulnerabilities and Mitigation
Vulnerabilities Nature of the Vulnerability Examples Mitigation Strategies Potential Attack Scenarios
Prompt Injection Manipulation of LLMs through
crafted inputs leading to unautho- • Hidden prompts in web • Operational restrictions • Adversarial injections on web-
rized exploitation or sensitive in- pages • User consent for sensitive op- sites
formation disclosure. • Deceptive documents erations • Hidden prompts in documents
• Rogue web plugin in- • Trust boundaries establishment • Direct user control through
structions crafted inputs

Insecure Output Blind trust in LLM outputs lead


Handling to security risks like XSS, CSRF, • Direct processing • Validation of LLM outputs • LLM responses manipulating
SSRF, etc. of LLM-generated • Encoding outputs before reach- internal functions
JavaScript or Markdown ing end-users • Generating unvetted SQL
queries
• Creating harmful XSS pay-
loads

Inference Data Poi- Stealthy activation of malicious


soning responses under specific opera- • Conditions based on • Monitoring and anomaly de- • Manipulated responses under
tional conditions such as token- token-output limits in tection systems specifically de- token limitations leading to
limited output. user settings signed for conditional outputs misinformation
• Stealthily altered outputs • Regular audits of outputs under • Triggered malicious behavior
when cost-saving modes various token limitations in cost-sensitive environments
are enabled

of
Adversarial Code LLMs produce functionally
Natural Language accurate code with hidden vulner- • DeceptPrompt algorithm • Advanced code validation • Crafted prompts leading to
Instructions abilities due to adversarial instruc- creating deceptive • Training LLMs with adversar- code with vulnerabilities
instructions ial examples Unauthorized access or system

ro
tions. •
• Continuous updates and secu- compromises
rity patches

-p
Automatic Automated methods to generate
Adversarial Prompt prompts that bypass LLM align- • Crafting specific suffixes • Developing advanced align- • Bypassing alignment measures
Generation ment measures. for objectionable content ment algorithms leading to the generation of ob-
generation Real-time monitoring jectionable content
re •
• Training models with new ad-
versarial examples
lP

Training Data Poi- Manipulation of training data to


soning skew LLM learning, introducing • Injecting biased or harm- • Verifying data sources • Misleading outputs spreading
biases or vulnerabilities. ful data into training sets • Employing dedicated models biased opinions
• Sandboxing, input filters • Injection of false data into
• Monitoring for poisoning signs training
na

Insecure Plugins Vulnerabilities in plugin design


and interaction with external sys- • Inadequate input valida- • Rigorous input validation • Exploiting input handling vul-
tems or data sources. tion • Adherence to least privilege nerabilities
ur

• Overprivileged access • Secure API practices • Overprivileged plugins for


• Insecure API interactions • Regular security audits privilege escalation
• SQL injections
Jo

Denial of Service Attempts to make a system inac-


(DoS) Attack cessible by overwhelming it with • Volume-based attacks • Rate limiting • Overloading servers
traffic or triggering crashes. • Protocol attacks • Robust infrastructure • Disrupting communication be-
• Application layer attacks • Continuous monitoring and tween users and services
rapid response • Straining system resources

1) Nature of Training Data Poisoning: Training data poi- different applications from a compromised data source [201].
soning in LLMs occurs when an attacker deliberately ma- Another effective strategy is implementing sandboxing and
nipulates the training data or fine-tuning processes. This input filters and ensuring adversarial robustness. In addition,
manipulation introduces vulnerabilities, backdoors, or biases, regularly monitoring for signs of poisoning attacks through
significantly compromising the model’s security, effectiveness, loss measurement and model analysis is vital in identifying
and ethical behavior. Examples include intentionally including and mitigating such threats.
targeted, inaccurate documents, training models using unver-
The prevention of training data poisoning in LLMs can
ified data, or allowing unrestricted dataset access, leading to
be significantly bolstered by incorporating advanced strategies
loss of control. Such actions can detrimentally affect model
before and after the training phase. The pre-training defense
performance, erode user trust, and harm brand reputation
is a dataset-level strategy that filters suspicious samples from
[203].
the training data. This method assumes that text and image
2) Prevention Strategies: To combat training data poison- pairs (i.e., Multimodal data) in a dataset should be relevant
ing, several prevention strategies are essential. Firstly, verify- to each other. The post-training defense is another crucial
ing the supply chain of training data and the legitimacy of data strategy, which involves ”sterilizing” a poisoned model by
sources is crucial. This step ensures the integrity and quality further fine-tuning it on clean data, thus maintaining its utility.
of the data used for training models. Employing dedicated This is conducted by fine-tuning the poisoned models on a
models for specific use cases can help isolate and protect clean dataset (e.g., the VG dataset in the study) with a specific

38
learning rate [202]. insecure API interactions, SQL injection, and database vul-
3) Potential Attack Scenarios: Several potential attack sce- nerabilities.
narios arise from training data poisoning. These include the 2) Prevention Strategies: To counter the Insecure Plugins,
generation of misleading LLM outputs that could spread a multi-faceted approach to security is essential. Implementing
biased opinions or even incite hate crimes. Malicious users rigorous input validation, including type-checking, sanitiza-
might inject false data into training, intentionally skewing tion, and parameterization, is crucial, especially in data query
the model’s outputs [204]. Adversaries could also manipulate construction. Adhering to the principle of least privilege is key
a model’s training data to compromise its integrity. Such in plugin design; each plugin should only access necessary
scenarios highlight the need for stringent security measures resources and functionalities. Ensuring secure API practices
in the training and maintaining LLMs, as the implications of and avoiding direct URL construction from user inputs is vital.
compromised models extend beyond technical performance to Employing parameterized queries for SQL interactions helps
societal impacts and ethical considerations. prevent injection attacks. In addition, regular security audits
and vulnerability assessments are necessary to identify and
F. Inference Data Poisoning address potential weaknesses proactively.
1) Nature of Inference Data Poisoning: Inference data 3) Potential Attack Scenarios: Various attack scenarios
poisoning targets LLMs during their operational phase, unlike emerge from Insecure Plugins. For instance, an attacker could
training-time attacks that tamper with a model’s training exploit input handling vulnerabilities to extract sensitive data

of
dataset. This attack subtly alters the input data to trigger or gain unauthorized system access. Overprivileged plugins
specific, often malicious behaviors in a model without any could be used for privilege escalation, allowing attackers to

ro
modifications to the model itself. The approach detailed by perform restricted actions. Manipulation of API calls can lead
He et al. [205] utilizes a novel method where the poison is to redirection to malicious sites, opening doors to further
system exploits. SQL injection through plugin queries can

-p
activated not by obvious, fixed triggers but by conditions re-
lated to output token limitations. Such conditions are generally compromise database integrity and confidentiality, leading to
overlooked as they are a part of normal user interactions aimed significant data breaches.
re
at managing computational costs, thus enhancing the stealth
and undetectability of attacks.
lP

2) Prevention Strategies: Preventing inference data poison- H. Denial of Service (DoS) attack
ing requires a multi-faceted approach. Firstly, robust anomaly 1) Nature of DoS attack: A Denial of Service (DoS) attack
detection systems can be implemented to scrutinize input is a malicious attempt to disrupt the normal functioning of a
na

patterns and detect deviations from typical user queries. Reg- targeted system, making it inaccessible to its intended users.
ular audits of model responses under various conditions can The attack typically involves overwhelming the target with a
also help identify any inconsistencies that suggest poisoning. flood of internet traffic. This could be achieved through various
ur

Implementing stricter input handling controls and limiting means, such as sending more requests than the system can
the impact of token limitation settings could also reduce handle or sending information that triggers a crash. In the
Jo

vulnerabilities. context of services like LLM, a DoS attack could bombard the
3) Potential Attack Scenarios: The potential scenarios for service with a high volume of complex queries, significantly
inference data poisoning are varied and context-dependent. slowing down the system or causing it to fail [206].
For example, in a cost-sensitive environment where users 2) Potential Attack Scenarios: The DoS attacks against
frequently limit token outputs to manage expenses, an attacker LLM can be divided into three categories: volume-based
could leverage this setting to trigger harmful responses from attacks, protocol attacks, and application layer attacks.
the model. Such scenarios could include delivering incorrect or
• Volume-based Attacks: This is the most straightforward
biased information, manipulating sentiment in text generation,
kind of DoS attack, where the attacker attempts to sat-
or generating content that could lead to reputational damage or
urate the bandwidth of the targeted system. For LLM,
legal issues. The BrieFool framework [205] effectively exploits
this could involve sending many requests simultaneously,
this vector, demonstrating high success rates in controlled
more than what the servers are equipped to handle,
experiments, highlighting the need for heightened security
leading to service disruption [207].
measures in environments where LLMs are deployed.
• Protocol Attacks: These attacks exploit weaknesses in the
network protocol stack layers to render the target inacces-
G. Insecure Plugins sible. They could involve, for instance, manipulating the
1) Nature of Insecure Plugins: The nature of insecure communication process between the user and the LLM
plugins in LLMs revolves around several key vulnerabilities service in a way that disrupts or halts service [208].
that stem from how these plugins are designed, implemented, • Application Layer Attacks: These are more sophisticated
and interact with external systems or data sources. These and involve sending requests that appear to be legitimate
vulnerabilities can compromise the security, reliability, and but are designed to exhaust application resources. For
integrity of both the LLM and the systems it interacts with. LLM, this could involve complex queries requiring ex-
The primary issues associated with insecure plugins in LLMs tensive processing power or memory, thereby straining
include inadequate input validation, overprivileged access, the system [209].

39
3) Prevention Strategies: To combat DoS attacks in LLM 3) Training Data Availability and Quality: A critical chal-
services, the following prevention strategies can be applied: lenge for AI-based cyber defense is the lack of high-quality,
• Rate Limiting: Implementing a rate-limiting strategy is accessible training data, as organizations generally hesitate
crucial. This involves limiting the number of requests a to share sensitive information. The effectiveness of LLMs in
user can make within a given timeframe, which helps cybersecurity depends heavily on the quality and availability of
prevent an overload of the system. training data. Overcoming this data gap remains a significant
• Robust Infrastructure: A robust and scalable server in- hurdle, whether through synthetic data generation or other
frastructure can help absorb the influx of traffic during means.
an attack. This could involve using load balancers, re- 4) Developing and Training Custom Models for Unique
dundant systems, and cloud-based services that can scale Cybersecurity Domains: Certain specialized areas in cyber-
dynamically in response to increased traffic. security require custom models due to their unique vocab-
• Monitoring and Rapid Response: Continuous traffic mon- ularies or data structures, which standard LLMs might not
itoring can help quickly identify unusual patterns in- address adequately. Unique Vocabularies and Data Structures:
dicative of a DoS attack. Once detected, rapid response Cybersecurity domains, such as network security, malware
measures, such as traffic filtering or rerouting, can be analysis, and threat intelligence, have their terminologies,
employed to mitigate the attack. data formats, and communication protocols. Standard LLMs,
typically trained on general datasets, might not be familiar with

of
these specialized terms and structures, leading to ineffective
or inaccurate threat detection and response. Customizing and
IX. LLM C YBERSECURITY I NSIGHTS , C HALLENGES AND

ro
training these models to handle specific cybersecurity scenar-
L IMITATIONS
ios is complex and demands substantial resources, presenting
a significant challenge in the field.

-p
A. Challenges and Limitations
5) Real-Time Information Provision by Security Copilots:
1) Adapting to Sophisticated Phishing Techniques: The Security copilots powered by LLMs need to provide accurate,
re
increasing sophistication of phishing attacks, especially those up-to-date information in real-time to be effective in the
enhanced by AI, presents a major challenge for LLMs in dynamic threat landscape of cybersecurity. Ensuring the rele-
lP

cybersecurity. These models need to evolve to identify and vance and accuracy of information provided by these models
counteract these threats effectively continuously. The chal- in real-time is challenging but essential for effective responses
lenge lies in the need for regular updates and training to keep to cybersecurity threats.
na

pace with the advanced tactics of attackers, which demands


substantial resources and expertise. For example, a large com-
pany implemented an LLM-based security system to detect B. LLM Cybersecurity Insights
ur

phishing emails. Initially, the system was highly effective, Table XVI presents various facets of LLM integration into
identifying and blocking 95% of phishing attempts. However, cybersecurity, providing insights into architectural nuances,
attackers quickly adapted, using AI to generate more con- dataset creation, pre-training, fine-tuning methodologies, eval-
Jo

vincing phishing emails that mimicked the company’s official uation metrics, advanced techniques, deployment strategies,
communication style and included personalized information security measures, and optimization approaches.
about the customers. The company’s LLM struggled to keep 1) LLM architecture: A cyber security scientist venturing
up with these advanced tactics. Phishing emails have become into utilizing LLMs must understand the architecture’s nuances
so sophisticated that they can bypass traditional detection (presented in Section III) to tailor these tools for security ap-
methods, significantly increasing the number of successful plications effectively. Understanding the architecture of LLMs,
attacks. Hence, evolving and adapting LLMs in cybersecurity including their ability to process and generate language-based
to combat AI-enhanced phishing threats is an open challenge. data, is crucial for detecting phishing attempts, deciphering
2) Managing Data Overload in Enterprise Applications: malicious scripts, or identifying unusual patterns in network
With the proliferation of enterprise applications, IT teams are traffic that may indicate a breach. Knowledge of how these
overwhelmed by the sheer volume of data they need to manage models tokenize input data, their attention mechanisms to
and secure, often without corresponding increases in staffing. weigh information, and their output generation techniques
LLMs are expected to assist in managing this data deluge provide the foundational skills necessary to tweak models for
efficiently. However, ensuring these models can process vast optimized threat detection and response [210].
amounts of data accurately and identify threats amidst this 2) Building Cyber Security dataset: Building a robust cy-
complexity is daunting, necessitating high levels of efficiency bersecurity dataset using LLMs involves generating and refin-
and accuracy in the LLMs. The corporation faced a situation ing intricate prompt-response pairs to mirror real-world cyber
where the LLM failed to recognize a sophisticated cyber- threats. Employing synthetic data generation via the OpenAI
attack hidden within the massive influx of data. This oversight API allows for diverse cybersecurity scenarios, while advanced
occurred because the model hadn’t been trained with the latest tools like Evol-Instruct [211] enhance dataset quality by
attack patterns, highlighting a gap in its learning. The incident adding complexity and removing outdated threats. Techniques
underscored the need for LLMs to process data efficiently and such as regex filtering and removing near-duplicates ensure
maintain high accuracy and adaptability in threat detection. the data’s uniqueness and relevance. In addition, familiarizing

40
TABLE XVI: LLM Cybersecurity Insights.
Aspect Details Tools/Methods Applications
Architecture Focus on model components such as Threat Detection and Analysis, Security Au-
tokenization, attention mechanisms, and • Paper: Attention Is All You Need tomation, Cyber Forensics, Penetration Test-
output generation. ing, Security Training and Awareness, and
Chatbots.
Cyber Security Creation of prompt-response pairs that Building datasets that mirror real-world
Dataset simulate cyber threats using synthetic • OpenAI API for synthetic data threats for training and refining LLMs.
data. • Evol-Instruct for data refinement
• Regex filtering for uniqueness

Pre-training Models Training on large-scale datasets com- Preparing LLMs to understand and predict
prising billions of tokens, filtered and • Megatron-LM for handling large cybersecurity-specific content accurately.
aligned with cybersecurity lexicon. datasets
• gpt-neox for sequential data handling
• Distributed training tools

Supervised Fine- Incorporating specialized cybersecurity Enhancing LLMs to address unique cyberse-
Tuning datasets into pre-trained models for tai- • LoRA for parameter-efficient adjust- curity threats and scenarios specifically.
lored applications. ments
• QLoRA for quantization and efficient
memory management

Cyber Security Setting up specialized frameworks and Evaluating how well LLMs detect, under-
Evaluation datasets to test LLMs against potential • Bespoke cybersecurity benchmarks stand, and respond to cyber threats.

of
cyber threats. • Authoritative datasets for threat detec-
tion

ro
Advanced LLM Implementing techniques like RAG and Improving response relevance and accuracy
Techniques RLHF to augment LLMs with real-time • RAG for context retrieval from in cybersecurity applications.
data and expert-aligned feedback. databases
• RLHF with specialized preference

-p
datasets and reward models

LLM Deployments Adopting deployment strategies that Deploying LLMs in various environments
Platforms like Gradio and Streamlit
re

range from local installations to large- to ensure accessibility and responsiveness
scale server setups. for prototyping across devices.
• Cloud services for robust deployment
• Edge deployment strategies for
lP

resource-limited environments

Securing LLMs Addressing vulnerabilities unique to Preventing and mitigating security threats to
LLMs such as prompt hacking and train- • Security measures like prompt injec- maintain data integrity and model reliability
ing data leakage. tion prevention in LLMs.
na

• Red teaming
• Continuous monitoring systems

Optimizing LLMs Implementing strategies to reduce mem- Enabling efficient LLM operation on various
ur

ory and computational requirements • Model quantization hardware, making them scalable and practical
while maintaining output quality. • Use of bfloat16 data formats for diverse applications.
• Optimization of attention mechanisms
Jo

with various prompt templates like Alpaca [212] is essential when building a large-scale language model for cyber security
for structuring this data effectively, ensuring that the LLM from scratch, requiring an understanding of hardware capabil-
can be finely tuned to respond to the nuanced landscape of ities and managing distributed workloads effectively.
cybersecurity challenges efficiently. Most pre-training LLM models are trained using smdis-
3) Pre-training models: Pre-training a model for cyberse- tributed libraries, proposed by AWS SageMaker, which of-
curity tasks involves a complex and resource-intensive pro- fer robust solutions for distributed training machine learning
cess to prepare a language model to understand and predict models, enhancing efficiency on large-scale deployments. The
cybersecurity-specific content. This requires a massive dataset smdistributed.dataparallel library supports data parallelism,
comprising billions or trillions of tokens, which undergo optimizing GPU usage by partitioning the training data across
rigorous processes like filtering, tokenizing, and aligning with multiple GPUs, thus speeding up the learning process and
a pre-defined vocabulary to ensure relevance and accuracy. minimizing communication overhead. On the other hand,
Techniques such as causal language modeling, distinct from smdistributed.modelparallel is tailored for model parallelism,
masked language modeling, are employed, where the loss allowing large models to be split across multiple GPUs when
functions and training methodologies, such as those used in a single model cannot fit into the memory of one GPU. These
Megatron-LM [213] or gpt-neox [214], are optimized for han- tools seamlessly integrate with frameworks like TensorFlow,
dling sequential data predictively. Understanding the scaling PyTorch, and MXNet, simplifying the implementation of com-
laws is crucial, as these laws help predict how increases in plex distributed training tasks.
model size, dataset breadth, and computational power can 4) Supervised Fine-Tuning: Supervised fine-tuning (SFT)
proportionally enhance model performance [215]. While in- of pre-trained Large Language Models for cybersecurity ap-
depth knowledge of High-Performance Computing (HPC) isn’t plications enables these models to move beyond basic next-
necessary for using pre-trained models, it becomes essential token prediction tasks, transforming them into specialized

41
Fig. 8: Parameter Efficient Fine-Tuning (PEFT) provides an efficient approach by minimizing the number of parameters needed
for fine-tuning and reducing memory consumption comparable to that of traditional fine-tuning.

of
tools tailored to specific cybersecurity needs. This fine-tuning mimic potential security scenarios to assess how well the

ro
process allows for incorporating proprietary or novel datasets model detects, understands, and responds to cyber threats. This
that have not been previously exposed to models like Fal- involves configuring the LLM with tailored inputs, expected
con 180b, providing a significant edge in addressing unique outputs, and security-specific contextual data [30]. For in-
security challenges. Figure 8 outlines a comprehensive three-
step process for training a large language model specialized
-p stance, IBM’s D2A dataset [176] and Microsoft’s dataset [234]
aids in evaluating AI models’ capability to identify software
re
in cybersecurity, beginning with unsupervised pre-training on vulnerabilities using specific metrics such as accuracy.
a vast corpus of cybersecurity-related texts, including diverse Table XVII compares benchmarks for evaluating LLMs
lP

data such as malware, network security, and dark web content. in cybersecurity knowledge. CyberMetric [104] is a bench-
Following this, the model undergoes traditional fine-tuning mark dataset designed explicitly for evaluating large language
using a smaller, targeted dataset to refine its capabilities for models in knowledge cybersecurity. It consists of 10,000
na

specific cybersecurity tasks. However, the Parameter-Efficient questions derived from various authoritative sources within
Fine-Tuning (PEFT) [216] involves freezing the original model the field. The dataset is used to measure the knowledge of
weights and fine-tuning a small set of new parameters, enhanc- LLMs across a spectrum of cybersecurity topics, facilitating
ur

ing the model’s adaptability and efficiency while minimizing direct comparisons between human expertise and machine
the risk of overfitting, thus preparing the LLM to tackle capabilities. This unique dataset aids in understanding the
advanced cybersecurity challenges efficiently. strengths and limitations of LLMs in cybersecurity, providing
Jo

Techniques such as LoRA (Low-rank Adapters) [217] offer a foundation for further development and specialized training
a parameter-efficient approach by adjusting only a subset of the of these models in this critical area.
model’s parameters, thus optimizing computational resources Similar to the CyberMetric benchmark, Meta proposed the
while maintaining performance. More advanced methods like CyberSecEval 2 benchmark [223] to quantify security risks
QLoRA [218] enhance this by quantizing the model’s weights associated with LLMs such as GPT-4 and Meta Llama 3 70B-
and managing memory more efficiently, making executing Instruct. They highlight new testing areas, notably prompt
these operations even on limited platforms like Google Colab injection and code interpreter abuse, revealing that mitigating
with only one GPU A100. In addition, tools like Axolotl and attack risks in LLMs remains challenging, with significant
DeepSpeed [219], [220] facilitate the deployment of these fine- rates of successful prompt injections. The study also explores
tuned models across various hardware setups, ensuring that the safety-utility tradeoff, proposing the False Refusal Rate
the enhanced models can be scaled efficiently for real-world (FRR) to measure how conditioning LLMs to avoid unsafe
cybersecurity tasks, ranging from intrusion detection to real- prompts might reduce their overall utility by also rejecting
time threat analysis. This strategic fine-tuning enhances model benign requests. Additionally, the research assesses LLMs’
specificity and significantly boosts their utility in practical capabilities in automating core cybersecurity tasks, suggesting
cybersecurity applications. that models with coding abilities perform better in exploit
5) Cyber Security Evaluation: To evaluate the code gen- generation tasks. The benchmark code is open source to
eration models, Hugging Face uses the following 7 code facilitate further research 3 .
generation Python tasks: DS-1000, MBPP, MBPP+, APPS, In- Liu et al. introduced SecQA [222], a novel dataset designed
structHumanEval, HumanEval+, and HumanEval [227]–[233]. to evaluate the performance of LLMs in computer security.
In cybersecurity, evaluating large language models demands The dataset, generated by GPT-4, consists of multiple-choice
a specialized framework considering such applications’ unique questions to assess LLMs’ understanding and application of
security and accuracy needs. When setting up evaluation met-
rics for cybersecurity-focused LLMs, test cases should closely 3 https://github.com/meta-llama/PurpleLlama/tree/main/CybersecurityBenchmarks

42
TABLE XVII: Comparison of Benchmarks for Evaluating LLMs in Cybersecurity Knowledge
Benchmark Source Year Description Key Features and Metrics
CyberSecEval 1 Meta 2023 A benchmark tests LLMs across two critical security do- It measures the frequency and
[221] mains—generating insecure code and compliance with re- conditions LLMs propose insecure
quests to assist in cyberattacks. code solutions under.
SecQA Liu et al. 2023 A dataset of multiple-choice questions designed to evaluate Evaluates understanding and appli-
[222] the performance of LLMs in computer security. Features two cation of security principles.
versions of varying complexity and tests LLMs in both 0-shot
and 5-shot learning settings.
CyberMetric Tihanyi et 2024 A dataset designed for evaluating LLMs in cybersecurity Direct comparison between human
al. [104] knowledge, consisting of 10,000 questions from various au- expertise and LLMs.
thoritative sources. Used to measure the spectrum of cyber-
security topics covered by LLMs.
CyberSecEval 2 Meta 2024 Focuses on quantifying security risks associated with LLMs, Testing areas: prompt injection,
[223] such as prompt injection and code interpreter abuse. It high- code interpreter abuse; Metric:
lights challenges in mitigating attack risks and introduces the FRR.
False Refusal Rate (FRR) metric.
WMDP-Cyber Li et al. 2024 Consists of 3,668 multiple-choice questions designed to mea- Covers biosecurity, cybersecurity,
[224] sure LLMs’ knowledge in biosecurity, cybersecurity, and and chemical security.
chemical security. Excludes sensitive and export-controlled
information.
LLM4Vuln Sun et al. 2024 A unified evaluation framework for assessing the vulnerability Focuses on vulnerability reasoning

of
[225] reasoning capabilities of LLMs, using 75 verified high-risk in LLMs.
smart contract vulnerabilities in 4,950 scenarios across three
LLMs.

ro
CyberBench Liu et al. 2024 A domain-specific, multi-task benchmark for assessing LLM Includes diverse tasks such as vul-
[226] performance in cybersecurity tasks. nerability detection, threat analysis,
and incident response.

-p
re
security principles. SecQA features two versions with varying the NIST (National Institute of Standards and Technology)
complexity to challenge LLMs across different difficulty lev- database [243]. This capability allows the model to offer
els. The authors comprehensively evaluated prominent LLMs, current and specific advice regarding vulnerabilities, threat
lP

including GPT-3.5-Turbo, GPT-4, Llama-2, Vicuna, Mistral, intelligence, and compliance standards. Integrating real-time
and Zephyr, in both 0-shot and 5-shot learning settings. data from these authoritative sources into the response gener-
The findings from the SecQA v1 and v2 datasets reveal ation process allows RAG to empower Language Models to
na

diverse capabilities and limitations of these models in han- deliver precise and contextually relevant cybersecurity insights
dling security-related content. Li et al. [224] introduced the without extensive retraining, thus enhancing decision-making
Weapons of Mass Destruction Proxy (WMDP) benchmark. in critical security operations [244].
ur

This publicly available dataset consists of 3,668 multiple- Reinforcement Learning from Human Feedback (RLHF) is
choice questions designed to measure LLMs’ knowledge in an advanced method to enhance LLMs tailored for cybersecu-
Jo

biosecurity, cybersecurity, and chemical security, ensuring the rity applications, focusing on aligning the model’s responses
exclusion of sensitive and export-controlled information. Sun with expert expectations in the security domain. This involves
et al. [225] introduced LLM4Vuln, a unified evaluation frame- utilizing specialized preference datasets, which contain re-
work designed to precisely assess the vulnerability reasoning sponses ranked by cybersecurity professionals, presenting a
capabilities of LLMs independent of their other functions such more challenging production process than typical instructional
as information seeking, knowledge adoption, and structured datasets. Techniques like Proximal Policy Optimization (PPO)
output. This framework aims to determine how enhancing leverage a reward model to evaluate how well text outputs
these separate capabilities could boost LLMs’ effectiveness in align with security expert rankings, refining the model’s
identifying vulnerabilities. To test the efficacy of LLM4Vuln, training through adjustments based on KL divergence [240].
controlled experiments were conducted with 75 verified high- Direct Preference Optimization (DPO) further optimizes this
risk smart contract vulnerabilities sourced from Code4rena by framing it as a classification challenge, using a stable
audits conducted between August and November 2023. These reference model that avoids the complexities of training reward
vulnerabilities were tested in 4,950 scenarios across three models and requires minimal hyperparameter adjustments
LLMs: GPT-4, Mixtral, and Code Llama. [245]. These methods are crucial for reducing biases, fine-
6) Advanced LLM techniques (RAG and RLHF): Advanced tuning threat detection accuracy, and enhancing the overall
techniques like Retrieval Augmentation Generation (RAG) effectiveness of cybersecurity-focused LLMs.
can significantly enhance Language Model performance by In practical cybersecurity applications, the integration of
enabling the model to access external databases for additional RAG can be facilitated by orchestrators like LangChain,
context and information, making it highly effective in special- LlamaIndex, and FastRAG, which connect Language Models
ized fields such as cybersecurity. In cybersecurity applications, to relevant tools, databases, and other resources. These orches-
RAG can dynamically retrieve up-to-date information from trators ensure efficient information flow, enabling Language
well-known databases such as CVE (Common Vulnerabilities Models to seamlessly access and incorporate essential cyberse-
and Exposures), CWE (Common Weakness Enumeration), and curity information [246]. Advanced techniques such as multi-

43
TABLE XVIII: Optimization Strategies for Large Language Models in Cybersecurity
Optimization Description Key Benefits Cybersecurity Use Case Scenarios
Strategy
Advanced Attention Implements techniques like Flash Attention [235] Speeds up processing saves com- Efficient processing of long log files and network
Mechanisms to optimize self-attention layers, reducing compu- pute resources. traffic data for anomaly detection.
tation times, particularly effective for long input
sequences.
Bitsnbytes Introduces k-bit quantization (notably 8-bit) us- Halves memory usage without Efficient real-time malware analysis and intrusion
ing block-wise methods to maintain performance loss in performance. detection on edge devices.
while halving memory usage.
GPTQ [236] A novel quantization method for GPT models Compresses model size, mini- Deploying large-scale threat prediction models on
that reduces bit width to 3 or 4 bits per weight, mizes accuracy loss. consumer-grade hardware.
enabling the operation of large models on single
GPUs with minimal accuracy loss.
GGUF Quantization Optimized for quick model loading and saving, Enhances efficiency of model de- Rapid deployment of updated models to respond
making LLM inference more efficient. Supported ployment. to emerging threats and vulnerabilities.
by Hugging Face Hub.
QLoRA [218] Enables training using memory-saving techniques Preserves performance with re- Training complex cybersecurity models on sys-
with a small set of trainable low-rank adaptation duced memory. tems with limited memory resources.
weights.
Lower-precision Uses formats like bfloat16 instead of float32 for Reduces computational Enhancing the speed and efficiency of continuous
data Formats training and inference to optimize resource usage overhead. cybersecurity monitoring systems.
without compromising performance accuracy.
FSDP-QLoRA Combines Fully Sharded Data Parallelism (FSDP) Scales up model training across Enabling the collaborative training of security

of
with 4-bit quantization and LoRA to shard model multiple GPUs. models across different organizations without re-
parameters, optimizer states, and gradients across quiring top-tier hardware.
GPUs.
Half-Quadratic A model quantization technique that enables the Works efficiently with HQQ can be employed in cybersecurity to pro-

ro
Quantization quantization of large models rapidly and accu- CUDA/Triton kernels and tect models by reducing the precision of model
(HQQ) [237] rately without the need for calibration data. aims for seamless integration weights, making it harder for attackers to reverse
with torch.compile. engineer or tamper with the models..

-p
Multi-token Predic- A new training approach for large language mod- Models trained with 4-token pre- Multi-token prediction can enhance the modeling
tion [238] els where models predict multiple future tokens dictions can achieve up to 3x of sophisticated cyber attack patterns.
simultaneously rather than the next token only. faster inference speeds, even with
large batch sizes.
re
Trust Region An advanced policy gradient method in reinforce- TRPO enhances training stability In environments with dynamic and evolving
Policy Optimization ment learning that addresses the inefficiencies of by using trust regions to prevent threats, TRPO can help maintain a stable and
(TRPO) [239] standard policy gradient methods. overly large updates that could effective response mechanism, adjusting policies
lP

destabilize the policy. incrementally to handle new types of malware.


Proximal Policy A reinforcement learning technique designed to Prevents ”falling off the cliff” By limiting the extent of policy updates, PPO
Optimization (PPO) improve training stability by cautiously updating scenarios where a policy update helps maintain a steady adaptation to evolving cy-
[240] policies. is too large could irreversibly bersecurity threats, reducing the risk of overfitting
na

damage the policy’s effective- to specific attack patterns.


ness.
Direct Preference A fine-tuning methodology for foundation mod- Requires significantly less data Reduces the computational and data demands of
Optimization els optimize policies directly using a Kull- and compute resources than pre- continuously training cybersecurity models, allow-
(DPO) [241] back–Leibler divergence-constrained framework, vious methods like PPO. ing for more scalable solutions.
ur

removing the need for a separate reward model.


Odds Ratio Pref- An algorithm designed for supervised fine-tuning Eliminates the need for an Enables dynamic adaptation of security models
erence Optimization (SFT) of language models that optimizes prefer- additional preference alignment to new and evolving cyber threats by optimizing
Jo

(ORPO) [242] ence alignment without the need for a separate phase, simplifying the fine- preference alignment efficiently.
reference model. tuning process.

query retrievers and HyDE are used to optimize the retrieval of demos quickly, with hosting options like Hugging Face Spaces
relevant cybersecurity documents and adapt user queries into providing an accessible path to broader distribution. On the
more effective forms for document retrieval. Furthermore, in- industrial scale, deploying LLMs can require robust server
corporating a memory system that recalls previous interactions setups, utilizing cloud services or on-premises infrastructure
allows these models to provide consistent and context-aware that might leverage specialized frameworks for peak perfor-
responses over time. This amalgamation of advanced retrieval mance and efficiency. Meanwhile, edge deployment strategies
mechanisms and memory enhancement through RAG signif- bring LLM capabilities to devices with limited resources,
icantly boosts the efficacy of Language Models in handling using advanced, lightweight frameworks to integrate smart
complex and evolving cybersecurity challenges, making them capabilities directly into mobile and web platforms, ensuring
invaluable tools for tracking vulnerabilities, managing risks, responsiveness and accessibility across a broad spectrum of
and adhering to industry standards in the cybersecurity domain user environments [248], [249].
[247].
Currently, LLMs can be deployed on Phones. Microsoft
7) LLM deployments: Deploying LLMs offers a range of [129] propose phi-3-mini. This highly efficient 3.8 billion
approaches tailored to the scale and specific needs of different parameter language model delivers robust performance on par
applications. At one end of the spectrum, local deployment of- with much larger models such as Mixtral 8x7B and GPT-
fers enhanced privacy and control, utilizing platforms like LM 3.5, achieving impressive scores like 69% on the MMLU
Studio and Ollama to power apps directly on users’ machines, and 8.38 on MT-bench. Remarkably, the phi-3-mini’s compact
thus capitalizing on the open-source nature of some LLMs. size allows for deployment on mobile devices, expanding
For more dynamic or temporary setups, frameworks such as its accessibility and utility. This performance breakthrough
Gradio and Streamlit allow developers to prototype and share is primarily attributed to an innovative approach in training

44
data selection—a significantly enhanced version of the dataset XVIII presents the optimization strategies for LLMs that can
used for phi-2, which integrates heavily filtered web data and be adopted for Cybersecurity use cases. For instance, quantiz-
synthetic data tailored for relevance and diversity. It has been ing a model to 4-bit can bring down the VRAM requirement
further aligned to ensure the model’s practicality in real-world from 32 GB to just over 9 GB, allowing these models to
applications for enhanced robustness, safety, and optimization run efficiently on consumer-level hardware like the RTX
for chat formats. In addition, the research extends into larger 3090 GPU. Therefore, advanced attention mechanisms such
models, phi-3-small and phi-3-medium, which are trained on as Flash Attention reduce computation times by optimizing
4.8 trillion tokens with 7 billion and 14 billion parameters, self-attention layers, which are integral to transformers [235].
respectively. These models retain the foundational strengths This optimization is especially beneficial for handling long
of phi-3-mini and exhibit superior performance, scoring up to input sequences, where traditional self-attention mechanisms
78% on MMLU and 8.9 on MT-bench, illustrating significant could become prohibitively expensive regarding memory and
enhancements in language understanding capabilities with processing power [251], [252].
scaling. In addition, AirLLM 4 enhances memory management The quantization methods include Bitsnbytes, 4-bit GPTQ,
for inference, enabling large language models, such as those 2-bit GPTQ, and GGUF quantization. Bitsnbytes introduces a
with 70 billion parameters (e.g., Llama3 70B), to operate on a k-bit quantization approach that significantly reduces mem-
single 4GB GPU card. This can be achieved without requiring ory consumption while maintaining performance [236]. It
quantization, distillation, pruning, or any other form of model employs an 8-bit optimization using block-wise quantization

of
compression that could diminish performance. to achieve 32-bit performance at a lower memory cost and
8) Securing LLMs: Securing LLMs is essential due to their uses LLM.Int() for 8-bit quantization during inference, halving

ro
inherent susceptibility to traditional software vulnerabilities the required memory without performance loss. Furthermore,
and unique risks stemming from their design and operational QLoRA [218], or 4-bit quantization, enables the training
methods. Specifically, LLMs are prone to prompt hacking, of LLMs using memory-saving techniques that include a
where techniques such as prompt injection can be used to
manipulate model responses, prompt leaking that risks expo-
-p small set of trainable low-rank adaptation weights, allowing
for performance preservation. In parallel, GPTQ is a novel
re
sure of training data, and jailbreaking intended to circumvent quantization method for GPT models, facilitating the reduction
built-in safety mechanisms. These specific threats necessitate of bit width to 3 or 4 bits per weight, enabling the operation
lP

implementing comprehensive security measures that directly of models as large as 175 billion parameters on a single GPU
address the unique challenges LLMs pose. Additionally, in- with minimal accuracy loss. This method provides substantial
serting backdoors during training, either by poisoning the data compression and speed advantages, making high-performance
na

or embedding secret triggers, can significantly alter a model’s LLMs more accessible and cost-effective. Additionally, the
behavior during inference, posing severe risks to data integrity GGUF format, supported by Hugging Face Hub and optimized
and model reliability. for quick model loading and saving, enhances the efficiency
ur

As discussed in Section VIII, to mitigate these threats of LLM inference.


effectively, organizations must adopt rigorous defensive strate- Another effective optimization is incorporating lower-
gies as recommended by the OWASP LLM security checklist
Jo

precision data formats such as bfloat16 for training and


5
. This includes testing LLM applications against known inference. This approach aligns with the training precision
vulnerabilities using methods like red teaming and specific and avoids the computational overhead associated with float32
tools such as garak [250] to identify and address security precision, optimizing resource usage without compromising
flaws. In addition, deploying continuous monitoring systems performance accuracy. The potential VRAM requirements for
like langfuse 6 in production environments helps detect and different models using bfloat16 are substantial. For example,
rectify anomalous behaviors or potential breaches in real- GPT-3 might require up to 350 GB. In comparison, smaller
time. The OWASP checklist also emphasizes the importance models like Llama-2-70b and Falcon-40b require 140 GB and
of governance frameworks that ensure data used in training is 80 GB, respectively, illustrating the scale of resources needed
ethically sourced and handled, maintaining transparency about even with efficient data formats 7 .
data sources and model training methodologies. This struc- Recently, FSDP-QLoRA 8 , a new technique combining data
tured approach to security and governance ensures that LLMs parallelism, 4-bit quantization, and LoRA, was introduced by
are used responsibly and remain secure from conventional Answer.AI in collaboration with bitsandbytes. Utilizing Fully
cyber threats and those unique to their operational nature. Sharded Data Parallelism (FSDP) to shard model parameters,
9) Optimizing LLMs: Optimizing LLMs for production optimizer states, and gradients across GPUs, this approach
encompasses several crucial techniques to enhance speed, enables the training of LLMs up to 70 billion parameters
reduce memory requirements, and maintain output quality. on dual 24GB GPU systems. FSDP-QLoRA represents a
One pivotal strategy is model quantization, which significantly significant step forward in making the training of large-scale
reduces the precision of model weights—often to 4-bit or 8- LLMs.
bit—thereby decreasing the GPU memory requirements. Table Collectively, these techniques make it feasible to deploy
4 https://pypi.org/project/airllm/ powerful LLMs on a wider range of hardware and enhance
5 https://owasp.org/www-project-top-10-for-large-language-model-
applications/ 7 https://huggingface.co/docs/transformers/main/en/llm tutorial optimization
6 https://github.com/langfuse/langfuse 8 https://huggingface.co/docs/bitsandbytes/main/en/fsdp qlora

45
their scalability and practicality in diverse applications, ensur- [4] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.
ing they can deliver high performance even under hardware Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”
Advances in neural information processing systems, vol. 30, 2017.
constraints. [5] B. Yan, K. Li, M. Xu, Y. Dong, Y. Zhang, Z. Ren, and X. Cheng, “On
protecting the data privacy of large language models (llms): A survey,”
X. C ONCLUSION arXiv preprint arXiv:2403.05156, 2024.
[6] D. Myers, R. Mohawesh, V. I. Chellaboina, A. L. Sathvik, P. Venkatesh,
In this paper, we presented a comprehensive and in-depth Y.-H. Ho, H. Henshaw, M. Alhawawreh, D. Berdik, and Y. Jararweh,
review of the future of cybersecurity through the lens of “Foundation and large language models: fundamentals, challenges,
opportunities, and social impacts,” Cluster Computing, vol. 27, no. 1,
Generative AI and Large Language Models (LLMs). Our pp. 1–26, 2024.
exploration covered a wide range of LLM applications in [7] S. Tonmoy, S. Zaman, V. Jain, A. Rani, V. Rawte, A. Chadha, and
cybersecurity, including hardware design security, intrusion A. Das, “A comprehensive survey of hallucination mitigation tech-
niques in large language models,” arXiv preprint arXiv:2401.01313,
detection, software engineering, design verification, cyber 2024.
threat intelligence, malware detection, and phishing and spam [8] M. A. Ferrag, M. Debbah, and M. Al-Hawawreh, “Generative ai for
detection, illustrating the broad potential of LLMs across cyber threat-hunting in 6g-enabled iot networks,” in 2023 IEEE/ACM
23rd International Symposium on Cluster, Cloud and Internet Comput-
various domains. ing Workshops (CCGridW). IEEE, 2023, pp. 16–25.
We provided a detailed examination of the evolution and [9] I. H. Sarker, H. Janicke, M. A. Ferrag, and A. Abuadbba, “Multi-
current state of LLMs, highlighting advancements in 35 aspect rule-based ai: Methods, taxonomy, challenges and directions
toward automation, intelligence and transparent cybersecurity modeling
specific models, such as GPT-4, GPT-3.5, BERT, Falcon,

of
for critical infrastructures,” Internet of Things, p. 101110, 2024.
and LLaMA. Our analysis included an in-depth look at the [10] Y. Yao, J. Duan, K. Xu, Y. Cai, Z. Sun, and Y. Zhang, “A survey on
vulnerabilities associated with LLMs, such as prompt injec- large language model (llm) security and privacy: The good, the bad,

ro
tion, insecure output handling, training and inference data and the ugly,” High-Confidence Computing, p. 100211, 2024.
[11] Y. Yan, Y. Zhang, and K. Huang, “Depending on yourself when
poisoning, DDoS attacks, and adversarial natural language you should: Mentoring llm with rl agents to become the master in

-p
instructions. We discussed mitigation strategies to protect these cybersecurity games,” arXiv preprint arXiv:2403.17674, 2024.
models, offering a thorough understanding of potential attack [12] M. Sladić, V. Valeros, C. Catania, and S. Garcia, “Llm in the shell:
Generative honeypots,” arXiv preprint arXiv:2309.00155, 2023.
scenarios and prevention techniques.
re
[13] W. Tann, Y. Liu, J. H. Sim, C. M. Seah, and E.-C. Chang, “Using
Our evaluation of 40 LLM models in terms of cybersecurity large language models for cybersecurity capture-the-flag challenges and
knowledge and hardware security demonstrated their vary- certification questions,” arXiv preprint arXiv:2308.10443, 2023.
lP

[14] O. G. Lira, A. Marroquin, and M. A. To, “Harnessing the advanced


ing strengths and weaknesses. We also conducted a detailed capabilities of llm for adaptive intrusion detection systems,” in In-
assessment of cybersecurity datasets used for LLM training ternational Conference on Advanced Information Networking and
and testing, from data creation to usage, identifying gaps and Applications. Springer, 2024, pp. 453–464.
na

[15] C. Ebert and M. Beck, “Artificial intelligence for cybersecurity,” IEEE


opportunities for future research. Software, vol. 40, no. 6, pp. 27–34, 2023.
We addressed the challenges and limitations of employing [16] J. Wang, Y. Huang, C. Chen, Z. Liu, S. Wang, and Q. Wang, “Software
LLMs in cybersecurity settings, including the difficulty of testing with large language models: Survey, landscape, and vision,”
ur

defending against adversarial attacks and ensuring model ro- IEEE Transactions on Software Engineering, 2024.
[17] E. Almazrouei, H. Alobeidli, A. Alshamsi, A. Cappelli, R. Cojo-
bustness. Additionally, we explored advanced techniques like caru, M. Debbah, É. Goffinet, D. Hesslow, J. Launay, Q. Malartic
Jo

Half-Quadratic Quantization (HQQ), Reinforcement Learning et al., “The falcon series of open language models,” arXiv preprint
with Human Feedback (RLHF), Direct Preference Optimiza- arXiv:2311.16867, 2023.
[18] H. Zhou, C. Hu, Y. Yuan, Y. Cui, Y. Jin, C. Chen, H. Wu, D. Yuan,
tion (DPO), Odds Ratio Preference Optimization (ORPO), L. Jiang, D. Wu, X. Liu, C. Zhang, X. Wang, and J. Liu, “Large
GPT-Generated Unified Format (GGUF), Quantized Low- language model (llm) for telecommunications: A comprehensive survey
Rank Adapters (QLoRA), and Retrieval-Augmented Genera- on principles, key techniques, and opportunities,” 2024.
[19] H. Lai and M. Nissim, “A survey on automatic generation of figurative
tion (RAG) to enhance real-time cybersecurity defenses and language: From rule-based systems to large language models,” ACM
improve the sophistication of LLM applications in threat Computing Surveys, 2024.
detection and response. [20] M. A. Ferrag, M. Ndhlovu, N. Tihanyi, L. C. Cordeiro, M. Deb-
bah, T. Lestable, and N. S. Thandi, “Revolutionizing cyber threat
Our findings underscore the significant potential of LLMs detection with large language models: A privacy-preserving bert-based
in transforming cybersecurity practices. By integrating LLMs lightweight model for iot/iiot devices,” IEEE Access, 2024.
into future cybersecurity frameworks, we can leverage their [21] N. Tihanyi, T. Bisztray, R. A. Dubniczky, R. Toth, B. Borsos, B. Cherif,
M. A. Ferrag, L. Muzsai, R. Jain, R. Marinelli et al., “Dynamic
capabilities to develop more robust and sophisticated defenses intelligence assessment: Benchmarking llms on the road to agi with
against evolving cyber threats. The strategic direction outlined a focus on model confidence,” arXiv preprint arXiv:2410.15490, 2024.
in this paper aims to guide future research and deployment, [22] N. Tihanyi, M. A. Ferrag, R. Jain, T. Bisztray, and M. Debbah,
“Cybermetric: A benchmark dataset based on retrieval-augmented gen-
emphasizing the importance of innovation and resilience in eration for evaluating llms in cybersecurity knowledge,” in 2024 IEEE
safeguarding digital infrastructures. International Conference on Cyber Security and Resilience (CSR).
IEEE, 2024, pp. 296–302.
[23] Z. Liu, “A review of advancements and applications of pre-trained lan-
R EFERENCES guage models in cybersecurity,” in 2024 12th International Symposium
[1] R. Pascanu, C. Gulcehre, K. Cho, and Y. Bengio, “How to construct on Digital Forensics and Security (ISDFS), 2024, pp. 1–10.
deep recurrent neural networks,” arXiv preprint arXiv:1312.6026, 2013. [24] O. Friha, M. A. Ferrag, B. Kantarci, B. Cakmak, A. Ozgun, and
[2] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural N. Ghoualmi-Zine, “Llm-based edge intelligence: A comprehensive
computation, vol. 9, no. 8, pp. 1735–1780, 1997. survey on architectures, applications, security and trustworthiness,”
[3] R. Dey and F. M. Salem, “Gate-variants of gated recurrent unit (gru) IEEE Open Journal of the Communications Society, 2024.
neural networks,” in 2017 IEEE 60th international midwest symposium [25] S. Jamal, H. Wimmer, and I. H. Sarker, “An improved transformer-
on circuits and systems (MWSCAS). IEEE, 2017, pp. 1597–1600. based model for detecting phishing, spam and ham emails: A large

46
language model approach,” Security and Privacy, p. e402, 2024. [48] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning repre-
[Online]. Available: https://doi.org/10.1002/spy2.402 sentations by back-propagating errors,” nature, vol. 323, no. 6088, pp.
[26] W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, 533–536, 1986.
B. Zhang, J. Zhang, Z. Dong et al., “A survey of large language [49] S. M. Kasongo, “A deep learning technique for intrusion detection
models,” arXiv preprint arXiv:2303.18223, 2023. system using a recurrent neural networks based framework,” Computer
[27] F. R. Alzaabi and A. Mehmood, “A review of recent advances, Communications, vol. 199, pp. 113–125, 2023.
challenges, and opportunities in malicious insider threat detection using [50] S. M. Sohi, J.-P. Seifert, and F. Ganji, “Rnnids: Enhancing network
machine learning methods,” IEEE Access, vol. 12, pp. 30 907–30 927, intrusion detection systems through deep learning,” Computers &
2024. Security, vol. 102, p. 102151, 2021.
[28] M. A. K. Raiaan, M. S. H. Mukta, K. Fatema, N. M. Fahad, S. Sakib, [51] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares,
M. M. J. Mim, J. Ahmad, M. E. Ali, and S. Azam, “A review on large H. Schwenk, and Y. Bengio, “Learning phrase representations using
language models: Architectures, applications, taxonomies, open issues rnn encoder-decoder for statistical machine translation,” arXiv preprint
and challenges,” IEEE Access, vol. 12, pp. 26 839–26 874, 2024. arXiv:1406.1078, 2014.
[29] R. Fang, R. Bindu, A. Gupta, and D. Kang, “Llm agents [52] H. Sedjelmaci, F. Guenab, S.-M. Senouci, H. Moustafa, J. Liu, and
can autonomously exploit one-day vulnerabilities,” arXiv preprint S. Han, “Cyber security based on artificial intelligence for cyber-
arXiv:2404.08144, 2024. physical systems,” IEEE Network, vol. 34, no. 3, pp. 6–7, 2020.
[30] Y. Chang, X. Wang, J. Wang, Y. Wu, L. Yang, K. Zhu, H. Chen, X. Yi,
[53] P. Dixit and S. Silakari, “Deep learning algorithms for cybersecurity
C. Wang, Y. Wang et al., “A survey on evaluation of large language
applications: A technological and status review,” Computer Science
models,” ACM Transactions on Intelligent Systems and Technology,
Review, vol. 39, p. 100317, 2021.
2023.
[31] D. Saha, S. Tarek, K. Yahyaei, S. K. Saha, J. Zhou, M. Tehranipoor, [54] S. Gaba, I. Budhiraja, V. Kumar, S. Martha, J. Khurmi, A. Singh,
and F. Farahmandi, “Llm for soc security: A paradigm shift,” arXiv K. K. Singh, S. Askar, and M. Abouhawwash, “A systematic analysis
preprint arXiv:2310.06046, 2023. of enhancing cyber security using deep learning for cyber physical

of
[32] B. Min, H. Ross, E. Sulem, A. P. B. Veyseh, T. H. Nguyen, O. Sainz, systems,” IEEE Access, 2024.
E. Agirre, I. Heintz, and D. Roth, “Recent advances in natural language [55] C. Yin, Y. Zhu, J. Fei, and X. He, “A deep learning approach for
processing via large pre-trained language models: A survey,” ACM intrusion detection using recurrent neural networks,” Ieee Access,

ro
Computing Surveys, vol. 56, no. 2, pp. 1–40, 2023. vol. 5, pp. 21 954–21 961, 2017.
[33] S. Zhang, L. Dong, X. Li, S. Zhang, X. Sun, S. Wang, J. Li, R. Hu, [56] D. Güera and E. J. Delp, “Deepfake video detection using recurrent
T. Zhang, F. Wu et al., “Instruction tuning for large language models: neural networks,” in 2018 15th IEEE international conference on

-p
A survey,” arXiv preprint arXiv:2308.10792, 2023. advanced video and signal based surveillance (AVSS). IEEE, 2018,
[34] A. Fan, B. Gokkaya, M. Harman, M. Lyubarskiy, S. Sengupta, S. Yoo, pp. 1–6.
and J. M. Zhang, “Large language models for software engineering: [57] S. Althubiti, W. Nick, J. Mason, X. Yuan, and A. Esterline, “Applying
re
Survey and open problems,” arXiv preprint arXiv:2310.03533, 2023. long short-term memory recurrent neural network for intrusion detec-
[35] J. Wu, W. Gan, Z. Chen, S. Wan, and P. S. Yu, “Multimodal large tion,” in SoutheastCon 2018. IEEE, 2018, pp. 1–5.
language models: A survey,” arXiv preprint arXiv:2311.13165, 2023. [58] C. Xu, J. Shen, X. Du, and F. Zhang, “An intrusion detection system
lP

[36] Y. Liu, Y. Yao, J.-F. Ton, X. Zhang, R. G. H. Cheng, Y. Klochkov, using a deep neural network with gated recurrent units,” IEEE Access,
M. F. Taufiq, and H. Li, “Trustworthy llms: a survey and guideline vol. 6, pp. 48 697–48 707, 2018.
for evaluating large language models’ alignment,” arXiv preprint [59] M. A. Ferrag and L. Maglaras, “Deepcoin: A novel deep learning and
arXiv:2308.05374, 2023. blockchain-based energy exchange framework for smart grids,” IEEE
na

[37] L. Hu, Z. Liu, Z. Zhao, L. Hou, L. Nie, and J. Li, “A survey of Transactions on Engineering Management, vol. 67, no. 4, pp. 1285–
knowledge enhanced pre-trained language models,” IEEE Transactions 1297, 2019.
on Knowledge and Data Engineering, 2023. [60] A. Chawla, B. Lee, S. Fallon, and P. Jacob, “Host based intrusion
[38] H. Zhang, H. Song, S. Li, M. Zhou, and D. Song, “A survey of con-
ur

detection system with combined cnn/rnn model,” in ECML PKDD 2018


trollable text generation using transformer-based pre-trained language Workshops: Nemesis 2018, UrbReas 2018, SoGood 2018, IWAISe 2018,
models,” ACM Computing Surveys, vol. 56, no. 3, pp. 1–37, 2023. and Green Data Mining 2018, Dublin, Ireland, September 10-14, 2018,
[39] Z. He, Z. Li, and S. Yang, “Large language models for blockchain secu- Proceedings 18. Springer, 2019, pp. 149–158.
Jo

rity: A systematic literature review,” arXiv preprint arXiv:2403.14280, [61] I. Ullah and Q. H. Mahmoud, “Design and development of rnn anomaly
2024. detection model for iot networks,” IEEE Access, vol. 10, pp. 62 722–
[40] Y. Yigit, M. A. Ferrag, I. H. Sarker, L. A. Maglaras, C. Chrysoulas, 62 750, 2022.
N. Moradpoor, and H. Janicke, “Critical infrastructure protec- [62] A. A. E.-B. Donkol, A. G. Hafez, A. I. Hussein, and M. M. Mabrook,
tion: Generative ai, challenges, and opportunities,” arXiv preprint “Optimization of intrusion detection using likely point pso and en-
arXiv:2405.04874, 2024. hanced lstm-rnn hybrid technique in communication networks,” IEEE
[41] J. Wang, Y. Huang, C. Chen, Z. Liu, S. Wang, and Q. Wang, “Software Access, vol. 11, pp. 9469–9482, 2023.
testing with large language models: Survey, landscape, and vision,”
[63] Z. Zhao, Z. Li, J. Jiang, F. Yu, F. Zhang, C. Xu, X. Zhao, R. Zhang,
IEEE Transactions on Software Engineering, pp. 1–27, 2024.
and S. Guo, “Ernn: Error-resilient rnn for encrypted traffic detection
[42] H. Xu, S. Wang, N. Li, Y. Zhao, K. Chen, K. Wang, Y. Liu, T. Yu,
towards network-induced phenomena,” IEEE Transactions on Depend-
and H. Wang, “Large language models for cyber security: A systematic
able and Secure Computing, 2023.
literature review,” arXiv preprint arXiv:2405.04760, 2024.
[43] Z. Han, C. Gao, J. Liu, S. Q. Zhang et al., “Parameter-efficient fine- [64] X. Wang, S. Wang, P. Feng, K. Sun, S. Jajodia, S. Benchaaboun, and
tuning for large models: A comprehensive survey,” arXiv preprint F. Geck, “Patchrnn: A deep learning-based system for security patch
arXiv:2403.14608, 2024. identification,” in MILCOM 2021-2021 IEEE Military Communications
[44] J. Zhang, H. Bu, H. Wen, Y. Chen, L. Li, and H. Zhu, “When llms Conference (MILCOM). IEEE, 2021, pp. 595–600.
meet cybersecurity: A systematic literature review,” arXiv preprint [65] H. Polat, M. Türkoğlu, O. Polat, and A. Şengür, “A novel approach for
arXiv:2405.03644, 2024. accurate detection of the ddos attacks in sdn-based scada systems based
[45] C. Cui, Y. Ma, X. Cao, W. Ye, Y. Zhou, K. Liang, J. Chen, J. Lu, on deep recurrent neural networks,” Expert Systems with Applications,
Z. Yang, K.-D. Liao et al., “A survey on multimodal large language vol. 197, p. 116748, 2022.
models for autonomous driving,” in Proceedings of the IEEE/CVF [66] G. Parra, L. Selvera, J. Khoury, H. Irizarry, E. Bou-Harb, and P. Rad,
Winter Conference on Applications of Computer Vision, 2024, pp. 958– “Interpretable federated transformer log learning for cloud threat foren-
979. sics,” in Proceedings of the Network and Distributed Systems Security
[46] G. Bai, Z. Chai, C. Ling, S. Wang, J. Lu, N. Zhang, T. Shi, (NDSS) Symposium, 2022.
Z. Yu, M. Zhu, Y. Zhang et al., “Beyond efficiency: A systematic [67] N. Ziems and S. Wu, “Security vulnerability detection using deep
survey of resource-efficient large language models,” arXiv preprint learning natural language processing,” in IEEE INFOCOM 2021-IEEE
arXiv:2401.00625, 2024. Conference on Computer Communications Workshops (INFOCOM
[47] S. Tian, Q. Jin, L. Yeganova, P.-T. Lai, Q. Zhu, X. Chen, Y. Yang, WKSHPS). IEEE, 2021, pp. 1–6.
Q. Chen, W. Kim, D. C. Comeau et al., “Opportunities and challenges [68] Z. Wu, H. Zhang, P. Wang, and Z. Sun, “Rtids: A robust transformer-
for chatgpt and large language models in biomedicine and health,” based approach for intrusion detection system,” IEEE Access, vol. 10,
Briefings in Bioinformatics, vol. 25, no. 1, p. bbad493, 2024. pp. 64 375–64 387, 2022.

47
[69] F. Demirkıran, A. Çayır, U. Ünal, and H. Dağ, “An ensemble of [91] Z. Xiao, Q. Wang, H. Pearce, and S. Chen, “Logic meets
pre-trained transformer models for imbalanced multiclass malware magic: Llms cracking smart contract vulnerabilities,” arXiv preprint
classification,” Computers & Security, vol. 121, p. 102846, 2022. arXiv:2501.07058, 2025.
[70] A. Ghourabi, “A security model based on lightgbm and transformer to [92] M. Hassanin, M. Keshk, S. Salim, M. Alsubaie, and D. Sharma, “Pllm-
protect healthcare systems from cyberattacks,” IEEE Access, vol. 10, cs: Pre-trained large language model (llm) for cyber threat detection in
pp. 48 890–48 903, 2022. satellite networks,” Ad Hoc Networks, vol. 166, p. 103645, 2025.
[71] C. Thapa, S. I. Jang, M. E. Ahmed, S. Camtepe, J. Pieprzyk, and [93] P. Liu, C. Sun, Y. Zheng, X. Feng, C. Qin, Y. Wang, Z. Xu, Z. Li,
S. Nepal, “Transformer-based language models for software vulnera- P. Di, Y. Jiang et al., “Llm-powered static binary taint analysis,” ACM
bility detection,” in Proceedings of the 38th Annual Computer Security Transactions on Software Engineering and Methodology, 2025.
Applications Conference, 2022, pp. 481–496. [94] M. Gaber, M. Ahmed, and H. Janicke, “Zero day ransomware detection
[72] P. Ranade, A. Piplai, S. Mittal, A. Joshi, and T. Finin, “Generating with pulse: Function classification with transformer models and assem-
fake cyber threat intelligence using transformer-based models,” in 2021 bly language,” Computers & Security, vol. 148, p. 104167, 2025.
International Joint Conference on Neural Networks (IJCNN). IEEE, [95] Y. Ding, Y. Fu, O. Ibrahim, C. Sitawarin, X. Chen, B. Alomair,
2021, pp. 1–9. D. Wagner, B. Ray, and Y. Chen, “Vulnerability detection with code
[73] M. Fu and C. Tantithamthavorn, “Linevul: a transformer-based line- language models: How far are we?” arXiv preprint arXiv:2403.18624,
level vulnerability prediction,” in Proceedings of the 19th International 2024.
Conference on Mining Software Repositories, 2022, pp. 608–620. [96] T. Koide, N. Fukushi, H. Nakano, and D. Chiba, “Chatspamdetector:
[74] C. Mamede, E. Pinconschi, and R. Abreu, “A transformer-based ide Leveraging large language models for effective phishing email detec-
plugin for vulnerability detection,” in 37th IEEE/ACM International tion,” arXiv preprint arXiv:2402.18093, 2024.
Conference on Automated Software Engineering, 2022, pp. 1–4. [97] F. Heiding, B. Schneier, A. Vishwanath, J. Bernstein, and P. S. Park,
[75] P. Evangelatos, C. Iliou, T. Mavropoulos, K. Apostolou, T. Tsikrika, “Devising and detecting phishing emails using large language models,”
S. Vrochidis, and I. Kompatsiaris, “Named entity recognition in cyber IEEE Access, 2024.

of
threat intelligence using transformer-based models,” in 2021 IEEE [98] R. Chataut, P. K. Gyawali, and Y. Usman, “Can ai keep you safe? a
International Conference on Cyber Security and Resilience (CSR). study of large language models for phishing detection,” in 2024 IEEE
IEEE, 2021, pp. 348–353. 14th Annual Computing and Communication Workshop and Conference

ro
[76] F. Hashemi Chaleshtori and I. Ray, “Automation of vulnerability (CCWC). IEEE, 2024, pp. 0548–0554.
information extraction using transformer-based language models,” in [99] M. Rostami, M. Chilese, S. Zeitouni, R. Kande, J. Rajendran, and A.-R.
Computer Security. ESORICS 2022 International Workshops. Springer, Sadeghi, “Beyond random inputs: A novel ml-based hardware fuzzing,”

-p
2023, pp. 645–665. 2024.
[77] S. Liu, Y. Li, and Y. Liu, “Commitbart: A large pre-trained model for [100] Z. Zhang, G. Chadwick, H. McNally, Y. Zhao, and R. Mullins,
github commits,” arXiv preprint arXiv:2208.08100, 2022. “Llm4dv: Using large language models for hardware test stimuli
re
[78] B. Ahmad, S. Thakur, B. Tan, R. Karri, and H. Pearce, “On hardware generation,” 2023.
security bug code fixes by prompting large language models,” IEEE [101] M. Nair, R. Sadhukhan, and D. Mukhopadhyay, “Generating
Transactions on Information Forensics and Security, pp. 1–1, 2024. secure hardware using chatgpt resistant to cwes,” Cryptology
lP

[79] L. J. Wan, Y. Huang, Y. Li, H. Ye, J. Wang, X. Zhang, and D. Chen, ePrint Archive, Paper 2023/212, 2023, https://eprint.iacr.org/2023/212.
“Invited paper: Software/hardware co-design for llm and its application [Online]. Available: https://eprint.iacr.org/2023/212
for design verification,” in 2024 29th Asia and South Pacific Design [102] L. J. Wan, Y. Huang, Y. Li, H. Ye, J. Wang, X. Zhang,
Automation Conference (ASP-DAC), 2024, pp. 435–441. and D. Chen, “Software/hardware co-design for llm and its
na

[80] E. Jang, J. Cui, D. Yim, Y. Jin, J.-W. Chung, S. Shin, and Y. Lee, application for design verification,” in Proceedings of the 29th Asia
“Ignore me but don’t replace me: Utilizing non-linguistic elements and South Pacific Design Automation Conference, ser. ASPDAC
for pretraining on the cybersecurity domain,” arXiv preprint, 2024, to ’24. IEEE Press, 2024, p. 435–441. [Online]. Available: https:
appear in NAACL Findings 2024. //doi.org/10.1109/ASP-DAC58780.2024.10473893
ur

[81] M. Bayer, P. Kuehn, R. Shanehsaz, and C. Reuter, “Cysecbert: A [103] M. Liu, N. Pinckney, B. Khailany, and H. Ren, “Verilogeval: Evaluating
domain-adapted language model for the cybersecurity domain,” ACM large language models for verilog code generation,” 2023.
Transactions on Privacy and Security, vol. 27, no. 2, pp. 1–20, 2024. [104] N. Tihanyi, M. A. Ferrag, R. Jain, and M. Debbah, “Cybermetric: A
Jo

[82] A. Shestov, R. Levichev, R. Mussabayev, E. Maslov, A. Cheshkov, and benchmark dataset for evaluating large language models knowledge in
P. Zadorozhny, “Finetuning large language models for vulnerability cybersecurity,” arXiv preprint arXiv:2402.07688, 2024.
detection,” arXiv preprint, 2024, version 4. [105] R. Meng, M. Mirchev, M. Böhme, and A. Roychoudhury, “Large
[83] F. He, F. Li, and P. Liang, “Enhancing smart contract security: Leverag- language model guided protocol fuzzing,” in Proceedings of the 31st
ing pre-trained language models for advanced vulnerability detection,” Annual Network and Distributed System Security Symposium (NDSS),
IET Blockchain, 2024, first published: 29 March 2024. 2024.
[84] C. Patsakis, F. Casino, and N. Lykousas, “Assessing llms in malicious [106] V.-T. Pham, M. Böhme, and A. Roychoudhury, “Aflnet: A greybox
code deobfuscation of real-world malware campaigns,” Expert Systems fuzzer for network protocols,” in 2020 IEEE 13th International Con-
with Applications, vol. 256, p. 124912, 2024. [Online]. Available: ference on Software Testing, Validation and Verification (ICST), 2020,
https://www.sciencedirect.com/science/article/pii/S0957417424017792 pp. 460–465.
[85] Y. Guo, C. Patsakis, Q. Hu, Q. Tang, and F. Casino, “Outside the [107] S. Qin, F. Hu, Z. Ma, B. Zhao, T. Yin, and C. Zhang, “Nsfuzz:
comfort zone: Analysing llm capabilities in software vulnerability Towards efficient and state-aware network service fuzzing,” ACM
detection,” in European symposium on research in computer security. Trans. Softw. Eng. Methodol., vol. 32, no. 6, sep 2023. [Online].
Springer, 2024, pp. 271–289. Available: https://doi.org/10.1145/3580598
[86] N. Lykousas and C. Patsakis, “Decoding developer password patterns: [108] J. Wang, L. Yu, and X. Luo, “Llmif: Augmented large language model
A comparative analysis of password extraction and selection practices,” for fuzzing iot devices,” in 2024 IEEE Symposium on Security and
Computers & Security, vol. 145, p. 103974, 2024. Privacy (SP). IEEE Computer Society, 2024, pp. 196–196.
[87] E. Karlsen, X. Luo, N. Zincir-Heywood, and M. Heywood, “Bench- [109] M. Ren, X. Ren, H. Feng, J. Ming, and Y. Lei, “Z-fuzzer: device-
marking large language models for log analysis, security, and interpre- agnostic fuzzing of zigbee protocol implementation,” in Proceedings
tation,” Journal of Network and Systems Management, vol. 32, no. 3, of the 14th ACM Conference on Security and Privacy in Wireless and
p. 59, 2024. Mobile Networks, ser. WiSec ’21. New York, NY, USA: Association
[88] A. Mechri, M. A. Ferrag, and M. Debbah, “Secureqwen: Leveraging for Computing Machinery, 2021, p. 347–358. [Online]. Available:
llms for vulnerability detection in python codebases,” Computers & https://doi.org/10.1145/3448300.3468296
Security, vol. 148, p. 104151, 2025. [110] J. Pereyda, “Boofuzz: Network protocol fuzzing for humans,” https:
[89] H. Ding, Y. Liu, X. Piao, H. Song, and Z. Ji, “Smartguard: An llm- //boofuzz.readthedocs.io/en/stable, 2020.
enhanced framework for smart contract vulnerability detection,” Expert [111] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal,
Systems with Applications, p. 126479, 2025. A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language mod-
[90] U. Arshad and Z. Halim, “Blockllm: A futuristic llm-based decen- els are few-shot learners,” Advances in neural information processing
tralized vehicular network architecture for secure communications,” systems, vol. 33, pp. 1877–1901, 2020.
Computers and Electrical Engineering, vol. 123, p. 110027, 2025. [112] OpenAI, “Gpt-4 technical report,” 2023.

48
[113] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, llm. [Online]. Available: https://www.databricks.com/blog/2023/04/12/
Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learn- dolly-first-open-commercially-viable-instruction-tuned-llm
ing with a unified text-to-text transformer,” The Journal of Machine [136] TIIUAE, “Falcon-11b,” https://huggingface.co/tiiuae/falcon-11B, 2024,
Learning Research, vol. 21, no. 1, pp. 5485–5551, 2020. accessed: 2024-05-01.
[114] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training [137] L. B. Allal, R. Li, D. Kocetkov, C. Mou, C. Akiki, C. M. Ferrandis,
of deep bidirectional transformers for language understanding,” arXiv N. Muennighoff, M. Mishra, A. Gu, M. Dey et al., “Santacoder: don’t
preprint arXiv:1810.04805, 2018. reach for the stars!” arXiv preprint arXiv:2301.03988, 2023.
[115] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, [138] R. Li, L. B. Allal, Y. Zi, N. Muennighoff, D. Kocetkov, C. Mou,
“Albert: A lite bert for self-supervised learning of language represen- M. Marone, C. Akiki, J. Li, J. Chim et al., “Starcoder: may the source
tations,” arXiv preprint arXiv:1909.11942, 2019. be with you!” arXiv preprint arXiv:2305.06161, 2023.
[116] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, [139] Hugging Face & ServiceNow, “Huggingfaceh4/starchat-alpha,” https://
L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert huggingface.co/HuggingFaceH4/starchat-alpha, 2023, accessed: 2023-
pretraining approach,” arXiv preprint arXiv:1907.11692, 2019. 12-10.
[117] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and [140] E. Nijkamp, H. Hayashi, C. Xiong, S. Savarese, and Y. Zhou,
Q. V. Le, “Xlnet: Generalized autoregressive pretraining for language “Codegen2: Lessons for training llms on programming and natural
understanding,” Advances in neural information processing systems, languages,” arXiv preprint arXiv:2305.02309, 2023.
vol. 32, 2019. [141] Salesforce AI Research, “Codegen2.5: Small, but mighty,”
[118] W. Qi, Y. Yan, Y. Gong, D. Liu, N. Duan, J. Chen, R. Zhang, 2023, accessed: 2023-12-10. [Online]. Available: https://blog.
and M. Zhou, “Prophetnet: Predicting future n-gram for sequence-to- salesforceairesearch.com/codegen25/
sequence pre-training,” arXiv preprint arXiv:2001.04063, 2020. [142] Y. Wang, H. Le, A. D. Gotmare, N. D. Bui, J. Li, and S. C. Hoi,
[119] G. Penedo, Q. Malartic, D. Hesslow, R. Cojocaru, A. Cappelli, “Codet5+: Open code large language models for code understanding
H. Alobeidli, B. Pannier, E. Almazrouei, and J. Launay, “The refined- and generation,” arXiv preprint arXiv:2305.07922, 2023.
web dataset for falcon llm: outperforming curated corpora with web

of
[143] E. Nijkamp, T. Xie, H. Hayashi, B. Pang, C. Xia, C. Xing, J. Vig,
data, and web data only,” arXiv preprint arXiv:2306.01116, 2023. S. Yavuz, P. Laban, B. Krause et al., “Xgen-7b technical report,” arXiv
[120] N. Kitaev, Ł. Kaiser, and A. Levskaya, “Reformer: The efficient preprint arXiv:2309.03450, 2023.
transformer,” arXiv preprint arXiv:2001.04451, 2020.

ro
[144] Replit, Inc., “replit-code-v1-3b,” 2023, accessed: 2023-12-10. [Online].
[121] A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts,
Available: https://huggingface.co/replit/replit-code-v1-3b
P. Barham, H. W. Chung, C. Sutton, S. Gehrmann et al., “Palm: Scaling
[145] Deci AI, “Introducing decicoder: The new gold standard
language modeling with pathways,” Journal of Machine Learning

-p
in efficient and accurate code generation,” August 2023,
Research, vol. 24, no. 240, pp. 1–113, 2023.
accessed: 2023-12-10. [Online]. Available: https://deci.ai/blog/
[122] R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lepikhin, A. Passos,
decicoder-efficient-and-accurate-code-generation-llm/
S. Shakeri, E. Taropa, P. Bailey, Z. Chen et al., “Palm 2 technical
re
[146] B. Roziere, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y. Adi,
report,” arXiv preprint arXiv:2305.10403, 2023.
J. Liu, T. Remez, J. Rapin et al., “Code llama: Open foundation models
[123] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux,
for code,” arXiv preprint arXiv:2308.12950, 2023.
T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al., “Llama:
lP

Open and efficient foundation language models,” arXiv preprint [147] J. Bai, S. Bai, Y. Chu, Z. Cui, K. Dang, X. Deng, Y. Fan, W. Ge,
arXiv:2302.13971, 2023. Y. Han, F. Huang, B. Hui, L. Ji, M. Li, J. Lin, R. Lin, D. Liu, G. Liu,
[124] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, C. Lu, K. Lu, J. Ma, R. Men, X. Ren, X. Ren, C. Tan, S. Tan, J. Tu,
N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al., “Llama P. Wang, S. Wang, W. Wang, S. Wu, B. Xu, J. Xu, A. Yang, H. Yang,
na

2: Open foundation and fine-tuned chat models,” arXiv preprint J. Yang, S. Yang, Y. Yao, B. Yu, H. Yuan, Z. Yuan, J. Zhang, X. Zhang,
arXiv:2307.09288, 2023. Y. Zhang, Z. Zhang, C. Zhou, J. Zhou, X. Zhou, and T. Zhu, “Qwen
[125] D. Lepikhin, H. Lee, Y. Xu, D. Chen, O. Firat, Y. Huang, M. Krikun, technical report,” arXiv preprint, Tech. Rep., 2023, 59 pages, 5 figures.
N. Shazeer, and Z. Chen, “Gshard: Scaling giant models with [148] D. Guo, Q. Zhu, D. Yang, Z. Xie, K. Dong, W. Zhang, G. Chen,
ur

conditional computation and automatic sharding,” arXiv preprint X. Bi, Y. Wu, Y. Li, F. Luo, Y. Xiong, and W. Liang, “Deepseek-
arXiv:2006.16668, 2020. coder: When the large language model meets programming – the rise
[126] K. Clark, M.-T. Luong, Q. V. Le, and C. D. Manning, “Electra: Pre- of code intelligence,” arXiv preprint, 2024, submitted on 25 Jan 2024,
Jo

training text encoders as discriminators rather than generators,” arXiv Last revised 26 Jan 2024.
preprint arXiv:2003.10555, 2020. [149] C. Team, A. J. Hartman, A. Hu, C. A. Choquette-Choo, H. Zhao,
[127] The MosaicML NLP Team, “Mpt-30b: Raising the bar for open- J. Fine, J. Hui, J. Shen, J. Kelley, J. Howland, K. Bansal, L. Vilnis,
source foundation models,” June 2023, accessed: 2023-12-10. [Online]. M. Wirth, N. Nguyen, P. Michel, P. Choy, P. Joshi, R. Kumar,
Available: https://www.mosaicml.com/blog/mpt-30b S. Hashmi, S. Agrawal, S. Zuo, T. Warkentin, and Z. Gong,
[128] 01.AI, “Yi-34b,” https://huggingface.co/01-ai/Yi-34B, 2023, accessed: “Codegemma: Open code models based on gemma,” 2024. [Online].
2023-12-10. Available: https://goo.gle/codegemma
[129] M. A. et al., “Phi-3 technical report: A highly capable language model [150] M. Mishra, M. Stallone, G. Zhang, Y. Shen, A. Prasad, A. Meza Soria,
locally on your phone,” arXiv preprint arXiv:2404.14219, 2024. M. Merler, P. Selvam, S. Surendran, S. Singh, M. Sethi, X.-H. Dang,
[130] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. P. Li, K.-L. Wu, S. Zawad, A. Coleman, M. White, M. Lewis,
Chaplot, D. de las Casas, F. Bressand, G. Lengyel, G. Lample, R. Pavuluri, Y. Koyfman, B. Lublinsky, M. de Bayser, I. Abdelaziz,
L. Saulnier, L. Renard Lavaud, M.-A. Lachaux, P. Stock, T. Le Scao, K. Basu, M. Agarwal, Y. Zhou, C. Johnson, A. Goyal, H. Patel,
T. Lavril, T. Wang, T. Lacroix, and W. El Sayed, “Mistral 7b,” arXiv Y. Shah, P. Zerfos, H. Ludwig, A. Munawar, M. Crouse, P. Kapanipathi,
preprint arXiv:2310.06825, 2023, submitted on 10 Oct 2023. [Online]. S. Salaria, B. Calio, S. Wen, S. Seelam, B. Belgodere, C. Fonseca,
Available: https://doi.org/10.48550/arXiv.2310.06825 A. Singhee, N. Desai, D. D. Cox, R. Puri, and R. Panda, “Granite code
[131] N. Dey, G. Gosal, Z. C. Chen, H. Khachane, W. Marshall, R. Pathria, models: A family of open foundation models for code intelligence,”
M. Tom, and J. Hestness, “Cerebras-gpt: Open compute-optimal arXiv preprint arXiv:2405.04324, May 2024.
language models trained on the cerebras wafer-scale cluster,” arXiv [151] DeepSeek-AI, “Deepseek-v2: A strong, economical, and efficient
preprint arXiv:2304.03208, 2023, submitted on 6 Apr 2023. [Online]. mixture-of-experts language model,” arXiv preprint arXiv:2405.04434,
Available: https://doi.org/10.48550/arXiv.2304.03208 May 2024, submitted on 7 May 2024 (v1), last revised 8 May 2024
[132] ZySec-AI, “Zysec-ai: Project zysec,” Webpage, accessed: 2024-05-01. (this version, v2). [Online]. Available: https://arxiv.org/abs/2405.04434
[Online]. Available: https://github.com/ZySec-AI/project-zysec [152] P. Haller, J. Golde, and A. Akbik, “Pecc: Problem extraction and coding
[133] DeciAI Research Team, “Decilm-7b,” 2023. [Online]. Available: challenges,” arXiv preprint arXiv:2404.18766, 2024.
https://huggingface.co/Deci/DeciLM-7B [153] A. Z. Yang, Y. Takashima, B. Paulsen, J. Dodds, and D. Kroening,
[134] L. Tunstall, E. Beeching, N. Lambert, N. Rajani, K. Rasul, Y. Belkada, “Vert: Verified equivalent rust transpilation with few-shot learning,”
S. Huang, L. von Werra, C. Fourrier, N. Habib, N. Sarrazin, O. San- arXiv preprint arXiv:2404.18852, 2024.
seviero, A. M. Rush, and T. Wolf, “Zephyr: Direct distillation of lm [154] D. Nichols, P. Polasam, H. Menon, A. Marathe, T. Gamblin, and
alignment,” 2023. A. Bhatele, “Performance-aligned llms for generating fast code,” arXiv
[135] M. Conover, M. Hayes, A. Mathur, J. Xie, J. Wan, S. Shah, preprint arXiv:2404.18864, 2024.
A. Ghodsi, P. Wendell, M. Zaharia, and R. Xin. (2023) Free [155] Z. Ma, A. R. Chen, D. J. Kim, T.-H. Chen, and S. Wang, “Llmparser:
dolly: Introducing the world’s first truly open instruction-tuned An exploratory study on using large language models for log parsing,”

49
in Proceedings of the IEEE/ACM 46th International Conference on 43rd International Conference on Software Engineering: Software
Software Engineering, 2024, pp. 1–13. Engineering in Practice (ICSE-SEIP), 2021, pp. 111–120.
[156] T. H. Le, M. A. Babar, and T. H. Thai, “Software vulnerability [177] Y. Zhou, S. Liu, J. Siow, X. Du, and Y. Liu, “Devign: Effective Vulner-
prediction in low-resource languages: An empirical study of codebert ability Identification by Learning Comprehensive Program Semantics
and chatgpt,” arXiv preprint arXiv:2404.17110, 2024. via Graph Neural Networks,” arXiv e-prints, p. arXiv:1909.03496, Sep.
[157] B. Guan, Y. Wan, Z. Bi, Z. Wang, H. Zhang, Y. Sui, P. Zhou, and 2019.
L. Sun, “Codeip: A grammar-guided multi-bit watermark for large [178] H. Hanif, M. H. N. M. Nasir, M. F. Ab Razak, A. Firdaus, and N. B.
language models of code,” arXiv preprint arXiv:2404.15639, 2024. Anuar, “The rise of software vulnerability: Taxonomy of software
[158] X.-C. Wen, X. Wang, Y. Chen, R. Hu, D. Lo, and C. Gao, “Vuleval: To- vulnerabilities detection and machine learning approaches,” Journal of
wards repository-level evaluation of software vulnerability detection,” Network and Computer Applications, vol. 179, p. 103009, 2021.
arXiv preprint arXiv:2404.15596, 2024. [179] R. Russell, L. Kim, L. Hamilton, T. Lazovich, J. Harer, O. Ozdemir,
[159] Z. Zhang, C. Chen, B. Liu, C. Liao, Z. Gong, H. Yu, J. Li, and R. Wang, P. Ellingwood, and M. McConley, “Automated vulnerability detection
“Unifying the perspectives of nlp and software engineering: A survey in source code using deep representation learning,” in 2018 17th
on language models for code,” arXiv preprint arXiv:2311.07989, 2023. IEEE International Conference on Machine Learning and Applications
[160] H. Husain, H.-H. Wu, T. Gazit, M. Allamanis, and M. Brockschmidt, (ICMLA), 2018, pp. 757–762.
“Codesearchnet challenge: Evaluating the state of semantic code [180] ——, “Automated vulnerability detection in source code using deep
search,” arXiv preprint arXiv:1909.09436, 2019. representation learning,” in 2018 17th IEEE International Conference
[161] L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, C. Foster, on Machine Learning and Applications (ICMLA), 2018, pp. 757–762.
J. Phang, H. He, A. Thite, N. Nabeshima et al., “The pile: An [181] Y. Zhou and A. Sharma, “Automated identification of security issues
800gb dataset of diverse text for language modeling,” arXiv preprint from commit messages and bug reports,” in Proceedings of the 2017
arXiv:2101.00027, 2020. 11th joint meeting on foundations of software engineering, 2017, pp.
[162] D. Kocetkov, R. Li, L. B. Allal, J. Li, C. Mou, C. M. Ferrandis, 914–919.
[182] L. Wartschinski, Y. Noller, T. Vogel, T. Kehrer, and L. Grunske,

of
Y. Jernite, M. Mitchell, S. Hughes, T. Wolf et al., “The stack: 3 tb of
permissively licensed source code,” arXiv preprint arXiv:2211.15533, “Vudenc: Vulnerability detection with deep learning on a natural
2022. codebase for python,” Information and Software Technology, vol.
144, p. 106809, 2022, arXiv preprint arXiv:2201.08441. [Online].

ro
[163] H. Laurençon, L. Saulnier, T. Wang, C. Akiki, A. Villanova del Moral,
T. Le Scao, L. Von Werra, C. Mou, E. González Ponferrada, H. Nguyen Available: https://doi.org/10.48550/arXiv.2201.08441
et al., “The bigscience roots corpus: A 1.6 tb composite multilingual [183] J. Fan, Y. Li, S. Wang, and T. N. Nguyen, “A c/c++ code
vulnerability dataset with code changes and cve summaries,” in

-p
dataset,” Advances in Neural Information Processing Systems, vol. 35,
pp. 31 809–31 826, 2022. Proceedings of the 17th International Conference on Mining Software
[164] A. Lozhkov, R. Li, L. B. Allal, F. Cassano, J. Lamy-Poirier, N. Tazi, Repositories, ser. MSR ’20. New York, NY, USA: Association
for Computing Machinery, 2020, p. 508–512. [Online]. Available:
re
A. Tang, D. Pykhtar, J. Liu, Y. Wei et al., “Starcoder 2 and the stack
v2: The next generation,” arXiv preprint arXiv:2402.19173, 2024. https://doi.org/10.1145/3379597.3387501
[184] G. Bhandari, A. Naseer, and L. Moonen, “Cvefixes: automated collec-
[165] R. Schuster, C. Song, E. Tromer, and V. Shmatikov, “You autocomplete
tion of vulnerabilities and their fixes from open-source software,” in
lP

me: Poisoning vulnerabilities in neural code completion,” in 30th


Proceedings of the 17th International Conference on Predictive Models
USENIX Security Symposium (USENIX Security 21), 2021, pp. 1559–
and Data Analytics in Software Engineering, 2021, pp. 30–39.
1575.
[185] G. Nikitopoulos, K. Dritsa, P. Louridas, and D. Mitropoulos, “Crossvul:
[166] O. Asare, M. Nagappan, and N. Asokan, “Is github’s copilot as bad
a cross-language vulnerability dataset with commit data,” in Proceed-
na

as humans at introducing vulnerabilities in code?” Empirical Software


ings of the 29th ACM Joint Meeting on European Software Engineering
Engineering, vol. 28, no. 6, p. 129, 2023.
Conference and Symposium on the Foundations of Software Engineer-
[167] G. Sandoval, H. Pearce, T. Nys, R. Karri, S. Garg, and B. Dolan-Gavitt, ing, 2021, pp. 1565–1569.
“Lost at c: A user study on the security implications of large language [186] Z. Li, D. Zou, S. Xu, H. Jin, Y. Zhu, and Z. Chen, “Sysevr: A
ur

model code assistants,” in 32nd USENIX Security Symposium (USENIX framework for using deep learning to detect software vulnerabilities,”
Security 23), 2023, pp. 2205–2222. IEEE Transactions on Dependable and Secure Computing, vol. 19,
[168] N. Perry, M. Srivastava, D. Kumar, and D. Boneh, “Do users write no. 4, pp. 2244–2258, 2022.
Jo

more insecure code with ai assistants?” in Proceedings of the 2023 [187] Z. Li, D. Zou, S. Xu, X. Ou, H. Jin, S. Wang, Z. Deng, and Y. Zhong,
ACM SIGSAC Conference on Computer and Communications Security, “Vuldeepecker: A deep learning-based system for vulnerability detec-
2023, pp. 2785–2799. tion,” arXiv preprint arXiv:1801.01681, 2018.
[169] S. Hamer, M. d’Amorim, and L. Williams, “Just another copy and [188] Y. Chen, Z. Ding, L. Alowain, X. Chen, and D. Wagner, “DiverseVul:
paste? comparing the security vulnerabilities of chatgpt generated code A New Vulnerable Source Code Dataset for Deep Learning Based
and stackoverflow answers,” arXiv preprint arXiv:2403.15600, 2024. Vulnerability Detection,” arXiv e-prints, p. arXiv:2304.00409, Apr.
[170] D. Cotroneo, R. De Luca, and P. Liguori, “Devaic: A tool for security 2023.
assessment of ai-generated code,” arXiv preprint arXiv:2404.07548, [189] D. N. Gadde, A. Kumar, T. Nalapat, E. Rezunov, and F. Cappellini,
2024. “All artificial, less intelligence: Genai through the lens of formal
[171] R. Tóth, T. Bisztray, and L. Erdodi, “Llms in web-development: Evalu- verification,” Infineon Technologies, 2024.
ating llm-generated php code unveiling vulnerabilities and limitations,” [190] OWASP Foundation, “Owasp top 10 for large
arXiv preprint arXiv:2404.14459, 2024. language model applications,” https://owasp.org/
[172] N. Tihanyi, T. Bisztray, M. A. Ferrag, R. Jain, and L. C. Cordeiro, “Do www-project-top-10-for-large-language-model-applications/, 2023,
neutral prompts produce insecure code? formai-v2 dataset: Labelling accessed: 2023-12-26.
vulnerabilities in code generated by large language models,” 2024. [191] F. Perez and I. Ribeiro, “Ignore previous prompt: Attack techniques
[173] N. S. Harzevili, A. B. Belle, J. Wang, S. Wang, Z. Ming, N. Nagappan for language models,” arXiv preprint arXiv:2211.09527, 2022.
et al., “A survey on automated software vulnerability detection using [192] K. Greshake, S. Abdelnabi, S. Mishra, C. Endres, T. Holz, and
machine learning and deep learning,” arXiv preprint arXiv:2306.11673, M. Fritz, “More than you’ve asked for: A comprehensive analysis of
2023. novel prompt injection threats to application-integrated large language
[174] M. A. Ferrag, O. Friha, D. Hamouda, L. Maglaras, and H. Janicke, models,” arXiv e-prints, pp. arXiv–2302, 2023.
“Edge-iiotset: A new comprehensive realistic cyber security dataset of [193] J. Yan, V. Yadav, S. Li, L. Chen, Z. Tang, H. Wang, V. Srinivasan,
iot and iiot applications for centralized and federated learning,” IEEE X. Ren, and H. Jin, “Virtual prompt injection for instruction-tuned
Access, vol. 10, pp. 40 281–40 306, 2022. large language models,” arXiv preprint arXiv:2307.16888, 2023.
[175] N. Tihanyi, T. Bisztray, R. Jain, M. A. Ferrag, L. C. Cordeiro, and [194] R. Pedro, D. Castro, P. Carreira, and N. Santos, “From prompt
V. Mavroeidis, “The formai dataset: Generative ai in software security injections to sql injection attacks: How protected is your llm-integrated
through the lens of formal verification,” in Proceedings of the 19th web application?” arXiv preprint arXiv:2308.01990, 2023.
International Conference on Predictive Models and Data Analytics in [195] S. Abdelnabi, K. Greshake, S. Mishra, C. Endres, T. Holz, and M. Fritz,
Software Engineering, 2023, pp. 33–43. “Not what you’ve signed up for: Compromising real-world llm-
[176] Y. Zheng, S. Pujar, B. Lewis, L. Buratti, E. Epstein, B. Yang, J. Laredo, integrated applications with indirect prompt injection,” in Proceedings
A. Morari, and Z. Su, “D2a: A dataset built for ai-based vulnerability of the 16th ACM Workshop on Artificial Intelligence and Security, 2023,
detection methods using differential analysis,” in 2021 IEEE/ACM pp. 79–90.

50
[196] Y. Liu, G. Deng, Y. Li, K. Wang, T. Zhang, Y. Liu, H. Wang, [218] T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “Qlora:
Y. Zheng, and Y. Liu, “Prompt injection attack against llm-integrated Efficient finetuning of quantized llms,” Advances in Neural Information
applications,” arXiv preprint arXiv:2306.05499, 2023. Processing Systems, vol. 36, 2024.
[197] J. Yan, V. Yadav, S. Li, L. Chen, Z. Tang, H. Wang, V. Srinivasan, [219] X. Wu, H. Xia, S. Youn, Z. Zheng, S. Chen, A. Bakhtiari, M. Wyatt,
X. Ren, and H. Jin, “Backdooring instruction-tuned large language R. Y. Aminabadi, Y. He, O. Ruwase, L. Song et al., “Zeroquant(4+2):
models with virtual prompt injection,” in NeurIPS 2023 Workshop on Redefining llms quantization with a new fp6-centric strategy for diverse
Backdoors in Deep Learning-The Good, the Bad, and the Ugly, 2023. generative tasks,” arXiv preprint arXiv:2312.08583, 2023.
[198] D. Glukhov, I. Shumailov, Y. Gal, N. Papernot, and V. Papyan, “Llm [220] H. Xia, Z. Zheng, X. Wu, S. Chen, Z. Yao, S. Youn, A. Bakhtiari,
censorship: A machine learning challenge or a computer security M. Wyatt, D. Zhuang, Z. Zhou et al., “Fp6-llm: Efficiently serving large
problem?” arXiv preprint arXiv:2307.10719, 2023. language models through fp6-centric algorithm-system co-design,”
[199] F. Wu, X. Liu, and C. Xiao, “Deceptprompt: Exploiting llm-driven arXiv preprint arXiv:2401.14112, 2024.
code generation via adversarial natural language instructions,” arXiv [221] M. Bhatt, S. Chennabasappa, C. Nikolaidis, S. Wan, I. Evtimov,
preprint arXiv:2312.04730, 2023. D. Gabi, D. Song, F. Ahmad, C. Aschermann, L. Fontana et al., “Purple
[200] B. D. Son, N. T. Hoa, T. Van Chien, W. Khalid, M. A. Ferrag, W. Choi, llama cyberseceval: A secure coding benchmark for language models,”
and M. Debbah, “Adversarial attacks and defenses in 6g network- arXiv preprint arXiv:2312.04724, 2023.
assisted iot systems,” IEEE Internet of Things Journal, 2024. [222] Z. Liu, “Secqa: A concise question-answering dataset for evalu-
[201] A. Zou, Z. Wang, J. Z. Kolter, and M. Fredrikson, “Universal and ating large language models in computer security,” arXiv preprint
transferable adversarial attacks on aligned language models,” arXiv arXiv:2312.15838, 2023.
preprint arXiv:2307.15043, 2023. [223] M. Bhatt, S. Chennabasappa, Y. Li, C. Nikolaidis, D. Song, S. Wan,
[202] Z. Yang, X. He, Z. Li, M. Backes, M. Humbert, P. Berrang, and F. Ahmad, C. Aschermann, Y. Chen, D. Kapil, D. Molnar, S. Whitman,
Y. Zhang, “Data poisoning attacks against multimodal encoders,” in and J. Saxe, “Cyberseceval 2: A wide-ranging cybersecurity evaluation
International Conference on Machine Learning. PMLR, 2023, pp. suite for large language models,” 2024.
[224] N. Li, A. Pan, A. Gopal, S. Yue, D. Berrios, A. Gatti, J. D. Li, A.-

of
39 299–39 313.
[203] A. E. Cinà, K. Grosse, A. Demontis, S. Vascon, W. Zellinger, B. A. K. Dombrowski, S. Goel, L. Phan et al., “The wmdp benchmark:
Moser, A. Oprea, B. Biggio, M. Pelillo, and F. Roli, “Wild patterns Measuring and reducing malicious use with unlearning,” arXiv preprint
arXiv:2403.03218, 2024.

ro
reloaded: A survey of machine learning security against training data
poisoning,” ACM Computing Surveys, vol. 55, no. 13s, pp. 1–39, 2023. [225] Y. Sun, D. Wu, Y. Xue, H. Liu, W. Ma, L. Zhang, M. Shi, and
[204] P. Gupta, K. Yadav, B. B. Gupta, M. Alazab, and T. R. Gadekallu, “A Y. Liu, “Llm4vuln: A unified evaluation framework for decoupling and
enhancing llms’ vulnerability reasoning,” 2024.

-p
novel data poisoning attack in federated learning based on inverted loss
function,” Computers & Security, vol. 130, p. 103270, 2023. [226] Z. Liu, J. Shi, and J. F. Buford, “Cyberbench: A multi-task benchmark
[205] J. He, W. Jiang, G. Hou, W. Fan, R. Zhang, and H. Li, “Talk too much: for evaluating large language models in cybersecurity.” [Online].
Available: http://aics.site/AICS2024/AICS CyberBench.pdf
re
Poisoning large language models under token limit,” arXiv preprint
arXiv:2404.14795, 2024. [227] P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick,
and O. Tafjord, “Think you have solved question answering? try arc,
[206] A. B. de Neira, B. Kantarci, and M. Nogueira, “Distributed denial of
the ai2 reasoning challenge,” arXiv preprint arXiv:1803.05457, 2018.
lP

service attack prediction: Challenges, open issues and opportunities,”


[228] Y. Lai, C. Li, Y. Wang, T. Zhang, R. Zhong, L. Zettlemoyer, W.-
Computer Networks, vol. 222, 2023.
t. Yih, D. Fried, S. Wang, and T. Yu, “Ds-1000: A natural and
[207] N. Hoque, D. K. Bhattacharyya, and J. K. Kalita, “Botnet in ddos
reliable benchmark for data science code generation,” in International
attacks: Trends and challenges,” IEEE Communications Surveys and
Conference on Machine Learning. PMLR, 2023, pp. 18 319–18 345.
na

Tutorials, vol. 17, 2015.


[229] J. Liu, C. S. Xia, Y. Wang, and L. Zhang, “Is your code generated by
[208] O. Osanaiye, K. K. R. Choo, and M. Dlodlo, “Distributed denial of chatgpt really correct? rigorous evaluation of large language models for
service (ddos) resilience in cloud: Review and conceptual cloud ddos code generation,” Advances in Neural Information Processing Systems,
mitigation framework,” 2016. vol. 36, 2024.
ur

[209] Q. Yan, F. R. Yu, Q. Gong, and J. Li, “Software-defined networking [230] R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and Y. Choi, “Hel-
(sdn) and distributed denial of service (ddos) attacks in cloud com- laswag: Can a machine really finish your sentence?” arXiv preprint
puting environments: A survey, some research issues, and challenges,” arXiv:1905.07830, 2019.
Jo

IEEE Communications Surveys and Tutorials, vol. 18, 2016. [231] D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and
[210] H. Zhao, H. Chen, F. Yang, N. Liu, H. Deng, H. Cai, S. Wang, D. Yin, J. Steinhardt, “Measuring massive multitask language understanding,”
and M. Du, “Explainability for large language models: A survey,” ACM arXiv preprint arXiv:2009.03300, 2020.
Transactions on Intelligent Systems and Technology, vol. 15, no. 2, pp. [232] K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser,
1–38, 2024. M. Plappert, J. Tworek, J. Hilton, R. Nakano et al., “Training verifiers
[211] C. Xu, Q. Sun, K. Zheng, X. Geng, P. Zhao, J. Feng, C. Tao, and to solve math word problems,” arXiv preprint arXiv:2110.14168, 2021.
D. Jiang, “Wizardlm: Empowering large language models to follow [233] D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang,
complex instructions,” arXiv preprint arXiv:2304.12244, 2023. D. Song, and J. Steinhardt, “Measuring mathematical problem solving
[212] R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li, C. Guestrin, P. Liang, with the math dataset,” arXiv preprint arXiv:2103.03874, 2021.
and T. Hashimoto, “Alpaca: a strong, replicable instruction-following [234] S. Lu, D. Guo, S. Ren, J. Huang, A. Svyatkovskiy, A. Blanco, C. B.
model; 2023,” URL https://crfm. stanford. edu/2023/03/13/alpaca. html. Clement, D. Drain, D. Jiang, D. Tang, G. Li, L. Zhou, L. Shou, L. Zhou,
[213] V. A. Korthikanti, J. Casper, S. Lym, L. McAfee, M. Andersch, M. Tufano, M. Gong, M. Zhou, N. Duan, N. Sundaresan, S. K. Deng,
M. Shoeybi, and B. Catanzaro, “Reducing activation recomputation S. Fu, and S. Liu, “Codexglue: A machine learning benchmark dataset
in large transformer models,” Proceedings of Machine Learning and for code understanding and generation,” CoRR, vol. abs/2102.04664,
Systems, vol. 5, 2023. 2021.
[214] A. Andonian, Q. Anthony, S. Biderman, S. Black, P. Gali, L. Gao, [235] T. Dao, D. Fu, S. Ermon, A. Rudra, and C. Ré, “Flashattention: Fast
E. Hallahan, J. Levy-Kramer, C. Leahy, L. Nestler, K. Parker, and memory-efficient exact attention with io-awareness,” Advances in
M. Pieler, J. Phang, S. Purohit, H. Schoelkopf, D. Stander, T. Songz, Neural Information Processing Systems, vol. 35, pp. 16 344–16 359,
C. Tigges, B. Thérien, P. Wang, and S. Weinbach, “GPT-NeoX: 2022.
Large Scale Autoregressive Language Modeling in PyTorch,” 9 2023. [236] E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, “Gptq: Accu-
[Online]. Available: https://www.github.com/eleutherai/gpt-neox rate post-training quantization for generative pre-trained transformers,”
[215] J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, arXiv preprint arXiv:2210.17323, 2022.
R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws [237] H. Badri and A. Shaji, “Half-quadratic quantization of large
for neural language models,” arXiv preprint arXiv:2001.08361, 2020. machine learning models,” November 2023. [Online]. Available:
[216] N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe, https://mobiusml.github.io/hqq blog/
A. Gesmundo, M. Attariyan, and S. Gelly, “Parameter-efficient transfer [238] F. Gloeckle, B. Youbi Idrissi, B. Rozière, D. Lopez-Paz, and G. Syn-
learning for nlp,” in International conference on machine learning. naeve, “Better & Faster Large Language Models via Multi-token
PMLR, 2019, pp. 2790–2799. Prediction,” arXiv e-prints, p. arXiv:2404.19737, Apr. 2024.
[217] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, [239] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust
and W. Chen, “Lora: Low-rank adaptation of large language models,” region policy optimization,” in Proceedings of the 32nd International
arXiv preprint arXiv:2106.09685, 2021. Conference on Machine Learning, ser. Proceedings of Machine

51
Learning Research, F. Bach and D. Blei, Eds., vol. 37. Lille,
France: PMLR, 07–09 Jul 2015, pp. 1889–1897. [Online]. Available:
https://proceedings.mlr.press/v37/schulman15.html
[240] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov,
“Proximal policy optimization algorithms,” 2017.
[241] R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning,
S. Ermon, and C. Finn, “Direct preference optimization: Your
language model is secretly a reward model,” in Advances in
Neural Information Processing Systems, A. Oh, T. Naumann,
A. Globerson, K. Saenko, M. Hardt, and S. Levine, Eds.,
vol. 36. Curran Associates, Inc., 2023, pp. 53 728–53 741.
[Online]. Available: https://proceedings.neurips.cc/paper files/paper/
2023/file/a85b405ed65c6477a4fe8302b5e06ce7-Paper-Conference.pdf
[242] J. Hong, N. Lee, and J. Thorne, “Orpo: Monolithic preference opti-
mization without reference model,” 2024.
[243] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal,
H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, S. Riedel, and D. Kiela,
“Retrieval-augmented generation for knowledge-intensive nlp tasks,”
in Advances in Neural Information Processing Systems, H. Larochelle,
M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., vol. 33. Curran
Associates, Inc., 2020, pp. 9459–9474.
[244] Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai, J. Sun,
M. Wang, and H. Wang, “Retrieval-augmented generation for large

of
language models: A survey,” 2024.
[245] R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and
C. Finn, “Direct preference optimization: Your language model is

ro
secretly a reward model,” 2023.
[246] P. Zhao, H. Zhang, Q. Yu, Z. Wang, Y. Geng, F. Fu, L. Yang, W. Zhang,
J. Jiang, and B. Cui, “Retrieval-augmented generation for ai-generated

-p
content: A survey,” 2024.
[247] Y. Huang and J. Huang, “A survey on retrieval-augmented text gener-
ation for large language models,” 2024.
re
[248] M. team, “MLC-LLM,” 2023. [Online]. Available: https://github.com/
mlc-ai/mlc-llm
[249] ——, “MNN-LLM,” 2023. [Online]. Available: https://github.com/
lP

wangzhaode/mnn-llm/
[250] L. Derczynski, E. Galinkin, and S. Majumdar, “garak: A Framework
for Large Language Model Red Teaming,” https://garak.ai, 2024.
[251] N. Shazeer, “Fast transformer decoding: One write-head is all you
na

need,” 2019.
[252] J. Ainslie, J. Lee-Thorp, M. de Jong, Y. Zemlyanskiy, F. Lebrón, and
S. Sanghai, “Gqa: Training generalized multi-query transformer models
from multi-head checkpoints,” 2023.
ur
Jo

52
Declaration of interests

X The authors declare that they have no known competing financial interests or personal relationships
that could have appeared to influence the work reported in this paper.

☐ The author is an Editorial Board Member/Editor-in-Chief/Associate Editor/Guest Editor for [Journal


name] and was not involved in the editorial review or the decision to publish this article.

☐ The authors declare the following financial interests/personal relationships which may be considered
as potential competing interests:

of
ro
-p
re
lP
na
ur
Jo

You might also like