AS L L M C: Urvey of Arge Anguage Odels in Ybersecurity
AS L L M C: Urvey of Arge Anguage Odels in Ybersecurity
A BSTRACT
Large Language Models (LLMs) have quickly risen to prominence due to their ability to perform at
or close to the state-of-the-art in a variety of fields while handling natural language. An important
field of research is the application of such models at the cybersecurity context. This survey aims to
identify where in the field of cybersecurity LLMs have already been applied, the ways in which they
are being used and their limitations in the field. Finally, suggestions are made on how to improve
such limitations and what can be expected from these systems once these limitations are overcome.
1 Introduction
1.1 Motivation
The cybersecurity industry is a field that, by its nature, requires keeping up with the state-of-the-art constantly, given
that every new technology provides new avenues for malicious exploitation and, as such, may increase the available
attack surface.
Software vulnerabilities are errors introduced in software, accidentally or deliberately, allowing threat actors to exploit
the behavior of a system in a harmful way that is not intended or expected by its developers. Software vulnerabilities
exploitation may lead to user data exposure or corruption, denial-of-service, or code execution that can lead to the
complete takeover of a system [1]. As such, companies commonly conduct their own controlled attempts at invading
their own systems in an effort to detect and fix any vulnerabilities before malicious actors [2], commonly referred to
as a vulnerability assessment, or, interchangeably, red teaming and/or penetration testing.
The evolution of artificial intelligence and its effective usage in a wide variety of fields as a generic tool has also made
its way to cybersecurity, both in the offensive as well as defensive side, and its application has become the norm in
most commercial and academic settings [3, 4].
1.2 Justifications
The rise of neural networks and deep learning have led to the development of AI-based malware and intrusion detection
and prevention by analyzing deviance from expected patterns or binary features which may indicate malware [5, 6, 7,
8, 9, 10, 11, 12]. Although these approaches have been successful and are deployed commercially, they still present
issues of too many false positives and lack of explainability and are not well suited for more complex tasks [13].
Recent advances in neural network architectures and the introduction of Large Language Models (LLMs) have shown a
great capability to generalize beyond their original pre-trained settings [14]. Foundation LLMs provide a good starting
point as they’re pre-trained on a broad language corpus which includes conversational, scientific, question and answer,
and code databases where further fine-tuning could be beneficial for the model on such tasks [15, 16].
The high specificity and advanced skill set necessaries to perform a vulnerability assessment make it a difficult, albeit
desired, task to automate [17, 18, 19].
1.3 Objectives
An overview of the basic concepts is first provided, outlining both the fields of cybersecurity and artificial intelligence
and their intersections.
Next, a survey is carried by finding papers where LLMs are used for cybersecurity applications, the ways in which
they’re used and the areas in which they are well suited or need to improve upon.
Finally, the paper suggests a new approach through which the usage of LLMs could contribute to the cybersecurity
field in an attempt to address the limitations of current systems and where this new approach can lead to.
2 Basic concepts
2.1 Cybersecurity
The cybersecurity field deals with protecting from and responding to threats in computer systems and its related
networks and components. Modern infrastructure is increasingly reliant on computer systems and both industrial as
well as home devices are increasingly connected to the internet, meaning protecting systems from outside tempering
has become a must [20].
Cyber threats, a diverse array of malicious activities, pose significant risks to individuals, organizations, and even
nations. Common threats include malware, which encompasses viruses, worms, and ransomware, as well as phishing
attacks that exploit human vulnerabilities through deceptive means. Understanding the various attack vectors, such as
exploiting software vulnerabilities, social engineering tactics, and physical breaches, is essential in crafting effective
cybersecurity strategies [21].
To counteract these threats, cybersecurity relies on a combination of preventive, detective, and corrective measures.
Encryption, firewalls, antivirus software, and other tools serve as key components of a comprehensive cybersecurity
toolkit. In addition to technological safeguards, the establishment and enforcement of security policies and procedures,
as well as rigorous risk management practices, are crucial in creating a resilient security posture [22]. Organizations
also need to be cognizant of regulatory frameworks, as various industries and regions may have specific cybersecurity
requirements.
As technology evolves, so do the challenges and opportunities in cybersecurity. Emerging trends, such as the inte-
gration of artificial intelligence and machine learning [5, 6, 7, 8, 9, 10, 11, 12], are reshaping the landscape, offering
both new solutions and potential vulnerabilities. Ethical considerations, including privacy concerns and responsible
disclosure of vulnerabilities, add an additional layer of complexity to the field.
In the proactive pursuit of cybersecurity excellence, organizations often employ penetration testing and vulnerability
assessments as indispensable tools to fortify their defenses. Penetration testing, commonly known as ethical hacking,
involves simulated cyber-attacks conducted by skilled professionals to identify and exploit potential vulnerabilities
in systems, networks, and applications. This hands-on approach allows organizations to assess the resilience of their
security measures in a controlled environment, uncovering weaknesses that malicious actors might exploit [23, 24].
Vulnerability assessments, on the other hand, focus on systematically identifying, quantifying, and prioritizing vul-
nerabilities within an IT environment. By conducting regular assessments, organizations can stay ahead of emerging
threats, address weaknesses promptly, and enhance their overall security posture [25]. These proactive measures play
a vital role in maintaining a dynamic and adaptive defense against the evolving landscape of cyber threats.
2
2.2 Neural Networks
Artificial neural networks were first introduced as a mathematical model of the biological neurons in an attempt to
develop a machine with the same capabilities of a biological brain [26, 27]. The basic neuron is composed of a set of
inputs with weights, outputs and an activation function; a network is constructed by the interconnection of neurons.
Although with modest results at first, further advances on learning algorithms [28, 29] and the stacking of several layers
[30], combined with larger memories, more data availability and faster computers have propelled neural networks as
a somewhat generic algorithm, capable of reaching the state-of-the-art in a wide variety of fields and applications, as
long as there is enough data available for training. This approach is often called "Deep Learning" and has become
ubiquitous in the field [31].
The most common approach with neural networks is through supervised learning. In this approach, the network
goes through an initial training, where previously annotated data is shown with its desired output and the weights are
adjusted for the entire dataset. This produces a set of weights which, when deployed, will allow correct labeling of
previously unseen data. The performance of the network is related to the quality of the data used to train as well as the
amount of training it has gone through.
Natural Language Processing (NLP) is a field that deals with the understanding of natural language by computing
machines [32]. Although other approaches are also possible, the neural network approach has also reached state-of-
the-art in NLP tasks and has become the preferred method.
The introduction of the transformer architecture and its attention mechanism [33] greatly advanced the field and led to
the creation of large language models (LLMs) [15, 16]. LLMs take a range of parameters and are trained in extremely
large and varied datasets. These datasets may include questions and answers from websites, encyclopedias, books,
source codes and so on. Training is done through "next-word prediction", in which the network is shown a phrase with
a missing word or sequence of words and has to correctly identify the missing piece.
In general, the training process involves the following steps:
Tokenization The data is pre-processed into numerical tokens representing the smallest possible textual
data point carrying meaning. This step is also required for inference.
Unsupervised learning The models are trained to predict the next token from a sequence of tokens in the training
data. This creates a "base" model from which further training may or may not be performed.
Supervised fine-tuning The model receives further training by receiving demonstrations of desired request/response
pairs created by humans.
Because the training process requires such a huge amount of memory and processing power, it remains available only
to a few big industry players which can manage the computing power necessary to perform this task. However, the
increased competition and possibility of contribution from the community has lead to the creation of openly available
language models, generally referred to as foundation models. These models have already gone through the training
process and are provided as a set of weights, generally with varying parameter sizes so they can be run in more
accessible consumer hardware. Further specialization of such models is possible by fine-tuning [34, 35, 36] or specific
prompting [37, 38, 39, 40].
3 Related work
Deep neural networks are already used extensively in cybersecurity for threat and vulnerability detection [5, 6, 7, 8],
network intrusion prevention [9, 10], password guessing [41], and others [11, 12], albeit lack of explainability and
excess of false positives undermine trust in these systems [13].
The introduction of the transformer architecture [42] and subsequent construction of large language models trained
on a varied corpus of text data have shown a remarkable ability for further model specialization [14]. Fine-tuning
of pre-trained foundation models on task-specific data achieves near or better than state-of-the-art performance when
compared to other techniques of machine learning [15, 16, 43]. Fine-tuning large language models on publicly avail-
able code databases has led to better performance in code writing [44, 45, 46], vulnerable code fixing [47, 48], and
finding vulnerable code [49].
3
Large language models have also been trained to navigate a browser to look for answers [50] and use external tools
through Application Programming Interfaces (APIs) [51, 52]. Autonomous agents with a mix of tool usage and code
execution capabilities have also been shown to perform well [53].
In order to analyze interest in and relevance of LLMs over time, a search of the keywords "Large Language Models"
and "Large Language Models" AND "Cybersecurity" was performed on Google Scholar, starting from the year in
which the transformer architecture [42] was first published, i.e. 2017, up until the date of publication of this article,
i.e. 2023, shown in Figure 1 It becomes clear how quickly interest in this research topic has grown, mostly attributed
to the successes of commercial large language models such as GPT-4, BERT and LLaMA. However, the amount of
research which also includes "cybersecurity" is lagging far behind, demonstrating not only a worrying trend of lack of
oversight into such systems but also a gap of applying this new technology to the cybersecurity context.
·104
2
1.8
1.6
1.4
Number of Publications
1.2
0.8
0.6
0.4
0.2
0
2017 2018 2019 2020 2021 2022 2023
Year
It is important to note the distinction between "LLMs in cybersecurity" and "cybersecurity of LLMs", with the former
being the focus of this survey.
A survey of papers leveraging LLMs in the context of cybersecurity was carried by searching for the keywords "Large
Language Models" AND "Cybersecurity" and further selecting papers which used LLMs as a tool for cybersecurity
in offensive or defensive contexts. An overview of the papers is shown in Table 1. The highlighted papers were
selected for further analysis due to their increased applicability as autonomous vulnerability assessment agents in the
cybersecurity context.
4
Table 1: Summary of Publications on Large Language Models in Cyber-
security
Detecting Phishing Sites Using ChatGPT [60] Phishing website detec- GPT 3.5; GPT-4
tion
ChatIDS: Explainable Cybersecurity Using Generative AI [61] IDS alert explainability GPT-3.5-turbo
Devising and Detecting Phishing: large language models vs. Smaller Human Models [62] Phishing email detection GPT-4; Claude; PaLM;
LLaMA
Large Language Models Can Be Used To Effectively Scale Spear Phishing Campaigns [63] Spear phishing email cre- GPT-3.5; GPT-4
ation
Getting pwn’d by AI: Penetration Testing with Large Language Models [64] Penetration testing and GPT-3.5-turbo
vulnerability detection
Revolutionizing Cyber Threat Detection with Large Language Models [65] Threat detection and re- SecureBERT
sponse
SecureFalcon: The Next Cyber Reasoning System for Cyber Security [66] Source code vulnerability FalconLLM
detection
Prompting Is All You Need: Automated Android Bug Replay with Large Language Models [67] Android bug replay GPT-3.5
On the Uses of Large Language Models to Interpret Ambiguous Cyberattack Descriptions [68] TTP interpretation BERT; SecureBERT
PentestGPT: An LLM-empowered Automatic Penetration Testing Tool [69] Penetration testing GPT-3.5; GPT-4;
LaMDA
From Text to MITRE Techniques: Exploring the Malicious Use of Large Language Models for Generating Malicious payload genera- GPT-3.5; LaMDA
Cyber Attack Payloads [70] tion
RatGPT: Turning online LLMs into Proxies for Malware Attacks [71] Attack proxy GPT-3.5
Transformer-Based Language Models for Software Vulnerability Detection [72] Source code vulnerability BERT; DistilBERT;
detection RoBERTa; CodeBERT;
GPT-2; MegatronBERT;
MegatronGPT-2; GPT-J
VulRepair: a T5-based automated software vulnerability repair [73] Source code vulnerability CodeT5
fixing
6
3.1 PentestGPT: An LLM-empowered Automatic Penetration Testing Tool
The study first builds on current base models (GPT-3.5, GPT-4 and Bard) through a iterative approach, i.e. passing
information to and from the tested back and forth and prompting the LLM, in order to test their usability in a pentesting
scenario. The authors find that the LLMs are able to provide a good "intuition" on how to proceed with the task,
however, a major drawback is that context gets lost as prompting advances and the system loses its ability to correctly
probe and decide on further tasks.
In order to solve these issues, the authors propose a framework with tooling and benchmarks, with its most significant
constituent being the "PentestGPT", an LLM with reasoning, generation and parsing modules. Each module repre-
sents a role within a red team and was developed through Chain-of-Thought prompting that defined its role on the
commercially available models GPT-3.5 and GPT-4. As shown in benchmarking, the "PentestGPT-GPT4" approach
presented the best results.
3.2 Getting pwn’d by AI: Penetration Testing with Large Language Models
Two approaches were taken in this study: as a "sparring" partner for penetration testing, in which the LLM acts in
a Q&A fashion; and as an automated agent connected to a virtual machine (VM) in a scenario where the pentester
already got initial access and is performing further privilege escalation.
The LLM performed well in the Q&A approach and replied with realistic and feasible suggestions in order to perform
the initial pentesting, although some filtering which is present in commercial models sometimes impacted on the
results.
For the second approach, a script was made that kept the model in a loop with the VM so that it could issue commands
and get the output from the terminal. The model was successfully able to gain root privilege multiple times within
the vulnerable machine, but failed to perform multi-step exploitation. Besides the previous issue with filtering, this
approach also had hallucinations in which the model tried to run scripts which did not exist in the environment,
although the authors didn’t find this issue occurred too often.
The study compares several commercial foundation models already pre-trained on code tasks. Zero-shot refers to direct
prompting of the model, in comparison with few-shot in which the prompt is engineered with a few examples. Because
the models used were already trained mostly on code, they are better suited for code tasks with fewer prompting.
The authors found that although successful, most commercial models struggle with more complex tasks, and results
were greatly dependent on the quality of the prompt. They also trained their own local model which, due to its reduced
size, did not perform as well as the commercial models. However, in general, the models were able to fix vulnerable
code, even if it took more than one attempt.
3.4 Out of the Cage: How Stochastic Parrots Win in Cyber Security Environments
The authors used commercial, pre-trained models in order to create single-prompt agents. The agents were only
prompted with instructions and rules of a game that was developed as a reinforcement learning playground and state
and memory information plus a small prompt of examples of valid actions and a query for the next step. They found
that memory and temperature (which corresponds to a "randomness" toggle) were important factors so that the agent
wouldn’t get stuck. There were also hallucination issues, in which the agent would suggest impossible actions.
The latest commercial model, GPT-4, also performed significantly better than GPT-3.5-turbo, and had better reasoning
skills so it would get stuck less often.
3.5 Using Large Language Models for Cybersecurity Capture-The-Flag Challenges and Certification
Questions
The study focuses on two scenarios of using LLMs: Q&A at the certification level; and CTF challenge solving.
In the first scenario, the LLM performed better in factual questions than conceptual questions. The authors tested on
Cisco certifications ranging from the associate to the expert level.
For the second scenario, the authors prompted the models with CTF challenges. Three commercially available models,
namely GPT-3.5, PaLM 2 and Prometheus, were tested. A back-and-forth conversation was taken, in which the
7
authors prompted the models describing the CTF challenge, asking for further direction and replying with feedback
from exploring the challenge.
All models were successful in aiding with the CTF challenges, although one major limitation was security filtering,
which was bypassed through jailbreaks.
Although current state-of-the-art LLMs excel at generating small to medium text snippets, the quality of the generated
output tends to quickly degrade as the conversation evolves and when dealing with complex tasks where additional
context may "leak" into the prompt. Notably, this issue is present even in models with large context windows [74].
Similarly, the way in which context is given to a model and even where relevant information is located, e.g. at the
beginning of the query vs at the end vs repeated throughout, may change the quality of the output provided by the
model [75].
This loss of context becomes especially challenging when trying to provide agency for a system where the inputs and
outputs can’t always be controlled and improved by a human supervisor, leading to a quick degradation of the model
on the performed task.
Another common issue with LLMs are hallucinations, which happens when the models make up content not grounded
in reality or contradictory with previous content and context [76, 77, 78, 79, 80].
These are harder to tackle and represent one of the most significant issue of current models. Some hallucinations are
harder to pinpoint as they can simply be considered bias from the dataset. This behaviour may also be desirable in
some cases, such as for models in which creative generation is the desired output instead.
The current most common approaches for mitigating these issues are given bellow:
4.1 Fine-tuning
Because current models are prohibitively expensive to fully train when new data becomes available, a further post-
training step may be carried [81]. This also allows adding private data into the models, such as internal company
documents and personal information [82].
Several approaches are being researched in order to perform fine-tuning in a more efficient way [35, 34, 83, 84, 85, 86]
and although it is orders of magnitude cheaper than training from scratch, its costs still remains high.
A more significant issue is catastrophic forgetting, in which the model loses some of it original capability after fine-
tuning [87, 88, 89].
By providing the model with a few examples of how its expected behavior should look like, the model tends to perform
better than if the examples were not displayed to it [90, 40].
This technique is striking due to its simplicity as well as its effectiveness and can somewhat approximate the fine-
tuning of a model without the more computationally intensive training step, providing instead a guidance towards the
desired behavior.
However, hallucinations are still common and the examples provided via the prompt end up consuming tokens that
could instead be used with information. This decreases the effective prompting window.
By combining the LLMs with a vector database, which groups tokens by semantic proximity, it is possible to merge
the results of the model generation with the contents of the database which are grounded in factual information [91,
92, 93, 94, 95].
This technique is especially relevant as it can also provide citations on where from the database the information comes
from. It still has high costs, as the information must be processed and stored in the specialized database. There is also
an issue in which the retrieved information may be semantically related but not always factually relevant.
8
4.4 Chain-of-Verification (CoVe)
This approach involves tasking the model with creating a first draft of the response; questioning its own output and fact-
checking its responses independently, i.e. without context of the draft in order to avoid bias; and, finally, generating a
final, verified response [96]. Although the quality of the responses increases and the hallucination is greatly reduced,
it does not completely solve the hallucinations and increases the computational load significantly as each query needs
to be repeated with the verification prompt as well.
5 Proposed solution
Mixture-of-Experts (MoE) is a technique already used in previous deep learning systems [97] which is making its
way into current state-of-the-art LLM models [98, 99] and allows for high quality inference without the high costs
associated with other techniques.
The step-by-step overview of this approach is as follows:
Subtask identification The main task is subdivided into subtasks which may be addressed independently;
Expert modelling Each subtask is assigned to an expert model;
Gating modelling A model is trained to recognize the initial task and to which expert model it should be routed
to;
Combination the output is produced by combining the results of the experts with the gating model.
The proposed solution is a Mixture-of-Experts-based approach in which each expert is a task-specific foundation
model. Foundation models are usually trained for a desired task, such as instruction following, conversation and code
generation. The combination of specialized models may lead to reduced hallucinations, better context management
and better specificity on the results.
This technique is particularly well suited for vulnerability assessment tasks given it requires a wide ranging skill set.
The approach is similar to PentestGPT [69], in which three different modules are presented: Reasoning, Generation
and Parsing. Each module has its own set of tasks, however, specialization is acquired merely through keeping each
module with a narrow task-space at inference level, i.e. the model themselves are not specially trained. Although not
as in-depth, this approach already manages to be effective.
We propose greater specialization and model-localization by leveraging different foundation LLMs for each task as
expert models, with one acting as the gating model which chooses the expert for the task at hand.
Instruction-tuned models [100] are ideal for the reasoning aspects of the tasks as they are already aligned with sen-
tences structured as instructions.
Models specialized in code [101] may then receive instructions to interpret one section of code, find vulnerabilities or
develop exploits which can then loop back to the other models as a response or tool to be used.
The gating model should be trained on the widest range of subjects so it can understand the task given and to which
expert model it should be delegated, e.g. it needs to understand what code is, but it doesn’t need to excel at coding.
These can be seen as the "generic", commercial foundation models [102, 15, 103].
6 Final remarks
In conclusion, the Mixture-of-Experts framework presented in this paper represents a significant stride towards enhanc-
ing cybersecurity practices, leveraging the remarkable capabilities of Large Language Models (LLMs) for pen testing
and vulnerability assessment. By orchestrating the collaboration of specialized LLMs, each adept in distinct cyberse-
curity domains, we have introduced a scalable and adaptable solution that holds promise in addressing the multifaceted
challenges posed by modern cyber threats. The system’s ability to allocate specific tasks to LLMs specializing in code
analysis, reasoning, and anomaly detection ensures a comprehensive evaluation of security postures.
The implications of this research extend beyond theoretical frameworks, offering tangible prospects for fortifying
digital infrastructures. The adaptability of the Mixture-of-Experts architecture positions it as a dynamic tool that can
evolve alongside the ever-changing cybersecurity landscape. Nevertheless, it is imperative to acknowledge the current
limitations, such as the dependence on the quality of LLMs and the necessity for comprehensive training datasets.
Practical validation in diverse environments is essential to substantiate the efficacy of the proposed framework in
real-world cybersecurity scenarios.
9
Experts
Gating model
P robed system ..
.
F eedback
As the cybersecurity community grapples with increasingly sophisticated threats, the collaborative intelligence har-
nessed by the Mixture-of-Experts framework provides a foundation for future innovations. Through continuous refine-
ment and validation, this approach has the potential to not only augment existing security measures but also lay the
groundwork for intelligent, adaptive defense mechanisms against emerging digital risks.
7 Future directions
While this research has laid the groundwork for integrating LLMs into cybersecurity practices, numerous avenues
for future exploration emerge. To further enhance the Mixture-of-Experts framework, a concerted effort is needed to
refine the expertise of specialized LLMs. This involves continuous fine-tuning, leveraging domain-specific datasets,
and exploring techniques to mitigate biases that may arise during the training process.
Expanding the scope of the framework to encompass additional cybersecurity domains is paramount. Future research
should explore the integration of LLMs specialized in threat intelligence, incident response, and even regulatory com-
pliance. This expansion will contribute to a more holistic and multifaceted approach to cybersecurity, addressing not
only vulnerabilities and threats but also the broader landscape of risk management and compliance.
Empirical validation remains a crucial aspect of future research endeavors. Conducting extensive experiments in di-
verse and realistic cybersecurity environments will provide valuable insights into the practical effectiveness of the
Mixture-of-Experts framework. This includes evaluating the framework’s performance in scenarios involving sophis-
ticated adversaries, varied network architectures, and different industry sectors.
Furthermore, as ethical considerations surrounding AI and cybersecurity continue to gain prominence, future research
should delve into the development of responsible AI frameworks within the proposed system. Ensuring transparency,
accountability, and fairness in decision-making processes will be essential for gaining trust in the deployment of such
intelligent systems within critical security infrastructures.
In summary, the future trajectory of this research should focus on refining, expanding, and validating the Mixture-
of-Experts framework, paving the way for a new era of intelligent and collaborative cybersecurity defenses. As we
navigate the complexities of the digital landscape, the ongoing development of advanced frameworks will be instru-
mental in safeguarding our interconnected world against evolving cyber threats.
10
References
[1] Irena Bojanova and Carlos Eduardo C Galhardo. Bug, fault, error, or weakness: Demystifying software security
vulnerabilities. IT Prof., 25(1):7–12, January 2023.
[2] Fabian M. Teichmann and Sonia R. Boticiu. An overview of the benefits, challenges, and legal aspects of
penetration testing and red teaming. International Cybersecurity Law Review, 4(4):387–397, September 2023.
[3] Jian hua Li. Cyber security meets artificial intelligence: a survey. Frontiers of Information Technology &
Electronic Engineering, 19(12):1462–1474, December 2018.
[4] Zhimin Zhang, Huansheng Ning, Feifei Shi, Fadi Farha, Yang Xu, Jiabo Xu, Fan Zhang, and Kim-Kwang Ray-
mond Choo. Artificial intelligence in cyber security: research advances, challenges, and opportunities. Artificial
Intelligence Review, 55(2):1029–1053, March 2021.
[5] Jacob A Harer, Louis Y Kim, Rebecca L Russell, Onur Ozdemir, Leonard R Kosta, Akshay Rangamani, Lei H
Hamilton, Gabriel I Centeno, Jonathan R Key, Paul M Ellingwood, Erik Antelman, Alan Mackay, Marc W
McConley, Jeffrey M Opper, Peter Chin, and Tomo Lazovich. Automated software vulnerability detection with
machine learning. February 2018.
[6] Guanjun Lin, Sheng Wen, Qing-Long Han, Jun Zhang, and Yang Xiang. Software vulnerability detection using
deep neural networks: A survey. Proc. IEEE Inst. Electr. Electron. Eng., 108(10):1825–1848, October 2020.
[7] Bingchang Liu, Liang Shi, Zhuhua Cai, and Min Li. Software vulnerability discovery techniques: A survey. In
2012 Fourth International Conference on Multimedia Information Networking and Security. IEEE, November
2012.
[8] Francesco Lomio, Emanuele Iannone, Andrea De Lucia, Fabio Palomba, and Valentina Lenarduzzi. Just-in-time
software vulnerability detection: Are we there yet? Journal of Systems and Software, 188:111283, 2022.
[9] Deepaa Selva, Balakrishnan Nagaraj, Danil Pelusi, Rajendran Arunkumar, and Ajay Nair. Intelligent network
intrusion prevention feature collection and classification algorithms. Algorithms, 14(8), 2021.
[10] Hui Wang, Zijian Cao, and Bo Hong. A network intrusion detection system based on convolutional neural
network. J. Intell. Fuzzy Syst., 38(6):7623–7637, June 2020.
[11] Iqbal H Sarker, Md Hasan Furhad, and Raza Nowrozy. AI-driven cybersecurity: An overview, security intelli-
gence modeling and research directions. SN Comput. Sci., 2(3), May 2021.
[12] Murat Kuzlu, Corinne Fair, and Ozgur Guler. Role of artificial intelligence in the internet of things (IoT)
cybersecurity. Discov. Internet Things, 1(1), December 2021.
[13] Mariarosaria Taddeo, Tom McCutcheon, and Luciano Floridi. Trusting artificial intelligence in cybersecurity is
a double-edged sword. Nat. Mach. Intell., 1(12):557–560, November 2019.
[14] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are
unsupervised multitask learners. 2018.
[15] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix,
Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard
Grave, and Guillaume Lample. LLaMA: Open and efficient foundation language models. February 2023.
[16] OpenAI. GPT-4 technical report. March 2023.
[17] Zhenguo Hu, Razvan Beuran, and Yasuo Tan. Automated penetration testing using deep reinforcement learning.
In 2020 IEEE European Symposium on Security and Privacy Workshops (EuroS&PW), pages 2–10, 2020.
[18] Dean Richard McKinnel, Tooska Dargahi, Ali Dehghantanha, and Kim-Kwang Raymond Choo. A systematic
literature review and meta-analysis on artificial intelligence in penetration testing and vulnerability assessment.
Computers & Electrical Engineering, 75:175–188, 2019.
[19] Jonathon Schwartz and Hanna Kurniawati. Autonomous penetration testing using reinforcement learning.
CoRR, abs/1905.05965, 2019.
[20] R A Kemmerer. Cybersecurity. In 25th International Conference on Software Engineering, 2003. Proceedings.
IEEE, 2003.
[21] Mamoona Humayun, Mahmood Niazi, NZ Jhanjhi, Mohammad Alshayeb, and Sajjad Mahmood. Cyber se-
curity threats and vulnerabilities: A systematic mapping study. Arabian Journal for Science and Engineering,
45(4):3171–3189, January 2020.
11
[22] Kutub Thakur, Meikang Qiu, Keke Gai, and Md Liakat Ali. An investigation on cyber security threats and
security models. In 2015 IEEE 2nd International Conference on Cyber Security and Cloud Computing, pages
307–311, 2015.
[23] Aileen G Bacudio, Xiaohong Yuan, Bei-Tseng Bill Chu, and Monique Jones. An overview of penetration
testing. International Journal of Network Security & Its Applications, 3(6):19, 2011.
[24] Sugandh Shah and Babu M Mehtre. An overview of vulnerability assessment and penetration testing techniques.
Journal of Computer Virology and Hacking Techniques, 11:27–49, 2015.
[25] Prashant S Shinde and Shrikant B Ardhapurkar. Cyber security analysis using vulnerability assessment and
penetration testing. In 2016 World Conference on Futuristic Trends in Research and Innovation for Social
Welfare (Startup Conclave), pages 1–5. IEEE, 2016.
[26] F. Rosenblatt. The perceptron - a perceiving and recognizing automaton. Technical Report 85-460-1, Cornell
Aeronautical Laboratory, Ithaca, New York, January 1957.
[27] Terence Sanger and Pallavi N. Baljekar. The perceptron: a probabilistic model for information storage and
organization in the brain. Psychological review, 65 6:386–408, 1958.
[28] David E Rumelhart, Geoffrey E Hintont, and Ronald J Williams. Learning representations by back-propagating
errors.
[29] David C. Plaut and Geoffrey E. Hinton. Learning sets of filters using back-propagation. 2(1):35–61.
[30] Geoffrey E. Hinton. Learning multiple layers of representation. 11(10):428–434.
[31] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. 521(7553):436–444.
[32] Yoav Goldberg. A primer on neural network models for natural language processing, 2015.
[33] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser,
and Illia Polosukhin. Attention Is All You Need.
[34] Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA: Efficient Finetuning of Quan-
tized LLMs.
[35] Edward Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu
Chen. LORA: LOW-RANK ADAPTATION OF LARGE LAN- GUAGE MODELS.
[36] Renrui Zhang, Jiaming Han, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, Peng Gao, and
Yu Qiao. LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention.
[37] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language Models are
Unsupervised Multitask Learners.
[38] Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten
Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean,
and William Fedus. Emergent abilities of large language models, 2022.
[39] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H Chi, Quoc V Le, and
Denny Zhou. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.
[40] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Nee-
lakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen
Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter,
Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christo-
pher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language Models are Few-
Shot Learners.
[41] Briland Hitaj, Paolo Gasti, Giuseppe Ateniese, and Fernando Perez-Cruz. PassGAN: A deep learning approach
for password guessing. September 2017.
[42] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser,
and Illia Polosukhin. Attention is all you need. June 2017.
[43] Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L Wainwright, Pamela Mishkin, Chong Zhang, Sand-
hini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie
Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models
to follow instructions with human feedback. March 2022.
12
[44] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri
Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael
Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov,
Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such,
Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen
Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William
Saunders, Christopher Hesse, Andrew N Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec
Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario
Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. Evaluating large language models trained
on code. July 2021.
[45] Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting
Liu, Daxin Jiang, and Ming Zhou. CodeBERT: A pre-trained model for programming and natural languages.
February 2020.
[46] Jingxuan He and Martin Vechev. Controlling large language models to generate secure and vulnerable code.
February 2023.
[47] Anastasiia Grishina. Enabling automatic repair of source code vulnerabilities using data-driven methods. Febru-
ary 2022.
[48] Michael Fu, Chakkrit Tantithamthavorn, Trung Le, Van Nguyen, and Dinh Phung. VulRepair: a t5-based
automated software vulnerability repair. In Proceedings of the 30th ACM Joint European Software Engineering
Conference and Symposium on the Foundations of Software Engineering, New York, NY, USA, November 2022.
ACM.
[49] Chandra Thapa, Seung Ick Jang, Muhammad Ejaz Ahmed, Seyit Camtepe, Josef Pieprzyk, and Surya Nepal.
Transformer-based language models for software vulnerability detection. April 2022.
[50] Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse,
Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger,
Kevin Button, Matthew Knight, Benjamin Chess, and John Schulman. WebGPT: Browser-assisted question-
answering with human feedback. December 2021.
[51] Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola
Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. February
2023.
[52] Yujia Qin, Shengding Hu, Yankai Lin, Weize Chen, Ning Ding, Ganqu Cui, Zheni Zeng, Yufei Huang, Chaojun
Xiao, Chi Han, Yi Ren Fung, Yusheng Su, Huadong Wang, Cheng Qian, Runchu Tian, Kunlun Zhu, Shihao
Liang, Xingyu Shen, Bokai Xu, Zhen Zhang, Yining Ye, Bowen Li, Ziwei Tang, Jing Yi, Yuzhang Zhu, Zhen-
ning Dai, Lan Yan, Xin Cong, Yaxi Lu, Weilin Zhao, Yuxiang Huang, Junxi Yan, Xu Han, Xian Sun, Dahai
Li, Jason Phang, Cheng Yang, Tongshuang Wu, Heng Ji, Zhiyuan Liu, and Maosong Sun. Tool learning with
foundation models, 2023.
[53] Daniil A Boiko, Robert MacKnight, and Gabe Gomes. Emergent autonomous scientific research capabilities of
large language models. April 2023.
[54] Wesley Tann, Yuancheng Liu, Jun Heng Sim, Choon Meng Seah, and Ee-Chien Chang. Using Large Language
Models for Cybersecurity Capture-The-Flag Challenges and Certification Questions.
[55] Tanmay Singla, Dharun Anandayuvaraj, Kelechi G. Kalu, Taylor R. Schorlemmer, and James C. Davis. An
Empirical Study on Using Large Language Models to Analyze Software Supply Chain Security Failures.
[56] Maria Rigaki, Ondřej Lukáš, Carlos A. Catania, and Sebastian Garcia. Out of the Cage: How Stochastic Parrots
Win in Cyber Security Environments.
[57] Hammond Pearce, Benjamin Tan, Baleegh Ahmad, Ramesh Karri, and Brendan Dolan-Gavitt. Examining Zero-
Shot Vulnerability Repair with Large Language Models. In 2023 IEEE Symposium on Security and Privacy
(SP), pages 2339–2356.
[58] Marwan Omar. VulDefend: A Novel Technique based on Pattern-exploiting Training for Detecting Software
Vulnerabilities Using Language Models. In 2023 IEEE Jordan International Joint Conference on Electrical
Engineering and Information Technology (JEEIT), pages 287–293.
[59] Timothy McIntosh, Tong Liu, Teo Susnjak, Hooman Alavizadeh, Alex Ng, Raza Nowrozy, and Paul Watters.
Harnessing GPT-4 for Generation of Cybersecurity GRC Policies: A Focus on Ransomware Attack Mitigation.
page 103424.
[60] Takashi Koide, Naoki Fukushi, Hiroki Nakano, and Daiki Chiba. Detecting Phishing Sites Using ChatGPT.
13
[61] Victor Jüttner, Martin Grimmer, and Erik Buchmann. ChatIDS: Explainable Cybersecurity Using Generative
AI.
[62] Fredrik Heiding, Bruce Schneier, Arun Vishwanath, and Jeremy Bernstein. Devising and Detecting Phishing:
Large language models vs. Smaller Human Models.
[63] Julian Hazell. Large Language Models Can Be Used To Effectively Scale Spear Phishing Campaigns.
[64] Andreas Happe and Jürgen Cito. Getting pwn’d by AI: Penetration Testing with Large Language Models.
[65] Mohamed Amine Ferrag, Mthandazo Ndhlovu, Norbert Tihanyi, Lucas C. Cordeiro, Merouane Debbah, and
Thierry Lestable. Revolutionizing Cyber Threat Detection with Large Language Models.
[66] Mohamed Amine Ferrag, Ammar Battah, Norbert Tihanyi, Merouane Debbah, Thierry Lestable, and Lucas C.
Cordeiro. SecureFalcon: The Next Cyber Reasoning System for Cyber Security.
[67] Sidong Feng and Chunyang Chen. Prompting Is All You Need: Automated Android Bug Replay with Large
Language Models.
[68] Reza Fayyazi and Shanchieh Jay Yang. On the Uses of Large Language Models to Interpret Ambiguous Cyber-
attack Descriptions.
[69] Gelei Deng, Yi Liu, Víctor Mayoral-Vilches, Peng Liu, Yuekang Li, Yuan Xu, Tianwei Zhang, Yang Liu, Martin
Pinzger, and Stefan Rass. PentestGPT: An LLM-empowered Automatic Penetration Testing Tool.
[70] P. V. Sai Charan, Hrushikesh Chunduri, P. Mohan Anand, and Sandeep K. Shukla. From Text to MITRE
Techniques: Exploring the Malicious Use of Large Language Models for Generating Cyber Attack Payloads.
[71] Mika Beckerich, Laura Plein, and Sergio Coronado. RatGPT: Turning online LLMs into Proxies for Malware
Attacks.
[72] Chandra Thapa, Seung Ick Jang, Muhammad Ejaz Ahmed, Seyit Camtepe, Josef Pieprzyk, and Surya Nepal.
Transformer-Based Language Models for Software Vulnerability Detection. In Proceedings of the 38th Annual
Computer Security Applications Conference, pages 481–496. ACM.
[73] Michael Fu, Chakkrit Tantithamthavorn, Trung Le, Van Nguyen, and Dinh Phung. VulRepair: A T5-based
automated software vulnerability repair. In Proceedings of the 30th ACM Joint European Software Engineering
Conference and Symposium on the Foundations of Software Engineering, pages 935–947. ACM.
[74] Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed Chi, Nathanael Schärli, and Denny
Zhou. Large language models can be easily distracted by irrelevant context, 2023.
[75] Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang.
Lost in the middle: How language models use long contexts, 2023.
[76] Niels Mündler, Jingxuan He, Slobodan Jenko, and Martin Vechev. Self-contradictory hallucinations of large
language models: Evaluation, detection and mitigation, 2023.
[77] Nouha Dziri, Sivan Milton, Mo Yu, Osmar Zaiane, and Siva Reddy. On the origin of hallucinations in conver-
sational models: Is it the datasets or the models?, 2022.
[78] Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto,
and Pascale Fung. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–
38, mar 2023.
[79] Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang,
Yulong Chen, Longyue Wang, Anh Tuan Luu, Wei Bi, Freda Shi, and Shuming Shi. Siren’s song in the ai ocean:
A survey on hallucination in large language models, 2023.
[80] Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. On faithfulness and factuality in abstrac-
tive summarization, 2020.
[81] Shengyu Zhang, Linfeng Dong, Xiaoya Li, Sen Zhang, Xiaofei Sun, Shuhe Wang, Jiwei Li, Runyi Hu, Tianwei
Zhang, Fei Wu, and Guoyin Wang. Instruction tuning for large language models: A survey, 2023.
[82] Rouzbeh Behnia, Mohammadreza Reza Ebrahimi, Jason Pacheco, and Balaji Padmanabhan. EW-tune: A
framework for privately fine-tuning large language models with differential privacy. In 2022 IEEE International
Conference on Data Mining Workshops (ICDMW). IEEE, nov 2022.
[83] Sadhika Malladi, Tianyu Gao, Eshaan Nichani, Alex Damian, Jason D. Lee, Danqi Chen, and Sanjeev Arora.
Fine-Tuning Language Models with Just Forward Passes.
[84] Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, and Jiaya Jia. Longlora: Efficient
fine-tuning of long-context large language models, 2023.
14
[85] Ning Ding, Yujia Qin, Guang Yang, Fuchao Wei, Zonghan Yang, Yusheng Su, Shengding Hu, Yulin Chen, Chi-
Min Chan, Weize Chen, Jing Yi, Weilin Zhao, Xiaozhi Wang, Zhiyuan Liu, Hai-Tao Zheng, Jianfei Chen, Yang
Liu, Jie Tang, Juanzi Li, and Maosong Sun. Parameter-efficient fine-tuning of large-scale pre-trained language
models. Nature Machine Intelligence, 5(3):220–235, March 2023.
[86] Kai Lv, Yuqing Yang, Tengxiao Liu, Qinghui Gao, Qipeng Guo, and Xipeng Qiu. Full parameter fine-tuning
for large language models with limited resources, 2023.
[87] Yong Lin, Lu Tan, Hangyu Lin, Zeming Zheng, Renjie Pi, Jipeng Zhang, Shizhe Diao, Haoxiang Wang, Han
Zhao, Yuan Yao, and Tong Zhang. Speciality vs generality: An empirical study on catastrophic forgetting in
fine-tuning foundation models, 2023.
[88] Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang. An empirical study of catastrophic
forgetting in large language models during continual fine-tuning, 2023.
[89] Yuexiang Zhai, Shengbang Tong, Xiao Li, Mu Cai, Qing Qu, Yong Jae Lee, and Yi Ma. Investigating the
catastrophic forgetting in multimodal large language models, 2023.
[90] Jerry Wei, Jason Wei, Yi Tay, Dustin Tran, Albert Webson, Yifeng Lu, Xinyun Chen, Hanxiao Liu, Da Huang,
Denny Zhou, and Tengyu Ma. Larger language models do in-context learning differently, 2023.
[91] Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich
Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented
generation for knowledge-intensive nlp tasks. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and
H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 9459–9474. Curran
Associates, Inc., 2020.
[92] Deng Cai, Yan Wang, Lemao Liu, and Shuming Shi. Recent advances in retrieval-augmented text generation.
In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information
Retrieval, SIGIR ’22, page 3417–3419, New York, NY, USA, 2022. Association for Computing Machinery.
[93] Yuning Mao, Pengcheng He, Xiaodong Liu, Yelong Shen, Jianfeng Gao, Jiawei Han, and Weizhu Chen.
Generation-augmented retrieval for open-domain question answering, 2021.
[94] Huayang Li, Yixuan Su, Deng Cai, Yan Wang, and Lemao Liu. A survey on retrieval-augmented text generation,
2022.
[95] Shangqing Liu, Yu Chen, Xiaofei Xie, Jingkai Siow, and Yang Liu. Retrieval-augmented generation for code
summarization via hybrid gnn, 2021.
[96] Shehzaad Dhuliawala, Mojtaba Komeili, Jing Xu, Roberta Raileanu, Xian Li, Asli Celikyilmaz, and Jason
Weston. Chain-of-verification reduces hallucination in large language models, 2023.
[97] Seniha Esen Yuksel, Joseph N. Wilson, and Paul D. Gader. Twenty years of mixture of experts. IEEE Transac-
tions on Neural Networks and Learning Systems, 23(8):1177–1193, 2012.
[98] Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, Vincent Zhao, Andrew Dai, Zhifeng Chen, Quoc
Le, and James Laudon. Mixture-of-experts with expert choice routing, 2022.
[99] Joan Puigcerver, Carlos Riquelme, Basil Mustafa, and Neil Houlsby. From sparse to soft mixtures of experts,
2023.
[100] Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang,
Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Mad-
die Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language
models to follow instructions with human feedback, 2022.
[101] Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu
Liu, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Can-
ton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron,
Louis Martin, Nicolas Usunier, Thomas Scialom, and Gabriel Synnaeve. Code llama: Open foundation models
for code, 2023.
[102] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bash-
lykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer,
Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia
Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin
Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne
Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mi-
haylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan
15
Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang,
Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan,
Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom.
Llama 2: Open Foundation and Fine-Tuned Chat Models.
[103] OpenAI. GPT-4 Technical Report.
16