LLM_Honeypot_Leveraging_Large_Language_Models_as_A
LLM_Honeypot_Leveraging_Large_Language_Models_as_A
models have the potential to revolutionize honeypot pipeline was created. It starts with collecting and
technology by enabling the creation of highly real- preprocessing a dataset of attacker commands and
istic and interactive honeypots. Leveraging LLMs, responses. This data is used for Supervised Fine-
honeypots can engage with attackers in a more Tuning (SFT) on a pre-trained language model,
sophisticated manner, providing more secure and enhancing its ability to mimic a Linux server.
intelligent responses. Recent research demonstrates The fine-tuned model is then rigorously evalu-
the feasibility of using LLMs as dynamic honey- ated to ensure it effectively engages with attackers
pot servers by employing pre-trained models like and provides valuable security insights. Finally,
ChatGPT. Even without extensive fine-tuning, well- the optimized honeypot is deployed to a public
crafted prompts can allow these models to observe IP address for real-world interaction with potential
and study attacker behaviors and tactics effectively threats.
[9]–[11]. However, a significant challenge in using A. Data Collection and Processing
chatbots as honeypots is the potential for attackers To develop the honeypot, we used log records
to detect and identify the honeypot due to static ele- from a Cowrie honeypot on a public cloud endpoint
ments or predictable behaviors. One way to fix this [12]. Cowrie, a medium-interaction honeypot, logs
problem would be to make honeypot environments brute-force attacks and shell commands via SSH
more dynamic and real, use advanced behavioral and Telnet [13], simulating system compromises
analysis, and create continuous learning models and subsequent interactions [14].
that can better adapt to new attack patterns [9]. We parsed terminal commands from a public
These advancements would enhance the capability honeypot dataset [10], providing real-world at-
of honeypots to remain covert and effective in tacker data. To enhance the dataset, we included
detecting sophisticated threats. commonly used Linux commands [15], ensuring
Therefore, in this study, we aim to develop an the model could respond accurately to various
interactive LLM-Honeypot system based on Large scenarios. Additionally, we added 293 command-
Language Models (LLMs) capable of accurately explanation pairs [16], providing context to im-
mimicking the behavior of a Linux server. By fine- prove the model’s ability to generate accurate re-
tuning a pre-trained open-source language model sponses. This comprehensive approach improved
using a dataset of attacker-generated commands the model’s performance and engagement with
and responses, we seek to create an interactive attackers.
honeypot system that can engage with attackers Overall, the combination of real-world attacker
in a highly realistic and informative manner. This data, common Linux commands, and detailed com-
approach has the potential to significantly enhance mand explanations formed a robust training dataset.
the effectiveness of honeypot technology, offering This combined dataset played a crucial role in fine-
cybersecurity professionals a powerful tool to de- tuning the language model to function effectively
tect and analyze malicious activities. as a honeypot, capable of providing realistic and
II. M ETHODOLOGY intelligent interactions with attackers:
This study develops an LLM-based honeypot • Dataset #1: consisting of 174 commands
to interact with attackers and gather insights into parsed from the cloud-deployed Cowrie hon-
their tactics. As shown in Figure 1, a multi-stage eypot logs [10].
• Dataset #2: comprising the top 100 Linux
You are a Linux expert. You understand
commands [15] with manually populated vari- what every Linux terminal command does
ations, totaling 160 commands. and you reply with the explanation
• Dataset #3: Summaries of 283 Linux com- when asked.
mands’ man pages [16]. Listing 2. Prompt 2
We processed the collected data to prepare it for
language model training, essential for developing For the second part of the dataset, we created
our fine-tuned LLM to mimic a honeypot. This a set of prompts that positioned the model as a
involved transforming raw data into a format suit- Linux expert, capable of providing detailed ex-
able for effective training. Initially, we combined planations of Linux terminal commands. These
multiple datasets to create a collection of 617 Linux prompt enabled the model to generate informative
commands, covering various range of scenarios to and context-rich responses that demonstrated a
ensure model robustness. Using a local Cowrie deep understanding of Linux commands and their
honeypot system, we simulated command execu- applications. Again, a sample prompt is provided
tion via SSH, capturing responses in a controlled in Listing 2.
environment and saving these interactions as logs. C. Model Selection
This resulted in a substantial dataset of command-
The rapid development of Large Language Mod-
response pairs.
els (LLMs) has provided powerful tools for various
Next, we performed text preprocessing on the
applications, including honeypot mimicry. Select-
dataset, including tokenizing the text data and
ing the correct model is critical for accurately sim-
converting tokens into a standardized format for
ulating interactions and balancing computational
training. These preprocessing steps were crucial
efficiency with performance for real-time deploy-
for maintaining dataset quality and consistency. By
ment.
transforming raw data into a structured format, we
We tested several recent models, including
laid a foundation for training our LLM, contribut-
Llama3 [17], Phi 3 [18], CodeLlama [19], and
ing to the development of a realistic and interactive
Codestral [20]. Llama3, with its 8B and 70B vari-
honeypot.
ants, offers scalable language processing, while Phi
B. Prompt Engineering 3, CodeLlama, and Codestral are notable for their
focus on code-development tasks. However, our
By analyzing the prompts utilized in prior re-
experiments showed that code-centric models like
search [9]–[11], we rigorously tested and refined
Phi 3, CodeLlama, and Codestral were less effec-
our prompts to ensure they aligned with our objec-
tive for honeypot simulation. Larger models (8B
tives. This iterative process of prompt engineering
and 70B) were too slow, emphasizing the need for
was essential to optimizing the model’s interaction
computational speed. Smaller models demonstrated
with the dataset, ultimately contributing to the
sufficient capability, suggesting the importance of
development of a highly effective honeypot system.
balancing model size and efficiency.
You are mimicking a Linux server. Respond These findings highlight the need for a model
with what the terminal would respond that excels in linguistic proficiency and meets prac-
when a code is given. I want you to
reply only with the terminal outputs tical demands for speed and resource management.
inside one unique code block and Therefore, we chose the Llama3 8B model for our
nothing else. Do not write any honeypot LLM.
explanations. Do not type any
commands unless I instruct you to do
so. D. Supervised Fine-Tuning (SFT)
Supervised Fine-Tuning (SFT) is essential for
Listing 1. Prompt 1
adapting large pre-trained models to specific tasks.
For the first part of the dataset, we designed We fine-tuned the foundation models using Llama-
prompts that mimicked a Linux terminal prompt, Factory [21] with our curated dataset. To enhance
instructing the model to respond as a Linux server training efficiency, we employed Low-Rank Adap-
would. These prompts ensured that the model gen- tation (LoRA) [22], which reduces the number of
erated responses that were concise, accurate, and trainable parameters by decomposing the weight
formatted like a Linux terminal output. A sample matrices into lower-rank representations, allowing
prompt for the first part is provided in Listing 1. for efficient training without sacrificing perfor-
mance.
Fig. 2. Interactive LLM-Honeypot Server Framework
Quantized Low-Rank Adapters (QLoRA) fur- to simulate realistic interactions with potential at-
ther optimized the model by quantizing it to 8- tackers. The setup involves the following compo-
bit precision, reducing its size and computational nents:
load while maintaining accuracy. To prevent over- 1) Attacker Interface: Represented by the icon
fitting and improve generalization, we incorporated on the left, this interface depicts the external
NEFTune noise [23], a regularization technique entity attempting to interact with the honey-
that introduces noise during training. Additionally, pot system via SSH (Secure Shell) protocol.
Flash Attention 2 [24] was integrated to enhance Attackers use this interface to execute com-
attention mechanism efficiency, critical for process- mands and probe the system.
ing long sequences. 2) SSH Server: The central component of the
The final model, fine-tuned to respond like a system, highlighted in purple, is the SSH
honeypot server, achieves a balance between ef- server. This server acts as the entry point for
ficiency and accuracy using these advanced tech- all incoming SSH connections from attack-
niques. The model is publicly accessible on Hug- ers. It is configured to handle authentication,
gingface 1 and our GitHub page 2 . manage sessions, and relay commands to the
integrated LLM.
III. E XPERIMENTAL R ESULTS
3) Large Language Model (LLM): Embedded
The experimental results of our proposed ap- within the SSH server and shown in green,
proach include an analysis of training losses, eval- the LLM is fine-tuned to mimic the behavior
uation metrics, and a comparative performance as- of a typical Linux server. Upon receiving
sessment of different models. We begin by detailing commands from the SSH server, the LLM
the experimental setup, which involved significant processes these commands and generates ap-
computational resources, utilizing 2 x NVIDIA propriate responses. This model leverages
RTX A6000 (40GB VRAM) GPUs for training pre-trained data and fine-tuning techniques
the models. Following this, we provide a com- to provide realistic and contextually relevant
prehensive analysis of the results, highlighting the replies.
effectiveness and efficiency of our fine-tuned model 4) Interaction Flow: The arrows indicate the
in mimicking honeypot behavior. flow of interactions. The attacker initiates a
connection and sends commands to the SSH
A. Interactive LLM-Honeypot Framework
server, which then forwards these commands
Large Language Models (LLMs) are primarily to the LLM. The LLM processes the com-
designed to process and generate natural language mands and generates responses, which are
text, and as such, they do not natively under- sent back to the SSH server and subsequently
stand network traffic data. To bridge this gap and relayed to the attacker.
leverage LLMs’ capabilities for cybersecurity ap-
By combining the SSH server with a sophis-
plications, we developed a wrapper that interfaces
ticated LLM, our system can engage attackers
the LLM with network traffic at the IP (Layer
in a realistic manner, capturing valuable data on
3) level. This wrapper enables the system to act
their tactics and techniques. This architecture not
as a vulnerable server, capable of engaging with
only enhances the honeypot’s ability to simulate
attackers through realistic interactions.
genuine server interactions but also provides a
In figure 2, we illustrate the architecture of our
robust framework for analyzing attacker behavior
LLM-based honeypot system, which integrates an
and improving overall cybersecurity defenses.
SSH server with a Large Language Model (LLM)
B. Custom SSH Server Wrapper
1 huggingface.co/hotal/honeypot-llama3-8B To deploy the final model as a functional honey-
2 github.com/AI-in-Complex-Systems-Lab/LLM-Honeypot pot server, we crafted a custom SSH server using
Python’s Paramiko library [25]. This server inte-
grates our fine-tuned language model to generate
realistic responses.
Figure 3 displays an example SSH connection
and the corresponding responses to issued com-
mands. The custom SSH server operates as follows:
1) SSH Connection: User connects to the
honeypot server using ssh -T -p 2222
”root@localhost”, simulating an attack.
2) Authentication: Server prompts for a pass-
word; upon success, user accesses the hon- Fig. 4. Training losses over 36 steps in Supervised Fine-Tuning
eypot’s command-line interface. TABLE I
3) Command Execution: User runs Linux com- S IMILARITY AND D ISTANCE M ETRICS
mands (e.g., ls -al, echo ’hello world’, ifcon-
Metric Mean Score
fig), which the SSH server forwards to the
Base Fine-Tuned
integrated LLM.
Cosine Similarity 0.663 0.695
4) LLM Response Generation: LLM generates Jaro-Winkler Similarity 0.534 0.599
responses mimicking a real Linux server Levenshtein Distance 0.332 0.285
(e.g., listing directory contents, outputting
text, displaying network configuration). completed in 14 minutes. The consistent decrease
5) Interaction Logging: Honeypot logs all com- in training loss demonstrates the model’s capability
mands and responses, capturing data on at- to improve its performance progressively, thereby
tacker behavior for cybersecurity analysis. enhancing its ability to generate realistic and con-
For generating inferences, we utilized the Hug- textually appropriate responses.
gingface Transformers [26]. Our Custom-SSH
server is capable of collecting the IP addresses D. Similarity Analysis with Cowrie Outputs
of incoming SSH connections, username-password To evaluate the performance of our fine-tuned
pairs (for authentication), and logs of every com- language model, Llama3-8B, we employed multi-
mand along with the generated responses by the ple metrics to measure the similarity between the
model. By incorporating a LLM, our custom SSH expected (Cowrie) and generated terminal outputs.
server can engage attackers in a realistic manner, We used cosine similarity to quantify the cosine of
providing insights into their actions and enhancing the angle between two vectors in high-dimensional
the honeypot’s overall functionality. space, with higher scores indicating better per-
C. Training Loss Analysis formance. Additionally, we used the Jaro-Winkler
As illustrated in Figure 4, the training losses similarity, which measures the similarity based on
of our fine-tuned model exhibit a steady decline matching characters and necessary transpositions,
over the training steps. This trend indicates that with higher scores indicating closer matches. Fi-
the model effectively learned from our dataset and nally, we utilized the Levenshtein distance, which
adapted well to the task of mimicking a Linux calculates the minimum number of single-character
server. During the fine-tuning phase, we employed edits needed to transform one string into another,
a learning rate of 5×10−4 and conducted a total of with lower scores indicating closer matches. These
36 training steps. The entire training process was diverse metrics provided a comprehensive evalua-
tion of our model’s performance.