LLM_Honeypot_Leveraging_Large_Language_Models_as_A

This paper presents a novel approach to honeypot technology by leveraging Large Language Models (LLMs) to create interactive systems that engage with attackers. The methodology involves fine-tuning a pre-trained language model on attacker-generated commands to enhance its ability to simulate realistic interactions, thereby providing valuable insights into malicious activities. The results demonstrate the potential of LLMs to improve the effectiveness of honeypots in cybersecurity strategies, while also addressing inherent limitations of traditional honeypots.

Uploaded by

Cường Phan Bá

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views7 pages

LLM_Honeypot_Leveraging_Large_Language_Models_as_A

Uploaded by

Cường Phan Bá

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

LLM Honeypot: Leveraging Large

Language Models as Advanced Interactive

Honeypot Systems
Hakan T. Otal and M. Abdullah Canbaz
Department of Information Science and Technology
College of Emergency Preparedness, Homeland Security, and Cybersecurity
University at Albany, SUNY
Albany, New York, United States
arXiv:2409.08234v1 [cs.CR] 12 Sep 2024

hotal, mcanbaz [at] albany [dot] edu

Abstract—The rapid evolution of cyber threats and client honeypots, which are designed to be
necessitates innovative solutions for detecting and attacked by malicious servers. Additionally, there
analyzing malicious activity. Honeypots, which are are specialized honeypots such as malware honey-
decoy systems designed to lure and interact with
attackers, have emerged as a critical component pots that capture and analyze malicious software,
in cybersecurity. In this paper, we present a novel and database honeypots that protect sensitive data
approach to creating realistic and interactive honey- repositories. Each type of honeypot serves a unique
pot systems using Large Language Models (LLMs). purpose in a cybersecurity strategy, providing in-
By fine-tuning a pre-trained open-source language sights into different aspects of attacker behavior
model on a diverse dataset of attacker-generated
commands and responses, we developed a honeypot and tactics, including reducing the costs associated
capable of sophisticated engagement with attackers. with maintaining security [3]. Deploying honeypots
Our methodology involved several key steps: data on cloud platforms like Amazon Web Services,
collection and processing, prompt engineering, model Google Cloud, and Microsoft Azure allows for the
selection, and supervised fine-tuning to optimize the monitoring and analysis of adversarial activities in
model’s performance. Evaluation through similarity
metrics and live deployment demonstrated that our a scalable and dynamic environment [4].
approach effectively generates accurate and infor- However, despite their advantages, honeypots
mative responses. The results highlight the potential
come with certain limitations that must be carefully
of LLMs to revolutionize honeypot technology, pro-
viding cybersecurity professionals with a powerful considered. For instance, low-interaction honey-
tool to detect and analyze malicious activity, thereby pots, often favored for their resource efficiency
enhancing overall security infrastructure. and minimal engagement with attackers, have
Index Terms—Honeypot, Large Language Models, constrained emulation capabilities. This limitation
Cybersecurity, Fine-Tuning
makes them vulnerable to honeypot fingerprinting,
I. I NTRODUCTION which can potentially reduce their effectiveness [5].
In the realm of cybersecurity, honeypots have Moreover, these honeypots are easier for attackers
proven to be a valuable tool for detecting and to detect and can only gather limited information
analyzing malicious activity by serving as decoy about the nature of attacks, resulting in restricted
systems that attract potential attackers, allowing responses to threats [6]. Additionally, the use of
organizations to study their tactics and enhance fixed rate-limiting thresholds in experiments to pre-
their overall security infrastructure [1]. Honeypots vent damage may inadvertently reveal their pres-
come in various forms, including low-interaction ence to scanners, thus compromising their covert
honeypots that simulate services with minimal nature [7]. These factors highlight the necessity for
functionality to gather information about general a balanced approach in the deployment and config-
attack patterns [2], and high-interaction honeypots uration of honeypots to optimize their effectiveness
that provide a more complex and realistic en- while mitigating inherent limitations.
vironment to engage attackers more thoroughly. In parallel, recent advances in artificial intelli-
These can range from simple emulations of specific gence and natural language processing have given
services to full-fledged systems that mimic entire rise to Large Language Models (LLMs) capable of
networks. Examples include server honeypots [2], generating human-like text responses [8]. With ap-
which expose network services to attract attackers, propriate fine-tuning and prompt engineering, these
Fig. 1. Data Collection & Model Training Pipeline

models have the potential to revolutionize honeypot pipeline was created. It starts with collecting and
technology by enabling the creation of highly real- preprocessing a dataset of attacker commands and
istic and interactive honeypots. Leveraging LLMs, responses. This data is used for Supervised Fine-
honeypots can engage with attackers in a more Tuning (SFT) on a pre-trained language model,
sophisticated manner, providing more secure and enhancing its ability to mimic a Linux server.
intelligent responses. Recent research demonstrates The fine-tuned model is then rigorously evalu-
the feasibility of using LLMs as dynamic honey- ated to ensure it effectively engages with attackers
pot servers by employing pre-trained models like and provides valuable security insights. Finally,
ChatGPT. Even without extensive fine-tuning, well- the optimized honeypot is deployed to a public
crafted prompts can allow these models to observe IP address for real-world interaction with potential
and study attacker behaviors and tactics effectively threats.
[9]–[11]. However, a significant challenge in using A. Data Collection and Processing
chatbots as honeypots is the potential for attackers To develop the honeypot, we used log records
to detect and identify the honeypot due to static ele- from a Cowrie honeypot on a public cloud endpoint
ments or predictable behaviors. One way to fix this [12]. Cowrie, a medium-interaction honeypot, logs
problem would be to make honeypot environments brute-force attacks and shell commands via SSH
more dynamic and real, use advanced behavioral and Telnet [13], simulating system compromises
analysis, and create continuous learning models and subsequent interactions [14].
that can better adapt to new attack patterns [9]. We parsed terminal commands from a public
These advancements would enhance the capability honeypot dataset [10], providing real-world at-
of honeypots to remain covert and effective in tacker data. To enhance the dataset, we included
detecting sophisticated threats. commonly used Linux commands [15], ensuring
Therefore, in this study, we aim to develop an the model could respond accurately to various
interactive LLM-Honeypot system based on Large scenarios. Additionally, we added 293 command-
Language Models (LLMs) capable of accurately explanation pairs [16], providing context to im-
mimicking the behavior of a Linux server. By fine- prove the model’s ability to generate accurate re-
tuning a pre-trained open-source language model sponses. This comprehensive approach improved
using a dataset of attacker-generated commands the model’s performance and engagement with
and responses, we seek to create an interactive attackers.
honeypot system that can engage with attackers Overall, the combination of real-world attacker
in a highly realistic and informative manner. This data, common Linux commands, and detailed com-
approach has the potential to significantly enhance mand explanations formed a robust training dataset.
the effectiveness of honeypot technology, offering This combined dataset played a crucial role in fine-
cybersecurity professionals a powerful tool to de- tuning the language model to function effectively
tect and analyze malicious activities. as a honeypot, capable of providing realistic and
II. M ETHODOLOGY intelligent interactions with attackers:
This study develops an LLM-based honeypot • Dataset #1: consisting of 174 commands
to interact with attackers and gather insights into parsed from the cloud-deployed Cowrie hon-
their tactics. As shown in Figure 1, a multi-stage eypot logs [10].
• Dataset #2: comprising the top 100 Linux
You are a Linux expert. You understand
commands [15] with manually populated vari- what every Linux terminal command does
ations, totaling 160 commands. and you reply with the explanation
• Dataset #3: Summaries of 283 Linux com- when asked.
mands’ man pages [16]. Listing 2. Prompt 2
We processed the collected data to prepare it for
language model training, essential for developing For the second part of the dataset, we created
our fine-tuned LLM to mimic a honeypot. This a set of prompts that positioned the model as a
involved transforming raw data into a format suit- Linux expert, capable of providing detailed ex-
able for effective training. Initially, we combined planations of Linux terminal commands. These
multiple datasets to create a collection of 617 Linux prompt enabled the model to generate informative
commands, covering various range of scenarios to and context-rich responses that demonstrated a
ensure model robustness. Using a local Cowrie deep understanding of Linux commands and their
honeypot system, we simulated command execu- applications. Again, a sample prompt is provided
tion via SSH, capturing responses in a controlled in Listing 2.
environment and saving these interactions as logs. C. Model Selection
This resulted in a substantial dataset of command-
The rapid development of Large Language Mod-
response pairs.
els (LLMs) has provided powerful tools for various
Next, we performed text preprocessing on the
applications, including honeypot mimicry. Select-
dataset, including tokenizing the text data and
ing the correct model is critical for accurately sim-
converting tokens into a standardized format for
ulating interactions and balancing computational
training. These preprocessing steps were crucial
efficiency with performance for real-time deploy-
for maintaining dataset quality and consistency. By
ment.
transforming raw data into a structured format, we
We tested several recent models, including
laid a foundation for training our LLM, contribut-
Llama3 [17], Phi 3 [18], CodeLlama [19], and
ing to the development of a realistic and interactive
Codestral [20]. Llama3, with its 8B and 70B vari-
honeypot.
ants, offers scalable language processing, while Phi
B. Prompt Engineering 3, CodeLlama, and Codestral are notable for their
focus on code-development tasks. However, our
By analyzing the prompts utilized in prior re-
experiments showed that code-centric models like
search [9]–[11], we rigorously tested and refined
Phi 3, CodeLlama, and Codestral were less effec-
our prompts to ensure they aligned with our objec-
tive for honeypot simulation. Larger models (8B
tives. This iterative process of prompt engineering
and 70B) were too slow, emphasizing the need for
was essential to optimizing the model’s interaction
computational speed. Smaller models demonstrated
with the dataset, ultimately contributing to the
sufficient capability, suggesting the importance of
development of a highly effective honeypot system.
balancing model size and efficiency.
You are mimicking a Linux server. Respond These findings highlight the need for a model
with what the terminal would respond that excels in linguistic proficiency and meets prac-
when a code is given. I want you to
reply only with the terminal outputs tical demands for speed and resource management.
inside one unique code block and Therefore, we chose the Llama3 8B model for our
nothing else. Do not write any honeypot LLM.
explanations. Do not type any
commands unless I instruct you to do
so. D. Supervised Fine-Tuning (SFT)
Supervised Fine-Tuning (SFT) is essential for
Listing 1. Prompt 1
adapting large pre-trained models to specific tasks.
For the first part of the dataset, we designed We fine-tuned the foundation models using Llama-
prompts that mimicked a Linux terminal prompt, Factory [21] with our curated dataset. To enhance
instructing the model to respond as a Linux server training efficiency, we employed Low-Rank Adap-
would. These prompts ensured that the model gen- tation (LoRA) [22], which reduces the number of
erated responses that were concise, accurate, and trainable parameters by decomposing the weight
formatted like a Linux terminal output. A sample matrices into lower-rank representations, allowing
prompt for the first part is provided in Listing 1. for efficient training without sacrificing perfor-
mance.
Fig. 2. Interactive LLM-Honeypot Server Framework
Quantized Low-Rank Adapters (QLoRA) fur- to simulate realistic interactions with potential at-
ther optimized the model by quantizing it to 8- tackers. The setup involves the following compo-
bit precision, reducing its size and computational nents:
load while maintaining accuracy. To prevent over- 1) Attacker Interface: Represented by the icon
fitting and improve generalization, we incorporated on the left, this interface depicts the external
NEFTune noise [23], a regularization technique entity attempting to interact with the honey-
that introduces noise during training. Additionally, pot system via SSH (Secure Shell) protocol.
Flash Attention 2 [24] was integrated to enhance Attackers use this interface to execute com-
attention mechanism efficiency, critical for process- mands and probe the system.
ing long sequences. 2) SSH Server: The central component of the
The final model, fine-tuned to respond like a system, highlighted in purple, is the SSH
honeypot server, achieves a balance between ef- server. This server acts as the entry point for
ficiency and accuracy using these advanced tech- all incoming SSH connections from attack-
niques. The model is publicly accessible on Hug- ers. It is configured to handle authentication,
gingface 1 and our GitHub page 2 . manage sessions, and relay commands to the
integrated LLM.
III. E XPERIMENTAL R ESULTS
3) Large Language Model (LLM): Embedded
The experimental results of our proposed ap- within the SSH server and shown in green,
proach include an analysis of training losses, eval- the LLM is fine-tuned to mimic the behavior
uation metrics, and a comparative performance as- of a typical Linux server. Upon receiving
sessment of different models. We begin by detailing commands from the SSH server, the LLM
the experimental setup, which involved significant processes these commands and generates ap-
computational resources, utilizing 2 x NVIDIA propriate responses. This model leverages
RTX A6000 (40GB VRAM) GPUs for training pre-trained data and fine-tuning techniques
the models. Following this, we provide a com- to provide realistic and contextually relevant
prehensive analysis of the results, highlighting the replies.
effectiveness and efficiency of our fine-tuned model 4) Interaction Flow: The arrows indicate the
in mimicking honeypot behavior. flow of interactions. The attacker initiates a
connection and sends commands to the SSH
A. Interactive LLM-Honeypot Framework
server, which then forwards these commands
Large Language Models (LLMs) are primarily to the LLM. The LLM processes the com-
designed to process and generate natural language mands and generates responses, which are
text, and as such, they do not natively under- sent back to the SSH server and subsequently
stand network traffic data. To bridge this gap and relayed to the attacker.
leverage LLMs’ capabilities for cybersecurity ap-
By combining the SSH server with a sophis-
plications, we developed a wrapper that interfaces
ticated LLM, our system can engage attackers
the LLM with network traffic at the IP (Layer
in a realistic manner, capturing valuable data on
3) level. This wrapper enables the system to act
their tactics and techniques. This architecture not
as a vulnerable server, capable of engaging with
only enhances the honeypot’s ability to simulate
attackers through realistic interactions.
genuine server interactions but also provides a
In figure 2, we illustrate the architecture of our
robust framework for analyzing attacker behavior
LLM-based honeypot system, which integrates an
and improving overall cybersecurity defenses.
SSH server with a Large Language Model (LLM)
B. Custom SSH Server Wrapper
1 huggingface.co/hotal/honeypot-llama3-8B To deploy the final model as a functional honey-
2 github.com/AI-in-Complex-Systems-Lab/LLM-Honeypot pot server, we crafted a custom SSH server using
Python’s Paramiko library [25]. This server inte-
grates our fine-tuned language model to generate
realistic responses.
Figure 3 displays an example SSH connection
and the corresponding responses to issued com-
mands. The custom SSH server operates as follows:
1) SSH Connection: User connects to the
honeypot server using ssh -T -p 2222
”root@localhost”, simulating an attack.
2) Authentication: Server prompts for a pass-
word; upon success, user accesses the hon- Fig. 4. Training losses over 36 steps in Supervised Fine-Tuning
eypot’s command-line interface. TABLE I
3) Command Execution: User runs Linux com- S IMILARITY AND D ISTANCE M ETRICS
mands (e.g., ls -al, echo ’hello world’, ifcon-
Metric Mean Score
fig), which the SSH server forwards to the
Base Fine-Tuned
integrated LLM.
Cosine Similarity 0.663 0.695
4) LLM Response Generation: LLM generates Jaro-Winkler Similarity 0.534 0.599
responses mimicking a real Linux server Levenshtein Distance 0.332 0.285
(e.g., listing directory contents, outputting
text, displaying network configuration). completed in 14 minutes. The consistent decrease
5) Interaction Logging: Honeypot logs all com- in training loss demonstrates the model’s capability
mands and responses, capturing data on at- to improve its performance progressively, thereby
tacker behavior for cybersecurity analysis. enhancing its ability to generate realistic and con-
For generating inferences, we utilized the Hug- textually appropriate responses.
gingface Transformers [26]. Our Custom-SSH
server is capable of collecting the IP addresses D. Similarity Analysis with Cowrie Outputs
of incoming SSH connections, username-password To evaluate the performance of our fine-tuned
pairs (for authentication), and logs of every com- language model, Llama3-8B, we employed multi-
mand along with the generated responses by the ple metrics to measure the similarity between the
model. By incorporating a LLM, our custom SSH expected (Cowrie) and generated terminal outputs.
server can engage attackers in a realistic manner, We used cosine similarity to quantify the cosine of
providing insights into their actions and enhancing the angle between two vectors in high-dimensional
the honeypot’s overall functionality. space, with higher scores indicating better per-
C. Training Loss Analysis formance. Additionally, we used the Jaro-Winkler
As illustrated in Figure 4, the training losses similarity, which measures the similarity based on
of our fine-tuned model exhibit a steady decline matching characters and necessary transpositions,
over the training steps. This trend indicates that with higher scores indicating closer matches. Fi-
the model effectively learned from our dataset and nally, we utilized the Levenshtein distance, which
adapted well to the task of mimicking a Linux calculates the minimum number of single-character
server. During the fine-tuning phase, we employed edits needed to transform one string into another,
a learning rate of 5×10−4 and conducted a total of with lower scores indicating closer matches. These
36 training steps. The entire training process was diverse metrics provided a comprehensive evalua-
tion of our model’s performance.

Fig. 3. Example of Honeypot SSH Connection

Fig. 5. Histogram of Cosine Similarity Scores over 140 Samples
The results of our evaluation, summarized in realism, engagement effectiveness, and robustness
Table I, demonstrate the performance of our fine- using metrics like accuracy and interaction quality,
tuned language model, Llama3-8B, using different aiming to refine our model and enhance honeypots
similarity and distance metrics over 140 random for better cyber-threat detection and analysis.
samples. The fine-tuned model achieved a cosine R EFERENCES
similarity score of 0.695 (higher is better), indicat-
ing a strong match between the expected and gener- [1] N. Naik, C. Shang, P. Jenkins, and Q. Shen, “D-fri-
honeypot: a secure sting operation for hacking the hackers
ated terminal outputs. The Jaro-Winkler similarity using dynamic fuzzy rule interpolation,” IEEE Transac-
score was 0.599 (higher is better), also reflecting tions on Emerging Topics in Computational Intelligence,
a reasonable level of similarity. The Levenshtein 2020.
[2] M. A. Hakim, H. Aksu, A. S. Uluagac, and K. Akkaya,
distance was 0.332 (lower is better), suggesting a “U-pot: a honeypot framework for upnp-based iot de-
relatively low number of edits needed to align the vices,” 2018 IEEE 37th International Performance Com-
generated output with the expected one. puting and Communications Conference (IPCCC), 2018.
[3] M. L. Bringer, C. Chelmecki, and H. Fujinoki, “A survey:
In addition, fine-tuned LLM showed improve- recent advances and future trends in honeypot research,”
ments across all metrics compared to the base International Journal of Computer Network and Informa-
model. These results show the effectiveness of our tion Security, 2012.
[4] C. Kelly, N. Pitropakis, A. Mylonas, S. McKeown, and
model in generating outputs that closely match the W. J. Buchanan, “A comparative analysis of honeypots on
expected responses from a Cowrie honeypot server. different cloud platforms,” Sensors, 2021.
As shown in Figure 5, the cosine similarity [5] I. Damgård, M. Keller, E. Larraia, V. Pastro, P. Schöll, and
N. P. Smart, “Computer security – esorics 2021,” Lecture
scores of the outputs generated by LLM exhibit a Notes in Computer Science, 2021.
distribution with most scores concentrated towards [6] A. R.A., D. F.M., A. B.K., A. O.S, and O. T.J, “Improving
higher values, indicating that the model’s responses deception capability in honeynet through data manipula-
are mostly similar to the expected outputs. tion,” Journal of Internet Technology and Secured Trans-
action, 2015.
It is also worth noting that while some generated [7] H. Griffioen, K. Oosthoek, P. v. d. Knaap, and C. Doerr,
outputs may differ but they remain contextually “Scan, test, execute: adversarial tactics in amplification
accurate and true for the given commands. ddos attacks,” Proceedings of the 2021 ACM SIGSAC
Conference on Computer and Communications Security,
2021.
IV. C ONCLUSION
[8] J. Yang, H. Jin, R. Tang, X. Han, Q. Feng, H. Jiang,
This study introduces an innovative approach S. Zhong, B. Yin, and X. Hu, “Harnessing the power of
to developing interactive and realistic honeypot llms in practice: A survey on chatgpt and beyond,” ACM
Transactions on Knowledge Discovery from Data, 2024.
systems using Large Language Models (LLMs). [9] F. McKee and D. Noever, “Chatbots in a Honeypot World,”
By fine-tuning a pre-trained open-source language 2023.
model on attacker-generated commands and re- [10] U. Sedlar, M. Kren, L. Štefanič Južnič, and M. Volk,
“CyberLab honeynet dataset,” 2020.
sponses, we created a sophisticated honeypot that [11] M. Sladić, V. Valeros, C. Catania, and S. Garcia, “LLM
enhances realism and deployment effectiveness. in the Shell: Generative Honeypots,” 2023.
Our LLM-based honeypot system improves re- [12] M. Oosterhof, “Cowrie 2.5.0 documentation.” [Online].
Available: https://cowrie.readthedocs.io/
sponse quality and the ability to detect and ana- [13] C. Kelly, N. Pitropakis, A. Mylonas, S. McKeown, and
lyze malicious activities. The integration of LLMs W. J. Buchanan, “A comparative analysis of honeypots on
with honeypot technology creates a dynamic, adap- different cloud platforms,” Sensors, 2021.
[14] S. Deshmukh, R. Rade, and D. F. Kazi, “Attacker be-
tive system that evolves with emerging threats. haviour profiling using stochastic ensemble of hidden
Leveraging LLMs’ reinforcement learning and at- markov models,” 2019.
tention mechanisms, our system refines responses [15] V. Švábenský, J. Vykopal, P. Seda, and P. Čeleda, “Dataset
of shell commands used by participants of hands-on
and maintains high contextual relevance, providing cybersecurity training,” Data in Brief, 2021.
deeper insights into attacker behavior and strength- [16] “Linux man pages tldr summarized.” [On-
ening security infrastructures. line]. Available: https://huggingface.co/datasets/tmskss/
linux-man-pages-tldr-summarized/
Future work includes expanding training
[17] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi
datasets, exploring alternative fine-tuning, and et al., “Llama 2: Open foundation and fine-tuned chat
incorporating behavioral analysis. We plan to models,” 2023.
deploy the system to a public IP for real- [18] Microsoft, “Phi-3 technical report: A highly capable lan-
guage model locally on your phone,” 2024.
world interaction, collecting logs on attack [19] MetaAI, “Code llama: Open f. models for code,” 2024.
vectors for analysis. This will involve creating [20] A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch,
knowledge graphs to map attack techniques B. Savary, and Others, “Mixtral of experts,” 2024.
[21] Y. Zheng, R. Zhang, J. Zhang, Y. Ye, Z. Luo, and Y. Ma,
and provide insights into attacker strategies. “Llamafactory: Unified efficient fine-tuning of 100+ lan-
Performance analysis can focus on response guage models,” 2024.
[22] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang,
L. Wang, and W. Chen, “Lora: Low-rank adaptation of
large language models,” 2021.
[23] N. Jain, P. yeh Chiang, Y. Wen, J. Kirchenbauer, H.-
M. Chu, G. Somepalli, B. R. Bartoldson, B. Kailkhura,
A. Schwarzschild, A. Saha, M. Goldblum, J. Geiping,
and T. Goldstein, “Neftune: Noisy embeddings improve
instruction finetuning,” 2023.
[24] T. Dao, “Flashattention-2: Faster attention with better
parallelism and work partitioning,” 2023.
[25] “Paramiko documentation.” [Online]. Available: https:
//www.paramiko.org/
[26] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue,
and Others, “Huggingface’s transformers: State-of-the-art
natural language processing,” 2020.