Vul-RAG Enhancing LLM-based Vulnerability Detection Via Knowledge-Level RAG
Vul-RAG Enhancing LLM-based Vulnerability Detection Via Knowledge-Level RAG
Yiling Lou
Fudan University
China
ABSTRACT large language models (LLMs) further boosts the learning-based
Vulnerability detection is essential for software quality assurance. In vulnerability detection techniques. Due to the strong code and text
recent years, deep learning models (especially large language mod- comprehension capabilities, LLMs show promising effectiveness
els) have shown promise in vulnerability detection. In this work, we in analyzing the malicious behaviors (e.g., bugs or vulnerabilities)
propose a novel LLM-based vulnerability detection technique Vul- in the code [7–11]. For example, existing LLM-based vulnerability
RAG, which leverages knowledge-level retrieval-augmented gener- detection techniques incorporate prompt engineering (e.g., chain-
ation (RAG) framework to detect vulnerability for the given code in of-thought [12, 13] and few-shot learning [14]) to facilitate more
three phases. First, Vul-RAG constructs a vulnerability knowledge accurate vulnerability detection.
base by extracting multi-dimension knowledge via LLMs from ex- Preliminary Study. However, due to the limited interpretability of
isting CVE instances; second, for a given code snippet, Vul-RAG deep learning models, it remains unclear whether existing learning-
retrieves the relevant vulnerability knowledge from the constructed based vulnerability detection techniques really understand and
knowledge base based on functional semantics; third, Vul-RAG capture the code semantics related to vulnerable behaviors, espe-
leverages LLMs to check the vulnerability of the given code snip- cially when the only outputs of the models are binary labels (i.e.,
pet by reasoning the presence of vulnerability causes and fixing vulnerable or benign). To fill this knowledge gap, we first perform
solutions of the retrieved vulnerability knowledge. Our evaluation a preliminary study based on the assumption that “if the technique
of Vul-RAG on our constructed benchmark PairVul shows that can precisely distinguish a pair of vulnerable code and non-vulnerable
Vul-RAG substantially outperforms all baselines by 12.96%/110% code with high lexical similarity (i.e., only differing in several tokens),
relative improvement in accuracy/pairwise-accuracy. In addition, we consider the technique with the better capability of capturing the
our user study shows that the vulnerability knowledge generated vulnerability-related semantics in code”. As two lexically-similar
by Vul-RAG can serve as high-quality explanations which can code snippets can differ in code semantics, it is likely that models
improve the manual detection accuracy from 0.60 to 0.77. have captured the high-level vulnerability-related semantics if the
models can precisely distinguish between them. As there is no such
existing vulnerability detection benchmark focusing on such pairs
1 INTRODUCTION of vulnerable code and non-vulnerable code with high lexical simi-
Security vulnerabilities in software leave open doors for the dis- larity, we first construct a new benchmark PairVul which contains
ruptive attacks, resulting in serious consequences during software 4,314 pairs of vulnerable and patched code functions across 2,073
execution. To date, there has been a large body of research on auto- CVEs. We then evaluate the three representative learning-based
mated vulnerability detection techniques. In addition to leveraging techniques (i.e., LLMAO [8], LineVul [6], and DeepDFA [3]) along
the traditional program analysis, deep learning has been incorpo- with one static analysis technique (i.e., Cppcheck [15]) on our con-
rated into the vulnerability detection techniques given the recent structed benchmark to study their distinguishing capability for such
advance in the artificial intelligence domain. code pairs. Based on the results, existing learning-based techniques
Learning-based vulnerability detection techniques [1–6] mainly actually exhibit rather limited effectiveness in distinguishing within
formulate the vulnerability detection as the binary classification such lexically-similar code pairs. In particular, the accuracy on our
task for the given code, which first train different models (e.g., benchmark PairVul drops to 0.50 ∼ 0.54, which are much lower
graph neural networks or pre-trained language models) on existing than that reported in previous benchmarks (e.g., 0.99 accuracy of
vulnerable code and benign code, and then predict the vulnera- LineVul [6] on BigVul [16]). The results demonstrate that existing
bility for the given code. More recently, the advanced progress in
,, Xueying Du, Geng Zheng, Kaixin Wang, Jiayi Feng, Wentai Deng, Mingwei Liu, Bihuan Chen, Xin Peng, Tao Ma, and Yiling Lou
trained models have limited capabilities of capturing the high-level • Benchmark. We construct a new benchmark PairVul that exclu-
code semantics related to vulnerable behaviors in the given code. sively contains pairs of vulnerable code and similar-but-correct
Technique. Inspired by the observation in our preliminary study, code.
our insight is to distinguish the vulnerable code from the similar- • Preliminary Study. We perform the first study to find that
but-correct code with high-level vulnerability knowledge. In partic- existing learning-based techniques have limited capabilities of
ular, based on how developers manually identify a vulnerability, understanding and capturing the vulnerability-related code se-
understanding a vulnerability often involves the code semantics mantics.
from the three dimensions: (i) the functionality the code is imple- • Technique. We construct a vulnerability knowledge base based
menting, (ii) the causes for the vulnerability, and (iii) the fixing on the proposed multi-dimension knowledge representation, and
solution for the vulnerability. Such high-level code semantics can propose a novel knowledge-level RAG framework Vul-RAG for
serve as the vulnerability knowledge for vulnerability detection. vulnerability detection.
To this end, we propose a novel LLM-based vulnerability detec- • Evaluation. We evaluate Vul-RAG and find the usefulness of
tion technique Vul-RAG, which leverages knowledge-level retrieval- vulnerability knowledge generated by Vul-RAG for both auto-
augmented generation (RAG) framework to detect vulnerability in mated and manual vulnerability detection.
the given code. The main idea of Vul-RAG is to leverage LLM to
reason for vulnerability detection based on the similar vulnerability 2 BACKGROUND
knowledge from existing vulnerabilities. In particular, Vul-RAG
2.1 CVE and CWE
consists of three phases. First, Vul-RAG constructs a vulnerabil-
ity knowledge base by extracting multi-dimension knowledge (i.e., Existing vulnerability classification systems, such as Common Vul-
functional semantics, causes, and fixing solutions) via LLMs from nerabilities and Exposures (CVE) [17] and Common Weakness Enu-
existing CVE instances; second, for a given code snippet, Vul-RAG meration (CWE) [18], provide a comprehensive taxonanomy of
retrieves the relevant vulnerability knowledge from the constructed categorizing and managing vulnerabilities. CVE is a publicly dis-
knowledge base based on functional semantics; third, Vul-RAG closed list of common security vulnerabilities. Each vulnerability
leverages LLMs to check the vulnerability of the given code snippet is assigned a unique identifier (CVE ID). A single CVE ID may be
by reasoning the presence of vulnerability causes and fixing solu- associated with multiple distinct code snippets.
tions of the retrieved vulnerability knowledge. The main technical CWE is a publicly accessible classification system of common
novelties of Vul-RAG include: (i) a novel representation of multi- software and hardware security vulnerabilities. Each weakness type
dimension vulnerability knowledge that focuses on more high-level within this enumeration is assigned a unique identifier (CWE ID).
code semantics rather than lexical details, and (ii) a novel knowledge- While CWE provides a broad classification of vulnerability types,
level RAG framework for LLMs that first retrieves relevant knowledge the specific code behaviors leading to a vulnerability under a given
based on functional semantics and then detects vulnerability by rea- CWE category may vary widely. For example, CWE-416 (Use After
soning from the vulnerability causes and fixing solutions. Free) [19] signifies the issue of referencing memory after it has
Evaluation. We further evaluate Vul-RAG on our benchmark been freed. The root cause of this vulnerability might stem from
PairVul. First, we compare Vul-RAG with three representative improper synchronization under race conditions (e.g., CVE-2023-
learning-based vulnerability detection techniques and one static 30772 [20]), or errors in reference counting leading to premature
analysis technique. The results show that Vul-RAG substantially object destruction (e.g., CVE-2023-3609 [21]).
outperforms all baselines by more precisely identifying the pairs of
vulnerable code and similar-but-correct code, e.g., 12.96% improve- 2.2 Learning-based Vulnerability Detection
ment in accuracy and 110% improvement in pairwise accuracy The recent advance in deep learning has boosted many learning-
(i.e., the ratio of both non-vulnerable code and vulnerable code based vulnerability detection techniques.
in one pair being correctly identified ). Second, we evaluate the GNN-based Vulnerability Detection [1–4] typically represents
usefulness of our vulnerability knowledge by comparing Vul-RAG the code snippets under detection as graph-based intermediate
with both the basic GPT-4 and the GPT-4 enhanced with code-level representations, such as Abstract Syntax Trees (ASTs) or Control
RAG. The results show that Vul-RAG consistently outperforms Flow Graphs (CFGs). Graph neural networks (GNN) are then applied
two GPT-4-based variants in all metrics. Third, we further perform to these abstracted code representations for feature extraction. The
a user study of vulnerability detection with/without the vulnerabil- features learned by the models are subsequently fed into a binary
ity knowledge generated by Vul-RAG. The results show that the classifier for vulnerability detection.
vulnerability knowledge can improve the manual detection accu- PLM-based Vulnerability Detection [5, 6] typically involves
racy from 0.6 to 0.77, and the user feedback also shows the high fine-tuning existing PLMs on vulnerability detection datasets. In
quality of generated knowledge regarding the helpfulness, precise- this way, code snippets are tokenized and processed by Pre-trained
ness, and generalizability. In summary, the evaluation results confirm Language Models (PLM, e.g., RoBERTa [22]) to serves as the encoder.
two-fold benefits of the proposed knowledge-level RAG framework: (i) The extracted features are then used for binary classification.
enhancing automated vulnerability detection by better retrieving and LLM-based Vulnerability Detection. This category leverages
utilizing existing vulnerability knowledge, and (ii) enhancing manual large language models (LLMs) for vulnerability detection via prompt
vulnerability detection by providing developer-friendly explanations engineering or fine-tuning [7–9]. The former leverages different
for understanding vulnerable or non-vulnerable code. prompting strategies, e.g., Chain-of-Thought (CoT) [12, 13] and few-
In summary, this paper makes the following contributions: shot learning [14, 23] for more accurate LLM-based vulnerability
Vul-RAG: Enhancing LLM-based Vulnerability Detection
via Knowledge-level RAG ,,
detection, without modifying the original LLM parameters; the 3.1 Benchmark PairVul
latter updates LLM parameters by trained on vulnerability detection For the preliminary study, we first construct a new benchmark
datasets, to learn the features of vulnerable code. PairVul as it is challenging and efforts-intensive to prepare pairs
of vulnerable code and patched code from existing vulnerability
detection benchmarks. For example, Table 1 shows the detailed sta-
2.3 Retrieval-Augmented Generation tistics of three widely-used benchmarks released between Jan. 2019
Retrieval-Augmented Generation (RAG) is a general paradigm which and Apr. 2024, i.e., BigVul [16], Devign [1], and Reveal [2]. The last
enhances LLMs by including relevant information retrieved from row presents our constructed benchmark PairVul for comparison.
external databases into the input [24]. RAG typically consists of In particular, existing datasets do not focus on pairs of vulnera-
three phases: indexing, retrieval, and generation. First, the index- ble code and patched code, e.g., some do not include patched code
ing phase constructs external databases and their retrieval index while some contain non-similar vulnerable code and correct code
from external data sources; Second, given a user query, the retrieval with significantly different code length. Therefore, we construct a
system then utilizes these index to fetch the relevant document new benchmark PairVul that exclusively contains pairs of vulner-
chunks as context. Third, the retrieved context is then integrated able code and patched code. In particular, in this work, we focus
into the input prompt for LLMs, and LLMs then generate the final on function-level vulnerability detection given it has been widely
output based on the augmented inputs. RAG has been widely used studied in previous learning-based research [3–5, 29].
in various domains [25–28]. For example, RAG has been special- Data Format. Specifically, our benchmark contains the following
ized to software engineering tasks such as code generation [27, 28], information for each vulnerability:
which retrieves the similar code from the code base and augments • CVE ID: The unique identifier assigned to a reported vulnera-
the prompt with the retrieved code for model inference. bility in the Common Vulnerabilities and Exposures (CVE).
• CVE Description: Descriptions of the vulnerability provided
by the CVE system, including the manifestation, the potential
3 PRELIMINARY STUDY impact, and the environment where the vulnerability may occur.
Although existing learning-based vulnerability detection tech- • CWE ID: The Common Weakness Enumeration identifier that
niques show promising effectiveness, it still remains unclear whether categorizes the type of the vulnerability exploits.
these techniques really understand and capture the code semantics
• Vulnerable Code: The source code snippet containing the
related to vulnerable behaviors, due to the weak interpretability
vulnerability, which will be modified in the commit.
of deep learning models. To fill this knowledge gap, in this pre-
liminary study, we make the assumption that “if the technique can • Patched Code: The source code snippet that has been commit-
precisely distinguish a pair of vulnerable code and non-vulnerable ted to fix the vulnerability in the vulnerable code.
code with high lexical similarity (i.e., only differing in several tokens), • Patch Diff: A detailed line-level difference between the vulner-
we consider the technique with the better capability of capturing the able and patched code, consisting of added and deleted lines.
vulnerability-related semantics in code”. As shown by the example Construction Procedure. Given the representativeness of the
in Figure 1, the vulnerable code is fixed by moving the statement Linux kernel in modern complex software systems, we use Linux
inet_frag_lru_add(nf, qp) into the lock-protected code block, kernel CVEs as the data source for our benchmark. The specific
while the pair of vulnerable code and the non-vulnerable code share benchmark construction process involves the following two steps:
high lexical similarity but differ in the semantics. To this end, we • Vulnerble and Patched Code Collection. We first collect all the
first propose to construct a benchmark that contains pairs of vulner- CVEs related to the Linux kernel from Linux Kernel CVEs[30],
able code and its corresponding patched code, as patched code often an open-source project dedicated to automatically tracking
shares high similarity as the original code; then we evaluate the CVEs within the upstream Linux kernel. Based on the list of
existing learning-based techniques on our constructed benchmark collected CVE IDs, we further extract corresponding CWE IDs
to study their distinguishing capability for such code pairs. and CVE descriptions from the National Vulnerability Data-
base (NVD), enriching our dataset with detailed vulnerability
static struct inet_frag_queue *inet_frag_intern(struct static struct inet_frag_queue *inet_frag_intern(struct
netns_frags *nf, struct inet_frag_queue *qp_in, struct
inet_frags *f, void *arg)
netns_frags *nf, struct inet_frag_queue *qp_in, struct
inet_frags *f, void *arg)
categorizations and descriptions. Based on the CVE ID list, we
{ {
... ...
then parse the commit information for each CVE to extract
read_lock(&f->lock); /* Protects against hash read_lock(&f->lock); /* Protects against hash
rebuild */ rebuild */ function-level vulnerable and patched code pairs. Vulnerable
hash = f->hashfn(qp_in); hash = f->hashfn(qp_in);
hb = &f->hash[hash]; hb = &f->hash[hash];
spin_lock(&hb->chain_lock);
code snippets prior to the commit diffs are labeled as positive
spin_lock(&hb->chain_lock);
...
...
qp = qp_in; ❌ Vulnerable Code qp = qp_in; √ Non-vulnerable Code samples and the patched code snippets as negative samples. In
if (!mod_timer(&qp->timer, jiffies + nf- if (!mod_timer(&qp->timer, jiffies + nf-
>timeout)) >timeout)) this way, we initially obtain a dataset of 4,667 function pairs of
atomic_inc(&qp->refcnt); atomic_inc(&qp->refcnt);
atomic_inc(&qp->refcnt);
hlist_add_head(&qp->list, &hb->chain);
atomic_inc(&qp->refcnt);
hlist_add_head(&qp->list, &hb->chain);
vulnerable and patched code across 2,174 CVEs.
inet_frag_lru_add(nf, qp);
spin_unlock(&hb->chain_lock);
read_unlock(&f->lock);
spin_unlock(&hb->chain_lock); • Patched Code Verification. The patched code cannot always be
read_unlock(&f->lock);
inet_frag_lru_add(nf, qp);
return qp; return qp; non-vulnerable, thus it is important to double check the correct-
} }
Diff :Call “inet_frag_lru_add” Diff: Call “inet_frag_lru_add” ness of the patched code. To this end, we further implement a
after unlocking before unlocking
filtering process for the patched code by ensuring that it has
Figure 1: A pair of vulnerable code and similar non- not been subsequently reverted or modified by other commits.
vulnerable code (the patched code)
,, Xueying Du, Geng Zheng, Kaixin Wang, Jiayi Feng, Wentai Deng, Mingwei Liu, Bihuan Chen, Xin Peng, Tao Ma, and Yiling Lou
Table 1: Existing Benchmarks for Vulnerability Detection. “Positive Number/Ratio” is the number/portion of vulnerable samples,
“#CVE” is the number of CVEs, “Positive LOC”/“Negative LOC” is the average lines numbers of vulnerable/non-vulnerable code,
“Patched Code Included” means whether the patched code of the vulnerability is included, and “Patched Code Verified” means
whether patched code is verified as correct.
Positive Negative Patched Code Patched Code
Benchmark Time Positive Number/Ratio #CVE
LOC LOC Included Verified
BigVul 2020 10,900(5.78%) 3,285 73.47 23.83 N /
Devign 2019 12,460 (45.61%) / 54.50 49.53 N /
ReVeal 2020 2,240 ( 9.85%) / 67.73 28.69 Y N
PairVul 2024 1,923 (50.00%) 896 68.58 70.25 Y Y
Benchmark Statistics. As a result, we obtain a new benchmark we develop a new metric pairwise accuracy, which calculates the
PairVul of 4,314 pairs of vulnerable and patched code functions ratio of pairs whose vulnerable and patched code are both correctly
across 2,073 CVEs. In this work, we focus on the top-5 CWEs in identified. Besides, we also use six commonly-used metrics in vul-
our benchmark given the non-trivial costs of model execution and nerability detection tasks, i.e., FN, FP, accuracy, precision, recall,
manual analysis. In particular, as this work focuses on learning- and F1. FN is the ratio of false negatives; FP is the ratio of false
based techniques which often require training datasets, we further positives; accuracy is the proportion of correctly detected instances;
divide the benchmark into the training set and the testing set in the precision is the proportion of true positive predictions among all
following steps. For each CVE, we randomly select one instance positive predictions; recall is the proportion of true positive predic-
into the testing set with the remaining instances (if has any) of the tions among all instances; and F1-score is the harmonic mean of
CVE into the training set. We exclude cases where the code length the precision and recall, which balances both values.
exceeds the current token limit of GPT-3.5-turbo (i.e., 16,384 tokens). Table 3: Effectiveness of SOTA techniques in PairVul
The final training set includes 896 CVEs with 1,462 pairs of vulner-
CWE Tech. FN FP Acc. Pair Acc. Precis. Recall F1
able and patched code functions, while the testing set includes 373 CppCheck 50.0% 0.0% 0.50 0.00 / 0.00 /
CVEs with 592 pairs. The statistics of each CWE category in our DeepDFA 9.3% 40.3% 0.50 0.02 0.50 0.81 0.62
CWE-416 LineVul 0.0% 50.0% 0.50 0.04 0.50 1.00 0.67
benchmark is shown in Table 2. LLMAO 24.5% 20.4% 0.55 0.14 0.56 0.51 0.53
CppCheck 48.9% 0.6% 0.51 0.01 0.67 0.02 0.04
Table 2: Statistics of each CWE in PairVul DeepDFA 8.5% 42.6% 0.49 0.01 0.49 0.83 0.62
CWE-476 LineVul 12.9% 33.7% 0.54 0.09 0.53 0.75 0.62
Training Set Test Set LLMAO 44.9% 3.4% 0.52 0.03 0.60 0.10 0.17
CWE
CVE Num. Func. Pair Num. CVE Num. Func. Pair Num. CppCheck 49.6% 0.0% 0.50 0.01 1.00 0.01 0.02
CWE-416 339 587 145 267 DeepDFA 5.9% 45.1% 0.49 0.00 0.49 0.88 0.63
CWE-362 LineVul 10.7% 40.9% 0.49 0.02 0.49 0.79 0.61
CWE-476 194 262 60 89 LLMAO 16.9% 30.2% 0.53 0.11 0.52 0.66 0.58
CWE-362 169 280 81 121 CppCheck 49.1% 0.9% 0.50 0.00 0.50 0.02 0.04
CWE-119 129 163 42 53 DeepDFA 11.5% 37.5% 0.51 0.00 0.50 0.76 0.60
CWE-119 LineVul 19.8% 32.1% 0.49 0.04 0.49 0.62 0.55
CWE-787 122 170 45 62 LLMAO 45.3% 2.8% 0.52 0.04 0.63 0.09 0.16
CppCheck 48.4% 1.6% 0.50 0.02 0.50 0.03 0.06
DeepDFA 9.8% 40.7% 0.50 0.00 0.49 0.80 0.61
CWE-787 LineVul 4.0% 46.8% 0.50 0.02 0.50 0.92 0.65
3.2 Studied Baselines LLMAO 41.9% 2.4% 0.56 0.11 0.77 0.16 0.27
We evaluate the following state-of-the-art (SOTA) vulnerability CppCheck 49.5% 0.3% 0.50 0.01 0.60 0.01 0.02
DeepDFA 8.7% 41.4% 0.50 0.01 0.49 0.82 0.62
detection techniques on our benchmark PairVul. Overall LineVul 6.3% 43.8% 0.50 0.02 0.50 0.87 0.64
LLMAO 29.7% 16.4% 0.54 0.10 0.55 0.41 0.47
• LLMAO [8]: An LLM-based fault localization approach fine- Uniform Guess 0.0% 100% 0.50 0.00 0.50 1.00 0.67
tuning LLM (i.e., CodeGen), which has also been fine-tuned on
the Devign dataset for vulnerability detection.
3.4 Results
• LineVul [6]: A PLM-based vulnerability detection model, offer-
ing both function-level and line-level detection granularity. As shown in Table 3, existing techniques exhibit limited effective-
ness on our benchmark PairVul. In particular, compared to the
• DeepDFA [3]: A GNN-based detection technique with data flow
effectiveness reported in previous benchmark (e.g., 0.99 accuracy of
analysis-guided graph learning framework, which is designed
LineVul on BigVul [6]), existing techniques perform much poorer
for function-level vulnerability detection.
on PairVul (ranging from 0.50 to 0.54 accuracy), which shows even
• Cppcheck [15]: A widely-used open-source static analysis tool. lower accuracy and F1 than the uniform guess (i.e., identifying
We directly utilize the public implementation of all baselines. To all instances as vulnerable). In particular, the pairwise accuracy
adapt the techniques to our benchmark, we fine-tune the learning- ranges from 0.01 to 0.10, indicating that existing learning-based
based baselines on our training set for 10 epochs. Additionally, as techniques fail to capture the subtle difference between similar vul-
LLMAO is a line-level vulnerability detection technique, we adapt nerable code and non-vulnerable code. The observations imply that
it to the function level by regarding the function that has any line the learning-based models have limited capability of understanding
with higher-than-0.5 suspiciousness scores as vulnerable. the semantics related to the vulnerability.
Our insight. In fact, two code snippets with subtle lexical dif-
3.3 Metrics ference can have different semantics (i.e., different functionalities).
To further evaluate the capability in distinguishing a pair of vul- Therefore, identifying vulnerability based on the high-level code
nerable code and non-vulnerable code with high lexical similarity, semantics can help better distinguish the vulnerable code from the
Vul-RAG: Enhancing LLM-based Vulnerability Detection
via Knowledge-level RAG ,,
similar-but-correct code. In particular, based on how developers 4.2 Vulnerability Knowledge Base Construction
manually identify a vulnerability, understanding a vulnerability To comprehensively summarize a vulnerability, we propose a three-
often involves the code semantics from the three dimensions: (i) dimension representation of vulnerability knowledge (in Section 4.2.1).
the functionality the code is implementing, (ii) the causes for the Based on the knowledge representation, Vul-RAG leverages LLMs
vulnerability, and (iii) the fixing solution for the vulnerability. Such to extract the relevant vulnerability knowledge from existing CVE
high-level code semantics can serve as the knowledge for vulnera- instances, which further forms the knowledge base (in Section 4.2.2).
bility detection. Therefore, in this work, we propose to distinguish
the vulnerable code from the similar-but-correct code by enhanc- 4.2.1 Vulnerability Knowledge Representation. Vul-RAG represents
ing LLMs with high-level vulnerability knowledge. In particular, the vulnerability knowledge of a CVE instance from three dimen-
we first leverages LLMs to automatically construct a vulnerabil- sions: functional semantics, vulnerability causes, and fixing solu-
ity knowledge base from existing vulnerability instances, which tions. Figure 3 exemplifies the three-dimension representation for
further are utilized to boost LLMs in vulnerability detection. CVE-2022-38457. In this case, the vulnerable code accesses a shared
data structure within an RCU read lock context without proper
synchronization mechanism, allowing a race condition and use-
after-free vulnerability. To fix this vulnerability, the patched code
add a spin lock to protect the shared data structure.
4 APPROACH • Functional Semantics: It summarizes the high-level functional-
4.1 Overview ity (i.e., what this code is doing) of the vulnerable code, including:
In this work, we present a novel LLM-based vulnerability detection – Abstract purpose: brief summary of the code intention.
technique Vul-RAG, which leverages knowledge-level RAG frame- – Detailed behavior: detailed description of the code behavior.
work to detect vulnerability in the given code. The main idea of • Vulnerability Causes: It describes the reasons for triggering
Vul-RAG is to leverage LLM to reason for vulnerability detection vulnerable behaviors by comparing the vulnerable code and its
based on the similar vulnerability knowledge from existing vul- corresponding patch. We consider causes described in different
nerabilities. Figure 2 shows the overview of our approach, which perspectives, including:
includes the following three phases. – Abstract vulnerability description: brief summary of the cause.
• Phase-1: Offline Vulnerability Knowledge Base Construc- – Detailed vulnerability description: more concrete descriptions
tion (Section 4.2): Vul-RAG first constructs a vulnerability of the causes.
knowledge base by extracting multi-dimension knowledge via – Triggering action: the direct action triggering the vulnerability,
LLMs from existing CVE instances. e.g., “concurrent access to shared data structures” in Figure 3.
• Phase-2: Online Vulnerability Knowledge Retrieval (Sec- • Fixing Solutions: It summarizes the fixing of the vulnerability
tion 4.3). For a given code snippet, Vul-RAG retrieves the rele- by comparing the vulnerable code and its corresponding patch.
vant vulnerability knowledge from the constructed knowledge
base based on functional semantics. 4.2.2 Knowledge Extraction. For each existing vulnerability in-
• Phase-3: Online Knowledge-Augmented Vulnerability De- stance (i.e., the vulnerable code and its patch), Vul-RAG prompts
tection (Section 4.4). Vul-RAG leverages LLMs to check the LLMs to extract three-dimension knowledge, and then abstracts
vulnerability of the given code snippet by reasoning the pres- the extracted knowledge to facilitate more general representation.
ence of vulnerability causes and fixing solutions of the retrieved We then explain each step in detail.
vulnerability knowledge. Functional Semantics Extraction. Given the vulnerable code
snippet, Vul-RAG prompts LLMs with the following instructions
to summarize both the abstract purpose and the detailed behavior
Functional Vulnerability Causes and
Knowledge respectively, where the placeholder “[Vulnerable Code]” denotes
Semantics Fixing Solutions
Abstraction
Extraction Extraction
the vulnerable code snippet.
CVE Corpus Prompt for Abstract Purpose Extraction: [Vulnerable Code]
Vulnerability
Knowledge Vulnerability What is the purpose of the function in the above code snippet? Please
Extraction Knowledge Base
summarize the answer in one sentence with the following format:
① Vulnerability Knowledge Base Construction “Function purpose:”.
Functional Vulnerability Prompt for Detailed Behavior Extraction: [Vulnerable Code]
Retrieval Documents/
Top N Related
Documents/
Semantics Knowledge Please summarize the functions of the above code snippet in the list
Query Source Code
vulnerability
Source Code
Extraction Retrieval
knowledge format without any other explanation: “The functions of the code
② Vulnerability Knowledge Retrieval
snippet are: 1. 2. 3...”
LLM-based Vulnerability
Vulnerability Detection The example output of functional semantics is in Figure 3.
Detection Prompt
Source Code
User Detection Results & Vulnerability Causes and Fixing Solutions Extraction. As the
Related Vulnerability
Under Detection Knowledge causes and fixing solutions are often logically connected, Vul-RAG
③ Knowledge-Augmented Vulnerability Detection
extracts them together so that the reasoning capabilities of LLMs
Figure 2: Overview of Vul-RAG can be better utilized. In particular, Vul-RAG incorporates two
rounds of extraction, i.e., the first round asks LLMs to explain why
,, Xueying Du, Geng Zheng, Kaixin Wang, Jiayi Feng, Wentai Deng, Mingwei Liu, Bihuan Chen, Xin Peng, Tao Ma, and Yiling Lou
CVE ID
A use-after-free(UAF) vulnerability was found in function 'vmw_cmd_res_check' in
Abstract Purpose: Look up a TTM base object associated with a given
drivers/gpu/vmxgfx/vmxgfx_execbuf.c in Linux kernel's vmwgfx driver with device key in a TTM object file.
file '/dev/dri/renderD128 (or Dxxx)'. CVE Description Detailed Behavior: 1. Look up a TTM base object using a key in a TTM
struct ttm_base_object *ttm_base_object_lookup(struct ttm_object_file
object file. 2.Acquire a reference to the base object if found successfully.
*tfile, uint64_t key) 3. Return the base object if a reference is acquired, otherwise return
{ NULL. Functional Semantics
... Functional Semantics
rcu_read_lock(); Extraction Prompt
ret = ttm_tfile_find_ref_rcu(tfile, key, &hash);
if (likely(ret == 0)) { Abstract Vulnerability Description: Use of RCU read lock without proper
base = hlist_entry(hash, struct ttm_ref_object, hash)->obj; synchronization mechanism to protect shared data structures.
if (!kref_get_unless_zero(&base->refcount))
base = NULL; }
Trigger Action: Concurrent access to the shared data structure while
rcu_read_unlock(); Knowledge under RCU read lock context.
return base;
} ❌ Vulnerable Code Abstraction
Prompt
Detailed Vulnerability Description: The code accesses a shared data
structure within an RCU read lock context without ensuring exclusive
struct ttm_base_object *ttm_base_object_lookup(struct ttm_object_file access, allowing a race condition where the object can be freed while
*tfile, uint64_t key) still being accessed. Vulnerability Causes
{ Initial Extracted
... Vulnerability
spin_lock(&tfile->lock); Patch Diff Cause
ret = ttm_tfile_find_ref(tfile, key, &hash);
Initial Extracted
Solution Description: To mitigate the vulnerability, it is necessary to
if (likely(ret == 0)) {
Fixing Solution use a synchronization mechanism, such as a spin lock, to protect the
base = hlist_entry(hash, struct ttm_ref_object, hash)->obj;
if (!kref_get_unless_zero(&base->refcount))
shared data structure from concurrent modifications. By replacing the
base = NULL;} RCU read lock with a spin lock, exclusive access to the data structure
spin_unlock(&tfile->lock); is ensured, preventing the race condition and use-after-free
Vulnerability Cause vulnerability....
return base; & Fixing Solution Fixing Solution
} √ Patch Code Extraction Prompt
the modification of the vulnerable code snippet is necessary and the vulnerability causes and fixing solutions. We do not abstract func-
second round asks LLMs to further summarize causes and fixing so- tional semantics, as it is utilized only during the retrieval phase,
lutions based on the explanations generated in the first round. Such and not provided as enhanced knowledge to LLMs during vulnera-
a two-step strategy is based on the CoT paradigm, which inspires bility detection process. We then describe the knowledge abstraction
LLM reasoning capabilities by thinking step-by-step and further guidelines and examples as follows.
results in better extraction [12, 13, 31, 32]. In addition, to enable • Abstracting Method Invocations. The extracted knowledge might
LLMs to summarize the causes and solutions in the proper formats, contain concrete method invocations with detailed function
Vul-RAG incorporates few-shot learning by including two demon- identifiers (e.g., io_worker_handle_work function) and pa-
stration examples of vulnerability causes and fixing solutions due rameters (e.g., mutex_lock(&dmxdev->mutex)), which can be
to the limited input length of GPT models. Following the vulnerabil- abstracted into the generalized description (e.g., “during handling
ity knowledge representation outlined in Section 4.4, we manually of IO work processes” and “employing a locking mechanism akin
construct two examples. The detailed prompts are as follows, where to mutex_lock()”).
the placeholders “[Vulnerable Code]”, “[Patched Code]”, and • Abstracting Variable Names and Types. The extracted knowledge
“[Patch Diff]” denote the vulnerable code, the patched code, and might contain concrete variable names or types (e.g., “without
the code diff of the given vulnerability, and [CVE ID] and [CVE &dev->ref initialization”), which can be abstracted into the more
Description] denote the details of the given vulnerability. general description (e.g., “without proper reference counter ini-
Extraction Prompt in Round 1: This is a code snippet with tialization”).
a vulnerability [CVE ID]: [Vulnerable Code] The vulnerability is Vul-RAG incorporates the following prompt to leverage LLMs
described as follows:[CVE Description] The correct way to fix it is by for knowledge extraction, which queries LLMs to abstract the
[Patch Diff] The code after modification is as follows: [Patched method invocations and variable names.
Code] Why is the above modification necessary?
Prompt for Knowledge Abstraction: With the detailed vulner-
Extraction Prompt in Round 2: I want you to act as a vulnera-
ability knowledge extracted from the previous stage, your task is to
bility detection expert and organize vulnerability knowledge based
abstract and generalize this knowledge to enhance its applicability
on the above vulnerability repair information. Please summarize the
across different scenarios. Please adhere to the following guidelines
generalizable specific behavior of the code that leads to the vulnera-
and examples provided:
bility and the specific solution to fix it. Format your findings in JSON.
[Knowledge Abstraction Guidelines and Examples] ...
Here are some examples to guide you on the level of detail expected
in your extraction: [Vulnerability Causes and Fixing Solution Example The final output is the three-dimension knowledge of each vul-
1] [Vulnerability Causes and Fixing Solution Example 2] nerability instance (i.e., denoted as a knowledge item). In particular,
Knowledge Abstraction. Different vulnerability instances might given a set of existing vulnerability instances (i.e., the training
share common high-level knowledge (e.g., the similar causes and constructed from PairVul as mentioned in Section 3.1), we repeat
fixing solutions), and thus abstracting the high-level commonality the extraction procedure for each vulnerability instance and ag-
among the extracted vulnerability knowledge can further distill gregate the extracted knowledge items of all instances as the final
more general knowledge representation less bonded to concrete vulnerability knowledge base.
code implementation details.
To this end, Vul-RAG leverages LLMs to abstract high-level 4.3 Vulnerability Knowledge Retrieval
knowledge by abstracting the following concrete code elements For a given code snippet for vulnerability detection, Vul-RAG re-
(i.e., method invocations, variable names, and types) in the extracted trieves relevant vulnerability knowledge items from the constructed
Vul-RAG: Enhancing LLM-based Vulnerability Detection
via Knowledge-level RAG ,,
vulnerability knowledge base in a three-step retrieval process: query enhances LLMs with each retrieved knowledge item by sequentially
generation, candidate knowledge retrieval, and candidate knowl- checking whether the given code exhibits the same vulnerability
edge re-ranking. cause or the same fixing solutions.
Query Generation. Instead of relying solely on the code as the If the given code exhibits the same vulnerability cause of the
retrieval query, Vul-RAG incorporates both the code and its func- knowledge item but lacks the relevant fixing solution, it is identi-
tional semantics as a multi-dimension query. Firstly, Vul-RAG fied as vulnerable. Otherwise, Vul-RAG cannot identify the code as
prompts LLMs to extract the functional semantics of the given code, vulnerable with the current knowledge item and proceeds to the
as described in the knowledge base construction (Section 4.2.2). The next iteration (i.e., using the next retrieved knowledge item). If the
abstract purpose, detailed behavior, and the code itself, form the code cannot be identified as vulnerable with any of the retrieved
query for the subsequent retrieval. knowledge items, it is finally identified as non-vulnerable. The itera-
Candidate Knowledge Retrieval. Vul-RAG conducts similarity- tion process terminates when (i) the code is identified as vulnerable
based retrieval using three query elements: the code, the abstract or (ii) all the retrieved knowledge items have been considered.
purpose, and the detailed behavior. It separately retrieves the top- In particular, the prompts used for identifying the existence of
n (where n = 10 in our experiments) knowledge items for each vulnerability causes and the fixing solutions are as follows.
query element. Consequently, Vul-RAG retrieves a total of 10 to Prompt for Finding Vulnerability Causes: Given the follow-
30 candidate knowledge items (accounting for potential duplicates ing code and related vulnerability causes, please detect if there is a
among the items retrieved across the different query elements). The vulnerability cause in the code. [Code Snippet]. In a similar code sce-
retrieval is based on the similarity between each query element nario, the following vulnerabilities have been found: [Vulnerability
and the corresponding elements of the knowledge items. Vul-RAG causes][fixing solutions]. Please use your own knowledge of vulner-
adopts BM25 [33] for similarity calculation, a method widely used abilities and the above vulnerability knowledge to detect whether
in search engines due to its efficiency and effectiveness [11]. Given a there is a vulnerability in the code.
query 𝑞 and the documentation 𝑑 for retrieval, BM25 calculates the Prompt for Finding Fixing Solutions: Given the following
code and related vulnerability fixing solutions, please detect if there
similarity score between 𝑞 and 𝑑 based on the following Equation 1,
is a vulnerability in the code. [Code Snippet]. In a similar code sce-
where 𝑓 (𝑤𝑖 , 𝑞) is the word 𝑤𝑖 ’s term frequency in query 𝑞, 𝐼𝐷𝐹 (𝑤𝑖 ) nario, the following vulnerabilities have been found: [Vulnerability
is the inverse document frequency of word 𝑤𝑖 . The hyperparame- causes][fixing solutions]. Please use your own knowledge of vulner-
ters 𝑘 and 𝑏 (where k=1.2 and b=0.75) are used to normalize term abilities and the above vulnerability knowledge to detect whether
frequencies and control the influence of document length. Before there is a corresponding fixing solution in the code.
calculating BM25 similarity, both the query and the retrieval docu-
mentation undergo standard preprocessing procedures, including
tokenization, lemmatization, and stop word removal [34].
𝑛 5 EVALUATION SETUP
∑︁ IDF (wi ) × f (wi , q) × (k + 1)
𝑆𝑖𝑚𝐵𝑀25 (𝑞, 𝑑 ) = (1) We evaluate the effectiveness and usefulness of Vul-RAG by an-
|q|
𝑖=1 f (wi , q) + k × 1-b + b × avgdl
swering the following four research questions:
Candidate Knowledge Re-ranking. We re-rank candidate knowl-
• RQ1: Compared to SOTA techniques: How does Vul-RAG
edge items with the Reciprocal Rank Fusion (RRF) strategy. For each
perform compared to state-of-the-art (SOTA) vulnerability de-
retrieved knowledge item 𝑘, we calculate its re-rank score by aggre-
tection techniques?
gating the reciprocal of its rank across all three query elements. If a
• RQ2: Compared to GPT-4-based techniques: How does Vul-
knowledge item 𝑘 is not retrieved by a particular query element, we
RAG perform compared to GPT4-based detection techniques?
assign its rank as infinity. The re-rank score for 𝑘 is calculated using
• RQ3: Usefulness for developers: Can the vulnerability knowl-
the following Equation 2. 𝐸 denotes the set of all query elements (i.e.,
edge generated by Vul-RAG help developers in manual vulner-
the code, the abstract purpose, and the detailed behavior), 𝑟𝑎𝑛𝑘𝑡 (𝑘)
ability detection?
denotes the rank of knowledge item 𝑘 based on query element 𝑡.
• RQ4: Bad Case Analysis: Why does Vul-RAG fail in detecting
some vulnerabilities?
∑︁ 1
𝑅𝑒𝑅𝑎𝑛𝑘𝑆𝑐𝑜𝑟𝑒𝑘 = (2)
𝑟𝑎𝑛𝑘𝑡 (𝑘)
𝑡 ∈𝐸
In this end, we obtain the top 10 candidate knowledge items 5.1 Implementation
with the highest re-rank scores as the final knowledge items to be
We build Vul-RAG on the top of GPT series models. In particu-
provided to the LLMs for vulnerability detection.
lar, for the offline knowledge base construction, given the large
number of vulnerability knowledge items to be generated, we use
4.4 Knowledge-Augmented Vulnerability the gpt-3.5-turbo-0125 model [36] due to its rapid response and
Detection cost-effectiveness [11]; for the online knowledge-augmented detec-
Based on the retrieved knowledge items, Vul-RAG leverages LLMs tion, we use the GPT-4 model [37] as it is currently one of the most
to reason whether the given code is vulnerable. However, directly effective LLMs with superior understanding and logical reasoning
incorporating all the retrieved knowledge items into one prompt capabilities [38]. For the knowledge retrieval process, we utilize
can hinder the effectiveness of the models, as LLMs often perform Elasticsearch [39] as our search engine, which based on the Lucene
poorly on lengthy contexts [35]. Therefore, Vul-RAG iteratively library using BM25 as the default score function.
,, Xueying Du, Geng Zheng, Kaixin Wang, Jiayi Feng, Wentai Deng, Mingwei Liu, Bihuan Chen, Xin Peng, Tao Ma, and Yiling Lou
Q: I want you to act as a vulnerability detection expert. Q: I want you to act as a vulnerability detection expert, Q: I want you to act as a vulnerability detection expert,
Given the following code, please detect whether there is a given the following code snippet and related vulnerability given the following code snippet and related vulnerability
vulnerability in the following code snippet: knowledge, please detect whether there is a vulnerability in knowledge, please detect whether there is a vulnerability
the code snippet: [Code Snippet] ... in the code snippet: [Code Snippet] In a similar code
static int da9150_charger_remove(struct platform_device *pdev)
Related vulnerable code snippet: scenario, the following vulnerabilities have been found.
{
Related vulnerability knowledge:
struct da9150_charger *charger = static int cedrus_remove(struct platform_device *pdev)
platform_get_drvdata(pdev); { Vulnerability Causes:
int irq; struct cedrus_dev *dev = platform_get_drvdata(pdev); ...Lack of proper cancellation of pending work associated
/* Make sure IRQs are released before unregistering with a specific functionality during device removal
power supplies */ Patch diff:cancel_delayed_work_sync process...The code does not cancel delayed work
irq = platform_get_irq_byname(pdev, "CHG_VBUS"); (&dev-> watchdog_work);
if (media_devnode_is_registered(dev->mdev.devnode)) associated with a specific functionality before proceeding
free_irq(irq, charger); { with device removal. This can result in a scenario where the
... media_device_unregister(&dev->mdev); device structure is accessed after it has been freed,
if (!IS_ERR_OR_NULL(charger->usb_phy)) v4l2_m2m_unregister_media_controller(dev- potentially leading to a use-after-free vulnerability....
usb_unregister_notifier(charger->usb_phy, >m2m_dev);
&charger->otg_nb); media_device_cleanup(&dev->mdev); Fixing Solution:
}
Patch diff:add cancel_work_sync To mitigate the vulnerability, it is necessary to cancel any
(&charger->otg_work); v4l2_m2m_release(dev->m2m_dev); pending work related to the specific functionality
power_supply_unregister(charger->battery); video_unregister_device(&dev->vfd); before proceeding with further cleanup during device
power_supply_unregister(charger->usb); v4l2_device_unregister(&dev->v4l2_dev); removal. This ensures that the code handles asynchronous
events properly and prevents the use-after-free
/* Release ADC channels */ cedrus_hw_remove(dev); [Vulnerable Source vulnerability... [Vulnerability Knowledge]
... [Code Snippet] return 0;
Code Snippet] ...Please answer YES or NO...
return 0; }
} Related Patched code snippet: [Patched Code Snippet]
A: YES, ..., its pattern of handling shared resources aligns
Please answer YES or NO, and explain why you think so. ...Please answer YES or NO...
with the preconditions and behaviors that can lead to
A: YES, there is a potential vulnerability in the code snippet A: NO, ..., Without clear evidence of asynchronous vulnerabilities as described. Without explicit
provided... The code does not check the return value of operations that need to be canceled in the same way as the synchronization or cancellation of asynchronous
`platform _get_irq_byname()` before calling `free_irq()`... xgene_hwmon device, it's difficult to directly label this operations before resource release, the code is potentially
function as vulnerable based solely on the provided vulnerable to race conditions and use-after-free issues...
❌ Fails to identify the root information...
cause of the vulnerability. ❌ Fails to identify the associations therein √ Successfully identify the root cause.
static void hci_loglink_complete_evt(struct hci_dev *hdev, struct sk_buff static void btsdio_remove(struct sdio_func *func) static int xgene_hwmon_remove(struct platform_device
*skb) { *pdev)
{ struct btsdio_data *data = {
...
sdio_get_drvdata(func); struct xgene_hwmon_dev *ctx =
BT_DBG("%s log_handle 0x%4.4x phy_handle 0x%2.2x status
0x%2.2x", hdev->name, le16_to_cpu(ev->handle), ev->phy_handle, ev- platform_get_drvdata(pdev);
struct hci_dev *hdev;
>status);
BT_DBG("func %p", func); hwmon_device_unregister(ctx->hwmon_dev);
hcon = hci_conn_hash_lookup_handle(hdev, ev->phy_handle);
if (!data) kfifo_free(&ctx->async_msg_fifo);
if (!hcon)
return; if (acpi_disabled)
return;
... hdev = data->hdev; mbox_free_channel(ctx->mbox_chan);
hchan->handle = le16_to_cpu(ev->handle); sdio_set_drvdata(func, NULL); else
BT_DBG("hcon %p mgr %p hchan %p", hcon, hcon->amp_mgr, hci_unregister_dev(hdev); pcc_mbox_free_channel(ctx->pcc_chan);
hchan); hci_free_dev(hdev);
... ❌ Irrelevant code with return 0; √ Relevant code with
} }
} different vulnerability similar vulnerability
• Generalizability: The vulnerability knowledge maintain a de- with the current knowledge base. This limitation is inherent
gree of general applicability, eschewing overly specific descrip- to the RAG-based framework. In future work, we will further
tions that diminish its broad utility (e.g., narratives overly reliant extend the knowledge base by extracting more CVE information
on variable names from the source code). to mitigate this issue.
In addition, the reasons for false positive can be classified into
6.3.2 Results. Compared to the basic setting, participants provided the following two categories:
with vulnerability knowledge generated by Vul-RAG can more • Mismatched Fixing Solutions. There are 11 cases (52.4 %) that
precisely identify the vulnerable and non-vulnerable code (i.e., 77% although Vul-RAG successfully retrieves relevant vulnerability
detection accuracy with knowledge v.s. 60% detection accuracy knowledge, the code snippet is stilled considered as vulnerable,
without knowledge). It indicates that the vulnerability knowledge as it is considered not applied the fixing solution of the retrieved
generated by Vul-RAG is indeed helpful for developers to better knowledge. It is because one vulnerability can be fixed by more
understand the semantics and vulnerabilities in the given code. than one alternative solutions.
In addition, based on the survey feedback, participants rate the • Irrelevant Vulnerability Knowledge Retrieval. There are 10
helpfulness, preciseness, and generalizability with average scores (47.6%) false positives caused by Vul-RAG retrieving irrelevant
of 3.00, 3.20, and 2.97, respectively. The results further indicate vulnerability knowledge. Based on our manual inspection, these
the high quality and usefulness of the vulnerability knowledge incorrectly-retrieved knowledge descriptions often generally
generated by Vul-RAG. contain “missing proper validation of specific values”, which is
Table 5: FN/FP analysis in CWE-119 too general for GPT4 to precisely identify the vulnerability.
Type Reason Number
FN
Inaccurate vulnerability knowledge descriptions. 5 7 THREATS TO VALIDITY
Unretrieved relevant vulnerability knowledge. 2
Non-existent relevant vulnerability knowledge. 12
Threats in benchmarks. There might be potential data leakage issue
Mismatched fixing solutions. 11 between the vulnerability benchmark and the GPT-4 training data.
FP
Irrelevant vulnerability knowledge retrieval 10 Nevertheless, the substantial improvements of Vul-RAG over the
basic GPT-4 can show the effectiveness of Vul-RAG is not simply
due to data memorization. Threats in generalization. Our benchmark
6.4 RQ4: Bad Case Analysis focuses on the Linux kernel CVEs due to its prevalence and rich
To understand the limitation of Vul-RAG, we further manually vulnerability information[41], which might limit the generalization
analyse the bad cases (i.e., false negatives and false positives re- of results. However, our approach is not limited to the Linux kernel
ported by Vul-RAG). In particular, we include all 19 FN and 21 CVEs and can be extended to CVEs of other systems in the future.
FP cases from CWE-119 for manual analysis. Table 5 summarizes In addition, another generalizability issue of Vul-RAG occurs in
the reasons and distributions. In particular, the reasons for false cases that the constructed knowledge base does not contain the rel-
negatives are classified into three primary categories: evant knowledge for the given code under detection, which raises
• Inaccurate Vulnerability Knowledge Descriptions. We ob- concerns about whether the extracted vulnerability knowledge can
serve that for 5 instances (26.3%), Vul-RAG successfully retrieves generalize to detect code snippet from different CVEs. To mitigate
relevant vulnerability knowledge but fails to detect the vulnera- this threat, we manually compile a small-scale benchmark compris-
bility due to the imprecise knowledge descriptions. For example, ing 60 code functions (30 positive and 30 negative samples) across
given the vulnerable code snippet of CVE-2021-4204, although 30 unique CVEs. For each case in this benchmark, we manually
Vul-RAG successfully retrieves the relevant knowledge of the verify the presence of relevant vulnerability knowledge extracted
same CVE, it yields a false negative due to the vague descriptions from other CVEs in the knowledge base. The performance of Vul-
of vulnerability knowledge (i.e., only briefly mentioning “lacks RAG on this benchmark (i.e., a recall rate of 0.83 and a precision
proper bounds checking” in the vulnerability cause and fixing rate of 0.76), demonstrates the generalizability of the extracted
solution description with explicitly stating what kind of bound vulnerability knowledge across different CVEs.
checking should be performed).
• Unretrieved Relevant Vulnerability Knowledge. We observe 8 RELATED WORK
that for 2 cases (15.8%) Vul-RAG fails to retrieve relevant vul- DL-based Vulnerability Detection. Most DL-based work mainly
nerability knowledge, thus leading to false negatives. Although leverages graph neural network (GNN) models and pre-trained
there are instances in the knowledge base that share the similar language models (PLMs) for vulnerability detection. Devign [1] em-
vulnerability root causes and fixing solutions of the given code, ploys GNN to efficiently extract useful features in a joint graph and
their functional semantics are significantly different. Therefore, REVEAL [2] conceptualizes function-level code as a Code Property
Vul-RAG fails to retrieve them from the knowledge base. Graph (CPG) and uses GGNN for CPG embedding. VulChecker [4]
• Non-existent Relevant Vulnerability Knowledge. Based on uses program slicing and a message-passing GNN to precisely locate
our manual checking, the 12 cases (63.2 %) in this category is vulnerabilities in code and classify their type (CWE). DeepDFA [3]
cased by the absence of relevant vulnerability knowledge in our uses a data flow analysis-guided graph learning framework to simu-
knowledge base. Even there are other vulnerable and patched late data flow computation. For PLM-based vulnerability detection,
code pairs of the same CVE, the vulnerability behaviors and VulBERTa [5] uses the RoBERTa model [22] as the encoder, while
fixing solutions are dissimilar, rendering these cases unsolvable Linevul [6] uses attention scores for line-level prediction.
Vul-RAG: Enhancing LLM-based Vulnerability Detection
via Knowledge-level RAG ,,
LLM-based Vulnerability Detection. Wu et al. [42] and Zhou et [9] A. Shestov, A. Cheshkov, R. Levichev, R. Mussabayev, P. Zadorozhny,
al. [43] explore the effectiveness and limits of ChatGPT in software E. Maslov, C. Vadim, and E. Bulychev, “Finetuning large language models for
vulnerability detection,” CoRR, vol. abs/2401.17010, 2024. [Online]. Available:
security applications; Gao et al. [44] build a comprehensive vul- https://doi.org/10.48550/arXiv.2401.17010
nerability benchmark VulBench to evaluate the effectiveness of 16 [10] H. Li, Y. Hao, Y. Zhai, and Z. Qian, “The hitchhiker’s guide to program analysis:
A journey with large language models,” 2023.
LLMs in vulnerability detection. Zhang et al. [7] investigate various [11] Y. Sun, D. Wu, Y. Xue, H. Liu, H. Wang, Z. Xu, X. Xie, and Y. Liu, “When gpt
prompts to improve ChatGPT in vulnerability detection. Yang et meets program analysis: Towards intelligent detection of smart contract logic
al. [8] and Shestov et al. [9] fine-tune LLMs for vulnerability detec- vulnerabilities in gptscan,” 2023.
[12] J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou
tion. Additionally, Li et al. [10] and Sun et al. [11] combine LLMs et al., “Chain-of-thought prompting elicits reasoning in large language models,”
with static analysis for vulnerability detection. Wang et al. [45] Advances in Neural Information Processing Systems, vol. 35, pp. 24 824–24 837,
boosts static analysis with LLM-based intention inference to de- 2022.
[13] Z. Zhang, A. Zhang, M. Li, and A. Smola, “Automatic chain of thought prompting
tect resource leaks. To the best of our knowledge, we are the first in large language models,” arXiv preprint arXiv:2210.03493, 2022.
vulnerability detection technique based on knowledge-level RAG [14] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakan-
tan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,”
framework. In addition, we also make the first attempt to eval- Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
uate existing techniques on distinguishing vulnerable code and [15] (2024) Cppcheck. [Online]. Available: http://cppcheck.net/
similar-but-benign code. [16] J. Fan, Y. Li, S. Wang, and T. N. Nguyen, “A C/C++ code vulnerability dataset with
code changes and CVE summaries,” in MSR ’20: 17th International Conference on
Mining Software Repositories, Seoul, Republic of Korea, 29-30 June, 2020, S. Kim,
9 CONCLUSION G. Gousios, S. Nadi, and J. Hejderup, Eds. ACM, 2020, pp. 508–512. [Online].
Available: https://doi.org/10.1145/3379597.3387501
In this work, we propose a novel LLM-based vulnerability detection [17] (2024) The website of common vulnerabilities and exposures. [Online]. Available:
technique Vul-RAG, which leverages knowledge-level retrieval- https://cve.mitre.org/
augmented generation (RAG) framework to detect vulnerability for [18] (2024) The website of ommon weakness enumeration. [Online]. Available:
https://cwe.mitre.org/
the given code. Overall, compared to four representative baselines, [19] (2024) The website of cwe-416. [Online]. Available: https://cwe.mitre.org/data/
Vul-RAG shows substantial improvements (i.e., 12.96% improve- definitions/416.html
[20] (2024) The website of cve-2023-30772. [Online]. Available: https://cve.mitre.org/
ment in accuracy and 110% improvement in pairwise accuracy). cgi-bin/cvename.cgi?name=CVE-2023-30772
Our user study results show that the vulnerability knowledge can [21] (2024) The website of cve-2023-3609. [Online]. Available: https://cve.mitre.org/
improve the manual detection accuracy from 0.6 to 0.77, and the cgi-bin/cvename.cgi?name=CVE-2023-3609
[22] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis,
user feedback also shows the high quality of generated knowledge L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized BERT
regarding the helpfulness, preciseness, and generalizability. pretraining approach,” CoRR, vol. abs/1907.11692, 2019. [Online]. Available:
http://arxiv.org/abs/1907.11692
[23] T. Ahmed and P. Devanbu, “Few-shot training llms for project-specific code-
REFERENCES summarization,” in Proceedings of the 37th IEEE/ACM International Conference on
[1] Y. Zhou, S. Liu, J. K. Siow, X. Du, and Y. Liu, “Devign: Effective vulnerability Automated Software Engineering, 2022, pp. 1–5.
identification by learning comprehensive program semantics via graph neural [24] Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai, J. Sun, Q. Guo, M. Wang, and
networks,” in Advances in Neural Information Processing Systems 32: Annual H. Wang, “Retrieval-augmented generation for large language models: A survey,”
Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 2024.
December 8-14, 2019, Vancouver, BC, Canada, H. M. Wallach, H. Larochelle, [25] P. S. H. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal,
A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett, Eds., 2019, pp. H. Küttler, M. Lewis, W. Yih, T. Rocktäschel, S. Riedel, and D. Kiela,
10 197–10 207. [Online]. Available: https://proceedings.neurips.cc/paper/2019/ “Retrieval-augmented generation for knowledge-intensive NLP tasks,” in
hash/49265d2447bc3bbfe9e76306ce40a31f-Abstract.html Advances in Neural Information Processing Systems 33: Annual Conference
[2] S. Chakraborty, R. Krishna, Y. Ding, and B. Ray, “Deep learning based vulnerability on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12,
detection: Are we there yet?” IEEE Trans. Software Eng., vol. 48, no. 9, pp. 2020, virtual, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin,
3280–3296, 2022. [Online]. Available: https://doi.org/10.1109/TSE.2021.3087402 Eds., 2020. [Online]. Available: https://proceedings.neurips.cc/paper/2020/hash/
[3] B. Steenhoek, H. Gao, and W. Le, “Dataflow analysis-inspired deep learning 6b493230205f780e1bc26945df7481e5-Abstract.html
for efficient vulnerability detection,” in Proceedings of the 46th IEEE/ACM [26] E. Shi, Y. Wang, W. Tao, L. Du, H. Zhang, S. Han, D. Zhang, and H. Sun, “RACE:
International Conference on Software Engineering, ICSE 2024, Lisbon, Portugal, retrieval-augmented commit message generation,” in Proceedings of the 2022
April 14-20, 2024. ACM, 2024, pp. 16:1–16:13. [Online]. Available: https: Conference on Empirical Methods in Natural Language Processing, EMNLP 2022,
//doi.org/10.1145/3597503.3623345 Abu Dhabi, United Arab Emirates, December 7-11, 2022, Y. Goldberg, Z. Kozareva,
[4] Y. Mirsky, G. Macon, M. D. Brown, C. Yagemann, M. Pruett, E. Downing, and Y. Zhang, Eds. Association for Computational Linguistics, 2022, pp.
S. Mertoguno, and W. Lee, “Vulchecker: Graph-based vulnerability localization 5520–5530. [Online]. Available: https://doi.org/10.18653/v1/2022.emnlp-main.372
in source code,” in 32nd USENIX Security Symposium, USENIX Security 2023, [27] F. Zhang, B. Chen, Y. Zhang, J. Keung, J. Liu, D. Zan, Y. Mao, J. Lou, and W. Chen,
Anaheim, CA, USA, August 9-11, 2023, J. A. Calandrino and C. Troncoso, “Repocoder: Repository-level code completion through iterative retrieval and
Eds. USENIX Association, 2023, pp. 6557–6574. [Online]. Available: https: generation,” in Proceedings of the 2023 Conference on Empirical Methods in Natural
//www.usenix.org/conference/usenixsecurity23/presentation/mirsky Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, H. Bouamor,
[5] H. Hanif and S. Maffeis, “Vulberta: Simplified source code pre-training for J. Pino, and K. Bali, Eds. Association for Computational Linguistics, 2023, pp.
vulnerability detection,” in International Joint Conference on Neural Networks, 2471–2484. [Online]. Available: https://aclanthology.org/2023.emnlp-main.151
IJCNN 2022, Padua, Italy, July 18-23, 2022. IEEE, 2022, pp. 1–8. [Online]. [28] S. Lu, N. Duan, H. Han, D. Guo, S. won Hwang, and A. Svyatkovskiy, “Reacc: A
Available: https://doi.org/10.1109/IJCNN55064.2022.9892280 retrieval-augmented code completion framework,” ArXiv, vol. abs/2203.07722,
[6] M. Fu and C. Tantithamthavorn, “Linevul: A transformer-based line-level 2022. [Online]. Available: https://api.semanticscholar.org/CorpusID:247450969
vulnerability prediction,” in 19th IEEE/ACM International Conference on Mining [29] A. Sejfia, S. Das, S. Shafiq, and N. Medvidovic, “Toward improved deep
Software Repositories, MSR 2022, Pittsburgh, PA, USA, May 23-24, 2022. ACM, learning-based vulnerability detection,” in Proceedings of the 46th IEEE/ACM
2022, pp. 608–620. [Online]. Available: https://doi.org/10.1145/3524842.3528452 International Conference on Software Engineering, ICSE 2024, Lisbon, Portugal,
[7] C. Zhang, H. Liu, J. Zeng, K. Yang, Y. Li, and H. Li, “Prompt-enhanced software April 14-20, 2024. ACM, 2024, pp. 62:1–62:12. [Online]. Available: https:
vulnerability detection using chatgpt,” CoRR, vol. abs/2308.12697, 2023. [Online]. //doi.org/10.1145/3597503.3608141
Available: https://doi.org/10.48550/arXiv.2308.12697 [30] (2024) The website of linux kernel cves. [Online]. Available: https://www.
[8] A. Z. H. Yang, C. L. Goues, R. Martins, and V. J. Hellendoorn, “Large linuxkernelcves.com/
language models for test-free fault localization,” in Proceedings of the 46th [31] J. Li, G. Li, Y. Li, and Z. Jin, “Structured chain-of-thought prompting for code
IEEE/ACM International Conference on Software Engineering, ICSE 2024, Lisbon, generation,” arXiv preprint arXiv:2305.06599, 2023.
Portugal, April 14-20, 2024. ACM, 2024, pp. 17:1–17:12. [Online]. Available: [32] Y. Nong, M. Aldeen, L. Cheng, H. Hu, F. Chen, and H. Cai, “Chain-of-
https://doi.org/10.1145/3597503.3623342 thought prompting of large language models for discovering and fixing
,, Xueying Du, Geng Zheng, Kaixin Wang, Jiayi Feng, Wentai Deng, Mingwei Liu, Bihuan Chen, Xin Peng, Tao Ma, and Yiling Lou
software vulnerabilities,” CoRR, vol. abs/2402.17230, 2024. [Online]. Available: [38] OpenAI, “GPT-4 technical report,” CoRR, vol. abs/2303.08774, 2023. [Online].
https://doi.org/10.48550/arXiv.2402.17230 Available: https://doi.org/10.48550/arXiv.2303.08774
[33] S. E. Robertson and S. Walker, “Some simple effective approximations to the [39] (2023) Elasticsearch. [Online]. Available: https://github.com/elastic/elasticsearch
2-poisson model for probabilistic weighted retrieval,” in Proceedings of the [40] R. Likert, “A technique for the measurement of attitudes.” Archives of psychology,
17th Annual International ACM-SIGIR Conference on Research and Development 1932.
in Information Retrieval. Dublin, Ireland, 3-6 July 1994 (Special Issue of [41] M. Jimenez, M. Papadakis, and Y. L. Traon, “An empirical analysis of
the SIGIR Forum). ACM/Springer, 1988, pp. 232–241. [Online]. Available: vulnerabilities in openssl and the linux kernel,” in 23rd Asia-Pacific Software
https://doi.org/10.1016/0306-4573(88)90021-0 Engineering Conference, APSEC 2016, Hamilton, New Zealand, December 6-9, 2016,
[34] M. Çagatayli and E. Çelebi, “The effect of stemming and stop-word-removal A. Potanin, G. C. Murphy, S. Reeves, and J. Dietrich, Eds. IEEE Computer Society,
on automatic text classification in turkish language,” in Neural Information 2016, pp. 105–112. [Online]. Available: https://doi.org/10.1109/APSEC.2016.025
Processing - 22nd International Conference, ICONIP 2015, Istanbul, Turkey, [42] F. Wu, Q. Zhang, A. P. Bajaj, T. Bao, N. Zhang, R. Wang, and C. Xiao, “Exploring
November 9-12, 2015, Proceedings, Part I, ser. Lecture Notes in Computer Science, the limits of chatgpt in software security applications,” CoRR, vol. abs/2312.05275,
S. Arik, T. Huang, W. K. Lai, and Q. Liu, Eds., vol. 9489. Springer, 2015, pp. 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2312.05275
168–176. [Online]. Available: https://doi.org/10.1007/978-3-319-26532-2_19 [43] X. Zhou, T. Zhang, and D. Lo, “Large language model for vulnerability detection:
[35] N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, and P. Liang, Emerging results and future directions,” CoRR, vol. abs/2401.15468, 2024. [Online].
“Lost in the middle: How language models use long contexts,” CoRR, vol. Available: https://doi.org/10.48550/arXiv.2401.15468
abs/2307.03172, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2307. [44] Z. Gao, H. Wang, Y. Zhou, W. Zhu, and C. Zhang, “How far have we gone in
03172 vulnerability detection using large language models,” CoRR, vol. abs/2311.12420,
[36] (2023) Gpt-3-5-turbo documentation. [Online]. Available: https://platform.openai. 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2311.12420
com/docs/models/gpt-3-5-turbo [45] C. Wang, J. Liu, X. Peng, Y. Liu, and Y. Lou, “Boosting static resource leak detection
[37] (2023) Gpt-4 documentation. [Online]. Available: https://platform.openai.com/ via llm-based resource-oriented intention inference,” CoRR, vol. abs/2311.04448,
docs/models/gpt-4-and-gpt-4-turbo 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2311.04448