0% found this document useful (0 votes)

25 views

Vul-RAG Enhancing LLM-based Vulnerability Detection Via Knowledge-Level RAG

Uploaded by

zexyzm1201

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views

Vul-RAG Enhancing LLM-based Vulnerability Detection Via Knowledge-Level RAG

Uploaded by

zexyzm1201

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Vul-RAG: Enhancing LLM-based Vulnerability Detection

via Knowledge-level RAG

Xueying Du Geng Zheng Kaixin Wang
Fudan University Alibaba Group Fudan University
China China China
Jiayi Feng Wentai Deng Mingwei Liu
Fudan University Nanjing University Sun Yat-sen University
China China China
arXiv:2406.11147v2 [cs.SE] 19 Jun 2024

Bihuan Chen Xin Peng Tao Ma

Fudan University Fudan University Alibaba Group
China China China

Yiling Lou
Fudan University
China
ABSTRACT large language models (LLMs) further boosts the learning-based
Vulnerability detection is essential for software quality assurance. In vulnerability detection techniques. Due to the strong code and text
recent years, deep learning models (especially large language mod- comprehension capabilities, LLMs show promising effectiveness
els) have shown promise in vulnerability detection. In this work, we in analyzing the malicious behaviors (e.g., bugs or vulnerabilities)
propose a novel LLM-based vulnerability detection technique Vul- in the code [7–11]. For example, existing LLM-based vulnerability
RAG, which leverages knowledge-level retrieval-augmented gener- detection techniques incorporate prompt engineering (e.g., chain-
ation (RAG) framework to detect vulnerability for the given code in of-thought [12, 13] and few-shot learning [14]) to facilitate more
three phases. First, Vul-RAG constructs a vulnerability knowledge accurate vulnerability detection.
base by extracting multi-dimension knowledge via LLMs from ex- Preliminary Study. However, due to the limited interpretability of
isting CVE instances; second, for a given code snippet, Vul-RAG deep learning models, it remains unclear whether existing learning-
retrieves the relevant vulnerability knowledge from the constructed based vulnerability detection techniques really understand and
knowledge base based on functional semantics; third, Vul-RAG capture the code semantics related to vulnerable behaviors, espe-
leverages LLMs to check the vulnerability of the given code snip- cially when the only outputs of the models are binary labels (i.e.,
pet by reasoning the presence of vulnerability causes and fixing vulnerable or benign). To fill this knowledge gap, we first perform
solutions of the retrieved vulnerability knowledge. Our evaluation a preliminary study based on the assumption that “if the technique
of Vul-RAG on our constructed benchmark PairVul shows that can precisely distinguish a pair of vulnerable code and non-vulnerable
Vul-RAG substantially outperforms all baselines by 12.96%/110% code with high lexical similarity (i.e., only differing in several tokens),
relative improvement in accuracy/pairwise-accuracy. In addition, we consider the technique with the better capability of capturing the
our user study shows that the vulnerability knowledge generated vulnerability-related semantics in code”. As two lexically-similar
by Vul-RAG can serve as high-quality explanations which can code snippets can differ in code semantics, it is likely that models
improve the manual detection accuracy from 0.60 to 0.77. have captured the high-level vulnerability-related semantics if the
models can precisely distinguish between them. As there is no such
existing vulnerability detection benchmark focusing on such pairs
1 INTRODUCTION of vulnerable code and non-vulnerable code with high lexical simi-
Security vulnerabilities in software leave open doors for the dis- larity, we first construct a new benchmark PairVul which contains
ruptive attacks, resulting in serious consequences during software 4,314 pairs of vulnerable and patched code functions across 2,073
execution. To date, there has been a large body of research on auto- CVEs. We then evaluate the three representative learning-based
mated vulnerability detection techniques. In addition to leveraging techniques (i.e., LLMAO [8], LineVul [6], and DeepDFA [3]) along
the traditional program analysis, deep learning has been incorpo- with one static analysis technique (i.e., Cppcheck [15]) on our con-
rated into the vulnerability detection techniques given the recent structed benchmark to study their distinguishing capability for such
advance in the artificial intelligence domain. code pairs. Based on the results, existing learning-based techniques
Learning-based vulnerability detection techniques [1–6] mainly actually exhibit rather limited effectiveness in distinguishing within
formulate the vulnerability detection as the binary classification such lexically-similar code pairs. In particular, the accuracy on our
task for the given code, which first train different models (e.g., benchmark PairVul drops to 0.50 ∼ 0.54, which are much lower
graph neural networks or pre-trained language models) on existing than that reported in previous benchmarks (e.g., 0.99 accuracy of
vulnerable code and benign code, and then predict the vulnera- LineVul [6] on BigVul [16]). The results demonstrate that existing
bility for the given code. More recently, the advanced progress in
,, Xueying Du, Geng Zheng, Kaixin Wang, Jiayi Feng, Wentai Deng, Mingwei Liu, Bihuan Chen, Xin Peng, Tao Ma, and Yiling Lou

trained models have limited capabilities of capturing the high-level • Benchmark. We construct a new benchmark PairVul that exclu-
code semantics related to vulnerable behaviors in the given code. sively contains pairs of vulnerable code and similar-but-correct
Technique. Inspired by the observation in our preliminary study, code.
our insight is to distinguish the vulnerable code from the similar- • Preliminary Study. We perform the first study to find that
but-correct code with high-level vulnerability knowledge. In partic- existing learning-based techniques have limited capabilities of
ular, based on how developers manually identify a vulnerability, understanding and capturing the vulnerability-related code se-
understanding a vulnerability often involves the code semantics mantics.
from the three dimensions: (i) the functionality the code is imple- • Technique. We construct a vulnerability knowledge base based
menting, (ii) the causes for the vulnerability, and (iii) the fixing on the proposed multi-dimension knowledge representation, and
solution for the vulnerability. Such high-level code semantics can propose a novel knowledge-level RAG framework Vul-RAG for
serve as the vulnerability knowledge for vulnerability detection. vulnerability detection.
To this end, we propose a novel LLM-based vulnerability detec- • Evaluation. We evaluate Vul-RAG and find the usefulness of
tion technique Vul-RAG, which leverages knowledge-level retrieval- vulnerability knowledge generated by Vul-RAG for both auto-
augmented generation (RAG) framework to detect vulnerability in mated and manual vulnerability detection.
the given code. The main idea of Vul-RAG is to leverage LLM to
reason for vulnerability detection based on the similar vulnerability 2 BACKGROUND
knowledge from existing vulnerabilities. In particular, Vul-RAG
2.1 CVE and CWE
consists of three phases. First, Vul-RAG constructs a vulnerabil-
ity knowledge base by extracting multi-dimension knowledge (i.e., Existing vulnerability classification systems, such as Common Vul-
functional semantics, causes, and fixing solutions) via LLMs from nerabilities and Exposures (CVE) [17] and Common Weakness Enu-
existing CVE instances; second, for a given code snippet, Vul-RAG meration (CWE) [18], provide a comprehensive taxonanomy of
retrieves the relevant vulnerability knowledge from the constructed categorizing and managing vulnerabilities. CVE is a publicly dis-
knowledge base based on functional semantics; third, Vul-RAG closed list of common security vulnerabilities. Each vulnerability
leverages LLMs to check the vulnerability of the given code snippet is assigned a unique identifier (CVE ID). A single CVE ID may be
by reasoning the presence of vulnerability causes and fixing solu- associated with multiple distinct code snippets.
tions of the retrieved vulnerability knowledge. The main technical CWE is a publicly accessible classification system of common
novelties of Vul-RAG include: (i) a novel representation of multi- software and hardware security vulnerabilities. Each weakness type
dimension vulnerability knowledge that focuses on more high-level within this enumeration is assigned a unique identifier (CWE ID).
code semantics rather than lexical details, and (ii) a novel knowledge- While CWE provides a broad classification of vulnerability types,
level RAG framework for LLMs that first retrieves relevant knowledge the specific code behaviors leading to a vulnerability under a given
based on functional semantics and then detects vulnerability by rea- CWE category may vary widely. For example, CWE-416 (Use After
soning from the vulnerability causes and fixing solutions. Free) [19] signifies the issue of referencing memory after it has
Evaluation. We further evaluate Vul-RAG on our benchmark been freed. The root cause of this vulnerability might stem from
PairVul. First, we compare Vul-RAG with three representative improper synchronization under race conditions (e.g., CVE-2023-
learning-based vulnerability detection techniques and one static 30772 [20]), or errors in reference counting leading to premature
analysis technique. The results show that Vul-RAG substantially object destruction (e.g., CVE-2023-3609 [21]).
outperforms all baselines by more precisely identifying the pairs of
vulnerable code and similar-but-correct code, e.g., 12.96% improve- 2.2 Learning-based Vulnerability Detection
ment in accuracy and 110% improvement in pairwise accuracy The recent advance in deep learning has boosted many learning-
(i.e., the ratio of both non-vulnerable code and vulnerable code based vulnerability detection techniques.
in one pair being correctly identified ). Second, we evaluate the GNN-based Vulnerability Detection [1–4] typically represents
usefulness of our vulnerability knowledge by comparing Vul-RAG the code snippets under detection as graph-based intermediate
with both the basic GPT-4 and the GPT-4 enhanced with code-level representations, such as Abstract Syntax Trees (ASTs) or Control
RAG. The results show that Vul-RAG consistently outperforms Flow Graphs (CFGs). Graph neural networks (GNN) are then applied
two GPT-4-based variants in all metrics. Third, we further perform to these abstracted code representations for feature extraction. The
a user study of vulnerability detection with/without the vulnerabil- features learned by the models are subsequently fed into a binary
ity knowledge generated by Vul-RAG. The results show that the classifier for vulnerability detection.
vulnerability knowledge can improve the manual detection accu- PLM-based Vulnerability Detection [5, 6] typically involves
racy from 0.6 to 0.77, and the user feedback also shows the high fine-tuning existing PLMs on vulnerability detection datasets. In
quality of generated knowledge regarding the helpfulness, precise- this way, code snippets are tokenized and processed by Pre-trained
ness, and generalizability. In summary, the evaluation results confirm Language Models (PLM, e.g., RoBERTa [22]) to serves as the encoder.
two-fold benefits of the proposed knowledge-level RAG framework: (i) The extracted features are then used for binary classification.
enhancing automated vulnerability detection by better retrieving and LLM-based Vulnerability Detection. This category leverages
utilizing existing vulnerability knowledge, and (ii) enhancing manual large language models (LLMs) for vulnerability detection via prompt
vulnerability detection by providing developer-friendly explanations engineering or fine-tuning [7–9]. The former leverages different
for understanding vulnerable or non-vulnerable code. prompting strategies, e.g., Chain-of-Thought (CoT) [12, 13] and few-
In summary, this paper makes the following contributions: shot learning [14, 23] for more accurate LLM-based vulnerability
Vul-RAG: Enhancing LLM-based Vulnerability Detection
via Knowledge-level RAG ,,

detection, without modifying the original LLM parameters; the 3.1 Benchmark PairVul
latter updates LLM parameters by trained on vulnerability detection For the preliminary study, we first construct a new benchmark
datasets, to learn the features of vulnerable code. PairVul as it is challenging and efforts-intensive to prepare pairs
of vulnerable code and patched code from existing vulnerability
detection benchmarks. For example, Table 1 shows the detailed sta-
2.3 Retrieval-Augmented Generation tistics of three widely-used benchmarks released between Jan. 2019
Retrieval-Augmented Generation (RAG) is a general paradigm which and Apr. 2024, i.e., BigVul [16], Devign [1], and Reveal [2]. The last
enhances LLMs by including relevant information retrieved from row presents our constructed benchmark PairVul for comparison.
external databases into the input [24]. RAG typically consists of In particular, existing datasets do not focus on pairs of vulnera-
three phases: indexing, retrieval, and generation. First, the index- ble code and patched code, e.g., some do not include patched code
ing phase constructs external databases and their retrieval index while some contain non-similar vulnerable code and correct code
from external data sources; Second, given a user query, the retrieval with significantly different code length. Therefore, we construct a
system then utilizes these index to fetch the relevant document new benchmark PairVul that exclusively contains pairs of vulner-
chunks as context. Third, the retrieved context is then integrated able code and patched code. In particular, in this work, we focus
into the input prompt for LLMs, and LLMs then generate the final on function-level vulnerability detection given it has been widely
output based on the augmented inputs. RAG has been widely used studied in previous learning-based research [3–5, 29].
in various domains [25–28]. For example, RAG has been special- Data Format. Specifically, our benchmark contains the following
ized to software engineering tasks such as code generation [27, 28], information for each vulnerability:
which retrieves the similar code from the code base and augments • CVE ID: The unique identifier assigned to a reported vulnera-
the prompt with the retrieved code for model inference. bility in the Common Vulnerabilities and Exposures (CVE).
• CVE Description: Descriptions of the vulnerability provided
by the CVE system, including the manifestation, the potential
3 PRELIMINARY STUDY impact, and the environment where the vulnerability may occur.
Although existing learning-based vulnerability detection tech- • CWE ID: The Common Weakness Enumeration identifier that
niques show promising effectiveness, it still remains unclear whether categorizes the type of the vulnerability exploits.
these techniques really understand and capture the code semantics
• Vulnerable Code: The source code snippet containing the
related to vulnerable behaviors, due to the weak interpretability
vulnerability, which will be modified in the commit.
of deep learning models. To fill this knowledge gap, in this pre-
liminary study, we make the assumption that “if the technique can • Patched Code: The source code snippet that has been commit-
precisely distinguish a pair of vulnerable code and non-vulnerable ted to fix the vulnerability in the vulnerable code.
code with high lexical similarity (i.e., only differing in several tokens), • Patch Diff: A detailed line-level difference between the vulner-
we consider the technique with the better capability of capturing the able and patched code, consisting of added and deleted lines.
vulnerability-related semantics in code”. As shown by the example Construction Procedure. Given the representativeness of the
in Figure 1, the vulnerable code is fixed by moving the statement Linux kernel in modern complex software systems, we use Linux
inet_frag_lru_add(nf, qp) into the lock-protected code block, kernel CVEs as the data source for our benchmark. The specific
while the pair of vulnerable code and the non-vulnerable code share benchmark construction process involves the following two steps:
high lexical similarity but differ in the semantics. To this end, we • Vulnerble and Patched Code Collection. We first collect all the
first propose to construct a benchmark that contains pairs of vulner- CVEs related to the Linux kernel from Linux Kernel CVEs[30],
able code and its corresponding patched code, as patched code often an open-source project dedicated to automatically tracking
shares high similarity as the original code; then we evaluate the CVEs within the upstream Linux kernel. Based on the list of
existing learning-based techniques on our constructed benchmark collected CVE IDs, we further extract corresponding CWE IDs
to study their distinguishing capability for such code pairs. and CVE descriptions from the National Vulnerability Data-
base (NVD), enriching our dataset with detailed vulnerability
static struct inet_frag_queue *inet_frag_intern(struct static struct inet_frag_queue *inet_frag_intern(struct
netns_frags *nf, struct inet_frag_queue *qp_in, struct
inet_frags *f, void *arg)
netns_frags *nf, struct inet_frag_queue *qp_in, struct
inet_frags *f, void *arg)
categorizations and descriptions. Based on the CVE ID list, we
{ {
... ...
then parse the commit information for each CVE to extract
read_lock(&f->lock); /* Protects against hash read_lock(&f->lock); /* Protects against hash
rebuild */ rebuild */ function-level vulnerable and patched code pairs. Vulnerable
hash = f->hashfn(qp_in); hash = f->hashfn(qp_in);
hb = &f->hash[hash]; hb = &f->hash[hash];
spin_lock(&hb->chain_lock);
code snippets prior to the commit diffs are labeled as positive
spin_lock(&hb->chain_lock);
...
...
qp = qp_in; ❌ Vulnerable Code qp = qp_in; √ Non-vulnerable Code samples and the patched code snippets as negative samples. In
if (!mod_timer(&qp->timer, jiffies + nf- if (!mod_timer(&qp->timer, jiffies + nf-
>timeout)) >timeout)) this way, we initially obtain a dataset of 4,667 function pairs of
atomic_inc(&qp->refcnt); atomic_inc(&qp->refcnt);
atomic_inc(&qp->refcnt);
hlist_add_head(&qp->list, &hb->chain);
atomic_inc(&qp->refcnt);
hlist_add_head(&qp->list, &hb->chain);
vulnerable and patched code across 2,174 CVEs.
inet_frag_lru_add(nf, qp);
spin_unlock(&hb->chain_lock);
read_unlock(&f->lock);
spin_unlock(&hb->chain_lock); • Patched Code Verification. The patched code cannot always be
read_unlock(&f->lock);
inet_frag_lru_add(nf, qp);
return qp; return qp; non-vulnerable, thus it is important to double check the correct-
} }
Diff :Call “inet_frag_lru_add” Diff: Call “inet_frag_lru_add” ness of the patched code. To this end, we further implement a
after unlocking before unlocking
filtering process for the patched code by ensuring that it has
Figure 1: A pair of vulnerable code and similar non- not been subsequently reverted or modified by other commits.
vulnerable code (the patched code)
,, Xueying Du, Geng Zheng, Kaixin Wang, Jiayi Feng, Wentai Deng, Mingwei Liu, Bihuan Chen, Xin Peng, Tao Ma, and Yiling Lou

Table 1: Existing Benchmarks for Vulnerability Detection. “Positive Number/Ratio” is the number/portion of vulnerable samples,
“#CVE” is the number of CVEs, “Positive LOC”/“Negative LOC” is the average lines numbers of vulnerable/non-vulnerable code,
“Patched Code Included” means whether the patched code of the vulnerability is included, and “Patched Code Verified” means
whether patched code is verified as correct.
Positive Negative Patched Code Patched Code
Benchmark Time Positive Number/Ratio #CVE
LOC LOC Included Verified
BigVul 2020 10,900(5.78%) 3,285 73.47 23.83 N /
Devign 2019 12,460 (45.61%) / 54.50 49.53 N /
ReVeal 2020 2,240 ( 9.85%) / 67.73 28.69 Y N
PairVul 2024 1,923 (50.00%) 896 68.58 70.25 Y Y

Benchmark Statistics. As a result, we obtain a new benchmark we develop a new metric pairwise accuracy, which calculates the
PairVul of 4,314 pairs of vulnerable and patched code functions ratio of pairs whose vulnerable and patched code are both correctly
across 2,073 CVEs. In this work, we focus on the top-5 CWEs in identified. Besides, we also use six commonly-used metrics in vul-
our benchmark given the non-trivial costs of model execution and nerability detection tasks, i.e., FN, FP, accuracy, precision, recall,
manual analysis. In particular, as this work focuses on learning- and F1. FN is the ratio of false negatives; FP is the ratio of false
based techniques which often require training datasets, we further positives; accuracy is the proportion of correctly detected instances;
divide the benchmark into the training set and the testing set in the precision is the proportion of true positive predictions among all
following steps. For each CVE, we randomly select one instance positive predictions; recall is the proportion of true positive predic-
into the testing set with the remaining instances (if has any) of the tions among all instances; and F1-score is the harmonic mean of
CVE into the training set. We exclude cases where the code length the precision and recall, which balances both values.
exceeds the current token limit of GPT-3.5-turbo (i.e., 16,384 tokens). Table 3: Effectiveness of SOTA techniques in PairVul
The final training set includes 896 CVEs with 1,462 pairs of vulner-
CWE Tech. FN FP Acc. Pair Acc. Precis. Recall F1
able and patched code functions, while the testing set includes 373 CppCheck 50.0% 0.0% 0.50 0.00 / 0.00 /
CVEs with 592 pairs. The statistics of each CWE category in our DeepDFA 9.3% 40.3% 0.50 0.02 0.50 0.81 0.62
CWE-416 LineVul 0.0% 50.0% 0.50 0.04 0.50 1.00 0.67
benchmark is shown in Table 2. LLMAO 24.5% 20.4% 0.55 0.14 0.56 0.51 0.53
CppCheck 48.9% 0.6% 0.51 0.01 0.67 0.02 0.04
Table 2: Statistics of each CWE in PairVul DeepDFA 8.5% 42.6% 0.49 0.01 0.49 0.83 0.62
CWE-476 LineVul 12.9% 33.7% 0.54 0.09 0.53 0.75 0.62
Training Set Test Set LLMAO 44.9% 3.4% 0.52 0.03 0.60 0.10 0.17
CWE
CVE Num. Func. Pair Num. CVE Num. Func. Pair Num. CppCheck 49.6% 0.0% 0.50 0.01 1.00 0.01 0.02
CWE-416 339 587 145 267 DeepDFA 5.9% 45.1% 0.49 0.00 0.49 0.88 0.63
CWE-362 LineVul 10.7% 40.9% 0.49 0.02 0.49 0.79 0.61
CWE-476 194 262 60 89 LLMAO 16.9% 30.2% 0.53 0.11 0.52 0.66 0.58
CWE-362 169 280 81 121 CppCheck 49.1% 0.9% 0.50 0.00 0.50 0.02 0.04
CWE-119 129 163 42 53 DeepDFA 11.5% 37.5% 0.51 0.00 0.50 0.76 0.60
CWE-119 LineVul 19.8% 32.1% 0.49 0.04 0.49 0.62 0.55
CWE-787 122 170 45 62 LLMAO 45.3% 2.8% 0.52 0.04 0.63 0.09 0.16
CppCheck 48.4% 1.6% 0.50 0.02 0.50 0.03 0.06
DeepDFA 9.8% 40.7% 0.50 0.00 0.49 0.80 0.61
CWE-787 LineVul 4.0% 46.8% 0.50 0.02 0.50 0.92 0.65
3.2 Studied Baselines LLMAO 41.9% 2.4% 0.56 0.11 0.77 0.16 0.27
We evaluate the following state-of-the-art (SOTA) vulnerability CppCheck 49.5% 0.3% 0.50 0.01 0.60 0.01 0.02
DeepDFA 8.7% 41.4% 0.50 0.01 0.49 0.82 0.62
detection techniques on our benchmark PairVul. Overall LineVul 6.3% 43.8% 0.50 0.02 0.50 0.87 0.64
LLMAO 29.7% 16.4% 0.54 0.10 0.55 0.41 0.47
• LLMAO [8]: An LLM-based fault localization approach fine- Uniform Guess 0.0% 100% 0.50 0.00 0.50 1.00 0.67
tuning LLM (i.e., CodeGen), which has also been fine-tuned on
the Devign dataset for vulnerability detection.
3.4 Results
• LineVul [6]: A PLM-based vulnerability detection model, offer-
ing both function-level and line-level detection granularity. As shown in Table 3, existing techniques exhibit limited effective-
ness on our benchmark PairVul. In particular, compared to the
• DeepDFA [3]: A GNN-based detection technique with data flow
effectiveness reported in previous benchmark (e.g., 0.99 accuracy of
analysis-guided graph learning framework, which is designed
LineVul on BigVul [6]), existing techniques perform much poorer
for function-level vulnerability detection.
on PairVul (ranging from 0.50 to 0.54 accuracy), which shows even
• Cppcheck [15]: A widely-used open-source static analysis tool. lower accuracy and F1 than the uniform guess (i.e., identifying
We directly utilize the public implementation of all baselines. To all instances as vulnerable). In particular, the pairwise accuracy
adapt the techniques to our benchmark, we fine-tune the learning- ranges from 0.01 to 0.10, indicating that existing learning-based
based baselines on our training set for 10 epochs. Additionally, as techniques fail to capture the subtle difference between similar vul-
LLMAO is a line-level vulnerability detection technique, we adapt nerable code and non-vulnerable code. The observations imply that
it to the function level by regarding the function that has any line the learning-based models have limited capability of understanding
with higher-than-0.5 suspiciousness scores as vulnerable. the semantics related to the vulnerability.
Our insight. In fact, two code snippets with subtle lexical dif-
3.3 Metrics ference can have different semantics (i.e., different functionalities).
To further evaluate the capability in distinguishing a pair of vul- Therefore, identifying vulnerability based on the high-level code
nerable code and non-vulnerable code with high lexical similarity, semantics can help better distinguish the vulnerable code from the
Vul-RAG: Enhancing LLM-based Vulnerability Detection
via Knowledge-level RAG ,,

similar-but-correct code. In particular, based on how developers 4.2 Vulnerability Knowledge Base Construction
manually identify a vulnerability, understanding a vulnerability To comprehensively summarize a vulnerability, we propose a three-
often involves the code semantics from the three dimensions: (i) dimension representation of vulnerability knowledge (in Section 4.2.1).
the functionality the code is implementing, (ii) the causes for the Based on the knowledge representation, Vul-RAG leverages LLMs
vulnerability, and (iii) the fixing solution for the vulnerability. Such to extract the relevant vulnerability knowledge from existing CVE
high-level code semantics can serve as the knowledge for vulnera- instances, which further forms the knowledge base (in Section 4.2.2).
bility detection. Therefore, in this work, we propose to distinguish
the vulnerable code from the similar-but-correct code by enhanc- 4.2.1 Vulnerability Knowledge Representation. Vul-RAG represents
ing LLMs with high-level vulnerability knowledge. In particular, the vulnerability knowledge of a CVE instance from three dimen-
we first leverages LLMs to automatically construct a vulnerabil- sions: functional semantics, vulnerability causes, and fixing solu-
ity knowledge base from existing vulnerability instances, which tions. Figure 3 exemplifies the three-dimension representation for
further are utilized to boost LLMs in vulnerability detection. CVE-2022-38457. In this case, the vulnerable code accesses a shared
data structure within an RCU read lock context without proper
synchronization mechanism, allowing a race condition and use-
after-free vulnerability. To fix this vulnerability, the patched code
add a spin lock to protect the shared data structure.
4 APPROACH • Functional Semantics: It summarizes the high-level functional-
4.1 Overview ity (i.e., what this code is doing) of the vulnerable code, including:
In this work, we present a novel LLM-based vulnerability detection – Abstract purpose: brief summary of the code intention.
technique Vul-RAG, which leverages knowledge-level RAG frame- – Detailed behavior: detailed description of the code behavior.
work to detect vulnerability in the given code. The main idea of • Vulnerability Causes: It describes the reasons for triggering
Vul-RAG is to leverage LLM to reason for vulnerability detection vulnerable behaviors by comparing the vulnerable code and its
based on the similar vulnerability knowledge from existing vul- corresponding patch. We consider causes described in different
nerabilities. Figure 2 shows the overview of our approach, which perspectives, including:
includes the following three phases. – Abstract vulnerability description: brief summary of the cause.
• Phase-1: Offline Vulnerability Knowledge Base Construc- – Detailed vulnerability description: more concrete descriptions
tion (Section 4.2): Vul-RAG first constructs a vulnerability of the causes.
knowledge base by extracting multi-dimension knowledge via – Triggering action: the direct action triggering the vulnerability,
LLMs from existing CVE instances. e.g., “concurrent access to shared data structures” in Figure 3.
• Phase-2: Online Vulnerability Knowledge Retrieval (Sec- • Fixing Solutions: It summarizes the fixing of the vulnerability
tion 4.3). For a given code snippet, Vul-RAG retrieves the rele- by comparing the vulnerable code and its corresponding patch.
vant vulnerability knowledge from the constructed knowledge
base based on functional semantics. 4.2.2 Knowledge Extraction. For each existing vulnerability in-
• Phase-3: Online Knowledge-Augmented Vulnerability De- stance (i.e., the vulnerable code and its patch), Vul-RAG prompts
tection (Section 4.4). Vul-RAG leverages LLMs to check the LLMs to extract three-dimension knowledge, and then abstracts
vulnerability of the given code snippet by reasoning the pres- the extracted knowledge to facilitate more general representation.
ence of vulnerability causes and fixing solutions of the retrieved We then explain each step in detail.
vulnerability knowledge. Functional Semantics Extraction. Given the vulnerable code
snippet, Vul-RAG prompts LLMs with the following instructions
to summarize both the abstract purpose and the detailed behavior
Functional Vulnerability Causes and
Knowledge respectively, where the placeholder “[Vulnerable Code]” denotes
Semantics Fixing Solutions
Abstraction
Extraction Extraction
the vulnerable code snippet.
CVE Corpus Prompt for Abstract Purpose Extraction: [Vulnerable Code]
Vulnerability
Knowledge Vulnerability What is the purpose of the function in the above code snippet? Please
Extraction Knowledge Base
summarize the answer in one sentence with the following format:
① Vulnerability Knowledge Base Construction “Function purpose:”.
Functional Vulnerability Prompt for Detailed Behavior Extraction: [Vulnerable Code]
Retrieval Documents/
Top N Related
Documents/
Semantics Knowledge Please summarize the functions of the above code snippet in the list
Query Source Code
vulnerability
Source Code
Extraction Retrieval
knowledge format without any other explanation: “The functions of the code
② Vulnerability Knowledge Retrieval
snippet are: 1. 2. 3...”
LLM-based Vulnerability
Vulnerability Detection The example output of functional semantics is in Figure 3.
Detection Prompt
Source Code
User Detection Results & Vulnerability Causes and Fixing Solutions Extraction. As the
Related Vulnerability
Under Detection Knowledge causes and fixing solutions are often logically connected, Vul-RAG
③ Knowledge-Augmented Vulnerability Detection
extracts them together so that the reasoning capabilities of LLMs
Figure 2: Overview of Vul-RAG can be better utilized. In particular, Vul-RAG incorporates two
rounds of extraction, i.e., the first round asks LLMs to explain why
,, Xueying Du, Geng Zheng, Kaixin Wang, Jiayi Feng, Wentai Deng, Mingwei Liu, Bihuan Chen, Xin Peng, Tao Ma, and Yiling Lou

CVE ID
A use-after-free(UAF) vulnerability was found in function 'vmw_cmd_res_check' in
Abstract Purpose: Look up a TTM base object associated with a given
drivers/gpu/vmxgfx/vmxgfx_execbuf.c in Linux kernel's vmwgfx driver with device key in a TTM object file.
file '/dev/dri/renderD128 (or Dxxx)'. CVE Description Detailed Behavior: 1. Look up a TTM base object using a key in a TTM
struct ttm_base_object *ttm_base_object_lookup(struct ttm_object_file
object file. 2.Acquire a reference to the base object if found successfully.
*tfile, uint64_t key) 3. Return the base object if a reference is acquired, otherwise return
{ NULL. Functional Semantics
... Functional Semantics
rcu_read_lock(); Extraction Prompt
ret = ttm_tfile_find_ref_rcu(tfile, key, &hash);
if (likely(ret == 0)) { Abstract Vulnerability Description: Use of RCU read lock without proper
base = hlist_entry(hash, struct ttm_ref_object, hash)->obj; synchronization mechanism to protect shared data structures.
if (!kref_get_unless_zero(&base->refcount))
base = NULL; }
Trigger Action: Concurrent access to the shared data structure while
rcu_read_unlock(); Knowledge under RCU read lock context.
return base;
} ❌ Vulnerable Code Abstraction
Prompt
Detailed Vulnerability Description: The code accesses a shared data
structure within an RCU read lock context without ensuring exclusive
struct ttm_base_object *ttm_base_object_lookup(struct ttm_object_file access, allowing a race condition where the object can be freed while
*tfile, uint64_t key) still being accessed. Vulnerability Causes
{ Initial Extracted
... Vulnerability
spin_lock(&tfile->lock); Patch Diff Cause
ret = ttm_tfile_find_ref(tfile, key, &hash);
Initial Extracted
Solution Description: To mitigate the vulnerability, it is necessary to
if (likely(ret == 0)) {
Fixing Solution use a synchronization mechanism, such as a spin lock, to protect the
base = hlist_entry(hash, struct ttm_ref_object, hash)->obj;
if (!kref_get_unless_zero(&base->refcount))
shared data structure from concurrent modifications. By replacing the
base = NULL;} RCU read lock with a spin lock, exclusive access to the data structure
spin_unlock(&tfile->lock); is ensured, preventing the race condition and use-after-free
Vulnerability Cause vulnerability....
return base; & Fixing Solution Fixing Solution
} √ Patch Code Extraction Prompt

Instance-level Vulnerability Knowledge Extraction Input Extracted Vulnerability Knowledge

Figure 3: An Example of Vulnerability Knowledge Extraction from Historical Commit of CVE-2022-38457

the modification of the vulnerable code snippet is necessary and the vulnerability causes and fixing solutions. We do not abstract func-
second round asks LLMs to further summarize causes and fixing so- tional semantics, as it is utilized only during the retrieval phase,
lutions based on the explanations generated in the first round. Such and not provided as enhanced knowledge to LLMs during vulnera-
a two-step strategy is based on the CoT paradigm, which inspires bility detection process. We then describe the knowledge abstraction
LLM reasoning capabilities by thinking step-by-step and further guidelines and examples as follows.
results in better extraction [12, 13, 31, 32]. In addition, to enable • Abstracting Method Invocations. The extracted knowledge might
LLMs to summarize the causes and solutions in the proper formats, contain concrete method invocations with detailed function
Vul-RAG incorporates few-shot learning by including two demon- identifiers (e.g., io_worker_handle_work function) and pa-
stration examples of vulnerability causes and fixing solutions due rameters (e.g., mutex_lock(&dmxdev->mutex)), which can be
to the limited input length of GPT models. Following the vulnerabil- abstracted into the generalized description (e.g., “during handling
ity knowledge representation outlined in Section 4.4, we manually of IO work processes” and “employing a locking mechanism akin
construct two examples. The detailed prompts are as follows, where to mutex_lock()”).
the placeholders “[Vulnerable Code]”, “[Patched Code]”, and • Abstracting Variable Names and Types. The extracted knowledge
“[Patch Diff]” denote the vulnerable code, the patched code, and might contain concrete variable names or types (e.g., “without
the code diff of the given vulnerability, and [CVE ID] and [CVE &dev->ref initialization”), which can be abstracted into the more
Description] denote the details of the given vulnerability. general description (e.g., “without proper reference counter ini-
Extraction Prompt in Round 1: This is a code snippet with tialization”).
a vulnerability [CVE ID]: [Vulnerable Code] The vulnerability is Vul-RAG incorporates the following prompt to leverage LLMs
described as follows:[CVE Description] The correct way to fix it is by for knowledge extraction, which queries LLMs to abstract the
[Patch Diff] The code after modification is as follows: [Patched method invocations and variable names.
Code] Why is the above modification necessary?
Prompt for Knowledge Abstraction: With the detailed vulner-
Extraction Prompt in Round 2: I want you to act as a vulnera-
ability knowledge extracted from the previous stage, your task is to
bility detection expert and organize vulnerability knowledge based
abstract and generalize this knowledge to enhance its applicability
on the above vulnerability repair information. Please summarize the
across different scenarios. Please adhere to the following guidelines
generalizable specific behavior of the code that leads to the vulnera-
and examples provided:
bility and the specific solution to fix it. Format your findings in JSON.
[Knowledge Abstraction Guidelines and Examples] ...
Here are some examples to guide you on the level of detail expected
in your extraction: [Vulnerability Causes and Fixing Solution Example The final output is the three-dimension knowledge of each vul-
1] [Vulnerability Causes and Fixing Solution Example 2] nerability instance (i.e., denoted as a knowledge item). In particular,
Knowledge Abstraction. Different vulnerability instances might given a set of existing vulnerability instances (i.e., the training
share common high-level knowledge (e.g., the similar causes and constructed from PairVul as mentioned in Section 3.1), we repeat
fixing solutions), and thus abstracting the high-level commonality the extraction procedure for each vulnerability instance and ag-
among the extracted vulnerability knowledge can further distill gregate the extracted knowledge items of all instances as the final
more general knowledge representation less bonded to concrete vulnerability knowledge base.
code implementation details.
To this end, Vul-RAG leverages LLMs to abstract high-level 4.3 Vulnerability Knowledge Retrieval
knowledge by abstracting the following concrete code elements For a given code snippet for vulnerability detection, Vul-RAG re-
(i.e., method invocations, variable names, and types) in the extracted trieves relevant vulnerability knowledge items from the constructed
Vul-RAG: Enhancing LLM-based Vulnerability Detection
via Knowledge-level RAG ,,

vulnerability knowledge base in a three-step retrieval process: query enhances LLMs with each retrieved knowledge item by sequentially
generation, candidate knowledge retrieval, and candidate knowl- checking whether the given code exhibits the same vulnerability
edge re-ranking. cause or the same fixing solutions.
Query Generation. Instead of relying solely on the code as the If the given code exhibits the same vulnerability cause of the
retrieval query, Vul-RAG incorporates both the code and its func- knowledge item but lacks the relevant fixing solution, it is identi-
tional semantics as a multi-dimension query. Firstly, Vul-RAG fied as vulnerable. Otherwise, Vul-RAG cannot identify the code as
prompts LLMs to extract the functional semantics of the given code, vulnerable with the current knowledge item and proceeds to the
as described in the knowledge base construction (Section 4.2.2). The next iteration (i.e., using the next retrieved knowledge item). If the
abstract purpose, detailed behavior, and the code itself, form the code cannot be identified as vulnerable with any of the retrieved
query for the subsequent retrieval. knowledge items, it is finally identified as non-vulnerable. The itera-
Candidate Knowledge Retrieval. Vul-RAG conducts similarity- tion process terminates when (i) the code is identified as vulnerable
based retrieval using three query elements: the code, the abstract or (ii) all the retrieved knowledge items have been considered.
purpose, and the detailed behavior. It separately retrieves the top- In particular, the prompts used for identifying the existence of
n (where n = 10 in our experiments) knowledge items for each vulnerability causes and the fixing solutions are as follows.
query element. Consequently, Vul-RAG retrieves a total of 10 to Prompt for Finding Vulnerability Causes: Given the follow-
30 candidate knowledge items (accounting for potential duplicates ing code and related vulnerability causes, please detect if there is a
among the items retrieved across the different query elements). The vulnerability cause in the code. [Code Snippet]. In a similar code sce-
retrieval is based on the similarity between each query element nario, the following vulnerabilities have been found: [Vulnerability
and the corresponding elements of the knowledge items. Vul-RAG causes][fixing solutions]. Please use your own knowledge of vulner-
adopts BM25 [33] for similarity calculation, a method widely used abilities and the above vulnerability knowledge to detect whether
in search engines due to its efficiency and effectiveness [11]. Given a there is a vulnerability in the code.
query 𝑞 and the documentation 𝑑 for retrieval, BM25 calculates the Prompt for Finding Fixing Solutions: Given the following
code and related vulnerability fixing solutions, please detect if there
similarity score between 𝑞 and 𝑑 based on the following Equation 1,
is a vulnerability in the code. [Code Snippet]. In a similar code sce-
where 𝑓 (𝑤𝑖 , 𝑞) is the word 𝑤𝑖 ’s term frequency in query 𝑞, 𝐼𝐷𝐹 (𝑤𝑖 ) nario, the following vulnerabilities have been found: [Vulnerability
is the inverse document frequency of word 𝑤𝑖 . The hyperparame- causes][fixing solutions]. Please use your own knowledge of vulner-
ters 𝑘 and 𝑏 (where k=1.2 and b=0.75) are used to normalize term abilities and the above vulnerability knowledge to detect whether
frequencies and control the influence of document length. Before there is a corresponding fixing solution in the code.
calculating BM25 similarity, both the query and the retrieval docu-
mentation undergo standard preprocessing procedures, including
tokenization, lemmatization, and stop word removal [34].
𝑛 5 EVALUATION SETUP
∑︁ IDF (wi ) × f (wi , q) × (k + 1)
𝑆𝑖𝑚𝐵𝑀25 (𝑞, 𝑑 ) = (1) We evaluate the effectiveness and usefulness of Vul-RAG by an-
|q|
𝑖=1 f (wi , q) + k × 1-b + b × avgdl
swering the following four research questions:
Candidate Knowledge Re-ranking. We re-rank candidate knowl-
• RQ1: Compared to SOTA techniques: How does Vul-RAG
edge items with the Reciprocal Rank Fusion (RRF) strategy. For each
perform compared to state-of-the-art (SOTA) vulnerability de-
retrieved knowledge item 𝑘, we calculate its re-rank score by aggre-
tection techniques?
gating the reciprocal of its rank across all three query elements. If a
• RQ2: Compared to GPT-4-based techniques: How does Vul-
knowledge item 𝑘 is not retrieved by a particular query element, we
RAG perform compared to GPT4-based detection techniques?
assign its rank as infinity. The re-rank score for 𝑘 is calculated using
• RQ3: Usefulness for developers: Can the vulnerability knowl-
the following Equation 2. 𝐸 denotes the set of all query elements (i.e.,
edge generated by Vul-RAG help developers in manual vulner-
the code, the abstract purpose, and the detailed behavior), 𝑟𝑎𝑛𝑘𝑡 (𝑘)
ability detection?
denotes the rank of knowledge item 𝑘 based on query element 𝑡.
• RQ4: Bad Case Analysis: Why does Vul-RAG fail in detecting
some vulnerabilities?
∑︁ 1
𝑅𝑒𝑅𝑎𝑛𝑘𝑆𝑐𝑜𝑟𝑒𝑘 = (2)
𝑟𝑎𝑛𝑘𝑡 (𝑘)
𝑡 ∈𝐸
In this end, we obtain the top 10 candidate knowledge items 5.1 Implementation
with the highest re-rank scores as the final knowledge items to be
We build Vul-RAG on the top of GPT series models. In particu-
provided to the LLMs for vulnerability detection.
lar, for the offline knowledge base construction, given the large
number of vulnerability knowledge items to be generated, we use
4.4 Knowledge-Augmented Vulnerability the gpt-3.5-turbo-0125 model [36] due to its rapid response and
Detection cost-effectiveness [11]; for the online knowledge-augmented detec-
Based on the retrieved knowledge items, Vul-RAG leverages LLMs tion, we use the GPT-4 model [37] as it is currently one of the most
to reason whether the given code is vulnerable. However, directly effective LLMs with superior understanding and logical reasoning
incorporating all the retrieved knowledge items into one prompt capabilities [38]. For the knowledge retrieval process, we utilize
can hinder the effectiveness of the models, as LLMs often perform Elasticsearch [39] as our search engine, which based on the Lucene
poorly on lengthy contexts [35]. Therefore, Vul-RAG iteratively library using BM25 as the default score function.
,, Xueying Du, Geng Zheng, Kaixin Wang, Jiayi Feng, Wentai Deng, Mingwei Liu, Bihuan Chen, Xin Peng, Tao Ma, and Yiling Lou

6 RESULTS Table 4: Effectiveness of Vul-RAG and GPT-4-based tech-

niques (RQ1& RQ2)
6.1 RQ1: Compared to SOTA techniques
CWE Tech. FN FP Acc. Pair Acc. Precis. Recall F1
In RQ1, we evaluate Vul-RAG with the same setting of our pre- Basic GPT-4 42.5 % 6.4 % 0.52 0.05 0.55 0.19 0.29
CWE-416 Code-based 38.2 % 11.6 % 0.50 0.01 0.50 0.24 0.32
liminary study (Section 3), including the same benchmark (i.e., Vul-RAG 17.8 % 21.2 % 0.61 0.22 0.60 0.64 0.62
PairVul), same metrics, and the same baselines (i.e., LLMAO, Line- Basic GPT-4 43.3 % 7.9 % 0.47 0.04 0.42 0.17 0.24
CWE-476 Code-based 37.1 % 12.9 % 0.50 0.00 0.50 0.26 0.34
Vul, DeepDFA, and Cppcheck). To space limits, we do not repeat Vul-RAG 23.0 % 15.2 % 0.62 0.22 0.64 0.54 0.59
the results of the baselines (previously presented in Table 3) and Basic GPT-4 42.1 % 9.1 % 0.47 0.04 0.44 0.18 0.26
CWE-362 Code-based 40.1 % 9.9 % 0.50 0.01 0.50 0.20 0.28
we present the results of Vul-RAG in Table 4. Based on two tables, Vul-RAG 19.6 % 21.7 % 0.59 0.20 0.58 0.61 0.60
we have the following findings. Basic GPT-4 34.0 % 20.8 % 0.46 0.09 0.45 0.34 0.39
CWE-119 Code-based 37.7 % 12.3 % 0.50 0.02 0.50 0.25 0.33
First, Vul-RAG achieves the highest accuracy (i.e., 0.61) and Vul-RAG 17.9 % 19.8 % 0.62 0.23 0.62 0.64 0.63
pairwise accuracy (0.21) among all baselines, which substantially Basic GPT-4 36.3 % 10.5 % 0.52 0.06 0.55 0.29 0.38
CWE-787 Code-based 37.1 % 11.3 % 0.52 0.03 0.54 0.26 0.35
outperforms the best baseline LLMAO by 12.96% and 110% relative Vul-RAG 22.1 % 17.2 % 0.61 0.18 0.62 0.56 0.59
improvements. The significant improvements in pairwise accuracy Basic GPT-4 41.1 % 8.9 % 0.49 0.05 0.49 0.21 0.30
Overall Code-based 38.3 % 11.5 % 0.50 0.01 0.51 0.23 0.32
shows the advantage of Vul-RAG in distinguishing between vul- Vul-RAG 19.4 % 19.8 % 0.61 0.21 0.61 0.61 0.61
nerable code and similar-but-correct code. (A)). GPT-4 incorrectly suggests that the absence of a return value
Additionally, Vul-RAG achieves the best trade-off between recall check in “platform_get_irq_byname()” could cause a vulnerabil-
and precision, with these two metrics both being 0.61, respectively. ity, whereas such a check is not required here. However, it overlooks
Although these scores are not the highest individually, other base- the true issue, which is the improper handling of asynchronous
lines with higher scores in one metric often fail short in another. events resulting in a race condition and subsequently a use-after-
For example, LineVul with the highest recall of 0.87, tends to predict free vulnerability. This misunderstanding continues as GPT-4 de-
most code as vulnerable, leading to a low precision of 0.50 (same tects the corresponding patched code, leading to false positives and
as the uniform guess). In particular, we consider the F1 metric affecting the pairwise accuracy. Enhancing GPT-4 with code-based
with less practical insight in our benchmark, as the uniform guess RAG also fails to detect the vulnerability. As shown in Figure 4 (B),
achieves the highest F1 (with 1.0 recall and 0.5 precision). However, although the retrieved code pair contains a similar functional se-
it provides limited practical benefit to suggest all code as vulnerable mantic and vulnerability cause, GPT-4 still struggles to associate the
for developers. Nevertheless, the overall limited effectiveness of all vulnerability knowledge implied in the retrieved source code with
techniques indicates that capturing the subtle semantic difference the target code under detection. In contrast, providing the distilled
is very challenging, which calls for more awareness from the future high-level vulnerability knowledge from our approach Vul-RAG,
work. GPT-4 not only successfully detects the vulnerability root cause
in the vulnerable code but also accurately identifies the patched
6.2 RQ2: Compared to GPT-4-based techniques code (Figure 4 (C)). The comparison demonstrates the high-level
RQ2 evaluates the usefulness of the knowledge-level RAG frame- vulnerability knowledge can effectively help LLMs understand the
work by comparing Vul-RAG with two GPT-4-based baselines, i.e., behavior of the vulnerable code, thereby improving the accuracy
the basic GPT-4-based one and the GPT4-based one enhanced with of vulnerability detection.
code-level RAG. Retrieval Strategy. Figure 5 compares the retrieving outcomes
of code-based retrieval (i.e., retrieving only by code snippet) and
6.2.1 Baselines. our retrieval strategy (i.e., retrieving by both code snippet and ex-
• Basic GPT-4 directly queries GPT-4 for vulnerability detection, tracted functional semantics) for the given code snippet. As shown
with a role-play prompt design outlined in Figure 4 (a). It indi- in Figure 5, when detecting a given code snippet from CVE-2023-
cates the basic capabilities of GPT-4 in vulnerability detection. 1989, the code-based retrieval finds a code snippet (from CVE-2021-
• Code-based RAG enhances the basic GPT-4 by retrieving simi- 33034) that shares more operational resources with the target code
lar code snippets from training datasets into the prompts. The (highlighted in yellow), but differ significantly in their functional
prompt design is detailed as Figure 4 (b). Comparing Vul-RAG semantics, leading to disparate root causes of vulnerabilities. In
with this baseline can study the contribution of knowledge-level contrast, our retrieval strategy finds a code snippet (from CVE-
RAG over Code-level RAG. 2023-1855) that shares more semantic similarity with the target
6.2.2 Results. Table 4 shows the comparison between Vul-RAG code (highlighted in green). Furthermore, they share an identical
and two GPT-4-based baselines. Overall, Vul-RAG consistently vulnerability root cause, which lies in the failure to adequately
outperforms two baselines regarding all metrics, demonstrating the handle asynchronous events during the device removal process.
effectiveness of our knowledge-level RAG framework. Specifically, This indicates that our retrieval strategy can help LLMs find code
we use two examples that Vul-RAG can successfully detect the pairs with more similar vulnerability causes.
vulnerability but two baselines cannot, to explain the superiority
of Vul-RAG over the two baselines.
Knowledge Representation. Figure 4 illustrates an example to 6.3 RQ3: Usefulness for Developers
show the benefits of our knowledge representation. When detect- In RQ3, we conduct a user study to investigate whether the vul-
ing the given code from CVE-2023-30772, the basic GPT-4 fails to nerability knowledge generated by Vul-RAG can help developers
identify the real cause of the vulnerability (as shown in Figure 4 identify vulnerable code more precisely.
Vul-RAG: Enhancing LLM-based Vulnerability Detection
via Knowledge-level RAG ,,

Q: I want you to act as a vulnerability detection expert. Q: I want you to act as a vulnerability detection expert, Q: I want you to act as a vulnerability detection expert,
Given the following code, please detect whether there is a given the following code snippet and related vulnerability given the following code snippet and related vulnerability
vulnerability in the following code snippet: knowledge, please detect whether there is a vulnerability in knowledge, please detect whether there is a vulnerability
the code snippet: [Code Snippet] ... in the code snippet: [Code Snippet] In a similar code
static int da9150_charger_remove(struct platform_device *pdev)
Related vulnerable code snippet: scenario, the following vulnerabilities have been found.
{
Related vulnerability knowledge:
struct da9150_charger *charger = static int cedrus_remove(struct platform_device *pdev)
platform_get_drvdata(pdev); { Vulnerability Causes:
int irq; struct cedrus_dev *dev = platform_get_drvdata(pdev); ...Lack of proper cancellation of pending work associated
/* Make sure IRQs are released before unregistering with a specific functionality during device removal
power supplies */ Patch diff：cancel_delayed_work_sync process...The code does not cancel delayed work
irq = platform_get_irq_byname(pdev, "CHG_VBUS"); (&dev-> watchdog_work);
if (media_devnode_is_registered(dev->mdev.devnode)) associated with a specific functionality before proceeding
free_irq(irq, charger); { with device removal. This can result in a scenario where the
... media_device_unregister(&dev->mdev); device structure is accessed after it has been freed,
if (!IS_ERR_OR_NULL(charger->usb_phy)) v4l2_m2m_unregister_media_controller(dev- potentially leading to a use-after-free vulnerability....
usb_unregister_notifier(charger->usb_phy, >m2m_dev);
&charger->otg_nb); media_device_cleanup(&dev->mdev); Fixing Solution:
}
Patch diff：add cancel_work_sync To mitigate the vulnerability, it is necessary to cancel any
(&charger->otg_work); v4l2_m2m_release(dev->m2m_dev); pending work related to the specific functionality
power_supply_unregister(charger->battery); video_unregister_device(&dev->vfd); before proceeding with further cleanup during device
power_supply_unregister(charger->usb); v4l2_device_unregister(&dev->v4l2_dev); removal. This ensures that the code handles asynchronous
events properly and prevents the use-after-free
/* Release ADC channels */ cedrus_hw_remove(dev); [Vulnerable Source vulnerability... [Vulnerability Knowledge]
... [Code Snippet] return 0;
Code Snippet] ...Please answer YES or NO...
return 0; }
} Related Patched code snippet: [Patched Code Snippet]
A: YES, ..., its pattern of handling shared resources aligns
Please answer YES or NO, and explain why you think so. ...Please answer YES or NO...
with the preconditions and behaviors that can lead to
A: YES, there is a potential vulnerability in the code snippet A: NO, ..., Without clear evidence of asynchronous vulnerabilities as described. Without explicit
provided... The code does not check the return value of operations that need to be canceled in the same way as the synchronization or cancellation of asynchronous
`platform _get_irq_byname()` before calling `free_irq()`... xgene_hwmon device, it's difficult to directly label this operations before resource release, the code is potentially
function as vulnerable based solely on the provided vulnerable to race conditions and use-after-free issues...
❌ Fails to identify the root information...
cause of the vulnerability. ❌ Fails to identify the associations therein √ Successfully identify the root cause.

(A) (B) (C)

Figure 4: An example of vulnerability knowledge representation

Code Purpose: Handle a logical link complete event in the Code Purpose: To remove a Bluetooth SDIO Code Purpose: To remove the xgene_hwmon device
Bluetooth stack. function and free associated resources. and free associated resources.
Code Function: 1. Log a logical link complete event. 2. Look up a Code Function: 1. Remove a Bluetooth SDIO Code Function: 1. Remove the xgene_hwmon device
HCI connection based on the physical handle. 3... 6. Confirm the function. 2.Retrieve the Bluetooth SDIO data. 3. associated with the platform device. 2. Unregister the
logical link for a BREDR channel. 7. Hold the HCI connection. Unregister and free the HCI device. hardware monitoring device. 3. ...

static void hci_loglink_complete_evt(struct hci_dev *hdev, struct sk_buff static void btsdio_remove(struct sdio_func *func) static int xgene_hwmon_remove(struct platform_device
*skb) { *pdev)
{ struct btsdio_data *data = {
...
sdio_get_drvdata(func); struct xgene_hwmon_dev *ctx =
BT_DBG("%s log_handle 0x%4.4x phy_handle 0x%2.2x status
0x%2.2x", hdev->name, le16_to_cpu(ev->handle), ev->phy_handle, ev- platform_get_drvdata(pdev);
struct hci_dev *hdev;
>status);
BT_DBG("func %p", func); hwmon_device_unregister(ctx->hwmon_dev);
hcon = hci_conn_hash_lookup_handle(hdev, ev->phy_handle);
if (!data) kfifo_free(&ctx->async_msg_fifo);
if (!hcon)
return; if (acpi_disabled)
return;
... hdev = data->hdev; mbox_free_channel(ctx->mbox_chan);
hchan->handle = le16_to_cpu(ev->handle); sdio_set_drvdata(func, NULL); else
BT_DBG("hcon %p mgr %p hchan %p", hcon, hcon->amp_mgr, hci_unregister_dev(hdev); pcc_mbox_free_channel(ctx->pcc_chan);
hchan); hci_free_dev(hdev);
... ❌ Irrelevant code with return 0; √ Relevant code with
} }
} different vulnerability similar vulnerability

CVE-2021-33034 CVE-2023-1989 CVE-2023-1855

Code-based Retrieval Result Code Snippet Under Retrieval Functional Semantics based Retrieval Result

Figure 5: An example of knowledge retrieval strategy

6.3.1 User study methodology. Tasks. We select 10 cases from the and the vulnerability knowledge generated by Vul-RAG. In particu-
benchmark PairVul for a user study. Specifically, we randomly select lar, the participants in 𝐺𝐴 are tasked to identify vulnerability in 𝑇𝐴
two cases from each of the five CWE categories PairVul, including with the knowledge-accompanied setting, and to identify vulnera-
both true positive (i.e., genuinely vulnerable code snippets) and bility in 𝑇𝐵 with the basic setting; conversely, the participants in 𝐺 𝐵
false positive (i.e., correct code snippets mistakenly predicted by are tasked to identify vulnerability in 𝑇𝐴 with the basic setting, and
Vul-RAG as vulnerable) instances. To ensure a balanced evaluation, to identify vulnerability in 𝑇𝐵 with the knowledge-accompanied
we randomly assign the two cases from each CWE category into setting. In addition to recording the outputs (i.e., vulnerable or
two equal groups (𝑇𝐴 and 𝑇𝐵 ), with each group comprising 5 cases. not) of each participant, we further survey the participants on
Participants. We invite 6 participants with 3-5 years c/c++ pro- the helpfulness, preciseness, and generalizability of the vulnera-
gramming experience for the user study. We conduct a pre-experiment bility knowledge on a 4-points Likert scale [40] (i.e., 1-disagree;
survey on their c/c++ programming expertise, based on which they 2-somewhat disagree; 3-somewhat agree; 4-agree) as follows:
are divided into two participant groups (𝐺𝐴 and 𝐺 𝐵 ) of similar • Helpfulness: The vulnerability knowledge provided by Vul-
expertise distribution. RAG is helpful in understanding the vulnerability and verifying
Procedure. Each participant is tasked to identify whether the given detection labels.
code snippet is vulnerable. For comparison, participants are asked to • Preciseness: The vulnerability knowledge offer precise and
identify vulnerability in two settings, i.e., (i) basic setting: provided detailed descriptions of the vulnerability, avoiding overly generic
with the given code snippets and the detection labels generated by narratives that do not adequately identify the root cause.
Vul-RAG, or (ii) knowledge-accompanied setting: provided with
the given code snippets, the detection labels generated by Vul-RAG,
,, Xueying Du, Geng Zheng, Kaixin Wang, Jiayi Feng, Wentai Deng, Mingwei Liu, Bihuan Chen, Xin Peng, Tao Ma, and Yiling Lou

• Generalizability: The vulnerability knowledge maintain a de- with the current knowledge base. This limitation is inherent
gree of general applicability, eschewing overly specific descrip- to the RAG-based framework. In future work, we will further
tions that diminish its broad utility (e.g., narratives overly reliant extend the knowledge base by extracting more CVE information
on variable names from the source code). to mitigate this issue.
In addition, the reasons for false positive can be classified into
6.3.2 Results. Compared to the basic setting, participants provided the following two categories:
with vulnerability knowledge generated by Vul-RAG can more • Mismatched Fixing Solutions. There are 11 cases (52.4 %) that
precisely identify the vulnerable and non-vulnerable code (i.e., 77% although Vul-RAG successfully retrieves relevant vulnerability
detection accuracy with knowledge v.s. 60% detection accuracy knowledge, the code snippet is stilled considered as vulnerable,
without knowledge). It indicates that the vulnerability knowledge as it is considered not applied the fixing solution of the retrieved
generated by Vul-RAG is indeed helpful for developers to better knowledge. It is because one vulnerability can be fixed by more
understand the semantics and vulnerabilities in the given code. than one alternative solutions.
In addition, based on the survey feedback, participants rate the • Irrelevant Vulnerability Knowledge Retrieval. There are 10
helpfulness, preciseness, and generalizability with average scores (47.6%) false positives caused by Vul-RAG retrieving irrelevant
of 3.00, 3.20, and 2.97, respectively. The results further indicate vulnerability knowledge. Based on our manual inspection, these
the high quality and usefulness of the vulnerability knowledge incorrectly-retrieved knowledge descriptions often generally
generated by Vul-RAG. contain “missing proper validation of specific values”, which is
Table 5: FN/FP analysis in CWE-119 too general for GPT4 to precisely identify the vulnerability.
Type Reason Number
FN
Inaccurate vulnerability knowledge descriptions. 5 7 THREATS TO VALIDITY
Unretrieved relevant vulnerability knowledge. 2
Non-existent relevant vulnerability knowledge. 12
Threats in benchmarks. There might be potential data leakage issue
Mismatched fixing solutions. 11 between the vulnerability benchmark and the GPT-4 training data.
FP
Irrelevant vulnerability knowledge retrieval 10 Nevertheless, the substantial improvements of Vul-RAG over the
basic GPT-4 can show the effectiveness of Vul-RAG is not simply
due to data memorization. Threats in generalization. Our benchmark
6.4 RQ4: Bad Case Analysis focuses on the Linux kernel CVEs due to its prevalence and rich
To understand the limitation of Vul-RAG, we further manually vulnerability information[41], which might limit the generalization
analyse the bad cases (i.e., false negatives and false positives re- of results. However, our approach is not limited to the Linux kernel
ported by Vul-RAG). In particular, we include all 19 FN and 21 CVEs and can be extended to CVEs of other systems in the future.
FP cases from CWE-119 for manual analysis. Table 5 summarizes In addition, another generalizability issue of Vul-RAG occurs in
the reasons and distributions. In particular, the reasons for false cases that the constructed knowledge base does not contain the rel-
negatives are classified into three primary categories: evant knowledge for the given code under detection, which raises
• Inaccurate Vulnerability Knowledge Descriptions. We ob- concerns about whether the extracted vulnerability knowledge can
serve that for 5 instances (26.3%), Vul-RAG successfully retrieves generalize to detect code snippet from different CVEs. To mitigate
relevant vulnerability knowledge but fails to detect the vulnera- this threat, we manually compile a small-scale benchmark compris-
bility due to the imprecise knowledge descriptions. For example, ing 60 code functions (30 positive and 30 negative samples) across
given the vulnerable code snippet of CVE-2021-4204, although 30 unique CVEs. For each case in this benchmark, we manually
Vul-RAG successfully retrieves the relevant knowledge of the verify the presence of relevant vulnerability knowledge extracted
same CVE, it yields a false negative due to the vague descriptions from other CVEs in the knowledge base. The performance of Vul-
of vulnerability knowledge (i.e., only briefly mentioning “lacks RAG on this benchmark (i.e., a recall rate of 0.83 and a precision
proper bounds checking” in the vulnerability cause and fixing rate of 0.76), demonstrates the generalizability of the extracted
solution description with explicitly stating what kind of bound vulnerability knowledge across different CVEs.
checking should be performed).
• Unretrieved Relevant Vulnerability Knowledge. We observe 8 RELATED WORK
that for 2 cases (15.8%) Vul-RAG fails to retrieve relevant vul- DL-based Vulnerability Detection. Most DL-based work mainly
nerability knowledge, thus leading to false negatives. Although leverages graph neural network (GNN) models and pre-trained
there are instances in the knowledge base that share the similar language models (PLMs) for vulnerability detection. Devign [1] em-
vulnerability root causes and fixing solutions of the given code, ploys GNN to efficiently extract useful features in a joint graph and
their functional semantics are significantly different. Therefore, REVEAL [2] conceptualizes function-level code as a Code Property
Vul-RAG fails to retrieve them from the knowledge base. Graph (CPG) and uses GGNN for CPG embedding. VulChecker [4]
• Non-existent Relevant Vulnerability Knowledge. Based on uses program slicing and a message-passing GNN to precisely locate
our manual checking, the 12 cases (63.2 %) in this category is vulnerabilities in code and classify their type (CWE). DeepDFA [3]
cased by the absence of relevant vulnerability knowledge in our uses a data flow analysis-guided graph learning framework to simu-
knowledge base. Even there are other vulnerable and patched late data flow computation. For PLM-based vulnerability detection,
code pairs of the same CVE, the vulnerability behaviors and VulBERTa [5] uses the RoBERTa model [22] as the encoder, while
fixing solutions are dissimilar, rendering these cases unsolvable Linevul [6] uses attention scores for line-level prediction.
Vul-RAG: Enhancing LLM-based Vulnerability Detection
via Knowledge-level RAG ,,

LLM-based Vulnerability Detection. Wu et al. [42] and Zhou et [9] A. Shestov, A. Cheshkov, R. Levichev, R. Mussabayev, P. Zadorozhny,
al. [43] explore the effectiveness and limits of ChatGPT in software E. Maslov, C. Vadim, and E. Bulychev, “Finetuning large language models for
vulnerability detection,” CoRR, vol. abs/2401.17010, 2024. [Online]. Available:
security applications; Gao et al. [44] build a comprehensive vul- https://doi.org/10.48550/arXiv.2401.17010
nerability benchmark VulBench to evaluate the effectiveness of 16 [10] H. Li, Y. Hao, Y. Zhai, and Z. Qian, “The hitchhiker’s guide to program analysis:
A journey with large language models,” 2023.
LLMs in vulnerability detection. Zhang et al. [7] investigate various [11] Y. Sun, D. Wu, Y. Xue, H. Liu, H. Wang, Z. Xu, X. Xie, and Y. Liu, “When gpt
prompts to improve ChatGPT in vulnerability detection. Yang et meets program analysis: Towards intelligent detection of smart contract logic
al. [8] and Shestov et al. [9] fine-tune LLMs for vulnerability detec- vulnerabilities in gptscan,” 2023.
[12] J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou
tion. Additionally, Li et al. [10] and Sun et al. [11] combine LLMs et al., “Chain-of-thought prompting elicits reasoning in large language models,”
with static analysis for vulnerability detection. Wang et al. [45] Advances in Neural Information Processing Systems, vol. 35, pp. 24 824–24 837,
boosts static analysis with LLM-based intention inference to de- 2022.
[13] Z. Zhang, A. Zhang, M. Li, and A. Smola, “Automatic chain of thought prompting
tect resource leaks. To the best of our knowledge, we are the first in large language models,” arXiv preprint arXiv:2210.03493, 2022.
vulnerability detection technique based on knowledge-level RAG [14] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakan-
tan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,”
framework. In addition, we also make the first attempt to eval- Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
uate existing techniques on distinguishing vulnerable code and [15] (2024) Cppcheck. [Online]. Available: http://cppcheck.net/
similar-but-benign code. [16] J. Fan, Y. Li, S. Wang, and T. N. Nguyen, “A C/C++ code vulnerability dataset with
code changes and CVE summaries,” in MSR ’20: 17th International Conference on
Mining Software Repositories, Seoul, Republic of Korea, 29-30 June, 2020, S. Kim,
9 CONCLUSION G. Gousios, S. Nadi, and J. Hejderup, Eds. ACM, 2020, pp. 508–512. [Online].
Available: https://doi.org/10.1145/3379597.3387501
In this work, we propose a novel LLM-based vulnerability detection [17] (2024) The website of common vulnerabilities and exposures. [Online]. Available:
technique Vul-RAG, which leverages knowledge-level retrieval- https://cve.mitre.org/
augmented generation (RAG) framework to detect vulnerability for [18] (2024) The website of ommon weakness enumeration. [Online]. Available:
https://cwe.mitre.org/
the given code. Overall, compared to four representative baselines, [19] (2024) The website of cwe-416. [Online]. Available: https://cwe.mitre.org/data/
Vul-RAG shows substantial improvements (i.e., 12.96% improve- definitions/416.html
[20] (2024) The website of cve-2023-30772. [Online]. Available: https://cve.mitre.org/
ment in accuracy and 110% improvement in pairwise accuracy). cgi-bin/cvename.cgi?name=CVE-2023-30772
Our user study results show that the vulnerability knowledge can [21] (2024) The website of cve-2023-3609. [Online]. Available: https://cve.mitre.org/
improve the manual detection accuracy from 0.6 to 0.77, and the cgi-bin/cvename.cgi?name=CVE-2023-3609
[22] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis,
user feedback also shows the high quality of generated knowledge L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized BERT
regarding the helpfulness, preciseness, and generalizability. pretraining approach,” CoRR, vol. abs/1907.11692, 2019. [Online]. Available:
http://arxiv.org/abs/1907.11692
[23] T. Ahmed and P. Devanbu, “Few-shot training llms for project-specific code-
REFERENCES summarization,” in Proceedings of the 37th IEEE/ACM International Conference on
[1] Y. Zhou, S. Liu, J. K. Siow, X. Du, and Y. Liu, “Devign: Effective vulnerability Automated Software Engineering, 2022, pp. 1–5.
identification by learning comprehensive program semantics via graph neural [24] Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai, J. Sun, Q. Guo, M. Wang, and
networks,” in Advances in Neural Information Processing Systems 32: Annual H. Wang, “Retrieval-augmented generation for large language models: A survey,”
Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 2024.
December 8-14, 2019, Vancouver, BC, Canada, H. M. Wallach, H. Larochelle, [25] P. S. H. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal,
A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett, Eds., 2019, pp. H. Küttler, M. Lewis, W. Yih, T. Rocktäschel, S. Riedel, and D. Kiela,
10 197–10 207. [Online]. Available: https://proceedings.neurips.cc/paper/2019/ “Retrieval-augmented generation for knowledge-intensive NLP tasks,” in
hash/49265d2447bc3bbfe9e76306ce40a31f-Abstract.html Advances in Neural Information Processing Systems 33: Annual Conference
[2] S. Chakraborty, R. Krishna, Y. Ding, and B. Ray, “Deep learning based vulnerability on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12,
detection: Are we there yet?” IEEE Trans. Software Eng., vol. 48, no. 9, pp. 2020, virtual, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin,
3280–3296, 2022. [Online]. Available: https://doi.org/10.1109/TSE.2021.3087402 Eds., 2020. [Online]. Available: https://proceedings.neurips.cc/paper/2020/hash/
[3] B. Steenhoek, H. Gao, and W. Le, “Dataflow analysis-inspired deep learning 6b493230205f780e1bc26945df7481e5-Abstract.html
for efficient vulnerability detection,” in Proceedings of the 46th IEEE/ACM [26] E. Shi, Y. Wang, W. Tao, L. Du, H. Zhang, S. Han, D. Zhang, and H. Sun, “RACE:
International Conference on Software Engineering, ICSE 2024, Lisbon, Portugal, retrieval-augmented commit message generation,” in Proceedings of the 2022
April 14-20, 2024. ACM, 2024, pp. 16:1–16:13. [Online]. Available: https: Conference on Empirical Methods in Natural Language Processing, EMNLP 2022,
//doi.org/10.1145/3597503.3623345 Abu Dhabi, United Arab Emirates, December 7-11, 2022, Y. Goldberg, Z. Kozareva,
[4] Y. Mirsky, G. Macon, M. D. Brown, C. Yagemann, M. Pruett, E. Downing, and Y. Zhang, Eds. Association for Computational Linguistics, 2022, pp.
S. Mertoguno, and W. Lee, “Vulchecker: Graph-based vulnerability localization 5520–5530. [Online]. Available: https://doi.org/10.18653/v1/2022.emnlp-main.372
in source code,” in 32nd USENIX Security Symposium, USENIX Security 2023, [27] F. Zhang, B. Chen, Y. Zhang, J. Keung, J. Liu, D. Zan, Y. Mao, J. Lou, and W. Chen,
Anaheim, CA, USA, August 9-11, 2023, J. A. Calandrino and C. Troncoso, “Repocoder: Repository-level code completion through iterative retrieval and
Eds. USENIX Association, 2023, pp. 6557–6574. [Online]. Available: https: generation,” in Proceedings of the 2023 Conference on Empirical Methods in Natural
//www.usenix.org/conference/usenixsecurity23/presentation/mirsky Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, H. Bouamor,
[5] H. Hanif and S. Maffeis, “Vulberta: Simplified source code pre-training for J. Pino, and K. Bali, Eds. Association for Computational Linguistics, 2023, pp.
vulnerability detection,” in International Joint Conference on Neural Networks, 2471–2484. [Online]. Available: https://aclanthology.org/2023.emnlp-main.151
IJCNN 2022, Padua, Italy, July 18-23, 2022. IEEE, 2022, pp. 1–8. [Online]. [28] S. Lu, N. Duan, H. Han, D. Guo, S. won Hwang, and A. Svyatkovskiy, “Reacc: A
Available: https://doi.org/10.1109/IJCNN55064.2022.9892280 retrieval-augmented code completion framework,” ArXiv, vol. abs/2203.07722,
[6] M. Fu and C. Tantithamthavorn, “Linevul: A transformer-based line-level 2022. [Online]. Available: https://api.semanticscholar.org/CorpusID:247450969
vulnerability prediction,” in 19th IEEE/ACM International Conference on Mining [29] A. Sejfia, S. Das, S. Shafiq, and N. Medvidovic, “Toward improved deep
Software Repositories, MSR 2022, Pittsburgh, PA, USA, May 23-24, 2022. ACM, learning-based vulnerability detection,” in Proceedings of the 46th IEEE/ACM
2022, pp. 608–620. [Online]. Available: https://doi.org/10.1145/3524842.3528452 International Conference on Software Engineering, ICSE 2024, Lisbon, Portugal,
[7] C. Zhang, H. Liu, J. Zeng, K. Yang, Y. Li, and H. Li, “Prompt-enhanced software April 14-20, 2024. ACM, 2024, pp. 62:1–62:12. [Online]. Available: https:
vulnerability detection using chatgpt,” CoRR, vol. abs/2308.12697, 2023. [Online]. //doi.org/10.1145/3597503.3608141
Available: https://doi.org/10.48550/arXiv.2308.12697 [30] (2024) The website of linux kernel cves. [Online]. Available: https://www.
[8] A. Z. H. Yang, C. L. Goues, R. Martins, and V. J. Hellendoorn, “Large linuxkernelcves.com/
language models for test-free fault localization,” in Proceedings of the 46th [31] J. Li, G. Li, Y. Li, and Z. Jin, “Structured chain-of-thought prompting for code
IEEE/ACM International Conference on Software Engineering, ICSE 2024, Lisbon, generation,” arXiv preprint arXiv:2305.06599, 2023.
Portugal, April 14-20, 2024. ACM, 2024, pp. 17:1–17:12. [Online]. Available: [32] Y. Nong, M. Aldeen, L. Cheng, H. Hu, F. Chen, and H. Cai, “Chain-of-
https://doi.org/10.1145/3597503.3623342 thought prompting of large language models for discovering and fixing
,, Xueying Du, Geng Zheng, Kaixin Wang, Jiayi Feng, Wentai Deng, Mingwei Liu, Bihuan Chen, Xin Peng, Tao Ma, and Yiling Lou

software vulnerabilities,” CoRR, vol. abs/2402.17230, 2024. [Online]. Available: [38] OpenAI, “GPT-4 technical report,” CoRR, vol. abs/2303.08774, 2023. [Online].
https://doi.org/10.48550/arXiv.2402.17230 Available: https://doi.org/10.48550/arXiv.2303.08774
[33] S. E. Robertson and S. Walker, “Some simple effective approximations to the [39] (2023) Elasticsearch. [Online]. Available: https://github.com/elastic/elasticsearch
2-poisson model for probabilistic weighted retrieval,” in Proceedings of the [40] R. Likert, “A technique for the measurement of attitudes.” Archives of psychology,
17th Annual International ACM-SIGIR Conference on Research and Development 1932.
in Information Retrieval. Dublin, Ireland, 3-6 July 1994 (Special Issue of [41] M. Jimenez, M. Papadakis, and Y. L. Traon, “An empirical analysis of
the SIGIR Forum). ACM/Springer, 1988, pp. 232–241. [Online]. Available: vulnerabilities in openssl and the linux kernel,” in 23rd Asia-Pacific Software
https://doi.org/10.1016/0306-4573(88)90021-0 Engineering Conference, APSEC 2016, Hamilton, New Zealand, December 6-9, 2016,
[34] M. Çagatayli and E. Çelebi, “The effect of stemming and stop-word-removal A. Potanin, G. C. Murphy, S. Reeves, and J. Dietrich, Eds. IEEE Computer Society,
on automatic text classification in turkish language,” in Neural Information 2016, pp. 105–112. [Online]. Available: https://doi.org/10.1109/APSEC.2016.025
Processing - 22nd International Conference, ICONIP 2015, Istanbul, Turkey, [42] F. Wu, Q. Zhang, A. P. Bajaj, T. Bao, N. Zhang, R. Wang, and C. Xiao, “Exploring
November 9-12, 2015, Proceedings, Part I, ser. Lecture Notes in Computer Science, the limits of chatgpt in software security applications,” CoRR, vol. abs/2312.05275,
S. Arik, T. Huang, W. K. Lai, and Q. Liu, Eds., vol. 9489. Springer, 2015, pp. 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2312.05275
168–176. [Online]. Available: https://doi.org/10.1007/978-3-319-26532-2_19 [43] X. Zhou, T. Zhang, and D. Lo, “Large language model for vulnerability detection:
[35] N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, and P. Liang, Emerging results and future directions,” CoRR, vol. abs/2401.15468, 2024. [Online].
“Lost in the middle: How language models use long contexts,” CoRR, vol. Available: https://doi.org/10.48550/arXiv.2401.15468
abs/2307.03172, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2307. [44] Z. Gao, H. Wang, Y. Zhou, W. Zhu, and C. Zhang, “How far have we gone in
03172 vulnerability detection using large language models,” CoRR, vol. abs/2311.12420,
[36] (2023) Gpt-3-5-turbo documentation. [Online]. Available: https://platform.openai. 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2311.12420
com/docs/models/gpt-3-5-turbo [45] C. Wang, J. Liu, X. Peng, Y. Liu, and Y. Lou, “Boosting static resource leak detection
[37] (2023) Gpt-4 documentation. [Online]. Available: https://platform.openai.com/ via llm-based resource-oriented intention inference,” CoRR, vol. abs/2311.04448,
docs/models/gpt-4-and-gpt-4-turbo 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2311.04448

Contents - Explore ATtiny Microcontrollers Using C and Assembly Language
100% (2)
Contents - Explore ATtiny Microcontrollers Using C and Assembly Language
15 pages
Howtoget1 Million Streams On: Written by
100% (2)
Howtoget1 Million Streams On: Written by
23 pages
Crossproject Transfer Representation Learning For Vulnerable Fun 2018
No ratings yet
Crossproject Transfer Representation Learning For Vulnerable Fun 2018
9 pages
final research paper
No ratings yet
final research paper
9 pages
Understanding the Effectiveness of Large Language Models in Detecting Security Vulnerabilities
No ratings yet
Understanding the Effectiveness of Large Language Models in Detecting Security Vulnerabilities
18 pages
2412.15905v2
No ratings yet
2412.15905v2
21 pages
Your Instructions Are Not Always Helpfu
No ratings yet
Your Instructions Are Not Always Helpfu
10 pages
Security Vulnerability Detection With Multitask Se
No ratings yet
Security Vulnerability Detection With Multitask Se
11 pages
Buffer Overflow
No ratings yet
Buffer Overflow
12 pages
Diverse Vu l
No ratings yet
Diverse Vu l
15 pages
Finetuning LLM for vulnerability detection
No ratings yet
Finetuning LLM for vulnerability detection
12 pages
Cleanvul: Automatic Function-Level Vulnerability Detection in Code Commits Using LLM Heuristics
No ratings yet
Cleanvul: Automatic Function-Level Vulnerability Detection in Code Commits Using LLM Heuristics
25 pages
Li 2021
No ratings yet
Li 2021
16 pages
2308.10345v1
No ratings yet
2308.10345v1
18 pages
applsci-14-09697
No ratings yet
applsci-14-09697
14 pages
QNLP
No ratings yet
QNLP
20 pages
Binary Code Vulnerability Detection Based On Multi-Level Feature Fusion
No ratings yet
Binary Code Vulnerability Detection Based On Multi-Level Feature Fusion
12 pages
Automated Vulnerability Detectionin Source Code Using Deep Representation Learning
No ratings yet
Automated Vulnerability Detectionin Source Code Using Deep Representation Learning
7 pages
LineVul A Transformer-Based Line-Level Vulnerability Prediction
No ratings yet
LineVul A Transformer-Based Line-Level Vulnerability Prediction
13 pages
Benchmarking LLMs and LLM-based Agents in Practical Vulnerability Detection for Code Repositories
No ratings yet
Benchmarking LLMs and LLM-based Agents in Practical Vulnerability Detection for Code Repositories
17 pages
E-GVD Efficient Software Vulnerability Detection T-1
No ratings yet
E-GVD Efficient Software Vulnerability Detection T-1
9 pages
2310.01152v2
No ratings yet
2310.01152v2
10 pages
Software Vulnerability Analysis and Discovery Using Deep Learning Techniques A Survey
No ratings yet
Software Vulnerability Analysis and Discovery Using Deep Learning Techniques A Survey
15 pages
Auto Patch 080525
No ratings yet
Auto Patch 080525
20 pages
2308.14434v1
No ratings yet
2308.14434v1
8 pages
Llm Oakland2024
No ratings yet
Llm Oakland2024
19 pages
Usenixsecurity23 Mirsky
No ratings yet
Usenixsecurity23 Mirsky
19 pages
The Rise of Software Vulnerability Taxonomy of Sof - 2021 - Journal of Network
No ratings yet
The Rise of Software Vulnerability Taxonomy of Sof - 2021 - Journal of Network
24 pages
ndss2018 03A-2 Li Paper
No ratings yet
ndss2018 03A-2 Li Paper
15 pages
Auto-Detection of Programming Code Vulnerabilities with Natural L
No ratings yet
Auto-Detection of Programming Code Vulnerabilities with Natural L
37 pages
2105.02388v1
No ratings yet
2105.02388v1
6 pages
B—BinVulDet：通过反编译伪代码和 BiLSTM-attention 检测二进制程序中的漏洞
No ratings yet
B—BinVulDet：通过反编译伪代码和 BiLSTM-attention 检测二进制程序中的漏洞
15 pages
DetectLlama
No ratings yet
DetectLlama
21 pages
Jimenez VPMLinuxKernel
No ratings yet
Jimenez VPMLinuxKernel
10 pages
Meta-Path Based Attentional Graph Learning Model f
No ratings yet
Meta-Path Based Attentional Graph Learning Model f
13 pages
2404.18353v2
No ratings yet
2404.18353v2
47 pages
1 s2.0 S0167404822004096 Main
No ratings yet
1 s2.0 S0167404822004096 Main
11 pages
Mishra_Thesis_AI_Augmented_Vulnerability
No ratings yet
Mishra_Thesis_AI_Augmented_Vulnerability
96 pages
2309.15324v1
No ratings yet
2309.15324v1
13 pages
OpenSCV - An Open Hierarchical Taxonomy For Smart Contract Vulnerabilities
No ratings yet
OpenSCV - An Open Hierarchical Taxonomy For Smart Contract Vulnerabilities
55 pages
Pattern Based Vulnerability Discovery
No ratings yet
Pattern Based Vulnerability Discovery
151 pages
When Chatgpt Meets Smart Contract Vulnerability Detection: How Far Are We?
No ratings yet
When Chatgpt Meets Smart Contract Vulnerability Detection: How Far Are We?
30 pages
2307.06616v3 (1)
No ratings yet
2307.06616v3 (1)
18 pages
2024ist - A Vulnerability Detection Framework by Focusing On Critical Execution Paths
No ratings yet
2024ist - A Vulnerability Detection Framework by Focusing On Critical Execution Paths
16 pages
Vulnerabilities Classification Machine Learning Paper SIMARGL
No ratings yet
Vulnerabilities Classification Machine Learning Paper SIMARGL
16 pages
Bachelor_thesis
No ratings yet
Bachelor_thesis
61 pages
Machine Learning For Source Code Vulnerability Detection: What Works and What Isn't There Yet
No ratings yet
Machine Learning For Source Code Vulnerability Detection: What Works and What Isn't There Yet
17 pages
AIBug Hunter
No ratings yet
AIBug Hunter
34 pages
Towards An Improved Understanding of Software Vulnerability Assessment Using Data-Driven Approaches-Le2022 - PHD
No ratings yet
Towards An Improved Understanding of Software Vulnerability Assessment Using Data-Driven Approaches-Le2022 - PHD
176 pages
2501.18908v1
No ratings yet
2501.18908v1
16 pages
2308.04662v3
No ratings yet
2308.04662v3
14 pages
2407.08708
No ratings yet
2407.08708
16 pages
Vuln Sense
No ratings yet
Vuln Sense
12 pages
Automated_Software_Vulnerability_Assessment_with_Concept_Drift
No ratings yet
Automated_Software_Vulnerability_Assessment_with_Concept_Drift
12 pages
2304.07232v1
No ratings yet
2304.07232v1
6 pages
8. How Well Do Large Language Models Serve as End-to-End Secure
No ratings yet
8. How Well Do Large Language Models Serve as End-to-End Secure
13 pages
Vulnerability Detection On Android Apps-Inspired B
No ratings yet
Vulnerability Detection On Android Apps-Inspired B
15 pages
1-s2.0-S016412122300314X-main
No ratings yet
1-s2.0-S016412122300314X-main
14 pages
Shirley Yang Masc Thesis
No ratings yet
Shirley Yang Masc Thesis
65 pages
Usage of Machine Learning in Software Testing: Sumit Mahapatra and Subhankar Mishra
No ratings yet
Usage of Machine Learning in Software Testing: Sumit Mahapatra and Subhankar Mishra
16 pages
An Improved Model for Analysis of Host Network Vul
No ratings yet
An Improved Model for Analysis of Host Network Vul
6 pages
Advanced Penetration Testing with Kali Linux: Unlocking industry-oriented VAPT tactics (English Edition)
From Everand
Advanced Penetration Testing with Kali Linux: Unlocking industry-oriented VAPT tactics (English Edition)
Ummed Meel
No ratings yet
Samplers Lecture
No ratings yet
Samplers Lecture
4 pages
2020 - 04 - Introduction To Kubernetes RBAC
No ratings yet
2020 - 04 - Introduction To Kubernetes RBAC
41 pages
Malik Zeeshan Arsahad
No ratings yet
Malik Zeeshan Arsahad
4 pages
Wireless Web Development 2nd Edition Ray Rischpater (Auth.) pdf download
100% (1)
Wireless Web Development 2nd Edition Ray Rischpater (Auth.) pdf download
78 pages
Practical 16 - MySQL Basic Commands
No ratings yet
Practical 16 - MySQL Basic Commands
4 pages
Network Basics Survival Guide
No ratings yet
Network Basics Survival Guide
87 pages
Optimus Prime Solutions Web Design
No ratings yet
Optimus Prime Solutions Web Design
1 page
Mathematics Complementary Elective Courses For Bca Programme
No ratings yet
Mathematics Complementary Elective Courses For Bca Programme
12 pages
Connecting a PS5 Controller with RaspberryPi via Bluetooth | by Shyam Padia | Me
No ratings yet
Connecting a PS5 Controller with RaspberryPi via Bluetooth | by Shyam Padia | Me
13 pages
The Internet 123456
No ratings yet
The Internet 123456
2 pages
Smart Energy Management System[1]
No ratings yet
Smart Energy Management System[1]
11 pages
ML MCQ Unit 1
No ratings yet
ML MCQ Unit 1
8 pages
MAME - RetroPie Docs
No ratings yet
MAME - RetroPie Docs
16 pages
Learn 7zip Command Examples in Linux
No ratings yet
Learn 7zip Command Examples in Linux
3 pages
Epm7000 Gek-113584d
No ratings yet
Epm7000 Gek-113584d
282 pages
Smart Shader
No ratings yet
Smart Shader
13 pages
Cyber Security Labs
No ratings yet
Cyber Security Labs
8 pages
Computer Science: Sigaram Thoduvom
No ratings yet
Computer Science: Sigaram Thoduvom
52 pages
Mykov240 ServiceManual LIS-SetUp
No ratings yet
Mykov240 ServiceManual LIS-SetUp
20 pages
Xdocs - Club Tesis Hospital San Juan de Dios
No ratings yet
Xdocs - Club Tesis Hospital San Juan de Dios
256 pages
Active Distribution Networks
No ratings yet
Active Distribution Networks
6 pages
KVR32S22S8/16: Memory Module Specifi Cations
No ratings yet
KVR32S22S8/16: Memory Module Specifi Cations
2 pages
Digital Weighing Indicator User Manual: Downloaded From Manuals Search Engine
No ratings yet
Digital Weighing Indicator User Manual: Downloaded From Manuals Search Engine
75 pages
Group 3 Project: Android Auto in Toyota and Lexus
No ratings yet
Group 3 Project: Android Auto in Toyota and Lexus
29 pages
User Guide Nokia t21 User Guide
No ratings yet
User Guide Nokia t21 User Guide
37 pages
Shivani Pandey: Education
No ratings yet
Shivani Pandey: Education
2 pages
ELE551 Embedded Systems and IOT Fundamentals Mid Sem Project Work Project Title: Arduino Sunflower
No ratings yet
ELE551 Embedded Systems and IOT Fundamentals Mid Sem Project Work Project Title: Arduino Sunflower
9 pages
Model TI984 Telephone Interface: Able of Ontents
No ratings yet
Model TI984 Telephone Interface: Able of Ontents
16 pages

Vul-RAG Enhancing LLM-based Vulnerability Detection Via Knowledge-Level RAG

Uploaded by

Vul-RAG Enhancing LLM-based Vulnerability Detection Via Knowledge-Level RAG

Uploaded by

Vul-RAG: Enhancing LLM-based Vulnerability Detection

via Knowledge-level RAG

Bihuan Chen Xin Peng Tao Ma

Instance-level Vulnerability Knowledge Extraction Input Extracted Vulnerability Knowledge

Figure 3: An Example of Vulnerability Knowledge Extraction from Historical Commit of CVE-2022-38457

6 RESULTS Table 4: Effectiveness of Vul-RAG and GPT-4-based tech-

(A) (B) (C)

Figure 4: An example of vulnerability knowledge representation

CVE-2021-33034 CVE-2023-1989 CVE-2023-1855

Figure 5: An example of knowledge retrieval strategy

You might also like