RAGTruth- A Hallucination Corpus for Developing Trustworthy Retrieval-Augmented Language Models
RAGTruth- A Hallucination Corpus for Developing Trustworthy Retrieval-Augmented Language Models
nations in large language models (LLMs). De- to an extra coat of paint for your bedroom. 3 To get good
pictures, 3D ultrasounds are best performed between 20
spite the integration of RAG, LLMs may still
and 32 weeks, and the best pictures are between 24 and
present unsupported or contradictory claims 30 weeks.
to the retrieved contents. In order to develop ...
effective hallucination prevention strategies un- R ESPONSE : Based on the given passages, here’s how to
der RAG, it is important to create benchmark prepare for an ultrasound:
datasets that can measure the extent of halluci- * For a gender reveal ultrasound, wait until at least 17
nation. This paper presents RAGTruth, a cor- weeks of pregnancy to ensure accurate results.
pus tailored for analyzing word-level hallucina- * For 3D ultrasounds, schedule the appointment between
tions in various domains and tasks within the 20 and 32 weeks of pregnancy for the best pictures ...
standard RAG frameworks for LLM applica-
A NNOTATION
tions. RAGTruth comprises nearly 18,000 nat-
Span: between 20 and 32 weeks of pregnancy for the best
urally generated responses from diverse LLMs pictures
using RAG. These responses have undergone Type: Evident Conflict
meticulous manual annotations at both the indi- Reason: Original: "the best pictures are between 24 and
vidual case and word levels, incorporating eval- 30 weeks", Generative: "between 20 and 32 weeks of
uations of hallucination intensity. We not only pregnancy for the best pictures"
benchmark hallucination frequencies across dif-
ferent LLMs, but also critically assess the ef- Table 1: An example of RAGTruth data from the ques-
fectiveness of several existing hallucination de- tion answering task. It contains context, response gener-
tection methodologies. We show that using a ated by LLM with and span-level annotation.
high-quality dataset such as RAGTruth, it is
possible to finetune a relatively small LLM and
achieve a competitive hallucination detection
tual or accurate information (Rawte et al., 2023).
performance when compared to the existing
prompt-based approaches using state-of-the-art The occasional generation of outputs that appear
LLMs such as GPT-4. Furthermore, the fine- plausible but are factually incorrect significantly
tuned model can effectively mitigate hallucina- undermine the reliability of LLMs in real-world
tion in LLM responses. scenarios, such as medical diagnoses (Pal et al.,
2023) and news summarization (Shen et al., 2023).
1 Introduction
To reduce hallucination, various methods have
Large language models (LLMs) have achieved re- been developed that can be applied at differ-
markable success in a variety of tasks, including ent stages of LLM lifecycle, including pre-
text generation (Li et al., 2022), machine transla- training (Brown et al., 2020), supervised fine-
tion (Kocmi and Federmann, 2023), and question tuning (Zhou et al., 2023; Zhang et al., 2023a),
answering (Zhao et al., 2023). However, one of the RLHF (Ouyang et al., 2022; Lin et al., 2022),
key challenges in deploying LLMs in real-world and inference (Dhuliawala et al., 2023; Gao et al.,
applications is their tendency to hallucinate (Kad- 2023). In terms of detection, methods are devel-
dour et al., 2023). Hallucination in the context oped by examining the model’s intrinsic state (Guo
of LLMs usually refers to a situation where the et al., 2017), comparing it with external data
model generates content that is not based on fac- and tools (Chern et al., 2023), or leveraging the
LLM’s inherent powerful capabilities for self- different hallucination detection methods at
checking (Agrawal et al., 2023; Manakul et al., both the passage and word levels.
2023). Retrieval-augmented generation (RAG) is
extensively used to supply LLMs with updated, (iii) We present a baseline method of fine-tuning
relevant knowledge, significantly mitigating hal- LLM for hallucination detection. It is shown
lucination (Varshney et al., 2023). Nevertheless, that by fine-tuning the Llama-2-13B model on
even with RAG and other enhancements, LLMs the RAGTruth training data, we can achieve
still produce statements that are either unfounded results competitive to the existing prompt-
or contradict the information provided in the re- based approaches using GPT-4. This shows
trieved references (Shuster et al., 2021). the potential of developing better hallucina-
tion detection methods using RAGTruth.
Despite the growing awareness of the hallucina-
tion phenomenon, the understanding of hallucina- (iv) We show that by using our finetuned hallucina-
tion in LLMs is still in its early stages. One key tion detector, it is possible to significantly re-
challenge is the lack of high-quality, large-scale duce the occurrence of hallucinations in the re-
datasets specifically designed for hallucination de- sponses from LLMs. The improvement holds
tection. This issue is particularly acute in RAG even for models with inherently low halluci-
settings. Due to the relatively low hallucination nation rates, such as GPT-4.
ratio, a substantial increase in annotation resources
is needed. Existing datasets for LLM hallucina- 2 Related Work
tion detection are predominantly synthesized (Li
2.1 Hallucination of Large Language Models
et al., 2023). For instance, in Liu and Liu (2023);
Longpre et al. (2021), prompts conflicting with Though hallucination in traditional natural lan-
conventional knowledge are purposely generated guage generation (NLG) contexts has been widely
to trigger hallucinations. While these approaches studied(Ji et al., 2023), comprehending and tack-
are efficient at generating hallucinations, the re- ling this problem in the context of LLMs presents
sulting artificial hallucinations can substantially distinct challenges(Zhang et al., 2023b). Existing
differ from those that naturally occur. In Chen et al. research has demonstrated that incorporating up-
(2023); Hu et al. (2023), hallucination datasets are to-date, relevant knowledge in the prompt can ef-
developed by manual annotations of naturally pro- fectively reduce fact-conflicting hallucination (Vu
duced LLM responses. However, these datasets are et al., 2023; Lewis et al., 2021). This approach,
of limited size and are not specifically focused on referred to as Retrieval-Augmented Generation
the RAG scenario. (RAG), is widely used in real-world LLM applica-
In this paper, we introduce a large-scale high- tions. For instance, Google Bard 1 and Microsoft
quality dataset specifically designed for word-level BingChat 2 have implemented this technique.
hallucination detection for RAG applications. Us- 2.2 Hallucination Evaluation Datasets
ing this dataset, we have conducted an extensive
Extensive research has focused on hallucination
benchmarking of mainstream LLMs to assess their
benchmarks within conventional Natural Language
tendency to generate hallucinations, as well as
Generation settings (Dziri et al., 2022; Zhong et al.,
evaluate current methods for hallucination detec-
2021; Durmus et al., 2020; Lin et al., 2022). With
tion. Additionally, we have demonstrated supe-
the rise of LLMs, the detection of hallucinations
rior performance in identifying hallucinations by
has become increasingly challenging, necessitating
fine-tuning LLM with RAGTruth dataset. Our key
the development of high-quality datasets for LLM
contributions are:
evaluation (Chen and Shu, 2023). Contributions
(i) We propose RAGTruth, a large-scale word- in this domain include HaluEval (Li et al., 2023),
level hallucination evaluation dataset specifi- which introduced datasets encompassing both syn-
cally for the RAG scenario across several com- thetically and naturally generated LLM responses,
mon tasks. It consists of nearly 18,000 fully and FELM (Chen et al., 2023), which concentrated
annotated natural responses generated from on naturally generated LLM responses across mul-
major open-source and closed-source LLMs. tiple domain tasks. RefChecker (Hu et al., 2023), a
1
https://bard.google.com
2
(ii) We perform a comprehensive comparison of https://www.bing.com
distinctive approach, breaks down claims in LLM vided information. These conflicts are easily ver-
responses into triples and utilizes human annota- ifiable without extensive context, often involving
tion to assess the truthfulness of facts. Notably, clear factual errors, misspelled names, incorrect
these works primarily focus on annotating factual numbers, etc.
hallucinations in LLM responses. Distinguishing
from previous research, our work centers on the Subtle Conflict: for when generative content
evaluation of LLMs within RAG settings. presents a departure or divergence from the pro-
vided information, altering the intended contextual
2.3 Hallucination Detection Methods meaning. These conflicts often involve substitu-
Researchers have been exploring various methods tion of terms that carry different implications or
to enhance the reliability of LLMs by detecting hal- severity, requiring a deeper understanding of their
lucinations. In Azaria and Mitchell (2023); Xiao contextual applications.
and Wang (2021); Malinin and Gales (2021), intrin- Evident Introduction of Baseless Information:
sic model uncertainty metrics such as token-level for when generated content includes information
probability and entropy are used to detect halluci- not substantiated in the provided information. It
nations. When direct access to output uncertainty involves the creation of hypothetical, fabricated, or
is not feasible, as in the case with limited APIs like hallucinatory details lacking evidence or support.
GPT-4, an alternative approach involves employing
a fully accessible LLM as a proxy (Manakul et al., Subtle Introduction of Baseless Information:
2023). In Falke et al. (2019); Barrantes et al. (2020), is when generated content extends beyond the pro-
natural language inference modules are adapted to vided information by incorporating inferred details,
check the information consistency between the ar- insights, or sentiments. This additional informa-
ticles and their summaries, and it has been shown tion lacks verifiability and might include subjective
that external knowledge is helpful for detecting fac- assumptions or commonly observed norms rather
tual hallucinations. (Guo et al., 2022; Mallen et al., than explicit facts.
2022). Additionally, methods that leverage the in-
3.2 Response Generation
herent capabilities of LLMs have been proposed
for self-checking, such as verbalization-based and Tasks and Data Sources We selected three
consistency-based methods (Xiong et al., 2023; widely recognized generation tasks with RAG set-
Manakul et al., 2023). These techniques aim to tings for response generation: Question Answering,
detect hallucinations without relying on internal Data-to-text Writing, and News Summarization.
states or external data and tools. For the task of question answering, we con-
ducted a random sampling from the training set
3 Construction Process of RAGTruth of MS MARCO (Nguyen et al., 2016). To reduce
the difficulty of annotation, we selected only those
We established a data generation and annotation
questions related to daily life, and preserved only
pipeline as shown in Figure 1.
three retrieved passages for each question. Then
3.1 Hallucination Taxonomy we prompted LLMs to generate answers for each
question solely based on the retrieved passages.
Different from open-end generation, under RAG
For the data-to-text writing task, we prompted
setting, the prompt contains rich context informa-
LLMs to generate an objective overview for
tion, and the model is generally required to gen-
a randomly sampled business in the restaurant
erate text based on the provided context. The de-
and nightlife categories from the Yelp Open
tection and mitigation of inconsistencies between
Dataset (Yelp, 2021). In this dataset, information
retrieved information and responses emerge as sig-
pertaining to a business is represented using struc-
nificant sources of hallucination.
tured data. To streamline the annotation process,
As outlined below, we categorize the halluci-
we focused only on the following business informa-
nation in the RAG setting into four types. For
tion fields: BusinessParking, RestaurantsReserva-
concrete examples of each type, please refer to
tions, OutdoorSeating, WiFi, RestaurantsTakeOut,
Appendix A.
RestaurantsGoodForGroups, Music, and Ambience.
Evident Conflict: for when generative content In addition to the structured data, we have also in-
presents direct contraction or opposition to the pro- cluded up to three business-related user reviews
Figure 1: Data gathering pipeline. Taking a data-to-text writing task as an example, our data gathering pipeline
includes 2 steps: 1) response generation. We generated responses with multiple LLMs and natural prompts. 2)
human annotation. Human labeler annotated hallucinated spans in LLM responses.
to enrich the context information. In the prompt, the accuracy and reliability of the annotation re-
these information is represented in JSON format. sults. We recruited annotators from a professional
For the news summarization task, we randomly vendor and paid them at a rate of $25 per hour per
selected documents from the training set of the individual.
well-known CNN/Daily Mail dataset (See et al., The annotators are invited to perform annotation
2017) as well as recent news articles from a pres- tasks using Label Studio (Tkachenko et al., 2020-
tigious news platform. LLMs were prompted to 2022). Each labeling task is presented within one
generate a summary for each of the source news. page, comprising the following components: 1) the
context provided to the AI models; 2) a set of 6
Models The following six models with strong
responses, generated by different AI models. Our
instruction-following ability are used for response
annotation interface is available in Appendix C.
generation: GPT-3.5-turbo-0613 and GPT-4-0613
Their task was to annotate the specific spans of
from OpenAI (OpenAI, 2023); Mistral-7b-Instruct
the generated text that contains hallucinated infor-
from Mistral AI (Jiang et al., 2023); Llama-2-
mation and categorize them into the four types. To
7B-chat, Llama-2-13B-chat and Llama-2-70B-chat
ensure the quality of the annotations, each response
(4bit quantized)3 from Meta (Touvron et al., 2023).
is independently labeled by two annotators. The
To ensure a fair comparison, the prompts used
consistency rate of two annotators was 91.8% at the
for response generation are kept straightforward
response level and 78.8% at the span level. In cases
with subtle differences among various models to
where there is a considerable difference between
optimize their performance. We provide detailed
the two annotations, a third review is undertaken.
prompts in the Appendix B.
For each sample, we collected one response from 3.4 Annotations for Adaptive Evaluation
each model. As a result, we got a total of 6 re-
In different contexts, the definition and criteria for
sponses for each input sample.
hallucination vary, and the annotation of hallucina-
3.3 Human Annotation tion is not always straightforward. In contentious
cases, additional annotations are provided to accu-
Identifying AI-generated hallucinations is a chal-
rately reflect these situations. This approach en-
lenging task. It requires a strong capacity for criti-
ables users to adopt various evaluation strategies
cal thinking to understand the logical flow of vari-
tailored to their specific application circumstances.
ous texts, along with meticulous attention to detail
Please refer to Appendix C for more statistical in-
for spotting subtle inaccuracies and inconsisten-
formation about these annotations.
cies. Moreover, a certain level of media literacy
and knowledge of current affairs is crucial to grasp Implicit Truth The extensive world knowledge
the subjects discussed in news-related sample data. and ability of LLMs is a significant advantage in
Therefore, we chose annotators who are proficient open-ended generation scenarios. But in the con-
in English and possess a bachelor’s degree in En- text of this paper, which focuses on the relatively
glish, Communications, or relevant fields to ensure strict RAG scenarios, we have labeled information
3
https://huggingface.co/TheBloke/ that is not mentioned in the reference but may be
Llama-2-70B-Chat-AWQ truthful as hallucinations. For instance, mentioning
Subtle baseless info Subtle conflict
Evident baseless info Evident conflict lengths than existing datasets for hallucination de-
tection (Wang et al., 2020).
QA
4.2 Hallucination Statistics
Data-to-text Hallucination Types As shown in Figure 2, the
Writing
generation of information baseless in the context
Sum. was significantly more prevalent than the gener-
ation of information conflicting with the context,
0% 25% 50% 75% 100% especially for the question answering tasks. Within
Figure 2: Frequency of different types of hallucination the two major categories of baseless info and con-
by task. flict, the more severe hallucinations, namely Evi-
dent baseless info and Evident conflict, respectively,
account for a significant portion. This observation
a local officer’s name not present in the reference highlights the importance and challenges of LLMs
or claiming that a restaurant accepts credit card hallucination mitigation, even in RAG settings.
payments without any basis.
The decision is based on the observation that Hallucination vs Tasks As shown in Table 2,
LLMs have a relatively high chance of making er- across the three tasks, the date-to-text writing task
rors when generating detailed facts, partly because exhibited the highest frequency of hallucinations in
their embedded knowledge can be outdated. There- its responses. Inconsistent handling of JSON for-
fore, RAG applications usually instruct LLMs not mat data, especially time and attributes, contributed
to generate factual content without the support of to a significant number of hallucinations in this task.
references. Besides, we provided an additional Interestingly, the models did not show a higher rate
span-level annotation named implicit_true for these of hallucinations for recent news compared to out-
spans to accommodate different application needs. dated news. This could be attributed to the shorter
context length in the recent news subtask compared
Differences in Handling Null Value In the data- to the CNN/DM subtask.
to-text writing task, certain fields sometimes are
with null values. We observed that in the generated Hallucination vs Models Table 3 illustrates that
results, null is often interpreted as false by some among the data we collected, OpenAI’s two mod-
models. Since the more common expressions for els demonstrated notably lower hallucination rates
negation in our dataset are the boolean value False compared to others. Specifically, GPT-4-0613 ex-
or the text No, we labeled these instances as hal- hibited the lowest hallucination frequency.
lucinations (evident introduction of baseless info) To more clearly compare the hallucination rate
and provided a special span-level annotation named of different models, we calculated the hallucination
due_to_null for these spans. In the subsequent hal- density for each model across three tasks. Hallu-
lucination detection experiments, our prompts will cination density is defined as the average number
be aligned with this standard. of hallucination spans per hundred words in the
responses. In the Llama2 series, a clear negative
4 Hallucination Benchmark Analysis correlation was observed between the model scale
4.1 Basic Statistics and hallucination density, aside from the data-to-
text writing tasks. Despite its strong performance
We presented detailed statistics of RAGTruth in in various benchmarks and leaderboards (Zheng
Table 2. Compared to existing datasets for hallu- et al., 2023), the Mistral-7B-Instruct model gener-
cination detection (Cao et al., 2023; Kamoi et al., ated the highest number of responses containing
2023), the RAGTruth dataset is considerably large hallucinations.
in scale. The corpus contains a total of 2,965 in-
stances of data, which include 989 instances for Hallucination vs Length After removing the top
question answering, 1,033 instances for date-to-text and bottom 5% of outliers, we partitioned the data
writing, and 943 instances for news summarization. for each task type into three equal-sized groups
Each instance comprises responses from 6 differ- according to the length of the context/response. We
ent models. As shown in Table 2, the RAGTruth then computed the average number of hallucinated
dataset also features longer prompt and response spans per response within each group. As shown
C ONTEXT L ENGTH R ESP. L ENGTH H ALLUCINATION
Task # Instance # Resp.
Mean Max Mean Max # Resp. % Resp. # Span
Question Answering 989 5934 243 509 119 381 1724 29.1% 2927
Data-to-text Writing 1033 6198 354 1253 159 369 4254 68.6% 9290
Summarization(CNN/DM) 628 3768 648 1749 124 632 1165 30.9% 1474
Summarization(Recent News) 315 1890 369 481 89 240 521 27.6% 598
Overall 2965 17790 381 1749 131 632 7664 43.1% 14289
Table 2: The basic statistics of RAGTruth. Here "Resp." stands for "Response".
Table 3: Hallucination counts and density of models. †: We used 4-bit quantized version of Llama-2-70B-chat.
Table 4: Average number of hallucinations per response Figure 3: Heatmaps of normalized hallucination oc-
in different context length buckets (CLB) and response currence positions. The probability of hallucinations
length buckets (RLB) for the three types of tasks. The occurring is higher in brighter areas.
subscript denotes the minimum and maximum length of
this bucket.
5 Experimental Setup
5.1 Hallucination Detection Algorithms
in Table 4, there is a clear overall trend of an in-
Using RAGTruth, we conducted experiments with
crease in the average number of hallucinations as
the following four distinct algorithms for halluci-
the response length grows. Only the average num-
nation detection:
ber of hallucinations in news summarization tasks
Hallucination Detection Prompt: Hallucination
significantly increases with the length of the con-
detection prompts are manually crafted to instruct
text. This may be because the contexts in the other
LLMs (GPT-4-turbo and GPT-3.5-turbo) in assess-
two tasks are more structured, and an increase in
ing whether a given reference-response pair con-
length does not significantly raise the difficulty of
tains hallucinated content and to identify the corre-
understanding the content.
sponding hallucinated spans in the response. For
detailed information about these prompts, please
Location of Hallucinations In Figure 3, we refer to Appendix D.
present the heatmap of the hallucination occurrence SelfCheckGPT (Manakul et al., 2023): Self-
positions. Hallucinations are significantly more CheckGPT employs a zero-resource, sampling-
likely to occur towards the end of responses in based method to fact-check the responses of black-
question-answering and news summarization tasks. box models. When processing each response in
Compared to other tasks, the data-to-text writing RAGTruth, 3 extra responses from the same model
task has a relatively higher occurrence of halluci- were sampled and served as references, and GPT-
nations in the first half. In that bright area, hallu- 3.5-turbo was used to verify consistency. We de-
cinations concerning business attributes frequently tected hallucinations sentence-by-sentence within
occur. a response, and then aggregated these results to
Q UESTION A NSWERING DATA - TO - TEXT W RITING S UMMARIZATION OVERALL
Methods
Precision Recall F1 Precision Recall F1 Precision Recall F1 Precision Recall F1
Promptgpt-3.5-turbo 18.8 84.4 30.8 65.1 95.5 77.4 23.4 89.2 37.1 37.1 92.3 52.9
Promptgpt-4-turbo 33.2 90.6 45.6 64.3 100.0 78.3 31.5 97.6 47.6 46.9 97.9 63.4
SelfCheckGPTgpt-3.5-turbo 35.0 58.0 43.7 68.2 82.8 74.8 31.1 56.5 40.1 49.7 71.9 58.8
LMvLMgpt-4-turbo 18.7 76.9 30.1 68.0 76.7 72.1 23.3 81.9 36.2 36.2 77.8 49.4
Finetuned Llama-2-13B 61.6 76.3 68.2 85.4 91.0 88.1 64.0 54.9 59.1 76.9 80.7 78.7
Table 5: The response-level hallucination detection performance for each baseline method across different tasks and
different models.
Table 6: The span-level detection performance for each baseline method across different tasks and different models.
Table 7: Utilizing the finetuned hallucination detector to sample from two responses can significantly reduce the
rate of hallucinations. The numbers within the brackets in the group column represent the model’s hallucination rate.
†: Some instances did not have responses that met the required criteria.
40 36.2 38.3
35.3
and Mistral-7B-Instruct models, compared to ran-
30 25.4 dom selection, the first strategy reduced the hallu-
20
11.5 cination rate by 21.6%, while the second strategy
10 achieved a reduction of 63.2%. Even for models
2.5
0 with a low hallucination rate, specifically GPT-3.5-
Prompt(GPT-3.5-turbo) Prompt(GPT-4-turbo) Finetuned Llama-2-13B
Turbo and GPT-4, employing the finetuned hallu-
Figure 4: The span-level recalls of different models on cination detector for sampling can still further re-
four types of hallucinations. duce the rate of hallucinations. The two strategies
yielded a reduction in hallucination rates of 42.9%
and 51.0%, respectively. These results demonstrate
four different types of hallucination spans. In the the potential of an efficient hallucination detection
current stage, as we have not differentiated the model in developing trustworthy RAG LLMs.
types of detected hallucinations, we only report the
char-level recall for different types of hallucina- 7 Conclusion
tions. As indicated in Figure 5, the detection of
In this paper, we introduce RAGTruth, a large-scale
evident hallucinations proves more effective com-
corpus of naturally generated hallucinations, fea-
pared to that of subtle hallucinations.
turing detailed word-level annotations tailored for
RAG scenarios. Our work includes an in-depth
6.3 Hallucination Suppression
analysis of the interplay between hallucinations
We tested the effectiveness of hallucination sup- and various factors, such as task types, models be-
pression using our finetuned hallucination detec- ing used, and contextual settings.
tion model. For the 450 instances in the test set, Additionally, we conduct empirical benchmarks
we employed two strategies to select a final output of several hallucination detection approaches using
from two responses generated by two different mod- our corpus. We show that fine-tuning Llama with
els with similar hallucination densities. The first RAGTruth leads to competitive performance. This
strategy involved selecting the response with fewer implies that by using a high-quality dataset such
predicted hallucination spans. The second strategy, as RAGTruth, it is possible to develop specialized
more stringent, mandated that the selected response hallucination detection models that are highly ef-
have no detected hallucination spans. When the fective when compared to prompt-based methods
number of hallucination spans detected in both can- using general models such as GPT-4.
didate responses is the same, one will be chosen at Simultaneously, our findings reveal that identi-
random. Due to limited response candidates, not fying hallucinations in RAG contexts, particularly
all instances have a response that conforms to the at the span level, remains a formidable challenge,
second strategy. In practical scenarios, this issue with current methods still falling short of reliable
can be addressed by increasing the number of can- detection. We hope that RAGTruth, can assist the
didate responses. We employed random selection development of hallucination detection techniques
as a simple baseline for comparison. for retrieval augmented generation.
8 Limitations Shiqi Chen, Yiran Zhao, Jinghan Zhang, I-Chun Chern,
Siyang Gao, Pengfei Liu, and Junxian He. 2023.
The study of hallucination in large language mod- Felm: Benchmarking factuality evaluation of large
els is a rapidly advancing field, characterized by language models.
the continuous evolution of application scenarios,
I-Chun Chern, Steffi Chern, Shiqi Chen, Weizhe Yuan,
sources of hallucination, and techniques for detect- Kehua Feng, Chunting Zhou, Junxian He, Graham
ing and preventing them. While our work repre- Neubig, and Pengfei Liu. 2023. Factool: Factuality
sents the first attempt to benchmark hallucination detection in generative ai – a tool augmented frame-
in the RAG setting, there may be situations not work for multi-task and multi-domain scenarios.
addressed by this research that are nonetheless sig-
Roi Cohen, May Hamri, Mor Geva, and Amir Glober-
nificant for certain practical applications. son. 2023. Lm vs lm: Detecting factual errors via
cross examination.
9 Ethical considerations
Shehzaad Dhuliawala, Mojtaba Komeili, Jing Xu,
This work is in full compliance with the Ethics Pol- Roberta Raileanu, Xian Li, Asli Celikyilmaz, and
icy of the ACL. We acknowledge that responses Jason Weston. 2023. Chain-of-verification reduces
generated by LLMs in this study may contain in- hallucination in large language models.
accuracies. Aside from this, to the best of our
Esin Durmus, He He, and Mona Diab. 2020. FEQA: A
knowledge, there are no additional ethical issues question answering evaluation framework for faith-
associated with this paper. fulness assessment in abstractive summarization. In
Proceedings of the 58th Annual Meeting of the Asso-
10 Acknowledgement ciation for Computational Linguistics, pages 5055–
5070, Online. Association for Computational Lin-
We appreciate the valuable feedback and assistance guistics.
from Shizhe Diao. We thank Doris Li for her sup-
port in creating the illustrations for this research. Nouha Dziri, Hannah Rashkin, Tal Linzen, and David
Reitter. 2022. Evaluating attribution in dialogue sys-
tems: The begin benchmark. Transactions of the
Association for Computational Linguistics, 10:1066–
References 1083.
Ayush Agrawal, Mirac Suzgun, Lester Mackey, and
Adam Tauman Kalai. 2023. Do language models Tobias Falke, Leonardo F. R. Ribeiro, Prasetya Ajie
know when they’re hallucinating references? Utama, Ido Dagan, and Iryna Gurevych. 2019. Rank-
ing generated summaries by correctness: An interest-
Amos Azaria and Tom Mitchell. 2023. The Internal ing but challenging application for natural language
State of an LLM Knows When its Lying. inference. In Proceedings of the 57th Annual Meet-
ing of the Association for Computational Linguistics,
Mario Barrantes, Benedikt Herudek, and Richard pages 2214–2220, Florence, Italy. Association for
Wang. 2020. Adversarial nli for factual correct- Computational Linguistics.
ness in text summarisation models. arXiv preprint
arXiv:2005.11739. Luyu Gao, Zhuyun Dai, Panupong Pasupat, Anthony
Chen, Arun Tejasvi Chaganty, Yicheng Fan, Vincent
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Zhao, Ni Lao, Hongrae Lee, Da-Cheng Juan, and
Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Kelvin Guu. 2023. RARR: Researching and revising
Neelakantan, Pranav Shyam, Girish Sastry, Amanda what language models say, using language models.
Askell, Sandhini Agarwal, Ariel Herbert-Voss, In Proceedings of the 61st Annual Meeting of the
Gretchen Krueger, Tom Henighan, Rewon Child, Association for Computational Linguistics (Volume 1:
Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Long Papers), pages 16477–16508, Toronto, Canada.
Clemens Winter, Christopher Hesse, Mark Chen, Eric Association for Computational Linguistics.
Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess,
Jack Clark, Christopher Berner, Sam McCandlish,
Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Wein-
Alec Radford, Ilya Sutskever, and Dario Amodei.
berger. 2017. On calibration of modern neural net-
2020. Language models are few-shot learners.
works.
Zouying Cao, Yifei Yang, and Hai Zhao. 2023. Auto-
hall: Automated hallucination dataset generation for Zhijiang Guo, Michael Schlichtkrull, and Andreas Vla-
large language models. ArXiv, abs/2310.00259. chos. 2022. A survey on automated fact-checking.
Canyu Chen and Kai Shu. 2023. Can llm-generated Xiangkun Hu, Dongyu Ru, Qipeng Guo, Lin Qiu, and
misinformation be detected? arXiv preprint Zheng Zhang. 2023. Refchecker for fine-grained
arXiv:2309.13788. hallucination detection.
Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Andrey Malinin and Mark Gales. 2021. Uncertainty
Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea estimation in autoregressive structured prediction.
Madotto, and Pascale Fung. 2023. Survey of halluci-
nation in natural language generation. ACM Comput. Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das,
Surv., 55(12). Daniel Khashabi, Hannaneh, and Hajishirzi. 2022.
When Not to Trust Language Models: Investigat-
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Men- ing Effectiveness of Parametric and Non-Parametric
sch, Chris Bamford, Devendra Singh Chaplot, Diego Memories.
de las Casas, Florian Bressand, Gianna Lengyel, Guil-
laume Lample, Lucile Saulnier, Lélio Renard Lavaud, Potsawee Manakul, Adian Liusie, and Mark J. F. Gales.
Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, 2023. SelfCheckGPT: Zero-Resource Black-Box
Thibaut Lavril, Thomas Wang, Timothée Lacroix, Hallucination Detection for Generative Large Lan-
and William El Sayed. 2023. Mistral 7b. guage Models.
Neeraj Varshney, Wenlin Yao, Hongming Zhang, Jian- Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao
shu Chen, and Dong Yu. 2023. A stitch in time saves Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu,
nine: Detecting and mitigating hallucinations of llms Lili Yu, Susan Zhang, Gargi Ghosh, Mike Lewis,
by validating low-confidence generation. Luke Zettlemoyer, and Omer Levy. 2023. Lima: Less
is more for alignment.
Tu Vu, Mohit Iyyer, Xuezhi Wang, Noah Constant, Jerry
Wei, Jason Wei, Chris Tar, Yun-Hsuan Sung, Denny
Zhou, Quoc Le, and Thang Luong. 2023. Freshllms:
Refreshing large language models with search engine
augmentation.
Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu,
Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang,
Yulong Chen, Longyue Wang, Anh Tuan Luu, Wei
Bi, Freda Shi, Shuming Shi, and Tencent AI Lab.
2023b. Siren’s Song in the AI Ocean: A Survey on
Hallucination in Large Language Models.
E VIDENT C ONFLICT
Example 1
Original content: They’ve sent about 500 of the most seriously injured people for treatment in Kathmandu already – the majority
by road.
Generative content: The team airlifted 500 injured people. . . to Kathmandu for medical treatment.
Example 2
Original content: "RestaurantsReservations": false
Generative content: The restaurant offers outdoor seating and takes reservations
S UBTLE C ONFLICT
Example 1
Original content: This is the section that lets the government deny trademark protection to a mark that is "immoral," "scandalous"
or "disparaging."
Generative content: . . . violates Section 2(a) of the trademark act, which prohibits marks that are "immoral, scandalous, or
disqualified."
Example 2
Original content: It was quite a hassle to find a table since this place is crowded all the time, but well, nothing can stand between
me and that absolutely delightful lobster roll
Generative content: The restaurant is located in a crowded area, but the delicious lobster rolls make it worth the hassle to find a
table.
E VIDENT I NTRODUCTION OF BASELESS I NFORMATION
Example 1
Original content: The couple divorced in 1968. . . She had no contact with the surviving members of the band until meeting up at
the 2006 Las Vegas premiere of "The Beatles Love."
Generative content: The couple divorced in 1968, but later reconciled and attended the 2006 Las Vegas premiere of "The Beatles
Love."
Example 2
Original content: “RestaurantsTakeOut”: true
Generative content: Customers can also enjoy takeout and delivery options.
S UBTLE I NTRODUCTION OF BASELESS I NFORMATION
Example 1
Original content: Amazing happy hour and great people.
Generative content: One of the standout features of Killer Shrimp is their happy hour, which is available daily from 4 pm to 7 pm.
Example 2
Original content: Place mashed potatoes in a medium bowl. Stir in onions, cheese, salt, pepper, and bacon.
Generative content: Place your mashed potatoes in a medium bowl. Stir in cooked onions, cheese, salt, pepper, and bacon.
Q UESTION ANSWERING
Answer the following question:
{question}
Bear in mind that your response should be strictly based on the following 3 passages:
{passages}
In case the passages do not contain the necessary information to answer the question, please reply with: "Unable to answer based
on given passages."
DATA - TO - TEXT W RITING
Instruction:
Write an objective overview about the following local business based only on the provided structured data in the JSON format.
You should include details and cover the information mentioned in the customers’ review. The overview should be 100 - 200
words. Don’t make up information.
Structured data:
{json_data}
Overview:
S UMMARIZATION
Summarize the following news within {word_num} words:
{news}
output:
Table 9: Prompts for generating responses for the three types of tasks. word_num is min(200,
word_num_of_news//4). The word count requirement is only to control the length of the generated summa-
rization, it will not serve as the basis for hallucination annotation.
C Annotation Details
Figure 5: Annotation interface. For privacy reasons, we have masked the full names of the annotators in the
screenshot.
implicit_true due_to_null
Task Model # Hallucination Span
# Span % Span # Span % Span
GPT-3.5-turbo-0613 89 33 0.371
GPT-4-0613 51 15 0.294
Llama-2-7B-chat 1010 251 0.249
Question Answering
Llama-2-13B-chat 654 215 0.329
Llama-2-70B-chat 529 168 0.318
Mistral-7B-Instruct 594 164 0.276
GPT-3.5-turbo-0613 384 52 0.135 69 0.180
GPT-4-0613 354 24 0.068 209 0.590
Llama-2-7B-chat 1775 195 0.110 230 0.130
Data-to-text Writing
Llama-2-13B-chat 2803 260 0.09 439 0.157
Llama-2-70B-chat 1834 274 0.149 272 0.148
Mistral-7B-Instruct 2140 102 0.048 423 0.198
GPT-3.5-turbo-0613 60 14 0.233
GPT-4-0613 80 10 0.125
Llama-2-7B-chat 517 44 0.085
Summarization
Llama-2-13B-chat 342 28 0.082
Llama-2-70B-chat 245 27 0.110
Mistral-7B-Instruct 828 52 0.063
Overall 14289 1928 0.135 1642 0.115
Table 10: Detailed statistical information for the labels implicit_true and due_to_null. The majority of implicit
truths appear in two types of tasks: question answering and data-to-text writing. About 17.7% hallucination spans
in the data-to-text writing tasks are related to null values in the JSON data.
D Hallucination Detection Prompts
S UMMARIZATION
Below is the original news:
{article}
Below is a summary of the news:
{summary}
Your task is to determine whether the summary contains either or both of the following two types of hallucinations:
1. conflict: instances where the summary presents direct contraction or opposition to the original news;
2. baseless info: instances where the generated summary includes information which is not substantiated by or inferred from the
original news.
Then, compile the labeled hallucinated spans into a JSON dict, with a key "hallucination list" and its value is a list of
hallucinated spans. If there exist potential hallucinations, the output should be in the following JSON format: {"hallucination
list": [hallucination span1, hallucination span2, ...]}. Otherwise, leave the value as a empty list as following: {"hallucination
list": []}.
Output:
Q UESTION ANSWERING
Below is a question:
{question}
Below are related passages:
{passages}
Below is an answer:
{answer}
Your task is to determine whether the answer contains either or both of the following two types of hallucinations:
1. conflict: instances where the answer presents direct contraction or opposition to the passages;
2. baseless info: instances where the answer includes information which is not substantiated by or inferred from the passages.
Then, compile the labeled hallucinated spans into a JSON dict, with a key "hallucination list" and its value is a list of
hallucinated spans. If there exist potential hallucinations, the output should be in the following JSON format: {"hallucination
list": [hallucination span1, hallucination span2, ...]}. Otherwise, leave the value as a empty list as following: {"hallucination
list": []}.
Output:
DATA - TO - TEXT W RITING
Below is a structured data in the JSON format:
{business info}
Below is an overview article written in accordance with the structured data:
{overview}
Your task is to determine whether the overview contains either or both of the following two types of hallucinations:
1. conflict: instances where the overview presents direct contraction or opposition to the structured data;
2. baseless info: instances where the generated overview includes information which is not substantiated by or inferred from the
structured data.
In JSON, "null" or "None" represents an unknown value rather than a negation.
Then, compile the labeled hallucinated spans into a JSON dict, with a key "hallucination list" and its value is a list of
hallucinated spans. If there exist potential hallucinations, the output should be in the following JSON format: {"hallucination
list": [hallucination span1, hallucination span2, ...]}. Otherwise, leave the value as a empty list as following: {"hallucination
list": []}.
Output:
Table 11: Prompts for detecting hallucination for the three types of tasks. In the prompt for data-to-text writing, we
clarified that null or None in JSON should be treated as unknown rather than a negation.