0% found this document useful (0 votes)

33 views

RAGTruth- A Hallucination Corpus for Developing Trustworthy Retrieval-Augmented Language Models

The document introduces RAGTruth, a large-scale corpus designed to analyze hallucinations in retrieval-augmented generation (RAG) applications of large language models (LLMs). It contains nearly 18,000 annotated responses from various LLMs, allowing for the benchmarking of hallucination frequencies and the effectiveness of detection methodologies. The study demonstrates that fine-tuning a smaller LLM with RAGTruth can achieve competitive hallucination detection performance compared to state-of-the-art models like GPT-4.

Uploaded by

w4nderlust

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views

RAGTruth- A Hallucination Corpus for Developing Trustworthy Retrieval-Augmented Language Models

Uploaded by

w4nderlust

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

RAGTruth: A Hallucination Corpus for Developing Trustworthy

Retrieval-Augmented Language Models

Cheng Niu1 , Yuanhao Wu1 , Juno Zhu1 , Siliang Xu1 , Kashun Shum1 ,
Randy Zhong1 , Juntong Song1 , and Tong Zhang2
1
NewsBreak
2
University of Illinois Urbana-Champaign
[email protected]

Abstract Q UESTION : how to prepare to get an ultrasound?

R EFERENCE : Passage 1: 1 If you are just interested in
Retrieval-augmented generation (RAG) has be- checking the gender, you can go as early as 17 weeks. 2
come a main technique for alleviating halluci- Any earlier, and the results may not be accurate, leading
arXiv:2401.00396v2 [cs.CL] 17 May 2024

nations in large language models (LLMs). De- to an extra coat of paint for your bedroom. 3 To get good
pictures, 3D ultrasounds are best performed between 20
spite the integration of RAG, LLMs may still
and 32 weeks, and the best pictures are between 24 and
present unsupported or contradictory claims 30 weeks.
to the retrieved contents. In order to develop ...
effective hallucination prevention strategies un- R ESPONSE : Based on the given passages, here’s how to
der RAG, it is important to create benchmark prepare for an ultrasound:
datasets that can measure the extent of halluci- * For a gender reveal ultrasound, wait until at least 17
nation. This paper presents RAGTruth, a cor- weeks of pregnancy to ensure accurate results.
pus tailored for analyzing word-level hallucina- * For 3D ultrasounds, schedule the appointment between
tions in various domains and tasks within the 20 and 32 weeks of pregnancy for the best pictures ...
standard RAG frameworks for LLM applica-
A NNOTATION
tions. RAGTruth comprises nearly 18,000 nat-
Span: between 20 and 32 weeks of pregnancy for the best
urally generated responses from diverse LLMs pictures
using RAG. These responses have undergone Type: Evident Conflict
meticulous manual annotations at both the indi- Reason: Original: "the best pictures are between 24 and
vidual case and word levels, incorporating eval- 30 weeks", Generative: "between 20 and 32 weeks of
uations of hallucination intensity. We not only pregnancy for the best pictures"
benchmark hallucination frequencies across dif-
ferent LLMs, but also critically assess the ef- Table 1: An example of RAGTruth data from the ques-
fectiveness of several existing hallucination de- tion answering task. It contains context, response gener-
tection methodologies. We show that using a ated by LLM with and span-level annotation.
high-quality dataset such as RAGTruth, it is
possible to finetune a relatively small LLM and
achieve a competitive hallucination detection
tual or accurate information (Rawte et al., 2023).
performance when compared to the existing
prompt-based approaches using state-of-the-art The occasional generation of outputs that appear
LLMs such as GPT-4. Furthermore, the fine- plausible but are factually incorrect significantly
tuned model can effectively mitigate hallucina- undermine the reliability of LLMs in real-world
tion in LLM responses. scenarios, such as medical diagnoses (Pal et al.,
2023) and news summarization (Shen et al., 2023).
1 Introduction
To reduce hallucination, various methods have
Large language models (LLMs) have achieved re- been developed that can be applied at differ-
markable success in a variety of tasks, including ent stages of LLM lifecycle, including pre-
text generation (Li et al., 2022), machine transla- training (Brown et al., 2020), supervised fine-
tion (Kocmi and Federmann, 2023), and question tuning (Zhou et al., 2023; Zhang et al., 2023a),
answering (Zhao et al., 2023). However, one of the RLHF (Ouyang et al., 2022; Lin et al., 2022),
key challenges in deploying LLMs in real-world and inference (Dhuliawala et al., 2023; Gao et al.,
applications is their tendency to hallucinate (Kad- 2023). In terms of detection, methods are devel-
dour et al., 2023). Hallucination in the context oped by examining the model’s intrinsic state (Guo
of LLMs usually refers to a situation where the et al., 2017), comparing it with external data
model generates content that is not based on fac- and tools (Chern et al., 2023), or leveraging the
LLM’s inherent powerful capabilities for self- different hallucination detection methods at
checking (Agrawal et al., 2023; Manakul et al., both the passage and word levels.
2023). Retrieval-augmented generation (RAG) is
extensively used to supply LLMs with updated, (iii) We present a baseline method of fine-tuning
relevant knowledge, significantly mitigating hal- LLM for hallucination detection. It is shown
lucination (Varshney et al., 2023). Nevertheless, that by fine-tuning the Llama-2-13B model on
even with RAG and other enhancements, LLMs the RAGTruth training data, we can achieve
still produce statements that are either unfounded results competitive to the existing prompt-
or contradict the information provided in the re- based approaches using GPT-4. This shows
trieved references (Shuster et al., 2021). the potential of developing better hallucina-
tion detection methods using RAGTruth.
Despite the growing awareness of the hallucina-
tion phenomenon, the understanding of hallucina- (iv) We show that by using our finetuned hallucina-
tion in LLMs is still in its early stages. One key tion detector, it is possible to significantly re-
challenge is the lack of high-quality, large-scale duce the occurrence of hallucinations in the re-
datasets specifically designed for hallucination de- sponses from LLMs. The improvement holds
tection. This issue is particularly acute in RAG even for models with inherently low halluci-
settings. Due to the relatively low hallucination nation rates, such as GPT-4.
ratio, a substantial increase in annotation resources
is needed. Existing datasets for LLM hallucina- 2 Related Work
tion detection are predominantly synthesized (Li
2.1 Hallucination of Large Language Models
et al., 2023). For instance, in Liu and Liu (2023);
Longpre et al. (2021), prompts conflicting with Though hallucination in traditional natural lan-
conventional knowledge are purposely generated guage generation (NLG) contexts has been widely
to trigger hallucinations. While these approaches studied(Ji et al., 2023), comprehending and tack-
are efficient at generating hallucinations, the re- ling this problem in the context of LLMs presents
sulting artificial hallucinations can substantially distinct challenges(Zhang et al., 2023b). Existing
differ from those that naturally occur. In Chen et al. research has demonstrated that incorporating up-
(2023); Hu et al. (2023), hallucination datasets are to-date, relevant knowledge in the prompt can ef-
developed by manual annotations of naturally pro- fectively reduce fact-conflicting hallucination (Vu
duced LLM responses. However, these datasets are et al., 2023; Lewis et al., 2021). This approach,
of limited size and are not specifically focused on referred to as Retrieval-Augmented Generation
the RAG scenario. (RAG), is widely used in real-world LLM applica-
In this paper, we introduce a large-scale high- tions. For instance, Google Bard 1 and Microsoft
quality dataset specifically designed for word-level BingChat 2 have implemented this technique.
hallucination detection for RAG applications. Us- 2.2 Hallucination Evaluation Datasets
ing this dataset, we have conducted an extensive
Extensive research has focused on hallucination
benchmarking of mainstream LLMs to assess their
benchmarks within conventional Natural Language
tendency to generate hallucinations, as well as
Generation settings (Dziri et al., 2022; Zhong et al.,
evaluate current methods for hallucination detec-
2021; Durmus et al., 2020; Lin et al., 2022). With
tion. Additionally, we have demonstrated supe-
the rise of LLMs, the detection of hallucinations
rior performance in identifying hallucinations by
has become increasingly challenging, necessitating
fine-tuning LLM with RAGTruth dataset. Our key
the development of high-quality datasets for LLM
contributions are:
evaluation (Chen and Shu, 2023). Contributions
(i) We propose RAGTruth, a large-scale word- in this domain include HaluEval (Li et al., 2023),
level hallucination evaluation dataset specifi- which introduced datasets encompassing both syn-
cally for the RAG scenario across several com- thetically and naturally generated LLM responses,
mon tasks. It consists of nearly 18,000 fully and FELM (Chen et al., 2023), which concentrated
annotated natural responses generated from on naturally generated LLM responses across mul-
major open-source and closed-source LLMs. tiple domain tasks. RefChecker (Hu et al., 2023), a
1
https://bard.google.com
2
(ii) We perform a comprehensive comparison of https://www.bing.com
distinctive approach, breaks down claims in LLM vided information. These conflicts are easily ver-
responses into triples and utilizes human annota- ifiable without extensive context, often involving
tion to assess the truthfulness of facts. Notably, clear factual errors, misspelled names, incorrect
these works primarily focus on annotating factual numbers, etc.
hallucinations in LLM responses. Distinguishing
from previous research, our work centers on the Subtle Conflict: for when generative content
evaluation of LLMs within RAG settings. presents a departure or divergence from the pro-
vided information, altering the intended contextual
2.3 Hallucination Detection Methods meaning. These conflicts often involve substitu-
Researchers have been exploring various methods tion of terms that carry different implications or
to enhance the reliability of LLMs by detecting hal- severity, requiring a deeper understanding of their
lucinations. In Azaria and Mitchell (2023); Xiao contextual applications.
and Wang (2021); Malinin and Gales (2021), intrin- Evident Introduction of Baseless Information:
sic model uncertainty metrics such as token-level for when generated content includes information
probability and entropy are used to detect halluci- not substantiated in the provided information. It
nations. When direct access to output uncertainty involves the creation of hypothetical, fabricated, or
is not feasible, as in the case with limited APIs like hallucinatory details lacking evidence or support.
GPT-4, an alternative approach involves employing
a fully accessible LLM as a proxy (Manakul et al., Subtle Introduction of Baseless Information:
2023). In Falke et al. (2019); Barrantes et al. (2020), is when generated content extends beyond the pro-
natural language inference modules are adapted to vided information by incorporating inferred details,
check the information consistency between the ar- insights, or sentiments. This additional informa-
ticles and their summaries, and it has been shown tion lacks verifiability and might include subjective
that external knowledge is helpful for detecting fac- assumptions or commonly observed norms rather
tual hallucinations. (Guo et al., 2022; Mallen et al., than explicit facts.
2022). Additionally, methods that leverage the in-
3.2 Response Generation
herent capabilities of LLMs have been proposed
for self-checking, such as verbalization-based and Tasks and Data Sources We selected three
consistency-based methods (Xiong et al., 2023; widely recognized generation tasks with RAG set-
Manakul et al., 2023). These techniques aim to tings for response generation: Question Answering,
detect hallucinations without relying on internal Data-to-text Writing, and News Summarization.
states or external data and tools. For the task of question answering, we con-
ducted a random sampling from the training set
3 Construction Process of RAGTruth of MS MARCO (Nguyen et al., 2016). To reduce
the difficulty of annotation, we selected only those
We established a data generation and annotation
questions related to daily life, and preserved only
pipeline as shown in Figure 1.
three retrieved passages for each question. Then
3.1 Hallucination Taxonomy we prompted LLMs to generate answers for each
question solely based on the retrieved passages.
Different from open-end generation, under RAG
For the data-to-text writing task, we prompted
setting, the prompt contains rich context informa-
LLMs to generate an objective overview for
tion, and the model is generally required to gen-
a randomly sampled business in the restaurant
erate text based on the provided context. The de-
and nightlife categories from the Yelp Open
tection and mitigation of inconsistencies between
Dataset (Yelp, 2021). In this dataset, information
retrieved information and responses emerge as sig-
pertaining to a business is represented using struc-
nificant sources of hallucination.
tured data. To streamline the annotation process,
As outlined below, we categorize the halluci-
we focused only on the following business informa-
nation in the RAG setting into four types. For
tion fields: BusinessParking, RestaurantsReserva-
concrete examples of each type, please refer to
tions, OutdoorSeating, WiFi, RestaurantsTakeOut,
Appendix A.
RestaurantsGoodForGroups, Music, and Ambience.
Evident Conflict: for when generative content In addition to the structured data, we have also in-
presents direct contraction or opposition to the pro- cluded up to three business-related user reviews
Figure 1: Data gathering pipeline. Taking a data-to-text writing task as an example, our data gathering pipeline
includes 2 steps: 1) response generation. We generated responses with multiple LLMs and natural prompts. 2)
human annotation. Human labeler annotated hallucinated spans in LLM responses.

to enrich the context information. In the prompt, the accuracy and reliability of the annotation re-
these information is represented in JSON format. sults. We recruited annotators from a professional
For the news summarization task, we randomly vendor and paid them at a rate of $25 per hour per
selected documents from the training set of the individual.
well-known CNN/Daily Mail dataset (See et al., The annotators are invited to perform annotation
2017) as well as recent news articles from a pres- tasks using Label Studio (Tkachenko et al., 2020-
tigious news platform. LLMs were prompted to 2022). Each labeling task is presented within one
generate a summary for each of the source news. page, comprising the following components: 1) the
context provided to the AI models; 2) a set of 6
Models The following six models with strong
responses, generated by different AI models. Our
instruction-following ability are used for response
annotation interface is available in Appendix C.
generation: GPT-3.5-turbo-0613 and GPT-4-0613
Their task was to annotate the specific spans of
from OpenAI (OpenAI, 2023); Mistral-7b-Instruct
the generated text that contains hallucinated infor-
from Mistral AI (Jiang et al., 2023); Llama-2-
mation and categorize them into the four types. To
7B-chat, Llama-2-13B-chat and Llama-2-70B-chat
ensure the quality of the annotations, each response
(4bit quantized)3 from Meta (Touvron et al., 2023).
is independently labeled by two annotators. The
To ensure a fair comparison, the prompts used
consistency rate of two annotators was 91.8% at the
for response generation are kept straightforward
response level and 78.8% at the span level. In cases
with subtle differences among various models to
where there is a considerable difference between
optimize their performance. We provide detailed
the two annotations, a third review is undertaken.
prompts in the Appendix B.
For each sample, we collected one response from 3.4 Annotations for Adaptive Evaluation
each model. As a result, we got a total of 6 re-
In different contexts, the definition and criteria for
sponses for each input sample.
hallucination vary, and the annotation of hallucina-
3.3 Human Annotation tion is not always straightforward. In contentious
cases, additional annotations are provided to accu-
Identifying AI-generated hallucinations is a chal-
rately reflect these situations. This approach en-
lenging task. It requires a strong capacity for criti-
ables users to adopt various evaluation strategies
cal thinking to understand the logical flow of vari-
tailored to their specific application circumstances.
ous texts, along with meticulous attention to detail
Please refer to Appendix C for more statistical in-
for spotting subtle inaccuracies and inconsisten-
formation about these annotations.
cies. Moreover, a certain level of media literacy
and knowledge of current affairs is crucial to grasp Implicit Truth The extensive world knowledge
the subjects discussed in news-related sample data. and ability of LLMs is a significant advantage in
Therefore, we chose annotators who are proficient open-ended generation scenarios. But in the con-
in English and possess a bachelor’s degree in En- text of this paper, which focuses on the relatively
glish, Communications, or relevant fields to ensure strict RAG scenarios, we have labeled information
3
https://huggingface.co/TheBloke/ that is not mentioned in the reference but may be
Llama-2-70B-Chat-AWQ truthful as hallucinations. For instance, mentioning
Subtle baseless info Subtle conflict
Evident baseless info Evident conflict lengths than existing datasets for hallucination de-
tection (Wang et al., 2020).
QA
4.2 Hallucination Statistics
Data-to-text Hallucination Types As shown in Figure 2, the
Writing
generation of information baseless in the context
Sum. was significantly more prevalent than the gener-
ation of information conflicting with the context,
0% 25% 50% 75% 100% especially for the question answering tasks. Within
Figure 2: Frequency of different types of hallucination the two major categories of baseless info and con-
by task. flict, the more severe hallucinations, namely Evi-
dent baseless info and Evident conflict, respectively,
account for a significant portion. This observation
a local officer’s name not present in the reference highlights the importance and challenges of LLMs
or claiming that a restaurant accepts credit card hallucination mitigation, even in RAG settings.
payments without any basis.
The decision is based on the observation that Hallucination vs Tasks As shown in Table 2,
LLMs have a relatively high chance of making er- across the three tasks, the date-to-text writing task
rors when generating detailed facts, partly because exhibited the highest frequency of hallucinations in
their embedded knowledge can be outdated. There- its responses. Inconsistent handling of JSON for-
fore, RAG applications usually instruct LLMs not mat data, especially time and attributes, contributed
to generate factual content without the support of to a significant number of hallucinations in this task.
references. Besides, we provided an additional Interestingly, the models did not show a higher rate
span-level annotation named implicit_true for these of hallucinations for recent news compared to out-
spans to accommodate different application needs. dated news. This could be attributed to the shorter
context length in the recent news subtask compared
Differences in Handling Null Value In the data- to the CNN/DM subtask.
to-text writing task, certain fields sometimes are
with null values. We observed that in the generated Hallucination vs Models Table 3 illustrates that
results, null is often interpreted as false by some among the data we collected, OpenAI’s two mod-
models. Since the more common expressions for els demonstrated notably lower hallucination rates
negation in our dataset are the boolean value False compared to others. Specifically, GPT-4-0613 ex-
or the text No, we labeled these instances as hal- hibited the lowest hallucination frequency.
lucinations (evident introduction of baseless info) To more clearly compare the hallucination rate
and provided a special span-level annotation named of different models, we calculated the hallucination
due_to_null for these spans. In the subsequent hal- density for each model across three tasks. Hallu-
lucination detection experiments, our prompts will cination density is defined as the average number
be aligned with this standard. of hallucination spans per hundred words in the
responses. In the Llama2 series, a clear negative
4 Hallucination Benchmark Analysis correlation was observed between the model scale
4.1 Basic Statistics and hallucination density, aside from the data-to-
text writing tasks. Despite its strong performance
We presented detailed statistics of RAGTruth in in various benchmarks and leaderboards (Zheng
Table 2. Compared to existing datasets for hallu- et al., 2023), the Mistral-7B-Instruct model gener-
cination detection (Cao et al., 2023; Kamoi et al., ated the highest number of responses containing
2023), the RAGTruth dataset is considerably large hallucinations.
in scale. The corpus contains a total of 2,965 in-
stances of data, which include 989 instances for Hallucination vs Length After removing the top
question answering, 1,033 instances for date-to-text and bottom 5% of outliers, we partitioned the data
writing, and 943 instances for news summarization. for each task type into three equal-sized groups
Each instance comprises responses from 6 differ- according to the length of the context/response. We
ent models. As shown in Table 2, the RAGTruth then computed the average number of hallucinated
dataset also features longer prompt and response spans per response within each group. As shown
C ONTEXT L ENGTH R ESP. L ENGTH H ALLUCINATION
Task # Instance # Resp.
Mean Max Mean Max # Resp. % Resp. # Span
Question Answering 989 5934 243 509 119 381 1724 29.1% 2927
Data-to-text Writing 1033 6198 354 1253 159 369 4254 68.6% 9290
Summarization(CNN/DM) 628 3768 648 1749 124 632 1165 30.9% 1474
Summarization(Recent News) 315 1890 369 481 89 240 521 27.6% 598
Overall 2965 17790 381 1749 131 632 7664 43.1% 14289

Table 2: The basic statistics of RAGTruth. Here "Resp." stands for "Response".

Q UESTION A NSWERING DATA - TO - TEXT W RITING S UMMARIZATION OVERALL

Model
# Resp. # Span Density # Resp. # Span Density # Resp. # Span Density # Resp. # Span
GPT-3.5-turbo-0613 75 89 0.12 272 384 0.18 54 60 0.05 401 533
GPT-4-0613 48 51 0.06 290 354 0.27 74 80 0.08 406 485
Llama-2-7B-chat 510 1010 0.59 888 1775 1.27 434 517 0.58 1832 3302
Llama-2-13B-chat 399 654 0.48 983 2803 1.53 295 342 0.41 1677 3799
Llama-2-70B-chat† 320 529 0.40 863 1834 1.15 212 245 0.26 1395 2608
Mistral-7B-Instruct 378 594 0.59 958 2140 1.51 617 828 0.86 1953 3562

Table 3: Hallucination counts and density of models. †: We used 4-bit quantized version of Llama-2-70B-chat.

CLB S UMMARIZATION D2T W RITING QA

QA
1 0.29(176,368] 1.51(178,273] 0.50(131,187]
2 0.36(368,587] 1.48(273,378] 0.51(187,288] Data-to-text
3 0.44(587,1422] 1.49(378,731] 0.49(288,400] Writing
RLB S UMMARIZATION D2T W RITING QA Sum.
1 0.34(44,87] 1.20(93,131] 0.21(19,93] 0% 25% 50% 75% 100%
2 0.32(87,119] 1.59(131,175] 0.37(93,138] Normalized position
3 0.44(119,245] 1.69(175,258] 0.87(138,257]

Table 4: Average number of hallucinations per response Figure 3: Heatmaps of normalized hallucination oc-
in different context length buckets (CLB) and response currence positions. The probability of hallucinations
length buckets (RLB) for the three types of tasks. The occurring is higher in brighter areas.
subscript denotes the minimum and maximum length of
this bucket.
5 Experimental Setup
5.1 Hallucination Detection Algorithms
in Table 4, there is a clear overall trend of an in-
Using RAGTruth, we conducted experiments with
crease in the average number of hallucinations as
the following four distinct algorithms for halluci-
the response length grows. Only the average num-
nation detection:
ber of hallucinations in news summarization tasks
Hallucination Detection Prompt: Hallucination
significantly increases with the length of the con-
detection prompts are manually crafted to instruct
text. This may be because the contexts in the other
LLMs (GPT-4-turbo and GPT-3.5-turbo) in assess-
two tasks are more structured, and an increase in
ing whether a given reference-response pair con-
length does not significantly raise the difficulty of
tains hallucinated content and to identify the corre-
understanding the content.
sponding hallucinated spans in the response. For
detailed information about these prompts, please
Location of Hallucinations In Figure 3, we refer to Appendix D.
present the heatmap of the hallucination occurrence SelfCheckGPT (Manakul et al., 2023): Self-
positions. Hallucinations are significantly more CheckGPT employs a zero-resource, sampling-
likely to occur towards the end of responses in based method to fact-check the responses of black-
question-answering and news summarization tasks. box models. When processing each response in
Compared to other tasks, the data-to-text writing RAGTruth, 3 extra responses from the same model
task has a relatively higher occurrence of halluci- were sampled and served as references, and GPT-
nations in the first half. In that bright area, hallu- 3.5-turbo was used to verify consistency. We de-
cinations concerning business attributes frequently tected hallucinations sentence-by-sentence within
occur. a response, and then aggregated these results to
Q UESTION A NSWERING DATA - TO - TEXT W RITING S UMMARIZATION OVERALL
Methods
Precision Recall F1 Precision Recall F1 Precision Recall F1 Precision Recall F1
Promptgpt-3.5-turbo 18.8 84.4 30.8 65.1 95.5 77.4 23.4 89.2 37.1 37.1 92.3 52.9
Promptgpt-4-turbo 33.2 90.6 45.6 64.3 100.0 78.3 31.5 97.6 47.6 46.9 97.9 63.4
SelfCheckGPTgpt-3.5-turbo 35.0 58.0 43.7 68.2 82.8 74.8 31.1 56.5 40.1 49.7 71.9 58.8
LMvLMgpt-4-turbo 18.7 76.9 30.1 68.0 76.7 72.1 23.3 81.9 36.2 36.2 77.8 49.4
Finetuned Llama-2-13B 61.6 76.3 68.2 85.4 91.0 88.1 64.0 54.9 59.1 76.9 80.7 78.7

Table 5: The response-level hallucination detection performance for each baseline method across different tasks and
different models.

Q UESTION A NSWERING DATA - TO - TEXT W RITING S UMMARIZATION OVERALL

Methods
Precision Recall F1 Precision Recall F1 Precision Recall F1 Precision Recall F1
Prompt Baselinegpt-3.5-turbo 7.9 25.1 12.1 8.7 45.1 14.6 6.1 33.7 10.3 7.8 35.3 12.8
Prompt Baselinegpt-4-turbo 23.7 52.0 32.6 17.9 66.4 28.2 14.7 65.4 24.1 18.4 60.9 28.3
Finetuned Llama-2-13B 55.8 60.8 58.2 56.5 50.7 53.5 52.4 30.8 38.8 55.6 50.2 52.7

Table 6: The span-level detection performance for each baseline method across different tasks and different models.

provide a response-level detection outcome. 6 Experimental Results

LMvLM (Cohen et al., 2023): LMvLM is an ap-
6.1 Response-level Detection
proach that employs a multi-turn interaction be-
tween two Language Models that aim to discover The results in Table 5 reveal that hallucination
inconsistencies through cross-examination. detection remains a significant challenge in the
LLM Finetuning: Llama-2-13B has been fine- context of RAG for all existing detection methods.
tuned using the training set from RAGTruth. The Even when reference information is available, the
model takes the context-response pair with proper responses generated may still include hallucina-
instructions as the input and treats the hallucinate tions, which current LLMs cannot reliably identify.
span as the targeted generation output. We em- The most advanced LLM, GPT-4-turbo, achieves
ployed full training with an initial learning rate only an average F1 score of 63.4%. For another
of 2e-5, and limiting the training to 1 epochs, all notable baseline, SelfCheckGPT also shows unsat-
conducted on 4 A100 GPUs. isfactory performance in this regard, achieving an
average F1 score of 58.8% with GPT-3.5-turbo.
5.2 Data Split By utilizing our high-quality training set, a fine-
All detection algorithms are tested on the same tuned Llama-2-13B can achieve the best perfor-
RAGTruth test set, which consists of 450 instances mance with an average 78.7% f1 score. This
in total, derived by randomly selecting 150 in- shows the effectiveness of our data in improving
stances from each task type. The rest of the data the model’s hallucination detection ability.
is used to fine-tune the LLama-2-13B model, as
6.2 Span-level Detection
previously mentioned.
RAGTruth, as a hallucination corpus with fine-
5.3 Evaluation Metrics grained span labels, enables us to present exper-
It is a more challenging and significant task to iden- imental results for span-level detection, serving as
tify the locations of hallucinations within the re- a baseline for future research. As shown in Ta-
sponse than only determining whether a response ble 6, the overall performance of the current detec-
contains hallucinations. We assess hallucination tion method is sub-optimal, highlighting the chal-
detection at both the response and span levels. lenges in span-level detection. Even the advanced
GPT-4-turbo tends to incorrectly classify many
Response-level Detection We report precision,
non-hallucinated contents with a low precision of
recall, and F1 score for each detection algorithm
18.4%. While our fine-tuned model shows im-
and its variants across different tasks.
proved capability in identifying hallucinated spans
Span-level Detection We calculate the overlap by achieving an averaged f1 score of 52.7%, it still
between the detected span and human-labeled span falls short of perfect detection, emphasizing the
and report the precision, recall, and f1 score at the inherent difficulties of this task.
char-level. We also report the detection performance across
G ROUP S ELECTION S TRATEGY VALID R ESPONSE N UM H ALLUCINATION R ATE
Random 450 52.4(-)
Llama-2-7B-chat (51.8)
Select the response with fewer detected hallucination spans 450 41.1(↓21.6%)
Mistral-7B-Instruct (57.6)
Select the response with no detected hallucination spans 328† 19.3(↓63.2%)

Random 450 9.8(-)

GPT-3.5-Turbo-0613 (10.9)
Select the response with fewer detected hallucination spans 450 5.6(↓42.9%)
GPT-4-0613 (9.3)
Select the response with no detected hallucination spans 448† 4.8(↓51.0%)

Table 7: Utilizing the finetuned hallucination detector to sample from two responses can significantly reduce the
rate of hallucinations. The numbers within the brackets in the group column represent the model’s hallucination rate.
†: Some instances did not have responses that met the required criteria.

The results shown in Table 7 indicate that with

70
63.4 66.3
Subtle Conflict
Evident Conflict 60.4 the help of the hallucination detector, both strate-
60
52.9 55.8
Subtle Baseless Info
Evident Baseless Info 49.8 gies can significantly reduce the hallucination
50
rate. For the relatively small Llama-2-7B-chat
Recall(%)

40 36.2 38.3
35.3
and Mistral-7B-Instruct models, compared to ran-
30 25.4 dom selection, the first strategy reduced the hallu-
20
11.5 cination rate by 21.6%, while the second strategy
10 achieved a reduction of 63.2%. Even for models
2.5
0 with a low hallucination rate, specifically GPT-3.5-
Prompt(GPT-3.5-turbo) Prompt(GPT-4-turbo) Finetuned Llama-2-13B
Turbo and GPT-4, employing the finetuned hallu-
Figure 4: The span-level recalls of different models on cination detector for sampling can still further re-
four types of hallucinations. duce the rate of hallucinations. The two strategies
yielded a reduction in hallucination rates of 42.9%
and 51.0%, respectively. These results demonstrate
four different types of hallucination spans. In the the potential of an efficient hallucination detection
current stage, as we have not differentiated the model in developing trustworthy RAG LLMs.
types of detected hallucinations, we only report the
char-level recall for different types of hallucina- 7 Conclusion
tions. As indicated in Figure 5, the detection of
In this paper, we introduce RAGTruth, a large-scale
evident hallucinations proves more effective com-
corpus of naturally generated hallucinations, fea-
pared to that of subtle hallucinations.
turing detailed word-level annotations tailored for
RAG scenarios. Our work includes an in-depth
6.3 Hallucination Suppression
analysis of the interplay between hallucinations
We tested the effectiveness of hallucination sup- and various factors, such as task types, models be-
pression using our finetuned hallucination detec- ing used, and contextual settings.
tion model. For the 450 instances in the test set, Additionally, we conduct empirical benchmarks
we employed two strategies to select a final output of several hallucination detection approaches using
from two responses generated by two different mod- our corpus. We show that fine-tuning Llama with
els with similar hallucination densities. The first RAGTruth leads to competitive performance. This
strategy involved selecting the response with fewer implies that by using a high-quality dataset such
predicted hallucination spans. The second strategy, as RAGTruth, it is possible to develop specialized
more stringent, mandated that the selected response hallucination detection models that are highly ef-
have no detected hallucination spans. When the fective when compared to prompt-based methods
number of hallucination spans detected in both can- using general models such as GPT-4.
didate responses is the same, one will be chosen at Simultaneously, our findings reveal that identi-
random. Due to limited response candidates, not fying hallucinations in RAG contexts, particularly
all instances have a response that conforms to the at the span level, remains a formidable challenge,
second strategy. In practical scenarios, this issue with current methods still falling short of reliable
can be addressed by increasing the number of can- detection. We hope that RAGTruth, can assist the
didate responses. We employed random selection development of hallucination detection techniques
as a simple baseline for comparison. for retrieval augmented generation.
8 Limitations Shiqi Chen, Yiran Zhao, Jinghan Zhang, I-Chun Chern,
Siyang Gao, Pengfei Liu, and Junxian He. 2023.
The study of hallucination in large language mod- Felm: Benchmarking factuality evaluation of large
els is a rapidly advancing field, characterized by language models.
the continuous evolution of application scenarios,
I-Chun Chern, Steffi Chern, Shiqi Chen, Weizhe Yuan,
sources of hallucination, and techniques for detect- Kehua Feng, Chunting Zhou, Junxian He, Graham
ing and preventing them. While our work repre- Neubig, and Pengfei Liu. 2023. Factool: Factuality
sents the first attempt to benchmark hallucination detection in generative ai – a tool augmented frame-
in the RAG setting, there may be situations not work for multi-task and multi-domain scenarios.
addressed by this research that are nonetheless sig-
Roi Cohen, May Hamri, Mor Geva, and Amir Glober-
nificant for certain practical applications. son. 2023. Lm vs lm: Detecting factual errors via
cross examination.
9 Ethical considerations
Shehzaad Dhuliawala, Mojtaba Komeili, Jing Xu,
This work is in full compliance with the Ethics Pol- Roberta Raileanu, Xian Li, Asli Celikyilmaz, and
icy of the ACL. We acknowledge that responses Jason Weston. 2023. Chain-of-verification reduces
generated by LLMs in this study may contain in- hallucination in large language models.
accuracies. Aside from this, to the best of our
Esin Durmus, He He, and Mona Diab. 2020. FEQA: A
knowledge, there are no additional ethical issues question answering evaluation framework for faith-
associated with this paper. fulness assessment in abstractive summarization. In
Proceedings of the 58th Annual Meeting of the Asso-
10 Acknowledgement ciation for Computational Linguistics, pages 5055–
5070, Online. Association for Computational Lin-
We appreciate the valuable feedback and assistance guistics.
from Shizhe Diao. We thank Doris Li for her sup-
port in creating the illustrations for this research. Nouha Dziri, Hannah Rashkin, Tal Linzen, and David
Reitter. 2022. Evaluating attribution in dialogue sys-
tems: The begin benchmark. Transactions of the
Association for Computational Linguistics, 10:1066–
References 1083.
Ayush Agrawal, Mirac Suzgun, Lester Mackey, and
Adam Tauman Kalai. 2023. Do language models Tobias Falke, Leonardo F. R. Ribeiro, Prasetya Ajie
know when they’re hallucinating references? Utama, Ido Dagan, and Iryna Gurevych. 2019. Rank-
ing generated summaries by correctness: An interest-
Amos Azaria and Tom Mitchell. 2023. The Internal ing but challenging application for natural language
State of an LLM Knows When its Lying. inference. In Proceedings of the 57th Annual Meet-
ing of the Association for Computational Linguistics,
Mario Barrantes, Benedikt Herudek, and Richard pages 2214–2220, Florence, Italy. Association for
Wang. 2020. Adversarial nli for factual correct- Computational Linguistics.
ness in text summarisation models. arXiv preprint
arXiv:2005.11739. Luyu Gao, Zhuyun Dai, Panupong Pasupat, Anthony
Chen, Arun Tejasvi Chaganty, Yicheng Fan, Vincent
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Zhao, Ni Lao, Hongrae Lee, Da-Cheng Juan, and
Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Kelvin Guu. 2023. RARR: Researching and revising
Neelakantan, Pranav Shyam, Girish Sastry, Amanda what language models say, using language models.
Askell, Sandhini Agarwal, Ariel Herbert-Voss, In Proceedings of the 61st Annual Meeting of the
Gretchen Krueger, Tom Henighan, Rewon Child, Association for Computational Linguistics (Volume 1:
Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Long Papers), pages 16477–16508, Toronto, Canada.
Clemens Winter, Christopher Hesse, Mark Chen, Eric Association for Computational Linguistics.
Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess,
Jack Clark, Christopher Berner, Sam McCandlish,
Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Wein-
Alec Radford, Ilya Sutskever, and Dario Amodei.
berger. 2017. On calibration of modern neural net-
2020. Language models are few-shot learners.
works.
Zouying Cao, Yifei Yang, and Hai Zhao. 2023. Auto-
hall: Automated hallucination dataset generation for Zhijiang Guo, Michael Schlichtkrull, and Andreas Vla-
large language models. ArXiv, abs/2310.00259. chos. 2022. A survey on automated fact-checking.

Canyu Chen and Kai Shu. 2023. Can llm-generated Xiangkun Hu, Dongyu Ru, Qipeng Guo, Lin Qiu, and
misinformation be detected? arXiv preprint Zheng Zhang. 2023. Refchecker for fine-grained
arXiv:2309.13788. hallucination detection.
Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Andrey Malinin and Mark Gales. 2021. Uncertainty
Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea estimation in autoregressive structured prediction.
Madotto, and Pascale Fung. 2023. Survey of halluci-
nation in natural language generation. ACM Comput. Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das,
Surv., 55(12). Daniel Khashabi, Hannaneh, and Hajishirzi. 2022.
When Not to Trust Language Models: Investigat-
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Men- ing Effectiveness of Parametric and Non-Parametric
sch, Chris Bamford, Devendra Singh Chaplot, Diego Memories.
de las Casas, Florian Bressand, Gianna Lengyel, Guil-
laume Lample, Lucile Saulnier, Lélio Renard Lavaud, Potsawee Manakul, Adian Liusie, and Mark J. F. Gales.
Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, 2023. SelfCheckGPT: Zero-Resource Black-Box
Thibaut Lavril, Thomas Wang, Timothée Lacroix, Hallucination Detection for Generative Large Lan-
and William El Sayed. 2023. Mistral 7b. guage Models.

Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng

Jean Kaddour, Joshua Harris, Maximilian Mozes, Her-
Gao, Saurabh Tiwary, Rangan Majumder, and
bie Bradley, Roberta Raileanu, and Robert McHardy.
Li Deng. 2016. MS MARCO: A human gener-
2023. Challenges and applications of large language
ated machine reading comprehension dataset. CoRR,
models.
abs/1611.09268.
Ryo Kamoi, Tanya Goyal, Juan Diego Rodriguez, and OpenAI. 2023. Gpt-4 technical report.
Greg Durrett. 2023. Wice: Real-world entailment
for claims in wikipedia. In Conference on Empirical Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car-
Methods in Natural Language Processing. roll L. Wainwright, Pamela Mishkin, Chong Zhang,
Sandhini Agarwal, Katarina Slama, Alex Ray, John
Tom Kocmi and Christian Federmann. 2023. Large Schulman, Jacob Hilton, Fraser Kelton, Luke Miller,
language models are state-of-the-art evaluators of Maddie Simens, Amanda Askell, Peter Welinder,
translation quality. arXiv preprint arXiv:2302.14520. Paul Christiano, Jan Leike, and Ryan Lowe. 2022.
Training language models to follow instructions with
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio human feedback.
Petroni, Vladimir Karpukhin, Naman Goyal, Hein-
rich Küttler, Mike Lewis, Wen tau Yih, Tim Rock- Ankit Pal, Logesh Kumar Umapathi, and Malaikannan
täschel, Sebastian Riedel, and Douwe Kiela. 2021. Sankarasubbu. 2023. Med-halt: Medical domain
Retrieval-augmented generation for knowledge- hallucination test for large language models.
intensive nlp tasks.
Vipula Rawte, Amit Sheth, and Amitava Das. 2023. A
Junyi Li, Xiaoxue Cheng, Wayne Xin Zhao, Jian-Yun Survey of Hallucination in Large Foundation Models.
Nie, and Ji-Rong Wen. 2023. HaluEval: A Large-
Scale Hallucination Evaluation Benchmark for Large Abigail See, Peter J. Liu, and Christopher D. Manning.
Language Models. 2017. Get to the point: Summarization with pointer-
generator networks. In Proceedings of the 55th An-
Junyi Li, Tianyi Tang, Wayne Xin Zhao, Jian-Yun Nie, nual Meeting of the Association for Computational
and Ji-Rong Wen. 2022. Pretrained language mod- Linguistics (Volume 1: Long Papers), pages 1073–
els for text generation: A survey. arXiv preprint 1083, Vancouver, Canada. Association for Computa-
arXiv:2201.05273. tional Linguistics.

Jiaming Shen, Jialu Liu, Dan Finnie, Negar Rahmati,

Stephanie Lin, Jacob Hilton, and Owain Evans. 2022.
Michael Bendersky, and Marc Najork. 2023. "
TruthfulQA: Measuring how models mimic human
why is this misleading?": Detecting news headline
falsehoods. In Proceedings of the 60th Annual Meet-
hallucinations with explanations. arXiv preprint
ing of the Association for Computational Linguistics
arXiv:2302.05852.
(Volume 1: Long Papers), pages 3214–3252, Dublin,
Ireland. Association for Computational Linguistics. Kurt Shuster, Spencer Poff, Moya Chen, Douwe Kiela,
and Jason Weston. 2021. Retrieval augmentation
Alisa Liu and Jiacheng Liu. 2023. The memo- reduces hallucination in conversation. arXiv preprint
trap dataset. https://github.com/liujch1998/ arXiv:2104.07567.
memo-trap.
Maxim Tkachenko, Mikhail Malyuk, Andrey
Shayne Longpre, Kartik Perisetla, Anthony Chen, Holmanyuk, and Nikolai Liubimov. 2020-
Nikhil Ramesh, Chris DuBois, and Sameer Singh. 2022. Label Studio: Data labeling soft-
2021. Entity-based knowledge conflicts in question ware. Open source software available from
answering. In Proceedings of the 2021 Conference https://github.com/heartexlabs/label-studio.
on Empirical Methods in Natural Language Process-
ing, pages 7052–7063, Online and Punta Cana, Do- Hugo Touvron, Louis Martin, Kevin Stone, Peter Al-
minican Republic. Association for Computational bert, Amjad Almahairi, Yasmine Babaei, Nikolay
Linguistics. Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti
Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu,
Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Peiyu Liu, Jian-Yun Nie, and Ji-Rong Wen. 2023. A
Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, survey of large language models.
Cynthia Gao, Vedanuj Goswami, Naman Goyal, An-
thony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan
Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin,
Isabel Kloumann, Artem Korenev, Punit Singh Koura, Zhuohan Li, Dacheng Li, Eric. P Xing, Hao Zhang,
Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Di- Joseph E. Gonzalez, and Ion Stoica. 2023. Judging
ana Liskovich, Yinghai Lu, Yuning Mao, Xavier Mar- llm-as-a-judge with mt-bench and chatbot arena.
tinet, Todor Mihaylov, Pushkar Mishra, Igor Moly-
bog, Yixin Nie, Andrew Poulton, Jeremy Reizen- Ming Zhong, Da Yin, Tao Yu, Ahmad Zaidi, Mutethia
stein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Mutuma, Rahul Jha, Ahmed Hassan Awadallah, Asli
Ruan Silva, Eric Michael Smith, Ranjan Subrama- Celikyilmaz, Yang Liu, Xipeng Qiu, and Dragomir
nian, Xiaoqing Ellen Tan, Binh Tang, Ross Tay- Radev. 2021. QMSum: A new benchmark for query-
lor, Adina Williams, Jian Xiang Kuan, Puxin Xu, based multi-domain meeting summarization. In Pro-
Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, ceedings of the 2021 Conference of the North Amer-
Melanie Kambadur, Sharan Narang, Aurelien Ro- ican Chapter of the Association for Computational
driguez, Robert Stojnic, Sergey Edunov, and Thomas Linguistics: Human Language Technologies, pages
Scialom. 2023. Llama 2: Open foundation and fine- 5905–5921, Online. Association for Computational
tuned chat models. Linguistics.

Neeraj Varshney, Wenlin Yao, Hongming Zhang, Jian- Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao
shu Chen, and Dong Yu. 2023. A stitch in time saves Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu,
nine: Detecting and mitigating hallucinations of llms Lili Yu, Susan Zhang, Gargi Ghosh, Mike Lewis,
by validating low-confidence generation. Luke Zettlemoyer, and Omer Levy. 2023. Lima: Less
is more for alignment.
Tu Vu, Mohit Iyyer, Xuezhi Wang, Noah Constant, Jerry
Wei, Jason Wei, Chris Tar, Yun-Hsuan Sung, Denny
Zhou, Quoc Le, and Thang Luong. 2023. Freshllms:
Refreshing large language models with search engine
augmentation.

Alex Wang, Kyunghyun Cho, and Mike Lewis. 2020.

Asking and answering questions to evaluate the fac-
tual consistency of summaries.

Yijun Xiao and William Yang Wang. 2021. On hal-

lucination and predictive uncertainty in conditional
language generation.

Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie

Fu, Junxian He, and Bryan Hooi. 2023. Can llms
express their uncertainty? an empirical evaluation of
confidence elicitation in llms.

Yelp. 2021. Yelp open dataset. https://www.yelp.

com/dataset. Accessed: 2023-11-03.

Hanning Zhang, Shizhe Diao, Yong Lin, Yi R Fung,

Qing Lian, Xingyao Wang, Yangyi Chen, Heng Ji,
and Tong Zhang. 2023a. R-tuning: Teaching large
language models to refuse unknown questions. arXiv
preprint arXiv:2311.09677.

Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu,
Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang,
Yulong Chen, Longyue Wang, Anh Tuan Luu, Wei
Bi, Freda Shi, Shuming Shi, and Tencent AI Lab.
2023b. Siren’s Song in the AI Ocean: A Survey on
Hallucination in Large Language Models.

Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang,

Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen
Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen
Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang,
A Hallucination Examples

E VIDENT C ONFLICT
Example 1
Original content: They’ve sent about 500 of the most seriously injured people for treatment in Kathmandu already – the majority
by road.
Generative content: The team airlifted 500 injured people. . . to Kathmandu for medical treatment.

Example 2
Original content: "RestaurantsReservations": false
Generative content: The restaurant offers outdoor seating and takes reservations
S UBTLE C ONFLICT
Example 1
Original content: This is the section that lets the government deny trademark protection to a mark that is "immoral," "scandalous"
or "disparaging."
Generative content: . . . violates Section 2(a) of the trademark act, which prohibits marks that are "immoral, scandalous, or
disqualified."

Example 2
Original content: It was quite a hassle to find a table since this place is crowded all the time, but well, nothing can stand between
me and that absolutely delightful lobster roll
Generative content: The restaurant is located in a crowded area, but the delicious lobster rolls make it worth the hassle to find a
table.
E VIDENT I NTRODUCTION OF BASELESS I NFORMATION
Example 1
Original content: The couple divorced in 1968. . . She had no contact with the surviving members of the band until meeting up at
the 2006 Las Vegas premiere of "The Beatles Love."
Generative content: The couple divorced in 1968, but later reconciled and attended the 2006 Las Vegas premiere of "The Beatles
Love."

Example 2
Original content: “RestaurantsTakeOut”: true
Generative content: Customers can also enjoy takeout and delivery options.
S UBTLE I NTRODUCTION OF BASELESS I NFORMATION
Example 1
Original content: Amazing happy hour and great people.
Generative content: One of the standout features of Killer Shrimp is their happy hour, which is available daily from 4 pm to 7 pm.

Example 2
Original content: Place mashed potatoes in a medium bowl. Stir in onions, cheese, salt, pepper, and bacon.
Generative content: Place your mashed potatoes in a medium bowl. Stir in cooked onions, cheese, salt, pepper, and bacon.

Table 8: Examples of the four types of hallucinations.

B Response Generation Prompts

Q UESTION ANSWERING
Answer the following question:
{question}
Bear in mind that your response should be strictly based on the following 3 passages:
{passages}
In case the passages do not contain the necessary information to answer the question, please reply with: "Unable to answer based
on given passages."
DATA - TO - TEXT W RITING
Instruction:
Write an objective overview about the following local business based only on the provided structured data in the JSON format.
You should include details and cover the information mentioned in the customers’ review. The overview should be 100 - 200
words. Don’t make up information.
Structured data:
{json_data}
Overview:
S UMMARIZATION
Summarize the following news within {word_num} words:
{news}
output:

Table 9: Prompts for generating responses for the three types of tasks. word_num is min(200,
word_num_of_news//4). The word count requirement is only to control the length of the generated summa-
rization, it will not serve as the basis for hallucination annotation.
C Annotation Details

Figure 5: Annotation interface. For privacy reasons, we have masked the full names of the annotators in the
screenshot.
implicit_true due_to_null
Task Model # Hallucination Span
# Span % Span # Span % Span
GPT-3.5-turbo-0613 89 33 0.371
GPT-4-0613 51 15 0.294
Llama-2-7B-chat 1010 251 0.249
Question Answering
Llama-2-13B-chat 654 215 0.329
Llama-2-70B-chat 529 168 0.318
Mistral-7B-Instruct 594 164 0.276
GPT-3.5-turbo-0613 384 52 0.135 69 0.180
GPT-4-0613 354 24 0.068 209 0.590
Llama-2-7B-chat 1775 195 0.110 230 0.130
Data-to-text Writing
Llama-2-13B-chat 2803 260 0.09 439 0.157
Llama-2-70B-chat 1834 274 0.149 272 0.148
Mistral-7B-Instruct 2140 102 0.048 423 0.198
GPT-3.5-turbo-0613 60 14 0.233
GPT-4-0613 80 10 0.125
Llama-2-7B-chat 517 44 0.085
Summarization
Llama-2-13B-chat 342 28 0.082
Llama-2-70B-chat 245 27 0.110
Mistral-7B-Instruct 828 52 0.063
Overall 14289 1928 0.135 1642 0.115

Table 10: Detailed statistical information for the labels implicit_true and due_to_null. The majority of implicit
truths appear in two types of tasks: question answering and data-to-text writing. About 17.7% hallucination spans
in the data-to-text writing tasks are related to null values in the JSON data.
D Hallucination Detection Prompts

S UMMARIZATION
Below is the original news:
{article}
Below is a summary of the news:
{summary}
Your task is to determine whether the summary contains either or both of the following two types of hallucinations:
1. conflict: instances where the summary presents direct contraction or opposition to the original news;
2. baseless info: instances where the generated summary includes information which is not substantiated by or inferred from the
original news.
Then, compile the labeled hallucinated spans into a JSON dict, with a key "hallucination list" and its value is a list of
hallucinated spans. If there exist potential hallucinations, the output should be in the following JSON format: {"hallucination
list": [hallucination span1, hallucination span2, ...]}. Otherwise, leave the value as a empty list as following: {"hallucination
list": []}.
Output:
Q UESTION ANSWERING
Below is a question:
{question}
Below are related passages:
{passages}
Below is an answer:
{answer}
Your task is to determine whether the answer contains either or both of the following two types of hallucinations:
1. conflict: instances where the answer presents direct contraction or opposition to the passages;
2. baseless info: instances where the answer includes information which is not substantiated by or inferred from the passages.
Then, compile the labeled hallucinated spans into a JSON dict, with a key "hallucination list" and its value is a list of
hallucinated spans. If there exist potential hallucinations, the output should be in the following JSON format: {"hallucination
list": [hallucination span1, hallucination span2, ...]}. Otherwise, leave the value as a empty list as following: {"hallucination
list": []}.
Output:
DATA - TO - TEXT W RITING
Below is a structured data in the JSON format:
{business info}
Below is an overview article written in accordance with the structured data:
{overview}
Your task is to determine whether the overview contains either or both of the following two types of hallucinations:
1. conflict: instances where the overview presents direct contraction or opposition to the structured data;
2. baseless info: instances where the generated overview includes information which is not substantiated by or inferred from the
structured data.
In JSON, "null" or "None" represents an unknown value rather than a negation.
Then, compile the labeled hallucinated spans into a JSON dict, with a key "hallucination list" and its value is a list of
hallucinated spans. If there exist potential hallucinations, the output should be in the following JSON format: {"hallucination
list": [hallucination span1, hallucination span2, ...]}. Otherwise, leave the value as a empty list as following: {"hallucination
list": []}.
Output:

Table 11: Prompts for detecting hallucination for the three types of tasks. In the prompt for data-to-text writing, we
clarified that null or None in JSON should be treated as unknown rather than a negation.

Immediate download (Ebook) Generative AI in Action (MEAP V02) by Amit Bahree ISBN 9781633436947, 1633436942 ebooks 2024
100% (2)
Immediate download (Ebook) Generative AI in Action (MEAP V02) by Amit Bahree ISBN 9781633436947, 1633436942 ebooks 2024
81 pages
MindStudio Documentation
No ratings yet
MindStudio Documentation
78 pages
Ai Tools PDF
No ratings yet
Ai Tools PDF
273 pages
ChatGPT Broke The Turing Test
No ratings yet
ChatGPT Broke The Turing Test
16 pages
RAG-HAT - A Hallucination-Aware Tuning Pipeline For LLM in Retrieval-Augmented Generation
No ratings yet
RAG-HAT - A Hallucination-Aware Tuning Pipeline For LLM in Retrieval-Augmented Generation
11 pages
Luna: An Evaluation Foundation Model To Catch Language Model Hallucinations With High Accuracy and Low Cost
No ratings yet
Luna: An Evaluation Foundation Model To Catch Language Model Hallucinations With High Accuracy and Low Cost
13 pages
2023 Findingsemnlp 59
No ratings yet
2023 Findingsemnlp 59
14 pages
Towards Reliable Medical Question Answering: Techniques and Challenges in Mitigating Hallucinations in Language Models
No ratings yet
Towards Reliable Medical Question Answering: Techniques and Challenges in Mitigating Hallucinations in Language Models
9 pages
2408.15533v2-1
No ratings yet
2408.15533v2-1
12 pages
A Survey On Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions
No ratings yet
A Survey On Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions
49 pages
102_Rethinking_Hallucinations_
No ratings yet
102_Rethinking_Hallucinations_
18 pages
2311.05232v2
No ratings yet
2311.05232v2
58 pages
Cognitive Mirage A Review of Hallucinations in Large Language Models
No ratings yet
Cognitive Mirage A Review of Hallucinations in Large Language Models
21 pages
Hallucibot: Is There No Such Thing As A Bad Question?: William Watson Nicole Cho
No ratings yet
Hallucibot: Is There No Such Thing As A Bad Question?: William Watson Nicole Cho
26 pages
2412.05223v1
No ratings yet
2412.05223v1
12 pages
Detecting Hallucinations in Large Language Models Using Semantic Entropy
No ratings yet
Detecting Hallucinations in Large Language Models Using Semantic Entropy
12 pages
Hallucination Reduction in Large Language Models With Retrieval-Augmented Generation Using Wikipedia Knowledge
No ratings yet
Hallucination Reduction in Large Language Models With Retrieval-Augmented Generation Using Wikipedia Knowledge
16 pages
RAG幻觉抑制
No ratings yet
RAG幻觉抑制
12 pages
Auto Hall
No ratings yet
Auto Hall
13 pages
A Comprehensive Survey of Hallucination Mitigation Techniques in Large Language Models
No ratings yet
A Comprehensive Survey of Hallucination Mitigation Techniques in Large Language Models
19 pages
2502.12769v1
No ratings yet
2502.12769v1
18 pages
LLM Hallucination Reasoning With Zero-Shot Knowledge Test: Seongmin@gatech - Edu Hsiang - Hsu, Richard - Cf.chen
No ratings yet
LLM Hallucination Reasoning With Zero-Shot Knowledge Test: Seongmin@gatech - Edu Hsiang - Hsu, Richard - Cf.chen
12 pages
Hallucination Is Inevitable - An Innate Limitation of Large Language Models
No ratings yet
Hallucination Is Inevitable - An Innate Limitation of Large Language Models
26 pages
Trapping LLM "Hallucinations" Using Tagged Context Prompts: Philip Feldman, James R. Foulds, and Shimei Pan
No ratings yet
Trapping LLM "Hallucinations" Using Tagged Context Prompts: Philip Feldman, James R. Foulds, and Shimei Pan
14 pages
controlling-large-language-model-hallucination-based-on-agent-ai-with-lang-graph
No ratings yet
controlling-large-language-model-hallucination-based-on-agent-ai-with-lang-graph
7 pages
Scaling_Transformer_Paradigms__A_Robust_Framework_for_Hallucination_Detection_in_Resource_Limited_NLP_Systems
No ratings yet
Scaling_Transformer_Paradigms__A_Robust_Framework_for_Hallucination_Detection_in_Resource_Limited_NLP_Systems
10 pages
LLM Hallucinations in Practical Code Generation
No ratings yet
LLM Hallucinations in Practical Code Generation
13 pages
2410.18270v1
No ratings yet
2410.18270v1
16 pages
2502.17125v1
No ratings yet
2502.17125v1
8 pages
Hallucination Ai Rootcauses
No ratings yet
Hallucination Ai Rootcauses
4 pages
2404.18930v1
No ratings yet
2404.18930v1
30 pages
86583dacea8d2d7ff5b3b0637afcb2cb34c3
No ratings yet
86583dacea8d2d7ff5b3b0637afcb2cb34c3
12 pages
The troubling emergence of hallucination in large language models
No ratings yet
The troubling emergence of hallucination in large language models
33 pages
Mit
No ratings yet
Mit
19 pages
Large Legal Fictions
No ratings yet
Large Legal Fictions
37 pages
Anshika AI Intern Assignment
No ratings yet
Anshika AI Intern Assignment
4 pages
3613904.3642428
No ratings yet
3613904.3642428
13 pages
2501.03995
No ratings yet
2501.03995
8 pages
Hallucinations 2312.17249
No ratings yet
Hallucinations 2312.17249
19 pages
LLM hallucinations
No ratings yet
LLM hallucinations
6 pages
2407.17468v1
No ratings yet
2407.17468v1
16 pages
ReDeEP Detecting Hallucination in Retrieval-Augmen
No ratings yet
ReDeEP Detecting Hallucination in Retrieval-Augmen
23 pages
2023 Findings-Emnlp 764
No ratings yet
2023 Findings-Emnlp 764
15 pages
Benchmarking Hallucination in Large Language Models based on Unanswerable Math Word Problem
No ratings yet
Benchmarking Hallucination in Large Language Models based on Unanswerable Math Word Problem
11 pages
NLP810_Project Proposal Report_Muhammad Cendekia Airlangga
No ratings yet
NLP810_Project Proposal Report_Muhammad Cendekia Airlangga
3 pages
2406.16338v1
No ratings yet
2406.16338v1
34 pages
2024 09 27 24314506 Full
No ratings yet
2024 09 27 24314506 Full
19 pages
LLMs Know More Than They Show - On The Intrinsic Representation of LLM Hallucinations
No ratings yet
LLMs Know More Than They Show - On The Intrinsic Representation of LLM Hallucinations
31 pages
A stitch in time
No ratings yet
A stitch in time
23 pages
Trustful LLMS: Customizing and Grounding Text Generation With Knowledge Bases and Dual Decoders
No ratings yet
Trustful LLMS: Customizing and Grounding Text Generation With Knowledge Bases and Dual Decoders
11 pages
A Review On Debiasing and Dehallucinating in Large Language Models
No ratings yet
A Review On Debiasing and Dehallucinating in Large Language Models
50 pages
2503.05777v1
No ratings yet
2503.05777v1
95 pages
Do Language Models Know When They're Hallucinating References?
No ratings yet
Do Language Models Know When They're Hallucinating References?
18 pages
230513534 v 1
No ratings yet
230513534 v 1
13 pages
Hallucinations_in_LLMs_Understanding_and_Addressing_Challenges
No ratings yet
Hallucinations_in_LLMs_Understanding_and_Addressing_Challenges
5 pages
2022acl_long236
No ratings yet
2022acl_long236
15 pages
2022 Acl-Long 464
No ratings yet
2022 Acl-Long 464
15 pages
Honest AI Fine-Tuning Small Language Models To Say I Don't Know, and Reducing Hallucination in RAG
No ratings yet
Honest AI Fine-Tuning Small Language Models To Say I Don't Know, and Reducing Hallucination in RAG
8 pages
Sources of Hallucination by Large Language Models On Inference Tasks
No ratings yet
Sources of Hallucination by Large Language Models On Inference Tasks
17 pages
A Survey On Retrieval-Augmented Text Generation For Large Language Models
No ratings yet
A Survey On Retrieval-Augmented Text Generation For Large Language Models
18 pages
hallucination by gpt
No ratings yet
hallucination by gpt
2 pages
SLM Meets LLM
No ratings yet
SLM Meets LLM
11 pages
2504.07640v1
No ratings yet
2504.07640v1
11 pages
The Uplift Project: Enhancing and Propagating Intelligence and Longevity
From Everand
The Uplift Project: Enhancing and Propagating Intelligence and Longevity
R. Robert Holson
No ratings yet
Declarative_Machine_Learning_Systems
No ratings yet
Declarative_Machine_Learning_Systems
11 pages
Tips for Writing Technical Papers
No ratings yet
Tips for Writing Technical Papers
1 page
complexity_vs_users
No ratings yet
complexity_vs_users
1 page
Relational_Deep_Learningpdf
No ratings yet
Relational_Deep_Learningpdf
29 pages
Social Dilemma' Star Tristan Harris Responds To Criticisms of The Film, Netflix's Algorithm, and More
No ratings yet
Social Dilemma' Star Tristan Harris Responds To Criticisms of The Film, Netflix's Algorithm, and More
3 pages
Generative AI Tools - Attempt Review
No ratings yet
Generative AI Tools - Attempt Review
4 pages
The Impact of ChatGPT On Human Society
No ratings yet
The Impact of ChatGPT On Human Society
7 pages
Hunyuan_DiT_Tech_Report_05140553
No ratings yet
Hunyuan_DiT_Tech_Report_05140553
25 pages
2407.17788v1
No ratings yet
2407.17788v1
12 pages
Technical Interview Task
No ratings yet
Technical Interview Task
3 pages
Artificial-Analysis-State-of-AI-Q1-2025
No ratings yet
Artificial-Analysis-State-of-AI-Q1-2025
29 pages
1-s2.0-S1544612323011583-main
No ratings yet
1-s2.0-S1544612323011583-main
8 pages
Spring 2305.15486
No ratings yet
Spring 2305.15486
305 pages
usenixsecurity24-deng
No ratings yet
usenixsecurity24-deng
19 pages
Llava Med
No ratings yet
Llava Med
17 pages
Ai Detector
No ratings yet
Ai Detector
1 page
Scalexm - Ai: A Compact Guide To Large Language Models
No ratings yet
Scalexm - Ai: A Compact Guide To Large Language Models
9 pages
ChatGPT and Prompt Injection Attacks
No ratings yet
ChatGPT and Prompt Injection Attacks
20 pages
Chat GPT Bible - Lawyers and Legal Professionals Special Edition (Lucas Foster)
No ratings yet
Chat GPT Bible - Lawyers and Legal Professionals Special Edition (Lucas Foster)
114 pages
ChatGPT for Java: A Hands-on Developer’s Guide to ChatGPT and Open AI APIs 1st Edition Bruce Hopkins download
100% (1)
ChatGPT for Java: A Hands-on Developer’s Guide to ChatGPT and Open AI APIs 1st Edition Bruce Hopkins download
78 pages
Improving The Capabilities of Large Language Model Based Marketing Analytics Copilots With Semantic Search and Fine-Tuning
No ratings yet
Improving The Capabilities of Large Language Model Based Marketing Analytics Copilots With Semantic Search and Fine-Tuning
17 pages
1-Aydin, Ö. & Karaarslan E. 2023.
No ratings yet
1-Aydin, Ö. & Karaarslan E. 2023.
17 pages
Emerging Technology
No ratings yet
Emerging Technology
32 pages
50 Examples of How Brands Are Using AI Plus AI Survey - Sweathead
No ratings yet
50 Examples of How Brands Are Using AI Plus AI Survey - Sweathead
86 pages
Medical Graph RAG: Towards Safe Medical Large Language Model Via Graph Retrieval-Augmented Generation
No ratings yet
Medical Graph RAG: Towards Safe Medical Large Language Model Via Graph Retrieval-Augmented Generation
10 pages
2024-03-AUSTRALIA NATIONAL AGENCY-AI-FOUNDATION-MODELS
No ratings yet
2024-03-AUSTRALIA NATIONAL AGENCY-AI-FOUNDATION-MODELS
36 pages
cj
No ratings yet
cj
21 pages
FELIX: Automatic and Interpretable Feature Engineering Using LLMs
No ratings yet
FELIX: Automatic and Interpretable Feature Engineering Using LLMs
17 pages
Glyphtic-com-Ultimate-ChatGPT-Mastery-Guide-Inc-2dffb5fdf0fa44718de0edd320a3e9c0
No ratings yet
Glyphtic-com-Ultimate-ChatGPT-Mastery-Guide-Inc-2dffb5fdf0fa44718de0edd320a3e9c0
17 pages
The AI Revolution in Medicine: GPT-4 and Beyond Peter Lee - Quickly download the ebook to read anytime, anywhere
No ratings yet
The AI Revolution in Medicine: GPT-4 and Beyond Peter Lee - Quickly download the ebook to read anytime, anywhere
59 pages
Navigating The Jagged Technological Frontier - Field Experimental Evidence of The Effects of AI On Knowledge Worker Productivity and Quality
No ratings yet
Navigating The Jagged Technological Frontier - Field Experimental Evidence of The Effects of AI On Knowledge Worker Productivity and Quality
58 pages