Hallucination Reduction in Large Language Models With Retrieval-Augmented Generation Using Wikipedia Knowledge
Hallucination Reduction in Large Language Models With Retrieval-Augmented Generation Using Wikipedia Knowledge
Abstract
Natural language understanding and generation have seen great progress, yet
the persistent issue of hallucination undermines the reliability of model outputs.
Introducing retrieval-augmented generation (RAG) with external knowledge
sources, such as Wikipedia, presents a novel and significant approach to enhancing
factual accuracy and coherence in generated content. By dynamically integrating
relevant information, the Mistral model demonstrates substantial improvements
in precision, recall, and overall quality of responses. This research offers a robust
framework for mitigating hallucinations, providing valuable insights for deploying
reliable AI systems in critical applications. The comprehensive evaluation under-
scores the potential of RAG to advance the performance and trustworthiness of
large language models.
1 Introduction
Large language models (LLMs) have made significant strides in natural lan-
guage understanding and generation, transforming various applications ranging from
machine translation to conversational agents. These models have demonstrated
remarkable capabilities in processing and generating human-like text, which has facil-
itated their adoption in a multitude of sectors. However, a persistent challenge that
undermines their reliability and trustworthiness is the phenomenon of hallucination,
where models generate information that is factually incorrect or fabricated. This issue
1
arises due to the probabilistic nature of their predictions and the inherent limitations
within their training datasets, which often contain biases or incomplete information.
This problem is particularly concerning in domains requiring high factual accuracy
and trust, such as healthcare, law, and education. In these fields, the dissemination
of incorrect information can lead to severe consequences, underscoring the need for
robust mechanisms to enhance the accuracy of LLM outputs. Addressing hallucina-
tion in LLMs is crucial for advancing their utility and ensuring their safe deployment
in critical applications, thereby fostering greater confidence in their use.
1.1 Background
Large language models, exemplified by the Mistral model, leverage vast amounts of
textual data to learn linguistic patterns and generate coherent and contextually rel-
evant responses. These models operate by predicting the next word in a sequence,
which enables them to produce fluid and natural-sounding text. Despite their remark-
able capabilities, LLMs are prone to generating hallucinated content, which arises
due to the probabilistic nature of their predictions and the limitations of their train-
ing data. The training datasets often encompass a broad range of topics and sources,
but they may not always provide the depth or accuracy required for specific queries.
The Mistral model, being an open-source large language model, presents an ideal can-
didate for exploring methods to mitigate hallucination. By augmenting LLMs with
external knowledge sources, such as Wikipedia, researchers aim to enhance the fac-
tual accuracy of generated outputs, thereby reducing the incidence of hallucinations.
Wikipedia’s extensive and continually updated repository of information serves as a
valuable resource, enabling the model to cross-reference and verify the information it
generates, thus improving its reliability.
1.2 Motivation
The motivation behind employing retrieval-augmented generation (RAG) with
Wikipedia to address hallucination in LLMs stems from the need to improve the
factual reliability of model outputs. LLMs, despite their advanced capabilities, can
still produce outputs that are not grounded in reality, leading to misinformation.
Wikipedia, as a comprehensive and frequently updated repository of human knowl-
edge, provides a valuable resource for supplementing the model’s internal knowledge
with verifiable information. This integration allows for the retrieval of contextually
relevant and accurate data, which the model can then use to generate more factu-
ally grounded responses. This approach not only mitigates the risk of hallucination
but also enhances the overall quality and trustworthiness of the model’s outputs. By
ensuring that the information generated is backed by a reliable source, the model’s
credibility is significantly bolstered. Furthermore, the use of RAG helps in bridging
the gap between the model’s training data and real-world knowledge, making it more
adaptable and accurate in diverse scenarios.
2
1.3 Contributions
The research presented in this paper offers several key contributions to the field of nat-
ural language processing and the ongoing efforts to improve the reliability of LLMs.
Firstly, the implementation of a RAG framework within the Mistral model demon-
strates the practical feasibility and effectiveness of using external knowledge sources
to reduce hallucination. This implementation showcases the potential for RAG to be
integrated into existing models, paving the way for broader applications and improve-
ments. Secondly, the use of Wikipedia as a knowledge base exemplifies how freely
available and authoritative resources can be leveraged to enhance the factual accu-
racy of LLMs. Wikipedia’s extensive database provides a rich source of information
that can be dynamically accessed and utilized by the model, ensuring that its out-
puts are based on up-to-date and accurate data. Thirdly, the experimental results
provide valuable insights into the quantitative and qualitative benefits of RAG, offer-
ing a robust evaluation of its impact on reducing hallucination. The results highlight
the improvements in the model’s performance, demonstrating the tangible benefits of
integrating RAG. Lastly, the research outlines potential future directions for further
refining RAG techniques and exploring their applicability across different models and
knowledge domains. These future directions include optimizing the retrieval mech-
anisms, expanding the range of knowledge bases used, and testing the approach in
various practical applications to fully understand its capabilities and limitations.
2 Related Work
The following section provides a comprehensive review of the relevant literature.
Hallucination in LLMs has been a persistent challenge, leading to the development
of various techniques aimed at mitigating this issue. Approaches focusing on enhancing
the training datasets to ensure greater factual accuracy were often employed to reduce
the likelihood of generating incorrect information [1, 2]. Implementing constraint-based
generation methods achieved more controlled outputs by enforcing specific rules dur-
ing the text generation process [3]. The utilization of reinforcement learning techniques
enabled models to receive feedback based on the factual correctness of their out-
puts, thereby iteratively improving their reliability [4]. Efforts to incorporate external
knowledge bases into the training process allowed models to access verified informa-
tion, significantly decreasing the frequency of hallucinated content [5, 6]. Utilizing
fact-checking algorithms during post-processing stages provided an additional layer
of verification, ensuring the outputs align with known facts [7, 7, 8]. Adaptive learn-
ing rates and regularization techniques were applied to maintain a balance between
creativity and factuality in generated responses [9]. Incorporating context-aware gen-
eration techniques achieved better alignment with real-world knowledge and culture
by dynamically adjusting responses based on the input context [10, 11]. Employing
multi-task learning strategies facilitated the simultaneous training of LLMs on various
datasets, enhancing their ability to discern factual information from unreliable sources
[12–14]. Implementing domain-specific fine-tuning protocols allowed models to special-
ize in particular areas, further reducing the incidence of hallucination by leveraging
specialized knowledge [15]. Techniques leveraging adversarial training setups improved
3
model robustness by exposing them to scenarios designed to trigger hallucination, thus
enabling them to learn how to avoid generating inaccurate information [16].
Retrieval-augmented generation (RAG) has been widely recognized for its potential
to enhance the factual accuracy of LLM outputs by integrating real-time information
retrieval with text generation. Implementations of RAG combined a retrieval module
with generative models, enabling the incorporation of relevant external data into the
response generation process [17, 18]. Leveraging extensive databases allowed models
to access up-to-date information, significantly enhancing the relevance and accuracy
of generated content [19]. The integration of sophisticated search algorithms within
the retrieval module facilitated the identification of the most pertinent information,
improving the quality of responses [20, 21]. Utilizing indexing techniques optimized
the retrieval process, ensuring rapid access to large volumes of data [22, 23]. Fine-
tuning generative models on retrieved datasets ensured better alignment between
the model’s internal knowledge and external information [24]. Adopting hybrid mod-
els that blend retrieval and generation components achieved a seamless integration
of factual data into the text generation process [25]. The application of attention
mechanisms within RAG setups improved the model’s ability to focus on the most
relevant information, thereby enhancing output accuracy [26]. Employing dynamic
retrieval strategies enabled models to adapt to varying contexts, ensuring the retrieval
of the most contextually appropriate information [27]. Combining RAG with natu-
ral language understanding tasks facilitated more coherent and contextually accurate
responses [28–30]. Techniques optimizing the synergy between retrieval and generation
processes enhanced the overall efficiency and effectiveness of RAG implementations,
making them a robust solution for mitigating hallucination in LLMs [31, 32].
3 Methods
This section provides a detailed explanation of the methodology employed in this
research, encompassing the model architecture, data collection process, RAG imple-
mentation, training pipeline, and evaluation metrics.
4
The following figure illustrates the sophisticated model architecture of the modi-
fied Mistral model, highlighting the interaction between the retrieval and generative
components:
Transformer Architecture
Encoder Decoder
Input Query (Transformer (Transformer
Layers) Layers)
The figure depicts the flow of information within the modified Mistral model. The
input query is first processed by the encoder, which comprises several layers of trans-
former architecture designed to capture intricate linguistic patterns. The encoder’s
output is then fed into the retrieval module, which dynamically fetches relevant infor-
mation from Wikipedia. This retrieved information is subsequently integrated into the
decoder, where it is used to inform the text generation process. The final output is a
generated response that leverages both the internal knowledge of the model and the
external data retrieved from Wikipedia, thereby enhancing factual accuracy and reduc-
ing the likelihood of hallucinations. This seamless integration between the retrieval
and generative components is pivotal for ensuring the reliability and trustworthiness
of the generated content.
5
The following table outlines the various sources and aspects of data collection,
including the specific pre-processing steps and the frequency of updates:
The table illustrates the meticulous data collection and pre-processing methodol-
ogy employed to ensure the reliability and accuracy of the dataset used by the RAG
framework. Each aspect of data collection is addressed with specific pre-processing
steps to maintain the integrity and relevance of the information. The tokenization step
involves splitting the text into tokens, enabling efficient processing and retrieval. Nor-
malization ensures that the text format is standardized, facilitating consistent data
handling. The removal of redundant and irrelevant content eliminates unnecessary
information, creating a clean and focused dataset. Regular updates, including real-
time incorporation of the latest edits, ensure that the dataset remains current and
reflective of the most recent knowledge. Indexing creates structured representations,
optimizing the retrieval process and enabling rapid access to pertinent information.
The comprehensive and systematic approach to data collection and pre-processing
underscores the importance of maintaining a robust and reliable knowledge base for
the RAG framework, ultimately enhancing the factual accuracy and reliability of the
generated content.
6
on the most relevant retrieved information while generating responses. This approach
ensures that the generated text is grounded in verifiable data, significantly reducing
the likelihood of hallucinations.
The following algorithm outlines the detailed steps of the RAG implementation,
emphasizing the interaction between the retrieval and generation components:
7
Wikipedia. The generation module is fine-tuned to balance the creativity and factual-
ity of the responses, thereby reducing the likelihood of hallucinations and improving
the model’s reliability.
8
which measure the accuracy of the retrieved information and its integration into the
generated text. Additionally, the model’s performance is evaluated using BLEU and
ROUGE scores, which assess the fluency and coherence of the generated responses.
Qualitative metrics involve human evaluations of the text, focusing on the relevance,
factuality, and overall quality of the generated content. The combination of quan-
titative and qualitative metrics provides a robust evaluation framework, ensuring a
comprehensive assessment of the model’s performance and the effectiveness of the
RAG approach in reducing hallucinations. This structured approach to evaluation
metrics ensures a thorough and nuanced assessment of the RAG framework’s effec-
tiveness, combining quantitative and qualitative measures to provide a comprehensive
evaluation of the model’s performance in generating accurate and high-quality text.
The table 2 lists the novel metrics used to evaluate the effectiveness of the RAG
framework:
9
4 Experiments and Results
This section presents and discusses the experimental results, offering a comprehensive
comparison of the Mistral model’s performance with and without the integration of
retrieval-augmented generation (RAG). The results are illustrated through a series
of quantitative and qualitative analyses, using a variety of metrics and examples to
demonstrate the reduction in hallucinations and the enhancement of factual accuracy.
10
1
0.8
0.6
Scores
0.4
generated responses from the Mistral model with and without RAG are presented and
analyzed to highlight the differences in factual accuracy and coherence.
The table above provides clear examples of how the integration of RAG enhances
the factual accuracy of the generated responses. Without RAG, the Mistral model
produces hallucinated content, such as incorrect capitals, genomic information, and
Nobel Prize details. In contrast, with the RAG framework, the model retrieves and
integrates accurate information, resulting in factually correct and coherent responses.
The experiments and results presented in this section underscore the significant
improvements achieved through the integration of the RAG framework into the Mis-
tral model. Quantitative metrics demonstrate substantial enhancements in precision,
recall, F1-score, BLEU, and ROUGE scores, while qualitative analyses provide con-
crete examples of the reduction in hallucination and the enhancement of factual
accuracy. These findings highlight the efficacy of the RAG framework in improving the
11
reliability and trustworthiness of the generated content, thereby ensuring the model’s
suitability for deployment in critical applications requiring high factual accuracy.
5 Discussion
This section provides an in-depth discussion of the results and their implications,
examining the effectiveness of retrieval-augmented generation (RAG) in reducing hal-
lucination, identifying the limitations encountered during the research, and suggesting
future research directions.
12
5.3 Broader Impact and Ethical Considerations
The broader impact of this research extends beyond technical advancements, encom-
passing important ethical considerations and societal implications. By reducing
hallucinations and enhancing the factual accuracy of LLM outputs, the integration
of RAG contributes to mitigating the risks associated with the dissemination of mis-
information and false information. This is particularly critical in domains where the
consequences of incorrect information can be severe, such as public health, legal advice,
and education. However, the dependency on external knowledge sources also raises
ethical concerns related to data quality, bias, and representativeness. Ensuring the
transparency and accountability of the retrieval and generation processes is essential to
address these concerns and build trust in the use of LLMs. Additionally, the computa-
tional resources required for implementing RAG highlight the need for sustainable and
efficient approaches to model development and deployment. Future research should
continue to explore the ethical implications of LLMs and develop guidelines and best
practices to ensure their responsible and equitable use.
5.4 Limitations
Despite the significant improvements achieved through the integration of RAG, sev-
eral limitations were encountered during the research. One of the primary challenges
involves the dependency on the quality and comprehensiveness of the external knowl-
edge sources. Although Wikipedia provides a vast repository of information, it is not
immune to inaccuracies and biases, which can affect the reliability of the retrieved
information. Additionally, the retrieval process, while sophisticated, is not infallible
and may occasionally fetch irrelevant or outdated data, impacting the quality of the
generated responses. The computational overhead associated with the dynamic atten-
tion mechanism and the retrieval process also presents a limitation, as it can increase
the latency and resource requirements of the model. Furthermore, the evaluation
metrics, although comprehensive, may not fully capture the nuances of human judg-
ment in assessing the factual accuracy and relevance of the generated content. These
limitations highlight the need for ongoing refinement and optimization of the RAG
framework to address these challenges and enhance the overall performance of the
model.
13
techniques, such as knowledge distillation and transfer learning, may provide addi-
tional improvements in the model’s performance. Finally, conducting extensive user
studies and real-world evaluations can offer valuable insights into the practical applica-
tions and limitations of the RAG framework, guiding future research and development
efforts.
6 Conclusion
The research presented in this paper provides a comprehensive exploration of the
integration of retrieval-augmented generation (RAG) into the Mistral model, highlight-
ing the significant improvements in factual accuracy and coherence achieved through
this approach. The findings demonstrate that leveraging external knowledge sources,
such as Wikipedia, allows the model to dynamically retrieve and incorporate rele-
vant information, thereby substantially reducing the occurrence of hallucinations. This
advancement addresses one of the critical challenges in the field of natural language
processing, enhancing the reliability and trustworthiness of large language model out-
puts. Through a detailed analysis involving both quantitative and qualitative metrics,
the study showcases the enhanced performance of the Mistral model with the RAG
framework. The quantitative results, indicated by improved precision, recall, F1-score,
BLEU, and ROUGE scores, underscore the model’s ability to generate more accurate
and contextually appropriate responses. The qualitative examples further illustrate
the tangible benefits of the RAG integration, providing clear evidence of the model’s
improved capability to produce factually correct and coherent content. The integra-
tion of retrieval-augmented generation into the Mistral model represents a significant
advancement in the quest to enhance the factual accuracy and coherence of large
language model outputs. The research provides a robust framework for addressing
the challenges associated with hallucinations, paving the way for more reliable and
trustworthy AI-driven applications. The comprehensive analysis and promising results
highlight the potential of RAG to significantly improve the performance of large lan-
guage models, ensuring their effective and responsible deployment in various critical
fields.
References
[1] Guan, X., Liu, Y., Lin, H., Lu, Y., He, B., Han, X., Sun, L.: Mitigating large lan-
guage model hallucinations via autonomous knowledge graph-based retrofitting.
In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp.
18126–18134 (2024)
[2] Boztemir, Y., Çalışkan, N.: Analyzing and mitigating cultural hallucinations of
commercial language models in turkish. Authorea Preprints (2024)
[3] Wong, H.-t., Yip, G.-l.: Improving generalization beyond training data with
compositional generalization in large language models (2024)
14
[4] Jovanovic, M., Voss, P.: Trends and challenges of real-time learning in large
language models: A critical review. arXiv preprint arXiv:2404.18311 (2024)
[6] Bae, Y.S., Kim, H.R., Kim, J.H.: Equipping llama with google query api for
improved accuracy and reduced hallucination (2024)
[7] Ledaal, B.V.: Tsetlin machine for fake news detection: Enhancing accuracy and
reliability (2023)
[9] Caballero Hinojosa, A.: Exploring the power of large language models: News
intention detection using adaptive learning prompting (2023)
[10] Horst, R.: User simulation in task-oriented dialog systems based on large language
models via in-context learning (2024)
[11] McIntosh, T.R., Liu, T., Susnjak, T., Watters, P., Ng, A., Halgamuge, M.N.: A
culturally sensitive test to evaluate nuanced gpt hallucination. IEEE Transactions
on Artificial Intelligence (2023)
[12] Zhao, Z., Fan, W., Li, J., Liu, Y., Mei, X., Wang, Y., Wen, Z., Wang, F., Zhao,
X., Tang, J., et al.: Recommender systems in the era of large language models
(llms). IEEE Transactions on Knowledge and Data Engineering (2024)
[13] Gupta, H.: Instruction tuned models are quick learners with instruction equipped
data on downstream tasks (2023)
[15] Liuska, J.: Enhancing large language models for data analytics through domain-
specific context creation (2024)
[16] Greshake, K., Abdelnabi, S., Mishra, S., Endres, C., Holz, T., Fritz, M.: Not
what you’ve signed up for: Compromising real-world llm-integrated applications
with indirect prompt injection. In: Proceedings of the 16th ACM Workshop on
Artificial Intelligence and Security, pp. 79–90 (2023)
[17] Fazlija, G.: Toward optimising a retrieval augmented generation pipeline using
large language model (2024)
[18] Sticha, A.: Utilizing large language models for question answering in task-oriented
dialogues (2023)
15
[19] Menon, K.: Utilizing open-source ai to navigate and interpret technical docu-
ments: leveraging rag models for enhanced analysis and solutions in product
documentation (2024)
[20] Muludi, K., Fitria, K.M., Triloka, J., et al.: Retrieval-augmented generation
approach: Document question answering using large language model. Interna-
tional Journal of Advanced Computer Science & Applications 15(3) (2024)
[21] Pichai, K.: A retrieval-augmented generation based large language model bench-
marked on a novel dataset. Journal of Student Research 12(4) (2023)
[22] Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, Y., Sun, J., Wang,
H.: Retrieval-augmented generation for large language models: A survey. arXiv
preprint arXiv:2312.10997 (2023)
[24] Buchmann, R., Eder, J., Fill, H.-G., Frank, U., Karagiannis, D., Laurenzi, E.,
Mylopoulos, J., Plexousakis, D., Santos, M.Y.: Large language models: Expecta-
tions for semantics-driven systems engineering. Data & Knowledge Engineering,
102324 (2024)
[25] Xiong, X., Zheng, M.: Merging mixture of experts and retrieval augmented
generation for enhanced information retrieval and reasoning (2024)
[26] Borgeaud, S., Mensch, A., Hoffmann, J., Cai, T., Rutherford, E., Millican, K., Van
Den Driessche, G.B., Lespiau, J.-B., Damoc, B., Clark, A., et al.: Improving lan-
guage models by retrieving from trillions of tokens. In: International Conference
on Machine Learning, pp. 2206–2240 (2022). PMLR
[27] Bucur, M.: Exploring large language models and retrieval augmented generation
for automated form filling (2023)
[28] Liu, T.: Towards augmenting and evaluating large language models (2024)
[29] Yang, K.: Controlling long-form large language model outputs (2023)
[30] Yu, W.: Knowledge augmented methods for natural language processing and
beyond (2023)
[31] Ferri-Molla, I., Linares-Pellicer, J., Izquierdo-Domenech, J.: Virtual reality and
language models, a new frontier in learning (2024)
16