Long-Context LLMs Meet RAG: Overcoming Challenges For Long Inputs in RAG
Long-Context LLMs Meet RAG: Overcoming Challenges For Long Inputs in RAG
Retrieval-augmented generation (RAG) empowers large language models (LLMs) to utilize external
knowledge sources. The increasing capacity of LLMs to process longer input sequences opens up avenues
for providing more retrieved information, to potentially enhance the quality of generated outputs. It
is plausible to assume that a larger retrieval set would contain more relevant information (higher
recall), that might result in improved performance. However, our empirical findings demonstrate that for
many long-context LLMs, the quality of generated output initially improves first, but then subsequently
arXiv:2410.05983v1 [cs.CL] 8 Oct 2024
declines as the number of retrieved passages increases. This paper investigates this phenomenon,
identifying the detrimental impact of retrieved "hard negatives" as a key contributor. To mitigate this
and enhance the robustness of long-context LLM-based RAG, we propose both training-free and training-
based approaches. We first showcase the effectiveness of retrieval reordering as a simple yet powerful
training-free optimization. Furthermore, we explore training-based methods, specifically RAG-specific
implicit LLM fine-tuning and RAG-oriented fine-tuning with intermediate reasoning, demonstrating their
capacity for substantial performance gains. Finally, we conduct a systematic analysis of design choices
for these training-based methods, including data distribution, retriever selection, and training context
length.
1. Introduction
Retrieval-augmented generation (RAG) (Gao et al., 2023) empowers large language models (LLMs)
to utilize external information sources by selecting the most relevant pieces from a large corpus
(Zhao et al., 2023), thereby enhancing their effectiveness, customizability and efficiency in complex
problem-solving. RAG can also mitigate issues such as factual inaccuracies (Augenstein et al., 2023)
and hallucinations (Huang et al., 2023), which LLMs often exhibit when confronted with knowledge-
intensive tasks. RAG systems typically employ a retriever to identify relevant information from a
corpus, which is then presented in the context of an LLM as the generator.
Recent advances in computational resources and methodological innovations have enabled the
development of LLMs that support increasingly longer context (Dubey et al., 2024; Reid et al., 2024).
This has even opened up new avenues for directly inputting entire corpora or knowledge bases into
the LLMs. Yet, it would still not be feasible for large corpora (e.g., Wikipedia) and can incur higher
computational costs. Despite extensive research on RAG (Lee et al., 2024; Li et al., 2024; Xu et al.,
2023), the interplay with long-context LLMs, particularly how to optimally design RAG systems using
them effectively, remains under-explored. Existing works (Asai et al., 2024; Lin et al., 2024; Yoran
et al., 2024) propose tuning LLMs for RAG, but predominantly focus on a limited number of retrieved
passages (fewer than 10). Intuitively, longer context would allow for the inclusion of more retrieved
passages, leading to higher recall and potentially improved performance. However, our findings reveal
that this does not always hold true and highlight the need for a careful re-evaluation of standard RAG
designs when utilizing long-context LLMs. We demonstrate that achieving optimal performance in
such systems and to fully utilize the opportunities provided by the LLMs require a holistic rethinking
and effective novel approaches to the unique challenges.
This paper presents comprehensive analyses on long-context LLMs in RAG systems. Contrary to
the suggestions of previous work (Li et al., 2024; Xu et al., 2023), our research reveals that increasing
the number of retrieved passages does not consistently improve performance with long-context LLMs
(Section 3.1). Instead, we observe that the generative modeling performance initially increases and
then declines – simply providing more retrieved passages does not guarantee better outcomes. Using
stronger retrievers is also not a mitigation mechanism – indeed the performance degradation can
even be more severe with them. For deeper understanding of the phenomenon, we conduct further
investigations, which reveal that increasing the number of retrieved passages can introduce irrelevant
information (“noise”) that can mislead the LLM generation (Section 3.2). We also examine the impact
of “hard negatives” of different retrievers on the LLMs, and show that there are scenarios where the
‘hard negatives’ from stronger retrievers might confuse the LLM generation even more than those
from weaker retrievers (Section 3.3).
To address the challenges identified in our analyses, we propose three methods, encompassing
both training-free and training-based approaches, to enhance the performance of long-context LLMs
in RAG applications: (1) Retrieval reordering: recognizing the "lost-in-the-middle" phenomenon
observed for long-context LLMs (Liu et al., 2024), we propose reordering retrieved documents based
on their retrieval scores. By prioritizing documents with higher scores at the beginning and end of
the input sequences, we guide the LLMs’ attention towards more relevant information and mitigate
the impact of hard negatives. (2) Implicit robustness fine-tuning: given the ability to handle noisy
retrieved context is not explicitly acquired during standard LLM training, we propose tuning the
LLMs with the data comprising queries and retrieved documents, including those with potential noise.
This encourages the LLMs to implicitly learn robustness to hard negatives. (3) Explicit relevance
fine-tuning: while the previous method implicitly enhances robustness, it does not explicitly teach
the LLMs to identify relevant documents. Therefore, we propose augmenting the LLM tuning with an
intermediate reasoning step, where the LLMs are trained to analyze the retrieved documents and
explicitly identify relevant information before generating the final output. This approach aims to
improve the LLMs’ ability to discern relevant information from noise within the retrieved context.
Overall, the main contributions can be summarized as follows:
• Systematic analysis of long-context RAG: we systematically analyze the use of long-context LLMs in
RAG systems, specifically examining the impact of retrieved "hard negatives" on performance.
• Novel methods for robust RAG: we propose three methods to improve the robustness of long-
context LLMs in RAG: (1) a training-free method based on retrieval reordering, (2) implicit
tuning for robustness to hard negatives and (3) explicit tuning with intermediate reasoning for
relevance identification. Overall, our proposed approaches show significant accuracy and robustness
improvements on long-context RAG performance.
• Comprehensive study of RAG-specific LLM tuning: we conduct a thorough investigation into
various factors influencing the effectiveness of RAG-specific tuning, including data distribution, the
employed retriever, and training context length.
2. Related Work
Large language models (LLMs) can be prone to hallucinations especially at knowledge-intensive tasks
(Augenstein et al., 2023; Huang et al., 2023; Zhao et al., 2023). Retrieval-augmented generation
(RAG) addresses this by incorporating external knowledge sources to provide accurate and relevant
information (Gao et al., 2023). Traditional RAG systems comprise a retriever to identify relevant
information and a generator to synthesize the answer (Zhao et al., 2024; Zhu et al., 2021). While
previous research focused on improving either the retriever (Izacard et al., 2021; Karpukhin et al.,
2
Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG
2020; Wang et al., 2022) or the generator (Agarwal et al., 2024; Dong et al., 2022; Liu et al., 2024) in
isolation, we take a holistic approach. Conducting comprehensive analyses of the entire RAG system,
we focus on the challenges and opportunities presented by using long-context LLMs as generators.
We propose novel solutions to better employ them in long-context RAG.
Increased computational resources and advancements in efficient training methods have pushed
LLMs supporting longer inputs (Wang et al., 2024; Zhou et al., 2024). While long-context LLMs
(Reid et al., 2024) demonstrated impressive performance on benchmarks like "needle-in-the-haystack"
(Kamradt, 2023) and RULER (Hsieh et al., 2024a), these benchmarks often rely on random negative
examples and do not accurately reflect the challenges posed by the "hard negatives" encountered in
real-world RAG scenarios (Cuconasu et al., 2024). Furthermore, existing studies on long-context LLMs
in multi-document settings (Liu et al., 2024; Shi et al., 2023) often assume a single "golden" document
and random negatives, which differs from the RAG context where multiple relevant passages and
hard negatives may exist (Cuconasu et al., 2024; Hsieh et al., 2024b). Although some research has
explored the relationship between RAG and long-context LLMs (Lee et al., 2024; Li et al., 2024;
Xu et al., 2023), these works take different perspectives. They mainly focus on studying the (1)
trade-offs between RAG and long-context LLMs (Xu et al., 2023), (2) routers to manage RAG and
long-context LLMs (Li et al., 2024), (3) and the potential for LLMs to replace retrieval entirely (Lee
et al., 2024), while leaving long-context LLMs as generators in RAG under-explored. We delve deeper
into the potential benefits of long-context LLMs for RAG and investigate how to optimize these LLMs
specifically for this application.
Previous research has explored adapting LLMs for RAG using instruction tuning (Zhang et al.,
2023). RetRobust (Yoran et al., 2024) fine-tunes LLMs with 1 retrieved relevant passage or random
negative passage to make it robust to irrelevant passage. RA-DIT (Lin et al., 2024) conducts dual
instruction tuning to make the LLM more effectively leverage retrieved information and retriever
provide results more aligned with LLM preference. Self-RAG (Asai et al., 2024) introduces a framework
to train a LM that dynamically retrieves passages, generates content, and evaluates the retrieved
passages for improved performance. RAFT (Zhang et al., 2024) trains the LLMs to improve their ability
to answer questions in “open-book” in-domain settings. More recently, RankRAG (Yu et al., 2024)
tunes a LLM for the dual purpose of context ranking and answer generation in RAG. InstructRAG (Wei
et al., 2024) finetunes the LLM to generate self-synthesized rationales rather than directly answering
the question. However, these existing efforts primarily focus on tuning with a limited number of
retrieved passages (typically fewer than 10) and do not fully leverage the potential of long-context
LLMs. This work aims to address this gap by specifically investigating how to optimize long-context
LLMs for large-scale RAG, where the number of retrieved passages can be significantly higher.
This subsection investigates the relationship between the number of retrieved passages and the
performance of long-context LLMs in RAG systems.
Research question. Long-context LLMs offer the potential to incorporate more retrieved passages
3
Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG
into RAG systems. This raises a crucial question: Does a larger volume of retrieved context consistently
translate to better performance when using long-context LLMs in RAG?
Experimental setting. We evaluate the performance of RAG systems on the Natural Questions (NQ)
(Kwiatkowski et al., 2019) dataset using two different retrievers (BM25 (Robertson et al., 2009) and
e5 (Wang et al., 2022), where e5 exhibits higher performance on NQ (Recall@40 is 0.90 with e5 and
0.73 with BM25)) and four long-context LLMs (Gemma-7B-Chat (Team et al., 2024a), Gemma-2-9B-
Chat (Team et al., 2024b), Mistral-Nemo-12B-Instruct (Jiang et al., 2023) and Gemini-1.5-pro (Reid
et al., 2024)). We systematically vary the number of passages retrieved by each retriever.
0.60 0.60
0.55 0.55
0.50 0.50
RAG Accuracy
RAG Accuracy
0.45 0.45
0.40 0.40
0.35 0.35
0.30 Gemma-7B-Chat 0.30 Gemma-7B-Chat
Gemma-2-9B-Chat Gemma-2-9B-Chat
0.25 Mistral-Nemo-12B-Instruct 0.25 Mistral-Nemo-12B-Instruct
Gemini-1.5-Pro Gemini-1.5-Pro
0.20 101 0.20 101
100 102 103 100 102 103
# Retrieved Passages # Retrieved Passages
(a) RAG performance with e5 retriever (b) RAG performance with BM25 retriever
Figure 1 | Impact of retrieved context size on RAG performance with 4 different LLMs on NQ. Increasing
the number of retrieved passages initially improves performance but then leads to a decline. This
degradation is more pronounced using a retriever (e5) that exhibits higher recall@k on NQ compared
to BM25 (Recall@40 is 0.90 with e5 and 0.73 with BM25). The maximum number of retrieved
passages varies across LLMs due to differences in their maximum token limits.
Observations. Figure 1 presents the following key observations: 1) Strong Retriever (e5): Across all
LLMs, increasing the number of retrieved passages initially improves performance, but then leads to a
sharp decline or plateau. 2) Weak Retriever (BM25): Performance generally exhibits a continuous
increase or a slight decrease as the number of retrieved passages increases. While these observations
may appear counter-intuitive - given that one might expect monotonic improvements due to higher
recall (i.e., a greater chance of retrieving relevant information) - the inclusion of additional documents
can reduce precision, with irrelevant or misleading passages detracting LLMs from overall performance.
Comparison of different retrievers and the results on other datasets are shown in Appendix A and B.1.
Insights. The effectiveness of increasing retrieved context size in RAG depends on the strength of
the retriever. With a strong retriever, performance exhibits an “inverted-U pattern”, while a weak
retriever shows more consistent, albeit potentially limited, improvement. This suggests that factors
beyond simply the amount of retrieved information are at play.
This subsection delves into the factors hindering the performance of long-context LLMs in RAG, aiming
to discern whether limitations arise from retrieval quality or the LLM’s ability to process the retrieved
information.
Research question. Do the observed performance bottlenecks originate from limitations in the retriever’s
4
Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG
ability to identify relevant information, or from the long-context LLM’s capacity to effectively utilize the
retrieved context?
Experimental setting. We analyze the relationship between RAG performance and retrieval quality,
specifically recall and precision, using the Gemma-2-9B-Chat LLM with both e5 and BM25 retrievers
(Figure 2). Recall@k measures the presence of relevant passages within the top-k retrieved passages,
while precision@k quantifies the proportion of relevant passages among them.
1.0 1.0
0.54 RAG accuracy RAG accuracy Recall
0.425 Precision
0.53 0.8 0.400 0.8
Recall / Precision
Recall / Precision
0.52
RAG accuracy
RAG accuracy
0.375
0.6 0.6
0.51
0.350
0.50 0.4 0.4
0.325
0.49
0.300
0.48 0.2 0.2
Recall 0.275
0.47 Precision
0.0 0.0
0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40
# Retrieved Passages # Retrieved Passages
(a) Retrieval with e5 retriever (b) Retrieval with BM25 retriever
Figure 2 | Analyzing the relationship between RAG performance and retrieval quality (recall/precision)
using Gemma-2-9B-Chat with e5 and BM25 retrievers. (1) Accuracy vs. Recall: RAG accuracy
consistently falls below retrieval recall for both retrievers, indicating that the presence of relevant
information does not guarantee correct answers. This highlights the detrimental impact of irrelevant
passages on LLM performance. (2) Precision and hard negatives: Despite higher precision with e5, the
performance degradation with increasing retrieval size is more pronounced compared to BM25. This
demonstrates that precision alone is an insufficient metric for assessing the impact of "hard negatives,"
as the nature of irrelevant information significantly influences LLM performance.
Observations. Increasing the number of retrieved passages consistently leads to higher recall but
lower precision, irrespective of the retriever used. Crucially, the overall accuracy of the RAG system
falls below the recall across all retrieval sizes. This indicates that even when relevant information is
present in the retrieved context, the LLM may fail to generate the correct answer. This demonstrates
that the irrelevant retrieved passages can sometimes mislead the LLM. Furthermore, despite exhibiting
higher precision, the e5 retriever leads to a more pronounced performance degradation as the number
of retrieved passages increases compared to BM25.
Insights. These observations yield two key insights: (1) Influence of irrelevant passages: The dis-
crepancy between retrieval recall and RAG accuracy underscores the detrimental effect of irrelevant
retrieved passages ("hard negatives") on the LLMs’ performance. Even when relevant information is
available, the presence of hard negatives can mislead the LLMs and hinder their ability to generate
accurate answers. (2) Limitations of precision as a metric: The contrasting performance trends observed
with e5 and BM25, despite the former’s higher precision, reveal that precision alone is an inadequate
measure of retrieval quality in this context, when the end-to-end performance is considered. The
specific characteristics of the irrelevant passages, rather than just their quantity, significantly impact
the LLMs’ performance. Retrievers might significantly differ in their way of priorization of them, and
that might not be fully captured in metrics like precision. In this scenario, it is observed that “hard
negatives” retrieved by a stronger retriever (e5) might even more detrimental to the LLM than those
5
Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG
This subsection investigates the impact of "hard negatives" on the performance of long-context LLMs
in RAG, highlighting the need for more robust evaluation methodologies.
Research question. In long-context RAG scenarios, where a vast knowledge source necessitates
retrieving numerous passages, the likelihood of including relevant information (i.e. obtaining high
recall) increases. However, this also elevates the risk of introducing hard negatives. This raises two
critical questions: (1) How robust are current long-context LLMs to these hard negatives? and (2) Does
the impact of hard negatives vary with the retriever used?
Experimental setting. This study investigates the effect of hard negative passages on long-context
LLM performance in a controlled setting. We tasked three LLMs (Gemma2-7B-Chat, Mistral-Nemo-
12B-Instruct, and Gemini-1.5-Pro) with answering queries based on a context comprising a single
golden passage and a varying number of hard negative passages retrieved using different methods (e5,
Contriever, BM25, and random sampling). This synthetic experiment, detailed in Figure 3, isolates
the impact of hard negatives by holding the golden passage constant and intentionally excluding
scenarios with multiple golden passages, which are common in real-world RAG systems. See Appendix
C for a complete illustration of the experimental setup.
e5
0.5 BM25 0.75 random 0.75 random 0.80 random
contriever BM25 BM25 BM25
0.4 0.70
RAG Accuracy
RAG Accuracy
RAG Accuracy
contriever 0.70 contriever 0.75 contriever
Precision
0.65 e5 e5 e5
0.3 0.65 0.70
0.60
0.60 0.65
0.2 0.55
0.50 0.55 0.60
0.1
0.45 0.50 5 10 15 20 25 30 35 40 0.55
0.30.40.50.60.70.80.9 5 10 15 20 25 30 35 40 5 10 15 20 25 30 35 40
Recall # Passages # Passages # Passages
(a) Retrievers (b) Gemma2-9B-Chat (c) Mistral-12B-Instruct (d) Gemini-1.5-Pro
Figure 3 | Evaluating the impact of hard negatives on long-context LLMs. (a) The retriever performance
on NQ dataset: e5 > contriever > BM25. (b)(c)(d) For each query, a single golden passage (containing
the correct answer) is combined with varying numbers of hard negative passages retrieved by different
methods: e5, Contriever, BM25, and random sampling. The LLMs are then tasked with answering the
query based on this context. This setup allows us to assess the robustness of LLMs to hard negatives
and the influence of retriever characteristics on their overall impact.
Observations. (1) Sensitivity to hard negatives: Across all LLMs, increasing the number of hard
negative passages generally leads to a decline in RAG answer accuracy. (2) Retriever strength and
hard negative difficulty: The strength of the retriever directly correlates with the difficulty of the
retrieved hard negatives. LLMs struggle more with hard negatives from stronger retrievers (e.g., e5)
compared to those from weaker retrievers (e.g., BM25) or random sampling. (3) Distinguishing
random and hard negatives: While Gemini-1.5-Pro demonstrates robustness to random negatives, it
remains susceptible to the influence of hard negatives. More results on other datasets and qualitative
studies can be found in Appendix B.2 and D.
Insights. Existing benchmarks for evaluating long-context LLMs, such as "needle-in-the-haystack"
(Kamradt, 2023) and RULER (Hsieh et al., 2024a), predominantly utilize random negatives. Our
findings demonstrate that such benchmarks may not adequately capture the challenges posed by
6
Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG
hard negatives, which are prevalent in real-world RAG applications. Their takeaways would have
limitations. The need for new evaluation methodologies that incorporate hard negatives (specific to
the employed retrievers) is highlighted, to provide a more comprehensive and realistic assessment of
long-context LLM performance in RAG.
This reordering strategy aims to guide the LLM’s attention towards the most relevant passages, thereby
reducing the influence of hard negatives positioned in the middle of the sequence. The pseudo-code
for retrieval reordering can be found in Appendix E.
RAG Accuracy
RAG Accuracy
RAG Accuracy
0.42 0.46
0.535 0.52
0.40 0.50 0.44
0.530 0.48
0.38 0.46 0.42
0.525
original order original order 0.44 original order 0.40 original order
0.520 reordering 0.36 reordering 0.42 reordering reordering
0.38
5 10 15 20 25 30 35 40 5 10 15 20 25 30 35 40 0 25 50 75 100125150175200 0 25 50 75 100125150175200
# Retrieved Passages # Retrieved Passages # Retrieved Passages # Retrieved Passages
(a) NQ: Gemma2+e5 (b) NQ: Gemma2+BM25 (c) NQ: Mistral+e5 (d) NQ: Mistral+BM25
0.550
0.54 0.39 0.525 original order 0.42
reordering
RAG Accuracy
RAG Accuracy
RAG Accuracy
RAG Accuracy
Figure 4 | Evaluating the effectiveness of retrieval reordering in various RAG configurations. Results
demonstrate that reordering retrieved passages consistently enhances performance, particularly
when the number of retrieved passages is large. (Retrievers: e5, BM25; LLMs: Gemma2-9b-Chat,
Mistral-Nemo-12B-Instruct; Datasets: NQ, PopQA)
Retrieval reordering significantly improves RAG performance, particularly with larger numbers
of retrieved passages. To assess the effectiveness of retrieval reordering, we conduct experiments
7
Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG
with two retrievers (e5 and BM25), two long-context LLMs (Gemma-2-9B-Chat and Mistral-Nemo-
12B-Instruct), and two datasets (NQ and PopQA). As illustrated in Figure 4, retrieval reordering yields
negligible improvements with smaller retrieval sets, but significantly and consistently outperforms
the original ordering when the number of retrieved passages is large. This behavior is attributed to
the interplay of two factors that become increasingly significant with larger retrieval sets: (1) the
amplified "lost-in-the-middle" phenomenon, where LLMs prioritize information at the beginning and
end of the input sequence, and (2) the increased prevalence of hard negatives, which can hinder
accurate answer generation. By strategically placing passages, retrieval reordering mitigates these
issues, highlighting the potential of position engineering as a complementary technique to prompt
engineering for optimizing long-context LLMs in RAG.
While the retrieval reordering strategy presented in Section 4 mitigates the detrimental impact of
hard negatives, it does not inherently enhance the LLM’s ability to handle such irrelevant information
within the context. To address this, we conduct a systematic investigation into RAG-specific tuning as
a means of improving long-context LLMs for RAG applications.
Our tuning paradigm involves training LLM to generate the correct answer (𝑎) given a comprehen-
sive input comprising an instruction ( 𝐼 ), a query (𝑞), and a set of retrieved passages ( 𝑑1 , 𝑑2 , ..., 𝑑 𝑘 ):
This approach aims to implicitly enhance the LLM’s robustness to hard negatives by exposing it to a
diverse range of retrieved contexts during fine-tuning, thus enabling it to learn to effectively identify
and utilize relevant information even in the presence of noise.
To assess the generalization capabilities of RAG-tuned LLMs, we fine-tune Gemma-2-9B-Base,
Mistral-Nemo-12B-Base and Gemini-1.0-Pro using a diverse dataset comprising NQ, WoW, Fever, and
MMLU. We then evaluate on a range of unseen datasets, including TriviaQA, PopQA, HotpotQA,
2wikimultihopqa, Webquestions, Bamboogle, ASQA, T-REx, and zsRE. We compare the performance
of the RAG-tuned model (RAG FT) with two types of baselines: (1) Chat model with retrieval
augmentation: the Gemma-2-9B-Chat/Mistral-Nemo-12B-Instruct/Gemini-1.0-Pro w. RAG; (2) Direct
SFT: the ones fine-tuned with standard supervised fine-tuning (SFT) on question-answer pairs without
retrieved context (Direct FT w/o RAG). Further details regarding the datasets and experimental setup
can be found in Appendix F and G.
Figure 5 shows the three key observations: (1) Consistent improvement over baselines: RAG
FT consistently outperforms the chat model w. RAG and the Direct FT model across all evaluated
datasets. (2) Robustness to hard negatives: the curve of RAG FT is generally flatter than that of the
chat model, which demonstrates that our finetuned LLM is more robust to the hard negatives as
the number of retrieved passages increases. (3) Superiority over direct fine-tuning: In most cases,
RAG FT demonstrates superior performance compared to Direct FT. This indicates that RAG FT not
only enables the LLM to "memorize" knowledge during training but also equips it with the ability
to effectively "extract" relevant information from retrieved context during inference. These findings
highlight the effectiveness of RAG-specific tuning in enhancing the generalization capabilities of LLMs
for knowledge-intensive tasks. Separate results on those three LLMs are shown in Appendix J, K and
L. Qualitative studies can be found in Appendix I.
8
Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG
RAG Accuracy
RAG Accuracy
0.725 0.50 0.40
0.700 0.45
0.675 0.40 0.35
0.650 0.35 0.30
0.625 0.30
0.600 0.25
0.25
0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40
# Retrieved Passages # Retrieved Passages # Retrieved Passages
(a) TriviaQA (b) PopQA (c) HotpotQA
0.40 0.7
0.425
0.400 0.35
0.6
RAG Accuracy
RAG Accuracy
RAG Accuracy
0.375 0.30
0.350 0.25 0.5
0.325 0.20
0.300 0.4
0.275 0.15
0.250 0.10 0.3
0.225 0 5 10 15 20 25 30 35 40 0.05 0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40
# Retrieved Passages # Retrieved Passages # Retrieved Passages
(d) 2wikimultihopqa (e) Bamboogle (f) ASQA
0.70 0.7
0.65 Gemma-2-9B-Chat
0.60 0.6
RAG Accuracy
Gemma-2 RAG FT
RAG Accuracy
Figure 5 | Generalization ability of LLMs fine-tuned with RAG-specific data (RAG FT). RAG FT
consistently outperforms the chat LLM w. RAG and the model fine-tuned directly on question-answer
pairs (Direct FT). This demonstrates the effectiveness of RAG FT in enabling the LLM to effectively
extract knowledge from retrieved context on unseen tasks. Note that Direct FT is evaluated without
retrieval to align with its training paradigm and all others are evaluated with retrieval augmentation.
(LLMs: Gemma-2-9B-Base, Mistral-Nemo-12B-Base, Gemini-1.0-Pro)
While the fine-tuning approach described in Section 5.1 implicitly enhances the LLM’s robustness to
hard negatives, it does not explicitly train the model to differentiate between relevant and irrelevant
passages within the retrieved context. To address this, we investigate the effectiveness of incorporating
an intermediate reasoning step into the fine-tuning process.
This modified paradigm involves training the LLM to generate both a reasoning paragraph (𝑟 )
that explicitly identifies the relevant passages for the given query (𝑞) and the final answer (𝑎):
During training, the LLMs are provided with labeled reasoning paragraphs to guide its learning
process. During inference, the LLMs are instructed to first generate the reasoning paragraph and then
9
Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG
0.55
0.80 0.6 0.50
0.75
RAG Accuracy
RAG Accuracy
RAG Accuracy
0.5 0.45
0.70 0.40
0.65 0.4 0.35
0.60 0.3 0.30
0.55 0.25
0.2 0.20
0.50
0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40
# Retrieved Passages # Retrieved Passages # Retrieved Passages
(a) TriviaQA (b) PopQA (c) HotpotQA
0.60 0.8
0.55 0.7 Gemma-2-9B-Chat
RAG Accuracy
RAG Accuracy
0.50 Gemma-2 RAG FT
0.6
0.45 Gemma-2 RAG FT w. Int
0.40 0.5 Gemma-2 Direct FT
0.4 Gemini-1.0-Pro
0.35 Gemini RAG FT
0.30 0.3 Gemini RAG FT w. Int
0.25 Gemini Direct FT
0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40
# Retrieved Passages # Retrieved Passages
(d) 2wikimultihopqa (e) ASQA (f) Legend
Figure 6 | Evaluating the impact of intermediate reasoning on the performance of RAG-tuned LLMs.
Results demonstrate that fine-tuning with an intermediate reasoning step (RAG FT w. Int) leads to
further improvements compared to implicit RAG fine-tuning (RAG FT) and direct fine-tuning (Direct
FT). Direct FT is evaluated without retrieval to align with its training and all others are evaluated
with retrieval augmentation. Due to the computational complexity of inference with reasoning
augmentation, results are shown for 1000 randomly-sampled queries from each dataset. (LLMs:
Gemma-2-9B-Base and Gemini-1.0-Pro, more results in Appendix J, K and L)
utilize this analysis to produce the answer. This approach aims to explicitly enhance the LLMs’ ability
to discern relevant information from noise within the retrieved context, thereby improving its overall
performance in RAG.
We utilize the same training data mixture as in Section 5.1 and augment it with reasoning labels
generated by Gemini-1.5-Pro for each question-passage pair. These labels provide explicit guidance
on identifying relevant passages. Further details of the experimental setup and the generation of
reasoning labels can be found in Appendix H.
Figure 6 demonstrates the effectiveness of this approach. The LLM fine-tuned with explicit
intermediate reasoning consistently outperforms training with implicit RAG data. This improvement
can be attributed to two key factors: (1) Explicit relevance training: Providing intermediate reasoning
labels during training explicitly teaches the LLM to differentiate between relevant and irrelevant
passages, enhancing its ability to discern crucial information from noise. (2) Structured reasoning for
enhanced understanding: Generating a reasoning paragraph before answering introduces a structured
approach to processing the retrieved context. This step, akin to chain-of-thought reasoning (Wei et al.,
2022), helps decouple the complex information and facilitates a more focused analysis, ultimately
leading to improved performance. These highlight the value of incorporating explicit reasoning
mechanisms in RAG tuning to enhance the LLM’s ability to effectively utilize retrieved context. More
results on Gemma-2-9B models, Mistral-Nemo-12B models and Gemini-1.0-Pro models are shown in
Appendix J, K and L. Qualitative studies can be found in Appendix I.
10
Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG
RAG Accuracy
RAG Accuracy
0.40 FT w. e5
FT w. mix 0.59
0.35 Mixed data 0.6 25% max
NQ only 0.58 50% max
0.30 Wow only 0.5 100% max
0.25 0.57
Fever only 0-100% max
0.20 Mmlu only 0.4 0.56 50-100% max
5 10 15 20 25 30 35 40 BM25 contriever bge e5 Avg 5 10 15 20 25 30 35 40
# Retrieved Passages Inference retriever # Retrieved Passages
(a) Analysis of training data distribu- (b) Influence of retriever variations on (c) Investigation of the optimal num-
tion. (Test: HotpotQA) fine-tuning effectiveness. (NQ) ber of passages for training.
Figure 7 | (a) Impact of training data distribution: A diverse mix of training data sources enhances the
generalization ability of the LLM. (b) Influence of the retriever choice: Fine-tuning with data retrieved
from multiple retrievers improves generalization to unseen retrievers during inference. (c) Effect of
training context length: Fine-tuning with the maximum context length yields optimal performance
across varying numbers of retrieved passages during inference. (LLM: Gemma-2-9B-Base)
11
Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG
different retrieval sizes during inference. We investigate this aspect with the Gemma-2-9B-Base model,
which has a maximum input sequence length of 8192 tokens (corresponding to approximately 40
passages). We evaluate five different training configurations: (1) Fixed 10 retrieved passages (25%
max). (2) Fixed 20 retrieved passages (50% max). (3) Fixed 40 retrieved passages (maximum input
capacity) (100% max). (4) Dynamic 0-40 retrieved passages (0-100% max). (5) Dynamic 20-40
retrieved passages (50-100% max).
Figure 7(c) presents the results on NQ, demonstrating that fine-tuning with the maximum number
of retrieved passages (100% max) consistently yields the best performance across various retrieval sizes
during inference. This suggests that training with the full context capacity enhances the LLM’s ability
to effectively handle varying amounts of retrieved information, leading to improved generalization
and robustness. More analyses of RAG-specific tuning can be found in in Appendix M and N.
7. Conclusions
This paper investigates the impact of increasing the number of retrieved passages on the performance
of long-context LLMs in retrieval-augmented generation (RAG) systems. Contrary to expectations, we
observe that performance initially improve but then degrade as more passages are included. This
phenomenon is attributed to the detrimental influence of retrieved "hard negatives". To mitigate
this issue, we propose and evaluate three solutions: training-free retrieval reordering, RAG-specific
implicit LLM fine-tuning, and RAG-oriented LLM fine-tuning with intermediate reasoning. A systematic
analysis of the training-based methods explores the effects of data distribution, retriever for training,
and training context length. Interesting future directions include exploring (automated) position
optimization with more advanced retrieval ordering methods, and fine-tuning the LLMs for RAG with
more fine-grained and multi-step reasoning chains.
References
R. Agarwal, A. Singh, L. M. Zhang, B. Bohnet, S. Chan, A. Anand, Z. Abbas, A. Nova, J. D. Co-Reyes,
E. Chu, et al. Many-shot in-context learning. arXiv preprint arXiv:2404.11018, 2024.
A. Asai, Z. Wu, Y. Wang, A. Sil, and H. Hajishirzi. Self-rag: Learning to retrieve, generate, and critique
through self-reflection. In The Twelfth International Conference on Learning Representations, 2024.
I. Augenstein, T. Baldwin, M. Cha, T. Chakraborty, G. L. Ciampaglia, D. Corney, R. DiResta, E. Ferrara,
S. Hale, A. Halevy, et al. Factuality challenges in the era of large language models. arXiv preprint
arXiv:2310.05189, 2023.
J. Chen, S. Xiao, P. Zhang, K. Luo, D. Lian, and Z. Liu. Bge m3-embedding: Multi-lingual, multi-
functionality, multi-granularity text embeddings through self-knowledge distillation. arXiv preprint
arXiv:2402.03216, 2024.
F. Cuconasu, G. Trappolini, F. Siciliano, S. Filice, C. Campagnano, Y. Maarek, N. Tonellotto, and
F. Silvestri. The power of noise: Redefining retrieval for rag systems. In Proceedings of the 47th
International ACM SIGIR Conference on Research and Development in Information Retrieval, pages
719–729, 2024.
Q. Dong, L. Li, D. Dai, C. Zheng, Z. Wu, B. Chang, X. Sun, J. Xu, and Z. Sui. A survey on in-context
learning. arXiv preprint arXiv:2301.00234, 2022.
A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang,
A. Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.
12
Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG
Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai, J. Sun, and H. Wang. Retrieval-augmented
generation for large language models: A survey. arXiv preprint arXiv:2312.10997, 2023.
C.-P. Hsieh, S. Sun, S. Kriman, S. Acharya, D. Rekesh, F. Jia, and B. Ginsburg. Ruler: What’s the real
context size of your long-context language models? arXiv preprint arXiv:2404.06654, 2024a.
C.-Y. Hsieh, Y.-S. Chuang, C.-L. Li, Z. Wang, L. T. Le, A. Kumar, J. Glass, A. Ratner, C.-Y. Lee, R. Krishna,
et al. Found in the middle: Calibrating positional attention bias improves long context utilization.
arXiv preprint arXiv:2406.16008, 2024b.
L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H. Wang, Q. Chen, W. Peng, X. Feng, B. Qin, et al. A
survey on hallucination in large language models: Principles, taxonomy, challenges, and open
questions. arXiv preprint arXiv:2311.05232, 2023.
J. Lee, A. Chen, Z. Dai, D. Dua, D. S. Sachan, M. Boratko, Y. Luan, S. M. Arnold, V. Perot, S. Dalmia,
et al. Can long-context language models subsume retrieval, rag, sql, and more? arXiv preprint
arXiv:2406.13121, 2024.
Z. Li, C. Li, M. Zhang, Q. Mei, and M. Bendersky. Retrieval augmented generation or long-context
llms? a comprehensive study and hybrid approach. arXiv preprint arXiv:2407.16833, 2024.
X. V. Lin, X. Chen, M. Chen, W. Shi, M. Lomeli, R. James, P. Rodriguez, J. Kahn, G. Szilvasy, M. Lewis,
et al. Ra-dit: Retrieval-augmented dual instruction tuning. In The Twelfth International Conference
on Learning Representations, 2024.
N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, and P. Liang. Lost in the middle:
How language models use long contexts. Transactions of the Association for Computational Linguistics,
12:157–173, 2024.
S. Robertson, H. Zaragoza, et al. The probabilistic relevance framework: Bm25 and beyond. Founda-
tions and Trends® in Information Retrieval, 3(4):333–389, 2009.
F. Shi, X. Chen, K. Misra, N. Scales, D. Dohan, E. H. Chi, N. Schärli, and D. Zhou. Large language
models can be easily distracted by irrelevant context. In International Conference on Machine
Learning, pages 31210–31227. PMLR, 2023.
13
Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG
L. Wang, N. Yang, X. Huang, B. Jiao, L. Yang, D. Jiang, R. Majumder, and F. Wei. Text embeddings by
weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2022.
X. Wang, M. Salmani, P. Omidi, X. Ren, M. Rezagholizadeh, and A. Eshaghi. Beyond the limits:
A survey of techniques to extend the context length in large language models. arXiv preprint
arXiv:2402.02244, 2024.
J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al. Chain-of-thought
prompting elicits reasoning in large language models. Advances in neural information processing
systems, 35:24824–24837, 2022.
Z. Wei, W.-L. Chen, and Y. Meng. Instructrag: Instructing retrieval-augmented generation with explicit
denoising. arXiv preprint arXiv:2406.13629, 2024.
P. Xu, W. Ping, X. Wu, L. McAfee, C. Zhu, Z. Liu, S. Subramanian, E. Bakhturina, M. Shoeybi, and
B. Catanzaro. Retrieval meets long context large language models. arXiv preprint arXiv:2310.03025,
2023.
O. Yoran, T. Wolfson, O. Ram, and J. Berant. Making retrieval-augmented language models robust to
irrelevant context. In The Twelfth International Conference on Learning Representations, 2024.
Y. Yu, W. Ping, Z. Liu, B. Wang, J. You, C. Zhang, M. Shoeybi, and B. Catanzaro. Rankrag: Unifying
context ranking with retrieval-augmented generation in llms. arXiv preprint arXiv:2407.02485,
2024.
S. Zhang, L. Dong, X. Li, S. Zhang, X. Sun, S. Wang, J. Li, R. Hu, T. Zhang, F. Wu, et al. Instruction
tuning for large language models: A survey. arXiv preprint arXiv:2308.10792, 2023.
T. Zhang, S. G. Patil, N. Jain, S. Shen, M. Zaharia, I. Stoica, and J. E. Gonzalez. Raft: Adapting
language model to domain specific rag. arXiv preprint arXiv:2403.10131, 2024.
W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong, et al. A
survey of large language models. arXiv preprint arXiv:2303.18223, 2023.
W. X. Zhao, J. Liu, R. Ren, and J.-R. Wen. Dense text retrieval based on pretrained language models:
A survey. ACM Transactions on Information Systems, 42(4):1–60, 2024.
Z. Zhou, X. Ning, K. Hong, T. Fu, J. Xu, S. Li, Y. Lou, L. Wang, Z. Yuan, X. Li, et al. A survey on efficient
inference for large language models. arXiv preprint arXiv:2404.14294, 2024.
F. Zhu, W. Lei, C. Wang, J. Zheng, S. Poria, and T.-S. Chua. Retrieving and reading: A comprehensive
survey on open-domain question answering. arXiv preprint arXiv:2101.00774, 2021.
14
Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG
Appendix
A. Retriever performance and similarity
We analyze the performance and similarity of four retrievers (BM25, contriever, e5 and bge) on the
NQ dataset shown in Figure 8. Each data point corresponds to a retrieval (recall, precision) pair for a
specific number of retrieved passages. The overall retrieval performances on NQ are observed as e5 >
bge > contriever > bm25, with contriever having a similar performance with BM25 and bge having a
similar performance with e5 (as their curves are closer).
BM25
contriever
0.5 e5
bge
0.4
Precision
0.3
0.2
0.1
0.3 0.4 0.5 0.6 0.7 0.8 0.9
Recall
Figure 8 | Retriever performance on NQ. (1) Retrieval performance: e5 > bge > contriever > BM25;
(2) Contriever is more similar to BM25, while bge is more similar to e5 (since their curves are closer
respectively).
Observations. Figure 9 presents the following key observations similar to that in Section 3.1: 1)
Strong Retriever (e5): Across all LLMs, increasing the number of retrieved passages initially enhances
performance, but subsequently results in either a sharp decline or a plateau. 2) Weak Retriever
(BM25): Performance generally shows a continuous improvement or a slighter decrease as the number
of retrieved passages increases. While these observations may appear counter-intuitive - given that
one might expect monotonic improvements due to higher recall (i.e., a greater chance of retrieving
relevant information) - the inclusion of additional documents can reduce precision, with irrelevant or
misleading passages detracting LLMs from overall performance.
Observations. Figure 10 shows the following observations similar to that in Section 3.3: (1) Sensitivity
to hard negatives: Across all LLMs, increasing the number of hard negative passages generally results
15
Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG
0.7 0.7
Gemma-7B-Chat
Gemma-2-9B-Chat
0.6 0.6 Mistral-Nemo-12B-Instruct
Gemini-1.5-Pro
RAG Accuracy
RAG Accuracy
0.5 0.5
0.4 0.4
Figure 9 | Impact of retrieved context size on RAG performance (on PopQA) with 4 different LLMs.
Increasing the number of retrieved passages initially improves performance but then leads to a decline.
This degradation is more pronounced using a retriever (e5) that exhibits higher recall@k on PopQA
compared to BM25 (Recall@40 is 0.85 with e5 and 0.57 with BM25). The maximum number of
retrieved passages varies across LLMs due to differences in their maximum token limits.
RAG Accuracy
RAG Accuracy
0.725 contriever 0.750
Precision
Figure 10 | Evaluating the impact of hard negatives on long-context LLMs. (a) The retriever perfor-
mance on PopQA dataset: e5 > contriever > BM25. (b)(c)(d) For each query, a single golden passage
(containing the correct answer) is combined with varying numbers of hard negative passages retrieved
by different methods (e5, Contriever, BM25, and random sampling). The LLMs are then tasked with
answering the query based on this context. This setup allows us to assess the robustness of LLMs to
hard negatives and the influence of retriever strength on their impact.
in a decline in RAG answer accuracy. (2) Retriever strength and hard negative difficulty: The strength
of the retriever is directly correlated with the difficulty of the retrieved hard negatives. LLMs struggle
more with hard negatives generated by stronger retrievers (e.g., e5) compared to those produced by
weaker retrievers (e.g., BM25) or through random sampling. (3) Distinguishing random and hard
negatives: While all the LLMs demonstrates robustness to random negatives, it remains susceptible to
the influence of hard negatives.
16
Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG
17
Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG
18
Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG
Doc 5 (Title: "Geography of Nigeria") ... Nigeria, has a temperature range of to , and an
annual rainfall of about with a single rainfall maxima in September. The single Dry season
experienced in this climate, the tropical savanna climate in central Nigeria beginning from
December to march, is hot and dry with the Harmattan wind, a continental tropical (CT)
airmass laden with dust from the Sahara Desert prevailing throughout this period. With the
Intertropical Convergence Zone (ITCZ) swinging northward over West Africa from the
Southern Hemisphere in April, heavy showers coming from pre-monsoonal convective clouds
mainly in the form of ... [Related but Irrelevant]
w. bm25 Doc 1 (Title: "Oron people") ... Civil War. Oron is found in the flood plain of South Eastern
Nigeria, with the land mainly intersected by numerous streams and tributaries flowing into
Cross River. The entire coastline stretches from Uya Oron to Udung Uko. Oron is in the
tropical region and has a uniformly high temperature all the year round. The two main
seasons are the dry which spans between October and April and wet season which starts
around May and ends in September. There are also two prevailing winds – the South-West
onshore winds which brings heavy rains and the ... [Not Related]
Doc 2 (Title: "South Equatorial Current") ... is driven directly by the trade winds which blow
from east to west. In the Indian Ocean, the westward-flowing South Equatorial Current is
well-developed only south of the equator. Directly on the equator, the winds reverse twice a
year due to the monsoons, and so the surface current can be either eastward or westward.
South Equatorial Current Ocean current in the Pacific, Atlantic, and Indian Ocean that flows
east-to-west between the equator and about 20 degrees south. In the Pacific and Atlantic
Oceans, it extends across the equator to about 5 degrees north. Within the southern
hemisphere, the South Equatorial ... [Related but Irrelevant]
Doc 3 (Title: "Wind direction") ... Wind direction Wind direction is reported by the direction
from which it originates. For example, a ""northerly"" wind blows from the north to the south.
Wind direction is usually reported in cardinal directions or in azimuth degrees. Wind
direction is measured in degrees clockwise from due north. Consequently, a wind blowing
from the north has a wind direction of 0; a wind blowing from the east has a wind direction
of 90; a wind blowing from the south has a wind direction of 180; and a wind blowing from
the west has a wind direction of 270 ... [Related but Irrelevant]
Doc 4 (Title: "Gulf Stream") ... this current interacts with the northeastern coast of South
America, the current forks into two branches. One passes into the Caribbean Sea, while a
second, the Antilles Current, flows north and east of the West Indies. These two branches
rejoin north of the Straits of Florida. The trade winds blow westward in the tropics, and the
westerlies blow eastward at mid-latitudes. This wind pattern applies a stress to the
subtropical ocean surface with negative curl across the north Atlantic Ocean. The resulting
Sverdrup transport is equatorward. Because of conservation of potential vorticity caused by
the northward-moving winds on the subtropical ... [Not Related]
Doc 5 (Title: "Climate of the United Kingdom") ... climate that western parts of the UK
experience. The high latitude and proximity to a large ocean to the west means that the
United Kingdom experiences strong winds. The prevailing wind is from the south-west, but it
may blow from any direction for sustained periods of time. Winds are strongest near westerly
facing coasts and exposed headlands. Gales — which are defined as winds with speeds of —
are strongly associated with the passage of deep depressions across the country. The Hebrides
experience on average 35 days of gale a year (a day where there are gale-force winds) while
inland ... [Not Related]
w. Doc 1 (Title: "Queen of Peace, Bray") ... of Bugisi in Tanzania for a number of years. The
random parish has a very close relationship with St Cronan’s B.N.S., Scoil Chualann and Gaelscoil Uí
sampling Chéadaigh. The parish provides the sacraments of Communion and Confirmation to the
children in the schools. They also help to raise funds for the twin parish of Bugisi. Queen of
Peace, Bray The Queen of Peace is a Catholic church situated at the junction of the Putland
Road and the Vevay Road in Bray, Co. Wicklow, Ireland. The present church was built in 1946
by TJ Macken of St Patrick’s Street, Dún Laoghaire, ... [Not Related]
19
Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG
Doc 2 (Title: "Cordova Congressional Internship Program") ... Puerto Rico’s Constitutional
Convention from 1951 to 1952. By 2012, over 670 students from colleges and universities in
Puerto Rico had enjoyed internships under the program, and the Spring 2009 class included
a record 24 members. A private sector committee, recently headed by Univision Puerto Rico
president Larry Sands, provides private funds to supplement the 350,000 annual grant
provided by the Puerto Rico Legislative Assembly. Under the auspices of TWC, seventeen
states have since established similar legislative-funded Congressional internship programs.
The Center established in 2008 the McClintock Award to the State Legislator of the Year ...
[Not Related]
Doc 3 (Title: "V bomber") ... Puerto Rico’s Constitutional Convention from 1951 to 1952. By
2012, over 670 students from colleges and universities in Puerto Rico had enjoyed
internships under the program, and the Spring 2009 class included a record 24 members. A
private sector committee, recently headed by Univision Puerto Rico president Larry Sands,
provides private funds to supplement the 350,000 annual grant provided by the Puerto Rico
Legislative Assembly. Under the auspices of TWC, seventeen states have since established
similar legislative-funded Congressional internship programs. The Center established in 2008
the McClintock Award to the ... [Not Related]
Doc 4 (Title: "Defence Materials and Stores Research and Development Establishment") ...
materials for the Indian Armed Forces. DMSRDE has developed Nuclear Shielding Pad, Boot
Anti Mine, Blast Protection Suit, Bullet Proof Jackets, etc.. ""The Defence Material and Stores
Research Development Establishment in Kanpur has developed a new NBC suit that would be
proved effective against any kind of dangerous weapons or chemicals and protect soldiers
from any sort of attack,"" DMSRDE Director Arvind Kumar Saxena was quoted by
media-persons. 40,000 pieces of NBC suits costing about Rs 30,000 had been requested by
Indian army. ""the further progress on the other two suits are going on."" further ... [Not
Related]
Doc 5 (Title: "Chess title") ... retain the title of Candidate Master, if it was earned according
to criteria above). This is in contrast to international titles awarded by FIDE, which are
awarded for life. In European countries the term of ""expert"" is not used. Instead, players of
that level are called ""Candidate Masters"", although the FIDE Candidate Master title
generally requires a higher rating (2200 FIDE). It is possible (and common), however, for
players in the United States to have a rating that places them in the ’expert’ category while
still retaining the title of ’Life Master’ or ’National Master’ ... [Not Related]
20
Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG
E. Retrieval reordering
F. Datasets
In this section, we discuss the datasets for RAG-specific LLM training and evaluation.
We select a series of fine-tuning data designed to enhance the model’s robustness to hard negatives
in the retrieval context and improve its contextual awareness in generating predictions. The training
data are from four sources with different answer types: Natural Question (short-form), Wizard of
Wikipedia (long-form), FEVER (true/false), and MMLU (close-set). The statistics of the training data
mix can be found in Table 1.
To comprehensively evaluate our methods, we select testing datasets across different tasks including:
(1) Question-answering: TriviaQA, PopQA, WebQuestions; (2) Multi-hop tasks: HotpotQA, 2WikiMul-
tiHopQA, Bamboogle; (3) Long-form tasks: ASQA; (4) Slot filling: T-REx, Zero-shot RE. The statistics
of all the datasets can be found in Table 2.
21
Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG
Following Karpukhin et al. (2020), we use the text chunks from 2018 Wikipedia dump as the retrieval
corpus. The articles are split by section, where long sections are further split into text chunks of equal
sizes and contain less than 100 words, leading to a total of 21M text chunks.
22
Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG
Hyperparameters. We use the top-40 retrieved text chunks for a given example to generate the
fine-tuning samples and use e5 as the retriever for the main results. We fine-tune both Gemma-2-9B-
Base and Mistral-Nemo-12B-Base using 8 H100 GPUs. For both models, we use the chat template
corresponding to Gemma-2-9B-Chat and Mistral-Nemo-12B-Instruct respectively when tuning the
models. We use the axolotl1 codebase for their tuning. For Gemini-1.0-Pro tuning, we use the Google
Cloud Tuning API2 with the default settings. The hyperparameters can be found in Table 3.
Training RAG instruction templates. The RAG instruction templates for different training datasets
can be found in Table 4.
1 https://github.com/axolotl-ai-cloud/axolotl
2 https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/tuning
23
Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG
Training RAG answer templates. The RAG answer templates for different training datasets can be
found in Table 5.
Hyperparameters. For all the compared LLMs, we conduct top-p sampling (p = 1) and the maximum
number of generated token is set to be 32. For Gemma-2 series models, we use the huggingface
inference pipeline3 . For Gemini series models, we use Google Cloud Inference API4 . While for other
series of LLMs, we utilize vLLM5 codebase for efficient generation.
Evaluation RAG instruction templates. The RAG instruction templates for different testing datasets
can be found in Table 6.
Evaluation RAG answer templates. The RAG answer templates for different testing datasets are all:
"Question: {question}. Answer:"
3 https://huggingface.co/docs/transformers/
4 https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/inference
5 https://github.com/vllm-project/vllm
24
Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG
Table 7 | Training instruction templates for RAG tuning with intermediate reasoning.
Training RAG Answer templates. The RAG answer templates for different training datasets can be
found in Table 8.
Table 8 | Training answer templates for RAG tuning with intermediate reasoning.
25
Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG
Instructions to generate intermediate reasoning from Gemini-1.5-pro. The prompt that guides
Gemini-1.5-pro for intermediate reasoning generation can be found in Table 9.
Task Prompts
NQ Read the following documents relevant to the given question: {question}
{reference}
Please identify documents that are useful to answer the given question: {question},
and explain how the contents lead to the answer: {answers}.
If none of the documents is aligned with the answer,
in that case, you have to explain the answer only based on your own knowledge,
without referring to the provided information.
Note that the question may be compositional and require intermediate analysis to deduce the final answer.
Make sure your response is grounded and provides clear reasoning details followed by a concise conclusion.
Wizard of Wikipedia Read the following documents relevant to the given conversation: {question}
{reference}
Please identify documents that are useful to provide a response to a conversation: {question}
and explain how the contents lead to the response: {answers}.
If none of the documents is aligned with the response,
in that case, you have to explain the response only based on your own knowledge,
without referring to the provided information.
Make sure your response is grounded and provides clear reasoning details followed by a concise conclusion.
FEVER Read the following documents relevant to the given question: {question}
{reference}
Please identify documents that are useful to verify a fact: {question}
(Return SUPPORTS if it is correct and return REFUTES if it is not correct.),
and explain how the contents lead to the answer: {answers}.
If none of the documents is aligned with the answer,
in that case, you have to explain the answer only based on your own knowledge
without referring to the provided information.
Make sure your response is grounded and provides clear reasoning details followed by a concise conclusion.
MMLU Read the following documents relevant to the given question: {question}
{reference}
Please identify documents that are useful to answer the given question: {question} with options: {choices},
and explain how the contents lead to the answer: {answers}.
If none of the documents is aligned with the answer,
in that case, you have to explain the answer only based on your own knowledge,
without referring to the provided information.
Make sure your response is grounded and provides clear reasoning details followed by a concise conclusion.
26
Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG
Hyperparameters. For all the compared LLMs, we conduct top-p sampling (p = 1) and the maximum
number of generated token is set to be 256. For Gemma-2 series models, we use the huggingface
inference pipeline. While for other series of LLMs, we utilize vLLM codebase for efficient generation.
Evaluation RAG instruction templates. The RAG instruction templates for different testing datasets
can be found in Table 10.
Table 10 | Testing instruction templates for RAG tuning with intermediate reasoning.
Evaluation RAG answer templates. The RAG answer templates for different testing datasets are all:
"Question: {question}. Answer:"
27
Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG
Question Which film features the Dawes Tomes Mousley Grubbs Fidelity Fiduciary Bank?
Ground Mary Poppins
Truth
Retrieved Doc 1 (Title: "Fidelity Fiduciary Bank") Fidelity Fiduciary Bank ""Fidelity Fiduciary
Passages Bank"" is a song from Walt Disney’s film ""Mary Poppins"", and it is composed by
Richard M. Sherman and Robert B. Sherman. The song sung by the stodgy old bankers
at the ""Dawes, Tomes, Mousely, Grubbs Fidelity Fiduciary Bank"", led by the ""Elder
Mr. Dawes"" (Nackvid Keyed), to George Banks’s two children, Jane and Michael, in an
attempt to get Michael Banks to invest his tuppence in the bank. As the song continues
the pressure is on George, a junior clerk at the bank, to sway Michael. When Michael
finally, and ever so slightly, opens ...
Doc 2 (Title: "Fidelity Fiduciary Bank") convince Michael to invest his money goes like
this: has as its origins an essay by C. C. Turner titled ’Money London’ in the book
edited by G. R. Sims called ""Living London"" (London: 1903): Fidelity Fiduciary Bank
""Fidelity Fiduciary Bank"" is a song from Walt Disney’s film ""Mary Poppins"", and it is
composed by Richard M. Sherman and Robert B. Sherman. The song sung by the
stodgy old bankers at the ""Dawes, Tomes, Mousely, Grubbs Fidelity Fiduciary Bank"",
led by the ""Elder Mr. Dawes"" (Nackvid Keyed), to George Banks’s two children, Jane
and Michael, in an attempt to get ...
...
Doc 40 (Title: "The Firm (soundtrack)") The Firm (soundtrack) The Firm is an album
by American pianist Dave Grusin released in 1993, recorded for the GRP label. This
album is a soundtrack to the Motion Picture ""The Firm"" directed by Sidney Pollack. It
reached No. 131 on ""Billboard""’s Contemporary Jazz chart. This soundtrack was a
1993 Academy Award nominee for Best Original Score. All tracks composed by Dave
Grusin; except where indicated In 2015 La-La Land Records issued a remastered and
expanded edition, featuring the film score on disc one and the 1993 soundtrack album
and bonus tracks on disc two. Italicised tracks on disc one ...
Model Predictions
Gemma- Fidelity Fiduciary Bank
2-9B-
Chat
Implicit Mary Poppins (1964 film)
RAG FT
RAG FT Documents 1 and 2 are useful to answer the question.
with Rea-
soning Both documents explicitly state that the song "Fidelity Fiduciary Bank" from the
musical film *Mary Poppins* features the bank "Dawes, Tomes, Mousely, Grubbs
Fidelity Fiduciary Bank".
Therefore, the film that features the mentioned bank is Mary Poppins.
28
Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG
29
Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG
RAG Accuracy
0.74
RAG Accuracy
0.50
0.72 Gemma-2-9B-Chat Gemma-2-9B-Chat 0.40
RAG FT 0.45 RAG FT
0.70 0.40
Direct FT Direct FT 0.35
0.68 Gemma-2-9B-Chat
0.35
0.66 0.30 RAG FT
0.30 Direct FT
0.64
0.25 0.25
0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40
# Retrieved Passages # Retrieved Passages # Retrieved Passages
(a) TriviaQA (b) PopQA (c) HotpotQA
RAG Accuracy
RAG Accuracy
0.48
0.38 Gemma-2-9B-Chat 0.30 Gemma-2-9B-Chat
0.36 0.46 RAG FT RAG FT
0.34 0.44 Direct FT 0.25 Direct FT
0.32 Gemma-2-9B-Chat 0.42
0.30 RAG FT 0.20
Direct FT 0.40
0.28 0.15 0 5 10 15 20 25 30 35 40
0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40
# Retrieved Passages # Retrieved Passages # Retrieved Passages
(d) 2wikimultihopqa (e) Webquestions (f) Bamboogle
0.70 0.7
0.65
0.65 0.6
RAG Accuracy
RAG Accuracy
RAG Accuracy
0.60
0.60 0.5 Gemma-2-9B-Chat
0.55 RAG FT
0.55 0.4
0.50 Direct FT
0.50 Gemma-2-9B-Chat 0.45 Gemma-2-9B-Chat 0.3
0.45 RAG FT RAG FT 0.2
Direct FT 0.40 Direct FT
0.40
0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40
# Retrieved Passages # Retrieved Passages # Retrieved Passages
(g) ASQA (h) T-REx (i) zsRE
Figure 11 | Generalization ability of LLMs fine-tuned with RAG-specific data (RAG FT). RAG FT
consistently outperforms the chat LLM w. RAG and the model fine-tuned directly on question-answer
pairs (Direct FT). This demonstrates the effectiveness of RAG FT in enabling the LLM to effectively
extract knowledge from retrieved context on unseen tasks. Note that Direct FT is evaluated without
retrieval to align with its training paradigm and all others are evaluated with retrieval augmentation.
(LLM: Gemma-2-9B-Base)
In Figure 6, we show the power of RAG finetuning with intermediate reasoning on five datasets
because of the space limitation. The whole results on all the nine datasets with Gemma-2-9B models
can be found in Figure 12. Note that due to the computational complexity of inference with reasoning
augmentation, results are shown for 1000 randomly-sampled queries for each dataset.
30
Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG
0.55
0.80 0.6 0.50
0.75
RAG Accuracy
RAG Accuracy
RAG Accuracy
0.5 0.45
0.70 0.40
0.65 0.4 0.35
Gemma-2-9B-Chat Gemma-2-9B-Chat Gemma-2-9B-Chat
0.60 RAG FT 0.3 RAG FT 0.30 RAG FT
0.55 RAG FT w. Int RAG FT w. Int 0.25 RAG FT w. Int
Direct FT 0.2 Direct FT 0.20 Direct FT
0.50
0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40
# Retrieved Passages # Retrieved Passages # Retrieved Passages
(a) TriviaQA (b) PopQA (c) HotpotQA
RAG Accuracy
RAG Accuracy
0.50 RAG FT w. Int 0.50
Direct FT 0.30
0.45 0.45
0.25
0.40 0.40 Gemma-2-9B-Chat Gemma-2-9B-Chat
0.35 RAG FT 0.20 RAG FT
0.35
0.30 0.30 RAG FT w. Int 0.15 RAG FT w. Int
Direct FT 0.10 Direct FT
0.25 0.25
0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40
# Retrieved Passages # Retrieved Passages # Retrieved Passages
(d) 2wikimultihopqa (e) Webquestions (f) Bamboogle
0.8 0.7
0.7
0.7 0.6
RAG Accuracy
RAG Accuracy
Figure 12 | Evaluating the impact of intermediate reasoning on the performance of RAG-tuned LLMs.
Results demonstrate that fine-tuning with an intermediate reasoning step (RAG FT w. Int) leads
to further improvements compared to implicit RAG fine-tuning (RAG FT) and direct fine-tuning
(Direct FT). Direct FT is evaluated without retrieval to align with its training paradigm and all others
are evaluated with retrieval augmentation. Due to the computational complexity of inference with
reasoning augmentation, results are shown for 1000 randomly-sampled queries from each dataset.
(LLM: Gemma-2-9B-Base)
31
Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG
0.80 0.60
0.55 0.50
0.75 0.45
RAG Accuracy
RAG Accuracy
RAG Accuracy
0.50
0.70 0.45 0.40
0.65 0.40 0.35
Mistral-Nemo-12B-Chat Mistral-Nemo-12B-Chat Mistral-Nemo-12B-Chat
RAG FT 0.35 RAG FT 0.30 RAG FT
0.60 0.30
RAG FT w. Int RAG FT w. Int 0.25 RAG FT w. Int
0.55 Direct FT 0.25 Direct FT 0.20 Direct FT
0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40
# Retrieved Passages # Retrieved Passages # Retrieved Passages
(a) TriviaQA (b) PopQA (c) HotpotQA
RAG Accuracy
RAG Accuracy
0.70
0.65 0.7
0.7
0.60 0.6
RAG Accuracy
RAG Accuracy
RAG Accuracy
32
Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG
0.65
0.80 0.60 0.50
RAG Accuracy
RAG Accuracy
RAG Accuracy
0.78 0.55 0.45
0.76 0.50 0.40
Gemini-1.0-Pro 0.45 Gemini-1.0-Pro Gemini-1.0-Pro
0.74 RAG FT 0.40 RAG FT 0.35 RAG FT
RAG FT w. Int RAG FT w. Int RAG FT w. Int
0.72 Direct FT 0.35 Direct FT 0.30 Direct FT
0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40
# Retrieved Passages # Retrieved Passages # Retrieved Passages
(a) TriviaQA (b) PopQA (c) HotpotQA
0.375
0.575 Gemini-1.0-Pro
0.50 0.550 RAG FT 0.350
RAG Accuracy
RAG Accuracy
0.325
RAG Accuracy
Gemini-1.0-Pro 0.525 RAG FT w. Int
0.45 Direct FT 0.300
RAG FT 0.500
0.40 RAG FT w. Int 0.475 0.275
Direct FT Gemini-1.0-Pro
0.35 0.450 0.250 RAG FT
0.425 0.225 RAG FT w. Int
0.30 0.400 0.200 Direct FT
0 5 10 15 20 25 30 35 40 0.375 0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40
# Retrieved Passages # Retrieved Passages # Retrieved Passages
(d) 2wikimultihopqa (e) Webquestions (f) Bamboogle
RAG Accuracy
RAG Accuracy
Figure 14 | Evaluating RAG-specific tuning with Gemini-1.0-Pro models. Results demonstrate that fine-
tuning with an intermediate reasoning step (RAG FT w. Int) leads to further improvements compared
to implicit RAG fine-tuning (RAG FT), while implicit RAG fine-tuning outperforms LLMs without RAG-
specific tuning (Gemini-1.0-Pro) and direct fine-tuning (Direct FT). Direct FT is evaluated without
retrieval to align with its training paradigm and all others are evaluated with retrieval augmentation.
Due to the Gemini-1.0-Pro API call credit limitation, results are shown for 1000 randomly-sampled
queries from each dataset. (LLM: Gemini-1.0-Pro)
33
Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG
To investigate the influence of the size of the training data on the effectiveness of RAG-specific
tuning, we fine-tune the Gemma-2-9B-Base model using varying amounts (5k to 200k samples) of
mixed training data from NQ, WoW, Fever, and MMLU. Table 11 presents the evaluation results on
the NQ dataset, demonstrating a clear positive correlation between the scale of training data and the
performance of the resulting LLM in RAG. Increasing the amount of training data consistently leads
to improved accuracy, highlighting the benefits of leveraging larger datasets for fine-tuning LLMs in
RAG applications.
Table 12 | Combining RAG-specific data with general SFT data for enhanced LLM performance in
RAG.
Having established the effectiveness of RAG-specific fine-tuning for improving LLM performance
in RAG tasks, we now investigate whether combining RAG-specific data with general SFT data can
further enhance performance while preserving the LLM’s general capabilities (e.g., reasoning and
long-form generation), as a way to assess the potential of the proposed tuning methods to be useful
for construction of foundation models. We train the Gemma-2-9B model using two different strategies:
(1) SFT data only: The LLM is trained solely on general SFT data (Ultrachat 200k). (2) SFT data
+ RAG-specific data: The LLM is trained on a combination of Ultrachat 200k and 50k RAG-specific
data (the same data used in Figure 5). We evaluate the resulting models on MT-Bench to assess their
general language capabilities and on NQ and TriviaQA to measure their RAG performance.
Table 12 presents the results, demonstrating that incorporating RAG-specific data into the SFT pro-
cess can significantly improve the LLM’s performance on RAG tasks while maintaining its performance
on general language tasks. This finding suggests that combining task-specific and general-purpose
data during fine-tuning can be a viable strategy for enhancing LLMs in specialized applications without
compromising their overall capabilities.
34