LLMs之RAG：《Inference Scaling for Long-Context Retrieval Augmented Generation长上下文检索增强生成的推理扩展》翻译与解读-CSDN博客

本文链接：https://blog.csdn.net/qq_41185868/article/details/143473227

LLMs之RAG：《Inference Scaling for Long-Context Retrieval Augmented Generation长上下文检索增强生成的推理扩展》翻译与解读

导读：这篇论文的核心主题是长文本检索增强生成 (RAG) 的推理规模扩展。论文针对现有 RAG 方法在处理长文本上下文时存在的效率和有效性问题，提出了一种新的推理规模扩展策略，并构建了一个计算分配模型来优化资源利用。

>> 背景痛点：

● 长文本 LLMs 的推理计算成本高：虽然长文本大型语言模型 (LLMs) 可以处理更长的上下文，但其推理计算成本随着上下文长度的增加而急剧增长。

● 现有 RAG 方法的局限性：现有的 RAG 方法主要关注于增加检索到的知识数量（例如，检索更多或更长的文档），而忽略了如何有效利用这些知识。单纯增加上下文长度并不总是能提高性能，甚至可能因为噪声增加而导致性能下降。长文本 LLMs 难以在复杂任务中有效定位相关信息，并且超过一定阈值(如top-10文档)的检索结果反而会降低性能。

>> 具体的解决方案：论文提出了两种推理规模扩展策略：

● 基于示例的 RAG (DRAG)：将多个 RAG示例作为演示提供给 LLM，利用 LLMs 的长文本能力进行一次性生成。通过增加检索到的文档数量和上下文示例数量来扩展推理计算。通过示例学习如何定位相关信息并应用到响应生成中。

● 迭代式基于示例的 RAG (IterDRAG)：将复杂的查询分解成更简单的子查询，并通过交错检索和生成来迭代地回答这些子查询。通过增加迭代次数来扩展推理计算。构建推理链以弥合多跳查询的组合性差距。

>> 核心思路步骤：

(1) 提出 DRAG 和 IterDRAG 两种策略：这两种策略分别通过增加上下文示例和迭代生成来扩展推理计算。

(2) 定义有效上下文长度：作为衡量推理计算的指标，它包含所有迭代中的输入 token 总数。

(3) 寻找最佳性能：对于给定的计算预算（最大有效上下文长度），通过枚举不同的推理参数组合，找到能够最大化性能的最佳参数配置。

(4) 建立推理缩放定律：通过实验，发现 RAG 性能与最佳配置下的有效上下文长度之间存在近乎线性的关系。

(5) 构建计算分配模型：建立一个模型来量化 RAG 性能与不同推理参数之间的关系，该模型可以预测不同推理配置下的 RAG 性能，并指导最佳计算分配。

>> 优势：

● 显著提升 RAG 性能：与基线方法相比，DRAG 和 IterDRAG 在多个基准数据集上取得了显著的性能提升，最高可达 58.9%。

● 优越的扩展性：DRAG 和 IterDRAG 比单纯增加文档数量具有更好的扩展性，能够更有效地利用测试时间的计算资源。

● 计算分配模型的实用性：该模型可以预测不同推理配置下的 RAG 性能，并指导最佳计算资源分配，提高了资源利用效率。

● 良好的泛化能力：计算分配模型在不同数据集上的泛化性能良好，能够应用于各种知识密集型任务。

>> 论文结论和观点：

● 在最佳推理参数配置下，长文本 RAG 的性能与测试时间计算量之间存在近乎线性的关系。

● DRAG 和 IterDRAG 两种策略能够有效地扩展推理计算，并显著提高 RAG 的性能。

● 计算分配模型能够准确地预测不同推理配置下的 RAG 性能，并指导最佳计算资源分配。

● 检索质量对 RAG 性能有重要影响，需要改进检索方法以提高相关性和减少噪声。

● 长文本建模能力的提升对于进一步提高 RAG 性能至关重要。

总而言之，这篇论文系统地研究了长文本 RAG 的推理规模扩展问题，提出了两种有效的推理策略和一个能够优化计算资源分配的模型，为提高长文本 RAG 的性能提供了新的思路和方法。研究结果表明，在合理的计算资源分配下，增加计算量可以近乎线性地提高 RAG 的性能，这为未来长文本 RAG 的研究提供了重要的理论指导和实践参考。

《Inference Scaling for Long-Context Retrieval Augmented Generation》翻译与解读

Abstract

1 Introduction

Figure 1 | Normalized performance vs. effective context lengths on MuSiQue. Each line represents a fixed configuration, scaled by adjusting the number of documents. Red dots and dash lines represent the optimal configurations and their fitting results. Standard RAG plateaus early at 104 tokens, in contrast, DRAG and IterDRAG show near-linear improvement as the effective context length grows.图1 | 在MuSiQue上，标准化性能与有效上下文长度的关系。每条线代表一个固定配置，通过调整文档数量进行缩放。红色点和虚线表示最优配置及其拟合结果。标准RAG在104个token时早早达到平台期，相比之下，DRAG和IterDRAG随着有效上下文长度的增长显示出接近线性的改善。

Figure 2 | Evaluation accuracy of Gemini 1.5 Flash using different methods: zero-shot QA, many-shot QA, RAG (with an optimal number of documents), DRAG and IterDRAG on benchmark QA datasets. By scaling up inference compute (up to 5M tokens), DRAG consistently outperforms baselines, while IterDRAG improves upon DRAG through interleaving retrieval and iterative generation.图2 | 使用不同方法的Gemini 1.5 Flash在基准QA数据集上的评估准确性：零样本QA，多样本QA，RAG（具有最佳文档数量），DRAG和IterDRAG。通过扩展推理计算（最多达5M个token），DRAG始终优于基线，而IterDRAG则通过交错检索和迭代生成进一步改进DRAG。

6、Discussion

Conclusion

《Inference Scaling for Long-Context Retrieval Augmented Generation》翻译与解读

地址	论文地址：https://arxiv.org/pdf/2410.04343
时间	2024年10月6日
作者	Google DeepMind团队等

Abstract

The scaling of inference computation has unlocked the potential of long-context large language models (LLMs) across diverse settings. For knowledge-intensive tasks, the increased compute is often allocated to incorporate more external knowledge. However, without effectively utilizing such knowledge, solely expanding context does not always enhance performance. In this work, we investigate inference scaling for retrieval augmented generation (RAG), exploring strategies beyond simply increasing the quantity of knowledge. We focus on two inference scaling strategies: in-context learning and iterative prompting. These strategies provide additional flexibility to scale test-time computation (e.g., by increasing retrieved documents or generation steps), thereby enhancing LLMs' ability to effectively acquire and utilize contextual information. We address two key questions: (1) How does RAG performance benefit from the scaling of inference computation when optimally configured? (2) Can we predict the optimal test-time compute allocation for a given budget by modeling the relationship between RAG performance and inference parameters? Our observations reveal that increasing inference computation leads to nearly linear gains in RAG performance when optimally allocated, a relationship we describe as the inference scaling laws for RAG. Building on this, we further develop the computation allocation model to estimate RAG performance across different inference configurations. The model predicts optimal inference parameters under various computation constraints, which align closely with the experimental results. By applying these optimal configurations, we demonstrate that scaling inference compute on long-context LLMs achieves up to 58.9% gains on benchmark datasets compared to standard RAG.

推理计算的扩展解锁了长上下文大语言模型（LLMs）在各种环境中的潜力。对于知识密集型任务，增加的计算资源通常用于合并更多的外部知识。然而，如果没有效地利用这些知识，单纯地扩展上下文并不总是能提高性能。在这项工作中，我们研究了检索增强生成（RAG）的推理扩展，探索超越简单增加知识量的策略。我们专注于两种推理扩展策略：上下文学习和迭代提示。这些策略提供了额外的灵活性来扩展测试时计算（例如，通过增加检索文档或生成步骤），从而增强LLMs有效获取和利用上下文信息的能力。我们探讨了两个关键问题：(1) 当配置最优时，RAG性能如何从推理计算的扩展中受益？

(2) 我们能否通过建模RAG性能与推理参数之间的关系来预测给定预算下的最佳测试时计算分配？

我们的观察表明，在优化分配的情况下，增加推理计算几乎可以线性地提升RAG性能，这种关系我们描述为RAG的推理扩展定律。基于此，我们进一步开发了计算分配模型，以估计不同推理配置下的RAG性能。该模型预测在各种计算约束下的最佳推理参数，这些预测与实验结果高度一致。通过应用这些最优配置，我们证明了在长上下文LLMs上扩展推理计算在基准数据集上相比标准RAG可实现高达58.9%的性能提升。

Keywords: Inference scaling, Retrieval augmented generation, Long-context LLMs

关键词：推理扩展，检索增强生成，长上下文LLMs

1 Introduction

Long-context large language models (LLMs) are designed to handle extended input sequences, enabling them to process and understand longer context (e.g., Gemini 1.5 Pro with up to 2M tokens) (Achiam et al., 2023; Reid et al., 2024; Team et al., 2023). Combined with increased inference computation, long-context LLMs demonstrate improved performance across various downstream tasks (Agarwal et al.; Snell et al., 2024). For example, many-shot in-context learning (ICL) can match the performance of supervised fine-tuning by providing extensive in-context examples (Bertsch et al., 2024). Particularly for knowledge-intensive tasks that leverage retrieval augmented generation (RAG), increasing the quantity or size of retrieved documents up to a certain threshold consistently enhances the performance (Jiang et al., 2024; Ram et al., 2023; Xu et al., 2024).

Previous studies on inference scaling for RAG focus on expanding the retrieved knowledge by increasing the number or lengths of retrieved documents (Jiang et al., 2024; Shao et al., 2024; Xu et al., 2024). However, only emphasizing on the knowledge quantity without providing further guidance presents certain limitations. On one hand, current long-context LLMs still have limited ability to effectively locate relevant information in ultra-long sequences upon challenging tasks (Kuratov et al., 2024; Li et al., 2024). For instance, the optimal performance of long-context LLMs is often achieved without fully utilizing the maximum length (Agarwal et al.). On the other hand, numerous studies show that retrieving over soft thresholds (e.g., top-10 documents) leads to a performance plateau and may even cause declines (Kuratov et al., 2024; Lee et al., 2024a; Ram et al., 2023). Such performance drops may be traced back to the increased noise within context, which causes distraction and adversely affects generation (Yoran et al., 2024; Zhang et al., 2024). As a result, inference scaling of long-context RAG remains challenging for existing methods.

长上下文大语言模型（LLMs）设计用于处理扩展的输入序列，使它们能够处理和理解更长的上下文（例如，Gemini 1.5 Pro 最多支持2M个token）（Achiam等人，2023；Reid等人，2024；Team等人，2023）。结合增加的推理计算，长上下文LLMs在各种下游任务中表现出改进的性能（Agarwal等人；Snell等人，2024）。例如，多示例上下文学习（ICL）可以通过提供广泛的上下文示例来匹配监督微调的性能（Bertsch等人，2024）。特别是对于利用检索增强生成（RAG）的知识密集型任务，增加检索文档的数量或大小直至某个阈值会持续提升性能（Jiang等人，2024；Ram等人，2023；Xu等人，2024）。

以往关于RAG推理扩展的研究集中在通过增加检索文档的数量或长度来扩大检索到的知识（Jiang等人，2024；Shao等人，2024；Xu等人，2024）。然而，仅强调知识量而没有提供进一步指导存在一定的局限性。一方面，当前的长上下文LLMs在面对具有挑战性的任务时，仍然有限制于在超长序列中有效定位相关信息的能力（Kuratov等人，2024；Li等人，2024）。例如，长上下文LLMs的最佳性能往往是在未充分利用最大长度的情况下实现的（Agarwal等人）。另一方面，许多研究表明，超过软阈值（如前10篇文档）检索会导致性能平台期甚至下降（Kuratov等人，2024；Lee等人，2024a；Ram等人，2023）。这种性能下降可能追溯到上下文中增加的噪声，这导致分散注意力并负面影响生成（Yoran等人，2024；Zhang等人，2024）。因此，现有方法对长上下文RAG的推理扩展仍然具有挑战性。

In this work, we leverage a broader range of strategies to comprehensively explore how RAG benefits from the scaling of inference computation. A straightforward strategy is demonstration-based RAG (DRAG), where multiple RAG examples are provided as demonstrations to utilize the long-context capabilities of LLMs (Brown et al., 2020). DRAG allows models to learn (in-context) how to locate relevant information and apply it to response generation1. Nevertheless, the quality of one-step retrieval varies across tasks and often fails to provide sufficient information. Inspired by iterative methods (Trivedi et al., 2023; Yoran et al., 2024), we develop iterative demonstration-based RAG (IterDRAG). IterDRAG learns to decompose input queries into simpler sub-queries and answer them using interleaved retrieval. By iteratively retrieving and generating upon sub-queries, LLMs construct reasoning chains that bridge the compositionality gap for multi-hop queries. Together, these strategies provide additional flexibility in scaling inference computation for RAG, allowing long-context LLMs to more effectively address complex knowledge-intensive queries.

Building on these strategies, we investigate multiple ways to scale up inference computation. Here, we measure computation by considering the total number of input tokens across all iterations, referred to as the effective context length. In DRAG, scaling the effective context length can be done by increasing two inference parameters: the number of retrieved documents and in-context examples. In IterDRAG, test-time compute can be further extended by introducing additional generation steps. Since different combinations of inference parameters result in varied allocations of computational resources, our goal is to establish the relationship between RAG performance, different scales and allocations of inference computation. Through extensive experiments on benchmark QA datasets, we demonstrate an almost linear relationship between RAG performance and the scale of effective context length by combining both RAG strategies, as shown in Figure 1 (right). Moreover, our RAG strategies exhibit improved performance than merely scaling the number of documents, achieving state-of-the-art performance with the compact Gemini 1.5 Flash (See evaluation in Figure 2).

在这项工作中，我们利用更广泛的战略全面探索RAG如何从推理计算的扩展中受益。一个直接的策略是基于演示的RAG（DRAG），其中提供多个RAG示例作为演示，以利用LLMs的长上下文能力（Brown等人，2020）。DRAG允许模型学习（在上下文中）如何定位相关信息并将其应用于响应生成1。然而，单步检索的质量因任务而异，通常无法提供足够的信息。受迭代方法的启发（Trivedi等人，2023；Yoran等人，2024），我们开发了迭代基于演示的RAG（IterDRAG）。IterDRAG学习将输入查询分解为更简单的子查询，并使用交错检索来回答它们。通过迭代检索和基于子查询的生成，LLMs构建了推理链，弥合了多跳查询的组合性差距。这些策略共同为RAG的推理计算扩展提供了额外的灵活性，使长上下文LLMs能够更有效地解决复杂的知识密集型查询。

基于这些策略，我们探讨了多种扩展推理计算的方法。在这里，我们通过考虑所有迭代中的总输入token数来衡量计算，称为有效上下文长度。在DRAG中，可以通过增加两个推理参数来扩展有效上下文长度：检索文档的数量和上下文示例的数量。在IterDRAG中，可以通过引入额外的生成步骤来进一步扩展测试时计算。由于不同的推理参数组合导致计算资源分配的变化，我们的目标是建立RAG性能、不同规模和推理计算分配之间的关系。通过在基准QA数据集上的广泛实验，我们展示了通过结合两种RAG策略，RAG性能与有效上下文长度的规模之间几乎呈线性关系，如图1（右）所示。此外，我们的RAG策略比单纯增加文档数量表现出更好的性能，使用紧凑的Gemini 1.5 Flash达到了最先进的性能（见图2中的评估）。

Drawing from our observations, we examine the relationship between RAG performance and inference computation, which we quantify as the inference scaling laws for RAG. These observed inference scaling laws reveal that RAG performance consistently improves with the expansion of the effective context length under optimal configurations. Consequently, we take a deeper dive into modeling RAG performance with respect to various inference computation allocations. Our goal is to predict the optimal set of inference parameters that maximize the performance across different RAG tasks. To achieve this, we quantitatively model the relationship between RAG performance and varying inference configurations with the computation allocation model for RAG. Using the estimated computation allocation model, the optimal configurations can be empirically determined and generalize well for various scenarios, thereby maximizing the utilization of the computation budget. We summarize our contributions as follows:

>> We systematically investigate inference scaling for long-context RAG, for which we introduce two scaling strategies, DRAG and IterDRAG, to effectively scale inference computation.

>> We comprehensively evaluate DRAG and IterDRAG, where they not only achieve state-of-the-art performance, but also exhibit superior scaling properties compared to solely increasing the quantity of documents.

>> Through extensive experiments on benchmark QA datasets, we demonstrate that when test-time compute is optimally allocated, long-context RAG performance can scale almost linearly with the increasing order of magnitude of the computation budget.

>> We quantitatively model the relationship between RAG performance and different inference parameters, deriving the computation allocation model. This model aligns closely with our experimental results and generalize well across scenarios, providing practical guidance for optimal computation allocation in long-context RAG.

根据我们的观察，我们检查了RAG性能与推理计算之间的关系，我们将其量化为RAG的推理扩展定律。这些观察到的推理扩展定律揭示了在最佳配置下，随着有效上下文长度的扩展，RAG性能持续改善。因此，我们深入研究了针对不同推理计算分配的RAG性能建模。我们的目标是预测最大化不同RAG任务性能的最佳推理参数集。为了实现这一点，我们定量建模了RAG性能与变化的推理配置之间的关系，即RAG的计算分配模型。使用估计的计算分配模型，可以实证确定最佳配置，并在各种情况下表现良好，从而最大化计算预算的利用。我们总结我们的贡献如下：

>> 我们系统地研究了长上下文RAG的推理扩展，为此我们引入了两种扩展策略，DRAG和IterDRAG，以有效扩展推理计算。

>> 我们全面评估了DRAG和IterDRAG，它们不仅实现了最先进水平的性能，而且相比单纯增加文档数量，展现了更优的扩展特性。

>> 通过对基准QA数据集的广泛实验，我们证明了当测试时计算最优分配时，长上下文RAG性能几乎可以随着计算预算量级的增加而线性扩展。

>> 我们定量建模了RAG性能与不同推理参数之间的关系，推导出了计算分配模型。该模型与我们的实验结果高度一致，并且在各种场景中表现良好，为长上下文RAG中的最佳计算分配提供了实用指导。

Figure 1 | Normalized performance vs. effective context lengths on MuSiQue. Each line represents a fixed configuration, scaled by adjusting the number of documents. Red dots and dash lines represent the optimal configurations and their fitting results. Standard RAG plateaus early at 104 tokens, in contrast, DRAG and IterDRAG show near-linear improvement as the effective context length grows.图1 | 在MuSiQue上，标准化性能与有效上下文长度的关系。每条线代表一个固定配置，通过调整文档数量进行缩放。红色点和虚线表示最优配置及其拟合结果。标准RAG在104个token时早早达到平台期，相比之下，DRAG和IterDRAG随着有效上下文长度的增长显示出接近线性的改善。

Figure 2 | Evaluation accuracy of Gemini 1.5 Flash using different methods: zero-shot QA, many-shot QA, RAG (with an optimal number of documents), DRAG and IterDRAG on benchmark QA datasets. By scaling up inference compute (up to 5M tokens), DRAG consistently outperforms baselines, while IterDRAG improves upon DRAG through interleaving retrieval and iterative generation.图2 | 使用不同方法的Gemini 1.5 Flash在基准QA数据集上的评估准确性：零样本QA，多样本QA，RAG（具有最佳文档数量），DRAG和IterDRAG。通过扩展推理计算（最多达5M个token），DRAG始终优于基线，而IterDRAG则通过交错检索和迭代生成进一步改进DRAG。

6、Discussion

In our experiments, we observe consistent benefits of inference scaling using DRAG and IterDRAG. Combined with the computation allocation model for RAG, this approach enables the derivation of a (nearly) optimal solution for long-context RAG given computation constraints. In the following, we discuss additional factors that may influence the scaling of long-context RAG.	在我们的实验中，我们观察到使用DRAG和IterDRAG进行推理扩展的一致好处。结合RAG的计算分配模型，这种方法能够在给定计算约束的情况下，为长上下文RAG得出（近乎）最优解。接下来，我们将讨论可能影响长上下文RAG扩展的其他因素。
Retrieval. One critical factor in improving performance of RAG lies in the quality of the retrieved documents. To study how retrieval impacts final accuracy, we analyze retrieval performance and report the results across different document sizes in Appendix A. In all datasets, recall scores demonstrate improvements as the number of documents increases, approaching near-perfect scores with large document sets (e.g., ∼1k). Despite consistent gains in recall, the results show diminishing returns on discounted ranking metrics like NDCG, indicating increasing distraction within the context. This trend is also evident in in Figure 5b, where RAG performance peaks between 100 and 500 documents. Our observations suggest the necessity of refining retrieval (e.g., through re-ranking) to further optimize the document relevance, particularly in cases of complex, multi-hop queries. However, how the inference scaling behavior discovered in this paper would change in the presence of such a refining component remains unknown. Alternatively, iterative retrieval, as seen in IterDRAG, improves recall performance by using simpler, straightforward sub-queries to collect additional context for each intermediate answer. In summary, retrieving more documents improves recall but does not necessarily lead to better generation quality if the documents are not effectively ranked or filtered. This highlights the need for retrieval methods that dynamically adjust to minimize irrelevant content.	检索。提高RAG性能的一个关键因素在于检索文档的质量。为了研究检索如何影响最终的准确性，我们在附录A中分析了检索性能，并报告了不同文档大小的结果。在所有数据集中，召回率得分随着文档数量的增加而提高，使用大型文档集（例如，约1000个文档）时接近完美分数。尽管召回率得分有持续增长，但结果显示，在折扣排名指标（如NDCG）上收益递减，这表明上下文内的干扰增加。这一趋势也在图5b中显而易见，RAG性能在100到500个文档之间达到峰值。我们的观察建议需要改进检索（例如，通过重排序）以进一步优化文档的相关性，特别是在复杂、多跳查询的情况下。然而，本文中发现的推理扩展行为在这种精炼组件存在时会发生怎样的变化仍未知。作为一种替代方案，如IterDRAG所见的迭代检索，通过使用更简单、直接的子查询来收集每个中间答案的额外上下文，从而提高召回率表现。总之，检索更多文档可以提高召回率，但如果文档没有有效排名或过滤，则不一定能提高生成质量。这突显了需要采用能够动态调整以最小化无关内容的检索方法。
Error Analysis. Despite overall improvements, our error analysis in Appendix F reveals that certain errors persist, particularly in cases of compositional reasoning tasks where multiple hops of reasoning are required. The common errors fall into four categories: (1) inaccurate or outdated retrieval;(2) incorrect or lack of reasoning; (3) hallucination or unfaithful reasoning; and (4) evaluation issues or refusal to answer. The first category highlights the need for enhancing retrieval methods and maintaining a reliable & up-to-date knowledge base, specially for complex questions that rely on multiple supporting facts. In addition, incorrect or missing reasoning steps often result in errors or partially correct answers. In our experiments, we observe that both (1) and (2) are substantially improved with IterDRAG, suggesting the importance of interleaving retrieval and iterative generation for multi-hop queries. Moreover, developing faithful LLMs and strategies to mitigate hallucination could further enhance RAG performance. Finally, we note that existing metrics fail in certain cases (e.g., abbreviations), underscoring the need for more robust and reliable evaluation methods.	错误分析。尽管总体有所改进，但附录F中的错误分析显示某些错误仍然存在，特别是在需要多次推理的组合推理任务中。常见的错误分为四类：(1) 不准确或过时的检索；(2) 错误或缺乏推理；(3) 幻想或不忠实推理；以及 (4) 评估问题或拒绝回答。第一类错误强调了增强检索方法和维护可靠及更新知识库的必要性，特别是对于依赖多个支持事实的复杂问题。此外，错误或缺失的推理步骤经常导致错误或部分正确的答案。在我们的实验中，我们观察到使用IterDRAG时，(1) 和 (2) 显著改善，这表明对于多跳查询，交织检索和迭代生成的重要性。此外，开发忠实的LLMs和减轻幻想的策略可以进一步提升RAG性能。最后，我们注意到现有指标在某些情况下（例如，缩写词）失效，强调了需要更加稳健和可靠的评估方法。
Long-Context Modeling. We also discuss the impact of long-context modeling w.r.t. RAG per-formance. In summary, we find that retrieving more documents is generally beneficial for RAG performance, as demonstrated in Section 4. Nevertheless, naïvely extending the context length in each generation step does not always lead to better results. Specifically, DRAG performance peaks at around 105 tokens, while IterDRAG achieves optimal performance at around 106 tokens by leveraging multiple rounds of generation. For instance, as seen in the performance plateau in Figure 1 and Figure 10, LLMs struggle to effectively utilize very long contexts (≥ 105 tokens) in each iteration, potentially due to inherent limitations of long-context modeling. Our observations suggest that:(1) the model’s ability to identify relevant information from extensive context remains to be improved, especially when presented with large quantity of “similar” documents; (2) the long-context mod-eling should be further refined to enhance in-context learning capabilities, where multiple lengthy demonstrations are provided.	长上下文建模。我们还讨论了长上下文建模对RAG性能的影响。总的来说，我们发现在每次生成步骤中盲目延长上下文长度并不总是导致更好的结果，尽管检索更多文档通常对RAG性能有益，如第4节所示。具体来说，DRAG在大约105个token处性能达到峰值，而IterDRAG通过利用多轮生成，在大约106个token处达到最佳性能。例如，从图1和图10中看到的性能平台期可以看出，LLMs在每次迭代中难以有效利用非常长的上下文（≥ 105个token），可能是由于长上下文建模的固有限制。我们的观察表明：(1) 模型从广泛上下文中识别相关信息的能力有待提高，尤其是在面对大量“相似”文档时；(2) 需要进一步完善长上下文建模，以增强上下文学习能力，尤其是在提供多个长时间演示的情况下。

Conclusion

In this paper, we explore inference scaling in long-context RAG. By systematically studying the performance with different inference configurations, we demonstrate that RAG performance improves almost linearly with the increasing order of magnitude of the test-time compute under optimal inference parameters. Based on our observations, we derive inference scaling laws for RAG and the corresponding computation allocation model, designed to predict RAG performance on varying hyperparameters. Through extensive experiments, we show that optimal configurations can be accurately estimated and align closely with the experimental results. These insights provide a strong foundation for future research in optimizing inference strategies for long-context RAG.

在本文中，我们探讨了长上下文RAG中的推理扩展。通过系统地研究不同推理配置下的性能，我们证明了在最优推理参数下，RAG性能几乎可以随着测试时计算量级的增加而线性提升。基于我们的观察，我们推导了RAG的推理扩展定律及相应的计算分配模型，该模型旨在预测不同超参数下的RAG性能。通过大量的实验，我们展示了可以准确估计最优配置，并且与实验结果高度吻合。这些见解为未来在优化长上下文RAG的推理策略方面的研究提供了坚实的基础。