latex_conversion
latex_conversion
1 Introduction 5
1
3.1 Emergence of RAG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2
5.4.2 IVF-PQ: Combining Inverted Indices and Product Quantization . 18
3
6.5.3 Efficiency-Focused Evaluation . . . . . . . . . . . . . . . . . . . . 25
8 Conclusion 30
10 References 36
4
1 Introduction
The review is organized into several main sections: historical perspective on pre-
RAG systems, foundations of RAG systems, retrieval algorithms in RAG, vector indexing
methods for efficient retrieval, empirical evaluation methodologies, and research gaps
and future directions. Throughout, we emphasize peer-reviewed academic research that
provides theoretical foundations, empirical evaluations, and state-of-the-art approaches
in each area.
The earliest formal information retrieval systems were based on Boolean logic. As de-
scribed by Salton et al. [1983] in their seminal work “Introduction to Modern Information
Retrieval”:
“Boolean retrieval models represent documents and queries as sets of terms, with
retrieval based on exact matching using logical operators (AND, OR, NOT).”
5
2.1.2 Vector Space Model
Salton et al. [1975] introduced the Vector Space Model (VSM) in “A Vector Space Model
for Automatic Indexing,” which represented a fundamental shift in information retrieval:
“The vector space model represents documents and queries as vectors in a high-
dimensional space, with each dimension corresponding to a term in the vocabulary.
Similarity between documents and queries is computed using measures such as cosine
similarity.”
This approach enabled ranked retrieval based on degrees of similarity rather than
binary matching, significantly improving retrieval effectiveness.
Robertson and Spärck Jones [1976] developed the probabilistic relevance framework in
“Relevance Weighting of Search Terms,” which provided a theoretical foundation for rank-
ing documents by their probability of relevance to a query:
This work led to the development of the BM25 ranking function [Robertson et al.,
1995], which remains influential in modern retrieval systems and serves as a strong base-
line in RAG research.
“LSA addresses the problems of synonymy and polysemy by mapping terms and doc-
uments to a lower-dimensional ’semantic space’.”
Building on this concept, Blei et al. [2003] developed Latent Dirichlet Allocation
(LDA) in “Latent Dirichlet Allocation,” a generative probabilistic model for collections
6
of discrete data such as text corpora:
The learning to rank paradigm, formalized by Liu [2009] in “Learning to Rank for Infor-
mation Retrieval,” applied machine learning techniques to optimize ranking functions:
“Learning to rank aims to automatically learn a ranking model from training data, such
that the model can sort new objects according to their degrees of relevance, preference,
or importance.”
Huang et al. [2013] presented one of the first successful applications of deep learning
to information retrieval in “Learning Deep Structured Semantic Models for Web Search
using Clickthrough Data”:
“We develop a series of new latent semantic models with a deep structure that project
queries and documents into a common low-dimensional space where the relevance of a
document given a query is readily computed as the distance between them.”
This work demonstrated the potential of neural networks to learn effective representa-
tions for information retrieval, setting the stage for later developments in neural retrieval
models.
7
2.3 Pre-RAG Neural Retrieval and Question Answering (2015-
2019)
“We propose two novel model architectures for computing continuous vector represen-
tations of words from very large data sets.”
These word embeddings were quickly applied to information retrieval tasks, with Vulić
and Moens [2015] demonstrating their effectiveness in “Monolingual and Cross-Lingual
Information Retrieval Models Based on (Bilingual) Word Embeddings.”
Guo et al. [2016] introduced DRMM (Deep Relevance Matching Model) in “A Deep Rel-
evance Matching Model for Ad-hoc Retrieval,” which specifically addressed the unique
characteristics of relevance matching in information retrieval:
This work highlighted the distinction between semantic matching (finding text with
similar meaning) and relevance matching (finding documents relevant to a query), a
crucial insight for retrieval systems.
Chen et al. [2017] presented DrQA in “Reading Wikipedia to Answer Open-Domain Ques-
tions,” which combined a document retriever with a machine reading component:
“Our approach combines a search component based on bigram hashing and TF-IDF
matching with a multi-layer recurrent neural network model trained to extract answers
from text.”
8
This two-stage approach—retrieve relevant documents, then extract answers—foreshadowed
the architecture of RAG systems, though it relied on traditional sparse retrieval methods
and treated the retrieval and reading components as separate modules.
Devlin et al. [2019] introduced BERT in “BERT: Pre-training of Deep Bidirectional Trans-
formers for Language Understanding,” which revolutionized natural language processing
with contextual word representations:
Nogueira and Cho [2019] quickly applied BERT to information retrieval in “Pas-
sage Re-ranking with BERT,” demonstrating significant improvements in retrieval per-
formance:
“We present a simple re-ranking approach using BERT that achieves state-of-the-art
results on the MSMARCO-Passage Re-Ranking task.”
This work established the effectiveness of transformer-based models for retrieval tasks
and introduced the multi-stage retrieval paradigm that would become standard in many
RAG systems.
Gillick et al. [2019] presented the Dual Encoder architecture in “Learning Dense Repre-
sentations for Entity Retrieval,” which used separate encoders for queries and documents
with a shared representation space:
“We present a dual encoder architecture for learning dense entity representations for
efficient retrieval.”
This approach enabled efficient similarity search using approximate nearest neighbor
techniques, a key component of modern RAG systems.
9
2.4.2 Contrastive Learning for Retrieval
“We present a method that uses the natural structure of conversational data to learn
vector representations of queries and responses that are effective for response suggestion.”
This contrastive learning approach would later be refined and applied to document
retrieval in dense retrieval models like DPR.
Lee et al. [2019] introduced ORQA (Open-Retrieval Question Answering) in “Latent Re-
trieval for Weakly Supervised Open Domain Question Answering,” which used a pre-
trained language model for both retrieval and reading:
“We present a new approach to open-domain QA that learns a retriever and reader
simultaneously, using only question-answer pairs as supervision.”
This work represented a significant step toward RAG, using dense retrieval and inte-
grating retrieval more tightly with the answer generation process, though it still treated
them as separate components.
The stage was set for the emergence of RAG with several key developments in 2020:
1. REALM [Guu et al., 2020]: Integrated retrieval directly into pre-training, using
a knowledge retriever to access relevant documents during both pre-training and
fine-tuning.
2. DPR [Karpukhin et al., 2020]: Demonstrated the effectiveness of dense passage re-
trieval for open-domain question answering, providing a strong retrieval component
for RAG systems.
10
These works marked the beginning of the RAG era, building upon decades of research
in information retrieval, question answering, and neural language models.
“We introduce RAG models where the parametric memory is a pre-trained seq2seq
model and the non-parametric memory is a dense vector index of Wikipedia, accessed
with a pre-trained neural retriever.”
Their approach utilized a Dense Passage Retriever (DPR) to access relevant informa-
tion from a knowledge corpus, which was then fed into a sequence-to-sequence model for
generation. The authors demonstrated that RAG models outperformed traditional lan-
guage models on knowledge-intensive tasks, showing significant improvements in factual
accuracy and providing provenance for their predictions.
Concurrent with Lewis et al.’s work, Guu et al. [2020] introduced REALM (Retrieval-
Augmented Language Model Pre-Training) at ICML 2020. REALM represented a signif-
icant advancement by incorporating retrieval mechanisms directly into the pre-training
process. As described by the authors:
“To capture knowledge in a more modular and interpretable way, we augment language
model pre-training with a latent knowledge retriever, which allows the model to retrieve
and attend over documents from a large corpus such as Wikipedia, used during pre-
training, fine-tuning and inference.”
A key innovation in REALM was the joint training of the retriever and language
model components through an unsupervised learning approach. This allowed the system
to learn what information to retrieve without explicit supervision, using masked language
modeling as the learning signal.
11
3.2 Dense Passage Retrieval
The Dense Passage Retrieval (DPR) system, introduced by Karpukhin et al. [2020] at
EMNLP, represents a fundamental component of many RAG systems. DPR uses dense
vector representations of passages and queries to perform efficient similarity search. The
authors demonstrated that:
DPR employs a bi-encoder framework where separate encoders map queries and pas-
sages to a shared vector space. The system is trained using contrastive learning with
positive and negative passage examples, creating a retrieval mechanism that captures
semantic relationships beyond keyword matching.
Early RAG systems relied on static knowledge bases, typically using Wikipedia or other
fixed corpora as their non-parametric memory. Recent research has focused on extending
RAG to incorporate more dynamic and diverse knowledge sources.
Borgeaud et al. [2022], in their paper “Improving language models by retrieving from
trillions of tokens,” introduced RETRO (Retrieval-Enhanced Transformer), which scales
the retrieval corpus to trillions of tokens. This approach demonstrated that retrieval-
based language models can effectively leverage much larger knowledge bases than previous
systems.
Li et al. [2024] explored various RAG system designs in their COLING paper “En-
hancing Retrieval-Augmented Generation: A Study of Best Practices.” Their research
systematically investigated key factors including language model size, prompt design,
document chunk size, knowledge base size, retrieval depth, query expansion techniques,
and contrastive in-context learning.
12
4 Retrieval Algorithms in RAG Systems
Dense retrieval represents documents and queries as dense vector embeddings in a shared
semantic space, enabling similarity-based retrieval beyond lexical matching. The seminal
work in this area came from Karpukhin et al. [2020], who introduced Dense Passage
Retrieval (DPR) for open-domain question answering:
“We introduce Dense Passage Retrieval (DPR), a new approach for retrieval that
uses dense representations alone, where embeddings are learned from a small number of
questions and passages by a simple dual-encoder framework.”
The success of DPR catalyzed research into dense retrieval methods for RAG systems,
with subsequent work focusing on improving embedding quality, retrieval efficiency, and
domain adaptation.
Despite advances in dense retrieval, sparse retrieval methods remain important in RAG
systems due to their interpretability and efficiency. Robertson and Zaragoza [2009] pro-
vided the theoretical foundation for BM25, which remains a strong baseline in many
retrieval tasks:
“The BM25 weighting scheme has become a de facto standard for probabilistic ap-
proaches to document retrieval. It represents a specific instantiation of the probabilistic
relevance framework, which provides a principled foundation for designing retrieval func-
tions.”
Lin et al. [2021] demonstrated in their paper “Few-Shot Learning with Siamese Net-
works and Label Tuning” that sparse retrieval methods like BM25 can still outperform
dense retrievers in certain scenarios, particularly when dealing with specialized terminol-
ogy or when training data is limited.
13
4.1.3 Hybrid Retrieval
Recognizing the complementary strengths of dense and sparse retrieval, researchers have
developed hybrid approaches. Luan et al. [2021] introduced the Sparse-Dense Represen-
tation (SPLADE) model in their paper “Sparse Lexical and Expansion Representation
for Information Retrieval”:
“SPLADE combines the efficiency of sparse retrieval with the effectiveness of dense
retrieval by learning sparse expanded representations of queries and documents.”
Their approach uses BERT to predict importance weights for both document terms
and related terms, creating sparse representations that capture semantic relationships
while maintaining the efficiency benefits of inverted indices.
“We present a simple re-ranking approach using BERT that achieves state-of-the-
art results on the MSMARCO-Passage Re-Ranking task. Our approach is based on a
two-stage pipeline, where an initial retrieval system based on BM25 is followed by a
BERT-based re-ranker.”
This approach has been widely adopted in RAG systems, with initial retrieval us-
ing efficient methods (BM25 or approximate nearest neighbor search) followed by more
computationally intensive re-ranking of the top candidates.
Query reformulation techniques have been extensively studied in the academic litera-
ture. Mao et al. [2021] introduced Generation-Augmented Retrieval (GAR) in their
paper “Generation-Augmented Retrieval for Open-Domain Question Answering”:
14
Building on this work, Wang et al. [2023] proposed Hypothetical Document Embed-
dings (HyDE) in “Precise Zero-Shot Dense Retrieval without Relevance Labels,” which
uses an LLM to generate a hypothetical document that would be relevant to the query,
and then uses the embedding of this hypothetical document for retrieval instead of the
query embedding.
Contrastive learning has emerged as a powerful technique for training retrieval models.
Xiong et al. [2021] introduced ANCE (Approximate Nearest Neighbor Negative Con-
trastive Learning) in their paper “Approximate Nearest Neighbor Negative Contrastive
Learning for Dense Text Retrieval”:
“ANCE trains the dense retriever with negatives from the model’s own retrieval results,
which are closer to the query than random negatives and thus provide more informative
learning signals.”
Recent academic research has focused on making retrieval more context-aware in RAG
systems. Shao et al. [2023] introduced Self-RAG in their paper “Self-RAG: Learning to
Retrieve, Generate, and Critique through Self-Reflection,” which introduces a retrieval-
aware LLM that can decide when to retrieve, what to retrieve, and whether to use the
retrieved content through a process of self-reflection.
Efficiency considerations are crucial for practical RAG systems. Izacard et al. [2022]
addressed this in “Atlas: Few-shot Learning with Retrieval Augmented Language Mod-
els,” introducing an efficient pre-training approach for retrieval-augmented language mod-
els that enables strong performance with minimal fine-tuning.
15
5 Vector Indexing Methods for Efficient Retrieval
The challenge of exact nearest neighbor search in high-dimensional spaces has been well-
documented in academic literature. Weber et al. [1998] established fundamental theoret-
ical limitations in their seminal paper “A Quantitative Analysis and Performance Study
for Similarity-Search Methods in High-Dimensional Spaces”:
“As dimensionality increases, the distance to the nearest neighbor approaches the
distance to the farthest neighbor, making exact nearest neighbor search computationally
intractable for high-dimensional data.”
This phenomenon, known as the “curse of dimensionality,” has motivated the develop-
ment of approximate nearest neighbor (ANN) search methods that trade perfect accuracy
for significant gains in computational efficiency.
The Hierarchical Navigable Small World (HNSW) algorithm, introduced by Malkov and
Yashunin [2018] in their paper “Efficient and Robust Approximate Nearest Neighbor
Search Using Hierarchical Navigable Small World Graphs,” has become one of the most
widely used indexing methods in RAG systems:
“We present a new approach for the approximate K-nearest neighbor search based on
navigable small world graphs with controllable hierarchy (Hierarchical NSW, HNSW).
The proposed solution is fully graph-based, without any need for additional search struc-
tures.”
HNSW builds upon the concept of navigable small world graphs, organizing vectors in
a multi-layer graph structure where each layer contains a subset of the points from lower
16
layers. This hierarchical structure enables efficient search by first navigating through
sparse top layers to quickly reach the approximate region of the query point, then refining
the search in denser lower layers.
Subsequent academic research has further analyzed and optimized HNSW. Prokhorenkova
and Shekhovtsov [2020] provided theoretical analysis in “Graph-based Nearest Neighbor
Search: From Practice to Theory,” establishing formal guarantees for graph-based meth-
ods like HNSW.
Building on the NSW concept, Subramanya et al. [2019] introduced the Vamana graph
in their paper “Diskann: Fast Accurate Billion-point Nearest Neighbor Search on a Single
Node,” which improved upon HNSW by using a different graph construction algorithm
that better balances search efficiency and index construction time.
Product Quantization (PQ), introduced by Jégou et al. [2011] in their influential paper
“Product Quantization for Nearest Neighbor Search,” has become a fundamental tech-
nique for compressing vector representations while maintaining search capability:
“This paper introduces a product quantization based approach for approximate nearest
neighbor search. The idea is to decompose the space into a Cartesian product of low-
dimensional subspaces and to quantize each subspace separately.”
Building on the foundation of PQ, Ge et al. [2013] introduced Optimized Product Quan-
tization (OPQ) in their paper “Optimized Product Quantization,” which improves upon
17
standard PQ by finding an optimal rotation of the data before applying product quanti-
zation, leading to better quantization accuracy and retrieval performance.
The Inverted File Index (IVF) approach, described in detail by Sivic and Zisserman
[2003] in “Video Google: A Text Retrieval Approach to Object Matching in Videos,”
adapts techniques from text retrieval to vector search. In the context of vector search,
IVF partitions the vector space into clusters and builds an inverted index that maps each
cluster to its member vectors.
Baranchuk et al. [2018] examined the combination of inverted indices and product quan-
tization in their paper “Revisiting the Inverted Indices for Billion-Scale Approximate
Neighbors”:
“We argue that the potential of the simple inverted index was not fully exploited in
previous works and advocate its usage both for highly-entangled deep descriptors and
relatively disentangled SIFT descriptors.”
Their research demonstrated that properly optimized inverted indices can outperform
more complex methods for large-scale retrieval tasks.
Recent academic work has focused on hybrid approaches that combine multiple indexing
strategies. Johnson et al. [2019] described such an approach in “Billion-scale similarity
search with GPUs”:
18
“We present a hybrid system that combines an inverted index with product quantiza-
tion to achieve both memory efficiency and search speed.”
Question answering has been the primary evaluation domain for RAG systems since
their inception. Kwiatkowski et al. [2019] introduced Natural Questions (NQ), which has
become a standard benchmark:
“Natural Questions contains real user queries issued to Google Search, along with
corresponding Wikipedia pages that might contain the answer and human-annotated
answer spans.”
Lewis et al. [2020] evaluated the original RAG model on these datasets, establishing
them as standard benchmarks for RAG evaluation:
19
6.1.2 Knowledge-Intensive Tasks
This benchmark enables more comprehensive evaluation across diverse tasks while
using a consistent knowledge source (Wikipedia), facilitating more direct comparisons
between different approaches.
As RAG systems aim to reduce hallucination in language models, specialized datasets for
fact verification have become important. Thorne et al. [2018] introduced FEVER (Fact
Extraction and VERification):
The retrieval component of RAG systems is typically evaluated using standard infor-
mation retrieval metrics. Karpukhin et al. [2020] used the following metrics in their
evaluation of Dense Passage Retrieval:
• Top-k Accuracy: The percentage of questions for which the answer is contained
in the top-k retrieved passages.
20
• Mean Reciprocal Rank (MRR): The average of the reciprocal ranks of the first
relevant passage across all queries.
• Precision@k: The proportion of relevant passages among the top-k retrieved pas-
sages.
• Recall@k: The proportion of all relevant passages that are retrieved in the top-k
results.
Xiong et al. [2021] emphasized the importance of recall in their work on dense retrieval:
“Recall is particularly important for retrieval systems that feed into downstream com-
ponents, as relevant documents missed at the retrieval stage cannot be recovered later.”
For evaluating the generation quality of RAG systems, researchers employ both reference-
based and reference-free metrics. Lewis et al. [2020] used the following metrics:
• Exact Match (EM): The percentage of generated answers that exactly match the
reference answer.
• F1 Score: The harmonic mean of precision and recall at the token level between
the generated and reference answers.
Izacard and Grave [2021] noted limitations of these metrics in their work on Fusion-
in-Decoder:
“Exact Match and F1 metrics can be overly strict, especially when multiple valid
answer formulations exist. We therefore also report ROUGE-L, which better captures
semantic similarity.”
Evaluating RAG systems end-to-end presents unique challenges, as both retrieval and
generation quality must be considered. Shao et al. [2023] proposed specialized metrics
for Self-RAG:
21
“We introduce Retrieval Precision and Retrieval Utility metrics, which measure both
whether the system retrieves when it should and whether the retrieved information is
actually used in generation.”
Similarly, Chen et al. [2023] developed metrics specifically for evaluating RAG systems
in their RAGAS framework:
“RAGAS provides automated metrics for evaluating RAG pipelines, including answer
relevance, answer faithfulness, context relevance, and context precision.”
This methodology helps isolate the contribution of specific components and design
choices to overall system performance.
“We compare against state-of-the-art models for both explicit and implicit knowledge
storage on three popular Open-QA benchmarks, and find that we outperform all previous
methods by a significant margin.”
These comparisons typically include both sparse retrieval methods (e.g., BM25) and
other dense retrieval approaches to provide a comprehensive performance assessment.
22
6.3.3 Cross-Domain Evaluation
“BEIR provides a diverse set of information retrieval tasks to evaluate the zero-shot
transfer capabilities of retrieval models.”
This approach helps identify whether performance improvements are robust across
different domains or limited to specific datasets.
Despite the prevalence of automated metrics, human evaluation remains important for
assessing aspects of RAG systems that are difficult to quantify automatically. Shuster
et al. [2021] emphasized this in their work on knowledge-grounded dialogue:
“We conduct human evaluations to assess factual accuracy, relevance, and engaging-
ness of responses, finding that automated metrics often fail to capture important quali-
tative differences.”
Human evaluations typically involve presenting judges with system outputs and asking
them to rate various aspects, such as factual accuracy, relevance, and coherence.
“Reference-based evaluation assumes that the reference answer is complete and correct,
which is often not the case for complex questions with multiple valid answers.”
This challenge has led to increased interest in reference-free evaluation methods that
assess the quality and factuality of generated text without requiring exact matches to
reference answers.
23
6.4.2 Retrieval Evaluation Complexity
Evaluating the retrieval component of RAG systems presents unique challenges. Karpukhin
et al. [2020] noted:
“The standard approach of using a single positive passage per question underestimates
retrieval performance, as multiple passages may contain the answer.”
This has led researchers to develop more nuanced evaluation approaches, such as
considering all passages containing the answer as positive examples or using human judg-
ments to assess relevance.
RAG systems involve inherent trade-offs between retrieval accuracy, generation quality,
computational efficiency, and memory usage. Borgeaud et al. [2022] addressed this chal-
lenge in their evaluation of RETRO:
“We systematically evaluate the trade-offs between retrieval corpus size, computational
cost, and model performance, providing insights into the scaling behavior of retrieval-
augmented language models.”
Recent work has focused on developing automated methods for assessing the factuality
of generated text. Honovich et al. [2022] introduced TRUE (Trustworthy and Reliable
Understanding Evaluation):
This approach enables more scalable evaluation of the factual accuracy of RAG sys-
tems without requiring extensive human annotation.
24
6.5.2 Retrieval-Aware Evaluation
Recognizing the tight coupling between retrieval and generation in RAG systems, re-
searchers have developed evaluation methodologies that explicitly account for this rela-
tionship. Si et al. [2023] proposed:
This approach acknowledges that retrieval errors propagate to generation and that
the utility of retrieved information depends on how effectively it is incorporated into the
generated output.
“We evaluate not only accuracy but also indexing time, retrieval latency, and memory
usage, providing a comprehensive assessment of practical deployment considerations.”
This trend reflects the growing recognition that RAG systems must balance effective-
ness with efficiency to be practically useful.
The academic literature reveals several important research gaps and opportunities for
future work in algorithmic optimization of retrieval mechanisms in RAG systems:
Most current vector indexing methods are optimized for static datasets, but RAG systems
often require dynamic updates to their knowledge bases. Iwasaki and Miyazaki [2018]
addressed this challenge in “Optimization of Indexing Based on k-Nearest Neighbor Graph
for Proximity Search in High-dimensional Data,” but more research is needed on efficient
25
index maintenance for dynamic knowledge bases.
The performance of retrieval algorithms and indexing methods can vary significantly
across different domains and data types. There is a need for research on domain-specific
optimization techniques that can adapt to the characteristics of particular applications.
While empirical evaluations have demonstrated the effectiveness of various retrieval and
indexing methods, the theoretical understanding of why certain approaches work better
26
than others in specific contexts remains limited. More research is needed to establish
stronger theoretical foundations for RAG systems.
Current evaluation metrics for RAG systems often focus on retrieval accuracy or genera-
tion quality in isolation. There is a need for holistic evaluation frameworks that consider
the end-to-end performance of RAG systems, including both retrieval and generation
components.
27
7.5 Hardware Acceleration
As RAG systems scale to larger knowledge bases, hardware acceleration becomes increas-
ingly important for maintaining real-time performance. Research on specialized hard-
ware architectures and algorithms optimized for specific hardware platforms represents a
promising direction.
Current RAG systems predominantly rely on dense vector representations for retrieval,
but there are opportunities to integrate symbolic knowledge and reasoning into the re-
trieval process.
28
7.7 Multi-Modal RAG Systems
Most current RAG research focuses on text, but there are growing opportunities for
multi-modal RAG systems that can retrieve and generate across different modalities.
• Encrypted Search: Developing methods for similarity search over encrypted vec-
tors without compromising privacy.
• Federated RAG: Creating distributed RAG architectures that can leverage knowl-
edge across multiple private data sources.
These research gaps and future directions represent significant opportunities for ad-
vancing the state of the art in RAG systems through algorithmic optimization of retrieval
mechanisms and data structures.
29
8 Conclusion
This literature review has examined the academic research on algorithmic optimization
of retrieval mechanisms in RAG systems, with a focus on the data structures that enable
efficient knowledge access. The review has covered the historical evolution of information
retrieval systems, the foundations of RAG systems, retrieval algorithms, vector indexing
methods, and empirical evaluation methodologies, highlighting key contributions from
peer-reviewed academic literature.
The field of RAG systems is rapidly evolving, with ongoing research addressing chal-
lenges in scalability, efficiency, and accuracy. The integration of advanced retrieval algo-
rithms with sophisticated vector indexing methods represents a promising approach to
improving the performance of RAG systems across a wide range of applications.
Future research in this area will likely focus on addressing the identified research
gaps, particularly in dynamic knowledge management, domain-specific optimization, and
hardware acceleration. As RAG systems continue to evolve, algorithmic optimization of
retrieval mechanisms will remain a critical area of research, with significant implications
for the development of more capable and efficient AI systems.
30
This systematic review of the literature provides a solid foundation for future research
and development in the field of retrieval-augmented generation, with particular emphasis
on algorithmic and data structure approaches to optimizing knowledge access.
31
9 Visual Elements and Comparisons
32
Indexing Key Pa- Year Time Com- Space Com- Key Advan- Key Limita-
Method per plexity plexity tages tions
(Query)
Flat - - O(nd) O(nd) Exact results, Prohibitively
(Brute simple imple- slow for large
Force) mentation datasets
LSH Indyk & 1998 O(d log n) O(nd) Theoretical Performance
Motwani guarantees, degrades in
simple con- high dimen-
cept sions
HNSW Malkov & 2018 O(d log n) O(nd) State-of- High memory
Yashunin the-art per- overhead,
formance, complex im-
logarithmic plementation
search time
IVF Sivic & 2003 O(d(n/k)) O(nd) Simple con- Performance
Zisserman cept, efficient depends on
for first-stage clustering
retrieval quality
PQ Jégou et 2011 O(d + k) O(n + kd) Dramatic Lossy com-
al. memory re- pression,
duction, fast reduced accu-
distance com- racy
putation
IVF-PQ Baranchuk 2018 O(d(n/k) + O(n + kd) Balances Complex pa-
et al. k) speed and rameter tun-
memory effi- ing, still lossy
ciency
33
Retrieval RepresentativeKey Char- Advantages Limitations Applications
Paradigm Papers acteristics
Sparse Re- Robertson Term-based Interpretable,Limited General
trieval & Zaragoza matching efficient, semantic search,
(2009) with inverted no training under- specialized
indices required standing domains
Dense Re- Karpukhin et Neural em- Semantic Requires Open-
trieval al. (2020) beddings in matching, training domain
shared vector handles data QA, se-
space synonyms mantic
search
Hybrid Re- Luan et al. Combines Leverages Increased Production
trieval (2021) sparse and strengths complexity systems
dense ap- of both
proaches paradigms
Multi- Nogueira & Initial re- Balances Pipeline Web
stage Cho (2019) trieval fol- efficiency complexity search,
Retrieval lowed by and accu- large-scale
re-ranking racy retrieval
Generation
Retrieval Component
Component
Retrieval
ResponseAl-
RAG System Architecture
Language
Vector Indexing
Model
Context In- Generated gorithms
Generation
Query Processing User Query
Retrieved Response
Documents
tegration
34
Hierarchical Navigable Small World
Layer 3 (Top)
Layer 2
Layer 1
Layer 0 (Base)
35
10 References
References
Artem Babenko and Victor Lempitsky. Additive quantization for extreme vector com-
pression. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 931–938, 2014.
Dmitry Baranchuk, Artem Babenko, and Yury Malkov. Revisiting the inverted indices for
billion-scale approximate nearest neighbors. In Proceedings of the European Conference
on Computer Vision, pages 202–216, 2018.
David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. Journal
of Machine Learning Research, 3:993–1022, 2003.
Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford,
Katie Millican, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc,
Aidan Clark, et al. Improving language models by retrieving from trillions of tokens.
In International Conference on Machine Learning, pages 2206–2223. PMLR, 2022.
Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. Reading wikipedia to
answer open-domain questions. In Proceedings of the 55th Annual Meeting of the As-
sociation for Computational Linguistics, pages 1870–1879, 2017.
Jerry Chen, Yu Guo, Swati Agarwal, and William Yang Wang. RAGAS: Automated
evaluation of retrieval augmented generation. arXiv preprint arXiv:2309.15217, 2023.
Scott Deerwester, Susan T Dumais, George W Furnas, Thomas K Landauer, and Richard
Harshman. Indexing by latent semantic analysis. Journal of the American Society for
Information Science, 41(6):391–407, 1990.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training
of deep bidirectional transformers for language understanding. In Proceedings of the
2019 Conference of the North American Chapter of the Association for Computational
Linguistics, pages 4171–4186, 2019.
Yunfan Gao, Yun Xiong, Xinyu Gao, Kangqi Jia, Jinyang Pan, Yixin Bi, Yi Dai, Jian Sun,
Meng Wang, and Haofen Wang. Retrieval-augmented generation for large language
models: A survey. arXiv preprint arXiv:2312.10997, 2023.
Tiezheng Ge, Kaiming He, Qifa Ke, and Jian Sun. Optimized product quantization. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 36(4):744–755, 2013.
36
Daniel Gillick, Sayali Kulkarni, Larry Lansing, Alessandro Presta, Jason Baldridge, Eu-
gene Ie, and Diego Garcia-Olano. Learning dense representations for entity retrieval.
In Proceedings of the 23rd Conference on Computational Natural Language Learning,
pages 528–537, 2019.
Jiafeng Guo, Yixing Fan, Qingyao Ai, and W Bruce Croft. A deep relevance matching
model for ad-hoc retrieval. In Proceedings of the 25th ACM International on Conference
on Information and Knowledge Management, pages 55–64. ACM, 2016.
Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. Realm:
Retrieval-augmented language model pre-training. In Proceedings of the 37th Interna-
tional Conference on Machine Learning, pages 3929–3938, 2020.
Matthew Henderson, Rami Al-Rfou, Brian Strope, Yun-Hsuan Sung, László Lukács, Ruiqi
Guo, Sanjiv Kumar, Balint Miklos, and Ray Kurzweil. Efficient natural language
response suggestion for smart reply. arXiv preprint arXiv:1705.00652, 2017.
Or Honovich, Thomas Scialom, Omer Levy, and Timo Schick. TRUE: Re-evaluating
factual consistency evaluation. arXiv preprint arXiv:2204.04991, 2022.
Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck.
Learning deep structured semantic models for web search using clickthrough data. In
Proceedings of the 22nd ACM International Conference on Information & Knowledge
Management, pages 2333–2338. ACM, 2013.
Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: Towards removing
the curse of dimensionality. In Proceedings of the 30th Annual ACM Symposium on
Theory of Computing, pages 604–613, 1998.
Gautier Izacard and Edouard Grave. Leveraging passage retrieval with generative models
for open domain question answering. In Proceedings of the 16th Conference of the
European Chapter of the Association for Computational Linguistics, pages 874–880,
2021.
Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo
Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave. At-
las: Few-shot learning with retrieval augmented language models. arXiv preprint
arXiv:2208.03299, 2022.
37
Jeff Johnson, Matthijs Douze, and Hervé Jégou. Billion-scale similarity search with GPUs.
IEEE Transactions on Big Data, 7(3):535–547, 2019.
Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. TriviaQA: A large
scale distantly supervised challenge dataset for reading comprehension. In Proceedings
of the 55th Annual Meeting of the Association for Computational Linguistics, pages
1601–1611, 2017.
Hervé Jégou, Matthijs Douze, and Cordelia Schmid. Product quantization for nearest
neighbor search. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33
(1):117–128, 2011.
Yannis Kalantidis and Yannis Avrithis. Locally optimized product quantization for ap-
proximate nearest neighbor search. In Proceedings of the IEEE Conference on Com-
puter Vision and Pattern Recognition, pages 2321–2328, 2014.
Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov,
Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question
answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural
Language Processing, pages 6769–6781, 2020.
Kalpesh Krishna, Siddharth Khosla, Jeffrey P. Bigham, and Zachary C. Lipton. Gener-
ating question-answer hierarchies. In Proceedings of the 59th Annual Meeting of the
Association for Computational Linguistics, pages 5546–5561, 2021.
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh,
Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al.
Natural questions: A benchmark for question answering research. In Transactions of
the Association for Computational Linguistics, volume 7, pages 453–466, 2019.
Kenton Lee, Ming-Wei Chang, and Kristina Toutanova. Latent retrieval for weakly su-
pervised open domain question answering. In Proceedings of the 57th Annual Meeting
of the Association for Computational Linguistics, pages 6086–6096, 2019.
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin,
Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian
Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP
tasks. In Advances in Neural Information Processing Systems, volume 33, pages 9459–
9474, 2020.
Sharath Li, Lennart Stenzel, Carsten Eickhoff, and Seyed Ali Bahrainian. Enhancing
retrieval-augmented generation: A study of best practices. In Proceedings of the 2024
International Conference on Computational Linguistics, pages 449–461, 2024.
38
Jimmy Lin, Xueguang Ma, Sheng-Chieh Lin, Jheng-Hong Yang, Ronak Pradeep, and
Rodrigo Nogueira. Few-shot learning with siamese networks and label tuning. In
Proceedings of the 44th International ACM SIGIR Conference on Research and Devel-
opment in Information Retrieval, pages 2356–2362, 2021.
Tie-Yan Liu. Learning to rank for information retrieval. Foundations and Trends in
Information Retrieval, 3(3):225–331, 2009.
Yi Luan, Jacob Eisenstein, Kristina Toutanova, and Michael Collins. Sparse, dense,
and attentional representations for text retrieval. Transactions of the Association for
Computational Linguistics, 9:329–345, 2021.
Yury A Malkov and Dmitry A Yashunin. Efficient and robust approximate nearest neigh-
bor search using hierarchical navigable small world graphs. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 42(4):824–836, 2018.
Yuning Mao, Pengcheng He, Xiaodong Liu, Yelong Shen, Jianfeng Gao, Jiawei Han, and
Weizhu Chen. Generation-augmented retrieval for open-domain question answering.
In Proceedings of the 59th Annual Meeting of the Association for Computational Lin-
guistics, pages 4089–4100, 2021.
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word
representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
Rodrigo Nogueira and Kyunghyun Cho. Passage re-ranking with bert. arXiv preprint
arXiv:1901.04085, 2019.
Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick Lewis, Majid Yazdani, Nicola
De Cao, James Thorne, Yacine Jernite, Vladimir Karpukhin, Jean Maillard, et al.
KILT: a benchmark for knowledge intensive language tasks. In Proceedings of the
2021 Conference of the North American Chapter of the Association for Computational
Linguistics, pages 2523–2544, 2021.
Yingqi Qu, Yuchen Ding, Jing Liu, Kai Liu, Ruiyang Ren, Wayne Xin Zhao, Daxiang
Dong, Hua Wu, and Haifeng Wang. RocketQA: An optimized training approach to
dense passage retrieval for open-domain question answering. In Proceedings of the
2021 Conference of the North American Chapter of the Association for Computational
Linguistics, pages 5835–5847, 2021.
39
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD: 100,000+
questions for machine comprehension of text. In Proceedings of the 2016 Conference
on Empirical Methods in Natural Language Processing, pages 2383–2392, 2016.
Hannah Rashkin, Xi Victoria Lin, Guy Tyen, Maarten Sap, Wen-tau Yih, and Yejin
Choi. Measuring factuality in text generation with attributable sources. arXiv preprint
arXiv:2305.14251, 2023.
Stephen Robertson and Hugo Zaragoza. The probabilistic relevance framework: BM25
and beyond. Foundations and Trends in Information Retrieval, 3(4):333–389, 2009.
Stephen E Robertson and Karen Spärck Jones. Relevance weighting of search terms.
Journal of the American Society for Information Science, 27(3):129–146, 1976.
Gerard Salton, Anita Wong, and Chung-Shu Yang. A vector space model for automatic
indexing. Communications of the ACM, 18(11):613–620, 1975.
Gerard Salton, Edward A Fox, and Harry Wu. Introduction to modern information
retrieval. In Proceedings of the ACM Conference, 1983.
Yixuan Shao, Xinyang Geng, Yizhe Liu, Chulaka Gunasekara, Dian Jiang, Nanyun Peng,
and Marjan Ghazvininejad. Self-RAG: Learning to retrieve, generate, and critique
through self-reflection. arXiv preprint arXiv:2310.11511, 2023.
Kurt Shuster, Spencer Poff, Moya Chen, Douwe Kiela, and Jason Weston. Retrieval
augmentation reduces hallucination in conversation. In Findings of the Association for
Computational Linguistics: EMNLP 2021, pages 3784–3803, 2021.
Chengwei Si, Zhengyuan Chen, Ning Ding, and William Yang Wang. Prompting and
evaluating large language models for retrieval-augmented generation. arXiv preprint
arXiv:2306.10023, 2023.
Josef Sivic and Andrew Zisserman. Video google: A text retrieval approach to object
matching in videos. In Proceedings of the 9th IEEE International Conference on Com-
puter Vision, pages 1470–1477, 2003.
Suhas J Subramanya, Fnu Devvrit, Harsha Raghavan, Vijay Badrinarayanan, and Shan-
mugavelayutham Muthukrishnan. DiskANN: Fast accurate billion-point nearest neigh-
bor search on a single node. In Advances in Neural Information Processing Systems,
volume 32, pages 13766–13776, 2019.
40
Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna
Gurevych. BEIR: A heterogeneous benchmark for zero-shot evaluation of information
retrieval models. In Advances in Neural Information Processing Systems, volume 34,
pages 21545–21561, 2021.
Ivan Vulić and Marie-Francine Moens. Monolingual and cross-lingual information re-
trieval models based on (bilingual) word embeddings. In Proceedings of the 38th In-
ternational ACM SIGIR Conference on Research and Development in Information Re-
trieval, pages 363–372. ACM, 2015.
Roger Weber, Hans-Jörg Schek, and Stephen Blott. A quantitative analysis and perfor-
mance study for similarity-search methods in high-dimensional spaces. In Proceedings
of the 24th International Conference on Very Large Data Bases, pages 194–205, 1998.
Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul Bennett, Junaid
Ahmed, and Arnold Overwijk. Approximate nearest neighbor negative contrastive
learning for dense text retrieval. In International Conference on Learning Representa-
tions, 2021.
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhut-
dinov, and Christopher D Manning. HotpotQA: A dataset for diverse, explainable
multi-hop question answering. In Proceedings of the 2018 Conference on Empirical
Methods in Natural Language Processing, pages 2369–2380, 2018.
41