IR and RL NLP
IR and RL NLP
CLASSICAL IR
Step-by-Step Explanation
1. User Query
2. Term Lookup
● For each query term, the system performs a lookup in an inverted index, which maps terms
to the list of documents where they appear.
● Each term (e.g., "founded") retrieves a list of documents that contain the term:
○ Example: "founded" → {doc47, doc39, doc41, ...}
● Synonyms, stemming, or misspellings may result in multiple similar terms being searched
(e.g., "Stanford" and "Stamford").
○ Example: "Stanford" → {doc47, doc39, doc68, ...} and "Stamford" → {doc21,
doc11, doc17, ...}
3. Document Scoring
● Once the documents corresponding to all query terms are retrieved, the system assigns a
relevance score to each document based on the query terms.
Scoring Process:
Example:
● Document doc39 has all three terms ("Stanford," "University," and "founded"), so it receives
a high relevance score.
● Document doc47 has two terms ("Stanford" and "founded"), so it receives a slightly lower
score.
● Document doc64 contains relevant terms but fewer matches, resulting in a lower score.
4. Ranked Results
TF-IDF
1. TF (Term Frequency):
○ Measures how frequently a term appears in a document.
○ Formula:
○ Intuition:
■ The more a term appears in a document, the more important it is for that
document.
2. IDF (Inverse Document Frequency):
○ Measures how unique or rare a term is across the entire corpus.
○ Formula:
■ The "+1" is used to avoid division by zero when a term is not present in any
document.
○ Intuition:
■ Common terms like "the," "and," or "is" appear in almost every document
and are less useful for distinguishing documents.
■ Rare terms (e.g., "Stanford" or "university") are more informative for
identifying relevant documents.
3. TF-IDF:
○ Combines TF and IDF to assign a weight to each term in a document.
○ Formula:
TF-IDF(t,d)=TF(t,d)⋅IDF(t)
○ Interpretation:
■ High TF-IDF: Terms that appear frequently in a document but rarely across
the corpus.
■ Low TF-IDF: Terms that appear often across the corpus or are rare in the
document.
1. Smoother IDF
2. Scoring
3. BM25 Weight
Summary of Parameters
● k1 (default: 1.2):
○ Controls the saturation of term frequency.
○ Higher k1: Repeated terms have more impact on the score.
○ Lower k1: Repeated terms have less impact.
● b (default: 0.75):
○ Controls document length normalization.
○ Higher b: Longer documents are penalized more.
○ Lower b: Document length has less effect on scoring
NEURAL IR
Cross encoder
[query;document]
○ This allows the model to consider token-level interactions between the query and
the document directly.
2. Single Encoder (BERT-Style):
○ BERT-style encoder with N layers processes the concatenated input, creating a joint
contextualized representation for the query and document.
○ The output is a dense vector representation that captures their joint relationship.
3. Representation Scoring:
○ The relevance score between q and doc is computed as:
■ Enc: The output of the encoder for the concatenated query and document.
■ Dense: A dense layer applied to the final representation to compute the
relevance score.
4. Loss Function:
○ Cross-encoders use negative log-likelihood loss to maximize the score of the positive
passage while minimizing the scores of negative passages:
Strengths of Cross-Encoders
1. Deep Interaction:
○ By encoding the query and document together, cross-encoders enable rich
token-level interactions, capturing complex relationships between terms in
the query and document.
2. High Quality:
○ Cross-encoders typically outperform other architectures (e.g., bi-encoders)
in terms of relevance ranking accuracy because of their ability to process the
query and document jointly.
Challenges of Cross-Encoders
1. Scalability Issues:
○ For a single query, the model must process every potential query-document
pair. This is computationally expensive for large corpora, as the BERT
encoder must run separately for each pair.
2. Infeasibility for Large-Scale Retrieval:
○ While cross-encoders are effective for ranking a small set of candidate
documents, they cannot scale to tasks where millions of documents must be
scored in real-time.
DPR
How DPR Works
● Each query q and document doc is passed through its respective encoder:
○ EncQ(q): Outputs a dense vector for the query.
○ EncD(doc): Outputs a dense vector for the document.
Step 3: Retrieval
● Documents are ranked based on their similarity scores with the query.
● The top k documents are returned as the most relevant results.
Step 4: Training
● During training, the model is optimized using the negative log-likelihood loss, which
encourages high similarity scores for positive query-document pairs and low scores for
negative pairs.
Advantages of DPR
1. High Scalability:
○ Precomputing document embeddings allows fast retrieval using approximate nearest
neighbor (ANN) search.
○ Efficient for large-scale corpora.
2. Decoupled Query and Document Encoding:
○ Query and document encoders work independently, enabling efficient retrieval
without recomputing document embeddings for every query.
3. Contextual Representation:
○ By using BERT-style encoders, DPR captures rich contextual information for both
queries and documents.
Limitations of DPR
Colbert
Contextualized Embeddings:
● Both the query and the document are independently encoded using a BERT-style encoder to
produce dense, contextualized token-level embeddings.
● This ensures that the embeddings capture semantic context for each token in both the query
and the document.
● CoLBERT performs late interaction by matching each token in the query with all tokens in the
document after encoding.
● This allows rich token-level interaction while maintaining scalability.
MaxSim Scoring:
● For each token in the query, CoLBERT calculates the maximum similarity with all tokens in
the document. This is achieved using a dot product between query and document token
embeddings.
● The relevance score for the query-document pair is the sum of MaxSim scores for all query
tokens:
Training with Contrastive Loss:
Strengths of CoLBERT
Limitations of CoLBERT
1. Simplified Interaction:
○ Although CoLBERT captures token-level interactions, it lacks the full
cross-attention seen in cross-encoders, which can capture deeper
relationships between query and document tokens.
2. Storage Overhead:
○ Requires storing token-level embeddings for all documents, which can be
memory-intensive for large corpora.
3. Complexity in Retrieval:
○ During inference, calculating MaxSim for token-level matching adds
computational overhead compared to simple dot-product similarity in
bi-encoders.
Policy gradient