0% found this document useful (0 votes)
3 views

IR and RL NLP

The document outlines the process of information retrieval, detailing classical methods like TF-IDF and BM25, as well as neural approaches such as cross-encoders, DPR, and CoLBERT. It explains how user queries are processed, documents are scored for relevance, and the strengths and limitations of various retrieval models. Key concepts include term frequency, inverse document frequency, and the importance of contextualized embeddings in modern retrieval systems.

Uploaded by

Shriram Pradeep
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

IR and RL NLP

The document outlines the process of information retrieval, detailing classical methods like TF-IDF and BM25, as well as neural approaches such as cross-encoders, DPR, and CoLBERT. It explains how user queries are processed, documents are scored for relevance, and the strengths and limitations of various retrieval models. Key concepts include term frequency, inverse document frequency, and the importance of contextualized embeddings in modern retrieval systems.

Uploaded by

Shriram Pradeep
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

INFORMATION RETRIEVAL

CLASSICAL IR

Step-by-Step Explanation

1. User Query

● The user submits a query: "When was Stanford University founded?"


● The query is tokenized into terms: "founded," "Stanford," "University," etc.

2. Term Lookup

● For each query term, the system performs a lookup in an inverted index, which maps terms
to the list of documents where they appear.

Key Points in Term Lookup:

● Each term (e.g., "founded") retrieves a list of documents that contain the term:
○ Example: "founded" → {doc47, doc39, doc41, ...}
● Synonyms, stemming, or misspellings may result in multiple similar terms being searched
(e.g., "Stanford" and "Stamford").
○ Example: "Stanford" → {doc47, doc39, doc68, ...} and "Stamford" → {doc21,
doc11, doc17, ...}

3. Document Scoring

● Once the documents corresponding to all query terms are retrieved, the system assigns a
relevance score to each document based on the query terms.

Scoring Process:

● A document’s relevance is determined by:


○ Term Frequency (TF): How often each query term appears in the document.
○ Inverse Document Frequency (IDF): How unique or rare the term is across all
documents in the collection.
○ TF-IDF Weighting: Combines term frequency and rarity to compute a score for each
term in the document.
○ Boolean or Vector Space Model: Determines relevance based on exact matches
(Boolean model) or cosine similarity (Vector space model).

Example:

● Document doc39 has all three terms ("Stanford," "University," and "founded"), so it receives
a high relevance score.
● Document doc47 has two terms ("Stanford" and "founded"), so it receives a slightly lower
score.
● Document doc64 contains relevant terms but fewer matches, resulting in a lower score.

4. Ranked Results

● Documents are ranked by their relevance scores.


● The top-ranked documents are presented to the user as the most likely to answer the query.
○ Example:
■ doc39: "A History of Stanford University."
■ doc47: "Stanford University Wikipedia."
■ doc64: "Stanford University About Page."

TF-IDF

1. TF (Term Frequency):
○ Measures how frequently a term appears in a document.
○ Formula:

TF(t,d)=Number of occurrences of term t in document /Total number of terms in


document

○ Intuition:
■ The more a term appears in a document, the more important it is for that
document.
2. IDF (Inverse Document Frequency):
○ Measures how unique or rare a term is across the entire corpus.
○ Formula:

IDF(t)=log⁡(Total number of documents N/Number of documents containing term


t+1)

■ The "+1" is used to avoid division by zero when a term is not present in any
document.
○ Intuition:
■ Common terms like "the," "and," or "is" appear in almost every document
and are less useful for distinguishing documents.
■ Rare terms (e.g., "Stanford" or "university") are more informative for
identifying relevant documents.
3. TF-IDF:
○ Combines TF and IDF to assign a weight to each term in a document.
○ Formula:

TF-IDF(t,d)=TF(t,d)⋅IDF(t)

○ Interpretation:
■ High TF-IDF: Terms that appear frequently in a document but rarely across
the corpus.
■ Low TF-IDF: Terms that appear often across the corpus or are rare in the
document.

BM25 (Best match 25)

1. Smoother IDF

2. Scoring

3. BM25 Weight

Summary of Parameters

● k1 (default: 1.2):
○ Controls the saturation of term frequency.
○ Higher k1​: Repeated terms have more impact on the score.
○ Lower k1​: Repeated terms have less impact.
● b (default: 0.75):
○ Controls document length normalization.
○ Higher b: Longer documents are penalized more.
○ Lower b: Document length has less effect on scoring

NEURAL IR

Cross encoder

1. Query and Document Concatenation:


○ In cross-encoders, the query (q) and document (doc) are concatenated into a single
input sequence:

[query;document]

○ This allows the model to consider token-level interactions between the query and
the document directly.
2. Single Encoder (BERT-Style):
○ BERT-style encoder with N layers processes the concatenated input, creating a joint
contextualized representation for the query and document.
○ The output is a dense vector representation that captures their joint relationship.
3. Representation Scoring:
○ The relevance score between q and doc is computed as:

■ Enc: The output of the encoder for the concatenated query and document.
■ Dense: A dense layer applied to the final representation to compute the
relevance score.
4. Loss Function:
○ Cross-encoders use negative log-likelihood loss to maximize the score of the positive
passage while minimizing the scores of negative passages:

■ doci+​: Positive document relevant to the query.


■ doci−​: Negative documents (irrelevant).
■ The model assigns higher scores to relevant documents and lower scores to
irrelevant ones.

Strengths of Cross-Encoders
1. Deep Interaction:
○ By encoding the query and document together, cross-encoders enable rich
token-level interactions, capturing complex relationships between terms in
the query and document.
2. High Quality:
○ Cross-encoders typically outperform other architectures (e.g., bi-encoders)
in terms of relevance ranking accuracy because of their ability to process the
query and document jointly.

Challenges of Cross-Encoders

1. Scalability Issues:
○ For a single query, the model must process every potential query-document
pair. This is computationally expensive for large corpora, as the BERT
encoder must run separately for each pair.
2. Infeasibility for Large-Scale Retrieval:
○ While cross-encoders are effective for ranking a small set of candidate
documents, they cannot scale to tasks where millions of documents must be
scored in real-time.

DPR
How DPR Works

Step 1: Query and Document Encoding

● Each query q and document doc is passed through its respective encoder:
○ EncQ(q): Outputs a dense vector for the query.
○ EncD(doc): Outputs a dense vector for the document.

Step 2: Precomputed Document Embeddings

● Document embeddings EncD(doc) can be precomputed and stored in a vector index.


● During retrieval, the query embedding EncQ(q) is computed on the fly and matched against
precomputed document embeddings using dot-product similarity.

Step 3: Retrieval

● Documents are ranked based on their similarity scores with the query.
● The top k documents are returned as the most relevant results.

Step 4: Training

● During training, the model is optimized using the negative log-likelihood loss, which
encourages high similarity scores for positive query-document pairs and low scores for
negative pairs.

Advantages of DPR

1. High Scalability:
○ Precomputing document embeddings allows fast retrieval using approximate nearest
neighbor (ANN) search.
○ Efficient for large-scale corpora.
2. Decoupled Query and Document Encoding:
○ Query and document encoders work independently, enabling efficient retrieval
without recomputing document embeddings for every query.
3. Contextual Representation:
○ By using BERT-style encoders, DPR captures rich contextual information for both
queries and documents.

Limitations of DPR

1. Limited Query-Document Interaction:


○ Query and document embeddings are computed independently, which means
token-level interactions between the query and document are not considered.
○ This can lead to suboptimal relevance scoring compared to models like
cross-encoders.
2. Training Complexity:
○ Requires carefully selected negative samples (doci−​) for contrastive learning.
○ The quality of retrieval depends heavily on the training data and negative sampling
strategy.
3. Storage Requirements:
○ Precomputing and storing dense embeddings for large document collections can be
memory-intensive.

Colbert

Contextualized Embeddings:

● Both the query and the document are independently encoded using a BERT-style encoder to
produce dense, contextualized token-level embeddings.
● This ensures that the embeddings capture semantic context for each token in both the query
and the document.

Token-Level Matching with Late Interaction:

● CoLBERT performs late interaction by matching each token in the query with all tokens in the
document after encoding.
● This allows rich token-level interaction while maintaining scalability.

MaxSim Scoring:

● For each token in the query, CoLBERT calculates the maximum similarity with all tokens in
the document. This is achieved using a dot product between query and document token
embeddings.
● The relevance score for the query-document pair is the sum of MaxSim scores for all query
tokens:
Training with Contrastive Loss:

● CoLBERT is trained using negative log-likelihood loss:

Strengths of CoLBERT

1. Rich Contextualized Representations:


○ Uses BERT-style embeddings for tokens, ensuring that both query and
document tokens are understood in context.
2. Efficient Late Interaction:
○ Instead of performing expensive full cross-attention between queries and
documents (as in cross-encoders), CoLBERT only calculates token-level dot
products, making it scalable for large datasets.
3. Pre computable Document Embeddings:
○ Document embeddings can be precomputed and stored, allowing efficient
retrieval during inference.
4. Highly Scalable:
○ Combines the efficiency of bi-encoders (separate encoding of queries and
documents) with the richness of token-level interactions.

Limitations of CoLBERT

1. Simplified Interaction:
○ Although CoLBERT captures token-level interactions, it lacks the full
cross-attention seen in cross-encoders, which can capture deeper
relationships between query and document tokens.
2. Storage Overhead:
○ Requires storing token-level embeddings for all documents, which can be
memory-intensive for large corpora.
3. Complexity in Retrieval:
○ During inference, calculating MaxSim for token-level matching adds
computational overhead compared to simple dot-product similarity in
bi-encoders.

Policy gradient

You might also like