0% found this document useful (0 votes)

3 views

IR and RL NLP

The document outlines the process of information retrieval, detailing classical methods like TF-IDF and BM25, as well as neural approaches such as cross-encoders, DPR, and CoLBERT. It explains how user queries are processed, documents are scored for relevance, and the strengths and limitations of various retrieval models. Key concepts include term frequency, inverse document frequency, and the importance of contextualized embeddings in modern retrieval systems.

Uploaded by

Shriram Pradeep

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

IR and RL NLP

Uploaded by

Shriram Pradeep

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

INFORMATION RETRIEVAL

CLASSICAL IR

Step-by-Step Explanation

1. User Query

● The user submits a query: "When was Stanford University founded?"

● The query is tokenized into terms: "founded," "Stanford," "University," etc.

2. Term Lookup

● For each query term, the system performs a lookup in an inverted index, which maps terms
to the list of documents where they appear.

Key Points in Term Lookup:

● Each term (e.g., "founded") retrieves a list of documents that contain the term:
○ Example: "founded" → {doc47, doc39, doc41, ...}
● Synonyms, stemming, or misspellings may result in multiple similar terms being searched
(e.g., "Stanford" and "Stamford").
○ Example: "Stanford" → {doc47, doc39, doc68, ...} and "Stamford" → {doc21,
doc11, doc17, ...}

3. Document Scoring

● Once the documents corresponding to all query terms are retrieved, the system assigns a
relevance score to each document based on the query terms.

Scoring Process:

● A document’s relevance is determined by:

○ Term Frequency (TF): How often each query term appears in the document.
○ Inverse Document Frequency (IDF): How unique or rare the term is across all
documents in the collection.
○ TF-IDF Weighting: Combines term frequency and rarity to compute a score for each
term in the document.
○ Boolean or Vector Space Model: Determines relevance based on exact matches
(Boolean model) or cosine similarity (Vector space model).

Example:

● Document doc39 has all three terms ("Stanford," "University," and "founded"), so it receives
a high relevance score.
● Document doc47 has two terms ("Stanford" and "founded"), so it receives a slightly lower
score.
● Document doc64 contains relevant terms but fewer matches, resulting in a lower score.

4. Ranked Results

● Documents are ranked by their relevance scores.

● The top-ranked documents are presented to the user as the most likely to answer the query.
○ Example:
■ doc39: "A History of Stanford University."
■ doc47: "Stanford University Wikipedia."
■ doc64: "Stanford University About Page."

TF-IDF

1. TF (Term Frequency):
○ Measures how frequently a term appears in a document.
○ Formula:

TF(t,d)=Number of occurrences of term t in document /Total number of terms in

document

○ Intuition:
■ The more a term appears in a document, the more important it is for that
document.
2. IDF (Inverse Document Frequency):
○ Measures how unique or rare a term is across the entire corpus.
○ Formula:

IDF(t)=log⁡(Total number of documents N/Number of documents containing term

t+1)

■ The "+1" is used to avoid division by zero when a term is not present in any
document.
○ Intuition:
■ Common terms like "the," "and," or "is" appear in almost every document
and are less useful for distinguishing documents.
■ Rare terms (e.g., "Stanford" or "university") are more informative for
identifying relevant documents.
3. TF-IDF:
○ Combines TF and IDF to assign a weight to each term in a document.
○ Formula:

TF-IDF(t,d)=TF(t,d)⋅IDF(t)

○ Interpretation:
■ High TF-IDF: Terms that appear frequently in a document but rarely across
the corpus.
■ Low TF-IDF: Terms that appear often across the corpus or are rare in the
document.

BM25 (Best match 25)

1. Smoother IDF

2. Scoring

3. BM25 Weight

Summary of Parameters

● k1 (default: 1.2):
○ Controls the saturation of term frequency.
○ Higher k1: Repeated terms have more impact on the score.
○ Lower k1: Repeated terms have less impact.
● b (default: 0.75):
○ Controls document length normalization.
○ Higher b: Longer documents are penalized more.
○ Lower b: Document length has less effect on scoring

NEURAL IR

Cross encoder

1. Query and Document Concatenation:

○ In cross-encoders, the query (q) and document (doc) are concatenated into a single
input sequence:

[query;document]

○ This allows the model to consider token-level interactions between the query and
the document directly.
2. Single Encoder (BERT-Style):
○ BERT-style encoder with N layers processes the concatenated input, creating a joint
contextualized representation for the query and document.
○ The output is a dense vector representation that captures their joint relationship.
3. Representation Scoring:
○ The relevance score between q and doc is computed as:

■ Enc: The output of the encoder for the concatenated query and document.
■ Dense: A dense layer applied to the final representation to compute the
relevance score.
4. Loss Function:
○ Cross-encoders use negative log-likelihood loss to maximize the score of the positive
passage while minimizing the scores of negative passages:

■ doci+: Positive document relevant to the query.

■ doci−: Negative documents (irrelevant).
■ The model assigns higher scores to relevant documents and lower scores to
irrelevant ones.

Strengths of Cross-Encoders
1. Deep Interaction:
○ By encoding the query and document together, cross-encoders enable rich
token-level interactions, capturing complex relationships between terms in
the query and document.
2. High Quality:
○ Cross-encoders typically outperform other architectures (e.g., bi-encoders)
in terms of relevance ranking accuracy because of their ability to process the
query and document jointly.

Challenges of Cross-Encoders

1. Scalability Issues:
○ For a single query, the model must process every potential query-document
pair. This is computationally expensive for large corpora, as the BERT
encoder must run separately for each pair.
2. Infeasibility for Large-Scale Retrieval:
○ While cross-encoders are effective for ranking a small set of candidate
documents, they cannot scale to tasks where millions of documents must be
scored in real-time.

DPR
How DPR Works

Step 1: Query and Document Encoding

● Each query q and document doc is passed through its respective encoder:
○ EncQ(q): Outputs a dense vector for the query.
○ EncD(doc): Outputs a dense vector for the document.

Step 2: Precomputed Document Embeddings

● Document embeddings EncD(doc) can be precomputed and stored in a vector index.

● During retrieval, the query embedding EncQ(q) is computed on the fly and matched against
precomputed document embeddings using dot-product similarity.

Step 3: Retrieval

● Documents are ranked based on their similarity scores with the query.
● The top k documents are returned as the most relevant results.

Step 4: Training

● During training, the model is optimized using the negative log-likelihood loss, which
encourages high similarity scores for positive query-document pairs and low scores for
negative pairs.

Advantages of DPR

1. High Scalability:
○ Precomputing document embeddings allows fast retrieval using approximate nearest
neighbor (ANN) search.
○ Efficient for large-scale corpora.
2. Decoupled Query and Document Encoding:
○ Query and document encoders work independently, enabling efficient retrieval
without recomputing document embeddings for every query.
3. Contextual Representation:
○ By using BERT-style encoders, DPR captures rich contextual information for both
queries and documents.

Limitations of DPR

1. Limited Query-Document Interaction:

○ Query and document embeddings are computed independently, which means
token-level interactions between the query and document are not considered.
○ This can lead to suboptimal relevance scoring compared to models like
cross-encoders.
2. Training Complexity:
○ Requires carefully selected negative samples (doci−) for contrastive learning.
○ The quality of retrieval depends heavily on the training data and negative sampling
strategy.
3. Storage Requirements:
○ Precomputing and storing dense embeddings for large document collections can be
memory-intensive.

Colbert

Contextualized Embeddings:

● Both the query and the document are independently encoded using a BERT-style encoder to
produce dense, contextualized token-level embeddings.
● This ensures that the embeddings capture semantic context for each token in both the query
and the document.

Token-Level Matching with Late Interaction:

● CoLBERT performs late interaction by matching each token in the query with all tokens in the
document after encoding.
● This allows rich token-level interaction while maintaining scalability.

MaxSim Scoring:

● For each token in the query, CoLBERT calculates the maximum similarity with all tokens in
the document. This is achieved using a dot product between query and document token
embeddings.
● The relevance score for the query-document pair is the sum of MaxSim scores for all query
tokens:
Training with Contrastive Loss:

● CoLBERT is trained using negative log-likelihood loss:

Strengths of CoLBERT

1. Rich Contextualized Representations:

○ Uses BERT-style embeddings for tokens, ensuring that both query and
document tokens are understood in context.
2. Efficient Late Interaction:
○ Instead of performing expensive full cross-attention between queries and
documents (as in cross-encoders), CoLBERT only calculates token-level dot
products, making it scalable for large datasets.
3. Pre computable Document Embeddings:
○ Document embeddings can be precomputed and stored, allowing efficient
retrieval during inference.
4. Highly Scalable:
○ Combines the efficiency of bi-encoders (separate encoding of queries and
documents) with the richness of token-level interactions.

Limitations of CoLBERT

1. Simplified Interaction:
○ Although CoLBERT captures token-level interactions, it lacks the full
cross-attention seen in cross-encoders, which can capture deeper
relationships between query and document tokens.
2. Storage Overhead:
○ Requires storing token-level embeddings for all documents, which can be
memory-intensive for large corpora.
3. Complexity in Retrieval:
○ During inference, calculating MaxSim for token-level matching adds
computational overhead compared to simple dot-product similarity in
bi-encoders.

Policy gradient

Alta J. LaDage - Occult Psychology
No ratings yet
Alta J. LaDage - Occult Psychology
93 pages
IR - Models
100% (3)
IR - Models
58 pages
CurriculumSamplingForDenseRetrievalWithDocumentExpansion
No ratings yet
CurriculumSamplingForDenseRetrievalWithDocumentExpansion
10 pages
rank doc
No ratings yet
rank doc
9 pages
Unit V Easy To Learn
No ratings yet
Unit V Easy To Learn
21 pages
IR Unit 2
No ratings yet
IR Unit 2
54 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
31 pages
IR Systems Usually Adopt Index Terms To Process Queries Index Term
No ratings yet
IR Systems Usually Adopt Index Terms To Process Queries Index Term
24 pages
L12&L13 Ranked Retrieval
No ratings yet
L12&L13 Ranked Retrieval
31 pages
Chapter 2 Modeling: Modern Information Retrieval by R. Baeza-Yates and B. Ribeir
No ratings yet
Chapter 2 Modeling: Modern Information Retrieval by R. Baeza-Yates and B. Ribeir
47 pages
Sparse, Dense, and Attentional Representations For Text Retrieval
No ratings yet
Sparse, Dense, and Attentional Representations For Text Retrieval
17 pages
02 Chap02a-BooleanAndvector Models
No ratings yet
02 Chap02a-BooleanAndvector Models
30 pages
Session 4 Text Feature
No ratings yet
Session 4 Text Feature
40 pages
Chapter 6_Scoring Term weighting and vector space model
No ratings yet
Chapter 6_Scoring Term weighting and vector space model
43 pages
CS11-711 Advanced NLP: Retrieval and Retrieval-Augmented Generation
No ratings yet
CS11-711 Advanced NLP: Retrieval and Retrieval-Augmented Generation
37 pages
CCS369 - TSS-Unit 3
No ratings yet
CCS369 - TSS-Unit 3
55 pages
Tif Idf Cosine Similarity
No ratings yet
Tif Idf Cosine Similarity
44 pages
F-IR
No ratings yet
F-IR
30 pages
Ranked Retrieval
No ratings yet
Ranked Retrieval
52 pages
KEN2570-5-Search and IR
No ratings yet
KEN2570-5-Search and IR
18 pages
ir-journal
No ratings yet
ir-journal
41 pages
Completed Unit II 17.7.17
No ratings yet
Completed Unit II 17.7.17
113 pages
IR Models: - Why IR Models? - Boolean IR Model - Vector Space IR Model - Probabilistic IR Model
No ratings yet
IR Models: - Why IR Models? - Boolean IR Model - Vector Space IR Model - Probabilistic IR Model
46 pages
NLP SEE
No ratings yet
NLP SEE
27 pages
Chapter 2
No ratings yet
Chapter 2
37 pages
4 IRModels
No ratings yet
4 IRModels
32 pages
Relevance of A Document To A Query
No ratings yet
Relevance of A Document To A Query
10 pages
Recuperación Información Modelo Vectorial
No ratings yet
Recuperación Información Modelo Vectorial
40 pages
4_IRModels
No ratings yet
4_IRModels
46 pages
Basic IR: Modeling
No ratings yet
Basic IR: Modeling
22 pages
Cs8080 Ir Unit2 I Modeling and Retrieval Evaluation
No ratings yet
Cs8080 Ir Unit2 I Modeling and Retrieval Evaluation
42 pages
Information Retrieval System and The Pagerank Algorithm
No ratings yet
Information Retrieval System and The Pagerank Algorithm
37 pages
Information Retrieval Systems Information Retrieval Systems
No ratings yet
Information Retrieval Systems Information Retrieval Systems
7 pages
Information Retrieval - 1
No ratings yet
Information Retrieval - 1
47 pages
IRS 2nd Chap
No ratings yet
IRS 2nd Chap
42 pages
I R Rank
No ratings yet
I R Rank
52 pages
Week 3 - Probabilistic Retrieval and Relevance Feedback
No ratings yet
Week 3 - Probabilistic Retrieval and Relevance Feedback
37 pages
3 Retrieval Models
No ratings yet
3 Retrieval Models
87 pages
Assignment 2
No ratings yet
Assignment 2
3 pages
TF Idf
100% (3)
TF Idf
38 pages
Thesis
No ratings yet
Thesis
16 pages
CS8080 INFORMATION RETRIEVAL TECHNIQUES II INTERNAL EXAMINATION - Google Forms
No ratings yet
CS8080 INFORMATION RETRIEVAL TECHNIQUES II INTERNAL EXAMINATION - Google Forms
420 pages
Irt-23 Unit 2
No ratings yet
Irt-23 Unit 2
10 pages
IR Cheatsheet Final
No ratings yet
IR Cheatsheet Final
3 pages
AutoRAG Automated Framework For Optimization of Retrieval-Augmented Generation
No ratings yet
AutoRAG Automated Framework For Optimization of Retrieval-Augmented Generation
22 pages
Question Answering
No ratings yet
Question Answering
68 pages
IR_MOD2_NOTES
No ratings yet
IR_MOD2_NOTES
26 pages
Query Languages
No ratings yet
Query Languages
34 pages
Unit2 Bool Vect Example ST
No ratings yet
Unit2 Bool Vect Example ST
34 pages
4 IRinArabic2021 Ranked Retrieval I
No ratings yet
4 IRinArabic2021 Ranked Retrieval I
49 pages
Unit 2 Irt
No ratings yet
Unit 2 Irt
33 pages
IR Presentation 2
No ratings yet
IR Presentation 2
28 pages
Scoring
No ratings yet
Scoring
49 pages
Chapter Five IR Models
No ratings yet
Chapter Five IR Models
28 pages
Lec 4
No ratings yet
Lec 4
39 pages
5 IRModels IR
No ratings yet
5 IRModels IR
25 pages
Tamrakar 2015
No ratings yet
Tamrakar 2015
6 pages
Chapter 14-NLP
No ratings yet
Chapter 14-NLP
24 pages
DEXA08final PDF
No ratings yet
DEXA08final PDF
9 pages
NLP SEE
No ratings yet
NLP SEE
9 pages
Visualizing Data Structures
From Everand
Visualizing Data Structures
Rhonda Hoenigman
No ratings yet
Class- Xii Maths Pre-board
No ratings yet
Class- Xii Maths Pre-board
11 pages
3608870a47 Af en Nexo-V1400
No ratings yet
3608870a47 Af en Nexo-V1400
279 pages
Watchmen 01 Spanish 1st Edition Alan Moore Dave Gibbons Full Chapter Free
100% (2)
Watchmen 01 Spanish 1st Edition Alan Moore Dave Gibbons Full Chapter Free
57 pages
AUTOSAR CP SWS CommunicationStackTypes
No ratings yet
AUTOSAR CP SWS CommunicationStackTypes
26 pages
CBSE Class 12 English Core SET - 2 Question Paper 2022
No ratings yet
CBSE Class 12 English Core SET - 2 Question Paper 2022
10 pages
Lesson Plan - Grand Opening - YLE
No ratings yet
Lesson Plan - Grand Opening - YLE
4 pages
NPTEL Assignment 3-4
No ratings yet
NPTEL Assignment 3-4
10 pages
Natural Confidence PDF
No ratings yet
Natural Confidence PDF
8 pages
Sample TRM All Series 2020v1 - Shortse
No ratings yet
Sample TRM All Series 2020v1 - Shortse
40 pages
Sample Questions
No ratings yet
Sample Questions
13 pages
Next Generation Audit
No ratings yet
Next Generation Audit
2 pages
GNED 08 - Lesson 02
No ratings yet
GNED 08 - Lesson 02
1 page
Econometrics Proofs
No ratings yet
Econometrics Proofs
4 pages
1 New Dealer Appointment Form
No ratings yet
1 New Dealer Appointment Form
2 pages
Goat Houses
No ratings yet
Goat Houses
1 page
Applications of The Coanda Effect Ocr
No ratings yet
Applications of The Coanda Effect Ocr
9 pages
Certificado de Tornillería
No ratings yet
Certificado de Tornillería
7 pages
Setup For Oracle Payables and Cash Management
No ratings yet
Setup For Oracle Payables and Cash Management
4 pages
Pre and Post Assessment Scoresheet
No ratings yet
Pre and Post Assessment Scoresheet
44 pages
Soalan Sainsf1
No ratings yet
Soalan Sainsf1
13 pages
Poke Cheats
No ratings yet
Poke Cheats
8 pages
Who Should Own Water Socratic Seminar
No ratings yet
Who Should Own Water Socratic Seminar
2 pages
Unique Floating Mechanism System Automatically Adjusts The Difference Between The Spindle Feed of The Tap
No ratings yet
Unique Floating Mechanism System Automatically Adjusts The Difference Between The Spindle Feed of The Tap
2 pages
Impacts of Colonialism
No ratings yet
Impacts of Colonialism
5 pages
How The Starting System Works - Blog - Teknisi
No ratings yet
How The Starting System Works - Blog - Teknisi
24 pages
Computer-Assisted Language Learning
No ratings yet
Computer-Assisted Language Learning
13 pages
RRL References
No ratings yet
RRL References
4 pages
FC One Sheets ADSD 11182021
No ratings yet
FC One Sheets ADSD 11182021
2 pages
CRT Picture Tube
No ratings yet
CRT Picture Tube
5 pages