Tycs Sem Vi Informational Retrival Final Notes (WWW - Profajaypashankar.com-1
Tycs Sem Vi Informational Retrival Final Notes (WWW - Profajaypashankar.com-1
AJAY PASHANKAR
Retrieval Models: Boolean model: Boolean operators, query processing, Vector space
model: TF-IDF, cosine similarity, query-document matching, Probabilistic model:
Bayesian retrieval, relevance feedback
Additional Reference(s):
1. Ricci, F, Rokach, L. Shapira, B. Kantor, ―Recommender Systems Handbook‖, First Edition.
2. Bruce Croft, Donald Metzler, and Trevor Strohman, Search Engines: Information Retrieval in
Practice, Pearson Education.
3. Stefan Buttcher, Charlie Clarke, Gordon Cormack, Information Retrieval: Implementing and
Evaluating Search Engines, MIT Press.
INDEX
An information model (IR) model can be classified into the following three models −
Classical IR Model
It is the simplest and easy to implement IR model. This model is based on mathematical knowledge
that was easily recognized and understood as well. Boolean, Vector and Probabilistic are the three
classical IR models.
Non-Classical IR Model
It is completely opposite to the classical IR model. Such kinds of IR models are based on principles
other than similarity, probability, Boolean operations. Information logic model, situation theory model,
and interaction models are examples of non-classical IR models.
Alternative IR Model
It is the enhancement of the classical IR model making use of some specific techniques from some
other fields. Cluster model, fuzzy model, and latent semantic indexing (LSI) models are the example of
alternative IR model.
• Search intermediary
• Domain knowledge
• Relevance feedback
• Natural language interface
• Graphical query language
1. Inverted Index
The primary data structure of most of the IR systems is in the form of inverted index. We can define an
inverted index as a data structure that list, for every word, all documents that contain it and frequency
of the occurrences in document. It makes it easy to search for ‘hits’ of a query word.
3. Stemming
Stemming, the simplified form of morphological analysis, is the heuristic process of extracting the base
form of words by chopping off the ends of words. For example, the words laughing, laughs, laughed
would be stemmed to the root word laugh.
4. Crawling
Crawling is the process of gathering different web pages to index them to support a search engine. The
purpose of crawling is to quickly and efficiently gather as many relevant web pages as possible and
together with the link structure that interconnects them.
5. Query
Queries are search statements which describe the information requirements in search engines. A query
will never identify one particular result, it will find many results which match the query with different
degrees.
6. Relevance Feedback
Relevance feedback helps in taking results that are initially returned from a specific query, to gather
user feedback, and determine whether those results are relevant to perform a new query.
Retrieval Precision and recall are two metrics used to evaluate the performance of an information
retrieval system, such as a search engine. Precision is the fraction of relevant results returned by the
system, while recall is the fraction of relevant results that the system was able to return. In other
words, precision measures the accuracy of the results returned, while recall measures the
completeness of the results. A system with high precision returns fewer results, but they are more
likely to be relevant. A system with high recall returns more results, but they are less likely to be
relevant. For example, if a search engine returns 100 results and 80 of them are relevant, then the
precision is 80%. On the other hand, if the search engine was able to return all 200 relevant results,
then the recall would be 100%.
Information retrieval has has many wide spread applications which can be categorized into three types.
General Applications -
• Digital Libraries
• Media Search
• Search Engines
Domain-specific applications
1. Search engines: These are the most common type of IR service, and they allow users to search
the Internet for websites, documents, and other types of information. Some examples of search
engines include Google, Bing, and Yahoo.
2. Library catalogs: These IR services allow users to search for books, journals, and other
materials in a library's collection.
3. Document databases: These IR services allow users to search for documents within a specific
database or collection, such as a database of research papers or legal documents.
4. Specialized IR services: These are IR services that are designed to search specific types of
information, such as medical literature or patents.
IR services use various techniques to index and retrieve information, including keyword searches,
natural language processing, and machine learning algorithms. They may also use metadata, such as
author names, publication dates, and subject tags, to help users find relevant information.
1. File systems: A file system is a way of organizing and storing files on a computer or other digital
device. It typically includes a hierarchy of folders and subfolders, and users can access and
retrieve files by navigating through the folder structure.
2. Databases: A database is a collection of structured data that can be searched, queried, and
accessed using a specialized software application. Databases can be used to store and retrieve a
wide range of information, including customer data, financial records, and product information.
3. Cloud storage: Cloud storage refers to the practice of storing data on remote servers that are
accessed over the Internet. This allows users to access and retrieve their data from any device
with an Internet connection.
4. Optical storage: Optical storage refers to the use of lasers or other light-based technologies to
store and retrieve data on media such as CDs, DVDs, and Blu-ray discs.
Regardless of the method used, effective information storage and retrieval systems should be efficient,
reliable, and secure. They should also be easy to use and allow users to access and retrieve the
information they need quickly and easily
-------------------------------------------------------------------------------------------------------------------
Indexing (Creating document representation)
- Indexing is the manual or automated process of making statements about a document, lesson, and
person and so on.
- For example: author wise, subject wise, text wise, etc.
- Index can be:
i. Document oriented: - the indexer accesses the document relevance to subjects and other features of
interests to user.
ii. Request oriented: - the indexer accesses the document relevance to subjects and other features of
interests to user.
- Automated indexing begins with feature extraction such as extracting all words from a text, followed
by refinements such as eliminating stop words (a, an, the, of), stemming (walking walk), counting
the most frequent words, mapping the concepts using thesaurus (tube pipe).
-------------------------------------------------------------------------------------------------------------------
BUILDING AN INVERTED INDEX
- Inverted index, also called postings file or inverted file, is an index data structure storing a mapping
from content, such as words or numbers to its locations in a database file or in a document or a set of
documents.
- An index always maps back from terms to the parts of a document where they occur.
- Then for each term, a list is maintained in which documents the term occurs in.
- Each item in the list which records that a term appeared in a document is called a posting.
- The dictionary will be sorted alphabetically and each postings list is sorted by document ID.
DOC 2 = home sales rise in July DOC 3 = increase in home sales in July DOC 4 = July new home sales
rise forecasts |DOC 1| home |DOC 1| |DOC 2| |DOC 3| |DOC 4| posting list increase
|DOC 3| July |DOC 2| |DOC 3| |DOC 4| increasing new |DOC 1| |DOC 4| rise
|DOC 2| |DOC 4| sales |DOC 1| |DOC 2| |DOC 3| |DOC 4|
top |DOC 1|
INDEXING ARCHITECTURE
BIWORD INDEX
- Index every consecutive pair of terms in the text as a phrase.
- Example: Friends, Romans, Countrymen would generate two bi-words “Friends Romans” and
“Romans Countrymen”.
- Each of these bi-word is now a vocabulary term.
POSITIONAL INDEXES
- Posting lists in a positional index in which each posting is a docID and a list of positions.
- Example:
Cat, 100 <1, 6 :<7, 18, 33, 72, 86, 231>; 2, 5 : <1, 17, 74, 222, 255>; 4, 2 : <8, 16>; .. .. >
The word “cat” has a document frequency 100 and occurs 6 times in document 1 at positions 7, 18,
33, 72, 86, 231 and so on. SPARSE VECTORS
- Most documents and queries do not contain most word, so vectors are sparse.
An Inverted Index is a data structure used in information retrieval systems to efficiently retrieve
documents or web pages containing a specific term or set of terms. In an inverted index, the index is
organized by terms (words), and each term points to a list of documents or web pages that contain
that term.
Inverted indexes are widely used in search engines, database systems, and other applications where
efficient text search is required. They are especially useful for large collections of documents, where
searching through all the documents would be prohibitively slow.
An inverted index is an index data structure storing a mapping from content, such as words or
numbers, to its locations in a document or a set of documents. In simple words, it is a hashmap-like
data structure that directs you from a word to a document or a web page.
Example: Consider the following documents.
Document 1: The quick brown fox jumped over the lazy dog.
Document 2: The lazy dog slept in the sun.
To create an inverted index for these documents, we first tokenize the documents into terms, as
follows.
Document 1: The, quick, brown, fox, jumped, over, the lazy, dog.
Document 2: The, lazy, dog, slept, in, the, sun.
Next, we create an index of the terms, where each term points to a list of documents that contain that
term, as follows.
The -> Document 1, Document 2
Quick -> Document 1
To search for documents containing a particular term or set of terms, the search engine queries the
inverted index for those terms and retrieves the list of documents associated with each term. The
search engine can then use this information to rank the documents based on relevance to the query
and present them to the user in order of importance.
There are two types of inverted indexes:
• Record-Level Inverted Index: Record Level Inverted Index contains a list of references to
documents for each word.
• Word-Level Inverted Index: Word Level Inverted Index additionally contains the positions of
each word within a document. The latter form offers more functionality but needs more
processing power and space to be created.
Suppose we want to search the texts “hello everyone, ” “this article is based on an inverted index, ”
and “which is hashmap-like data structure“. If we index by (text, word within the text), the index
with a location in the text is:
hello (1, 1)
everyone (1, 2)
this (2, 1)
article (2, 2)
is (2, 3); (3, 2)
based (2, 4)
on (2, 5)
inverted (2, 6)
index (2, 7)
which (3, 1)
hashmap (3, 3)
like (3, 4)
data (3, 5)
structure (3, 6)
The word “hello” is in document 1 (“hello everyone”) starting at word 1, so has an entry (1, 1), and the
word “is” is in documents 2 and 3 at ‘3rd’ and ‘2nd’ positions respectively (here position is based on
the word).
The index may have weights, frequencies, or other indicators.
Steps to Build an Inverted Index
• Fetch the Document: Removing of Stop Words: Stop words are the most occurring and
useless words in documents like “I”, “the”, “we”, “is”, and “an”.
• Stemming of Root Word: Whenever I want to search for “cat”, I want to see a document that
has information about it. But the word present in the document is called “cats” or “catty”
instead of “cat”. To relate both words, I’ll chop some part of every word I read so that I could
get the “root word”. There are standard tools for performing this like “Porter’s Stemmer”.
• Record Document IDs: If the word is already present add a reference of the document to
index else creates a new entry. Add additional information like the frequency of the word,
location of the word, etc.
Example:
Words Document
ant doc1
demo doc2
world doc1, doc2
Compression techniques:
-------------------------------------------------------------------------------------------------------------
Why compression (in general)?
In other words, find the first component of V such that the sum of all preceding components is
greater than or equal to the value, x, to be encoded. For our example of 7, using a vector V of
Index Pruning
To this point, we have discussed lossless approaches for inverted index compression. A lossy
approach is called static index pruning. The basic idea was . Essentially, posting list entries may
be removed or pruned without significantly degrading precision. Experiments were done with both term
specific pruning and uniform pruning. With term specific pruning, different levels of pruning are done
for each term. Static pruning simply eliminates posting list entries in a uniform fashion - regardless of
the term. It was shown that pruning at levels of nearly seventy percent of the full inverted index did
not significantly affect average precision.
-------------------------------------------------------------------------------------------------------------
DOCUMENT REPRESENTATION AND TERM WEIGHTING:
- A query can specify text words or phrase, the system should look for.
- The query description is transformed manually or automatically into a formed query representation,
ready to match with document representation.
4. Selection
- User examines the results and selects the relevant items.
Document Representation:
Document representation refers to how documents are represented within an information retrieval
system. In most cases, documents are represented as a bag-of-words model, where each document is
treated as an unordered collection of words or terms. Other representations, such as vector space
models, can also be used.
Bag-of-Words Model:
• In the bag-of-words model, a document is represented as a vector where each dimension
corresponds to a unique term in the vocabulary.
• The value of each dimension (term) in the vector typically indicates the frequency of the
corresponding term in the document.
• Stop words (common words like "the", "and", "of", etc.) are often removed to reduce noise in
the representation.
• Stemming and lemmatization may also be applied to reduce inflected or derived words to their
base or dictionary form.
Term Weighting:
Term weighting involves assigning weights to terms in the document representation to reflect their
importance in distinguishing relevant documents from irrelevant ones. The goal is to give higher
weights to terms that are more discriminative and informative.
Term Frequency-Inverse Document Frequency (TF-IDF):
• TF-IDF is a popular term weighting scheme used in information retrieval.
• It calculates the importance of a term within a document relative to its importance across all
documents in the corpus.
• The weight of a term �t in a document �d is calculated as the product of two components:
1. Term Frequency (TF): The frequency of term �t in document �d, usually normalized by
the total number of terms in �d to account for document length.
2. Inverse Document Frequency (IDF): The logarithmically scaled inverse fraction of the
documents that contain term �t among the documents in the corpus. It measures the
informativeness of a term; terms that occur in fewer documents tend to have higher IDF
weights.
• The TF-IDF weight for term �t in document �d is given by:
-------------------------------------------------------------------------------------------------------------------
Boolean Model:
The Boolean model is one of the oldest and simplest information retrieval models used to retrieve
documents that match a Boolean query. It operates on the principle of set theory and allows queries to
be formulated using Boolean operators such as AND, OR, and NOT. Here's how the Boolean model
works:
Principles of the Boolean Model:
1. Document Representation:
• In the Boolean model, each document and query is represented as a set of index terms
(words or phrases).
2. Binary Representation:
• Each term in a document or query is either present (1) or absent (0), resulting in binary
representation.
• The presence of a term indicates that it occurs at least once in the document or query.
3. Boolean Operators:
• Boolean operators (AND, OR, NOT) are used to construct queries to retrieve documents
based on the presence or absence of terms.
• AND Operator: Retrieves documents containing all terms in the query.
• OR Operator: Retrieves documents containing at least one of the terms in the query.
• NOT Operator: Excludes documents containing the specified term.
Example:
Consider a small document collection with the following documents:
• Document 1: "information retrieval techniques"
• Document 2: "document indexing methods"
• Document 3: "retrieval models in IR"
Suppose we want to retrieve documents related to "information retrieval" and "models" using Boolean
queries.
1. Boolean Query Formation:
• Calculate the IDF for each term using the formula: ���(�)=log(�DF(�))IDF(t)=log(DF(t)N)
where �N is the total number of documents and DF(�)DF(t) is the number of documents
containing term �t.
Term DF(t) IDF(t)
Information 2 log(3/2)
Retrieval 3 log(3/3)
Is 1 log(3/1)
Important 1 log(3/1)
Techniques 1 log(3/1)
Are 1 log(3/1)
Used 1 log(3/1)
In 2 log(3/2)
Search 1 log(3/1)
Engines 1 log(3/1)
Algorithms 1 log(3/1)
For 1 log(3/1)
TF-IDF Calculation:
• Multiply TF by IDF for each term in each document to get the TF-IDF weight.
Term Document 1 Document 2 Document 3
Information TF * IDF TF * IDF 0
Retrieval TF * IDF TF * IDF TF * IDF
Is TF * IDF 0 0
Important TF * IDF 0 0
Techniques 0 TF * IDF 0
Are 0 TF * IDF 0
Used 0 TF * IDF 0
In 0 TF * IDF TF * IDF
Search 0 TF * IDF TF * IDF
Engines 0 TF * IDF TF * IDF
Algorithms 0 0 TF * IDF
For 0 0 TF * IDF
This table shows the TF-IDF weights for each term in each document. The values are calculated by
multiplying the TF of each term by its corresponding IDF.
This process assigns higher weights to terms that are important within a document but occur
infrequently across the entire collection, making them more discriminative for retrieval purposes.
-------------------------------------------------------------------------------------------------------------------
Cosine similarity:
Cosine similarity is a widely used metric in Information Retrieval (IR) and Natural Language Processing
(NLP) for measuring the similarity between two vectors. In the context of IR, cosine similarity is
commonly used to determine the relevance of documents to a query. Here's how cosine similarity
works in IR:
Cosine Similarity Formula:
Given two vectors �A and �B, the cosine similarity similarity(�,�)similarity(A,B) is calculated using
the following formula:
-----------------------------------------------------------------------------------------------------------------
Query-document matching
Query-document matching is a fundamental concept in Information Retrieval (IR) systems, where the
goal is to retrieve and rank documents based on their relevance to a given user query. The process
involves comparing the content of documents against the content of the query to identify relevant
documents. Here's how query-document matching typically works:
1. Query Processing:
• The user submits a query to the IR system.
• The query undergoes preprocessing steps, including tokenization, stemming, stop-word
removal, and possibly other normalization techniques to prepare it for matching against the
documents.
2. Term Matching:
• The preprocessed query terms are matched against the indexed documents to identify
documents containing the query terms.
• Documents that contain all or some of the query terms are candidates for retrieval.
3. Scoring and Ranking:
• Once candidate documents are identified, a relevance score is assigned to each document based
on its similarity to the query.
• Various scoring methods, such as TF-IDF, BM25, or machine learning-based models, may be
used to calculate the relevance score.
• The documents are ranked based on their relevance scores, with the most relevant documents
appearing at the top of the search results.
4. Retrieval and Presentation:
• The top-ranked documents are retrieved from the index and presented to the user as search
results.
• The user can then review the search results and select relevant documents based on their
information needs.
Techniques for Query-Document Matching:
1. Exact Matching:
• Documents are retrieved if they contain all the terms in the query.
• Boolean model is an example of exact matching.
2. Partial Matching:
• Documents are retrieved based on the presence of some terms in the query.
• Vector space model and probabilistic models often allow for partial matching.
This simple flowchart outlines the process of handling spelling errors in a search engine:
1. User Inputs Query: The user inputs a query into the search engine.
2. Spell Check: The search engine performs a spell check on the query. If the query contains a
spelling error, it proceeds to suggest a correction.
3. Did you mean: The search engine suggests a corrected query to the user based on the spell
check results. If the user accepts the correction, the search engine proceeds to search for
documents related to the corrected query.
4. Search Results: The search engine retrieves and presents relevant search results based on the
corrected query.
This flow diagram illustrates a basic approach to handling spelling errors in a search engine, including
spell checking, suggestion, and search result presentation. More sophisticated systems may
incorporate additional steps, such as fuzzy matching algorithms and user feedback mechanisms, to
further improve search accuracy and user experience.
-------------------------------------------------------------------------------------------------------------------
Certainly! In addition to the strategies mentioned earlier, there are several other considerations and
techniques relevant to addressing spelling errors in queries and documents:
1. Language Models: Advanced language models like BERT (Bidirectional Encoder
Representations from Transformers) and GPT (Generative Pre-trained Transformer) have been
trained on vast amounts of text data and can assist in understanding context even with
misspelled words. Fine-tuning these models on domain-specific data can enhance their ability to
handle spelling variations.
2. Word Embeddings: Word embedding techniques such as Word2Vec and GloVe can capture
semantic relationships between words. By leveraging word embeddings, it's possible to identify
similar words or phrases that may correspond to misspelled terms, thereby improving search
accuracy.
3. Probabilistic Models: Probabilistic models like the noisy channel model and edit distance
algorithms (e.g., Levenshtein distance) estimate the likelihood of certain spelling corrections
given the observed misspelled words. These models are widely used in spell checking and
correction systems.
-------------------------------------------------------------------------------------------------------------------
Edit distance and string similarity measures
Edit distance and string similarity measures are fundamental concepts in computer science and natural
language processing that quantify the similarity between two strings. They are widely used in tasks
such as spell checking, fuzzy string matching, and information retrieval. Let's explore each concept:
Edit Distance:
Edit distance, also known as Levenshtein distance, measures the minimum number of single-character
edits (insertions, deletions, or substitutions) required to transform one string into another.
For example, the edit distance between "kitten" and "sitting" is 3, achieved by the following
transformations:
1. Substituting 's' for 'k'
2. Substituting 'i' for 'e'
3. Inserting 'g' at the end
The computation of edit distance is typically done using dynamic programming algorithms, such as the
Wagner-Fischer algorithm, which efficiently computes the minimum edit distance between two strings.
Let's walk through an example using edit distance and string similarity measures:
Example: Suppose we have two strings: "kitten" and "sitting".
1. Edit Distance: We want to find the minimum number of single-character edits (insertions,
deletions, or substitutions) required to transform "kitten" into "sitting".
Using dynamic programming, we can compute the edit distance:
kitten
↓↓
sitting
2. String Similarity Measures: Let's explore some string similarity measures between "kitten"
and "sitting".
• Jaccard Similarity: This measure compares the similarity between two sets. Let's
consider the sets of characters in each string:
• Set 1: {'k', 'i', 't', 'e', 'n'}
• Set 2: {'s', 'i', 't', 't', 'i', 'n', 'g'} The Jaccard similarity is the size of the
intersection divided by the size of the union: J(S1, S2) = |S1 ∩ S2| / |S1 ∪ S2| =
2 / 9.
• Cosine Similarity: We represent the strings as vectors in a vector space. Each
dimension represents the frequency of a character in the string. Then, we calculate the
cosine of the angle between the vectors.
• Vector 1: (1, 0, 0, 1, 0, 1, 1) (frequency of 'k', 'i', 't', 'e', 'n', 's', 'g')
• Vector 2: (0, 1, 0, 0, 1, 1, 1) The cosine similarity is the dot product of the
vectors divided by the product of their magnitudes.
• Jaro-Winkler Similarity: This measure takes into account the number of matching
characters and transpositions. It is more complex and involves a formula to calculate a
similarity score between two strings.
These measures provide different perspectives on the similarity between "kitten" and "sitting" based on
various criteria, including character overlap, sequence alignment, and phonetic similarity. Each
measure has its strengths and weaknesses depending on the specific task and context of the
comparison.
Let's delve deeper into the concepts of edit distance and string similarity measures:
Edit Distance:
Edit distance, also known as Levenshtein distance, is a metric used to quantify the similarity between
two strings. It measures the minimum number of single-character edits (insertions, deletions, or
substitutions) required to transform one string into another.
Calculation of Edit Distance:
The calculation of edit distance is typically done using dynamic programming algorithms, such as the
Wagner-Fischer algorithm. Here's a high-level overview of how the algorithm works:
1. Create a matrix where the rows correspond to characters of the first string and the columns
correspond to characters of the second string.
2. Initialize the first row and column with incremental values representing the number of
characters in each string.
3. Traverse the matrix row by row, filling in each cell with the minimum of the following three
operations:
• If the characters at the current positions match, take the value from the diagonal cell
(representing no edit).
• Otherwise, take the minimum of the value from the cell above (representing insertion),
the value from the cell to the left (representing deletion), and the diagonal value
(representing substitution), and add one.
4. The value in the bottom-right cell of the matrix represents the edit distance between the two
strings.
-------------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------------------------------------
Edit Distance using Recursion
Subproblems in Edit Distance:
The idea is to process all characters one by one starting from either from left or right sides of both
strings.
Let us process from the right end of the strings, there are two possibilities for every pair of characters
being traversed, either they match or they don’t match. If last characters of both string matches
then there is no need to perform any operation So, recursively calculate the answer for rest of part of
the strings. When last characters do not match, we can perform all three operations to match the last
characters in the given strings, i.e. insert, replace, and remove. We then recursively calculate the
result for the remaining part of the string. Upon completion of these operations, we will select the
minimum answer.
Below is the recursive tree for this problem:
When the last characters of strings matches. Make a recursive call EditDistance(M-1,N-1) to
calculate the answer for remaining part of the strings.
When the last characters of strings don’t matches. Make three recursive calls as show below:
• Insert str1[N-1] at last of str2 : EditDistance(M, N-1)
• Replace str2[M-1] with str1[N-1] : EditDistance(M-1, N-1)
• Remove str2[M-1] : EditDistance(M-1, N)
Recurrence Relations for Edit Distance:
• EditDistance(str1, str2, M, N) = EditDistance(str1, str2, M-1, N-1)
• Case 1: When the last character of both the strings are same
• Case 2: When the last characters are different
• EditDistance(str1, str2, M, N) = 1 + Minimum{ EditDistance(str1, str2 ,M-1,N-
1), EditDistance(str1, str2 ,M,N-1), EditDistance(str1, str2 ,M-1,N) }
Base Case for Edit Distance:
• Case 1: When str1 becomes empty i.e. M=0
• return N, as it require N characters to convert an empty string to str1 of size N
• Case 2: When str2 becomes empty i.e. N=0
• return M, as it require M characters to convert an empty string to str2 of size M
• Time Complexity: O(3m), when none of the characters of two strings match as shown in the
image below.
Auxiliary Space: O(1)
• Edit Distance Using Dynamic Programming (Memoization):
• In the above recursive approach, there are several overlapping subproblems:
Edit_Distance(M-1, N-1) is called Three times
Edit_Distance(M-1, N-2) is called Two times
Edit_Distance(M-2, N-1) is called Two times. And so on…
• So, we can use Memoization technique to store the result of each subproblems to avoid
recalculating the result again and again.
• Below are the illustration of overlapping subproblems during the recursion.
By employing these techniques, IR systems can effectively correct spelling errors and enhance the
accuracy of search results, improving the overall user experience.
In addition to the techniques mentioned earlier, there are a few more advanced approaches and
considerations for spelling correction in IR systems:
By incorporating these advanced techniques and considerations, IR systems can achieve robust and
efficient spelling correction capabilities across diverse contexts and languages.
example
Let's consider an example scenario of spelling correction in an information retrieval system:
Suppose we have an IR system that allows users to search for documents in a large collection of
scientific papers related to biology and genetics. A user enters the query "gene expretion regulation"
into the search bar, intending to find documents about gene expression regulation.
Here's how the system might process the query using various spelling correction techniques:
1. Dictionary-Based Correction: The system first checks the query words against its dictionary
of correctly spelled words. It identifies "expretion" as a misspelled word since it's not found in
the dictionary. The system suggests corrections based on similar words like "expression."
2. Edit Distance Algorithms: Using an edit distance algorithm like Levenshtein distance, the
system calculates the distance between "expretion" and words in the dictionary. It finds that
"expression" has a low edit distance and suggests it as the correction.
3. Phonetic Matching: The system may employ a phonetic matching algorithm like Soundex or
Metaphone to find phonetically similar words. Even though "expretion" and "expression" might
not be phonetically similar, such techniques can help in other cases where phonetic similarity is
more apparent.
4. N-gram Language Models: The system analyzes the surrounding words and their frequency
in the document collection. It observes that "gene expression regulation" is a common phrase in
the corpus, and "expretion" is likely a misspelling based on the context.
5. User Feedback and Learning: If users frequently click on search results related to "gene
expression regulation" after typing "gene expretion regulation," the system learns from this
feedback and may prioritize "expression" as the correct spelling in future corrections.
In this example, the system applies a combination of techniques to identify and correct the spelling
error in the user query, ultimately improving the relevance and accuracy of search results returned to
the user.
-------------------------------------------------------------------------------------------------------------------
Terminology
▪ These are character bigrams:
▪ st, pr, an …
▪ These are word bigrams:
▪ palo alto, flying from, road repairs
▪ In today’s class, we will generally deal with word bigrams
▪ In the accompanying Coursera lecture, we mostly deal with character bigrams (because we
cover stuff complementary to what we’re discussing here)
Similarly
trigrams,
k-grams
independent word Spelling Correction
-------------------------------------------------------------------------------------------------------------------
CHAPTER V: PERFORMANCE EVALUATION
Topics Covered: Evaluation metrics: precision, recall, F-measure, average precision, Test collections
and relevance judgments, Experimental design and significance testing
-------------------------------------------------------------------------------------------------------------------
Precision measures the relevance of the retrieved documents/items among all the retrieved ones. It
helps answer the question: "Of all the items retrieved, how many are relevant?"
Formula:
Precision focuses on the accuracy of the retrieval system. A high precision indicates that a large
proportion of the retrieved documents are relevant to the user's query.
Recall:
Recall measures the completeness of the retrieval system by quantifying the proportion of relevant
documents retrieved out of all the relevant documents available. It answers the question: "Of all the
relevant items available, how many were retrieved?"
Formula:
• Interpretation: While precision and recall are informative individually, they are often
considered together to give a more comprehensive understanding of the system's performance.
• User-Centric: The interpretation of precision and recall can vary based on user needs. For
instance, in a medical information retrieval system, high recall may be more critical to ensure
all relevant research papers are retrieved, even if it means a few irrelevant ones slip through.
In summary, precision and recall are crucial metrics in evaluating the effectiveness of IR systems. They
provide insights into how well the system retrieves relevant information and how accurately it filters
out irrelevant content, ultimately contributing to the overall user satisfaction and utility of the system.
-------------------------------------------------------------------------------------------------------------------
Recall and precision are two important metrics used to evaluate the performance of systems,
particularly in the context of information retrieval, search engines, and machine learning classifiers.
While both metrics measure aspects of a system's effectiveness, they focus on different aspects of
performance:
1. Recall:
• Recall, also known as sensitivity or true positive rate, measures the ability of a system to
retrieve all relevant items from the total pool of relevant items.
• It answers the question: "Of all the relevant items that exist, how many did the system
retrieve?"
• Mathematically, recall is calculated as the ratio of the number of true positive results to
the total number of relevant items:
•
• A high recall value indicates that the system is successfully retrieving a large proportion
of the relevant items, even if it may also retrieve some irrelevant ones.
2. Precision:
• Precision measures the proportion of retrieved items that are relevant among all the
retrieved items.
• It answers the question: "Of all the items retrieved by the system, how many are
relevant?"
• Mathematically, precision is calculated as the ratio of true positive results to the total
number of retrieved items:
• A high precision value indicates that a large proportion of the retrieved items are
relevant to the user's query, minimizing the presence of irrelevant results.
In summary, recall focuses on the system's ability to capture all relevant items, regardless of how
many irrelevant items are retrieved along with them, while precision focuses on the system's ability to
retrieve relevant items accurately, minimizing the inclusion of irrelevant items in the results. These
metrics are often used together to provide a comprehensive evaluation of system performance,
particularly in tasks such as information retrieval, document classification, and search engine
evaluation.
-------------------------------------------------------------------------------------------------------------------
The F1 score ranges from 0 to 1, where 1 indicates perfect precision and recall, while 0 indicates poor
performance in either precision or recall.
2. Average Precision:
Average precision is a metric used to evaluate the performance of an IR system, especially in ranked
retrieval scenarios. It calculates the average precision at each relevant document retrieved.
Calculation:
1. For each relevant document in the retrieved list, calculate the precision at that point.
2. Average all the precisions across the relevant documents.
Average precision is particularly useful in scenarios where the system returns a ranked list of
documents, such as web search engines. It provides a measure of how well the system ranks relevant
documents.
3. Test Collections:
Test collections are curated datasets used to evaluate the performance of IR systems. These collections
typically contain:
• A set of queries: These are representative search queries that users might enter into the
system.
• A set of documents: A collection of documents that the system can retrieve results from.
• Relevance judgments: Annotations indicating which documents in the collection are relevant for
each query.
Test collections provide a standardized way to evaluate the effectiveness of IR systems across different
algorithms and approaches.
4. Relevance Judgments:
Relevance judgments are annotations that indicate the relevance of documents to specific queries
within a test collection. They are typically provided by human assessors who evaluate the documents
based on predefined relevance criteria.
Relevance judgments can be binary (relevant or non-relevant) or graded (with degrees of relevance).
They serve as ground truth labels against which the performance of IR systems is evaluated.
In summary, F-measure, average precision, test collections, and relevance judgments are essential
components of the evaluation process in Information Retrieval. They help measure the effectiveness
and performance of IR systems, providing insights into their precision, recall, ranking capabilities, and
overall retrieval quality.
-------------------------------------------------------------------------------------------------------------------
Let's provide examples for each of the concepts discussed in the context of Information Retrieval (IR):
1. F-measure:
Suppose we have an IR system designed to retrieve relevant documents for a set of queries. After
running the system, we evaluate its performance using precision and recall. Let's say the precision is
0.75 and the recall is 0.80. We can calculate the F1 score as follows:
2. Average Precision:
Imagine we have a test collection consisting of 10 queries and corresponding relevant documents.
After running our IR system for each query, we obtain ranked lists of retrieved documents. We then
calculate precision at each relevant document position and average them across all queries.
For example, if the precision at relevant documents for query 1 is 0.8, query 2 is 0.6, and so forth, we
sum up these values and divide by the total number of queries to obtain the average precision.
Experimental Design:
Experimental design refers to the process of planning and conducting experiments to evaluate the
performance of IR systems. It involves several key steps:
1. Formulating Research Questions: Clearly define the research questions and objectives of the
study. What aspects of the IR system do you want to evaluate? What hypotheses are you
testing?
2. Selection of Evaluation Measures: Choose appropriate evaluation measures based on the
research questions and the specific task of the IR system. Common measures include precision,
recall, F1 score, mean average precision (MAP), normalized discounted cumulative gain (nDCG),
etc.
3. Selection of Test Collections: Choose suitable test collections that reflect the characteristics
of the real-world data and the tasks the IR system is designed for. Test collections should
include queries, relevant documents, and relevance judgments.
4. Experimental Setup: Define the experimental setup, including the selection of baseline
methods, parameter settings, preprocessing techniques, and experimental protocols.
5. Cross-Validation and Replication: Use techniques like cross-validation to ensure the
robustness and generalizability of the results. Replicate experiments with different datasets and
settings to validate the findings.
6. Controlled Variables: Control for variables that could impact the results, such as hardware
configurations, indexing techniques, retrieval algorithms, and user interfaces.
------------------------------------------------------------------------------------------------------------------
Significance Testing:
Significance testing is used to determine whether observed differences or effects in experimental data
are statistically significant or simply due to chance. It helps researchers make inferences about the
population based on sample data.
1. Hypothesis Formulation: Formulate null and alternative hypotheses based on the research
questions. The null hypothesis typically assumes that there is no significant difference between
groups or conditions, while the alternative hypothesis suggests otherwise.
2. Selection of Statistical Test: Choose an appropriate statistical test based on the research
design, data distribution, and nature of the variables being analyzed. Common tests include t-
tests, chi-square tests, ANOVA, Mann-Whitney U test, etc.
3. Calculation of P-value: Perform the statistical test and calculate the p-value, which
represents the probability of observing the data or more extreme results under the assumption
that the null hypothesis is true.
4. Interpretation of Results: Compare the obtained p-value with the significance level (alpha),
typically set at 0.05. If the p-value is less than alpha, the null hypothesis is rejected, indicating
that the observed difference is statistically significant.
5. Effect Size: Consider the effect size in addition to statistical significance to assess the practical
importance or magnitude of the observed differences.
6. Multiple Comparisons: Adjust for multiple comparisons if conducting multiple tests to control
the family-wise error rate or false discovery rate.
In summary, experimental design and significance testing play pivotal roles in the evaluation and
validation of IR systems. They provide a structured framework for conducting experiments, interpreting
results, and drawing meaningful conclusions about system performance and effectiveness.
• User Experience (UX) Evaluation: Evaluate the overall user experience, including ease of
use, system responsiveness, interface design, and relevance of retrieved results.
Contextual Evaluation:
• Task-Based Evaluation: Evaluate the IR system in the context of specific tasks or user
scenarios to assess its practical utility and effectiveness.
• Domain-Specific Evaluation: Consider the unique characteristics and requirements of
different application domains when designing experiments and evaluating system performance.
Error Analysis:
• Error Analysis: Conduct thorough error analysis to identify common sources of errors, such as
false positives, false negatives, and misclassifications.
• Root Cause Analysis: Investigate the underlying reasons behind errors and explore strategies
to mitigate them.
Longitudinal Studies:
• Long-Term Evaluation: Conduct longitudinal studies to assess the stability and performance
of the IR system over time, considering factors such as system drift, user dynamics, and
evolving information needs.
Multi-Modal Evaluation:
• Multi-Modal IR: Evaluate IR systems that support multiple modalities, such as text, image,
audio, and video, considering the unique challenges and evaluation metrics associated with
each modality.
• Open Science Practices: Embrace open science principles by sharing code, datasets, and
research findings openly to foster collaboration and transparency in the IR community.
By considering these additional factors and best practices, researchers and practitioners can conduct
more comprehensive and rigorous evaluations of IR systems, leading to more meaningful insights and
advancements in the field.
-------------------------------------------------------------------------------------------------------------------
CHAPTER VI: TEXT CATEGORIZATION AND FILTERING
Topics covered: Text classification algorithms: Naive Bayes, Support Vector Machines, Feature
selection and dimensionality reduction, Applications of text categorization and filtering
In Information Retrieval (IR), text classification algorithms are essential for tasks such as document
categorization, sentiment analysis, spam detection, and more. Here's an overview of some common
algorithms and techniques used in text classification within IR:
1. Naive Bayes (NB):
• Naive Bayes classifiers are based on Bayes' theorem and assume that features are
conditionally independent given the class label.
• In text classification, Naive Bayes is often used due to its simplicity, efficiency, and
effectiveness, especially with large feature spaces.
• It works well with text data and can handle high-dimensional feature spaces efficiently.
-------------------------------------------------------------------------------------------------------------------
Applications of text categorization and filtering
Text categorization and filtering have numerous applications across various domains. Here are some
prominent examples:
1. Email Spam Filtering:
Certainly! Here are a few more examples of text categorization and filtering applications:
11. Product Review Analysis:
• Example: E-commerce platforms like Amazon categorize and analyze product reviews to provide
insights to manufacturers and other consumers about the quality, features, and satisfaction
levels associated with different products.
12. Content Recommendation Systems:
• Example: Streaming services like Netflix and Spotify use text categorization algorithms to
analyze user preferences and behavior, categorize content based on genres, themes, and
attributes, and recommend personalized movies, shows, and music playlists to users.
13. Content Tagging and Metadata Management:
• Example: Content management systems and digital asset management platforms use text
categorization to automatically tag and classify digital assets such as images, videos, and
documents, making it easier to search, organize, and retrieve content.
14. Online Advertising Targeting:
• Example: Ad networks and digital marketing platforms analyze website content and user
behavior to categorize web pages and target advertisements based on user interests,
demographics, and preferences.
15. Language Identification and Translation:
• Example: Language identification algorithms classify text into different languages, enabling
multilingual search engines, translation services, and global communication platforms to
accurately detect and translate text across language barriers.
16. Knowledge Base Construction and Ontology Development:
• Example: Text categorization algorithms are used to extract and categorize information from
unstructured text sources such as websites, documents, and articles, enabling the construction
of knowledge bases and the development of ontologies for knowledge representation and
semantic web applications.
17. Event Detection and Trend Analysis:
• Example: Social media monitoring tools analyze text data from social media platforms to detect
and categorize events, trends, and discussions in real-time, helping organizations and
governments track public opinion, emerging issues, and crisis events.
These additional examples demonstrate the diverse range of applications and industries where text
categorization and filtering techniques are utilized to automate processes, extract insights, and
enhance decision-making capabilities.
Let's illustrate K-means and hierarchical clustering with an example in the context of Information
Retrieval (IR):
Example: Document Clustering in IR
Suppose we have a collection of news articles from different categories such as sports, politics,
technology, and entertainment. Our goal is to cluster these articles based on their content using K-
means and hierarchical clustering techniques.
Step 1: Document Representation
We represent each document using the TF-IDF (Term Frequency-Inverse Document Frequency) vector
representation. Each document becomes a high-dimensional vector where each dimension represents
the importance of a term in that document relative to the entire corpus.
Step 2: K-means Clustering
Let's say we want to group the documents into 4 clusters (K=4).
1. Initialization: Randomly select 4 initial cluster centroids.
2. Assignment Step: Assign each document to the cluster whose centroid is closest to it based
on cosine similarity or Euclidean distance.
3. Update Step: Recalculate the centroids of the clusters based on the mean of the documents in
each cluster.
4. Iterations: Repeat the assignment and update steps until convergence or until a maximum
number of iterations is reached.
Example Summary:
Let's tie these concepts together with an example:
Suppose we're developing a news aggregation platform. We apply K-means clustering to group news
articles into topics. Evaluation metrics such as silhouette score and cluster purity help assess the
quality of clustering. Users searching for "climate change" can benefit from query expansion, where
terms from clusters related to "environmental science" and "sustainability" are added to the query.
When presenting search results, clustering aids in grouping articles under different facets like "policy
implications," "scientific research," and "public awareness campaigns," enhancing user exploration and
comprehension.
This example illustrates how evaluation, query expansion, and result grouping with clustering can
enhance the Information Retrieval process, providing users with more relevant and organized
information.
1] Web crawler
Search engine bots, web robots, and spiders are other names for web crawlers. In search engine
optimization (SEO) strategy, it is crucial. It mostly consists of a piece of software that browses the
web, downloads information, and then gathers it all.
There are the following web crawler features that can affect the search results
o Included Pages
o Excluded Pages
o Document Types
o Frequency of Crawling
2] Database
An example of a non-relational database is the search engine database.
That is where all of the data on the web is kept. It has a lot of online resources. Amazon Elastic Search
Service and Splunk are two of the most well-known search engine databases.
The following two characteristics of database variables may have an impact on search results:
Dimensions of the database
The database's recentness
3] Search interfaces
One of the most crucial elements of a search engine is the search interface.
It serves as the user's interface with the database. In essence, it aids users
with database query searches.
There are the following features Search Interfaces that affect the search
results -
Operators
Phrase Searching
Truncation
4] Ranking algorithms
Google uses the ranking algorithm to determine the order of web sites in its search results.
The following ranking factors have an impact on the search results:
Location and frequency
Link Evaluation
Clickthrough analysis
Architecture:
1. Crawling: The process of discovering and fetching web pages from the internet.
2. Indexing: Analyzing and storing the content of web pages in a searchable index.
3. Query Processing: Interpreting and executing user queries against the indexed data.
4. Ranking: Determining the relevance of indexed pages to the user query and ranking them
accordingly.
5. User Interface: Presenting search results to the user in a user-friendly manner.
Challenges:
1. Scale: The web is vast and constantly expanding, requiring search engines to crawl and index
billions of pages.
2. Freshness: Keeping indexed content up-to-date with the rapidly changing web.
3. Relevance: Providing users with relevant and diverse search results for their queries.
4. Spam and Manipulation: Dealing with spam, low-quality content, and attempts to manipulate
search engine rankings.
5. Diversity of Content: Indexing and retrieving various types of content, including text, images,
videos, and multimedia.
Web search faces several challenges due to the dynamic nature of the web, the vast amount of
information available, and the diverse needs and behaviors of users. Some of the key challenges in
web search include:
Information Overload: The web contains an enormous volume of information, and users often
struggle to find relevant content amidst the abundance of data. Information overload can lead to user
frustration and difficulties in locating specific information.
Quality and Trustworthiness: Ensuring the quality and trustworthiness of information retrieved from
the web is a significant challenge. The web contains a mix of reliable, authoritative sources and
unreliable or misleading content. Users may encounter misinformation, fake news, and biased
perspectives, which can undermine the credibility of search results.
Dynamic and Evolving Content: The web is constantly evolving, with new content being created,
updated, and removed at a rapid pace. Search engines must continuously crawl, index, and update
their databases to reflect the latest information available on the web.
Multimedia Content: The increasing prevalence of multimedia content, including images, videos,
audio files, and interactive media, presents challenges for search engines in effectively indexing,
analyzing, and retrieving non-textual content.
Multilingual and Multicultural Content: The web is a global platform with content available in
multiple languages and tailored to diverse cultural contexts. Search engines must support multilingual
search capabilities and account for cultural differences in language usage, terminology, and
preferences.
Personalization and Privacy: Balancing the need for personalized search experiences with user
privacy concerns is a challenge for search engines. While personalized search results can enhance user
satisfaction and relevance, they also raise privacy issues related to data collection, tracking, and
profiling.
Semantic Understanding: Improving the semantic understanding of search queries and web content
is an ongoing challenge. Search engines must go beyond keyword matching and incorporate natural
language processing, entity recognition, and semantic analysis techniques to better understand the
meaning and context of user queries and web documents.
Mobile and Voice Search: The increasing prevalence of mobile devices and voice-activated assistants
has transformed user search behavior. Search engines must adapt to the unique characteristics of
mobile and voice search, including shorter queries, location-based information, and conversational
language.
Addressing these challenges requires ongoing research, innovation, and collaboration among search
engine providers, information retrieval experts, web developers, and other stakeholders to enhance the
quality, relevance, and accessibility of web search experiences.
-------------------------------------------------------------------------------------------------------------------
Crawling and Indexing Web Pages:
Crawling:
• Crawlers, also known as spiders or bots, systematically navigate the web by following links from
one page to another.
• They discover new pages and update existing ones by fetching and analyzing their content.
• Example: Googlebot, the crawler used by Google, navigates the web, discovering and fetching
web pages to be indexed.
Web crawling is the process by which we gather pages from the Web, in order to index them and
support a search engine. The objective of crawling is to quickly and efficiently gather as many useful
web pages as possible, together with the link structure that interconnects them.
Robustness: The Web contains servers that create spider traps, which are generators of web pages
that mislead crawlers into getting stuck fetching an infinite number of pages in a particular domain.
Crawlers must be designed to be resilient to such traps. Not all such traps are malicious; some
are the inadvertent side-effect of faulty website development.
Politeness: Web servers have both implicit and explicit policies regulating the rate at which a crawler
can visit them. These politeness policies must be respected.
Crawler architecture
The simple scheme outlined above for crawling demands several modules that fit together as shown in
Figure 20.1.
1. The URL frontier, containing URLs yet to be fetched in the current crawl (in the case of continuous
crawling, a URL may have been fetched previously but is back in the frontier for re-fetching). We
describe this further
in Section 20.2.3.
2. A DNS resolution module that determines the web server from which to fetch the page specified by
a URL. We describe this further in Section 20.2.2.
3. A fetch module that uses the http protocol to retrieve the web page at a URL.
4. A parsing module that extracts the text and set of links from a fetched web page.
5. A duplicate elimination module that determines whether an extracted link is already in the URL
frontier or has recently been fetched.
Crawling is performed by anywhere from one to potentially hundreds of threads, each of which loops
through the logical cycle in Figure 20.1. These threads may be run in a single process, or be
partitioned amongst multiple processes running at different nodes of a distributed system. We begin by
assuming that the URL frontier is in place and non-empty and defer our description of the
implementation of the URL frontier to Section 20.2.3. We follow the progress of a single URL through
the cycle of being fetched, passing through various checks and filters, then finally (for continuous
crawling) being returned to the URL frontier.
A crawler thread begins by taking a URL from the frontier and fetching the web page at that URL,
generally using the http protocol.
The fetched page is then written into a temporary store, where a number of operations are performed
on it. Next, the page is parsed and the text as well as the links in it are extracted.
The text (with any tag information – e.g., terms in boldface) is passed on to the indexer.
Link information including anchor text is also passed on to the indexer for use in ranking in ways that
are described in Chapter 21. In addition, each extracted link goes through a series of tests to
determine whether the link should be added to the URL frontier.
First, the thread tests whether a web page with the same content has already been seen at another
URL. The simplest implementation for this would use a simple fingerprint such as a checksum (placed
in a store labelled "Doc FP’s" in Figure 20.1). A more sophisticated test would use shingles instead.
Indexing web pages is a crucial step in the process of organizing and making web content searchable.
Search engines like Google, Bing, and Yahoo use indexing to create a searchable database of web
pages that users can access through their search interfaces. Here's how indexing web pages typically
works:
1. Crawling: As mentioned earlier, web crawling is the process of systematically browsing the
internet to discover and retrieve web pages. Crawlers, also known as spiders or bots, visit web
pages by following hyperlinks from one page to another. They download the content of each
page they visit, including text, images, links, and metadata.
2. Parsing and Analyzing Content: Once the web crawler downloads the content of a web page,
it parses the HTML and extracts relevant information such as text content, headings, metadata
-------------------------------------------------------------------------------------------------------------------
Link Analysis and PageRank Algorithm in IR:
Link Analysis:
• Link analysis examines the structure of the web, particularly the network of hyperlinks between
web pages.
• It seeks to understand the relationships and importance of pages based on how they are linked
to by other pages.
• Example: Analyzing inbound links to a web page can provide insights into its popularity and
authority within the web ecosystem.
Link analysis in information retrieval (IR) refers to the process of analyzing the relationships between
documents based on hyperlinks. It's a fundamental concept used in various applications, particularly in
web search engines like Google, Bing, and Yahoo. Link analysis helps search engines understand the
structure and relevance of web pages by examining how they are linked together. Here's how link
analysis works in IR:
1. Hyperlink Structure: On the web, hyperlinks are used to connect one webpage to another.
Each hyperlink represents a relationship or connection between two web pages. By analyzing
these hyperlinks, search engines can uncover valuable information about the relationships and
authority of web pages.
2. PageRank Algorithm: One of the most famous algorithms used for link analysis is PageRank,
developed by Larry Page and Sergey Brin, the founders of Google. PageRank assigns a
numerical weight to each webpage based on the quantity and quality of inbound links it receives
from other pages. Pages with higher PageRank scores are considered more authoritative and
are likely to appear higher in search results.
3. Link-based Relevance: Search engines use link analysis to assess the relevance and
importance of a webpage based on its inbound and outbound links. A webpage that receives
many inbound links from other reputable and relevant sites is considered more authoritative
and trustworthy on a particular topic. Similarly, a webpage that links to other authoritative
pages may also gain credibility.
4. Anchor Text Analysis: In addition to analyzing the number and quality of links, search
engines also examine the anchor text (the clickable text of a hyperlink) used in inbound links to
determine the relevance and context of the linked content. Pages with descriptive and relevant
anchor text are likely to be considered more authoritative and relevant for specific search
queries.
5. Link Structure Analysis: Search engines analyze the overall structure of the link graph to
identify patterns, clusters, and communities of related web pages. This analysis helps improve
-------------------------------------------------------------------------------------------------------------------
PageRank Algorithm:
• PageRank is a link analysis algorithm developed by Larry Page and Sergey Brin, the founders of
Google.
• It assigns a numerical value (PageRank score) to each web page based on the quantity and
quality of inbound links.
• Pages with higher PageRank scores are considered more important and are likely to rank higher
in search engine results.
• Example: Suppose Page A has many high-quality inbound links from reputable websites, while
Page B has fewer links from less authoritative sources. Page A is likely to have a higher
PageRank score and rank higher in search results.
PageRank Algorithm:
1. Concept of Page Importance:
• The PageRank algorithm views the web as a network of interconnected pages, where
each page is considered a node.
• The importance of a page is determined by the number and quality of inbound links it
receives from other pages.
• Pages with many inbound links from high-quality and authoritative sources are
considered more important.
2. Iterative Calculation:
• PageRank is calculated iteratively through multiple iterations or "iterations".
• At each iteration, the PageRank score of each page is updated based on the PageRank
scores of the pages linking to it.
3. Damping Factor:
• The PageRank algorithm incorporates a damping factor, typically set to 0.85, to model
the behavior of web users who may randomly jump from one page to another.
• The damping factor ensures that even pages with no outbound links receive a small
fraction of PageRank from every page on the web.
4. Formula for PageRank Calculation:
• The PageRank score PR(u) of a page u is calculated as the sum of the PageRank scores
of all pages v linking to u, divided by the number of outbound links from page v, and
multiplied by the damping factor:
•
5. Iterative Process:
• The PageRank scores are initially set to a uniform value for all pages.
• The PageRank calculation is performed iteratively until convergence, where the
PageRank scores stabilize and stop changing significantly.
-------------------------------------------------------------------------------------------------------------------
Example of PageRank Algorithm:
Let's consider a simple example with four web pages (A, B, C, D) connected by links:
• Page A has outbound links to pages B, C, and D.
• Page B has an inbound link from page A and outbound links to pages C and D.
• Page C has inbound links from pages A and B, and no outbound links.
• Page D has inbound links from pages A and B, and no outbound links.
-------------------------------------------------------------------------------------------------------------------
Topics covered: Algorithms and Techniques, Supervised learning for ranking: RankSVM, RankBoost,
Pairwise and listwise learning to rank approaches Evaluation metrics for learning to rank
-------------------------------------------------------------------------------------------------------------------
Algorithms and Techniques:
PageRank is a key algorithm developed by Larry Page and Sergey Brin, the founders of Google, as part
of their early work on the Google search engine. It revolutionized web search by introducing a method
for ranking web pages based on their importance and relevance, as determined by the structure of the
web itself. Here are some of the key algorithms and techniques used in PageRank:
1. Link Analysis: PageRank is fundamentally a link analysis algorithm. It assigns a numerical
weight, or PageRank score, to each webpage in a network of hyperlinked documents based on
the quantity and quality of inbound links it receives from other pages.
2. Random Walk Model: The PageRank algorithm models web users as random surfers
navigating the web by following hyperlinks from one page to another. The probability that a
surfer will move from one page to another is determined by the number and quality of links on
each page.
3. Graph Theory: PageRank views the web as a directed graph, where web pages are nodes and
hyperlinks between pages are edges. The algorithm applies principles from graph theory to
analyze the structure of the web graph and compute PageRank scores for individual pages.
4. Transition Matrix: PageRank represents the web graph as a transition matrix, where each
element represents the probability of transitioning from one page to another via a hyperlink.
The transition matrix is typically sparse and can be efficiently manipulated using matrix
operations.
5. Iterative Algorithm: PageRank is computed iteratively using an iterative algorithm that
repeatedly updates the PageRank scores of web pages until convergence is achieved. In each
iteration, the PageRank scores are recalculated based on the current estimates and the link
structure of the web graph.
6. Damping Factor: To model the behavior of real users who may occasionally jump to a random
page instead of following a hyperlink, PageRank introduces a damping factor (usually denoted
as �d) between 0 and 1. The damping factor represents the probability that a random surfer
will continue browsing the web rather than following a link.
7. Teleportation: The damping factor also introduces the concept of teleportation, where a
random surfer has a probability 1−1−d of jumping to any page in the web graph, regardless of
its link structure. Teleportation helps ensure that the PageRank algorithm converges to a unique
solution even for disconnected or poorly connected web graphs.
8. Convergence Criteria: PageRank iterates until the PageRank scores stabilize, indicating that
the algorithm has converged to a stable solution. Convergence criteria may include a maximum
number of iterations or a threshold for the change in PageRank scores between iterations.
9. Handling Dead Ends and Spider Traps: PageRank algorithms incorporate techniques to
handle dead ends (pages with no outgoing links) and spider traps (cycles of links that trap the
random surfer). These techniques ensure that the algorithm converges and produces
meaningful results even for complex web graphs.
-------------------------------------------------------------------------------------------------------------------
Supervised learning for ranking refers to the process of training machine learning models to rank items
or documents based on their relevance to a given query or context. Several algorithms have been
developed for this purpose, including RankSVM, RankBoost, and pairwise learning methods. Here's an
overview of each approach:
1. RankSVM (Ranking Support Vector Machine):
• RankSVM is an extension of the traditional Support Vector Machine (SVM) algorithm
tailored for ranking tasks.
• In RankSVM, the goal is to learn a ranking function that maps input features (e.g.,
document features, query-document features) to a ranking score that reflects the
relevance of documents to a query.
• RankSVM optimizes a loss function that penalizes the deviation of the predicted ranking
from the true ranking based on labeled training data.
• The optimization process involves solving a constrained optimization problem to find the
optimal separating hyperplane between relevant and irrelevant documents.
• RankSVM is capable of learning complex non-linear ranking functions and can handle
large feature spaces effectively.
Ranking Support Vector Machine (RankSVM) is a supervised learning algorithm designed specifically for
ranking tasks. It extends the traditional Support Vector Machine (SVM) algorithm to learn a ranking
Pairwise learning in Information Retrieval (IR) refers to a supervised learning approach where models
are trained to rank items or documents based on their pairwise relationships. In pairwise learning,
training examples consist of pairs of items or documents, and the model learns to predict which item in
the pair is more relevant or preferable to a given query or context. Here's how pairwise learning works
in IR:
1. Training Data:
• Pairwise learning requires labeled training data in the form of query-document pairs, where
each pair is labeled with a relevance judgment or preference.
• Each pair consists of two documents: one document that is considered more relevant or
preferable (positive example) and another document that is considered less relevant or
preferable (negative example).
2. Feature Extraction:
The web graph is a conceptual representation of the World Wide Web, where web pages are
represented as nodes and hyperlinks between pages are represented as edges. It forms the backbone
of web search engines and plays a crucial role in various web-related tasks. Here's an overview of web
graph representation and link analysis algorithms:
Web graph representation is a fundamental concept in web science and information retrieval, where
the structure of the World Wide Web is abstracted into a graph-like structure. Here's a detailed
explanation of web graph representation:
1. Nodes and Edges:
• Nodes: In the web graph, nodes represent web pages or documents accessible on the World
Wide Web. Each node corresponds to a unique URL.
• Edges: Edges represent hyperlinks between web pages. If page A contains a hyperlink pointing
to page B, there exists a directed edge from node A to node B in the web graph.
2. Directed Graph:
• The web graph is a directed graph because hyperlinks have directionality. A hyperlink from page
A to page B does not imply a link from page B to page A.
• This directed nature reflects the inherent structure of the web, where pages can link to other
pages without reciprocation.
3. Representation:
• Adjacency List: One common representation of the web graph is using an adjacency list. In
this representation, each node is associated with a list of its outgoing links (nodes it points to).
• Adjacency Matrix: Another representation is using an adjacency matrix, where rows and
columns correspond to nodes, and entries indicate the presence or absence of edges between
nodes.
Link analysis algorithms in Information Retrieval (IR) are techniques used to analyze the relationships
and interconnections between web pages or documents. These algorithms help assess the importance,
authority, and relevance of documents based on the structure of hyperlinks between them. Here's a
detailed explanation of link analysis algorithms in IR:
1. PageRank Algorithm:
• Concept: PageRank, developed by Larry Page and Sergey Brin at Google, assigns a numerical
weight (PageRank score) to each page in the web graph based on the quantity and quality of
inbound links it receives from other pages.
• Working: PageRank operates on the principle that pages with higher inbound link counts from
authoritative pages are likely to be more important and relevant.
• Algorithm: It iteratively computes PageRank scores until convergence, taking into account
damping factors and teleportation to handle dead ends and spider traps.
• Application: PageRank is widely used by search engines to rank search results based on the
importance and relevance of web pages.
4. TrustRank Algorithm:
• Concept: TrustRank is a variant of PageRank that aims to combat web spam and identify
trustworthy pages.
• Working: It starts with a seed set of trusted pages and propagates trust scores through the
web graph, discounting the influence of untrustworthy pages.
• Algorithm: TrustRank helps search engines prioritize trustworthy pages in search results and
improve the quality of search engine rankings.
• Application: TrustRank is used to enhance the credibility and reliability of search results by
identifying and filtering out spammy or low-quality web pages.
Topics covered: Web page crawling techniques: breadth-first, depth-first, focused crawling, Near-
duplicate page detection algorithms, Handling dynamic web content during crawling
-------------------------------------------------------------------------------------------------------------------
Web page crawling techniques
Web page crawling is the process of systematically browsing the World Wide Web to discover and
retrieve web pages for indexing by search engines or other purposes. Here are some common web
page crawling techniques:
1. Breadth-First Crawling:
• Breadth-first crawling starts with a set of seed URLs and systematically explores web pages by
visiting pages at each level of depth before moving to the next level.
• It ensures that pages closer to the seed URLs are crawled first, gradually expanding the crawl
frontier outward.
2. Depth-First Crawling:
• Depth-first crawling prioritizes exploring pages at deeper levels of the web page hierarchy
before visiting pages at shallower levels.
• It may be useful for certain scenarios, but it can lead to deep, narrow crawls that may not cover
a wide range of content.
3. Focused Crawling:
• Focused crawling aims to crawl specific areas of the web that are relevant to a particular topic,
domain, or set of keywords.
• It uses heuristics, content analysis, or relevance feedback to identify and prioritize pages
related to the target topic.
4. Parallel Crawling:
• Parallel crawling involves running multiple crawlers concurrently to crawl different parts of the
web simultaneously.
• It improves the efficiency and speed of crawling by distributing the workload across multiple
threads, processes, or machines.
5. Incremental Crawling:
• Incremental crawling focuses on updating the index with new or modified content since the last
crawl.
• It uses techniques such as timestamp comparison, change detection, or crawling frequency
adjustments to identify and prioritize pages that have been updated or added recently.
6. Politeness and Crawling Ethics:
• Politeness policies regulate the rate and frequency of requests sent to web servers to avoid
overloading servers or causing disruptions.
• Crawlers often adhere to the robots.txt protocol, which specifies guidelines for web crawlers
regarding which pages to crawl and which to avoid.
7. Duplicate Content Detection:
• Crawlers may implement techniques to detect and avoid crawling duplicate or near-duplicate
content to maintain index quality and reduce redundancy.
• Techniques include using checksums, fingerprints, or similarity measures to identify duplicate
content.
8. Dynamic Page Handling:
• Crawlers must handle dynamically generated pages, AJAX content, and other dynamically
loaded resources to ensure comprehensive coverage of the web.
• Techniques include executing JavaScript, interpreting AJAX requests, or analyzing embedded
content to discover and crawl dynamically generated content.
9. Link Analysis and Page Ranking:
• Crawlers may prioritize crawling pages based on link analysis algorithms such as PageRank or
HITS to focus on high-quality or authoritative content.
• Page ranking algorithms influence the crawling strategy by determining which pages are more
likely to be relevant or important.
10. Crawl Frontier Management:
• Crawl frontier management involves maintaining a queue or priority list of URLs to be crawled
and managing crawl scheduling, prioritization, and resource allocation.
• Techniques include URL scheduling algorithms, crawl budget allocation, and dynamic
adjustment of crawl priorities based on content freshness or importance.
Effective web page crawling requires a combination of these techniques, along with careful
consideration of scalability, efficiency, relevance, and ethical considerations. Modern web crawlers
employ sophisticated algorithms and strategies to navigate the vast and dynamic landscape of the
World Wide Web efficiently while respecting the guidelines and constraints set by web servers and
website owners.
2. Depth-First Crawling:
• Description: Depth-first crawling prioritizes exploring pages at deeper levels of the web page
hierarchy before visiting shallower levels.
• Working: It focuses on visiting pages as deeply as possible along a single branch of the web
page hierarchy before exploring other branches.
• Advantages: Can lead to more focused and efficient crawls, especially when targeting specific
areas of the web.
• Disadvantages: May miss important content located at shallower levels of the hierarchy,
potentially leading to incomplete coverage.
3. Focused Crawling:
• Description: Focused crawling aims to crawl specific areas of the web that are relevant to a
particular topic, domain, or set of keywords.
• Working: Uses heuristics, content analysis, or relevance feedback to identify and prioritize
pages related to the target topic.
• Advantages: Enables efficient discovery and retrieval of content relevant to specific
information needs or user queries.
• Disadvantages: Requires sophisticated algorithms and heuristics to determine relevance, and
may miss valuable content outside the defined focus area.
4. Parallel Crawling:
• Description: Parallel crawling involves running multiple crawlers concurrently to crawl different
parts of the web simultaneously.
• Working: Distributes the workload across multiple threads, processes, or machines to improve
efficiency and speed.
• Advantages: Accelerates the crawling process, enabling faster discovery and retrieval of web
content.
• Disadvantages: Requires infrastructure and resource management to coordinate parallel
crawlers and avoid duplication or conflicts.
5. Incremental Crawling:
• Description: Incremental crawling focuses on updating the index with new or modified content
since the last crawl.
• Working: Uses techniques such as timestamp comparison, change detection, or crawling
frequency adjustments to identify and prioritize pages that have been updated or added
recently.
• Advantages: Helps maintain index freshness and relevance by prioritizing recently updated or
added content.
• Disadvantages: Requires efficient mechanisms for detecting changes and managing crawl
scheduling to ensure timely updates.
Summarization is one of the most common Natural Language Processing (NLP) tasks. With the
amount of new content generated by billions of people and their smartphones every day, we are
inundated with increasing amount of data every day. Humans can only consume a finite amount of
information and need a way to filter out the wheat from the chaff and find the information that
matters. Text summarization can help achieve that for textual information. We can separate the signal
from the noise and take meaningful actions from them.
In this article, we explore different methods to implement this task and some of the learnings that we
have come across on the way. We hope this will be helpful to other folks who would like to implement
basic summarization in their data science pipeline for solving different business problems.
Python provides some excellent libraries and modules to perform Text Summarization. We will provide
a simple example of generating Extractive Summarization using the Gensim and HuggingFace modules
in this article.
Uses of Summarization?
It may be tempting to use summarization for all texts to get useful information from them and spend
less time reading. However, for now, NLP summarization has been a successful use case in only a few
areas.
Extractive
Extractive summarization methods work just like that. It takes the text, ranks all the sentences
according to the understanding and relevance of the text, and presents you with the most important
sentences.
This method does not create new words or phrases, it just takes the already existing words and
phrases and presents only that. You can imagine this as taking a page of text and marking the most
important sentences using a highlighter.
Abstractive
Abstractive summarization, on the other hand, tries to guess the meaning of the whole text and
presents the meaning to you.
It creates words and phrases, puts them together in a meaningful way, and along with that, adds the
most important facts found in the text. This way, abstractive summarization techniques are more
complex than extractive summarization techniques and are also computationally more expensive.
The best way to illustrate these types is through an example. Here we have run the Input Text below
through both types of summarization and the results are shown below.
Input Text:
China’s Huawei overtook Samsung Electronics as the world’s biggest seller of mobile phones in the
second quarter of 2020, shipping 55.8 million devices compared to Samsung’s 53.7 million, according
to data from research firm Canalys. While Huawei’s sales fell 5 per cent from the same quarter a year
earlier, South Korea’s Samsung posted a bigger drop of 30 per cent, owing to disruption from the
coronavirus in key markets such as Brazil, the United States and Europe, Canalys said. Huawei’s
overseas shipments fell 27 per cent in Q2 from a year earlier, but the company increased its
dominance of the China market which has been faster to recover from COVID-19 and where it now
sells over 70 per cent of its phones. “Our business has demonstrated exceptional resilience in these
difficult times,” a Huawei spokesman said. “Amidst a period of unprecedented global economic
slowdown and challenges, we’re continued to grow and further our leadership position.” Nevertheless,
Huawei’s position as number one seller may prove short-lived once other markets recover given it is
We use the same article to summarize as before, but this time, we use a transformer model from
Huggingface,
from transformers import pipeline
We have to load the pre-trained summarization model into the pipeline:
summarizer = pipeline(“summarization”)
Next, to use this model, we pass the text, the minimum length, and the maximum length parameters.
We get the following output:
summarizer(Input, min_length=30, max_length=300)
Output:
China’s Huawei overtook Samsung Electronics as the world’s biggest seller of mobile phones in the
second quarter of 2020, shipping 55.8 million devices compared to Samsung’s 53.7 million. Samsung
posted a bigger drop of 30 per cent, owing to disruption from coronavirus in key markets such as
Brazil, the United States and Europe.
Text summarization is the process of distilling the most important information from a text while
preserving its key meaning and content. There are two primary approaches to text summarization:
extractive summarization and abstractive summarization.
1. Extractive Summarization:
• Extractive summarization involves selecting and extracting key sentences or passages
directly from the original text to create a summary. The extracted sentences are typically
the ones that contain the most relevant information or represent the main ideas of the
text.
• Extractive summarization methods often use statistical techniques, natural language
processing (NLP), and machine learning algorithms to identify important sentences
based on criteria such as word frequency, sentence position, and semantic similarity.
• Advantages of extractive summarization include the preservation of the original wording
and the ability to generate coherent summaries quickly. However, extractive methods
-------------------------------------------------------------------------------------------------------------------
Text summarization is the process of distilling the most important information from a source text to
produce a condensed version while retaining the key ideas and meaning. There are two primary
approaches to text summarization: extractive summarization and abstractive summarization. Let's
delve into each approach in detail:
Extractive Summarization:
Extractive summarization involves selecting a subset of sentences or passages from the source text
and combining them to create a summary. The selected sentences are usually the most informative
and representative of the content of the original text. Here's how extractive summarization works:
1. Sentence Ranking: Extractive summarization algorithms analyze the source text to identify
sentences that contain important information. Various features can be used to assess the
importance of sentences, such as word frequency, sentence length, position in the text, and the
presence of keywords.
2. Scoring and Selection: Once the sentences are identified, each sentence is assigned a score
based on its importance or relevance to the overall content. Common techniques for scoring
sentences include algorithms like TextRank, which is based on graph-based ranking algorithms
similar to Google's PageRank algorithm, and TF-IDF (Term Frequency-Inverse Document
Frequency), which measures the importance of words in a document relative to a corpus of
documents.
3. Sentence Selection: The sentences with the highest scores are then selected to form the
summary. These selected sentences are typically arranged in the same order as they appear in
the original text to maintain coherence and readability.
4. Generation of Summary: The selected sentences are concatenated to form the final
summary, which provides a condensed representation of the main ideas and key points of the
source text.
Abstractive Summarization:
Abstractive summarization goes beyond merely selecting and rearranging sentences from the source
text. Instead, it aims to generate a summary that captures the essence of the original content in a
more human-like manner. Abstractive summarization involves the following steps:
1. Understanding the Text: Abstractive summarization algorithms employ natural language
processing (NLP) techniques to comprehend the meaning and context of the source text. This
may involve parsing the text, identifying entities and relationships, and understanding the
semantic structure of the content.
Comparison:
• Extractive Summarization:
• Pros:
• Retains the original wording and structure of the text.
• Generally produces grammatically correct summaries.
• Cons:
• Limited to sentences present in the source text.
• May not capture the semantic meaning or context of the original text
comprehensively.
• Abstractive Summarization:
• Pros:
• Can generate summaries that go beyond the original text.
• Captures the semantic meaning and context more effectively.
• Cons:
• Challenging to generate grammatically correct and coherent summaries.
• Requires more advanced natural language processing techniques and language
models.
In summary, while extractive summarization focuses on selecting and rearranging existing content,
abstractive summarization aims to understand the text and generate new content that effectively
conveys the main ideas and key points of the source text. Each approach has its strengths and
limitations, and the choice between extractive and abstractive summarization depends on the specific
requirements and constraints of the task at hand.
-------------------------------------------------------------------------------------------------------------------
Question answering (QA) involves finding precise answers to user queries or questions, typically posed
in natural language. There are several approaches for finding precise answers in QA systems:
1. Information Retrieval (IR)-based QA: In this approach, the QA system retrieves relevant
documents or passages from a large corpus in response to the user's question. The system uses
keyword matching, vector space models, or other IR techniques to identify documents
containing potential answers. Once the documents are retrieved, the system may employ
techniques such as passage extraction or document ranking to select the most relevant
information for answering the question.
2. Text Matching and Similarity: This approach involves analyzing the similarity between the
user's question and textual content in the corpus. Techniques such as cosine similarity,
semantic similarity, or word embeddings are used to measure the similarity between the
question and candidate answers. The system selects the answer that best matches the semantic
meaning or context of the question.
3. Machine Learning and Natural Language Processing (NLP): Machine learning models,
particularly deep learning architectures such as recurrent neural networks (RNNs), convolutional
neural networks (CNNs), and transformer models like BERT, have been employed in QA
systems. These models are trained on large datasets to understand the relationship between
questions and answers and to generate responses based on learned patterns in the data.
4. Semantic Parsing: Semantic parsing involves analyzing the syntactic and semantic structure
of the question to understand its meaning and intent. Techniques such as dependency parsing,
semantic role labelling, and entity recognition are used to extract relevant entities,
relationships, and constraints from the question. The parsed representation of the question is
then used to query structured or unstructured data sources to find precise answers.
5. Knowledge Graphs: Knowledge graphs represent structured information about entities and
their relationships in a graph-based format. QA systems can leverage knowledge graphs to find
precise answers by traversing the graph to identify relevant entities and relationships based on
the user's question. Techniques such as graph-based inference and query expansion can be
used to infer additional information and improve answer precision.
Question answering (QA) involves finding precise answers to user queries or questions posed in natural
language. Various approaches exist for achieving this goal, ranging from rule-based systems to
advanced machine learning models. Here, I'll detail several approaches for finding precise answers in
QA systems:
1. Keyword Matching: One of the simplest approaches to QA involves matching keywords in the
user's question with keywords in a corpus of documents or a knowledge base. The system
retrieves documents containing the relevant keywords and extracts sentences or passages that
contain matching keywords. While straightforward, this approach may not capture nuanced or
complex queries effectively.
2. Information Retrieval (IR) + Passage Retrieval: In this approach, the QA system first
retrieves relevant documents using information retrieval techniques such as TF-IDF (Term
Frequency-Inverse Document Frequency) or BM25 (Best Matching 25). Then, it selects relevant
passages or sentences from the retrieved documents based on their relevance to the user's
question. Passage retrieval methods may consider factors such as semantic similarity,
document context, and language models to identify relevant passages.
3. Named Entity Recognition (NER): Named Entity Recognition identifies entities such as
people, organizations, locations, and dates mentioned in the user's question and in the corpus
of documents. QA systems can use NER to extract relevant entities and then search for
sentences or passages containing these entities to provide answers. NER enhances precision by
focusing on specific entities mentioned in the question.
4. Semantic Parsing and Structured Knowledge Bases: Some QA systems leverage
structured knowledge bases such as Wikidata, Freebase, or DBpedia. Semantic parsing
techniques are used to translate the user's question into a structured query language (e.g.,
SPARQL for RDF knowledge bases). The system then executes the query against the knowledge
base to retrieve precise answers. Structured knowledge bases offer rich semantic information
and enable precise retrieval of factual knowledge.
5. Machine Learning Models: Modern QA systems often employ machine learning models,
particularly deep learning architectures, to understand and answer questions. These models
include:
• Sequence-to-Sequence Models: Seq2Seq models, based on recurrent neural networks
(RNNs) or transformer architectures, can map the user's question to an answer directly.
These models learn to generate answers based on input questions and can handle both
extractive and abstractive QA tasks.
• BERT and Transformers: Bidirectional Encoder Representations from Transformers
(BERT) and other transformer-based models have shown remarkable performance in QA
tasks. These models can understand the context and semantics of the question and the
document corpus, enabling accurate answer extraction.
• BERT-based Fine-tuning: Pre-trained language models like BERT can be fine-tuned on
QA datasets using techniques such as extractive summarization. During fine-tuning, the
model learns to extract the most relevant spans of text from documents to generate
precise answers to questions.
6. Ensemble Approaches: Some QA systems combine multiple approaches mentioned above to
improve precision and robustness. Ensemble methods integrate outputs from different models
or techniques to generate more accurate answers. For example, an ensemble model may
combine results from keyword matching, IR-based passage retrieval, and machine learning
models to provide precise answers across a range of queries.
-------------------------------------------------------------------------------------------------------------------
Recommender systems are information filtering systems that aim to predict user preferences and
recommend items (such as movies, products, or articles) that users are likely to be interested in. Two
primary approaches to building recommender systems are collaborative filtering and content-based
filtering.
1. Collaborative Filtering:
Collaborative filtering (CF) recommends items to users based on the preferences of other users. The
underlying assumption is that users who have preferred similar items in the past will likely prefer
similar items in the future. Collaborative filtering methods can be further categorized into two types:
a. Memory-Based Collaborative Filtering: Memory-based CF techniques compute similarities
between users or items based on their historical interactions. One common method is user-based
collaborative filtering, where recommendations for a user are generated based on the preferences of
A Content-Based Recommender works by the data that we take from the user, either explicitly (rating)
or implicitly (clicking on a link). By the data we create a user profile, which is then used to suggest to
the user, as the user provides more input or take more actions on the recommendation, the engine
becomes more accurate.
User Profile: In the User Profile, we create vectors that describe the user’s preference. In the creation
of a user profile, we use the utility matrix which describes the relationship between user and item.
With this information, the best estimate we can make regarding which item user likes, is some
aggregation of the profiles of those items. Item Profile: In Content-Based Recommender, we must
build a profile for each item, which will represent the important characteristics of that item. For
example, if we make a movie as an item then its actors, director, release year and genre are the most
significant features of the movie. We can also add its rating from the IMDB (Internet Movie Database)
in the Item Profile. Utility Matrix: Utility Matrix signifies the user’s preference with certain items. In
the data gathered from the user, we have to find some relation between the items which are liked by
the user and those which are disliked, for this purpose we use the utility matrix. In it we assign a
particular value to each user-item pair, this value is known as the degree of preference. Then we draw
a matrix of a user with the respective items to identify their preference relationship.
-------------------------------------------------------------------------------------------------------------------
Content-based filtering
The content-based recommendation system works on two methods, both of them using different
models and algorithms. One uses the vector spacing method and is called method 1, while the other
uses a classification model and is called method 2.
1. The vector space method
Let us suppose you read a crime thriller book by Agatha Christie, you review it on the internet. Also,
you review one more fictional book of the comedy genre with it and review the crime thriller books as
good and the comedy one as bad.
Now, a rating system is made according to the information provided by you. In the rating system from
0 to 9, crime thriller and detective genres are ranked as 9, and other serious books lie from 9 to 0 and
the comedy ones lie at the lowest, maybe in minus.
With this information, the next book recommendation you will get will be of crime thriller genres most
probably as they are the highest rated genres for you.
For this ranking system, a user vector is created which ranks the information provided by you. After
this, an item vector is created where books are ranked according to their genres on it.
With the vector, every book name is assigned a certain value by multiplying and getting the dot
product of the user and item vector, and the value is then used for recommendation.
Like this, the dot products of all the available books searched by you are ranked and according to it the
top 5 or top 10 books are assigned.
This method of content based filtering was the first one used by a content-based recommendation
system to recommend items to the user.
2. Classification method
The second method of content based filtering is the classification method. In it, we can create a
decision tree and find out if the user wants to read a book or not.
For example, a book is considered, let it be The Alchemist.
Based on the user data, we first look at the author name and it is not Agatha Christie. Then, the genre
is not a crime thriller, nor is it the type of book you ever reviewed. With these classifications, we
conclude that this book shouldn’t be recommended to you.
Cross-lingual retrieval is a challenging yet essential task for information retrieval in multilingual
environments. Addressing vocabulary mismatch, syntax variations, and resource scarcity requires
sophisticated techniques such as machine translation, cross-lingual information retrieval models, and
cross-lingual transfer learning. By leveraging these techniques, cross-lingual retrieval systems can
effectively retrieve relevant information across different languages, enabling users to access
information regardless of linguistic barriers.
Example of cross-lingual retrieval using machine translation and cross-lingual information retrieval
models:
Scenario: Suppose we have a multinational company with offices in English-speaking countries and
Spanish-speaking countries. Employees across these offices need to access documents and information
stored in the company's knowledge base, which is available in both English and Spanish. To facilitate
efficient information retrieval across languages, the company wants to implement a cross-lingual
retrieval system.
Challenges:
• Vocabulary Mismatch: English and Spanish have different vocabularies, and direct translations
may not capture the exact meaning of queries or documents.
• Syntax Variations: English and Spanish have different word orders and syntactic structures,
making it challenging to match queries and documents accurately.
• Resource Scarcity: The company may not have sufficient bilingual resources or parallel corpora
to develop effective cross-lingual retrieval models.
Solution: The company decides to implement a cross-lingual retrieval system using a combination of
machine translation and cross-lingual information retrieval models.
Implementation Steps:
1. Query Translation:
• When a user submits a query in one language (e.g., English), the system translates the
query into the target language (e.g., Spanish) using a machine translation service such
as Google Translate or Microsoft Translator.
• For example, if a user in an English-speaking country searches for "project
management," the system translates the query to "gestión de proyectos" in Spanish.
2. Cross-lingual Information Retrieval:
• The system uses cross-lingual information retrieval models to match the translated
query with relevant documents in the target language.
• It employs techniques such as cross-lingual word embeddings or multilingual models to
capture semantic similarities across languages and retrieve relevant documents.
• For instance, if the translated query "gestión de proyectos" is matched with Spanish
documents containing similar terms, such as "técnicas de gestión de proyectos" or
"mejores prácticas en gestión de proyectos," those documents are retrieved and
presented to the user.
3. Evaluation and Refinement:
• The system continuously evaluates the relevance of retrieved documents based on user
feedback and adjusts its retrieval algorithms accordingly.
• It may refine the cross-lingual retrieval models by incorporating additional linguistic
features, optimizing parameters, or leveraging user interactions to improve retrieval
accuracy.
Benefits:
• Seamless Access to Information: Employees across English and Spanish-speaking regions can
access relevant documents and information regardless of their language preferences.
• Improved Efficiency: The cross-lingual retrieval system reduces the time and effort required to
manually search for information in different languages.
• Enhanced Collaboration: Employees from diverse linguistic backgrounds can collaborate more
effectively by sharing and accessing documents across languages.
By implementing a cross-lingual retrieval system that combines machine translation and cross-lingual
information retrieval techniques, the multinational company can overcome language barriers and
facilitate efficient access to knowledge and information across its global workforce. This example
demonstrates how cross-lingual retrieval solutions can address the challenges of vocabulary mismatch,
syntax variations, and resource scarcity in multilingual environments.
-------------------------------------------------------------------------------------------------------------------
Machine translation for Information Retrieval (IR) involves using automated translation systems
to translate queries or documents from one language to another in order to facilitate cross-lingual
search and retrieval. This approach allows users to search for information in languages they are not
proficient in and enables access to a wider range of resources. Let's explore machine translation for IR
in detail:
1. Basic Components of Machine Translation:
Machine translation systems typically consist of the following components:
• Text Analysis: Breaking down the input text into its constituent parts, such as words, phrases,
and sentences.
• Translation Model: Generating a translation based on statistical or neural models that capture
the relationships between source and target languages.
• Language Generation: Reconstructing the translated text in the target language, ensuring
fluency and coherence.
2. Types of Machine Translation:
• Statistical Machine Translation (SMT): Based on statistical models that learn translation
patterns from bilingual corpora. SMT systems rely on word alignments and phrase-based
translation techniques.
• Neural Machine Translation (NMT): Utilizes deep learning models, such as recurrent neural
networks (RNNs) or transformer models, to learn translation mappings directly from source to
target languages. NMT has shown significant improvements in translation quality over
traditional SMT approaches.
3. Integration of Machine Translation with IR:
Machine translation can be integrated into the IR process in various ways:
• Query Translation: Translating user queries from the source language to the target language
before retrieving relevant documents. For example, translating an English query into Spanish
before searching for relevant documents in Spanish databases.
• Document Translation: Translating documents retrieved in the target language back to the
source language for user comprehension. This allows users to understand the content of
documents written in languages they are not proficient in.
• Cross-Lingual Retrieval: Facilitating retrieval of documents across multiple languages by
translating queries and documents between source and target languages as part of the retrieval
process.
4. Example Scenario:
Consider a multinational company with offices in English-speaking countries and Japanese-speaking
countries. Employees across these offices need to access documents and information stored in the
company's knowledge base, which is available in both English and Japanese.
• Query Translation: An employee in the English-speaking office submits a query in English,
such as "sales report analysis." The machine translation system translates the query into
Japanese, generating the equivalent query in Japanese, "売上レポート分析."
• Cross-Lingual Retrieval: The translated query is used to retrieve relevant documents written
in Japanese from the company's knowledge base. These documents could include sales reports,
market analyses, and financial summaries.
• Document Translation: The retrieved Japanese documents can be translated back into
English for the English-speaking employee to understand the content and extract relevant
information.
5. Challenges and Considerations:
• Translation Quality: The accuracy and fluency of machine translation can significantly impact
the effectiveness of cross-lingual retrieval. Poor translations may lead to irrelevant search
results and user frustration.
• Domain Specificity: Machine translation systems may struggle with domain-specific
terminology and context. Customizing translation models for specific domains can improve
translation quality and relevance.
• Resource Availability: Availability of bilingual corpora and language resources can impact the
development and performance of machine translation systems, particularly for low-resource
languages.
In conclusion, machine translation plays a crucial role in enabling cross-lingual search and retrieval in
Information Retrieval systems. By leveraging machine translation technologies, users can access and
Multilingual document representations are techniques used to represent textual documents in a way
that allows for meaningful comparisons, analysis, and retrieval across multiple languages. They enable
systems to process and understand documents written in different languages, facilitating tasks such as
cross-lingual information retrieval, machine translation, and cross-lingual document classification. Here
are some key approaches and methods used in multilingual document representations:
1. Word Embeddings:
1. Relevance Judgments:
• Relevance judgments involve human assessors determining the relevance of documents
returned by the IR system to a set of user queries.
• Assessors typically assign relevance judgments based on predefined criteria, such as
whether the document contains information that satisfies the user's information need.
• Relevance judgments serve as the ground truth against which the performance of the IR
system is measured.
2. Precision and Recall:
• Precision measures the proportion of relevant documents retrieved by the system among
all documents retrieved. It is computed as:
Precision=Number of relevant documents retrievedTotal number of documents retrieved
Precision=Total number of documents retrievedNumber of relevant documents retrieved
• Recall measures the proportion of relevant documents retrieved by the system among all
relevant documents in the collection. It is computed as:
Recall=Number of relevant documents retrievedTotal number of relevant documentsReca
ll=Total number of relevant documentsNumber of relevant documents retrieved
• Precision and recall are typically computed at different levels, such as document level,
query level, or user session level.
•
• Recall=
•
3. F-measure:
• The F-measure (or F1 score) is the harmonic mean of precision and recall and provides a
single measure that balances the two metrics.
• It is computed as:
•
• Where N is the total number of queries, and APi is the average precision for query i.
•
5. Mean Reciprocal Rank (MRR):
• MRR measures the average of the reciprocal ranks of the first relevant document
retrieved for each query.
• It is particularly useful for tasks where only the top-ranked document matters (e.g.,
question answering).
User studies in Information Retrieval (IR) involve systematic investigations into how users interact with
IR systems, their information-seeking behavior, and their satisfaction with the retrieval outcomes.
Here's a detailed breakdown of the components and methodologies involved in user studies in IR:
Test collections and benchmarking play a pivotal role in Information Retrieval (IR) research and
development. They provide standardized datasets and evaluation metrics for assessing the
performance of IR systems. Here's a detailed explanation of test collections and benchmarking in IR:
1. Test Collections:
Test collections are curated datasets that consist of:
1. Document Collection: A corpus of documents typically crawled from the web, scientific
papers, news articles, or other sources relevant to the domain of interest.
2. Queries: A set of queries or search topics formulated based on real-world information needs or
specific test scenarios.
3. Relevance Judgments: For each query, human assessors provide relevance judgments,
indicating which documents in the collection are relevant to the query and to what degree.
2. Benchmarking:
Benchmarking involves evaluating the performance of IR systems using test collections. The process
usually includes the following steps:
1. System Retrieval: IR systems retrieve documents from the test collection in response to the
given queries.
2. Evaluation: The retrieved documents are compared against the relevance judgments to assess
the system's effectiveness using various evaluation metrics.
3. Metrics: Common evaluation metrics include precision, recall, F1 score, mean average
precision (MAP), mean reciprocal rank (MRR), normalized discounted cumulative gain (NDCG),
and precision-recall curves.
4. Comparison: The performance of different IR systems can be compared based on their scores
on these metrics.
Test collections and benchmarking play crucial roles in Information Retrieval (IR) research and
development by providing standardized datasets and evaluation methodologies to assess the
performance of IR systems. Here's a detailed explanation of test collections and benchmarking in IR:
1. Test Collections:
Test collections refer to standardized datasets comprising documents, queries, and relevance
judgments used for evaluating the effectiveness of IR systems. These collections are carefully
constructed to represent various aspects of real-world information retrieval scenarios. Here's how test
collections are typically constructed and utilized:
2. Benchmarking in IR:
Benchmarking involves the systematic evaluation and comparison of IR systems using standardized
test collections and evaluation measures. Here's how benchmarking is conducted in IR:
• System Evaluation: IR systems are evaluated using test collections and evaluation metrics to
assess their performance in retrieving relevant documents for given queries.
• Comparison of Techniques: Benchmarking allows researchers to compare the effectiveness of
different retrieval models, algorithms, and techniques under controlled conditions. This helps
identify the strengths and weaknesses of various approaches and informs the development of
more effective retrieval systems.
• Evaluation Measures: Benchmarking typically involves the computation of evaluation
measures such as precision, recall, F1-score, MAP, mean reciprocal rank (MRR), and normalized
discounted cumulative gain (NDCG). These measures provide quantitative assessments of
retrieval performance and help researchers understand the trade-offs between precision and
recall.
• Statistical Analysis: Benchmarking studies often include statistical analysis to determine
whether observed differences in retrieval performance are statistically significant. This helps
ensure that any observed improvements or differences are not due to random chance.
• Publication and Sharing: Benchmarking studies are often published in academic conferences
and journals, allowing researchers to share their findings with the broader IR community. Test
collections and evaluation results are also shared publicly to facilitate reproducibility and further
research in the field.
In summary, test collections and benchmarking provide essential resources and methodologies for
evaluating and comparing the performance of IR systems. They help researchers assess the
effectiveness of retrieval techniques, identify areas for improvement, and advance the state-of-the-art
in information retrieval.
-------------------------------------------------------------------------------------------------------------------
Online evaluation methods in Information Retrieval:
Online evaluation methods in Information Retrieval (IR) involve assessing the performance of retrieval
systems using live user interactions and real-world data. Unlike offline evaluation methods that rely on
predefined test collections, online evaluation leverages actual user queries, interactions, and feedback
to measure system effectiveness. Here's a detailed overview of online evaluation methods in IR:
1. Click-Through Rate (CTR):
• Click-through rate measures the proportion of users who click on a search result after
issuing a query.
• In online evaluation, CTR is used to assess the relevance and attractiveness of search
results presented to users.
• Higher CTRs indicate that the displayed results are more relevant and engaging to users.
2. Dwell Time:
• Dwell time refers to the amount of time users spend interacting with a search result
page or a specific document after clicking on it.
• Longer dwell times often indicate that users find the content relevant and engaging.
• Dwell time can be used as a proxy for user satisfaction and result relevance.
3. User Engagement Metrics:
A/B testing is a powerful and widely used technique in IR for empirically evaluating and optimizing
search algorithms, user interfaces, and system configurations.
By systematically comparing different variants and measuring their impact on predefined metrics, A/B
testing helps drive data-driven decision-making and continuous improvement in IR systems.
-------------------------------------------------------------------------------------------------------------------
Interleaving experiments in Information Retrieval (IR):
Interleaving experiments in Information Retrieval (IR) are a method used to compare the performance
of different ranking algorithms or search strategies in a live environment. These experiments involve
interleaving results from multiple algorithms and presenting them to users in a randomized order to
evaluate which algorithm provides better user satisfaction or relevance. Here's a detailed overview of
interleaving experiments in IR:
1. Objective:
• The primary objective of interleaving experiments is to compare the effectiveness of
different ranking algorithms or search strategies in providing relevant and satisfying
search results to users.
• Interleaving helps identify which algorithm or strategy performs better in real-world
scenarios based on user feedback and preferences.
2. Setup:
• Interleaving experiments involve presenting search results to users in a randomized
order, with results interleaved from different algorithms or strategies.
• Each user query triggers the execution of multiple ranking algorithms or strategies, and
the top-ranked results from each algorithm are combined and presented to the user in a
randomized order.
• Users are then asked to interact with the interleaved results and provide feedback or
relevance judgments based on their preferences.
3. Interleaving Methods:
• Team Draft Interleaving: In team draft interleaving, two or more algorithms are
treated as competing "teams." Results from each team are combined into a single
interleaved list by taking turns selecting the next result.
• Probabilistic Interleaving: Probabilistic interleaving assigns probabilities to each
result from different algorithms and samples results based on these probabilities to
create the interleaved list.
Interleaving experiments provide valuable insights into the comparative performance of different
ranking algorithms or search strategies in real-world scenarios. By systematically evaluating and
comparing interleaved search results based on user feedback, interleaving experiments help inform
data-driven decision-making and optimization of IR systems.