IR Cs Sem 6
IR Cs Sem 6
Syllabus:
Definition:
• Information Retrieval is the process of obtaining information from large collections which are
stored on computers in an unstructured ways. It is mainly focus on searching and retrieving of
documents that can be based on full-text or other content-based indexing.
Objectives:
• The general objective of an Information Retrieval System is to minimize the overhead of a user
locating needed information.
Importance of IR:
• Today millions of people use web search engines every day and they don’t rely on librarian or
any professional searchers to retrieve the information.
• The IR system notifies regarding the existence and location of documents that might consist of
the required information.
• Information retrieval also extends support to users in browsing or filtering document collection
or processing a set of retrieved documents.
• An IR system has the ability to represent, store, organize, and access information items.
Characteristics of IR:
Advantages of IR:
• To save the time of the readers when they search for their necessary information.
• The searching process is easy to understand.
• Current information is available in the storage database.
• Users can access multi-database to use multiple keywords/concepts at the same time.
Disadvantages of IR:
• High establishment cost.
• Maximum library users and staff have not enough IT knowledge to run this system.
• Lack of training facility.
• Electricity supply problem.
• Lack of networking and internet facility.
• Slow speeds of the internet delay the retrieval system.
• Sometimes it gives irrelevant information.
Applications of IR:
• Digital libraries.
• Information filtering.
• Recommender systems.
• Media search.
• Blog search.
• Image retrieval.
• 3D retrieval.
• Music retrieval.
• News search.
• Speech retrieval.
• Video retrieval.
• Search engines.
• Site search.
Although searching the World Wide Web (web search) is by far the most common application involving
information retrieval, search is also a crucial part of applications in corporations, government, and many
other domains.
Types of Search:
1) Vertical Search: It is a specialized form of web search where the domain of the search is restricted to
particular topic.
2) Enterprise Search: It involves finding the required information in huge variety of computer files
scattered across a corporate internet.
3) Desktop Search: It is the personal version of the enterprise search where the information
sources are the files stored on an individual computer including email messages and web
pages that have recently been browsed.
4) Peer to peer search: It involves finding information in networks of nodes or computers
without any centralised control.
5) Ad hoc search: It includes text based task, filtering, classification and question answering.
• The earliest computer-based searching systems were built in the late 1940s and were inspired by
pioneering innovation in the first half of the 20th century.
• The idea of using computers to search for relevant pieces of information was popularized in the article As
We May Think by Vannevar Bush in 1945.
• Information retrieval, or rather machines which were able to fetch some information were first heard of in
1948, and Holmstrom described the first one, called a Univac. This machine was able to record specific
symbols on a magnetic steel tape and fetches a document under those symbols, then retypes its content.
• Automated systems were already introduced not two years later, in 1950, and by the end of the 50s, one
was already in a movie, called Desk Set in 1957.
• In the 1960s, the first large information retrieval research group was formed by Gerard Salton at Cornell.
• By the 1970s several different retrieval techniques had been shown to perform well on small text
corpora such as the Cranfield collection.
• Large-scale retrieval systems, such as the Lockheed Dialog system, came into use early in the 1970s.
• In 1992, the US Department of Defence along with the National Institute of Standards and
Technology (NIST), cosponsored the Text Retrieval Conference (TREC) as part of the TIPSTER text
program.
• Early Developments: As there was an increase in the need for a lot of information, it became necessary
to build data structures to get faster access. The index is the data structure for faster retrieval of
information. Over centuries manual categorization of hierarchies was done for indexes.
• Information Retrieval in Libraries: Libraries were the first to adopt IR systems for information
retrieval. In first-generation, it consisted, automation of previous technologies, and the search was based
on author name and title. In the second generation, it included searching by subject heading, keywords,
etc. In the third generation, it consisted of graphical interfaces, electronic forms, hypertext features, etc.
• The Web and Digital Libraries: It is cheaper than various sources of information, it provides greater
access to networks due to digital communication and it gives free access to publish on a larger medium.
Components Information Retrieval
Information retrieval is concerned with representing, searching, and manipulating large collections of
electronic text and other human-language data.
Figure: Components of IR
An information retrieval system thus has three major components:
1. Document sub-system
2. User sub system
3. Searching /Retrieval subsystem
Document sub-system
a) Acquisition:
It is the process of selection of documents and other objects from various web resources that consist of
text-based documents. This data is collected by web crawlers and stored in the database.
b) Representation:
It consists of indexing that contains free-text terms, controlled vocabulary, manual & automatic
techniques as well. Example: Abstracting contains summarizing and Bibliographic description that
contains author, title, sources, data, and metadata.
c) File organization:
There are two types of file organization methods.
o Sequential: It contains documents by document data.
o Inverted: It contains term by term, list of records under each term
o Combination: It consists of both terms and documents.
User sub system
a) Problem:
Depending on the user's demand, the information retrieval system may contain information that may
change, evolve, and change during the search IR process adjustments.
b) Representation:
This is responsible for followings:
o Converting a concept to query
o What we search for
o These are stemmed and corrected using dictionary
o Focus toward a good result
o Subject to feedback changes
c) Query :
o Queries are used to translate data into requirement.
o It is used to allow the user to interact with the computer.
o Queries can take vocabulary as input and generates feedback.
3. Searching /Retrieval subsystem:
a) Matching:
It is the process which search engines use to identify sets of words that should be treated as a cohesive
unit when scanning across a search index for the most relevant documents. Various algorithms are used
for matching and searching. For exact matching Boolean model is appropriate, for best matching
‘ranking by relevance’ technique is used and sometimes both techniques are used for searching.
b) Retrieved objects:
Document can be retrieved in sorted order like LIFO or in ranked order
An information model (IR) model can be classified into the following three models −
Classical IR Model:
It is the simplest and easy to implement IR model. This model is based on mathematical knowledge that
was easily recognized and understood as well. Boolean, Vector and Probabilistic are the three classical
IR models.
Non-Classical IR Model:
It is completely opposite to classical IR model. Such kind of IR models is based on principles other than
similarity, probability, Boolean operations. Information logic model, situation theory model and
interaction models are the examples of non-classical IR model.
Alternative IR Model:
It is the enhancement of classical IR model making use of some specific techniques from some other
fields. Cluster model, fuzzy model and latent semantic indexing (LSI) models are the example of
alternative IR model.
Basic Terms:
1) Collection: It is the group of documents that retrieval is performed on it. for e.g Wikipedia
2) Documents: It is the unit of information that we want to return as a result of a query. For
example newspaper
3) Term: It is the smallest unit of information in a query. For e.g token
4) Information need: It is the topic about which the user desires to know more and is
differentiated from a query.
5) Query: User conveys to the computer in an attempt to communicate the information need.
6) Inverted Index: Also called as inverted file. It is an index which always map index back
from terms to the parts of documents where they occur.
7) Posting: Each item in the list which records that a term appeared in a document.
Boolean model :
The Boolean model of information retrieval is a classical information retrieval (IR) model and is the
first and most adopted one. It is used by virtually all commercial IR systems today.
Exact vs Best match
• In exact match a query specifies precise criteria. Each document either matches or fails to match
the query. The results retrieved in exact match are a set of document (without ranking).
• In best match a query describes good or best matching documents. In this case the result is a
ranked list of document. The Boolean model here I’m going to deal with is the most common
exact match model.
Basic Assumption of Boolean Model
An index term is either present(1) or absent(0) in the document
All index terms provide equal evidence with respect to information needs.
Queries are Boolean combinations of index terms.
X AND Y: represents doc that contains both X and Y
X OR Y: represents doc that contains either X or Y
NOT X: represents the doc that do not contain X
Boolean Queries Example
User information need: Interested to know about Everest and Nepal
User Boolean query: Everest AND Nepal
Implementation Part
Example of Input collection
Doc1= English tutorial and fast track
Doc2 = learning latent semantic indexing
Doc3 = Book on semantic indexing
Doc4 = Advance in structure and semantic indexing
Doc5 = Analysis of latent structures
Query problem: advance and structure AND NOT analysis
Boolean Model Index Construction
First we build the term-document incidence matrix which represents a list of all the distinct terms and
their presence on each document (incidence vector). If the document contains the term than incidence
vector is 1 otherwise 0.
So now we have 0/1 vector for each term. To answer the query we take the vectors
for advance, structure and analysis, complement the last, and then do a bitwise AND.
Doc1.
Example:
Advantages:
• It is easy to implement.
• It is easy to understand why the document is retrieved or not.
• Users can determine whether the query is too specific or too broad.
Disadvantages:
• The Boolean operators are too strict and ways need to be found to soften them.
• The standard Boolean approach has no provision for ranking.
• The Boolean model does not support the assignment of weights to the query or document terms.
Extended Boolean model:
It combines the characteristics of the Vector Space Model with the properties of Boolean algebra and
ranks the similarity between queries and documents. This way a document may be somewhat relevant if
it matches some of the queried terms and will be returned as a result, whereas in the Standard Boolean
model it wasn't.
Extended Boolean model vs Ranked Retrieval:
• The Boolean retrieval model is opposite of ranked retrieval models like vector space model
where users use free text queries where we type one or more words rather than using a
precise language operators to building up query expressions.
• A proximity operators means to specify two terms in a operator query that occur close to each
other in a document, where closeness may be measured by limiting the allowed number of
intervening words or by reference to a structural unit such as a sequence or paragraph.
Types of Queries in IR Systems
1.Keyword Queries :
• Simplest and most common queries.
• The user enters just keyword combinations to retrieve documents.
• These keywords are connected by logical AND operator.
• All retrieval models provide support for keyword queries.
2. Boolean Queries :
• Some IR systems allow using +, -, AND, OR, NOT, ( ), Boolean operators in combination of
keyword formulations.
• No ranking is involved because a document either satisfies such a query or does not satisfy it.
• A document is retrieved for boolean query if it is logically true as exact match in document.
3. Phrase Queries :
• When documents are represented using an inverted keyword index for searching, the relative order
of items in document is lost.
• To perform exact phase retrieval, these phases are encoded in inverted index or implemented
differently.
• This query consists of a sequence of words that make up a phase.
• It is generally enclosed within double quotes.
4. Proximity Queries :
• Proximity refers ti search that accounts for how close within a record multiple items should be to
each other.
• Most commonly used proximity search option is a phase search that requires terms to be in exact
order.
• Other proximity operators can specify how close terms should be to each other. Some will specify
the order of search terms.
• Search engines use various operators names such as NEAR, ADJ (adjacent), or AFTER.
• However, providing support for complex proximity operators becomes expensive as it requires time-
consuming pre-processing of documents and so it is suitable for smaller document collections rather
than for web.
5. Wildcard Queries:
• It supports regular expressions and pattern matching-based searching in text.
• Retrieval models do not directly support for this query type.
• In IR systems, certain kinds of wildcard search support may be implemented. Example: usually
words ending with trailing characters.
6. Natural Language Queries :
• There are only a few natural language search engines that aim to understand the structure and
meaning of queries written in natural language text, generally as question or narrative.
• The system tries to formulate answers for these queries from retrieved results.
• Semantic models can provide support for this query type.
Binary Tree:
• A tree data structure is a non-linear data structure because it does not store in a sequential
manner.
• It is a hierarchical structure as elements in a tree are arranged in multiple levels.
• In the Tree data structure, the topmost node is known as a root node.
• Each node contains some data, and data can be of any type.
• A tree whose elements have at most 2 children is called a binary tree. Since each element in a
binary tree can have only 2 children, we typically name them the left and right child.
• A Binary Tree node contains following parts.
1. Data
2. Pointer to left child
3. Pointer to right child
Example:
Pros and cons of Tree data structure:
Pros:
• Solves the prefix problem (e.g., terms starting with “hyp”)
Cons:
1. Slower: O(log M) [and this requires a balanced tree]
2. Rebalancing binary trees is expensive.
3. B-trees mitigate the rebalancing problem.
Solution: transform wild-card queries so that the *’s always occur at the end.
This gives rise to the Permuterm Index.
Permuterm:
• The Permuterm index [Garfield 1976] is a time-efficient and elegant solution to the string dictionary
problem in which pattern queries may possibly include one wild-card symbol (called Tolerant
Retrieval problem).
• Unfortunately the Permuterm index is space inefficient because it quadruples the dictionary size.
•
• In Permuterm index, there are number of lexicon for query processing. To overcome this
problem K-gram technique was introduced.
K-gram index:
• In a K-gram index , the dictionary contains all -grams that occur in any term in the
vocabulary. Each postings list points from a - gram to all vocabulary terms containing that -
gram.
• K-grams are k-length subsequences of a string. Here, k can be 1, 2, 3 and so on. For k=1,
each resulting subsequence is called a “unigram”; for k=2, a “bigram”; and for k=3, a
“trigram”. These are the most widely used k grams for spelling correction, but the value of k
really depends on the situation and context.
• Following are K-gram index
o Unigrams: [“c”, “a”, “t”, “a”, “s”, “t”, “r”, “o”, “p”, “h”, “i”, “c”]
o Bigrams: [“ca”, “at”, “ta”, “as”, “st”, “tr”, “ro”, “op”, “ph”, “hi”, “ic”]
o Trigrams: [“cat”, “ata”, “tas”, “ast”, “str”, “tro”, “rop”, “oph”, “phi”, “hic”]
• A k-gram index maps a k-gram to a postings list of all possible vocabulary terms that
contain it. The figure below shows the k- gram postings list corresponding to the bigram
“ur”.
Spelling Correction:
Two principles used for spelling correction:
• Correcting document(s) being indexed:
• Correcting user queries to retrieve “right” answers
We focus on two specific forms of spelling correction:
• Isolated-term correction:
o In isolated-term correction, we attempt to correct a single query term at a time even
when we have a multiple-term query.
o Check each word on its own for misspelling
o Will not catch typos resulting in correctly spelled words ▪ e.g., from → form
• Context-sensitive correction:
o Isolated-term correction would fail to correct typographical errors such as flew form
Heathrow, where all three query terms are correctly spelled. When a phrase such as
this retrieves few documents, a search engine may like to offer the corrected query
flew from Heathrow.
o The simplest way to do this is to enumerate corrections of each of the three query
terms even though each query term is correctly spelled, then try substitutions of each
correction in the phrase.
o For the example flew form Heathrow, we enumerate such phrases as:
▪ flew from heathrow
▪ fled form heathrow
▪ flea form heathrow
For each such substitute phrase, the search engine runs the query and determines the number of
matching results.
We begin by examining two techniques for addressing isolated-term correction: edit distance, and k-
gram overlap. We then proceed to context-sensitive correction.
Levenshtein /Edit Distance:
• In computational linguistics and computer science, edit distance is a way of quantifying how
dissimilar two strings (e.g., words) are to one another by counting the minimum number of
operations required to transform one string into the other.
• Applications:
o Natural language processing, where automatic spelling correction can determine candidate
corrections for a misspelled word by selecting words from a dictionary that have a low distance
to the word in question.
o In bioinformatics, it can be used to quantify the similarity of DNA sequences, which can be
viewed as strings of the letters A, C, G and T.
• The Levenshtein distance is a string metric for measuring difference between two sequences.
• Informally, the Levenshtein distance between two words is the minimum number of single-
character edits (i.e. insertions, deletions or substitutions) required to change one word into the
other.
• It is named after Vladimir Levenshtein, who considered this distance in 1965.
• Levenshtein distance may also be referred to as edit distance, although it may also denote a
larger family of distance metrics.
• It is closely related to pairwise string alignments.
• Definition Mathematically, the Levenshtein distance between two strings a, b (of length |a| and
|b| respectively) is given by leva,b(|a|,|b|) where: where 1(ai≠bi) is the indicator function equal to
0 when ai≠bi and equal to 1 otherwise, and leva, b(i,j) is the distance between the first i
characters of a and the first j characters of b.
• Note that the first element in the minimum corresponds to deletion (from a to b), the second to
insertion and the third to match or mismatch, depending on whether the respective symbols are
the same.
Phonetic Correction:
• A phonetic algorithm is an algorithm for indexing of words by their pronunciation.
• Most phonetic algorithms were developed for English and are not useful for indexing
words in other languages.
• It is mainly used to correct phonetic misspellings in proper nouns. ... Algorithms for
such phonetic hashing are commonly collectively known as soundex algorithms
• Soundex Algorithm:
1. Retain the first letter of the name and drop all other occurrences of a, e, i, o, u, y, h, w.
2. Replace consonants with digits as follows (after the first letter):
o b, f, p, v → 1
o c, g, j, k, q, s, x, z → 2
o d, t → 3
o l→4
o m, n → 5
o r→6
3. If two or more letters with the same number are adjacent in the original name (before step
1), only retain the first letter; also two letters with the same number separated by 'h' , 'w'
or 'y' are coded as a single number, whereas such letters separated by a vowel are coded
twice. This rule also applies to the first letter.4
4. If there are too few letters in the word to assign three numbers, append zeros until there
are three numbers. If there are four or more numbers, retain only the first three.