0% found this document useful (0 votes)
26 views

Information Retrieval

This document provides an introduction to information retrieval concepts and systems. It discusses key concepts like queries, documents, document substitutes, and file structures. It also outlines some common operations in information retrieval systems, including query operations like parsing and feedback, term operations like stemming and truncation, and document operations like parsing and ranking. The document presents information retrieval as the process of matching user queries to stored documents through the use of search engines and retrieval systems.

Uploaded by

Chuks Valentine
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

Information Retrieval

This document provides an introduction to information retrieval concepts and systems. It discusses key concepts like queries, documents, document substitutes, and file structures. It also outlines some common operations in information retrieval systems, including query operations like parsing and feedback, term operations like stemming and truncation, and document operations like parsing and ranking. The document presents information retrieval as the process of matching user queries to stored documents through the use of search engines and retrieval systems.

Uploaded by

Chuks Valentine
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

CSC 231: Data Management I

Information Retrieval (IR)

Introduction
Imagine if you want to search for a Microsoft word document on your system and you go to
the search engine and type the keyword of the document, the search engine brings the
document and other related document. The search engine was able to retrieve the document
because you saved it. This is the concept of information Retrieval.

This study session introduces IR concepts, and presents a domain model of IR systems that
describes their similarities and differences. The relationship of IR systems to other
information systems will be discussed and also the evaluation of IR systems.

1|Page
CSC 231: Data Management I

2.1 Concept of Information Retrieval


Automated information retrieval (IR) systems were originally developed to help manage the
huge scientific literature that has developed since the 1940s. Many university, corporate, and
public libraries now use IR systems to provide access to books, journals, and other documents.

An IR system matches user queries formal statements of information needs to documents stored
in a database. A document is a data object, usually textual, though it may also contain other types
of data such as photographs, graphs, and so on.

Box 2.1: Definition of Information Retrieval

Information Retrieval is the activity of obtaining information resources relevant to an


information need from a collection of information resources. Searches can be based on metadata
(data that describes other data) or on full-text indexing.

Often, the documents themselves are not stored directly in the IR system, but are represented in
the system by document substitutes. This study material, for example, is a document and could
be stored in its entirety in an IR database. You might instead choose to create a document
substitute for it consisting of the title, author, and abstract.

This is typically done for efficiency, that is, to reduce the size of the database and searching time.
Document substitutes are also called documents. An IR system must support certain basic
operations. There must be a way to enter documents into a database, change the documents, and
delete them. There must also be some way to search for documents, and present them to a user.

An information retrieval process begins when a user enters a query into the system. Your queries
are matched against the database information. Depending on the application the data objects may
be, for example, text documents, images, audio, mind maps or videos. Often the documents
themselves are not kept or stored directly in the IR system, but are instead represented in the
system by document substitutes or metadata.

2|Page
CSC 231: Data Management I

Box 2.2: Query

Queries are formal statements of information needs, for example search strings in web search
engines. In information retrieval a query does not uniquely identify a single object (An object is
an entity that is represented by information in a database) in the collection. Instead, several
objects may match the query, perhaps with different degrees of relevancy.

Most IR systems compute a numeric score on how well each object in the database matches the
query, and rank the objects according to this value. The top ranking objects are then shown to the
user. The process may then be iterated if the user wishes to refine the query. The figure 2.1
below shows the pathway of information from query to the retrieval of the records.

Figure 2.1: Abstract Model of Information Retrieval

3|Page
CSC 231: Data Management I

2.2 Domain Analysis of Information Retrieval (IR) Systems

The first steps in domain analysis are to identify important concepts and vocabulary in the
domain, define them, and organize them with a faceted classification. Table 2.1 is a faceted
classification for IR systems, containing important IR concepts and vocabulary. The first row of
the table specifies the facets.
Table 2.1: Faceted Classification of IR Systems

Conceptual File Query Term Document Hardware


model structure operations operations operations operations

Boolean Inverted file Feedback Stem Parse von Neumann

Extended Signature Parse Weight Display Parallel


Boolean cluster

Probabilistic Pat trees Boolean Thesaurus Field mask IR specific

String search Graphs Cluster Stoplist Rank Optical disk

Vector space Hashing Truncation Sort Mag. Disk

2.2.1 File Structures

In designing IR system, choosing the type of file structure is a fundamental decision you need to
make.

The file structures used in IR systems are:

1. Flat files,

4|Page
CSC 231: Data Management I

2. Inverted files,

3. Signature files,

4. PAT trees,

5. Graphs.

Though it is possible to keep file structures in main memory, in practice they are usually stored
on disk because of their size.

Using a flat file approach, one or more documents are stored in a file, usually as ASCII
(American Standard Code for Information Interchange) or EBCDIC (Extended Binary Coded
Decimal Interchange Code) text.

Flat file searching is usually done via pattern matching. On UNIX, for example, you can store a
document collection one per file in a UNIX directory, and search it using pattern searching tools
such as grep or awk. An example of a flat file structure is shown below

Figure 2.3: Flat file structure

An inverted file is a kind of indexed file. The structure of an inverted file entry is usually
keyword, document-ID, and field-ID. A keyword is an indexing term that describes the
document, document-ID is a unique identifier for a document, and field-ID is a unique name that

5|Page
CSC 231: Data Management I

indicates from which field in the document the keyword came. Figure 2.4 shows how an inverted
file looks like.

Figure 2.4: Inverted File Structure

Source: http://ads.harvard.edu/pubs/A+AS/2000A+AS..143...85A/img10.gif

Some systems also include information about the paragraph and sentence location where the
term occurs. Searching is done by looking up query terms in the inverted file.

Signature files contain signatures that represent documents. There are various ways of
constructing signatures. Using one common signature method, for example, documents are split
into logical blocks each containing a fixed number of distinct significant, that is, non-stoplist.

Each word in the block is hashed to give a signature--a bit pattern with some of the bits set to 1.
The signatures of each word in a block are OR'ed together to create a block signature. The block

6|Page
CSC 231: Data Management I

signatures are then concatenated to produce the document signature. Searching is done by
comparing the signatures of queries with document signatures.

PAT trees are Patricia trees constructed over all sistrings in a text. If a document collection is
viewed as a sequentially numbered array of characters, a sistring is a subsequence of characters
from the array starting at a given point and extending an arbitrary distance to the right. A Patricia
tree is a digital tree where the individual bits of the keys are used to decide branching.

Graphs, or networks, are ordered collections of nodes connected by arcs. They can be used to
represent documents in various ways. For example, a kind of graph called a semantic net can be
used to represent the semantic relationships in text often lost in the indexing systems above.

Although interesting, graph-based techniques for IR are impractical now because of the amount
of manual effort that would be needed to represent a large document collection in this form.

2.2.2 Query Operations

Queries are formal statements of information needs put to the IR system by users. The operations
on queries are obviously a function of the type of query, and the capabilities of the IR system.
One common query operation is parsing, that is breaking the query into its constituent elements.

Boolean queries, for example, must be parsed into their constituent terms and operators. The set
of document identifiers associated with each query term is retrieved, and the sets are then
combined according to the Boolean operators.

In feedback, information from previous searches is used to modify queries. For example, terms
from relevant documents found by a query may be added to the query, and terms from non-
relevant documents deleted. There is some evidence that feedback can significantly improve IR
performance. Below are the Boolean operators.

7|Page
CSC 231: Data Management I

Figure 2.5: A Boolean Query

Source: http://pdfgold.helpmax.net/en/search-and-index/searching-pdfs/

2.2.3 Term Operations

There are different types of Operations on terms in an IR system.

They include:

 Stemming,

 Truncation,

8|Page
CSC 231: Data Management I

 Weighting,

 Stoplist and

 Thesaurus operations

Stemming is the automated conflation (fusing or combining) of related words, by reducing the
words to a common root form.

Truncation is manual conflation of terms by using wildcard characters in the word, so that the
truncated term will match multiple words. For example, if you are interested in finding
documents about truncation, you might enter the term "truncat?" which would match terms such
as truncate, truncated, and truncation.

Another way of conflating related terms is by performing thesaurus operations which lists
synonymous terms, and sometimes the relationships among them.

A Stoplist is a list of words considered to have no indexing value, used to eliminate potential
indexing terms. Each potential indexing term is checked against the stoplist and eliminated if
found there.

Weighting, indexing or query terms are assigned numerical values usually based on information
about the statistical distribution of terms, that is, the frequencies with which terms occur in
documents, document collections, or subsets of document collections such as documents
considered relevant to a query.

2.2.4 Document Operations

Documents are the primary objects in IR systems and there are many operations for them. In
many types of IR systems, documents added to a database must be given unique identifiers,

9|Page
CSC 231: Data Management I

parsed into their constituent fields, and those fields broken into field identifiers and terms. Once
in the database, you sometimes wish to mask off certain fields for searching and display.

For example, you may wish to search only the title and abstract fields of documents for a given
query, or may wish to see only the title and author of retrieved documents. You may also wish to
sort retrieved documents by some field, for example by author.

Display operations include printing the documents, and displaying them on a CRT. Using
information about term distributions, it is possible to assign a probability of relevance to each
document in a retrieved set, allowing retrieved documents to be ranked in order of probable
relevance. Term distribution information can also be used to cluster similar documents in a
document space.

Another important document operation is display. The user interface of an IR system, as with
any other type of information system, is critical to its successful usage. Know that user interface
algorithms and data structures are not IR specific.

2.2.5 Functional View of Paradigm IR System

The activities associated with a common type of Boolean IR system are chosen because it
represents the operational standard for IR systems. Below is an example of explicit Boolean IR
system

10 | P a g e
CSC 231: Data Management I

Figure 2.6: Example of Boolean IR system

When building the database, documents are taken one by one, and their text is broken into words.
The words from the documents are compared against a stoplist--a list of words thought to have
no indexing value.

Words from the document not found in the stoplist may next be stemmed. Words may then also
be counted, since the frequency of words in documents and in the database as a whole are often
used for ranking retrieved documents.

11 | P a g e
CSC 231: Data Management I

Finally, the words and associated information such as the documents, fields within the
documents, and counts are put into the database. The database then might consist of pairs of
document identifiers and keywords as follows.

keyword1 - document1-Field_2
keyword2 - document1-Field_2, 5
keyword2 - document3-Field_1, 2
keyword3 - document3-Field_3, 4

keyword-n - document-n-Field_i, j

Such a structure is called an inverted file. In an IR system, each document must have a unique
identifier, and its fields, if field operations are supported, must have unique field names.

To search the database, you must enter a query consisting of a set of keywords connected by
Boolean operators. The query is parsed into its constituent terms and Boolean operators. These
terms are then looked up in the inverted file and the list of document identifiers corresponding to
them are combined according to the specified Boolean operators.

If frequency information has been kept, the retrieved set may be ranked in order of probable
relevance. The result of the search is then presented to you.

In some systems, you make judgments about the relevance of the retrieved documents, and this
information is used to modify the query automatically by adding terms from relevant documents
and deleting terms from non-relevant documents. A system such as this give remarkably good
retrieval performance given their simplicity, but their performance is far from perfect. Many
techniques to improve them have been proposed.

One such technique aims to establish a connection between morphologically related terms.
Stemming is a technique for conflating term variants so that the semantic closeness of words like

12 | P a g e
CSC 231: Data Management I

"engineer," "engineered," and "engineering" will be recognized in searching. Another way to


relate terms is via thesauri, or synonym lists.

2.3 Information Retrieval System Evaluations


IR systems can be evaluated in terms of many criteria including execution efficiency, storage
efficiency, retrieval effectiveness, and the features they offer you the user. The relative
importance of these factors must be decided by the designers of the system, and the selection of
appropriate data structures and algorithms for implementation will depend on these decisions.

Execution efficiency is measured by the time it takes a system, or part of a system, to perform a
computation. This can be measured in C based systems by using profiling tools such as proof on
UNIX. Execution efficiency has always been a major concern of IR systems since most of them
are interactive, and a long retrieval time will interfere with the usefulness of the system.

Storage efficiency is measured by the number of bytes needed to store data. Space overhead, a
common measure of storage efficiency, is the ratio of the size of the index files plus the size of
the document files over the size of the document files. Space overhead ratios of from 1.5 to 3 are
typical for IR systems based on inverted files.

Most IR experimentation has focused on retrieval effectiveness usually based on document


relevance judgments. This has been a problem since relevance judgments are subjective and
unreliable. That is, different judges will assign different relevance values to a document retrieved
in response to a given query.

The seriousness of the problem is the subject of debate, with many IR researchers arguing that
the relevance judgment reliability problem is not sufficient to invalidate the experiments that use
relevance judgments.

Many measures of retrieval effectiveness have been proposed. The most commonly used are
recall and precision.

13 | P a g e
CSC 231: Data Management I

Box 2.3: Recall and Precision

Recall is the ratio of relevant documents retrieved for a given query over the number of relevant
documents for that query in the database. Except for small test collections, this denominator is
generally unknown and must be estimated by sampling or some other method.

Precision is the ratio of the number of relevant documents retrieved over the total number of
documents retrieved. Both recall and precision take on values between 0 and 1.

Since you often wish to compare IR performance in terms of both recall and precision, methods
for evaluating them simultaneously have been developed. One method involves the use of recall-
precision graphs--bivariate plots where one axis is recall and the other precision.

Recall-precision plots show that recall and precision are inversely related. That is, when
precision goes up, recall typically goes down and vice-versa.

Figure 2.7: Recall-precision Graph

A combined measure of recall and precision, for the evaluation of E is defined as:

14 | P a g e
CSC 231: Data Management I

Where P = precision, R = recall, and b is a measure of the relative importance, to a user, of recall
and precision. Experimenters choose values of E that they hope will reflect the recall and
precision interests of the typical user. For example, b levels of .5, indicating that a user was twice
as interested in precision as recall, and 2, indicating that a user was twice as interested in recall
as precision, might be used.

IR experiments often use test collections which consist of a document database and a set of
queries for the data base for which relevance judgments are available. The number of documents
in test collections has tended to be small, typically a few hundred to a few thousand documents.
Test collections are available on an optical disk.

Table 2.3: IR Test Collections


Collection Subject Documents Queries
ADI Information science 82 35
CACM Computer science 3200 64
CISI Library science 1460 76
CRAN Aeronautics 1400 225
LISA Library science 6004 35
MED Medicine 1033 30
NLM Medicine 3078 155
NPL Electrical engineering 11429 100
TIME General articles 423 83

Summary:

1. An IR system matches user queries formal statements of information needs to documents


stored in a database. A document is a data object, usually textual, though it may also contain
other types of data such as photographs, graphs, and so on.

15 | P a g e
CSC 231: Data Management I
2. An IR conceptual model is a general approach to IR systems. Several taxonomies for IR
conceptual models have been proposed. Faloutsos gives three basic approaches: text pattern
search, inverted file search, and signature search while Belkin and Croft once categorize IR
conceptual models differently. They divide retrieval techniques first into exact match and inexact
match.

3. The type of file structure to use is an important decision to make before using a document
database. The file structures used in IR systems are flat files, inverted files, signature files, PAT
trees, and graphs. Though it is possible to keep file structures in main memory, in practice IR
databases are usually stored on disk because of their size.

5. Queries are formal statements of information needs put to the IR system by users. The
operations on queries are obviously a function of the type of query, and the capabilities of the IR
system. One common query operation is parsing, that is breaking the query into its constituent
elements.

6. Operations on terms in an IR system include stemming, truncation, weighting, stoplist and


thesaurus operations.

7. Documents are the primary objects in IR systems and there are many operations for them. In
many types of IR systems, documents added to a database must be given unique identifiers,
parsed into their constituent fields, and those fields broken into field identifiers and terms.

8. IR systems can be evaluated in terms of many criteria including execution efficiency, storage
efficiency, retrieval effectiveness, and the features they offer a user. The relative importance of
these factors must be decided by the designers of the system, and the selection of appropriate
data structures and algorithms for implementation will depend on these decisions.

16 | P a g e
CSC 231: Data Management I

Glossary of Terms
Information Retrieval (IR) is the activity of obtaining information resources relevant to an
information need from a collection of information resources. Searches can be based on metadata
(data that describes other data) or on full-text indexing.

Queries are formal statements of information needs, for example search strings in web search
engines.

17 | P a g e

You might also like