Unit 17
Unit 17
17.0 OBJECTIVES The concept of information retrieval pre-supposes that there are some documents
or records containing information that have been organized in an order suitable
The aim of any Information Storage and Retrieval (ISAR) system, irrespective for easy retrieval. The documents or records that we are concerned with contain
of its types, is to retrieve the desired information. Thus, the processes and bibliographic information which are quite different from other kinds of information
465 466 or data. We may take a simple example. If we have a database of information
pertaining to an office, or a company, all we have are the different kinds of Information Retrieval Information Retrieval doctor’s record of his patients is data, and figures relating to temperature, humidity,
records and related facts, like names of employees, their positions, salary, and so Processes and Techniques etc., or sales of a company, are data.
on, or in the case of a manufacturing company, names of different items, prices,
quantity, and so on. The retrieval system here is designed to search for and 17.3.2 The Database
retrieve specific facts or data, like the salary of a particular manager, or the price
of a perfume, and so on. The major objective of an information retrieval system, A database can be conceived as a system whose base, whose key concept, is
on the other hand, is to retrieve the information – either the actual information simply a particular way of handling data. In other words, a database is nothing
or through the documents containing the information surrogates – that fully or more than a computer-based record-keeping system. The overall objective of a
partially match the user’s query. Thus, the search output may contain bibliographic database is to record and maintain information. The Macmillan Dictionary of
details of the documents that matches the query, or the actual text, image, video, Information Technology [Longley and Shain, 1989] defines a database as ‘a
etc. that contain the required information. The database in case of an information collection of interrelated data stored so that it may be accessed by authorised
retrieval system may contain abstracts or full texts of documents, like newspaper users with simple user-friendly dialogues’. The Chambers Science and
articles, handbooks, dictionaries, encyclopedias, legal documents, statistics, etc., Technology Dictionary [Walker, 1988] provides a more simple definition of a
as well as audio, images, and video information. database: ‘a collection of structured data independent of any particular application’.
Whatever may be the nature of the database – bibliographic, full-text or multimedia It may be noted from the above definitions that a database contains some data
– the system pre-supposes that there is a group of users for whom the system that are structured and integrated. Ellingen [1991] defines a database as ‘a
is designed. Users are considered to have certain queries or information needs, collection of information that can be searched as a single entity’. According to
and when they put forward their requirement to the system, the latter should be Oxborrow [1989], a database can be considered as ‘an organised collection of
able to provide the necessary bibliographic references of those documents related sets of data, managed in such a way as to enable the user or application
containing either the required information, or the actual text in the case of a full- program to view the complete collection, or a logical subset of the collection, as
text retrieval system. Alternative models of (knowledge-based) information a single unit’.
retrieval seek to provide the user with the information directly rather than just
the citations, the abstract or the full text. From the above definitions we can simplify the definition of a database as an
organised collection of related sets of data that can be accessed by more than
Self Check Exercises one user by simple means and can be searched to reveal those that touch upon
a particular need [Chowdhury, 2004]. In the computer world we frequently deal
1) Mention some synonyms of information retrieval?
with files, which are the outer identifying boundary or a sort of folder containing
2) How does the original connotation of IR differ from that of modern information data. Thus, a file is equivalent to an ordinary address book, if we are talking
retrieval? about names and addresses. A file in a computer is given a unique name by
Note: i) Write your answers in the space given below. which it is addressed.
ii) Check your answers with the answers given at the end of the Unit.
17.3.3 Records and Fields
...............................................................................................................
A record is a collection of related information. A database is an organised
...............................................................................................................
collection of units of information, and each unit of information in a database is
............................................................................................................... called a record. A record is generally what a user wants to find out while
searching a database. An example of a record is the main entry in a library’s
...............................................................................................................
catalogue, which describes the book’s title, author, subject, etc. A collection of
............................................................................................................... database records constitutes a database file. Identifying what the record is to be
is one of the early tasks in designing a database. If the database is a bibliographic
one, the bibliographic information about each document is the unit of information
17.3 DATABASES
or record.
An information retrieval system deals with databases, and so does a database
A stored record is a named collection of associated stored fields [Date, 1981].
management system. So, what is the difference between an information retrieval
Each record is made up of particular segments or elements of information, each
system and a database management system? Before we discuss these differences,
of which is called a field. A field holds a particular type of information within
we need to have some basic idea of a database and its various components,
a record that can be separately addressed. The different items of information in
types, etc. which are discussed in the following sections.
a bibliographic record may be author, title, subject heading, etc. Thus, the different
fields in a bibliographic record can be the ‘author field’ containing name(s) of
17.3.1 Data author(s), the ‘title field’ containing the title, and so on. A field may be subdivided
into still smaller units called subfields. For example, if ‘imprint’ of application is
The data is discrete fact, when processed it becomes information. However, in
regarded as a field in a bibliographic database, then the different components of
the context of Information Retrieval System, we may consider information as a
the imprint – the publisher’s name, place of publication, and date of publication
logical set of data. The word ‘data’ refers to a set of given facts. Information
– can be called subfields.
in a form that can be processed by a computer is called data. The term data has
for long been used to refer to scientific measurements, but words also constitute A record is, thus, composed of fields and subfields. Identifying what fields and
data just as numbers do. A list of names is data, a set of keywords is data, a subfields are to be included in each record is an important task in the database
467 468
design process. Each field is given a unique identifier, at the design stage, called Information Retrieval Information Retrieval Bibliographic databases form the basis of most of the information retrieval systems
field tag, which is then used throughout for data input, editing, searching, printing, Processes and Techniques available today, be they home-grown or available on CD-ROM or through online
and so on. Several standards have been developed to help the designers of access. Bibliographic databases can be divided into five broad categories:
information retrieval systems in this regard. For example, in case of an online
catalogue, or more specifically OPAC (Online Public Access Catalogue), as they a) large discipline-oriented databases;
are called, standard bibliographic format like MARC21 [2002] (MARC stands b) interdisciplinary databases with coverage based on key or core journals;
for Machine Readable Catalogue or Cataloguing; several different types of MARC
formats have been developed and MARC21 is the most recent and the most c) cross-disciplinary databases;
heavily used MARC format), CCF (Common Communication Format) [1992], d) smaller, more specialized databases serving a particular technology or application
and so on, specify the fields and the corresponding field tags to be used while area; and
preparing catalogue entries for bibliographic items.
e) databases covering specific types of publication.
17.3.4 Properties of Databases
However, there could be many more kinds of bibliographic databases, such as:
A database is designed to avoid duplication of data as well as to permit retrieval
– Specific subjects/disciplines: CASearch, BIOSIS, ERIC, MEDLINE,
of information to satisfy a wide variety of user information needs. Major properties
ENERGYLINE, LISA, ISA, and so on;
of a database can be summarised as follows:
– Multidisciplinary: SCI SEARCH, SOCIAL SCISEARCH;
l it is integrated with provisions for different applications;
– Mission-oriented: NASA;
l it eliminates or reduces data duplication;
– Problem-oriented: ENVIROLINE, TOXLINE;
l it enhances data independence by permitting application programs to be insensitive
– Referral: Foundations Directory, Fine Chemicals Directory, Ulrich’s International
to changes in the database;
Periodicals Directory;
l it permits shared access;
– Factual: PTS Forecasts, CARIS/FAO (Ongoing Research);
l it permits finer granularity; and
– Textual references: DRUGLINE; and so on.
l it provides facilities for centralised control of accessing and security control
functions. Many of these databases are available online, accessible through the web, and
CD-ROM versions.
17.3.5 Kinds of Databases
17.3.6 Information Retrieval vs. Database Management Systems
In discussing databases, it is sometimes useful to classify them by the type of
data record contained and sometimes by subject coverage. The two major divisions The technology that helps to process and manipulate data of various kinds is
are reference databases and source databases. Reference databases lead the broadly termed as database management technology, and the resulting software
users to the source of the information: a document, a person or an organisation. packages are known as database management systems (DBMSs). A database
They can be divided into three categories: management system stores and retrieves discrete data elements that are structured,
as opposed to a typical information retrieval system that is designed to deal with
a) bibliographic databases, which include citations or bibliographic references, unstructured data e.g., the full texts of documents.
and sometimes abstracts of literature;
Typically a search in a database management environment produces one or more
b) catalogue databases, which show the catalogue of a given library or a group of
records that are stored in the database. One may argue that an information
libraries in a network; and
retrieval system also stores discrete data elements, like author, title, keyword,
c) referral databases, which offer references to information such as the name, etc., in the form of a structured database. While this is true, an information
address, specialisation, etc., of persons, institutions, information systems, etc. retrieval system also handles unstructured data, for example a large chunk of
text, and this is where a typical database management system differs from an
Source databases provide the answer with no need for the user to refer elsewhere. information retrieval system. Many more differences between the two systems
These databases contain the information sought for in electronic form and, can be noticed especially in the search and retrieval aspects. For example, in a
therefore, the user can get access to the information instantly as a result of a typical database management search, we expect to retrieve discrete data, e.g.
search. Source databases can be grouped according to their content, for example, the price of an item, date of birth of an employee, and so on, whereas in
information retrieval search we retrieve an entire document or part of it containing
a) numeric databases, which contain numerical data of various kinds, including the information required by the user. The major differences between a typical
statistics and survey data. database management system and an information retrieval system are shown in
b) full-text databases, which contain the full text of documents. Table 17.1.
c) text-numeric databases, which contain a combination of textual and numerical
data, such as a company annual report and handbook data.
469 470
Table 17.1: Difference Between IRS and DBMS Information Retrieval Information Retrieval As shown in Figure 17.1, the major functions of an information retrieval system
Processes and Techniques can be divided into two categories: (a) organisation and representation of
Information Retrieval Systems Database Management Systems information, and (b) retrieval of information. In the organisation part, although the
specific techniques vary from one information retrieval system to another, the
Designed to deal with unstructured data Deals with structured data
basic task is to create an index, called the inverted index, of the potential search
An item may be retrieved if it exactly or partially An item will be retrieved only when it terms (keywords and phrases). The index terms may be assigned by a human
exactly matches a query matches the query
indexer or by an automatic process, or may be derived automatically from the
Queries are usually language-based, e.g. a keyword, Queries are mostly value-based, e.g. document texts based on some selection criteria.
an author name etc. salary or date of birth of a person
Vocabulary is very important and usually some No vocabulary control tool is required
Analy Organized
vocabulary control tools are used
Repre Information
A number of advanced search techniques are used, Exact match of search term and field value Information Resources and
for example proximity search is expected Contro
Sources Index File(s)
Vocab
Self Check Exercises
3) What is a database? Give examples of three bibliographic databases. Retrieved
4) Discuss two major differences between a DBMS and an IRS? (Search ou
Note: i) Write your answers in the space given below.
ii) Check your answers with the answers given at the end of the Unit.
............................................................................................................... Users Query
............................................................................................................... Analysi
a) a writer presents a set of ideas in a document using a set of concepts; In a typical information retrieval environment, the users queries are not matched
with the documents per se; instead, they are matched with an index file. The
b) somewhere there will be some users who require the ideas but may not be able to actual documents are stored in a separate sequence, and once a match is found
identify those. In other words, there will be some persons who lack the ideas put between an index term and a user search term, the pointer from the index file
forth by the author in his/her work; and is followed to retrieve the document.
c) information retrieval systems serve to match the writer’s ideas expressed in
the document with the users’ requirements or demands for those. The elementary units of a text retrieval system are the document records. Each
document record comprises of a number of fields and subfields, each one of
Thus, an information retrieval system serves as a bridge between the world of which contains a particular unit of information – author’s name, publisher’s
creators or generators of information and the users of that information. The name, title, keyword(s), class number, ISBN, and so on. The document record
information resources are processed, indexed and stored in an appropriate way. may also contain an abstract or full text of the document concerned. A text
The users interact with the system through a user interface. The user queries, retrieval system is designed to provide fast access to the records through any of
submitted through the interface are matched with the index and the matching the sought keys or access points. This means that there should be a mechanism
items are retrieved. A number of activities are involved in the processes of for fast access to the document records. What should the basic mechanism be
information processing, indexing and matching. These will be discussed later in for accessing the document records through some key values – by chosen
this Unit. keyword(s), or by author’s name, say? To answer this question, we should first
471 472 understand how document records are physically stored in the computer.
Document records are stored one after another in the computer memory: this is Information Retrieval Information Retrieval adopted for the purpose. Each index term in the inverted file is associated with
actually the virtual structure of the database file. Imagine a text database that Processes and Techniques a pointer which shows the record number in which the index term occurs.
stores a few, say ten, document records. Now, suppose a user wants to check
if there is a document in the database that is written by G.G. Chowdhury; The indexing technique, i.e., the technique adopted to draw index terms from the
another user wants to know if there is a book on Internet. What would the user’s records, determines the order in which index terms will appear in the index file.
approach be? The simplest way would be to open each document record one Different techniques may be required for the purpose: for example, index entries
after another and to check each and every field; if there is a match then it may be required for:
retrieves that document. This process continues until all the document records
have been checked. It may be a very simple approach, but one can very well l each and every term occurring in a given field, for example, all the words occurring
imagine that this will be an extremely slow process even for a faster computer in the title field. However, there is a risk; some unwanted terms, like ‘a’, ‘an’,
when the text database is relatively large, and will be an impossible proposition ‘and’, ‘the’, etc., occurring in the document titles may also be indexed. To avoid
for a database that has some hundreds or thousands of document records or more. this problem, text retrieval systems usually incorporate a stop-word file which
prevents unwanted terms from being indexed
What is the solution then? How can we retrieve the desired document records?
Let’s take a common example. What do we do when we want to locate a l the whole field as it is, for example, the full title as it occurs in the document record
particular term or phrase, say the word ‘computer’, or the phrase ‘information l each occurrence of a repeatable field, for example, names of all the authors
retrieval’, in a book? Do we start from the first line in the first page and continue
up to the last line in the last page of the book? No: we use a simple tool – the l some selected words or phrases from a field or subfield, for example, some
back-of-the-book index. What is such an index? It is a simple alphabetical list of terms and phrases occurring in the title field, etc.
all the potential index terms, drawn from the text of the book, each having a
pointer showing the occurrence(s) of the terms. Thus, we look into the index file Thus, for each significant index term in the database the inverted file contains
with the required search term, locate it and then move to the page(s) indicated an entry along with a reference list which specifies position(s) in the database
for the actual information. A similar approach is taken in a text retrieval system: where that term appears. Therefore, in an inverted file system, the searcher first
an index file is created that contains all the potential index terms arranged in an consults the index file, which then refers to the position in the main text database
appropriate order. This index file is called an inverted file. Users looking for where the desired record appears. The inverted file system is, thus, an example
particular information are required to consult the inverted file first, which then of indirect file access. If the terms are arranged alphabetically in the inverted
leads to the main database where the document records are stored. file, then the file represents an example of indirect sequential file organization.
Figure 17.3 presents a very simple example of such an inverted file which will
Like inverted file, two more file structures also exist for representation and help us understand the basic concept of an inverted file. However, an inverted
access to information. These are sequential file and indexed sequential file. file may contain a lot of other information along with each entry, such as the
number of occurrences of the term in a given record or position information, such
Sequential File: In a sequential file the records are arranged in order of a key as the field in which the term/phrase occurs, where the term/phrase occurs in
field and the computer can use a searching technique, like a binary search, to a given sentence/paragraph, and so on. As shown in Figure 17.3, index entries
access a specific record. A sequential file is designed for efficient processing of are drawn from all four sample document records for the author, title, publisher,
records in sorted order on some search key. In this file structure, records are and keyword fields. Titles have been indexed as they are, while each occurrence
chained together by pointers to permit fast retrieval in search key order. Pointer of the author and the keyword field in the document records has been indexed.
points to next record in order. Records are stored physically in search key order
(or as close to this as possible). This minimises number of block accesses. Document records
Indexed Sequential File: Indexed sequential file is a type of file access in Document no: 1
Author: Cunningham, M.
which an index is used to obtain the address of the block containing the required Title: File structure and design
record. In indexed sequential files each record of a file has a key field which Publisher: Chartwell-Bratt
uniquely identifies that record. It has an index consists of keys and addresses. Year: 1985
Indexed sequential files are important for applications where data needs to be Keywords: File structure; File organization
accessed either sequentially or randomly using the index. Document no:2
Author: Tharp, A.
Example: A library may store details about its users as an indexed sequential Title: File organization and processing
file. Sometimes the file is accessed sequentially: when the whole of the file is Publisher: John Wiley
Year: 1988
processed to produce overdue statistics at the end of the month. or Keywords: File structure; File organization
Randomly: may be a user changes address, or a lady user gets married and Document no: 3
Author: Ford, N.
changes her surname. An indexed sequential file can only be stored on a random Title: Expert systems and artificial intelligence
access device, e.g., magnetic disc, compact disc (CD). Publisher: Library Association
Year: 1991
17.5.1 Inverted File Keywords: Expert systems; Artificial intelligence; Knowledge-based systems
Document no: 4
In an inverted file system of text retrieval, each database consists of two files. Author: Charniak, E.; McDermott, D.
Title: Introduction to artificial intelligence
One is the text file, which contains what we would expect to find, that is the
Publisher: Addison-Wesley
document records in their normal form – the form in which they are entered into Year: 1985
the database. The other is the inverted file, which contains all the index terms,
drawn automatically from the document records according to the indexing technique 473 474 Fig. 17.2: Sample document records
Information Retrieval Information Retrieval 1980 and 1990’). For single key searches, the whole file can be maintained in
Index file Processes and Techniques an order according to the value of the given single set of keys. In a telephone
4 40 1 1 Adddison-Wesley directory, for example, users search through the names of subscribers and therefore
3 60 1 2 Artificial Intelligence the names of subscribers are arranged in alphabetical order. File access in multi-
4 60 1 1 Artificial Intelligence key searches is complicated by the fact that it is not possible to order the file
4 20 1 1 Charniak, E. simultaneously in accordance with the values of the different search keys. For
1 40 1 1 Chartwell-Bratt example, a users’ file in a library can be arranged according to the name of the
1 20 1 1 Cunningham, M. user, occupation or specialisation, address or department, and so on, and in each
3 60 1 1 Expert Systems case the resulting arrangement of the records within one field will be different
4 60 1 2 Expert Systems
from the other.
3 30 1 1 Expert Systems and Artificial Intelligence
In the case of a multi-key search, a principal key is to be identified and the file can
1 60 1 2 File Organization
be ordered in accordance with the values of that key. When the principal key is
2 60 1 2 File Organization
used as part of a search statement, the subsection of the file corresponding to the
2 30 1 1 File Organization and Processing
given principal key value can then be isolated and subjected to a separate search
1 60 1 1 File Structure based on the values of any secondary keys also included in the search query.
2 60 1 1 File Structure
1 30 1 1 File Structure and Design A catalogue of a library can be considered as a multi-key file, where the keys
4 30 1 1 Introduction to Artificial Intelligence
are the author, title, publisher, subject, etc. In such a file, the principal key is
usually the author, i.e., the file is ordered in accordance with the name (surname)
3 60 1 3 Knowledge-based Systems
of the authors. From each record in the main file there may be a number of
3 40 1 1 Library Association
pointers giving access to secondary keys, like publisher, title, etc. A simple file
4 20 1 2 Mcdermott, D.
of authors and publishers can be ordered according to the author’s name as the
2 20 1 1 Tharp, A. principal key, with a sparse index giving access to a chain of pointers for each
publisher name. Documents published by a given publisher can be found by
Fig. 17.3: Sample inverted index file following the pointer chain. Pointer chains can be provided for all secondary keys
in addition to the primary keys attached to the records; each given record can
The field tag is used to denote the field where the given term/phrase occurs. This be traced through the pointer chain for any of the keys. This type of record
information is used in field-specific searches (discussed in Unit 19). Similarly, the organisation is known as a multi-list [Chowdhury, 2004].
position information is used for proximity or adjacency searching (discussed in
Unit 19). Other types of information may also be stored along with each entry, Multi-list organisation, however, becomes too time-consuming when each query
and each such item of information facilitates a particular type of search. key is attached to a large number of records. One solution to this might be to
Nevertheless, the more such information is added to each entry, the more bulky use large indexes that provide one pointer for each record exhibiting a given key
the inverted file becomes, and therefore takes more storage space and processing value. Such an index is called an inverted index or an inverted file. Inverted files
time. In this example, a user looking for a term ‘expert systems’ will retrieve two are widely used in operational information retrieval situations. The advantage of
records, document numbers 3 and 4 from the database, while another user using inverted files is that such files allow extremely rapid search and retrieval
looking for a book written by ‘Tharp, A.’ will retrieve book number 2. A complex operations, based only on the information provided in the index rather than on
query with search terms combined by Boolean operators will follow the same data from the main record file.
path. For example, a user with a query ‘expert systems OR file organization’ will
retrieve all four document records, while the query ‘artificial intelligence AND One important issue for the inverted file system is the size of the index file. If
knowledge-based systems’ will retrieve document record number 3. In the first each and every term occurring in the document database is indexed, then size
example, as the search terms are joined by the logical operator ‘OR’, the system of the index file will be quite large, equal to that of the main document database.
will consult the inverted file for each term and then will merge the document Therefore, in order to facilitate fast searching, we need to have a method that
numbers retrieved in each case, while in the second case, because the terms are allows fast access to the terms/phrases in the inverted file. In other words, we
joined by the logical operator ‘AND’, the retrieved document numbers for both need to have an efficient file organization technique.
terms will be matched to locate the common document numbers, that is the ones
where both terms are present. Figure 17.3 shows that an index term may occur Self Check Exercises
in several document records, and in each case, several items of information, such 5) What is an inverted file? What role does it play in an information retrieval process?
as its frequency of occurrence, field(s) in which it has occurred, position
information, and so on, have to be stored in the index file. Thus, conceptually the 6) What is the difference between a single key and a multi-key query?
structure of an inverted file may look like the one shown in Figure 17.3. Note: i) Write your answers in the space given below.
ii) Check your answers with the answers given at the end of the Unit.
17.5.2 Access to Inverted Files
...............................................................................................................
The user may pose a single key query or a multiple key query. In the former
...............................................................................................................
case, the value of a single search key (say the name of the author) is used as
the retrieval criterion, whereas in a multiple key search a number of search keys ...............................................................................................................
(say the name of the author, subject name, date of publication, and so on, as in
...............................................................................................................
the query ‘papers written by Salton on information retrieval systems between
475 476 ...............................................................................................................
Information Retrieval Information Retrieval representative of the contents of the document, but these are also standardized
17.6 VOCABULARY CONTROL Processes and Techniques (in terms of their usage, spelling, form, and so on) and are likely to be chosen
by the user for searching purposes. Similarly, there are programs available by
Vocabulary control is one of the most important components of an information
which end-users may go to the appropriate page of a particular online vocabulary
retrieval system. As we have noted from its simple model given in Figure 17.1,
control tool in order to choose the most appropriate term(s) for preparing the
an information retrieval system tries to match user queries with the stored
search expression. Vocabulary control tools also help end-users modify their
documents (the inverted index file to be precise) and retrieves those that match.
previously formulated search expressions by either widening or narrowing down
In order to match the contents of the user requirements (the search terms) with
the search expressions.
the contents of the stored documents (the index entries), one must follow a
vocabulary that is common to both. In other words, user requirements need to 17.6.1 Vocabulary Control Tools
be translated and put to the retrieval systems in the same language (using the
same terms, for example) as was used to express the contents of the document As the name suggests, these are the tools used to control the vocabulary of
records. This leads us to the concept of using a standard or controlled vocabulary indexing and retrieval. What an indexer and an index user need is a set of
in an information retrieval environment. guidelines for the proper selection of terms. Syntactic structures are devices that
provide these guidelines by showing the relationships among terms or concepts,
Indexing may be thought of as a process of labelling items for future reference. and they fall into two major categories: (i) classification schemes, and (ii) subject
Considerable order can be introduced into the process by standardizing the terms heading lists and thesauri. A combination of the two categories has also been
that are to be used as labels. This standardization is known as vocabulary control, developed.
the systematic selection of preferred terms.
Classification schemes, being tools for organising knowledge, could be of great
Lancaster [1986] suggests that the process of subject indexing involves two quite help for vocabulary control but the main body of classification schemes is organised
distinct intellectual steps: the ‘conceptual analysis’ of the documents and in an artificial language (called notations which may contain numbers, alphabets,
‘translation’ of the conceptual analysis into a particular vocabulary. The second punctuation marks, or a combination of them) whereas for vocabulary control we
step in any information retrieval environment involves a ‘controlled vocabulary’, need natural language representation. Indexes to classification schemes could
that is a limited set of terms that must be used to represent the subject matter serve the role of vocabulary control but here terms appear alphabetically and
of documents. Similarly, the process of preparing the search strategy also involves thus the logical (semantic) organisation of knowledge is not available. Some
two stages: conceptual analysis and translation. The first step involves an analysis attempts have been made to combine the features of the main arrangement in
of the request (submitted by the user) to determine what the user is really looking classification schemes with those that appear in the index to the classification
for, and the second step involves translation of the conceptual analysis to the scheme to generate some kind of faceted or classified thesaurus such as thesauro-
vocabulary of the system. facet. Further discussion on these tools are available in Units 2 & 3.
There are two major objectives of vocabulary control in an information retrieval Subject heading lists were initially developed to prepare entries/headings in a
environment: subject catalogue that could replicate the classified arrangement of document
records. Therefore, they include rather broader subject terms or headings. On
a) to promote the consistent representation of subject matter by indexers and the other hand, thesauri have been developed on specific subject fields with a
searchers, thereby avoiding the dispersion of related materials. This is achieved view to bringing together the various representations of terms (synonyms, spelling
through the control (merging) of synonymous and nearly synonymous expressions
variants, homonyms, etc.) along with an indication of a mapping of that term in
and by distinguishing among homographs; and
the universe of knowledge by indicating the broader (superordinate), narrower
b) to facilitate the conduct of a comprehensive search on some topic by linking together (subordinate), and related (coordinate and collateral) terms. However, this
terms whose meanings are related. distinction has gradually faded and the latest Library of Congress subject headings
list indicates the terms’ features as shown in normal thesauri.
Lancaster [1986] adds that indexing tends to be more consistent when the
vocabulary used is controlled, because indexers are more likely to agree on the 17.6.2 Controlled vs. Natural Language Indexing
terms needed to describe a particular topic if they are selected from a pre-
established list than when given a free hand to use any terms they wish. Similarly, Controlled indexing languages are those in which both the terms that are used
from the searcher’s point of view, it is easier to identify the terms appropriate to represent subjects and the process whereby terms are assigned to particular
to information needs if these terms must be selected from a definitive list. Thus documents are controlled or executed by a person. Normally there is a list of
controlled vocabulary tends to match language of indexers and searchers. The terms - a subject headings list or a thesaurus, that acts as the authority list in
various aspects of vocabulary control has been discussed in Unit 2 of this course. identifying terms that may be assigned to documents, and indexing involves the
assignation of terms from this list to specific documents. The searcher is expected
A number of vocabulary control tools have been designed over the years. They to consult the same controlled list during formulation of a search strategy. In
differ in their structure and design features, but they all have the same purpose natural language indexing, any term that appears in the title, abstract or text of
in an information retrieval environment. A number of software packages are now a document record may be an index term. There is no mechanism to control the
use of terms for such indexing. Similarly, the searcher is not expected to use any
available that allow the record creator to automatically switch to one or more
controlled list of terms.
chosen online vocabulary control tools in order to select appropriate terms for
representing the document in hand. For example, OCLC’s Connexion (an Whether to use a controlled vocabulary or to use natural language indexing has
integrated cataloguing suite of tools) and OCLC’s CatExpress (simple copy been an age-old debate in information retrieval. The major debates in natural
cataloguing suite of tools) provide such facility. This helps in a number of ways versus controlled vocabulary indexing are shown in Table 17.2 [Rowley, 1994;
– the document records do not only contain a number of terms that are 477 478 Svenonius, 1986].
Table 17.2: The Four Eras of Debate on Controlled Vs. Natural Language Indexing Information Retrieval Information Retrieval problems. In other words, users do not know whether or not any such item exists
Processes and Techniques that can meet their information requirements. Regarding various aspects of search
Era One – controlled vocabulary such as strategy, method and techniques you may refer to Unit 19 of this course.
Era Two – comparisons of natural and controlled language: major experimental studies
noted that natural language can perform as well as controlled vocabulary, but
17.7.1 Exact Match Search
other factors, such as the number of access points, are also significant.
In exact match search, the search engine will only match query terms exactly;
Era Three – many case studies of limited generalizability. Searching online databases was it does not allow for truncation, wildcards, or stemming. Exact Match option is
considered. It was noted that the best performance can be achieved by a
nowadays available in Internet-based databases to retrieve more relevant
combination of controlled and natural language; the number of access points
information. Phrase search can be characterized as exact match search, where
was reaffirmed to have a significant effect; full-text and bibliographic databases
were noted to have produced different results.
a phrase is given at the search query that searches whole phrase.
Era Four – new advances in user-based systems including OPACs. The value of controlled 17.7.2 Best Match Search
vocabulary in the context of user-friendly interfaces and the development of
knowledge bases were noted.
In best match search, the search engine will match query terms closely, if not
exactly. It may allow for truncation, wildcards, or stemming. Best Match search
Aitchison and Gilchrist [2000] provide a detailed comparison of natural and is performed, when exact match could not fetch sufficient number of relevant
controlled language indexing. The details have been provided in Unit 2 of this
information. Best match search constructs a tree-structured self-organizing map,
course. However, despite much debate extending over more than a century,
together with a range of research projects, information scientists have failed to where each level of the tree consists of a separate, progressively larger self-
resolve the issue concerning the relative merits and demerits of controlled and organising map. The search for the best match then proceeds level by level, at
natural language. Evidences produced by practice and tested research suggest each time restricting the search to a subset of units that is governed by the
that controlled language and natural language may be used in conjunction with location of the best match in the previous, smaller level. The map is taught one
one another. level at a time, starting from the smallest level. The best match search can be
done even more quickly if the data set is relatively small: the location of the best
Self Check Exercises match in the previous level can be tabulated for each input sample.
7) What is the role of a vocabulary control tool in an information retrieval process?
17.7.3 Partial Match Search
8) What is the difference between a subject heading list and a thesaurus from the
perspectives of information retrieval? A partial match is one that matches one or more characters at the end of the
Note: i) Write your answers in the space given below. text input, but did not match all of the regular expression, although, it may have
ii) Check your answers with the answers given at the end of the Unit. done so had more input been available. Partial matches are typically used when
either validating data input, checking each character as it is entered on the
...............................................................................................................
keyboard, or when searching texts that are either too long to load into memory
............................................................................................................... or even into a memory mapped file, or are of indeterminate length, for example
the source may be a socket. Some information retrieval systems perform partial
...............................................................................................................
match search.
...............................................................................................................
............................................................................................................... 17.8 INFORMATION SEEKING AND USER
............................................................................................................... INTERFACES
The user interface forms an important component of an information retrieval
17.7 SEARCHING system since it connects the users to the organised information resources. A user
interface is the means by which information is transferred between the user and
While searching for information in a database, users may approach with some
the computer and vice-versa. Well-designed user interfaces should allow the
keys. For a bibliographic database, such keys can be author name, title, ISBN,
users to better find and fully use the information that the information system
subject keywords, etc. In a non-bibliographic database, these keys will depend
provides access to. In fact a good user interface greatly enhances the quality of
on the nature of the database concerned.
interactions with information systems [Chowdhury, 2004].
In a bibliographic information retrieval environment, searches can be divided into
User interfaces basically perform two major functions: (a) they allow users to
two main classes: known item search and unknown item search. A known item
search or browse an information collection, and (b) they display the results of a
search is the one where the user knows something about the item being sought.
search, and often allows users to perform further tasks, like sorting, saving and/
This may be any key like author, title, publisher, ISBN, and so on. In such a case,
or printing the search results, modifying the search query, and so on. The user
user can enter the appropriate key and can get the full details of the item
interface therefore is the most important component of an information retrieval
concerned. For example, the user can enter the author name to retrieve the full
system that a user can see and interact with. The success of an information
record. However, very few users actually know about the author, title, etc., of
retrieval system depends significantly on the design and usefulness of the user
the item that he/she might need at a given instance. Consequently most of the
interface. Hence significant amount of research has taken place in the past few
searches are unknown item search. An unknown item search is the one where
decades on the design, use and evaluation of user interfaces to various kinds of
users are not aware of the existence of any document that may solve their 479 480 information retrieval systems.
17.8.1 Information Need and Information Seeking Information Retrieval Information Retrieval Self Check Exercises
Processes and Techniques
9) What is a user interface?
The user is the focal point of all information retrieval systems because the sole
objective of any information storage and retrieval system is to transfer information 10) What are the two major functions of the user interface in an information retrieval
from the source (the database) to the user. Information need is often a vague system?
concept. It is often a result of some unresolved problem(s). It may arise when Note: i) Write your answers in the space given below.
an individual recognizes that his/her current state of knowledge is insufficient to ii) Check your answers with the answers given at the end of the Unit.
cope with the task in hand, or to resolve conflicts in a subject area, or to fill a
void in some area of knowledge. Information, needed by the user to accomplish ...............................................................................................................
a goal – to resolve a problem, to answer a specific question, or to meet a ...............................................................................................................
curiosity— may vary from quick and brief information to the most exhaustive
and detailed information. ...............................................................................................................
...............................................................................................................
Figure 17.4 shows a simple model of information access. Although it appears to
be a very simple model, in essence several complex processes take place ...............................................................................................................
throughout the process. Some of these processes are technological and are
...............................................................................................................
related to the information retrieval system, users interfaces, etc. Other processes
relate to the nature and characteristics of the content as well as the concerned
user. The process may take more or less time, and may become simple or 17.9 FEATURES OF INFORMATION RETRIEVAL
complex depending on the nature of the users – their cognitive abilities, background, SYSTEMS
specific nature of the information need, and so on.
Based on accessibility of information, two broad categories of information retrieval
systems can be identified: in-house and online. In-house information retrieval
Information n systems are set up by a particular library or information centre to serve mainly
the users within the organization. One particular type of in-house database is the
library catalogue. Online public access catalogues (OPACs) provide facilities for
Query Formul library users to carry out online catalogue searches, and then to check the
availability of the item required.
Submit query to th By online information retrieval systems we mean those that have been designed
to provide access to remote database(s) to a variety of users. Such services are
available mostly on a commercial basis, and there are a number of vendors that
Receive search r handle this sort of service. With the development of optical storage technology,
another type of information retrieval system appeared on CD-ROM (compact-
disc read-only memory). Information retrieval systems based on CD-ROM
Study search re technology are available mostly on a commercial basis, though there have been
some free and in-house developments too. Basic techniques for search and
retrieval of information from the in-house or CD-ROM and online information
retrieval systems are more or less the same, except that the online system is
linked to users at a distance through the electronic communication network.
17.9.2 Information Retrieval Features of Online Search Services l OPACs allow users to search for the bibliographic records contained within a
library’s collection;
Traditional online information search systems that began about four decades ago l Nowadays, some OPACs also provide access to the electronic resources and
were designed to provide access to remote databases, often through a database databases, in addition to the typical bibliographic records;
vendor or service provider. These systems were expensive to use. They were
not quite suitable for searching directly by the end-users, and in most cases were l Searches take place on the metadata of the records in the library’s collection;
used by information intermediaries on behalf of, or in cooperation with, the end- l Sometimes users can search more than one collection (within the same library or
users. Online search services have been provided by database producers, but in different libraries);
more commonly by service providers or vendors like Dialog, Ovid, etc. The
major characteristics of this type of online information retrieval system are as l They have relatively simple search interface; and
follows: l OPACs are nowadays available through the web.
l users get access to remote databases that are often many in number and large in Although each OPAC has a search interface and retrieval engine that is proprietary
size; to the company providing the software for the purpose, the following information
l many databases can be searched using a single search interface; retrieval features are commonly available in OPACs:
l database records mainly contain bibliographic details of records with abstracts, l Browse and search facilities;
and sometimes with additional information, such as citations, etc.; only some
l Keyword and phrase search facilities;
databases contain full text information;
l service providers have their own search interface with good search and retrieval l Indexers assign subject headings to the records by using a subject heading list like
capabilities; LCSH (Library of Congress Subject Heading List), and users can search by these
assigned subject headings;
l users need to register with the service providers;
l Boolean search usually limited to the keyword search option; in other words, only
l users are charged for searching as well as for the content; and keywords can be combined with Boolean operators;
l modern online service providers have web interfaces with good search features l Proximity search also limited to the keyword search option;
and hyperlinked records/information.
l Search results are usually not ranked;
Although each online search service provider, such as Dialog, Ovid, STN, etc., l Searching of records through selected keys – author, title, ISBN, call number, etc.;
has its own proprietary retrieval engine and user interface, the commonly available these are searched as phrases, and are usually automatically right truncated; and
search and retrieval features are as follows [Chowdhury and Chowdhury, 2001a]:
l Some searches can be limited by date, collection, language, etc.
l Users can select one or more databases to search;
l Novice and expert search modes are available; 483 484
17.9.4 Information Retrieval Features of e-Journal Services Information Retrieval Information Retrieval l Right truncation and wild card search facilities are common in many digital libraries,
Processes and Techniques and a variety of operators, such as ‘%’, ‘*’, ‘@’, and ‘?’ are used for the purpose.
Electronic journals, or e-journals, form a very important part of the collection of However, some digital libraries provide specific truncation search facilities. For
today’s libraries. Nowadays there are two major categories of e-journals: one example, in THOMAS and American Memory, the ‘include word variants’ option
that have their printed counterparts, for example, the Journal of Documentation, is used for truncation;
and the other that are available only in electronic format, for example, the D-Lib l Many digital libraries support proximity search differently. Basically, there are
Magazine. Access to electronic journals is provided either by publishers themselves two options: one is through the use of proximity operators, but the operators vary,
or aggregators. The benefits of getting access to an individual publisher’s journals e.g., ‘Near’, ‘Nearby’ ‘Sentence’, ‘Paragraph’, and so on;
are value-added features and absence of intermediaries. Aggregators, on the
other hand, conglomerate journals of several publishers under one interface and l Most of the chosen digital libraries allow users to conduct search on specific
search system. fields;
l While most digital libraries allow users to specify the maximum number of hits,
Each publisher and aggregator of e-journals has a proprietary retrieval engine the output is not always ranked, except for a few like NDLTD;
and search interface that can be used to search one or more e-journals. Common
information retrieval features of e-journals are: l In some cases, for example, in ACM digital library, users can sort the results using
some chosen keys; and
l users can browse each issue or can search the entire collection; l Usually the system comes up with a brief output that can lead to the full records.
l there are usually novice and expert search modes; However, in many cases, an output format can be chosen by the user.
l word and phrase search facilities are available; 17.9.5.2 Special Features
l common search facilities include: Boolean search, truncation, field search, limiting In addition to the common features mentioned above, some digital libraries have
search and range search; some special information retrieval features. For example,
l searches can be conducted on metadata (author, title, etc.) or on the full texts or
articles; and l ACM Digital Library has some unique search features, such as Stem expansion,
Fuzzy expansion (spelled like), and sounds like search.
l output is available in one or more formats – HTML, PDF, etc.
l DeLiver (outcome of a DLI1 project at the University of Illinois) offers some
17.9.5 Information Retrieval Features of Digital Libraries unique search features. It allows users to search and view specific parts of the
article, such as the figures, or references. Thus user can ‘fine tune’ a search and
Information retrieval services are at the heart of digital libraries [Fox and Urs, get more relevant results.
2002]. A digital library can provide access to one or more of the information
l GEMS (Nanyang Technological University, Singapore, digital library; presently
resources separately using the search interface of each respective system.
called iGEMS) allows users to set up his/her own profile for future search and for
Alternatively, there may be a single search interface to allow users to conduct
obtaining SDI services. It also allows instant opening of a CD-ROM and provides
searches across all the systems with just one query.
access to an online journal or database.
17.9.5.1 Common Features l HEADLINE (a hybrid library project in the UK, http://www.headline.ac.uk) is unique
in two respects:
Based on the study of some selected digital libraries, the following general IR
features of digital libraries are observed [Chowdhury and Chowdhury, 2000; – It automatically creates an information page, called the Subject Page, on the
Meyyappan, Chowdhury and Foo, 2000]: subject of interest of the user. The necessary information is gathered from
the user’s log-in screen.
l Users can access the collections of a digital library by either of two modes: browsing – Allows user to customize the Subject Page to create his/her own subject
and searching; page.
l While most digital libraries allow users to search the local digital library collections, l IEL (IEEE Electronic Library) allows users to choose options to match similar
some digital libraries, e.g., NDLTD, provide facilities for federated search or search subjects, or to search for the latest additions to IEL. Search interface allows to
across a number of digital libraries; browse and select search terms from the displayed list. Superscript, subscript and
special characters can be searched.
l Boolean, proximity and truncation search facilities are the commonly available
search options in the digital libraries, though the operators vary. Some digital l New Zealand Digital Library has developed a digital library software and makes it
libraries, provide options like, ‘also must contain’, ‘or may contain’, ‘but not contain’, freely available, i.e., Greenstone Digital Library Software (GSDL).
‘should contain’, ‘must contain’ and so on, to activate a Boolean search; l NDLTD uses the InfoSeek search engine, and therefore a number of good search
l Keyword and phrase search are the common facilities of the digital libraries, features are available. Users can search a specific site search or can conduct a
though the techniques for conducting a phrase search differs. In some cases, for federated Search across the digital libraries that are member of the NDLTD
example in BUILDER, users can enter a phrase in the ‘Phrase Search Box’, Federation.
while in others, for example, in DIGILIB (at University of Queensland, Australia; l THOMAS uses a probabilistic information retrieval system called ‘InQuery’.
http://www.architect.uq.edu.au/digilib/), NCSTRL (Networked Computer Science
Technical Reference Library; http://www.ncstrl.org), etc., a search phrase has l The UC Berkeley DL uses Cha-Cha and ChesireII search systems and has two
to be entered within double quotes; unique features:
485 486
– Natural language search facilities Information Retrieval Information Retrieval it does not change its content, at the most the entire document is removed from the
Processes and Techniques system. Keeping track of the changes in the millions of web pages, and making
– Image retrieval by image content
necessary changes in the information retrieval system is a major challenge. Another
l The Universal Library (at Carnegie-Mellon University; http:// major problem with the web is that the resources (the web pages) often move.
www.ul.cs.cmu.edu/) has a unique feature called the hyperbolic tree that
This information needs to be tracked by the retrieval system in order to facilitate
has a unique visualisation effect and user search the collection through this
proper retrieval.
hyperbolic tree.
l Ownership: Information resources that are accessible through the web have
17.10 WEB INFORMATION RETRIEVAL SYSTEMS different access requirements: while some information can be accessed and used
for free, others require specific permission or access rights, often through payment
Web information retrieval is significantly different from the traditional text retrieval of fees. Identifying the rights to access is a major challenge for web information
systems. These differences are mainly due to a number of typical characteristics retrieval.
of the world wide web such as the distributed architecture of the web, the l Distributed users: Most text retrieval systems are designed to meet the information
variety of information available on the web, growth of the web, the distribution needs of a specific user community. Hence text retrieval systems usually have an
of information as well as the users, and so on. Major characteristics of the web
idea of the nature, characteristics, information needs, search behaviour etc., of the
that make Internet information retrieval different from other information retrieval
target user community. Web information is in sharp contrast with this. Ideally the
systems are discussed below.
users of an information resource on the web may be anyone, located anywhere in
17.10.1 Characteristics of Web Information Retrieval the world. This imposes a significant challenge since the designer of a web
information retrieval system will have no idea about the target users, their nature,
l Distributed nature of the web: The web resources are distributed all over the characteristics, location, information search behaviour, etc.
world. Hence complex measures are required to locate, index and retrieve the l Multiple languages: Since the web is distributed all over the world, the language
information resources. The fact that the computers that are interconnected have
of the information resources as well as the users vary significantly. An ideal web
different architecture, and the information resources are created using different
information retrieval system should be able to retrieve the required information
platforms, software and standards, make the matter more complex. Most text
irrespective of the language of the query or the source information. This diversity
retrieval systems deal with a set of information resources that is several times
smaller in volume compared to the web. In addition, text retrieval systems usually of language poses a tremendous challenge for web information retrieval.
deal with a set of documents that have been created using a set of standards – l Resource requirements: Massive amount of resources are required to build and
hardware, software and processing standards. Although information retrieval in run an effective and efficient web information retrieval system. The matter is
case of OPACs has to deal with distributed information, the problems are tackled worsened by the fact that there is no single body who would fund for these resources,
by use of several standards for processing information, such as the MARC formats. and yet everyone wants a good information retrieval system for access to web
No such uniform standard is used for the creation and processing of web information information resources.
resources.
l Size and growth of the web: The web has grown exponentially over the past 17.10.2 Information Retrieval Features of Web Search Engines
decade. The processes of identifying, indexing and retrieving information become
more complex as the size of the web, and hence the volume of information on the Search engines are the most commonly used tools for finding information on the
web, increases. Conventional text retrieval systems have to be tested and modified web. Digital libraries usually provide links to one or more search engines to allow
to make them suitable for handling the large volume of data on the web. users search for the web information resources. A search engine allows the user
to enter search terms – keywords and/or phrases – that are run against a
l Deep vs. the surface web: Information resources on the web can be accessed at
database containing information on the web pages collected automatically by
two different levels. While millions of web information resources can be accessed
programs called Spiders. At the end of a search session, the search engine
by anyone, a lot of information is accessible either through authorised access
retrieves web pages from its database that match the search terms entered by
(information that is password protected, say) or can be generated only by activating
the searcher.
an appropriate program. Researchers call the former as the surface web and the
latter the deep web, with a note that the deep web is several times larger than the
There are three main components of a search engine: (1) the Spider, i.e., the
surface web.
program that automatically collects information about millions of pages on the
l Type and format of the documents: While text retrieval systems deal with textual web, (2) the Index that stores information collected by the spider on the various
information, the web contains from simple text to multimedia information. Again web pages, and (3) the Search engine software and interface with which the
these information resources appear in a variety of formats thereby making the users interact to conduct a web search [Chowdhury and Chowdhury, 2001b].
task of indexing and retrieval more complex. Search engines can be categorised in a number of ways. Two broad categories
l Quality of information: Since anyone can publish almost anything on the web, it is are: search engines and meta search engines, the later category refers to tools
very difficult to assess the quality of the information resources. As opposed to the that allow users to conduct concurrent searches on more than one search engines.
conventional text retrieval systems that deal with published information resources Some people also categorise search engines based on their characteristics of
which are somewhat quality-controlled, web information retrieval systems have to indexing. For example, Search engines can be categorised as full-text search
deal with both the controlled and uncontrolled sets of information resources. tools, extracting search tools, subject-specific search tools and meta search tools.
SearchEngineWatch.com, the most up-to-date and widely used information
l Frequency of changes: Web pages change quite frequently. This is in sharp contrast
with the input of the conventional text retrieval systems that deal with relatively resource on web search engines, categorises search engines as follows [Sullivan,
static information; once an information resource is added to a text retrieval system, 2004]:
487 488
l The Major Search Engines, e.g., AltaVista, AOL Search, Google, HotBot, etc. Information Retrieval Information Retrieval 14) How does the traditional online information retrieval differ from web information
Processes and Techniques retrieval?
l Kids Search Engines, e.g., Yahooligans, KidsClick, etc.
Note: i) Write your answers in the space given below or in a notebook.
l News Search Engines, AltaVista News, Ananova, Yahoo News, etc.
ii) Check your answers with the answers given at the end of the Unit.
l Specialty Search Engines, AskJeeves, Allexperts.com, CNETDownload.com, etc.
...............................................................................................................
l Multimedia Search Engines, e.g., AltaVista Photofinder, FAST multimedia
...............................................................................................................
search, Ditto, Napster, Gnutella, etc.
l Search Utilities, e.g., Copernic, LexiBot, SearchWolf, Subject Search Spider, etc. ...............................................................................................................
l Translate: Automatic translation of web pages from selected languages is Lancaster and Warner [2001] provide an excellent review of the applications of
available. Some search engines also allow users to enter text in a given lan- expert systems and related intelligent technologies in different areas of library
guage which can be instantly translated into another chosen language. and information science. They note that the major applications of intelligent
technologies in the field of library and information science include the following:
l Family Filter: Can be turned on or off to allow/avoid retrieval of unwanted
materials. l cataloguing
l subject indexing
Self Check Exercises l collection management
11) What is meant by online information retrieval? l reference services including:
12) How does an OPAC search differ from an online search? —referral of users to appropriate information resources
13) What is a web search engine and what role does it play in the context of the web? —selection of an appropriate database for searching information to meet a specific
489 490 information need