Module 1 - Introduction
Module 1 - Introduction
Dr.D.SARASWATHI 1
• Introduction to Information Storage and Information Retrieval (IR) –
Definition and Objectives – Functional overview – Relationship to
Database Management Systems – Digital libraries – Data Warehouses
Dr.D.SARASWATHI 2
Definition
An Information Retrieval System is a system that is
capable of storage, retrieval, and maintenance of
information.
Dr.D.SARASWATHI 3
Dr.D.SARASWATHI 4
Dr.D.SARASWATHI 5
Dr.D.SARASWATHI 6
Web Search
Dr.D.SARASWATHI 7
EXCALIBUR VISUAL RETRIEVALWARE FINDS PIX IN A DATABASE - Tech Monitor
Dr.D.SARASWATHI 8
Dr.D.SARASWATHI 9
• An Information Retrieval System consists of a software
program that facilitates a user in finding the information the
user needs.
Dr.D.SARASWATHI 10
Dr.D.SARASWATHI 11
IR success lies on?
• the time required to find the information needed, excluding the time for
actually reading the relevant data.
• Search composition, search execution, and reading non-relevant items
Dr.D.SARASWATHI 12
• The first Information Retrieval Systems
originated with the need to organize information
in central repositories (e.g., libraries) (Hyman-
82).
Dr.D.SARASWATHI 14
Excite
Dr.D.SARASWATHI 15
webseek
Dr.D.SARASWATHI 16
Dr.D.SARASWATHI 17
Dr.D.SARASWATHI 18
Objectives of Information Retrieval Systems
• To minimize the overhead of a user locating needed information.
Dr.D.SARASWATHI 19
• In information retrieval the term “relevant” item is
used to represent an item containing the needed
information.
Dr.D.SARASWATHI 20
Effects of Search on Total Document Space
Dr.D.SARASWATHI 21
• When a user decides to issue a search looking for information on a topic,
the total database is logically divided into four segments.
• Relevant items are those documents that contain information that helps
the searcher in answering his question.
• Non-relevant items are those items that do not provide any directly useful
information.
Dr.D.SARASWATHI 22
• The two major measures commonly associated
with information systems are
1) Precision
2) Recall
Dr.D.SARASWATHI 23
Precision and Recall
• Precision is directly affected by retrieval of non-
relevant items and drops to a number close to
zero.
Dr.D.SARASWATHI 25
Ideal Precision/Recall Graph
Dr.D.SARASWATHI 26
Achievable Precision/Recall Graph
Dr.D.SARASWATHI 27
• Information Retrieval Systems such as RetrievalWare, TOPIC,
AltaVista, Infoseek and INQUERY that the idea of accepting
natural language queries is becoming a standard system
feature.
• This allows users to state in natural language what they are
interested in finding.
• But the completeness of the user specification is limited by
the user’s willingness to construct long natural language
queries.
• Most users on the Internet enter one or two search terms.
Dr.D.SARASWATHI 28
Vocabulary Domains
Dr.D.SARASWATHI 29
Functional Overview
• A total Information Storage and Retrieval System is
composed of four major functional processes:
1) Item Normalization
2) Selective Dissemination of Information (i.e., “Mail”)
3) Archival Document Database Search, and an Index
4) Database Search along with the Automatic File Build
process that supports Index Files.
Dr.D.SARASWATHI 30
1) Item Normalization
• The first step in any integrated system is to normalize
the incoming items to a standard format.
• Item normalization provides logical restructuring of the
item.
• Additional operations during item normalization are
needed to create a searchable data structure:
• identification of processing tokens (e.g., words),
• characterization of the tokens, and
• stemming (e.g., removing word endings) of the tokens.
Dr.D.SARASWATHI 31
Total Information Retrieval
Dr.D.SARASWATHI 32
Text Normalization Process
Dr.D.SARASWATHI 33
• The processing tokens and their characterization
are used to define the searchable text from the
total received text.
• Standardizing the input takes the different external
formats of input data and performs the translation
to the formats acceptable to the system.
• A system may have a single format for all items or
allow multiple formats.
Dr.D.SARASWATHI 34
• One example of standardization could be translation of
foreign languages into Unicode.
• Every language has a different internal binary encoding
for the characters in the language. One standard
encoding that covers English, French, Spanish, etc. is
ISO-Latin.
• To assist the users in generating indexes, especially the
professional indexers, the system provides a process
called Automatic File Build(AFB).
Dr.D.SARASWATHI 35
• Multi-media adds an extra dimension to the normalization
process.
• In addition to normalizing the textual input, the multi-media
input also needs to be standardized.
• for higher quality video -MPEG-2, MPEG-1, AVI.
• for lower quality video - Real Media
• Audio standards - WAV or Real Media (Real Audio)
• Images vary from JPEG to BMP
Dr.D.SARASWATHI 36
• The next process is to parse the item into logical
sub-divisions that have meaning to the user.
• This process, called “Zoning,” is visible to the
user and used to increase the precision of a
search and optimize the display.
Dr.D.SARASWATHI 37
Zoning
• A typical item is sub- divided into zones, which
may overlap and can be hierarchical, such as
Title, Author, Abstract, Main Text, Conclusion,
and References.
• The zoning information is passed to the
processing token identification operation to
store the information, allowing searches to be
restricted to a specific zone.
Dr.D.SARASWATHI 38
•For example, if the user is interested in
articles discussing “Einstein” then the
search should not include the Bibliography,
which could include references to articles
written by “Einstein.”
Dr.D.SARASWATHI 39
• Systems determine words by dividing input symbols
into 3 classes:
Dr.D.SARASWATHI 40
Word
• A word is defined as a contiguous set of word
symbols bounded by inter-word symbols.
Dr.D.SARASWATHI 41
Examples
• word symbols - alphabetic characters and
numbers
• possible inter-word symbols - blanks, periods
and semicolons , apostrophe
Dr.D.SARASWATHI 42
Stop List/Algorithm
• Applied to the list of potential processing tokens.
• The objective of the Stop function is to save system
resources by eliminating from the set of searchable
processing tokens those that have little value to the
system.
• Given the significant increase in available cheap
memory, storage and processing power, the need to
apply the Stop function to processing tokens is
decreasing. Dr.D.SARASWATHI 43
Examples of Stop algorithms
• Stop all numbers greater than “999999” (this was
selected to allow dates to be searchable) Stop any
processing token that has numbers and characters
intermixed
Dr.D.SARASWATHI 44
2) Selective Dissemination (Distribution,
Spreading) of Information
• Selective Dissemination of Information (SDI) is a
service that delivers information to users based on
their interests.
Dr.D.SARASWATHI 45
• It is a proactive approach to information dissemination,
where the provider creates a profile of the user’s information
needs and regularly updates new publications, research
papers, news articles, or any other relevant material
matching the user’s profile.
Dr.D.SARASWATHI 50
• Every user can have one or more Private Index files leading to a very
large number of files.
• Each Private Index file references only a small subset of the total
number of items in the Document Database.
• Public Index files are maintained by professional library services
personnel and typically index every item in the Document Database.
• There is a small number of Public Index files.
• These files have access lists (i.e., lists of users and their privileges)
that allow anyone to search or retrieve data.
• Private Index files typically have very limited access lists.
• To assist the users in generating indexes, especially the professional
indexers, the system provides a process called Automatic File Build
(also called Information Extraction).
Dr.D.SARASWATHI 51
5) Multimedia Database Search
• From a system perspective, the multi-media data is
not logically its own data structure, but an
augmentation to the existing structures in the
Information Retrieval System.
Dr.D.SARASWATHI 52
Relationship to DBMS
• From a practical standpoint, the integration of DBMS’s
and Information Retrieval Systems is very important.
• Commercial database companies have already
integrated the two types of systems.
• One of the first commercial databases to integrate the
two systems into a single view is the INQUIRE DBMS.
• This has been available for over fifteen years.
Dr.D.SARASWATHI 53
• A more current example is the ORACLE DBMS that now offers
an imbedded capability called CONVECTIS, which is an
informational retrieval system that uses a comprehensive
thesaurus which provides the basis to generate “themes” for
a particular item.
• The INFORMIX DBMS has the ability to link to RetrievalWare
to provide integration of structured data and information
along with functions associated with Information Retrieval
Systems.
Dr.D.SARASWATHI 54
Digital Libraries and Data Warehouses
(DataMarts)
• As the Internet continued its exponential growth and project funding
became available, the topic of Digital Libraries has grown.
• By 1995 enough research and pilot efforts had started to support the
1ST ACM International Conference on Digital Libraries (Fox-96).
• Indexing is one of the critical disciplines in library science and
significant effort has gone into the establishment of indexing and
cataloging standards.
• Migration of many of the library products to a digital format
introduces both opportunities and challenges.
• Information Storage and Retrieval technology has addressed a small
subset of the issues associatedDr.D.SARASWATHI
with Digital Libraries. 55
• Data warehouses are similar to information storage and
retrieval systems in that they both have a need for search
and retrieval of information.
• But a data warehouse is more focused on structured data
and decision support technologies.
• In addition to the normal search process, a complete
system provides a flexible set of analytical tools to “mine”
the data.
• Data mining (originally called Knowledge Discovery in
Databases - KDD) is a search process that automatically
analyzes data and extract relationships and dependencies
that were not part of the database design.
Dr.D.SARASWATHI 56
Dr.D.SARASWATHI 57
Dr.D.SARASWATHI 58
Dr.D.SARASWATHI 59
Dr.D.SARASWATHI 60
Dr.D.SARASWATHI 61