0% found this document useful (0 votes)
16 views

Module 1 - Introduction

The document provides an overview of information retrieval systems, including their definition, objectives, and key components and processes like normalization, indexing, and searching. It also discusses measures like precision and recall.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Module 1 - Introduction

The document provides an overview of information retrieval systems, including their definition, objectives, and key components and processes like normalization, indexing, and searching. It also discusses measures like precision and recall.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

Module 1- Introduction

Dr.D.SARASWATHI 1
• Introduction to Information Storage and Information Retrieval (IR) –
Definition and Objectives – Functional overview – Relationship to
Database Management Systems – Digital libraries – Data Warehouses

Dr.D.SARASWATHI 2
Definition
An Information Retrieval System is a system that is
capable of storage, retrieval, and maintenance of
information.

• text (including numeric and date data), images, audio,


video and other multi-media objects.

Dr.D.SARASWATHI 3
Dr.D.SARASWATHI 4
Dr.D.SARASWATHI 5
Dr.D.SARASWATHI 6
Web Search

Dr.D.SARASWATHI 7
EXCALIBUR VISUAL RETRIEVALWARE FINDS PIX IN A DATABASE - Tech Monitor

Dr.D.SARASWATHI 8
Dr.D.SARASWATHI 9
• An Information Retrieval System consists of a software
program that facilitates a user in finding the information the
user needs.

• The system may use standard computer hardware or


specialized hardware to support the search subfunction and
to convert non-textual sources to a searchable media (e.g.,
transcription of audio to text).

Dr.D.SARASWATHI 10
Dr.D.SARASWATHI 11
IR success lies on?

•The gauge of success of an information


system is how well it can minimize the
overhead for a user to find the needed
information.

• the time required to find the information needed, excluding the time for
actually reading the relevant data.
• Search composition, search execution, and reading non-relevant items
Dr.D.SARASWATHI 12
• The first Information Retrieval Systems
originated with the need to organize information
in central repositories (e.g., libraries) (Hyman-
82).

• Catalogues were created to facilitate the


identification and retrieval of items.
Dr.D.SARASWATHI 13
Web search Engine
• access to terabytes of information -over 800 million indexable pages

Dr.D.SARASWATHI 14
Excite

Dr.D.SARASWATHI 15
webseek

Dr.D.SARASWATHI 16
Dr.D.SARASWATHI 17
Dr.D.SARASWATHI 18
Objectives of Information Retrieval Systems
• To minimize the overhead of a user locating needed information.

• Overhead - query generation, query execution, scanning results of


query to select items to read, reading non-relevant items

Dr.D.SARASWATHI 19
• In information retrieval the term “relevant” item is
used to represent an item containing the needed
information.

• From a user’s perspective “relevant” and “needed”


are synonymous.

Dr.D.SARASWATHI 20
Effects of Search on Total Document Space

Dr.D.SARASWATHI 21
• When a user decides to issue a search looking for information on a topic,
the total database is logically divided into four segments.

• Relevant items are those documents that contain information that helps
the searcher in answering his question.

• Non-relevant items are those items that do not provide any directly useful
information.

• There are two possibilities with respect to each item:


• it can be retrieved or not retrieved by the user’s query

Dr.D.SARASWATHI 22
• The two major measures commonly associated
with information systems are

1) Precision
2) Recall

Dr.D.SARASWATHI 23
Precision and Recall
• Precision is directly affected by retrieval of non-
relevant items and drops to a number close to
zero.

• Recall is not effected by retrieval of non-relevant


items and thus remains at 100 per cent once
achieved.
Dr.D.SARASWATHI 24
Ideal Precision and Recall

Dr.D.SARASWATHI 25
Ideal Precision/Recall Graph

Dr.D.SARASWATHI 26
Achievable Precision/Recall Graph

Dr.D.SARASWATHI 27
• Information Retrieval Systems such as RetrievalWare, TOPIC,
AltaVista, Infoseek and INQUERY that the idea of accepting
natural language queries is becoming a standard system
feature.
• This allows users to state in natural language what they are
interested in finding.
• But the completeness of the user specification is limited by
the user’s willingness to construct long natural language
queries.
• Most users on the Internet enter one or two search terms.
Dr.D.SARASWATHI 28
Vocabulary Domains

Dr.D.SARASWATHI 29
Functional Overview
• A total Information Storage and Retrieval System is
composed of four major functional processes:

1) Item Normalization
2) Selective Dissemination of Information (i.e., “Mail”)
3) Archival Document Database Search, and an Index
4) Database Search along with the Automatic File Build
process that supports Index Files.
Dr.D.SARASWATHI 30
1) Item Normalization
• The first step in any integrated system is to normalize
the incoming items to a standard format.
• Item normalization provides logical restructuring of the
item.
• Additional operations during item normalization are
needed to create a searchable data structure:
• identification of processing tokens (e.g., words),
• characterization of the tokens, and
• stemming (e.g., removing word endings) of the tokens.
Dr.D.SARASWATHI 31
Total Information Retrieval

Dr.D.SARASWATHI 32
Text Normalization Process

Dr.D.SARASWATHI 33
• The processing tokens and their characterization
are used to define the searchable text from the
total received text.
• Standardizing the input takes the different external
formats of input data and performs the translation
to the formats acceptable to the system.
• A system may have a single format for all items or
allow multiple formats.

Dr.D.SARASWATHI 34
• One example of standardization could be translation of
foreign languages into Unicode.
• Every language has a different internal binary encoding
for the characters in the language. One standard
encoding that covers English, French, Spanish, etc. is
ISO-Latin.
• To assist the users in generating indexes, especially the
professional indexers, the system provides a process
called Automatic File Build(AFB).
Dr.D.SARASWATHI 35
• Multi-media adds an extra dimension to the normalization
process.
• In addition to normalizing the textual input, the multi-media
input also needs to be standardized.
• for higher quality video -MPEG-2, MPEG-1, AVI.
• for lower quality video - Real Media
• Audio standards - WAV or Real Media (Real Audio)
• Images vary from JPEG to BMP
Dr.D.SARASWATHI 36
• The next process is to parse the item into logical
sub-divisions that have meaning to the user.
• This process, called “Zoning,” is visible to the
user and used to increase the precision of a
search and optimize the display.

Dr.D.SARASWATHI 37
Zoning
• A typical item is sub- divided into zones, which
may overlap and can be hierarchical, such as
Title, Author, Abstract, Main Text, Conclusion,
and References.
• The zoning information is passed to the
processing token identification operation to
store the information, allowing searches to be
restricted to a specific zone.
Dr.D.SARASWATHI 38
•For example, if the user is interested in
articles discussing “Einstein” then the
search should not include the Bibliography,
which could include references to articles
written by “Einstein.”

Dr.D.SARASWATHI 39
• Systems determine words by dividing input symbols
into 3 classes:

1) Valid word symbols


2) Inter-word symbols
3) Special processing symbols.

Dr.D.SARASWATHI 40
Word
• A word is defined as a contiguous set of word
symbols bounded by inter-word symbols.

• In many systems inter-word symbols are non-


searchable and should be carefully selected.

Dr.D.SARASWATHI 41
Examples
• word symbols - alphabetic characters and
numbers
• possible inter-word symbols - blanks, periods
and semicolons , apostrophe

Dr.D.SARASWATHI 42
Stop List/Algorithm
• Applied to the list of potential processing tokens.
• The objective of the Stop function is to save system
resources by eliminating from the set of searchable
processing tokens those that have little value to the
system.
• Given the significant increase in available cheap
memory, storage and processing power, the need to
apply the Stop function to processing tokens is
decreasing. Dr.D.SARASWATHI 43
Examples of Stop algorithms
• Stop all numbers greater than “999999” (this was
selected to allow dates to be searchable) Stop any
processing token that has numbers and characters
intermixed

Dr.D.SARASWATHI 44
2) Selective Dissemination (Distribution,
Spreading) of Information
• Selective Dissemination of Information (SDI) is a
service that delivers information to users based on
their interests.

• This can be done through various methods like email,


RSS feeds, or newsletters.

Dr.D.SARASWATHI 45
• It is a proactive approach to information dissemination,
where the provider creates a profile of the user’s information
needs and regularly updates new publications, research
papers, news articles, or any other relevant material
matching the user’s profile.

• SDI helps users stay up-to-date with the latest information in


their field of interest, which can be extremely valuable in
today’s rapidly changing and dynamic world.
Dr.D.SARASWATHI 46
• The SDI system has been used extensively in academic and
research settings to support researchers, scholars, and
students.
• It has also been used in various fields, such as library,
business, law, healthcare, and government, to provide
relevant information to decision-makers.
• The system can be customized to match the user’s specific
needs, ensuring that they receive only the information that is
relevant to them.
Dr.D.SARASWATHI 47
3)Document Database Search
• The Document Database Search Process provides the
capability for a query to search against all items received by
the system.
• The Document Database Search process is composed of the
search process, user entered queries (typically adhoc
queries) and the document database which contains all
items that have been received, processed and stored by the
system.
• Typically items in the Document Database do not change
(i.e., are not edited) once received.
Dr.D.SARASWATHI 48
4) Index Database Search
• When an item is determined to be of interest, a user may
want to save it for future reference.
• This is in effect filing it.
• In an information system this is accomplished via the index
process.
• In this process the user can logically store an item in a file
along with additional index terms and descriptive text the
user wants to associate with the item.
• The Index Database Search Process provides the capability to
create indexes and search them.
Dr.D.SARASWATHI 49
• There are 2 classes of index files:

1) Public Index files

2) Private Index files

Dr.D.SARASWATHI 50
• Every user can have one or more Private Index files leading to a very
large number of files.
• Each Private Index file references only a small subset of the total
number of items in the Document Database.
• Public Index files are maintained by professional library services
personnel and typically index every item in the Document Database.
• There is a small number of Public Index files.
• These files have access lists (i.e., lists of users and their privileges)
that allow anyone to search or retrieve data.
• Private Index files typically have very limited access lists.
• To assist the users in generating indexes, especially the professional
indexers, the system provides a process called Automatic File Build
(also called Information Extraction).
Dr.D.SARASWATHI 51
5) Multimedia Database Search
• From a system perspective, the multi-media data is
not logically its own data structure, but an
augmentation to the existing structures in the
Information Retrieval System.

Dr.D.SARASWATHI 52
Relationship to DBMS
• From a practical standpoint, the integration of DBMS’s
and Information Retrieval Systems is very important.
• Commercial database companies have already
integrated the two types of systems.
• One of the first commercial databases to integrate the
two systems into a single view is the INQUIRE DBMS.
• This has been available for over fifteen years.

Dr.D.SARASWATHI 53
• A more current example is the ORACLE DBMS that now offers
an imbedded capability called CONVECTIS, which is an
informational retrieval system that uses a comprehensive
thesaurus which provides the basis to generate “themes” for
a particular item.
• The INFORMIX DBMS has the ability to link to RetrievalWare
to provide integration of structured data and information
along with functions associated with Information Retrieval
Systems.
Dr.D.SARASWATHI 54
Digital Libraries and Data Warehouses
(DataMarts)
• As the Internet continued its exponential growth and project funding
became available, the topic of Digital Libraries has grown.
• By 1995 enough research and pilot efforts had started to support the
1ST ACM International Conference on Digital Libraries (Fox-96).
• Indexing is one of the critical disciplines in library science and
significant effort has gone into the establishment of indexing and
cataloging standards.
• Migration of many of the library products to a digital format
introduces both opportunities and challenges.
• Information Storage and Retrieval technology has addressed a small
subset of the issues associatedDr.D.SARASWATHI
with Digital Libraries. 55
• Data warehouses are similar to information storage and
retrieval systems in that they both have a need for search
and retrieval of information.
• But a data warehouse is more focused on structured data
and decision support technologies.
• In addition to the normal search process, a complete
system provides a flexible set of analytical tools to “mine”
the data.
• Data mining (originally called Knowledge Discovery in
Databases - KDD) is a search process that automatically
analyzes data and extract relationships and dependencies
that were not part of the database design.
Dr.D.SARASWATHI 56
Dr.D.SARASWATHI 57
Dr.D.SARASWATHI 58
Dr.D.SARASWATHI 59
Dr.D.SARASWATHI 60
Dr.D.SARASWATHI 61

You might also like