unit-1introduction
unit-1introduction
By
Mr.K.Yellaswamy
Assistant Professor
Department of Computer Science & Engineering
CMR College of Engineering & Technology
Information Retrieval Systems
• Information
What is “information”?
• Retrieval
What do we mean by “retrieval”?
What are different types information needs?
• Systems
How do computer systems fit into the human
information seeking process?
K.YELLASWAMY Asst.Professor
2
CMRCET
Information Hierarchy
More refined and abstract
Wisdom
Knowledge
Information
Data
K.YELLASWAMY Asst.Professor
3
CMRCET
Information Hierarchy
Data
The raw material of information
Information
Data organized and presented in a particular manner
Knowledge
“Justified true belief”
Information that can be acted upon
Wisdom
Distilled and integrated knowledge
Demonstrative of high-level “understanding”
K.YELLASWAMY Asst.Professor
4
CMRCET
A (Facetious) Example
Data
98.6º F, 99.5º F, 100.3º F, 101º F, …
Information
Hourly body temperature: 98.6º F, 99.5º F, 100.3º F,
101º F, …
Knowledge
If you have a temperature above 100º F, you most
likely have a fever
Wisdom
If you don‟t feel well, go see a doctor
K.YELLASWAMY Asst.Professor
5
CMRCET
What types of information?
• Text
• Structured documents (e.g., XML)
• Images
• Audio (sound effects, songs, etc.)
• Video
• Programs
• Services
K.YELLASWAMY Asst.Professor
6
CMRCET
Outline of Unit-1
• Definition of IR Systems
• Objectives of IR Systems
• Functional Overview
• Relationship to DBMS
K.YELLASWAMY Asst.Professor
7
CMRCET
Definition of IR Systems
• An IR System is a system capable of storage,
retrieval, and maintenance of information.
Information: text, image, audio, video, and other
multi-media objects
Focus on textual information here
Item:
The smallest complete textual unit processed and
manipulated by an IR system
Depend on how a specific source treats information
Book? Chapter? Paragraph?
„Item‟ and „Document‟ are used interchangeably
K.YELLASWAMY Asst.Professor
8
CMRCET
Definition of IR Systems (Cont.)
K.YELLASWAMY Asst.Professor
9
CMRCET
Supporting the Search Process
Source IR System Predict Nominate Choose
Selection
Query
Query
Formulation
Examination Document
Source
Reselection
Delivery
K.YELLASWAMY Asst.Professor
10
CMRCET
Supporting the Search Process
Source IR System
Selection
Query
Query
Formulation
Selection Document
Indexing Index
Examination Document
Acquisition Collection
Delivery
K.YELLASWAMY Asst.Professor
11
CMRCET
Objectives of IR Systems
K.YELLASWAMY Asst.Professor
12
CMRCET
Overview
• The general objective of an IR system is to
minimize the overhead of a user locating
needed information
• The two major measures commonly
associated with information systems are
precision and recall
• Support of user search generation
• How to present the search results in a format
that facilitate the user in determining
relevant items
K.YELLASWAMY Asst.Professor
13
CMRCET
Basic Measures for Text Retrieval
All Documents
• Precision: the percentage of retrieved documents that are in fact relevant to the
query (i.e., “correct” responses)
| {Relevant} {Retrieved} |
precision
| {Retrieved} |
• Recall: the percentage of documents that are relevant to the query and were, in
fact, retrieved
| {Relevant} {Retrieved} |
precision
| {Relevant} |
K.YELLASWAMY Asst.Professor
14
CMRCET
Precision and Recall
Relevant Relevant
Retrieved Not Retrieved
Non-Relevant Non-Relevant
Retrieved Not Retrieved
K.YELLASWAMY Asst.Professor
15
CMRCET
Measuring Precision and Recall
Assume there are a total of 14 relevant documents
Hits 1-10
Precision 1/1 1/2 1/3 1/4 2/5 3/6 3/7 4/8 4/9 4/10
Recall 1/14 1/14 1/14 1/14 2/14 3/14 3/14 4/14 4/14 4/14
Hits 11-20
Precision 5/11 5/12 5/13 5/14 5/15 6/16 6/17 6/18 6/19 6/20
Recall 5/14 5/14 5/14 5/14 5/14 6/14 6/14 6/14 6/14 6/14
= relevant document
K.YELLASWAMY Asst.Professor
16
CMRCET
Graphing Precision and Recall
tradeoff
0.6
Precision
0.4
0.2
0
0 0.1 0.2 0.3 0.4 0.5
Recall
K.YELLASWAMY Asst.Professor
17
CMRCET
Precision and Recall (Cont.)
• Precision
Measures retrieval overheard for a particular
query
In the WWW-world, precision is more
important than recall
• Recall
Recall
How well a system is able to 100 retrieve the
K.YELLASWAMY Asst.Professor
19
CMRCET
Functional Overview
K.YELLASWAMY Asst.Professor
20
CMRCET
Functional Overview
• Four major functional process
Item Normalization
Selective Dissemination of Information
Archival Document Database Search
Index Database Search + Automatic File Build
Process (Support index files)
K.YELLASWAMY Asst.Professor
21
CMRCET
Total IR System
Selective Private
Dissemination of Profiles
Indexing
Information
(Mail)
Mail Files Private
Index
Item Item File
Input Document File Document
Normalization
Creation File Public
Index
Candidate File
Automatic File Index Records
Build(AFB) Public
AFB Indexing
Profiles
K.YELLASWAMY Asst.Professor
22
CMRCET
Item Normalization
• Normalize incoming items to a standard
format
Language encoding
Different file formats…
• Logical restructuring – zoning
• Create a searchable data structure (Indexing)
Identification of processing tokens
Characterization of the tokens – single words, or
phrase
Stemming of the tokens
K.YELLASWAMY Asst.Professor
23
CMRCET
Functional Overview –
Item Normalization
K.YELLASWAMY Asst.Professor
24
CMRCET
Overview
Standardize
Input
Logical Identify
Stop Characterize Apply
Subsetting Processing
Algorithm Tokens Stemming
(Zoning) Tokens
Create
Update Searchable Data
Document Structure
File
K.YELLASWAMY Asst.Professor
25
CMRCET
Standardize Input
K.YELLASWAMY Asst.Professor
26
CMRCET
Logical Subsetting (Zoning)
K.YELLASWAMY Asst.Professor
30
CMRCET
Stemming Algorithm
Normalize the token to a standard semantic
representation
Computer, Compute, Computers, Computing
Comput
Reduce the number of unique words the system
has to contain
ex: “computable”, “computation”, “computability”
small database saves 32 percent of storages
larger database : 1.6 MB 20 %
50 MB 13.5%
Improve the efficiency of the IR System and to
improve recall Decline precision
Expand a search term to similar token representations in
run time?
K.YELLASWAMY Asst.Professor
31
CMRCET
Create Searchable Data Structure
K.YELLASWAMY Asst.Professor
32
CMRCET
Functional Overview – Selective
Dissemination of Information
K.YELLASWAMY Asst.Professor
33
CMRCET
Selective Dissemination of
Information (SDI)
• Provides the capability to dynamically
compare newly received items in the
information system against standing
statements of interest of users and deliver the
item to those users whose statement of
interest matches the contents of the items
• Consist of
Search process
User statements of interest (Profile)
User mail file
K.YELLASWAMY Asst.Professor
34
CMRCET
Selective Dissemination of
Information (Cont.)
A profile contains a typically broad search
statement along with a list of user mail files
that will receive the document if the search
statement in the profile is satisfied
As each item is received, it is processed against
every user‟s profile
When the search statement is satisfied, the item is
placed in the mail file(s) associated with the
process
User search profiles are different than ad hoc
queries in that they contain significant more
search terms and cover a wider range of interests
K.YELLASWAMY Asst.Professor
35
CMRCET
Functional Overview –
Document Database Search
Provides the capability for a query to search
against all items received by the system
Composed of the search process, user entered
queries and document database.
Document database contains all items that have
been received, processed and store by the system
Usually items in the Document DB do not change
May be partitioned by time and allow for archiving by
the time partitions
Queries differ from profiles in that they are
typically short and focused on a specific area of
interest
K.YELLASWAMY Asst.Professor
36
CMRCET
Functional Overview –
Index Database Search
When an item is determined to be of interest, a user
may want to save it (file it) for future reference
Accomplished via the index process
In the index process, the user can logically store an
item in a file along with additional index terms and
descriptive text the user wants to associate with the
item
An index can reference the original item, or contain
substantive information on the original item
Similar to card catalog in a library
The Index Database Search Process provides the
capability to create indexes and search them
K.YELLASWAMY Asst.Professor
37
CMRCET
Functional Overview –
Index Database Search (Cont.)
• The user may search the index and retrieve
the index and/or the document it references
• The system also provides the capability to
search the index and then search the items
referenced by the index records that satisfied
the index portion of the query
Combined file search
• In an ideal system the index record could
reference portions of items versus the total
item
K.YELLASWAMY Asst.Professor
38
CMRCET
Functional Overview –
Index Database Search (Cont.)
Two classes of index files: public and private index
files
Every user can have one or more private index files
leading to a very large number of files, and each
private index file references only a small subset of the
total number of items in the Document database
Public index files are maintained by professional library
services personnel and typically index every item in
the Document database
The capability to create private and public index
files is frequently implemented via a structured
Database Management System (RDBMS)
K.YELLASWAMY Asst.Professor
39
CMRCET
Functional Overview –
Index Database Search (Cont.)
To assist the users in generating indexes, the
system provides a process called Automatic File
Build (Information Extraction)
Process selected incoming documents and automatically
determine potential indexing for the item
Authors, date of publication, source, and references
The rules that govern which documents are processed
for extraction of index information and the index term
extraction process are stored in Automatic File Build
Profiles
When an item is processed it results in creation of
Candidate Index Records for review and edit by a
user prior to actual update of an index file
K.YELLASWAMY Asst.Professor
40
CMRCET
What about databases?
• What are examples of databases?
Banks storing account information
Retailers storing inventories
Universities storing student grades
• What exactly is a (relational) database?
Think of them as a collection of tables
They model some aspect of “the world”
K.YELLASWAMY Asst.Professor
41
CMRCET
A (Simple) Database Example
Student Table
Student ID Last Name First Name Department ID email
1 Kandula Ashok EE [email protected]
2 R Vishnu HIST [email protected]
3 Kandula Srinandan HIST [email protected]
4 Kandula Yellaswamy CLIS [email protected]