0% found this document useful (0 votes)
4 views

unit-1introduction

Ids notes

Uploaded by

22tq1a6740
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

unit-1introduction

Ids notes

Uploaded by

22tq1a6740
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

INFORMATION RETRIEVAL SYSTEMS

IV B.Tech (CSE)-I Sem

By
Mr.K.Yellaswamy
Assistant Professor
Department of Computer Science & Engineering
CMR College of Engineering & Technology
Information Retrieval Systems

• Information
What is “information”?
• Retrieval
What do we mean by “retrieval”?
What are different types information needs?
• Systems
How do computer systems fit into the human
information seeking process?

K.YELLASWAMY Asst.Professor
2
CMRCET
Information Hierarchy
More refined and abstract

Wisdom

Knowledge

Information

Data

K.YELLASWAMY Asst.Professor
3
CMRCET
Information Hierarchy
Data
The raw material of information
Information
Data organized and presented in a particular manner
Knowledge
“Justified true belief”
Information that can be acted upon
Wisdom
Distilled and integrated knowledge
Demonstrative of high-level “understanding”

K.YELLASWAMY Asst.Professor
4
CMRCET
A (Facetious) Example
Data
98.6º F, 99.5º F, 100.3º F, 101º F, …
Information
Hourly body temperature: 98.6º F, 99.5º F, 100.3º F,
101º F, …
Knowledge
If you have a temperature above 100º F, you most
likely have a fever
Wisdom
If you don‟t feel well, go see a doctor
K.YELLASWAMY Asst.Professor
5
CMRCET
What types of information?

• Text
• Structured documents (e.g., XML)
• Images
• Audio (sound effects, songs, etc.)
• Video
• Programs
• Services

K.YELLASWAMY Asst.Professor
6
CMRCET
Outline of Unit-1

• Definition of IR Systems
• Objectives of IR Systems
• Functional Overview
• Relationship to DBMS

K.YELLASWAMY Asst.Professor
7
CMRCET
Definition of IR Systems
• An IR System is a system capable of storage,
retrieval, and maintenance of information.
Information: text, image, audio, video, and other
multi-media objects
Focus on textual information here
Item:
The smallest complete textual unit processed and
manipulated by an IR system
Depend on how a specific source treats information
 Book? Chapter? Paragraph?
„Item‟ and „Document‟ are used interchangeably

K.YELLASWAMY Asst.Professor
8
CMRCET
Definition of IR Systems (Cont.)

An IR system facilitates a user in find the


information the user needs.
Success measure (Objectives of an IR System)
Minimize the overhead for finding information
Overhead:The time a user spends in all of the steps
leading to reading an item containing needed
information, excluding the time for actually
reading the relevant data
Query generation
Search composition
Search execution
Scanning results of query to select items to read

K.YELLASWAMY Asst.Professor
9
CMRCET
Supporting the Search Process
Source IR System Predict Nominate Choose
Selection

Query
Query
Formulation

Search Ranked List

Query Reformulation Selection Document


and
Relevance Feedback

Examination Document

Source
Reselection
Delivery
K.YELLASWAMY Asst.Professor
10
CMRCET
Supporting the Search Process
Source IR System
Selection

Query
Query
Formulation

Search Ranked List

Selection Document
Indexing Index

Examination Document
Acquisition Collection

Delivery
K.YELLASWAMY Asst.Professor
11
CMRCET
Objectives of IR Systems

K.YELLASWAMY Asst.Professor
12
CMRCET
Overview
• The general objective of an IR system is to
minimize the overhead of a user locating
needed information
• The two major measures commonly
associated with information systems are
precision and recall
• Support of user search generation
• How to present the search results in a format
that facilitate the user in determining
relevant items
K.YELLASWAMY Asst.Professor
13
CMRCET
Basic Measures for Text Retrieval

Relevant Relevant &


Retrieved Retrieved

All Documents

• Precision: the percentage of retrieved documents that are in fact relevant to the
query (i.e., “correct” responses)
| {Relevant}  {Retrieved} |
precision 
| {Retrieved} |
• Recall: the percentage of documents that are relevant to the query and were, in
fact, retrieved
| {Relevant}  {Retrieved} |
precision 
| {Relevant} |
K.YELLASWAMY Asst.Professor
14
CMRCET
Precision and Recall

Relevant Relevant
Retrieved Not Retrieved
Non-Relevant Non-Relevant
Retrieved Not Retrieved

Number _ Re trieved _ Re levant


Precision 
Number _ Total _ Re trieved
Number _ Re trieved _ Re levant
Recall 
Number _ Possible _ Re levant

K.YELLASWAMY Asst.Professor
15
CMRCET
Measuring Precision and Recall
Assume there are a total of 14 relevant documents

Hits 1-10

Precision 1/1 1/2 1/3 1/4 2/5 3/6 3/7 4/8 4/9 4/10
Recall 1/14 1/14 1/14 1/14 2/14 3/14 3/14 4/14 4/14 4/14

Hits 11-20

Precision 5/11 5/12 5/13 5/14 5/15 6/16 6/17 6/18 6/19 6/20
Recall 5/14 5/14 5/14 5/14 5/14 6/14 6/14 6/14 6/14 6/14

= relevant document

K.YELLASWAMY Asst.Professor
16
CMRCET
Graphing Precision and Recall

• Plot each (recall, precision) point on a graph


1

• Visually represent the precision/recall


0.8

tradeoff
0.6
Precision

0.4

0.2

0
0 0.1 0.2 0.3 0.4 0.5
Recall

K.YELLASWAMY Asst.Professor
17
CMRCET
Precision and Recall (Cont.)
• Precision
Measures retrieval overheard for a particular
query
In the WWW-world, precision is more
important than recall
• Recall
Recall
How well a system is able to 100 retrieve the

relevant items for users


Percent Precision

• Ideal Precision and Recall0 0 N


Number Items Retrieved
K.YELLASWAMY Asst.Professor
18
CMRCET
Two More Objectives of IR Systems

Support of user search generation


How to specify the information a user needs
 Language ambiguities – “field”
 Vocabulary corpus of a user and item authors
Must assist users automatically and through interaction
in developing a search specification that represents the
need of users and the writing style of diverse authors
How to present the search results in a format that
facilitate the user in determining relevant items
Ranking in order of potential relevance
Item clustering and link analysis…

K.YELLASWAMY Asst.Professor
19
CMRCET
Functional Overview

K.YELLASWAMY Asst.Professor
20
CMRCET
Functional Overview
• Four major functional process
Item Normalization
Selective Dissemination of Information
Archival Document Database Search
Index Database Search + Automatic File Build
Process (Support index files)

K.YELLASWAMY Asst.Professor
21
CMRCET
Total IR System
Selective Private
Dissemination of Profiles
Indexing
Information
(Mail)
Mail Files Private
Index
Item Item File
Input Document File Document
Normalization
Creation File Public
Index
Candidate File
Automatic File Index Records
Build(AFB) Public
AFB Indexing
Profiles

K.YELLASWAMY Asst.Professor
22
CMRCET
Item Normalization
• Normalize incoming items to a standard
format
Language encoding
Different file formats…
• Logical restructuring – zoning
• Create a searchable data structure (Indexing)
Identification of processing tokens
Characterization of the tokens – single words, or
phrase
Stemming of the tokens
K.YELLASWAMY Asst.Professor
23
CMRCET
Functional Overview –
Item Normalization

K.YELLASWAMY Asst.Professor
24
CMRCET
Overview
Standardize
Input

Logical Identify
Stop Characterize Apply
Subsetting Processing
Algorithm Tokens Stemming
(Zoning) Tokens

Create
Update Searchable Data
Document Structure
File

K.YELLASWAMY Asst.Professor
25
CMRCET
Standardize Input

Standardizing the input takes the different


external format of input data and performs the
translation to the formats acceptable to the system.
Translate foreign language into Unicode
Allow a single browser to display the languages and
potentially a single search system to search them
Translate multi-media input into a standard
format
Video: MPEG-2, MPEG-1, AVI, Real Video…
Audio: WAV, Real Audio
Image: GIF, JPEG, BMP…

K.YELLASWAMY Asst.Professor
26
CMRCET
Logical Subsetting (Zoning)

Parse the item into logical sub-divisions that have


meaning to user
Title, Author, Abstract, Main Text, Conclusion,
References, Country, Keyword…
Visible to the user and used to increase the
precision of a search and optimize the display
The zoning information is passed to the processing
token identification operation to store the information,
allowing searches to be restricted to a specific zone
display the minimum data required from each item to
allow determination of the possible relevance of that
item (display zones such as Title, Abstract…
K.YELLASWAMY Asst.Professor
27
CMRCET
Identify Processing Tokens

• Identify the information that are used in the


search process – Processing Tokens (Better than
Words)
• The first step is to determine a word
Dividing input symbols into three classes
• Valid word symbols: alphabetic characters,numbers
• Inter-word symbols: blanks, periods, semicolons (non-
searchable)
• Special processing symbols: hyphen (-)
A word is defined as a contiguous set of word
symbols bounded by inter-word symbols
K.YELLASWAMY Asst.Professor
28
CMRCET
Stop Algorithm

• Save system resources by eliminating


from the set of searchable processing
tokens those have little value to the search
Whose frequency and/or semantic use make
them of no use as a searchable token
• Any word found in almost every item
• Any word only found once or twice in the
database
Frequency * Rank = Constant
Stop algorithm v.s. Stop list
K.YELLASWAMY Asst.Professor
29
CMRCET
Characterize Tokens

• Identify any specific word characteristics


Word sense disambigulation
Part of speech tagging
Uppercase – proper names, acronyms, and
organization
Numbers and dates

K.YELLASWAMY Asst.Professor
30
CMRCET
Stemming Algorithm
Normalize the token to a standard semantic
representation
Computer, Compute, Computers, Computing
 Comput
Reduce the number of unique words the system
has to contain
 ex: “computable”, “computation”, “computability”
 small database saves 32 percent of storages
 larger database : 1.6 MB  20 %
50 MB  13.5%
Improve the efficiency of the IR System and to
improve recall  Decline precision
Expand a search term to similar token representations in
run time?

K.YELLASWAMY Asst.Professor
31
CMRCET
Create Searchable Data Structure

• Processing tokens  Stemming Algorithm


 update to the searchable data structure
• Internal representation (not visible to user)
Signature file, Inverted list, PAT Tree…
• Contains
Semantic concepts represent the items in
database
Limit what a user can find as a result of the
search

K.YELLASWAMY Asst.Professor
32
CMRCET
Functional Overview – Selective
Dissemination of Information

K.YELLASWAMY Asst.Professor
33
CMRCET
Selective Dissemination of
Information (SDI)
• Provides the capability to dynamically
compare newly received items in the
information system against standing
statements of interest of users and deliver the
item to those users whose statement of
interest matches the contents of the items
• Consist of
Search process
User statements of interest (Profile)
User mail file

K.YELLASWAMY Asst.Professor
34
CMRCET
Selective Dissemination of
Information (Cont.)
A profile contains a typically broad search
statement along with a list of user mail files
that will receive the document if the search
statement in the profile is satisfied
As each item is received, it is processed against
every user‟s profile
When the search statement is satisfied, the item is
placed in the mail file(s) associated with the
process
User search profiles are different than ad hoc
queries in that they contain significant more
search terms and cover a wider range of interests

K.YELLASWAMY Asst.Professor
35
CMRCET
Functional Overview –
Document Database Search
Provides the capability for a query to search
against all items received by the system
Composed of the search process, user entered
queries and document database.
Document database contains all items that have
been received, processed and store by the system
Usually items in the Document DB do not change
May be partitioned by time and allow for archiving by
the time partitions
Queries differ from profiles in that they are
typically short and focused on a specific area of
interest

K.YELLASWAMY Asst.Professor
36
CMRCET
Functional Overview –
Index Database Search
 When an item is determined to be of interest, a user
may want to save it (file it) for future reference
Accomplished via the index process
 In the index process, the user can logically store an
item in a file along with additional index terms and
descriptive text the user wants to associate with the
item
An index can reference the original item, or contain
substantive information on the original item
Similar to card catalog in a library
 The Index Database Search Process provides the
capability to create indexes and search them

K.YELLASWAMY Asst.Professor
37
CMRCET
Functional Overview –
Index Database Search (Cont.)
• The user may search the index and retrieve
the index and/or the document it references
• The system also provides the capability to
search the index and then search the items
referenced by the index records that satisfied
the index portion of the query
Combined file search
• In an ideal system the index record could
reference portions of items versus the total
item
K.YELLASWAMY Asst.Professor
38
CMRCET
Functional Overview –
Index Database Search (Cont.)
Two classes of index files: public and private index
files
Every user can have one or more private index files
leading to a very large number of files, and each
private index file references only a small subset of the
total number of items in the Document database
Public index files are maintained by professional library
services personnel and typically index every item in
the Document database
The capability to create private and public index
files is frequently implemented via a structured
Database Management System (RDBMS)
K.YELLASWAMY Asst.Professor
39
CMRCET
Functional Overview –
Index Database Search (Cont.)
To assist the users in generating indexes, the
system provides a process called Automatic File
Build (Information Extraction)
Process selected incoming documents and automatically
determine potential indexing for the item
 Authors, date of publication, source, and references
The rules that govern which documents are processed
for extraction of index information and the index term
extraction process are stored in Automatic File Build
Profiles
When an item is processed it results in creation of
Candidate Index Records  for review and edit by a
user prior to actual update of an index file

K.YELLASWAMY Asst.Professor
40
CMRCET
What about databases?
• What are examples of databases?
Banks storing account information
Retailers storing inventories
Universities storing student grades
• What exactly is a (relational) database?
Think of them as a collection of tables
They model some aspect of “the world”

K.YELLASWAMY Asst.Professor
41
CMRCET
A (Simple) Database Example
Student Table
Student ID Last Name First Name Department ID email
1 Kandula Ashok EE [email protected]
2 R Vishnu HIST [email protected]
3 Kandula Srinandan HIST [email protected]
4 Kandula Yellaswamy CLIS [email protected]

Department Table Course Table


Department ID Department Course ID Course Name
EE Electrical Engineering lbsc690 Information Technology
HIST History ee750 Communication
CLIS Information Studies hist405 American History
Enrollment Table
Student ID Course ID Grade
1 lbsc690 90
1 ee750 95
2 lbsc690 95
2 hist405 80
3 hist405 90
4 lbsc690 98
K.YELLASWAMY Asst.Professor
42
CMRCET
Database Queries
• What would you want to know from a
database?
What classes is John Arrow enrolled in?
Who has the highest grade in LBSC 690?
Who‟s in the history department?
Of all the non-CLIS students taking LBSC 690
with a last name shorter than six characters
and were born on a Monday, who has the
longest email address?
K.YELLASWAMY Asst.Professor
43
CMRCET
Databases vs. IR
Databases IR
What we’re Structured data. Clear Mostly unstructured.
semantics based on a Free text with some
retrieving
formal model. metadata.

Queries we’re Formally Vague, imprecise


posing (mathematically) defined information needs (often
queries. Unambiguous. expressed in natural
language).

Results we get Exact. Always correct in Sometimes relevant,


a formal sense. often not.

Interaction with One-shot queries. Interaction is important.


system

Other issues Concurrency, recovery, Issues downplayed.


atomicity are all critical.
K.YELLASWAMY Asst.Professor
44
CMRCET

You might also like