0% found this document useful (0 votes)

19 views

IRSUnit-1

Information Retrieval Systems (IRS) focus on efficiently extracting relevant information from various data types, including text and multimedia. The system's success is measured by its ability to minimize user overhead in locating needed information, utilizing precision and recall metrics. IRS encompasses processes like item normalization, selective dissemination, and document/database searches, integrating with database management systems and addressing challenges in user query generation.

Uploaded by

sudharani.am

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views

IRSUnit-1

Uploaded by

sudharani.am

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 26

Introduction to Information Retrieval Systems

• Information Retrieval Systems is the formal study of efficient and effective

ways to extract right bit of information from a collection.

• It is a system that is capable of storage, retrieval and maintenance of

information.

• Information in this context can be composed of text, images, audio, video and
other multimedia objects.

• The term “item” plays important role which represent the smallest complete
unit that is processed and manipulated by the system.

• The definition of item varies by how a specific source treats information. It

may be complete document such as book or newspaper or magazine
 An Information Retrieval System consists of software program that facilitates a
user in finding the information the user needs.
 It also use standard computer or specialized hardware to support the search
subfunction and to convert non-textual sources to a searchable media.
 The gauge of success of an information retrieval system is how well it can
minimize the overhead for a user to find needed information.
 Based upon the information needed by the user, two types of retrievals are
considered: comprehensive retrieval and reasonable retrieval.
 A comprehensive retrieval, retrieves more information and sometimes it
seems to be a negative features because it overloads the user with more
information.
 In contrast reasonable retrieval, retrieves the information depending upon
the satisfaction of the user and contains limited information only.
Objectives of Information Retrieval Systems
 The general objective of Information Retrieval Systems is to minimize the
overhead of user locating needed information.
 Overhead can be expressed can be expressed as the time a user spends in all
of the steps leading to reading an item containing the needed
information(e.g.:- Query generation, Query execution, scanning results of
query to select items to read, reading non-relevant items)
 The success of an information is very subjective, based upon what
information is needed and the willing of the user to accept the overhead.
 Needed Information can be defined as all information that is in the system
that relates to a user’s need.
 In Information Retrieval, the term relevant is used to represent an item
containing needed information.
 From user’s perspective, relevant and needed are synonyms. From a system
perspective, information could be relevant to a search statement even though it is not
needed/relevant to a user.
 The two major measures commonly associated with information system are precision
and recall.
 When a user decides to issue a search looking for information on a topic, the total
database is logically divided into four segments: Relevant Retrieved Relevant Not
Retrieved, Non-Relevant Retrieved, Non-Relevant Not Retrieved.

 Relevant items are those documents that contain information that helps the searcher in
answering his question. Non-relevant items are those items that do not provide any
useful information.
 Precision and Recall are defined as

Precision=Number_retrieved_relevant /Number_total_retrieved

Recall=Number_retrieved_relevant / Number_possible_relevant

Where Number_possible_relevant are the number of relevant items in the

database,

Number_total_retrieved is the total number of items in the database,

Number_retrieved_relevant is the number of items retrieved that are relevant to

user’s need.

 Precision measures one aspect of information retrieval overhead for a user associate
with a particular search. If search has a 85 percent precision, then 15 percent of user
effort is overhead in reviewing non-relevant items.
 Recall estimates how well a system processing a particular query is able to retrieve the
relevant items that the user is interested in seeing.
Fig : Ideal Precision and Recall graph
 The basic properties of precision and recall can be observed as precision starts off at
100 percent and maintains that values as long as relevant items are retrieved and
immediately drops to a number close to zero while retrieving non-relevant items
where as recall starts off close to zero and increases as long as relevant items are
retrieved until all possible relevant items are retrieved.
 Assume that there are 100 relevant items in the database from the graph at precision
of 0.3, there is an associated recall of 0.5. This means there are 50 relevant items in
the hit file from the recall value. A precision of 30 percent means the user would likely
review 167 items to find 50 relevant items.
 The first objective of Information Retrieval Systems is support of user search
generation. There are natural obstacles to specification of the information a user needs
that come from ambiguities inherent in languages. A word may contain different
homographs and use of acronyms that allow the same word to have multiple meanings.
 Many users have trouble in generating a good search statement. Not all users have the
knowledge of giving search statement using Boolean logic which is implemented in
database management systems. It is only with the introduction of Information Retrieval
Systems like RetrievalWare, TOPIC,AltaVista,Infoseek and INQUERY that the idea of
accepting natural language queries is becoming standard system feature.
 There are so many complexities arise in generating natural language query. One is that
the user is not an expert in the area that is being searched and lacks domain specific
vocabulary unique to that subject area which will lead to specifying query inaccurately
by misleading search terms.
 Information Retrieval Systems must provide tools to help overcome the search
specification problems. The search tools must assist the user automatically and
through system interaction in developing a search specification that represents the
need of the user.
 Information Retrieval Systems provide functions that provide functions that provide
results in order of potential relevance to the user.
 Multimedia information retrieval adds significant layer of complexity on how to display
multi-modal results.
Functional Overview of IRS

 A total Information Storage and Retrieval System is composed of four major

functional processes: Item Normalization, Selective Dissemination of

Information, Document Database Search, Index Database Search along with
Automatic File Build process that supports Index Files.
 All these four processes are integrated and implemented for public or general
systems.
 Commercial systems have not integrated these capabilities but supply them as
independent capabilities.
 The following figure shows the logical view of Information Retrieval System.
Boxes are used in the diagram to represent functions while disks are used to
represent data storage.
Item Normalization:
• The first step in any integrated system is to
normalize the incoming items to a standard
format.
• Item normalization provides logical
restructuring of the item.
• Additional operations during item
normalization are needed to create a
searchable data structure.
• Those operations are identification of
processing tokens, creating stop lists by
applying stop list algorithms,
characterization of tokens and
stemming(e.g., removing word endings)
 Standardizing the input takes the different external formats of input data and
performs the translation to the formats acceptable to the system.
 One example of standardization could be translation of foreign languages into
Unicode. One standard encoding that covers English French, Spanish, etc. is
ISO-Latin. There are other internal encodings for other language groups such
as Russian, Japanese, Arabic, etc.
 If the input is video the likely digital standards will be either MPEG or AVI.
Audio standards are typically WAV or Real Media. Images vary from JPEG to
BMP.
 The next process is to parse the items into logical sub-divisions that have
meaning to the user. This process is called zoning.
 Once standardization and zoning has been completed, information that are
used in the search process need to be identified. Identification of processing
tokens means determining a word.
 Systems determine words by dividing input symbols into three classes: word
symbols, inter-word symbols and special symbols.
 A word is defined as a contiguous set of word symbols bounded by inter-word
symbols. In many searches, inter-word symbols are non-searchable.
 Examples of word symbols are alphabetic characters and numbers, inter-word
symbols are blanks, periods and semicolons.
 Finally there are some symbols that may requires special processing. For
example, an hyphen(-) may be used in many ways such as small-business man
is interpreted as business man running small business instead of small
business man which is interpreted as small man who is running business.
 Next a stop algorithm is applied to the list of processing tokens. The objective
of the stop list algorithm is to save system resources by eliminating from the
set of searchable processing tokens those that have little value.
 The next step in finalizing on processing tokens is identification of any specific
word characteristics. The characteristic is used in systems to assist in
disambiguation of a particular word. Morphological analysis of processing
token’s parts of speech is included here. Thus for a word “plane” the system
understands that it could mean “level or flat” as an adjective, “aircraft or
facet” as a noun or “the act of smoothing or evening” as a verb.
 Once the potential processing token has been identified and characterized
most systems apply stemming algorithms to normalize the token to a standard
semantic representation. For example, the system must keep singular, plural,
past tense as separate searchable tokens.
 Once the processing tokens have been finalized, based upon the stemming
algorithm, they are used as updates to the searchable data structure.
 This structure contains the semantic concepts that represent the items in the
database and limits what a user can find as a result of their search
Selective Dissemination of
Information
 The Selective Dissemination of Information Process provides the capability to
dynamically compare newly received items in the information system against
standing statements of interest of users and deliver the item to those users
whose statement of interest matches the contents of the item.

 The Mail process is composed of search process ,user statements of interest and
user mail files.

 As each item is received, it is processed against every user’s profile.

 A profile contains a typically broad search statement along with a list of user mail
files that will receive the document if the search statement in the profile is
satisfied.
 When search statement is satisfied, the item is placed in the Mail files
associated with a profile.
 Items in the mail files are typically viewed in time of receipt order and
automatically deleted after a specified time period.
 The dynamic asynchronous updating of mail files makes it difficult to present
the results of dissemination in estimated order of likelihood of relevance to
the user.
 The general assumption has been that the only knowledge available in making
decisions on whether an incoming item is of interest is the user’s profile and
the incoming item.
 Selective Dissemination of Information has not yet been applied to
multimedia sources. In some cases where the audio is transformed into text,
existing textual algorithms have been applied to the transcribed text.
Document Database Search
 The Document Database Search Process provides the capability for a query to
search against all items received by the system.
 The Document Database search is composed of search process ,user entered
queries and the document database which contains all items that have been
received, processed and stored by the system.
 It is the retrospective search source for the system. Any search for
information that has already been processed into the system can be
considered as retrospective search for information.
 The searches span for greater time periods.
 Each query is processed against the total document database.
 The Document Database can be very large, hundreds of millions of items or
more.
 Items in the document database do not change once received. The value of
much information quickly decreases overtime.
 The database is partitioned based an time and searching is done by using time
based partitioning.
Index Database Search
 The index database search provides the capability to create indexes and
search them.
 The user may search the index and retrieve the index and/or the document it
references.
 The System also provides the capability to search the index and search the
items referenced by the index records that satisfied the index portion of the
query. This is called combined file search.
 There are two classes of index files : public and private index files.
 Every user can have one or more private index files leading to very large
number of index files. Each private index file references only a small subset of
total items of database.
 Public index files are maintained by professional library services personnel
and index every item in the document database. These files have access lists
that allow anyone to search or retrieve data.
 To assist the users in generating indexes, the system provide a process called
Automatic File Build.
 This capability processes selected incoming documents and automatically
determine potential indexing for the item.
 The rules that govern which documents are processed for extraction of index
information and the index term extraction process are stored in automatic
build profiles. When an item is processed, it results in creation of candidate
index records.
 The placement in an index facilitates normalizing the searching, assisting the
user in finding items.
Multimedia Database Search
 Multimedia search is implemented against different modalities of information.
 From a system perspective, the multimedia data is not logically its own data
structure, but an augmentation to file existing structures in the information
retrieval systems.
 The specialized indexes to allow search of the multimedia will be augmented
search structures. The original source will be kept as normalized digital real
source for access possibly in their own specialized retrieval servers(e.g., Real
media server, ORACLE video server).
 The correlation between multimedia and textual domains will be either via
time or positional synchronization.
 Time synchronization is the example of transcribed text from audio or
composite video sources.
 Positional synchronization is where the multimedia is localized by a hyperlink
in a textual item.
Relationship to Database Management Systems
DBMS IRS

DBMS is characterized by IRS is characterized by structured

structured data format. as well as unstructured format.
User must follow query formats by User provide the query using
learning query languages. natural languages.
Data is stored and retrieved from Information is retrieved from one or
one centralized server. more servers.
DBMS offers advance data IRS do not offer advance DMF but is
modelling facility including DDL only restricted to classification of
and DML for modelling and objects.
manipulating data.
DBMSs provide precise semantics. IRS most of the time provides
imprecise semantics.
Query specification is complete. Query specification is incomplete.

DBMS has the capability to define In IRS, such validation mechanisms

integrity constraints. are not there.
 The integration of DBMS and IRS is very important. Commercial database
companies have already integrate these two types of systems.

 One of the first commercial databases to integrate the two systems into a
single view is the INQUIRE DBMS which has been available over fifteen years.

 The ORACLE DBMS that now offers an embedded capability called

CONVECTIS, which is an information retrieval system.

 The INFORMIX DBMS has the ability to link RetirevalWare to provide

integration of structured data and information along with functions
associated with Information Retrieval Systems.
Digital Libraries and Data Warehouses
 Two other systems in the context of information retrieval are Digital
Libraries and Data Warehouses.
 All these three systems are repositories of information and their
primary goal is to satisfy user needs.
 Libraries serve as the repositories of intellectual wealth of society. As
such, libraries have always been concerned with storing and retrieving
information in the media it is created on.
 As quantities of information grew exponentially, libraries were forced
to make use of electronic tools to facilitate the storage and retrieval
process. During this process, the terminology evolved from electronic
to digital libraries in which information is stored in digital form.
 Since the collection is digital, the library no longer must own a copy of
information as long as it provides access. Indexing and cataloguing are
important standards in library science.
 Information storage and retrieval technology has addressed a small subset of
issues associated with digital libraries.
 The conversion of existing hardcopy text, images and analog data and the
storage and retrieval of the digital version is a major concern to digital
libraries.
 Data Warehouses comes from commercial sector than academic sources.
 A data warehouse consists of data, an information directory that describe the
content and meaning of the data being stored, an input function that captures
data and moves it to the data warehouse, data search and manipulation tools
that allow users the means to access and analyse the warehouse data and a
delivery mechanism to export data to other warehouses.
 Data warehouses are similar to information storage and retrieval systems in
that they both have a need for search and retrieval of information.
 But a data warehouse is more focused on structured date and decision
support technologies.
 Pattern recognition and artificial intelligence algorithms are applied to know
the hidden relationships of data.
 This differs from clustering in information retrieval in that clustering is based
upon known characteristics of items.
Thank you!!

EN Programming ELCO Micro-ANTS LEB02 Basic Encoder V2.2 26-10-2020
100% (2)
EN Programming ELCO Micro-ANTS LEB02 Basic Encoder V2.2 26-10-2020
35 pages
Indexing and Abstracting Reviewer LLE
100% (2)
Indexing and Abstracting Reviewer LLE
46 pages
Letra de Yellow Lemon Tree de Fool's Garden - MUSICA
100% (1)
Letra de Yellow Lemon Tree de Fool's Garden - MUSICA
2 pages
AppearTV Hardware Maintenance Guide
No ratings yet
AppearTV Hardware Maintenance Guide
10 pages
IRS Study Material
100% (1)
IRS Study Material
87 pages
Irs Unit1
No ratings yet
Irs Unit1
15 pages
Irs PDF
No ratings yet
Irs PDF
68 pages
Unit I - Irs
No ratings yet
Unit I - Irs
85 pages
UNIT I - IRS
No ratings yet
UNIT I - IRS
116 pages
irs notes_merged (1)
No ratings yet
irs notes_merged (1)
166 pages
Irs I
No ratings yet
Irs I
20 pages
UNIT 1 IRS WWWWW
No ratings yet
UNIT 1 IRS WWWWW
26 pages
IRS Unit-1
No ratings yet
IRS Unit-1
27 pages
IRS Unit 1 by Krishna
No ratings yet
IRS Unit 1 by Krishna
33 pages
Irs Unit-1
No ratings yet
Irs Unit-1
61 pages
Cmrit Isr Notes - Docx New
No ratings yet
Cmrit Isr Notes - Docx New
54 pages
UNIT 1 IRS (1)
No ratings yet
UNIT 1 IRS (1)
26 pages
unit-1-irs-information-retrieval-systems-unit-1
No ratings yet
unit-1-irs-information-retrieval-systems-unit-1
27 pages
UNIT I
No ratings yet
UNIT I
65 pages
Objectives of Information Retrieval
No ratings yet
Objectives of Information Retrieval
5 pages
Module 1 - Introduction
No ratings yet
Module 1 - Introduction
61 pages
Unit-I: Introduction To Information Retrieval Systems
100% (1)
Unit-I: Introduction To Information Retrieval Systems
14 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
44 pages
I Unit
No ratings yet
I Unit
43 pages
Unit - 1
No ratings yet
Unit - 1
51 pages
PE II6
No ratings yet
PE II6
166 pages
irs unit-1 modified
No ratings yet
irs unit-1 modified
12 pages
IRS Notes
No ratings yet
IRS Notes
10 pages
IRS Unit-1
50% (2)
IRS Unit-1
14 pages
Topic 2 Basic Concepts of Information Retrieval Systems
No ratings yet
Topic 2 Basic Concepts of Information Retrieval Systems
12 pages
Chapter 6-8IR Revised
No ratings yet
Chapter 6-8IR Revised
76 pages
IRS Spectrum
100% (1)
IRS Spectrum
150 pages
Information Storage and Retrieval
No ratings yet
Information Storage and Retrieval
45 pages
Unit-18
No ratings yet
Unit-18
19 pages
Introduction To Information Retrieval Systems
No ratings yet
Introduction To Information Retrieval Systems
2 pages
Chapter 5
No ratings yet
Chapter 5
57 pages
Functional Overview of an Information Retrieval System
No ratings yet
Functional Overview of an Information Retrieval System
1 page
Unit-1 Chapter 1
No ratings yet
Unit-1 Chapter 1
44 pages
IRS Unit-1
100% (5)
IRS Unit-1
14 pages
Week 1
No ratings yet
Week 1
28 pages
of-280fbpkmhy
No ratings yet
of-280fbpkmhy
9 pages
Information Retrieval Systems
100% (1)
Information Retrieval Systems
102 pages
IR Chapter 1&2
No ratings yet
IR Chapter 1&2
88 pages
Chapter 1 Introduction To ISR
No ratings yet
Chapter 1 Introduction To ISR
39 pages
Automatic Image Annotation: Fundamentals and Applications
From Everand
Automatic Image Annotation: Fundamentals and Applications
Fouad Sabry
No ratings yet
Infs 422 Combine
No ratings yet
Infs 422 Combine
375 pages
IR First Chapter
No ratings yet
IR First Chapter
32 pages
IR Chapter 1
No ratings yet
IR Chapter 1
32 pages
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
From Everand
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
Fouad Sabry
No ratings yet
Information Storage and Retrieval
No ratings yet
Information Storage and Retrieval
5 pages
IR Module
No ratings yet
IR Module
80 pages
Web Mining UNIT-II Chapter-01 - 02 - 03
No ratings yet
Web Mining UNIT-II Chapter-01 - 02 - 03
19 pages
Unit 1
No ratings yet
Unit 1
19 pages
Ch1 IR
No ratings yet
Ch1 IR
39 pages
Unit-17 Information Processing and Retrieval
No ratings yet
Unit-17 Information Processing and Retrieval
32 pages
Information Retrieval Question Bank-2
No ratings yet
Information Retrieval Question Bank-2
168 pages
Information Retrieval Question Bank
No ratings yet
Information Retrieval Question Bank
161 pages
unit-1introduction
No ratings yet
unit-1introduction
44 pages
Unit 5
No ratings yet
Unit 5
14 pages
Wollo University Kombolcha Institute of Technology College of Informatics Department of Information Technology
100% (1)
Wollo University Kombolcha Institute of Technology College of Informatics Department of Information Technology
35 pages
Unit 1 - Modern Information Retrieval - WWW - Rgpvnotes.in
No ratings yet
Unit 1 - Modern Information Retrieval - WWW - Rgpvnotes.in
8 pages
Introduction To IR 2021
No ratings yet
Introduction To IR 2021
40 pages
Image Retrieval: Fundamentals and Applications
From Everand
Image Retrieval: Fundamentals and Applications
Fouad Sabry
No ratings yet
Choose The Correct Answer, A, B or C. Write A, B or C in The Numbered Box. (1.5 PTS)
No ratings yet
Choose The Correct Answer, A, B or C. Write A, B or C in The Numbered Box. (1.5 PTS)
6 pages
Rakesh Yadav Class Notes Math in Hindi PDF Free Download-Signed
No ratings yet
Rakesh Yadav Class Notes Math in Hindi PDF Free Download-Signed
423 pages
5Bs Implementation Monitoring Tool v2.0
No ratings yet
5Bs Implementation Monitoring Tool v2.0
3 pages
DM375M6 60HBW en
No ratings yet
DM375M6 60HBW en
2 pages
630i_7159
No ratings yet
630i_7159
38 pages
Linux in Action David Clinton pdf download
100% (1)
Linux in Action David Clinton pdf download
34 pages
Caeses Su2
No ratings yet
Caeses Su2
11 pages
Basic Concepts of Economics
No ratings yet
Basic Concepts of Economics
20 pages
Ordonez v. USA - Document No. 3
No ratings yet
Ordonez v. USA - Document No. 3
3 pages
Building Bye-Laws PDF
100% (1)
Building Bye-Laws PDF
18 pages
Reference:: Property of STI
No ratings yet
Reference:: Property of STI
10 pages
Dzexams 3am Anglais 175821
No ratings yet
Dzexams 3am Anglais 175821
3 pages
The Habsburg-Ottoman Wars
No ratings yet
The Habsburg-Ottoman Wars
2 pages
WELD2CAST Engineering Case Study No - 3
No ratings yet
WELD2CAST Engineering Case Study No - 3
3 pages
Rackett 1970
No ratings yet
Rackett 1970
4 pages
Subcostal TAPSE - A Retrospective Analysis of A Novel Right Ventricle Function Assessment Method From The Subcostal Position in Patients With Sepsis
No ratings yet
Subcostal TAPSE - A Retrospective Analysis of A Novel Right Ventricle Function Assessment Method From The Subcostal Position in Patients With Sepsis
8 pages
Multifan Horizontal Circulation fan EN
No ratings yet
Multifan Horizontal Circulation fan EN
4 pages
2 AQUA Domestic Pump0712 PDF
No ratings yet
2 AQUA Domestic Pump0712 PDF
111 pages
Computer Studies Ii
No ratings yet
Computer Studies Ii
51 pages
Jose Vasquez Resume
No ratings yet
Jose Vasquez Resume
2 pages
Gan Cube - Google 搜尋
No ratings yet
Gan Cube - Google 搜尋
1 page
Activity 4 - Animals
No ratings yet
Activity 4 - Animals
5 pages
BUSINESS DRIVEN INFORMATION SYSTEMS 7th Edition Paige Baltzan 2024 Scribd Download
100% (1)
BUSINESS DRIVEN INFORMATION SYSTEMS 7th Edition Paige Baltzan 2024 Scribd Download
23 pages
HOUSE RULE: The Following Are Expected To Be Followed:: Present Test Permit When I Take The Exam
No ratings yet
HOUSE RULE: The Following Are Expected To Be Followed:: Present Test Permit When I Take The Exam
1 page
D6 System - Weg51013 - D6 Fantasy v1.0
No ratings yet
D6 System - Weg51013 - D6 Fantasy v1.0
18 pages
Folder 994K PDF
No ratings yet
Folder 994K PDF
32 pages
11 Distribution System Load Characteristics (1172)
No ratings yet
11 Distribution System Load Characteristics (1172)
9 pages

IRSUnit-1

Uploaded by

IRSUnit-1

Uploaded by

Introduction to Information Retrieval Systems

• Information Retrieval Systems is the formal study of efficient and effective

• It is a system that is capable of storage, retrieval and maintenance of

• The definition of item varies by how a specific source treats information. It

Where Number_possible_relevant are the number of relevant items in the

Number_total_retrieved is the total number of items in the database,

Number_retrieved_relevant is the number of items retrieved that are relevant to

 A total Information Storage and Retrieval System is composed of four major

functional processes: Item Normalization, Selective Dissemination of

 As each item is received, it is processed against every user’s profile.

DBMS is characterized by IRS is characterized by structured

DBMS has the capability to define In IRS, such validation mechanisms

 The ORACLE DBMS that now offers an embedded capability called

 The INFORMIX DBMS has the ability to link RetirevalWare to provide

You might also like