0% found this document useful (0 votes)
19 views

IRSUnit-1

Information Retrieval Systems (IRS) focus on efficiently extracting relevant information from various data types, including text and multimedia. The system's success is measured by its ability to minimize user overhead in locating needed information, utilizing precision and recall metrics. IRS encompasses processes like item normalization, selective dissemination, and document/database searches, integrating with database management systems and addressing challenges in user query generation.

Uploaded by

sudharani.am
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

IRSUnit-1

Information Retrieval Systems (IRS) focus on efficiently extracting relevant information from various data types, including text and multimedia. The system's success is measured by its ability to minimize user overhead in locating needed information, utilizing precision and recall metrics. IRS encompasses processes like item normalization, selective dissemination, and document/database searches, integrating with database management systems and addressing challenges in user query generation.

Uploaded by

sudharani.am
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 26

Introduction to Information Retrieval Systems

• Information Retrieval Systems is the formal study of efficient and effective


ways to extract right bit of information from a collection.

• It is a system that is capable of storage, retrieval and maintenance of


information.

• Information in this context can be composed of text, images, audio, video and
other multimedia objects.

• The term “item” plays important role which represent the smallest complete
unit that is processed and manipulated by the system.

• The definition of item varies by how a specific source treats information. It


may be complete document such as book or newspaper or magazine
 An Information Retrieval System consists of software program that facilitates a
user in finding the information the user needs.
 It also use standard computer or specialized hardware to support the search
subfunction and to convert non-textual sources to a searchable media.
 The gauge of success of an information retrieval system is how well it can
minimize the overhead for a user to find needed information.
 Based upon the information needed by the user, two types of retrievals are
considered: comprehensive retrieval and reasonable retrieval.
 A comprehensive retrieval, retrieves more information and sometimes it
seems to be a negative features because it overloads the user with more
information.
 In contrast reasonable retrieval, retrieves the information depending upon
the satisfaction of the user and contains limited information only.
Objectives of Information Retrieval Systems
 The general objective of Information Retrieval Systems is to minimize the
overhead of user locating needed information.
 Overhead can be expressed can be expressed as the time a user spends in all
of the steps leading to reading an item containing the needed
information(e.g.:- Query generation, Query execution, scanning results of
query to select items to read, reading non-relevant items)
 The success of an information is very subjective, based upon what
information is needed and the willing of the user to accept the overhead.
 Needed Information can be defined as all information that is in the system
that relates to a user’s need.
 In Information Retrieval, the term relevant is used to represent an item
containing needed information.
 From user’s perspective, relevant and needed are synonyms. From a system
perspective, information could be relevant to a search statement even though it is not
needed/relevant to a user.
 The two major measures commonly associated with information system are precision
and recall.
 When a user decides to issue a search looking for information on a topic, the total
database is logically divided into four segments: Relevant Retrieved Relevant Not
Retrieved, Non-Relevant Retrieved, Non-Relevant Not Retrieved.

 Relevant items are those documents that contain information that helps the searcher in
answering his question. Non-relevant items are those items that do not provide any
useful information.
 Precision and Recall are defined as

Precision=Number_retrieved_relevant /Number_total_retrieved

Recall=Number_retrieved_relevant / Number_possible_relevant

Where Number_possible_relevant are the number of relevant items in the


database,

Number_total_retrieved is the total number of items in the database,

Number_retrieved_relevant is the number of items retrieved that are relevant to


user’s need.

 Precision measures one aspect of information retrieval overhead for a user associate
with a particular search. If search has a 85 percent precision, then 15 percent of user
effort is overhead in reviewing non-relevant items.
 Recall estimates how well a system processing a particular query is able to retrieve the
relevant items that the user is interested in seeing.
Fig : Ideal Precision and Recall graph
 The basic properties of precision and recall can be observed as precision starts off at
100 percent and maintains that values as long as relevant items are retrieved and
immediately drops to a number close to zero while retrieving non-relevant items
where as recall starts off close to zero and increases as long as relevant items are
retrieved until all possible relevant items are retrieved.
 Assume that there are 100 relevant items in the database from the graph at precision
of 0.3, there is an associated recall of 0.5. This means there are 50 relevant items in
the hit file from the recall value. A precision of 30 percent means the user would likely
review 167 items to find 50 relevant items.
 The first objective of Information Retrieval Systems is support of user search
generation. There are natural obstacles to specification of the information a user needs
that come from ambiguities inherent in languages. A word may contain different
homographs and use of acronyms that allow the same word to have multiple meanings.
 Many users have trouble in generating a good search statement. Not all users have the
knowledge of giving search statement using Boolean logic which is implemented in
database management systems. It is only with the introduction of Information Retrieval
Systems like RetrievalWare, TOPIC,AltaVista,Infoseek and INQUERY that the idea of
accepting natural language queries is becoming standard system feature.
 There are so many complexities arise in generating natural language query. One is that
the user is not an expert in the area that is being searched and lacks domain specific
vocabulary unique to that subject area which will lead to specifying query inaccurately
by misleading search terms.
 Information Retrieval Systems must provide tools to help overcome the search
specification problems. The search tools must assist the user automatically and
through system interaction in developing a search specification that represents the
need of the user.
 Information Retrieval Systems provide functions that provide functions that provide
results in order of potential relevance to the user.
 Multimedia information retrieval adds significant layer of complexity on how to display
multi-modal results.
Functional Overview of IRS

 A total Information Storage and Retrieval System is composed of four major

functional processes: Item Normalization, Selective Dissemination of


Information, Document Database Search, Index Database Search along with
Automatic File Build process that supports Index Files.
 All these four processes are integrated and implemented for public or general
systems.
 Commercial systems have not integrated these capabilities but supply them as
independent capabilities.
 The following figure shows the logical view of Information Retrieval System.
Boxes are used in the diagram to represent functions while disks are used to
represent data storage.
Item Normalization:
• The first step in any integrated system is to
normalize the incoming items to a standard
format.
• Item normalization provides logical
restructuring of the item.
• Additional operations during item
normalization are needed to create a
searchable data structure.
• Those operations are identification of
processing tokens, creating stop lists by
applying stop list algorithms,
characterization of tokens and
stemming(e.g., removing word endings)
 Standardizing the input takes the different external formats of input data and
performs the translation to the formats acceptable to the system.
 One example of standardization could be translation of foreign languages into
Unicode. One standard encoding that covers English French, Spanish, etc. is
ISO-Latin. There are other internal encodings for other language groups such
as Russian, Japanese, Arabic, etc.
 If the input is video the likely digital standards will be either MPEG or AVI.
Audio standards are typically WAV or Real Media. Images vary from JPEG to
BMP.
 The next process is to parse the items into logical sub-divisions that have
meaning to the user. This process is called zoning.
 Once standardization and zoning has been completed, information that are
used in the search process need to be identified. Identification of processing
tokens means determining a word.
 Systems determine words by dividing input symbols into three classes: word
symbols, inter-word symbols and special symbols.
 A word is defined as a contiguous set of word symbols bounded by inter-word
symbols. In many searches, inter-word symbols are non-searchable.
 Examples of word symbols are alphabetic characters and numbers, inter-word
symbols are blanks, periods and semicolons.
 Finally there are some symbols that may requires special processing. For
example, an hyphen(-) may be used in many ways such as small-business man
is interpreted as business man running small business instead of small
business man which is interpreted as small man who is running business.
 Next a stop algorithm is applied to the list of processing tokens. The objective
of the stop list algorithm is to save system resources by eliminating from the
set of searchable processing tokens those that have little value.
 The next step in finalizing on processing tokens is identification of any specific
word characteristics. The characteristic is used in systems to assist in
disambiguation of a particular word. Morphological analysis of processing
token’s parts of speech is included here. Thus for a word “plane” the system
understands that it could mean “level or flat” as an adjective, “aircraft or
facet” as a noun or “the act of smoothing or evening” as a verb.
 Once the potential processing token has been identified and characterized
most systems apply stemming algorithms to normalize the token to a standard
semantic representation. For example, the system must keep singular, plural,
past tense as separate searchable tokens.
 Once the processing tokens have been finalized, based upon the stemming
algorithm, they are used as updates to the searchable data structure.
 This structure contains the semantic concepts that represent the items in the
database and limits what a user can find as a result of their search
Selective Dissemination of
Information
 The Selective Dissemination of Information Process provides the capability to
dynamically compare newly received items in the information system against
standing statements of interest of users and deliver the item to those users
whose statement of interest matches the contents of the item.

 The Mail process is composed of search process ,user statements of interest and
user mail files.

 As each item is received, it is processed against every user’s profile.

 A profile contains a typically broad search statement along with a list of user mail
files that will receive the document if the search statement in the profile is
satisfied.
 When search statement is satisfied, the item is placed in the Mail files
associated with a profile.
 Items in the mail files are typically viewed in time of receipt order and
automatically deleted after a specified time period.
 The dynamic asynchronous updating of mail files makes it difficult to present
the results of dissemination in estimated order of likelihood of relevance to
the user.
 The general assumption has been that the only knowledge available in making
decisions on whether an incoming item is of interest is the user’s profile and
the incoming item.
 Selective Dissemination of Information has not yet been applied to
multimedia sources. In some cases where the audio is transformed into text,
existing textual algorithms have been applied to the transcribed text.
Document Database Search
 The Document Database Search Process provides the capability for a query to
search against all items received by the system.
 The Document Database search is composed of search process ,user entered
queries and the document database which contains all items that have been
received, processed and stored by the system.
 It is the retrospective search source for the system. Any search for
information that has already been processed into the system can be
considered as retrospective search for information.
 The searches span for greater time periods.
 Each query is processed against the total document database.
 The Document Database can be very large, hundreds of millions of items or
more.
 Items in the document database do not change once received. The value of
much information quickly decreases overtime.
 The database is partitioned based an time and searching is done by using time
based partitioning.
Index Database Search
 The index database search provides the capability to create indexes and
search them.
 The user may search the index and retrieve the index and/or the document it
references.
 The System also provides the capability to search the index and search the
items referenced by the index records that satisfied the index portion of the
query. This is called combined file search.
 There are two classes of index files : public and private index files.
 Every user can have one or more private index files leading to very large
number of index files. Each private index file references only a small subset of
total items of database.
 Public index files are maintained by professional library services personnel
and index every item in the document database. These files have access lists
that allow anyone to search or retrieve data.
 To assist the users in generating indexes, the system provide a process called
Automatic File Build.
 This capability processes selected incoming documents and automatically
determine potential indexing for the item.
 The rules that govern which documents are processed for extraction of index
information and the index term extraction process are stored in automatic
build profiles. When an item is processed, it results in creation of candidate
index records.
 The placement in an index facilitates normalizing the searching, assisting the
user in finding items.
Multimedia Database Search
 Multimedia search is implemented against different modalities of information.
 From a system perspective, the multimedia data is not logically its own data
structure, but an augmentation to file existing structures in the information
retrieval systems.
 The specialized indexes to allow search of the multimedia will be augmented
search structures. The original source will be kept as normalized digital real
source for access possibly in their own specialized retrieval servers(e.g., Real
media server, ORACLE video server).
 The correlation between multimedia and textual domains will be either via
time or positional synchronization.
 Time synchronization is the example of transcribed text from audio or
composite video sources.
 Positional synchronization is where the multimedia is localized by a hyperlink
in a textual item.
Relationship to Database Management Systems
DBMS IRS

DBMS is characterized by IRS is characterized by structured


structured data format. as well as unstructured format.
User must follow query formats by User provide the query using
learning query languages. natural languages.
Data is stored and retrieved from Information is retrieved from one or
one centralized server. more servers.
DBMS offers advance data IRS do not offer advance DMF but is
modelling facility including DDL only restricted to classification of
and DML for modelling and objects.
manipulating data.
DBMSs provide precise semantics. IRS most of the time provides
imprecise semantics.
Query specification is complete. Query specification is incomplete.

DBMS has the capability to define In IRS, such validation mechanisms


integrity constraints. are not there.
 The integration of DBMS and IRS is very important. Commercial database
companies have already integrate these two types of systems.

 One of the first commercial databases to integrate the two systems into a
single view is the INQUIRE DBMS which has been available over fifteen years.

 The ORACLE DBMS that now offers an embedded capability called


CONVECTIS, which is an information retrieval system.

 The INFORMIX DBMS has the ability to link RetirevalWare to provide


integration of structured data and information along with functions
associated with Information Retrieval Systems.
Digital Libraries and Data Warehouses
 Two other systems in the context of information retrieval are Digital
Libraries and Data Warehouses.
 All these three systems are repositories of information and their
primary goal is to satisfy user needs.
 Libraries serve as the repositories of intellectual wealth of society. As
such, libraries have always been concerned with storing and retrieving
information in the media it is created on.
 As quantities of information grew exponentially, libraries were forced
to make use of electronic tools to facilitate the storage and retrieval
process. During this process, the terminology evolved from electronic
to digital libraries in which information is stored in digital form.
 Since the collection is digital, the library no longer must own a copy of
information as long as it provides access. Indexing and cataloguing are
important standards in library science.
 Information storage and retrieval technology has addressed a small subset of
issues associated with digital libraries.
 The conversion of existing hardcopy text, images and analog data and the
storage and retrieval of the digital version is a major concern to digital
libraries.
 Data Warehouses comes from commercial sector than academic sources.
 A data warehouse consists of data, an information directory that describe the
content and meaning of the data being stored, an input function that captures
data and moves it to the data warehouse, data search and manipulation tools
that allow users the means to access and analyse the warehouse data and a
delivery mechanism to export data to other warehouses.
 Data warehouses are similar to information storage and retrieval systems in
that they both have a need for search and retrieval of information.
 But a data warehouse is more focused on structured date and decision
support technologies.
 Pattern recognition and artificial intelligence algorithms are applied to know
the hidden relationships of data.
 This differs from clustering in information retrieval in that clustering is based
upon known characteristics of items.
Thank you!!

You might also like