IRSUnit-1
IRSUnit-1
• Information in this context can be composed of text, images, audio, video and
other multimedia objects.
• The term “item” plays important role which represent the smallest complete
unit that is processed and manipulated by the system.
Relevant items are those documents that contain information that helps the searcher in
answering his question. Non-relevant items are those items that do not provide any
useful information.
Precision and Recall are defined as
Precision=Number_retrieved_relevant /Number_total_retrieved
Recall=Number_retrieved_relevant / Number_possible_relevant
Precision measures one aspect of information retrieval overhead for a user associate
with a particular search. If search has a 85 percent precision, then 15 percent of user
effort is overhead in reviewing non-relevant items.
Recall estimates how well a system processing a particular query is able to retrieve the
relevant items that the user is interested in seeing.
Fig : Ideal Precision and Recall graph
The basic properties of precision and recall can be observed as precision starts off at
100 percent and maintains that values as long as relevant items are retrieved and
immediately drops to a number close to zero while retrieving non-relevant items
where as recall starts off close to zero and increases as long as relevant items are
retrieved until all possible relevant items are retrieved.
Assume that there are 100 relevant items in the database from the graph at precision
of 0.3, there is an associated recall of 0.5. This means there are 50 relevant items in
the hit file from the recall value. A precision of 30 percent means the user would likely
review 167 items to find 50 relevant items.
The first objective of Information Retrieval Systems is support of user search
generation. There are natural obstacles to specification of the information a user needs
that come from ambiguities inherent in languages. A word may contain different
homographs and use of acronyms that allow the same word to have multiple meanings.
Many users have trouble in generating a good search statement. Not all users have the
knowledge of giving search statement using Boolean logic which is implemented in
database management systems. It is only with the introduction of Information Retrieval
Systems like RetrievalWare, TOPIC,AltaVista,Infoseek and INQUERY that the idea of
accepting natural language queries is becoming standard system feature.
There are so many complexities arise in generating natural language query. One is that
the user is not an expert in the area that is being searched and lacks domain specific
vocabulary unique to that subject area which will lead to specifying query inaccurately
by misleading search terms.
Information Retrieval Systems must provide tools to help overcome the search
specification problems. The search tools must assist the user automatically and
through system interaction in developing a search specification that represents the
need of the user.
Information Retrieval Systems provide functions that provide functions that provide
results in order of potential relevance to the user.
Multimedia information retrieval adds significant layer of complexity on how to display
multi-modal results.
Functional Overview of IRS
The Mail process is composed of search process ,user statements of interest and
user mail files.
A profile contains a typically broad search statement along with a list of user mail
files that will receive the document if the search statement in the profile is
satisfied.
When search statement is satisfied, the item is placed in the Mail files
associated with a profile.
Items in the mail files are typically viewed in time of receipt order and
automatically deleted after a specified time period.
The dynamic asynchronous updating of mail files makes it difficult to present
the results of dissemination in estimated order of likelihood of relevance to
the user.
The general assumption has been that the only knowledge available in making
decisions on whether an incoming item is of interest is the user’s profile and
the incoming item.
Selective Dissemination of Information has not yet been applied to
multimedia sources. In some cases where the audio is transformed into text,
existing textual algorithms have been applied to the transcribed text.
Document Database Search
The Document Database Search Process provides the capability for a query to
search against all items received by the system.
The Document Database search is composed of search process ,user entered
queries and the document database which contains all items that have been
received, processed and stored by the system.
It is the retrospective search source for the system. Any search for
information that has already been processed into the system can be
considered as retrospective search for information.
The searches span for greater time periods.
Each query is processed against the total document database.
The Document Database can be very large, hundreds of millions of items or
more.
Items in the document database do not change once received. The value of
much information quickly decreases overtime.
The database is partitioned based an time and searching is done by using time
based partitioning.
Index Database Search
The index database search provides the capability to create indexes and
search them.
The user may search the index and retrieve the index and/or the document it
references.
The System also provides the capability to search the index and search the
items referenced by the index records that satisfied the index portion of the
query. This is called combined file search.
There are two classes of index files : public and private index files.
Every user can have one or more private index files leading to very large
number of index files. Each private index file references only a small subset of
total items of database.
Public index files are maintained by professional library services personnel
and index every item in the document database. These files have access lists
that allow anyone to search or retrieve data.
To assist the users in generating indexes, the system provide a process called
Automatic File Build.
This capability processes selected incoming documents and automatically
determine potential indexing for the item.
The rules that govern which documents are processed for extraction of index
information and the index term extraction process are stored in automatic
build profiles. When an item is processed, it results in creation of candidate
index records.
The placement in an index facilitates normalizing the searching, assisting the
user in finding items.
Multimedia Database Search
Multimedia search is implemented against different modalities of information.
From a system perspective, the multimedia data is not logically its own data
structure, but an augmentation to file existing structures in the information
retrieval systems.
The specialized indexes to allow search of the multimedia will be augmented
search structures. The original source will be kept as normalized digital real
source for access possibly in their own specialized retrieval servers(e.g., Real
media server, ORACLE video server).
The correlation between multimedia and textual domains will be either via
time or positional synchronization.
Time synchronization is the example of transcribed text from audio or
composite video sources.
Positional synchronization is where the multimedia is localized by a hyperlink
in a textual item.
Relationship to Database Management Systems
DBMS IRS
One of the first commercial databases to integrate the two systems into a
single view is the INQUIRE DBMS which has been available over fifteen years.