0% found this document useful (0 votes)
57 views22 pages

IRS UNIT-IV

Uploaded by

teamkiller334
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views22 pages

IRS UNIT-IV

Uploaded by

teamkiller334
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

UNIT -5

TEXT SEARCH ALGORITHMS AND MULTIMEDIA INFORMATION RETRIEVAL

5.1 TEXT SEARCH ALGORITHMS

5.1.1 Introduction

Text Streaming Architecture:

• One or more users can enter the queries in text scanning system. This would be the basis of these systems.
• After the query is entered, the text to be searched is accessed & it is compared with the query terms.
• The query is said to be complete when all of the text has been accessed.
• This architecture has an advantage that if the item satisfies a query, then the results can be presented to the
user for the purpose of retrieving automatically.
• The architecture of text streaming search system is shown in below fig.

• The elements of the architecture include,


❖ Database
❖ Term Detector
❖ Query Resolver
❖ User Interface
• Data Base: It contains the full text of the items.
• Term Detector:
❖ It is a special hardware / software.
❖ It contains all the terms that are being searched.
❖ It also contains logic between the terms.
❖ The text is given as input t term detector and it would detect the presence of the search term.
❖ The detected terms would become the output and they are sent to Query Resolver.
❖ By doing so, the term detector would allow it for final logical processing of a query against an item.

• Query Resolver:
❖ There are 2 functions that are performed by the Query Resolver. They are,
1. The search statements are accepted from the user, the logic is extracted, terms are searched & the
searched terms are passed to the detectors.
2. The results from the detectors are obtained & the queries that are satisfied by the item are
determined& also the weight associated with it is identified.
❖ The information is passed by the Query Resolver to the user interface.
• User Interface:
❖ This will continuously update the status of search to the user & retrieve any item that satisfies the
search statement on the request of the user.
❖ The entire process is focused on finding at least one or all occurrences of a pattern of text in a text
stream.
• In hardware search machines, the team detectors may work against the same data stream that allows for
more number of queries or against different data streams reducing the time to access the complete
database.
• The multiple detectors may execute at a time in Software Systems.
• Approaches to Data Stream:
• The data stream has 2 approaches. They are,
❖ In the 1st approach, the complete database is sent to the detectors that functions as a search of
database.
❖ In the 2nd approach, the retrieved items at random are passed to the detectors.
• Considering the 2nd approach, the main idea here is that index search of the database is performed.
• The additional search logic is performed by the text streamer. This is performed for the cases those which
are not satisfied by index search.
• Limits of Index Search: the examples for limits of index search are,
❖ Searching for stop words
❖ When stemming is performed, for the searching process is exact matching.
❖ Search is carried out for those terms that contains “don’t cares” at both leading & trailing of the term.
❖ Search for those symbols that are on the interword symbol list as{ “ , ;}

Disadvantage of search on Text Streaming:

• The search is dependent on the module in the computer that is very slow.
• This would be the major disadvantage of searching on streaming of text.
• The speed is gained by the indexes in 2 ways. They are,

1. By reducing the amount of data to be retrieved.

2. By providing the best ratio b/w the total num of items delivered to user & the total num of

Items retrieved in response to a query.

• The full text function does not need any overhead of additional storage.
• But the inversion systems require the storage overhead of 50% to 300% of original database.

Advantages of search on text streaming:

• By using text streaming i.e., the hits are returned to the user as soon as they are found.
• In an index system, before the hits are determined; the query must be processed completely.
• But the streaming of text provides the accurate estimate of present search state from time to time till the
query is completed.

Other problems:

• There are some other problems that are related to fuzzy searches & imbedded string query terms in
indexes.
• All the possible index values are difficult to be identified while searching to the complete dictionary of
possible terms.
• Most of the streaming algorithms would identify the query terms that are embedded.
• Some other algorithms & hardware search units would also perform fuzzy searches.

Finite State Automata:

• A finite state automata is used by many hardware and software text searchers.
• It is logical machine that contains 5 elements. They are I, S, P, S0, SF

WHERE,

S0 → Initial state

SF →Set of final states

I →Input symbols from alphabet.

S →Set of possible states.

P → Set of productions.

• Productions are defined on the basis of present state & input symbols so that the next state is defined.
• A directed graph is used to represent the finite state automata.
• The graph consists of a sequence or nodes that represent states & edges between the nodes are used to
represent the transitions that are defined by set of productions.
• The symbol related with each edge defines the i/p that allows a transition from Si to Sj nodes.
• The below fig. shows a Finite State Automata to identify the string “ALU” in any i/p stream.


5.1.2 Software Text Search Algorithms:

Software Text Search Algorithms:

• In the techniques of software streaming, the item that is to be searched is read into memory & then an
algorithm is applied to it.
• The system would resolve a particular search against a particular item more frequently although there is no
restriction on the streaming being applied to many simultaneous searches against the same item.
• There are main 4 algorithms that are associated with software text search. They are,
1. Brute force approach
2. Knuth Morris Pratt
3. Boyer More
4. Shift OR Algorithm.
• Out of all the algorithms, Boyer Moore is fastest one and it requires atmost O(n+m) comparisons.
• Boyer Moore & Knuth Morris requires preprocessing of O (n) search strings.

Brute Force Approach:

• Brute Force approach is the simplest algorithm of string matching.


• The basic idea is that the search string must be matched to the i/p string.
• If there is occurrence of a mismatch in the process of comparison then the input string is shifted by one
position & the process of comparison is stated again.
• The number of comparisons that are expected when the i/p string of n characters for a pattern of m
characters is being searched is given by nc.
𝑐 1
Nc= (1 − (n-m+1) +0(1) )
𝑐−1 𝑐𝑚

Where,

nc → number of comparisons expected

c → size of the alphabet for the text.

Knuth Morris Pratt (KMP) Algorithm:


Fig: Example of Knuth Morris Pratt Algorithm

Boyer Moore Algorithm:

Step-1: Construct 'Bad Match Table'


Step-2: Compare right most character of pattern with given string based on the
'value' of bad match table
Step-3: If mismatch then shift the pattern to the right position corresponding to
the 'value' of bad match table
While constructing bad match table use following formula for value.
value=length of pattern-index-1 and last value=length of pattern
Here the letter 'A' is occurring twice so replace the latest value by old one. In the same
way for M also. T is the last character in pattern so its value=8(length of pattern)
Mismatch here so move 8 characters right hand side

Mismatch so move 1 character to the right hand side


9.2 Hardware Text Search Systems
Software text search is applicable to many circumstances but has encountered restrictions on the
ability to handle many search terms simultaneously against the same text and limits due to I/O
speeds. One approach that off loaded the resource intensive searching from the main processors
was to have a specialized hardware machine to perform the searches and pass the results to the
main computer which supported the user interface and retrieval of hits. Since the searcher is
hardware based, scalability is achieved by increasing the number of hardware search devices.
Another major advantage of using a hardware text search unit is in the elimination of the
index that represents the document database. Typically the indexes are 70% the size of the actual
items. Other advantages are that new items can be searched as soon as received by the system rather
than waiting for the index to be created and the search speed is deterministic.
Figure 9.1 represents hardware as well as software text search
solutions.The arithmetic part of the system is focused on the term detector. There has been three
approaches to implementing term detectors: parallel comparators or associative memory, a cellular
structure, and a universal finite state automata.
When the term comparator is implemented with parallel comparators, each term in the
query is assigned to an individual comparison element and input

data are serially streamed into the detector. When a match occurs, the term comparator informs the
external query resolver (usually in the main computer) by setting status flags.

Specialized hardware that interfaces with computers and is used to search secondary storage
devices was developed from the early 1970s with the most recent product being the Parallel
Searcher (previously the Fast Data Finder). The typical hardware configuration is shown in
Figure 9.9 in the dashed box. The speed of search is then based on the speed of the I/O.
One of the earliest hardware text string search units was the Rapid Search Machine
developed by General Electric. The machine consisted of a special purpose search unit where a
single query was passed against a magnetic tape containing the documents. A more sophisticated
search unit was developed by Operating Systems Inc. called the Associative File Processor (AFP).
It is capable of searching against multiple queries at the same time. Following that initial
development, OSI, using a different approach, developed the High SpeedText Search (HSTS)
machine. It uses an algorithm similar to the Aho- Corasick software finite state machine algorithm
except that it runs three parallel state machines. One state machine is dedicated to contiguous word
phrases, another for imbedded term match and the final for exact word match.

Inparallel with that development effort, GE redesigned their Rapid Search Machine into the
GESCAN unit. The GESCAN system uses a text array processor (TAP) that simultaneously
matches many terms and conditions against a given text stream the TAP receives the query
information from the user’s computer and directly access the textual data from secondary storage.
The TAP consists of a large cache memory and an array of four to 128 query processors. The text is
loaded into the cache and searched by the query processors (Figure 9.10). Each query processor is
independent and can be loaded at any time. A complete query is handled by each query processor.

A query processor works two operations in parallel; matching query terms to input text and
Boolean logic resolution. Term matching is performed by a series of character cells each containing
one character of the query. A string of character cells is implemented on the same LSI chip and the
chips can be connected in series for longer strings. When a word or phrase of the query is matched,
a signal is sent to the resolution sub-process on the LSI chip. The resolution chip is responsible for
resolving the Boolean logic between terms and proximity requirements. If the item satisfies the
query, the information is transmitted to the users computer.

The text array processor uses these chips in a matrix arrangement as shown in Figure9.10.
Each row of the matrix is a query processor in which the first chip performsthe query resolution
while the remaining chips match query terms. The maximum number of characters in a query is
restricted by the length of a row while the number of rows limit the number of simultaneous queries
that can be processed.
Another approach for hardware searchers is to augment disc storage. Theaugmentation is a
generalized associative search element placed between the read and write heads on the disk. The
content addressable segment sequential memory (CASSM) system uses these search elements in
parallel to obtain structured data from a database. The CASSM system was developed at the
University of Florida as a general purpose search device. It can be used to perform string searching
across the database. Another special search machine is the relational associative processor (RAP)
developed at the University of Toronto. Like CASSM performs search across a secondary storage
device using a series of cells comparing data in parallel.

The Fast Data Finder (FDF) is the most recent specialized hardware text search unit still in
use in many organizations. It was developed to search text and has been used to search English and
foreign languages. The early Fast Data Finders consisted of an array of programmable text
processing cells connected in series forming a pipeline hardware search processor. The cells are
implemented using a VSLI chip. In the TREC tests each chip contained 24processor cells with a
typical system containing 3600 cells. Each cell will be a comparator for a single character limiting
the total number of characters in a query to the number of cells.

The cells are interconnected with an 8-bit data path and approximately 20- bit control path.
The text to be searched passes through each cell in a pipeline fashion until the complete database
has been searched. As data is analyzed at each cell, the 20 control lines states are modified
depending upon their current state and the results from the comparator. An example of a Fast Data
Finder system is shown inFigure 9.11.
A cell is composed of both a register cell (Rs) and a comparator (Cs).The input
from the Document database is controlled and buffered by the micro process/memory
and feed through the comparators. The search characters are stored in the registers. The
connection between the registers reflects the control lines that are also passing state
information. Groups of cells are used to detect query terms, along with logic between the
terms, by appropriate programming of the control lines. When a pattern match is
detected, a hit is passed to the internal microprocessor that passes it back to the host
processor, allowing immediate access by the user to the Hit item.
The functions supported by the Fast data Finder are:
➢ Boolean Logic including negation
➢ Proximity on an arbitrary pattern
➢ Variable length “don’t cares”
➢ Term counting and thresholds
➢ Fuzzy matching
➢ Term weights
➢ Numeric ranges
Multimedia Information Retrieval

Definition: Multimedia information retrieval is the process of satisfying a user’s stated


information need by identifying all relevant text, graphics, audio(speech & non speech
audio), imagery, or video documents or portions of documents from a document collection.

Multimedia:

➢ Multimedia data contains different data types such as text, images, graphics and
sound.
➢ The multimedia data has become very important in many applications such as offices,
CAD/CAM applications, commercial, medical & entertainment applications.
➢ Hence the information system of multimedia is one of the important field in the area
of information management.
➢ As the main characteristic of multimedia is to handle variety of data, the
development of information system related to multimedia is more complex than
traditional one.
➢ The multimedia systems should have ability to store, retrieve, transport, & present
the data.
➢ The data is the one with heterogeneous characteristics such as text, images, graphics,
& sound.
➢ The system which deal with simple data types such as integers, strings are known as
conventional systems.
➢ Thus, inorder to provide support for such complex multimedia structures, the system
known as Multimedia Information Retrieval Systems must be developed.
Spoken Language Audio Retrieval

➢ As a user may wish to search the archives of a large text collections, the ability to
search the content to audio sources such as speechless, radio broadcasts, &
conversations would be valuable for range of applications.
➢ An assortment of techniques have been developed to support the automated
recognition of speech.
➢ These have applicability for a range of application areas such as speaker verification,
transcription & command & control.
➢ For example Jones (1997) reports a comparative evaluation of speech and text
retrieval in the context of the Video Mail Retrieval (VMR) project.
➢ While speech transcription would error rate may be high, redundancy in the source
material helps offset these errors rates & still support effective retrieval.
➢ Some recent efforts have focused on the automated transcription of broadcast news.
➢ For example, below fig illustrate BBN’S Rough and Ready prototype that aims to
provide information access to spoken language from audio & video sources.
➢ Rough and Ready creates a summarization of speech that is ready for browsing.
➢ The above fig illustrates January 31, 1998 sample from ABC’S world news tonight in which
the left hand column indicates the speaker the center column shows the translation with
highlighted names entities (i.e., people, organization, locations) .
➢ And the right most columns lists the topic of discussion.
➢ Rough and Ready’s transcription is created by the BYBLOS large vocabulary speech
recognition system, a continues density Hidden Markov Model (HMM) system that has
been competitively tested in annual formal evaluations for the past 12 years.
➢ BYBLOS runs at 3times real time, uses a 60,000 words dictionary, & most recently
reported word error rates of 18.8% for the broadcast news transcription task.
Non-speech audio Retrieval:

➢ The content based access to speech audio, noise/ sound retrieval is also important in
such fields as, movie/ video production.
➢ User extensible sound classification & retrieval systems including signal processing,
speech recognition, computer music, multimedia databases.
➢ Just as image indexing algorithms use visual features vectors to index & match images.

➢ The above fig shoes the analysis of male laughter an several dimensions including
amplitude, brightness, bandwidth & pitch.
➢ The below fig shows the content based access to audio.

➢ The above fig shows an end users content based retrieval application that enables a user
to browse and/ or query a sound database/acoustic(ex: pitch, duration) and /or
perceptual properties(ex:”scratchy”) and/or query by example.
➢ For example, sound fisher supports such complex content queries as find all AIFF
encoded file with animal or human vocal sound that are similar to barking sound
without regard to duration or amplitude.
➢ Performance of the sound fisher the system was evaluated using a database of 400
widely ranging sound files(ex: captured from nature, animals, instruments, speech)
➢ Additional requirements identified by this research include the need for sound displays,
sound synthesis (a kind of query formulation/ refinement tool), sound separation &
matching of trajectories of futures over time.
Graph Retrieval:

➢ Another important media class is graphics, to include tables & charts.


➢ Graphs are constructed from more primitive data elements such as points, lines, &
labels.
➢ An innovative example of a graph retrieval system is SageBook.
➢ SageBook enables both search & customization of stored data graphics.
➢ It may require an audio query during audio retrieval, SageBook supports data graphic
query, representation (i.e., content description), indexing, search & adaption
capabilities.
➢ The below fig shows graphical query and data graphics returned for that query.

➢ In the bottom left hand side of the fig shoes, queries are formulated via a graphical
direct manipulation interface(sagebrush) by selecting & arranging spaces(ex: charts,
tables) objects contained within those spaces(ex: marks, bars) & object properties (ex:
color, size, shape, position)
➢ The right hand side of the fig displays the relevant graphics retrieved by matching the
underline content and/or properties of the graphical query at the bottom left of the fig
with those graphics stored in a library.
➢ Both exact matching & similarity based matching is performed on both graphical
elements and graphemes as well as an underlying data represented by the graphical.
➢ For example, in the query & in the responses in the fig. for 2 graphemes to match they
must be the same class (i.e., color, shape, size, width) to encode data.
➢ All the data graphics returned by a “close graphics matching strategy”.
➢ SageBook maintains an internal representation of syntax & semantics of data graphics
which include spatial relationships b/w objects, relationships b/w data domains (ex:
interval, 2D coordinate), & th various graphic & data attributes.
➢ Search is performed both on graphical & data properties with 3 & 4 alternative search
strategies respectively to enable varying degrees of math relaxation just as in large text
and imagery collections several data graphic grouping techniques based on data&
graphical properties were designed to enable clustering for browsing large collections.
➢ Finally SageBook provides automatic adaption techniques that can modify the retrieved
graphic (ex: eliminating graphical elements) that don’t match the specified query.

Imagery Retrieval:

➢ Increasing volumes of imagery from web page images to personal collection from digital
cameras have escalated the need for more effective on efficient imagery access.
➢ Researchers have identified needs for indexing& search of not only metadata associated
with the imagery (ex: captions, annotations) but also retrieval directly on the content of
the imagery.
➢ Initial algorithm development has focused on the content of the imagery which can be
used as a means for retrieving similar images without the burden of manual indexing..
➢ Query By Image Content (QBIC) supports access to imagery collections on the basis of
visual properties such as color, shape, texture, and sketches.
➢ In their approach, query facilities for specifying color parameters, drawing desired
shapes or selecting textures replace the traditional keyword query found in text
retrieval.
➢ The below fig shows a query to a database of all US stamps prior to 1995 in which QBIC
is asked to retrieve red images.
➢ The “red stamps” results are displayed in below fig.

➢ For example, if we further refine this search by adding the keyword “president” we
obtain the resultsshown in below fig in which all stamps are both red in color & are
related to “president”.
➢ The female stamp in the bottom right hand corner of below fig is of Martha Washington
from the presidential stamp collection.

➢ Additional research in image processing has addressed specific kinds of content-based


retrieval problems.
➢ Consider face processing, where we distinguish face-detection, face recognition, & face
retrieval.
➢ Researchers have also developed systems to track human movement (ex: heads, hands,
feet) and to differentiate human expressions such as a smile, surprise, anger, or disgust.
➢ This expression recognition is related to research in emotion recognition on the context
of human computer interaction.
➢ Face recognition is also important in video retrieval.
Video Retrieval:

➢ the ability to support content based access to video promises access to video mail, video
taped
meetings, surveillance video, & broadcast television.

➢ Broadcast News Navigator (BNN) is a web- based tool that automatically captures,
annotates, segments, summarizes & visualizes stories from broadcast news video.
➢ BNN is to broadcast news video.
➢ BNN integrates text, speech, and image processing technologies to perform multistream
analysis of video to support content-based search & retrieval.
➢ BNN address the problem of time-consuming, manual video annotation techniques that
frequently result in inconsistent, error- full or incomplete video catalogues.
➢ Below fig shows BNN’S video query page.
➢ From this web page, the user can select to search among 30 national or local news
sources, specify an absolute or relative date range, search closed captions or speech
transcriptions, run a pre-specified profile, search on text keywords, or each on concepts
that express topics or so-called named entities such as people, organizations, &
locations.
➢ In below fig shows the user has selected to search all new video sources for a 2 week
location tags.

➢ In below fig shows, BNN automatically generates a custom query web page which
includes menus of people and location names from content exacted over the relevant
time period to ease query formulation by the user.
➢ In above fig, the use has selected “George Bush” & “George W. Bush” from the people
menu , “New York” & “New York City” from the location menu, & the key words
“presidential primary”.
*********************

You might also like