irs unit-1 modified
irs unit-1 modified
An Information Retrieval System (IRS) is a system designed to store, retrieve, and maintain
information. This information can be text, images, audio, video, or other multimedia. Modern
techniques allow for searching across different media types (eg. EXCALIBUR's Visual Retrieval
Ware)
An "item" refers to the smallest complete unit processed by the system, such as a document, video,
or audio program. The system helps users find the information they need, sometimes using
specialized hardware to process non-text data (eg, converting audio to text). The process involves
search composition, execution, and filtering out non-relevant items, which contribute to retrieval
overhead
With advancements in computing and storage, large databases are now accessible to the average
user. The growth of the Internet and advanced search engines like INFOSEEK and EXCITE have
made it easier to access huge amounts of information. Media like images and audio are also
searchable. and organizations like BBC and Disney use transcription and video indexing for easier
access to content
Objectives of
Information Retrieval Systems
The main objective of an Information Retrieval System is to reduce the time users spend finding
relevant information. This includes query generation, execution, reviewing results, and avoiding
irrelevant items. The goal is to improve user efficency in locating the needed information.
1. Precision: Measures the proportion of relevant items retrieved out of all retrieved items. Precision
drops when many non-relevant items are retrieved.
2. Recall: Measures the proportion of relevant items retrieved out of all possible relevant items.
Recall remains high once relevant items are retrieved, regardless of non-relevant ones.
Modern IRS systems, like AltaVista and Infoseek, allow users to enter natural language queries.
This makes searching more intuitive, though most users typically enter only one or two keywords
instead of long queries.
Functional Overview:
A total Information Storage and Retrieval System is composed of four major functional
processes:
1) Item Normalization
2) Selective Dissemination of Information (i.e., “Mail”)
3) Archival Document Database Search, and an Index
4) Database Search along with the Automatic File Build process that
supportsIndexFiles.
1.Item Normalization
*Item Normalization* is the process of converting incoming data into a standard format that the
system can understand and work with. This step is essential for ensuring that all data, regardless of
its original format, is compatible with the system’s processing and search capabilities. Here's a
breakdown of its key aspects:
---
---
---
---
---
*Selective Dissemination of Information (SDI)* is a system process that automatically delivers new
and relevant information to users based on their specified interests. This method ensures users
receive personalized updates without needing to search manually.
---
---
2. *User Profiles*:
- Profiles define what each user is interested in.
- Broad and flexible search statements allow users to receive diverse but relevant information.
3. *Mail Files*:
- These are storage spaces where matched documents are sent.
- Each user has dedicated mail files to organize the received information.
---
---
### *Limitations*
1. *Broad Search Statements*: Profiles might lead to receiving some irrelevant items if the search
criteria are too broad.
2. *Lack of Multimedia Support*: Currently, SDI systems focus mainly on text-based information
and do not fully support multimedia data like videos or images.
---
*Simplified Explanation:*
The *Document Database Search* allows users to search through all the information stored in the
system. This includes documents, articles, or any data received and saved. The key components of
this process are:
1. *Query-Based Search:*
- Users input queries to find relevant documents.
- These queries are often created as needed (called *ad hoc queries*).
2. *Stored Documents:*
- The system keeps all received documents in a database, known as the *Document Database*.
- Once stored, these documents are not edited or changed.
3. *Search Process:*
- The system scans the database to find items matching the user’s query.
*Significance:*
- This process ensures users can access all the documents the system has received.
- It is useful for retrieving past information quickly and accurately.
*Example:*
If a user wants to find all documents related to “Artificial Intelligence,” they can enter the query,
and the system will search the database to retrieve relevant documents.
*Simplified Explanation:*
The *Index Database Search* allows users to save important documents for future use by filing
them with specific tags or descriptions. This process involves:
1. *Indexing:*
- Users can *file documents* into special categories called *index files*.
- Additional tags or descriptions (index terms) can be added to make searching easier later.
3. *Search Process:*
- Users can search both private and public index files to find saved documents.
- The system helps users generate indexes automatically through *Automatic File Build
(Information Extraction)*.
*Significance:*
- Indexing organizes documents for quick and efficient retrieval.
- Private files enable personalized storage, while public files serve broader organizational needs.
*Example:*
- A researcher can create a private index file for documents on “Machine Learning” and tag them
with specific keywords. Later, they can search this index file to find those documents easily.
Integrating Database Management Systems (DBMS) with Information Retrieval (IR) systems is
crucial for managing both structured data and unstructured information effectively.
Practical Importance: Combining DBMS and IR systems improves data handling, enabling users to
search and retrieve both structured (tables, records) and unstructured (text, documents) data
seamlessly.
Commercial Integration: Many commercial database systems have integrated these capabilities:
INQUIRE DBMS: One of the first to combine DBMS and IR systems, available for over 15 years
ORACLE DBMS: Offers an embedded IR tool called CONVECTIS, which uses a thesaurus to
generate "themes" for items, improving search accuracy.
INFORMIX DBMS: Links to Retrieval Ware, enabling the integration of structured data with
advanced IR functions.
Key Benefits:
1. Enhanced Search: Users can query structured data alongside related unstructured data for a
complete view.
2. Efficient Retrieval: Features like thesaurus-based searches (e.g., in ORACLE) allow for more
intuitive and comprehensive results.
3. Unified Systems: Integration reduces the need for separate systems, saving time and improving
user experience.
This integration bridges the gap between traditional databases and modern information retrieval
needs, offering a more powerful and flexible approach to managing diverse types of data.
Digital Libraries
*Digital libraries* are online platforms where books, research papers, videos, images, and other
materials are stored digitally. These libraries let users access information from anywhere, anytime,
using devices like smartphones, laptops, or tablets. They aim to make knowledge more accessible
and preserve important materials in a digital format.
Data Warehouses
A *data warehouse* is a large storage system used by businesses to collect and manage data from
various sources. Unlike regular storage, it organizes data in a way that makes it easy to analyze and
create reports, helping businesses make informed decisions. It’s like a giant library for business
information.
Browse capabilities help users explore search results efficiently by summarizing and organizing the
information. They allow users to identify and select relevant items for display, making the process
of finding what they need easier and more intuitive.
---
---
---
---
---
#### 2.3.2
Iterative Search and Search History Log
Iterative search makes refining results easier by applying new conditions to previous search
outcomes. It helps users focus on relevant items without starting over. Relevance feedback lets users
improve results further. The search history log saves all searches from the session, enabling quick
access to modify or revisit previous queries.
The goal of search capabilities is to map a user's needs to relevant items in a database. Users can
input queries using natural language or Boolean logic. Some systems allow search terms to be
"weighted" based on importance to improve results.
Boolean logic connects search terms using operators like AND, OR, and NOT to retrieve relevant
information. These operations work by set intersection (AND), set union (OR), and set difference
(NOT). Special Boolean searches like "M of N" allow items containing a subset of terms to be
retrieved
2.1.2 Proximity
Proximity restricts how close search terms must be in a document to be considered related,
improving precision. For example, terms like "COMPUTER" and "DESIGN" close together suggest
relevance.
2.1.3 Contiguous Word Phrases
A Contiguous Word Phrase treats multiple words as a single unit, like "United States of America."
This search method is similar to proximity but allows for greater specificity in querying exact
phrases.
Fuzzy searches help find terms with similar spellings, useful for handling typos. For example, a
search for "computer" might also find "compiter" or "conputer," improving recall but reducing
precision.
Term masking allows part of a search term to be hidden, expanding the search. For example,
"COMPUTER" can find terms like "COMPUTER" and "COMPUTERS." Masking can be used for
prefix, suffix, or embedded searches.
2.1.6 Numeric and Date Ranges
Masking doesn't work for numeric or date ranges. Instead, specialized range queries are used to find
values like numbers greater than 125.
Thesaurus and concept class expansion help find terms related in meaning. A thesaurus expands
search terms based on language, while concept classes expand based on related ideas in a structured
tree.
Natural language queries allow users to input questions directly. While they improve recall, they
can reduce precision, especially when using negation.