0% found this document useful (0 votes)
2 views

irs unit-1 modified

An Information Retrieval System (IRS) is designed to store, retrieve, and maintain various types of information, including text and multimedia, while improving user efficiency in finding relevant data. Key processes include item normalization, selective dissemination of information, document database search, and index database search, each contributing to effective information management. The integration of IRS with database management systems enhances data handling, allowing seamless access to both structured and unstructured data.

Uploaded by

Balle Manasa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

irs unit-1 modified

An Information Retrieval System (IRS) is designed to store, retrieve, and maintain various types of information, including text and multimedia, while improving user efficiency in finding relevant data. Key processes include item normalization, selective dissemination of information, document database search, and index database search, each contributing to effective information management. The integration of IRS with database management systems enhances data handling, allowing seamless access to both structured and unstructured data.

Uploaded by

Balle Manasa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

Definition of Information Retrieval System

An Information Retrieval System (IRS) is a system designed to store, retrieve, and maintain
information. This information can be text, images, audio, video, or other multimedia. Modern
techniques allow for searching across different media types (eg. EXCALIBUR's Visual Retrieval
Ware)

An "item" refers to the smallest complete unit processed by the system, such as a document, video,
or audio program. The system helps users find the information they need, sometimes using
specialized hardware to process non-text data (eg, converting audio to text). The process involves
search composition, execution, and filtering out non-relevant items, which contribute to retrieval
overhead

With advancements in computing and storage, large databases are now accessible to the average
user. The growth of the Internet and advanced search engines like INFOSEEK and EXCITE have
made it easier to access huge amounts of information. Media like images and audio are also
searchable. and organizations like BBC and Disney use transcription and video indexing for easier
access to content

Objectives of
Information Retrieval Systems

The main objective of an Information Retrieval System is to reduce the time users spend finding
relevant information. This includes query generation, execution, reviewing results, and avoiding
irrelevant items. The goal is to improve user efficency in locating the needed information.

Key Measures: Precision and Recall

1. Precision: Measures the proportion of relevant items retrieved out of all retrieved items. Precision
drops when many non-relevant items are retrieved.

2. Recall: Measures the proportion of relevant items retrieved out of all possible relevant items.
Recall remains high once relevant items are retrieved, regardless of non-relevant ones.

Natural Language Queries

Modern IRS systems, like AltaVista and Infoseek, allow users to enter natural language queries.
This makes searching more intuitive, though most users typically enter only one or two keywords
instead of long queries.
Functional Overview:
A total Information Storage and Retrieval System is composed of four major functional
processes:
1) Item Normalization
2) Selective Dissemination of Information (i.e., “Mail”)
3) Archival Document Database Search, and an Index
4) Database Search along with the Automatic File Build process that
supportsIndexFiles.

1.Item Normalization

*Item Normalization* is the process of converting incoming data into a standard format that the
system can understand and work with. This step is essential for ensuring that all data, regardless of
its original format, is compatible with the system’s processing and search capabilities. Here's a
breakdown of its key aspects:

---

### *1. Logical Restructuring and Standardization*


- *Purpose*: To ensure all incoming data is in a consistent format.
- *Example*: Foreign language text can be converted to Unicode, a universal encoding system.
Similarly, video files can be converted to formats like MPEG-2, and images to formats like JPEG.
- *Benefits*: Allows the system to process diverse types of data, including multimedia like text,
audio, video, and images.

---

### *2. Processing Tokens*


- *What are Tokens?*: Tokens are the smallest meaningful units in the text, such as words or
symbols.
- *Steps in Token Processing*:
1. *Identification*: Recognize and separate words (e.g., “running” into “run”).
2. *Stemming*: Remove endings from words to reduce them to their base form (e.g., “running” →
“run”).
3. *Characterization*: Understand the role of tokens (e.g., is it a valid word, a separator, or a
special character?).
---

### *3. Zoning*


- *Definition*: Dividing an item into logical sections like Title, Author, Abstract, Main Text, and
References.
- *Purpose*: Makes searches more precise.
- Example: If searching for "Einstein," the system can avoid matches in the Bibliography zone and
focus on the Main Text.
- *Structure*: Zones can overlap and be hierarchical, which helps users refine searches further.

---

### *4. Handling Symbols*


- *Classes of Symbols*:
1. *Valid Word Symbols*: Alphabetic characters, numbers, etc.
2. *Inter-word Symbols*: Blanks, periods, semicolons, etc., that separate words.
3. *Special Processing Symbols*: Apostrophes, hyphens, or domain-specific symbols.
- *Customization*: Each language or domain has unique requirements for recognizing symbols
(e.g., an apostrophe is important in names like O'Connor).

---

### *5. Stop Lists and Stop Algorithms*


- *Purpose*: Save resources by removing tokens with little value, like common words (“the,”
“and”) or irrelevant numbers.
- *Examples of Stop Algorithms*:
- Ignore numbers greater than 999999 (except dates).
- Discard tokens with mixed letters and numbers if irrelevant.
- *Current Trend*: With advancements in memory and storage, the use of Stop Lists is becoming
less critical.

---

### *Why Item Normalization is Important*


1. *Standardization*: Ensures all data is processed consistently, no matter its source or type.
2. *Search Optimization*: Helps the system focus on relevant content and discard unnecessary data.
3. *Improved Precision*: By zoning and token processing, the system provides more accurate
search results.
4. *Resource Efficiency*: Saves system resources by removing unimportant data.

2)Selective Dissemination of Information (SDI):

*Selective Dissemination of Information (SDI)* is a system process that automatically delivers new
and relevant information to users based on their specified interests. This method ensures users
receive personalized updates without needing to search manually.

---

### *How SDI Works*


1. *User Profiles*:
- Each user creates a *profile* containing their topics of interest.
- A profile is a broad search statement (e.g., "Artificial Intelligence in Healthcare").
- It also includes the user’s designated mail files, where matched items will be sent.

2. *New Items Processing*:


- When new information (like documents or articles) enters the system, it is compared against all
user profiles.
- This process is dynamic and happens for every newly received item.

3. *Matching and Dissemination*:


- If an item matches a user’s profile, it is sent to their mail file.
- This ensures that users only receive information relevant to their interests.

---

*Key Components of the SDI Process*


1. *Search Process*:
- The system continuously matches new data against user profiles.
- It performs searches dynamically as new items are added to the system.

2. *User Profiles*:
- Profiles define what each user is interested in.
- Broad and flexible search statements allow users to receive diverse but relevant information.

3. *Mail Files*:
- These are storage spaces where matched documents are sent.
- Each user has dedicated mail files to organize the received information.

---

### *Advantages of SDI*


1. *Personalization*: Users receive only the information that matches their specific needs and
interests.
2. *Time-Saving*: Users don’t need to manually search for updates—they are delivered
automatically.
3. *Efficiency*: The system processes data in real time, ensuring timely delivery of relevant
information.
4. *Scalability*: SDI can handle multiple user profiles and a large influx of new data.

---
### *Limitations*
1. *Broad Search Statements*: Profiles might lead to receiving some irrelevant items if the search
criteria are too broad.
2. *Lack of Multimedia Support*: Currently, SDI systems focus mainly on text-based information
and do not fully support multimedia data like videos or images.

---

### *Example in Real Life*


Consider an academic researcher interested in the latest articles on "Machine Learning." They set up
an SDI profile with this topic. When new articles or papers are added to the database, the system
identifies those matching their profile and sends them directly to their mailbox.

3) Document Database Search*

*Simplified Explanation:*
The *Document Database Search* allows users to search through all the information stored in the
system. This includes documents, articles, or any data received and saved. The key components of
this process are:

1. *Query-Based Search:*
- Users input queries to find relevant documents.
- These queries are often created as needed (called *ad hoc queries*).

2. *Stored Documents:*
- The system keeps all received documents in a database, known as the *Document Database*.
- Once stored, these documents are not edited or changed.

3. *Search Process:*
- The system scans the database to find items matching the user’s query.

*Significance:*
- This process ensures users can access all the documents the system has received.
- It is useful for retrieving past information quickly and accurately.

*Example:*
If a user wants to find all documents related to “Artificial Intelligence,” they can enter the query,
and the system will search the database to retrieve relevant documents.

4) Index Database Search*

*Simplified Explanation:*
The *Index Database Search* allows users to save important documents for future use by filing
them with specific tags or descriptions. This process involves:

1. *Indexing:*
- Users can *file documents* into special categories called *index files*.
- Additional tags or descriptions (index terms) can be added to make searching easier later.

2. *Types of Index Files:*


- *Private Index Files:*
- Created by individual users.
- These files reference only a small subset of the total database.
- Access is limited to specific users.

- *Public Index Files:*


- Created and maintained by library professionals.
- These files often reference every document in the database.
- Access is available to a larger group of users, depending on permissions.

3. *Search Process:*
- Users can search both private and public index files to find saved documents.
- The system helps users generate indexes automatically through *Automatic File Build
(Information Extraction)*.

*Significance:*
- Indexing organizes documents for quick and efficient retrieval.
- Private files enable personalized storage, while public files serve broader organizational needs.

*Example:*
- A researcher can create a private index file for documents on “Machine Learning” and tag them
with specific keywords. Later, they can search this index file to find those documents easily.

Relationship to Database Management Systems

Integrating Database Management Systems (DBMS) with Information Retrieval (IR) systems is
crucial for managing both structured data and unstructured information effectively.

Practical Importance: Combining DBMS and IR systems improves data handling, enabling users to
search and retrieve both structured (tables, records) and unstructured (text, documents) data
seamlessly.

Commercial Integration: Many commercial database systems have integrated these capabilities:

INQUIRE DBMS: One of the first to combine DBMS and IR systems, available for over 15 years

ORACLE DBMS: Offers an embedded IR tool called CONVECTIS, which uses a thesaurus to
generate "themes" for items, improving search accuracy.

INFORMIX DBMS: Links to Retrieval Ware, enabling the integration of structured data with
advanced IR functions.

Key Benefits:

1. Enhanced Search: Users can query structured data alongside related unstructured data for a
complete view.

2. Efficient Retrieval: Features like thesaurus-based searches (e.g., in ORACLE) allow for more
intuitive and comprehensive results.
3. Unified Systems: Integration reduces the need for separate systems, saving time and improving
user experience.

This integration bridges the gap between traditional databases and modern information retrieval
needs, offering a more powerful and flexible approach to managing diverse types of data.

Digital Libraries

*Digital libraries* are online platforms where books, research papers, videos, images, and other
materials are stored digitally. These libraries let users access information from anywhere, anytime,
using devices like smartphones, laptops, or tablets. They aim to make knowledge more accessible
and preserve important materials in a digital format.

#### *Significance of Digital Libraries*


1. *Convenience*: You can access resources 24/7 without going to a physical location.
2. *Preservation*: Rare and valuable resources are digitized, preventing wear and tear.
3. *Global Access*: Resources can be shared across the world, helping remote learners.
4. *Search Efficiency*: Tools like keywords and filters make finding information quick.
5. *Cost Savings*: No need for physical space or maintenance, reducing costs for institutions.
6. *Eco-Friendly*: Reduces the need for paper and printing, supporting a greener planet.

Data Warehouses

A *data warehouse* is a large storage system used by businesses to collect and manage data from
various sources. Unlike regular storage, it organizes data in a way that makes it easy to analyze and
create reports, helping businesses make informed decisions. It’s like a giant library for business
information.

#### *Significance of Data Warehouses*


1. *Centralized Data*: Combines information from different departments, like sales, marketing, and
finance.
2. *Improved Decision-Making*: Helps businesses analyze trends and make better plans.
3. *Fast Insights*: Processes massive amounts of data quickly for real-time reporting.
4. *Historical Records*: Stores older data, which can be used for long-term analysis.
5. *Better Performance*: Reduces the workload on regular databases by handling heavy data
analysis separately.
6. *Accuracy*: Provides reliable and consistent data for critical business strategies

*2.2 Browse Capabilities*

Browse capabilities help users explore search results efficiently by summarizing and organizing the
information. They allow users to identify and select relevant items for display, making the process
of finding what they need easier and more intuitive.

---

#### *Displaying Results*


- *Line Item Status:* Results are displayed as a list, where each line represents a summary of an
item.
- *Data Visualization:* Uses visual tools like 2D or 3D graphs to represent search results. Each
point on the graph represents an item, and its position shows how closely it relates to the user’s
query. Clusters of points indicate related topics, making it easier to browse similar items.

---

#### *2.2.1 Ranking*


Ranking organizes results based on relevance scores, helping users quickly find the most important
items.
- *Relevance Scores:* Scores range from 0.0 (least relevant) to 1.0 (most relevant), indicating how
well an item matches the query.
- *Collaborative Filtering:* This technique, used by platforms like Amazon, MovieFinder, and
CDNow, suggests results based on what other users with similar interests have selected.
- *Graphical Ranking:* Relevance can also be visualized using color or position on a graph. For
example, clusters of related items can be highlighted to make browsing easier.

---

#### *2.2.2 Zoning*


Zoning focuses on specific sections (zones) of a document, like headings, passages, or specific
paragraphs, to display the most relevant parts of an item.
- *Locality-Based Searches:* Ensures users only see the most meaningful sections of a document,
saving time and reducing unnecessary reading.

---

#### *2.2.3 Highlighting*


Highlighting emphasizes key words or phrases in search results, making it easier to find relevant
content.
- *Starting Points:* Browsing starts with the first highlighted word or section, and users can jump to
the next highlight as needed.
- *Color and Intensity:* Different colors or shades indicate how important a word or section is to
the search.
- *Paragraph-Based Highlights:* Some systems, like DCARS, allow users to browse results in the
order of paragraphs or words that contributed most to the item’s relevance score.

---

### *Why Browse Capabilities Matter*


1. *Simplifies Search:* Makes large sets of results easier to understand and navigate.
2. *Improves Accuracy:* Tools like ranking, zoning, and highlighting help users focus on what’s
most relevant.
3. *Visual Aid:* Graphical displays and highlighting reduce the effort needed to sift through
unrelated content.
4. *User-Friendly:* Collaborative filtering and clustering provide personalized and intuitive
browsing experiences.

2.3 Miscellaneous Capabilities


Miscellaneous capabilities refer to additional tools that enhance search systems, making them more
user-friendly, efficient, and flexible. These features help users refine searches, explore word
relationships, and save time by reusing previous queries.

#### 2.3.1 Vocabulary Browse


Vocabulary Browse allows users to explore the words in the database in alphabetical order. It
displays unique words, their occurrences, and helps users understand the impact of search variations
like using wildcards (e.g., “compul*” for “compulsion,” “compulsive,” etc.). It also identifies
common errors, like typing “computen” instead of “computer.” This feature simplifies the process
of constructing accurate search queries.

#### 2.3.2
Iterative Search and Search History Log
Iterative search makes refining results easier by applying new conditions to previous search
outcomes. It helps users focus on relevant items without starting over. Relevance feedback lets users
improve results further. The search history log saves all searches from the session, enabling quick
access to modify or revisit previous queries.

#### 2.3.3 Canned Query


Canned queries save time by allowing users to store and reuse frequently used searches. Once
saved, these queries can be refined or expanded with new criteria as needed. Variables can be added
to canned queries, offering flexibility to adjust search parameters during execution. This feature is
especially useful for repetitive tasks or ongoing research.

2.1 Search Capabilities

The goal of search capabilities is to map a user's needs to relevant items in a database. Users can
input queries using natural language or Boolean logic. Some systems allow search terms to be
"weighted" based on importance to improve results.

2.1.1 Boolean Logic

Boolean logic connects search terms using operators like AND, OR, and NOT to retrieve relevant
information. These operations work by set intersection (AND), set union (OR), and set difference
(NOT). Special Boolean searches like "M of N" allow items containing a subset of terms to be
retrieved

2.1.2 Proximity

Proximity restricts how close search terms must be in a document to be considered related,
improving precision. For example, terms like "COMPUTER" and "DESIGN" close together suggest
relevance.
2.1.3 Contiguous Word Phrases

A Contiguous Word Phrase treats multiple words as a single unit, like "United States of America."
This search method is similar to proximity but allows for greater specificity in querying exact
phrases.

2.1.4 Fuzzy Searches

Fuzzy searches help find terms with similar spellings, useful for handling typos. For example, a
search for "computer" might also find "compiter" or "conputer," improving recall but reducing
precision.

2.1.5 Term Masking

Term masking allows part of a search term to be hidden, expanding the search. For example,
"COMPUTER" can find terms like "COMPUTER" and "COMPUTERS." Masking can be used for
prefix, suffix, or embedded searches.
2.1.6 Numeric and Date Ranges

Masking doesn't work for numeric or date ranges. Instead, specialized range queries are used to find
values like numbers greater than 125.

2.1.7 Concept/Thesaurus Expansion

Thesaurus and concept class expansion help find terms related in meaning. A thesaurus expands
search terms based on language, while concept classes expand based on related ideas in a structured
tree.

2.1.8 Natural Language Queries

Natural language queries allow users to input questions directly. While they improve recall, they
can reduce precision, especially when using negation.

You might also like