0% found this document useful (0 votes)

16 views

Module 1 - Introduction

The document provides an overview of information retrieval systems, including their definition, objectives, and key components and processes like normalization, indexing, and searching. It also discusses measures like precision and recall.

Uploaded by

Jaswanthh Krishna

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views

Module 1 - Introduction

Uploaded by

Jaswanthh Krishna

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 61

Module 1- Introduction

Dr.D.SARASWATHI 1
• Introduction to Information Storage and Information Retrieval (IR) –
Definition and Objectives – Functional overview – Relationship to
Database Management Systems – Digital libraries – Data Warehouses

Dr.D.SARASWATHI 2
Definition
An Information Retrieval System is a system that is
capable of storage, retrieval, and maintenance of
information.

• text (including numeric and date data), images, audio,

video and other multi-media objects.

Dr.D.SARASWATHI 3
Dr.D.SARASWATHI 4
Dr.D.SARASWATHI 5
Dr.D.SARASWATHI 6
Web Search

Dr.D.SARASWATHI 7
EXCALIBUR VISUAL RETRIEVALWARE FINDS PIX IN A DATABASE - Tech Monitor

Dr.D.SARASWATHI 8
Dr.D.SARASWATHI 9
• An Information Retrieval System consists of a software
program that facilitates a user in finding the information the
user needs.

• The system may use standard computer hardware or

specialized hardware to support the search subfunction and
to convert non-textual sources to a searchable media (e.g.,
transcription of audio to text).

Dr.D.SARASWATHI 10
Dr.D.SARASWATHI 11
IR success lies on?

•The gauge of success of an information

system is how well it can minimize the
overhead for a user to find the needed
information.

• the time required to find the information needed, excluding the time for
actually reading the relevant data.
• Search composition, search execution, and reading non-relevant items
Dr.D.SARASWATHI 12
• The first Information Retrieval Systems
originated with the need to organize information
in central repositories (e.g., libraries) (Hyman-
82).

• Catalogues were created to facilitate the

identification and retrieval of items.
Dr.D.SARASWATHI 13
Web search Engine
• access to terabytes of information -over 800 million indexable pages

Dr.D.SARASWATHI 14
Excite

Dr.D.SARASWATHI 15
webseek

Dr.D.SARASWATHI 16
Dr.D.SARASWATHI 17
Dr.D.SARASWATHI 18
Objectives of Information Retrieval Systems
• To minimize the overhead of a user locating needed information.

• Overhead - query generation, query execution, scanning results of

query to select items to read, reading non-relevant items

Dr.D.SARASWATHI 19
• In information retrieval the term “relevant” item is
used to represent an item containing the needed
information.

• From a user’s perspective “relevant” and “needed”

are synonymous.

Dr.D.SARASWATHI 20
Effects of Search on Total Document Space

Dr.D.SARASWATHI 21
• When a user decides to issue a search looking for information on a topic,
the total database is logically divided into four segments.

• Relevant items are those documents that contain information that helps
the searcher in answering his question.

• Non-relevant items are those items that do not provide any directly useful
information.

• There are two possibilities with respect to each item:

• it can be retrieved or not retrieved by the user’s query

Dr.D.SARASWATHI 22
• The two major measures commonly associated
with information systems are

1) Precision
2) Recall

Dr.D.SARASWATHI 23
Precision and Recall
• Precision is directly affected by retrieval of non-
relevant items and drops to a number close to
zero.

• Recall is not effected by retrieval of non-relevant

items and thus remains at 100 per cent once
achieved.
Dr.D.SARASWATHI 24
Ideal Precision and Recall

Dr.D.SARASWATHI 25
Ideal Precision/Recall Graph

Dr.D.SARASWATHI 26
Achievable Precision/Recall Graph

Dr.D.SARASWATHI 27
• Information Retrieval Systems such as RetrievalWare, TOPIC,
AltaVista, Infoseek and INQUERY that the idea of accepting
natural language queries is becoming a standard system
feature.
• This allows users to state in natural language what they are
interested in finding.
• But the completeness of the user specification is limited by
the user’s willingness to construct long natural language
queries.
• Most users on the Internet enter one or two search terms.
Dr.D.SARASWATHI 28
Vocabulary Domains

Dr.D.SARASWATHI 29
Functional Overview
• A total Information Storage and Retrieval System is
composed of four major functional processes:

1) Item Normalization
2) Selective Dissemination of Information (i.e., “Mail”)
3) Archival Document Database Search, and an Index
4) Database Search along with the Automatic File Build
process that supports Index Files.
Dr.D.SARASWATHI 30
1) Item Normalization
• The first step in any integrated system is to normalize
the incoming items to a standard format.
• Item normalization provides logical restructuring of the
item.
• Additional operations during item normalization are
needed to create a searchable data structure:
• identification of processing tokens (e.g., words),
• characterization of the tokens, and
• stemming (e.g., removing word endings) of the tokens.
Dr.D.SARASWATHI 31
Total Information Retrieval

Dr.D.SARASWATHI 32
Text Normalization Process

Dr.D.SARASWATHI 33
• The processing tokens and their characterization
are used to define the searchable text from the
total received text.
• Standardizing the input takes the different external
formats of input data and performs the translation
to the formats acceptable to the system.
• A system may have a single format for all items or
allow multiple formats.

Dr.D.SARASWATHI 34
• One example of standardization could be translation of
foreign languages into Unicode.
• Every language has a different internal binary encoding
for the characters in the language. One standard
encoding that covers English, French, Spanish, etc. is
ISO-Latin.
• To assist the users in generating indexes, especially the
professional indexers, the system provides a process
called Automatic File Build(AFB).
Dr.D.SARASWATHI 35
• Multi-media adds an extra dimension to the normalization
process.
• In addition to normalizing the textual input, the multi-media
input also needs to be standardized.
• for higher quality video -MPEG-2, MPEG-1, AVI.
• for lower quality video - Real Media
• Audio standards - WAV or Real Media (Real Audio)
• Images vary from JPEG to BMP
Dr.D.SARASWATHI 36
• The next process is to parse the item into logical
sub-divisions that have meaning to the user.
• This process, called “Zoning,” is visible to the
user and used to increase the precision of a
search and optimize the display.

Dr.D.SARASWATHI 37
Zoning
• A typical item is sub- divided into zones, which
may overlap and can be hierarchical, such as
Title, Author, Abstract, Main Text, Conclusion,
and References.
• The zoning information is passed to the
processing token identification operation to
store the information, allowing searches to be
restricted to a specific zone.
Dr.D.SARASWATHI 38
•For example, if the user is interested in
articles discussing “Einstein” then the
search should not include the Bibliography,
which could include references to articles
written by “Einstein.”

Dr.D.SARASWATHI 39
• Systems determine words by dividing input symbols
into 3 classes:

1) Valid word symbols

2) Inter-word symbols
3) Special processing symbols.

Dr.D.SARASWATHI 40
Word
• A word is defined as a contiguous set of word
symbols bounded by inter-word symbols.

• In many systems inter-word symbols are non-

searchable and should be carefully selected.

Dr.D.SARASWATHI 41
Examples
• word symbols - alphabetic characters and
numbers
• possible inter-word symbols - blanks, periods
and semicolons , apostrophe

Dr.D.SARASWATHI 42
Stop List/Algorithm
• Applied to the list of potential processing tokens.
• The objective of the Stop function is to save system
resources by eliminating from the set of searchable
processing tokens those that have little value to the
system.
• Given the significant increase in available cheap
memory, storage and processing power, the need to
apply the Stop function to processing tokens is
decreasing. Dr.D.SARASWATHI 43
Examples of Stop algorithms
• Stop all numbers greater than “999999” (this was
selected to allow dates to be searchable) Stop any
processing token that has numbers and characters
intermixed

Dr.D.SARASWATHI 44
2) Selective Dissemination (Distribution,
Spreading) of Information
• Selective Dissemination of Information (SDI) is a
service that delivers information to users based on
their interests.

• This can be done through various methods like email,

RSS feeds, or newsletters.

Dr.D.SARASWATHI 45
• It is a proactive approach to information dissemination,
where the provider creates a profile of the user’s information
needs and regularly updates new publications, research
papers, news articles, or any other relevant material
matching the user’s profile.

• SDI helps users stay up-to-date with the latest information in

their field of interest, which can be extremely valuable in
today’s rapidly changing and dynamic world.
Dr.D.SARASWATHI 46
• The SDI system has been used extensively in academic and
research settings to support researchers, scholars, and
students.
• It has also been used in various fields, such as library,
business, law, healthcare, and government, to provide
relevant information to decision-makers.
• The system can be customized to match the user’s specific
needs, ensuring that they receive only the information that is
relevant to them.
Dr.D.SARASWATHI 47
3)Document Database Search
• The Document Database Search Process provides the
capability for a query to search against all items received by
the system.
• The Document Database Search process is composed of the
search process, user entered queries (typically adhoc
queries) and the document database which contains all
items that have been received, processed and stored by the
system.
• Typically items in the Document Database do not change
(i.e., are not edited) once received.
Dr.D.SARASWATHI 48
4) Index Database Search
• When an item is determined to be of interest, a user may
want to save it for future reference.
• This is in effect filing it.
• In an information system this is accomplished via the index
process.
• In this process the user can logically store an item in a file
along with additional index terms and descriptive text the
user wants to associate with the item.
• The Index Database Search Process provides the capability to
create indexes and search them.
Dr.D.SARASWATHI 49
• There are 2 classes of index files:

1) Public Index files

2) Private Index files

Dr.D.SARASWATHI 50
• Every user can have one or more Private Index files leading to a very
large number of files.
• Each Private Index file references only a small subset of the total
number of items in the Document Database.
• Public Index files are maintained by professional library services
personnel and typically index every item in the Document Database.
• There is a small number of Public Index files.
• These files have access lists (i.e., lists of users and their privileges)
that allow anyone to search or retrieve data.
• Private Index files typically have very limited access lists.
• To assist the users in generating indexes, especially the professional
indexers, the system provides a process called Automatic File Build
(also called Information Extraction).
Dr.D.SARASWATHI 51
5) Multimedia Database Search
• From a system perspective, the multi-media data is
not logically its own data structure, but an
augmentation to the existing structures in the
Information Retrieval System.

Dr.D.SARASWATHI 52
Relationship to DBMS
• From a practical standpoint, the integration of DBMS’s
and Information Retrieval Systems is very important.
• Commercial database companies have already
integrated the two types of systems.
• One of the first commercial databases to integrate the
two systems into a single view is the INQUIRE DBMS.
• This has been available for over fifteen years.

Dr.D.SARASWATHI 53
• A more current example is the ORACLE DBMS that now offers
an imbedded capability called CONVECTIS, which is an
informational retrieval system that uses a comprehensive
thesaurus which provides the basis to generate “themes” for
a particular item.
• The INFORMIX DBMS has the ability to link to RetrievalWare
to provide integration of structured data and information
along with functions associated with Information Retrieval
Systems.
Dr.D.SARASWATHI 54
Digital Libraries and Data Warehouses
(DataMarts)
• As the Internet continued its exponential growth and project funding
became available, the topic of Digital Libraries has grown.
• By 1995 enough research and pilot efforts had started to support the
1ST ACM International Conference on Digital Libraries (Fox-96).
• Indexing is one of the critical disciplines in library science and
significant effort has gone into the establishment of indexing and
cataloging standards.
• Migration of many of the library products to a digital format
introduces both opportunities and challenges.
• Information Storage and Retrieval technology has addressed a small
subset of the issues associatedDr.D.SARASWATHI
with Digital Libraries. 55
• Data warehouses are similar to information storage and
retrieval systems in that they both have a need for search
and retrieval of information.
• But a data warehouse is more focused on structured data
and decision support technologies.
• In addition to the normal search process, a complete
system provides a flexible set of analytical tools to “mine”
the data.
• Data mining (originally called Knowledge Discovery in
Databases - KDD) is a search process that automatically
analyzes data and extract relationships and dependencies
that were not part of the database design.
Dr.D.SARASWATHI 56
Dr.D.SARASWATHI 57
Dr.D.SARASWATHI 58
Dr.D.SARASWATHI 59
Dr.D.SARASWATHI 60
Dr.D.SARASWATHI 61

Azure App Service
No ratings yet
Azure App Service
1,959 pages
Cmrit Isr Notes - Docx New
No ratings yet
Cmrit Isr Notes - Docx New
54 pages
Irs Unit-1
No ratings yet
Irs Unit-1
61 pages
Irs Unit1
No ratings yet
Irs Unit1
15 pages
IRS Study Material
100% (1)
IRS Study Material
87 pages
UNIT 1 IRS WWWWW
No ratings yet
UNIT 1 IRS WWWWW
26 pages
UNIT I
No ratings yet
UNIT I
65 pages
UNIT 1 IRS (1)
No ratings yet
UNIT 1 IRS (1)
26 pages
Unit - 1
No ratings yet
Unit - 1
51 pages
IRSUnit-1
No ratings yet
IRSUnit-1
26 pages
Irs PDF
No ratings yet
Irs PDF
68 pages
Irs I
No ratings yet
Irs I
20 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
44 pages
IRS Unit-1
50% (2)
IRS Unit-1
14 pages
irs unit-1 modified
No ratings yet
irs unit-1 modified
12 pages
IRS Unit-1
No ratings yet
IRS Unit-1
27 pages
Unit-I: Introduction To Information Retrieval Systems
100% (1)
Unit-I: Introduction To Information Retrieval Systems
14 pages
Unit-1 Chapter 1
No ratings yet
Unit-1 Chapter 1
44 pages
IRS Unit 1 by Krishna
No ratings yet
IRS Unit 1 by Krishna
33 pages
Unit I - Irs
No ratings yet
Unit I - Irs
85 pages
UNIT I - IRS
No ratings yet
UNIT I - IRS
116 pages
unit-1introduction
No ratings yet
unit-1introduction
44 pages
Information Retrieval Techniques
No ratings yet
Information Retrieval Techniques
63 pages
PE II6
No ratings yet
PE II6
166 pages
IRS Notes
No ratings yet
IRS Notes
10 pages
IRS Unit-1
100% (5)
IRS Unit-1
14 pages
Chapter 1 Introduction To ISR
No ratings yet
Chapter 1 Introduction To ISR
39 pages
IR Chapter 1&2
No ratings yet
IR Chapter 1&2
88 pages
Unit 1: Introduction and Data Pre-Processing
No ratings yet
Unit 1: Introduction and Data Pre-Processing
71 pages
IR UNIT I - Notes
No ratings yet
IR UNIT I - Notes
23 pages
01 Introduction to ISR
No ratings yet
01 Introduction to ISR
34 pages
Jeppiaar Institute of Technology: Department OF Computer Science and Engineering
No ratings yet
Jeppiaar Institute of Technology: Department OF Computer Science and Engineering
24 pages
Information Storage and Retrieval
No ratings yet
Information Storage and Retrieval
45 pages
Ch1 IR
No ratings yet
Ch1 IR
39 pages
Introduction To IR 2021
No ratings yet
Introduction To IR 2021
40 pages
IR Chapter 1
No ratings yet
IR Chapter 1
32 pages
IR First Chapter
No ratings yet
IR First Chapter
32 pages
Introduction To IR Chapter 01
No ratings yet
Introduction To IR Chapter 01
29 pages
Web Mining UNIT-II Chapter-01 - 02 - 03
No ratings yet
Web Mining UNIT-II Chapter-01 - 02 - 03
19 pages
Module 1print
No ratings yet
Module 1print
5 pages
Information Retrivals Ans
No ratings yet
Information Retrivals Ans
78 pages
ISR chap..1
No ratings yet
ISR chap..1
27 pages
Infs 422 Combine
No ratings yet
Infs 422 Combine
375 pages
IRS B Tech CSE Part 1
No ratings yet
IRS B Tech CSE Part 1
161 pages
IR chapter 1 (2)
No ratings yet
IR chapter 1 (2)
29 pages
Search and Retrieval of Information
No ratings yet
Search and Retrieval of Information
7 pages
Information Retrieval 1
100% (2)
Information Retrieval 1
12 pages
irs unit-4 modified
No ratings yet
irs unit-4 modified
13 pages
Documentation Ir
No ratings yet
Documentation Ir
58 pages
Concepts of Information Retrieval System
No ratings yet
Concepts of Information Retrieval System
10 pages
ch1_Information Retrieval Systems
No ratings yet
ch1_Information Retrieval Systems
52 pages
CompletedUNIT 1 PPT 10.7.17
100% (6)
CompletedUNIT 1 PPT 10.7.17
87 pages
IR Cs Sem 6
No ratings yet
IR Cs Sem 6
16 pages
Chap 1
No ratings yet
Chap 1
22 pages
Tycs Sem Vi Informational Retrival Final Notes (WWW - Profajaypashankar.com-1
No ratings yet
Tycs Sem Vi Informational Retrival Final Notes (WWW - Profajaypashankar.com-1
103 pages
Modern Information Retrieval: Computer Engineering Department Fall 2005
No ratings yet
Modern Information Retrieval: Computer Engineering Department Fall 2005
19 pages
Unit-5. Search Engines
No ratings yet
Unit-5. Search Engines
105 pages
IR-Module 1 and 2
No ratings yet
IR-Module 1 and 2
48 pages
Image Retrieval: Unlocking the Power of Visual Data
From Everand
Image Retrieval: Unlocking the Power of Visual Data
Fouad Sabry
No ratings yet
Image Retrieval: Fundamentals and Applications
From Everand
Image Retrieval: Fundamentals and Applications
Fouad Sabry
No ratings yet
Databases: System Concepts, Designs, Management, and Implementation
From Everand
Databases: System Concepts, Designs, Management, and Implementation
Jonathan Rigdon
No ratings yet
DFT-Formal Verification
No ratings yet
DFT-Formal Verification
29 pages
Raspberry Pi As A Video Server
No ratings yet
Raspberry Pi As A Video Server
4 pages
FOUZIYA FARHEEN - GIS Analyst
No ratings yet
FOUZIYA FARHEEN - GIS Analyst
2 pages
GM Application Manual
100% (1)
GM Application Manual
29 pages
ATCI Open Demands
No ratings yet
ATCI Open Demands
4 pages
Cover Letter Sample Hamyarapply
No ratings yet
Cover Letter Sample Hamyarapply
3 pages
TUT (Eng) SCIA21.0 - Tutorial - 1D Reinforcement
No ratings yet
TUT (Eng) SCIA21.0 - Tutorial - 1D Reinforcement
24 pages
Oneapi Base Toolkit - Get Started Guide Linux - 2023.0 766893 766894
No ratings yet
Oneapi Base Toolkit - Get Started Guide Linux - 2023.0 766893 766894
78 pages
How To Load Equipment Long Text in S - 4 HANA 2020 On Premise Using SAP Data Services - SAP Blogs
No ratings yet
How To Load Equipment Long Text in S - 4 HANA 2020 On Premise Using SAP Data Services - SAP Blogs
8 pages
BSCOE
No ratings yet
BSCOE
3 pages
Procedure - Configure HMIWeb Options
No ratings yet
Procedure - Configure HMIWeb Options
6 pages
IEOR 4004: Programming Assignment 1: I I T I N I 1 I T I
No ratings yet
IEOR 4004: Programming Assignment 1: I I T I N I 1 I T I
1 page
Emu Log
No ratings yet
Emu Log
12 pages
CXM, Sds
No ratings yet
CXM, Sds
2 pages
Lecture - Virtualization and Containerization
No ratings yet
Lecture - Virtualization and Containerization
45 pages
Social Media and Web Analytics Unit-5
No ratings yet
Social Media and Web Analytics Unit-5
10 pages
Sistemet Operative Hyrje - Koncepte: Lënda
No ratings yet
Sistemet Operative Hyrje - Koncepte: Lënda
24 pages
Ar-En-005-0001-04 - Firewall Plan and Details
No ratings yet
Ar-En-005-0001-04 - Firewall Plan and Details
1 page
Scand LTD.: Software Consulting and Development
No ratings yet
Scand LTD.: Software Consulting and Development
14 pages
CSE 4508 RDBMS Lab Task Winter 2024
No ratings yet
CSE 4508 RDBMS Lab Task Winter 2024
7 pages
GPS Vehicle Tracker (GPS+GSM+SMS/GPRS) GT06 User Manual: (Version 3.2)
No ratings yet
GPS Vehicle Tracker (GPS+GSM+SMS/GPRS) GT06 User Manual: (Version 3.2)
23 pages
Module 2 EMPOWERMENT TECHNOLOGY
No ratings yet
Module 2 EMPOWERMENT TECHNOLOGY
36 pages
INF30036 DataTypes Lecture2-1
No ratings yet
INF30036 DataTypes Lecture2-1
42 pages
Real-Time Status & Monitoring Flows
No ratings yet
Real-Time Status & Monitoring Flows
27 pages
Pitch Deck - Fitness App (1)
No ratings yet
Pitch Deck - Fitness App (1)
24 pages
Phyton 3 (1)
No ratings yet
Phyton 3 (1)
81 pages
Bitcoin Essentials 4
No ratings yet
Bitcoin Essentials 4
21 pages
Comparison Moodle Vs WordPress LMS
No ratings yet
Comparison Moodle Vs WordPress LMS
3 pages
Competencies To Enhance Digital Teaching and Learning
No ratings yet
Competencies To Enhance Digital Teaching and Learning
27 pages

Module 1 - Introduction

Uploaded by

Module 1 - Introduction

Uploaded by

Module 1- Introduction

• text (including numeric and date data), images, audio,

• The system may use standard computer hardware or

•The gauge of success of an information

• Catalogues were created to facilitate the

• Overhead - query generation, query execution, scanning results of

• From a user’s perspective “relevant” and “needed”

• There are two possibilities with respect to each item:

• Recall is not effected by retrieval of non-relevant

1) Valid word symbols

• In many systems inter-word symbols are non-

• This can be done through various methods like email,

• SDI helps users stay up-to-date with the latest information in

1) Public Index files

2) Private Index files

You might also like