0% found this document useful (0 votes)

47 views17 pages

Information Retrieval

This document provides an introduction to information retrieval concepts and systems. It discusses key concepts like queries, documents, document substitutes, and file structures. It also outlines some common operations in information retrieval systems, including query operations like parsing and feedback, term operations like stemming and truncation, and document operations like parsing and ranking. The document presents information retrieval as the process of matching user queries to stored documents through the use of search engines and retrieval systems.

Uploaded by

Chuks Valentine

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views17 pages

Information Retrieval

Uploaded by

Chuks Valentine

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

CSC 231: Data Management I

Information Retrieval (IR)

Introduction
Imagine if you want to search for a Microsoft word document on your system and you go to
the search engine and type the keyword of the document, the search engine brings the
document and other related document. The search engine was able to retrieve the document
because you saved it. This is the concept of information Retrieval.

This study session introduces IR concepts, and presents a domain model of IR systems that
describes their similarities and differences. The relationship of IR systems to other
information systems will be discussed and also the evaluation of IR systems.

1|Page
CSC 231: Data Management I

2.1 Concept of Information Retrieval

Automated information retrieval (IR) systems were originally developed to help manage the
huge scientific literature that has developed since the 1940s. Many university, corporate, and
public libraries now use IR systems to provide access to books, journals, and other documents.

An IR system matches user queries formal statements of information needs to documents stored
in a database. A document is a data object, usually textual, though it may also contain other types
of data such as photographs, graphs, and so on.

Box 2.1: Definition of Information Retrieval

Information Retrieval is the activity of obtaining information resources relevant to an

information need from a collection of information resources. Searches can be based on metadata
(data that describes other data) or on full-text indexing.

Often, the documents themselves are not stored directly in the IR system, but are represented in
the system by document substitutes. This study material, for example, is a document and could
be stored in its entirety in an IR database. You might instead choose to create a document
substitute for it consisting of the title, author, and abstract.

This is typically done for efficiency, that is, to reduce the size of the database and searching time.
Document substitutes are also called documents. An IR system must support certain basic
operations. There must be a way to enter documents into a database, change the documents, and
delete them. There must also be some way to search for documents, and present them to a user.

An information retrieval process begins when a user enters a query into the system. Your queries
are matched against the database information. Depending on the application the data objects may
be, for example, text documents, images, audio, mind maps or videos. Often the documents
themselves are not kept or stored directly in the IR system, but are instead represented in the
system by document substitutes or metadata.

2|Page
CSC 231: Data Management I

Box 2.2: Query

Queries are formal statements of information needs, for example search strings in web search
engines. In information retrieval a query does not uniquely identify a single object (An object is
an entity that is represented by information in a database) in the collection. Instead, several
objects may match the query, perhaps with different degrees of relevancy.

Most IR systems compute a numeric score on how well each object in the database matches the
query, and rank the objects according to this value. The top ranking objects are then shown to the
user. The process may then be iterated if the user wishes to refine the query. The figure 2.1
below shows the pathway of information from query to the retrieval of the records.

Figure 2.1: Abstract Model of Information Retrieval

3|Page
CSC 231: Data Management I

2.2 Domain Analysis of Information Retrieval (IR) Systems

The first steps in domain analysis are to identify important concepts and vocabulary in the
domain, define them, and organize them with a faceted classification. Table 2.1 is a faceted
classification for IR systems, containing important IR concepts and vocabulary. The first row of
the table specifies the facets.
Table 2.1: Faceted Classification of IR Systems

Conceptual File Query Term Document Hardware

model structure operations operations operations operations

Boolean Inverted file Feedback Stem Parse von Neumann

Extended Signature Parse Weight Display Parallel

Boolean cluster

Probabilistic Pat trees Boolean Thesaurus Field mask IR specific

String search Graphs Cluster Stoplist Rank Optical disk

Vector space Hashing Truncation Sort Mag. Disk

2.2.1 File Structures

In designing IR system, choosing the type of file structure is a fundamental decision you need to
make.

The file structures used in IR systems are:

1. Flat files,

4|Page
CSC 231: Data Management I

2. Inverted files,

3. Signature files,

4. PAT trees,

5. Graphs.

Though it is possible to keep file structures in main memory, in practice they are usually stored
on disk because of their size.

Using a flat file approach, one or more documents are stored in a file, usually as ASCII
(American Standard Code for Information Interchange) or EBCDIC (Extended Binary Coded
Decimal Interchange Code) text.

Flat file searching is usually done via pattern matching. On UNIX, for example, you can store a
document collection one per file in a UNIX directory, and search it using pattern searching tools
such as grep or awk. An example of a flat file structure is shown below

Figure 2.3: Flat file structure

An inverted file is a kind of indexed file. The structure of an inverted file entry is usually
keyword, document-ID, and field-ID. A keyword is an indexing term that describes the
document, document-ID is a unique identifier for a document, and field-ID is a unique name that

5|Page
CSC 231: Data Management I

indicates from which field in the document the keyword came. Figure 2.4 shows how an inverted
file looks like.

Figure 2.4: Inverted File Structure

Source: http://ads.harvard.edu/pubs/A+AS/2000A+AS..143...85A/img10.gif

Some systems also include information about the paragraph and sentence location where the
term occurs. Searching is done by looking up query terms in the inverted file.

Signature files contain signatures that represent documents. There are various ways of
constructing signatures. Using one common signature method, for example, documents are split
into logical blocks each containing a fixed number of distinct significant, that is, non-stoplist.

Each word in the block is hashed to give a signature--a bit pattern with some of the bits set to 1.
The signatures of each word in a block are OR'ed together to create a block signature. The block

6|Page
CSC 231: Data Management I

signatures are then concatenated to produce the document signature. Searching is done by
comparing the signatures of queries with document signatures.

PAT trees are Patricia trees constructed over all sistrings in a text. If a document collection is
viewed as a sequentially numbered array of characters, a sistring is a subsequence of characters
from the array starting at a given point and extending an arbitrary distance to the right. A Patricia
tree is a digital tree where the individual bits of the keys are used to decide branching.

Graphs, or networks, are ordered collections of nodes connected by arcs. They can be used to
represent documents in various ways. For example, a kind of graph called a semantic net can be
used to represent the semantic relationships in text often lost in the indexing systems above.

Although interesting, graph-based techniques for IR are impractical now because of the amount
of manual effort that would be needed to represent a large document collection in this form.

2.2.2 Query Operations

Queries are formal statements of information needs put to the IR system by users. The operations
on queries are obviously a function of the type of query, and the capabilities of the IR system.
One common query operation is parsing, that is breaking the query into its constituent elements.

Boolean queries, for example, must be parsed into their constituent terms and operators. The set
of document identifiers associated with each query term is retrieved, and the sets are then
combined according to the Boolean operators.

In feedback, information from previous searches is used to modify queries. For example, terms
from relevant documents found by a query may be added to the query, and terms from non-
relevant documents deleted. There is some evidence that feedback can significantly improve IR
performance. Below are the Boolean operators.

7|Page
CSC 231: Data Management I

Figure 2.5: A Boolean Query

Source: http://pdfgold.helpmax.net/en/search-and-index/searching-pdfs/

2.2.3 Term Operations

There are different types of Operations on terms in an IR system.

They include:

 Stemming,

 Truncation,

8|Page
CSC 231: Data Management I

 Weighting,

 Stoplist and

 Thesaurus operations

Stemming is the automated conflation (fusing or combining) of related words, by reducing the
words to a common root form.

Truncation is manual conflation of terms by using wildcard characters in the word, so that the
truncated term will match multiple words. For example, if you are interested in finding
documents about truncation, you might enter the term "truncat?" which would match terms such
as truncate, truncated, and truncation.

Another way of conflating related terms is by performing thesaurus operations which lists
synonymous terms, and sometimes the relationships among them.

A Stoplist is a list of words considered to have no indexing value, used to eliminate potential
indexing terms. Each potential indexing term is checked against the stoplist and eliminated if
found there.

Weighting, indexing or query terms are assigned numerical values usually based on information
about the statistical distribution of terms, that is, the frequencies with which terms occur in
documents, document collections, or subsets of document collections such as documents
considered relevant to a query.

2.2.4 Document Operations

Documents are the primary objects in IR systems and there are many operations for them. In
many types of IR systems, documents added to a database must be given unique identifiers,

9|Page
CSC 231: Data Management I

parsed into their constituent fields, and those fields broken into field identifiers and terms. Once
in the database, you sometimes wish to mask off certain fields for searching and display.

For example, you may wish to search only the title and abstract fields of documents for a given
query, or may wish to see only the title and author of retrieved documents. You may also wish to
sort retrieved documents by some field, for example by author.

Display operations include printing the documents, and displaying them on a CRT. Using
information about term distributions, it is possible to assign a probability of relevance to each
document in a retrieved set, allowing retrieved documents to be ranked in order of probable
relevance. Term distribution information can also be used to cluster similar documents in a
document space.

Another important document operation is display. The user interface of an IR system, as with
any other type of information system, is critical to its successful usage. Know that user interface
algorithms and data structures are not IR specific.

2.2.5 Functional View of Paradigm IR System

The activities associated with a common type of Boolean IR system are chosen because it
represents the operational standard for IR systems. Below is an example of explicit Boolean IR
system

10 | P a g e
CSC 231: Data Management I

Figure 2.6: Example of Boolean IR system

When building the database, documents are taken one by one, and their text is broken into words.
The words from the documents are compared against a stoplist--a list of words thought to have
no indexing value.

Words from the document not found in the stoplist may next be stemmed. Words may then also
be counted, since the frequency of words in documents and in the database as a whole are often
used for ranking retrieved documents.

11 | P a g e
CSC 231: Data Management I

Finally, the words and associated information such as the documents, fields within the
documents, and counts are put into the database. The database then might consist of pairs of
document identifiers and keywords as follows.

keyword1 - document1-Field_2
keyword2 - document1-Field_2, 5
keyword2 - document3-Field_1, 2
keyword3 - document3-Field_3, 4

keyword-n - document-n-Field_i, j

Such a structure is called an inverted file. In an IR system, each document must have a unique
identifier, and its fields, if field operations are supported, must have unique field names.

To search the database, you must enter a query consisting of a set of keywords connected by
Boolean operators. The query is parsed into its constituent terms and Boolean operators. These
terms are then looked up in the inverted file and the list of document identifiers corresponding to
them are combined according to the specified Boolean operators.

If frequency information has been kept, the retrieved set may be ranked in order of probable
relevance. The result of the search is then presented to you.

In some systems, you make judgments about the relevance of the retrieved documents, and this
information is used to modify the query automatically by adding terms from relevant documents
and deleting terms from non-relevant documents. A system such as this give remarkably good
retrieval performance given their simplicity, but their performance is far from perfect. Many
techniques to improve them have been proposed.

One such technique aims to establish a connection between morphologically related terms.
Stemming is a technique for conflating term variants so that the semantic closeness of words like

12 | P a g e
CSC 231: Data Management I

"engineer," "engineered," and "engineering" will be recognized in searching. Another way to

relate terms is via thesauri, or synonym lists.

2.3 Information Retrieval System Evaluations

IR systems can be evaluated in terms of many criteria including execution efficiency, storage
efficiency, retrieval effectiveness, and the features they offer you the user. The relative
importance of these factors must be decided by the designers of the system, and the selection of
appropriate data structures and algorithms for implementation will depend on these decisions.

Execution efficiency is measured by the time it takes a system, or part of a system, to perform a
computation. This can be measured in C based systems by using profiling tools such as proof on
UNIX. Execution efficiency has always been a major concern of IR systems since most of them
are interactive, and a long retrieval time will interfere with the usefulness of the system.

Storage efficiency is measured by the number of bytes needed to store data. Space overhead, a
common measure of storage efficiency, is the ratio of the size of the index files plus the size of
the document files over the size of the document files. Space overhead ratios of from 1.5 to 3 are
typical for IR systems based on inverted files.

Most IR experimentation has focused on retrieval effectiveness usually based on document

relevance judgments. This has been a problem since relevance judgments are subjective and
unreliable. That is, different judges will assign different relevance values to a document retrieved
in response to a given query.

The seriousness of the problem is the subject of debate, with many IR researchers arguing that
the relevance judgment reliability problem is not sufficient to invalidate the experiments that use
relevance judgments.

Many measures of retrieval effectiveness have been proposed. The most commonly used are
recall and precision.

13 | P a g e
CSC 231: Data Management I

Box 2.3: Recall and Precision

Recall is the ratio of relevant documents retrieved for a given query over the number of relevant
documents for that query in the database. Except for small test collections, this denominator is
generally unknown and must be estimated by sampling or some other method.

Precision is the ratio of the number of relevant documents retrieved over the total number of
documents retrieved. Both recall and precision take on values between 0 and 1.

Since you often wish to compare IR performance in terms of both recall and precision, methods
for evaluating them simultaneously have been developed. One method involves the use of recall-
precision graphs--bivariate plots where one axis is recall and the other precision.

Recall-precision plots show that recall and precision are inversely related. That is, when
precision goes up, recall typically goes down and vice-versa.

Figure 2.7: Recall-precision Graph

A combined measure of recall and precision, for the evaluation of E is defined as:

14 | P a g e
CSC 231: Data Management I

Where P = precision, R = recall, and b is a measure of the relative importance, to a user, of recall
and precision. Experimenters choose values of E that they hope will reflect the recall and
precision interests of the typical user. For example, b levels of .5, indicating that a user was twice
as interested in precision as recall, and 2, indicating that a user was twice as interested in recall
as precision, might be used.

IR experiments often use test collections which consist of a document database and a set of
queries for the data base for which relevance judgments are available. The number of documents
in test collections has tended to be small, typically a few hundred to a few thousand documents.
Test collections are available on an optical disk.

Table 2.3: IR Test Collections

Collection Subject Documents Queries
ADI Information science 82 35
CACM Computer science 3200 64
CISI Library science 1460 76
CRAN Aeronautics 1400 225
LISA Library science 6004 35
MED Medicine 1033 30
NLM Medicine 3078 155
NPL Electrical engineering 11429 100
TIME General articles 423 83

Summary:

1. An IR system matches user queries formal statements of information needs to documents

stored in a database. A document is a data object, usually textual, though it may also contain
other types of data such as photographs, graphs, and so on.

15 | P a g e
CSC 231: Data Management I
2. An IR conceptual model is a general approach to IR systems. Several taxonomies for IR
conceptual models have been proposed. Faloutsos gives three basic approaches: text pattern
search, inverted file search, and signature search while Belkin and Croft once categorize IR
conceptual models differently. They divide retrieval techniques first into exact match and inexact
match.

3. The type of file structure to use is an important decision to make before using a document
database. The file structures used in IR systems are flat files, inverted files, signature files, PAT
trees, and graphs. Though it is possible to keep file structures in main memory, in practice IR
databases are usually stored on disk because of their size.

5. Queries are formal statements of information needs put to the IR system by users. The
operations on queries are obviously a function of the type of query, and the capabilities of the IR
system. One common query operation is parsing, that is breaking the query into its constituent
elements.

6. Operations on terms in an IR system include stemming, truncation, weighting, stoplist and

thesaurus operations.

7. Documents are the primary objects in IR systems and there are many operations for them. In
many types of IR systems, documents added to a database must be given unique identifiers,
parsed into their constituent fields, and those fields broken into field identifiers and terms.

8. IR systems can be evaluated in terms of many criteria including execution efficiency, storage
efficiency, retrieval effectiveness, and the features they offer a user. The relative importance of
these factors must be decided by the designers of the system, and the selection of appropriate
data structures and algorithms for implementation will depend on these decisions.

16 | P a g e
CSC 231: Data Management I

Glossary of Terms
Information Retrieval (IR) is the activity of obtaining information resources relevant to an
information need from a collection of information resources. Searches can be based on metadata
(data that describes other data) or on full-text indexing.

Queries are formal statements of information needs, for example search strings in web search
engines.

17 | P a g e

UNIT-1
No ratings yet
UNIT-1
15 pages
Unit 1: Introduction and Data Pre-Processing
No ratings yet
Unit 1: Introduction and Data Pre-Processing
71 pages
Introduction To Information Storage and Retrieval Systems: BY-Research Scholar
No ratings yet
Introduction To Information Storage and Retrieval Systems: BY-Research Scholar
42 pages
Information Retrieval System Assignment-1
No ratings yet
Information Retrieval System Assignment-1
10 pages
Documentation Ir
No ratings yet
Documentation Ir
58 pages
Automatic Image Annotation: Fundamentals and Applications
From Everand
Automatic Image Annotation: Fundamentals and Applications
Fouad Sabry
No ratings yet
Chap 4 Text IR PDF
No ratings yet
Chap 4 Text IR PDF
19 pages
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
From Everand
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
Fouad Sabry
No ratings yet
Information Retrieval: Adt-V Unit
No ratings yet
Information Retrieval: Adt-V Unit
106 pages
11 Multimedia Media IR
No ratings yet
11 Multimedia Media IR
19 pages
Information Retrieval
No ratings yet
Information Retrieval
5 pages
UNIT I
No ratings yet
UNIT I
65 pages
ISR chap..1
No ratings yet
ISR chap..1
27 pages
Image Retrieval: Fundamentals and Applications
From Everand
Image Retrieval: Fundamentals and Applications
Fouad Sabry
No ratings yet
Information Retrievalpdf
No ratings yet
Information Retrievalpdf
7 pages
ISR Lab Manual
No ratings yet
ISR Lab Manual
110 pages
Chapter 1
No ratings yet
Chapter 1
52 pages
Unit-I: Introduction To Information Retrieval Systems
100% (1)
Unit-I: Introduction To Information Retrieval Systems
14 pages
Irs Unit-1
No ratings yet
Irs Unit-1
61 pages
Web Information Retrieval
No ratings yet
Web Information Retrieval
10 pages
Adt Unit 5
No ratings yet
Adt Unit 5
31 pages
IR-Lec1 - Ch1-2023
No ratings yet
IR-Lec1 - Ch1-2023
41 pages
IRS unit-1
No ratings yet
IRS unit-1
61 pages
Web Mining UNIT-II Chapter-01 - 02 - 03
No ratings yet
Web Mining UNIT-II Chapter-01 - 02 - 03
19 pages
Week 2 - Information Retrieval Basics
No ratings yet
Week 2 - Information Retrieval Basics
74 pages
IR Introduction
100% (1)
IR Introduction
6 pages
Image Retrieval: Unlocking the Power of Visual Data
From Everand
Image Retrieval: Unlocking the Power of Visual Data
Fouad Sabry
No ratings yet
Text Mining
No ratings yet
Text Mining
23 pages
CompletedUNIT 1 PPT 10.7.17
100% (6)
CompletedUNIT 1 PPT 10.7.17
87 pages
IR-Module 1 and 2
No ratings yet
IR-Module 1 and 2
48 pages
Lec1 Introduction
No ratings yet
Lec1 Introduction
40 pages
Intelligence Database ct1
No ratings yet
Intelligence Database ct1
8 pages
Parallel and Distributed Ir
No ratings yet
Parallel and Distributed Ir
33 pages
1 Information Retrieval System
No ratings yet
1 Information Retrieval System
10 pages
chapter one IR
No ratings yet
chapter one IR
18 pages
lecture1-intro-boolean
No ratings yet
lecture1-intro-boolean
42 pages
An Overview of Information Retrieval Outline: A (Simple) Database Example Databases vs. IR
No ratings yet
An Overview of Information Retrieval Outline: A (Simple) Database Example Databases vs. IR
16 pages
22103071-ASSIGNMENT - II
No ratings yet
22103071-ASSIGNMENT - II
7 pages
Completed UNIT-III 20.9.17
No ratings yet
Completed UNIT-III 20.9.17
61 pages
An Introduction To Information Retrieval Systems: Intelligent Systems March 18, 2004 Ramashis Das
No ratings yet
An Introduction To Information Retrieval Systems: Intelligent Systems March 18, 2004 Ramashis Das
25 pages
Information Retrieval Detailed Lecture Nov 2023
No ratings yet
Information Retrieval Detailed Lecture Nov 2023
39 pages
Associative Text Retrieval From A Large Document Collection Using Unorganized Neural Networks
No ratings yet
Associative Text Retrieval From A Large Document Collection Using Unorganized Neural Networks
10 pages
ISR U 1&2 Tech-Knowledge
No ratings yet
ISR U 1&2 Tech-Knowledge
68 pages
Aesthetics and Technology in Building, Pier Luigi Nervi
100% (4)
Aesthetics and Technology in Building, Pier Luigi Nervi
146 pages
Information Storage and Retrieval
No ratings yet
Information Storage and Retrieval
45 pages
02 Basic Techniques PDF
No ratings yet
02 Basic Techniques PDF
51 pages
Information Storage and Retrieval - 783
100% (1)
Information Storage and Retrieval - 783
12 pages
2 Introduction To Information Retrieval
No ratings yet
2 Introduction To Information Retrieval
38 pages
Python Data Structures Explained: A Practical Guide with Examples
From Everand
Python Data Structures Explained: A Practical Guide with Examples
William E. Clark
No ratings yet
Ir Chapter Three
No ratings yet
Ir Chapter Three
41 pages
1 introIR
No ratings yet
1 introIR
22 pages
IR Chapter I
No ratings yet
IR Chapter I
70 pages
01 Introduction To ISR
No ratings yet
01 Introduction To ISR
48 pages
IR Unit III - Notes
No ratings yet
IR Unit III - Notes
18 pages
Information Storage And: Retrieval Techniques
No ratings yet
Information Storage And: Retrieval Techniques
56 pages
ISR Chap...4
No ratings yet
ISR Chap...4
43 pages
Irs I
No ratings yet
Irs I
20 pages
Research Paper on Information Retrieval System
100% (1)
Research Paper on Information Retrieval System
7 pages
IRS B Tech CSE Part 1
No ratings yet
IRS B Tech CSE Part 1
161 pages
C++ File Handling Step by Step: A Practical Guide with Examples
From Everand
C++ File Handling Step by Step: A Practical Guide with Examples
William E. Clark
No ratings yet
LOutline 2 2 F 15
No ratings yet
LOutline 2 2 F 15
79 pages
Information Management Capture and Representation
No ratings yet
Information Management Capture and Representation
18 pages
LOutline 2 1 F 15
No ratings yet
LOutline 2 1 F 15
28 pages
CSC 204 SS5
No ratings yet
CSC 204 SS5
12 pages
LOutline 1 F 15
No ratings yet
LOutline 1 F 15
34 pages
How To Add ExpressJS Server To NextJS Application
No ratings yet
How To Add ExpressJS Server To NextJS Application
16 pages
DES 3528 - Datasheet - EN - DE
No ratings yet
DES 3528 - Datasheet - EN - DE
6 pages
SG300-28 Datasheet: Quick Specs
No ratings yet
SG300-28 Datasheet: Quick Specs
3 pages
Managed Gigabit Switches: Product Highlights
No ratings yet
Managed Gigabit Switches: Product Highlights
11 pages
SG300-10SFP Datasheet: Quick Spec
No ratings yet
SG300-10SFP Datasheet: Quick Spec
3 pages
CCS0006 - Computer Programming 1 Lab Version 6
No ratings yet
CCS0006 - Computer Programming 1 Lab Version 6
6 pages
The Swift Programming Language
No ratings yet
The Swift Programming Language
531 pages
DB2 UDB V8.1 Family Application Development Certification:: Java Programming
No ratings yet
DB2 UDB V8.1 Family Application Development Certification:: Java Programming
23 pages
Computer Graphics Practical
No ratings yet
Computer Graphics Practical
19 pages
Exp 1
No ratings yet
Exp 1
10 pages
Pointers I
No ratings yet
Pointers I
33 pages
Turbo Debugger Version 5 Users Guide
0% (1)
Turbo Debugger Version 5 Users Guide
194 pages
Algorithm Practicals
No ratings yet
Algorithm Practicals
92 pages
Inter S
No ratings yet
Inter S
5 pages
Session in PHP
No ratings yet
Session in PHP
7 pages
Informatica 41128 PDF
No ratings yet
Informatica 41128 PDF
34 pages
DDC Online - Direct Digital Controls
No ratings yet
DDC Online - Direct Digital Controls
9 pages
What Is Huffman Coding and Its History
No ratings yet
What Is Huffman Coding and Its History
5 pages
Introduction To Unified Modeling Language (UML)
No ratings yet
Introduction To Unified Modeling Language (UML)
27 pages
Integer Representation
No ratings yet
Integer Representation
4 pages
70 Shell Scripting Interview Questions
0% (1)
70 Shell Scripting Interview Questions
10 pages
Data Structures & Algorithms DA5: Lab Assessment-5
No ratings yet
Data Structures & Algorithms DA5: Lab Assessment-5
21 pages
Homework #2 - : Parbegin
No ratings yet
Homework #2 - : Parbegin
3 pages
Visual Dialplan - User Manual
No ratings yet
Visual Dialplan - User Manual
333 pages
11th Ch-3 Data Types, Operators & Expressions in Python 2025-26
No ratings yet
11th Ch-3 Data Types, Operators & Expressions in Python 2025-26
3 pages
SDT QP Spring 2021 FINAL
No ratings yet
SDT QP Spring 2021 FINAL
8 pages
Java
No ratings yet
Java
14 pages
Rabin Karp and KMP Algorithm
No ratings yet
Rabin Karp and KMP Algorithm
20 pages
Javascript Bootcamp
92% (13)
Javascript Bootcamp
108 pages
Pointers and Arrays: ESC101 October 25
No ratings yet
Pointers and Arrays: ESC101 October 25
22 pages
Code
No ratings yet
Code
7 pages
Persistence With Spring
No ratings yet
Persistence With Spring
98 pages
AUTOSAR SWS CommunicationStackTypes
No ratings yet
AUTOSAR SWS CommunicationStackTypes
26 pages
Complete Download (eBook PDF) Introduction to Programming Using Python An 1 PDF All Chapters
100% (5)
Complete Download (eBook PDF) Introduction to Programming Using Python An 1 PDF All Chapters
45 pages
2 5 The TLB
No ratings yet
2 5 The TLB
6 pages

Information Retrieval

Uploaded by

Information Retrieval

Uploaded by

CSC 231: Data Management I

Information Retrieval (IR)

2.1 Concept of Information Retrieval

Box 2.1: Definition of Information Retrieval

Information Retrieval is the activity of obtaining information resources relevant to an

Box 2.2: Query

Figure 2.1: Abstract Model of Information Retrieval

2.2 Domain Analysis of Information Retrieval (IR) Systems

Conceptual File Query Term Document Hardware

Boolean Inverted file Feedback Stem Parse von Neumann

Extended Signature Parse Weight Display Parallel

Probabilistic Pat trees Boolean Thesaurus Field mask IR specific

String search Graphs Cluster Stoplist Rank Optical disk

Vector space Hashing Truncation Sort Mag. Disk

2.2.1 File Structures

The file structures used in IR systems are:

Figure 2.3: Flat file structure

Figure 2.4: Inverted File Structure

2.2.2 Query Operations

Figure 2.5: A Boolean Query

2.2.3 Term Operations

There are different types of Operations on terms in an IR system.

2.2.4 Document Operations

2.2.5 Functional View of Paradigm IR System

Figure 2.6: Example of Boolean IR system

"engineer," "engineered," and "engineering" will be recognized in searching. Another way to

2.3 Information Retrieval System Evaluations

Most IR experimentation has focused on retrieval effectiveness usually based on document

Box 2.3: Recall and Precision

Figure 2.7: Recall-precision Graph

Table 2.3: IR Test Collections

1. An IR system matches user queries formal statements of information needs to documents

6. Operations on terms in an IR system include stemming, truncation, weighting, stoplist and

You might also like