0% found this document useful (0 votes)

31 views

Unit - 1

Uploaded by

Sree Dhathri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views

Unit - 1

Uploaded by

Sree Dhathri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 51

Topics:

• Introduction: Definitions
• Objectives
• Functional Overview
• Relationship to DBMS
• Digital libraries and Data Warehouses
Introduction to Information Retrieval
Simple view of the IR process

Information need
User

Set of relevant documents Document collection

The set of documents in the answer MUST be relevant to

the user’s information need. Otherwise the IR process
results in complete failure.
Information Retrieval
Compare the generate a
information ranking
need with which
the reflects
information relevance

Ranked list of
User documents
Query IR System

feedback
Definitions of Data and Information

 Data is raw, unorganized facts that need to be

processed.
 When data is processed, organized, structured or
presented in a given context so as to make it
useful, it is called Information.
 Information may be a text(including numeric and
date data), images, video and other multimedia
objects.
Information Retrieval
• The process of actively seeking
information relevant to a topic
out of interest.
• “Document” the generic term for
is holder(book, an chapter,
information
webpage, etc).article,
• Information retrieval locates relevant documents,
on the basis of user input such as keywords or
phrases.
Definition of IRS
• An Information Retrieval System is a system that
is capable of storage retrieval and maintenance of
information.
Eg: Web search engines
• The smallest complete textual unit processed and
manipulated by an IR system is called Item.
What is a Search Engine?
• It is a program that helps in
locating information stored on world wide
web.
• Two types of search engines :
• Crawler-based
• Create their listing automatically.
Examples :Google ,Yahoo
• Human-based
• Depends on humans for its Listings
• Example : MSN
Components of Search Engine
• They have three major components:
• Crawler or spider
• -Finds and retrieves web pages.
• Index or Catalog
• -Indexes every web page that the crawler finds.
• Search Engine Software
• -Searches entries in the index to find a match.
• -Rank the matches based on relevance.
What about web search?
• First you need to get all the documents of the
web…. Crawlers.
• Then you have to index them(inverted files)
• Find the web pages that are relevant to the
query.
• Report the pages with their links in a sorted
order
The search process

Query entered

Query Results Ranked

Interpreted

Index
searched
Items
retrieved
About Google?
• The name "Google" is a play on the word "googol",
which refers to the number represented by 1
followed by one hundred zeros.
• Google receives over 200 million queries each day
through its various services.
• As of January 2012, Google has indexed 9.7 billion
web pages, 1.3 billion images, and over one billion
Usenet messages — in total, approximately 12
billion items.
How Google works
Google
• Google maintains (probably) the
worlds largest Linux cluster (over 15,000 servers)
• These are partitioned between index
servers and page servers
• Index servers resolve the queries.
• Page servers deliver the results of the queries
Google finds important pages
 The idea is that the documents on the web
have different degrees of "importance".
 Google will show the most important pages
first.
 The ideas is that more important pages are
likely to be more relevant to any query
than non-important pages.

17
Google Relevance Factors
Google's considers over 100 factors, including:
1. PageRank algorithm.
2. Popularity of page.
3. Position and size of the search terms within page.
4. Unique Content.
5. Terms order.
6. Page size and load time.
7. Error free websites.
8. Important incoming links.
9. Website Optimization.

18
• The three components of the information
retrieval environment:
User
Process
Collection
Information Retrieval Black box

User Query Documents

Collection

Process

Results
Inside Black box

Query Documents

Representation Representation

Query Representation Document Representatio

Comparison
Index
Function

Results
Information Hierarchy
Information Hierarchy

• Data
The raw material of information
• Information
•Data organized and presented in a particular manner.
• Knowledge
-“Justified true belief”
-Information that can be acted upon
• Wisdom
•Distilled and integrated knowledge
Demonstrative of high-level “understanding”
Example

• Data
98.6º F, 99.5º F, 100.3º F, 101º F, …
• Information
Hourly body temperature: 98.6º F, 99.5º F,
100.3º F, 101ºF
• Knowledge
If you have a temperature above 100º F, you
most likely have a fever.
• Wisdom
If you don’t feel well, go see a doctor
Objectives of IRS
 To minimize the overhead of a user locating
needed information.
 Efficient retrieval of documents. Two major
measures used are:
1. Precision: Measure of how many of the
retrieved documents were in fact relevant.
2. Recall: Measure of how many of the
relevant documents were retrieved.
 Support of user search generation.
 Present the search results in a format that
facilitate the user in determining relevant
items.
Minimize the overhead for
finding information

 Overhead: The time a user spends in all of

the steps leading to reading an item
containing needed information, excluding the
time for actually reading the relevant data.
• Query generation
• Search composition
• Search execution
• Scanning results of query to select items to
read
Precision and Recall
 When a user decides to issue a search looking info on
a topic, the total db is logically divided into 4 segments
as
Precision and Recall

Precision: the percentage of retrieved documents that

are in fact relevant to the query.

Recall: the percentage of documents

that are relevant to the query and were, in
fact, retrieved.
Measuring Precision and Recall
Ideal Precision and Recall Graph
Two More Objectives of IR Systems

 Support of user search generation

-should generate good search statement to
specify the information a user needs.
 To present the search results in a format
that facilitates the user in determining
relevant items.
-Ranking in order of potential relevance .
Functional Overview
 A total Information Storage and Retrieval
System is composed of four major functional
processes:
1. Item Normalization
2. Selective Dissemination of Information
3. Document Database Search
4. Index Database Search along with the
Automatic File Build process that supports
Index Files.
Total Information Retrieval System
1. Item Normalization
1.Normalize the
Incoming items
to a standard
Format.
2.Provides logical
restructuring
of the item.
3. Operations :
Identification of
Processing tokens
Characterization of
tokens
Stemming of the
tokens.
Figure: The Text Normalization Process
Standardize Input

 Standardizing the input takes the different

external format of input data and performs the
translation to the formats acceptable to the
system.
 Translate foreign language into Unicode.
 Translate multi-media input into a
standard format.
-Video: MPEG-2, MPEG-1, AVI, Real Video…
-Audio: WAV, Real Audio
- Image: GIF, JPEG, BMP…
Logical Subsetting (Zoning)

 Parse the item into logical sub-divisions that

have meaning to user.
-Title, Author, Abstract, Main Text, Conclusion, References,
Country, Keyword…
 Visible to the user and used to increase the
precision of a search and optimize the
display.
 The zoning information is passed to the next
operation to store the information, allowing
searches to be restricted to a specific zone.
Identify Processing Tokens

 Identifying the information(words) that

are used in the search process.
 The first step is to determine word by dividing
input symbols into three classes:
1.Valid word symbols:alphabetic characters,numbers
2.Inter-wordsymbols:blanks,periods, semicolons (nonsearchable)
3.Special processing symbols: hyphen (-)
 A word is defined as a contiguous set of word
symbols bounded by inter-word symbols.
Stop Algorithm

 Next, a Stop List/Algorithm is applied to the

list of potential processing tokens.
 The objective is to save system
resources and processing power by
eliminating the tokens of little value to
the search.
-Any word found in almost every item i.e frequently
occurring words.
-Any word only found once or twice in the database.

 The rank-frequency law of Ziph is:

Frequency * Rank = constant
Characterize Tokens

 Identification of any specific

word characteristics.
 The characterization is used in systems to
assist in disambiguation of a particular word.
Stemming Algorithm

 To normalize the token to a

standard semantic representation.
Computer, Compute, Computers, Computing
• Comput
 Reduces the number of unique words the
system has to contain.
 Improves the efficiency of the IR System.
 Should be applied with out leading to retrieval
of non-relevant items.
Create Searchable DataStructure

 Processing tokens -> Stemming Algorithm ->

updates to the Searchable data structure.
 Internal representation (not visible to user).
-Signature file, Inverted list, PAT Tree…
2. Selective Dissemination of Information

 The Selective Dissemination of Information

(Mail) Process provides the capability to
dynamically compare newly received items in
the information system against standing
statements of interest of users and deliver the
item to those users whose statement of
interest matches the contents of the item.
 Consists of ,
 Search process
 User statements of interest (Profile)
 User mail file
3. Document Database(DDB) Search

 The DDB Search Process provides the

capability for a query to search against all items
received by the system.

 Composed of the search process, user entered

queries and document database.
4. Index Database Search
 In the index process the user can logically store
an item in a file along with additional index
terms and descriptive text the user wants to
associate with the item.
 The Index Database Search Process provides
the capability to create indexes and search
them .
 There are two classes of index files: Public and
Private Index files.
 Each Private Index file references only a small
subset of the total number of items in the DDB
but the Public Index files are typically index
every item in the DDB
Document D1: “yes we got no bananas”
Document D2: “Johnny Appleseed planted apple seeds.”
Document D3: “we like to eat, eat, eat apples and
bananas”
V ocabulary P ostings
yes  List Q uery
we  D1 “apples bananas”:
got  D1, D3
no  “apples”  {D2, D3}
D1
bananas  “bananas”  {D1,
Johnny  D1
D3}
Appleseed  D1, D3 Whole query gives the
planted  D2 intersection:
apple  D2
seeds  {D2, D3} ^ {D1, D3} =
like  {D3}
to  D2
eat  D2, D3
and  D2
D3
Relationship to DBMS
 IR is supported by IRS but DR is supported by
DBMS.
 DBMS
 Precise Semantics
 SQL
 Structured data
 Expect reasonable number of updates.
 Generate full answer
 IRS
• Imprecise Semantics
• Keyword search
• Unstructured data format
• Read-Mostly. Add/update docs occasionally.
• Page through top k results.
Information Retrieval vs. Data Retrieval

Design Issues Data Retrieval Information Retrieval

Matching Exact Match Partial (Best) Match

Model Deterministic Probabilistic
Classification Approach Monotonic Polythetic
Query language Artificial Natural
Query specification Complete Incomplete
Items wanted Matching Relevant
Error response Sensitive Insensitive
Data representation Schema (Mostly) Index terms

47
Digital Libraries and Data Warehouses

 Three systems which are the repositories of

information are:
1. Digital Libraries
2. Data Warehouses (or Data Marts).
3. Information Storage and Retrieval System.
Digital Libraries

 A digital library is a library in which

information collections are stored in digital
formats(as opposed to print or other media)
and accessible via computers.
 The digital content may be stored locally or
accessed remotely via a network.
Traditional vs. Digital Library

 No physical boundary.
 Round the clock availability
 Multiple access
 Preservation and conservation.
 More storage space.
 Added value.
 Easily accessible.
Data Warehouses

 Data warehouse is a repository of an organization's

electronically stored data in support of decision-
making process .
 Data warehouses are designed to facilitate
reporting and analysis.
 The process of transforming data into information and
making it available to the user is known as data
warehousing.
Data Mining: A KDD Process

Pattern Evaluation
 Data mining: the core
of knowledge Data Mining
discovery process.
Task-relevant Data

Data Warehouse Selection

Data Cleaning

Data Integration

Databases 52

Principles of Concurrent and Distributed Programming
No ratings yet
Principles of Concurrent and Distributed Programming
646 pages
Introduction To IR 2021
No ratings yet
Introduction To IR 2021
40 pages
UNIT I
No ratings yet
UNIT I
65 pages
Irs Unit1
No ratings yet
Irs Unit1
15 pages
IRS Study Material
100% (1)
IRS Study Material
87 pages
CompletedUNIT 1 PPT 10.7.17
100% (6)
CompletedUNIT 1 PPT 10.7.17
87 pages
Documentation Ir
No ratings yet
Documentation Ir
58 pages
Introduction Information Retrieval
No ratings yet
Introduction Information Retrieval
73 pages
Chapter 1 Introduction To ISR
No ratings yet
Chapter 1 Introduction To ISR
39 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
44 pages
01 Introduction to ISR
No ratings yet
01 Introduction to ISR
34 pages
Irs Unit-1
No ratings yet
Irs Unit-1
61 pages
Introduction To IR Chapter 01
No ratings yet
Introduction To IR Chapter 01
29 pages
Module 1print
No ratings yet
Module 1print
5 pages
Chap 1
No ratings yet
Chap 1
22 pages
Unit-5. Search Engines
No ratings yet
Unit-5. Search Engines
105 pages
IR Chapter 1&2
No ratings yet
IR Chapter 1&2
88 pages
Unit-1 Chapter 1
No ratings yet
Unit-1 Chapter 1
44 pages
Module 1 - Introduction
No ratings yet
Module 1 - Introduction
61 pages
Unit 1: Introduction and Data Pre-Processing
No ratings yet
Unit 1: Introduction and Data Pre-Processing
71 pages
IR-Module 1 and 2
No ratings yet
IR-Module 1 and 2
48 pages
Chapter 1 Introduction to IR
No ratings yet
Chapter 1 Introduction to IR
18 pages
Intelligent
No ratings yet
Intelligent
20 pages
Unit-I: Introduction To Information Retrieval Systems
100% (1)
Unit-I: Introduction To Information Retrieval Systems
14 pages
IRS Unit-1
50% (2)
IRS Unit-1
14 pages
1 IR Intro
No ratings yet
1 IR Intro
30 pages
UNIT 1 IRS WWWWW
No ratings yet
UNIT 1 IRS WWWWW
26 pages
Chapter 1
No ratings yet
Chapter 1
52 pages
Concepts of Information Retrieval System
No ratings yet
Concepts of Information Retrieval System
10 pages
Unit - I - IR
No ratings yet
Unit - I - IR
39 pages
ch1_Information Retrieval Systems
No ratings yet
ch1_Information Retrieval Systems
52 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
50 pages
Wollo University Kombolcha Institute of Technology College of Informatics Department of Information Technology
100% (1)
Wollo University Kombolcha Institute of Technology College of Informatics Department of Information Technology
35 pages
Web Mining UNIT-II Chapter-01 - 02 - 03
No ratings yet
Web Mining UNIT-II Chapter-01 - 02 - 03
19 pages
Part I IR VTU M Tech SSE
No ratings yet
Part I IR VTU M Tech SSE
72 pages
Information Retrieval Techniques(1)
No ratings yet
Information Retrieval Techniques(1)
59 pages
L001
No ratings yet
L001
49 pages
Irs I
No ratings yet
Irs I
20 pages
IR chapter 1 (2)
No ratings yet
IR chapter 1 (2)
29 pages
Aesthetics and Technology in Building, Pier Luigi Nervi
100% (4)
Aesthetics and Technology in Building, Pier Luigi Nervi
146 pages
Cmrit Isr Notes - Docx New
No ratings yet
Cmrit Isr Notes - Docx New
54 pages
Information Storage And: Retrieval Techniques
No ratings yet
Information Storage And: Retrieval Techniques
56 pages
Zaheer Ahmad, Presentation Information Literacy Skills
No ratings yet
Zaheer Ahmad, Presentation Information Literacy Skills
29 pages
IRSUnit-1
No ratings yet
IRSUnit-1
26 pages
IRS Notes
No ratings yet
IRS Notes
10 pages
Chap 1
No ratings yet
Chap 1
23 pages
IR First Chapter
No ratings yet
IR First Chapter
32 pages
ISR chap..1
No ratings yet
ISR chap..1
27 pages
Lecture1 Chap1
No ratings yet
Lecture1 Chap1
22 pages
Information Retrieval
No ratings yet
Information Retrieval
5 pages
1 IRIntro
No ratings yet
1 IRIntro
95 pages
Monday - IR Fundamentals - Grace Yang - AFIRM19-IR
No ratings yet
Monday - IR Fundamentals - Grace Yang - AFIRM19-IR
77 pages
Ch1 IR
No ratings yet
Ch1 IR
39 pages
What Is Information Retrieval (IR) ?
No ratings yet
What Is Information Retrieval (IR) ?
21 pages
Information Storage and Retrieval: Chapter One - Introduction
No ratings yet
Information Storage and Retrieval: Chapter One - Introduction
50 pages
Modern Information Retrieval: Computer Engineering Department Fall 2005
No ratings yet
Modern Information Retrieval: Computer Engineering Department Fall 2005
19 pages
Information Storage and Retrieval
No ratings yet
Information Storage and Retrieval
5 pages
Image Retrieval: Fundamentals and Applications
From Everand
Image Retrieval: Fundamentals and Applications
Fouad Sabry
No ratings yet
Image Retrieval: Unlocking the Power of Visual Data
From Everand
Image Retrieval: Unlocking the Power of Visual Data
Fouad Sabry
No ratings yet
Automatic Image Annotation: Fundamentals and Applications
From Everand
Automatic Image Annotation: Fundamentals and Applications
Fouad Sabry
No ratings yet
Sphinx Search Beginner's Guide
From Everand
Sphinx Search Beginner's Guide
Abbas Ali
4/5 (2)
我的数学作业登录
100% (1)
我的数学作业登录
8 pages
How To Study God's Word
No ratings yet
How To Study God's Word
23 pages
1V5 S4hana2020 BPD en Us
No ratings yet
1V5 S4hana2020 BPD en Us
23 pages
Webquest
No ratings yet
Webquest
7 pages
Milen_Dimitrov_HW2_Q2
No ratings yet
Milen_Dimitrov_HW2_Q2
28 pages
The Implementation of Cooperative Learning Type TGT
No ratings yet
The Implementation of Cooperative Learning Type TGT
36 pages
Tamamizu Monogatari
No ratings yet
Tamamizu Monogatari
22 pages
Cp4291 Iot Lab Manual
No ratings yet
Cp4291 Iot Lab Manual
35 pages
Hall 2020 Language and Gender
No ratings yet
Hall 2020 Language and Gender
22 pages
SCERT Kerala State Syllabus 9th Standard Social Science II Textbooks Malayalam Medium Part 1
100% (1)
SCERT Kerala State Syllabus 9th Standard Social Science II Textbooks Malayalam Medium Part 1
4 pages
The Christmas Party SV A2
No ratings yet
The Christmas Party SV A2
5 pages
5 Steps to a 5: AP Spanish Language and Culture 2020 1st Edition Dennis Lavoie - The full ebook set is available with all chapters for download
100% (1)
5 Steps to a 5: AP Spanish Language and Culture 2020 1st Edition Dennis Lavoie - The full ebook set is available with all chapters for download
60 pages
All InRoads Tips and Tricks
No ratings yet
All InRoads Tips and Tricks
61 pages
X AI Monthly Test QP & MS
No ratings yet
X AI Monthly Test QP & MS
9 pages
Observational Report 8607
No ratings yet
Observational Report 8607
4 pages
Philip Ananias & Barnabas
No ratings yet
Philip Ananias & Barnabas
18 pages
AP Calculus AB Grade 11 Curriculum Map 2022 2023
No ratings yet
AP Calculus AB Grade 11 Curriculum Map 2022 2023
6 pages
Numeracy and The EYFS Powerpoint
No ratings yet
Numeracy and The EYFS Powerpoint
16 pages
How Excellent Is Thy Loving Kindness
No ratings yet
How Excellent Is Thy Loving Kindness
2 pages
Role of The Good Angel and The Bad
No ratings yet
Role of The Good Angel and The Bad
13 pages
ENCODER
No ratings yet
ENCODER
7 pages
Degree Modifiers, Meaning and Example 1
No ratings yet
Degree Modifiers, Meaning and Example 1
7 pages
Article Writing
No ratings yet
Article Writing
11 pages
What Is A Technical Manual
No ratings yet
What Is A Technical Manual
5 pages
Word Biblical Commentary Series - Bibliographic Listing
100% (1)
Word Biblical Commentary Series - Bibliographic Listing
19 pages
Ganjam_0
No ratings yet
Ganjam_0
26 pages
Phil-Lite-1
No ratings yet
Phil-Lite-1
27 pages
Vocabulary Check A: 1.1 Clothes and Accessories, Describing Clothes A
100% (1)
Vocabulary Check A: 1.1 Clothes and Accessories, Describing Clothes A
1 page
Autistic Savant (Self-Help) PDF
No ratings yet
Autistic Savant (Self-Help) PDF
2 pages

Unit - 1

Uploaded by

Unit - 1

Uploaded by

Topics:

Set of relevant documents Document collection

The set of documents in the answer MUST be relevant to

 Data is raw, unorganized facts that need to be

Query Results Ranked

User Query Documents

Query Representation Document Representatio

 Overhead: The time a user spends in all of

Precision: the percentage of retrieved documents that

Recall: the percentage of documents

 Support of user search generation

 Standardizing the input takes the different

 Parse the item into logical sub-divisions that

 Identifying the information(words) that

 Next, a Stop List/Algorithm is applied to the

 The rank-frequency law of Ziph is:

 Identification of any specific

 To normalize the token to a

 Processing tokens -> Stemming Algorithm ->

 The Selective Dissemination of Information

 The DDB Search Process provides the

 Composed of the search process, user entered

Design Issues Data Retrieval Information Retrieval

Matching Exact Match Partial (Best) Match

 Three systems which are the repositories of

 A digital library is a library in which

 Data warehouse is a repository of an organization's

Data Warehouse Selection

You might also like