0% found this document useful (0 votes)
31 views

Unit - 1

Uploaded by

Sree Dhathri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views

Unit - 1

Uploaded by

Sree Dhathri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 51

Topics:

• Introduction: Definitions
• Objectives
• Functional Overview
• Relationship to DBMS
• Digital libraries and Data Warehouses
Introduction to Information Retrieval
Simple view of the IR process

Information need
User

Set of relevant documents Document collection

The set of documents in the answer MUST be relevant to


the user’s information need. Otherwise the IR process
results in complete failure.
Information Retrieval
Compare the generate a
information ranking
need with which
the reflects
information relevance

Ranked list of
User documents
Query IR System

feedback
Definitions of Data and Information

 Data is raw, unorganized facts that need to be


processed.
 When data is processed, organized, structured or
presented in a given context so as to make it
useful, it is called Information.
 Information may be a text(including numeric and
date data), images, video and other multimedia
objects.
Information Retrieval
• The process of actively seeking
information relevant to a topic
out of interest.
• “Document” the generic term for
is holder(book, an chapter,
information
webpage, etc).article,
• Information retrieval locates relevant documents,
on the basis of user input such as keywords or
phrases.
Definition of IRS
• An Information Retrieval System is a system that
is capable of storage retrieval and maintenance of
information.
Eg: Web search engines
• The smallest complete textual unit processed and
manipulated by an IR system is called Item.
What is a Search Engine?
• It is a program that helps in
locating information stored on world wide
web.
• Two types of search engines :
• Crawler-based
• Create their listing automatically.
Examples :Google ,Yahoo
• Human-based
• Depends on humans for its Listings
• Example : MSN
Components of Search Engine
• They have three major components:
• Crawler or spider
• -Finds and retrieves web pages.
• Index or Catalog
• -Indexes every web page that the crawler finds.
• Search Engine Software
• -Searches entries in the index to find a match.
• -Rank the matches based on relevance.
What about web search?
• First you need to get all the documents of the
web…. Crawlers.
• Then you have to index them(inverted files)
• Find the web pages that are relevant to the
query.
• Report the pages with their links in a sorted
order
The search process

Query entered

Query Results Ranked


Interpreted

Index
searched
Items
retrieved
About Google?
• The name "Google" is a play on the word "googol",
which refers to the number represented by 1
followed by one hundred zeros.
• Google receives over 200 million queries each day
through its various services.
• As of January 2012, Google has indexed 9.7 billion
web pages, 1.3 billion images, and over one billion
Usenet messages — in total, approximately 12
billion items.
How Google works
Google
• Google maintains (probably) the
worlds largest Linux cluster (over 15,000 servers)
• These are partitioned between index
servers and page servers
• Index servers resolve the queries.
• Page servers deliver the results of the queries
Google finds important pages
 The idea is that the documents on the web
have different degrees of "importance".
 Google will show the most important pages
first.
 The ideas is that more important pages are
likely to be more relevant to any query
than non-important pages.

17
Google Relevance Factors
Google's considers over 100 factors, including:
1. PageRank algorithm.
2. Popularity of page.
3. Position and size of the search terms within page.
4. Unique Content.
5. Terms order.
6. Page size and load time.
7. Error free websites.
8. Important incoming links.
9. Website Optimization.

18
• The three components of the information
retrieval environment:
User
Process
Collection
Information Retrieval Black box

User Query Documents


Collection

Process

Results
Inside Black box

Query Documents

Representation Representation

Query Representation Document Representatio

Comparison
Index
Function

Results
Information Hierarchy
Information Hierarchy

• Data
The raw material of information
• Information
•Data organized and presented in a particular manner.
• Knowledge
-“Justified true belief”
-Information that can be acted upon
• Wisdom
•Distilled and integrated knowledge
Demonstrative of high-level “understanding”
Example

• Data
98.6º F, 99.5º F, 100.3º F, 101º F, …
• Information
Hourly body temperature: 98.6º F, 99.5º F,
100.3º F, 101ºF
• Knowledge
If you have a temperature above 100º F, you
most likely have a fever.
• Wisdom
If you don’t feel well, go see a doctor
Objectives of IRS
 To minimize the overhead of a user locating
needed information.
 Efficient retrieval of documents. Two major
measures used are:
1. Precision: Measure of how many of the
retrieved documents were in fact relevant.
2. Recall: Measure of how many of the
relevant documents were retrieved.
 Support of user search generation.
 Present the search results in a format that
facilitate the user in determining relevant
items.
Minimize the overhead for
finding information

 Overhead: The time a user spends in all of


the steps leading to reading an item
containing needed information, excluding the
time for actually reading the relevant data.
• Query generation
• Search composition
• Search execution
• Scanning results of query to select items to
read
Precision and Recall
 When a user decides to issue a search looking info on
a topic, the total db is logically divided into 4 segments
as
Precision and Recall

Precision: the percentage of retrieved documents that


are in fact relevant to the query.

Recall: the percentage of documents


that are relevant to the query and were, in
fact, retrieved.
Measuring Precision and Recall
Ideal Precision and Recall Graph
Two More Objectives of IR Systems

 Support of user search generation


-should generate good search statement to
specify the information a user needs.
 To present the search results in a format
that facilitates the user in determining
relevant items.
-Ranking in order of potential relevance .
Functional Overview
 A total Information Storage and Retrieval
System is composed of four major functional
processes:
1. Item Normalization
2. Selective Dissemination of Information
3. Document Database Search
4. Index Database Search along with the
Automatic File Build process that supports
Index Files.
Total Information Retrieval System
1. Item Normalization
1.Normalize the
Incoming items
to a standard
Format.
2.Provides logical
restructuring
of the item.
3. Operations :
Identification of
Processing tokens
Characterization of
tokens
Stemming of the
tokens.
Figure: The Text Normalization Process
Standardize Input

 Standardizing the input takes the different


external format of input data and performs the
translation to the formats acceptable to the
system.
 Translate foreign language into Unicode.
 Translate multi-media input into a
standard format.
-Video: MPEG-2, MPEG-1, AVI, Real Video…
-Audio: WAV, Real Audio
- Image: GIF, JPEG, BMP…
Logical Subsetting (Zoning)

 Parse the item into logical sub-divisions that


have meaning to user.
-Title, Author, Abstract, Main Text, Conclusion, References,
Country, Keyword…
 Visible to the user and used to increase the
precision of a search and optimize the
display.
 The zoning information is passed to the next
operation to store the information, allowing
searches to be restricted to a specific zone.
Identify Processing Tokens

 Identifying the information(words) that


are used in the search process.
 The first step is to determine word by dividing
input symbols into three classes:
1.Valid word symbols:alphabetic characters,numbers
2.Inter-wordsymbols:blanks,periods, semicolons (nonsearchable)
3.Special processing symbols: hyphen (-)
 A word is defined as a contiguous set of word
symbols bounded by inter-word symbols.
Stop Algorithm

 Next, a Stop List/Algorithm is applied to the


list of potential processing tokens.
 The objective is to save system
resources and processing power by
eliminating the tokens of little value to
the search.
-Any word found in almost every item i.e frequently
occurring words.
-Any word only found once or twice in the database.

 The rank-frequency law of Ziph is:


Frequency * Rank = constant
Characterize Tokens

 Identification of any specific


word characteristics.
 The characterization is used in systems to
assist in disambiguation of a particular word.
Stemming Algorithm

 To normalize the token to a


standard semantic representation.
Computer, Compute, Computers, Computing
• Comput
 Reduces the number of unique words the
system has to contain.
 Improves the efficiency of the IR System.
 Should be applied with out leading to retrieval
of non-relevant items.
Create Searchable DataStructure

 Processing tokens -> Stemming Algorithm ->


updates to the Searchable data structure.
 Internal representation (not visible to user).
-Signature file, Inverted list, PAT Tree…
2. Selective Dissemination of Information

 The Selective Dissemination of Information


(Mail) Process provides the capability to
dynamically compare newly received items in
the information system against standing
statements of interest of users and deliver the
item to those users whose statement of
interest matches the contents of the item.
 Consists of ,
 Search process
 User statements of interest (Profile)
 User mail file
3. Document Database(DDB) Search

 The DDB Search Process provides the


capability for a query to search against all items
received by the system.

 Composed of the search process, user entered


queries and document database.
4. Index Database Search
 In the index process the user can logically store
an item in a file along with additional index
terms and descriptive text the user wants to
associate with the item.
 The Index Database Search Process provides
the capability to create indexes and search
them .
 There are two classes of index files: Public and
Private Index files.
 Each Private Index file references only a small
subset of the total number of items in the DDB
but the Public Index files are typically index
every item in the DDB
Document D1: “yes we got no bananas”
Document D2: “Johnny Appleseed planted apple seeds.”
Document D3: “we like to eat, eat, eat apples and
bananas”
V ocabulary P ostings
yes  List Q uery
we  D1 “apples bananas”:
got  D1, D3
no  “apples”  {D2, D3}
D1
bananas  “bananas”  {D1,
Johnny  D1
D3}
Appleseed  D1, D3 Whole query gives the
planted  D2 intersection:
apple  D2
seeds  {D2, D3} ^ {D1, D3} =
like  {D3}
to  D2
eat  D2, D3
and  D2
D3
Relationship to DBMS
 IR is supported by IRS but DR is supported by
DBMS.
 DBMS
 Precise Semantics
 SQL
 Structured data
 Expect reasonable number of updates.
 Generate full answer
 IRS
• Imprecise Semantics
• Keyword search
• Unstructured data format
• Read-Mostly. Add/update docs occasionally.
• Page through top k results.
Information Retrieval vs. Data Retrieval

Design Issues Data Retrieval Information Retrieval

Matching Exact Match Partial (Best) Match


Model Deterministic Probabilistic
Classification Approach Monotonic Polythetic
Query language Artificial Natural
Query specification Complete Incomplete
Items wanted Matching Relevant
Error response Sensitive Insensitive
Data representation Schema (Mostly) Index terms

47
Digital Libraries and Data Warehouses

 Three systems which are the repositories of


information are:
1. Digital Libraries
2. Data Warehouses (or Data Marts).
3. Information Storage and Retrieval System.
Digital Libraries

 A digital library is a library in which


information collections are stored in digital
formats(as opposed to print or other media)
and accessible via computers.
 The digital content may be stored locally or
accessed remotely via a network.
Traditional vs. Digital Library

 No physical boundary.
 Round the clock availability
 Multiple access
 Preservation and conservation.
 More storage space.
 Added value.
 Easily accessible.
Data Warehouses

 Data warehouse is a repository of an organization's


electronically stored data in support of decision-
making process .
 Data warehouses are designed to facilitate
reporting and analysis.
 The process of transforming data into information and
making it available to the user is known as data
warehousing.
Data Mining: A KDD Process

Pattern Evaluation
 Data mining: the core
of knowledge Data Mining
discovery process.
Task-relevant Data

Data Warehouse Selection

Data Cleaning

Data Integration

Databases 52

You might also like