Unit - 1
Unit - 1
• Introduction: Definitions
• Objectives
• Functional Overview
• Relationship to DBMS
• Digital libraries and Data Warehouses
Introduction to Information Retrieval
Simple view of the IR process
Information need
User
Ranked list of
User documents
Query IR System
feedback
Definitions of Data and Information
Query entered
Index
searched
Items
retrieved
About Google?
• The name "Google" is a play on the word "googol",
which refers to the number represented by 1
followed by one hundred zeros.
• Google receives over 200 million queries each day
through its various services.
• As of January 2012, Google has indexed 9.7 billion
web pages, 1.3 billion images, and over one billion
Usenet messages — in total, approximately 12
billion items.
How Google works
Google
• Google maintains (probably) the
worlds largest Linux cluster (over 15,000 servers)
• These are partitioned between index
servers and page servers
• Index servers resolve the queries.
• Page servers deliver the results of the queries
Google finds important pages
The idea is that the documents on the web
have different degrees of "importance".
Google will show the most important pages
first.
The ideas is that more important pages are
likely to be more relevant to any query
than non-important pages.
17
Google Relevance Factors
Google's considers over 100 factors, including:
1. PageRank algorithm.
2. Popularity of page.
3. Position and size of the search terms within page.
4. Unique Content.
5. Terms order.
6. Page size and load time.
7. Error free websites.
8. Important incoming links.
9. Website Optimization.
18
• The three components of the information
retrieval environment:
User
Process
Collection
Information Retrieval Black box
Process
Results
Inside Black box
Query Documents
Representation Representation
Comparison
Index
Function
Results
Information Hierarchy
Information Hierarchy
• Data
The raw material of information
• Information
•Data organized and presented in a particular manner.
• Knowledge
-“Justified true belief”
-Information that can be acted upon
• Wisdom
•Distilled and integrated knowledge
Demonstrative of high-level “understanding”
Example
• Data
98.6º F, 99.5º F, 100.3º F, 101º F, …
• Information
Hourly body temperature: 98.6º F, 99.5º F,
100.3º F, 101ºF
• Knowledge
If you have a temperature above 100º F, you
most likely have a fever.
• Wisdom
If you don’t feel well, go see a doctor
Objectives of IRS
To minimize the overhead of a user locating
needed information.
Efficient retrieval of documents. Two major
measures used are:
1. Precision: Measure of how many of the
retrieved documents were in fact relevant.
2. Recall: Measure of how many of the
relevant documents were retrieved.
Support of user search generation.
Present the search results in a format that
facilitate the user in determining relevant
items.
Minimize the overhead for
finding information
47
Digital Libraries and Data Warehouses
No physical boundary.
Round the clock availability
Multiple access
Preservation and conservation.
More storage space.
Added value.
Easily accessible.
Data Warehouses
Pattern Evaluation
Data mining: the core
of knowledge Data Mining
discovery process.
Task-relevant Data
Data Cleaning
Data Integration
Databases 52