0% found this document useful (0 votes)
2 views

1520784495 Lec5 Ir Introduction

The document outlines a course on advanced topics in information retrieval and web search, including an introduction to the field and its significance in daily web activities. It discusses various aspects of information retrieval, including definitions, dimensions, and applications, as well as the architecture of search engines and traditional retrieval models. Additionally, it highlights advanced retrieval models and specific tasks within the field, such as personalized search and question answering.

Uploaded by

hoanglinh90198
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

1520784495 Lec5 Ir Introduction

The document outlines a course on advanced topics in information retrieval and web search, including an introduction to the field and its significance in daily web activities. It discusses various aspects of information retrieval, including definitions, dimensions, and applications, as well as the architecture of search engines and traditional retrieval models. Additionally, it highlights advanced retrieval models and specific tasks within the field, such as personalized search and question answering.

Uploaded by

hoanglinh90198
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

ADVANCED TOPICS

IN INFORMATION RETRIEVAL
AND WEB SEARCH

Lecture 1:
Introduction

S. M. Vahidipour
[email protected]
Outline

□ Introduction to the Course


□ Overview of the Semester

2
Text Books

Search Engines:
Information Retrieval in Practice

W. Bruce Croft, Donald Metzler, Trevor Strohman


Pearson Education, 2010

3
Text Books

Modern Information Retrieval:


The Concepts and Technology behind Search
(2nd Edition)

Ricardo Baeza-Yates, Berthier Ribeiro-Neto


ACM Press Books, 2010

4
Text Books

Introduction to Information Retrieval

C. Manning, P. Raghavan, and H. Schütze


Cambridge University Press, 2008

5
Search and Information Retrieval

 Search on the Web is a daily activity for many people


throughout the world
□ Google: 40,000 searches per second (3.5 billion per
day; 1.2 trillion per year)
□ Yahoo: 3,200 searches per second (280 million per day;
8.4 billion per month)
□ Bing: 927 searches per second ( 80 million per day;
2.4 billion per month)

106: Million, 109: billion, 1012: Trillion, 1015: Quadrillion, 1018: Quintillion, …

6
Search and Information Retrieval

□ Search and communication are most popular uses of the computer.


□ Applications involving search are everywhere.
□ The field of computer science that is most involved with R&D for search
is information retrieval (IR).

7
Information Retrieval

“Information retrieval is a field concerned with the structure, analysis,


organization, storage, searching, and retrieval of information.”
(Salton, 1968)

□ General definition that can be applied to many types of information


and search applications
□ Still appropriate after 40 years.
□ Primary focus of IR since the 50s has been on text and documents

8
Data/Information

□ Storage

□ Search

9
Data/Information

□ Structured

□ Unstructured

10
Structured vs. Unstructured Data

11
What is a Document?

 Examples:
 Web pages, email, books, news stories, scholarly papers, text
messages, Word™, Powerpoint™, PDF, forum postings, patents, IM
(Instant Messages) sessions, etc.
 Common properties
 Significant text content
 Some structure (≈ attributes in DB)
□ Papers: title, author, date
□ Email: subject, sender, destination, date

12
Comparing Text

Comparing the query text to the document text and determining what is
a good match is the core issue of information retrieval.
Exact matching of words is not enough
 Many different ways to write the same thing in a “natural language” like
English
 Does a news story containing the text “karl benz built the first automobile in 1886” match
the query “car inverter”?
 Defining the meaning of a word, a sentence, a paragraph, or a story is
more difficult than defining the meaning of a database field.

13
Dimensions of IR

IR is more than just text, and more than just web search
 although these are central
People doing IR work with different media, different types of search
applications, and different tasks

Three dimensions of IR
□ Content
□ Applications
□ Tasks

20
The Content Dimension

Textual data, but…


New applications increasingly involve new media
□ Video, photos, music, speech
□ Scanned documents (for legal purposes)
Like text, content is difficult to describe and compare
□ Text may be used to represent them (e.g., tags)
IR approaches to search and evaluation are appropriate

15
The Application D imension

 Web search  Desktop search


□ Personal enterprise search
□ Most common
□ See above plus recent web pages

 Vertical search
 P2P search
□ Restricted domain/topic
□ No centralized control
□ Books, movies, suppliers □ File sharing, shared locality

 Enterprise search  Literature search


□ Corporate intranet
□ Databases, emails, web pages,  Forum search
documentation, code, wikis, tags,
directories, presentations, spreadsheets …

16
The Task Dimension

 User queries / ad-hoc search


□ Range of query enormous, not pre-specified
 Filtering
□ Given a profile (interests), notify about interesting news stories
□ Identify relevant user profiles for a new document
 Classification / categorization
□ Automatically assign text to one or more classes of a given set
□ Identify relevant labels for documents
 Question answering
□ Similar to search
□ Automatically answer a question posed in natural language
□ Provide concrete answer, not list of documents.

17
Main Issues in IR

Relevance
□ A relevant document contains the information a user was looking for when
he/she submitted the query
Evaluation
□ How well does the ranking meet the expectation of the user
Users and information needs
□ Users of a search engine are the ultimate judges of quality

18
IR and Search Engines

A search engine is the practical application of information retrieval


techniques to large scale text collections
Big issues include main IR issues but also some others…

Information Retrieval Search Engines


● Relevance: Effective ranking ● Performance: Efficient search and indexing
● Evaluation: Testing and measuring ● Incorporating new data: Coverage and freshness
● Information needs: User interaction ● Scalability: Growing with data and users
● Adaptability: Tuning for applications
● Specific problems: e.g., Spam
Additional

19
Outline

□ Introduction to the Course


□ Overview of the Semester

20
Search Engine

 Basic architecture
Main issues
Indexing
 Text acquisition
 Text
transformation
 Index creation
Querying
 User interaction
 Ranking
 Evaluation

21
Overview of Traditional Retrieval Models

Boolean retrieval
Vector space model
Probabilistic models

22
Overview of Evaluation Metrics

 Effectiveness metrics

 Efficiency metrics

 Training, testing, and statistics

23
Advanced Retrieval Models

Language model-based retrieval


Learning to rank

30
Word Mismatch Problem

Language model-based approaches


□ Translation model
□ Topic model
□ Word cluster model
□ Wordnet
□ Dependency model

Query expansion approaches

25
Advanced/Specific IR Tasks

 Query log and query suggestion


 Personalized search
 Information extraction
 Cross-language IR
 Question answering
 Recommendation systems
 Enterprise search
 Digital library
 Structured text retrieval
 Multimedia retrieval
26
Query Log and Query Suggestion

27
Personalized Search

28
Information Extraction

29
Cross- language Retrieval

30
Question Answering

31
Recommendation Systems

32
Enterprise Search

33
Digital Library

40
Structured Text Retrieval

35
Multimedia Retrieval

36
Questions?

37

You might also like