0% found this document useful (0 votes)

9 views

2 Introduction To Information Retrieval

Uploaded by

maik9206

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views

2 Introduction To Information Retrieval

Uploaded by

maik9206

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 38

Information Retrieval (IR)

Definition of IR
 IR is finding material (usually documents) of
an unstructured nature (usually text) that
satisfies an information need from within
large collections (usually stored on
computers).
 Unstructured data?
 Information need?
Information Access Process
Information Need

Query

Send to System

Receive Results

Evaluate Results

No
Don

Stop
Information need
 Information need is the topic about which the
user desires to know more.
 Query is what the user conveys to the
computer in an attempt to communicate the
information need.
 A document is ‘relevant’ if it is one that the
user perceives as containing information of
value with respect to their personal
information need.
Evaluation of IR
 Retrieval Effectiveness
 To assess the effectiveness of an IR system (i.e. the
quality of its search results), precision and recall can
be used.
 Precision ( 查準率 ): what fraction of the returned
results are relevant to the information need?
 Recall ( 查全率 ): what fraction of the relevant
documents in the collection were returned by the
system?
Users of IR
 Used to be reference librarians and
professional searchers
 Now: hundreds of millions of people engage
in IR every day when they use a web search
engine
Information Retrieval Systems
Document
corpus

Query IR
String System

1. Doc1
2. Doc2
Ranked 3. Doc3
.
Documents .
IR Systems
 IR also covers supporting users in browsing
or filtering document collections or further
processing a set of retrieved documents, e.g.
summarization.
 Clustering is the task of grouping a set of
documents based on their contents. (For
example, arranging e-books based on their
topics.)
Text Browsing
IR Systems
 Can be broadly distinguished by:
- Web search: to provide search over billions
of documents stored on millions of computers
- Personal IR: e.g. spam (junk mail) filter (or
e-mail classification)
- Domain-specific search: e.g. corporation’s
internal documents, a database of patents,
research articles on biochemistry.
An Example IR Problem
 Which plays (scenes) of Shakespeare contain
the words Brutus AND Caesar AND NOT
Calpurnia?
A term-document incidence matrix
Anthony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth …..

Anthony 1 1 0 0 0 1

Brutus 1 1 0 1 0 0

Caesar 1 1 0 1 1 1

Calpurnia 0 1 0 0 0 0

Cleopatra 1 0 0 0 0 0

mercy 1 0 1 1 1 1

worser 1 0 1 1 1 0

……
A term-document incidence matrix
 Matrix element (t, d) is 1 if the play in column d
contains the word in row t, and is 0 otherwise.
 In IR, we say terms (which are equal to words)
 As a result, we can have a vector for each term
(which shows the documents it appears in) or a
vector for each document (showing the terms
that occur in it)
The answer
 Brutus AND Caesar AND NOT Calpurnia
 110100 AND 110111 AND 101111 = 100100
 Boolean retrieval model: terms are combined
with the operators AND, OR, and NOT
 The model views each document as just a set
of words.
Inverted Index
 More realistic.
 Say a 500k * 1M term-document matrix has
half-a-trillion 0’s and 1’s: too many to fit in a
computer’s memory.
 Inverted index = inverted file = index
construction
Inverted Index
 We keep a ‘dictionary’ of terms (sometimes also
referred to as a ‘vocabulary’ or ‘lexicon’)
 Then for each term, we have a list that records which
documents the term occurs in.
 Each item in the list – which records that a term
appeared in a document – is called a ‘posting’.
 The list is then called a ‘postings list’ (or inverted list).
Inverted Index
Brutus-> 1 2 4 11 31 45 173 174
Caesar-> 1 2 4 5 6 16 57 132 …
Calpurnia -> 2 31 54 101
…
“Dictionary” -> Postings (document ID)
Inverted Index
Processing Boolean Queries
 Brutus AND Calpurnia
Brutus -> 1 2 4 11 31 45 173 174
Calpurnia -> 2 31 54 101
Intersection => 2 31
Natural Language Processing
 Stemming:
- computational  comput
 Stop words
- the, it, a, etc.
The Vector-Space Model
 Assume t distinct terms remain after preprocessing;
call them index terms or the vocabulary.
 These “orthogonal” terms form a vector space.
 Dimension = t = |vocabulary|
 Each term, i, in a document or query, j, is given a
real-valued weight, wij.
 Both documents and queries are expressed as
t-dimensional vectors:
 dj = (w1j, w2j, …, wtj)
Graphic Representation
Similarity Measure
 Euclidean distance

 Cosine similarity
Term Frequency (TF) and Inverse
Document Frequency (IDF)
 fij = frequency of term i in document j
 df i = document frequency of term i
= number of documents containing term i
 idfi = inverse document frequency of term i,
= log2 (N/df i)
(N: total number of documents)
Computing TF-IDF -- An Example
 Given a document containing terms with given frequencies:
A(3), B(2), C(1)
 Assume collection contains 10,000 documents and
document frequencies of these terms are:
A(50), B(1300), C(250)
Then:
A: tf = 3/(3+2+1); idf = log2(10000/50) = 7.6; tf-idf = 3.8
B: tf = 2/(3+2+1); idf = log2 (10000/1300) = 2.9; tf-idf = 0.96
C: tf = 1/(3+2+1); idf = log2 (10000/250) = 5.3; tf-idf = 0.89
Relevance Feedback
 Relevance feedback: user feedback on relevance of
docs in initial set of results
 User issues a (short, simple) query
 The user marks some results as relevant and/or non-relevant.
 The system computes a better representation of the
information need based on feedback.
 Relevance feedback can go through one or more iterations.
 Idea: it may be difficult to formulate a good query when
you don’t know the collection well, so iterate
Relevance Feedback: Example
 Image search engine
http://nayana.ece.ucsb.edu/imsearch/imsearch.ht
ml
Results for Initial Query

9.1.1
Relevance Feedback

9.1.1
Results after Relevance Feedback

9.1.1
Relevance feedback on initial query
Initial
x x
query x
o x
 x x
x x
o x x
x
x o
x o x
o o x
x x
x
x known non-relevant documents
Revised
query o known relevant documents
9.1.1
Rocchio Algorithm
 Used in practice:
  1  1 
q m   q0   
Dr dj Dr
d j  
Dnr  d j
d j Dnr

 Dr = set of known relevant doc vectors

 Dnr = set of known irrelevant doc vectors
 qm = modified query vector; q0 = original query
vector; α, β, and γ = weights (hand-chosen or set
empirically)
 New query moves toward relevant documents
and away from irrelevant documents
9.1.1
Relevance Feedback: Problems
 Long queries are inefficient for typical IR engine.
 Long response times for user.
 High cost for retrieval system.
W
hy
?
 Partial solution:
 Only reweight certain prominent terms

 Perhaps top 20 by term frequency

 Users are often reluctant to provide explicit feedback
 It’s often harder to understand why a particular document
was retrieved after applying relevance feedback
Two Solutions
 Pseudo relevance feedback
 Implicit (or indirect) relevance feedback
Query Expansion
 In relevance feedback, users give additional
input (relevant/non-relevant) on documents,
which is used to reweight terms in the
documents
 In query expansion, users give additional
input (good/bad search term) on words or
phrases

9.2.2
Query assist

Would you expect such a feature to increase the query

volume at a search engine?
How do we augment the user query?
 Manual thesaurus
 E.g. MedLine: physician, syn: doc, doctor, MD, medico
 Can be query rather than just synonyms
 Ontology
 Global Analysis: (static; of all documents in collection)
 Automatically derived thesaurus
 (co-occurrence statistics)

 Refinements based on query log mining

 Common on the web

 Local Analysis: (dynamic)

 Analysis of documents in result set

9.2.2
Automatic Thesaurus Generation
Example

9.2.3

Hourglass Workout Program by Luisagiuliet 2
76% (21)
Hourglass Workout Program by Luisagiuliet 2
51 pages
12 Week Program: Summer Body Starts Now
87% (46)
12 Week Program: Summer Body Starts Now
70 pages
The Hold Me Tight Workbook - Dr. Sue Johnson
100% (16)
The Hold Me Tight Workbook - Dr. Sue Johnson
187 pages
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
77% (13)
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
260 pages
Cheat Code To The Universe
94% (79)
Cheat Code To The Universe
34 pages
Facial Gains Guide (001 081)
91% (45)
Facial Gains Guide (001 081)
81 pages
Curse of Strahd
95% (467)
Curse of Strahd
258 pages
The Psychiatric Interview - Daniel Carlat
91% (34)
The Psychiatric Interview - Daniel Carlat
473 pages
The Borax Conspiracy
91% (57)
The Borax Conspiracy
14 pages
Shortcut To Shred Ebook Revised 9-9-2015 PDF
88% (8)
Shortcut To Shred Ebook Revised 9-9-2015 PDF
15 pages
Trauma-Focused ACT - Russ Harris
95% (39)
Trauma-Focused ACT - Russ Harris
568 pages
I Hate You - Don't Leave Me
80% (54)
I Hate You - Don't Leave Me
6 pages
The Secret Language of Attraction
86% (107)
The Secret Language of Attraction
278 pages
How To Develop and Write A Grant Proposal
83% (542)
How To Develop and Write A Grant Proposal
17 pages
Workbook For The Body Keeps The Score
89% (53)
Workbook For The Body Keeps The Score
111 pages
KamaSutra Positions
78% (69)
KamaSutra Positions
55 pages
7 Hermetic Principles
93% (30)
7 Hermetic Principles
3 pages
27 Feedback Mechanisms Pogil Key
77% (13)
27 Feedback Mechanisms Pogil Key
6 pages
Frank Hammond - List of Demons
92% (92)
Frank Hammond - List of Demons
3 pages
36 Questions That Lead To Love
91% (35)
36 Questions That Lead To Love
3 pages
2025 MandateForLeadership FULL
70% (10)
2025 MandateForLeadership FULL
920 pages
The 36 Questions That Lead To Love - The New York Times
94% (34)
The 36 Questions That Lead To Love - The New York Times
3 pages
100 Questions To Ask Your Partner
80% (35)
100 Questions To Ask Your Partner
2 pages
The 36 Questions That Lead To Love - The New York Times
95% (21)
The 36 Questions That Lead To Love - The New York Times
3 pages
Jeffrey Epstein39s Little Black Book Unredacted PDF
75% (12)
Jeffrey Epstein39s Little Black Book Unredacted PDF
95 pages
14 Easiest & Hardest Muscles To Build (Ranked With Solutions)
100% (7)
14 Easiest & Hardest Muscles To Build (Ranked With Solutions)
27 pages
Starbucks Underfilled Latte Lawsuit
68% (76)
Starbucks Underfilled Latte Lawsuit
24 pages
1001 Songs
69% (72)
1001 Songs
1,798 pages
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
23% (954)
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
38 pages
Zodiac Sign & Their Most Common Addictions
63% (30)
Zodiac Sign & Their Most Common Addictions
9 pages
Apache Hive Tutorial
No ratings yet
Apache Hive Tutorial
139 pages
Unit 2
No ratings yet
Unit 2
58 pages
Module 6 Updated Final
No ratings yet
Module 6 Updated Final
48 pages
Relevance Feedback: LBSC 796/INFM 718R: Week 8
No ratings yet
Relevance Feedback: LBSC 796/INFM 718R: Week 8
56 pages
a
No ratings yet
a
48 pages
Question Answering
No ratings yet
Question Answering
68 pages
Cs8080 Ir Unit2 I Modeling and Retrieval Evaluation
No ratings yet
Cs8080 Ir Unit2 I Modeling and Retrieval Evaluation
42 pages
IR Unit 2
No ratings yet
IR Unit 2
54 pages
IR Lecture 6b
No ratings yet
IR Lecture 6b
45 pages
NLP Mod-V Q - A (Uploaded by Snaptricks - In)
No ratings yet
NLP Mod-V Q - A (Uploaded by Snaptricks - In)
7 pages
IR Unit 2 Final
No ratings yet
IR Unit 2 Final
43 pages
5-Retrieval Effectiveness
No ratings yet
5-Retrieval Effectiveness
20 pages
Unit 2 Irt
No ratings yet
Unit 2 Irt
33 pages
Chap 3
No ratings yet
Chap 3
34 pages
Ir 1
No ratings yet
Ir 1
14 pages
5 Retrieval Effectiveness
No ratings yet
5 Retrieval Effectiveness
20 pages
Unit-3 Irs
No ratings yet
Unit-3 Irs
48 pages
Ranked Retrieval: Thus Far, Our Queries Have All Been Boolean
No ratings yet
Ranked Retrieval: Thus Far, Our Queries Have All Been Boolean
40 pages
Tf-Idf: David Kauchak cs160 Fall 2009
No ratings yet
Tf-Idf: David Kauchak cs160 Fall 2009
51 pages
IR Merged Merged
No ratings yet
IR Merged Merged
132 pages
IR Lecture 5b
No ratings yet
IR Lecture 5b
36 pages
Information Retrival Systems
No ratings yet
Information Retrival Systems
50 pages
IR Lecture 1b
No ratings yet
IR Lecture 1b
54 pages
Query and Document Expansion in Text Retrieval: Clara Isabel Cabezas University of Maryland College Park May, 2 2000
No ratings yet
Query and Document Expansion in Text Retrieval: Clara Isabel Cabezas University of Maryland College Park May, 2 2000
37 pages
Chapter 6_Scoring Term weighting and vector space model
No ratings yet
Chapter 6_Scoring Term weighting and vector space model
43 pages
7 Query Operations
No ratings yet
7 Query Operations
19 pages
Widc Tfidf
No ratings yet
Widc Tfidf
20 pages
Bif601 Final Term Handouts
No ratings yet
Bif601 Final Term Handouts
18 pages
Bif601 Final Term Handous 15 To 61
No ratings yet
Bif601 Final Term Handous 15 To 61
28 pages
CT075!3!2 DTM Topic 12 Text Data Mining
No ratings yet
CT075!3!2 DTM Topic 12 Text Data Mining
25 pages
Cross-Language Information Retrieval (CLIR) : Ananthakrishnan R
No ratings yet
Cross-Language Information Retrieval (CLIR) : Ananthakrishnan R
32 pages
Ir 103 131
No ratings yet
Ir 103 131
29 pages
lecture6-tfidf Vector Space Model (2)
No ratings yet
lecture6-tfidf Vector Space Model (2)
45 pages
IR Chapt 5
No ratings yet
IR Chapt 5
55 pages
lecture7-relevance-feedback
No ratings yet
lecture7-relevance-feedback
47 pages
Relevance Feedback
No ratings yet
Relevance Feedback
37 pages
application_nlp
No ratings yet
application_nlp
23 pages
Introduction of IR Models
No ratings yet
Introduction of IR Models
67 pages
An Introduction To Information Retrieval Systems: Intelligent Systems March 18, 2004 Ramashis Das
No ratings yet
An Introduction To Information Retrieval Systems: Intelligent Systems March 18, 2004 Ramashis Das
25 pages
Information Retrieval - 1
No ratings yet
Information Retrieval - 1
47 pages
AZ Lecture7-Queryexpansion
No ratings yet
AZ Lecture7-Queryexpansion
49 pages
Introduction of IR Models
No ratings yet
Introduction of IR Models
62 pages
IR - Models
100% (3)
IR - Models
58 pages
Module1PartBInformationRetrievalWebdocuments
No ratings yet
Module1PartBInformationRetrievalWebdocuments
49 pages
Query Languages: Chapter Seven
No ratings yet
Query Languages: Chapter Seven
36 pages
Relevance Feedback: Improving Results
No ratings yet
Relevance Feedback: Improving Results
41 pages
Information Retrieval
100% (1)
Information Retrieval
11 pages
IR QB
No ratings yet
IR QB
8 pages
Anti-Serendipity: Finding Useless Documents and Similar Documents
No ratings yet
Anti-Serendipity: Finding Useless Documents and Similar Documents
9 pages
IR Unit II
No ratings yet
IR Unit II
4 pages
14 Vcat
No ratings yet
14 Vcat
66 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
37 pages
CS583 Info Retrieval
No ratings yet
CS583 Info Retrieval
33 pages
Session 4 Text Feature
No ratings yet
Session 4 Text Feature
40 pages
5 Retrievalefective
No ratings yet
5 Retrievalefective
22 pages
Information Retrieval
No ratings yet
Information Retrieval
72 pages
CS8080 INFORMATION RETRIEVAL TECHNIQUES II INTERNAL EXAMINATION - Google Forms
No ratings yet
CS8080 INFORMATION RETRIEVAL TECHNIQUES II INTERNAL EXAMINATION - Google Forms
420 pages
CS583 Info Retrieval
No ratings yet
CS583 Info Retrieval
34 pages
5 retrievalEfective
No ratings yet
5 retrievalEfective
13 pages
Semantic Network: Fundamentals and Applications
From Everand
Semantic Network: Fundamentals and Applications
Fouad Sabry
No ratings yet
The Future of Search
From Everand
The Future of Search
Andres J. Clary
No ratings yet
ISO 15926 Part 7 - Implementation of ISO 15926
100% (1)
ISO 15926 Part 7 - Implementation of ISO 15926
6 pages
SQL Interview Questions
No ratings yet
SQL Interview Questions
14 pages
MSSQL
No ratings yet
MSSQL
46 pages
Grade 10 PPT IT
No ratings yet
Grade 10 PPT IT
39 pages
ISC 305 Assignment
No ratings yet
ISC 305 Assignment
3 pages
Actors On The Database Management Scene
No ratings yet
Actors On The Database Management Scene
11 pages
Data Storage
No ratings yet
Data Storage
44 pages
RDBMS Final Notes For Board Exam
100% (1)
RDBMS Final Notes For Board Exam
13 pages
DBMS Exp 3
No ratings yet
DBMS Exp 3
8 pages
How To Create Dynamic Dropdown List Using Laravel
No ratings yet
How To Create Dynamic Dropdown List Using Laravel
3 pages
Big Data Analytics Syllabus
No ratings yet
Big Data Analytics Syllabus
2 pages
10 Reasons Why You Should Prefer Postgresql To Mysql: Anand Chitipothu
No ratings yet
10 Reasons Why You Should Prefer Postgresql To Mysql: Anand Chitipothu
58 pages
Imd 111 Tutorial Webopac
No ratings yet
Imd 111 Tutorial Webopac
4 pages
Create Table Player
No ratings yet
Create Table Player
8 pages
Power BI
100% (2)
Power BI
282 pages
Treatment Database Schema Documentation
No ratings yet
Treatment Database Schema Documentation
2 pages
Free DataStage Tutorials and Guides
0% (4)
Free DataStage Tutorials and Guides
3 pages
Nosql Databases: P.Krishna Reddy Iiit Hyderabad
No ratings yet
Nosql Databases: P.Krishna Reddy Iiit Hyderabad
30 pages
6 - Master Slide Deck
No ratings yet
6 - Master Slide Deck
102 pages
MUStart4.1 Snippets
No ratings yet
MUStart4.1 Snippets
3 pages
Replication Issue and Soution
No ratings yet
Replication Issue and Soution
4 pages
12c DBCA Creation
No ratings yet
12c DBCA Creation
8 pages
Pratham Gupta: - Linkedin
No ratings yet
Pratham Gupta: - Linkedin
1 page
LS22 - All Notes
No ratings yet
LS22 - All Notes
38 pages
Introduction To Oracle Sharding
100% (1)
Introduction To Oracle Sharding
13 pages
Abap Faq Ans
No ratings yet
Abap Faq Ans
2 pages
Lab 3
No ratings yet
Lab 3
7 pages
ADBMS Sem 1 Mumbai University (MSC - CS)
No ratings yet
ADBMS Sem 1 Mumbai University (MSC - CS)
39 pages
Pivot Tables: Insert A Pivot Table
No ratings yet
Pivot Tables: Insert A Pivot Table
27 pages

2 Introduction To Information Retrieval

Uploaded by

2 Introduction To Information Retrieval

Uploaded by

Information Retrieval (IR)

 Dr = set of known relevant doc vectors

 Perhaps top 20 by term frequency

Would you expect such a feature to increase the query

 Refinements based on query log mining

 Local Analysis: (dynamic)

You might also like