0% found this document useful (0 votes)
9 views

2 Introduction To Information Retrieval

Uploaded by

maik9206
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

2 Introduction To Information Retrieval

Uploaded by

maik9206
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 38

Information Retrieval (IR)

Definition of IR
 IR is finding material (usually documents) of
an unstructured nature (usually text) that
satisfies an information need from within
large collections (usually stored on
computers).
 Unstructured data?
 Information need?
Information Access Process
Information Need

Query

Send to System

Receive Results

Evaluate Results

No
Don

Stop
Information need
 Information need is the topic about which the
user desires to know more.
 Query is what the user conveys to the
computer in an attempt to communicate the
information need.
 A document is ‘relevant’ if it is one that the
user perceives as containing information of
value with respect to their personal
information need.
Evaluation of IR
 Retrieval Effectiveness
 To assess the effectiveness of an IR system (i.e. the
quality of its search results), precision and recall can
be used.
 Precision ( 查準率 ): what fraction of the returned
results are relevant to the information need?
 Recall ( 查全率 ): what fraction of the relevant
documents in the collection were returned by the
system?
Users of IR
 Used to be reference librarians and
professional searchers
 Now: hundreds of millions of people engage
in IR every day when they use a web search
engine
Information Retrieval Systems
Document
corpus

Query IR
String System

1. Doc1
2. Doc2
Ranked 3. Doc3
.
Documents .
IR Systems
 IR also covers supporting users in browsing
or filtering document collections or further
processing a set of retrieved documents, e.g.
summarization.
 Clustering is the task of grouping a set of
documents based on their contents. (For
example, arranging e-books based on their
topics.)
Text Browsing
IR Systems
 Can be broadly distinguished by:
- Web search: to provide search over billions
of documents stored on millions of computers
- Personal IR: e.g. spam (junk mail) filter (or
e-mail classification)
- Domain-specific search: e.g. corporation’s
internal documents, a database of patents,
research articles on biochemistry.
An Example IR Problem
 Which plays (scenes) of Shakespeare contain
the words Brutus AND Caesar AND NOT
Calpurnia?
A term-document incidence matrix
Anthony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth …..

Anthony 1 1 0 0 0 1

Brutus 1 1 0 1 0 0

Caesar 1 1 0 1 1 1

Calpurnia 0 1 0 0 0 0

Cleopatra 1 0 0 0 0 0

mercy 1 0 1 1 1 1

worser 1 0 1 1 1 0

……
A term-document incidence matrix
 Matrix element (t, d) is 1 if the play in column d
contains the word in row t, and is 0 otherwise.
 In IR, we say terms (which are equal to words)
 As a result, we can have a vector for each term
(which shows the documents it appears in) or a
vector for each document (showing the terms
that occur in it)
The answer
 Brutus AND Caesar AND NOT Calpurnia
 110100 AND 110111 AND 101111 = 100100
 Boolean retrieval model: terms are combined
with the operators AND, OR, and NOT
 The model views each document as just a set
of words.
Inverted Index
 More realistic.
 Say a 500k * 1M term-document matrix has
half-a-trillion 0’s and 1’s: too many to fit in a
computer’s memory.
 Inverted index = inverted file = index
construction
Inverted Index
 We keep a ‘dictionary’ of terms (sometimes also
referred to as a ‘vocabulary’ or ‘lexicon’)
 Then for each term, we have a list that records which
documents the term occurs in.
 Each item in the list – which records that a term
appeared in a document – is called a ‘posting’.
 The list is then called a ‘postings list’ (or inverted list).
Inverted Index
Brutus-> 1 2 4 11 31 45 173 174
Caesar-> 1 2 4 5 6 16 57 132 …
Calpurnia -> 2 31 54 101

“Dictionary” -> Postings (document ID)
Inverted Index
Processing Boolean Queries
 Brutus AND Calpurnia
Brutus -> 1 2 4 11 31 45 173 174
Calpurnia -> 2 31 54 101
Intersection => 2 31
Natural Language Processing
 Stemming:
- computational  comput
 Stop words
- the, it, a, etc.
The Vector-Space Model
 Assume t distinct terms remain after preprocessing;
call them index terms or the vocabulary.
 These “orthogonal” terms form a vector space.
 Dimension = t = |vocabulary|
 Each term, i, in a document or query, j, is given a
real-valued weight, wij.
 Both documents and queries are expressed as
t-dimensional vectors:
 dj = (w1j, w2j, …, wtj)
Graphic Representation
Similarity Measure
 Euclidean distance

 Cosine similarity
Term Frequency (TF) and Inverse
Document Frequency (IDF)
 fij = frequency of term i in document j
 df i = document frequency of term i
= number of documents containing term i
 idfi = inverse document frequency of term i,
= log2 (N/df i)
(N: total number of documents)
Computing TF-IDF -- An Example
 Given a document containing terms with given frequencies:
A(3), B(2), C(1)
 Assume collection contains 10,000 documents and
document frequencies of these terms are:
A(50), B(1300), C(250)
Then:
A: tf = 3/(3+2+1); idf = log2(10000/50) = 7.6; tf-idf = 3.8
B: tf = 2/(3+2+1); idf = log2 (10000/1300) = 2.9; tf-idf = 0.96
C: tf = 1/(3+2+1); idf = log2 (10000/250) = 5.3; tf-idf = 0.89
Relevance Feedback
 Relevance feedback: user feedback on relevance of
docs in initial set of results
 User issues a (short, simple) query
 The user marks some results as relevant and/or non-relevant.
 The system computes a better representation of the
information need based on feedback.
 Relevance feedback can go through one or more iterations.
 Idea: it may be difficult to formulate a good query when
you don’t know the collection well, so iterate
Relevance Feedback: Example
 Image search engine
http://nayana.ece.ucsb.edu/imsearch/imsearch.ht
ml
Results for Initial Query

9.1.1
Relevance Feedback

9.1.1
Results after Relevance Feedback

9.1.1
Relevance feedback on initial query
Initial
x x
query x
o x
 x x
x x
o x x
x
x o
x o x
o o x
x x
x
x known non-relevant documents
Revised
query o known relevant documents
9.1.1
Rocchio Algorithm
 Used in practice:
  1  1 
q m   q0   
Dr dj Dr
d j  
Dnr  d j
d j Dnr

 Dr = set of known relevant doc vectors


 Dnr = set of known irrelevant doc vectors
 qm = modified query vector; q0 = original query
vector; α, β, and γ = weights (hand-chosen or set
empirically)
 New query moves toward relevant documents
and away from irrelevant documents
9.1.1
Relevance Feedback: Problems
 Long queries are inefficient for typical IR engine.
 Long response times for user.
 High cost for retrieval system.
W
hy
?
 Partial solution:
 Only reweight certain prominent terms

 Perhaps top 20 by term frequency


 Users are often reluctant to provide explicit feedback
 It’s often harder to understand why a particular document
was retrieved after applying relevance feedback
Two Solutions
 Pseudo relevance feedback
 Implicit (or indirect) relevance feedback
Query Expansion
 In relevance feedback, users give additional
input (relevant/non-relevant) on documents,
which is used to reweight terms in the
documents
 In query expansion, users give additional
input (good/bad search term) on words or
phrases

9.2.2
Query assist

Would you expect such a feature to increase the query


volume at a search engine?
How do we augment the user query?
 Manual thesaurus
 E.g. MedLine: physician, syn: doc, doctor, MD, medico
 Can be query rather than just synonyms
 Ontology
 Global Analysis: (static; of all documents in collection)
 Automatically derived thesaurus
 (co-occurrence statistics)

 Refinements based on query log mining


 Common on the web

 Local Analysis: (dynamic)


 Analysis of documents in result set

9.2.2
Automatic Thesaurus Generation
Example

9.2.3

You might also like