2 Introduction To Information Retrieval
2 Introduction To Information Retrieval
Definition of IR
IR is finding material (usually documents) of
an unstructured nature (usually text) that
satisfies an information need from within
large collections (usually stored on
computers).
Unstructured data?
Information need?
Information Access Process
Information Need
Query
Send to System
Receive Results
Evaluate Results
No
Don
Stop
Information need
Information need is the topic about which the
user desires to know more.
Query is what the user conveys to the
computer in an attempt to communicate the
information need.
A document is ‘relevant’ if it is one that the
user perceives as containing information of
value with respect to their personal
information need.
Evaluation of IR
Retrieval Effectiveness
To assess the effectiveness of an IR system (i.e. the
quality of its search results), precision and recall can
be used.
Precision ( 查準率 ): what fraction of the returned
results are relevant to the information need?
Recall ( 查全率 ): what fraction of the relevant
documents in the collection were returned by the
system?
Users of IR
Used to be reference librarians and
professional searchers
Now: hundreds of millions of people engage
in IR every day when they use a web search
engine
Information Retrieval Systems
Document
corpus
Query IR
String System
1. Doc1
2. Doc2
Ranked 3. Doc3
.
Documents .
IR Systems
IR also covers supporting users in browsing
or filtering document collections or further
processing a set of retrieved documents, e.g.
summarization.
Clustering is the task of grouping a set of
documents based on their contents. (For
example, arranging e-books based on their
topics.)
Text Browsing
IR Systems
Can be broadly distinguished by:
- Web search: to provide search over billions
of documents stored on millions of computers
- Personal IR: e.g. spam (junk mail) filter (or
e-mail classification)
- Domain-specific search: e.g. corporation’s
internal documents, a database of patents,
research articles on biochemistry.
An Example IR Problem
Which plays (scenes) of Shakespeare contain
the words Brutus AND Caesar AND NOT
Calpurnia?
A term-document incidence matrix
Anthony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth …..
Anthony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
……
A term-document incidence matrix
Matrix element (t, d) is 1 if the play in column d
contains the word in row t, and is 0 otherwise.
In IR, we say terms (which are equal to words)
As a result, we can have a vector for each term
(which shows the documents it appears in) or a
vector for each document (showing the terms
that occur in it)
The answer
Brutus AND Caesar AND NOT Calpurnia
110100 AND 110111 AND 101111 = 100100
Boolean retrieval model: terms are combined
with the operators AND, OR, and NOT
The model views each document as just a set
of words.
Inverted Index
More realistic.
Say a 500k * 1M term-document matrix has
half-a-trillion 0’s and 1’s: too many to fit in a
computer’s memory.
Inverted index = inverted file = index
construction
Inverted Index
We keep a ‘dictionary’ of terms (sometimes also
referred to as a ‘vocabulary’ or ‘lexicon’)
Then for each term, we have a list that records which
documents the term occurs in.
Each item in the list – which records that a term
appeared in a document – is called a ‘posting’.
The list is then called a ‘postings list’ (or inverted list).
Inverted Index
Brutus-> 1 2 4 11 31 45 173 174
Caesar-> 1 2 4 5 6 16 57 132 …
Calpurnia -> 2 31 54 101
…
“Dictionary” -> Postings (document ID)
Inverted Index
Processing Boolean Queries
Brutus AND Calpurnia
Brutus -> 1 2 4 11 31 45 173 174
Calpurnia -> 2 31 54 101
Intersection => 2 31
Natural Language Processing
Stemming:
- computational comput
Stop words
- the, it, a, etc.
The Vector-Space Model
Assume t distinct terms remain after preprocessing;
call them index terms or the vocabulary.
These “orthogonal” terms form a vector space.
Dimension = t = |vocabulary|
Each term, i, in a document or query, j, is given a
real-valued weight, wij.
Both documents and queries are expressed as
t-dimensional vectors:
dj = (w1j, w2j, …, wtj)
Graphic Representation
Similarity Measure
Euclidean distance
Cosine similarity
Term Frequency (TF) and Inverse
Document Frequency (IDF)
fij = frequency of term i in document j
df i = document frequency of term i
= number of documents containing term i
idfi = inverse document frequency of term i,
= log2 (N/df i)
(N: total number of documents)
Computing TF-IDF -- An Example
Given a document containing terms with given frequencies:
A(3), B(2), C(1)
Assume collection contains 10,000 documents and
document frequencies of these terms are:
A(50), B(1300), C(250)
Then:
A: tf = 3/(3+2+1); idf = log2(10000/50) = 7.6; tf-idf = 3.8
B: tf = 2/(3+2+1); idf = log2 (10000/1300) = 2.9; tf-idf = 0.96
C: tf = 1/(3+2+1); idf = log2 (10000/250) = 5.3; tf-idf = 0.89
Relevance Feedback
Relevance feedback: user feedback on relevance of
docs in initial set of results
User issues a (short, simple) query
The user marks some results as relevant and/or non-relevant.
The system computes a better representation of the
information need based on feedback.
Relevance feedback can go through one or more iterations.
Idea: it may be difficult to formulate a good query when
you don’t know the collection well, so iterate
Relevance Feedback: Example
Image search engine
http://nayana.ece.ucsb.edu/imsearch/imsearch.ht
ml
Results for Initial Query
9.1.1
Relevance Feedback
9.1.1
Results after Relevance Feedback
9.1.1
Relevance feedback on initial query
Initial
x x
query x
o x
x x
x x
o x x
x
x o
x o x
o o x
x x
x
x known non-relevant documents
Revised
query o known relevant documents
9.1.1
Rocchio Algorithm
Used in practice:
1 1
q m q0
Dr dj Dr
d j
Dnr d j
d j Dnr
9.2.2
Query assist
9.2.2
Automatic Thesaurus Generation
Example
9.2.3