0% found this document useful (0 votes)
1 views

week6

Information Retrieval (IR) involves finding unstructured documents that meet a user's information needs, commonly associated with web search but applicable in various contexts like email and corporate databases. The document discusses key concepts such as the classic search model, precision and recall metrics, and the construction of inverted indexes for efficient query processing. It also covers advanced topics like phrase queries and positional indexes to enhance search accuracy and relevance.

Uploaded by

ceyu1213
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

week6

Information Retrieval (IR) involves finding unstructured documents that meet a user's information needs, commonly associated with web search but applicable in various contexts like email and corporate databases. The document discusses key concepts such as the classic search model, precision and recall metrics, and the construction of inverted indexes for efficient query processing. It also covers advanced topics like phrase queries and positional indexes to enhance search accuracy and relevance.

Uploaded by

ceyu1213
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 98

Introduction to Information Retrieval

Introduction to
Information Retrieval
Introducing Information Retrieval
and Web Search
Introduction to Information Retrieval

Information Retrieval
 Information Retrieval (IR) is finding material (usually
documents) of an unstructured nature (usually text)
that satisfies an information need from within large
collections (usually stored on computers).

 These days we frequently think first of web search, but


there are many other cases:
 E-mail search
 Searching your laptop
 Corporate knowledge bases
 Legal information retrieval

2
Introduction to Information Retrieval

Unstructured (text) vs. structured


(database) data in the mid-nineties

250

200

150
Unstructured
100 Structured

50

0
Data volume Market Cap

3
Introduction to Information Retrieval

Unstructured (text) vs. structured


(database) data today

250

200

150
Unstructured
100 Structured

50

0
Data volume Market Cap

4
Introduction to Information Retrieval Sec. 1.1

Basic assumptions of Information Retrieval


 Collection: A set of documents
 Assume it is a static collection for the moment

 Goal: Retrieve documents with information that is


relevant to the user’s information need and helps the
user complete a task

5
Introduction to Information Retrieval

The classic search model


User task Get rid of mice in a
politically correct way
Misconception?

Info need
Info about removing mice
without killing them
Misformulation?

Query Searc
how trap mice alive
h

Search
engine

Query Results
Collection
refinement
Introduction to Information Retrieval Sec. 1.1

How good are the retrieved docs?


 Precision : Fraction of retrieved docs that are
relevant to the user’s information need
 Recall : Fraction of relevant docs in collection that
are retrieved

 More precise definitions and measurements to follow later

7
Introduction to Information Retrieval

Introduction to
Information Retrieval
Term-document incidence matrices
Introduction to Information Retrieval Sec. 1.1

Unstructured data in 1620


 Which plays of Shakespeare contain the words Brutus
AND Caesar but NOT Calpurnia?
 One could grep all of Shakespeare’s plays for Brutus
and Caesar, then strip out lines containing Calpurnia?
 Why is that not the answer?
 Slow (for large corpora)
 NOT Calpurnia is non-trivial
 Other operations (e.g., find the word Romans near
countrymen) not feasible
 Ranked retrieval (best documents to return)
 Later lectures
10
Introduction to Information Retrieval Sec. 1.1

Term-document incidence matrices

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0

Brutus AND Caesar BUT NOT 1 if play contains


Calpurnia word, 0 otherwise
Introduction to Information Retrieval Sec. 1.1

Incidence vectors
 So we have a 0/1 vector for each term.
 To answer query: take the vectors for Brutus, Caesar
and Calpurnia (complemented)  bitwise AND.
 110100 AND
 110111 AND
 101111 = Antony
Antony and Cleopatra
1
Julius Caesar
1
The Tempest
0
Hamlet
0
Othello
0
Macbeth
1

 100100 Brutus
Caesar
1
1
1
1
0
0
1
1
0
1
0
1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0

12
Introduction to Information Retrieval Sec. 1.1

Answers to query
 Antony and Cleopatra, Act III, Scene ii
Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus,
When Antony found Julius Caesar dead,
He cried almost to roaring; and he wept
When at Philippi he found Brutus slain.

 Hamlet, Act III, Scene ii


Lord Polonius: I did enact Julius Caesar I was killed i’ the
Capitol; Brutus killed me.

13
Introduction to Information Retrieval Sec. 1.1

Bigger collections
 Consider N = 1 million documents, each with about
1000 words.
 Avg 6 bytes/word including spaces/punctuation
 6GB of data in the documents.
 Say there are M = 500K distinct terms among these.

14
Introduction to Information Retrieval Sec. 1.1

Can’t build the matrix


 500K x 1M matrix has half-a-trillion 0’s and 1’s.

 But it has no more than one billion 1’s. Why?


 matrix is extremely sparse.

 What’s a better representation?


 We only record the 1 positions.

15
Introduction to Information Retrieval

Introduction to
Information Retrieval
The Inverted Index
The key data structure underlying modern IR
Introduction to Information Retrieval Sec. 1.2

Inverted index
 For each term t, we must store a list of all documents
that contain t.
 Identify each doc by a docID, a document serial number
 Can we used fixed-size arrays for this?
Brutus 1 2 4 11 31 45 173 174
Caesar 1 2 4 5 6 16 57 132
Calpurnia 2 31 54 101

What happens if the word Caesar


is added to document 14?
18
Introduction to Information Retrieval Sec. 1.2

Inverted index
 We need variable-size postings lists
 On disk, a continuous run of postings is normal and best
 In memory, can use linked lists or variable length arrays
 Some tradeoffs in size/ease of insertion Posting

Brutus 1 2 4 11 31 45 173 174


Caesar 1 2 4 5 6 16 57 132
Calpurnia 2 31 54 101

Dictionary Postings
Sorted by docID (more later on why).
19
Introduction to Information Retrieval Sec. 1.2

Inverted index construction


Documents to Friends, Romans, countrymen.
be indexed

Tokenizer
Token stream Friends Romans Countrymen
Linguistic
modules
Modified tokens friend roman countryman

Indexer friend 2 4
roman 1 2
Inverted index
countryman 13 16
Introduction to Information Retrieval

Initial stages of text processing


 Tokenization
 Cut character sequence into word tokens
 Deal with “John’s”, a state-of-the-art solution
 Normalization
 Map text and query term to same form
 You want U.S.A. and USA to match
 Stemming
 We may wish different forms of a root to match
 authorize, authorization
 Stop words
 We may omit very common words (or not)
 the, a, to, of
Introduction to Information Retrieval Sec. 1.2

Indexer steps: Token sequence


 Sequence of (Modified token, Document ID) pairs.

Doc 1 Doc 2

I did enact Julius So let it be with


Caesar I was killed Caesar. The noble
i’ the Capitol; Brutus hath told you
Brutus killed me. Caesar was ambitious
Introduction to Information Retrieval Sec. 1.2

Indexer steps: Sort


 Sort by terms
 And then docID

Core indexing step


Introduction to Information Retrieval Sec. 1.2

Indexer steps: Dictionary & Postings


 Multiple term
entries in a single
document are
merged.
 Split into Dictionary
and Postings
 Doc. frequency
information is
added.

Why frequency?
Will discuss later.
Introduction to Information Retrieval Sec. 1.2

Where do we pay in storage?


Lists of
docIDs

Terms
and
counts IR system
implementation
• How do we
index efficiently?
• How much
storage do we
need?
Pointers 26
Introduction to Information Retrieval

Introduction to
Information Retrieval
Query processing with an inverted index
Introduction to Information Retrieval Sec. 1.3

The index we just built


 How do we process a query? Our focus
 Later - what kinds of queries can we process?

29
Introduction to Information Retrieval Sec. 1.3

Query processing: AND


 Consider processing the query:
Brutus AND Caesar
 Locate Brutus in the Dictionary;
 Retrieve its postings.
 Locate Caesar in the Dictionary;
 Retrieve its postings.
 “Merge” the two postings (intersect the document sets):

2 4 8 16 32 64 128 Brutus
1 2 3 5 8 13 21 34 Caesar

30
Introduction to Information Retrieval Sec. 1.3

The merge
 Walk through the two postings simultaneously, in
time linear in the total number of postings entries

2 4 8 16 32 64 128 Brutus
1 2 3 5 8 13 21 34 Caesar

If the list lengths are x and y, the merge takes O(x+y)


operations.
Crucial: postings sorted by docID.
31
Introduction to Information Retrieval Sec. 1.3

The merge
 Walk through the two postings simultaneously, in
time linear in the total number of postings entries

2 4 8 16 32 64 128 Brutus
2 8
1 2 3 5 8 13 21 34 Caesar

If the list lengths are x and y, the merge takes O(x+y)


operations.
Crucial: postings sorted by docID.
32
Introduction to Information Retrieval

Intersecting two postings lists


(a “merge” algorithm)

33
Introduction to Information Retrieval

Introduction to
Information Retrieval
Phrase queries and positional indexes
Introduction to Information Retrieval Sec. 2.4

Phrase queries
 We want to be able to answer queries such as
“stanford university” – as a phrase
 Thus the sentence “I went to university at Stanford”
is not a match.
 The concept of phrase queries has proven easily
understood by users; one of the few “advanced search”
ideas that works
 Many more queries are implicit phrase queries
 For this, it no longer suffices to store only
<term : docs> entries
Introduction to Information Retrieval Sec. 2.4.1

A first attempt: Biword indexes


 Index every consecutive pair of terms in the text as a
phrase
 For example the text “Friends, Romans,
Countrymen” would generate the biwords
 friends romans
 romans countrymen
 Each of these biwords is now a dictionary term
 Two-word phrase query-processing is now
immediate.
Introduction to Information Retrieval Sec. 2.4.1

Longer phrase queries


 Longer phrases can be processed by breaking them
down
 stanford university palo alto can be broken into the
Boolean query on biwords:
stanford university AND university palo AND palo alto

Without the docs, we cannot verify that the docs


matching the above Boolean query do contain the
phrase.
Can have false positives!
Introduction to Information Retrieval Sec. 2.4.1

Issues for biword indexes


 False positives, as noted before
 Index blowup due to bigger dictionary
 Infeasible for more than biwords, big even for them

 Biword indexes are not the standard solution (for all


biwords) but can be part of a compound strategy
Introduction to Information Retrieval Sec. 2.4.2

Solution 2: Positional indexes


 In the postings, store, for each term the position(s) in
which tokens of it appear:

<term, number of docs containing term;


doc1: position1, position2 … ;
doc2: position1, position2 … ;
etc.>
Introduction to Information Retrieval Sec. 2.4.2

Positional index example

<be: 993427;
1: 7, 18, 33, 72, 86, 231; Which of docs 1,2,4,5
2: 3, 149; could contain “to be
4: 17, 191, 291, 430, 434; or not to be”?
5: 363, 367, …>

 For phrase queries, we use a merge algorithm


recursively at the document level
 But we now need to deal with more than just
equality
Introduction to Information Retrieval Sec. 2.4.2

Processing a phrase query


 Extract inverted index entries for each distinct term:
to, be, or, not.
 Merge their doc:position lists to enumerate all
positions with “to be or not to be”.
 to:
 2:1,17,74,222,551; 4:8,16,190,429,433; 7:13,23,191; ...
 be:
 1:17,19; 4:17,191,291,430,434; 5:14,19,101; ...
 Same general method for proximity searches
Introduction to Information Retrieval Sec. 2.4.2

Proximity queries
 LIMIT! /3 STATUTE /3 FEDERAL /2 TORT
 Again, here, /k means “within k words of”.
 Clearly, positional indexes can be used for such
queries; biword indexes cannot.
 Exercise: Adapt the linear merge of postings to
handle proximity queries. Can you make it work for
any value of k?
 This is a little tricky to do correctly and efficiently
 See Figure 2.12 of IIR
Introduction to Information Retrieval Sec. 2.4.2

Positional index size


 A positional index expands postings storage
substantially
 Even though indices can be compressed
 Nevertheless, a positional index is now standardly
used because of the power and usefulness of phrase
and proximity queries … whether used explicitly or
implicitly in a ranking retrieval system.
Introduction to Information Retrieval Sec. 2.4.2

Positional index size


 Need an entry for each occurrence, not just once per
document
 Index size depends on average document size Why?
 Average web page has <1000 terms
 SEC filings, books, even some epic poems … easily 100,000
terms
 Consider a term with frequency 0.1%
Document size Postings Positional postings
1000 1 1
100,000 1 100
Introduction to Information Retrieval Sec. 2.4.2

Rules of thumb
 A positional index is 2–4 as large as a non-positional
index

 Positional index size 35–50% of volume of original


text

 Caveat: all of this holds for “English-like” languages


Introduction to Information Retrieval Sec. 2.4.3

Combination schemes
 These two approaches can be profitably combined
 For particular phrases (“Michael Jackson”, “Britney
Spears”) it is inefficient to keep on merging positional
postings lists
 Even more so for phrases like “The Who”
 Williams et al. (2004) evaluate a more sophisticated
mixed indexing scheme
 A typical web query mixture was executed in ¼ of the time
of using just a positional index
 It required 26% more space than having a positional index
alone
Introduction to Information Retrieval

Introduction to
Information Retrieval
Introducing ranked retrieval
Introduction to Information Retrieval Ch. 6

Ranked retrieval
 Thus far, our queries have all been Boolean.
 Documents either match or don’t.
 Good for expert users with precise understanding of
their needs and the collection.
 Also good for applications: Applications can easily
consume 1000s of results.
 Not good for the majority of users.
 Most users incapable of writing Boolean queries (or they
are, but they think it’s too much work).
 Most users don’t want to wade through 1000s of results.
 This is particularly true of web search.
Introduction to Information Retrieval Ch. 6

Problem with Boolean search:


feast or famine
 Boolean queries often result in either too few (≈0) or
too many (1000s) results.
 Query 1: “standard user dlink 650” → 200,000 hits
 Query 2: “standard user dlink 650 no card found” → 0 hits
 It takes a lot of skill to come up with a query that
produces a manageable number of hits.
 AND gives too few; OR gives too many
Introduction to Information Retrieval

Ranked retrieval models


 Rather than a set of documents satisfying a query
expression, in ranked retrieval models, the system
returns an ordering over the (top) documents in the
collection with respect to a query
 Free text queries: Rather than a query language of
operators and expressions, the user’s query is just
one or more words in a human language
 In principle, there are two separate choices here, but
in practice, ranked retrieval models have normally
been associated with free text queries and vice versa
4
Introduction to Information Retrieval Ch. 6

Feast or famine: not a problem in


ranked retrieval
 When a system produces a ranked result set, large
result sets are not an issue
 Indeed, the size of the result set is not an issue
 We just show the top k ( ≈ 10) results
 We don’t overwhelm the user

 Premise: the ranking algorithm works


Introduction to Information Retrieval Ch. 6

Scoring as the basis of ranked retrieval


 We wish to return in order the documents most likely
to be useful to the searcher
 How can we rank-order the documents in the
collection with respect to a query?
 Assign a score – say in [0, 1] – to each document
 This score measures how well document and query
“match”.
Introduction to Information Retrieval Ch. 6

Query-document matching scores


 We need a way of assigning a score to a
query/document pair
 Let’s start with a one-term query
 If the query term does not occur in the document:
score should be 0
 The more frequent the query term in the document,
the higher the score (should be)
 We will look at a number of alternatives for this
Introduction to Information Retrieval

Introduction to
Information Retrieval
Scoring with the Jaccard coefficient
Introduction to Information Retrieval Ch. 6

Take 1: Jaccard coefficient


 A commonly used measure of overlap of two sets A
and B is the Jaccard coefficient
 jaccard(A,B) = |A ∩ B| / |A ∪ B|
 jaccard(A,A) = 1
 jaccard(A,B) = 0 if A ∩ B = 0
 A and B don’t have to be the same size.
 Always assigns a number between 0 and 1.
Introduction to Information Retrieval Ch. 6

Jaccard coefficient: Scoring example


 What is the query-document match score that the
Jaccard coefficient computes for each of the two
documents below?
 Query: ides of march
 Document 1: caesar died in march
 Document 2: the long march
 Document 3: the long ides march



Introduction to Information Retrieval Ch. 6

Issues with Jaccard for scoring


 It doesn’t consider term frequency (how many times
a term occurs in a document)
 Rare terms in a collection are more informative than
frequent terms
 Jaccard doesn’t consider this information
 We need a more sophisticated way of normalizing for
length
 Later in this lecture, we’ll use | A  B | / | A  B |
. . . instead of |A ∩ B|/|A ∪ B| (Jaccard) for length
normalization.
Introduction to Information Retrieval

Introduction to
Information Retrieval
Term frequency weighting
Introduction to Information Retrieval Sec. 6.2

Recall: Binary term-document


incidence matrix
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0

Each document is represented by a binary vector ∈ {0,1}|V|


Introduction to Information Retrieval Sec. 6.2

Term-document count matrices


 Consider the number of occurrences of a term in a
document:
 Each document is a count vector in ℕ|V|: a column below

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 157 73 0 0 0 0
Brutus 4 157 0 1 0 0
Caesar 232 227 0 2 1 1
Calpurnia 0 10 0 0 0 0
Cleopatra 57 0 0 0 0 0
mercy 2 0 3 5 5 1
worser 2 0 1 1 1 0
Introduction to Information Retrieval

Bag of words model


 Vector representation doesn’t consider the ordering
of words in a document
 John is quicker than Mary and Mary is quicker than
John have the same vectors

 This is called the bag of words model.


 In a sense, this is a step back: The positional index
was able to distinguish these two documents
 We will look at “recovering” positional information later on
 For now: bag of words model
Introduction to Information Retrieval

Term frequency tf
 The term frequency tft,d of term t in document d is
defined as the number of times that t occurs in d.
 We want to use tf when computing query-document
match scores. But how?
 Raw term frequency is not what we want:
 A document with 10 occurrences of the term is more
relevant than a document with 1 occurrence of the term.
 But not 10 times more relevant.
 Relevance does not increase proportionally with
term frequency.
NB: frequency = count in IR
Introduction to Information Retrieval Sec. 6.2

Log-frequency weighting
 The log frequency weight of term t in d is
 1 + log 10 tf t,d , if tf t,d > 0
w t,d = 
 0, otherwise

 Score for a document-query pair: sum over terms t in


both q and d:

 score = t∈q∩d (1 + log tf t ,d )

 The score is 0 if none of the query terms is present in


the document.
Introduction to Information Retrieval

Introduction to
Information Retrieval
(Inverse) Document frequency weighting
Introduction to Information Retrieval Sec. 6.2.1

Document frequency
 Rare terms are more informative than frequent terms
 Recall stop words
 Consider a term in the query that is rare in the
collection (e.g., arachnocentric)
 A document containing this term is very likely to be
relevant to the query arachnocentric
 → We want a high weight for rare terms like
arachnocentric.
Introduction to Information Retrieval Sec. 6.2.1

Document frequency, continued


 Frequent terms are less informative than rare terms
 Consider a query term that is frequent in the
collection (e.g., high, increase, line)
 A document containing such a term is more likely to
be relevant than a document that doesn’t
 But it’s not a sure indicator of relevance.
 → For frequent terms, we want posi ve weights for
words like high, increase, and line
 But lower weights than for rare terms.
 We will use document frequency (df) to capture this.
Introduction to Information Retrieval Sec. 6.2.1

idf weight
 dft is the document frequency of t: the number of
documents that contain t
 dft is an inverse measure of the informativeness of t
 dft ≤ N
 We define the idf (inverse document frequency) of t
by
idf t = log10 ( N/df t )
 We use log (N/dft) instead of N/dft to “dampen” the effect
of idf.

Will turn out the base of the log is immaterial.


Introduction to Information Retrieval Sec. 6.2.1

idf example, suppose N = 1 million


term dft idft
calpurnia 1
animal 100
sunday 1,000
fly 10,000
under 100,000
the 1,000,000

idf t = log10 ( N/df t )


There is one idf value for each term t in a collection.
Introduction to Information Retrieval

Effect of idf on ranking


 Question: Does idf have an effect on ranking for one-
term queries, like
 iPhone

28
Introduction to Information Retrieval

Effect of idf on ranking


 Question: Does idf have an effect on ranking for one-
term queries, like
 iPhone
 idf has no effect on ranking one term queries
 idf affects the ranking of documents for queries with at
least two terms
 For the query capricious person, idf weighting makes
occurrences of capricious count for much more in the final
document ranking than occurrences of person.

29
Introduction to Information Retrieval Sec. 6.2.1

Collection vs. Document frequency


 The collection frequency of t is the number of
occurrences of t in the collection, counting
multiple occurrences.
 Example:
Word Collection frequency Document frequency
insurance 10440 3997
try 10422 8760

 Which word is a better search term (and should


get a higher weight)?
Introduction to Information Retrieval

Introduction to
Information Retrieval
tf-idf weighting
Introduction to Information Retrieval Sec. 6.2.2

tf-idf weighting
 The tf-idf weight of a term is the product of its tf
weight and its idf weight.
w t ,d = (1 + log tf t ,d ) × log10 ( N / df t )
 Best known weighting scheme in information retrieval
 Note: the “-” in tf-idf is a hyphen, not a minus sign!
 Alternative names: tf.idf, tf x idf
 Increases with the number of occurrences within a
document
 Increases with the rarity of the term in the collection
Introduction to Information Retrieval Sec. 6.2.2

Final ranking of documents for a query

Score(q,d) =  tf.idft,d
t ∈q∩d

34
Introduction to Information Retrieval Sec. 6.3

Binary → count → weight matrix


Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 5.25 3.18 0 0 0 0.35


Brutus 1.21 6.1 0 1 0 0
Caesar 8.59 2.54 0 1.51 0.25 0
Calpurnia 0 1.54 0 0 0 0
Cleopatra 2.85 0 0 0 0 0
mercy 1.51 0 1.9 0.12 5.25 0.88
worser 1.37 0 0.11 4.15 0.25 1.95

Each document is now represented by a real-valued


vector of tf-idf weights ∈ R|V|
Introduction to Information Retrieval

Introduction to
Information Retrieval
The Vector Space Model (VSM)
Introduction to Information Retrieval Sec. 6.3

Documents as vectors
 Now we have a |V|-dimensional vector space
 Terms are axes of the space
 Documents are points or vectors in this space
 Very high-dimensional: tens of millions of
dimensions when you apply this to a web search
engine
 These are very sparse vectors – most entries are zero
Introduction to Information Retrieval Sec. 6.3

Queries as vectors
 Key idea 1: Do the same for queries: represent them
as vectors in the space
 Key idea 2: Rank documents according to their
proximity to the query in this space
 proximity = similarity of vectors
 proximity ≈ inverse of distance
 Recall: We do this because we want to get away from
the you’re-either-in-or-out Boolean model
 Instead: rank more relevant documents higher than
less relevant documents
Introduction to Information Retrieval Sec. 6.3

Formalizing vector space proximity


 First cut: distance between two points
 ( = distance between the end points of the two vectors)
 Euclidean distance?
 Euclidean distance is a bad idea . . .
 . . . because Euclidean distance is large for vectors of
different lengths.
Introduction to Information Retrieval Sec. 6.3

Why distance is a bad idea


The Euclidean
distance between q
and d2 is large even
though the
distribution of terms
in the query q and the
distribution of
terms in the
document d2 are
very similar.
Introduction to Information Retrieval Sec. 6.3

Use angle instead of distance


 Thought experiment: take a document d and append
it to itself. Call this document d′.
 “Semantically” d and d′ have the same content
 The Euclidean distance between the two documents
can be quite large
 The angle between the two documents is 0,
corresponding to maximal similarity.

 Key idea: Rank documents according to angle with


query.
Introduction to Information Retrieval Sec. 6.3

From angles to cosines


 The following two notions are equivalent.
 Rank documents in decreasing order of the angle between
query and document
 Rank documents in increasing order of
cosine(query,document)

 Cosine is a monotonically decreasing function for the


interval [0o, 180o]
Introduction to Information Retrieval Sec. 6.3

From angles to cosines

 But how – and why – should we be computing cosines?


Introduction to Information Retrieval Sec. 6.3

Length normalization
 A vector can be (length-) normalized by dividing each
of its components by its length – for this we use the
L2 norm: 
x 2 = i xi2

 Dividing a vector by its L2 norm makes it a unit


(length) vector (on surface of unit hypersphere)
 Effect on the two documents d and d′ (d appended
to itself) from earlier slide: they have identical
vectors after length-normalization.
 Long and short documents now have comparable weights
Introduction to Information Retrieval Sec. 6.3

cosine(query,document)
Dot product Unit vectors
  


V
  q•d q d q di
i =1 i
cos( q , d ) =   =  •  =
q d
qd
 i=1 i
V 2 V
2
q
i =1 i
d

qi is the tf-idf weight of term i in the query


di is the tf-idf weight of term i in the document

cos(q,d) is the cosine similarity of q and d … or,


equivalently, the cosine of the angle between q and d.
Introduction to Information Retrieval

Cosine for length-normalized vectors


 For length-normalized vectors, cosine similarity is
simply the dot product (or scalar product):

   
cos(q, d ) = q • d =  qi di
V

i=1

for q, d length-normalized.

47
Introduction to Information Retrieval

Cosine similarity illustrated

48
Introduction to Information Retrieval Sec. 6.3

Cosine similarity amongst 3 documents


How similar are
the novels term SaS PaP WH

SaS: Sense and affection 115 58 20

Sensibility jealous 10 7 11

gossip 2 0 6
PaP: Pride and
wuthering 0 0 38
Prejudice, and
WH: Wuthering Term frequencies (counts)
Heights?

Note: To simplify this example, we don’t do idf weighting.


Introduction to Information Retrieval Sec. 6.3

3 documents example contd.


Log frequency weighting After length normalization

term SaS PaP WH term SaS PaP WH


affection 3.06 2.76 2.30 affection 0.789 0.832 0.524
jealous 2.00 1.85 2.04 jealous 0.515 0.555 0.465
gossip 1.30 0 1.78 gossip 0.335 0 0.405
wuthering 0 0 2.58 wuthering 0 0 0.588

cos(SaS,PaP) ≈
0.789 × 0.832 + 0.515 × 0.555 + 0.335 × 0.0 + 0.0 × 0.0 ≈
0.94
cos(SaS,WH) ≈ 0.79
cos(PaP,WH) ≈ 0.69
Why do we have cos(SaS,PaP) > cos(SAS,WH)?
Introduction to Information Retrieval

Introduction to
Information Retrieval
Calculating tf-idf cosine scores
in an IR system
Introduction to Information Retrieval Sec. 6.4

tf-idf weighting has many variants


Introduction to Information Retrieval Sec. 6.4

Weighting may differ in queries vs


documents
 Many search engines allow for different weightings
for queries vs. documents
 SMART Notation: denotes the combination in use in
an engine, with the notation ddd.qqq, using the
acronyms from the previous table
 A very standard weighting scheme is: lnc.ltc
 Document: logarithmic tf (l as first character), no idf
and cosine normalization
A bad idea?
 Query: logarithmic tf (l in leftmost column), idf (t in
second column), cosine normalization …
Introduction to Information Retrieval Sec. 6.4

tf-idf example: lnc.ltc


Document: car insurance auto insurance
Query: best car insurance
Term Query Document Pro
d
tf- tf-wt df idf wt n’liz tf-raw tf-wt wt n’liz
raw e e
auto 0 0 5000 2.3 0 0 1 1 1 0.52 0
best 1 1 50000 1.3 1.3 0.34 0 0 0 0 0
car 1 1 10000 2.0 2.0 0.52 1 1 1 0.52 0.27
insurance 1 1 1000 3.0 3.0 0.78 2 1.3 1.3 0.68 0.53
Exercise: what is N, the number of docs?
Doc length = 12 + 0 2 + 12 + 1.32 ≈ 1.92
Score = 0+0+0.27+0.53 = 0.8
Introduction to Information Retrieval Sec. 6.3

Computing cosine scores


Introduction to Information Retrieval

Summary – vector space ranking


 Represent the query as a weighted tf-idf vector
 Represent each document as a weighted tf-idf vector
 Compute the cosine similarity score for the query
vector and each document vector
 Rank documents with respect to the query by score
 Return the top K (e.g., K = 10) to the user
Introduction to Information Retrieval

Introduction to
Information Retrieval
Evaluating search engines
Introduction to Information Retrieval Sec. 8.6

Measures for a search engine


 How fast does it index
 Number of documents/hour
 (Average document size)
 How fast does it search
 Latency as a function of index size
 Expressiveness of query language
 Ability to express complex information needs
 Speed on complex queries
 Uncluttered UI
 Is it free?
61
Introduction to Information Retrieval Sec. 8.6

Measures for a search engine


 All of the preceding criteria are measurable: we can
quantify speed/size
 we can make expressiveness precise
 The key measure: user happiness
 What is this?
 Speed of response/size of index are factors
 But blindingly fast, useless answers won’t make a user
happy
 Need a way of quantifying user happiness with the
results returned
 Relevance of results to user’s information need
62
Introduction to Information Retrieval Sec. 8.1

Evaluating an IR system
 An information need is translated into a query
 Relevance is assessed relative to the information
need not the query
 E.g., Information need: I’m looking for information on
whether drinking red wine is more effective at
reducing your risk of heart attacks than white wine.
 Query: wine red white heart attack effective
 You evaluate whether the doc addresses the
information need, not whether it has these words

63
Introduction to Information Retrieval Sec. 8.4

Evaluating ranked results


 Evaluation of a result set:
 If we have
 a benchmark document collection
 a benchmark set of queries
 assessor judgments of whether documents are relevant to queries
Then we can use Precision/Recall/F measure as before
 Evaluation of ranked results:
 The system can return any number of results
 By taking various numbers of the top returned documents
(levels of recall), the evaluator can produce a precision-
recall curve
64
Introduction to Information Retrieval

Recall/Precision
R P
 1 R 0.1 1.0
 2 N 0.1 0.5 Top 10 returned
 3 N 0.1 0.33 docs ranked top-down
 4 R 0.2 0.5
 5 R 0.3 0.6
 6 N Assume 10 rel docs

 7 R in collection
 8 N
 9 N
 10 N
Introduction to Information Retrieval Sec. 8.4

Two current evaluation measures…


 Mean average precision (MAP)
 AP: Average of the precision value obtained for the top k
documents, each time a relevant doc is retrieved
 Avoids interpolation, use of fixed recall levels
 Does weight most accuracy of top returned results
 MAP for set of queries is arithmetic average of APs
 Macro-averaging: each query counts equally

R N N R R N R R …
P: 1.0 0.5 0.6 0.58 0.62
AP=avg(1.0,0.5,0.6,0.58,0.62,…)
MAP=avg(AP1, AP2, AP3, …) 66

You might also like