0% found this document useful (0 votes)

1 views

week6

Information Retrieval (IR) involves finding unstructured documents that meet a user's information needs, commonly associated with web search but applicable in various contexts like email and corporate databases. The document discusses key concepts such as the classic search model, precision and recall metrics, and the construction of inverted indexes for efficient query processing. It also covers advanced topics like phrase queries and positional indexes to enhance search accuracy and relevance.

Uploaded by

ceyu1213

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

1 views

week6

Uploaded by

ceyu1213

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 98

Introduction to Information Retrieval

Introduction to
Information Retrieval
Introducing Information Retrieval
and Web Search
Introduction to Information Retrieval

Information Retrieval
 Information Retrieval (IR) is finding material (usually
documents) of an unstructured nature (usually text)
that satisfies an information need from within large
collections (usually stored on computers).

 These days we frequently think first of web search, but

there are many other cases:
 E-mail search
 Searching your laptop
 Corporate knowledge bases
 Legal information retrieval

2
Introduction to Information Retrieval

Unstructured (text) vs. structured

(database) data in the mid-nineties

250

200

150
Unstructured
100 Structured

0
Data volume Market Cap

3
Introduction to Information Retrieval

Unstructured (text) vs. structured

(database) data today

250

200

150
Unstructured
100 Structured

0
Data volume Market Cap

4
Introduction to Information Retrieval Sec. 1.1

Basic assumptions of Information Retrieval

 Collection: A set of documents
 Assume it is a static collection for the moment

 Goal: Retrieve documents with information that is

relevant to the user’s information need and helps the
user complete a task

5
Introduction to Information Retrieval

The classic search model

User task Get rid of mice in a
politically correct way
Misconception?

Info need
Info about removing mice
without killing them
Misformulation?

Query Searc
how trap mice alive
h

Search
engine

Query Results
Collection
refinement
Introduction to Information Retrieval Sec. 1.1

How good are the retrieved docs?

 Precision : Fraction of retrieved docs that are
relevant to the user’s information need
 Recall : Fraction of relevant docs in collection that
are retrieved

 More precise definitions and measurements to follow later

7
Introduction to Information Retrieval

Introduction to
Information Retrieval
Term-document incidence matrices
Introduction to Information Retrieval Sec. 1.1

Unstructured data in 1620

 Which plays of Shakespeare contain the words Brutus
AND Caesar but NOT Calpurnia?
 One could grep all of Shakespeare’s plays for Brutus
and Caesar, then strip out lines containing Calpurnia?
 Why is that not the answer?
 Slow (for large corpora)
 NOT Calpurnia is non-trivial
 Other operations (e.g., find the word Romans near
countrymen) not feasible
 Ranked retrieval (best documents to return)
 Later lectures
10
Introduction to Information Retrieval Sec. 1.1

Term-document incidence matrices

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0

Brutus AND Caesar BUT NOT 1 if play contains

Calpurnia word, 0 otherwise
Introduction to Information Retrieval Sec. 1.1

Incidence vectors
 So we have a 0/1 vector for each term.
 To answer query: take the vectors for Brutus, Caesar
and Calpurnia (complemented)  bitwise AND.
 110100 AND
 110111 AND
 101111 = Antony
Antony and Cleopatra
1
Julius Caesar
1
The Tempest
0
Hamlet
0
Othello
0
Macbeth
1

 100100 Brutus
Caesar
1
1
1
1
0
0
1
1
0
1
0
1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0

12
Introduction to Information Retrieval Sec. 1.1

Answers to query
 Antony and Cleopatra, Act III, Scene ii
Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus,
When Antony found Julius Caesar dead,
He cried almost to roaring; and he wept
When at Philippi he found Brutus slain.

 Hamlet, Act III, Scene ii

Lord Polonius: I did enact Julius Caesar I was killed i’ the
Capitol; Brutus killed me.

13
Introduction to Information Retrieval Sec. 1.1

Bigger collections
 Consider N = 1 million documents, each with about
1000 words.
 Avg 6 bytes/word including spaces/punctuation
 6GB of data in the documents.
 Say there are M = 500K distinct terms among these.

14
Introduction to Information Retrieval Sec. 1.1

Can’t build the matrix

 500K x 1M matrix has half-a-trillion 0’s and 1’s.

 But it has no more than one billion 1’s. Why?

 matrix is extremely sparse.

 What’s a better representation?

 We only record the 1 positions.

15
Introduction to Information Retrieval

Introduction to
Information Retrieval
The Inverted Index
The key data structure underlying modern IR
Introduction to Information Retrieval Sec. 1.2

Inverted index
 For each term t, we must store a list of all documents
that contain t.
 Identify each doc by a docID, a document serial number
 Can we used fixed-size arrays for this?
Brutus 1 2 4 11 31 45 173 174
Caesar 1 2 4 5 6 16 57 132
Calpurnia 2 31 54 101

What happens if the word Caesar

is added to document 14?
18
Introduction to Information Retrieval Sec. 1.2

Inverted index
 We need variable-size postings lists
 On disk, a continuous run of postings is normal and best
 In memory, can use linked lists or variable length arrays
 Some tradeoffs in size/ease of insertion Posting

Brutus 1 2 4 11 31 45 173 174

Caesar 1 2 4 5 6 16 57 132
Calpurnia 2 31 54 101

Dictionary Postings
Sorted by docID (more later on why).
19
Introduction to Information Retrieval Sec. 1.2

Inverted index construction

Documents to Friends, Romans, countrymen.
be indexed

Tokenizer
Token stream Friends Romans Countrymen
Linguistic
modules
Modified tokens friend roman countryman

Indexer friend 2 4
roman 1 2
Inverted index
countryman 13 16
Introduction to Information Retrieval

Initial stages of text processing

 Tokenization
 Cut character sequence into word tokens
 Deal with “John’s”, a state-of-the-art solution
 Normalization
 Map text and query term to same form
 You want U.S.A. and USA to match
 Stemming
 We may wish different forms of a root to match
 authorize, authorization
 Stop words
 We may omit very common words (or not)
 the, a, to, of
Introduction to Information Retrieval Sec. 1.2

Indexer steps: Token sequence

 Sequence of (Modified token, Document ID) pairs.

Doc 1 Doc 2

I did enact Julius So let it be with

Caesar I was killed Caesar. The noble
i’ the Capitol; Brutus hath told you
Brutus killed me. Caesar was ambitious
Introduction to Information Retrieval Sec. 1.2

Indexer steps: Sort

 Sort by terms
 And then docID

Core indexing step

Introduction to Information Retrieval Sec. 1.2

Indexer steps: Dictionary & Postings

 Multiple term
entries in a single
document are
merged.
 Split into Dictionary
and Postings
 Doc. frequency
information is
added.

Why frequency?
Will discuss later.
Introduction to Information Retrieval Sec. 1.2

Where do we pay in storage?

Lists of
docIDs

Terms
and
counts IR system
implementation
• How do we
index efficiently?
• How much
storage do we
need?
Pointers 26
Introduction to Information Retrieval

Introduction to
Information Retrieval
Query processing with an inverted index
Introduction to Information Retrieval Sec. 1.3

The index we just built

 How do we process a query? Our focus
 Later - what kinds of queries can we process?

29
Introduction to Information Retrieval Sec. 1.3

Query processing: AND

 Consider processing the query:
Brutus AND Caesar
 Locate Brutus in the Dictionary;
 Retrieve its postings.
 Locate Caesar in the Dictionary;
 Retrieve its postings.
 “Merge” the two postings (intersect the document sets):

2 4 8 16 32 64 128 Brutus
1 2 3 5 8 13 21 34 Caesar

30
Introduction to Information Retrieval Sec. 1.3

The merge
 Walk through the two postings simultaneously, in
time linear in the total number of postings entries

2 4 8 16 32 64 128 Brutus
1 2 3 5 8 13 21 34 Caesar

If the list lengths are x and y, the merge takes O(x+y)

operations.
Crucial: postings sorted by docID.
31
Introduction to Information Retrieval Sec. 1.3

The merge
 Walk through the two postings simultaneously, in
time linear in the total number of postings entries

2 4 8 16 32 64 128 Brutus
2 8
1 2 3 5 8 13 21 34 Caesar

If the list lengths are x and y, the merge takes O(x+y)

operations.
Crucial: postings sorted by docID.
32
Introduction to Information Retrieval

Intersecting two postings lists

(a “merge” algorithm)

33
Introduction to Information Retrieval

Introduction to
Information Retrieval
Phrase queries and positional indexes
Introduction to Information Retrieval Sec. 2.4

Phrase queries
 We want to be able to answer queries such as
“stanford university” – as a phrase
 Thus the sentence “I went to university at Stanford”
is not a match.
 The concept of phrase queries has proven easily
understood by users; one of the few “advanced search”
ideas that works
 Many more queries are implicit phrase queries
 For this, it no longer suffices to store only
<term : docs> entries
Introduction to Information Retrieval Sec. 2.4.1

A first attempt: Biword indexes

 Index every consecutive pair of terms in the text as a
phrase
 For example the text “Friends, Romans,
Countrymen” would generate the biwords
 friends romans
 romans countrymen
 Each of these biwords is now a dictionary term
 Two-word phrase query-processing is now
immediate.
Introduction to Information Retrieval Sec. 2.4.1

Longer phrase queries

 Longer phrases can be processed by breaking them
down
 stanford university palo alto can be broken into the
Boolean query on biwords:
stanford university AND university palo AND palo alto

Without the docs, we cannot verify that the docs

matching the above Boolean query do contain the
phrase.
Can have false positives!
Introduction to Information Retrieval Sec. 2.4.1

Issues for biword indexes

 False positives, as noted before
 Index blowup due to bigger dictionary
 Infeasible for more than biwords, big even for them

 Biword indexes are not the standard solution (for all

biwords) but can be part of a compound strategy
Introduction to Information Retrieval Sec. 2.4.2

Solution 2: Positional indexes

 In the postings, store, for each term the position(s) in
which tokens of it appear:

<term, number of docs containing term;

doc1: position1, position2 … ;
doc2: position1, position2 … ;
etc.>
Introduction to Information Retrieval Sec. 2.4.2

Positional index example

<be: 993427;
1: 7, 18, 33, 72, 86, 231; Which of docs 1,2,4,5
2: 3, 149; could contain “to be
4: 17, 191, 291, 430, 434; or not to be”?
5: 363, 367, …>

 For phrase queries, we use a merge algorithm

recursively at the document level
 But we now need to deal with more than just
equality
Introduction to Information Retrieval Sec. 2.4.2

Processing a phrase query

 Extract inverted index entries for each distinct term:
to, be, or, not.
 Merge their doc:position lists to enumerate all
positions with “to be or not to be”.
 to:
 2:1,17,74,222,551; 4:8,16,190,429,433; 7:13,23,191; ...
 be:
 1:17,19; 4:17,191,291,430,434; 5:14,19,101; ...
 Same general method for proximity searches
Introduction to Information Retrieval Sec. 2.4.2

Proximity queries
 LIMIT! /3 STATUTE /3 FEDERAL /2 TORT
 Again, here, /k means “within k words of”.
 Clearly, positional indexes can be used for such
queries; biword indexes cannot.
 Exercise: Adapt the linear merge of postings to
handle proximity queries. Can you make it work for
any value of k?
 This is a little tricky to do correctly and efficiently
 See Figure 2.12 of IIR
Introduction to Information Retrieval Sec. 2.4.2

Positional index size

 A positional index expands postings storage
substantially
 Even though indices can be compressed
 Nevertheless, a positional index is now standardly
used because of the power and usefulness of phrase
and proximity queries … whether used explicitly or
implicitly in a ranking retrieval system.
Introduction to Information Retrieval Sec. 2.4.2

Positional index size

 Need an entry for each occurrence, not just once per
document
 Index size depends on average document size Why?
 Average web page has <1000 terms
 SEC filings, books, even some epic poems … easily 100,000
terms
 Consider a term with frequency 0.1%
Document size Postings Positional postings
1000 1 1
100,000 1 100
Introduction to Information Retrieval Sec. 2.4.2

Rules of thumb
 A positional index is 2–4 as large as a non-positional
index

 Positional index size 35–50% of volume of original

text

 Caveat: all of this holds for “English-like” languages

Introduction to Information Retrieval Sec. 2.4.3

Combination schemes
 These two approaches can be profitably combined
 For particular phrases (“Michael Jackson”, “Britney
Spears”) it is inefficient to keep on merging positional
postings lists
 Even more so for phrases like “The Who”
 Williams et al. (2004) evaluate a more sophisticated
mixed indexing scheme
 A typical web query mixture was executed in ¼ of the time
of using just a positional index
 It required 26% more space than having a positional index
alone
Introduction to Information Retrieval

Introduction to
Information Retrieval
Introducing ranked retrieval
Introduction to Information Retrieval Ch. 6

Ranked retrieval
 Thus far, our queries have all been Boolean.
 Documents either match or don’t.
 Good for expert users with precise understanding of
their needs and the collection.
 Also good for applications: Applications can easily
consume 1000s of results.
 Not good for the majority of users.
 Most users incapable of writing Boolean queries (or they
are, but they think it’s too much work).
 Most users don’t want to wade through 1000s of results.
 This is particularly true of web search.
Introduction to Information Retrieval Ch. 6

Problem with Boolean search:

feast or famine
 Boolean queries often result in either too few (≈0) or
too many (1000s) results.
 Query 1: “standard user dlink 650” → 200,000 hits
 Query 2: “standard user dlink 650 no card found” → 0 hits
 It takes a lot of skill to come up with a query that
produces a manageable number of hits.
 AND gives too few; OR gives too many
Introduction to Information Retrieval

Ranked retrieval models

 Rather than a set of documents satisfying a query
expression, in ranked retrieval models, the system
returns an ordering over the (top) documents in the
collection with respect to a query
 Free text queries: Rather than a query language of
operators and expressions, the user’s query is just
one or more words in a human language
 In principle, there are two separate choices here, but
in practice, ranked retrieval models have normally
been associated with free text queries and vice versa
4
Introduction to Information Retrieval Ch. 6

Feast or famine: not a problem in

ranked retrieval
 When a system produces a ranked result set, large
result sets are not an issue
 Indeed, the size of the result set is not an issue
 We just show the top k ( ≈ 10) results
 We don’t overwhelm the user

 Premise: the ranking algorithm works

Introduction to Information Retrieval Ch. 6

Scoring as the basis of ranked retrieval

 We wish to return in order the documents most likely
to be useful to the searcher
 How can we rank-order the documents in the
collection with respect to a query?
 Assign a score – say in [0, 1] – to each document
 This score measures how well document and query
“match”.
Introduction to Information Retrieval Ch. 6

Query-document matching scores

 We need a way of assigning a score to a
query/document pair
 Let’s start with a one-term query
 If the query term does not occur in the document:
score should be 0
 The more frequent the query term in the document,
the higher the score (should be)
 We will look at a number of alternatives for this
Introduction to Information Retrieval

Introduction to
Information Retrieval
Scoring with the Jaccard coefficient
Introduction to Information Retrieval Ch. 6

Take 1: Jaccard coefficient

 A commonly used measure of overlap of two sets A
and B is the Jaccard coefficient
 jaccard(A,B) = |A ∩ B| / |A ∪ B|
 jaccard(A,A) = 1
 jaccard(A,B) = 0 if A ∩ B = 0
 A and B don’t have to be the same size.
 Always assigns a number between 0 and 1.
Introduction to Information Retrieval Ch. 6

Jaccard coefficient: Scoring example

 What is the query-document match score that the
Jaccard coefficient computes for each of the two
documents below?
 Query: ides of march
 Document 1: caesar died in march
 Document 2: the long march
 Document 3: the long ides march
ଶ
ଵ
ଷ
Introduction to Information Retrieval Ch. 6

Issues with Jaccard for scoring

 It doesn’t consider term frequency (how many times
a term occurs in a document)
 Rare terms in a collection are more informative than
frequent terms
 Jaccard doesn’t consider this information
 We need a more sophisticated way of normalizing for
length
 Later in this lecture, we’ll use | A  B | / | A  B |
. . . instead of |A ∩ B|/|A ∪ B| (Jaccard) for length
normalization.
Introduction to Information Retrieval

Introduction to
Information Retrieval
Term frequency weighting
Introduction to Information Retrieval Sec. 6.2

Recall: Binary term-document

incidence matrix
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0

Each document is represented by a binary vector ∈ {0,1}|V|

Introduction to Information Retrieval Sec. 6.2

Term-document count matrices

 Consider the number of occurrences of a term in a
document:
 Each document is a count vector in ℕ|V|: a column below

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 157 73 0 0 0 0
Brutus 4 157 0 1 0 0
Caesar 232 227 0 2 1 1
Calpurnia 0 10 0 0 0 0
Cleopatra 57 0 0 0 0 0
mercy 2 0 3 5 5 1
worser 2 0 1 1 1 0
Introduction to Information Retrieval

Bag of words model

 Vector representation doesn’t consider the ordering
of words in a document
 John is quicker than Mary and Mary is quicker than
John have the same vectors

 This is called the bag of words model.

 In a sense, this is a step back: The positional index
was able to distinguish these two documents
 We will look at “recovering” positional information later on
 For now: bag of words model
Introduction to Information Retrieval

Term frequency tf
 The term frequency tft,d of term t in document d is
defined as the number of times that t occurs in d.
 We want to use tf when computing query-document
match scores. But how?
 Raw term frequency is not what we want:
 A document with 10 occurrences of the term is more
relevant than a document with 1 occurrence of the term.
 But not 10 times more relevant.
 Relevance does not increase proportionally with
term frequency.
NB: frequency = count in IR
Introduction to Information Retrieval Sec. 6.2

Log-frequency weighting
 The log frequency weight of term t in d is
 1 + log 10 tf t,d , if tf t,d > 0
w t,d = 
 0, otherwise

 Score for a document-query pair: sum over terms t in

both q and d:

 score = t∈q∩d (1 + log tf t ,d )

 The score is 0 if none of the query terms is present in

the document.
Introduction to Information Retrieval

Introduction to
Information Retrieval
(Inverse) Document frequency weighting
Introduction to Information Retrieval Sec. 6.2.1

Document frequency
 Rare terms are more informative than frequent terms
 Recall stop words
 Consider a term in the query that is rare in the
collection (e.g., arachnocentric)
 A document containing this term is very likely to be
relevant to the query arachnocentric
 → We want a high weight for rare terms like
arachnocentric.
Introduction to Information Retrieval Sec. 6.2.1

Document frequency, continued

 Frequent terms are less informative than rare terms
 Consider a query term that is frequent in the
collection (e.g., high, increase, line)
 A document containing such a term is more likely to
be relevant than a document that doesn’t
 But it’s not a sure indicator of relevance.
 → For frequent terms, we want posi ve weights for
words like high, increase, and line
 But lower weights than for rare terms.
 We will use document frequency (df) to capture this.
Introduction to Information Retrieval Sec. 6.2.1

idf weight
 dft is the document frequency of t: the number of
documents that contain t
 dft is an inverse measure of the informativeness of t
 dft ≤ N
 We define the idf (inverse document frequency) of t
by
idf t = log10 ( N/df t )
 We use log (N/dft) instead of N/dft to “dampen” the effect
of idf.

Will turn out the base of the log is immaterial.

Introduction to Information Retrieval Sec. 6.2.1

idf example, suppose N = 1 million

term dft idft
calpurnia 1
animal 100
sunday 1,000
fly 10,000
under 100,000
the 1,000,000

idf t = log10 ( N/df t )

There is one idf value for each term t in a collection.
Introduction to Information Retrieval

Effect of idf on ranking

 Question: Does idf have an effect on ranking for one-
term queries, like
 iPhone

28
Introduction to Information Retrieval

Effect of idf on ranking

 Question: Does idf have an effect on ranking for one-
term queries, like
 iPhone
 idf has no effect on ranking one term queries
 idf affects the ranking of documents for queries with at
least two terms
 For the query capricious person, idf weighting makes
occurrences of capricious count for much more in the final
document ranking than occurrences of person.

29
Introduction to Information Retrieval Sec. 6.2.1

Collection vs. Document frequency

 The collection frequency of t is the number of
occurrences of t in the collection, counting
multiple occurrences.
 Example:
Word Collection frequency Document frequency
insurance 10440 3997
try 10422 8760

 Which word is a better search term (and should

get a higher weight)?
Introduction to Information Retrieval

Introduction to
Information Retrieval
tf-idf weighting
Introduction to Information Retrieval Sec. 6.2.2

tf-idf weighting
 The tf-idf weight of a term is the product of its tf
weight and its idf weight.
w t ,d = (1 + log tf t ,d ) × log10 ( N / df t )
 Best known weighting scheme in information retrieval
 Note: the “-” in tf-idf is a hyphen, not a minus sign!
 Alternative names: tf.idf, tf x idf
 Increases with the number of occurrences within a
document
 Increases with the rarity of the term in the collection
Introduction to Information Retrieval Sec. 6.2.2

Final ranking of documents for a query

Score(q,d) =  tf.idft,d
t ∈q∩d

34
Introduction to Information Retrieval Sec. 6.3

Binary → count → weight matrix

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 5.25 3.18 0 0 0 0.35

Brutus 1.21 6.1 0 1 0 0
Caesar 8.59 2.54 0 1.51 0.25 0
Calpurnia 0 1.54 0 0 0 0
Cleopatra 2.85 0 0 0 0 0
mercy 1.51 0 1.9 0.12 5.25 0.88
worser 1.37 0 0.11 4.15 0.25 1.95

Each document is now represented by a real-valued

vector of tf-idf weights ∈ R|V|
Introduction to Information Retrieval

Introduction to
Information Retrieval
The Vector Space Model (VSM)
Introduction to Information Retrieval Sec. 6.3

Documents as vectors
 Now we have a |V|-dimensional vector space
 Terms are axes of the space
 Documents are points or vectors in this space
 Very high-dimensional: tens of millions of
dimensions when you apply this to a web search
engine
 These are very sparse vectors – most entries are zero
Introduction to Information Retrieval Sec. 6.3

Queries as vectors
 Key idea 1: Do the same for queries: represent them
as vectors in the space
 Key idea 2: Rank documents according to their
proximity to the query in this space
 proximity = similarity of vectors
 proximity ≈ inverse of distance
 Recall: We do this because we want to get away from
the you’re-either-in-or-out Boolean model
 Instead: rank more relevant documents higher than
less relevant documents
Introduction to Information Retrieval Sec. 6.3

Formalizing vector space proximity

 First cut: distance between two points
 ( = distance between the end points of the two vectors)
 Euclidean distance?
 Euclidean distance is a bad idea . . .
 . . . because Euclidean distance is large for vectors of
different lengths.
Introduction to Information Retrieval Sec. 6.3

Why distance is a bad idea

The Euclidean
distance between q
and d2 is large even
though the
distribution of terms
in the query q and the
distribution of
terms in the
document d2 are
very similar.
Introduction to Information Retrieval Sec. 6.3

Use angle instead of distance

 Thought experiment: take a document d and append
it to itself. Call this document d′.
 “Semantically” d and d′ have the same content
 The Euclidean distance between the two documents
can be quite large
 The angle between the two documents is 0,
corresponding to maximal similarity.

 Key idea: Rank documents according to angle with

query.
Introduction to Information Retrieval Sec. 6.3

From angles to cosines

 The following two notions are equivalent.
 Rank documents in decreasing order of the angle between
query and document
 Rank documents in increasing order of
cosine(query,document)

 Cosine is a monotonically decreasing function for the

interval [0o, 180o]
Introduction to Information Retrieval Sec. 6.3

From angles to cosines

 But how – and why – should we be computing cosines?

Introduction to Information Retrieval Sec. 6.3

Length normalization
 A vector can be (length-) normalized by dividing each
of its components by its length – for this we use the
L2 norm: 
x 2 = i xi2

 Dividing a vector by its L2 norm makes it a unit

(length) vector (on surface of unit hypersphere)
 Effect on the two documents d and d′ (d appended
to itself) from earlier slide: they have identical
vectors after length-normalization.
 Long and short documents now have comparable weights
Introduction to Information Retrieval Sec. 6.3

cosine(query,document)
Dot product Unit vectors
  


V
  q•d q d q di
i =1 i
cos( q , d ) =   =  •  =
q d
qd
 i=1 i
V 2 V
2
q
i =1 i
d

qi is the tf-idf weight of term i in the query

di is the tf-idf weight of term i in the document

cos(q,d) is the cosine similarity of q and d … or,

equivalently, the cosine of the angle between q and d.
Introduction to Information Retrieval

Cosine for length-normalized vectors

 For length-normalized vectors, cosine similarity is
simply the dot product (or scalar product):

   
cos(q, d ) = q • d =  qi di
V

i=1

for q, d length-normalized.

47
Introduction to Information Retrieval

Cosine similarity illustrated

48
Introduction to Information Retrieval Sec. 6.3

Cosine similarity amongst 3 documents

How similar are
the novels term SaS PaP WH

SaS: Sense and affection 115 58 20

Sensibility jealous 10 7 11

gossip 2 0 6
PaP: Pride and
wuthering 0 0 38
Prejudice, and
WH: Wuthering Term frequencies (counts)
Heights?

Note: To simplify this example, we don’t do idf weighting.

Introduction to Information Retrieval Sec. 6.3

3 documents example contd.

Log frequency weighting After length normalization

term SaS PaP WH term SaS PaP WH

affection 3.06 2.76 2.30 affection 0.789 0.832 0.524
jealous 2.00 1.85 2.04 jealous 0.515 0.555 0.465
gossip 1.30 0 1.78 gossip 0.335 0 0.405
wuthering 0 0 2.58 wuthering 0 0 0.588

cos(SaS,PaP) ≈
0.789 × 0.832 + 0.515 × 0.555 + 0.335 × 0.0 + 0.0 × 0.0 ≈
0.94
cos(SaS,WH) ≈ 0.79
cos(PaP,WH) ≈ 0.69
Why do we have cos(SaS,PaP) > cos(SAS,WH)?
Introduction to Information Retrieval

Introduction to
Information Retrieval
Calculating tf-idf cosine scores
in an IR system
Introduction to Information Retrieval Sec. 6.4

tf-idf weighting has many variants

Introduction to Information Retrieval Sec. 6.4

Weighting may differ in queries vs

documents
 Many search engines allow for different weightings
for queries vs. documents
 SMART Notation: denotes the combination in use in
an engine, with the notation ddd.qqq, using the
acronyms from the previous table
 A very standard weighting scheme is: lnc.ltc
 Document: logarithmic tf (l as first character), no idf
and cosine normalization
A bad idea?
 Query: logarithmic tf (l in leftmost column), idf (t in
second column), cosine normalization …
Introduction to Information Retrieval Sec. 6.4

tf-idf example: lnc.ltc

Document: car insurance auto insurance
Query: best car insurance
Term Query Document Pro
d
tf- tf-wt df idf wt n’liz tf-raw tf-wt wt n’liz
raw e e
auto 0 0 5000 2.3 0 0 1 1 1 0.52 0
best 1 1 50000 1.3 1.3 0.34 0 0 0 0 0
car 1 1 10000 2.0 2.0 0.52 1 1 1 0.52 0.27
insurance 1 1 1000 3.0 3.0 0.78 2 1.3 1.3 0.68 0.53
Exercise: what is N, the number of docs?
Doc length = 12 + 0 2 + 12 + 1.32 ≈ 1.92
Score = 0+0+0.27+0.53 = 0.8
Introduction to Information Retrieval Sec. 6.3

Computing cosine scores

Introduction to Information Retrieval

Summary – vector space ranking

 Represent the query as a weighted tf-idf vector
 Represent each document as a weighted tf-idf vector
 Compute the cosine similarity score for the query
vector and each document vector
 Rank documents with respect to the query by score
 Return the top K (e.g., K = 10) to the user
Introduction to Information Retrieval

Introduction to
Information Retrieval
Evaluating search engines
Introduction to Information Retrieval Sec. 8.6

Measures for a search engine

 How fast does it index
 Number of documents/hour
 (Average document size)
 How fast does it search
 Latency as a function of index size
 Expressiveness of query language
 Ability to express complex information needs
 Speed on complex queries
 Uncluttered UI
 Is it free?
61
Introduction to Information Retrieval Sec. 8.6

Measures for a search engine

 All of the preceding criteria are measurable: we can
quantify speed/size
 we can make expressiveness precise
 The key measure: user happiness
 What is this?
 Speed of response/size of index are factors
 But blindingly fast, useless answers won’t make a user
happy
 Need a way of quantifying user happiness with the
results returned
 Relevance of results to user’s information need
62
Introduction to Information Retrieval Sec. 8.1

Evaluating an IR system
 An information need is translated into a query
 Relevance is assessed relative to the information
need not the query
 E.g., Information need: I’m looking for information on
whether drinking red wine is more effective at
reducing your risk of heart attacks than white wine.
 Query: wine red white heart attack effective
 You evaluate whether the doc addresses the
information need, not whether it has these words

63
Introduction to Information Retrieval Sec. 8.4

Evaluating ranked results

 Evaluation of a result set:
 If we have
 a benchmark document collection
 a benchmark set of queries
 assessor judgments of whether documents are relevant to queries
Then we can use Precision/Recall/F measure as before
 Evaluation of ranked results:
 The system can return any number of results
 By taking various numbers of the top returned documents
(levels of recall), the evaluator can produce a precision-
recall curve
64
Introduction to Information Retrieval

Recall/Precision
R P
 1 R 0.1 1.0
 2 N 0.1 0.5 Top 10 returned
 3 N 0.1 0.33 docs ranked top-down
 4 R 0.2 0.5
 5 R 0.3 0.6
 6 N Assume 10 rel docs
…
 7 R in collection
 8 N
 9 N
 10 N
Introduction to Information Retrieval Sec. 8.4

Two current evaluation measures…

 Mean average precision (MAP)
 AP: Average of the precision value obtained for the top k
documents, each time a relevant doc is retrieved
 Avoids interpolation, use of fixed recall levels
 Does weight most accuracy of top returned results
 MAP for set of queries is arithmetic average of APs
 Macro-averaging: each query counts equally

R N N R R N R R …
P: 1.0 0.5 0.6 0.58 0.62
AP=avg(1.0,0.5,0.6,0.58,0.62,…)
MAP=avg(AP1, AP2, AP3, …) 66

Clare - Neuropsychological Rehabilitation and People With Dementia - 2008
No ratings yet
Clare - Neuropsychological Rehabilitation and People With Dementia - 2008
192 pages
Status of Watershed Management in Brgy. Dolos, Bulan, Sorsogon As Perceived by LGU Officials, BWD Employees and Its Residents For C.Y 2017-2018.
No ratings yet
Status of Watershed Management in Brgy. Dolos, Bulan, Sorsogon As Perceived by LGU Officials, BWD Employees and Its Residents For C.Y 2017-2018.
17 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
46 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
54 pages
Lecture2 Intro Boolean 6per
No ratings yet
Lecture2 Intro Boolean 6per
9 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
51 pages
Ir 1
No ratings yet
Ir 1
59 pages
Information Retrival Systems
No ratings yet
Information Retrival Systems
50 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
33 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
33 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
Boolean Retrieval
No ratings yet
Boolean Retrieval
34 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
IR Summary Lec 1 - Introduction
No ratings yet
IR Summary Lec 1 - Introduction
54 pages
C1 Intro
No ratings yet
C1 Intro
10 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
42 pages
Module 4-Boolean Retrieval Models
No ratings yet
Module 4-Boolean Retrieval Models
52 pages
Boolean Retrieval PPT Updated
No ratings yet
Boolean Retrieval PPT Updated
30 pages
Lecture1 Intro Handout 1 Per
No ratings yet
Lecture1 Intro Handout 1 Per
57 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
57 pages
IR-Lec1 - Ch1-2023
No ratings yet
IR-Lec1 - Ch1-2023
41 pages
lecture1-intro
No ratings yet
lecture1-intro
60 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
30 pages
Lec 1 IR
No ratings yet
Lec 1 IR
42 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
38 pages
Lecture01 Intro
No ratings yet
Lecture01 Intro
45 pages
Lecture1-Intro - Realted To Ch1
No ratings yet
Lecture1-Intro - Realted To Ch1
60 pages
2
No ratings yet
2
50 pages
Lecture1 Introduction
No ratings yet
Lecture1 Introduction
67 pages
Chapter4 Indexconstruction
No ratings yet
Chapter4 Indexconstruction
49 pages
Lecture 2-Boolean Retrieval
No ratings yet
Lecture 2-Boolean Retrieval
29 pages
Intro To IRE
No ratings yet
Intro To IRE
48 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
69 pages
Lecture4-Indexconstruction Ch2 and Ch4
No ratings yet
Lecture4-Indexconstruction Ch2 and Ch4
49 pages
CSCI 7000 Modern Information Retrieval: Lecture 1: Introduction
No ratings yet
CSCI 7000 Modern Information Retrieval: Lecture 1: Introduction
16 pages
On Information Retrival
No ratings yet
On Information Retrival
23 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
31 pages
Information Retrieval
No ratings yet
Information Retrieval
57 pages
03 Dictionaries
No ratings yet
03 Dictionaries
112 pages
03 Dictionaries
No ratings yet
03 Dictionaries
112 pages
Unit 1 Intro to IR
No ratings yet
Unit 1 Intro to IR
32 pages
Lecture 2 - Boolean Retrieval
No ratings yet
Lecture 2 - Boolean Retrieval
49 pages
lecture1-intro-boolean
No ratings yet
lecture1-intro-boolean
42 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
115 pages
Module 4-Boolean Retrieval Models-Edit Distance, Spelling Correction
No ratings yet
Module 4-Boolean Retrieval Models-Edit Distance, Spelling Correction
124 pages
Lecture 4-Indexconstruction
No ratings yet
Lecture 4-Indexconstruction
45 pages
Term Vocabulary and Postings List
No ratings yet
Term Vocabulary and Postings List
64 pages
C3 IndexConstruction
No ratings yet
C3 IndexConstruction
46 pages
IR Lecture 1b
No ratings yet
IR Lecture 1b
54 pages
Unit 1
No ratings yet
Unit 1
181 pages
Document Indexing in Information Retrieval:
No ratings yet
Document Indexing in Information Retrieval:
19 pages
Index Construction
No ratings yet
Index Construction
48 pages
Lecture 1: Introduction and Overview: Information Retrieval Computer Science Tripos Part II
No ratings yet
Lecture 1: Introduction and Overview: Information Retrieval Computer Science Tripos Part II
38 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
50 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
49 pages
Unit I
No ratings yet
Unit I
83 pages
Lecture 5-Dictionaries and Tolerant Retrieval
No ratings yet
Lecture 5-Dictionaries and Tolerant Retrieval
48 pages
chapter2-MA212-Indexing+&+Preprocessing
No ratings yet
chapter2-MA212-Indexing+&+Preprocessing
68 pages
Lecture2 Dictionary
No ratings yet
Lecture2 Dictionary
62 pages
600 Computer Mcqs
No ratings yet
600 Computer Mcqs
23 pages
IR Lec04 Skip Ptrs Phrase Queries Indexing
No ratings yet
IR Lec04 Skip Ptrs Phrase Queries Indexing
18 pages
Machine Reasoning: Fundamentals and Applications
From Everand
Machine Reasoning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Production Analysis
No ratings yet
Production Analysis
11 pages
SM Notes UNIT-3
No ratings yet
SM Notes UNIT-3
7 pages
Flight Dynamics of Aeroelastic Vehicles
No ratings yet
Flight Dynamics of Aeroelastic Vehicles
13 pages
Panasonic_Quickref_Displays_January_2024
No ratings yet
Panasonic_Quickref_Displays_January_2024
2 pages
36 - Z. Pavlov - Single-Domed Mosques in Macedonia
No ratings yet
36 - Z. Pavlov - Single-Domed Mosques in Macedonia
10 pages
Research Instrument - Media Literacy
No ratings yet
Research Instrument - Media Literacy
2 pages
District Collector Project
No ratings yet
District Collector Project
15 pages
(N.R VI : Tanfac
No ratings yet
(N.R VI : Tanfac
4 pages
Gds Sabre
No ratings yet
Gds Sabre
1 page
Jkijkljlkjlk
No ratings yet
Jkijkljlkjlk
4 pages
Ri PKM
No ratings yet
Ri PKM
29 pages
Vega - USFIV Bible
No ratings yet
Vega - USFIV Bible
61 pages
V Shale
No ratings yet
V Shale
14 pages
Community Health Paper Intro Page
No ratings yet
Community Health Paper Intro Page
2 pages
Notes On The Development of The Urban Heritage Management Concept in Contemporary Policies
No ratings yet
Notes On The Development of The Urban Heritage Management Concept in Contemporary Policies
9 pages
AI-QP-1
No ratings yet
AI-QP-1
7 pages
Malvern Property News 19/08/2011
No ratings yet
Malvern Property News 19/08/2011
22 pages
Hydrosphere PDF
No ratings yet
Hydrosphere PDF
44 pages
Austrian Briefing Aspern Essling PDF
100% (1)
Austrian Briefing Aspern Essling PDF
16 pages
Custom Apps List
No ratings yet
Custom Apps List
2 pages
SOP 2023 - Artigo
No ratings yet
SOP 2023 - Artigo
33 pages
Leakage Answers
No ratings yet
Leakage Answers
56 pages
Preposition File
No ratings yet
Preposition File
8 pages
Structural Geology
No ratings yet
Structural Geology
203 pages
ENGL 4000 Syllabus Spring 2019
No ratings yet
ENGL 4000 Syllabus Spring 2019
7 pages
HashedIn-by-Deloitte-Off-Campus
No ratings yet
HashedIn-by-Deloitte-Off-Campus
3 pages
Final BMS 840 Module
No ratings yet
Final BMS 840 Module
62 pages
5.21 Effect in General of Sequestration or Liquidation On Contracts Subsisting at Commencement of Co
No ratings yet
5.21 Effect in General of Sequestration or Liquidation On Contracts Subsisting at Commencement of Co
24 pages