IR Lecture 1b
IR Lecture 1b
1
Information Retrieval
Traditionally it has been accepted that
information retrieval system does not return
the actual information but the documents
containing that information in a large corpus.
‘An information retrieval system does not
inform (i.e. change the knowledge of) the user
on the subject of her inquiry. It merely informs
on the existence (or non-existence) and
whereabouts of documents relating to her
request.’
2
Information Retrieval Process
3
IR vs. databases:
Structured vs unstructured data
Structured data tends to refer to
information in “tables”
Employee Manager Salary
Smith Jones 50000
Chang Smith 60000
Ivy Smith 50000
Typically allows numerical range and exact match
(for text) queries, e.g.,
Salary < 60000 AND Manager = Smith.
4
Unstructured data
Typically refers to free text
Allows
• Keyword queries including operators
• More sophisticated “concept” queries e.g.,
• find all web pages dealing with drug abuse
Classic model for searching text
documents
5
Semi-structured data
In fact almost no data is “unstructured”
E.g., this slide has distinctly identified
zones such as the Title and Bullets
Facilitates “semi-structured” search such
as
• Title contains data AND Bullets contain
search
7
Unstructured (text) vs. structured
(database) data in 2006
160
140
120
100
80 Unstructured
Structured
60
40
20
0
Data volume Market Cap
8
IR: An Example
Which plays of Shakespeare contain the words
Brutus AND Caesar but NOT Calpurnia?
Simplest approach is to grep all of
Shakespeare’s plays for Brutus and Caesar,
then strip out lines containing Calpurnia?
• Slow (for large corpora)
• NOT Calpurnia is non-trivial
• Other operations (e.g., find the word Romans near
countrymen) not feasible
• Ranked retrieval (best documents to return)
9
How to avoid linear scanning ?
Index the documents in advance
10
Indexing
The process of transforming document
text to some representation of it is known
as indexing.
Different index structures might be used.
One commonly used data structure by IR
system is inverted index.
11
Information Retrieval Model
An IR model is a pattern that defines
several aspects of retrieval procedure,
for example,
how the documents and user’s queries
are represented
how system retrieves relevant
documents according to users’ queries &
how retrieved documents are ranked.
12
IR Model
An IR model consists of
- a model for documents
- a model for queries and
- a matching function which compares
queries to documents.
- a ranking function
13
Classical IR Model
IR models can be classified as:
Classical models of IR
Non-Classical models of IR
Alternative models of IR
14
Classical IR Model
15
Non-Classical models of IR
Non-classical information retrieval models
are based on principles other than
similarity, probability, Boolean operations
etc. on which classical retrieval models are
based on.
information logic model, situation theory
model and interaction model.
16
Alternative IR models
Alternative models are enhancements of
classical models making use of specific
techniques from other fields.
Example:
Cluster model, fuzzy model and latent
semantic indexing (LSI) models.
17
Information Retrieval Model
The actual text of the document and query is
not used in the retrieval process. Instead,
some representation of it.
Document representation is matched with
query representation to perform retrieval
One frequently used method is to represent
document as a set of index terms or keywords
18
Boolean model
19
Boolean model
20
Boolean model
22
Basics of Boolean IR model
Which plays of Shakespeare contain the
words Brutus AND Caesar but NOT
Calpurnia?
Document collection: A collection of
Shakespeare's work
23
Binary Term-document matrix
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
1 if play contains
word, 0 otherwise
24
So we have a 0/1 vector for each term.
To answer query: take the vectors for
Brutus, Caesar and Calpurnia
(complemented) bitwise AND.
110100 AND 110111 AND 101111 =
100100.
25
Answers to query
Antony and Cleopatra, Act
III, Scene ii
…………
…………...
27
Bigger corpora
Consider N = 1M documents, each with
about 1K terms.
Avg 6 bytes/term incl spaces/punctuation
• 6GB of data in the documents.
Say there are m = 500K distinct terms
among these.
28
Can’t build the matrix
500K x 1M matrix has half-a-trillion 0’s
and 1’s. Why?
But it has no more than one billion 1’s.
• matrix is extremely sparse.
What’s a better representation?
• We only record the 1 positions.
29
Inverted index
For each term T, we must store a list of
all documents that contain T.
Brutus 2 4 8 16 32 64 128
Calpurnia 1 2 3 5 8 13 21 34
Caesar 13 16
Tokenize
r
Token stream. Friends Romans Countrymen
More on
these later. Linguistic modules
Modified tokens. friend roman countryman
Indexe
friend 2 4
r
roman 1 2
Inverted index.
countryman 32 13 16
Indexer steps
Term Doc #
Sequence of (Modified token, Document I
did
1
1
indexing step.).
caesar 1 capitol 1
I 1 caesar 1
was 1 caesar 2
killed 1 caesar 2
i' 1 did 1
the 1 enact 1
capitol 1 hath 1
brutus 1 I 1
killed 1 I 1
me 1 i' 1
so 2 it 2
let 2 julius 1
it 2 killed 1
be 2 killed 1
with 2 let 2
caesar 2 me 1
the 2 noble 2
noble 2 so 2
brutus 2 the 1
hath 2 the 2
told 2 told 2
you 2 you 2
caesar 2 was 1
was 2 was 2
ambitious 2 with 2
34
Term Doc # Term Doc # Term freq
Multiple term entries ambitious
be
2
2
ambitious
be
2
2
1
1
in a single document
brutus 1 brutus 1 1
brutus 2 brutus 2 1
capitol 1 capitol 1 1
information is added.
I 1 2
I 1 i' 1 1
I 1 it 2 1
i' 1 julius 1 1
it 2 killed 1 2
julius 1 let 2 1
killed 1 me 1 1
killed 1 noble 2 1
let 2 so 2 1
me 1
the 1 1
noble 2
the 2 1
so 2
told 2 1
the 1
you 2 1
the 2
was 1 1
told 2
was 2 1
you 2
with 2 1
was 1
was 2
with 2
35
The result is split into a Dictionary file
and a Postings file.
Term Doc # Freq
ambitious 2 1 Doc # Freq
Term N docs Coll freq 2 1
be 2 1
ambitious 1 1 2 1
brutus 1 1
be 1 1 1 1
brutus 2 1
brutus 2 2 2 1
capitol 1 1
capitol 1 1 1 1
caesar 1 1 caesar 2 3 1 1
caesar 2 2 did 1 1 2 2
did 1 1 enact 1 1 1 1
enact 1 1 hath 1 1 1 1
hath 2 1 I 1 2
2 1
I 1 2 i' 1 1
1 2
i' 1 1 it 1 1
1 1
it 2 1 julius 1 1
killed 1 2 2 1
julius 1 1
let 1 1 1 1
killed 1 2
me 1 1 1 2
let 2 1
noble 1 1 2 1
me 1 1
so 1 1 1 1
noble 2 1
the 2 2 2 1
so 2 1 2 1
told 1 1
the 1 1 you 1 1 1 1
the 2 1 was 2 2 2 1
told 2 1 with 1 1 2 1
you 2 1 2 1
was 1 1 1 1
was 2 1 2 1
with 2 1 2 1
36
Doc # Freq
Term N docs Coll freq 2 1
2 1
ambitious 1 1
1 1
be 1 1 2 1
brutus 2 2 1 1
capitol 1 1 1 1
caesar 2 3 2 2
1 1
did 1 1
1 1
enact 1 1 2 1
hath 1 1 1 2
I 1 2 1 1
i' 1 1 2 1
Terms it
julius
1
1
1
1
1
1
1
2
2 1
killed 1 2 1 1
let 1 1 2 1
me 1 1 2 1
noble 1 1 1 1
so 1 1 2 1
2 1
the 2 2
2 1
told 1 1 1 1
you 1 1 2 1
was 2 2 2 1
with 1 1
Pointers 37
The index we just built
How do we process a query?
38
Query processing: AND
Consider processing the query:
Brutus AND Caesar
• Locate Brutus in the Dictionary;
• Retrieve its postings.
• Locate Caesar in the Dictionary;
• Retrieve its postings.
• “Merge” the two postings:
2 4 8 16 32 64 128 Brutus
1 2 3 5 8 13 21 34 Caesar
39
The merge
Walk through the two postings
simultaneously, in time linear in the total
number of postings entries
2 4 8 16 32 64 128 Brutus
2 8
1 2 3 5 8 13 21 34 Caesar
40
Merging Algorithm
Merge(p,q)
1 Start
2. Ans ()
3. While p<> nil and q <> nil do
if pdocID = q docID
then ADD(answer, pdocID) // add to result and advance pointers
else if pdocID < qdocID
then p pnext
else q qnext
4. end {of algo}
41
Boolean queries: Exact match
The Boolean Retrieval model is being able to
ask a query that is a Boolean expression:
• Boolean Queries are queries using AND, OR and
NOT to join query terms
• Views each document as a set of words
• Is precise: document matches condition or not.
Primary commercial retrieval tool for 3
decades.
Professional searchers (e.g., lawyers) still like
Boolean queries.
42
Example: WestLaw http://www.westlaw.com/
43
Merging: More general merges
Consider an arbitrary Boolean formula:
(Brutus OR Caesar) AND NOT
(Antony OR Cleopatra)
44
Query optimization
What is the best order for query processing?
Consider a query that is an AND of t terms.
For each of the t terms, get its postings, then AND
them together.
Brutus 2 4 8 16 32 64 128
Calpurnia 1 2 3 5 8 16 21 34
Caesar 13 16
46
Query optimization example
Process in order of increasing freq:
• start with smallest set, then keep cutting
further.
This is why we kept
freq in dictionary
Brutus 2 4 8 16 32 64 128
Calpurnia 1 2 3 5 8 13 21 34
Caesar 13 16
Phrases?
• Indian Institute of Information Technology
Proximity: Find Murty NEAR Infosys.
• Need index to capture position information in
docs.
Find documents with (author =
Zufrasky) AND (text contains Retrieval).
49
What else to consider ?
1 vs. 0 occurrence of a search term
• 2 vs. 1 occurrence
• 3 vs. 2 occurrences, etc.
• Usually more seems better
Need term frequency information in docs
50
Ranking search results
Boolean queries give inclusion or
exclusion of docs.
Requires precise language for building
query expressions ( instead of free text )
Often we want to rank/group results
51
Clustering and classification
Given a set of docs, group them into
clusters based on their contents.
Given a set of topics, plus a new doc D,
decide which topic(s) D belongs to.
52
The web and its challenges
Unusual and diverse documents
Unusual and diverse users, queries,
information needs
Beyond terms, exploit ideas from
social networks
• link analysis, clickstreams ...
How do search engines work? And
how can we make them better?
53
More sophisticated information
retrieval
Cross-language information retrieval
Question answering
Summarization
Text mining
…
54