lecture1-intro-boolean
lecture1-intro-boolean
Information Retrieval
Computer Science Tripos Part II
Ronan Cummins1
2016
1
Adapted from Simone Teufel’s original slides
1
Overview
1 Motivation
Definition of “Information Retrieval”
IR: beginnings to now
2
What is Information Retrieval?
3
Document Collections
4
Document Collections
6
What we mean here by document collections
Manning et al, 2008:
7
IR Basics
Document
Collection
Query IR System
Set of relevant
documents
8
IR Basics
web
pages
Query IR System
Set of relevant
web pages
9
What is Information Retrieval?
10
Structured vs Unstructured Data
Unstructured data means that a formal, semantically overt,
easy-for-computer structure is missing.
In contrast to the rigidly structured data used in DB style
searching (e.g. product inventories, personnel records)
SELECT *
FROM business catalogue
WHERE category = ’florist’
AND city zip = ’cb1’
11
Information Needs and Relevance
12
Types of information needs
Known-item search
Precise information seeking search
Open-ended search (“topical search”)
13
Information scarcity vs. information abundance
. . . when a servant had spilled an urn of hot coffee over his legs, he replied to
the distressed inquiries of the lady of the house, ’Thank you, madam, the
agony is somewhat abated.’ [not Lord Byron, but Lord Macaulay]
14
Relevance
15
How well has the system performed?
16
IR today
Web search ( )
Search ground are billions of documents on millions of
computers
issues: spidering; efficient indexing and search; malicious
manipulation to boost search engine rankings
Link analysis covered in Lecture 8
17
A short history of IR
prec ision
no items retrieved
18
IR for non-textual media
19
Similarity Searches
20
Areas of IR
21
Overview
1 Motivation
Definition of “Information Retrieval”
IR: beginnings to now
22
Brutus AND Caesar AND NOT Calpurnia
23
The term-document incidence matrix
24
Query “Brutus AND Caesar AND NOT Calpunia”
We compute the results for our query as the bitwise AND between
vectors for Brutus, Caesar and complement (Calpurnia):
25
Query “Brutus AND Caesar AND NOT Calpunia”
We compute the results for our query as the bitwise AND between
vectors for Brutus, Caesar and complement (Calpurnia):
26
Query “Brutus AND Caesar AND NOT Calpunia”
We compute the results for our query as the bitwise AND between
vectors for Brutus, Caesar and complement (Calpurnia):
27
The results: two documents
28
Bigger collections
29
Can’t build the Term-Document incidence matrix
30
The inverted index
Calpurnia 2 31 54 101
31
Processing Boolean Queries: conjunctive queries
Locate the postings lists of both query terms and intersect them.
Calpurnia 2 31 54 101
Intersection 2 31
32
Algorithm for intersection of two postings
Calpurnia 2 31 54 101
Intersection 2 31
33
Complexity of the Intersection Algorithm
34
Query Optimisation: conjunctive terms
Organise order in which the postings lists are accessed so that least
work needs to be done
Brutus AND Caesar AND Calpurnia
Calpurnia 4 2 31 54 101
35
Query Optimisation: disjunctive terms
36
Practical Boolean Search
37
Examples
38
Does Google use the Boolean Model?
39
Reading
40