0% found this document useful (0 votes)
55 views

Lect 2 - Boolean Retrieval

An information retrieval model selects and ranks documents relevant to a user's query. The model represents documents and queries and uses a matching function to compare them and a ranking function to order documents. Key components include acquisition, representation, file organization, and queries. Common models include Boolean, vector space, and probabilistic models. The indexing process involves preprocessing text, creating an inverted index of terms to documents, and storing the index for fast retrieval.

Uploaded by

Miral Elnakib
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views

Lect 2 - Boolean Retrieval

An information retrieval model selects and ranks documents relevant to a user's query. The model represents documents and queries and uses a matching function to compare them and a ranking function to order documents. Key components include acquisition, representation, file organization, and queries. Common models include Boolean, vector space, and probabilistic models. The indexing process involves preprocessing text, creating an inverted index of terms to documents, and storing the index for fast retrieval.

Uploaded by

Miral Elnakib
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

INFORMATION RETRIEVAL

AND SEARCH ENGINES


1

Dr. Ahmed El-Shaer


Week 2

Text Processing and Tokenization

❑ Preprocessing techniques: Lowercasing, Stop word removal, Stemming,


Lemmatization
❑ Tokenization and term frequency calculations
❑ Inverted index creation and compression strategies

DS414 information Retrieval & Search Engines 2


2
IR models
Information Retrieval Model
An IR model is a pattern that defines several aspects of
retrieval procedure, for example,

⚫ how the documents and user’s queries are represented

⚫ how system retrieves relevant documents according to


users’ queries &

⚫ how retrieved documents are ranked.


DS414 information Retrieval & Search Engines 4
IR models
❑ An Information Retrieval (IR) model selects and ranks the document that is required
by the user or the user has asked for in the form of a query. The documents and the
queries are represented in a similar manner, so that document selection and ranking
can be formalized by a matching function that returns a retrieval status value
(RSV) for each document in the collection. Many of the Information Retrieval
systems represent document contents by a set of descriptors, called terms,
belonging to a vocabulary V.

DS414 information Retrieval & Search Engines 5


Components of Information Retrieval/ IR
Model
❑x

DS414 information Retrieval & Search Engines 6


Components of Information Retrieval/ IR
Model
❑ Acquisition: In this step, the selection of documents and other objects from various web
resources that consist of text-based documents takes place.
❑ Representation: It consists of indexing that contains free-text terms, controlled
vocabulary, manual & automatic techniques as well. example: Abstracting contains
summarizing and Bibliographic description that contains author, title, sources, data, and
metadata.
❑ File Organization: There are two types of file organization methods sequential and
inverted
❑ Query: An IR process starts when a user enters a query into the system.

DS414 information Retrieval & Search Engines 7


IR models
An IR model consists of
⚫ a model for documents
⚫ a model for queries and
⚫ a matching function which compares queries to
documents.
⚫ a ranking function

DS414 information Retrieval & Search Engines 8


Classical IR Model
Classical models
of IR

IR models can be Non-Classical


classified as: models of IR

Alternative
models of IR
DS414 information Retrieval & Search Engines 9
Classical IR Model
⚫ based on mathematical knowledge that was easily
recognized and well understood
⚫ simple, efficient and easy to implement
⚫ The three classical information retrieval models are:
-Boolean
-Vector space
-Probabilistic models

DS414 information Retrieval & Search Engines 10


Non-Classical models of IR
Non-classical information retrieval models are based on
principles other than similarity, probability, Boolean
operations etc. on which classical retrieval models are
based on.
information logic model, situation theory model and
interaction model.

DS414 information Retrieval & Search Engines 11


Alternative IR models
⚫Alternative models are enhancements of classical
models making use of specific techniques from other
fields.
Example:
Cluster model, fuzzy model and latent semantic indexing
(LSI) models.

DS414 information Retrieval & Search Engines 12


Information Retrieval Model
⚫ The actual text of the document and query is not used in the

retrieval process. Instead, some representation of it.

⚫ Document representation is matched with query

representation to perform retrieval

⚫ One frequently used method is to represent document as a set

of index terms or keywords


DS414 information Retrieval & Search Engines 13
Boolean retrieval model
Queries: Users express queries as a Boolean expression
● AND, OR, NOT

Ex. query:

❑ Users are required to express their queries as a Boolean expression consisting of


keywords connected with Boolean logical operators (AND, OR, NOT).

❑ Retrieval is performed based on whether or not document contains the query


terms.

DS414 information Retrieval & Search Engines 14


Boolean retrieval model
Example: Build a Term-Document Incidence Matrix
o Which term appears in which document
o Rows are terms
o Columns are documents
Given example collection:
d1 He likes to wink, he likes to drink
d2 He likes to drink, and drink, and drink
d3 The thing he likes to drink is ink
d4 The ink he likes to drink is pink
d5 He likes to wink, and drink pink ink
DS414 information Retrieval & Search Engines 15
Term-Document Incidence Matrix

DS414 information Retrieval & Search Engines 16


Boolean retrieval model

DS414 information Retrieval & Search Engines 17


Boolean retrieval model
Any given query divides the collection into two sets:

● retrieved (matching)

● not-retrieved (not matching)

Returns a set of documents that “exactly” satisfy the query (Boolean expression)

● Called “Exact-Match” retrieval

Used?

● Many search systems still in-use are Boolean

● e.g., Email, library catalog, Mac OS X Spotlight, legal search

DS414 information Retrieval & Search Engines 18


Boolean retrieval model
Example:
Consider these documents:

Doc 1 breakthrough drug for schizophrenia

Doc 2 new schizophrenia drug

Doc 3 new approach for treatment of schizophrenia

Doc 4 new hopes for schizophrenia patients

a. Draw the term-document incidence matrix for this document collection.

b. what are the returned results for these queries:


a. schizophrenia AND drug
b. for AND NOT(drug OR approach)
DS414 information Retrieval & Search Engines 19
2
Indexing &
Preprocessing
Basic components of search engines

DS414 information Retrieval & Search Engines 21


Indexing process (offline)
❑ Search engine indexing refers to the process where a search engine

(such as Google) organizes and stores online content in a central

database (its index). The search engine can then analyze and

understand the content, and serve it to readers in ranked lists on

its Search Engine Results Pages (SERPs).

DS414 information Retrieval & Search Engines 22


Indexing process (offline)
document →unique ID
what can you store?
disk space? rights?
web-crawling, compression? a lookup table for
RSS feeds, quickly finding all docs
emails containing a word

what data do
we want to search? format conversion. international?
which part contains “meaning”?
word units? stopping? stemming?
DS414 information Retrieval & Search Engines 23
Search process (online)

DS414 information Retrieval & Search Engines 24


Search process (online)
❑ ssssss Steps in Search Engine

DS414 information Retrieval & Search Engines 25


Bigger Collections …
Consider N = 1 million documents, each with about 1000 words.

Say there are M = 500K distinct terms among these.

500K x 1M term-doc incidence matrix has half-a-trillion 0’s and 1’s.

But it has no more than one billion 1’s.

● matrix is extremely sparse.

DS414 information Retrieval & Search Engines 26


2
Indexing
Inverted index
❑ For each term T, we must store a list of all documents that contain T.
● Identify each by a docID, a document serial number

DS414 information Retrieval & Search Engines 28


Construction of inverted index
Documents to
be indexed. He likes to wink, he likes to drink

Token stream Tokenizer


He likes to wink he likes to drink

normalizer
like wink he like drink
Terms
modified token
indexer He 2 4
likes 1 2
Inverted index
DS414 information Retrieval & Search Engines
wink 3 2 29
Term docID Term docID Term Doc # Term freq
ambitious 2 ambitious 2 1
I 1

Indexer steps did


enact
julius
1
1
1
Multiple term
entries in a single
be
brutus
brutus
2
1
2
be
brutus
brutus
capitol
2
1
2
1
1
1
1
1

caesar 1 document are capitol 1 caesar


caesar
1
2
1
2
Sequence of I 1 merged. caesar 1 did 1 1

(Modified token, was 1 Frequency caesar


caesar
2
2
enact
hath
1
2
1
1
Document ID) pairs killed 1 information is did 1 I 1 2
i' 1 added enact 1
i' 1 1

Doc 1
it 2 1
the 1 hath 1 julius 1 1
capitol 1 I 1 killed 1 2
brutus 1 I 1
let 2 1
me 1 1
I did enact Julius killed 1 i' 1 noble 2 1

Caesar I was killed


me 1 Sort by terms it 2 so 2 1
so 2
alphabetic
the 1 1
julius 1
i' the Capitol; let 2 killed 1
the 2 1

Brutus killed me. it 2 (Core killed 1


told
you
2
2
1
1
be 2 indexing let 2 was
was
1
2
1
1

preprocess
with 2
step.). me 1 with 2 1

Doc 2
caesar 2 noble 2
the 2 so 2
noble 2 the 1 Dictionary &
the 2
posting
brutus 2
hath 2 told 2
So let it be with you 2
told 2
Caesar. The noble you 2 was 1 • Multiple term entries
Brutus hath told you caesar 2 was 2 in a single document
are merged.
with 2
Caesar was ambitious was 2
ambitious 2 • Frequency information
DS414 information Retrieval & Search Engines is added 30
Inverted matrix
Indexer steps Term
ambitious
Doc #
2
Term freq
1 docID
be 2 1
brutus 1 1
brutus 2 1
capitol 1 1
caesar 1 1
caesar 2 2
did 1 1
enact 1 1
hath 2 1
I 1 2
i' 1 1
it 2 1
julius 1 1
killed 1 2
let 2 1
me 1 1
noble 2 1
so 2 1
the 1 1
the 2 1
told 2 1
you 2 1
was 1 1
was 2 1
with 2 1
DS414 information Retrieval & Search Engines 31
Indexer steps

⚫ The result is split into a


Dictionary file and a Postings
file.

DS414 information Retrieval & Search Engines 32


Boolean retrieval model
Example:
Consider these documents:

Doc 1 breakthrough drug for schizophrenia

Doc 2 new schizophrenia drug

Doc 3 new approach for treatment of schizophrenia


Solved
Doc 4 new hopes for schizophrenia patients

a. Draw the term-document incidence matrix for this document collection.

b. Draw the inverted index representation for this collection

c. what are the returned results for these queries:


a. schizophrenia AND drug
b. for AND NOT(drug OR approach)
DS414 information Retrieval & Search Engines 33
The index we just built

DS414 information Retrieval & Search Engines 34


Query processing: AND (intersection)
⚫ Consider processing the query:
Brutus AND Caesar
• Locate Brutus in the Dictionary;
• Retrieve its postings.
• Locate Caesar in the Dictionary;
• Retrieve its postings.
• “Merge” the two postings:
2 4 8 16 32 64 128 Brutus
1 2 3 5 8 13 21 34 Caesar

DS414 information Retrieval & Search Engines 35


The merge: AND (intersection)
❑ Walk through the two postings simultaneously, in time linear in the total number
of postings entries

2 4 8 16 32 64 128 Brutus
2 8
1 2 3 5 8 13 21 34 Caesar

If the list lengths are x and y, the merge takes O(x+y) operations.

Crucial: postings sorted by docID.

DS414 information Retrieval & Search Engines 36


Merging Algorithm: AND (intersection)
❑Intersecting Two Postings Lists:(a “merge” algorithm)
⚫ INTERSECT(p1, p2)
⚫ 1 answer ← ( )
⚫ 2 while p1 != NULL and p2 != NULL
⚫ 3 if docID(p1) = docID(p2)
⚫ 4 then ADD(answer, docID(p1))
⚫ 5 p1 ← next(p1)
⚫ 6 p2 ← next(p2)
⚫ 7 else if docID(p1) < docID(p2)
⚫ 8 then p1 ← next(p1)
⚫ 9 else p2 ← next(p2)
⚫ 10 return answer
DS414 information Retrieval & Search Engines 37
Quiz: try to solve!!
Example:
Consider these documents:

Doc 1 new home sales top forecasts

Doc 2 home sales rise in july

Doc 3 increase in home sales in july

Doc 4 july new home sales rise

a. Draw the term-document incidence matrix for this document collection.

b. Draw the inverted index representation for this collection

c. what are the returned results for these queries:


a. Sales AND increase NOT July

DS414 information Retrieval & Search Engines 38

You might also like