0% found this document useful (0 votes)
2 views

Lecture 5

The document discusses information retrieval models and evaluation, focusing on the information life cycle, retrieval processes, and various classic IR models including Boolean, fuzzy, and vector models. It highlights the importance of query languages and the mechanics of Boolean queries, including their advantages and disadvantages. Additionally, it introduces the vector space model for ranking documents based on their similarity to queries using distance measures.

Uploaded by

Julien LEKA
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Lecture 5

The document discusses information retrieval models and evaluation, focusing on the information life cycle, retrieval processes, and various classic IR models including Boolean, fuzzy, and vector models. It highlights the importance of query languages and the mechanics of Boolean queries, including their advantages and disadvantages. Additionally, it introduces the vector space model for ranking documents based on their similarity to queries using distance measures.

Uploaded by

Julien LEKA
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 75

Lecture 5: Information Retrieval – Models

and Evaluation (I)

Dr. YI Cheng (易成)


School of Economics and Management
Mar 25, 2024
Information Life Cycle
Analyzing User Requirement
1

7
Creation
Collection/
Capture

Reuse /
Leverage
Information
Life Cycle Management
Organization/
Indexing

Distribution /
Dissemination
Storage /
Retrieval

Dr. Yi, C., Tsinghua SEM 2


The Retrieval Process
Text
User
Interface
user need Text

Text Operations

logical view logical view


Query DB Manager
Indexing
user feedback Operations Module

query inverted file

Searching Index

retrieved docs
Text
Ranking Database
ranked docs
Dr. Yi, C., Tsinghua SEM 3
Classic IR Models
• Set Theoretic Models
– Boolean
– Fuzzy
• Vector Models (Algebraic)
• Probabilistic Models (probabilistic)
• Others (e.g., neural networks, etc.)

Dr. Yi, C., Tsinghua SEM 4


Classic IR Models
• Set Theoretic Models
– Boolean
– Fuzzy
• Vector Models (Algebraic)
• Probabilistic Models (probabilistic)
• Others (e.g., neural networks, etc.)

Dr. Yi, C., Tsinghua SEM 5


Query Languages
• A way to express the question (information
need)
• Types:
– Boolean
– Natural Language
– Stylized Natural Language
– Form-Based (GUI)

Dr. Yi, C., Tsinghua SEM 6


Boolean Model for IR
• Based on Boolean Logic (Algebra of Sets).
• Fundamental principles established by George
Boole in the 1850’s
• Deals with set membership and operations on
sets
• Set membership in IR systems is usually based
on whether (or not) a document contains a
keyword (term)
– Exact match
Dr. Yi, C., Tsinghua SEM 7
Boolean
• Dominant language in commercial systems until
WWW
• Many enterprise search, patent search, and online
library catalogs are Boolean systems
• Database systems use Boolean logic for searching

Dr. Yi, C., Tsinghua SEM 8


Simple Query Language: Boolean
– Terms + Connectors (or operators)
– Terms
• Words
• Normalized (stemmed) words
• Phrases
• Thesaurus terms
– Connectors
• AND e.g. (Cat AND Dog)
• OR e.g. (Cat OR Dog)
• NOT e.g. NOT (Cat)
Dr. Yi, C., Tsinghua SEM 9
Boolean Queries
• Usually expressed as INFIX operators in IR
– ((a AND b) OR (c AND b))
• NOT is an UNARY PREFIX operator
– ((a AND b) OR (c AND (NOT b)))
• AND and OR can be n-ary operators
– (a AND b AND c AND d)
• Some rules - (De Morgan)
– NOT(a) AND NOT(b) = NOT(a OR b)
– NOT(a) OR NOT(b)= NOT(a AND b)
– NOT(NOT(a)) = a
Dr. Yi, C., Tsinghua SEM 10
Boolean Queries

Dr. Yi, C., Tsinghua SEM 11


Inverted Files Revisited
• Boolean queries can be answered using an
inverted index.
Term Doc # Freq Doc # Freq
Term N docs Tot Freq
a 2 1 a 1 1 2 1
aid 1 1 aid 1 1 1 1
all 1 1 1 1
all 1 1
and 2 1
and 1 1 2 1
come 1 1
come 1 1 1 1
country 1 1
country 2 1 country 2 2 1 1
dark 2 1 dark 1 1 2 1
for 1 1 for 1 1 2 1
good 1 1 good 1 1 1 1
in 2 1 in 1 1 1 1
is 1 1 2 1
is 1 1
it 2 1
it 1 1 1 1
manor 2 1
manor 1 1 2 1
men 1 1
midnight 2 1 men 1 1 2 1
night 2 1 midnight 1 1 1 1
now 1 1 night 1 1 2 1
of 1 1 now 1 1 2 1
past 2 1 of 1 1 1 1
stormy 2 1 1 1
past 1 1
the 1 2 2 1
stormy 1 1
the 2 2
the 2 4 2 1
their 1 1
time 1 1 their 1 1 1 2
time 2 1 time 2 2 2 2
to 1 2 to 1 2 1 1
was 2 2 was 1 2 1 1
2 1
1 2
2 2
Dr. Yi, C., Tsinghua SEM 12
Boolean Logic
t1 D9 t2
D1 D2

m5 m3 m6
D11 D4 m1 = t1 t2 t3
D5
m2 = t1 t2 t3
D3 m1 D6 m3 = t1 t2 t3
m2 m4 m4 = t1 t2 t3
D10
m5 = t1 t2 t3
m7 m6 = t1 t2 t3
m8
D8 m7 = t1 t2 t3
D7
m8 = t1 t2 t3

t3

Dr. Yi, C., Tsinghua SEM 13


Ordering of Retrieved Documents
• Pure Boolean has no ordering
• In practice:
– Order chronologically
– Order by total number of “hits” on query
terms
• What if one term has more hits than others?
• Is it better to have one of each term or many of
one term?

Dr. Yi, C., Tsinghua SEM 14


Basic Concepts for Extended Boolean
• Instead of binary values, terms in documents
and queries have a weight (importance or
some other statistical property)
• Instead of binary set membership, sets are
“fuzzy” and the weights are used to determine
degree of membership.
• Degree of set membership can be used to rank
the results of a query
Dr. Yi, C., Tsinghua SEM 15
Fuzzy Sets
• Introduced by Zadeh in 1965.
• If each object (i.e., document) in set {A} (i.e.,
term A’s set) has membership value v(A) and
in set {B} has value v(B), where 0  v  1
• v(AB) = min(v(A), v(B))
• v(AB) = max(v(A), v(B))
• v(~A) = 1-v(A)

Dr. Yi, C., Tsinghua SEM 16


Fuzzy Sets
• If we have three documents and three terms (Vt1,
Vt2, Vt3)
– Set membership function can be the relative term
frequency (or TD-IDF or others) within a document
– D1=(.4, .2, 1), D2=(0, 0, .8), D3=(.7, .4, 0)
For search: t1t2 t3 For search: t1t2 t3

RSV(D1) = max(.4, .2, 1) = 1 RSV(D1) = min(.4, .2, 1) = .2


RSV(D2) = max(0, 0, .8) = .8 RSV(D2) = min(0, 0, .8) = 0
RSV(D3) = max(.7, .4, 0) = .7 RSV(D3) = min(.7, .4, 0) = 0

RSV: retrieval status value

Dr. Yi, C., Tsinghua SEM 17


Fuzzy Sets
• Fuzzy set membership of term to document is
f(A)→[0,1]
• D1 = {(mesons, .8), (scattering, .4)}
• D2 = {(mesons, .5), (scattering, .6)}
• Query = MESONS AND SCATTERING
• RSV(D1) = MIN(.8,.4) = .4
• RSV(D2) = MIN(.5,.6) = .5
• D2 is ranked before D1 in the result set.

Dr. Yi, C., Tsinghua SEM 18


Critique of Fuzzy Sets
• D1 = {(mesons, .4), (scattering, .4)}
• D2 = {(mesons, .39), (scattering, .99)}
• Query = MESONS AND SCATTERING
• RSV(D1) = MIN(.4,.4) = .4
• RSV(D2) = MIN(.39,.99) = .39
• Consistent with the Boolean model:
– Query = t1t2t3…t100
– If D not indexed by t1 then it fails, even if D is
indexed by t2,…,t100
Dr. Yi, C., Tsinghua SEM 19
Critique of Fuzzy Sets
• The rank of a document depends entirely on
the lowest or highest weighted term in an
AND or OR operation
• Still suffer from the lack of discrimination
among the retrieval results

Dr. Yi, C., Tsinghua SEM 20


Boolean Model
• Advantages
– Simple queries are easy to understand
– Relatively easy to implement
• Disadvantages
– Difficult to specify what is wanted, particularly in complex
situations
– Too much returned, or too little
– Ordering not well determined in Traditional Boolean
– Ordering may be problematic in extended Boolean

Dr. Yi, C., Tsinghua SEM 21


Non-Boolean IR
• Need to measure some similarity between the
query and the document
• The basic notion is that documents that are
somehow similar to a query, are likely to be
relevant responses for that query
• To measure similarity we…
– Need to consider the characteristics of the document
and the query
– Make the assumption that similarity of language use
between the query and the document implies
similarity of topic and hence, potential relevance.
Dr. Yi, C., Tsinghua SEM 22
Classic IR Models
• Set Theoretic Models
– Boolean
– Fuzzy
• Vector Models (Algebraic)
• Probabilistic Models (probabilistic)

Dr. Yi, C., Tsinghua SEM 23


Vector Space Model
• Documents are represented as vectors in term
space
– Terms are usually stems
– Documents represented by binary or weighted
vectors of terms
• Queries represented the same as documents
• Distance measure between query and
documents is used to rank retrieved documents
– Documents nearer in vector space are more similar

Dr. Yi, C., Tsinghua SEM 24


Vector Representation
• Documents (D) and Queries (Q) are
represented as vectors
• Position 1 corresponds to term 1, position 2
to term 2, position t to term t
• The weight of the term is stored in each
position
Di = wd , wd ,..., wd
i1 i2 it

Q = wq1 , wq 2 ,..., wqt


w = 0 if a term is absent
Dr. Yi, C., Tsinghua SEM 25
Document Vectors
ID nova star heat h'wood film role diet fur
A 10 5 3
B 5 10
C 10 8 7
D 9 10 5
E 10 10
F 9 10
G 3 7 5 5 3

Star Doc about astronomy (A)

Doc about movie stars (G)

Doc about mammal behavior (E)

Dr. Yi, C., Tsinghua SEM 26


Diet
Vector Space Documents and Queries
Dot Product
docs t1 t2 t3 Q.Di
D1 1 0 1 4
D2 1 0 0 1
D3 0 1 1 5
D4 1 0 0 1
D5 1 1 1 6
D6 1 1 0 3
D7 0 1 0 2
D8 0 1 0 2
D9 0 0 1 3
D10 0 1 1 5
D11 0 0 1 4
Q 1 2 3
q1 q2 q3
Q is a query – also represented as a vector

Dr. Yi, C., Tsinghua SEM 27


Documents in Vector Space
t3
Q
D1
D9
D11

D3 D5
D10

D4 D2
t1

D7
D8 D6
t2
Dr. Yi, C., Tsinghua SEM 28
Vector Space Model
• Documents ranked by distance between
points representing query and documents
– Similarity measure is more common than a
distance or dissimilarity measure
– E.g. Cosine correlation
Dot Product

Normalization

Dr. Yi, C., Tsinghua SEM 29


Cosine vs. Degrees

C
o
s
i
n
e

Degrees
Dr. Yi, C., Tsinghua SEM 30
Example: Similarity Calculation

D1 = (0.8, 0.3)
1.0
Q D2 = (0.2, 0.7)
D2
0.8 Q = (0.4, 0.8)
2
cos 1 = 0.74
0.6

0.4

0.2
1 D1 cos  2 = 0.98
0.2 0.4 0.6 0.8 1.0

Dr. Yi, C., Tsinghua SEM 31


Example: Similarity Calculation
Di=(wdi1, wdi2,…,wdit)
Term B Q =(wqi1, wqi2,…,wqit)
1.0 Q = (0.4,0.8)

t
D2 Q D1=(0.8,0.3)
j =1
wq j wd ij
0.8 D2=(0.2,0.7) sim(Q, Di ) =
 j =1 (wq j ) 2  j =1 (wdij ) 2
t t

0.6
2
(0.4  0.2) + (0.8  0.7)
0.4 sim(Q, D 2) =
D1 [(0.4) 2 + (0.8) 2 ]  [(0.2) 2 + (0.7) 2 ]
0.2 1 0.64
= = 0.98
0 0.2 0.4 0.6 0.8 1.0
0.42
Term A 0.56
sim(Q, D1 ) = = 0.74
0.58
Dr. Yi, C., Tsinghua SEM 32
Exercise: Similarity Calculation
– Consider two documents D1, D2 and a query Q
• D1 = (0.5, 0.8, 0.3), D2 = (0.9, 0.4, 0.2), Q = (1.5, 1.0, 0)

Dr. Yi, C., Tsinghua SEM 33


Term Weights in Vector Space Model
1. Binary Weights i.e., 0 or 1
2. Raw term frequency
3. tf*idf i.e., term freq * inverse doc freq
– Recall the Zipf distribution
– Want to weight terms highly if they are
• Frequent in relevant documents … BUT
• Infrequent in the collection as a whole

Dr. Yi, C., Tsinghua SEM 34


Vector Space Similarity
• Combine tf and idf into a similarity measure
Di = wd i1 , wd i 2 ,..., wd it
Q = wq1 , wq 2, ..., wqt w = 0 if a term is absent
t
If term weights are normalized : sim(Q, Di ) =  wqj * wd ij
j =1

Otherwise normalization and the similarity comparison are


combined :
t
w qj
* wd
ij
j =1
sim(Q, Di ) =
t t
 (w 2
qj )
*  (w d ij ) 2

j =1 j =1

Dr. Yi, C., Tsinghua SEM 35


tf*idf Normalization
• Normalize the term weights (so longer vectors are
not unfairly given more weight)
– Normalizing usually means forcing all values to fall
within a certain range, usually between 0 and 1,
inclusive

tf ik log( N / nk )
wik =

t 2 2
k =1
(tf ik ) [log( N / nk )]
Dr. Yi, C., Tsinghua SEM 36
Recall: The Jazz Musician Example (L3)
• Final TFxIDF representation of “famous jazz
saxophonist born in Kansas who played bebop
and latin”
• Used in feature vector
representation of this
sample document

Dr. Yi, C., Tsinghua SEM 37


Recall: The Jazz Musician Example
• Assume the sample phrase “famous jazz
saxophonist born in Kansas who played bebop
and latin” was a search query
• Which musician in the collection is the best
match to the query?
– Cosine similarity between
the query and each
musician’s biography

Dr. Yi, C., Tsinghua SEM 38


Homework 3b (Due 5pm Apr 1)
• What’s your intuition of the ideal ranking?

Dr. Yi, C., Tsinghua SEM 39


A Note on Similarity (Distance)
Measurement
• Similarity underlies many methods and
solutions to business problems
• Once an object can be represented as a
feature vector, we can compute similarity
between objects as “distance” between
objects
– The closer the two objects in feature space, the
more similar

Dr. Yi, C., Tsinghua SEM 40


A Note on Similarity (Distance)
Measurement
• There are many other distance measures
• Cosine distance mainly used in measuring text
similarity
– Ignore differences in scale (i.e., magnitude of
vectors) and focus on the textual content

Dr. Yi, C., Tsinghua SEM 41


A Note on Similarity (Distance)
Measurement
• A basic geometric interpretation for
measuring distance:

• General Euclidean distance

42
A Note on Similarity (Distance)
Measurement
• Applications of Euclidean distance
– Classification (spatial technique): k-nearest
neighbor search

Dr. Yi, C., Tsinghua SEM 43


A Note on Similarity (Distance)
Measurement
• Applications of Euclidean distance
– Classification: predict the target value by finding
the nearest neighbors (and, e.g., looking at the
majority vote or using weighted voting by
assigning more weight to more similar neighbors)

Dr. Yi, C., Tsinghua SEM 44


A Note on Similarity (Distance)
Measurement
• Applications of Euclidean distance
– To find similar customers, products, etc. (k-
nearest neighbor search)
• Netflix: the movie X was recommended based on
your interest in A, B, and C.
• Amazon: customers with similar searches
purchased…

Dr. Yi, C., Tsinghua SEM 45


Problems with Vector Space
• There is no real theoretical basis for the
assumption of a term space
– It is more for visualization than having any real basis
– There is no explicit definition of relevance, but the
implicit assumption is relevance is related to the
similarity of query and documents

• Terms are not really orthogonal dimensions


– Terms are not independent of all other terms

• Word ordering is lost, in query or document

Dr. Yi, C., Tsinghua SEM 46


Classic IR Models
• Set Theoretic Models
– Boolean
– Fuzzy
• Vector Models (Algebraic)
• Probabilistic Models (probabilistic)

Dr. Yi, C., Tsinghua SEM 47


Probabilistic Models
• Rigorous formal model which attempts to predict
the probability that a given document will be
relevant to a given query
• Ranks retrieved documents according to the
probability of relevance to query
• Relies on accurate estimates of probabilities (that
are difficult to get)
• Dominant paradigm today

Dr. Yi, C., Tsinghua SEM 48


IR as Classification
Given the representation of
the document, the probability
of it to be relevant

Given the representation of


the document, the probability
of it to be non-relevant

Dr. Yi, C., Tsinghua SEM 49


Example Question
• Probabilistic Retrieval Example
– D1: “Cost of paper is up.” (relevant)
– D2: “Cost of jellybeans is up.” (not relevant)
– D3: “Salaries of CEO’s are up.” (not relevant)
– D4: “Paper: CEO’s labor cost up.” (????)

Dr. Yi, C., Tsinghua SEM 50


Bayes’ theorem
p( A) p( B | A)
p( A | B) =
For example:
A: disease
p( B) B: symptom

p ( A | B ) : probability of A given B

p ( A) : probability of A

p (B ) : probability of B

p ( B | A) : probability of B given A
Dr. Yi, C., Tsinghua SEM 51
Bayes Classifier
• Bayes Decision Rule
– A document D is relevant if P(R|D) > P(NR|D)
• Estimating probabilities
– Use Bayes Rule

– Classify a document as relevant if

• Left-hand-side is likelihood ratio: ranking score

Dr. Yi, C., Tsinghua SEM 52


Binary Independence
• Assume independence

• Binary independence model


– Document represented by a vector of binary features
indicating term occurrence (or non-occurrence)
– pi is probability that term i occurs (i.e., has value 1) in
relevant document, si is probability of term i’s
occurrence in non-relevant document
• That is, we already know the indexing term distribution in
relevant/irrelevant documents

Dr. Yi, C., Tsinghua SEM 53


Binary Independence
• Scoring function is

• In some cases, query provides the only


information about relevant set
• Then the score of a document is determined by
all the matching terms (i.e., in the document and
the query)
– Terms not in the query will have the same probability
of occurrence in the relevant and non-relevant set.
Dr. Yi, C., Tsinghua SEM 54
Parameters in Computing Term
Weight
• N = total number of documents in collection
• R = total number of relevant documents for
a query
• n = number of documents that contain the
query term
• r = number of relevant documents that
contain the query term

Dr. Yi, C., Tsinghua SEM 55


Robertson-Sparck Jones
Term Weights
Given a term t
and a query q
Document Relevance

+ -
+ r n-r n
Document
Indexing
- R-r N-n-R+r N-n
R N-R N

Dr. Yi, C., Tsinghua SEM 56


Robertson-Sparck Jones
Term Weights
• Retrospective formulation:

 r 
w=

 
The odds that t will occur in
the relevant set.

 R−r
log
 n−r  The odds that t will

  occur in the non-

 N −n−R+r
relevant set.

Meaning of the Log: If the document contains t, what is


the odds of the document to be
Dr. Yi, C., Tsinghua SEM relevant? 57
Robertson-Sparck Jones
Term Weights
• Predictive formulation

 r + 0.5 
 
 R − r + 0.5 
w (1)
= log
 n − r + 0.5 
 
 N − n − R + r + 0.5 
What if we don’t have information about R or r?
Dr. Yi, C., Tsinghua SEM 58
Example
• Probabilistic Retrieval Example
– D1: “Cost of paper is up.” (relevant)
– D2: “Cost of jellybeans is up.” (not relevant)
– D3: “Salaries of CEO’s are up.” (not relevant)
– D4: “Paper: CEO’s labor cost up.” (????)

• Term Relevant Not relevant Evidence Odds of Relevance


Paper 1 0 for (strong) (1.5/0.5)/(0.5/2.5) = 15
CEO 0 1/2 against (0.5/1.5)/(1.5/1.5) = 1/3
labor 0 0 none (0.5/1.5)/(0.5/2.5) = 5/3
cost 1 1/2 for (weak) (1.5/0.5)/(1.5/1.5) = 3
up 1 1 none (1.5/0.5)/(2.5/0.5) = 3/5

TOTAL ODDS (product of the individual odds) =

Dr. Yi, C., Tsinghua SEM 59


 r + 0.5 
 
 R − r + 0.5 
Example (Cost)  n − r + 0.5 
 
 N − n − R + r + 0.5 
• R=1, r=1, n=2, N=3
• cost appears in 1 of 1 relevant document
– odds are (1+.5)/(0+.5) = 3 to 1 that cost will
appear
• cost appears in 1 of 2 non-relevant documents
– odds are (1+.5)/(1+.5) = 1 to 1 that cost will
appear
• If cost appears in D, then the odds are (3/1)/(1/1)
i.e., 3 to 1 that D is relevant.

Dr. Yi, C., Tsinghua SEM 60


Example
• Probabilistic Retrieval Example
– D1: “Cost of paper is up.” (relevant)
– D2: “Cost of jellybeans is up.” (not relevant)
– D3: “Salaries of CEO’s are up.” (not relevant)
– D4: “Paper: CEO’s labor cost up.” (????)

• Term Relevant Not relevant Evidence Odds of Relevance


Paper 1 0 for (strong) (1.5/0.5)/(0.5/2.5) = 15
CEO 0 1/2 against (0.5/1.5)/(1.5/1.5) = 1/3
labor 0 0 none (0.5/1.5)/(0.5/2.5) = 5/3
cost 1 1/2 for (weak) (1.5/0.5)/(1.5/1.5) = 3
up 1 1 none (1.5/0.5)/(2.5/0.5) = 3/5

TOTAL ODDS (product of the individual odds) = 15

Dr. Yi, C., Tsinghua SEM 61


Adjusting Term Weight to Include Term
Frequency: Okapi BM-25
(k1 + 1)tf (k 2 + 1)qtf

T Q
w(1)

K + tf k 2 + qtf
Where:
• Q is a query containing terms T
• tf is the frequency of the term in a specific document
• qtf is the frequency of the term in a topic from which Q was
derived
• k1, b and k2 are parameters , usually 1.2, 0.75 and 0-1000
• dl and avdl are the document length and the average document
length measured in some convenient unit (e.g. bytes)
• K is k1((1-b) + b * dl / avdl)
• w(1) is the Robertson-Sparck Jones weight.
Dr. Yi, C., Tsinghua SEM 62
BM-25 Example

Dr. Yi, C., Tsinghua SEM 63


BM-25 Example

“President”

“Lincoln”

Dr. Yi, C., Tsinghua SEM 64


BM-25 Example
• Effect of term frequencies

Dr. Yi, C., Tsinghua SEM 65


A Variant of Probabilistic Model in
Classification Problems

Dr. Yi, C., Tsinghua SEM 66


A Variant of Probabilistic Model in
Classification Problems (Naïve Bayes)

• We want to classify a new document D that has w2, w3, and w4.
• For the positive class: Pr (Class =1)

Compare!
• For the negative class: D belongs to the
negative class!

Pr (Class =0)
Dr. Yi, C., Tsinghua SEM 67
A Variant of Probabilistic Model in
Classification Problems
• We can consider each term weight as an
evidence lift: P (t|Class) / P(t)
• Probability is a product of evidence lifts
– If a lift is greater than one, then the probability is
increased
– If a lift is less than one, then the probability is diminished
• In general, evidence lift is defined as:

Dr. Yi, C., Tsinghua SEM 68


Example: Evidence Lifts from
Facebook “Likes”
• What people “Likes” on Facebook can be
predictive of traits, e.g., IQ
• What are the Likes that give strong evidence
lifts for “high IQ”?
– Binary target class: IQ > 130 (high)

Dr. Yi, C., Tsinghua SEM 69


Example: Evidence Lifts from
Facebook “Likes”

Dr. Yi, C., Tsinghua SEM 70


Homework 3cd (Due 5pm Apr 1)

Dr. Yi, C., Tsinghua SEM 71


Probabilistic Models
Advantages Disadvantages
•Strong theoretical basis • Relevance information is
required
•In principle should supply
the best predictions of • Important indicators of
relevance given available relevance may not be
information term -- though terms only
•Can be implemented are usually used
similarly to Vector
• Ideally requires on-going
Same as vector when collection of relevance
Relevant set = {query}, and information
Non-relevant set = {}

Dr. Yi, C., Tsinghua SEM 72


Vector and Probabilistic Models
• Support “natural language” queries
• Treat documents and queries the same
• Support relevance feedback searching
• Support ranked retrieval
• Differ primarily in theoretical basis and in how the
ranking is calculated
– Vector assumes relevance
– Probabilistic relies on relevance judgments or
estimates

Dr. Yi, C., Tsinghua SEM 73


Recommended Readings
• Bruce Croft, Donald Metzler and Trevor
Strohman, Search Engines: Information
Retrieval in Practice, 1st Ed., Addison-Wesley,
2009
– Chapter 7: retrieval models

Dr. Yi, C., Tsinghua SEM 74


Week Date Lesson Topics
1 Feb 26 Introduction
2 Mar 4 Metadata and subject analysis (metadata schemes,
controlled vocabularies)
3 Mar 11 - Information categorization
- Computational classification: text processing basics
4 Mar 18 - Computational classification: decision tree
Course - Information retrieval: inverted indexes
5-6 Mar 25, Apr 1 - Information retrieval: models (Boolean, vector space,
Schedule probabilistic) and evaluation
7 Apr 8 Project presentation
8-9 Apr 15, 22 -Web search (link analysis, paid search)
10 Apr 29 - Test 1
- Guest lecture
11-12 May 6, 13 Information and social network (information cascades,
social network analysis)
13-14 May 20, 27 Social and ethical issues (pricing of information,
information goods market, IP issues)
- Review
15 Jun 3 Test 2
Dr. Yi, C., Tsinghua SEM 75

You might also like