Lecture 5
Lecture 5
7
Creation
Collection/
Capture
Reuse /
Leverage
Information
Life Cycle Management
Organization/
Indexing
Distribution /
Dissemination
Storage /
Retrieval
Text Operations
Searching Index
retrieved docs
Text
Ranking Database
ranked docs
Dr. Yi, C., Tsinghua SEM 3
Classic IR Models
• Set Theoretic Models
– Boolean
– Fuzzy
• Vector Models (Algebraic)
• Probabilistic Models (probabilistic)
• Others (e.g., neural networks, etc.)
m5 m3 m6
D11 D4 m1 = t1 t2 t3
D5
m2 = t1 t2 t3
D3 m1 D6 m3 = t1 t2 t3
m2 m4 m4 = t1 t2 t3
D10
m5 = t1 t2 t3
m7 m6 = t1 t2 t3
m8
D8 m7 = t1 t2 t3
D7
m8 = t1 t2 t3
t3
D3 D5
D10
D4 D2
t1
D7
D8 D6
t2
Dr. Yi, C., Tsinghua SEM 28
Vector Space Model
• Documents ranked by distance between
points representing query and documents
– Similarity measure is more common than a
distance or dissimilarity measure
– E.g. Cosine correlation
Dot Product
Normalization
C
o
s
i
n
e
Degrees
Dr. Yi, C., Tsinghua SEM 30
Example: Similarity Calculation
D1 = (0.8, 0.3)
1.0
Q D2 = (0.2, 0.7)
D2
0.8 Q = (0.4, 0.8)
2
cos 1 = 0.74
0.6
0.4
0.2
1 D1 cos 2 = 0.98
0.2 0.4 0.6 0.8 1.0
0.6
2
(0.4 0.2) + (0.8 0.7)
0.4 sim(Q, D 2) =
D1 [(0.4) 2 + (0.8) 2 ] [(0.2) 2 + (0.7) 2 ]
0.2 1 0.64
= = 0.98
0 0.2 0.4 0.6 0.8 1.0
0.42
Term A 0.56
sim(Q, D1 ) = = 0.74
0.58
Dr. Yi, C., Tsinghua SEM 32
Exercise: Similarity Calculation
– Consider two documents D1, D2 and a query Q
• D1 = (0.5, 0.8, 0.3), D2 = (0.9, 0.4, 0.2), Q = (1.5, 1.0, 0)
j =1 j =1
tf ik log( N / nk )
wik =
t 2 2
k =1
(tf ik ) [log( N / nk )]
Dr. Yi, C., Tsinghua SEM 36
Recall: The Jazz Musician Example (L3)
• Final TFxIDF representation of “famous jazz
saxophonist born in Kansas who played bebop
and latin”
• Used in feature vector
representation of this
sample document
42
A Note on Similarity (Distance)
Measurement
• Applications of Euclidean distance
– Classification (spatial technique): k-nearest
neighbor search
p ( A | B ) : probability of A given B
p ( A) : probability of A
p (B ) : probability of B
p ( B | A) : probability of B given A
Dr. Yi, C., Tsinghua SEM 51
Bayes Classifier
• Bayes Decision Rule
– A document D is relevant if P(R|D) > P(NR|D)
• Estimating probabilities
– Use Bayes Rule
+ -
+ r n-r n
Document
Indexing
- R-r N-n-R+r N-n
R N-R N
r
w=
The odds that t will occur in
the relevant set.
R−r
log
n−r The odds that t will
N −n−R+r
relevant set.
r + 0.5
R − r + 0.5
w (1)
= log
n − r + 0.5
N − n − R + r + 0.5
What if we don’t have information about R or r?
Dr. Yi, C., Tsinghua SEM 58
Example
• Probabilistic Retrieval Example
– D1: “Cost of paper is up.” (relevant)
– D2: “Cost of jellybeans is up.” (not relevant)
– D3: “Salaries of CEO’s are up.” (not relevant)
– D4: “Paper: CEO’s labor cost up.” (????)
K + tf k 2 + qtf
Where:
• Q is a query containing terms T
• tf is the frequency of the term in a specific document
• qtf is the frequency of the term in a topic from which Q was
derived
• k1, b and k2 are parameters , usually 1.2, 0.75 and 0-1000
• dl and avdl are the document length and the average document
length measured in some convenient unit (e.g. bytes)
• K is k1((1-b) + b * dl / avdl)
• w(1) is the Robertson-Sparck Jones weight.
Dr. Yi, C., Tsinghua SEM 62
BM-25 Example
“President”
“Lincoln”
• We want to classify a new document D that has w2, w3, and w4.
• For the positive class: Pr (Class =1)
Compare!
• For the negative class: D belongs to the
negative class!
Pr (Class =0)
Dr. Yi, C., Tsinghua SEM 67
A Variant of Probabilistic Model in
Classification Problems
• We can consider each term weight as an
evidence lift: P (t|Class) / P(t)
• Probability is a product of evidence lifts
– If a lift is greater than one, then the probability is
increased
– If a lift is less than one, then the probability is diminished
• In general, evidence lift is defined as: