Chapter 2 (Information Storage & Retrieval)
Chapter 2 (Information Storage & Retrieval)
r
Word distribution: Zipf's Law
• Zipf's Law states that when the distinct words in a text are arranged
in decreasing order of their frequency of occuerence (most frequent
words first), the occurence characterstics of the vocabulary can be
characterized by the constant rank-frequency law of Zipf.
• If the words, w, in a
collection are ranked, r,
by their frequency, f, they
roughly fit the relation:
1
f
r*f=c r
– Different collections
have different constants
c.
• The table shows the most frequently occurring words from 336,310 document corpus
containing 125, 720, 891 total words; out of which 508, 209 are unique words
Explanations for Zipf’s Law
• The law has been explained by “principle of least effort”
which makes it easier for a speaker or writer of a language
to repeat certain words instead of coining new and
different words.
– Zipf’s explanation was his “principle of least effort” which
balance between speaker’s desire for a small vocabulary and
hearer’s desire for a large one.
Word significance: Luhn’s Ideas
• Luhn Idea (1958): the frequency of word occurrence in a
text furnishes a useful measurement of word significance
• Luhn suggested that both extremely common and extremely
uncommon words were not very useful for indexing
• For this, Luhn specifies two cutoff points: an upper and a
lower cutoffs based on which non-significant words are
excluded
– The words exceeding the upper cutoff were considered to be
common
– The words below the lower cutoff were considered to be rare
– Hence they are not contributing significantly to the content of the
text
– The ability of words to discriminate content, reached a peak at a
rank order position half way between the two-cutoffs
• Let f be the frequency of occurrence of words in a text, and r their
rank in decreasing order of word frequency, then a plot relating f & r
yields the following curve
Luhn’s Ideas
28
Why use term weighting?
• Binary weights are too limiting
– Terms are either present or absent
– Not allow to order documents according to their level of
relevance for a given query
33
Exercise
• Calculate idf value for each term Suppose N = 1 million
term dft idft
computer 1
information 100
storage 1,000
retrieval 10,000
system 100,000
science 1,000,000
D1 D2 D3 D4 D5 D1 D2 D3 D4 D5
bullet 1 1 bullet
dish 1 1 dish
distance 3 2 2 1 distance
rabbit 1 1 rabbit
record 1 1 1 record
roast roast
Term TFIDF
D1 D2 D3 D4 D5
• Dot product
– The dot product is also known as the scalar product or
inner product
– the dot product is defined as the product of the magnitudes
of query and document vectors
k1 k2 k3 q dj k3
d1 1 0 1 2
d2 1 0 0 1
d3 0 1 1 2
d4 1 0 0 1
d5 1 1 1 3
d6 1 1 0 2
d7 0 1 0 1
q 1 1 1 48
Inner Product: Exercise
k2
k1
d2 d6 d7
d4 d5
d1 d3
k1 k2 k3 q dj k3
d1 1 0 1 ?
d2 1 0 0 ?
d3 0 1 1 ?
d4 1 0 0 ?
d5 1 1 1 ?
d6 1 1 0 ?
d7 0 1 0 ?
q 1 2 3 49
Exercise
• Index terms with tfidf weights
Index terms Q D1 D2 D3
i 1 w i 1 i ,q
n n
dj q 2
i, j w 2
• Or;
n
d j dk wi , j wi ,k
sim(d j , d k ) i 1
i 1 w i 1 i,k
n n
d j dk 2
i, j w 2
1.0 Q
D2
cos 1 0.74 0.8
2
cos 2 0.98
0.6
0.4
1 D1
0.2
Terms D1 D2 D3
affection 0.996 0.993 0.847
Jealous 0.087 0.120 0.466
gossip 0.017 0.000 0.254
54
Exercise
• A database collection consists of 1 million documents, of
which 200,000 contain the term holiday while 250,000
contain the term season. A document repeats holiday 7
times and season 5 times. It is known that holiday is
repeated more than any other term in the document.
55
Any Query
Thank You!