2_text operation
2_text operation
r
Word distribution: Zipf's Law
• Zipf's Law states that when the distinct words in a text are
arranged in decreasing order of their frequency of occuerence (most
frequent words first), the occurence characterstics of the vocabulary
can be characterized by the constant rank-frequency law of Zipf.
• If the words, w, in a
collection are ranked, r,
by their frequency, f,
they roughly fit the
relation:
r*f=c
– Different collections
have different constants
c.
• The table shows the most frequently occurring words from 336,310 document corpus
containing 125, 720, 891 total words; out of which 508, 209 are unique words
More Example: Zipf’s Law
• Illustration of Rank-Frequency Law. Let the total number of word
occurrences in the sample N = 1, 000, 000
Rank (R) Term Frequency (F) R.(F/N)
1 the 69 971 0.070
2 of 36 411 0.073
3 and 28 852 0.086
4 to 26 149 0.104
5 a 23237 0.116
6 in 21341 0.128
7 that 10595 0.074
8 is 10099 0.081
9 was 9816 0.088
10 he 9543 0.095
Methods that Build on Zipf's Law
• Stop lists:
• Ignore the most frequent words (upper cut-off).
Used by almost all systems.
• Significant words:
• Take words in between the most frequent (upper
cut-off) and least frequent words (lower cut-off).
• Term weighting:
• Give differing weights to terms based on their
frequency, with most frequent words weighed
less. Used by almost all ranking methods.
Word significance: Luhn’s Ideas
• Luhn Idea (1958): the frequency of word occurrence in a
text furnishes a useful measurement of word significance.
• Luhn suggested that both extremely common and extremely
uncommon words were not very useful for indexing.
• For this, Luhn specifies two cutoff points: an upper and a
lower cutoffs based on which non-significant words are
excluded
– The words exceeding the upper cutoff were considered to be
common
– The words below the lower cutoff were considered to be rare
– Hence they are not contributing significantly to the content of the
text
– The ability of words to discriminate content, reached a peak at a
rank order position half way between the two-cutoffs
• Let f be the frequency of occurrence of words in a text, and r their
rank in decreasing order of word frequency, then a plot relating f & r
yields the following curve
Luhn’s Ideas