0% found this document useful (0 votes)
13 views

Chapter 2 (Information Storage & Retrieval)

Uploaded by

Abraham Abayneh
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Chapter 2 (Information Storage & Retrieval)

Uploaded by

Abraham Abayneh
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 56

Chapter Two

Text/Document Operations and


Automatic Indexing
Document representation
• IR system/Search engine does not scan each
document to see if it satisfies the query
• It uses an index to quickly locate the relevant
documents
• Index: a list of concepts with pointers to
documents that discuss them
– What goes in the index is important
• Document representation: deciding what
concepts should go in the index
Document representation
Two options:
•Controlled vocabulary – a set of manually
constructed concepts that describe the major
topics covered in the collection
•Free-text indexing – the set of individual terms
that occur in the collection
Statistical Properties of words in a Text
• How is the frequency of different words distributed?
• How fast does vocabulary size grow with the size of a
corpus?
• Such properties of text collection greatly affect the
performance of IR system & can be used to select suitable
term weights & other aspects of the system
• There are three well-known researchers who define
statistical properties of words in a text:
– Zipf’s Law: models word distribution in text corpus
– Luhn’s idea: measures word significance
– Heap’s Law: shows how vocabulary size grows with the growth
of corpus size
Word Distribution/Frequency
• A few words are very
common.
2 most frequent words
(e.g. “the”, “of”) can
account for about 10% of
word occurrences.
• Most words are very rare.
Half the words in a corpus
appear only once, called
“read only once” or
Hapax Legomena (in
Greek)
Word distribution: Zipf's Law
• Zipf's Law- named after the Harvard linguistic professor
George Kingsley Zipf (1902-1950),
 attempts to capture the distribution of the frequencies (i.e. , number
of occurances ) of the words within a text
• For all the words in a collection of documents, for each word w
f : is the frequency of w
r : is rank of w in order of frequency. (The most commonly
occurring word has rank 1, etc.)
f

Zipf’s Distribution of sorted word


distributions: w has rank r &
frequencies, according to
Rank
frequency f Zipf’s law
Frequency
Distribution

r
Word distribution: Zipf's Law
• Zipf's Law states that when the distinct words in a text are arranged
in decreasing order of their frequency of occuerence (most frequent
words first), the occurence characterstics of the vocabulary can be
characterized by the constant rank-frequency law of Zipf.

• If the words, w, in a
collection are ranked, r,
by their frequency, f, they
roughly fit the relation:
1
f 
r*f=c r
– Different collections
have different constants
c.

• The table shows the most frequently occurring words from 336,310 document corpus
containing 125, 720, 891 total words; out of which 508, 209 are unique words
Explanations for Zipf’s Law
• The law has been explained by “principle of least effort”
which makes it easier for a speaker or writer of a language
to repeat certain words instead of coining new and
different words.
– Zipf’s explanation was his “principle of least effort” which
balance between speaker’s desire for a small vocabulary and
hearer’s desire for a large one.
Word significance: Luhn’s Ideas
• Luhn Idea (1958): the frequency of word occurrence in a
text furnishes a useful measurement of word significance
• Luhn suggested that both extremely common and extremely
uncommon words were not very useful for indexing
• For this, Luhn specifies two cutoff points: an upper and a
lower cutoffs based on which non-significant words are
excluded
– The words exceeding the upper cutoff were considered to be
common
– The words below the lower cutoff were considered to be rare
– Hence they are not contributing significantly to the content of the
text
– The ability of words to discriminate content, reached a peak at a
rank order position half way between the two-cutoffs
• Let f be the frequency of occurrence of words in a text, and r their
rank in decreasing order of word frequency, then a plot relating f & r
yields the following curve
Luhn’s Ideas

Luhn (1958) suggested that both extremely common and extremely


uncommon words were not very useful for document representation &
indexing
Text Operations
• Not all words in a document are equally significant to represent the
contents/meanings of a document
– Some word carry more meaning than others
– Noun words are the most representative of a document content
• Therefore, need to preprocess the text of a document in a collection
to be used as index terms
• Using the set of all words in a collection to index documents
creates too much noise for the retrieval task
– Reduce noise means reduce words which can be used to refer to the
document
• Preprocessing is the process of controlling the size of the
vocabulary or the number of distinct words used as index terms
– Preprocessing will lead to an improvement in the information retrieval
performance
• However, some search engines on the Web omit preprocessing
– Every word in the document is an index term
Text Operations
• Text operations is the process of text transformations in to
logical representations
• 5 main operations for selecting index terms, i.e. to choose
words/stems (or groups of words) to be used as indexing
terms:
– Tokenization of the text: generate a set of words from text
collection
– Elimination of stop words - filter out words which are not useful
in the retrieval process
– Normalization – bringing to one form – e.g. downcasing
– Stemming words - remove affixes (prefixes and suffixes) and
group together word variants with similar meaning
– Construction of term categorization structures such as thesaurus,
to capture relationship for allowing the expansion of the original
query with related terms
Tokenization
• Tokenization is one of the step used to convert text of the
documents into a sequence of words, w 1, w2, … wn to be
adopted as index terms
– It is the process of demarcating and possibly classifying
sections of a string of input characters into words
– For example,
• The quick brown fox jumps over the lazy dog

• Objective - identify words in the text


– What is a word means?
• Is that a sequence of characters, numbers and alpha-numeric once?
– How we identify a set of words that exist in a text
documents?
• Tokenization Issues
– numbers, hyphens, punctuations marks, apostrophes …
Issues in Tokenization
• One word or multiple: How to handle special cases involving hyphens,
apostrophes, punctuation marks etc? C++, C#, URL’s, e-mail, …
– Sometimes punctuations (e-mail), numbers (1999), & case (Republican vs.
republican) can be a meaningful part of a token.
– However, frequently they are not.
• Two words may be connected by hyphens.
– Can two words connected by hyphens taken as one word or two words?
Break up hyphenated sequence as two tokens?
• In most cases hyphen – break up the words (e.g. state-of-the-art  state of the
art), but some words, e.g. MS-DOS, B-49 - unique words which require
hyphens
• Two words may be connected by punctuation marks .
– Punctuation marks: remove totally unless significant, e.g. program code:
x.exe and xexe. What about Kebede’s, www.command.com?
• Two words (phrase) may be separated by space.
– E.g. Addis Ababa, San Francisco, Los Angeles
• Two words may be written in different ways
– lowercase, lower-case, lower case? data base, database, data-base?
Tokenization
• Analyze text into a sequence of discrete tokens (words)
• Input: “Friends, Romans and Countrymen”
• Output: Tokens (an instance of a sequence of characters that
are grouped together as a useful semantic unit for processing)
– Friends
– Romans
– and
– Countrymen
• Each such token is now a candidate for an index entry,
after further processing
• But what are valid tokens to emit?
Exercise: Tokenization
• The cat slept peacefully in the living room. It’s
a very old cat.

• The instructor (Dr. O’Neill) thinks that the


boys’ stories about Chile’s capital aren’t
amusing.
Stopword Removal
• A stopword is a term that is discarded from the document
representation
• Stopwords: words that we ignore because we expect them not to be
useful in distinguishing between relevant/non-relevant documents
for any query
• Stopwords are extremely common words across document
collections that have no discriminatory power
• Assumption: stopwords are unimportant because they are frequent in
every document
– They may occur in 80% of the documents in a collection.
– They would appear to be of little value in helping select documents matching a user need
and needs to be filtered out as potential index terms
• Stopwords are typically function words:
– Examples of stopwords are articles, prepositions, conjunctions, etc.:
• articles (a, an, the); pronouns: (I, he, she, it, their, his)
• Some prepositions (on, of, in, about, besides, against); conjunctions/ connectors (and, but,
for, nor, or, so, yet), verbs (is, are, was, were), adverbs (here, there, out, because, soon,
after) and adjectives (all, any, each, every, few, many, some) can also be treated as
stopwords
• Stopwords are language dependent
How to detect a stopword?
• One method: Sort terms (in decreasing order) by
document frequency (DF) and take the most frequent
ones based on the cutoff point
–In a collection about insurance practices, “insurance”
would be a stop word

• Another method: Build a stop word list that contains


a set of articles, pronouns, etc.
–Why do we need stop lists: With a stop list, we can
compare and exclude from index terms entirely the
commonest words.
–Can you identify common words in Amharic and build
stop list?
Trends in Stopwords
• Stopword elimination used to be standard in older IR
systems. But the trend is away from doing this nowadays.
• Most web search engines index stopwords:
– Good query optimization techniques mean you pay little at query
time for including stop words.
– You need stopwords for:
• Phrase queries: “King of Denmark”
• Various song titles, etc.: “Let it be”, “To be or not to be”
• “Relational” queries: “flights to London”
– Elimination of stopwords might reduce recall (e.g. “To be or not
to be” – all eliminated except “be” – no or irrelevant retrieval)
Normalization
• It is Canonicalizing tokens so that matches occur
despite superficial differences in the character
sequences of the tokens
– Need to “normalize” terms in indexed text as well as query
terms into the same form
– Example: We want to match U.S.A. and USA, by deleting
periods in a term
• Case Folding: Often best to lower case everything,
since users will use lowercase regardless of ‘correct’
capitalization…
– Republican vs. republican
– Fasil vs. fasil vs. FASIL
– Anti-discriminatory vs. antidiscriminatory
– Car vs. Automobile?
Stemming/Morphological analysis
• Stemming reduces tokens to their “root” form of words to
recognize morphological variation .
– The process involves removal of affixes (i.e. prefixes and suffixes)
with the aim of reducing variants to the same stem

– Often removes inflectional and derivational morphology of a word


• Inflectional morphology: vary the form of words in order to
express grammatical features, such as singular/plural or
past/present tense. E.g. Boy → boys, cut → cutting.
• Derivational morphology: makes new words from old ones.
E.g. creation is formed from create , but they are two separate
words. And also, destruction → destroy
• Compounding – combining words to form new ones e.g
beefsteak

• Stemming is language dependent


Stemming
• The final output from a conflation algorithm is a
set of classes, one for each stem detected.
–A Stem: the portion of a word which is left after
the removal of its affixes (i.e., prefixes and/or
suffixes).
–Example: ‘connect’ is the stem for {connected,
connecting connection, connections}
–Thus, [automate, automatic, automation]  all
reduce to  automat
Ways to implement stemming
There are basically two ways to implement
stemming.
–The first approach is to create a big dictionary that
maps words to their stems

–The second approach is to use a set of rules that


extract stems from words

But, since stemming is imperfectly defined,


anyway, occasional mistakes are tolerable, & the
rule-based approach is the one that is generally
chosen
Porter Stemmer
• Stemming is the operation of stripping the suffices from a
word, leaving its stem
– Google, for instance, uses stemming to search for web pages
containing the words connected, connecting, connection and
connections when users ask for a web page that contains the word
connect.
• In 1979, Martin Porter developed a stemming algorithm that
uses a set of rules to extract stems from words, and though it
makes some mistakes, most common words seem to work out
right
– Porter describes his algorithm and provides a reference
implementation in C at
http://tartarus.org/~martin/PorterStemmer/index.html
Porter stemmer
• Most common algorithm for stemming English words to
their common grammatical root
• It is simple procedure for removing known affixes in
English without using a dictionary. To gets rid of plurals
the following rules are used:
– SSES  SS caresses  caress
– IES  i ponies  poni
– SS  SS caress → caress
–S cats  cat
– EMENT   (Delete final ement if what remains is longer than
1 character )
replacement  replac
cement  cement
Porter stemmer
• Porter stemmer works in steps.
– While step 1a gets rid of plurals –s and -es, step 1b removes
-ed or -ing.
– e.g.
;; agreed -> agree ;; disabled ->
disable
;; matting -> mat ;; mating -> mate
;; meeting -> meet ;; milling -> mill
;; messing -> mess ;; meetings -> meet
;; feed -> feed
Stemming: challenges
• May produce unusual stems that are not English
words:
– Removing ‘UAL’ from FACTUAL and EQUAL

• May conflate (reduce to the same token) words


that are actually distinct.
• “computer”, “computational”, “computation” all
reduced to same token “comput”

• Not recognize all morphological derivations.


Term weighting and similarity measures
Term Weighting
• The terms of a document are not equally
useful for describing the document contents
• There are properties of an index term which
are useful for evaluating the importance of
the term in a document
– For instance, a word which appears in all
documents of a collection is completely useless
for retrieval tasks

28
Why use term weighting?
• Binary weights are too limiting
– Terms are either present or absent
– Not allow to order documents according to their level of
relevance for a given query

• Non-binary weights allow to model partial matching


– Partial matching allows retrieval of documents that
approximate the query
• Term-weighting helps to apply best matching that
improves quality of answer set
– Term weighting enables ranking of retrieved documents;
such that best matching documents are ordered at the top as
they are more relevant than others. 29
Binary Weights
• Only the presence (1) or docs t1 t2 t3
absence (0) of a term is D1 1 0 1
D2 1 0 0
included in the vector D3 0 1 1
• Binary formula gives every D4 1 0 0
D5 1 1 1
word that appears in a D6 1 1 0
document equal relevance D7 0 1 0
D8 0 1 0
• It can be useful when D9 0 0 1
frequency is not important D10 0 1 1
D11 1 0 1
• Binary Weights Formula:
1 if freq ij  0

freq ij  
0 if freq ij  0

Term Weighting: Term Frequency (TF)
• TF (term frequency) - Count the number
of times term occurs in document
docs t1 t2 t3
fij = frequency of term i in document j D1 2 0 3
• The more times a term t occurs in D2 1 0 0
document d the more likely it is that t is D3 0 4 7
relevant to the document, i.e. more D4 3 0 0
indicative of the topic D5 1 6 3
– If used alone, it favors common words and D6 3 5 0
long documents D7 0 8 0
– It gives too much credit to words that appears D8 0 10 0
more frequently
D9 0 0 1
• May want to normalize term frequency (tf) D10 0 3 5
D11 4 0 1
Inverse Document Frequency (IDF)
• IDF measures rarity of the term in collection

• The IDF is a measure of the general importance of the


term
– Inverts the document frequency= IDF

• It diminishes the weight of terms that occur very


frequently in the collection and increases the weight of
terms that occur rarely.
– Gives full weight to terms that occur in one document only.
– Gives lowest weight to terms that occur in all documents.
– Terms that appear in many different documents are less
indicative of overall topic.
• idfi = inverse document frequency of term i,
= log2 (N/ df i) (N: total number of documents)32
Inverse Document Frequency
• E.g.: given a collection of 1000 documents and document
frequency, compute IDF for each word?
Word N DF IDF
the 1000 1000 0
some 1000 100 3.322
car 1000 10 6.644
merge 1000 1 9.966
• IDF provides high values for rare words and low values for common words

• IDF is an indication of a term’s discrimination power.


• Log used to dampen the effect relative to tf.
• Make the difference between Document frequency vs. corpus frequency ?

33
Exercise
• Calculate idf value for each term Suppose N = 1 million
term dft idft
computer 1
information 100
storage 1,000
retrieval 10,000
system 100,000
science 1,000,000

idf t  log ( N/df t )


34
TF*IDF Weighting
• The most used term-weighting is tf*idf weighting scheme:
wij = tfij idfi = tfij * log2 (N/ dfi)

• A term occurring frequently in the document but


rarely in the rest of the collection is given high
weight.
– The tf-idf value for a term will always be greater than
or equal to zero.
• Experimentally, tf*idf has been found to work well.
– It is often used in the vector space model together with
cosine similarity to determine the similarity between
two documents.
35
TF*IDF weighting
• When does TF*IDF registers a high weight? when a term t
occurs many times within a small number of documents
– Highest tf*idf for a term shows a term has a high term frequency (in the
given document) and a low document frequency (in the whole collection
of documents);
– The weights hence tend to filter out common terms.
– Thus lending high discriminating power to those documents

• Lower TF*IDF is registered when the term occurs


fewer times in a document, or occurs in many
documents
– Thus offering a less pronounced relevance signal

• Lowest TF*IDF is registered when the term occurs in


virtually all documents
Computing TF-IDF: An Example
• Assume collection contains 10,000 documents and
statistical analysis shows that:

• Document frequencies (DF) of three terms are: A(50),


B(1300), C(250). And also term frequencies (TF) of these
terms are: A(3), B(2), C(1)

• Compute TF*IDF for each term?


–A: tf = 3/3=1.00; idf = log2(10000/50) = 7.644; tf*idf = 7.644
–B: tf = 2/3=0.67; idf = log2(10000/1300) = 2.943; tf*idf = 1.962
–C: tf = 1/3=0.33; idf = log2(10000/250) = 5.322; tf*idf = 1.774
• Query vector is typically treated as a document and also tf-idf weighted.37
More Example
• Consider a document containing 100 words wherein the
word cow appears 3 times. Now, assume we have 10
million documents and cow appears in one thousand of
these.

– The term frequency (TF) for cow :


3/100 = 0.03

– The inverse document frequency is


log2(10,000,000 / 1,000) = 13.228

– The TF*IDF score is the product of these frequencies:


0.03 * 13.228 = 0.39684 38
Exercise
• Let C = number of times Word C TW TD DF TF IDF TFIDF

a given word appears in airplane 5 46 3 1


a document; blue 1 46 3 1
• TW = total number of chair 7 46 3 3
words in a document;
computer 3 46 3 1
• TD = total number of
documents in a corpus, forest 2 46 3 1
and justice 7 46 3 3
• DF = total number of love 2 46 3 1
documents containing a might 2 46 3 1
given word; perl 5 46 3 2
• compute TF, IDF and
rose 6 46 3 3
TF*IDF score for each
term shoe 4 46 3 1
thesis 2 46 3 2 39
Exercise
Term TF Term Normalized TF IDF

D1 D2 D3 D4 D5 D1 D2 D3 D4 D5

bullet 1 1 bullet
dish 1 1 dish
distance 3 2 2 1 distance
rabbit 1 1 rabbit
record 1 1 1 record
roast roast
Term TFIDF
D1 D2 D3 D4 D5

 Calculate normalized TF and Idf


bullet
 Compute TF*IDF value
dish
distance
rabbit
record
40
roast
Similarity Measure
• We now have vectors for all documents in the
collection, a vector for the query, how to t3
compute similarity?

• A similarity measure is a function that
computes the degree of similarity or distance D1
between document vector and query vector. Q
 t1

• Using a similarity measure between the query


t2 D2
and each document:
– It is possible to rank the retrieved documents in
the order of presumed relevance.
– It is possible to enforce a certain threshold so
that the size of the retrieved set can be
controlled.
41
Similarity Measure: Techniques
• Euclidean distance
– It is the most common similarity measure.
– Euclidean distance examines the root of square differences
between coordinates of a pair of document and query terms.

• Dot product
– The dot product is also known as the scalar product or
inner product
– the dot product is defined as the product of the magnitudes
of query and document vectors

• Cosine similarity (or normalized inner product)


– It projects document and query vectors into a term space and
calculate the cosine angle between these.
42
Euclidean Distance
• Similarity between vectors for the document di and
query q can be computed as:
n
sim(dj,q) = |dj – q| =
 (w
i 1
ij  wiq ) 2

– where wij is the weight of term i in document j


and wiq is the weight of term i in the query

– Example: Determine the Euclidean distance between


the document 1 vector (0, 3, 2, 1, 10) and query
vector (2, 7, 1, 0, 0). 0 means corresponding term not
found in document or query
•  (0  2) 2  (3  7) 2  (2  1) 2  (1  0) 2  (10  0) 2  11 .05
43
Exercise
• A collection consisting of three “documents” (D = 3) is searched
for the query [gold silver truck].
D1 = Shipment of gold damaged in a fire.
D2 = Delivery of silver arrived in a silver truck.
D3 = Shipment of gold arrived in a truck.

• Index terms with raw data and idf value


Index Q D1 D2 D3 Df Idf
terms
arrive 0 0 1 1 2 0.176
damage 0 1 0 0 1 0.477
deliver 0 0 1 0 1 0.477
fire 0 1 0 0 1 0.477
gold 1 1 0 1 2 0.176
sliver 1 0 2 0 1 0.477
shipment 0 1 1 1 3 0
44
truck 1 0 1 1 2 0.176
Exercise
• Index terms with tfidf weights
Index terms Q D1 D2 D3

arrive 0 0 0.176 0.176


damage 0 0.477 0 0
deliver 0 0 0.477 0
fire 0 0.477 0 0
gold 0.176 0.176 0 0.176
sliver 0.477 0 0.954 0
shipment 0 0 0 0
truck 0.176 0 0.176 0.176

1. Compute the Euclidean distance between each


document vector and query vector
2. Rank the documents according to its result
45
Inner Product
• Similarity between vectors for the document di and query q
can be computed as the vector inner product:
n
sim(dj,q) = dj•q =
 wij · wiq
i 1
– where wij is the weight of term i in document j and
wiq is the weight of term i in the query q

• For binary vectors, the inner product is the number of


matched query terms in the document (size of intersection)

• For weighted term vectors, it is the sum of the products of


the weights of the matched terms
46
Inner Product -- Examples
• Binary weight :
– Size of vector = size of vocabulary = 7
– sim(D, Q) = 3
Retrieval Database Term Computer Text Manage Data
D 1 1 1 0 1 1 0
Q 1 0 1 0 0 1 1

Retrieval Database Architecture


• Term Weighted:
D1 2 3 5
D2 3 7 1
Q 1 0 2
Inner Product: Example 1
k2
k1
d2 d6 d7
d4
d5
d3
d1

k1 k2 k3 q  dj k3
d1 1 0 1 2
d2 1 0 0 1
d3 0 1 1 2
d4 1 0 0 1
d5 1 1 1 3
d6 1 1 0 2
d7 0 1 0 1

q 1 1 1 48
Inner Product: Exercise
k2
k1
d2 d6 d7
d4 d5
d1 d3

k1 k2 k3 q  dj k3
d1 1 0 1 ?
d2 1 0 0 ?
d3 0 1 1 ?
d4 1 0 0 ?
d5 1 1 1 ?
d6 1 1 0 ?
d7 0 1 0 ?

q 1 2 3 49
Exercise
• Index terms with tfidf weights
Index terms Q D1 D2 D3

arrive 0 0 0.176 0.176


damage 0 0.477 0 0
deliver 0 0 0.477 0
fire 0 0.477 0 0
gold 0.176 0.176 0 0.176
sliver 0.477 0 0.954 0
shipment 0 0 0 0
truck 0.176 0 0.176 0.176

1. Compute the inner product between each document


vector and query vector
2. Rank the documents according to its result
3. Compute with normalized tf and compare with the
unormalized tf 50
Cosine Similarity
• Measures similarity between d1 and d2 captured by the
cosine of the angle x between them.
 

n
d j q wi , j wi , q
sim ( d j , q )     i 1

i 1 w i 1 i ,q
n n
dj q 2
i, j w 2

• Or;  

n
d j  dk wi , j wi ,k
sim(d j , d k )     i 1

i 1 w i 1 i,k
n n
d j dk 2
i, j w 2

• The denominator involves the lengths of the vectors


– So the cosine measure is also known as the normalized inner
product 

n
Length d j  i 1
2
w
i, j
Example: Computing Cosine Similarity
• Let say we have query vector Q = (0.4, 0.8); and also
document D2 = (0.2, 0.7). Compute their similarity using
cosine?

(0.4 * 0.2)  (0.8 * 0.7)


sim(Q, D2 ) 
[(0.4) 2  (0.8) 2 ] * [(0.2) 2  (0.7) 2 ]
0.64
  0.98
0.42
Example: Computing Cosine Similarity
• Let say we have two documents in our corpus; D1 = (0.8,
0.3) and D2 = (0.2, 0.7). Given query vector Q = (0.4, 0.8),
determine which document is the most relevant one for the
query?

1.0 Q
D2
cos 1  0.74 0.8

2
cos  2  0.98
0.6

0.4
1 D1
0.2

0.2 0.4 0.6 0.8 1.0


53
Exercise
• Given three documents; D1, D2 and D3 with the
corresponding TFIDF weight,

• Which documents are more similar using the three


measurement?

Terms D1 D2 D3
affection 0.996 0.993 0.847
Jealous 0.087 0.120 0.466
gossip 0.017 0.000 0.254
54
Exercise
• A database collection consists of 1 million documents, of
which 200,000 contain the term holiday while 250,000
contain the term season. A document repeats holiday 7
times and season 5 times. It is known that holiday is
repeated more than any other term in the document.

• Calculate the weight of both terms in this document using


three different term weight methods. Try with
(i) normalized and unnormalized TF;
(ii) TF*IDF based on normalized and unnormalized TF

55
Any Query

Thank You!

You might also like