0% found this document useful (0 votes)
11 views

Topic 4 W4 - Text Processing

Uploaded by

VISALINI VIJAYAN
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Topic 4 W4 - Text Processing

Uploaded by

VISALINI VIJAYAN
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 42

Search Engines

Information Retrieval in Practice

All slides ©Addison Wesley, 2008


Processing Text
• Converting documents to index terms
• Why?
– Matching the exact string of characters typed by
the user is too restrictive
• i.e., it doesn’t work very well in terms of effectiveness
– Not all words are of equal value in a search
– Sometimes not clear where words begin and end
• Not even clear what a word is in some languages
– e.g., Chinese, Korean
Text Statistics
• Huge variety of words used in text but
• Many statistical characteristics of word
occurrences are predictable
– e.g., distribution of word counts
• Retrieval models and ranking algorithms
depend heavily on statistical properties of
words
– e.g., important words occur often in documents
but are not high frequency in collection
Zipf’s Law
• Distribution of word frequencies is very
skewed
– a few words occur very often, many words hardly
ever occur
– e.g., two most common words (“the”, “of”) make
up about 10% of all word occurrences in text
documents
Zipf’s Law
• the frequency of any word is inversely proportional to its
rank in the frequency table.
• most frequent word will occur approximately twice as often
as the second most frequent word,
• three times as often as the third most frequent word, etc.
• For example in a doc., the word "the" is the most frequently
occurring word, and by itself accounts for nearly 7% of all
word occurrences (69,971 out of slightly over 1 million).
• True to Zipf's Law, the second-place word "of" accounts for
slightly over 3.5% of words (36,411 occurrences), followed
by "and" (28,852).
Zipf’s Law
Vocabulary Growth
• As corpus grows, so does vocabulary size
– Fewer new words when corpus is already large
• Observed relationship (Heaps’ Law):

v = k.nβ
where v is vocabulary size (number of unique words),
n is the number of words in corpus, k, β are
parameters that vary for each corpus (typical values
given are 10 ≤ k ≤ 100 and β ≈ 0.5)
AP89 Example

k β
Heaps’ Law Predictions
• Predictions for TREC collections are accurate
for large numbers of words
– e.g., first 10,879,522 words of the AP89 collection
scanned
– prediction is 100,151 unique words
– actual number is 100,024
• Predictions for small numbers of words (i.e.
< 1000) are much worse
GOV2 (Web) Example
Web Example
• Heaps’ Law works with very large corpora
– new words occurring even after seeing 30 million!
– parameter values different than typical TREC
values
• New words come from a variety of sources
• spelling errors, invented words (e.g. product, company
names), code, other languages, email addresses, etc.
• Search engines must deal with these large and
growing vocabularies
Estimating Result Set Size

• How many pages contain all of the query terms?


• For the query “a b c”:
fabc = N · fa/N · fb/N · fc/N = (fa · fb · fc)/N2

• Assuming that terms occur independently


• fabc is the estimated size of the result set
• fa, fb, fc are the number of documents that terms a, b, and
c occur in
• N is the number of documents in the collection
GOV2 Example

(fa · fb )/N

Collection size (N) is 25,205,179


Tokenizing
• Forming words from sequence of characters
• Surprisingly complex in English, can be harder
in other languages
• Early IR systems:
– any sequence of alphanumeric characters of
length 3 or more
– terminated by a space or other special character
– upper-case changed to lower-case
Tokenizing
• Example:
– “Bigcorp's 2007 bi-annual report showed profits
rose 10%.” becomes
– “bigcorp 2007 annual report showed profits rose”
• Why? Too much information lost
– Small decisions in tokenizing can have major
impact on effectiveness of some queries
Tokenizing Problems
• Small words can be important in some queries,
usually in combinations
• xp, ma, pm, ben e king, el paso, master p, gm, j lo, world
war II
• Both hyphenated and non-hyphenated forms of
many words are common
– Sometimes hyphen is not needed
• e-bay, wal-mart, active-x, cd-rom, t-shirts
– At other times, hyphens should be considered either
as part of the word or a word separator
• winston-salem, mazda rx-7, e-cards, pre-diabetes, t-mobile,
spanish-speaking
Tokenizing Problems
• Special characters are an important part of tags,
URLs, code in documents
• Capitalized words can have different meaning
from lower case words
– Bush, Apple
• Apostrophes can be a part of a word, a part of a
possessive, or just a mistake
– rosie o'donnell, can't, don't, 80's, 1890's, men's straw
hats, master's degree, england's ten largest cities,
shriner's
Tokenizing Problems
• Numbers can be important, including decimals
– nokia 3250, top 10 courses, united 93, quicktime
6.5 pro, 92.3 the beat, 288358
• Periods can occur in numbers, abbreviations,
URLs, ends of sentences, and other situations
– I.B.M., Ph.D., cs.umass.edu, F.E.A.R.
• Note: tokenizing steps for queries must be
identical to steps for documents
Tokenizing Process
• First step is to use parser to identify
appropriate parts of document to tokenize
• Defer complex decisions to other components
– word is any sequence of alphanumeric characters,
terminated by a space or special character, with
everything converted to lower-case
– everything indexed
– example: 92.3 → 92 3 but search finds documents
with 92 and 3 adjacent
Tokenizing Process
• Not that different than simple tokenizing
process used in past
• Examples of rules used with TREC
– Apostrophes in words ignored
• o’connor → oconnor bob’s → bobs
– Periods in abbreviations ignored
• I.B.M. → ibm Ph.D. → ph d
Stopping
• Function words (determiners, prepositions)
have little meaning on their own
• High occurrence frequencies
• Treated as stopwords (i.e. removed)
– reduce index space, improve response time,
improve effectiveness
• Can be important in combinations
– e.g., “to be or not to be”
Stopping
• Stopword list can be created from high-
frequency words or based on a standard list
• Lists are customized for applications, domains,
and even parts of documents
– e.g., “click” is a good stopword for anchor text
• Best policy is to index all words in documents,
make decisions about which words to use at
query time
Stemming
• Many morphological variations of words
– inflectional (plurals, tenses)
– derivational (making verbs nouns etc.)
• In most cases, these have the same or very
similar meanings
• Stemmers attempt to reduce morphological
variations of words to a common stem
– usually involves removing suffixes
• Can be done at indexing time or as part of
query processing (like stopwords)
Stemming
• Generally a small but significant effectiveness
improvement
– can be crucial for some languages
– e.g., 5-10% improvement for English, up to 50% in
Arabic

Words with the Arabic root ktb


Stemming
• Two basic types
– Dictionary-based: uses lists of related words
– Algorithmic: uses program to determine related
words
• Algorithmic stemmers
– suffix-s: remove ‘s’ endings assuming plural
• e.g., cats → cat, lakes → lake, wiis → wii
• Many false negatives: supplies → supplie
• Some false positives: ups → up
Porter Stemmer
• Algorithmic stemmer used in IR experiments
since the 70s
• Consists of a series of rules designed to the
longest possible suffix at each step
• Effective in TREC
• Produces stems not words
• Makes a number of errors and difficult to
modify
Krovetz Stemmer
• Hybrid algorithmic-dictionary
– Word checked in dictionary
• If present, either left alone or replaced with “exception”
• If not present, word is checked for suffixes that could be
removed
• After removal, dictionary is checked again
• Produces words not stems
• Comparable effectiveness
• Lower false positive rate, somewhat higher false
negative
Stemmer Comparison
Phrases
• Many queries are 2-3 word phrases
• Phrases are
– More precise than single words
• e.g., documents containing “black sea” vs. two words
“black” and “sea”
– Less ambiguous
• e.g., “big apple” vs. “apple”
• Can be difficult for ranking
• e.g., Given query “fishing supplies”, how do we score
documents with
– exact phrase many times, exact phrase just once, individual words
in same sentence, same paragraph, whole document, variations on
words?
Document Structure and Markup
• Some parts of documents are more important
than others
• Document parser recognizes structure using
markup, such as HTML tags
– Headers, anchor text, bolded text all likely to be
important
– Metadata can also be important
– Links used for link analysis
Example Web Page
hypertext
Example Web Page

hypertext
Link Analysis
• Links are a key component of the Web
• Important for navigation, but also for search
– e.g., <a href="http://example.com" >Example
website</a>
– “Example website” is the anchor text
– “http://example.com” is the destination link
– both are used by search engines
Anchor Text
• Used as a description of the content of the
destination page
– i.e., collection of anchor text in all links pointing to
a page used as an additional text field
• Anchor text tends to be short, descriptive, and
similar to query text
• Retrieval experiments have shown that anchor
text has significant impact on effectiveness for
some types of queries
PageRank
• Billions of web pages, some more informative
than others
• Links can be viewed as information about the
popularity (authority?) of a web page
– can be used by ranking algorithm
• Inlink count could be used as simple measure
• Link analysis algorithms like PageRank provide
more reliable ratings
Dangling Links
• Random jump prevents getting stuck on
pages that
– do not have links
– contains only links that no longer point to
other pages
– have links forming a loop
• Links that point to the first two types of
pages are called dangling links
– may also be links to pages that have not yet
been crawled
Link Quality
• Link quality is affected by spam and other
factors
– e.g., link farms to increase PageRank
– trackback links in blogs can create loops
– links from comments section of popular blogs
• Blog services modify comment links to contain
rel=nofollow attribute
• e.g., “Come visit my <a rel=nofollow
href="http://www.page.com">web page</a>.”
Trackback Links
Internationalization
• 2/3 of the Web is in English
• About 50% of Web users do not use English as
their primary language
• Many (maybe most) search applications have
to deal with multiple languages
– monolingual search: search in one language, but
with many possible languages
– cross-language search: search in multiple
languages at the same time
Internationalization
• Many aspects of search engines are language-
neutral
• Major differences:
– Text encoding (converting to Unicode)
– Tokenizing (many languages have no word
separators)
– Stemming
• Cultural differences may also impact interface
design and features provided
Chinese “Tokenizing”
END

You might also like