Topic 4 W4 - Text Processing
Topic 4 W4 - Text Processing
v = k.nβ
where v is vocabulary size (number of unique words),
n is the number of words in corpus, k, β are
parameters that vary for each corpus (typical values
given are 10 ≤ k ≤ 100 and β ≈ 0.5)
AP89 Example
k β
Heaps’ Law Predictions
• Predictions for TREC collections are accurate
for large numbers of words
– e.g., first 10,879,522 words of the AP89 collection
scanned
– prediction is 100,151 unique words
– actual number is 100,024
• Predictions for small numbers of words (i.e.
< 1000) are much worse
GOV2 (Web) Example
Web Example
• Heaps’ Law works with very large corpora
– new words occurring even after seeing 30 million!
– parameter values different than typical TREC
values
• New words come from a variety of sources
• spelling errors, invented words (e.g. product, company
names), code, other languages, email addresses, etc.
• Search engines must deal with these large and
growing vocabularies
Estimating Result Set Size
(fa · fb )/N
hypertext
Link Analysis
• Links are a key component of the Web
• Important for navigation, but also for search
– e.g., <a href="http://example.com" >Example
website</a>
– “Example website” is the anchor text
– “http://example.com” is the destination link
– both are used by search engines
Anchor Text
• Used as a description of the content of the
destination page
– i.e., collection of anchor text in all links pointing to
a page used as an additional text field
• Anchor text tends to be short, descriptive, and
similar to query text
• Retrieval experiments have shown that anchor
text has significant impact on effectiveness for
some types of queries
PageRank
• Billions of web pages, some more informative
than others
• Links can be viewed as information about the
popularity (authority?) of a web page
– can be used by ranking algorithm
• Inlink count could be used as simple measure
• Link analysis algorithms like PageRank provide
more reliable ratings
Dangling Links
• Random jump prevents getting stuck on
pages that
– do not have links
– contains only links that no longer point to
other pages
– have links forming a loop
• Links that point to the first two types of
pages are called dangling links
– may also be links to pages that have not yet
been crawled
Link Quality
• Link quality is affected by spam and other
factors
– e.g., link farms to increase PageRank
– trackback links in blogs can create loops
– links from comments section of popular blogs
• Blog services modify comment links to contain
rel=nofollow attribute
• e.g., “Come visit my <a rel=nofollow
href="http://www.page.com">web page</a>.”
Trackback Links
Internationalization
• 2/3 of the Web is in English
• About 50% of Web users do not use English as
their primary language
• Many (maybe most) search applications have
to deal with multiple languages
– monolingual search: search in one language, but
with many possible languages
– cross-language search: search in multiple
languages at the same time
Internationalization
• Many aspects of search engines are language-
neutral
• Major differences:
– Text encoding (converting to Unicode)
– Tokenizing (many languages have no word
separators)
– Stemming
• Cultural differences may also impact interface
design and features provided
Chinese “Tokenizing”
END