0% found this document useful (0 votes)
14 views16 pages

04 - Lect4 - Text Transformation

Uploaded by

Mahmoud Nasser
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views16 pages

04 - Lect4 - Text Transformation

Uploaded by

Mahmoud Nasser
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Biomedical IR

Search Engine Architecture [part 3]


Lecture 4

Dr. Ebtsam AbdelHakam

Minia University
Indexing Process
document  unique ID
Document what can you store?
web-crawling disk space? rights?
provider feeds data store
compression?
RSS “feeds”
A System desktop/email
and
…………M…e…th…od for
……………………
……………………………………
……………………………………
Documents Index
……………………………………
……….. acquisition creation Index

what data
do we want? a lookup table for
quickly finding all docs
Text containing a word
transformation
format conversion. international?
which part contains “meaning”? © Addison Wesley, 2008

word units? stopping? stemming?

Walid Magdy, TTDS 2017/2018


Text Transformation (Pre-processing)

• Standard text pre-processing steps:


1. Tokenisation
2. Stop of word removal
3. Normalization
4. Stemming

Walid Magdy, TTDS 2017/2018


Getting ready for indexing?
• Pre-processing steps before indexing:
• Tokenisation
• Stopping
• Stemming
• Objective  identify the optimal form of the term to
be indexed to achieve the best retrieval performance

Walid Magdy, TTDS 2017/2018


Tokenisation
• Tokenizer: A document is converted to a stream of tokens,
e.g. individual words.

• Sentence  tokenization (splitting)  tokens


• A token is an instance of a sequence of characters
• Typical technique: split at non-letter characters (space)
• Each such token is now a candidate for an index entry
(term), after further processing

Walid Magdy, TTDS 2017/2018


• This is a very exciting lecture on the technologies of
text
• Stop words: the most common words in collection
 the, a, is, he, she, I, him, for, on, to, very, …
• There are a lot of them ≈ 30-40% of text
• New stop words appear in specific domains
• Tweets: RT  “RT @realDonalTrump Mexico will …”
• Patents: said, claim  “a said method that extracts ….”
• Stop words
• influence on sentence structure
• less influence on topic (aboutness)
Walid Magdy, TTDS 2017/2018
Stopping: stop words
• Common practice in many applications “remove them”
• You need them for:
• Phrase queries:
“King of Denmark”, “Let it be”, “To be or not to be”
• “Relational” queries: “flights to London”
• Objective  make words with different
surface forms look the same
• Document: “this is my CAR!!”
Query: “car”
should “car” match “CAR”?
• Sentence  tokenisation  tokens  normalisation
 terms to be indexed
• Same tokenisation/normalisation steps should be
applied to documents & queries

Walid Magdy, TTDS 2017/2018


• Search for: “play”
should it match: “played”, “playing”, “player”?

• Stemmers attempt to reduce morphological variations


of words to a common stem
• usually involves removing suffixes (in English)
• Many morphological variations of words
• inflectional (plurals, tenses)
• derivational (making verbs nouns etc.)
• In most cases, aboutness does not change
• Can be done at indexing time or as part of query
processing (like stopwords)
Walid Magdy, TTDS 2017/2018
• Two basic types
• Dictionary-based: uses lists of related words
• Algorithmic: uses program to determine related words
• Algorithmic stemmers
• suffix-s: remove ‘s’ endings assuming plural
• e.g., cats → cat, lakes → lake, windows → window
• Many false negatives: supplies → supplie
• Some false positives: James → Jame

Walid Magdy, TTDS 2017/2018


• Most common algorithm for stemming English
• Conventions + 5 phases of reductions
• phases applied sequentially
• each phase consists of a set of commands
• sample convention:
of the rules in a compound command, select the one that
applies to the longest suffix.
• Example rules in Porter stemmer
• sses  ss (processes  process)
• yi (reply  repli)
• ies  i (replies  repli)
• ement → null (replacement  replac)

Walid Magdy, TTDS 2017/2018


• Irregular verbs:
• saw  see
• went  go
• Different spellings
• colour vs. color
• tokenisation vs. tokenization
• Television vs. TV
• Synonyms
• car vs. vehicle
• UK vs. Britain

• Solution  Query expansion …


Walid Magdy, TTDS 2017/2018
 Text pre-processing before IR:
Tokenisation  Stopping  Stemming

Walid Magdy, TTDS 2017/2018


Index
Creation
Storing Document Statistics

‣ The counts and positions of document terms are stored.

‣ Index common types:


1. Forward Index: Key is the document, value is a list of terms
and term positions. Easiest for the crawler to build.

2. Inverted Index: Key is a term, value is a list of documents


and term positions. Provides faster processing at query time.
Forward index

- The rationale behind developing a forward index is that as documents


are parsed, it is better to intermediately store the words per document.

- The forward index is sorted to transform it to an inverted index.


- The forward index is essentially a list of pairs consisting of a document
and a word, collated by the document.
Inverted index

• This index can only determine whether a word exists within a particular
document, since it stores no information regarding the frequency and
position of the word; it is therefore considered to be a boolean index.

• Such an index determines which documents match a query but does


not rank matched documents.

You might also like