0% found this document useful (0 votes)
9 views32 pages

03 -Lect3 search engines-part2

The document outlines the processes involved in information storage and retrieval, focusing on indexing and text pre-processing techniques such as tokenization, stopping, and stemming. It discusses the importance of creating an inverted index for efficient search engine performance and the various types of indexing methods used. Additionally, it explains the significance of document statistics and term weighting in optimizing search results.

Uploaded by

mh861590
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views32 pages

03 -Lect3 search engines-part2

The document outlines the processes involved in information storage and retrieval, focusing on indexing and text pre-processing techniques such as tokenization, stopping, and stemming. It discusses the importance of creating an inverted index for efficient search engine performance and the various types of indexing methods used. Additionally, it explains the significance of document statistics and term weighting in optimizing search results.

Uploaded by

mh861590
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

■Information Storage and

Retrieval (CS418)
Search Engine Architecture [2]
Lecture 3

Dr. Ebtsam AbdelHakam

Computer Science Dept.


Minia University
Indexing
Process document  unique ID
what can you store?
web-crawling Document
data store disk space? rights?
provider feeds compression?
RSS “feeds”
A System desktop/email
and
…………M…e…th…od for
……………………
……………………………………
……………………………………
Documents Index
……………………………………
……….. acquisition creation Index

what data
do we want? a lookup table for
quickly finding all docs
Text containing a word
transformation
format conversion. international?
which part contains “meaning”? © Addison Wesley, 2008

word units? stopping? stemming?

Walid Magdy, TTDS 2017/2018


Text Transformation (Pre-processing)

• Standard text pre-processing steps:


1. Tokenisation
2. Stopping
3. Normalization
4. Stemming

Walid Magdy, TTDS 2017/2018


Getting ready for indexing?
• Pre-processing steps before indexing:
• Tokenisation
• Stopping
• Stemming
• Objective  identify the optimal form of the term to
be indexed to achieve the best retrieval performance

Walid Magdy, TTDS 2017/2018


Tokenisation
• Tokenizer: A document is converted to a stream of tokens,
e.g. individual words.

• Sentence  tokenization (splitting)  tokens


• A token is an instance of a sequence of characters
• Typical technique: split at non-letter characters (space)
• Each such token is now a candidate for an index entry
(term), after further processing

Walid Magdy, TTDS 2017/2018


• This is a very exciting lecture on the technologies of
text
• Stop words: the most common words in collection
 the, a, is, he, she, I, him, for, on, to, very, …
• There are a lot of them ≈ 30-40% of text
• New stop words appear in specific domains
• Tweets: RT  “RT @realDonalTrump Mexico will …”
• Patents: said, claim  “a said method that extracts ….”
• Stop words
• influence on sentence structure
• less influence on topic (aboutness)
Walid Magdy, TTDS 2017/2018
Stopping: stop words
• Common practice in many applications “remove them”
• You need them for:
• Phrase queries:
“King of Denmark”, “Let it be”, “To be or not to be”
• “Relational” queries: “flights to London”
• Objective  make words with different surface
forms look the same
• Document: “this is my CAR!!”
Query: “car”
should “car” match “CAR”?
• Sentence  tokenisation  tokens  normalisation
 terms to be indexed
• Same tokenisation/normalisation steps should be
applied to documents & queries

Walid Magdy, TTDS 2017/2018


• Search for: “play”
should it match: “played”, “playing”, “player”?
• Many morphological variations of words
• inflectional (plurals, tenses)
• derivational (making verbs nouns etc.)
• In most cases, aboutness does not change
• Stemmers attempt to reduce morphological variations
of words to a common stem
• usually involves removing suffixes (in English)
• Can be done at indexing time or as part of query
processing (like stopwords)
Walid Magdy, TTDS 2017/2018
• Two basic types
• Dictionary-based: uses lists of related words
• Algorithmic: uses program to determine related words
• Algorithmic stemmers
• suffix-s: remove ‘s’ endings assuming plural
• e.g., cats → cat, lakes → lake, windows → window
• Many false negatives: supplies → supplie
• Some false positives: James → Jame

Walid Magdy, TTDS 2017/2018


• Most common algorithm for stemming English
• Conventions + 5 phases of reductions
• phases applied sequentially
• each phase consists of a set of commands
• sample convention:
of the rules in a compound command, select the one that
applies to the longest suffix.
• Example rules in Porter stemmer
• sses  ss (processes  process)
• yi (reply  repli)
• ies  i (replies  repli)
• ement → null (replacement  replac)

Walid Magdy, TTDS 2017/2018


• Irregular verbs:
• saw  see
• went  go
• Different spellings
• colour vs. color
• tokenisation vs. tokenization
• Television vs. TV
• Synonyms
• car vs. vehicle
• UK vs. Britain

• Solution  Query expansion …


Walid Magdy, TTDS 2017/2018
• Text pre-processing before IR:
• Tokenisation  Stopping  Stemming

Walid Magdy, TTDS 2017/2018


Indexing Process
Indexing process
■ Indexing is where processed information from crawled pages gets
added to the search index.
■ The search index is what you search when you use a search engine.
That’s why getting indexed in major search engines like Google and
Bing is so important.
■ Users can’t find you unless you’re in the index.
■ How to add your website to Google index? (Assignment)
■ The purpose of storing an index is to optimize speed and
performance in finding relevant documents for a search query.
■ For example, while an index of 10,000 documents can be queried
within milliseconds, a sequential scan of every word in 10,000 large
documents could take hours.
Index data structures
Search engine architectures vary in the way indexing is
performed and in methods of index storage to meet
the various design factors.
1. Inverted index
Stores a list of occurrences of each atomic search criterion, typically in
the form of a hash table or binary tree.
2. Citation index
Stores citations or hyperlinks between documents to support citation
analysis, a subject of bibliometrics.
3. n-gram index
Stores sequences of length of data to support other types of retrieval
or text mining.
4. Document-term matrix
Used in latent semantic analysis, stores the occurrences of words in
documents in a two-dimensional sparse matrix.
Index Creation
Storing Document Statistics

‣ The counts and positions of document terms are stored.

‣ Index common types:


1. Forward Index: Key is the document, value is a list of terms
and term positions. Easiest for the crawler to build.

2. Inverted Index: Key is a term, value is a list of documents


and term positions. Provides faster processing at query time.
Forward index

- The rationale behind developing a forward index is that as documents


are parsed, it is better to intermediately store the words per document.

- The forward index is sorted to transform it to an inverted index.


- The forward index is essentially a list of pairs consisting of a document
and a word, collated by the document.
Inverted index

• This index can only determine whether a word exists within a particular
document, since it stores no information regarding the frequency and
position of the word; it is therefore considered to be a boolean index.

• Such an index determines which documents match a query but does


not rank matched documents.
Index Creation
Storing Document Statistics

‣ The counts and positions of document terms are stored

‣ Term weights are calculated and stored with the terms.


‣ The weight estimates the term’s importance to the document.

‣ The weights are used by ranking algorithms

• e.g.TF-IDF ranks documents by the Term Frequency of the query


term within the document times the Inverse Document Frequency
of the term across all documents.
• Higher scores means you have more query terms which are not
found in many documents.
Index vs Grep
• Say we have collection of Shakespeare plays
• We want to find all plays that contain:
QUERY:
Brutus AND Caesar AND NOT Calpurnia

• Grep: Start at 1st play, read everything and


filter if criteria doesn’t match (linear scan, 1M words)
• Index (a.k.a. Inverted Index): build index data
structure off-line. Quick lookup at query-time.
These examples/figures are from: Manning, Raghavan, Schütze, Intro to Information Retrieval, CUP, 2008
The Shakespeare collection as
Term-Document Matrix

Matrix element (t,d) is:


1 if term t occurs in document d,
0 otherwise

These examples/figures are from: Manning, Raghavan, Schütze, Intro to Information Retrieval, CUP, 2008
The Shakespeare collection as
Term-Document Matrix

QUERY:
Brutus AND Caesar AND NOT Calpurnia
Answer: “Antony and Cleopatra”(d=1), “Hamlet”(d=4)
Inverted Index Data Structure
term (t) document id (d), e.g. “Brutus” occurs in d=1, 2, 4...
Importantly, it’s sorted list
Inverted Index
• Each index term is associated with an
inverted list
– Contains lists of documents, or lists of word
occurrences in documents, and other
information
– Each entry is called a posting
– The part of the posting that refers to a specific
document or location is called a pointer
– Each document in the collection is given a
unique number
– Lists are usually document-ordered (sorted by
document number)
25
Sec. 1.2

Inverted index

■ For each term t, we must store a list of all documents that


contain t.
– Identify each doc by a docID, a document serial number
■ Can we used fixed-size arrays for this?
Brutus 1 2 4 11 31 45 173 174
Caesar 1 2 4 5 6 16 57 132
Calpurnia 2 31 54 101

What happens if the word Caesar


is added to document 14?
26
Can we used fixed-size arrays for
this?
■ The inverted index is a sparse matrix, since not all words are
present in each document.
■ To reduce computer storage memory requirements, it is
stored differently from a two dimensional array.

■ How we can reduce index size?


(Assignment)
Sec. 1.2

Inverted index

■ We need variable-size postings lists


– On disk, a continuous run of postings is normal and best
– In memory, can use linked lists or variable length arraysPosting
■ Some tradeoffs in size/ease of insertion

Brutus 1 2 4 11 31 45 173 174


Caesar 1 2 4 5 6 16 57 132
Calpurnia 2 31 54 101

Dictionary Postings
Sorted by docID (more later on why).
28
Sec. 1.2

Indexer steps: Token sequence


■ Sequence of (Modified token, Document ID) pairs.

Doc 1 Doc 2

I did enact Julius So let it be with


Caesar I was killed Caesar. The noble
i’ the Capitol; Brutus hath told you
Brutus killed me. Caesar was ambitious
Sec. 1.2

Indexer steps: Sort


■ Sort by terms
– At least conceptually
■ And then docID

Core indexing step


Sec. 1.2

Indexer steps: Dictionary & Postings

■ Multiple term entries in a


single document are
merged.
■ Split into Dictionary and
Postings
■ Doc. frequency information
is added.

Why frequency?
Will discuss later.
Sec. 1.2
Final inverted index

Lists of
docIDs

Terms
and
counts IR system
implementation
• How do we index
efficiently?
• How much storage
do we need?

Pointers
32

You might also like