0% found this document useful (0 votes)

9 views32 pages

03 -Lect3 search engines-part2

The document outlines the processes involved in information storage and retrieval, focusing on indexing and text pre-processing techniques such as tokenization, stopping, and stemming. It discusses the importance of creating an inverted index for efficient search engine performance and the various types of indexing methods used. Additionally, it explains the significance of document statistics and term weighting in optimizing search results.

Uploaded by

mh861590

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views32 pages

03 -Lect3 search engines-part2

Uploaded by

mh861590

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

■Information Storage and

Retrieval (CS418)
Search Engine Architecture [2]
Lecture 3

Dr. Ebtsam AbdelHakam

Computer Science Dept.

Minia University
Indexing
Process document  unique ID
what can you store?
web-crawling Document
data store disk space? rights?
provider feeds compression?
RSS “feeds”
A System desktop/email
and
…………M…e…th…od for
……………………
……………………………………
……………………………………
Documents Index
……………………………………
……….. acquisition creation Index

what data
do we want? a lookup table for
quickly finding all docs
Text containing a word
transformation
format conversion. international?
which part contains “meaning”? © Addison Wesley, 2008

word units? stopping? stemming?

Walid Magdy, TTDS 2017/2018

Text Transformation (Pre-processing)

• Standard text pre-processing steps:

1. Tokenisation
2. Stopping
3. Normalization
4. Stemming

Walid Magdy, TTDS 2017/2018

Getting ready for indexing?
• Pre-processing steps before indexing:
• Tokenisation
• Stopping
• Stemming
• Objective  identify the optimal form of the term to
be indexed to achieve the best retrieval performance

Walid Magdy, TTDS 2017/2018

Tokenisation
• Tokenizer: A document is converted to a stream of tokens,
e.g. individual words.

• Sentence  tokenization (splitting)  tokens

• A token is an instance of a sequence of characters
• Typical technique: split at non-letter characters (space)
• Each such token is now a candidate for an index entry
(term), after further processing

Walid Magdy, TTDS 2017/2018

• This is a very exciting lecture on the technologies of
text
• Stop words: the most common words in collection
 the, a, is, he, she, I, him, for, on, to, very, …
• There are a lot of them ≈ 30-40% of text
• New stop words appear in specific domains
• Tweets: RT  “RT @realDonalTrump Mexico will …”
• Patents: said, claim  “a said method that extracts ….”
• Stop words
• influence on sentence structure
• less influence on topic (aboutness)
Walid Magdy, TTDS 2017/2018
Stopping: stop words
• Common practice in many applications “remove them”
• You need them for:
• Phrase queries:
“King of Denmark”, “Let it be”, “To be or not to be”
• “Relational” queries: “flights to London”
• Objective  make words with different surface
forms look the same
• Document: “this is my CAR!!”
Query: “car”
should “car” match “CAR”?
• Sentence  tokenisation  tokens  normalisation
 terms to be indexed
• Same tokenisation/normalisation steps should be
applied to documents & queries

Walid Magdy, TTDS 2017/2018

• Search for: “play”
should it match: “played”, “playing”, “player”?
• Many morphological variations of words
• inflectional (plurals, tenses)
• derivational (making verbs nouns etc.)
• In most cases, aboutness does not change
• Stemmers attempt to reduce morphological variations
of words to a common stem
• usually involves removing suffixes (in English)
• Can be done at indexing time or as part of query
processing (like stopwords)
Walid Magdy, TTDS 2017/2018
• Two basic types
• Dictionary-based: uses lists of related words
• Algorithmic: uses program to determine related words
• Algorithmic stemmers
• suffix-s: remove ‘s’ endings assuming plural
• e.g., cats → cat, lakes → lake, windows → window
• Many false negatives: supplies → supplie
• Some false positives: James → Jame

Walid Magdy, TTDS 2017/2018

• Most common algorithm for stemming English
• Conventions + 5 phases of reductions
• phases applied sequentially
• each phase consists of a set of commands
• sample convention:
of the rules in a compound command, select the one that
applies to the longest suffix.
• Example rules in Porter stemmer
• sses  ss (processes  process)
• yi (reply  repli)
• ies  i (replies  repli)
• ement → null (replacement  replac)

Walid Magdy, TTDS 2017/2018

• Irregular verbs:
• saw  see
• went  go
• Different spellings
• colour vs. color
• tokenisation vs. tokenization
• Television vs. TV
• Synonyms
• car vs. vehicle
• UK vs. Britain

• Solution  Query expansion …

Walid Magdy, TTDS 2017/2018
• Text pre-processing before IR:
• Tokenisation  Stopping  Stemming

Walid Magdy, TTDS 2017/2018

Indexing Process
Indexing process
■ Indexing is where processed information from crawled pages gets
added to the search index.
■ The search index is what you search when you use a search engine.
That’s why getting indexed in major search engines like Google and
Bing is so important.
■ Users can’t find you unless you’re in the index.
■ How to add your website to Google index? (Assignment)
■ The purpose of storing an index is to optimize speed and
performance in finding relevant documents for a search query.
■ For example, while an index of 10,000 documents can be queried
within milliseconds, a sequential scan of every word in 10,000 large
documents could take hours.
Index data structures
Search engine architectures vary in the way indexing is
performed and in methods of index storage to meet
the various design factors.
1. Inverted index
Stores a list of occurrences of each atomic search criterion, typically in
the form of a hash table or binary tree.
2. Citation index
Stores citations or hyperlinks between documents to support citation
analysis, a subject of bibliometrics.
3. n-gram index
Stores sequences of length of data to support other types of retrieval
or text mining.
4. Document-term matrix
Used in latent semantic analysis, stores the occurrences of words in
documents in a two-dimensional sparse matrix.
Index Creation
Storing Document Statistics
•
‣ The counts and positions of document terms are stored.

‣ Index common types:

1. Forward Index: Key is the document, value is a list of terms
and term positions. Easiest for the crawler to build.

2. Inverted Index: Key is a term, value is a list of documents

and term positions. Provides faster processing at query time.
Forward index

- The rationale behind developing a forward index is that as documents

are parsed, it is better to intermediately store the words per document.

- The forward index is sorted to transform it to an inverted index.

- The forward index is essentially a list of pairs consisting of a document
and a word, collated by the document.
Inverted index

• This index can only determine whether a word exists within a particular
document, since it stores no information regarding the frequency and
position of the word; it is therefore considered to be a boolean index.

• Such an index determines which documents match a query but does

not rank matched documents.
Index Creation
Storing Document Statistics
•
‣ The counts and positions of document terms are stored

‣ Term weights are calculated and stored with the terms.

‣ The weight estimates the term’s importance to the document.

‣ The weights are used by ranking algorithms

• e.g.TF-IDF ranks documents by the Term Frequency of the query

term within the document times the Inverse Document Frequency
of the term across all documents.
• Higher scores means you have more query terms which are not
found in many documents.
Index vs Grep
• Say we have collection of Shakespeare plays
• We want to find all plays that contain:
QUERY:
Brutus AND Caesar AND NOT Calpurnia

• Grep: Start at 1st play, read everything and

filter if criteria doesn’t match (linear scan, 1M words)
• Index (a.k.a. Inverted Index): build index data
structure off-line. Quick lookup at query-time.
These examples/figures are from: Manning, Raghavan, Schütze, Intro to Information Retrieval, CUP, 2008
The Shakespeare collection as
Term-Document Matrix

Matrix element (t,d) is:

1 if term t occurs in document d,
0 otherwise

These examples/figures are from: Manning, Raghavan, Schütze, Intro to Information Retrieval, CUP, 2008
The Shakespeare collection as
Term-Document Matrix

QUERY:
Brutus AND Caesar AND NOT Calpurnia
Answer: “Antony and Cleopatra”(d=1), “Hamlet”(d=4)
Inverted Index Data Structure
term (t) document id (d), e.g. “Brutus” occurs in d=1, 2, 4...
Importantly, it’s sorted list
Inverted Index
• Each index term is associated with an
inverted list
– Contains lists of documents, or lists of word
occurrences in documents, and other
information
– Each entry is called a posting
– The part of the posting that refers to a specific
document or location is called a pointer
– Each document in the collection is given a
unique number
– Lists are usually document-ordered (sorted by
document number)
25
Sec. 1.2

Inverted index

■ For each term t, we must store a list of all documents that

contain t.
– Identify each doc by a docID, a document serial number
■ Can we used fixed-size arrays for this?
Brutus 1 2 4 11 31 45 173 174
Caesar 1 2 4 5 6 16 57 132
Calpurnia 2 31 54 101

What happens if the word Caesar

is added to document 14?
26
Can we used fixed-size arrays for
this?
■ The inverted index is a sparse matrix, since not all words are
present in each document.
■ To reduce computer storage memory requirements, it is
stored differently from a two dimensional array.

■ How we can reduce index size?

(Assignment)
Sec. 1.2

Inverted index

■ We need variable-size postings lists

– On disk, a continuous run of postings is normal and best
– In memory, can use linked lists or variable length arraysPosting
■ Some tradeoffs in size/ease of insertion

Brutus 1 2 4 11 31 45 173 174

Caesar 1 2 4 5 6 16 57 132
Calpurnia 2 31 54 101

Dictionary Postings
Sorted by docID (more later on why).
28
Sec. 1.2

Indexer steps: Token sequence

■ Sequence of (Modified token, Document ID) pairs.

Doc 1 Doc 2

I did enact Julius So let it be with

Caesar I was killed Caesar. The noble
i’ the Capitol; Brutus hath told you
Brutus killed me. Caesar was ambitious
Sec. 1.2

Indexer steps: Sort

■ Sort by terms
– At least conceptually
■ And then docID

Core indexing step

Sec. 1.2

Indexer steps: Dictionary & Postings

■ Multiple term entries in a

single document are
merged.
■ Split into Dictionary and
Postings
■ Doc. frequency information
is added.

Why frequency?
Will discuss later.
Sec. 1.2
Final inverted index

Lists of
docIDs

Terms
and
counts IR system
implementation
• How do we index
efficiently?
• How much storage
do we need?

Pointers
32

Salesforce Agnetforce Certification Q&A
No ratings yet
Salesforce Agnetforce Certification Q&A
30 pages
IR Unit 2 Dictionaries and Query Processing
No ratings yet
IR Unit 2 Dictionaries and Query Processing
20 pages
Fundamentals of Front End Development
No ratings yet
Fundamentals of Front End Development
179 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
42 pages
Slides Chap09
No ratings yet
Slides Chap09
153 pages
Introduction To Information Storage and Retrieval: Chapter Four: Indexing Structure
No ratings yet
Introduction To Information Storage and Retrieval: Chapter Four: Indexing Structure
34 pages
Lecture1 Introduction
No ratings yet
Lecture1 Introduction
67 pages
6_2018_09_11!11_16_16_AM
No ratings yet
6_2018_09_11!11_16_16_AM
101 pages
chapter2-MA212-Indexing+&+Preprocessing
No ratings yet
chapter2-MA212-Indexing+&+Preprocessing
68 pages
week6
No ratings yet
week6
98 pages
Chapter 3 IR
No ratings yet
Chapter 3 IR
56 pages
Lec 19
No ratings yet
Lec 19
60 pages
Unit I
No ratings yet
Unit I
83 pages
Lecture1 Intro
No ratings yet
Lecture1 Intro
57 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
69 pages
IRS Chapter 2
No ratings yet
IRS Chapter 2
57 pages
lecture2-indexing
No ratings yet
lecture2-indexing
78 pages
04 - Lect4 - Text Transformation
No ratings yet
04 - Lect4 - Text Transformation
16 pages
lecture1-intro
No ratings yet
lecture1-intro
60 pages
Lecture1-Intro - Realted To Ch1
No ratings yet
Lecture1-Intro - Realted To Ch1
60 pages
Lecture1 Intro Handout 1 Per
No ratings yet
Lecture1 Intro Handout 1 Per
57 pages
2T-Inverted Index
No ratings yet
2T-Inverted Index
54 pages
Lec 1 IR
No ratings yet
Lec 1 IR
42 pages
03lecture 3 - Biomedical IR-indexing
No ratings yet
03lecture 3 - Biomedical IR-indexing
27 pages
3 Index Construction
No ratings yet
3 Index Construction
43 pages
lecture02 - IR
No ratings yet
lecture02 - IR
36 pages
4_Indexing
No ratings yet
4_Indexing
59 pages
2
No ratings yet
2
50 pages
1-Getting Started With ELK
No ratings yet
1-Getting Started With ELK
44 pages
3 Index Construction
No ratings yet
3 Index Construction
43 pages
Lecture01 Intro
No ratings yet
Lecture01 Intro
45 pages
Lecture 2 Inverted Index PDF
No ratings yet
Lecture 2 Inverted Index PDF
24 pages
ch3_ Indexing _2019
No ratings yet
ch3_ Indexing _2019
38 pages
Chapter 4 IR
No ratings yet
Chapter 4 IR
56 pages
3 Indexing (2)
No ratings yet
3 Indexing (2)
28 pages
Lect 3 Inverted Index
No ratings yet
Lect 3 Inverted Index
24 pages
Chapter 3 Indexing
No ratings yet
Chapter 3 Indexing
48 pages
indexing_1
No ratings yet
indexing_1
61 pages
3-Index Construction
No ratings yet
3-Index Construction
43 pages
IR-Lec1 - Ch1-2023
No ratings yet
IR-Lec1 - Ch1-2023
41 pages
4_Indexing (2)
No ratings yet
4_Indexing (2)
29 pages
2.boolean Retrieval Model
No ratings yet
2.boolean Retrieval Model
40 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
57 pages
C1 Intro
No ratings yet
C1 Intro
10 pages
Ir Chapter Three
No ratings yet
Ir Chapter Three
41 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
30 pages
Information Retrival Systems
No ratings yet
Information Retrival Systems
50 pages
KEN2570-5-Search and IR
No ratings yet
KEN2570-5-Search and IR
18 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
38 pages
Filing Systems
75% (4)
Filing Systems
47 pages
IR Unit 2
No ratings yet
IR Unit 2
54 pages
L05
No ratings yet
L05
33 pages
Inverted Index Construction: Adapted From Lectures by
No ratings yet
Inverted Index Construction: Adapted From Lectures by
78 pages
Ir 1
No ratings yet
Ir 1
14 pages
Indexing Structure: Chapter Four
No ratings yet
Indexing Structure: Chapter Four
26 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
31 pages
CSCI 7000 Modern Information Retrieval: Lecture 1: Introduction
No ratings yet
CSCI 7000 Modern Information Retrieval: Lecture 1: Introduction
16 pages
RelativityOne - Recipes PDF
100% (1)
RelativityOne - Recipes PDF
294 pages
On Information Retrival
No ratings yet
On Information Retrival
23 pages
Record Management
100% (2)
Record Management
46 pages
Chapter 1: Boolean Retrieval
No ratings yet
Chapter 1: Boolean Retrieval
9 pages
Computer Networks and Information Security
No ratings yet
Computer Networks and Information Security
35 pages
digital marketing communication lecture notes
No ratings yet
digital marketing communication lecture notes
46 pages
SEO SOP by Avalanche Creative
No ratings yet
SEO SOP by Avalanche Creative
18 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
String Matching Chapter 12 Goodrich Nep
No ratings yet
String Matching Chapter 12 Goodrich Nep
43 pages
Privacy Priserving Multi - Keyword Ranked Search Over Encrypted Cloud Data
No ratings yet
Privacy Priserving Multi - Keyword Ranked Search Over Encrypted Cloud Data
66 pages
An Elasticsearch Crash Course Presentation PDF
No ratings yet
An Elasticsearch Crash Course Presentation PDF
81 pages
Arena Variables Guide
No ratings yet
Arena Variables Guide
86 pages
CSE 435/535 Information Retrieval: Chapter 2: Tokenization, Stemming, Lemmatization
No ratings yet
CSE 435/535 Information Retrieval: Chapter 2: Tokenization, Stemming, Lemmatization
48 pages
CS3308 Information Retrieval Quiz
50% (2)
CS3308 Information Retrieval Quiz
63 pages
RAG - A Simple Introduction
100% (5)
RAG - A Simple Introduction
75 pages
Unit Iv - Irt
No ratings yet
Unit Iv - Irt
62 pages
Tycs Sem Vi QB PDF
No ratings yet
Tycs Sem Vi QB PDF
29 pages
IRS Unit-3
No ratings yet
IRS Unit-3
30 pages
Retrieval Tools: Topic 4
No ratings yet
Retrieval Tools: Topic 4
69 pages
Splunk Validated Architectures
No ratings yet
Splunk Validated Architectures
49 pages
101 Free Online Journal and Research Databases For Academics Free Resource
No ratings yet
101 Free Online Journal and Research Databases For Academics Free Resource
13 pages
WorldWideWeb Proposal For A HyperText Project
No ratings yet
WorldWideWeb Proposal For A HyperText Project
7 pages
Information Retrieval Features of Online Search Services
No ratings yet
Information Retrieval Features of Online Search Services
6 pages
Unit-1 WAD
No ratings yet
Unit-1 WAD
13 pages
5 - Social Media 2
No ratings yet
5 - Social Media 2
12 pages
Search Engine R1 PDF
No ratings yet
Search Engine R1 PDF
5 pages
Digital Documentation
No ratings yet
Digital Documentation
6 pages
ICT - Minimum Learning Competencies - Grade 9 and 10
No ratings yet
ICT - Minimum Learning Competencies - Grade 9 and 10
8 pages
Legal Research in The Computer Age Digest
No ratings yet
Legal Research in The Computer Age Digest
3 pages
Schematron: A language for validating XML
From Everand
Schematron: A language for validating XML
Erik Siegel
No ratings yet
Data Structures I Essentials
From Everand
Data Structures I Essentials
Dennis Smolarski
No ratings yet
Concept Mining: Fundamentals and Applications
From Everand
Concept Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
Text Mining: Fundamentals and Applications
From Everand
Text Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet

03 -Lect3 search engines-part2

Uploaded by

03 -Lect3 search engines-part2

Uploaded by

■Information Storage and

Dr. Ebtsam AbdelHakam

Computer Science Dept.

word units? stopping? stemming?

Walid Magdy, TTDS 2017/2018

• Standard text pre-processing steps:

Walid Magdy, TTDS 2017/2018

Walid Magdy, TTDS 2017/2018

• Sentence  tokenization (splitting)  tokens

Walid Magdy, TTDS 2017/2018

Walid Magdy, TTDS 2017/2018

Walid Magdy, TTDS 2017/2018

Walid Magdy, TTDS 2017/2018

• Solution  Query expansion …

Walid Magdy, TTDS 2017/2018

‣ Index common types:

2. Inverted Index: Key is a term, value is a list of documents

- The rationale behind developing a forward index is that as documents

- The forward index is sorted to transform it to an inverted index.

• Such an index determines which documents match a query but does

‣ Term weights are calculated and stored with the terms.

‣ The weights are used by ranking algorithms

• e.g.TF-IDF ranks documents by the Term Frequency of the query

• Grep: Start at 1st play, read everything and

Matrix element (t,d) is:

■ For each term t, we must store a list of all documents that

What happens if the word Caesar

■ How we can reduce index size?

■ We need variable-size postings lists

Brutus 1 2 4 11 31 45 173 174

Indexer steps: Token sequence

I did enact Julius So let it be with

Indexer steps: Sort

Core indexing step

Indexer steps: Dictionary & Postings

■ Multiple term entries in a

You might also like