0% found this document useful (0 votes)

14 views16 pages

04 - Lect4 - Text Transformation

Uploaded by

Mahmoud Nasser

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views16 pages

04 - Lect4 - Text Transformation

Uploaded by

Mahmoud Nasser

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Biomedical IR

Search Engine Architecture [part 3]

Lecture 4

Dr. Ebtsam AbdelHakam

Minia University
Indexing Process
document  unique ID
Document what can you store?
web-crawling disk space? rights?
provider feeds data store
compression?
RSS “feeds”
A System desktop/email
and
…………M…e…th…od for
……………………
……………………………………
……………………………………
Documents Index
……………………………………
……….. acquisition creation Index

what data
do we want? a lookup table for
quickly finding all docs
Text containing a word
transformation
format conversion. international?
which part contains “meaning”? © Addison Wesley, 2008

word units? stopping? stemming?

Walid Magdy, TTDS 2017/2018

Text Transformation (Pre-processing)

• Standard text pre-processing steps:

1. Tokenisation
2. Stop of word removal
3. Normalization
4. Stemming

Walid Magdy, TTDS 2017/2018

Getting ready for indexing?
• Pre-processing steps before indexing:
• Tokenisation
• Stopping
• Stemming
• Objective  identify the optimal form of the term to
be indexed to achieve the best retrieval performance

Walid Magdy, TTDS 2017/2018

Tokenisation
• Tokenizer: A document is converted to a stream of tokens,
e.g. individual words.

• Sentence  tokenization (splitting)  tokens

• A token is an instance of a sequence of characters
• Typical technique: split at non-letter characters (space)
• Each such token is now a candidate for an index entry
(term), after further processing

Walid Magdy, TTDS 2017/2018

• This is a very exciting lecture on the technologies of
text
• Stop words: the most common words in collection
 the, a, is, he, she, I, him, for, on, to, very, …
• There are a lot of them ≈ 30-40% of text
• New stop words appear in specific domains
• Tweets: RT  “RT @realDonalTrump Mexico will …”
• Patents: said, claim  “a said method that extracts ….”
• Stop words
• influence on sentence structure
• less influence on topic (aboutness)
Walid Magdy, TTDS 2017/2018
Stopping: stop words
• Common practice in many applications “remove them”
• You need them for:
• Phrase queries:
“King of Denmark”, “Let it be”, “To be or not to be”
• “Relational” queries: “flights to London”
• Objective  make words with different
surface forms look the same
• Document: “this is my CAR!!”
Query: “car”
should “car” match “CAR”?
• Sentence  tokenisation  tokens  normalisation
 terms to be indexed
• Same tokenisation/normalisation steps should be
applied to documents & queries

Walid Magdy, TTDS 2017/2018

• Search for: “play”
should it match: “played”, “playing”, “player”?

• Stemmers attempt to reduce morphological variations

of words to a common stem
• usually involves removing suffixes (in English)
• Many morphological variations of words
• inflectional (plurals, tenses)
• derivational (making verbs nouns etc.)
• In most cases, aboutness does not change
• Can be done at indexing time or as part of query
processing (like stopwords)
Walid Magdy, TTDS 2017/2018
• Two basic types
• Dictionary-based: uses lists of related words
• Algorithmic: uses program to determine related words
• Algorithmic stemmers
• suffix-s: remove ‘s’ endings assuming plural
• e.g., cats → cat, lakes → lake, windows → window
• Many false negatives: supplies → supplie
• Some false positives: James → Jame

Walid Magdy, TTDS 2017/2018

• Most common algorithm for stemming English
• Conventions + 5 phases of reductions
• phases applied sequentially
• each phase consists of a set of commands
• sample convention:
of the rules in a compound command, select the one that
applies to the longest suffix.
• Example rules in Porter stemmer
• sses  ss (processes  process)
• yi (reply  repli)
• ies  i (replies  repli)
• ement → null (replacement  replac)

Walid Magdy, TTDS 2017/2018

• Irregular verbs:
• saw  see
• went  go
• Different spellings
• colour vs. color
• tokenisation vs. tokenization
• Television vs. TV
• Synonyms
• car vs. vehicle
• UK vs. Britain

• Solution  Query expansion …

Walid Magdy, TTDS 2017/2018
 Text pre-processing before IR:
Tokenisation  Stopping  Stemming

Walid Magdy, TTDS 2017/2018

Index
Creation
Storing Document Statistics
•
‣ The counts and positions of document terms are stored.

‣ Index common types:

1. Forward Index: Key is the document, value is a list of terms
and term positions. Easiest for the crawler to build.

2. Inverted Index: Key is a term, value is a list of documents

and term positions. Provides faster processing at query time.
Forward index

- The rationale behind developing a forward index is that as documents

are parsed, it is better to intermediately store the words per document.

- The forward index is sorted to transform it to an inverted index.

- The forward index is essentially a list of pairs consisting of a document
and a word, collated by the document.
Inverted index

• This index can only determine whether a word exists within a particular
document, since it stores no information regarding the frequency and
position of the word; it is therefore considered to be a boolean index.

• Such an index determines which documents match a query but does

not rank matched documents.

03 -Lect3 search engines-part2
No ratings yet
03 -Lect3 search engines-part2
32 pages
1. 2_text Operation_1 (2)
No ratings yet
1. 2_text Operation_1 (2)
28 pages
1-Getting Started With ELK
No ratings yet
1-Getting Started With ELK
44 pages
2T-Inverted Index
No ratings yet
2T-Inverted Index
54 pages
Lec 19
No ratings yet
Lec 19
60 pages
Lecture 2: Datastructures and Algorithms For Indexing: Information Retrieval Computer Science Tripos Part II
No ratings yet
Lecture 2: Datastructures and Algorithms For Indexing: Information Retrieval Computer Science Tripos Part II
47 pages
2 Text Operations
No ratings yet
2 Text Operations
32 pages
3-More on Indexing & Text Operations
No ratings yet
3-More on Indexing & Text Operations
27 pages
Lecture 3 - Terms, Postings, Dictionaries, and Tolerant Retrieval
No ratings yet
Lecture 3 - Terms, Postings, Dictionaries, and Tolerant Retrieval
77 pages
2-Text Operations_new
No ratings yet
2-Text Operations_new
39 pages
Lecture 3-Term Vocabulary and Posting Lists
No ratings yet
Lecture 3-Term Vocabulary and Posting Lists
26 pages
Chapter 2 Part 1 & 2
No ratings yet
Chapter 2 Part 1 & 2
58 pages
IRS Chapter 2
No ratings yet
IRS Chapter 2
57 pages
Lecture 3-Term Vocabulary and Posting Lists
No ratings yet
Lecture 3-Term Vocabulary and Posting Lists
38 pages
Information Retrieval: Text Processing
No ratings yet
Information Retrieval: Text Processing
43 pages
3. text-processing
No ratings yet
3. text-processing
70 pages
Chapter 1: Boolean Retrieval
No ratings yet
Chapter 1: Boolean Retrieval
9 pages
Lecture2-Dictionary - Term Vocabulary and Postings Lists ch2 and ch4
No ratings yet
Lecture2-Dictionary - Term Vocabulary and Postings Lists ch2 and ch4
33 pages
Chapter 4 - Processing Text
No ratings yet
Chapter 4 - Processing Text
7 pages
lecture2-dictionary
No ratings yet
lecture2-dictionary
37 pages
MSC IR 2021
100% (1)
MSC IR 2021
188 pages
Lecture3 Roy
No ratings yet
Lecture3 Roy
5 pages
02 Text Operation
No ratings yet
02 Text Operation
52 pages
chap2part2
No ratings yet
chap2part2
20 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
34 pages
lec5
No ratings yet
lec5
22 pages
CSE 435/535 Information Retrieval: Chapter 2: Tokenization, Stemming, Lemmatization
No ratings yet
CSE 435/535 Information Retrieval: Chapter 2: Tokenization, Stemming, Lemmatization
48 pages
Chap 2 Part 2
No ratings yet
Chap 2 Part 2
20 pages
3 Indexing (2)
No ratings yet
3 Indexing (2)
28 pages
Unit 3_ Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval
No ratings yet
Unit 3_ Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval
8 pages
IR Lec03 Vocabulary Postings List
No ratings yet
IR Lec03 Vocabulary Postings List
28 pages
Session 1
No ratings yet
Session 1
33 pages
Lecture 7
No ratings yet
Lecture 7
32 pages
C2 Dictionary
No ratings yet
C2 Dictionary
6 pages
Chapter -2 Text operation( Lecture 2.1)
No ratings yet
Chapter -2 Text operation( Lecture 2.1)
63 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
Ir Chapter Three
No ratings yet
Ir Chapter Three
41 pages
Information Retrieval Systems Chap 2
67% (3)
Information Retrieval Systems Chap 2
60 pages
3 Index Construction
No ratings yet
3 Index Construction
43 pages
3 Index Construction
No ratings yet
3 Index Construction
43 pages
4_Indexing (2)
No ratings yet
4_Indexing (2)
29 pages
Lecture2 Dictionary
No ratings yet
Lecture2 Dictionary
62 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
Lecture 3
No ratings yet
Lecture 3
70 pages
chapter2-MA212-Indexing+&+Preprocessing
No ratings yet
chapter2-MA212-Indexing+&+Preprocessing
68 pages
Chapter 3 IR
No ratings yet
Chapter 3 IR
56 pages
Explain Text Operation
No ratings yet
Explain Text Operation
6 pages
Information Retrival Unit 1
No ratings yet
Information Retrival Unit 1
29 pages
6 The Term Vocabulary & Posting List
No ratings yet
6 The Term Vocabulary & Posting List
19 pages
lecture2-indexing
No ratings yet
lecture2-indexing
78 pages
Introduction IR
No ratings yet
Introduction IR
61 pages
Unit-Ii Notes
No ratings yet
Unit-Ii Notes
17 pages
1-Overview of Information Retrieval
No ratings yet
1-Overview of Information Retrieval
44 pages
CL_lec 6
No ratings yet
CL_lec 6
28 pages
Introduction To Information Retrieval: Courtesy
No ratings yet
Introduction To Information Retrieval: Courtesy
61 pages
Introduction To Information Retrieval: Jian-Yun Nie University of Montreal Canada
No ratings yet
Introduction To Information Retrieval: Jian-Yun Nie University of Montreal Canada
61 pages
2.boolean Retrieval Model
No ratings yet
2.boolean Retrieval Model
40 pages
1-Overview of Information Retrieval
No ratings yet
1-Overview of Information Retrieval
44 pages
SDL Trados Studio – A Practical Guide
From Everand
SDL Trados Studio – A Practical Guide
Andy Walker
5/5 (1)
Schematron: A language for validating XML
From Everand
Schematron: A language for validating XML
Erik Siegel
No ratings yet
CBR Tefl
No ratings yet
CBR Tefl
26 pages
12th English Syllabus
No ratings yet
12th English Syllabus
2 pages
Demo Lesson
No ratings yet
Demo Lesson
22 pages
English II Class 4
No ratings yet
English II Class 4
2 pages
Young Learners My Family Level 3
No ratings yet
Young Learners My Family Level 3
2 pages
Rubrics
No ratings yet
Rubrics
1 page
CONTINUING RESISTANCE: CRITICISMS OF THE PHILIPPINE NATIONAL LANGUAGE POLICY by Rufus Montecalvo
100% (2)
CONTINUING RESISTANCE: CRITICISMS OF THE PHILIPPINE NATIONAL LANGUAGE POLICY by Rufus Montecalvo
12 pages
Basic Communication Skills: Presented by
No ratings yet
Basic Communication Skills: Presented by
17 pages
Grade 10 Nouns
No ratings yet
Grade 10 Nouns
2 pages
Sentence Structure and Types of Sentences - Grammar - Academic Guides at Walden University
No ratings yet
Sentence Structure and Types of Sentences - Grammar - Academic Guides at Walden University
1 page
Agenda Harian Kelas V
No ratings yet
Agenda Harian Kelas V
4 pages
WK 5 Vowels, Dipthongs Sounds and Consonants
No ratings yet
WK 5 Vowels, Dipthongs Sounds and Consonants
40 pages
APE 2 - Oña Dany
No ratings yet
APE 2 - Oña Dany
3 pages
OXFORD DISCOVER 1 REVISION
No ratings yet
OXFORD DISCOVER 1 REVISION
7 pages
Basic Elements of Spoken Language
100% (1)
Basic Elements of Spoken Language
2 pages
Pronouns Leticia Rodas & María Elena Osorio
No ratings yet
Pronouns Leticia Rodas & María Elena Osorio
2 pages
Curriculum Based Readers Theatre
No ratings yet
Curriculum Based Readers Theatre
2 pages
Lexical Interference Problems That Undergraduate Students Majoring in English Encounter
100% (1)
Lexical Interference Problems That Undergraduate Students Majoring in English Encounter
81 pages
Chaucer S Vocab
100% (1)
Chaucer S Vocab
15 pages
SocioculturaltransferinL2speechbehaviors-Evidenceandmotivatingfactors
No ratings yet
SocioculturaltransferinL2speechbehaviors-Evidenceandmotivatingfactors
22 pages
Who Wants To Be A Millionaire
No ratings yet
Who Wants To Be A Millionaire
58 pages
TOEFL ITP Level 2 - Multiple Choice Questions
No ratings yet
TOEFL ITP Level 2 - Multiple Choice Questions
16 pages
Whale Word Search
No ratings yet
Whale Word Search
3 pages
Parrot TTS
No ratings yet
Parrot TTS
13 pages
9051 Assignment No.1
No ratings yet
9051 Assignment No.1
36 pages
Modul Speaking 1
No ratings yet
Modul Speaking 1
65 pages
SC Concepts + 100 Most Important SC Questions
No ratings yet
SC Concepts + 100 Most Important SC Questions
171 pages
LMS English 5 Q2 With InterventionExercises
No ratings yet
LMS English 5 Q2 With InterventionExercises
5 pages
El Papel de La Competencia Pragmática en La Enseñanza Del Idioma Inglés
No ratings yet
El Papel de La Competencia Pragmática en La Enseñanza Del Idioma Inglés
17 pages
Amr vs. Bri
No ratings yet
Amr vs. Bri
41 pages

04 - Lect4 - Text Transformation

Uploaded by

04 - Lect4 - Text Transformation

Uploaded by

Biomedical IR

Search Engine Architecture [part 3]

Dr. Ebtsam AbdelHakam

word units? stopping? stemming?

Walid Magdy, TTDS 2017/2018

• Standard text pre-processing steps:

Walid Magdy, TTDS 2017/2018

Walid Magdy, TTDS 2017/2018

• Sentence  tokenization (splitting)  tokens

Walid Magdy, TTDS 2017/2018

Walid Magdy, TTDS 2017/2018

• Stemmers attempt to reduce morphological variations

Walid Magdy, TTDS 2017/2018

Walid Magdy, TTDS 2017/2018

• Solution  Query expansion …

Walid Magdy, TTDS 2017/2018

‣ Index common types:

2. Inverted Index: Key is a term, value is a list of documents

- The rationale behind developing a forward index is that as documents

- The forward index is sorted to transform it to an inverted index.

• Such an index determines which documents match a query but does

You might also like