0% found this document useful (0 votes)

126 views

IR Chap3

The document discusses different indexing structures for organizing terms extracted from documents. It describes sequential files, which list terms alphabetically with their associated documents but lack weights or linking. Inverted files list terms with pointers to their corresponding documents, allowing faster retrieval of relevant documents for a query. The document also mentions suffix trees and tries, which support additional applications like string matching. Overall, the document provides an overview of common indexing structures and their basic functionality.

Uploaded by

biniam teshome

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

126 views

IR Chap3

Uploaded by

biniam teshome

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 45

Indexing structure

Abdo A.
2019/20

1
Outline
 Major Steps in Index Construction
 Index file Evaluation Metrics
 Building Index file
 Sequential File

 Inverted file

 Suffix tree

 Suffix Trie

 Suffix Tree Applications

March 8, 2020 2
Indexing: Basic Concepts
 Indexing is an arrangement of index terms to permit fast
searching and reducing memory space requirement
 It used to speed up access to desired information from document

collection as per users query such that

 It enhances efficiency in terms of time for retrieval. Relevant

documents are searched and retrieved quick

 Index file usually has index terms in a sorted order.

 Which list is easier to search?

fox pig zebra hen ant cat dog lion ox

ant cat dog fox hen lion ox pig zebra

March 8, 2020 3
Indexing: Basic Concepts
 An index file consists of records, called index entries.
 Index files are much smaller than the original file.
 Remember Heaps Law: in 1 GB of text collection the

vocabulary has a size of only 5 MB. This size may be

further reduced by Linguistic pre-processing (or text
operations).
 The usual unit for indexing is the word

 Index terms - are used to look up records in a file.

March 8, 2020 4
Major Steps in Index Construction
 Source file: Collection of text document
A document can be described by a set of representative

keywords called index terms.

 Index Terms Selection: apply text operations or
preprocessing
Tokenize: identify words in a document, so that each

document is represented by a list of keywords or attributes

Stop words removal: words with high frequency are non-

content bearing and needs to be removed from text

collection

March 8, 2020 5
Major Steps in Index Construction …
Word stem: reduce words with similar meaning into their


stem/root word
Term relevance weight: Different index terms have

varying relevance when used to describe document

contents. This effect is captured through the assignment
of numerical weights to each index term of a
document. There are different index terms weighting
methods: including TF, IDF, TF*IDF, …
 Indexing structure: a set of index terms (vocabulary) are

organized in Index File to easily identify documents in

which each term occurs in.
March 8, 2020 6
Basic Indexing Process
Documents to
be indexed. Friends, Romans, countrymen.

Token Tokenizer
stream. Friends Romans countrymen

Modified Linguistic friend roman countryman

tokens. preprocessor

Index File Indexer

friend 2 4

roman 1 2
Inverted file countryman 13 16
March 8, 2020
Index file Evaluation Metrics
 Running time of the main operations
 Access/search time
 How much is the running time to find the required search key
from the list?
 Update time (Insertion time, Deletion time)
 How much time does it take to update existing records in an
attempt to add new terms or delete existing unnecessary terms?
 Does the indexing structure allows incremental update or re-
indexing?
 Space overhead
 Computer storage space consumed for keeping the list.

March 8, 2020 8
Building Index file
 An index file of a document is a file consisting of a list of
index terms and a link to one or more documents that has
the index term
An index file is a list of search terms that are organized for
associative look-up, i.e., to answer user’s query:
In which documents does a specified search term appear?

Where within each document does each term appear? (There

may be several occurrences.)

For organizing index file for a collection of documents, there

are various options available:

Decide what data structure and/or file structure to use. Is it

sequential file, inverted file, suffix tree, etc. ?

March 8, 2020 9
Sequential File

 Sequential file is the most primitive file structures.

 It has no vocabulary as well as linking pointers.

 The records are generally arranged serially, one after

another, but in lexicographic order on the value of some key
field. i.e
 a particular attribute is chosen as primary key whose value

will determine the order of the records.

 when the first key fails to discriminate among records, a

second key is chosen to give an order.

March 8, 2020 10
Example:
 Given a collection of documents, they are parsed to extract
words and these are saved with the Document ID.

Negative affect
Doc 1 can make it harder
to do even easy tasks.
so make it easy

positive affect can

Doc 2
make it easier
to do difficult tasks
March 8, 2020 11
Sorting the
Vocabulary
Sequential file
 After all documents
have been tokenized,
stop words are
removed, and
normalization and
stemming are applied,
to generate index
terms
 These index terms in
sequential file are
sorted in alphabetical
order

March 8, 2020
Sequential File

 To access records search serially;

 starting at the first record read and investigate
all the succeeding records until the required
record is found or end of the file is reached.
 Update options: Is the index needs to be rebuilt
or incremental update is supported?

March 8, 2020 13
Sequential File …

Its main advantages:

 easy to implement;

 provides fast access to the next record using lexicographic

order.
 Can be searched quickly, using binary search, O(log n)

 Its disadvantages:

 No weights attached to terms.

 Random access is slow: since similar terms are indexed
individually, we need to find all terms that match with
the query
March 8, 2020 14
Inverted file
 A word oriented indexing mechanism based on sorted list of
keywords, with each keyword having links to the documents
containing it
 Building and maintaining an inverted index is a relatively low
cost risk. On a text of n words an inverted index can be built in
O(n) time
 This list is inverted from a list of terms in location order to a
list of terms in alphabetical order.
Word IDs
Word Extraction

Original
Documents •W1:d1,d2,d3
•W2:d2,d4,d7,d9
•…
•Wn :di,…dn
Document IDs
March 8, 2020 •Inverted Files 15
Use of Inverted Files for
Calculating Similarities
In the term vector space, if q is query and dj a document,
then q and dj have no terms in common iff q.dj = 0.
1. To calculate all the non-zero similarities find R, the set of all
the documents, dj, that contain at least one term in the query:
2. Merge the inverted lists for each term ti in the query, with a
logical or, to establish the set, R.
3. For each dj  R, calculate Similarity(q, dj), using appropriate
weights.
4. Return the elements of R in ranked order.

March 8, 2020 16
Inverted file
Data to be held in the inverted file includes
 The vocabulary (List of terms):

 is the set of all distinct words (index terms) in the text

collection.
having information about vocabulary (list of terms) speeds

searching for relevant documents

 For each term: the inverted file contains information related to
Location: all the text locations/positions where the word
occurs
frequency of occurrence of terms in a document collection

March 8, 2020 17
Enhancements to Inverted Files --
Concept
Location: Each posting holds information about the location of
each term within the document.
Uses
user interface design -- highlight location of search term
adjacency and near operators (in Boolean searching)
Frequency: Each inverted list includes the number of postings
for each term.
Uses
term weighting
query processing optimization

March 8, 2020 18
Inverted file
 Having information about the location of each term within
the document helps for:

 user interface design: highlight location of search term

 proximity based ranking: adjacency and near operators (in

Boolean searching)

 Having information about frequency is used for:

 calculating term weighting (like TF, TF*IDF, …)

 optimizing query processing

19
March 8, 2020
Inverted File
Documents are organized by the terms/words they contain
Term CF Doc ID TF Location This is called an
term 1 3 2 1 66 index file.
19 1 213 Text operations
29 1 45 are performed
before building
term 2 4 3 1 94
the index.
19 2 7, 212
22 1 56
term 3 1 5 1 43
term 4 3 11 2 3, 70 CF, total
34 1 40 frequency of tj in
the corpus n
Is it possible to keep all these information during searching?
March 8, 2020 20
Construction of Inverted file
An inverted index consists of two files: vocabulary and posting
files
 A vocabulary file (Word list):

 stores all of the distinct terms (keywords) that appear in

any of the documents (in lexicographical order, i.e like that
of a dictionary) and
 For each word a pointer to a posting file

 Records kept for each term j in the vocabulary (word list)

contains the following:
 term j

 number of documents in which term j occurs (DFj)

 Collection frequency of term j (Cf)

 pointer to inverted (postings)21list for term j

Postings File (Inverted List)
 For each distinct term in the vocabulary, the posting file stores a
list of pointers to the documents that contain that term.
 Each element in an inverted list is called a posting, i.e., the
occurrence of a term in a document
 Each list consists of one or many individual postings
Advantage of dividing inverted file into vocabulary and
posting:
 Keeping a pointer in the vocabulary to the list in the posting file
allows:
 the vocabulary to be kept in memory at search time even for
large text collection, while the Posting file is kept on disk
for accessing the pointers to documents
March 8, 2020 22
General structure of Inverted File
 The following figure shows the general structure of inverted
index file.

March 8, 2020 23
Organization of Index File
Vocabulary
(word list)
Postings
Documents
(inverted list)
Pointer
Term DF CF To
posting

term 1 3 3 Inverted
term 2 3 4 lists

term 3 1 1

term 4 2 3

March 8, 2020 24
Example:
 Given a collection of documents, they are parsed to extract
words and these are saved with the Document ID.

Negative affect
Doc 1 can make it harder
to do even easy tasks.
so make it easy

positive affect can

Doc 2 make it easier
to do difficult tasks
March 8, 2020 25
Sorting the
Vocabulary
 After all documents
have been tokenized the
inverted file is sorted by
terms
 Steps

 Extract the terms in

each doc
 Sort the terms

 Compile the terms

i.e Collect the
frequencies for each
term

March 8, 2020
Remove stop words and compute
frequency
 Multiple term
entries in a
single
document are
merged and
frequency
information
added

March 8, 2020 27
stemming & compute frequency

 Multiple term
entries in a
single
document are
merged and
frequency
information
added

28
March 8, 2020
Vocabulary and postings file
The file is commonly split into a Dictionary and a Posting file
vocabulary posting

Doc # TF
Term DF CF 1 1
affect 2 2 2 1
difficult 1 1 1 1
do 2 2 1 1
2 1
easy 2 3
1 2
hard 1 1 2 1
make 2 3 1 2
negative 1 1 2
1
2 1
positive 2 1 1 1
task 2 2 2 1
1 1
29 Pointers
Searching on Inverted File
 Since the whole index file is divided into two, searching can
be done faster by loading vocabulary list which takes less
memory even for large document collection
 Using binary Search the searching takes logarithmic time
 The search is done in the vocabulary lists

 Updating inverted file is complex.

 We need to update both vocabulary and posting files

March 8, 2020 30
Example: Create Inverted file
 Create an inverted file (both the vocabulary list and
the posting file) for the following document
collection
D1 The Department of Computer Science was established
in 1984.
D2 The Department launched its first BSc in Computer
Studies in 1987.
D3 Followed by the MSc in Computer Science which was
started in 1991.
D4 The Department also produced its first PhD graduate in
1994.
D5 Our staff have contributed intellectually and
professionally to the advancements in these fields.
March 8, 2020 31
Example: Create Inverted file
 After text operation red color terms remain as index
term
D1 The Department of Computer Science was established
in 1984.
D2 The Department launched its first BSc in Computer
Studies in 1987.
D3 Followed by the MSc in Computer Science which was
started in 1991.
D4 The Department also produced its first PhD graduate in
1994.
D5 Our staff have contributed intellectually and
professionally to the advancements in these fields.

March 8, 2020 32
…. Example: Create Inverted file
After text operation performed
 D1= department comput science establish
 D2= department launch bsc comput study
 D3= follow msc comput science start
 D4= department produce phd graduat
 D5= staff contribut intellect profession advance field

March 8, 2020 33
vocabulary posting All term specific
word DF CF WID
Doc# TF mTF loc info. (max tf, tf, tf-
advance 1 1 w1 5 1 1 5 idf, location…etc.)
bsc 1 1 W2 2 1 1 3 Stored on posting
comput 3 3 W3 1 1 1 2
•W1:d5
contribut 1 1 W4 2 1 1 4 •W2:d2
department 3 3 W5 3 1 1 2 •W3:d1,d2,d3
5 1 1 4 •Wn :di,…dn
establish 1 1 W6
field 1 1 W7 1 1 1 1

follow 1 1 2 1 1 1
document file
graduat 1 1 4 1 1 1

intellect 1 1 Pointers 1 1
c
launch 1 1 1 1
o
1 1 1 1
msc n
phd 1 1t 1 1

produce 1 1i 1 1

profession 1 1n 1 1

science 2 2u 2 2

staff 1 1e 1 1

start 1 1 1 1

study 1 1 1 1
Suffix trie
•A suffix trie is an ordinary trie in which the input strings are all
possible suffixes.
–Principles: The idea behind suffix TRIE is to assign to each symbol
in a text an index corresponding to its position in the text. (i.e:
First symbol has index 1, last symbol has index n (#of symbols in
text).
• To build the suffix TRIE we use these indices instead of the actual object.
•The structure has several advantages:
• It requires less storage space.
• We do not have to worry how the text is represented (binary,
ASCII, etc).
• We do not have to store the same object twice (no duplicate).
March 8, 2020 35
Suffix Trie
•Construct suffix trie for the following string: GOOGOL
•We begin by giving a position to every suffix in the text starting from left to
right as per characters occurrence in the string.
• TEXT: G O O G O L $
POSITION: 1 2 3 4 567
•Build a SUFFIX TRIE for all n suffixes of the text.
•Note: The resulting tree has n leaves and height n.

This structure is
particularly useful
for any application
requiring prefix
based ("starts with")
pattern matching.
March 8, 2020 36
Suffix tree
 A suffix tree is an extension of suffix
trie that construct a Trie of all the
proper suffixes of S
 The suffix tree is created by
compacting unary nodes of the
suffix TRIE.
 We store pointers rather than words in
the leaves.
 It is also possible to replace strings
in every edge by a pair (a,b), where
a & b are the beginning and end
index of the string. i.e.
(3,7) for OGOL$
(1,2) for GO
(7,7) for $
March 8, 2020 37
Example: Suffix tree
•Let s=abab, a suffix tree of s is a compressed trie of all
suffixes of s=abab$
•We label each leaf with the
starting point of the
•{ corresponding suffix.
• $ •$
• b$ •ab
•b •5
• ab$ •$
• bab$
• abab$ } •$ •ab$ •4
•ab$
•3
•2
•1
March 8, 2020 39
Generalized suffix tree
• Given a set of strings S, a generalized suffix tree of S is a compressed trie of
all suffixes of s  S
•To make suffixes prefix-free we add a special char, $, at the end of s.
•To associate each suffix with a unique string in S add a different special
symbol to each s
• Build a suffix tree for the string s1$s2#, where `$' and `#' are a special
terminator for s1,s2.
•Ex.: Let s1=abab & s2=aab, a generalized suffix tree for s1 & s2 is:
{ •#
•a •$
5. $ 4. # •b
4. b$ 3. b# •# •5 •4
•b
3. ab$ 2. ab# •ab$ •ab$ •$
•3
2. bab$ 1. aab#
•ab$ •$ •# •1
•4
1. abab$ •2
} 8, 2020 •1 •3 40 •2
March
Search in suffix tree
 Searching for all instances of a substring S in a suffix tree is easy
since any substring of S is the prefix of some suffix.
 Pseudo-code for searching in suffix tree:

 Start at root

 Go down the tree by taking each time the corresponding path

 If S correspond to a node, then return all leaves in sub-tree

 the places where S can be found are given by the pointers in

all the leaves in the subtree rooted at x.

 If S encountered a NIL pointer before reaching the end, then S is
not in the tree

March 8, 2020 41
Search in suffix tree
Example:
1. Find GO
2. Find OR
 If S = "GO" we take the GO
path and return:
GOOGOL$,GOL$.
 If S = "OR" we take the O path
and then we hit a NIL pointer so
"OR" is not in the tree.

March 8, 2020 42
Exercise

 Given the following index terms:

worker, word and world
construct index file using suffix tree?

March 8, 2020 43
Suffix Tree Applications
 Suffix Tree can be used to solve a large number of string
problems that occur in:
 text-editing,

 free-text search,

 etc.

 Some examples of string problems are given below.

 String matching

 Longest Common Substring

 Longest Repeated Substring

 Palindromes

March8,etc..
2020 44
Complexity Analysis
 The suffix tree for a string has been built in O(n2) time.
 Searching is very fast: The search time is linear in the length of
string S.
 The number of leaves is n+1, where n is the number of input

strings.
 Furthermore, in the leaves, we may store either the strings

themselves or pointers to the strings (that is, integers).

 Searching for a substring[1..m], in string[1..n], can be solved in
O(m) time.

March 8, 2020 45
Thank you

March 8, 2020 46

67surface Water Lab Report: Problem
No ratings yet
67surface Water Lab Report: Problem
4 pages
CAA Internship Report
75% (4)
CAA Internship Report
22 pages
Chapter 5 Javascriptdocument
No ratings yet
Chapter 5 Javascriptdocument
122 pages
Module 1 Operating System Overview
No ratings yet
Module 1 Operating System Overview
20 pages
Chapter 4 Javascript
No ratings yet
Chapter 4 Javascript
80 pages
Chapter Two - Text Operations and Automatic Indexing: 2.1. Text Acquisition Via Crawler
No ratings yet
Chapter Two - Text Operations and Automatic Indexing: 2.1. Text Acquisition Via Crawler
19 pages
Chapter 2 Design Principles
100% (1)
Chapter 2 Design Principles
20 pages
Chapter 1 Introduction
No ratings yet
Chapter 1 Introduction
77 pages
Chapter 3 - Inheritance and Polymorphism
100% (1)
Chapter 3 - Inheritance and Polymorphism
37 pages
MCQ-4 String
No ratings yet
MCQ-4 String
7 pages
Mainframe Operating Systems
No ratings yet
Mainframe Operating Systems
4 pages
Question Bank (Bioinformatics I)
No ratings yet
Question Bank (Bioinformatics I)
75 pages
Unit 1 - Modern Information Retrieval - WWW - Rgpvnotes.in
No ratings yet
Unit 1 - Modern Information Retrieval - WWW - Rgpvnotes.in
8 pages
Prepared By: CS8661 Internet Programming Lab Manual
No ratings yet
Prepared By: CS8661 Internet Programming Lab Manual
89 pages
Chapter 3 - Naming and Threads-1
No ratings yet
Chapter 3 - Naming and Threads-1
21 pages
Object Oriented Programming (Lab) Assignment # 2: Instructions
No ratings yet
Object Oriented Programming (Lab) Assignment # 2: Instructions
2 pages
Inte 314 Advanced Internet Programming
No ratings yet
Inte 314 Advanced Internet Programming
3 pages
Mettu University: Fundamental of Database System
No ratings yet
Mettu University: Fundamental of Database System
30 pages
Note On Operating System and Kernel
No ratings yet
Note On Operating System and Kernel
3 pages
Course Outlines
No ratings yet
Course Outlines
4 pages
Chapter 3 - Simple Sorting and Searching
100% (1)
Chapter 3 - Simple Sorting and Searching
18 pages
Operating System
No ratings yet
Operating System
7 pages
Computer Science
No ratings yet
Computer Science
5 pages
CPU-scheduling Exercises Exercises 1
No ratings yet
CPU-scheduling Exercises Exercises 1
2 pages
Assignment IT110
No ratings yet
Assignment IT110
2 pages
Chapter 6 Synchronization
No ratings yet
Chapter 6 Synchronization
37 pages
Mysql Database:: How To Connect To Databse
No ratings yet
Mysql Database:: How To Connect To Databse
9 pages
Exit Exam
No ratings yet
Exit Exam
100 pages
05 - Strategies For Query Processing (Ch18)
No ratings yet
05 - Strategies For Query Processing (Ch18)
50 pages
Java Programming Course Outline
100% (1)
Java Programming Course Outline
3 pages
Chapter 2 Processes and Process Management
No ratings yet
Chapter 2 Processes and Process Management
115 pages
Chapter One (History and Overview)
No ratings yet
Chapter One (History and Overview)
36 pages
1. cgi ppt 35
100% (1)
1. cgi ppt 35
16 pages
Chapter II. Process Management: 2.1 Overview
No ratings yet
Chapter II. Process Management: 2.1 Overview
17 pages
10 Emerging Wireless Networks: UWB, FSO, MANET, and Flash OFDM
No ratings yet
10 Emerging Wireless Networks: UWB, FSO, MANET, and Flash OFDM
39 pages
Bca Ii Sem Operating Systems
No ratings yet
Bca Ii Sem Operating Systems
78 pages
Question and Answers Part A (2 Marks) 1. Define The Term Internet
No ratings yet
Question and Answers Part A (2 Marks) 1. Define The Term Internet
24 pages
Lecture07-Memory IF
No ratings yet
Lecture07-Memory IF
52 pages
Different Networking Devices: Passive Hub Active Hub
100% (1)
Different Networking Devices: Passive Hub Active Hub
9 pages
Chapter - 6 Managing Network Services
No ratings yet
Chapter - 6 Managing Network Services
49 pages
Distributed File Systems: Unit - V Essay Questions
No ratings yet
Distributed File Systems: Unit - V Essay Questions
10 pages
COA - Chapter # 2
No ratings yet
COA - Chapter # 2
60 pages
Unit 1 Introduction To Computer Security: COSC 4035
0% (1)
Unit 1 Introduction To Computer Security: COSC 4035
47 pages
Chapter Three: Data Encoding, Data Transmission and Multiplexing
No ratings yet
Chapter Three: Data Encoding, Data Transmission and Multiplexing
27 pages
Bont Test 3
100% (1)
Bont Test 3
5 pages
Configure and Use Internet
100% (1)
Configure and Use Internet
9 pages
On Information Retrival
No ratings yet
On Information Retrival
23 pages
Components of System Administration
50% (2)
Components of System Administration
3 pages
Security Model Exam
No ratings yet
Security Model Exam
18 pages
Data Structures and Algorithms
No ratings yet
Data Structures and Algorithms
61 pages
Network and System Administration CHP 1 & 2
No ratings yet
Network and System Administration CHP 1 & 2
26 pages
DBMS Intro
No ratings yet
DBMS Intro
55 pages
Chapter - 4 - Association Rule Mining
No ratings yet
Chapter - 4 - Association Rule Mining
86 pages
Designing ProLog
No ratings yet
Designing ProLog
17 pages
Presentation1 Operating System
No ratings yet
Presentation1 Operating System
12 pages
ITS323Y12S1L01 Data Communications and Networks
No ratings yet
ITS323Y12S1L01 Data Communications and Networks
25 pages
Database Administration and Management: SQL Lab
No ratings yet
Database Administration and Management: SQL Lab
38 pages
Network Device and Configuration Manual
No ratings yet
Network Device and Configuration Manual
128 pages
Computer Security # CoSc4171 PDF
100% (1)
Computer Security # CoSc4171 PDF
123 pages
Word Excercise
No ratings yet
Word Excercise
41 pages
Security2 1
No ratings yet
Security2 1
42 pages
Mastering Active Directory
From Everand
Mastering Active Directory
VICTOR P HENDERSON
No ratings yet
IR Chap7
No ratings yet
IR Chap7
30 pages
IR Chap4
100% (1)
IR Chap4
32 pages
Information Storage and Retrieval: Chapter One - Introduction
No ratings yet
Information Storage and Retrieval: Chapter One - Introduction
50 pages
IR Chap4
100% (1)
IR Chap4
32 pages
A Tour Beyond BIOS Memory Map and Practices in UEFI BIOS V2
No ratings yet
A Tour Beyond BIOS Memory Map and Practices in UEFI BIOS V2
40 pages
Learning Episode 5 - Writing My Lesson Plans: EDUC 302-Field Study 2 Participation and Assistantship
No ratings yet
Learning Episode 5 - Writing My Lesson Plans: EDUC 302-Field Study 2 Participation and Assistantship
4 pages
Dagohoy National High School Senior High School Department Dagohoy, Bohol Detailed Lesson Plan (DLP)
No ratings yet
Dagohoy National High School Senior High School Department Dagohoy, Bohol Detailed Lesson Plan (DLP)
2 pages
Complete Download A Handbook of Statistical Analyses Using S-PLUS Everitt PDF All Chapters
100% (2)
Complete Download A Handbook of Statistical Analyses Using S-PLUS Everitt PDF All Chapters
52 pages
The Role of Women in The Life of Augustine
No ratings yet
The Role of Women in The Life of Augustine
5 pages
Instant Access to Crisis, Collapse, Militarism and Civil War: The History and Historiography of 18th Century Iran Michael Axworthy (Editor) ebook Full Chapters
100% (2)
Instant Access to Crisis, Collapse, Militarism and Civil War: The History and Historiography of 18th Century Iran Michael Axworthy (Editor) ebook Full Chapters
41 pages
Matriz de Respuesta A Observaciones DDL 18 Abril 2024-2 en
No ratings yet
Matriz de Respuesta A Observaciones DDL 18 Abril 2024-2 en
148 pages
CURRICULUM VITAE
No ratings yet
CURRICULUM VITAE
4 pages
Diary firbe
No ratings yet
Diary firbe
7 pages
DMTA 10073 01EN - Rev - A Vanta Getting - Started International PDF
No ratings yet
DMTA 10073 01EN - Rev - A Vanta Getting - Started International PDF
8 pages
Anemia in Preg.
No ratings yet
Anemia in Preg.
9 pages
Nirmiti Profile
No ratings yet
Nirmiti Profile
18 pages
experiment no 5 Web Application development Using PHP
No ratings yet
experiment no 5 Web Application development Using PHP
4 pages
Trickling Filter Design Example
No ratings yet
Trickling Filter Design Example
4 pages
Kill-ENGN Cheat Sheet Print Friendly
No ratings yet
Kill-ENGN Cheat Sheet Print Friendly
1 page
Kasema, Magali and Tonya Paper 2021
No ratings yet
Kasema, Magali and Tonya Paper 2021
15 pages
Virtual internship opportunities on AICTE Internship Portal by ServiceNow-reg
No ratings yet
Virtual internship opportunities on AICTE Internship Portal by ServiceNow-reg
2 pages
Dicionário Assírio L
No ratings yet
Dicionário Assírio L
280 pages
Fce Speaking Part Four On Sport and Exercise
No ratings yet
Fce Speaking Part Four On Sport and Exercise
1 page
Barrier Analysis For The Supply Chain of Palm Oil Processing Biomass (Empty Fruit Bunch) As Renewable Fuel
100% (1)
Barrier Analysis For The Supply Chain of Palm Oil Processing Biomass (Empty Fruit Bunch) As Renewable Fuel
100 pages
Lecture 11 - Smart Sensors
No ratings yet
Lecture 11 - Smart Sensors
15 pages
Fundamentals of Additive Manufacturing Principles Technologies and Applications 1st Edition Youssef instant download
100% (3)
Fundamentals of Additive Manufacturing Principles Technologies and Applications 1st Edition Youssef instant download
64 pages
Soal Latihan Possessive Adjective Kelas 5
No ratings yet
Soal Latihan Possessive Adjective Kelas 5
2 pages
Mid Yr Pir 2022
No ratings yet
Mid Yr Pir 2022
6 pages
Ethos Logos Pathos in Advertising
No ratings yet
Ethos Logos Pathos in Advertising
4 pages
International Journal of Linguistics, Literature and Translation (IJLLT) ISSN: 2617-0299
No ratings yet
International Journal of Linguistics, Literature and Translation (IJLLT) ISSN: 2617-0299
7 pages
Piensate-Rico-1
No ratings yet
Piensate-Rico-1
6 pages
? Game PHRASAL VERB
No ratings yet
? Game PHRASAL VERB
3 pages

IR Chap3

Uploaded by

IR Chap3

Uploaded by

Indexing structure

 Suffix Tree Applications

collection as per users query such that

documents are searched and retrieved quick

 Which list is easier to search?

fox pig zebra hen ant cat dog lion ox

vocabulary has a size of only 5 MB. This size may be

 Index terms - are used to look up records in a file.

keywords called index terms.

document is represented by a list of keywords or attributes

content bearing and needs to be removed from text

varying relevance when used to describe document

organized in Index File to easily identify documents in

Modified Linguistic friend roman countryman

Index File Indexer

Where within each document does each term appear? (There

may be several occurrences.)

are various options available:

sequential file, inverted file, suffix tree, etc. ?

 Sequential file is the most primitive file structures.

 The records are generally arranged serially, one after

will determine the order of the records.

second key is chosen to give an order.

positive affect can

 To access records search serially;

Its main advantages:

 provides fast access to the next record using lexicographic

 No weights attached to terms.

 is the set of all distinct words (index terms) in the text

searching for relevant documents

 user interface design: highlight location of search term

 proximity based ranking: adjacency and near operators (in

 Having information about frequency is used for:

 calculating term weighting (like TF, TF*IDF, …)

 optimizing query processing

 stores all of the distinct terms (keywords) that appear in

 Records kept for each term j in the vocabulary (word list)

 number of documents in which term j occurs (DFj)

 Collection frequency of term j (Cf)

 pointer to inverted (postings)21list for term j

positive affect can

 Extract the terms in

 Compile the terms

 Updating inverted file is complex.

 Go down the tree by taking each time the corresponding path

 If S correspond to a node, then return all leaves in sub-tree

 the places where S can be found are given by the pointers in

all the leaves in the subtree rooted at x.

 Given the following index terms:

 Some examples of string problems are given below.

 Longest Common Substring

 Longest Repeated Substring

themselves or pointers to the strings (that is, integers).

You might also like