0% found this document useful (0 votes)

46 views

Chapter 3 Part 1

The document discusses improving the effectiveness and efficiency of information retrieval systems. It describes the indexing and searching subsystems, with indexing involving organizing documents with extracted keywords offline and searching being the online process of finding relevant documents for a user's query. Various techniques for indexing texts, accessing texts, and processing indexes are also covered.

Uploaded by

S J

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views

Chapter 3 Part 1

Uploaded by

S J

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 43

Indexing structure

Designing an IR System
Our focus during IR system design:
• In improving Effectiveness of the system
–The concern here is retrieving more relevant documents for users query
–Effectiveness of the system is measured in terms of precision, recall, …
–Main emphasis: Stemming, stop words removal, weighting schemes,
matching algorithms

• In improving Efficiency of the system

–The concern here is reducing storage space requirement, enhancing
searching time, indexing time, access time…
–Main emphasis: Compression, indexing structures, space – time tradeoffs
Subsystems of IR system
The two subsystems of an IR system:
–Indexing:
• is an offline process of organizing documents using keywords
extracted from the collection
• Indexing is used to speed up access to desired information from
document collection as per users query

–Searching
• Is an online process that scans document corpus to find relevant
documents that matches users query
Indexing Subsystem
documents
Documents Assign document identifier

document document
Tokenization
IDs
tokens
Stopword removal
non-stoplist tokens
Stemming & Normalization
stemmed terms
Term weighting

Weighted index
terms Index File
Searching Subsystem
query parse query
query tokens
ranked non-stoplist
document set Stop word
tokens
Ranking
Stemming & Normalize
relevant stemmed terms
document set
Similarity Query
Term weighting
Measure terms

Index terms
Index
Basic assertion
Indexing and searching:
inexorably connected

– you cannot search that that was not first indexed in some manner or other
– indexing of documents or objects is done in order to be searchable
• there are many ways to do indexing
– to index one needs an indexing language
• there are many indexing languages
• even taking every word in a document is an indexing language

Knowing searching is knowing indexing

Implementation Issues
•Storage of text:
–The need for text compression: to reduce storage space

•Indexing text
–Organizing indexes
• What techniques to use ? How to select it ?
–Storage of indexes
• Is compression required? Do we store on memory or in a disk ?

•Accessing text
–Accessing indexes
• How to access to indexes ? What data/file structure to use?
–Processing indexes
• How to search a given query in the index? How to update the index?
–Accessing documents
Text Compression
•Text compression is about finding ways to represent the text in
fewer bits or bytes
•Advantages:
–save storage space requirement.
–speed up document transmission time
–Takes less time to search the compressed text
•Common compression methods
–Statistical methods: which requires statistical information about frequency
of occurrence of symbols in the document
E.g. Huffman coding
•Estimate probabilities of symbols, code one at a time, shorter codes for high
probabilities
–Adaptive methods: which constructs dictionary in the processing of
compression
E.g. Ziv-Lempel compression:
•Replace words or symbols with a pointer to dictionary entries
Huffman coding
•Developed in 1950s by David Huffman,
widely used for text compression, 0 1
multimedia codec and message transmission D4
0 1
•The problem: Given a set of n symbols and
1 D3
their weights (or frequencies), construct a 0
tree structure (a binary tree for binary code)D D2
1
with the objective of reducing memory space
& decoding time per symbol.
Code of:
•Huffman coding is constructed based on
D1 = 000
frequency of occurrence of letters in text D2 = 001
documents D3 = 01
D4 = 1
How to construct Huffman coding
Step 1: Create forest of trees for each symbol, t1, t2,… tn
Step 2: Sort forest of trees according to falling probabilities of symbol occurrence
Step 3: WHILE more than one tree exist DO
– Merge two trees t1 and t2 with least probabilities p1 and p2
– Label their root with sum p1 + p2
– Associate binary code: 1 with the right branch and 0 with the left branch
Step 4: Create a unique codeword for each symbol by traversing the tree from the root
to the leaf.
– Concatenate all encountered 0s and 1s together during traversal

• The resulting tree has a prob. of 1 in its root and symbols in its leaf node.
Example
• Consider a 7-symbol alphabet given in the following table to
construct the Huffman coding.

Symbol Probability
A 0.05
• The Huffman encoding algorithm
b 0.05
picks each time two symbols
c 0.1 (with the smallest frequency) to
d 0.2 combine
e 0.3
f 0.2
g 0.1
Huffman code tree
1
0 1
0.4 0.6
0
0 1 1
0.3
d f 1 e
0
0.2
1
g 0
0.1

c 0 1
a b

• Using the Huffman coding a table can be constructed by working down

the tree, left to right. This gives the binary equivalents for each symbol
in terms of 1s and 0s.
• What is the Huffman binary representation for ‘café’?
Exercise
1. Given the following, apply the Huffman algorithm to find
an optimal binary code:

Character: a b c d e t

Frequency: 16 5 12 17 10 25

2. Given text: “for each rose, a rose is a rose”

– Construct the Huffman coding
Ziv-Lempel compression
•The problem with Huffman coding is that it requires knowledge
about the data before encoding takes place.
–Huffman coding requires frequencies of symbol occurrence before
codeword is assigned to symbols

•Ziv-Lempel compression
–Not rely on previous knowledge about the data
–Rather builds this knowledge in the course of data transmission/data
storage
–Ziv-Lempel algorithm (called LZ) uses a table of code-words created
during data transmission;
•each time it replaces strings of characters with a reference to a previous occurrence
of the string.
Lempel-Ziv Compression Algorithm
• The multi-symbol patterns are of the form: C0C1 . . . Cn-1Cn. The
prefix of a pattern consists of all the pattern symbols except the
last: C0C1 . . . Cn-1

Lempel-Ziv Output: there are three options in assigning a code to

each symbol in the list
• If one-symbol pattern is not in dictionary, assign (0, symbol)
• If multi-symbol pattern is not in dictionary, assign
(dictionaryPrefixIndex, lastPatternSymbol)
• If the last input symbol or the last pattern is in the dictionary, asign
(dictionaryPrefixIndex, )
Example: LZ Compression
Encode (i.e., compress) the string ABBCBCABABCAABCAAB using the
LZ algorithm.

The compressed message is: 0A0B2C3A2A4A6B

Example: Decompression
Decode (i.e., decompress) the sequence: 0A0B2C3A2A4A6B

The decompressed message is:

ABBCBCABABCAABCAAB
Exercise
Encode (i.e., compress) the following strings using the Lempel-Ziv
algorithm.

1. Mississippi
2. ABBCBCABABCAABCAAB
3. SATATASACITASA.
Indexing structures: Exercise
• Discuss in detail theoretical and algorithmic concepts (including
construction, various operations, complexity, etc.) on the following
commonly used data structures

1. Data structure vs. file structure

2. Arrays (fixed and dynamic arrays), sorted arrays
3. Records and linked list
4. Tree (AVL tree, Binary tree): balanced vs. unbalanced tree)
5. B tree and its variants (B+, B++ Tree, B* Tree),
6. Hierarchical Tree (like Quad Tree and its variants)
7. PAT-Tree and its variants
8. Disjoint tree: balanced and degenerate tree
9. Graph
10.. Hashing,
11. Trie and its variants
Indexing: Basic Concepts
• Indexing is used to speed up access to desired information
from document collection as per users query such that
– It enhances efficiency in terms of time for retrieval. Relevant documents
are searched and retrieved quick
Example: author catalog in library
• An index file consists of records, called index entries.
• Index files are much smaller than the original file.
– Remember Heaps Law: In 1 GB text collection the size of a vocabulary
is only 5 MB (Baeza-Yates and Ribeiro-Neto, 2005)
– This size may be further reduced by Linguistic pre-processing (like
stemming & other normalization methods).
• The usual unit for indexing is the word
– Index terms - are used to look up records in a file.
Major Steps in Index Construction
• Source file: Collection of text document
–A document can be described by a set of representative keywords called index terms.
• Index Terms Selection:
–Tokenize: identify words in a document, so that each document is represented by a list
of keywords or attributes
–Stop words: removal of high frequency words
• Stop list of words is used for comparing the input text
–Word stem and normalization: reduce words with similar meaning into their stem/root
word
• Suffix stripping is the common method
–Term relevance weight: Different index terms have varying relevance when used to
describe document contents.
• This effect is captured through the assignment of numerical weights to each index term of
a document.
• There are different index terms weighting methods: TF, TF*IDF, …

• Output: a set of index terms (vocabulary) to be used for Indexing the

documents that each term occurs in.
Basic Indexing Process
Documents to
be indexed. Friends, Romans, countrymen.

Token Tokenizer
stream. Friends Romans countrymen

Modified Linguistic friend roman countryman

tokens. preprocessing

Index File Indexer

friend 2 4
(Inverted file).
roman 1 2

countryman 13 16
Building Index file
•An index file of a document is a file consisting of a list of index terms and a link to
one or more documents that has the index term
–A good index file maps each keyword Ki to a set of documents Di that contain the keyword

•Index file usually has index terms in a sorted order.

–The sort order of the terms in the index file provides an order on a physical file

•An index file is list of search terms that are organized for associative look-up, i.e.,
to answer user’s query:
–In which documents does a specified search term appear?
–Where within each document does each term appear? (There may be several occurrences.)

•For organizing index file for a collection of documents, there are various options
available:
–Decide what data structure and/or file structure to use. Is it sequential file, inverted file, suffix
array, signature file, etc. ?
Index file Evaluation Metrics
• Running time
–Indexing time
–Access/search time
–Update time (Insertion time, Deletion time, modification time….)

• Space overhead
–Computer storage space consumed.

• Access types supported efficiently.

–Is the indexing structure allows to access:
• records with a specified term, or
• records with terms falling in a specified range of values.
Sequential File
•Sequential file is the most primitive file structures.
• It has no vocabulary as well as linking pointers.
•The records are generally arranged serially, one after another,
but in lexicographic order on the value of some key field.
• a particular attribute is chosen as primary key whose value will
determine the order of the records.
• when the first key fails to discriminate among records, a second key is
chosen to give an order.
Example:
• Given a collection of documents, they are parsed to extract
words and these are saved with the Document ID.

I did enact Julius

Doc 1 Caesar I was killed
I the Capitol;
Brutus killed me.

So let it be with
Doc 2 Caesar. The noble
Brutus has told you
Caesar was ambitious
Sorting the Vocabulary
Term Doc #
I 1
• After all documents did
enact
1
1
Sequential file
have been julius 1
Doc
caesar 1
tokenized, stopwords I 1 Term No.
are removed, and was
killed
1
1 1 ambition 2
normalization and I 1
2 brutus 1
the 1
stemming are capitol 1
3 brutus 2
applied, to generate brutus
killed
1
1 4 capitol 1
index terms me 1
so 2 5 caesar 1
• These index terms in let 2
6 caesar 2
it 2
sequential file are be 2 7 caesar 2
sorted in with
caesar
2
2 8 enact 1
alphabetical order the 2
9 julius 1
noble 2
brutus 2 10 kill 1
hath 2
told 2 11 kill 1
you 2
caesar 2 12 noble 2
was 2
ambitious 2
Sequential File
• Its main advantages are:
– easy to implement;
– provides fast access to the next record using lexicographic order.
– Instead of Linear time search, one can search in logarithmic time using
binary search
• Its disadvantages:
– difficult to update. Index must be rebuilt if a new term is added. Inserting a
new record may require moving a large proportion of the file;
– random access is extremely slow.
• The problem of update can be solved :
– by ordering records by date of acquisition, than the key value; hence, the
newest entries are added at the end of the file & therefore pose no
difficulty to updating. But searching becomes very tough; it requires linear
time
Inverted file
• A word oriented indexing mechanism based on sorted list of
keywords, with each keyword having links to the documents
containing it
–Building and maintaining an inverted index is a relatively low cost risk. On a
text of n words an inverted index can be built in O(n) time, n is number of
keywords

• Content of the inverted file:

–Data to be held in the inverted file includes :
• The vocabulary (List of terms)
• The occurrence (Location and frequency of terms in a document
collection)
Inverted file
• The occurrence: contains one record per term, listing
–Frequency of each term in a document, i.e. count number of
occurrences of keywords in a document
• TFij, number of occurrences of term tj in document di
• DFj, number of documents containing tj
• maxi, maximum frequency of any term in di
• N, total number of documents in a collection
• CFj,, collection frequency of tj in nj
• ….

–Locations/Positions of words in the text

Inverted file
•Why vocabulary?
–Having information about vocabulary (list of terms) speeds searching
for relevant documents

•Why location?
– Having information about the location of each term within the
document helps for:
•user interface design: highlight location of search term
•proximity based ranking: adjacency and near operators (in
Boolean searching)
•Why frequencies?
•Having information about frequency is used for:
–calculating term weighting (like TF, TF*IDF, …)
–optimizing query processing
Inverted File
Documents are organized by the terms/words they contain
Term CF Document TF Location
ID
This is called an
auto 3 2 1 66
index file.
19 1 213
29 1 45
bus 4 3 1 94 Text operations are
19 2 7, 212 performed before
building the index.
22 1 56
taxi 1 5 1 43
train 3 11 2 3, 70
34 1 40
Construction of Inverted file
•An inverted index consists of two files:
–vocabulary file
–Posting file
Advantage of dividing inverted file:
•Keeping a pointer in the vocabulary to the list in the
posting file allows:
– the vocabulary to be kept in memory at search time even
for large text collection, and
– Posting file to be kept on disk for accessing to documents
Inverted index storage
•Separation of inverted file into vocabulary and posting file is a
good idea.
–Vocabulary: For searching purpose we need only word list. This allows
the vocabulary to be kept in memory at search time since the space
required for the vocabulary is small.
• The vocabulary grows by O(nβ), where β is a constant between 0 – 1.
• Example: from 1,000,000,000 documents, there may be 1,000,000 distinct words.
Hence, the size of index is 100 MBs, which can easily be held in memory of a
dedicated computer.

–Posting file requires much more space.

• For each word appearing in the text we are keeping statistical information related to
word occurrence in documents.
• Each of the postings pointer to the document requires an extra space of O(n).
•How to speed up access to inverted file?
Vocabulary file
•A vocabulary file (Word list):
–stores all of the distinct terms (keywords) that appear in any
of the documents (in lexicographical order) and
–For each word a pointer to posting file
•Records kept for each term j in the word list contains the
following:
–term j
–number of documents in which term j occurs (DFj)
–Total frequency of term j (CFj)
–pointer to postings (inverted) list for term j
Postings File (Inverted List)
•For each distinct term in the vocabulary, stores a list of
pointers to the documents that contain that term.
•Each element in an inverted list is called a posting, i.e.,
the occurrence of a term in a document
•It is stored as a separate inverted list for each column,
i.e., a list corresponding to each term in the index file.
– Each list consists of one or many individual postings related
to Document ID, TF and location information about a given
term i
Organization of Index File
Vocabulary
Postings Actual
(word list)
(inverted list) Documents
Term No Tot Pointer
of freq To
Doc posting

Act 3 3 Inverted lists

Bus 3 4
pen 1 1
total 2 3
Example:
• Given a collection of documents, they are parsed to extract
words and these are saved with the Document ID.

I did enact Julius

Doc 1 Caesar I was killed
I the Capitol;
Brutus killed me.

So let it be with
Doc 2 Caesar. The noble
Brutus has told you
Caesar was ambitious
Sorting the Vocabulary Term
ambitious
be
Doc #
2
2
Term Doc #
brutus 1
I 1
• After all did 1
brutus
capitol
2
1
enact 1
documents have julius 1 caesar 1
caesar 2
been tokenized caesar
I
1
1 caesar 2
the inverted file was
killed
1
1
did
enact
1
1
is sorted by I 1 has 1
the 1
terms capitol 1
I
I
1
1
brutus 1
I 1
killed 1
it 2
me 1
so 2 julius 1
let 2 killed 1
it 2 killed 1
be 2 let 2
with 2 me 1
caesar 2 noble 2
the 2 so 2
noble 2 the 1
brutus 2
the 2
hath 2
told 2
told 2
you 2 you 2
caesar 2 was 1
was 2 was 2
ambitious 2 with 2
Remove stopwords, apply stemming & compute term
frequency
•Multiple term Term Doc #
entries in a single ambition 2 Term Doc # TF
document are brutus 1 ambition 2 1
merged and brutus 2 brutus 1 1
frequency brutus 2 1
capitol 1
information capitol 1 1
caesar 1
added caesar 1 1
caesar 2
•Counting caesar 2 caesar 2 2
number of enact 1 enact 1 1
occurrence of julius 1 julius 1 1
terms in the kill 1 kill 1 2
collections helps kill 1 noble 2 1
to compute TF
noble 2
Vocabulary and postings file
The file is commonly split into a Dictionary and a Posting file
vocabulary posting
Term Doc # TF Term DF CF Doc # TF
ambition 2 1 2 1
ambitious 1 1
brutus 1 1 1 1
brutus 2 1 brutus 2 2
2 1
capitol 1 1 capitol 1 1
1 1
caesar 1 1 caesar 2 3 1 1
caesar 2 2 enact 1 1 2 2
enact 1 1 julius 1 1 1 1
julius 1 1 kill 1 2 1 1
kill 1 2 1 2
noble 1 1
noble 2 1 2 1

Pointers
Complexity Analysis
• The inverted index can be built in O(n) time.
– n is number of vocabulary terms
• Since terms in vocabulary file are sorted searching
takes logarithmic time.
• To update the inverted index it is possible to apply
Incremental indexing which requires O(k) time, k is
number of new index terms
Exercises
• Construct the inverted index for the following
document collections.
Doc 1 : New home to home sales forecasts
Doc 2 : Rise in home sales in July
Doc 3 : Home sales rise in July for new homes
Doc 4 : July new home sales rise

Hourglass Workout Program by Luisagiuliet 2
76% (21)
Hourglass Workout Program by Luisagiuliet 2
51 pages
12 Week Program: Summer Body Starts Now
87% (46)
12 Week Program: Summer Body Starts Now
70 pages
Read People Like A Book by Patrick King-Edited
58% (81)
Read People Like A Book by Patrick King-Edited
12 pages
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
77% (13)
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
260 pages
Cheat Code To The Universe
94% (79)
Cheat Code To The Universe
34 pages
Facial Gains Guide (001 081)
91% (45)
Facial Gains Guide (001 081)
81 pages
Curse of Strahd
95% (467)
Curse of Strahd
258 pages
The Psychiatric Interview - Daniel Carlat
91% (34)
The Psychiatric Interview - Daniel Carlat
473 pages
The Borax Conspiracy
91% (57)
The Borax Conspiracy
14 pages
The Secret Language of Attraction
86% (108)
The Secret Language of Attraction
278 pages
How To Develop and Write A Grant Proposal
83% (542)
How To Develop and Write A Grant Proposal
17 pages
Penis Enlargement Secret
60% (124)
Penis Enlargement Secret
12 pages
Workbook For The Body Keeps The Score
89% (53)
Workbook For The Body Keeps The Score
111 pages
Donald Trump & Jeffrey Epstein Rape Lawsuit and Affidavits
83% (1016)
Donald Trump & Jeffrey Epstein Rape Lawsuit and Affidavits
13 pages
KamaSutra Positions
78% (69)
KamaSutra Positions
55 pages
7 Hermetic Principles
93% (30)
7 Hermetic Principles
3 pages
27 Feedback Mechanisms Pogil Key
77% (13)
27 Feedback Mechanisms Pogil Key
6 pages
Frank Hammond - List of Demons
92% (92)
Frank Hammond - List of Demons
3 pages
Phone Codes
79% (28)
Phone Codes
5 pages
36 Questions That Lead To Love
91% (35)
36 Questions That Lead To Love
3 pages
How 2 Setup Trust
97% (307)
How 2 Setup Trust
3 pages
The 36 Questions That Lead To Love - The New York Times
94% (34)
The 36 Questions That Lead To Love - The New York Times
3 pages
100 Questions To Ask Your Partner
78% (36)
100 Questions To Ask Your Partner
2 pages
Satanic Calendar
25% (56)
Satanic Calendar
4 pages
The 36 Questions That Lead To Love - The New York Times
95% (21)
The 36 Questions That Lead To Love - The New York Times
3 pages
14 Easiest & Hardest Muscles To Build (Ranked With Solutions)
100% (8)
14 Easiest & Hardest Muscles To Build (Ranked With Solutions)
27 pages
Jeffrey Epstein39s Little Black Book Unredacted PDF
75% (12)
Jeffrey Epstein39s Little Black Book Unredacted PDF
95 pages
1001 Songs
69% (72)
1001 Songs
1,798 pages
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
23% (954)
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
38 pages
Zodiac Sign & Their Most Common Addictions
63% (30)
Zodiac Sign & Their Most Common Addictions
9 pages
ISR Chap...4
No ratings yet
ISR Chap...4
43 pages
Chapter Four Indexing Structure
100% (2)
Chapter Four Indexing Structure
60 pages
Lecture4 - Indexing and Searching I
No ratings yet
Lecture4 - Indexing and Searching I
56 pages
Chapter 3,4, 5 and 6
No ratings yet
Chapter 3,4, 5 and 6
145 pages
Algorithms: Compressed Matching in Dictionaries
No ratings yet
Algorithms: Compressed Matching in Dictionaries
14 pages
Advanced Indexing Issues
No ratings yet
Advanced Indexing Issues
52 pages
Completed UNIT-III 20.9.17
No ratings yet
Completed UNIT-III 20.9.17
61 pages
Chapter 4 IR
No ratings yet
Chapter 4 IR
56 pages
Compression For Sending and Storing Information: Text, Audio, Images, Videos
No ratings yet
Compression For Sending and Storing Information: Text, Audio, Images, Videos
28 pages
Multimedia Data Compression
No ratings yet
Multimedia Data Compression
31 pages
Chapter-4 - Data Structure-File Structure
No ratings yet
Chapter-4 - Data Structure-File Structure
34 pages
Unit 2 CA209
No ratings yet
Unit 2 CA209
29 pages
Topic6 - Naïve Algorithms, Binary Tries - Unit2
No ratings yet
Topic6 - Naïve Algorithms, Binary Tries - Unit2
13 pages
Unit 2
No ratings yet
Unit 2
28 pages
Chapter Three
No ratings yet
Chapter Three
30 pages
Compression and Decompression Using Huffman Convention Synopsis
No ratings yet
Compression and Decompression Using Huffman Convention Synopsis
10 pages
Part-A: Searching: Searching Refers To The Operation of Finding Locations of A
No ratings yet
Part-A: Searching: Searching Refers To The Operation of Finding Locations of A
8 pages
Fla 03
No ratings yet
Fla 03
27 pages
4_Indexing
No ratings yet
4_Indexing
59 pages
AI6122 Topic 3.1 - Index
No ratings yet
AI6122 Topic 3.1 - Index
40 pages
UNIT-2
No ratings yet
UNIT-2
10 pages
Compression: Some Slides Courtesy James Allan@umass
No ratings yet
Compression: Some Slides Courtesy James Allan@umass
47 pages
ICT - Module 1 Lecture 3
No ratings yet
ICT - Module 1 Lecture 3
43 pages
Ir Chapter Three
No ratings yet
Ir Chapter Three
41 pages
Week 9
No ratings yet
Week 9
46 pages
Module 06. String Algorithms Lecture 3-6
No ratings yet
Module 06. String Algorithms Lecture 3-6
48 pages
Chapter 6 Organizing Files For Performance Not Complete
No ratings yet
Chapter 6 Organizing Files For Performance Not Complete
65 pages
Module 2-Data Structures and Algorithms For Retrieval-Cat1
No ratings yet
Module 2-Data Structures and Algorithms For Retrieval-Cat1
133 pages
Chapter 3 Multimedia Data Compression
No ratings yet
Chapter 3 Multimedia Data Compression
21 pages
Lec2 2
No ratings yet
Lec2 2
17 pages
Cay Dau Hieu Nhi Phan
No ratings yet
Cay Dau Hieu Nhi Phan
9 pages
Unit 2
No ratings yet
Unit 2
40 pages
File Organization Lec910
No ratings yet
File Organization Lec910
37 pages
11 FM-Index
No ratings yet
11 FM-Index
6 pages
Chapter 4 Multi
No ratings yet
Chapter 4 Multi
45 pages
Unit Iv Indexing and Hashing: Basic Concepts
No ratings yet
Unit Iv Indexing and Hashing: Basic Concepts
35 pages
CS2202_IndexingHashing
No ratings yet
CS2202_IndexingHashing
83 pages
Chapter-2 - Automatic Text Anlysis
No ratings yet
Chapter-2 - Automatic Text Anlysis
67 pages
Current Challenges in Textual Databases: Gonzalo Navarro
No ratings yet
Current Challenges in Textual Databases: Gonzalo Navarro
44 pages
3-Index Construction
No ratings yet
3-Index Construction
43 pages
W11 Greedy Algorithms Lecture 21 06052024 115021am
No ratings yet
W11 Greedy Algorithms Lecture 21 06052024 115021am
6 pages
Chapter 3 Part 2
No ratings yet
Chapter 3 Part 2
22 pages
Comparison of Lossless Data Compression Algorithms
No ratings yet
Comparison of Lossless Data Compression Algorithms
12 pages
ICS 220 - Data Structures and Algorithms: Dr. Ken Cosh
No ratings yet
ICS 220 - Data Structures and Algorithms: Dr. Ken Cosh
22 pages
Data Compresion 1
No ratings yet
Data Compresion 1
2 pages
Chapter 3-Part II
100% (1)
Chapter 3-Part II
26 pages
Unit -4 Search Trees
No ratings yet
Unit -4 Search Trees
192 pages
Data Compression
No ratings yet
Data Compression
22 pages
Data Structure: Huffman Tree:Project Submitted To: Sir Abdul Wahab
No ratings yet
Data Structure: Huffman Tree:Project Submitted To: Sir Abdul Wahab
24 pages
Chapter 3 Indexing
No ratings yet
Chapter 3 Indexing
48 pages
CH 6
No ratings yet
CH 6
21 pages
9 Dictionaries and Tolerant Retrieval
No ratings yet
9 Dictionaries and Tolerant Retrieval
58 pages
Data Compression
No ratings yet
Data Compression
7 pages
Huffman Compression
No ratings yet
Huffman Compression
37 pages
Irs Unit III
No ratings yet
Irs Unit III
74 pages
lecture3-tolerent
No ratings yet
lecture3-tolerent
81 pages
What Is Huffman Coding and Its History
No ratings yet
What Is Huffman Coding and Its History
5 pages
chap5-index-construction
No ratings yet
chap5-index-construction
38 pages
DB2 11 for z/OS: Intermediate Training for Application Developers
From Everand
DB2 11 for z/OS: Intermediate Training for Application Developers
Robert Wingate
No ratings yet
Schematron: A language for validating XML
From Everand
Schematron: A language for validating XML
Erik Siegel
No ratings yet
Wiley Encyclopedia of Telecommunications Vol II.
No ratings yet
Wiley Encyclopedia of Telecommunications Vol II.
596 pages
LZ78
No ratings yet
LZ78
17 pages
DC MCQ PDF
100% (3)
DC MCQ PDF
35 pages
On Improving Tunstall Codes: Shmuel T. Klein and Dana Shapira
No ratings yet
On Improving Tunstall Codes: Shmuel T. Klein and Dana Shapira
16 pages
Imc14 05 Dictionary Codes
No ratings yet
Imc14 05 Dictionary Codes
31 pages
BE Semester-VII (ATKT CE) Question Bank Data Compression: All Questions Carry Equal Marks (10 Marks)
No ratings yet
BE Semester-VII (ATKT CE) Question Bank Data Compression: All Questions Carry Equal Marks (10 Marks)
2 pages
Modern Digital Communication: Discrete Stationary Source Coding
No ratings yet
Modern Digital Communication: Discrete Stationary Source Coding
17 pages
DC Question Bank
No ratings yet
DC Question Bank
5 pages
Chapter 3 IR
No ratings yet
Chapter 3 IR
56 pages
Alfonseca Et Al. - 2007 - A Simple Genetic Algorithm For Music Generation by Means of Algorithmic Informat
No ratings yet
Alfonseca Et Al. - 2007 - A Simple Genetic Algorithm For Music Generation by Means of Algorithmic Informat
9 pages
Text Processing: Data Structures and Algorithms in Java 1/47
No ratings yet
Text Processing: Data Structures and Algorithms in Java 1/47
47 pages
TSBK08 Data Compression Exercises: Informationskodning, ISY, Link Opings Universitet, 2013
No ratings yet
TSBK08 Data Compression Exercises: Informationskodning, ISY, Link Opings Universitet, 2013
32 pages
List of Algorithms - Wikipedia, The Free Encyclopedia
No ratings yet
List of Algorithms - Wikipedia, The Free Encyclopedia
34 pages
RCS 087 - DATA COMPRESSION 300+ MCQs
No ratings yet
RCS 087 - DATA COMPRESSION 300+ MCQs
50 pages
DCR New Question Bank For Gtu
No ratings yet
DCR New Question Bank For Gtu
7 pages
DC MCQ Unit-03
No ratings yet
DC MCQ Unit-03
4 pages
Huffman Coding, RLE, LZW
No ratings yet
Huffman Coding, RLE, LZW
41 pages
Unit31 LZ78
No ratings yet
Unit31 LZ78
15 pages
Data Reduction Tech For Graphics-REPORT
No ratings yet
Data Reduction Tech For Graphics-REPORT
27 pages
MS Xca
No ratings yet
MS Xca
30 pages
Data Compression and Data Retrieval 2161603: Department of CE / IT - 07 / 16
No ratings yet
Data Compression and Data Retrieval 2161603: Department of CE / IT - 07 / 16
18 pages
Lempel Ziv
No ratings yet
Lempel Ziv
22 pages
Is The Process of Encoding Information Using Fewer Bits Than The Original Representation
No ratings yet
Is The Process of Encoding Information Using Fewer Bits Than The Original Representation
4 pages
(MS-WUSP) : Windows Update Services: Client-Server Protocol Specification
No ratings yet
(MS-WUSP) : Windows Update Services: Client-Server Protocol Specification
142 pages
Data Compression: Multiple Choice Questions On
No ratings yet
Data Compression: Multiple Choice Questions On
238 pages
Image and video compression for multimedia engineering: fundamentals, algorithms, and standards Third Edition Shi - The full ebook with complete content is ready for download
100% (4)
Image and video compression for multimedia engineering: fundamentals, algorithms, and standards Third Edition Shi - The full ebook with complete content is ready for download
57 pages
B.E Semester: 6 - IT (GTU) : 2161603 - Data Compression and Data Retrieval
No ratings yet
B.E Semester: 6 - IT (GTU) : 2161603 - Data Compression and Data Retrieval
17 pages
Teradata Vantage™ SQL Operators and User Defined Functions
No ratings yet
Teradata Vantage™ SQL Operators and User Defined Functions
272 pages
Benedetto-Puzzle Basil Ep 388
No ratings yet
Benedetto-Puzzle Basil Ep 388
23 pages
DB2@SAP Deep Compression
No ratings yet
DB2@SAP Deep Compression
30 pages

Chapter 3 Part 1

Uploaded by

Chapter 3 Part 1

Uploaded by

Indexing structure

• In improving Efficiency of the system

Knowing searching is knowing indexing

• Using the Huffman coding a table can be constructed by working down

2. Given text: “for each rose, a rose is a rose”

Lempel-Ziv Output: there are three options in assigning a code to

The compressed message is: 0A0B2C3A2A4A6B

The decompressed message is:

1. Data structure vs. file structure

• Output: a set of index terms (vocabulary) to be used for Indexing the

Modified Linguistic friend roman countryman

Index File Indexer

•Index file usually has index terms in a sorted order.

• Access types supported efficiently.

I did enact Julius

• Content of the inverted file:

–Locations/Positions of words in the text

–Posting file requires much more space.

Act 3 3 Inverted lists

I did enact Julius

You might also like