Learning Guide Unit 2
Learning Guide Unit 2
id=443823
1 of 15 11/27/2024, 3:22 PM
Learning Guide Unit 2 | Home https://my.uopeople.edu/mod/book/tool/print/index.php?id=443823
2 of 15 11/27/2024, 3:22 PM
Learning Guide Unit 2 | Home https://my.uopeople.edu/mod/book/tool/print/index.php?id=443823
3 of 15 11/27/2024, 3:22 PM
Learning Guide Unit 2 | Home https://my.uopeople.edu/mod/book/tool/print/index.php?id=443823
• Dictionaries
◦ Binary Trees
◦ B-Tree
• Wildcard queries
◦ Permuterm index
◦ K-gram index
• Spelling correction
◦ Isolated term correction
◦ Context sensitive correction
• Index Construction
• Dynamic indexing
• Corpus
• Single pass in-memory indexing
• Distributed indexing
• Ranked retrieval
1. Recognize the dictionary data structure and be able to implement a dictionary within an information retrieval system.
2. Describe the di�erent options for implementing wildcard queries including:
◦ Permuterm indexes
◦ K-gram indexes
3. Articulate di�erent approaches for the implementation of spelling correction including:
◦ Isolated-term correction approaches
▪ Edit distance
▪ K-gram overlap
◦ Context-sensitive correction
▪ Phoenetic correction using soundex algorithms
4. Describe computer hardware limitations and their impact on indexing processes
5. Describe Indexer architectures including:
◦ Blocked sort-based indexing
◦ Single-pass in-memory indexing
◦ Distributed indexing
◦ Dynamic Indexing
4 of 15 11/27/2024, 3:22 PM
Learning Guide Unit 2 | Home https://my.uopeople.edu/mod/book/tool/print/index.php?id=443823
Unit 2 explores some of the most important concepts in information retrieval. In Chapter 3 of the text we learn about the concept of a
dictionary and how this is used in information retrieval. In chapter 4 we examine di�erent algorithms for creating the inverted index
structure which is the structure that we used to search for speci�c terms to retrieve documents. Together these topics are foundational
for information retrieval.
The image that we have of a dictionary is typically of a large book that contains lists of words in alphabetical order and for each word
there is a de�nition of the meaning of the word. This understanding of dictionary is helpful in understanding the dictionary data
structure that is used in information retrieval. In IR, the dictionary contains the terms or words extracted from the collection of
documents which we refer to as the corpus. In the IR dictionary we also de�ne the word in terms of the documents that the word
appears in. The dictionary structure may be used for terms from the document as well as names of documents.
The basic idea of a dictionary is that each unique term or document will only appear in the dictionary once and the dictionary itself will be
structured in some way that allows a speci�c item to be searched for and found e�ciently. Much of chapter 3 in our text deals with
techniques to e�ciently search the dictionary. One aspect that is covered is the structure of the dictionary. If the dictionary were an
unsorted list of terms then search would be very ine�cient because the entire list would potentially have to be searched to �nd the
correct entry which of course would take a lot of time. Sorting the list would make it more e�cient as it would be possible to skip through
the list and not have to examine every item. More e�cient still would be a structure such as a B-tree which enables binary searches to be
conducted.
Other aspects of e�cient searching that are covered include the idea of wildcard searches and spelling correction both for terms that are
indexed as well as the terms that are provided to query for. Both of these concepts contribute to search e�ciency. A successful search
is one where you can �nd the information that you are looking for and do so e�ciently. By providing the ability to search imprecisely for
a term the possibility of getting results is higher. The idea behind spelling correction is that a search is only going to �nd matching terms
(unless wildcards are used) so it makes sense to ensure that the terms are spelled correctly which should improve the chances that the
search and index terms are common.
In chapter 4 we learn about the actual construction of the inverted index. Found algorithms are discussed including:
is a technique that can be used on a corpus (document collection) that is relatively small and static (doesn’t
experience a lot of change). The block sort-based index algorithm passes through the corpus identifying all of the term id/document id
pairs. The idea of a term id is that each term is stored in dictionary structure which contains the term, and the term id which would
typically be a number assigned to each term. The document id, similarly would be a structure that contains each document name and a
number assigned to uniquely identify the document that we refer to as the document id.
The pairs are then sorted by term id and then document id. This two step process is referred to as an inversion. The assembly of term
with document id and the associated frequency for each term / document id pair is called a posting. The completed inverted index (which
is built entirely in memory) can then be written to disk. You will note that all 3 of the structures are required to use the index. This
algorithm can only be used when all of the index processing can be completed within memory so it can only be used on relatively small
collections.
5 of 15 11/27/2024, 3:22 PM
Learning Guide Unit 2 | Home https://my.uopeople.edu/mod/book/tool/print/index.php?id=443823
Figure 1 illustrates the data structures created with the block sort-based indexing algorithm. This approach creates two dictionaries, one
for terms and a second one for documents. The actual indexing is accomplished in the posting which maps terms to documents and the
frequency that each term appears within each document. If you are wondering why use three structures? Why not just build an index
that has terms, document names, and frequencies? The answer is that this approach makes better use of limited memory resources. The
block sort-based index is limited by the space in memory and repeating terms and document names would consume far more memory
than an integer that is used to represent each.
A second algorithm for building an inverted index is called . SPIMI solves the memory problem
that the Block sort-based index algorithm is limited by. As you can see in Figure 2, SPMI uses a term dictionary structure to point to a
postings list of document id’s. Each time a new term is encountered it is added to the dictionary and each time a new document is
encountered it is added to the document dictionary and an entry is added to the posting list for the term. What is very di�erent about the
SPIMI algorithm is that these structures are very dynamic in that the posting list allocates more storage as it is required to add additional
docid’s. The algorithm monitors memory availability and when available memory has been used, the term dictionary structure is sorted
by term and written out to disk, memory is then initialized, a new term dictionary is created and the process repeats until all documents in
the collection have been indexed.
6 of 15 11/27/2024, 3:22 PM
Learning Guide Unit 2 | Home https://my.uopeople.edu/mod/book/tool/print/index.php?id=443823
When this portion of the processing has been completed, the ‘blocks’ which have been written out to disk are then merged back together,
as illustrated in Figure 3, to form the complete index.
In this approach there are only two structures that are written out to disk. The �rst is the document dictionary which contains the name
of each document and its document id and the second is the postings �le which contains each term and the documents that the term
occurs in. The advantage of this algorithm is that it can process any size of corpus, limited only by the amount of disk space available.
The corpus that we will be using contains 2,476 articles published by the Reuters news service. These articles are segmented into
directories for di�erent topics areas so we will need to ‘walk’ through the directory structure to �nd and index all o� the documents. This
corpus contains 2,476 �les and contains approximately 3.09 Mbytes of text. The size of the collection on disk is larger because of all the
�les and directories that the data is segmented into. This collection is not very large it was a set of �les randomly extracted from a larger
Reuters collection that contained over 11,000 articles, however, on a relatively current personal computer the indexing process for the
collection requires two hours to index. Of this 9 minutes were required to build the index in memory and the remainder to write the
index out to the disk. These results demonstrate the fact that indexing is an intensive process and with large collections the capabilities of
more than one computer system will be required.
Distributed indexing addresses this problem by splitting the indexing workload among multiple, perhaps many di�erent systems each of
which indexes a portion of the entire collection. There are two approaches to splitting the workload up among multiple computers, term-
partitioning, and document partitioning.
In term partitioning, the collection is segmented by terms with each system processing some range of terms. A document partition
system segments the document collection by documents. In either approach the distributed indexing solution will employ a master node
that determines the partitioning of the collection and allocate segments or ‘splits’ of the work to the available servers for processing.
Index processing, employs standard indexer algorithms against the portion of the workload. In a distributed indexing model, the mapping
of terms to termid’s becomes more complex as there must be a common set of terms that is developed. One approach to solving this
problem is to maintain a dictionary of common terms that is used by all indexer processes. The only new terms that would be maintained
by any one of the indexer processes would be those infrequent terms that are a part of the ‘split’ that is being indexed. These new terms
would then be merged with the terms of the common dictionary to form a complete dictionary of terms.
The �nal indexing algorithm explored in unit 2 is the dynamic indexing. All of the previous algorithms make the assumption that the
corpus of documents is static. By static we mean that the documents do not change and new documents are not typically added to the
collection. Although there are some collections that have these characteristics, many do not. If you consider the document corpus that
we are using for our development project, it contains a series of classic works of literature where each book is contained in a �le. The
contents in the �le do not change, after all who would change the writings of William Shakesphere, Daniel Defoe, or Mark Twain? Further
the books that are in the collection do not change either. When we index the collection the index is complete. This is not the case when
the corpus tends to be more dynamic. Take the world wide web as a corpus or collection of documents. The collection is under constant
change as new pages or web sites are added and the content in existing pages tends to change. The techniques that we have viewed so
far could not accommodate such a dynamic corpus.
In Dynamic indexing change is the basic assumption. As documents are indexed, they are scheduled to be indexed again in the future to
accommodate the change that may occur.
7 of 15 11/27/2024, 3:22 PM
Learning Guide Unit 2 | Home https://my.uopeople.edu/mod/book/tool/print/index.php?id=443823
8 of 15 11/27/2024, 3:22 PM
Learning Guide Unit 2 | Home https://my.uopeople.edu/mod/book/tool/print/index.php?id=443823
Manning, C.D., Raghaven, P., & Schütze, H. (2009). An Introduction to Information Retrieval (Online ed.). Cambridge, MA: Cambridge
University Press. Available at http://nlp.stanford.edu/IR-book/information-retrieval-book.html
• Wildcard query
• B-Tree
• Permuterm index
• K-gram index
• Edit distance
• Levenshtein Distance
• Jaccard Coe�cient
• Phonetic correction
• Soundex algorithm
• Indexing
• Indexer
• External sorting algorithm
• Inversion
• Posting
• MapReduce
• Key-Value Pairs
• Parser
• Segment �le
• Inverter
• Auxiliary index
• Logarithmic Merging
9 of 15 11/27/2024, 3:22 PM
Learning Guide Unit 2 | Home https://my.uopeople.edu/mod/book/tool/print/index.php?id=443823
In your own words, describe block sort based indexing and single-pass in memory indexing. As part of your discussion you must address how
they differ from each other and the key limitations or constraints of each.
10 of 15 11/27/2024, 3:22 PM
Learning Guide Unit 2 | Home https://my.uopeople.edu/mod/book/tool/print/index.php?id=443823
The following example presents an example of what our text calls single pass in memory indexing. This indexer has been developed using
the Python scripting language. Your assignment will be to use this code to gain an understanding of how to generate an inverted index.
This simple python code will read through a directory of documents, tokenize each document and add terms extracted from the �les to an
index. The program will generate metrics from the corpus and will generate two �les a document dictionary �le and a terms dictionary
�le.
The terms dictionary �le will contain each unique term discovered in the corpus in sorted order and will have a unique index number
assigned to each term. The document dictionary will contain each document discovered in the corpus and will have a unique index
number assigned to each document.
From our reading assignments, we should recognize that a third document is required that will link the terms to the documents they were
discovered in using the index numbers. Generating this third �le will be a future assignment.
We will be using a small corpus of �les that contain article and author information from articles submitted to the Journal “Communications
of the ACM”.
The corpus is in a zip �le in the resources section of this unit as is the example python code.
You will need to have the current version of Python .x installed on your computer to complete the assignment.
You will need to modify the code to change the directory where the �les are found to match your environment. Download the
or copy its contents to your blank “Untitled” Python �le and save it with name “Code_indexer.py”.
The areas where you must update the code are mentioned with comment ”#TO BE EDITED IF REQUIRED”. You should modify these parts
to work for your environment. If you are working in Linux, remember that forward slashes must be changed to backslashes.
• You must modify and execute the indexer against the CACM corpus. Although this will not build a complete index it will
demonstrate key concepts such as
o Reading the document and extracting and tokenizing all of the text
As we will see in coming units the ability to count terms, documents, and compiling other metrics is vital to information retrieval and this
�rst assignment demonstrates some of those processes.
• Your terms dictionary and documents dictionary �les must be stored on disk and uploaded as part of your completed assignment.
• Your indexer must tokenize the contents of each document and extract terms.
• Your indexer must report statistics on its processing and must print these statistics as the �nal output of the program.
• Number of documents processed
• Total number of terms parsed from all documents
• Total number of unique terms found and added to the index
When you have completed coding and testing your indexer program, you must execute your indexer against the corpus of documents in
the cacm.zip �le which can be downloaded in the resources section of Unit 2.
Capture the statistics output from your program after running it against the corpus. Your statistics must include all of the statistics listed
above. You can capture the statistics by copying and pasting the output of your program directly into a document which you can upload
as part of your assignment or you can manually record each statistic and include in your posting of your assignment.
It is suggested that you use the course forum to post any di�culties you are having and seek the help of peers.
11 of 15 11/27/2024, 3:22 PM
Learning Guide Unit 2 | Home https://my.uopeople.edu/mod/book/tool/print/index.php?id=443823
As you work with the indexer and corpus make note of your observations and provide a summary of your observations when posting your
assignment. Examples of observations might include content of the data, running time, e�ciency of the program and other observations.
This assignment will have four elements for peer assessment. Keep in mind that as part of your assessment process you should review
and respond to the assessment questions and provide substantive feedback. Your instructor will be monitoring the quality of the
feedback that you provide and a portion of your grade will be based upon the feedback that you provide to your peers. Feedback can take
the form of suggestions on how to improve the project, providing assistance to help fellow students complete their assignment, sharing
best practices, tips, or resources that you have found useful or explaining concepts to your peers.
12 of 15 11/27/2024, 3:22 PM
Learning Guide Unit 2 | Home https://my.uopeople.edu/mod/book/tool/print/index.php?id=443823
Upon completion of this assignment, you will be able to demonstrate the use of eEdit distance and k-overlap grams to determine spelling
corrections in the given text. You will further be able to determine the correct spelling (there is no case-sensitivity) for the words in a given
document using an existing dictionary that has vocabulary terms.
Assume there exists a dictionary on local server with the following four vocabulary terms (note: this is a demo dictionary, while
in an actual dictionary there are millions of words/terms):
▪ Information
▪ Jeopardy
▪ Lost
▪ Mount Everest
A sample document, Doc1, is also available on the server with incorrect spellings:
: At Mount Everest, 12 people were lost. This inforomation about jopardy raised fear.
2. Describe the approach used to correct the spellings of ‘information’ and ‘jeopardy’ in the document (Doc1) using Levenshtein
distance and k-gram overlap (you may choose the value of k). Create the table, as shown in the Example Levenshtein distance
computation to compute the edit distance using Levenshtein distance metric among the pair of strings.
3. In this situation, the dictionary is sorted. If the dictionary is not sorted, what shall be its impact on the process chosen by you in Q2?
• Submit a document that is of 500-1000 words (the word count does not include the title and the reference list), double-spaced
using 12-point Times New Roman font.
• Use sources to support your arguments. Use high-quality, credible, relevant sources to develop ideas that are appropriate for the
discipline and genre of the writing.
• Use APA citations and references to support your work. For assistance with APA formatting, view the Learning Resource Center:
Academic Writing.
Manning, C.D., Raghavan, P., & Schutze. (2009). Chapter 3. Dictionaries and tolerant retrieval. In An introduction to information retrieval.
Figure 3.6, p.59. https://nlp.stanford.edu/IR-book/pdf/03dict.pdf
13 of 15 11/27/2024, 3:22 PM
Learning Guide Unit 2 | Home https://my.uopeople.edu/mod/book/tool/print/index.php?id=443823
The Self-Quiz gives you an opportunity to self-assess your knowledge of what you have learned so far.
The results of the Self-Quiz do not count towards your �nal grade, but the quiz is an important part of the University’s learning process
and it is expected that you will take it to ensure understanding of the materials presented. Reviewing and analyzing your results will help
you perform better on future Graded Quizzes and the Final Exam.
Please access the Self-Quiz on the main course homepage; it will be listed inside the Unit.
14 of 15 11/27/2024, 3:22 PM
Learning Guide Unit 2 | Home https://my.uopeople.edu/mod/book/tool/print/index.php?id=443823
Post response to discussion question and respond to at least three of your peer's postings
15 of 15 11/27/2024, 3:22 PM