0% found this document useful (0 votes)

3 views

Learning Guide Unit 2

Learning Guide Unit 2 focuses on key concepts in information retrieval, including dictionary structures, wildcard queries, spelling correction, and various indexing techniques. It covers algorithms for constructing inverted indexes, such as block sort-based indexing and SPIMI, and discusses the challenges of indexing dynamic corpora. By the end of the unit, students will be able to implement these concepts in practical assignments and understand the impact of hardware limitations on indexing processes.

Uploaded by

Reg

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

Learning Guide Unit 2

Uploaded by

Reg

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Learning Guide Unit 2 | Home https://my.uopeople.edu/mod/book/tool/print/index.php?

id=443823

Site: University of the People Printed by: Patrick Rolemodel Asante

Course: CS 3308-01 Information Retrieval - AY2025-T2 Date: Wednesday, 27 November 2024, 3:20 PM
Book: Learning Guide Unit 2

1 of 15 11/27/2024, 3:22 PM
Learning Guide Unit 2 | Home https://my.uopeople.edu/mod/book/tool/print/index.php?id=443823

Learning Guide Unit 2

2 of 15 11/27/2024, 3:22 PM
Learning Guide Unit 2 | Home https://my.uopeople.edu/mod/book/tool/print/index.php?id=443823

3 of 15 11/27/2024, 3:22 PM
Learning Guide Unit 2 | Home https://my.uopeople.edu/mod/book/tool/print/index.php?id=443823

• Dictionaries
◦ Binary Trees
◦ B-Tree
• Wildcard queries
◦ Permuterm index
◦ K-gram index
• Spelling correction
◦ Isolated term correction
◦ Context sensitive correction
• Index Construction
• Dynamic indexing
• Corpus
• Single pass in-memory indexing
• Distributed indexing
• Ranked retrieval

By the end of this Unit, you will be able to:

1. Recognize the dictionary data structure and be able to implement a dictionary within an information retrieval system.
2. Describe the di�erent options for implementing wildcard queries including:
◦ Permuterm indexes
◦ K-gram indexes
3. Articulate di�erent approaches for the implementation of spelling correction including:
◦ Isolated-term correction approaches
▪ Edit distance
▪ K-gram overlap
◦ Context-sensitive correction
▪ Phoenetic correction using soundex algorithms
4. Describe computer hardware limitations and their impact on indexing processes
5. Describe Indexer architectures including:
◦ Blocked sort-based indexing
◦ Single-pass in-memory indexing
◦ Distributed indexing
◦ Dynamic Indexing

• Read the Learning Guide and Reading Assignments

• Complete and submit the Programming Assignment
• Make entries to the Learning Journal
• Take the Self-Quiz

4 of 15 11/27/2024, 3:22 PM
Learning Guide Unit 2 | Home https://my.uopeople.edu/mod/book/tool/print/index.php?id=443823

Unit 2 explores some of the most important concepts in information retrieval. In Chapter 3 of the text we learn about the concept of a
dictionary and how this is used in information retrieval. In chapter 4 we examine di�erent algorithms for creating the inverted index
structure which is the structure that we used to search for speci�c terms to retrieve documents. Together these topics are foundational
for information retrieval.

The image that we have of a dictionary is typically of a large book that contains lists of words in alphabetical order and for each word
there is a de�nition of the meaning of the word. This understanding of dictionary is helpful in understanding the dictionary data
structure that is used in information retrieval. In IR, the dictionary contains the terms or words extracted from the collection of
documents which we refer to as the corpus. In the IR dictionary we also de�ne the word in terms of the documents that the word
appears in. The dictionary structure may be used for terms from the document as well as names of documents.

The basic idea of a dictionary is that each unique term or document will only appear in the dictionary once and the dictionary itself will be
structured in some way that allows a speci�c item to be searched for and found e�ciently. Much of chapter 3 in our text deals with
techniques to e�ciently search the dictionary. One aspect that is covered is the structure of the dictionary. If the dictionary were an
unsorted list of terms then search would be very ine�cient because the entire list would potentially have to be searched to �nd the
correct entry which of course would take a lot of time. Sorting the list would make it more e�cient as it would be possible to skip through
the list and not have to examine every item. More e�cient still would be a structure such as a B-tree which enables binary searches to be
conducted.

Other aspects of e�cient searching that are covered include the idea of wildcard searches and spelling correction both for terms that are
indexed as well as the terms that are provided to query for. Both of these concepts contribute to search e�ciency. A successful search
is one where you can �nd the information that you are looking for and do so e�ciently. By providing the ability to search imprecisely for
a term the possibility of getting results is higher. The idea behind spelling correction is that a search is only going to �nd matching terms
(unless wildcards are used) so it makes sense to ensure that the terms are spelled correctly which should improve the chances that the
search and index terms are common.

In chapter 4 we learn about the actual construction of the inverted index. Found algorithms are discussed including:

is a technique that can be used on a corpus (document collection) that is relatively small and static (doesn’t
experience a lot of change). The block sort-based index algorithm passes through the corpus identifying all of the term id/document id
pairs. The idea of a term id is that each term is stored in dictionary structure which contains the term, and the term id which would
typically be a number assigned to each term. The document id, similarly would be a structure that contains each document name and a
number assigned to uniquely identify the document that we refer to as the document id.

The pairs are then sorted by term id and then document id. This two step process is referred to as an inversion. The assembly of term
with document id and the associated frequency for each term / document id pair is called a posting. The completed inverted index (which
is built entirely in memory) can then be written to disk. You will note that all 3 of the structures are required to use the index. This
algorithm can only be used when all of the index processing can be completed within memory so it can only be used on relatively small
collections.

5 of 15 11/27/2024, 3:22 PM
Learning Guide Unit 2 | Home https://my.uopeople.edu/mod/book/tool/print/index.php?id=443823

Figure 1 illustrates the data structures created with the block sort-based indexing algorithm. This approach creates two dictionaries, one
for terms and a second one for documents. The actual indexing is accomplished in the posting which maps terms to documents and the
frequency that each term appears within each document. If you are wondering why use three structures? Why not just build an index
that has terms, document names, and frequencies? The answer is that this approach makes better use of limited memory resources. The
block sort-based index is limited by the space in memory and repeating terms and document names would consume far more memory
than an integer that is used to represent each.

A second algorithm for building an inverted index is called . SPIMI solves the memory problem
that the Block sort-based index algorithm is limited by. As you can see in Figure 2, SPMI uses a term dictionary structure to point to a
postings list of document id’s. Each time a new term is encountered it is added to the dictionary and each time a new document is
encountered it is added to the document dictionary and an entry is added to the posting list for the term. What is very di�erent about the
SPIMI algorithm is that these structures are very dynamic in that the posting list allocates more storage as it is required to add additional
docid’s. The algorithm monitors memory availability and when available memory has been used, the term dictionary structure is sorted
by term and written out to disk, memory is then initialized, a new term dictionary is created and the process repeats until all documents in
the collection have been indexed.

6 of 15 11/27/2024, 3:22 PM
Learning Guide Unit 2 | Home https://my.uopeople.edu/mod/book/tool/print/index.php?id=443823

When this portion of the processing has been completed, the ‘blocks’ which have been written out to disk are then merged back together,
as illustrated in Figure 3, to form the complete index.

In this approach there are only two structures that are written out to disk. The �rst is the document dictionary which contains the name
of each document and its document id and the second is the postings �le which contains each term and the documents that the term
occurs in. The advantage of this algorithm is that it can process any size of corpus, limited only by the amount of disk space available.

The corpus that we will be using contains 2,476 articles published by the Reuters news service. These articles are segmented into
directories for di�erent topics areas so we will need to ‘walk’ through the directory structure to �nd and index all o� the documents. This
corpus contains 2,476 �les and contains approximately 3.09 Mbytes of text. The size of the collection on disk is larger because of all the
�les and directories that the data is segmented into. This collection is not very large it was a set of �les randomly extracted from a larger
Reuters collection that contained over 11,000 articles, however, on a relatively current personal computer the indexing process for the
collection requires two hours to index. Of this 9 minutes were required to build the index in memory and the remainder to write the
index out to the disk. These results demonstrate the fact that indexing is an intensive process and with large collections the capabilities of
more than one computer system will be required.

Distributed indexing addresses this problem by splitting the indexing workload among multiple, perhaps many di�erent systems each of
which indexes a portion of the entire collection. There are two approaches to splitting the workload up among multiple computers, term-
partitioning, and document partitioning.

In term partitioning, the collection is segmented by terms with each system processing some range of terms. A document partition
system segments the document collection by documents. In either approach the distributed indexing solution will employ a master node
that determines the partitioning of the collection and allocate segments or ‘splits’ of the work to the available servers for processing.

Index processing, employs standard indexer algorithms against the portion of the workload. In a distributed indexing model, the mapping
of terms to termid’s becomes more complex as there must be a common set of terms that is developed. One approach to solving this
problem is to maintain a dictionary of common terms that is used by all indexer processes. The only new terms that would be maintained
by any one of the indexer processes would be those infrequent terms that are a part of the ‘split’ that is being indexed. These new terms
would then be merged with the terms of the common dictionary to form a complete dictionary of terms.

The �nal indexing algorithm explored in unit 2 is the dynamic indexing. All of the previous algorithms make the assumption that the
corpus of documents is static. By static we mean that the documents do not change and new documents are not typically added to the
collection. Although there are some collections that have these characteristics, many do not. If you consider the document corpus that
we are using for our development project, it contains a series of classic works of literature where each book is contained in a �le. The
contents in the �le do not change, after all who would change the writings of William Shakesphere, Daniel Defoe, or Mark Twain? Further
the books that are in the collection do not change either. When we index the collection the index is complete. This is not the case when
the corpus tends to be more dynamic. Take the world wide web as a corpus or collection of documents. The collection is under constant
change as new pages or web sites are added and the content in existing pages tends to change. The techniques that we have viewed so
far could not accommodate such a dynamic corpus.

In Dynamic indexing change is the basic assumption. As documents are indexed, they are scheduled to be indexed again in the future to
accommodate the change that may occur.

7 of 15 11/27/2024, 3:22 PM
Learning Guide Unit 2 | Home https://my.uopeople.edu/mod/book/tool/print/index.php?id=443823

8 of 15 11/27/2024, 3:22 PM
Learning Guide Unit 2 | Home https://my.uopeople.edu/mod/book/tool/print/index.php?id=443823

Manning, C.D., Raghaven, P., & Schütze, H. (2009). An Introduction to Information Retrieval (Online ed.). Cambridge, MA: Cambridge
University Press. Available at http://nlp.stanford.edu/IR-book/information-retrieval-book.html

Chapter 3: Dictionaries and Tolerant Retrieval

Chapter 4: Index Construction

• Wildcard query
• B-Tree
• Permuterm index
• K-gram index
• Edit distance
• Levenshtein Distance
• Jaccard Coe�cient
• Phonetic correction
• Soundex algorithm
• Indexing
• Indexer
• External sorting algorithm
• Inversion
• Posting
• MapReduce
• Key-Value Pairs
• Parser
• Segment �le
• Inverter
• Auxiliary index
• Logarithmic Merging

9 of 15 11/27/2024, 3:22 PM
Learning Guide Unit 2 | Home https://my.uopeople.edu/mod/book/tool/print/index.php?id=443823

In your own words, describe block sort based indexing and single-pass in memory indexing. As part of your discussion you must address how
they differ from each other and the key limitations or constraints of each.

10 of 15 11/27/2024, 3:22 PM
Learning Guide Unit 2 | Home https://my.uopeople.edu/mod/book/tool/print/index.php?id=443823

The following example presents an example of what our text calls single pass in memory indexing. This indexer has been developed using
the Python scripting language. Your assignment will be to use this code to gain an understanding of how to generate an inverted index.

This simple python code will read through a directory of documents, tokenize each document and add terms extracted from the �les to an
index. The program will generate metrics from the corpus and will generate two �les a document dictionary �le and a terms dictionary
�le.

The terms dictionary �le will contain each unique term discovered in the corpus in sorted order and will have a unique index number
assigned to each term. The document dictionary will contain each document discovered in the corpus and will have a unique index
number assigned to each document.

From our reading assignments, we should recognize that a third document is required that will link the terms to the documents they were
discovered in using the index numbers. Generating this third �le will be a future assignment.

We will be using a small corpus of �les that contain article and author information from articles submitted to the Journal “Communications
of the ACM”.
The corpus is in a zip �le in the resources section of this unit as is the example python code.

You will need to have the current version of Python .x installed on your computer to complete the assignment.

You will need to modify the code to change the directory where the �les are found to match your environment. Download the
or copy its contents to your blank “Untitled” Python �le and save it with name “Code_indexer.py”.

The areas where you must update the code are mentioned with comment ”#TO BE EDITED IF REQUIRED”. You should modify these parts
to work for your environment. If you are working in Linux, remember that forward slashes must be changed to backslashes.

The requirements for this assignment include:

• You must modify and execute the indexer against the CACM corpus. Although this will not build a complete index it will
demonstrate key concepts such as

o Traversing a directory of documents

o Reading the document and extracting and tokenizing all of the text

o Computing counts of documents and terms

o Building a dictionary of unique terms that exist within the corpus

o Writing out to a disk �le, a sorted term dictionary

As we will see in coming units the ability to count terms, documents, and compiling other metrics is vital to information retrieval and this
�rst assignment demonstrates some of those processes.

• Your terms dictionary and documents dictionary �les must be stored on disk and uploaded as part of your completed assignment.
• Your indexer must tokenize the contents of each document and extract terms.
• Your indexer must report statistics on its processing and must print these statistics as the �nal output of the program.
• Number of documents processed
• Total number of terms parsed from all documents
• Total number of unique terms found and added to the index

When you have completed coding and testing your indexer program, you must execute your indexer against the corpus of documents in
the cacm.zip �le which can be downloaded in the resources section of Unit 2.

Capture the statistics output from your program after running it against the corpus. Your statistics must include all of the statistics listed
above. You can capture the statistics by copying and pasting the output of your program directly into a document which you can upload
as part of your assignment or you can manually record each statistic and include in your posting of your assignment.

It is suggested that you use the course forum to post any di�culties you are having and seek the help of peers.

11 of 15 11/27/2024, 3:22 PM
Learning Guide Unit 2 | Home https://my.uopeople.edu/mod/book/tool/print/index.php?id=443823

As you work with the indexer and corpus make note of your observations and provide a summary of your observations when posting your
assignment. Examples of observations might include content of the data, running time, e�ciency of the program and other observations.

Peer Assessment Criteria

This assignment will have four elements for peer assessment. Keep in mind that as part of your assessment process you should review
and respond to the assessment questions and provide substantive feedback. Your instructor will be monitoring the quality of the
feedback that you provide and a portion of your grade will be based upon the feedback that you provide to your peers. Feedback can take
the form of suggestions on how to improve the project, providing assistance to help fellow students complete their assignment, sharing
best practices, tips, or resources that you have found useful or explaining concepts to your peers.

The four elements required of the assignment include:

• The indexer python code uploaded as part of the submission

• The documents.dat and index.dat uploaded as part of the submission
• The metrics produced when the indexer was executed
• A description of the assignment and observations made while running the indexer against the corpus

12 of 15 11/27/2024, 3:22 PM
Learning Guide Unit 2 | Home https://my.uopeople.edu/mod/book/tool/print/index.php?id=443823

Upon completion of this assignment, you will be able to demonstrate the use of eEdit distance and k-overlap grams to determine spelling
corrections in the given text. You will further be able to determine the correct spelling (there is no case-sensitivity) for the words in a given
document using an existing dictionary that has vocabulary terms.

Assume there exists a dictionary on local server with the following four vocabulary terms (note: this is a demo dictionary, while
in an actual dictionary there are millions of words/terms):

▪ Information
▪ Jeopardy
▪ Lost
▪ Mount Everest

A sample document, Doc1, is also available on the server with incorrect spellings:

: At Mount Everest, 12 people were lost. This inforomation about jopardy raised fear.

Using the above data, answer the following questions.

1. How can the given dictionary be helpful for spelling corrections in ?

2. Describe the approach used to correct the spellings of ‘information’ and ‘jeopardy’ in the document (Doc1) using Levenshtein
distance and k-gram overlap (you may choose the value of k). Create the table, as shown in the Example Levenshtein distance
computation to compute the edit distance using Levenshtein distance metric among the pair of strings.

3. In this situation, the dictionary is sorted. If the dictionary is not sorted, what shall be its impact on the process chosen by you in Q2?

• Submit a document that is of 500-1000 words (the word count does not include the title and the reference list), double-spaced
using 12-point Times New Roman font.

• Use sources to support your arguments. Use high-quality, credible, relevant sources to develop ideas that are appropriate for the
discipline and genre of the writing.

• Use APA citations and references to support your work. For assistance with APA formatting, view the Learning Resource Center:
Academic Writing.

Manning, C.D., Raghavan, P., & Schutze. (2009). Chapter 3. Dictionaries and tolerant retrieval. In An introduction to information retrieval.
Figure 3.6, p.59. https://nlp.stanford.edu/IR-book/pdf/03dict.pdf

13 of 15 11/27/2024, 3:22 PM
Learning Guide Unit 2 | Home https://my.uopeople.edu/mod/book/tool/print/index.php?id=443823

The Self-Quiz gives you an opportunity to self-assess your knowledge of what you have learned so far.

The results of the Self-Quiz do not count towards your �nal grade, but the quiz is an important part of the University’s learning process
and it is expected that you will take it to ensure understanding of the materials presented. Reviewing and analyzing your results will help
you perform better on future Graded Quizzes and the Final Exam.

Please access the Self-Quiz on the main course homepage; it will be listed inside the Unit.

14 of 15 11/27/2024, 3:22 PM
Learning Guide Unit 2 | Home https://my.uopeople.edu/mod/book/tool/print/index.php?id=443823

Read the Learning Guide and Reading Assignments

Post response to discussion question and respond to at least three of your peer's postings

Complete and submit the Programming Assignment

Make entries to the Learning Journal

Take the Self-Quiz

15 of 15 11/27/2024, 3:22 PM

C# & .NET Interview Questions 2024 Edition
100% (1)
C# & .NET Interview Questions 2024 Edition
14 pages
5.1.2.8 Lab - Challenge Passwords With Kali Tools
No ratings yet
5.1.2.8 Lab - Challenge Passwords With Kali Tools
3 pages
Actuator Controls Ac 01.2/acexc 01.2: Operation and Setting Manual
No ratings yet
Actuator Controls Ac 01.2/acexc 01.2: Operation and Setting Manual
160 pages
User Manual DNC One 2015
100% (1)
User Manual DNC One 2015
30 pages
Learning Guide Unit 2 _ Home
No ratings yet
Learning Guide Unit 2 _ Home
11 pages
CHAP 4 Inverted Index
No ratings yet
CHAP 4 Inverted Index
21 pages
FOP Efficiency Indexing 13
No ratings yet
FOP Efficiency Indexing 13
22 pages
IR Unit III - Notes
No ratings yet
IR Unit III - Notes
18 pages
Lecture 2 Inverted Index PDF
No ratings yet
Lecture 2 Inverted Index PDF
24 pages
Chapter 1: Boolean Retrieval
No ratings yet
Chapter 1: Boolean Retrieval
9 pages
IR_MOD4_NOTES
No ratings yet
IR_MOD4_NOTES
19 pages
chap5-index-construction
No ratings yet
chap5-index-construction
38 pages
L05
No ratings yet
L05
33 pages
IR Unit 2 Dictionaries and Query Processing
No ratings yet
IR Unit 2 Dictionaries and Query Processing
20 pages
Ir Chapter Three
No ratings yet
Ir Chapter Three
41 pages
Index Construction
No ratings yet
Index Construction
37 pages
ir
No ratings yet
ir
4 pages
Completed UNIT-III 20.9.17
No ratings yet
Completed UNIT-III 20.9.17
61 pages
Learning Guide Unit 1 _ Home
No ratings yet
Learning Guide Unit 1 _ Home
10 pages
Unit 3 Indexing
100% (1)
Unit 3 Indexing
10 pages
C10 IR M2021 IndexConstruction SimpleandDistributed
No ratings yet
C10 IR M2021 IndexConstruction SimpleandDistributed
42 pages
Inverted File
No ratings yet
Inverted File
20 pages
Information Retrieval - 2
No ratings yet
Information Retrieval - 2
24 pages
Chapter 3,4, 5 and 6
No ratings yet
Chapter 3,4, 5 and 6
145 pages
Course Name: Advanced Information Retrieval
No ratings yet
Course Name: Advanced Information Retrieval
6 pages
SUMSEM2022-23 CSE3024 ETH VL2022230700533 2023-05-22 Reference-Material-I
No ratings yet
SUMSEM2022-23 CSE3024 ETH VL2022230700533 2023-05-22 Reference-Material-I
7 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
50 pages
Chapter 3 Indexing
No ratings yet
Chapter 3 Indexing
48 pages
4.index Construction - New
No ratings yet
4.index Construction - New
46 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
49 pages
II. Information Retrieval (Basics Cont.) : Web Search - Summer Term 2006
No ratings yet
II. Information Retrieval (Basics Cont.) : Web Search - Summer Term 2006
16 pages
Introduction To Information Storage and Retrieval: Chapter Four: Indexing Structure
No ratings yet
Introduction To Information Storage and Retrieval: Chapter Four: Indexing Structure
34 pages
3
No ratings yet
3
8 pages
Learning Guide Unit 1 _ Home
No ratings yet
Learning Guide Unit 1 _ Home
6 pages
Chapter 4 IR
No ratings yet
Chapter 4 IR
56 pages
Lecture 5-Dictionaries and Tolerant Retrieval
No ratings yet
Lecture 5-Dictionaries and Tolerant Retrieval
48 pages
Chapter-4 - Data Structure-File Structure
No ratings yet
Chapter-4 - Data Structure-File Structure
34 pages
Lecture 4-Dictionaries and Tolerant Retrieval
No ratings yet
Lecture 4-Dictionaries and Tolerant Retrieval
50 pages
IRS Module 5
No ratings yet
IRS Module 5
24 pages
03 Dictionaries
No ratings yet
03 Dictionaries
112 pages
03 Dictionaries
No ratings yet
03 Dictionaries
112 pages
1726119671-4 Index Construction
No ratings yet
1726119671-4 Index Construction
19 pages
3 Indexing (2)
No ratings yet
3 Indexing (2)
28 pages
IR Chap3
No ratings yet
IR Chap3
45 pages
C3 IndexConstruction
No ratings yet
C3 IndexConstruction
46 pages
Document Indexing in Information Retrieval:
No ratings yet
Document Indexing in Information Retrieval:
19 pages
Lec6 InvretedIndex pt2
No ratings yet
Lec6 InvretedIndex pt2
38 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
54 pages
Unit 1 Notes-1
No ratings yet
Unit 1 Notes-1
10 pages
IR Journal
No ratings yet
IR Journal
36 pages
IR ch4 - Inverted-Index
No ratings yet
IR ch4 - Inverted-Index
44 pages
115 Ir 9
No ratings yet
115 Ir 9
4 pages
Lecture4-Indexconstruction Ch2 and Ch4
No ratings yet
Lecture4-Indexconstruction Ch2 and Ch4
49 pages
Elementary IR: Scalable Boolean Text Search: (Compare With R & G 27.1-3)
No ratings yet
Elementary IR: Scalable Boolean Text Search: (Compare With R & G 27.1-3)
22 pages
4_Indexing
No ratings yet
4_Indexing
59 pages
2 Introduction To Information Retrieval
No ratings yet
2 Introduction To Information Retrieval
38 pages
4 - Indexing
No ratings yet
4 - Indexing
42 pages
ch3_ Indexing _2019
No ratings yet
ch3_ Indexing _2019
38 pages
Chapter4 Indexconstruction
No ratings yet
Chapter4 Indexconstruction
49 pages
UNIT-2
No ratings yet
UNIT-2
10 pages
IRS imp
No ratings yet
IRS imp
76 pages
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
From Everand
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
Fouad Sabry
No ratings yet
Design Patterns Made Easy: A Practical Guide with Examples
From Everand
Design Patterns Made Easy: A Practical Guide with Examples
William E. Clark
No ratings yet
Data Structures I Essentials
From Everand
Data Structures I Essentials
Dennis Smolarski
No ratings yet
Learning Guide Unit 6 _ Home
No ratings yet
Learning Guide Unit 6 _ Home
10 pages
CS 3308 Learning Journal Unit 5
No ratings yet
CS 3308 Learning Journal Unit 5
6 pages
CS 3308 Learning Journal Unit 7
No ratings yet
CS 3308 Learning Journal Unit 7
5 pages
ENGL 1102-Unit 2 Discussion Assignment
No ratings yet
ENGL 1102-Unit 2 Discussion Assignment
3 pages
MATH 1302 - Unit 2 Discussion Assignment
No ratings yet
MATH 1302 - Unit 2 Discussion Assignment
4 pages
MATH 1281 - Unit 4 Discussion Assignment
No ratings yet
MATH 1281 - Unit 4 Discussion Assignment
5 pages
MATH 1281 - Unit 8 Assignment
100% (1)
MATH 1281 - Unit 8 Assignment
2 pages
MATH 1281 - Unit 3 Assignment
No ratings yet
MATH 1281 - Unit 3 Assignment
5 pages
MATH 1281 - Unit 5 Assignment
No ratings yet
MATH 1281 - Unit 5 Assignment
4 pages
MATH 1280-Unit 1 Discussion Assignment
No ratings yet
MATH 1280-Unit 1 Discussion Assignment
3 pages
Lecture 2
No ratings yet
Lecture 2
29 pages
MATH 1280-Unit 2 Discussion Assignment
No ratings yet
MATH 1280-Unit 2 Discussion Assignment
2 pages
SE - Ch.01 - Software and Software Engineering
No ratings yet
SE - Ch.01 - Software and Software Engineering
20 pages
21 20 1 EFT UPC CUPS RCM Config and Admin Guide
No ratings yet
21 20 1 EFT UPC CUPS RCM Config and Admin Guide
29 pages
Inf1501 Ass 1
No ratings yet
Inf1501 Ass 1
45 pages
DAA QB Unit-1
100% (1)
DAA QB Unit-1
2 pages
HeRo User Guide 4328547
No ratings yet
HeRo User Guide 4328547
29 pages
IMINT Target Acquisition Using Deep Learning
No ratings yet
IMINT Target Acquisition Using Deep Learning
5 pages
How To Lock Your Windows XP in 2 Click: Stalin
No ratings yet
How To Lock Your Windows XP in 2 Click: Stalin
10 pages
4차산업과 리더쉽 - 퀄컴 조남성
No ratings yet
4차산업과 리더쉽 - 퀄컴 조남성
29 pages
RTN900R2 IP OVER DCC Description
No ratings yet
RTN900R2 IP OVER DCC Description
47 pages
Tester - Gamify Studios
No ratings yet
Tester - Gamify Studios
2 pages
Crear: Crud/Abm Practica NOMBRE: Assad Joaquin Cadena Antonio
No ratings yet
Crear: Crud/Abm Practica NOMBRE: Assad Joaquin Cadena Antonio
7 pages
cblecspu10
No ratings yet
cblecspu10
9 pages
React Q&A
No ratings yet
React Q&A
1 page
Name: Jerywin Dulangan Bayawan DATE: 09/30/21 Year/Course/Section: Bsis/3/A Module #: 2
No ratings yet
Name: Jerywin Dulangan Bayawan DATE: 09/30/21 Year/Course/Section: Bsis/3/A Module #: 2
1 page
Installation - Magisk
No ratings yet
Installation - Magisk
5 pages
operating systems R18-Lab Manual
No ratings yet
operating systems R18-Lab Manual
90 pages
X20BC8083 Eng
No ratings yet
X20BC8083 Eng
5 pages
Introduction JIRA - Latest
No ratings yet
Introduction JIRA - Latest
32 pages
Ancel's Intel Hidden Bios Guide
No ratings yet
Ancel's Intel Hidden Bios Guide
8 pages
World Adult Soulmate © Start Your Own XXX Business
No ratings yet
World Adult Soulmate © Start Your Own XXX Business
10 pages
Log
No ratings yet
Log
111 pages
HTML Technical MCQ
No ratings yet
HTML Technical MCQ
18 pages
Tanzania Education and Research Network (Ternet) : Office of The Executive Secretary
No ratings yet
Tanzania Education and Research Network (Ternet) : Office of The Executive Secretary
7 pages
Install Privoxy On El Capitan From Source
No ratings yet
Install Privoxy On El Capitan From Source
15 pages
A Probabilistic Misbehavior Detection Scheme Towards Efficient Trust Establishment in Delay-Tolerant Networks
No ratings yet
A Probabilistic Misbehavior Detection Scheme Towards Efficient Trust Establishment in Delay-Tolerant Networks
11 pages
Login Issues in Apps
No ratings yet
Login Issues in Apps
5 pages

Learning Guide Unit 2

Uploaded by

Learning Guide Unit 2

Uploaded by

Learning Guide Unit 2 | Home https://my.uopeople.edu/mod/book/tool/print/index.php?

Site: University of the People Printed by: Patrick Rolemodel Asante

Learning Guide Unit 2

By the end of this Unit, you will be able to:

• Read the Learning Guide and Reading Assignments

Chapter 3: Dictionaries and Tolerant Retrieval

The requirements for this assignment include:

o Traversing a directory of documents

o Computing counts of documents and terms

o Building a dictionary of unique terms that exist within the corpus

o Writing out to a disk �le, a sorted term dictionary

Peer Assessment Criteria

The four elements required of the assignment include:

• The indexer python code uploaded as part of the submission

Using the above data, answer the following questions.

1. How can the given dictionary be helpful for spelling corrections in ?

Read the Learning Guide and Reading Assignments

Complete and submit the Programming Assignment

Make entries to the Learning Journal

Take the Self-Quiz

You might also like