0% found this document useful (0 votes)

14 views14 pages

Learning Guide Unit 4 _ Home

Unit 4 of the CS 3308-01 Information Retrieval course focuses on weighting techniques in information retrieval, including parametric and zone indexes, term frequency, and tf-idf weighting. Students will learn to calculate inverse document frequency, document and query vector scores, and cosine similarity to determine document relevance. The unit also includes practical assignments for building an inverted index and calculating term weights, culminating in a project to recommend similar documents based on cosine similarity.

Uploaded by

Reg

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views14 pages

Learning Guide Unit 4 _ Home

Uploaded by

Reg

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Learning Guide Unit 4 | Home https://my.uopeople.edu/mod/book/tool/print/index.php?

id=443845

Site: University of the People Printed by: Patrick Rolemodel Asante

Course: CS 3308-01 Information Retrieval - AY2025-T2 Date: Tuesday, 10 December 2024, 12:02 PM
Book: Learning Guide Unit 4

1 of 14 12/10/2024, 12:02 PM
Learning Guide Unit 4 | Home https://my.uopeople.edu/mod/book/tool/print/index.php?id=443845

Learning Guide Unit 4

2 of 14 12/10/2024, 12:02 PM
Learning Guide Unit 4 | Home https://my.uopeople.edu/mod/book/tool/print/index.php?id=443845

3 of 14 12/10/2024, 12:02 PM
Learning Guide Unit 4 | Home https://my.uopeople.edu/mod/book/tool/print/index.php?id=443845

• Parametric and Zone indexes

• Weighted Zone Scoring
• Ranked Boolean Retrieval
• Learning weights
• The optimal weight g
• Term frequency
• Tf-idf weighting
• Vector space model for scoring
• Cosine similarity
• Term-document matrix

By the end of this Unit, you will be able to:

1. Identify the di�erences between parametric and zone indexes.

2. Compare di�erent weighting schemes and be able to implement each scheme.
3. Determine and calculate inverse document frequency, term frequency, document and query vector scores, Cosine similarity
between documents and between a document and a query.
4. Identify di�erent tf-idf variants.

• Read the Learning Guide and Reading Assignments

• Complete and submit the Programming Assignment
• Make entries to the Learning Journal
• Take the Self-Quiz

4 of 14 12/10/2024, 12:02 PM
Learning Guide Unit 4 | Home https://my.uopeople.edu/mod/book/tool/print/index.php?id=443845

Unit 4 is an important unit as we cover the important concept of weighting as it relates to information retrieval. In Units 2 and 3 we
learned about the inverted index which is the most fundamental data structure in information retrieval. The inverted index provides a
structure that can be used to search for speci�c terms in documents in order to �nd documents that match the information that we are
looking for.

In the index part 1 that we constructed and executed against the CACM corpus, we discovered that a term might appear in many
documents. Because of this, a search for terms might retrieve many documents some of which are more or less relevant. What we will
explore in this unit is techniques to determine the relevance of the documents identi�ed in a query. The approach that we will use to
determine the relevance of a document is referred to as weighting. In this approach a weight is assigned to each term. Terms with higher
weights will be assumed to be more relevant than terms with lower weights and as such should be presented �rst in the responses to a
query.

Our text describes a few di�erent approaches to assigning term weighting, but in our development assignment for this unit we will be
using the inverse document frequency and term frequency to calculate the weight of the term which essentially assumes that a higher
frequency of a term indicates that it is more relevant.

The Inverse Document Frequency is calculated using the following formula:

In this formula N refers to the total number of documents in the corpus. In our CACM corpus, there are 731 documents so in this case N
would be 731. The dft in the formula refers to the document frequency. The subscripted t refers to the term so dft is the number of
documents that a particular term appears in. For example the term ‘mortgage’ appears in 8 documents in the collection. In this case the
idft for the term ‘mortgage’ would be found by taking the base 10 logarithm of 731 divided by 8 which is 1.9608.

The objective of idft is to reduce the weight of a particular term by a factor that grows with its frequency within the collection. This is
important because a term that exists in most documents is probably not very relevant. On the other hand a term that appears frequently
in a few documents is probably very relevant.

To incorporate this concept into our weight, we use the idft which gets smaller as the frequency of the term across the collection grows
along with the term frequency which is simply de�ned as the number of times that a particular term occurs in a particular document. The
following formula is used to calculate the tf and idf weight:

The tf idf weight which is often represented as tf-idft,d provides a weight for a particular term in a particular document. This weight can
then be used to calculate a vector for the document. The vector is simply the accumulation of the weights of each term in a query.

Imagine, if you would, that you wanted to �nd information on home mortgages, you would use two query terms ‘home’ and ‘mortgage’.
We already determined that the idft for the term ‘mortgage’ is 1.9608 which is then multiplied by the tft,d to determine the tf-idft,d This
value is a component of the vector which is calculated with the following formula:

5 of 14 12/10/2024, 12:02 PM
Learning Guide Unit 4 | Home https://my.uopeople.edu/mod/book/tool/print/index.php?id=443845

What this formula indicates is that the tf-idft,d value for each term in the query must be added together to get a score. In the case of our
home mortgage query, the tf-idft,d value for home 1.6335 must be added to the tf-idft,d value for mortgage to obtain the vector score of
3.5943.

This calculation of vector scores must be repeated for every document in the collection that matches each of the search terms in the
query to develop a list of potential documents for retrieval.

When the list of potential documents has been built, the vector score for the query must be determined. The vector score for the query is
determined in much the same way as we determined the vector scores for documents that match the query terms with the exception that
the idft value is multiplied by 1.

In the case of our ‘home mortgages’ query, the idft for home was 1.6335 and the idft for mortgage was 1.9608 so the vector score for the
query is:

Vector score = 1.6335 + 1.9608 which of course is 3.5943

Once we have both the vector score for the query and the vector score for all of the documents we can apply the weighting. We do this by
calculating the cosine similarity between each document vector and the query vector using the following formula:

In this equation there will be two vectors compared. The vector for the query will be compared with the vector for the document.

Assuming that a particular document has a tf-idft,d value for ‘home’ of 3.9216 and a tf-idft,d value for ‘mortgage’ of 14.7012. Also knowing
that for the query ‘home mortgage’ we have an tf-idft,d score for home that is 1.6335 and a tf-idft,d score for mortgage of 1.9608 the dot
product can be calculated as follows:

It might be easier to understand this formula if we use our example. The �rst vector is the query vector which is composed of the tf-idft,d
scores for home (1.6335) and mortgage (1.9608). The second vector is the vector score for the document which is composed of the tf-idft,d
scores for home (3.9216) and mortgage (14.7012). To calculate the Euclidean length we need to add these scores and take the square root
of them as follows:

The resulting similarity ranges from −1 meaning exactly opposite, to 1 meaning exactly the same, with 0 usually indicating independence,
and in-between values indicating intermediate similarity or dissimilarity.

This same calculation must be performed for each document in the collection that contains the query terms and the resulting cosine

6 of 14 12/10/2024, 12:02 PM
Learning Guide Unit 4 | Home https://my.uopeople.edu/mod/book/tool/print/index.php?id=443845

similarity scores sorted from largest to smallest. The result of the query will then return the highest scoring documents. This process is
introduced in Unit 4, but we will be implementing these techniques in a development project assigned in Unit 5. Please take the time to
really understand the process of computing the tf-idf weight and how this can be used to calculate vectors and the cosine similarity score.

The recommended reading resources in this unit can be reviewed as they provide more details to help you understand the calculation of
tf-idf, vectors, and cosine similarity.

7 of 14 12/10/2024, 12:02 PM
Learning Guide Unit 4 | Home https://my.uopeople.edu/mod/book/tool/print/index.php?id=443845

Manning, C.D., Raghaven, P., & Schütze, H. (2009). An Introduction to Information Retrieval (Online ed.). Cambridge, MA: Cambridge
University Press. Available at http://nlp.stanford.edu/IR-book/information-retrieval-book.html

Chapter 6: Scoring, Term Weighting and the Vector Space Model

• Term Vector Calculations: A Fast Track Tutorial By Dr. E. Garcia: http://en.youscribe.com/catalogue/tous/knowledge/term-vector-fast-

track-tutorial-521048
• Wikipedia Entry for Calculating Cosine Simularity: http://en.wikipedia.org/wiki/Cosine_similarity

• Term frequency
• Bag of words
• Inverse document frequency
• Tf-idf weighting
• Document vector
• Query vector
• Vector space model
• Cosine similarity
• Euclidean Length

8 of 14 12/10/2024, 12:02 PM
Learning Guide Unit 4 | Home https://my.uopeople.edu/mod/book/tool/print/index.php?id=443845

What is involved in determining and calculating the Inverse Document Frequency? Also, what are the di�erences between parametric and
zone indexes?

You must post your initial response before being able to review other student’s responses. Once you have made your �rst response, you
will be able to reply to other student’s posts. You are expected to make a minimum of 3 responses to your fellow student’s posts.

9 of 14 12/10/2024, 12:02 PM
Learning Guide Unit 4 | Home https://my.uopeople.edu/mod/book/tool/print/index.php?id=443845

Over the past two units (units 2 and 3) you have developed and re�ned a basic process to build an inverted index. In this unit you will
extend the work that you started by adding new features to your indexer process. Your assignment will be to add new features to your
indexer.

Your Indexer must apply editing to the terms (tokens) extracted from the document collection as follows:

• Develop a routine to identify and remove stop words

• Implement a porter stemmer to stem the tokens processed by your indexer routine. You do not have to write a stemmer algorithm,
you can use the code that is provided and integrate it into your own routine.
• Remove (do not process) any term that begins with a punctuation character

Your indexer process must calculate frequencies and weighted term measures that can be used to process a query for scoring in the
vector space model. As such your index must:

• Determine the term frequency (tft,d) for each unique term in each document in the collection to be included in the inverted index.
• Determine the document frequency (dft) which is a count of the number of documents from within the collection that each unique
inverted index term appears in.
• Calculate the inverse document frequency using the following formula:

◦ The N in this formula refers to the number of documents that exist in the collection. When using the corpus-small document
collection N will be 41 as there are 41 documents in the collection.
• Finally calculate the tf-idf weighting using the following formula: tf - idftd = tft,d × idft
• The tf-idf weighting must be maintained as an attribute in the inverted index data structure. The following is an example (but not a
required) structure for the inverted index:

Your indexer must report statistics on its processing and must print these statistics as the �nal output of the program.

• Number of documents processed

• Total number of terms parsed from all documents
• Total number of unique terms found and added to the index
• Total number of terms found that matched one of the stop words in your program’s stop words list

In testing your indexer process you should create a new database to hold the elements of your inverted index.

10 of 14 12/10/2024, 12:02 PM
Learning Guide Unit 4 | Home https://my.uopeople.edu/mod/book/tool/print/index.php?id=443845

Example code for Indexer Part 2 is .

11 of 14 12/10/2024, 12:02 PM
Learning Guide Unit 4 | Home https://my.uopeople.edu/mod/book/tool/print/index.php?id=443845

At the end of this assignment, you will be able to �nd the cosine similarity between the documents and use it further to recommend
similar documents.

Assume that you have designed a search engine and the following three documents exist in the corpus (Note: In a real-world situation,
there exists millions of documents with words ranging from hundreds to millions in each of them):

Document 1: Earth is round.

Document 2: Moon is round.

Document 3: Day is nice.

It was found programmatically that a user always refers to Document 1. There is a need to recommend one document from the corpus to
the user. Your search engine needs to �nd documents similar to Document 1. To accomplish the task, you need to carry out pre-
processing (hint: “is” is a stop word), the vector creation of each document, and �nd the similarity of the vector of each document with
vector of Document 1.

Based on the highest similarity among the vectors, recommend the respective document to the user.

Compute the vector for each of the documents and explain the vectors. Describe in detail the steps that made you reach the valid
similarity result Introduction to Information Retrieval (stanford.edu) . For each
calculation and equation used, show the necessary references for it.

• Submit a document that is of 500-1000 words (the word count does not include the title and the reference list), double-spaced using
12-point Times New Roman font.
• Use sources to support your arguments. Use high-quality, credible, relevant sources to develop ideas that are appropriate for the
discipline and genre of the writing. Use APA citations and references to support your work. For assistance with APA formatting, view
the Learning Resource Center: Academic Writing.

• Manning, C.D., Raghaven, P., & Schütze, H. (2009). An Introduction to information retrieval (Online ed.). Cambridge, MA: Cambridge
University Press. Available at http://nlp.stanford.edu/IR-book/information-retrieval-book.html
• Chapter 6: Scoring, Term Weighting and the Vector Space Model [Introduction to Information Retrieval (stanford.edu)

12 of 14 12/10/2024, 12:02 PM
Learning Guide Unit 4 | Home https://my.uopeople.edu/mod/book/tool/print/index.php?id=443845

The Self-Quiz gives you an opportunity to self-assess your knowledge of what you have learned so far.

The results of the Self-Quiz do not count towards your �nal grade, but the quiz is an important part of the University’s learning process
and it is expected that you will take it to ensure understanding of the materials presented. Reviewing and analyzing your results will help
you perform better on future Graded Quizzes and the Final Exam.

Please access the Self-Quiz on the main course homepage; it will be listed inside the Unit.

13 of 14 12/10/2024, 12:02 PM
Learning Guide Unit 4 | Home https://my.uopeople.edu/mod/book/tool/print/index.php?id=443845

Read the Learning Guide and Reading Assignments

Complete and submit the Programming Assignment

Make entries to the Learning Journal

Take the Self-Quiz

14 of 14 12/10/2024, 12:02 PM

MA3005 Compiled
No ratings yet
MA3005 Compiled
280 pages
Learning Guide Unit 4 _ Home
No ratings yet
Learning Guide Unit 4 _ Home
10 pages
ISR Chap..3
No ratings yet
ISR Chap..3
26 pages
Relevance of A Document To A Query
No ratings yet
Relevance of A Document To A Query
10 pages
3 Termweighting
No ratings yet
3 Termweighting
34 pages
3 termWeightingIR
No ratings yet
3 termWeightingIR
32 pages
3 Termweighting
No ratings yet
3 Termweighting
34 pages
Chapter 3 IR
No ratings yet
Chapter 3 IR
34 pages
Chapter Three Term Weighting and Similarity Measures
No ratings yet
Chapter Three Term Weighting and Similarity Measures
25 pages
Chapter-3 Termweighting
No ratings yet
Chapter-3 Termweighting
17 pages
IR Lecture 4b
No ratings yet
IR Lecture 4b
57 pages
chapter 3 term weighting
No ratings yet
chapter 3 term weighting
11 pages
3 Term Weighting
No ratings yet
3 Term Weighting
34 pages
Term Weighting 2021
100% (2)
Term Weighting 2021
38 pages
Vmodel
No ratings yet
Vmodel
10 pages
Chapter Three Term Weighting and Similarity Measures
No ratings yet
Chapter Three Term Weighting and Similarity Measures
33 pages
Term Weighting and Similarity Measures
No ratings yet
Term Weighting and Similarity Measures
35 pages
AI6122 Topic 3.2 - Ranking
No ratings yet
AI6122 Topic 3.2 - Ranking
27 pages
Term Weighting and Similarity Measures
50% (2)
Term Weighting and Similarity Measures
54 pages
Vector Space Model
No ratings yet
Vector Space Model
6 pages
IR - ch5 - Vector Space Model
No ratings yet
IR - ch5 - Vector Space Model
23 pages
IR Lecture 4b
No ratings yet
IR Lecture 4b
57 pages
Tf-Idf: David Kauchak cs160 Fall 2009
No ratings yet
Tf-Idf: David Kauchak cs160 Fall 2009
51 pages
Lecture 6 Score - Term Weight - Vector Space Model
No ratings yet
Lecture 6 Score - Term Weight - Vector Space Model
43 pages
Document Ranking Using Customizes Vector Method
No ratings yet
Document Ranking Using Customizes Vector Method
6 pages
3 Termweighting
No ratings yet
3 Termweighting
40 pages
Vector Space Model: TF - IDF: Adapted From Lectures by
No ratings yet
Vector Space Model: TF - IDF: Adapted From Lectures by
37 pages
Basic IR: Modeling
No ratings yet
Basic IR: Modeling
22 pages
TF Idf
100% (3)
TF Idf
38 pages
Chapter 5 IR
No ratings yet
Chapter 5 IR
46 pages
Chapter 6_Scoring Term weighting and vector space model
No ratings yet
Chapter 6_Scoring Term weighting and vector space model
43 pages
Module 3 Indexing Part A
No ratings yet
Module 3 Indexing Part A
46 pages
Ranked Retrieval
No ratings yet
Ranked Retrieval
52 pages
IR Journal
No ratings yet
IR Journal
36 pages
4_IRModels
No ratings yet
4_IRModels
46 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
27 pages
5 B IRModels
No ratings yet
5 B IRModels
51 pages
Session 4 Text Feature
No ratings yet
Session 4 Text Feature
40 pages
Tif Idf Cosine Similarity
No ratings yet
Tif Idf Cosine Similarity
44 pages
4_IRModels
No ratings yet
4_IRModels
30 pages
Lecture 10
No ratings yet
Lecture 10
18 pages
4-IR Models
No ratings yet
4-IR Models
33 pages
ISR chap...5
No ratings yet
ISR chap...5
34 pages
5 IRModels
No ratings yet
5 IRModels
30 pages
Information Retrieval Systems
100% (1)
Information Retrieval Systems
16 pages
4-IR Models
No ratings yet
4-IR Models
33 pages
lec3
No ratings yet
lec3
51 pages
IR Chapter 2 Part II
No ratings yet
IR Chapter 2 Part II
45 pages
Information Retrieval Systems Information Retrieval Systems
No ratings yet
Information Retrieval Systems Information Retrieval Systems
7 pages
Lecture 2: More Similarity Searching Multidimensional Scaling
No ratings yet
Lecture 2: More Similarity Searching Multidimensional Scaling
8 pages
ir
No ratings yet
ir
120 pages
IR Models: Chapter Five
100% (1)
IR Models: Chapter Five
26 pages
Recuperación Información Modelo Vectorial
No ratings yet
Recuperación Información Modelo Vectorial
40 pages
Implementation
No ratings yet
Implementation
16 pages
Chapter 4 IR Models
No ratings yet
Chapter 4 IR Models
43 pages
Vector Space and IR Evaluation
No ratings yet
Vector Space and IR Evaluation
41 pages
IR Chap4
100% (1)
IR Chap4
32 pages
IR Chap4
100% (1)
IR Chap4
32 pages
IRS-Unit-4
No ratings yet
IRS-Unit-4
63 pages
CS 3308 DISCUSSION FORUM 4
No ratings yet
CS 3308 DISCUSSION FORUM 4
2 pages
Value Creation with Digital Twins: Conceptual Reference Frameworks and Case Study
From Everand
Value Creation with Digital Twins: Conceptual Reference Frameworks and Case Study
Linard Dario Barth
No ratings yet
Learning Guide Unit 6 _ Home
No ratings yet
Learning Guide Unit 6 _ Home
10 pages
Learning Guide Unit 1 _ Home
No ratings yet
Learning Guide Unit 1 _ Home
10 pages
CS 3308 Learning Journal Unit 7
No ratings yet
CS 3308 Learning Journal Unit 7
5 pages
CS 3308 Learning Journal Unit 5
No ratings yet
CS 3308 Learning Journal Unit 5
6 pages
MATH 1281 - Unit 8 Assignment
100% (1)
MATH 1281 - Unit 8 Assignment
2 pages
MATH 1302 - Unit 2 Discussion Assignment
No ratings yet
MATH 1302 - Unit 2 Discussion Assignment
4 pages
MATH 1281 - Unit 3 Assignment
No ratings yet
MATH 1281 - Unit 3 Assignment
5 pages
MATH 1281 - Unit 5 Assignment
No ratings yet
MATH 1281 - Unit 5 Assignment
4 pages
MATH 1281 - Unit 4 Discussion Assignment
No ratings yet
MATH 1281 - Unit 4 Discussion Assignment
5 pages
ENGL 1102-Unit 2 Discussion Assignment
No ratings yet
ENGL 1102-Unit 2 Discussion Assignment
3 pages
MATH 1280-Unit 1 Discussion Assignment
No ratings yet
MATH 1280-Unit 1 Discussion Assignment
3 pages
Lecture 2
No ratings yet
Lecture 2
29 pages
MATH 1280-Unit 2 Discussion Assignment
No ratings yet
MATH 1280-Unit 2 Discussion Assignment
2 pages
Proposal
0% (1)
Proposal
3 pages
Solving LP Problem
100% (1)
Solving LP Problem
9 pages
Control Systems Lab: Submitted by
No ratings yet
Control Systems Lab: Submitted by
9 pages
!portfolio Optimization With Trade Paring Constraints
No ratings yet
!portfolio Optimization With Trade Paring Constraints
14 pages
Global Sliding Mode Control
No ratings yet
Global Sliding Mode Control
9 pages
MA121 Problem Set 10
No ratings yet
MA121 Problem Set 10
1 page
Reinforcement Learning For Finance A Review
No ratings yet
Reinforcement Learning For Finance A Review
18 pages
U2-ML-QB With Answers
No ratings yet
U2-ML-QB With Answers
16 pages
m.tech-data-science-2021
No ratings yet
m.tech-data-science-2021
4 pages
Crypto 2
No ratings yet
Crypto 2
12 pages
Gauss Elimination
No ratings yet
Gauss Elimination
5 pages
Aaron Hillman HW 2
No ratings yet
Aaron Hillman HW 2
6 pages
Lecture-Hashing
No ratings yet
Lecture-Hashing
8 pages
Tic-Tac-Toe: School of Computing Science and Engineering Galgotias University
No ratings yet
Tic-Tac-Toe: School of Computing Science and Engineering Galgotias University
12 pages
Traffic Flow Prediction Models A Review of Deep Learning Techniques
No ratings yet
Traffic Flow Prediction Models A Review of Deep Learning Techniques
25 pages
Affine Arithmetic-Based Methods for Uncertain Power System Analysis 1st edition - eBook PDF 2024 Scribd Download
100% (3)
Affine Arithmetic-Based Methods for Uncertain Power System Analysis 1st edition - eBook PDF 2024 Scribd Download
69 pages
Summary Output W T: Multiple R R Square Adjusted R Square Standard Error Observations Anova Regression Residual Total
No ratings yet
Summary Output W T: Multiple R R Square Adjusted R Square Standard Error Observations Anova Regression Residual Total
6 pages
Enggr11t2 Maths Exam Paper 1 and Paper 2 Answers Answer Series
No ratings yet
Enggr11t2 Maths Exam Paper 1 and Paper 2 Answers Answer Series
9 pages
Session 18 Regression
No ratings yet
Session 18 Regression
16 pages
2021 Mirhoseini
No ratings yet
2021 Mirhoseini
23 pages
ACF
No ratings yet
ACF
27 pages
Summer Term 2024 Course Handout: Date: 28.05.2024
No ratings yet
Summer Term 2024 Course Handout: Date: 28.05.2024
3 pages
Interpreting Error Bars - BIOLOGY FOR LIFE
No ratings yet
Interpreting Error Bars - BIOLOGY FOR LIFE
2 pages
Data Mining Project Report template
No ratings yet
Data Mining Project Report template
3 pages
Drawing Graphs: Unit 1 - Exercise 11
No ratings yet
Drawing Graphs: Unit 1 - Exercise 11
6 pages
Prerequisite Quiz
No ratings yet
Prerequisite Quiz
3 pages
Class 9th PDF
No ratings yet
Class 9th PDF
2 pages
20 Recurrence Relations
No ratings yet
20 Recurrence Relations
28 pages
List of Figures
No ratings yet
List of Figures
15 pages

Learning Guide Unit 4 _ Home

Uploaded by

Learning Guide Unit 4 _ Home

Uploaded by

Learning Guide Unit 4 | Home https://my.uopeople.edu/mod/book/tool/print/index.php?

Site: University of the People Printed by: Patrick Rolemodel Asante

Learning Guide Unit 4

• Parametric and Zone indexes

By the end of this Unit, you will be able to:

1. Identify the di�erences between parametric and zone indexes.

• Read the Learning Guide and Reading Assignments

The Inverse Document Frequency is calculated using the following formula:

Vector score = 1.6335 + 1.9608 which of course is 3.5943

Chapter 6: Scoring, Term Weighting and the Vector Space Model

• Term Vector Calculations: A Fast Track Tutorial By Dr. E. Garcia: http://en.youscribe.com/catalogue/tous/knowledge/term-vector-fast-

• Develop a routine to identify and remove stop words

• Number of documents processed

Example code for Indexer Part 2 is .

Document 1: Earth is round.

Document 2: Moon is round.

Document 3: Day is nice.

Read the Learning Guide and Reading Assignments

Complete and submit the Programming Assignment

Make entries to the Learning Journal

Take the Self-Quiz

You might also like