0% found this document useful (0 votes)
3 views

Learning Guide Unit 4 _ Home

Unit 4 of the CS 3308-01 Information Retrieval course focuses on weighting techniques in information retrieval, including parametric and zone indexes, term frequency, and tf-idf weighting. Students will learn to calculate inverse document frequency, document and query vector scores, and cosine similarity to determine document relevance. The unit also includes practical assignments for building an inverted index and calculating term weights, culminating in a project to recommend similar documents based on cosine similarity.

Uploaded by

Reg
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Learning Guide Unit 4 _ Home

Unit 4 of the CS 3308-01 Information Retrieval course focuses on weighting techniques in information retrieval, including parametric and zone indexes, term frequency, and tf-idf weighting. Students will learn to calculate inverse document frequency, document and query vector scores, and cosine similarity to determine document relevance. The unit also includes practical assignments for building an inverted index and calculating term weights, culminating in a project to recommend similar documents based on cosine similarity.

Uploaded by

Reg
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Learning Guide Unit 4 | Home https://my.uopeople.edu/mod/book/tool/print/index.php?

id=443845

Site: University of the People Printed by: Patrick Rolemodel Asante


Course: CS 3308-01 Information Retrieval - AY2025-T2 Date: Tuesday, 10 December 2024, 12:02 PM
Book: Learning Guide Unit 4

1 of 14 12/10/2024, 12:02 PM
Learning Guide Unit 4 | Home https://my.uopeople.edu/mod/book/tool/print/index.php?id=443845

Learning Guide Unit 4

2 of 14 12/10/2024, 12:02 PM
Learning Guide Unit 4 | Home https://my.uopeople.edu/mod/book/tool/print/index.php?id=443845

3 of 14 12/10/2024, 12:02 PM
Learning Guide Unit 4 | Home https://my.uopeople.edu/mod/book/tool/print/index.php?id=443845

• Parametric and Zone indexes


• Weighted Zone Scoring
• Ranked Boolean Retrieval
• Learning weights
• The optimal weight g
• Term frequency
• Tf-idf weighting
• Vector space model for scoring
• Cosine similarity
• Term-document matrix

By the end of this Unit, you will be able to:

1. Identify the di�erences between parametric and zone indexes.


2. Compare di�erent weighting schemes and be able to implement each scheme.
3. Determine and calculate inverse document frequency, term frequency, document and query vector scores, Cosine similarity
between documents and between a document and a query.
4. Identify di�erent tf-idf variants.

• Read the Learning Guide and Reading Assignments


• Complete and submit the Programming Assignment
• Make entries to the Learning Journal
• Take the Self-Quiz

4 of 14 12/10/2024, 12:02 PM
Learning Guide Unit 4 | Home https://my.uopeople.edu/mod/book/tool/print/index.php?id=443845

Unit 4 is an important unit as we cover the important concept of weighting as it relates to information retrieval. In Units 2 and 3 we
learned about the inverted index which is the most fundamental data structure in information retrieval. The inverted index provides a
structure that can be used to search for speci�c terms in documents in order to �nd documents that match the information that we are
looking for.

In the index part 1 that we constructed and executed against the CACM corpus, we discovered that a term might appear in many
documents. Because of this, a search for terms might retrieve many documents some of which are more or less relevant. What we will
explore in this unit is techniques to determine the relevance of the documents identi�ed in a query. The approach that we will use to
determine the relevance of a document is referred to as weighting. In this approach a weight is assigned to each term. Terms with higher
weights will be assumed to be more relevant than terms with lower weights and as such should be presented �rst in the responses to a
query.

Our text describes a few di�erent approaches to assigning term weighting, but in our development assignment for this unit we will be
using the inverse document frequency and term frequency to calculate the weight of the term which essentially assumes that a higher
frequency of a term indicates that it is more relevant.

The Inverse Document Frequency is calculated using the following formula:

In this formula N refers to the total number of documents in the corpus. In our CACM corpus, there are 731 documents so in this case N
would be 731. The dft in the formula refers to the document frequency. The subscripted t refers to the term so dft is the number of
documents that a particular term appears in. For example the term ‘mortgage’ appears in 8 documents in the collection. In this case the
idft for the term ‘mortgage’ would be found by taking the base 10 logarithm of 731 divided by 8 which is 1.9608.

The objective of idft is to reduce the weight of a particular term by a factor that grows with its frequency within the collection. This is
important because a term that exists in most documents is probably not very relevant. On the other hand a term that appears frequently
in a few documents is probably very relevant.

To incorporate this concept into our weight, we use the idft which gets smaller as the frequency of the term across the collection grows
along with the term frequency which is simply de�ned as the number of times that a particular term occurs in a particular document. The
following formula is used to calculate the tf and idf weight:

The tf idf weight which is often represented as tf-idft,d provides a weight for a particular term in a particular document. This weight can
then be used to calculate a vector for the document. The vector is simply the accumulation of the weights of each term in a query.

Imagine, if you would, that you wanted to �nd information on home mortgages, you would use two query terms ‘home’ and ‘mortgage’.
We already determined that the idft for the term ‘mortgage’ is 1.9608 which is then multiplied by the tft,d to determine the tf-idft,d This
value is a component of the vector which is calculated with the following formula:

5 of 14 12/10/2024, 12:02 PM
Learning Guide Unit 4 | Home https://my.uopeople.edu/mod/book/tool/print/index.php?id=443845

What this formula indicates is that the tf-idft,d value for each term in the query must be added together to get a score. In the case of our
home mortgage query, the tf-idft,d value for home 1.6335 must be added to the tf-idft,d value for mortgage to obtain the vector score of
3.5943.

This calculation of vector scores must be repeated for every document in the collection that matches each of the search terms in the
query to develop a list of potential documents for retrieval.

When the list of potential documents has been built, the vector score for the query must be determined. The vector score for the query is
determined in much the same way as we determined the vector scores for documents that match the query terms with the exception that
the idft value is multiplied by 1.

In the case of our ‘home mortgages’ query, the idft for home was 1.6335 and the idft for mortgage was 1.9608 so the vector score for the
query is:

Vector score = 1.6335 + 1.9608 which of course is 3.5943

Once we have both the vector score for the query and the vector score for all of the documents we can apply the weighting. We do this by
calculating the cosine similarity between each document vector and the query vector using the following formula:

In this equation there will be two vectors compared. The vector for the query will be compared with the vector for the document.

Assuming that a particular document has a tf-idft,d value for ‘home’ of 3.9216 and a tf-idft,d value for ‘mortgage’ of 14.7012. Also knowing
that for the query ‘home mortgage’ we have an tf-idft,d score for home that is 1.6335 and a tf-idft,d score for mortgage of 1.9608 the dot
product can be calculated as follows:

It might be easier to understand this formula if we use our example. The �rst vector is the query vector which is composed of the tf-idft,d
scores for home (1.6335) and mortgage (1.9608). The second vector is the vector score for the document which is composed of the tf-idft,d
scores for home (3.9216) and mortgage (14.7012). To calculate the Euclidean length we need to add these scores and take the square root
of them as follows:

The resulting similarity ranges from −1 meaning exactly opposite, to 1 meaning exactly the same, with 0 usually indicating independence,
and in-between values indicating intermediate similarity or dissimilarity.

This same calculation must be performed for each document in the collection that contains the query terms and the resulting cosine

6 of 14 12/10/2024, 12:02 PM
Learning Guide Unit 4 | Home https://my.uopeople.edu/mod/book/tool/print/index.php?id=443845

similarity scores sorted from largest to smallest. The result of the query will then return the highest scoring documents. This process is
introduced in Unit 4, but we will be implementing these techniques in a development project assigned in Unit 5. Please take the time to
really understand the process of computing the tf-idf weight and how this can be used to calculate vectors and the cosine similarity score.

The recommended reading resources in this unit can be reviewed as they provide more details to help you understand the calculation of
tf-idf, vectors, and cosine similarity.

7 of 14 12/10/2024, 12:02 PM
Learning Guide Unit 4 | Home https://my.uopeople.edu/mod/book/tool/print/index.php?id=443845

Manning, C.D., Raghaven, P., & Schütze, H. (2009). An Introduction to Information Retrieval (Online ed.). Cambridge, MA: Cambridge
University Press. Available at http://nlp.stanford.edu/IR-book/information-retrieval-book.html

Chapter 6: Scoring, Term Weighting and the Vector Space Model

• Term Vector Calculations: A Fast Track Tutorial By Dr. E. Garcia: http://en.youscribe.com/catalogue/tous/knowledge/term-vector-fast-


track-tutorial-521048
• Wikipedia Entry for Calculating Cosine Simularity: http://en.wikipedia.org/wiki/Cosine_similarity

• Term frequency
• Bag of words
• Inverse document frequency
• Tf-idf weighting
• Document vector
• Query vector
• Vector space model
• Cosine similarity
• Euclidean Length

8 of 14 12/10/2024, 12:02 PM
Learning Guide Unit 4 | Home https://my.uopeople.edu/mod/book/tool/print/index.php?id=443845

What is involved in determining and calculating the Inverse Document Frequency? Also, what are the di�erences between parametric and
zone indexes?

You must post your initial response before being able to review other student’s responses. Once you have made your �rst response, you
will be able to reply to other student’s posts. You are expected to make a minimum of 3 responses to your fellow student’s posts.

9 of 14 12/10/2024, 12:02 PM
Learning Guide Unit 4 | Home https://my.uopeople.edu/mod/book/tool/print/index.php?id=443845

Over the past two units (units 2 and 3) you have developed and re�ned a basic process to build an inverted index. In this unit you will
extend the work that you started by adding new features to your indexer process. Your assignment will be to add new features to your
indexer.

Your Indexer must apply editing to the terms (tokens) extracted from the document collection as follows:

• Develop a routine to identify and remove stop words


• Implement a porter stemmer to stem the tokens processed by your indexer routine. You do not have to write a stemmer algorithm,
you can use the code that is provided and integrate it into your own routine.
• Remove (do not process) any term that begins with a punctuation character

Your indexer process must calculate frequencies and weighted term measures that can be used to process a query for scoring in the
vector space model. As such your index must:

• Determine the term frequency (tft,d) for each unique term in each document in the collection to be included in the inverted index.
• Determine the document frequency (dft) which is a count of the number of documents from within the collection that each unique
inverted index term appears in.
• Calculate the inverse document frequency using the following formula:

◦ The N in this formula refers to the number of documents that exist in the collection. When using the corpus-small document
collection N will be 41 as there are 41 documents in the collection.
• Finally calculate the tf-idf weighting using the following formula: tf - idftd = tft,d × idft
• The tf-idf weighting must be maintained as an attribute in the inverted index data structure. The following is an example (but not a
required) structure for the inverted index:

Your indexer must report statistics on its processing and must print these statistics as the �nal output of the program.

• Number of documents processed


• Total number of terms parsed from all documents
• Total number of unique terms found and added to the index
• Total number of terms found that matched one of the stop words in your program’s stop words list

In testing your indexer process you should create a new database to hold the elements of your inverted index.

You must post your initial response before being able to review other student’s responses. Once you have made your �rst response, you
will be able to reply to other student’s posts. You are expected to make a minimum of 3 responses to your fellow student’s posts.

10 of 14 12/10/2024, 12:02 PM
Learning Guide Unit 4 | Home https://my.uopeople.edu/mod/book/tool/print/index.php?id=443845

Example code for Indexer Part 2 is .

11 of 14 12/10/2024, 12:02 PM
Learning Guide Unit 4 | Home https://my.uopeople.edu/mod/book/tool/print/index.php?id=443845

At the end of this assignment, you will be able to �nd the cosine similarity between the documents and use it further to recommend
similar documents.

Assume that you have designed a search engine and the following three documents exist in the corpus (Note: In a real-world situation,
there exists millions of documents with words ranging from hundreds to millions in each of them):

Document 1: Earth is round.

Document 2: Moon is round.

Document 3: Day is nice.

It was found programmatically that a user always refers to Document 1. There is a need to recommend one document from the corpus to
the user. Your search engine needs to �nd documents similar to Document 1. To accomplish the task, you need to carry out pre-
processing (hint: “is” is a stop word), the vector creation of each document, and �nd the similarity of the vector of each document with
vector of Document 1.

Based on the highest similarity among the vectors, recommend the respective document to the user.

Compute the vector for each of the documents and explain the vectors. Describe in detail the steps that made you reach the valid
similarity result Introduction to Information Retrieval (stanford.edu) . For each
calculation and equation used, show the necessary references for it.

• Submit a document that is of 500-1000 words (the word count does not include the title and the reference list), double-spaced using
12-point Times New Roman font.
• Use sources to support your arguments. Use high-quality, credible, relevant sources to develop ideas that are appropriate for the
discipline and genre of the writing. Use APA citations and references to support your work. For assistance with APA formatting, view
the Learning Resource Center: Academic Writing.

• Manning, C.D., Raghaven, P., & Schütze, H. (2009). An Introduction to information retrieval (Online ed.). Cambridge, MA: Cambridge
University Press. Available at http://nlp.stanford.edu/IR-book/information-retrieval-book.html
• Chapter 6: Scoring, Term Weighting and the Vector Space Model [Introduction to Information Retrieval (stanford.edu)

12 of 14 12/10/2024, 12:02 PM
Learning Guide Unit 4 | Home https://my.uopeople.edu/mod/book/tool/print/index.php?id=443845

The Self-Quiz gives you an opportunity to self-assess your knowledge of what you have learned so far.

The results of the Self-Quiz do not count towards your �nal grade, but the quiz is an important part of the University’s learning process
and it is expected that you will take it to ensure understanding of the materials presented. Reviewing and analyzing your results will help
you perform better on future Graded Quizzes and the Final Exam.

Please access the Self-Quiz on the main course homepage; it will be listed inside the Unit.

13 of 14 12/10/2024, 12:02 PM
Learning Guide Unit 4 | Home https://my.uopeople.edu/mod/book/tool/print/index.php?id=443845

Read the Learning Guide and Reading Assignments

Complete and submit the Programming Assignment

Make entries to the Learning Journal

Take the Self-Quiz

14 of 14 12/10/2024, 12:02 PM

You might also like