Proj1_Fall24
Proj1_Fall24
Due date:
October 16th 11:59PM (online submission through Canvas)
Description:
Note: If you decide to use C++, you might consider using C++ STL (Standard
Template Library), which has all the necessary classes. Get familiar with the
different types of containers available in STL along with the methods provided.
FileDictionary: You also need to keep a Dictionary to map each document name
to a unique numerical ID.
Data: We are using the TREC data, which contains multiple documents in a file
and tags them separately. So you cannot treat each file as a single document,
you need to parse them to separate documents.
Testing: You should print out document ids and token streams to see if you
properly parse documents. Store the output in a file called "parser_output.txt" in
the following form:
caesar 1
card 2 (token: token ID)
cat 3
……….
FT911-1 1
FT911-2 2 (document name: doc ID)
FT911-3 3
Submit:
Submit your assignment through Canvas using a .zip file (contain all the files for
this assignment, including “parser_output.txt”, also provide a Readme file for the
instruction of how to run your code). Code should be submitted on time, and you
may be asked to give a demo.