CSC2001F 2024 DSAssignment1
CSC2001F 2024 DSAssignment1
Instructions
Artificial Intelligence systems need to acquire general or common sense knowledge about the
world in order to answer questions from users (e.g. “do Asian beetles have spots?”) or to per
form basic reasoning (e.g. “what weights more, a ton of bricks or a ton of feathers?”). Large
Language Models such as ChatGPT are “trained” (acquire knowledge) primarily from text on the
web, which is not guaranteed to be factually correct or representative of the real world. There
fore researchers have developed datasets containing verified general knowledge which can be
used to improve AI systems. One of these datasets is GenericsKB, a knowledge base of
“generic” sentences about the world. Generic statements (not to be confused with generics in
Java) are knowledge about kinds of things (including e.g. animals, objects, or abstract concepts)
that are generally true (although there might be exceptions). The full GenericsKB contains
3.4M+ sentences expressing general truths such as "Dogs bark," and "Trees remove carbon
dioxide from the atmosphere."
The goal of this assignment is to build a proof of concept Java program for querying the knowl
edge in GenericsKB, as well as to add additional knowledge or to update the knowledge base.
We assume for this assignment that the data is stored in memory while the application performs
multiple functions on demand until the user exits. A more elaborate application could offer ad
ditional functionality but we focus on just the core functionality for this assignment.
Dataset
You will be working with a pre processed subset of 100 000 statements (the file Generic-
sKB.txt is attached), derived from GeneticsKB (https://allenai.org/data/genericskb). The order
of statements has been randomized. An additional file (GenericsKB-additional.txt)can be
used to test adding additional statements to the knowledge base.
(where some items or statements might overlap with the initial knowledge base).
Each entry in the knowledge base has 3 data fields: The “term” (the thing that the statement is
about, e.g. “tree”), the sentence (e.g. “Trees remove carbon dioxide from the atmosphere.") and
a confidence score (between 0 and 1, where 1 represents complete confidence). The reason for
the confidence score is that the dataset was constructed semi automatically, so the statements
were not all manually verified. Note that terms may consist of multiple (space separated) words,
e.g. “cellular mechanism”. The data is given in a text file with one statement per line; the 3 data
fields (term, sentence, score) are separated by tabs. Therefore when reading in the data line by
line, the first step is to split each line by tabs. While the original GenericsKB contains multiple
statements corresponding to the same term (e.g. “Bee pollen has benefits” and “Bee pollen
comes from flowers”), your program only has to be able to store a single statement per term.
The main dataset only contains in single statement per term, but the additional file may have
multiple statements corresponding to the same term, which should be interpreted as update op
erations, as explained below.
Study the data carefully. The data loading should be relatively straight forward and you must
write your own code to read in the text file.
Application
Your application must include at least functionality to do the following:
1. Read in an initial knowledge base from a file to populate the (in memory) knowledge
base
2. Allow the user to add new statements to the knowledge base. The new statements can
be about items already in the knowledge base or about new items. If a new statement is
about an item already in the knowledge base then the entry for that item should be up
dated with the new statement (i.e., update the sentence and confidence score), unless
the new statement has a lower confidence score. This update functionality should be
supported both through the user interface directly and through loading a knowledge base
file (in which the statements are treated as updates executed in the order that they ap
pear in the file).
3. Display information from the knowledge base: For a give item, display the statement
about that item if there is one in the knowledge base. For a given item and sentence,
check whether the statement is in the knowledge base or not, and return the confidence
score if it is.
You may use any user interface for the application at least a text menu is required but the in
terface can be graphical or GUI based. Here is an example text based interaction:
Statement found: Maple syrup is made from the sweet sap that is stored
in the trunk of the sugar maple. (Confidence score: 0.75)
Data structures
You should write two versions of the application: GenericsKbArrayApp and GenericsKbB-
STApp. You have to use the specified data structures to store the statements, with one entry per
item.
In GenericsKbArrayApp you must use a traditional array (a single array of objects) to store
the statements. You may use a fixed size array or try to determine the size programmatically.
Do not use a LinkedList, ArrayList or other advanced data structure and do not sort the data. In
this version, the functionality to add new statements to the knowledge base does not have to
include adding new items, only to update the statements associated with items in the initial
knowledge base.
In GenericsKbBSTApp you must use a Binary Search Tree (BST) instead of an array. Your
BST implementation can be created from scratch or re used from anywhere. You may NOT re
place the BST with a different data structure and you may not use a balanced tree. The same
functionality should be supported, except that the BST has to support adding additional items as
well. Statement deletion does not have to be supported.
Test your applications to load 3 versions of the dataset with different sizes.
Hint: Do this assignment incrementally. First create a data structure with only items and get the
related functions working one at a time. Then add in the functions related to full statements. Fi
nally, write the code to load in the external file.
Report
Write a report that includes the following:
What your OO design is: what classes you created, why, and how they interact (at most
1 page).
What test values you used during testing and what the output was in each case (use out
put redirection or cut and paste or take screenshots) (at most 10 pages).
A statement of what you included in your application(s) that constitutes creativity how
you went beyond the basic requirements of the assignment (at most 1 page). Examples
of creativity include:
◦ Enhancing the search functionality to include matching items containing single whole
words (e.g. “cellular” matching “cellular mechanism” and “cellular respiration” but not
“unicellular”) or other partial matches of items and/or sentences.
Summary statistics from your use of git to demonstrate usage. Print out the first 10 lines
and last 10 lines from "git log" , with line numbers added. You can use a Unix command
such as:
Dev requirements
As a software developer, you are required to make appropriate use of the following tools:
Makefile
src/
o all source code
bin/
o all class files
doc/
o javadoc output
report.pdf
Your report must be in PDF format. Do not submit the git repository.
Marking Guidelines
Your assignment will be marked by tutors, using the following marking guide.
Description of creativity 3
Code (35)
OOP design 8
Interactive interface 3
Reading data from file 3
Populating the knowledge base (array) 3
Updating the knowledge base (array) 3
Searching the knowledge base (array) 3
Populating the knowledge base (BST) 3
Updating the knowledge base (BST) 3
Searching the knowledge base (BST) 3
Implementation of creativity 3
Dev (11) Git usage log 2
Documentation (javadoc) 6
Makefile: make and clean targets 3
Total (70)
Additional resources for assignment
GenericsKB.txt
GenericsKB additional.txt