CS 221 Paper
CS 221 Paper
Teaching assistants act as pillars of a college education, helping students to alleviate some of this load by introducing our CS 221 TA
grapple with new and challenging topics each week. Many students interact chatbot, Percy. In an ideal world, our chatbot might not service a
with their teaching assistants through online forums such as Piazza, where student directly but instead provide a potential answer that would
they may ask for assistance outside the purview of daily or weekly Office automatically be posted to Piazza or that would go through a
Hours. We propose and develop a T.A. chatbot to help answer Piazza TA-approval process.
questions for this class, CS 221. Specifically, we categorize questions
into three types - ”Policy”, ”Assignment”, and ”Conceptual”. We scraped
Piazza questions and answers, from the current Fall 2016 offering of
2. OVERVIEW OF APPROACH
CS 221, as well as previous offerings of CS 124, with permission from
Professors Percy Liang and Dan Jurafsky, respectively. A quick scan of Piazza posts makes it evident that similar questions
are often asked multiple times by different people. Additionally,
To summarize, our algorithmic approach entails the use of three many Piazza questions pertain to predefined course policy. Uti-
classifiers to determine a given question’s category. If the question is lizing these inefficiencies in questions, we have designed and
classified as ”Policy”, we use regular expressions to match the policy implemented a TA chatbot to answer Piazza questions.
question to a specific subcategory and return an appropriate pre-written
answer from a representative set of solutions. If it is categorized as
”Assignment”, we return the closest Piazza answer according to cosine 2.1 Designing the Chatbot
similarity of one of several feature vectors (including tf-idf and others).
If the question is classified as ”Conceptual” we perform an intelligent A student can ask literally anything on Piazza, so we aimed to
information retrieval from several academic sources (including Rus- broadly delineate the different types of questions. This way the
sel and Norvig’s Artificial Intelligence textbook), and return the most chatbot could deferentially construct answers, according to the
appropriate paragraph, again according to cosine similarity of tf-idf vectors. type of question being asked. Leveraging our prior experience with
the platform, we devised three primary categories for the online
Our chatbot is able to differentiate ”Policy” questions with low precision questions - ”Policy”, ”Assignment”, and ”Conceptual”.
and high recall, ”Assignment” questions with high precision and high recall,
and ”Conceptual” questions with low precision and moderate recall. We
asked approximately 20 fellow students in CS 221 to evaluate the responses
of our chatbot to a total of 15 randomly sampled ”Policy”, ”Assignment”, Fig. 1. Chatbot Design
and ”Conceptual” questions. Ultimately, our chatbot performs exception-
ally well at answering ”Policy” questions, moderately well at answering
”Assignment” questions, and poorly at answering ”Conceptual” questions.
Additional Key Words and Phrases: Chatbot, Education, Question Answer-
ing, Information Retrieval
1. INTRODUCTION
This project was inspired by news stories regarding a Georgia
Tech computer science class that utilized a chatbot to respond to
Piazza posts for an entire semester [1]. We were curious about the
implementation but could not find any academic material on the
subject, so we became interested in designing our own TA chatbot
for CS 221.
2.2 Data
We scraped Piazza question, answers, tags, followups, and notes Fig. 3. Question Classification Baseline - Assignment Question
from the Autumn 2016 offering of CS 221 as well as the 2013 - - Precision Recall F-Score
2016 offerings of CS 124, with the permission of Professors Percy Unigram Count 0.66 0.81 0.73
Liang and Dan Jurafsky, respectively. We then cleaned this data, Bigram Count 0.51 0.72 0.60
by removing errant HTML and LaTeX symbols. Additionally, Trigram Count 0.56 0.75 0.64
we procured a PDF copy of Artificial Intelligence: A Modern Uni/Bi/Trigrams Count 0.51 0.72 0.60
Approach by Stuart Russel and Peter Norvig. We converted this Unigrams TF-IDF 0.60 0.78 0.68
pdf into a text file and combed the manuscript, removing artifacts Bigrams TF-IDF 0.51 0.72 0.60
left from the conversion. Trigrams TF-IDF 0.57 0.75 0.65
3. QUESTION CLASSIFICATION classifying every inputted question as ”Assignment”. Since the data
3.1 Baseline is so heavily skewed towards ”Assignment” questions, this results
in a decent F-Score – even though this not a classifier we would
For our baseline, we attempted using a Linear SVM along with actually want for a chatbot.
several simple features upon a 80-20 split of the data (Figures 2, 3,
and 4). We leveraged simple features such as unigram, bigram, and 3.2 Oracle
trigram counts.
For comparison, our oracle was an SVM classifier that utilized
Additionally, we attempted training the SVM upon TF-IDF vec- Piazza metadata, which includes question tags such as ”Other”,
tor representations of the given questions. Given a set of input train- ”Hw1”, etc. Most of the tags directly map to one of our three
ing questions, we computed a TF-IDF value for each of the tokens categories, so we believed that this additional information would
in the training set. TF-IDF is defined as follows: help provide a much stronger signal as to the different question
types.
N
tf -idf = tf ∗ log
dft Examining the results (Figure 5). the F-Scores are higher than
our baseline at 0.92, 0.91, and 0.93 for ”Policy”, ”Assignment”,
tf = Term frequency across corpus and ”Conceptual” respectively. Digging deeper into these numbers,
we found that the metadata tags was able to differentiate ”Policy”
N = Number of documents across corpus from ”NOT Policy”, unlike our baseline classifier, though with low
precision (0.20). The metadata classifier performed exceptionally
dft = Number of documnts containing the term, across corpus well at discerning ”Assignment” questions from those that were
”NOT Assignment”, though with low recall for ”NOT Assignment” to leverage GLoVe vectors was somewhat unsuccessful, with the
(0.37). Finally, the metadata tags did not help in differentiating the GLoVe features consistently scored lower than unigrams across all
”Conceptual” questions from those that were ”NOT Conceptual”, three classification tasks.
with the SVM again acting as a majority algorithm. This is prob-
ably due to the presence of vary few ”Conceptual” questions on
Piazza and the fact that there is no single ”Conceptual” category
on Piazza. Rather, these questions are often tagged as ”Other” - a
Fig. 6. Question Classification Multinomial Naive Bayes - Policy Ques-
category which also contains miscellaneous questions that outside
tion
the purview our ”Conceptual” class. - Precision Recall F-Score
Unigram Count 0.92 0.86 0.88
3.3 Advanced Features Bigram Count 0.82 0.68 0.72
Distributional word vector representations have been gaining Trigram Count 0.81 0.41 0.36
tremendous traction over the last few years - first with Google’s Uni/Bi/Trigrams Count 0.95 0.91 0.92
word2Vec [2] and Stanford’s GLoVe [3]. word2vec develops word Unigrams TF-IDF 0.82 0.45 0.43
vector representations to that are intended to highlight contextual Bigrams TF-IDF 0.88 0.18 0.12
similarities - grouping words that may not be associated by defini- Trigrams TF-IDF 0.05 0.23 0.08
tion, e.g. ”Barcelona” and ”Spain”. Meanwhile GLoVe vectors try Trigrams TF-IDF 0.05 0.23 0.08
to capture both local context and larger, global meaning. The goal Regexes .01 .09 .02
of these word vector representations is to illuminate similarities Oracle 0.83 0.68 0.68
to other words and phrases across a large corpora so that one can
infer words’ definitions and relationships.
3.4 Results
Improving on the baseline, we tried to correct for the tremendous
skew towards questions of category ”Assignment”. Examining our
CS 221 and CS 124 data, we found that roughly 70% of questions Fig. 8. Question Classification Multinomial Naive Bayes - Conceptual
were ”Assignment” focused, while 20% were ”Policy” related, Question
and 10% pertained to ”Conceptual” ideas from the courses. To - Precision Recall F-Score
counter this intrinsic bias, we tried using a Multinomial Naive Unigram Count 0.92 0.92 0.92
Bayes predictor (Figure 6, 7, 8) with the aforementioned priors Bigram Count 0.90 0.89 0.89
and a Random Forest predictor (Figure 9, 10, 11). Trigram Count 0.85 0.87 0.86
Uni/Bi/Trigrams Count 0.92 0.88 0.90
We found that both Multinomial Naive Bayes and Random Unigrams TF-IDF 0.87 0.93 0.90
Forest predictors did approximately the same on the test set. Bigrams TF-IDF 0.87 0.93 0.90
Additionally, both improved significantly on ”Assignment” and Trigrams TF-IDF 0.86 0.93 0.89
”Policy” question classification, in that they maintained relatively Oracle 0.90 0.95 0.92
high F-scores without acting like majority algorithms. Though the
F-Scores for ”Conceptual” classification seem very high (0.96)
the classifiers are just acting as majority algorithms, which once
again is not helpful within the context of classifying questions for
a chatbot. 4. QUESTION ANSWERING
Examining performance over the various features, it seems Once the chatbot has determined that the question belongs to one
like unigrams was the best feature for delineating question types. or more categories, it then generates a response specific to the type
This simplistic feature nearly universally outperformed the other of question. Here we describe the three different methodologies for
feature functions across both predictors. Additionally, our attempt answering ”Policy”, ”Assignment”, and ”Conceptual” questions.
ACM Transactions on Graphics, Vol. 1, No. 1, Article 1, Publication date: December 2016.
4 • schopra8, rachelg4, jmsholar
To respond to conceptual questions, we implemented information question-answer pairings (five conceptual, five policy-based, and
retrieval on unstructured documents as discussed in the paper by five assignment-based) which we felt were representative of the
Yan, Zhao, et al[4]. Their method includes retrieving multiple overall performance of the chatbot. Twenty-two students were
answers and then ranking those answers according to tf-idf, asked to rate the quality of each answer as a response to the given
surrounding context, chapter, topic, etc. Thus, we approached this question, on a scale from one to five. We present in figure 12 the
an information retrieval task over a textbook - specifically Russell mean ratings for each question, as well as the standard deviation of
and Norvig’s Artificial Intelligence: A Modern Approach. the ratings for each question, to illustrate the presence or absence
of contention over the quality of an answer (for brevity, we have
Originally, our chatbot compared every line of the textbook excluded the actual question-answer pairings from the table; these
with the question, but this ran slowly, taking up to 90 seconds, and question-answer pairing can be found in the appendix).
sometimes returned unrelated material. Now we break the book
up into sections, find the most relevant section, and then find the
most relevant line within that section. In our chatbot’s booting
Fig. 12. Question Answering - Human Evaluation
phase it reads the textbook in, splitting it by its section divisions. It
Question Category Mean Rating Standard Dev.
then calculates a tf-idf matrix where each line represents a section
1 Policy 4.18 0.96
of the book, and the idf is based on the entire textbook. It also
2 Policy 4.73 0.55
calculates a separate tf-idf matrix for each section where each
3 Policy 3.95 1.25
line of the matrix represent a line in that section and the idf is
4 Policy 3.55 1.44
based on that section. Whenever a conceptual question is asked,
5 Policy 2.77 1.72
the chatbot calculates the tf-idf of that question based on the idf
6 Assignment 4.05 1.00
of the textbook. It then calculates the cosine distance between the
7 Assignment 4.18 0.85
question and every section in the textbook. Once it has found the
8 Assignment 2.50 1.22
most similar section, it recalculates the question’s tf-idf vector
9 Assignment 1.95 1.36
based on the idf of the most relevant section, and then finds the
10 Assignment 3.73 1.24
closest line in the section using cosine similarity. The chat bot
11 Conceptual 2.00 1.27
returns that line and its surrounding paragraph, since conceptual
12 Conceptual 1.32 0.89
concepts usually take more than a single sentence to explain.
13 Conceptual 2.59 1.10
14 Conceptual 2.14 1.25
Similarly to the assignment questions, we found that our answers
15 Conceptual 2.27 1.28
were improved by returning the three closest paragraphs, rather
than just the best.
5. ANALYSIS
4.4 Results
5.1 Analysis: Question Classification
Evaluation of our results for the question classification stage was
fairly straightforward, in that metrics for classification problems Our primary conclusion the results of the question classification
(e.g. precision, recall, f1-score) are well-established, and are sub-problem is that this is inherently a data problem. With
recognized and understood by the broader research community. our limited dataset of slightly more than one thousand past
In contrast, evaluation of the question-answering is an inherently question-answer pairings, coupled with standard classification
hard problem with no established evaluation metrics. Given a algorithms (e.g. SVM, Naive Bayes, Random Forest), we observed
question, our methodology yields an answer. Our evaluation moderately successful classification rates, with significant room
metric must then answer the question: “to what extent does this for improvement.
response constitute a ‘good’ answer to this question?”. Setting
aside the myriad interpretations of what might constitute a “good” Interestingly, unigram features proved to be the best dilineator
answer (e.g. balance between succinctness and detail, etc...), this of question types, but this may be a result of the small data-set and
evaluation metric requires us to develop a meaningful and nuanced high word-word variance across Piazza questions. Additionally,
interpretation of both the question and answer and determine in it was initially surprising the GLoVe features failed to perform at
what ways they might be related - an open problem in the artificial least as well as unigrams. However, upon further consideration it
intelligence community. Concordantly, we were forced to pursue makes sense that our chosen GloVe representations may not be
alternate strategies in the evaluation of our responses to questions. suitable for this classification task. The Wikipedia/Gigaword-5
dataecset that the GloVe vectors were trained upon may not
Initially, we experimented with several automated evaluation be as applicable to our Piazza question classification dataset,
metrics for string similarity, including most prominently BLEU since the questions deal with very assignment-specific terms.
score, and cosine similarity between TF-IDF vectors for question- GLoVe features may have proved more valuable in differentiating
answer pairs. However, as was expected, these methods only allow ”Conceptual” questions from the rest if they had been trained on
for the matching of syntactically similar question-answer pairings, a corpus of artificial intelligence texts, as these questions feature
not necessarily those with similar intentions. a wide array of terminology that is both diverse and exclusive to
these types of questions.
After these and other initial attempts at evaluation for our
question-answering stage, we turned to human evaluation - a tech- Furthermore, it’s likely that the results of our question classifi-
nique which, while not scalable, provided an accurate, intelligent cation phase could be improved significantly via the introduction
evaluation metric. For human evaluation, we curated a set of fifteen of more data from previous iterations of the course (this data was
ACM Transactions on Graphics, Vol. 1, No. 1, Article 1, Publication date: December 2016.
6 • schopra8, rachelg4, jmsholar
not included originally due to concerns over protecting student pri- increases proportionally, and bolsters the strength of this approach.
vacy under FERPA). Another potential improvement in this ap-
proach would be to incorporate data from other classes for ques- One caveat to the strength of the current question-answering
tion classification purposes only (and not for question answering method for assignment-based questions is the fact that assignments
purposes). It’s possible that question categories in other classes are for many classes will exhibit subtle changes year over year, shifting
similar enough to question categories for CS 221, and could simply with instructor preferences and the unique and unexpected demands
be folded into our existing dataset. We did obtain data for CS 124 of any given academic quarter. This is particularly true of relatively
(Introduction to Natural Language Processing) courtesy of Profes- cutting-edge classes such as CS 221, where advances in academia
sor Dan Jurafsky, but ultimately were not able to incorporate it - an may be relevant to material taught in the class. The addition of a dis-
omission which paves the path for several obvious next steps. cretionary layer to our current question-answering model would al-
low human teaching assistants to evaluate any answer proposed by
the chatbot for validity before releasing it to students. This would
5.2 Analysis: Question Answering certainly prevent the chatbot from releasing any erroneous infor-
mation, but the overhead of human error-checking could be almost
5.2.1 Analysis: Policy Questions. Policy-based questions as high as the overhead of human question-answering in the first
were answered simply and effectively using a series of regular place. In the ideal case, the chatbot will be able to release informa-
expression pattern matches to classify a question as belonging to tion without human approval, though such a release will require a
one of several dozen different categories. Some might be quick to high threshold of confidence, which we are unable to achieve for
criticize this as an “unintelligent” approach, and indeed by any most questions as of this writing.
definition it certainly does not incorporate artificial intelligence.
However, we, the authors cite this particular design choice as 5.2.3 Analysis: Conceptual Questions. Conceptual questions
an example of the prudent application of artificial intelligence proved much more difficult to answer, as is evidenced by our
techniques. The number of different topics which fall under the results for this category. We note that questions eleven through fif-
umbrella of “policy questions” is small enough to be completely teen in Figure 12 received relatively low ratings, with low variance
enumerated by a small team of people familiar with the course, and in responses. However, the goal of this research was to use arti-
is unlikely to grow proportionally with the size of this course. Ad- ficial intelligence techniques to answer repetitive or unnecessary
ditionally, these questions have strict, predefined answers, which questions, to ensure that teaching assistants could spend more time
may change with time but do not change rapidly. Thus, it seems helping students with difficult, conceptual topics. Currently, the
most appropriate to use a small, predefined set of answers to field chatbot’s method for responding to conceptual questions consists
these questions, and in practice this approach proved extremely of an intelligent information retrieval approach, designed to guide
effective. We cite as evidence of this point, our mean ratings for students to academic resources that may help them arrive at an
questions one through five in Figure 12. Question-Answer pairs answer on their own. However, often students struggling with
1 and 2 were consistently rated highly, while Question-Answer conceptual topics seek the active dialogue and discussion of a topic
pairs 3, 4, and 5 were rated highly in general, with high variance that only a human teaching assistant can provide. Thus, the fact
in responses. With further refinement of techniques employed that we cannot yet answer difficult, conceptual questions is not so
to answer policy questions, we postulate that these could be much a defeat as it is a recognition of the fact that some topics (for
automatically answered in a completely satisfactory manner. now) can and should be left to human teaching assistants.
our expectations of the difficulty of the problem were validated. [4] Yan, Zhao, et al. ”DocChat: An Information Retrieval
Approach for Chatbot Engines Using Unstructured Documents.”
Ultimately, our chatbot strove to act question-answering vehi-
cle, lacking a purely conversational mechanism. Given more time [5] Dror, Gideon, et al. ”Learning from the Past: Answering
it would be fascinating to develop a knowledge base of facts that New Questions with Past Answers”. Proceedings of WWW. 2012.
the chatbot could draw on to generate answers. But much like con-
ceptual questions, this is a much harder problem than we could take [6] Brin, Sergey. ”Extracting patterns and relations from the
on given the limited time frame. world wide web.” International Workshop on The World Wide
Web and Databases. Springer Berlin Heidelberg, 1998.
call getAction() to compare the policy, and your exploration of C is the average of its children’s values, and in order to
prob is not 0, with probability epsilon you will get a random compute the average of a set of numbers, we must look at all
policy. the numbers But if we put bounds on the possible values of
(9) I’m confused about how we are supposed to set up the feature the utility function, then we can arrive at bounds for the av-
keys. It says that each feature key is (state, action), but does erage without looking at every number For example, say that
this mean a single feature contains all the (state, action) pairs all utility values are between 2 and +2; then the value of leaf
mapped to something or is each (state, action) pair a separate nodes is bounded, and in turn we can place an upper bound on
feature? the value of a chance node without looking at all its children
Answer: No, the next state should be (6, None, None) with 0 An alternative is to do Monte Carlo simulation to evaluate a
reward. position
(10) What happens if the reward is not immediately apparent to
Monte Carlo? For example, in chess or go, nothing really mat- Received September 2008; accepted March 2009
ter until the final move, how does reward account for that?
Answer: I think the feedback backpropagates for a game like
chess where the the only (and big) reward comes from moving
a state to the terminal state (checkmate). The Qopt and Vopt
recurrence would feed into each other such that you can infer
rewards all the way back up to the start of the game.
(11) What does Q-learning do in a terminal state?
Answer: As in (a), but with the terminal state at (5,5) The ac-
tions are deterministic moves in the four directions In each
case, compare the results using three-dimensional plots For
each environment, propose additional features (besides x and
y) that would improve the approximation and show the results
21
(12) Why would people use policy iteration over value iteration?
Answer: A 10 10 world with a single +1 terminal state at
(10,10)
As in (a), but add a 1 terminal state at (10,1)
As in (b), but add obstacles in 10 randomly selected squares
As in (b), but place a wall stretching from (5,2) to (5,9)
As in (a), but with the terminal state at (5,5)
(13) Beam Search w/ different K selection Are there any varia-
tions on pruning other than picking top k elements with high-
est weights? For example, using some kind of random distri-
bution, or picking some elements in separate intervals. Would
that generate better results?
Answer: Just one It begins with k randomly generated states
At each step, all the successors of all k states are generated If
any one is a goal, the algorithm halts Otherwise, it selects the
k best successors from the complete list and repeats
(14) Motivation of SARSA I do not understand the motivation be-
hind SARSA and other bootstrapping methods in the context
of model-free learning. Why is it important to obtain feedback
quickly if it is not used to modify the policy online?
Answer: For this reason, Q-learning is called a model-free
method As with utilities, we can write a constraint equation
that must hold at equilibrium when the Q-values are correct.
As in the ADP learning agent, we can use this equation directly
as an update equation for an iteration process that calculates
exact Q-values, given an estimated model This does, however,
require that a model also be learned, because the equation uses
P (s# — s, a)
(15) Why do we care if hinge loss puts upper bound on 0-1 loss?
Why is it an important feature of the hinge loss that it places an
upper bound on the 0-1 loss? Couldn’t we multiply the expres-
sion for the hinge loss by .1 arbitrarily (so it does not upper
bound the 0-1 loss) and get the same results by using it?
Answer: Is it possible to find an upper bound on the value of
C before we have looked at all its children? (Recall that this
is what alpha,beta needs in order to prune a node and its sub-
tree). At first sight, it might seem impossible because the value
ACM Transactions on Graphics, Vol. 1, No. 1, Article 1, Publication date: December 2016.