0% found this document useful (0 votes)
54 views

CS 221 Paper

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views

CS 221 Paper

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Meet Percy: The CS 221 Teaching Assistant Chatbot

SAHIL CHOPRA and RACHEL GIANFORTE and JOHN SHOLAR


Stanford University

Teaching assistants act as pillars of a college education, helping students to alleviate some of this load by introducing our CS 221 TA
grapple with new and challenging topics each week. Many students interact chatbot, Percy. In an ideal world, our chatbot might not service a
with their teaching assistants through online forums such as Piazza, where student directly but instead provide a potential answer that would
they may ask for assistance outside the purview of daily or weekly Office automatically be posted to Piazza or that would go through a
Hours. We propose and develop a T.A. chatbot to help answer Piazza TA-approval process.
questions for this class, CS 221. Specifically, we categorize questions
into three types - ”Policy”, ”Assignment”, and ”Conceptual”. We scraped
Piazza questions and answers, from the current Fall 2016 offering of
2. OVERVIEW OF APPROACH
CS 221, as well as previous offerings of CS 124, with permission from
Professors Percy Liang and Dan Jurafsky, respectively. A quick scan of Piazza posts makes it evident that similar questions
are often asked multiple times by different people. Additionally,
To summarize, our algorithmic approach entails the use of three many Piazza questions pertain to predefined course policy. Uti-
classifiers to determine a given question’s category. If the question is lizing these inefficiencies in questions, we have designed and
classified as ”Policy”, we use regular expressions to match the policy implemented a TA chatbot to answer Piazza questions.
question to a specific subcategory and return an appropriate pre-written
answer from a representative set of solutions. If it is categorized as
”Assignment”, we return the closest Piazza answer according to cosine 2.1 Designing the Chatbot
similarity of one of several feature vectors (including tf-idf and others).
If the question is classified as ”Conceptual” we perform an intelligent A student can ask literally anything on Piazza, so we aimed to
information retrieval from several academic sources (including Rus- broadly delineate the different types of questions. This way the
sel and Norvig’s Artificial Intelligence textbook), and return the most chatbot could deferentially construct answers, according to the
appropriate paragraph, again according to cosine similarity of tf-idf vectors. type of question being asked. Leveraging our prior experience with
the platform, we devised three primary categories for the online
Our chatbot is able to differentiate ”Policy” questions with low precision questions - ”Policy”, ”Assignment”, and ”Conceptual”.
and high recall, ”Assignment” questions with high precision and high recall,
and ”Conceptual” questions with low precision and moderate recall. We
asked approximately 20 fellow students in CS 221 to evaluate the responses
of our chatbot to a total of 15 randomly sampled ”Policy”, ”Assignment”, Fig. 1. Chatbot Design
and ”Conceptual” questions. Ultimately, our chatbot performs exception-
ally well at answering ”Policy” questions, moderately well at answering
”Assignment” questions, and poorly at answering ”Conceptual” questions.
Additional Key Words and Phrases: Chatbot, Education, Question Answer-
ing, Information Retrieval

1. INTRODUCTION
This project was inspired by news stories regarding a Georgia
Tech computer science class that utilized a chatbot to respond to
Piazza posts for an entire semester [1]. We were curious about the
implementation but could not find any academic material on the
subject, so we became interested in designing our own TA chatbot
for CS 221.

Teaching assistants are crucial to guiding students through their


Below we list a few examples of these three categories:
educational experience. TAs have a rich understanding of the
material that is being taught within their class, and as experts on (1) Policy Questions - The chatbot should be able to answer ques-
the course, they can help provide students with both conceptual tions regarding class policy, e.g. office hour timings, assign-
support and assignment-specific assistance. ment due dates, etc. Example: ”Where are office hours lo-
cated?”
Nowadays, most computer science classes at Stanford utilize
Piazza, an online forum where students can ask questions at any (2) Assignment Questions - The chatbot should be able to answer
time of day. With this ease of access, TAs are being inundated assignment specific questions. Example: ”I am receiving the
with more questions than ever before. With our project, we hope following output probabilities from my Bayesian network? ...
What could my bug be?”
ACM Transactions on Graphics, Vol. 1, No. 1, Article 1, Publication date: December 2016.
2 • schopra8, rachelg4, jmsholar

(3) Conceptual Questions - The chatbot should be able to an-


swer conceptual questions pertaining to artificial intelligence. Fig. 2. Question Classification Baseline - Policy Question
Example: ”What is the difference between state-based and - Precision Recall F-Score
variable-based models?” Unigram Count 0.67 0.82 0.74
Bigram Count 0.53 0.73 0.61
With this categorization of questions, our chatbot consists of Trigram Count 0.83 0.91 0.87
three steps. First classifying the type of question, second gener- Uni/Bi/Trigrams Count 0.67 0.82 0.74
ating an answer, and third providing an answer to the user if the Unigrams TF-IDF 0.83 0.91 0.87
prescribed solution meets a defined confidence threshold (Figure Bigrams TF-IDF 0.60 0.77 0.67
1). Trigrams TF-IDF 0.75 0.86 0.80

2.2 Data
We scraped Piazza question, answers, tags, followups, and notes Fig. 3. Question Classification Baseline - Assignment Question
from the Autumn 2016 offering of CS 221 as well as the 2013 - - Precision Recall F-Score
2016 offerings of CS 124, with the permission of Professors Percy Unigram Count 0.66 0.81 0.73
Liang and Dan Jurafsky, respectively. We then cleaned this data, Bigram Count 0.51 0.72 0.60
by removing errant HTML and LaTeX symbols. Additionally, Trigram Count 0.56 0.75 0.64
we procured a PDF copy of Artificial Intelligence: A Modern Uni/Bi/Trigrams Count 0.51 0.72 0.60
Approach by Stuart Russel and Peter Norvig. We converted this Unigrams TF-IDF 0.60 0.78 0.68
pdf into a text file and combed the manuscript, removing artifacts Bigrams TF-IDF 0.51 0.72 0.60
left from the conversion. Trigrams TF-IDF 0.57 0.75 0.65

Ultimately, we trained our chatbot on both our dataset of


approximately 1500 cleaned (question, answer) tuples from the
Fig. 4. Question Classification Baseline - Conceptual Question
2016 offering of CS 221 and our the cleaned copy of Artificial
- Precision Recall F-Score
Intelligence: A Modern Approach. The Piazza data was leveraged
Unigram Count 0.91 0.95 0.93
to classify questions and answer questions from the ”Assignment”
Bigram Count 0.90 0.95 0.92
category. The textbook was utilized to answer questions from the
Trigram Count 0.88 0.94 0.91
”Conceptual” category.
Uni/Bi/Trigrams Count 0.91 0.95 0.93
Unigrams TF-IDF 0.90 0.95 0.92
Please note that our data set for CS 221 only extends through
Bigrams TF-IDF 0.87 0.93 0.90
November 6, 2016 as we did not have sufficient time to scrape and
Trigrams TF-IDF 0.82 0.91 0.86
clean additional data.

3. QUESTION CLASSIFICATION classifying every inputted question as ”Assignment”. Since the data
3.1 Baseline is so heavily skewed towards ”Assignment” questions, this results
in a decent F-Score – even though this not a classifier we would
For our baseline, we attempted using a Linear SVM along with actually want for a chatbot.
several simple features upon a 80-20 split of the data (Figures 2, 3,
and 4). We leveraged simple features such as unigram, bigram, and 3.2 Oracle
trigram counts.
For comparison, our oracle was an SVM classifier that utilized
Additionally, we attempted training the SVM upon TF-IDF vec- Piazza metadata, which includes question tags such as ”Other”,
tor representations of the given questions. Given a set of input train- ”Hw1”, etc. Most of the tags directly map to one of our three
ing questions, we computed a TF-IDF value for each of the tokens categories, so we believed that this additional information would
in the training set. TF-IDF is defined as follows: help provide a much stronger signal as to the different question
types.
N
tf -idf = tf ∗ log
dft Examining the results (Figure 5). the F-Scores are higher than
our baseline at 0.92, 0.91, and 0.93 for ”Policy”, ”Assignment”,
tf = Term frequency across corpus and ”Conceptual” respectively. Digging deeper into these numbers,
we found that the metadata tags was able to differentiate ”Policy”
N = Number of documents across corpus from ”NOT Policy”, unlike our baseline classifier, though with low
precision (0.20). The metadata classifier performed exceptionally
dft = Number of documnts containing the term, across corpus well at discerning ”Assignment” questions from those that were

Fig. 5. Question Classification Oracle - Piazza Metadata


At first glance, it seems like the baselines are performing de- - Precision Recall F-Score
cently well across ”Policy”, ”Assignment”, and ”Conceptual” clas- Policy Questions 0.90 0.95 0.92
sification with maximum F-Scores of 0.87, 0.73, and 0.93, respec- Assignment Questions 0.79 0.80 0.77
tively. Upon further analysis it becomes clear that these high F- Conceptual Questions 0.91 0.95 0.93
Scores are artificial. The SVM is acting as a majority algorithm and
ACM Transactions on Graphics, Vol. 1, No. 1, Article 1, Publication date: December 2016.
CS 221 T.A. Chatbot • 3

”NOT Assignment”, though with low recall for ”NOT Assignment” to leverage GLoVe vectors was somewhat unsuccessful, with the
(0.37). Finally, the metadata tags did not help in differentiating the GLoVe features consistently scored lower than unigrams across all
”Conceptual” questions from those that were ”NOT Conceptual”, three classification tasks.
with the SVM again acting as a majority algorithm. This is prob-
ably due to the presence of vary few ”Conceptual” questions on
Piazza and the fact that there is no single ”Conceptual” category
on Piazza. Rather, these questions are often tagged as ”Other” - a
Fig. 6. Question Classification Multinomial Naive Bayes - Policy Ques-
category which also contains miscellaneous questions that outside
tion
the purview our ”Conceptual” class. - Precision Recall F-Score
Unigram Count 0.92 0.86 0.88
3.3 Advanced Features Bigram Count 0.82 0.68 0.72
Distributional word vector representations have been gaining Trigram Count 0.81 0.41 0.36
tremendous traction over the last few years - first with Google’s Uni/Bi/Trigrams Count 0.95 0.91 0.92
word2Vec [2] and Stanford’s GLoVe [3]. word2vec develops word Unigrams TF-IDF 0.82 0.45 0.43
vector representations to that are intended to highlight contextual Bigrams TF-IDF 0.88 0.18 0.12
similarities - grouping words that may not be associated by defini- Trigrams TF-IDF 0.05 0.23 0.08
tion, e.g. ”Barcelona” and ”Spain”. Meanwhile GLoVe vectors try Trigrams TF-IDF 0.05 0.23 0.08
to capture both local context and larger, global meaning. The goal Regexes .01 .09 .02
of these word vector representations is to illuminate similarities Oracle 0.83 0.68 0.68
to other words and phrases across a large corpora so that one can
infer words’ definitions and relationships.

For our question classification task, we attempted leveraging


GLoVE vectors that had been pre-trained on the combination of
the 2014 Wikipedia dump and the Gigaword-5 dataset. For each
Fig. 7. Question Classification Multinomial Naive Bayes - Assignment
training and test sample, we iterated through the words of the
Question
question or answer, removing stop words and summing together
- Precision Recall F-Score
the word vectors to devise a ”one-hot-vector” representation of the
Unigram Count 0.81 0.80 0.81
entire text sample. Please note that, we only ran the Random Forest
Bigram Count 0.78 0.79 0.78
predictor on this feature function because the GLoVe representa-
Trigram Count 0.69 0.72 0.70
tions are not guaranteed to be non-negative, and Multinomial Naive
Uni/Bi/Trigrams Count 0.85 0.84 0.84
Bayes predictor assumes a multinomial gaussian distribution.
Unigrams TF-IDF 0.82 0.82 0.79
Bigrams TF-IDF 0.72 0.73 0.68
In addition to our experiments with GLoVe vectors, we at-
Trigrams TF-IDF 0.67 0.71 0.80
tempted leveraging the regular expressions we wrote for determin-
Oracle 0.86 0.85 0.83
ing the topic of a ”Policy” question, e.g. office hours, as binary
features to predict whether a question was of category ”Policy”.

3.4 Results
Improving on the baseline, we tried to correct for the tremendous
skew towards questions of category ”Assignment”. Examining our
CS 221 and CS 124 data, we found that roughly 70% of questions Fig. 8. Question Classification Multinomial Naive Bayes - Conceptual
were ”Assignment” focused, while 20% were ”Policy” related, Question
and 10% pertained to ”Conceptual” ideas from the courses. To - Precision Recall F-Score
counter this intrinsic bias, we tried using a Multinomial Naive Unigram Count 0.92 0.92 0.92
Bayes predictor (Figure 6, 7, 8) with the aforementioned priors Bigram Count 0.90 0.89 0.89
and a Random Forest predictor (Figure 9, 10, 11). Trigram Count 0.85 0.87 0.86
Uni/Bi/Trigrams Count 0.92 0.88 0.90
We found that both Multinomial Naive Bayes and Random Unigrams TF-IDF 0.87 0.93 0.90
Forest predictors did approximately the same on the test set. Bigrams TF-IDF 0.87 0.93 0.90
Additionally, both improved significantly on ”Assignment” and Trigrams TF-IDF 0.86 0.93 0.89
”Policy” question classification, in that they maintained relatively Oracle 0.90 0.95 0.92
high F-scores without acting like majority algorithms. Though the
F-Scores for ”Conceptual” classification seem very high (0.96)
the classifiers are just acting as majority algorithms, which once
again is not helpful within the context of classifying questions for
a chatbot. 4. QUESTION ANSWERING
Examining performance over the various features, it seems Once the chatbot has determined that the question belongs to one
like unigrams was the best feature for delineating question types. or more categories, it then generates a response specific to the type
This simplistic feature nearly universally outperformed the other of question. Here we describe the three different methodologies for
feature functions across both predictors. Additionally, our attempt answering ”Policy”, ”Assignment”, and ”Conceptual” questions.
ACM Transactions on Graphics, Vol. 1, No. 1, Article 1, Publication date: December 2016.
4 • schopra8, rachelg4, jmsholar

Fig. 9. Question Classification Random Forest - Policy Question


- Precision Recall F-Score ”Past year exams and solutions are listed on the CS 221 website.”
Unigram Count 0.86 0.82 0.80
Though this method is very straightforward, it works decently
Bigram Count 0.88 0.86 0.84
well for the fifteen topics the bot must address in regards to class
Trigram Count 0.97 0.95 0.96
specific questions. If a question matches multiple subcategories,
Uni/Bi/Trigrams Count 0.75 0.86 0.80
we provide a relevant answer from each matching subcategory.
Unigrams TF-IDF 0.91 0.91 0.91
Bigrams TF-IDF 0.68 0.73 0.67
To allow the system to be more helpful over the course of a
Trigrams TF-IDF 0.75 0.86 0.80
quarter, we implemented a method to allow real TA’s to input
Trigrams TF-IDF 0.05 0.23 0.08
new information into the knowledge base when things change.
Regexes 0.60 .77 .67
For example, a TA may notify students of a time change in
GLoVe 50D 0.67 0.82 0.74
office hours. If a student then asks a question related to the new
GLoVe 100D 0.67 0.82 0.74
information, both the appropriate pre-seeded answer and the new
GLoVe 300D 0.66 0.77 0.71
piece of information will be returned. This new information is also
Oracle 0.96 0.95 0.95
categorized and returned based on regular expression matching.

Fig. 10. Question Classification Random Forest - Assignment Question


4.2 Assignment Questions
- Precision Recall F-Score ”Assignment” questions proved a significant challenge as these are
Unigram Count 0.79 0.80 0.77 often very detailed and specific. There is no way for a chatbot to
Bigram Count 0.77 0.78 0.77 generate an answer to questions about why a snippet of code is
Trigram Count 0.69 0.72 0.70 broken, so questions of this nature must be left to human teaching
Uni/Bi/Trigrams Count 0.78 0.79 0.76 assistants. However, some questions pertain to the same common
Unigrams TF-IDF 0.78 0.78 0.75 bug and are asked multiple times. Our bot’s goal is to be able to
Bigrams TF-IDF 0.70 0.73 0.70 respond to these repeated questions, leaving human TA’s free to
Trigrams TF-IDF 0.63 0.68 0.64 answer new questions.
GLoVe 50D 0.62 0.72 0.64
GLoVe 100D 0.67 0.71 0.64 We began by using cosine distance among tf-idf vectors com-
GLoVe 300D 0.69 0.71 0.64 puted across the training set and test question to return an answer
Oracle 0.84 0.83 0.80 from our existing knowledge base. The algorithm reads in all the
questions from Piazza and calculates a tf-idf matrix across the
document set. When a new question was posed, the bot creates a
vector consisting of each word’s tf-idf score that was computed
Fig. 11. Question Classification Random Forest - Conceptual Question during training. It then iterates through all the questions and finds
- Precision Recall F-Score the question that minimizes the cosine distance and returns the
Unigram Count 0.95 0.96 0.96 answer of that most similar question. We tested out algorithm both
Bigram Count 0.90 0.93 0.921 maximizing the dot product of the tf-idf vectors and minimizing
Trigram Count 0.89 0.91 0.90 the cosine distance between the vectors, and found that both
Uni/Bi/Trigrams Count 0.92 0.94 0.92 performed similarly, often returning the same answer.
Unigrams TF-IDF 0.89 0.92 0.88
Bigrams TF-IDF 0.90 0.92 0.89 Ultimately, the chatbot could not reliably find the best answer as
Trigrams TF-IDF 0.89 0.92 0.90 the top answer; however, it was often among the top few closest
GLoVe 50D 0.91 0.93 0.91 answers. To address this problem the chatbot in practice should
GLoVe 100D 0.96 0.99 0.97 return the top three questions produced by either algorithm. This
GLoVe 300D 0.92 0.94 0.93 greatly improves the overall helpfulness of the chatbot’s answers,
since the probability that the best answer is in the top three is much
higher than the probability that it is the top answer.
4.1 Policy Questions
We curb our chatbot’s impulse to answer questions, even when
To respond to a policy question, we extend the approach described
it does not have a similar question in it’s knowledge base and
in ”Policy Question Classification”. During the classification stage,
has no possible clue as to the true answer. To accomplish this we
a question is evaluated against a series of regular expressions and
implemented a maximum threshold for the cosine distance such
placed into one or more policy question subcategories. For any sub-
that if the distance between a new question and it’s closest Piazza
category that a question is classified into, a pre-written answer is
question is above the threshold, the chatbot returns a response akin
selected from a suite of responses. To facilitate fluid user interac-
to ”I don’t know”. This method fits with our goal of answering
tion and eliminate a feeling that Percy simply gives ”canned an-
repetitive questions rather than trying to generate answers.
swers”, the chatbot has multiple, similar responses for each of the
categories of class specific questions. If a question is asked multiple
times, the chatbot is able to return variations of the answer so that it 4.3 Conceptual Questions
does not seem too repetitive. For examples, if a question is placed
under the ”PRACTICE EXAMS” subcategory, we might select the Conceptual questions are much more open-ended than class spe-
following response. cific questions, and so we decided to use a much broader approach.
ACM Transactions on Graphics, Vol. 1, No. 1, Article 1, Publication date: December 2016.
CS 221 T.A. Chatbot • 5

To respond to conceptual questions, we implemented information question-answer pairings (five conceptual, five policy-based, and
retrieval on unstructured documents as discussed in the paper by five assignment-based) which we felt were representative of the
Yan, Zhao, et al[4]. Their method includes retrieving multiple overall performance of the chatbot. Twenty-two students were
answers and then ranking those answers according to tf-idf, asked to rate the quality of each answer as a response to the given
surrounding context, chapter, topic, etc. Thus, we approached this question, on a scale from one to five. We present in figure 12 the
an information retrieval task over a textbook - specifically Russell mean ratings for each question, as well as the standard deviation of
and Norvig’s Artificial Intelligence: A Modern Approach. the ratings for each question, to illustrate the presence or absence
of contention over the quality of an answer (for brevity, we have
Originally, our chatbot compared every line of the textbook excluded the actual question-answer pairings from the table; these
with the question, but this ran slowly, taking up to 90 seconds, and question-answer pairing can be found in the appendix).
sometimes returned unrelated material. Now we break the book
up into sections, find the most relevant section, and then find the
most relevant line within that section. In our chatbot’s booting
Fig. 12. Question Answering - Human Evaluation
phase it reads the textbook in, splitting it by its section divisions. It
Question Category Mean Rating Standard Dev.
then calculates a tf-idf matrix where each line represents a section
1 Policy 4.18 0.96
of the book, and the idf is based on the entire textbook. It also
2 Policy 4.73 0.55
calculates a separate tf-idf matrix for each section where each
3 Policy 3.95 1.25
line of the matrix represent a line in that section and the idf is
4 Policy 3.55 1.44
based on that section. Whenever a conceptual question is asked,
5 Policy 2.77 1.72
the chatbot calculates the tf-idf of that question based on the idf
6 Assignment 4.05 1.00
of the textbook. It then calculates the cosine distance between the
7 Assignment 4.18 0.85
question and every section in the textbook. Once it has found the
8 Assignment 2.50 1.22
most similar section, it recalculates the question’s tf-idf vector
9 Assignment 1.95 1.36
based on the idf of the most relevant section, and then finds the
10 Assignment 3.73 1.24
closest line in the section using cosine similarity. The chat bot
11 Conceptual 2.00 1.27
returns that line and its surrounding paragraph, since conceptual
12 Conceptual 1.32 0.89
concepts usually take more than a single sentence to explain.
13 Conceptual 2.59 1.10
14 Conceptual 2.14 1.25
Similarly to the assignment questions, we found that our answers
15 Conceptual 2.27 1.28
were improved by returning the three closest paragraphs, rather
than just the best.

5. ANALYSIS
4.4 Results
5.1 Analysis: Question Classification
Evaluation of our results for the question classification stage was
fairly straightforward, in that metrics for classification problems Our primary conclusion the results of the question classification
(e.g. precision, recall, f1-score) are well-established, and are sub-problem is that this is inherently a data problem. With
recognized and understood by the broader research community. our limited dataset of slightly more than one thousand past
In contrast, evaluation of the question-answering is an inherently question-answer pairings, coupled with standard classification
hard problem with no established evaluation metrics. Given a algorithms (e.g. SVM, Naive Bayes, Random Forest), we observed
question, our methodology yields an answer. Our evaluation moderately successful classification rates, with significant room
metric must then answer the question: “to what extent does this for improvement.
response constitute a ‘good’ answer to this question?”. Setting
aside the myriad interpretations of what might constitute a “good” Interestingly, unigram features proved to be the best dilineator
answer (e.g. balance between succinctness and detail, etc...), this of question types, but this may be a result of the small data-set and
evaluation metric requires us to develop a meaningful and nuanced high word-word variance across Piazza questions. Additionally,
interpretation of both the question and answer and determine in it was initially surprising the GLoVe features failed to perform at
what ways they might be related - an open problem in the artificial least as well as unigrams. However, upon further consideration it
intelligence community. Concordantly, we were forced to pursue makes sense that our chosen GloVe representations may not be
alternate strategies in the evaluation of our responses to questions. suitable for this classification task. The Wikipedia/Gigaword-5
dataecset that the GloVe vectors were trained upon may not
Initially, we experimented with several automated evaluation be as applicable to our Piazza question classification dataset,
metrics for string similarity, including most prominently BLEU since the questions deal with very assignment-specific terms.
score, and cosine similarity between TF-IDF vectors for question- GLoVe features may have proved more valuable in differentiating
answer pairs. However, as was expected, these methods only allow ”Conceptual” questions from the rest if they had been trained on
for the matching of syntactically similar question-answer pairings, a corpus of artificial intelligence texts, as these questions feature
not necessarily those with similar intentions. a wide array of terminology that is both diverse and exclusive to
these types of questions.
After these and other initial attempts at evaluation for our
question-answering stage, we turned to human evaluation - a tech- Furthermore, it’s likely that the results of our question classifi-
nique which, while not scalable, provided an accurate, intelligent cation phase could be improved significantly via the introduction
evaluation metric. For human evaluation, we curated a set of fifteen of more data from previous iterations of the course (this data was
ACM Transactions on Graphics, Vol. 1, No. 1, Article 1, Publication date: December 2016.
6 • schopra8, rachelg4, jmsholar

not included originally due to concerns over protecting student pri- increases proportionally, and bolsters the strength of this approach.
vacy under FERPA). Another potential improvement in this ap-
proach would be to incorporate data from other classes for ques- One caveat to the strength of the current question-answering
tion classification purposes only (and not for question answering method for assignment-based questions is the fact that assignments
purposes). It’s possible that question categories in other classes are for many classes will exhibit subtle changes year over year, shifting
similar enough to question categories for CS 221, and could simply with instructor preferences and the unique and unexpected demands
be folded into our existing dataset. We did obtain data for CS 124 of any given academic quarter. This is particularly true of relatively
(Introduction to Natural Language Processing) courtesy of Profes- cutting-edge classes such as CS 221, where advances in academia
sor Dan Jurafsky, but ultimately were not able to incorporate it - an may be relevant to material taught in the class. The addition of a dis-
omission which paves the path for several obvious next steps. cretionary layer to our current question-answering model would al-
low human teaching assistants to evaluate any answer proposed by
the chatbot for validity before releasing it to students. This would
5.2 Analysis: Question Answering certainly prevent the chatbot from releasing any erroneous infor-
mation, but the overhead of human error-checking could be almost
5.2.1 Analysis: Policy Questions. Policy-based questions as high as the overhead of human question-answering in the first
were answered simply and effectively using a series of regular place. In the ideal case, the chatbot will be able to release informa-
expression pattern matches to classify a question as belonging to tion without human approval, though such a release will require a
one of several dozen different categories. Some might be quick to high threshold of confidence, which we are unable to achieve for
criticize this as an “unintelligent” approach, and indeed by any most questions as of this writing.
definition it certainly does not incorporate artificial intelligence.
However, we, the authors cite this particular design choice as 5.2.3 Analysis: Conceptual Questions. Conceptual questions
an example of the prudent application of artificial intelligence proved much more difficult to answer, as is evidenced by our
techniques. The number of different topics which fall under the results for this category. We note that questions eleven through fif-
umbrella of “policy questions” is small enough to be completely teen in Figure 12 received relatively low ratings, with low variance
enumerated by a small team of people familiar with the course, and in responses. However, the goal of this research was to use arti-
is unlikely to grow proportionally with the size of this course. Ad- ficial intelligence techniques to answer repetitive or unnecessary
ditionally, these questions have strict, predefined answers, which questions, to ensure that teaching assistants could spend more time
may change with time but do not change rapidly. Thus, it seems helping students with difficult, conceptual topics. Currently, the
most appropriate to use a small, predefined set of answers to field chatbot’s method for responding to conceptual questions consists
these questions, and in practice this approach proved extremely of an intelligent information retrieval approach, designed to guide
effective. We cite as evidence of this point, our mean ratings for students to academic resources that may help them arrive at an
questions one through five in Figure 12. Question-Answer pairs answer on their own. However, often students struggling with
1 and 2 were consistently rated highly, while Question-Answer conceptual topics seek the active dialogue and discussion of a topic
pairs 3, 4, and 5 were rated highly in general, with high variance that only a human teaching assistant can provide. Thus, the fact
in responses. With further refinement of techniques employed that we cannot yet answer difficult, conceptual questions is not so
to answer policy questions, we postulate that these could be much a defeat as it is a recognition of the fact that some topics (for
automatically answered in a completely satisfactory manner. now) can and should be left to human teaching assistants.

5.2.2 Analysis: Assignment Questions. Providing answers to 6. CONCLUSIONS


questions focused on specific assignments, problem sets, and other
course materials comprised the bulk of our research. Analysis Though certainly far from perfect, our chatbot produces viable
of the ratings provided for questions six through ten in Figure results in some contexts, and provides a solid foundation for future
12 reveals that some assignment-based question-answer pairings research in this direction. Our original purpose was to handle
were received quite well (notably questions six, seven, and ten), repetitive questions on Piazza. Given that mission, the chatbot
while others exhibited room for improvement, or high variance in performs reasonably well. If a similar question has already been
responses, indicated disagreement among survey participants. asked, the chatbot is generally able to retrieve it. If no similar
question has been asked the bot is able to recognize that and
Our approach for question-answer pairing was inspired by that respond, ”I don’t know.” It also performs well on policy questions,
of Dror, Gideon, et al. [5], who pursued a similar approach when which is the other type of question that consumes the time of
they attempted to use past question-answer pairings from Yahoo teaching assistants, but does not necessarily require their skills.
Answers to respond to new questions, with some success. How- The limited number of categories makes it easy to determine
ever, Dror and Gideon had access to a dataset of approximately one exactly what a policy question is asking and return the relevant
million past question-answer pairings, while our team was working information.
with a dataset of only approximately one thousand pairings. This
proved to be one of our primary limitations in responding to The main issue we faced was responding to conceptual ques-
assignment-based questions, but, luckily, will also be one of the tions. When we began this project we recognized that this was
easiest obstacles to conquer in future iterations of this research. an inherently hard problem, and it is one we believe still requires
This research was inherently limited in scope by the fact that it a human teaching assistant. The process of understanding a
was conducted over the course of two months for class project, and concept, understanding a student’s question, and then synthesizing
given a larger project scope, it is likely that considerably more data the appropriate information to directly respond to the student’s
could be acquired. With more data, the likelihood that any new question is one that even humans struggle with. We did not expect
question can be answered using a past question-answer pairing to be successful on this part of the project, and despite our attempts
ACM Transactions on Graphics, Vol. 1, No. 1, Article 1, Publication date: December 2016.
CS 221 T.A. Chatbot • 7

our expectations of the difficulty of the problem were validated. [4] Yan, Zhao, et al. ”DocChat: An Information Retrieval
Approach for Chatbot Engines Using Unstructured Documents.”
Ultimately, our chatbot strove to act question-answering vehi-
cle, lacking a purely conversational mechanism. Given more time [5] Dror, Gideon, et al. ”Learning from the Past: Answering
it would be fascinating to develop a knowledge base of facts that New Questions with Past Answers”. Proceedings of WWW. 2012.
the chatbot could draw on to generate answers. But much like con-
ceptual questions, this is a much harder problem than we could take [6] Brin, Sergey. ”Extracting patterns and relations from the
on given the limited time frame. world wide web.” International Workshop on The World Wide
Web and Databases. Springer Berlin Heidelberg, 1998.

7. FUTURE WORK 8. APPENDIX


Given that our research was conducted over the course of only two 8.1 User Survey Questions
months, several areas stand out as candidates for the continuation
and extension of this study. The questions provided in the user survey are as follows. Users
were asked to rate the quality of each answer from one to five as a
First, gathering more data from past courses would allow us to response to its respective question.
improve our classification of questions simply by having more data
(1) How do sign up for an alternate exam?
to train on. More data would also improve our chatbot’s ability
Answer: Fill out the form listed on the website, if you want to
to retrieve answers to similar questions by increasing the number
need an accommodation to take an alternative exam, by Nov. 1
of question and answer pairs we can match on. Having past data
would also allow us to begin handling issues of weighting past (2) Is the code from lecture posted?
quarters’ answers against current answers, and ensuring that the Answer: Slides are posted on the website and lecture videos
chatbot does not return information that is out of date. can be found at https://mvideox.stanford.edu/Course/787.
Code can be found at http://web.stanford.edu/class/cs221/live/
Additionally, though we began using a GLoVe vectors as (3) Are there practice solutions?
features for question classification, they performed worse than Answer: Practice exams can be found on the CS 221 website.
simple unigram features. As a next step we are considering training (4) Do I need an exam monitor?
the GLoVe vectors on materials that are reflective of academic Answer: SCPD students can the take exam either on campus or
literature surrounding artificial intelligence, rather than just generic at a designated test center. If you choose to take the exam re-
materials. This may better capture relationships between words in motely, please schedule the exam for a 3-hour period between
our specific context. Monday, Nov. 28 at 9 a.m. and Tuesday, Nov. 29 at 9 p.m. (Pa-
cific Time).
A possible improvement for assignment questions would be to
utilize GLoVe vector cosine similarity for finding questions that (5) Are we allowed to import from the python library for our as-
have already been asked rather than tf-idf cosine similarity. This signments?
may better account for differences in wording and more effectively Answer: Try running your code on corn. If it works and passes
compare the the similarity of the key points of the questions. grader.py you should be good
(6) For 4a pacman. I’m having issues with pacman refusing to eat
To improve policy question answering, we would expand the the last piece of food. He will wander around right next to it
chatbot’s set of regular expressions. Though we believe that regular but never eat it.
expressions are the best method for answering policy questions, the Answer 1: Maybe in your settings, the value when paceman is
current set of regular expressions was designed by hand, but these around the food is larger than the value after the pacman eat
may not entirely generalize. Thus, we need an unsupervised way it. Maybe you put too much weight on the food, which is even
of finding potentially relevant regular expressions. One candidate larger than the end game bonus.It sounds like your evaluation
is the DIPRE method, first developed by Sergey Brin[3]. function returns a larger value when it is sitting right next to
the food capsule, than what it would get by the score increase
by finishing the game.
(7) For problem 0b It seems like parts 2 and 3 will have the same
REFERENCES
number of calls, because AC-3 must still be called for every as-
[1] Maderer, Jason. Artificial Intelligence Course Creates AI signment of x the way it is currently written, but the question
Teaching Assistant. 9 May 2016. also asks us to explain why AC-3 reduces the number of calls.
What am I missing?
[2] Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, Answer: AC-3 will (very likely) shrink the domain for the re-
J. (2013). Distributed representations of words and phrases and maining unassigned variables. Thus when we consider the next
their compositionality. Advances in neural information processing variable, the number of times we call backtrack() is determined
systems, 3111-3119. by how many values we can assign to it, which should be lower
with AC-3.
[3] Pennington, J., Socher, R., & Manning, C. (2014). Glove: (8) For blackjack, I’m confused on the information given to use.
Global Vectors for Word Representation. Proceedings of the 2014 How do we know the values of the cards that correlate the the
Conference on Empirical Methods in Natural Language Processing counts in the deck we are passed?
(EMNLP). doi:10.3115/v1/d14-1162. Answer: It’s specific to this problem. Basically, you first learn
some policy, then compare it to another policy. However, if you
ACM Transactions on Graphics, Vol. 1, No. 1, Article 1, Publication date: December 2016.
8 • schopra8, rachelg4, jmsholar

call getAction() to compare the policy, and your exploration of C is the average of its children’s values, and in order to
prob is not 0, with probability epsilon you will get a random compute the average of a set of numbers, we must look at all
policy. the numbers But if we put bounds on the possible values of
(9) I’m confused about how we are supposed to set up the feature the utility function, then we can arrive at bounds for the av-
keys. It says that each feature key is (state, action), but does erage without looking at every number For example, say that
this mean a single feature contains all the (state, action) pairs all utility values are between 2 and +2; then the value of leaf
mapped to something or is each (state, action) pair a separate nodes is bounded, and in turn we can place an upper bound on
feature? the value of a chance node without looking at all its children
Answer: No, the next state should be (6, None, None) with 0 An alternative is to do Monte Carlo simulation to evaluate a
reward. position
(10) What happens if the reward is not immediately apparent to
Monte Carlo? For example, in chess or go, nothing really mat- Received September 2008; accepted March 2009
ter until the final move, how does reward account for that?
Answer: I think the feedback backpropagates for a game like
chess where the the only (and big) reward comes from moving
a state to the terminal state (checkmate). The Qopt and Vopt
recurrence would feed into each other such that you can infer
rewards all the way back up to the start of the game.
(11) What does Q-learning do in a terminal state?
Answer: As in (a), but with the terminal state at (5,5) The ac-
tions are deterministic moves in the four directions In each
case, compare the results using three-dimensional plots For
each environment, propose additional features (besides x and
y) that would improve the approximation and show the results
21
(12) Why would people use policy iteration over value iteration?
Answer: A 10 10 world with a single +1 terminal state at
(10,10)
As in (a), but add a 1 terminal state at (10,1)
As in (b), but add obstacles in 10 randomly selected squares
As in (b), but place a wall stretching from (5,2) to (5,9)
As in (a), but with the terminal state at (5,5)
(13) Beam Search w/ different K selection Are there any varia-
tions on pruning other than picking top k elements with high-
est weights? For example, using some kind of random distri-
bution, or picking some elements in separate intervals. Would
that generate better results?
Answer: Just one It begins with k randomly generated states
At each step, all the successors of all k states are generated If
any one is a goal, the algorithm halts Otherwise, it selects the
k best successors from the complete list and repeats
(14) Motivation of SARSA I do not understand the motivation be-
hind SARSA and other bootstrapping methods in the context
of model-free learning. Why is it important to obtain feedback
quickly if it is not used to modify the policy online?
Answer: For this reason, Q-learning is called a model-free
method As with utilities, we can write a constraint equation
that must hold at equilibrium when the Q-values are correct.
As in the ADP learning agent, we can use this equation directly
as an update equation for an iteration process that calculates
exact Q-values, given an estimated model This does, however,
require that a model also be learned, because the equation uses
P (s# — s, a)
(15) Why do we care if hinge loss puts upper bound on 0-1 loss?
Why is it an important feature of the hinge loss that it places an
upper bound on the 0-1 loss? Couldn’t we multiply the expres-
sion for the hinge loss by .1 arbitrarily (so it does not upper
bound the 0-1 loss) and get the same results by using it?
Answer: Is it possible to find an upper bound on the value of
C before we have looked at all its children? (Recall that this
is what alpha,beta needs in order to prune a node and its sub-
tree). At first sight, it might seem impossible because the value
ACM Transactions on Graphics, Vol. 1, No. 1, Article 1, Publication date: December 2016.

You might also like