Cscubs Dialogues
Cscubs Dialogues
Abstract
Constantly increasing amounts of data on the Internet as well as fast GPUs available
allowed for the research in AI. Google DeepMind and self-driving cars are showing how fast
machine learning flows into industry. It helps humans to do their job, and the chatbots
are a good example of it. We use them to play music, tell us the weather, order a cab,
and much more. Chatbots are all about generating the response given the conversation so
far. In this work we train a dialogue response generation model using neural networks. It
is a sequence to sequence model which takes a dialogue as an input and produces the next
response as an output. The data used is the Ubuntu Dialogue Corpus - a large dataset
for research in unstructured multi-turn dialogue systems. Since the input sequences are
the dialogues, the inputs are pretty long, and in this case the information at the very
beginning of the sequence is often lost while training the model. That is one of the reasons
we are using attention. It allows the decoder more direct access to the input and lets the
model itself decide which inputs consider more for output generation. Also, we extend the
Ubuntu Dialogue Corpus with the information from the man pages in order to enrich the
input with the technical information. We use the short descriptions. In this way we do not
overload the input with too much additional information, but still add some background
knowledge.
1 Introduction
An explosion in the number of people having informal, public conversations on social media
websites such as Facebook and Twitter presented a unique opportunity to build collections of
naturally occurring conversations that are orders of magnitude larger than those previously
available. These corpora, in turn, present new opportunities to apply data-driven techniques
to conversational tasks [13].
The task of the response generation is to generate any response that fits the provided
stimulus without mentioning the context, intent or dialogue state. Without employing rules
or templates, there is the hope of creating a system that is both flexible and extensible when
operating in an open domain [13]. Success in open domain response generation could be useful to
social media platforms and provide conversation-aware autocomplete for responses in progress
or providing a list of suggested responses to a target status.
Researchers have recently observed critical problems applying end-to-end neural network
architectures for dialogue response generation [15, 16, 8]. The neural networks have been
1
unable to generate meaningful responses taking dialogue context into account, which indicates
that the models have failed to learn useful high-level abstractions of the dialogue.
When we want a chatbot that learns from existing conversations between humans and
answers the complex queries, we need the intelligent models like retrieval-based or generative
models. The retrieval–based models pick a response from a collection of responses based on the
query. They do not generate new sentences and have always grammatically correct sentences.
The generative models are much more intelligent. They generate a response word by word
based on the query. They are computationally expensive to train, require huge amounts of data
and often have grammatical errors. They learn the sentence structure by themselves. But the
very important advantage of generative models over all the other types of the models is that
they are able to remember the previous conversations and handle previously unseen queries.
In this work we train the Sequence to Sequence model on the Ubuntu Dialogue Corpus
[6, 9, 10]. We train the generative model, which is generating the responses from scratch and
does not pick the response from the predefined set like retrieval-based models do. To our
knowledge, there was only one paper where the attention mechanism was used on the Ubuntu
Dialogue Corpus [11]. Our work was done independently and in parallel to the paper. We
also visualise attention weights to be able to see to which words the model attends to when
generating the output. Modelling was stopped when perplexity on the evaluation set stopped
decreasing by calculating moving average over the last steps. We also extend the input data
with user manuals in order to enrich it with the technical information. Extending the input
was already done for different tasks, like classification or response selection. We enlarge the
dataset for the response generation task.
2 Related Work
The Ubuntu Dialogue Corpus is the largest publicly available multiturn dialogue corpus and
is used for the task of response selection and generation [9]. There was considered a task of
selecting best next response using TF-IDF, Recurrent Neural networks (RNN) and Long Short-
Term Memory (LSTM). In the next utterance ranking task on the Ubuntu Dialogue Corpus
were evaluated performances of LSTMs, Bi-LSTMs and CNNs on the dataset and created an
ensemble by averaging predictions of multiple models [6]. The best classifier was an ensemble
of 11 LSTMs, 7 Bi-LSTMs and 10 CNNs trained with different meta-parameters.
The more interesting task that the response selection is response generation. Generative
models produce system responses that are autonomously generated word-by-word. They open
up the possibility for flexible and realistic interactions. Generative models were used for building
open domain, conversational dialogue systems based on large dialogue corpora [16]. There
was also introduced the multiresolution recurrent neural network (MrRNN) for generatively
modeling sequential data at multiple levels of abstraction [15]. MrRNN was applied to dialog
response generation on two different tasks: Ubuntu technical support and Twitter conversations,
and evaluated it in a human evaluation study and via automatic evaluation metrics.
To our knowledge, there was only one paper published recently with attention model on the
Ubuntu Dialogue Corpus (”Coherent Dialogue with Attention-based Language Models”) [11].
There was modeled coherent conversation continuation via RNN-based dialogue models. They
investigated how to improve the performance of a recurrent neural network dialogue model via
an attention mechanism. They evaluated the model on two dialogue datasets, the open domain
MovieTriples dataset and the closed domain Ubuntu Troubleshoot dataset. There was also
showed that a vanilla RNN with dynamic attention outperforms more complex memory models
(e.g., LSTM and GRU) by allowing for flexible, long-distance memory. The paper ”Coherent
Dialogue with Attention-based Language Models” was written independently and in parallel to
our work.
On the Figure 11 is a simple representation of the RNN with inputs xi and outputs hi .
RNNs have loops in them, allowing information to persist. It can be thought of as multiple
copies of the same network, each passing a message to a successor. Recurrent neural networks
are intimately related to sequences and lists because of it’s chain-like nature.
• an encoder or reader or input RNN processes the input sequence. The encoder emits the
context C, usually as a simple function of its final hidden state.
In a sequence to sequence architecture, the two RNNs are trained jointly to maximize the
average of logP (y (1) , ..., y (ny ) |x(1) , ..., x(nx ) ) over all the pairs of x and y sequences in the training
set. The last state hnx of the encoder RNN is typically used as a representation C of the input
sequence that is provided as input to the decoder RNN.
Figure 3: The graphical illustration of the model generating the t-th target word yt given a
source sentence (x1 , x2 , ..., xT )
Illustration of the attention mechanism is on the Firgure 4 [1]. The source sentence is
in English and the generated translation is in French. Each pixel shows the weight αij of
the annotation of the j-th source word for the i-th target word in grayscale (0: black, 1:
white). ”La destruction” was translated from ”Destruction” and ”la Syrie” from ”Syria”, so
the corresponding weights are non-black.
4 Experimental Results
Due to very long input sequences and a very large dataset, the maximum batch size we could
fit into TITAN X with 12 GB for the model with 2 layers, 1024 units each was 30. This is very
small compare to 0,5 million data instances, which made the training process of such a large
model very slow.
Responses generated by the model were always grammatically correct and different, suitable
as response or not, but very generic. That is why we did not evaluate them using different
metrics such as BLEU or METEOR, but used perplexity, visualized attention and looked at
the generated responses.
When training the model statistics was printed and model was saved as checkpoint every 50
steps. During a step the model was trained using one batch. We also calculate moving average
over last 3000 steps since the perplexities every 50 steps were very different and hard to follow.
We stopped the training process when the perplexity on the evaluation set stopped decreas-
ing. On the Figure 5 is an example of how the perplexity on the evaluation set decreases over
time. On the X axis are the number of steps, and on the Y axis is the perplexity. This model
was trained on TITAN X with 12 GB for 10 days. The training takes that long due to large
input sequences, big number of dialogues and using attention, which further makes the model
bigger and more difficult to train. This model had 2 layers with 1024 units each.
An interesting observation on Figure 5 is that the perplexity of the smallest bucket is very
close to the perplexity on the training set. The perplexities for bigger buckets are larger: the
smallest perplexity is reached for the smallest bucket, and the largest perplexity is reached for
Figure 5: Decreasing of perplexity while training the model
the biggest bucket (bigger bucket - larger perplexity). This shows that it is more difficult for
the model to memorize longer sequences then shorter sequences, as expected.
The responses generated by the models were very generic. The reasons for that may be that
the beam search was not used and that the ”I don’t know” and ”I am not sure” are indeed
the most common answers on the Ubuntu Dialogue Corpus. We show some examples of the
response generated (* indicates the response generated by the model and + is the real answer
by the user).
Example
Figure 6: Top 10 most probable tokens for each word in the generated response
First items from each row on Figure 6 form the output ”I don’t know, sorry” for the dialogue
above, since it is one of the most safe answers.
An interesting way to see how the attention is used by the model is to visualise it. On
the Figure 7 are the attention weights visualised (darker square - bigger weight). The X axis
represents the input dialogue and the Y axis shows the output response.
On the Figure 7 we can see, that the model was paying more attention to the words ”pulse”
and ”sound” when generating the output response for the dialogue below. The generated
response is A*.
Example
A: Any idea why empathy’s not playing notification sounds? Even though I have ’em ticked
in preferences
B: restarted it yet?
A: yar
B: check pulse to see if the application is muted for some reason?
B: well Sound settings.
A: Had sound effects turned off in sound settings, didn’t realize that controlled other
applications
B: Ah yea, ive done it a few time it’s annoying
B: My favorite though is recently pulse has been freezing on my desktop and audio will just
not be adjustable for like... 30 seconds or so
A*: I’m not sure, I’m not sure if it’s a problem with the sound card.
In conclusion we can say, that the model tends to give grammatically correct and different,
but generic responses. We believe that adding beam search will give the possibility to see the
other responses as well and choose the one among them.
Consider the next example. The user B at the end was giving the instructions what to do.
The model generated response ”thanks, I’ll try that”, which is suitable in this case. In any
case, there is enormous number of different possible responses, and every human would answer
differently.
Example
A: how do I create a folder from the command line?
A: ooooookay, how do I restart a process? kill and start? or is it easier? :P
B: depends on the process... who is it owned by? is it a system service?
A: nautilus
A: it seems to randomly lock up, so I was going to assign a keyboard shortcut to restart it,
only to find out I don’t know how to restart a process, or if it’s even possible...
B: Use the ”kill” command to send processes signals telling them to exit. You need the
process id to use ”kill” e.g. kill -TERM 1234. You can also use ”killall some-process-name”
e.g. killall nautilus
A*: thanks, I’ll try that
A+: so there’s no restart?
2 www.askubuntu.com
References
[1] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly
learning to align and translate. CoRR, abs/1409.0473, 2014.
[2] Felix A Gers, Jürgen Schmidhuber, and Fred Cummins. Learning to forget: Continual prediction
with lstm. Neural computation, 12(10):2451–2471, 2000.
[3] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. http:
//www.deeplearningbook.org.
[4] Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. Speech recognition with deep
recurrent neural networks. CoRR, abs/1303.5778, 2013.
[5] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Comput., 9(8):1735–
1780, November 1997.
[6] Rudolf Kadlec, Martin Schmid, and Jan Kleindienst. Improved deep learning baselines for ubuntu
corpus dialogs. CoRR, abs/1510.03753, 2015.
[7] Nal Kalchbrenner and Phil Blunsom. Recurrent continuous translation models. In EMNLP,
volume 3, page 413, 2013.
[8] Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. A diversity-promoting
objective function for neural conversation models. CoRR, abs/1510.03055, 2015.
[9] Ryan Lowe, Nissan Pow, Iulian Serban, and Joelle Pineau. The ubuntu dialogue corpus: A large
dataset for research in unstructured multi-turn dialogue systems. CoRR, abs/1506.08909, 2015.
[10] Ryan Lowe, Nissan Pow, Iulian V. Serban, Laurent Charlin, Chia-Wei Liu, and Joelle Pineau.
Training end-to-end dialogue systems with the ubuntu dialogue corpus. Dialogue Discourse,
8(1):31–65, 2017.
[11] Hongyuan Mei, Mohit Bansal, and Matthew R. Walter. Coherent dialogue with attention-based
language models. CoRR, abs/1611.06997, 2016.
[12] Tomas Mikolov and Geoffrey Zweig. Context dependent recurrent neural network language model.
SLT, 12:234–239, 2012.
[13] Alan Ritter, Colin Cherry, and William B. Dolan. Data-driven response generation in social media.
In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP
’11, pages 583–593, Stroudsburg, PA, USA, 2011. Association for Computational Linguistics.
[14] Hasim Sak, Andrew W. Senior, and Françoise Beaufays. Long short-term memory based recurrent
neural network architectures for large vocabulary speech recognition. CoRR, abs/1402.1128, 2014.
[15] Iulian Vlad Serban, Tim Klinger, Gerald Tesauro, Kartik Talamadupula, Bowen Zhou, Yoshua
Bengio, and Aaron C. Courville. Multiresolution recurrent neural networks: An application to
dialogue response generation. CoRR, abs/1606.00776, 2016.
[16] Iulian Vlad Serban, Alessandro Sordoni, Yoshua Bengio, Aaron C. Courville, and Joelle Pineau.
Hierarchical neural network generative models for movie dialogues. CoRR, abs/1507.04808, 2015.
[17] Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. Recurrent neural network regularization.
CoRR, abs/1409.2329, 2014.