Hierarchical Transformers For Task Oriented Dialog Models
Hierarchical Transformers For Task Oriented Dialog Models
U2 Shared Encoder
Representation
Embedding Layer + Pos. Enc.
encodings are applied, as explained in Section 2.2.
A standard transformer decoder follows the HT-
SOS Can I
Ut Shared Encoder
CONTEXT ENCODER
Encoder for generating the response.
The CT-Mask for HIER was experimentally ob-
(a) Model: HIER
y1 y2 y3 tained after trying a few other variants. The intu-
ition behind this mask was that the model should re-
Decoder Decoder Decoder
HT-ENCODER
U1 S1 U2 Ut
I
ply to the last user utterance in the context. Hence,
DB Shared Encoder
we design the attention mask to apply cross atten-
U1 Shared Encoder Dialogue e1 e2 e3
S1 Shared Encoder
Representation
tion between all the utterances and the last utter-
p1 p2 p3
UT Shared Encoder ea ea ea
CONTEXT ENCODER
w.e1 w.e2 w.e3
Table 2: Simplest Baselines in absence of both Belief or Policy / Dialog Act annotations
Table 3: Context-to-Response generation with Oracle Policy. Superior Performance of DAMD: DAMD always
receives an extra input of Bt−1 annotation, while predicting for Bt or response Rt , which helps in NLU of the
subsequent utterances. This is not available in any other models.
Table 4: Context-to-Response: For this experiment only belief-states are given. GPT-2,DC means a second pre-
training phase using extra dialog corpus (DC) starting from GPT-2 model parameters.
Marco with an HT-Encoder and rerun the context- tems built upon transformer encoder-decoder archi-
to-response generation experiment. Introducing tecture. It is also applicable to tasks where the input
HT-Encoder into Marco helps improve in terms of sequence can be split into an abstract set of sub-
inform (minor), success and the combined score units (e.g., search history in Sordoni’s application).
metric. The results of this experiment show that We believe that our proposed approach for hierar-
HT-Encoder is suitable for any model architecture. chical encoding in transformers and the algorithm
Overall, our experiments show how useful the for converting the standard transformer encoder
proposed HT-Encoder module can be for dialog sys- makes it an invaluable but accessible resource for
5655
Annotations Evaluation Metrics
Models Pretraining
Belief DB Policy BLEU Entity-F1 Inform Success Score
DAMD - Gen* Oracle Gen 16.60 - 76.40 60.40 85.00
SimpleTOD GPT-2 Gen - Gen 15.01 - 84.4 70.1 92.26
SOLOIST GPT-2, DC Gen Gen - 16.54 - 85.50 72.90 95.74
HIER-Joint - Gen - Gen 19.74 53.94 80.5 71.7 95.84
Table 5: End-to-End: Belief State predicted by model itself. *In the End-to-End setting also, DAMD will need to
use the oracle Bt−1 for predicting the current belief Bt .
Table 6: Comparison between vanilla Marco model and Marco + HT-Encoder with proposed HT-Encoder. Bold-
faced results denote statistically significant improvement with p < 0.05. We didn’t observe any significant im-
provement in act-prediction F1-Score or BLEU scores for response generation. The numbers given in the table are
means of 10 different runs of each algorithm.
future researchers working on dialog systems or training the NLU and NLG modules end-to-end
similar problem statements with transformer-based using a single loss function. Marco (Wang et al.,
architectures. 2020) and HDSA (Chen et al., 2019) used a fine-
tuned BERT model as their act predictor as it often
5 Related Works triumphs other ways to train the dialog policy net-
work (even joint learning). HDSA is a transformer
Task Oriented Dialog Systems Researchers Seq2Seq model with act-controllable self-attention
identify four different subtasks for any task- heads (in the decoder) to disentangle the individual
oriented dialog system (Wen et al., 2017), natural tasks and domains within the network. Marco uses
language understanding (NLU), dialog state track- a soft-attention over the act sequence during the
ing (DST), dialog act or policy generation, and response generation process.
Natural Language Generation (NLG). Before the
advent of large scale Seq2Seq models, researchers Hierarchical Encoders The concept of Hierar-
focused on building feature-rich models with rule- chical Encoders have been used in many different
based pipelines for both natural language under- context in the past. It has been most well known in
standing and generation. It usually required sepa- the area of dialog response generation as the HRED
rate utterance-level and dialog-level NLU feature model. Many open domain dialog systems have
extraction modules. These NLU features decide used the hierarchical recurrent encoding scheme
the next dialog act that the system should follow. of HRED for various tasks and architectures. Hi-
This act is then converted into a natural language re- erarchical Encoder was first proposed by (Sordoni
sponse using the NLG module. Young et al. (2013) et al., 2015a) for using in a query suggestion sys-
modeled this problem as a Markov Decision Pro- tem. They used it encode the user history compris-
cess whose state comprised of various utterance ing multiple queries using an Hierarchical LSTM
and dialog features detected by an NLU module. network. Serban et al. (2016) extended this work to
However, such models had the usual drawback of open domain dialog generation problems and pro-
any pipelined approaches, error propagation. Wen posed the HRED network. HRED captures the high
et al. (2017) proposed using neural networks for ex- level features of the conversation in a context RNN.
tracting features like intent, belief states, etc. and Several models have adopted this approach later on,
5656
e.g. VHRED (Serban et al., 2017), CVAE (Zhao Discovery & Data Mining, KDD 2019, Anchor-
et al., 2017), DialogWAE (Gu et al., 2018), etc. age, AK, USA, August 4-8, 2019, pages 2623–2631.
ACM.
Another area in which researchers have proposed
the use of hierarchical encoder is for processing Paweł Budzianowski and Ivan Vulić. 2019. Hello, it’s
of paragraph or long documents. Li et al. (2015) GPT-2 - how can I help you? towards the use of pre-
used a hierarchical LSTM network for training an trained language models for task-oriented dialogue
systems. In Proceedings of the 3rd Workshop on
autoencoder that can encode and decode long para- Neural Generation and Translation, pages 15–22,
graphs and documents. Zhang et al. (2019) pro- Hong Kong. Association for Computational Linguis-
posed HIBERT where they introduced hierarchy tics.
into the BERT architecture to remove the limitation
Paweł Budzianowski, Tsung-Hsien Wen, Bo-Hsiang
on length of input sequence. HIBERT samples a Tseng, Iñigo Casanueva, Stefan Ultes, Osman Ra-
single vector for each sentence or document seg- madan, and Milica Gašić. 2018. MultiWOZ - a
ment (usually contextual embedding of CLS or large-scale multi-domain Wizard-of-Oz dataset for
EOS token) from the sentence encoder to be passed task-oriented dialogue modelling. In Proceedings of
the 2018 Conference on Empirical Methods in Nat-
onto the higher level transformer encoder. Liu and ural Language Processing, pages 5016–5026, Brus-
Lapata (2019) applies a similar approach for encod- sels, Belgium. Association for Computational Lin-
ing documents in a multi-document summarization guistics.
task.
Wenhu Chen, Jianshu Chen, Pengda Qin, Xifeng Yan,
and William Yang Wang. 2019. Semantically con-
6 Conclusion ditioned dialog response generation via hierarchical
disentangled self-attention. In Proceedings of the
This paper explored the use of hierarchy in 57th Annual Meeting of the Association for Com-
transformer-based models for task-oriented dialog putational Linguistics, pages 3696–3709, Florence,
system. We started by proposing a generalized Italy. Association for Computational Linguistics.
framework for Hierarchical Transformer Encoders Yue Gu, Kangning Yang, Shiyu Fu, Shuhong Chen,
(HT-Encoders). Using that, we implemented two Xinyu Li, and Ivan Marsic. 2018. Multimodal af-
models, one new model called HIER, and another fective analysis using hierarchical attention strategy
HIER-CLS model by adapting the existing HIB- with word-level alignment. In Proceedings of the
56th Annual Meeting of the Association for Compu-
ERT architecture into our framework. We thor- tational Linguistics (Volume 1: Long Papers), pages
oughly experimented with these models in four 2225–2235, Melbourne, Australia. Association for
different response generation tasks of the Multi- Computational Linguistics.
WOZ dataset. We compared the proposed mod-
Ehsan Hosseini-Asl, Bryan McCann, Chien-Sheng Wu,
els with an exhaustive set of recent state-of-the-art Semih Yavuz, and Richard Socher. 2020. A sim-
models to thoroughly analyze the effectiveness of ple language model for task-oriented dialogue. In
HT-Encoders. We empirically show that the basic Advances in Neural Information Processing Systems
transformer seq2seq architecture, when equipped 33: Annual Conference on Neural Information Pro-
cessing Systems 2020, NeurIPS 2020, December 6-
with an HT-Encoder, outperforms many of the state- 12, 2020, virtual.
of-the-art models in each experiment. We further
prove its usefulness by applying it to an existing Dan Jurafsky. 2000. Speech & language processing.
model Marco. This work opens up a new direction Pearson Education India.
on hierarchical transformers in dialogue systems Jiwei Li, Thang Luong, and Dan Jurafsky. 2015. A
where complex dependencies exist between the ut- hierarchical neural autoencoder for paragraphs and
terances. It would also be beneficial to explore the documents. In Proceedings of the 53rd Annual Meet-
ing of the Association for Computational Linguistics
effectiveness of the proposed HT-Encoder when
and the 7th International Joint Conference on Natu-
applied for various other tasks. ral Language Processing (Volume 1: Long Papers),
pages 1106–1115, Beijing, China. Association for
Computational Linguistics.
References
Yang Liu and Mirella Lapata. 2019. Hierarchical trans-
Takuya Akiba, Shotaro Sano, Toshihiko Yanase, formers for multi-document summarization. In Pro-
Takeru Ohta, and Masanori Koyama. 2019. Op- ceedings of the 57th Annual Meeting of the Asso-
tuna: A next-generation hyperparameter optimiza- ciation for Computational Linguistics, pages 5070–
tion framework. In Proceedings of the 25th ACM 5081, Florence, Italy. Association for Computa-
SIGKDD International Conference on Knowledge tional Linguistics.
5657
Jiahuan Pei, Pengjie Ren, and Maarten de Rijke. Kaiser, and Illia Polosukhin. 2017. Attention is all
2019. A modular task-oriented dialogue system you need. In Advances in Neural Information Pro-
using a neural mixture-of-experts. arXiv preprint cessing Systems 30: Annual Conference on Neural
arXiv:1907.05346. Information Processing Systems 2017, December 4-
9, 2017, Long Beach, CA, USA, pages 5998–6008.
Baolin Peng, Chunyuan Li, Jinchao Li, Shahin Shayan-
deh, Lars Liden, and Jianfeng Gao. 2020a. Soloist: Kai Wang, Junfeng Tian, Rui Wang, Xiaojun Quan,
Few-shot task-oriented dialog with a single pre- and Jianxing Yu. 2020. Multi-domain dialogue acts
trained auto-regressive model. arXiv preprint and response co-generation. In Proceedings of the
arXiv:2005.05298. 58th Annual Meeting of the Association for Compu-
tational Linguistics, pages 7125–7134, Online. As-
Shuke Peng, Xinjing Huang, Zehao Lin, Feng Ji, sociation for Computational Linguistics.
Haiqing Chen, and Yin Zhang. 2019. Teacher-
student framework enhanced multi-domain dialogue Tsung-Hsien Wen, David Vandyke, Nikola Mrkšić,
generation. arXiv preprint arXiv:1908.07137. Milica Gašić, Lina M. Rojas-Barahona, Pei-Hao Su,
Stefan Ultes, and Steve Young. 2017. A network-
Shuke Peng, Feng Ji, Zehao Lin, Shaobo Cui, based end-to-end trainable task-oriented dialogue
Haiqing Chen, and Yin Zhang. 2020b. Mtss: system. In Proceedings of the 15th Conference of
Learn from multiple domain teachers and become the European Chapter of the Association for Compu-
a multi-domain dialogue expert. arXiv preprint tational Linguistics: Volume 1, Long Papers, pages
arXiv:2005.10450. 438–449, Valencia, Spain. Association for Computa-
tional Linguistics.
Alec Radford, Jeff Wu, Rewon Child, David Luan,
Dario Amodei, and Ilya Sutskever. 2019. Language Steve Young, Milica Gašić, Blaise Thomson, and Ja-
models are unsupervised multitask learners. son D Williams. 2013. Pomdp-based statistical spo-
ken dialog systems: A review. Proceedings of the
Iulian Vlad Serban, Alessandro Sordoni, Yoshua Ben- IEEE, 101(5):1160–1179.
gio, Aaron C. Courville, and Joelle Pineau. 2016.
Building end-to-end dialogue systems using gener- Xingxing Zhang, Furu Wei, and Ming Zhou. 2019. HI-
ative hierarchical neural network models. In Pro- BERT: Document level pre-training of hierarchical
ceedings of the Thirtieth AAAI Conference on Arti- bidirectional transformers for document summariza-
ficial Intelligence, February 12-17, 2016, Phoenix, tion. In Proceedings of the 57th Annual Meeting
Arizona, USA, pages 3776–3784. AAAI Press. of the Association for Computational Linguistics,
pages 5059–5069, Florence, Italy. Association for
Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe, Computational Linguistics.
Laurent Charlin, Joelle Pineau, Aaron C. Courville,
and Yoshua Bengio. 2017. A hierarchical latent Yichi Zhang, Zhijian Ou, and Zhou Yu. 2020a. Task-
variable encoder-decoder model for generating di- oriented dialog systems that consider multiple ap-
alogues. In Proceedings of the Thirty-First AAAI propriate responses under the same context. In Pro-
Conference on Artificial Intelligence, February 4-9, ceedings of the AAAI Conference on Artificial Intel-
2017, San Francisco, California, USA, pages 3295– ligence, volume 34, pages 9604–9611.
3301. AAAI Press.
Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen,
Alessandro Sordoni, Yoshua Bengio, Hossein Vahabi, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing
Christina Lioma, Jakob Grue Simonsen, and Jian- Liu, and Bill Dolan. 2020b. DIALOGPT : Large-
Yun Nie. 2015a. A hierarchical recurrent encoder- scale generative pre-training for conversational re-
decoder for generative context-aware query sugges- sponse generation. In Proceedings of the 58th An-
tion. In Proceedings of the 24th ACM International nual Meeting of the Association for Computational
Conference on Information and Knowledge Manage- Linguistics: System Demonstrations, pages 270–
ment, CIKM 2015, Melbourne, VIC, Australia, Octo- 278, Online. Association for Computational Linguis-
ber 19 - 23, 2015, pages 553–562. ACM. tics.
Alessandro Sordoni, Michel Galley, Michael Auli, Tiancheng Zhao, Ran Zhao, and Maxine Eskenazi.
Chris Brockett, Yangfeng Ji, Margaret Mitchell, 2017. Learning discourse-level diversity for neural
Jian-Yun Nie, Jianfeng Gao, and Bill Dolan. 2015b. dialog models using conditional variational autoen-
A neural network approach to context-sensitive gen- coders. In Proceedings of the 55th Annual Meet-
eration of conversational responses. In Proceedings ing of the Association for Computational Linguistics
of the 2015 Conference of the North American Chap- (Volume 1: Long Papers), pages 654–664, Vancou-
ter of the Association for Computational Linguis- ver, Canada. Association for Computational Linguis-
tics: Human Language Technologies, pages 196– tics.
205, Denver, Colorado. Association for Computa-
tional Linguistics.