0% found this document useful (0 votes)
15 views

Hierarchical Transformers For Task Oriented Dialog Models

NAACL'21 paper

Uploaded by

Pawan Goyal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Hierarchical Transformers For Task Oriented Dialog Models

NAACL'21 paper

Uploaded by

Pawan Goyal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Hierarchical Transformer for Task Oriented Dialog Systems

Bishal Santra∗ Potnuru Anusha∗ Pawan Goyal


bsantraigi† anusha.sparkx† pawang‡

Computer Science and Engineering Dept.


Indian Institute of Technology Kharagpur
Kharagpur, W.B., India
{†}@gmail.com, {‡}@cse.iitkgp.ac.in

Abstract context and, optionally, some external information


through knowledge bases (Wen et al., 2017) or an-
Generative models for dialog systems have
notations e.g. belief states, dialog acts etc. (Chen
gained much interest because of the recent
success of RNN and Transformer based mod-
et al., 2019; Zhao et al., 2017).
els in tasks like question answering and sum- A dialog is usually represented as a series of ut-
marization. Although the task of dialog re- terances. However, it is not sufficient to view each
sponse generation is generally seen as a se- utterance independently for engaging in a conver-
quence to sequence (Seq2Seq) problem, re- sation. In a dialogue between humans, the speakers
searchers in the past have found it challeng- communicate both utterance level and dialog level
ing to train dialog systems using the standard
information. E.g., dialog intent often cannot be de-
Seq2Seq models. Therefore, to help the model
learn meaningful utterance and conversation tected by looking at a single utterance, whereas di-
level features, Sordoni et al. (2015b); Serban alog acts are specific to each utterance and change
et al. (2016) proposed Hierarchical RNN ar- throughout a conversation. Intuitively, we can in-
chitecture, which was later adopted by sev- struct the model to achieve both utterance level
eral other RNN based dialog systems. With and dialog level understanding separately through
the transformer-based models dominating the a hierarchical encoder (Serban et al., 2016).
seq2seq problems lately, the natural question
to ask is the applicability of the notion of hi- There has been a lot of interest in the past
erarchy in transformer based dialog systems. towards using the Hierarchical Encoder-Decoder
In this paper, we propose a generalized frame- (HRED) model for encoding utterances in many
work for Hierarchical Transformer Encoders RNN based dialog systems. However, since the
and show how a standard transformer can be rise of Transformers and self-attention (Vaswani
morphed into any hierarchical encoder, includ- et al., 2017), the use of hierarchy has not been ex-
ing HRED and HIBERT like models, by us- plored further for transformer-based dialog models.
ing specially designed attention masks and po-
Past research and user-studies have also shown that
sitional encodings. We demonstrate that Hi-
erarchical Encoding helps achieve better nat- hierarchy is an important aspect of human conver-
ural language understanding of the contexts in sation (Jurafsky, 2000). But, most previous works
transformer-based models for task-oriented di- based on transformer have focused on training mod-
alog systems through a wide range of experi- els either as language models (Budzianowski and
ments. The code and data for all experiments Vulić, 2019; Zhang et al., 2020b) or as standard
in this paper has been open-sourced1 2 . (non-hierarchical) Seq2Seq models (Chen et al.,
2019; Zhang et al., 2020a; Wang et al., 2020) with
1 Introduction
certain task specific extensions. Although arguably,
Dialog systems are concerned with replicating the the self-attention mechanism might automatically
human ability to make conversation. In a genera- learn such a scheme during the training process,
tive dialog system, the model aims at generating our empirical results show that forcing this induc-
coherent and informative responses given a dialog tive bias by manual design as proposed here leads

Equal Contributions to better performing models.
1
Experiments in this paper: https://github.com/ This paper bridges these two popular approaches
bsantraigi/HIER of transformers and hierarchical encoding for di-
2
PyTorch implementation of Hierarchical Transformer
Encoder: https://github.com/bsantraigi/ alogs systems to propose a family of Hierarchical
hier-transformer-pytorch Transformer Encoders. Although arguably, the self-
5649
Proceedings of the 2021 Conference of the North American Chapter of the
Association for Computational Linguistics: Human Language Technologies, pages 5649–5658
June 6–11, 2021. ©2021 Association for Computational Linguistics
attention mechanism of standard encoders might Contextual Embedding for all tokens
in context
automatically learn such a scheme during the train-
ing process, our empirical results show that forcing
Context
this inductive bias by manual design as proposed Encoder

here leads to better performing models. Our contri-


Global
butions in this paper include: Positional
Encoding + + + +
• We propose a generalized framework for hier-
Shared
archical encoders in transformer based mod- Encoders

els that covers a broader range of architec-


tures including existing encoding schemes Local
Positional
like HRED/HIBERT (Zhang et al., 2019) and Encoding

possibly other novel variants. We call mem- + + + + + + +


bers of this family of hierarchical transformer
encoders as an HT-Encoder.
• Then, we formulate a straightforward algo-
rithm for converting an implementation of Figure 1: Detailed architecture for a Hierarchical
standard transformer encoder into an HT- Transformer Encoder or HT-Encoder: The main in-
ductive bias incorporated in this model is to encode the
Encoder by changing the attention mask and
full dialog context hierarchically in two stages. This
the positional encoding. is done by the two encoders, 1) Shared Utterance En-
• Building upon that, we show how an coder (M layers) and 2) Context Encoder (N layers), as
HRED/HIBERT like hierarchical encoder shown in the figure. Shared encoder first encodes each
(HIER-CLS) can be implemented using our utterance (u1 , u2 , . . . , ut ) individually to extract the ut-
HT-Encoder framework. terance level features. The same parameterized Shared
• We also showcase a novel HT-Encoder based Encoder is used for encoding all utterances in the con-
text. In the second Context Encoder the full context
model, called HIER, with a context encod-
is encoded using a single transformer encoder for ex-
ing mechanism different from HRED. We tracting dialog level features. The attention mask in
show that these simple HT-Encoder based context encoder decides how the context encoding is
baselines achieve at par or better performance done and is a choice of the user. This one depicted
than many recent models with more sophis- in the figure is for the HIER model described in Sec-
ticated architectures or training procedures. tion 2.3. Only the final utterance in the Context En-
We make a thorough comparison with many coder gets to attend over all the previous utterances as
recently proposed models in four different ex- shown. This allows the model to have access to both
utterance level features and dialog level features till the
perimental settings for dialog response gener-
last layer of the encoding process. Notation: Utterance
ation task. i, ui = [wi1 , . . . , wi|ui | ], wij is the word embedding
• We further apply HT-Encoder to a state-of-the- for j th word in ith utterance.
art model, Marco (Wang et al., 2020), for task-
oriented dialog systems and obtain improved
results. 2.1 Hierarchical Transformer Encoders
(HT-Encoder)
Like the original HRED architecture, HT-Encoder
2 Models
also has two basic components, a shared utterance
encoder and the context encoder. Shared utterance
Formally, the task of a dialog system is to pre- encoder, or the Shared Encoder in short, is the
dict a coherent response, r, given a dialog con- first phase of the encoding process where each ut-
text c. In case of a goal oriented dialog sys- terance is processed independently to obtain utter-
tem, context c might consist of dialog history, ance level representations. In the second phase, the
Ct = [U1 , S1 , ..., Ut ], and optionally a belief state Context Encoder is used to process the full con-
(dialog act, slot values, intent etc.) bt , when avail- text together. These context level representations
able. Here, Ui , Si represent the user and system are then used for the tasks like dialog state tracking
utterances at turn i, respectively. The actual target or response generation. We propose two different
response following Ct is the system utterance St . types of Hierarchical Encoding schemes for the
5650
transformer model. utterance encoder (local PE) and once more before
context encoder (global PE).
1. HIER-CLS: When Serban et al. (2016) em-
ployed a hierarchical encoder for dialog contexts, CT-Mask (HIER-CLS)

they obtained a single representative embedding,


usually the final hidden state of an RNN, for each
utterance. Similarly, in HIER-CLS, the context en-
coder utilizes only a single utterance embedding
for each utterance. We do this by taking the con-
textual embedding of the first token (often termed
UT-Mask CT-Mask (HIER)
as the “CLS” token in transformer based models)
of each utterance.
2. HIER: Recent works have shown the impor-
tance of contextual word embeddings. In HIER,
we consider contextual embedding of all utterance
tokens as input to the context encoder. We sim-
ply concatenate the whole sequence of contextual
Figure 2: Example of UT-Mask (A for the given CI )
embeddings and forward it to the context encoder. and CT-Masks. Blue cells: 1, White cells: 0. Bottom
2.2 Conversion Algorithm: Standard left is the UT-Mask and on the right are CT-Masks for
HIER-CLS(top) and HIER(bottom). In this example,
Encoder to HT-Encoder the context comprises of three utterances of lengths 0, 1
In this section, we show how the two-step process and 2, respectively. CI indicates which utterance each
of hierarchical encoding can be achieved using a of the tokens belongs to. The entries in PI denotes the
single standard transformer encoder. If we want to relative position of each token with respect to utterance
have an M layer utterance encoder followed by an corresponding to it.
N layer context encoder, we start with an (M + N )
layer standard encoder. Then by applying two sep- UT-Mask and Local Positional Encoding The
arate masks as designed below, we convert the stan- steps for obtaining the UT-Mask and positional en-
dard encoder into an HT-encoder. First, we need coding for the utterance encoder are given below
to encode the utterances independently. Within and is accompanied by Figure 2. C is the dialog
the self-attention mechanism of a transformer en- context to be encoded. wij is the jth token of ith ut-
coder, which token gets to attend to which other terance. In CI , each index i is repeated |ui | (length
tokens is controlled by the attention mask. If we of ui ) times. And CIR is a square matrix created
apply a block-diagonal mask, each block of size by repeating CI . PI has the same dimensions as
same as the length of utterances (as shown in Fig- CI , and it stores the position of each token wij in
ure 2 bottom-left), to the concatenated sequence context C, relative to utterance ui . P : I 7→ Rd is
of tokenized utterances, we effectively achieve the the positional encoding function that takes an in-
same process of utterance encoding. We call this dex (or indices) and returns their d-dim positional
block-diagonal mask for utterance encoding the embedding. A is the UT-Mask for the given con-
UT-mask. text C and their utterance indices CI . An example
Similarly, another attention mask (CT-Mask) instance of this process is given in Figure 2. 1(.) is
can explain the context encoding phase that allows an indicator function that returns true when the in-
tokens to attend beyond the respective utterance put logic holds, and is applied to a matrix or vector
boundaries. See the two matrices on Figure 2’s element-wise.
right for examples of such CT-Masks. From here,
it can be quickly concluded that if we apply the C = [w11 , w12 , ..., wT lT ]
UT-Mask for the first few layers of the encoder and CI = [0, . . . , 0, 1, . . . , 1, . . . , T ]
the CT-Mask in the remaining few layers, we effec- PI = [0, 1, . . . , l1 − 1, 0, . . . , l2 − 1, . . . , lT − 1]
tively have a hierarchical encoder. The CT-Mask
CIR = repeat(CI , len(CI ), 0)
also gives us more freedom on what kind of global
T

attention we want to allow during context encod- A = 1 2CIR == (CIR + CIR )
ing. Positional encoding is applied once before Pc = P[PI , :]
5651
y1 y2 y3
Block Bij is a matrix which, if all ones, means that
HT-ENCODER

U1 S1 U2 Ut Decoder Decoder Decoder


I
utterance i can attend to utterance j’s contextual
U1 Shared Encoder
token embeddings. The local and global positional
S1 Shared Encoder Dialogue

U2 Shared Encoder
Representation
Embedding Layer + Pos. Enc.
encodings are applied, as explained in Section 2.2.
A standard transformer decoder follows the HT-
SOS Can I
Ut Shared Encoder

CONTEXT ENCODER
Encoder for generating the response.
The CT-Mask for HIER was experimentally ob-
(a) Model: HIER
y1 y2 y3 tained after trying a few other variants. The intu-
ition behind this mask was that the model should re-
Decoder Decoder Decoder
HT-ENCODER

U1 S1 U2 Ut
I
ply to the last user utterance in the context. Hence,
DB Shared Encoder
we design the attention mask to apply cross atten-
U1 Shared Encoder Dialogue e1 e2 e3

S1 Shared Encoder
Representation
tion between all the utterances and the last utter-
p1 p2 p3

ance (see Figure 3a).


FFN

UT Shared Encoder ea ea ea

CONTEXT ENCODER
w.e1 w.e2 w.e3

ACT Embedding Layer


HIER++: HIER++ is the extended version of the
SOS Can I
HIER model, as shown in Figure 3b, that also takes
the dialog act label as input. The dialog act repre-
(b) Model: HIER++
sentation proposed in Chen et al. (2019) consists
Figure 3: The proposed architecture for the hierarchical of the domain, act, and slot values. A linear feed-
transformer: (a) HIER: when the belief states are not forward layer (FFN) acts as the embedding layer
available and (b) HIER++: when the belief states are for converting their 44-dimension multi-hot dia-
available.
log act representation. The output embedding is
added to the input token embeddings of the de-
CT-Masks for Models The attention masks for coder in HIER++ model. Similar to HDSA, we
context encoding depends on the choice for model also use ground truth dialog acts during training,
architecture. We provide the details of the architec- and predictions from a fine-tuned BERT model
tures and their attention masks used in our exper- during validation and testing. HIER++ is applied
iments in the subsequent section. There are other to the Context-to-Response generation task of the
masks possible, but these are the ones we found to MultiWOZ dataset.
be working best in their respective settings.
HIER-CLS: As described in Section 2.1, the en-
2.3 Model Architectures coding scheme of HIER-CLS is more akin to the
HRED (Chen et al., 2019) and HIBERT (Zhang
We propose several model architectures to test the et al., 2019) models. It differs from HIER++ only
effectiveness of the proposed HIER-Encoder in var- with respect to the CT-Mask.
ious experimental settings. These architectures are
designed to fit well with the four experimental set- Ablations To understand the individual impact
tings (see Section 3.1) of the response generation of UT-Mask and CT-Mask, we ran the same experi-
task of the MultiWOZ dataset in terms of input and ments with the following model ablations.
output.
The tested model architectures are as follows. 1. SET: HIER without the context encoder. Each
Using the HIER encoding scheme described in utterance is encoded independently. It shows
Section 2.1, we test two model architectures for the importance of context encoding. Effec-
response generation, namely HIER and HIER++. tively, this model is only the shared utterance
encoder (SET) applied to each utterance inde-
HIER: HIER is the most straightforward model pendently.
architecture with an HT-Encoder replacing the en-
coder in a Transformer Seq2Seq. The working of 2. MAT: HIER without the utterance encoder.
the model is shown in Figure 3a. First, in the ut- This model only uses the context encoder as
terance encoding phase, each utterance is encoded per the context attention mask of Figure 3a.
independently with the help of the UT-Mask. In the As this is equivalent to a simple transformer
second half of the encoder, we apply a CT-Mask encoder with a special attention mask, we call
as depicted by the figure’s block attention matrix. it the Masked Attention Transformer or MAT.
5652
3. SET++: An alternative version of SET with Baselines To fully grasp the effectiveness of our
dialog-act input to the decoder similar to proposed approaches, we consider several base-
HIER++. line3 models with varying complexity and archi-
tectures. Token-MoE (Pei et al., 2019) is a token
HIER-Joint: Finally, we propose the HIER- level mixture-of-experts (MoE) model. It builds
Joint model3 suitable for the end-to-end response upon the base architecture of LSTM-Seq2Seq with
generation task of the MultiWOZ dataset. The soft attention. In the decoding phase, they employ
HIER-Joint model comprises an HT-Encoder and k expert decoders and a chair decoder network
three transformer decoders for decoding belief state which combines the outputs from the experts. Attn-
sequence, dialog act sequence, and response. It is LSTM (Budzianowski et al., 2018) uses an LSTM
jointly trained to predict all three sequences simul- Seq2Seq model with attention on encoded context
taneously. As belief state labels can help dialog-act utterance, oracle belief state and DB search results.
generation, and similarly, both belief and act labels HRED (Serban et al., 2017) model is based on
can assist response generation, we pass the token the same idea of hierarchical encoding in RNN
embedding from the belief decoder and act decoder Seq2Seq networks (results source: Peng et al.,
to the response decoder. Act decoder receives mean 2019, 2020b). The transformer based baseline
token embedding from the belief decoder too. (Vaswani et al., 2017) concatenates the utterances
in dialog context to obtain a single source sequence
Model L H A E and treats the task as a sequence transduction prob-
SET 6/-/3 100 4 100 lem. HDSA (Chen et al., 2019) uses a dialog act
MAT -/4/6 200 5 100 graph to control the state of the attention heads of
HIER 3/3/3 100 4 100 a Seq2Seq transformer model. Zhang et al. (2020a)
SET++ 4/-/3 91 7 175 proposes to augment the training dataset by build-
HIER++ 4/6/3 91 7 175 ing up a one-to-many state-to-action map, so that
the system can learn a more balanced distribution
Table 1: Best Hyper-parameters: L: a/b/c = number of
for the action prediction task. Using this method
layers in shared encoder/ Context Encoder / decoder, H
= hidden size, A = attention heads, E = embedding size.
they train a domain-aware multi-decoder (DAMD)
network for predicting belief, action and response,
jointly. As each agent response may cover multi-
ple domains, acts or slots at the same time, Marco
3 Experimenal Framework
(Wang et al., 2020) learns to generate the response
Our implementation is based on the PyTorch li- by attending over the predicted dialog act sequence
brary. All the models use a vocabulary of size at every step of decoding. SimpleTOD (Hosseini-
1,505. We generate responses using beam search4 Asl et al., 2020) and SOLOIST (Peng et al., 2020a)
with beam width 5. The model optimizes a cross are both based on the GPT-2 (Radford et al., 2019)
entropy loss. Full details of model parameters are architecture. The main difference between these
given in suplementary material. two architectures is that SOLOIST further pretrains
the GPT-2 model on two more dialog corpus before
Dataset We use MultiWOZ5 (Budzianowski fine-tuning on MultiWOZ dataset.
et al., 2018), a multi-domain task-oriented dataset.
It contains a total of 10,400 English dialogs di- 3.1 Task Settings:
vided into training (8,400), validation (1,000) and Following the literature (Zhang et al., 2020a; Peng
test (1,000). Each turn in the dialog is considered et al., 2020a), we now consider four different set-
as a prediction problem with all utterances upto tings for evaluating the strength of hierarchical en-
that turn as the context.6 coding.
3
Block diagram for HIER-Joint model has been provided
in supplementary material.
1. No Annotations First, to simply gauge the
4
https://github.com/OpenNMT/ benefit of using a Hierarchical encoder in a Trans-
OpenNMT-py/tree/master/onmt/translate former Seq2Seq model, we compare the perfor-
5
MultiWOZ v2.0 https://github.com/ mance of HIER to other baselines including HRED
budzianowski/multiwoz/blob/master/data/
MultiWOZ_2.0.zip and vanilla Transformer without any belief states
6
See supplementary for more details. and dialog act annotations.
5653
2. Oracle Policy In this setting, several recently oracle is present. By comparing the performance of
proposed model architectures for the response gen- Transformer, SET and MAT baselines against that
eration task of MultiWOZ are compared against of HIER we can see that in each case HIER is able
each other in presence of ground truth belief state to improve in terms of BLEU, Success and over-
and dialog act annotations. This experiment helps all Score. HIER being better than SET and MAT
us understand the models’ capabilities towards gen- implies that only the UT-Mask or the CT-Mask is
erating good responses (BLEU score) when true be- not sufficient, the full scheme of HT-Encoder is
lief state and(or) dialog acts are available to them. necessary for the improvement. The exception in
the improvements is the SET model which has the
3. Context-to-Response The model is given true
highest inform score of 76.80. Although, we ob-
belief states and DB search results in this experi-
serve that it is the combination of the BLEU and
ment, but they need to generate the dialog act and
Inform score that depicts the real quality of the re-
response during inference. Some of the baselines
sponses. As BLEU measures precision of n-grams
generate dialog act as an intermediate step in their
and inform measures recall of task related entities,
architecture whereas others use a fine-tuned BERT
only when both metrics increase we get a better
model.
performing model. This is reflected upto some ex-
4. End-to-End This is the most realistic evalu- tent in Entity-F1 score (H-Mean of entity recall
ation scheme where a model has to predict both and precision), but it too ignores tokens other than
belief states and dialog act (or one of these as per task related entities. So SET only having a higher
the models input requirement) for searching DB or inform score may mean that it is over-predicting
generating response. some entities leading to improved recall.
In the Context-to-Response generation task with
3.2 Evaluation Metrics
oracle policy (Table 3), our HIER++ and HIER-
We used the official evaluation metrics7 re- CLS models show very strong performance and
leased by the authors of the MultiWOZ dataset beat the HDSA model (in terms of Inform and Suc-
(Budzianowski et al., 2018): Delexicalized-BLUE cess rates) and even the GPT-2 based baseline Sim-
score, INFORM rate (measures how often the en- pleTOD (in terms of BLEU and Success rate). This
tities provided by the system are correct), SUC- shows that without the intricacies of the baselines,
CESS rate (reflects how often the system is able just by applying a hierarchical encoder based model
to answer all the requested attributes), Entity- we are able to perform almost at the level of the
F1 score (Wen et al., 2017) (measures the entity state-of-the-art model. Compared to HIER, Sim-
coverage accuracy), and Combined Score (S = pleTOD utilizes GPT-2’s pretraining, and DAMD
BLEU + 0.5 × (Inf orm + Success)) to measure uses attention over previous belief states and action
the overall quality. sequences. Whereas, HIER’s access to oracle pol-
icy is only through the average embedding of its
Training Cross-entropy losses over the ground
tokens.
truth response and/or belief and act sequences are
used for the training the models. We did hyper- Further in Table 5, we compare end-to-end gen-
parameter search using the Optuna library (Akiba eration performance of HIER-Joint with baseline
et al., 2019) by training the model upto 5 epochs. models that can perform belief-state and/or dialog
Final models were trained 8 upto 30 epochs with act generation. In terms of BLEU and combined
early stopping. score HIER-Joint is able to perform better than the
baselines. With respect to inform and success the
4 Results model outperforms the DAMD baseline.
For the four different experimental settings dis- While the above experiments focus on proving
cussed in Section 3.1, we showcase results from the base performance of the proposed response
those experiments in Tables 2 through 5. Table 2 generation models (HIER, HIER++, HIER-CLS,
shows the results from our experiments when no and ablations), HT-Encoder can be applied to any
model that uses a standard transformer encoder.
7
https://github.com/budzianowski/ Hence, in a final experiment (Table 6), we integrate
multiwoz
8
A system with two Tesla P100 GPUs were used for train- HT-Encoder with an existing state-of-the-art model
ing. Marco. We replace the standard transformer in
5654
Evaluation Metrics
Models
BLEU Entity-F1 Inform Success Score
HRED 17.50 - 70.7 60.9 83.3
TokenMoE 16.81 - 75.30 59.70 84.31
Transformer 19.1 55.1 71.1 59.9 84.60
SET 18.67 51.61 76.80 57.69 85.92
MAT 18.86 54.89 71.9 52.5 81.06
HIER 20.91 54.45 73.60 60.10 87.76

Table 2: Simplest Baselines in absence of both Belief or Policy / Dialog Act annotations

Annotations Evaluation Metrics


Models Pretraining
Belief DB Policy BLEU Entity-F1 Inform Success Score
SimpleTOD GPT-2 Oracle Oracle Oracle 17.78 - 93.4 83.2 106.08
SimpleTOD GPT-2 Oracle - Oracle 18.61 - 92.3 85.8 107.66
HDSA - Oracle Oracle Oracle 30.4 86.2 87.9 78.0 113.4
DAMD - Oracle Oracle Oracle 27.3 - 95.4 87.2 118.5
SET++ - - - Oracle 25.56 82.27 85.7 74.3 105.56
HIER++ - - - Oracle 29.54 85.01 88.3 85.4 116.39
HIER-CLS - - - Oracle 29.29 84.23 88.3 85.9 116.39

Table 3: Context-to-Response generation with Oracle Policy. Superior Performance of DAMD: DAMD always
receives an extra input of Bt−1 annotation, while predicting for Bt or response Rt , which helps in NLU of the
subsequent utterances. This is not available in any other models.

Annotations Evaluation Metrics


Models Pretraining
Belief DB Policy BLEU Entity-F1 Inform Success Score
AttLSTM - Oracle Oracle - 18.80 54.8 71.2 60.2 84.50
SimpleTOD GPT-2 Oracle Oracle Gen 16.9 - 84 72.8 94.5
HDSA - Oracle Oracle BERT 23.6 68.9 82.9 68.9 99.50
DAMD - Oracle Oracle Gen 18.60 - 89.20 77.90 102.15
SOLOIST GPT-2, DC Oracle Oracle - 18.03 - 89.60 79.30 102.49
Marco - Oracle Oracle Gen 19.45 - 90.30 75.20 102.20
Marco-BERT - Oracle Oracle BERT 20.02 59.99 92.3 78.6 105.47
SET++ - Oracle Oracle BERT 22.08 65.33 86.2 76.3 103.33
HIER++ - Oracle Oracle BERT 23.04 64.15 86.5 76.6 104.59
HIER-CLS - Oracle Oracle BERT 22.89 64.57 85.2 76.8 103.89

Table 4: Context-to-Response: For this experiment only belief-states are given. GPT-2,DC means a second pre-
training phase using extra dialog corpus (DC) starting from GPT-2 model parameters.

Marco with an HT-Encoder and rerun the context- tems built upon transformer encoder-decoder archi-
to-response generation experiment. Introducing tecture. It is also applicable to tasks where the input
HT-Encoder into Marco helps improve in terms of sequence can be split into an abstract set of sub-
inform (minor), success and the combined score units (e.g., search history in Sordoni’s application).
metric. The results of this experiment show that We believe that our proposed approach for hierar-
HT-Encoder is suitable for any model architecture. chical encoding in transformers and the algorithm
Overall, our experiments show how useful the for converting the standard transformer encoder
proposed HT-Encoder module can be for dialog sys- makes it an invaluable but accessible resource for
5655
Annotations Evaluation Metrics
Models Pretraining
Belief DB Policy BLEU Entity-F1 Inform Success Score
DAMD - Gen* Oracle Gen 16.60 - 76.40 60.40 85.00
SimpleTOD GPT-2 Gen - Gen 15.01 - 84.4 70.1 92.26
SOLOIST GPT-2, DC Gen Gen - 16.54 - 85.50 72.90 95.74
HIER-Joint - Gen - Gen 19.74 53.94 80.5 71.7 95.84

Table 5: End-to-End: Belief State predicted by model itself. *In the End-to-End setting also, DAMD will need to
use the oracle Bt−1 for predicting the current belief Bt .

Act Prediction Response Generation


Models
Precision Recall F1 BLEU Inform Success Score
Marco 72.61 74.98 73.72 19.16 88.45 73.5 100.14
Marco + HTEncoder 73.23 74.11 73.68 19.05 91.72 75.8 102.81
Marco-BERT - - - 19.82 90.86 76.66 103.58
Marco-BERT + HTEncoder - - - 19.53 90.99 78.41 104.23

Table 6: Comparison between vanilla Marco model and Marco + HT-Encoder with proposed HT-Encoder. Bold-
faced results denote statistically significant improvement with p < 0.05. We didn’t observe any significant im-
provement in act-prediction F1-Score or BLEU scores for response generation. The numbers given in the table are
means of 10 different runs of each algorithm.

future researchers working on dialog systems or training the NLU and NLG modules end-to-end
similar problem statements with transformer-based using a single loss function. Marco (Wang et al.,
architectures. 2020) and HDSA (Chen et al., 2019) used a fine-
tuned BERT model as their act predictor as it often
5 Related Works triumphs other ways to train the dialog policy net-
work (even joint learning). HDSA is a transformer
Task Oriented Dialog Systems Researchers Seq2Seq model with act-controllable self-attention
identify four different subtasks for any task- heads (in the decoder) to disentangle the individual
oriented dialog system (Wen et al., 2017), natural tasks and domains within the network. Marco uses
language understanding (NLU), dialog state track- a soft-attention over the act sequence during the
ing (DST), dialog act or policy generation, and response generation process.
Natural Language Generation (NLG). Before the
advent of large scale Seq2Seq models, researchers Hierarchical Encoders The concept of Hierar-
focused on building feature-rich models with rule- chical Encoders have been used in many different
based pipelines for both natural language under- context in the past. It has been most well known in
standing and generation. It usually required sepa- the area of dialog response generation as the HRED
rate utterance-level and dialog-level NLU feature model. Many open domain dialog systems have
extraction modules. These NLU features decide used the hierarchical recurrent encoding scheme
the next dialog act that the system should follow. of HRED for various tasks and architectures. Hi-
This act is then converted into a natural language re- erarchical Encoder was first proposed by (Sordoni
sponse using the NLG module. Young et al. (2013) et al., 2015a) for using in a query suggestion sys-
modeled this problem as a Markov Decision Pro- tem. They used it encode the user history compris-
cess whose state comprised of various utterance ing multiple queries using an Hierarchical LSTM
and dialog features detected by an NLU module. network. Serban et al. (2016) extended this work to
However, such models had the usual drawback of open domain dialog generation problems and pro-
any pipelined approaches, error propagation. Wen posed the HRED network. HRED captures the high
et al. (2017) proposed using neural networks for ex- level features of the conversation in a context RNN.
tracting features like intent, belief states, etc. and Several models have adopted this approach later on,
5656
e.g. VHRED (Serban et al., 2017), CVAE (Zhao Discovery & Data Mining, KDD 2019, Anchor-
et al., 2017), DialogWAE (Gu et al., 2018), etc. age, AK, USA, August 4-8, 2019, pages 2623–2631.
ACM.
Another area in which researchers have proposed
the use of hierarchical encoder is for processing Paweł Budzianowski and Ivan Vulić. 2019. Hello, it’s
of paragraph or long documents. Li et al. (2015) GPT-2 - how can I help you? towards the use of pre-
used a hierarchical LSTM network for training an trained language models for task-oriented dialogue
systems. In Proceedings of the 3rd Workshop on
autoencoder that can encode and decode long para- Neural Generation and Translation, pages 15–22,
graphs and documents. Zhang et al. (2019) pro- Hong Kong. Association for Computational Linguis-
posed HIBERT where they introduced hierarchy tics.
into the BERT architecture to remove the limitation
Paweł Budzianowski, Tsung-Hsien Wen, Bo-Hsiang
on length of input sequence. HIBERT samples a Tseng, Iñigo Casanueva, Stefan Ultes, Osman Ra-
single vector for each sentence or document seg- madan, and Milica Gašić. 2018. MultiWOZ - a
ment (usually contextual embedding of CLS or large-scale multi-domain Wizard-of-Oz dataset for
EOS token) from the sentence encoder to be passed task-oriented dialogue modelling. In Proceedings of
the 2018 Conference on Empirical Methods in Nat-
onto the higher level transformer encoder. Liu and ural Language Processing, pages 5016–5026, Brus-
Lapata (2019) applies a similar approach for encod- sels, Belgium. Association for Computational Lin-
ing documents in a multi-document summarization guistics.
task.
Wenhu Chen, Jianshu Chen, Pengda Qin, Xifeng Yan,
and William Yang Wang. 2019. Semantically con-
6 Conclusion ditioned dialog response generation via hierarchical
disentangled self-attention. In Proceedings of the
This paper explored the use of hierarchy in 57th Annual Meeting of the Association for Com-
transformer-based models for task-oriented dialog putational Linguistics, pages 3696–3709, Florence,
system. We started by proposing a generalized Italy. Association for Computational Linguistics.
framework for Hierarchical Transformer Encoders Yue Gu, Kangning Yang, Shiyu Fu, Shuhong Chen,
(HT-Encoders). Using that, we implemented two Xinyu Li, and Ivan Marsic. 2018. Multimodal af-
models, one new model called HIER, and another fective analysis using hierarchical attention strategy
HIER-CLS model by adapting the existing HIB- with word-level alignment. In Proceedings of the
56th Annual Meeting of the Association for Compu-
ERT architecture into our framework. We thor- tational Linguistics (Volume 1: Long Papers), pages
oughly experimented with these models in four 2225–2235, Melbourne, Australia. Association for
different response generation tasks of the Multi- Computational Linguistics.
WOZ dataset. We compared the proposed mod-
Ehsan Hosseini-Asl, Bryan McCann, Chien-Sheng Wu,
els with an exhaustive set of recent state-of-the-art Semih Yavuz, and Richard Socher. 2020. A sim-
models to thoroughly analyze the effectiveness of ple language model for task-oriented dialogue. In
HT-Encoders. We empirically show that the basic Advances in Neural Information Processing Systems
transformer seq2seq architecture, when equipped 33: Annual Conference on Neural Information Pro-
cessing Systems 2020, NeurIPS 2020, December 6-
with an HT-Encoder, outperforms many of the state- 12, 2020, virtual.
of-the-art models in each experiment. We further
prove its usefulness by applying it to an existing Dan Jurafsky. 2000. Speech & language processing.
model Marco. This work opens up a new direction Pearson Education India.
on hierarchical transformers in dialogue systems Jiwei Li, Thang Luong, and Dan Jurafsky. 2015. A
where complex dependencies exist between the ut- hierarchical neural autoencoder for paragraphs and
terances. It would also be beneficial to explore the documents. In Proceedings of the 53rd Annual Meet-
ing of the Association for Computational Linguistics
effectiveness of the proposed HT-Encoder when
and the 7th International Joint Conference on Natu-
applied for various other tasks. ral Language Processing (Volume 1: Long Papers),
pages 1106–1115, Beijing, China. Association for
Computational Linguistics.
References
Yang Liu and Mirella Lapata. 2019. Hierarchical trans-
Takuya Akiba, Shotaro Sano, Toshihiko Yanase, formers for multi-document summarization. In Pro-
Takeru Ohta, and Masanori Koyama. 2019. Op- ceedings of the 57th Annual Meeting of the Asso-
tuna: A next-generation hyperparameter optimiza- ciation for Computational Linguistics, pages 5070–
tion framework. In Proceedings of the 25th ACM 5081, Florence, Italy. Association for Computa-
SIGKDD International Conference on Knowledge tional Linguistics.
5657
Jiahuan Pei, Pengjie Ren, and Maarten de Rijke. Kaiser, and Illia Polosukhin. 2017. Attention is all
2019. A modular task-oriented dialogue system you need. In Advances in Neural Information Pro-
using a neural mixture-of-experts. arXiv preprint cessing Systems 30: Annual Conference on Neural
arXiv:1907.05346. Information Processing Systems 2017, December 4-
9, 2017, Long Beach, CA, USA, pages 5998–6008.
Baolin Peng, Chunyuan Li, Jinchao Li, Shahin Shayan-
deh, Lars Liden, and Jianfeng Gao. 2020a. Soloist: Kai Wang, Junfeng Tian, Rui Wang, Xiaojun Quan,
Few-shot task-oriented dialog with a single pre- and Jianxing Yu. 2020. Multi-domain dialogue acts
trained auto-regressive model. arXiv preprint and response co-generation. In Proceedings of the
arXiv:2005.05298. 58th Annual Meeting of the Association for Compu-
tational Linguistics, pages 7125–7134, Online. As-
Shuke Peng, Xinjing Huang, Zehao Lin, Feng Ji, sociation for Computational Linguistics.
Haiqing Chen, and Yin Zhang. 2019. Teacher-
student framework enhanced multi-domain dialogue Tsung-Hsien Wen, David Vandyke, Nikola Mrkšić,
generation. arXiv preprint arXiv:1908.07137. Milica Gašić, Lina M. Rojas-Barahona, Pei-Hao Su,
Stefan Ultes, and Steve Young. 2017. A network-
Shuke Peng, Feng Ji, Zehao Lin, Shaobo Cui, based end-to-end trainable task-oriented dialogue
Haiqing Chen, and Yin Zhang. 2020b. Mtss: system. In Proceedings of the 15th Conference of
Learn from multiple domain teachers and become the European Chapter of the Association for Compu-
a multi-domain dialogue expert. arXiv preprint tational Linguistics: Volume 1, Long Papers, pages
arXiv:2005.10450. 438–449, Valencia, Spain. Association for Computa-
tional Linguistics.
Alec Radford, Jeff Wu, Rewon Child, David Luan,
Dario Amodei, and Ilya Sutskever. 2019. Language Steve Young, Milica Gašić, Blaise Thomson, and Ja-
models are unsupervised multitask learners. son D Williams. 2013. Pomdp-based statistical spo-
ken dialog systems: A review. Proceedings of the
Iulian Vlad Serban, Alessandro Sordoni, Yoshua Ben- IEEE, 101(5):1160–1179.
gio, Aaron C. Courville, and Joelle Pineau. 2016.
Building end-to-end dialogue systems using gener- Xingxing Zhang, Furu Wei, and Ming Zhou. 2019. HI-
ative hierarchical neural network models. In Pro- BERT: Document level pre-training of hierarchical
ceedings of the Thirtieth AAAI Conference on Arti- bidirectional transformers for document summariza-
ficial Intelligence, February 12-17, 2016, Phoenix, tion. In Proceedings of the 57th Annual Meeting
Arizona, USA, pages 3776–3784. AAAI Press. of the Association for Computational Linguistics,
pages 5059–5069, Florence, Italy. Association for
Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe, Computational Linguistics.
Laurent Charlin, Joelle Pineau, Aaron C. Courville,
and Yoshua Bengio. 2017. A hierarchical latent Yichi Zhang, Zhijian Ou, and Zhou Yu. 2020a. Task-
variable encoder-decoder model for generating di- oriented dialog systems that consider multiple ap-
alogues. In Proceedings of the Thirty-First AAAI propriate responses under the same context. In Pro-
Conference on Artificial Intelligence, February 4-9, ceedings of the AAAI Conference on Artificial Intel-
2017, San Francisco, California, USA, pages 3295– ligence, volume 34, pages 9604–9611.
3301. AAAI Press.
Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen,
Alessandro Sordoni, Yoshua Bengio, Hossein Vahabi, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing
Christina Lioma, Jakob Grue Simonsen, and Jian- Liu, and Bill Dolan. 2020b. DIALOGPT : Large-
Yun Nie. 2015a. A hierarchical recurrent encoder- scale generative pre-training for conversational re-
decoder for generative context-aware query sugges- sponse generation. In Proceedings of the 58th An-
tion. In Proceedings of the 24th ACM International nual Meeting of the Association for Computational
Conference on Information and Knowledge Manage- Linguistics: System Demonstrations, pages 270–
ment, CIKM 2015, Melbourne, VIC, Australia, Octo- 278, Online. Association for Computational Linguis-
ber 19 - 23, 2015, pages 553–562. ACM. tics.

Alessandro Sordoni, Michel Galley, Michael Auli, Tiancheng Zhao, Ran Zhao, and Maxine Eskenazi.
Chris Brockett, Yangfeng Ji, Margaret Mitchell, 2017. Learning discourse-level diversity for neural
Jian-Yun Nie, Jianfeng Gao, and Bill Dolan. 2015b. dialog models using conditional variational autoen-
A neural network approach to context-sensitive gen- coders. In Proceedings of the 55th Annual Meet-
eration of conversational responses. In Proceedings ing of the Association for Computational Linguistics
of the 2015 Conference of the North American Chap- (Volume 1: Long Papers), pages 654–664, Vancou-
ter of the Association for Computational Linguis- ver, Canada. Association for Computational Linguis-
tics: Human Language Technologies, pages 196– tics.
205, Denver, Colorado. Association for Computa-
tional Linguistics.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob


Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
5658

You might also like