Generating Diverse and Consistent QA pairs from Contexts with Information-Maximizing Hierarchical Conditional VAEs

Generating Diverse and Consistent QA pairs
from Contexts with Information-Maximizing
Hierarchical Conditional VAEs
Dong Bok Lee1*, Seanie Lee1*, Woo Tae Jung2 , Donghwan Kim2 and Sung Ju Hwang1,3
KAIST1, Daejeon, South Korea
42MARU2, Seoul, South Korea
AITRICS3, Seoul, South Korea
1

Data Scarcity in Question Answering (QA)
2
One of the most crucial challenges in QA is the scarcity of labeled data.
Since it is costly to obtain QA pairs for target text domain with human annotation.
Human AnnotatorNew domain
Coronavirus (COVID-19)
Coronaviruses are a group of related RNA
viruses that cause diseases in mammals and birds.
In humans, these viruses cause respiratory tract
infections that can range from mild to lethal. Mild
illnesses include some cases of the common
cold (which is also caused by other …
QA Annotation
Q: When did the COVID-19 occur?
A: 2019

Neural Question Generation (NQG)
3
Neural Question (and answer) generation is a popular approach to overcome this.
Neural Networks
Input (paragraph)
Philadelphia has more murals than any other
u.s. city, thanks in part to the 1984 creation
of the department of recreation’s mural arts
program, . . . The program has funded more
than 2,800 murals
Q1: which city has more murals than any other city?
A1: Philadelphia
Q2: why Philadelphia has more murals?
A2: the 1984 creation of the department of
recreation’s mural arts program
Q3: when did the department of recreation’s mural
arts program start?
A3: 1984
Q4: how many murals funded the graffiti arts program
by the department of recreation?
A4: more than 2,800
Seq2Seq + attention + copy

Diversity: One-to-many Problem
4
Existing systems have overlooked that QAG is essentially a one-to-many problem.
Neural Networks
Input (paragraph)
than 2,800 murals
A1: Philadelphia
arts program start?
A3: 1984
A4: more than 2,800
than 2,800 murals
A1: Philadelphia
arts program start?
A3: 1984
A4: more than 2,800

Consistency: Answerability of Question
5
No constraint for consistency between question and answer.
Neural Networks
Input (paragraph)
than 2,800 murals
A1: Philadelphia
arts program start?
A3: 1984
A4: more than 2,800

Info-Maximizing Hierarchical Conditional VAEs
6
To overcome these challenges, we propose Info-HCVAE for QA pairs generation.
Diversity-> deep latent variable model (HCVAE) Consistency-> Mutual Information Maximization
+

Derivation of ELBO
7
Formally, our goal is to learn conditional joint distribution as follows:
Question
which city has more
murals than any other city?
Answer
Philadelphia
Context (paragraph)
Philadelphia has more murals than
any other u.s. city, thanks in part to
the 1984 creation of the department
of recreation’s mural arts
program, . . . The program has
funded more than 2,800 murals
𝑥, 𝑦 ~ 𝑝(𝑥, 𝑦|𝑐)

Derivation of ELBO
8
In here, we introduce a separate latent space for question and answer as follows:
Discrete R.V.
(e.g., Categorical Distribution)
Continuous R.V.
(e.g., Gaussian Distribution)
𝑝 𝑥, 𝑦 𝑐 = '
!!
(
!"
𝑝 𝑥 𝑧", 𝑦, 𝑐 𝑝(𝑦|𝑧", 𝑧#, 𝑐)
𝑝 𝑧# 𝑧", 𝑐 𝑝 𝑧" 𝑐 𝑑𝑧"

Derivation of ELBO
9
We then use variational posteriors to maximize following Evidence Lower Bound:
log 𝑝! 𝑥, 𝑦 𝑐 ≥ 𝔼"!~#$ 𝑧# 𝑥, 𝑐 [log 𝑝! 𝑥 𝑧#, 𝑦, 𝑐 ]
+ 𝔼"%~%$ 𝑧& 𝑧#, 𝑦, 𝑐 log 𝑝! 𝑦 𝑧&, 𝑐
−𝐷'([𝑞) 𝑧& 𝑧#, 𝑦, 𝑐 ||𝑝*(𝑧&|𝑧#, 𝑐)]
−𝐷'([𝑞) 𝑧# 𝑥, 𝑐 ||𝑝* 𝑧# 𝑐 ]
=: ℒ+,-./

Derivation of ELBO
10
After training, the generative process of HCVAE is as follows:
1. 𝑝! 𝑧" 𝑐
2. 𝑝! 𝑧# 𝑧", 𝑐 3. 𝑝$ 𝑦 𝑧#, 𝑐
4. 𝑝$ 𝑥 𝑧", 𝑦, 𝑐
1. Sample question L.V.: 𝑧, ~ 𝑝-(𝑧,|𝑐)
2. Sample answer L.V.: 𝑧. ~ 𝑝-(𝑧.|𝑧,, 𝑐)
3. Generate an answer: y ~ 𝑝/(𝑦|𝑧., 𝑐)
4. Generate a question: x ~ 𝑝/(𝑥|𝑧,, 𝑦, 𝑐)

11
This is the overall description of our specific implementation for HCVAE.

12
Conditional Posterior / Prior networks aim at mapping the input into latent space.
Posterior
3 Bi-LSTMs, 2 MLPs, attention
Prior
1 Bi-LSTM, 1 MLP
M
L
PContext
At the end of the main drive,
is a simple, modern stone
statue of mary.
what is at the end of
the main drive?
Question
Answer
Bi-LSTM
Encoder
(a) Conditional Posterior / Prior Networks
Bi-LSTM
Encoder
Bi-LSTM
Encoder
M
L
P

13
AG network aims at generating answer spans from categorical L.V:
statue of mary.
modern stone statue of mary
Bi-LSTM Decoder
Answer
Heuristic Matching
Answer Generation
Heuristic Matching Layer in NLI [Mou2016],
Bi-LSTM,
2 linear layers to predict answer spans
[Mou2016] Lili Mou et. al., Natural language inference by tree-based convolution and heuristic matching, ACL 2016
(b) Answer Generation Network

14
QG network aims at generating question from answer and continuous L.V:
Question Generation
Bi-LSTM with gated self-attention [Wang2017],
Luong attention [Luong2015],
LSTM,
Linear layer to predict words,
Maxout copy mechanism [Zhao2018]
what is at the end of
the main drive?
statue of mary.
copy
Bi-LSTM
Encoder
Attention
LSTM
Decoder
Question
(c) Question Generation Network
[Wang2017] Wenhui Wang et. al., Gated self-matching net-works for reading comprehension and question answering, ACL 2017
[Luong2015] Thang Luong et. al. Effective approaches to attention-based neural machine translation, EMNLP 2015,
[Zhao2018] Yao Zhao el. al. Paragraph-level neural question generation with maxout pointer and gated self-attention networks, EMNLP2018

Mutual Information Maximization
15
Semantic consistency of QA pairs is important.
Context: ...during the age of enlightenment, philosophers such as
john locke advocated the principle in their writings, whereas others,
such as thomas hobbes, strongly opposed it.
Ground Truth: who was an advocate of separation of powers?
Generated: who opposed the principle of enlightenment?
[Zhang 2019] Zhang and Bansal, Addressing Semantic Drift in Question Generation for Semi-Supervised Question Answering, EMNLP2019

16
Maximizing the mutual information of QA pairs: Intractable
[Belghazi 2018] Belghazi et al., MINE: Mutual Information Neural Estimation, ICML2018
𝐼 𝑋; 𝑌 = 𝐷!"(ℙ#$||ℙ#⨂ℙ$)
= ∫𝒳×𝒴
log
(ℙ!"
(ℙ!⨂ℙ"
𝑑ℙ#$
≥ 𝔼#$ 𝑇+(𝑥, 𝑦 ] − log 𝔼#⨂,$[exp(𝑇+ 𝑥, 𝑦 ]
𝑇+ ∶ 𝒳 × 𝒴 → ℝ is a discriminator

17
Alternative estimator: Jensen-Shannon divergence
[Hjelm 2019] Hjelm et al., Learning deep representations by mutual information estimation and maximization, ICLR 2019
𝑀𝐼 𝑋, 𝑌 ≥ 𝔼!"[log 𝜎(𝑇# 𝑥, 𝑦 )]
+𝔼!⨂"[log 1 − 𝜎 𝑇# 𝑥%
, 𝑦 ]
𝜎 𝑥 =
&
&'()*(,-)

Average pooling of hidden states
𝑀𝐼 𝑋, 𝑌 ≥ 𝔼-,0~ℙ[log 𝑔(𝑥, 𝑦)]
+
&
3
𝔼4-,0~ℕ[log(1 − 𝑔 8𝑥, 𝑦 )]
+
&
3
𝔼-, 40~ℕ[log(1 − 𝑔 𝑥, 8𝑦 )]
=: ℒ6789
𝑔(𝑥, 𝑦) = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑( ̅𝑥:
𝑊 B𝑦)
18
Following Yeh and Chen (2019), we maximize MI of QA pairs as follows:
[Yeh 2019] Yeh and Chen, QAInfomax: Learning Robust Question Answering System by Mutual Information Maximization EMNLP 2019
Negative Examples

Training Info-HCVAE
19
We maximize the following objective with Monte Carlo estimators
𝑚𝑎𝑥!ℒ"#$%& + 𝜆ℒ'()*
Θ ≔ {𝜙, 𝜓, 𝜃, 𝑊}

Experimental Setup
20
1) Tasks and Evaluation Metric
• QA generation: QA-based Evaluation, Reverse QA-based Evaluation
• Semi-supervised QA: F1 and Exact Match (EM)
2) Data
• HarvestingQA (Du and Cardie, 2018)
• SQuAD
• Natural Questions
• TriviaQA

Experimental Setup
21
3) Baselines
• Harvesting-QG [Du 2018]:
Seq2Seq + attention + copy
• Maxout-QG [Zhao 2018]:
Seq2Seq + attention + Maxout copy + BERT encoder
• Semantic-QG [Zhang 2019]:
Seq2Seq + attention + Maxout copy + BERT encoder + Reinforcement
[Du 2018] Du and Cardie, Harvesting Paragraph-Level Question-Answer Pairs from Wikipedia , NAACL2018
[Zhao2018] Paragraph-level Neural Question Generation with Maxout Pointer and Gated Self-attention Networks, EMNLP 2018

QA-based Evaluation (QAE)
22
Train QA model with synthetic data and measure F1, EM with annotated data.
Synthetic data Annotated data
Train Test

Reverse QA-based Evaluation (R-QAE)
23
Accuracy (F1, EM) of the QA model trained with annotated data, evaluated
on synthetic QA pairs.
Train Test
Annotated data Synthetic Data

QA Generation on SQuAD
24
Method QAE(↑) R-QAE(↓)
SQuAD (EM / F1)
Harvesting-QG
Maxout-QG
Semantic-QG
HCVAE
Info-HCVAE
55.11 / 66.40
56.08 / 67.50
60.49 / 71.91
69.46 / 80.79
71.18 / 81.51
66.77 / 77.85
62.49 / 78.24
74.23 / 88.54
37.57 / 61.24
38.80 / 60.73

QA Generation on SQuAD
25
More data efficient compared to the other baselines
61.38

QA Generation on Natural Questions
26
Natural Questions (EM / F1)
Harvesting-QG
Maxout-QG
Semantic-QG
HCVAE
Info-HCVAE
27.91 / 41.23
30.98 / 44.96
30.59 / 45.29
31.45 / 46.77
37.18 / 51.46
49.89 / 70.01
49.96 / 70.03
58.42 / 79.23
32.78 / 55.12
29.39 / 53.04

QA Generation on TriviaQA
27
TriviaQA (EM / F1)
Harvesting-QG
Maxout-QG
Semantic-QG
HCVAE
Info-HCVAE
21.32 / 30.21
24.58 / 34.32
27.54 / 38.25
30.20 / 40.88
35.45 / 44.11
29.75 / 47.73
31.56 / 49.92
37.45 / 58.15
34.41 / 48.16
21.65 / 37.65

QA Generation Examples (One-to-Many)
28
Input (paragraph and answer)
The Scotland act 1998 which was
passed by and given royal assent by
queen Elizabeth ii on 19 November
1998, governs functions and role of
the Scottish parliament and delimits
its legislative competence . . .
Q1: which act was passed in 1998?
Q2: which act governs role of the Scottish
parliament?
Q3: which act was passed by queen
Elizabeth ii?
Q4: which act gave the Scottish parliament
the responsibility to determine its
legislative policy?
We sample the question latent variables multiple times with a fixed answer.
𝑧! ~ 𝑝"(𝑧!|𝑐)

QA Generation Examples (Latent Interpolation)
29
We generate QA pairs by interpolating between two latent codes.
Atop the main building’ s gold dome
is a golden statue of the virgin
mary. ... Next to the main building is
the basilica of the sacred heart.
Immediately behind the basilica is
the grotto, ...a marian place of
prayer and reflection. ... At the end of
the main drive ..., is a simple, modern
stone statue of mary.
Interpolation

30
Q1: where is the grotto at?
A1: A marian place of prayer and reflection
Interpolation

31
Q2: what place is behind the basilica of
prayer?
A2: grotto
Interpolation

32
prayer?
A2: grotto
Q3: what is next to the main building at
notre dame?
A3: the basilica of the sacred heart
Interpolation

33
prayer?
A2: grotto
Q3: what is next to the main building at
notre dame?
A3: the basilica of the sacred heart
Q4: what is at the end of the main drive?
A4: stone statue of mary
Interpolation

Semi-supervised QA
34
Sample 10 different QA pairs for a single paragraph from target datasets
(+S×10, +N×10, +T×10)
Info-HCVAE
Input (paragraph)
than 2,800 murals
A1: Philadelphia
.
.
.
arts program start?
A9: 1984
A10: more than 2,800

Semi-supervised QA
35
Generate a QA pair for each paragraph from HarvestingQA dataset
(+H×10% ~ +H×100%)
Info-HCVAE
Input (paragraph)
… The typical division is into
three branches: a legislature,
an executive, and a judiciary,
which is the trias politica model
Q: what are the three branches of the
government ?
A: a legislature, an executive, and
a judiciary,

Semi-supervised QA on SQuAD
36
Data EM F1
SQuAD 80.25 88.23
+𝐒×𝟏𝟎 81.20 (+0.95) 88.36 (+0.13)
+𝐇×𝟏𝟎𝟎% 81.03 (+0.78) 88.79 (+0.56)
+𝐒×𝟏𝟎 + 𝐇×100% 81.44 (+1.19) 88.72 (+0.49)
Info-HCVAE
+𝐒×𝟏𝟎 82.09 (+1.84) 89.11 (+0.88)
+𝐇×𝟏𝟎𝟎% 82.37 (+2.12) 89.63 (+1.40)
+𝐒×𝟏𝟎 + 𝐇×𝟏𝟎𝟎% 82.19 (+1.94) 89.84 (+1.59)
Semantic-QG (baseline)

Semi-supervised QA on Natural Questions
37
Data EM F1
SQuAD 42.77 57.29
+𝐍×𝟏
+𝐍×𝟐
+𝐍×𝟑
+𝐍×𝟓
+𝐍×𝟏𝟎
Natural Questions
46.70 (+3.94)
46.95 (+4.19)
47.73 (+4.96)
48.19 (+5.42)
48.44 (+5.67)
61.55
61.08 (+3.79)
61.34 (+4.05)
61.98 (+4.69)
62.21 (+4.92)
62.69 (+5.40)
73.91

Semi-supervised QA on TriviaQA
38
Data EM F1
SQuAD 48.96 57.98
+𝐓×𝟏
+𝐓×𝟐
+𝐓×𝟑
+𝐓×𝟓
+𝐓×𝟏𝟎
TriviaQA
49.65 (+0.69)
50.01 (+1.05)
49.71 (+0.75)
50.14 (+1.18)
49.65 (+0.69)
64.55
59.13 (+1.21)
59.08 (+1.10)
59.49 (+1.51)
59.21 (+1.23)
59.20 (+1.22)
70.42

Conclusion
39
• We propose a novel probabilistic generative model for one-to-many QA
generation
• By maximizing mutual information of QA pairs, we improve the semantic
consistency of QA pairs.
• Results show that we significantly improve the performance of BERT QA
model by further training it with our generated QA pairs.
• Gap between semi-supervised and supervised learning, due to the discrepancy
among different domains: we hope future research can close the gap.
Codes available at https://github.com/seanie12/Info-HCVAE

Generating Diverse and Consistent QA pairs from Contexts with Information-Maximizing Hierarchical Conditional VAEs

Recommended

More Related Content

Similar to Generating Diverse and Consistent QA pairs from Contexts with Information-Maximizing Hierarchical Conditional VAEs (20)

More from MLAI2 (20)

Recently uploaded (20)

Generating Diverse and Consistent QA pairs from Contexts with Information-Maximizing Hierarchical Conditional VAEs