0% found this document useful (0 votes)
1 views

Improving Complex Knowledge Base Question Answering via Question-to-Action and Question-to-Question Alignment

The paper presents a novel framework called ALCQA for improving complex knowledge base question answering by addressing the semantic and structural gaps between natural language questions and action sequences. It employs question-to-action and question-to-question alignment techniques, utilizing a seq2seq model for candidate action sequence generation and a reward-guided selection strategy based on similar question-answer pairs. Experimental results demonstrate that ALCQA outperforms state-of-the-art methods, achieving a 9.88% improvement in the F1 metric on the CQA dataset.

Uploaded by

z670176916
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

Improving Complex Knowledge Base Question Answering via Question-to-Action and Question-to-Question Alignment

The paper presents a novel framework called ALCQA for improving complex knowledge base question answering by addressing the semantic and structural gaps between natural language questions and action sequences. It employs question-to-action and question-to-question alignment techniques, utilizing a seq2seq model for candidate action sequence generation and a reward-guided selection strategy based on similar question-answer pairs. Experimental results demonstrate that ALCQA outperforms state-of-the-art methods, achieving a 9.88% improvement in the F1 metric on the CQA dataset.

Uploaded by

z670176916
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Improving Complex Knowledge Base Question Answering via

Question-to-Action and Question-to-Question Alignment

Yechun Tang, Xiaoxia Cheng, Weiming Lu∗


College of Computer Science and Technology , Zhejiang University
{tangyechun, zjucxx, luwm}@zju.edu.cn

Abstract and Jiang, 2020; Qin et al., 2021), and further trans-
form them into queries like SPARQL query lan-
Complex knowledge base question answering guage. Recently, many works (Liang et al., 2017;
can be achieved by converting questions into se-
Saha et al., 2019; Ansari et al., 2019; Hua et al.,
quences of predefined actions. However, there
is a significant semantic and structural gap be- 2020a,b,c) predefine a collection of functions with
tween natural language and action sequences, constrained argument types and represent the inter-
which makes this conversion difficult. In this mediate logical form as a sequence of actions that
paper, we introduce an alignment-enhanced can be generated using a seq2seq model. Sequence-
complex question answering framework, called based methods are natural to accomplish more com-
ALCQA, which mitigates this gap through plex operations by simply expanding the function
question-to-action alignment and question-to- set, thus making some logically complex questions
question alignment. We train a question rewrit-
ing model to align the question and each ac-
answerable while they’re difficult to answer using
tion, and utilize a pretrained language model to query graphs.
implicitly align the question and KG artifacts. The seq2seq model has been widely used and
Moreover, considering that similar questions achieved good results on many text generation
correspond to similar action sequences, we re-
tasks, such as machine translation, text summariza-
trieve top-k similar question-answer pairs at the
inference stage through question-to-question tion and style transfer. In these tasks, the source and
alignment and propose a novel reward-guided the target sequence are both natural language texts
action sequence selection strategy to select and thus share some low-level features. However,
from candidate action sequences. We con- semantic parsing aims to transform unstructured
duct experiments on CQA and WQSP datasets, texts into structured logical forms, which requires
and the results show that our approach out- a difficult alignment between them. This problem
performs state-of-the-art methods and obtains
becomes more serious when the complexity of the
a 9.88% improvements in the F1 metric on
CQA dataset. Our source code is available at question rises. Some works propose to solve this
https://github.com/TTTTTTTTy/ALCQA. problem by modelling the hierarchical structure of
logical forms. Dong and Lapata (2016) introduces
1 Introduction a sequence-to-tree model with an attention mecha-
nism. Dong and Lapata (2018) proposes to decode
Complex knowledge base question answering a sketch of the logical forms which contain a set
(CQA) aims to answer various natural language of functions at first and then decode low-level de-
questions with a large-scale knowledge graph. tails like arguments. Guo et al. (2021) iteratively
Compared to simple questions with single or multi- segments a span from the question by a segmenta-
hop of relations, complex questions have more tion model and parses it using a base parser until
kinds of answer types such as numeric or boolean the whole query is parsed. Li et al. (2021) uses a
types and require more kinds of aggregation oper- shift-reduce algorithm to obtain token sequences
ations like min/max or intersection/union to yield instead of predicting the start and end positions of
answers. Semantic parsing approaches typically the span. However, most of these works require
map questions to intermediate logical forms such intermediate logical forms or sub-questions to train
as query graphs (Yih et al., 2015; Bao et al., 2016; models, which are usually difficult to obtain. Guo
Bhutani et al., 2019; Maheshwari et al., 2019; Lan et al. (2021) and Li et al. (2021) propose to pretrain

Corresponding author. a base parser firstly, and then search good segments
137
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 137 - 147
December 7-11, 2022 ©2022 Association for Computational Linguistics
    
    
  /0  4   4      1   
0     1'01   
 &   #  $
       /01 01'012  
!            
     (3   " 
     #  %
.  
       /01 012
       "  
         
       
       /01 2
      5#  $6 5#  %6
                -  7 $8 7 %8 
 
 "  $  %
  " 
       &  &
       
    )              !
   
-      - Ž  
(3  ) 

 
     
         
'  ( #  )   #  * +,  
#$ "    

          

Figure 1: An overview of the proposed approach. The question is first converted into a more structured form, then
multiple candidate action sequences are generated by the seq2seq model, and finally the candidate action sequences
are scored based on similar question-answer pairs.

that predicted sub logical forms are part of or can build a memory consisting of question-answer pairs
be composed into the golden meaning representa- and retrieve a set of question-answer pairs as the
tion. They don’t necessarily require training pairs support set based on the similarity with the current
but have the limitation that decomposed utterances question during action sequence selection phase.
are continuous segments of the original question. We then propose a reward-guided selection strategy
that scores each candidate action sequence accord-
In this paper, we propose a novel framework to ing to the support set.
boost the alignment between unstructured text and Our main contributions are as follow:
structured logical forms. We decompose the seman-
tic parsing task into three stages: question rewrit- • We propose a novel framework that mitigates
ing, candidate action sequences generation and ac- the gap between natural language questions
tion sequence selection. In the question rewriting and structural logical forms through question-
stage, we utilize a question rewriting model to ex- to-action alignment and question-to-question
plicitly transform a query into a set of utterances, alignment.
each corresponding to a single action, thus reduc-
ing the complexity of the question. We propose • We propose a novel question rewriting mech-
a two-phase training method to train the rewrit- anism that rewrites a question into a more
ing model on the lack of training pairs. In the structured form without requiring a dataset or
candidates generation stage, we build a seq2seq adding any constraints, and employ a reward-
model to generate logical forms with beam search guided action sequence selection strategy that
algorithm and consider KG artifacts like entities as utilizes similar question-answer pairs to score
candidate vocabularies in the decoding stage. To candidate action sequences.
further align the question and action sequence, we • We conduct experiments on several datasets,
concatenate a question and a KG artifact as input and experimental results show that our ap-
and encode it using a pretrained language model proach is comparable to the state-of-the-art
(PLM) like BERT (Devlin et al., 2018). The cross on WQSP dataset and obtains a 9.88% im-
attention mechanism of PLM can effectively align provements in the F1 metric on CQA dataset.
between the question and KG artifacts implicitly,
which makes decoding easier. Moreover, we inno- 2 Methodology
vatively propose to improve complex knowledge
base question answering via question-to-question 2.1 Overview
alignment. Motivated by the phenomenon that the In this task, with training set T =
more similar two questions are, the more similar {(q1 , a1 ), ..., (qs , as )}, where (qi , ai ) is a
their corresponding action sequences will be, we question-answer pair, the objective is to transform
138
complex questions into logical forms, which can Module 1: Question Rewriting Training
be further derived into KG queries to find answers. Input: T = {(q1 , a1 ), ..., (qn , an )}
We define the logical form as a sequence of actions Output: Mr , which is the trained model for
involving a function and multiple arguments. rewriting questions
Following NS-CQA (Hua et al., 2020c), we 1 Search pseudo action sequences and obtain
design 16 functions with arguments comprised T  = {(q1 , a1 , L1 ), ..., (qn , an , Ln )},
of numerical values and KG artifacts including where Li = {f1 ; f2 ; ...fk } is the pseudo
entities, relations, and entity types. We recognize action sequence of (qi , ai );
these arguments in the preprocessing step. Denote 2 Train Mq which transforms action
the input question as q, the set of predefined sequences into questions using T  ;
functions as function set F, question related 3 Q ← {};
numerical values and KG artifacts as argument 4 for (qi , Li ) in T do

set G, parameters of model as θ, our goal can 5 qori ← qi ;
be normalized as maximizing the probability 6 for j ∈ [k, 1] do
P (L | q ; θ), where L is the action sequence that 7 L ← {f1 ; f2 ; ...; fj−1 };
produces correct answers and each word in L 8 qdel ← Translate(L , Mq );
belongs to F or G. 9
 ← Compare(q , q );
qij ori del
As shown in Figure 1, our framework consists of 10 qori ← qdel ;
three stages: question rewriting, candidate action 11 end
sequences generation, and action sequence selec- 12 Q ← Q ∪ {qi , {qi1  ; q  ; ...; q  }} ;
i2 ik
tion. In the first stage, we rewrite a complex query
13 end
into a more structured form by a seq2seq model, the
14 Train Mr using Q;
details of training the model will be described in
2.2. The rewritten query then can be combined with
the original question as input, and a newly seq2seq
model is used to generate multiple candidate action model as shown in Module 1. In the first phase
sequences sequentially. And finally, We retrieve k (line 1-2), we employ a breadth-first search algo-
question-answer pairs that are most similar to the rithm to find pseudo action sequences for some
current question from a pre-constructed memory. questions, and then train a seq2seq model that
The candidates are then modified according to the translates an action sequence into a query. In the
KG artifacts in these k questions and scored based second phase (line 3-13), we construct a training
on the comparison results between the execution corpus for question rewriting based on searched
results and respective answers, separately. question-logical form pairs and the model trained
in the previous stage. Specifically, given an ac-
2.2 Question Rewriting tion sequence L = {f1 ; f2 ; ...fk }, we delete the
last action fk , back-translate the shorter action se-
An action sequence consists of multiple consecu- quence into a new query, and compare it with the
tive actions, and it is difficult for the seq2seq model original question. We can determine that the to-
to decide which part of the question to focus on kens which appear in the original question but not
when generating each action. We train a question in the current generated question are the ones we
rewriting model that transform a query into a set should most focus on when generating the deleted
of utterances which are concatenated by the sym- action. For example, the left part of Figure 1 il-
bol "#" and each utterance corresponds to a single lustrates the process of decomposing the question
action. With the rewritten question, the model can "how many musical instruments can lesser num-
focus on a certain part of the question when gen- ber of people perform with than glockenspiel". We
erating action in the sequence, thus reducing the firstly delete the last action "Count()" and then
difficulty of decoding. the seq2seq model translates the newly formed
To train the rewriting model, we require an ad- sequence "SelectAll(...)LessThan(...)" into query
equate training corpus which is difficult to obtain. "which musical instruments can lesser number of
On the lack of golden datasets, we propose a two- people perform with than glockenspiel". The words
phase approach to convert queries into rewritten "how many" should be paid more attention because
questions and use them for training of the rewriting they do not appear in the generated question. We
139
iteratively perform delete, back-translate and com- if the output is a function or from EG if the output
pare operations until the action sequence is empty is an argument. τt−1 is a vector that obtains from
and concatenate the compare results of each step learnable embedding matrix Wtype according to the
using symbol "#". type of last output. ct is the context vector resulting
Thus, we can construct the question rewriting from the weighted summation of hi , the i-th row of
dataset Q and train a question rewriting model Mr . the question embedding H, based on the attention
To make the rewriting model learn to output KG mechanism. Wa ∈ Rds ×dh is a projection matrix.
artifacts in the rewritten query, we concatenate the We then calculate the vocabulary distribution
original question and KG artifacts as input, and based on hidden state st . Our vocabulary consists
wrap KG artifacts with symbols like entity and of two parts, a fixed vocabulary containing a col-
/entity. We initialize models in both phases lection of predefined functions and a dynamic vo-
using BART (Lewis et al., 2020), an outstanding cabulary consisting of arguments, i.e., numerical
pretrained seq2seq model that demonstrates high values and KG artifacts related to the question. We
performance on a wide range of generation tasks, feed st through one linear layer Wo and a softmax
and finetune them by constructed datasets. function to compute the probability of each word
in the fixed vocabulary. To obtain the probabili-
2.3 Encoder-decoder Architecture ties of the words in the dynamic vocabulary, we
We use BERT and BiLSTM (Hochreiter and project the hidden vectors st to the same dimension
Schmidhuber, 1997) to construct the encoder. through the projection matrix Wp ∈ Rds ×de and
Given a question q with n tokens and the argument then compute the similarities with each word by
set G = {g1 , ..., gm }, where m is the size of argu- taking the dot product.
ment set with respect to q and gi = {gi1 , ..., gil }
is a KG artifact or numerical value with l tokens, Pf ix = Softmax(Wo st )
we concatenate the question and each argument Pdyn = Softmax(st Wp EGT ) (3)
separately using [SEP] as the delimiter to construct
BERT input sequences. In this case, we obtain Next, we calculate the probability Pt that gener-
question embedding Eq ∈ Rn×de , and argument ate from the fixed vocabulary at the current time
embedding gi ∈ Rde by mean pooling over Egi . step through a linear layer followed by the acti-
We then stack embeddings of arguments to con- vation function, and combine the two vocabulary
struct a matrix EG ∈ Rm×de and feed Eq into a distributions based on Pt . Note that if w is a word
BiLSTM encoder to obtain the final question repre- in fixed vocabulary, then Pdyn (w) is zero; similarly
sentation H ∈ Rn×dh . Pf ix (w) is zero when w is in dynamic vocabulary.

E = BERT({[CLS], q, [SEP], gi , [SEP]}) P (w) = Pt Pf ix (w) + (1 − Pt )Pdyn (w)


H = BiLSTM(Eq ) Pt = σ(Wf ct ) (4)
gi = MeanPooling(Egi ) (1) 2.4 Reward-guided Action Sequence Selection
Strategy
Decoding is implemented using LSTM, and at
To improve accuracy, we generate multiple candi-
each time step, the current hidden state st ∈ Rds
date action sequences with beam search algorithm
is updated based on the hidden state and output of
and design a reward-guided action sequence selec-
the previous time step as follows:
tion strategy. In general, the more similar the struc-
st = LSTM([ot−1 ; τt−1 ; ct ], st−1 ) ture and semantics of the two questions are, the
 more similar their corresponding action sequences
ct = αti hi will be. Therefore, we propose that similar ques-
i
tions can be used to help the selection of correct
αt = Softmax(et ) action sequence. Specifically, we build a memory
et = st−1 Wa H T (2) consisting of question-answer pairs in the training
set. Note that we don’t require golden logical forms
where [;] denotes vector concatenation. ot−1 is of these questions.
the embedding of output in the last step which To retrieve similar questions with answers from
obtains from learnable embedding matrix Wf unc memory, we use edit distance to calculate the simi-
140
larity between two questions. To improve the gen- calculated as follows:
eralization of the questions, we replace the entity k j
mentions, type mentions and numerical values in j=1 dj ri
si = k (5)
the questions with the symbol [ENTITY], [TYPE] j=1 dj
and [CONSTANT], respectively. We don’t mask

relations because it is always hard to recognize re- where kj=1 dj is a normalized term. We take the
lation mentions. In addition, the presence of some candidate action sequence with the highest score
antonyms including atmost and atleast, less and as the output sequence in the inference stage.
greater, can lead to the exact opposite semantics
of questions with similar contexts. Therefore, we 2.5 Training
construct a set of antonym pairs and set the sim- We use REINFORCE (Williams, 1992) algorithm
ilarity to 0 when there is an antonym pair in the to train our model. We view F1 scores of the
two questions. We retrieve k question-answer pairs answers generated by predicted action sequence
with the highest similarity to form the support set with respect to ground-truth answers as original re-
S = {{q1 , a1 , d1 }, ..., {qk , ak , dk }}, where di is wards. To improve the stability of training, we use
the similarity computed by edit distance. the adaptive reward function (Hua et al., 2020c)
to adjust rewards. Moreover, we use a breadth-
Which families are House of Shishman a part of or did Clara Maria of Pomerania belong to ? first search algorithm on a subset of data to obtain
[TYPE1] [ENTITY1] [ENTITY2]
pseudo-action sequences and pretrain the model to
Which geographic locations are Asia a part of or are Bell situated in ?
[TYPE1] [ENTITY1] [ENTITY2] prevent the cold start problem.
(a) question-to-question alignment
3 Experiments
Select(House of Shishman, part of, family) Union(Clara Maria of Pomerania, noble family, family)
[ENTITY1] [TYPE1] [ENTITY2] [TYPE1]
3.1 Experimental Setup
Select(Asia, [relation x], geographic location) Union( Bell, [relation y], geographic location)
[ENTITY1] [TYPE1] [ENTITY2] [TYPE1] Our method aims to solve various complex ques-
Ś part of / ś continent
[RELATION1] [RELATION2]
Ś continent / ś part of
[RELATION2] [RELATION1]
tions, and we mainly evaluate it on ComplexQues-
tionAnswering (CQA) (Saha et al., 2018) dataset
(b) action-to-action alignment
which is a large-scale KBQA dataset containing
Figure 2: An example of adjusting candidate action seven types of complex questions, as shown in Ta-
sequences. The upper and lower parts of (a) are the ble 1. We show the details and some examples of
original question and a question in the support set, re- this dataset in Appendix A. We also conduct exper-
spectively. We first obtain a relation-masked action iments on WebQuestionsSP (WQSP) (Yih et al.,
sequence (the second line of (b)) based on the alignment 2015) which contains 4737 simple questions. The
results of entities and types between two questions as results show that our method also works well on
shown in (a), and then output multiple action sequences
simple datasets.
according to all possible combinations of relations.
We employ standard F1-measure between pre-
dicted entity set and ground truth answers as evalu-
We then propose a reward-guided action se- ation metrics. For some categories whose answers
quence selection strategy that scores each candidate are boolean values or numbers on CQA dataset, we
action sequence according to its fitness to the re- view answers as single-value sets and compute the
trieved support set. Specifically, given a candidate corresponding F1 scores. The training details and
Ai and an item {qj , aj , dj } in the support set, we model parameters can be found in Appendix B
adjust the arguments in Ai to arguments of qj ac-
cording to their positions in the text as Figure 2, 3.2 Baselines
and then score it by compute F1 scores between We compare our framework with seq2seq based
aj and execution results of modified sequences on methods. KVmem (Saha et al., 2018) presents a
the lack of golden action sequences. Due to the model consisting of a hierarchical encoder and a
positions of the relations being unknown, we ob- key value memory network. CIPITR (Saha et al.,
tain all possible orders of relations and generate 2019) proposes to mitigate reward sparsity with
multiple modified action sequences. We then take auxiliary rewards and restricts the program space to
the highest F1 as the score of {qj , aj , dj } to Ai and semantically correct programs. CIPITR proposes
denote it as rij . The overall score of Ai then can be two training ways, one training a single model for
141
Question Category KVmem CIP-All CIP-Sep NSM MRL-CQA MARL NS-CQA Ours
Simple Question 41.40% 41.62% 94.89% 88.33% 88.37% 88.06% 88.83% 88.73%
Logical Reasoning 37.56% 21.31% 85.33% 81.20% 80.27% 79.43% 81.23% 88.73%
Quantitative Reasoning 0.89% 5.65% 33.27% 41.89% 45.06% 49.93% 56.28% 76.30%
Comparative Reasoning 1.63% 1.67% 9.60% 64.06% 62.09% 64.10% 65.87% 83.09%
Verification (Boolean) 27.28% 30.86% 61.39% 60.38% 85.62% 85.83% 84.66% 88.18%
Quantitative (Count) 17.80% 37.23% 48.40% 61.84% 62.00% 60.89% 76.96% 80.41%
Comparative (Count) 9.60% 0.36% 0.99% 39.00% 40.33% 40.50% 43.25% 60.80%
Overall macro F1 19.45% 19.82% 47.70% 62.39% 66.25% 66.96% 71.01% 80.89%
Overall micro F1 31.18% 31.52% 73.31% 76.01% 77.71% 77.71% 80.80% 85.31%

Table 1: The overall performances on CQA dataset. Best results are bolded for each category and second-best
results are underlined.

all question categories, denoted by CIP-ALL, and and 2. Our framework significantly outperforms
the other training a separate model for each cat- the state-of-the-art model on CQA dataset while
egory, denoted by CIP-SEP. NSM (Liang et al., staying competitive on WQSP dataset. On CQA
2017) utilizes a key-variable memory to handle dataset, our method achieves the best overall per-
compositionality and helps find good programs by formance of 80.89% and 85.31% in macro and
pruning the search space. MRL-CQA (Hua et al., micro F1 with 9.88% and 4.51% improvement, re-
2020a) and MARL (Hua et al., 2020b) propose spectively. Moreover, it can be observed that our
meta-reinforcement learning approaches that ef- method achieves the best result on six of seven
fectively adapts the meta-learned programmer to question categories. On Logical Reasoning and
new questions to tackle potential distributional bi- Verification (Boolean), which are relatively sim-
ases, where the former uses an unsupervised re- pler, our model obtain a 3.40% and 2.35% improve-
trieval model and the latter learns it alternately ment in macro F1, respectively. On Quantitative
with the programmer from weak supervision. NS- Reasoning, Comparative Reasoning, Quantitative
CQA (Hua et al., 2020c) presents a memory buffer (Count) and Comparative (Count), whose ques-
that stores high-reward programs and proposes an tions are complex and hard to parse, out model
adaptive reward function to improve training per- obtain a considerable improvement. To be spe-
formance. SSRP (Ansari et al., 2019) presents a cific, the macro F1 scores increase by 20.02%,
noise-resilient model that is distant-supervised by 17.22%, 3.45% and 17.55%, respectively. Our
the final answer. CBR-KBQA (Das et al., 2021) proposed method doesn’t outperform CIP-Sep on
generates complex logical forms conditioned on Simple Question which trains a separate model on
similar retrieved questions and their logical forms this category but still achieves a comparable result
to generalize to unseen relations. with the second-best baseline. On WQSP dataset,
We also compare our method with graph-based our method outperforms all the sequence-based
methods on WQSP dataset. STAGG (Yih et al., methods and stay competitive with the graph-based
2015) proposes a staged query graph generation method which having the best results. Our method
framework and leverages the knowledge base in doesn’t gain a lot because most questions in this
an early stage to prune the search space. TEX- dataset are one hop and simple enough while our
TRAY (Bhutani et al., 2019) answers complex frameword aims to deal with various question cate-
questions using a novel decompose-execute-join gories. We don’t compare with graph-based meth-
approach. QGG (Lan and Jiang, 2020) modifies ods on CQA dataset because they always start from
STAGG with more flexible ways to handle con- a topic entity and interact with KG to add relations
straints and multi-hop relations. OQGG (Qin et al., into query graphs step by step, which can not solve
2021) starts with the entire knowledge base and most question types like Quantitative Reasoning
gradually shrinks it to the desired query graph. and Comparative Reasoning in this dataset.

The experimental results demonstrate the ability


3.3 Overall Performances
of our method to parse complex questions and gen-
The overall performances of our proposed frame- erate correct action sequences. The main improve-
work against KBQA baselines are shown in Table 1 ment of the proposed method comes from two as-
142
Method F1 To explore the impact of employing different
NSM 69.0% underlying embeddings, we conduct experiments
SSRP 72.6% on two settings, initializing an embedding matrix
NS-CQA 72.0%
CBR-KBQA† 72.8%
randomly and encoding with BERT. We finetune
the embedding matrix during the training stage in
STAGG 66.8%
TEXTRAY 60.3% the first setting while freezing the parameters of
QGG 74.0% the BERT model. As shown in Table 4, BERT
OQGG 66.0% embedding achieves the best result and improves
Ours 73.6% by 4.40% compared to random embedding. It is
reasonable because BERT is pretrained with a large
Table 2: The overall performances on WQSP dataset. † corpus to represent rich semantics and uses a cross-
denotes supervised training.
attention mechanism to align the question and KG
artifacts better. Note that our proposed method still
pects. On the one hand, we employ a rewrite model outperforms state-of-the-art methods without using
to decompose a complex question into several ut- BERT.
terances, allowing the decoder to focus on a shorter
part when decoding each action. On the other hand, Settings macro F1 micro F1
we make full use of existing question-answer pairs Random Embedding 76.49% 81.63%
BERT Embedding 80.89% 85.31%
and determine the structure of action sequences
indirectly through the alignment between question- Table 4: Ablation studies for different underlying em-
question pairs. beddings.
3.4 Ablation Studies
To investigate the effect of the number of candi-
We conduct a series of ablation studies on CQA
date action sequences and the size of the support set
dataset to demonstrate the effectiveness of the main
on the selection of action sequences, we conduct
modules in our framework. To explore the im-
experiments and plot the results in Figure 3. It can
pact of question rewriting module, we remove it
be observed that the macro F1 score increases with
and only use the original question as input of the
the size of the support set at the beginning, what-
seq2seq model. The performance drops by 1.79%
ever the number of candidates. This trend slows
in macro F1 as shown in Table 3. To prove the
down gradually and the macro F1 score peaks when
effectiveness of the action sequence selection mod-
the size is about 6. Then, as the size of the support
ule, we generate candidate action sequences using
set continues to increase, the macro F1 score de-
beam search mechanism and directly use the action
creases slightly. It’s mainly caused by the simple
sequence with the highest probability as the output
and rough method we use to calculate the question
instead of selecting by action sequence selection
similarity, which leads to the assumption that sim-
module. The macro F1 drops by 2.49% after remov-
ilar questions have similar action sequence struc-
ing this module. To verify that the cross-attention
tures not always hold. In contrast, a certain number
mechanism in BERT can lead to alignment between
of similar questions can alleviate this problem and
question and KG artifacts and further improve the
improve performance. However, when the number
generation result, we encode question and KG arti-
reaches a certain level, the newly added questions
facts separately and find the performance drops by
become less similar to the original questions and
0.97%. Experimental results show that every main
introduce noise instead. In addition, the increase in
module in our framework has an important role in
the number of generated candidates also improves
performance improvement.
performance. If the number is too high, this boost
becomes less apparent or even negative because of
Settings macro F1 micro F1
the lower quality of the newly added candidates.
Full Model 80.89% 85.31%
w/o question rewriting 79.10% 84.15% 3.5 Case study
w/o candidates selection 78.40% 83.55%
w/o cross-attention 79.92% 84.63% We show some examples to illustrate the ability of
our modules. Table 5 shows a complex question of
Table 3: Ablation studies on main components. category Quantitative (Count). We can observe that
143
Question Does Janko Kroner have location of birth at
Peraia, Pella and Povazska Bystrica ?
w/o Mod- Select(Janko Kroner, place of birth, admin-
ule istrative territorial entity) # Bool(Povazska
Bystrica) # Bool(Povazska Bystrica)
Prob: 0.6085 Selection Score: 0.7333
w/ Module Select(Janko Kroner, place of birth, admin-
istrative territorial entity) # Bool(Peraia,
Pella) # Bool(Povazska Bystrica)
Prob: 0.3519 Selection Score: 1.0000

Figure 3: Trends of macro F1 when the size of support Table 6: Test case on action sequence selection module
set increases.

Question how many works of art feature approxi- graph which is a graph-like logical form proposed
mately 5 fictional taxons or people by (Yih et al., 2015). (Bao et al., 2016) proposed
Rewritten which works of art contain which fictional multi-constraint query graph to improve perfor-
Question taxon # and which common name # approx- mance. Ding et al. (2019) and Bhutani et al. (2019)
imately 5 people # how many
decomposed complex query graph into a set of
w/o Mod- SelectAll(fictional taxon, present in work, simple queries to overcome the long-tail problem.
ule work of art) SelectAll(common name,
present in work, work of art) AtLeast(5) (Lan and Jiang, 2020) employed early incorpora-
Count() tion of constraints to prune the search space. (Chen
w/ Module SelectAll(fictional taxon, present in work, et al., 2021) leveraged the query structure to con-
work of art) SelectAll(common name, strain the generation of the candidate queries. (Qin
present in work, work of art) Around(5)
Count()
et al., 2021) generated query graph by shrinking
the entire knowledge base. Sequence-based meth-
Table 5: Test case on quesiton rewriting module ods define a set of functions and utilize a seq2seq
model to generate action sequences. Liang et al.
(2017) augmented the standard seq2seq model with
the model wrongly predicts the third action in the a key-variable memory to save and reuse intermedi-
absence of rewriting module but makes a correct ate execution results. Saha et al. (2019) mitigated
generation with the help of rewritten utterances. reward sparsity with auxiliary rewards. Ansari
It’s reasonable because the seq2seq model learns et al. (2019) learned program induction with much
to focus on "approximately 5 people" when pre- noise in the query annotation. Hua et al. (2020a,b)
dicting the third action. Table 6 shows a query of employed meta-learning to adapt programmer to
category Verification (Boolean). It’s confusing for unseen questions quickly. Hua et al. (2020c) pro-
the model to decide which entity to output, and the posed a adaptive reward function to control the
correct action sequence is given a lower probability. exploration-exploitation trade-off in reinforcement
However, it’s much easier to choose through ac- learning.
tion sequence selection module. The wrong logical
Compared to graph-based methods, sequence-
form produces an incorrect result in the majority of
based methods can generate logical forms directly
cases and thus receives a lower selection score, as
using the seq2seq model, which is easier to im-
shown.
plement and can handle more question categories
4 Related Work by simply expanding the set of action functions.
However, the semantic and structural gap between
Semantic parsing is the task of translating natural natural language utterances and action sequences
language utterances into executable meaning repre- leads to poor performance on translation.
sentations. Recent semantic parsing based KBQA
methods can be categorized as graph-based (Yih 5 Conclusion
et al., 2015; Bao et al., 2016; Bhutani et al.,
2019; Lan and Jiang, 2020; Qin et al., 2021) and In this paper, we propose an alignment-enhanced
sequence-based (Liang et al., 2017; Saha et al., complex question answering framework, which
2019; Ansari et al., 2019; Hua et al., 2020a,b,c; Das reduces the semantic and structural gap between
et al., 2021). Graph-based methods build a query question and action sequence by question-to-action
144
and question-to-question alignment. We train a of the 28th ACM International Conference on Infor-
question rewriting model to align question and sub- mation and Knowledge Management, pages 739–748.
action sequence in the absence of training data and Yongrui Chen, Huiying Li, Yuncheng Hua, and Guilin
employ a pretrained language model to align the Qi. 2021. Formal query building with query struc-
question and action arguments implicitly. More- ture prediction for complex question answering over
over, we utilize similar questions to help select the knowledge base. arXiv preprint arXiv:2109.03614.
correct action sequence from multiple candidates. Rajarshi Das, Manzil Zaheer, Dung Thai, Ameya God-
Experiments show that our framework achieves bole, Ethan Perez, Jay Yoon Lee, Lizhen Tan, Lazaros
state-of-the-art on the CQA dataset and performs Polymenakos, and Andrew McCallum. 2021. Case-
based reasoning for natural language queries over
well on various complex question categories. In the knowledge bases. In Proceedings of the 2021 Con-
future, how to better align questions with logical ference on Empirical Methods in Natural Language
forms will be considered. Processing, pages 9594–9611.

Limitations Jacob Devlin, Ming-Wei Chang, Kenton Lee, and


Kristina Toutanova. 2018. Bert: Pre-training of deep
In our method, we view KG artifacts as tokens bidirectional transformers for language understand-
ing. arXiv preprint arXiv:1810.04805.
and generate logical forms using a seq2seq model,
which can handle more types of complex ques- Jiwei Ding, Wei Hu, Qixin Xu, and Yuzhong Qu. 2019.
tions, i.e., superlative quesions without topic enti- Leveraging frequent query substructures to generate
formal queries for complex question answering. In
ties. However, for single and multi-hop questions, Proceedings of the 2019 Conference on Empirical
graph-based methods may gain better performance. Methods in Natural Language Processing and the 9th
The reason is that they start from a topic entity International Joint Conference on Natural Language
and interact with KG to add relations into query Processing (EMNLP-IJCNLP), pages 2614–2622.
graphs step by step, which can prune the search Li Dong and Mirella Lapata. 2016. Language to logical
space more effectively. Moreover, we control the form with neural attention. In Proceedings of the
vocabulary size through entity and relation recog- 54th Annual Meeting of the Association for Compu-
nition, which makes the preprocessing step more tational Linguistics (Volume 1: Long Papers), pages
33–43.
complex.
Li Dong and Mirella Lapata. 2018. Coarse-to-fine de-
Acknowledgement coding for neural semantic parsing. In Proceedings
of the 56th Annual Meeting of the Association for
This work is supported by the National Key Re- Computational Linguistics (Volume 1: Long Papers),
search and Development Project of China (No. pages 731–742.
2018AAA0101900), the Key Research and Devel- Yinuo Guo, Zeqi Lin, Jian-Guang Lou, and Dongmei
opment Program of Zhejiang Province, China (No. Zhang. 2021. Iterative utterance segmentation for
2021C01013), CKCEST, and MOE Engineering neural semantic parsing. In Proceedings of the AAAI
Research Center of Digital Library. Conference on Artificial Intelligence, volume 35,
pages 12937–12945.
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long
References short-term memory. Neural computation, 9(8):1735–
1780.
Ghulam Ahmed Ansari, Amrita Saha, Vishwajeet Ku-
mar, Mohan Bhambhani, Karthik Sankaranarayanan, Yuncheng Hua, Yuan-Fang Li, Gholamreza Haffari,
and Soumen Chakrabarti. 2019. Neural program in- Guilin Qi, and Tongtong Wu. 2020a. Few-shot com-
duction for kbqa without gold programs or query plex knowledge base question answering via meta
annotations. In IJCAI, pages 4890–4896. reinforcement learning. In Proceedings of the 2020
Conference on Empirical Methods in Natural Lan-
Junwei Bao, Nan Duan, Zhao Yan, Ming Zhou, and guage Processing (EMNLP), pages 5827–5837.
Tiejun Zhao. 2016. Constraint-based question an-
swering with knowledge graph. In Proceedings of Yuncheng Hua, Yuan-Fang Li, Gholamreza Haffari,
COLING 2016, the 26th International Conference on Guilin Qi, and Wei Wu. 2020b. Retrieve, program,
Computational Linguistics: Technical Papers, pages repeat: Complex knowledge base question answering
2503–2514. via alternate meta-learning. In IJCAI.
Nikita Bhutani, Xinyi Zheng, and HV Jagadish. 2019. Yuncheng Hua, Yuan-Fang Li, Guilin Qi, Wei Wu,
Learning to answer complex questions over knowl- Jingyao Zhang, and Daiqing Qi. 2020c. Less is
edge bases with query composition. In Proceedings more: Data-efficient complex question answering

145
over knowledge bases. Journal of Web Semantics, and the 7th International Joint Conference on Natu-
65:100612. ral Language Processing (Volume 1: Long Papers),
pages 1321–1331.
Yunshi Lan and Jing Jiang. 2020. Query graph gen-
eration for answering multi-hop complex questions
from knowledge bases. Association for Computa-
tional Linguistics. A CQA Dataset
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Complex Question Answering(CQA) dataset con-
Ghazvininejad, Abdelrahman Mohamed, Omer Levy, tains the subset of the QA pairs from the Complex
Veselin Stoyanov, and Luke Zettlemoyer. 2020. Bart:
Denoising sequence-to-sequence pre-training for nat-
Sequential Question Answering(CSQA) dataset,
ural language generation, translation, and comprehen- where the questions are answerable without need-
sion. In Proceedings of the 58th Annual Meeting of ing the previous dialog context. There are 944K,
the Association for Computational Linguistics, pages 100K and 156K question-answer pairs in the train-
7871–7880.
ing, validating and test set, respectively. This
Yuntao Li, Bei Chen, Qian Liu, Yan Gao, Jian-Guang dataset has seven types of complex questions, mak-
Lou, Yan Zhang, and Dongmei Zhang. 2021. Keep ing it difficult for the model to answer correctly.
the structure: A latent shift-reduce parser for seman- We show some examples of each question category
tic parsing. In IJCAI.
in Table 7. For simple questions, the correspond-
Chen Liang, Jonathan Berant, Quoc Le, Kenneth For- ing action sequence contains only one action, and
bus, and Ni Lao. 2017. Neural symbolic machines: for some complex questions, the length of action
Learning semantic parsers on freebase with weak
supervision. In Proceedings of the 55th Annual Meet- sequence may up to 4.
ing of the Association for Computational Linguistics
(Volume 1: Long Papers), pages 23–33. B Training Details
Gaurav Maheshwari, Priyansh Trivedi, Denis To compare with previous works and reduce train-
Lukovnikov, Nilesh Chakraborty, Asja Fischer, and
ing time, we also randomly select two small sub-
Jens Lehmann. 2019. Learning to rank query graphs
for complex question answering over knowledge sets(about 1% each) from the training set to train
graphs. In International semantic web conference, models. We use BFS algorithm to search pseudo
pages 487–504. Springer. action sequences for the first subset to train the
Kechen Qin, Cheng Li, Virgil Pavlu, and Javed Aslam. question rewriting model as introduced in 2.2 and
2021. Improving query graph generation for complex pretrain the action sequence generation model. We
question answering over knowledge base. In Pro- use the second one for subsequent reinforcement
ceedings of the 2021 Conference on Empirical Meth- learning of the action sequence generation model.
ods in Natural Language Processing, pages 4201–
4207. We evaluate our trained model on the whole test
set.
Amrita Saha, Ghulam Ahmed Ansari, Abhishek Laddha, We initialize two models in the question rewrit-
Karthik Sankaranarayanan, and Soumen Chakrabarti.
2019. Complex program induction for querying ing stage with the base version of BART and fine-
knowledge bases in the absence of gold programs. tune them using Adam Optimizer with a learning
Transactions of the Association for Computational rate of 1e-5. For the action sequence generation
Linguistics, 7:185–200. model, we adopt the uncased base version of BERT
Amrita Saha, Vardaan Pahuja, Mitesh M Khapra, for underlying embeddings and freeze the parame-
Karthik Sankaranarayanan, and Sarath Chandar. ters to improve training stability. We set the dimen-
2018. Complex sequential question answering: To- sion of type embedding to 100, the hidden sizes of
wards learning to converse over linked question an- one-layer BiLSTM Encoder and LSTM Decoder
swer pairs with a knowledge graph. In Thirty-Second
AAAI Conference on Artificial Intelligence. to 300. We train the model for 100 epochs and 50
epochs using Adam with learning rates of 1e-4 and
Ronald J Williams. 1992. Simple statistical gradient- 1e-5 in the pretraining and reinforcement learning
following algorithms for connectionist reinforcement
learning. Machine learning, 8(3):229–256. stages, respectively, and finally choose the check-
point with the highest reward in the development
Wen-tau Yih, Ming-Wei Chang, Xiaodong He, and Jian- set. We generate 5 candidate action sequences with
feng Gao. 2015. Semantic parsing via staged query
graph generation: Question answering with knowl-
a beam size of 10, and retrieve 3 questions with a
edge base. In Proceedings of the 53rd Annual Meet- similarity greater than threshold 0.6 as the support
ing of the Association for Computational Linguistics set. If no similar question meets the condition, we
146
Question Category Example Question
Simple Question (599K) Direct Where did the expiration of Brian Hetherston occur ?
union Which people were casted in Cab Number 13 or Hearts of Fire ?
Logical Reasoning
Intersection Who have location of birth at Lourdes and the gender as male ?
(138K)
Difference Which people are a native of Grenada but not United Kingdom ?
Verification (63K) Boolean Is United Kingdom headed by Jonas Spelveris and Georgius Sebastos ?
Min/Max Who had an influence on max number of bands and musical ensembles ?
Quantitative Reasoning Which applications are manufactured by atleast 1 business organizations and
Atleast/Atmost
(118K) business enterprises ?
exactly/around n Which films had their voice dubbing done by exactly 20 people ?
Which positions preside the jurisdiction over more number of administrative
Comparative Reasoning
Less/More/Equal territories and US administrative territories than Minister for Regional
(62K)
Development ?
How many nucleic acid sequences encodes Dynein light chain 1,
Direct
cytoplasmic ?
Quantitative Reasoning How many system software or operating systems are the computing
Union
(Count) (159K) platforms for which Street Fighter IV were specifically designed ?
How many people studied at Harvard University and Ecole nationale
Intersection
superieure des Beaux-Arts ?
exactly/around n How many musical instruments are played by atmost 7998 people ?
Comparative Reasoning How many administrative territories have less number of cities and
Less/More/Equal
(Count) (63K) mythological Greek characters as their toponym than Bagdad ?

Table 7: The examples of various question types on CQA dataset.

directly select the top one action sequence gener-


ated by beam search as output.

147

You might also like