0% found this document useful (0 votes)
33 views

Retrieval Reranking and Multi Task Learning For Knowledge Base Question Answering

知识库问答的检索重排序和多任务学习

Uploaded by

gohezhengq
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views

Retrieval Reranking and Multi Task Learning For Knowledge Base Question Answering

知识库问答的检索重排序和多任务学习

Uploaded by

gohezhengq
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Retrieval, Re-ranking and Multi-task Learning for Knowledge-Base

Question Answering

Zhiguo Wang, Patrick Ng, Ramesh Nallapati, Bing Xiang


AWS AI Labs
{zhiguow, patricng, rnallapa, bxiang}@amazon.com

Abstract 2015; Dai et al., 2016; Hao et al., 2017; Mohammed


et al., 2018; Yu et al., 2018; Wu et al., 2019; Chen
Question answering over knowledge bases
et al., 2019a; Petrochuk and Zettlemoyer, 2018),
(KBQA) usually involves three sub-tasks,
namely topic entity detection, entity linking and finds a relation-chain within KBs that is most
and relation detection. Due to the large num- similar to the question in a common semantic space,
ber of entities and relations inside knowledge where the relation-chain can be 1-hop, 2-hop or
bases (KB), previous work usually utilized so- multi-hop (Chen et al., 2019b). Another research
phisticated rules to narrow down the search direction formulates KBQA as a semantic parsing
space and managed only a subset of KBs in task (Berant et al., 2013; Bao et al., 2016; Luo et al.,
memory. In this work, we leverage a retrieve-
2018), and tackles questions that involve complex
and-rerank framework to access KBs via tradi-
tional information retrieval (IR) method, and reasoning, such as ordinal (e.g. What is the sec-
re-rank retrieved candidates with more pow- ond largest fulfillment center of Amazon?), and
erful neural networks such as the pre-trained aggregation (e.g. How many fulfillment centers
BERT model. Considering the fact that di- does Amazon have?). Most recently, some stud-
rectly assigning a different BERT model for ies proposed to derive answers from both KBs and
each sub-task may incur prohibitive costs, we free-text corpus to deal with the low-coverage issue
propose to share a BERT encoder across all of KBs (Xu et al., 2016; Sun et al., 2018; Xiong
three sub-tasks and define task-specific layers
et al., 2019; Sun et al., 2019). In this paper, we
on top of the shared layer. The unified model is
then trained under a multi-task learning frame- follow the first research direction since the relation-
work. Experiments show that: (1) Our IR- chain type of questions counts the vast majority
based retrieval method is able to collect high- of real-life questions (Berant et al., 2013; Bordes
quality candidates efficiently, thus enables our et al., 2015; Jiang et al., 2019).
method adapt to large-scale KBs easily; (2) the Previous semantic matching methods for KBQA
BERT model improves the accuracy across all
usually decompose the task into sequential sub-
three sub-tasks; and (3) benefiting from multi-
task learning, the unified model obtains fur-
tasks consisting of topic entity detection, entity
ther improvements with only 1/3 of the origi- linking, and relation detection. For example in Fig-
nal parameters. Our final model achieves com- ure 1, given the question “Who wrote the book
petitive results on the SimpleQuestions dataset Beau Geste?”, a KBQA system first identifies
and superior performance on the FreebaseQA the topic entity “Beau Geste” from the question,
dataset. then the topic entity is linked to an entity node
(m.04wxy8) from a list of candidate nodes, and
1 Introduction
finally the relation book.written work.author is se-
Answering natural language questions by search- lected as the relation-chain leading to the final an-
ing over large-scale knowledge bases (KBQA) is swer. Previous methods usually worked on a subset
highly demanded by real-life applications, such of KB in order to fit KB into memory. For entity
as Google Assistant, Siri, and Alexa. Owing to linking, some sophisticated heuristics were com-
the availability of large-scale KBs, significant ad- monly used to collect entity candidates. For rela-
vancements have been made over the years. One tion detection, previous work usually enumerated
main research direction views KBQA as a seman- all possible 1-hop and 2-hop relation-chains (start-
tic matching task (Bordes et al., 2014; Dong et al., ing from linked entity nodes) as candidates. All
Who wrote the book Beau Geste ? (1) Topic entity detection
(2) Entity linking
MID name type predicate object MID
(3) Relation detection
m.04wxy8 Beau Geste book book.written_work.author m.05f834
m.0dl_h4 Beau Geste film book.written_work.subjects m.0322m
m.051vvdc Beau Geste music book.book.genre m.05hgj
… … … … …
Node candidates for entity linking. Relation-chain candidates for relation detection.

Figure 1: A typical workflow for KBQA. Given a question “Who wrote the book Beau Geste?”, the topic entity
detection model first identifies a topic entity “Beau Geste” from the question. Then, the entity linking model
links the topic entity into an entity node (m.04wxy8) in the KB. Finally, the relation book.written work.author is
selected as the relation-chain leading to the final answer node (m.05f834).

these workarounds may prevent their methods from 2 Task Definition


generalizing well to other datasets, and scaling up
Knowledge-base question answering (KBQA) aims
to bigger KBs.
to find answers for natural language questions from
structural knowledge bases (KB). We assume a
To tackle these issues, we leverage a retrieve- KB K is a collection of subject-predicate-object
and-rerank strategy to access KBs. In the retrieval triples he1 , p, e2 i, where e1 , e2 ∈ E are entities,
step, we ingest KBs into two inverted indices: one and p ∈ P is a relation type between two entities,
that stores all entity nodes for entity linking, and E is the set of all entities, and P is the set of all
the other one that stores all subject-predicate-object relation types. Given a question Q, the goal of
triples for relation detection. Then, we use TF-IDF KBQA is to find an entity node a ∈ E from the KB
algorithm to retrieve candidates for both entity link- as the final answer, thus can be formulated as
ing and relation detection sub-tasks. This method
naturally overcomes the memory overhead when â = arg max P r(a|Q, K) (1)
a∈E
dealing with large-scale KBs, therefore makes our
method easily scale up to large-scale tasks. In the where P r(a|Q, K) is the probability of a to be
re-ranking step, we leverage the advanced BERT the answer for Q. A general purpose KB usually
model to re-rank all candidates by fine-grained se- contains millions of entities in E and billions of
mantic matching. For the topic entity detection relations in K (Bollacker et al., 2008), therefore
sub-task, we utilize another BERT model to predict directly modeling P r(a|Q, K) is challenging. Pre-
the start and end positions of a topic entity within a vious studies usually factorize this model in dif-
question. Since assigning a different BERT model ferent ways. One line of research forms KBQA
for each sub-task may incur prohibitive costs, we as a semantic parsing task P r(q|Q, K) to parse a
therefore propose to share a BERT encoder across question Q directly into a logical form query q,
sub-tasks and define task-specific layers for each and execute the query q over KB to derive the final
individual sub-task on top of the shared layer. This answer. Another line of studies views KBQA as a
unified BERT model is then trained under the multi- semantic matching task, and finds a relation-chain
task learning framework. Experiments on two stan- within KB that is similar to the question in a com-
dard benchmarks show that: (1) Our IR-based re- mon semantic space. Then the trailing entity of the
trieval method is able to collect high-quality candi- relation-chain is taken as the final answer. Follow-
dates efficiently; (2) the BERT model improves the ing this direction, we decompose the KBQA task
accuracy across all three sub-tasks; and (3) bene- into three stages: (1) identify a topic entity t from
fiting from multi-task learning, the unified model the question Q, where t is a sub-string of Q; (2)
obtains further improvements with only 1/3 of the link the topic entity t to a topic node e ∈ E in KB;
original parameters. Our final model achieves com- and (3) detect a relation-chain r ∈ K starting from
petitive results on the SimpleQuestions dataset and the topic node e, where r can be 1-hop, 2-hop or
superior performance on the FreebaseQA dataset. multi-hop relation-chains within KB. Correspond-
ingly, we factorize the model as Frequency (TF-IDF) is a ranking function usually
used together with an inverted index to estimate
P r(a|Q, K) = P r(t, e, r|Q, K) the relevance of documents to a given search query
= Pt (t|Q, K)Pl (e|t, Q, K) (Schütze et al., 2008).
Pr (r|e, t, Q, K) (2)
4 Retrieval and Re-ranking for KBQA
where Pt (t|Q, K) is the model for topic entity de- In this section, we describe how to parameterize
tection, Pl (e|t, Q, K) models the entity linking pro- Pt , Pl and Pr in Equation (2).
cess, and Pr (r|e, t, Q, K) is the component for re-
lation detection stage. We will discuss how to pa- 4.1 Topic Entity Detection Model Pt
rameterize these components in Section 4.
The goal of a topic entity detection model
3 Background Pt (t|Q, K) is to identify a topic entity t that the
question Q is asking about, where t is usually a sub-
We briefly introduce some background required by string of Q. Previous approaches for this task can
the following sections. be categorized into two types: (1) rule-based and
BERT: BERT model (Devlin et al., 2019) fol- (2) sequence labeling. The rule-based approaches
lows the multi-head self-attention architecture take all entity names and their alias from a KB
(Vaswani et al., 2017), and is pre-trained with a as a gazetteer, and n-grams of the question that
masked language modeling objective on a large- exactly match with an entry in the gazetteer are
scale text corpus. It has achieved state-of-the-art taken as topic entities (Yih et al., 2015; Yao, 2015;
performance on a bunch of textual tasks. Specif- He and Golub, 2016; Yu et al., 2017). The advan-
ically, for semantic matching tasks, BERT sim- tage of this method is that no machine learning
ply concatenates two textual sequences together, models need to be involved. However, the draw-
and encodes the new sequence with multiple self- backs include: (1) topic entities need to have the
attention layers. Then, the output vector of the exact same surface strings as they occur in KB, and
first token is fed into a linear layer to compute (2) memory-efficient data structures need to be de-
the similarity score between the two input textual signed to load the massive gazetteer into memory
sequences. (Yao, 2015). Other approaches leverage a sequence
Freebase: We take Freebase (Bollacker et al., labeling model to tag consecutive tokens in the
2008) as our back-end KB to answer questions. It question Q as topic entities (Dai et al., 2016; Bor-
contains more than 46 million topic entities and des et al., 2015; Mohammed et al., 2018; Wu et al.,
2.6 billion triples. Each entity has an internal ma- 2019). This approach is able to predict more pre-
chine identifier (MID) and a set of aliases. Some cise topic entities, thus prunes some unimportant
entities also have properties such as entity types matched entities.
and detailed descriptions. Freebase contains a spe- Inspired from the Start/End prediction method
cial entity category called Compound Value Type commonly utilized for machine reading compre-
(CVT), which does not have a name or alias, and is hension tasks (Wang and Jiang, 2016; Seo et al.,
only used to collect multiple fields of an event or a 2016), we cast the topic entity detection task into
special relationship. In the official Freebase dump predicting the start and end positions of the topic
1 , all facts are formulated as the unified subject-
entity t in the question Q. Formally, we denote
predicate-object triples, and there is no explicit ts and te as the start and end positions for a topic
split for entities and relations. We partition facts entity t, and assume this process is independent of
in Freebase into a set of entities E and a set of re- KB. Thus the model can be further decomposed as
lations K by following the pre-processing steps in Pt (t|Q, K) = Ps (ts |Q)Pe (te |Q), where Ps (ts |Q)
Chah (2017). and Pe (te |Q) are the probabilities of ts and te to
Inverted Index and TF-IDF: Inverted index is be the start and end positions. This formulation di-
an optimized data structure of finding documents rectly models the goal of the topic entity detection
(from a large document collection) where a query task, i.e. finding the best topic entity within a ques-
word X occurs. It is commonly used for fast free- tion, therefore can give a more precise estimation.
text searches. Term Frequency-Inverse Document We leverage the advanced BERT model to pa-
1
https://developers.google.com/freebase rameterize Ps (ts |Q) and Pe (te |Q). Concretely, we
first leverage BERT encoder to encode the input BERT model to compute the similarity between
question Q, then apply two independent linear lay- each candidate node and the topic entity in
ers (with one output neuron) on top of BERT’s the given question context. Concretely, we
output for each token. The start/end scores are represent each pair of a topic entity t and
normalized across all tokens with the sof tmax a candidate node e as a sequence of tokens
function to estimate the probabilities of each token with the format “ [CLS] topic entity
position to be the start/end of the topic entity. [SEP] question pattern [SEP] node
name [SEP] node types [SEP] node
4.2 Entity Linking Model Pl description [SEP]”, where topic
The purpose of an entity linking model entity is the string for the topic entity t,
Pl (e|t, Q, K) is to link the recognized topic question pattern is the question string with
entity t to an entity node e ∈ E in KB. A general t being removed, node name, node types
purpose KB usually contains millions of nodes in and node description are the name, types
E, which makes it almost impossible to search over and description for the topic node e, and [SEP]
the full space. Previous methods usually narrow is the delimiter used by BERT model. We encode
down the search space based on some heuristic this sequence with BERT model, then feed the
rules. For example, Yih et al. (2015) and Wu et al. hidden vector for the token [CLS] into a linear
(2019) used keyword search to collect all nodes layer (with one output neuron) to compute the
that have one alias exactly matching the topic similarity score for each pair of t and e.
entity, and Yin et al. (2016) collected all nodes that
have at least one word overlapping with the topic 4.3 Relation Detection Model Pr
entity. Once a smaller set of candidates is selected,
complicated neural networks can be utilized to The relation detection model Pr (r|e, t, Q, K) tra-
compute the similarity between a candidate node verses relation-chains starting from a linked topic
and the topic entity in the question context. node e, and attempts to detect the correct relation-
Inspired from the recent success of question an- chain r that answers the question Q. Previous work
swering over free-text corpus (Chen et al., 2017; usually enumerates all possible 1-hop and 2-hop
Wang et al., 2018, 2019), we propose a retrieve- relation-chains starting from a linked topic node e,
and-rerank method to solve the entity linking task and leverages deep neural networks to compute se-
in two steps. In the first retrieval step, we create mantic similarity between each candidate relation-
an inverted index for all entity nodes, where each chain r and the question Q (Bordes et al., 2014;
node is represented with all tokens from its aliases Yih et al., 2015; Dong et al., 2015; Yu et al., 2017;
and description. Then, we use the topic entity t Wu et al., 2019). In real KBQA systems, usually, a
as a query to retrieve top-K candidate nodes from list of linked nodes from the entity linking step is
the index with the TF-IDF algorithm2 . The similar considered to retain high recall. If we enumerate
method is also used by Vakulenko et al. (2019) and all relation-chains for all these linked topic nodes,
Nedelchev et al. (2020). This information retrieval we will end up with a large collection of candidate
(IR) method is better than previous work in the relation-chains. Furthermore, re-ranking so many
following ways. First, our method can find can- candidate relation-chains will add much run-time
didate nodes even if a topic entity does not have latency, especially when a heavy model such as
an exactly matched entity node. Second, we do BERT is utilized.
not have to maintain all entity nodes inside CPU To address this issue, we propose to use the
memory, and can still query candidates efficiently, retrieve-and-rerank method for the relation detec-
which enables our method to be easily adapted to tion task, and deal with this task in two stages simi-
large-scale KBs. Third, the relative importance of lar to what we did for the entity linking task. In the
various matched words is naturally considered in first retrieval step, we create an inverted index for
the TF-IDF algorithm. all subject-predict-object triples, where each triple
In the second re-ranking step, we leverage is represented as all tokens from the name of the
subject entity, the name of the predicate, and types
2
Mohammed et al. (2018) also created an inverted index of the object entity. Then, we use the question Q as
for all nodes. However, they generated ngrams of each entity
name into several entries, and looked up exactly matched a query to retrieve top-K 1-hop relation-chains with
ngram candidates by keyword searching. the constraint that all subject nodes are from the list
of linked entity nodes. If two-hop relation-chains Both entity linking and relation detection tasks
are required in a target dataset, we will do the same are ranking tasks, therefore we leverage a hinge
querying step again, but with the constraint list be- loss function for both tasks:
ing all object entities from the first retrieval step. N
1 X
We acknowledge that this method does not consider L(θ) = − max(0, l + s(Q, c− ) − s(Q, c+ )
the already covered semantics in the first retrieval N
i=1
step, when we do the second step retrieval. Since (4)
the main goal of the retrieval step is to collect a list Where θ is the trainable parameter, l is a margin,
of high-quality candidates, we will perform better s(Q, c) can be the model of Pl or Pr , c+ is a correct
semantic matching in the re-ranking step with more candidate, and c− is an incorrect candidate. We set
powerful neural networks. If multi-hop relation- l = 1.0 in this work.
chains are needed, we can iterate this process until
5.2 Multi-Task Learning
reaching the maximum steps. Usually, the number
of max-hop is pre-computed on the target question A naive approach would be to use three different
sets. Another way is to utilize a model to decide BERT encoders for the topic entity detection, entity
when to stop (Chen et al., 2019b), however we will linking and relation detection sub-tasks individu-
leave this option in the future work. ally. Since BERT model is a very large model, it
After collecting a list of relation-chains, we is expensive to host three BERT models in real ap-
leverage another BERT model to compute the plications. To address this, we propose to share a
similarity between a question Q and each relation- BERT encoder across all three sub-tasks, and define
chain r. Each pair of Q and r will be represented lean layers for each individual sub-task on top of
as a sequence of tokens with the format “[CLS] the shared layer. This unified model is then trained
question [SEP] topic-entity name under the multi-task learning framework proposed
[SEP] relation chain [SEP] answer by Liu et al. (2019). First, training instances for
name [SEP] answer types [SEP]”, each sub-tasks are packed into mini-batches sep-
where topic-entity name is the name for arately. At the beginning of each training epoch,
the linked entity node, relation chain is mini-batches from all three sub-tasks are mixed
the word sequence of a candidate relation-chain3 , together and randomly shuffled. During training,
answer name is the name of the trailing node a mini-batch is selected, and the model is updated
in the relation-chain, and answer types are all according to the task-specific objective for the se-
types of the trailing node. The hidden vector for lected mini-batch.
the [CLS] token will be fed into a linear layer
6 Experiments
(with one output neuron) to predict the similarity
between Q and r. We evaluate the effectiveness of our model on stan-
dard benchmarks in this section. We first conduct
5 Multi-Task Learning for KBQA experiments on each sub-task with a separate BERT
model in Section 6.2, 6.3 and 6.4, then evaluate the
5.1 Training Objectives
influence of sharing a BERT encoder for all three
For the topic entity detection model, we define models in Section 6.5. Finally, we benchmark our
the objective function as the cross-entropy loss be- method on full Freebase in Section 6.6.
tween true distributions and predicted distributions.
We sum up the cross-entropy losses for both start 6.1 Datasets and Basic Settings
and end models, and average over all N training We evaluate our proposed model on two large-scale
instances: benchmarks: SimpleQuestions and FreebaseQA.
N
Other existing datasets, such as WebQuestions (Be-
1 X rant et al., 2013), Free917 (Cai and Yates, 2013)
L(θt ) = − log(Psi ) + log(Pei ) (3)
N and WebQSP (Yih et al., 2016), are not considered,
i=1
because they only contain few thousands of ques-
where θt is the trainable parameter for topic entity tions which is even less than the number of relation
detection model. types in Freebase.
3
A relation-chain is split into a word sequence based on SimpleQuestions: The SimpleQuestions
delimiters such as periods, hyphens and underscores. dataset (Bordes et al., 2015) is so far the largest
KBQA dataset. It consists of 108,442 English SimpleQ. FreebaseQA
questions written by human annotators, and all Models EM F1 EM F1
questions can be answered by 1-hop relation
BIO 94.9 97.3 65.1 75.2
chains in Freebase. Each question is annotated
Start/End Pt 96.4 97.8 74.3 81.5
with a gold-standard subject-relation-object triple
from Freebase. We follow the official train/dev/test Multi-task Pt 96.0 97.7 70.6 79.3
split. To fairly compare with previous work, we
leverage the released FB2M subset of Freebase as Table 1: Results for topic entity detection.
the back-end KB for this dataset. FB2M includes
2M entities and 5k relation types between these
entities. tity recognition (NER) task in Devlin et al. (2019),
FreebaseQA: FreebaseQA dataset (Jiang et al., where we use BIO schema to annotate each ques-
2019) is a large-scale dataset with 28K unique open- tion token. Since the sequence labeling method
domain factoid questions which are collected from may predict multiple spans to be topic entities, we
triviaQA dataset (Joshi et al., 2017) and other trivia choose the span with the maximum average token
websites. Each question can be answered by a 1- score as the final prediction.
hop or 2-hop relation-chain from Freebase. All We employ the metrics exact match (EM) and
questions have been matched to subject-predicate- F1 proposed in Rajpurkar et al. (2016) to evalu-
object triples in Freebase, and verified by human ate the identified topic entities. Experimental re-
annotators. Comparing with other KBQA datasets, sults are shown in Table 1. We can see that our
FreebaseQA provides more linguistically sophisti- Start/End prediction model works better than the
cated questions, because all questions are created BIO sequence labeling baseline. Specifically, in
independently from Freebase. FreebaseQA also FreebaseQA dataset, since the questions are longer
released a new subset of Freebase, which includes and more complicated, our Start/End model outper-
16M unique entities, and 182M triples. We follow forms the BIO sequence labeling model by a large
the official train/dev/test split, and take the Free- margin.
base subset as the back-end KB for this dataset.
6.3 Entity Linking Experiments
Basic Settings: We leverage the pre-trained
BERT-base model with default hyper-parameters We retrieve a list of candidate nodes for each ques-
in our experiments. We create inverted indices for tion as follows. For questions in the training sets,
topic nodes and relations with Elasticsearch4 , and we use the ground-truth topic entity as the query
utilize the BM25 (a variance of TF-IDF) algorithm to retrieve top-100 candidate nodes. For questions
to retrieve inverted indices. in the dev and testing sets, we use top-N predicted
topic entities as queries, and retrieve top-50 can-
6.2 Topic Entity Detection Experiments didates for each topic entity. All candidates are
then sorted based on their popularity (number of
In order to train and evaluate our topic entity de-
out-going triples). Based on the results on dev sets,
tection model, we annotate the ground-truth topic
we set N=1 for the SimpleQuestions dataset, and
entity for each question with the following steps.
N=5 for the FreebaseQA dataset. We employ the
First, for each question, all alias names for the
top-K accuracy to evaluate entity linking results,
annotated topic entity MID are collected from Free-
where an instance is correct if there is at least one
base. Then, we match each alias against the ques-
correct candidate inside the top-K candidate list.
tion string. If more than one alias occurs in the
Retrieval step: We implement a Keyword-
question string, the longest matched string will be
search baseline for the retrieval step. In this base-
annotated as the ground-truth. Otherwise, the span
line, all nodes, having an alias exactly matching
with the minimum edit distance will be selected as
with the topic entity, are collected as candidates.
the ground-truth.
All candidates are sorted based on their popularity,
We implement a BERT-based sequence labeling
i.e. the number of out-going triples. Table 2 lists
model as a baseline for our Start/End prediction
the results of the baseline as well as our IR-based
model described in Section 4.1. The baseline model
method proposed in Section 4.2. Our IR-based
follows the same architecture for the named en-
method gets better results than the Keyword base-
4
https://www.elastic.co/products/elasticsearch line on both datasets. The main reason is that our
SimpleQuestions FreebaseQA Models Top-1 Top-10 Top-20
Keyword IR Keyword IR
Yin et al. (2016) 72.7 86.9 88.4
Top-1 24.4 75.7 39.0 35.3 Yu et al. (2017) 79.0 89.5 90.9
Top-5 55.0 86.7 70.1 72.7 Qiu et al. (2018) 81.1 91.7 93.4
Top-10 68.8 89.2 76.1 81.1 Wu et al. (2019) 82.2 92.5 93.6
Top-50 89.3 93.7 81.8 89.4
Pl 84.2 92.1 93.1
Top-100 92.7 93.7 82.9 89.8
Multi-task Pl 84.3 92.1 93.1
Table 2: Qualitative analysis on entity linking candi- Full Freebase 79.0 88.9 90.3
dates for the retrieval step.
Table 3: Entity linking results on SimpleQuestions.
IR-based method does not require exact matches
between predicted topic entities and entity nodes Models Top-1 Top-3 Top-5
within KB, therefore is more robust to prediction er- Wu et al. (2019) 52.4 79.6 85.7
rors or entity name variances from the up-streaming
topic entity detection model. Pl 69.4 84.8 86.6
Re-ranking step: We feed top-100 candidate Pt Pl 71.9 84.6 86.3
nodes from the retrieval step into our entity link- Multi-task Pl 68.1 84.2 85.8
ing model Pl to re-rank all candidates. Table 3 Multi-task Pt Pl 71.7 84.7 86.4
shows results on the SimpleQuestions dataset. The
Full Freebase 68.1 81.6 83.8
first group of numbers in Table 3 are results from
previous state-of-the-art models. We can see that
our entity linking model Pl outperforms previous Table 4: Entity linking results on FreebaseQA.
models in terms of Top-1 accuracy, and achieves
competitive results in terms of Top-10 and Top-20 sort them based on the popularity of the trailing ob-
accuracy. Table 4 lists the results of our model and ject node (number of in-coming triples), and only
previous work on the FreebaseQA dataset. Our keep top-4 relation-chains in the final list. Based on
entity linking model Pl improves accuracy over the results on the dev set, we set N=30 for the Sim-
previous work (Wu et al., 2019) by a large margin. pleQuetions dataset and N=10 for the FreebaseQA
Since top-5 predicted topic entities are used for the dataset. For the SimpleQuestions dataset, since
FreebaseQA dataset, we create another ranker to all questions can be answered with 1-hop relation-
multiply together scores from both the topic entity chains, we only retrieve 1-hop candidates. For the
detection model and entity linking model, and list FreebaseQA dataset, following the method in Jiang
the results in the row Pt Pl in Table 45 . The Pt Pl et al. (2019), we only expand 1-hop relation-chain
ranker gets even better Top-1 accuracy than our candidates into 2-hop candidates if the object node
entity linking model Pl alone, which verifies that of a 1-hop relation-chain is a CVT node. For the
our factorization in Equation (2) is reasonable. SimpleQuetions dataset, a prediction is correct if
both the subject and relation are correctly retrieved.
6.4 Relation Detection Experiments
For the FreebaseQA dataset, a prediction is correct
We retrieve a list of relation-chain candidates for if the final answer matches with the ground-truth
each question as follows. For questions in the train- answer.
ing sets, we use the correct entity node as the start Retrieval step: We implement a baseline to col-
point to search top-100 candidates. For questions lect all relation-chains starting from entity nodes,
in the dev and testing sets, we use the top-N entity and sort all relation-chains based on their popu-
nodes predicted by our entity linking model as start larity, i.e. the in-coming triples for the trailing
points to retrieve top-100 candidates. For candi- object. Retrieval results from the baseline are listed
dates with the same subject and relation type, we in the “All” columns in Table 5. The results from
5
our IR based method (proposed in Section 4.3) are
Accuracy of the Pt Pl model is not given in Table 3, be-
cause only the best (top-1) topic entity is used for retrieving shown in the “IR” columns in Table 5. The last
entity candidates in the SimpleQuestions dataset. row “Rel/Q” in Table 5 gives the average number
SimpleQuestions FreebaseQA Models SimpleQ. FreebaseQA
All IR All IR
Dai et al. (2016) 75.7 N/A
Top-1 16.5 52.8 0.3 10.9 Yin et al. (2016) 76.4 N/A
Top-5 53.5 80.8 1.4 20.3 Yu et al. (2017) 77.0 N/A
Top-10 65.6 86.1 3.4 26.6 Wu et al. (2019) 77.3 37.0
Top-50 81.9 91.7 22.8 49.8 Hao et al. (2018) 80.2 N/A
Top-100 87.6 92.5 31.9 62.6 Petrochuk (2018) 78.1 N/A
Rel/Q 772 100 3021 100 Pr 79.4 45.4
Pt Pl Pr 79.4 49.1
Table 5: Qualitative analysis for relation-chain candi- Multi-task Pr 79.7 47.9
dates in the retrieval step, where “Rel/Q” is the average Multi-task Pt Pl Pr 79.7 51.7
number of relation-chains per question.
Full Freebase 74.1 35.4

of relation-chains per question. Comparing the “IR”


Table 6: Relation detection accuracy in the end-to-end
columns with “All” columns, our IR-based method manner.
retrieves fewer relation-chains but maintains better
recall.
Re-ranking step: We feed top-100 relation- hard to fit the full Freebase into memory (Bordes
chain candidates from the retrieval step into our et al., 2014; Dong et al., 2015). Our method ingests
relation detection model Pr to re-rank all candi- Freebase into inverted indices on hard disk storage,
dates. Table 6 shows the results from previous thus naturally overcomes the memory overhead.
state-of-the-art models as well as our relation de- This advantage enables us to evaluate our method
tection model Pr . We can see that our Pr model on the full Freebase. The last rows of Table 3, 4,
obtains very competitive results on the SimpleQues- and 6 show the results of running our “Multi-task”
tions dataset, and outperforms previous models by model over the full Freebase. Significant degra-
a large margin in the FreebaseQA dataset. We also dations are observed in entity linking and relation
create a model Pt Pl Pr to multiply scores from our detection tasks on both datasets. This phenomenon
topic entity detection model, entity linking model reveals that previous studies may overestimate the
and relation detection model. By considering the in- capacity of their KBQA models. We suggest that
fluence of all three components, our Pt Pl Pr model researchers evaluate their models on the full Free-
achieves even better accuracy on the FreebaseQA base in the future.
dataset.
7 Conclusion
6.5 Multi-task Learning Experiments
In this work, we proposed a retrieve-and-rerank
Our method achieves very strong performance by strategy to access large-scale KBs in two steps.
leveraging three BERT encoders for each model First, we leveraged traditional IR methods to col-
component. In this section, we share a BERT en- lect high-quality candidates from KBs for entity
coder for all three models, and jointly train the linking and relation detection. Second, we utilized
unified model with the multi-task learning method the advanced BERT model to re-rank candidates by
described in Section 5.2. Experimental results from fine-grained semantic matching. We also employed
this model are shown in rows with the prefix “Multi- a BERT model to predict the start and end posi-
task” in Table 1, 3, 4, and 6. Although the multi- tions of the topic entity in a question. To reduce
task model only has about 1/3 of the original pa- the model size, we proposed a joint model to share
rameters, it is able to achieve better end-to-end BERT encoder across all three sub-tasks, and create
accuracy in Table 6, and retain similar performance task-specific layers on the top. We then trained this
as before on the other two sub-tasks. joint model with multi-task learning. Experimental
results show that our method achieves superior re-
6.6 KBQA over Full Freebase sults on standard benchmarks, and is able to scale
Most of the previous studies conducted KBQA ex- up to large-scale KBs.
periments with a subset of Freebase, because it is
References Zi-Yuan Chen, Chih-Hung Chang, Yi-Pei Chen, Jij-
nasa Nayak, and Lun-Wei Ku. 2019b. UHop: An
Junwei Bao, Nan Duan, Zhao Yan, Ming Zhou, and unrestricted-hop relation extraction framework for
Tiejun Zhao. 2016. Constraint-based question an- knowledge-based question answering. In Proceed-
swering with knowledge graph. In Proceedings of ings of the 2019 Conference of the North American
COLING 2016, the 26th International Conference Chapter of the Association for Computational Lin-
on Computational Linguistics: Technical Papers, guistics: Human Language Technologies, Volume 1
pages 2503–2514, Osaka, Japan. The COLING 2016 (Long and Short Papers), pages 345–356, Minneapo-
Organizing Committee. lis, Minnesota. Association for Computational Lin-
guistics.
Jonathan Berant, Andrew Chou, Roy Frostig, and Percy
Liang. 2013. Semantic parsing on freebase from Zihang Dai, Lei Li, and Wei Xu. 2016. CFO: Condi-
question-answer pairs. In Proceedings of the 2013 tional focused neural question answering with large-
Conference on Empirical Methods in Natural Lan- scale knowledge bases. In Proceedings of the 54th
guage Processing, pages 1533–1544. Annual Meeting of the Association for Computa-
tional Linguistics (Volume 1: Long Papers), pages
Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim 800–810, Berlin, Germany. Association for Compu-
Sturge, and Jamie Taylor. 2008. Freebase: a collab- tational Linguistics.
oratively created graph database for structuring hu-
man knowledge. In Proceedings of the 2008 ACM Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
SIGMOD international conference on Management Kristina Toutanova. 2019. BERT: Pre-training of
of data, pages 1247–1250. AcM. deep bidirectional transformers for language under-
standing. In Proceedings of the 2019 Conference
Antoine Bordes, Sumit Chopra, and Jason Weston. of the North American Chapter of the Association
2014. Question answering with subgraph embed- for Computational Linguistics: Human Language
dings. In Proceedings of the 2014 Conference on Technologies, Volume 1 (Long and Short Papers),
Empirical Methods in Natural Language Processing pages 4171–4186, Minneapolis, Minnesota. Associ-
(EMNLP), pages 615–620, Doha, Qatar. Association ation for Computational Linguistics.
for Computational Linguistics.
Li Dong, Furu Wei, Ming Zhou, and Ke Xu. 2015.
Antoine Bordes, Nicolas Usunier, Sumit Chopra, and Question answering over Freebase with multi-
Jason Weston. 2015. Large-scale simple question column convolutional neural networks. In Proceed-
answering with memory networks. arXiv preprint ings of the 53rd Annual Meeting of the Association
arXiv:1506.02075. for Computational Linguistics and the 7th Interna-
tional Joint Conference on Natural Language Pro-
Qingqing Cai and Alexander Yates. 2013. Large-scale cessing (Volume 1: Long Papers), pages 260–269,
semantic parsing via schema matching and lexicon Beijing, China. Association for Computational Lin-
extension. In Proceedings of the 51st Annual Meet- guistics.
ing of the Association for Computational Linguis-
Yanchao Hao, Hao Liu, Shizhu He, Kang Liu, and Jun
tics (Volume 1: Long Papers), pages 423–433, Sofia,
Zhao. 2018. Pattern-revising enhanced simple ques-
Bulgaria. Association for Computational Linguis-
tion answering over knowledge bases. In Proceed-
tics.
ings of the 27th International Conference on Com-
putational Linguistics, pages 3272–3282.
Niel Chah. 2017. Freebase-triples: A methodology for
processing the freebase data dumps. arXiv preprint Yanchao Hao, Yuanzhe Zhang, Kang Liu, Shizhu He,
arXiv:1712.08707. Zhanyi Liu, Hua Wu, and Jun Zhao. 2017. An end-
to-end model for question answering over knowl-
Danqi Chen, Adam Fisch, Jason Weston, and Antoine edge base with cross-attention combining global
Bordes. 2017. Reading Wikipedia to answer open- knowledge. In Proceedings of the 55th Annual Meet-
domain questions. In Proceedings of the 55th An- ing of the Association for Computational Linguistics
nual Meeting of the Association for Computational (Volume 1: Long Papers), pages 221–231, Vancou-
Linguistics (Volume 1: Long Papers), pages 1870– ver, Canada. Association for Computational Linguis-
1879, Vancouver, Canada. Association for Computa- tics.
tional Linguistics.
Xiaodong He and David Golub. 2016. Character-
Yu Chen, Lingfei Wu, and Mohammed J. Zaki. 2019a. level question answering with attention. In Proceed-
Bidirectional attentive memory networks for ques- ings of the 2016 Conference on Empirical Methods
tion answering over knowledge bases. In Proceed- in Natural Language Processing, pages 1598–1607,
ings of the 2019 Conference of the North American Austin, Texas. Association for Computational Lin-
Chapter of the Association for Computational Lin- guistics.
guistics: Human Language Technologies, Volume 1
(Long and Short Papers), pages 2913–2923, Min- Kelvin Jiang, Dekun Wu, and Hui Jiang. 2019. Free-
neapolis, Minnesota. Association for Computational baseqa: A new factoid qa data set matching trivia-
Linguistics. style question-answer pairs with freebase. In Pro-
ceedings of the 2019 Conference of the North Amer- retrieval. In Proceedings of the international com-
ican Chapter of the Association for Computational munication of association for computing machinery
Linguistics: Human Language Technologies, Vol- conference, page 260.
ume 1 (Long and Short Papers), pages 318–323.
Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and
Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Hannaneh Hajishirzi. 2016. Bidirectional attention
Zettlemoyer. 2017. Triviaqa: A large scale distantly flow for machine comprehension. arXiv preprint
supervised challenge dataset for reading comprehen- arXiv:1611.01603.
sion. arXiv preprint arXiv:1705.03551.
Haitian Sun, Tania Bedrax-Weiss, and William Cohen.
Xiaodong Liu, Pengcheng He, Weizhu Chen, and 2019. PullNet: Open domain question answering
Jianfeng Gao. 2019. Improving multi-task deep with iterative retrieval on knowledge bases and text.
neural networks via knowledge distillation for In Proceedings of the 2019 Conference on Empirical
natural language understanding. arXiv preprint Methods in Natural Language Processing and the
arXiv:1904.09482. 9th International Joint Conference on Natural Lan-
guage Processing (EMNLP-IJCNLP), pages 2380–
Kangqi Luo, Fengli Lin, Xusheng Luo, and Kenny Zhu. 2390, Hong Kong, China. Association for Computa-
2018. Knowledge base question answering via en- tional Linguistics.
coding of complex query graphs. In Proceedings of
the 2018 Conference on Empirical Methods in Nat- Haitian Sun, Bhuwan Dhingra, Manzil Zaheer, Kathryn
ural Language Processing, pages 2185–2194, Brus- Mazaitis, Ruslan Salakhutdinov, and William Cohen.
sels, Belgium. Association for Computational Lin- 2018. Open domain question answering using early
guistics. fusion of knowledge bases and text. In Proceed-
ings of the 2018 Conference on Empirical Methods
Salman Mohammed, Peng Shi, and Jimmy Lin. 2018. in Natural Language Processing, pages 4231–4242,
Strong baselines for simple question answering over Brussels, Belgium. Association for Computational
knowledge graphs with and without neural networks. Linguistics.
In Proceedings of the 2018 Conference of the North
American Chapter of the Association for Compu-
Svitlana Vakulenko, Javier David Fernandez Garcia,
tational Linguistics: Human Language Technolo-
Axel Polleres, Maarten de Rijke, and Michael
gies, Volume 2 (Short Papers), pages 291–296, New
Cochez. 2019. Message passing for complex ques-
Orleans, Louisiana. Association for Computational
tion answering over knowledge graphs. In CIKM,
Linguistics.
pages 1431–1440. ACM.
Rostislav Nedelchev, Debanjan Chaudhuri, Jens
Lehmann, and Asja Fischer. 2020. End-to-end Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
entity linking and disambiguation leveraging word Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz
and knowledge graph embeddings. arXiv preprint Kaiser, and Illia Polosukhin. 2017. Attention is all
arXiv:2002.11143. you need. In I. Guyon, U. V. Luxburg, S. Bengio,
H. Wallach, R. Fergus, S. Vishwanathan, and R. Gar-
Michael Petrochuk and Luke Zettlemoyer. 2018. Sim- nett, editors, Advances in Neural Information Pro-
pleQuestions nearly solved: A new upperbound and cessing Systems 30, pages 5998–6008. Curran Asso-
baseline approach. In Proceedings of the 2018 ciates, Inc.
Conference on Empirical Methods in Natural Lan-
guage Processing, Brussels, Belgium. Association Shuohang Wang and Jing Jiang. 2016. Machine com-
for Computational Linguistics. prehension using match-lstm and answer pointer.
arXiv preprint arXiv:1608.07905.
Yunqi Qiu, Manling Li, Yuanzhuo Wang, Yantao Jia,
and Xiaolong Jin. 2018. Hierarchical type con- Shuohang Wang, Mo Yu, Xiaoxiao Guo, Zhiguo Wang,
strained topic entity detection for knowledge base Tim Klinger, Wei Zhang, Shiyu Chang, Gerry
question answering. In Companion Proceedings of Tesauro, Bowen Zhou, and Jing Jiang. 2018. R 3:
the The Web Conference 2018, pages 35–36. Interna- Reinforced ranker-reader for open-domain question
tional World Wide Web Conferences Steering Com- answering. In Thirty-Second AAAI Conference on
mittee. Artificial Intelligence.

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Zhiguo Wang, Patrick Ng, Xiaofei Ma, Ramesh Nal-
Percy Liang. 2016. SQuAD: 100,000+ questions for lapati, and Bing Xiang. 2019. Multi-passage
machine comprehension of text. In Proceedings of BERT: A globally normalized BERT model for
the 2016 Conference on Empirical Methods in Natu- open-domain question answering. In Proceedings of
ral Language Processing, pages 2383–2392, Austin, the 2019 Conference on Empirical Methods in Nat-
Texas. Association for Computational Linguistics. ural Language Processing and the 9th International
Joint Conference on Natural Language Processing
Hinrich Schütze, Christopher D Manning, and Prab- (EMNLP-IJCNLP), pages 5877–5881, Hong Kong,
hakar Raghavan. 2008. Introduction to information China. Association for Computational Linguistics.
Dekun Wu, Nana Nosirova, Hui Jiang, and Mingbin Xu. ference on Advances in Databases and Information
2019. A general fofe-net framework for simple and Systems, pages 286–294. Springer.
effective question answering over knowledge bases.
arXiv preprint arXiv:1903.12356.

Wenhan Xiong, Mo Yu, Shiyu Chang, Xiaoxiao Guo,


and William Yang Wang. 2019. Improving question
answering over incomplete KBs with knowledge-
aware reader. In Proceedings of the 57th Annual
Meeting of the Association for Computational Lin-
guistics, pages 4258–4264, Florence, Italy.

Kun Xu, Siva Reddy, Yansong Feng, Songfang Huang,


and Dongyan Zhao. 2016. Question answering on
Freebase via relation extraction and textual evidence.
In Proceedings of the 54th Annual Meeting of the As-
sociation for Computational Linguistics (Volume 1:
Long Papers), pages 2326–2336, Berlin, Germany.
Association for Computational Linguistics.

Xuchen Yao. 2015. Lean question answering over Free-


base from scratch. In Proceedings of the 2015 Con-
ference of the North American Chapter of the Asso-
ciation for Computational Linguistics: Demonstra-
tions, pages 66–70, Denver, Colorado. Association
for Computational Linguistics.

Wen-tau Yih, Ming-Wei Chang, Xiaodong He, and


Jianfeng Gao. 2015. Semantic parsing via staged
query graph generation: Question answering with
knowledge base. In Proceedings of the 53rd Annual
Meeting of the Association for Computational Lin-
guistics and the 7th International Joint Conference
on Natural Language Processing (Volume 1: Long
Papers), pages 1321–1331, Beijing, China. Associa-
tion for Computational Linguistics.

Wen-tau Yih, Matthew Richardson, Chris Meek, Ming-


Wei Chang, and Jina Suh. 2016. The value of se-
mantic parse labeling for knowledge base question
answering. In Proceedings of the 54th Annual Meet-
ing of the Association for Computational Linguistics
(Volume 2: Short Papers), pages 201–206, Berlin,
Germany. Association for Computational Linguis-
tics.

Wenpeng Yin, Mo Yu, Bing Xiang, Bowen Zhou, and


Hinrich Schütze. 2016. Simple question answering
by attentive convolutional neural network. In Pro-
ceedings of COLING 2016, the 26th International
Conference on Computational Linguistics: Techni-
cal Papers, pages 1746–1756, Osaka, Japan.

Mo Yu, Wenpeng Yin, Kazi Saidul Hasan, Cicero dos


Santos, Bing Xiang, and Bowen Zhou. 2017. Im-
proved neural relation detection for knowledge base
question answering. In Proceedings of the 55th An-
nual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), pages 571–
581, Vancouver, Canada. Association for Computa-
tional Linguistics.

Yang Yu, Kazi Saidul Hasan, Mo Yu, Wei Zhang, and


Zhiguo Wang. 2018. Knowledge base relation de-
tection via multi-view matching. In European Con-

You might also like