Retrieving and Reading - A Comprehensive Survey On Open-Domain Question Answering
Retrieving and Reading - A Comprehensive Survey On Open-Domain Question Answering
Abstract—Open-domain Question Answering (OpenQA) is an important task in Natural Language Processing (NLP), which aims to
answer a question in the form of natural language based on large-scale unstructured documents. Recently, there has been a surge in
the amount of research literature on OpenQA, particularly on techniques that integrate with neural Machine Reading Comprehension
(MRC). While these research works have advanced performance to new heights on benchmark datasets, they have been rarely
covered in existing surveys on QA systems. In this work, we review the latest research trends in OpenQA, with particular attention to
systems that incorporate neural MRC techniques. Specifically, we begin with revisiting the origin and development of OpenQA systems.
arXiv:2101.00774v3 [cs.AI] 8 May 2021
We then introduce modern OpenQA architecture named “Retriever-Reader” and analyze the various systems that follow this
architecture as well as the specific techniques adopted in each of the components. We then discuss key challenges to developing
OpenQA systems and offer an analysis of benchmarks that are commonly used. We hope our work would enable researchers to be
informed of the recent advancement and also the open challenges in OpenQA research, so as to stimulate further progress in this field.
Index Terms—Textual Question Answering, Open Domain Question Answering, Machine Reading Comprehension, Information
Retrieval, Natural Language Understanding, Information Extraction
1 I NTRODUCTION
Question Answering (QA) aims to provide precise articles [4] and science books [5], etc. Specifically, textual QA
answers in response to the user’s questions in natural is studied under two task settings based on the availability
language. It is a long-standing task dating back to the 1960s of contextual information, i.e. Machine Reading Compre-
[1]. Compared with a search engine, the QA system aims hension (MRC) and Open-domain QA (OpenQA). MRC,
to present the final answer to a question directly instead which originally took inspiration from language proficiency
of returning a list of relevant snippets or hyperlinks, thus exams, aims to enable machines to read and comprehend
offering better user-friendliness and efficiency. Nowadays specified context passage(s) for answering a given question.
many web search engines like Google and Bing have been In comparison, OpenQA tries to answer a given question
evolving towards higher intelligence by incorporating QA without any specified context. It usually requires the system
techniques into their search functionalities [2]. Empowered to first search for the relevant documents as the context w.r.t.
with these techniques, search engines now have the ability a given question from either a local document repository or
to respond precisely to some types of questions such as the World Wide Web (WWW), and then generate the answer,
as illustrated in Fig. 1. OpenQA therefore enjoys a wider
— Q: “When was Barack Obama born?” scope of application and is more in line with real-world QA
— A: “4 August 1961”. behavior of human beings while MRC can be considered as
a step to OpenQA [6]. In fact, building an OpenQA system
The whole QA landscape can roughly be divided into that is capable of answering any input questions is deemed
two parts: textual QA and Knowledge Base (KB)-QA, ac- as the ultimate goal of QA research.
cording to the type of information source where answers are In literature, OpenQA has been studied closely with re-
derived from. Textual QA mines answers from unstructured search in Natural Language Processing (NLP), Information
text documents while KB-QA from a predefined structured Retrieval (IR), and Information Extraction (IE) [7], [8], [9],
KB that is often manually constructed. Textual QA is gener- [10]. Traditional OpenQA systems mostly follow a pipeline
ally more scalable than the latter, since most of the unstruc- consisting of three stages, i.e. Question Analysis, Document
tured text resources it exploits to obtain answers are fairly Retrieval and Answer Extraction [6], [9], [11]. Given an input
common and easily accessible, such as Wikipedia [3], news question in natural language, Question Analysis aims to
reformulate the question to generate search queries for facil-
itating subsequent Document Retrieval and classify the ques-
• *Corresponding author: Wenqiang Lei
• Fengbin Zhu, Wenqiang Lei and Tat-Seng Chua are with National
tion to obtain its expected answer type(s) that would guide
University of Singapore (NUS) E-mail: [email protected], wenqian- Answer Extraction. In the Document Retrieval stage, the sys-
[email protected], [email protected] tem searches for question-relevant documents or passages
• Fengbin Zhu and Chao Wang are with 6ESTATES PTE LTD, Singapore with the generated search queries, usually using existing
E-mail: [email protected]
• Jianming Zheng is with National University of Defense Technology, China IR techniques like TF-IDF and BM25, or specific techniques
E-mail: [email protected] developed for Web search engines like Google.com and
• Soujanya Poria is with Singapore University of Technology and Design Bing.com. After that, in the Answer Extraction stage, the final
(SUTD) E-mail: [email protected]
answer is extracted from related documents received from
2
Fig. 1: An illustration of OpenQA. Given a natural language question, the system infers the answer from a collection of
unstructured text documents.
systems are required to return exact short answers to given of expected answer types. A simple illustration of this stage
questions starting from TREC-11 held in 2002 [42]. is given in the leftmost grey box of Fig. 2.
The TREC campaign provides a local collection of doc- In Query Formulation, linguistic techniques such as POS
uments as the information source for generating answers, tagging [40], [44], stemming [40], parsing [44] and stop word
but the popularity of World Wide Web (WWW), especially removal [45], [48] are usually utilized to extract keywords
the increasing maturity of search engines, has inspired re- for retrieving. However, the terms used in questions are
searchers to build Web-based OpenQA systems [40], [44], often not the same as those appearing in the documents
[45], [46] obtaining answers from online resources like that contain the correct answers. This problem is called
Google.com and Ask.com, using IR techniques. Web search “term mismatch” and is a long-standing and critical issue
engines are able to consistently and efficiently collect mas- in IR. To address this problem, query expansion [49], [50]
sive web pages, therefore capable of providing much more and paraphrasing techniques [51], [52], [53], [54] are often
information to help find answers in response to user ques- employed to produce additional search words or phrases so
tions. In 2001, a QA system called MULDER [44] was de- as to retrieve more relevant documents.
signed to automatically answer open-domain factoid ques- Question Classification, the other module that is often
tions with a search engine (e.g., Google.com). It first trans- adopted for Question Analysis stage, aims to identify the type
lates users’ questions to multiple search queries with several of the given question based on a set of question types (e.g.,
natural-language parsers and submits them to the search en- where, when, who, what) or a taxonomy [55], [56] manually
gine to search for relevant documents, and then employs an defined by linguistic experts. After obtaining the type of
answer extraction component to extract the answer from the the question, expected answer types can be easily deter-
returned results. Following this pipeline, a well-known QA mined using rule-based mapping methods [9]. For example,
system AskMSR [45] was developed, which mainly depends given a question “When was Barack Obama born?”, the
on data redundancy rather than sophisticated linguistic answer type would be inferred as “Date” when knowing the
analysis of either questions or candidate answers. It first question type is “When”. Identifying the question type can
translates the user’s question into queries relying on a set provide constraint upon answer extraction and significantly
of predefined rewriting rules to gather relevant documents reduce the difficulty of finding correct answers. Question
from search engines and then adopts a series of n-gram Classification has attracted much interest in literature [44],
based algorithms to mine, filter and select the best answer. [55], [57], [58], [59]. For instance, [59] proposed to extract
For such OpenQA systems, the search engines are able relevant words from a given question and then classify the
to provide access to an ocean of information, significantly question based on rules associating these words to concepts;
enlarging the possibility of finding precise answers to user [57] trained a list of question classifiers using various ma-
questions. Nevertheless, such an ample information source chine learning techniques such as Support Vector Machines
also brings considerable noisy content that challenges the (SVM), Nearest Neighbors and Decision Trees on top of the
QA system to filter out. hierarchical taxonomy proposed by [55].
Question Answer
Question Analysis
Expected
Answer Types
constraints posed by the question are only partially met, need to take a lot of care and place special importance on
with precision sacrificed. this stage.
• Probabilistic Model: The Probabilistic Models provide a In traditional OpenQA systems, factoid questions and
way of integrating probabilistic relationships between list questions [63] have been widely studied for a long time.
words into a model. Okapi BM25 [61] is a probabilis- Factoid questions (e.g., When, Where, Who...) to which the
tic model sensitive to term frequency and document answers are usually a single text span in in the documents,
length, which is one of the most empirically successful such as such as an entity name, a word token or a noun
retrieval models and widely used in current search phrase. While list questions whose answers are a set of
engines. factoids that appeared in the same document or aggregated
• Language Model: The Language Models [62] are also from different documents. The answer type received from
very popular, among which the Query Likelihood the stage of Question Analysis plays a crucial role, especially
Model [60] is the most widely adopted. It builds a for the given question whose answers are named entities.
probabilistic language model LMd for each document Thus, early systems heavily rely on the Named Entity
d and ranks documents according to the probability Recognition (NER) technique [40], [46], [64] since comparing
P (q | LMd ) of the language model generating the given the recognised entities and the answer type may easily yield
question q . the final answer. In [65], the answer extraction is described
In practice, the documents received often contain irrele- as a unified process, first uncovering latent or hidden infor-
vant ones, or the number of documents is so large that the mation from the question and the answer respectively, and
capacity of the Answer Extraction model is overwhelmed. To then using some matching methods to detect answers, such
address the above issues, post-processing on the retrieved as surface text pattern matching [66], [67], word or phrase
documents is very demanded. Widely used approaches on matching [44], and syntactic structure matching [40], [48],
processing retrieved documents include document filtering, [68].
document re-ranking and document selection [9], etc. Doc- In practice, sometimes the extracted answer needs to be
ument filtering is used to identify and remove the noise validated when it is not confident enough before presenting
w.r.t. a given question; document re-ranking is developed to to the end-users. Moreover, in some cases multiple answer
further sort the documents according to a plausibility degree candidates may be produced to a question and we have
of containing the correct answer in the descending order; to select one among them. Answer validation is applied to
document selection is to choose the top relevant documents. solve such issues. One widely applied validation method
After post-processing, only the most relevant documents is to adopt an extra information source like a Web search
would be remained and fed to the next stage to extract the engine to validate the confidence of each candidate answer.
final answer. The principle is that the system should return a sufficiently
large number of documents which contain both question
and answer terms. The larger the number of such returned
2.2.3 Answer Extraction documents is, the more likely it will be the correct answer.
The ultimate goal of an OpenQA system is to successfully This principle has been investigated and demonstrated
answer given questions, and Answer Extraction stage is fairly effective, though simple [9].
responsible for returning a user the most precise answer
to a question. The performance of this stage is decided by 2.3 Application of Deep Neural Networks in OpenQA
the complexity of the question, the expected answer types In the recent decade, deep learning techniques have also
from Question Analysis stage, the retrieved documents from been successfully applied to OpenQA. In particular, deep
Document Retrieval stage as well as the extraction method learning has been used in almost every stage in an OpenQA
adopted, etc. With so many influential factors, researchers system, and moreover, it enables OpenQA systems to be
5
end-to-end trainable. For Question Analysis, some works or the taxonomies of questions are hand-crafted by linguists,
develop neural classifiers to determine the question types. which are non-optimal since it is impossible to cover all
For example, [13] and [14] respectively adopt a CNN-based question types in reality, especially those complicated ones.
and an LSTM-based model to classify the given questions, Furthermore, the classification errors would easily result
both achieving competitive results. For Document Retrieval, in the failure of answer extraction, thus severely hurting
dense representation based methods [16], [29], [30], [35] the overall performance of the system. According to the
have been proposed to address “term-mismatch”, which experiments in [47], about 36.4% of errors in early OpenQA
is a long-standing problem that harms retrieval perfor- systems are caused by miss-classification of question types.
mance. Unlike the traditional methods such as TF-IDF Neural models are able to automatically transform ques-
and BM25 that use sparse representations, deep retrieval tions from natural language to representations that are more
methods learn to encode questions and documents into recognisable to machines. Moreover, neural MRC models
a latent vector space where text semantics beyond term provide an unprecedented powerful solution to Answer Ex-
match can be measured. For example, [29] and [35] train traction in OpenQA, largely offsetting the necessity of apply-
their own encoders to encode each document and question ing the traditional linguistic analytic techniques to questions
independently into dense vectors, and the similarity score and bringing revolutions to OpenQA systems [3], [28], [29],
between them is computed using the inner product of [37]. The very first work to incorporate neural MRC models
their vectors. The Sublinear Maximum Inner Product Search into the OpenQA system is DrQA proposed by [3], evolving
(MIPS) algorithm [69], [70], [71] is used to improve the to a “Retriever-Reader” architecture. It combines TF-IDF
retrieval efficiency given a question, especially when the based IR technique and a neural MRC model to answer
document repository is large-scale. For Answer Extraction, open-domain factoid questions over Wikipedia and achieves
as a decisive stage for OpenQA systems to arrive at the impressive performance. After [3], lots of works have been
final answer, neural models can also be applied. Extracting released [28], [30], [33], [34], [37], [73], [74], [75]. Nowadays,
answers from some relevant documents to a given question to build OpenQA systems following the “Retriever-Reader”
essentially makes the task of Machine Reading Comprehen- architecture has been widely acknowledged as the most
sion (MRC). In the past few years, with the emergence of efficient and promising way, which is also the main focus
some large-scale datasets such as CNN/Daily Mail [18], MS of this paper.
MARCO [20], RACE [21] and SQuAD 2.0 [22], research on
neural MRC has achieved remarkable progress [24], [25],
[26], [27]. For example, BiDAF [24] represents the given 3 M ODERN O PEN QA: R ETRIEVING AND R EADING
document at different levels of granularity via a multi-stage In this section, we introduce the “Retriever-Reader” architec-
hierarchical structure consisting of a character embedding ture of the OpenQA system, as illustrated in Fig. 3. Retriever
layer, a word embedding layer, and a contextual embedding is aimed at retrieving relevant documents w.r.t. a given
layer, and leverages a bidirectional attention flow mecha- question, which can be regarded as an IR system, while
nism to obtain a question-aware document representation Reader aims at inferring the final answer from the received
without early summarization. QANet [26] adopts CNN and documents, which is usually a neural MRC model. They
the self-attention mechanism [72] to model the local inter- are two major components of a modern OpenQA system. In
actions and global interactions respectively, which performs addition, some other auxiliary modules, which are marked
significantly faster than usual recurrent models. in dash lines in Fig. 3, can also be incorporated into an
Furthermore, applying deep learning enables the OpenQA system, including Document Post-processing that
OpenQA systems to be end-to-end trainable [15], [30], [37]. filters and re-ranks retrieved documents in a fine-grained
For example, [37] argue it is sub-optimal to incorporate manner to select the most relevant ones, and Answer Post-
a standalone IR system in an OpenQA system, and they processing that is to determine the final answer among
develop an ORQA system that treats the document retrieval multiple answer candidates. The systems following this
from the information source as a latent variable and trains architecture can be classified into two groups, i.e. pipeline
the whole system only from question-answer string pairs systems and end-to-end systems. In the following, we will
based on BERT [27]. REALM [30] is a pre-trained language introduce each component with the respective approaches
model that contains a knowledge retriever and a knowledge in the pipeline systems, then followed by the end-to-end
augmented encoder. Both its retriever and encoder are dif- trainable ones. In Fig. 4 we provide a taxonomy of the
ferentiable neural networks, which are able to compute the modern OpenQA system to make our descriptions better
gradient w.r.t. the model parameters to be back propagated understandable.
all the way throughout the network. Similar to other pre-
training language models, it also has two stages, i.e., pre-
training and fine-tuning. In the pre-training stage, the model 3.1 Retriever
is trained in an unsupervised manner, using masked lan- Retriever is usually regarded as an IR system, with the goal
guage modeling as the learning signal while the parameters of retrieving related documents or passages that probably
are fine-tuned using supervised examples in the fine-tuning contain the correct answer given a natural language ques-
stage. tion as well as ranking them in a descending order according
In early OpenQA systems, the success of answering a to their relevancy. Broadly, current approaches to Retriever
question is highly dependent on the performance of Ques- can be classified into three categories, i.e. Sparse Retriever,
tion Analysis, particularly Question Classification, that pro- Dense Retriever, and Iterative Retriever, which will be detailed
vides expected answer types [47]. However, either the types in the following.
6
Document Answer
... Retriever Post-processing
Reader Post-processing
Fig. 3: An illustration of “Retriever-Reader” architecture of OpenQA system. The modules marked with dash lines are
auxiliary.
parse
S
F-IDF,
BM25,
DrQA,
T
Retriever
BERTserini
enSPI,
ORQA,
REALM,
D
Dense
Retriever
Retriever
DPR,
ColBERT,
SPARTA
olden
Retriever,
Multi-
G
step
Reasoner,
Adaptive
I
terative
Retriever,
Path
Retriever,
Retriever
MUPPET,
DDRQA,
MDR,
Graph
Retriever,
GAR
S-QA,
InferSent
Re-ranker,
D
upervised
S
Relation-Networks
Re-
Learning
ranker,
Paragraph
Ranker,
Document
Post-
einforcement
R
processing
Learning
ransfer
T
ulti-Passage
BERT
Re-
M
Learning
ranker
OpenQA
System
rQA,
Match-LSTM,
BiDAF,
D
xtractive
E
S-Norm
Reader,
BERT
Reader
Reader,
Graph
Reader
Reader
enerative
G
BART
Reader,
T5
Reader
Reader
Rule-based
Strength-based
Re-Ranker
Answer
Post-
processing
overage-Based
Re-
C
Learning-based
Ranker,
RankQA
etriever-
R
etrieve-and-Read,
ORQA,
R
Reader
REALM,
RAG
Retriever-free GPT2, GPT3, T5, BART
such a method usually requires heavy computation, which level interaction step over the question and document repre-
is sometimes prohibitively expensive, making it hardly ap- sentations to calculate the similarity score. Akin to DPR [16],
plicable to large-scale documents. ColBERT-QA first encodes the question and document in-
Representation-interaction Retriever: In order to dependently with two BERT encoders. Formally, given a
achieve both high accuracy and efficiency, some recent question q and a document d, with corresponding vectors
systems [17], [82], [83] combine representation-based and denoted as Eq (length n) and Ed (length m), the relevance
interaction-based methods. For instance, ColBERT-QA [17]
develops its retriever based on ColBERT [84], which extends
the dual-encoder architecture by performing a simple token-
8
Score
Score Score
score between them is computed as follows: Retriever based on its workflow: 1) Document Retrieval:
n the IR techniques used to retrieve documents in every
X m
Sq,d = max Eqi · EdTj . (1) retrieval step; 2) Query Reformulation: the mechanism used
j=1 to generate a query for each retrieval; 3) Retrieval Stopping
i=1
Mechanism: the method to decide when to terminate the
Then, ColBERT computes the score of each token embed-
retrieval process.
ding of the question over all those of the document first,
and then sums all these scores as the final relevance score be- Document Retrieval: We first revisit the IR techniques
tween q and d. As another example, SPARTA [82] develops used to retrieve documents in every retrieval step given a
a neural ranker to calculate the token-level matching score query. Some works [36], [86], [89] apply Sparse Retriever
using dot product between a non-contextualized encoded (e.g., iteratively, and some [29], [35], [85], [88] use Dense Retriever
BERT word embedding) question and a contextualized en- interatively. Among the works using Sparse Retriever,
coded (e.g., BERT encoder) document. Concretely, given the GOLDEN Retriever [36] adopts BM25 retrieval, while Graph
representations of the question and document, the weight of Retriever [89] and DDRQA [86] retrieve top K documents
each question token is computed with max-pooling, ReLU using TF-IDF. For those with Dense Retriever, most prior
and log sequentially; the final relevance score is the sum of systems including Multi-step Reasoner [29], MUPPET [35]
each question token weight. The representation-interaction and MDR [85] use MIPS retrieval to obtain the most se-
method is a promising approach to dense retrieval, due to mantically similar documents given a representation of the
its good trade-off between effectiveness and efficiency. But question; Path Retriever [88] develops a Recurrent Neural
it still needs to be further explored. Network (RNN) retrieval to learn to retrieve reasoning paths
Though effective, Dense Retriever often suffers heavy for a question over a Wikipedia graph, which is built to
computational burden when applied to large-scale docu- model the relationships among paragraphs based on the
ments. In order to speed up the computation, some works Wikipedia hyperlinks and article structures.
propose to compute and cache the representations of all Query Reformulation: In order to obtain a sufficient
documents offline in advance [16], [29], [30], [35], [37]. In amount of relevant documents, the search queries used for
this way, these representations will not be changed once each step of retrieval are usually varied and generated based
computed, which means the documents are encoded in- on the previous query and the retrieved documents. The
dependently of the question, to some extent sacrificing the generated queries take each from the two forms: 1) explicit
effectiveness of retrieval. form, i.e. a natural language query [36], [86], [87]; and 2)
implicit form, i.e. a dense representation [29], [35], [85].
3.1.3 Iterative Retriever Some works produce a new query taking the form of nat-
Iterative Retriever aims to search for the relevant documents ural language. For example, GOLDEN Retriever [36] recasts
from a large collection in multiple steps given a question, the query reformulation task as an MRC task because they
which is also called Multi-step Retriever. It has been ex- both take a question and some context documents as input
plored extensively in the past few years [29], [35], [36], [85], and aim to generate a string in natural language. Instead
[86], [87], [88], [89], especially when answering complex of pursuing an answer in MRC, the target for query refor-
questions like those requiring multi-hop reasoning [90], [91]. mulation is a new query that helps obtain more supporting
In order to obtain a sufficient amount of relevant documents, documents in the next retrieval step. GAR [87] develops a
the search queries need to vary for different steps and be query expansion module using a pretrained Seq2Seq model
reformulated based on the context information in the previ- BART [92], which takes the initial question as input and
ous step. In the following, we will elaborate on Iterative generates new queries. It is trained by taking various gener-
9
ation targets as output consisting of the answer, the sentence contain irrelevant ones, and sometimes, the number of re-
where the answer belongs to, and the title of a passage that turned documents is extremely large that overwhelms the
contains the answer. capability of Reader. Document Post-processing in the modern
Some other works produce dense representations to be OpenQA architecture is similar with that in the traditional
used for searching in a latent space. For example, Multi-step one, as introduced in Section 2.2.2. It aims at reducing the
Reasoner [29] adopts a Gated Recurrent Unit (GRU) [93] number of candidate documents and only allowing the most
taking token-level hidden representations from Reader and relevant ones to be passed to the next stage.
the question as input to generate a new query vector, which In the past few yeas, this module has been explored
is trained using Reinforcement learning (RL) by measur- with much interest [28], [33], [34], [79], [96], [97], [98], [99].
ing how well the answer extracted by Reader matches the For example, R3 [28] adopts a neural Passage Ranker, and
ground-truth after reading the new set of paragraphs re- trains it jointly with Reader through Reinforcement Learning
trieved with the new question. MUPPET [35] applies a bidi- (RL). DS-QA [33] adds a Paragraph Selector to remove the
rectional attention layer adapted from [94] to a new query noisy ones from the retrieved documents by measuring the
representation q̃ , taking each obtained paragraph P and the probability of each paragraph containing the answer among
initial question Q as input. MDR [85] uses a pre-trained all candidate paragraphs. [96] explore two different passage
masked language model (such as RoBert) as its encoder, rankers that assign scores to retrieved passages based on
which encodes the concatenation of the representation of their likelihood of containing the answer to a given ques-
the question as well as all the previous passages as a new tion. One is InferSent Ranker, a forward neural network that
dense query. employs InferSent sentence representations [100], to rank
Comparably, the explicit query is easily understandable passages based on semantic similarity with the question.
and controllable to humans but is constrained by the terms The other one is Relation-Networks Ranker that adopts
in the vocabulary, while the implicit query is generated in Relation Networks [101], focusing on measuring word-level
a semantic space, which can get rid of the limitation of the relevance between the question and the passages. Their
vocabulary but lacks interpretability. experiments show that word-level relevance matching sig-
Retrieval Stopping Mechanism: The iterating retrieval nificantly improves the retrieval performance and semantic
manner yields greater possibilities to gather more relevant similarity is more beneficial to the overall performance. [34]
passages, but the retrieval efficiency would drop dramat- develop a Paragraph Ranker using two separate RNNs fol-
ically along with the increasing number of iterations. Re- lowing the dual-encoder architecture. Each pair of question-
garding the mechanism for stopping an iterative retrieval, passage is fed into the Ranker to obtain their representations
most existing systems choose to specify a fixed number of independently and inner product is applied to compute
iterations [29], [36], [85], [86], [89] or a maximum number of the relevance score. [98] propose a time-aware re-ranking
retrieved documents [35], [87], which can hardly guarantee module that incorporates temporal information from differ-
the retrieval effectiveness. [77] argue that setting a fixed ent aspects to rank the candidate documents over temporal
number of documents to be obtained for all input questions collections of news articles.
is sub-optimal and instead they develop an Adaptive Re- The focus of research on this module is learning to
triever based on the Document Retriever in DrQA [3]. They further re-rank the retrieved documents [28], [33], [34], [79].
propose two methods to dynamically set the number of re- However, with the development of Dense Retriever, recent
trieved documents for each question, i.e. a simple threshold- OpenQA systems tend to develop a trainable retriever that
based heuristic method as well as a trainable classifier is capable of learning to rank and retrieving the most
using ordinal ridge regression. Since for the questions that relevant documents simultaneously, which would result in
require arbitrary hops of reasoning, it is difficult to specify the absence of this module.
the number of iterations, Path Retriever [88] terminates its
retrieval only when the end-of-evidence token (e.g. [EOE]) is
detected by its Recurrent Retriever. This allows it to perform
3.3 Reader
adaptive retrieval steps but only obtains one document at
each step. To the best of our knowledge, it is still a critical Reader is the other core component of a modern OpenQA
challenge to develop an efficient iterative retriever while not system and also the main feature that differentiates QA
sacrificing accuracy. systems against other IR or IE systems, which is usually
In addition, typical IR systems pursue two optimization implemented as a neural MRC model. It is aimed at in-
targets, i.e. precision and recall. The former computes the ferring the answer in response to the question from a set
ratio of relevant documents returned to the total number of ordered documents, and is more challenging compared
of documents returned while the latter is the number of to the original MRC task where only a piece of passage
relevant documents returned out of the total number of is given in most cases [18], [19], [90], [102], [103]. Broadly,
relevant documents in the underlying repository. However, existing Readers can be categorised into two types: Extrac-
for OpenQA systems, recall is much more important than tive Reader that predicts an answer span from the retrieved
precision due to the post-processing usually applied to documents, and Generative Reader that generates answers
returned documents [95], as described below. in natural language using sequence-to-sequence (Seq2Seq)
models. Most prior OpenQA systems are equipped with an
3.2 Document Post-processing Extractive Reader [3], [16], [28], [29], [30], [33], [35], [78], [89],
Post-processing over the retrieved documents from Retriever while some recent ones develop a Generative Reader [38],
is often needed since the retrieved documents inevitably [85], [104].
10
feed-forward network or an RNN is used to learn to further left-to-right decoder while BART and T5 use Transformer
re-rank the answers to select the best one as the final answer. encode-decoder closely following its original form [72].
Prior studies [115], [116] show that a large amount of knowl-
3.5 End-to-end Methods edge learned from large-scale textual data can be stored in
the underlying parameters, and thus these models are capa-
In recent years, various OpenQA systems [15], [30], [37] ble of answering questions without access to any external
have been developed, in which Retriever and Reader can be knowledge. For example, GPT-2 [112] is able to correctly
trained in an end-to-end manner. In addition, there are some generate the answer given only a natural language ques-
systems with only Retriever [73], and also some that are tion without fine-tuning. Afterwards, GPT-3 [113] achieves
able to answer open questions without the stage of retrieval, competitive performance with few-shot learning compared
which are mostly pre-trained Seq2Seq language models [92], to prior state-of-the-art fine-tuning approaches, in which
[108], [112], [113]. In the following, we will introduce these several demonstrations are given at inference as condition-
three types of systems, i.e. Retriever-Reader, Retriever-only ing [112] while weight update is not allowed. Recently,
and Retriever-free. [116] comprehensively evaluate the capability of language
models for answering questions without access to any
3.5.1 Retriever-Reader
external knowledge. Their experiments demonstrate that
Deep learning techniques enable Retriever and Reader in an pre-trained language models are able to gain impressive
OpenQA system to be end-to-end trainable [15], [30], [37], performance on various benchmarks and such Retrieval-
[38]. For example, [15] propose to jointly train Retriever free methods make a fundamentally different approach to
and Reader using multi-task learning based on the BiDAF building OpenQA systems.
model [24], simultaneously computing the similarity of a In Table 1, we summarize existing modern OpenQA
passage to the given question and predicting the start systems as well as the approaches adopted for different
and end position of an answer span. [37] argue that it components.
is sub-optimal to incorporate a standalone IR system in
an OpenQA system and develop ORQA that jointly trains
Retriever and Reader from question-answer pairs, with both 4 C HALLENGES AND B ENCHMARKS
developed using BERT [27]. REALM [30] is a pre-trained In this section, we first discuss key challenges to building
masked language model including a neural Retriever and a OpenQA systems followed by an analysis of existing QA
neural Reader, which is able to compute the gradient w.r.t. benchmarks that are commonly used not only for OpenQA
the model parameters and backpropagate the gradient all but also for MRC.
the way throughout the network. Since both modules are
developed using neural networks, the response speed to a
4.1 Challenges to OpenQA
question is a most critical issue during inference, especially
over a large collection of documents. To build an OpenQA system that is capable of answering
any input questions is regarded as the ultimate goal of QA
3.5.2 Retriever-only research. However, the research community still has a long
To enhance the efficiency of answering questions, some way to go. Here we discuss some salient challenges that
systems are developed by only adopting a Retriever while need be addressed on the way. By doing this we hope the
omitting Reader which is usually the most time-consuming research gaps can be made clearer so as to accelerate the
stage in other modern OpenQA systems. DenSPI [73] builds progress in this field.
a question-agnostic phrase-level embedding index offline
given a collection of documents like Wikipedia articles. 4.1.1 Distant Supervision
In the index, each candidate phrase from the corpus is In the OpenQA setting, it is almost impossible to create
represented by the concatenation of two vectors, i.e. a sparse a collection containing “sufficient” high-quality training
vector (e.g., tf-idf) and a dense vector (e.g., BERT encoder). data for developing OpenQA systems in advance. Distant
In the inference, the given question is encoded in the same supervision is therefore popularly utilized, which is able to
way, and FAISS [114] is employed to search for the most label data automatically based on an existing corpus, such as
similar phrase as the final answer. Experiments show that Wikipedia. However, distant supervision inevitably suffers
it obtains remarkable efficiency gains and reduces com- from wrong label problem and often leads to a considerable
putational cost significantly while maintaining accuracy. amount of noisy data, significantly increasing the difficulty
However, the system computes the similarity between each of modeling and training. Therefore, the systems that are
phrase and the question independently, which ignores the able to tolerate such noise are always demanded.
contextual information that is usually crucial to answering
questions . 4.1.2 Retrieval Effectiveness and Efficiency
Retrieval effectiveness means the ability of the system
3.5.3 Retriever-free to separate relevant documents from irrelevant ones for
Recent advancement in pre-training Seq2Seq language mod- a given question. The system often suffers from “term-
els such as GPT-2 [112], GPT-3 [113], BART [92] and T5 [108] mismatch”, which results in failure of retrieving relevant
brings a surge of improvements for downstream NLG tasks, documents; on the other hand the system may receive noisy
most of which are built using Transformer-based architec- documents that contain the exact terms in the question
tures. In particular, GPT-2 and GPT-3 adopt Transformer or even the correct answer span, but are irrelevant to the
12
TABLE 1: Approaches adopted for different components of existing modern OpenQA systems.
question. Both issues increase the difficulty of accurately consistently enhance both aspects (also with a good trade-
understanding the context during answer inference. Some off) will be a long-standing challenge in the advancement of
neural retrieval methods [15], [16], [30], [37], [73], [117] are OpenQA.
proposed recently for improving retrieval effectiveness. For
example, [37] and [30] jointly train the retrieval and reader 4.1.3 Knowledge Incorporation
modules, which take advantage of pre-trained language To incorporate knowledge beyond context documents and
models and regard the retrieval model as a latent variable. given questions is a key enhancement to OpenQA sys-
However, these neural retrieval methods often suffer from tems [7], e.g. world knowledge, commonsense or domain-
low efficiency. Some works [15], [16], [37], [117] propose specific knowledge. Before making use of such knowledge,
to pre-compute the question-independent embedding for we need to first consider how to represent them. There are
each document or phrase and construct the embedding generally two ways: explicit and implicit.
index only once. Advanced sub-linear Maximum Inner Prod- For the explicit manner, knowledge is usually trans-
uct Search (MIPS) algorithms [69], [70], [71] are usually formed into the form of triplets and stored in classical KBs
employed to obtain the top K related documents given a such as DBPedia [118], Freebase [119] and Yago2 [120],
question. However, the response speed still has a huge gap which are easily understood by humans. Some early QA
from that of typical IR techniques when the system faces a systems attempt to incorporate knowledge to help find the
massive set of documents. answer in this way. For example, IBM Watson DeepQA [58]
Retrieval effectiveness and efficiency are both crucial fac- combines a Web search engine and a KB to compete with
tors for the deployment of an OpenQA system in practice, human champions on the American TV show “Jeopardy”;
especially when it comes to the real-time scenarios. How to QuASE [121] searches for a list of most prominent sentences
13
from a Web search engine (e.g, Google.com), and then uti- for arriving at the final answer. To achieve these goals, three
lizes entity linking over Freebase [119] to detect the correct major challenges need to be addressed.
answer from the selected sentences. In recent years, with the First, conversational OpenQA should have the ability to
popularity of Graph Neural Network (GNN), some works determine if a question is unanswerable, such as to detect
[89], [122], [123] propose to gain relevant information not if ambiguity exists in the question or whether the current
only from a text corpus but also from a KB to facilitate context is sufficient for generating an answer. Research on
evidence retrieval and question answering. For example, unanswerable questions has attracted a lot of attention in
[122] construct a question-specific sub-graph containing sen- the development of MRC over the past few years [20], [22],
tences from the corpus, and entities and relations from the [128], [144], [152], [153]. However, current OpenQA systems
KB. Then, graph CNN based methods [105], [124], [125] rarely incorporate such a mechanism to determine unan-
are used to infer the final answer over the sub-graph. swerability of questions, which is particularly necessary for
However, there also exist problems for storing knowledge conversational OpenQA systems.
in an explicit manner, such as incomplete and out-of-date Second, when the question is classified as unanswerable
knowledge. Moreover, to construct a KB is both labor- due to ambiguity or insufficient background knowledge, the
intensive and time-consuming. conversational OpenQA system needs to generate a follow-
On the other hand, with the implicit approach, a up question [154]. Question Generation (QG) can then be
large amount of knowledge [115] can be stored in un- considered as a sub-task of QA, which is a crucial module
derlying parameters learned from massive texts by pre- of conversational OpenQA. In the past few years, research
trained language models such as BERT [27], XLNet [126] on automatic question generation from text passages has
and T5 [108], which can be applied smoothly in downstream received growing attention [155], [156], [157], [158]. Com-
tasks. Recently, pre-trained language models have been pared to the typical QG task targeting at generating a
popularly researched and applied to developing OpenQA question based on a given passage where the answer to the
systems [16], [30], [32], [37], [78], [87], [88]. For example, generated question can be found, the question generated
[32], [78], [88] develop their Reader using BERT [27] while in conversational OpenQA should be answered by human
[16], [37] use BERT to develop both Retriever and Reader. In users only.
addition, pre-trained language models like GPT-2 [112] are The third challenge is how to better model the conversa-
able to generate the answer given only a natural language tion history not only in Reader but also in Retriever [159].
question. However, such systems act like a “black box” The recently released conversational MRC datasets like
and it is nearly impossible to know what knowledge has CoQA [133] and QuAC [134] are aimed at enabling a Reader
been exactly stored and used for a particular answer. They to answer the latest question by comprehending not only
lack interpretability that is crucial especially for real-world the given context passage but also the conversation history
applications. so far. As they provide context passages in their task set-
Knowledge enhanced OpenQA is desired not only be- ting, they omit the stage of document retrieval which is
cause it is helpful to generating the answer but also because necessary when it comes to OpenQA. Recently, in [159] the
it serves as the source for interpreting the obtained answer. QuAC dataset is extended to a new OR-QuAC dataset by
How to represent and make full use of the knowledge for adapting to an open-retrieval setting, and an open-retrieval
OpenQA still needs more research efforts. conversational question answering system (OpenConvQA)
is developed, which is able to retrieve relevant passages
4.1.4 Conversational OpenQA from a large collection before inferring the answer, taking
into account the conversation QA pairs. OpenConvQA tries
Non-conversational OpenQA is challenged by several prob- to answer a given question without any specified context,
lems that are almost impossible to resolve, such as the and thus enjoys a wider scope of application and better
lengthy words for a complex question (e.g. Who is the accords with real-world QA behavior of human beings.
second son of the first Prime Minister of Singapore? ), ambi- However, the best performance (F1: 29.4) of the system on
guity resulting in incorrect response (e.g. When was Michael OR-QuAC is far lower than the state-of-the-art (F1: 74.41 ) on
Jordan born? ) and insufficient background knowledge from QuAC, indicating that it is a bigger challenge when it comes
the user that leads to unreasonable results (e.g. Why do I to an open-retrieval setting.
have a bad headache today? ). These problems would be well
addressed under the conversational setting.
Conversational systems [150], [151] are equipped with 4.2 Benchmarks
a dialogue-like interface that enables interaction between A large number of QA benchmarks have been released in
human users and the system for information exchange. For the past decade, which are summarized in Table 2. Here
the complex question example given above, it can be de- we provide a brief analysis of them with the focus on
composed into two simple questions sequentially: “Who is their respective characteristics, dataset distributions w.r.t.
the first Prime Minister of Singapore?” followed by “Who is background information domain, number of questions, year
the second son of him?”. When ambiguity is detected in the of release. As aforementioned in this paper, the success of
question, the conversational OpenQA system is expected to the MRC task is a crucial step to more advanced OpenQA
raise a follow-up question for clarification, such as “Do you and we believe the future advancement of MRC methods
mean the basketball player?”. If a question with insufficient will significantly promote the OpenQA systems. Thus, we
background knowledge is given, a follow-up question can
also be asked to gather more information from human users 1. stated on June 2020 https://quac.ai/
14
TABLE 2: Dataset: The name of the dataset. Domain: The domain of background information in the dataset. #Q (k): The
number of questions contained in the dataset, with unit(k) denoting “thousand”. Answer Type: The answer types included
in the dataset. Context in MRC: The context documents or passages that are given to generate answers in MRC tasks.
OpenQA: This column indicates whether the dataset is applicable for developing OpenQA systems, with the tick mark
denoting yes.
include not only the datasets for OpenQA but also those number of questions in Fig. 6. Also, we summarize the
solely for MRC to make our survey more comprehensive. information source type of the datasets that are applicable
to developing OpenQA systems in Table 3.
The major criterion for judging the applicability of a
QA dataset to develop OpenQA systems is whether it in-
volves a separate document set (usually large-scale) [90], or 5 C ONCLUSION
whether it has relatively easy access to such an information In this work we presented a comprehensive survey on the
source [18], [22] where the answers to questions can be latest progress of Open-domain QA (OpenQA) systems. In
inferred. For example, HotpotQA [90] provides a full-wiki particular, we first reviewed the development of OpenQA
setting itself to require a system to find the answer to a and illustrated a “Retriever-Reader” architecture. Moreover,
question in the scope of the entire Wikipedia. [3] extend we reviewed a variety of existing OpenQA systems as well
SQuAD [19] to SQuADopen by using the entire Wikipedia as their different approaches. Finally, we discussed some
as its information source. We summarize and illustrate the salient challenges towards OpenQA followed by a summary
distributions of datasets listed in Table 2 w.r.t. year of release of various QA benchmarks, hoping to reveal the research
in Fig. 7, background information domain in Fig.8 and gaps so as to push further progress in this field. Based on
15
TABLE 3: The information source of the datasets that are applicable for developing OpenQA system. Source Type: The
type of background information source. Source: The background information source in OpenQA setting.
Fig. 7: Distribution of popular datasets w.r.t. release year step and multi-step neural retrievers will attract increasing
attention due to the demand for more accurate retrieval of
related documents. Also, more end-to-end OpenQA systems
will be developed with the advancement of deep learning
techniques. Knowledge enhanced OpenQA is very promis-
ing not only because it is helpful to generating the answer
but also because it serves as the source for interpreting the
obtained answer. However, how to represent and make full
use of the knowledge for OpenQA still needs more research
efforts. Furthermore, to equip OpenQA with a dialogue-like
interface that enables interaction between human users and
the system for information exchange is expected to attract
increasing attention, which well aligns with real world
application scenarios.
6 ACKNOWLEDGEMENTS
This research is supported by the National Research Foun-
our review of prior research, we claim that OpenQA would dation, Singapore under its International Research Centres
continue to be a research hot-spot. In particular, single- in Singapore Funding Initiative and A*STAR under its RIE
16
Fig. 8: Datasets distribution w.r.t. background information [12] Z. Huang, S. Xu, M. Hu, X. Wang, J. Qiu, Y. Fu, Y. Zhao, Y. Peng,
domain and C. Wang, “Recent trends in deep learning based open-
domain textual question answering systems,” IEEE Access, vol. 8,
pp. 94 341–94 356, 2020.
[13] T. Lei, Z. Shi, D. Liu, L. Yang, and F. Zhu, “A novel cnn-
based method for question classification in intelligent question
answering,” in Proceedings of the 2018 International Conference on
Algorithms, Computing and Artificial Intelligence. Association for
Computing Machinery, 2018.
[14] W. Xia, W. Zhu, B. Liao, M. Chen, L. Cai, and L. Huang, “Novel
architecture for long short-term memory used in question classi-
fication,” Neurocomputing, vol. 299, pp. 20–31, 2018.
[15] K. Nishida, I. Saito, A. Otsuka, H. Asano, and J. Tomita,
“Retrieve-and-read: Multi-task learning of information retrieval
and reading comprehension,” in Proceedings of the 27th ACM
International Conference on Information and Knowledge Management,
ser. CIKM ’18. Association for Computing Machinery, 2018, p.
647–656.
[16] V. Karpukhin, B. Oğuz, S. Min, L. Wu, S. Edunov, D. Chen, and
W.-t. Yih, “Dense passage retrieval for open-domain question
answering,” arXiv preprint arXiv:2004.04906, 2020.
[17] O. Khattab, C. Potts, and M. Zaharia, “Relevance-guided
Supervision for OpenQA with ColBERT,” 2020. [Online].
Available: http://arxiv.org/abs/2007.00814
[18] K. M. Hermann, T. Kočiský, E. Grefenstette, L. Espeholt, W. Kay,
M. Suleyman, and P. Blunsom, “Teaching machines to read and
comprehend,” in Proceedings of the 28th International Conference on
Neural Information Processing Systems - Volume 1. MIT Press, 2015,
2020 Advanced Manufacturing and Engineering (AME) pro- pp. 1693–1701.
grammatic grant, Award No. - A19E2b0098, Project name [19] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, “SQuAD:
100,000+ questions for machine comprehension of text,” in Pro-
- K-EMERGE: Knowledge Extraction, Modelling, and Ex- ceedings of the 2016 Conference on Empirical Methods in Natural
plainable Reasoning for General Expertise. Any opinions, Language Processing. Association for Computational Linguistics,
findings and conclusions or recommendations expressed in 2016, pp. 2383–2392.
[20] T. Nguyen, M. Rosenberg, X. Song, J. Gao, S. Tiwary, R. Ma-
this material are those of the author(s) and do not reflect jumder, and L. Deng, “MS MARCO: A human generated machine
the views of National Research Foundation and A*STAR, reading comprehension dataset,” 2016.
Singapore. [21] G. Lai, Q. Xie, H. Liu, Y. Yang, and E. H. Hovy, “RACE: large-
scale reading comprehension dataset from examinations,” CoRR,
vol. abs/1704.04683, 2017.
R EFERENCES [22] P. Rajpurkar, R. Jia, and P. Liang, “Know what you don’t know:
[1] B. F. Green, Jr., A. K. Wolf, C. Chomsky, and K. Laughery, Unanswerable questions for SQuAD,” in Proceedings of the 56th
“Baseball: An automatic question-answerer,” in Papers Presented Annual Meeting of the Association for Computational Linguistics (Vol-
at the May 9-11, 1961, Western Joint IRE-AIEE-ACM Computer ume 2: Short Papers). Association for Computational Linguistics,
Conference. ACM, 1961, pp. 219–224. 2018, pp. 784–789.
[2] J. Falconer, “Google: Our new search strategy is [23] J. Li, M. Liu, M.-Y. Kan, Z. Zheng, Z. Wang, W. Lei, T. Liu, and
to compute answers, not links,” 2011. [Online]. B. Qin, “Molweni: A challenge multiparty dialogues-based ma-
Available: https://thenextweb.com/google/2011/06/01/ chine reading comprehension dataset with discourse structure,”
google-our-new-search-strategy-is-to-compute-answers-not-links/ 2020.
[3] D. Chen, A. Fisch, J. Weston, and A. Bordes, “Reading Wikipedia [24] M. J. Seo, A. Kembhavi, A. Farhadi, and H. Hajishirzi, “Bidi-
to answer open-domain questions,” in Proceedings of the 55th An- rectional attention flow for machine comprehension,” in 5th
nual Meeting of the Association for Computational Linguistics (Volume International Conference on Learning Representations, ICLR 2017.
1: Long Papers). Association for Computational Linguistics, 2017, OpenReview.net, 2017.
pp. 1870–1879. [25] W. Wang, N. Yang, F. Wei, B. Chang, and M. Zhou, “Gated
[4] E. M. Voorhees, “The trec-8 question answering track report,” self-matching networks for reading comprehension and question
NIST, Tech. Rep., 1999. answering,” in Proceedings of the 55th Annual Meeting of the
[5] T. Mihaylov, P. Clark, T. Khot, and A. Sabharwal, “Can a suit of Association for Computational Linguistics, ACL. Association for
armor conduct electricity? A new dataset for open book question Computational Linguistics, 2017, pp. 189–198.
answering,” CoRR, vol. abs/1809.02789, 2018. [26] A. W. Yu, D. Dohan, M. Luong, R. Zhao, K. Chen, M. Norouzi,
[6] S. M. Harabagiu, S. J. Maiorano, and M. A. Paundefinedca, and Q. V. Le, “Qanet: Combining local convolution with global
“Open-domain textual question answering techniques,” Nat. self-attention for reading comprehension,” in International Confer-
Lang. Eng., vol. 9, no. 3, p. 231–267, 2003. ence on Learning Representations, ICLR. OpenReview.net, 2018.
[7] V. C. John Burger, Claire Cardie et al., “Issues, tasks and program [27] J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-
structures to roadmap research in question & answering (q &a,” training of deep bidirectional transformers for language under-
NIST, Tech. Rep., 2001. standing,” CoRR, vol. abs/1810.04805, 2018.
[8] O. Kolomiyets and M.-F. Moens, “A survey on question answer- [28] S. Wang, M. Yu, X. Guo, Z. Wang, T. Klinger, W. Zhang, S. Chang,
ing technology from an information retrieval perspective,” Inf. G. Tesauro, B. Zhou, and J. Jiang, “R3: Reinforced ranker-reader
Sci., vol. 181, no. 24, pp. 5412–5434, 2011. for open-domain question answering,” in AAAI, 2018.
[9] A. Allam and M. Haggag, “The question answering systems: A [29] R. Das, S. Dhuliawala, M. Zaheer, and A. McCallum, “Multi-step
survey,” International Journal of Research and Reviews in Information retriever-reader interaction for scalable open-domain question
Sciences, pp. 211–221, 2012. answering,” in International Conference on Learning Representations,
[10] A. Mishra and S. K. Jain, “A survey on question answering 2019.
systems with classification,” J. King Saud Univ. Comput. Inf. Sci., [30] K. Guu, K. Lee, Z. Tung, P. Pasupat, and M.-W. Chang, “Realm:
vol. 28, no. 3, p. 345–361, 2016. Retrieval-augmented language model pre-training,” CoRR, 2020.
[11] M. Paşca, “Open-domain question answering from large text [31] M. Ding, C. Zhou, Q. Chen, H. Yang, and J. Tang, “Cognitive
collections,” Computational Linguistics, vol. 29, no. 4, pp. 665–667, graph for multi-hop reading comprehension at scale,” in Proceed-
2003. ings of the 57th Annual Meeting of the Association for Computational
17
Linguistics. Association for Computational Linguistics, 2019, pp. [49] J. Xu and W. B. Croft, “Query expansion using local and global
2694–2703. document analysis,” in Proceedings of the 19th Annual International
[32] Y. Nie, S. Wang, and M. Bansal, “Revealing the importance of ACM SIGIR Conference on Research and Development in Information
semantic retrieval for machine reading at scale,” in Proceedings Retrieval. Association for Computing Machinery, 1996, p. 4–11.
of the 2019 Conference on Empirical Methods in Natural Language [50] C. Carpineto and G. Romano, “A survey of automatic query
Processing and the 9th International Joint Conference on Natural expansion in information retrieval,” ACM Computing Survey,
Language Processing (EMNLP-IJCNLP). Association for Compu- vol. 44, no. 1, 2012.
tational Linguistics, 2019, pp. 2553–2566. [51] C. Quirk, C. Brockett, and W. Dolan, “Monolingual machine
[33] Y. Lin, H. Ji, Z. Liu, and M. Sun, “Denoising distantly supervised translation for paraphrase generation,” in Proceedings of the 2004
open-domain question answering,” in Proceedings of the 56th An- Conference on Empirical Methods in Natural Language Processing.
nual Meeting of the Association for Computational Linguistics (Volume Association for Computational Linguistics, 2004, pp. 142–149.
1: Long Papers). Association for Computational Linguistics, 2018, [52] C. Bannard and C. Callison-Burch, “Paraphrasing with bilingual
pp. 1736–1745. parallel corpora,” in Proceedings of the 43rd Annual Meeting of the
[34] J. Lee, S. Yun, H. Kim, M. Ko, and J. Kang, “Ranking paragraphs Association for Computational Linguistics (ACL’05). Association
for improving answer recall in open-domain question answer- for Computational Linguistics, 2005, pp. 597–604.
ing,” in Proceedings of the 2018 Conference on Empirical Methods [53] S. Zhao, C. Niu, M. Zhou, T. Liu, and S. Li, “Combining mul-
in Natural Language Processing. Association for Computational tiple resources to improve SMT-based paraphrasing model,” in
Linguistics, 2018, pp. 565–569. Proceedings of ACL-08: HLT. Association for Computational
[35] Y. Feldman et al., “Multi-hop paragraph retrieval for open- Linguistics, 2008, pp. 1021–1029.
domain question answering,” in Proceedings of the 57th Annual [54] S. Wubben, A. van den Bosch, and E. Krahmer, “Paraphrase
Meeting of the Association for Computational Linguistics. Associa- generation as monolingual translation: Data and evaluation,”
tion for Computational Linguistics, 2019, pp. 2296–2309. in Proceedings of the 6th International Natural Language Generation
[36] P. Qi, X. Lin, L. Mehr, Z. Wang, and C. D. Manning, “An- Conference, 2010.
swering complex open-domain questions through iterative query [55] X. Li and D. Roth, “Learning question classifiers,” in COLING
generation,” in Proceedings of the 2019 Conference on Empirical 2002: The 19th International Conference on Computational Linguistics,
Methods in Natural Language Processing and the 9th International 2002.
Joint Conference on Natural Language Processing (EMNLP-IJCNLP). [56] J. Suzuki, H. Taira, Y. Sasaki, and E. Maeda, “Question clas-
Association for Computational Linguistics, 2019, pp. 2590–2602. sification using HDAG kernel,” in Proceedings of the ACL 2003
[37] K. Lee, M.-W. Chang, and K. Toutanova, “Latent retrieval for Workshop on Multilingual Summarization and Question Answering.
weakly supervised open domain question answering,” in Proceed- Association for Computational Linguistics, 2003, pp. 61–68.
ings of the 57th Annual Meeting of the Association for Computational [57] D. Zhang and W. S. Lee, “Question classification using support
Linguistics. Association for Computational Linguistics, 2019, pp. vector machines,” in Proceedings of the 26th Annual International
6086–6096. ACM SIGIR Conference on Research and Development in Informaion
[38] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, Retrieval, ser. SIGIR ’03. Association for Computing Machinery,
N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, 2003, p. 26–32.
S. Riedel, and D. Kiela, “Retrieval-Augmented Generation for [58] D. Ferrucci, E. Brown, J. Chu-Carroll, J. Fan, D. Gondek, A. A.
Knowledge-Intensive NLP Tasks,” 2020. [Online]. Available: Kalyanpur, A. Lally, J. W. Murdock, E. Nyberg, J. Prager,
http://arxiv.org/abs/2005.11401 N. Schlaefer, and C. Welty, “Building watson: An overview of
the deepqa project,” AI Magazine, vol. 31, no. 3, pp. 59–79, 2010.
[39] W. A. Woods, “Progress in natural language understanding: An
[59] H. Tayyar Madabushi and M. Lee, “High accuracy rule-based
application to lunar geology,” in Proceedings of the June 4-8, 1973,
question classification using question syntax and semantics,” in
National Computer Conference and Exposition. ACM, 1973, pp.
Proceedings of COLING 2016, the 26th International Conference on
441–450.
Computational Linguistics: Technical Papers. The COLING 2016
[40] J. Kupiec, “Murax: A robust linguistic approach for question Organizing Committee, 2016, pp. 1220–1230.
answering using an on-line encyclopedia,” in Proceedings of the
[60] C. D. Manning, P. Raghavan, and H. Schütze, Introduction to
16th Annual International ACM SIGIR Conference on Research and
Information Retrieval. USA: Cambridge University Press, 2008.
Development in Information Retrieval. Association for Computing
[61] S. Robertson, H. Zaragoza et al., “The probabilistic relevance
Machinery, 1993, p. 181–190.
framework: Bm25 and beyond,” Foundations and Trends in Infor-
[41] E. M. Voorhees, “Overview of the trec 2001 question answering mation Retrieval, vol. 3, no. 4, pp. 333–389, 2009.
track,” in In Proceedings of TREC-10, 2001, pp. 42–51. [62] W. B. Croft and J. Lafferty, Language modeling for information
[42] ——, “Overview of the TREC 2002 question answering track,” in retrieval. Kluwer Academic Publ., 2003.
Proceedings of The Eleventh Text REtrieval Conference, TREC 2002, [63] E. M. Voorhees, “Overview of the trec 2004 question answering
Gaithersburg, Maryland, USA, November 19-22, 2002, ser. NIST track,” in In Proceedings of the Thirteenth Text REtreival Conference
Special Publication, vol. 500-251. National Institute of Standards (TREC 2004), 2005, pp. 52–62.
and Technology (NIST), 2002. [64] D. Mollá, M. van Zaanen, and D. Smith, “Named entity recog-
[43] E. Voorhees, “Overview of the trec 2003 question answering nition for question answering,” in Proceedings of the Australasian
track,” NIST, Tech. Rep., 2003. Language Technology Workshop 2006, 2006, pp. 51–58.
[44] C. Kwok, O. Etzioni, O. Etzioni, and D. S. Weld, “Scaling question [65] M. Wang, “A survey of answer extraction techniques in factoid
answering to the web,” ACM Transactions on Information Systems, question answering,” Computational Linguistics, vol. 1, no. 1, pp.
vol. 19, no. 3, pp. 242–262, 2001. 1–14, 2006.
[45] E. Brill, S. Dumais, and M. Banko, “An analysis of the AskMSR [66] M. M. Soubbotin and S. M. Soubbotin, “Patterns of potential an-
question-answering system,” in Proceedings of the 2002 Conference swer expressions as clues to the right answers,” in In Proceedings
on Empirical Methods in Natural Language Processing (EMNLP of the 10th Text REtrieval Conference (TREC-10), 2001.
2002). Association for Computational Linguistics, 2002, pp. 257– [67] D. Ravichandran and E. Hovy, “Learning surface text patterns
264. for a question answering system,” in Proceedings of the 40th
[46] Z. Zheng, “Answerbus question answering system,” in Proceed- Annual Meeting of the Association for Computational Linguistics.
ings of the Second International Conference on Human Language Association for Computational Linguistics, 2002, pp. 41–47.
Technology Research, ser. HLT ’02. Morgan Kaufmann Publishers [68] D. Shen, G.-J. M. Kruijff, and D. Klakow, “Exploring syntactic
Inc., 2002, p. 399–404. relation patterns for question answering,” in Second International
[47] D. Moldovan, M. Pasca, S. Harabagiu, and M. Surdeanu, “Per- Joint Conference on Natural Language Processing: Full Papers, 2005.
formance issues and error analysis in an open-domain question [69] P. Ram and A. G. Gray, “Maximum inner-product search using
answering system,” in Proceedings of the 40th Annual Meeting cone trees,” in Proceedings of the 18th ACM SIGKDD international
of the Association for Computational Linguistics. Association for conference on Knowledge discovery and data mining, 2012, pp. 931–
Computational Linguistics, 2002, pp. 33–40. 939.
[48] R. Sun, J. Jiang, Y. F. Tan, H. Cui, T.-S. Chua, and M.-Y. Kan, [70] A. Shrivastava and P. Li, “Asymmetric lsh (alsh) for sublinear
“Using syntactic and semantic relation analysis in question an- time maximum inner product search (mips),” in Advances in
swering,” in TREC, 2005. Neural Information Processing Systems, 2014, pp. 2321–2329.
18
[71] F. Shen, W. Liu, S. Zhang, Y. Yang, and H. Tao Shen, “Learning for question answering,” in International Conference on Learning
binary codes for maximum inner product search,” in Proceedings Representations, 2020.
of the IEEE International Conference on Computer Vision, 2015, pp. [89] S. Min, D. Chen, L. Zettlemoyer, and H. Hajishirzi, “Knowledge
4148–4156. Guided Text Retrieval and Reading for Open Domain Question
[72] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Answering,” 2019. [Online]. Available: http://arxiv.org/abs/
Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” 1911.03868
in Advances in Neural Information Processing Systems 30: Annual [90] Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. Cohen, R. Salakhutdinov,
Conference on Neural Information Processing Systems 2017, 4-9 De- and C. D. Manning, “HotpotQA: A dataset for diverse, explain-
cember 2017, Long Beach, CA, USA, 2017, pp. 5998–6008. able multi-hop question answering,” in Proceedings of the 2018
[73] M. Seo, J. Lee, T. Kwiatkowski, A. Parikh, A. Farhadi, and H. Ha- Conference on Empirical Methods in Natural Language Processing.
jishirzi, “Real-time open-domain question answering with dense- Association for Computational Linguistics, 2018, pp. 2369–2380.
sparse phrase index,” in Proceedings of the 57th Annual Meeting [91] J. Welbl, P. Stenetorp, and S. Riedel, “Constructing datasets
of the Association for Computational Linguistics. Association for for multi-hop reading comprehension across documents,”
Computational Linguistics, 2019, pp. 4430–4441. Transactions of the Association for Computational Linguistics, pp.
[74] M. Dehghani, H. Azarbonyad, J. Kamps, and M. de Rijke, “Learn- 287–302, 2018. [Online]. Available: https://www.aclweb.org/
ing to transform, combine, and reason in open-domain question anthology/Q18-1021
answering,” in Proceedings of the Twelfth ACM International Confer- [92] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed,
ence on Web Search and Data Mining, ser. WSDM ’19. Association O. Levy, V. Stoyanov, and L. Zettlemoyer, “BART: Denoising
for Computing Machinery, 2019, p. 681–689. sequence-to-sequence pre-training for natural language genera-
[75] B. Dhingra, M. Zaheer, V. Balachandran, G. Neubig, R. Salakhut- tion, translation, and comprehension,” in Proceedings of the 58th
dinov, and W. W. Cohen, “Differentiable reasoning over a virtual Annual Meeting of the Association for Computational Linguistics.
knowledge base,” in International Conference on Learning Represen- Association for Computational Linguistics, 2020, pp. 7871–7880.
tations, 2020. [93] K. Cho, B. van Merrienboer, Ç. Gülçehre, D. Bahdanau,
[76] B. Kratzwald, A. Eigenmann, and S. Feuerriegel, “RankQA: Neu- F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase rep-
ral question answering with answer re-ranking,” in Proceedings resentations using RNN encoder-decoder for statistical machine
of the 57th Annual Meeting of the Association for Computational translation,” in EMNLP. ACL, 2014, pp. 1724–1734.
Linguistics. Association for Computational Linguistics, 2019, pp. [94] C. Clark and M. Gardner, “Simple and effective multi-paragraph
6076–6085. reading comprehension,” in Proceedings of the 56th Annual Meeting
[77] B. Kratzwald et al., “Adaptive document retrieval for deep of the Association for Computational Linguistics, ACL. Association
question answering,” in Proceedings of the 2018 Conference on for Computational Linguistics, 2018, pp. 845–855.
Empirical Methods in Natural Language Processing. Association [95] A. Lampert, “A quick introduction to question answering,” Dated
for Computational Linguistics, 2018, pp. 576–581. December, 2004.
[78] W. Yang, Y. Xie, A. Lin, X. Li, L. Tan, K. Xiong, M. Li, and J. Lin, [96] P. M. Htut, S. Bowman, and K. Cho, “Training a ranking function
“End-to-end open-domain question answering with BERTserini,” for open-domain question answering,” in Proceedings of the 2018
in Proceedings of the 2019 Conference of the North American Chapter Conference of the North American Chapter of the Association for
of the Association for Computational Linguistics (Demonstrations). Computational Linguistics: Student Research Workshop. Association
Association for Computational Linguistics, 2019, pp. 72–77. for Computational Linguistics, 2018, pp. 120–127.
[79] Z. Wang, P. Ng, X. Ma, R. Nallapati, and B. Xiang, “Multi-passage [97] P. Banerjee, K. K. Pal, A. Mitra, and C. Baral, “Careful selection of
BERT: A globally normalized BERT model for open-domain ques- knowledge to solve open book question answering,” in Proceed-
tion answering,” in Proceedings of the 2019 Conference on Empirical ings of the 57th Annual Meeting of the Association for Computational
Methods in Natural Language Processing and the 9th International Linguistics. Association for Computational Linguistics, 2019, pp.
Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 6120–6129.
Association for Computational Linguistics, 2019, pp. 5878–5882. [98] J. Wang, A. Jatowt, M. Färber, and M. Yoshikawa, “Answering
[80] K. Weinberger, A. Dasgupta, J. Langford, A. Smola, and J. At- event-related questions over long-term news article archives,”
tenberg, “Feature hashing for large scale multitask learning,” in ECIR, ser. Lecture Notes in Computer Science, vol. 12035.
in Proceedings of the 26th Annual International Conference on Ma- Springer, 2020, pp. 774–789.
chine Learning. Association for Computing Machinery, 2009, p. [99] J. Wang, A. Jatowt, M. Färber, and M. Yoshikawa, “Improving
1113–1120. question answering for event-focused questions in temporal col-
[81] P. Yang, H. Fang, and J. Lin, “Anserini: Enabling the use of lucene lections of news articles,” Information Retrieval Journal, vol. 24,
for information retrieval research,” in Proceedings of the 40th no. 1, pp. 29–54, 2021.
International ACM SIGIR Conference on Research and Development in [100] A. Conneau, D. Kiela, H. Schwenk, L. Barrault, and A. Bor-
Information Retrieval, ser. SIGIR ’17. Association for Computing des, “Supervised learning of universal sentence representations
Machinery, 2017, p. 1253–1256. from natural language inference data,” in Proceedings of the 2017
[82] T. Zhao, X. Lu, and K. Lee, “Sparta: Efficient open-domain Conference on Empirical Methods in Natural Language Processing.
question answering via sparse transformer matching retrieval,” Association for Computational Linguistics, 2017, pp. 670–680.
arXiv preprint arXiv:2009.13013, 2020. [101] A. Santoro, D. Raposo, D. G. Barrett, M. Malinowski, R. Pascanu,
[83] Y. Zhang, P. Nie, X. Geng, A. Ramamurthy, L. Song, and D. Jiang, P. Battaglia, and T. Lillicrap, “A simple neural network module
“Dc-bert: Decoupling question and document for efficient con- for relational reasoning,” in Advances in Neural Information Pro-
textual encoding,” 2020. cessing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach,
[84] O. Khattab and M. Zaharia, “Colbert: Efficient and effective R. Fergus, S. Vishwanathan, and R. Garnett, Eds. Curran
passage search via contextualized late interaction over bert,” Associates, Inc., 2017, pp. 4967–4976.
in Proceedings of the 43rd International ACM SIGIR Conference on [102] M. Richardson, C. J. Burges, and E. Renshaw, “MCTest: A chal-
Research and Development in Information Retrieval, ser. SIGIR ’20. lenge dataset for the open-domain machine comprehension of
Association for Computing Machinery, 2020, p. 39–48. text,” in Proceedings of the 2013 Conference on Empirical Methods
[85] W. Xiong, X. L. Li, S. Iyer, J. Du, P. Lewis, W. Y. Wang, Y. Mehdad, in Natural Language Processing. Association for Computational
W.-t. Yih, S. Riedel, D. Kiela et al., “Answering complex open- Linguistics, 2013, pp. 193–203.
domain questions with multi-hop dense retrieval,” arXiv preprint [103] D. Dua, Y. Wang, P. Dasigi, G. Stanovsky, S. Singh, and M. Gard-
arXiv:2009.12756, 2020. ner, “DROP: A reading comprehension benchmark requiring
[86] Y. Zhang, P. Nie, A. Ramamurthy, and L. Song, “Ddrqa: Dy- discrete reasoning over paragraphs,” in Proc. of NAACL, 2019.
namic document reranking for open-domain multi-hop question [104] G. Izacard and E. Grave, “Leveraging passage retrieval with
answering,” arXiv preprint arXiv:2009.07465, 2020. generative models for open domain question answering,” arXiv
[87] Y. Mao, P. He, X. Liu, Y. Shen, J. Gao, J. Han, and W. Chen, preprint arXiv:2007.01282, 2020.
“Generation-Augmented Retrieval for Open-domain Question [105] T. N. Kipf and M. Welling, “Semi-supervised classification with
Answering,” 2020. [Online]. Available: http://arxiv.org/abs/ graph convolutional networks,” in ICLR, 2017.
2009.08553 [106] M. Joshi, D. Chen, Y. Liu, D. S. Weld, L. Zettlemoyer, and
[88] A. Asai, K. Hashimoto, H. Hajishirzi, R. Socher, and C. Xiong, O. Levy, “SpanBERT: Improving pre-training by representing and
“Learning to retrieve reasoning paths over wikipedia graph predicting spans,” arXiv preprint arXiv:1907.10529, 2019.
19
[107] C. Tan, F. Wei, N. Yang, B. Du, W. Lv, and M. Zhou, “S-net: [127] F. Hill, A. Bordes, S. Chopra, and J. Weston, “The goldilocks
From answer extraction to answer synthesis for machine reading principle: Reading children’s books with explicit memory rep-
comprehension,” in AAAI. AAAI Press, 2018, pp. 5940–5947. resentations,” CoRR, 2015.
[108] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, [128] A. Trischler, T. Wang, X. Yuan, J. Harris, A. Sordoni, P. Bachman,
Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer and K. Suleman, “NewsQA: A machine comprehension dataset,”
learning with a unified text-to-text transformer,” arXiv e-prints, in Proceedings of the 2nd Workshop on Representation Learning for
2019. NLP. Association for Computational Linguistics, 2017, pp. 191–
[109] S. Liu, X. Zhang, S. Zhang, H. Wang, and W. Zhang, “Neural 200.
machine reading comprehension: Methods and trends,” CoRR, [129] M. Dunn, L. Sagun, M. Higgins, V. U. Güney, V. Cirik, and K. Cho,
vol. abs/1907.01118, 2019. “Searchqa: A new q&a dataset augmented with context from a
[110] S. Wang, M. Yu, J. Jiang, W. Zhang, X. Guo, S. Chang, Z. Wang, search engine,” CoRR, vol. abs/1704.05179, 2017.
T. Klinger, G. Tesauro, and M. Campbell, “Evidence aggregation [130] M. Joshi, E. Choi, D. Weld, and L. Zettlemoyer, “TriviaQA: A
for answer re-ranking in open-domain question answering,” in large scale distantly supervised challenge dataset for reading
6th International Conference on Learning Representations, ICLR 2018, comprehension,” in Proceedings of the 55th Annual Meeting of the
Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Association for Computational Linguistics (Volume 1: Long Papers).
Proceedings. ICLR, 2018. Association for Computational Linguistics, 2017, pp. 1601–1611.
[111] S. Wang and J. Jiang, “Learning natural language inference with [131] T. Kociský, J. Schwarz, P. Blunsom, C. Dyer, K. M. Hermann,
LSTM,” in Conference of the North American Chapter of the Associ- G. Melis, and E. Grefenstette, “The narrativeqa reading compre-
ation for Computational Linguistics: Human Language Technologies. hension challenge,” CoRR, vol. abs/1712.07040, 2017.
The Association for Computational Linguistics, 2016, pp. 1442– [132] W. He, K. Liu, Y. Lyu, S. Zhao, X. Xiao, Y. Liu, Y. Wang, H. Wu,
1451. Q. She, X. Liu, T. Wu, and H. Wang, “Dureader: a chinese machine
[112] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, reading comprehension dataset from real-world applications,”
“Language models are unsupervised multitask learners,” OpenAI CoRR, vol. abs/1711.05073, 2017.
blog, vol. 1, no. 8, p. 9, 2019. [133] S. Reddy, D. Chen, and C. D. Manning, “Coqa: A conversational
[113] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, question answering challenge,” CoRR, vol. abs/1808.07042, 2018.
P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell [134] E. Choi, H. He, M. Iyyer, M. Yatskar, W. Yih, Y. Choi, P. Liang, and
et al., “Language models are few-shot learners,” arXiv preprint L. Zettlemoyer, “Quac : Question answering in context,” CoRR,
arXiv:2005.14165, 2020. vol. abs/1808.07036, 2018.
[114] J. Johnson, M. Douze, and H. Jégou, “Billion-scale similarity [135] P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal,
search with gpus,” CoRR, vol. abs/1702.08734, 2017. C. Schoenick, and O. Tafjord, “Think you have solved question
[115] F. Petroni, T. Rocktäschel, P. Lewis, A. Bakhtin, Y. Wu, A. H. answering? try arc, the AI2 reasoning challenge,” CoRR, vol.
Miller, and S. Riedel, “Language models as knowledge bases?” abs/1803.05457, 2018.
arXiv preprint arXiv:1909.01066, 2019. [136] M. Saeidi, M. Bartolo, P. Lewis, S. Singh, T. Rocktäschel, M. Shel-
[116] A. Roberts, C. Raffel, and N. Shazeer, “How much knowledge don, G. Bouchard, and S. Riedel, “Interpretation of natural lan-
can you pack into the parameters of a language model?” arXiv guage rules in conversational machine reading,” in Proceedings
preprint arXiv:2002.08910, 2020. of the 2018 Conference on Empirical Methods in Natural Language
Processing. Association for Computational Linguistics, 2018, pp.
[117] M. Seo, T. Kwiatkowski, A. Parikh, A. Farhadi, and H. Ha-
2087–2097.
jishirzi, “Phrase-indexed question answering: A new challenge
[137] S. Šuster et al., “CliCR: a dataset of clinical case reports for
for scalable document comprehension,” in Proceedings of the 2018
machine reading comprehension,” in Proceedings of the 2018
Conference on Empirical Methods in Natural Language Processing.
Conference of the North American Chapter of the Association for
Association for Computational Linguistics, 2018, pp. 559–564.
Computational Linguistics: Human Language Technologies, Volume 1
[118] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and (Long Papers). Association for Computational Linguistics, 2018,
Z. Ives, “Dbpedia: A nucleus for a web of open data,” in The pp. 1551–1563.
Semantic Web. Springer Berlin Heidelberg, 2007, pp. 722–735.
[138] D. Khashabi, S. Chaturvedi, M. Roth, S. Upadhyay, and D. Roth,
[119] K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor, “Free- “Looking beyond the surface: A challenge set for reading com-
base: A collaboratively created graph database for structuring prehension over multiple sentences,” in Proceedings of the 2018
human knowledge,” in Proceedings of the 2008 ACM SIGMOD Conference of the North American Chapter of the Association for
International Conference on Management of Data. ACM, 2008, pp. Computational Linguistics: Human Language Technologies, Volume 1
1247–1250. (Long Papers). Association for Computational Linguistics, 2018,
[120] J. Hoffart, F. M. Suchanek, K. Berberich, and G. Weikum, “Yago2: pp. 252–262.
A spatially and temporally enhanced knowledge base from [139] R. Zellers, Y. Bisk, R. Schwartz, and Y. Choi, “SWAG: A large-
wikipedia,” Artif. Intell., vol. 194, pp. 28–61, 2013. scale adversarial dataset for grounded commonsense inference,”
[121] H. Sun, H. Ma, W.-t. Yih, C.-T. Tsai, J. Liu, and M.-W. Chang, in Proceedings of the 2018 Conference on Empirical Methods in Natural
“Open domain question answering via semantic enrichment,” in Language Processing. Association for Computational Linguistics,
Proceedings of the 24th International Conference on World Wide Web. 2018, pp. 93–104.
International World Wide Web Conferences Steering Committee, [140] A. Saha, R. Aralikatte, M. M. Khapra, and K. Sankaranarayanan,
2015, pp. 1045–1055. “Duorc: Towards complex language understanding with para-
[122] H. Sun, B. Dhingra, M. Zaheer, K. Mazaitis, R. Salakhutdinov, and phrased reading comprehension,” CoRR, vol. abs/1804.07927,
W. Cohen, “Open domain question answering using early fusion 2018.
of knowledge bases and text,” in Proceedings of the 2018 Conference [141] S. Zhang, X. Liu, J. Liu, J. Gao, K. Duh, and B. V. Durme, “Record:
on Empirical Methods in Natural Language Processing. Association Bridging the gap between human and machine commonsense
for Computational Linguistics, 2018, pp. 4231–4242. reading comprehension,” 2018.
[123] H. Sun, T. Bedrax-Weiss, and W. Cohen, “PullNet: Open do- [142] A. Talmor, J. Herzig, N. Lourie, and J. Berant, “Commonsenseqa:
main question answering with iterative retrieval on knowledge A question answering challenge targeting commonsense knowl-
bases and text,” in Proceedings of the 2019 Conference on Empirical edge,” CoRR, vol. abs/1811.00937, 2018.
Methods in Natural Language Processing and the 9th International [143] M. Chen, M. D’Arcy, A. Liu, J. Fernandez, and D. Downey, “CO-
Joint Conference on Natural Language Processing (EMNLP-IJCNLP). DAH: An adversarially-authored question answering dataset for
Association for Computational Linguistics, 2019, pp. 2380–2390. common sense,” in Proceedings of the 3rd Workshop on Evaluating
[124] Y. Li, D. Tarlow, M. Brockschmidt, and R. S. Zemel, “Gated graph Vector Space Representations for NLP. Association for Computa-
sequence neural networks,” in ICLR, 2016. tional Linguistics, 2019, pp. 63–69.
[125] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Mon- [144] T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. Parikh,
fardini, “The graph neural network model,” IEEE Transactions on C. Alberti, D. Epstein, I. Polosukhin, M. Kelcey, J. Devlin, K. Lee,
Neural Networks, vol. 20, no. 1, pp. 61–80, 2009. K. N. Toutanova, L. Jones, M.-W. Chang, A. Dai, J. Uszkor-
[126] Z. Yang, Z. Dai, Y. Yang, J. G. Carbonell, R. Salakhutdinov, eit, Q. Le, and S. Petrov, “Natural questions: a benchmark for
and Q. V. Le, “Xlnet: Generalized autoregressive pretraining for question answering research,” Transactions of the Association of
language understanding,” CoRR, vol. abs/1906.08237, 2019. Computational Linguistics, 2019.
20
[145] L. Huang, R. Le Bras, C. Bhagavatula, and Y. Choi, “Cosmos QA: Fengbin Zhu received his B.E. degree from
Machine reading comprehension with contextual commonsense Shandong University, China. He is currently pur-
reasoning,” in Proceedings of the 2019 Conference on Empirical suing his Ph.D degree at the School of Comput-
Methods in Natural Language Processing and the 9th International ing, National University of Singapore (NUS). His
Joint Conference on Natural Language Processing (EMNLP-IJCNLP), research interests include natural language pro-
2019, pp. 2391–2401. cessing, machine reading comprehension and
[146] C. Clark, K. Lee, M.-W. Chang, T. Kwiatkowski, M. Collins, conversational question answering.
and K. Toutanova, “BoolQ: Exploring the surprising difficulty
of natural yes/no questions,” in Proceedings of the 2019 Conference
of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, Volume 1 (Long and Short
Papers). Association for Computational Linguistics, 2019, pp.
2924–2936.
[147] A. Fan, Y. Jernite, E. Perez, D. Grangier, J. Weston, and M. Auli,
“ELI5: Long form question answering,” in Proceedings of the 57th
Annual Meeting of the Association for Computational Linguistics. Wenqiang Lei is a Research Fellow with School
Association for Computational Linguistics, 2019, pp. 3558–3567. of Computing, National University of Singapore
[148] W. Xiong, J. Wu, H. Wang, V. Kulkarni, M. Yu, S. Chang, X. Guo, (NUS). He received his Ph.D. in Computer Sci-
and W. Y. Wang, “TWEETQA: A social media focused question ence from NUS in 2019. His research interests
answering dataset,” in Proceedings of the 57th Annual Meeting cover natural language processing and informa-
of the Association for Computational Linguistics. Association for tion retrieval, particularly on dialogue systems,
Computational Linguistics, 2019, pp. 5020–5031. conversational recommendations and question
[149] J. Liu, Y. Lin, Z. Liu, and M. Sun, “XQA: A cross-lingual open- answering. He has published multiple papers at
domain question answering dataset,” in Proceedings of the 57th top conferences like ACL, IJCAI, AAAI, EMNLP
Annual Meeting of the Association for Computational Linguistics. and WSDM and the winner of ACM MM 2020
Association for Computational Linguistics, 2019, pp. 2358–2368. best paper award. He served as (senior) PC
[150] J. Gao, M. Galley, and L. Li, “Neural approaches to conversational members on toptier conferences including ACL, EMNLP, SIGIR, AAAI,
ai,” Foundations and Trends® in Information Retrieval, vol. 13, no. KDD and he is a reviewer for journals like TOIS, TKDE, and TASLP.
2-3, pp. 127–298, 2019.
[151] W. Lei, X. He, M. de Rijke, and T.-S. Chua, “Conversational
recommendation: Formulation, methods, and evaluation,” in
Proceedings of the 43rd International ACM SIGIR Conference on
Research and Development in Information Retrieval, ser. SIGIR ’20. Chao Wang holds a PhD in Computer Science
Association for Computing Machinery, 2020, p. 2425–2428. from Tsinghua University, where he was ad-
[152] H. Zhu, L. Dong, F. Wei, W. Wang, B. Qin, and T. Liu, “Learning vised by Dr. Shaoping Ma and Dr. Yiqun liu. His
to ask unanswerable questions for machine reading comprehen- work has primarily focused on nature language
sion,” in Proceedings of the 57th Annual Meeting of the Association processing, information retrieval, search engine
for Computational Linguistics. Association for Computational user behavior analysis. His work has appeared
Linguistics, 2019, pp. 4238–4248. in major journals and conferences such as SI-
[153] M. Hu, F. Wei, Y. xing Peng, Z. X. Huang, N. Yang, and M. Zhou, GIR, CIKM, TOIS, and IRJ.
“Read + verify: Machine reading comprehension with unanswer-
able questions,” ArXiv, vol. abs/1808.05759, 2018.
[154] M. Aliannejadi, H. Zamani, F. Crestani, and W. B. Croft, “Asking
clarifying questions in open-domain information-seeking con-
versations,” in Proceedings of the 42nd International ACM SIGIR
Conference on Research and Development in Information Retrieval.
Association for Computing Machinery, 2019, p. 475–484.
[155] X. Du, J. Shao, and C. Cardie, “Learning to ask: Neural question Jianming Zheng is a PhD candidate at the
generation for reading comprehension,” in Proceedings of the 55th School of System Engineering, the National Uni-
Annual Meeting of the Association for Computational Linguistics versity of Defense Technology, China. His re-
(Volume 1: Long Papers). Vancouver, Canada: Association for search interests include semantics representa-
Computational Linguistics, 2017, pp. 1342–1352. tion, few-shot learning and its applications in
[156] N. Duan, D. Tang, P. Chen, and M. Zhou, “Question generation information retrieval. He received the BS and MS
for question answering,” in Proceedings of the 2017 Conference on degrees from the National University of Defense
Empirical Methods in Natural Language Processing. Copenhagen, Technology, China, in 2016 and 2018, respec-
Denmark: Association for Computational Linguistics, 2017, tively. He has several papers published in SIGIR,
pp. 866–874. [Online]. Available: https://www.aclweb.org/ COLING, IPM, FITEE, Cognitive Computation,
anthology/D17-1090 etc.
[157] Q. Zhou, N. Yang, F. Wei, C. Tan, H. Bao, and M. Zhou, “Neural
question generation from text: A preliminary study,” CoRR, vol.
abs/1704.01792, 2017.
[158] L. Pan, W. Lei, T. Chua, and M. Kan, “Recent advances in neural
question generation,” CoRR, vol. abs/1905.08949, 2019. Soujanya Poria is an assistant professor of In-
[159] C. C. Chen Qu, Liu Yang et al., “Open-retrieval conversational formation Systems Technology and Design, at
question answering,” CoRR, vol. abs/2005.11364, 2020. the Singapore University of Technology and De-
sign (SUTD), Singapore. He holds a Ph.D. de-
gree in Computer Science from the University of
Stirling, UK. He is a recipient of the prestigious
early career research award called “NTU Pres-
idential Postdoctoral Fellowship” in 2018. Sou-
janya has co-authored more than 100 research
papers, published in top-tier conferences and
journals such as ACL, EMNLP, AAAI, NAACL,
Neurocomputing, Computational Intelligence Magazine, etc. Soujanya
has been an area chair at top conferences such as ACL, EMNLP,
NAACL. Soujanya serves or has served on the editorial boards of the
Cognitive Computation and Information Fusion.
21