0% found this document useful (0 votes)
119 views

Retrieving and Reading - A Comprehensive Survey On Open-Domain Question Answering

Uploaded by

Eap Chen
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
119 views

Retrieving and Reading - A Comprehensive Survey On Open-Domain Question Answering

Uploaded by

Eap Chen
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

1

Retrieving and Reading : A Comprehensive


Survey on Open-domain Question Answering
Fengbin Zhu, Wenqiang Lei*, Chao Wang, Jianming Zheng, Soujanya Poria, Tat-Seng Chua,

Abstract—Open-domain Question Answering (OpenQA) is an important task in Natural Language Processing (NLP), which aims to
answer a question in the form of natural language based on large-scale unstructured documents. Recently, there has been a surge in
the amount of research literature on OpenQA, particularly on techniques that integrate with neural Machine Reading Comprehension
(MRC). While these research works have advanced performance to new heights on benchmark datasets, they have been rarely
covered in existing surveys on QA systems. In this work, we review the latest research trends in OpenQA, with particular attention to
systems that incorporate neural MRC techniques. Specifically, we begin with revisiting the origin and development of OpenQA systems.
arXiv:2101.00774v3 [cs.AI] 8 May 2021

We then introduce modern OpenQA architecture named “Retriever-Reader” and analyze the various systems that follow this
architecture as well as the specific techniques adopted in each of the components. We then discuss key challenges to developing
OpenQA systems and offer an analysis of benchmarks that are commonly used. We hope our work would enable researchers to be
informed of the recent advancement and also the open challenges in OpenQA research, so as to stimulate further progress in this field.

Index Terms—Textual Question Answering, Open Domain Question Answering, Machine Reading Comprehension, Information
Retrieval, Natural Language Understanding, Information Extraction

1 I NTRODUCTION
Question Answering (QA) aims to provide precise articles [4] and science books [5], etc. Specifically, textual QA
answers in response to the user’s questions in natural is studied under two task settings based on the availability
language. It is a long-standing task dating back to the 1960s of contextual information, i.e. Machine Reading Compre-
[1]. Compared with a search engine, the QA system aims hension (MRC) and Open-domain QA (OpenQA). MRC,
to present the final answer to a question directly instead which originally took inspiration from language proficiency
of returning a list of relevant snippets or hyperlinks, thus exams, aims to enable machines to read and comprehend
offering better user-friendliness and efficiency. Nowadays specified context passage(s) for answering a given question.
many web search engines like Google and Bing have been In comparison, OpenQA tries to answer a given question
evolving towards higher intelligence by incorporating QA without any specified context. It usually requires the system
techniques into their search functionalities [2]. Empowered to first search for the relevant documents as the context w.r.t.
with these techniques, search engines now have the ability a given question from either a local document repository or
to respond precisely to some types of questions such as the World Wide Web (WWW), and then generate the answer,
as illustrated in Fig. 1. OpenQA therefore enjoys a wider
— Q: “When was Barack Obama born?” scope of application and is more in line with real-world QA
— A: “4 August 1961”. behavior of human beings while MRC can be considered as
a step to OpenQA [6]. In fact, building an OpenQA system
The whole QA landscape can roughly be divided into that is capable of answering any input questions is deemed
two parts: textual QA and Knowledge Base (KB)-QA, ac- as the ultimate goal of QA research.
cording to the type of information source where answers are In literature, OpenQA has been studied closely with re-
derived from. Textual QA mines answers from unstructured search in Natural Language Processing (NLP), Information
text documents while KB-QA from a predefined structured Retrieval (IR), and Information Extraction (IE) [7], [8], [9],
KB that is often manually constructed. Textual QA is gener- [10]. Traditional OpenQA systems mostly follow a pipeline
ally more scalable than the latter, since most of the unstruc- consisting of three stages, i.e. Question Analysis, Document
tured text resources it exploits to obtain answers are fairly Retrieval and Answer Extraction [6], [9], [11]. Given an input
common and easily accessible, such as Wikipedia [3], news question in natural language, Question Analysis aims to
reformulate the question to generate search queries for facil-
itating subsequent Document Retrieval and classify the ques-
• *Corresponding author: Wenqiang Lei
• Fengbin Zhu, Wenqiang Lei and Tat-Seng Chua are with National
tion to obtain its expected answer type(s) that would guide
University of Singapore (NUS) E-mail: [email protected], wenqian- Answer Extraction. In the Document Retrieval stage, the sys-
[email protected], [email protected] tem searches for question-relevant documents or passages
• Fengbin Zhu and Chao Wang are with 6ESTATES PTE LTD, Singapore with the generated search queries, usually using existing
E-mail: [email protected]
• Jianming Zheng is with National University of Defense Technology, China IR techniques like TF-IDF and BM25, or specific techniques
E-mail: [email protected] developed for Web search engines like Google.com and
• Soujanya Poria is with Singapore University of Technology and Design Bing.com. After that, in the Answer Extraction stage, the final
(SUTD) E-mail: [email protected]
answer is extracted from related documents received from
2

Q: When was Barack Obama born? ... A: 4 August, 1961

Fig. 1: An illustration of OpenQA. Given a natural language question, the system infers the answer from a collection of
unstructured text documents.

the previous stage. provide a summary and analysis of QA benchmarks that


Deep learning techniques, which have driven remark- are applicable to either MRC or OpenQA. Finally, we draw
able progress in many fields, have also been successfully our conclusions based on the presented contents above in
applied to almost every stage of OpenQA systems [12]. For Section 5.
example, [13] and [14] develop the question classifier using
a CNN-base model and an LSTM-based model respectively. 2 D EVELOPMENT OF O PEN QA
In [15], [16], [17], they propose neural retrieval models to
In this section, we first briefly introduce the origin of Open-
search for relevant documents in a latent space. In recent
domain Question Answering (OpenQA) and then review
years, with the emergence of some large-scale QA datasets
the traditional and deep learning approaches to OpenQA
[18], [19], [20], [21], [22], [23], neural MRC techniques
sequentially to describe its remarkable advancement in the
have been greatly advanced [18], [24], [25], [26], [27]. By
past two decades.
adopting the popular neural MRC methods to extract the
answer to a given question from the relevant document(s),
traditional OpenQA systems have been revolutionized [3], 2.1 Origin of OpenQA
[28], [29], [30] and evolved to a modern “Retriever-Reader” Pioneering research on Question Answering (QA) system
architecture. Retriever is responsible for retrieving relevant was conducted within the scope of Information Retrieval
documents w.r.t. a given question, which can be regarded as (IR), with the focus on restricted domain or closed-domain
an IR system, while Reader aims at inferring the final answer settings. The earliest QA system, as is widely acknowl-
from the received documents, which is usually a neural edged, is the Baseball [1], which was designed in 1961 to
MRC model. A handful of works [3], [31], [32] even re- answer questions about American baseball games, such as
name OpenQA as Machine Reading Comprehension at Scale game time, location and team name. Within this system,
(MRS). Following this architecture, extensive research has all the relevant information is stored in a well-defined
been made along various directions, such as re-ranking the dictionary, and user questions are translated into query
retrieved documents before feeding them into a neural MRC statements using linguistic methods to extract final answers
model [28], [33], [34], retrieving the relevant documents from the dictionary. In 1973, another famous QA system
iteratively given a question [29], [35], [36], and training the LUNAR [39] was developed as a powerful tool to assist
entire OpenQA system in an end-to-end manner [15], [30], research work of lunar geologists, where chemical analy-
[37], [38], etc. sis data about lunar rocks and soil obtained from Apollo
Based on above observations and insights, we believe moon missions are stored in one data base file provided
it is time to provide a comprehensive literature review on by NASA MSC for each scientist to conveniently view and
OpenQA systems, with particular attention to techniques analyze. In 1993, MURAX [40] was designed to answer
that incorporate neural MRC models. Our review is ex- simple academic questions based on an English academic
pected to acknowledge the advancement that has been made encyclopedia which mainly employs linguistic analysis and
thus far and summarize the current challenges to stimulate syntactic pattern matching techniques.
further progress in this field. In the rest of this survey, we In 1999, OpenQA was first defined as extracting top
will present the following contents. In Section 2, we review 5 probable snippets containing the correct answer from a
the development of OpenQA systems, including the origin, collection of news articles in the QA track launched by
traditional architecture, and recent progress in using deep the Text REtrieval Conference (TREC) [4]. Compared to the
neural networks. In Section 3, we summarize and elabo- previous research on QA, in the open-domain setting, a
rate a “Retriever-Reader” architecture for OpenQA followed large number of unstructured documents is used as the
by detailed analysis on the various techniques adopted. information source from which the correct answer to a given
In Section 4, we first discuss some salient challenges to- question would be extracted. In the later years, a series of
wards OpenQA, identifying the research gaps and hoping TREC QA Tracks have remarkably advanced the research
to enhance further research in this field, and subsequently progress on OpenQA [41], [42], [43]. It is worth noting that
3

systems are required to return exact short answers to given of expected answer types. A simple illustration of this stage
questions starting from TREC-11 held in 2002 [42]. is given in the leftmost grey box of Fig. 2.
The TREC campaign provides a local collection of doc- In Query Formulation, linguistic techniques such as POS
uments as the information source for generating answers, tagging [40], [44], stemming [40], parsing [44] and stop word
but the popularity of World Wide Web (WWW), especially removal [45], [48] are usually utilized to extract keywords
the increasing maturity of search engines, has inspired re- for retrieving. However, the terms used in questions are
searchers to build Web-based OpenQA systems [40], [44], often not the same as those appearing in the documents
[45], [46] obtaining answers from online resources like that contain the correct answers. This problem is called
Google.com and Ask.com, using IR techniques. Web search “term mismatch” and is a long-standing and critical issue
engines are able to consistently and efficiently collect mas- in IR. To address this problem, query expansion [49], [50]
sive web pages, therefore capable of providing much more and paraphrasing techniques [51], [52], [53], [54] are often
information to help find answers in response to user ques- employed to produce additional search words or phrases so
tions. In 2001, a QA system called MULDER [44] was de- as to retrieve more relevant documents.
signed to automatically answer open-domain factoid ques- Question Classification, the other module that is often
tions with a search engine (e.g., Google.com). It first trans- adopted for Question Analysis stage, aims to identify the type
lates users’ questions to multiple search queries with several of the given question based on a set of question types (e.g.,
natural-language parsers and submits them to the search en- where, when, who, what) or a taxonomy [55], [56] manually
gine to search for relevant documents, and then employs an defined by linguistic experts. After obtaining the type of
answer extraction component to extract the answer from the the question, expected answer types can be easily deter-
returned results. Following this pipeline, a well-known QA mined using rule-based mapping methods [9]. For example,
system AskMSR [45] was developed, which mainly depends given a question “When was Barack Obama born?”, the
on data redundancy rather than sophisticated linguistic answer type would be inferred as “Date” when knowing the
analysis of either questions or candidate answers. It first question type is “When”. Identifying the question type can
translates the user’s question into queries relying on a set provide constraint upon answer extraction and significantly
of predefined rewriting rules to gather relevant documents reduce the difficulty of finding correct answers. Question
from search engines and then adopts a series of n-gram Classification has attracted much interest in literature [44],
based algorithms to mine, filter and select the best answer. [55], [57], [58], [59]. For instance, [59] proposed to extract
For such OpenQA systems, the search engines are able relevant words from a given question and then classify the
to provide access to an ocean of information, significantly question based on rules associating these words to concepts;
enlarging the possibility of finding precise answers to user [57] trained a list of question classifiers using various ma-
questions. Nevertheless, such an ample information source chine learning techniques such as Support Vector Machines
also brings considerable noisy content that challenges the (SVM), Nearest Neighbors and Decision Trees on top of the
QA system to filter out. hierarchical taxonomy proposed by [55].

2.2.2 Document Retrieval


2.2 Traditional Architecture of OpenQA
This stage is aimed at obtaining a small number of relevant
The traditional architecture of OpenQA systems is illus- documents that probably contain the correct answer to a
trated in Fig. 2, which mainly comprises three stages: Ques- given question from a collection of unstructured documents,
tion Analysis, Document Retrieval, and Answer Extraction [6], which usually relies on an IR engine. It can significantly
[11]. Given a natural language question, Question Analysis reduce the search space for arriving at the final answer.
aims to understand the question first so as to facilitate docu- In the past decades, various retrieval models have been
ment retrieval and answer extraction in the following stages. developed for Document Retrieval, among which some pop-
Performance of this stage is found to have a noticeable influ- ular ones are the Boolean model, Vector Space Models,
ence upon that of the following stages, and hence important Probabilistic Models, Language Models [60], etc., which are
to the final output of the system [47]. Then, Document Re- briefly revisited as follows.
trieval stage searches for question-relevant documents based • Boolean Model: The Boolean Model is one of the sim-
on a self-built IR system [4] or Web search engine [44], plest retrieval models. The question is transformed into
[45] using the search queries generated by Question Analysis. the form of a Boolean expression of terms, which are
Finally, Answer Extraction is responsible for extracting final combined with the operators like ”AND”, ”OR” and
answers to user questions from the relevant documents ”NOT” to exactly match with the documents, with each
received in the preceding step. In the following, we will document viewed as a set of words.
analyze each stage one by one. • Vector Space Model: The Vector Space Models represent
the question and each document as word vectors in
2.2.1 Question Analysis a d -dimensional word space, where d is the number
The goals of Question Analysis stage are two-fold. On one of words in the vocabulary. When searching for rel-
hand, it aims to facilitate the retrieval of question-relevant evant documents to a given question, the relevance
documents, for which a Query Formulation module is often score of each document is computed by computing
adopted to generate search queries. On the other hand, it is the similarity (e.g., the cosine similarity) or distance
expected to enhance the performance of Answer Extraction (e.g., the euclidean distance) between its vector and the
stage by employing a Question Classification module to question vector. Compared to the Boolean model, this
predict the type of the given question, which leads to a set approach returns documents to the question even if the
4

Question Answer

Question Analysis

Question Query Generated Document Answer


Classification Formulation Queries Retrieval Extraction

Expected
Answer Types

Fig. 2: An illustration of traditional architecture of OpenQA system

constraints posed by the question are only partially met, need to take a lot of care and place special importance on
with precision sacrificed. this stage.
• Probabilistic Model: The Probabilistic Models provide a In traditional OpenQA systems, factoid questions and
way of integrating probabilistic relationships between list questions [63] have been widely studied for a long time.
words into a model. Okapi BM25 [61] is a probabilis- Factoid questions (e.g., When, Where, Who...) to which the
tic model sensitive to term frequency and document answers are usually a single text span in in the documents,
length, which is one of the most empirically successful such as such as an entity name, a word token or a noun
retrieval models and widely used in current search phrase. While list questions whose answers are a set of
engines. factoids that appeared in the same document or aggregated
• Language Model: The Language Models [62] are also from different documents. The answer type received from
very popular, among which the Query Likelihood the stage of Question Analysis plays a crucial role, especially
Model [60] is the most widely adopted. It builds a for the given question whose answers are named entities.
probabilistic language model LMd for each document Thus, early systems heavily rely on the Named Entity
d and ranks documents according to the probability Recognition (NER) technique [40], [46], [64] since comparing
P (q | LMd ) of the language model generating the given the recognised entities and the answer type may easily yield
question q . the final answer. In [65], the answer extraction is described
In practice, the documents received often contain irrele- as a unified process, first uncovering latent or hidden infor-
vant ones, or the number of documents is so large that the mation from the question and the answer respectively, and
capacity of the Answer Extraction model is overwhelmed. To then using some matching methods to detect answers, such
address the above issues, post-processing on the retrieved as surface text pattern matching [66], [67], word or phrase
documents is very demanded. Widely used approaches on matching [44], and syntactic structure matching [40], [48],
processing retrieved documents include document filtering, [68].
document re-ranking and document selection [9], etc. Doc- In practice, sometimes the extracted answer needs to be
ument filtering is used to identify and remove the noise validated when it is not confident enough before presenting
w.r.t. a given question; document re-ranking is developed to to the end-users. Moreover, in some cases multiple answer
further sort the documents according to a plausibility degree candidates may be produced to a question and we have
of containing the correct answer in the descending order; to select one among them. Answer validation is applied to
document selection is to choose the top relevant documents. solve such issues. One widely applied validation method
After post-processing, only the most relevant documents is to adopt an extra information source like a Web search
would be remained and fed to the next stage to extract the engine to validate the confidence of each candidate answer.
final answer. The principle is that the system should return a sufficiently
large number of documents which contain both question
and answer terms. The larger the number of such returned
2.2.3 Answer Extraction documents is, the more likely it will be the correct answer.
The ultimate goal of an OpenQA system is to successfully This principle has been investigated and demonstrated
answer given questions, and Answer Extraction stage is fairly effective, though simple [9].
responsible for returning a user the most precise answer
to a question. The performance of this stage is decided by 2.3 Application of Deep Neural Networks in OpenQA
the complexity of the question, the expected answer types In the recent decade, deep learning techniques have also
from Question Analysis stage, the retrieved documents from been successfully applied to OpenQA. In particular, deep
Document Retrieval stage as well as the extraction method learning has been used in almost every stage in an OpenQA
adopted, etc. With so many influential factors, researchers system, and moreover, it enables OpenQA systems to be
5

end-to-end trainable. For Question Analysis, some works or the taxonomies of questions are hand-crafted by linguists,
develop neural classifiers to determine the question types. which are non-optimal since it is impossible to cover all
For example, [13] and [14] respectively adopt a CNN-based question types in reality, especially those complicated ones.
and an LSTM-based model to classify the given questions, Furthermore, the classification errors would easily result
both achieving competitive results. For Document Retrieval, in the failure of answer extraction, thus severely hurting
dense representation based methods [16], [29], [30], [35] the overall performance of the system. According to the
have been proposed to address “term-mismatch”, which experiments in [47], about 36.4% of errors in early OpenQA
is a long-standing problem that harms retrieval perfor- systems are caused by miss-classification of question types.
mance. Unlike the traditional methods such as TF-IDF Neural models are able to automatically transform ques-
and BM25 that use sparse representations, deep retrieval tions from natural language to representations that are more
methods learn to encode questions and documents into recognisable to machines. Moreover, neural MRC models
a latent vector space where text semantics beyond term provide an unprecedented powerful solution to Answer Ex-
match can be measured. For example, [29] and [35] train traction in OpenQA, largely offsetting the necessity of apply-
their own encoders to encode each document and question ing the traditional linguistic analytic techniques to questions
independently into dense vectors, and the similarity score and bringing revolutions to OpenQA systems [3], [28], [29],
between them is computed using the inner product of [37]. The very first work to incorporate neural MRC models
their vectors. The Sublinear Maximum Inner Product Search into the OpenQA system is DrQA proposed by [3], evolving
(MIPS) algorithm [69], [70], [71] is used to improve the to a “Retriever-Reader” architecture. It combines TF-IDF
retrieval efficiency given a question, especially when the based IR technique and a neural MRC model to answer
document repository is large-scale. For Answer Extraction, open-domain factoid questions over Wikipedia and achieves
as a decisive stage for OpenQA systems to arrive at the impressive performance. After [3], lots of works have been
final answer, neural models can also be applied. Extracting released [28], [30], [33], [34], [37], [73], [74], [75]. Nowadays,
answers from some relevant documents to a given question to build OpenQA systems following the “Retriever-Reader”
essentially makes the task of Machine Reading Comprehen- architecture has been widely acknowledged as the most
sion (MRC). In the past few years, with the emergence of efficient and promising way, which is also the main focus
some large-scale datasets such as CNN/Daily Mail [18], MS of this paper.
MARCO [20], RACE [21] and SQuAD 2.0 [22], research on
neural MRC has achieved remarkable progress [24], [25],
[26], [27]. For example, BiDAF [24] represents the given 3 M ODERN O PEN QA: R ETRIEVING AND R EADING
document at different levels of granularity via a multi-stage In this section, we introduce the “Retriever-Reader” architec-
hierarchical structure consisting of a character embedding ture of the OpenQA system, as illustrated in Fig. 3. Retriever
layer, a word embedding layer, and a contextual embedding is aimed at retrieving relevant documents w.r.t. a given
layer, and leverages a bidirectional attention flow mecha- question, which can be regarded as an IR system, while
nism to obtain a question-aware document representation Reader aims at inferring the final answer from the received
without early summarization. QANet [26] adopts CNN and documents, which is usually a neural MRC model. They
the self-attention mechanism [72] to model the local inter- are two major components of a modern OpenQA system. In
actions and global interactions respectively, which performs addition, some other auxiliary modules, which are marked
significantly faster than usual recurrent models. in dash lines in Fig. 3, can also be incorporated into an
Furthermore, applying deep learning enables the OpenQA system, including Document Post-processing that
OpenQA systems to be end-to-end trainable [15], [30], [37]. filters and re-ranks retrieved documents in a fine-grained
For example, [37] argue it is sub-optimal to incorporate manner to select the most relevant ones, and Answer Post-
a standalone IR system in an OpenQA system, and they processing that is to determine the final answer among
develop an ORQA system that treats the document retrieval multiple answer candidates. The systems following this
from the information source as a latent variable and trains architecture can be classified into two groups, i.e. pipeline
the whole system only from question-answer string pairs systems and end-to-end systems. In the following, we will
based on BERT [27]. REALM [30] is a pre-trained language introduce each component with the respective approaches
model that contains a knowledge retriever and a knowledge in the pipeline systems, then followed by the end-to-end
augmented encoder. Both its retriever and encoder are dif- trainable ones. In Fig. 4 we provide a taxonomy of the
ferentiable neural networks, which are able to compute the modern OpenQA system to make our descriptions better
gradient w.r.t. the model parameters to be back propagated understandable.
all the way throughout the network. Similar to other pre-
training language models, it also has two stages, i.e., pre-
training and fine-tuning. In the pre-training stage, the model 3.1 Retriever
is trained in an unsupervised manner, using masked lan- Retriever is usually regarded as an IR system, with the goal
guage modeling as the learning signal while the parameters of retrieving related documents or passages that probably
are fine-tuned using supervised examples in the fine-tuning contain the correct answer given a natural language ques-
stage. tion as well as ranking them in a descending order according
In early OpenQA systems, the success of answering a to their relevancy. Broadly, current approaches to Retriever
question is highly dependent on the performance of Ques- can be classified into three categories, i.e. Sparse Retriever,
tion Analysis, particularly Question Classification, that pro- Dense Retriever, and Iterative Retriever, which will be detailed
vides expected answer types [47]. However, either the types in the following.
6

Q: When was Barack Obama born? A: 4 August, 1961

Document Answer
... Retriever Post-processing
Reader Post-processing

Unstructured Documents Relevant Documents

Fig. 3: An illustration of “Retriever-Reader” architecture of OpenQA system. The modules marked with dash lines are
auxiliary.

3.1.1 Sparse Retriever two independent BERT-based encoders to encode a ques-


It refers to the systems that search for the relevant doc- tion and a document respectively and the relevance score
uments by adopting classical IR methods as introduced between them is computed by the inner product of their vec-
in Section 2.2.2, such as TF-IDF [3], [34], [76], [77] and tors. In order to obtain a sufficiently powerful retriever, they
BM25 [78], [79]. DrQA [3] is the very first approach to pre-train the retriever using Inverse Cloze Task (ICT), i.e., to
modern OpenQA systems and developed by combining predict its context given a sentence. DPR [16] also employs
classical IR techniques and neural MRC models to answer two independent BERT encoders like ORQA but denies
open-domain factoid questions. Particularly, the retriever the necessity of the expensive pre-training stage. Instead, it
in DrQA adopts bi-gram hashing [80] and TF-IDF match- focuses on learning a strong retriever using pairwise ques-
ing to search over Wikipedia, given a natural language tions and answers sorely. DPR carefully designs the ways to
question. BERTserini [78] employs Anserini [81] as its re- select negative samples to a question, including any random
triever, which is an open-source IR toolkit based on Lucene. documents from the corpus, top documents returned by
In [78], different granularities of text including document- BM25 that do not contain the correct answer, and in-batch
level, paragraph-level and sentence-level are investigated negatives which are the gold documents paired with other
experimentally, and the results show paragraph-level index questions in the same batch. It is worth mentioning that their
achieves the best performance. Traditional retrieval methods experiments show the inner product function is optimal for
such as TF-IDF and BM25 use sparse representations to mea- calculating the similarity score for a dual-encoder retriever.
sure term match. However, the terms used in user questions Representation-based method [16], [16], [30], [37] can be
are often not the same as those appearing in the documents. very fast since the representations of documents can be com-
Various methods based on dense representations [16], [29], puted and indexed offline in advance. But it may sacrifice
[30], [35] have been developed in recent years, which learn the retrieval effectiveness because the representations of the
to encode questions and documents into a latent vector question and document are obtained independently, leading
space where text semantics beyond term match can be to only shallow interactions captured between them.
measured.
Interaction-based Retriever: Such a kind of retrievers
take a question together with a document at the same time
3.1.2 Dense Retriever as input, and are powerful by usually modeling the token-
Along with the success of deep learning that offers re- level interactions between them, such as transformer-based
markable semantic representation, various deep retrieval encoder [27], [72]. [15] propose to jointly train Retriever
models have been developed in the past few years, greatly and Reader using supervised multi-task learning [24]. Based
enhancing retrieval effectiveness and thus lifting final QA on BiDAF [24], a retrieval layer is added to compute the
performance. According to the different ways of encoding relevance score between question and document while a
the question and document as well as of scoring their comprehension layer is adopted to predict the start and
similarity, dense retrievers in existing OpenQA systems can end position of the answer span in [15]. [32] develop a
be roughly divided into three types: Representation-based paragraph-level dense Retriever and a sentence-level dense
Retriever [16], [30], [37], [73], Interaction-based Retriever [15], Retriever, both based on BERT [27]. They regard the process
[32], and Representation-interaction Retriever [17], [82], [83], as of dense retrieval as a binary classification problem. In
illustrated in Fig. 5. particular, they take each pair of question and document as
Representation-based Retriever: Representation-based input and use the embedding of [CLS] token to determine
Retriever, also called Dual-encoder or Two-tower retriever, whether they are relevant. Their experiments show that both
employs two independent encoders like BERT [27] to encode paragraph-level and sentence-level retrieval are necessary
the question and the document respectively, and estimates for obtaining good performance of the system. Interaction-
their relevance by computing a single similarity score be- based method [15], [32] is powerful as it allows for very
tween two representations. For example, ORQA [37] adopts rich interactions between question and document. However,
7


 parse

S 
 F-IDF,
BM25,
DrQA,

T

Retriever 
BERTserini


 enSPI,
ORQA,
REALM,

D

Dense
Retriever

Retriever 
DPR,
ColBERT,
SPARTA


 olden
Retriever,
Multi-
G

step
Reasoner,
Adaptive

I
terative


Retriever,
Path
Retriever,


Retriever

MUPPET,
DDRQA,
MDR,


Graph
Retriever,
GAR


 S-QA,
InferSent
Re-ranker,

D

 upervised

S

Relation-Networks
Re-

Learning

ranker,
Paragraph
Ranker,


Document
Post- 
 einforcement


R

processing 
Learning


 ransfer

T 
 ulti-Passage
BERT
Re-
M

Learning 
ranker


OpenQA


System

 rQA,
Match-LSTM,
BiDAF,

D

 xtractive

E

S-Norm
Reader,
BERT


Reader

Reader,
Graph
Reader

Reader

 enerative

G

BART
Reader,
T5
Reader

Reader


Rule-based 
Strength-based
Re-Ranker

Answer
Post-

processing

 overage-Based
Re-
C

Learning-based

Ranker,
RankQA


 etriever-
R 
 etrieve-and-Read,
ORQA,

R

Reader 
REALM,
RAG


End-to-end 
Retriever-only 
DenSPI


Retriever-free 
GPT2,
GPT3,
T5,
BART

Fig. 4: A taxonomy of “Retriever-Reader” OpenQA system.

such a method usually requires heavy computation, which level interaction step over the question and document repre-
is sometimes prohibitively expensive, making it hardly ap- sentations to calculate the similarity score. Akin to DPR [16],
plicable to large-scale documents. ColBERT-QA first encodes the question and document in-
Representation-interaction Retriever: In order to dependently with two BERT encoders. Formally, given a
achieve both high accuracy and efficiency, some recent question q and a document d, with corresponding vectors
systems [17], [82], [83] combine representation-based and denoted as Eq (length n) and Ed (length m), the relevance
interaction-based methods. For instance, ColBERT-QA [17]
develops its retriever based on ColBERT [84], which extends
the dual-encoder architecture by performing a simple token-
8

Score

Score Score

... ... ... ... ... ...

Question Document Question Document Question Document

(1) Representation-based Retriever (2) Interaction-based Retriever (3) Representation-interaction Retriever

Fig. 5: Three types of dense retrievers.

score between them is computed as follows: Retriever based on its workflow: 1) Document Retrieval:
n the IR techniques used to retrieve documents in every
X m
Sq,d = max Eqi · EdTj . (1) retrieval step; 2) Query Reformulation: the mechanism used
j=1 to generate a query for each retrieval; 3) Retrieval Stopping
i=1
Mechanism: the method to decide when to terminate the
Then, ColBERT computes the score of each token embed-
retrieval process.
ding of the question over all those of the document first,
and then sums all these scores as the final relevance score be- Document Retrieval: We first revisit the IR techniques
tween q and d. As another example, SPARTA [82] develops used to retrieve documents in every retrieval step given a
a neural ranker to calculate the token-level matching score query. Some works [36], [86], [89] apply Sparse Retriever
using dot product between a non-contextualized encoded (e.g., iteratively, and some [29], [35], [85], [88] use Dense Retriever
BERT word embedding) question and a contextualized en- interatively. Among the works using Sparse Retriever,
coded (e.g., BERT encoder) document. Concretely, given the GOLDEN Retriever [36] adopts BM25 retrieval, while Graph
representations of the question and document, the weight of Retriever [89] and DDRQA [86] retrieve top K documents
each question token is computed with max-pooling, ReLU using TF-IDF. For those with Dense Retriever, most prior
and log sequentially; the final relevance score is the sum of systems including Multi-step Reasoner [29], MUPPET [35]
each question token weight. The representation-interaction and MDR [85] use MIPS retrieval to obtain the most se-
method is a promising approach to dense retrieval, due to mantically similar documents given a representation of the
its good trade-off between effectiveness and efficiency. But question; Path Retriever [88] develops a Recurrent Neural
it still needs to be further explored. Network (RNN) retrieval to learn to retrieve reasoning paths
Though effective, Dense Retriever often suffers heavy for a question over a Wikipedia graph, which is built to
computational burden when applied to large-scale docu- model the relationships among paragraphs based on the
ments. In order to speed up the computation, some works Wikipedia hyperlinks and article structures.
propose to compute and cache the representations of all Query Reformulation: In order to obtain a sufficient
documents offline in advance [16], [29], [30], [35], [37]. In amount of relevant documents, the search queries used for
this way, these representations will not be changed once each step of retrieval are usually varied and generated based
computed, which means the documents are encoded in- on the previous query and the retrieved documents. The
dependently of the question, to some extent sacrificing the generated queries take each from the two forms: 1) explicit
effectiveness of retrieval. form, i.e. a natural language query [36], [86], [87]; and 2)
implicit form, i.e. a dense representation [29], [35], [85].
3.1.3 Iterative Retriever Some works produce a new query taking the form of nat-
Iterative Retriever aims to search for the relevant documents ural language. For example, GOLDEN Retriever [36] recasts
from a large collection in multiple steps given a question, the query reformulation task as an MRC task because they
which is also called Multi-step Retriever. It has been ex- both take a question and some context documents as input
plored extensively in the past few years [29], [35], [36], [85], and aim to generate a string in natural language. Instead
[86], [87], [88], [89], especially when answering complex of pursuing an answer in MRC, the target for query refor-
questions like those requiring multi-hop reasoning [90], [91]. mulation is a new query that helps obtain more supporting
In order to obtain a sufficient amount of relevant documents, documents in the next retrieval step. GAR [87] develops a
the search queries need to vary for different steps and be query expansion module using a pretrained Seq2Seq model
reformulated based on the context information in the previ- BART [92], which takes the initial question as input and
ous step. In the following, we will elaborate on Iterative generates new queries. It is trained by taking various gener-
9

ation targets as output consisting of the answer, the sentence contain irrelevant ones, and sometimes, the number of re-
where the answer belongs to, and the title of a passage that turned documents is extremely large that overwhelms the
contains the answer. capability of Reader. Document Post-processing in the modern
Some other works produce dense representations to be OpenQA architecture is similar with that in the traditional
used for searching in a latent space. For example, Multi-step one, as introduced in Section 2.2.2. It aims at reducing the
Reasoner [29] adopts a Gated Recurrent Unit (GRU) [93] number of candidate documents and only allowing the most
taking token-level hidden representations from Reader and relevant ones to be passed to the next stage.
the question as input to generate a new query vector, which In the past few yeas, this module has been explored
is trained using Reinforcement learning (RL) by measur- with much interest [28], [33], [34], [79], [96], [97], [98], [99].
ing how well the answer extracted by Reader matches the For example, R3 [28] adopts a neural Passage Ranker, and
ground-truth after reading the new set of paragraphs re- trains it jointly with Reader through Reinforcement Learning
trieved with the new question. MUPPET [35] applies a bidi- (RL). DS-QA [33] adds a Paragraph Selector to remove the
rectional attention layer adapted from [94] to a new query noisy ones from the retrieved documents by measuring the
representation q̃ , taking each obtained paragraph P and the probability of each paragraph containing the answer among
initial question Q as input. MDR [85] uses a pre-trained all candidate paragraphs. [96] explore two different passage
masked language model (such as RoBert) as its encoder, rankers that assign scores to retrieved passages based on
which encodes the concatenation of the representation of their likelihood of containing the answer to a given ques-
the question as well as all the previous passages as a new tion. One is InferSent Ranker, a forward neural network that
dense query. employs InferSent sentence representations [100], to rank
Comparably, the explicit query is easily understandable passages based on semantic similarity with the question.
and controllable to humans but is constrained by the terms The other one is Relation-Networks Ranker that adopts
in the vocabulary, while the implicit query is generated in Relation Networks [101], focusing on measuring word-level
a semantic space, which can get rid of the limitation of the relevance between the question and the passages. Their
vocabulary but lacks interpretability. experiments show that word-level relevance matching sig-
Retrieval Stopping Mechanism: The iterating retrieval nificantly improves the retrieval performance and semantic
manner yields greater possibilities to gather more relevant similarity is more beneficial to the overall performance. [34]
passages, but the retrieval efficiency would drop dramat- develop a Paragraph Ranker using two separate RNNs fol-
ically along with the increasing number of iterations. Re- lowing the dual-encoder architecture. Each pair of question-
garding the mechanism for stopping an iterative retrieval, passage is fed into the Ranker to obtain their representations
most existing systems choose to specify a fixed number of independently and inner product is applied to compute
iterations [29], [36], [85], [86], [89] or a maximum number of the relevance score. [98] propose a time-aware re-ranking
retrieved documents [35], [87], which can hardly guarantee module that incorporates temporal information from differ-
the retrieval effectiveness. [77] argue that setting a fixed ent aspects to rank the candidate documents over temporal
number of documents to be obtained for all input questions collections of news articles.
is sub-optimal and instead they develop an Adaptive Re- The focus of research on this module is learning to
triever based on the Document Retriever in DrQA [3]. They further re-rank the retrieved documents [28], [33], [34], [79].
propose two methods to dynamically set the number of re- However, with the development of Dense Retriever, recent
trieved documents for each question, i.e. a simple threshold- OpenQA systems tend to develop a trainable retriever that
based heuristic method as well as a trainable classifier is capable of learning to rank and retrieving the most
using ordinal ridge regression. Since for the questions that relevant documents simultaneously, which would result in
require arbitrary hops of reasoning, it is difficult to specify the absence of this module.
the number of iterations, Path Retriever [88] terminates its
retrieval only when the end-of-evidence token (e.g. [EOE]) is
detected by its Recurrent Retriever. This allows it to perform
3.3 Reader
adaptive retrieval steps but only obtains one document at
each step. To the best of our knowledge, it is still a critical Reader is the other core component of a modern OpenQA
challenge to develop an efficient iterative retriever while not system and also the main feature that differentiates QA
sacrificing accuracy. systems against other IR or IE systems, which is usually
In addition, typical IR systems pursue two optimization implemented as a neural MRC model. It is aimed at in-
targets, i.e. precision and recall. The former computes the ferring the answer in response to the question from a set
ratio of relevant documents returned to the total number of ordered documents, and is more challenging compared
of documents returned while the latter is the number of to the original MRC task where only a piece of passage
relevant documents returned out of the total number of is given in most cases [18], [19], [90], [102], [103]. Broadly,
relevant documents in the underlying repository. However, existing Readers can be categorised into two types: Extrac-
for OpenQA systems, recall is much more important than tive Reader that predicts an answer span from the retrieved
precision due to the post-processing usually applied to documents, and Generative Reader that generates answers
returned documents [95], as described below. in natural language using sequence-to-sequence (Seq2Seq)
models. Most prior OpenQA systems are equipped with an
3.2 Document Post-processing Extractive Reader [3], [16], [28], [29], [30], [33], [35], [78], [89],
Post-processing over the retrieved documents from Retriever while some recent ones develop a Generative Reader [38],
is often needed since the retrieved documents inevitably [85], [104].
10

3.3.1 Extractive Reader 3.3.2 Generative Reader


Generative Reader aims to generate answers as natural as
possible instead of extracting answer spans, usually relying
Extractive Reader is based on the assumption that the on Seq2Seq models. For example, S-Net [107] is developed
correct answer to a given question definitely exists in the by combining extraction and generation methods to comple-
context, and usually focuses on learning to predict the start ment each other. It employs an evidence extraction model
and end position of an answer span from the retrieved to predict the boundary of the text span as the evidence
documents. The approaches can be divided into two types to the answer first and then feeds it into a Seq2Seq answer
according to whether the retrieved documents are processed synthesis model to generate the final answer. Recently, some
independently or jointly for answer extraction. OpenQA systems [38], [85], [104] adopt pretrained Seq2Seq
Many prior systems [16], [33], [86], [88], [89] rank the language models to develop their Readers, like BART [92]
retrieved documents by the probability of including the and T5 [108]. For example, RAG [38] adopts a pretrained
answer and extract the answer span from the concatenation BART model as its reader to generate answers by taking
of the question and the most probable document(s). For the input question as well as the documents retrieved by
example, DS-QA [33] extracts an answer span from the DPR [16]. FID [104] first encodes each retrieved document
paragraph selected by a dedicated Paragraph Selector mod- independently using T5 or BART encoder and then per-
ule through measuring the probability of each paragraph forms attention over all the output representations using the
containing the answer among all candidate paragraphs. decoder to generate the final answer. However, the current
DPR [16] computes the probability of a passage containing generation results often suffer syntax error, incoherence, or
the answer and that of a token being the starting and ending illogic [109]. Generative Reader needs to be further explored
position of an answer span using BERT Reader, and selects and advanced.
the answer with the highest probability after combining
them. Some systems develop graph-based Reader [88], [89] 3.4 Answer Post-processing
to learn to extract an answer span from a retrieved graph.
Neural MRC techniques have been advancing rapidly in
For example, Graph Reader [89] takes the graph as input
recent years, but most existing MRC models still specialise
and learns the passage representation mainly using Graph
in extracting answers only from a single or several short
Convolution Networks [105] first, and then extracts the
passages and tend to fail in cases where the correct answer
answer span from the most probable one. Path Retriever [88]
comes from various evidences in a narrative document
leverages BERT Reader to simultaneously re-rank the rea-
or multiple documents [110]. The Answer Post-processing
soning paths and extract the answer span from the one with
module is developed to help detect the final answer from
the highest probability of containing the correct answer us-
a set of answer candidates extracted by the Reader, taking
ing multi-task learning. However, with retrieved documents
into account their respective supporting facts. The methods
processed independently, the model fails to take advantage
adopted in existing systems can be classified into two cate-
of different evidences from the long narrative document
gories, i.e. rule-based method [34], [110] and learning-based
or multiple documents for answer extraction, harming the
method [76], [110] For example, [110] propose two answer
performance especially in cases where the given questions
re-rankers, a “strength-based re-ranker” and a “coverage-
require multiple hop reasoning.
based re-ranker”, to aggregate evidences from different
In contrast, some systems [3], [78], [82], [94] extract an passages to decide the final answer. The “strength-based re-
answer span based on all retrieved documents in a joint ranker” is a rule-based method that simply performs count-
manner. For example, DrQA [3] decomposes the retrieved ing or probability calculation based on the candidate pre-
documents into paragraphs and extracts various features dictions and does not require any training. The “coverage-
consisting of Part-of-Speech (POS), Named Entity (NE) and based re-ranker” is developed using an attention-based
Term-Frequency (TF), etc. Then the DrQA Reader, which match-LSTM model [111]. It first concatenates the passages
is implemented with a multi-layer Bi-LSTM, takes as input containing the same answer into a pseudo passage and then
the question and the paragraphs, and predicts an answer measures how well this pseudo passage entails the answer
span. In this process, to make answer scores comparable for the given question. The experiments in [110] show that
across paragraphs, it adopts the unnormalized exponential a weighted combination of the outputs of the above dif-
function and takes argmax over all answer spans to obtain ferent re-rankers achieves the best performance on several
the final result. BERTserini [78] develops its Reader based benchmarks. RankQA [76] develops an answer re-ranking
on BERT by removing the softmax layer to enable answer module consisting of three steps: feature extraction, answer
scores to be compared and aggregated among different aggregation and re-ranking. Firstly, taking the top k answer
paragraphs. [94] argue using un-normalized scores (e.g., ex- candidates from Reader as input, the module extracts a set
ponential scores or logits score) for all answer spans is sub- of features from both Retriever such as document-question
optimal and propose a Shared-Normalization mechanism similarity, question length and paragraph length, and Reader
by modifying the objective function to normalize the start such as original score of answer candidate, part-of-speech
and end scores across all paragraphs, achieving consistent tags of the answer and named entity of the answer. Secondly,
performance gain. After that, many OpenQA systems [17], the module groups all answer candidates with an identical
[29], [35], [36], [79], [82] develop their readers by apply- answer span to generate a set of aggregation features like
ing this mechanism based on original MRC models like the sum, mean, minimum, and maximum of the span scores,
BiDAF [24], BERT [27] and SpanBERT [106]. etc. Based on these features, a re-ranking network such as a
11

feed-forward network or an RNN is used to learn to further left-to-right decoder while BART and T5 use Transformer
re-rank the answers to select the best one as the final answer. encode-decoder closely following its original form [72].
Prior studies [115], [116] show that a large amount of knowl-
3.5 End-to-end Methods edge learned from large-scale textual data can be stored in
the underlying parameters, and thus these models are capa-
In recent years, various OpenQA systems [15], [30], [37] ble of answering questions without access to any external
have been developed, in which Retriever and Reader can be knowledge. For example, GPT-2 [112] is able to correctly
trained in an end-to-end manner. In addition, there are some generate the answer given only a natural language ques-
systems with only Retriever [73], and also some that are tion without fine-tuning. Afterwards, GPT-3 [113] achieves
able to answer open questions without the stage of retrieval, competitive performance with few-shot learning compared
which are mostly pre-trained Seq2Seq language models [92], to prior state-of-the-art fine-tuning approaches, in which
[108], [112], [113]. In the following, we will introduce these several demonstrations are given at inference as condition-
three types of systems, i.e. Retriever-Reader, Retriever-only ing [112] while weight update is not allowed. Recently,
and Retriever-free. [116] comprehensively evaluate the capability of language
models for answering questions without access to any
3.5.1 Retriever-Reader
external knowledge. Their experiments demonstrate that
Deep learning techniques enable Retriever and Reader in an pre-trained language models are able to gain impressive
OpenQA system to be end-to-end trainable [15], [30], [37], performance on various benchmarks and such Retrieval-
[38]. For example, [15] propose to jointly train Retriever free methods make a fundamentally different approach to
and Reader using multi-task learning based on the BiDAF building OpenQA systems.
model [24], simultaneously computing the similarity of a In Table 1, we summarize existing modern OpenQA
passage to the given question and predicting the start systems as well as the approaches adopted for different
and end position of an answer span. [37] argue that it components.
is sub-optimal to incorporate a standalone IR system in
an OpenQA system and develop ORQA that jointly trains
Retriever and Reader from question-answer pairs, with both 4 C HALLENGES AND B ENCHMARKS
developed using BERT [27]. REALM [30] is a pre-trained In this section, we first discuss key challenges to building
masked language model including a neural Retriever and a OpenQA systems followed by an analysis of existing QA
neural Reader, which is able to compute the gradient w.r.t. benchmarks that are commonly used not only for OpenQA
the model parameters and backpropagate the gradient all but also for MRC.
the way throughout the network. Since both modules are
developed using neural networks, the response speed to a
4.1 Challenges to OpenQA
question is a most critical issue during inference, especially
over a large collection of documents. To build an OpenQA system that is capable of answering
any input questions is regarded as the ultimate goal of QA
3.5.2 Retriever-only research. However, the research community still has a long
To enhance the efficiency of answering questions, some way to go. Here we discuss some salient challenges that
systems are developed by only adopting a Retriever while need be addressed on the way. By doing this we hope the
omitting Reader which is usually the most time-consuming research gaps can be made clearer so as to accelerate the
stage in other modern OpenQA systems. DenSPI [73] builds progress in this field.
a question-agnostic phrase-level embedding index offline
given a collection of documents like Wikipedia articles. 4.1.1 Distant Supervision
In the index, each candidate phrase from the corpus is In the OpenQA setting, it is almost impossible to create
represented by the concatenation of two vectors, i.e. a sparse a collection containing “sufficient” high-quality training
vector (e.g., tf-idf) and a dense vector (e.g., BERT encoder). data for developing OpenQA systems in advance. Distant
In the inference, the given question is encoded in the same supervision is therefore popularly utilized, which is able to
way, and FAISS [114] is employed to search for the most label data automatically based on an existing corpus, such as
similar phrase as the final answer. Experiments show that Wikipedia. However, distant supervision inevitably suffers
it obtains remarkable efficiency gains and reduces com- from wrong label problem and often leads to a considerable
putational cost significantly while maintaining accuracy. amount of noisy data, significantly increasing the difficulty
However, the system computes the similarity between each of modeling and training. Therefore, the systems that are
phrase and the question independently, which ignores the able to tolerate such noise are always demanded.
contextual information that is usually crucial to answering
questions . 4.1.2 Retrieval Effectiveness and Efficiency
Retrieval effectiveness means the ability of the system
3.5.3 Retriever-free to separate relevant documents from irrelevant ones for
Recent advancement in pre-training Seq2Seq language mod- a given question. The system often suffers from “term-
els such as GPT-2 [112], GPT-3 [113], BART [92] and T5 [108] mismatch”, which results in failure of retrieving relevant
brings a surge of improvements for downstream NLG tasks, documents; on the other hand the system may receive noisy
most of which are built using Transformer-based architec- documents that contain the exact terms in the question
tures. In particular, GPT-2 and GPT-3 adopt Transformer or even the correct answer span, but are irrelevant to the
12

TABLE 1: Approaches adopted for different components of existing modern OpenQA systems.

System Category Retriever Document Post -Processing Reader Answer Post-processing


DrQA [3] Pipeline Sparse - Extractive -
R3 [28] Pipeline Sparse RL1 Extractive -
DS-QA [33] Pipeline Sparse SL2 Extractive -
Rule-based
[110] Pipeline Sparse - Extractive
Learning-based
[96] Pipeline Sparse SL Extractive -
Paragraph
Pipeline Sparse SL Extractive Rule-based
Ranker [34]
RankQA [76] Pipeline Sparse - Extractive Learning-based
BERTserini [78] Pipeline Sparse - Extractive -
Multi-Passage
Pipeline Sparse TL3 Extractive -
BERT [79]
[15] Pipeline Dense - Extractive -
DPR [16] Pipeline Dense - Extractive -
ColBERT-QA [17] Pipeline Dense - Extractive -
SPARTA [82] Pipeline Dense - Extractive -
Sparse
FID [104] Pipeline - Generative -
Dense
Adaptive
Pipeline Iterative - Extractive -
Retrieval [77]
Multi-step
Pipeline Iterative - Extractive -
Reasoner [29]
GOLDEN
Pipeline Iterative - Extractive -
Retriever [36]
MUPPET [35] Pipeline Iterative - Extractive -
Path Retriever [88] Pipeline Iterative - Extractive -
Graph
Pipeline Iterative - Extractive -
Retriever [89]
DDRQA [86] Pipeline Iterative - Extractive -
GAR [87] Pipeline Iterative - Extractive Rule-based
Extractive
MDR [85] Pipeline Iterative - -
Generative
DenSPI [73] End-to-end Retriever - - -
Retrieve-
End-to-end Dense - Extractive -
-and-Read [32]
ORQA [37] End-to-end Dense - Extractive -
REALM [30] End-to-end Dense - Extractive -
RAG [38] End-to-end Dense - Generative -
1
RL: Reinforcement Learning;
2
SL: Supervised Learning;
3
TL: Transfer Learning.

question. Both issues increase the difficulty of accurately consistently enhance both aspects (also with a good trade-
understanding the context during answer inference. Some off) will be a long-standing challenge in the advancement of
neural retrieval methods [15], [16], [30], [37], [73], [117] are OpenQA.
proposed recently for improving retrieval effectiveness. For
example, [37] and [30] jointly train the retrieval and reader 4.1.3 Knowledge Incorporation
modules, which take advantage of pre-trained language To incorporate knowledge beyond context documents and
models and regard the retrieval model as a latent variable. given questions is a key enhancement to OpenQA sys-
However, these neural retrieval methods often suffer from tems [7], e.g. world knowledge, commonsense or domain-
low efficiency. Some works [15], [16], [37], [117] propose specific knowledge. Before making use of such knowledge,
to pre-compute the question-independent embedding for we need to first consider how to represent them. There are
each document or phrase and construct the embedding generally two ways: explicit and implicit.
index only once. Advanced sub-linear Maximum Inner Prod- For the explicit manner, knowledge is usually trans-
uct Search (MIPS) algorithms [69], [70], [71] are usually formed into the form of triplets and stored in classical KBs
employed to obtain the top K related documents given a such as DBPedia [118], Freebase [119] and Yago2 [120],
question. However, the response speed still has a huge gap which are easily understood by humans. Some early QA
from that of typical IR techniques when the system faces a systems attempt to incorporate knowledge to help find the
massive set of documents. answer in this way. For example, IBM Watson DeepQA [58]
Retrieval effectiveness and efficiency are both crucial fac- combines a Web search engine and a KB to compete with
tors for the deployment of an OpenQA system in practice, human champions on the American TV show “Jeopardy”;
especially when it comes to the real-time scenarios. How to QuASE [121] searches for a list of most prominent sentences
13

from a Web search engine (e.g, Google.com), and then uti- for arriving at the final answer. To achieve these goals, three
lizes entity linking over Freebase [119] to detect the correct major challenges need to be addressed.
answer from the selected sentences. In recent years, with the First, conversational OpenQA should have the ability to
popularity of Graph Neural Network (GNN), some works determine if a question is unanswerable, such as to detect
[89], [122], [123] propose to gain relevant information not if ambiguity exists in the question or whether the current
only from a text corpus but also from a KB to facilitate context is sufficient for generating an answer. Research on
evidence retrieval and question answering. For example, unanswerable questions has attracted a lot of attention in
[122] construct a question-specific sub-graph containing sen- the development of MRC over the past few years [20], [22],
tences from the corpus, and entities and relations from the [128], [144], [152], [153]. However, current OpenQA systems
KB. Then, graph CNN based methods [105], [124], [125] rarely incorporate such a mechanism to determine unan-
are used to infer the final answer over the sub-graph. swerability of questions, which is particularly necessary for
However, there also exist problems for storing knowledge conversational OpenQA systems.
in an explicit manner, such as incomplete and out-of-date Second, when the question is classified as unanswerable
knowledge. Moreover, to construct a KB is both labor- due to ambiguity or insufficient background knowledge, the
intensive and time-consuming. conversational OpenQA system needs to generate a follow-
On the other hand, with the implicit approach, a up question [154]. Question Generation (QG) can then be
large amount of knowledge [115] can be stored in un- considered as a sub-task of QA, which is a crucial module
derlying parameters learned from massive texts by pre- of conversational OpenQA. In the past few years, research
trained language models such as BERT [27], XLNet [126] on automatic question generation from text passages has
and T5 [108], which can be applied smoothly in downstream received growing attention [155], [156], [157], [158]. Com-
tasks. Recently, pre-trained language models have been pared to the typical QG task targeting at generating a
popularly researched and applied to developing OpenQA question based on a given passage where the answer to the
systems [16], [30], [32], [37], [78], [87], [88]. For example, generated question can be found, the question generated
[32], [78], [88] develop their Reader using BERT [27] while in conversational OpenQA should be answered by human
[16], [37] use BERT to develop both Retriever and Reader. In users only.
addition, pre-trained language models like GPT-2 [112] are The third challenge is how to better model the conversa-
able to generate the answer given only a natural language tion history not only in Reader but also in Retriever [159].
question. However, such systems act like a “black box” The recently released conversational MRC datasets like
and it is nearly impossible to know what knowledge has CoQA [133] and QuAC [134] are aimed at enabling a Reader
been exactly stored and used for a particular answer. They to answer the latest question by comprehending not only
lack interpretability that is crucial especially for real-world the given context passage but also the conversation history
applications. so far. As they provide context passages in their task set-
Knowledge enhanced OpenQA is desired not only be- ting, they omit the stage of document retrieval which is
cause it is helpful to generating the answer but also because necessary when it comes to OpenQA. Recently, in [159] the
it serves as the source for interpreting the obtained answer. QuAC dataset is extended to a new OR-QuAC dataset by
How to represent and make full use of the knowledge for adapting to an open-retrieval setting, and an open-retrieval
OpenQA still needs more research efforts. conversational question answering system (OpenConvQA)
is developed, which is able to retrieve relevant passages
4.1.4 Conversational OpenQA from a large collection before inferring the answer, taking
into account the conversation QA pairs. OpenConvQA tries
Non-conversational OpenQA is challenged by several prob- to answer a given question without any specified context,
lems that are almost impossible to resolve, such as the and thus enjoys a wider scope of application and better
lengthy words for a complex question (e.g. Who is the accords with real-world QA behavior of human beings.
second son of the first Prime Minister of Singapore? ), ambi- However, the best performance (F1: 29.4) of the system on
guity resulting in incorrect response (e.g. When was Michael OR-QuAC is far lower than the state-of-the-art (F1: 74.41 ) on
Jordan born? ) and insufficient background knowledge from QuAC, indicating that it is a bigger challenge when it comes
the user that leads to unreasonable results (e.g. Why do I to an open-retrieval setting.
have a bad headache today? ). These problems would be well
addressed under the conversational setting.
Conversational systems [150], [151] are equipped with 4.2 Benchmarks
a dialogue-like interface that enables interaction between A large number of QA benchmarks have been released in
human users and the system for information exchange. For the past decade, which are summarized in Table 2. Here
the complex question example given above, it can be de- we provide a brief analysis of them with the focus on
composed into two simple questions sequentially: “Who is their respective characteristics, dataset distributions w.r.t.
the first Prime Minister of Singapore?” followed by “Who is background information domain, number of questions, year
the second son of him?”. When ambiguity is detected in the of release. As aforementioned in this paper, the success of
question, the conversational OpenQA system is expected to the MRC task is a crucial step to more advanced OpenQA
raise a follow-up question for clarification, such as “Do you and we believe the future advancement of MRC methods
mean the basketball player?”. If a question with insufficient will significantly promote the OpenQA systems. Thus, we
background knowledge is given, a follow-up question can
also be asked to gather more information from human users 1. stated on June 2020 https://quac.ai/
14

TABLE 2: Dataset: The name of the dataset. Domain: The domain of background information in the dataset. #Q (k): The
number of questions contained in the dataset, with unit(k) denoting “thousand”. Answer Type: The answer types included
in the dataset. Context in MRC: The context documents or passages that are given to generate answers in MRC tasks.
OpenQA: This column indicates whether the dataset is applicable for developing OpenQA systems, with the tick mark
denoting yes.

Dataset Domain #Q (k) Answer Type Context in MRC OpenQA


MCTest [102] Children’s story 2.0 Multiple choices A children’s story
CNN/Daily Mail [18] News 1,384.8 Entities A passage from one CNN or Daily Mail news 4
CBT [127] Children’s story 687.3 Multiple choices A children’s story
SQuAD [19] Wikipedia 108.0 Spans A passage from Wikipedia 4
MS MARCO [20] Web search 1,010.9 Free-form Multiple passages from Bing Search 4
Boolean
Unanswerable
NewsQA [128] News 119.6 Spans A news article from CNN news 4
Unanswerable
SearchQA [129] Web search 140.4 Spans Multiple passages from Google Search 4
TriviaQA [130] Trivia 95.9 Spans One or multiple passages
Free-form
RACE [21] Science 97.6 Multiple choices A passage from mid/high school exams
Quasar-T [91] Reddit 43.0 Free-form Multiple documents from Reddit 4
Quasar-S [91] Technical 37.0 Entities A passage from Stack Overflow 4
NarrativeQA [131] Others 46.7 Free-form A summary and a full story from movie scripts
DuReader [132] Web search 200.0 Free-form Multiple passages from Baidu Search or Baidu Zhidao 4
Boolean
SQuAD 2.0 [22] Wikipedia 158.0 Spans A passage from Wikipedia 4
Unanswerable
CoQA [133] Others 127.0 Free-form A passage and conversation history
Boolean
Unanswerable
QuAC [134] Wikipedia 98.4 Spans A passage from Wikipedia and conversation history 4
Boolean
Unanswerable
ARC [135] Science 7.7 Multiple choices No additional context
ShARC [136] Others 32.4 Boolean A rule text, a scenario and conversation history
CliCR [137] Medical 104.9 Spans A passage from clinical case reports
HotpotQA [90] Wikipedia 113.0 Spans A pair of paragraphs from Wikipedia 4
Boolean
Unanswerable
MultiRC [138] Others 6.0 Multiple choices Multiple sentences
SWAG [139] Commonsense 113.0 Multiple choices A piece of video caption
DuoRC [140] Others 186.0 Free-form A movie plot story
Spans
Unanswerable
WikiHop [91] Wikipedia 51.3 Multiple choices Multiple passages from Wikipedia 4
MedHop [91] Medical 2.5 Multiple choices Multiple passagee from MEDLINE
ReCoRD [141] News 120.7 Multiple choices A passage from CNN/Daily Mail News
OpenBookQA [5] Science 5.9 Multiple choices Open book
CommonsenseQA [142] Commonsense 12.2 Multiple choices No additional context
CODAH [143] Commonsense 2.8 Multiple choices No additional context
DROP [103] Wikipedia 96.5 Free-form A passage from Wikipedia
Natural
Wikipedia 323.0 Spans An article from Wikipedia 4
Questions [144]
Boolean
Unanswerable
Cosmos QA [145] Commonsense 35.6 Multiple choices A passage
BoolQ [146] Wikipedia 16.0 Boolean An article from Wikipedia 4
ELI5 [147] Reddit 272.0 Free-form A set of web documents
TWEETQA [148] Social media 13.7 Free-form A tweet from Twitter
XQA [149] Wikipedia 90.6 Entities A passage from Wikipedia in a target language 4

include not only the datasets for OpenQA but also those number of questions in Fig. 6. Also, we summarize the
solely for MRC to make our survey more comprehensive. information source type of the datasets that are applicable
to developing OpenQA systems in Table 3.
The major criterion for judging the applicability of a
QA dataset to develop OpenQA systems is whether it in-
volves a separate document set (usually large-scale) [90], or 5 C ONCLUSION
whether it has relatively easy access to such an information In this work we presented a comprehensive survey on the
source [18], [22] where the answers to questions can be latest progress of Open-domain QA (OpenQA) systems. In
inferred. For example, HotpotQA [90] provides a full-wiki particular, we first reviewed the development of OpenQA
setting itself to require a system to find the answer to a and illustrated a “Retriever-Reader” architecture. Moreover,
question in the scope of the entire Wikipedia. [3] extend we reviewed a variety of existing OpenQA systems as well
SQuAD [19] to SQuADopen by using the entire Wikipedia as their different approaches. Finally, we discussed some
as its information source. We summarize and illustrate the salient challenges towards OpenQA followed by a summary
distributions of datasets listed in Table 2 w.r.t. year of release of various QA benchmarks, hoping to reveal the research
in Fig. 7, background information domain in Fig.8 and gaps so as to push further progress in this field. Based on
15

Fig. 6: Number of questions in each dataset

TABLE 3: The information source of the datasets that are applicable for developing OpenQA system. Source Type: The
type of background information source. Source: The background information source in OpenQA setting.

Source Type Source Dataset


SQuADopen [3]
HotpotQA [90]
QuAC [134]
Wikipedia Full Wikipedia WikiHop [91]
Natural Questions [144]
BoolQ [146]
XQA [149]
Bing Search MS MARCO [20]
Search Engine Google Search SearchQA [129]
Baidu Search DuReader [132]
News from CNN/Daily CNN/Daily Mail [18]
Online News
News from CNN NewsQA [128]
Reddit Quasar-T [91]
Internet Forum
Stack Overflow Quasar-S [91]

Fig. 7: Distribution of popular datasets w.r.t. release year step and multi-step neural retrievers will attract increasing
attention due to the demand for more accurate retrieval of
related documents. Also, more end-to-end OpenQA systems
will be developed with the advancement of deep learning
techniques. Knowledge enhanced OpenQA is very promis-
ing not only because it is helpful to generating the answer
but also because it serves as the source for interpreting the
obtained answer. However, how to represent and make full
use of the knowledge for OpenQA still needs more research
efforts. Furthermore, to equip OpenQA with a dialogue-like
interface that enables interaction between human users and
the system for information exchange is expected to attract
increasing attention, which well aligns with real world
application scenarios.

6 ACKNOWLEDGEMENTS
This research is supported by the National Research Foun-
our review of prior research, we claim that OpenQA would dation, Singapore under its International Research Centres
continue to be a research hot-spot. In particular, single- in Singapore Funding Initiative and A*STAR under its RIE
16

Fig. 8: Datasets distribution w.r.t. background information [12] Z. Huang, S. Xu, M. Hu, X. Wang, J. Qiu, Y. Fu, Y. Zhao, Y. Peng,
domain and C. Wang, “Recent trends in deep learning based open-
domain textual question answering systems,” IEEE Access, vol. 8,
pp. 94 341–94 356, 2020.
[13] T. Lei, Z. Shi, D. Liu, L. Yang, and F. Zhu, “A novel cnn-
based method for question classification in intelligent question
answering,” in Proceedings of the 2018 International Conference on
Algorithms, Computing and Artificial Intelligence. Association for
Computing Machinery, 2018.
[14] W. Xia, W. Zhu, B. Liao, M. Chen, L. Cai, and L. Huang, “Novel
architecture for long short-term memory used in question classi-
fication,” Neurocomputing, vol. 299, pp. 20–31, 2018.
[15] K. Nishida, I. Saito, A. Otsuka, H. Asano, and J. Tomita,
“Retrieve-and-read: Multi-task learning of information retrieval
and reading comprehension,” in Proceedings of the 27th ACM
International Conference on Information and Knowledge Management,
ser. CIKM ’18. Association for Computing Machinery, 2018, p.
647–656.
[16] V. Karpukhin, B. Oğuz, S. Min, L. Wu, S. Edunov, D. Chen, and
W.-t. Yih, “Dense passage retrieval for open-domain question
answering,” arXiv preprint arXiv:2004.04906, 2020.
[17] O. Khattab, C. Potts, and M. Zaharia, “Relevance-guided
Supervision for OpenQA with ColBERT,” 2020. [Online].
Available: http://arxiv.org/abs/2007.00814
[18] K. M. Hermann, T. Kočiský, E. Grefenstette, L. Espeholt, W. Kay,
M. Suleyman, and P. Blunsom, “Teaching machines to read and
comprehend,” in Proceedings of the 28th International Conference on
Neural Information Processing Systems - Volume 1. MIT Press, 2015,
2020 Advanced Manufacturing and Engineering (AME) pro- pp. 1693–1701.
grammatic grant, Award No. - A19E2b0098, Project name [19] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, “SQuAD:
100,000+ questions for machine comprehension of text,” in Pro-
- K-EMERGE: Knowledge Extraction, Modelling, and Ex- ceedings of the 2016 Conference on Empirical Methods in Natural
plainable Reasoning for General Expertise. Any opinions, Language Processing. Association for Computational Linguistics,
findings and conclusions or recommendations expressed in 2016, pp. 2383–2392.
[20] T. Nguyen, M. Rosenberg, X. Song, J. Gao, S. Tiwary, R. Ma-
this material are those of the author(s) and do not reflect jumder, and L. Deng, “MS MARCO: A human generated machine
the views of National Research Foundation and A*STAR, reading comprehension dataset,” 2016.
Singapore. [21] G. Lai, Q. Xie, H. Liu, Y. Yang, and E. H. Hovy, “RACE: large-
scale reading comprehension dataset from examinations,” CoRR,
vol. abs/1704.04683, 2017.
R EFERENCES [22] P. Rajpurkar, R. Jia, and P. Liang, “Know what you don’t know:
[1] B. F. Green, Jr., A. K. Wolf, C. Chomsky, and K. Laughery, Unanswerable questions for SQuAD,” in Proceedings of the 56th
“Baseball: An automatic question-answerer,” in Papers Presented Annual Meeting of the Association for Computational Linguistics (Vol-
at the May 9-11, 1961, Western Joint IRE-AIEE-ACM Computer ume 2: Short Papers). Association for Computational Linguistics,
Conference. ACM, 1961, pp. 219–224. 2018, pp. 784–789.
[2] J. Falconer, “Google: Our new search strategy is [23] J. Li, M. Liu, M.-Y. Kan, Z. Zheng, Z. Wang, W. Lei, T. Liu, and
to compute answers, not links,” 2011. [Online]. B. Qin, “Molweni: A challenge multiparty dialogues-based ma-
Available: https://thenextweb.com/google/2011/06/01/ chine reading comprehension dataset with discourse structure,”
google-our-new-search-strategy-is-to-compute-answers-not-links/ 2020.
[3] D. Chen, A. Fisch, J. Weston, and A. Bordes, “Reading Wikipedia [24] M. J. Seo, A. Kembhavi, A. Farhadi, and H. Hajishirzi, “Bidi-
to answer open-domain questions,” in Proceedings of the 55th An- rectional attention flow for machine comprehension,” in 5th
nual Meeting of the Association for Computational Linguistics (Volume International Conference on Learning Representations, ICLR 2017.
1: Long Papers). Association for Computational Linguistics, 2017, OpenReview.net, 2017.
pp. 1870–1879. [25] W. Wang, N. Yang, F. Wei, B. Chang, and M. Zhou, “Gated
[4] E. M. Voorhees, “The trec-8 question answering track report,” self-matching networks for reading comprehension and question
NIST, Tech. Rep., 1999. answering,” in Proceedings of the 55th Annual Meeting of the
[5] T. Mihaylov, P. Clark, T. Khot, and A. Sabharwal, “Can a suit of Association for Computational Linguistics, ACL. Association for
armor conduct electricity? A new dataset for open book question Computational Linguistics, 2017, pp. 189–198.
answering,” CoRR, vol. abs/1809.02789, 2018. [26] A. W. Yu, D. Dohan, M. Luong, R. Zhao, K. Chen, M. Norouzi,
[6] S. M. Harabagiu, S. J. Maiorano, and M. A. Paundefinedca, and Q. V. Le, “Qanet: Combining local convolution with global
“Open-domain textual question answering techniques,” Nat. self-attention for reading comprehension,” in International Confer-
Lang. Eng., vol. 9, no. 3, p. 231–267, 2003. ence on Learning Representations, ICLR. OpenReview.net, 2018.
[7] V. C. John Burger, Claire Cardie et al., “Issues, tasks and program [27] J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-
structures to roadmap research in question & answering (q &a,” training of deep bidirectional transformers for language under-
NIST, Tech. Rep., 2001. standing,” CoRR, vol. abs/1810.04805, 2018.
[8] O. Kolomiyets and M.-F. Moens, “A survey on question answer- [28] S. Wang, M. Yu, X. Guo, Z. Wang, T. Klinger, W. Zhang, S. Chang,
ing technology from an information retrieval perspective,” Inf. G. Tesauro, B. Zhou, and J. Jiang, “R3: Reinforced ranker-reader
Sci., vol. 181, no. 24, pp. 5412–5434, 2011. for open-domain question answering,” in AAAI, 2018.
[9] A. Allam and M. Haggag, “The question answering systems: A [29] R. Das, S. Dhuliawala, M. Zaheer, and A. McCallum, “Multi-step
survey,” International Journal of Research and Reviews in Information retriever-reader interaction for scalable open-domain question
Sciences, pp. 211–221, 2012. answering,” in International Conference on Learning Representations,
[10] A. Mishra and S. K. Jain, “A survey on question answering 2019.
systems with classification,” J. King Saud Univ. Comput. Inf. Sci., [30] K. Guu, K. Lee, Z. Tung, P. Pasupat, and M.-W. Chang, “Realm:
vol. 28, no. 3, p. 345–361, 2016. Retrieval-augmented language model pre-training,” CoRR, 2020.
[11] M. Paşca, “Open-domain question answering from large text [31] M. Ding, C. Zhou, Q. Chen, H. Yang, and J. Tang, “Cognitive
collections,” Computational Linguistics, vol. 29, no. 4, pp. 665–667, graph for multi-hop reading comprehension at scale,” in Proceed-
2003. ings of the 57th Annual Meeting of the Association for Computational
17

Linguistics. Association for Computational Linguistics, 2019, pp. [49] J. Xu and W. B. Croft, “Query expansion using local and global
2694–2703. document analysis,” in Proceedings of the 19th Annual International
[32] Y. Nie, S. Wang, and M. Bansal, “Revealing the importance of ACM SIGIR Conference on Research and Development in Information
semantic retrieval for machine reading at scale,” in Proceedings Retrieval. Association for Computing Machinery, 1996, p. 4–11.
of the 2019 Conference on Empirical Methods in Natural Language [50] C. Carpineto and G. Romano, “A survey of automatic query
Processing and the 9th International Joint Conference on Natural expansion in information retrieval,” ACM Computing Survey,
Language Processing (EMNLP-IJCNLP). Association for Compu- vol. 44, no. 1, 2012.
tational Linguistics, 2019, pp. 2553–2566. [51] C. Quirk, C. Brockett, and W. Dolan, “Monolingual machine
[33] Y. Lin, H. Ji, Z. Liu, and M. Sun, “Denoising distantly supervised translation for paraphrase generation,” in Proceedings of the 2004
open-domain question answering,” in Proceedings of the 56th An- Conference on Empirical Methods in Natural Language Processing.
nual Meeting of the Association for Computational Linguistics (Volume Association for Computational Linguistics, 2004, pp. 142–149.
1: Long Papers). Association for Computational Linguistics, 2018, [52] C. Bannard and C. Callison-Burch, “Paraphrasing with bilingual
pp. 1736–1745. parallel corpora,” in Proceedings of the 43rd Annual Meeting of the
[34] J. Lee, S. Yun, H. Kim, M. Ko, and J. Kang, “Ranking paragraphs Association for Computational Linguistics (ACL’05). Association
for improving answer recall in open-domain question answer- for Computational Linguistics, 2005, pp. 597–604.
ing,” in Proceedings of the 2018 Conference on Empirical Methods [53] S. Zhao, C. Niu, M. Zhou, T. Liu, and S. Li, “Combining mul-
in Natural Language Processing. Association for Computational tiple resources to improve SMT-based paraphrasing model,” in
Linguistics, 2018, pp. 565–569. Proceedings of ACL-08: HLT. Association for Computational
[35] Y. Feldman et al., “Multi-hop paragraph retrieval for open- Linguistics, 2008, pp. 1021–1029.
domain question answering,” in Proceedings of the 57th Annual [54] S. Wubben, A. van den Bosch, and E. Krahmer, “Paraphrase
Meeting of the Association for Computational Linguistics. Associa- generation as monolingual translation: Data and evaluation,”
tion for Computational Linguistics, 2019, pp. 2296–2309. in Proceedings of the 6th International Natural Language Generation
[36] P. Qi, X. Lin, L. Mehr, Z. Wang, and C. D. Manning, “An- Conference, 2010.
swering complex open-domain questions through iterative query [55] X. Li and D. Roth, “Learning question classifiers,” in COLING
generation,” in Proceedings of the 2019 Conference on Empirical 2002: The 19th International Conference on Computational Linguistics,
Methods in Natural Language Processing and the 9th International 2002.
Joint Conference on Natural Language Processing (EMNLP-IJCNLP). [56] J. Suzuki, H. Taira, Y. Sasaki, and E. Maeda, “Question clas-
Association for Computational Linguistics, 2019, pp. 2590–2602. sification using HDAG kernel,” in Proceedings of the ACL 2003
[37] K. Lee, M.-W. Chang, and K. Toutanova, “Latent retrieval for Workshop on Multilingual Summarization and Question Answering.
weakly supervised open domain question answering,” in Proceed- Association for Computational Linguistics, 2003, pp. 61–68.
ings of the 57th Annual Meeting of the Association for Computational [57] D. Zhang and W. S. Lee, “Question classification using support
Linguistics. Association for Computational Linguistics, 2019, pp. vector machines,” in Proceedings of the 26th Annual International
6086–6096. ACM SIGIR Conference on Research and Development in Informaion
[38] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, Retrieval, ser. SIGIR ’03. Association for Computing Machinery,
N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, 2003, p. 26–32.
S. Riedel, and D. Kiela, “Retrieval-Augmented Generation for [58] D. Ferrucci, E. Brown, J. Chu-Carroll, J. Fan, D. Gondek, A. A.
Knowledge-Intensive NLP Tasks,” 2020. [Online]. Available: Kalyanpur, A. Lally, J. W. Murdock, E. Nyberg, J. Prager,
http://arxiv.org/abs/2005.11401 N. Schlaefer, and C. Welty, “Building watson: An overview of
the deepqa project,” AI Magazine, vol. 31, no. 3, pp. 59–79, 2010.
[39] W. A. Woods, “Progress in natural language understanding: An
[59] H. Tayyar Madabushi and M. Lee, “High accuracy rule-based
application to lunar geology,” in Proceedings of the June 4-8, 1973,
question classification using question syntax and semantics,” in
National Computer Conference and Exposition. ACM, 1973, pp.
Proceedings of COLING 2016, the 26th International Conference on
441–450.
Computational Linguistics: Technical Papers. The COLING 2016
[40] J. Kupiec, “Murax: A robust linguistic approach for question Organizing Committee, 2016, pp. 1220–1230.
answering using an on-line encyclopedia,” in Proceedings of the
[60] C. D. Manning, P. Raghavan, and H. Schütze, Introduction to
16th Annual International ACM SIGIR Conference on Research and
Information Retrieval. USA: Cambridge University Press, 2008.
Development in Information Retrieval. Association for Computing
[61] S. Robertson, H. Zaragoza et al., “The probabilistic relevance
Machinery, 1993, p. 181–190.
framework: Bm25 and beyond,” Foundations and Trends in Infor-
[41] E. M. Voorhees, “Overview of the trec 2001 question answering mation Retrieval, vol. 3, no. 4, pp. 333–389, 2009.
track,” in In Proceedings of TREC-10, 2001, pp. 42–51. [62] W. B. Croft and J. Lafferty, Language modeling for information
[42] ——, “Overview of the TREC 2002 question answering track,” in retrieval. Kluwer Academic Publ., 2003.
Proceedings of The Eleventh Text REtrieval Conference, TREC 2002, [63] E. M. Voorhees, “Overview of the trec 2004 question answering
Gaithersburg, Maryland, USA, November 19-22, 2002, ser. NIST track,” in In Proceedings of the Thirteenth Text REtreival Conference
Special Publication, vol. 500-251. National Institute of Standards (TREC 2004), 2005, pp. 52–62.
and Technology (NIST), 2002. [64] D. Mollá, M. van Zaanen, and D. Smith, “Named entity recog-
[43] E. Voorhees, “Overview of the trec 2003 question answering nition for question answering,” in Proceedings of the Australasian
track,” NIST, Tech. Rep., 2003. Language Technology Workshop 2006, 2006, pp. 51–58.
[44] C. Kwok, O. Etzioni, O. Etzioni, and D. S. Weld, “Scaling question [65] M. Wang, “A survey of answer extraction techniques in factoid
answering to the web,” ACM Transactions on Information Systems, question answering,” Computational Linguistics, vol. 1, no. 1, pp.
vol. 19, no. 3, pp. 242–262, 2001. 1–14, 2006.
[45] E. Brill, S. Dumais, and M. Banko, “An analysis of the AskMSR [66] M. M. Soubbotin and S. M. Soubbotin, “Patterns of potential an-
question-answering system,” in Proceedings of the 2002 Conference swer expressions as clues to the right answers,” in In Proceedings
on Empirical Methods in Natural Language Processing (EMNLP of the 10th Text REtrieval Conference (TREC-10), 2001.
2002). Association for Computational Linguistics, 2002, pp. 257– [67] D. Ravichandran and E. Hovy, “Learning surface text patterns
264. for a question answering system,” in Proceedings of the 40th
[46] Z. Zheng, “Answerbus question answering system,” in Proceed- Annual Meeting of the Association for Computational Linguistics.
ings of the Second International Conference on Human Language Association for Computational Linguistics, 2002, pp. 41–47.
Technology Research, ser. HLT ’02. Morgan Kaufmann Publishers [68] D. Shen, G.-J. M. Kruijff, and D. Klakow, “Exploring syntactic
Inc., 2002, p. 399–404. relation patterns for question answering,” in Second International
[47] D. Moldovan, M. Pasca, S. Harabagiu, and M. Surdeanu, “Per- Joint Conference on Natural Language Processing: Full Papers, 2005.
formance issues and error analysis in an open-domain question [69] P. Ram and A. G. Gray, “Maximum inner-product search using
answering system,” in Proceedings of the 40th Annual Meeting cone trees,” in Proceedings of the 18th ACM SIGKDD international
of the Association for Computational Linguistics. Association for conference on Knowledge discovery and data mining, 2012, pp. 931–
Computational Linguistics, 2002, pp. 33–40. 939.
[48] R. Sun, J. Jiang, Y. F. Tan, H. Cui, T.-S. Chua, and M.-Y. Kan, [70] A. Shrivastava and P. Li, “Asymmetric lsh (alsh) for sublinear
“Using syntactic and semantic relation analysis in question an- time maximum inner product search (mips),” in Advances in
swering,” in TREC, 2005. Neural Information Processing Systems, 2014, pp. 2321–2329.
18

[71] F. Shen, W. Liu, S. Zhang, Y. Yang, and H. Tao Shen, “Learning for question answering,” in International Conference on Learning
binary codes for maximum inner product search,” in Proceedings Representations, 2020.
of the IEEE International Conference on Computer Vision, 2015, pp. [89] S. Min, D. Chen, L. Zettlemoyer, and H. Hajishirzi, “Knowledge
4148–4156. Guided Text Retrieval and Reading for Open Domain Question
[72] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Answering,” 2019. [Online]. Available: http://arxiv.org/abs/
Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” 1911.03868
in Advances in Neural Information Processing Systems 30: Annual [90] Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. Cohen, R. Salakhutdinov,
Conference on Neural Information Processing Systems 2017, 4-9 De- and C. D. Manning, “HotpotQA: A dataset for diverse, explain-
cember 2017, Long Beach, CA, USA, 2017, pp. 5998–6008. able multi-hop question answering,” in Proceedings of the 2018
[73] M. Seo, J. Lee, T. Kwiatkowski, A. Parikh, A. Farhadi, and H. Ha- Conference on Empirical Methods in Natural Language Processing.
jishirzi, “Real-time open-domain question answering with dense- Association for Computational Linguistics, 2018, pp. 2369–2380.
sparse phrase index,” in Proceedings of the 57th Annual Meeting [91] J. Welbl, P. Stenetorp, and S. Riedel, “Constructing datasets
of the Association for Computational Linguistics. Association for for multi-hop reading comprehension across documents,”
Computational Linguistics, 2019, pp. 4430–4441. Transactions of the Association for Computational Linguistics, pp.
[74] M. Dehghani, H. Azarbonyad, J. Kamps, and M. de Rijke, “Learn- 287–302, 2018. [Online]. Available: https://www.aclweb.org/
ing to transform, combine, and reason in open-domain question anthology/Q18-1021
answering,” in Proceedings of the Twelfth ACM International Confer- [92] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed,
ence on Web Search and Data Mining, ser. WSDM ’19. Association O. Levy, V. Stoyanov, and L. Zettlemoyer, “BART: Denoising
for Computing Machinery, 2019, p. 681–689. sequence-to-sequence pre-training for natural language genera-
[75] B. Dhingra, M. Zaheer, V. Balachandran, G. Neubig, R. Salakhut- tion, translation, and comprehension,” in Proceedings of the 58th
dinov, and W. W. Cohen, “Differentiable reasoning over a virtual Annual Meeting of the Association for Computational Linguistics.
knowledge base,” in International Conference on Learning Represen- Association for Computational Linguistics, 2020, pp. 7871–7880.
tations, 2020. [93] K. Cho, B. van Merrienboer, Ç. Gülçehre, D. Bahdanau,
[76] B. Kratzwald, A. Eigenmann, and S. Feuerriegel, “RankQA: Neu- F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase rep-
ral question answering with answer re-ranking,” in Proceedings resentations using RNN encoder-decoder for statistical machine
of the 57th Annual Meeting of the Association for Computational translation,” in EMNLP. ACL, 2014, pp. 1724–1734.
Linguistics. Association for Computational Linguistics, 2019, pp. [94] C. Clark and M. Gardner, “Simple and effective multi-paragraph
6076–6085. reading comprehension,” in Proceedings of the 56th Annual Meeting
[77] B. Kratzwald et al., “Adaptive document retrieval for deep of the Association for Computational Linguistics, ACL. Association
question answering,” in Proceedings of the 2018 Conference on for Computational Linguistics, 2018, pp. 845–855.
Empirical Methods in Natural Language Processing. Association [95] A. Lampert, “A quick introduction to question answering,” Dated
for Computational Linguistics, 2018, pp. 576–581. December, 2004.
[78] W. Yang, Y. Xie, A. Lin, X. Li, L. Tan, K. Xiong, M. Li, and J. Lin, [96] P. M. Htut, S. Bowman, and K. Cho, “Training a ranking function
“End-to-end open-domain question answering with BERTserini,” for open-domain question answering,” in Proceedings of the 2018
in Proceedings of the 2019 Conference of the North American Chapter Conference of the North American Chapter of the Association for
of the Association for Computational Linguistics (Demonstrations). Computational Linguistics: Student Research Workshop. Association
Association for Computational Linguistics, 2019, pp. 72–77. for Computational Linguistics, 2018, pp. 120–127.
[79] Z. Wang, P. Ng, X. Ma, R. Nallapati, and B. Xiang, “Multi-passage [97] P. Banerjee, K. K. Pal, A. Mitra, and C. Baral, “Careful selection of
BERT: A globally normalized BERT model for open-domain ques- knowledge to solve open book question answering,” in Proceed-
tion answering,” in Proceedings of the 2019 Conference on Empirical ings of the 57th Annual Meeting of the Association for Computational
Methods in Natural Language Processing and the 9th International Linguistics. Association for Computational Linguistics, 2019, pp.
Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 6120–6129.
Association for Computational Linguistics, 2019, pp. 5878–5882. [98] J. Wang, A. Jatowt, M. Färber, and M. Yoshikawa, “Answering
[80] K. Weinberger, A. Dasgupta, J. Langford, A. Smola, and J. At- event-related questions over long-term news article archives,”
tenberg, “Feature hashing for large scale multitask learning,” in ECIR, ser. Lecture Notes in Computer Science, vol. 12035.
in Proceedings of the 26th Annual International Conference on Ma- Springer, 2020, pp. 774–789.
chine Learning. Association for Computing Machinery, 2009, p. [99] J. Wang, A. Jatowt, M. Färber, and M. Yoshikawa, “Improving
1113–1120. question answering for event-focused questions in temporal col-
[81] P. Yang, H. Fang, and J. Lin, “Anserini: Enabling the use of lucene lections of news articles,” Information Retrieval Journal, vol. 24,
for information retrieval research,” in Proceedings of the 40th no. 1, pp. 29–54, 2021.
International ACM SIGIR Conference on Research and Development in [100] A. Conneau, D. Kiela, H. Schwenk, L. Barrault, and A. Bor-
Information Retrieval, ser. SIGIR ’17. Association for Computing des, “Supervised learning of universal sentence representations
Machinery, 2017, p. 1253–1256. from natural language inference data,” in Proceedings of the 2017
[82] T. Zhao, X. Lu, and K. Lee, “Sparta: Efficient open-domain Conference on Empirical Methods in Natural Language Processing.
question answering via sparse transformer matching retrieval,” Association for Computational Linguistics, 2017, pp. 670–680.
arXiv preprint arXiv:2009.13013, 2020. [101] A. Santoro, D. Raposo, D. G. Barrett, M. Malinowski, R. Pascanu,
[83] Y. Zhang, P. Nie, X. Geng, A. Ramamurthy, L. Song, and D. Jiang, P. Battaglia, and T. Lillicrap, “A simple neural network module
“Dc-bert: Decoupling question and document for efficient con- for relational reasoning,” in Advances in Neural Information Pro-
textual encoding,” 2020. cessing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach,
[84] O. Khattab and M. Zaharia, “Colbert: Efficient and effective R. Fergus, S. Vishwanathan, and R. Garnett, Eds. Curran
passage search via contextualized late interaction over bert,” Associates, Inc., 2017, pp. 4967–4976.
in Proceedings of the 43rd International ACM SIGIR Conference on [102] M. Richardson, C. J. Burges, and E. Renshaw, “MCTest: A chal-
Research and Development in Information Retrieval, ser. SIGIR ’20. lenge dataset for the open-domain machine comprehension of
Association for Computing Machinery, 2020, p. 39–48. text,” in Proceedings of the 2013 Conference on Empirical Methods
[85] W. Xiong, X. L. Li, S. Iyer, J. Du, P. Lewis, W. Y. Wang, Y. Mehdad, in Natural Language Processing. Association for Computational
W.-t. Yih, S. Riedel, D. Kiela et al., “Answering complex open- Linguistics, 2013, pp. 193–203.
domain questions with multi-hop dense retrieval,” arXiv preprint [103] D. Dua, Y. Wang, P. Dasigi, G. Stanovsky, S. Singh, and M. Gard-
arXiv:2009.12756, 2020. ner, “DROP: A reading comprehension benchmark requiring
[86] Y. Zhang, P. Nie, A. Ramamurthy, and L. Song, “Ddrqa: Dy- discrete reasoning over paragraphs,” in Proc. of NAACL, 2019.
namic document reranking for open-domain multi-hop question [104] G. Izacard and E. Grave, “Leveraging passage retrieval with
answering,” arXiv preprint arXiv:2009.07465, 2020. generative models for open domain question answering,” arXiv
[87] Y. Mao, P. He, X. Liu, Y. Shen, J. Gao, J. Han, and W. Chen, preprint arXiv:2007.01282, 2020.
“Generation-Augmented Retrieval for Open-domain Question [105] T. N. Kipf and M. Welling, “Semi-supervised classification with
Answering,” 2020. [Online]. Available: http://arxiv.org/abs/ graph convolutional networks,” in ICLR, 2017.
2009.08553 [106] M. Joshi, D. Chen, Y. Liu, D. S. Weld, L. Zettlemoyer, and
[88] A. Asai, K. Hashimoto, H. Hajishirzi, R. Socher, and C. Xiong, O. Levy, “SpanBERT: Improving pre-training by representing and
“Learning to retrieve reasoning paths over wikipedia graph predicting spans,” arXiv preprint arXiv:1907.10529, 2019.
19

[107] C. Tan, F. Wei, N. Yang, B. Du, W. Lv, and M. Zhou, “S-net: [127] F. Hill, A. Bordes, S. Chopra, and J. Weston, “The goldilocks
From answer extraction to answer synthesis for machine reading principle: Reading children’s books with explicit memory rep-
comprehension,” in AAAI. AAAI Press, 2018, pp. 5940–5947. resentations,” CoRR, 2015.
[108] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, [128] A. Trischler, T. Wang, X. Yuan, J. Harris, A. Sordoni, P. Bachman,
Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer and K. Suleman, “NewsQA: A machine comprehension dataset,”
learning with a unified text-to-text transformer,” arXiv e-prints, in Proceedings of the 2nd Workshop on Representation Learning for
2019. NLP. Association for Computational Linguistics, 2017, pp. 191–
[109] S. Liu, X. Zhang, S. Zhang, H. Wang, and W. Zhang, “Neural 200.
machine reading comprehension: Methods and trends,” CoRR, [129] M. Dunn, L. Sagun, M. Higgins, V. U. Güney, V. Cirik, and K. Cho,
vol. abs/1907.01118, 2019. “Searchqa: A new q&a dataset augmented with context from a
[110] S. Wang, M. Yu, J. Jiang, W. Zhang, X. Guo, S. Chang, Z. Wang, search engine,” CoRR, vol. abs/1704.05179, 2017.
T. Klinger, G. Tesauro, and M. Campbell, “Evidence aggregation [130] M. Joshi, E. Choi, D. Weld, and L. Zettlemoyer, “TriviaQA: A
for answer re-ranking in open-domain question answering,” in large scale distantly supervised challenge dataset for reading
6th International Conference on Learning Representations, ICLR 2018, comprehension,” in Proceedings of the 55th Annual Meeting of the
Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Association for Computational Linguistics (Volume 1: Long Papers).
Proceedings. ICLR, 2018. Association for Computational Linguistics, 2017, pp. 1601–1611.
[111] S. Wang and J. Jiang, “Learning natural language inference with [131] T. Kociský, J. Schwarz, P. Blunsom, C. Dyer, K. M. Hermann,
LSTM,” in Conference of the North American Chapter of the Associ- G. Melis, and E. Grefenstette, “The narrativeqa reading compre-
ation for Computational Linguistics: Human Language Technologies. hension challenge,” CoRR, vol. abs/1712.07040, 2017.
The Association for Computational Linguistics, 2016, pp. 1442– [132] W. He, K. Liu, Y. Lyu, S. Zhao, X. Xiao, Y. Liu, Y. Wang, H. Wu,
1451. Q. She, X. Liu, T. Wu, and H. Wang, “Dureader: a chinese machine
[112] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, reading comprehension dataset from real-world applications,”
“Language models are unsupervised multitask learners,” OpenAI CoRR, vol. abs/1711.05073, 2017.
blog, vol. 1, no. 8, p. 9, 2019. [133] S. Reddy, D. Chen, and C. D. Manning, “Coqa: A conversational
[113] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, question answering challenge,” CoRR, vol. abs/1808.07042, 2018.
P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell [134] E. Choi, H. He, M. Iyyer, M. Yatskar, W. Yih, Y. Choi, P. Liang, and
et al., “Language models are few-shot learners,” arXiv preprint L. Zettlemoyer, “Quac : Question answering in context,” CoRR,
arXiv:2005.14165, 2020. vol. abs/1808.07036, 2018.
[114] J. Johnson, M. Douze, and H. Jégou, “Billion-scale similarity [135] P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal,
search with gpus,” CoRR, vol. abs/1702.08734, 2017. C. Schoenick, and O. Tafjord, “Think you have solved question
[115] F. Petroni, T. Rocktäschel, P. Lewis, A. Bakhtin, Y. Wu, A. H. answering? try arc, the AI2 reasoning challenge,” CoRR, vol.
Miller, and S. Riedel, “Language models as knowledge bases?” abs/1803.05457, 2018.
arXiv preprint arXiv:1909.01066, 2019. [136] M. Saeidi, M. Bartolo, P. Lewis, S. Singh, T. Rocktäschel, M. Shel-
[116] A. Roberts, C. Raffel, and N. Shazeer, “How much knowledge don, G. Bouchard, and S. Riedel, “Interpretation of natural lan-
can you pack into the parameters of a language model?” arXiv guage rules in conversational machine reading,” in Proceedings
preprint arXiv:2002.08910, 2020. of the 2018 Conference on Empirical Methods in Natural Language
Processing. Association for Computational Linguistics, 2018, pp.
[117] M. Seo, T. Kwiatkowski, A. Parikh, A. Farhadi, and H. Ha-
2087–2097.
jishirzi, “Phrase-indexed question answering: A new challenge
[137] S. Šuster et al., “CliCR: a dataset of clinical case reports for
for scalable document comprehension,” in Proceedings of the 2018
machine reading comprehension,” in Proceedings of the 2018
Conference on Empirical Methods in Natural Language Processing.
Conference of the North American Chapter of the Association for
Association for Computational Linguistics, 2018, pp. 559–564.
Computational Linguistics: Human Language Technologies, Volume 1
[118] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and (Long Papers). Association for Computational Linguistics, 2018,
Z. Ives, “Dbpedia: A nucleus for a web of open data,” in The pp. 1551–1563.
Semantic Web. Springer Berlin Heidelberg, 2007, pp. 722–735.
[138] D. Khashabi, S. Chaturvedi, M. Roth, S. Upadhyay, and D. Roth,
[119] K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor, “Free- “Looking beyond the surface: A challenge set for reading com-
base: A collaboratively created graph database for structuring prehension over multiple sentences,” in Proceedings of the 2018
human knowledge,” in Proceedings of the 2008 ACM SIGMOD Conference of the North American Chapter of the Association for
International Conference on Management of Data. ACM, 2008, pp. Computational Linguistics: Human Language Technologies, Volume 1
1247–1250. (Long Papers). Association for Computational Linguistics, 2018,
[120] J. Hoffart, F. M. Suchanek, K. Berberich, and G. Weikum, “Yago2: pp. 252–262.
A spatially and temporally enhanced knowledge base from [139] R. Zellers, Y. Bisk, R. Schwartz, and Y. Choi, “SWAG: A large-
wikipedia,” Artif. Intell., vol. 194, pp. 28–61, 2013. scale adversarial dataset for grounded commonsense inference,”
[121] H. Sun, H. Ma, W.-t. Yih, C.-T. Tsai, J. Liu, and M.-W. Chang, in Proceedings of the 2018 Conference on Empirical Methods in Natural
“Open domain question answering via semantic enrichment,” in Language Processing. Association for Computational Linguistics,
Proceedings of the 24th International Conference on World Wide Web. 2018, pp. 93–104.
International World Wide Web Conferences Steering Committee, [140] A. Saha, R. Aralikatte, M. M. Khapra, and K. Sankaranarayanan,
2015, pp. 1045–1055. “Duorc: Towards complex language understanding with para-
[122] H. Sun, B. Dhingra, M. Zaheer, K. Mazaitis, R. Salakhutdinov, and phrased reading comprehension,” CoRR, vol. abs/1804.07927,
W. Cohen, “Open domain question answering using early fusion 2018.
of knowledge bases and text,” in Proceedings of the 2018 Conference [141] S. Zhang, X. Liu, J. Liu, J. Gao, K. Duh, and B. V. Durme, “Record:
on Empirical Methods in Natural Language Processing. Association Bridging the gap between human and machine commonsense
for Computational Linguistics, 2018, pp. 4231–4242. reading comprehension,” 2018.
[123] H. Sun, T. Bedrax-Weiss, and W. Cohen, “PullNet: Open do- [142] A. Talmor, J. Herzig, N. Lourie, and J. Berant, “Commonsenseqa:
main question answering with iterative retrieval on knowledge A question answering challenge targeting commonsense knowl-
bases and text,” in Proceedings of the 2019 Conference on Empirical edge,” CoRR, vol. abs/1811.00937, 2018.
Methods in Natural Language Processing and the 9th International [143] M. Chen, M. D’Arcy, A. Liu, J. Fernandez, and D. Downey, “CO-
Joint Conference on Natural Language Processing (EMNLP-IJCNLP). DAH: An adversarially-authored question answering dataset for
Association for Computational Linguistics, 2019, pp. 2380–2390. common sense,” in Proceedings of the 3rd Workshop on Evaluating
[124] Y. Li, D. Tarlow, M. Brockschmidt, and R. S. Zemel, “Gated graph Vector Space Representations for NLP. Association for Computa-
sequence neural networks,” in ICLR, 2016. tional Linguistics, 2019, pp. 63–69.
[125] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Mon- [144] T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. Parikh,
fardini, “The graph neural network model,” IEEE Transactions on C. Alberti, D. Epstein, I. Polosukhin, M. Kelcey, J. Devlin, K. Lee,
Neural Networks, vol. 20, no. 1, pp. 61–80, 2009. K. N. Toutanova, L. Jones, M.-W. Chang, A. Dai, J. Uszkor-
[126] Z. Yang, Z. Dai, Y. Yang, J. G. Carbonell, R. Salakhutdinov, eit, Q. Le, and S. Petrov, “Natural questions: a benchmark for
and Q. V. Le, “Xlnet: Generalized autoregressive pretraining for question answering research,” Transactions of the Association of
language understanding,” CoRR, vol. abs/1906.08237, 2019. Computational Linguistics, 2019.
20

[145] L. Huang, R. Le Bras, C. Bhagavatula, and Y. Choi, “Cosmos QA: Fengbin Zhu received his B.E. degree from
Machine reading comprehension with contextual commonsense Shandong University, China. He is currently pur-
reasoning,” in Proceedings of the 2019 Conference on Empirical suing his Ph.D degree at the School of Comput-
Methods in Natural Language Processing and the 9th International ing, National University of Singapore (NUS). His
Joint Conference on Natural Language Processing (EMNLP-IJCNLP), research interests include natural language pro-
2019, pp. 2391–2401. cessing, machine reading comprehension and
[146] C. Clark, K. Lee, M.-W. Chang, T. Kwiatkowski, M. Collins, conversational question answering.
and K. Toutanova, “BoolQ: Exploring the surprising difficulty
of natural yes/no questions,” in Proceedings of the 2019 Conference
of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, Volume 1 (Long and Short
Papers). Association for Computational Linguistics, 2019, pp.
2924–2936.
[147] A. Fan, Y. Jernite, E. Perez, D. Grangier, J. Weston, and M. Auli,
“ELI5: Long form question answering,” in Proceedings of the 57th
Annual Meeting of the Association for Computational Linguistics. Wenqiang Lei is a Research Fellow with School
Association for Computational Linguistics, 2019, pp. 3558–3567. of Computing, National University of Singapore
[148] W. Xiong, J. Wu, H. Wang, V. Kulkarni, M. Yu, S. Chang, X. Guo, (NUS). He received his Ph.D. in Computer Sci-
and W. Y. Wang, “TWEETQA: A social media focused question ence from NUS in 2019. His research interests
answering dataset,” in Proceedings of the 57th Annual Meeting cover natural language processing and informa-
of the Association for Computational Linguistics. Association for tion retrieval, particularly on dialogue systems,
Computational Linguistics, 2019, pp. 5020–5031. conversational recommendations and question
[149] J. Liu, Y. Lin, Z. Liu, and M. Sun, “XQA: A cross-lingual open- answering. He has published multiple papers at
domain question answering dataset,” in Proceedings of the 57th top conferences like ACL, IJCAI, AAAI, EMNLP
Annual Meeting of the Association for Computational Linguistics. and WSDM and the winner of ACM MM 2020
Association for Computational Linguistics, 2019, pp. 2358–2368. best paper award. He served as (senior) PC
[150] J. Gao, M. Galley, and L. Li, “Neural approaches to conversational members on toptier conferences including ACL, EMNLP, SIGIR, AAAI,
ai,” Foundations and Trends® in Information Retrieval, vol. 13, no. KDD and he is a reviewer for journals like TOIS, TKDE, and TASLP.
2-3, pp. 127–298, 2019.
[151] W. Lei, X. He, M. de Rijke, and T.-S. Chua, “Conversational
recommendation: Formulation, methods, and evaluation,” in
Proceedings of the 43rd International ACM SIGIR Conference on
Research and Development in Information Retrieval, ser. SIGIR ’20. Chao Wang holds a PhD in Computer Science
Association for Computing Machinery, 2020, p. 2425–2428. from Tsinghua University, where he was ad-
[152] H. Zhu, L. Dong, F. Wei, W. Wang, B. Qin, and T. Liu, “Learning vised by Dr. Shaoping Ma and Dr. Yiqun liu. His
to ask unanswerable questions for machine reading comprehen- work has primarily focused on nature language
sion,” in Proceedings of the 57th Annual Meeting of the Association processing, information retrieval, search engine
for Computational Linguistics. Association for Computational user behavior analysis. His work has appeared
Linguistics, 2019, pp. 4238–4248. in major journals and conferences such as SI-
[153] M. Hu, F. Wei, Y. xing Peng, Z. X. Huang, N. Yang, and M. Zhou, GIR, CIKM, TOIS, and IRJ.
“Read + verify: Machine reading comprehension with unanswer-
able questions,” ArXiv, vol. abs/1808.05759, 2018.
[154] M. Aliannejadi, H. Zamani, F. Crestani, and W. B. Croft, “Asking
clarifying questions in open-domain information-seeking con-
versations,” in Proceedings of the 42nd International ACM SIGIR
Conference on Research and Development in Information Retrieval.
Association for Computing Machinery, 2019, p. 475–484.
[155] X. Du, J. Shao, and C. Cardie, “Learning to ask: Neural question Jianming Zheng is a PhD candidate at the
generation for reading comprehension,” in Proceedings of the 55th School of System Engineering, the National Uni-
Annual Meeting of the Association for Computational Linguistics versity of Defense Technology, China. His re-
(Volume 1: Long Papers). Vancouver, Canada: Association for search interests include semantics representa-
Computational Linguistics, 2017, pp. 1342–1352. tion, few-shot learning and its applications in
[156] N. Duan, D. Tang, P. Chen, and M. Zhou, “Question generation information retrieval. He received the BS and MS
for question answering,” in Proceedings of the 2017 Conference on degrees from the National University of Defense
Empirical Methods in Natural Language Processing. Copenhagen, Technology, China, in 2016 and 2018, respec-
Denmark: Association for Computational Linguistics, 2017, tively. He has several papers published in SIGIR,
pp. 866–874. [Online]. Available: https://www.aclweb.org/ COLING, IPM, FITEE, Cognitive Computation,
anthology/D17-1090 etc.
[157] Q. Zhou, N. Yang, F. Wei, C. Tan, H. Bao, and M. Zhou, “Neural
question generation from text: A preliminary study,” CoRR, vol.
abs/1704.01792, 2017.
[158] L. Pan, W. Lei, T. Chua, and M. Kan, “Recent advances in neural
question generation,” CoRR, vol. abs/1905.08949, 2019. Soujanya Poria is an assistant professor of In-
[159] C. C. Chen Qu, Liu Yang et al., “Open-retrieval conversational formation Systems Technology and Design, at
question answering,” CoRR, vol. abs/2005.11364, 2020. the Singapore University of Technology and De-
sign (SUTD), Singapore. He holds a Ph.D. de-
gree in Computer Science from the University of
Stirling, UK. He is a recipient of the prestigious
early career research award called “NTU Pres-
idential Postdoctoral Fellowship” in 2018. Sou-
janya has co-authored more than 100 research
papers, published in top-tier conferences and
journals such as ACL, EMNLP, AAAI, NAACL,
Neurocomputing, Computational Intelligence Magazine, etc. Soujanya
has been an area chair at top conferences such as ACL, EMNLP,
NAACL. Soujanya serves or has served on the editorial boards of the
Cognitive Computation and Information Fusion.
21

Tat-Seng Chua is the KITHCT Chair Professor


at the School of Computing, National University
of Singapore (NUS). He is also the Distinguished
Visiting Professor of Tsinghua University. Dr.
Chua was the Founding Dean of the School of
Computing from 1998-2000. His main research
interests include heterogeneous data analytics,
multimedia information retrieval, recommenda-
tion and conversation systems, and the emerg-
ing applications in E-commerce, wellness and
Fintech. Dr. Chua is the co-Director of NExT, a
joint research Center between NUS and Tsinghua, focusing on Extreme
Search.
Dr. Chua is the recipient of the 2015 ACM SIGMM Achievements
Award for the Outstanding Technical Contributions to Multimedia Com-
puting, Communications and Applications. He is the Chair of steering
committee of ACM ICMR (2015-19), and Multimedia Modeling (MMM)
conference series. He was the General Co-Chair of ACM Multimedia
2005, ACM CIVR (now ACM ICMR) 2005, ACM SIGIR 2008, and ACM
Web Science 2015. He serves in the editorial boards of three interna-
tional journals. He holds a PhD from the University of Leeds, UK.

You might also like