Chapter 14-NLP
Chapter 14-NLP
All
rights reserved. Draft of February 3, 2024.
CHAPTER
The quest for knowledge is deeply human, and so it is not surprising that practically
as soon as there were computers we were asking them questions. By the early 1960s,
systems used the two major paradigms of question answering—retrieval-based and
knowledge-based—to answer questions about baseball statistics or scientific facts.
Even imaginary computers got into the act. Deep Thought, the computer that Dou-
glas Adams invented in The Hitchhiker’s Guide to the Galaxy, managed to answer
“the Ultimate Question Of Life, The Universe, and Everything”.1 In 2011, IBM’s
Watson question-answering system won the TV game-show Jeopardy!, surpassing
humans at answering questions like:
WILLIAM WILKINSON’S “AN ACCOUNT OF THE
PRINCIPALITIES OF WALLACHIA AND MOLDOVIA”
INSPIRED THIS AUTHOR’S MOST FAMOUS NOVEL 2
Question answering systems are designed to fill human information needs that
might arise in situations like talking to a virtual assistant or a chatbot, interacting
with a search engine, or querying a database. Question answering systems often
focus on a particular subset of these information needs: factoid questions, questions
that can be answered with simple facts expressed in short texts, like the following:
(14.1) Where is the Louvre Museum located?
(14.2) What is the average age of the onset of autism?
One way to do question answering is just to directly ask a large language model.
For example, we could use the techniques of Chapter 12, prompting a large pre-
trained causal language model with a string like
Q: Where is the Louvre Museum located? A:
have it do conditional generation given this prefix, and take the response as the
answer. The idea is that huge pretrained language models have read a lot of facts
in their pretraining data, presumably including the location of the Louvre, and have
encoded this information in their parameters.
For some general factoid questions this can be a useful approach and is used in
practice. But prompting a large language model is not yet a solution for question
answering. The main problem is that large language models often give the wrong
hallucinate answer! Large language models hallucinate. A hallucination is a response that is
not faithful to the facts of the world. That is, when asked questions, large language
models simply make up answers that sound reasonable. For example, (Dahl et al.,
2024) found that when asked questions about the legal domain (like about particular
legal cases), large language models had hallucination rates ranging from 69% to
88%.
1 The answer was 42, but unfortunately the details of the question were never revealed.
2 The answer, of course, is ‘Who is Bram Stoker’, and the novel was Dracula.
2 C HAPTER 14 • Q UESTION A NSWERING AND I NFORMATION R ETRIEVAL
Sometime there are ways to tell that language models are hallucinating, but often
there aren’t. One problem is that language model estimates of their confidence in
calibrated their answers aren’t well-calibrated. In a calibrated system, the confidence of a
system in the correctness of its answer is highly correlated with the probability of an
answer being correct. So if the system is wrong, at least it might hedge its answer
or tell us to go check another source. But since language models are not well-
calibrated, they often give a very wrong answer with complete certainty.
A second problem is that simply prompting a large language model doesn’t allow
us to ask questions about proprietary data. A common use of question-answering is
to query private data, like asking an assistant about our email or private documents,
or asking a question about our own medical records. Or a company may have in-
ternal documents that contain answers for customer service or internal use. Or legal
firms need to ask questions about legal discovery from proprietary documents. Fur-
thermore, the use of internal datasets, or even the web itself, can be especially useful
for rapidly changing or dynamic information; by contrast, large language models
are often only released at long increments of many months and so may not have
up-to-date information.
For this reason the current dominant solution for question-answering is the two-
stage retriever/reader model (Chen et al., 2017), and that is the method we will
focus on in this chapter. In a retriever/reader model, we use information retrieval
techniques to first retrieve documents that are likely to have information that might
help answer the question. Then we either extract an answer from spans of text in
the documents, or use large language models to generate an answer given these
documents, sometimes called retrieval-augmented generation,
Basing our answers on retrieved documents can solve the above-mentioned prob-
lems with using simple prompting to answer questions. First, we can ensure that the
answer is grounded in facts from some curated dataset. And we can give the answer
accompanied by the context of the passage or document the answer came from. This
information can help users have confidence in the accuracy of the answer (or help
them spot when it is wrong!). And we can use our retrieval techniques on any pro-
prietary data we want, such as legal or medical data for those applications.
We’ll begin by introducing information retrieval, the task of choosing the most
relevant document from a document set given a user’s query expressing their infor-
mation need. We’ll see the classic method based on cosines of sparse tf-idf vec-
tors, as well as modern neural IR using dense retriever, in which we run documents
through BERT or other language models to get neural representations, and use co-
sine between dense representations of the query and document.
We then introduce retriever-based question answering, via the retriever/reader
model. This algorithm most commonly relies on the vast amount of text on the
web, in which case it is sometimes called open domain QA, or on collections of
proprietary data, or scientific papers like PubMed. We’ll go through the two types
of readers, span extractors and retrieval-augmented generation.
in information retrieval should see the Historical Notes section at the end of the
chapter and textbooks like Manning et al. (2008).
ad hoc retrieval The IR task we consider is called ad hoc retrieval, in which a user poses a
query to a retrieval system, which then returns an ordered set of documents from
document some collection. A document refers to whatever unit of text the system indexes and
retrieves (web pages, scientific papers, news articles, or even shorter passages like
collection paragraphs). A collection refers to a set of documents being used to satisfy user
term requests. A term refers to a word in a collection, but it may also include phrases.
query Finally, a query represents a user’s information need expressed as a set of terms.
The high-level architecture of an ad hoc retrieval engine is shown in Fig. 14.1.
Document
Inverted
Document
Document Indexing
Document
Document
Index
Document Document
Document
Document
Document
Document
document collection Search Ranked
Document
Documents
Query query
query Processing vector
The basic IR architecture uses the vector space model we introduced in Chap-
ter 6, in which we map queries and document to vectors based on unigram word
counts, and use the cosine similarity between the vectors to rank potential documents
(Salton, 1971). This is thus an example of the bag-of-words model introduced in
Chapter 4, since words are considered independently of their positions.
If we use log weighting, terms which occur 0 times in a document would have tf = 0,
3 We can also use this alternative formulation, which we have used in earlier editions: tft, d =
log10 (count(t, d) + 1)
4 C HAPTER 14 • Q UESTION A NSWERING AND I NFORMATION R ETRIEVAL
N
idft = log10 (14.4)
dft
where N is the total number of documents in the collection, and dft is the number
of documents in which term t occurs. The fewer documents in which a term occurs,
the higher this weight; the lowest weight of 0 is assigned to terms that occur in every
document.
Here are some idf values for some words in the corpus of Shakespeare plays,
ranging from extremely informative words that occur in only one play like Romeo,
to those that occur in a few like salad or Falstaff, to those that are very common like
fool or so common as to be completely non-discriminative since they occur in all 37
plays like good or sweet.4
Word df idf
Romeo 1 1.57
salad 2 1.27
Falstaff 4 0.967
forest 12 0.489
battle 21 0.246
wit 34 0.037
fool 36 0.012
good 37 0
sweet 37 0
The tf-idf value for word t in document d is then the product of term frequency
tft, d and IDF:
q·d
score(q, d) = cos(q, d) = (14.6)
|q||d|
Another way to think of the cosine computation is as the dot product of unit vectors;
we first normalize both the query and document vector to unit vectors, by dividing
by their lengths, and then take the dot product:
q d
score(q, d) = cos(q, d) = · (14.7)
|q| |d|
4 Sweet was one of Shakespeare’s favorite adjectives, a fact probably related to the increased use of
sugar in European recipes around the turn of the 16th century (Jurafsky, 2014, p. 175).
14.1 • I NFORMATION R ETRIEVAL 5
We can spell out Eq. 14.7, using the tf-idf values and spelling out the dot product as
a sum of products:
X tf-idf(t, q) tf-idf(t, d)
score(q, d) = qP · qP (14.8)
2 2
t∈q qi ∈q tf-idf (qi , q) di ∈d tf-idf (di , d)
Now let’s use (14.8) to walk through an example of a tiny query against a collec-
tion of 4 nano documents, computing tf-idf values and seeing the rank of the docu-
ments. We’ll assume all words in the following query and documents are downcased
and punctuation is removed:
Query: sweet love
Doc 1: Sweet sweet nurse! Love?
Doc 2: Sweet sorrow
Doc 3: How sweet is love?
Doc 4: Nurse!
Fig. 14.2 shows the computation of the tf-idf cosine between the query and Doc-
ument 1, and the query and Document 2. The cosine is the normalized dot product
of tf-idf values, so for the normalization we must need to compute the document
vector lengths |q|, |d1 |, and |d2 | for the query and the first two documents using
Eq. 14.3, Eq. 14.4, Eq. 14.5, and Eq. 14.8 (computations for Documents 3 and 4 are
also needed but are left as an exercise for the reader). The dot product between the
vectors is the sum over dimensions of the product, for each dimension, of the values
of the two tf-idf vectors for that dimension. This product is only non-zero where
both the query and document have non-zero values, so for this example, in which
only sweet and love have non-zero values in the query, the dot product will be the
sum of the products of those elements of each vector.
Document 1 has a higher cosine with the query (0.747) than Document 2 has
with the query (0.0779), and so the tf-idf cosine model would rank Document 1
above Document 2. This ranking is intuitive given the vector space model, since
Document 1 has both terms including two instances of sweet, while Document 2 is
missing one of the terms. We leave the computation for Documents 3 and 4 as an
exercise for the reader.
In practice, there are many variants and approximations to Eq. 14.8. For exam-
ple, we might choose to simplify processing by removing some terms. To see this,
let’s start by expanding the formula for tf-idf in Eq. 14.8 to explicitly mention the tf
and idf terms from (14.5):
X tft, q · idft tft, d · idft
score(q, d) = qP · qP (14.9)
2 2
t∈q qi ∈q tf-idf (qi , q) di ∈d tf-idf (di , d)
In one common variant of tf-idf cosine, for example, we drop the idf term for the
document. Eliminating the second copy of the idf term (since the identical term is
already computed for the query) turns out to sometimes result in better performance:
Query
word cnt tf df idf tf-idf n’lized = tf-idf/|q|
sweet 1 1 3 0.125 0.125 0.383
nurse 0 0 2 0.301 0 0
love 1 1 2 0.301 0.301 0.924
how 0 0 1 0.602 0 0
sorrow 0 0 1 0.602 0 0
is 0 0 1 0.602 0 0
√
|q| = .1252 + .3012 = .326
Document 1 Document 2
word cnt tf tf-idf n’lized × q cnt tf tf-idf n’lized ×q
sweet 2 1.301 0.163 0.357 0.137 1 1.000 0.125 0.203 0.0779
nurse 1 1.000 0.301 0.661 0 0 0 0 0 0
love 1 1.000 0.301 0.661 0.610 0 0 0 0 0
how 0 0 0 0 0 0 0 0 0 0
sorrow 0 0 0 0 0 1 1.000 0.602 0.979 0
is 0 0 0 0 0 0 0 0 0 0
√ √
|d1 | = .1632 + .3012 + .3012 = .456 2 2
|d2 | = .125 + .602 = .615
P P
Cosine: of column: 0.747 Cosine: of column: 0.0779
Figure 14.2 Computation of tf-idf cosine score between the query and nano-documents 1 (0.747) and 2
(0.0779), using Eq. 14.3, Eq. 14.4, Eq. 14.5 and Eq. 14.8.
scheme (sometimes called Okapi BM25 after the Okapi IR system in which it was
introduced (Robertson et al., 1995)). BM25 adds two parameters: k, a knob that
adjust the balance between term frequency and IDF, and b, which controls the im-
portance of document length normalization. The BM25 score of a document d given
a query q is:
IDF weighted tf
z }| { z }| {
X N tft,d
log (14.11)
t∈q
dft k 1 − b + b |d| + tft,d
|davg |
where |davg | is the length of the average document. When k is 0, BM25 reverts to
no use of term frequency, just a binary selection of terms in the query (plus idf).
A large k results in raw term frequency (plus idf). b ranges from 1 (scaling by
document length) to 0 (no length scaling). Manning et al. (2008) suggest reasonable
values are k = [1.2,2] and b = 0.75. Kamphuis et al. (2020) is a useful summary of
the many minor variants of BM25.
Stop words In the past it was common to remove high-frequency words from both
the query and document before representing them. The list of such high-frequency
stop list words to be removed is called a stop list. The intuition is that high-frequency terms
(often function words like the, a, to) carry little semantic weight and may not help
with retrieval, and can also help shrink the inverted index files we describe below.
The downside of using a stop list is that it makes it difficult to search for phrases
that contain words in the stop list. For example, common stop lists would reduce the
phrase to be or not to be to the phrase not. In modern IR systems, the use of stop lists
is much less common, partly due to improved efficiency and partly because much
of their function is already handled by IDF weighting, which downweights function
14.1 • I NFORMATION R ETRIEVAL 7
words that occur in every document. Nonetheless, stop word removal is occasionally
useful in various NLP tasks so is worth keeping in mind.
ranked retrieval systems, we need a metric that prefers the one that ranks the relevant
documents higher. We need to adapt precision and recall to capture how well a
system does at putting relevant documents higher in the ranking.
Rank Judgment PrecisionRank RecallRank
1 R 1.0 .11
2 N .50 .11
3 R .66 .22
4 N .50 .22
5 R .60 .33
6 R .66 .44
7 N .57 .44
8 R .63 .55
9 N .55 .55
10 N .50 .55
11 R .55 .66
12 N .50 .66
13 N .46 .66
14 N .43 .66
15 R .47 .77
16 N .44 .77
17 N .44 .77
18 R .44 .88
19 N .42 .88
20 N .40 .88
21 N .38 .88
22 N .36 .88
23 N .35 .88
24 N .33 .88
25 R .36 1.0
Figure 14.3 Rank-specific precision and recall values calculated as we proceed down
through a set of ranked documents (assuming the collection has 9 relevant documents).
Let’s turn to an example. Assume the table in Fig. 14.3 gives rank-specific pre-
cision and recall values calculated as we proceed down through a set of ranked doc-
uments for a particular query; the precisions are the fraction of relevant documents
seen at a given rank, and recalls the fraction of relevant documents found at the same
rank. The recall measures in this example are based on this query having 9 relevant
documents in the collection as a whole.
Note that recall is non-decreasing; when a relevant document is encountered,
recall increases, and when a non-relevant document is found it remains unchanged.
Precision, on the other hand, jumps up and down, increasing when relevant doc-
uments are found, and decreasing otherwise. The most common way to visualize
precision-recall precision and recall is to plot precision against recall in a precision-recall curve,
curve
like the one shown in Fig. 14.4 for the data in table 14.3.
Fig. 14.4 shows the values for a single query. But we’ll need to combine values
for all the queries, and in a way that lets us compare one system to another. One way
of doing this is to plot averaged precision values at 11 fixed levels of recall (0 to 100,
in steps of 10). Since we’re not likely to have datapoints at these exact levels, we
interpolated
precision use interpolated precision values for the 11 recall values from the data points we do
have. We can accomplish this by choosing the maximum precision value achieved
at any level of recall at or above the one we’re calculating. In other words,
IntPrecision(r) = max Precision(i) (14.13)
i>=r
14.1 • I NFORMATION R ETRIEVAL 9
1.0
0.8
0.6
Precision
0.4
0.2
This interpolation scheme not only lets us average performance over a set of queries,
but also helps smooth over the irregular precision values in the original data. It is
designed to give systems the benefit of the doubt by assigning the maximum preci-
sion value achieved at higher levels of recall from the one being measured. Fig. 14.5
and Fig. 14.6 show the resulting interpolated data points from our example.
Given curves such as that in Fig. 14.6 we can compare two systems or approaches
by comparing their curves. Clearly, curves that are higher in precision across all
recall values are preferred. However, these curves can also provide insight into the
overall behavior of a system. Systems that are higher in precision toward the left
may favor precision over recall, while systems that are more geared towards recall
will be higher at higher levels of recall (to the right).
mean average
precision A second way to evaluate ranked retrieval is mean average precision (MAP),
which provides a single metric that can be used to compare competing systems or
approaches. In this approach, we again descend through the ranked list of items,
but now we note the precision only at those points where a relevant item has been
encountered (for example at ranks 1, 3, 5, 6 but not 2 or 4 in Fig. 14.3). For a single
query, we average these individual precision measurements over the return set (up
to some fixed cutoff). More formally, if we assume that Rr is the set of relevant
10 C HAPTER 14 • Q UESTION A NSWERING AND I NFORMATION R ETRIEVAL
0.9
0.8
0.7
0.6
Precision
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall
documents at or above r, then the average precision (AP) for a single query is
1 X
AP = Precisionr (d) (14.14)
|Rr |
d∈Rr
where Precisionr (d) is the precision measured at the rank at which document d was
found. For an ensemble of queries Q, we then average over these averages, to get
our final MAP measure:
1 X
MAP = AP(q) (14.15)
|Q|
q∈Q
The MAP for the single query (hence = AP) in Fig. 14.3 is 0.6.
the query and the document, and thus building a representation that is sensitive to
the meanings of both query and document. Then a linear layer can be put on top of
the [CLS] token to predict a similarity score for the query/document tuple:
This architecture is shown in Fig. 14.7a. Usually the retrieval step is not done on
an entire document. Instead documents are broken up into smaller passages, such
as non-overlapping fixed-length chunks of say 100 tokens, and the retriever encodes
and retrieves these passages rather than entire documents. The query and document
have to be made to fit in the BERT 512-token window, for example by truncating
the query to 64 tokens and truncating the document if necessary so that it, the query,
[CLS], and [SEP] fit in 512 tokens. The BERT system together with the linear layer
U can then be fine-tuned for the relevance task by gathering a tuning dataset of
relevant and non-relevant passages.
s(q,d)
s(q,d)
U • zCLS_D
zCLS zCLS_Q
… …
… …
… …
… …
… …
… …
(a) (b)
Figure 14.7 Two ways to do dense retrieval, illustrated by using lines between layers to schematically rep-
resent self-attention: (a) Use a single encoder to jointly encode query and document and finetune to produce a
relevance score with a linear layer over the CLS token. This is too compute-expensive to use except in rescoring
(b) Use separate encoders for query and document, and use the dot product between CLS token outputs for the
query and document as the score. This is less compute-expensive, but not as accurate.
The problem with the full BERT architecture in Fig. 14.7a is the expense in
computation and time. With this architecture, every time we get a query, we have to
pass every single single document in our entire collection through a BERT encoder
jointly with the new query! This enormous use of resources is impractical for real
cases.
At the other end of the computational spectrum is a much more efficient archi-
tecture, the bi-encoder. In this architecture we can encode the documents in the
collection only one time by using two separate encoder models, one to encode the
query and one to encode the document. We encode each document, and store all
the encoded document vectors in advance. When a query comes in, we encode just
this query and then use the dot product between the query vector and the precom-
puted document vectors as the score for each candidate document (Fig. 14.7b). For
example, if we used BERT, we would have two encoders BERTQ and BERTD and
12 C HAPTER 14 • Q UESTION A NSWERING AND I NFORMATION R ETRIEVAL
we could represent the query and document as the [CLS] token of the respective
encoders (Karpukhin et al., 2020):
zq = BERTQ (q)[CLS]
zd = BERTD (d)[CLS]
score(q, d) = zq · zd (14.17)
The bi-encoder is much cheaper than a full query/document encoder, but is also
less accurate, since its relevance decision can’t take full advantage of all the possi-
ble meaning interactions between all the tokens in the query and the tokens in the
document.
There are numerous approaches that lie in between the full encoder and the bi-
encoder. One intermediate alternative is to use cheaper methods (like BM25) as the
first pass relevance ranking for each document, take the top N ranked documents,
and use expensive methods like the full BERT scoring to rerank only the top N
documents rather than the whole set.
ColBERT Another intermediate approach is the ColBERT approach of Khattab and Za-
haria (2020) and Khattab et al. (2021), shown in Fig. 14.8. This method separately
encodes the query and document, but rather than encoding the entire query or doc-
ument into one vector, it separately encodes each of them into contextual represen-
tations for each token. These BERT representations of each document word can be
pre-stored for efficiency. The relevance score between a query q and a document d is
a sum of maximum similarity (MaxSim) operators between tokens in q and tokens
in d. Essentially, for each token in q, ColBERT finds the most contextually simi-
lar token in d, and then sums up these similarities. A relevant document will have
tokens that are contextually very similar to the query.
More formally, a question q is tokenized as [q1 , . . . , qn ], prepended with a [CLS]
and a special [Q] token, truncated to N=32 tokens (or padded with [MASK] tokens if
it is shorter), and passed through BERT to get output vectors q = [q1 , . . . , qN ]. The
passage d with tokens [d1 , . . . , dm ], is processed similarly, including a [CLS] and
special [D] token. A linear layer is applied on top of d and q to control the output
dimension, so as to keep the vectors small for storage efficiency, and vectors are
rescaled to unit length, producing the final vector sequences Eq (length N) and Ed
(length m). The ColBERT scoring mechanism is:
N
X m
score(q, d) = max Eqi · Ed j (14.18)
j=1
i=1
While the interaction mechanism has no tunable parameters, the ColBERT ar-
chitecture still needs to be trained end-to-end to fine-tune the BERT encoders and
train the linear layers (and the special [Q] and [D] embeddings) from scratch. It
is trained on triples hq, d + , d − i of query q, positive document d + and negative doc-
ument d − to produce a score for each document using (14.18), optimizing model
parameters using a cross-entropy loss.
All the supervised algorithms (like ColBERT or the full-interaction version of
the BERT algorithm applied for reranking) need training data in the form of queries
together with relevant and irrelevant passages or documents (positive and negative
examples). There are various semi-supervised ways to get labels; some datasets (like
MS MARCO Ranking, Section 14.3.1) contain gold positive examples. Negative
examples can be sampled randomly from the top-1000 results from some existing
IR system. If datasets don’t have labeled positive examples, iterative methods like
14.3 • U SING N EURAL IR FOR Q UESTION A NSWERING 13
s(q,d)
Query Document
Figure 14.8 A sketch of the ColBERT algorithm at inference time. The query and docu-
ment are first passed through separate BERT encoders. Similarity between query and doc-
ument is computed by summing a soft alignment between the contextual representations of
tokens in the query and the document. Training is end-to-end. (Various details aren’t de-
picted; for example the query is prepended by a [CLS] and [Q:] tokens, and the document
by [CLS] and [D:] tokens). Figure adapted from Khattab and Zaharia (2020).
relevance-guided supervision can be used (Khattab et al., 2021) which rely on the
fact that many datasets contain short answer strings. In this method, an existing IR
system is used to harvest examples that do contain short answer strings (the top few
are taken as positives) or don’t contain short answer strings (the top few are taken as
negatives), these are used to train a new retriever, and then the process is iterated.
Efficiency is an important issue, since every possible document must be ranked
for its similarity to the query. For sparse word-count vectors, the inverted index
allows this very efficiently. For dense vector algorithms finding the set of dense
document vectors that have the highest dot product with a dense query vector is
an instance of the problem of nearest neighbor search. Modern systems there-
Faiss fore make use of approximate nearest neighbor vector search algorithms like Faiss
(Johnson et al., 2017).
Question Answer
Where is the Louvre Museum located? in Paris, France
What are the names of Odin’s ravens? Huginn and Muninn
What kind of nuts are used in marzipan? almonds
What instrument did Max Roach play? drums
What’s the official language of Algeria? Arabic
Figure 14.9 Some factoid questions and their answers.
tractor or a generator. The first method is span extraction, using a neural reading
comprehension algorithm that passes over each passage and is trained to find spans
of text that answer the question. The second method is also known as retrieval-
augmented generation: we take a large pretrained language model, give it some set
of retrieved passages and other text as its prompt, and autoregressively generate a
new answer token by token.
Reader
Generator
query
LLM
Retriever
Q: When was docs
Docs and prompt
A: 1791
the premiere of or Extracter
The Magic Flute? start end
Relevant BERT
Docs [CLS] q1 q2 [SEP] d1 d2
Indexed Docs
Figure 14.10 Retrieval-based question answering has two stages: retrieval, which returns relevant docu-
ments from the collection, and reading, in which a neural reading comprehension system extracts answer
spans, or a large pretrained language model that generates answers autoregressively given the documents as a
prompt.
In the next few sections we’ll describe these two standard reader algorithms.
But first, we’ll introduce some commonly-used question answering datasets.
crowd workers multiple context documents and asked to come up with questions
that require reasoning about all of the documents.
The fact that questions in datasets like SQuAD or HotpotQA are created by an-
notators who have first read the passage may make their questions easier to answer,
since the annotator may (subconsciously) make use of words from the answer text.
A solution to this possible bias is to make datasets from questions that were not
written with a passage in mind. The TriviaQA dataset (Joshi et al., 2017) contains
94K questions written by trivia enthusiasts, together with supporting documents
from Wikipedia and the web resulting in 650K question-answer-evidence triples.
MS MARCO MS MARCO (Microsoft Machine Reading Comprehension) is a collection of
datasets, including 1 million real anonymized questions from Microsoft Bing query
logs together with a human generated answer and 9 million passages (Nguyen et al.,
2016), that can be used both to test retrieval ranking and question answering. The
Natural
Questions Natural Questions dataset (Kwiatkowski et al., 2019) similarly incorporates real
anonymized queries to the Google search engine. Annotators are presented a query,
along with a Wikipedia page from the top 5 search results, and annotate a paragraph-
length long answer and a short span answer, or mark null if the text doesn’t contain
the paragraph. For example the question “When are hops added to the brewing
process?” has the short answer the boiling process and a long answer which the
surrounding entire paragraph from the Wikipedia page on Brewing. In using this
dataset, a reading comprehension model is given a question and a Wikipedia page
and must return a long answer, short answer, or ’no answer’ response.
TyDi QA The above datasets are all in English. The TyDi QA dataset contains 204K
question-answer pairs from 11 typologically diverse languages, including Arabic,
Bengali, Kiswahili, Russian, and Thai (Clark et al., 2020). In the T Y D I QA task,
a system is given a question and the passages from a Wikipedia article and must
(a) select the passage containing the answer (or N ULL if no passage contains the
answer), and (b) mark the minimal answer span (or N ULL). Many questions have
no answer. The various languages in the dataset bring up challenges for QA systems
like morphological variation between the question and the answer, or complex issue
with word segmentation or multiple alphabets.
In the reading comprehension task, a system is given a question and the passage
in which the answer should be found. In the full two-stage QA task, however, sys-
16 C HAPTER 14 • Q UESTION A NSWERING AND I NFORMATION R ETRIEVAL
tems are not given a passage, but are required to do their own retrieval from some
document collection. A common way to create open-domain QA datasets is to mod-
ify a reading comprehension dataset. For research purposes this is most commonly
done by using QA datasets that annotate Wikipedia (like SQuAD or HotpotQA). For
training, the entire (question, passage, answer) triple is used to train the reader. But
at inference time, the passages are removed and system is given only the question,
together with access to the entire Wikipedia corpus. The system must then do IR to
find a set of pages and then read them.
Pstart Pend
i i
… . . …
S E
i
Encoder (BERT)
[CLS] q1 … qn [SEP] p1 … pm
Question Passage
Figure 14.12 An encoder model (using BERT) for span-based question answering from
reading-comprehension-based question answering tasks.
For span-based question answering, we represent the question as the first se-
quence and the passage as the second sequence. We’ll also need to add a linear layer
that will be trained in the fine-tuning phase to predict the start and end position of the
14.3 • U SING N EURAL IR FOR Q UESTION A NSWERING 17
span. We’ll add two new special vectors: a span-start embedding S and a span-end
embedding E, which will be learned in fine-tuning. To get a span-start probability
for each output token p0i , we compute the dot product between S and p0i and then use
a softmax to normalize over all tokens p0i in the passage:
exp(S · p0i )
Pstarti = P 0 (14.19)
j exp(S · p j )
exp(E · p0i )
Pendi = P 0 (14.20)
j exp(E · p j )
The score of a candidate span from position i to j is S · p0i + E · p0j , and the highest
scoring span in which j ≥ i is chosen is the model prediction.
The training loss for fine-tuning is the negative sum of the log-likelihoods of the
correct start and end positions for each instance:
Many datasets (like SQuAD 2.0 and Natural Questions) also contain (question,
passage) pairs in which the answer is not contained in the passage. We thus also
need a way to estimate the probability that the answer to a question is not in the
document. This is standardly done by treating questions with no answer as having
the [CLS] token as the answer, and hence the answer span start and end index will
point at [CLS] (Devlin et al., 2019).
For many datasets the annotated documents/passages are longer than the maxi-
mum 512 input tokens BERT allows, such as Natural Questions whose gold passages
are full Wikipedia pages. In such cases, following Alberti et al. (2019), we can cre-
ate multiple pseudo-passage observations from the labeled Wikipedia page. Each
observation is formed by concatenating [CLS], the question, [SEP], and tokens from
the document. We walk through the document, sliding a window of size 512 (or
rather, 512 minus the question length n minus special tokens) and packing the win-
dow of tokens into each next pseudo-passage. The answer span for the observation
is either labeled [CLS] (= no answer in this particular window) or the gold-labeled
span is marked. The same process can be used for inference, breaking up each re-
trieved document into separate observation passages and labeling each observation.
The answer can be chosen as the span with the highest probability (or nil if no span
is more probable than [CLS]).
And simple conditional generation for question answering adds a prompt like Q: ,
followed by a query q , and A:, all concatenated:
n
Y
p(x1 , . . . , xn ) = p([Q:] ; q ; [A:] ; x<i )
i=1
retrieved passage 2
...
retrieved passage n
|Q|
1 X 1
MRR = (14.22)
|Q| ranki
i=1
14.5 Summary
This chapter introduced the tasks of question answering and information retrieval.
• Question answering (QA) is the task of answering a user’s questions.
• We focus in this chapter on the task of retrieval-based question answering,
in which the user’s questions are intended to be answered by the material in
some set of documents.
• Information Retrieval (IR) is the task of returning documents to a user based
on their information need as expressed in a query. In ranked retrieval, the
documents are returned in ranked order.
20 C HAPTER 14 • Q UESTION A NSWERING AND I NFORMATION R ETRIEVAL
• The match between a query and a document can be done by first representing
each of them with a sparse vector that represents the frequencies of words,
weighted by tf-idf or BM25. Then the similarity can be measured by cosine.
• Documents or queries can instead be represented by dense vectors, by encod-
ing the question and document with an encoder-only model like BERT, and in
that case computing similarity in embedding space.
• The inverted index is an storage mechanism that makes it very efficient to
find documents that have a particular word.
• Ranked retrieval is generally evaluated by mean average precision or inter-
polated precision.
• Question answering systems generally use the retriever/reader architecture.
In the retriever stage, an IR system is given a query and returns a set of
documents.
• The reader stage can either be a span-based extractor, that predicts a span
of text in the retrieved documents to return as the answer, or a retrieval-
augmented generator, in which a large language model is used to generate a
novel answer after reading the documents and the query.
• QA can be evaluated by exact match with a known answer if only a single
answer is given, or with mean reciprocal rank if a ranked set of answers is
given.
Month = July
Place = Boston
Day = 7
Game Serial No. = 96
(Team = Red Sox, Score = 5)
(Team = Yankees, Score = 3)
Each question was constituency-parsed using the algorithm of Zellig Harris’s
TDAP project at the University of Pennsylvania, essentially a cascade of finite-state
transducers (see the historical discussion in Joshi and Hopely 1999 and Karttunen
1999). Then in a content analysis phase each word or phrase was associated with a
program that computed parts of its meaning. Thus the phrase ‘Where’ had code to
assign the semantics Place = ?, with the result that the question “Where did the
Red Sox play on July 7” was assigned the meaning
Place = ?
Team = Red Sox
Month = July
Day = 7
The question is then matched against the database to return the answer. Simmons
(1965) summarizes other early QA systems.
Another important progenitor of the knowledge-based paradigm for question-
answering is work that used predicate calculus as the meaning representation lan-
LUNAR guage. The LUNAR system (Woods et al. 1972, Woods 1978) was designed to be
a natural language interface to a database of chemical facts about lunar geology. It
could answer questions like Do any samples have greater than 13 percent aluminum
by parsing them into a logical form
(TEST (FOR SOME X16 / (SEQ SAMPLES) : T ; (CONTAIN’ X16
(NPR* X17 / (QUOTE AL203)) (GREATERTHAN 13 PCT))))
By a couple decades later, drawing on new machine learning approaches in NLP,
Zelle and Mooney (1996) proposed to treat knowledge-based QA as a semantic pars-
ing task, by creating the Prolog-based GEOQUERY dataset of questions about US
geography. This model was extended by Zettlemoyer and Collins (2005) and 2007.
By a decade later, neural models were applied to semantic parsing (Dong and Lap-
ata 2016, Jia and Liang 2016), and then to knowledge-based question answering by
mapping text to SQL (Iyer et al., 2017).
Meanwhile, the information-retrieval paradigm for question answering was in-
fluenced by the rise of the web in the 1990s. The U.S. government-sponsored TREC
(Text REtrieval Conference) evaluations, run annually since 1992, provide a testbed
for evaluating information-retrieval tasks and techniques (Voorhees and Harman,
2005). TREC added an influential QA track in 1999, which led to a wide variety of
factoid and non-factoid systems competing in annual evaluations.
At that same time, Hirschman et al. (1999) introduced the idea of using chil-
dren’s reading comprehension tests to evaluate machine text comprehension algo-
rithms. They acquired a corpus of 120 passages with 5 questions each designed for
3rd-6th grade children, built an answer extraction system, and measured how well
the answers given by their system corresponded to the answer key from the test’s
publisher. Their algorithm focused on word overlap as a feature; later algorithms
added named entity features and more complex similarity between the question and
the answer span (Riloff and Thelen 2000, Ng et al. 2000).
The DeepQA component of the Watson Jeopardy! system was a large and so-
phisticated feature-based system developed just before neural systems became com-
22 C HAPTER 14 • Q UESTION A NSWERING AND I NFORMATION R ETRIEVAL
Exercises
Exercises 23
Alberti, C., K. Lee, and M. Collins. 2019. A BERT base- Joshi, M., E. Choi, D. S. Weld, and L. Zettlemoyer. 2017.
line for the natural questions. http://arxiv.org/abs/ Triviaqa: A large scale distantly supervised challenge
1901.08634. dataset for reading comprehension. ACL.
Arora, S., P. Lewis, A. Fan, J. Kahn, and C. Ré. 2023. Rea- Jurafsky, D. 2014. The Language of Food. W. W. Norton,
soning over public and private data in retrieval-based sys- New York.
tems. TACL, 11:902–921. Kamphuis, C., A. P. de Vries, L. Boytsov, and J. Lin. 2020.
Chen, D., A. Fisch, J. Weston, and A. Bordes. 2017. Reading Which BM25 do you mean? a large-scale reproducibil-
Wikipedia to answer open-domain questions. ACL. ity study of scoring variants. European Conference on
Clark, J. H., E. Choi, M. Collins, D. Garrette, Information Retrieval.
T. Kwiatkowski, V. Nikolaev, and J. Palomaki. 2020. Karpukhin, V., B. Oğuz, S. Min, P. Lewis, L. Wu, S. Edunov,
TyDi QA: A benchmark for information-seeking ques- D. Chen, and W.-t. Yih. 2020. Dense passage retrieval for
tion answering in typologically diverse languages. TACL, open-domain question answering. EMNLP.
8:454–470. Karttunen, L. 1999. Comments on Joshi. In A. Kornai, edi-
Clark, P., I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, tor, Extended Finite State Models of Language, pages 16–
C. Schoenick, and O. Tafjord. 2018. Think you have 18. Cambridge University Press.
solved question answering? Try ARC, the AI2 reasoning Khandelwal, U., O. Levy, D. Jurafsky, L. Zettlemoyer, and
challenge. ArXiv preprint arXiv:1803.05457. M. Lewis. 2019. Generalization through memorization:
Dahl, M., V. Magesh, M. Suzgun, and D. E. Ho. 2024. Large Nearest neighbor language models. ICLR.
legal fictions: Profiling legal hallucinations in large lan- Khattab, O., C. Potts, and M. Zaharia. 2021. Relevance-
guage models. ArXiv preprint. guided supervision for OpenQA with ColBERT. TACL,
Deerwester, S. C., S. T. Dumais, T. K. Landauer, G. W. Fur- 9:929–944.
nas, and R. A. Harshman. 1990. Indexing by latent se- Khattab, O. and M. Zaharia. 2020. ColBERT: Efficient and
mantics analysis. JASIS, 41(6):391–407. effective passage search via contextualized late interac-
Devlin, J., M.-W. Chang, K. Lee, and K. Toutanova. 2019. tion over BERT. SIGIR.
BERT: Pre-training of deep bidirectional transformers for Kwiatkowski, T., J. Palomaki, O. Redfield, M. Collins,
language understanding. NAACL HLT. A. Parikh, C. Alberti, D. Epstein, I. Polosukhin, J. De-
Dong, L. and M. Lapata. 2016. Language to logical form vlin, K. Lee, K. Toutanova, L. Jones, M. Kelcey, M.-W.
with neural attention. ACL. Chang, A. M. Dai, J. Uszkoreit, Q. Le, and S. Petrov.
2019. Natural questions: A benchmark for question an-
Ferrucci, D. A. 2012. Introduction to “This is Watson”. IBM
swering research. TACL, 7:452–466.
Journal of Research and Development, 56(3/4):1:1–1:15.
Lee, K., M.-W. Chang, and K. Toutanova. 2019. Latent re-
Furnas, G. W., T. K. Landauer, L. M. Gomez, and S. T. trieval for weakly supervised open domain question an-
Dumais. 1987. The vocabulary problem in human- swering. ACL.
system communication. Communications of the ACM,
30(11):964–971. Manning, C. D., P. Raghavan, and H. Schütze. 2008. Intro-
duction to Information Retrieval. Cambridge.
Green, B. F., A. K. Wolf, C. Chomsky, and K. Laughery.
1961. Baseball: An automatic question answerer. Pro- Ng, H. T., L. H. Teo, and J. L. P. Kwan. 2000. A ma-
ceedings of the Western Joint Computer Conference 19. chine learning approach to answering questions for read-
ing comprehension tests. EMNLP.
Hermann, K. M., T. Kocisky, E. Grefenstette, L. Espeholt,
W. Kay, M. Suleyman, and P. Blunsom. 2015. Teaching Nguyen, T., M. Rosenberg, X. Song, J. Gao, S. Tiwary,
machines to read and comprehend. NeurIPS. R. Majumder, and L. Deng. 2016. Ms marco: A hu-
man generated machine reading comprehension dataset.
Hirschman, L., M. Light, E. Breck, and J. D. Burger. 1999. NeurIPS.
Deep Read: A reading comprehension system. ACL.
Phillips, A. V. 1960. A question-answering routine. Techni-
Iyer, S., I. Konstas, A. Cheung, J. Krishnamurthy, and cal Report 16, MIT AI Lab.
L. Zettlemoyer. 2017. Learning a neural semantic parser
Rajpurkar, P., R. Jia, and P. Liang. 2018. Know what you
from user feedback. ACL.
don’t know: Unanswerable questions for SQuAD. ACL.
Izacard, G., P. Lewis, M. Lomeli, L. Hosseini, F. Petroni,
Rajpurkar, P., J. Zhang, K. Lopyrev, and P. Liang. 2016.
T. Schick, J. Dwivedi-Yu, A. Joulin, S. Riedel, and
SQuAD: 100,000+ questions for machine comprehension
E. Grave. 2022. Few-shot learning with retrieval aug-
of text. EMNLP.
mented language models. ArXiv preprint.
Ram, O., Y. Levine, I. Dalmedigos, D. Muhlgay, A. Shashua,
Jia, R. and P. Liang. 2016. Data recombination for neural
K. Leyton-Brown, and Y. Shoham. 2023. In-context
semantic parsing. ACL.
retrieval-augmented language models. ArXiv preprint.
Johnson, J., M. Douze, and H. Jégou. 2017. Billion-
Riloff, E. and M. Thelen. 2000. A rule-based ques-
scale similarity search with GPUs. ArXiv preprint
tion answering system for reading comprehension tests.
arXiv:1702.08734.
ANLP/NAACL workshop on reading comprehension tests.
Joshi, A. K. and P. Hopely. 1999. A parser from antiquity. In
Robertson, S., S. Walker, S. Jones, M. M. Hancock-
A. Kornai, editor, Extended Finite State Models of Lan-
Beaulieu, and M. Gatford. 1995. Okapi at TREC-3.
guage, pages 6–15. Cambridge University Press.
Overview of the Third Text REtrieval Conference (TREC-
3).
24 Chapter 14 • Question Answering and Information Retrieval