0% found this document useful (0 votes)
3 views

Sparse, Dense, and Attentional Representations For Text Retrieval

Uploaded by

uniquee0314
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Sparse, Dense, and Attentional Representations For Text Retrieval

Uploaded by

uniquee0314
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Sparse, Dense, and Attentional Representations for Text Retrieval

Yi Luan∗, Jacob Eisenstein∗ , Kristina Toutanova∗ , Michael Collins

Google Research
{luanyi, jeisenstein, kristout, mjcollins}@google.com

Abstract IR benchmarks (Nogueira and Cho, 2019; Yang


et al., 2019; Nogueira et al., 2019a), especially
Dual encoders perform retrieval by encoding since sizable annotated data has become available
documents and queries into dense low- for training deep neural models (Dietz et al., 2018;

Downloaded from http://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00369/1924040/tacl_a_00369.pdf by guest on 26 June 2024


dimensional vectors, scoring each document Craswell et al., 2020). However, this pipeline
by its inner product with the query. We suffers from a strict upper bound imposed by any
investigate the capacity of this architecture recall errors in the first-stage retrieval model: For
relative to sparse bag-of-words models and example, the recall@1000 for BM25 reported by
attentional neural networks. Using both Yan et al. (2020) is 69.4.
theoretical and empirical analysis, we establish A promising alternative is to perform first-stage
connections between the encoding dimension, retrieval using learned dense low-dimensional
the margin between gold and lower- encodings of documents and queries (Huang
ranked documents, and the document length, et al., 2013; Reimers and Gurevych, 2019; Gillick
suggesting limitations in the capacity of et al., 2019; Karpukhin et al., 2020). The dual
fixed-length encodings to support precise encoder model scores each document by the
retrieval of long documents. Building on these inner product between its encoding and that of
insights, we propose a simple neural model the query. Unlike full attentional architectures,
that combines the efficiency of dual encoders which require extensive computation on each
with some of the expressiveness of more costly candidate document, the dual encoder can be
attentional architectures, and explore sparse- easily applied to very large document collections
dense hybrids to capitalize on the precision thanks to efficient algorithms for inner product
of sparse retrieval. These models outperform search; unlike untrained sparse retrieval models, it
strong alternatives in large-scale retrieval. can exploit machine learning to generalize across
related terms.
1 Introduction
To assess the relevance of a document to an
Retrieving relevant documents is a core task for information-seeking query, models must both (i)
language technology, and is a component of check for precise term overlap (for example,
applications such as information extraction and presence of key entities in the query) and (ii)
question answering (e.g., Narasimhan et al., compute semantic similarity generalizing across
2016; Kwok et al., 2001; Voorhees, 2001). related concepts. Sparse retrieval models excel at
While classical information retrieval has focused the first sub-problem, while learned dual encoders
on heuristic weights for sparse bag-of-words can be better at the second. Recent history in NLP
representations (Spärck Jones, 1972), more recent might suggest that learned dense representations
work has adopted a two-stage retrieval and should always outperform sparse features overall,
ranking pipeline, where a large number of but this is not necessarily true: as shown in
documents are retrieved using sparse high Figure 1, the BM25 model (Robertson et al., 2009)
dimensional query/document representations, and can outperform a dual encoder based on BERT,
are further reranked with learned neural models particularly on longer documents and on a task
(Mitra and Craswell, 2018). This two-stage that requires precise detection of word overlap.1
approach has achieved state-of-the-art results on This raises questions about the limitations of dual
∗ 1
Equal contribution. See §4 for experimental details.

329

Transactions of the Association for Computational Linguistics, vol. 9, pp. 329–345, 2021. https://doi.org/10.1162/tacl a 00369
Action Editor: Jimmy Lin. Submission batch: 6/2020; Revision batch: 9/2020; Published 4/2021.
c 2021 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.
benchmarks (MS MARCO passage and document),
and passage retrieval for question answering
(Natural Questions). We confirm prior find-
ings that full attentional architectures excel at
reranking tasks, but are not efficient enough
for large-scale retrieval. Of the more efficient
alternatives, the hybridized multi-vector encoder
Figure 1: Recall@1 for retrieving passage containing is at or near the top in every evaluation, out-
a query from three million candidates. The figure performing state-of-the-art retrieval results in
compares a fine-tuned BERT-based dual encoder (DE- MS MARCO. Our code is publicly available at
BERT-768), an off-the-shelf BERT-based encoder with https://github.com/google-research
average pooling (BERT-init), and sparse term-based /language/tree/master/language
retrieval (BM25), while binning passages by length.
/multivec.

Downloaded from http://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00369/1924040/tacl_a_00369.pdf by guest on 26 June 2024


encoders, and the circumstances in which these 2 Analyzing Dual Encoder Fidelity
powerful models do not yet reach the state of
A query or a document is a sequence of words
the art. Here we explore these questions using
drawn from some vocabulary V. Throughout this
both theoretical and empirical tools, and propose
section we assume a representation of queries
a new architecture that leverages the strengths
and documents typically used in sparse bag-of-
of dual encoders while avoiding some of their
words models: Each query q and document d is a
weaknesses.
vector in Rv where v is the vocabulary size. We
We begin with a theoretical investigation take the inner product hq, di to be the relevance
of compressive dual encoders—dense encodings score of document d for query q . This framework
whose dimension is below the vocabulary accounts for a several well-known ranking models,
size—and analyze their ability to preserve distinc- including Boolean inner product, TF-IDF, and BM25.
tions made by sparse bag-of-words retrieval We will compare sparse retrieval models with
models, which we term their fidelity. Fidelity compressive dual encoders, for which we write
is important for the sub-problem of detecting f (d) and f (q ) to indicate compression of d
precise term overlap, and is a tractable proxy and q to Rk , with k ≪ v , and where k does
for capacity. Using the theory of dimensionality not vary with the document length. For these
reduction, we relate fidelity to the normalized models, the relevance score is the inner product
margin between the gold retrieval result and hf (q ), f (d)i. (In §3, we consider encoders that
its competitors, and show that this margin is apply to sequences of tokens rather than vectors
in turn related to the length of documents in of counts.)
the collection. We validate the theory with an A fundamental question is how the capacity of
empirical investigation of the effects of random dual encoders varies with the embedding size k. In
projection compression on sparse BM25 retrieval this section we focus on the related, more tractable
using queries and documents from TREC-CAR, a notion of fidelity: How much can we compress
recent IR benchmark (Dietz et al., 2018). the input while maintaining the ability to mimic
Next, we offer a multi-vector encoding model, the performance of bag-of-words retrieval? We
which is computationally feasible for retrieval explore this question mainly through the encoding
like the dual-encoder architecture and achieves model of random projections, but also discuss
significantly better quality. A simple hybrid that more general dimensionality reduction in §2.2.
interpolates models based on dense and sparse
representations leads to further improvements. 2.1 Random Projections
We compare the performance of dual encoders, To establish baselines on the fidelity of
multi-vector encoders, and their sparse-dense compressive dual encoder retrieval, we now
hybrids with classical sparse retrieval mod- consider encoders based on random projections
els and attentional neural networks, as well as (Vempala, 2004). The encoder is defined as
state-of-the-art published results where avail- f (x) = Ax, where A ∈ Rk×v is a random matrix.
able. Our evaluations include open retrieval In Rademacher embeddings, each element ai,j

330
of the matrix A is sampled with equal probabil- to study the tightness of the bound; although
ity from two possible values: {− √1k , √1k }. In theoretical tightness (up to a constant factor)
Gaussian embeddings, each ai,j ∼ N (0, k−1/2 ). is suggested by results on the optimality of
A pairwise ranking error occurs when hq, d1 i > the distributional Johnson-Lindenstrauss lemma
hq, d2 i but hAq, Ad1 i < hAq, Ad2 i. Using such (Johnson and Lindenstrauss, 1984; Jayram and
random projections, it is possible to bound the Woodruff, 2013; Kane et al., 2011), here we study
probability of any such pairwise error in terms of the question only empirically.
the embedding size. 2.1.1 Recall-at-r
Definition 2.1. For a query q and pair of In retrieval applications, it is important to return
documents (d1 , d2 ) such that hq, d1 i ≥ hq, d2 i, the the desired result within the top r search results.
normalized margin is defined as, µ(q, d1 , d2 ) = For query q , define d1 as the document that
hq,d1 −d2 i maximizes some inner product ranking metric.
kq k×]kd1 −d2 k .
The probability of returning d1 in the top r results
Lemma 1. Define a matrix A ∈ Rk×d of Gaussian

Downloaded from http://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00369/1924040/tacl_a_00369.pdf by guest on 26 June 2024


after random projection can be bounded by a
or Rademacher embeddings. Define vectors function of the embedding size and normalized
q, d1 , d2 such that µ(q, d1 , d2 ) = ǫ > 0. A ranking margin:
error occurs when hAq, Ad2 i ≥ hAq, Ad1 i. If β Lemma 2. Consider a query q , with target
is the probability of such an error then, document d1 , and document collection D that

k 2
 excludes d1 , and such that ∀d2 ∈ D , µ(q, d1 , d2 ) >
3
β ≤ 4 exp − (ǫ /2 − ǫ /3) . (1) 0. Define r0 to be any integer such that 1 ≤ r0 ≤
2
|D|. Define ǫ to be the r0 ’th smallest normalized
The proof, which builds on well-known results margin µ(q, d1 , d2 ) for any d2 ∈ D, and for
about random projections, is found in §A.1. By simplicity assume that only a single document
solving (1) for k, we can derive an embedding d2 ∈ D has µ(q, d1 , d2 ) = ǫ.2
size that guarantees a desired upper bound on the Define a matrix A ∈ Rk×d of Gaussian
pairwise error probability, or Rademacher embeddings. Define R to be
a random variable such that R = |{d2 ∈
4 D : hAq, Ad1 i ≤ hAq, Ad2 i}|, and let C =
k ≥ 2(ǫ2 /2 − ǫ3 /3)−1 ln . (2)
β 4(|D| − r0 + 1). Then
It is convenient to derive a simpler but looser
 
k 2 3
quadratic bound (proved in §A.2): Pr(R ≥ r0 ) ≤ C exp − (ǫ /2 − ǫ /3) .
2

Corollary 1. Define vectors q, d1 , d2 such that The proof is in §A.3. A direct consequence of
ǫ = µ(q, d1 , d2 ) > 0. If A ∈ Rk×v is a matrix the lemma is that to achieve recall-at-r0 = 1 for a
of random Gaussian or Rademacher embeddings given (q, d1 , D) triple with probability ≥ 1 − β , it
such that k > 12ǫ−2 ln β4 , then Pr(hAq, Ad1 i ≤ is sufficient to set
hAq, Ad2 i) ≤ β . 2 4(|D| − r0 + 1)
k≥ ln , (3)
ǫ2 /2 − ǫ3 /3 β
On the Tightness of the Bound. Let k∗ (q, d1 , d2 )
denote the lowest dimension Gaussian or where ǫ is the r0 ’th smallest normalized margin.
Rademacher random projection following the As with the bound on pairwise relevance errors
definition in Lemma 1, for which Pr(hAq, Ad1 i < in Lemma 1, Lemma 2 implies an upper bound
hAq, Ad2 i) ≤ β , for a given document pair on the minimum random projection dimension
(d1 , d2 ) and query q with normalized margin k∗ (q, d1 , D) that recalls d1 in the top r0 results
ǫ. Our lemma places an upper bound on k∗ , with probability ≥ 1 − β . Due to the application
saying that k∗ (q, d1 , d2 ) ≤ 2(ǫ2 /2 − ǫ3 /3)−1 ln β4 . of the union bound and worst-case assumptions
Any k ≥ k∗ (q, d1 , d2 ) has sufficiently low about the normalized margins of documents in Dǫ ,
probability of error, but lower values of k could 2
The case where multiple documents are tied with nor-
potentially also have the desired property. Later malized margin ǫ is straightforward but slightly complicates
in this section we perform empirical evaluation the analysis.

331
this bound is potentially loose. Later in this section
we examine the empirical relationship between
maximum document length, the distribution of
normalized margins, and k∗ .
2.1.2 Application to Boolean Inner Product
Boolean inner product is a retrieval function in
which d, q ∈ {0, 1}v over a vocabulary of size v ,
with di indicating the presence of term i in the
document (and analogously for qi ). The relevance Figure 2: Minimum k sufficient for Rademacher
score hq, di is then the number of terms that appear embeddings to approximate BM25 pairwise rankings
in both q and d. For this simple retrieval function, on TREC-CAR with error rate β < .05.
it is possible to compute an embedding size that
guarantees a desired pairwise error probability random projection k∗ which has ≤ β probability

Downloaded from http://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00369/1924040/tacl_a_00369.pdf by guest on 26 June 2024


over an entire dataset of documents. of making an error on a triple with normalized
Corollary 2. For a set of documents D = {d ∈ margin ǫ is upper bounded by a linear function
{0, 1}v } and a query q ∈ {0, 1}v , let LD = of this quantity. In particular, for β = .05, the
maxd∈D kdk2 and LQ = kq k2 . Let A ∈ Rk×v Lemma entails that k∗ ≤ 8.76(ǫ2 /2 − ǫ3 /3)−1 . In
be a matrix of random Rademacher or Gaussian this experiment we measure the empirical value
embeddings such that k ≥ 24LQ LD ln β4 . Then for of k∗ to evaluate the tightness of the bound.
any d1 , d2 ∈ D such that hq, d1 i > hq, d2 i, the The results are shown on the x-axis of Figure 2.
probability that hAq, Ad1 i ≤ hAq, Ad2 i is ≤ β . For each bin we compute the minimum embedding
The proof is in §A.4. The corollary shows size required to obtain 95% pairwise accuracy in
that for Boolean inner product ranking, we can ranking d1 vs d2 , using a grid of 40 possible values
guarantee any desired error bound β by choosing for k between 32 and 9472, shown on the y -axis.
an embedding size k that grows linearly in LD , the (We exclude examples that had higher values of
number of unique terms in the longest document. (ǫ2 /2 − ǫ3 /3)−1 than the range shown because
they did not reach 95% accuracy for the explored
2.1.3 Application to TF-IDF and BM25 range of k.) The figure shows that the theoretical
Both TF-IDF (Spärck Jones, 1972) and BM25 bound is tight up to a constant factor, and that
(Robertson et al., 2009) can be written as inner the minimum embedding size that yields desired
products between bag-of-words representations of fidelity grows linearly with (ǫ2 /2 − ǫ3 /3)−1 .
the document and query as described earlier in this
Margins and Document Length. For boolean
section. Set the query representation q̃i = qi × IDFi ,
inner product, it was possible to express
where qi indicates the presence of the term in the
the minimum possible normalized margin (and
query and IDFi indicates the inverse document
therefore a sufficient embedding size) in terms of
frequency of term i. The TF-IDF score is then hq̃, di.
LQ and LD , the maximum number of unique terms
For BM25, we define d˜ ∈ Rv , with each d˜i a
across all queries and documents, respectively.
function of the count di and the document length
Unfortunately, it is difficult to analytically derive
(and hyperparameters); BM25(q, d) is then hq̃, d˜i.
a minimum normalized margin ǫ for either TF-IDF
Due to its practical utility in retrieval, we now
or BM25: Because each term may have a unique
focus on BM25.
inverse document frequency, the minimum non-
Pairwise Accuracy. We use empirical data zero margin hq, d1 − d2 i decreases with the num-
to test the applicability of Lemma 1 to the ber of terms in the query as each additional
BM25 relevance model. We select query-document term creates more ways in which two documents
triples (q, d1 , d2 ) from the TREC-CAR dataset can receive nearly the same score. We
(Dietz et al., 2018) by considering all possible therefore study empirically how normalized
(q, d2 ), and selecting d1 = arg maxd BM25(q, d). margins vary with maximum document length.
We bin the triples by the normalized margin ǫ, and Using the TREC-CAR retrieval dataset, we
compute the quantity (ǫ2 /2 − ǫ3 /3)−1 . According bin documents by length. For each query,
to Lemma 1, the minimum embedding size of a we compute the normalized margins between

332
Figure 3: Random projection on BM25 retrieval in TREC-CAR dataset, with documents binned by length.

Downloaded from http://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00369/1924040/tacl_a_00369.pdf by guest on 26 June 2024


the document with best BM25 in the bin and for fixed-length dual encoders when document
all other documents in the bin, and look at length grows.
the 10th, 100th, and 1000th smallest normalized In our setting, BM25, TF-IDF, and Boolean inner
margins. The distribution over these normalized product can all be reformulated equivalently as
margins is shown in Figure 3a, revealing that inner products in a space with vectors of norm
normalized margins decrease with document at most 1 by L2 -normalizing each query vector

length. In practice, the observed minimum and rescaling all document vectors by LD =
normalized margin for a collection of documents maxd ||d||, a constant factor that grows with the
and queries is found to be much lower for BM25 length of the longest document. Now suppose we
compared to Boolean inner product. For example, desire to limit the distortion on the unnormalized
for the collection used in Figure 2, the minimum inner products to some value ≤ ǫ̃, which might
normalized margin for BM25 is 6.8e-06, while for
guarantee a desired performance characteristic.
Boolean inner product it is 0.0169.
This corresponds to decreasing the maximum
normalized inner product distortion ǫ by a factor
Document Length and Encoding Dimension. √
of LD . According to the general bounds
Figure 3b shows the growth in minimum random
on dimensionality reduction mentioned in the
projection dimension required to reach desired
recall-at-10, using the same document bins as in previous paragraph, this could necessitate an
Figure 3a. As predicted, the required dimension increase in the encoding size by a factor of LD .
increases with the document length, while the However, there are a number of caveats to
normalized margin shrinks. this theoretical argument. First, the theory states
only that there exist vector sets that cannot be
encoded into representations that grow more
2.2 Bounds on General Encoding Functions
slowly than Ω(ǫ−2 ); actual documents and queries
We derived upper bounds on minimum required might be easier to encode if, for example,
encoding for random linear projections above, they are generated from some simple underlying
and found the bounds on (q, d1 , d2 ) triples to be stochastic process. Second, our construction
empirically tight up to a constant factor. More achieves ||d|| ≤ 1 by rescaling all document
general non-linear and learned encoders could vectors by a constant factor, but there may be
be more efficient. However, there are general other ways to constrain the norms while using
theoretical results showing that it is impossible the embedding space more efficiently. Third,
for any encoder to guarantee an inner product in the non-linear case it might be possible to
distortion |hf (x), f (y )i − hx, y i| ≤ ǫ with an eliminate ranking errors without achieving low
encoding that does not grow as Ω(ǫ−2 ) inner product distortion. Finally, from a practical
(Larsen and Nelson, 2017; Alon and Klartag, perspective, the generalization offered by learned
2017), for vectors x, y with norm ≤ 1. These dual encoders might overwhelm any sacrifices
results suggest more general capacity limitations in fidelity, when evaluated on real tasks of

333
interest. Lacking theoretical tools to settle these than a dual encoder that uses a single vector of
questions, we present a set of empirical in- size mk.
vestigations in later sections of this paper. But This efficiency is a key difference from the
first we explore a lightweight modification to the POLY-ENCODER (Humeau et al., 2020), which
dual encoder, which offers gains in expressivity at computes a fixed number of vectors per query,
limited additional computational cost. and aggregates them by softmax attention against
document vectors. (Yang et al., 2018b) propose
3 Multi-Vector Encodings a similar architecture for language modeling.
Because of the use of softmax in these approaches,
The theoretical analysis suggests that fixed- it is not possible to decompose the relevance score
length vector representations of documents may into a max over inner products, and so fast nearest-
in general need to be large for long documents, if neighbor search cannot be applied. In addition,
fidelity with respect to sparse high-dimensional these works did not address retrieval from a large
representations is important. Cross-attentional document collection.

Downloaded from http://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00369/1924040/tacl_a_00369.pdf by guest on 26 June 2024


architectures can achieve higher fidelity, but are Analysis. To see why multi-vector encodings
impractical for large-scale retrieval (Nogueira can enable smaller encodings per vector, consider
et al., 2019b; Reimers and Gurevych, 2019; an idealized setting in which each document vector
Humeau et al., 2020). We therefore propose a is thePsum of m orthogonal segments such that
new architecture that represents each document as d= m ( i)
i=1 d and each query refers to exactly one
a fixed-size set of m vectors. Relevance scores segment in the gold document.3 An orthogonal
are computed as the maximum inner product over segmentation can be obtained by choosing the
this set. segments as a partition of the vocabulary.
Formally, let x = (x1 , . . . , xT ) represent a
sequence of tokens, with x1 equal to the special Theorem 1. Define vectors q, d1 , d2 ∈ Rv such
token [CLS], and define y analogously. Then that hq, d1 i > hq, d2 i, and assume that both d1 and
[h1 (x), . . . , hT (x)] represents the sequence of d2 can be decomposed into m segments such
( i)
that: d1 = m
P
contextualized embeddings at the top level of i=1 d1 , and analogously for d2 ; all
a deep transformer. We define a single-vector segments across both documents are orthogonal.
representation of the query x as f (1) (x) = h1 (x), If there exists an i such that hq, d1 i = hq, d1 i
( i)

and a multi-vector representation of document ( i) ( i) ( i)


and hq, d2 i ≥ hq, d2 i, then µ(q, d1 , d2 ) ≥
y as f (m) (y) = [h1 (y), . . . , hm (y)], the first m µ(q, d1 , d2 ). (The proof is in §A.5.)
representation vectors for the sequence of tokens
in y, with m < T . The relevance score is defined Remark 1. The BM25 score can be computed
(m)
as maxj =1...m hf (1) (x), fj (y)i. from non-negative representations of the docu-
Although this scoring function is not a dual ment and query; if the segmentation corresponds
encoder, the search for the highest-scoring to a partition of the vocabulary, then the segments
document can be implemented efficiently with will also be non-negative, and thus the condition
( i)
standard approximate nearest-neighbor search by hq, d2 i ≥ hq, d2 i holds for all i.
adding multiple (m) entries for each document to The relevant case is when the same segment
the search index data structure. If some vector ( i)
(m)
fj (y) yields the largest inner product with is maximal for both documents, hq, d2 i =
(j )
maxj hq, d2 i, as will hold for ‘‘simple’’ queries
the query vector f (1) (x), it is easy to show the
that are well-aligned with the segmentation. Then
corresponding document must be the one that
the normalized margin in the multi-vector model
maximizes the relevance score ψ (m) (x, y). The
will be at least as large as in the equivalent
size of the index must grow by a factor of m, but
single vector representation. The relationship to
due to the efficiency of contemporary approximate
encoding size follows from the theory in the
nearest neighbor and maximum inner product
previous section: Theorem 1 implies that if we
search, the time complexity can be sublinear in the (m)
size of the index (Andoni et al., 2019; Guo et al., set fi ((y) = Ad(i) (for appropriate A), then
2016b). Thus, a model using m vectors of size k to 3
Here we use (d, q ) rather than (x, y) because we describe
represent documents is more efficient at run-time vector encodings rather than token sequences.

334
an increase in the normalized margin enables the First, trained non-linear dual encoders might
use of a smaller encoding dimension k while be able to detect precise word overlap with
still supporting the same pairwise error rate. much lower-dimensional encodings, especially for
There are now m times more ‘‘documents’’ to queries and documents with a natural distribution,
evaluate, but Lemma 2 shows that this exerts only which may exhibit a low-dimensional subspace
a logarithmic increase on the encoding size for a structure. Second, the semantic generalization
desired recall@r . But while we hope this argument aspect of the IR task may be more important
is illuminating, the assumptions of orthogonal than the first aspect for practical applications, and
segments and perfect segment match against the our theory does not make predictions about how
query are quite strong. We must therefore rely encoder dimensionality relates to such ability to
on empirical analysis to validate the efficacy of compute general semantic similarity.
multi-vector encoding in realistic applications. We relate the theoretical analysis to text
retrieval in practice through experimental studies
Cross-Attention. Cross-attentional architec- on three tasks. The first task, described in

Downloaded from http://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00369/1924040/tacl_a_00369.pdf by guest on 26 June 2024


tures can be viewed as a generalization of the §5, tests the ability of models to retrieve
multi-vector model: (1) set m = Tmax (one vector natural language documents that exactly contain a
per token); (2) compute one vector per token in query and evaluates both BM25 and deep neural
the query; (3) allow more expressive aggregation dual encoders on a task of detecting precise
over vectors than the simple max employed above. word overlap, defined over texts with a natural
Any sparse scoring function (e.g., BM25) can be distribution. The second task, described in §6, is
mimicked by a cross-attention model, which need the passage retrieval sub-problem of the open-
only compute identity between individual words; domain QA version of the Natural Questions
this can be achieved by random projection word (Kwiatkowski et al., 2019; Lee et al., 2019); this
embeddings whose dimension is proportional to benchmark reflects the need to capture graded
the log of the vocabulary size. By definition, the notions of similarly and has a natural query text
required representation also grows linearly with distribution. For both of these tasks, we perform
the number of tokens in the passage and query. controlled experiments varying the maximum
As with the POLY-ENCODER, retrieval in the cross- length of the documents in the collection,
attention model cannot be performed efficiently at which enables assessing the relationship between
scale using fast nearest-neighbor search. In con- encoder dimension and document length.
temporaneous work, Khattab and Zaharia (2020) To evaluate the performance of our best
propose an approach with TY vectors per query models in comparison to state-of-the-art works
and TX vectors per document, using a simple on large-scale retrieval and ranking, in §7 we
sum-of-max for aggregation of the inner products. report results on a third group of tasks focusing
They apply this approach to retrieval via re- on passage/document ranking: the passage and
ranking results of TY nearest-neighbor searches. document-level MS MARCO retrieval datasets
Our multi-vector model uses fixed length repre- (Nguyen et al., 2016; Craswell et al., 2020). Here
sentations instead, and a single nearest neighbor we follow the standard two-stage retrieval and
search per query. ranking system: a first-stage retrieval from a large
document collection, followed by reranking with
4 Experimental Setup a cross-attention model. We focus on the impact
of the first-stage retrieval model.
The full IR task requires detection of both precise
word overlap and semantic generalization. Our 4.1 Models
theoretical results focus on the first aspect, and Our experiments compare compressive and sparse
derive theoretical and empirical bounds on the dual encoders, cross attention, and hybrid models.
sufficient dimensionality to achieve high fidelity
with respect to sparse bag-of-words models as BM25. We use case-insensitive wordpiece tok-
document length grows, for two types of linear enizations of texts and default BM25 parameters
random projections. The theoretical setup differs from the gensim library. We apply either unigram
from modeling for realistic information-seeking (BM25-uni) or combined unigram+bigram repre-
scenarios in at least two ways. sentations (BM25-bi).

335
Dual Encoders from BERT (DE-BERT). We we linearly combine a sparse and dense system’s
encode queries and documents using BERT-base, scores using a single trainable weight λ, tuned on
which is a pre-trained transformer network a development set. For example, a hybrid model
(12 layers, 768 dimensions) (Devlin et al., 2019). of ME-BERT and BM25-uni is referred to as HYBRID-
We implement dual encoders from BERT as a ME-BERT-uni. We implement approximate search to
special case of the multi-vector model formalized retrieve using a linear combination of two systems
in §3, with number of vectors for the document by re-ranking n-best top scoring candidates from
m = 1: The representations for queries and each system. Prior and concurrent work has also
documents are the top layer representations at used hybrid sparse-dense models (Guo et al.,
the [CLS] token. This approach is widely used 2016a; Seo et al., 2019; Karpukhin et al., 2020;
for retrieval (Lee et al., 2019; Reimers and Ma et al., 2020; Gao et al., 2020). Our contribution
Gurevych, 2019; Humeau et al., 2020; Xiong is to assess the impact of sparse-dense hybrids as
et al., 2020).4 For lower-dimensional encodings, the document length grows.
we learn down-projections from d = 768 to k ∈

Downloaded from http://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00369/1924040/tacl_a_00369.pdf by guest on 26 June 2024


32, 64, 128, 512,5 implemented as a single feed- 4.2 Learning and Inference
forward layer, followed by layer normalization.
All parameters are fine-tuned for the retrieval For the experiments in §5 and §6, all trained
tasks. We refer to these models as DE-BERT-k. models are initialized from BERT-base, and
all parameters are fine-tuned using a cross-
Cross-Attentional BERT. The most expressive entropy loss with 7 sampled negatives from a
model we consider is cross-attentional BERT, which pre-computed 200-document list and additional
we implement by applying the BERT encoder to in-batch negatives (with a total number of
the concatenation of the query and document, 1024 candidates in a batch); the pre-computed
with a special [SEP] separator between x and y. candidates include 100 top neighbors from BM25
The relevance score is a learned linear function and 100 random samples. This is similar to the
of the encoding of the [CLS] token. Due to method by Lee et al. (2019), but with additional
the computational cost, cross-attentional BERT fixed candidates, also used in concurrent work
is applied only in reranking as in prior work (Karpukhin et al., 2020). Given a model trained in
(Nogueira and Cho, 2019; Yang et al., 2019). this way, for the scalable methods, we also applied
These models are referred to as CROSS-ATTENTION. hard-negative mining as in Gillick et al. (2019)
and used one iteration when beneficial. More
Multi-Vector Encoding from BERT (ME-BERT). sophisticated negative selection is proposed in
In §3 we introduced a model in which every concurrent work (Xiong et al., 2020). For retrieval
document is represented by exactly m vectors. from large document collections with the scalable
We use m = 8 as a good compromise between models, we used ScaNN: an efficient approximate
cost and accuracy in §5 and §6, and find values nearest neighbor search library (Guo et al., 2020);
of 3 to 4 for m more accurate on the datasets in most experiments, we use exact search settings
in §7. In addition to using BERT output repre- but also evaluate approximate search in Section
sentations directly, we consider down-projected §7. In §7, the same general approach with slightly
representations, implemented using a feed- different hyperparameters (detailed in that section)
forward layer with dimension 768 × k. A model was used, to enable more direct comparisons to
with k-dimensional embeddings is referred to as prior work.
ME-BERT-k .

Sparse-Dense Hybrids (HYBRID). A natural 5 Containing Passage ICT Task


approach to balancing between the fidelity of
sparse representations and the generalization of We begin with experiments on the task of retriev-
learned dense ones is to build a hybrid. To do this, ing a Wikipedia passage y containing a sequence
of words x. We create a dataset using Wikipedia,
4
Based on preliminary experiments with pooling following the Inverse Cloze Task definition by
strategies we use the [CLS] vectors (without the feed-forward
projection learned on the next sentence prediction task).
Lee et al. (2019), but adapted to suit the goals of
5
We experimented with adding a similar layer for our study. The task is defined by first breaking
d = 768, but this did not offer empirical gains. Wikipedia texts into segments of length at most l.

336
Figure 4: Results on the containing passage ICT task as maximum passage length varies (50 to 400 tokens). Left:
Reranking 200 candidates; Right: Retrieval from 3 million candidates. Exact numbers refer to Table A.1.

These form the document collection D. Queries xi Figure 4 (right) shows results for the much more
are generated by sampling sub-sequences from the challenging task of retrieval from 3 million can-

Downloaded from http://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00369/1924040/tacl_a_00369.pdf by guest on 26 June 2024


documents yi . We use queries of lengths between didates. For the latter setting, we only evaluate
5 and 25, and do not remove the queries xi from models that can efficiently retrieve nearest neigh-
their corresponding documents yi . bors from such a large set. We see similar behavior
We create a dataset with 1 million queries to the reranking setting, with the multi-vector
and evaluate retrieval against four document methods exceeding BM25-uni performance for all
collections Dl , for l ∈ 50,100,200,400. Each lengths and DE-BERT models under-performing
BM25-uni. The hybrid model outperforms both
Dl contains 3 million documents of maximum
length l tokens. In addition to original Wikipedia components in the combination with largest
passages, each Dl contains synthetic distractor improvements over ME-BERT for the longest-
documents, which contain the large majority of document collection.
words in x but differ by one or two tokens.
5 K queries are used for evaluation, leaving the 6 Retrieval for Open-Domain QA
rest for training and validation. Although checking
containment is a straightforward machine learning For this task we similarly use English Wikipedia6
task, it is a good testbed for assessing the as four different document collections, of
fidelity of compressive neural models. BM25-bi maximum passage length l ∈ {50, 100, 200, 400},
achieves over 95 MRR@10 across collections for and corresponding approximate sizes of 39
this task. million, 27.3 million, 16.1 million, and 10.2
million documents, respectively. Here we use real
Figure 4 (left) shows test set results on user queries contained in the Natural Questions
reranking, where models need to select one of dataset (Kwiatkowski et al., 2019). We follow the
200 passages (top 100 BM25-bi and 100 random setup in Lee et al. (2019). There are 87,925 QA
candidates). It is interesting to see how strong pairs in training and 3,610 QA pairs in the test set.
the sparse models are relative to even a 768- We hold out a subset of training for development.
dimensional DE-BERT. As the document length
For document retrieval, a passage is correct
increases, the performance of both the sparse and
for a query x if it contains a string that matches
dense dual encoders worsens; the accuracy of the
exactly an annotator-provided short answer for the
DE-BERT models falls most rapidly, widening the
question. We form a reranking task by considering
gap to BM25.
the top 100 results from BM25-uni and 100 random
Full cross-attention is nearly perfect and does samples, and also consider the full retrieval setting.
not degrade with document length. DE-BERT-768, BM25-uni is used here instead of BM25-bi, because
which uses 8 vectors of dimension 768 to represent it is the stronger model for this task.
documents, strongly outperforms the best DE-BERT Our theoretical results do not make direct
model. Even DE-BERT-64, which uses 8 vectors of predictions for performance of compressive dual
size only 64 instead (thus requiring the same encoder models relative to BM25 on this task. They
document collection size as DE-BERT-512 and being
faster at inference time), outperforms the DE-BERT 6
https://archive.org/download/enwiki
models by a large margin. -20181220.

337
Figure 5: Results on NQ passage recall as maximum passage length varies (50 to 400 tokens). Left: Reranking of
200 passages; Right: Open domain retrieval result on all of (English) Wikipedia. Exact numbers refer to Table A.1.

do tell us that as the document length grows, 7 Large-Scale Supervised IR


low-dimensional compressive dual encoders

Downloaded from http://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00369/1924040/tacl_a_00369.pdf by guest on 26 June 2024


may not be able to measure weighted term The previous experimental sections focused on
overlap precisely, potentially leading to lower understanding the relationship between compres-
performance on the task. Therefore, we would sive encoder representation dimensionality and
expect that higher dimensional dual encoders, document length. Here we evaluate whether
multi-vector encoders, and hybrid models become our newly proposed multi-vector retrieval model
ME-BERT, its corresponding dual encoder base-
more useful for collections with longer documents.
Figure 5 (left) shows heldout set results on line DE-BERT, and sparse-dense hybrids compare
favorably to state-of-the-art models for large-
the reranking task. To fairly compare systems
scale supervised retrieval and ranking on IR
that operate over collections of different-sized
benchmarks.
passages, we allow each model to select
approximately the same number of tokens (400)
Datasets. The MS MARCO passage ranking task
and evaluate on whether an answer is contained
focuses on ranking passages from a collection
in them. For example, models retrieving from
of about 8.8 mln. About 532k queries paired with
D50 return their top 8 passages, and ones
relevant passages are provided for training. The MS
retrieving from D100 retrieve top 4. The figure
MARCO document ranking task is on ranking full
shows this recall@400 tokens across models. documents instead. The full collection contains
The relative performance of BM25-uni and DE- about 3 million documents and the training set
BERT is different from that seen in the ICT
has about 367 thousand queries. We report results
task, due to the semantic generalizations needed. on the passage and document development sets,
Nevertheless, higher-dimensional DE-BERT models comprising 6,980 and 5,193 queries, respectively
generally perform better, and multi-vector models in Table 1. We report MS MARCO and TREC DL
provide further benefits, especially for longer- 2019 (Craswell et al., 2020) test results in Table 2.
document collections; ME-BERT-768 outperforms DE-
BERT-768 and ME-BERT-64 outperforms DE-BERT-512; Model Settings. For MS MARCO passage we apply
CROSS-ATTENTION is still substantially stronger. models on the provided passage collections. For MS
Figure 5 (right) shows heldout set results for MARCO document, we follow Yan et al. (2020) and
the task of retrieving from Wikipedia for each break documents into a set of overlapping passages
of the four document collections Dl . Unlike with length up to 482 tokens, each including the
the reranking setting, only higher-dimensional document URL and title. For each task, we train
DE-BERT models outperform BM25 for passages the models on that task’s training data only. We
longer than 50. The hybrid models offer large initialize the retriever and reranker models with
improvements over their components, capturing BERT-large. We train dense retrieval models on
both precise word overlap and semantic similarity. positive and negative candidates from the 1000-
The gain from adding BM25 to ME-BERT and DE- best list of BM25, additionally using one iteration
BERT increases as the length of the documents in of hard negative mining when beneficial. For ME-
the collection grows, which is consistent with our BERT, we used m = 3 for the passage and m = 4
expectations based on the theory. for the document task.

338
MS-Passage MS-Doc part of the table, where more expensive second-
Model MRR MRR stage models are employed to re-rank candidates.
Retrieval BM25 0.167 0.249 Figure 6 delves into the impact of the first-stage
BM25-E 0.184 0.209 retrieval systems as the number of candidates the
DOC2QUERY 0.215 -
DOCT5QUERY 0.278 - second stage reranker has access to is substantially
DEEPCT 0.243 - reduced, improving efficiency.
HDCT - 0.300
DE-BERT 0.302 0.288 We report results in comparison to the following
ME-BERT 0.334 0.333 systems: 1) MULTI-STAGE (Nogueira and Lin, 2019),
DE-HYBRID 0.304 0.313
DE-HYBRID-E 0.309 0.315 which reranks BM25 candidates with a cascade of
ME-HYBRID 0.338 0.346 BERT models, 2) DOC2QUERY (Nogueira et al.,
ME-HYBRID-E 0.343 0.339
Reranking MULTI-STAGE 0.390 -
2019b) and DOCT5QUERY (Nogueira and Lin,
IDST 0.408 - 2019), which use neural models to expand docu-
Leaderboard 0.439 - ments before indexing and scoring with sparse
DE-BERT 0.391 0.339
ME-BERT 0.395 0.353 retrieval models, 3) DEEPCT (Dai and Callan,

Downloaded from http://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00369/1924040/tacl_a_00369.pdf by guest on 26 June 2024


ME-HYBRID 0.394 0.353 2020b), which learns to map BERT’s contextualized
text representations to context-aware term
Table 1: Development set results on MS MARCO- weights, 4) HDCT (Dai and Callan, 2020a),
Passage (MS-Passage), MS MARCO-Document which uses a hierachical approach that combines
(MS-Doc) showing MRR@10. passage-level term weights into document level
term weights, 5) IDST, a two-stage cascade
ranking pipeline by Yan et al. (2020), and 6)
Model MRR(MS) RR NDCG@10 Holes@10 Leaderboard, which is the best score on the MS
7
Passage Retrieval MARCO-passage leaderboard as of Sept. 18, 2020.
BM25-Anserini 0.186 0.825 0.506 0.000 We also compare our models both to our own
DE-BERT 0.295 0.936 0.639 0.165
ME-BERT 0.323 0.968 0.687 0.109 BM25 implementation described in §4.1, and the ex-
DE-HYBRID-E 0.306 0.951 0.659 0.105 ternal publicly available sparse model implemen-
ME-HYBRID-E 0.336 0.977 0.706 0.051
tations, denoted with BM25-E. For the passage task,
Document Retrieval BM25-E is the Anserini (Yang et al., 2018a) sys-
Base-Indri 0.192 0.785 0.517 0.002 tem with default parameters. For the document
DE-BERT - 0.841 0.510 0.188
ME-BERT - 0.877 0.588 0.109 task, BM25-E is the official IndriQueryLikelihood
DE-HYBRID-E 0.287 0.890 0.595 0.084 baseline. We report on dense-sparse hybrids using
ME-HYBRID-E 0.310 0.914 0.610 0.063
both our own BM25, and the external sparse
Table 2: Test set first-pass retrieval results on systems; the latter hybrids are indicated by a
the passage and document TREC 2019 DL suffix -E.
evaluation as well as MS MARCO eval MRR@10 Looking at the top part of Table 1, we can
(passage) and MRR@100 (document) under see that our DE-BERT model already outperforms
MRR(MS). or is competitive with prior systems. The multi-
vector model brings larger improvement on the
dataset containing longer documents (MS MARCO
Results. Table 1 comparatively evaluates our document), and the sparse-dense hybrid models
models on the dev sets of two tasks. The state of bring improvements over dense-only models on
the art prior work follows the two-stage retrieval both datasets. According to a Wilcoxon signed
and reranking approach, where an efficient first- rank test for statistical significance, all differences
stage system retrieves a (usually large) list of between DE-BERT, ME-BERT, DE-HYBRID-E, and ME-
candidates from the document collection, and HYBRID-E are statistically significant on both
a second stage more expensive model such as development sets with p-value < .0001.
cross-attention BERT reranks the candidates. When a large number of candidates can be
Our focus is on improving the first stage, and we reranked, the impact of the first-stage system
compare to prior works in two settings: Retrieval, decreases. In the bottom part of the table we
top part of Table 1, where only first-stage efficient
retrieval systems are used and Reranking, bottom 7
https://microsoft.github.io/msmarco/.

339
Figure 6: MRR@10 when reranking at different retrieval depth (10 to 1000 candidates) for MS MARCO.

see that our models are comparable to sys-


tems reranking BM25 candidates. The accuracy

Downloaded from http://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00369/1924040/tacl_a_00369.pdf by guest on 26 June 2024


of the first-stage system is particularly impor-
tant when the cost of reranking a large set of
candidates is prohibitive. Figure 6 shows the
performance of systems that rerank a smaller
number of candidates. We see that, when a
very small number of candidates can be scored
with expensive cross-attention models, the multi- Figure 7: Quality/running time tradeoff for DE-BERT and
vector ME-BERT and hybrid models achieve large ME-BERT on the MS MARCO passage dev set. Dashed lines
improvements compared to prior systems on both show quality with exact search.
MS MARCO tasks.

Table 2 shows test results for dense models, substantially higher MRR than DE-BERT for the
external sparse model baselines, and hybrids of same inference time per query.
the two (without reranking). In addition to test set
(eval) results on the MS MARCO passage task, we
report metrics on the manually annotated passage 8 Related Work
and document retrieval test set at TREC DL
We have mentioned research on improving the
2019. We report the fraction of unrated items
accuracy of retrieval models throughout the paper.
as Holes@10 following Xiong et al. (2020).
Here we focus on work related to our central
focus on the capacity of dense dual encoder
Time and Space Analysis Figure 7 compares
representations relative to sparse bags-of-words.
the running time/quality trade-off curves for DE-
In compressive sensing it is possible to recover
BERT and ME-BERT on the MS MARCO passage task
a bag of words vector x from the projection
using the ScaNN (Guo et al., 2020) library on a 160
Ax for suitable A. Bounds for the sufficient
Intel(R) Xeon(R) CPU @ 2.20GHz cores machine dimensionality of isotropic Gaussian projections
with 1.88TB memory. Both models use one vector (Candes and Tao, 2005; Arora et al., 2018) are
of size k = 1024 per query; DE-BERT uses one and more pessimistic than the bound described in
ME-BERT uses 3 vectors of size k = 1024 per §2, but this is unsurprising because the task
document. The size of the document index for of recovering bags-of-words from a compressed
DE-BERT is 34.2GB and the size of the index for measurement is strictly harder than recovering
ME-BERT is about 3 times larger. The indexing time inner products.
was 1.52h and 3.02h for DE-BERT and ME-BERT, Subramani et al. (2019) ask whether it is possi-
respectively. The ScaNN configuration we use ble to exactly recover sentences (token sequences)
is num leaves=5000, and num leaves to search from pretrained decoders, using vector embed-
ranges from 25 to 2000 (from less to more exact dings that are added as a bias to the decoder hidden
search) and time per query is measured when using state. Because their decoding model is more
parallel inference on all 160 cores. In the higher expressive (and thus more computationally inten-
quality range of the curves, ME-BERT achieves sive) than inner product retrieval, the theoretical

340
issues examined here do not apply. Nonetheless, doi.org/10.1016/S0022-0000(03)
(Subramani et al., 2019) empirically observe a 00025-4
similar dependence between sentence length and
embedding size. Wieting and Kiela (2019) rep- Noga Alon and Bo’az Klartag. 2017. Optimal
resent sentences as bags of random projections, compression of approximate inner products
finding that high-dimensional projections (k = and dimension reduction. In 58th Annual
4096) perform nearly as well as trained encoding Symposium on Foundations of Computer Sci-
models. These empirical results provide further ence (FOCS). DOI: https://doi.org/10
empirical support for the hypothesis that bag-of- .1109/FOCS.2017.65
words vectors from real text are ‘‘hard to embed’’ Alexandr Andoni, Piotr Indyk, and Ilya Razen-
in the sense of Larsen and Nelson (2017). Our shteyn. 2019. Approximate nearest neigh-
contribution is to systematically explore the rela- bor search in high dimensions. Proceedings of
tionship between document length and encoding the International Congress of Mathematicians
dimension, focusing on the case of exact inner (ICM 2018).

Downloaded from http://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00369/1924040/tacl_a_00369.pdf by guest on 26 June 2024


product-based retrieval. We leave the combina-
tion of representation learning and approximate Sanjeev Arora, Mikhail Khodak, Nikunj Saunshi,
retrieval for future work. and Kiran Vodrahalli. 2018. A compressed
sensing view of unsupervised text embeddings,
bag-of-n-grams, and LSTMs. In Proceedings
9 Conclusion of the International Conference on Learning
Transformers perform well on an unreasonable Representations (ICLR). DOI: https://doi
range of problems in natural language processing. .org/10.1142/9789813272880 0182
Yet the computational demands of large-scale
Shai Ben-David, Nadav Eiron, and Hans Ulrich
retrieval push us to seek other architectures:
Simon. 2002. Limitations of learning via
cross-attention over contextualized embeddings
embeddings in Euclidean half spaces. Journal of
is too slow, but dual encoding into fixed-
Machine Learning Research, 3(Nov):441–461.
length vectors may be insufficiently expressive,
sometimes failing even to match the performance Emmanuel J. Candes and Terence Tao. 2005.
of sparse bag-of-words competitors. We have Decoding by linear programming. IEEE
used both theoretical and empirical techniques Transactions on Information Theory, 51(12):
to characterize the fidelity of fixed-length dual 4203–4215. DOI: https://doi.org/10
encoders, focusing on the role of document .1109/TIT.2005.858979
length. Based on these observations, we propose
hybrid models that yield strong performance while Nick Craswell, Bhaskar Mitra, Emine Yilmaz,
maintaining scalability. Daniel Campos, and Ellen M. Voorhees. 2020.
Overview of the TREC 2019 deep learning
track. In Text REtrieval Conference (TREC).
Acknowledgments TREC.

We thank Ming-Wei Chang, Jon Clark, William Zhuyun Dai and Jamie Callan. 2020a. Context-
Cohen, Kelvin Guu, Sanjiv Kumar, Kenton Lee, aware document term weighting for ad-hoc
Jimmy Lin, Ankur Parikh, Ice Pasupat, Iulia search. In Proceedings of The Web Conference
Turc, William A. Woods, Vincent Zhao, and the 2020. DOI: https://doi.org/10.1145
anonymous reviewers for helpful discussions of /3366423.3380258
this work.
Zhuyun Dai and Jamie Callan. 2020b. Context-
aware sentence/passage term importance esti-
References mation for first stage retrieval. Proceedings of
the ACM SIGIR International Conference on
Dimitris Achlioptas. 2003. Database-friendly ran- Theory of Information Retrieval.
dom projections: Johnson-Lindenstrauss with
binary coins. Journal of Computer and System Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Sciences, 66(4):671–687. DOI: https:// Kristina Toutanova. 2019. BERT: Pre-training

341
of deep bidirectional transformers for language Samuel Humeau, Kurt Shuster, Marie-Anne
understanding. In Proceedings of the 2019 Lachaux, and Jason Weston. 2020. Poly-
Conference of the North American Chapter of encoders: Transformer architectures and pre-
the Association for Computational Linguistics: training strategies for fast and accurate
Human Language Technologies. multi-sentence scoring. In Proceedings of
the International Conference on Learning
Laura Dietz, Ben Gamari, Jeff Dalton, Representations (ICLR).
and Nick Craswell. 2018. TREC complex
answer retrieval overview. In Text REtrieval Thathachar S. Jayram and David P.
Conference (TREC). Woodruff. 2013. Optimal bounds for Johnson-
Lindenstrauss transforms and streaming
Luyu Gao, Zhuyun Dai, Zhen Fan, and Jamie problems with subconstant error. ACM Trans-
Callan. 2020. Complementing lexical retrieval actions on Algorithms (TALG), 9(3):1–17.
with semantic residual embedding. CoRR, DOI: https://doi.org/10.1145

Downloaded from http://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00369/1924040/tacl_a_00369.pdf by guest on 26 June 2024


abs/2004.13969. Version 1. /2483699.2483706

Daniel Gillick, Sayali Kulkarni, Larry Lansing, William B. Johnson and Joram Lindenstrauss.
Alessandro Presta, Jason Baldridge, Eugene 1984. Extensions of Lipschitz mappings into
Ie, and Diego Garcia-Olano. 2019. Learning a Hilbert space. Contemporary Mathematics,
dense representations for entity retrieval. In 26(189–206):1.
Proceedings of the 23rd Conference on Compu-
Daniel Kane, Raghu Meka, and Jelani Nelson.
tational Natural Language Learning (CoNLL).
2011. Almost optimal explicit Johnson-
DOI: https://doi.org/10.18653/v1
Lindenstrauss families. Approximation, Ran-
/K19-1049
domization, and Combinatorial Optimization.
Algorithms and Techniques.
Jiafeng Guo, Yixing Fan, Qingyao Ai, and
W. Bruce Croft. 2016a. A deep relevance Vladimir Karpukhin, Barlas Oguz, Sewon Min,
matching model for ad-hoc retrieval. In Pro- Patrick Lewis, Ledell Wu, Sergey Edunov,
ceedings of the 25th ACM International on Danqi Chen, and Wen-tau Yih. 2020. Dense
Conference on Information and Knowledge passage retrieval for open-domain question
Management. answering. In Proceedings of the 2020 Con-
ference on Empirical Methods in Natural Lan-
Ruiqi Guo, Sanjiv Kumar, Krzysztof guage Processing (EMNLP). DOI: https://
Choromanski, and David Simcha. 2016b. Quan- doi.org/10.18653/v1/2020.emnlp
tization based fast inner product search. In -main.550
Proceedings of the International Conference on
Artificial Intelligence and Statistics (AISTATS). Omar Khattab and Matei Zaharia. 2020. Col-
BERT. In Proceedings of the 43rd International
Ruiqi Guo, Philip Sun, Erik Lindgren, Quan Geng, ACM SIGIR Conference on Research and
David Simcha, Felix Chern, and Sanjiv Kumar. Development in Information Retrieval. DOI:
2020. Accelerating large-scale inference with https://doi.org/10.1145/3397271
anisotropic vector quantization. In Proceedings .3401075
of the 37th International Conference on
Machine Learning. Tom Kwiatkowski, Jennimaria Palomaki, Olivia
Redfield, Michael Collins, Ankur Parikh, Chris
Po-Sen Huang, Xiaodong He, Jianfeng Gao, Alberti, Danielle Epstein, Illia Polosukhin,
Li Deng, Alex Acero, and Larry Heck. 2013. Jacob Devlin, Kenton Lee, Kristina Toutanova,
Learning deep structured semantic models Llion Jones, Matthew Kelcey, Ming-Wei
for web search using clickthrough data. In Chang, Andrew M. Dai, Jakob Uszkoreit,
Proceedings of the International Conference Quoc Le, and Slav Petrov. 2019. Natural ques-
on Information and Knowledge Management tions: A benchmark for question answering
(CIKM). DOI: https://doi.org/10.1145 research. Transactions of the Association for
/2505515.2505665 Computational Linguistics, 7:453–466. DOI:

342
https://doi.org/10.1162/tacl a Rodrigo Nogueira, Wei Yang, Kyunghyun Cho,
00276 and Jimmy Lin. 2019a. Multi-stage document
ranking with BERT. CoRR, abs/1910.14424.
Cody Kwok, Oren Etzioni, and Daniel S.
Weld. 2001. Scaling question answering to Rodrigo Nogueira, Wei Yang, Jimmy Lin, and
the web. ACM Transactions on Informa- Kyunghyun Cho. 2019b. Document expansion
tion Systems (TOIS), 19(3):242–262. DOI: by query prediction. CoRR, abs/1904.08375.
https://doi.org/10.1145/502115
.502117 Nils Reimers and Iryna Gurevych. 2019.
Sentence-BERT: Sentence embeddings using
Kasper Green Larsen and Jelani Nelson. 2017. Op- Siamese BERT-networks. In Proceedings of
timality of the Johnson-Lindenstrauss lemma. Empirical Methods in Natural Language Pro-
In 2017 IEEE 58th Annual Symposium on cessing (EMNLP). DOI: https://doi.org
Foundations of Computer Science (FOCS). /10.18653/v1/D19-1410
DOI: https://doi.org/10.1109/FOCS

Downloaded from http://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00369/1924040/tacl_a_00369.pdf by guest on 26 June 2024


.2017.64 Stephen Robertson, and Hugo Zaragoza. 2009.
The probabilistic relevance framework: BM25
Kenton Lee, Ming-Wei Chang, and Kristina and beyond. Foundations and Trends in
Toutanova. 2019. Latent retrieval for weakly Information Retrieval, 3(4):333–389. DOI:
supervised open domain question answering. https://doi.org/10.1561/1500000019
In Proceedings of the Association for
Computational Linguistics (ACL). Minjoon Seo, Jinhyuk Lee, Tom Kwiatkowski,
Ankur Parikh, Ali Farhadi, and Hannaneh
Ji Ma, Ivan Korotkov, Yinfei Yang, Keith Hall,
Hajishirzi. 2019. Real-time open-domain ques-
and Ryan T. McDonald. 2020. Zero-shot neural
tion answering with dense-sparse phrase index.
retrieval via domain-targeted synthetic query
In Proceedings of the 57th Annual Meeting of
generation. CoRR, abs/2004.14503.
the Association for Computational Linguistics.
Bhaskar Mitra and Nick Craswell. 2018. An
Karen Spärck Jones. 1972. A statistical inter-
introduction to neural information retrieval.
pretation of term specificity and its applica-
Foundations and Trends R in Information
tion in retrieval. Journal of Documentation,
Retrieval, 13(1):1–126. DOI: https://doi
28(1):11–21. DOI: https://doi.org/10
.org/10.1561/1500000061
.1108/eb026526
Karthik Narasimhan, Adam Yala, and Regina
Barzilay. 2016. Improving information extrac- Nishant Subramani, Samuel Bowman, and
tion by acquiring external evidence with Kyunghyun Cho. 2019. Can unconditional
reinforcement learning. In Proceedings of language models recover arbitrary sentences?
Empirical Methods in Natural Language Pro- In Advances in Neural Information Processing
cessing (EMNLP). DOI: https://doi Systems.
.org/10.18653/v1/D16-1261
Santosh S. Vempala. 2004. The Random
Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Projection Method, volume 65. American
Gao, Saurabh Tiwary, Rangan Majumder, and Mathematical Society.
Li Deng. 2016. MS MARCO: A human
Ellen M. Voorhees. 2001. The TREC question
generated machine reading comprehension
answering track. Natural Language Engi-
dataset.
neering, 7(4):361–378. DOI: https://doi
Rodrigo Nogueira and Kyunghyun Cho. 2019. .org/10.1017/S1351324901002789
Passage re-ranking with BERT. CoRR, abs
/1901.04085. John Wieting and Douwe Kiela. 2019. No
training required: Exploring random encoders
Rodrigo Nogueira and Jimmy Lin. 2019. From for sentence classification. In Proceedings
doc2query to doctttttquery. https://github of the International Conference on Learning
.com/castorini/docTTTTTquery Representations (ICLR).

343
Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Let q̄ = q/kq k and d¯ = (d1 − d2 )/kd1 − d2 k.
Tang, Jialin Liu, Paul Bennett, Junaid Ahmed, Then µ(q, d1 , d2 ) = hq̄, d¯i. A ranking error occurs
and Arnold Overwijk. 2020. Approximate if and only if hAq̄, Ad¯i ≤ 0, which implies
nearest neighbor negative contrastive learning | hAq̄, Ad¯i − hq̄, d¯i | ≥ ǫ. By construction kq̄k =
for dense text retrieval. CoRR, abs/2007.00808. kd¯k = 1, so the probability of an inner product
Version 1. distortion ≥ ǫ is bounded by the right-hand side
of (5).
Ming Yan, Chenliang Li, Chen Wu, Bin Bi,
Wei Wang, Jiangnan Xia, and Luo Si. 2020.
IDST at TREC 2019 deep learning track:
A.2 Corollary 1
Deep cascade ranking with generation-based
document expansion and pre-trained language Proof. We have ǫ = µ(q, d1 , d2 ) = hq̄, d¯i ≤ 1 by
modeling. In Text REtrieval Conference the Cauchy-Schwarz inequality. For ǫ ≤ 1, we
(TREC). have ǫ2 /6 ≤ ǫ2 /2 − ǫ3 /3. We can then loosen
2
the bound in (1) to β ≤ 4 exp(− k2 ǫ6 ). Taking

Downloaded from http://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00369/1924040/tacl_a_00369.pdf by guest on 26 June 2024


Peilin Yang, Hui Fang, and Jimmy Lin, New the natural log yields lnβ ≤ ln4 − ǫ2 k/12, which
York, NY, USA. 2018a. Anserini: Reproducible can be rearranged into k ≥ 12ǫ−2 ln β4 .
ranking baselines using lucene. Journal of
A.3 Lemma 2
Data and Information Quality, 10(4). DOI:
https://doi.org/10.1145/3239571 Proof. For convenience define µ(d2 ) =
µ(q, d1 , d2 ). Define ǫ as in the theorem statement,
Wei Yang, Haotian Zhang, and Jimmy Lin. and Dǫ = {d2 ∈ D : µ(q, d1 , d2 ) ≥ ǫ}. We have
2019. Simple applications of BERT for ad hoc
Pr(R ≥ r0 ) ≤ Pr(∃d2 ∈ Dǫ : Aq1 ≤ Aq2 )
document retrieval. CoRR, abs/1903.10972.
X k
≤ 4 exp(− (µ(d2 )2 /2 − µ(d2 )3 /3))
Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, 2
d2 ∈Dǫ
and William W. Cohen. 2018b. Breaking the
k
softmax bottleneck: A high-rank RNN language ≤ 4|Dǫ | exp(− (ǫ2 /2 − ǫ3 /3)).
2
model. In Proceedings of the International Con-
The first inequality follows because the event
ference on Learning Representations (ICLR).
R ≥ r0 implies the event ∃d2 ∈ Dǫ : Aq1 ≤ Aq2 .
The second inequality follows by a combination
of Lemma 1 and the union bound. The final
A Proofs inequality follows because for any d2 ∈ Dǫ ,
µ(q, d1 , d2 ) ≥ ǫ. The theorem follows because
A.1 Lemma 1 |Dǫ | = |D| − r0 + 1.
Proof. For both distributions of embeddings, the
error on the squared norm can be bounded with
high probability (Achlioptas, 2003, Lemma 5.1): A.4 Corollary 2
Proof. For the retrieval function maxd hq, di,
Pr( kAxk2 − kxk2 > ǫkxk2 ) the minimum non-zero unnormalized margin
k (4) hq, d1 i − hq, d2 i is 1 when q and d are
< 2 exp(− (ǫ2 /2 − ǫ3 /3)). Boolean vectors. Therefore the normalized margin
2
has lower bound µ(q, d1 , d2 ) ≥ 1/(kq k ×
This bound implies an analogous bound on the kd1 − d2 k). For non-negative d1 and d2 we
absolute error of the inner product (Ben-David
q √
have kd1 − d2 k ≤ kd1 k + kd2 k2 ≤ 2LD .
2
et al., 2002, corollary 19),
Preserving a normalized margin of ǫ =
1
ǫ (2LQ LD )− 2 is therefore sufficient to avoid any
Pr(| hAx, Ay i − hx, y i | ≥ (kx|k2 + ky k2 ))
2 pairwise errors. By plugging this value into
k 2 3 Corollary 1, we see that setting k ≥ 24LQ LD ln β4
≤ 4 exp(− (ǫ /2 − ǫ /3)).
2 ensures that the probability of any pairwise
(5) error is ≤ β .

344
Model Reranking Retrieval
Passage length 50 100 200 400 50 100 200 400
ICT task (MRR@10)
CROSS-ATTENTION 99.9 99.9 99.8 99.6 - - - -
HYBRID-ME-BERT-uni - - - - 98.2 97.0 94.4 91.9
HYBRID-ME-BERT-bi - - - - 99.3 99.0 97.3 96.1
ME-BERT-768 98.0 96.7 92.4 89.8 96.8 96.1 91.1 85.2
ME-BERT-64 96.3 94.2 89.0 83.7 92.9 91.7 84.6 72.8
DE-BERT-768 91.7 87.8 79.7 74.1 90.2 85.6 72.9 63.0
DE-BERT-512 91.4 87.2 78.9 73.1 89.4 81.5 66.8 55.8
DE-BERT-128 90.5 85.0 75.0 68.1 85.7 75.4 58.0 47.3
DE-BERT-64 88.8 82.0 70.7 63.8 82.8 68.9 48.5 38.3
DE-BERT-32 83.6 74.9 62.6 55.9 70.1 53.2 34.0 27.6
BM25-uni 92.1 88.6 84.6 81.8 92.1 88.6 84.6 81.8

Downloaded from http://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00369/1924040/tacl_a_00369.pdf by guest on 26 June 2024


BM25-bi 98.0 97.1 95.9 94.5 98.0 97.1 95.9 94.5
NQ (Recall@400 tokens)
CROSS-ATTENTION 48.9 55.5 54.2 47.6 - - - -
HYBRID-ME-BERT-uni - - - - 45.7 49.5 48.5 42.9
ME-BERT-768 43.6 49.6 46.5 38.7 42.0 43.3 40.4 34.4
ME-BERT-64 44.4 48.7 44.5 38.2 42.2 43.4 38.9 33.0
DE-BERT-768 42.9 47.7 44.4 36.6 44.2 44.0 40.1 32.2
DE-BERT-512 43.8 48.5 44.1 36.5 43.3 43.2 38.8 32.7
DE-BERT-128 42.8 45.7 41.2 35.7 38.0 36.7 32.8 27.0
DE-BERT-64 42.6 45.7 42.5 35.4 37.4 35.1 32.6 26.6
DE-BERT-32 42.4 45.8 42.1 34.0 36.3 34.7 31.0 24.9
BM25-uni 30.1 35.7 34.1 30.1 30.1 35.7 34.1 30.1

Table A.1: Results on ICT task and NQ task (correspond to Figure 4 and Figure 5).

A.5 Theorem 1
hq,d1 −d2 i
Proof. Recall that µ(q, d1 , d2 ) = kq k×kd1 −d2 k .
( i)
By assumption we have hq, d1 i = hq, d1 i and
(j )
maxj hq, d2 i ≤ hq, d2 i, implying that
( i) ( i)
hq, d1 − d2 i ≥ hq, d1 − d2 i (6)
In the denominator, we expand kd1 − d2 k =
( i) ( i) (¬i) (¬i) (¬i)
k(
Pd1 −(jd) 2 ) + (d1 − d2 )k, where d =
j 6=i d . Plugging this into the squared norm,
kd1 − d2 k2
( i) ( i) (¬i) (¬i) 2
= k(d1 − d2 ) + (d1 − d2 )k (7)
( i) ( i) 2 (¬i) (¬i) 2
= kd1 − d2 k + kd1 − d2 k
(8)
( i) ( i) (¬i) (¬i)
+ 2hd1 − d2 i, d1 − d2
( i) ( i) 2 (¬i) (¬i) 2
= kd1 − d2 k + kd1 − d2 k (9)
( i) ( i) 2
≥ kd1 − d2 k . (10)
( i) (i) (¬i) (¬i)
The inner product hd1 − − =0
d2 , d1 d2 i
because the segments are orthogonal. The combi-
nation of (6) and (10) completes the theorem.

345

You might also like