Sparse, Dense, and Attentional Representations For Text Retrieval
Sparse, Dense, and Attentional Representations For Text Retrieval
Google Research
{luanyi, jeisenstein, kristout, mjcollins}@google.com
329
Transactions of the Association for Computational Linguistics, vol. 9, pp. 329–345, 2021. https://doi.org/10.1162/tacl a 00369
Action Editor: Jimmy Lin. Submission batch: 6/2020; Revision batch: 9/2020; Published 4/2021.
c 2021 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.
benchmarks (MS MARCO passage and document),
and passage retrieval for question answering
(Natural Questions). We confirm prior find-
ings that full attentional architectures excel at
reranking tasks, but are not efficient enough
for large-scale retrieval. Of the more efficient
alternatives, the hybridized multi-vector encoder
Figure 1: Recall@1 for retrieving passage containing is at or near the top in every evaluation, out-
a query from three million candidates. The figure performing state-of-the-art retrieval results in
compares a fine-tuned BERT-based dual encoder (DE- MS MARCO. Our code is publicly available at
BERT-768), an off-the-shelf BERT-based encoder with https://github.com/google-research
average pooling (BERT-init), and sparse term-based /language/tree/master/language
retrieval (BM25), while binning passages by length.
/multivec.
330
of the matrix A is sampled with equal probabil- to study the tightness of the bound; although
ity from two possible values: {− √1k , √1k }. In theoretical tightness (up to a constant factor)
Gaussian embeddings, each ai,j ∼ N (0, k−1/2 ). is suggested by results on the optimality of
A pairwise ranking error occurs when hq, d1 i > the distributional Johnson-Lindenstrauss lemma
hq, d2 i but hAq, Ad1 i < hAq, Ad2 i. Using such (Johnson and Lindenstrauss, 1984; Jayram and
random projections, it is possible to bound the Woodruff, 2013; Kane et al., 2011), here we study
probability of any such pairwise error in terms of the question only empirically.
the embedding size. 2.1.1 Recall-at-r
Definition 2.1. For a query q and pair of In retrieval applications, it is important to return
documents (d1 , d2 ) such that hq, d1 i ≥ hq, d2 i, the the desired result within the top r search results.
normalized margin is defined as, µ(q, d1 , d2 ) = For query q , define d1 as the document that
hq,d1 −d2 i maximizes some inner product ranking metric.
kq k×]kd1 −d2 k .
The probability of returning d1 in the top r results
Lemma 1. Define a matrix A ∈ Rk×d of Gaussian
Corollary 1. Define vectors q, d1 , d2 such that The proof is in §A.3. A direct consequence of
ǫ = µ(q, d1 , d2 ) > 0. If A ∈ Rk×v is a matrix the lemma is that to achieve recall-at-r0 = 1 for a
of random Gaussian or Rademacher embeddings given (q, d1 , D) triple with probability ≥ 1 − β , it
such that k > 12ǫ−2 ln β4 , then Pr(hAq, Ad1 i ≤ is sufficient to set
hAq, Ad2 i) ≤ β . 2 4(|D| − r0 + 1)
k≥ ln , (3)
ǫ2 /2 − ǫ3 /3 β
On the Tightness of the Bound. Let k∗ (q, d1 , d2 )
denote the lowest dimension Gaussian or where ǫ is the r0 ’th smallest normalized margin.
Rademacher random projection following the As with the bound on pairwise relevance errors
definition in Lemma 1, for which Pr(hAq, Ad1 i < in Lemma 1, Lemma 2 implies an upper bound
hAq, Ad2 i) ≤ β , for a given document pair on the minimum random projection dimension
(d1 , d2 ) and query q with normalized margin k∗ (q, d1 , D) that recalls d1 in the top r0 results
ǫ. Our lemma places an upper bound on k∗ , with probability ≥ 1 − β . Due to the application
saying that k∗ (q, d1 , d2 ) ≤ 2(ǫ2 /2 − ǫ3 /3)−1 ln β4 . of the union bound and worst-case assumptions
Any k ≥ k∗ (q, d1 , d2 ) has sufficiently low about the normalized margins of documents in Dǫ ,
probability of error, but lower values of k could 2
The case where multiple documents are tied with nor-
potentially also have the desired property. Later malized margin ǫ is straightforward but slightly complicates
in this section we perform empirical evaluation the analysis.
331
this bound is potentially loose. Later in this section
we examine the empirical relationship between
maximum document length, the distribution of
normalized margins, and k∗ .
2.1.2 Application to Boolean Inner Product
Boolean inner product is a retrieval function in
which d, q ∈ {0, 1}v over a vocabulary of size v ,
with di indicating the presence of term i in the
document (and analogously for qi ). The relevance Figure 2: Minimum k sufficient for Rademacher
score hq, di is then the number of terms that appear embeddings to approximate BM25 pairwise rankings
in both q and d. For this simple retrieval function, on TREC-CAR with error rate β < .05.
it is possible to compute an embedding size that
guarantees a desired pairwise error probability random projection k∗ which has ≤ β probability
332
Figure 3: Random projection on BM25 retrieval in TREC-CAR dataset, with documents binned by length.
333
interest. Lacking theoretical tools to settle these than a dual encoder that uses a single vector of
questions, we present a set of empirical in- size mk.
vestigations in later sections of this paper. But This efficiency is a key difference from the
first we explore a lightweight modification to the POLY-ENCODER (Humeau et al., 2020), which
dual encoder, which offers gains in expressivity at computes a fixed number of vectors per query,
limited additional computational cost. and aggregates them by softmax attention against
document vectors. (Yang et al., 2018b) propose
3 Multi-Vector Encodings a similar architecture for language modeling.
Because of the use of softmax in these approaches,
The theoretical analysis suggests that fixed- it is not possible to decompose the relevance score
length vector representations of documents may into a max over inner products, and so fast nearest-
in general need to be large for long documents, if neighbor search cannot be applied. In addition,
fidelity with respect to sparse high-dimensional these works did not address retrieval from a large
representations is important. Cross-attentional document collection.
334
an increase in the normalized margin enables the First, trained non-linear dual encoders might
use of a smaller encoding dimension k while be able to detect precise word overlap with
still supporting the same pairwise error rate. much lower-dimensional encodings, especially for
There are now m times more ‘‘documents’’ to queries and documents with a natural distribution,
evaluate, but Lemma 2 shows that this exerts only which may exhibit a low-dimensional subspace
a logarithmic increase on the encoding size for a structure. Second, the semantic generalization
desired recall@r . But while we hope this argument aspect of the IR task may be more important
is illuminating, the assumptions of orthogonal than the first aspect for practical applications, and
segments and perfect segment match against the our theory does not make predictions about how
query are quite strong. We must therefore rely encoder dimensionality relates to such ability to
on empirical analysis to validate the efficacy of compute general semantic similarity.
multi-vector encoding in realistic applications. We relate the theoretical analysis to text
retrieval in practice through experimental studies
Cross-Attention. Cross-attentional architec- on three tasks. The first task, described in
335
Dual Encoders from BERT (DE-BERT). We we linearly combine a sparse and dense system’s
encode queries and documents using BERT-base, scores using a single trainable weight λ, tuned on
which is a pre-trained transformer network a development set. For example, a hybrid model
(12 layers, 768 dimensions) (Devlin et al., 2019). of ME-BERT and BM25-uni is referred to as HYBRID-
We implement dual encoders from BERT as a ME-BERT-uni. We implement approximate search to
special case of the multi-vector model formalized retrieve using a linear combination of two systems
in §3, with number of vectors for the document by re-ranking n-best top scoring candidates from
m = 1: The representations for queries and each system. Prior and concurrent work has also
documents are the top layer representations at used hybrid sparse-dense models (Guo et al.,
the [CLS] token. This approach is widely used 2016a; Seo et al., 2019; Karpukhin et al., 2020;
for retrieval (Lee et al., 2019; Reimers and Ma et al., 2020; Gao et al., 2020). Our contribution
Gurevych, 2019; Humeau et al., 2020; Xiong is to assess the impact of sparse-dense hybrids as
et al., 2020).4 For lower-dimensional encodings, the document length grows.
we learn down-projections from d = 768 to k ∈
336
Figure 4: Results on the containing passage ICT task as maximum passage length varies (50 to 400 tokens). Left:
Reranking 200 candidates; Right: Retrieval from 3 million candidates. Exact numbers refer to Table A.1.
These form the document collection D. Queries xi Figure 4 (right) shows results for the much more
are generated by sampling sub-sequences from the challenging task of retrieval from 3 million can-
337
Figure 5: Results on NQ passage recall as maximum passage length varies (50 to 400 tokens). Left: Reranking of
200 passages; Right: Open domain retrieval result on all of (English) Wikipedia. Exact numbers refer to Table A.1.
338
MS-Passage MS-Doc part of the table, where more expensive second-
Model MRR MRR stage models are employed to re-rank candidates.
Retrieval BM25 0.167 0.249 Figure 6 delves into the impact of the first-stage
BM25-E 0.184 0.209 retrieval systems as the number of candidates the
DOC2QUERY 0.215 -
DOCT5QUERY 0.278 - second stage reranker has access to is substantially
DEEPCT 0.243 - reduced, improving efficiency.
HDCT - 0.300
DE-BERT 0.302 0.288 We report results in comparison to the following
ME-BERT 0.334 0.333 systems: 1) MULTI-STAGE (Nogueira and Lin, 2019),
DE-HYBRID 0.304 0.313
DE-HYBRID-E 0.309 0.315 which reranks BM25 candidates with a cascade of
ME-HYBRID 0.338 0.346 BERT models, 2) DOC2QUERY (Nogueira et al.,
ME-HYBRID-E 0.343 0.339
Reranking MULTI-STAGE 0.390 -
2019b) and DOCT5QUERY (Nogueira and Lin,
IDST 0.408 - 2019), which use neural models to expand docu-
Leaderboard 0.439 - ments before indexing and scoring with sparse
DE-BERT 0.391 0.339
ME-BERT 0.395 0.353 retrieval models, 3) DEEPCT (Dai and Callan,
339
Figure 6: MRR@10 when reranking at different retrieval depth (10 to 1000 candidates) for MS MARCO.
Table 2 shows test results for dense models, substantially higher MRR than DE-BERT for the
external sparse model baselines, and hybrids of same inference time per query.
the two (without reranking). In addition to test set
(eval) results on the MS MARCO passage task, we
report metrics on the manually annotated passage 8 Related Work
and document retrieval test set at TREC DL
We have mentioned research on improving the
2019. We report the fraction of unrated items
accuracy of retrieval models throughout the paper.
as Holes@10 following Xiong et al. (2020).
Here we focus on work related to our central
focus on the capacity of dense dual encoder
Time and Space Analysis Figure 7 compares
representations relative to sparse bags-of-words.
the running time/quality trade-off curves for DE-
In compressive sensing it is possible to recover
BERT and ME-BERT on the MS MARCO passage task
a bag of words vector x from the projection
using the ScaNN (Guo et al., 2020) library on a 160
Ax for suitable A. Bounds for the sufficient
Intel(R) Xeon(R) CPU @ 2.20GHz cores machine dimensionality of isotropic Gaussian projections
with 1.88TB memory. Both models use one vector (Candes and Tao, 2005; Arora et al., 2018) are
of size k = 1024 per query; DE-BERT uses one and more pessimistic than the bound described in
ME-BERT uses 3 vectors of size k = 1024 per §2, but this is unsurprising because the task
document. The size of the document index for of recovering bags-of-words from a compressed
DE-BERT is 34.2GB and the size of the index for measurement is strictly harder than recovering
ME-BERT is about 3 times larger. The indexing time inner products.
was 1.52h and 3.02h for DE-BERT and ME-BERT, Subramani et al. (2019) ask whether it is possi-
respectively. The ScaNN configuration we use ble to exactly recover sentences (token sequences)
is num leaves=5000, and num leaves to search from pretrained decoders, using vector embed-
ranges from 25 to 2000 (from less to more exact dings that are added as a bias to the decoder hidden
search) and time per query is measured when using state. Because their decoding model is more
parallel inference on all 160 cores. In the higher expressive (and thus more computationally inten-
quality range of the curves, ME-BERT achieves sive) than inner product retrieval, the theoretical
340
issues examined here do not apply. Nonetheless, doi.org/10.1016/S0022-0000(03)
(Subramani et al., 2019) empirically observe a 00025-4
similar dependence between sentence length and
embedding size. Wieting and Kiela (2019) rep- Noga Alon and Bo’az Klartag. 2017. Optimal
resent sentences as bags of random projections, compression of approximate inner products
finding that high-dimensional projections (k = and dimension reduction. In 58th Annual
4096) perform nearly as well as trained encoding Symposium on Foundations of Computer Sci-
models. These empirical results provide further ence (FOCS). DOI: https://doi.org/10
empirical support for the hypothesis that bag-of- .1109/FOCS.2017.65
words vectors from real text are ‘‘hard to embed’’ Alexandr Andoni, Piotr Indyk, and Ilya Razen-
in the sense of Larsen and Nelson (2017). Our shteyn. 2019. Approximate nearest neigh-
contribution is to systematically explore the rela- bor search in high dimensions. Proceedings of
tionship between document length and encoding the International Congress of Mathematicians
dimension, focusing on the case of exact inner (ICM 2018).
We thank Ming-Wei Chang, Jon Clark, William Zhuyun Dai and Jamie Callan. 2020a. Context-
Cohen, Kelvin Guu, Sanjiv Kumar, Kenton Lee, aware document term weighting for ad-hoc
Jimmy Lin, Ankur Parikh, Ice Pasupat, Iulia search. In Proceedings of The Web Conference
Turc, William A. Woods, Vincent Zhao, and the 2020. DOI: https://doi.org/10.1145
anonymous reviewers for helpful discussions of /3366423.3380258
this work.
Zhuyun Dai and Jamie Callan. 2020b. Context-
aware sentence/passage term importance esti-
References mation for first stage retrieval. Proceedings of
the ACM SIGIR International Conference on
Dimitris Achlioptas. 2003. Database-friendly ran- Theory of Information Retrieval.
dom projections: Johnson-Lindenstrauss with
binary coins. Journal of Computer and System Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Sciences, 66(4):671–687. DOI: https:// Kristina Toutanova. 2019. BERT: Pre-training
341
of deep bidirectional transformers for language Samuel Humeau, Kurt Shuster, Marie-Anne
understanding. In Proceedings of the 2019 Lachaux, and Jason Weston. 2020. Poly-
Conference of the North American Chapter of encoders: Transformer architectures and pre-
the Association for Computational Linguistics: training strategies for fast and accurate
Human Language Technologies. multi-sentence scoring. In Proceedings of
the International Conference on Learning
Laura Dietz, Ben Gamari, Jeff Dalton, Representations (ICLR).
and Nick Craswell. 2018. TREC complex
answer retrieval overview. In Text REtrieval Thathachar S. Jayram and David P.
Conference (TREC). Woodruff. 2013. Optimal bounds for Johnson-
Lindenstrauss transforms and streaming
Luyu Gao, Zhuyun Dai, Zhen Fan, and Jamie problems with subconstant error. ACM Trans-
Callan. 2020. Complementing lexical retrieval actions on Algorithms (TALG), 9(3):1–17.
with semantic residual embedding. CoRR, DOI: https://doi.org/10.1145
Daniel Gillick, Sayali Kulkarni, Larry Lansing, William B. Johnson and Joram Lindenstrauss.
Alessandro Presta, Jason Baldridge, Eugene 1984. Extensions of Lipschitz mappings into
Ie, and Diego Garcia-Olano. 2019. Learning a Hilbert space. Contemporary Mathematics,
dense representations for entity retrieval. In 26(189–206):1.
Proceedings of the 23rd Conference on Compu-
Daniel Kane, Raghu Meka, and Jelani Nelson.
tational Natural Language Learning (CoNLL).
2011. Almost optimal explicit Johnson-
DOI: https://doi.org/10.18653/v1
Lindenstrauss families. Approximation, Ran-
/K19-1049
domization, and Combinatorial Optimization.
Algorithms and Techniques.
Jiafeng Guo, Yixing Fan, Qingyao Ai, and
W. Bruce Croft. 2016a. A deep relevance Vladimir Karpukhin, Barlas Oguz, Sewon Min,
matching model for ad-hoc retrieval. In Pro- Patrick Lewis, Ledell Wu, Sergey Edunov,
ceedings of the 25th ACM International on Danqi Chen, and Wen-tau Yih. 2020. Dense
Conference on Information and Knowledge passage retrieval for open-domain question
Management. answering. In Proceedings of the 2020 Con-
ference on Empirical Methods in Natural Lan-
Ruiqi Guo, Sanjiv Kumar, Krzysztof guage Processing (EMNLP). DOI: https://
Choromanski, and David Simcha. 2016b. Quan- doi.org/10.18653/v1/2020.emnlp
tization based fast inner product search. In -main.550
Proceedings of the International Conference on
Artificial Intelligence and Statistics (AISTATS). Omar Khattab and Matei Zaharia. 2020. Col-
BERT. In Proceedings of the 43rd International
Ruiqi Guo, Philip Sun, Erik Lindgren, Quan Geng, ACM SIGIR Conference on Research and
David Simcha, Felix Chern, and Sanjiv Kumar. Development in Information Retrieval. DOI:
2020. Accelerating large-scale inference with https://doi.org/10.1145/3397271
anisotropic vector quantization. In Proceedings .3401075
of the 37th International Conference on
Machine Learning. Tom Kwiatkowski, Jennimaria Palomaki, Olivia
Redfield, Michael Collins, Ankur Parikh, Chris
Po-Sen Huang, Xiaodong He, Jianfeng Gao, Alberti, Danielle Epstein, Illia Polosukhin,
Li Deng, Alex Acero, and Larry Heck. 2013. Jacob Devlin, Kenton Lee, Kristina Toutanova,
Learning deep structured semantic models Llion Jones, Matthew Kelcey, Ming-Wei
for web search using clickthrough data. In Chang, Andrew M. Dai, Jakob Uszkoreit,
Proceedings of the International Conference Quoc Le, and Slav Petrov. 2019. Natural ques-
on Information and Knowledge Management tions: A benchmark for question answering
(CIKM). DOI: https://doi.org/10.1145 research. Transactions of the Association for
/2505515.2505665 Computational Linguistics, 7:453–466. DOI:
342
https://doi.org/10.1162/tacl a Rodrigo Nogueira, Wei Yang, Kyunghyun Cho,
00276 and Jimmy Lin. 2019a. Multi-stage document
ranking with BERT. CoRR, abs/1910.14424.
Cody Kwok, Oren Etzioni, and Daniel S.
Weld. 2001. Scaling question answering to Rodrigo Nogueira, Wei Yang, Jimmy Lin, and
the web. ACM Transactions on Informa- Kyunghyun Cho. 2019b. Document expansion
tion Systems (TOIS), 19(3):242–262. DOI: by query prediction. CoRR, abs/1904.08375.
https://doi.org/10.1145/502115
.502117 Nils Reimers and Iryna Gurevych. 2019.
Sentence-BERT: Sentence embeddings using
Kasper Green Larsen and Jelani Nelson. 2017. Op- Siamese BERT-networks. In Proceedings of
timality of the Johnson-Lindenstrauss lemma. Empirical Methods in Natural Language Pro-
In 2017 IEEE 58th Annual Symposium on cessing (EMNLP). DOI: https://doi.org
Foundations of Computer Science (FOCS). /10.18653/v1/D19-1410
DOI: https://doi.org/10.1109/FOCS
343
Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Let q̄ = q/kq k and d¯ = (d1 − d2 )/kd1 − d2 k.
Tang, Jialin Liu, Paul Bennett, Junaid Ahmed, Then µ(q, d1 , d2 ) = hq̄, d¯i. A ranking error occurs
and Arnold Overwijk. 2020. Approximate if and only if hAq̄, Ad¯i ≤ 0, which implies
nearest neighbor negative contrastive learning | hAq̄, Ad¯i − hq̄, d¯i | ≥ ǫ. By construction kq̄k =
for dense text retrieval. CoRR, abs/2007.00808. kd¯k = 1, so the probability of an inner product
Version 1. distortion ≥ ǫ is bounded by the right-hand side
of (5).
Ming Yan, Chenliang Li, Chen Wu, Bin Bi,
Wei Wang, Jiangnan Xia, and Luo Si. 2020.
IDST at TREC 2019 deep learning track:
A.2 Corollary 1
Deep cascade ranking with generation-based
document expansion and pre-trained language Proof. We have ǫ = µ(q, d1 , d2 ) = hq̄, d¯i ≤ 1 by
modeling. In Text REtrieval Conference the Cauchy-Schwarz inequality. For ǫ ≤ 1, we
(TREC). have ǫ2 /6 ≤ ǫ2 /2 − ǫ3 /3. We can then loosen
2
the bound in (1) to β ≤ 4 exp(− k2 ǫ6 ). Taking
344
Model Reranking Retrieval
Passage length 50 100 200 400 50 100 200 400
ICT task (MRR@10)
CROSS-ATTENTION 99.9 99.9 99.8 99.6 - - - -
HYBRID-ME-BERT-uni - - - - 98.2 97.0 94.4 91.9
HYBRID-ME-BERT-bi - - - - 99.3 99.0 97.3 96.1
ME-BERT-768 98.0 96.7 92.4 89.8 96.8 96.1 91.1 85.2
ME-BERT-64 96.3 94.2 89.0 83.7 92.9 91.7 84.6 72.8
DE-BERT-768 91.7 87.8 79.7 74.1 90.2 85.6 72.9 63.0
DE-BERT-512 91.4 87.2 78.9 73.1 89.4 81.5 66.8 55.8
DE-BERT-128 90.5 85.0 75.0 68.1 85.7 75.4 58.0 47.3
DE-BERT-64 88.8 82.0 70.7 63.8 82.8 68.9 48.5 38.3
DE-BERT-32 83.6 74.9 62.6 55.9 70.1 53.2 34.0 27.6
BM25-uni 92.1 88.6 84.6 81.8 92.1 88.6 84.6 81.8
Table A.1: Results on ICT task and NQ task (correspond to Figure 4 and Figure 5).
A.5 Theorem 1
hq,d1 −d2 i
Proof. Recall that µ(q, d1 , d2 ) = kq k×kd1 −d2 k .
( i)
By assumption we have hq, d1 i = hq, d1 i and
(j )
maxj hq, d2 i ≤ hq, d2 i, implying that
( i) ( i)
hq, d1 − d2 i ≥ hq, d1 − d2 i (6)
In the denominator, we expand kd1 − d2 k =
( i) ( i) (¬i) (¬i) (¬i)
k(
Pd1 −(jd) 2 ) + (d1 − d2 )k, where d =
j 6=i d . Plugging this into the squared norm,
kd1 − d2 k2
( i) ( i) (¬i) (¬i) 2
= k(d1 − d2 ) + (d1 − d2 )k (7)
( i) ( i) 2 (¬i) (¬i) 2
= kd1 − d2 k + kd1 − d2 k
(8)
( i) ( i) (¬i) (¬i)
+ 2hd1 − d2 i, d1 − d2
( i) ( i) 2 (¬i) (¬i) 2
= kd1 − d2 k + kd1 − d2 k (9)
( i) ( i) 2
≥ kd1 − d2 k . (10)
( i) (i) (¬i) (¬i)
The inner product hd1 − − =0
d2 , d1 d2 i
because the segments are orthogonal. The combi-
nation of (6) and (10) completes the theorem.
345