GTE
GTE
Zehan Li1 , Xin Zhang1 , Yanzhao Zhang1 , Dingkun Long1 , Pengjun Xie1 , Meishan Zhang
1
Alibaba Group
{lizehan.lzh,linzhang.zx,zhangyanzhao.zyz,
dingkun.ldk,pengjun.xpj}@alibaba-inc.com
Others
Fact Verification
MEDI BERRI
unifying various NLP tasks into a single for- …
FEVER
…
mat, we train a unified text embedding model
by employing contrastive learning over a di- Figure 1: Illustration of the multi-stage contrastive learn-
verse mixture of datasets from multiple sources. ing pipeline used to train our text embedding model.
By significantly increasing the number of train-
ing data during both unsupervised pre-training
and supervised fine-tuning stages, we achieve
augmented systems based on text embedding mod-
substantial performance gains over existing em-
bedding models. Notably, even with a relatively els that integrate the reasoning and comprehension
modest parameter count of 110M, GTEbase out- capabilities of LLMs (Izacard et al., 2022b; Ram
performs the black-box embedding API pro- et al., 2023; Shi et al., 2023). Consequently, there
vided by OpenAI and even surpasses 10x larger has been a growing focus on general text represen-
text embedding models on the massive text tation in both industry and academia.
embedding benchmark. Furthermore, without
additional fine-tuning on each programming The pursuit of developing a unified model to ad-
language individually, our model outperforms dress a multitude of downstream tasks has been
previous best code retrievers of similar size by long-standing due to the diverse formats, domains
treating code as text. In summary, our model and downstream applications of natural language.
achieves impressive results by effectively har- The emergence of pre-trained language models has
nessing multi-stage contrastive learning, offer- further opened up possibilities for training such a
ing a powerful and efficient text embedding
universal model. Nonetheless, within the realm
model with broad applicability across various
NLP and code-related tasks.1 of text representation research, previous text em-
bedding models have primarily focused on specific
1 Introduction tasks, and their training strategies or models, tai-
lored to a single task, may not perform optimally
Text embeddings have became an indispensable in other contexts. For example, the text represen-
component in many natural language processing tation model SimCSE (Gao et al., 2021), trained
tasks, such as text classification, text retrieval, ques- on symmetric text pairs, demonstrates limitations
tion answering and dialogue systems (Karpukhin in text retrieval tasks. Similarly, certain text rep-
et al., 2020; Humeau et al., 2020; Choi et al., 2021; resentation models specifically designed for dense
Izacard et al., 2022a; Long et al., 2022a; Rajapakse, retrieval tasks do not exhibit robust performance
2023). These embedding models represent texts us- in sentence textual similarity tasks. Recently, there
ing low-dimensional vectors and capture their sim- has been a shift in research focus towards develop-
ilarity through vector operations. The emergence ing more comprehensive models for text represen-
of recent large language models (LLMs) (Radford tation leveraging large quantities of unlabeled web
et al., 2018; Touvron et al., 2023; OpenAI, 2023) data through unsupervised contrastive pre-training,
has generated considerable interest in retrieval- coupled with task-specific data, prompts, or in-
1
The GTE model is publicly available at https:// structions to mitigate task conflicts during fine-
huggingface.co/thenlper/gte-large tuning (Ni et al., 2022a,b; Neelakantan et al., 2022;
Wang et al., 2022b; Su et al., 2023). Additionally, as a robust baseline for the research community
the introduction of benchmarks, such as the Mas- investigating text and code embedding.
sive Text Embedding Benchmark (MTEB) (Muen-
nighoff et al., 2023), has established a robust basis 2 Related Work
for assessing the universality of text representation
models. However, a significant limitation in ex- Text embeddings serve as low-dimensional vector
isting research is the reliance on in-house data for representations for texts of varying lengths and are
pre-training, creating a bottleneck in the utilization essential in numerous natural language processing
of pre-trained model weights or APIs. Furthermore, (NLP) tasks. In contrast to high-dimensional and
the formulation of prompts specifically tailored for sparse representations such as TF-IDF, dense text
each task requires extra human effort during imple- embeddings possess the capacity to address the lex-
mentation (Su et al., 2023). ical mismatch problem and enhance the efficiency
of text retrieval and matching.
This work presents a straightforward approach Pre-trained language models, exemplified by
to construct a general text embedding (GTE) model BERT (Devlin et al., 2019) and GPT (Radford
solely using contrastive learning on open-source et al., 2018), have demonstrated remarkable suc-
data, as illustrated in Figure 1. Specifically, we cess across various NLP tasks. Nonetheless, ex-
first gather a large-scale dataset comprising un- tracting a high-quality sentence embedding from
supervised text pairs extracted from various data pre-trained language models poses a significant
sources for contrastive pre-training. Surprisingly, challenge due to the presence of anisotropic em-
our model, pre-trained on this dataset, exhibits re- bedding spaces resulting from the masked language
markable performance, surpassing BM25 and E5 modeling objective. To address this issue, subse-
model (Wang et al., 2022b) in zero-shot text re- quent studies have proposed different approaches,
trieval tasks and surpassing many supervised mod- including supervised fine-tuning (Reimers and
els in the MTEB benchmark. To further enhance Gurevych, 2019), normalizing flow (Li et al., 2020),
the quality of the learned text representations, we normalizing flow (Li et al., 2020), whitening (Su
obtain high-quality text pairs with human labels et al., 2021), or unsupervised contrastive learn-
from multiple sources for contrastive fine-tuning. ing (Gao et al., 2021). These investigations pri-
After supervised fine-tuning, our 110M BERT- marily concentrate on enhancing performance in
based (Devlin et al., 2019) model already outper- semantic textual similarity tasks, wherein two sen-
forms the current commercial embedding API of tences exhibit similar formats.
OpenAI and ranks highly in the MTEB benchmark.
Another line of research focuses on the text re-
Furthermore, since our model is trained using code
trieval problem, where the query and document
data as well, we evaluate its code search capabili-
typically exhibit an asymmetric relationship. In
ties on the CodeSearchNet benchmark, which en-
this context, the dual-encoder architecture necessi-
compasses six programming languages. Notably,
tates training with both positive and negative pairs.
even without language-specific fine-tuning on each
Lee et al. (2019) propose the Inverse Close Task
subset, our model significantly outperforms state-
(ICT) as a self-supervised pre-training approach for
of-the-art code retrievers of similar size that have
generating a dense retriever. The ICT method in-
been fine-tuned for each programming language.
volves cropping a random sentence from a passage
In the rest of this paper, we provide a detailed to construct pseudo query-document pairs. Ad-
account of the data sources and training configu- ditionally, Chang et al. (2020) leverage the link
rations employed. Subsequently, we present the structure within Wikipedia to introduce further su-
evaluation results on widely recognized text em- pervision signals in the pre-training data. In a sim-
bedding benchmarks and compare them with the ilar vein, REALM (Guu et al., 2020) proposes a
performance of previous state-of-the-art baselines joint training approach, wherein a dense retriever
that were specifically optimized for each individual and a language model are trained concurrently. The
task. Our model consistently demonstrates supe- learning signal for the language model is derived
rior performance or, at the very least, comparable from masked language modeling, with backpropa-
results to those achieved by larger models, owing gation incorporated through the retrieval step. Re-
to its incorporation of a more diverse mixture of cent advancements, such as Contriever (Izacard
training datasets. We aspire for our model to serve et al., 2022a) and coCondenser (Gao and Callan,
2022), have demonstrated that constructing posi- is more varied to further enhance the model’s ver-
tive pairs through random passage cropping yields satility. Moreover, our model does not incorpo-
superior results compared to the ICT task. Building rate task-specific prompts, which enhances repro-
upon the ideas presented in (Chang et al., 2020), ducibility and ease of use.
some researchers have also put forth methods for
constructing higher-quality positive pairs using the 3 Approach
web link topology for retriever pre-training (Zhou The training process of our model consists of two
et al., 2022), a technique that proves effective in stages: unsupervised pre-training and supervised
zero-shot scenarios. Furthermore, in the field of fine-tuning. Both stages employ the learning ob-
dense retrieval, significant research is dedicated to jective of contrastive learning. Firstly, we will in-
enhancing the text representation capabilities of troduce the basic framework of the model. Subse-
pre-trained language models through the design of quently, we will discuss the sources and construc-
auxiliary pre-training tasks (Gao and Callan, 2021; tion methods of the training data in the two stages.
Xiao et al., 2022; Gao and Callan, 2022; Wang Finally, we will present some special optimization
et al., 2022a; Long et al., 2022b; Li et al., 2023). strategies used to enhance the model’s performance
The previous two lines of research can be gen- during the training process.
eralized as learning a vector representation for a
3.1 Model Architecture
piece of text and distinguished by the type of down-
stream tasks. Recently, several studies have ex- The backbone of our embedding model is a deep
plored the construction of unified text representa- Transformer encoder (Vaswani et al., 2017) which
tion models through large-scale contrastive learn- can be initialized with pre-trained language models
ing and prompt-based learning (Neelakantan et al., such as BERT (Devlin et al., 2019). Our model
2022; Wang et al., 2022b; Su et al., 2023). Ad- follows the vanilla dual-encoder architecture with
ditionally, some research efforts have focused on mean pooling on top of the contextualized token
constructing evaluation datasets to better assess representations produced by the language model.
the stability of text representation models across Formally, given a piece of text x = (x1 , . . . , xn )
different tasks and domains. BEIR (Benchmark- consisting of n tokens, an embedding model E con-
ing IR) (Thakur et al., 2021) collects a substantial vert the text into a low-dimensional dense vector
number of retrieval tasks from various domains to x = E(x) ∈ Rd . To implement E, we first employ
evaluate the robustness of dense retriever models in a language model to get the deep contextualized
zero-shot scenarios. Meanwhile, MTEB (Massive token representations
Text Embedding Benchmark) (Muennighoff et al.,
h = LM(x) ∈ Rn×d . (1)
2023) benchmarks over 56 datasets spanning seven
categories, providing a comprehensive evaluation Then we apply a lightweight mean pooling
of text embedding models. across the first dimension to get the text representa-
This study aims to develop a general text em- tion,
n
bedding model through a multi-stage training ap- 1X
x= hi ∈ Rd (2)
proach. In the initial stage of unsupervised con- n
i=1
trastive learning, we generate weak supervised cor- The text representations are learned through the
relation text pairs using publicly available data contrastive objective, distinguishing semantic rele-
from various sources. Unlike previous study (Wang vant text pairs from irrelevant ones. Such training
et al., 2022b), we exclusively utilized open-source procedure requires positive and negative pairs, tak-
data and did not employ any filtering or cleaning ing the format of (q, d+ , d− ). For a query q, a rel-
methods. Pre-training on a large-scale text pairs evant document d+ , a set of irrelevant documents
can effectively improve the domain generalization D− = {d− −
1 , . . . , dn }, one popular contrastive ob-
of text representation models and bridge the gap jective is the InfoNCE loss (van den Oord et al.,
between the MLM training objective and the con- 2018),
trastive learning objective of representation models,
+
making the language model more suitable for text es(q,d )/τ
Lcl = − log n , (3)
representation tasks. In the supervised fine-tuning s(q,d+ )/τ P s(q,d−
)/τ
stage, the mixture of training data in our approach e + e i
i=1
where s(q, d) estimates the similarity between two Source Datasets Prop. Size
pieces of text q and d via vector distance between
q = E(q) and d = E(d). Web Page 3 18.7% 147M
To acquire text embeddings of superior quality Academic Paper 5 5.7% 45M
that can be applied across a wide range of scenar- Hyperlink 4 13.4% 106M
ios, we compile an extensive text pair dataset from Social Media 2 41.5% 327M
multiple formats and domains. This dataset is then Knowledge Base 2 4.8% 38M
trained using an improved contrastive loss method Community QA 7 1.5% 12M
in a multi-stage fashion. News 5 0.4% 3M
Code 2 2.5% 20M
3.2 Unsupervised Pre-training Data Others 3 11.6% 91M
Weakly supervised text relevance data is readily Total 33 100% 788M
available in publicly accessible web sources, such
as the inherent connection between queries and Table 1: Statistics of pre-training data.
answers on QA forums. These data can be exten-
sively collected without the need for manual anno-
combination of training data used by previous re-
tation, thereby efficiently aiding in training text rep-
search (Gao et al., 2021; Gao and Callan, 2022;
resentation models. Inspired by previous work (Ni
Asai et al., 2023; Su et al., 2023; Li et al., 2023).
et al., 2022a,b; Neelakantan et al., 2022; Wang
More details can be found in Appendix A.
et al., 2022b), our model is initially pre-trained
on naturally occurring text pairs extracted from 3.4 Training Details
diverse sources. To ensure the versatility of the
embedding model, we explore a range of resources Data Sampling In the initial stage of unsuper-
for text pair extraction, including web pages (e.g., vised pre-training, data sources often differ signifi-
CommonCrawl, ClueWeb), scientific papers (e.g., cantly in terms of the number of training instances.
arXiv, SemanticScholar), community QA forums To address this imbalance, we employ a multino-
(e.g., StackExchange), social media (e.g., Reddit), mial distribution to sample data batches from differ-
knowledge bases (e.g., Wikipedia, DBPedia), and ent data sources, taking into account their respec-
code repositories (e.g., StackOverflow, GitHub). tive sizes. Suppose the whole pre-training dataset
Additionally, we harness the presence of hyperlinks D consists of m different subsets {D1 , . . . , Dm }
in certain datasets to facilitate text pair extraction. and denote the size of each subset as ni = |Di |, at
Table 2 demonstrates some examples of text pair each training iteration, the probability of sampling
format from different sources. Further details re- data from the i-th subset Di can be represented by:
garding the data collection process can be found nα
in Appendix A. In total, we utilized ∼800M text pi = Pm i α, (4)
j=1 nj
pairs text pairs for the unsupervised pre-training
stage. Simple statistics and data distributions are where we set α = 0.5 in this work. Furthermore,
illustrated in Table 1. to prevent the model from solely learning task-
3.3 Supervised Fine-tuning Data specific shortcuts for discrimination, we ensure
that all training instances within a batch originate
In the supervised fine-tuning stage, we use rela- from the same task.
tively lower-sized datasets with human annotation
of the relevance between two pieces of text and op- Improved Contrastive Loss When using the
tional hard negatives mined by an extra retriever to contrastive objective, people usually reuse in-batch
form text triples. To handle both symmetric tasks documents as negative candidates to improve train-
(e.g., semantic textual similarity) and asymmet- ing efficiency (Karpukhin et al., 2020). This paper
ric tasks (e.g., passage retrieval), we collect data uses an improved contrastive learning objective
from a large variety of tasks and domains, includ- which is bidirectional and enlarges the negative
ing web search (e.g., MS MARCO), open-domain samples with both in-batched queries and docu-
QA (e.g., NQ), NLI (e.g., SNLI), fact verification ments. This can be viewd as a combination of loss
(e.g., FEVER), paraphrases (e.g., Quora). We to- variants proposed by Radford et al. (2021); Ren
tally used ∼3M pairs for fine-tuning, which is a et al. (2021); Moiseev et al. (2023).
Task Type Text Pair Format Query Doc
Founded by Roger Williams in 1636, Providence is
Web Page (title, body) Providence Real Estate | Providence Homes for Sale
recognized as one of the country’s oldest cities. . .
A rather non-standard quantum representation of the
Academic Paper (title, abstract) Polymer Quantum Mechanics and its Continuum Limit
canonical commutation relations of quantum mechanics. . .
After the championship in 1996, the PGA of America Pebble Beach Golf Links The largest margin of victory
Hyperlink (citation, reference)
raised its stake to 50% and announced that . . . ever in a major championship, surpassing the 13-shot . . .
Pretty sure any team with Lebron James will be a playoff I was being sarcastic and making fun of the East, but
Social Media (post, comment)
contender. Considering UNC would be in the East. . . honestly I was really in deep thought about this . . .
Animation is the process of creating the illusion of motion
Knowledge Base (entity, description) Animation
and shape change by means of the rapid display of . . .
A tough question as it overlaps science and theology. Since
Community QA (question, answer) How the human species evolved?
you asked “how the human species evolved?” I’ll assume . . .
Nepal’s opposition alliance formally calls off weeks of
News (summary, content) Nepalese Opposition Welcomes Return of Parliament
pro-democracy protests after King Gyenandra reinstates . . .
func (s *DescribeSnapshotCopyGrantsInput) SetMaxRecords
Code (text, code) SetMaxRecords sets the MaxRecords field’s value.
(v int64) *DescribeSnapshotCopyGrantsInput { s.MaxRecords
Consider a batch of positive text pair samples reduce memory cost and scale up batch size to over
ten thousands. We run the pre-training for 50, 000
B = {(q1 , d1 ), (q2 , d2 ), ..., (qn , dn )},
steps, which roughly corresponds to one epoch on
we use an improved contrastive loss which takes the whole pre-training data. We only tuned the
the form learning rate to ensure the convergence of larger
n models. we employ the AdamW optimizer with
1X es(qi ,di )/τ linear learning rate decay and a warm-up period
Licl = − log (5)
n Z during the initial 5% of training steps. We con-
i=1
with the partition function being ducted experiments on three distinct model scales:
X X small, base, and large. These models were initial-
Z= es(qi ,dj )/τ + es(qi ,qj )/τ ized using the small-sized MiniLM (Wang et al.,
j j̸=i 2020) model and the base and large models of the
(6)
+
X
e s(qj ,di )/τ
+
X
es(dj ,di )/τ BERT (Devlin et al., 2019) model. Further details
j j̸=i
can be found in Table 3.
in which the first two terms are used for query to In the second stage of contrastive fine-tuning
document contrast, where as the last two terms are with supervised data and hard negatives, a large
used for the inverse. In this work, we use the cosine batch size is unnecessary since hard negatives can
similarity as the distance metric already provide a reliable gradient estimation of
the learning objective (Xiong et al., 2021; Li et al.,
q·d
s(q, d) = . (7) 2023). Therefore, a global batch size of 128 and a
||q||2 · ||d||2 train group size of 16 are utilized, with one positive
The temperature τ is fixed to 0.01 in this work. example and the remaining being either hard nega-
tives or random negatives. Instead we increase the
Training and Evaluation The training of our
max sequence length to 512 to better handle texts
embedding model consists of two stages. In the
with longer lengths. The learning rate is decreased
first stage of contrastive pre-training with only in-
by a factor of ten during fine-tuning. The model
batch negatives, using a large batch size is crucial
is fine-tuned on the collected dataset for a single
to better model performance by reducing the gap
epoch. In-batch texts are also incorporated as nega-
between training and inference with more nega-
tive candidates using the enhanced contrastive loss
tives included and providing a better approximation
described in Equation 5.
to the underlying learning objective. To facilitate
this, we limit the maximum sequence length to 128 After training, we directly take the last check-
during pre-training and distribute the use of nega- point for evaluation. We run model training on up
tives across all GPUs. Popular techniques such as to 8 NVIDIA A100 GPUs with 80GB memory and
automatic mixed precision training (Micikevicius model evaluation on up to 8 NVIDIA Tesla V100
et al., 2018) with fp16, deepspeed ZeRO (Rajb- GPUs with 32GB memory. Models are trained with
handari et al., 2020) stage 1 and gradient check- mixed precision using fp16 and evaluated with half
pointing (Chen et al., 2016) are also jointly used to precision fp16 as well.
Model Params LR GPUs BS Base LM
GTEsmall 30M 3× 10−4 2 16384 microsoft/MiniLM-L12-H384-uncased
GTEbase 110M 2 × 10−4 4 16384 bert-base-uncased
GTElarge 330M 5 × 10−5 8 16384 bert-large-uncased
50
0 Avg.
NF ID
Ho ever
a
DB 20
dia
im cs
A
AD iQA
MA k
Sc r
Ar act
uc us
NQ
ora
O
ve
An
MS Stac
tQ
RC
Cl cido
Tó Corp
-20
OV
if
Pe
Fe
Qu
gu
-f
tpo
CQ F
up
he
ate
-C
S
c
Tre
Figure 2: Recall@100 of unsupervised text retrieval methods on BEIR benchmark (Thakur et al., 2021). We
compare our model GTEbase (based on BERTbase ) without using any annotated data to SimCSE (Gao et al., 2021)
(based on RoBERTalarge ), Contriever (Izacard et al., 2022a) (based on BERTbase ) and BM25. Baseline results are
borrowed from the Contriever paper (Izacard et al., 2022a) with dot product being the similarity function.
Dataset BM25 SimCSE Contriever CPT-S E5small E5base E5large GTEsmall GTEbase GTElarge
MS MARCO 22.8 9.4 20.6 19.9 25.4 26.0 26.2 31.3 31.8 31.7
Trec-Covid 65.6 26.2 27.4 52.9 52.0 61.0 61.8 61.8 64.0 64.8
NFCorpus 32.5 9.9 31.7 32.0 29.3 35.8 33.7 34.9 36.2 38.1
NQ 32.9 11.7 25.4 - 37.3 39.0 41.7 32.0 35.3 34.5
HotpotQA 60.3 19.8 48.1 51.5 46.0 52.4 52.2 49.3 50.8 49.2
FiQA 23.6 9.8 24.5 34.1 38.3 40.0 43.2 37.0 36.9 40.6
ArguAna 31.5 38.3 37.9 38.7 42.5 42.2 44.4 41.6 41.0 41.3
Touche-2020 36.7 8.9 19.3 21.0 19.9 16.9 19.8 17.7 18.2 18.5
CQADupStack 29.9 13.2 28.4 - 35.0 35.4 38.9 38.1 39.9 39.8
Quora 78.9 78.0 83.5 68.1 85.8 85.7 86.1 86.1 85.0 84.8
DBPedia 31.3 15.0 29.2 27.2 34.5 35.4 37.1 33.5 33.2 33.6
Scidocs 15.8 5.5 14.9 - 19.9 21.1 21.8 21.5 22.5 22.7
Fever 75.3 21.1 68.2 57.1 62.5 63.4 68.6 71.3 72.7 70.5
Climate-Fever 21.3 11.8 15.5 15.8 14.5 15.4 15.7 21.4 21.0 25.4
Scifact 66.5 25.7 64.9 65.4 68.5 73.7 72.3 72.7 74.1 74.1
Average 41.7 20.3 36.0 - 40.8 42.9 44.2 43.4 44.2 44.6
Table 5: nDCG@10 of different unsupervised methods on the BEIR benchmark (Thakur et al., 2021). SimCSE is
based on BERTbase backbone. CPT-S (Neelakantan et al., 2022) is of similar size to BERTlarge . Baseline results are
borrowed from E5 paper (Wang et al., 2022b). Note that Contriever uses dot product as the similarity metric while
other models uses cosine similarity.
on the tasks covered in the MTEB benchmark, forms the previous best model, E5, by a signifi-
please refer to the Appendix B. cant margin across all considered tasks, without
Two settings are considered for comparison: the the use of task-specific prompts. This improve-
unsupervised setting and the supervised setting. In ment can be attributed to the inclusion of more
the unsupervised setting, models are trained us- training data formats and various sources of self-
ing unlabeled data, while supervised models are supervision signals. Furthermore, it is worth noting
fine-tuned using high-quality datasets with human that our unsupervised pre-trained model narrows
labels. The results of strong baseline models are the gap even further with larger supervised base-
presented in Table 6. lines, such as GTR and Sentence-T5. In the super-
In the unsupervised setting, our model outper- vised setting, our model surpasses OpenAI results
Params Class. Clust. Pair. Rerank Retr. STS Summ. Avg
# of datasets → 12 11 3 4 15 10 1 56
Unsupervised models
Glove 120M 57.3 27.7 70.9 43.3 21.6 61.9 28.9 42.0
BERT 110M 61.7 30.1 56.3 43.4 10.6 54.4 29.8 38.3
SimCSE 110M 62.5 29.0 70.3 46.5 20.3 74.3 31.2 45.5
E5small 30M 67.0 41.7 78.2 53.1 40.8 68.8 25.2 54.2
E5base 110M 67.9 43.4 79.2 53.5 42.9 69.5 24.3 55.5
E5large 330M 69.0 44.3 80.3 54.4 44.2 69.9 24.8 56.4
GTEsmall 30M 71.0 44.9 82.4 57.5 43.4 77.2 30.4 58.5
GTEbase 110M 71.5 46.0 83.3 58.4 44.2 76.5 29.5 59.0
GTElarge 330M 71.8 46.4 83.3 58.8 44.6 76.3 30.1 59.3
Supervised models
SimCSE 110M 67.3 33.4 73.7 47.5 21.8 79.1 23.3 48.7
Contriever 110M 66.7 41.1 82.5 53.1 41.9 76.5 30.4 56.0
GTRlarge 330M 67.1 41.6 85.3 55.4 47.4 78.2 29.5 58.3
Sentence-T5large 330M 72.3 41.7 85.0 54.0 36.7 81.8 29.6 57.1
E5small 30M 71.7 39.5 85.1 54.5 46.0 80.9 31.4 58.9
E5base 110M 72.6 42.1 85.1 55.7 48.7 81.0 31.0 60.4
E5large 330M 73.1 43.3 85.9 56.5 50.0 82.1 31.0 61.4
InstructORbase 110M 72.6 42.1 85.1 55.7 48.8 81.0 31.0 60.4
InstructORlarge 330M 73.9 45.3 85.9 57.5 47.6 83.2 31.8 61.6
OpenAIada-001 n.a. 70.4 37.5 76.9 49.0 18.4 78.6 26.9 49.5
OpenAIada-002 n.a. 70.9 45.9 84.9 56.3 49.3 81.0 30.8 61.0
GTEsmall 30M 72.3 44.9 83.5 57.7 49.5 82.1 30.4 61.4
GTEbase 110M 73.0 46.1 84.3 58.6 51.2 82.3 30.7 62.4
GTElarge 330M 73.3 46.8 85.0 59.1 52.2 83.4 31.7 63.1
Larger models
InstructORxl 1.5B 73.1 44.7 86.6 57.3 49.3 83.1 32.3 61.8
GTRxxl 4.5B 67.4 42.4 86.1 56.7 48.5 78.4 30.6 59.0
Sentence-T5xxl 4.5B 73.4 43.7 85.1 56.4 42.2 82.6 30.1 59.5
Table 6: Results on the MTEB (Muennighoff et al., 2023) (56 datasets in English subset). Compared models include
SimCSE (Gao et al., 2021), Sentence-T5 (Ni et al., 2022a), GTR (Ni et al., 2022b), Contriever (Izacard et al., 2022a),
OpenAI text embedding API (Neelakantan et al., 2022), E5 (Wang et al., 2022b) and InstructOR (Su et al., 2023).
Exact parameter amount of OpenAI ada model is not available, but is suspected to be ∼300M, comparable to the
BERT large size model.
by a large margin despite using a modest model such as CodeBERT (Guo et al., 2021) and Graph-
size. GTEsmall is comparable to E5large while be- CodeBERT (Guo et al., 2021). We also compare
ing 10× smaller. GTElarge establishes new state- our approach with a more recent code language
of-the-art performance on the MTEB benchmark, model called UniXcoder (Guo et al., 2022), which
outperforming the multi-task instruction-finetuned aims to integrate various pre-training tasks into a
embedding model, InstructORlarge , by 1.5 points unified model. CodeRetriever (Li et al., 2022) is ini-
on average. tialized from GraphCodeBERT and pre-trained on
large-scale multi-modal code-text pairs mined and
4.4 Code Search cleaned by heuristics. It is important to note that
while the baseline models are individually trained
Programming languages can be regarded as a dis-
and evaluated for each programming language, our
tinct form of text. To assess the effectiveness of our
model is directly evaluated across all the languages.
approach in code search, we conduct a comparative
analysis with other code-based language models, In line with recent work (Guo et al., 2021, 2022;
Model Params Ruby JS Go Python Java PHP Avg.
CodeBERT 110M×6 67.9 62.0 88.2 67.2 67.6 62.8 69.3
GraphCodeBERT 110M×6 70.3 64.4 89.7 69.2 69.1 64.9 71.3
UniXcoder 110M×6 74.0 68.4 91.5 72.0 72.6 67.6 74.4
CodeRetriever 110M×6 77.1 71.9 92.4 75.8 76.5 70.8 77.4
GTEbase 110M 76.1 73.6 88.1 95.9 80.1 85.3 83.2
Table 7: Results on CodeSearchNet. Comparison on code search across 6 programming languages (Husain et al.,
2019) with CodeBERT (Feng et al., 2020), GraphCodeBERT (Guo et al., 2021), UniXcoder (Guo et al., 2022) and
CodeRetriever (Li et al., 2022). This setting requires finding the corresponding code candidates from all candidates
from dev and test set.
Li et al., 2022), we mainly evaluate on the chal- datasets used in pre-training. Model training was
lenging settings where the code corpus include all carried out by randomly sampling a subset from
codes from dev and test set instead of 1k randomly all available datasets. In the pre-training stage, the
sampled code.2 The results are presented in Table 7. first group consisted of only the five largest datasets,
Surprisingly, our model surpasses models that are ranked by size. The second group included an ad-
pre-trained on code and then fine-tuned for each ditional 10 randomly sampled datasets, resulting in
programming language separately. This finding a mixture of 15 datasets. The third group utilized
demonstrates that, by scaling the amount of data all 33 datasets in the pre-training process. For fine-
and computational resources, the language model tuning, we initially started with the three datasets
can acquire high-quality code representations di- used in E5 (Wang et al., 2022b) fine-tuning and
rectly from sequences of code tokens, without the gradually incorporated datasets from MEDI (Su
need for incorporating human knowledge about the et al., 2023) and BERRI (Asai et al., 2023) to inves-
structural information of code (Guo et al., 2021). tigate the potential benefits. The results presented
We observe a significant improvement in Python, in Figure 3a demonstrate that the inclusion of more
likely due to its resemblance to natural language. diverse data sources consistently enhances model
Our model, pre-trained on an extensive text pairs performance during both the pre-training and fine-
spanning various domains, demonstrates effective tuning stages.
cross-task knowledge transfer from text retrieval to
code retrieval.
Figure 3: Scaling analysis of different factors during contrastive pre-training and fine-tuning. Model performance is
mesured by average performance on MTEB.
MTEB 56.4 59.0 57.8 57.7 59.0 5.4 Training Data Mixture
We study the influence of mixing ratio used in sam-
Table 8: Model performance at different training steps
pling distribution on pre-training data to model
during unsupervised contrastive pre-training.
performance.
The performance on two task categories, re-
3
trieval and STS, as well as the average performance
We use a fixed random seed for data sampling during
model training, ensuring that each model encounters the data on MTEB is reported in Table 10. We observe
batches in the same order. that neither uniformly sampling from each pre-
α Retrieval STS MTEB suspect that this is a common problem that other
models also suffer from, but quantifying it without
0 36.7 73.2 55.4 details about the training data sources is even more
0.3 44.6 75.9 58.9 challenging (Neelakantan et al., 2022).
0.5 44.2 76.5 59.0
Furthermore, the models trained in this study
1 42.0 75.5 58.3
are based on a non-causal architecture with bidi-
rectional context attention. It would be intriguing
Table 10: Influence of ratio α used in pre-training data
sampling. to explore similar pre-training methods for causal
or prefix language models, as these models could
optimize generation and retrieval jointly and unify
training task (α = 0) nor directly combining all them within a single model.
data sources (α = 1) is the best choice. Setting α
to 0.5 can improve results on all tasks. 7 Conclusion
5.5 Ablation of the Contrastive Objective This paper presents a multi-stage contrastive learn-
This work uses an improved contrastive objective ing approach to develop text embedding model that
which can efficiently enlarges the negative pool can be applied to various tasks. Our model benefits
under fixed batch size. We compare it against the from a diverse training data mixture, enabling it to
vanilla contrastive loss with only in-batch negatives achieve good generalization performance for single
in both pre-training and fine-tuning stages. vector embedding. Through extensive evaluation
on multiple benchmarks, we demonstrate the ef-
Setting PT FT fectiveness and versatility of our text embedding
model. Our future work will focus on scaling the
Vanilla 57.3 61.8 model to support longer context, extending it to
Improved 57.8 62.4 support multilingual and multi-modal applications,
as well as exploring the benefits of prompts and
Table 11: Comparison of the vanilla contrastive loss
instructions.
with in-batch negatives, and the improved contrastive
loss with enlarged negative pool. For ablation we run the
pre-training (PT) for 30k steps to reduce computational
cost. We report average score on MTEB. References
Akari Asai, Timo Schick, Patrick Lewis, Xilun Chen,
According to Table 11, using improved con- Gautier Izacard, Sebastian Riedel, Hannaneh Ha-
trastive loss consistently improves model perfor- jishirzi, and Wen-tau Yih. 2023. Task-aware retrieval
mance in both pre-training and fine-tuning stages. with instructions. In Findings of the Association for
Computational Linguistics: ACL 2023, pages 3650–
3675, Toronto, Canada. Association for Computa-
6 Discussion tional Linguistics.
Despite the strong performance on English tasks,
Tom Brown, Benjamin Mann, Nick Ryder, Melanie
our current model can only handle text with a Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind
length of less than 512, as it is initialized from Neelakantan, Pranav Shyam, Girish Sastry, Amanda
BERT and lacks multilingual capabilities. Conse- Askell, Sandhini Agarwal, Ariel Herbert-Voss,
quently, longer texts must be truncated or split for Gretchen Krueger, Tom Henighan, Rewon Child,
Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens
encoding. However, with more data engineering Winter, Chris Hesse, Mark Chen, Eric Sigler, Ma-
and compute resources, the described training ap- teusz Litwin, Scott Gray, Benjamin Chess, Jack
proach could easily be extended to a multilingual Clark, Christopher Berner, Sam McCandlish, Alec
version and accommodate longer contexts. Radford, Ilya Sutskever, and Dario Amodei. 2020.
Language models are few-shot learners. In Ad-
Another issue is the problem of data contami- vances in Neural Information Processing Systems,
nation resulting from large-scale pre-training on volume 33, pages 1877–1901. Curran Associates,
Internet data. Currently, we only conduct dedupli- Inc.
cation based on exact matching of text pairs, which
Wei-Cheng Chang, Felix X. Yu, Yin-Wen Chang, Yim-
is an overly strict filter. This issue has also been ing Yang, and Sanjiv Kumar. 2020. Pre-training tasks
highlighted by Brown et al. (2020) during the train- for embedding-based large-scale retrieval. In Inter-
ing of large-scale generative language models. We national Conference on Learning Representations.
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Luyu Gao and Jamie Callan. 2022. Unsupervised cor-
Yuan, Henrique Ponde de Oliveira Pinto, Jared Ka- pus aware language model pre-training for dense pas-
plan, Harri Edwards, Yuri Burda, Nicholas Joseph, sage retrieval. In Proceedings of the 60th Annual
Greg Brockman, Alex Ray, Raul Puri, Gretchen Meeting of the Association for Computational Lin-
Krueger, Michael Petrov, Heidy Khlaaf, Girish Sas- guistics (Volume 1: Long Papers), pages 2843–2853,
try, Pamela Mishkin, Brooke Chan, Scott Gray, Dublin, Ireland. Association for Computational Lin-
Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz guistics.
Kaiser, Mohammad Bavarian, Clemens Winter,
Philippe Tillet, Felipe Petroski Such, Dave Cum- Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021.
mings, Matthias Plappert, Fotios Chantzis, Eliza- SimCSE: Simple contrastive learning of sentence em-
beth Barnes, Ariel Herbert-Voss, William Hebgen beddings. In Proceedings of the 2021 Conference
Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie on Empirical Methods in Natural Language Process-
Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, ing, pages 6894–6910, Online and Punta Cana, Do-
William Saunders, Christopher Hesse, Andrew N. minican Republic. Association for Computational
Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Linguistics.
Morikawa, Alec Radford, Matthew Knight, Miles
Brundage, Mira Murati, Katie Mayer, Peter Welinder, Daya Guo, Shuai Lu, Nan Duan, Yanlin Wang, Ming
Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Zhou, and Jian Yin. 2022. UniXcoder: Unified cross-
Sutskever, and Wojciech Zaremba. 2021. Evaluating modal pre-training for code representation. In Pro-
large language models trained on code. ceedings of the 60th Annual Meeting of the Associa-
tion for Computational Linguistics (Volume 1: Long
Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Papers), pages 7212–7225, Dublin, Ireland. Associa-
Guestrin. 2016. Training deep nets with sublinear tion for Computational Linguistics.
memory cost.
Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng,
Duyu Tang, Shujie LIU, Long Zhou, Nan Duan,
Hyunjin Choi, Judong Kim, Seongho Joe, and Alexey Svyatkovskiy, Shengyu Fu, Michele Tufano,
Youngjune Gwon. 2021. Evaluation of bert and albert Shao Kun Deng, Colin Clement, Dawn Drain, Neel
sentence embedding performance on downstream nlp Sundaresan, Jian Yin, Daxin Jiang, and Ming Zhou.
tasks. 2020 25th International Conference on Pattern 2021. Graphcode{bert}: Pre-training code represen-
Recognition (ICPR), pages 5482–5487. tations with data flow. In International Conference
on Learning Representations.
Alexis Conneau, Douwe Kiela, Holger Schwenk, Loïc
Barrault, and Antoine Bordes. 2017. Supervised Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat,
learning of universal sentence representations from and Mingwei Chang. 2020. Retrieval augmented
natural language inference data. In Proceedings of language model pre-training. In Proceedings of the
the 2017 Conference on Empirical Methods in Nat- 37th International Conference on Machine Learning,
ural Language Processing, pages 670–680, Copen- volume 119 of Proceedings of Machine Learning
hagen, Denmark. Association for Computational Lin- Research, pages 3929–3938. PMLR.
guistics.
Samuel Humeau, Kurt Shuster, Marie-Anne Lachaux,
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and and Jason Weston. 2020. Poly-encoders: Architec-
Kristina Toutanova. 2019. BERT: Pre-training of tures and pre-training strategies for fast and accurate
deep bidirectional transformers for language under- multi-sentence scoring. In International Conference
standing. In Proceedings of the 2019 Conference of on Learning Representations.
the North American Chapter of the Association for
Computational Linguistics: Human Language Tech- Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis
nologies, Volume 1 (Long and Short Papers), pages Allamanis, and Marc Brockschmidt. 2019. Code-
4171–4186, Minneapolis, Minnesota. Association for searchnet challenge: Evaluating the state of semantic
Computational Linguistics. code search. CoRR, abs/1909.09436.
Kenton Lee, Ming-Wei Chang, and Kristina Toutanova. Niklas Muennighoff, Nouamane Tazi, Loic Magne, and
2019. Latent retrieval for weakly supervised open Nils Reimers. 2023. MTEB: Massive text embedding
domain question answering. In Proceedings of the benchmark. In Proceedings of the 17th Conference
57th Annual Meeting of the Association for Computa- of the European Chapter of the Association for Com-
tional Linguistics, pages 6086–6096, Florence, Italy. putational Linguistics, pages 2014–2037, Dubrovnik,
Association for Computational Linguistics. Croatia. Association for Computational Linguistics.
Bohan Li, Hao Zhou, Junxian He, Mingxuan Wang, Arvind Neelakantan, Tao Xu, Raul Puri, Alec Rad-
Yiming Yang, and Lei Li. 2020. On the sentence ford, Jesse Michael Han, Jerry Tworek, Qiming
embeddings from pre-trained language models. In Yuan, Nikolas Tezak, Jong Wook Kim, Chris Hallacy,
Proceedings of the 2020 Conference on Empirical Johannes Heidecke, Pranav Shyam, Boris Power,
Methods in Natural Language Processing (EMNLP), Tyna Eloundou Nekoul, Girish Sastry, Gretchen
pages 9119–9130, Online. Association for Computa- Krueger, David Schnurr, Felipe Petroski Such, Kenny
tional Linguistics. Hsu, Madeleine Thompson, Tabarak Khan, Toki
Sherbakov, Joanne Jang, Peter Welinder, and Lilian
Xiaonan Li, Yeyun Gong, Yelong Shen, Xipeng Qiu, Weng. 2022. Text and code embeddings by con-
Hang Zhang, Bolun Yao, Weizhen Qi, Daxin Jiang, trastive pre-training. CoRR, abs/2201.10005.
Weizhu Chen, and Nan Duan. 2022. CodeRetriever:
A large scale contrastive pre-training method for code Jianmo Ni, Gustavo Hernandez Abrego, Noah Con-
search. In Proceedings of the 2022 Conference on stant, Ji Ma, Keith Hall, Daniel Cer, and Yinfei Yang.
Empirical Methods in Natural Language Processing, 2022a. Sentence-t5: Scalable sentence encoders
pages 2898–2910, Abu Dhabi, United Arab Emirates. from pre-trained text-to-text models. In Findings of
Association for Computational Linguistics. the Association for Computational Linguistics: ACL
2022, pages 1864–1874, Dublin, Ireland. Association
Zehan Li, Yanzhao Zhang, Dingkun Long, and Pengjun for Computational Linguistics.
Xie. 2023. Challenging decoder helps in masked Jianmo Ni, Chen Qu, Jing Lu, Zhuyun Dai, Gustavo Her-
auto-encoder pre-training for dense passage retrieval. nandez Abrego, Ji Ma, Vincent Zhao, Yi Luan, Keith
CoRR, abs/2305.13197. Hall, Ming-Wei Chang, and Yinfei Yang. 2022b.
Kyle Lo, Lucy Lu Wang, Mark Neumann, Rodney Kin- Large dual encoders are generalizable retrievers. In
ney, and Daniel Weld. 2020. S2ORC: The semantic Proceedings of the 2022 Conference on Empirical
scholar open research corpus. In Proceedings of the Methods in Natural Language Processing, pages
58th Annual Meeting of the Association for Compu- 9844–9855, Abu Dhabi, United Arab Emirates. As-
tational Linguistics, pages 4969–4983, Online. Asso- sociation for Computational Linguistics.
ciation for Computational Linguistics. Barlas Oguz, Kushal Lakhotia, Anchit Gupta, Patrick
Lewis, Vladimir Karpukhin, Aleksandra Piktus,
Dingkun Long, Qiong Gao, Kuan Zou, Guangwei Xu, Xilun Chen, Sebastian Riedel, Scott Yih, Sonal
Pengjun Xie, Ruijie Guo, Jianfeng Xu, Guanjun Gupta, and Yashar Mehdad. 2022. Domain-matched
Jiang, Luxi Xing, and Ping Yang. 2022a. Multi-cpr: pre-training tasks for dense retrieval. In Findings
A multi domain chinese dataset for passage retrieval. of the Association for Computational Linguistics:
Proceedings of the 45th International ACM SIGIR NAACL 2022, pages 1524–1534, Seattle, United
Conference on Research and Development in Infor- States. Association for Computational Linguistics.
mation Retrieval.
OpenAI. 2023. Gpt-4 technical report. ArXiv,
Dingkun Long, Yanzhao Zhang, Guangwei Xu, and abs/2303.08774.
Pengjun Xie. 2022b. Retrieval oriented masking pre-
training language model for dense passage retrieval. Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
ArXiv, abs/2210.15133. Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sas-
try, Amanda Askell, Pamela Mishkin, Jack Clark,
Paulius Micikevicius, Sharan Narang, Jonah Alben, Gre- Gretchen Krueger, and Ilya Sutskever. 2021. Learn-
gory Diamos, Erich Elsen, David Garcia, Boris Gins- ing transferable visual models from natural language
burg, Michael Houston, Oleksii Kuchaiev, Ganesh supervision. In Proceedings of the 38th International
Venkatesh, and Hao Wu. 2018. Mixed precision Conference on Machine Learning, volume 139 of
training. In International Conference on Learning Proceedings of Machine Learning Research, pages
Representations. 8748–8763. PMLR.
Fedor Moiseev, Gustavo Hernandez Abrego, Peter Dorn- Alec Radford, Karthik Narasimhan, Tim Salimans, and
bach, Imed Zitouni, Enrique Alfonseca, and Zhe Ilya Sutskever. 2018. Improving language under-
Dong. 2023. SamToNe: Improving contrastive loss standing by generative pre-training.
Thilina C. Rajapakse. 2023. Dense passage retrieval: A heterogeneous benchmark for zero-shot evaluation
Architectures and augmentation methods. Proceed- of information retrieval models. In Proceedings of
ings of the 46th International ACM SIGIR Confer- the Neural Information Processing Systems Track on
ence on Research and Development in Information Datasets and Benchmarks, volume 1. Curran.
Retrieval.
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Martinet, Marie-Anne Lachaux, Timothée Lacroix,
and Yuxiong He. 2020. Zero: Memory optimizations Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal
toward training trillion parameter models. In Pro- Azhar, Aurelien Rodriguez, Armand Joulin, Edouard
ceedings of the International Conference for High Grave, and Guillaume Lample. 2023. Llama: Open
Performance Computing, Networking, Storage and and efficient foundation language models. ArXiv,
Analysis, SC ’20. IEEE Press. abs/2302.13971.
Ori Ram, Yoav Levine, Itay Dalmedigos, Dor Muhlgay, Aäron van den Oord, Yazhe Li, and Oriol Vinyals. 2018.
Amnon Shashua, Kevin Leyton-Brown, and Yoav Representation learning with contrastive predictive
Shoham. 2023. In-context retrieval-augmented lan- coding. CoRR, abs/1807.03748.
guage models. ArXiv, abs/2302.00083. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz
Nils Reimers and Iryna Gurevych. 2019. Sentence- Kaiser, and Illia Polosukhin. 2017. Attention is all
BERT: Sentence embeddings using Siamese BERT- you need. In Advances in Neural Information Pro-
networks. In Proceedings of the 2019 Conference on cessing Systems, volume 30. Curran Associates, Inc.
Empirical Methods in Natural Language Processing
and the 9th International Joint Conference on Natu- Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao,
ral Language Processing (EMNLP-IJCNLP), pages Linjun Yang, Daxin Jiang, Rangan Majumder, and
3982–3992, Hong Kong, China. Association for Com- Furu Wei. 2022a. Simlm: Pre-training with repre-
putational Linguistics. sentation bottleneck for dense passage retrieval. In
Annual Meeting of the Association for Computational
Ruiyang Ren, Shangwen Lv, Yingqi Qu, Jing Liu, Linguistics.
Wayne Xin Zhao, QiaoQiao She, Hua Wu, Haifeng
Wang, and Ji-Rong Wen. 2021. PAIR: Leverag- Liang Wang, Nan Yang, Xiaolong Huang, Binxing
ing passage-centric similarity relation for improving Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder,
dense passage retrieval. In Findings of the Associa- and Furu Wei. 2022b. Text embeddings by weakly-
tion for Computational Linguistics: ACL-IJCNLP supervised contrastive pre-training. arXiv preprint
2021, pages 2173–2183, Online. Association for arXiv:2212.03533.
Computational Linguistics. Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan
Yang, and Ming Zhou. 2020. Minilm: Deep self-
Andrew Rosenberg and Julia Hirschberg. 2007. V-
attention distillation for task-agnostic compression
measure: A conditional entropy-based external clus-
of pre-trained transformers. In Proceedings of the
ter evaluation measure. In Proceedings of the 2007
34th International Conference on Neural Information
Joint Conference on Empirical Methods in Natural
Processing Systems, NIPS’20, Red Hook, NY, USA.
Language Processing and Computational Natural
Curran Associates Inc.
Language Learning (EMNLP-CoNLL), pages 410–
420, Prague, Czech Republic. Association for Com- Guillaume Wenzek, Marie-Anne Lachaux, Alexis Con-
putational Linguistics. neau, Vishrav Chaudhary, Francisco Guzmán, Ar-
mand Joulin, and Edouard Grave. 2020. CCNet:
Weijia Shi, Sewon Min, Michihiro Yasunaga, Minjoon Extracting high quality monolingual datasets from
Seo, Rich James, Mike Lewis, Luke Zettlemoyer, and web crawl data. In Proceedings of the Twelfth Lan-
Wen tau Yih. 2023. Replug: Retrieval-augmented guage Resources and Evaluation Conference, pages
black-box language models. ArXiv, abs/2301.12652. 4003–4012, Marseille, France. European Language
Resources Association.
Hongjin Su, Weijia Shi, Jungo Kasai, Yizhong Wang,
Yushi Hu, Mari Ostendorf, Wen-tau Yih, Noah A. Fangzhao Wu, Ying Qiao, Jiun-Hung Chen, Chuhan
Smith, Luke Zettlemoyer, and Tao Yu. 2023. One Wu, Tao Qi, Jianxun Lian, Danyang Liu, Xing Xie,
embedder, any task: Instruction-finetuned text em- Jianfeng Gao, Winnie Wu, and Ming Zhou. 2020.
beddings. In Findings of the Association for Compu- MIND: A large-scale dataset for news recommenda-
tational Linguistics: ACL 2023, pages 1102–1121, tion. In Proceedings of the 58th Annual Meeting of
Toronto, Canada. Association for Computational Lin- the Association for Computational Linguistics, pages
guistics. 3597–3606, Online. Association for Computational
Linguistics.
Jianlin Su, Jiarun Cao, Weijie Liu, and Yangyiwen Ou.
2021. Whitening sentence representations for better Shitao Xiao, Zheng Liu, Yingxia Shao, and Zhao Cao.
semantics and faster retrieval. 2022. Retromae: Pre-training retrieval-oriented lan-
guage models via masked auto-encoder. In Confer-
Nandan Thakur, Nils Reimers, Andreas Rücklé, Ab- ence on Empirical Methods in Natural Language
hishek Srivastava, and Iryna Gurevych. 2021. Beir: Processing.
Yiqing Xie, Xiao Liu, and Chenyan Xiong. 2023. Un- the user can write their questions in the format of a
supervised dense retrieval training with web anchors. summaritive title and a descriptive body. These two
In Proceedings of the 46th International ACM SI-
fields are usually semantically consistent. In ad-
GIR Conference on Research and Development in
Information Retrieval, SIGIR ’23, page 2476–2480, dition, we also consider the question answer pairs
New York, NY, USA. Association for Computing from this type of websites. The data sources we
Machinery. used include StackExchange, Yahoo Answers, Wik-
Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang,
iHow and Amazon QA. Simple heuristics such as
Jialin Liu, Paul N. Bennett, Junaid Ahmed, and text lengths and voting numbers are used to filter
Arnold Overwijk. 2021. Approximate nearest neigh- out low-quality data.
bor negative contrastive learning for dense text re-
trieval. In International Conference on Learning Social Media The social media websites such
Representations. as Twitter and Reddit usually involves people pub-
Jiawei Zhou, Xiaoguang Li, Lifeng Shang, Lan Luo, lishing posts about one event, and many internauts
Ke Zhan, Enrui Hu, Xinyu Zhang, Hao Jiang, Zhao leave their comments. The post is also structured
Cao, Fan Yu, Xin Jiang, Qun Liu, and Lei Chen. 2022. with title and body in it, which we consider as
Hyperlink-induced pre-training for passage retrieval positive pairs. Similar to Community QA, post
in open-domain question answering. In Proceedings
of the 60th Annual Meeting of the Association for comment are also regared as positive pairs for data
Computational Linguistics (Volume 1: Long Papers), mining. We mine data from Reddit.
pages 7135–7146, Dublin, Ireland. Association for
Computational Linguistics. News News are structured as title body pairs.
Some news has highlighted sentences in it. We use
A More Details about Training Data these information to construct (query,doc) pairs.
We used data from CCNews, MicrosoftNews, NPR,
A.1 Pre-training Data
CNNDaily.
Web Page Within a web page, we use title as
query and the body text as document. Some re- Knowledge Base Knowledge base usually stores
sources include Common Crawl, Clue Webs, MS textual descriptions knowledge about an entity or
MARCO documents. The task can be formatted as event. The (entity, description) pairs are mined. We
given a short title, find the most relevant body texts use WikiPedia and DBPedia for text pair mining in
from a set of randomly sampled texts. this work.
Academic Paper The scientific articles usually Code Code can be viewed as another form of text.
have a higher quality due to its formal nature. For The naturally paired text-code can be repurposed as
each paper, we use the title as query and its abstract positive pairs. We use GitHub and StackOverflow
as document for constructing text pairs. The ar- as two data sources. We reuse training set from
ticles are mined from different websites (such as CodeSearchNet which is mined from GitHub.
arXiv, bioRxiv, medRxiv, PubMed and Semantic
Others In addition, we also use data from various
Scholar) to cover a wide range of topics.
websites such as Amazon reviews about the goods,
Hyperlink Another important information debate websites about one argument, googaq q,a
present on the internet is the hyperlink with text pairs by prompting google search box with search
in it, also know as web anchors. The hyperlink log queries.
can provide necessary references for current
arguments. We use the citation argument and A.2 Fine-tuning Data
the text from reference as relevant text pairs for Web Search We used MS MARCO passage re-
contrast. This type of task is more challenging as trieval benchmarks. Hard negatives are mined by
it usually involves multi-hop reasoning. We used sampling from high-ranked documents retrieval
three resources to incorporate the link information: system, excluding positive ones.
ClueWeb, Wikipedia and Semantic Scholar paper
Open QA We consider Natural Questions, Trivia
citations.
QA, Web Questions, HotpotQA, etc. In the open
Community QA We also used many data from domain QA datasets, a question and its supporting
community QA websites. The UI design of such evidence passages are provided as positive pairs.
websites usually follows a structured format, where Top ranked passage by retrieval system which do
not include answer to the question is regared as data comes from embedding-training-data.6 The
hard negatives. training data is keep as it was without any specific
filtering, except that we use text pair exact-match
Natural Language Inference Prior work (Con- for training data de-duplication for some datasets.
neau et al., 2017) has shown that high-quality sen- The fine-tuning data is basically a combina-
tence embeddings can be learned from a supervised tion of previous research. For the MS MARCO
natural language inference task. We use entailment dataset, we use mined hard negative by the sec-
as positive pairs and contradiction as negative pairs ond stage retriever from Li et al. (2023). For NQ
to construct training triples. The combination of dataset, we reuse the training data released by co-
MNLI and SNLI is used in this work. Condenser (Gao and Callan, 2022). We use NLI
Fact Verification One argument and its support- data released by SimCSE (Gao et al., 2021). Other
ing source (a Wikipedia document) is positive pairs. data comes from MEDI and BERRI (Su et al., 2023;
We use training set from FEVER as data source for Asai et al., 2023), but we discard the instructions
this task. written for each task and only use training triples.
Some randomly sampled examples can be found in
Paraphrase Two sentences with similar mean- Table 12.
ings are labeled as positive pairs. This type of data
includes Quora and StackExchangeDupquestion. B Massive Text Embedding Benchmark
Others In addition to previous datasets, we also Classification This task is evaluated in the lin-
used miscellaneous datasets from different NLP ear probing setting. The embedding model is kept
tasks and domains released in MEDI (Su et al., frozen and used to extract text embeddings for each
2023) and BERRI (Asai et al., 2023). By doing example from train and test set. The train set em-
so, a sub-sampled version of pre-training data is beddings are used as input features to train a lo-
also included in fine-tuning to avoid catastrophe gistic regression classifier with 100 maximum it-
forgetting. erations. The accuracy on test set is reported as
the main evaluation metric. In this setting, differ-
A.3 Data Sources ent classification tasks only need to train an extra
classification head with a few labeled training data.
The pre-training data comes mostly from language
corpus released by previous work. We use Com- Clustering A high-quality embedding model
momCrawl preprocessed by CCNet at 2019 snap- should embed semantically similar texts close in
shot due to large computaional cost of process- the embedding space. This property is evaluated by
ing (Wenzek et al., 2020). Since Reddit data is running a k-means algorithm on the embeddings
no longer free available, we use two pre-processed produced for each sentence of the test set. A mini-
version by sentence-transformers 4 and Oguz et al. batch k-means model is used with batch size 32 and
(2022) for pair mining. Text pairs mined from hy- k being the number of labels. Texts are partitioned
perlinks come from Zhou et al. (2022) and Xie et al. into k clusters. The clustering performance is mea-
(2023). We also include citation pairs from the sured by the v-measure (Rosenberg and Hirschberg,
S2ORC dataset (Lo et al., 2020). We reuse DBPe- 2007) which is invariant to the permutation of clus-
dia, debating arguments and PubMed corpus from tering labels.
BEIR (Thakur et al., 2021). Wikipedia data is taken
from Izacard et al. (2022b). Microsoft News data Reranking Given a query and a list of relevant
comes from Wu et al. (2020). Arxiv data is down- and irrelevant reference texts, reranking needs to
loaded from Kaggle, medRxiv and bioRxiv are rank the list of reference texts based on their sim-
mined via requesting public API from year 2013 to ilarity to the query. The embedding model is in-
2022. The StackExchange and StackOverflow data voked to obtain embeddings for each query and
comes from the pre-processed version maintained reference text and cosine similarity is used as the
by sentence-transformers team.5 The remaining ranking score. This inference setting is quite sim-
ilar to text retrieval with the reference set being
4
https://huggingface.co/datasets/
sentence-transformers/reddit-title-body 6
https://huggingface.co/
5
https://huggingface.co/ datasets/sentence-transformers/
flax-sentence-embeddings embedding-training-data
Task Type Text Triple Format query doc hard neg
The following are the most common Cellulitis usually begins as
Web Search (query, passage, negative) finger cellulitis symptoms
symptoms of cellulitis. However. . . a small area of pain and . . .
big little lies season 2 Big Little Lies (TV series). Little People, Big World.
Open QA (question, passage, negative)
how many episodes series garnered several accolades. . . final minutes of the season two. . .
(Read for Slate ’s take Slate had an opinion Slate did not hold any opinion
Natural Language Inference (sentence, entailment, contradiction)
on Jackson’s findings.) on Jackson’s findings. on Jackson’s findings.
Roman Bernard Atwood (born
Roman Atwood is a 6th Streamy Awards Casey Neistat
Fact Verification (argument, evidence, others) May 28, 1983) is an American
content creator. and Jesse Wellens, PrankvsPrank . . .
YouTube personality. . .
Lexapro taken with Can stopping lexapro cause
Paraphrase (sentence, paraphrase, others) Can dayquil be taken with Lexapro?
crestor any reaction? a longer period?
Table 12: Examples of (query, positive, negative) text triples in fine-tuning data.
smaller and harder to distinguish. In line with pre- C Original CodeSearchNet Results
vious work, the main evaluation metric is MAP
We list the results of the original setting on Code-
(mean average precision).
SearchNet in Table 13, where the retrieval cor-
pus contains 1k randomly sampled code snip-
Retrieval We omit the text retrieval evaluation pets. Compared to previous open-source code lan-
since it’s similar to that introduced in previous sec- guage models with similar architecture and size
tion. (CodeBERT (Feng et al., 2020) and GraphCode-
BERT (Guo et al., 2021)), our model is superior
Pair Classification This task needs to assign a in most programming languages. There is still a
label for a pair of texts. Popular tasks include performance gap to the code embedding model
duplicate or paraphrase identification, where the trained by Neelakantan et al. (2022), which used
label is binary. The similarity score is the cosine Codex (Chen et al., 2021) as backbone and trained
similarity between the embeddings of two texts. on a large-scale (code, text) pairs extracted from
The average precision score is reported as the main open-source code. It is worthwhile to explore how
evaluation metric using the best binary threshold. to further close this gap.
Table 13: Results on CodeSearchNet (Husain et al., 2019). We compare with CodeBERT (Feng et al., 2020),
GraphCodeBERT (Guo et al., 2021) and cpt-code (Neelakantan et al., 2022). This setting requires finding the
relevant code block among 1K candidates for a given natural language query.