0% found this document useful (0 votes)

37 views

GTE

The document presents GTE, a general-purpose text embedding model developed using multi-stage contrastive learning, which significantly improves performance over existing models by leveraging a diverse mixture of datasets. The model achieves impressive results in various NLP and code-related tasks, outperforming larger models and commercial APIs without task-specific fine-tuning. GTE is trained on a large-scale dataset of text pairs, demonstrating its broad applicability and efficiency in generating high-quality text representations.

Uploaded by

zhangjunqi13.cst

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views

GTE

Uploaded by

zhangjunqi13.cst

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Towards General Text Embeddings with Multi-stage Contrastive Learning

Zehan Li1 , Xin Zhang1 , Yanzhao Zhang1 , Dingkun Long1 , Pengjun Xie1 , Meishan Zhang
1
Alibaba Group
{lizehan.lzh,linzhang.zx,zhangyanzhao.zyz,
dingkun.ldk,pengjun.xpj}@alibaba-inc.com

Unsupervised Contrastive Pre-training on Supervised Contrastive Fine-tuning

Abstract Massive Text Pairs mined from the Web on Annotated Text Triples from Multiple Tasks
Web Search Paraphrase
MS MARCO Quora
We present GTE, a general-purpose text embed- Open QA
StackExchangeDup

ding model trained with multi-stage contrastive Natural Questions

TriviaQA
Web Questions
Natural Language Inference
MNLI SNLI

learning. In line with recent advancements in HotpotQA

arXiv:2308.03281v1 [cs.CL] 7 Aug 2023

Others
Fact Verification
MEDI BERRI
unifying various NLP tasks into a single for- …
FEVER

…
mat, we train a unified text embedding model
by employing contrastive learning over a di- Figure 1: Illustration of the multi-stage contrastive learn-
verse mixture of datasets from multiple sources. ing pipeline used to train our text embedding model.
By significantly increasing the number of train-
ing data during both unsupervised pre-training
and supervised fine-tuning stages, we achieve
augmented systems based on text embedding mod-
substantial performance gains over existing em-
bedding models. Notably, even with a relatively els that integrate the reasoning and comprehension
modest parameter count of 110M, GTEbase out- capabilities of LLMs (Izacard et al., 2022b; Ram
performs the black-box embedding API pro- et al., 2023; Shi et al., 2023). Consequently, there
vided by OpenAI and even surpasses 10x larger has been a growing focus on general text represen-
text embedding models on the massive text tation in both industry and academia.
embedding benchmark. Furthermore, without
additional fine-tuning on each programming The pursuit of developing a unified model to ad-
language individually, our model outperforms dress a multitude of downstream tasks has been
previous best code retrievers of similar size by long-standing due to the diverse formats, domains
treating code as text. In summary, our model and downstream applications of natural language.
achieves impressive results by effectively har- The emergence of pre-trained language models has
nessing multi-stage contrastive learning, offer- further opened up possibilities for training such a
ing a powerful and efficient text embedding
universal model. Nonetheless, within the realm
model with broad applicability across various
NLP and code-related tasks.1 of text representation research, previous text em-
bedding models have primarily focused on specific
1 Introduction tasks, and their training strategies or models, tai-
lored to a single task, may not perform optimally
Text embeddings have became an indispensable in other contexts. For example, the text represen-
component in many natural language processing tation model SimCSE (Gao et al., 2021), trained
tasks, such as text classification, text retrieval, ques- on symmetric text pairs, demonstrates limitations
tion answering and dialogue systems (Karpukhin in text retrieval tasks. Similarly, certain text rep-
et al., 2020; Humeau et al., 2020; Choi et al., 2021; resentation models specifically designed for dense
Izacard et al., 2022a; Long et al., 2022a; Rajapakse, retrieval tasks do not exhibit robust performance
2023). These embedding models represent texts us- in sentence textual similarity tasks. Recently, there
ing low-dimensional vectors and capture their sim- has been a shift in research focus towards develop-
ilarity through vector operations. The emergence ing more comprehensive models for text represen-
of recent large language models (LLMs) (Radford tation leveraging large quantities of unlabeled web
et al., 2018; Touvron et al., 2023; OpenAI, 2023) data through unsupervised contrastive pre-training,
has generated considerable interest in retrieval- coupled with task-specific data, prompts, or in-
1
The GTE model is publicly available at https:// structions to mitigate task conflicts during fine-
huggingface.co/thenlper/gte-large tuning (Ni et al., 2022a,b; Neelakantan et al., 2022;
Wang et al., 2022b; Su et al., 2023). Additionally, as a robust baseline for the research community
the introduction of benchmarks, such as the Mas- investigating text and code embedding.
sive Text Embedding Benchmark (MTEB) (Muen-
nighoff et al., 2023), has established a robust basis 2 Related Work
for assessing the universality of text representation
models. However, a significant limitation in ex- Text embeddings serve as low-dimensional vector
isting research is the reliance on in-house data for representations for texts of varying lengths and are
pre-training, creating a bottleneck in the utilization essential in numerous natural language processing
of pre-trained model weights or APIs. Furthermore, (NLP) tasks. In contrast to high-dimensional and
the formulation of prompts specifically tailored for sparse representations such as TF-IDF, dense text
each task requires extra human effort during imple- embeddings possess the capacity to address the lex-
mentation (Su et al., 2023). ical mismatch problem and enhance the efficiency
of text retrieval and matching.
This work presents a straightforward approach Pre-trained language models, exemplified by
to construct a general text embedding (GTE) model BERT (Devlin et al., 2019) and GPT (Radford
solely using contrastive learning on open-source et al., 2018), have demonstrated remarkable suc-
data, as illustrated in Figure 1. Specifically, we cess across various NLP tasks. Nonetheless, ex-
first gather a large-scale dataset comprising un- tracting a high-quality sentence embedding from
supervised text pairs extracted from various data pre-trained language models poses a significant
sources for contrastive pre-training. Surprisingly, challenge due to the presence of anisotropic em-
our model, pre-trained on this dataset, exhibits re- bedding spaces resulting from the masked language
markable performance, surpassing BM25 and E5 modeling objective. To address this issue, subse-
model (Wang et al., 2022b) in zero-shot text re- quent studies have proposed different approaches,
trieval tasks and surpassing many supervised mod- including supervised fine-tuning (Reimers and
els in the MTEB benchmark. To further enhance Gurevych, 2019), normalizing flow (Li et al., 2020),
the quality of the learned text representations, we normalizing flow (Li et al., 2020), whitening (Su
obtain high-quality text pairs with human labels et al., 2021), or unsupervised contrastive learn-
from multiple sources for contrastive fine-tuning. ing (Gao et al., 2021). These investigations pri-
After supervised fine-tuning, our 110M BERT- marily concentrate on enhancing performance in
based (Devlin et al., 2019) model already outper- semantic textual similarity tasks, wherein two sen-
forms the current commercial embedding API of tences exhibit similar formats.
OpenAI and ranks highly in the MTEB benchmark.
Another line of research focuses on the text re-
Furthermore, since our model is trained using code
trieval problem, where the query and document
data as well, we evaluate its code search capabili-
typically exhibit an asymmetric relationship. In
ties on the CodeSearchNet benchmark, which en-
this context, the dual-encoder architecture necessi-
compasses six programming languages. Notably,
tates training with both positive and negative pairs.
even without language-specific fine-tuning on each
Lee et al. (2019) propose the Inverse Close Task
subset, our model significantly outperforms state-
(ICT) as a self-supervised pre-training approach for
of-the-art code retrievers of similar size that have
generating a dense retriever. The ICT method in-
been fine-tuned for each programming language.
volves cropping a random sentence from a passage
In the rest of this paper, we provide a detailed to construct pseudo query-document pairs. Ad-
account of the data sources and training configu- ditionally, Chang et al. (2020) leverage the link
rations employed. Subsequently, we present the structure within Wikipedia to introduce further su-
evaluation results on widely recognized text em- pervision signals in the pre-training data. In a sim-
bedding benchmarks and compare them with the ilar vein, REALM (Guu et al., 2020) proposes a
performance of previous state-of-the-art baselines joint training approach, wherein a dense retriever
that were specifically optimized for each individual and a language model are trained concurrently. The
task. Our model consistently demonstrates supe- learning signal for the language model is derived
rior performance or, at the very least, comparable from masked language modeling, with backpropa-
results to those achieved by larger models, owing gation incorporated through the retrieval step. Re-
to its incorporation of a more diverse mixture of cent advancements, such as Contriever (Izacard
training datasets. We aspire for our model to serve et al., 2022a) and coCondenser (Gao and Callan,
2022), have demonstrated that constructing posi- is more varied to further enhance the model’s ver-
tive pairs through random passage cropping yields satility. Moreover, our model does not incorpo-
superior results compared to the ICT task. Building rate task-specific prompts, which enhances repro-
upon the ideas presented in (Chang et al., 2020), ducibility and ease of use.
some researchers have also put forth methods for
constructing higher-quality positive pairs using the 3 Approach
web link topology for retriever pre-training (Zhou The training process of our model consists of two
et al., 2022), a technique that proves effective in stages: unsupervised pre-training and supervised
zero-shot scenarios. Furthermore, in the field of fine-tuning. Both stages employ the learning ob-
dense retrieval, significant research is dedicated to jective of contrastive learning. Firstly, we will in-
enhancing the text representation capabilities of troduce the basic framework of the model. Subse-
pre-trained language models through the design of quently, we will discuss the sources and construc-
auxiliary pre-training tasks (Gao and Callan, 2021; tion methods of the training data in the two stages.
Xiao et al., 2022; Gao and Callan, 2022; Wang Finally, we will present some special optimization
et al., 2022a; Long et al., 2022b; Li et al., 2023). strategies used to enhance the model’s performance
The previous two lines of research can be gen- during the training process.
eralized as learning a vector representation for a
3.1 Model Architecture
piece of text and distinguished by the type of down-
stream tasks. Recently, several studies have ex- The backbone of our embedding model is a deep
plored the construction of unified text representa- Transformer encoder (Vaswani et al., 2017) which
tion models through large-scale contrastive learn- can be initialized with pre-trained language models
ing and prompt-based learning (Neelakantan et al., such as BERT (Devlin et al., 2019). Our model
2022; Wang et al., 2022b; Su et al., 2023). Ad- follows the vanilla dual-encoder architecture with
ditionally, some research efforts have focused on mean pooling on top of the contextualized token
constructing evaluation datasets to better assess representations produced by the language model.
the stability of text representation models across Formally, given a piece of text x = (x1 , . . . , xn )
different tasks and domains. BEIR (Benchmark- consisting of n tokens, an embedding model E con-
ing IR) (Thakur et al., 2021) collects a substantial vert the text into a low-dimensional dense vector
number of retrieval tasks from various domains to x = E(x) ∈ Rd . To implement E, we first employ
evaluate the robustness of dense retriever models in a language model to get the deep contextualized
zero-shot scenarios. Meanwhile, MTEB (Massive token representations
Text Embedding Benchmark) (Muennighoff et al.,
h = LM(x) ∈ Rn×d . (1)
2023) benchmarks over 56 datasets spanning seven
categories, providing a comprehensive evaluation Then we apply a lightweight mean pooling
of text embedding models. across the first dimension to get the text representa-
This study aims to develop a general text em- tion,
n
bedding model through a multi-stage training ap- 1X
x= hi ∈ Rd (2)
proach. In the initial stage of unsupervised con- n
i=1
trastive learning, we generate weak supervised cor- The text representations are learned through the
relation text pairs using publicly available data contrastive objective, distinguishing semantic rele-
from various sources. Unlike previous study (Wang vant text pairs from irrelevant ones. Such training
et al., 2022b), we exclusively utilized open-source procedure requires positive and negative pairs, tak-
data and did not employ any filtering or cleaning ing the format of (q, d+ , d− ). For a query q, a rel-
methods. Pre-training on a large-scale text pairs evant document d+ , a set of irrelevant documents
can effectively improve the domain generalization D− = {d− −
1 , . . . , dn }, one popular contrastive ob-
of text representation models and bridge the gap jective is the InfoNCE loss (van den Oord et al.,
between the MLM training objective and the con- 2018),
trastive learning objective of representation models,
+
making the language model more suitable for text es(q,d )/τ
Lcl = − log n , (3)
representation tasks. In the supervised fine-tuning s(q,d+ )/τ P s(q,d−
)/τ
stage, the mixture of training data in our approach e + e i
i=1
where s(q, d) estimates the similarity between two Source Datasets Prop. Size
pieces of text q and d via vector distance between
q = E(q) and d = E(d). Web Page 3 18.7% 147M
To acquire text embeddings of superior quality Academic Paper 5 5.7% 45M
that can be applied across a wide range of scenar- Hyperlink 4 13.4% 106M
ios, we compile an extensive text pair dataset from Social Media 2 41.5% 327M
multiple formats and domains. This dataset is then Knowledge Base 2 4.8% 38M
trained using an improved contrastive loss method Community QA 7 1.5% 12M
in a multi-stage fashion. News 5 0.4% 3M
Code 2 2.5% 20M
3.2 Unsupervised Pre-training Data Others 3 11.6% 91M
Weakly supervised text relevance data is readily Total 33 100% 788M
available in publicly accessible web sources, such
as the inherent connection between queries and Table 1: Statistics of pre-training data.
answers on QA forums. These data can be exten-
sively collected without the need for manual anno-
combination of training data used by previous re-
tation, thereby efficiently aiding in training text rep-
search (Gao et al., 2021; Gao and Callan, 2022;
resentation models. Inspired by previous work (Ni
Asai et al., 2023; Su et al., 2023; Li et al., 2023).
et al., 2022a,b; Neelakantan et al., 2022; Wang
More details can be found in Appendix A.
et al., 2022b), our model is initially pre-trained
on naturally occurring text pairs extracted from 3.4 Training Details
diverse sources. To ensure the versatility of the
embedding model, we explore a range of resources Data Sampling In the initial stage of unsuper-
for text pair extraction, including web pages (e.g., vised pre-training, data sources often differ signifi-
CommonCrawl, ClueWeb), scientific papers (e.g., cantly in terms of the number of training instances.
arXiv, SemanticScholar), community QA forums To address this imbalance, we employ a multino-
(e.g., StackExchange), social media (e.g., Reddit), mial distribution to sample data batches from differ-
knowledge bases (e.g., Wikipedia, DBPedia), and ent data sources, taking into account their respec-
code repositories (e.g., StackOverflow, GitHub). tive sizes. Suppose the whole pre-training dataset
Additionally, we harness the presence of hyperlinks D consists of m different subsets {D1 , . . . , Dm }
in certain datasets to facilitate text pair extraction. and denote the size of each subset as ni = |Di |, at
Table 2 demonstrates some examples of text pair each training iteration, the probability of sampling
format from different sources. Further details re- data from the i-th subset Di can be represented by:
garding the data collection process can be found nα
in Appendix A. In total, we utilized ∼800M text pi = Pm i α, (4)
j=1 nj
pairs text pairs for the unsupervised pre-training
stage. Simple statistics and data distributions are where we set α = 0.5 in this work. Furthermore,
illustrated in Table 1. to prevent the model from solely learning task-
3.3 Supervised Fine-tuning Data specific shortcuts for discrimination, we ensure
that all training instances within a batch originate
In the supervised fine-tuning stage, we use rela- from the same task.
tively lower-sized datasets with human annotation
of the relevance between two pieces of text and op- Improved Contrastive Loss When using the
tional hard negatives mined by an extra retriever to contrastive objective, people usually reuse in-batch
form text triples. To handle both symmetric tasks documents as negative candidates to improve train-
(e.g., semantic textual similarity) and asymmet- ing efficiency (Karpukhin et al., 2020). This paper
ric tasks (e.g., passage retrieval), we collect data uses an improved contrastive learning objective
from a large variety of tasks and domains, includ- which is bidirectional and enlarges the negative
ing web search (e.g., MS MARCO), open-domain samples with both in-batched queries and docu-
QA (e.g., NQ), NLI (e.g., SNLI), fact verification ments. This can be viewd as a combination of loss
(e.g., FEVER), paraphrases (e.g., Quora). We to- variants proposed by Radford et al. (2021); Ren
tally used ∼3M pairs for fine-tuning, which is a et al. (2021); Moiseev et al. (2023).
Task Type Text Pair Format Query Doc
Founded by Roger Williams in 1636, Providence is
Web Page (title, body) Providence Real Estate | Providence Homes for Sale
recognized as one of the country’s oldest cities. . .
A rather non-standard quantum representation of the
Academic Paper (title, abstract) Polymer Quantum Mechanics and its Continuum Limit
canonical commutation relations of quantum mechanics. . .
After the championship in 1996, the PGA of America Pebble Beach Golf Links The largest margin of victory
Hyperlink (citation, reference)
raised its stake to 50% and announced that . . . ever in a major championship, surpassing the 13-shot . . .
Pretty sure any team with Lebron James will be a playoff I was being sarcastic and making fun of the East, but
Social Media (post, comment)
contender. Considering UNC would be in the East. . . honestly I was really in deep thought about this . . .
Animation is the process of creating the illusion of motion
Knowledge Base (entity, description) Animation
and shape change by means of the rapid display of . . .
A tough question as it overlaps science and theology. Since
Community QA (question, answer) How the human species evolved?
you asked “how the human species evolved?” I’ll assume . . .
Nepal’s opposition alliance formally calls off weeks of
News (summary, content) Nepalese Opposition Welcomes Return of Parliament
pro-democracy protests after King Gyenandra reinstates . . .
func (s *DescribeSnapshotCopyGrantsInput) SetMaxRecords
Code (text, code) SetMaxRecords sets the MaxRecords field’s value.
(v int64) *DescribeSnapshotCopyGrantsInput { s.MaxRecords

Table 2: Examples of mined (query, document) pairs in the pre-training data.

Consider a batch of positive text pair samples reduce memory cost and scale up batch size to over
ten thousands. We run the pre-training for 50, 000
B = {(q1 , d1 ), (q2 , d2 ), ..., (qn , dn )},
steps, which roughly corresponds to one epoch on
we use an improved contrastive loss which takes the whole pre-training data. We only tuned the
the form learning rate to ensure the convergence of larger
n models. we employ the AdamW optimizer with
1X es(qi ,di )/τ linear learning rate decay and a warm-up period
Licl = − log (5)
n Z during the initial 5% of training steps. We con-
i=1

with the partition function being ducted experiments on three distinct model scales:
X X small, base, and large. These models were initial-
Z= es(qi ,dj )/τ + es(qi ,qj )/τ ized using the small-sized MiniLM (Wang et al.,
j j̸=i 2020) model and the base and large models of the
(6)
+
X
e s(qj ,di )/τ
+
X
es(dj ,di )/τ BERT (Devlin et al., 2019) model. Further details
j j̸=i
can be found in Table 3.

in which the first two terms are used for query to In the second stage of contrastive fine-tuning
document contrast, where as the last two terms are with supervised data and hard negatives, a large
used for the inverse. In this work, we use the cosine batch size is unnecessary since hard negatives can
similarity as the distance metric already provide a reliable gradient estimation of
the learning objective (Xiong et al., 2021; Li et al.,
q·d
s(q, d) = . (7) 2023). Therefore, a global batch size of 128 and a
||q||2 · ||d||2 train group size of 16 are utilized, with one positive
The temperature τ is fixed to 0.01 in this work. example and the remaining being either hard nega-
tives or random negatives. Instead we increase the
Training and Evaluation The training of our
max sequence length to 512 to better handle texts
embedding model consists of two stages. In the
with longer lengths. The learning rate is decreased
first stage of contrastive pre-training with only in-
by a factor of ten during fine-tuning. The model
batch negatives, using a large batch size is crucial
is fine-tuned on the collected dataset for a single
to better model performance by reducing the gap
epoch. In-batch texts are also incorporated as nega-
between training and inference with more nega-
tive candidates using the enhanced contrastive loss
tives included and providing a better approximation
described in Equation 5.
to the underlying learning objective. To facilitate
this, we limit the maximum sequence length to 128 After training, we directly take the last check-
during pre-training and distribute the use of nega- point for evaluation. We run model training on up
tives across all GPUs. Popular techniques such as to 8 NVIDIA A100 GPUs with 80GB memory and
automatic mixed precision training (Micikevicius model evaluation on up to 8 NVIDIA Tesla V100
et al., 2018) with fp16, deepspeed ZeRO (Rajb- GPUs with 32GB memory. Models are trained with
handari et al., 2020) stage 1 and gradient check- mixed precision using fp16 and evaluated with half
pointing (Chen et al., 2016) are also jointly used to precision fp16 as well.
Model Params LR GPUs BS Base LM
GTEsmall 30M 3× 10−4 2 16384 microsoft/MiniLM-L12-H384-uncased
GTEbase 110M 2 × 10−4 4 16384 bert-base-uncased
GTElarge 330M 5 × 10−5 8 16384 bert-large-uncased

Table 3: Pre-training configurations of models of different sizes.

4 Experiments Zero-shot text classification accuracy on SST-

2 is shown in Table 4. In the vanilla setting, our
In this section, we provide an extensive evaluation
110M model already matches the performance of
of our embedding model, comparing to state-of-
prompted E5large with 330M parameters. Using
the-art models for each task. Note that an apple-to-
prompting strategy further improves results signifi-
apple comparison is hardly possible since different
cantly and closes the gap with large models. Even
models used different in-house data for pre-training
without explicit prompt or instruction during train-
and the base language models vary a lot. We mainly
ing, our model can somewhat understand the label
use the number of model parameters as a criterion
context better when formatted as a natural language
for performance comparison since it is closely re-
text.
lated to the inference speed.

4.1 Zero-shot Text Classification 4.2 Unsupervised Text Retrieval

Text retrieval requires retrieving most relevant doc-
Model Params Prompting Accuracy uments from a large-scale candidate sets. We
E5base 110M ✓ 81.3 use BEIR (Thakur et al., 2021) as our evalua-
E5large 330M ✓ 85.3 tion benchmark for zero-shot unsupervised text
cpt-text 6B 88.1 retrieval. BEIR is a heterogeneous information re-
cpt-text 6B ✓ 89.1 trieval benchmark which contains retrieval tasks of
different formats and from different domains. We
GTEbase 110M 85.1 use the open available 15 datasets for evaluation.
GTEbase 110M ✓ 87.2
We compare our unsupervised pre-trained check-
point to recent unsupervised dense retrievers such
Table 4: Zero shot text classification performance on
SST-2. All compared models are the fine-tuned ones. as Contriever (Izacard et al., 2022a) and E5 (Wang
et al., 2022b). According to Table 5, we find that
One method to assess the quality of learned our base size model significantly outperforms the
representation is through zero-shot classifica- models with comparable size, like SimCSE, Con-
tion. (Radford et al., 2021; Neelakantan et al., 2022; triever and E5. Our base model is comparable to
Wang et al., 2022b). We recast text classification E5large without using human supervision.
into an embedding-based similarity matching prob-
4.3 Massive Text Embedding Benchmark
lem. In this setting, inputs texts are converted into
embeddings directly and labels are verbalized to Massive Text Embedding Benchmark (MTEB) is a
corresponding text to get label embeddings. Dis- comprehensive semi-supervised benchmark that in-
tances between input embeddings and label embed- corporates a limited amount of supervision data for
dings are measured by their inner product and label evaluation. In this paper, we evaluate the English
with the most close embedding distance to the in- subsets which encompasses 56 English datasets
put text is regarded as the classification result. An across seven distinct tasks, including text classi-
example is SST-2 binary sentiment classification fication (Class.), text clustering (Clust.), pairwise
task. We consider two types of label verbalizers classification (Pair.), text reranking (Rerank.), text
for evaluation. The vanilla version uses the sen- retreival (Retr.), semantic textual similarity (STS)
timent word ‘positive’ or ‘negative’ to denote the and summarization (Summ.). The evaluation met-
corresponding labels. Prompted version uses fuzzy rics employed in MTEB are accuracy, v-measure,
prompt template, such as ‘this is an example of average precision, MAP, nDCG@10, and Spear-
positive/negative movie review’. man coefficients, respectively. For further details
100
SimCSE Contriever BM25 GTE
Recall@100

0 Avg.
NF ID

Ho ever

a
DB 20
dia
im cs

A
AD iQA

MA k

Sc r
Ar act
uc us

ora
O

An
MS Stac
tQ

RC
Cl cido
Tó Corp
-20
OV

if
Pe

Qu
gu
-f
tpo
CQ F
up
he

ate
-C

S
c
Tre

Figure 2: Recall@100 of unsupervised text retrieval methods on BEIR benchmark (Thakur et al., 2021). We
compare our model GTEbase (based on BERTbase ) without using any annotated data to SimCSE (Gao et al., 2021)
(based on RoBERTalarge ), Contriever (Izacard et al., 2022a) (based on BERTbase ) and BM25. Baseline results are
borrowed from the Contriever paper (Izacard et al., 2022a) with dot product being the similarity function.

Dataset BM25 SimCSE Contriever CPT-S E5small E5base E5large GTEsmall GTEbase GTElarge
MS MARCO 22.8 9.4 20.6 19.9 25.4 26.0 26.2 31.3 31.8 31.7
Trec-Covid 65.6 26.2 27.4 52.9 52.0 61.0 61.8 61.8 64.0 64.8
NFCorpus 32.5 9.9 31.7 32.0 29.3 35.8 33.7 34.9 36.2 38.1
NQ 32.9 11.7 25.4 - 37.3 39.0 41.7 32.0 35.3 34.5
HotpotQA 60.3 19.8 48.1 51.5 46.0 52.4 52.2 49.3 50.8 49.2
FiQA 23.6 9.8 24.5 34.1 38.3 40.0 43.2 37.0 36.9 40.6
ArguAna 31.5 38.3 37.9 38.7 42.5 42.2 44.4 41.6 41.0 41.3
Touche-2020 36.7 8.9 19.3 21.0 19.9 16.9 19.8 17.7 18.2 18.5
CQADupStack 29.9 13.2 28.4 - 35.0 35.4 38.9 38.1 39.9 39.8
Quora 78.9 78.0 83.5 68.1 85.8 85.7 86.1 86.1 85.0 84.8
DBPedia 31.3 15.0 29.2 27.2 34.5 35.4 37.1 33.5 33.2 33.6
Scidocs 15.8 5.5 14.9 - 19.9 21.1 21.8 21.5 22.5 22.7
Fever 75.3 21.1 68.2 57.1 62.5 63.4 68.6 71.3 72.7 70.5
Climate-Fever 21.3 11.8 15.5 15.8 14.5 15.4 15.7 21.4 21.0 25.4
Scifact 66.5 25.7 64.9 65.4 68.5 73.7 72.3 72.7 74.1 74.1
Average 41.7 20.3 36.0 - 40.8 42.9 44.2 43.4 44.2 44.6

Table 5: nDCG@10 of different unsupervised methods on the BEIR benchmark (Thakur et al., 2021). SimCSE is
based on BERTbase backbone. CPT-S (Neelakantan et al., 2022) is of similar size to BERTlarge . Baseline results are
borrowed from E5 paper (Wang et al., 2022b). Note that Contriever uses dot product as the similarity metric while
other models uses cosine similarity.

on the tasks covered in the MTEB benchmark, forms the previous best model, E5, by a signifi-
please refer to the Appendix B. cant margin across all considered tasks, without
Two settings are considered for comparison: the the use of task-specific prompts. This improve-
unsupervised setting and the supervised setting. In ment can be attributed to the inclusion of more
the unsupervised setting, models are trained us- training data formats and various sources of self-
ing unlabeled data, while supervised models are supervision signals. Furthermore, it is worth noting
fine-tuned using high-quality datasets with human that our unsupervised pre-trained model narrows
labels. The results of strong baseline models are the gap even further with larger supervised base-
presented in Table 6. lines, such as GTR and Sentence-T5. In the super-
In the unsupervised setting, our model outper- vised setting, our model surpasses OpenAI results
Params Class. Clust. Pair. Rerank Retr. STS Summ. Avg
# of datasets → 12 11 3 4 15 10 1 56
Unsupervised models
Glove 120M 57.3 27.7 70.9 43.3 21.6 61.9 28.9 42.0
BERT 110M 61.7 30.1 56.3 43.4 10.6 54.4 29.8 38.3
SimCSE 110M 62.5 29.0 70.3 46.5 20.3 74.3 31.2 45.5
E5small 30M 67.0 41.7 78.2 53.1 40.8 68.8 25.2 54.2
E5base 110M 67.9 43.4 79.2 53.5 42.9 69.5 24.3 55.5
E5large 330M 69.0 44.3 80.3 54.4 44.2 69.9 24.8 56.4
GTEsmall 30M 71.0 44.9 82.4 57.5 43.4 77.2 30.4 58.5
GTEbase 110M 71.5 46.0 83.3 58.4 44.2 76.5 29.5 59.0
GTElarge 330M 71.8 46.4 83.3 58.8 44.6 76.3 30.1 59.3
Supervised models
SimCSE 110M 67.3 33.4 73.7 47.5 21.8 79.1 23.3 48.7
Contriever 110M 66.7 41.1 82.5 53.1 41.9 76.5 30.4 56.0
GTRlarge 330M 67.1 41.6 85.3 55.4 47.4 78.2 29.5 58.3
Sentence-T5large 330M 72.3 41.7 85.0 54.0 36.7 81.8 29.6 57.1
E5small 30M 71.7 39.5 85.1 54.5 46.0 80.9 31.4 58.9
E5base 110M 72.6 42.1 85.1 55.7 48.7 81.0 31.0 60.4
E5large 330M 73.1 43.3 85.9 56.5 50.0 82.1 31.0 61.4
InstructORbase 110M 72.6 42.1 85.1 55.7 48.8 81.0 31.0 60.4
InstructORlarge 330M 73.9 45.3 85.9 57.5 47.6 83.2 31.8 61.6
OpenAIada-001 n.a. 70.4 37.5 76.9 49.0 18.4 78.6 26.9 49.5
OpenAIada-002 n.a. 70.9 45.9 84.9 56.3 49.3 81.0 30.8 61.0
GTEsmall 30M 72.3 44.9 83.5 57.7 49.5 82.1 30.4 61.4
GTEbase 110M 73.0 46.1 84.3 58.6 51.2 82.3 30.7 62.4
GTElarge 330M 73.3 46.8 85.0 59.1 52.2 83.4 31.7 63.1
Larger models
InstructORxl 1.5B 73.1 44.7 86.6 57.3 49.3 83.1 32.3 61.8
GTRxxl 4.5B 67.4 42.4 86.1 56.7 48.5 78.4 30.6 59.0
Sentence-T5xxl 4.5B 73.4 43.7 85.1 56.4 42.2 82.6 30.1 59.5

Table 6: Results on the MTEB (Muennighoff et al., 2023) (56 datasets in English subset). Compared models include
SimCSE (Gao et al., 2021), Sentence-T5 (Ni et al., 2022a), GTR (Ni et al., 2022b), Contriever (Izacard et al., 2022a),
OpenAI text embedding API (Neelakantan et al., 2022), E5 (Wang et al., 2022b) and InstructOR (Su et al., 2023).
Exact parameter amount of OpenAI ada model is not available, but is suspected to be ∼300M, comparable to the
BERT large size model.

by a large margin despite using a modest model such as CodeBERT (Guo et al., 2021) and Graph-
size. GTEsmall is comparable to E5large while be- CodeBERT (Guo et al., 2021). We also compare
ing 10× smaller. GTElarge establishes new state- our approach with a more recent code language
of-the-art performance on the MTEB benchmark, model called UniXcoder (Guo et al., 2022), which
outperforming the multi-task instruction-finetuned aims to integrate various pre-training tasks into a
embedding model, InstructORlarge , by 1.5 points unified model. CodeRetriever (Li et al., 2022) is ini-
on average. tialized from GraphCodeBERT and pre-trained on
large-scale multi-modal code-text pairs mined and
4.4 Code Search cleaned by heuristics. It is important to note that
while the baseline models are individually trained
Programming languages can be regarded as a dis-
and evaluated for each programming language, our
tinct form of text. To assess the effectiveness of our
model is directly evaluated across all the languages.
approach in code search, we conduct a comparative
analysis with other code-based language models, In line with recent work (Guo et al., 2021, 2022;
Model Params Ruby JS Go Python Java PHP Avg.
CodeBERT 110M×6 67.9 62.0 88.2 67.2 67.6 62.8 69.3
GraphCodeBERT 110M×6 70.3 64.4 89.7 69.2 69.1 64.9 71.3
UniXcoder 110M×6 74.0 68.4 91.5 72.0 72.6 67.6 74.4
CodeRetriever 110M×6 77.1 71.9 92.4 75.8 76.5 70.8 77.4
GTEbase 110M 76.1 73.6 88.1 95.9 80.1 85.3 83.2

Table 7: Results on CodeSearchNet. Comparison on code search across 6 programming languages (Husain et al.,
2019) with CodeBERT (Feng et al., 2020), GraphCodeBERT (Guo et al., 2021), UniXcoder (Guo et al., 2022) and
CodeRetriever (Li et al., 2022). This setting requires finding the corresponding code candidates from all candidates
from dev and test set.

Li et al., 2022), we mainly evaluate on the chal- datasets used in pre-training. Model training was
lenging settings where the code corpus include all carried out by randomly sampling a subset from
codes from dev and test set instead of 1k randomly all available datasets. In the pre-training stage, the
sampled code.2 The results are presented in Table 7. first group consisted of only the five largest datasets,
Surprisingly, our model surpasses models that are ranked by size. The second group included an ad-
pre-trained on code and then fine-tuned for each ditional 10 randomly sampled datasets, resulting in
programming language separately. This finding a mixture of 15 datasets. The third group utilized
demonstrates that, by scaling the amount of data all 33 datasets in the pre-training process. For fine-
and computational resources, the language model tuning, we initially started with the three datasets
can acquire high-quality code representations di- used in E5 (Wang et al., 2022b) fine-tuning and
rectly from sequences of code tokens, without the gradually incorporated datasets from MEDI (Su
need for incorporating human knowledge about the et al., 2023) and BERRI (Asai et al., 2023) to inves-
structural information of code (Guo et al., 2021). tigate the potential benefits. The results presented
We observe a significant improvement in Python, in Figure 3a demonstrate that the inclusion of more
likely due to its resemblance to natural language. diverse data sources consistently enhances model
Our model, pre-trained on an extensive text pairs performance during both the pre-training and fine-
spanning various domains, demonstrates effective tuning stages.
cross-task knowledge transfer from text retrieval to
code retrieval.

5 Analysis Pre-training Batch Size We gradually increase

the batch size by a factor of 2 while keeping the
In this section, we analyze the crucial factors in- training steps fixed to study the influence of batch
fluencing model performance and present a series size used in embedding model pre-training. Ac-
of ablation experiments. Unless otherwise stated, cording to Figure 3b, model performance saturates
the experiments are performed using a BERT-base at around a batch size of ten thousands. No per-
scale model with 110M parameters. The training formance gain is observed when further scaling up
steps and epochs remain consistent across all abla- batch size.
tion experiments.

5.1 Impact of Scaling

We investigate the impact of scaling the number Number of Model Parameters We investigate
of data sources, batch size, and model parameters the scaling behavior by training language models
on the quality of learned text embeddings. The of various sizes, including 30M, 110M, and 330M,
evaluation is conducted on the MTEB benchmark. which correspond to the small, base, and large
scales of the BERT model. Figure 3c illustrates
Number of Training Datasets First, we con- the performance of the pre-trained and fine-tuned
ducted an ablation study on the number of training models. It can be observed that as the model size
2
Evaluation on the original settings can be found in Ap- grows exponentially, the model performance also
pendix C. improves linearly.
64 64 PT 64
62 62 62
60 60 60
58 PT 58 58 PT
FT FT
56 56 56
+ ++ +++ 30M 110M 330M
211 212 213 214
Level B N
(a) Number of training datasets. (b) Batch size. (c) Number of model parameters.

Figure 3: Scaling analysis of different factors during contrastive pre-training and fine-tuning. Model performance is
mesured by average performance on MTEB.

5.3 Influence of Different Training Stages

To examine the efficacy of multi-stage contrastive
learning, we conducted an analysis on the training
strategies. We compared three settings: a) solely
pre-training on unsupervised text pairs extracted
from diverse sources; b) solely fine-tuning on super-
vised datasets; c) contrastive pre-training followed
by fine-tuning. All models were initialized from
the original BERT base model.

Figure 4: Loss during contrastive pre-training for mod-

Setting PT FT Full
els of different size. MTEB 59.0 57.8 62.4

Table 9: Model performance at different training stages.

5.2 Training Behavior PT denotes run only unsupervised pre-training. FT only
use supervised data for model trainining. Full apply two
We plot the training loss of different sized models stages in a sequential manner.
during contrastive pre-training in Figure 4. Larger
models have better ablity at learning to distinguish It can be observed from Table 9 that relying
positive pairs from negative ones. The training loss solely on supervised data for fine-tuning is insuf-
experiences minor fluctuations consistently across ficient to achieve a high-quality text embedding
all model scales, which suggests variations in the model, likely due to its limited size. Conversely, un-
quality and difficulty of data per batch.3 supervised pre-training using web-scale text pairs
We also evaluate model performance at different yields superior text embeddings compared to solely
training steps. It’s shown that model performance relying on labeled data for fine-tuning. Neverthe-
saturates at 20k steps roughly corresponds to train- less, the incorporation of supervised data in a multi-
ing convergence. stage fashion with unsupervised pre-training can
still contribute to the refinement of the acquired
Steps 10k 20k 30k 40k 50k text embeddings.

MTEB 56.4 59.0 57.8 57.7 59.0 5.4 Training Data Mixture
We study the influence of mixing ratio used in sam-
Table 8: Model performance at different training steps
pling distribution on pre-training data to model
during unsupervised contrastive pre-training.
performance.
The performance on two task categories, re-
3
trieval and STS, as well as the average performance
We use a fixed random seed for data sampling during
model training, ensuring that each model encounters the data on MTEB is reported in Table 10. We observe
batches in the same order. that neither uniformly sampling from each pre-
α Retrieval STS MTEB suspect that this is a common problem that other
models also suffer from, but quantifying it without
0 36.7 73.2 55.4 details about the training data sources is even more
0.3 44.6 75.9 58.9 challenging (Neelakantan et al., 2022).
0.5 44.2 76.5 59.0
Furthermore, the models trained in this study
1 42.0 75.5 58.3
are based on a non-causal architecture with bidi-
rectional context attention. It would be intriguing
Table 10: Influence of ratio α used in pre-training data
sampling. to explore similar pre-training methods for causal
or prefix language models, as these models could
optimize generation and retrieval jointly and unify
training task (α = 0) nor directly combining all them within a single model.
data sources (α = 1) is the best choice. Setting α
to 0.5 can improve results on all tasks. 7 Conclusion
5.5 Ablation of the Contrastive Objective This paper presents a multi-stage contrastive learn-
This work uses an improved contrastive objective ing approach to develop text embedding model that
which can efficiently enlarges the negative pool can be applied to various tasks. Our model benefits
under fixed batch size. We compare it against the from a diverse training data mixture, enabling it to
vanilla contrastive loss with only in-batch negatives achieve good generalization performance for single
in both pre-training and fine-tuning stages. vector embedding. Through extensive evaluation
on multiple benchmarks, we demonstrate the ef-
Setting PT FT fectiveness and versatility of our text embedding
model. Our future work will focus on scaling the
Vanilla 57.3 61.8 model to support longer context, extending it to
Improved 57.8 62.4 support multilingual and multi-modal applications,
as well as exploring the benefits of prompts and
Table 11: Comparison of the vanilla contrastive loss
instructions.
with in-batch negatives, and the improved contrastive
loss with enlarged negative pool. For ablation we run the
pre-training (PT) for 30k steps to reduce computational
cost. We report average score on MTEB. References
Akari Asai, Timo Schick, Patrick Lewis, Xilun Chen,
According to Table 11, using improved con- Gautier Izacard, Sebastian Riedel, Hannaneh Ha-
trastive loss consistently improves model perfor- jishirzi, and Wen-tau Yih. 2023. Task-aware retrieval
mance in both pre-training and fine-tuning stages. with instructions. In Findings of the Association for
Computational Linguistics: ACL 2023, pages 3650–
3675, Toronto, Canada. Association for Computa-
6 Discussion tional Linguistics.
Despite the strong performance on English tasks,
Tom Brown, Benjamin Mann, Nick Ryder, Melanie
our current model can only handle text with a Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind
length of less than 512, as it is initialized from Neelakantan, Pranav Shyam, Girish Sastry, Amanda
BERT and lacks multilingual capabilities. Conse- Askell, Sandhini Agarwal, Ariel Herbert-Voss,
quently, longer texts must be truncated or split for Gretchen Krueger, Tom Henighan, Rewon Child,
Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens
encoding. However, with more data engineering Winter, Chris Hesse, Mark Chen, Eric Sigler, Ma-
and compute resources, the described training ap- teusz Litwin, Scott Gray, Benjamin Chess, Jack
proach could easily be extended to a multilingual Clark, Christopher Berner, Sam McCandlish, Alec
version and accommodate longer contexts. Radford, Ilya Sutskever, and Dario Amodei. 2020.
Language models are few-shot learners. In Ad-
Another issue is the problem of data contami- vances in Neural Information Processing Systems,
nation resulting from large-scale pre-training on volume 33, pages 1877–1901. Curran Associates,
Internet data. Currently, we only conduct dedupli- Inc.
cation based on exact matching of text pairs, which
Wei-Cheng Chang, Felix X. Yu, Yin-Wen Chang, Yim-
is an overly strict filter. This issue has also been ing Yang, and Sanjiv Kumar. 2020. Pre-training tasks
highlighted by Brown et al. (2020) during the train- for embedding-based large-scale retrieval. In Inter-
ing of large-scale generative language models. We national Conference on Learning Representations.
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Luyu Gao and Jamie Callan. 2022. Unsupervised cor-
Yuan, Henrique Ponde de Oliveira Pinto, Jared Ka- pus aware language model pre-training for dense pas-
plan, Harri Edwards, Yuri Burda, Nicholas Joseph, sage retrieval. In Proceedings of the 60th Annual
Greg Brockman, Alex Ray, Raul Puri, Gretchen Meeting of the Association for Computational Lin-
Krueger, Michael Petrov, Heidy Khlaaf, Girish Sas- guistics (Volume 1: Long Papers), pages 2843–2853,
try, Pamela Mishkin, Brooke Chan, Scott Gray, Dublin, Ireland. Association for Computational Lin-
Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz guistics.
Kaiser, Mohammad Bavarian, Clemens Winter,
Philippe Tillet, Felipe Petroski Such, Dave Cum- Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021.
mings, Matthias Plappert, Fotios Chantzis, Eliza- SimCSE: Simple contrastive learning of sentence em-
beth Barnes, Ariel Herbert-Voss, William Hebgen beddings. In Proceedings of the 2021 Conference
Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie on Empirical Methods in Natural Language Process-
Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, ing, pages 6894–6910, Online and Punta Cana, Do-
William Saunders, Christopher Hesse, Andrew N. minican Republic. Association for Computational
Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Linguistics.
Morikawa, Alec Radford, Matthew Knight, Miles
Brundage, Mira Murati, Katie Mayer, Peter Welinder, Daya Guo, Shuai Lu, Nan Duan, Yanlin Wang, Ming
Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Zhou, and Jian Yin. 2022. UniXcoder: Unified cross-
Sutskever, and Wojciech Zaremba. 2021. Evaluating modal pre-training for code representation. In Pro-
large language models trained on code. ceedings of the 60th Annual Meeting of the Associa-
tion for Computational Linguistics (Volume 1: Long
Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Papers), pages 7212–7225, Dublin, Ireland. Associa-
Guestrin. 2016. Training deep nets with sublinear tion for Computational Linguistics.
memory cost.
Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng,
Duyu Tang, Shujie LIU, Long Zhou, Nan Duan,
Hyunjin Choi, Judong Kim, Seongho Joe, and Alexey Svyatkovskiy, Shengyu Fu, Michele Tufano,
Youngjune Gwon. 2021. Evaluation of bert and albert Shao Kun Deng, Colin Clement, Dawn Drain, Neel
sentence embedding performance on downstream nlp Sundaresan, Jian Yin, Daxin Jiang, and Ming Zhou.
tasks. 2020 25th International Conference on Pattern 2021. Graphcode{bert}: Pre-training code represen-
Recognition (ICPR), pages 5482–5487. tations with data flow. In International Conference
on Learning Representations.
Alexis Conneau, Douwe Kiela, Holger Schwenk, Loïc
Barrault, and Antoine Bordes. 2017. Supervised Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat,
learning of universal sentence representations from and Mingwei Chang. 2020. Retrieval augmented
natural language inference data. In Proceedings of language model pre-training. In Proceedings of the
the 2017 Conference on Empirical Methods in Nat- 37th International Conference on Machine Learning,
ural Language Processing, pages 670–680, Copen- volume 119 of Proceedings of Machine Learning
hagen, Denmark. Association for Computational Lin- Research, pages 3929–3938. PMLR.
guistics.
Samuel Humeau, Kurt Shuster, Marie-Anne Lachaux,
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and and Jason Weston. 2020. Poly-encoders: Architec-
Kristina Toutanova. 2019. BERT: Pre-training of tures and pre-training strategies for fast and accurate
deep bidirectional transformers for language under- multi-sentence scoring. In International Conference
standing. In Proceedings of the 2019 Conference of on Learning Representations.
the North American Chapter of the Association for
Computational Linguistics: Human Language Tech- Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis
nologies, Volume 1 (Long and Short Papers), pages Allamanis, and Marc Brockschmidt. 2019. Code-
4171–4186, Minneapolis, Minnesota. Association for searchnet challenge: Evaluating the state of semantic
Computational Linguistics. code search. CoRR, abs/1909.09436.

Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebas-

Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xi- tian Riedel, Piotr Bojanowski, Armand Joulin, and
aocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Edouard Grave. 2022a. Unsupervised dense informa-
Ting Liu, Daxin Jiang, and Ming Zhou. 2020. Code- tion retrieval with contrastive learning. Transactions
BERT: A pre-trained model for programming and on Machine Learning Research.
natural languages. In Findings of the Association
for Computational Linguistics: EMNLP 2020, pages Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas
1536–1547, Online. Association for Computational Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-
Linguistics. Yu, Armand Joulin, Sebastian Riedel, and Edouard
Grave. 2022b. Few-shot Learning with Retrieval
Luyu Gao and Jamie Callan. 2021. Condenser: a pre- Augmented Language Models.
training architecture for dense retrieval. In Confer-
ence on Empirical Methods in Natural Language Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick
Processing. Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and
Wen-tau Yih. 2020. Dense passage retrieval for open- for dual encoder retrieval models with same tower
domain question answering. In Proceedings of the negatives. In Findings of the Association for Compu-
2020 Conference on Empirical Methods in Natural tational Linguistics: ACL 2023, pages 12028–12037,
Language Processing (EMNLP), pages 6769–6781, Toronto, Canada. Association for Computational Lin-
Online. Association for Computational Linguistics. guistics.

Kenton Lee, Ming-Wei Chang, and Kristina Toutanova. Niklas Muennighoff, Nouamane Tazi, Loic Magne, and
2019. Latent retrieval for weakly supervised open Nils Reimers. 2023. MTEB: Massive text embedding
domain question answering. In Proceedings of the benchmark. In Proceedings of the 17th Conference
57th Annual Meeting of the Association for Computa- of the European Chapter of the Association for Com-
tional Linguistics, pages 6086–6096, Florence, Italy. putational Linguistics, pages 2014–2037, Dubrovnik,
Association for Computational Linguistics. Croatia. Association for Computational Linguistics.

Bohan Li, Hao Zhou, Junxian He, Mingxuan Wang, Arvind Neelakantan, Tao Xu, Raul Puri, Alec Rad-
Yiming Yang, and Lei Li. 2020. On the sentence ford, Jesse Michael Han, Jerry Tworek, Qiming
embeddings from pre-trained language models. In Yuan, Nikolas Tezak, Jong Wook Kim, Chris Hallacy,
Proceedings of the 2020 Conference on Empirical Johannes Heidecke, Pranav Shyam, Boris Power,
Methods in Natural Language Processing (EMNLP), Tyna Eloundou Nekoul, Girish Sastry, Gretchen
pages 9119–9130, Online. Association for Computa- Krueger, David Schnurr, Felipe Petroski Such, Kenny
tional Linguistics. Hsu, Madeleine Thompson, Tabarak Khan, Toki
Sherbakov, Joanne Jang, Peter Welinder, and Lilian
Xiaonan Li, Yeyun Gong, Yelong Shen, Xipeng Qiu, Weng. 2022. Text and code embeddings by con-
Hang Zhang, Bolun Yao, Weizhen Qi, Daxin Jiang, trastive pre-training. CoRR, abs/2201.10005.
Weizhu Chen, and Nan Duan. 2022. CodeRetriever:
A large scale contrastive pre-training method for code Jianmo Ni, Gustavo Hernandez Abrego, Noah Con-
search. In Proceedings of the 2022 Conference on stant, Ji Ma, Keith Hall, Daniel Cer, and Yinfei Yang.
Empirical Methods in Natural Language Processing, 2022a. Sentence-t5: Scalable sentence encoders
pages 2898–2910, Abu Dhabi, United Arab Emirates. from pre-trained text-to-text models. In Findings of
Association for Computational Linguistics. the Association for Computational Linguistics: ACL
2022, pages 1864–1874, Dublin, Ireland. Association
Zehan Li, Yanzhao Zhang, Dingkun Long, and Pengjun for Computational Linguistics.
Xie. 2023. Challenging decoder helps in masked Jianmo Ni, Chen Qu, Jing Lu, Zhuyun Dai, Gustavo Her-
auto-encoder pre-training for dense passage retrieval. nandez Abrego, Ji Ma, Vincent Zhao, Yi Luan, Keith
CoRR, abs/2305.13197. Hall, Ming-Wei Chang, and Yinfei Yang. 2022b.
Kyle Lo, Lucy Lu Wang, Mark Neumann, Rodney Kin- Large dual encoders are generalizable retrievers. In
ney, and Daniel Weld. 2020. S2ORC: The semantic Proceedings of the 2022 Conference on Empirical
scholar open research corpus. In Proceedings of the Methods in Natural Language Processing, pages
58th Annual Meeting of the Association for Compu- 9844–9855, Abu Dhabi, United Arab Emirates. As-
tational Linguistics, pages 4969–4983, Online. Asso- sociation for Computational Linguistics.
ciation for Computational Linguistics. Barlas Oguz, Kushal Lakhotia, Anchit Gupta, Patrick
Lewis, Vladimir Karpukhin, Aleksandra Piktus,
Dingkun Long, Qiong Gao, Kuan Zou, Guangwei Xu, Xilun Chen, Sebastian Riedel, Scott Yih, Sonal
Pengjun Xie, Ruijie Guo, Jianfeng Xu, Guanjun Gupta, and Yashar Mehdad. 2022. Domain-matched
Jiang, Luxi Xing, and Ping Yang. 2022a. Multi-cpr: pre-training tasks for dense retrieval. In Findings
A multi domain chinese dataset for passage retrieval. of the Association for Computational Linguistics:
Proceedings of the 45th International ACM SIGIR NAACL 2022, pages 1524–1534, Seattle, United
Conference on Research and Development in Infor- States. Association for Computational Linguistics.
mation Retrieval.
OpenAI. 2023. Gpt-4 technical report. ArXiv,
Dingkun Long, Yanzhao Zhang, Guangwei Xu, and abs/2303.08774.
Pengjun Xie. 2022b. Retrieval oriented masking pre-
training language model for dense passage retrieval. Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
ArXiv, abs/2210.15133. Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sas-
try, Amanda Askell, Pamela Mishkin, Jack Clark,
Paulius Micikevicius, Sharan Narang, Jonah Alben, Gre- Gretchen Krueger, and Ilya Sutskever. 2021. Learn-
gory Diamos, Erich Elsen, David Garcia, Boris Gins- ing transferable visual models from natural language
burg, Michael Houston, Oleksii Kuchaiev, Ganesh supervision. In Proceedings of the 38th International
Venkatesh, and Hao Wu. 2018. Mixed precision Conference on Machine Learning, volume 139 of
training. In International Conference on Learning Proceedings of Machine Learning Research, pages
Representations. 8748–8763. PMLR.
Fedor Moiseev, Gustavo Hernandez Abrego, Peter Dorn- Alec Radford, Karthik Narasimhan, Tim Salimans, and
bach, Imed Zitouni, Enrique Alfonseca, and Zhe Ilya Sutskever. 2018. Improving language under-
Dong. 2023. SamToNe: Improving contrastive loss standing by generative pre-training.
Thilina C. Rajapakse. 2023. Dense passage retrieval: A heterogeneous benchmark for zero-shot evaluation
Architectures and augmentation methods. Proceed- of information retrieval models. In Proceedings of
ings of the 46th International ACM SIGIR Confer- the Neural Information Processing Systems Track on
ence on Research and Development in Information Datasets and Benchmarks, volume 1. Curran.
Retrieval.
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Martinet, Marie-Anne Lachaux, Timothée Lacroix,
and Yuxiong He. 2020. Zero: Memory optimizations Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal
toward training trillion parameter models. In Pro- Azhar, Aurelien Rodriguez, Armand Joulin, Edouard
ceedings of the International Conference for High Grave, and Guillaume Lample. 2023. Llama: Open
Performance Computing, Networking, Storage and and efficient foundation language models. ArXiv,
Analysis, SC ’20. IEEE Press. abs/2302.13971.

Ori Ram, Yoav Levine, Itay Dalmedigos, Dor Muhlgay, Aäron van den Oord, Yazhe Li, and Oriol Vinyals. 2018.
Amnon Shashua, Kevin Leyton-Brown, and Yoav Representation learning with contrastive predictive
Shoham. 2023. In-context retrieval-augmented lan- coding. CoRR, abs/1807.03748.
guage models. ArXiv, abs/2302.00083. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz
Nils Reimers and Iryna Gurevych. 2019. Sentence- Kaiser, and Illia Polosukhin. 2017. Attention is all
BERT: Sentence embeddings using Siamese BERT- you need. In Advances in Neural Information Pro-
networks. In Proceedings of the 2019 Conference on cessing Systems, volume 30. Curran Associates, Inc.
Empirical Methods in Natural Language Processing
and the 9th International Joint Conference on Natu- Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao,
ral Language Processing (EMNLP-IJCNLP), pages Linjun Yang, Daxin Jiang, Rangan Majumder, and
3982–3992, Hong Kong, China. Association for Com- Furu Wei. 2022a. Simlm: Pre-training with repre-
putational Linguistics. sentation bottleneck for dense passage retrieval. In
Annual Meeting of the Association for Computational
Ruiyang Ren, Shangwen Lv, Yingqi Qu, Jing Liu, Linguistics.
Wayne Xin Zhao, QiaoQiao She, Hua Wu, Haifeng
Wang, and Ji-Rong Wen. 2021. PAIR: Leverag- Liang Wang, Nan Yang, Xiaolong Huang, Binxing
ing passage-centric similarity relation for improving Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder,
dense passage retrieval. In Findings of the Associa- and Furu Wei. 2022b. Text embeddings by weakly-
tion for Computational Linguistics: ACL-IJCNLP supervised contrastive pre-training. arXiv preprint
2021, pages 2173–2183, Online. Association for arXiv:2212.03533.
Computational Linguistics. Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan
Yang, and Ming Zhou. 2020. Minilm: Deep self-
Andrew Rosenberg and Julia Hirschberg. 2007. V-
attention distillation for task-agnostic compression
measure: A conditional entropy-based external clus-
of pre-trained transformers. In Proceedings of the
ter evaluation measure. In Proceedings of the 2007
34th International Conference on Neural Information
Joint Conference on Empirical Methods in Natural
Processing Systems, NIPS’20, Red Hook, NY, USA.
Language Processing and Computational Natural
Curran Associates Inc.
Language Learning (EMNLP-CoNLL), pages 410–
420, Prague, Czech Republic. Association for Com- Guillaume Wenzek, Marie-Anne Lachaux, Alexis Con-
putational Linguistics. neau, Vishrav Chaudhary, Francisco Guzmán, Ar-
mand Joulin, and Edouard Grave. 2020. CCNet:
Weijia Shi, Sewon Min, Michihiro Yasunaga, Minjoon Extracting high quality monolingual datasets from
Seo, Rich James, Mike Lewis, Luke Zettlemoyer, and web crawl data. In Proceedings of the Twelfth Lan-
Wen tau Yih. 2023. Replug: Retrieval-augmented guage Resources and Evaluation Conference, pages
black-box language models. ArXiv, abs/2301.12652. 4003–4012, Marseille, France. European Language
Resources Association.
Hongjin Su, Weijia Shi, Jungo Kasai, Yizhong Wang,
Yushi Hu, Mari Ostendorf, Wen-tau Yih, Noah A. Fangzhao Wu, Ying Qiao, Jiun-Hung Chen, Chuhan
Smith, Luke Zettlemoyer, and Tao Yu. 2023. One Wu, Tao Qi, Jianxun Lian, Danyang Liu, Xing Xie,
embedder, any task: Instruction-finetuned text em- Jianfeng Gao, Winnie Wu, and Ming Zhou. 2020.
beddings. In Findings of the Association for Compu- MIND: A large-scale dataset for news recommenda-
tational Linguistics: ACL 2023, pages 1102–1121, tion. In Proceedings of the 58th Annual Meeting of
Toronto, Canada. Association for Computational Lin- the Association for Computational Linguistics, pages
guistics. 3597–3606, Online. Association for Computational
Linguistics.
Jianlin Su, Jiarun Cao, Weijie Liu, and Yangyiwen Ou.
2021. Whitening sentence representations for better Shitao Xiao, Zheng Liu, Yingxia Shao, and Zhao Cao.
semantics and faster retrieval. 2022. Retromae: Pre-training retrieval-oriented lan-
guage models via masked auto-encoder. In Confer-
Nandan Thakur, Nils Reimers, Andreas Rücklé, Ab- ence on Empirical Methods in Natural Language
hishek Srivastava, and Iryna Gurevych. 2021. Beir: Processing.
Yiqing Xie, Xiao Liu, and Chenyan Xiong. 2023. Un- the user can write their questions in the format of a
supervised dense retrieval training with web anchors. summaritive title and a descriptive body. These two
In Proceedings of the 46th International ACM SI-
fields are usually semantically consistent. In ad-
GIR Conference on Research and Development in
Information Retrieval, SIGIR ’23, page 2476–2480, dition, we also consider the question answer pairs
New York, NY, USA. Association for Computing from this type of websites. The data sources we
Machinery. used include StackExchange, Yahoo Answers, Wik-
Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang,
iHow and Amazon QA. Simple heuristics such as
Jialin Liu, Paul N. Bennett, Junaid Ahmed, and text lengths and voting numbers are used to filter
Arnold Overwijk. 2021. Approximate nearest neigh- out low-quality data.
bor negative contrastive learning for dense text re-
trieval. In International Conference on Learning Social Media The social media websites such
Representations. as Twitter and Reddit usually involves people pub-
Jiawei Zhou, Xiaoguang Li, Lifeng Shang, Lan Luo, lishing posts about one event, and many internauts
Ke Zhan, Enrui Hu, Xinyu Zhang, Hao Jiang, Zhao leave their comments. The post is also structured
Cao, Fan Yu, Xin Jiang, Qun Liu, and Lei Chen. 2022. with title and body in it, which we consider as
Hyperlink-induced pre-training for passage retrieval positive pairs. Similar to Community QA, post
in open-domain question answering. In Proceedings
of the 60th Annual Meeting of the Association for comment are also regared as positive pairs for data
Computational Linguistics (Volume 1: Long Papers), mining. We mine data from Reddit.
pages 7135–7146, Dublin, Ireland. Association for
Computational Linguistics. News News are structured as title body pairs.
Some news has highlighted sentences in it. We use
A More Details about Training Data these information to construct (query,doc) pairs.
We used data from CCNews, MicrosoftNews, NPR,
A.1 Pre-training Data
CNNDaily.
Web Page Within a web page, we use title as
query and the body text as document. Some re- Knowledge Base Knowledge base usually stores
sources include Common Crawl, Clue Webs, MS textual descriptions knowledge about an entity or
MARCO documents. The task can be formatted as event. The (entity, description) pairs are mined. We
given a short title, find the most relevant body texts use WikiPedia and DBPedia for text pair mining in
from a set of randomly sampled texts. this work.

Academic Paper The scientific articles usually Code Code can be viewed as another form of text.
have a higher quality due to its formal nature. For The naturally paired text-code can be repurposed as
each paper, we use the title as query and its abstract positive pairs. We use GitHub and StackOverflow
as document for constructing text pairs. The ar- as two data sources. We reuse training set from
ticles are mined from different websites (such as CodeSearchNet which is mined from GitHub.
arXiv, bioRxiv, medRxiv, PubMed and Semantic
Others In addition, we also use data from various
Scholar) to cover a wide range of topics.
websites such as Amazon reviews about the goods,
Hyperlink Another important information debate websites about one argument, googaq q,a
present on the internet is the hyperlink with text pairs by prompting google search box with search
in it, also know as web anchors. The hyperlink log queries.
can provide necessary references for current
arguments. We use the citation argument and A.2 Fine-tuning Data
the text from reference as relevant text pairs for Web Search We used MS MARCO passage re-
contrast. This type of task is more challenging as trieval benchmarks. Hard negatives are mined by
it usually involves multi-hop reasoning. We used sampling from high-ranked documents retrieval
three resources to incorporate the link information: system, excluding positive ones.
ClueWeb, Wikipedia and Semantic Scholar paper
Open QA We consider Natural Questions, Trivia
citations.
QA, Web Questions, HotpotQA, etc. In the open
Community QA We also used many data from domain QA datasets, a question and its supporting
community QA websites. The UI design of such evidence passages are provided as positive pairs.
websites usually follows a structured format, where Top ranked passage by retrieval system which do
not include answer to the question is regared as data comes from embedding-training-data.6 The
hard negatives. training data is keep as it was without any specific
filtering, except that we use text pair exact-match
Natural Language Inference Prior work (Con- for training data de-duplication for some datasets.
neau et al., 2017) has shown that high-quality sen- The fine-tuning data is basically a combina-
tence embeddings can be learned from a supervised tion of previous research. For the MS MARCO
natural language inference task. We use entailment dataset, we use mined hard negative by the sec-
as positive pairs and contradiction as negative pairs ond stage retriever from Li et al. (2023). For NQ
to construct training triples. The combination of dataset, we reuse the training data released by co-
MNLI and SNLI is used in this work. Condenser (Gao and Callan, 2022). We use NLI
Fact Verification One argument and its support- data released by SimCSE (Gao et al., 2021). Other
ing source (a Wikipedia document) is positive pairs. data comes from MEDI and BERRI (Su et al., 2023;
We use training set from FEVER as data source for Asai et al., 2023), but we discard the instructions
this task. written for each task and only use training triples.
Some randomly sampled examples can be found in
Paraphrase Two sentences with similar mean- Table 12.
ings are labeled as positive pairs. This type of data
includes Quora and StackExchangeDupquestion. B Massive Text Embedding Benchmark

Others In addition to previous datasets, we also Classification This task is evaluated in the lin-
used miscellaneous datasets from different NLP ear probing setting. The embedding model is kept
tasks and domains released in MEDI (Su et al., frozen and used to extract text embeddings for each
2023) and BERRI (Asai et al., 2023). By doing example from train and test set. The train set em-
so, a sub-sampled version of pre-training data is beddings are used as input features to train a lo-
also included in fine-tuning to avoid catastrophe gistic regression classifier with 100 maximum it-
forgetting. erations. The accuracy on test set is reported as
the main evaluation metric. In this setting, differ-
A.3 Data Sources ent classification tasks only need to train an extra
classification head with a few labeled training data.
The pre-training data comes mostly from language
corpus released by previous work. We use Com- Clustering A high-quality embedding model
momCrawl preprocessed by CCNet at 2019 snap- should embed semantically similar texts close in
shot due to large computaional cost of process- the embedding space. This property is evaluated by
ing (Wenzek et al., 2020). Since Reddit data is running a k-means algorithm on the embeddings
no longer free available, we use two pre-processed produced for each sentence of the test set. A mini-
version by sentence-transformers 4 and Oguz et al. batch k-means model is used with batch size 32 and
(2022) for pair mining. Text pairs mined from hy- k being the number of labels. Texts are partitioned
perlinks come from Zhou et al. (2022) and Xie et al. into k clusters. The clustering performance is mea-
(2023). We also include citation pairs from the sured by the v-measure (Rosenberg and Hirschberg,
S2ORC dataset (Lo et al., 2020). We reuse DBPe- 2007) which is invariant to the permutation of clus-
dia, debating arguments and PubMed corpus from tering labels.
BEIR (Thakur et al., 2021). Wikipedia data is taken
from Izacard et al. (2022b). Microsoft News data Reranking Given a query and a list of relevant
comes from Wu et al. (2020). Arxiv data is down- and irrelevant reference texts, reranking needs to
loaded from Kaggle, medRxiv and bioRxiv are rank the list of reference texts based on their sim-
mined via requesting public API from year 2013 to ilarity to the query. The embedding model is in-
2022. The StackExchange and StackOverflow data voked to obtain embeddings for each query and
comes from the pre-processed version maintained reference text and cosine similarity is used as the
by sentence-transformers team.5 The remaining ranking score. This inference setting is quite sim-
ilar to text retrieval with the reference set being
4
https://huggingface.co/datasets/
sentence-transformers/reddit-title-body 6
https://huggingface.co/
5
https://huggingface.co/ datasets/sentence-transformers/
flax-sentence-embeddings embedding-training-data
Task Type Text Triple Format query doc hard neg
The following are the most common Cellulitis usually begins as
Web Search (query, passage, negative) finger cellulitis symptoms
symptoms of cellulitis. However. . . a small area of pain and . . .
big little lies season 2 Big Little Lies (TV series). Little People, Big World.
Open QA (question, passage, negative)
how many episodes series garnered several accolades. . . final minutes of the season two. . .
(Read for Slate ’s take Slate had an opinion Slate did not hold any opinion
Natural Language Inference (sentence, entailment, contradiction)
on Jackson’s findings.) on Jackson’s findings. on Jackson’s findings.
Roman Bernard Atwood (born
Roman Atwood is a 6th Streamy Awards Casey Neistat
Fact Verification (argument, evidence, others) May 28, 1983) is an American
content creator. and Jesse Wellens, PrankvsPrank . . .
YouTube personality. . .
Lexapro taken with Can stopping lexapro cause
Paraphrase (sentence, paraphrase, others) Can dayquil be taken with Lexapro?
crestor any reaction? a longer period?

Table 12: Examples of (query, positive, negative) text triples in fine-tuning data.

smaller and harder to distinguish. In line with pre- C Original CodeSearchNet Results
vious work, the main evaluation metric is MAP
We list the results of the original setting on Code-
(mean average precision).
SearchNet in Table 13, where the retrieval cor-
pus contains 1k randomly sampled code snip-
Retrieval We omit the text retrieval evaluation pets. Compared to previous open-source code lan-
since it’s similar to that introduced in previous sec- guage models with similar architecture and size
tion. (CodeBERT (Feng et al., 2020) and GraphCode-
BERT (Guo et al., 2021)), our model is superior
Pair Classification This task needs to assign a in most programming languages. There is still a
label for a pair of texts. Popular tasks include performance gap to the code embedding model
duplicate or paraphrase identification, where the trained by Neelakantan et al. (2022), which used
label is binary. The similarity score is the cosine Codex (Chen et al., 2021) as backbone and trained
similarity between the embeddings of two texts. on a large-scale (code, text) pairs extracted from
The average precision score is reported as the main open-source code. It is worthwhile to explore how
evaluation metric using the best binary threshold. to further close this gap.

Semantic Textual Similarity To determine the

similarity between a given pair of sentences, con-
tinuous scores are assigned, with higher values in-
dicating greater similarity. The embedding model
is employed to embed the sentences, and their sim-
ilarity is computed using cosine similarity. The
estimated similarity scores is compared against hu-
man labeled scores ranging from 1 to 5. We report
Spearman’s correlation, which measures the rank-
ings instead of the actual scores and better suits the
need of evaluating sentence embeddings.

Summarization This is a text generation evalua-

tion task which aims to automatically evaluate the
quality of generated text. For the summarization
task, the quality of each generated summary is com-
puted by measuring the cosine similarity between
its embedding and the embedding of the ground
truth references. In the case of multiple gold refer-
ences, the closest one with highest similarity score
is used for quality estimation. Similar to STS task,
we use the Spearman correlation between the rank-
ing produced by the text embedding model and the
human assessments for evaluation.
Model Params Ruby JS Go Python Java PHP Avg.
CodeBERT 110M × 6 69.3 70.6 84.0 86.8 74.8 70.6 76.0
GraphCodeBERT 110M × 6 84.1 73.2 87.9 75.7 71.1 72.5 77.4
cpt-code S 300M 86.3 86.0 97.7 99.8 94.0 96.7 93.4
cpt-code M 1.2B 85.5 86.5 97.5 99.9 94.4 97.2 93.5
GTEbase 110M 79.6 79.4 84.2 98.8 86.8 86.8 85.9

Table 13: Results on CodeSearchNet (Husain et al., 2019). We compare with CodeBERT (Feng et al., 2020),
GraphCodeBERT (Guo et al., 2021) and cpt-code (Neelakantan et al., 2022). This setting requires finding the
relevant code block among 1K candidates for a given natural language query.

Improving Text Embeddings With Large Language Models
No ratings yet
Improving Text Embeddings With Large Language Models
20 pages
Jina-Embeddings-V3:: Multilingual Embeddings With Task Lora
No ratings yet
Jina-Embeddings-V3:: Multilingual Embeddings With Task Lora
20 pages
Text and Code Embeddings by Contrastive Pre-Training
No ratings yet
Text and Code Embeddings by Contrastive Pre-Training
13 pages
Jacob Devlin BERT
No ratings yet
Jacob Devlin BERT
43 pages
Recent advances in universal text embeddings: A Comprehensive Review of Top-Performing Methods on the MTEB Benchmark
No ratings yet
Recent advances in universal text embeddings: A Comprehensive Review of Top-Performing Methods on the MTEB Benchmark
21 pages
Bert
No ratings yet
Bert
10 pages
Bert - Se: A P - L R M S E: RE Trained Anguage Epresentation Odel For Oftware Ngineering
No ratings yet
Bert - Se: A P - L R M S E: RE Trained Anguage Epresentation Odel For Oftware Ngineering
17 pages
Pre Trained Models For NLP
No ratings yet
Pre Trained Models For NLP
15 pages
Bert
No ratings yet
Bert
20 pages
A Survey On Contextual Embeddings
No ratings yet
A Survey On Contextual Embeddings
13 pages
Lecture 15 - Foundation Models - CLIP and GPT
No ratings yet
Lecture 15 - Foundation Models - CLIP and GPT
45 pages
Retgen: A Joint Framework For Retrieval and Grounded Text Generation Modeling
No ratings yet
Retgen: A Joint Framework For Retrieval and Grounded Text Generation Modeling
15 pages
Trend
No ratings yet
Trend
47 pages
2005 14165v3 PDF
No ratings yet
2005 14165v3 PDF
74 pages
NLP-LLM
No ratings yet
NLP-LLM
47 pages
Language-Agnostic BERT Sentence Embedding
No ratings yet
Language-Agnostic BERT Sentence Embedding
14 pages
Pre-Trained Models For Natural Language Processing: A Survey
No ratings yet
Pre-Trained Models For Natural Language Processing: A Survey
31 pages
Session 15-2 Future NLP & Deep Learning
No ratings yet
Session 15-2 Future NLP & Deep Learning
81 pages
Transfer Learning in Natural Language Processing PDF
0% (1)
Transfer Learning in Natural Language Processing PDF
238 pages
Qiu et al. - 2020 - Pre-trained Models for Natural Language Processing
No ratings yet
Qiu et al. - 2020 - Pre-trained Models for Natural Language Processing
28 pages
Perspectives in Business Ethics
No ratings yet
Perspectives in Business Ethics
113 pages
1 s2.0 S2095809922006324 Main
No ratings yet
1 s2.0 S2095809922006324 Main
20 pages
2207 06839
No ratings yet
2207 06839
32 pages
GPT1
No ratings yet
GPT1
12 pages
Large Language Model
0% (1)
Large Language Model
38 pages
Huggingface Co Blog Warm Starting Encoder Decoder Data Preprocessing
No ratings yet
Huggingface Co Blog Warm Starting Encoder Decoder Data Preprocessing
20 pages
EELBERT Tiny Models Through Dynamic Embeddings 1705151354
No ratings yet
EELBERT Tiny Models Through Dynamic Embeddings 1705151354
9 pages
ERNIE 3.0 Large-Scale Knowledge Enhanced Pre-Training For Language Understanding and Generation-2107.02137
No ratings yet
ERNIE 3.0 Large-Scale Knowledge Enhanced Pre-Training For Language Understanding and Generation-2107.02137
22 pages
Ke, P., et al. (2021). Jointgt - Graph-text joint representation learning for text generation from knowledge graphs. arXiv
No ratings yet
Ke, P., et al. (2021). Jointgt - Graph-text joint representation learning for text generation from knowledge graphs. arXiv
13 pages
MTEB: Massive Text Embedding Benchmark
No ratings yet
MTEB: Massive Text Embedding Benchmark
24 pages
2023.emnlp Main.96SynthIE
No ratings yet
2023.emnlp Main.96SynthIE
20 pages
LLM_introduction 2024
No ratings yet
LLM_introduction 2024
77 pages
TOD-BERT: Pre-Trained Natural Language Understanding For Task-Oriented Dialogue
No ratings yet
TOD-BERT: Pre-Trained Natural Language Understanding For Task-Oriented Dialogue
13 pages
Chen et al. (2020) KGPT- Knowledge-Grounded Pre-training for Data-to-Text Generation
No ratings yet
Chen et al. (2020) KGPT- Knowledge-Grounded Pre-training for Data-to-Text Generation
14 pages
Lecture 2a - Word Level Semantics
No ratings yet
Lecture 2a - Word Level Semantics
34 pages
Augmenting LLMs With Knowledge - A Survey On Hallucination Prevention
No ratings yet
Augmenting LLMs With Knowledge - A Survey On Hallucination Prevention
11 pages
All about Encoder-Decoder Models
No ratings yet
All about Encoder-Decoder Models
50 pages
GLM: General Language Model Pretraining With Autoregressive Blank Infilling
No ratings yet
GLM: General Language Model Pretraining With Autoregressive Blank Infilling
16 pages
NLP-week9-fine-tuning_and_IR
No ratings yet
NLP-week9-fine-tuning_and_IR
64 pages
How To Fine-Tune BERT For Text Classification?: Corresponding Author The Source Codes Are Available at
No ratings yet
How To Fine-Tune BERT For Text Classification?: Corresponding Author The Source Codes Are Available at
10 pages
Downloed Papers
No ratings yet
Downloed Papers
700 pages
Kalyan 1 s2.0 S2949719123000456 Main
No ratings yet
Kalyan 1 s2.0 S2949719123000456 Main
48 pages
2411.05036v1
No ratings yet
2411.05036v1
21 pages
2021-Sentence Embedding Models For Similarity Detection of Software Requirements
No ratings yet
2021-Sentence Embedding Models For Similarity Detection of Software Requirements
11 pages
Gecko: Versatile Text Embeddings Distilled from Large Language Models
No ratings yet
Gecko: Versatile Text Embeddings Distilled from Large Language Models
18 pages
poster_version_final_bis
No ratings yet
poster_version_final_bis
1 page
RM Assignment 4
No ratings yet
RM Assignment 4
5 pages
No Training Required Exploring Random Encoders For Sentence Classification
No ratings yet
No Training Required Exploring Random Encoders For Sentence Classification
16 pages
2205.00148
No ratings yet
2205.00148
16 pages
Augmenting LLMs Survey
No ratings yet
Augmenting LLMs Survey
33 pages
paper_review
No ratings yet
paper_review
6 pages
Learning To Generate Reviews and Discovering Sentiment
No ratings yet
Learning To Generate Reviews and Discovering Sentiment
9 pages
Switchprompt: Learning Domain-Specific Gated Soft Prompts For Classification in Low-Resource Domains
No ratings yet
Switchprompt: Learning Domain-Specific Gated Soft Prompts For Classification in Low-Resource Domains
7 pages
All NLP Tasks Are Generation Tasks: A General Pretraining Framework
No ratings yet
All NLP Tasks Are Generation Tasks: A General Pretraining Framework
14 pages
2302.13007v3
No ratings yet
2302.13007v3
12 pages
Google JAX Essentials: A quick practical learning of blazing-fast library for machine learning and deep learning projects
From Everand
Google JAX Essentials: A quick practical learning of blazing-fast library for machine learning and deep learning projects
Mei Wong
No ratings yet
Cracking the Java Coding Interview: A Comprehensive Guide to Algorithmic Problem Solving
From Everand
Cracking the Java Coding Interview: A Comprehensive Guide to Algorithmic Problem Solving
Aarav Joshi
No ratings yet
Python Machine Learning By Example: Unlock machine learning best practices with real-world use cases
From Everand
Python Machine Learning By Example: Unlock machine learning best practices with real-world use cases
Yuxi (Hayden) Liu
No ratings yet
Python Machine Learning
From Everand
Python Machine Learning
Sebastian Raschka
4/5 (18)
Large Language Model Using Tensorflow: A Complete TensorFlow Implementation Guide for Modern AI Development
From Everand
Large Language Model Using Tensorflow: A Complete TensorFlow Implementation Guide for Modern AI Development
Aarav Joshi
No ratings yet
D11 D12 D13 0354 Midterm
No ratings yet
D11 D12 D13 0354 Midterm
2 pages
Grundy, 2020, The effects of bilingualism on executive functions= an updated quantitative analysis
No ratings yet
Grundy, 2020, The effects of bilingualism on executive functions= an updated quantitative analysis
23 pages
Question Form Explanation
No ratings yet
Question Form Explanation
1 page
Double Take Availability Windows & Linux.
No ratings yet
Double Take Availability Windows & Linux.
73 pages
ALS Monthly Monitoring PLan
No ratings yet
ALS Monthly Monitoring PLan
2 pages
[FREE PDF sample] Oral Poetry and Somali Nationalism The Case of Sayid Mahammad Abdille Hasan African Studies 1st Edition Said S. Samatar ebooks
100% (6)
[FREE PDF sample] Oral Poetry and Somali Nationalism The Case of Sayid Mahammad Abdille Hasan African Studies 1st Edition Said S. Samatar ebooks
32 pages
Comparison Between Two Wordsworth Poems
100% (1)
Comparison Between Two Wordsworth Poems
2 pages
English Reading
No ratings yet
English Reading
7 pages
Lista de Verbos Regulares e Irregulares en Ingles
No ratings yet
Lista de Verbos Regulares e Irregulares en Ingles
27 pages
Class 6 English Sample Paper Set D
No ratings yet
Class 6 English Sample Paper Set D
4 pages
Campus Journalism - Reviewer
No ratings yet
Campus Journalism - Reviewer
7 pages
OpenScapeBusiness_V3_R4 0 0 __592_Release_Note_V1 0 348
No ratings yet
OpenScapeBusiness_V3_R4 0 0 __592_Release_Note_V1 0 348
45 pages
[Ebooks PDF] download Coefficient Inverse Problems for Parabolic Type Equations and Their Application Danilaev full chapters
100% (5)
[Ebooks PDF] download Coefficient Inverse Problems for Parabolic Type Equations and Their Application Danilaev full chapters
79 pages
(Ebook) Intonation Phonology 2th Edition by D. Robert Ladd (full name:Dwight Robert Ladd Jr) ISBN 9780511808814, 051180881X download pdf
100% (1)
(Ebook) Intonation Phonology 2th Edition by D. Robert Ladd (full name:Dwight Robert Ladd Jr) ISBN 9780511808814, 051180881X download pdf
81 pages
Hurdy Gurdy Music
100% (1)
Hurdy Gurdy Music
6 pages
Top 100 Codes PREP INSTA
0% (1)
Top 100 Codes PREP INSTA
14 pages
Computer Hardware & Software_17869
No ratings yet
Computer Hardware & Software_17869
67 pages
RS Group 5 - Consecrated Life Final
No ratings yet
RS Group 5 - Consecrated Life Final
38 pages
Axium Training Handout Sheet
No ratings yet
Axium Training Handout Sheet
3 pages
MP QB
No ratings yet
MP QB
19 pages
Lecture-3 - Architecture of Distributed Systems F23
No ratings yet
Lecture-3 - Architecture of Distributed Systems F23
20 pages
Classicism, Neo Classicism & Romanticism
100% (1)
Classicism, Neo Classicism & Romanticism
14 pages
Microprocessor Model Question SET2
No ratings yet
Microprocessor Model Question SET2
6 pages
Vthc Pitfalls for Americans Singing German
No ratings yet
Vthc Pitfalls for Americans Singing German
4 pages
1° lesson plan 9° 2025
No ratings yet
1° lesson plan 9° 2025
12 pages
IT2301-Java Programming QB
No ratings yet
IT2301-Java Programming QB
10 pages
Students' Difficulties in Designing Research Proposal (A Case Study at English Department Students of Uin Ar-Raniry)
No ratings yet
Students' Difficulties in Designing Research Proposal (A Case Study at English Department Students of Uin Ar-Raniry)
76 pages
Vocabulary Exercises With Answers
No ratings yet
Vocabulary Exercises With Answers
5 pages
Palmsecure SDK v02
No ratings yet
Palmsecure SDK v02
68 pages