0% found this document useful (0 votes)
124 views

MTEB: Massive Text Embedding Benchmark

The document introduces the Massive Text Embedding Benchmark (MTEB), which aims to evaluate text embedding models across a variety of tasks (e.g. clustering, classification) and datasets in order to establish a more comprehensive benchmark. MTEB consists of 58 datasets covering 8 embedding tasks in 112 languages. 33 models are benchmarked on MTEB to analyze their strengths and weaknesses across different tasks. The benchmark finds no single model dominates across all tasks, suggesting the field has yet to identify a universally strong text embedding method. MTEB is made publicly available to facilitate future embedding research and model selection.

Uploaded by

ChrisHalden
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
124 views

MTEB: Massive Text Embedding Benchmark

The document introduces the Massive Text Embedding Benchmark (MTEB), which aims to evaluate text embedding models across a variety of tasks (e.g. clustering, classification) and datasets in order to establish a more comprehensive benchmark. MTEB consists of 58 datasets covering 8 embedding tasks in 112 languages. 33 models are benchmarked on MTEB to analyze their strengths and weaknesses across different tasks. The benchmark finds no single model dominates across all tasks, suggesting the field has yet to identify a universally strong text embedding method. MTEB is made publicly available to facilitate future embedding research and model selection.

Uploaded by

ChrisHalden
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

MTEB: Massive Text Embedding Benchmark

Niklas Muennighoff1 , Nouamane Tazi1 , Loïc Magne1 , Nils Reimers2 *


1
Hugging Face 2 cohere.ai
1
[email protected] 2 [email protected]

Abstract their possible use cases. For example, Sim-


CSE (Gao et al., 2021b) or SBERT (Reimers and
Text embeddings are commonly evaluated on Gurevych, 2019) solely evaluate on STS and clas-
a small set of datasets from a single task not
sification tasks, leaving open questions about the
arXiv:2210.07316v3 [cs.CL] 19 Mar 2023

covering their possible applications to other


tasks. It is unclear whether state-of-the-art em- transferability of the embedding models to search
beddings on semantic textual similarity (STS) or clustering tasks. STS is known to poorly corre-
can be equally well applied to other tasks like late with other real-world use cases (Neelakantan
clustering or reranking. This makes progress et al., 2022; Wang et al., 2021). Further, evaluating
in the field difficult to track, as various models embedding methods on many tasks requires imple-
are constantly being proposed without proper menting multiple evaluation pipelines. Implemen-
evaluation. To solve this problem, we intro-
tation details like pre-processing or hyperparam-
duce the Massive Text Embedding Benchmark
(MTEB). MTEB spans 8 embedding tasks cov- eters may influence the results making it unclear
ering a total of 58 datasets and 112 languages. whether performance improvements simply come
Through the benchmarking of 33 models on from a favorable evaluation pipeline. This leads to
MTEB, we establish the most comprehensive the “blind” application of these models to new use
benchmark of text embeddings to date. We cases in industry or requires incremental work to
find that no particular text embedding method reevaluate them on different tasks.
dominates across all tasks. This suggests that
The Massive Text Embedding Benchmark
the field has yet to converge on a universal text
embedding method and scale it up sufficiently (MTEB) aims to provide clarity on how models
to provide state-of-the-art results on all embed- perform on a variety of embedding tasks and thus
ding tasks. MTEB comes with open-source serves as the gateway to finding universal text em-
code and a public leaderboard at https: beddings applicable to a variety of tasks. MTEB
//github.com/embeddings-benchm consists of 58 datasets covering 112 languages
ark/mteb. from 8 embedding tasks: Bitext mining, classi-
fication, clustering, pair classification, reranking,
1 Introduction
retrieval, STS and summarization. MTEB software
Natural language embeddings power a variety of is available open-source1 enabling evaluation of
use cases from clustering and topic representa- any embedding model by adding less than 10 lines
tion (Aggarwal and Zhai, 2012; Angelov, 2020) of code. Datasets and the MTEB leaderboard are
to search systems and text mining (Huang et al., available on the Hugging Face Hub2 .
2020; Zhu et al., 2021; Nayak, 2019) to feature We evaluate over 30 models on MTEB with addi-
representations for downstream models (Saharia tional speed and memory benchmarking to provide
et al., 2022; Borgeaud et al., 2022). Using gener- a holistic view of the state of text embedding mod-
ative language models or cross-encoders for these els. We cover both models available open-source
applications is often intractable, as they may re- as well as models accessible via APIs, such as the
quire exponentially more computations (Reimers OpenAI Embeddings endpoint. We find there to be
and Gurevych, 2019). no single best solution, with different models dom-
However, the evaluation regime of current text
1
embedding models rarely covers the breadth of https://github.com/embeddings-benchm
ark/mteb
* 2
Most of the work done while at Hugging Face. Corre- https://huggingface.co/spaces/mteb/l
spondence to [email protected]. eaderboard
inating different tasks. Our benchmarking sheds Transformers (Vaswani et al., 2017) inject context
light on the weaknesses and strengths of individual awareness into language models via self-attention
models, such as SimCSE’s (Gao et al., 2021b) low and form the foundation of most recent embed-
performance on clustering and retrieval despite its ding models. BERT (Devlin et al., 2018) uses the
strong performance on STS. We hope our work transformer architecture and performs large-scale
makes selecting the right embedding model easier self-supervised pre-training. The resulting model
and simplifies future embedding research. can directly be used to produce text embeddings
via an averaging operation alike Glove. Build-
2 Related Work ing on InferSent (Conneau et al., 2017), SBERT
(Reimers and Gurevych, 2019) demonstrated it to
2.1 Benchmarks
be beneficial to perform additional fine-tuning of
Benchmarks, such as (Super)GLUE (Wang et al., the transformer for competitive embedding perfor-
2018, 2019) or Big-BENCH (Srivastava et al., mance. Most recent fine-tuned embedding models
2022), and evaluation frameworks (Gao et al., use a contrastive loss objective to perform super-
2021a) play a key role in driving NLP progress. vised fine-tuning on positive and negative text pairs
Yearly released SemEval datasets (Agirre et al., (Gao et al., 2021b; Wang et al., 2021; Ni et al.,
2012, 2013, 2014, 2015, 2016) are commonly used 2021b; Muennighoff, 2022). Due to the large va-
as the go-to benchmark for text embeddings. Se- riety of available pre-trained transformers (Wolf
mEval datasets correspond to the task of semantic et al., 2020), there is an at least equally large va-
textual similarity (STS) requiring models to embed riety of potential text embedding models to be ex-
similar sentences with geometrically close embed- plored. This leads to confusion about which model
dings. Due to the limited expressivity of a single Se- provides practitioners with the best performance
mEval dataset, SentEval (Conneau and Kiela, 2018) for their embedding use case.
aggregates multiple STS datasets. SentEval focuses We benchmark both word embedding and trans-
on fine-tuning classifiers on top of embeddings. It former models on MTEB quantifying gains pro-
lacks tasks like retrieval or clustering, where em- vided by often much slower context aware models.
beddings are directly compared without additional
classifiers. Further, the toolkit was proposed in 3 The MTEB Benchmark
2018 and thus does not provide easy support for
recent trends like text embeddings from transform- 3.1 Desiderata
ers (Reimers and Gurevych, 2019). Due to the MTEB is built on a set of desiderata: (a) Diversity:
insufficiency of STS benchmarking, USEB (Wang MTEB aims to provide an understanding of the
et al., 2021) was introduced consisting mostly of usability of embedding models in various use cases.
reranking tasks. Consequently, it does not cover The benchmark comprises 8 different tasks, with
tasks like retrieval or classification. Meanwhile, the up to 15 datasets each. Of the 58 total datasets in
recently released BEIR Benchmark (Thakur et al., MTEB, 10 are multilingual, covering 112 differ-
2021) has become the standard for the evaluation ent languages. Sentence-level and paragraph-level
of embeddings for zero-shot information retrieval. datasets are included to contrast performance on
MTEB unifies datasets from different embed- short and long texts. (b) Simplicity: MTEB pro-
ding tasks into a common, accessible evaluation vides a simple API for plugging in any model that
framework. MTEB incorporates SemEval datasets given a list of texts can produce a vector for each
(STS11 - STS22) and BEIR alongside a variety of list item with a consistent shape. This makes it
other datasets from various tasks to provide a holis- possible to benchmark a diverse set of models. (c)
tic performance review of text embedding models. Extensibility: New datasets for existing tasks can
be benchmarked in MTEB via a single file that
2.2 Embedding Models specifies the task and a Hugging Face dataset name
Text embedding models like Glove (Pennington where the data has been uploaded (Lhoest et al.,
et al., 2014) lack context awareness and are thus 2021). New tasks require implementing a task in-
commonly labeled as Word Embedding Models. terface for loading the data and an evaluator for
They consist of a layer mapping each input word benchmarking. We welcome dataset, task or metric
to a vector often followed by an averaging layer to contributions from the community via pull requests
provide a final embedding invariant of input length. to continue the development of MTEB. (d) Repro-
Clustering MTEB Classification

ArxivP2P ArxivS2S BiorxivP2P BiorxivS2S Massive Text


AmazonCounterfactual AmazonPolarity
Embedding Benchmark
MedrxivP2P MedrxivS2S Reddit RedditP2P AmazonReviews Banking77 Emotion

StackExchange StackExchangeP2P
8 Tasks Imdb MassiveIntent MassiveScenario

TwentyNewsgroup 58 Datasets MTOPDomain MTOPIntent

ToxicConversations TweetSentimentExtraction
Bitext Mining STS

BUCC Tatoeba BIOSESS SICK-R Pair Classification

STS11 STS12 STS13 SprintDuplicateQuestions TwitterSemEval2015


Retrieval
STS14 STS15 STS16 TwitterURLCorpus
ArguAna ClimateFEVER DBPedia
STS17 STS22 STSB

CQADupstackRetrieval FEVER FiQA2018 Reranking

HotpotQA MSMARCO NFCorpus NQ Quora Summarization AskUbuntuDupQuestions MindSmallReranking

SCIDOCS SciFact Touche2020 TRECCOVID SummEval


SciDocsRR StackOverFlowDupQuestions

Figure 1: An overview of tasks and datasets in MTEB. Multilingual datasets are marked with a purple shade.

ducibility: Through versioning at a dataset and Clustering Given a set of sentences or para-
software level, we aim to make it easy to repro- graphs, the goal is to group them into meaning-
duce results in MTEB. JSON files corresponding ful clusters. A mini-batch k-means model with
to all results available in this paper have been made batch size 32 and k equal to the number of dif-
available together with the MTEB benchmark3 . ferent labels (Pedregosa et al., 2011) is trained on
the embedded texts. The model is scored using
3.2 Tasks and Evaluation v-measure (Rosenberg and Hirschberg, 2007). V-
measure does not depend on the cluster label, thus
Figure 1 provides an overview of tasks and datasets
the permutation of labels does not affect the score.
available in MTEB. Dataset statistics are available
in Table 2. The benchmark consists of the follow-
ing 8 task types: Pair Classification A pair of text inputs is pro-
vided and a label needs to be assigned. Labels
Bitext Mining Inputs are two sets of sentences are typically binary variables denoting duplicate
from two different languages. For each sentence or paraphrase pairs. The two texts are embedded
in the first set, the best match in the second set and their distance is computed with various metrics
needs to be found. The matches are commonly (cosine similarity, dot product, euclidean distance,
translations. The provided model is used to embed manhattan distance). Using the best binary thresh-
each sentence and the closest pairs are found via old accuracy, average precision, f1, precision and
cosine similarity. F1 serves as the main metric for recall are computed. The average precision score
bitext mining. Accuracy, precision and recall are based on cosine similarity is the main metric.
also computed.

Classification A train and test set are embedded Reranking Inputs are a query and a list of rele-
with the provided model. The train set embeddings vant and irrelevant reference texts. The aim is to
are used to train a logistic regression classifier with rank the results according to their relevance to the
100 maximum iterations, which is scored on the query. The model is used to embed the references
test set. The main metric is accuracy with average which are then compared to the query using cosine
precision and f1 additionally provided. similarity. The resulting ranking is scored for each
query and averaged across all queries. Metrics are
3
https://huggingface.co/datasets/mteb mean MRR@k and MAP with the latter being the
/results main metric.
100
AmazonCounterfactualClassification
AmazonPolarityClassification 97
AmazonReviewsClassification 85 84
Banking77Classification 90 89 83
EmotionClassification 90 89 84 87
ImdbClassification 91 94 81 85 85
MassiveIntentClassification 92 92 89 91 89 88
MassiveScenarioClassification 92 92 89 91 89 88 100
MTOPDomainClassification 91 92 87 92 88 88 98 98
MTOPIntentClassification 91 92 87 92 88 88 98 98 100
ToxicConversationsClassification 93 93 87 90 89 90 96 96 95 95 95
TweetSentimentExtractionClassification 94 94 89 91 92 90 97 97 96 96 98
ArxivClusteringP2P 91 91 83 87 83 86 90 90 89 89 89 89
ArxivClusteringS2S 92 93 87 90 87 89 97 97 95 95 96 96 93
BiorxivClusteringP2P 88 88 81 85 82 85 87 87 87 87 87 87 95 90
BiorxivClusteringS2S 91 91 85 87 85 88 93 93 92 92 92 92 95 96 94
MedrxivClusteringP2P 87 87 81 84 81 84 87 87 87 87 87 87 92 89 96 93
MedrxivClusteringS2S 89 89 83 85 83 85 90 90 89 89 89 90 93 93 93 97 96
RedditClustering 94 94 88 92 90 89 95 95 95 95 95 96 90 95 88 93 88 91
RedditClusteringP2P 94 95 86 93 92 91 95 95 95 95 96 97 92 95 90 93 89 91 96
StackExchangeClustering 92 92 89 91 88 88 95 95 94 94 94 94 92 96 89 94 89 92 95 95 90
StackExchangeClusteringP2P 87 87 79 86 82 83 90 90 89 89 89 88 89 91 86 90 85 88 89 91 92
TwentyNewsgroupsClustering 93 93 88 91 87 88 96 96 95 95 95 96 92 98 89 95 90 93 95 95 96 91
SprintDuplicateQuestions 74 74 69 78 72 69 77 77 79 79 74 75 73 75 71 75 71 74 77 76 77 74 77
TwitterSemEval2015 88 89 83 85 85 85 91 91 90 90 92 92 85 91 83 88 83 85 91 92 89 84 90 71
TwitterURLCorpus 92 92 84 88 87 89 92 92 92 92 93 93 89 92 87 91 87 89 93 93 92 86 93 74 88
AskUbuntuDupQuestions 88 87 85 89 84 84 92 92 91 91 89 91 89 91 86 90 85 88 90 90 92 88 92 77 85 87
MindSmallReranking 84 86 80 81 80 84 89 89 88 88 88 88 85 88 82 88 82 86 88 88 87 82 89 67 84 88 83
SciDocsRR 91 92 86 89 85 88 95 95 93 93 93 93 94 97 91 97 91 95 94 94 95 92 96 76 89 92 91 89
StackOverflowDupQuestions 88 88 84 87 83 83 92 92 91 91 89 90 90 92 86 92 86 90 90 90 93 92 92 75 85 87 92 84 93
ArguAna 92 91 84 87 85 89 90 90 90 90 91 90 92 91 91 91 91 90 91 93 92 87 92 72 86 91 87 86 92 88 85
ClimateFEVER 87 88 83 86 83 84 91 91 90 90 90 91 88 91 85 90 86 88 91 90 90 85 92 72 85 88 86 84 90 86 87
CQADupstackAndroidRetrieval 88 87 80 89 82 84 90 90 90 90 88 88 87 89 85 88 85 86 89 90 91 92 90 79 84 86 90 81 90 90 87 85
CQADupstackEnglishRetrieval 91 91 83 89 86 87 92 92 92 92 92 92 90 93 88 91 88 89 93 93 96 91 93 74 86 90 87 83 92 89 91 88 91
CQADupstackGamingRetrieval 91 90 82 90 85 87 93 93 92 92 91 91 89 93 87 91 87 89 93 94 94 95 93 75 87 89 90 84 92 91 90 88 94 94
CQADupstackGisRetrieval 86 85 79 86 80 81 87 87 87 87 86 86 88 89 85 89 86 88 87 88 91 93 89 74 81 84 88 80 90 91 86 84 91 90 92
CQADupstackMathematicaRetrieval 88 87 80 87 82 83 89 89 89 89 87 87 91 91 88 91 87 89 89 90 93 94 90 77 82 86 89 81 92 92 88 85 92 92 93 94
CQADupstackPhysicsRetrieval 88 88 80 87 82 83 89 89 88 88 89 88 93 92 88 92 87 89 90 91 93 92 91 73 83 87 87 82 92 88 90 86 91 94 93 90 93
CQADupstackProgrammersRetrieval 88 88 81 87 82 85 90 90 89 89 88 88 90 91 88 91 87 89 90 91 94 95 92 75 83 87 88 81 93 92 91 85 92 95 94 93 94 94
CQADupstackStatsRetrieval 87 87 80 86 81 82 88 88 88 88 87 87 92 91 89 92 90 91 89 90 92 93 91 74 83 86 87 81 93 90 89 85 90 93 92 93 96 94 95
CQADupstackTexRetrieval 87 87 80 86 80 82 88 88 88 88 86 87 90 90 87 90 86 88 89 89 93 92 90 75 82 85 88 81 90 91 87 84 90 92 91 92 96 91 93 93 80
CQADupstackUnixRetrieval 88 87 81 88 82 84 90 90 89 89 88 88 89 90 87 90 86 88 90 90 93 93 91 76 83 86 93 80 91 91 88 85 94 93 94 93 95 93 95 94 94
CQADupstackWebmastersRetrieval 88 88 80 88 82 84 90 90 89 89 89 88 89 91 87 90 87 89 90 91 93 93 91 74 83 87 89 83 92 91 89 85 93 93 94 93 93 92 96 93 93 94
CQADupstackWordpressRetrieval 87 87 80 87 82 83 89 89 88 88 88 88 89 89 86 90 86 88 89 90 92 92 90 75 83 87 89 81 91 92 88 84 92 92 93 92 93 91 93 92 93 94 96
DBPedia 90 90 86 87 84 86 93 93 92 92 92 93 90 95 89 93 88 91 93 92 93 87 94 73 88 89 88 85 93 89 89 91 86 91 89 86 87 88 87 88 87 87 88 87
FEVER 87 88 83 86 83 84 91 91 90 90 90 91 88 91 85 90 86 88 91 90 90 85 92 72 85 88 86 84 90 86 87 100 85 88 88 84 85 86 85 85 84 85 85 84 91
FiQA2018 92 92 87 90 87 88 95 95 93 93 95 94 91 95 88 92 88 90 94 94 95 90 95 75 89 92 91 87 94 91 92 89 89 92 92 88 90 90 91 90 90 90 91 90 91 89
HotpotQA 90 90 86 89 85 88 95 95 94 94 94 94 90 96 88 93 88 92 94 93 94 88 95 74 90 91 90 88 94 90 89 93 88 91 90 88 88 88 88 88 87 88 89 87 96 93 93
MSMARCO 90 91 87 91 86 87 94 94 95 95 93 94 89 94 87 91 88 90 94 93 93 87 94 77 88 90 89 85 92 89 89 91 89 91 90 87 88 88 88 88 87 88 89 87 93 91 93 93
NFCorpus 89 89 83 85 84 87 89 89 88 88 90 89 92 91 94 94 95 94 90 91 92 87 92 72 85 89 86 84 92 87 93 87 86 90 89 85 88 89 89 90 86 87 88 87 90 87 90 89 89
NQ 91 92 86 89 86 88 95 95 94 94 94 95 91 96 88 94 88 92 95 94 95 89 97 74 90 92 90 89 95 91 91 93 88 92 91 87 89 90 89 88 88 89 90 89 95 93 93 97 92 90 75
QuoraRetrieval 92 92 88 92 88 89 97 97 96 96 98 97 91 97 88 93 88 91 96 96 97 91 96 76 90 93 91 88 94 91 91 91 91 94 94 89 90 92 92 91 90 91 92 91 93 91 95 95 95 91 95
SCIDOCS 89 90 85 86 82 85 91 91 89 89 89 89 95 93 92 97 92 96 90 91 93 90 93 74 85 89 88 86 97 91 91 88 87 90 89 90 92 90 92 93 90 90 90 89 91 88 91 91 90 92 92 91
SciFact 89 90 83 85 83 86 89 89 89 89 89 89 93 92 95 96 95 95 90 91 92 88 92 73 85 89 86 83 94 88 92 87 87 90 90 87 90 91 90 92 88 89 89 88 91 87 90 89 89 97 90 91 95
Touche2020 92 93 85 88 88 90 93 93 92 92 94 93 91 93 89 92 89 90 94 95 93 88 93 74 89 93 88 87 92 89 96 88 88 92 91 86 87 89 90 88 88 88 89 88 90 88 93 91 91 91 93 94 90 90
TRECCOVID 88 89 84 84 83 86 90 90 89 89 89 90 93 92 94 96 96 97 90 91 91 86 92 72 85 89 88 86 94 88 91 88 86 89 88 86 88 88 88 90 87 88 88 87 91 88 90 91 89 96 91 90 95 96 90
BIOSSES 86 86 80 82 80 83 85 85 85 85 85 85 91 88 93 93 92 92 86 87 88 84 88 71 82 85 83 80 90 85 90 83 84 87 86 83 86 87 86 88 85 85 85 84 87 83 86 85 85 93 86 86 91 97 87 92
SICK-R 83 84 81 82 81 84 89 89 88 88 89 89 81 88 78 84 78 82 88 88 86 81 88 68 89 85 83 83 86 83 82 84 81 82 83 77 79 81 80 79 78 80 80 79 84 84 85 88 86 82 88 89 81 80 86 81 76
STS12 92 92 89 90 88 88 97 97 95 95 95 96 90 96 87 93 87 91 95 94 95 89 96 76 91 93 91 87 94 91 91 91 89 92 92 87 88 89 89 88 88 89 89 88 93 91 94 95 94 90 95 96 90 90 93 90 86 90
STS13 92 91 89 90 88 88 96 96 94 94 94 95 91 95 89 93 89 91 95 94 95 90 95 75 90 92 91 87 94 91 92 90 89 94 92 88 89 91 91 89 89 90 90 89 92 90 93 93 92 91 94 95 91 91 93 91 87 86 97
STS14 94 94 90 92 91 90 97 97 96 96 97 98 92 96 90 94 90 92 96 97 96 90 97 76 91 94 92 89 95 92 94 92 90 94 94 89 90 91 91 90 89 91 91 90 93 92 95 95 94 92 95 97 92 92 95 92 88 89 98 98
70
STS15 93 92 89 90 89 89 96 96 95 95 95 96 91 95 89 93 88 90 95 95 95 89 95 74 90 92 91 88 94 91 92 90 89 92 92 87 88 90 89 88 87 89 89 88 92 90 94 94 93 90 94 96 90 90 94 90 87 89 96 96 98
STS16 93 93 89 91 91 89 95 95 94 94 95 95 90 93 88 91 88 89 95 95 95 90 94 75 88 93 91 85 92 90 92 88 90 93 92 87 89 90 91 89 89 90 91 90 91 88 95 92 93 91 92 96 90 90 93 89 86 85 95 94 97 95
STS17 91 90 86 88 87 88 95 95 95 95 95 96 87 94 85 91 85 88 94 94 92 87 94 73 92 91 89 88 92 89 89 89 87 89 90 85 86 87 87 85 84 86 87 86 90 89 91 94 92 88 93 95 88 87 92 88 83 93 95 93 95 95 92
STS22 89 89 91 88 85 85 94 94 92 92 93 93 89 93 87 91 87 90 92 93 94 87 93 75 88 90 90 86 92 90 89 89 87 89 89 86 87 87 87 87 87 87 88 87 91 89 97 92 92 88 93 94 90 88 90 89 85 85 93 92 94 92 92 90
STSBenchmark 93 93 87 90 89 91 96 96 95 95 96 97 89 95 87 92 87 90 96 96 94 89 95 75 92 93 90 88 93 90 92 90 89 92 92 86 88 89 90 88 87 89 89 89 91 90 94 94 93 90 94 96 89 89 94 89 85 94 96 95 97 96 96 97 91
SummEval 93 92 85 89 87 90 94 94 93 93 94 94 91 94 90 92 90 91 94 95 93 88 94 73 92 93 88 89 92 89 94 91 88 91 92 86 88 88 89 88 87 87 89 88 91 91 92 93 92 92 93 94 90 91 93 91 88 89 94 94 95 94 92 95 91 95
RedditClustering
AmazonPolarityClassification

CQADupstackAndroidRetrieval

DBPedia
ArxivClusteringS2S

TwentyNewsgroupsClustering

MindSmallReranking
SprintDuplicateQuestions

SciDocsRR

CQADupstackEnglishRetrieval

NQ
CQADupstackGamingRetrieval
CQADupstackGisRetrieval

QuoraRetrieval
SCIDOCS
SciFact
Touche2020

BIOSSES
AmazonCounterfactualClassification

AmazonReviewsClassification
Banking77Classification
EmotionClassification
ImdbClassification
MassiveIntentClassification
MassiveScenarioClassification
MTOPDomainClassification
MTOPIntentClassification
ToxicConversationsClassification

CQADupstackMathematicaRetrieval
CQADupstackPhysicsRetrieval
CQADupstackProgrammersRetrieval
CQADupstackStatsRetrieval
CQADupstackTexRetrieval
CQADupstackUnixRetrieval

FEVER

HotpotQA

NFCorpus
FiQA2018

MSMARCO

TRECCOVID

SICK-R
STS12
STS13
STS14
STS15
STS16
STS17
STS22
STSBenchmark
SummEval
TweetSentimentExtractionClassification
ArxivClusteringP2P

BiorxivClusteringP2P
BiorxivClusteringS2S
MedrxivClusteringP2P
MedrxivClusteringS2S

RedditClusteringP2P
StackExchangeClustering
StackExchangeClusteringP2P

TwitterSemEval2015
TwitterURLCorpus
AskUbuntuDupQuestions

StackOverflowDupQuestions
ArguAna
ClimateFEVER

CQADupstackWebmastersRetrieval
CQADupstackWordpressRetrieval

Figure 2: Similarity of MTEB datasets. We use the best model on MTEB STS (ST5-XXL, see Table 1) to embed
100 samples for each dataset. Cosine similarities between the averaged embeddings are computed and visualized.

Retrieval Each dataset consists of a corpus, Summarization A set of human-written and


queries and a mapping for each query to relevant machine-generated summaries are provided. The
documents from the corpus. The aim is to find these aim is to score the machine summaries. The pro-
relevant documents. The provided model is used vided model is first used to embed all summaries.
to embed all queries and all corpus documents and For each machine summary embedding, distances
similarity scores are computed using cosine simi- to all human summary embeddings are computed.
larity. After ranking the corpus documents for each The closest score (e.g. highest cosine similarity)
query based on the scores, nDCG@k, MRR@k, is kept and used as the model’s score of a single
MAP@k, precision@k and recall@k are computed machine-generated summary. Pearson and Spear-
for several values of k. nDCG@10 serves as the man correlations with ground truth human assess-
main metric. MTEB reuses datasets and evaluation ments of the machine-generated summaries are
from BEIR (Thakur et al., 2021). computed. Like for STS, Spearman correlation
based on cosine similarity serves as the main met-
Semantic Textual Similarity (STS) Given a
ric (Reimers et al., 2016).
sentence pair the aim is to determine their simi-
larity. Labels are continuous scores with higher 3.3 Datasets
numbers indicating more similar sentences. The
To further the diversity of MTEB, datasets of vary-
provided model is used to embed the sentences and
ing text lengths are included. All datasets are
their similarity is computed using various distance
grouped into three categories:
metrics. Distances are benchmarked with ground
truth similarities using Pearson and Spearman cor- Sentence to sentence (S2S) A sentence is com-
relations. Spearman correlation based on cosine pared with another sentence. An example of S2S
similarity serves as the main metric (Reimers et al., are all current STS tasks in MTEB, where the simi-
2016). larity between two sentences is assessed.
Class. Clust. PairClass. Rerank. Retr. STS Summ. Avg.
Num. Datasets (→) 12 11 3 4 15 10 1 56
Self-supervised methods
Glove 57.29 27.73 70.92 43.29 21.62 61.85 28.87 41.97
Komninos 57.65 26.57 72.94 44.75 21.22 62.47 30.49 42.06
BERT 61.66 30.12 56.33 43.44 10.59 54.36 29.82 38.33
SimCSE-BERT-unsup 62.50 29.04 70.33 46.47 20.29 74.33 31.15 45.45
Supervised methods
SimCSE-BERT-sup 67.32 33.43 73.68 47.54 21.82 79.12 23.31 48.72
coCondenser-msmarco 64.71 37.64 81.74 51.84 32.96 76.47 29.50 52.35
Contriever 66.68 41.10 82.53 53.14 41.88 76.51 30.36 56.00
SPECTER 52.37 34.06 61.37 48.10 15.88 61.02 27.66 40.28
LaBSE 62.71 29.55 78.87 48.42 18.99 70.80 31.05 45.21
LASER2 53.65 15.28 68.86 41.44 7.93 55.32 26.80 33.63
MiniLM-L6 63.06 42.35 82.37 58.04 41.95 78.90 30.81 56.26
MiniLM-L12 63.21 41.81 82.41 58.44 42.69 79.80 27.90 56.53
MiniLM-L12-multilingual 64.30 37.14 78.45 53.62 32.45 78.92 30.67 52.44
MPNet 65.07 43.69 83.04 59.36 43.81 80.28 27.49 57.78
MPNet-multilingual 67.91 38.40 80.81 53.80 35.34 80.73 31.57 54.71
OpenAI Ada Similarity 70.44 37.52 76.86 49.02 18.36 78.60 26.94 49.52
SGPT-125M-nli 61.46 30.95 71.78 47.56 20.90 74.71 30.26 45.97
SGPT-5.8B-nli 70.14 36.98 77.03 52.33 32.34 80.53 30.38 53.74
SGPT-125M-msmarco 60.72 35.79 75.23 50.58 37.04 73.41 28.90 51.23
SGPT-1.3B-msmarco 66.52 39.92 79.58 54.00 44.49 75.74 25.44 56.11
SGPT-2.7B-msmarco 67.13 39.83 80.65 54.67 46.54 76.83 27.87 57.12
SGPT-5.8B-msmarco 68.13 40.35 82.00 56.56 50.25 78.10 24.75 58.81
SGPT-BLOOM-7.1B-msmarco 66.19 38.93 81.90 55.65 48.21 77.74 24.99 57.44
GTR-Base 65.25 38.63 83.85 54.23 44.67 77.07 29.67 56.19
GTR-Large 67.14 41.60 85.33 55.36 47.42 78.19 29.50 58.28
GTR-XL 67.11 41.51 86.13 55.96 47.96 77.80 30.21 58.42
GTR-XXL 67.41 42.42 86.12 56.65 48.48 78.38 30.64 58.97
ST5-Base 69.81 40.21 85.17 53.09 33.63 81.14 31.39 55.27
ST5-Large 72.31 41.65 84.97 54.00 36.71 81.83 29.64 57.06
ST5-XL 72.84 42.34 86.06 54.71 38.47 81.66 29.91 57.87
ST5-XXL 73.42 43.71 85.06 56.43 42.24 82.63 30.08 59.51

Table 1: Average of the main metric (see Section 3.2) per task per model on MTEB English subsets.

Paragraph to paragraph (P2P) A paragraph is the same corpora, such as ClimateFEVER and
compared with another paragraph. MTEB imposes FEVER, resulting in a score of 1. Clusters of simi-
no limit on the input length, leaving it up to the lar datasets can be seen among CQADupstack vari-
models to truncate if necessary. Several clustering ations and STS datasets. S2S and P2P variations of
tasks are framed as both S2S and P2P tasks. The the same dataset tend to also be similar. Scientific
former only compare titles, while the latter include datasets, such as SciDocsRR, SciFact, ArxivClus-
both title and content. For ArxivClustering, for tering, show high similarities among each other
example, abstracts are concatenated to the title in even when coming from different tasks (Reranking,
the P2P setting. Retrieval and Clustering in this case).

Sentence to paragraph (S2P) A few retrieval 4 Results


datasets are mixed in a S2P setting. Here a query
is a single sentence, while documents are long 4.1 Models
paragraphs consisting of multiple sentences. We evaluate on the test splits of all datasets except
for MSMARCO, where the dev split is used follow-
Similarities across 56 MTEB datasets are vi- ing Thakur et al. (2021). We benchmark models
sualized in Figure 2. Several datasets rely on claiming state-of-the-art results on various embed-
GTR ST5 SGPT
2018) or decoders like GPT (Radford et al., 2019).
0.74
Classification 0.44
Clustering (a) Transformer encoder methods coCon-
denser (Gao and Callan, 2021), Contriever (Izac-

Average Performance (v_measure)


0.43
Average Performance (accuracy)

0.72
0.42
0.70
0.41 ard et al., 2021), LaBSE (Feng et al., 2020) and
0.68

0.66
0.40

0.39
SimCSE-BERT-sup (Gao et al., 2021b) are based
0.64 0.38 on the pre-trained BERT model (Devlin et al.,
0.37
0.62
0.36
2018). coCondenser and Contriever add a self-
0.1B 1B 2B
Model Parameters (Billions)
4B 0.1B 1B 2B
Model Parameters (Billions)
4B
supervised stage prior to supervised fine-tuning
PairClassification Reranking
0.86 for a total of three training stages. LaBSE uses
0.56
BERT to perform additional pre-training on par-
Average Performance (map)
Average Performance (ap)

0.84
0.55
0.82
0.54 allel data to produce a competitive bitext mining
0.80
0.53 model. SPECTER (Cohan et al., 2020a) relies on
0.78 0.52
the pre-trained SciBERT (Beltagy et al., 2019) vari-
0.76 0.51

0.1B 1B 2B 4B 0.1B 1B 2B 4B
ant instead and fine-tunes on citation graphs. GTR
Model Parameters (Billions) Model Parameters (Billions)
Retrieval STS (Ni et al., 2021b) and ST5 (Ni et al., 2021a) are
Average Performance (cos. sim. spearman corr.)

0.500
0.82
based on the encoder part of the T5 model (Raf-
Average Performance (nDCG@10)

0.475

0.450
0.80 fel et al., 2020) and only differ in their fine-tuning
0.425
0.78 datasets. After additional self-supervised training,
0.400

0.375
0.76 ST5 does contrastive fine-tuning on NLI (Ni et al.,
0.350 0.74
2021a; Gao et al., 2021b) being geared towards
0.1B 1B 2B
Model Parameters (Billions)
4B 0.1B 1B 2B
Model Parameters (Billions)
4B STS tasks. Meanwhile, GTR fine-tunes on MS-
MARCO and focuses on retrieval tasks. MPNet
Figure 3: MTEB performance scales with model and MiniLM correspond to fine-tuned embedding
size. The smallest SGPT variant underperforms similar-
models (Reimers and Gurevych, 2019) of the pre-
sized GTR and ST5 variants. This may be due to the
bias-only fine-tuning SGPT employs, which catches
trained MPNet (Song et al., 2020) and MiniLM
up with full fine-tuning only as model size and thus (Wang et al., 2020) models using diverse datasets
the number of bias parameters increases (Muennighoff, to target any embedding use case.
2022). (b) Transformer decoder methods SGPT Bi-
Encoders (Muennighoff, 2022) perform contrastive
fine-tuning of <0.1% of pre-trained parameters us-
ding tasks leading to a high representation of trans-
ing weighted-mean pooling. Similar to ST5 and
formers (Vaswani et al., 2017). We group models
GTR, SGPT-nli models are geared towards STS,
into self-supervised and supervised methods.
while SGPT-msmarco models towards retrieval.
Self-supervised methods (a) Transformer- SGPT-msmarco models embed queries and doc-
based BERT (Devlin et al., 2018) is trained using uments for retrieval with different special tokens
self-supervised mask and sentence prediction tasks. to help the model distinguish their role. For non-
By taking the mean across the sequence length retrieval tasks, we use its query representations.
(mean-pooling) the model can directly be used We benchmark publicly available SGPT models
to produce text embeddings. SimCSE-Unsup based on GPT-NeoX (Andonian et al., 2021), GPT-
(Gao et al., 2021b) uses BERT as a foundation J (Wang and Komatsuzaki, 2021) and BLOOM
and performs additional self-supervised training. (Scao et al., 2022). Alternatively, cpt-text (Nee-
(b) Non-transformer: Komninos (Komninos lakantan et al., 2022) passes pre-trained GPT de-
and Manandhar, 2016) and Glove (Pennington coders through a two-stage process using last token
et al., 2014) are two word embedding models pooling to provide embeddings from decoders. We
that directly map words to vectors. Hence, their benchmark their models via the OpenAI Embed-
embeddings lack context awareness, but provide dings API4 .
significant speed-ups. (c) Non-transformer LASER (Heffernan et al.,
2022) is the only context aware non-transformer
Supervised methods The original transformer
model we benchmark, relying on an LSTM
model (Vaswani et al., 2017) consists of an encoder
and decoder network. Subsequent transformers 4
https://beta.openai.com/docs/guides/
often train only encoders like BERT (Devlin et al., embeddings
60 ST5-XXL
GTR-XXL
MPNet
GTR-Base MiniLM-L12
SGPT-5.8B-msmarco MiniLM-L6
ST5-Base Contriever
55

coCondenser-msmarco
SGPT-5.8B-nli SGPT-125M-msmarco
50
SimCSE-BERT-sup
MTEB Score

SGPT-125M-nli
SimCSE-BERT-unsup
45 LaBSE
Base Architecture Glove
LASER
WordEmbeddings SPECTER
GPT
40 MiniLM Komninos
MPNet
T5 BERT
BERT
SciBERT
35 LASER2

102 103 104


Speed (examples per sec)

Figure 4: Performance, speed, and size of produced embeddings (size of the circles) of different embedding
models. Embedding sizes range from 1.2 kB (Glove / Komninos) to 16.4 kB (SGPT-5.8B) per example. Speed
was benchmarked on STS15 using 1x Nvidia A100 80GB with CUDA 11.6.

(Hochreiter and Schmidhuber, 1997) instead. Simi- the best non-ST5 model, OpenAI Ada Similarity.
lar to LaBSE, the model trains on parallel data and
focuses on bitext mining applications. Clustering Despite being almost 50x smaller, the
MPNet embedding model is on par with the ST5-
4.2 Analysis
XXL state-of-the-art on Clustering. This may be
Based on the results in Table 1, we observe that due to the large variety of datasets MPNet (and
there is considerable variability between tasks. No MiniLM) has been fine-tuned on. Clustering re-
model claims the state-of-the-art in all seven En- quires coherent distances between a large number
glish tasks. There is even more variability in the of embeddings. Models like SimCSE-sup or SGPT-
results per dataset present in the appendix. Further, nli, which are only fine-tuned on a single dataset,
there remains a large gap between self-supervised NLI, may produce incoherent embeddings when
and supervised methods. Self-supervised large lan- encountering topics unseen during fine-tuning. Re-
guage models have been able to close this gap in latedly, we find that the query embeddings of SGPT-
many natural language generation tasks (Chowd- msmarco and the Ada Search endpoint are competi-
hery et al., 2022). However, they appear to still tive with SGPT-nli and the Ada Similarity endpoint,
require supervised fine-tuning for competitive em- respectively. We refer to the public leaderboard5
bedding performance. for Ada Search results. This could be due to the
We find that performance strongly correlates MSMARCO dataset being significantly larger than
with model size, see Figure 3. A majority of NLI. Thus, while the OpenAI docs recommend us-
MTEB tasks are dominated by multi-billion param- ing the similarity embeddings for clustering use
eter models. However, these come at a significant cases6 , the retrieval query embeddings may be the
cost as we investigate in Section 4.3. better choice in some cases.
Classification ST5 models dominate the classifi-
5
cation task across most datasets, as can be seen in https://huggingface.co/spaces/mteb/l
eaderboard
detail in the full results in the appendix. ST5-XXL 6
https://beta.openai.com/docs/guides/
has the highest average performance, 3% ahead of embeddings/similarity-embeddings
1.0
LaBSE
LASER2
MiniLM-L12-multilingual
0.8 MPNet-multilingual
SGPT-BLOOM-7.1B-msmarco

0.6
F1 score

0.4

0.2

0.0
no l-eng
spb-eng
epa-eng
o g
tur-eng
te -eng
pol-eng
viel-eng
hrv-eng
ron-eng
hin-eng
glg-eng
sq -eng
ce i-eng
hu t-engg
slk-eng
lit-eng
fin-eng
af -eng
thar-eng
nld-eng
slv-eng
mo l-engg

da -engg
zsme-eng
ca -eng
jpnt-engg
ina-eng
cm ll-engg

eu t-engg
az l-engg
boe-eng
fra-eng
is -eng
pe l-eng
bus-eng
nn l-eng

po -eng
hy r-engg
uke-eng
gler-eng
rus-eng
ind-eng
g
ita-eng
ma -engg
uigr-eng
g
heo-eng
amb-eng
g
asr-eng
wu t-eng
g
ido-engg
fry-eng
tam eng

yid-eng
be -engg

faoz-engg
tat-eng
gla-eng
sw -engg

ku -engg
lat-eng
jav-eng
cb -eng
kh s-eng
m g
tu -engg
awv-eng
a g
hs -engg
ds i-engg
ce s-eng
mab-eng
wax-eng
sw r-eng
ang-eng
g g
cs l-engg
w g
or -eng
chv-eng
mha-eng
brer-eng
g
kz -eng
pa p-eng
m g
be -engg
ka r-eng
ng
mau-en

ess-en

tg -en
lvns-en
swn-en

e -en
kan-en
bes-en

srpo-en

mk -en
urdd-en

cym-en
xh -en

koh-en

yuu-en

ara-en

kan-en

ile-en
uzh-en

nd -en
arz-en
nok-en

lfn-en
oc -en
pmb-en

tz -en
gs b-en
arq-en

dt j-en
co -en

b-e
-

r
n

b
de

(a) Bitext Mining on Tatoeba

0.7
0.8
0.6
0.7

Cos. Sim. Spearman Corr.


0.5
0.6
Accuracy

0.4 0.5
0.3 0.4

0.2
0.3
0.2
0.1

0.1
ko fr es en ar it zh ru tr de pl
en
zh i
-CN
pt
id
es
th
it
fr
ru
de
fa
sv
zh i
-TW
nl
ms
da
pl
tr
sq
el
ro
hu
sl
ko
fi
ja
nb
ml
lv
he
ur
bn
ar
te
af
ta
hy
my
az
mn
is
kn
tl
jv
sw
ka
km
am
cy
zh

n
-ar
-de

n
en
en
en
-en
-tr
it
l
-fr
-en
-pl
h

fr-p
fr-e

it-e

es-

de
en
es-
nl-
pl-

de
en
en

zh

de
(b) Multilingual Classification (c) Multi- and Crosslingual STS

Figure 5: MTEB multilingual performance. Bitext mining is dominated by LaBSE, while classification and STS
results are mixed. SGPT-BLOOM-7B1-msmarco tends to perform well on the languages BLOOM has been pre-
trained on, such as Chinese, French and Portuguese.

Pair Classification GTR-XL and GTR-XXL (Muennighoff, 2022), the playing field is more
have the strongest performance. Pair classifica- even with SGPT-5.8B-nli outperforming SGPT-
tion is closest to STS in its framing, yet models 5.8B-msmarco, see Table 11.
rank significantly differently on the two tasks. This
STS & Summarization Retrieval models (GTR,
highlights the importance of benchmarking on a
SGPT-msmarco) perform badly on STS, while ST5-
diverse set of tasks to avoid blindly reusing a model
XXL has the highest performance. This highlights
for a different task.
the bifurcation of the field into separate embedding
Reranking MPNet and MiniLM models perform models for retrieval (asymmetric) and similarity
strongly on reranking tasks. On SciDocsRR (Co- (symmetric) use cases (Muennighoff, 2022).
han et al., 2020a) they perform far better than big-
ger models, which is likely due to parts of Sci- 4.3 Efficiency
DocsRR being included in their training data. Our We investigate the latency-performance trade-off
scale of experiments and that of model pre-training of models in Figure 4. The graph allows for signifi-
make controlling for data contamination challeng- cant elimination of model candidates in the model
ing. Thus, we ignore overlap of MTEB datasets selection process. It brings model selection down
with model training datasets in MTEB scores. As to three clusters:
long as enough datasets are averaged, we believe
Maximum speed Word Embedding models offer
these effects to be insignificant.
maximum speed with Glove taking the lead on both
Retrieval SGPT-5.8B-msmarco is the best em- performance and speed, thus making the choice
bedding model on the BEIR subset in MTEB simple in this case.
as well as on the full BEIR benchmark (Thakur
Maximum performance If latency is less impor-
et al., 2021; Muennighoff, 2022). The even larger
tant than performance, the left-hand side of the
7.1B SGPT model making use of BLOOM (Scao
graph offers a cluster of highly performant, but
et al., 2022) performs significantly weaker, which
slow models. Depending on the task at hand, GTR-
is likely due to the multilinguality of BLOOM.
XXL, ST5-XXL or SGPT-5.8B may be the right
Models geared towards STS (SimCSE, ST5, SGPT-
choice, see Section 4.2. SGPT-5.8B comes with
nli) perform badly on retrieval tasks. Retrieval
the additional caveat of its high-dimensional em-
tasks are unique in that there are two distinct types
beddings requiring more storage.
of texts: Queries and documents (“asymmetric”),
while other tasks only have a single type of text Speed and performance The fine-tuned MPNet
(“symmetric”). On the QuoraRetrieval dataset, and MiniLM models lead the middle cluster mak-
which has been shown to be largely symmetric ing the choice easy.
4.4 Multilinguality on. We found model performance on different tasks
MTEB comes with 10 multilingual datasets across to vary strongly with no model claiming state-of-
bitext mining, classification and STS tasks. We in- the-art on all tasks. Our studies on scaling behav-
vestigate performance on these in Figure 5. Tabular ior, model efficiency and multilinguality revealed
results can be found in Tables 12, 13 and 14. various intricacies of models that should ease the
decision-making process for future research or in-
Bitext Mining LaBSE (Feng et al., 2020) per- dustry applications of text embeddings.
forms strongly across a wide array of languages in We welcome task, dataset or metric contributions
bitext mining. Meanwhile, LASER2 shows high to the MTEB codebase7 as well as additions to the
variance across different languages. While there leaderboard via our automatic submission format8 .
are additional language-specific LASER2 models
available for some of the languages we benchmark,
we use the default multilingual LASER2 model
for all languages. This is to provide a fair one-to-
one comparison of models. In practice, however,
the high variance of LASER2’s performance may
be resolved by mixing its model variants. MP-
Net, MiniLM and SGPT-BLOOM-7B1-msmarco
perform poorly on languages they have not been
pre-trained on, such as German for the latter.
Classification & STS On multilingual classifi-
cation and STS, the multilingual MPNet provides
the overall strongest performance. It outperforms
the slightly faster multilingual MiniLM on almost
all languages. Both models have been trained
on the same languages, thus bringing decision-
making down to performance vs speed. SGPT-
BLOOM-7B1-msmarco provides state-of-the-art
performance on languages like Hindi, Portuguese,
Chinese or French, which the model has seen ex-
tensively during pre-training. It also performs com-
petitively on languages like Russian or Japanese
that unintentionally leaked into its pre-training
data (Muennighoff et al., 2022). However, it is not
much ahead of the much cheaper MPNet. LASER2
performs consistently worse than other models.

5 Conclusion
In this work, we presented the Massive Text Em-
bedding Benchmark (MTEB). Consisting of 8 text
embedding tasks with up to 15 datasets each and
covering 112 languages, MTEB aims to provide re-
liable embedding performance estimates. By open-
sourcing MTEB alongside a leaderboard, we pro-
vide a foundation for further pushing the state-of-
the-art of available text embeddings.
To introduce MTEB, we have conducted the
most comprehensive benchmarking of text embed-
7
dings to date. Through the course of close to 5,000 https://github.com/embeddings-benchm
ark/mteb
experiments on over 30 different models, we have 8
https://huggingface.co/spaces/mteb/l
set up solid baselines for future research to build eaderboard
Acknowledgments (* SEM), volume 1: proceedings of the Main confer-
ence and the shared task: semantic textual similar-
This work was granted access to the HPC resources ity, pages 32–43.
of Institut du développement et des ressources en
informatique scientifique (IDRIS) du Centre na- Loubna Ben Allal, Raymond Li, Denis Kocetkov,
Chenghao Mou, Christopher Akiki, Carlos Munoz
tional de la recherche scientifique (CNRS) under Ferrandis, Niklas Muennighoff, Mayank Mishra,
the allocation 2021-A0101012475 made by Grand Alex Gu, Manan Dey, et al. 2023. Santa-
équipement national de calcul intensif (GENCI). In coder: don’t reach for the stars! arXiv preprint
particular, all the evaluations and data processing arXiv:2301.03988.
ran on the Jean Zay cluster of IDRIS, and we want Alex Andonian, Quentin Anthony, Stella Biderman,
to thank the IDRIS team for responsive support Sid Black, Preetham Gali, Leo Gao, Eric Hallahan,
throughout the project, in particular Rémi Lacroix. Josh Levy-Kramer, Connor Leahy, Lucas Nestler,
We thank Douwe Kiela, Teven Le Scao and Nan- Kip Parker, Michael Pieler, Shivanshu Purohit, Tri
Songz, Phil Wang, and Samuel Weinbach. 2021.
dan Thakur for feedback and suggestions. GPT-NeoX: Large scale autoregressive language
modeling in pytorch.

References Dimo Angelov. 2020. Top2vec: Distributed representa-


tions of topics. arXiv preprint arXiv:2008.09470.
Charu C Aggarwal and ChengXiang Zhai. 2012. A
survey of text clustering algorithms. In Mining text Akari Asai, Jungo Kasai, Jonathan H Clark, Kenton
data, pages 77–128. Springer. Lee, Eunsol Choi, and Hannaneh Hajishirzi. 2020.
Xor qa: Cross-lingual open-retrieval question an-
Eneko Agirre, Carmen Banea, Claire Cardie, Daniel
swering. arXiv preprint arXiv:2010.11856.
Cer, Mona Diab, Aitor Gonzalez-Agirre, Weiwei
Guo, Inigo Lopez-Gazpio, Montse Maritxalar, Rada
Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. Scib-
Mihalcea, et al. 2015. Semeval-2015 task 2: Seman-
ert: A pretrained language model for scientific text.
tic textual similarity, english, spanish and pilot on
arXiv preprint arXiv:1903.10676.
interpretability. In Proceedings of the 9th interna-
tional workshop on semantic evaluation (SemEval
Sebastian Borgeaud, Arthur Mensch, Jordan Hoff-
2015), pages 252–263.
mann, Trevor Cai, Eliza Rutherford, Katie Milli-
Eneko Agirre, Carmen Banea, Claire Cardie, Daniel M can, George Bm Van Den Driessche, Jean-Baptiste
Cer, Mona T Diab, Aitor Gonzalez-Agirre, Weiwei Lespiau, Bogdan Damoc, Aidan Clark, et al. 2022.
Guo, Rada Mihalcea, German Rigau, and Janyce Improving language models by retrieving from tril-
Wiebe. 2014. Semeval-2014 task 10: Multilingual lions of tokens. In International Conference on Ma-
semantic textual similarity. In SemEval@ COLING, chine Learning, pages 2206–2240. PMLR.
pages 81–91.
Micael Carvalho, Rémi Cadène, David Picard, Laure
Eneko Agirre, Carmen Banea, Daniel Cer, Mona Soulier, Nicolas Thome, and Matthieu Cord. 2018.
Diab, Aitor Gonzalez Agirre, Rada Mihalcea, Ger- Cross-modal retrieval in the cooking context: Learn-
man Rigau Claramunt, and Janyce Wiebe. 2016. ing semantic text-image embeddings. In The 41st
Semeval-2016 task 1: Semantic textual similar- International ACM SIGIR Conference on Research
ity, monolingual and cross-lingual evaluation. In & Development in Information Retrieval, pages 35–
SemEval-2016. 10th International Workshop on Se- 44.
mantic Evaluation; 2016 Jun 16-17; San Diego, CA.
Stroudsburg (PA): ACL; 2016. p. 497-511. ACL (As- Iñigo Casanueva, Tadas Temčinas, Daniela Gerz,
sociation for Computational Linguistics). Matthew Henderson, and Ivan Vulić. 2020. Efficient
intent detection with dual sentence encoders.
Eneko Agirre, Daniel Cer, Mona Diab, and Aitor
Gonzalez-Agirre. 2012. Semeval-2012 task 6: A pi- Aakanksha Chowdhery, Sharan Narang, Jacob Devlin,
lot on semantic textual similarity. In * SEM 2012: Maarten Bosma, Gaurav Mishra, Adam Roberts,
The First Joint Conference on Lexical and Compu- Paul Barham, Hyung Won Chung, Charles Sutton,
tational Semantics–Volume 1: Proceedings of the Sebastian Gehrmann, et al. 2022. Palm: Scaling
main conference and the shared task, and Volume language modeling with pathways. arXiv preprint
2: Proceedings of the Sixth International Workshop arXiv:2204.02311.
on Semantic Evaluation (SemEval 2012), pages 385–
393. Jonathan H Clark, Eunsol Choi, Michael Collins, Dan
Garrette, Tom Kwiatkowski, Vitaly Nikolaev, and
Eneko Agirre, Daniel Cer, Mona Diab, Aitor Gonzalez- Jennimaria Palomaki. 2020. Tydi qa: A benchmark
Agirre, and Weiwei Guo. 2013. * sem 2013 shared for information-seeking question answering in typo-
task: Semantic textual similarity. In Second joint logically diverse languages. Transactions of the As-
conference on lexical and computational semantics sociation for Computational Linguistics, 8:454–470.
Arman Cohan, Sergey Feldman, Iz Beltagy, Doug Sepp Hochreiter and Jürgen Schmidhuber. 1997.
Downey, and Daniel S Weld. 2020a. Specter: Long short-term memory. Neural computation,
Document-level representation learning using 9(8):1735–1780.
citation-informed transformers. arXiv preprint
arXiv:2004.07180. Jui-Ting Huang, Ashish Sharma, Shuying Sun, Li Xia,
David Zhang, Philip Pronin, Janani Padmanab-
Arman Cohan, Sergey Feldman, Iz Beltagy, Doug han, Giuseppe Ottaviano, and Linjun Yang. 2020.
Downey, and Daniel S. Weld. 2020b. Specter: Embedding-based retrieval in facebook search. In
Document-level representation learning using Proceedings of the 26th ACM SIGKDD Interna-
citation-informed transformers. tional Conference on Knowledge Discovery & Data
Mining, pages 2553–2561.
Alexis Conneau and Douwe Kiela. 2018. Senteval: An
evaluation toolkit for universal sentence representa-
Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis
tions. arXiv preprint arXiv:1803.05449.
Allamanis, and Marc Brockschmidt. 2019. Code-
Alexis Conneau, Douwe Kiela, Holger Schwenk, Loic searchnet challenge: Evaluating the state of seman-
Barrault, and Antoine Bordes. 2017. Supervised tic code search. arXiv preprint arXiv:1909.09436.
learning of universal sentence representations from
natural language inference data. arXiv preprint Gautier Izacard, Mathilde Caron, Lucas Hosseini, Se-
arXiv:1705.02364. bastian Riedel, Piotr Bojanowski, Armand Joulin,
and Edouard Grave. 2021. Towards unsupervised
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and dense information retrieval with contrastive learning.
Kristina Toutanova. 2018. Bert: Pre-training of deep arXiv preprint arXiv:2112.09118.
bidirectional transformers for language understand-
ing. arXiv preprint arXiv:1810.04805. Alexandros Komninos and Suresh Manandhar. 2016.
Dependency based embeddings for sentence classi-
Alexander R. Fabbri, Wojciech Kryściński, Bryan fication tasks. In Proceedings of the 2016 confer-
McCann, Caiming Xiong, Richard Socher, and ence of the North American chapter of the associa-
Dragomir Radev. 2020. Summeval: Re-evaluating tion for computational linguistics: human language
summarization evaluation. technologies, pages 1490–1500.
Fangxiaoyu Feng, Yinfei Yang, Daniel Cer, Naveen
Wuwei Lan, Siyu Qiu, Hua He, and Wei Xu. 2017.
Arivazhagan, and Wei Wang. 2020. Language-
A continuously growing dataset of sentential para-
agnostic bert sentence embedding. arXiv preprint
phrases. In Proceedings of The 2017 Conference on
arXiv:2007.01852.
Empirical Methods on Natural Language Process-
Jack FitzGerald, Christopher Hench, Charith Peris, ing (EMNLP), pages 1235–1245. Association for
Scott Mackie, Kay Rottmann, Ana Sanchez, Aaron Computational Linguistics.
Nash, Liam Urbach, Vishesh Kakarala, Richa Singh,
Swetha Ranganath, Laurie Crist, Misha Britan, Quentin Lhoest, Albert Villanova del Moral, Yacine
Wouter Leeuwis, Gokhan Tur, and Prem Natara- Jernite, Abhishek Thakur, Patrick von Platen, Suraj
jan. 2022. Massive: A 1m-example multilin- Patil, Julien Chaumond, Mariama Drame, Julien Plu,
gual natural language understanding dataset with 51 Lewis Tunstall, et al. 2021. Datasets: A commu-
typologically-diverse languages. nity library for natural language processing. arXiv
preprint arXiv:2109.02846.
Leo Gao, Jonathan Tow, Stella Biderman, Sid Black,
Anthony DiPofi, Charles Foster, Laurence Golding, Haoran Li, Abhinav Arora, Shuohui Chen, Anchit
Jeffrey Hsu, Kyle McDonell, Niklas Muennighoff, Gupta, Sonal Gupta, and Yashar Mehdad. 2020.
et al. 2021a. A framework for few-shot language Mtop: A comprehensive multilingual task-oriented
model evaluation. Version v0. 0.1. Sept. semantic parsing benchmark.
Luyu Gao and Jamie Callan. 2021. Unsupervised cor- Xueqing Liu, Chi Wang, Yue Leng, and ChengXiang
pus aware language model pre-training for dense Zhai. 2018. Linkso: a dataset for learning to retrieve
passage retrieval. arXiv preprint arXiv:2108.05540. similar question answer pairs on software develop-
ment forums. In Proceedings of the 4th ACM SIG-
Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021b.
SOFT International Workshop on NLP for Software
Simcse: Simple contrastive learning of sentence em-
Engineering, pages 2–5.
beddings. arXiv preprint arXiv:2104.08821.
Gregor Geigle, Nils Reimers, Andreas Rücklé, and Andrew L. Maas, Raymond E. Daly, Peter T. Pham,
Iryna Gurevych. 2021. Tweac: Transformer with ex- Dan Huang, Andrew Y. Ng, and Christopher Potts.
tendable qa agent classifiers. 2011. Learning word vectors for sentiment analy-
sis. In Proceedings of the 49th Annual Meeting of
Kevin Heffernan, Onur Çelebi, and Holger Schwenk. the Association for Computational Linguistics: Hu-
2022. Bitext mining using distilled sentence rep- man Language Technologies, pages 142–150, Port-
resentations for low-resource languages. arXiv land, Oregon, USA. Association for Computational
preprint arXiv:2205.12654. Linguistics.
Julian McAuley and Jure Leskovec. 2013. Hidden fac- Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
tors and hidden topics: Understanding rating dimen- Dario Amodei, Ilya Sutskever, et al. 2019. Lan-
sions with review text. RecSys ’13, New York, NY, guage models are unsupervised multitask learners.
USA. Association for Computing Machinery. OpenAI blog, 1(8):9.
Niklas Muennighoff. 2020. Vilio: State-of-the-art Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
visio-linguistic models applied to hateful memes. Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
arXiv preprint arXiv:2012.07788. Wei Li, Peter J Liu, et al. 2020. Exploring the limits
of transfer learning with a unified text-to-text trans-
Niklas Muennighoff. 2022. Sgpt: Gpt sentence former. J. Mach. Learn. Res., 21(140):1–67.
embeddings for semantic search. arXiv preprint
arXiv:2202.08904. Nils Reimers, Philip Beyer, and Iryna Gurevych. 2016.
Task-oriented intrinsic evaluation of semantic tex-
Niklas Muennighoff, Thomas Wang, Lintang Sutawika,
tual similarity. In Proceedings of COLING 2016,
Adam Roberts, Stella Biderman, Teven Le Scao,
the 26th International Conference on Computational
M Saiful Bari, Sheng Shen, Zheng-Xin Yong, Hai-
Linguistics: Technical Papers, pages 87–96.
ley Schoelkopf, et al. 2022. Crosslingual general-
ization through multitask finetuning. arXiv preprint Nils Reimers and Iryna Gurevych. 2019. Sentence-
arXiv:2211.01786. bert: Sentence embeddings using siamese bert-
Pandu Nayak. 2019. Understanding searches better networks. arXiv preprint arXiv:1908.10084.
than ever before. Facebook Research. Tatoeba multilingual test set.
Arvind Neelakantan, Tao Xu, Raul Puri, Alec Radford,
Jesse Michael Han, Jerry Tworek, Qiming Yuan, Andrew Rosenberg and Julia Hirschberg. 2007. V-
Nikolas Tezak, Jong Wook Kim, Chris Hallacy, et al. measure: A conditional entropy-based external clus-
2022. Text and code embeddings by contrastive pre- ter evaluation measure. pages 410–420.
training. arXiv preprint arXiv:2201.10005. Chitwan Saharia, William Chan, Saurabh Saxena, Lala
Jianmo Ni, Gustavo Hernández Ábrego, Noah Con- Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed
stant, Ji Ma, Keith B Hall, Daniel Cer, and Yinfei Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi,
Yang. 2021a. Sentence-t5: Scalable sentence en- Rapha Gontijo Lopes, et al. 2022. Photorealistic
coders from pre-trained text-to-text models. arXiv text-to-image diffusion models with deep language
preprint arXiv:2108.08877. understanding. arXiv preprint arXiv:2205.11487.

Jianmo Ni, Chen Qu, Jing Lu, Zhuyun Dai, Gus- Elvis Saravia, Hsien-Chi Toby Liu, Yen-Hao Huang,
tavo Hernández Ábrego, Ji Ma, Vincent Y Zhao, Junlin Wu, and Yi-Shin Chen. 2018. CARER: Con-
Yi Luan, Keith B Hall, Ming-Wei Chang, et al. textualized affect representations for emotion recog-
2021b. Large dual encoders are generalizable re- nition. In Proceedings of the 2018 Conference on
trievers. arXiv preprint arXiv:2112.07899. Empirical Methods in Natural Language Processing,
pages 3687–3697, Brussels, Belgium. Association
Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, for Computational Linguistics.
Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya
Sutskever, and Mark Chen. 2021. Glide: To- Teven Le Scao, Angela Fan, Christopher Akiki, El-
wards photorealistic image generation and editing lie Pavlick, Suzana Ilić, Daniel Hesslow, Ro-
with text-guided diffusion models. arXiv preprint man Castagné, Alexandra Sasha Luccioni, François
arXiv:2112.10741. Yvon, Matthias Gallé, et al. 2022. Bloom: A 176b-
parameter open-access multilingual language model.
James O’Neill, Polina Rozenshtein, Ryuichi Kiryo, arXiv preprint arXiv:2211.05100.
Motoko Kubota, and Danushka Bollegala. 2021. I
wish i would have loved this one, but i didn’t – a Darsh Shah, Tao Lei, Alessandro Moschitti, Salva-
multilingual dataset for counterfactual detection in tore Romeo, and Preslav Nakov. 2018. Adversar-
product reviews. ial domain adaptation for duplicate question detec-
tion. In Proceedings of the 2018 Conference on
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, Empirical Methods in Natural Language Processing,
B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, pages 1056–1063, Brussels, Belgium. Association
R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, for Computational Linguistics.
D. Cournapeau, M. Brucher, M. Perrot, and E. Duch-
esnay. 2011. Scikit-learn: Machine learning in Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-
Python. Journal of Machine Learning Research, Yan Liu. 2020. Mpnet: Masked and permuted pre-
12:2825–2830. training for language understanding. Advances in
Neural Information Processing Systems, 33:16857–
Jeffrey Pennington, Richard Socher, and Christopher D 16867.
Manning. 2014. Glove: Global vectors for word rep-
resentation. In Proceedings of the 2014 conference Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao,
on empirical methods in natural language process- Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch,
ing (EMNLP), pages 1532–1543. Adam R Brown, Adam Santoro, Aditya Gupta,
Adrià Garriga-Alonso, et al. 2022. Beyond the Fangzhao Wu, Ying Qiao, Jiun-Hung Chen, Chuhan
imitation game: Quantifying and extrapolating the Wu, Tao Qi, Jianxun Lian, Danyang Liu, Xing Xie,
capabilities of language models. arXiv preprint Jianfeng Gao, Winnie Wu, et al. 2020. Mind: A
arXiv:2206.04615. large-scale dataset for news recommendation. In
Proceedings of the 58th Annual Meeting of the Asso-
Hao Tan and Mohit Bansal. 2019. Lxmert: Learning ciation for Computational Linguistics, pages 3597–
cross-modality encoder representations from trans- 3606.
formers. arXiv preprint arXiv:1908.07490.
Wei Xu, Chris Callison-Burch, and William B Dolan.
Nandan Thakur, Nils Reimers, Andreas Rücklé, Ab- 2015. Semeval-2015 task 1: Paraphrase and seman-
hishek Srivastava, and Iryna Gurevych. 2021. Beir: tic similarity in twitter (pit). In Proceedings of the
A heterogenous benchmark for zero-shot evaluation 9th international workshop on semantic evaluation
of information retrieval models. (SemEval 2015), pages 1–11.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Xinyu Zhang, Nandan Thakur, Odunayo Ogundepo,
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Ehsan Kamalloo, David Alfonso-Hermelo, Xi-
Kaiser, and Illia Polosukhin. 2017. Attention is all aoguang Li, Qun Liu, Mehdi Rezagholizadeh, and
you need. Advances in neural information process- Jimmy Lin. 2022. Making a miracl: Multilingual in-
ing systems, 30. formation retrieval across a continuum of languages.
arXiv preprint arXiv:2210.09984.
Alex Wang, Yada Pruksachatkun, Nikita Nangia,
Amanpreet Singh, Julian Michael, Felix Hill, Omer Jeffrey Zhu, Mingqin Li, Jason Li, and Cassandra
Levy, and Samuel Bowman. 2019. Superglue: A Oduola. 2021. Bing delivers more contextualized
stickier benchmark for general-purpose language un- search using quantized transformer inference on
derstanding systems. Advances in neural informa- nvidia gpus in azure.
tion processing systems, 32.
Pierre Zweigenbaum, Serge Sharoff, and Reinhard
Alex Wang, Amanpreet Singh, Julian Michael, Felix Rapp. 2016. Towards preparation of the second bucc
Hill, Omer Levy, and Samuel R Bowman. 2018. shared task: Detecting parallel sentences in compa-
Glue: A multi-task benchmark and analysis platform rable corpora. In Proceedings of the Ninth Workshop
for natural language understanding. arXiv preprint on Building and Using Comparable Corpora. Euro-
arXiv:1804.07461. pean Language Resources Association (ELRA), Por-
toroz, Slovenia, pages 38–43.
Ben Wang and Aran Komatsuzaki. 2021. GPT-J-
6B: A 6 Billion Parameter Autoregressive Language Pierre Zweigenbaum, Serge Sharoff, and Reinhard
Model. https://github.com/kingoflol Rapp. 2017. Overview of the second bucc shared
z/mesh-transformer-jax. task: Spotting parallel sentences in comparable cor-
pora. In Proceedings of the 10th Workshop on Build-
Kexin Wang, Nils Reimers, and Iryna Gurevych. 2021. ing and Using Comparable Corpora, pages 60–67.
Tsdae: Using transformer-based sequential denois-
ing auto-encoder for unsupervised sentence embed- Pierre Zweigenbaum, Serge Sharoff, and Reinhard
ding learning. arXiv preprint arXiv:2104.06979. Rapp. 2018. Overview of the third bucc shared task:
Spotting parallel sentences in comparable corpora.
Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan In Proceedings of 11th workshop on building and
Yang, and Ming Zhou. 2020. Minilm: Deep self- using comparable corpora, pages 39–42.
attention distillation for task-agnostic compression
of pre-trained transformers. Advances in Neural In-
formation Processing Systems, 33:5776–5788.

Samuel Weinbach, Marco Bellagente, Constantin


Eichenberg, Andrew Dai, Robert Baldock,
Souradeep Nanda, Björn Deiseroth, Koen Oost-
ermeijer, Hannah Teufel, and Andres Felipe
Cruz-Salinas. 2022. M-vader: A model for dif-
fusion with multimodal context. arXiv preprint
arXiv:2212.02936.

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien


Chaumond, Clement Delangue, Anthony Moi, Pier-
ric Cistac, Tim Rault, Rémi Louf, Morgan Funtow-
icz, et al. 2020. Transformers: State-of-the-art nat-
ural language processing. In Proceedings of the
2020 conference on empirical methods in natural
language processing: system demonstrations, pages
38–45.
A Datasets clustering capacity, as well as multi-scale capac-
ities i.e. is a model able to both separate Maths
Table 2 provides a summary along with statistics of from Physics as well as Probability from Algebraic
all MTEB tasks. In the following, we give a brief Topology at the same time.
description of each dataset included in MTEB.
For every dataset, split and strategy, we select
A.1 Clustering subsets of all labels and then sample articles from
those labels. This yields splits with a varying
ArxivClusteringS2S, ArxivClusteringP2P,
amount and size of clusters.
BiorxivClusteringS2S, BiorxivClusteringP2P,
MedrxivClusteringP2P, MedrxivCluster- RedditClustering (Geigle et al., 2021): Cluster-
ingS2S These datasets are custom-made for ing of titles from 199 subreddits. Clustering of 25
MTEB using the public APIs from arXiv9 and splits, each with 10-50 classes, and each class with
bioRxiv/medRxiv10 . For S2S datasets, the input 100 - 1000 sentences
text is simply the title of the paper, while for
P2P the input text is the concatenation of the RedditClusteringP2P Dataset created for
title and the abstract. The cluster labels are MTEB using available data from Reddit posts11 .
generated using categories given to the papers by The task consists of clustering the concatenation of
humans. For bioRxiv and medRxiv this category title+post according to their subreddit. It contains
is unique, but for arXiv multiple categories can 10 splits, with 10 and 100 clusters per split and
be given to a single paper so we only use the 1,000 to 100,000 posts.
first one. For bioRxiv and medRxiv there is
only one level of category (e.g. biochemistry, StackExchangeClustering (Geigle et al., 2021)
genetics, microbiology, etc.) hence we only Clustering of titles from 121 stackexchanges. Clus-
perform clustering based on that label. For arXiv tering of 25 splits, each with 10-50 classes, and
there is a main category and secondary category: each class with 100-1000 sentences.
for example "cs.AI" means the main category is
Computer Science and the sub-category is AI, StackExchangeClusteringP2P Dataset created
math.AG means the main category is Mathematics for MTEB using available data from StackEx-
and the sub-category is Algrebraic Geometry etc. change posts12 . The task consists of clustering
Hence, we create three types of splits: the concatenation of title and post according to
their subreddit. It contains 10 splits, with 10 to 100
(a) Main category clustering Articles are only clusters and 5,000 to 10,000 posts per split.
clustered based on the main category (Math,
Physics, Computer Science etc.). This split evalu- TwentyNewsgroupsClustering13 Clustering of
ates coarse clustering capacity of a model. the 20 Newsgroups dataset, given titles of article
the goal is to find the newsgroup (20 in total). Con-
(b) Secondary category clustering within the tains 10 splits, each with 20 classes, with each split
same main category Articles are clustered containing between 1,000 and 10,000 titles.
based on their secondary category, but within a
given main category, for example only Math papers A.2 Classification
that need to be clustered into Algebraic Geometry,
Functional Analysis, Numerical Analysis etc. This AmazonCounterfactual (O’Neill et al., 2021) A
split evaluates fine-grained clustering capacity of a collection of Amazon customer reviews annotated
model, as differentiating some sub-categories can for counterfactual detection pair classification. For
be very difficult. each review the label is either "counterfactual" or
"not-counterfactual". This is a multilingual dataset
(c) Secondary category clustering Articles are with 4 available languages.
clustered based on their secondary category for all
11
main categories, so the labels can be Number The- https://huggingface.co/datasets/sent
ence-transformers/reddit-title-body
ory, Computational Complexity, Astrophysics of 12
https://huggingface.co/datasets/flax
Galaxies etc. These splits evaluate fine-grained -sentence-embeddings/stackexchange_title
_body_jsonl
9 13
https://arxiv.org/help/api/ https://scikit-learn.org/0.19/datase
10
https://api.biorxiv.org/ ts/twenty_newsgroups.html
Name Type Categ. #Lang. Train Dev Test Train avg. Dev avg. Test avg.
Samples Samples Samples chars chars chars

BUCC BitextMining s2s 4 0 0 641684 0 0 101.3


Tatoeba BitextMining s2s 112 0 0 2000 0 0 39.4
AmazonCounterfactualClassification Classification s2s 4 4018 335 670 107.3 109.2 106.1
AmazonPolarityClassification Classification p2p 1 3600000 0 400000 431.6 0 431.4
AmazonReviewsClassification Classification s2s 6 1200000 30000 30000 160.5 159.2 160.4
Banking77Classification Classification s2s 1 10003 0 3080 59.5 0 54.2
EmotionClassification Classification s2s 1 16000 2000 2000 96.8 95.3 96.6
ImdbClassification Classification p2p 1 25000 0 25000 1325.1 0 1293.8
MassiveIntentClassification Classification s2s 51 11514 2033 2974 35.0 34.8 34.6
MassiveScenarioClassification Classification s2s 51 11514 2033 2974 35.0 34.8 34.6
MTOPDomainClassification Classification s2s 6 15667 2235 4386 36.6 36.5 36.8
MTOPIntentClassification Classification s2s 6 15667 2235 4386 36.6 36.5 36.8
ToxicConversationsClassification Classification s2s 1 50000 0 50000 298.8 0 296.6
TweetSentimentExtractionClassification Classification s2s 1 27481 0 3534 68.3 0 67.8
ArxivClusteringP2P Clustering p2p 1 0 0 732723 0 0 1009.9
ArxivClusteringS2S Clustering s2s 1 0 0 732723 0 0 74.0
BiorxivClusteringP2P Clustering p2p 1 0 0 75000 0 0 1666.2
BiorxivClusteringS2S Clustering s2s 1 0 0 75000 0 0 101.6
MedrxivClusteringP2P Clustering p2p 1 0 0 37500 0 0 1981.2
MedrxivClusteringS2S Clustering s2s 1 0 0 37500 0 0 114.7
RedditClustering Clustering s2s 1 0 420464 420464 0 64.7 64.7
RedditClusteringP2P Clustering p2p 1 0 0 459399 0 0 727.7
StackExchangeClustering Clustering s2s 1 0 417060 373850 0 56.8 57.0
StackExchangeClusteringP2P Clustering p2p 1 0 0 75000 0 0 1090.7
TwentyNewsgroupsClustering Clustering s2s 1 0 0 59545 0 0 32.0
SprintDuplicateQuestions PairClassification s2s 1 0 101000 101000 0 65.2 67.9
TwitterSemEval2015 PairClassification s2s 1 0 0 16777 0 0 38.3
TwitterURLCorpus PairClassification s2s 1 0 0 51534 0 0 79.5
AskUbuntuDupQuestions Reranking s2s 1 0 0 2255 0 0 52.5
MindSmallReranking Reranking s2s 1 231530 0 107968 69.0 0 70.9
SciDocsRR Reranking s2s 1 0 19594 19599 0 69.4 69.0
StackOverflowDupQuestions Reranking s2s 1 23018 3467 3467 49.6 49.8 49.8
ArguAna Retrieval p2p 1 0 0 10080 0 0 1052.9
ClimateFEVER Retrieval s2p 1 0 0 5418128 0 0 539.1
CQADupstackAndroidRetrieval Retrieval s2p 1 0 0 23697 0 0 578.7
CQADupstackEnglishRetrieval Retrieval s2p 1 0 0 41791 0 0 467.1
CQADupstackGamingRetrieval Retrieval s2p 1 0 0 46896 0 0 474.7
CQADupstackGisRetrieval Retrieval s2p 1 0 0 38522 0 0 991.1
CQADupstackMathematicaRetrieval Retrieval s2p 1 0 0 17509 0 0 1103.7
CQADupstackPhysicsRetrieval Retrieval s2p 1 0 0 39355 0 0 799.4
CQADupstackProgrammersRetrieval Retrieval s2p 1 0 0 33052 0 0 1030.2
CQADupstackStatsRetrieval Retrieval s2p 1 0 0 42921 0 0 1041.0
CQADupstackTexRetrieval Retrieval s2p 1 0 0 71090 0 0 1246.9
CQADupstackUnixRetrieval Retrieval s2p 1 0 0 48454 0 0 984.7
CQADupstackWebmastersRetrieval Retrieval s2p 1 0 0 17911 0 0 689.8
CQADupstackWordpressRetrieval Retrieval s2p 1 0 0 49146 0 0 1111.9
DBPedia Retrieval s2p 1 0 4635989 4636322 0 310.2 310.1
FEVER Retrieval s2p 1 0 0 5423234 0 0 538.6
FiQA2018 Retrieval s2p 1 0 0 58286 0 0 760.4
HotpotQA Retrieval s2p 1 0 0 5240734 0 0 288.6
MSMARCO Retrieval s2p 1 0 8848803 8841866 0 336.6 336.8
MSMARCOv2 Retrieval s2p 1 138641342 138368101 0 341.4 342.0 0
NFCorpus Retrieval s2p 1 0 0 3956 0 0 1462.7
NQ Retrieval s2p 1 0 0 2684920 0 0 492.7
QuoraRetrieval Retrieval s2s 1 0 0 532931 0 0 62.9
SCIDOCS Retrieval s2p 1 0 0 26657 0 0 1161.9
SciFact Retrieval s2p 1 0 0 5483 0 0 1422.3
Touche2020 Retrieval s2p 1 0 0 382594 0 0 1720.1
TRECCOVID Retrieval s2p 1 0 0 171382 0 0 1117.4
BIOSSES STS s2s 1 200 200 200 156.6 156.6 156.6
SICK-R STS s2s 1 19854 19854 19854 46.1 46.1 46.1
STS12 STS s2s 1 4468 0 6216 100.7 0 64.7
STS13 STS s2s 1 0 0 3000 0 0 54.0
STS14 STS s2s 1 0 0 7500 0 0 54.3
STS15 STS s2s 1 0 0 6000 0 0 57.7
STS16 STS s2s 1 0 0 2372 0 0 65.3
STS17 STS s2s 11 0 0 500 0 0 43.3
STS22 STS p2p 18 0 0 8060 0 0 1992.8
STSBenchmark STS s2s 1 11498 3000 2758 57.6 64.0 53.6
SummEval Summarization p2p 1 0 0 2800 0 0 359.8

Table 2: Tasks in MTEB

AmazonPolarity (McAuley and Leskovec, AmazonReviews (McAuley and Leskovec,


2013) A collection of Amazon customer reviews 2013) A collection of Amazon reviews designed
annotated for polarity classification. For each to aid research in multilingual text classification.
review the label is either "positive" or "negative". For each review the label is the score given by
the review between 0 and 4 (1-5 stars). This is a
multilingual dataset with 6 available languages. TwitterURLCorpus (Lan et al., 2017)
Paraphrase-Pairs of Tweets. The goal is to
Banking77 (Casanueva et al., 2020) Dataset
classify a pair of tweets as paraphrases or not.
composed of online banking queries annotated with
their corresponding intents. For each user query A.4 Bitext Mining
the label is an intent among 77 intents like ’acti-
vate_my_card’, ’apple_pay’, ’bank_transfer’, etc. BUCC (Zweigenbaum et al., 2016, 2017, 2018)
BUCC provides big set of sentences (∼ 10-70k
Emotion (Saravia et al., 2018) Dataset of English each) for English, French, Russian, German and
Twitter messages with six basic emotions: anger, Chinese, along with associated pairs annotation.
fear, joy, love, sadness, and surprise. The annotated pairs here corresponds to a pairs of
Imdb (Maas et al., 2011) Large movie review translated sentences, i.e. a sentence and its transla-
dataset with labels being positive or negative. tion in the other language.

MassiveIntent (FitzGerald et al., 2022) A col- Tatoeba (Research) Tatoeba provides sets of sen-
lection of Amazon Alexa virtual assistant utter- tences (1000 sentences each) for 112 languages
ances annotated with the associated intent. For with annoated associated pairs. Each pair is one
each user utterance the label is one of 60 intents sentence and its translation in another language.
like ’play_music’, ’alarm_set’, etc. This is a multi-
A.5 Reranking
lingual dataset with 51 available languages.
AskUbuntuDupQuestions16 Questions from
MassiveScenario (FitzGerald et al., 2022) A col-
AskUbuntu with manual annotations marking pairs
lection of Amazon Alexa virtual assistant utter-
of questions as similar or dissimilar.
ances annotated with the associated intent. For
each user utterance the label is a theme among 60 MindSmall (Wu et al., 2020) Large-scale En-
scenarios like ’music’, ’weather’, etc. This is a glish Dataset for News Recommendation Research.
multilingual dataset with 51 available languages. Ranking news article titles given the title of a news
MTOPDomain / MTOPIntent Multilingual article. The idea is to recommend other news from
sentence datasets from the MTOP (Li et al., 2020) the one you are reading.
benchmark. We refer to their paper for details. SciDocsRR (Cohan et al., 2020b) Ranking of re-
ToxicConversations Dataset from Kaggle com- lated scientific papers based on their title.
petition14 . Collection of comments from the Civil
StackOverflowDupQuestions (Liu et al., 2018)
Comments platform together with annotations if
Stack Overflow Duplicate Questions Task for ques-
the comment is toxic or not.
tions with the tags Java, JavaScript and Python,
TweetSentimentExtraction Dataset from Kag- ranking questions as duplicates or not.
gle competition15 . Sentiment classification of
tweets as neutral, positive or negative. A.6 Semantic Textual Similarity (STS)
STS12, STS13, STS14, STS15, STS16, STS17,
A.3 Pair Classification
STS22, STSBenchmark (Agirre et al., 2012,
SprintDuplicateQuestions (Shah et al., 2018): 2013)17181920 Original STS benchmark, with
Collection of questions from the Sprint commu- scores from 0 to 5. The selection of sentences
nity. The goal is to classify a pair of sentences as includes text from image captions, news headlines
duplicates or not. and user forums. In total they contain between
TwitterSemEval2015 (Xu et al., 2015) 1,000 and 20,000 sentences. STS12 - STS16 and
Paraphrase-Pairs of Tweets from the SemEval 16
https://github.com/taolei87/askubuntu
17
2015 workshop. The goal is to classify a pair of https://alt.qcri.org/semeval2014/tas
tweets as paraphrases or not. k10/
18
https://alt.qcri.org/semeval2015/tas
14 k2/
https://www.kaggle.com/competitions/
19
jigsaw-unintended-bias-in-toxicity-class https://alt.qcri.org/semeval2016/tas
ification k1/
15 20
https://www.kaggle.com/competitions/ https://competitions.codalab.org/com
tweet-sentiment-extraction petitions/33835
STSBenchmark are monolingual english bench- 3. Multinguality MTEB contains multilingual
marks. STS17 and STS22 contain crosslingual classification, STS and bitext mining datasets.
pairs of sentences, where the goal is to assess the However, retrieval and clustering are English-only.
similarity of two sentences in different languages. SGPT-BLOOM-7B1-msmarco is geared towards
STS17 has 11 language pairs (among Korean, Ara- multilingual retrieval datasets and due to the lack
bic, English, French, German, Turkish, Spanish, thereof cannot be comprehensively benchmarked
Italian and Dutch) and STS22 has 18 language pairs in MTEB. Further, MTEB does not contain any
(among Arabic, English, French, German, Turkish, code datasets that could be used to benchmark code
Spanish, Polish, Italian, Russian and Chinese). models (Neelakantan et al., 2022; Allal et al., 2023).
It should be easy to extend MTEB with datasets,
BIOSSES21 Contains 100 sentence pairs from
such as CodeSearchNet (Husain et al., 2019), TyDI
the biomedical field.
QA (Clark et al., 2020), XOR QA (Asai et al., 2020)
SICK-R (Agirre et al., 2014) Sentences Involv- or MIRACL (Zhang et al., 2022).
ing Compositional Knowledge (SICK) contains a
4. Additional modalities Text embeddings are
large number of sentence pairs (10 0000) that are
commonly used as input features for downstream
lexically, syntactically and semantically rich.
models, such as in our classification task. This
A.7 Summarization can involve other modalities, notably image con-
SummEval (Fabbri et al., 2020) Summaries gen- tent (Carvalho et al., 2018; Tan and Bansal, 2019;
erated by recent summarization models trained on Muennighoff, 2020; Nichol et al., 2021; Saharia
CNN or DailyMail alongside human annotations. et al., 2022; Weinbach et al., 2022). We have fo-
cused solely on natural language applications and
A.8 Retrieval leave extensive benchmarking of text embeddings
We refer to the BEIR paper (Thakur et al., 2021), as inputs for other modalities to future work.
which contains description of each dataset. For
MTEB, we include all publicly available datasets:
C Examples
ArguAna, ClimateFEVER, CQADupstack, DB- Tables 3-9 provide examples for each dataset for
Pedia, FEVER, FiQA2018, HotpotQA, MS- each task. For retrieval datasets, we refer to the
MARCO, NFCorpus, NQ, Quora, SCIDOCS, BEIR paper (Thakur et al., 2021).
SciFact, Touche2020, TRECCOVID.
D Correlations
B Limitations of MTEB
Figure 6 provides correlation heatmaps for model
While MTEB aims to be a diverse benchmark to performance and MTEB tasks.
provide holistic performance reviews, the bench-
mark has its limitations. We list them here: E Models
1. Long document datasets MTEB covers mul- Table 10 provides publicly available model check-
tiple text lengths (S2S, P2P, S2P), but very long points used for MTEB evaluation.
documents are still missing. The longest datasets in
MTEB have a few hundred words, and longer text F Additional results
sizes could be relevant for use cases like retrieval. Tables 11 until the end provide results on individ-
2. Task imbalance Tasks in MTEB have a differ- ual datasets of MTEB. The results are additionally
ent amount of datasets with summarization consist- available in json format on the Hugging Face Hub22
ing of only a single dataset. This means MTEB av- and can be inspected on the leaderboard23 .
erage scores, which are computed over all datasets,
are biased towards tasks with many datasets, no-
tably retrieval, classification and clustering. As
MTEB grows, we hope to add more datasets to cur-
rently underrepresented tasks like summarization
22
or pair classification. https://huggingface.co/datasets/mteb
/results
21 23
https://tabilab.cmpe.boun.edu.tr/BIO https://huggingface.co/spaces/mteb/l
SSES/DataSet.html eaderboard
Dataset Text Label

AmazonCounterfactualClassification In person it looks as though it would have cost a lot more. counterfactual

AmazonPolarityClassification an absolute masterpiece I am quite sure any of you actually taking the time to read this have played the game at least positive
once, and heard at least a few of the tracks here. And whether you were aware of it or not, Mitsuda’s music contributed
greatly to the...

AmazonReviewsClassification solo llega una unidad cuando te obligan a comprar dos Te obligan a comprar dos unidades y te llega solo una y no hay 0
forma de reclamar, una autentica estafa, no compreis!!

Banking77Classification What currencies is an exchange rate calculated in? exchange_rate

EmotionClassification i feel so inhibited in someone elses kitchen like im painting on someone elses picture sadness

ImdbClassification When I first saw a glimpse of this movie, I quickly noticed the actress who was playing the role of Lucille Ball. Rachel negative
York’s portrayal of Lucy is absolutely awful. Lucille Ball was an astounding comedian with incredible talent. To think
about a legend like Lucille Ball being portrayed the way she was in the movie is horrendous. I cannot believe...

MassiveIntentClassification réveille-moi à neuf heures du matin le vendredi alarm_set

MassiveScenarioClassification tell me the artist of this song music

MTOPDomainClassification Maricopa County weather forecast for this week weather

MTOPIntentClassification what ingredients do is have left GET_INFO_RECIPES

ToxicConversationsClassification The guy’s a damn cop, so what do you expect? toxic

TweetSentimentExtractionClassification I really really like the song Love Story by Taylor Swift positive

Table 3: Classification examples

Dataset Text Cluster

ArxivClusteringP2P Finite groups of rank two which do not involve Qd(p). Let p > 3 be a prime. We show that if G is a finite group math
with p-rank equal to 2, then G involves Qd(p) if and only if G p0 -involves Qd(p). This allows us to use a version
of Glauberman’s ZJ-theorem to give a more direct construction of finite group actions on mod-p homotopy spheres.
We give an example to illustrate that the above conclusion does not hold for p ≤ 3.

ArxivClusteringS2S Vertical shift and simultaneous Diophantine approximation on polynomial curves math

BiorxivClusteringP2P Innate Immune sensing of Influenza A viral RNA through IFI16 promotes pyroptotic cell death Programmed cell death immunology
pathways are triggered by various stresses or stimuli, including viral infections. The mechanism underlying the regula-
tion of these pathways upon Influenza A virus IAV infection is not well characterized. We report that a cytosolic DNA
sensor IFI16 is...

BiorxivClusteringS2S Association of CDH11 with ASD revealed by matched-gene co-expression analysis and mouse behavioral neuroscience

MedrxivClusteringP2P Temporal trends in the incidence of haemophagocytic lymphohistiocytosis: a nationwide cohort study from England infectious diseases
2003-2018. Haemophagocytic lymphohistiocytosis (HLH) is rare, results in high mortality and is increasingly being
diagnosed. Little is known about what is driving the apparent rise in the incidence of this disease. Using national linked
electronic health data from hospital admissions and death certification cases of HLH that were diagnosed in England
between 1/1/2003 and 31/12/2018 were identified using a previously validated approach. We calculated incidence...

MedrxivClusteringS2S Current and Lifetime Somatic Symptom Burden Among Transition-aged Young Adults on the Autism Spectrum psychiatry and clinical psychology

RedditClustering Could anyone tell me what breed my bicolor kitten is? r/cats

RedditClusteringP2P Headaches after working out? Hey guys! I’ve been diagnosed with adhd since I was seven. I just recently got rediag- r/ADHD
nosed (22f) and I’ve been out on a different medication, adderall I was normally taking vyvanse but because of cost and
no insurance adderall was more affordable. I’ve noticed that if I take adderall and workout...

StackExchangeClustering Does this property characterize a space as Hausdorff? math.stackexchange.com

StackExchangeClusteringP2P Google play services error DEBUG: Application is pausing, which disconnects the RTMP client. I am having this issue unity
from past day with Google Play Services Unity. What happens is, when I install app directly ot device via Unity, the
Google Play Services work fine but when I upload it as beta to play store console and install it via that then it starts to
give " DEBUG: Application is pausing, which disconnects the RTMP client" error. I have a proper SHA1 key.

TwentyNewsgroupsClustering Commercial mining activities on the moon 14

Table 4: Clustering examples

Dataset Sentence 1 Sentence 2 Label

SprintDuplicateQuestions Franklin U722 USB modem signal strength How do I know if my Franklin U772 USB Modem has a 1
weak signal ?

TwitterSemEval2015 All the home alones watching 8 mile","All the home alones The last rap battle in 8 Mile nevr gets old ahah 0
watching 8 mile

TwitterURLCorpus How the metaphors we use to describe discovery affect men Light Bulbs or Seeds ? How Metaphors for Ideas Influence 0
and women in the sciences Judgments About Genius

Table 5: Pair classification examples. Labels are binary.


Dataset Query Positive Negative

AskUbuntuDupQuestions change the application icon theme but not changing the change folder icons in ubuntu-mono-dark theme change steam tray icon back to default
panel icons

MindSmallReranking Man accused in probe of Giuliani associates is freed on bail Studies show these are the best and worst states for your There are 14 cheap days to fly left in 2019: When are they
retirement and what deals can you score?

SciDocsRR Discovering social circles in ego networks Benchmarks for testing community detection algorithms on Improving www proxies performance with greedy-dual-
directed and weighted graphs with overlapping communi- size-frequency caching policy
ties.

StackOverflowDupQuestions Java launch error selection does not contain a main type Error: Selection does not contain a main type Selection Sort in Java

Table 6: Reranking examples

Dataset Sentence 1 Sentence 2 Score

BIOSSES It has recently been shown that Craf is essential for Kras It has recently become evident that Craf is essential for the 4.0
G12D-induced NSCLC. onset of Kras-driven non-small cell lung cancer.

SICK-R A group of children is playing in the house and there is no A group of kids is playing in a yard and an old man is stand- 3.2
man standing in the background ing in the background

STS12 Nationally, the federal Centers for Disease Control and Pre- There were 293 human cases of West Nile in Indiana in 1.7
vention recorded 4,156 cases of West Nile, including 284 2002, including 11 deaths statewide.
deaths.

STS13 this frame has to do with people ( the residents ) residing in inhabit or live in ; be an inhabitant of ; 2.8
locations , sometimes with a co-resident .

STS14 then the captain was gone. then the captain came back. 0.8

STS15 you ’ll need to check the particular policies of each pub- if you need to publish the book and you have found one 3.0
lisher to see what is allowed and what is not allowed. publisher that allows it.

STS16 you do not need to worry. you don ’t have to worry. 5.0

STS17 La gente muestra su afecto el uno por el otro. A women giving something to other lady. 1.4

STS22 El secretario general de la Asociación Gremial de los Tra- En diálogo con el servicio informativo de la Radio Pública, 1
bajadores del Subte y Premetro de Metrodelegados, Beto el ministro de Salud de la Nación, Ginés González García,
Pianelli, dijo que el Gobierno porteño debe convocar “in- habló sobre el avance del coronavirus en la Argentina y se
mediatamente” a licitación para la compra de nuevos trenes manifestó a favor de prorrogar la cuarentena obligatoria dis-
y retirar los que quedan en circulación... puesta por...

STSBenchmark A man is playing the cello. A man seated is playing the cello. 4.25

Table 7: STS examples. Scores are continuous between 0 and 5 (included).

Dataset First set sentence Second set sentence

BUCC Morales remporte l’élection présidentielle de 2005 à la ma- Morales went on to win the 2005 presidential election with
jorité absolue. an absolute majority.

Tatoeba Chi le ha detto che Tom l’ha fatto? Who told you that Tom did that?

Table 8: Bitext mining examples

Dataset Human Summary Machine Summary Relevance

SummEval V. Stiviano must pay back $2.6 million in gifts from Donald donald sterling , nba team last year . sterling ’s wife sued 1.7
Sterling. Sterling’s wife claimed the ex-Clippers used the for $ 2.6 million in gifts . sterling says he is the former
couple’s money for the gifts. The items included a Ferrari, female companion who has lost the . sterling has ordered
two Bentleys and a Range Rover. v. stiviano to pay back $ 2.6 m in gifts after his wife sued .
sterling also includes a $ 391 easter bunny costume , $ 299
and a $ 299 .

Table 9: Summarization example


100
Glove
Komninos 98
BERT 90 88
SimCSE-BERT-unsup 97 95 93 100
SimCSE-BERT-sup 95 94 92 99 95

Class.
coCondenser-msmarco 95 93 85 95 94
Contriever 90 89 79 91 90 98
SPECTER 93 92 89 92 91 90 84
LaBSE 96 94 92 98 97 95 91 94
LASER2 94 94 91 97 96 90 85 90 97
80
90

Clust.
MiniLM-L6 91 90 78 91 90 97 97 88 92 87 68
MiniLM-L12 90 88 77 91 90 97 97 87 91 86 100
MiniLM-L12-multilingual 95 93 85 96 96 98 96 91 95 92 97 97
MPNet 89 88 78 90 90 96 96 87 90 86 99 99 96

PairClass.
MPNet-multilingual 94 93 85 97 96 98 97 89 95 93 97 97 99 95 85 60
SGPT-125M-nli 97 96 91 99 98 95 91 94 97 96 92 91 97 91 96 72 81
SGPT-5.8B-nli 96 95 89 98 98 97 94 91 96 94 94 94 97 93 98 99
SGPT-125M-msmarco 92 90 78 91 89 96 96 87 89 85 95 96 96 94 95 93 95
SGPT-1.3B-msmarco 89 87 75 88 87 96 97 83 87 82 96 96 94 94 95 89 94 99
SGPT-2.7B-msmarco 87 86 73 87 85 95 97 81 85 80 95 96 94 94 94 88 93 98 100 80

Rerank.
SGPT-5.8B-msmarco 84 82 69 83 81 92 95 78 81 77 93 94 91 92 91 84 90 97 99 99 58 95 83
40
SGPT-BLOOM-7.1B-msmarco 84 83 69 83 81 92 94 78 82 78 94 95 91 93 91 84 90 97 98 99 100
GTR-Base 87 85 74 88 87 96 98 79 87 83 96 97 95 95 95 88 92 96 97 97 96 95
GTR-Large 85 83 72 87 86 95 98 76 84 80 95 96 93 95 94 86 90 94 97 97 95 94 100
GTR-XL 85 83 72 86 85 95 97 76 84 79 95 96 93 95 94 85 90 94 97 97 95 95 100 100 75
57 85 87 90

Retr.
GTR-XXL 84 82 71 85 85 94 97 75 84 79 95 96 92 95 93 84 89 93 96 96 94 94 99 100 100
ST5-Base 94 92 87 97 97 96 93 88 95 93 94 93 96 94 97 97 97 91 90 89 85 85 92 91 91 91 20
ST5-Large 92 90 85 95 96 94 93 86 93 91 93 93 95 94 96 95 97 90 90 89 86 86 92 92 92 92 99
ST5-XL 92 90 84 94 95 94 93 85 92 90 93 93 94 94 95 94 96 91 91 90 87 87 93 93 93 92 99 100
ST5-XXL 90 88 81 92 93 94 94 83 90 87 93 94 94 95 95 92 96 92 93 92 90 90 95 95 95 94 97 99 99 70
79 75 85 78 69

STS
Glove

BERT
Komninos

SimCSE-BERT-unsup
SimCSE-BERT-sup

SPECTER
coCondenser-msmarco
Contriever

LaBSE
LASER2
MiniLM-L6
MiniLM-L12
MiniLM-L12-multilingual
MPNet
MPNet-multilingual
SGPT-125M-nli

SGPT-125M-msmarco
SGPT-1.3B-msmarco
SGPT-2.7B-msmarco
SGPT-5.8B-msmarco
SGPT-BLOOM-7.1B-msmarco
SGPT-5.8B-nli

GTR-Base

GTR-XL
GTR-XXL
ST5-Base
GTR-Large

ST5-Large
ST5-XL
ST5-XXL
0

Summ.
3 -4 8 -8 -16 0

Class. Clust. PairClass. Rerank. Retr. STS Summ.

(a) Model correlation based on all results (b) Task correlation based on average task results

Figure 6: Pearson correlations across model and task results. Left: Size variants of the same architecture show
high correlations. Right: Performance on clustering and reranking correlates strongest, while summarization and
classification show weaker correlation with other tasks.

Model Public Checkpoint

Glove https://huggingface.co/sentence-transformers/average_word_embeddings_glove.6B.300d
Komninos https://huggingface.co/sentence-transformers/average_word_embeddings_komninos
BERT https://huggingface.co/bert-base-uncased
SimCSE-BERT-unsup https://huggingface.co/princeton-nlp/unsup-simcse-bert-base-uncased
SimCSE-BERT-sup https://huggingface.co/princeton-nlp/sup-simcse-bert-base-uncased
coCondenser-msmarco https://huggingface.co/sentence-transformers/msmarco-bert-co-condensor
Contriever https://huggingface.co/nthakur/contriever-base-msmarco
SPECTER https://huggingface.co/sentence-transformers/allenai-specter
LaBSE https://huggingface.co/sentence-transformers/LaBSE
LASER2 https://github.com/facebookresearch/LASER
MiniLM-L6 https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
MiniLM-L12 https://huggingface.co/sentence-transformers/all-MiniLM-L12-v2
MiniLM-L12-multilingual https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
MPNet https://huggingface.co/sentence-transformers/all-mpnet-base-v2
MPNet-multilingual https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2
MiniLM-L12-multilingual https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
SGPT-125M-nli https://huggingface.co/Muennighoff/SGPT-125M-weightedmean-nli-bitfit
SGPT-5.8B-nli https://huggingface.co/Muennighoff/SGPT-5.8B-weightedmean-nli-bitfit
SGPT-125M-msmarco https://huggingface.co/Muennighoff/SGPT-125M-weightedmean-msmarco-specb-bitfit
SGPT-1.3B-msmarco https://huggingface.co/Muennighoff/SGPT-1.3B-weightedmean-msmarco-specb-bitfit
SGPT-2.7B-msmarco https://huggingface.co/Muennighoff/SGPT-2.7B-weightedmean-msmarco-specb-bitfit
SGPT-5.8B-msmarco https://huggingface.co/Muennighoff/SGPT-5.8B-weightedmean-msmarco-specb-bitfit
SGPT-BLOOM-7.1B-msmarco https://huggingface.co/bigscience/sgpt-bloom-7b1-msmarco
SGPT-BLOOM-1.7B-nli https://huggingface.co/bigscience-data/sgpt-bloom-1b7-nli
GTR-Base https://huggingface.co/sentence-transformers/gtr-t5-base
GTR-Large https://huggingface.co/sentence-transformers/gtr-t5-large
GTR-XL https://huggingface.co/sentence-transformers/gtr-t5-xl
GTR-XXL https://huggingface.co/sentence-transformers/gtr-t5-xxl
ST5-Base https://huggingface.co/sentence-transformers/sentence-t5-base
ST5-Large https://huggingface.co/sentence-transformers/sentence-t5-large
ST5-XL https://huggingface.co/sentence-transformers/sentence-t5-xl
ST5-XXL https://huggingface.co/sentence-transformers/sentence-t5-xxl

Table 10: Publicly available model links used for evaluation


Dataset Glove Komninos BERT SimCSE- SimCSE- coCondenser- Contr- SPECTER LaBSE LASER2 MiniLM- MiniLM- MiniLM- MPNet MPNet- OpenAI SGPT-125M- SGPT-5.8B- SGPT-125M- SGPT-1.3B- SGPT-2.7B- SGPT-5.8B- SGPT- GTR- GTR- GTR- GTR- ST5- ST5- ST5- ST5-
BERT- BERT- msmarco iever L6 L12- L12- multilingual Ada nli nli msmarco msmarco msmarco msmarco BLOOM-7.1B- Base Large XL XXL Base Large XL XXL
unsup sup multilingual Similarity msmarco

AmazonCounterfactualClassification 56.91 60.54 74.25 67.09 75.75 64.06 72.19 58.70 75.93 76.84 64.15 65.28 71.57 65.27 75.81 76.40 65.88 74.07 61.24 65.21 67.57 69.22 68.06 69.33 70.03 68.60 67.30 75.82 75.51 76.01 77.07
AmazonPolarityClassification 60.32 59.59 71.33 74.48 82.47 66.88 68.63 57.77 68.95 61.01 62.58 62.98 69.21 67.13 76.41 92.83 74.94 82.31 65.40 73.21 71.44 71.26 68.97 67.82 73.92 74.58 75.05 85.12 92.87 93.17 92.79
AmazonReviewsClassification 29.67 31.01 33.56 33.85 39.60 34.85 37.42 26.26 35.80 28.71 31.79 30.79 35.11 31.92 38.51 47.45 35.10 41.58 31.17 34.96 35.75 39.19 33.86 38.48 37.21 38.20 37.30 44.94 47.12 48.18 48.93
Banking77Classification 67.69 67.05 63.41 73.55 75.76 82.35 80.02 66.66 69.85 57.76 79.75 80.40 79.77 81.86 81.07 68.04 74.68 81.74 77.70 82.06 83.22 84.49 84.33 79.26 81.21 82.22 82.32 76.48 78.46 80.88 82.31
EmotionClassification 36.93 33.18 35.28 42.22 44.81 41.91 44.77 24.82 37.22 24.83 38.43 41.17 42.37 39.73 45.84 50.32 42.23 49.92 39.08 46.39 49.21 49.66 44.87 42.20 46.32 45.55 43.19 51.36 51.73 51.95 48.57
ImdbClassification 62.57 63.98 65.35 69.63 73.53 60.17 67.04 56.35 62.04 57.58 60.66 59.76 60.46 70.72 64.57 89.38 62.90 74.33 58.67 64.05 63.53 66.64 61.77 65.99 70.86 68.15 70.8 77.34 87.01 87.54 90.23
MassiveIntentClassification 56.19 57.21 59.88 59.84 65.95 70.40 67.78 51.73 61.46 47.91 67.40 67.15 66.84 69.57 69.32 65.17 58.08 70.0 61.41 68.65 69.01 70.39 69.67 67.05 70.06 70.23 70.61 69.74 71.78 72.09 73.44
MassiveScenarioClassification 66.03 66.11 64.28 66.25 70.78 73.73 76.00 58.58 66.41 55.92 75.76 74.58 71.51 76.01 75.35 67.67 66.34 75.03 69.74 76.04 75.90 76.28 75.34 75.40 75.49 75.94 77.77 72.32 73.16 73.26 74.82
MTOPDomainClassification 79.11 78.57 82.63 81.71 84.29 91.34 93.18 74.53 86.06 75.36 91.56 91.90 87.06 92.08 89.24 89.89 81.52 89.64 86.96 92.08 92.56 93.47 93.68 92.42 94.01 93.60 93.84 90.34 90.99 90.73 92.49
MTOPIntentClassification 55.85 57.07 68.14 59.23 63.14 71.07 69.31 50.05 63.03 49.47 62.18 62.84 65.52 70.21 68.69 64.80 58.24 70.68 62.25 71.19 71.85 72.42 71.34 62.44 63.86 65.93 67.71 63.32 64.98 68.15 68.33
ToxicConversationsClassification 65.40 67.76 70.0 68.82 72.04 64.01 67.77 57.44 66.90 54.05 66.99 67.47 66.07 60.86 71.02 70.00 62.79 69.93 62.66 68.73 68.84 67.71 66.55 66.60 68.65 67.56 68.48 68.20 71.73 70.95 70.04
TweetSentimentExtractionClassification 50.80 49.68 51.81 53.36 59.73 55.74 56.10 45.52 58.82 48.73 55.41 54.25 56.12 55.46 59.03 63.35 54.82 62.44 52.41 55.67 56.69 56.85 55.85 56.02 54.09 54.77 54.54 62.71 62.33 61.21 62.01

ArxivClusteringP2P 32.56 34.73 35.19 32.61 35.18 36.94 42.61 44.75 32.13 17.77 46.55 46.07 38.33 48.38 37.78 41.49 34.74 40.55 39.71 43.38 44.72 45.59 44.59 35.49 37.50 37.90 37.90 39.28 41.62 41.62 42.89
ArxivClusteringS2S 23.14 26.01 27.51 24.68 27.54 29.03 32.32 35.27 22.05 12.39 37.86 37.50 31.55 39.72 31.68 28.47 24.68 32.49 28.24 33.71 35.08 38.86 38.03 27.18 30.55 30.45 32.39 27.26 29.44 31.17 33.47
BiorxivClusteringP2P 29.27 29.76 30.12 24.90 30.15 32.35 34.97 39.52 29.84 12.40 38.48 36.99 33.49 39.62 33.09 36.86 28.93 33.59 33.63 35.06 34.41 36.55 36.03 27.66 29.59 30.52 30.48 33.99 35.99 36.43 36.53
BiorxivClusteringS2S 19.18 20.71 24.77 19.55 24.67 28.16 29.08 34.53 20.57 8.83 33.17 33.21 29.44 35.02 29.60 27.55 23.08 29.13 27.04 30.71 30.53 33.70 32.48 23.25 25.72 26.06 27.50 22.92 24.02 26.47 28.66
MedrxivClusteringP2P 26.12 26.65 26.09 23.60 26.25 30.23 31.19 35.04 30.13 17.91 34.41 34.25 31.52 35.58 31.96 31.09 28.30 30.33 31.37 32.08 31.35 31.51 31.05 27.57 28.72 28.69 29.12 33.20 32.40 32.30 32.09
MedrxivClusteringS2S 20.38 21.50 23.60 21.97 24.12 27.01 27.27 31.66 24.82 16.63 32.29 32.24 30.87 32.87 31.70 26.50 24.93 28.02 26.87 29.45 28.77 28.76 29.26 25.13 27.39 26.69 27.56 26.13 26.33 26.93 26.82
RedditClustering 28.46 28.84 27.24 32.18 40.23 48.04 54.89 24.13 28.79 9.96 50.67 51.18 42.02 54.82 45.24 42.47 33.76 42.17 40.23 48.23 46.47 40.45 35.53 56.13 61.69 61.34 64.13 52.93 54.53 57.03 58.99
RedditClusteringP2P 35.82 7.37 43.32 45.14 47.74 53.53 57.58 35.06 49.14 26.42 54.15 54.80 50.73 56.77 51.31 58.10 41.01 48.02 49.09 53.18 54.17 55.75 54.52 58.53 61.67 61.11 62.84 59.67 62.50 62.34 64.46
StackExchangeClustering 35.80 39.04 43.58 43.07 47.55 59.54 63.15 39.01 35.43 15.79 53.36 53.05 49.60 53.80 52.98 53.52 44.59 54.13 52.74 60.86 59.19 59.21 55.13 64.21 69.93 69.95 71.43 63.13 65.11 67.13 70.78
StackExchangeClusteringP2P 28.51 30.23 26.55 28.50 29.45 30.48 32.25 31.46 28.83 18.63 38.00 33.13 31.69 34.28 32.94 30.43 28.23 31.12 32.66 32.36 32.57 33.95 34.31 33.01 33.21 32.73 32.85 35.68 36.86 34.79 35.25
TwentyNewsgroupsClustering 25.83 27.42 23.35 23.21 34.86 38.68 46.82 24.22 23.28 11.38 46.86 47.47 39.28 49.74 44.10 36.26 28.24 37.20 32.13 40.06 40.89 39.46 37.28 46.72 51.64 51.15 50.44 48.10 49.33 49.53 50.93

SprintDuplicateQuestions 86.96 85.55 36.81 69.41 69.39 96.09 95.55 71.63 89.26 65.54 94.55 92.45 89.46 90.15 90.55 77.85 77.73 80.54 89.89 92.58 93.47 93.84 94.93 94.55 95.05 95.45 95.68 91.23 89.01 91.44 88.89
TwitterSemEval2015 48.45 53.85 55.90 60.21 67.75 65.95 66.85 43.25 62.78 59.57 67.86 70.02 62.06 73.85 66.75 69.04 57.09 66.00 54.75 62.37 63.68 66.87 65.31 72.23 76.03 77.81 77.54 78.25 79.75 80.89 80.28
TwitterURLCorpus 77.35 79.41 76.29 81.37 83.89 83.17 85.21 69.22 84.58 81.47 84.70 84.77 83.83 85.11 85.14 83.69 80.51 84.54 81.06 83.79 84.80 85.29 85.46 84.77 84.89 85.14 85.13 86.05 86.14 85.86 86.01

AskUbuntuDupQuestions 49.57 50.88 45.84 51.57 51.80 58.99 56.69 50.07 52.75 48.99 63.48 64.06 60.49 65.85 60.16 53.49 52.63 55.90 55.84 58.13 59.63 61.63 59.97 60.86 61.64 63.08 63.23 59.73 61.51 62.86 66.16
MindSmallReranking 27.01 28.92 28.37 28.62 29.30 27.13 31.58 24.80 29.81 24.79 30.80 31.02 30.37 30.97 30.15 30.71 29.27 31.11 30.40 31.34 31.72 32.29 31.79 31.33 31.84 31.50 31.93 30.20 30.27 29.77 30.60
SciDocsRR 62.56 63.55 64.94 66.33 70.14 72.78 76.51 81.31 68.72 54.99 87.12 87.20 77.78 88.65 78.09 71.04 68.36 77.54 71.34 77.21 77.72 80.79 79.77 73.71 76.39 76.49 77.96 73.96 74.88 75.16 76.09
StackOverflowDupQuestions 34.03 35.65 34.62 39.35 38.90 48.48 47.78 36.22 42.42 36.98 50.76 51.47 45.85 51.98 46.79 40.85 39.97 44.77 44.74 49.32 49.61 51.53 51.07 51.01 51.58 52.79 53.50 48.46 49.34 51.05 52.85

ArguAna 36.30 30.96 28.29 38.34 38.33 45.15 48.32 32.67 34.18 12.86 50.17 47.13 44.88 46.52 48.91 39.65 31.04 35.07 45.42 49.68 50.49 51.38 47.28 50.83 52.09 52.81 53.77 44.85 39.27 39.40 39.85
ClimateFEVER 14.44 14.87 5.41 11.80 11.98 16.96 24.79 6.86 3.83 0.36 20.27 21.57 18.49 21.97 15.27 2.83 11.01 17.57 21.86 26.6 27.11 30.46 29.39 24.88 26.90 27.01 27.21 10.37 11.36 10.61 14.63
CQADupstackRetrieval 15.47 16.79 5.51 13.22 14.50 27.72 33.67 14.60 18.75 4.12 41.32 42.53 30.71 44.96 31.32 10.17 20.29 29.98 27.25 33.33 36.53 39.40 39.62 34.55 36.62 37.35 38.56 35.23 38.96 40.78 44.65
DBPedia 18.29 15.88 4.13 15.04 19.73 27.86 38.10 4.14 15.57 1.53 32.33 33.36 22.63 32.09 26.22 3.48 10.87 26.10 22.72 31.51 34.70 39.87 39.03 35.24 39.55 39.74 41.28 27.77 31.55 33.65 39.19
FEVER 14.99 15.56 3.30 21.05 20.41 45.68 59.29 5.45 12.17 0.77 51.93 55.91 52.66 50.86 56.76 4.45 18.40 38.64 60.45 68.12 72.73 78.24 73.97 68.93 72.66 72.18 74.08 26.16 36.21 36.12 51.20
FiQA2018 10.09 10.49 2.19 9.84 10.41 15.62 27.42 5.64 7.00 1.73 36.87 37.27 20.33 49.96 22.96 7.54 8.94 18.59 21.12 29.99 33.29 37.20 35.84 35.15 42.79 44.19 46.78 34.83 43.55 44.71 46.68
HotpotQA 19.18 20.77 8.26 19.75 22.89 35.61 56.81 5.46 18.75 5.50 46.51 44.59 30.01 39.29 37.03 12.6 17.73 33.99 40.88 49.93 52.84 59.26 57.26 54.93 57.85 58.91 59.67 33.20 33.95 37.17 42.14
MSMARCO 9.60 9.75 1.91 9.35 11.00 29.57 36.77 5.58 7.60 1.09 36.54 39.03 23.72 39.75 26.60 10.53 6.27 15.83 27.98 36.05 38.83 39.91 41.12 41.16 42.73 43.52 44.05 20.71 23.96 25.17 27.68
NFCorpus 13.87 11.79 4.30 9.88 12.42 22.29 31.31 0.84 16.54 2.44 31.59 32.25 23.45 33.29 25.49 20.59 11.80 28.26 22.79 32.08 33.89 36.21 35.78 30.22 32.63 33.34 34.18 28.64 31.10 33.18 35.08
NQ 12.87 12.75 2.61 11.69 16.08 29.85 41.83 5.99 8.42 0.64 43.87 46.47 29.80 50.45 33.60 2.02 7.63 24.63 29.73 42.94 46.70 52.41 53.15 50.47 55.09 56.16 57.24 36.32 42.02 46.29 52.87
QuoraRetrieval 71.32 71.58 61.03 78.03 79.62 86.51 86.72 64.65 77.03 71.14 87.56 87.75 86.55 87.46 86.41 82.18 78.96 84.68 72.98 85.28 85.60 84.58 74.71 87.98 88.47 88.91 89.09 85.49 85.73 85.85 85.96
SCIDOCS 8.04 8.47 2.81 5.50 7.53 10.13 17.12 0.00 5.63 0.78 21.64 21.82 0.03 23.77 13.96 6.28 7.13 13.55 12.21 16.18 16.57 19.87 18.62 14.00 15.51 15.71 15.88 14.16 15.38 15.97 17.17
SciFact 29.58 29.53 13.34 25.72 29.59 52.31 65.51 47.88 38.20 4.04 64.51 62.64 48.37 65.57 50.30 45.46 31.79 46.66 56.90 68.29 70.17 74.70 72.11 59.74 63.42 64.20 66.77 45.76 49.91 50.91 55.38
Touche2020 13.99 13.17 0.97 8.90 9.89 8.57 15.79 8.46 4.88 1.06 16.90 17.22 16.06 19.93 17.40 3.1 12.27 16.18 22.97 24.45 23.44 25.43 23.98 25.89 28.29 25.26 26.76 20.30 21.63 22.51 21.65
TRECCOVID 36.22 35.92 14.74 26.2 22.93 40.54 44.77 29.91 16.34 10.97 47.25 50.82 39.12 51.33 37.87 24.56 39.31 55.35 70.30 72.98 75.17 84.88 81.37 56.05 56.68 60.09 51.90 40.70 46.11 54.77 59.48

BIOSSES 44.93 50.25 54.70 72.31 68.38 77.32 83.32 64.95 78.70 62.01 81.64 83.57 74.18 80.43 76.27 78.04 70.93 79.50 75.21 83.02 84.84 86.25 85.31 79.00 84.86 78.94 81.91 75.89 78.93 73.12 80.43
SICK-R 55.43 55.49 58.65 72.24 80.77 72.00 70.20 56.39 69.99 62.86 77.58 79.32 79.61 80.59 79.62 77.48 74.57 79.59 65.93 67.23 68.20 69.63 69.82 71.45 73.39 73.63 74.29 80.18 80.34 79.98 80.47
STS12 54.64 53.51 30.87 66.05 75.30 68.19 64.34 62.49 65.08 62.60 72.37 73.08 76.02 72.63 77.90 72.30 69.17 74.29 66.53 66.59 66.99 67.50 69.66 68.59 70.33 69.11 70.12 78.05 79.11 79.02 78.85
STS13 69.16 70.80 59.89 81.49 84.67 80.40 80.03 58.70 67.98 59.62 80.60 82.13 80.70 83.48 85.11 81.49 77.23 85.35 76.17 77.33 77.58 79.16 79.67 79.09 82.19 81.82 82.72 85.85 87.33 88.80 88.94
STS14 60.81 63.56 47.73 73.61 80.19 74.02 74.51 54.87 64.03 57.03 75.59 76.73 78.85 78.00 80.81 74.74 70.99 79.21 69.05 71.83 72.78 74.46 74.61 74.64 77.16 77.07 78.24 82.19 83.17 84.33 84.86
STS15 72.31 74.08 60.29 79.72 85.40 82.57 83.30 62.54 76.59 71.57 85.39 85.58 85.84 85.66 87.48 84.28 79.74 85.52 79.24 80.66 82.62 84.47 83.81 84.85 86.31 86.01 86.26 87.46 88.28 88.89 89.32
STS16 65.34 64.60 63.73 78.12 80.82 79.78 79.67 64.27 72.98 70.75 78.99 80.23 81.05 80.03 83.20 82.06 77.93 82.54 76.07 78.91 80.10 80.96 80.40 81.57 81.85 82.23 81.61 84.03 84.36 85.31 84.67
STS17 77.95 76.91 64.10 83.58 89.44 85.94 86.32 69.63 79.45 76.73 87.59 88.63 86.87 90.60 86.99 87.08 87.33 90.44 84.95 86.99 87.25 87.78 87.07 85.80 83.93 84.90 85.18 89.57 88.99 88.91 89.46
STS22 56.35 53.89 56.37 59.65 61.96 67.54 64.64 55.06 60.97 39.75 67.21 65.67 61.72 67.95 63.06 64.71 59.64 63.20 65.66 67.30 68.75 69.35 66.13 66.17 64.30 66.61 65.76 62.66 62.39 64.32 65.33
STSBenchmark 61.54 61.55 47.29 76.52 84.25 76.97 78.81 61.26 72.25 69.77 82.03 83.09 84.42 83.42 86.82 83.78 79.54 85.67 75.34 77.59 79.21 81.39 80.90 79.58 77.60 77.65 77.73 85.52 85.36 83.93 84.01

SummEval 28.87 30.49 29.82 31.15 23.31 29.50 30.36 27.66 31.05 26.8 30.81 27.9 30.67 27.49 31.57 26.94 30.26 30.38 28.90 25.44 27.87 24.75 24.99 29.67 29.50 30.21 30.64 31.39 29.64 29.91 30.08

Average 41.97 42.06 38.33 45.45 48.72 52.35 56.00 40.28 45.21 34.95 56.26 56.53 52.44 57.78 54.71 49.52 45.97 53.74 51.23 56.11 57.12 58.81 57.44 56.19 58.28 58.42 58.97 55.27 57.06 57.87 59.51

Table 11: All English results. The main score for each task is reported as described in Section 3.2.
Dataset Language LASER2 LaBSE MiniLM-L12-multilingual MPNet-multilingual SGPT-BLOOM-7.1B-msmarco
BUCC de-en 99.21 99.35 97.11 98.59 54.00
BUCC fr-en 98.39 98.72 94.99 96.89 97.06
BUCC ru-en 97.62 97.78 95.06 96.44 45.30
BUCC zh-en 97.70 99.16 95.63 97.56 97.96
Tatoeba sqi-eng 97.22 96.76 98.17 98.57 10.38
Tatoeba fry-eng 42.07 89.31 31.13 43.54 24.62
Tatoeba kur-eng 19.09 83.59 46.94 61.44 8.26
Tatoeba tur-eng 98.03 98.00 95.08 96.17 6.15
Tatoeba deu-eng 99.07 99.20 97.02 97.73 70.10
Tatoeba nld-eng 95.35 96.07 94.58 95.50 29.74
Tatoeba ron-eng 96.52 96.92 95.30 96.43 27.23
Tatoeba ang-eng 25.22 59.28 10.24 16.72 28.76
Tatoeba ido-eng 80.86 89.42 40.25 43.91 43.91
Tatoeba jav-eng 9.95 79.77 17.04 23.39 15.02
Tatoeba isl-eng 94.32 94.75 24.07 59.25 6.29
Tatoeba slv-eng 95.40 96.03 96.92 97.08 10.14
Tatoeba cym-eng 5.85 92.00 13.25 22.31 6.97
Tatoeba kaz-eng 53.30 87.49 34.89 61.49 3.32
Tatoeba est-eng 96.43 96.55 97.33 98.40 4.76
Tatoeba heb-eng 0.00 91.53 86.88 88.26 1.69
Tatoeba gla-eng 1.52 85.66 3.61 4.72 2.09
Tatoeba mar-eng 92.93 92.65 92.38 93.83 45.53
Tatoeba lat-eng 64.81 80.07 19.47 24.25 28.76
Tatoeba bel-eng 79.54 95.00 67.73 79.94 8.03
Tatoeba pms-eng 36.23 64.57 30.70 34.19 31.94
Tatoeba gle-eng 4.20 93.80 11.62 16.85 3.26
Tatoeba pes-eng 93.13 94.70 92.59 93.47 12.13
Tatoeba nob-eng 95.77 98.40 97.73 98.53 21.07
Tatoeba bul-eng 93.57 94.58 92.65 93.52 20.09
Tatoeba cbk-eng 77.17 79.44 55.37 58.68 64.63
Tatoeba hun-eng 95.20 96.55 91.58 94.18 5.07
Tatoeba uig-eng 56.49 92.40 24.39 48.35 1.27
Tatoeba rus-eng 92.58 93.75 91.87 92.92 59.84
Tatoeba spa-eng 97.33 98.40 95.42 97.00 94.48
Tatoeba hye-eng 88.72 94.09 93.28 94.38 0.50
Tatoeba tel-eng 96.72 97.86 36.40 79.73 64.62
Tatoeba afr-eng 92.59 96.18 58.22 72.96 16.62
Tatoeba mon-eng 3.42 95.91 95.04 96.14 2.85
Tatoeba arz-eng 66.16 76.00 51.26 55.69 70.66
Tatoeba hrv-eng 96.72 96.95 95.98 97.00 12.79
Tatoeba nov-eng 60.02 74.38 47.99 50.23 52.23
Tatoeba gsw-eng 27.52 46.50 25.74 25.12 21.03
Tatoeba nds-eng 77.13 79.42 32.16 38.88 23.92
Tatoeba ukr-eng 93.52 93.97 92.82 92.67 22.06
Tatoeba uzb-eng 23.20 84.23 17.14 23.19 4.71
Tatoeba lit-eng 96.20 96.47 93.16 95.37 4.49
Tatoeba ina-eng 93.93 95.37 79.13 84.32 73.67
Tatoeba lfn-eng 63.39 67.54 47.02 49.56 44.85
Tatoeba zsm-eng 95.41 95.62 95.31 95.80 79.95
Tatoeba ita-eng 94.32 92.72 93.05 93.76 65.04
Tatoeba cmn-eng 85.62 95.10 94.93 95.83 91.45
Tatoeba lvs-eng 95.33 95.88 97.87 97.53 6.55
Tatoeba glg-eng 96.14 96.82 94.00 95.32 79.86
Tatoeba ceb-eng 9.93 64.42 8.05 7.39 6.64
Tatoeba bre-eng 31.2 15.07 5.56 6.42 4.67
Tatoeba ben-eng 89.43 88.55 36.48 64.90 75.98
Tatoeba swg-eng 33.10 59.36 26.31 22.80 16.89
Tatoeba arq-eng 26.63 42.69 18.60 19.84 27.75
Tatoeba kab-eng 65.88 4.31 1.16 1.41 1.69
Tatoeba fra-eng 94.28 94.86 91.72 93.12 91.44
Tatoeba por-eng 94.54 94.14 92.13 93.02 92.62
Tatoeba tat-eng 34.74 85.92 10.25 10.89 3.59
Tatoeba oci-eng 58.13 65.81 38.57 43.49 40.17
Tatoeba pol-eng 97.32 97.22 94.28 96.95 14.09
Tatoeba war-eng 8.25 60.29 7.25 7.42 10.38
Tatoeba aze-eng 82.41 94.93 62.10 76.36 6.32
Tatoeba vie-eng 96.73 97.20 95.12 97.23 94.20
Tatoeba nno-eng 72.75 94.48 76.34 81.41 16.28
Tatoeba cha-eng 14.86 31.77 15.98 12.59 23.26
Tatoeba mhr-eng 6.86 15.74 6.89 7.57 1.56
Tatoeba dan-eng 95.22 95.71 94.80 96.17 23.52
Tatoeba ell-eng 96.20 95.35 95.43 94.93 5.34
Tatoeba amh-eng 80.82 91.47 36.21 53.49 0.03
Tatoeba pam-eng 3.24 10.73 5.41 5.39 5.85
Tatoeba hsb-eng 45.75 67.11 36.10 44.32 9.68
Tatoeba srp-eng 93.64 94.43 92.24 94.12 11.69
Tatoeba epo-eng 96.61 98.20 41.73 55.12 26.20
Tatoeba kzj-eng 4.46 11.33 6.24 5.88 5.17
Tatoeba awa-eng 33.74 71.70 33.43 42.83 35.01
Tatoeba fao-eng 57.04 87.40 27.51 38.24 12.61
Tatoeba mal-eng 98.16 98.45 32.20 88.46 83.30
Tatoeba ile-eng 87.88 85.58 57.71 60.36 59.59
Tatoeba bos-eng 95.86 94.92 93.27 94.02 13.65
Tatoeba cor-eng 4.45 10.11 3.42 3.53 2.83
Tatoeba cat-eng 95.80 95.38 94.42 96.05 88.31
Tatoeba eus-eng 93.32 95.01 23.18 31.33 53.38
Tatoeba yue-eng 87.75 89.58 71.45 77.58 77.03
Tatoeba swe-eng 95.31 95.63 94.42 95.45 19.53
Tatoeba dtp-eng 7.39 10.85 5.69 5.03 3.41
Tatoeba kat-eng 81.16 95.02 95.44 95.46 0.42
Tatoeba jpn-eng 93.78 95.38 90.41 92.51 71.36
Tatoeba csb-eng 27.03 52.57 21.56 23.73 10.03
Tatoeba xho-eng 4.68 91.55 4.52 6.53 5.51
Tatoeba orv-eng 23.24 38.93 15.10 23.77 5.79
Tatoeba ind-eng 92.98 93.66 92.74 93.50 88.04
Tatoeba tuk-eng 16.35 75.27 15.16 14.91 5.48
Tatoeba max-eng 36.96 63.26 45.25 48.77 36.14
Tatoeba swh-eng 55.66 84.50 14.48 16.02 16.74
Tatoeba hin-eng 95.32 96.87 97.62 97.75 85.23
Tatoeba dsb-eng 42.34 64.81 33.43 36.85 8.78
Tatoeba ber-eng 77.63 8.40 4.43 4.88 4.92
Tatoeba tam-eng 87.32 89.0 24.64 73.60 72.76
Tatoeba slk-eng 95.82 96.5 95.15 96.62 9.98
Tatoeba tgl-eng 63.19 96.02 13.09 17.67 10.70
Tatoeba ast-eng 76.35 90.68 62.17 70.08 71.13
Tatoeba mkd-eng 93.63 93.6 91.00 93.02 10.47
Tatoeba khm-eng 74.19 78.37 32.11 58.80 0.37
Tatoeba ces-eng 95.52 96.68 95.12 95.73 9.55
Tatoeba tzl-eng 36.56 58.88 25.46 34.21 27.82
Tatoeba urd-eng 84.23 93.22 94.57 95.12 70.10
Tatoeba ara-eng 90.14 88.80 87.93 90.19 85.37
Tatoeba kor-eng 87.97 90.95 92.52 93.07 22.39
Tatoeba yid-eng 2.49 88.79 14.38 30.73 0.16
Tatoeba fin-eng 96.98 96.37 93.10 95.92 3.41
Tatoeba tha-eng 96.38 96.14 96.72 95.99 2.22
Tatoeba wuu-eng 75.09 90.18 76.00 78.25 79.58
Average mix 67.42 81.75 57.98 63.38 31.08

Table 12: Multilingual bitext mining results. Scores are f1.


Dataset Language LASER2 LaBSE MiniLM-L12-multilingual MPNet-multilingual SGPT-BLOOM-7.1B-msmarco
AmazonCounterfactualClassification de 67.82 73.17 68.35 69.95 61.35
AmazonCounterfactualClassification ja 68.76 76.42 63.45 69.79 58.23
AmazonReviewsClassification de 31.07 39.92 35.91 39.52 29.70
AmazonReviewsClassification es 32.72 39.39 37.49 39.99 35.97
AmazonReviewsClassification fr 31.12 38.52 35.30 39.00 35.92
AmazonReviewsClassification ja 28.94 36.44 33.24 36.64 27.64
AmazonReviewsClassification zh 30.89 36.45 35.26 37.74 32.63
MassiveIntentClassification af 38.01 56.12 45.88 52.32 47.85
MassiveIntentClassification am 12.70 55.71 36.75 41.55 33.30
MassiveIntentClassification ar 37.16 50.86 45.14 51.43 59.25
MassiveIntentClassification az 19.98 58.97 47.42 56.98 45.24
MassiveIntentClassification bn 42.51 58.22 35.34 48.79 61.59
MassiveIntentClassification cy 17.33 50.16 26.12 27.87 44.92
MassiveIntentClassification da 45.61 58.25 57.73 62.77 51.23
MassiveIntentClassification de 44.79 56.21 50.71 59.57 56.10
MassiveIntentClassification el 46.71 57.03 58.70 62.62 46.13
MassiveIntentClassification es 45.44 58.32 59.66 64.43 66.35
MassiveIntentClassification fa 45.01 62.33 61.02 65.34 51.20
MassiveIntentClassification fi 45.94 60.12 57.54 62.28 45.33
MassiveIntentClassification fr 46.13 60.47 60.25 64.82 66.95
MassiveIntentClassification he 42.55 56.55 52.51 58.21 43.18
MassiveIntentClassification hi 40.20 59.40 58.37 62.77 63.54
MassiveIntentClassification hu 42.77 59.52 60.41 63.87 44.73
MassiveIntentClassification hy 28.07 56.20 51.60 57.74 38.13
MassiveIntentClassification id 45.81 61.12 59.85 65.43 64.06
MassiveIntentClassification is 39.86 54.90 30.83 37.05 44.35
MassiveIntentClassification it 48.25 59.83 59.61 64.68 60.77
MassiveIntentClassification ja 45.30 63.11 60.89 63.74 61.22
MassiveIntentClassification jv 24.30 50.98 32.37 36.49 50.94
MassiveIntentClassification ka 22.70 48.35 43.03 49.85 33.84
MassiveIntentClassification km 22.48 48.55 40.04 45.47 37.34
MassiveIntentClassification kn 4.32 56.24 40.98 50.63 53.54
MassiveIntentClassification ko 44.26 60.99 50.30 61.82 53.36
MassiveIntentClassification lv 39.75 57.10 54.68 61.29 46.50
MassiveIntentClassification ml 41.33 57.91 42.41 54.34 58.27
MassiveIntentClassification mn 16.20 58.50 51.77 56.59 40.28
MassiveIntentClassification ms 43.23 58.60 54.76 60.70 59.65
MassiveIntentClassification my 25.37 57.35 52.01 57.09 37.42
MassiveIntentClassification nb 37.74 57.91 55.50 62.60 49.41
MassiveIntentClassification nl 45.00 59.37 59.51 63.57 52.09
MassiveIntentClassification pl 44.99 59.71 59.43 64.30 50.48
MassiveIntentClassification pt 48.55 60.16 61.27 64.89 66.69
MassiveIntentClassification ro 44.30 57.92 58.39 62.80 50.53
MassiveIntentClassification ru 44.29 60.67 59.04 63.26 58.32
MassiveIntentClassification sl 44.72 59.37 57.36 63.51 47.74
MassiveIntentClassification sq 46.12 58.03 56.59 62.49 48.94
MassiveIntentClassification sv 45.95 59.66 59.43 64.73 50.79
MassiveIntentClassification sw 31.89 51.62 29.57 31.95 49.81
MassiveIntentClassification ta 29.63 55.04 36.77 50.17 56.40
MassiveIntentClassification te 36.03 58.32 40.72 52.82 54.71
MassiveIntentClassification th 43.39 56.58 58.97 61.11 44.43
MassiveIntentClassification tl 29.73 55.28 33.67 38.83 50.21
MassiveIntentClassification tr 43.93 60.91 59.90 64.54 46.56
MassiveIntentClassification ur 26.11 56.70 52.80 56.37 56.75
MassiveIntentClassification vi 44.33 56.67 56.61 59.68 64.53
MassiveIntentClassification zh-CN 40.62 63.86 61.99 65.33 67.07
MassiveIntentClassification zh-TW 32.93 59.51 58.77 62.35 62.89
MassiveScenarioClassification af 47.10 63.39 53.64 59.67 51.47
MassiveScenarioClassification am 17.70 62.02 41.89 48.97 34.87
MassiveScenarioClassification ar 45.21 57.72 51.74 57.78 65.21
MassiveScenarioClassification az 28.21 63.48 52.06 61.53 45.58
MassiveScenarioClassification bn 50.52 61.84 41.17 54.53 67.30
MassiveScenarioClassification cy 22.58 56.13 31.72 35.26 46.29
MassiveScenarioClassification da 54.87 65.24 66.87 71.00 53.52
MassiveScenarioClassification de 54.34 62.39 57.40 67.34 61.74
MassiveScenarioClassification el 55.47 64.58 66.14 68.81 48.96
MassiveScenarioClassification es 52.77 63.61 65.04 70.42 73.34
MassiveScenarioClassification fa 52.50 67.46 65.86 69.88 53.17
MassiveScenarioClassification fi 52.63 64.58 63.75 67.60 44.69
MassiveScenarioClassification fr 54.32 65.10 66.06 70.69 72.91
MassiveScenarioClassification he 52.41 63.53 59.20 65.16 43.10
MassiveScenarioClassification hi 47.37 64.40 65.21 67.92 69.27
MassiveScenarioClassification hu 53.43 65.82 66.56 70.30 45.16
MassiveScenarioClassification hy 33.57 61.25 56.11 63.02 38.73
MassiveScenarioClassification id 54.38 65.84 66.16 70.73 70.13
MassiveScenarioClassification is 49.78 61.94 37.52 44.16 44.21
MassiveScenarioClassification it 54.84 64.09 65.00 69.73 65.57
MassiveScenarioClassification ja 54.12 67.72 66.50 69.69 65.76
MassiveScenarioClassification jv 32.71 58.29 38.60 44.20 54.79
MassiveScenarioClassification ka 26.92 53.38 50.66 57.30 32.99
MassiveScenarioClassification km 27.23 56.18 46.96 53.14 39.34
MassiveScenarioClassification kn 10.06 61.74 45.73 56.08 60.50
MassiveScenarioClassification ko 52.01 67.26 55.66 68.52 55.69
MassiveScenarioClassification lv 44.82 61.87 59.80 66.28 44.35
MassiveScenarioClassification ml 49.10 62.26 47.69 60.13 65.53
MassiveScenarioClassification mn 21.51 62.60 57.07 60.85 38.72
MassiveScenarioClassification ms 53.60 65.63 61.71 65.81 64.99
MassiveScenarioClassification my 29.72 62.94 59.10 63.03 36.84
MassiveScenarioClassification nb 43.90 64.29 64.25 70.24 51.80
MassiveScenarioClassification nl 53.33 65.16 65.52 70.37 56.32
MassiveScenarioClassification pl 52.92 64.56 65.04 68.99 49.98
MassiveScenarioClassification pt 53.41 63.28 65.79 70.09 71.46
MassiveScenarioClassification ro 50.48 62.41 64.17 67.95 53.69
MassiveScenarioClassification ru 51.84 65.25 65.24 69.92 61.60
MassiveScenarioClassification sl 51.29 64.25 64.01 70.81 48.04
MassiveScenarioClassification sq 55.65 64.54 64.31 69.63 50.06
MassiveScenarioClassification sv 54.64 66.01 67.14 71.60 51.73
MassiveScenarioClassification sw 42.04 58.36 34.86 37.29 54.22
MassiveScenarioClassification ta 36.72 59.08 42.62 55.96 62.77
MassiveScenarioClassification te 42.08 64.13 46.46 58.81 62.59
MassiveScenarioClassification th 52.15 64.34 67.01 69.44 45.18
MassiveScenarioClassification tl 37.34 60.23 37.37 43.99 52.06
MassiveScenarioClassification tr 52.56 65.43 66.55 70.4 47.21
MassiveScenarioClassification ur 32.60 61.52 60.43 62.9 64.26
MassiveScenarioClassification vi 50.97 61.05 60.72 65.71 70.61
MassiveScenarioClassification zh-CN 50.22 70.85 67.44 71.23 73.95
MassiveScenarioClassification zh-TW 42.32 67.08 65.70 68.73 70.30
MTOPDomainClassification de 74.08 86.95 79.20 85.73 82.05
MTOPDomainClassification es 73.47 84.07 83.04 86.96 93.55
MTOPDomainClassification fr 72.26 84.14 78.63 81.21 90.98
MTOPDomainClassification hi 72.95 85.11 81.36 84.76 89.33
MTOPDomainClassification th 72.68 81.24 79.99 82.51 60.49
MTOPIntentClassification de 51.62 63.42 54.23 61.27 61.92
MTOPIntentClassification es 52.75 64.44 60.28 66.59 74.49
MTOPIntentClassification fr 50.12 62.01 54.05 59.76 69.12
MTOPIntentClassification hi 45.55 62.58 59.90 62.37 64.85
MTOPIntentClassification th 50.07 64.61 61.96 64.80 49.36
Average mix 42.85 60.77 54.87 60.39 54.4

Table 13: Multilingual classification results. Scores are accuracy.


Dataset Language Komninos LASER2 LaBSE MiniLM-L12-multilingual MPNet-multilingual SGPT-BLOOM-7.1B-msmarco
STS17 ko-ko 2.54 70.52 71.32 77.03 83.41 66.89
STS17 ar-ar 13.78 67.47 69.07 79.16 79.10 76.42
STS17 en-ar 9.08 65.05 74.51 81.22 80.85 78.07
STS17 en-de -3.11 66.66 73.85 84.22 83.28 59.10
STS17 en-tr -0.45 70.05 72.07 76.74 74.90 11.80
STS17 es-en -8.18 55.30 65.71 84.44 86.11 78.22
STS17 es-es 48.23 79.67 80.83 85.56 85.14 86.00
STS17 fr-en 5.81 70.82 76.98 76.59 81.17 80.46
STS17 it-en 3.64 70.98 76.99 82.35 84.24 51.58
STS17 nl-en -0.44 68.12 75.22 81.71 82.51 45.85
STS22 de 33.04 25.69 48.58 44.64 46.70 30.05
STS22 es 48.53 54.92 63.18 56.56 59.91 65.41
STS22 pl 12.47 18.34 39.30 33.74 33.65 31.13
STS22 tr 47.38 36.97 58.15 53.39 56.30 47.14
STS22 ar 32.42 42.57 57.67 46.2 52.19 58.67
STS22 ru 19.44 39.24 57.49 57.08 58.74 43.36
STS22 zh 4.78 49.41 63.02 58.75 61.75 66.78
STS22 fr 49.43 58.61 77.95 70.55 74.30 80.38
STS22 de-en 28.65 32.35 50.14 52.65 50.81 51.16
STS22 es-en 26.97 54.34 71.86 67.33 70.26 75.06
STS22 it 57.77 60.31 72.22 55.22 60.65 65.65
STS22 pl-en 45.55 53.63 69.41 69.02 73.07 53.31
STS22 zh-en 14.05 46.19 64.02 65.71 67.96 68.45
STS22 es-it 41.10 42.21 69.69 47.67 53.70 65.50
STS22 de-fr 14.77 37.41 53.28 51.73 62.34 53.28
STS22 de-pl 11.21 15.67 58.69 44.22 40.53 43.05
STS22 fr-pl 39.44 39.44 61.98 50.71 84.52 28.17
Average mix 22.14 51.55 65.67 64.23 67.71 57.81

Table 14: Multilingual STS Results. Scores are Spearman correlations of cosine similarities.

You might also like