MTEB: Massive Text Embedding Benchmark
MTEB: Massive Text Embedding Benchmark
StackExchange StackExchangeP2P
8 Tasks Imdb MassiveIntent MassiveScenario
ToxicConversations TweetSentimentExtraction
Bitext Mining STS
Figure 1: An overview of tasks and datasets in MTEB. Multilingual datasets are marked with a purple shade.
ducibility: Through versioning at a dataset and Clustering Given a set of sentences or para-
software level, we aim to make it easy to repro- graphs, the goal is to group them into meaning-
duce results in MTEB. JSON files corresponding ful clusters. A mini-batch k-means model with
to all results available in this paper have been made batch size 32 and k equal to the number of dif-
available together with the MTEB benchmark3 . ferent labels (Pedregosa et al., 2011) is trained on
the embedded texts. The model is scored using
3.2 Tasks and Evaluation v-measure (Rosenberg and Hirschberg, 2007). V-
measure does not depend on the cluster label, thus
Figure 1 provides an overview of tasks and datasets
the permutation of labels does not affect the score.
available in MTEB. Dataset statistics are available
in Table 2. The benchmark consists of the follow-
ing 8 task types: Pair Classification A pair of text inputs is pro-
vided and a label needs to be assigned. Labels
Bitext Mining Inputs are two sets of sentences are typically binary variables denoting duplicate
from two different languages. For each sentence or paraphrase pairs. The two texts are embedded
in the first set, the best match in the second set and their distance is computed with various metrics
needs to be found. The matches are commonly (cosine similarity, dot product, euclidean distance,
translations. The provided model is used to embed manhattan distance). Using the best binary thresh-
each sentence and the closest pairs are found via old accuracy, average precision, f1, precision and
cosine similarity. F1 serves as the main metric for recall are computed. The average precision score
bitext mining. Accuracy, precision and recall are based on cosine similarity is the main metric.
also computed.
Classification A train and test set are embedded Reranking Inputs are a query and a list of rele-
with the provided model. The train set embeddings vant and irrelevant reference texts. The aim is to
are used to train a logistic regression classifier with rank the results according to their relevance to the
100 maximum iterations, which is scored on the query. The model is used to embed the references
test set. The main metric is accuracy with average which are then compared to the query using cosine
precision and f1 additionally provided. similarity. The resulting ranking is scored for each
query and averaged across all queries. Metrics are
3
https://huggingface.co/datasets/mteb mean MRR@k and MAP with the latter being the
/results main metric.
100
AmazonCounterfactualClassification
AmazonPolarityClassification 97
AmazonReviewsClassification 85 84
Banking77Classification 90 89 83
EmotionClassification 90 89 84 87
ImdbClassification 91 94 81 85 85
MassiveIntentClassification 92 92 89 91 89 88
MassiveScenarioClassification 92 92 89 91 89 88 100
MTOPDomainClassification 91 92 87 92 88 88 98 98
MTOPIntentClassification 91 92 87 92 88 88 98 98 100
ToxicConversationsClassification 93 93 87 90 89 90 96 96 95 95 95
TweetSentimentExtractionClassification 94 94 89 91 92 90 97 97 96 96 98
ArxivClusteringP2P 91 91 83 87 83 86 90 90 89 89 89 89
ArxivClusteringS2S 92 93 87 90 87 89 97 97 95 95 96 96 93
BiorxivClusteringP2P 88 88 81 85 82 85 87 87 87 87 87 87 95 90
BiorxivClusteringS2S 91 91 85 87 85 88 93 93 92 92 92 92 95 96 94
MedrxivClusteringP2P 87 87 81 84 81 84 87 87 87 87 87 87 92 89 96 93
MedrxivClusteringS2S 89 89 83 85 83 85 90 90 89 89 89 90 93 93 93 97 96
RedditClustering 94 94 88 92 90 89 95 95 95 95 95 96 90 95 88 93 88 91
RedditClusteringP2P 94 95 86 93 92 91 95 95 95 95 96 97 92 95 90 93 89 91 96
StackExchangeClustering 92 92 89 91 88 88 95 95 94 94 94 94 92 96 89 94 89 92 95 95 90
StackExchangeClusteringP2P 87 87 79 86 82 83 90 90 89 89 89 88 89 91 86 90 85 88 89 91 92
TwentyNewsgroupsClustering 93 93 88 91 87 88 96 96 95 95 95 96 92 98 89 95 90 93 95 95 96 91
SprintDuplicateQuestions 74 74 69 78 72 69 77 77 79 79 74 75 73 75 71 75 71 74 77 76 77 74 77
TwitterSemEval2015 88 89 83 85 85 85 91 91 90 90 92 92 85 91 83 88 83 85 91 92 89 84 90 71
TwitterURLCorpus 92 92 84 88 87 89 92 92 92 92 93 93 89 92 87 91 87 89 93 93 92 86 93 74 88
AskUbuntuDupQuestions 88 87 85 89 84 84 92 92 91 91 89 91 89 91 86 90 85 88 90 90 92 88 92 77 85 87
MindSmallReranking 84 86 80 81 80 84 89 89 88 88 88 88 85 88 82 88 82 86 88 88 87 82 89 67 84 88 83
SciDocsRR 91 92 86 89 85 88 95 95 93 93 93 93 94 97 91 97 91 95 94 94 95 92 96 76 89 92 91 89
StackOverflowDupQuestions 88 88 84 87 83 83 92 92 91 91 89 90 90 92 86 92 86 90 90 90 93 92 92 75 85 87 92 84 93
ArguAna 92 91 84 87 85 89 90 90 90 90 91 90 92 91 91 91 91 90 91 93 92 87 92 72 86 91 87 86 92 88 85
ClimateFEVER 87 88 83 86 83 84 91 91 90 90 90 91 88 91 85 90 86 88 91 90 90 85 92 72 85 88 86 84 90 86 87
CQADupstackAndroidRetrieval 88 87 80 89 82 84 90 90 90 90 88 88 87 89 85 88 85 86 89 90 91 92 90 79 84 86 90 81 90 90 87 85
CQADupstackEnglishRetrieval 91 91 83 89 86 87 92 92 92 92 92 92 90 93 88 91 88 89 93 93 96 91 93 74 86 90 87 83 92 89 91 88 91
CQADupstackGamingRetrieval 91 90 82 90 85 87 93 93 92 92 91 91 89 93 87 91 87 89 93 94 94 95 93 75 87 89 90 84 92 91 90 88 94 94
CQADupstackGisRetrieval 86 85 79 86 80 81 87 87 87 87 86 86 88 89 85 89 86 88 87 88 91 93 89 74 81 84 88 80 90 91 86 84 91 90 92
CQADupstackMathematicaRetrieval 88 87 80 87 82 83 89 89 89 89 87 87 91 91 88 91 87 89 89 90 93 94 90 77 82 86 89 81 92 92 88 85 92 92 93 94
CQADupstackPhysicsRetrieval 88 88 80 87 82 83 89 89 88 88 89 88 93 92 88 92 87 89 90 91 93 92 91 73 83 87 87 82 92 88 90 86 91 94 93 90 93
CQADupstackProgrammersRetrieval 88 88 81 87 82 85 90 90 89 89 88 88 90 91 88 91 87 89 90 91 94 95 92 75 83 87 88 81 93 92 91 85 92 95 94 93 94 94
CQADupstackStatsRetrieval 87 87 80 86 81 82 88 88 88 88 87 87 92 91 89 92 90 91 89 90 92 93 91 74 83 86 87 81 93 90 89 85 90 93 92 93 96 94 95
CQADupstackTexRetrieval 87 87 80 86 80 82 88 88 88 88 86 87 90 90 87 90 86 88 89 89 93 92 90 75 82 85 88 81 90 91 87 84 90 92 91 92 96 91 93 93 80
CQADupstackUnixRetrieval 88 87 81 88 82 84 90 90 89 89 88 88 89 90 87 90 86 88 90 90 93 93 91 76 83 86 93 80 91 91 88 85 94 93 94 93 95 93 95 94 94
CQADupstackWebmastersRetrieval 88 88 80 88 82 84 90 90 89 89 89 88 89 91 87 90 87 89 90 91 93 93 91 74 83 87 89 83 92 91 89 85 93 93 94 93 93 92 96 93 93 94
CQADupstackWordpressRetrieval 87 87 80 87 82 83 89 89 88 88 88 88 89 89 86 90 86 88 89 90 92 92 90 75 83 87 89 81 91 92 88 84 92 92 93 92 93 91 93 92 93 94 96
DBPedia 90 90 86 87 84 86 93 93 92 92 92 93 90 95 89 93 88 91 93 92 93 87 94 73 88 89 88 85 93 89 89 91 86 91 89 86 87 88 87 88 87 87 88 87
FEVER 87 88 83 86 83 84 91 91 90 90 90 91 88 91 85 90 86 88 91 90 90 85 92 72 85 88 86 84 90 86 87 100 85 88 88 84 85 86 85 85 84 85 85 84 91
FiQA2018 92 92 87 90 87 88 95 95 93 93 95 94 91 95 88 92 88 90 94 94 95 90 95 75 89 92 91 87 94 91 92 89 89 92 92 88 90 90 91 90 90 90 91 90 91 89
HotpotQA 90 90 86 89 85 88 95 95 94 94 94 94 90 96 88 93 88 92 94 93 94 88 95 74 90 91 90 88 94 90 89 93 88 91 90 88 88 88 88 88 87 88 89 87 96 93 93
MSMARCO 90 91 87 91 86 87 94 94 95 95 93 94 89 94 87 91 88 90 94 93 93 87 94 77 88 90 89 85 92 89 89 91 89 91 90 87 88 88 88 88 87 88 89 87 93 91 93 93
NFCorpus 89 89 83 85 84 87 89 89 88 88 90 89 92 91 94 94 95 94 90 91 92 87 92 72 85 89 86 84 92 87 93 87 86 90 89 85 88 89 89 90 86 87 88 87 90 87 90 89 89
NQ 91 92 86 89 86 88 95 95 94 94 94 95 91 96 88 94 88 92 95 94 95 89 97 74 90 92 90 89 95 91 91 93 88 92 91 87 89 90 89 88 88 89 90 89 95 93 93 97 92 90 75
QuoraRetrieval 92 92 88 92 88 89 97 97 96 96 98 97 91 97 88 93 88 91 96 96 97 91 96 76 90 93 91 88 94 91 91 91 91 94 94 89 90 92 92 91 90 91 92 91 93 91 95 95 95 91 95
SCIDOCS 89 90 85 86 82 85 91 91 89 89 89 89 95 93 92 97 92 96 90 91 93 90 93 74 85 89 88 86 97 91 91 88 87 90 89 90 92 90 92 93 90 90 90 89 91 88 91 91 90 92 92 91
SciFact 89 90 83 85 83 86 89 89 89 89 89 89 93 92 95 96 95 95 90 91 92 88 92 73 85 89 86 83 94 88 92 87 87 90 90 87 90 91 90 92 88 89 89 88 91 87 90 89 89 97 90 91 95
Touche2020 92 93 85 88 88 90 93 93 92 92 94 93 91 93 89 92 89 90 94 95 93 88 93 74 89 93 88 87 92 89 96 88 88 92 91 86 87 89 90 88 88 88 89 88 90 88 93 91 91 91 93 94 90 90
TRECCOVID 88 89 84 84 83 86 90 90 89 89 89 90 93 92 94 96 96 97 90 91 91 86 92 72 85 89 88 86 94 88 91 88 86 89 88 86 88 88 88 90 87 88 88 87 91 88 90 91 89 96 91 90 95 96 90
BIOSSES 86 86 80 82 80 83 85 85 85 85 85 85 91 88 93 93 92 92 86 87 88 84 88 71 82 85 83 80 90 85 90 83 84 87 86 83 86 87 86 88 85 85 85 84 87 83 86 85 85 93 86 86 91 97 87 92
SICK-R 83 84 81 82 81 84 89 89 88 88 89 89 81 88 78 84 78 82 88 88 86 81 88 68 89 85 83 83 86 83 82 84 81 82 83 77 79 81 80 79 78 80 80 79 84 84 85 88 86 82 88 89 81 80 86 81 76
STS12 92 92 89 90 88 88 97 97 95 95 95 96 90 96 87 93 87 91 95 94 95 89 96 76 91 93 91 87 94 91 91 91 89 92 92 87 88 89 89 88 88 89 89 88 93 91 94 95 94 90 95 96 90 90 93 90 86 90
STS13 92 91 89 90 88 88 96 96 94 94 94 95 91 95 89 93 89 91 95 94 95 90 95 75 90 92 91 87 94 91 92 90 89 94 92 88 89 91 91 89 89 90 90 89 92 90 93 93 92 91 94 95 91 91 93 91 87 86 97
STS14 94 94 90 92 91 90 97 97 96 96 97 98 92 96 90 94 90 92 96 97 96 90 97 76 91 94 92 89 95 92 94 92 90 94 94 89 90 91 91 90 89 91 91 90 93 92 95 95 94 92 95 97 92 92 95 92 88 89 98 98
70
STS15 93 92 89 90 89 89 96 96 95 95 95 96 91 95 89 93 88 90 95 95 95 89 95 74 90 92 91 88 94 91 92 90 89 92 92 87 88 90 89 88 87 89 89 88 92 90 94 94 93 90 94 96 90 90 94 90 87 89 96 96 98
STS16 93 93 89 91 91 89 95 95 94 94 95 95 90 93 88 91 88 89 95 95 95 90 94 75 88 93 91 85 92 90 92 88 90 93 92 87 89 90 91 89 89 90 91 90 91 88 95 92 93 91 92 96 90 90 93 89 86 85 95 94 97 95
STS17 91 90 86 88 87 88 95 95 95 95 95 96 87 94 85 91 85 88 94 94 92 87 94 73 92 91 89 88 92 89 89 89 87 89 90 85 86 87 87 85 84 86 87 86 90 89 91 94 92 88 93 95 88 87 92 88 83 93 95 93 95 95 92
STS22 89 89 91 88 85 85 94 94 92 92 93 93 89 93 87 91 87 90 92 93 94 87 93 75 88 90 90 86 92 90 89 89 87 89 89 86 87 87 87 87 87 87 88 87 91 89 97 92 92 88 93 94 90 88 90 89 85 85 93 92 94 92 92 90
STSBenchmark 93 93 87 90 89 91 96 96 95 95 96 97 89 95 87 92 87 90 96 96 94 89 95 75 92 93 90 88 93 90 92 90 89 92 92 86 88 89 90 88 87 89 89 89 91 90 94 94 93 90 94 96 89 89 94 89 85 94 96 95 97 96 96 97 91
SummEval 93 92 85 89 87 90 94 94 93 93 94 94 91 94 90 92 90 91 94 95 93 88 94 73 92 93 88 89 92 89 94 91 88 91 92 86 88 88 89 88 87 87 89 88 91 91 92 93 92 92 93 94 90 91 93 91 88 89 94 94 95 94 92 95 91 95
RedditClustering
AmazonPolarityClassification
CQADupstackAndroidRetrieval
DBPedia
ArxivClusteringS2S
TwentyNewsgroupsClustering
MindSmallReranking
SprintDuplicateQuestions
SciDocsRR
CQADupstackEnglishRetrieval
NQ
CQADupstackGamingRetrieval
CQADupstackGisRetrieval
QuoraRetrieval
SCIDOCS
SciFact
Touche2020
BIOSSES
AmazonCounterfactualClassification
AmazonReviewsClassification
Banking77Classification
EmotionClassification
ImdbClassification
MassiveIntentClassification
MassiveScenarioClassification
MTOPDomainClassification
MTOPIntentClassification
ToxicConversationsClassification
CQADupstackMathematicaRetrieval
CQADupstackPhysicsRetrieval
CQADupstackProgrammersRetrieval
CQADupstackStatsRetrieval
CQADupstackTexRetrieval
CQADupstackUnixRetrieval
FEVER
HotpotQA
NFCorpus
FiQA2018
MSMARCO
TRECCOVID
SICK-R
STS12
STS13
STS14
STS15
STS16
STS17
STS22
STSBenchmark
SummEval
TweetSentimentExtractionClassification
ArxivClusteringP2P
BiorxivClusteringP2P
BiorxivClusteringS2S
MedrxivClusteringP2P
MedrxivClusteringS2S
RedditClusteringP2P
StackExchangeClustering
StackExchangeClusteringP2P
TwitterSemEval2015
TwitterURLCorpus
AskUbuntuDupQuestions
StackOverflowDupQuestions
ArguAna
ClimateFEVER
CQADupstackWebmastersRetrieval
CQADupstackWordpressRetrieval
Figure 2: Similarity of MTEB datasets. We use the best model on MTEB STS (ST5-XXL, see Table 1) to embed
100 samples for each dataset. Cosine similarities between the averaged embeddings are computed and visualized.
Table 1: Average of the main metric (see Section 3.2) per task per model on MTEB English subsets.
Paragraph to paragraph (P2P) A paragraph is the same corpora, such as ClimateFEVER and
compared with another paragraph. MTEB imposes FEVER, resulting in a score of 1. Clusters of simi-
no limit on the input length, leaving it up to the lar datasets can be seen among CQADupstack vari-
models to truncate if necessary. Several clustering ations and STS datasets. S2S and P2P variations of
tasks are framed as both S2S and P2P tasks. The the same dataset tend to also be similar. Scientific
former only compare titles, while the latter include datasets, such as SciDocsRR, SciFact, ArxivClus-
both title and content. For ArxivClustering, for tering, show high similarities among each other
example, abstracts are concatenated to the title in even when coming from different tasks (Reranking,
the P2P setting. Retrieval and Clustering in this case).
0.72
0.42
0.70
0.41 ard et al., 2021), LaBSE (Feng et al., 2020) and
0.68
0.66
0.40
0.39
SimCSE-BERT-sup (Gao et al., 2021b) are based
0.64 0.38 on the pre-trained BERT model (Devlin et al.,
0.37
0.62
0.36
2018). coCondenser and Contriever add a self-
0.1B 1B 2B
Model Parameters (Billions)
4B 0.1B 1B 2B
Model Parameters (Billions)
4B
supervised stage prior to supervised fine-tuning
PairClassification Reranking
0.86 for a total of three training stages. LaBSE uses
0.56
BERT to perform additional pre-training on par-
Average Performance (map)
Average Performance (ap)
0.84
0.55
0.82
0.54 allel data to produce a competitive bitext mining
0.80
0.53 model. SPECTER (Cohan et al., 2020a) relies on
0.78 0.52
the pre-trained SciBERT (Beltagy et al., 2019) vari-
0.76 0.51
0.1B 1B 2B 4B 0.1B 1B 2B 4B
ant instead and fine-tunes on citation graphs. GTR
Model Parameters (Billions) Model Parameters (Billions)
Retrieval STS (Ni et al., 2021b) and ST5 (Ni et al., 2021a) are
Average Performance (cos. sim. spearman corr.)
0.500
0.82
based on the encoder part of the T5 model (Raf-
Average Performance (nDCG@10)
0.475
0.450
0.80 fel et al., 2020) and only differ in their fine-tuning
0.425
0.78 datasets. After additional self-supervised training,
0.400
0.375
0.76 ST5 does contrastive fine-tuning on NLI (Ni et al.,
0.350 0.74
2021a; Gao et al., 2021b) being geared towards
0.1B 1B 2B
Model Parameters (Billions)
4B 0.1B 1B 2B
Model Parameters (Billions)
4B STS tasks. Meanwhile, GTR fine-tunes on MS-
MARCO and focuses on retrieval tasks. MPNet
Figure 3: MTEB performance scales with model and MiniLM correspond to fine-tuned embedding
size. The smallest SGPT variant underperforms similar-
models (Reimers and Gurevych, 2019) of the pre-
sized GTR and ST5 variants. This may be due to the
bias-only fine-tuning SGPT employs, which catches
trained MPNet (Song et al., 2020) and MiniLM
up with full fine-tuning only as model size and thus (Wang et al., 2020) models using diverse datasets
the number of bias parameters increases (Muennighoff, to target any embedding use case.
2022). (b) Transformer decoder methods SGPT Bi-
Encoders (Muennighoff, 2022) perform contrastive
fine-tuning of <0.1% of pre-trained parameters us-
ding tasks leading to a high representation of trans-
ing weighted-mean pooling. Similar to ST5 and
formers (Vaswani et al., 2017). We group models
GTR, SGPT-nli models are geared towards STS,
into self-supervised and supervised methods.
while SGPT-msmarco models towards retrieval.
Self-supervised methods (a) Transformer- SGPT-msmarco models embed queries and doc-
based BERT (Devlin et al., 2018) is trained using uments for retrieval with different special tokens
self-supervised mask and sentence prediction tasks. to help the model distinguish their role. For non-
By taking the mean across the sequence length retrieval tasks, we use its query representations.
(mean-pooling) the model can directly be used We benchmark publicly available SGPT models
to produce text embeddings. SimCSE-Unsup based on GPT-NeoX (Andonian et al., 2021), GPT-
(Gao et al., 2021b) uses BERT as a foundation J (Wang and Komatsuzaki, 2021) and BLOOM
and performs additional self-supervised training. (Scao et al., 2022). Alternatively, cpt-text (Nee-
(b) Non-transformer: Komninos (Komninos lakantan et al., 2022) passes pre-trained GPT de-
and Manandhar, 2016) and Glove (Pennington coders through a two-stage process using last token
et al., 2014) are two word embedding models pooling to provide embeddings from decoders. We
that directly map words to vectors. Hence, their benchmark their models via the OpenAI Embed-
embeddings lack context awareness, but provide dings API4 .
significant speed-ups. (c) Non-transformer LASER (Heffernan et al.,
2022) is the only context aware non-transformer
Supervised methods The original transformer
model we benchmark, relying on an LSTM
model (Vaswani et al., 2017) consists of an encoder
and decoder network. Subsequent transformers 4
https://beta.openai.com/docs/guides/
often train only encoders like BERT (Devlin et al., embeddings
60 ST5-XXL
GTR-XXL
MPNet
GTR-Base MiniLM-L12
SGPT-5.8B-msmarco MiniLM-L6
ST5-Base Contriever
55
coCondenser-msmarco
SGPT-5.8B-nli SGPT-125M-msmarco
50
SimCSE-BERT-sup
MTEB Score
SGPT-125M-nli
SimCSE-BERT-unsup
45 LaBSE
Base Architecture Glove
LASER
WordEmbeddings SPECTER
GPT
40 MiniLM Komninos
MPNet
T5 BERT
BERT
SciBERT
35 LASER2
Figure 4: Performance, speed, and size of produced embeddings (size of the circles) of different embedding
models. Embedding sizes range from 1.2 kB (Glove / Komninos) to 16.4 kB (SGPT-5.8B) per example. Speed
was benchmarked on STS15 using 1x Nvidia A100 80GB with CUDA 11.6.
(Hochreiter and Schmidhuber, 1997) instead. Simi- the best non-ST5 model, OpenAI Ada Similarity.
lar to LaBSE, the model trains on parallel data and
focuses on bitext mining applications. Clustering Despite being almost 50x smaller, the
MPNet embedding model is on par with the ST5-
4.2 Analysis
XXL state-of-the-art on Clustering. This may be
Based on the results in Table 1, we observe that due to the large variety of datasets MPNet (and
there is considerable variability between tasks. No MiniLM) has been fine-tuned on. Clustering re-
model claims the state-of-the-art in all seven En- quires coherent distances between a large number
glish tasks. There is even more variability in the of embeddings. Models like SimCSE-sup or SGPT-
results per dataset present in the appendix. Further, nli, which are only fine-tuned on a single dataset,
there remains a large gap between self-supervised NLI, may produce incoherent embeddings when
and supervised methods. Self-supervised large lan- encountering topics unseen during fine-tuning. Re-
guage models have been able to close this gap in latedly, we find that the query embeddings of SGPT-
many natural language generation tasks (Chowd- msmarco and the Ada Search endpoint are competi-
hery et al., 2022). However, they appear to still tive with SGPT-nli and the Ada Similarity endpoint,
require supervised fine-tuning for competitive em- respectively. We refer to the public leaderboard5
bedding performance. for Ada Search results. This could be due to the
We find that performance strongly correlates MSMARCO dataset being significantly larger than
with model size, see Figure 3. A majority of NLI. Thus, while the OpenAI docs recommend us-
MTEB tasks are dominated by multi-billion param- ing the similarity embeddings for clustering use
eter models. However, these come at a significant cases6 , the retrieval query embeddings may be the
cost as we investigate in Section 4.3. better choice in some cases.
Classification ST5 models dominate the classifi-
5
cation task across most datasets, as can be seen in https://huggingface.co/spaces/mteb/l
eaderboard
detail in the full results in the appendix. ST5-XXL 6
https://beta.openai.com/docs/guides/
has the highest average performance, 3% ahead of embeddings/similarity-embeddings
1.0
LaBSE
LASER2
MiniLM-L12-multilingual
0.8 MPNet-multilingual
SGPT-BLOOM-7.1B-msmarco
0.6
F1 score
0.4
0.2
0.0
no l-eng
spb-eng
epa-eng
o g
tur-eng
te -eng
pol-eng
viel-eng
hrv-eng
ron-eng
hin-eng
glg-eng
sq -eng
ce i-eng
hu t-engg
slk-eng
lit-eng
fin-eng
af -eng
thar-eng
nld-eng
slv-eng
mo l-engg
da -engg
zsme-eng
ca -eng
jpnt-engg
ina-eng
cm ll-engg
eu t-engg
az l-engg
boe-eng
fra-eng
is -eng
pe l-eng
bus-eng
nn l-eng
po -eng
hy r-engg
uke-eng
gler-eng
rus-eng
ind-eng
g
ita-eng
ma -engg
uigr-eng
g
heo-eng
amb-eng
g
asr-eng
wu t-eng
g
ido-engg
fry-eng
tam eng
yid-eng
be -engg
faoz-engg
tat-eng
gla-eng
sw -engg
ku -engg
lat-eng
jav-eng
cb -eng
kh s-eng
m g
tu -engg
awv-eng
a g
hs -engg
ds i-engg
ce s-eng
mab-eng
wax-eng
sw r-eng
ang-eng
g g
cs l-engg
w g
or -eng
chv-eng
mha-eng
brer-eng
g
kz -eng
pa p-eng
m g
be -engg
ka r-eng
ng
mau-en
ess-en
tg -en
lvns-en
swn-en
e -en
kan-en
bes-en
srpo-en
mk -en
urdd-en
cym-en
xh -en
koh-en
yuu-en
ara-en
kan-en
ile-en
uzh-en
nd -en
arz-en
nok-en
lfn-en
oc -en
pmb-en
tz -en
gs b-en
arq-en
dt j-en
co -en
b-e
-
r
n
b
de
0.7
0.8
0.6
0.7
0.4 0.5
0.3 0.4
0.2
0.3
0.2
0.1
0.1
ko fr es en ar it zh ru tr de pl
en
zh i
-CN
pt
id
es
th
it
fr
ru
de
fa
sv
zh i
-TW
nl
ms
da
pl
tr
sq
el
ro
hu
sl
ko
fi
ja
nb
ml
lv
he
ur
bn
ar
te
af
ta
hy
my
az
mn
is
kn
tl
jv
sw
ka
km
am
cy
zh
n
-ar
-de
n
en
en
en
-en
-tr
it
l
-fr
-en
-pl
h
fr-p
fr-e
it-e
es-
de
en
es-
nl-
pl-
de
en
en
zh
de
(b) Multilingual Classification (c) Multi- and Crosslingual STS
Figure 5: MTEB multilingual performance. Bitext mining is dominated by LaBSE, while classification and STS
results are mixed. SGPT-BLOOM-7B1-msmarco tends to perform well on the languages BLOOM has been pre-
trained on, such as Chinese, French and Portuguese.
Pair Classification GTR-XL and GTR-XXL (Muennighoff, 2022), the playing field is more
have the strongest performance. Pair classifica- even with SGPT-5.8B-nli outperforming SGPT-
tion is closest to STS in its framing, yet models 5.8B-msmarco, see Table 11.
rank significantly differently on the two tasks. This
STS & Summarization Retrieval models (GTR,
highlights the importance of benchmarking on a
SGPT-msmarco) perform badly on STS, while ST5-
diverse set of tasks to avoid blindly reusing a model
XXL has the highest performance. This highlights
for a different task.
the bifurcation of the field into separate embedding
Reranking MPNet and MiniLM models perform models for retrieval (asymmetric) and similarity
strongly on reranking tasks. On SciDocsRR (Co- (symmetric) use cases (Muennighoff, 2022).
han et al., 2020a) they perform far better than big-
ger models, which is likely due to parts of Sci- 4.3 Efficiency
DocsRR being included in their training data. Our We investigate the latency-performance trade-off
scale of experiments and that of model pre-training of models in Figure 4. The graph allows for signifi-
make controlling for data contamination challeng- cant elimination of model candidates in the model
ing. Thus, we ignore overlap of MTEB datasets selection process. It brings model selection down
with model training datasets in MTEB scores. As to three clusters:
long as enough datasets are averaged, we believe
Maximum speed Word Embedding models offer
these effects to be insignificant.
maximum speed with Glove taking the lead on both
Retrieval SGPT-5.8B-msmarco is the best em- performance and speed, thus making the choice
bedding model on the BEIR subset in MTEB simple in this case.
as well as on the full BEIR benchmark (Thakur
Maximum performance If latency is less impor-
et al., 2021; Muennighoff, 2022). The even larger
tant than performance, the left-hand side of the
7.1B SGPT model making use of BLOOM (Scao
graph offers a cluster of highly performant, but
et al., 2022) performs significantly weaker, which
slow models. Depending on the task at hand, GTR-
is likely due to the multilinguality of BLOOM.
XXL, ST5-XXL or SGPT-5.8B may be the right
Models geared towards STS (SimCSE, ST5, SGPT-
choice, see Section 4.2. SGPT-5.8B comes with
nli) perform badly on retrieval tasks. Retrieval
the additional caveat of its high-dimensional em-
tasks are unique in that there are two distinct types
beddings requiring more storage.
of texts: Queries and documents (“asymmetric”),
while other tasks only have a single type of text Speed and performance The fine-tuned MPNet
(“symmetric”). On the QuoraRetrieval dataset, and MiniLM models lead the middle cluster mak-
which has been shown to be largely symmetric ing the choice easy.
4.4 Multilinguality on. We found model performance on different tasks
MTEB comes with 10 multilingual datasets across to vary strongly with no model claiming state-of-
bitext mining, classification and STS tasks. We in- the-art on all tasks. Our studies on scaling behav-
vestigate performance on these in Figure 5. Tabular ior, model efficiency and multilinguality revealed
results can be found in Tables 12, 13 and 14. various intricacies of models that should ease the
decision-making process for future research or in-
Bitext Mining LaBSE (Feng et al., 2020) per- dustry applications of text embeddings.
forms strongly across a wide array of languages in We welcome task, dataset or metric contributions
bitext mining. Meanwhile, LASER2 shows high to the MTEB codebase7 as well as additions to the
variance across different languages. While there leaderboard via our automatic submission format8 .
are additional language-specific LASER2 models
available for some of the languages we benchmark,
we use the default multilingual LASER2 model
for all languages. This is to provide a fair one-to-
one comparison of models. In practice, however,
the high variance of LASER2’s performance may
be resolved by mixing its model variants. MP-
Net, MiniLM and SGPT-BLOOM-7B1-msmarco
perform poorly on languages they have not been
pre-trained on, such as German for the latter.
Classification & STS On multilingual classifi-
cation and STS, the multilingual MPNet provides
the overall strongest performance. It outperforms
the slightly faster multilingual MiniLM on almost
all languages. Both models have been trained
on the same languages, thus bringing decision-
making down to performance vs speed. SGPT-
BLOOM-7B1-msmarco provides state-of-the-art
performance on languages like Hindi, Portuguese,
Chinese or French, which the model has seen ex-
tensively during pre-training. It also performs com-
petitively on languages like Russian or Japanese
that unintentionally leaked into its pre-training
data (Muennighoff et al., 2022). However, it is not
much ahead of the much cheaper MPNet. LASER2
performs consistently worse than other models.
5 Conclusion
In this work, we presented the Massive Text Em-
bedding Benchmark (MTEB). Consisting of 8 text
embedding tasks with up to 15 datasets each and
covering 112 languages, MTEB aims to provide re-
liable embedding performance estimates. By open-
sourcing MTEB alongside a leaderboard, we pro-
vide a foundation for further pushing the state-of-
the-art of available text embeddings.
To introduce MTEB, we have conducted the
most comprehensive benchmarking of text embed-
7
dings to date. Through the course of close to 5,000 https://github.com/embeddings-benchm
ark/mteb
experiments on over 30 different models, we have 8
https://huggingface.co/spaces/mteb/l
set up solid baselines for future research to build eaderboard
Acknowledgments (* SEM), volume 1: proceedings of the Main confer-
ence and the shared task: semantic textual similar-
This work was granted access to the HPC resources ity, pages 32–43.
of Institut du développement et des ressources en
informatique scientifique (IDRIS) du Centre na- Loubna Ben Allal, Raymond Li, Denis Kocetkov,
Chenghao Mou, Christopher Akiki, Carlos Munoz
tional de la recherche scientifique (CNRS) under Ferrandis, Niklas Muennighoff, Mayank Mishra,
the allocation 2021-A0101012475 made by Grand Alex Gu, Manan Dey, et al. 2023. Santa-
équipement national de calcul intensif (GENCI). In coder: don’t reach for the stars! arXiv preprint
particular, all the evaluations and data processing arXiv:2301.03988.
ran on the Jean Zay cluster of IDRIS, and we want Alex Andonian, Quentin Anthony, Stella Biderman,
to thank the IDRIS team for responsive support Sid Black, Preetham Gali, Leo Gao, Eric Hallahan,
throughout the project, in particular Rémi Lacroix. Josh Levy-Kramer, Connor Leahy, Lucas Nestler,
We thank Douwe Kiela, Teven Le Scao and Nan- Kip Parker, Michael Pieler, Shivanshu Purohit, Tri
Songz, Phil Wang, and Samuel Weinbach. 2021.
dan Thakur for feedback and suggestions. GPT-NeoX: Large scale autoregressive language
modeling in pytorch.
Jianmo Ni, Chen Qu, Jing Lu, Zhuyun Dai, Gus- Elvis Saravia, Hsien-Chi Toby Liu, Yen-Hao Huang,
tavo Hernández Ábrego, Ji Ma, Vincent Y Zhao, Junlin Wu, and Yi-Shin Chen. 2018. CARER: Con-
Yi Luan, Keith B Hall, Ming-Wei Chang, et al. textualized affect representations for emotion recog-
2021b. Large dual encoders are generalizable re- nition. In Proceedings of the 2018 Conference on
trievers. arXiv preprint arXiv:2112.07899. Empirical Methods in Natural Language Processing,
pages 3687–3697, Brussels, Belgium. Association
Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, for Computational Linguistics.
Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya
Sutskever, and Mark Chen. 2021. Glide: To- Teven Le Scao, Angela Fan, Christopher Akiki, El-
wards photorealistic image generation and editing lie Pavlick, Suzana Ilić, Daniel Hesslow, Ro-
with text-guided diffusion models. arXiv preprint man Castagné, Alexandra Sasha Luccioni, François
arXiv:2112.10741. Yvon, Matthias Gallé, et al. 2022. Bloom: A 176b-
parameter open-access multilingual language model.
James O’Neill, Polina Rozenshtein, Ryuichi Kiryo, arXiv preprint arXiv:2211.05100.
Motoko Kubota, and Danushka Bollegala. 2021. I
wish i would have loved this one, but i didn’t – a Darsh Shah, Tao Lei, Alessandro Moschitti, Salva-
multilingual dataset for counterfactual detection in tore Romeo, and Preslav Nakov. 2018. Adversar-
product reviews. ial domain adaptation for duplicate question detec-
tion. In Proceedings of the 2018 Conference on
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, Empirical Methods in Natural Language Processing,
B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, pages 1056–1063, Brussels, Belgium. Association
R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, for Computational Linguistics.
D. Cournapeau, M. Brucher, M. Perrot, and E. Duch-
esnay. 2011. Scikit-learn: Machine learning in Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-
Python. Journal of Machine Learning Research, Yan Liu. 2020. Mpnet: Masked and permuted pre-
12:2825–2830. training for language understanding. Advances in
Neural Information Processing Systems, 33:16857–
Jeffrey Pennington, Richard Socher, and Christopher D 16867.
Manning. 2014. Glove: Global vectors for word rep-
resentation. In Proceedings of the 2014 conference Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao,
on empirical methods in natural language process- Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch,
ing (EMNLP), pages 1532–1543. Adam R Brown, Adam Santoro, Aditya Gupta,
Adrià Garriga-Alonso, et al. 2022. Beyond the Fangzhao Wu, Ying Qiao, Jiun-Hung Chen, Chuhan
imitation game: Quantifying and extrapolating the Wu, Tao Qi, Jianxun Lian, Danyang Liu, Xing Xie,
capabilities of language models. arXiv preprint Jianfeng Gao, Winnie Wu, et al. 2020. Mind: A
arXiv:2206.04615. large-scale dataset for news recommendation. In
Proceedings of the 58th Annual Meeting of the Asso-
Hao Tan and Mohit Bansal. 2019. Lxmert: Learning ciation for Computational Linguistics, pages 3597–
cross-modality encoder representations from trans- 3606.
formers. arXiv preprint arXiv:1908.07490.
Wei Xu, Chris Callison-Burch, and William B Dolan.
Nandan Thakur, Nils Reimers, Andreas Rücklé, Ab- 2015. Semeval-2015 task 1: Paraphrase and seman-
hishek Srivastava, and Iryna Gurevych. 2021. Beir: tic similarity in twitter (pit). In Proceedings of the
A heterogenous benchmark for zero-shot evaluation 9th international workshop on semantic evaluation
of information retrieval models. (SemEval 2015), pages 1–11.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Xinyu Zhang, Nandan Thakur, Odunayo Ogundepo,
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Ehsan Kamalloo, David Alfonso-Hermelo, Xi-
Kaiser, and Illia Polosukhin. 2017. Attention is all aoguang Li, Qun Liu, Mehdi Rezagholizadeh, and
you need. Advances in neural information process- Jimmy Lin. 2022. Making a miracl: Multilingual in-
ing systems, 30. formation retrieval across a continuum of languages.
arXiv preprint arXiv:2210.09984.
Alex Wang, Yada Pruksachatkun, Nikita Nangia,
Amanpreet Singh, Julian Michael, Felix Hill, Omer Jeffrey Zhu, Mingqin Li, Jason Li, and Cassandra
Levy, and Samuel Bowman. 2019. Superglue: A Oduola. 2021. Bing delivers more contextualized
stickier benchmark for general-purpose language un- search using quantized transformer inference on
derstanding systems. Advances in neural informa- nvidia gpus in azure.
tion processing systems, 32.
Pierre Zweigenbaum, Serge Sharoff, and Reinhard
Alex Wang, Amanpreet Singh, Julian Michael, Felix Rapp. 2016. Towards preparation of the second bucc
Hill, Omer Levy, and Samuel R Bowman. 2018. shared task: Detecting parallel sentences in compa-
Glue: A multi-task benchmark and analysis platform rable corpora. In Proceedings of the Ninth Workshop
for natural language understanding. arXiv preprint on Building and Using Comparable Corpora. Euro-
arXiv:1804.07461. pean Language Resources Association (ELRA), Por-
toroz, Slovenia, pages 38–43.
Ben Wang and Aran Komatsuzaki. 2021. GPT-J-
6B: A 6 Billion Parameter Autoregressive Language Pierre Zweigenbaum, Serge Sharoff, and Reinhard
Model. https://github.com/kingoflol Rapp. 2017. Overview of the second bucc shared
z/mesh-transformer-jax. task: Spotting parallel sentences in comparable cor-
pora. In Proceedings of the 10th Workshop on Build-
Kexin Wang, Nils Reimers, and Iryna Gurevych. 2021. ing and Using Comparable Corpora, pages 60–67.
Tsdae: Using transformer-based sequential denois-
ing auto-encoder for unsupervised sentence embed- Pierre Zweigenbaum, Serge Sharoff, and Reinhard
ding learning. arXiv preprint arXiv:2104.06979. Rapp. 2018. Overview of the third bucc shared task:
Spotting parallel sentences in comparable corpora.
Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan In Proceedings of 11th workshop on building and
Yang, and Ming Zhou. 2020. Minilm: Deep self- using comparable corpora, pages 39–42.
attention distillation for task-agnostic compression
of pre-trained transformers. Advances in Neural In-
formation Processing Systems, 33:5776–5788.
MassiveIntent (FitzGerald et al., 2022) A col- Tatoeba (Research) Tatoeba provides sets of sen-
lection of Amazon Alexa virtual assistant utter- tences (1000 sentences each) for 112 languages
ances annotated with the associated intent. For with annoated associated pairs. Each pair is one
each user utterance the label is one of 60 intents sentence and its translation in another language.
like ’play_music’, ’alarm_set’, etc. This is a multi-
A.5 Reranking
lingual dataset with 51 available languages.
AskUbuntuDupQuestions16 Questions from
MassiveScenario (FitzGerald et al., 2022) A col-
AskUbuntu with manual annotations marking pairs
lection of Amazon Alexa virtual assistant utter-
of questions as similar or dissimilar.
ances annotated with the associated intent. For
each user utterance the label is a theme among 60 MindSmall (Wu et al., 2020) Large-scale En-
scenarios like ’music’, ’weather’, etc. This is a glish Dataset for News Recommendation Research.
multilingual dataset with 51 available languages. Ranking news article titles given the title of a news
MTOPDomain / MTOPIntent Multilingual article. The idea is to recommend other news from
sentence datasets from the MTOP (Li et al., 2020) the one you are reading.
benchmark. We refer to their paper for details. SciDocsRR (Cohan et al., 2020b) Ranking of re-
ToxicConversations Dataset from Kaggle com- lated scientific papers based on their title.
petition14 . Collection of comments from the Civil
StackOverflowDupQuestions (Liu et al., 2018)
Comments platform together with annotations if
Stack Overflow Duplicate Questions Task for ques-
the comment is toxic or not.
tions with the tags Java, JavaScript and Python,
TweetSentimentExtraction Dataset from Kag- ranking questions as duplicates or not.
gle competition15 . Sentiment classification of
tweets as neutral, positive or negative. A.6 Semantic Textual Similarity (STS)
STS12, STS13, STS14, STS15, STS16, STS17,
A.3 Pair Classification
STS22, STSBenchmark (Agirre et al., 2012,
SprintDuplicateQuestions (Shah et al., 2018): 2013)17181920 Original STS benchmark, with
Collection of questions from the Sprint commu- scores from 0 to 5. The selection of sentences
nity. The goal is to classify a pair of sentences as includes text from image captions, news headlines
duplicates or not. and user forums. In total they contain between
TwitterSemEval2015 (Xu et al., 2015) 1,000 and 20,000 sentences. STS12 - STS16 and
Paraphrase-Pairs of Tweets from the SemEval 16
https://github.com/taolei87/askubuntu
17
2015 workshop. The goal is to classify a pair of https://alt.qcri.org/semeval2014/tas
tweets as paraphrases or not. k10/
18
https://alt.qcri.org/semeval2015/tas
14 k2/
https://www.kaggle.com/competitions/
19
jigsaw-unintended-bias-in-toxicity-class https://alt.qcri.org/semeval2016/tas
ification k1/
15 20
https://www.kaggle.com/competitions/ https://competitions.codalab.org/com
tweet-sentiment-extraction petitions/33835
STSBenchmark are monolingual english bench- 3. Multinguality MTEB contains multilingual
marks. STS17 and STS22 contain crosslingual classification, STS and bitext mining datasets.
pairs of sentences, where the goal is to assess the However, retrieval and clustering are English-only.
similarity of two sentences in different languages. SGPT-BLOOM-7B1-msmarco is geared towards
STS17 has 11 language pairs (among Korean, Ara- multilingual retrieval datasets and due to the lack
bic, English, French, German, Turkish, Spanish, thereof cannot be comprehensively benchmarked
Italian and Dutch) and STS22 has 18 language pairs in MTEB. Further, MTEB does not contain any
(among Arabic, English, French, German, Turkish, code datasets that could be used to benchmark code
Spanish, Polish, Italian, Russian and Chinese). models (Neelakantan et al., 2022; Allal et al., 2023).
It should be easy to extend MTEB with datasets,
BIOSSES21 Contains 100 sentence pairs from
such as CodeSearchNet (Husain et al., 2019), TyDI
the biomedical field.
QA (Clark et al., 2020), XOR QA (Asai et al., 2020)
SICK-R (Agirre et al., 2014) Sentences Involv- or MIRACL (Zhang et al., 2022).
ing Compositional Knowledge (SICK) contains a
4. Additional modalities Text embeddings are
large number of sentence pairs (10 0000) that are
commonly used as input features for downstream
lexically, syntactically and semantically rich.
models, such as in our classification task. This
A.7 Summarization can involve other modalities, notably image con-
SummEval (Fabbri et al., 2020) Summaries gen- tent (Carvalho et al., 2018; Tan and Bansal, 2019;
erated by recent summarization models trained on Muennighoff, 2020; Nichol et al., 2021; Saharia
CNN or DailyMail alongside human annotations. et al., 2022; Weinbach et al., 2022). We have fo-
cused solely on natural language applications and
A.8 Retrieval leave extensive benchmarking of text embeddings
We refer to the BEIR paper (Thakur et al., 2021), as inputs for other modalities to future work.
which contains description of each dataset. For
MTEB, we include all publicly available datasets:
C Examples
ArguAna, ClimateFEVER, CQADupstack, DB- Tables 3-9 provide examples for each dataset for
Pedia, FEVER, FiQA2018, HotpotQA, MS- each task. For retrieval datasets, we refer to the
MARCO, NFCorpus, NQ, Quora, SCIDOCS, BEIR paper (Thakur et al., 2021).
SciFact, Touche2020, TRECCOVID.
D Correlations
B Limitations of MTEB
Figure 6 provides correlation heatmaps for model
While MTEB aims to be a diverse benchmark to performance and MTEB tasks.
provide holistic performance reviews, the bench-
mark has its limitations. We list them here: E Models
1. Long document datasets MTEB covers mul- Table 10 provides publicly available model check-
tiple text lengths (S2S, P2P, S2P), but very long points used for MTEB evaluation.
documents are still missing. The longest datasets in
MTEB have a few hundred words, and longer text F Additional results
sizes could be relevant for use cases like retrieval. Tables 11 until the end provide results on individ-
2. Task imbalance Tasks in MTEB have a differ- ual datasets of MTEB. The results are additionally
ent amount of datasets with summarization consist- available in json format on the Hugging Face Hub22
ing of only a single dataset. This means MTEB av- and can be inspected on the leaderboard23 .
erage scores, which are computed over all datasets,
are biased towards tasks with many datasets, no-
tably retrieval, classification and clustering. As
MTEB grows, we hope to add more datasets to cur-
rently underrepresented tasks like summarization
22
or pair classification. https://huggingface.co/datasets/mteb
/results
21 23
https://tabilab.cmpe.boun.edu.tr/BIO https://huggingface.co/spaces/mteb/l
SSES/DataSet.html eaderboard
Dataset Text Label
AmazonCounterfactualClassification In person it looks as though it would have cost a lot more. counterfactual
AmazonPolarityClassification an absolute masterpiece I am quite sure any of you actually taking the time to read this have played the game at least positive
once, and heard at least a few of the tracks here. And whether you were aware of it or not, Mitsuda’s music contributed
greatly to the...
AmazonReviewsClassification solo llega una unidad cuando te obligan a comprar dos Te obligan a comprar dos unidades y te llega solo una y no hay 0
forma de reclamar, una autentica estafa, no compreis!!
EmotionClassification i feel so inhibited in someone elses kitchen like im painting on someone elses picture sadness
ImdbClassification When I first saw a glimpse of this movie, I quickly noticed the actress who was playing the role of Lucille Ball. Rachel negative
York’s portrayal of Lucy is absolutely awful. Lucille Ball was an astounding comedian with incredible talent. To think
about a legend like Lucille Ball being portrayed the way she was in the movie is horrendous. I cannot believe...
TweetSentimentExtractionClassification I really really like the song Love Story by Taylor Swift positive
ArxivClusteringP2P Finite groups of rank two which do not involve Qd(p). Let p > 3 be a prime. We show that if G is a finite group math
with p-rank equal to 2, then G involves Qd(p) if and only if G p0 -involves Qd(p). This allows us to use a version
of Glauberman’s ZJ-theorem to give a more direct construction of finite group actions on mod-p homotopy spheres.
We give an example to illustrate that the above conclusion does not hold for p ≤ 3.
ArxivClusteringS2S Vertical shift and simultaneous Diophantine approximation on polynomial curves math
BiorxivClusteringP2P Innate Immune sensing of Influenza A viral RNA through IFI16 promotes pyroptotic cell death Programmed cell death immunology
pathways are triggered by various stresses or stimuli, including viral infections. The mechanism underlying the regula-
tion of these pathways upon Influenza A virus IAV infection is not well characterized. We report that a cytosolic DNA
sensor IFI16 is...
BiorxivClusteringS2S Association of CDH11 with ASD revealed by matched-gene co-expression analysis and mouse behavioral neuroscience
MedrxivClusteringP2P Temporal trends in the incidence of haemophagocytic lymphohistiocytosis: a nationwide cohort study from England infectious diseases
2003-2018. Haemophagocytic lymphohistiocytosis (HLH) is rare, results in high mortality and is increasingly being
diagnosed. Little is known about what is driving the apparent rise in the incidence of this disease. Using national linked
electronic health data from hospital admissions and death certification cases of HLH that were diagnosed in England
between 1/1/2003 and 31/12/2018 were identified using a previously validated approach. We calculated incidence...
MedrxivClusteringS2S Current and Lifetime Somatic Symptom Burden Among Transition-aged Young Adults on the Autism Spectrum psychiatry and clinical psychology
RedditClustering Could anyone tell me what breed my bicolor kitten is? r/cats
RedditClusteringP2P Headaches after working out? Hey guys! I’ve been diagnosed with adhd since I was seven. I just recently got rediag- r/ADHD
nosed (22f) and I’ve been out on a different medication, adderall I was normally taking vyvanse but because of cost and
no insurance adderall was more affordable. I’ve noticed that if I take adderall and workout...
StackExchangeClusteringP2P Google play services error DEBUG: Application is pausing, which disconnects the RTMP client. I am having this issue unity
from past day with Google Play Services Unity. What happens is, when I install app directly ot device via Unity, the
Google Play Services work fine but when I upload it as beta to play store console and install it via that then it starts to
give " DEBUG: Application is pausing, which disconnects the RTMP client" error. I have a proper SHA1 key.
SprintDuplicateQuestions Franklin U722 USB modem signal strength How do I know if my Franklin U772 USB Modem has a 1
weak signal ?
TwitterSemEval2015 All the home alones watching 8 mile","All the home alones The last rap battle in 8 Mile nevr gets old ahah 0
watching 8 mile
TwitterURLCorpus How the metaphors we use to describe discovery affect men Light Bulbs or Seeds ? How Metaphors for Ideas Influence 0
and women in the sciences Judgments About Genius
AskUbuntuDupQuestions change the application icon theme but not changing the change folder icons in ubuntu-mono-dark theme change steam tray icon back to default
panel icons
MindSmallReranking Man accused in probe of Giuliani associates is freed on bail Studies show these are the best and worst states for your There are 14 cheap days to fly left in 2019: When are they
retirement and what deals can you score?
SciDocsRR Discovering social circles in ego networks Benchmarks for testing community detection algorithms on Improving www proxies performance with greedy-dual-
directed and weighted graphs with overlapping communi- size-frequency caching policy
ties.
StackOverflowDupQuestions Java launch error selection does not contain a main type Error: Selection does not contain a main type Selection Sort in Java
BIOSSES It has recently been shown that Craf is essential for Kras It has recently become evident that Craf is essential for the 4.0
G12D-induced NSCLC. onset of Kras-driven non-small cell lung cancer.
SICK-R A group of children is playing in the house and there is no A group of kids is playing in a yard and an old man is stand- 3.2
man standing in the background ing in the background
STS12 Nationally, the federal Centers for Disease Control and Pre- There were 293 human cases of West Nile in Indiana in 1.7
vention recorded 4,156 cases of West Nile, including 284 2002, including 11 deaths statewide.
deaths.
STS13 this frame has to do with people ( the residents ) residing in inhabit or live in ; be an inhabitant of ; 2.8
locations , sometimes with a co-resident .
STS14 then the captain was gone. then the captain came back. 0.8
STS15 you ’ll need to check the particular policies of each pub- if you need to publish the book and you have found one 3.0
lisher to see what is allowed and what is not allowed. publisher that allows it.
STS16 you do not need to worry. you don ’t have to worry. 5.0
STS17 La gente muestra su afecto el uno por el otro. A women giving something to other lady. 1.4
STS22 El secretario general de la Asociación Gremial de los Tra- En diálogo con el servicio informativo de la Radio Pública, 1
bajadores del Subte y Premetro de Metrodelegados, Beto el ministro de Salud de la Nación, Ginés González García,
Pianelli, dijo que el Gobierno porteño debe convocar “in- habló sobre el avance del coronavirus en la Argentina y se
mediatamente” a licitación para la compra de nuevos trenes manifestó a favor de prorrogar la cuarentena obligatoria dis-
y retirar los que quedan en circulación... puesta por...
STSBenchmark A man is playing the cello. A man seated is playing the cello. 4.25
BUCC Morales remporte l’élection présidentielle de 2005 à la ma- Morales went on to win the 2005 presidential election with
jorité absolue. an absolute majority.
Tatoeba Chi le ha detto che Tom l’ha fatto? Who told you that Tom did that?
SummEval V. Stiviano must pay back $2.6 million in gifts from Donald donald sterling , nba team last year . sterling ’s wife sued 1.7
Sterling. Sterling’s wife claimed the ex-Clippers used the for $ 2.6 million in gifts . sterling says he is the former
couple’s money for the gifts. The items included a Ferrari, female companion who has lost the . sterling has ordered
two Bentleys and a Range Rover. v. stiviano to pay back $ 2.6 m in gifts after his wife sued .
sterling also includes a $ 391 easter bunny costume , $ 299
and a $ 299 .
Class.
coCondenser-msmarco 95 93 85 95 94
Contriever 90 89 79 91 90 98
SPECTER 93 92 89 92 91 90 84
LaBSE 96 94 92 98 97 95 91 94
LASER2 94 94 91 97 96 90 85 90 97
80
90
Clust.
MiniLM-L6 91 90 78 91 90 97 97 88 92 87 68
MiniLM-L12 90 88 77 91 90 97 97 87 91 86 100
MiniLM-L12-multilingual 95 93 85 96 96 98 96 91 95 92 97 97
MPNet 89 88 78 90 90 96 96 87 90 86 99 99 96
PairClass.
MPNet-multilingual 94 93 85 97 96 98 97 89 95 93 97 97 99 95 85 60
SGPT-125M-nli 97 96 91 99 98 95 91 94 97 96 92 91 97 91 96 72 81
SGPT-5.8B-nli 96 95 89 98 98 97 94 91 96 94 94 94 97 93 98 99
SGPT-125M-msmarco 92 90 78 91 89 96 96 87 89 85 95 96 96 94 95 93 95
SGPT-1.3B-msmarco 89 87 75 88 87 96 97 83 87 82 96 96 94 94 95 89 94 99
SGPT-2.7B-msmarco 87 86 73 87 85 95 97 81 85 80 95 96 94 94 94 88 93 98 100 80
Rerank.
SGPT-5.8B-msmarco 84 82 69 83 81 92 95 78 81 77 93 94 91 92 91 84 90 97 99 99 58 95 83
40
SGPT-BLOOM-7.1B-msmarco 84 83 69 83 81 92 94 78 82 78 94 95 91 93 91 84 90 97 98 99 100
GTR-Base 87 85 74 88 87 96 98 79 87 83 96 97 95 95 95 88 92 96 97 97 96 95
GTR-Large 85 83 72 87 86 95 98 76 84 80 95 96 93 95 94 86 90 94 97 97 95 94 100
GTR-XL 85 83 72 86 85 95 97 76 84 79 95 96 93 95 94 85 90 94 97 97 95 95 100 100 75
57 85 87 90
Retr.
GTR-XXL 84 82 71 85 85 94 97 75 84 79 95 96 92 95 93 84 89 93 96 96 94 94 99 100 100
ST5-Base 94 92 87 97 97 96 93 88 95 93 94 93 96 94 97 97 97 91 90 89 85 85 92 91 91 91 20
ST5-Large 92 90 85 95 96 94 93 86 93 91 93 93 95 94 96 95 97 90 90 89 86 86 92 92 92 92 99
ST5-XL 92 90 84 94 95 94 93 85 92 90 93 93 94 94 95 94 96 91 91 90 87 87 93 93 93 92 99 100
ST5-XXL 90 88 81 92 93 94 94 83 90 87 93 94 94 95 95 92 96 92 93 92 90 90 95 95 95 94 97 99 99 70
79 75 85 78 69
STS
Glove
BERT
Komninos
SimCSE-BERT-unsup
SimCSE-BERT-sup
SPECTER
coCondenser-msmarco
Contriever
LaBSE
LASER2
MiniLM-L6
MiniLM-L12
MiniLM-L12-multilingual
MPNet
MPNet-multilingual
SGPT-125M-nli
SGPT-125M-msmarco
SGPT-1.3B-msmarco
SGPT-2.7B-msmarco
SGPT-5.8B-msmarco
SGPT-BLOOM-7.1B-msmarco
SGPT-5.8B-nli
GTR-Base
GTR-XL
GTR-XXL
ST5-Base
GTR-Large
ST5-Large
ST5-XL
ST5-XXL
0
Summ.
3 -4 8 -8 -16 0
(a) Model correlation based on all results (b) Task correlation based on average task results
Figure 6: Pearson correlations across model and task results. Left: Size variants of the same architecture show
high correlations. Right: Performance on clustering and reranking correlates strongest, while summarization and
classification show weaker correlation with other tasks.
Glove https://huggingface.co/sentence-transformers/average_word_embeddings_glove.6B.300d
Komninos https://huggingface.co/sentence-transformers/average_word_embeddings_komninos
BERT https://huggingface.co/bert-base-uncased
SimCSE-BERT-unsup https://huggingface.co/princeton-nlp/unsup-simcse-bert-base-uncased
SimCSE-BERT-sup https://huggingface.co/princeton-nlp/sup-simcse-bert-base-uncased
coCondenser-msmarco https://huggingface.co/sentence-transformers/msmarco-bert-co-condensor
Contriever https://huggingface.co/nthakur/contriever-base-msmarco
SPECTER https://huggingface.co/sentence-transformers/allenai-specter
LaBSE https://huggingface.co/sentence-transformers/LaBSE
LASER2 https://github.com/facebookresearch/LASER
MiniLM-L6 https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
MiniLM-L12 https://huggingface.co/sentence-transformers/all-MiniLM-L12-v2
MiniLM-L12-multilingual https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
MPNet https://huggingface.co/sentence-transformers/all-mpnet-base-v2
MPNet-multilingual https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2
MiniLM-L12-multilingual https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
SGPT-125M-nli https://huggingface.co/Muennighoff/SGPT-125M-weightedmean-nli-bitfit
SGPT-5.8B-nli https://huggingface.co/Muennighoff/SGPT-5.8B-weightedmean-nli-bitfit
SGPT-125M-msmarco https://huggingface.co/Muennighoff/SGPT-125M-weightedmean-msmarco-specb-bitfit
SGPT-1.3B-msmarco https://huggingface.co/Muennighoff/SGPT-1.3B-weightedmean-msmarco-specb-bitfit
SGPT-2.7B-msmarco https://huggingface.co/Muennighoff/SGPT-2.7B-weightedmean-msmarco-specb-bitfit
SGPT-5.8B-msmarco https://huggingface.co/Muennighoff/SGPT-5.8B-weightedmean-msmarco-specb-bitfit
SGPT-BLOOM-7.1B-msmarco https://huggingface.co/bigscience/sgpt-bloom-7b1-msmarco
SGPT-BLOOM-1.7B-nli https://huggingface.co/bigscience-data/sgpt-bloom-1b7-nli
GTR-Base https://huggingface.co/sentence-transformers/gtr-t5-base
GTR-Large https://huggingface.co/sentence-transformers/gtr-t5-large
GTR-XL https://huggingface.co/sentence-transformers/gtr-t5-xl
GTR-XXL https://huggingface.co/sentence-transformers/gtr-t5-xxl
ST5-Base https://huggingface.co/sentence-transformers/sentence-t5-base
ST5-Large https://huggingface.co/sentence-transformers/sentence-t5-large
ST5-XL https://huggingface.co/sentence-transformers/sentence-t5-xl
ST5-XXL https://huggingface.co/sentence-transformers/sentence-t5-xxl
AmazonCounterfactualClassification 56.91 60.54 74.25 67.09 75.75 64.06 72.19 58.70 75.93 76.84 64.15 65.28 71.57 65.27 75.81 76.40 65.88 74.07 61.24 65.21 67.57 69.22 68.06 69.33 70.03 68.60 67.30 75.82 75.51 76.01 77.07
AmazonPolarityClassification 60.32 59.59 71.33 74.48 82.47 66.88 68.63 57.77 68.95 61.01 62.58 62.98 69.21 67.13 76.41 92.83 74.94 82.31 65.40 73.21 71.44 71.26 68.97 67.82 73.92 74.58 75.05 85.12 92.87 93.17 92.79
AmazonReviewsClassification 29.67 31.01 33.56 33.85 39.60 34.85 37.42 26.26 35.80 28.71 31.79 30.79 35.11 31.92 38.51 47.45 35.10 41.58 31.17 34.96 35.75 39.19 33.86 38.48 37.21 38.20 37.30 44.94 47.12 48.18 48.93
Banking77Classification 67.69 67.05 63.41 73.55 75.76 82.35 80.02 66.66 69.85 57.76 79.75 80.40 79.77 81.86 81.07 68.04 74.68 81.74 77.70 82.06 83.22 84.49 84.33 79.26 81.21 82.22 82.32 76.48 78.46 80.88 82.31
EmotionClassification 36.93 33.18 35.28 42.22 44.81 41.91 44.77 24.82 37.22 24.83 38.43 41.17 42.37 39.73 45.84 50.32 42.23 49.92 39.08 46.39 49.21 49.66 44.87 42.20 46.32 45.55 43.19 51.36 51.73 51.95 48.57
ImdbClassification 62.57 63.98 65.35 69.63 73.53 60.17 67.04 56.35 62.04 57.58 60.66 59.76 60.46 70.72 64.57 89.38 62.90 74.33 58.67 64.05 63.53 66.64 61.77 65.99 70.86 68.15 70.8 77.34 87.01 87.54 90.23
MassiveIntentClassification 56.19 57.21 59.88 59.84 65.95 70.40 67.78 51.73 61.46 47.91 67.40 67.15 66.84 69.57 69.32 65.17 58.08 70.0 61.41 68.65 69.01 70.39 69.67 67.05 70.06 70.23 70.61 69.74 71.78 72.09 73.44
MassiveScenarioClassification 66.03 66.11 64.28 66.25 70.78 73.73 76.00 58.58 66.41 55.92 75.76 74.58 71.51 76.01 75.35 67.67 66.34 75.03 69.74 76.04 75.90 76.28 75.34 75.40 75.49 75.94 77.77 72.32 73.16 73.26 74.82
MTOPDomainClassification 79.11 78.57 82.63 81.71 84.29 91.34 93.18 74.53 86.06 75.36 91.56 91.90 87.06 92.08 89.24 89.89 81.52 89.64 86.96 92.08 92.56 93.47 93.68 92.42 94.01 93.60 93.84 90.34 90.99 90.73 92.49
MTOPIntentClassification 55.85 57.07 68.14 59.23 63.14 71.07 69.31 50.05 63.03 49.47 62.18 62.84 65.52 70.21 68.69 64.80 58.24 70.68 62.25 71.19 71.85 72.42 71.34 62.44 63.86 65.93 67.71 63.32 64.98 68.15 68.33
ToxicConversationsClassification 65.40 67.76 70.0 68.82 72.04 64.01 67.77 57.44 66.90 54.05 66.99 67.47 66.07 60.86 71.02 70.00 62.79 69.93 62.66 68.73 68.84 67.71 66.55 66.60 68.65 67.56 68.48 68.20 71.73 70.95 70.04
TweetSentimentExtractionClassification 50.80 49.68 51.81 53.36 59.73 55.74 56.10 45.52 58.82 48.73 55.41 54.25 56.12 55.46 59.03 63.35 54.82 62.44 52.41 55.67 56.69 56.85 55.85 56.02 54.09 54.77 54.54 62.71 62.33 61.21 62.01
ArxivClusteringP2P 32.56 34.73 35.19 32.61 35.18 36.94 42.61 44.75 32.13 17.77 46.55 46.07 38.33 48.38 37.78 41.49 34.74 40.55 39.71 43.38 44.72 45.59 44.59 35.49 37.50 37.90 37.90 39.28 41.62 41.62 42.89
ArxivClusteringS2S 23.14 26.01 27.51 24.68 27.54 29.03 32.32 35.27 22.05 12.39 37.86 37.50 31.55 39.72 31.68 28.47 24.68 32.49 28.24 33.71 35.08 38.86 38.03 27.18 30.55 30.45 32.39 27.26 29.44 31.17 33.47
BiorxivClusteringP2P 29.27 29.76 30.12 24.90 30.15 32.35 34.97 39.52 29.84 12.40 38.48 36.99 33.49 39.62 33.09 36.86 28.93 33.59 33.63 35.06 34.41 36.55 36.03 27.66 29.59 30.52 30.48 33.99 35.99 36.43 36.53
BiorxivClusteringS2S 19.18 20.71 24.77 19.55 24.67 28.16 29.08 34.53 20.57 8.83 33.17 33.21 29.44 35.02 29.60 27.55 23.08 29.13 27.04 30.71 30.53 33.70 32.48 23.25 25.72 26.06 27.50 22.92 24.02 26.47 28.66
MedrxivClusteringP2P 26.12 26.65 26.09 23.60 26.25 30.23 31.19 35.04 30.13 17.91 34.41 34.25 31.52 35.58 31.96 31.09 28.30 30.33 31.37 32.08 31.35 31.51 31.05 27.57 28.72 28.69 29.12 33.20 32.40 32.30 32.09
MedrxivClusteringS2S 20.38 21.50 23.60 21.97 24.12 27.01 27.27 31.66 24.82 16.63 32.29 32.24 30.87 32.87 31.70 26.50 24.93 28.02 26.87 29.45 28.77 28.76 29.26 25.13 27.39 26.69 27.56 26.13 26.33 26.93 26.82
RedditClustering 28.46 28.84 27.24 32.18 40.23 48.04 54.89 24.13 28.79 9.96 50.67 51.18 42.02 54.82 45.24 42.47 33.76 42.17 40.23 48.23 46.47 40.45 35.53 56.13 61.69 61.34 64.13 52.93 54.53 57.03 58.99
RedditClusteringP2P 35.82 7.37 43.32 45.14 47.74 53.53 57.58 35.06 49.14 26.42 54.15 54.80 50.73 56.77 51.31 58.10 41.01 48.02 49.09 53.18 54.17 55.75 54.52 58.53 61.67 61.11 62.84 59.67 62.50 62.34 64.46
StackExchangeClustering 35.80 39.04 43.58 43.07 47.55 59.54 63.15 39.01 35.43 15.79 53.36 53.05 49.60 53.80 52.98 53.52 44.59 54.13 52.74 60.86 59.19 59.21 55.13 64.21 69.93 69.95 71.43 63.13 65.11 67.13 70.78
StackExchangeClusteringP2P 28.51 30.23 26.55 28.50 29.45 30.48 32.25 31.46 28.83 18.63 38.00 33.13 31.69 34.28 32.94 30.43 28.23 31.12 32.66 32.36 32.57 33.95 34.31 33.01 33.21 32.73 32.85 35.68 36.86 34.79 35.25
TwentyNewsgroupsClustering 25.83 27.42 23.35 23.21 34.86 38.68 46.82 24.22 23.28 11.38 46.86 47.47 39.28 49.74 44.10 36.26 28.24 37.20 32.13 40.06 40.89 39.46 37.28 46.72 51.64 51.15 50.44 48.10 49.33 49.53 50.93
SprintDuplicateQuestions 86.96 85.55 36.81 69.41 69.39 96.09 95.55 71.63 89.26 65.54 94.55 92.45 89.46 90.15 90.55 77.85 77.73 80.54 89.89 92.58 93.47 93.84 94.93 94.55 95.05 95.45 95.68 91.23 89.01 91.44 88.89
TwitterSemEval2015 48.45 53.85 55.90 60.21 67.75 65.95 66.85 43.25 62.78 59.57 67.86 70.02 62.06 73.85 66.75 69.04 57.09 66.00 54.75 62.37 63.68 66.87 65.31 72.23 76.03 77.81 77.54 78.25 79.75 80.89 80.28
TwitterURLCorpus 77.35 79.41 76.29 81.37 83.89 83.17 85.21 69.22 84.58 81.47 84.70 84.77 83.83 85.11 85.14 83.69 80.51 84.54 81.06 83.79 84.80 85.29 85.46 84.77 84.89 85.14 85.13 86.05 86.14 85.86 86.01
AskUbuntuDupQuestions 49.57 50.88 45.84 51.57 51.80 58.99 56.69 50.07 52.75 48.99 63.48 64.06 60.49 65.85 60.16 53.49 52.63 55.90 55.84 58.13 59.63 61.63 59.97 60.86 61.64 63.08 63.23 59.73 61.51 62.86 66.16
MindSmallReranking 27.01 28.92 28.37 28.62 29.30 27.13 31.58 24.80 29.81 24.79 30.80 31.02 30.37 30.97 30.15 30.71 29.27 31.11 30.40 31.34 31.72 32.29 31.79 31.33 31.84 31.50 31.93 30.20 30.27 29.77 30.60
SciDocsRR 62.56 63.55 64.94 66.33 70.14 72.78 76.51 81.31 68.72 54.99 87.12 87.20 77.78 88.65 78.09 71.04 68.36 77.54 71.34 77.21 77.72 80.79 79.77 73.71 76.39 76.49 77.96 73.96 74.88 75.16 76.09
StackOverflowDupQuestions 34.03 35.65 34.62 39.35 38.90 48.48 47.78 36.22 42.42 36.98 50.76 51.47 45.85 51.98 46.79 40.85 39.97 44.77 44.74 49.32 49.61 51.53 51.07 51.01 51.58 52.79 53.50 48.46 49.34 51.05 52.85
ArguAna 36.30 30.96 28.29 38.34 38.33 45.15 48.32 32.67 34.18 12.86 50.17 47.13 44.88 46.52 48.91 39.65 31.04 35.07 45.42 49.68 50.49 51.38 47.28 50.83 52.09 52.81 53.77 44.85 39.27 39.40 39.85
ClimateFEVER 14.44 14.87 5.41 11.80 11.98 16.96 24.79 6.86 3.83 0.36 20.27 21.57 18.49 21.97 15.27 2.83 11.01 17.57 21.86 26.6 27.11 30.46 29.39 24.88 26.90 27.01 27.21 10.37 11.36 10.61 14.63
CQADupstackRetrieval 15.47 16.79 5.51 13.22 14.50 27.72 33.67 14.60 18.75 4.12 41.32 42.53 30.71 44.96 31.32 10.17 20.29 29.98 27.25 33.33 36.53 39.40 39.62 34.55 36.62 37.35 38.56 35.23 38.96 40.78 44.65
DBPedia 18.29 15.88 4.13 15.04 19.73 27.86 38.10 4.14 15.57 1.53 32.33 33.36 22.63 32.09 26.22 3.48 10.87 26.10 22.72 31.51 34.70 39.87 39.03 35.24 39.55 39.74 41.28 27.77 31.55 33.65 39.19
FEVER 14.99 15.56 3.30 21.05 20.41 45.68 59.29 5.45 12.17 0.77 51.93 55.91 52.66 50.86 56.76 4.45 18.40 38.64 60.45 68.12 72.73 78.24 73.97 68.93 72.66 72.18 74.08 26.16 36.21 36.12 51.20
FiQA2018 10.09 10.49 2.19 9.84 10.41 15.62 27.42 5.64 7.00 1.73 36.87 37.27 20.33 49.96 22.96 7.54 8.94 18.59 21.12 29.99 33.29 37.20 35.84 35.15 42.79 44.19 46.78 34.83 43.55 44.71 46.68
HotpotQA 19.18 20.77 8.26 19.75 22.89 35.61 56.81 5.46 18.75 5.50 46.51 44.59 30.01 39.29 37.03 12.6 17.73 33.99 40.88 49.93 52.84 59.26 57.26 54.93 57.85 58.91 59.67 33.20 33.95 37.17 42.14
MSMARCO 9.60 9.75 1.91 9.35 11.00 29.57 36.77 5.58 7.60 1.09 36.54 39.03 23.72 39.75 26.60 10.53 6.27 15.83 27.98 36.05 38.83 39.91 41.12 41.16 42.73 43.52 44.05 20.71 23.96 25.17 27.68
NFCorpus 13.87 11.79 4.30 9.88 12.42 22.29 31.31 0.84 16.54 2.44 31.59 32.25 23.45 33.29 25.49 20.59 11.80 28.26 22.79 32.08 33.89 36.21 35.78 30.22 32.63 33.34 34.18 28.64 31.10 33.18 35.08
NQ 12.87 12.75 2.61 11.69 16.08 29.85 41.83 5.99 8.42 0.64 43.87 46.47 29.80 50.45 33.60 2.02 7.63 24.63 29.73 42.94 46.70 52.41 53.15 50.47 55.09 56.16 57.24 36.32 42.02 46.29 52.87
QuoraRetrieval 71.32 71.58 61.03 78.03 79.62 86.51 86.72 64.65 77.03 71.14 87.56 87.75 86.55 87.46 86.41 82.18 78.96 84.68 72.98 85.28 85.60 84.58 74.71 87.98 88.47 88.91 89.09 85.49 85.73 85.85 85.96
SCIDOCS 8.04 8.47 2.81 5.50 7.53 10.13 17.12 0.00 5.63 0.78 21.64 21.82 0.03 23.77 13.96 6.28 7.13 13.55 12.21 16.18 16.57 19.87 18.62 14.00 15.51 15.71 15.88 14.16 15.38 15.97 17.17
SciFact 29.58 29.53 13.34 25.72 29.59 52.31 65.51 47.88 38.20 4.04 64.51 62.64 48.37 65.57 50.30 45.46 31.79 46.66 56.90 68.29 70.17 74.70 72.11 59.74 63.42 64.20 66.77 45.76 49.91 50.91 55.38
Touche2020 13.99 13.17 0.97 8.90 9.89 8.57 15.79 8.46 4.88 1.06 16.90 17.22 16.06 19.93 17.40 3.1 12.27 16.18 22.97 24.45 23.44 25.43 23.98 25.89 28.29 25.26 26.76 20.30 21.63 22.51 21.65
TRECCOVID 36.22 35.92 14.74 26.2 22.93 40.54 44.77 29.91 16.34 10.97 47.25 50.82 39.12 51.33 37.87 24.56 39.31 55.35 70.30 72.98 75.17 84.88 81.37 56.05 56.68 60.09 51.90 40.70 46.11 54.77 59.48
BIOSSES 44.93 50.25 54.70 72.31 68.38 77.32 83.32 64.95 78.70 62.01 81.64 83.57 74.18 80.43 76.27 78.04 70.93 79.50 75.21 83.02 84.84 86.25 85.31 79.00 84.86 78.94 81.91 75.89 78.93 73.12 80.43
SICK-R 55.43 55.49 58.65 72.24 80.77 72.00 70.20 56.39 69.99 62.86 77.58 79.32 79.61 80.59 79.62 77.48 74.57 79.59 65.93 67.23 68.20 69.63 69.82 71.45 73.39 73.63 74.29 80.18 80.34 79.98 80.47
STS12 54.64 53.51 30.87 66.05 75.30 68.19 64.34 62.49 65.08 62.60 72.37 73.08 76.02 72.63 77.90 72.30 69.17 74.29 66.53 66.59 66.99 67.50 69.66 68.59 70.33 69.11 70.12 78.05 79.11 79.02 78.85
STS13 69.16 70.80 59.89 81.49 84.67 80.40 80.03 58.70 67.98 59.62 80.60 82.13 80.70 83.48 85.11 81.49 77.23 85.35 76.17 77.33 77.58 79.16 79.67 79.09 82.19 81.82 82.72 85.85 87.33 88.80 88.94
STS14 60.81 63.56 47.73 73.61 80.19 74.02 74.51 54.87 64.03 57.03 75.59 76.73 78.85 78.00 80.81 74.74 70.99 79.21 69.05 71.83 72.78 74.46 74.61 74.64 77.16 77.07 78.24 82.19 83.17 84.33 84.86
STS15 72.31 74.08 60.29 79.72 85.40 82.57 83.30 62.54 76.59 71.57 85.39 85.58 85.84 85.66 87.48 84.28 79.74 85.52 79.24 80.66 82.62 84.47 83.81 84.85 86.31 86.01 86.26 87.46 88.28 88.89 89.32
STS16 65.34 64.60 63.73 78.12 80.82 79.78 79.67 64.27 72.98 70.75 78.99 80.23 81.05 80.03 83.20 82.06 77.93 82.54 76.07 78.91 80.10 80.96 80.40 81.57 81.85 82.23 81.61 84.03 84.36 85.31 84.67
STS17 77.95 76.91 64.10 83.58 89.44 85.94 86.32 69.63 79.45 76.73 87.59 88.63 86.87 90.60 86.99 87.08 87.33 90.44 84.95 86.99 87.25 87.78 87.07 85.80 83.93 84.90 85.18 89.57 88.99 88.91 89.46
STS22 56.35 53.89 56.37 59.65 61.96 67.54 64.64 55.06 60.97 39.75 67.21 65.67 61.72 67.95 63.06 64.71 59.64 63.20 65.66 67.30 68.75 69.35 66.13 66.17 64.30 66.61 65.76 62.66 62.39 64.32 65.33
STSBenchmark 61.54 61.55 47.29 76.52 84.25 76.97 78.81 61.26 72.25 69.77 82.03 83.09 84.42 83.42 86.82 83.78 79.54 85.67 75.34 77.59 79.21 81.39 80.90 79.58 77.60 77.65 77.73 85.52 85.36 83.93 84.01
SummEval 28.87 30.49 29.82 31.15 23.31 29.50 30.36 27.66 31.05 26.8 30.81 27.9 30.67 27.49 31.57 26.94 30.26 30.38 28.90 25.44 27.87 24.75 24.99 29.67 29.50 30.21 30.64 31.39 29.64 29.91 30.08
Average 41.97 42.06 38.33 45.45 48.72 52.35 56.00 40.28 45.21 34.95 56.26 56.53 52.44 57.78 54.71 49.52 45.97 53.74 51.23 56.11 57.12 58.81 57.44 56.19 58.28 58.42 58.97 55.27 57.06 57.87 59.51
Table 11: All English results. The main score for each task is reported as described in Section 3.2.
Dataset Language LASER2 LaBSE MiniLM-L12-multilingual MPNet-multilingual SGPT-BLOOM-7.1B-msmarco
BUCC de-en 99.21 99.35 97.11 98.59 54.00
BUCC fr-en 98.39 98.72 94.99 96.89 97.06
BUCC ru-en 97.62 97.78 95.06 96.44 45.30
BUCC zh-en 97.70 99.16 95.63 97.56 97.96
Tatoeba sqi-eng 97.22 96.76 98.17 98.57 10.38
Tatoeba fry-eng 42.07 89.31 31.13 43.54 24.62
Tatoeba kur-eng 19.09 83.59 46.94 61.44 8.26
Tatoeba tur-eng 98.03 98.00 95.08 96.17 6.15
Tatoeba deu-eng 99.07 99.20 97.02 97.73 70.10
Tatoeba nld-eng 95.35 96.07 94.58 95.50 29.74
Tatoeba ron-eng 96.52 96.92 95.30 96.43 27.23
Tatoeba ang-eng 25.22 59.28 10.24 16.72 28.76
Tatoeba ido-eng 80.86 89.42 40.25 43.91 43.91
Tatoeba jav-eng 9.95 79.77 17.04 23.39 15.02
Tatoeba isl-eng 94.32 94.75 24.07 59.25 6.29
Tatoeba slv-eng 95.40 96.03 96.92 97.08 10.14
Tatoeba cym-eng 5.85 92.00 13.25 22.31 6.97
Tatoeba kaz-eng 53.30 87.49 34.89 61.49 3.32
Tatoeba est-eng 96.43 96.55 97.33 98.40 4.76
Tatoeba heb-eng 0.00 91.53 86.88 88.26 1.69
Tatoeba gla-eng 1.52 85.66 3.61 4.72 2.09
Tatoeba mar-eng 92.93 92.65 92.38 93.83 45.53
Tatoeba lat-eng 64.81 80.07 19.47 24.25 28.76
Tatoeba bel-eng 79.54 95.00 67.73 79.94 8.03
Tatoeba pms-eng 36.23 64.57 30.70 34.19 31.94
Tatoeba gle-eng 4.20 93.80 11.62 16.85 3.26
Tatoeba pes-eng 93.13 94.70 92.59 93.47 12.13
Tatoeba nob-eng 95.77 98.40 97.73 98.53 21.07
Tatoeba bul-eng 93.57 94.58 92.65 93.52 20.09
Tatoeba cbk-eng 77.17 79.44 55.37 58.68 64.63
Tatoeba hun-eng 95.20 96.55 91.58 94.18 5.07
Tatoeba uig-eng 56.49 92.40 24.39 48.35 1.27
Tatoeba rus-eng 92.58 93.75 91.87 92.92 59.84
Tatoeba spa-eng 97.33 98.40 95.42 97.00 94.48
Tatoeba hye-eng 88.72 94.09 93.28 94.38 0.50
Tatoeba tel-eng 96.72 97.86 36.40 79.73 64.62
Tatoeba afr-eng 92.59 96.18 58.22 72.96 16.62
Tatoeba mon-eng 3.42 95.91 95.04 96.14 2.85
Tatoeba arz-eng 66.16 76.00 51.26 55.69 70.66
Tatoeba hrv-eng 96.72 96.95 95.98 97.00 12.79
Tatoeba nov-eng 60.02 74.38 47.99 50.23 52.23
Tatoeba gsw-eng 27.52 46.50 25.74 25.12 21.03
Tatoeba nds-eng 77.13 79.42 32.16 38.88 23.92
Tatoeba ukr-eng 93.52 93.97 92.82 92.67 22.06
Tatoeba uzb-eng 23.20 84.23 17.14 23.19 4.71
Tatoeba lit-eng 96.20 96.47 93.16 95.37 4.49
Tatoeba ina-eng 93.93 95.37 79.13 84.32 73.67
Tatoeba lfn-eng 63.39 67.54 47.02 49.56 44.85
Tatoeba zsm-eng 95.41 95.62 95.31 95.80 79.95
Tatoeba ita-eng 94.32 92.72 93.05 93.76 65.04
Tatoeba cmn-eng 85.62 95.10 94.93 95.83 91.45
Tatoeba lvs-eng 95.33 95.88 97.87 97.53 6.55
Tatoeba glg-eng 96.14 96.82 94.00 95.32 79.86
Tatoeba ceb-eng 9.93 64.42 8.05 7.39 6.64
Tatoeba bre-eng 31.2 15.07 5.56 6.42 4.67
Tatoeba ben-eng 89.43 88.55 36.48 64.90 75.98
Tatoeba swg-eng 33.10 59.36 26.31 22.80 16.89
Tatoeba arq-eng 26.63 42.69 18.60 19.84 27.75
Tatoeba kab-eng 65.88 4.31 1.16 1.41 1.69
Tatoeba fra-eng 94.28 94.86 91.72 93.12 91.44
Tatoeba por-eng 94.54 94.14 92.13 93.02 92.62
Tatoeba tat-eng 34.74 85.92 10.25 10.89 3.59
Tatoeba oci-eng 58.13 65.81 38.57 43.49 40.17
Tatoeba pol-eng 97.32 97.22 94.28 96.95 14.09
Tatoeba war-eng 8.25 60.29 7.25 7.42 10.38
Tatoeba aze-eng 82.41 94.93 62.10 76.36 6.32
Tatoeba vie-eng 96.73 97.20 95.12 97.23 94.20
Tatoeba nno-eng 72.75 94.48 76.34 81.41 16.28
Tatoeba cha-eng 14.86 31.77 15.98 12.59 23.26
Tatoeba mhr-eng 6.86 15.74 6.89 7.57 1.56
Tatoeba dan-eng 95.22 95.71 94.80 96.17 23.52
Tatoeba ell-eng 96.20 95.35 95.43 94.93 5.34
Tatoeba amh-eng 80.82 91.47 36.21 53.49 0.03
Tatoeba pam-eng 3.24 10.73 5.41 5.39 5.85
Tatoeba hsb-eng 45.75 67.11 36.10 44.32 9.68
Tatoeba srp-eng 93.64 94.43 92.24 94.12 11.69
Tatoeba epo-eng 96.61 98.20 41.73 55.12 26.20
Tatoeba kzj-eng 4.46 11.33 6.24 5.88 5.17
Tatoeba awa-eng 33.74 71.70 33.43 42.83 35.01
Tatoeba fao-eng 57.04 87.40 27.51 38.24 12.61
Tatoeba mal-eng 98.16 98.45 32.20 88.46 83.30
Tatoeba ile-eng 87.88 85.58 57.71 60.36 59.59
Tatoeba bos-eng 95.86 94.92 93.27 94.02 13.65
Tatoeba cor-eng 4.45 10.11 3.42 3.53 2.83
Tatoeba cat-eng 95.80 95.38 94.42 96.05 88.31
Tatoeba eus-eng 93.32 95.01 23.18 31.33 53.38
Tatoeba yue-eng 87.75 89.58 71.45 77.58 77.03
Tatoeba swe-eng 95.31 95.63 94.42 95.45 19.53
Tatoeba dtp-eng 7.39 10.85 5.69 5.03 3.41
Tatoeba kat-eng 81.16 95.02 95.44 95.46 0.42
Tatoeba jpn-eng 93.78 95.38 90.41 92.51 71.36
Tatoeba csb-eng 27.03 52.57 21.56 23.73 10.03
Tatoeba xho-eng 4.68 91.55 4.52 6.53 5.51
Tatoeba orv-eng 23.24 38.93 15.10 23.77 5.79
Tatoeba ind-eng 92.98 93.66 92.74 93.50 88.04
Tatoeba tuk-eng 16.35 75.27 15.16 14.91 5.48
Tatoeba max-eng 36.96 63.26 45.25 48.77 36.14
Tatoeba swh-eng 55.66 84.50 14.48 16.02 16.74
Tatoeba hin-eng 95.32 96.87 97.62 97.75 85.23
Tatoeba dsb-eng 42.34 64.81 33.43 36.85 8.78
Tatoeba ber-eng 77.63 8.40 4.43 4.88 4.92
Tatoeba tam-eng 87.32 89.0 24.64 73.60 72.76
Tatoeba slk-eng 95.82 96.5 95.15 96.62 9.98
Tatoeba tgl-eng 63.19 96.02 13.09 17.67 10.70
Tatoeba ast-eng 76.35 90.68 62.17 70.08 71.13
Tatoeba mkd-eng 93.63 93.6 91.00 93.02 10.47
Tatoeba khm-eng 74.19 78.37 32.11 58.80 0.37
Tatoeba ces-eng 95.52 96.68 95.12 95.73 9.55
Tatoeba tzl-eng 36.56 58.88 25.46 34.21 27.82
Tatoeba urd-eng 84.23 93.22 94.57 95.12 70.10
Tatoeba ara-eng 90.14 88.80 87.93 90.19 85.37
Tatoeba kor-eng 87.97 90.95 92.52 93.07 22.39
Tatoeba yid-eng 2.49 88.79 14.38 30.73 0.16
Tatoeba fin-eng 96.98 96.37 93.10 95.92 3.41
Tatoeba tha-eng 96.38 96.14 96.72 95.99 2.22
Tatoeba wuu-eng 75.09 90.18 76.00 78.25 79.58
Average mix 67.42 81.75 57.98 63.38 31.08
Table 14: Multilingual STS Results. Scores are Spearman correlations of cosine similarities.