Niklaus 2023 Multilegalpile
Niklaus 2023 Multilegalpile
Abstract
Large, high-quality datasets are crucial for train-
ing Large Language Models (LLMs). However,
arXiv:2306.02069v2 [cs.CL] 6 Jun 2023
1
MultiLegalPile: A 689GB Multilingual Legal Corpus
language specific SotA in five. On LexGLUE our English criteria. The dataset contains 103M words and has two
models reach SotA in five out of seven tasks with the large versions: WikiText2 and the larger WikiText103. It has
model achieving the highest aggregate score. been used to pretrain models like MegatronBERT (Shoeybi
et al., 2020) and GPT-2 (Radford et al., 2019).
In the spirit of open science, we provide the dataset under
a CC BY-NC-SA 4.0 license, with some subsets licensed The BookCorpus (Zhu et al., 2015), also known as the
more permissively. Dataset creation scripts, models, and Toronto Books Corpus, is an English dataset used for pre-
pretraining code are public under Apache 2.0 licenses. This training BERT (Devlin et al., 2019), RoBERTa (Liu et al.,
open-source approach encourages further research and ad- 2019), and T5 (Raffel et al., 2020a). It consists of almost
vancements in the field of legal text analysis and understand- 1B words from over 11K books collected from the web.
ing using large language models.
The Common Crawl corpus is a publicly available multilin-
gual dataset of scraped web pages, regularly updated with
Contributions new ”snapshots”. It has been used to pretrain GPT-3 (Brown
The contributions of this paper are three-fold: et al., 2020b) as well as XLM-R (Conneau et al., 2020a).
One significant drawback of Common Crawl is the presence
1. We curate and release a large scale multilingual legal
of uncleaned data, which includes a considerable amount of
text corpus, dubbed M ULTI L EGAL P ILE,1 covering 24
“gibberish or boiler-plate text like menus, error messages,
languages and 17 legal systems (jurisdictions).
or duplicate text” (Raffel et al., 2020a). As a result, uti-
2. We release 2 multilingual and 24 monolingual new legal- lizing the Common Crawl dataset necessitates additional
oriented PLMs, dubbed L EGAL XLM S, warm-started post-filtering and cleaning procedures. To address this issue,
from the XLM-R (Conneau & Lample, 2019) models, Raffel et al. (Raffel et al., 2020a) performed several clean-
and further pretrained on the M ULTI L EGAL P ILE. Addi- ing steps on the April 2019 snapshot of Common Crawl,
tionally, we pretrain a Longformer (Beltagy et al., 2020) resulting in the creation of the Colossal Clean Crawled Cor-
based on our multilingual base-size model on context pus (C4), comprising 750 GB of English-language text. It
lengths of up to 4096 tokens. was used for pretraining models such as T5 (Raffel et al.,
3. We benchmark the newly released models on the LEX- 2020a) and Switch Transformer (Fedus et al., 2022).
TREME and LexGLUE benchmarks, achieving new OpenWebText (Gokaslan & Cohen, 2019) openly replicates
SotA for base and large size models and increasing per- OpenAI’s closed English WebText dataset (Radford et al.,
formance drastically in Greek legal code. Our Long- 2019), used to pretrain GPT-2 (Radford et al., 2019). Web-
former model reaches SotA in four tasks and the highest Text comprises over 8M documents with a combined text
dataset aggregate score. Our monolingual models set size of 40 GB. To ensure data uniqueness, any documents
language specific SotA in five languages. sourced from Wikipedia were excluded from WebText, as
they are commonly utilized in other datasets. OpenWeb-
2. Related Work Text, on the other hand, consists of 38 GB of text data from
8M documents and was used for pretraining RoBERTa (Liu
2.1. General Pretraining Corpora et al., 2019) and MegatronBERT (Shoeybi et al., 2020).
The use of pretrained Language Models (PLMs) has become News articles are also a common source for pretraining
increasingly popular in NLP tasks, particularly with the corpora. The RealNews dataset (Zellers et al., 2019) is a
advent of models such as BERT (Devlin et al., 2019) that can large corpus extracted from Common Crawl, containing
be finetuned for specific applications. One key factor in the news articles from December 2016 to March 2019 (training)
success of pretraining is the availability of large and diverse and April 2019 (evaluation), totaling 120 GB. It was used
text corpora, which can help the model learn the nuances of for pretraining MegatronBERT (Shoeybi et al., 2020). For
natural language. In the following, we discuss large-scale pretraining RoBERTa, Liu et al. (2019) used an English
general-purpose text corpora used for pretraining. subset of RealNews, comprising 63M English news articles
Wikipedia is a commonly used multilingual dataset for pre- crawled from September 2016 to February 2019.
training language models, and has been used to pretrain The rise of LLMs brought about the creation of ever larger
BERT (Devlin et al., 2019), MegatronBERT (Shoeybi et al., training datasets. The Pile (Gao et al., 2020b) combines
2020), T5 (Raffel et al., 2020a), and GPT-3 (Brown et al., 22 distinct, well-curated datasets, such as Wikipedia (En-
2020b), among others. glish), OpenWebText2 (Gokaslan & Cohen, 2019), Open-
Based on Wikipedia, Merity et al. (2016) created WikiText Subtitles (Tiedemann, 2016) etc., encompassing 825 GB
by selecting articles fitting the Good or Featured article of data. Besides general-purpose textual datasets, it also
contains domain-specific datasets, such as ArXiv (Science),
1
https://huggingface.co/datasets/joelito/Multi Legal Pile
2
MultiLegalPile: A 689GB Multilingual Legal Corpus
FreeLaw (Legal), PubMed Abstracts (Biomedicine), and legal texts, including legislation, court cases, and contracts.
GitHub data (to improve code-related task performance They pretrained LegalBERT on this dataset, showing state-
(Gao et al., 2020b)). GPT-2 (Radford et al., 2019) and of-the-art performance, especially in tasks requiring domain
GPT-3 (Brown et al., 2020b) were evaluated on this dataset. knowledge. Another study by Zheng et al. (2021) used
the entire English Harvard Law case corpus (1965-2021)
In their work, Touvron et al. (2023) compiled a substantial
comprising 37 GB of text to pretrain CaseLaw-BERT.
dataset from various publicly available sources, including
CommonCrawl, C4, Github, Wikipedia, etc., totaling 1.4T Recently, Chalkidis* et al. (2023) released LexFiles, an En-
tokens. They trained the 13B-parameter LLaMA model glish legal corpus with 11 sub-corpora covering legislation
using this dataset, surpassing the performance of the 175B- and case law from six English-speaking legal systems (EU,
parameter GPT-3 on most benchmark tasks. However, the Council of Europe, Canada, US, UK, India). The corpus con-
dataset itself is not publicly available. To address this, a col- tains approx. 6M documents or approx. 19B tokens. They
laborative effort resulted in the creation of the RedPajama- trained two new legal English PLMs, showing improved
Data-1T dataset, replicating LLaMA’s dataset with a similar performance in legal probing and classification tasks.
size of 1.2T tokens.
Efforts to pretrain legal language models also exist for Ital-
Some of the afore-mentioned datasets, such as Common ian (Licari & Comandè, 2022), Romanian (Masala et al.,
Crawl, are used to pretrain multilingual versions of BERT, 2021), and Spanish (Gutiérrez-Fandiño et al., 2021). How-
DistilBERT, RoBERTa etc. These models were pretrained ever, English dominates, underscoring the importance of
on datasets that cover approximately 100 languages, thereby compiling multilingual legal corpora.
neglecting low-resource languages. ImaniGooghari et al.
(2023) addressed this by compiling Glot500, a 700 GB Model Domain Languages Size in # Words
dataset covering 500 diverse languages, with a focus on SciBERT (Beltagy et al., 2019) scientific English 2.38B (3.17B tokens)
Galactica (Taylor et al., 2022) scientific English 79.5B (106B tokens)
low-resource ones. The Glot500-m model, pretrained on BioBERT (Lee et al., 2019) biomedical English 18B
LegalBERT (Chalkidis et al., 2020b) legal English 1.44B (11.5GB)
this dataset, outperformed the XLM-RoBERTa base model CaselawBERT (Zheng et al., 2021) legal English 4.63B (37GB)
LegalXLMs (ours) legal 24 EU langs 87B (689GB)
on six out of seven tasks.
Table 1. Previous domain specific pretraining corpora. For some
2.2. Domain Specific Corpora corpora only GB or tokens were available. We converted 8 GB
into 1B words and 1 token to 0.75 words.
While pretraining on general-purpose text like Wikipedia
and news articles shows promise, evidence suggests that Table 1 compares previous domain-specific corpora, all in
pretraining on domain-specific text can enhance language English. In terms of size, none reach the M ULTI L EGAL P ILE
model performance on related tasks (Beltagy et al., 2019; proposed here.
Gu et al., 2021; Chalkidis et al., 2020b; Niklaus & Giofré,
2022). Domain-specific text corpora include texts specific
to fields like medicine, law, or science.
3. M ULTI L EGAL P ILE
Several studies have examined pretraining on scientific text 3.1. Construction
corpora. Beltagy et al. (2019) pretrained SciBERT, a BERT- We transformed all datasets into xz compressed JSON Lines
based model, on a random subset of 1.14M papers sourced (JSONL) format. The combination of XZ compression and
from Semantic Scholar. This collection comprises 18% JSONL is ideal for streaming large datasets due to reduced
of computer science papers and 82% of papers from the file size and efficient decompression and reading.
broader biomedical field. Similarly, PubMed and PubMed-
Central are common sources for biomedical datasets. Gu
Filtering mC4 We employed the vast multilingual web
et al. (2021) trained PubMedBERT using PubMed abstracts
crawl corpus, mC4 (Xue et al., 2021), as our foundation. To
and PubMedCentral articles; BioBERT (Lee et al., 2020)
effectively filter this corpus for legal content, we utilized
was pretrained similarly. Johnson et al. (2016) compiled the
regular expressions to identify documents with legal ref-
Medical Information Mart for Intensive Care III (MIMIC-
erences. We found that detecting legal citations, such as
III) dataset, a large single-center database of critical care pa-
references to laws and rulings, served as a reliable indicator
tients. Huang et al. (2019) used over 2 million de-identified
of legal-specific documents in the corpus.
clinical notes from this dataset to pretrain ClinicalBERT.
These models outperformed general-purpose models on
Iteration German English Spanish French Italian
biomedical NLP tasks.
1st 100% 20% 100% 65% 80%
In the legal domain, similar strategies are observed. 2nd 100% 85% 100% 100% 95%
Chalkidis et al. (2020a) collected 12 GB of diverse English
Table 2. Precision of investigated languages in legal mC4 (n=20)
3
MultiLegalPile: A 689GB Multilingual Legal Corpus
In order to ensure the accuracy of our filtering, we engaged we converted it into a unified format, such as jsonl. The
legal experts to aid in identifying citations to laws and rul- post-processing steps involved performing various tasks
ings across different jurisdictions and languages. We manu- depending on the initial data format. For example, in the
ally reviewed the precision of the retrieved documents for case of CASS, we extracted the textual data from XML tags.
five languages, namely German, English, Spanish, French,
and Italian, as shown in Table 2. The proficiency levels of Curating Eurlex Resources To curate the Eurlex re-
the evaluators included native German, fluent English and sources, we utilized the eurlex R package to generate
Spanish, intermediate French, and basic Italian. SPARQL queries and download the data. Subsequently, we
Subsequent to the initial review, we performed a second converted the data into a format more amenable to handling
round of precision evaluation, during which we refined our large datasets using Python.
regex expressions based on our findings from the first itera-
tion. This iterative process not only enhanced the precision Integrating Pile of Law Henderson et al. (2022) released
of the legal content detection, but also resulted in a reduc- a large corpus of diverse legal text in English mainly orig-
tion of the corpus size from 133GB to 106GB. Although inating from the US. We integrated the latest version with
the overall volume of data was reduced, this process signifi- additional data (from January 8, 2023) into our corpus.
cantly improved the quality and specificity of the corpus by
focusing on legal content with a higher degree of precision. 3.2. Description
A major reason for utilizing regexes instead of a Machine M ULTI L EGAL P ILE consists of four large subsets: a) Native
Learning (ML) based classifier was speed. Already when Multi Legal Pile (112 GB), b) Eurlex Resources2 (179 GB),
utilizing regexes, filtering through such a huge corpus like c) Legal MC43 (106 GB) and d) Pile of Law (Henderson
mC4 (27TB in total, of which 10.4TB are in English) took et al., 2022) (292 GB).
several days. An ML model based on Bag-of-Words, Word
vectors or even contextualized embeddings would a) need Figure 3 details the distribution of languages. Note that
an annotated dataset and b) likely be much slower. due to the integration of the Pile of Law, English is by far
the most dominant language, representing over half of the
words. In Figure 2 we show the distribution across text
types. Caselaw makes up over half of the corpus, due to
the good public access to court rulings especially in com-
mon law countries. Note, that even in civil law countries
–– where legislation is much more important – caselaw is
usually more plentiful than legislation (as can be seen in the
Swiss case in Table 9). It is hard to find publicly available
contracts, leading to the relatively low percentage of the
total corpus (< 10%), even though they could potentially
make up most of the legal texts in existence (from the private
sector). Note that most of the contracts in our corpus are
from the US or international treaties with the EU. Table 9 in
Appendix C provides additional of the M ULTI L EGAL P ILE,
including sources and licenses.
4
MultiLegalPile: A 689GB Multilingual Legal Corpus
Figure 3. M ULTI L EGAL P ILE Language Distribution (Note the log-scaled y-axis)
do not explicitly state the license used for the available (b) We train a new tokenizer of 128K BPEs on the training
data. We assume that such data sources allow pretraining subsets of M ULTI L EGAL P ILE to better cover legal language
usage, since the creators are usually public agencies such across all available legal systems and languages. However,
as courts and administrations. Such legislation and caselaw we reuse the original XLM-R embeddings for all lexically
is usually not protected by copyright law. Table 9 provides overlapping tokens (Pfeiffer et al., 2021), i.e., we warm-start
an overview of the license or copyright situation for each of word embeddings for tokens that already exist in the original
the 29 sources in the Native Multi Legal Pile. XLM-R vocabulary, and use random ones for the rest.
Second, the Eurlex Resources is licensed under CC BY 4.0 (c) We continue pretraining our models on the diverse M UL -
by the European Union.4 Thus, including this corpus does TI L EGAL P ILE corpus with batches of 512 samples for an
not pose legal issues for pretraining. additional 1M/500K steps for the base/large model. We do
initial warm-up steps for the first 5% of the total training
Third, the Legal mC4 corpus was created by filtering multi-
steps with a linearly increasing learning rate up to 1e−4,
lingual C4 (Xue et al., 2021) for legal content as described
and then follow a cosine decay scheduling, following recent
above. As mC4 is licensed under ODC-BY, we also release
trends. For half of the warm-up phase (2.5%), the Trans-
the filtered Legal mC4 corpus under the same license.
former encoder is frozen, and only the embeddings, shared
Finally, the Pile of Law (Henderson* et al., 2022) is pub- between input and output (MLM), are updated. We also use
lished under CC BY-NC-SA 4.0 and the dataset is not al- an increased 20/30% masking rate for base/large models
tered, therefore the license remains the same. respectively, where also 100% of the predictions are based
on masked tokens, compared to Devlin et al. (2019)5 , based
Usage of the M ULTI L EGAL P ILE corpus is presumably pos-
on the findings of Wettig et al. (2023).
sible for pretraining of NLP models. In general, we assume
that the fair use doctrine allows employing the data for legal (d) For both training the tokenizer and our legal models, we
NLP models because the results are rather transformative use a sentence sampler with exponential smoothing of the
(Henderson et al., 2023). Nevertheless, copyright issues in sub-corpora sampling rate following Conneau & Lample
generative AI remain an unresolved problem for the mo- (2019) and Raffel et al. (2020b), since there is a disparate
ment. Several court cases are currently pending, such as proportion of tokens across sub-corpora and languages (Fig-
Getty Images suing Stability AI for intellectual property ures 1 and 3) and we aim to preserve per-corpus and lan-
infringement (Sag, 2023). guage capacity, i.e., avoid overfitting to the majority (approx.
50% of the total number of tokens) US-origin English texts.
4. Pretraining Legal Models (e) We consider mixed cased models, i.e., both upper- and
lowercase letters covered, similar to all recently developed
As part of this study, we release 2 new multi-lingual legal-
large PLMs (Conneau & Lample, 2019; Raffel et al., 2020b;
oriented PLMs, dubbed Legal-XLM-Rs, trained on the
Brown et al., 2020a).
newly introduced M ULTI L EGAL P ILE corpus (Section 3).
For the newly released Legal-XLM-Rs we followed a series To better account for long contexts often found in legal
of best-practices in language model development literature: documents, we continue training the base-size multilingual
model on long contexts (4096 tokens) with windowed atten-
(a) We warm-start (initialize) our models from the original tion (128 tokens window size) (Beltagy et al., 2020) for 50K
XLM-R checkpoints (base or large) of Conneau & Lample steps, dubbing it Legal-XLM-LF-base. We use the standard
(2019). Model recycling is a standard process followed by 15% masking probability and increase the learning rate to
many (Wei et al., 2021; Ouyang et al., 2022) to benefit from 3e−5 before decaying but otherwise use the same settings
starting from an available “well-trained” PLM, rather from as for training the small-context models.
scratch (random). XLM-R was trained on 2.5TB of cleaned
5
CommonCrawl data in 100 languages. Devlin et al. – and many other follow-up work – used a 15%
masking ratio, and a recipe of 80/10/10% of predictions made
4 across masked/randomly-replaced/original tokens.
EUR-Lex Legal notice
5
MultiLegalPile: A 689GB Multilingual Legal Corpus
Model Source Params Vocab Specs Corpus # Langs
MiniLM Wang et al. (2020) 118M 250K 1M steps / BS 256 2.5TB CC100 100
DistilBERT Sanh et al. (2020) 135M 120K BS up to 4000 Wikipedia 104
mDeBERTa-v3 He et al. (2021b;a) 278M 128K 500K steps / BS 8192 2.5TB CC100 100
XLM-R base Conneau et al. (2020b) 278M 250K 1.5M steps / BS 8192 2.5TB CC100 100
XLM-R large Conneau et al. (2020b) 560M 250K 1.5M steps / BS 8192 2.5TB CC100 100
Legal-XLM-R-base ours 184M 128K 1M steps / BS 512 689GB MLP 24
Legal-XLM-R-large ours 435M 128K 500K steps / BS 512 689GB MLP 24
Legal-XLM-LF-base ours 208M 128K 50K steps / BS 512 689GB MLP 24
Legal-mono-R-base ours 111M 32K 200K steps / BS 512 689GB MLP 1
Legal-mono-R-large ours 337M 32K 500K steps / BS 512 689GB MLP 1
Table 3. Models: All models can process up to 512 tokens, except Legal-XLM-LF-base which can process up to 4096 tokens. BS is short
for batch size. MLP is short for M ULTI L EGAL P ILE . Params is the total parameter count (including the embedding layer).
In addition to the multilingual models, we also train 24 outcome from 85K cases from the Swiss Federal Supreme
monolingual models on each of the language-specific sub- Court. Online Terms of Service (OTS) (Drawzeski et al.,
sets of the corpus. Except for choosing a smaller vocab 2021) contains 100 contracts for detecting unfair clauses
size of 32K tokens, we use the same settings as for the with the tasks of classifying sentence unfairness levels and
multilingual models. Due to resource constraints, we only identifying clause topics. COVID19 Emergency Event
train base-size models and stop training at 200K steps. Due (C19) (Tziafas et al., 2021): consists of legal documents
to limited data available in some low-resource languages, from several European countries related to COVID-19 mea-
these models sometimes do multiple passes over the data. sures where models identify the type of measure described
Because of plenty of data and to achieve a better comparison in a sentence. MultiEURLEX (MEU) (Chalkidis et al.,
on LexGLUE, we continued training the English model for 2021b) is a corpus of 65K EU laws annotated with EU-
1M steps and also trained a large-size model for 500K steps. ROVOC taxonomy labels. Task involves identifying labels
See Table 7 in appendix A for an overview. for each document. Greek Legal NER (GLN) (Angelidis
et al., 2018) is a dataset for NER in Greek legal documents.
We make all our models publicly available alongside all
LegalNERo (LNR) (Pais et al., 2021) tackles NER in Roma-
intermediate checkpoints (every 50K/10K training steps for
nian legal documents. LeNER BR (LNB) (Luz de Araujo
RoBERTa/Longformer models) on the Hugging Face Hub.6
et al., 2018) addresses NER in Brazilian legal documents.
MAPA (MAP) (Baisa et al., 2016) is a multilingual corpus
5. Evaluating on LEXTREME and LexGLUE based on EUR-Lex for NER annotated at a coarse-grained
and fine-grained level.
5.1. Benchmark Description
Below we briefly describe each dataset. We refer the inter- LexGLUE (Chalkidis et al., 2021d) is a legal benchmark
ested reader to the original papers for more details. covering two single label text classification datasets, four
multi label text classification datasets and a multiple choice
LEXTREME (Niklaus et al., 2023) is a multilingual legal question answering dataset.
benchmark. It includes five single label text classification ECtHR Tasks A & B (Chalkidis et al., 2019a; 2021c) con-
datasets, three multi label text classification datasets and tain approx. 11K cases from the European Court of Human
four Named Entity Recognition (NER) datasets. Rights (ECtHR) public database. Based on case facts, Task
Brazilian Court Decisions (BCD) (Lage-Freitas et al., A involves predicting violated articles of the European Con-
2022) is from the State Supreme Court of Alagoas (Brazil) vention of Human Rights (ECHR) and Task B involves
and involves predicting case outcomes and judges’ una- predicting allegedly violated articles. SCOTUS (Spaeth
nimity on decisions. German Argument Mining (GAM) et al.) combines information from US Supreme Court (SCO-
(Urchs et al., 2021) contains200 German court decisions for TUS) opinions with the Supreme Court DataBase (SCDB).
classifying sentences according to their argumentative func- The task is to classify court opinions into 14 issue areas.
tion. Greek Legal Code (GLC) (Papaloukas et al., 2021) EUR-LEX (Chalkidis et al., 2021a) contains 65K EU laws
tackles topic classification of Greek legislation documents. from the EUR-Lex portal, annotated with EuroVoc concepts.
Tasks involve predicting topic categories at volume, chap- The task is to predict EuroVoc labels for a given document.
ter, and subject levels. Swiss Judgment Prediction (SJP) LEDGAR (Tuggener et al., 2020) contains approx. 850K
(Niklaus et al., 2021) focuses on predicting the judgment contract provisions from the US Securities and Exchange
Commission (SEC) filings. The task is to classify contract
6
https://huggingface.co/joelito provisions into categories. UNFAIR-ToS (Lippi et al., 2019)
6
MultiLegalPile: A 689GB Multilingual Legal Corpus
Model BCD GAM GLC SJP OTS C19 MEU GLN LNR LNB MAP Agg.
MiniLM 53.0 73.3 42.1 67.7 44.1 5.0 29.7 74.0 84.5 93.6 57.8 56.8
DistilBERT 54.5 69.5 62.8 66.8 56.1 25.9 36.4 71.0 85.3 89.6 60.8 61.7
mDeBERTa-v3 60.2 71.3 52.2 69.1 66.5 29.7 37.4 73.3 85.1 94.8 67.2 64.3
XLM-R-base 63.5 72.0 57.4 69.3 67.8 26.4 33.3 74.6 85.8 94.1 62.0 64.2
XLM-R-large 58.7 73.1 57.4 69.0 75.0 29.0 42.2 74.1 85.0 95.3 68.0 66.1
Legal-XLM-R-base 62.5 72.4 68.9 70.2 70.8 30.7 38.6 73.6 84.1 94.1 69.2 66.8
Legal-XLM-R-large 63.3 73.9 59.3 70.1 74.9 34.6 39.7 73.1 83.9 94.6 67.3 66.8
Legal-XLM-LF-base 72.4 74.6 70.2 72.9 69.8 26.3 33.1 72.1 84.7 93.3 66.2 66.9
Table 4. Dataset aggregate scores for multilingual models on LEXTREME. We report macro-F1 and the best scores in bold.
Model bg cs da de el en es et fi fr ga hr hu it lt lv mt nl pl pt ro sk sl sv Agg.
MiniLM 52.7 48.6 42.8 54.6 50.3 34.3 40.1 46.3 42.2 39.0 42.8 29.7 29.6 40.5 44.2 40.8 40.8 29.5 22.7 61.6 59.6 44.3 30.0 43.4 40.5
DistilBERT 54.2 48.6 46.0 60.1 58.8 48.0 50.0 48.8 49.6 47.9 51.4 35.9 31.2 50.1 51.9 41.5 44.4 34.6 34.5 63.2 63.8 51.3 36.2 50.1 46.7
mDeBERTa-v3 54.1 51.3 51.7 63.6 57.7 50.7 53.3 50.8 54.6 49.2 54.9 37.4 37.5 55.1 53.9 47.0 52.5 42.1 41.0 65.7 65.3 55.4 37.5 56.1 50.5
XLM-R-base 56.4 48.3 48.3 60.6 57.6 50.1 47.2 46.7 48.6 49.4 50.1 33.6 32.8 53.4 50.0 44.1 43.8 35.2 41.3 66.1 63.7 45.3 33.7 50.0 47.1
XLM-R-large 59.9 56.0 56.3 65.4 60.8 56.2 56.6 56.5 56.9 51.4 55.4 42.5 38.1 58.5 58.1 49.9 53.9 39.5 46.4 68.6 66.8 57.9 42.4 59.1 53.7
Legal-XLM-R-base 55.6 58.8 50.4 63.6 63.7 66.8 56.3 57.0 52.6 50.1 56.6 38.7 56.5 56.1 57.2 49.1 56.0 41.6 43.9 68.2 66.1 55.6 38.6 54.9 53.5
Legal-XLM-R-large 57.8 55.6 50.4 65.7 60.7 69.3 55.7 54.5 56.6 53.3 57.2 39.7 39.1 58.1 60.6 48.4 57.2 39.4 45.5 67.3 65.5 49.3 39.7 56.4 53.6
Legal-XLM-LF-base 54.4 49.3 48.1 64.0 60.5 52.8 49.2 52.2 48.2 48.5 55.4 33.0 34.7 54.6 54.8 45.2 52.5 40.1 40.6 68.3 64.1 48.4 33.0 51.3 48.9
NativeLegalBERT - - - - - 53.1 46.9 - - - - - - 45.3 - - - - - - 59.0 - - - 51.1
NativeBERT 54.8 57.3 51.2 63.0 62.3 52.0 42.6 47.2 52.4 49.4 50.1 - 37.4 47.1 - - - 37.0 40.5 66.5 63.1 44.8 - 55.1 50.2
Legal-mono-R-base 55.9 49.5 51.5 61.3 61.3 50.5 52.1 53.5 53.6 51.1 52.2 44.1 54.1 51.8 55.5 50.0 59.1 54.3 34.4 67.1 61.5 48.8 53.4 58 53.5
Table 5. Language aggregate scores on LEXTREME. We report macro-F1 and best scores in bold. For each language, we also list the
best-performing monolingual legal model under NativeLegalBERT, the best-performing monolingual non-legal model under NativeBERT
and our monolingual legal models under Legal-mono-R-base. Missing values indicate that no suitable models were found.
contains 50 Terms of Service (ToS) from online platforms, languages in Table 5.
annotated with types of unfair contractual terms. The task
We notice that our Legal-XLM-R-base model is on par with
is to predict unfair types for a given sentence. CaseHOLD
XLM-R large even though it only contains 33% of the param-
(Zheng et al., 2021) contains approx. 53K multiple choice
eters (184M vs 560M). All our models outperform XLM-R
questions about holdings of US court cases. The task is to
large on the dataset aggregate score. Our base model sets
identify the correct holding statement from a selection of
a new SotA on MAPA (MAP), the large model on CoViD
five choices.
19 emergency event (C19) and the Longformer on Brazilian
court decisions (BCD), German argument mining (GAM),
5.2. Experimental Setup Greek legal code (GLC) and Swiss judgment prediction
To ensure comparability, we followed the experimental se- (SJP). Surprisingly, the legal models slightly underperform
tups described in the original papers (Niklaus et al., 2023; in three NER tasks (GLN, LNR, and LNB). Sensitivity to
Chalkidis et al., 2021d) using hierarchical transformers for hyperparameter choice could be a reason for this underper-
datasets where the sequence length of most documents ex- formance (we used the same hyperparameters for all models
ceeds the maximum sequence length of the model (Aletras without tuning due to limited compute resources). We see
et al., 2016; Niklaus et al., 2022). The hyperparameters the largest improvements over prior art in Brazilian court
used for running experiments on each dataset are provided decisions (72.4 vs. 63.5) and in Greek legal code (70.2 vs
in Table 8 in the appendix. To obtain Table 6, we followed 62.8). Maybe these tasks are particularly hard and therefore
Chalkidis et al. (2021d), running five repetitions with differ- legal in-domain pretraining helps more. For BCD especially,
ent random seeds (1-5) and reporting the test scores based on the large amount of Brazilian caselaw in the pretraining cor-
the seed that yielded the best scores on the development data. pus may offer an additional explanation.
For values in Tables 4 and 5, we followed the procedure in The monolingual models underperform their base model
Niklaus et al. (2023), taking the mean of the results of 3 XLM-R base only in Italian, Polish, and Romanian. In some
random seeds (1-3). We show an overview of the evaluated languages the monolingual model even outperforms XLM-R
models in Table 3. base clearly (Croatian, Hungarian, Latvian, Maltese, Dutch,
Slovakian, and Swedish), and in five of them even set the
5.3. Evaluation on LEXTREME new SotA for the language, sometimes clearly outperform-
We evaluate our models on LEXTREME (Niklaus et al., ing all other models (the Dutch model even outperforms its
2023) and show results across datasets in Table 4 and across closest competitor mDeBERTa-v2 by 11.2 macro F1 and
7
MultiLegalPile: A 689GB Multilingual Legal Corpus
Table 6. Results on LexGLUE. We report macro-F1 and best scores in bold. Results from models marked with * are from Chalkidis et al.
(2021d). Similar to LEXTREME, we computed the aggregate score as the harmonic mean of individual dataset results.
its base model XLM-R by almost 20 macro F1). These 6. Conclusions and Future Work
languages are all in the lower end of the data availability in
the M ULTI L EGAL P ILE with the richest language (Dutch) Limitations We did not perform deduplication, thus data
containing only 810M words (see Figure 3). Pretraining from the legal mC4 part might be present in other parts.
a monolingual model on in-domain data may therefore be However, recent work (Muennighoff et al., 2023) suggests
worth it, especially in low-resource languages. that data duplication does not degrade performance during
pretraining for up to four epochs. Overlap between the other
Even though our legal Longformer model performs best on parts is highly unlikely, since they are from completely
the dataset level, it performs much worse on the language different jurisdictions.
level, possibly due to its lower scores in the most multilin-
gual tasks MEU, MAP and C19 (24, 24 and 6 languages,
Conclusions Due to a general lack of multilingual pre-
respectively). Our legal base and large models achieve SotA
training data especially in specialized domains such as law,
in some languages, and are on aggregate almost as robust
we curate a large-scale high-quality corpus in 24 languages
across languages as XLM-R.
from 17 jurisdictions. We continue pretraining XLM-R
Computing the final LEXTREME scores (harmonic mean checkpoints on our data, achieving a new SotA for base and
of dataset aggregate and language aggregate scores), we large models on the LEXTREME benchmark and vastly out-
find that the Legal-XLM-R-large is the new SotA on LEX- performing previous methods in greek legal code. We turn
TREME with a score of 59.5 vs 59.4 for Legal-XLM-R-base our XLM-R base model into a Longformer and continue pre-
and 59.3 for XLM-R large. The legal Longformer’s LEX- training on long documents. It reaches a new SotA in four
TREME scores is with 56.5 not competitive due to its low LEXTREME datasets and reaches the overall highest dataset
language aggregate score. aggregate score. Monolingual models achieve huge gains
over their base model XLM-R in some languages and even
5.4. Evaluation on LexGLUE set language specific SotA in five languages outperforming
other models by as much as 11 macro F1. On LexGLUE
We evaluate our English and multilingual models on our English models reach SotA in five out of seven tasks
LexGLUE (Chalkidis et al., 2021e) and compare against with the large model achieving the highest aggregate score.
baselines (see Table 6). Our models excel on the ECtHR,
SCOTUS, EUR-LEX, and CaseHOLD tasks, achieving new
SotA. In the other two tasks our models match general- Future Work We leave the pretraining of a large gen-
purpose models such as RoBERTa. A reason for slight erative multilingual legal language model for future work.
underperformance of the legal models in the LEDGAR and Here we limited the corpus to the EU languages due to
especially the Unfair ToS tasks may be the relatively low resource constraints, but in the future, we would like to
availability of contracts in the M ULTI L EGAL P ILE. expand the corpus in terms of languages and jurisdictions
covered. Especially in China there exist many accessible
sources suitable to extend the corpus. Finally, it would be
very interesting to study in more detail the specific contents
of the M ULTI L EGAL P ILE.
8
MultiLegalPile: A 689GB Multilingual Legal Corpus
9
MultiLegalPile: A 689GB Multilingual Legal Corpus
A benchmark dataset for legal language understanding in Punta Cana, Dominican Republic, November 2021. As-
english, 2021d. URL https://arxiv.org/abs/ sociation for Computational Linguistics. URL https:
2110.00976. //aclanthology.org/2021.nllp-1.1.
Chalkidis, I., Jana, A., Hartung, D., Bommarito, M. J., An- Fedus, W., Zoph, B., and Shazeer, N. Switch transformers:
droutsopoulos, I., Katz, D. M., and Aletras, N. LexGLUE: Scaling to trillion parameter models with simple and ef-
A Benchmark Dataset for Legal Language Understand- ficient sparsity. Journal of Machine Learning Research,
ing in English. SSRN Scholarly Paper ID 3936759, 23:1–39, 2022. URL http://jmlr.org/papers/
Social Science Research Network, Rochester, NY, Oc- v23/21-0998.html.
tober 2021e. URL https://papers.ssrn.com/
Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T.,
abstract=3936759.
Foster, C., Phang, J., He, H., Thite, A., Nabeshima, N.,
Chalkidis*, I., Garneau*, N., Goanta, C., Katz, D. M., and Presser, S., and Leahy, C. The Pile: An 800GB Dataset of
Søgaard, A. Lexfiles and legallama: Facilitating english Diverse Text for Language Modeling. arXiv:2101.00027
multinational legal language model development, 2023. [cs], December 2020a. URL http://arxiv.org/
abs/2101.00027. arXiv: 2101.00027.
Conneau, A. and Lample, G. Cross-lingual Lan-
Gao, L., Biderman, S. R., Black, S., Golding, L., Hoppe,
guage Model Pretraining. In Advances in Neural
T., Foster, C., Phang, J., He, H., Thite, A., Nabeshima,
Information Processing Systems, volume 32. Curran As-
N., Presser, S., and Leahy, C. The pile: An 800gb
sociates, Inc., 2019. URL https://proceedings.
dataset of diverse text for language modeling. ArXiv,
neurips.cc/paper/2019/hash/
abs/2101.00027, 2020b.
c04c19c2c2474dbf5f7ac4372c5b9af1-Abstract.
html. Gokaslan, A. and Cohen, V. Openwebtext corpus,
2019. URL http://Skylion007.github.io/
Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., OpenWebTextCorpus.
Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer,
L., and Stoyanov, V. Unsupervised cross-lingual repre- Gu, Y., Tinn, R., Cheng, H., Lucas, M., Usuyama, N., Liu,
sentation learning at scale. In Proceedings of the 58th X., Naumann, T., Gao, J., and Poon, H. Domain-Specific
Annual Meeting of the Association for Computational Language Model Pretraining for Biomedical Natural Lan-
Linguistics, pp. 8440–8451, Online, July 2020a. Associ- guage Processing. ACM Trans. Comput. Healthcare, 3
ation for Computational Linguistics. doi: 10.18653/v1/ (1), oct 2021. ISSN 2691-1957. doi: 10.1145/3458754.
2020.acl-main.747. URL https://aclanthology. URL https://doi.org/10.1145/3458754.
org/2020.acl-main.747.
Gutiérrez-Fandiño, A., Armengol-Estapé, J., Gonzalez-
Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Agirre, A., and Villegas, M. Spanish Legalese Lan-
Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer, guage Model and Corpora. oct 2021. URL http:
L., and Stoyanov, V. Unsupervised Cross-lingual Rep- //arxiv.org/abs/2110.12201.
resentation Learning at Scale. arXiv:1911.02116 [cs], He, P., Gao, J., and Chen, W. DeBERTaV3:
April 2020b. URL http://arxiv.org/abs/1911. Improving DeBERTa using ELECTRA-Style Pre-
02116. arXiv: 1911.02116. Training with Gradient-Disentangled Embedding Shar-
ing. arXiv:2111.09543 [cs], December 2021a. URL
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT:
http://arxiv.org/abs/2111.09543. arXiv:
Pre-training of deep bidirectional transformers for lan-
2111.09543.
guage understanding. In Proceedings of the 2019 Confer-
ence of the North American Chapter of the Association for He, P., Liu, X., Gao, J., and Chen, W. DeBERTa:
Computational Linguistics: Human Language Technolo- Decoding-enhanced BERT with Disentangled Atten-
gies, Volume 1 (Long and Short Papers), pp. 4171–4186, tion. arXiv:2006.03654 [cs], October 2021b. URL
Minneapolis, Minnesota, June 2019. Association for http://arxiv.org/abs/2006.03654. arXiv:
Computational Linguistics. doi: 10.18653/v1/N19-1423. 2006.03654.
URL https://aclanthology.org/N19-1423.
Henderson, P., Krass, M. S., Zheng, L., Guha, N., Manning,
Drawzeski, K., Galassi, A., Jablonowska, A., Lagioia, F., C. D., Jurafsky, D., and Ho, D. E. Pile of Law: Learning
Lippi, M., Micklitz, H. W., Sartor, G., Tagiuri, G., and Responsible Data Filtering from the Law and a 256GB
Torroni, P. A Corpus for Multilingual Analysis of On- Open-Source Legal Dataset, July 2022. URL http://
line Terms of Service. In Proceedings of the Natural arxiv.org/abs/2207.00220. arXiv:2207.00220
Legal Language Processing Workshop 2021, pp. 1–8, [cs].
10
MultiLegalPile: A 689GB Multilingual Legal Corpus
Henderson*, P., Krass*, M. S., Zheng, L., Guha, N., Man- representation model for biomedical text mining. Bioin-
ning, C. D., Jurafsky, D., and Ho, D. E. Pile of law: formatics, 36(4):1234–1240, 2020. ISSN 14602059. doi:
Learning responsible data filtering from the law and a 10.1093/bioinformatics/btz682.
256gb open-source legal dataset, 2022. URL https:
//arxiv.org/abs/2207.00220. Licari, D. and Comandè, G. ITALIAN-LEGAL-BERT: A
Pre-trained Transformer Language Model for Italian Law.
Henderson, P., Li, X., Jurafsky, D., Hashimoto, T., Lemley, Technical report, 2022. URL http://ceur-ws.org.
M. A., and Liang, P. Foundation models and fair use,
2023. Lippi, M., Pałka, P., Contissa, G., Lagioia, F., Micklitz,
H.-W., Sartor, G., and Torroni, P. CLAUDETTE: an
Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., automated detector of potentially unfair clauses in on-
Song, D., and Steinhardt, J. Measuring Massive Multitask line terms of service. Artificial Intelligence and Law,
Language Understanding, January 2021. URL http:// 27(2):117–139, 2019. ISSN 1572-8382. doi: 10.1007/
arxiv.org/abs/2009.03300. arXiv:2009.03300 s10506-019-09243-2. URL https://doi.org/10.
[cs]. 1007/s10506-019-09243-2.
Huang, K., Altosaar, J., and Ranganath, R. ClinicalBERT: Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy,
Modeling Clinical Notes and Predicting Hospital Read- O., Lewis, M., Zettlemoyer, L., and Stoyanov, V. Roberta:
mission. 2019. URL http://arxiv.org/abs/ A robustly optimized bert pretraining approach. 2019.
1904.05342. URL http://arxiv.org/abs/1907.11692.
ImaniGooghari, A., Lin, P., Kargaran, A. H., Severini, S., Luz de Araujo, P. H., Campos, T. E. d., de Oliveira,
Sabet, M. J., Kassner, N., Ma, C., Schmid, H., Martins, R. R., Stauffer, M., Couto, S., and Bermejo, P.
A. F. T., Yvon, F., and Schütze, H. Glot500: Scaling mul- Lener-br: a dataset for named entity recognition in
tilingual corpora and language models to 500 languages, brazilian legal text. In International Conference
2023. on Computational Processing of the Portuguese Lan-
guage, pp. 313–323. Springer, 2018. Dataset URL:
Johnson, A. E. W., Pollard, T. J., Shen, L., Lehman, L.- https://huggingface.co/datasets/lener br.
w. H., Feng, M., Ghassemi, M., Moody, B., Szolovits,
P., Anthony Celi, L., and Mark, R. G. MIMIC-III, a Masala, M., Iacob, R. C. A., Uban, A. S., Cidota, M., Velicu,
freely accessible critical care database. Scientific Data, H., Rebedea, T., and Popescu, M. jurBERT: A Romanian
3(1):160035, 2016. ISSN 2052-4463. doi: 10.1038/ BERT model for legal judgement prediction. In Proceed-
sdata.2016.35. URL https://doi.org/10.1038/ ings of the Natural Legal Language Processing Work-
sdata.2016.35. shop 2021, pp. 86–94, Punta Cana, Dominican Republic,
November 2021. Association for Computational Linguis-
Katz, D. M., Bommarito, M. J., Gao, S., and Arredondo, P. tics. doi: 10.18653/v1/2021.nllp-1.8. URL https:
GPT-4 Passes the Bar Exam, March 2023. URL https: //aclanthology.org/2021.nllp-1.8.
//papers.ssrn.com/abstract=4389233.
Merity, S., Xiong, C., Bradbury, J., and Socher, R. Pointer
Lage-Freitas, A., Allende-Cid, H., Santana, O., and Oliveira- sentinel mixture models, 2016.
Lage, L. Predicting Brazilian Court Decisions. PeerJ
Computer Science, 8:e904, March 2022. ISSN 2376- Muennighoff, N., Rush, A. M., Barak, B., Scao, T. L.,
5992. doi: 10.7717/peerj-cs.904. URL https:// Piktus, A., Tazi, N., Pyysalo, S., Wolf, T., and Raf-
peerj.com/articles/cs-904. Publisher: PeerJ fel, C. Scaling Data-Constrained Language Models,
Inc. May 2023. URL http://arxiv.org/abs/2305.
16264. arXiv:2305.16264 [cs].
Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H., and
Kang, J. BioBERT: a pre-trained biomedical language Niklaus, J. and Giofré, D. BudgetLongformer: Can we
representation model for biomedical text mining. Bioin- Cheaply Pretrain a SotA Legal Language Model From
formatics, pp. btz682, September 2019. ISSN 1367-4803, Scratch?, November 2022. URL http://arxiv.
1460-2059. doi: 10.1093/bioinformatics/btz682. URL org/abs/2211.17135. arXiv:2211.17135 [cs].
http://arxiv.org/abs/1901.08746. arXiv:
1901.08746. Niklaus, J., Chalkidis, I., and Stürmer, M. Swiss-Judgment-
Prediction: A Multilingual Legal Judgment Prediction
Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H., and Benchmark. In Proceedings of the Natural Legal Lan-
Kang, J. BioBERT: A pre-trained biomedical language guage Processing Workshop 2021, pp. 19–35, Punta
11
MultiLegalPile: A 689GB Multilingual Legal Corpus
Cana, Dominican Republic, November 2021. Associ- Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and
ation for Computational Linguistics. URL https: Sutskever, I. Language models are unsupervised multitask
//aclanthology.org/2021.nllp-1.3. learners. 2019.
Niklaus, J., Stürmer, M., and Chalkidis, I. An Empirical Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S.,
Study on Cross-X Transfer for Legal Judgment Predic- Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring the
tion. In Proceedings of the 2nd Conference of the Asia- limits of transfer learning with a unified text-to-text trans-
Pacific Chapter of the Association for Computational former. Journal of Machine Learning Research, 21(140):
Linguistics and the 12th International Joint Conference 1–67, 2020a. URL http://jmlr.org/papers/
on Natural Language Processing (Volume 1: Long Pa- v21/20-074.html.
pers), pp. 32–46, Online only, November 2022. Asso-
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S.,
ciation for Computational Linguistics. URL https:
Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring the
//aclanthology.org/2022.aacl-main.3.
Limits of Transfer Learning with a Unified Text-to-Text
Niklaus, J., Matoshi, V., Rani, P., Galassi, A., Stürmer, Transformer. Journal of Machine Learning Research,
M., and Chalkidis, I. Lextreme: A multi-lingual and 21(140):1–67, 2020b. ISSN 1533-7928. URL http:
multi-task benchmark for the legal domain, 2023. URL //jmlr.org/papers/v21/20-074.html.
https://arxiv.org/abs/2301.13126. Sag, M. Copyright safety for generative ai. Forth-
coming in the Houston Law Review, 2023. URL
OpenAI. GPT-4 Technical Report, March 2023. https://papers.ssrn.com/sol3/papers.
URL http://arxiv.org/abs/2303.08774. cfm?abstract_id=4438593.
arXiv:2303.08774 [cs].
Sanh, V., Debut, L., Chaumond, J., and Wolf, T. DistilBERT,
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, a distilled version of BERT: smaller, faster, cheaper and
C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., lighter. arXiv:1910.01108 [cs], February 2020. URL
Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., http://arxiv.org/abs/1910.01108. arXiv:
Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, 1910.01108.
J., and Lowe, R. Training language models to follow
instructions with human feedback, 2022. URL https: Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper,
//arxiv.org/abs/2203.02155. J., and Catanzaro, B. Megatron-lm: Training multi-
billion parameter language models using model paral-
Pais, V., Mitrofan, M., Gasan, C. L., Coneschi, V., and lelism, 2020.
Ianov, A. Named entity recognition in the Romanian
legal domain. In Proceedings of the Natural Legal Spaeth, H. J., Epstein, L., Martin, A. D., Segal, J. A., Ruger,
Language Processing Workshop 2021, pp. 9–18, Punta T. J., and Benesh, S. C. Supreme Court Database, Version
Cana, Dominican Republic, November 2021. Associ- 2020 Release 01.
ation for Computational Linguistics. doi: 10.18653/ Taylor, R., Kardas, M., Cucurull, G., Scialom, T., Hartshorn,
v1/2021.nllp-1.2. URL https://aclanthology. A., Saravia, E., Poulton, A., Kerkez, V., and Stojnic,
org/2021.nllp-1.2. R. Galactica: A Large Language Model for Science,
November 2022. URL http://arxiv.org/abs/
Papaloukas, C., Chalkidis, I., Athinaios, K., Pan-
2211.09085. arXiv:2211.09085 [cs, stat].
tazi, D.-A., and Koubarakis, M. Multi-granular le-
gal topic classification on greek legislation. arXiv Tiedemann, J. Finding alternative translations in a large
preprint arXiv:2109.15298, 2021. Dataset URL: corpus of movie subtitle. In Proceedings of the Tenth
https://huggingface.co/datasets/greek legal code. International Conference on Language Resources and
Evaluation (LREC’16), pp. 3518–3522, Portorož, Slove-
Pfeiffer, J., Vulić, I., Gurevych, I., and Ruder, S. UNKs nia, May 2016. European Language Resources Associa-
everywhere: Adapting multilingual language models to tion (ELRA). URL https://aclanthology.org/
new scripts. In Proceedings of the 2021 Conference L16-1559.
on Empirical Methods in Natural Language Processing,
pp. 10186–10203, Online and Punta Cana, Dominican Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux,
Republic, November 2021. Association for Computa- M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E.,
tional Linguistics. doi: 10.18653/v1/2021.emnlp-main. Azhar, F., Rodriguez, A., Joulin, A., Grave, E., and Lam-
800. URL https://aclanthology.org/2021. ple, G. LLaMA: Open and Efficient Foundation Language
emnlp-main.800. Models. ArXiv, abs/2302.1, 2023.
12
MultiLegalPile: A 689GB Multilingual Legal Corpus
Tuggener, D., von Däniken, P., Peetz, T., and Cieliebak, Xue, L., Constant, N., Roberts, A., Kale, M., Al-Rfou,
M. LEDGAR: A large-scale multi-label corpus for text R., Siddhant, A., Barua, A., and Raffel, C. mT5:
classification of legal provisions in contracts. In Pro- A massively multilingual pre-trained text-to-text trans-
ceedings of the Twelfth Language Resources and Evalua- former. arXiv:2010.11934 [cs], March 2021. URL
tion Conference, pp. 1235–1241, Marseille, France, May http://arxiv.org/abs/2010.11934. arXiv:
2020. European Language Resources Association. ISBN 2010.11934.
979-10-95546-34-4. URL https://aclanthology.
org/2020.lrec-1.155. Zellers, R., Holtzman, A., Rashkin, H., Bisk, Y., Farhadi,
A., Roesner, F., and Choi, Y. Defending against Neural
Tziafas, G., de Saint-Phalle, E., de Vries, W., Egger, C., Fake News. Curran Associates Inc., Red Hook, NY, USA,
and Caselli, T. A multilingual approach to identify 2019.
and classify exceptional measures against covid-19. In
Zheng, L., Guha, N., Anderson, B. R., Henderson, P.,
Proceedings of the Natural Legal Language Process-
and Ho, D. E. When Does Pretraining Help? Assess-
ing Workshop 2021, pp. 46–62, 2021. Dataset URL:
ing Self-Supervised Learning for Law and the Case-
https://tinyurl.com/ycysvtbm.
HOLD Dataset of 53,000+ Legal Holdings. In Pro-
ceedings of the Eighteenth International Conference on
Urchs, S., Mitrović, J., and Granitzer, M. Design
Artificial Intelligence and Law, ICAIL ’21, pp. 159–
and Implementation of German Legal Decision
168, New York, NY, USA, 2021. Association for Com-
Corpora:. In Proceedings of the 13th International
puting Machinery. ISBN 9781450385268. doi: 10.
Conference on Agents and Artificial Intelligence, pp.
1145/3462757.3466088. URL https://doi.org/
515–521, Online Streaming, — Select a Country
10.1145/3462757.3466088.
—, 2021. SCITEPRESS - Science and Technol-
ogy Publications. ISBN 978-989-758-484-8. doi: Zhu, Y., Kiros, R., Zemel, R., Salakhutdinov, R., Urta-
10.5220/0010187305150521. URL https://www. sun, R., Torralba, A., and Fidler, S. Aligning books and
scitepress.org/DigitalLibrary/Link. movies: Towards story-like visual explanations by watch-
aspx?doi=10.5220/0010187305150521. ing movies and reading books, 2015.
Wang, W., Wei, F., Dong, L., Bao, H., Yang, N., and
Zhou, M. MiniLM: Deep Self-Attention Distillation
for Task-Agnostic Compression of Pre-Trained Trans-
formers. In Advances in Neural Information Processing
Systems, volume 33, pp. 5776–5788. Curran Asso-
ciates, Inc., 2020. URL https://proceedings.
neurips.cc/paper/2020/hash/
3f5ee243547dee91fbd053c1c4a845aa-Abstract.
html.
Wei, J., Bosma, M., Zhao, V. Y., Guu, K., Yu, A. W.,
Lester, B., Du, N., Dai, A. M., and Le, Q. V. Fine-
tuned language models are zero-shot learners. CoRR,
abs/2109.01652, 2021. URL https://arxiv.org/
abs/2109.01652.
13
MultiLegalPile: A 689GB Multilingual Legal Corpus
A. Training Details
14
MultiLegalPile: A 689GB Multilingual Legal Corpus
B. Hyperparameter Details
source Dataset Task Task type Hierarchical Seeds lower case Batch size Metric for best model Evaluation strategy Epochs Early stopping patience Learning rate
(Niklaus et al., 2023) GLN GLN NER False 1,2,3 True 64 evaluation loss epoch 50 5 1e-5
(Niklaus et al., 2023) LNR LNR NER False 1,2,3 True 64 evaluation loss epoch 50 5 1e-5
(Niklaus et al., 2023) LNB LNB NER False 1,2,3 True 64 evaluation loss epoch 50 5 1e-5
(Niklaus et al., 2023) MAP MAP-F NER False 1,2,3 True 64 evaluation loss epoch 50 5 1e-5
(Niklaus et al., 2023) MAP MAP-C NER False 1,2,3 True 64 evaluation loss epoch 50 5 1e-5
(Niklaus et al., 2023) BCD BCD-J SLTC True 1,2,3 True 64 evaluation loss epoch 50 5 1e-5
(Niklaus et al., 2023) BCD BCD-U SLTC True 1,2,3 True 64 evaluation loss epoch 50 5 1e-5
(Niklaus et al., 2023) GAM GAM SLTC False 1,2,3 True 64 evaluation loss epoch 50 5 1e-5
(Niklaus et al., 2023) GLC GLC-C SLTC True 1,2,3 True 64 evaluation loss epoch 50 5 1e-5
(Niklaus et al., 2023) GLC GLC-S SLTC True 1,2,3 True 64 evaluation loss epoch 50 5 1e-5
(Niklaus et al., 2023) GLC GLC-V SLTC True 1,2,3 True 64 evaluation loss epoch 50 5 1e-5
(Niklaus et al., 2023) SJP SJP SLTC True 1,2,3 True 64 evaluation loss epoch 50 5 1e-5
(Niklaus et al., 2023) OTS OTS-UL SLTC False 1,2,3 True 64 evaluation loss epoch 50 5 1e-5
(Niklaus et al., 2023) OTS OTS-CT MLTC False 1,2,3 True 64 evaluation loss epoch 50 5 1e-5
(Niklaus et al., 2023) C19 C19 MLTC False 1,2,3 True 64 evaluation loss epoch 50 5 1e-5
(Niklaus et al., 2023) MEU MEU-1 MLTC True 1,2,3 True 64 evaluation loss 5 1e-5
(Niklaus et al., 2023) MEU MEU-2 MLTC True 1,2,3 True 64 evaluation loss 5 1e-5
(Niklaus et al., 2023) MEU MEU-3 MLTC True 1,2,3 True 64 evaluation loss 5 1e-5
(Chalkidis et al., 2021d) ECtHR ECtHR-A MLTC True 1,2,3,4,5 True 8 micro-f1 epoch 20 3 3e-5
(Chalkidis et al., 2021d) ECtHR ECtHR-B MLTC True 1,2,3,4,5 True 8 micro-f1 epoch 20 3 3e-5
(Chalkidis et al., 2021d) EUR-LEX EUR-LEX MLTC False 1,2,3,4,5 True 8 micro-f1 epoch 20 3 3e-5
(Chalkidis et al., 2021d) SCOTUS SCOTUS SLTC True 1,2,3,4,5 True 8 micro-f1 epoch 20 3 3e-5
(Chalkidis et al., 2021d) LEDGAR LEDGAR SLTC False 1,2,3,4,5 True 8 micro-f1 epoch 20 3 3e-5
(Chalkidis et al., 2021d) UnfairToS UnfairToS MLTC False 1,2,3,4,5 True 8 micro-f1 epoch 20 3 3e-5
(Chalkidis et al., 2021d) CaseHOLD CaseHOLD MCQA False 1,2,3,4,5 True 8 micro-f1 epoch 20 3 3e-5
Table 8. Hyperparameters for each dataset and task. However, there were a few exceptions. For the multilingual MEU tasks, given the
dataset’s size, we trained them for only 1 epoch with 1000 steps as the evaluation strategy when using multilingual models. When using
monolingual models, we trained for 50 epochs with epoch-based evaluation strategy, as we utilized only the language-specific subset of
the dataset. Regarding LexGlue, we followed the guidelines of Chalkidis et al. (2021d) for RoBERTa-based large language models, which
required a maximum learning rate of 1e-5, a warm-up ratio of 0.1, and a weight decay rate of 0.06.
.
C. Dataset Details
Language Text Type Words Documents Words per Document Jurisdiction Source License/Copyright
Native Multi Legal Pile
bg legislation 309M 262k 1178 Bulgaria MARCELL CC0-1.0
Czechia CzCDC Constitutional Court CC BY-NC 4.0
cs caselaw 571M 342k 1667 Czechia CzCDC Supreme Administrative Court CC BY-NC 4.0
Czechia CzCDC Supreme Court CC BY-NC 4.0
da caselaw 211M 92k 2275 Denmark DDSC CC BY 4.0 and other, depending on the dataset
da legislation 653M 296k 2201 Denmark DDSC CC BY 4.0 and other, depending on the dataset
de caselaw 1786M 614k 2905 Germany openlegaldata ODbL-1.0
Switzerland entscheidsuche similar to CC BY
de legislation 513M 302k 1698 Germany openlegaldata ODbL-1.0
Switzerland lexfind not protected by copyright law
en legislation 2539M 713k 3557 Switzerland lexfind not protected by copyright law
UK uk-lex CC BY 4.0
fr caselaw 1172M 495k 2363 Belgium jurportal not protected by copyright law
France CASS Open Licence 2.0
Luxembourg judoc not protected by copyright law
Switzerland entscheidsuche similar to CC BY
fr legislation 600M 253k 2365 Switzerland lexfind not protected by copyright law
Belgium ejustice not protected by copyright law
hu legislation 265M 259k 1019 Hungary MARCELL CC0-1.0
it caselaw 407M 159k 2554 Switzerland entscheidsuche similar to CC BY
it legislation 543M 238k 2278 Switzerland lexfind not protected by copyright law
nl legislation 551M 243k 2263 Belgium ejustice not protected by copyright law
pl legislation 299M 260k 1148 Poland MARCELL CC0-1.0
pt caselaw 12613M 17M 728 Brazil RulingBR not protected by copyright law
Brazil CRETA CC BY-NC-SA 4.0
Brazil CJPG not protected by copyright law
ro legislation 559M 396k 1410 Romania MARCELL CC0-1.0
sk legislation 280M 246k 1137 Slovakia MARCELL CC0-1.0
sl legislation 366M 257k 1418 Slovenia MARCELL CC-BY-4.0
total 24236M 23M 1065 Native Multi Legal Pile
Overall statistics for the remaining subsets
total 12107M 8M 1457 EU Eurlex Resources CC BY 4.0
total 43376M 18M 2454 US (99%), Canada, and EU Pile of Law CC BY-NC-SA 4.0; See Henderson* et al. (2022) for details
total 28599M 10M 2454 Legal mC4 ODC-BY
Table 9. Information about size and number of words and documents for Native Multi Legal Pile are provided according to language and
text type. For the remaining subsets of Multi Legal Pile we provide general statistics.
15