0% found this document useful (0 votes)

35 views15 pages

Niklaus 2023 Multilegalpile

Uploaded by

Eduardo Garcia

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views15 pages

Niklaus 2023 Multilegalpile

Uploaded by

Eduardo Garcia

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

MultiLegalPile: A 689GB Multilingual Legal Corpus

Joel Niklaus 1 2 3 Veton Matoshi 2 Matthias Stürmer 1 2 Ilias Chalkidis 4 Daniel E. Ho 3

Abstract
Large, high-quality datasets are crucial for train-
ing Large Language Models (LLMs). However,
arXiv:2306.02069v2 [cs.CL] 6 Jun 2023

so far, there are few datasets available for special-

ized critical domains such as law and the available
ones are often only for the English language. We
curate and release M ULTI L EGAL P ILE, a 689GB
corpus in 24 languages from 17 jurisdictions. The
M ULTI L EGAL P ILE corpus, which includes di-
verse legal data sources with varying licenses,
allows for pretraining NLP models under fair use,
with more permissive licenses for the Eurlex Re-
sources and Legal mC4 subsets. We pretrain two
RoBERTa models and one Longformer multilin-
gually, and 24 monolingual models on each of
the language-specific subsets and evaluate them
Figure 1. M ULTI L EGAL P ILE Source Distribution
on LEXTREME. Additionally, we evaluate the
English and multilingual models on LexGLUE. more, there is a scarcity of large-scale, domain-specific pre-
Our multilingual models set a new SotA on LEX- training corpora, which constitutes a significant gap in the
TREME and our English models on LexGLUE. current body of resources available for the training of LLMs.
We release the dataset, the trained models, and all Similarly, LLMs are predominantly English, especially con-
of the code under the most open possible licenses. sidering domain-specific models, e.g., ones specialized in
biomedical, legal, or financial texts.
Legal texts, often produced by public instruments (e.g.,
1. Introduction state governments, international organizations), are typically
Recent years have seen LLMs achieving remarkable available under public licenses, offering a rich resource for
progress, as demonstrated by their performance on vari- domain-specific pretraining. Given this context, we curate
ous benchmarks such as SuperGLUE (Wang et al., 2019), a humongous, openly available, corpus of multilingual law
MMLU (Hendrycks et al., 2021), and several human Ex- text spanning across numerous jurisdictions (legal systems),
ams (OpenAI, 2023), including U.S. bar exams for admis- predominantly under permissive licenses.
sion to practicing the law (Katz et al., 2023). These models Further on, we continue pretraining XLM-R models (Con-
are typically trained on increasingly large corpora, such neau & Lample, 2019) on our corpus and evaluated these
as the Pile (Gao et al., 2020a), C4 (Raffel et al., 2020b), models on the recently introduced LEXTREME (Niklaus
and mC4 (Xue et al., 2021). However, it is important to et al., 2023) and LexGLUE (Chalkidis et al., 2021e) bench-
note that public corpora available for training these models marks. Given the often extensive nature of legal text, we
are predominantly in English, and often constitute web text also pretrained a Longformer model (Beltagy et al., 2020)
with unclear licensing. This even led to lawsuits against for comparison with hierarchical models (Chalkidis et al.,
LLM producers, highlighting this critical issue. Further- 2019b; Niklaus et al., 2021; 2022).
1
Institute of Computer Science, University of Bern, Bern, Our multilingual models set new SotA on LEXTREME
Switzerland 2 Bern University of Applied Sciences 3 Stanford Uni- overall. Our legal Longformer outperforms all other models
versity 4 University of Copenhagen. Correspondence to: Joel
Niklaus <[email protected]>. in four LEXTREME datasets and reaches the highest dataset
aggregate score. Our monolingual models outperform their
base model XLM-R in 21 out of 24 languages, even reaching

1
MultiLegalPile: A 689GB Multilingual Legal Corpus

language specific SotA in five. On LexGLUE our English criteria. The dataset contains 103M words and has two
models reach SotA in five out of seven tasks with the large versions: WikiText2 and the larger WikiText103. It has
model achieving the highest aggregate score. been used to pretrain models like MegatronBERT (Shoeybi
et al., 2020) and GPT-2 (Radford et al., 2019).
In the spirit of open science, we provide the dataset under
a CC BY-NC-SA 4.0 license, with some subsets licensed The BookCorpus (Zhu et al., 2015), also known as the
more permissively. Dataset creation scripts, models, and Toronto Books Corpus, is an English dataset used for pre-
pretraining code are public under Apache 2.0 licenses. This training BERT (Devlin et al., 2019), RoBERTa (Liu et al.,
open-source approach encourages further research and ad- 2019), and T5 (Raffel et al., 2020a). It consists of almost
vancements in the field of legal text analysis and understand- 1B words from over 11K books collected from the web.
ing using large language models.
The Common Crawl corpus is a publicly available multilin-
gual dataset of scraped web pages, regularly updated with
Contributions new ”snapshots”. It has been used to pretrain GPT-3 (Brown
The contributions of this paper are three-fold: et al., 2020b) as well as XLM-R (Conneau et al., 2020a).
One significant drawback of Common Crawl is the presence
1. We curate and release a large scale multilingual legal
of uncleaned data, which includes a considerable amount of
text corpus, dubbed M ULTI L EGAL P ILE,1 covering 24
“gibberish or boiler-plate text like menus, error messages,
languages and 17 legal systems (jurisdictions).
or duplicate text” (Raffel et al., 2020a). As a result, uti-
2. We release 2 multilingual and 24 monolingual new legal- lizing the Common Crawl dataset necessitates additional
oriented PLMs, dubbed L EGAL XLM S, warm-started post-filtering and cleaning procedures. To address this issue,
from the XLM-R (Conneau & Lample, 2019) models, Raffel et al. (Raffel et al., 2020a) performed several clean-
and further pretrained on the M ULTI L EGAL P ILE. Addi- ing steps on the April 2019 snapshot of Common Crawl,
tionally, we pretrain a Longformer (Beltagy et al., 2020) resulting in the creation of the Colossal Clean Crawled Cor-
based on our multilingual base-size model on context pus (C4), comprising 750 GB of English-language text. It
lengths of up to 4096 tokens. was used for pretraining models such as T5 (Raffel et al.,
3. We benchmark the newly released models on the LEX- 2020a) and Switch Transformer (Fedus et al., 2022).
TREME and LexGLUE benchmarks, achieving new OpenWebText (Gokaslan & Cohen, 2019) openly replicates
SotA for base and large size models and increasing per- OpenAI’s closed English WebText dataset (Radford et al.,
formance drastically in Greek legal code. Our Long- 2019), used to pretrain GPT-2 (Radford et al., 2019). Web-
former model reaches SotA in four tasks and the highest Text comprises over 8M documents with a combined text
dataset aggregate score. Our monolingual models set size of 40 GB. To ensure data uniqueness, any documents
language specific SotA in five languages. sourced from Wikipedia were excluded from WebText, as
they are commonly utilized in other datasets. OpenWeb-
2. Related Work Text, on the other hand, consists of 38 GB of text data from
8M documents and was used for pretraining RoBERTa (Liu
2.1. General Pretraining Corpora et al., 2019) and MegatronBERT (Shoeybi et al., 2020).
The use of pretrained Language Models (PLMs) has become News articles are also a common source for pretraining
increasingly popular in NLP tasks, particularly with the corpora. The RealNews dataset (Zellers et al., 2019) is a
advent of models such as BERT (Devlin et al., 2019) that can large corpus extracted from Common Crawl, containing
be finetuned for specific applications. One key factor in the news articles from December 2016 to March 2019 (training)
success of pretraining is the availability of large and diverse and April 2019 (evaluation), totaling 120 GB. It was used
text corpora, which can help the model learn the nuances of for pretraining MegatronBERT (Shoeybi et al., 2020). For
natural language. In the following, we discuss large-scale pretraining RoBERTa, Liu et al. (2019) used an English
general-purpose text corpora used for pretraining. subset of RealNews, comprising 63M English news articles
Wikipedia is a commonly used multilingual dataset for pre- crawled from September 2016 to February 2019.
training language models, and has been used to pretrain The rise of LLMs brought about the creation of ever larger
BERT (Devlin et al., 2019), MegatronBERT (Shoeybi et al., training datasets. The Pile (Gao et al., 2020b) combines
2020), T5 (Raffel et al., 2020a), and GPT-3 (Brown et al., 22 distinct, well-curated datasets, such as Wikipedia (En-
2020b), among others. glish), OpenWebText2 (Gokaslan & Cohen, 2019), Open-
Based on Wikipedia, Merity et al. (2016) created WikiText Subtitles (Tiedemann, 2016) etc., encompassing 825 GB
by selecting articles fitting the Good or Featured article of data. Besides general-purpose textual datasets, it also
contains domain-specific datasets, such as ArXiv (Science),
1
https://huggingface.co/datasets/joelito/Multi Legal Pile

2
MultiLegalPile: A 689GB Multilingual Legal Corpus

FreeLaw (Legal), PubMed Abstracts (Biomedicine), and legal texts, including legislation, court cases, and contracts.
GitHub data (to improve code-related task performance They pretrained LegalBERT on this dataset, showing state-
(Gao et al., 2020b)). GPT-2 (Radford et al., 2019) and of-the-art performance, especially in tasks requiring domain
GPT-3 (Brown et al., 2020b) were evaluated on this dataset. knowledge. Another study by Zheng et al. (2021) used
the entire English Harvard Law case corpus (1965-2021)
In their work, Touvron et al. (2023) compiled a substantial
comprising 37 GB of text to pretrain CaseLaw-BERT.
dataset from various publicly available sources, including
CommonCrawl, C4, Github, Wikipedia, etc., totaling 1.4T Recently, Chalkidis* et al. (2023) released LexFiles, an En-
tokens. They trained the 13B-parameter LLaMA model glish legal corpus with 11 sub-corpora covering legislation
using this dataset, surpassing the performance of the 175B- and case law from six English-speaking legal systems (EU,
parameter GPT-3 on most benchmark tasks. However, the Council of Europe, Canada, US, UK, India). The corpus con-
dataset itself is not publicly available. To address this, a col- tains approx. 6M documents or approx. 19B tokens. They
laborative effort resulted in the creation of the RedPajama- trained two new legal English PLMs, showing improved
Data-1T dataset, replicating LLaMA’s dataset with a similar performance in legal probing and classification tasks.
size of 1.2T tokens.
Efforts to pretrain legal language models also exist for Ital-
Some of the afore-mentioned datasets, such as Common ian (Licari & Comandè, 2022), Romanian (Masala et al.,
Crawl, are used to pretrain multilingual versions of BERT, 2021), and Spanish (Gutiérrez-Fandiño et al., 2021). How-
DistilBERT, RoBERTa etc. These models were pretrained ever, English dominates, underscoring the importance of
on datasets that cover approximately 100 languages, thereby compiling multilingual legal corpora.
neglecting low-resource languages. ImaniGooghari et al.
(2023) addressed this by compiling Glot500, a 700 GB Model Domain Languages Size in # Words

dataset covering 500 diverse languages, with a focus on SciBERT (Beltagy et al., 2019) scientific English 2.38B (3.17B tokens)
Galactica (Taylor et al., 2022) scientific English 79.5B (106B tokens)
low-resource ones. The Glot500-m model, pretrained on BioBERT (Lee et al., 2019) biomedical English 18B
LegalBERT (Chalkidis et al., 2020b) legal English 1.44B (11.5GB)
this dataset, outperformed the XLM-RoBERTa base model CaselawBERT (Zheng et al., 2021) legal English 4.63B (37GB)
LegalXLMs (ours) legal 24 EU langs 87B (689GB)
on six out of seven tasks.
Table 1. Previous domain specific pretraining corpora. For some
2.2. Domain Specific Corpora corpora only GB or tokens were available. We converted 8 GB
into 1B words and 1 token to 0.75 words.
While pretraining on general-purpose text like Wikipedia
and news articles shows promise, evidence suggests that Table 1 compares previous domain-specific corpora, all in
pretraining on domain-specific text can enhance language English. In terms of size, none reach the M ULTI L EGAL P ILE
model performance on related tasks (Beltagy et al., 2019; proposed here.
Gu et al., 2021; Chalkidis et al., 2020b; Niklaus & Giofré,
2022). Domain-specific text corpora include texts specific
to fields like medicine, law, or science.
3. M ULTI L EGAL P ILE
Several studies have examined pretraining on scientific text 3.1. Construction
corpora. Beltagy et al. (2019) pretrained SciBERT, a BERT- We transformed all datasets into xz compressed JSON Lines
based model, on a random subset of 1.14M papers sourced (JSONL) format. The combination of XZ compression and
from Semantic Scholar. This collection comprises 18% JSONL is ideal for streaming large datasets due to reduced
of computer science papers and 82% of papers from the file size and efficient decompression and reading.
broader biomedical field. Similarly, PubMed and PubMed-
Central are common sources for biomedical datasets. Gu
Filtering mC4 We employed the vast multilingual web
et al. (2021) trained PubMedBERT using PubMed abstracts
crawl corpus, mC4 (Xue et al., 2021), as our foundation. To
and PubMedCentral articles; BioBERT (Lee et al., 2020)
effectively filter this corpus for legal content, we utilized
was pretrained similarly. Johnson et al. (2016) compiled the
regular expressions to identify documents with legal ref-
Medical Information Mart for Intensive Care III (MIMIC-
erences. We found that detecting legal citations, such as
III) dataset, a large single-center database of critical care pa-
references to laws and rulings, served as a reliable indicator
tients. Huang et al. (2019) used over 2 million de-identified
of legal-specific documents in the corpus.
clinical notes from this dataset to pretrain ClinicalBERT.
These models outperformed general-purpose models on
Iteration German English Spanish French Italian
biomedical NLP tasks.
1st 100% 20% 100% 65% 80%
In the legal domain, similar strategies are observed. 2nd 100% 85% 100% 100% 95%
Chalkidis et al. (2020a) collected 12 GB of diverse English
Table 2. Precision of investigated languages in legal mC4 (n=20)

3
MultiLegalPile: A 689GB Multilingual Legal Corpus

In order to ensure the accuracy of our filtering, we engaged we converted it into a unified format, such as jsonl. The
legal experts to aid in identifying citations to laws and rul- post-processing steps involved performing various tasks
ings across different jurisdictions and languages. We manu- depending on the initial data format. For example, in the
ally reviewed the precision of the retrieved documents for case of CASS, we extracted the textual data from XML tags.
five languages, namely German, English, Spanish, French,
and Italian, as shown in Table 2. The proficiency levels of Curating Eurlex Resources To curate the Eurlex re-
the evaluators included native German, fluent English and sources, we utilized the eurlex R package to generate
Spanish, intermediate French, and basic Italian. SPARQL queries and download the data. Subsequently, we
Subsequent to the initial review, we performed a second converted the data into a format more amenable to handling
round of precision evaluation, during which we refined our large datasets using Python.
regex expressions based on our findings from the first itera-
tion. This iterative process not only enhanced the precision Integrating Pile of Law Henderson et al. (2022) released
of the legal content detection, but also resulted in a reduc- a large corpus of diverse legal text in English mainly orig-
tion of the corpus size from 133GB to 106GB. Although inating from the US. We integrated the latest version with
the overall volume of data was reduced, this process signifi- additional data (from January 8, 2023) into our corpus.
cantly improved the quality and specificity of the corpus by
focusing on legal content with a higher degree of precision. 3.2. Description
A major reason for utilizing regexes instead of a Machine M ULTI L EGAL P ILE consists of four large subsets: a) Native
Learning (ML) based classifier was speed. Already when Multi Legal Pile (112 GB), b) Eurlex Resources2 (179 GB),
utilizing regexes, filtering through such a huge corpus like c) Legal MC43 (106 GB) and d) Pile of Law (Henderson
mC4 (27TB in total, of which 10.4TB are in English) took et al., 2022) (292 GB).
several days. An ML model based on Bag-of-Words, Word
vectors or even contextualized embeddings would a) need Figure 3 details the distribution of languages. Note that
an annotated dataset and b) likely be much slower. due to the integration of the Pile of Law, English is by far
the most dominant language, representing over half of the
words. In Figure 2 we show the distribution across text
types. Caselaw makes up over half of the corpus, due to
the good public access to court rulings especially in com-
mon law countries. Note, that even in civil law countries
–– where legislation is much more important – caselaw is
usually more plentiful than legislation (as can be seen in the
Swiss case in Table 9). It is hard to find publicly available
contracts, leading to the relatively low percentage of the
total corpus (< 10%), even though they could potentially
make up most of the legal texts in existence (from the private
sector). Note that most of the contracts in our corpus are
from the US or international treaties with the EU. Table 9 in
Appendix C provides additional of the M ULTI L EGAL P ILE,
including sources and licenses.

3.3. Licenses and Usage of M ULTI L EGAL P ILE

Figure 2. M ULTI L EGAL P ILE Text Type Distribution
The Creative Commons Attribution-NonCommercial-
Compiling Native M ULTI L EGAL P ILE To compile the
ShareAlike 4.0 International (CC BY-NC-SA 4.0) license
corpus, we scraped several sources containing legal lan-
applied for the released M ULTI L EGAL P ILE corpus depends
guage materials. Our search was conducted in a loose man-
on the upstream licenses of the data subsets described above.
ner, meaning that when we found a suitable source with
legal text data, we included it in our corpus. It is important First, our Native Multi Legal Pile consists of data sources
to note that we do not claim completeness, as we were un- with different licenses. They range from restrictive licenses
able to perform quality analysis for all available languages. such as CC BY-NC-SA 4.0 up to the most liberal Creative
For a detailed overview of sources used for the Native M UL - Commons Zero (CC0) license, which, in essence, releases
TI L EGAL P ILE corpus, please refer to Table 9. the data into the public domain. Many sources, however,
The majority of sources provided a link to download the 2
https://huggingface.co/datasets/joelito/eurlex resources
3
data directly. In cases where data was formatted differently, https://huggingface.co/datasets/joelito/legal-mc4

4
MultiLegalPile: A 689GB Multilingual Legal Corpus

Figure 3. M ULTI L EGAL P ILE Language Distribution (Note the log-scaled y-axis)

do not explicitly state the license used for the available (b) We train a new tokenizer of 128K BPEs on the training
data. We assume that such data sources allow pretraining subsets of M ULTI L EGAL P ILE to better cover legal language
usage, since the creators are usually public agencies such across all available legal systems and languages. However,
as courts and administrations. Such legislation and caselaw we reuse the original XLM-R embeddings for all lexically
is usually not protected by copyright law. Table 9 provides overlapping tokens (Pfeiffer et al., 2021), i.e., we warm-start
an overview of the license or copyright situation for each of word embeddings for tokens that already exist in the original
the 29 sources in the Native Multi Legal Pile. XLM-R vocabulary, and use random ones for the rest.
Second, the Eurlex Resources is licensed under CC BY 4.0 (c) We continue pretraining our models on the diverse M UL -
by the European Union.4 Thus, including this corpus does TI L EGAL P ILE corpus with batches of 512 samples for an
not pose legal issues for pretraining. additional 1M/500K steps for the base/large model. We do
initial warm-up steps for the first 5% of the total training
Third, the Legal mC4 corpus was created by filtering multi-
steps with a linearly increasing learning rate up to 1e−4,
lingual C4 (Xue et al., 2021) for legal content as described
and then follow a cosine decay scheduling, following recent
above. As mC4 is licensed under ODC-BY, we also release
trends. For half of the warm-up phase (2.5%), the Trans-
the filtered Legal mC4 corpus under the same license.
former encoder is frozen, and only the embeddings, shared
Finally, the Pile of Law (Henderson* et al., 2022) is pub- between input and output (MLM), are updated. We also use
lished under CC BY-NC-SA 4.0 and the dataset is not al- an increased 20/30% masking rate for base/large models
tered, therefore the license remains the same. respectively, where also 100% of the predictions are based
on masked tokens, compared to Devlin et al. (2019)5 , based
Usage of the M ULTI L EGAL P ILE corpus is presumably pos-
on the findings of Wettig et al. (2023).
sible for pretraining of NLP models. In general, we assume
that the fair use doctrine allows employing the data for legal (d) For both training the tokenizer and our legal models, we
NLP models because the results are rather transformative use a sentence sampler with exponential smoothing of the
(Henderson et al., 2023). Nevertheless, copyright issues in sub-corpora sampling rate following Conneau & Lample
generative AI remain an unresolved problem for the mo- (2019) and Raffel et al. (2020b), since there is a disparate
ment. Several court cases are currently pending, such as proportion of tokens across sub-corpora and languages (Fig-
Getty Images suing Stability AI for intellectual property ures 1 and 3) and we aim to preserve per-corpus and lan-
infringement (Sag, 2023). guage capacity, i.e., avoid overfitting to the majority (approx.
50% of the total number of tokens) US-origin English texts.
4. Pretraining Legal Models (e) We consider mixed cased models, i.e., both upper- and
lowercase letters covered, similar to all recently developed
As part of this study, we release 2 new multi-lingual legal-
large PLMs (Conneau & Lample, 2019; Raffel et al., 2020b;
oriented PLMs, dubbed Legal-XLM-Rs, trained on the
Brown et al., 2020a).
newly introduced M ULTI L EGAL P ILE corpus (Section 3).
For the newly released Legal-XLM-Rs we followed a series To better account for long contexts often found in legal
of best-practices in language model development literature: documents, we continue training the base-size multilingual
model on long contexts (4096 tokens) with windowed atten-
(a) We warm-start (initialize) our models from the original tion (128 tokens window size) (Beltagy et al., 2020) for 50K
XLM-R checkpoints (base or large) of Conneau & Lample steps, dubbing it Legal-XLM-LF-base. We use the standard
(2019). Model recycling is a standard process followed by 15% masking probability and increase the learning rate to
many (Wei et al., 2021; Ouyang et al., 2022) to benefit from 3e−5 before decaying but otherwise use the same settings
starting from an available “well-trained” PLM, rather from as for training the small-context models.
scratch (random). XLM-R was trained on 2.5TB of cleaned
5
CommonCrawl data in 100 languages. Devlin et al. – and many other follow-up work – used a 15%
masking ratio, and a recipe of 80/10/10% of predictions made
4 across masked/randomly-replaced/original tokens.
EUR-Lex Legal notice

5
MultiLegalPile: A 689GB Multilingual Legal Corpus
Model Source Params Vocab Specs Corpus # Langs
MiniLM Wang et al. (2020) 118M 250K 1M steps / BS 256 2.5TB CC100 100
DistilBERT Sanh et al. (2020) 135M 120K BS up to 4000 Wikipedia 104
mDeBERTa-v3 He et al. (2021b;a) 278M 128K 500K steps / BS 8192 2.5TB CC100 100
XLM-R base Conneau et al. (2020b) 278M 250K 1.5M steps / BS 8192 2.5TB CC100 100
XLM-R large Conneau et al. (2020b) 560M 250K 1.5M steps / BS 8192 2.5TB CC100 100
Legal-XLM-R-base ours 184M 128K 1M steps / BS 512 689GB MLP 24
Legal-XLM-R-large ours 435M 128K 500K steps / BS 512 689GB MLP 24
Legal-XLM-LF-base ours 208M 128K 50K steps / BS 512 689GB MLP 24
Legal-mono-R-base ours 111M 32K 200K steps / BS 512 689GB MLP 1
Legal-mono-R-large ours 337M 32K 500K steps / BS 512 689GB MLP 1

Table 3. Models: All models can process up to 512 tokens, except Legal-XLM-LF-base which can process up to 4096 tokens. BS is short
for batch size. MLP is short for M ULTI L EGAL P ILE . Params is the total parameter count (including the embedding layer).

In addition to the multilingual models, we also train 24 outcome from 85K cases from the Swiss Federal Supreme
monolingual models on each of the language-specific sub- Court. Online Terms of Service (OTS) (Drawzeski et al.,
sets of the corpus. Except for choosing a smaller vocab 2021) contains 100 contracts for detecting unfair clauses
size of 32K tokens, we use the same settings as for the with the tasks of classifying sentence unfairness levels and
multilingual models. Due to resource constraints, we only identifying clause topics. COVID19 Emergency Event
train base-size models and stop training at 200K steps. Due (C19) (Tziafas et al., 2021): consists of legal documents
to limited data available in some low-resource languages, from several European countries related to COVID-19 mea-
these models sometimes do multiple passes over the data. sures where models identify the type of measure described
Because of plenty of data and to achieve a better comparison in a sentence. MultiEURLEX (MEU) (Chalkidis et al.,
on LexGLUE, we continued training the English model for 2021b) is a corpus of 65K EU laws annotated with EU-
1M steps and also trained a large-size model for 500K steps. ROVOC taxonomy labels. Task involves identifying labels
See Table 7 in appendix A for an overview. for each document. Greek Legal NER (GLN) (Angelidis
et al., 2018) is a dataset for NER in Greek legal documents.
We make all our models publicly available alongside all
LegalNERo (LNR) (Pais et al., 2021) tackles NER in Roma-
intermediate checkpoints (every 50K/10K training steps for
nian legal documents. LeNER BR (LNB) (Luz de Araujo
RoBERTa/Longformer models) on the Hugging Face Hub.6
et al., 2018) addresses NER in Brazilian legal documents.
MAPA (MAP) (Baisa et al., 2016) is a multilingual corpus
5. Evaluating on LEXTREME and LexGLUE based on EUR-Lex for NER annotated at a coarse-grained
and fine-grained level.
5.1. Benchmark Description
Below we briefly describe each dataset. We refer the inter- LexGLUE (Chalkidis et al., 2021d) is a legal benchmark
ested reader to the original papers for more details. covering two single label text classification datasets, four
multi label text classification datasets and a multiple choice
LEXTREME (Niklaus et al., 2023) is a multilingual legal question answering dataset.
benchmark. It includes five single label text classification ECtHR Tasks A & B (Chalkidis et al., 2019a; 2021c) con-
datasets, three multi label text classification datasets and tain approx. 11K cases from the European Court of Human
four Named Entity Recognition (NER) datasets. Rights (ECtHR) public database. Based on case facts, Task
Brazilian Court Decisions (BCD) (Lage-Freitas et al., A involves predicting violated articles of the European Con-
2022) is from the State Supreme Court of Alagoas (Brazil) vention of Human Rights (ECHR) and Task B involves
and involves predicting case outcomes and judges’ una- predicting allegedly violated articles. SCOTUS (Spaeth
nimity on decisions. German Argument Mining (GAM) et al.) combines information from US Supreme Court (SCO-
(Urchs et al., 2021) contains200 German court decisions for TUS) opinions with the Supreme Court DataBase (SCDB).
classifying sentences according to their argumentative func- The task is to classify court opinions into 14 issue areas.
tion. Greek Legal Code (GLC) (Papaloukas et al., 2021) EUR-LEX (Chalkidis et al., 2021a) contains 65K EU laws
tackles topic classification of Greek legislation documents. from the EUR-Lex portal, annotated with EuroVoc concepts.
Tasks involve predicting topic categories at volume, chap- The task is to predict EuroVoc labels for a given document.
ter, and subject levels. Swiss Judgment Prediction (SJP) LEDGAR (Tuggener et al., 2020) contains approx. 850K
(Niklaus et al., 2021) focuses on predicting the judgment contract provisions from the US Securities and Exchange
Commission (SEC) filings. The task is to classify contract
6
https://huggingface.co/joelito provisions into categories. UNFAIR-ToS (Lippi et al., 2019)

6
MultiLegalPile: A 689GB Multilingual Legal Corpus

Model BCD GAM GLC SJP OTS C19 MEU GLN LNR LNB MAP Agg.
MiniLM 53.0 73.3 42.1 67.7 44.1 5.0 29.7 74.0 84.5 93.6 57.8 56.8
DistilBERT 54.5 69.5 62.8 66.8 56.1 25.9 36.4 71.0 85.3 89.6 60.8 61.7
mDeBERTa-v3 60.2 71.3 52.2 69.1 66.5 29.7 37.4 73.3 85.1 94.8 67.2 64.3
XLM-R-base 63.5 72.0 57.4 69.3 67.8 26.4 33.3 74.6 85.8 94.1 62.0 64.2
XLM-R-large 58.7 73.1 57.4 69.0 75.0 29.0 42.2 74.1 85.0 95.3 68.0 66.1
Legal-XLM-R-base 62.5 72.4 68.9 70.2 70.8 30.7 38.6 73.6 84.1 94.1 69.2 66.8
Legal-XLM-R-large 63.3 73.9 59.3 70.1 74.9 34.6 39.7 73.1 83.9 94.6 67.3 66.8
Legal-XLM-LF-base 72.4 74.6 70.2 72.9 69.8 26.3 33.1 72.1 84.7 93.3 66.2 66.9
Table 4. Dataset aggregate scores for multilingual models on LEXTREME. We report macro-F1 and the best scores in bold.

Model bg cs da de el en es et fi fr ga hr hu it lt lv mt nl pl pt ro sk sl sv Agg.
MiniLM 52.7 48.6 42.8 54.6 50.3 34.3 40.1 46.3 42.2 39.0 42.8 29.7 29.6 40.5 44.2 40.8 40.8 29.5 22.7 61.6 59.6 44.3 30.0 43.4 40.5
DistilBERT 54.2 48.6 46.0 60.1 58.8 48.0 50.0 48.8 49.6 47.9 51.4 35.9 31.2 50.1 51.9 41.5 44.4 34.6 34.5 63.2 63.8 51.3 36.2 50.1 46.7
mDeBERTa-v3 54.1 51.3 51.7 63.6 57.7 50.7 53.3 50.8 54.6 49.2 54.9 37.4 37.5 55.1 53.9 47.0 52.5 42.1 41.0 65.7 65.3 55.4 37.5 56.1 50.5
XLM-R-base 56.4 48.3 48.3 60.6 57.6 50.1 47.2 46.7 48.6 49.4 50.1 33.6 32.8 53.4 50.0 44.1 43.8 35.2 41.3 66.1 63.7 45.3 33.7 50.0 47.1
XLM-R-large 59.9 56.0 56.3 65.4 60.8 56.2 56.6 56.5 56.9 51.4 55.4 42.5 38.1 58.5 58.1 49.9 53.9 39.5 46.4 68.6 66.8 57.9 42.4 59.1 53.7
Legal-XLM-R-base 55.6 58.8 50.4 63.6 63.7 66.8 56.3 57.0 52.6 50.1 56.6 38.7 56.5 56.1 57.2 49.1 56.0 41.6 43.9 68.2 66.1 55.6 38.6 54.9 53.5
Legal-XLM-R-large 57.8 55.6 50.4 65.7 60.7 69.3 55.7 54.5 56.6 53.3 57.2 39.7 39.1 58.1 60.6 48.4 57.2 39.4 45.5 67.3 65.5 49.3 39.7 56.4 53.6
Legal-XLM-LF-base 54.4 49.3 48.1 64.0 60.5 52.8 49.2 52.2 48.2 48.5 55.4 33.0 34.7 54.6 54.8 45.2 52.5 40.1 40.6 68.3 64.1 48.4 33.0 51.3 48.9
NativeLegalBERT - - - - - 53.1 46.9 - - - - - - 45.3 - - - - - - 59.0 - - - 51.1
NativeBERT 54.8 57.3 51.2 63.0 62.3 52.0 42.6 47.2 52.4 49.4 50.1 - 37.4 47.1 - - - 37.0 40.5 66.5 63.1 44.8 - 55.1 50.2
Legal-mono-R-base 55.9 49.5 51.5 61.3 61.3 50.5 52.1 53.5 53.6 51.1 52.2 44.1 54.1 51.8 55.5 50.0 59.1 54.3 34.4 67.1 61.5 48.8 53.4 58 53.5

Table 5. Language aggregate scores on LEXTREME. We report macro-F1 and best scores in bold. For each language, we also list the
best-performing monolingual legal model under NativeLegalBERT, the best-performing monolingual non-legal model under NativeBERT
and our monolingual legal models under Legal-mono-R-base. Missing values indicate that no suitable models were found.
contains 50 Terms of Service (ToS) from online platforms, languages in Table 5.
annotated with types of unfair contractual terms. The task
We notice that our Legal-XLM-R-base model is on par with
is to predict unfair types for a given sentence. CaseHOLD
XLM-R large even though it only contains 33% of the param-
(Zheng et al., 2021) contains approx. 53K multiple choice
eters (184M vs 560M). All our models outperform XLM-R
questions about holdings of US court cases. The task is to
large on the dataset aggregate score. Our base model sets
identify the correct holding statement from a selection of
a new SotA on MAPA (MAP), the large model on CoViD
five choices.
19 emergency event (C19) and the Longformer on Brazilian
court decisions (BCD), German argument mining (GAM),
5.2. Experimental Setup Greek legal code (GLC) and Swiss judgment prediction
To ensure comparability, we followed the experimental se- (SJP). Surprisingly, the legal models slightly underperform
tups described in the original papers (Niklaus et al., 2023; in three NER tasks (GLN, LNR, and LNB). Sensitivity to
Chalkidis et al., 2021d) using hierarchical transformers for hyperparameter choice could be a reason for this underper-
datasets where the sequence length of most documents ex- formance (we used the same hyperparameters for all models
ceeds the maximum sequence length of the model (Aletras without tuning due to limited compute resources). We see
et al., 2016; Niklaus et al., 2022). The hyperparameters the largest improvements over prior art in Brazilian court
used for running experiments on each dataset are provided decisions (72.4 vs. 63.5) and in Greek legal code (70.2 vs
in Table 8 in the appendix. To obtain Table 6, we followed 62.8). Maybe these tasks are particularly hard and therefore
Chalkidis et al. (2021d), running five repetitions with differ- legal in-domain pretraining helps more. For BCD especially,
ent random seeds (1-5) and reporting the test scores based on the large amount of Brazilian caselaw in the pretraining cor-
the seed that yielded the best scores on the development data. pus may offer an additional explanation.
For values in Tables 4 and 5, we followed the procedure in The monolingual models underperform their base model
Niklaus et al. (2023), taking the mean of the results of 3 XLM-R base only in Italian, Polish, and Romanian. In some
random seeds (1-3). We show an overview of the evaluated languages the monolingual model even outperforms XLM-R
models in Table 3. base clearly (Croatian, Hungarian, Latvian, Maltese, Dutch,
Slovakian, and Swedish), and in five of them even set the
5.3. Evaluation on LEXTREME new SotA for the language, sometimes clearly outperform-
We evaluate our models on LEXTREME (Niklaus et al., ing all other models (the Dutch model even outperforms its
2023) and show results across datasets in Table 4 and across closest competitor mDeBERTa-v2 by 11.2 macro F1 and

7
MultiLegalPile: A 689GB Multilingual Legal Corpus

Model ECtHR-A ECtHR-B SCOTUS EUR-LEX LEDGAR UNFAIR-ToS CaseHOLD Agg.

TFIDF+SVM * 48.9 63.8 64.4 47.9 81.4 75.0 22.4 49.0
BERT * 63.6 73.4 58.3 57.2 81.8 81.3 70.8 68.2
DeBERTa * 60.8 71.0 62.7 57.4 83.1 80.3 72.6 68.5
RoBERTa-base * 59.0 68.9 62.0 57.9 82.3 79.2 71.4 67.5
RoBERTa-large * 67.6 71.6 66.3 58.1 83.6 81.6 74.4 70.9
Longformer * 64.7 71.7 64.0 57.7 83.0 80.9 71.9 69.5
BigBird * 62.9 70.9 62.0 56.8 82.6 81.3 70.8 68.4
Legal-BERT * 64.0 74.7 66.5 57.4 83.0 83.0 75.3 70.8
CaseLaw-BERT * 62.9 70.3 65.9 56.6 83.0 82.3 75.4 69.7
Legal-en-R-base (ours) 65.2 73.7 66.4 59.2 82.7 78.7 73.3 70.5
Legal-en-R-large (ours) 70.3 77.0 67.7 58.4 82.5 82.4 77.0 72.7
Legal-XLM-R-base (ours) 64.8 73.9 63.9 58.2 82.8 79.6 71.7 69.7
Legal-XLM-R-large (ours) 68.2 74.2 67.5 58.4 82.7 79.9 75.1 71.4
Legal-XLM-LF-base (ours) 67.9 76.2 61.6 59.1 82.1 78.9 72.0 70.2

Table 6. Results on LexGLUE. We report macro-F1 and best scores in bold. Results from models marked with * are from Chalkidis et al.
(2021d). Similar to LEXTREME, we computed the aggregate score as the harmonic mean of individual dataset results.

its base model XLM-R by almost 20 macro F1). These 6. Conclusions and Future Work
languages are all in the lower end of the data availability in
the M ULTI L EGAL P ILE with the richest language (Dutch) Limitations We did not perform deduplication, thus data
containing only 810M words (see Figure 3). Pretraining from the legal mC4 part might be present in other parts.
a monolingual model on in-domain data may therefore be However, recent work (Muennighoff et al., 2023) suggests
worth it, especially in low-resource languages. that data duplication does not degrade performance during
pretraining for up to four epochs. Overlap between the other
Even though our legal Longformer model performs best on parts is highly unlikely, since they are from completely
the dataset level, it performs much worse on the language different jurisdictions.
level, possibly due to its lower scores in the most multilin-
gual tasks MEU, MAP and C19 (24, 24 and 6 languages,
Conclusions Due to a general lack of multilingual pre-
respectively). Our legal base and large models achieve SotA
training data especially in specialized domains such as law,
in some languages, and are on aggregate almost as robust
we curate a large-scale high-quality corpus in 24 languages
across languages as XLM-R.
from 17 jurisdictions. We continue pretraining XLM-R
Computing the final LEXTREME scores (harmonic mean checkpoints on our data, achieving a new SotA for base and
of dataset aggregate and language aggregate scores), we large models on the LEXTREME benchmark and vastly out-
find that the Legal-XLM-R-large is the new SotA on LEX- performing previous methods in greek legal code. We turn
TREME with a score of 59.5 vs 59.4 for Legal-XLM-R-base our XLM-R base model into a Longformer and continue pre-
and 59.3 for XLM-R large. The legal Longformer’s LEX- training on long documents. It reaches a new SotA in four
TREME scores is with 56.5 not competitive due to its low LEXTREME datasets and reaches the overall highest dataset
language aggregate score. aggregate score. Monolingual models achieve huge gains
over their base model XLM-R in some languages and even
5.4. Evaluation on LexGLUE set language specific SotA in five languages outperforming
other models by as much as 11 macro F1. On LexGLUE
We evaluate our English and multilingual models on our English models reach SotA in five out of seven tasks
LexGLUE (Chalkidis et al., 2021e) and compare against with the large model achieving the highest aggregate score.
baselines (see Table 6). Our models excel on the ECtHR,
SCOTUS, EUR-LEX, and CaseHOLD tasks, achieving new
SotA. In the other two tasks our models match general- Future Work We leave the pretraining of a large gen-
purpose models such as RoBERTa. A reason for slight erative multilingual legal language model for future work.
underperformance of the legal models in the LEDGAR and Here we limited the corpus to the EU languages due to
especially the Unfair ToS tasks may be the relatively low resource constraints, but in the future, we would like to
availability of contracts in the M ULTI L EGAL P ILE. expand the corpus in terms of languages and jurisdictions
covered. Especially in China there exist many accessible
sources suitable to extend the corpus. Finally, it would be
very interesting to study in more detail the specific contents
of the M ULTI L EGAL P ILE.

8
MultiLegalPile: A 689GB Multilingual Legal Corpus

References Chalkidis, I., Androutsopoulos, I., and Aletras, N. Neu-

ral legal judgment prediction in English. In Proceed-
Aletras, N., Tsarapatsanis, D., Preoţiuc-Pietro, D., and Lam-
ings of the 57th Annual Meeting of the Association for
pos, V. Predicting judicial decisions of the European
Computational Linguistics, pp. 4317–4323, Florence,
Court of Human Rights: a Natural Language Processing
Italy, July 2019a. Association for Computational Lin-
perspective. PeerJ Computer Science, 2:e93, October
guistics. doi: 10.18653/v1/P19-1424. URL https:
2016. ISSN 2376-5992. doi: 10.7717/peerj-cs.93. URL
//aclanthology.org/P19-1424.
https://peerj.com/articles/cs-93. Pub-
lisher: PeerJ Inc. Chalkidis, I., Androutsopoulos, I., and Aletras, N. Neu-
ral Legal Judgment Prediction in English. In Proceed-
Angelidis, I., Chalkidis, I., and Koubarakis, M. Named ings of the 57th Annual Meeting of the Association for
entity recognition, linking and generation for greek legis- Computational Linguistics, pp. 4317–4323, Florence,
lation. 2018. Italy, July 2019b. Association for Computational Lin-
guistics. doi: 10.18653/v1/P19-1424. URL https:
Baisa, V., Michelfeit, J., Medveď, M., and Jakubı́ček, M.
//www.aclweb.org/anthology/P19-1424.
European Union language resources in Sketch Engine.
In Proceedings of the Tenth International Conference Chalkidis, I., Fergadiotis, M., Malakasiotis, P., Ale-
on Language Resources and Evaluation (LREC’16), pp. tras, N., and Androutsopoulos, I. LEGAL-BERT:
2799–2803, Portorož, Slovenia, May 2016. European The muppets straight out of law school. In
Language Resources Association (ELRA). URL https: Findings of the Association for Computational Lin-
//aclanthology.org/L16-1445. guistics: EMNLP 2020, pp. 2898–2904, Online,
November 2020a. Association for Computational Lin-
Beltagy, I., Lo, K., and Cohan, A. Scibert: A pretrained guistics. doi: 10.18653/v1/2020.findings-emnlp.
language model for scientific text. In Conference on Em- 261. URL https://aclanthology.org/2020.
pirical Methods in Natural Language Processing, 2019. findings-emnlp.261.
Beltagy, I., Peters, M. E., and Cohan, A. Longformer: The Chalkidis, I., Fergadiotis, M., Malakasiotis, P., Aletras, N.,
Long-Document Transformer. arXiv:2004.05150 [cs], and Androutsopoulos, I. LEGAL-BERT: The Muppets
December 2020. URL http://arxiv.org/abs/ straight out of Law School. arXiv:2010.02559 [cs], Octo-
2004.05150. arXiv: 2004.05150. ber 2020b. URL http://arxiv.org/abs/2010.
02559. arXiv: 2010.02559.
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan,
J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, Chalkidis, I., Fergadiotis, M., and Androutsopoulos, I. Mul-
G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, tiEURLEX - A multi-lingual and multi-label legal doc-
G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., ument classification dataset for zero-shot cross-lingual
Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., transfer. In EMNLP, 2021a.
Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C.,
Chalkidis, I., Fergadiotis, M., and Androutsopoulos, I. Mul-
McCandlish, S., Radford, A., Sutskever, I., and Amodei,
tiEURLEX – A multi-lingual and multi-label legal doc-
D. Language models are few-shot learners. In Larochelle,
ument classification dataset for zero-shot cross-lingual
H., Ranzato, M., Hadsell, R., Balcan, M., and Lin,
transfer. arXiv:2109.00904 [cs], September 2021b. URL
H. (eds.), Advances in Neural Information Processing
http://arxiv.org/abs/2109.00904. arXiv:
Systems, volume 33, pp. 1877–1901. Curran Associates,
2109.00904.
Inc., 2020a. URL https://proceedings.
neurips.cc/paper/2020/file/ Chalkidis, I., Fergadiotis, M., Tsarapatsanis, D., Aletras,
1457c0d6bfcb4967418bfb8ac142f64a-Paper. N., Androutsopoulos, I., and Malakasiotis, P. Paragraph-
pdf. level rationale extraction through regularization: A case
study on European court of human rights cases. In Pro-
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, ceedings of the 2021 Conference of the North American
J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Chapter of the Association for Computational Linguistics:
Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Human Language Technologies, pp. 226–241, Online,
Henighan, T. J., Child, R., Ramesh, A., Ziegler, D. M., June 2021c. Association for Computational Linguistics.
Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, doi: 10.18653/v1/2021.naacl-main.22. URL https:
M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, //aclanthology.org/2021.naacl-main.22.
S., Radford, A., Sutskever, I., and Amodei, D. Language
models are few-shot learners. ArXiv, abs/2005.14165, Chalkidis, I., Jana, A., Hartung, D., Bommarito, M., An-
2020b. droutsopoulos, I., Katz, D. M., and Aletras, N. Lexglue:

9
MultiLegalPile: A 689GB Multilingual Legal Corpus

A benchmark dataset for legal language understanding in Punta Cana, Dominican Republic, November 2021. As-
english, 2021d. URL https://arxiv.org/abs/ sociation for Computational Linguistics. URL https:
2110.00976. //aclanthology.org/2021.nllp-1.1.

Chalkidis, I., Jana, A., Hartung, D., Bommarito, M. J., An- Fedus, W., Zoph, B., and Shazeer, N. Switch transformers:
droutsopoulos, I., Katz, D. M., and Aletras, N. LexGLUE: Scaling to trillion parameter models with simple and ef-
A Benchmark Dataset for Legal Language Understand- ficient sparsity. Journal of Machine Learning Research,
ing in English. SSRN Scholarly Paper ID 3936759, 23:1–39, 2022. URL http://jmlr.org/papers/
Social Science Research Network, Rochester, NY, Oc- v23/21-0998.html.
tober 2021e. URL https://papers.ssrn.com/
Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T.,
abstract=3936759.
Foster, C., Phang, J., He, H., Thite, A., Nabeshima, N.,
Chalkidis*, I., Garneau*, N., Goanta, C., Katz, D. M., and Presser, S., and Leahy, C. The Pile: An 800GB Dataset of
Søgaard, A. Lexfiles and legallama: Facilitating english Diverse Text for Language Modeling. arXiv:2101.00027
multinational legal language model development, 2023. [cs], December 2020a. URL http://arxiv.org/
abs/2101.00027. arXiv: 2101.00027.
Conneau, A. and Lample, G. Cross-lingual Lan-
Gao, L., Biderman, S. R., Black, S., Golding, L., Hoppe,
guage Model Pretraining. In Advances in Neural
T., Foster, C., Phang, J., He, H., Thite, A., Nabeshima,
Information Processing Systems, volume 32. Curran As-
N., Presser, S., and Leahy, C. The pile: An 800gb
sociates, Inc., 2019. URL https://proceedings.
dataset of diverse text for language modeling. ArXiv,
neurips.cc/paper/2019/hash/
abs/2101.00027, 2020b.
c04c19c2c2474dbf5f7ac4372c5b9af1-Abstract.
html. Gokaslan, A. and Cohen, V. Openwebtext corpus,
2019. URL http://Skylion007.github.io/
Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., OpenWebTextCorpus.
Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer,
L., and Stoyanov, V. Unsupervised cross-lingual repre- Gu, Y., Tinn, R., Cheng, H., Lucas, M., Usuyama, N., Liu,
sentation learning at scale. In Proceedings of the 58th X., Naumann, T., Gao, J., and Poon, H. Domain-Specific
Annual Meeting of the Association for Computational Language Model Pretraining for Biomedical Natural Lan-
Linguistics, pp. 8440–8451, Online, July 2020a. Associ- guage Processing. ACM Trans. Comput. Healthcare, 3
ation for Computational Linguistics. doi: 10.18653/v1/ (1), oct 2021. ISSN 2691-1957. doi: 10.1145/3458754.
2020.acl-main.747. URL https://aclanthology. URL https://doi.org/10.1145/3458754.
org/2020.acl-main.747.
Gutiérrez-Fandiño, A., Armengol-Estapé, J., Gonzalez-
Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Agirre, A., and Villegas, M. Spanish Legalese Lan-
Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer, guage Model and Corpora. oct 2021. URL http:
L., and Stoyanov, V. Unsupervised Cross-lingual Rep- //arxiv.org/abs/2110.12201.
resentation Learning at Scale. arXiv:1911.02116 [cs], He, P., Gao, J., and Chen, W. DeBERTaV3:
April 2020b. URL http://arxiv.org/abs/1911. Improving DeBERTa using ELECTRA-Style Pre-
02116. arXiv: 1911.02116. Training with Gradient-Disentangled Embedding Shar-
ing. arXiv:2111.09543 [cs], December 2021a. URL
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT:
http://arxiv.org/abs/2111.09543. arXiv:
Pre-training of deep bidirectional transformers for lan-
2111.09543.
guage understanding. In Proceedings of the 2019 Confer-
ence of the North American Chapter of the Association for He, P., Liu, X., Gao, J., and Chen, W. DeBERTa:
Computational Linguistics: Human Language Technolo- Decoding-enhanced BERT with Disentangled Atten-
gies, Volume 1 (Long and Short Papers), pp. 4171–4186, tion. arXiv:2006.03654 [cs], October 2021b. URL
Minneapolis, Minnesota, June 2019. Association for http://arxiv.org/abs/2006.03654. arXiv:
Computational Linguistics. doi: 10.18653/v1/N19-1423. 2006.03654.
URL https://aclanthology.org/N19-1423.
Henderson, P., Krass, M. S., Zheng, L., Guha, N., Manning,
Drawzeski, K., Galassi, A., Jablonowska, A., Lagioia, F., C. D., Jurafsky, D., and Ho, D. E. Pile of Law: Learning
Lippi, M., Micklitz, H. W., Sartor, G., Tagiuri, G., and Responsible Data Filtering from the Law and a 256GB
Torroni, P. A Corpus for Multilingual Analysis of On- Open-Source Legal Dataset, July 2022. URL http://
line Terms of Service. In Proceedings of the Natural arxiv.org/abs/2207.00220. arXiv:2207.00220
Legal Language Processing Workshop 2021, pp. 1–8, [cs].

10
MultiLegalPile: A 689GB Multilingual Legal Corpus

Henderson*, P., Krass*, M. S., Zheng, L., Guha, N., Man- representation model for biomedical text mining. Bioin-
ning, C. D., Jurafsky, D., and Ho, D. E. Pile of law: formatics, 36(4):1234–1240, 2020. ISSN 14602059. doi:
Learning responsible data filtering from the law and a 10.1093/bioinformatics/btz682.
256gb open-source legal dataset, 2022. URL https:
//arxiv.org/abs/2207.00220. Licari, D. and Comandè, G. ITALIAN-LEGAL-BERT: A
Pre-trained Transformer Language Model for Italian Law.
Henderson, P., Li, X., Jurafsky, D., Hashimoto, T., Lemley, Technical report, 2022. URL http://ceur-ws.org.
M. A., and Liang, P. Foundation models and fair use,
2023. Lippi, M., Pałka, P., Contissa, G., Lagioia, F., Micklitz,
H.-W., Sartor, G., and Torroni, P. CLAUDETTE: an
Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., automated detector of potentially unfair clauses in on-
Song, D., and Steinhardt, J. Measuring Massive Multitask line terms of service. Artificial Intelligence and Law,
Language Understanding, January 2021. URL http:// 27(2):117–139, 2019. ISSN 1572-8382. doi: 10.1007/
arxiv.org/abs/2009.03300. arXiv:2009.03300 s10506-019-09243-2. URL https://doi.org/10.
[cs]. 1007/s10506-019-09243-2.

Huang, K., Altosaar, J., and Ranganath, R. ClinicalBERT: Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy,
Modeling Clinical Notes and Predicting Hospital Read- O., Lewis, M., Zettlemoyer, L., and Stoyanov, V. Roberta:
mission. 2019. URL http://arxiv.org/abs/ A robustly optimized bert pretraining approach. 2019.
1904.05342. URL http://arxiv.org/abs/1907.11692.

ImaniGooghari, A., Lin, P., Kargaran, A. H., Severini, S., Luz de Araujo, P. H., Campos, T. E. d., de Oliveira,
Sabet, M. J., Kassner, N., Ma, C., Schmid, H., Martins, R. R., Stauffer, M., Couto, S., and Bermejo, P.
A. F. T., Yvon, F., and Schütze, H. Glot500: Scaling mul- Lener-br: a dataset for named entity recognition in
tilingual corpora and language models to 500 languages, brazilian legal text. In International Conference
2023. on Computational Processing of the Portuguese Lan-
guage, pp. 313–323. Springer, 2018. Dataset URL:
Johnson, A. E. W., Pollard, T. J., Shen, L., Lehman, L.- https://huggingface.co/datasets/lener br.
w. H., Feng, M., Ghassemi, M., Moody, B., Szolovits,
P., Anthony Celi, L., and Mark, R. G. MIMIC-III, a Masala, M., Iacob, R. C. A., Uban, A. S., Cidota, M., Velicu,
freely accessible critical care database. Scientific Data, H., Rebedea, T., and Popescu, M. jurBERT: A Romanian
3(1):160035, 2016. ISSN 2052-4463. doi: 10.1038/ BERT model for legal judgement prediction. In Proceed-
sdata.2016.35. URL https://doi.org/10.1038/ ings of the Natural Legal Language Processing Work-
sdata.2016.35. shop 2021, pp. 86–94, Punta Cana, Dominican Republic,
November 2021. Association for Computational Linguis-
Katz, D. M., Bommarito, M. J., Gao, S., and Arredondo, P. tics. doi: 10.18653/v1/2021.nllp-1.8. URL https:
GPT-4 Passes the Bar Exam, March 2023. URL https: //aclanthology.org/2021.nllp-1.8.
//papers.ssrn.com/abstract=4389233.
Merity, S., Xiong, C., Bradbury, J., and Socher, R. Pointer
Lage-Freitas, A., Allende-Cid, H., Santana, O., and Oliveira- sentinel mixture models, 2016.
Lage, L. Predicting Brazilian Court Decisions. PeerJ
Computer Science, 8:e904, March 2022. ISSN 2376- Muennighoff, N., Rush, A. M., Barak, B., Scao, T. L.,
5992. doi: 10.7717/peerj-cs.904. URL https:// Piktus, A., Tazi, N., Pyysalo, S., Wolf, T., and Raf-
peerj.com/articles/cs-904. Publisher: PeerJ fel, C. Scaling Data-Constrained Language Models,
Inc. May 2023. URL http://arxiv.org/abs/2305.
16264. arXiv:2305.16264 [cs].
Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H., and
Kang, J. BioBERT: a pre-trained biomedical language Niklaus, J. and Giofré, D. BudgetLongformer: Can we
representation model for biomedical text mining. Bioin- Cheaply Pretrain a SotA Legal Language Model From
formatics, pp. btz682, September 2019. ISSN 1367-4803, Scratch?, November 2022. URL http://arxiv.
1460-2059. doi: 10.1093/bioinformatics/btz682. URL org/abs/2211.17135. arXiv:2211.17135 [cs].
http://arxiv.org/abs/1901.08746. arXiv:
1901.08746. Niklaus, J., Chalkidis, I., and Stürmer, M. Swiss-Judgment-
Prediction: A Multilingual Legal Judgment Prediction
Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H., and Benchmark. In Proceedings of the Natural Legal Lan-
Kang, J. BioBERT: A pre-trained biomedical language guage Processing Workshop 2021, pp. 19–35, Punta

11
MultiLegalPile: A 689GB Multilingual Legal Corpus

Cana, Dominican Republic, November 2021. Associ- Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and
ation for Computational Linguistics. URL https: Sutskever, I. Language models are unsupervised multitask
//aclanthology.org/2021.nllp-1.3. learners. 2019.

Niklaus, J., Stürmer, M., and Chalkidis, I. An Empirical Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S.,
Study on Cross-X Transfer for Legal Judgment Predic- Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring the
tion. In Proceedings of the 2nd Conference of the Asia- limits of transfer learning with a unified text-to-text trans-
Pacific Chapter of the Association for Computational former. Journal of Machine Learning Research, 21(140):
Linguistics and the 12th International Joint Conference 1–67, 2020a. URL http://jmlr.org/papers/
on Natural Language Processing (Volume 1: Long Pa- v21/20-074.html.
pers), pp. 32–46, Online only, November 2022. Asso-
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S.,
ciation for Computational Linguistics. URL https:
Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring the
//aclanthology.org/2022.aacl-main.3.
Limits of Transfer Learning with a Unified Text-to-Text
Niklaus, J., Matoshi, V., Rani, P., Galassi, A., Stürmer, Transformer. Journal of Machine Learning Research,
M., and Chalkidis, I. Lextreme: A multi-lingual and 21(140):1–67, 2020b. ISSN 1533-7928. URL http:
multi-task benchmark for the legal domain, 2023. URL //jmlr.org/papers/v21/20-074.html.
https://arxiv.org/abs/2301.13126. Sag, M. Copyright safety for generative ai. Forth-
coming in the Houston Law Review, 2023. URL
OpenAI. GPT-4 Technical Report, March 2023. https://papers.ssrn.com/sol3/papers.
URL http://arxiv.org/abs/2303.08774. cfm?abstract_id=4438593.
arXiv:2303.08774 [cs].
Sanh, V., Debut, L., Chaumond, J., and Wolf, T. DistilBERT,
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, a distilled version of BERT: smaller, faster, cheaper and
C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., lighter. arXiv:1910.01108 [cs], February 2020. URL
Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., http://arxiv.org/abs/1910.01108. arXiv:
Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, 1910.01108.
J., and Lowe, R. Training language models to follow
instructions with human feedback, 2022. URL https: Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper,
//arxiv.org/abs/2203.02155. J., and Catanzaro, B. Megatron-lm: Training multi-
billion parameter language models using model paral-
Pais, V., Mitrofan, M., Gasan, C. L., Coneschi, V., and lelism, 2020.
Ianov, A. Named entity recognition in the Romanian
legal domain. In Proceedings of the Natural Legal Spaeth, H. J., Epstein, L., Martin, A. D., Segal, J. A., Ruger,
Language Processing Workshop 2021, pp. 9–18, Punta T. J., and Benesh, S. C. Supreme Court Database, Version
Cana, Dominican Republic, November 2021. Associ- 2020 Release 01.
ation for Computational Linguistics. doi: 10.18653/ Taylor, R., Kardas, M., Cucurull, G., Scialom, T., Hartshorn,
v1/2021.nllp-1.2. URL https://aclanthology. A., Saravia, E., Poulton, A., Kerkez, V., and Stojnic,
org/2021.nllp-1.2. R. Galactica: A Large Language Model for Science,
November 2022. URL http://arxiv.org/abs/
Papaloukas, C., Chalkidis, I., Athinaios, K., Pan-
2211.09085. arXiv:2211.09085 [cs, stat].
tazi, D.-A., and Koubarakis, M. Multi-granular le-
gal topic classification on greek legislation. arXiv Tiedemann, J. Finding alternative translations in a large
preprint arXiv:2109.15298, 2021. Dataset URL: corpus of movie subtitle. In Proceedings of the Tenth
https://huggingface.co/datasets/greek legal code. International Conference on Language Resources and
Evaluation (LREC’16), pp. 3518–3522, Portorož, Slove-
Pfeiffer, J., Vulić, I., Gurevych, I., and Ruder, S. UNKs nia, May 2016. European Language Resources Associa-
everywhere: Adapting multilingual language models to tion (ELRA). URL https://aclanthology.org/
new scripts. In Proceedings of the 2021 Conference L16-1559.
on Empirical Methods in Natural Language Processing,
pp. 10186–10203, Online and Punta Cana, Dominican Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux,
Republic, November 2021. Association for Computa- M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E.,
tional Linguistics. doi: 10.18653/v1/2021.emnlp-main. Azhar, F., Rodriguez, A., Joulin, A., Grave, E., and Lam-
800. URL https://aclanthology.org/2021. ple, G. LLaMA: Open and Efficient Foundation Language
emnlp-main.800. Models. ArXiv, abs/2302.1, 2023.

12
MultiLegalPile: A 689GB Multilingual Legal Corpus

Tuggener, D., von Däniken, P., Peetz, T., and Cieliebak, Xue, L., Constant, N., Roberts, A., Kale, M., Al-Rfou,
M. LEDGAR: A large-scale multi-label corpus for text R., Siddhant, A., Barua, A., and Raffel, C. mT5:
classification of legal provisions in contracts. In Pro- A massively multilingual pre-trained text-to-text trans-
ceedings of the Twelfth Language Resources and Evalua- former. arXiv:2010.11934 [cs], March 2021. URL
tion Conference, pp. 1235–1241, Marseille, France, May http://arxiv.org/abs/2010.11934. arXiv:
2020. European Language Resources Association. ISBN 2010.11934.
979-10-95546-34-4. URL https://aclanthology.
org/2020.lrec-1.155. Zellers, R., Holtzman, A., Rashkin, H., Bisk, Y., Farhadi,
A., Roesner, F., and Choi, Y. Defending against Neural
Tziafas, G., de Saint-Phalle, E., de Vries, W., Egger, C., Fake News. Curran Associates Inc., Red Hook, NY, USA,
and Caselli, T. A multilingual approach to identify 2019.
and classify exceptional measures against covid-19. In
Zheng, L., Guha, N., Anderson, B. R., Henderson, P.,
Proceedings of the Natural Legal Language Process-
and Ho, D. E. When Does Pretraining Help? Assess-
ing Workshop 2021, pp. 46–62, 2021. Dataset URL:
ing Self-Supervised Learning for Law and the Case-
https://tinyurl.com/ycysvtbm.
HOLD Dataset of 53,000+ Legal Holdings. In Pro-
ceedings of the Eighteenth International Conference on
Urchs, S., Mitrović, J., and Granitzer, M. Design
Artificial Intelligence and Law, ICAIL ’21, pp. 159–
and Implementation of German Legal Decision
168, New York, NY, USA, 2021. Association for Com-
Corpora:. In Proceedings of the 13th International
puting Machinery. ISBN 9781450385268. doi: 10.
Conference on Agents and Artificial Intelligence, pp.
1145/3462757.3466088. URL https://doi.org/
515–521, Online Streaming, — Select a Country
10.1145/3462757.3466088.
—, 2021. SCITEPRESS - Science and Technol-
ogy Publications. ISBN 978-989-758-484-8. doi: Zhu, Y., Kiros, R., Zemel, R., Salakhutdinov, R., Urta-
10.5220/0010187305150521. URL https://www. sun, R., Torralba, A., and Fidler, S. Aligning books and
scitepress.org/DigitalLibrary/Link. movies: Towards story-like visual explanations by watch-
aspx?doi=10.5220/0010187305150521. ing movies and reading books, 2015.

Wang, A., Pruksachatkun, Y., Nangia, N., Singh, A.,

Michael, J., Hill, F., Levy, O., and Bowman, S. R. Su-
perGLUE: A Stickier Benchmark for General-Purpose
Language Understanding Systems. pp. 30, 2019.

Wang, W., Wei, F., Dong, L., Bao, H., Yang, N., and
Zhou, M. MiniLM: Deep Self-Attention Distillation
for Task-Agnostic Compression of Pre-Trained Trans-
formers. In Advances in Neural Information Processing
Systems, volume 33, pp. 5776–5788. Curran Asso-
ciates, Inc., 2020. URL https://proceedings.
neurips.cc/paper/2020/hash/
3f5ee243547dee91fbd053c1c4a845aa-Abstract.
html.

Wei, J., Bosma, M., Zhao, V. Y., Guu, K., Yu, A. W.,
Lester, B., Du, N., Dai, A. M., and Le, Q. V. Fine-
tuned language models are zero-shot learners. CoRR,
abs/2109.01652, 2021. URL https://arxiv.org/
abs/2109.01652.

Wettig, A., Gao, T., Zhong, Z., and Chen, D. Should

you mask 15% in masked language modeling? In Pro-
ceedings of the 17th Conference of the European Chap-
ter of the Association for Computational Linguistics,
pp. 2985–3000, Dubrovnik, Croatia, May 2023. Asso-
ciation for Computational Linguistics. URL https:
//aclanthology.org/2023.eacl-main.217.

13
MultiLegalPile: A 689GB Multilingual Legal Corpus

Model Name # Steps Vocab Size

Legal-bg-R-base 200K 32K
Legal-hr-R-base 200K 32K
Legal-cs-R-base 200K 32K
Legal-da-R-base 200K 32K
Legal-nl-R-base 200K 32K
Legal-en-R-base 200K 32K
Legal-en-R-large 500K 32K
Legal-et-R-base 200K 32K
Legal-fi-R-base 200K 32K
Legal-fr-R-base 200K 32K
Legal-de-R-base 200K 32K
Legal-el-R-base 200K 32K
Legal-hu-R-base 200K 32K
Legal-ga-R-base 200K 32K
Legal-it-R-base 200K 32K
Legal-lv-R-base 200K 32K
Legal-lt-R-base 200K 32K
Legal-mt-R-base 200K 32K
Legal-pl-R-base 200K 32K
Legal-pt-R-base 200K 32K
Legal-ro-R-base 200K 32K
Legal-sk-R-base 200K 32K
Legal-sl-R-base 200K 32K
Legal-es-R-base 200K 32K
Legal-sv-R-base 200K 32K
Legal-XLM-R-base 1M 128K
Legal-XLM-R-large 500K 128K
Legal-XLM-LF-base 50K 128K

Table 7. Model Details

A. Training Details

14
MultiLegalPile: A 689GB Multilingual Legal Corpus

B. Hyperparameter Details
source Dataset Task Task type Hierarchical Seeds lower case Batch size Metric for best model Evaluation strategy Epochs Early stopping patience Learning rate
(Niklaus et al., 2023) GLN GLN NER False 1,2,3 True 64 evaluation loss epoch 50 5 1e-5
(Niklaus et al., 2023) LNR LNR NER False 1,2,3 True 64 evaluation loss epoch 50 5 1e-5
(Niklaus et al., 2023) LNB LNB NER False 1,2,3 True 64 evaluation loss epoch 50 5 1e-5
(Niklaus et al., 2023) MAP MAP-F NER False 1,2,3 True 64 evaluation loss epoch 50 5 1e-5
(Niklaus et al., 2023) MAP MAP-C NER False 1,2,3 True 64 evaluation loss epoch 50 5 1e-5
(Niklaus et al., 2023) BCD BCD-J SLTC True 1,2,3 True 64 evaluation loss epoch 50 5 1e-5
(Niklaus et al., 2023) BCD BCD-U SLTC True 1,2,3 True 64 evaluation loss epoch 50 5 1e-5
(Niklaus et al., 2023) GAM GAM SLTC False 1,2,3 True 64 evaluation loss epoch 50 5 1e-5
(Niklaus et al., 2023) GLC GLC-C SLTC True 1,2,3 True 64 evaluation loss epoch 50 5 1e-5
(Niklaus et al., 2023) GLC GLC-S SLTC True 1,2,3 True 64 evaluation loss epoch 50 5 1e-5
(Niklaus et al., 2023) GLC GLC-V SLTC True 1,2,3 True 64 evaluation loss epoch 50 5 1e-5
(Niklaus et al., 2023) SJP SJP SLTC True 1,2,3 True 64 evaluation loss epoch 50 5 1e-5
(Niklaus et al., 2023) OTS OTS-UL SLTC False 1,2,3 True 64 evaluation loss epoch 50 5 1e-5
(Niklaus et al., 2023) OTS OTS-CT MLTC False 1,2,3 True 64 evaluation loss epoch 50 5 1e-5
(Niklaus et al., 2023) C19 C19 MLTC False 1,2,3 True 64 evaluation loss epoch 50 5 1e-5
(Niklaus et al., 2023) MEU MEU-1 MLTC True 1,2,3 True 64 evaluation loss 5 1e-5
(Niklaus et al., 2023) MEU MEU-2 MLTC True 1,2,3 True 64 evaluation loss 5 1e-5
(Niklaus et al., 2023) MEU MEU-3 MLTC True 1,2,3 True 64 evaluation loss 5 1e-5
(Chalkidis et al., 2021d) ECtHR ECtHR-A MLTC True 1,2,3,4,5 True 8 micro-f1 epoch 20 3 3e-5
(Chalkidis et al., 2021d) ECtHR ECtHR-B MLTC True 1,2,3,4,5 True 8 micro-f1 epoch 20 3 3e-5
(Chalkidis et al., 2021d) EUR-LEX EUR-LEX MLTC False 1,2,3,4,5 True 8 micro-f1 epoch 20 3 3e-5
(Chalkidis et al., 2021d) SCOTUS SCOTUS SLTC True 1,2,3,4,5 True 8 micro-f1 epoch 20 3 3e-5
(Chalkidis et al., 2021d) LEDGAR LEDGAR SLTC False 1,2,3,4,5 True 8 micro-f1 epoch 20 3 3e-5
(Chalkidis et al., 2021d) UnfairToS UnfairToS MLTC False 1,2,3,4,5 True 8 micro-f1 epoch 20 3 3e-5
(Chalkidis et al., 2021d) CaseHOLD CaseHOLD MCQA False 1,2,3,4,5 True 8 micro-f1 epoch 20 3 3e-5

Table 8. Hyperparameters for each dataset and task. However, there were a few exceptions. For the multilingual MEU tasks, given the
dataset’s size, we trained them for only 1 epoch with 1000 steps as the evaluation strategy when using multilingual models. When using
monolingual models, we trained for 50 epochs with epoch-based evaluation strategy, as we utilized only the language-specific subset of
the dataset. Regarding LexGlue, we followed the guidelines of Chalkidis et al. (2021d) for RoBERTa-based large language models, which
required a maximum learning rate of 1e-5, a warm-up ratio of 0.1, and a weight decay rate of 0.06.
.

C. Dataset Details
Language Text Type Words Documents Words per Document Jurisdiction Source License/Copyright
Native Multi Legal Pile
bg legislation 309M 262k 1178 Bulgaria MARCELL CC0-1.0
Czechia CzCDC Constitutional Court CC BY-NC 4.0
cs caselaw 571M 342k 1667 Czechia CzCDC Supreme Administrative Court CC BY-NC 4.0
Czechia CzCDC Supreme Court CC BY-NC 4.0
da caselaw 211M 92k 2275 Denmark DDSC CC BY 4.0 and other, depending on the dataset
da legislation 653M 296k 2201 Denmark DDSC CC BY 4.0 and other, depending on the dataset
de caselaw 1786M 614k 2905 Germany openlegaldata ODbL-1.0
Switzerland entscheidsuche similar to CC BY
de legislation 513M 302k 1698 Germany openlegaldata ODbL-1.0
Switzerland lexfind not protected by copyright law
en legislation 2539M 713k 3557 Switzerland lexfind not protected by copyright law
UK uk-lex CC BY 4.0
fr caselaw 1172M 495k 2363 Belgium jurportal not protected by copyright law
France CASS Open Licence 2.0
Luxembourg judoc not protected by copyright law
Switzerland entscheidsuche similar to CC BY
fr legislation 600M 253k 2365 Switzerland lexfind not protected by copyright law
Belgium ejustice not protected by copyright law
hu legislation 265M 259k 1019 Hungary MARCELL CC0-1.0
it caselaw 407M 159k 2554 Switzerland entscheidsuche similar to CC BY
it legislation 543M 238k 2278 Switzerland lexfind not protected by copyright law
nl legislation 551M 243k 2263 Belgium ejustice not protected by copyright law
pl legislation 299M 260k 1148 Poland MARCELL CC0-1.0
pt caselaw 12613M 17M 728 Brazil RulingBR not protected by copyright law
Brazil CRETA CC BY-NC-SA 4.0
Brazil CJPG not protected by copyright law
ro legislation 559M 396k 1410 Romania MARCELL CC0-1.0
sk legislation 280M 246k 1137 Slovakia MARCELL CC0-1.0
sl legislation 366M 257k 1418 Slovenia MARCELL CC-BY-4.0
total 24236M 23M 1065 Native Multi Legal Pile
Overall statistics for the remaining subsets
total 12107M 8M 1457 EU Eurlex Resources CC BY 4.0
total 43376M 18M 2454 US (99%), Canada, and EU Pile of Law CC BY-NC-SA 4.0; See Henderson* et al. (2022) for details
total 28599M 10M 2454 Legal mC4 ODC-BY

Table 9. Information about size and number of words and documents for Native Multi Legal Pile are provided according to language and
text type. For the remaining subsets of Multi Legal Pile we provide general statistics.

Generative AI Interview Questions
100% (2)
Generative AI Interview Questions
12 pages
Large Language Models Are Legal But They Are Not Making The Case For A Power LLM
No ratings yet
Large Language Models Are Legal But They Are Not Making The Case For A Power LLM
7 pages
Development of An Indian Legal Language Model (LLM) For Enhanced Legal Text Analysis and Assistance
No ratings yet
Development of An Indian Legal Language Model (LLM) For Enhanced Legal Text Analysis and Assistance
7 pages
Downloed Papers
No ratings yet
Downloed Papers
700 pages
Kalyan 1 s2.0 S2949719123000456 Main
No ratings yet
Kalyan 1 s2.0 S2949719123000456 Main
48 pages
LLaMA Open and Efficient Foundation Language Models
No ratings yet
LLaMA Open and Efficient Foundation Language Models
27 pages
Research Paper Llama
No ratings yet
Research Paper Llama
27 pages
Safari
No ratings yet
Safari
21 pages
Datasets For Large Language Models A Comprehensive Survey
No ratings yet
Datasets For Large Language Models A Comprehensive Survey
181 pages
The Common Pile v01 an 8TB Dataset of Public Domai
No ratings yet
The Common Pile v01 an 8TB Dataset of Public Domai
55 pages
Large Language Models: A Survey
No ratings yet
Large Language Models: A Survey
43 pages
LLM Survey
100% (1)
LLM Survey
43 pages
LLM_introduction 2024
No ratings yet
LLM_introduction 2024
77 pages
2023.arabicnlp-1.21
No ratings yet
2023.arabicnlp-1.21
32 pages
Chalkids 2020 Legal Bert
No ratings yet
Chalkids 2020 Legal Bert
7 pages
paper-1
No ratings yet
paper-1
44 pages
CCPDF: Building A High Quality Corpus For Visually Rich Documents From Web Crawl Data
No ratings yet
CCPDF: Building A High Quality Corpus For Visually Rich Documents From Web Crawl Data
18 pages
Survey On LLM
No ratings yet
Survey On LLM
9 pages
Improving Text Embeddings With Large Language Models
No ratings yet
Improving Text Embeddings With Large Language Models
20 pages
The Roots Search Tool: Data Transparency For Llms
No ratings yet
The Roots Search Tool: Data Transparency For Llms
11 pages
Pretraining Data and Tokenizer for Indic LLM
No ratings yet
Pretraining Data and Tokenizer for Indic LLM
9 pages
paperMidTern
No ratings yet
paperMidTern
4 pages
Perspective Large Languagemodels in Applied Mechanics
No ratings yet
Perspective Large Languagemodels in Applied Mechanics
7 pages
1 s2.0 S2666651021000176 Main
No ratings yet
1 s2.0 S2666651021000176 Main
6 pages
Mega
No ratings yet
Mega
36 pages
LEGALBERT The Muppets Straight Out of Law School
No ratings yet
LEGALBERT The Muppets Straight Out of Law School
7 pages
(2303.18223) A Survey of Large Language Models
No ratings yet
(2303.18223) A Survey of Large Language Models
115 pages
Unifiedcrawl: Aggregated Common Crawl For Affordable Adaptation of Llms On Low-Resource Languages
No ratings yet
Unifiedcrawl: Aggregated Common Crawl For Affordable Adaptation of Llms On Low-Resource Languages
19 pages
Considerations_for_multilingual_wikipedia_research
No ratings yet
Considerations_for_multilingual_wikipedia_research
12 pages
2404.19296v1
No ratings yet
2404.19296v1
19 pages
Unifying Large Language Models and Knowledge Graphs: A Roadmap
No ratings yet
Unifying Large Language Models and Knowledge Graphs: A Roadmap
29 pages
Scalexm - Ai: A Compact Guide To Large Language Models
No ratings yet
Scalexm - Ai: A Compact Guide To Large Language Models
9 pages
LLaMA Ankit - Rawat
No ratings yet
LLaMA Ankit - Rawat
52 pages
2024.arabicnlp-1.24
No ratings yet
2024.arabicnlp-1.24
15 pages
GTE
No ratings yet
GTE
18 pages
RM Assignment 4
No ratings yet
RM Assignment 4
5 pages
25636-1454-21112-2-10-20230927
No ratings yet
25636-1454-21112-2-10-20230927
4 pages
Compact Guide To Large Language Models
No ratings yet
Compact Guide To Large Language Models
9 pages
Prompt Templates For Structured Answer Generation
No ratings yet
Prompt Templates For Structured Answer Generation
11 pages
Large Language Models and Their Possible Uses in L
No ratings yet
Large Language Models and Their Possible Uses in L
21 pages
How Good Is Your Tokenizer? On The Monolingual Performance of Multilingual Language Models
No ratings yet
How Good Is Your Tokenizer? On The Monolingual Performance of Multilingual Language Models
28 pages
Natural Language Processing in the Era of Large La
No ratings yet
Natural Language Processing in the Era of Large La
5 pages
Bueno Teoria 2307.06435
No ratings yet
Bueno Teoria 2307.06435
37 pages
CL Assignments
No ratings yet
CL Assignments
22 pages
Perspectives in Business Ethics
No ratings yet
Perspectives in Business Ethics
113 pages
Recent Advances in Natural Language Processing Via Large Pre-Trained Language Models-A Survey
No ratings yet
Recent Advances in Natural Language Processing Via Large Pre-Trained Language Models-A Survey
40 pages
Pre Trained Models For NLP
No ratings yet
Pre Trained Models For NLP
15 pages
Learning Representations On Logs For AIOps
No ratings yet
Learning Representations On Logs For AIOps
11 pages
2406.17557v2
No ratings yet
2406.17557v2
38 pages
Bert - Se: A P - L R M S E: RE Trained Anguage Epresentation Odel For Oftware Ngineering
No ratings yet
Bert - Se: A P - L R M S E: RE Trained Anguage Epresentation Odel For Oftware Ngineering
17 pages
[10 December 2024, NeurIPS] Tutorial on Language Modeling
No ratings yet
[10 December 2024, NeurIPS] Tutorial on Language Modeling
255 pages
2311.01149
No ratings yet
2311.01149
15 pages
Can We Pretrain A SotA Legal Language Model On A Budget From Scratch
No ratings yet
Can We Pretrain A SotA Legal Language Model On A Budget From Scratch
25 pages
Natural learning
No ratings yet
Natural learning
35 pages
Large Language Model Algorithms in Plain English
No ratings yet
Large Language Model Algorithms in Plain English
8 pages
Sscibert: A Pre-Trained Language Model For Social Science Texts
No ratings yet
Sscibert: A Pre-Trained Language Model For Social Science Texts
24 pages
Legal Language Modeling
No ratings yet
Legal Language Modeling
10 pages
The_Best_LLMs_Cheatsheet_1727364716
No ratings yet
The_Best_LLMs_Cheatsheet_1727364716
15 pages
Qiu et al. - 2020 - Pre-trained Models for Natural Language Processing
No ratings yet
Qiu et al. - 2020 - Pre-trained Models for Natural Language Processing
28 pages
Language Identification: Fundamentals and Applications
From Everand
Language Identification: Fundamentals and Applications
Fouad Sabry
No ratings yet
The Language of Competition Law: DEFCOMCOURT 3
From Everand
The Language of Competition Law: DEFCOMCOURT 3
AAVV
No ratings yet
Final Year Major Project Report
No ratings yet
Final Year Major Project Report
74 pages
thdgfh hsgfh g
No ratings yet
thdgfh hsgfh g
2 pages
LLM - Seminar Report
No ratings yet
LLM - Seminar Report
13 pages
Fake News Detection Using Enhanced BERT
No ratings yet
Fake News Detection Using Enhanced BERT
8 pages
Natural Language Processingand Sentiment Analysis
No ratings yet
Natural Language Processingand Sentiment Analysis
15 pages
Can Chatgpt Understand Too? A Comparative Study On Chatgpt and Fine-Tuned Bert
No ratings yet
Can Chatgpt Understand Too? A Comparative Study On Chatgpt and Fine-Tuned Bert
19 pages
PDF Based Question &answering Using Langchain and Openai Api
No ratings yet
PDF Based Question &answering Using Langchain and Openai Api
58 pages
Tacl A 00300
No ratings yet
Tacl A 00300
14 pages
All about Encoder-Decoder Models
No ratings yet
All about Encoder-Decoder Models
50 pages
Exploringthe Profound Impactof Artificial Intelligence Applications Quillbot
No ratings yet
Exploringthe Profound Impactof Artificial Intelligence Applications Quillbot
24 pages
(2023) Label Informed Hierarchical Transformers For Sequential Sentence Classification in Scientific Abstracts (LIHT)
No ratings yet
(2023) Label Informed Hierarchical Transformers For Sequential Sentence Classification in Scientific Abstracts (LIHT)
13 pages
Wijayanti 2021
No ratings yet
Wijayanti 2021
6 pages
Email Spam Detection Using Machine Learning
No ratings yet
Email Spam Detection Using Machine Learning
10 pages
project
No ratings yet
project
65 pages
Experimental evaluation of bidirectional encoder representations from transformers models for de-identification of clinical document images
No ratings yet
Experimental evaluation of bidirectional encoder representations from transformers models for de-identification of clinical document images
8 pages
Error Analysis of Emotion Detection Using BERT
No ratings yet
Error Analysis of Emotion Detection Using BERT
8 pages
interenship report
No ratings yet
interenship report
26 pages
Scalable Matmul-Free Language Modeling: Com/Ridgerchu/Matmulfreellm
No ratings yet
Scalable Matmul-Free Language Modeling: Com/Ridgerchu/Matmulfreellm
19 pages
Unifying Large Language Models and Knowledge Graphs: A Roadmap
No ratings yet
Unifying Large Language Models and Knowledge Graphs: A Roadmap
28 pages
Multi-Modal Image Captioning For The Visually Impaired
No ratings yet
Multi-Modal Image Captioning For The Visually Impaired
8 pages
Notes
No ratings yet
Notes
37 pages
Term Paper by Hana
No ratings yet
Term Paper by Hana
21 pages
Fake News Detection
No ratings yet
Fake News Detection
8 pages
1 s2.0 S2949719124000177 Main
No ratings yet
1 s2.0 S2949719124000177 Main
25 pages
GenAi for edu
No ratings yet
GenAi for edu
9 pages
Top 50 LinkedIn LLM Interview Questions
100% (1)
Top 50 LinkedIn LLM Interview Questions
12 pages
A Survey On Neural Network Language Models
No ratings yet
A Survey On Neural Network Language Models
7 pages
NLP Research Presentation
No ratings yet
NLP Research Presentation
38 pages
Senior Design Project - Final Report
No ratings yet
Senior Design Project - Final Report
68 pages

Niklaus 2023 Multilegalpile

Uploaded by

Niklaus 2023 Multilegalpile

Uploaded by

MultiLegalPile: A 689GB Multilingual Legal Corpus

Joel Niklaus 1 2 3 Veton Matoshi 2 Matthias Stürmer 1 2 Ilias Chalkidis 4 Daniel E. Ho 3

so far, there are few datasets available for special-

3.3. Licenses and Usage of M ULTI L EGAL P ILE

Model ECtHR-A ECtHR-B SCOTUS EUR-LEX LEDGAR UNFAIR-ToS CaseHOLD Agg.

References Chalkidis, I., Androutsopoulos, I., and Aletras, N. Neu-

Wang, A., Pruksachatkun, Y., Nangia, N., Singh, A.,

Wettig, A., Gao, T., Zhong, Z., and Chen, D. Should

Model Name # Steps Vocab Size

Table 7. Model Details

You might also like