Natural Language Processing For The Legal Domain A Survey of Tasks, Datasets, Models, and Challenges
Natural Language Processing For The Legal Domain A Survey of Tasks, Datasets, Models, and Challenges
the unique aspects and challenges of processing legal texts, such as extensive document length, complex language, and
limited open legal datasets. We provide an overview of Natural Language Processing tasks specific to legal text, such as Legal
Document Summarization, legal Named Entity Recognition, Legal Question Answering, Legal Text Classification, and Legal
Judgment Prediction. In the section on legal Language Models, we analyze both developed Language Models and approaches
for adapting general Language Models to the legal domain. Additionally, we identify 15 Open Research Challenges, including
bias in Artificial Intelligence applications, the need for more robust and interpretable models, and improving explainability to
handle the complexities of legal language and reasoning.
CCS Concepts: • General and reference → Surveys and overviews; • Applied computing → Law; • Computing
methodologies → Natural language processing; Natural language generation; Knowledge representation and reasoning;
Supervised learning; Unsupervised learning; Reinforcement learning; Multi-task learning; Machine learning approaches;
Artificial intelligence; • Information systems → Language models; Retrieval tasks and goals; Question answering;
Clustering and classification; Summarization.
ACM Reference Format:
Farid Ariai and Gianluca Demartini. 2024. Natural Language Processing for the Legal Domain: A Survey of Tasks, Datasets,
Models, and Challenges. ACM Comput. Surv. 1, 1 (October 2024), 35 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn
1 INTRODUCTION
In recent years, advancements in Natural Language Processing (NLP) have significantly impacted the legal domain
by simplifying complex tasks such as Legal Document Summarization (LDS), enhancing legal text comprehension
for laypersons, and improving Legal Question Answering (LQA) and Legal Judgment Prediction (LJP) [24, 42, 50,
52, 63, 93, 98]. These improvements are primarily attributed to advancements in neural network architectures,
such as transformer models [118]. NLP techniques now enable machines to generate text, answer legal questions,
drafting a regulation, and simulate legal reasoning, which can revolutionize legal practices [50]. Applications like
contract review [45, 76, 77, 117] and case prediction [85, 120] have been automated to a large extent, speeding
up processes, reducing human error, and cutting operational costs. Additionally, the use of NLP allows lawyers
and legal professionals to reduce their workload, enhance efficiency, and minimize errors in decision-making
processes [98]. Despite the rapid development of NLP, challenges remain in processing lengthy documents,
Authors’ Contact Information: Farid Ariai, [email protected]; Gianluca Demartini, [email protected], The University of Queensland,
Brisbane, Queensland, Australia.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that
copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page.
Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy
otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from
[email protected].
© 2024 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM 1557-7341/2024/10-ART
https://doi.org/10.1145/nnnnnnn.nnnnnnn
ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: October 2024.
2 • Ariai and Demartini
understanding complex language, and navigating complicated document structures [12, 39, 52, 84, 107, 120, 132],
yet the promise of Large Language Models (LLMs) enhances the efficiency, accessibility, and precision of legal
services.
Despite these advantages, the integration of NLP in the legal domain is not without challenges such as, biases
and unfairness, and explainability issues. [28, 103, 113]. The use of Artificial Intelligence (AI) in legal applications
must follow strict standards of accuracy, fairness, and transparency, given the potential impact on clients’ lives
and rights.
This survey article explores the current landscape of NLP applications within the legal domain. It discusses its
potential benefits and the practical challenges it poses. NLP is a broad field covering a wide range of techniques
for processing, analyzing, and understanding human language. By examining the latest advancements and
applications of NLP in law, this article provides a comprehensive overview of the field. Figure 1 summarizes the
scope of the survey and categorise the research into several areas: LQA, LJP, Legal Text Classification (LTC),
LDS, legal Named Entity Recognition (NER), legal corpora, and legal Language Models (LMs). Each category lists
relevant projects and papers and shows the work being done in each sub-field. Notably, there is comparatively
less research in NER and legal corpora, whereas LDS and LQA have seen extensive research activity, with a
substantial number of datasets and research contributions. This summary provides an overview of how NLP
techniques are applied to various challenges in the legal domain and offers insights into future directions of AI in
legal practice.
To provide a comprehensive understanding of the existing research on integrating AI within the legal domain,
we present an overview of recent literature reviews, as summarized in Table 1. Most survey papers on intelligent
legal systems focus either on traditional NLP technologies for specific tasks like LJP and LDS or take a broader
approach but still miss certain aspects. As illustrated in Table 1, there seems to be a gap in survey papers that
thoroughly examine all facets of this multidisciplinary field. Our current work aims to bridge this gap by offering
a comprehensive survey of all NLP tasks, existing datasets and corpora, and LMs in the legal domain. We use ✓
to indicate papers that study the most of the existing research on each subject in legal NLP. Papers that do not
address a subject receive ✗, and those that partially cover specific subjects along with their datasets or LMs are
marked with –.
The main difference between this work and previous surveys is that this survey aims to provide a more general
description of all aspects of NLP tasks in the legal domain, rather than focusing solely on specific applications.
The main contributions of this survey are summarized as follows: (1) This article extends beyond previous
surveys by examining a broad spectrum of studies, and applications of legal NLP. By discussing datasets and
large corpora in 24 languages and exploring the popular legal LMs, this survey establishes itself as an important
resource in the field of legal NLP; (2) The survey offers an in-depth look at the challenges of integrating NLP with
legal applications, with detailed discussions on technical solutions that tackle these issues, thereby enhancing
understanding and encouraging further research in this evolving field. (3) This survey also highlights the existing
research gaps in legal NLP, identifying areas that require further exploration and development, and providing a
road-map for future research efforts in the legal NLP domain.
This document is organized as follows. In Section 2, Background and Foundational Concepts, we provide a
detailed overview of legal language and the basic principles of NLP as they apply to the legal domain. In Section
3, we briefly explain the research methodology of this work and how we extracted the resources. In section 4, we
explore various NLP tasks that are tailored for legal applications and show their unique requirements and the
methodologies employed to address them. These specialized tasks leverage advanced NLP techniques to process,
analyze, and extract meaningful information from legal texts, thereby facilitating more efficient and accurate legal
research and decision-making. Additionally, we delve into the datasets available for training and evaluating legal
NLP tasks, emphasizing their characteristics and the implications they have for model performance. Following this,
in Section 5, we explore the development of LMs that has been specifically adapted for the legal field. Finally, in
ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: October 2024.
Natural Language Processing for the Legal Domain: A Survey of Tasks, Datasets, Models, and Challenges • 3
Table 1. Comparison of existing surveys with this work, which shows the covered topics of each survey.
Large Legal
NLP Tasks
Corpus
Legal LM
in Legal
Dataset
Legal
Dias et al. – – – ✗ 2022 In this work, researchers did not only concentrate on the legal domain. Instead, they elaborated
[30] on foundational concepts of AI and NLP. They briefly explored the applications of NLP within
the legal field but did not delve deeply into legal datasets, specific NLP tasks in the legal domain,
or Legal LLMs. In contrast, our work provides a comprehensive analysis of all NLP tasks within
the legal domain, including LQA, LDS, LTC. Additionally, our study covers large legal corpora
and thoroughly examines the datasets available for each legal NLP task, areas that were not
fully addressed in this work.
Sun [112] ✓ – ✓ ✗ 2023 Sun explored a limited number of research projects in the field of LLMs and the legal domain.
It focused on two key NLP tasks within this field: LJP and statutory reasoning. Additionally, it
examined three datasets and two LLMs relevant to these domain. The difference is that our
work provides a comprehensive overview of all NLP tasks in the legal domain, as well as large
legal corpora and datasets for each task, which were not fully covered in Sun’s study.
Cui et al. [26] ✓ – ✓ ✗ 2023 This paper comprehensively reviewed on LJP task. Authors analyzed 43 LJP datasets in 9
different languages. They summarized 16 evaluation metrics to evaluate three NLP tasks (text
classification, text generation, and text regression) in LJP. For LMs, they explored existing
Pre-trained Language Models (PLMs). Unlike this paper, our work provides comprehensive
coverage of all NLP tasks in the legal domain and includes an exploration of large legal corpora.
Anh et al. [4] ✗ ✓ ✓ ✗ 2023 This survey give a short explanation regarding challenges in legal language and how LLMs try to
overcome the challenges. Then, the authors summarized six NLP tasks in the legal domain that
can be addressed by LLMs. In terms of ethical and legal considerations, they discussed ‘bias and
fairness’, ‘privacy and confidentiality’, ‘intellectual property’, ‘explainability and transparency’,
and ‘responsible use’. The difference is that our work focuses on existing methods for all Legal
NLP tasks, including LJP and LDS, available datasets for each task, and large legal corpora for
pre-training and fine-tuning, whereas they concentrated on ethical and legal considerations
and the impact of LLMs on NLP in legal texts.
Ganguly et al. – – ✓ ✗ 2023 Ganguly and his colleagues presented a comprehensive review at ECIR 2023 and discuss several
[39] key areas including the processing challenges of legal text, such as NER and sentence boundary
detection. They traced the historical evolution of AI and law research from the 1980s, highlighted
recent developments in NLP and IR techniques with a focus on the architectures of PLMs, and
conducted a detailed survey of current issues and advancements in legal IR and NLP tasks
like LDS and LJP. The review also included perspectives from the industry. In contrast to their
survey, we also explore LQA and LTC tasks, which they did not cover, and examine large legal
corpora, an area they did not address.
Chen et al. ✓ ✓ ✓ ✗ 2024 This survey study focuses on LLMs in three distinct fields: finance, healthcare, and law. Although
[23] it attempts to cover all aspects of the legal domain and LLMs, the broad scope of addressing
three expansive topics has resulted in a less thorough examination of many specific research
cases within the legal field. In addition, it did not cover large legal corpora which are used for
pre-training and fine-tuning purposes. Also, we explore methods for improving the efficiency
of LLMs in legal domain.
Krasadakis – – – – 2024 This survey study focuses on challenges and advances in some NLP tasks, such as NER and
et al. [58] Relation Extraction. Unlike this survey, our study covers all legal NLP tasks, along with their
corresponding datasets and large legal corpora. Additionally, we examine existing LLMs tailored
for the legal domain.
ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: October 2024.
4 • Ariai and Demartini
Legal NER Dozier et al. [32], Păis et al. [94], Smădu et al. [105]
Au et al. [7], Kalamkar et al. [54], Leitner et al. [61]
LDS Farzindar [36], Gelbart and Smith [40, 41], Moens et al. [80]
Polsley et al. [92], Schraagen et al. [99], Zhong et al. [140]
Jain et al. [49], Moro et al. [82], Zhong and Litman [141]
Legal Liu et al. [68], Shen et al. [102]
NLP
LTC Elnaggar et al. [34], Lee and Lee [60], Wei et al. [122]
Bambroo and Awasthi [8], Song et al. [106]
Fragkogiannis et al. [38], Mamooler et al. [76], Wang et al. [121]
Chalkidis et al. [16, 17], Tuggener et al. [117]
Graham et al. [45], Nguyen et al. [83], Papaloukas et al. [90]
LQA Huang et al. [48], Khazaeli et al. [57], Sovrano et al. [109]
Askari et al. [5], Zhong et al. [139]
Louis et al. [72], Zhang et al. [134]
Askari et al. [6], Sovrano et al. [108], Yuan et al. [131]
Büttner and Habernal [14], Chen et al. [21], Sovrano et al. [107]
Fig. 1. An overview of the research areas in Legal NLP and the key publications in the survey.
Section 6, we address the key challenges associated with deploying NLP technologies in legal settings, discussing
both current issues and potential solutions. Since this survey contains many acronyms, Table 2 provides the list
of acronyms and their meanings to make it easier to follow.
ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: October 2024.
Natural Language Processing for the Legal Domain: A Survey of Tasks, Datasets, Models, and Challenges • 5
to the functioning of the legal system. They include court filings, judgments, legislation, treaties, contracts,
and legal correspondence, each serving a specific purpose and following to specific formatting and content
standards that reflect legal logic and hierarchy. Legal documents are Fundamental tools for lawyers, judges,
and legal scholars, facilitating case analysis, legislative review, and contract drafting. They are also essential
in legal education and practice and provide the basis for legal arguments and decisions. Common examples of
legal documents include case law repositories, statutory databases, and collections of legal agreements. These
documents are utilized in various legal processes such as drafting legal arguments, performing legal analysis, and
ensuring regulatory compliance.
2.1.1 Legal language and its characteristics. Legal language is characterized by its unique features that set it
apart from everyday language, primarily due to its function in the legal system. One prominent feature is its
formality, where legal language often employs a more formal vocabulary and syntax to ensure precision and
avoid ambiguity [43]. This formality is critical, as the precise meaning of terms can have significant legal effects.
Legal texts also typically utilize passive constructions and complex sentence structures to provide detailed and
comprehensive descriptions [43]. These constructions help clarify responsibilities and outcomes without directly
attributing actions or intentions to specific parties.
Another distinctive aspect of legal language is its reliance on specialized words and phrases. This includes
terms that have specific meanings within legal contexts, archaic words that are not commonly used in everyday
language, and standardized phrases that have been historically embedded in legal tradition [43]. This can make
legal documents less accessible to non-specialists, requiring legal professionals to interpret the content accurately.
Furthermore, legal language is heavily intertextual, meaning it frequently references other legal texts, such
as statutes, regulations, and case law. This characteristic ensures that legal arguments are grounded in and
supported by existing legal frameworks and previous cases. The dense use of citations and references in legal
documents not only supports the arguments made but also connects the document to a broader legal discourse.
Such intertextuality demands that legal professionals not only understand the texts themselves but also the
ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: October 2024.
6 • Ariai and Demartini
Fig. 2. A sample page from the Code of Federal Regulations, illustrating the structured and referenced nature of legal
documents.
broader legal context in which they operate. To illustrate the intertextuality of legal language, Figure 2 shows
a sample page from the Code of Federal Regulations of the United States, extracted from the WestLaw [124]
website. Figure 2 displays § 40.51, Labor Certification, from 22 C.F.R. § 40.51, which is part of the Code of Federal
Regulations of the United States. This section falls under Title 22, governing regulations related to foreign
relations, specifically detailing the requirements and procedures for labor certification. Notably, the text includes
highlighted references to other legal sources, such as INA 212(a)(5). This citation refers to the Immigration and
Nationality Act, specifically section 212, which outlines various conditions for inadmissibility into the United
States, under subsection (a), paragraph (5). In the “Credits” part, you can see the reference “56 FR 30422, July
2, 1991,” which points to a publication in the Federal Register. Here, “56 FR” indicates the volume number, and
“30422” is the page number where the document begins. The date “July 2, 1991,” marks the publication date in
the Federal Register. Additionally, a sentence in subsection (b), paragraph (1) consists of 68 words, showing the
length and complexity typical of legal texts.
ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: October 2024.
Natural Language Processing for the Legal Domain: A Survey of Tasks, Datasets, Models, and Challenges • 7
The specific characteristics of legal language, such as its formal vocabulary, complex syntax, and extensive
use of references, present many challenges for NLP. Disambiguation titles and nested entities are other issues in
legal contexts [58]. Disambiguation titles, such as ‘The President of USA’ requires precise identification based
on contextual details like time and location. Nested entities , where titles of legislative articles refer to other
laws, introduce further complexity. On the other hand, legal documents are frequently provided in non-machine-
readable PDF formats, complicating data extraction and processing. These challenges highlight the need for
advanced and specialized NLP solutions tailored to the legal domain.
ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: October 2024.
8 • Ariai and Demartini
• Transformers: Transformers [118] are a neural network architecture designed to convert input sequences
into output sequences by understanding the context and relationships among the elements of the sequence.
For instance, given the input “What is the color of the sky?”, a transformer model internally processes
and identifies connections among the words ‘color’, ‘sky’, and ’blue’. This understanding enables it to
produce the output: “The sky is blue” [3]. Transformer models enhance this process through a self-attention
mechanism. This mechanism allows the model to analyze different parts of the sequence simultaneously
instead of sequentially, helping it identify which parts are most significant.
• PLMs: PLMs are trained on extensive corpora in a self-supervised manner, which involves tasks like recover-
ing incomplete input sentences or auto-regressive language modeling. These models, such as Bidirectional
Encoder Representations from Transformers (BERT) [29] and RoBERTa [70], are initially trained on large-
scale general text datasets. After pre-training, they can be fine-tuned for specific downstream tasks in the
legal domain, adapting them to comprehend and process legal language for applications like document
classification and information extraction.
• Question Answering (QA): QA system is a type of NLP solution designed to answer questions posed in
natural language. These systems take a user’s query and, by extracting relevant information from a dataset
provide a relevant and informative response.
• NER: NER is the task of identifying mentions of specific entities within a text that belong to predefined
categories such as persons, locations, organizations, and more. NER is a fundamental component for many
NLP applications, including question answering, text Summarization, and machine translation [64].
• Information Retrieval (IR): IR involves the process of obtaining relevant information from large
collections of unstructured legal texts, such as case laws, statutes, contracts, and regulations. The goal of IR
is to provide users with the most relevant documents and data in response to specific queries.
• Multi-task Learning (MTL): MTL is an approach in ML where a model is trained on multiple related
tasks simultaneously or utilizes auxiliary tasks to enhance performance on a specific task. By learning
from diverse tasks, MTL enables models to capture generalized and complementary knowledge, improving
robustness and addressing data scarcity, particularly for low-resource tasks. MTL’s ability to share implicit
knowledge across tasks often leads to performance gains and more efficient models, making it a valuable
strategy for building robust and adaptable systems in NLP and other domains [22].
• Parameter-Efficient Fine-Tuning (PEFT): PEFT is a method for adapting PLMs that involves freezing
the majority of the model’s parameters and only updating a small subset. This approach significantly
reduces the computational resources and time required for fine-tuning, making it particularly effective in
resource-limited scenarios, while still achieving competitive performance in tasks like text generation [65].
• Retrieval-Augmented Generation (RAG): RAG is an advanced AI framework that enhances text creation
by merging traditional information retrieval systems, with the generative power of LLMs. This integration
allows the AI to access additional knowledge sources while utilizing its advanced language capabilities.
2.2.4 Key Publications and Conferences in Legal NLP. This section highlights the key journals, conferences, and
workshops that serve as platforms for sharing advancements and insights in the intersection of NLP and the
legal domain. These resources provide good opportunities for researchers to engage with cutting-edge research
in Legal NLP.
Several leading journals focus on the intersection of AI, NLP, and the legal domain. “Artificial Intelligence and
Law”, published by Springer, is a leading journal that features research articles on legal reasoning, legal IR, and
legal knowledge representation. Additionally, the “Journal of Law and Information Technology”, published by
Taylor & Francis , focuses on the application of information technology in law, including research AI.
Conferences significantly advance research and promote collaboration in Legal NLP. The International Confer-
ence on Artificial Intelligence and Law (ICAIL) is a biennial event that showcases advances in AI applications in
ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: October 2024.
Natural Language Processing for the Legal Domain: A Survey of Tasks, Datasets, Models, and Challenges • 9
the legal domain, including NLP and ML. The Conference on Legal Knowledge and Information Systems (JURIX)
is an annual event that focuses on legal informatics and NLP technologies.
In Legal NLP, there are some good works which are presented in workshops. The Workshop on Automated
Semantic Analysis of Information in Legal Texts delves into NLP and semantic analysis of legal texts. The Natural
Legal Language Processing (NLLP) Workshop provides a platform for discussing NLP technologies tailored for
legal texts and is often part of major NLP conferences. The EXplainable AI in Law (XAILA) Workshop focuses on
the explainability of AI systems in legal contexts, aiming to improve transparency and trust in AI applications. The
Competition on Legal Information Extraction/Entailment (COLIEE) is an annual event that challenges participants
to develop innovative solutions for legal information extraction and entailment tasks.
3 METHODOLOGY
This survey follows the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) frame-
work [89]. It ensures a transparent and comprehensive assessment of research to NLP tasks within the legal
sector.
ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: October 2024.
10 • Ariai and Demartini
ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: October 2024.
Natural Language Processing for the Legal Domain: A Survey of Tasks, Datasets, Models, and Challenges • 11
Huang et al. [48] introduce the Artificial Intelligence Law Assistant, the first Chinese LQA system that integrates
a legal knowledge graph (KG) to enhance query comprehension and answer ranking. The system collects a
large-scale QA corpus from an online legal forum and constructs a legal KG with over 42,000 legal concepts.
It employs a knowledge-enhanced interactive attention network using Bi-LSTM and co-attention mechanisms
to enrich semantic representations of QA pairs with legal domain knowledge. Additionally, it provides visual
explanations for selected answers, offering users a clear understanding of the QA process.
An example of an answer retrieval system specific to Private International Law is proposed by Sovrano et al.
[109]. This system integrates Term Frequency Inverse Document Frequency (TF-IDF) with deep LMs to retrieve
relevant answers from an automatically generated KG of contextualized grammatical sub-trees. The KG aligns
with a legal ontology based on Ontology Design Patterns, such as agent, role, event, temporal parameter, and
action, to reflect the legal significance of the relationships within and between provisions.
Khazaeli et al. [57] develop an IR-based QA system tailored to the legal domain, combining sparse vector search
(BM25) and dense vector techniques (semantic embeddings) as input to a BERT-based [29] answer re-ranking
system. The system utilizes Legal GloVe and Legal Siamese BERT embeddings to enhance retrieval effectiveness.
An answer finder component computes the probability of a passage answering the question using a BERT
sequence binary classifier fine-tuned on question-answer pairs, improving the model’s ability to discriminate
good answers.
Li et al. [66] introduce a retrieving-then-answering framework featuring the Graph-Based Evidence Retrieval
and Aggregation Network (GESAN) to enhance QA on the Judicial Examination of Chinese Question Answering
(JEC-QA) dataset [139]. The framework leverages relevant legal knowledge by predicting question topics and
retrieving legal paragraphs using BM25. GESAN aggregates the evidence and processes it along with the question
and options to make accurate predictions, demonstrating improved accuracy and reasoning abilities in LQA.
Askari et al. [5] propose a method for generating query-dependent textual profiles for lawyers to improve legal
expert finding on QA platforms. Using data from the Avvo1 QA forum, they focus on aspects such as sentiment,
comments, and recency to create profiles. These profiles are fine-tuned using BERT models [29], and the final
aggregated score is calculated using a linear combination of profile-trained BERT models. It improves retrieval
performance in the legal expert finding task.
Zhang et al. [134] propose a generation-based method for LQA, modeling it as a generation task to produce
new, relevant answers tailored to each question. The system incorporates laws as external knowledge into the
answer generation process, using a retriever to fetch applicable law articles and a generator to create answers
using this knowledge. Both components are integrated into a single T5 [96] model using MTL. It can enhance the
model’s understanding and generation abilities while ensuring answers are accurate and informative.
Louis et al. [72] present an end-to-end methodology for generating long-form answers to statutory law
questions using a “retrieve-then-read” pipeline. The approach involves a retriever component that uses a bi-
encoder model to fetch relevant legal provisions, followed by a generator that formulates comprehensive answers
based on these provisions. The generator, an autoregressive LLM based on the Transformer architecture, employs
in-context learning and PEFT to generate detailed answers. The model’s interpretability is enhanced by an
extractive rationale generation strategy, ensuring responses are accompanied by verifiable justifications.
Sovrano et al. [108] propose DiscoLQA, a discourse-based LQA system that focuses on important discourse
elements like Elementary Discourse Units (EDUs) and Abstract Meaning Representations (AMRs). This approach
helps the answer retriever identify the most relevant parts of the discourse, enhancing retrieval accuracy. They
introduce the Q4EU dataset, containing over 70 questions and 200 answers on six European norms, demonstrating
improved performance in legal QA even without domain-specific training.
1 https://www.avvo.com
ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: October 2024.
12 • Ariai and Demartini
Yuan et al. [131] present a three-step approach to bridge the legal knowledge gap by creating CLIC-pages—snippets
that explain technical legal concepts in layperson’s terms. They construct a legal question bank containing legal
questions answered by CLIC-pages, using large-scale PLMs like GPT-3 [13] to generate machine-generated
questions. The study demonstrates that machine-generated questions are more scalable and diversified, aiding in
improving accessibility to legal information for non-experts.
Askari et al. [6] propose a cross-encoder re-ranker (𝐶𝐸 𝐹𝑆 ) for legal answer retrieval, incorporating fine-grained
structured inputs from community QA data to enhance retrieval effectiveness. They introduce the LegalQA
dataset containing 9,846 questions and 33,670 lawyer-curated answers. The approach involves a two-stage ranking
pipeline with a BM25 retriever followed by a re-ranker, showing that integrating question tags into the input
structure can bridge the knowledge gap and improve retrieval in the legal domain.
4.1.2 Datasets. The LQA datasets are a specialized resource designed to facilitate research in the domain of
legal NLP. they consist of a collection of legal questions and corresponding answers, drawn for various legal
documents and case law. Most questions in the LQA datasets fall into two main categories: knowledge-driven
questions (KD-questions) and case-analysis questions (CA-questions) [139]. KD-questions are centered around
the understanding of specific legal concepts, whereas CA-questions involve the analysis of actual legal cases.
Both types demand advanced reasoning skills and a deep comprehension of the text, making LQA a particularly
challenging task in the field of NLP.
Zhong et al. [139] introduce JEC-QA, a dataset with 26,365 multiple-choice questions from the National Judicial
Examination of China and related websites. Each question offers four possible answers and is labeled with the
type of reasoning required, such as word matching, concept understanding, numerical analysis, multi-paragraph
reading, and multi-hop reasoning. This dataset poses significant challenges for QA models, highlighting the gap
between machine performance and human expertise in complex legal reasoning.
Sovrano et al. [107] present a dataset designed to evaluate automated QA systems within the domain of Private
International Law. It includes 17 carefully selected questions based on key EU regulations—Rome I, Rome II, and
Brussels I bis—with answers derived directly from these regulations. The questions are classified based on their
specificity, allowing for nuanced analysis of context-dependency in legal reasoning. This dataset aids in assessing
the performance of QA systems intended for legal professionals navigating complex cross-border legal issues.
EQUALS [21] is a large-scale annotated LQA dataset in Chinese law, containing 6,914 question-answer pairs
with answers based on specific law articles. Curated by senior law students, it covers 10 collections of Chinese
laws and includes annotations indicating the type of reasoning required for each question. The dataset ensures
that answers are precise excerpts from relevant law articles, making it valuable for developing advanced LQA
systems that can aid in legal research and decision-making.
Büttner and Habernal [14] introduce GerLayQA, a dataset supporting LQA for laypersons in Germany, focusing
on the civil-law system. It contains 21,538 real-world questions posed by laypersons, paired with expert answers
from lawyers grounded in specific paragraphs of German law books. The dataset are constructed through filtering
and quality assurance to ensure accuracy and relevance, making it a valuable resource for developing LQA
systems that can interpret and apply German law to everyday legal inquiries.
ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: October 2024.
Natural Language Processing for the Legal Domain: A Survey of Tasks, Datasets, Models, and Challenges • 13
Despite its promise, LJP is a highly demanding and challenging task. It requires careful handling of natural
biases in historical legal data, which can create feedback loops that amplify discrimination [58]. Therefore,
ensuring the impartiality of predicted rulings is crucial [58]. Currently, LJP is primarily performed by legal experts
who undergo extensive specialized training to manage the complex steps involved, such as identifying relevant
law articles, defining charge ranges, and deciding penalty terms [26]. Nevertheless, LJP provides substantial
benefits, streamlining legal decision-making processes for both practitioners and ordinary citizens [26].
Luo et al. [73] propose an attention-based neural network to enhance charge prediction by jointly modeling
charge prediction and relevant law article extraction. They used Bidirectional Gated Recurrent Units (Bi-GRUs)
to encode fact descriptions and an article extractor to identify top relevant law articles. The model employs an
attention mechanism guided by context vectors to combine embeddings for prediction. Evaluations on Chinese
judgment documents showed improved accuracy in predicting charges and providing relevant legal articles.
Zhong et al. [136] introduce TopJudge, a topological MTL framework that models dependencies among subtasks
in LJP, such as law article prediction, charge prediction, and penalty terms. Using a Directed Acyclic Graph
(DAG), TopJudge processes subtasks in a topological order reflecting real-world legal decision-making. Evaluated
on large-scale Chinese criminal case datasets, it outperformed previous models in predicting legal outcomes.
Ye et al. [130] address the problem of Court View Generation from fact descriptions in criminal cases to
enhance the interpretability of charge prediction systems and aid in automatic legal document generation.
They formulated this as a text-to-text Natural Language Generation (NLG) problem, using a label-conditioned
Seq2Seq model with attention to decode court views based on encoded charge labels. Their model outperformed
basic Seq2Seq models in generating accurate and natural court views. This work contributes to automatic legal
document generation by providing justifications for charge decisions.
Yang et al. [129] propose a Multi-Perspective Bi-Feedback Network (MPBFN) with a Word Collocation Attention
mechanism to improve LJP. MPBFN addresses the challenges of multiple subtasks and their dependencies by
using a bi-feedback mechanism for forward prediction and backward verification among subtasks. The Word
Collocation Attention integrates word collocation features and numerical semantics to better predict penalties.
Evaluated on the CAIL-small and CAIL-big datasets [127], their model outperformed baselines in predicting law
articles, charges, and penalty terms.
Chalkidis et al. [15] introduce an English LJP dataset containing approximately 11.5k cases from the European
Court of Human Rights (ECHR). They evaluated various neural models on this dataset, including a hierarchical
version of BERT [29] (HIER-BERT) to handle long legal documents. Their models outperformed previous feature-
based approaches in tasks like violation classification and case importance prediction. They also explored potential
biases in legal predictive models using data anonymization.
Medvedeva et al. [79] investigate using NLP tools to predict judicial decisions of the ECHR based on court
proceeding texts. They employed an SVM linear classifier to predict violations of articles, achieving an average
accuracy of 75%. However, when predicting future decisions based on past cases, accuracy decreased to 58–68%.
The study also found that predicting outcomes based solely on judges’ surnames could achieve an average
accuracy of 65%, highlighting potential biases.
Zhong et al. [137] introduce QAjudge, a reinforcement learning-based model designed to provide interpretable
legal judgments by visualizing the prediction process. QAjudge uses a Question Net to iteratively select relevant
yes-no questions about case facts, an Answer Net to provide answers, and a Predict Net to generate the final
judgment. The model aims to minimize the number of questions asked. It focuses on crucial elements to ensure
fairness and interpretability. Evaluated on real-world datasets, QAjudge demonstrated potential in providing
reliable and transparent legal judgments.
Xu et al. [128] propose the Law Article Distillation based Attention Network (LADAN), an end-to-end model
addressing the issue of confusing charges in LJP by distinguishing similar law articles. The model uses a novel
graph neural network to learn differences between confusing law articles and an attention mechanism to extract
ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: October 2024.
14 • Ariai and Demartini
discriminative features from fact descriptions. Experiments on real-world datasets showed that LADAN improved
performance over previous methods in law article prediction, charge prediction, and penalty term prediction.
Strickson and De La Iglesia [111] present the first LJP model for UK court cases, creating a labeled dataset of
UK court judgments spanning 100 years. They evaluated various ML models and feature representations, with
their best model achieving an accuracy of 69%. The study demonstrated the potential of LJP for UK courts, though
challenges remain due to the complexity of legal language and lack of structured public datasets.
Ma et al. [74] introduce MSJudge, a MTL framework designed to predict legal judgments by leveraging multi-
stage judicial data, including pre-trial claims and court debates. MSJudge consists of components to encode
multi-stage context, model interactions among claims, facts, and debates, and predict judgments. Evaluated
on a large civil trial dataset2 , MSJudge outperformed state-of-the-art baselines, enhancing trial efficiency and
judgment quality.
Almuslim and Inkpen [2] focus on LJP for Canadian appeal court cases, employing various NLP and ML
methods to predict binary outcomes (‘Allow’ or ‘Dismiss’) based on case descriptions. Deep learning models
using custom Word2Vec embeddings achieved the highest accuracy of 93%, significantly outperforming classical
ML models. The study highlights the potential of predictive models to aid legal professionals and establishes a
foundation for future research in the Canadian legal system.
Feng et al. [37] address limitations of state-of-the-art LJP models by proposing an event-based prediction
model with constraints to improve performance. The model extracts fine-grained key events from case facts
and predicts judgments based on these events rather than the entire fact statement. They manually annotated a
legal event dataset and introduced output constraints to guide learning. Their method effectively leverages event
information and cross-task consistency constraints, enhancing LJP accuracy.
Tong et al. [116] introduce GJudge, a graph boosting framework incorporating constraints to address short-
comings of traditional LJP methods. GJudge features a multi-perspective interactive encoder and a Multi-Graph
Attention Network (MGAT) consistency expert module. The encoder merges fact descriptions with label similarity
connections, while the expert module distinguishes similar labels and maintains task consistency. Testing on
datasets showed that GJudge outperformed other models, including the state-of-the-art RLJP [125] model, with
higher F1 scores.
Previous works mainly focus on creating accurate representations of a case’s fact description to enhance
judgment prediction performance. However, these methods often overlook the practical judicial process, where
human judges compare similar law articles or potential charges before making a decision. To address this gap,
Zhang et al. [133] propose CL4LJP, a supervised contrastive learning framework to improve LJP by capturing
fine-grained differences between similar law articles and charges. The framework includes contrastive learning
tasks at the article, charge, and label levels, enhancing the model’s ability to model relationships between fact
descriptions and labels. Experiments demonstrated that CL4LJP outperformed previous methods, proving its
effectiveness and robustness.
Liu et al. [71] propose ML-LJP, a Multi-Law aware LJP method that expands law article prediction into a
multi-label classification task incorporating both charge-related and term-related articles. The approach uses label-
specific representations and contrastive learning to distinguish similar definitions. A Graph Attention Network
(GAT) is employed to learn interactions among multiple law articles for prison term prediction. Experiments
showed that ML-LJP outperformed state-of-the-art models, particularly in prison term prediction.
4.2.2 Datasets. The LJP datasets are specialized resources designed to advance research in predicting judicial
outcomes within the domain of legal NLP. These datasets are categorized into four main types: court view
generation datasets, law articles, charge prediction, and prison term prediction. Court view generation datasets
involve court opinions and summaries. Law Articles datasets involve the prediction of legal outcomes based on
2 https://github.com/mly-nlp/LJP-MSJudge
ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: October 2024.
Natural Language Processing for the Legal Domain: A Survey of Tasks, Datasets, Models, and Challenges • 15
specific statutes or regulations. Charge prediction datasets are concerned with predicting the charges that should
be brought against a defendant based on the case details, while prison term prediction datasets aim to estimate
the likely sentence duration given the nature of the crime and the legal context. Each type of dataset presents
unique challenges, demanding not only text comprehension but also the ability to apply complex legal reasoning,
making LJP a particularly complex task in the field of NLP.
Court View Gen [130] is an innovative dataset containing 171,981 Chinese legal cases, each involving a single
defendant and a corresponding charge, covering a total of 51 charge categories. This dataset is specifically curated
to support the generation of court opinions based on charge labels. The data was sourced from publicly available
legal documents within the CJO3 repository.
Niklaus et al. [85] introduce a multilingual LJP dataset from the Federal Supreme Court of Switzerland (FSCS),
containing over 85,000 cases in German, French, and Italian. The dataset is annotated with publication years,
legal areas, and cantons of origin, making it suitable for NLP applications in judgment prediction.
Semo et al. [100] introduce the first LJP dataset focused on class action cases in the United States. The dataset
targets predicting outcomes based on plaintiffs’ complaints rather than court-written fact summaries, involving a
rule-based extraction system to identify relevant text spans from complaints.
Almuslim and Inkpen [2] construct a dataset for LJP within the Saskatchewan Court of Appeal. They collected
and labeled 3,670 documents with case outcomes (‘allow’ or ‘dismiss’) using a two-step pattern matching and
keyword-based validation, ensuring label accuracy through manual annotation. This dataset supports research in
the Canadian legal system and aids in developing predictive models for legal judgments.
3 http://wenshu.court.gov.cn
ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: October 2024.
16 • Ariai and Demartini
challenges such as sequence length limitations and the need for explainability in predictive analysis, suggesting
improvements like integrating annotated sentences with full text to enhance sentence relevance identification.
Lee and Lee [60] focuse on legal document classification in the Korean language and compare three different DL
approaches: CNN with ASCII encoding, CNN with Word2Vec embeddings, and RNN with Word2Vec embeddings.
The classification models are used to classify case data into civil, criminal, and administrative. Using a dataset of
nearly 60,000 past case documents, the study finds that the RNN model with Word2Vec embedding achieves the
highest classification accuracy.
Bambroo and Awasthi [8] introduce an architecture that integrates long attention mechanisms with a distilled
BERT model pre-trained on legal domain-specific corpora. Their model employs a combination of local windowed
attention and task-motivated global attention to handle inputs up to eight times longer than standard BERT
models. The architecture, based on DistilBERT [97] and incorporating LongformerSelf-Attention, is optimized for
legal document classification, outperforming fine-tuned BERT and other transformer-based models in both speed
and effectiveness.
Song et al. [106] present a deep learning-based system built on top of RoBERTa [70] for multi-label legal
document classification. They enhance the model with domain-specific pre-training, a label-attention mechanism,
and MTL to improve classification accuracy, particularly for low-frequency classes. The label-attention mechanism
uses label embeddings to bridge the semantic gap between samples and class labels, addressing class imbalance
issues.
Fragkogiannis et al. [38] propose a method to improve classification of pages within lengthy documents
by leveraging sequential context from preceding pages. They enhance the input to pre-trained models like
BERT [29] by appending special tokens representing the predicted page type of the previous page, enabling
more context-aware classification without modifying the model architecture. Experiments on legal datasets
demonstrate improvements compared to non-recurrent setups.
Wang et al. [121] introduce a Document-to-Graph Classifier to classify legal documents based on facts and
reasons rather than topics. They extract key entities and represented legal documents using four distinct relation
graphs capturing different aspects of entity relationships. A GATs [119] is used to learn document representations
from the combined graph, improving classification by focusing on factual content.
Mamooler et al. [76] propose an active learning pipeline for fine-tuning PLMs for LTC. They address chal-
lenges of specialized vocabulary and high annotation costs. Their method involves continued pre-training of
RoBERTa [70] on legal texts, knowledge distillation using a pre-trained sentence transformer, and an efficient
initial sampling strategy by clustering unlabeled data. This approach reduces the number of labeling actions
required and improves efficiency in adapting models to LTC tasks.
4.3.2 Datasets. LTC datasets are characterized by their domain-specific vocabulary and multi-label nature,
requiring models to interpret complex legal texts and categorize them into single or multiple legal themes.
Chalkidis et al. [16] release EURLEX57K, a dataset containing 57,000 EU legislative documents from the EUR-
LEX portal4 , annotated with EUROVOC5 concepts. This dataset facilitates research in LTC, including extreme
multi-label text classification, few-shot, and zero-shot learning, with documents tagged with an expansive set of
descriptors.
Tuggener et al. [117] introduce LEDGAR, a multi-label corpus of legal provisions from contracts scraped from
the U.S. Securities and Exchange Commission’s website. The dataset includes over 846,000 provisions across
60,540 contracts, with an extensive label set suitable for text classification and legal studies, aiding in developing
advanced legal NLP models.
4 lex.europa.eu/
5 http://eurovoc.europa.eu/
ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: October 2024.
Natural Language Processing for the Legal Domain: A Survey of Tasks, Datasets, Models, and Challenges • 17
Chalkidis et al. [17] present MULTI-EURLEX, a multilingual dataset containing 65,000 EU laws translated into
23 official EU languages, annotated with EUROVOC labels. The dataset emphasizes temporal concept drift by
adopting chronological splits, enhancing its utility for sophisticated LTC tasks requiring understanding nuanced
legal terms across different time periods.
Papaloukas et al. [90] introduce the Greek Legal Code dataset, categorizing approximately 47,000 Greek
legislative documents into a detailed multi-level classification system. The dataset is structured into volumes,
chapters, and subjects, each containing diverse legal documents from Greek legislation history, supporting LTC
in the Greek legal domain.
Song et al. [106] introduce POSTURE50K, a legal dataset containing 50,000 U.S. legal opinions annotated
with Legal Procedural Postures ranging from common to rare motions. The dataset includes an innovative split
strategy to support supervised and zero-shot learning evaluations, ensuring infrequent categories are adequately
represented, enhancing model generalizability and testing accuracy.
Graham et al. [45] develop a domain-specific dataset for LTC focusing on deontic modalities in contract
sentences. They manually annotated contract sentences to train models for identifying deontic sentences like
permissions, obligations, and prohibitions. The corpus, derived from the Contract Understanding Atticus Dataset
(CUAD) [47], provides a resource for studying functional categories crucial for legal analysis.
ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: October 2024.
18 • Ariai and Demartini
Building on previous developments in LDS, Polsley et al. [92] introduce Casesummarizer, a tool designed for
the legal domain that pre-processes legal texts into sentences, scores them using a TF-IDF matrix from extensive
legal case reports, and enhances sentence scoring by identifying entities, dates, and section headings. The tool
provides a user-friendly interface with scalable summaries, lists of entities and abbreviations, and a significance
heat map.
Zhong et al. [140] propose an automatic extractive summarization system for legal cases concerning Post-
traumatic Stress Disorder from the US Board of Veterans’ Appeals. It employs a train-attribute-mask pipeline
using a CNN classifier to iteratively select predictive sentences from case texts.
Nguyen et al. [83] propose an RL framework to enhance deep summarization models for the legal domain,
utilizing Proximal Policy Optimization with a reward function that integrates both lexical and semantic criteria.
They fine-tune an extractive summarization backbone based on BERTSUM [69], employing a reward model
that includes lexical, sentence, and keyword-level semantics to produce better legal summaries. Schraagen et al.
[99] apply an RL approach with a Bi-LSTM and a deep learning approach based on the BART transformer
model to abstractive summarization of the Dutch case verdict database Rechtspraak.nl, combining extractive and
abstractive summarization to retain core facts while creating concise summaries.
Zhong and Litman [141] focus on extractive summarization of legal case decisions, proposing an unsupervised
graph-based ranking model that leverages a reweighting algorithm to utilize document structure properties. They
introduce a reweighting algorithm to improve sentence selection in the HipoRank model [31]. It aims to reduce
redundancy and enhance the selection of argumentative sentences from underrepresented sections.
Moro et al. [82] introduce a transfer learning approach that combines extractive and abstractive summarization
techniques to address the lack of labeled legal summarization datasets, outperforming previous results on the
Australian Legal Case Reports dataset and establishing a new baseline for abstractive summarization.
Jain et al. [49] propose a sentence scoring approach, DCESumm, which combines supervised sentence-level
summary relevance prediction with unsupervised clustering-based document-level score enhancement. They
utilize a Legal BERT-based Multi-Layer Perceptron (MLP) model to predict the summary relevance of each sentence,
refining scores through deep embedded sentence clustering to enhance the selection process by considering the
global context of the document.
Liu et al. [68] present Common Law Court Judgment Summarization (CLSum), a pioneering dataset for
summarizing multi-jurisdictional common law court judgments, leveraging large language models for data
augmentation, summary generation, and evaluation. They employ a two-stage summarization process with
techniques like sparse attention mechanisms and efficient training methods to process lengthy legal documents
within limited computational resources.
4.4.2 Datasets. LDS datasets are largely built from structured court proceedings and decisions, providing rich
sources for both extractive and abstractive summarization methods. These datasets often emphasize abstractive
summarization to achieve concise, readable summaries that transform the original legal language into more
accessible forms [102].
Zhong et al. [140] develop a dataset from 972,522 Board of Veterans’ Appeals decisions, focusing on single-issue
cases related to Post-traumatic Stress Disorder. The dataset consists of 112 carefully sampled decisions, annotated
by legal experts to capture key information such as ‘Issue’, ‘Procedural History’, ‘Service History’, ‘Outcome’,
‘Reasoning’, and ‘Evidential Support’.
Shen et al. [102] introduce Multi-LexSum, an abstractive summarization dataset tailored for U.S. federal civil
rights lawsuits, containing 40,000 source documents and 9,000 expert-written summaries of diverse lengths,
providing a rich resource for testing advanced summarization models.
Liu et al. [68] publish CLSum, a dataset designed for summarizing multi-jurisdictional common law court
judgments from Australia, Hong Kong, the United Kingdom, and Canada. This dataset leverages large language
ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: October 2024.
Natural Language Processing for the Legal Domain: A Survey of Tasks, Datasets, Models, and Challenges • 19
models for data augmentation and incorporates legal knowledge to enhance summary generation and evaluation.
This dataset addresses the challenge of sparse labeled data in legal domains. CLSum includes a comprehensive
collection of judgments and summaries from prominent court websites. It employs novel techniques to enrich
training sets and improve model performance in few-shot and zero-shot learning scenarios.
ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: October 2024.
20 • Ariai and Demartini
like legal norms and case-by-case regulations, distinguishing between different types of legal acts and literature.
This dataset’s comprehensive annotation process involved multiple cycles to refine the tagging guidelines and
enhance annotation quality.
Păis et al. [94] introduce the LegalNERo corpus, a manually annotated resource for NER in the Romanian legal
domain, featuring 370 legal documents annotated with five general entity types: person, location, organization,
time expressions, and legal references. This corpus was developed to support both specific legal domain NER
tasks and more general NER applications by enabling compatibility with existing general-purpose NER systems.
The corpus includes rich entity annotations, with legal references showing the highest token count per entity,
indicating their complexity and length. The detailed annotation process, including inter-annotator agreement
assessed by Cohen’s Kappa, and the subsequent mapping of entities to RDF format, highlights the corpus’s utility
and precision for advancing NER research and applications within the legal domain.
Au et al. [7] introduce the E-NER dataset, an annotated collection derived from the US SEC’s EDGAR filings,
designed for legal NER. This dataset contains filings that are rich in text, such as quarterly reports (Form 10-Q)
and significant event announcements (Form 8-K), from which sentences were extracted and annotated with seven
named entity classes more tailored to legal content than those in the standard CoNLL dataset[115]. The entities
include Person, Location, Organization, Government, Court, Business, and Legislation/Act, adjusting the CoNLL
classes to better suit legal documents. E-NER contains significantly longer sentences compared to CoNLL and
includes detailed annotations of financial entities from legal company filings.
Kalamkar et al. [54] present a comprehensive corpus aimed at enhancing legal NER, containing 46,545 entities
across 14 types identified in Indian High Court and Supreme Court judgments. This corpus, split into preamble and
judgment sections, includes diverse entity types detailed in their legal NER taxonomy, such as court, petitioner,
respondent, and statute, among others. The training set, drawn from judgments between 1950 and 2017, features
29,964 entities and the development and test sets, spanning 2018 to 2022. This dataset not only facilitates training
and evaluation of NER models specific to the legal domain but also provides a structured framework for assessing
the performance of NER systems on legal texts. Their approach leverages a combination of manual annotation
and ML techniques to ensure the precision of entity recognition in legal judgments.
ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: October 2024.
Natural Language Processing for the Legal Domain: A Survey of Tasks, Datasets, Models, and Challenges • 21
automated systems. It utilizes a format where a cited text serves as a prompt with five answer options—one
correct holding and four closely related incorrect holdings—to refine the models’ abilities to accurately reflect
legal reasoning.
Chalkidis et al. [19] introduce the Legal General Language Understanding Evaluation (LexGLUE) benchmark,
a comprehensive suite of datasets aimed at assessing the capabilities of NLP models across various legal tasks.
The benchmark covers datasets such as ECtHR [15], SCOTUS7 , EUR-LEX, LEDGAR [117], UNFAIR-ToS [67],
and CaseHOLD [135], each chosen for its complexity, relevance, and need for legal expertise. These datasets
cover a range of tasks from multi-label and multi-class classification to multiple-choice questions and are split
chronologically into training, development, and test sets to provide standardized evaluation metrics. For instance,
ECtHR datasets focus on violations of European Convention of Human Rights provisions, SCOTUS database
classifies U.S. Supreme Court opinions by legal issues, EUR-LEX database involves labeling EU laws with EuroVoc
concepts, LEDGAR classifies provisions of U.S. contracts, UNFAIR-ToS identifies unfair terms in online service
agreements, and CaseHOLD involves answering questions about legal rulings. This benchmark facilitates the
testing of NLP models, addressing the challenges of legal text comprehension and understanding required for
effective application in the legal domain.
Chalkidis et al. [20] present FairLex, a benchmark suite consisting of four legal datasets—ECtHR [15], SCOTUS,
FSCS, and CAIL [127]—that address the fairness of NLP applications across diverse legal jurisdictions and lan-
guages, including English, German, French, Italian, and Chinese. These datasets, curated from European Council,
USA, Switzerland, and China, cover various legal tasks such as judgment prediction, issue area classification,
and crime severity prediction, aiming to test the performance and fairness of LMs in recognizing and classifying
legal texts. FairLex focuses on ensuring demographic, regional, and legal topic fairness by analyzing attributes
like gender, age, region of origin, and legal areas within cases. Each dataset in FairLex provides a substantial
number of cases, systematically divided into training, development, and test sets, and includes detailed attributes
like the defendant state in ECtHR, decision direction in SCOTUS, legal areas in FSCS, and demographic details
in CAIL. The CAIL dataset from China contains over a million cases focusing on criminal law, annotated with
demographics and regional classifications, which are used to explore the crime severity prediction task.
Henderson et al. [46] introduce the ‘Pile of Law’, the first and an important large corpus in the legal domain,
containing a 256GB dataset of open-source English-language legal and administrative data. This dataset includes
contracts, court opinions, legislative records, and administrative rules, curated to explore data sanitization norms
across legal and administrative settings and serve as a tool for pre-training legal-domain LMs. They emphasize
the legal norms governing privacy and toxicity filtering, detailing how the dataset reflects these norms through
built-in filtering mechanisms in the collected data, which include court filings, legal analyses, and government
publications. By analyzing how legal and administrative entities handle sensitive information and potentially
offensive content, the paper provides actionable insights for researchers to improve content filtering practices
before pre-training LLMs, thereby enhancing the ethical use of NLP in legal applications.
Rabelo et al. [95] summarize the 8th Competition on Legal Information Extraction and Entailment (COLIEE
2021), which featured five tasks across case and statute law, engaging participants from various teams to apply
diverse NLP approaches. The competition tasks included case law retrieval and entailment, as well as statute law
retrieval and entailment with and without prior retrieved data. Specifically, Task 1 focused on extracting relevant
supporting cases from a corpus, while Task 2 involved identifying paragraphs from cases that entail a given new
case fragment. For statute law, Tasks 3 and 4 entailed retrieving and answering questions based on civil code
statutes, with Task 5 challenging participants to answer without pre-retrieved statutes. The datasets used varied
in complexity, from 4415 case files in Task 1 with a need to identify noticed cases without relying on citations,
7 https://www.supremecourt.gov
ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: October 2024.
22 • Ariai and Demartini
to the civil code-based Tasks 3, 4, and 5 which adapted to recent legal revisions in Japanese law and excluded
untranslated parts, reflecting the ongoing evolution and challenge in legal NLP applications.
Barale et al. [9] present AsyLex, a pioneering dataset tailored for Refugee Law applications, featuring 59,112
documents from Canadian refugee status determinations spanning from 1996 to 2022. This dataset is designed to
enhance the capabilities of NLP models in legal research by providing 19,115 gold-standard human-annotated and
30,944 inferred labels for entity extraction and LJP. Key contributions include anonymizing decision documents,
employing a robust annotation methodology, and creating datasets for specific NLP tasks like entity extraction
and judgment prediction. This rich corpus, with detailed annotations across 22 categories, supports complex legal
NLP tasks, thereby filling the gap in resources for the legal domain.
Niklaus et al. [86] introduce LEXTREME, a multilingual benchmark specifically designed to evaluate LMs on
legal NLP tasks, a critical step given the unique challenges of legal language. Surveying legal NLP literature from
2010 to 2022, they curate 11 datasets spanning 24 languages and cover a variety of legal domains, employing
datasets that only involve human-annotated texts or those with annotations derived through clear methodological
frameworks. They introduce two aggregate scores to facilitate fair comparison across models: the dataset aggregate
score and the language aggregate score, revealing a performance correlation with model size on LEXTREME. The
benchmark consists of three task types: Single Label Text Classification, Multi Label Text Classification, and NER,
using existing splits for training, validation, and testing when available, or creating random splits otherwise. This
effort marks a significant advancement in testing NLP capabilities across a diverse range of legal documents and
languages.
Park and James [91] explore the creation of a Natural Language Inference dataset within the legal domain,
focusing on criminal court verdicts in Korean. Their methodology includes the innovative use of adversarial
hypothesis generation to challenge annotators and enhance the robustness of the dataset, supported by visual
tools for hypothesis network construction. The data collection involves extracting context from verdicts and
augmenting it using Easy Data Augmentation [123] techniques and round-trip translation to generate a dataset
for training and testing Natural Language Inference models. The study highlights issues such as annotators’
limited domain knowledge and challenges in handling long contexts but provides solutions like targeted data
collection and the use of gamification to boost annotator engagement and productivity.
Goebel et al. [44] summarize the 10th Competition on Legal Information Extraction and Entailment (COLIEE
2023), featuring four tasks across case and statute law with participation from ten different teams engaging in
multiple tasks. Task 1 involves legal case retrieval, requiring participants to extract supporting cases from a
corpus, and Task 2 focuses on legal case entailment, identifying paragraphs that entail aspects of a new case. Task
3 and 4, based on Japanese Civil Code statutes from the bar exam, involve retrieving relevant articles and verifying
statements, respectively. The competition leverages a dataset of over 5,700 case law files and introduces new
query cases and test questions sourced from recent bar exams, testing the efficacy of different teams’ approaches
in handling complex legal texts and hypotheses in a controlled competitive environment.
Östling et al. [142] introduce the Cambridge Law Corpus (CLC), a comprehensive legal dataset featuring
258,146 cases from UK courts, dating from the 16th century to the present. The corpus includes raw text and
metadata across various court types, and is structured in XML format for ease of use and annotated for case
outcomes in a subset of 638 cases. Additionally, the CLC is supported by a Python library for data manipulation
and ML applications.
Niklaus et al. [87] present the MultiLegalPile, the largest open-source multilingual legal corpus available,
totaling 689GB and spanning 17 jurisdictions across 24 languages. This extensive dataset is designed to facilitate
training of LLMs within the legal domain, featuring diverse legal text types including case law, legislation, and
contracts, predominantly in English due to the integration of the ‘Pile of Law’ [46] dataset. Through careful
regex-based filtering from the mC4 corpus and manual reviews, the team ensures high precision in legal content
ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: October 2024.
Natural Language Processing for the Legal Domain: A Survey of Tasks, Datasets, Models, and Challenges • 23
selection. The corpus, efficiently compressed using XZ and formatted in JSONL, supports comprehensive NLP
research and modeling, emphasizing its broad applicability in advancing legal AI technologies.
ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: October 2024.
24 • Ariai and Demartini
Shi et al. [104] develop Legal-LM, a specialized language model tailored for Chinese legal consulting, enhanced
with a KG to address domain-specific challenges such as data veracity and non-expert user interaction. The
framework involves several steps: extensive pre-training on a rich corpus of legal texts integrated with a legal
KG, keyword extraction and Direct Preference Optimization to refine responses, and the use of an external legal
knowledge base for data retrieval and response validation. This multi-faceted approach ensures that Legal-LM
not only comprehends complex legal language but also generates precise and user-aligned legal advice.
ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: October 2024.
Natural Language Processing for the Legal Domain: A Survey of Tasks, Datasets, Models, and Challenges • 25
but fail to provide details about the annotators’ backgrounds or the specific annotation processes used. To develop
unbiased legal NLP systems, it is necessary to document the dataset creation process thoroughly. This includes
providing detailed descriptions of the annotation procedures and the qualifications of the annotators involved,
which is essential for ensuring the reliability and fairness of the systems trained on these datasets. Researchers
need to prioritize transparency to build trust and allow for effective evaluation in the legal NLP field.
ORC4: Multilingual Capabilities. In legal NLP, enhancing multilingual capabilities remains an underdeveloped
area. While efforts like MultiLegalPile [87] have begun to address this, there remains a gap in research for
many languages, including but not limited to Persian and Arabic. These limitations significantly restricts the
application of legal NLP across diverse legal systems worldwide, which is important for broader accessibility.
Multilingual capabilities introduce unique challenges for legal NLP models, primarily due to the distinct linguistic
structures of each language, which often require extensive fine-tuning to ensure accuracy and relevancy in legal
contexts. Furthermore, each legal system possesses its own set of terms and document standards, which can vary
dramatically from one language to another. Therefore, expanding research into these and other underserved
languages is essential for making NLP tools universally applicable and effective.
ORC5: Ontology. The use of ontologies in the legal domain is relatively sparse, yet it holds considerable potential
to enhance the robustness of AI methodologies. Ontology or KG can also enable the AI models to draw accurate
inferences regarding the relationship between the terms and thereby better understand and process complicated
legal texts. This approach could advance the capability of AI systems to handle complex legal reasoning and
decision-making processes. However, utilizing ontology in legal NLP faces unique challenges. The complexity of
legal language and the concept of ‘open texture’, where the meaning of legal terms can evolve over time, complicate
the creation of static ontological models [81]. Legal ontologies must be dynamic, reflecting changes in law and its
interpretation over time. Additionally, the integration of real-world and legal concepts within ontologies presents
further complexity, as it requires accommodating both legal terms and their relevant real-world contexts [81].
ORC6: Pre-processing Legal Text. Pre-processing legal texts is challenging due to the distinct nature of legal docu-
ments. Existing legal corpora are often contained of raw texts that require extensive cleaning and transformation
to become suitable for ML models. Additionally, legal documents can include complex nested structures, like
clauses within clauses, and cross-references to other legal cases, statutes, or provisions, making it difficult to
break them into coherent units for analysis. These challenges of legal documents make it impractical to directly
fine-tune LMs on these raw datasets without substantial pre-processing. This requirement complicates the use of
large legal datasets, making them hard to convert into formats that NLP models can readily process and learn
from. Without addressing these specific complexities, fine-tuning LMs on raw legal data becomes impractical,
limiting the effectiveness of legal NLP applications.
ORC7: Reinforcement Learning from Human Feedback (RLHF). The use of RLHF within the legal domain is notably
scarce. Currently, there is only one peer-reviewed work [83] available that explores this approach. This indicates
a significant opportunity for research and development in this area, as RLHF could potentially enhance NLP’s
capability to learn and make decisions based on complex legal data under human guidance. Further exploration
into this method could lead to more responsive and adaptable legal NLP systems. However, due to the complex
nature of legal reasoning and the need for accurate legal knowledge in human feedback phase, RLHF’s integration
into legal NLP poses some challenges. Therefore, on the human feedback side, legal experts such as lawyers and
judges must provide guidance to ensure the AI accurately interprets and applies complex legal concepts.
ORC8: Expanding Legal Domain Coverage. There is a noticeable gap in the research across various areas of the legal
domain, including Intellectual Property, Criminal Law, Banking Law, Family Law, and Human Rights Law. These
fields have seen limited exploration across all legal NLP tasks, such as LQA and other applications. Expanding
ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: October 2024.
26 • Ariai and Demartini
research into these areas is essential for developing comprehensive legal automated systems that can provide
tailored solutions and insights highly relevant to these sectors of law.
ORC9: Small Language Models (SLMs). Research into SLMs specific to the legal domain is notably absent. Ad-
dressing this gap could lead to more efficient, resource-conscious solutions that still maintain high performance
in legal text processing and analysis. The development of SLMs tailored for legal applications could revolutionize
the accessibility and scalability of legal NLP tools.
ORC10: Domain-Specific Efficient Fine-Tuning. Domain-specific efficient fine-tuning within the legal field is an
underexplored area, with only two known studies [63, 72] addressing it. Legal texts consist of complex structures
and specialized words that standard LMs may not capture without significant adaptation. Additionally, the legal
domain cover a vast array of document types, such as case law, statutes, and contracts, each requiring tailored
approaches for effective model application. This diversity makes it imperative to develop fine-tuning strategies
that do not only adapt a model generally but rather tailor it to understand the differences between these document
types. The majority of existing approaches involve fine-tuning the entire model, which can be resource-intensive.
More focused research could enable fine-tuning of legal LMs using fewer resources, enhancing the efficiency of
deploying these models in practice.
ORC11: Legal Logical Reasoning. Complex legal logical reasoning remains a significant challenge in LJP, particu-
larly in predicting prison terms. Current state-of-the-art methods struggle to achieve high accuracy in this area.
This highlights a clear need for enhanced approaches that can effectively handle the complex of legal reasoning.
ORC12: Legal Named Entity Recognition. Legal NER focuses on specific challenges such as disambiguating titles,
resolving nested entities, addressing co-references, managing lengthy texts, and processing machine-inaccessible
PDFs. Despite its critical role in understanding and structuring legal documents, there is limited research in this
area, as observed from Fig. 1.
ORC13: Stochastic Parrots. The concept of “Stochastic Parrots” pertains particularly to LLMs. It shows the concern
that these models often do not truly understand language but merely mimic human patterns. This mimicry can
lead to unreliable outcomes, especially in critical legal situations, if the models are not trained on high-quality,
unbiased datasets. The risk is notably significant in LJP, where training on biased or unfair data could lead to
irreversible outcomes, as discussed in Bender et al. [11]’s work on the limitations of LLMs. This underscores the
importance of ensuring that LLMs are trained responsibly to avoid perpetuating or amplifying existing biases in
legal decisions.
ORC14: Retrieval-Augmented Generation. In the legal domain, where documents are usually lengthy, often contain
cross-references, and present a variety of complex linguistic structures, LLMs can sometimes generate hallucina-
tory responses when faced with the task of generating accurate answers. RAG systems offer a promising solution
to these challenges. RAG can mitigate issues such as the natural limitations of LLMs concerning maximum input
lengths, where even extended limits may fall short due to the excessive length of many legal documents. This
approach not only improves the model’s response quality but also its relevance and contextual appropriateness by
incorporating more of the document’s content into the decision-making process. However, the integration of RAG
into the legal domain introduces unique challenges, such as managing documents from multiple jurisdictions,
ensuring temporal relevance, addressing multilingual issues, and overcoming biases in the retrieval phase. These
challenges must be addressed in future research when applying RAG in the legal domain.
ORC15: Automated Legal Assistance System. To decrease human error and the costs of legal services, there’s a
need for a comprehensive automated legal assistance system. This system should span all tasks within the legal
NLP field, from question answering to judgment prediction, and cater to different legal specializations like civil
ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: October 2024.
Natural Language Processing for the Legal Domain: A Survey of Tasks, Datasets, Models, and Challenges • 27
Open Research Challenges LQA LJP LTC LDS Legal NER LLMs Corpora
ORC1: Bias and Fairness ✓ ✓ ✓ – – ✓ ✓
ORC2: Interpretability and Explainability ✓ ✓ ✓ – – ✓ –
ORC3: Transparent Annotation ✓ ✓ ✓ ✓ ✓ – ✓
ORC4: Multilingual Capabilities ✓ ✓ ✓ ✓ ✓ ✓ ✓
ORC5: Ontology ✓ ✓ ✓ ✓ ✓ ✓ –
ORC6: Pre-processing Legal Text ✓ ✓ ✓ ✓ ✓ ✓ ✓
ORC7: RLHF ✓ ✓ ✓ ✓ – ✓ –
ORC8: Expanding Legal Domain Coverage ✓ ✓ ✓ ✓ ✓ ✓ ✓
ORC9: SLMs – – – – – ✓ –
ORC10: Domain-Specific Efficient Fine-Tuning ✓ ✓ ✓ ✓ – ✓ –
ORC11: Legal Logical Reasoning ✓ ✓ ✓ – – ✓ –
ORC12: Legal NER – – – – ✓ – –
ORC13: Stochastic Parrots – – – – – ✓ –
ORC14: RAG ✓ ✓ – – – ✓ –
ORC15: Automated Legal Assistance System ✓ ✓ ✓ ✓ ✓ ✓ ✓
and financial law, across multiple languages from English to Persian. Developing an accurate LLM pre-trained on
a vast, diverse dataset free from biases and unfairness is crucial. This ensures that the automated legal services
can reliably and equitably address a wide range of legal issues.
Summary. Table 3 illustrates the connections between ORCs and discussed areas. A direct relationship is marked
with an ✓, and otherwise with a –. As shown, most ORCs are related to LJP, LQA, LTC, and LLMs, indicating
more extensive research fields in these areas.
7 CONCLUSION
Advances in AI and NLP have improved Legal NLP techniques and models. These improvements help better meet
the needs of laypersons in legal matters and ease the workload of legal professionals. This survey provides a
comprehensive overview of the advancements in NLP techniques used in the legal domain. Additionally, we
discussed the unique characteristics of legal documents. We also reviewed existing datasets and LLMs tailored for
the legal domain. Legal NER research spans multiple languages and utilizes diverse methods, from rule-based
to BERT-based models. LDS has largely focused on extractive and abstractive methods, including TF-IDF and
transformer-based models. In LTC, multi-class classification tasks dominate, with deep learning architectures like
CNNs and Bi-LSTMs widely used. LJP primarily focuses on Chinese datasets with deep learning approaches like
CNNs. LQA often leverages information retrieval techniques such as BM25, with a significant focus on statutory
law. Finally, we explored key ORCs, such as the need for domain-specific fine-tuning strategies, addressing bias
and fairness in legal datasets, and the importance of interpretability and explainability. Other challenges include
the development of more robust pre-processing techniques, handling multilingual capabilities, and integrating
ontology-based methods for more accurate legal reasoning.
REFERENCES
[1] Muhammad Al-qurishi, Sarah Alqaseemi, and Riad Souissi. 2022. AraLegal-BERT: A pretrained language model for Arabic Legal text.
In Proceedings of the Natural Legal Language Processing Workshop 2022. Association for Computational Linguistics, Abu Dhabi, United
Arab Emirates (Hybrid), 338–344. https://doi.org/10.18653/v1/2022.nllp-1.31
[2] Intisar Almuslim and Diana Inkpen. 2022. Legal Judgment Prediction for Canadian Appeal Cases. In 2022 7th International Conference
on Data Science and Machine Learning Applications (CDMA). IEEE, Riyadh, Saudi Arabia, 163–168. https://doi.org/10.1109/CDMA54072.
ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: October 2024.
28 • Ariai and Demartini
2022.00032
[3] AWS Amazon. [n. d.]. What are Transformers in Artificial Intelligence? Retrieved July 24, 2024 from https://aws.amazon.com/what-
is/transformers-in-artificial-intelligence
[4] Dang Hoang Anh, Dinh-Truong Do, Vu Tran, and Nguyen Le Minh. 2023. The Impact of Large Language Modeling on Natural Language
Processing in Legal Texts: A Comprehensive Survey. In 2023 15th International Conference on Knowledge and Systems Engineering (KSE).
IEEE, Hanoi, Vietnam, 1–7. https://doi.org/10.1109/KSE59128.2023.10299488
[5] Arian Askari, Suzan Verberne, and Gabriella Pasi. 2022. Expert Finding in Legal Community Question Answering. In Advances in
Information Retrieval: 44th European Conference on IR Research (ECIR 2022), Matthias Hagen, Suzan Verberne, Craig Macdonald, Christin
Seifert, Krisztian Balog, Kjetil Nørvåg, and Vinay Setty (Eds.). Springer International Publishing, Berlin, Heidelberg, 22–30.
[6] Arian Askari, Zihui Yang, Zhaochun Ren, and Suzan Verberne. 2024. Answer Retrieval in Legal Community Question Answering. In
Advances in Information Retrieval: 46th European Conference on Information Retrieval (ECIR 2024), Nazli Goharian, Nicola Tonellotto,
Yulan He, Aldo Lipani, Graham McDonald, Craig Macdonald, and Iadh Ounis (Eds.). Springer Nature Switzerland, Berlin, Heidelberg,
477–485.
[7] Ting Wai Terence Au, Vasileios Lampos, and Ingemar Cox. 2022. E-NER — An Annotated Named Entity Recognition Corpus of Legal
Text. In Proceedings of the Natural Legal Language Processing Workshop 2022. Association for Computational Linguistics, Abu Dhabi,
United Arab Emirates (Hybrid), 246–255. https://doi.org/10.18653/v1/2022.nllp-1.22
[8] Purbid Bambroo and Aditi Awasthi. 2021. LegalDB: Long DistilBERT for Legal Document Classification. In 2021 International Conference
on Advances in Electrical, Computing, Communication and Sustainable Technologies (ICAECT). IEEE, 1–4. https://doi.org/10.1109/
ICAECT49130.2021.9392558
[9] Claire Barale, Mark Klaisoongnoen, Pasquale Minervini, Michael Rovatsos, and Nehal Bhuta. 2023. AsyLex: A Dataset for Legal
Language Processing of Refugee Claims. In Proceedings of the Natural Legal Language Processing Workshop 2023, Daniel Preot, iuc-Pietro,
Catalina Goanta, Ilias Chalkidis, Leslie Barrett, Gerasimos Spanakis, and Nikolaos Aletras (Eds.). Association for Computational
Linguistics, Singapore, 244–257. https://doi.org/10.18653/v1/2023.nllp-1.24
[10] Iz Beltagy, Matthew E Peters, and Arman Cohan. 2020. Longformer: The long-document transformer. arXiv:2004.05150 [cs.CL]
[11] Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. On the Dangers of Stochastic Parrots: Can
Language Models Be Too Big?. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (Virtual Event,
Canada) (FAccT ’21). Association for Computing Machinery, New York, NY, USA, 610–623. https://doi.org/10.1145/3442188.3445922
[12] Paheli Bhattacharya, Kaustubh Hiware, Subham Rajgaria, Nilay Pochhi, Kripabandhu Ghosh, and Saptarshi Ghosh. 2019. A Comparative
Study of Summarization Algorithms Applied to Legal Case Judgments. Advances in Information Retrieval (2019), 413–428.
[13] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam,
Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh,
Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin
Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are
few-shot learners. In Proceedings of the 34th International Conference on Neural Information Processing Systems (Vancouver, BC, Canada)
(NIPS ’20). Curran Associates Inc., Red Hook, NY, USA, Article 159, 25 pages.
[14] Marius Büttner and Ivan Habernal. 2024. Answering legal questions from laymen in German civil law system. In Proceedings of the 18th
Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), Yvette Graham and Matthew
Purver (Eds.). Association for Computational Linguistics, St. Julian’s, Malta, 2015–2027. https://aclanthology.org/2024.eacl-long.122
[15] Ilias Chalkidis, Ion Androutsopoulos, and Nikolaos Aletras. 2019. Neural Legal Judgment Prediction in English. In Proceedings of
the 57th Annual Meeting of the Association for Computational Linguistics, Anna Korhonen, David Traum, and Lluís Màrquez (Eds.).
Association for Computational Linguistics, Florence, Italy, 4317–4323. https://doi.org/10.18653/v1/P19-1424
[16] Ilias Chalkidis, Emmanouil Fergadiotis, Prodromos Malakasiotis, Nikolaos Aletras, and Ion Androutsopoulos. 2019. Extreme Multi-Label
Legal Text Classification: A Case Study in EU Legislation. In Proceedings of the Natural Legal Language Processing Workshop 2019,
Nikolaos Aletras, Elliott Ash, Leslie Barrett, Daniel Chen, Adam Meyers, Daniel Preotiuc-Pietro, David Rosenberg, and Amanda Stent
(Eds.). Association for Computational Linguistics, Minneapolis, Minnesota, 78–87. https://doi.org/10.18653/v1/W19-2209
[17] Ilias Chalkidis, Manos Fergadiotis, and Ion Androutsopoulos. 2021. MultiEURLEX - A multi-lingual and multi-label legal document
classification dataset for zero-shot cross-lingual transfer. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language
Processing, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.). Association for Computational
Linguistics, Online and Punta Cana, Dominican Republic, 6974–6996. https://doi.org/10.18653/v1/2021.emnlp-main.559
[18] Ilias Chalkidis, Manos Fergadiotis, Prodromos Malakasiotis, Nikolaos Aletras, and Ion Androutsopoulos. 2020. LEGAL-BERT: The
Muppets straight out of Law School. In Findings of the Association for Computational Linguistics: EMNLP 2020, Trevor Cohn, Yulan He,
and Yang Liu (Eds.). Association for Computational Linguistics, Online, 2898–2904. https://doi.org/10.18653/v1/2020.findings-emnlp.261
[19] Ilias Chalkidis, Abhik Jana, Dirk Hartung, Michael Bommarito, Ion Androutsopoulos, Daniel Katz, and Nikolaos Aletras. 2022. LexGLUE:
A Benchmark Dataset for Legal Language Understanding in English. In Proceedings of the 60th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers), Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). Association for
ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: October 2024.
Natural Language Processing for the Legal Domain: A Survey of Tasks, Datasets, Models, and Challenges • 29
ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: October 2024.
30 • Ariai and Demartini
[38] Pavlos Fragkogiannis, Martina Forster, Grace E. Lee, and Dell Zhang. 2023. Context-Aware Classification of Legal Document Pages.
In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (Taipei, Taiwan)
(SIGIR ’23). Association for Computing Machinery, New York, NY, USA, 3285–3289. https://doi.org/10.1145/3539618.3591839
[39] Debasis Ganguly, Jack G. Conrad, Kripabandhu Ghosh, Saptarshi Ghosh, Pawan Goyal, Paheli Bhattacharya, Shubham Kumar Nigam,
and Shounak Paul. 2023. Legal IR and NLP: The History, Challenges, and State-of-the-Art. In European Conference on Information Retrieval
(ECIR) (Advances in Information Retrieval). Springer-Verlag, Berlin, Heidelberg, 331–340. https://doi.org/10.1007/978-3-031-28241-6_34
[40] Daphne Gelbart and JC Smith. 1991. Flexicon, a new legal information retrieval system. Can. L. Libr. 16 (1991), 9.
[41] Dephne Gelbart and J. C. Smith. 1991. Beyond boolean search: FLEXICON, a legal tex-based intelligent system. In Proceedings of the 3rd
International Conference on Artificial Intelligence and Law (Oxford, England) (ICAIL ’91). Association for Computing Machinery, New
York, NY, USA, 225–234. https://doi.org/10.1145/112646.112674
[42] Joseph Gesnouin, Yannis Tannier, Christophe Gomes Da Silva, Hatim Tapory, Camille Brier, Hugo Simon, Raphael Rozenberg, Hermann
Woehrel, Mehdi El Yakaabi, Thomas Binder, Guillaume Marie, Emilie Caron, Mathile Nogueira, Thomas Fontas, Laure Puydebois,
Marie Theophile, Stephane Morandi, Mael Petit, David Creissac, Pauline Ennouchy, Elise Valetoux, Celine Visade, Severine Balloux,
Emmanuel Cortes, Pierre-Etienne Devineau, Ulrich Tan, Esther Mac Namara, and Su Yang. 2024. LLaMandement: Large Language
Models for Summarization of French Legislative Proposals. arXiv:2401.16182 [cs.CL]
[43] John Gibbons and M. Teresa Turell. 2008. Dimensions of Forensic Linguistics (1 ed.). AILA Applied Linguistics Series, Vol. 5. John
Benjamins Publishing Company, Netherlands. 1–317 pages.
[44] Randy Goebel, Yoshinobu Kano, Mi-Young Kim, Juliano Rabelo, Ken Satoh, and Masaharu Yoshioka. 2024. Overview and Discussion of
the Competition on Legal Information, Extraction/Entailment (COLIEE) 2023. The Review of Socionetwork Strategies 18, 1 (2024), 27–47.
[45] S. Georgette Graham, Hamidreza Soltani, and Olufemi Isiaq. 2023. Natural language processing for legal document review: categorising
deontic modalities in contracts. Artificial Intelligence and Law (2023). https://doi.org/10.1007/s10506-023-09379-2
[46] Peter Henderson, Mark Krass, Lucia Zheng, Neel Guha, Christopher D Manning, Dan Jurafsky, and Daniel Ho. 2022. Pile of Law:
Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset. In Advances in Neural Information
Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35. Curran Associates, Inc., Red
Hook, NY, USA, 29217–29234. https://proceedings.neurips.cc/paper_files/paper/2022/file/bc218a0c656e49d4b086975a9c785f47-Paper-
Datasets_and_Benchmarks.pdf
[47] Dan Hendrycks, Collin Burns, Anya Chen, and Spencer Ball. 2021. Cuad: An expert-annotated nlp dataset for legal contract review.
arXiv:2103.06268 [cs.CL]
[48] Weiyi Huang, Jiahao Jiang, Qiang Qu, and Min Yang. 2020. AILA: A Question Answering System in the Legal Domain. In Proceedings
of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20 (Yokohama, Yokohama, Japan) (IJCAI’20), Christian
Bessiere (Ed.). International Joint Conferences on Artificial Intelligence Organization, Article 762, 3 pages. https://doi.org/10.24963/
ijcai.2020/762
[49] Deepali Jain, Malaya Dutta Borah, and Anupam Biswas. 2024. A sentence is known by the company it keeps: Improving Legal Document
Summarization Using Deep Clustering. Artificial Intelligence and Law 32, 1 (2024), 165–200.
[50] Samyar Janatian, Hannes Westermann, Jinzhe Tan, Jaromir Savelka, and Karim Benyekhlef. 2023. From Text to Structure: Using Large
Language Models to Support the Development of Legal Expert Systems. arXiv:2311.04911 [cs.CL]
[51] Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand,
Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. 2023. Mistral 7B. arXiv:2310.06825 [cs.CL]
[52] Hang Jiang, Xiajie Zhang, Robert Mahari, Daniel Kessler, Eric Ma, Tal August, Irene Li, Alex ’Sandy’ Pentland, Yoon Kim, Jad Kabbara, and
Deb Roy. 2024. Leveraging Large Language Models for Learning Complex Legal Concepts through Storytelling. arXiv:2402.17019 [cs.CL]
[53] Lukasz Kaiser, Aidan N Gomez, Noam Shazeer, Ashish Vaswani, Niki Parmar, Llion Jones, and Jakob Uszkoreit. 2017. One model to
learn them all. arXiv:1706.05137 [cs.LG]
[54] Prathamesh Kalamkar, Astha Agarwal, Aman Tiwari, Smita Gupta, Saurabh Karn, and Vivek Raghavan. 2022. Named Entity Recognition
in Indian court judgments. In Proceedings of the Natural Legal Language Processing Workshop 2022. Association for Computational
Linguistics, Abu Dhabi, United Arab Emirates (Hybrid), 184–193. https://doi.org/10.18653/v1/2022.nllp-1.15
[55] Ambedkar Kanapala, Sukomal Pal, and Rajendra Pamula. 2019. Text summarization from legal documents: a survey. Artificial Intelligence
Review 51 (2019), 371–402.
[56] Daniel Martin Katz, Michael James Bommarito, Shang Gao, and Pablo Arredondo. 2024. GPT-4 passes the bar exam. Philosophical
Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 382, 2270 (2024), 20230254. https://doi.org/10.1098/
rsta.2023.0254
[57] Soha Khazaeli, Janardhana Punuru, Chad Morris, Sanjay Sharma, Bert Staub, Michael Cole, Sunny Chiu-Webster, and Dhruv Sakalley.
2021. A Free Format Legal Question Answering System. In Proceedings of the Natural Legal Language Processing Workshop 2021,
Nikolaos Aletras, Ion Androutsopoulos, Leslie Barrett, Catalina Goanta, and Daniel Preotiuc-Pietro (Eds.). Association for Computational
Linguistics, Punta Cana, Dominican Republic, 107–113. https://doi.org/10.18653/v1/2021.nllp-1.11
ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: October 2024.
Natural Language Processing for the Legal Domain: A Survey of Tasks, Datasets, Models, and Challenges • 31
[58] Panteleimon Krasadakis, Evangelos Sakkopoulos, and Vassilios S. Verykios. 2024. A Survey on Challenges and Advances in Natural
Language Processing with a Focus on Legal Informatics and Low-Resource Languages. Electronics 13, 3 (2024). https://doi.org/10.3390/
electronics13030648
[59] Amirhossein Layegh, Amir H. Payberah, Ahmet Soylu, Dumitru Roman, and Mihhail Matskin. 2023. ContrastNER: Contrastive-based
Prompt Tuning for Few-shot NER. In 2023 IEEE 47th Annual Computers, Software, and Applications Conference (COMPSAC). IEEE,
241–249. https://doi.org/10.1109/COMPSAC57700.2023.00038
[60] Jihoon Lee and Hyukjoon Lee. 2019. A Comparison Study on Legal Document Classification Using Deep Neural Networks. In
2019 International Conference on Information and Communication Technology Convergence (ICTC). 926–928. https://doi.org/10.1109/
ICTC46691.2019.8939926
[61] Elena Leitner, Georg Rehm, and Julian Moreno-Schneider. 2020. A Dataset of German Legal Documents for Named Entity Recognition.
In Proceedings of the Twelfth Language Resources and Evaluation Conference, Nicoletta Calzolari, Frédéric Béchet, Philippe Blache,
Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo,
Asuncion Moreno, Jan Odijk, and Stelios Piperidis (Eds.). European Language Resources Association, Marseille, France, 4478–4485.
https://aclanthology.org/2020.lrec-1.551
[62] LexisNexis [n. d.]. International Legal Generative AI Report. Retrieved July 22, 2024 from https://www.lexisnexis.com/community/
pressroom/b/news/posts/lexisnexis-international-legal-generative-ai-survey-shows-nearly-half-of-the-legal-profession-believe-
generative-ai-will-transform-the-practice-of-law
[63] Jonathan Li, Rohan Bhambhoria, and Xiaodan Zhu. 2022. Parameter-Efficient Legal Domain Adaptation. In Proceedings of the Natural
Legal Language Processing Workshop 2022. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates (Hybrid),
119–129. https://doi.org/10.18653/v1/2022.nllp-1.10
[64] Jing Li, Aixin Sun, Jianglei Han, and Chenliang Li. 2022. A Survey on Deep Learning for Named Entity Recognition. IEEE Transactions
on Knowledge and Data Engineering 34, 1 (Jan 2022), 50–70. https://doi.org/10.1109/TKDE.2020.2981314
[65] Junyi Li, Tianyi Tang, Wayne Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. 2024. Pre-Trained Language Models for Text Generation: A
Survey. ACM Comput. Surv. 56, 9 (apr 2024), 1–39. https://doi.org/10.1145/3649449
[66] Yanling Li, Jiaye Wu, and Xudong Luo. 2024. BERT-CNN based evidence retrieval and aggregation for Chinese legal multi-choice
question answering. Neural Computing and Applications 36, 11 (2024), 5909–5925. https://doi.org/10.1007/s00521-023-09380-5
[67] Marco Lippi, Przemysław Pałka, Giuseppe Contissa, Francesca Lagioia, Hans-Wolfgang Micklitz, Giovanni Sartor, and Paolo Torroni.
2019. CLAUDETTE: an automated detector of potentially unfair clauses in online terms of service. Artificial Intelligence and Law 27
(2019), 117–139.
[68] Shuaiqi Liu, Jiannong Cao, Yicong Li, Ruosong Yang, and Zhiyuan Wen. 2024. Low-resource court judgment summarization for
common law systems. Information Processing and Management 61, 5 (2024), 103796. https://doi.org/10.1016/j.ipm.2024.103796
[69] Yang Liu. 2019. Fine-tune BERT for extractive summarization. arXiv:1903.10318 [cs.CL]
[70] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin
Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv:1907.11692 [cs.CL]
[71] Yifei Liu, Yiquan Wu, Yating Zhang, Changlong Sun, Weiming Lu, Fei Wu, and Kun Kuang. 2023. ML-LJP: Multi-Law Aware Legal
Judgment Prediction. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval
(Taipei, Taiwan) (SIGIR ’23). Association for Computing Machinery, New York, NY, USA, 1023–1034. https://doi.org/10.1145/3539618.
3591731
[72] Antoine Louis, Gijs van Dijck, and Gerasimos Spanakis. 2024. Interpretable Long-Form Legal Question Answering with Retrieval-
Augmented Large Language Models. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. AAAI Press, 22266–22275.
https://doi.org/10.1609/aaai.v38i20.30232
[73] Bingfeng Luo, Yansong Feng, Jianbo Xu, Xiang Zhang, and Dongyan Zhao. 2017. Learning to Predict Charges for Criminal Cases with
Legal Basis. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Martha Palmer, Rebecca Hwa, and
Sebastian Riedel (Eds.). Association for Computational Linguistics, Copenhagen, Denmark, 2727–2736. https://doi.org/10.18653/v1/D17-
1289
[74] Luyao Ma, Yating Zhang, Tianyi Wang, Xiaozhong Liu, Wei Ye, Changlong Sun, and Shikun Zhang. 2021. Legal Judgment Prediction
with Multi-Stage Case Representation Learning in the Real Court Setting. In Proceedings of the 44th International ACM SIGIR Conference
on Research and Development in Information Retrieval (Virtual Event, Canada) (SIGIR ’21). Association for Computing Machinery, New
York, NY, USA, 993–1002. https://doi.org/10.1145/3404835.3462945
[75] Dimitris Mamakas, Petros Tsotsi, Ion Androutsopoulos, and Ilias Chalkidis. 2022. Processing Long Legal Documents with Pre-trained
Transformers: Modding LegalBERT and Longformer. In Proceedings of the Natural Legal Language Processing Workshop 2022, Nikolaos
Aletras, Ilias Chalkidis, Leslie Barrett, Cătălina Goant, ă, and Daniel Preot, iuc-Pietro (Eds.). Association for Computational Linguistics,
Abu Dhabi, United Arab Emirates (Hybrid), 130–142. https://doi.org/10.18653/v1/2022.nllp-1.11
[76] Sepideh Mamooler, Rémi Lebret, Stephane Massonnet, and Karl Aberer. 2022. An Efficient Active Learning Pipeline for Legal Text
Classification. In Proceedings of the Natural Legal Language Processing Workshop 2022, Nikolaos Aletras, Ilias Chalkidis, Leslie Barrett,
ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: October 2024.
32 • Ariai and Demartini
Cătălina Goant, ă, and Daniel Preot, iuc-Pietro (Eds.). Association for Computational Linguistics, Abu Dhabi, United Arab Emirates
(Hybrid), 345–358. https://doi.org/10.18653/v1/2022.nllp-1.32
[77] Stelios Maroudas, Sotiris Legkas, Prodromos Malakasiotis, and Ilias Chalkidis. 2022. Legal-Tech Open Diaries: Lesson learned on how
to develop and deploy light-weight models in the era of humongous Language Models. In Proceedings of the Natural Legal Language
Processing Workshop 2022, Nikolaos Aletras, Ilias Chalkidis, Leslie Barrett, Cătălina Goant, ă, and Daniel Preot, iuc-Pietro (Eds.). Association
for Computational Linguistics, Abu Dhabi, United Arab Emirates (Hybrid), 88–110. https://doi.org/10.18653/v1/2022.nllp-1.8
[78] Suzanne McGee. [n. d.]. Generative AI and the Law. Retrieved July 22, 2024 from https://www.lexisnexis.com/html/lexisnexis-
generative-ai-story
[79] Masha Medvedeva, Michel Vols, and Martijn Wieling. 2020. Using machine learning to predict decisions of the European Court of
Human Rights. Artificial Intelligence and Law 28, 2 (2020), 237–266. https://doi.org/10.1007/s10506-019-09255-y
[80] Marie-Francine Moens, Caroline Uyttendaele, and Jos Dumortier. 1999. Abstracting of legal cases: the potential of clustering based on
the selection of representative objects. Journal of the American Society for Information Science 50, 2 (1999), 151–161.
[81] Laurens Mommers. 2010. Ontologies in the Legal Domain. Springer Netherlands, Dordrecht, 265–276. https://doi.org/10.1007/978-90-
481-8845-1_12
[82] Gianluca Moro, Nicola Piscaglia, Luca Ragazzi, and Paolo Italiani. 2023. Multi-language transfer learning for low-resource legal case
summarization. Artificial Intelligence and Law (2023). https://doi.org/10.1007/s10506-023-09373-8
[83] Duy-Hung Nguyen, Bao-Sinh Nguyen, Nguyen Viet Dung Nghiem, Dung Tien Le, Mim Amina Khatun, Minh-Tien Nguyen, and Hung
Le. 2021. Robust Deep Reinforcement Learning for Extractive Legal Summarization. In Neural Information Processing, Teddy Mantoro,
Minho Lee, Media Anugerah Ayu, Kok Wai Wong, and Achmad Nizar Hidayanto (Eds.). Springer International Publishing, Cham,
597–604.
[84] Ha-Thanh Nguyen, Manh-Kien Phi, Xuan-Bach Ngo, Vu Tran, Le-Minh Nguyen, and Minh-Phuong Tu. 2024. Attentive deep neural
networks for legal document retrieval. Artificial Intelligence and Law 32, 1 (2024), 57–86. https://doi.org/10.1007/s10506-022-09341-8
[85] Joel Niklaus, Ilias Chalkidis, and Matthias Stürmer. 2021. Swiss-Judgment-Prediction: A Multilingual Legal Judgment Prediction
Benchmark. In Proceedings of the Natural Legal Language Processing Workshop 2021, Nikolaos Aletras, Ion Androutsopoulos, Leslie
Barrett, Catalina Goanta, and Daniel Preotiuc-Pietro (Eds.). Association for Computational Linguistics, Punta Cana, Dominican Republic,
19–35. https://doi.org/10.18653/v1/2021.nllp-1.3
[86] Joel Niklaus, Veton Matoshi, Pooja Rani, Andrea Galassi, Matthias Stürmer, and Ilias Chalkidis. 2023. LEXTREME: A Multi-Lingual
and Multi-Task Benchmark for the Legal Domain. In Findings of the Association for Computational Linguistics: EMNLP 2023, Houda
Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 3016–3054. https://doi.org/10.18653/
v1/2023.findings-emnlp.200
[87] Joel Niklaus, Veton Matoshi, Matthias Stürmer, Ilias Chalkidis, and Daniel E. Ho. 2024. MultiLegalPile: A 689GB Multilingual Legal
Corpus. arXiv:2306.02069 [cs.CL]
[88] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina
Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F
Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human feedback. In Advances in Neural
Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35. Curran Associates, Inc.,
27730–27744. https://proceedings.neurips.cc/paper_files/paper/2022/file/b1efde53be364a73914f58805a001731-Paper-Conference.pdf
[89] Matthew J. Page, Joanne E. McKenzie, Patrick M. Bossuyt, Isabelle Boutron, Tammy C. Hoffmann, Cynthia D. Mulrow, Larissa Shamseer,
Jennifer M. Tetzlaff, Elie A. Akl, Sue E. Brennan, Roger Chou, Julie Glanville, Jeremy M. Grimshaw, Asbjørn Hróbjartsson, Manoj M.
Lalu, Tianjing Li, Elizabeth W. Loder, Evan Mayo-Wilson, Steve McDonald, Luke A. McGuinness, Lesley A. Stewart, James Thomas,
Andrea C. Tricco, Vivian A. Welch, Penny Whiting, and David Moher. 2021. The PRISMA 2020 statement: an updated guideline for
reporting systematic reviews. Systematic Reviews 10, 1 (2021), 89. https://doi.org/10.1186/s13643-021-01626-4
[90] Christos Papaloukas, Ilias Chalkidis, Konstantinos Athinaios, Despina Pantazi, and Manolis Koubarakis. 2021. Multi-granular Legal
Topic Classification on Greek Legislation. In Proceedings of the Natural Legal Language Processing Workshop 2021, Nikolaos Aletras, Ion
Androutsopoulos, Leslie Barrett, Catalina Goanta, and Daniel Preotiuc-Pietro (Eds.). Association for Computational Linguistics, Punta
Cana, Dominican Republic, 63–75. https://doi.org/10.18653/v1/2021.nllp-1.6
[91] Sungmi Park and Joshua I. James. 2023. Lessons learned building a legal inference dataset. Artificial Intelligence and Law (2023).
https://doi.org/10.1007/s10506-023-09370-x
[92] Seth Polsley, Pooja Jhunjhunwala, and Ruihong Huang. 2016. CaseSummarizer: A System for Automated Summarization of Legal Texts.
In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations, Hideo Watanabe
(Ed.). The COLING 2016 Organizing Committee, Osaka, Japan, 258–262. https://aclanthology.org/C16-2054
[93] Thiago Dal Pont, Federico Galli, Andrea Loreggia, Giuseppe Pisano, Riccardo Rovatti, and Giovanni Sartor. 2023. Legal Summarisation
through LLMs: The PRODIGIT Project. arXiv:2308.04416 [cs.CL]
[94] Vasile Păis, Maria Mitrofan, Carol Luca Gasan, Vlad Coneschi, and Alexandru Ianov. 2021. Named Entity Recognition in the Romanian
Legal Domain. In Proceedings of the Natural Legal Language Processing Workshop 2021, Nikolaos Aletras, Ion Androutsopoulos, Leslie
ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: October 2024.
Natural Language Processing for the Legal Domain: A Survey of Tasks, Datasets, Models, and Challenges • 33
Barrett, Catalina Goanta, and Daniel Preotiuc-Pietro (Eds.). Association for Computational Linguistics, Punta Cana, Dominican Republic,
9–18. https://doi.org/10.18653/v1/2021.nllp-1.2
[95] Juliano Rabelo, Randy Goebel, Mi-Young Kim, Yoshinobu Kano, Masaharu Yoshioka, and Ken Satoh. 2022. Overview and discussion of
the competition on legal information extraction/entailment (COLIEE) 2021. The Review of Socionetwork Strategies 16, 1 (2022), 111–133.
[96] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu.
2020. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21, 1, Article 140 (jan 2020),
67 pages.
[97] Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2020. DistilBERT, a distilled version of BERT: smaller, faster, cheaper
and lighter. arXiv:1910.01108 [cs.CL]
[98] Jaromir Savelka, Kevin D. Ashley, Morgan A. Gray, Hannes Westermann, and Huihui Xu. 2023. Explaining Legal Concepts with
Augmented Large Language Models (GPT-4). arXiv:2306.09525 [cs.CL]
[99] Marijn Schraagen, Floris Bex, Nick Van De Luijtgaarden, and Daniël Prijs. 2022. Abstractive Summarization of Dutch Court Verdicts
Using Sequence-to-sequence Models. In Proceedings of the Natural Legal Language Processing Workshop 2022, Nikolaos Aletras, Ilias
Chalkidis, Leslie Barrett, Cătălina Goant, ă, and Daniel Preot, iuc-Pietro (Eds.). Association for Computational Linguistics, Abu Dhabi,
United Arab Emirates (Hybrid), 76–87. https://doi.org/10.18653/v1/2022.nllp-1.7
[100] Gil Semo, Dor Bernsohn, Ben Hagag, Gila Hayat, and Joel Niklaus. 2022. ClassActionPrediction: A Challenging Benchmark for Legal
Judgment Prediction of Class Action Cases in the US. In Proceedings of the Natural Legal Language Processing Workshop 2022, Nikolaos
Aletras, Ilias Chalkidis, Leslie Barrett, Cătălina Goant, ă, and Daniel Preot, iuc-Pietro (Eds.). Association for Computational Linguistics,
Abu Dhabi, United Arab Emirates (Hybrid), 31–46. https://doi.org/10.18653/v1/2022.nllp-1.3
[101] Zein Shaheen, Gerhard Wohlgenannt, and Erwin Filtz. 2020. Large scale legal text classification using transformer models.
arXiv:2010.12871 [cs.CL]
[102] Zejiang Shen, Kyle Lo, Lauren Yu, Nathan Dahlberg, Margo Schlanger, and Doug Downey. 2022. Multi-LexSum: Real-world Summaries
of Civil Rights Lawsuits at Multiple Granularities. In Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed,
A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35. Curran Associates, Inc., 13158–13173. https://proceedings.neurips.cc/paper_
files/paper/2022/file/552ef803bef9368c29e53c167de34b55-Paper-Datasets_and_Benchmarks.pdf
[103] Emily Sheng, Kai-Wei Chang, Premkumar Natarajan, and Nanyun Peng. 2019. The Woman Worked as a Babysitter: On Biases in
Language Generation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International
Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (Eds.).
Association for Computational Linguistics, Hong Kong, China, 3407–3412. https://doi.org/10.18653/v1/D19-1339
[104] Juanming Shi, Qinglang Guo, Yong Liao, Yuxing Wang, Shijia Chen, and Shenglin Liang. 2024. Legal-LM: Knowledge Graph Enhanced
Large Language Models for Law Consulting. In Advanced Intelligent Computing Technology and Applications, De-Shuang Huang,
Zhanjun Si, and Chuanlei Zhang (Eds.). Springer Nature Singapore, Singapore, 175–186.
[105] Răzvan-Alexandru Smădu, Ion-Robert Dinică, Andrei-Marius Avram, Dumitru-Clementin Cercel, Florin Pop, and Mihaela-Claudia
Cercel. 2022. Legal Named Entity Recognition with Multi-Task Domain Adaptation. In Proceedings of the Natural Legal Language
Processing Workshop 2022, Nikolaos Aletras, Ilias Chalkidis, Leslie Barrett, Cătălina Goant, ă, and Daniel Preot, iuc-Pietro (Eds.). Association
for Computational Linguistics, Abu Dhabi, United Arab Emirates (Hybrid), 305–321. https://doi.org/10.18653/v1/2022.nllp-1.29
[106] Dezhao Song, Andrew Vold, Kanika Madan, and Frank Schilder. 2022. Multi-label legal document classification: A deep learning-based
approach with label-attention and domain-specific pre-training. Information Systems 106 (2022), 101718. https://doi.org/10.1016/j.is.
2021.101718
[107] Francesco Sovrano, Monica Palmirani, Biagio Distefano, Salvatore Sapienza, and Fabio Vitali. 2021. A dataset for evaluating legal question
answering on private international law. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Law (São
Paulo, Brazil) (ICAIL ’21). Association for Computing Machinery, New York, NY, USA, 230–234. https://doi.org/10.1145/3462757.3466094
[108] Francesco Sovrano, Monica Palmirani, Salvatore Sapienza, and Vittoria Pistone. 2024. DiscoLQA: zero-shot discourse-based legal
question answering on European Legislation. Artificial Intelligence and Law (2024). https://doi.org/10.1007/s10506-023-09387-2
[109] Francesco Sovrano, Monica Palmirani, and Fabio Vitali. 2020. Legal knowledge extraction for knowledge graph based question-
answering. In Legal knowledge and information systems. IOS Press, 143–153. https://doi.org/10.3233/FAIA200858
[110] Ralf Steinberger, Bruno Pouliquen, Mijail Kabadjov, Jenya Belyaeva, and Erik van der Goot. 2011. JRC-NAMES: A Freely Available,
Highly Multilingual Named Entity Resource. In Proceedings of the International Conference Recent Advances in Natural Language
Processing 2011, Ruslan Mitkov and Galia Angelova (Eds.). Association for Computational Linguistics, Hissar, Bulgaria, 104–110.
https://aclanthology.org/R11-1015
[111] Benjamin Strickson and Beatriz De La Iglesia. 2020. Legal Judgement Prediction for UK Courts. In Proceedings of the 3rd International
Conference on Information Science and Systems (Cambridge, United Kingdom) (ICISS ’20). Association for Computing Machinery, New
York, NY, USA, 204–209. https://doi.org/10.1145/3388176.3388183
[112] Zhongxiang Sun. 2023. A Short Survey of Viewing Large Language Models in Legal Aspect. arXiv:2303.09136 [cs.CL]
ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: October 2024.
34 • Ariai and Demartini
[113] Alex Tamkin, Miles Brundage, Jack Clark, and Deep Ganguli. 2021. Understanding the Capabilities, Limitations, and Societal Impact of
Large Language Models. arXiv:2102.02503 [cs.CL]
[114] Doron Teichman, Eyal Zamir, and Ilana Ritov. 2023. Biases in legal decision-making: Comparing prosecutors, defense attorneys, law
students, and laypersons. Journal of empirical legal studies 20, 4 (2023), 852–894.
[115] Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 Shared Task: Language-Independent Named
Entity Recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003. 142–147. https:
//www.aclweb.org/anthology/W03-0419
[116] Suxin Tong, Jingling Yuan, Peiliang Zhang, and Lin Li. 2024. Legal Judgment Prediction via graph boosting with constraints. Information
Processing & Management 61, 3 (2024), 103663. https://doi.org/10.1016/j.ipm.2024.103663
[117] Don Tuggener, Pius von Däniken, Thomas Peetz, and Mark Cieliebak. 2020. LEDGAR: A Large-Scale Multi-label Corpus for Text
Classification of Legal Provisions in Contracts. In Proceedings of the Twelfth Language Resources and Evaluation Conference, Nicoletta
Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente
Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, and Stelios Piperidis (Eds.). European Language Resources
Association, Marseille, France, 1235–1241. https://aclanthology.org/2020.lrec-1.155
[118] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin.
2017. Attention is All you Need. In Advances in Neural Information Processing Systems (Long Beach, California, USA) (NIPS’17, Vol. 30),
I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.). Curran Associates, Inc., Red Hook,
NY, USA, 6000–6010. https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
[119] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2018. Graph attention networks.
arXiv:1710.10903 [stat.ML]
[120] Daniela Vianna, Edleno Silva de Moura, and Altigran Soares da Silva. 2023. A topic discovery approach for unsupervised organization
of legal document collections. Artificial Intelligence and Law (2023). https://doi.org/10.1007/s10506-023-09371-w
[121] Qiqi Wang, Kaiqi Zhao, Robert Amor, Benjamin Liu, and Ruofan Wang. 2022. D2GCLF: Document-to-Graph Classifier for Legal
Document Classification. In Findings of the Association for Computational Linguistics: NAACL 2022, Marine Carpuat, Marie-Catherine
de Marneffe, and Ivan Vladimir Meza Ruiz (Eds.). Association for Computational Linguistics, Seattle, United States, 2208–2221.
https://doi.org/10.18653/v1/2022.findings-naacl.170
[122] Fusheng Wei, Han Qin, Shi Ye, and Haozhen Zhao. 2018. Empirical Study of Deep Learning for Text Classification in Legal Document
Review. In 2018 IEEE International Conference on Big Data (Big Data). IEEE, 3317–3320. https://doi.org/10.1109/BigData.2018.8622157
[123] Jason Wei and Kai Zou. 2019. EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks. In
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference
on Natural Language Processing (EMNLP-IJCNLP), Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (Eds.). Association for
Computational Linguistics, Hong Kong, China, 6382–6388. https://doi.org/10.18653/v1/D19-1670
[124] Westlaw. [n. d.]. Westlaw. Retrieved May 23, 2024 from https://anzlaw.thomsonreuters.com/Browse/Home/Australia160?comp=wlau&
__lrTS=20240523040153004&transitionType=Default&contextData=(sc.Default)
[125] Yiquan Wu, Yifei Liu, Weiming Lu, Yating Zhang, Jun Feng, Changlong Sun, Fei Wu, and Kun Kuang. 2022. Towards Interactivity and
Interpretability: A Rationale-based Legal Judgment Prediction Framework. In Proceedings of the 2022 Conference on Empirical Methods
in Natural Language Processing, Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). Association for Computational Linguistics,
Abu Dhabi, United Arab Emirates, 4787–4799. https://doi.org/10.18653/v1/2022.emnlp-main.316
[126] Chaojun Xiao, Xueyu Hu, Zhiyuan Liu, Cunchao Tu, and Maosong Sun. 2021. Lawformer: A pre-trained language model for Chinese
legal long documents. AI Open 2 (2021), 79–84. https://doi.org/10.1016/j.aiopen.2021.06.003
[127] Chaojun Xiao, Haoxi Zhong, Zhipeng Guo, Cunchao Tu, Zhiyuan Liu, Maosong Sun, Yansong Feng, Xianpei Han, Zhen Hu, Heng
Wang, and Jianfeng Xu. 2018. CAIL2018: A Large-Scale Legal Dataset for Judgment Prediction. arXiv:1807.02478 [cs.CL] https:
//arxiv.org/abs/1807.02478
[128] Nuo Xu, Pinghui Wang, Long Chen, Li Pan, Xiaoyan Wang, and Junzhou Zhao. 2020. Distinguish Confusing Law Articles for Legal
Judgment Prediction. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Dan Jurafsky, Joyce
Chai, Natalie Schluter, and Joel Tetreault (Eds.). Association for Computational Linguistics, Online, 3086–3095. https://doi.org/10.
18653/v1/2020.acl-main.280
[129] Wenmian Yang, Weijia Jia, Xiaojie Zhou, and Yutao Luo. 2019. Legal judgment prediction via multi-perspective bi-feedback network.
In Proceedings of the 28th International Joint Conference on Artificial Intelligence (Macao, China) (IJCAI’19). AAAI Press, 4085–4091.
[130] Hai Ye, Xin Jiang, Zhunchen Luo, and Wenhan Chao. 2018. Interpretable Charge Predictions for Criminal Cases: Learning to Generate
Court Views from Fact Descriptions. In Proceedings of the 2018 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), Marilyn Walker, Heng Ji, and Amanda Stent (Eds.).
Association for Computational Linguistics, New Orleans, Louisiana, 1854–1864. https://doi.org/10.18653/v1/N18-1168
[131] Mingruo Yuan, Ben Kao, Tien-Hsuan Wu, Michael M. K. Cheung, Henry W. H. Chan, Anne S. Y. Cheung, Felix W. H. Chan, and Yongxi
Chen. 2023. Bringing legal knowledge to the public by constructing a legal question bank using large-scale pre-trained language model.
ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: October 2024.
Natural Language Processing for the Legal Domain: A Survey of Tasks, Datasets, Models, and Challenges • 35
ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: October 2024.