0% found this document useful (0 votes)
45 views

Natural Language Processing For The Legal Domain A Survey of Tasks, Datasets, Models, and Challenges

Uploaded by

469134492
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views

Natural Language Processing For The Legal Domain A Survey of Tasks, Datasets, Models, and Challenges

Uploaded by

469134492
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Natural Language Processing for the Legal Domain: A Survey of Tasks,

Datasets, Models, and Challenges


FARID ARIAI and GIANLUCA DEMARTINI, The University of Queensland, Australia
Natural Language Processing is revolutionizing the way legal professionals and laypersons operate in the legal field. The
considerable potential for Natural Language Processing in the legal sector, especially in developing computational tools
for various legal processes, has captured the interest of researchers for years. This survey follows the Preferred Reporting
Items for Systematic Reviews and Meta-Analyses framework [89], reviewing 148 studies, with a final selection of 127 after
manual filtering. It explores foundational concepts related to Natural Language Processing in the legal domain, illustrating
arXiv:2410.21306v1 [cs.CL] 25 Oct 2024

the unique aspects and challenges of processing legal texts, such as extensive document length, complex language, and
limited open legal datasets. We provide an overview of Natural Language Processing tasks specific to legal text, such as Legal
Document Summarization, legal Named Entity Recognition, Legal Question Answering, Legal Text Classification, and Legal
Judgment Prediction. In the section on legal Language Models, we analyze both developed Language Models and approaches
for adapting general Language Models to the legal domain. Additionally, we identify 15 Open Research Challenges, including
bias in Artificial Intelligence applications, the need for more robust and interpretable models, and improving explainability to
handle the complexities of legal language and reasoning.
CCS Concepts: • General and reference → Surveys and overviews; • Applied computing → Law; • Computing
methodologies → Natural language processing; Natural language generation; Knowledge representation and reasoning;
Supervised learning; Unsupervised learning; Reinforcement learning; Multi-task learning; Machine learning approaches;
Artificial intelligence; • Information systems → Language models; Retrieval tasks and goals; Question answering;
Clustering and classification; Summarization.
ACM Reference Format:
Farid Ariai and Gianluca Demartini. 2024. Natural Language Processing for the Legal Domain: A Survey of Tasks, Datasets,
Models, and Challenges. ACM Comput. Surv. 1, 1 (October 2024), 35 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn

1 INTRODUCTION
In recent years, advancements in Natural Language Processing (NLP) have significantly impacted the legal domain
by simplifying complex tasks such as Legal Document Summarization (LDS), enhancing legal text comprehension
for laypersons, and improving Legal Question Answering (LQA) and Legal Judgment Prediction (LJP) [24, 42, 50,
52, 63, 93, 98]. These improvements are primarily attributed to advancements in neural network architectures,
such as transformer models [118]. NLP techniques now enable machines to generate text, answer legal questions,
drafting a regulation, and simulate legal reasoning, which can revolutionize legal practices [50]. Applications like
contract review [45, 76, 77, 117] and case prediction [85, 120] have been automated to a large extent, speeding
up processes, reducing human error, and cutting operational costs. Additionally, the use of NLP allows lawyers
and legal professionals to reduce their workload, enhance efficiency, and minimize errors in decision-making
processes [98]. Despite the rapid development of NLP, challenges remain in processing lengthy documents,
Authors’ Contact Information: Farid Ariai, [email protected]; Gianluca Demartini, [email protected], The University of Queensland,
Brisbane, Queensland, Australia.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that
copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page.
Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy
otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from
[email protected].
© 2024 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM 1557-7341/2024/10-ART
https://doi.org/10.1145/nnnnnnn.nnnnnnn

ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: October 2024.
2 • Ariai and Demartini

understanding complex language, and navigating complicated document structures [12, 39, 52, 84, 107, 120, 132],
yet the promise of Large Language Models (LLMs) enhances the efficiency, accessibility, and precision of legal
services.
Despite these advantages, the integration of NLP in the legal domain is not without challenges such as, biases
and unfairness, and explainability issues. [28, 103, 113]. The use of Artificial Intelligence (AI) in legal applications
must follow strict standards of accuracy, fairness, and transparency, given the potential impact on clients’ lives
and rights.
This survey article explores the current landscape of NLP applications within the legal domain. It discusses its
potential benefits and the practical challenges it poses. NLP is a broad field covering a wide range of techniques
for processing, analyzing, and understanding human language. By examining the latest advancements and
applications of NLP in law, this article provides a comprehensive overview of the field. Figure 1 summarizes the
scope of the survey and categorise the research into several areas: LQA, LJP, Legal Text Classification (LTC),
LDS, legal Named Entity Recognition (NER), legal corpora, and legal Language Models (LMs). Each category lists
relevant projects and papers and shows the work being done in each sub-field. Notably, there is comparatively
less research in NER and legal corpora, whereas LDS and LQA have seen extensive research activity, with a
substantial number of datasets and research contributions. This summary provides an overview of how NLP
techniques are applied to various challenges in the legal domain and offers insights into future directions of AI in
legal practice.
To provide a comprehensive understanding of the existing research on integrating AI within the legal domain,
we present an overview of recent literature reviews, as summarized in Table 1. Most survey papers on intelligent
legal systems focus either on traditional NLP technologies for specific tasks like LJP and LDS or take a broader
approach but still miss certain aspects. As illustrated in Table 1, there seems to be a gap in survey papers that
thoroughly examine all facets of this multidisciplinary field. Our current work aims to bridge this gap by offering
a comprehensive survey of all NLP tasks, existing datasets and corpora, and LMs in the legal domain. We use ✓
to indicate papers that study the most of the existing research on each subject in legal NLP. Papers that do not
address a subject receive ✗, and those that partially cover specific subjects along with their datasets or LMs are
marked with –.
The main difference between this work and previous surveys is that this survey aims to provide a more general
description of all aspects of NLP tasks in the legal domain, rather than focusing solely on specific applications.
The main contributions of this survey are summarized as follows: (1) This article extends beyond previous
surveys by examining a broad spectrum of studies, and applications of legal NLP. By discussing datasets and
large corpora in 24 languages and exploring the popular legal LMs, this survey establishes itself as an important
resource in the field of legal NLP; (2) The survey offers an in-depth look at the challenges of integrating NLP with
legal applications, with detailed discussions on technical solutions that tackle these issues, thereby enhancing
understanding and encouraging further research in this evolving field. (3) This survey also highlights the existing
research gaps in legal NLP, identifying areas that require further exploration and development, and providing a
road-map for future research efforts in the legal NLP domain.
This document is organized as follows. In Section 2, Background and Foundational Concepts, we provide a
detailed overview of legal language and the basic principles of NLP as they apply to the legal domain. In Section
3, we briefly explain the research methodology of this work and how we extracted the resources. In section 4, we
explore various NLP tasks that are tailored for legal applications and show their unique requirements and the
methodologies employed to address them. These specialized tasks leverage advanced NLP techniques to process,
analyze, and extract meaningful information from legal texts, thereby facilitating more efficient and accurate legal
research and decision-making. Additionally, we delve into the datasets available for training and evaluating legal
NLP tasks, emphasizing their characteristics and the implications they have for model performance. Following this,
in Section 5, we explore the development of LMs that has been specifically adapted for the legal field. Finally, in

ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: October 2024.
Natural Language Processing for the Legal Domain: A Survey of Tasks, Datasets, Models, and Challenges • 3

Table 1. Comparison of existing surveys with this work, which shows the covered topics of each survey.

References Covered subjects Published Description


year

Large Legal
NLP Tasks

Corpus
Legal LM
in Legal
Dataset
Legal

Dias et al. – – – ✗ 2022 In this work, researchers did not only concentrate on the legal domain. Instead, they elaborated
[30] on foundational concepts of AI and NLP. They briefly explored the applications of NLP within
the legal field but did not delve deeply into legal datasets, specific NLP tasks in the legal domain,
or Legal LLMs. In contrast, our work provides a comprehensive analysis of all NLP tasks within
the legal domain, including LQA, LDS, LTC. Additionally, our study covers large legal corpora
and thoroughly examines the datasets available for each legal NLP task, areas that were not
fully addressed in this work.
Sun [112] ✓ – ✓ ✗ 2023 Sun explored a limited number of research projects in the field of LLMs and the legal domain.
It focused on two key NLP tasks within this field: LJP and statutory reasoning. Additionally, it
examined three datasets and two LLMs relevant to these domain. The difference is that our
work provides a comprehensive overview of all NLP tasks in the legal domain, as well as large
legal corpora and datasets for each task, which were not fully covered in Sun’s study.
Cui et al. [26] ✓ – ✓ ✗ 2023 This paper comprehensively reviewed on LJP task. Authors analyzed 43 LJP datasets in 9
different languages. They summarized 16 evaluation metrics to evaluate three NLP tasks (text
classification, text generation, and text regression) in LJP. For LMs, they explored existing
Pre-trained Language Models (PLMs). Unlike this paper, our work provides comprehensive
coverage of all NLP tasks in the legal domain and includes an exploration of large legal corpora.
Anh et al. [4] ✗ ✓ ✓ ✗ 2023 This survey give a short explanation regarding challenges in legal language and how LLMs try to
overcome the challenges. Then, the authors summarized six NLP tasks in the legal domain that
can be addressed by LLMs. In terms of ethical and legal considerations, they discussed ‘bias and
fairness’, ‘privacy and confidentiality’, ‘intellectual property’, ‘explainability and transparency’,
and ‘responsible use’. The difference is that our work focuses on existing methods for all Legal
NLP tasks, including LJP and LDS, available datasets for each task, and large legal corpora for
pre-training and fine-tuning, whereas they concentrated on ethical and legal considerations
and the impact of LLMs on NLP in legal texts.
Ganguly et al. – – ✓ ✗ 2023 Ganguly and his colleagues presented a comprehensive review at ECIR 2023 and discuss several
[39] key areas including the processing challenges of legal text, such as NER and sentence boundary
detection. They traced the historical evolution of AI and law research from the 1980s, highlighted
recent developments in NLP and IR techniques with a focus on the architectures of PLMs, and
conducted a detailed survey of current issues and advancements in legal IR and NLP tasks
like LDS and LJP. The review also included perspectives from the industry. In contrast to their
survey, we also explore LQA and LTC tasks, which they did not cover, and examine large legal
corpora, an area they did not address.
Chen et al. ✓ ✓ ✓ ✗ 2024 This survey study focuses on LLMs in three distinct fields: finance, healthcare, and law. Although
[23] it attempts to cover all aspects of the legal domain and LLMs, the broad scope of addressing
three expansive topics has resulted in a less thorough examination of many specific research
cases within the legal field. In addition, it did not cover large legal corpora which are used for
pre-training and fine-tuning purposes. Also, we explore methods for improving the efficiency
of LLMs in legal domain.
Krasadakis – – – – 2024 This survey study focuses on challenges and advances in some NLP tasks, such as NER and
et al. [58] Relation Extraction. Unlike this survey, our study covers all legal NLP tasks, along with their
corresponding datasets and large legal corpora. Additionally, we examine existing LLMs tailored
for the legal domain.

ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: October 2024.
4 • Ariai and Demartini

Methods Li et al. [63], Mamakas et al. [75]


Legal LMs
Models Chalkidis et al. [18], Xiao et al. [126]
Colombo et al. [25], Shi et al. [104]
Al-qurishi et al. [1]

Legal corpora Chalkidis et al. [19, 20], Zheng et al. [135]


Henderson et al. [46], Rabelo et al. [95], Xiao et al. [127]
Barale et al. [9], Niklaus et al. [86], Park and James [91]
Goebel et al. [44], Niklaus et al. [87], Östling et al. [142]

Legal NER Dozier et al. [32], Păis et al. [94], Smădu et al. [105]
Au et al. [7], Kalamkar et al. [54], Leitner et al. [61]

LDS Farzindar [36], Gelbart and Smith [40, 41], Moens et al. [80]
Polsley et al. [92], Schraagen et al. [99], Zhong et al. [140]
Jain et al. [49], Moro et al. [82], Zhong and Litman [141]
Legal Liu et al. [68], Shen et al. [102]
NLP

LTC Elnaggar et al. [34], Lee and Lee [60], Wei et al. [122]
Bambroo and Awasthi [8], Song et al. [106]
Fragkogiannis et al. [38], Mamooler et al. [76], Wang et al. [121]
Chalkidis et al. [16, 17], Tuggener et al. [117]
Graham et al. [45], Nguyen et al. [83], Papaloukas et al. [90]

LJP Luo et al. [73], Ye et al. [130], Zhong et al. [136]


Chalkidis et al. [15], Medvedeva et al. [79], Yang et al. [129]
Xu et al. [128], Zhong et al. [137]
Almuslim and Inkpen [2], Feng et al. [37], Ma et al. [74]
Liu et al. [71], Tong et al. [116], Zhang et al. [133]
Niklaus et al. [85], Strickson and De La Iglesia [111]
Semo et al. [100]

LQA Huang et al. [48], Khazaeli et al. [57], Sovrano et al. [109]
Askari et al. [5], Zhong et al. [139]
Louis et al. [72], Zhang et al. [134]
Askari et al. [6], Sovrano et al. [108], Yuan et al. [131]
Büttner and Habernal [14], Chen et al. [21], Sovrano et al. [107]

Fig. 1. An overview of the research areas in Legal NLP and the key publications in the survey.

Section 6, we address the key challenges associated with deploying NLP technologies in legal settings, discussing
both current issues and potential solutions. Since this survey contains many acronyms, Table 2 provides the list
of acronyms and their meanings to make it easier to follow.

2 BACKGROUND AND FOUNDATIONAL CONCEPTS


In this section, we will establish an essential understanding of how NLP intersects with the legal domain. Initially,
we explore the characteristics of legal documents, which are fundamental to this intersection. Legal texts are
known for their complex structure, specialized words. Recognizing these attributes is essential as they significantly
influence the development and application of AI technologies in legal practices.

2.1 Legal Documents


Legal documents are typically written in a descriptive language and presented in an unstructured text format. They
have unique features that set them apart from other fields. These documents cover a broad array of texts essential

ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: October 2024.
Natural Language Processing for the Legal Domain: A Survey of Tasks, Datasets, Models, and Challenges • 5

Table 2. List of Acronyms Used in the Survey.

Acronyms Meaning Acronyms Meaning


AMR Abstract Meaning Representation LJP Legal Judgment Prediction
AL Active Learning LQA Legal Question Answering
AI Artificial Intelligence LTC Legal Text Classification
BERT Bidirectional Encoder Representations from Transformers LSTM Long Short-Term Memory
Bi-GRU Bidirectional Gated Recurrent Units ML Machine Learning
Bi-LSTM Bidirectional LSTM MLM Masked Language Modeling
CanLII Canadian Legal Information Institute MGAT Multi-Graph Attention Network
CA Case Analysis ML-LJP Multi-Law aware LJP
CRF Conditional Random Field MLP Multi-Layer Perceptron
CNN Convolutional Neural Networks MPBFN Multi-Perspective Bi-Feedback Network
DL Deep Learning MTL Multi-task Learning
DAG Directed Acyclic Graph NER Named Entity Recognition
EDUs Elementary Discourse Units ORC Open Research Challenge
ECtHR European Court of Human Rights PEFT Parameter-Efficient Fine-Tuning
FSCS Federal Supreme Court of Switzerland PLM Pre-trained Language Model
GRU Gated Recurrent Unit PRISMA Preferred Reporting Items for Systematic Reviews and Meta-Analyses
GAT Graph Attention Networks QA Question Answering
GESAN Graph-Based Evidence Retrieval and Aggregation Network RNN Recurrent Neural Networks
IR Information Retrieval RL Reinforcement Learning
JEC-QA Judicial Examination of Chinese Question Answering RLHF Reinforcement Learning from Human Feedback
KG Knowledge Graph RAG Retrieval-Augmented Generation
KD Knowledge-Driven SLM Small Language Model
LLM Large Language Model SVM Support Vector Machine
LM Language Model TF-IDF Term Frequency Inverse Document Frequency
LDS Legal Document Summarization

to the functioning of the legal system. They include court filings, judgments, legislation, treaties, contracts,
and legal correspondence, each serving a specific purpose and following to specific formatting and content
standards that reflect legal logic and hierarchy. Legal documents are Fundamental tools for lawyers, judges,
and legal scholars, facilitating case analysis, legislative review, and contract drafting. They are also essential
in legal education and practice and provide the basis for legal arguments and decisions. Common examples of
legal documents include case law repositories, statutory databases, and collections of legal agreements. These
documents are utilized in various legal processes such as drafting legal arguments, performing legal analysis, and
ensuring regulatory compliance.
2.1.1 Legal language and its characteristics. Legal language is characterized by its unique features that set it
apart from everyday language, primarily due to its function in the legal system. One prominent feature is its
formality, where legal language often employs a more formal vocabulary and syntax to ensure precision and
avoid ambiguity [43]. This formality is critical, as the precise meaning of terms can have significant legal effects.
Legal texts also typically utilize passive constructions and complex sentence structures to provide detailed and
comprehensive descriptions [43]. These constructions help clarify responsibilities and outcomes without directly
attributing actions or intentions to specific parties.
Another distinctive aspect of legal language is its reliance on specialized words and phrases. This includes
terms that have specific meanings within legal contexts, archaic words that are not commonly used in everyday
language, and standardized phrases that have been historically embedded in legal tradition [43]. This can make
legal documents less accessible to non-specialists, requiring legal professionals to interpret the content accurately.
Furthermore, legal language is heavily intertextual, meaning it frequently references other legal texts, such
as statutes, regulations, and case law. This characteristic ensures that legal arguments are grounded in and
supported by existing legal frameworks and previous cases. The dense use of citations and references in legal
documents not only supports the arguments made but also connects the document to a broader legal discourse.
Such intertextuality demands that legal professionals not only understand the texts themselves but also the

ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: October 2024.
6 • Ariai and Demartini

Fig. 2. A sample page from the Code of Federal Regulations, illustrating the structured and referenced nature of legal
documents.

broader legal context in which they operate. To illustrate the intertextuality of legal language, Figure 2 shows
a sample page from the Code of Federal Regulations of the United States, extracted from the WestLaw [124]
website. Figure 2 displays § 40.51, Labor Certification, from 22 C.F.R. § 40.51, which is part of the Code of Federal
Regulations of the United States. This section falls under Title 22, governing regulations related to foreign
relations, specifically detailing the requirements and procedures for labor certification. Notably, the text includes
highlighted references to other legal sources, such as INA 212(a)(5). This citation refers to the Immigration and
Nationality Act, specifically section 212, which outlines various conditions for inadmissibility into the United
States, under subsection (a), paragraph (5). In the “Credits” part, you can see the reference “56 FR 30422, July
2, 1991,” which points to a publication in the Federal Register. Here, “56 FR” indicates the volume number, and
“30422” is the page number where the document begins. The date “July 2, 1991,” marks the publication date in
the Federal Register. Additionally, a sentence in subsection (b), paragraph (1) consists of 68 words, showing the
length and complexity typical of legal texts.

ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: October 2024.
Natural Language Processing for the Legal Domain: A Survey of Tasks, Datasets, Models, and Challenges • 7

The specific characteristics of legal language, such as its formal vocabulary, complex syntax, and extensive
use of references, present many challenges for NLP. Disambiguation titles and nested entities are other issues in
legal contexts [58]. Disambiguation titles, such as ‘The President of USA’ requires precise identification based
on contextual details like time and location. Nested entities , where titles of legislative articles refer to other
laws, introduce further complexity. On the other hand, legal documents are frequently provided in non-machine-
readable PDF formats, complicating data extraction and processing. These challenges highlight the need for
advanced and specialized NLP solutions tailored to the legal domain.

2.2 Legal NLP


2.2.1 Introduction to Legal NLP. Legal NLP involves the application of NLP techniques to legal texts. This field
is crucial as it helps automate and enhance the analysis of complex legal documents, improving efficiency and
accuracy in legal research, compliance, and decision-making processes. The foundation of NLP is text, and the
legal domain primarily consists of textual data [2], including statutes, case law, contracts, and regulations. Given
the text-intensive nature of the legal field, NLP offers significant potential to transform how legal professionals
interact with and utilize this vast amount of information. By leveraging advanced algorithms and Machine
Learning (ML) models, Legal NLP aims to make legal texts more accessible, interpretable, and actionable.
2.2.2 Why NLP is a game-changer for the legal section. LLMs, as a part of NLP applications, such as ChatGPT [88],
have had a significant impact since their public debut in November 2022. The legal sector, however, has been
exploring the potential of AI for a longer period, applying it practically. NLP applications in the legal field are
extensive, ranging from drafting client briefs and producing complex analyses from large document sets to
enabling smaller firms to compete with larger ones [78]. NLP is very important in doing thorough checks when
companies merge and greatly helps in legal education and learning in fast-changing fields [78].
A notable demonstration of the capabilities of NLP applications in the legal sector occurred when GPT-4
passed the Uniform Bar Exam, underscoring the technology’s potential [56]. Lawyers and law students are keenly
aware of this potential, as evidenced by a LexisNexis survey released in August 2023 [62]. The survey revealed
that about half of all lawyers believe that LLMs will significantly transform legal practice, with nearly all (92
percent) expecting some impact. Additionally, 77 percent of respondents believe LLMs will increase the efficiency
of lawyers, paralegals, and law clerks, while 63% foresee changes in how law is taught and studied. Moving
forward, we will introduce specific NLP tasks in the legal domain, exploring their applications and impacts they
are expected to have on the legal profession.
2.2.3 Basic foundations of NLP. The integration of NLP in the legal domain relies on several foundational
techniques that enable the effective analysis and manipulation of legal texts. These fundamental methods provide
the building blocks for more complex applications, transforming unstructured legal documents into structured,
actionable information. By leveraging these NLP techniques, legal professionals can enhance their efficiency and
accuracy, making it easier to manage and interpret vast amounts of legal data. The following section provides
essential definitions of terms related to the basic foundations of NLP:
• Tokenization: Tokenization is the process of breaking down a sequence of text into smaller units, typically
words or sub-words, known as tokens. This is a fundamental step in NLP as it allows for the structured
analysis of text. In the legal domain, tokenization helps in processing and understanding lengthy documents
by segmenting them into manageable pieces.
• Word Embeddings: Word Embeddings are continuous vector representations that encode the semantic
meanings of words or tokens in a high-dimensional space. These representations allow models to convert
individual tokens into a format suitable for neural network processing. During training, LMs develop
embeddings that capture the relationships between words, such as synonyms.

ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: October 2024.
8 • Ariai and Demartini

• Transformers: Transformers [118] are a neural network architecture designed to convert input sequences
into output sequences by understanding the context and relationships among the elements of the sequence.
For instance, given the input “What is the color of the sky?”, a transformer model internally processes
and identifies connections among the words ‘color’, ‘sky’, and ’blue’. This understanding enables it to
produce the output: “The sky is blue” [3]. Transformer models enhance this process through a self-attention
mechanism. This mechanism allows the model to analyze different parts of the sequence simultaneously
instead of sequentially, helping it identify which parts are most significant.
• PLMs: PLMs are trained on extensive corpora in a self-supervised manner, which involves tasks like recover-
ing incomplete input sentences or auto-regressive language modeling. These models, such as Bidirectional
Encoder Representations from Transformers (BERT) [29] and RoBERTa [70], are initially trained on large-
scale general text datasets. After pre-training, they can be fine-tuned for specific downstream tasks in the
legal domain, adapting them to comprehend and process legal language for applications like document
classification and information extraction.
• Question Answering (QA): QA system is a type of NLP solution designed to answer questions posed in
natural language. These systems take a user’s query and, by extracting relevant information from a dataset
provide a relevant and informative response.
• NER: NER is the task of identifying mentions of specific entities within a text that belong to predefined
categories such as persons, locations, organizations, and more. NER is a fundamental component for many
NLP applications, including question answering, text Summarization, and machine translation [64].
• Information Retrieval (IR): IR involves the process of obtaining relevant information from large
collections of unstructured legal texts, such as case laws, statutes, contracts, and regulations. The goal of IR
is to provide users with the most relevant documents and data in response to specific queries.
• Multi-task Learning (MTL): MTL is an approach in ML where a model is trained on multiple related
tasks simultaneously or utilizes auxiliary tasks to enhance performance on a specific task. By learning
from diverse tasks, MTL enables models to capture generalized and complementary knowledge, improving
robustness and addressing data scarcity, particularly for low-resource tasks. MTL’s ability to share implicit
knowledge across tasks often leads to performance gains and more efficient models, making it a valuable
strategy for building robust and adaptable systems in NLP and other domains [22].
• Parameter-Efficient Fine-Tuning (PEFT): PEFT is a method for adapting PLMs that involves freezing
the majority of the model’s parameters and only updating a small subset. This approach significantly
reduces the computational resources and time required for fine-tuning, making it particularly effective in
resource-limited scenarios, while still achieving competitive performance in tasks like text generation [65].
• Retrieval-Augmented Generation (RAG): RAG is an advanced AI framework that enhances text creation
by merging traditional information retrieval systems, with the generative power of LLMs. This integration
allows the AI to access additional knowledge sources while utilizing its advanced language capabilities.

2.2.4 Key Publications and Conferences in Legal NLP. This section highlights the key journals, conferences, and
workshops that serve as platforms for sharing advancements and insights in the intersection of NLP and the
legal domain. These resources provide good opportunities for researchers to engage with cutting-edge research
in Legal NLP.
Several leading journals focus on the intersection of AI, NLP, and the legal domain. “Artificial Intelligence and
Law”, published by Springer, is a leading journal that features research articles on legal reasoning, legal IR, and
legal knowledge representation. Additionally, the “Journal of Law and Information Technology”, published by
Taylor & Francis , focuses on the application of information technology in law, including research AI.
Conferences significantly advance research and promote collaboration in Legal NLP. The International Confer-
ence on Artificial Intelligence and Law (ICAIL) is a biennial event that showcases advances in AI applications in

ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: October 2024.
Natural Language Processing for the Legal Domain: A Survey of Tasks, Datasets, Models, and Challenges • 9

the legal domain, including NLP and ML. The Conference on Legal Knowledge and Information Systems (JURIX)
is an annual event that focuses on legal informatics and NLP technologies.
In Legal NLP, there are some good works which are presented in workshops. The Workshop on Automated
Semantic Analysis of Information in Legal Texts delves into NLP and semantic analysis of legal texts. The Natural
Legal Language Processing (NLLP) Workshop provides a platform for discussing NLP technologies tailored for
legal texts and is often part of major NLP conferences. The EXplainable AI in Law (XAILA) Workshop focuses on
the explainability of AI systems in legal contexts, aiming to improve transparency and trust in AI applications. The
Competition on Legal Information Extraction/Entailment (COLIEE) is an annual event that challenges participants
to develop innovative solutions for legal information extraction and entailment tasks.

3 METHODOLOGY
This survey follows the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) frame-
work [89]. It ensures a transparent and comprehensive assessment of research to NLP tasks within the legal
sector.

3.1 Search Strategy


We performed a systematic search across two academic databases to identify relevant studies, including: Google
Scholar and IEEE Xplore. Then, search queries were crafted to capture studies that focused on the application of
NLP to legal tasks. The search was defined by the following queries:
• Query 1: (“Natural Language Processing” OR “NLP”) AND (“Legal” OR “Law”)
• Query 2: (“Legal” AND (“Named Entity Recognition” OR “NER” OR “Document Summarization” OR “Text
Classification” OR “Document Classfication” OR “Judgment Prediction” OR “Question Answering” OR
“Corpus” OR “Language Model”))
Our search covered publications with the following date ranges for each specific NLP task: LQA from 2020-2024,
LJP from 2017-2024, LTC from 2018-2023, LDS from 2016-2024, Legal NER from 2010-2022, legal corpora from
2021-2024, and Legal LMs from 2020-2024. This ensured that we included recent advancements. Peer-reviewed
journal articles and high-quality conference proceedings were prioritized, along with a secondary consideration
of non-peer-reviewed sources, such as arXiv articles, where relevant.

3.2 Study Selection


A total of 148 studies were initially identified from the database search. To refine this list, we applied a manual
reviewing and. This process involved:
(1) Title and Abstract Screening: We reviewed the titles and abstracts of all retrieved studies to assess their
relevance to the predefined legal NLP tasks. Studies unrelated to the core legal NLP and its tasks were
excluded at this stage.
(2) Full-Text Review: Articles that passed the initial screening went through a more detailed full-text review.
This was done to confirm their relevance, quality, and alignment with the inclusion criteria. During this
phase, a careful study of the literature review sections of each included research and survey was conducted.
This helped ensure that the studies not only contributed original findings but also reflected a comprehensive
understanding of the existing research landscape in legal NLP.
(3) Final Selection: From the original 148 studies, 127 were included in the final list, selected based on their
direct relevance to key NLP tasks within the legal sector and their methodological quality, as well as their
engagement with existing literature in the field.

ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: October 2024.
10 • Ariai and Demartini

3.3 Eligibility Criteria


To determine which studies were included in the final synthesis, we established the following criteria:
• Inclusion Criteria:
– The study must focus on at least one of the following NLP tasks: Legal Question Answering, Legal Named
Entity Recognition, Legal Judgment Prediction, Legal Document Summarization, Legal Text Classification,
Legal Language Models, or legal corpora.
– The study must present empirical research or significant methodological contributions to legal NLP.
– Both peer-reviewed and non-peer-reviewed (e.g., arXiv) studies were considered if they provided valuable
insights.
• Exclusion Criteria:
– Studies focused exclusively on unrelated areas such as information retrieval methods, pattern mining,
information extraction, or similarity detection without a clear application to the specific legal NLP tasks
mentioned.
– General NLP studies without a focus on legal applications.
– Editorials, opinion pieces, or other non-research articles.
– Papers that did not meet basic methodological standards were not included in the final analysis.

4 NLP TASKS, DATASETS, AND LARGE CORPORA IN LEGAL DOMAINS


NLP tasks in the legal domains cover a range of specialized applications that address the unique challenges
and requirements of legal texts. These tasks leverage advanced NLP techniques to process, analyze, and extract
meaningful information from legal documents. Legal NLP tasks help make legal research and decision-making
more efficient and accurate. In this section, we explore various NLP tasks that are tailored to the legal domain
and show their impact on the legal section. Additionally, we will discuss existing works and research related to
each task and provide an overview of the current state of the art in this field.
Furthermore, the success of these NLP tasks heavily depends on the availability and quality of domain-specific
datasets. Therefore, we will also examine the datasets commonly used in legal NLP research, exploring their
characteristics and the role they play in advancing the development of NLP models for the legal domain.

4.1 Legal Question Answering


4.1.1 Task. LQA involves responding to queries about law. This task is typically done by legal professionals skilled
in the relevant legal field. This process requires a comprehensive review of existing laws, careful interpretation
of statutes and regulations, and the application of legal principles and past cases to the relevant facts. LQA seeks
to offer precise advice on legal matters. It helps people and businesses in navigating the legal landscape and
addressing legal challenges. Recently, DL has been leveraged in LQA to employ neural network models that train
on extensive datasets to identify complex patterns and relationships. These models evaluate the questions posed,
identify relevant legal topics, and produce suitable answers based on the patterns they have learned.
Modern ML approaches for LQA, particularly DL, rely on neural networks to understand natural language.
Popular architectures include Recurrent Neural Networks (RNN), Long Short-Term Memory (LSTM), and Convo-
lutional Neural Networks (CNN), which can be fine-tuned for specific tasks such as QA. These models generate
accurate responses, adapt to new patterns, understand context, and provide relevant answers. Transformer-based
models, such as BERT [29] and ChatGPT [88], have proven particularly effective in NLP tasks. These models
use the transformer architecture and self-attention mechanisms to learn the context of the text and understand
the meaning of words. This allows them to provide relevant answers by effectively weighing the importance of
different parts of the input. In the following, we will study the existing LQA works in legal domain.

ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: October 2024.
Natural Language Processing for the Legal Domain: A Survey of Tasks, Datasets, Models, and Challenges • 11

Huang et al. [48] introduce the Artificial Intelligence Law Assistant, the first Chinese LQA system that integrates
a legal knowledge graph (KG) to enhance query comprehension and answer ranking. The system collects a
large-scale QA corpus from an online legal forum and constructs a legal KG with over 42,000 legal concepts.
It employs a knowledge-enhanced interactive attention network using Bi-LSTM and co-attention mechanisms
to enrich semantic representations of QA pairs with legal domain knowledge. Additionally, it provides visual
explanations for selected answers, offering users a clear understanding of the QA process.
An example of an answer retrieval system specific to Private International Law is proposed by Sovrano et al.
[109]. This system integrates Term Frequency Inverse Document Frequency (TF-IDF) with deep LMs to retrieve
relevant answers from an automatically generated KG of contextualized grammatical sub-trees. The KG aligns
with a legal ontology based on Ontology Design Patterns, such as agent, role, event, temporal parameter, and
action, to reflect the legal significance of the relationships within and between provisions.
Khazaeli et al. [57] develop an IR-based QA system tailored to the legal domain, combining sparse vector search
(BM25) and dense vector techniques (semantic embeddings) as input to a BERT-based [29] answer re-ranking
system. The system utilizes Legal GloVe and Legal Siamese BERT embeddings to enhance retrieval effectiveness.
An answer finder component computes the probability of a passage answering the question using a BERT
sequence binary classifier fine-tuned on question-answer pairs, improving the model’s ability to discriminate
good answers.
Li et al. [66] introduce a retrieving-then-answering framework featuring the Graph-Based Evidence Retrieval
and Aggregation Network (GESAN) to enhance QA on the Judicial Examination of Chinese Question Answering
(JEC-QA) dataset [139]. The framework leverages relevant legal knowledge by predicting question topics and
retrieving legal paragraphs using BM25. GESAN aggregates the evidence and processes it along with the question
and options to make accurate predictions, demonstrating improved accuracy and reasoning abilities in LQA.
Askari et al. [5] propose a method for generating query-dependent textual profiles for lawyers to improve legal
expert finding on QA platforms. Using data from the Avvo1 QA forum, they focus on aspects such as sentiment,
comments, and recency to create profiles. These profiles are fine-tuned using BERT models [29], and the final
aggregated score is calculated using a linear combination of profile-trained BERT models. It improves retrieval
performance in the legal expert finding task.
Zhang et al. [134] propose a generation-based method for LQA, modeling it as a generation task to produce
new, relevant answers tailored to each question. The system incorporates laws as external knowledge into the
answer generation process, using a retriever to fetch applicable law articles and a generator to create answers
using this knowledge. Both components are integrated into a single T5 [96] model using MTL. It can enhance the
model’s understanding and generation abilities while ensuring answers are accurate and informative.
Louis et al. [72] present an end-to-end methodology for generating long-form answers to statutory law
questions using a “retrieve-then-read” pipeline. The approach involves a retriever component that uses a bi-
encoder model to fetch relevant legal provisions, followed by a generator that formulates comprehensive answers
based on these provisions. The generator, an autoregressive LLM based on the Transformer architecture, employs
in-context learning and PEFT to generate detailed answers. The model’s interpretability is enhanced by an
extractive rationale generation strategy, ensuring responses are accompanied by verifiable justifications.
Sovrano et al. [108] propose DiscoLQA, a discourse-based LQA system that focuses on important discourse
elements like Elementary Discourse Units (EDUs) and Abstract Meaning Representations (AMRs). This approach
helps the answer retriever identify the most relevant parts of the discourse, enhancing retrieval accuracy. They
introduce the Q4EU dataset, containing over 70 questions and 200 answers on six European norms, demonstrating
improved performance in legal QA even without domain-specific training.

1 https://www.avvo.com

ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: October 2024.
12 • Ariai and Demartini

Yuan et al. [131] present a three-step approach to bridge the legal knowledge gap by creating CLIC-pages—snippets
that explain technical legal concepts in layperson’s terms. They construct a legal question bank containing legal
questions answered by CLIC-pages, using large-scale PLMs like GPT-3 [13] to generate machine-generated
questions. The study demonstrates that machine-generated questions are more scalable and diversified, aiding in
improving accessibility to legal information for non-experts.
Askari et al. [6] propose a cross-encoder re-ranker (𝐶𝐸 𝐹𝑆 ) for legal answer retrieval, incorporating fine-grained
structured inputs from community QA data to enhance retrieval effectiveness. They introduce the LegalQA
dataset containing 9,846 questions and 33,670 lawyer-curated answers. The approach involves a two-stage ranking
pipeline with a BM25 retriever followed by a re-ranker, showing that integrating question tags into the input
structure can bridge the knowledge gap and improve retrieval in the legal domain.

4.1.2 Datasets. The LQA datasets are a specialized resource designed to facilitate research in the domain of
legal NLP. they consist of a collection of legal questions and corresponding answers, drawn for various legal
documents and case law. Most questions in the LQA datasets fall into two main categories: knowledge-driven
questions (KD-questions) and case-analysis questions (CA-questions) [139]. KD-questions are centered around
the understanding of specific legal concepts, whereas CA-questions involve the analysis of actual legal cases.
Both types demand advanced reasoning skills and a deep comprehension of the text, making LQA a particularly
challenging task in the field of NLP.
Zhong et al. [139] introduce JEC-QA, a dataset with 26,365 multiple-choice questions from the National Judicial
Examination of China and related websites. Each question offers four possible answers and is labeled with the
type of reasoning required, such as word matching, concept understanding, numerical analysis, multi-paragraph
reading, and multi-hop reasoning. This dataset poses significant challenges for QA models, highlighting the gap
between machine performance and human expertise in complex legal reasoning.
Sovrano et al. [107] present a dataset designed to evaluate automated QA systems within the domain of Private
International Law. It includes 17 carefully selected questions based on key EU regulations—Rome I, Rome II, and
Brussels I bis—with answers derived directly from these regulations. The questions are classified based on their
specificity, allowing for nuanced analysis of context-dependency in legal reasoning. This dataset aids in assessing
the performance of QA systems intended for legal professionals navigating complex cross-border legal issues.
EQUALS [21] is a large-scale annotated LQA dataset in Chinese law, containing 6,914 question-answer pairs
with answers based on specific law articles. Curated by senior law students, it covers 10 collections of Chinese
laws and includes annotations indicating the type of reasoning required for each question. The dataset ensures
that answers are precise excerpts from relevant law articles, making it valuable for developing advanced LQA
systems that can aid in legal research and decision-making.
Büttner and Habernal [14] introduce GerLayQA, a dataset supporting LQA for laypersons in Germany, focusing
on the civil-law system. It contains 21,538 real-world questions posed by laypersons, paired with expert answers
from lawyers grounded in specific paragraphs of German law books. The dataset are constructed through filtering
and quality assurance to ensure accuracy and relevance, making it a valuable resource for developing LQA
systems that can interpret and apply German law to everyday legal inquiries.

4.2 Legal Judgment Prediction


4.2.1 Task. LJP is an important task within legal NLP, especially prevalent in civil law systems where judgments
are determined based on case facts and statutory articles [138]. This task involves predicting legal outcomes from
the descriptions of cases and relevant legal statutes, and is essential in countries like France, Germany, Japan,
and China [138]. LJP has garnered significant attention from AI researchers and legal professionals due to its
potential to assist judges, lawyers, and legal scholars in predicting case outcomes based on historical data [23].

ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: October 2024.
Natural Language Processing for the Legal Domain: A Survey of Tasks, Datasets, Models, and Challenges • 13

Despite its promise, LJP is a highly demanding and challenging task. It requires careful handling of natural
biases in historical legal data, which can create feedback loops that amplify discrimination [58]. Therefore,
ensuring the impartiality of predicted rulings is crucial [58]. Currently, LJP is primarily performed by legal experts
who undergo extensive specialized training to manage the complex steps involved, such as identifying relevant
law articles, defining charge ranges, and deciding penalty terms [26]. Nevertheless, LJP provides substantial
benefits, streamlining legal decision-making processes for both practitioners and ordinary citizens [26].
Luo et al. [73] propose an attention-based neural network to enhance charge prediction by jointly modeling
charge prediction and relevant law article extraction. They used Bidirectional Gated Recurrent Units (Bi-GRUs)
to encode fact descriptions and an article extractor to identify top relevant law articles. The model employs an
attention mechanism guided by context vectors to combine embeddings for prediction. Evaluations on Chinese
judgment documents showed improved accuracy in predicting charges and providing relevant legal articles.
Zhong et al. [136] introduce TopJudge, a topological MTL framework that models dependencies among subtasks
in LJP, such as law article prediction, charge prediction, and penalty terms. Using a Directed Acyclic Graph
(DAG), TopJudge processes subtasks in a topological order reflecting real-world legal decision-making. Evaluated
on large-scale Chinese criminal case datasets, it outperformed previous models in predicting legal outcomes.
Ye et al. [130] address the problem of Court View Generation from fact descriptions in criminal cases to
enhance the interpretability of charge prediction systems and aid in automatic legal document generation.
They formulated this as a text-to-text Natural Language Generation (NLG) problem, using a label-conditioned
Seq2Seq model with attention to decode court views based on encoded charge labels. Their model outperformed
basic Seq2Seq models in generating accurate and natural court views. This work contributes to automatic legal
document generation by providing justifications for charge decisions.
Yang et al. [129] propose a Multi-Perspective Bi-Feedback Network (MPBFN) with a Word Collocation Attention
mechanism to improve LJP. MPBFN addresses the challenges of multiple subtasks and their dependencies by
using a bi-feedback mechanism for forward prediction and backward verification among subtasks. The Word
Collocation Attention integrates word collocation features and numerical semantics to better predict penalties.
Evaluated on the CAIL-small and CAIL-big datasets [127], their model outperformed baselines in predicting law
articles, charges, and penalty terms.
Chalkidis et al. [15] introduce an English LJP dataset containing approximately 11.5k cases from the European
Court of Human Rights (ECHR). They evaluated various neural models on this dataset, including a hierarchical
version of BERT [29] (HIER-BERT) to handle long legal documents. Their models outperformed previous feature-
based approaches in tasks like violation classification and case importance prediction. They also explored potential
biases in legal predictive models using data anonymization.
Medvedeva et al. [79] investigate using NLP tools to predict judicial decisions of the ECHR based on court
proceeding texts. They employed an SVM linear classifier to predict violations of articles, achieving an average
accuracy of 75%. However, when predicting future decisions based on past cases, accuracy decreased to 58–68%.
The study also found that predicting outcomes based solely on judges’ surnames could achieve an average
accuracy of 65%, highlighting potential biases.
Zhong et al. [137] introduce QAjudge, a reinforcement learning-based model designed to provide interpretable
legal judgments by visualizing the prediction process. QAjudge uses a Question Net to iteratively select relevant
yes-no questions about case facts, an Answer Net to provide answers, and a Predict Net to generate the final
judgment. The model aims to minimize the number of questions asked. It focuses on crucial elements to ensure
fairness and interpretability. Evaluated on real-world datasets, QAjudge demonstrated potential in providing
reliable and transparent legal judgments.
Xu et al. [128] propose the Law Article Distillation based Attention Network (LADAN), an end-to-end model
addressing the issue of confusing charges in LJP by distinguishing similar law articles. The model uses a novel
graph neural network to learn differences between confusing law articles and an attention mechanism to extract

ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: October 2024.
14 • Ariai and Demartini

discriminative features from fact descriptions. Experiments on real-world datasets showed that LADAN improved
performance over previous methods in law article prediction, charge prediction, and penalty term prediction.
Strickson and De La Iglesia [111] present the first LJP model for UK court cases, creating a labeled dataset of
UK court judgments spanning 100 years. They evaluated various ML models and feature representations, with
their best model achieving an accuracy of 69%. The study demonstrated the potential of LJP for UK courts, though
challenges remain due to the complexity of legal language and lack of structured public datasets.
Ma et al. [74] introduce MSJudge, a MTL framework designed to predict legal judgments by leveraging multi-
stage judicial data, including pre-trial claims and court debates. MSJudge consists of components to encode
multi-stage context, model interactions among claims, facts, and debates, and predict judgments. Evaluated
on a large civil trial dataset2 , MSJudge outperformed state-of-the-art baselines, enhancing trial efficiency and
judgment quality.
Almuslim and Inkpen [2] focus on LJP for Canadian appeal court cases, employing various NLP and ML
methods to predict binary outcomes (‘Allow’ or ‘Dismiss’) based on case descriptions. Deep learning models
using custom Word2Vec embeddings achieved the highest accuracy of 93%, significantly outperforming classical
ML models. The study highlights the potential of predictive models to aid legal professionals and establishes a
foundation for future research in the Canadian legal system.
Feng et al. [37] address limitations of state-of-the-art LJP models by proposing an event-based prediction
model with constraints to improve performance. The model extracts fine-grained key events from case facts
and predicts judgments based on these events rather than the entire fact statement. They manually annotated a
legal event dataset and introduced output constraints to guide learning. Their method effectively leverages event
information and cross-task consistency constraints, enhancing LJP accuracy.
Tong et al. [116] introduce GJudge, a graph boosting framework incorporating constraints to address short-
comings of traditional LJP methods. GJudge features a multi-perspective interactive encoder and a Multi-Graph
Attention Network (MGAT) consistency expert module. The encoder merges fact descriptions with label similarity
connections, while the expert module distinguishes similar labels and maintains task consistency. Testing on
datasets showed that GJudge outperformed other models, including the state-of-the-art RLJP [125] model, with
higher F1 scores.
Previous works mainly focus on creating accurate representations of a case’s fact description to enhance
judgment prediction performance. However, these methods often overlook the practical judicial process, where
human judges compare similar law articles or potential charges before making a decision. To address this gap,
Zhang et al. [133] propose CL4LJP, a supervised contrastive learning framework to improve LJP by capturing
fine-grained differences between similar law articles and charges. The framework includes contrastive learning
tasks at the article, charge, and label levels, enhancing the model’s ability to model relationships between fact
descriptions and labels. Experiments demonstrated that CL4LJP outperformed previous methods, proving its
effectiveness and robustness.
Liu et al. [71] propose ML-LJP, a Multi-Law aware LJP method that expands law article prediction into a
multi-label classification task incorporating both charge-related and term-related articles. The approach uses label-
specific representations and contrastive learning to distinguish similar definitions. A Graph Attention Network
(GAT) is employed to learn interactions among multiple law articles for prison term prediction. Experiments
showed that ML-LJP outperformed state-of-the-art models, particularly in prison term prediction.
4.2.2 Datasets. The LJP datasets are specialized resources designed to advance research in predicting judicial
outcomes within the domain of legal NLP. These datasets are categorized into four main types: court view
generation datasets, law articles, charge prediction, and prison term prediction. Court view generation datasets
involve court opinions and summaries. Law Articles datasets involve the prediction of legal outcomes based on
2 https://github.com/mly-nlp/LJP-MSJudge

ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: October 2024.
Natural Language Processing for the Legal Domain: A Survey of Tasks, Datasets, Models, and Challenges • 15

specific statutes or regulations. Charge prediction datasets are concerned with predicting the charges that should
be brought against a defendant based on the case details, while prison term prediction datasets aim to estimate
the likely sentence duration given the nature of the crime and the legal context. Each type of dataset presents
unique challenges, demanding not only text comprehension but also the ability to apply complex legal reasoning,
making LJP a particularly complex task in the field of NLP.
Court View Gen [130] is an innovative dataset containing 171,981 Chinese legal cases, each involving a single
defendant and a corresponding charge, covering a total of 51 charge categories. This dataset is specifically curated
to support the generation of court opinions based on charge labels. The data was sourced from publicly available
legal documents within the CJO3 repository.
Niklaus et al. [85] introduce a multilingual LJP dataset from the Federal Supreme Court of Switzerland (FSCS),
containing over 85,000 cases in German, French, and Italian. The dataset is annotated with publication years,
legal areas, and cantons of origin, making it suitable for NLP applications in judgment prediction.
Semo et al. [100] introduce the first LJP dataset focused on class action cases in the United States. The dataset
targets predicting outcomes based on plaintiffs’ complaints rather than court-written fact summaries, involving a
rule-based extraction system to identify relevant text spans from complaints.
Almuslim and Inkpen [2] construct a dataset for LJP within the Saskatchewan Court of Appeal. They collected
and labeled 3,670 documents with case outcomes (‘allow’ or ‘dismiss’) using a two-step pattern matching and
keyword-based validation, ensuring label accuracy through manual annotation. This dataset supports research in
the Canadian legal system and aids in developing predictive models for legal judgments.

4.3 Legal Text Classification


4.3.1 Task. LTC is an important task within the domain of NLP that involves categorizing legal documents
based on their content, a foundational aspect of building intelligent legal systems. With the exponential growth
of legal documents, it has become increasingly challenging for legal professionals to efficiently locate relevant
rulings in similar cases for argumentation. LTC addresses this challenge by automatically associating legal texts
with predefined categories, such as criminal, civil, or administrative cases, thereby simplifying legal research and
decision-making processes.
In the legal field, this process is often referred to as predictive coding, where ML algorithms are trained through
supervised learning to classify documents into specific categories. The broader task of text classification in
NLP involves assigning one or multiple categories to a document from a set of predefined options, and it can
take various forms, including binary classification, multi-class classification, and multi-label classification. Legal
document classification often falls under large multi-label text classification, where the label space can consist of
thousands of potential categories, adding significant complexity to the task [101]. This subsection explores the
methodologies and advancements in this area.
DL typically requires extensive data to yield effective results, but MTL offers a potential solution to the data
scarcity problem. Elnaggar et al. [34] leverage transfer learning and MTL to perform tasks like translation and
multi-label classification within legal document corpora. They utilize the MultiModel algorithm [53], which
uses a fully convolutional sequence-to-sequence architecture integrating different modality nets. Their model
processes legal texts through a unified embedding, enabling efficient task switching and promoting generalization
across different legal tasks, effectively tackling data scarcity in the legal field.
Wei et al. [122] investigate the application of CNNs for text classification in legal document review, comparing
their performance with SVMs on four datasets from real legal cases. They found that CNNs perform better
with larger training datasets and offer a more stable growth trend compared to SVMs. The study highlighted

3 http://wenshu.court.gov.cn

ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: October 2024.
16 • Ariai and Demartini

challenges such as sequence length limitations and the need for explainability in predictive analysis, suggesting
improvements like integrating annotated sentences with full text to enhance sentence relevance identification.
Lee and Lee [60] focuse on legal document classification in the Korean language and compare three different DL
approaches: CNN with ASCII encoding, CNN with Word2Vec embeddings, and RNN with Word2Vec embeddings.
The classification models are used to classify case data into civil, criminal, and administrative. Using a dataset of
nearly 60,000 past case documents, the study finds that the RNN model with Word2Vec embedding achieves the
highest classification accuracy.
Bambroo and Awasthi [8] introduce an architecture that integrates long attention mechanisms with a distilled
BERT model pre-trained on legal domain-specific corpora. Their model employs a combination of local windowed
attention and task-motivated global attention to handle inputs up to eight times longer than standard BERT
models. The architecture, based on DistilBERT [97] and incorporating LongformerSelf-Attention, is optimized for
legal document classification, outperforming fine-tuned BERT and other transformer-based models in both speed
and effectiveness.
Song et al. [106] present a deep learning-based system built on top of RoBERTa [70] for multi-label legal
document classification. They enhance the model with domain-specific pre-training, a label-attention mechanism,
and MTL to improve classification accuracy, particularly for low-frequency classes. The label-attention mechanism
uses label embeddings to bridge the semantic gap between samples and class labels, addressing class imbalance
issues.
Fragkogiannis et al. [38] propose a method to improve classification of pages within lengthy documents
by leveraging sequential context from preceding pages. They enhance the input to pre-trained models like
BERT [29] by appending special tokens representing the predicted page type of the previous page, enabling
more context-aware classification without modifying the model architecture. Experiments on legal datasets
demonstrate improvements compared to non-recurrent setups.
Wang et al. [121] introduce a Document-to-Graph Classifier to classify legal documents based on facts and
reasons rather than topics. They extract key entities and represented legal documents using four distinct relation
graphs capturing different aspects of entity relationships. A GATs [119] is used to learn document representations
from the combined graph, improving classification by focusing on factual content.
Mamooler et al. [76] propose an active learning pipeline for fine-tuning PLMs for LTC. They address chal-
lenges of specialized vocabulary and high annotation costs. Their method involves continued pre-training of
RoBERTa [70] on legal texts, knowledge distillation using a pre-trained sentence transformer, and an efficient
initial sampling strategy by clustering unlabeled data. This approach reduces the number of labeling actions
required and improves efficiency in adapting models to LTC tasks.

4.3.2 Datasets. LTC datasets are characterized by their domain-specific vocabulary and multi-label nature,
requiring models to interpret complex legal texts and categorize them into single or multiple legal themes.
Chalkidis et al. [16] release EURLEX57K, a dataset containing 57,000 EU legislative documents from the EUR-
LEX portal4 , annotated with EUROVOC5 concepts. This dataset facilitates research in LTC, including extreme
multi-label text classification, few-shot, and zero-shot learning, with documents tagged with an expansive set of
descriptors.
Tuggener et al. [117] introduce LEDGAR, a multi-label corpus of legal provisions from contracts scraped from
the U.S. Securities and Exchange Commission’s website. The dataset includes over 846,000 provisions across
60,540 contracts, with an extensive label set suitable for text classification and legal studies, aiding in developing
advanced legal NLP models.

4 lex.europa.eu/
5 http://eurovoc.europa.eu/

ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: October 2024.
Natural Language Processing for the Legal Domain: A Survey of Tasks, Datasets, Models, and Challenges • 17

Chalkidis et al. [17] present MULTI-EURLEX, a multilingual dataset containing 65,000 EU laws translated into
23 official EU languages, annotated with EUROVOC labels. The dataset emphasizes temporal concept drift by
adopting chronological splits, enhancing its utility for sophisticated LTC tasks requiring understanding nuanced
legal terms across different time periods.
Papaloukas et al. [90] introduce the Greek Legal Code dataset, categorizing approximately 47,000 Greek
legislative documents into a detailed multi-level classification system. The dataset is structured into volumes,
chapters, and subjects, each containing diverse legal documents from Greek legislation history, supporting LTC
in the Greek legal domain.
Song et al. [106] introduce POSTURE50K, a legal dataset containing 50,000 U.S. legal opinions annotated
with Legal Procedural Postures ranging from common to rare motions. The dataset includes an innovative split
strategy to support supervised and zero-shot learning evaluations, ensuring infrequent categories are adequately
represented, enhancing model generalizability and testing accuracy.
Graham et al. [45] develop a domain-specific dataset for LTC focusing on deontic modalities in contract
sentences. They manually annotated contract sentences to train models for identifying deontic sentences like
permissions, obligations, and prohibitions. The corpus, derived from the Contract Understanding Atticus Dataset
(CUAD) [47], provides a resource for studying functional categories crucial for legal analysis.

4.4 Legal Document Sumarrization


4.4.1 Task. LDS is a specialized branch of automatic summarization that focuses on condensing legal texts, such
as court judgments, into clear and informative summaries. Unlike general text summarization, which extracts key
details without following specific formatting rules, LDS must account for the distinct structure and specialized
content of legal documents. These documents often include complex details like article numbers, statutory
language, and citations that are critical for presenting the legal arguments and decisions accurately. The natural
complexity of legal texts, characterized by their extensive length and detailed internal structures such as sections,
articles, and paragraphs in statutes—demands tailored summarization techniques. This need is further emphasized
by the hierarchical importance of documents based on their judicial origin, where the interpretation of texts can
vary significantly between higher and lower court opinions [55].
LDS can be approached through extractive and abstractive methods. Extractive summarization techniques in
LDS focus on identifying and extracting the most critical sentences or phrases directly from the text, maintaining
the original wording and meaning. In contrast, abstractive summarization involves generating new sentences
that paraphrase the most important information, aiming for conciseness and coherence while ensuring that the
essence of the legal text is preserved. This subsection will explore existing approaches to LDS and show the
small differences that set it apart from more general summarization methods and discussing the challenges and
solutions specific to the legal domain.
Several systems have been specifically designed to summarize legal documents. One of the first systems in this
field was the Fast Legal EXpert CONsultant (FLEXICON), created by Gelbart and Smith in 1991 [40]. FLEXICON
utilizes a keyword-based approach [41], scanning a comprehensive database of terms to pinpoint crucial segments
of text. Following this, Moens et al. [80] introduced the SALOMON system in 1999, which employs cosine
similarity to cluster similar text regions, aiming to highlight relevant topics within the documents. This method
aligns with other abstraction-oriented techniques seen in the work of Erkan and Radev [35]. Another system,
LetSum, devised by Farzindar and Lapalme in 2004 [36], also adopts a keyword-centric strategy but uses “cue
phrases” to identify text related to specific themes such as ‘Introduction’, ‘Context’, and ‘Conclusion’. Although
LetSum was fairly successful in mimicking human-generated summaries, it tended to produce summaries that
were excessively lengthy.

ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: October 2024.
18 • Ariai and Demartini

Building on previous developments in LDS, Polsley et al. [92] introduce Casesummarizer, a tool designed for
the legal domain that pre-processes legal texts into sentences, scores them using a TF-IDF matrix from extensive
legal case reports, and enhances sentence scoring by identifying entities, dates, and section headings. The tool
provides a user-friendly interface with scalable summaries, lists of entities and abbreviations, and a significance
heat map.
Zhong et al. [140] propose an automatic extractive summarization system for legal cases concerning Post-
traumatic Stress Disorder from the US Board of Veterans’ Appeals. It employs a train-attribute-mask pipeline
using a CNN classifier to iteratively select predictive sentences from case texts.
Nguyen et al. [83] propose an RL framework to enhance deep summarization models for the legal domain,
utilizing Proximal Policy Optimization with a reward function that integrates both lexical and semantic criteria.
They fine-tune an extractive summarization backbone based on BERTSUM [69], employing a reward model
that includes lexical, sentence, and keyword-level semantics to produce better legal summaries. Schraagen et al.
[99] apply an RL approach with a Bi-LSTM and a deep learning approach based on the BART transformer
model to abstractive summarization of the Dutch case verdict database Rechtspraak.nl, combining extractive and
abstractive summarization to retain core facts while creating concise summaries.
Zhong and Litman [141] focus on extractive summarization of legal case decisions, proposing an unsupervised
graph-based ranking model that leverages a reweighting algorithm to utilize document structure properties. They
introduce a reweighting algorithm to improve sentence selection in the HipoRank model [31]. It aims to reduce
redundancy and enhance the selection of argumentative sentences from underrepresented sections.
Moro et al. [82] introduce a transfer learning approach that combines extractive and abstractive summarization
techniques to address the lack of labeled legal summarization datasets, outperforming previous results on the
Australian Legal Case Reports dataset and establishing a new baseline for abstractive summarization.
Jain et al. [49] propose a sentence scoring approach, DCESumm, which combines supervised sentence-level
summary relevance prediction with unsupervised clustering-based document-level score enhancement. They
utilize a Legal BERT-based Multi-Layer Perceptron (MLP) model to predict the summary relevance of each sentence,
refining scores through deep embedded sentence clustering to enhance the selection process by considering the
global context of the document.
Liu et al. [68] present Common Law Court Judgment Summarization (CLSum), a pioneering dataset for
summarizing multi-jurisdictional common law court judgments, leveraging large language models for data
augmentation, summary generation, and evaluation. They employ a two-stage summarization process with
techniques like sparse attention mechanisms and efficient training methods to process lengthy legal documents
within limited computational resources.

4.4.2 Datasets. LDS datasets are largely built from structured court proceedings and decisions, providing rich
sources for both extractive and abstractive summarization methods. These datasets often emphasize abstractive
summarization to achieve concise, readable summaries that transform the original legal language into more
accessible forms [102].
Zhong et al. [140] develop a dataset from 972,522 Board of Veterans’ Appeals decisions, focusing on single-issue
cases related to Post-traumatic Stress Disorder. The dataset consists of 112 carefully sampled decisions, annotated
by legal experts to capture key information such as ‘Issue’, ‘Procedural History’, ‘Service History’, ‘Outcome’,
‘Reasoning’, and ‘Evidential Support’.
Shen et al. [102] introduce Multi-LexSum, an abstractive summarization dataset tailored for U.S. federal civil
rights lawsuits, containing 40,000 source documents and 9,000 expert-written summaries of diverse lengths,
providing a rich resource for testing advanced summarization models.
Liu et al. [68] publish CLSum, a dataset designed for summarizing multi-jurisdictional common law court
judgments from Australia, Hong Kong, the United Kingdom, and Canada. This dataset leverages large language

ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: October 2024.
Natural Language Processing for the Legal Domain: A Survey of Tasks, Datasets, Models, and Challenges • 19

models for data augmentation and incorporates legal knowledge to enhance summary generation and evaluation.
This dataset addresses the challenge of sparse labeled data in legal domains. CLSum includes a comprehensive
collection of judgments and summaries from prominent court websites. It employs novel techniques to enrich
training sets and improve model performance in few-shot and zero-shot learning scenarios.

4.5 Legal Named Entity Recognition


4.5.1 Task. NER is a fundamental task in NLP that involves identifying specific segments of text and classifying
them into predefined categories such as ‘organization’, ‘person’, and ‘location’ [59]. In the legal domain, NER
extends to specialized recognition tasks that focus on extracting entities unique to legal texts, such as laws,
legal norms, and procedural terms. This specialized form of NER is crucial for structuring legal documents and
enhancing legal IR systems. Unlike general NER systems that handle common entity types, legal NER must
navigate the complex language and structured format of legal documents, underscoring the need for systems and
methodologies specifically tailored to the legal context.
Dozier et al. [32] present pioneering work in NER within legal documents such as US case law and pleadings,
employing three methodologies: lookup, contextual rules, and statistical models to detect entities like judges,
attorneys, and legal terms. Their system adapts these approaches to the specialized context of legal texts,
processing various types of documents and extracting legal entities. This work highlights the challenges and
necessary adaptations for deploying NER in the legal domain, where the specialized language and high accuracy
are required for successful legal analysis.
Păis et al. [94] develop a NER model for the legal domain that integrates Bi-LSTM cells and a Conditional
Random Fields (CRFs) layer, utilizing multiple data sources and embedding types. Their system architecture
combines word embeddings from pre-trained models, character embeddings, gazetteer resources from the
GeoNames6 database and JRC-Names [110], and known affixes to enrich the model’s understanding of legal
text. Their training process involved fine-tuning the word embeddings, while dynamically learning character
embeddings with subsequent Bi-LSTM layers, enhancing the model’s capability to generalize across unseen
texts. They implemented the system using a modified version of NeuroNER [27], allowing for online model
serving and incorporating features like dropout for regularization and gradient clipping to handle exploding
gradients. They also explored ensemble methods to improve model accuracy by combining results from multiple
model configurations, measuring the effectiveness through precision, recall, and F1 scores against a gold standard
corpus.
Smădu et al. [105] explore domain adaptation in Legal NER, focusing on the Romanian and German languages.
They utilize a model architecture that integrates a pre-trained BERT [29] layer for feature extraction with Bi-LSTM
networks to handle sequence dependencies and CRFs for sequence tagging. Their approach employs domain
adaptation techniques through a gradient reversal layer connected to a domain discriminator, aimed at reducing
domain-specific biases and enhancing feature transferability across domains. This model trains on both legal and
general domains simultaneously, adapting to the peculiarities of each through adversarial learning that modifies
the learning process dynamically. This setup allows for improved generalization of the NER system across varied
linguistic and domain contexts, though the actual performance varied with minimal improvements noted for
German and decreased performance for the Romanian legal dataset.
4.5.2 Datasets. Leitner et al. [61] present a dataset for NER focused on German federal court decisions, containing
approximately 67,000 sentences and over two million tokens. This dataset features 54,000 manually annotated
entities distributed across 19 fine-grained semantic classes, specifically tailored to the legal domain, such as
court, judge, lawyer, law, person, and legal literature, along with over 35,000 TimeML-based time expressions.
The annotations cover both broad categories like location, person, and organization, and more specialized ones
6 https://www.geonames.org/

ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: October 2024.
20 • Ariai and Demartini

like legal norms and case-by-case regulations, distinguishing between different types of legal acts and literature.
This dataset’s comprehensive annotation process involved multiple cycles to refine the tagging guidelines and
enhance annotation quality.
Păis et al. [94] introduce the LegalNERo corpus, a manually annotated resource for NER in the Romanian legal
domain, featuring 370 legal documents annotated with five general entity types: person, location, organization,
time expressions, and legal references. This corpus was developed to support both specific legal domain NER
tasks and more general NER applications by enabling compatibility with existing general-purpose NER systems.
The corpus includes rich entity annotations, with legal references showing the highest token count per entity,
indicating their complexity and length. The detailed annotation process, including inter-annotator agreement
assessed by Cohen’s Kappa, and the subsequent mapping of entities to RDF format, highlights the corpus’s utility
and precision for advancing NER research and applications within the legal domain.
Au et al. [7] introduce the E-NER dataset, an annotated collection derived from the US SEC’s EDGAR filings,
designed for legal NER. This dataset contains filings that are rich in text, such as quarterly reports (Form 10-Q)
and significant event announcements (Form 8-K), from which sentences were extracted and annotated with seven
named entity classes more tailored to legal content than those in the standard CoNLL dataset[115]. The entities
include Person, Location, Organization, Government, Court, Business, and Legislation/Act, adjusting the CoNLL
classes to better suit legal documents. E-NER contains significantly longer sentences compared to CoNLL and
includes detailed annotations of financial entities from legal company filings.
Kalamkar et al. [54] present a comprehensive corpus aimed at enhancing legal NER, containing 46,545 entities
across 14 types identified in Indian High Court and Supreme Court judgments. This corpus, split into preamble and
judgment sections, includes diverse entity types detailed in their legal NER taxonomy, such as court, petitioner,
respondent, and statute, among others. The training set, drawn from judgments between 1950 and 2017, features
29,964 entities and the development and test sets, spanning 2018 to 2022. This dataset not only facilitates training
and evaluation of NER models specific to the legal domain but also provides a structured framework for assessing
the performance of NER systems on legal texts. Their approach leverages a combination of manual annotation
and ML techniques to ensure the precision of entity recognition in legal judgments.

4.6 Large Legal Corpora


The foundational step in training an LLMs is the use of extensive corpora. For the development of a sophisticated
LLMs that effectively addresses a wide range of legal NLP tasks, it is crucial to have access to large-scale legal
corpora. These corpora must meet several critical criteria to ensure their effectiveness and ethical utility. First,
they should be transparent in their sourcing and composition, allowing users to understand the origins and types
of included data. Additionally, the privacy of individuals should be safeguarded, preventing any potential invasion
of personal data. It is also imperative to minimize toxicity and bias within the corpora to promote fairness and
accuracy in model outcomes. By following these principles, we can enhance the capabilities of LLMs in the legal
domain. This ensures they are both powerful and reliable tools for legal analysis. In this section, we will explore
the existing large legal corpora and online databases.
Zheng et al. [135] introduce the CaseHOLD dataset, a novel benchmark for evaluating NLP models in the
legal domain, designed to address the challenge of identifying the legal holdings from case texts. The dataset
contains over 53,000 multiple choice questions derived from U.S. case law citations, where each question requires
the identification of the correct holding from a set of potential answers. This task, simulating a fundamental
lawyering skill taught in law school, involves contextual understanding and application of legal rules to factual
situations. CaseHOLD is aimed at enhancing model training by focusing on semantic matching and the ability
to discern nuanced legal principles. The dataset is structured to provide a challenging yet accessible resource
for NLP researchers, with a clear focus on promoting deeper understanding and application of legal rules in

ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: October 2024.
Natural Language Processing for the Legal Domain: A Survey of Tasks, Datasets, Models, and Challenges • 21

automated systems. It utilizes a format where a cited text serves as a prompt with five answer options—one
correct holding and four closely related incorrect holdings—to refine the models’ abilities to accurately reflect
legal reasoning.
Chalkidis et al. [19] introduce the Legal General Language Understanding Evaluation (LexGLUE) benchmark,
a comprehensive suite of datasets aimed at assessing the capabilities of NLP models across various legal tasks.
The benchmark covers datasets such as ECtHR [15], SCOTUS7 , EUR-LEX, LEDGAR [117], UNFAIR-ToS [67],
and CaseHOLD [135], each chosen for its complexity, relevance, and need for legal expertise. These datasets
cover a range of tasks from multi-label and multi-class classification to multiple-choice questions and are split
chronologically into training, development, and test sets to provide standardized evaluation metrics. For instance,
ECtHR datasets focus on violations of European Convention of Human Rights provisions, SCOTUS database
classifies U.S. Supreme Court opinions by legal issues, EUR-LEX database involves labeling EU laws with EuroVoc
concepts, LEDGAR classifies provisions of U.S. contracts, UNFAIR-ToS identifies unfair terms in online service
agreements, and CaseHOLD involves answering questions about legal rulings. This benchmark facilitates the
testing of NLP models, addressing the challenges of legal text comprehension and understanding required for
effective application in the legal domain.
Chalkidis et al. [20] present FairLex, a benchmark suite consisting of four legal datasets—ECtHR [15], SCOTUS,
FSCS, and CAIL [127]—that address the fairness of NLP applications across diverse legal jurisdictions and lan-
guages, including English, German, French, Italian, and Chinese. These datasets, curated from European Council,
USA, Switzerland, and China, cover various legal tasks such as judgment prediction, issue area classification,
and crime severity prediction, aiming to test the performance and fairness of LMs in recognizing and classifying
legal texts. FairLex focuses on ensuring demographic, regional, and legal topic fairness by analyzing attributes
like gender, age, region of origin, and legal areas within cases. Each dataset in FairLex provides a substantial
number of cases, systematically divided into training, development, and test sets, and includes detailed attributes
like the defendant state in ECtHR, decision direction in SCOTUS, legal areas in FSCS, and demographic details
in CAIL. The CAIL dataset from China contains over a million cases focusing on criminal law, annotated with
demographics and regional classifications, which are used to explore the crime severity prediction task.
Henderson et al. [46] introduce the ‘Pile of Law’, the first and an important large corpus in the legal domain,
containing a 256GB dataset of open-source English-language legal and administrative data. This dataset includes
contracts, court opinions, legislative records, and administrative rules, curated to explore data sanitization norms
across legal and administrative settings and serve as a tool for pre-training legal-domain LMs. They emphasize
the legal norms governing privacy and toxicity filtering, detailing how the dataset reflects these norms through
built-in filtering mechanisms in the collected data, which include court filings, legal analyses, and government
publications. By analyzing how legal and administrative entities handle sensitive information and potentially
offensive content, the paper provides actionable insights for researchers to improve content filtering practices
before pre-training LLMs, thereby enhancing the ethical use of NLP in legal applications.
Rabelo et al. [95] summarize the 8th Competition on Legal Information Extraction and Entailment (COLIEE
2021), which featured five tasks across case and statute law, engaging participants from various teams to apply
diverse NLP approaches. The competition tasks included case law retrieval and entailment, as well as statute law
retrieval and entailment with and without prior retrieved data. Specifically, Task 1 focused on extracting relevant
supporting cases from a corpus, while Task 2 involved identifying paragraphs from cases that entail a given new
case fragment. For statute law, Tasks 3 and 4 entailed retrieving and answering questions based on civil code
statutes, with Task 5 challenging participants to answer without pre-retrieved statutes. The datasets used varied
in complexity, from 4415 case files in Task 1 with a need to identify noticed cases without relying on citations,

7 https://www.supremecourt.gov

ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: October 2024.
22 • Ariai and Demartini

to the civil code-based Tasks 3, 4, and 5 which adapted to recent legal revisions in Japanese law and excluded
untranslated parts, reflecting the ongoing evolution and challenge in legal NLP applications.
Barale et al. [9] present AsyLex, a pioneering dataset tailored for Refugee Law applications, featuring 59,112
documents from Canadian refugee status determinations spanning from 1996 to 2022. This dataset is designed to
enhance the capabilities of NLP models in legal research by providing 19,115 gold-standard human-annotated and
30,944 inferred labels for entity extraction and LJP. Key contributions include anonymizing decision documents,
employing a robust annotation methodology, and creating datasets for specific NLP tasks like entity extraction
and judgment prediction. This rich corpus, with detailed annotations across 22 categories, supports complex legal
NLP tasks, thereby filling the gap in resources for the legal domain.
Niklaus et al. [86] introduce LEXTREME, a multilingual benchmark specifically designed to evaluate LMs on
legal NLP tasks, a critical step given the unique challenges of legal language. Surveying legal NLP literature from
2010 to 2022, they curate 11 datasets spanning 24 languages and cover a variety of legal domains, employing
datasets that only involve human-annotated texts or those with annotations derived through clear methodological
frameworks. They introduce two aggregate scores to facilitate fair comparison across models: the dataset aggregate
score and the language aggregate score, revealing a performance correlation with model size on LEXTREME. The
benchmark consists of three task types: Single Label Text Classification, Multi Label Text Classification, and NER,
using existing splits for training, validation, and testing when available, or creating random splits otherwise. This
effort marks a significant advancement in testing NLP capabilities across a diverse range of legal documents and
languages.
Park and James [91] explore the creation of a Natural Language Inference dataset within the legal domain,
focusing on criminal court verdicts in Korean. Their methodology includes the innovative use of adversarial
hypothesis generation to challenge annotators and enhance the robustness of the dataset, supported by visual
tools for hypothesis network construction. The data collection involves extracting context from verdicts and
augmenting it using Easy Data Augmentation [123] techniques and round-trip translation to generate a dataset
for training and testing Natural Language Inference models. The study highlights issues such as annotators’
limited domain knowledge and challenges in handling long contexts but provides solutions like targeted data
collection and the use of gamification to boost annotator engagement and productivity.
Goebel et al. [44] summarize the 10th Competition on Legal Information Extraction and Entailment (COLIEE
2023), featuring four tasks across case and statute law with participation from ten different teams engaging in
multiple tasks. Task 1 involves legal case retrieval, requiring participants to extract supporting cases from a
corpus, and Task 2 focuses on legal case entailment, identifying paragraphs that entail aspects of a new case. Task
3 and 4, based on Japanese Civil Code statutes from the bar exam, involve retrieving relevant articles and verifying
statements, respectively. The competition leverages a dataset of over 5,700 case law files and introduces new
query cases and test questions sourced from recent bar exams, testing the efficacy of different teams’ approaches
in handling complex legal texts and hypotheses in a controlled competitive environment.
Östling et al. [142] introduce the Cambridge Law Corpus (CLC), a comprehensive legal dataset featuring
258,146 cases from UK courts, dating from the 16th century to the present. The corpus includes raw text and
metadata across various court types, and is structured in XML format for ease of use and annotated for case
outcomes in a subset of 638 cases. Additionally, the CLC is supported by a Python library for data manipulation
and ML applications.
Niklaus et al. [87] present the MultiLegalPile, the largest open-source multilingual legal corpus available,
totaling 689GB and spanning 17 jurisdictions across 24 languages. This extensive dataset is designed to facilitate
training of LLMs within the legal domain, featuring diverse legal text types including case law, legislation, and
contracts, predominantly in English due to the integration of the ‘Pile of Law’ [46] dataset. Through careful
regex-based filtering from the mC4 corpus and manual reviews, the team ensures high precision in legal content

ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: October 2024.
Natural Language Processing for the Legal Domain: A Survey of Tasks, Datasets, Models, and Challenges • 23

selection. The corpus, efficiently compressed using XZ and formatted in JSONL, supports comprehensive NLP
research and modeling, emphasizing its broad applicability in advancing legal AI technologies.

5 LEGAL LANGUAGE MODELS AND METHODS FOR LEGAL DOMAIN ADAPTATION


In the fast-moving field of NLP, LLMs have become a key tool for processing and understanding large amounts of
unstructured text data. These models, initially trained on broad datasets like Wikipedia, have shown great skill
across various language tasks. Building on this success, the legal technology community is increasingly interested
in using these powerful models for Legal NLP applications. This involves adapting these general-domain models
to legal texts and further training them on specialized legal documents. Such efforts aim to reduce the domain
gap and customize the models to better understand the complex language used in legal documents. In this section,
we will explore how these models are being adapted and applied within the legal domain to enhance Legal NLP
applications.
In this section, following the methodology of this survey, we studied all peer-reviewed LMs or methods.
However, due to the significant challenges present in the legal domain, there are many legal LMs that have not
undergone peer review. Given the scarcity of adequate peer-reviewed resources, our research has focused on
the investigation of, in order of priority, the peer-reviewed sources, then the most well-known and widely used
non-peer-reviewed legal LMs. Despite their lack of formal peer review, these models have gained considerable
attention and usage in the field.

5.1 Language Models


Chalkidis et al. [18] present an in-depth analysis of applying BERT [29], a pre-trained language model, in the legal
domain, showcasing the need for domain-specific adaptation to enhance performance on legal NLP tasks. They
explore three strategies: using standard BERT directly, further pre-training on legal corpora, and pre-training
from scratch with legal-specific data. Their study found that both further pre-training and pre-training from
scratch generally outperform the use of BERT directly. They introduce legal-bert, a specialized family of
models optimized for legal text, which includes versions for varied computational capacities and demonstrates
competitive performance with a lower environmental impact.
Xiao et al. [126] introduce Lawformer, a Longformer-based [10] language model adapted for Chinese legal texts,
designed to handle extensive document lengths common in legal data. Recognizing the limitation of standard
PLMs with shorter token capacities, Lawformer employs a unique combination of sliding window, dilated sliding
window, and global attention mechanisms to efficiently process long texts, making it suitable for legal AI tasks like
judgment prediction and LQA. Pre-trained on a vast corpus of Chinese legal documents segmented into criminal
and civil cases, Lawformer integrates complex sequential dependencies across tokens using these attention
techniques, enhancing model performance for legal-specific tasks.
In the development of specialized NLP tools for Arabic legal texts, a model specifically tailored to the unique
linguistic features of Arabic jurisprudence was designed, introducing AraLegal-BERT [1] midway through this
innovation. This model enhances NLP applications within the legal field by adapting BERT [29] technology to
Arabic’s specific content needs, involving pre-training BERT from scratch using a broad range of legal documents,
including legislative materials and contracts.
Colombo et al. [25] introduce SaulLM-7B, a novel LLM specifically designed for legal text comprehension
and generation, built on the 7 billion parameter Mistral [51] architecture. This model is trained on an extensive
English legal corpus, designed to meet the unique challenges of legal syntax and terms. SaulLM-7B uses a two-tier
training approach: continued pre-training on a carefully curated 30 billion token legal dataset and an innovative
instruction fine-tuning method, incorporating both generic and legal-specific instructions to enhance the model’s
performance on legal tasks.

ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: October 2024.
24 • Ariai and Demartini

Shi et al. [104] develop Legal-LM, a specialized language model tailored for Chinese legal consulting, enhanced
with a KG to address domain-specific challenges such as data veracity and non-expert user interaction. The
framework involves several steps: extensive pre-training on a rich corpus of legal texts integrated with a legal
KG, keyword extraction and Direct Preference Optimization to refine responses, and the use of an external legal
knowledge base for data retrieval and response validation. This multi-faceted approach ensures that Legal-LM
not only comprehends complex legal language but also generates precise and user-aligned legal advice.

5.2 Methods for Improving In-Domain Adaptability of Legal Language Models


Li et al. [63] explore a novel adaptation of LMs for the legal domain by integrating domain-specific unsupervised
data from public legal forums to optimize prefix domain adaptation, a parameter-efficient learning approach that
trains only about 0.1% of the model’s parameters. They introduce a training methodology where a deep prompt
is specifically tuned using a domain-adapted prefix from legal forums and then utilized in various legal tasks,
demonstrating improved few-shot performance compared to full model tuning methods like legal-bert [18].
This approach significantly reduces computational overhead while maintaining or exceeding performance metrics
across multiple legal tasks, suggesting an efficient and scalable model for legal NLP applications.
Mamakas et al. [75] explore strategies for adapting pre-trained transformers to cope with the challenges of long
legal texts within the LexGLUE benchmark, focusing on extending input capabilities and enhancing efficiency.
They modify Longformer [10], originally extending up to 4,096 sub-words, to process up to 8,192 sub-words by
reducing local attention window size and incorporating a global token at the end of each paragraph to facilitate
information flow across longer texts. Additionally, they adapt legal-bert [18] to employ TF-IDF representations
to manage longer documents effectively, introducing variants like TF-IDF-SRT-LegalBERT, which deduplicates
and sorts sub-words by TF-IDF scores, and TF-IDF-EMB-LegalBERT, which incorporates a TF-IDF embedding
layer. These adaptations aim to combine the robust capabilities of transformers with the practical requirements of
handling extensive legal documents, surpassing the performance of traditional linear classifiers while maintaining
computational efficiency.

6 OPEN RESEARCH CHALLENGES


Despite researchers’ efforts in the this interdisciplinary field and extensive advancements in AI techniques, Open
Research Challenges (ORCs) still exist. In this section, we identify the ORCs and provide advice to overcome
these challenges.
ORC1: Bias and Fairness. Bias and fairness are crucial concerns in the field of AI, especially at the intersection
with the legal domain where decisions can deeply impact individuals’ lives. The scarcity of unbiased data in
legal domains such as case law complicates the training of AI models, as these models often learn from historical
decisions that may reflect existing human biases [33, 114]. This reliance on biased datasets can lead to unfair and
biased outcomes in classification and prediction tasks. Addressing these issues is critical to ensure that AI-driven
legal decisions uphold the standards of impartiality and fairness required for justice.
ORC2: Interpretability and Explainability. Interpretability and explainability are crucial across various applications
in legal NLP, yet these aspects remain underexplored in many studies. The ability to trace and comprehend
the decision-making process of AI systems is essential for identifying and mitigating biases. Transparent and
understandable AI systems help build trust and ensure they are used responsibly, which is particularly important
in legal contexts where decisions can significantly impact people’s lives. Improving these aspects of AI models is
necessary to their ethical use, ensuring they meet the high standards of fairness required in legal proceedings.
ORC3: Transparent Annotation. Transparently annotated datasets are rare in the field of legal NLP. Often, studies
mention the involvement of expert annotators for tasks such as classification, question answering, or prediction,

ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: October 2024.
Natural Language Processing for the Legal Domain: A Survey of Tasks, Datasets, Models, and Challenges • 25

but fail to provide details about the annotators’ backgrounds or the specific annotation processes used. To develop
unbiased legal NLP systems, it is necessary to document the dataset creation process thoroughly. This includes
providing detailed descriptions of the annotation procedures and the qualifications of the annotators involved,
which is essential for ensuring the reliability and fairness of the systems trained on these datasets. Researchers
need to prioritize transparency to build trust and allow for effective evaluation in the legal NLP field.
ORC4: Multilingual Capabilities. In legal NLP, enhancing multilingual capabilities remains an underdeveloped
area. While efforts like MultiLegalPile [87] have begun to address this, there remains a gap in research for
many languages, including but not limited to Persian and Arabic. These limitations significantly restricts the
application of legal NLP across diverse legal systems worldwide, which is important for broader accessibility.
Multilingual capabilities introduce unique challenges for legal NLP models, primarily due to the distinct linguistic
structures of each language, which often require extensive fine-tuning to ensure accuracy and relevancy in legal
contexts. Furthermore, each legal system possesses its own set of terms and document standards, which can vary
dramatically from one language to another. Therefore, expanding research into these and other underserved
languages is essential for making NLP tools universally applicable and effective.
ORC5: Ontology. The use of ontologies in the legal domain is relatively sparse, yet it holds considerable potential
to enhance the robustness of AI methodologies. Ontology or KG can also enable the AI models to draw accurate
inferences regarding the relationship between the terms and thereby better understand and process complicated
legal texts. This approach could advance the capability of AI systems to handle complex legal reasoning and
decision-making processes. However, utilizing ontology in legal NLP faces unique challenges. The complexity of
legal language and the concept of ‘open texture’, where the meaning of legal terms can evolve over time, complicate
the creation of static ontological models [81]. Legal ontologies must be dynamic, reflecting changes in law and its
interpretation over time. Additionally, the integration of real-world and legal concepts within ontologies presents
further complexity, as it requires accommodating both legal terms and their relevant real-world contexts [81].
ORC6: Pre-processing Legal Text. Pre-processing legal texts is challenging due to the distinct nature of legal docu-
ments. Existing legal corpora are often contained of raw texts that require extensive cleaning and transformation
to become suitable for ML models. Additionally, legal documents can include complex nested structures, like
clauses within clauses, and cross-references to other legal cases, statutes, or provisions, making it difficult to
break them into coherent units for analysis. These challenges of legal documents make it impractical to directly
fine-tune LMs on these raw datasets without substantial pre-processing. This requirement complicates the use of
large legal datasets, making them hard to convert into formats that NLP models can readily process and learn
from. Without addressing these specific complexities, fine-tuning LMs on raw legal data becomes impractical,
limiting the effectiveness of legal NLP applications.
ORC7: Reinforcement Learning from Human Feedback (RLHF). The use of RLHF within the legal domain is notably
scarce. Currently, there is only one peer-reviewed work [83] available that explores this approach. This indicates
a significant opportunity for research and development in this area, as RLHF could potentially enhance NLP’s
capability to learn and make decisions based on complex legal data under human guidance. Further exploration
into this method could lead to more responsive and adaptable legal NLP systems. However, due to the complex
nature of legal reasoning and the need for accurate legal knowledge in human feedback phase, RLHF’s integration
into legal NLP poses some challenges. Therefore, on the human feedback side, legal experts such as lawyers and
judges must provide guidance to ensure the AI accurately interprets and applies complex legal concepts.
ORC8: Expanding Legal Domain Coverage. There is a noticeable gap in the research across various areas of the legal
domain, including Intellectual Property, Criminal Law, Banking Law, Family Law, and Human Rights Law. These
fields have seen limited exploration across all legal NLP tasks, such as LQA and other applications. Expanding

ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: October 2024.
26 • Ariai and Demartini

research into these areas is essential for developing comprehensive legal automated systems that can provide
tailored solutions and insights highly relevant to these sectors of law.
ORC9: Small Language Models (SLMs). Research into SLMs specific to the legal domain is notably absent. Ad-
dressing this gap could lead to more efficient, resource-conscious solutions that still maintain high performance
in legal text processing and analysis. The development of SLMs tailored for legal applications could revolutionize
the accessibility and scalability of legal NLP tools.
ORC10: Domain-Specific Efficient Fine-Tuning. Domain-specific efficient fine-tuning within the legal field is an
underexplored area, with only two known studies [63, 72] addressing it. Legal texts consist of complex structures
and specialized words that standard LMs may not capture without significant adaptation. Additionally, the legal
domain cover a vast array of document types, such as case law, statutes, and contracts, each requiring tailored
approaches for effective model application. This diversity makes it imperative to develop fine-tuning strategies
that do not only adapt a model generally but rather tailor it to understand the differences between these document
types. The majority of existing approaches involve fine-tuning the entire model, which can be resource-intensive.
More focused research could enable fine-tuning of legal LMs using fewer resources, enhancing the efficiency of
deploying these models in practice.
ORC11: Legal Logical Reasoning. Complex legal logical reasoning remains a significant challenge in LJP, particu-
larly in predicting prison terms. Current state-of-the-art methods struggle to achieve high accuracy in this area.
This highlights a clear need for enhanced approaches that can effectively handle the complex of legal reasoning.
ORC12: Legal Named Entity Recognition. Legal NER focuses on specific challenges such as disambiguating titles,
resolving nested entities, addressing co-references, managing lengthy texts, and processing machine-inaccessible
PDFs. Despite its critical role in understanding and structuring legal documents, there is limited research in this
area, as observed from Fig. 1.
ORC13: Stochastic Parrots. The concept of “Stochastic Parrots” pertains particularly to LLMs. It shows the concern
that these models often do not truly understand language but merely mimic human patterns. This mimicry can
lead to unreliable outcomes, especially in critical legal situations, if the models are not trained on high-quality,
unbiased datasets. The risk is notably significant in LJP, where training on biased or unfair data could lead to
irreversible outcomes, as discussed in Bender et al. [11]’s work on the limitations of LLMs. This underscores the
importance of ensuring that LLMs are trained responsibly to avoid perpetuating or amplifying existing biases in
legal decisions.
ORC14: Retrieval-Augmented Generation. In the legal domain, where documents are usually lengthy, often contain
cross-references, and present a variety of complex linguistic structures, LLMs can sometimes generate hallucina-
tory responses when faced with the task of generating accurate answers. RAG systems offer a promising solution
to these challenges. RAG can mitigate issues such as the natural limitations of LLMs concerning maximum input
lengths, where even extended limits may fall short due to the excessive length of many legal documents. This
approach not only improves the model’s response quality but also its relevance and contextual appropriateness by
incorporating more of the document’s content into the decision-making process. However, the integration of RAG
into the legal domain introduces unique challenges, such as managing documents from multiple jurisdictions,
ensuring temporal relevance, addressing multilingual issues, and overcoming biases in the retrieval phase. These
challenges must be addressed in future research when applying RAG in the legal domain.
ORC15: Automated Legal Assistance System. To decrease human error and the costs of legal services, there’s a
need for a comprehensive automated legal assistance system. This system should span all tasks within the legal
NLP field, from question answering to judgment prediction, and cater to different legal specializations like civil

ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: October 2024.
Natural Language Processing for the Legal Domain: A Survey of Tasks, Datasets, Models, and Challenges • 27

Table 3. Summary of existing ORCs in each area.

Open Research Challenges LQA LJP LTC LDS Legal NER LLMs Corpora
ORC1: Bias and Fairness ✓ ✓ ✓ – – ✓ ✓
ORC2: Interpretability and Explainability ✓ ✓ ✓ – – ✓ –
ORC3: Transparent Annotation ✓ ✓ ✓ ✓ ✓ – ✓
ORC4: Multilingual Capabilities ✓ ✓ ✓ ✓ ✓ ✓ ✓
ORC5: Ontology ✓ ✓ ✓ ✓ ✓ ✓ –
ORC6: Pre-processing Legal Text ✓ ✓ ✓ ✓ ✓ ✓ ✓
ORC7: RLHF ✓ ✓ ✓ ✓ – ✓ –
ORC8: Expanding Legal Domain Coverage ✓ ✓ ✓ ✓ ✓ ✓ ✓
ORC9: SLMs – – – – – ✓ –
ORC10: Domain-Specific Efficient Fine-Tuning ✓ ✓ ✓ ✓ – ✓ –
ORC11: Legal Logical Reasoning ✓ ✓ ✓ – – ✓ –
ORC12: Legal NER – – – – ✓ – –
ORC13: Stochastic Parrots – – – – – ✓ –
ORC14: RAG ✓ ✓ – – – ✓ –
ORC15: Automated Legal Assistance System ✓ ✓ ✓ ✓ ✓ ✓ ✓

and financial law, across multiple languages from English to Persian. Developing an accurate LLM pre-trained on
a vast, diverse dataset free from biases and unfairness is crucial. This ensures that the automated legal services
can reliably and equitably address a wide range of legal issues.
Summary. Table 3 illustrates the connections between ORCs and discussed areas. A direct relationship is marked
with an ✓, and otherwise with a –. As shown, most ORCs are related to LJP, LQA, LTC, and LLMs, indicating
more extensive research fields in these areas.

7 CONCLUSION
Advances in AI and NLP have improved Legal NLP techniques and models. These improvements help better meet
the needs of laypersons in legal matters and ease the workload of legal professionals. This survey provides a
comprehensive overview of the advancements in NLP techniques used in the legal domain. Additionally, we
discussed the unique characteristics of legal documents. We also reviewed existing datasets and LLMs tailored for
the legal domain. Legal NER research spans multiple languages and utilizes diverse methods, from rule-based
to BERT-based models. LDS has largely focused on extractive and abstractive methods, including TF-IDF and
transformer-based models. In LTC, multi-class classification tasks dominate, with deep learning architectures like
CNNs and Bi-LSTMs widely used. LJP primarily focuses on Chinese datasets with deep learning approaches like
CNNs. LQA often leverages information retrieval techniques such as BM25, with a significant focus on statutory
law. Finally, we explored key ORCs, such as the need for domain-specific fine-tuning strategies, addressing bias
and fairness in legal datasets, and the importance of interpretability and explainability. Other challenges include
the development of more robust pre-processing techniques, handling multilingual capabilities, and integrating
ontology-based methods for more accurate legal reasoning.

REFERENCES
[1] Muhammad Al-qurishi, Sarah Alqaseemi, and Riad Souissi. 2022. AraLegal-BERT: A pretrained language model for Arabic Legal text.
In Proceedings of the Natural Legal Language Processing Workshop 2022. Association for Computational Linguistics, Abu Dhabi, United
Arab Emirates (Hybrid), 338–344. https://doi.org/10.18653/v1/2022.nllp-1.31
[2] Intisar Almuslim and Diana Inkpen. 2022. Legal Judgment Prediction for Canadian Appeal Cases. In 2022 7th International Conference
on Data Science and Machine Learning Applications (CDMA). IEEE, Riyadh, Saudi Arabia, 163–168. https://doi.org/10.1109/CDMA54072.

ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: October 2024.
28 • Ariai and Demartini

2022.00032
[3] AWS Amazon. [n. d.]. What are Transformers in Artificial Intelligence? Retrieved July 24, 2024 from https://aws.amazon.com/what-
is/transformers-in-artificial-intelligence
[4] Dang Hoang Anh, Dinh-Truong Do, Vu Tran, and Nguyen Le Minh. 2023. The Impact of Large Language Modeling on Natural Language
Processing in Legal Texts: A Comprehensive Survey. In 2023 15th International Conference on Knowledge and Systems Engineering (KSE).
IEEE, Hanoi, Vietnam, 1–7. https://doi.org/10.1109/KSE59128.2023.10299488
[5] Arian Askari, Suzan Verberne, and Gabriella Pasi. 2022. Expert Finding in Legal Community Question Answering. In Advances in
Information Retrieval: 44th European Conference on IR Research (ECIR 2022), Matthias Hagen, Suzan Verberne, Craig Macdonald, Christin
Seifert, Krisztian Balog, Kjetil Nørvåg, and Vinay Setty (Eds.). Springer International Publishing, Berlin, Heidelberg, 22–30.
[6] Arian Askari, Zihui Yang, Zhaochun Ren, and Suzan Verberne. 2024. Answer Retrieval in Legal Community Question Answering. In
Advances in Information Retrieval: 46th European Conference on Information Retrieval (ECIR 2024), Nazli Goharian, Nicola Tonellotto,
Yulan He, Aldo Lipani, Graham McDonald, Craig Macdonald, and Iadh Ounis (Eds.). Springer Nature Switzerland, Berlin, Heidelberg,
477–485.
[7] Ting Wai Terence Au, Vasileios Lampos, and Ingemar Cox. 2022. E-NER — An Annotated Named Entity Recognition Corpus of Legal
Text. In Proceedings of the Natural Legal Language Processing Workshop 2022. Association for Computational Linguistics, Abu Dhabi,
United Arab Emirates (Hybrid), 246–255. https://doi.org/10.18653/v1/2022.nllp-1.22
[8] Purbid Bambroo and Aditi Awasthi. 2021. LegalDB: Long DistilBERT for Legal Document Classification. In 2021 International Conference
on Advances in Electrical, Computing, Communication and Sustainable Technologies (ICAECT). IEEE, 1–4. https://doi.org/10.1109/
ICAECT49130.2021.9392558
[9] Claire Barale, Mark Klaisoongnoen, Pasquale Minervini, Michael Rovatsos, and Nehal Bhuta. 2023. AsyLex: A Dataset for Legal
Language Processing of Refugee Claims. In Proceedings of the Natural Legal Language Processing Workshop 2023, Daniel Preot, iuc-Pietro,
Catalina Goanta, Ilias Chalkidis, Leslie Barrett, Gerasimos Spanakis, and Nikolaos Aletras (Eds.). Association for Computational
Linguistics, Singapore, 244–257. https://doi.org/10.18653/v1/2023.nllp-1.24
[10] Iz Beltagy, Matthew E Peters, and Arman Cohan. 2020. Longformer: The long-document transformer. arXiv:2004.05150 [cs.CL]
[11] Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. On the Dangers of Stochastic Parrots: Can
Language Models Be Too Big?. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (Virtual Event,
Canada) (FAccT ’21). Association for Computing Machinery, New York, NY, USA, 610–623. https://doi.org/10.1145/3442188.3445922
[12] Paheli Bhattacharya, Kaustubh Hiware, Subham Rajgaria, Nilay Pochhi, Kripabandhu Ghosh, and Saptarshi Ghosh. 2019. A Comparative
Study of Summarization Algorithms Applied to Legal Case Judgments. Advances in Information Retrieval (2019), 413–428.
[13] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam,
Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh,
Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin
Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are
few-shot learners. In Proceedings of the 34th International Conference on Neural Information Processing Systems (Vancouver, BC, Canada)
(NIPS ’20). Curran Associates Inc., Red Hook, NY, USA, Article 159, 25 pages.
[14] Marius Büttner and Ivan Habernal. 2024. Answering legal questions from laymen in German civil law system. In Proceedings of the 18th
Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), Yvette Graham and Matthew
Purver (Eds.). Association for Computational Linguistics, St. Julian’s, Malta, 2015–2027. https://aclanthology.org/2024.eacl-long.122
[15] Ilias Chalkidis, Ion Androutsopoulos, and Nikolaos Aletras. 2019. Neural Legal Judgment Prediction in English. In Proceedings of
the 57th Annual Meeting of the Association for Computational Linguistics, Anna Korhonen, David Traum, and Lluís Màrquez (Eds.).
Association for Computational Linguistics, Florence, Italy, 4317–4323. https://doi.org/10.18653/v1/P19-1424
[16] Ilias Chalkidis, Emmanouil Fergadiotis, Prodromos Malakasiotis, Nikolaos Aletras, and Ion Androutsopoulos. 2019. Extreme Multi-Label
Legal Text Classification: A Case Study in EU Legislation. In Proceedings of the Natural Legal Language Processing Workshop 2019,
Nikolaos Aletras, Elliott Ash, Leslie Barrett, Daniel Chen, Adam Meyers, Daniel Preotiuc-Pietro, David Rosenberg, and Amanda Stent
(Eds.). Association for Computational Linguistics, Minneapolis, Minnesota, 78–87. https://doi.org/10.18653/v1/W19-2209
[17] Ilias Chalkidis, Manos Fergadiotis, and Ion Androutsopoulos. 2021. MultiEURLEX - A multi-lingual and multi-label legal document
classification dataset for zero-shot cross-lingual transfer. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language
Processing, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.). Association for Computational
Linguistics, Online and Punta Cana, Dominican Republic, 6974–6996. https://doi.org/10.18653/v1/2021.emnlp-main.559
[18] Ilias Chalkidis, Manos Fergadiotis, Prodromos Malakasiotis, Nikolaos Aletras, and Ion Androutsopoulos. 2020. LEGAL-BERT: The
Muppets straight out of Law School. In Findings of the Association for Computational Linguistics: EMNLP 2020, Trevor Cohn, Yulan He,
and Yang Liu (Eds.). Association for Computational Linguistics, Online, 2898–2904. https://doi.org/10.18653/v1/2020.findings-emnlp.261
[19] Ilias Chalkidis, Abhik Jana, Dirk Hartung, Michael Bommarito, Ion Androutsopoulos, Daniel Katz, and Nikolaos Aletras. 2022. LexGLUE:
A Benchmark Dataset for Legal Language Understanding in English. In Proceedings of the 60th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers), Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). Association for

ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: October 2024.
Natural Language Processing for the Legal Domain: A Survey of Tasks, Datasets, Models, and Challenges • 29

Computational Linguistics, Dublin, Ireland, 4310–4330. https://doi.org/10.18653/v1/2022.acl-long.297


[20] Ilias Chalkidis, Tommaso Pasini, Sheng Zhang, Letizia Tomada, Sebastian Schwemer, and Anders Søgaard. 2022. FairLex: A Multi-
lingual Benchmark for Evaluating Fairness in Legal Text Processing. In Proceedings of the 60th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers), Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). Association for
Computational Linguistics, Dublin, Ireland, 4389–4406. https://doi.org/10.18653/v1/2022.acl-long.301
[21] Andong Chen, Feng Yao, Xinyan Zhao, Yating Zhang, Changlong Sun, Yun Liu, and Weixing Shen. 2023. EQUALS: A Real-world
Dataset for Legal Question Answering via Reading Chinese Laws. In Proceedings of the Nineteenth International Conference on
Artificial Intelligence and Law (Braga, Portugal) (ICAIL ’23). Association for Computing Machinery, New York, NY, USA, 71–80.
https://doi.org/10.1145/3594536.3595159
[22] Shijie Chen, Yu Zhang, and Qiang Yang. 2024. Multi-Task Learning in Natural Language Processing: An Overview. ACM Comput. Surv.
56, 12, Article 295 (jul 2024), 32 pages. https://doi.org/10.1145/3663363
[23] Zhiyu Zoey Chen, Jing Ma, Xinlu Zhang, Nan Hao, An Yan, Armineh Nourbakhsh, Xianjun Yang, Julian McAuley, Linda Petzold,
and William Yang Wang. 2024. A Survey on Large Language Models for Critical Societal Domains: Finance, Healthcare, and Law.
arXiv:2405.01769 [cs.CL]
[24] Odysseas S. Chlapanis, Ion Androutsopoulos, and Dimitrios Galanis. 2024. Archimedes-AUEB at SemEval-2024 Task 5: LLM explains
Civil Procedure. arXiv:2405.08502 [cs.CL]
[25] Pierre Colombo, Telmo Pessoa Pires, Malik Boudiaf, Dominic Culver, Rui Melo, Caio Corro, Andre F. T. Martins, Fabrizio Esposito, Vera Lú-
cia Raposo, Sofia Morgado, and Michael Desa. 2024. SaulLM-7B: A pioneering Large Language Model for Law. arXiv:2403.03883 [cs.CL]
[26] Junyun Cui, Xiaoyu Shen, and Shaochun Wen. 2023. A Survey on Legal Judgment Prediction: Datasets, Metrics, Models and Challenges.
IEEE Access 11 (2023), 102050–102071. https://doi.org/10.1109/ACCESS.2023.3317083
[27] Franck Dernoncourt, Ji Young Lee, and Peter Szolovits. 2017. NeuroNER: an easy-to-use program for named-entity recognition based
on neural networks. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing: System Demonstrations,
Lucia Specia, Matt Post, and Michael Paul (Eds.). Association for Computational Linguistics, Copenhagen, Denmark, 97–102. https:
//doi.org/10.18653/v1/D17-2017
[28] Aniket Deroy and Subhankar Maity. 2023. Questioning Biases in Case Judgment Summaries: Legal Datasets or Large Language Models?
arXiv:2312.00554 [cs.CL]
[29] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for
Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Jill Burstein, Christy Doran, and Thamar Solorio (Eds.).
Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186. https://doi.org/10.18653/v1/N19-1423
[30] João Dias, Pedro A. Santos, Nuno Cordeiro, Ana Antunes, Bruno Martins, Jorge Baptista, and Carlos Gonçalves. 2022. State of the Art
in Artificial Intelligence applied to the Legal Domain. arXiv:2204.07047 [cs.CL]
[31] Yue Dong, Andrei Mircea, and Jackie Chi Kit Cheung. 2021. Discourse-Aware Unsupervised Summarization for Long Scientific
Documents. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main
Volume, Paola Merlo, Jorg Tiedemann, and Reut Tsarfaty (Eds.). Association for Computational Linguistics, Online, 1089–1102.
https://doi.org/10.18653/v1/2021.eacl-main.93
[32] Christopher Dozier, Ravikumar Kondadadi, Marc Light, Arun Vachher, Sriharsha Veeramachaneni, and Ramdev Wudali. 2010. Named
Entity Recognition and Resolution in Legal Text. In Semantic Processing of Legal Texts: Where the Language of Law Meets the Law of
Language, Enrico Francesconi, Simonetta Montemagni, Wim Peters, and Daniela Tiscornia (Eds.). Springer Berlin Heidelberg, Berlin,
Heidelberg, 27–43. https://doi.org/10.1007/978-3-642-12837-0_2
[33] Gary Edmond and Kristy A Martire. 2019. Just cognition: scientific research on bias and some implications for legal procedure and
decision-making. The modern law review 82, 4 (2019), 633–664.
[34] Ahmed Elnaggar, Christoph Gebendorfer, Ingo Glaser, and Florian Matthes. 2018. Multi-Task Deep Learning for Legal Document
Translation, Summarization and Multi-Label Classification. In Proceedings of the 2018 Artificial Intelligence and Cloud Computing
Conference (Tokyo, Japan) (AICCC ’18). Association for Computing Machinery, New York, NY, USA, 9–15. https://doi.org/10.1145/
3299819.3299844
[35] Günes Erkan and Dragomir R Radev. 2004. Lexrank: Graph-based lexical centrality as salience in text summarization. Journal of
artificial intelligence research 22 (2004), 457–479.
[36] Atefeh Farzindar. 2004. Atefeh Farzindar and Guy Lapalme,’LetSum, an automatic Legal Text Summarizing system’in T. Gordon (ed.),
Legal Knowledge and Information Systems. Jurix 2004: The Seventeenth Annual Conference. Amsterdam: IOS Press, 2004, pp. 11-18..
In Legal knowledge and information systems: JURIX 2004, the seventeenth annual conference, Vol. 120. IOS Press, 11.
[37] Yi Feng, Chuanyi Li, and Vincent Ng. 2022. Legal Judgment Prediction via Event Extraction with Constraints. In Proceedings of the 60th
Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Smaranda Muresan, Preslav Nakov, and Aline
Villavicencio (Eds.). Association for Computational Linguistics, Dublin, Ireland, 648–664. https://doi.org/10.18653/v1/2022.acl-long.48

ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: October 2024.
30 • Ariai and Demartini

[38] Pavlos Fragkogiannis, Martina Forster, Grace E. Lee, and Dell Zhang. 2023. Context-Aware Classification of Legal Document Pages.
In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (Taipei, Taiwan)
(SIGIR ’23). Association for Computing Machinery, New York, NY, USA, 3285–3289. https://doi.org/10.1145/3539618.3591839
[39] Debasis Ganguly, Jack G. Conrad, Kripabandhu Ghosh, Saptarshi Ghosh, Pawan Goyal, Paheli Bhattacharya, Shubham Kumar Nigam,
and Shounak Paul. 2023. Legal IR and NLP: The History, Challenges, and State-of-the-Art. In European Conference on Information Retrieval
(ECIR) (Advances in Information Retrieval). Springer-Verlag, Berlin, Heidelberg, 331–340. https://doi.org/10.1007/978-3-031-28241-6_34
[40] Daphne Gelbart and JC Smith. 1991. Flexicon, a new legal information retrieval system. Can. L. Libr. 16 (1991), 9.
[41] Dephne Gelbart and J. C. Smith. 1991. Beyond boolean search: FLEXICON, a legal tex-based intelligent system. In Proceedings of the 3rd
International Conference on Artificial Intelligence and Law (Oxford, England) (ICAIL ’91). Association for Computing Machinery, New
York, NY, USA, 225–234. https://doi.org/10.1145/112646.112674
[42] Joseph Gesnouin, Yannis Tannier, Christophe Gomes Da Silva, Hatim Tapory, Camille Brier, Hugo Simon, Raphael Rozenberg, Hermann
Woehrel, Mehdi El Yakaabi, Thomas Binder, Guillaume Marie, Emilie Caron, Mathile Nogueira, Thomas Fontas, Laure Puydebois,
Marie Theophile, Stephane Morandi, Mael Petit, David Creissac, Pauline Ennouchy, Elise Valetoux, Celine Visade, Severine Balloux,
Emmanuel Cortes, Pierre-Etienne Devineau, Ulrich Tan, Esther Mac Namara, and Su Yang. 2024. LLaMandement: Large Language
Models for Summarization of French Legislative Proposals. arXiv:2401.16182 [cs.CL]
[43] John Gibbons and M. Teresa Turell. 2008. Dimensions of Forensic Linguistics (1 ed.). AILA Applied Linguistics Series, Vol. 5. John
Benjamins Publishing Company, Netherlands. 1–317 pages.
[44] Randy Goebel, Yoshinobu Kano, Mi-Young Kim, Juliano Rabelo, Ken Satoh, and Masaharu Yoshioka. 2024. Overview and Discussion of
the Competition on Legal Information, Extraction/Entailment (COLIEE) 2023. The Review of Socionetwork Strategies 18, 1 (2024), 27–47.
[45] S. Georgette Graham, Hamidreza Soltani, and Olufemi Isiaq. 2023. Natural language processing for legal document review: categorising
deontic modalities in contracts. Artificial Intelligence and Law (2023). https://doi.org/10.1007/s10506-023-09379-2
[46] Peter Henderson, Mark Krass, Lucia Zheng, Neel Guha, Christopher D Manning, Dan Jurafsky, and Daniel Ho. 2022. Pile of Law:
Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset. In Advances in Neural Information
Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35. Curran Associates, Inc., Red
Hook, NY, USA, 29217–29234. https://proceedings.neurips.cc/paper_files/paper/2022/file/bc218a0c656e49d4b086975a9c785f47-Paper-
Datasets_and_Benchmarks.pdf
[47] Dan Hendrycks, Collin Burns, Anya Chen, and Spencer Ball. 2021. Cuad: An expert-annotated nlp dataset for legal contract review.
arXiv:2103.06268 [cs.CL]
[48] Weiyi Huang, Jiahao Jiang, Qiang Qu, and Min Yang. 2020. AILA: A Question Answering System in the Legal Domain. In Proceedings
of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20 (Yokohama, Yokohama, Japan) (IJCAI’20), Christian
Bessiere (Ed.). International Joint Conferences on Artificial Intelligence Organization, Article 762, 3 pages. https://doi.org/10.24963/
ijcai.2020/762
[49] Deepali Jain, Malaya Dutta Borah, and Anupam Biswas. 2024. A sentence is known by the company it keeps: Improving Legal Document
Summarization Using Deep Clustering. Artificial Intelligence and Law 32, 1 (2024), 165–200.
[50] Samyar Janatian, Hannes Westermann, Jinzhe Tan, Jaromir Savelka, and Karim Benyekhlef. 2023. From Text to Structure: Using Large
Language Models to Support the Development of Legal Expert Systems. arXiv:2311.04911 [cs.CL]
[51] Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand,
Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. 2023. Mistral 7B. arXiv:2310.06825 [cs.CL]
[52] Hang Jiang, Xiajie Zhang, Robert Mahari, Daniel Kessler, Eric Ma, Tal August, Irene Li, Alex ’Sandy’ Pentland, Yoon Kim, Jad Kabbara, and
Deb Roy. 2024. Leveraging Large Language Models for Learning Complex Legal Concepts through Storytelling. arXiv:2402.17019 [cs.CL]
[53] Lukasz Kaiser, Aidan N Gomez, Noam Shazeer, Ashish Vaswani, Niki Parmar, Llion Jones, and Jakob Uszkoreit. 2017. One model to
learn them all. arXiv:1706.05137 [cs.LG]
[54] Prathamesh Kalamkar, Astha Agarwal, Aman Tiwari, Smita Gupta, Saurabh Karn, and Vivek Raghavan. 2022. Named Entity Recognition
in Indian court judgments. In Proceedings of the Natural Legal Language Processing Workshop 2022. Association for Computational
Linguistics, Abu Dhabi, United Arab Emirates (Hybrid), 184–193. https://doi.org/10.18653/v1/2022.nllp-1.15
[55] Ambedkar Kanapala, Sukomal Pal, and Rajendra Pamula. 2019. Text summarization from legal documents: a survey. Artificial Intelligence
Review 51 (2019), 371–402.
[56] Daniel Martin Katz, Michael James Bommarito, Shang Gao, and Pablo Arredondo. 2024. GPT-4 passes the bar exam. Philosophical
Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 382, 2270 (2024), 20230254. https://doi.org/10.1098/
rsta.2023.0254
[57] Soha Khazaeli, Janardhana Punuru, Chad Morris, Sanjay Sharma, Bert Staub, Michael Cole, Sunny Chiu-Webster, and Dhruv Sakalley.
2021. A Free Format Legal Question Answering System. In Proceedings of the Natural Legal Language Processing Workshop 2021,
Nikolaos Aletras, Ion Androutsopoulos, Leslie Barrett, Catalina Goanta, and Daniel Preotiuc-Pietro (Eds.). Association for Computational
Linguistics, Punta Cana, Dominican Republic, 107–113. https://doi.org/10.18653/v1/2021.nllp-1.11

ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: October 2024.
Natural Language Processing for the Legal Domain: A Survey of Tasks, Datasets, Models, and Challenges • 31

[58] Panteleimon Krasadakis, Evangelos Sakkopoulos, and Vassilios S. Verykios. 2024. A Survey on Challenges and Advances in Natural
Language Processing with a Focus on Legal Informatics and Low-Resource Languages. Electronics 13, 3 (2024). https://doi.org/10.3390/
electronics13030648
[59] Amirhossein Layegh, Amir H. Payberah, Ahmet Soylu, Dumitru Roman, and Mihhail Matskin. 2023. ContrastNER: Contrastive-based
Prompt Tuning for Few-shot NER. In 2023 IEEE 47th Annual Computers, Software, and Applications Conference (COMPSAC). IEEE,
241–249. https://doi.org/10.1109/COMPSAC57700.2023.00038
[60] Jihoon Lee and Hyukjoon Lee. 2019. A Comparison Study on Legal Document Classification Using Deep Neural Networks. In
2019 International Conference on Information and Communication Technology Convergence (ICTC). 926–928. https://doi.org/10.1109/
ICTC46691.2019.8939926
[61] Elena Leitner, Georg Rehm, and Julian Moreno-Schneider. 2020. A Dataset of German Legal Documents for Named Entity Recognition.
In Proceedings of the Twelfth Language Resources and Evaluation Conference, Nicoletta Calzolari, Frédéric Béchet, Philippe Blache,
Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo,
Asuncion Moreno, Jan Odijk, and Stelios Piperidis (Eds.). European Language Resources Association, Marseille, France, 4478–4485.
https://aclanthology.org/2020.lrec-1.551
[62] LexisNexis [n. d.]. International Legal Generative AI Report. Retrieved July 22, 2024 from https://www.lexisnexis.com/community/
pressroom/b/news/posts/lexisnexis-international-legal-generative-ai-survey-shows-nearly-half-of-the-legal-profession-believe-
generative-ai-will-transform-the-practice-of-law
[63] Jonathan Li, Rohan Bhambhoria, and Xiaodan Zhu. 2022. Parameter-Efficient Legal Domain Adaptation. In Proceedings of the Natural
Legal Language Processing Workshop 2022. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates (Hybrid),
119–129. https://doi.org/10.18653/v1/2022.nllp-1.10
[64] Jing Li, Aixin Sun, Jianglei Han, and Chenliang Li. 2022. A Survey on Deep Learning for Named Entity Recognition. IEEE Transactions
on Knowledge and Data Engineering 34, 1 (Jan 2022), 50–70. https://doi.org/10.1109/TKDE.2020.2981314
[65] Junyi Li, Tianyi Tang, Wayne Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. 2024. Pre-Trained Language Models for Text Generation: A
Survey. ACM Comput. Surv. 56, 9 (apr 2024), 1–39. https://doi.org/10.1145/3649449
[66] Yanling Li, Jiaye Wu, and Xudong Luo. 2024. BERT-CNN based evidence retrieval and aggregation for Chinese legal multi-choice
question answering. Neural Computing and Applications 36, 11 (2024), 5909–5925. https://doi.org/10.1007/s00521-023-09380-5
[67] Marco Lippi, Przemysław Pałka, Giuseppe Contissa, Francesca Lagioia, Hans-Wolfgang Micklitz, Giovanni Sartor, and Paolo Torroni.
2019. CLAUDETTE: an automated detector of potentially unfair clauses in online terms of service. Artificial Intelligence and Law 27
(2019), 117–139.
[68] Shuaiqi Liu, Jiannong Cao, Yicong Li, Ruosong Yang, and Zhiyuan Wen. 2024. Low-resource court judgment summarization for
common law systems. Information Processing and Management 61, 5 (2024), 103796. https://doi.org/10.1016/j.ipm.2024.103796
[69] Yang Liu. 2019. Fine-tune BERT for extractive summarization. arXiv:1903.10318 [cs.CL]
[70] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin
Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv:1907.11692 [cs.CL]
[71] Yifei Liu, Yiquan Wu, Yating Zhang, Changlong Sun, Weiming Lu, Fei Wu, and Kun Kuang. 2023. ML-LJP: Multi-Law Aware Legal
Judgment Prediction. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval
(Taipei, Taiwan) (SIGIR ’23). Association for Computing Machinery, New York, NY, USA, 1023–1034. https://doi.org/10.1145/3539618.
3591731
[72] Antoine Louis, Gijs van Dijck, and Gerasimos Spanakis. 2024. Interpretable Long-Form Legal Question Answering with Retrieval-
Augmented Large Language Models. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. AAAI Press, 22266–22275.
https://doi.org/10.1609/aaai.v38i20.30232
[73] Bingfeng Luo, Yansong Feng, Jianbo Xu, Xiang Zhang, and Dongyan Zhao. 2017. Learning to Predict Charges for Criminal Cases with
Legal Basis. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Martha Palmer, Rebecca Hwa, and
Sebastian Riedel (Eds.). Association for Computational Linguistics, Copenhagen, Denmark, 2727–2736. https://doi.org/10.18653/v1/D17-
1289
[74] Luyao Ma, Yating Zhang, Tianyi Wang, Xiaozhong Liu, Wei Ye, Changlong Sun, and Shikun Zhang. 2021. Legal Judgment Prediction
with Multi-Stage Case Representation Learning in the Real Court Setting. In Proceedings of the 44th International ACM SIGIR Conference
on Research and Development in Information Retrieval (Virtual Event, Canada) (SIGIR ’21). Association for Computing Machinery, New
York, NY, USA, 993–1002. https://doi.org/10.1145/3404835.3462945
[75] Dimitris Mamakas, Petros Tsotsi, Ion Androutsopoulos, and Ilias Chalkidis. 2022. Processing Long Legal Documents with Pre-trained
Transformers: Modding LegalBERT and Longformer. In Proceedings of the Natural Legal Language Processing Workshop 2022, Nikolaos
Aletras, Ilias Chalkidis, Leslie Barrett, Cătălina Goant, ă, and Daniel Preot, iuc-Pietro (Eds.). Association for Computational Linguistics,
Abu Dhabi, United Arab Emirates (Hybrid), 130–142. https://doi.org/10.18653/v1/2022.nllp-1.11
[76] Sepideh Mamooler, Rémi Lebret, Stephane Massonnet, and Karl Aberer. 2022. An Efficient Active Learning Pipeline for Legal Text
Classification. In Proceedings of the Natural Legal Language Processing Workshop 2022, Nikolaos Aletras, Ilias Chalkidis, Leslie Barrett,

ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: October 2024.
32 • Ariai and Demartini

Cătălina Goant, ă, and Daniel Preot, iuc-Pietro (Eds.). Association for Computational Linguistics, Abu Dhabi, United Arab Emirates
(Hybrid), 345–358. https://doi.org/10.18653/v1/2022.nllp-1.32
[77] Stelios Maroudas, Sotiris Legkas, Prodromos Malakasiotis, and Ilias Chalkidis. 2022. Legal-Tech Open Diaries: Lesson learned on how
to develop and deploy light-weight models in the era of humongous Language Models. In Proceedings of the Natural Legal Language
Processing Workshop 2022, Nikolaos Aletras, Ilias Chalkidis, Leslie Barrett, Cătălina Goant, ă, and Daniel Preot, iuc-Pietro (Eds.). Association
for Computational Linguistics, Abu Dhabi, United Arab Emirates (Hybrid), 88–110. https://doi.org/10.18653/v1/2022.nllp-1.8
[78] Suzanne McGee. [n. d.]. Generative AI and the Law. Retrieved July 22, 2024 from https://www.lexisnexis.com/html/lexisnexis-
generative-ai-story
[79] Masha Medvedeva, Michel Vols, and Martijn Wieling. 2020. Using machine learning to predict decisions of the European Court of
Human Rights. Artificial Intelligence and Law 28, 2 (2020), 237–266. https://doi.org/10.1007/s10506-019-09255-y
[80] Marie-Francine Moens, Caroline Uyttendaele, and Jos Dumortier. 1999. Abstracting of legal cases: the potential of clustering based on
the selection of representative objects. Journal of the American Society for Information Science 50, 2 (1999), 151–161.
[81] Laurens Mommers. 2010. Ontologies in the Legal Domain. Springer Netherlands, Dordrecht, 265–276. https://doi.org/10.1007/978-90-
481-8845-1_12
[82] Gianluca Moro, Nicola Piscaglia, Luca Ragazzi, and Paolo Italiani. 2023. Multi-language transfer learning for low-resource legal case
summarization. Artificial Intelligence and Law (2023). https://doi.org/10.1007/s10506-023-09373-8
[83] Duy-Hung Nguyen, Bao-Sinh Nguyen, Nguyen Viet Dung Nghiem, Dung Tien Le, Mim Amina Khatun, Minh-Tien Nguyen, and Hung
Le. 2021. Robust Deep Reinforcement Learning for Extractive Legal Summarization. In Neural Information Processing, Teddy Mantoro,
Minho Lee, Media Anugerah Ayu, Kok Wai Wong, and Achmad Nizar Hidayanto (Eds.). Springer International Publishing, Cham,
597–604.
[84] Ha-Thanh Nguyen, Manh-Kien Phi, Xuan-Bach Ngo, Vu Tran, Le-Minh Nguyen, and Minh-Phuong Tu. 2024. Attentive deep neural
networks for legal document retrieval. Artificial Intelligence and Law 32, 1 (2024), 57–86. https://doi.org/10.1007/s10506-022-09341-8
[85] Joel Niklaus, Ilias Chalkidis, and Matthias Stürmer. 2021. Swiss-Judgment-Prediction: A Multilingual Legal Judgment Prediction
Benchmark. In Proceedings of the Natural Legal Language Processing Workshop 2021, Nikolaos Aletras, Ion Androutsopoulos, Leslie
Barrett, Catalina Goanta, and Daniel Preotiuc-Pietro (Eds.). Association for Computational Linguistics, Punta Cana, Dominican Republic,
19–35. https://doi.org/10.18653/v1/2021.nllp-1.3
[86] Joel Niklaus, Veton Matoshi, Pooja Rani, Andrea Galassi, Matthias Stürmer, and Ilias Chalkidis. 2023. LEXTREME: A Multi-Lingual
and Multi-Task Benchmark for the Legal Domain. In Findings of the Association for Computational Linguistics: EMNLP 2023, Houda
Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 3016–3054. https://doi.org/10.18653/
v1/2023.findings-emnlp.200
[87] Joel Niklaus, Veton Matoshi, Matthias Stürmer, Ilias Chalkidis, and Daniel E. Ho. 2024. MultiLegalPile: A 689GB Multilingual Legal
Corpus. arXiv:2306.02069 [cs.CL]
[88] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina
Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F
Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human feedback. In Advances in Neural
Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35. Curran Associates, Inc.,
27730–27744. https://proceedings.neurips.cc/paper_files/paper/2022/file/b1efde53be364a73914f58805a001731-Paper-Conference.pdf
[89] Matthew J. Page, Joanne E. McKenzie, Patrick M. Bossuyt, Isabelle Boutron, Tammy C. Hoffmann, Cynthia D. Mulrow, Larissa Shamseer,
Jennifer M. Tetzlaff, Elie A. Akl, Sue E. Brennan, Roger Chou, Julie Glanville, Jeremy M. Grimshaw, Asbjørn Hróbjartsson, Manoj M.
Lalu, Tianjing Li, Elizabeth W. Loder, Evan Mayo-Wilson, Steve McDonald, Luke A. McGuinness, Lesley A. Stewart, James Thomas,
Andrea C. Tricco, Vivian A. Welch, Penny Whiting, and David Moher. 2021. The PRISMA 2020 statement: an updated guideline for
reporting systematic reviews. Systematic Reviews 10, 1 (2021), 89. https://doi.org/10.1186/s13643-021-01626-4
[90] Christos Papaloukas, Ilias Chalkidis, Konstantinos Athinaios, Despina Pantazi, and Manolis Koubarakis. 2021. Multi-granular Legal
Topic Classification on Greek Legislation. In Proceedings of the Natural Legal Language Processing Workshop 2021, Nikolaos Aletras, Ion
Androutsopoulos, Leslie Barrett, Catalina Goanta, and Daniel Preotiuc-Pietro (Eds.). Association for Computational Linguistics, Punta
Cana, Dominican Republic, 63–75. https://doi.org/10.18653/v1/2021.nllp-1.6
[91] Sungmi Park and Joshua I. James. 2023. Lessons learned building a legal inference dataset. Artificial Intelligence and Law (2023).
https://doi.org/10.1007/s10506-023-09370-x
[92] Seth Polsley, Pooja Jhunjhunwala, and Ruihong Huang. 2016. CaseSummarizer: A System for Automated Summarization of Legal Texts.
In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations, Hideo Watanabe
(Ed.). The COLING 2016 Organizing Committee, Osaka, Japan, 258–262. https://aclanthology.org/C16-2054
[93] Thiago Dal Pont, Federico Galli, Andrea Loreggia, Giuseppe Pisano, Riccardo Rovatti, and Giovanni Sartor. 2023. Legal Summarisation
through LLMs: The PRODIGIT Project. arXiv:2308.04416 [cs.CL]
[94] Vasile Păis, Maria Mitrofan, Carol Luca Gasan, Vlad Coneschi, and Alexandru Ianov. 2021. Named Entity Recognition in the Romanian
Legal Domain. In Proceedings of the Natural Legal Language Processing Workshop 2021, Nikolaos Aletras, Ion Androutsopoulos, Leslie

ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: October 2024.
Natural Language Processing for the Legal Domain: A Survey of Tasks, Datasets, Models, and Challenges • 33

Barrett, Catalina Goanta, and Daniel Preotiuc-Pietro (Eds.). Association for Computational Linguistics, Punta Cana, Dominican Republic,
9–18. https://doi.org/10.18653/v1/2021.nllp-1.2
[95] Juliano Rabelo, Randy Goebel, Mi-Young Kim, Yoshinobu Kano, Masaharu Yoshioka, and Ken Satoh. 2022. Overview and discussion of
the competition on legal information extraction/entailment (COLIEE) 2021. The Review of Socionetwork Strategies 16, 1 (2022), 111–133.
[96] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu.
2020. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21, 1, Article 140 (jan 2020),
67 pages.
[97] Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2020. DistilBERT, a distilled version of BERT: smaller, faster, cheaper
and lighter. arXiv:1910.01108 [cs.CL]
[98] Jaromir Savelka, Kevin D. Ashley, Morgan A. Gray, Hannes Westermann, and Huihui Xu. 2023. Explaining Legal Concepts with
Augmented Large Language Models (GPT-4). arXiv:2306.09525 [cs.CL]
[99] Marijn Schraagen, Floris Bex, Nick Van De Luijtgaarden, and Daniël Prijs. 2022. Abstractive Summarization of Dutch Court Verdicts
Using Sequence-to-sequence Models. In Proceedings of the Natural Legal Language Processing Workshop 2022, Nikolaos Aletras, Ilias
Chalkidis, Leslie Barrett, Cătălina Goant, ă, and Daniel Preot, iuc-Pietro (Eds.). Association for Computational Linguistics, Abu Dhabi,
United Arab Emirates (Hybrid), 76–87. https://doi.org/10.18653/v1/2022.nllp-1.7
[100] Gil Semo, Dor Bernsohn, Ben Hagag, Gila Hayat, and Joel Niklaus. 2022. ClassActionPrediction: A Challenging Benchmark for Legal
Judgment Prediction of Class Action Cases in the US. In Proceedings of the Natural Legal Language Processing Workshop 2022, Nikolaos
Aletras, Ilias Chalkidis, Leslie Barrett, Cătălina Goant, ă, and Daniel Preot, iuc-Pietro (Eds.). Association for Computational Linguistics,
Abu Dhabi, United Arab Emirates (Hybrid), 31–46. https://doi.org/10.18653/v1/2022.nllp-1.3
[101] Zein Shaheen, Gerhard Wohlgenannt, and Erwin Filtz. 2020. Large scale legal text classification using transformer models.
arXiv:2010.12871 [cs.CL]
[102] Zejiang Shen, Kyle Lo, Lauren Yu, Nathan Dahlberg, Margo Schlanger, and Doug Downey. 2022. Multi-LexSum: Real-world Summaries
of Civil Rights Lawsuits at Multiple Granularities. In Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed,
A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35. Curran Associates, Inc., 13158–13173. https://proceedings.neurips.cc/paper_
files/paper/2022/file/552ef803bef9368c29e53c167de34b55-Paper-Datasets_and_Benchmarks.pdf
[103] Emily Sheng, Kai-Wei Chang, Premkumar Natarajan, and Nanyun Peng. 2019. The Woman Worked as a Babysitter: On Biases in
Language Generation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International
Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (Eds.).
Association for Computational Linguistics, Hong Kong, China, 3407–3412. https://doi.org/10.18653/v1/D19-1339
[104] Juanming Shi, Qinglang Guo, Yong Liao, Yuxing Wang, Shijia Chen, and Shenglin Liang. 2024. Legal-LM: Knowledge Graph Enhanced
Large Language Models for Law Consulting. In Advanced Intelligent Computing Technology and Applications, De-Shuang Huang,
Zhanjun Si, and Chuanlei Zhang (Eds.). Springer Nature Singapore, Singapore, 175–186.
[105] Răzvan-Alexandru Smădu, Ion-Robert Dinică, Andrei-Marius Avram, Dumitru-Clementin Cercel, Florin Pop, and Mihaela-Claudia
Cercel. 2022. Legal Named Entity Recognition with Multi-Task Domain Adaptation. In Proceedings of the Natural Legal Language
Processing Workshop 2022, Nikolaos Aletras, Ilias Chalkidis, Leslie Barrett, Cătălina Goant, ă, and Daniel Preot, iuc-Pietro (Eds.). Association
for Computational Linguistics, Abu Dhabi, United Arab Emirates (Hybrid), 305–321. https://doi.org/10.18653/v1/2022.nllp-1.29
[106] Dezhao Song, Andrew Vold, Kanika Madan, and Frank Schilder. 2022. Multi-label legal document classification: A deep learning-based
approach with label-attention and domain-specific pre-training. Information Systems 106 (2022), 101718. https://doi.org/10.1016/j.is.
2021.101718
[107] Francesco Sovrano, Monica Palmirani, Biagio Distefano, Salvatore Sapienza, and Fabio Vitali. 2021. A dataset for evaluating legal question
answering on private international law. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Law (São
Paulo, Brazil) (ICAIL ’21). Association for Computing Machinery, New York, NY, USA, 230–234. https://doi.org/10.1145/3462757.3466094
[108] Francesco Sovrano, Monica Palmirani, Salvatore Sapienza, and Vittoria Pistone. 2024. DiscoLQA: zero-shot discourse-based legal
question answering on European Legislation. Artificial Intelligence and Law (2024). https://doi.org/10.1007/s10506-023-09387-2
[109] Francesco Sovrano, Monica Palmirani, and Fabio Vitali. 2020. Legal knowledge extraction for knowledge graph based question-
answering. In Legal knowledge and information systems. IOS Press, 143–153. https://doi.org/10.3233/FAIA200858
[110] Ralf Steinberger, Bruno Pouliquen, Mijail Kabadjov, Jenya Belyaeva, and Erik van der Goot. 2011. JRC-NAMES: A Freely Available,
Highly Multilingual Named Entity Resource. In Proceedings of the International Conference Recent Advances in Natural Language
Processing 2011, Ruslan Mitkov and Galia Angelova (Eds.). Association for Computational Linguistics, Hissar, Bulgaria, 104–110.
https://aclanthology.org/R11-1015
[111] Benjamin Strickson and Beatriz De La Iglesia. 2020. Legal Judgement Prediction for UK Courts. In Proceedings of the 3rd International
Conference on Information Science and Systems (Cambridge, United Kingdom) (ICISS ’20). Association for Computing Machinery, New
York, NY, USA, 204–209. https://doi.org/10.1145/3388176.3388183
[112] Zhongxiang Sun. 2023. A Short Survey of Viewing Large Language Models in Legal Aspect. arXiv:2303.09136 [cs.CL]

ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: October 2024.
34 • Ariai and Demartini

[113] Alex Tamkin, Miles Brundage, Jack Clark, and Deep Ganguli. 2021. Understanding the Capabilities, Limitations, and Societal Impact of
Large Language Models. arXiv:2102.02503 [cs.CL]
[114] Doron Teichman, Eyal Zamir, and Ilana Ritov. 2023. Biases in legal decision-making: Comparing prosecutors, defense attorneys, law
students, and laypersons. Journal of empirical legal studies 20, 4 (2023), 852–894.
[115] Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 Shared Task: Language-Independent Named
Entity Recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003. 142–147. https:
//www.aclweb.org/anthology/W03-0419
[116] Suxin Tong, Jingling Yuan, Peiliang Zhang, and Lin Li. 2024. Legal Judgment Prediction via graph boosting with constraints. Information
Processing & Management 61, 3 (2024), 103663. https://doi.org/10.1016/j.ipm.2024.103663
[117] Don Tuggener, Pius von Däniken, Thomas Peetz, and Mark Cieliebak. 2020. LEDGAR: A Large-Scale Multi-label Corpus for Text
Classification of Legal Provisions in Contracts. In Proceedings of the Twelfth Language Resources and Evaluation Conference, Nicoletta
Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente
Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, and Stelios Piperidis (Eds.). European Language Resources
Association, Marseille, France, 1235–1241. https://aclanthology.org/2020.lrec-1.155
[118] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin.
2017. Attention is All you Need. In Advances in Neural Information Processing Systems (Long Beach, California, USA) (NIPS’17, Vol. 30),
I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.). Curran Associates, Inc., Red Hook,
NY, USA, 6000–6010. https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
[119] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2018. Graph attention networks.
arXiv:1710.10903 [stat.ML]
[120] Daniela Vianna, Edleno Silva de Moura, and Altigran Soares da Silva. 2023. A topic discovery approach for unsupervised organization
of legal document collections. Artificial Intelligence and Law (2023). https://doi.org/10.1007/s10506-023-09371-w
[121] Qiqi Wang, Kaiqi Zhao, Robert Amor, Benjamin Liu, and Ruofan Wang. 2022. D2GCLF: Document-to-Graph Classifier for Legal
Document Classification. In Findings of the Association for Computational Linguistics: NAACL 2022, Marine Carpuat, Marie-Catherine
de Marneffe, and Ivan Vladimir Meza Ruiz (Eds.). Association for Computational Linguistics, Seattle, United States, 2208–2221.
https://doi.org/10.18653/v1/2022.findings-naacl.170
[122] Fusheng Wei, Han Qin, Shi Ye, and Haozhen Zhao. 2018. Empirical Study of Deep Learning for Text Classification in Legal Document
Review. In 2018 IEEE International Conference on Big Data (Big Data). IEEE, 3317–3320. https://doi.org/10.1109/BigData.2018.8622157
[123] Jason Wei and Kai Zou. 2019. EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks. In
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference
on Natural Language Processing (EMNLP-IJCNLP), Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (Eds.). Association for
Computational Linguistics, Hong Kong, China, 6382–6388. https://doi.org/10.18653/v1/D19-1670
[124] Westlaw. [n. d.]. Westlaw. Retrieved May 23, 2024 from https://anzlaw.thomsonreuters.com/Browse/Home/Australia160?comp=wlau&
__lrTS=20240523040153004&transitionType=Default&contextData=(sc.Default)
[125] Yiquan Wu, Yifei Liu, Weiming Lu, Yating Zhang, Jun Feng, Changlong Sun, Fei Wu, and Kun Kuang. 2022. Towards Interactivity and
Interpretability: A Rationale-based Legal Judgment Prediction Framework. In Proceedings of the 2022 Conference on Empirical Methods
in Natural Language Processing, Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). Association for Computational Linguistics,
Abu Dhabi, United Arab Emirates, 4787–4799. https://doi.org/10.18653/v1/2022.emnlp-main.316
[126] Chaojun Xiao, Xueyu Hu, Zhiyuan Liu, Cunchao Tu, and Maosong Sun. 2021. Lawformer: A pre-trained language model for Chinese
legal long documents. AI Open 2 (2021), 79–84. https://doi.org/10.1016/j.aiopen.2021.06.003
[127] Chaojun Xiao, Haoxi Zhong, Zhipeng Guo, Cunchao Tu, Zhiyuan Liu, Maosong Sun, Yansong Feng, Xianpei Han, Zhen Hu, Heng
Wang, and Jianfeng Xu. 2018. CAIL2018: A Large-Scale Legal Dataset for Judgment Prediction. arXiv:1807.02478 [cs.CL] https:
//arxiv.org/abs/1807.02478
[128] Nuo Xu, Pinghui Wang, Long Chen, Li Pan, Xiaoyan Wang, and Junzhou Zhao. 2020. Distinguish Confusing Law Articles for Legal
Judgment Prediction. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Dan Jurafsky, Joyce
Chai, Natalie Schluter, and Joel Tetreault (Eds.). Association for Computational Linguistics, Online, 3086–3095. https://doi.org/10.
18653/v1/2020.acl-main.280
[129] Wenmian Yang, Weijia Jia, Xiaojie Zhou, and Yutao Luo. 2019. Legal judgment prediction via multi-perspective bi-feedback network.
In Proceedings of the 28th International Joint Conference on Artificial Intelligence (Macao, China) (IJCAI’19). AAAI Press, 4085–4091.
[130] Hai Ye, Xin Jiang, Zhunchen Luo, and Wenhan Chao. 2018. Interpretable Charge Predictions for Criminal Cases: Learning to Generate
Court Views from Fact Descriptions. In Proceedings of the 2018 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), Marilyn Walker, Heng Ji, and Amanda Stent (Eds.).
Association for Computational Linguistics, New Orleans, Louisiana, 1854–1864. https://doi.org/10.18653/v1/N18-1168
[131] Mingruo Yuan, Ben Kao, Tien-Hsuan Wu, Michael M. K. Cheung, Henry W. H. Chan, Anne S. Y. Cheung, Felix W. H. Chan, and Yongxi
Chen. 2023. Bringing legal knowledge to the public by constructing a legal question bank using large-scale pre-trained language model.

ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: October 2024.
Natural Language Processing for the Legal Domain: A Survey of Tasks, Datasets, Models, and Challenges • 35

Artificial Intelligence and Law (2023). https://doi.org/10.1007/s10506-023-09367-6


[132] Kwan Yuen Iu and Vanessa Man-Yi Wong. 2023. ChatGPT by OpenAI: The End of Litigation Lawyers. https://doi.org/10.2139/ssrn.
4339839
[133] Han Zhang, Zhicheng Dou, Yutao Zhu, and Ji-Rong Wen. 2023. Contrastive Learning for Legal Judgment Prediction. ACM Transactions
on Information Systems 41, 4, Article 113 (apr 2023), 25 pages. https://doi.org/10.1145/3580489
[134] Weiqi Zhang, Hechuan Shen, Tianyi Lei, Qian Wang, Dezhong Peng, and Xu Wang. 2023. GLQA: A Generation-based Method for Legal
Question Answering. In 2023 International Joint Conference on Neural Networks (IJCNN). 1–8. https://doi.org/10.1109/IJCNN54540.2023.
10191483
[135] Lucia Zheng, Neel Guha, Brandon R. Anderson, Peter Henderson, and Daniel E. Ho. 2021. When does pretraining help? assessing
self-supervised learning for law and the CaseHOLD dataset of 53,000+ legal holdings. In Proceedings of the Eighteenth International
Conference on Artificial Intelligence and Law (São Paulo, Brazil) (ICAIL ’21). Association for Computing Machinery, New York, NY, USA,
159–168. https://doi.org/10.1145/3462757.3466088
[136] Haoxi Zhong, Zhipeng Guo, Cunchao Tu, Chaojun Xiao, Zhiyuan Liu, and Maosong Sun. 2018. Legal Judgment Prediction via
Topological Learning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Ellen Riloff, David
Chiang, Julia Hockenmaier, and Jun’ichi Tsujii (Eds.). Association for Computational Linguistics, Brussels, Belgium, 3540–3549.
https://doi.org/10.18653/v1/D18-1390
[137] Haoxi Zhong, Yuzhong Wang, Cunchao Tu, Tianyang Zhang, Zhiyuan Liu, and Maosong Sun. 2020. Iteratively Questioning and
Answering for Interpretable Legal Judgment Prediction. Proceedings of the AAAI Conference on Artificial Intelligence 34, 01 (Apr 2020),
1250–1257. https://doi.org/10.1609/aaai.v34i01.5479
[138] Haoxi Zhong, Chaojun Xiao, Cunchao Tu, Tianyang Zhang, Zhiyuan Liu, and Maosong Sun. 2020. How Does NLP Benefit Legal
System: A Summary of Legal Artificial Intelligence. In Proceedings of the 58th Annual Meeting of the Association for Computational
Linguistics, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (Eds.). Association for Computational Linguistics, Online,
5218–5230. https://doi.org/10.18653/v1/2020.acl-main.466
[139] Haoxi Zhong, Chaojun Xiao, Cunchao Tu, Tianyang Zhang, Zhiyuan Liu, and Maosong Sun. 2020. JEC-QA: A Legal-Domain Question
Answering Dataset. In Proceedings of the AAAI conference on artificial intelligence, Vol. 34. 9701–9708. https://doi.org/10.48550/arXiv.
1911.12011
[140] Linwu Zhong, Ziyi Zhong, Zinian Zhao, Siyuan Wang, Kevin D. Ashley, and Matthias Grabmair. 2019. Automatic Summarization
of Legal Decisions using Iterative Masking of Predictive Sentences. In Proceedings of the Seventeenth International Conference on
Artificial Intelligence and Law (Montreal, QC, Canada) (ICAIL ’19). Association for Computing Machinery, New York, NY, USA, 163–172.
https://doi.org/10.1145/3322640.3326728
[141] Yang Zhong and Diane Litman. 2022. Computing and Exploiting Document Structure to Improve Unsupervised Extractive Summarization
of Legal Case Decisions. In Proceedings of the Natural Legal Language Processing Workshop 2022, Nikolaos Aletras, Ilias Chalkidis,
Leslie Barrett, Cătălina Goant, ă, and Daniel Preot, iuc-Pietro (Eds.). Association for Computational Linguistics, Abu Dhabi, United Arab
Emirates (Hybrid), 322–337.
[142] Andreas Östling, Holli Sargeant, Huiyuan Xie, Ludwig Bull, Alexander Terenin, Leif Jonsson, Måns Magnusson, and Felix Steffek. 2024.
The cambridge law corpus: a dataset for legal AI research. In Proceedings of the 37th International Conference on Neural Information
Processing Systems (New Orleans, LA, USA) (NIPS ’23). Curran Associates Inc., Red Hook, NY, USA, Article 1793, 31 pages.

ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: October 2024.

You might also like