0% found this document useful (0 votes)
18 views

Large Language Models Are Legal But They Are Not Making The Case For A Power LLM

Uploaded by

Faique Memon
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Large Language Models Are Legal But They Are Not Making The Case For A Power LLM

Uploaded by

Faique Memon
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Large Language Models are legal but they are not: Making the case for a

powerful LegalLLM

Thanmay Jayakumar Fauzan Farooqui∗ Luqman Farooqui∗


Visvesvaraya National Institute of Technology, Nagpur, India
{thanmayjayakumar, fauzanfarooqui7, luqmanfarooqui99}@gmail.com

Abstract neural models. Though this holds true for the legal
domain as well (Sun, 2023), they aren’t used to
Realizing the recent advances in Natural Lan-
guage Processing (NLP) to the legal sector make direct decisions. Nevertheless, these auto-
mated systems that produce legal predictions and
arXiv:2311.08890v1 [cs.CL] 15 Nov 2023

poses challenging problems such as extremely


long sequence lengths, specialized vocabulary generations, are predominantly useful as advisory
that is usually only understood by legal profes- tools for legal practitioners that can augment their
sionals, and high amounts of data imbalance. decision-making process.
The recent surge of Large Language Models
Transformers (Vaswani et al., 2017) have be-
(LLMs) has begun to provide new opportu-
nities to apply NLP in the legal domain due come the de facto method for many text classi-
to their ability to handle lengthy, complex se- fication and multiple choice question answering
quences. Moreover, the emergence of domain- tasks. BERT (Devlin et al., 2019), a transformer-
specific LLMs has displayed extremely promis- encoder, and its derived models like RoBERTa (Liu
ing results on various tasks. In this study, we et al., 2019) are commonly employed in legal NLP
aim to quantify how general LLMs perform tasks. Pre-training such models on legal corpora
in comparison to legal-domain models (be it can help a model adapt to a specific domain by fine-
an LLM or otherwise). Specifically, we com-
pare the zero-shot performance of three general-
tuning it with domain-specific data. LegalBERT
purpose LLMs (ChatGPT-20b, LLaMA-2-70b, (Chalkidis et al., 2020) is one such BERT model
and Falcon-180b) on the LEDGAR subset of that was trained on legal-oriented data. CaseLaw-
the LexGLUE benchmark for contract provi- BERT (Zheng et al., 2021), PoL-BERT (Henderson
sion classification. Although the LLMs were et al., 2022), and LexLM (Chalkidis et al., 2023)
not explicitly trained on legal data, we ob- are a few more BERT-based variants pre-trained for
serve that they are still able to classify the the legal domain. Although they show remarkable
theme correctly in most cases. However, we
performance on various legal tasks in comparison
find that their mic-F1/mac-F1 performance is
up to 19.2/26.8% lesser than smaller models with general-purpose BERT models, one limit of
fine-tuned on the legal domain, thus underscor- these models is that BERT’s input size can only
ing the need for more powerful legal-domain incorporate a maximum of 512 tokens. For short
LLMs. sequences this may seem enough, but in the case
of long documents commonly found in the legal
1 Introduction domain, where input texts can go over 5000 tokens
(and requiring even more in few-shot settings), it
Legal professionals typically deal with large can be a severe drawback as a lot of important
amounts of textual information on a daily basis information will get truncated.
to make well-informed decisions in their practice. Due to this limit, BERT-based models aren’t em-
This can become very tedious and demanding due ployed as-is in long-document tasks. Typically,
to the overwhelming amount of data they must man- methods like hierarchical attention are utilized
age and the meticulous attention to detail necessary where the long document is split into segments
to maintain the required precision in their work. of max length (512 in the case of BERT mod-
Thanks to the rise of LLMs, many tasks such as els) and these segments are independently encoded.
sentiment analysis, named entity recognition, in- These segment embeddings are then aggregated
formation retrieval, etc. can now be handled by with stacked transformers to get the overall encod-
*
These authors contributed equally ing of the entire document. Similarly, recurrent
transformers (Dai et al., 2019; Yang et al., 2019; data consisting of long sequences.
Ding et al., 2021) were proposed to process long
documents by encoding its representation from in- Google PaLM: PaLM (Pathways Language
dividual segments in a recurrent fashion. Sparse Model) (Chowdhery et al., 2022; Anil et al.,
attention is another method that has been proposed 2023) is a proprietary LLM having 540 billion
to tackle long sequence inputs (Ainslie et al., 2020; parameters that was trained on the Pathways
Zaheer et al., 2020). Longformer (Beltagy et al., architecture. Although PaLM was initially trained
2020) uses a combination of local and global at- to handle sequence lengths of up to 2048 tokens, it
tention mechanisms to save on computational com- was increased to 8096 in the 340 billion parameter
plexity and enables the processing of up to 4096 PaLM 2 for a longer comprehension of the input.
tokens. A number of other works (Dai et al., 2022;
Mamakas et al., 2022) show that transformer-based BigScience BLOOM: BLOOM (BigScience
architectures that can capture longer text boast ma- Large Open-science Open-access Multilingual
jor benefits, even more so when augmented with Language Model) (Scao et al., 2022) is a group
strategies like sparse-attention and hierarchical net- of open-source multilingual LLMs, the largest
works. This again underlines an important direction having 176 billion parameters. It encompasses 46
for verbose legal datasets. natural and 13 programming languages, facilitating
Our contributions can be summarized as follows: sequence lengths of up to 2048 tokens.

• We conduct experiments to compare and ana- Meta LLaMA: LLaMA (Large Language Model
lyze the zero-shot performance of three gen- Meta AI) (Touvron et al., 2023) is a collection
eral LLMs to that of start-of-the-art in-domain of open-source foundation language models
models on the LEDGAR subset of LexGLUE ranging from 7 billion to 70 billion parameters.
(Chalkidis et al., 2022). We analyze our re- It was pre-trained natively on 2048 input tokens,
sults, quantify whether LLMs conform to ex- but recent research has shown that the context
pected advantages, and provide insights for length of LLMs can be extended efficiently with
further research. minimal training steps (Peng et al., 2023), leading
• We provide an overview of the most recent to a release of two variations of LLaMA that
LLM research, the benchmarks and datasets have an astounding context length of 64k and 128k.
developed for legal NLP, the challenges faced
when applying them to legal tasks, and pop- TII Falcon1 : This work by the Technology In-
ular approaches that solve them. We believe novation Institute (TII) boasts of being the largest
this to be a useful primer for anyone looking open-source model to date, while also ranking high-
to get a bird’s eye view of the field. est on the HuggingFace Leaderboard. It includes
models with 180B, 40B, 7.5B, and 1.3B parame-
2 Related Work ters (context window size of 2048) trained on TII’s
RefinedWeb dataset (Penedo et al., 2023).
In this section, we outline the relevant research on
LLMs, efforts in using them for legal domain tasks, 2.2 LLMs on the legal domain
and finally the benchmarks and datasets.
LexGPT: (Lee, 2023) finetune GPT-J models on
2.1 Large Language Models the Pile of Law dataset (Henderson et al., 2022),
and is the best-performing LLM that has been fine-
OpenAI GPT: GPT (Generative Pre-trained
tuned for legal use cases (LegalLLM) at the time
Transformer) (Radford et al., 2019; Brown et al.,
of writing. They experiment with generative mod-
2020) and the popular ChatGPT variant developed
els for legal classification tasks and observe that
by OpenAI are a family of large-scale proprietary
fine-tuning such out-of-the-box GPTs still results
transformer-decoder models pretrained to perform
in low performance when compared to discrimina-
generative and language modeling tasks, and allow
tive models. This insightfully shows the need to
a reasonable context length sufficient to carry out
bridge the gap between powerful LLMs and the
long-document processing. For instance, GPT 3.5
legal domain.
supports a maximum of 4096 tokens, and GPT 4
1
allows a stunning maximum of 32,768, ideal for https://falconllm.tii.ae/falcon.html
PolicyGPT: (Tang et al., 2023) demonstrate that forms of legal reasoning and outlines an empirical
many LLMs in zero-shot settings perform remark- evaluation of 20 LLMs. They demonstrate how
ably well when tasked with text classification LegalBench supports easing communication be-
of privacy policies. Though specific, this shows tween legal professionals and LLM developers by
how a LegalLLM may hold promise in enhancing using the IRAC framework in the case of American
performance on other general tasks. law. They observe that LLMs typically perform
better on classification tasks than application-based
Zero-and-Few-shot GPT: (Chalkidis, 2023) con- ones. They also find that for some tasks, in-context
duct experiments most similar to our study. They examples are not required, or only marginally
evaluate the performance of ChatGPT on the improve performance. They thus conclude that the
LexGLUE benchmark in both zero-shot and few- task performance in LLMs is mostly driven by the
shot settings (for the latter, examples were given task description used in the prompt.
in the instruction prompt, which seems to benefit
the model when the number of examples and labels Pile of Law: (Henderson et al., 2022) The
are around the same). They find that ChatGPT per- surge in LLM development emphasizes the need
forms very well, but severely lacks in performance for responsible practices in filtering out biased,
compared to smaller models trained on in-domain explicit, copyrighted, and confidential content
datasets. during pre-training. Present methodologies are ad
Resonating with these findings, the work of hoc and do not account for context. To address
(Savelka, 2023) investigates how an LLM (a GPT this, Pile of Law, a growing 256GB dataset of
model) performs on a semantic annotation task open-source English legal and administrative data,
in zero-shot settings, without being fine-tuned on was introduced to aid in legal tasks. This paper
legal-domain datasets. The LLM is primed with outlines a method for filtering legal-domain text
a short sentence description of each annotation la- while handling associated trade-offs. It aids in
bel and is tasked with labeling a short span of text. understanding government-established content
They observe that while the LLM performs surpris- filtering guidelines and illustrates various ways to
ingly well given the zero-shot setting, its perfor- learn responsible data filtering from the law.
mance was still far off from the model that was
trained on the in-domain data. In summary, both MultiLegalPile: (Chalkidis et al., 2021) The
studies highlight the potential fine-tuned LLMs can MultiLegalPile is a 689 GB substantial dataset that
bring to the legal domain. spans 24 EU languages across 17 jurisdictions. It
addresses the scarce availability of multilingual
2.3 Datasets and Benchmarks pre-training data in the legal domain, encom-
passing diverse legal data sources with varying
LexGLUE: (Chalkidis et al., 2022) present a licenses. In certain languages, monolingual models
unified evaluation framework for legal tasks to substantially outperform the base model, achieving
benchmark models. The datasets and tasks were language-specific SotA in five languages. In
curated from other sources of data considering LexGLUE, English models secure SotA in five of
various factors such as availability, size, difficulty, seven tasks.
etc. They present scores for various Pre-trained
Language Models (PLMs) on their benchmark. 3 Experimental Setup and Results
They point out interesting results that suggest that
PLMs fine-tuned on general legal datasets and In this section, we describe our experimental ap-
tasks do perform better, albeit PLMs fine-tuned proach, along with specifics of our evaluations.
on only one sub-domain don’t improve on per-
formance on the same sub-domain. Put together, 3.1 Dataset and Metrics
their observations point out the need for a general We use the LEDGAR (Tuggener et al., 2020) subset
LegalLLM (powerful enough to outperform other of the LexGLUE benchmark for our experiments
models on all criteria of the benchmark).vspace-4pt due to its readiness to work on LLMs (for exam-
ple, the other datasets have label indices alone, not
LegalBench: (Guha et al., 2023) This bench- the actual label names). The dataset was loaded
mark comprises 162 tasks representing six distinct through the HuggingFace Datasets library (Lhoest
Figure 1: The frequency distributions of the 100 LEDGAR labels
in the original LEDGAR test set from LexGLUE (left); and in our sampled test set of 1000 examples (right)

et al., 2021). In this benchmark, given a provi- models are very large, we use HuggingFace Chat
sion contract, the model is tasked with classify- for LLaMA and Falcon. Due to this constraint,
ing the contract into one of 100 EDGAR theme we only evaluated on a subset of 1,000 examples.
labels. As mentioned, there is a high imbalance However, we made sure that the subset had a label
of data in datasets containing legal corpora. Fig- frequency distribution close to the original dataset
ure 1 shows the label distribution in the LEDGAR (Refer Figure 1) so that the evaluations remain gen-
subset benchmark. This could result in difficul- eralizable.
ties such as biased models and poor classification We use zero-shot prompting to evaluate the
scores. To better report model evaluations in such above-mentioned LLMs, building on the benefits
settings, the F1-score is usually reported instead as explained earlier by other works (Tang et al.,
of accuracy. Moreover, both macro-F1 and micro- 2023; Guha et al., 2023). Further, in the custom
F1 scores are usually reported. For imbalanced instructions (ChatGPT) and system instructions
datasets, the former more accurately reflects the (HuggingChat), we enter the list of EDGAR theme
classifier’s performance as the latter skews the met- classes that the model should choose from. In the
ric towards the larger-proportion datasets, which is same fashion, to ensure that the model does not gen-
why the macro-F1 scores are typically lower than erate anything out of the list, we explicitly mention
the micro-F1 ones in these scenarios. this constraint as an instruction. The exact instruc-
As for the sequence lengths, (Chalkidis, 2023) tions that we use are provided in the appendix.
report the average token length of the instruction-
following examples in all the LexGLUE subsets 3.3 Results and Discussion
- the highest being 3.6k tokens. This restricts the For our experiments, we use three baseline general-
capability of LLM performance due to truncation purpose chat variants - ChatGPT (20b), Falcon
as noted earlier, and this is also highlighted in their (180b), and LLaMA-2 (70b) - and present the
study: few-shot settings could not be evaluated for results in Table 1. General-purpose LLMs per-
datasets having an average token length of more form worse than smaller in-domain models. The
than 2k for a single example, and in many cases, best general LLM, Falcon-Chat, performs 19.2%
the prompts were already truncated up to 4k tokens mic-F1 and 26.8% mac-F1 lower than the best in-
(the maximum limit of ChatGPT). The average domain model, LegalBERT, which itself is much
token length of the LEDGAR subset is 0.6k. smaller than the LexGPT, the current LegalLLM.
Our findings echo that of (Chalkidis, 2023).
3.2 Setup
Notably, for class labels with only one exam-
As baselines, we take three LLMs: ChatGPT (20b), ple in our sampled test set, the three chat-variants
LLaMA-2 (70b), and Falcon (180b). Since the surprisingly show the same results: they fail to
Model mic. F1 mac. F1 # params. which are prone to issues like hallucination. Our
Falcon-Chat 70.9 60.7 180b label-wise findings reflect this too.
LLaMA-Chat 70.4 59.6 70b However, the current legal benchmarks are lim-
ChatGPT 70.6 58.7 20b ited to NLU tasks. In general, it would be ideal to
LexGPT 83.9 74.0 6b have a powerful LegalLLM that can perform both
LegalBERT 88.2 83.0 0.11b generative and discriminative tasks. Our findings
show that there is a unique challenge in the legal
Table 1: Comparison of general LLMs (first three mod-
domain: if we have to build a better LegalLLM,
els, tested on a zero-shot setting by us) to models fine-
tuned on legal-domain datasets (last two). The cur- we need to find better methods to take advantage
rent LegalLLM is LexGPT, but the much smaller Legal- of the in-domain legal data for LLMs as simply
BERT shows state-of-the-art performance on LEDGAR. fine-tuning it doesn’t seem to be enough. As the au-
thors of LexGPT mention, reinforcement learning
from human feedback could be extremely helpful
predict them correctly, except the Qualification la- in improving LexGPT, providing ways for the first
bel (the others being Assigns, Books, Powers and LegalLLM to produce state-of-the-art results.
Sanctions. Similarly, Indemnity is always misclas-
However, if we limit the application of legal
sified as Indemnifications (three examples in total).
models to NLU tasks, our findings turn optimistic.
Further, labels that are semantically similar are fre-
The results show that the LLMs’ ability to process
quently mislabeled by the models (like Indemnity
large context may not be necessary for classifica-
and Indemnifications as pointed out earlier). For
tion - we hypothesize this could be because verbose
example, (Taxes, Tax Withholdings and Withhold-
legal text could turn out to have very similar seman-
ings) is almost always labeled as Tax Withholdings
tic content, so the additional context may not be
by all the models. (Jurisdictions, Submission To
as useful as expected. This hypothesis could be
Jurisdiction, Consent To Jurisdiction) is almost al-
echoed by findings from (Shaikh et al., 2020), who
ways labeled as Submission To Jurisdiction in the
show that a careful selection of a handful of textual
case of ChatGPT and Jurisdiction in the case of
features in a verbose dataset is strong enough to
Falcon and LLaMA. As for (Applicable Laws, Gov-
help statistical models achieve high accuracies for
erning Laws, Compliance With Laws), we observe
binary classification.
that Governing Laws was easiest to predict with
an average accuracy of 90%, and Compliance With This in fact should be good news for NLU, as it
Laws with 80%, but Applicable Laws performs very means legal practitioners can avoid having to use
poorly with 0% accuracy for LLaMA and Falcon or train unnecessarily large or expensive models
and 20% for ChatGPT - predicting only one from (both carbon-wise and cost-wise). Much smaller
a total of 5 samples correctly. However, in the case in-domain models like LegalBERT are nevertheless
of (Payments, Fees, Interests), the models seem to superior and should be used for practical applica-
predict them correctly in about 60% of the cases, tions, as suggested by (Chalkidis, 2023)
with Payments appearing at least once for Fees and
Interests. On average, only 95 of the 100 classes in 4 Conclusion
the reference labels are present in the predictions.
In this work, we examine three general-purpose
3.4 Subjective Analysis LLMs’ zero-shot performance on a multi-class
Our findings highlight that the perceived advan- contract provision classification task using the
tages LLMs have over BERT-based models (such LEDGAR dataset of LexGLUE. Our study shows
as the sheer amount of large parameters, extended that these LLMs, even though not explicitly trained
context length, and the amount of pre-training on legal data, can still demonstrate respectable
knowledge), cannot substitute for the obvious edge theme classification performance but are easily
in-domain data gives to the much smaller mod- overshadowed by smaller in-domain models. The
els. Even when the LLM is trained so (LexGPT), results highlight the need for better LegalLLMs. In
it couldn’t perform as well as the discriminative light of this, we also present a review of related
model (LegalBERT). This could be expected as datasets and models, which we hope will help get
the latter is more naturally suited for the bench- an overview of the field.
mark’s classification tasks than generative models
References Aakanksha Chowdhery, Sharan Narang, Jacob Devlin,
Maarten Bosma, Gaurav Mishra, Adam Roberts,
Joshua Ainslie, Santiago Ontanon, Chris Alberti, Va- Paul Barham, Hyung Won Chung, Charles Sutton,
clav Cvicek, Zachary Fisher, Philip Pham, Anirudh Sebastian Gehrmann, et al. 2022. Palm: Scaling
Ravula, Sumit Sanghai, Qifan Wang, and Li Yang. language modeling with pathways. arXiv preprint
2020. ETC: Encoding long and structured inputs in arXiv:2204.02311.
transformers. In Proceedings of the 2020 Conference
on Empirical Methods in Natural Language Process- Xiang Dai, Ilias Chalkidis, Sune Darkner, and Desmond
ing (EMNLP), pages 268–284, Online. Association Elliott. 2022. Revisiting transformer-based models
for Computational Linguistics. for long document classification. In Findings of the
Rohan Anil, Andrew M Dai, Orhan Firat, Melvin John- Association for Computational Linguistics: EMNLP
son, Dmitry Lepikhin, Alexandre Passos, Siamak 2022, pages 7212–7230, Abu Dhabi, United Arab
Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Emirates. Association for Computational Linguistics.
Chen, et al. 2023. Palm 2 technical report. arXiv
preprint arXiv:2305.10403. Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Car-
bonell, Quoc Le, and Ruslan Salakhutdinov. 2019.
Iz Beltagy, Matthew E Peters, and Arman Cohan. 2020. Transformer-XL: Attentive language models beyond
Longformer: The long-document transformer. arXiv a fixed-length context. In Proceedings of the 57th
preprint arXiv:2004.05150. Annual Meeting of the Association for Computational
Linguistics, pages 2978–2988, Florence, Italy. Asso-
Tom Brown, Benjamin Mann, Nick Ryder, Melanie ciation for Computational Linguistics.
Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind
Neelakantan, Pranav Shyam, Girish Sastry, Amanda Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Askell, et al. 2020. Language models are few-shot Kristina Toutanova. 2019. BERT: Pre-training of
learners. Advances in neural information processing deep bidirectional transformers for language under-
systems, 33:1877–1901. standing. In Proceedings of the 2019 Conference of
the North American Chapter of the Association for
Ilias Chalkidis. 2023. Chatgpt may pass the bar exam Computational Linguistics: Human Language Tech-
soon, but has a long way to go for the lexglue bench- nologies, Volume 1 (Long and Short Papers), pages
mark. arXiv preprint arXiv:2304.12202. 4171–4186, Minneapolis, Minnesota. Association for
Computational Linguistics.
Ilias Chalkidis, Manos Fergadiotis, and Ion Androut-
sopoulos. 2021. MultiEURLEX - a multi-lingual and SiYu Ding, Junyuan Shang, Shuohuan Wang, Yu Sun,
multi-label legal document classification dataset for Hao Tian, Hua Wu, and Haifeng Wang. 2021.
zero-shot cross-lingual transfer. In Proceedings of ERNIE-Doc: A retrospective long-document model-
the 2021 Conference on Empirical Methods in Natu- ing transformer. In Proceedings of the 59th Annual
ral Language Processing, pages 6974–6996, Online Meeting of the Association for Computational Lin-
and Punta Cana, Dominican Republic. Association guistics and the 11th International Joint Conference
for Computational Linguistics. on Natural Language Processing (Volume 1: Long
Ilias Chalkidis, Manos Fergadiotis, Prodromos Malaka- Papers), pages 2914–2927, Online. Association for
siotis, Nikolaos Aletras, and Ion Androutsopoulos. Computational Linguistics.
2020. LEGAL-BERT: The muppets straight out of
law school. In Findings of the Association for Com- Neel Guha, Julian Nyarko, Daniel E Ho, Christopher
putational Linguistics: EMNLP 2020, pages 2898– Ré, Adam Chilton, Aditya Narayana, Alex Chohlas-
2904, Online. Association for Computational Lin- Wood, Austin Peters, Brandon Waldon, Daniel N
guistics. Rockmore, et al. 2023. Legalbench: A collabo-
ratively built benchmark for measuring legal rea-
Ilias Chalkidis, Nicolas Garneau, Catalina Goanta, soning in large language models. arXiv preprint
Daniel Katz, and Anders Søgaard. 2023. LeXFiles arXiv:2308.11462.
and LegalLAMA: Facilitating English multinational
legal language model development. In Proceedings Peter Henderson, Mark Krass, Lucia Zheng, Neel Guha,
of the 61st Annual Meeting of the Association for Christopher D Manning, Dan Jurafsky, and Daniel
Computational Linguistics (Volume 1: Long Papers), Ho. 2022. Pile of law: Learning responsible data
pages 15513–15535, Toronto, Canada. Association filtering from the law and a 256gb open-source legal
for Computational Linguistics. dataset. Advances in Neural Information Processing
Systems, 35:29217–29234.
Ilias Chalkidis, Abhik Jana, Dirk Hartung, Michael
Bommarito, Ion Androutsopoulos, Daniel Katz, and Jieh-Sheng Lee. 2023. Lexgpt 0.1: pre-trained
Nikolaos Aletras. 2022. LexGLUE: A benchmark gpt-j models with pile of law. arXiv preprint
dataset for legal language understanding in English. arXiv:2306.05431.
In Proceedings of the 60th Annual Meeting of the
Association for Computational Linguistics (Volume Quentin Lhoest, Albert Villanova del Moral, Yacine
1: Long Papers), pages 4310–4330, Dublin, Ireland. Jernite, Abhishek Thakur, Patrick von Platen, Suraj
Association for Computational Linguistics. Patil, Julien Chaumond, Mariama Drame, Julien Plu,
Lewis Tunstall, et al. 2021. Datasets: A commu- Faisal Azhar, et al. 2023. Llama: Open and effi-
nity library for natural language processing. arXiv cient foundation language models. arXiv preprint
preprint arXiv:2109.02846. arXiv:2302.13971.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- Don Tuggener, Pius von Däniken, Thomas Peetz, and
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Mark Cieliebak. 2020. LEDGAR: A large-scale
Luke Zettlemoyer, and Veselin Stoyanov. 2019. multi-label corpus for text classification of legal pro-
Roberta: A robustly optimized bert pretraining ap- visions in contracts. In Proceedings of the Twelfth
proach. arXiv preprint arXiv:1907.11692. Language Resources and Evaluation Conference,
pages 1235–1241, Marseille, France. European Lan-
Dimitris Mamakas, Petros Tsotsi, Ion Androutsopou- guage Resources Association.
los, and Ilias Chalkidis. 2022. Processing long legal
documents with pre-trained transformers: Modding Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
LegalBERT and longformer. In Proceedings of the Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Natural Legal Language Processing Workshop 2022, Kaiser, and Illia Polosukhin. 2017. Attention is all
pages 130–142, Abu Dhabi, United Arab Emirates you need. Advances in neural information processing
(Hybrid). Association for Computational Linguistics. systems, 30.

Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-
Ruxandra Cojocaru, Alessandro Cappelli, Hamza bonell, Russ R Salakhutdinov, and Quoc V Le. 2019.
Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, Xlnet: Generalized autoregressive pretraining for lan-
and Julien Launay. 2023. The refinedweb dataset guage understanding. Advances in neural informa-
for falcon llm: outperforming curated corpora with tion processing systems, 32.
web data, and web data only. arXiv preprint
arXiv:2306.01116. Manzil Zaheer, Guru Guruganesh, Kumar Avinava
Dubey, Joshua Ainslie, Chris Alberti, Santiago On-
Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and En- tanon, Philip Pham, Anirudh Ravula, Qifan Wang,
rico Shippole. 2023. Yarn: Efficient context window Li Yang, et al. 2020. Big bird: Transformers for
extension of large language models. arXiv preprint longer sequences. Advances in neural information
arXiv:2309.00071. processing systems, 33:17283–17297.

Alec Radford, Jeff Wu, Rewon Child, David Luan, Lucia Zheng, Neel Guha, Brandon R Anderson, Peter
Dario Amodei, and Ilya Sutskever. 2019. Language Henderson, and Daniel E Ho. 2021. When does pre-
models are unsupervised multitask learners. training help? assessing self-supervised learning for
law and the casehold dataset of 53,000+ legal hold-
Jaromir Savelka. 2023. Unlocking practical applica- ings. In Proceedings of the eighteenth international
tions in legal domain: Evaluation of gpt for zero-shot conference on artificial intelligence and law, pages
semantic annotation of legal texts. arXiv preprint 159–168.
arXiv:2305.04417.
A Custom Prompt
Teven Le Scao, Angela Fan, Christopher Akiki, El-
lie Pavlick, Suzana Ilić, Daniel Hesslow, Roman For reproducibility, we present the prompts that
Castagné, Alexandra Sasha Luccioni, François Yvon, we use for all our experiments. The following is
et al. 2022. Bloom: A 176b-parameter open- the entry to the Custom Instructions setting of
access multilingual language model. arXiv preprint
arXiv:2211.05100. ChatGPT. For HuggingChat, we simply provide
both the instructions to the Custom System Prompt
Rafe Athar Shaikh, Tirath Prasad Sahu, and Veena box.
Anand. 2020. Predicting outcomes of legal cases
based on legal factors using classifiers. Procedia
Computer Science, 167:2393–2402. What would you like ChatGPT to know
about you to provide better responses? I want
Zhongxiang Sun. 2023. A short survey of viewing you to be an EDGAR contract provision classifier.
large language models in legal aspect. arXiv preprint Given a contract provision, you should correctly
arXiv:2303.09136.
identify the EDGAR theme. Do not give any
Chenhao Tang, Zhengliang Liu, Chong Ma, Zihao Wu, explanations.
Yiwei Li, Wei Liu, Dajiang Zhu, Quanzheng Li, Xi-
ang Li, Tianming Liu, et al. 2023. Policygpt: Au-
How would you like ChatGPT to respond?
tomated analysis of privacy policies with large lan-
guage models. arXiv preprint arXiv:2309.10238. One answer from the following list: [ {{paste the
list here}} ]. Do not give an option that is not in
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier the list.
Martinet, Marie-Anne Lachaux, Timothée Lacroix,
Baptiste Rozière, Naman Goyal, Eric Hambro,

You might also like