Large Language Models Are Legal But They Are Not Making The Case For A Power LLM

Uploaded by

Faique Memon

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views

Large Language Models Are Legal But They Are Not Making The Case For A Power LLM

Uploaded by

Faique Memon

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Large Language Models are legal but they are not: Making the case for a

powerful LegalLLM

Thanmay Jayakumar Fauzan Farooqui∗ Luqman Farooqui∗

Visvesvaraya National Institute of Technology, Nagpur, India
{thanmayjayakumar, fauzanfarooqui7, luqmanfarooqui99}@gmail.com

Abstract neural models. Though this holds true for the legal
domain as well (Sun, 2023), they aren’t used to
Realizing the recent advances in Natural Lan-
guage Processing (NLP) to the legal sector make direct decisions. Nevertheless, these auto-
mated systems that produce legal predictions and
arXiv:2311.08890v1 [cs.CL] 15 Nov 2023

poses challenging problems such as extremely

long sequence lengths, specialized vocabulary generations, are predominantly useful as advisory
that is usually only understood by legal profes- tools for legal practitioners that can augment their
sionals, and high amounts of data imbalance. decision-making process.
The recent surge of Large Language Models
Transformers (Vaswani et al., 2017) have be-
(LLMs) has begun to provide new opportu-
nities to apply NLP in the legal domain due come the de facto method for many text classi-
to their ability to handle lengthy, complex se- fication and multiple choice question answering
quences. Moreover, the emergence of domain- tasks. BERT (Devlin et al., 2019), a transformer-
specific LLMs has displayed extremely promis- encoder, and its derived models like RoBERTa (Liu
ing results on various tasks. In this study, we et al., 2019) are commonly employed in legal NLP
aim to quantify how general LLMs perform tasks. Pre-training such models on legal corpora
in comparison to legal-domain models (be it can help a model adapt to a specific domain by fine-
an LLM or otherwise). Specifically, we com-
pare the zero-shot performance of three general-
tuning it with domain-specific data. LegalBERT
purpose LLMs (ChatGPT-20b, LLaMA-2-70b, (Chalkidis et al., 2020) is one such BERT model
and Falcon-180b) on the LEDGAR subset of that was trained on legal-oriented data. CaseLaw-
the LexGLUE benchmark for contract provi- BERT (Zheng et al., 2021), PoL-BERT (Henderson
sion classification. Although the LLMs were et al., 2022), and LexLM (Chalkidis et al., 2023)
not explicitly trained on legal data, we ob- are a few more BERT-based variants pre-trained for
serve that they are still able to classify the the legal domain. Although they show remarkable
theme correctly in most cases. However, we
performance on various legal tasks in comparison
find that their mic-F1/mac-F1 performance is
up to 19.2/26.8% lesser than smaller models with general-purpose BERT models, one limit of
fine-tuned on the legal domain, thus underscor- these models is that BERT’s input size can only
ing the need for more powerful legal-domain incorporate a maximum of 512 tokens. For short
LLMs. sequences this may seem enough, but in the case
of long documents commonly found in the legal
1 Introduction domain, where input texts can go over 5000 tokens
(and requiring even more in few-shot settings), it
Legal professionals typically deal with large can be a severe drawback as a lot of important
amounts of textual information on a daily basis information will get truncated.
to make well-informed decisions in their practice. Due to this limit, BERT-based models aren’t em-
This can become very tedious and demanding due ployed as-is in long-document tasks. Typically,
to the overwhelming amount of data they must man- methods like hierarchical attention are utilized
age and the meticulous attention to detail necessary where the long document is split into segments
to maintain the required precision in their work. of max length (512 in the case of BERT mod-
Thanks to the rise of LLMs, many tasks such as els) and these segments are independently encoded.
sentiment analysis, named entity recognition, in- These segment embeddings are then aggregated
formation retrieval, etc. can now be handled by with stacked transformers to get the overall encod-
*
These authors contributed equally ing of the entire document. Similarly, recurrent
transformers (Dai et al., 2019; Yang et al., 2019; data consisting of long sequences.
Ding et al., 2021) were proposed to process long
documents by encoding its representation from in- Google PaLM: PaLM (Pathways Language
dividual segments in a recurrent fashion. Sparse Model) (Chowdhery et al., 2022; Anil et al.,
attention is another method that has been proposed 2023) is a proprietary LLM having 540 billion
to tackle long sequence inputs (Ainslie et al., 2020; parameters that was trained on the Pathways
Zaheer et al., 2020). Longformer (Beltagy et al., architecture. Although PaLM was initially trained
2020) uses a combination of local and global at- to handle sequence lengths of up to 2048 tokens, it
tention mechanisms to save on computational com- was increased to 8096 in the 340 billion parameter
plexity and enables the processing of up to 4096 PaLM 2 for a longer comprehension of the input.
tokens. A number of other works (Dai et al., 2022;
Mamakas et al., 2022) show that transformer-based BigScience BLOOM: BLOOM (BigScience
architectures that can capture longer text boast ma- Large Open-science Open-access Multilingual
jor benefits, even more so when augmented with Language Model) (Scao et al., 2022) is a group
strategies like sparse-attention and hierarchical net- of open-source multilingual LLMs, the largest
works. This again underlines an important direction having 176 billion parameters. It encompasses 46
for verbose legal datasets. natural and 13 programming languages, facilitating
Our contributions can be summarized as follows: sequence lengths of up to 2048 tokens.

• We conduct experiments to compare and ana- Meta LLaMA: LLaMA (Large Language Model
lyze the zero-shot performance of three gen- Meta AI) (Touvron et al., 2023) is a collection
eral LLMs to that of start-of-the-art in-domain of open-source foundation language models
models on the LEDGAR subset of LexGLUE ranging from 7 billion to 70 billion parameters.
(Chalkidis et al., 2022). We analyze our re- It was pre-trained natively on 2048 input tokens,
sults, quantify whether LLMs conform to ex- but recent research has shown that the context
pected advantages, and provide insights for length of LLMs can be extended efficiently with
further research. minimal training steps (Peng et al., 2023), leading
• We provide an overview of the most recent to a release of two variations of LLaMA that
LLM research, the benchmarks and datasets have an astounding context length of 64k and 128k.
developed for legal NLP, the challenges faced
when applying them to legal tasks, and pop- TII Falcon1 : This work by the Technology In-
ular approaches that solve them. We believe novation Institute (TII) boasts of being the largest
this to be a useful primer for anyone looking open-source model to date, while also ranking high-
to get a bird’s eye view of the field. est on the HuggingFace Leaderboard. It includes
models with 180B, 40B, 7.5B, and 1.3B parame-
2 Related Work ters (context window size of 2048) trained on TII’s
RefinedWeb dataset (Penedo et al., 2023).
In this section, we outline the relevant research on
LLMs, efforts in using them for legal domain tasks, 2.2 LLMs on the legal domain
and finally the benchmarks and datasets.
LexGPT: (Lee, 2023) finetune GPT-J models on
2.1 Large Language Models the Pile of Law dataset (Henderson et al., 2022),
and is the best-performing LLM that has been fine-
OpenAI GPT: GPT (Generative Pre-trained
tuned for legal use cases (LegalLLM) at the time
Transformer) (Radford et al., 2019; Brown et al.,
of writing. They experiment with generative mod-
2020) and the popular ChatGPT variant developed
els for legal classification tasks and observe that
by OpenAI are a family of large-scale proprietary
fine-tuning such out-of-the-box GPTs still results
transformer-decoder models pretrained to perform
in low performance when compared to discrimina-
generative and language modeling tasks, and allow
tive models. This insightfully shows the need to
a reasonable context length sufficient to carry out
bridge the gap between powerful LLMs and the
long-document processing. For instance, GPT 3.5
legal domain.
supports a maximum of 4096 tokens, and GPT 4
1
allows a stunning maximum of 32,768, ideal for https://falconllm.tii.ae/falcon.html
PolicyGPT: (Tang et al., 2023) demonstrate that forms of legal reasoning and outlines an empirical
many LLMs in zero-shot settings perform remark- evaluation of 20 LLMs. They demonstrate how
ably well when tasked with text classification LegalBench supports easing communication be-
of privacy policies. Though specific, this shows tween legal professionals and LLM developers by
how a LegalLLM may hold promise in enhancing using the IRAC framework in the case of American
performance on other general tasks. law. They observe that LLMs typically perform
better on classification tasks than application-based
Zero-and-Few-shot GPT: (Chalkidis, 2023) con- ones. They also find that for some tasks, in-context
duct experiments most similar to our study. They examples are not required, or only marginally
evaluate the performance of ChatGPT on the improve performance. They thus conclude that the
LexGLUE benchmark in both zero-shot and few- task performance in LLMs is mostly driven by the
shot settings (for the latter, examples were given task description used in the prompt.
in the instruction prompt, which seems to benefit
the model when the number of examples and labels Pile of Law: (Henderson et al., 2022) The
are around the same). They find that ChatGPT per- surge in LLM development emphasizes the need
forms very well, but severely lacks in performance for responsible practices in filtering out biased,
compared to smaller models trained on in-domain explicit, copyrighted, and confidential content
datasets. during pre-training. Present methodologies are ad
Resonating with these findings, the work of hoc and do not account for context. To address
(Savelka, 2023) investigates how an LLM (a GPT this, Pile of Law, a growing 256GB dataset of
model) performs on a semantic annotation task open-source English legal and administrative data,
in zero-shot settings, without being fine-tuned on was introduced to aid in legal tasks. This paper
legal-domain datasets. The LLM is primed with outlines a method for filtering legal-domain text
a short sentence description of each annotation la- while handling associated trade-offs. It aids in
bel and is tasked with labeling a short span of text. understanding government-established content
They observe that while the LLM performs surpris- filtering guidelines and illustrates various ways to
ingly well given the zero-shot setting, its perfor- learn responsible data filtering from the law.
mance was still far off from the model that was
trained on the in-domain data. In summary, both MultiLegalPile: (Chalkidis et al., 2021) The
studies highlight the potential fine-tuned LLMs can MultiLegalPile is a 689 GB substantial dataset that
bring to the legal domain. spans 24 EU languages across 17 jurisdictions. It
addresses the scarce availability of multilingual
2.3 Datasets and Benchmarks pre-training data in the legal domain, encom-
passing diverse legal data sources with varying
LexGLUE: (Chalkidis et al., 2022) present a licenses. In certain languages, monolingual models
unified evaluation framework for legal tasks to substantially outperform the base model, achieving
benchmark models. The datasets and tasks were language-specific SotA in five languages. In
curated from other sources of data considering LexGLUE, English models secure SotA in five of
various factors such as availability, size, difficulty, seven tasks.
etc. They present scores for various Pre-trained
Language Models (PLMs) on their benchmark. 3 Experimental Setup and Results
They point out interesting results that suggest that
PLMs fine-tuned on general legal datasets and In this section, we describe our experimental ap-
tasks do perform better, albeit PLMs fine-tuned proach, along with specifics of our evaluations.
on only one sub-domain don’t improve on per-
formance on the same sub-domain. Put together, 3.1 Dataset and Metrics
their observations point out the need for a general We use the LEDGAR (Tuggener et al., 2020) subset
LegalLLM (powerful enough to outperform other of the LexGLUE benchmark for our experiments
models on all criteria of the benchmark).vspace-4pt due to its readiness to work on LLMs (for exam-
ple, the other datasets have label indices alone, not
LegalBench: (Guha et al., 2023) This bench- the actual label names). The dataset was loaded
mark comprises 162 tasks representing six distinct through the HuggingFace Datasets library (Lhoest
Figure 1: The frequency distributions of the 100 LEDGAR labels
in the original LEDGAR test set from LexGLUE (left); and in our sampled test set of 1000 examples (right)

et al., 2021). In this benchmark, given a provi- models are very large, we use HuggingFace Chat
sion contract, the model is tasked with classify- for LLaMA and Falcon. Due to this constraint,
ing the contract into one of 100 EDGAR theme we only evaluated on a subset of 1,000 examples.
labels. As mentioned, there is a high imbalance However, we made sure that the subset had a label
of data in datasets containing legal corpora. Fig- frequency distribution close to the original dataset
ure 1 shows the label distribution in the LEDGAR (Refer Figure 1) so that the evaluations remain gen-
subset benchmark. This could result in difficul- eralizable.
ties such as biased models and poor classification We use zero-shot prompting to evaluate the
scores. To better report model evaluations in such above-mentioned LLMs, building on the benefits
settings, the F1-score is usually reported instead as explained earlier by other works (Tang et al.,
of accuracy. Moreover, both macro-F1 and micro- 2023; Guha et al., 2023). Further, in the custom
F1 scores are usually reported. For imbalanced instructions (ChatGPT) and system instructions
datasets, the former more accurately reflects the (HuggingChat), we enter the list of EDGAR theme
classifier’s performance as the latter skews the met- classes that the model should choose from. In the
ric towards the larger-proportion datasets, which is same fashion, to ensure that the model does not gen-
why the macro-F1 scores are typically lower than erate anything out of the list, we explicitly mention
the micro-F1 ones in these scenarios. this constraint as an instruction. The exact instruc-
As for the sequence lengths, (Chalkidis, 2023) tions that we use are provided in the appendix.
report the average token length of the instruction-
following examples in all the LexGLUE subsets 3.3 Results and Discussion
- the highest being 3.6k tokens. This restricts the For our experiments, we use three baseline general-
capability of LLM performance due to truncation purpose chat variants - ChatGPT (20b), Falcon
as noted earlier, and this is also highlighted in their (180b), and LLaMA-2 (70b) - and present the
study: few-shot settings could not be evaluated for results in Table 1. General-purpose LLMs per-
datasets having an average token length of more form worse than smaller in-domain models. The
than 2k for a single example, and in many cases, best general LLM, Falcon-Chat, performs 19.2%
the prompts were already truncated up to 4k tokens mic-F1 and 26.8% mac-F1 lower than the best in-
(the maximum limit of ChatGPT). The average domain model, LegalBERT, which itself is much
token length of the LEDGAR subset is 0.6k. smaller than the LexGPT, the current LegalLLM.
Our findings echo that of (Chalkidis, 2023).
3.2 Setup
Notably, for class labels with only one exam-
As baselines, we take three LLMs: ChatGPT (20b), ple in our sampled test set, the three chat-variants
LLaMA-2 (70b), and Falcon (180b). Since the surprisingly show the same results: they fail to
Model mic. F1 mac. F1 # params. which are prone to issues like hallucination. Our
Falcon-Chat 70.9 60.7 180b label-wise findings reflect this too.
LLaMA-Chat 70.4 59.6 70b However, the current legal benchmarks are lim-
ChatGPT 70.6 58.7 20b ited to NLU tasks. In general, it would be ideal to
LexGPT 83.9 74.0 6b have a powerful LegalLLM that can perform both
LegalBERT 88.2 83.0 0.11b generative and discriminative tasks. Our findings
show that there is a unique challenge in the legal
Table 1: Comparison of general LLMs (first three mod-
domain: if we have to build a better LegalLLM,
els, tested on a zero-shot setting by us) to models fine-
tuned on legal-domain datasets (last two). The cur- we need to find better methods to take advantage
rent LegalLLM is LexGPT, but the much smaller Legal- of the in-domain legal data for LLMs as simply
BERT shows state-of-the-art performance on LEDGAR. fine-tuning it doesn’t seem to be enough. As the au-
thors of LexGPT mention, reinforcement learning
from human feedback could be extremely helpful
predict them correctly, except the Qualification la- in improving LexGPT, providing ways for the first
bel (the others being Assigns, Books, Powers and LegalLLM to produce state-of-the-art results.
Sanctions. Similarly, Indemnity is always misclas-
However, if we limit the application of legal
sified as Indemnifications (three examples in total).
models to NLU tasks, our findings turn optimistic.
Further, labels that are semantically similar are fre-
The results show that the LLMs’ ability to process
quently mislabeled by the models (like Indemnity
large context may not be necessary for classifica-
and Indemnifications as pointed out earlier). For
tion - we hypothesize this could be because verbose
example, (Taxes, Tax Withholdings and Withhold-
legal text could turn out to have very similar seman-
ings) is almost always labeled as Tax Withholdings
tic content, so the additional context may not be
by all the models. (Jurisdictions, Submission To
as useful as expected. This hypothesis could be
Jurisdiction, Consent To Jurisdiction) is almost al-
echoed by findings from (Shaikh et al., 2020), who
ways labeled as Submission To Jurisdiction in the
show that a careful selection of a handful of textual
case of ChatGPT and Jurisdiction in the case of
features in a verbose dataset is strong enough to
Falcon and LLaMA. As for (Applicable Laws, Gov-
help statistical models achieve high accuracies for
erning Laws, Compliance With Laws), we observe
binary classification.
that Governing Laws was easiest to predict with
an average accuracy of 90%, and Compliance With This in fact should be good news for NLU, as it
Laws with 80%, but Applicable Laws performs very means legal practitioners can avoid having to use
poorly with 0% accuracy for LLaMA and Falcon or train unnecessarily large or expensive models
and 20% for ChatGPT - predicting only one from (both carbon-wise and cost-wise). Much smaller
a total of 5 samples correctly. However, in the case in-domain models like LegalBERT are nevertheless
of (Payments, Fees, Interests), the models seem to superior and should be used for practical applica-
predict them correctly in about 60% of the cases, tions, as suggested by (Chalkidis, 2023)
with Payments appearing at least once for Fees and
Interests. On average, only 95 of the 100 classes in 4 Conclusion
the reference labels are present in the predictions.
In this work, we examine three general-purpose
3.4 Subjective Analysis LLMs’ zero-shot performance on a multi-class
Our findings highlight that the perceived advan- contract provision classification task using the
tages LLMs have over BERT-based models (such LEDGAR dataset of LexGLUE. Our study shows
as the sheer amount of large parameters, extended that these LLMs, even though not explicitly trained
context length, and the amount of pre-training on legal data, can still demonstrate respectable
knowledge), cannot substitute for the obvious edge theme classification performance but are easily
in-domain data gives to the much smaller mod- overshadowed by smaller in-domain models. The
els. Even when the LLM is trained so (LexGPT), results highlight the need for better LegalLLMs. In
it couldn’t perform as well as the discriminative light of this, we also present a review of related
model (LegalBERT). This could be expected as datasets and models, which we hope will help get
the latter is more naturally suited for the bench- an overview of the field.
mark’s classification tasks than generative models
References Aakanksha Chowdhery, Sharan Narang, Jacob Devlin,
Maarten Bosma, Gaurav Mishra, Adam Roberts,
Joshua Ainslie, Santiago Ontanon, Chris Alberti, Va- Paul Barham, Hyung Won Chung, Charles Sutton,
clav Cvicek, Zachary Fisher, Philip Pham, Anirudh Sebastian Gehrmann, et al. 2022. Palm: Scaling
Ravula, Sumit Sanghai, Qifan Wang, and Li Yang. language modeling with pathways. arXiv preprint
2020. ETC: Encoding long and structured inputs in arXiv:2204.02311.
transformers. In Proceedings of the 2020 Conference
on Empirical Methods in Natural Language Process- Xiang Dai, Ilias Chalkidis, Sune Darkner, and Desmond
ing (EMNLP), pages 268–284, Online. Association Elliott. 2022. Revisiting transformer-based models
for Computational Linguistics. for long document classification. In Findings of the
Rohan Anil, Andrew M Dai, Orhan Firat, Melvin John- Association for Computational Linguistics: EMNLP
son, Dmitry Lepikhin, Alexandre Passos, Siamak 2022, pages 7212–7230, Abu Dhabi, United Arab
Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Emirates. Association for Computational Linguistics.
Chen, et al. 2023. Palm 2 technical report. arXiv
preprint arXiv:2305.10403. Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Car-
bonell, Quoc Le, and Ruslan Salakhutdinov. 2019.
Iz Beltagy, Matthew E Peters, and Arman Cohan. 2020. Transformer-XL: Attentive language models beyond
Longformer: The long-document transformer. arXiv a fixed-length context. In Proceedings of the 57th
preprint arXiv:2004.05150. Annual Meeting of the Association for Computational
Linguistics, pages 2978–2988, Florence, Italy. Asso-
Tom Brown, Benjamin Mann, Nick Ryder, Melanie ciation for Computational Linguistics.
Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind
Neelakantan, Pranav Shyam, Girish Sastry, Amanda Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Askell, et al. 2020. Language models are few-shot Kristina Toutanova. 2019. BERT: Pre-training of
learners. Advances in neural information processing deep bidirectional transformers for language under-
systems, 33:1877–1901. standing. In Proceedings of the 2019 Conference of
the North American Chapter of the Association for
Ilias Chalkidis. 2023. Chatgpt may pass the bar exam Computational Linguistics: Human Language Tech-
soon, but has a long way to go for the lexglue bench- nologies, Volume 1 (Long and Short Papers), pages
mark. arXiv preprint arXiv:2304.12202. 4171–4186, Minneapolis, Minnesota. Association for
Computational Linguistics.
Ilias Chalkidis, Manos Fergadiotis, and Ion Androut-
sopoulos. 2021. MultiEURLEX - a multi-lingual and SiYu Ding, Junyuan Shang, Shuohuan Wang, Yu Sun,
multi-label legal document classification dataset for Hao Tian, Hua Wu, and Haifeng Wang. 2021.
zero-shot cross-lingual transfer. In Proceedings of ERNIE-Doc: A retrospective long-document model-
the 2021 Conference on Empirical Methods in Natu- ing transformer. In Proceedings of the 59th Annual
ral Language Processing, pages 6974–6996, Online Meeting of the Association for Computational Lin-
and Punta Cana, Dominican Republic. Association guistics and the 11th International Joint Conference
for Computational Linguistics. on Natural Language Processing (Volume 1: Long
Ilias Chalkidis, Manos Fergadiotis, Prodromos Malaka- Papers), pages 2914–2927, Online. Association for
siotis, Nikolaos Aletras, and Ion Androutsopoulos. Computational Linguistics.
2020. LEGAL-BERT: The muppets straight out of
law school. In Findings of the Association for Com- Neel Guha, Julian Nyarko, Daniel E Ho, Christopher
putational Linguistics: EMNLP 2020, pages 2898– Ré, Adam Chilton, Aditya Narayana, Alex Chohlas-
2904, Online. Association for Computational Lin- Wood, Austin Peters, Brandon Waldon, Daniel N
guistics. Rockmore, et al. 2023. Legalbench: A collabo-
ratively built benchmark for measuring legal rea-
Ilias Chalkidis, Nicolas Garneau, Catalina Goanta, soning in large language models. arXiv preprint
Daniel Katz, and Anders Søgaard. 2023. LeXFiles arXiv:2308.11462.
and LegalLAMA: Facilitating English multinational
legal language model development. In Proceedings Peter Henderson, Mark Krass, Lucia Zheng, Neel Guha,
of the 61st Annual Meeting of the Association for Christopher D Manning, Dan Jurafsky, and Daniel
Computational Linguistics (Volume 1: Long Papers), Ho. 2022. Pile of law: Learning responsible data
pages 15513–15535, Toronto, Canada. Association filtering from the law and a 256gb open-source legal
for Computational Linguistics. dataset. Advances in Neural Information Processing
Systems, 35:29217–29234.
Ilias Chalkidis, Abhik Jana, Dirk Hartung, Michael
Bommarito, Ion Androutsopoulos, Daniel Katz, and Jieh-Sheng Lee. 2023. Lexgpt 0.1: pre-trained
Nikolaos Aletras. 2022. LexGLUE: A benchmark gpt-j models with pile of law. arXiv preprint
dataset for legal language understanding in English. arXiv:2306.05431.
In Proceedings of the 60th Annual Meeting of the
Association for Computational Linguistics (Volume Quentin Lhoest, Albert Villanova del Moral, Yacine
1: Long Papers), pages 4310–4330, Dublin, Ireland. Jernite, Abhishek Thakur, Patrick von Platen, Suraj
Association for Computational Linguistics. Patil, Julien Chaumond, Mariama Drame, Julien Plu,
Lewis Tunstall, et al. 2021. Datasets: A commu- Faisal Azhar, et al. 2023. Llama: Open and effi-
nity library for natural language processing. arXiv cient foundation language models. arXiv preprint
preprint arXiv:2109.02846. arXiv:2302.13971.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- Don Tuggener, Pius von Däniken, Thomas Peetz, and
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Mark Cieliebak. 2020. LEDGAR: A large-scale
Luke Zettlemoyer, and Veselin Stoyanov. 2019. multi-label corpus for text classification of legal pro-
Roberta: A robustly optimized bert pretraining ap- visions in contracts. In Proceedings of the Twelfth
proach. arXiv preprint arXiv:1907.11692. Language Resources and Evaluation Conference,
pages 1235–1241, Marseille, France. European Lan-
Dimitris Mamakas, Petros Tsotsi, Ion Androutsopou- guage Resources Association.
los, and Ilias Chalkidis. 2022. Processing long legal
documents with pre-trained transformers: Modding Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
LegalBERT and longformer. In Proceedings of the Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Natural Legal Language Processing Workshop 2022, Kaiser, and Illia Polosukhin. 2017. Attention is all
pages 130–142, Abu Dhabi, United Arab Emirates you need. Advances in neural information processing
(Hybrid). Association for Computational Linguistics. systems, 30.

Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-
Ruxandra Cojocaru, Alessandro Cappelli, Hamza bonell, Russ R Salakhutdinov, and Quoc V Le. 2019.
Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, Xlnet: Generalized autoregressive pretraining for lan-
and Julien Launay. 2023. The refinedweb dataset guage understanding. Advances in neural informa-
for falcon llm: outperforming curated corpora with tion processing systems, 32.
web data, and web data only. arXiv preprint
arXiv:2306.01116. Manzil Zaheer, Guru Guruganesh, Kumar Avinava
Dubey, Joshua Ainslie, Chris Alberti, Santiago On-
Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and En- tanon, Philip Pham, Anirudh Ravula, Qifan Wang,
rico Shippole. 2023. Yarn: Efficient context window Li Yang, et al. 2020. Big bird: Transformers for
extension of large language models. arXiv preprint longer sequences. Advances in neural information
arXiv:2309.00071. processing systems, 33:17283–17297.

Alec Radford, Jeff Wu, Rewon Child, David Luan, Lucia Zheng, Neel Guha, Brandon R Anderson, Peter
Dario Amodei, and Ilya Sutskever. 2019. Language Henderson, and Daniel E Ho. 2021. When does pre-
models are unsupervised multitask learners. training help? assessing self-supervised learning for
law and the casehold dataset of 53,000+ legal hold-
Jaromir Savelka. 2023. Unlocking practical applica- ings. In Proceedings of the eighteenth international
tions in legal domain: Evaluation of gpt for zero-shot conference on artificial intelligence and law, pages
semantic annotation of legal texts. arXiv preprint 159–168.
arXiv:2305.04417.
A Custom Prompt
Teven Le Scao, Angela Fan, Christopher Akiki, El-
lie Pavlick, Suzana Ilić, Daniel Hesslow, Roman For reproducibility, we present the prompts that
Castagné, Alexandra Sasha Luccioni, François Yvon, we use for all our experiments. The following is
et al. 2022. Bloom: A 176b-parameter open- the entry to the Custom Instructions setting of
access multilingual language model. arXiv preprint
arXiv:2211.05100. ChatGPT. For HuggingChat, we simply provide
both the instructions to the Custom System Prompt
Rafe Athar Shaikh, Tirath Prasad Sahu, and Veena box.
Anand. 2020. Predicting outcomes of legal cases
based on legal factors using classifiers. Procedia
Computer Science, 167:2393–2402. What would you like ChatGPT to know
about you to provide better responses? I want
Zhongxiang Sun. 2023. A short survey of viewing you to be an EDGAR contract provision classifier.
large language models in legal aspect. arXiv preprint Given a contract provision, you should correctly
arXiv:2303.09136.
identify the EDGAR theme. Do not give any
Chenhao Tang, Zhengliang Liu, Chong Ma, Zihao Wu, explanations.
Yiwei Li, Wei Liu, Dajiang Zhu, Quanzheng Li, Xi-
ang Li, Tianming Liu, et al. 2023. Policygpt: Au-
How would you like ChatGPT to respond?
tomated analysis of privacy policies with large lan-
guage models. arXiv preprint arXiv:2309.10238. One answer from the following list: [ {{paste the
list here}} ]. Do not give an option that is not in
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier the list.
Martinet, Marie-Anne Lachaux, Timothée Lacroix,
Baptiste Rozière, Naman Goyal, Eric Hambro,

Text Mining and Natural Language Processing in Construction
No ratings yet
Text Mining and Natural Language Processing in Construction
16 pages
1 s2.0 S2666651021000176 Main
No ratings yet
1 s2.0 S2666651021000176 Main
6 pages
Saullm-7B:: A Pioneering Large Language Model For Law
No ratings yet
Saullm-7B:: A Pioneering Large Language Model For Law
13 pages
BLT Can Large Language Models Handle Basic Legal Text 2311.09693v2
No ratings yet
BLT Can Large Language Models Handle Basic Legal Text 2311.09693v2
14 pages
Bigdata50022 2020 9378201
No ratings yet
Bigdata50022 2020 9378201
10 pages
Large Language Models and Their Possible Uses in L
No ratings yet
Large Language Models and Their Possible Uses in L
21 pages
Prompt Templates For Structured Answer Generation
No ratings yet
Prompt Templates For Structured Answer Generation
11 pages
The Impact of Large Language Modeling On Natural Language Processing in Legal Te
No ratings yet
The Impact of Large Language Modeling On Natural Language Processing in Legal Te
7 pages
Large Language Model Prompt Chaining for Long Legal Document Classification
No ratings yet
Large Language Model Prompt Chaining for Long Legal Document Classification
8 pages
LEGALBERT The Muppets Straight Out of Law School
No ratings yet
LEGALBERT The Muppets Straight Out of Law School
7 pages
relatedwork
No ratings yet
relatedwork
8 pages
LexGLUE A Benchmark Dataset For Legal Language Understadning in English
No ratings yet
LexGLUE A Benchmark Dataset For Legal Language Understadning in English
21 pages
Development of An Indian Legal Language Model (LLM) For Enhanced Legal Text Analysis and Assistance
No ratings yet
Development of An Indian Legal Language Model (LLM) For Enhanced Legal Text Analysis and Assistance
7 pages
Benchmarking Legal Knowledge of Large Language Models
No ratings yet
Benchmarking Legal Knowledge of Large Language Models
38 pages
Chalkids 2020 Legal Bert
No ratings yet
Chalkids 2020 Legal Bert
7 pages
Evaluating Large Language Models For Tax Law Reasoning: Abstract
No ratings yet
Evaluating Large Language Models For Tax Law Reasoning: Abstract
15 pages
Improving Legal Judgement Prediction in Romanian
No ratings yet
Improving Legal Judgement Prediction in Romanian
8 pages
(Greco+Tagarelli) s10506 023 09374 7
No ratings yet
(Greco+Tagarelli) s10506 023 09374 7
148 pages
2406.04202v1
No ratings yet
2406.04202v1
14 pages
T-Rag: L LLM T: Essons From The Renches
No ratings yet
T-Rag: L LLM T: Essons From The Renches
21 pages
Legal Language Modeling
No ratings yet
Legal Language Modeling
10 pages
25636-1454-21112-2-10-20230927
No ratings yet
25636-1454-21112-2-10-20230927
4 pages
24 Texto GPT
No ratings yet
24 Texto GPT
4 pages
2024.arabicnlp-1.24
No ratings yet
2024.arabicnlp-1.24
15 pages
Empirical Study of LLM Fine-Tuning For Text Classification in Legal Document Review
No ratings yet
Empirical Study of LLM Fine-Tuning For Text Classification in Legal Document Review
7 pages
T-Rag: Lessons From The LLM Trenches
No ratings yet
T-Rag: Lessons From The LLM Trenches
22 pages
Chatlaw: Open-Source Legal Large Language Model With Integrated External Knowledge Bases
No ratings yet
Chatlaw: Open-Source Legal Large Language Model With Integrated External Knowledge Bases
8 pages
Report - PDF 20240827 210738 0000
No ratings yet
Report - PDF 20240827 210738 0000
23 pages
Effectively Leveraging BERT for Legal Document Classification (1)
No ratings yet
Effectively Leveraging BERT for Legal Document Classification (1)
7 pages
Can Large Language Models Handle Basic Legal Text
No ratings yet
Can Large Language Models Handle Basic Legal Text
12 pages
Can We Pretrain A SotA Legal Language Model On A Budget From Scratch
No ratings yet
Can We Pretrain A SotA Legal Language Model On A Budget From Scratch
25 pages
aa
No ratings yet
aa
11 pages
survey
No ratings yet
survey
23 pages
ITALIAN LEGAL BERT A Pre Trained Transfo
No ratings yet
ITALIAN LEGAL BERT A Pre Trained Transfo
16 pages
Pranay Report
No ratings yet
Pranay Report
26 pages
2022.lrec-1.470
No ratings yet
2022.lrec-1.470
10 pages
BENCH - Extending Long Context Evaluation
No ratings yet
BENCH - Extending Long Context Evaluation
16 pages
Niklaus 2023 Multilegalpile
No ratings yet
Niklaus 2023 Multilegalpile
15 pages
2022 nllp-1 10
No ratings yet
2022 nllp-1 10
11 pages
Day 1
No ratings yet
Day 1
32 pages
AComprehensive Overviewof Large Language Models
No ratings yet
AComprehensive Overviewof Large Language Models
36 pages
The_Best_LLMs_Cheatsheet_1727364716
No ratings yet
The_Best_LLMs_Cheatsheet_1727364716
15 pages
شات القانزن السعودي
No ratings yet
شات القانزن السعودي
19 pages
Large Language Models A Comprehensive Survey of It
No ratings yet
Large Language Models A Comprehensive Survey of It
30 pages
paperMidTern
No ratings yet
paperMidTern
4 pages
Racial Skew in Fine-Tuned Legal AI Language Models
No ratings yet
Racial Skew in Fine-Tuned Legal AI Language Models
8 pages
2304.02020v1
No ratings yet
2304.02020v1
36 pages
Bueno Teoria 2307.06435
No ratings yet
Bueno Teoria 2307.06435
37 pages
SSRN Id4404017
No ratings yet
SSRN Id4404017
35 pages
Downloed Papers
No ratings yet
Downloed Papers
700 pages
Kalyan 1 s2.0 S2949719123000456 Main
No ratings yet
Kalyan 1 s2.0 S2949719123000456 Main
48 pages
NeurIPS 2023 Legalbench A Collaboratively Built Benchmark For Measuring Legal Reasoning in Large Language Models Paper Datasets - and - Benchmarks
No ratings yet
NeurIPS 2023 Legalbench A Collaboratively Built Benchmark For Measuring Legal Reasoning in Large Language Models Paper Datasets - and - Benchmarks
157 pages
Enhancing Contract Negotiations With LLM-Based Legal Document
No ratings yet
Enhancing Contract Negotiations With LLM-Based Legal Document
11 pages
Toward_a_Holistic_Performance_Evaluation_of_Large_Language_Models_Across_Diverse_AI_Accelerators
No ratings yet
Toward_a_Holistic_Performance_Evaluation_of_Large_Language_Models_Across_Diverse_AI_Accelerators
10 pages
LLM_Project_Guide
No ratings yet
LLM_Project_Guide
4 pages
Fine-Tuning Legal-BERT_ LLMs For Automated Legal Text Classification _ by Drewgelbard _ Nov, 2024 _ Towards AI
No ratings yet
Fine-Tuning Legal-BERT_ LLMs For Automated Legal Text Classification _ by Drewgelbard _ Nov, 2024 _ Towards AI
27 pages
A Comprehensive Overview of Large Language Models: A B, C, D, E, F, G F, G H, J I J
No ratings yet
A Comprehensive Overview of Large Language Models: A B, C, D, E, F, G F, G H, J I J
47 pages
A document processing pipeline
No ratings yet
A document processing pipeline
51 pages
Large Language Models and Their Use Cases
No ratings yet
Large Language Models and Their Use Cases
3 pages
Related - A Comparison Study of Pre-Trained Language Models For Chinese Legal Document Classification
No ratings yet
Related - A Comparison Study of Pre-Trained Language Models For Chinese Legal Document Classification
6 pages
Deep Learning: Fundamentals and Applications
From Everand
Deep Learning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Online - Talk On Future Technologies in Communications Society (Responses)
No ratings yet
Online - Talk On Future Technologies in Communications Society (Responses)
4 pages
ICC 24 - YP Workshop On Challenges Related To Being A PHD Student - Satisfaction Survey
No ratings yet
ICC 24 - YP Workshop On Challenges Related To Being A PHD Student - Satisfaction Survey
2 pages
Legal Artificial Assistance Agent To Assist Refugees
No ratings yet
Legal Artificial Assistance Agent To Assist Refugees
3 pages
ICC 24 - YP Panel On Navigating The World of Academic Publishing - From Writing To Reviewing
No ratings yet
ICC 24 - YP Panel On Navigating The World of Academic Publishing - From Writing To Reviewing
4 pages
Case Law
No ratings yet
Case Law
6 pages
CreditcardscomInc 20070810 S-1 EX-10.33 362297 EX-10.33 Affiliate Agreement
No ratings yet
CreditcardscomInc 20070810 S-1 EX-10.33 362297 EX-10.33 Affiliate Agreement
12 pages
Model Employment Contract
No ratings yet
Model Employment Contract
8 pages
Neral Defences in PPC
No ratings yet
Neral Defences in PPC
3 pages
Computer Book For PPSC Lecturer Computer Science
No ratings yet
Computer Book For PPSC Lecturer Computer Science
107 pages
Business Email Writing
No ratings yet
Business Email Writing
13 pages
12.diyat - Qisas
No ratings yet
12.diyat - Qisas
3 pages
Causes of Separation of East Pakistan
No ratings yet
Causes of Separation of East Pakistan
6 pages
1.unlawful Assembly
0% (1)
1.unlawful Assembly
3 pages
Powershell Commands PDF
No ratings yet
Powershell Commands PDF
3 pages
Result 9b Test 1
No ratings yet
Result 9b Test 1
1 page
Sessions Program I3C2S'25
No ratings yet
Sessions Program I3C2S'25
15 pages
Artificial Intelligence Notes
No ratings yet
Artificial Intelligence Notes
81 pages
Aiml Demo
No ratings yet
Aiml Demo
12 pages
Advanced Data Science & AI Certification Program
No ratings yet
Advanced Data Science & AI Certification Program
31 pages
Research Paper On Artificial Intelligence: Galgotias University Email
No ratings yet
Research Paper On Artificial Intelligence: Galgotias University Email
5 pages
CSE575
No ratings yet
CSE575
12 pages
Fundamental of Artificial Intelligence
No ratings yet
Fundamental of Artificial Intelligence
57 pages
NCERT_Code_407_AI_Detailed_Answers
No ratings yet
NCERT_Code_407_AI_Detailed_Answers
3 pages
MG Arr Phee Project Word Book
No ratings yet
MG Arr Phee Project Word Book
16 pages
Assignment
No ratings yet
Assignment
3 pages
CSEN2031_AI_ASSIGNMENT_I
No ratings yet
CSEN2031_AI_ASSIGNMENT_I
4 pages
Grey and Teal Modern Simple Research Project Presentation - 20240924 - 171407 - 0000
No ratings yet
Grey and Teal Modern Simple Research Project Presentation - 20240924 - 171407 - 0000
20 pages
GPT 3
No ratings yet
GPT 3
14 pages
AI SYLLABUS
No ratings yet
AI SYLLABUS
3 pages
Implementing Ethics Into Artificial Intelligence - A Contribution
No ratings yet
Implementing Ethics Into Artificial Intelligence - A Contribution
56 pages
AI and Robotics Complete practice set final - converted
No ratings yet
AI and Robotics Complete practice set final - converted
12 pages
VTU OLD QP@AzDOCUMENTS - in
No ratings yet
VTU OLD QP@AzDOCUMENTS - in
18 pages
The Application of Artificial Intelligence in Project Management Research: A Review
No ratings yet
The Application of Artificial Intelligence in Project Management Research: A Review
13 pages
Machine Hallucinations Architecture and Artificial Intelligence Architectural Design 1st Edition Neil Leach (Editor) All Chapters Instant Download
No ratings yet
Machine Hallucinations Architecture and Artificial Intelligence Architectural Design 1st Edition Neil Leach (Editor) All Chapters Instant Download
67 pages
Ms.+Margi+Mehta
No ratings yet
Ms.+Margi+Mehta
5 pages
Generative Adversarial Networks - Post Quiz - Attempt Review
No ratings yet
Generative Adversarial Networks - Post Quiz - Attempt Review
5 pages
Answer All Questions PART A - (5 2 10)
No ratings yet
Answer All Questions PART A - (5 2 10)
2 pages
Mastering_ChatGPT_Guide
No ratings yet
Mastering_ChatGPT_Guide
22 pages
FALLSEM2023-24 CSA2001 LTP BL2023241001061 Reference Material I 13-Oct-2023 AI and ML Unit 1 Koushik
No ratings yet
FALLSEM2023-24 CSA2001 LTP BL2023241001061 Reference Material I 13-Oct-2023 AI and ML Unit 1 Koushik
63 pages
Artificial intelligence
No ratings yet
Artificial intelligence
6 pages
Natural Language Processing A Machine Learning Perspective Yue Zhang pdf download
100% (1)
Natural Language Processing A Machine Learning Perspective Yue Zhang pdf download
54 pages
Sunilbabu253 180317094114
No ratings yet
Sunilbabu253 180317094114
22 pages
2024 - Slide1-Introduction - Eng - Ky 2
No ratings yet
2024 - Slide1-Introduction - Eng - Ky 2
55 pages
CS3491 AI Question Bank
100% (1)
CS3491 AI Question Bank
4 pages

Large Language Models Are Legal But They Are Not Making The Case For A Power LLM

Uploaded by

Large Language Models Are Legal But They Are Not Making The Case For A Power LLM

Uploaded by

Large Language Models are legal but they are not: Making the case for a

Thanmay Jayakumar Fauzan Farooqui∗ Luqman Farooqui∗

poses challenging problems such as extremely

You might also like