Vnlawbert: A Vietnamese Legal Answer Selection Approach Using Bert Language Model
Vnlawbert: A Vietnamese Legal Answer Selection Approach Using Bert Language Model
net/publication/349015367
CITATIONS READS
8 906
3 authors, including:
Son Nguyen
Ho Chi Minh City University of Science
22 PUBLICATIONS 241 CITATIONS
SEE PROFILE
All content following this page was uploaded by Son Nguyen on 05 August 2021.
Abstract— Recently, with the development of NLP (Natural • We propose a solution to the answer selection
Language Processing) methods and Deep Learning, there are problem in the Vietnamese legal question
several solutions to the problems in question answering systems answering system.
that achieve superior results. However, there are not many
solutions to question-answering systems in the Vietnamese legal • We compile a large corpus of text that represents
domain. In this research, we propose an answer selection the Vietnamese legal documents.
approach by fine-tuning the BERT language model on our
Vietnamese legal question-answer pair corpus and achieve an • We achieve a higher performance model
87% F1-Score. We further pre-train the original BERT model (VNLawBERT), evaluate and compare it with
on a Vietnamese legal domain-specific corpus and achieve a the original BERT model on the Vietnamese
higher F1-Score than the original BERT at 90.6% on the same answer selection task.
task, which could reveal the potential of a new pre-trained
language model in the legal area. II. RELATED WORKS
In this section, we will present some existing Q-A
Keywords— natural language processing, question answering, (Question-Answering) systems, especially in Vietnamese and
answer selection, language model, legal document. answer selection methods.
I. INTRODUCTION Q-A systems are divided into two types: knowledge-based
Asking questions about laws is a crystal clear need in any and retrieval-based.
country, but it is not easy since an enormous number of laws A knowledge-based system tends to build a huge graph
have been enacted over the last decades; furthermore, with linked entities. Dai Quoc Nguyen, Dat Quoc Nguyen,
understanding laws requires certain knowledge in the legal Son Bao Pham[2] introduced an ontology-based Q-A system
domain. Therefore, building a question-answering system in for the Vietnamese language, it includes a question analysis
the legal domain is an essential need. It not only helps a module and an answer extraction module, their experiment
normal person to find an answer to their question based on results were promising, they achieved an accuracy of 95% in
current legal documents but also helps lawyers in their work. Question Analysis module and 70% in Answer Retrieval
A question-answering system consists of several parts, module.
one of them is the answer selection which aims to choose On the other hand, retrieval-based systems try to retrieve
the best relevant candidates among retrieved documents by relevant documents and extract the answer from those
measuring the relevance between a question and each documents. Huu-Thanh Duong, Bao-Quoc Ho[3] proposed a
retrieved document. Besides that, modern language models Q-A system for Vietnamese legal documents. They applied
have proved to give a great contextual representation of similarity calculation to select and extract the answer from the
words in sentences. Their impressive results on downstream relevant documents retrieved from Lucence. They achieved a
tasks like sentence pair classification yield a promising precision of approximately 70% in the experiment. However,
approach to measure the relevance in the answer selection the answer selection method in this system relies on
task. calculating similarity scores using tf-idf. Therefore it can not
BERT[1] is a language model that was pre-trained on a capture the contextual relationship between words, our
large general corpus and achieved state-of-the-art in several approach address this problem using a contextual language
NLP tasks. It is interesting to investigate the effect of using model like BERT.
BERT and a sentence pair classification task to the answer Jamshid Mozafari, Afsaneh Fatemi, Mohammad Ali
selection problem in the Q-A system, especially in Nematbakhsh[4] made use of the BERT language model and
Vietnamese. Therefore, we introduce VNLawBERT, an proposed their answer selection method. The result was pretty
approach to select relevant candidates by fine-tuning BERT high, it proved that a pre-trained language model is an
using our question-answer pair dataset. Additionally, we essential tool in NLP tasks such as answer selection.
further pre-train BERT on a legal domain-specific corpus to
achieve higher performance. Our contributions are: In the evolution of NLP, traditional language models like
word2vec[5][6] tried to convert words token into vectors in a
• We construct a training and hand-annotated non-contextual way, in which a word is represented as a single
testing dataset for the Vietnamese answer vector in the vocabulary. This is not suitable in some cases,
selection task. for example, the words "bank" in the sentence "My bank was
robbed" and in "I am sitting at the bank of the river" have the
same vector representation. Recent unsupervised pre-trained
language models like BERT, ElMo[7], XLNet[8] address this
problem by contextually embed each word token based on its
surroundings. In this paper, we will focus on BERT, a
language model pre-trained on general text BooksCorpus and
English Wikipedia, which significantly improve the
performance of many NLP tasks such as sentence-pair
classification, question-answering, language inference. Also,
many studies have shown that further pre-training BERT or
completely pre-training the model from scratch using domain-
specific corpora can significantly improve the performance of
it. SciBERT[9], BioBERT[10], ClinicalBERT[11],
FinBERT[12] are examples, they all performed better than the
original BERT in a domain-specific task. Fig. 1 The general architecture of a Retrieval-based question-
answering system.
To the best of our knowledge, our work is the first to
propose an answer selection approach using BERT language To address the answer selection problem, we aim to select
model for the Vietnamese legal Q-A system. relevant candidates among the retrieved documents, a
retrieved document can be articles, passages, or sentences of
III. BACKGROUND AND DATASETS
a legal document. To this end, we compile a classification
A. Background & Problem statement task with an input consists of two sequences represent a
1) Vietnamese legal documents: In this section, we question and a retrieved document, the output is a
present an overview of Vietnamese legal documents and their confirmation whether the document is a candidate for the
question or not.
structure. Legal documents in Vietnam are divided into the
following categories: B. Question - Answering Classification Dataset
• Constitution To build the dataset for our answer selection task, we need
• Code (of Law) real-world legal questions and relevant answers. We choose
• Ordinance Thu Ky Luat's website[13] to extract the data, Thu Ky Luat is
• Order a law consulting company that provides lawyer's advice to
• Resolution user's questions. We ran an extractor using Scrapy[14] and
• Joint Resolution acquired about 250,000 question-answer pairs in 27 domains
• Decree shown in Table I.
• Decision Table I. QUESTION-ANSWER DOMAINS LIST
• Circular
No. Name of domain English name of domain
• Joint Circular
1 Doanh nghiệp Enterprise
• Directive
2 Đầu tư Investment
A legal document content has different levels include:
3 Thương mại Commerce
chapter, section, article, paragraph, point. Each legal
document has its own validity. When the government enacts 4 Xuất nhập khẩu Import and export
an update of a certain document, the existing one will be 5 Tiền tệ-Ngân hàng Monetary-Banking
expired or partially expired. 6 Thuế-Phí-Lệ Phí Tax-Fee-Charge
Lawyers give their advice based on articles of valid or partial 7 Chứng khoán Stock
valid documents, they typically quote some articles from 8 Bảo hiểm Insurance
legal documents and conclude an answer to the question or a 9 Kế toán-Kiểm toán Accounting-auditing
situation. 10 Lao động-Tiền lương Labor-Salary
2) Question-Answering system and Problem Statement: 11 Bất động sản Real estate
In this research, we focus on the retrieval-based question- 12 Dịch vụ pháp lý Legal service
answering system, its architecture consists of four parts: 13 Sở hữu trí tuệ Intellectual property
Question Processing, Document Retrieval, Answer Selection, 14 Bộ máy hành chính Bureaucracy
and Answer Extraction. The question processing part detects 15 Vi phạm hành chính Administrative violation
the question's type, generates a query from the question. The 16 Trách nhiệm hình sự Criminal responsibility
document retrieval part takes that query and retrieve relevant 17 Thủ tục Tố tụng Procedures
documents. Those documents are then evaluated by the 18 Tài chính nhà nước State financial
answer selection part to pick the best relevant candidates. 19 Xây dựng-Đô thị Construction-Urban
Finally, the answer extraction part processes the candidates 20 Giáo dục Education
to find the exact answer to the question. Fig. 1 shows the 21 Tài nguyên-Môi trường Resources-Environment
general architecture of a retrieval-based Q-A system. 22 Thể thao-Y tế Sports-Health
23 Quyền dân sự Civil rights
24 Văn hóa-Xã hội Sociocultural
25 Công nghệ thông tin Information technology
26 Giao thông-Vận tải Transportation
27 Lĩnh vực khác Other domains
Those questions and answers are used to create the Legal Documents National Database’s website[16] to extract
training and testing datasets for our answer selection task. legal documents. We obtain 23,254 valid or partial valid legal
With each question, we pair it with the correct answer as a documents and create a 320 MB cased dataset.
candidate (labeled as "1") and other answers as non-
candidates (labeled as "0"). The total number of tokens in a IV. METHODOLOGY
question and a candidate/non-candidate is less than or equal In this section, we describe the methods which are applied
to 512. in fine-tuning BERT on our answer selection task and further
The training dataset is automatically generated. For non- pre-training it using the datasets in Section III. We use a
candidate examples, we use Elasticsearch[15] to find a Colab Pro instance along with a Cloud TPU from Google to
sequence from our question-answer data that has similar experiment with our methods.
content to the candidate. In this dataset, each question has one
A. Fine-tuning BERT for answer selection task
candidate and two non-candidates.
To make the evaluation accurate, the testing dataset is The answer selection task's goal is to define a sequence is
handpicked by us to make sure non-candidates are correctly or contains the answer to the question or not. We compile a
labeled. In our testing dataset, each question has one sentence-pair classifier that uses BERT as the initial
candidate and four non-candidates. checkpoint and fine-tune it with our dataset.
The sizes of training and testing datasets are described in We use the last checkpoint from the BERT-Base,
Table II. An example of our datasets is shown below: Multilingual Cased. We set a maximum sequence length of
1) Candidate example 512 tokens, a batch size of 128, a learning rate of 2e-5, and
• Question: Vi rút máy tính là gì? (What is a train over three epochs.
computer virus?) We found there is small randomness in the results, so we
• Candidate: Căn cứ pháp lý: Điều 4 Luật Công run the process three times using the same model
nghệ thông tin 2006 Vi rút máy tính là chương configurations and calculate the average result.
trình máy tính có khả năng lây lan, gây ra hoạt B. Pre-training BERT with legal data
động không bình thường cho thiết bị số hoặc sao
We further pre-trained BERT with our legal documents
chép, sửa đổi, xóa bỏ thông tin lưu trữ trong thiết
dataset to give the model more knowledge in the Vietnamese
bị số. (Article 4 of the Law on Information
legal domain. From the last checkpoint of Multilingual Cased
Technology 2006 Computer virus means a
BERT-Base, we make use of the original MLM (Masked
computer program capable of spreading or
Language Model) and NSP (Next Sentence Prediction) task
causing abnormal operation of digital equipment
of BERT to further pre-train the model with our legal
or copying, modifying or deleting information
documents dataset.
stored in digital equipment.)
We set a maximum sequence length of 512 tokens, a batch
• Label: 1 size of 128, a learning rate of 2e-5 (recommended by BERT
2) Non-candidate example document), and train over 20 epochs (181765 steps) in 2 days
• Question: Vi rút máy tính là gì? (What is a to achieve a new pre-trained model called VNLawBERT.
computer virus?) Then we compare the result of our new model with BERT-
• Non-candidate: Căn cứ pháp lý: Điều 2 Luật Base on our answer selection task.
phòng, chống nhiễm vi rút gây ra hội chứng suy
giảm miễn dịch mắc phải ở người (HIV/AIDS) V. EXPERIMENTS AND RESULTS
2006 HIV dương tính là kết quả xét nghiệm mẫu We present the results of the models in this section. We
máu, mẫu dịch sinh học của cơ thể người đã được use precision, recall, and F1-score as our metrics, all metrics
xác định nhiễm HIV. (Article 2 of Law on are calculated on positive results
Prevention and Control of Human
Immunodeficiency Syndrome (HIV / AIDS) A. Results
2006 HIV positive means the result of an HIV The result in Table III indicates that BERT can perform
infection confirmed test of a human blood quite good to classify candidate and non-candidate with the
sample or biological fluid.) F1-Score about 87%.
• Label: 0 However, it also shows that the legal domain-specific
Table II. QUESTION-ANSWER DATASET SIZE model VNLawBERT performs even better than the BERT-
Base in all metrics, especially improving F1-Score by 3.5%
Number Number Number Disk size from the BERT-Base.
of of of It is because the context of words in legal documents is
questions domains examples way more different than the context in a general domain
Training 68,174 27 204,522 266 MB corpus like Wikipedia. The model needs further pre-training
Testing 350 27 1,750 2.8 MB process to understand the contexts of legal problems.
Table III. PERFORMANCE OF FINE-TUNED BERT-Base AND
VNLawBERT
C. Pre-train Dataset
We also prepare a dataset used to further pre-train BERT Precision Recall F1-Score
in order to give the model more legal domain-specific BERT-Base 0.804 0.952 0.872
knowledge. Legal documents should be accurate and come VNLawBERT 0.860 0.958 0.906
from a truthful source; to this end, we choose the Vietnam
B. Additional experiments VI. CONCLUSION
In this paper, we also fine-tune VNLawBERT models on In this paper, we address the answer selection problem by
different training datasets to discover the effect of training fine-tuning BERT language model on our question-answer
size and the question's domain on the result. dataset. We also reveal the potential of a new domain-specific
1) Effect of the question's domain: Since our testing model for the legal area since our VNLawBERT model
dataset consists of questions from several domains, we outperforms the original BERT model in our answer selection
hypothesize that training on a multi-domains dataset makes task. With this research, we hope researchers can experiment
the model perform better than training on a dataset of one or the model with other tasks in the legal domain such as Named
two domains. We test our hypothesis in this section. Entity Recognition, Reference Extraction, Question
We fine-tune the models on a training dataset which only Classification to build a legal domain-specific language
consists of all the examples of one domain. In this case, we model, which is also our future work.
use "Thủ tục tố tụng" (Procedures) and "Thuế-Phí-Lệ Phí"
ACKNOWLEDGMENT
(Tax-Fee-Charge) domains since they contribute the least
amount of examples in our testing dataset, we obtain 40,000 This research is funded by the University of Science,
examples. VNU-HCM under grant number CNTT 2020-14.
We fine-tune another model on a different dataset which
has the same size, constructed from questions in all domains. REFERENCES
The performance of our answer selection task in Table IV [1] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova,
“BERT: Pre-training of Deep Bidirectional Transformers for Language
proves that the model with multi-domains knowledge Understanding”, in Proceedings of NAACL, pages 4171-4186, 2018.
performs better. [2] Dai Quoc Nguyen, Dat Quoc Nguyen, Son Bao Pham, “A Vietnamese
Question Answering System”, International Conference on Knowledge
Table IV. PERFORMANCE OF 2-DOMAINS VNLawBERT AND and Systems Engineering, 2009.
MULTI-DOMAINS VNLawBERT [3] Huu-Thanh Duong, Bao-Quoc Ho, “A Vietnamese Question
Answering System in Vietnam’s Legal Documents”, IFIP International
Precision Recall F1-Score Conference on Computer Information Systems and Industrial
Management, 2016.
2-domains 0.709 0.960 0.816
[4] Jamshid Mozafari, Afsaneh Fatemi, Mohammad Ali Nematbakhsh,
VNLawBERT “BAS: An Answer Selection Method Using BERT Language Model”,
VNLawBERT 0.749 0.960 0.841 2019.
2) Effect of the training size: In this experiment, we [5] Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean, “Efficient
evaluate the model with different dataset sizes to explore a Estimation of Word Representations in Vector Space”, Proceedings of
the International Conference on Learning Representations (ICLR
sufficient number of examples needed to train the model. 2013), 2013.
We use the methods described in Section III to build three [6] Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean, “Distributed
datasets of different sizes: 20%, 52%, 100% of our training Representations of Words and Phrases and their Compositionality”,
dataset respectively. Advances in neural information processing systems 26, 2013.
The result is shown in Table V indicates that the more [7] Matthew Peters et al., “Deep Contextualized Word Representations”,
”, in Proceedings of NAACL, 2018.
examples we have, the more accurate the model is. In our
[8] Zhilin Yang et al., “XLNet: Generalized Autoregressive Pretraining for
experience, using more than 204,000 examples does not Language Understanding”, Conference on Neural Information
improve the performance of the model. Therefore, a training Processing Systems (NeurIPS 2019), 2019.
dataset of 204,000 examples (8,000 examples from each [9] Iz Beltagy, Kyle Lo, Arman Cohan, “SCIBERT: A Pretrained
domain) is sufficient for the model to perform at its best. Language Model for Scientific Text”, Proceedings of the 2019
Conference on Empirical Methods in Natural Language
Processing and the 9th International Joint Conference on Natural
Table V. PERFORMANCE OF VNLawBERT WITH DIFFERENT Language Processing, 2019.
TRAINING SIZES
[10] Jinhyuk Lee et al., “BioBERT: a pre-trained biomedical language
representation model for biomedical text mining”,
% of our Precision Recall F1- Bioinformatics, Volume 36, Issue 4, 15 February 2020, Pages
training Score 1234–1240, 2019.
dataset [11] Kexin Huang, Jaan Altosaar, Rajesh Ranganath, “ClinicalBERT:
Modeling Clinical Notes and Predicting Hospital Readmission”,
40,500 20 0.749 0.960 0.841 The ACM Conference on Health, Inference, and Learning, 2020.
examples [12] Yi Yang, Mark Christopher Siy UY, Allen Huang, “FinBERT: A
107,000 52 0.818 0.976 0.890 Pretrained Language Model for Financial Communications”,
examples 2020.
204,000 100 0.860 0.958 0.906 [13] Thu Ky Luat's website [Online]
examples https://nganhangphapluat.thukyluat.vn/
[14] Scrapy [Online] https://scrapy.org/
[15] Elasticsearch [Online] https://www.elastic.co/
[16] Vietnam Legal Documents National Database’s website [Online]
http://vbpl.vn/pages/portal.aspx