0% found this document useful (0 votes)
73 views

On The Vietnamese Name Entity Recognition: A Deep Learning Method Approach

This document presents a deep learning model for Vietnamese named entity recognition (NER) that combines bidirectional long short-term memory (Bi-LSTM) and conditional random fields (CRF). The model uses character-level embeddings from a Bi-LSTM combined with part-of-speech and chunk features as input to a CRF output layer. Evaluation on a Vietnamese dataset achieves a new state-of-the-art F1 score of 95.61% for Vietnamese NER.

Uploaded by

Mehari Yohannes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
73 views

On The Vietnamese Name Entity Recognition: A Deep Learning Method Approach

This document presents a deep learning model for Vietnamese named entity recognition (NER) that combines bidirectional long short-term memory (Bi-LSTM) and conditional random fields (CRF). The model uses character-level embeddings from a Bi-LSTM combined with part-of-speech and chunk features as input to a CRF output layer. Evaluation on a Vietnamese dataset achieves a new state-of-the-art F1 score of 95.61% for Vietnamese NER.

Uploaded by

Mehari Yohannes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

On the Vietnamese Name Entity Recognition: A

Deep Learning Method Approach


Ngoc C. Lê Ngoc-Yen Nguyen Anh-Duong Trinh
School of Applied Mathematics and Informatics School of Applied Mathematics and Informatics iCOMM Media & Tech, Jsc
Hanoi University of Science and Technology Hanoi Univ of Science and Technology [email protected]
Institution of Mathematics [email protected]
Vietnam Academy of Science and Technology
[email protected]
arXiv:1912.01109v1 [cs.CL] 18 Nov 2019

Abstract—Named entity recognition (NER) plays an important Fields (CRFs) [11], [22]. For the VLSP 2016 data set, the first
role in text-based information retrieval. In this paper, we combine Vietnamese NER system has applied MEMMs with specific
Bidirectional Long Short-Term Memory (Bi-LSTM) [7], [27] with features [25]. However, they have not achieved accuracy that
Conditional Random Field (CRF) [9] to create a novel deep
learning model for the NER problem. Each word as input of far beyond those of classical machine learning methods. Most
the deep learning model is represented by a Word2vec-trained of the above models depends heavily on specific resources
vector. A word embedding set trained from about one million and hand-crafted features, which makes it difficult for those
articles in 2018 collected through a Vietnamese news portal models to apply to new domains and other tasks.
(baomoi.com). In addition, we concatenate a Word2Vec [18]- In [19], [20], the author used the information of word, word
trained vector with semantic feature vector (Part-Of-Speech
(POS) tagging, chunk-tag) and hidden syntactic feature vector shapes, part-of-speech tags, chunking tags as hand-crafted fea-
(extracted by Bi-LSTM nerwork) to achieve the (so far best) tures for CRF to label entity tags [23]. Over the past few years,
result in Vietnamese NER system. The result was conducted many deep learning models have been proposed to overcome
on the data set VLSP2016 (Vietnamese Language and Speech these limitations. Some NER models have used LSTM and
Processing 2016 [29]) competition. CRF to predict NER [8], [12]. In addition, benefits from both
Index Terms—Vietnamese, Named Entity Recognition, Long
Short-Term Memory, Conditional Random Field, Word Embed- the expression of words and characters when combining CNN
ding and CRF are presented in [17], [28].
In this study, we introduce a deep neural network for Viet-
I. I NTRODUCTION namese NER using extraction of morphological features au-
Named-entity recognition (NER) (also known as entity iden- tomatically through a Bi-LSTM (character feature) network
tification, entity chunking and entity extraction) is a subtask combined with POS features - tagging and chunk tag. The
of information extraction that seeks to locate and classify model includes two bidirectional-lstm hidden layer and an
named entity mentions in unstructured text into pre-defined output layer CRF. For Vietnamese language, we use the data
categories such as the person names, organizations, locations, set from the 2016 VLSP contest. The results show that our
medical codes, time expressions, quantities, monetary values, model outperforms the best previous systems for Vietnamese
percentages, etc . It is a fundamental NLP research problem NER [23] with F1 is 95.61% on test set.
that has been studied for years. It is also considered as one The remainder of this paper is structured as follows. Section
of the most basic and important tasks in some big problems II refers related work on named entity recognition. Section
such as information extraction, question answering, entity III describes the implementation method. Section IV gives
linking, or machine translation. Recently, there are many experimental results and discussions. Finally, the conclusion
novel ideal in NER task such as Cross-View Training (CVT) will be given in Section V.
[4], a semi-supervised learning algorithm that improves the
representations of a Bi-LSTM sentence encoder using a mix II. RELATED WORK
of labeled and unlabeled data, or deep contextualized word The approaches for NER task can be divided into two rou-
representation [24] and contextual string embeddings, a recent tines: (1) statistical learning approaches and (2) deep learning
type of contextualized word embedding that were shown to methods.
yield state-of-the-art results [1], [2]. These studies have shown In the first type, the authors used traditional labeling models
new state-of-the-art methods with F1 scores on NER task. such as crf, hidden markov model, support vector machine,
In Vietnamese language, NER systems in VLSP 2016 adopted maximum entropy that are heavily dependent on hand-crafted
either conventional feature-based sequence labeling models features. Sentences are expressed in the form of a set of
such as Recurrent neural network (RNN), Bidirectional Long features such as word, pos, chunk, etc Then they are put into
Short Term Memory (Bi-LSTM) [25], Maximum-Entropy- a linear model for labeling. Some examples following this
Markov Models (MEMMs) [14], [21], Conditional Random routine are [6], [13], [15], [16]. These models were proven to
work quite well for low existing resources languages such as
Vietnamese. However, these kinds of NER systems are relied
heavily on the used feature set, and on hand-crafted features
that are expensive to construct and are difficultly reusable [23].
For the second routine, with the appearance of deep learning
models with superior computational performance seems to
improve the accuracy of the NER task. The performance of
deep learning models also have been shown much better than
the statistical based methods. In particular, the convolutional
neural network (CNN) [30], recurrent neural network (RNN),
LSTM networks are popular use, we can exploit the syntax
feature through character embedding in combination with
word embedding [26], [28]. Other information such as pos-
tag and chunk-tag is also used to provide more information
about semantic [3], [20], [25]. The word vectors are combined Fig. 1. Character-level Embedding
in different ways, then feed into the Bi-LSTM network with
CRF in output. For Vietnamese, there are many NER systems
using LSTM network. In [25], the authors introduced a model
that uses two Bi-LSTM layers with softmax layers at the LSTM units contain a memory cell that can maintain in-
output, with input from vectors using syntax specific, F1 score formation in memory for controlled periods of time. A cell
is 92.05%. A model using single Bi-LSTM layer combining in the LSTM network consists of three control gates: forget
crf at the output to achieve F1-score of 83.25% was given in gate (determining which information is ignored and which is
[22]. A number of high-precision models are introduced in the retained), update gate (deciding how much of the memorized
[3] with Bi-LSTM-CRF model with the input is the extracted information is added to the current state) and the output gate
vector with characteristic of word character, F1 is 94.88%. And (making the decision about which part of the current cell
most recently, a combination of Bi-LSTM - attention layer - makes it to the output). At time t, cell updates are given as
CRF model with F1 score of 95.33% was given in [23]. follows:

III. METHODOLOGY ft = σ (Wf ht−1 + Uf xt + bf ) (1)


A. Feature engineering it = σ(Wi ht−1 + Ui xt + bi ) (2)
Word embedding To build up a word embedding set, we ot = σ(W0 ht−1 + U0 xt + b0 ) (3)

use the skip-gram neural network model, which is trained c = tanh(Wc ht−1 + Uc xt + bc ) (4)
from one million articles in 2018 through a Vietnamese t

news portal (baomoi.com). Skip-gram is used to predict the ct = ft ct−1 + it c (5)
t
context word for a given target word. We choose sliding
ht = ot tanh(ct ), (6)
window of Size 2, therefore, normally there are four context
words corresponding to a target word. For words that are not
where σ is the sigmoid function and is an pointwise
trained, a vector called unknown (UNK) embedding is used
operator, which can be multiplication or addition or tanh
instead. The UNK embedding is created q by random vectors
q function, xt is the input vector at time t, ht is the hidden
3 3
sampled uniformly from the range [− dim , + dim ], where state vector that holds information from the beginning to the
the dimension (dim) is the dimension of word embeddings [3]. present time. The gates f , i, o, c are the forget gate, input gate,
To improve the performance of our system, we use semantic output gate and cell vector respectively. The matrices Uf , Ui ,
features to vectorize words into models (part of speech tagging Uo , Uc are weight matrices that connect input and gates. The
and chunk-tag). The pos-tagging vectors and chunk-tag vectors matrices Wf , Wi , Wo , Wc are weight matrices that connect
were represented as one-hot vectors. gates and hidden state.
1) Character Embedding: Recently, automatically extract-
ing hidden features using neural networks (LSTM, CNN) used
C. Bi-LSTM
in many articles, proved effective in the NER task [10]. In
this research, we use the Bi-LSTM network to extract hidden In sequence labeling task, the context of a word is rep-
patterns that characterize the syntax of words, as shown in resented more effectively by the context, i.e. the companion
Fig. III-A1. words, the left and right words in a sentence. To archive
these information, a candidate model is Bidirectional-LSTM
B. LSTM network as shown in Fig. III-C. Then, the output of the Bi-
LSTM networks are a type of Recurrent Neural Networks LSTM network is observed by concatenating its left and right
(RNN) that uses special units in addition to standard units. context representations.
Fig. 2. Bidirectional - LSTM

Fig. 3. Our Deep Learning model


D. Conditional Random Fields (CRFs)
For many sequence labeling tasks, a simple but effective
approach is consider the correlation between the labels close tags. The number of sentences in train-set, validation-set and
together and the best sequence decoding with simple rules. The test-set is shown in the T Table IV.
conditional random field is essentially a probability model,
which can predict labels with predefined structures. Instead TABLE I
of decoding independent labels, CRFs learn the sequence of NUMBER OF SENTENCES
labels from train data to decode output labels at the same time.
Dataset Number of sentences
In CRF, when given a word sequence x = (x1 , x2 , ..., xm ), the Train 14861
conditional probability of a tag sequence y = (y1 , y2 , ..., ym ), Validation 2000
is defined as in [19]: Test 2831

exp(w.F (y, x))


P (y|x) = P , (7)
exp(w.F (y 0 , x))
y 0 ∈Y The number of entities included in the train set and test set
where w is the parameter vector estimated from training data. is shown in the following table:
The feature function F (y, x) ∈ IRd is defined globally on an
entire input sequence and an entire tag sequence. Space Y is TABLE II
NUMBER OF LABELS IN DATASET
the space of all possible tag sequences. The feature function
F (y, x) is calculated by summing local feature functions: Type Train Test
n Location 6245 1379
X Organization 1213 274
Fj (y, x) = fj (yi−1 , yi , x, i) (8) Person 7480 1294
i=1 Miscellaneous names 282 49
Total 15220 2996
E. Our Deep Learning Model
For the NER labeling task for Vietnamese, we use multiple
Bi-LSTM layers with the CRF layer at the top to detect
entities named in the sequence [3], as shown in Fig. III-E. B. Hyper-parameters
The architecture operates as in the following sequence:
Table IV-B summarizes the hyper-parameters that we have
• The input of our neural network is sequence of word chosen for our NER model. In order to have more efficient
representations training process, the parameters are optimized using Nesterov-
• Each word representation encoded through two Bi-LSTM accelerated Adaptive Moment Estimation (Nadam) optimizer
layers [5] with batch size 64. Word representation concatenated by
• The CRF layer at the top to decode hidden feature vectors a 300 dimensional word2vec-vector (pre-trained from bao-
from previous layer (Bi-LSTM layer) moi.com) and two one-hot vectors represent pos tags and
chunks, respectively and a 60 dimensional character vector
IV. EXPERIMENTS
(generated from a bi-lstm network with dropout rate equal 0.3,
A. Datasets as shown in Figure 1). To prevent overfitting, we fix dropout
To evaluate the model, we use dataset from VLSP-2016 rate to 0.5 for both Bi-LSTM layers (as shown in Figure 3).
NER task [29], with four types of entities including person The NER model is trained in 40 epoch. First 20 epochs, the
(PER), location (LOC), organization (ORG), miscellaneous initial learning rate is set at 0.004. In the remaining epochs,
(MISC). In addition, VLSP-2016 dataset provides the infor- it is fixed to 0.0004. The best model obtained when the value
mation about word segmentation, part-of-speech, and chunking of the loss function on validation-set is minimal.
TABLE III ACKNOWLEDGMENT
THE MODEL HYPE-PARAMETERS
The first author also has receive the support from Institute of
Hyper-parameter Value Mathematics, Vietnam Academy of Science and Technology,
Character dimension 60 Year 2019. This work is also supported by iCOMM Media &
Word dimension 300
Hidden size char 30 Tech, Jsc. We would like to thank the iCOMM RnD team for
Hidden size word 64 supported resources and text data that we used during training
Update function Nadam and experiments our model.
Learning rate first 20 0epoch 0.004
Learning rate last 20 epoch 0.0004
Dropout character embedding 0.3 R EFERENCES
Dropout two Bi-LSTM layers 0.5 [1] A. Akbik, D. Blythe, and R. Vollgraf, Contextual String Embeddings
Batch size 64 for Sequence Labeling, COLING 2018, 27th International Conference
on Computational Linguistics, pp. 1638–1649, 2018.
[2] A. Akbik, T. Bergmann, and R. Vollgraf, Pooled Contextualized Em-
beddings for Named Entity Recognition, Proceedings of the 2019
C. Experimental Results Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, Volume 1,
The experiment is conducted by combining all input features pp. 724–728, June 2019.
[3] D. N. Anh, H. N. Kiem, and V. N. Van, Neural sequence labeling
include pos-tag feature, chunk feature and character feature. for Vietnamese POS Tagging and NER, 2019 IEEE-RIVF International
The results are shown in Table IV-C. Conference on Computing and Communication Technologies (RIVF),
March 2019.
[4] K. Clark, M-T. Luong, C. D. Manning, and Q. V. Le, Semi-Supervised
TABLE IV Sequence Modeling with Cross-View Training, ICLR 2018, Feb. 16th
RESULTS ON VLSP 2016 TEST-SET 2018.
[5] T. Dozat, Incorporating Nesterov Momentum into Adam, International
Precision Recall F1-Score Conference on Learning Representations 2016 (ICLR 2016).
LOC 95.43 96.95 96.18 [6] R. Florian, A. Ittycheriah, H. Jing, and T. Zhang, Named entity recog-
PER 95.53 97.53 96.52 nition through classifier combination, In Proceedings of the Seventh
ORG 87.32 90.51 88.89 Conference on Natural Language Learning at HLT-NAACL 2003 -
MISC 100.0 87.76 93.48 Volume 4, pp. 168–171, 2019.
Avg/total 95.32 95.93 95.61 [7] S. Hochreiter and Jrgen Schmidhuber, ”Long short-term memory”,
Neural Computation Vol. 9:8, pp. 1735-1780, 1997.
[8] Z. Huang, W. Xu, and K. Yu, Bidirectional LSTM-CRF Models for
With VLSP 2016 dataset, the experiment achieved state-of- Sequence Tagging, arXiv:1508.01991, Aug. 2015.
[9] J. Lafferty, A. McCallum, and F. Pereira, ”Conditional random fields:
the-art performances on Vietnamese NER task with 95.61% Probabilistic models for segmenting and labeling sequence data”, In
F1-score. Table IV-C shows the performance of our deep Proceeding 18th International Conference on Machine Learning pp.
learning model and several published systems on NER task. 282-289, 2001.
[10] G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, and C. Dyer,
Neural Architectures for Named Entity Recognition, In Proceedings of
TABLE V the 2016 Conference of the North American Chapter of the Association
PERFORMANCES ON VLSP 2016 DATASET for Computational Linguistics: Human Language Technologies, pp.260–
270, June 2016.
Model F1-Score [11] T. H. Le, T. T. T. Nguyen, T. H. Do, and X. T. Nguyen, Named Entity
VNER[12] 95.33 Recognition in Vietnamese Text, The Fourth International Workshop on
Feature-based CRF [10] 93.93 Vietnamese Language and Speech Processing (VLSP 2016), 2016.
NNVLP [9] 92.91 [12] T.A. Le, M.Y. Arkhipov, and M.S. Burtsev, ”Application of a Hybrid Bi-
Nguyen et al. 2018 [21] 94.88 LSTM-CRF Model to the Task of Russian Named Entity Recognition”,
Our NER model 95.61 In: Artificial Intelligence and Natural Language, AINL 2017, Commu-
nications in Computer and Information Science, vol. 789, pp. 91–103,
2017.
[13] P. Le-Hong, A. Roussanaly, T. M. H. Nguyen, and M. Rossignol, An em-
The general difference with other systems in Table 5 is that pirical study of maximum entropy approach for part-of-speech tagging
we trained a new word embedding set by word2vec model, of Vietnamese texts, Traitement Automatique des Langues Naturelles -
described in Subsection III-A. Moreover, we use two Bi-LSTM TALN 2010, ATALA (Association pour le Traitement Automatique des
Langues), Jul 2010, Montral, Canada, pp. 12–23, Oct 2010.
layers in order to encode word representations. [14] P. Le-Hong, Vietnamese Named Entity Recognition using Token Regular
Expressions and Bidirectional Inference, In The Fourth International
Workshop on Vietnamese Language and Speech Processing (VLSP
V. CONCLUSIONS 2016), 2016.
[15] D. Lin and X. Wu, Phrase clustering for discriminative learning, ACL ’09
In this paper, we presented a neural network model for Proceedings of the Joint Conference of the 47th Annual Meeting of the
Vietnamese named entity recognition task, which obtains state- ACL and the 4th International Joint Conference on Natural Language
of-the-art performance. Experiments on recognize Vietnamese Processing of the AFNLP: Volume 2, pp. 1030–1038, 2009.
[16] G. Luo, X. Huang, C-Y. Lin, and Z. Nie, Joint Named Entity Recognition
entity in sequence labeling task showed the effectiveness and Disambiguation, Proceedings of the 2015 Conference on Empirical
of training a new word embedding set and using two Bi- Methods in Natural Language Processing, pp. 879–888, September
LSTM layers in order to extract hidden features from word 2015.
[17] X. Ma, and E. Hovy, End-to-end Sequence Labeling via Bi-directional
representations. Our results is outperform the best previous LSTM-CNNs-CRF, Proceedings of the 54th Annual Meeting of the
systems for Vietnamese Named entity recognition. Association for Computational Linguistics, pp. 1064–1074, March 2016.
[18] T. Mikolov, K. Chen, G. Corrado, and J. D. Tomas, ”Efficient Estimation
of Word Representations in Vector Space”, arXiv:1301.3781, 2013.
[19] P. Q. N. Minh, A Feature-Rich Vietnamese Named-Entity Recognition
Model, arXiv:1803.04375, 12 Mar 2018.
[20] P. Q. N. Minh, A Feature-Based Model for Nested Named Entity
Recognition at VLSP-2018 NER Evaluation Campaign, In Proceedings
of Vietnamese Speech and Language Processing (VLSP), 2018.
[21] T. C. V. Nguyen, T. S. Pham, T. H. Vuong, N. V. Nguyen, and M. V.
Tran, Dsktlab-ner: Nested named entity recognition in vietnamese text,
In The Fourth International Workshop on Vietnamese Language and
Speech Processing (VLSP 2016), 2016.
[22] T. S. Nguyen, L. M. Nguyen, and X. C. Tran, Vietnamese named entity
recognition at vlsp 2016 evaluation campaign, In Proceedings of The
Fourth International Workshop on Vietnamese Language and Speech
Processing, 2016.
[23] K. A. Nguyen, N. Dong, and C-T. Nguyen, Attentive Neural Network
for Named Entity Recognition in Vietnamese, 2019 IEEE-RIVF Inter-
national Conference on Computing and Communication Technologies
(RIVF), March 2019.
[24] M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee,
and Luke Zettlemoyer, Deep contextualized word representations, Pro-
ceedings of the 2018 Conference of the North American Chapter
of the Association for Computational Linguistics: Human Language
Technologies, Volume 1, June 2018.
[25] T-H. Pham and P. Le-Hong, The Importance of Automatic Syntactic
Features in Vietnamese Named Entity Recognition, The 31st Pacific
Asia Conference on Language, Information and Computation, November
2017.
[26] T-H. Pham, X-K. Pham, T-A. Nguyen, and P. Le-Hong, NNVLP: A
Neural Network-Based Vietnamese Language Processing Toolkit, In
The Companion Volume of the IJCNLP 2017 Proceedings: System
Demonstrations, pp. 37-40, 2017.
[27] M. Schuster and K. K. Paliwal. ”Bidirectional recurrent neural net-
works”, Signal Processing, IEEE Transactions, Vol. 45.11, pp. 2673–
2681, 1997.
[28] F. Wu, J. Liu, C. Wu, Y. Huang, and X. Xie, Neural Chinese Named
Entity Recognition via CNN-LSTM-CRF and Joint Training with Word
Segmentation, The World Wide Web Conference, pp. 3342–3348, Apr.
2019.
[29] http://vlsp.org.vn/resources-vlsp2016
[30] W. Zhang. ”Shift-invariant pattern recognition neural network and its
optical architecture”, Proceedings of Annual Conference of the Japan
Society of Applied Physics, pp. 734, Sept. 1988.

You might also like