Discriminative Approach For Sequence Labelling Through The Use of CRFs and RNNs
Discriminative Approach For Sequence Labelling Through The Use of CRFs and RNNs
Final Project – Discriminative approach for sequence labelling through the use of CRFs
and RNNs
1
• CRF++: it is an open source and customizable find the template that I used to train the conditional
implementation of Conditional Random Fields random fields. The table 3.5 shows the results ob-
segmenting/labelling sequential data [2]. tained.
2
it can only access contextual information in one di- 4.2 Implementation
rection (typically the past).
The code for the implementation of RNNs was giv-
4.1 General idea en by the course teaching assistant and it’s based
on two paper of Grégoire Mesnil, Xiaodong He
A recurrent neural network could be seen as a Mul- [5][6].
ti layer perceptron (MLP) where we relax the con- Using this implementation, it is possible to train
dition whose connections do not form cycles, and and test RNNs of Jordan and Elman types. It im-
allow cyclical connection as well. An MLP can on- plements also word-embedding. There is a configu-
ly map from input to output vectors, whereas an ration file that permits us to play with the hyper-
RNN can map from the entire history of previous parameters, I report all of them in the following ta-
inputs to each output. RNNs are called recurrent ble. (Table 4.1).
because they perform the same task for every ele-
ment of a sequence, with the output being depend- Parameter Definition
ed on the previous computations. They are able to lr Learning rate value
memorize previous input in the network’s internal
state, and thereby influence the network output. win Number of words in the context
window
Here is what a typical RNN looks like:
bs Number of BPTT steps
nhidden Number hidden units
seed randomness
emb_dimension Dimension of word embedding
nepochs Maximum number of back-
propagation steps
Figure 4.1: A recurrent neural network and the unfold- 4.2 Methodology and results
ing in time of the computation involved in its forward
computation. Also for the RNNs I’ve used NL-SPARQL dataset.
I’ve took the 10% of the training-set shuffled (300
sentences) in order to create the validation test, that
The above diagram shows a RNN being un- is necessary during the training for the stopping-
rolled into a full network. By unrolling we simply criterion.
mean that we write out the network for the com-
plete sequence. Standard Recursive Neural Networks
• Xt is the input at time step t. In this section I report the results that I’ve obtained
• St is the hidden state at time step t. It’s the using the basic implementation of the RNNs given
“memory” of the network. by the course. I’ve tried different parameters set-
• Ot is the output at step t. tings and the best performance is reported in the
During the course we have seen two types of following table.
RNNS that use recurrent connection: the Elman
type, that feeds the activation of the hidden layer at
previous step with the input; and the Jordan type,
that feeds the activation of the output layer at pre- Table 4.2: Results obtained by using RNNs Elman type
vious time step with the input. For both of them,
during the training, the network is unrolled in time They are achieved using RNNs Elman type. The
backwards and backpropagation is applied, it’s hyper-parameters used are reported in the following
usually called Backpropagation Through Time table.
(BPTT). Words are fed into the neural networks by Parameter Value
using 1-of-n encoding. It’s possible, in order to lr 0.1
make model more accurate, to implement the word win 9
embedding, where the neural networks map the
bs 5
words (given in input) onto a d-dimensional con-
nhidden 100
tinuous space and it’s able to learn distributed rep-
resentation for these words (similar words are seed 38428
mapped to close position at the space). In our task emb_dimension 100
the output of the network is the probability for each nepochs 25
class (concept-tag) for the current word. Figure 4.2: Configuration that gave me the best per-
formances with the Elman type RNNs
3
LSTM Recursive Neural Networks 5 Evaluation and Conclusion
Take for instance that we are trying to predict the
last word in sentence “the clouds are in the sky”; During the spring-project and the final-project
we don’t need any further context, it is pretty ob- I’ve adopted discriminative models (CRFs and
vious the next word is going to be sky. In such RNNs) and generative models (SFSTs). While
cases, where the gap between the relevant infor- generative approaches models the joint distribu-
mation and the place that it’s needed is small, tion p(x, y), discriminative approaches focus sole-
RNNs can learn to use the past information. But ly on the posterior distribution p(y | x). The main
there are also cases where we need more context. difference is that discriminative models have only
Consider trying to predict the last word in the text to model the conditional distribution and they
“I grew up in Itay…I speak fluent Italy.” Recent completely disregard the prior distribution of the
information suggests that the next word is proba- training samples p(x), this gives to discriminative
bly the name of a language, but if we want to nar- models more freedom to fit the training data be-
row down which language, we need the context of
cause they have to tune only the parameters that
Italy. It’s entirely possible for the gap between the
maximize p(y|x). Generally, in a classification
relevant information and the point where it is
task, discriminative models work better than gen-
needed to become very large. Unfortunately, as
that gap grows, RNNs become unable to learn to erative ones, but many factor can influence them,
connect the information. Long Short Term like the size of the training data or the error pre-
Memory networks are a special kind of RNN, ca- sent in the training. In such cases generative mod-
pable of learning these long-term dependencies els have the advantage to use the prior probability
[7]. They were introduced by Hochreiter & in order to identify outlier sample and assign them
Schmidhuber in 1997. LSTMs are explicitly de- lower probability.
signed to combat vanishing gradients through a
gating mechanism. The next table reports all the best results obtained
during these 2 projects.
I found a simple implementation of them and I’ve
integrated the LSTM model into the code of
RNNs given by the course. (I’ve edited the Elman
model provided). In the following table it is possi-
ble to see the results that I’ve obtained.
Figure 5.1: Final comparison between all models.
4
Appendices
The project can be found on GitHub at the following
link: https://github.com/feedmari/LUS-Spring-Project
References
[1] Sutton C., An Introduction to Conditional Random
Fields, arxiv.org/pdf/1011.408.
[2] CRF++, Taku Kudo, dof: taku910.github.io/crfpp
[3]Raymond C. and Riccardi G., Discriminative and
Generative Algorithms for Spoken Language Under-
standing, Proc. Interspeech, Antwerp, 2007
[4] Alex Graves, Supervised Sequence Labelling with
Recurrent Neural Networks.
[5]Grégoire Mesnil, Xiaodong He, Li Deng and
Yoshua Bengio - Investigation of Recurrent Neural
Network Architectures and Learning Methods for
Spoken Language Understanding
[6]Grégoire Mesnil, Yann Dauphin, Kaisheng Yao,
Yoshua Bengio, Li Deng, Dilek Hakkani-Tur, Xiao-
dong He, Larry Heck, Gokhan Tur, Dong Yu and
Geoffrey Zweig - Using Recurrent Neural Networks
for Slot Filling in Spoken Language Understanding
[7]Christopher Olah, Undertanding LSTM Networks.
http://colah.github.io/posts/2015-08-Understanding-
LSTMs/
[8] Ali Orkan Bayer, 2017. Recurrent neural network
for language model and SLU.