collobert11a
collobert11a
Ronan Collobert†
IDIAP Research Institute
Martigny, Switzerland
[email protected]
224
Deep Learning for Efficient Discriminative Parsing
S Level 4
VP S Level 3
S VP Level 2
NP VP NP S#VP Level 1
Figure 1: Parse Tree representations. As in Penn Treebank (a), and after concatenating nodes spanning same
words (b). In (c) we show our definition of “levels”.
225
R. Collobert
Level 4 B-S I-S I-S E-S training set concatenated labels which were occurring
Level 3 O O B-VP E-VP at least 30 times (corresponding to the lowest number
Level 2 O O O S-S of occurrences of the less common non-concatenated
Level 1 O S-NP O S-VP tag). This added 14 extra labels to the 26 we already
Words But stocks kept falling had. Adding the extra O tag and using the IOBES
tagging scheme4 led us to 161 ((26 + 14) × 4 + 1) dif-
Figure 2: The parse tree shown in Figure 1a, rewritten ferent tags produced by our tagger. We denote T this
as four levels of tagging tasks. ensemble of tags.
With this additional pre-processing, any tree node is
2.2 From Tagging Levels To Parse Trees strictly larger (in terms of words it spans) than each
of its children. We enforce the corresponding Con-
Even if it had success with partial parsing (Abney, straint 1 during the iterative tagging process.
1997), the simplest scheme where one would have a
different tagger for each level of the parse tree is not Constraint 1 Any chunk at level i overlapping a
attractive in a full parsing setting. The maximum chunk at level j < i must span at least this overlapped
number of levels would have to be chosen at train chunk, and be larger.
time, which limits the maximum sentence length at
As a result, the iterative tagging process described
test time. Instead, we propose to have a unique tag-
above will generate a chunk of size N in at most N lev-
ger for all parse tree levels:
els, given a sentence of N words. At this time, the iter-
ative loop is stopped, and the full tree can be deduced.
1. Our tagger starts by predicting Level 1. The process might also be stopped if no new chunks
2. We then predict next level according to a history were found (all tags were O). Assuming our simple tree
of previous levels, with the same tagger. pre-processing has been done, this generic algorithm
3. We update the history of levels and go to 2. could be used with any tagger which could handle a
history of labels and tagging constraints. Even though
the tagging process is greedy because there is no global
This setup fits naturally into the recursive definition of
inference of the tree, we will see in Section 5 that
the levels. However, we must insure the predicted tags
it can perform surprisingly well. We propose in the
correspond to a parse tree. In a tree, a parent node
next section a tagger based on a convolutional Graph
fully includes child nodes. Without constraints during
Transformer Network (GTN) architecture. We will see
the level predictions, one could face a chunk partially
in Section 4 how we keep track of the history and how
spanning another chunk at a lower level, which would
we implement Constraint 1 for that tagger.
break this tree constraint.
We can guarantee that the tagging process corresponds
3 Architecture
to a valid tree, by adding a constraint enforcing higher
level chunks to fully include lower level chunks. This
We chose to use a variant of the versatile convolutional
iterative process might however never end, as it can
neural network architecture first proposed by Bengio
be subject to loops: for instance, the constraint is still
et al. (2001) for language modeling, and reintroduced
satisfied if the tagger predicts the same tags for two
later by Collobert and Weston (2008) for various NLP
consecutive levels. We propose to tackle this problem
tasks involving tagging. Our network outputs a graph
by (a) modifying the training parse trees such that
over which inference is achieved with a Viterbi algo-
nodes grow strictly as we go up in the tree and (b) en-
rithm. In that respect, one can see the whole archi-
forcing the corresponding constraints in the tagging
tecture (see Figure 3) as an instance of GTNs (Bottou
process.
et al., 1997; Le Cun et al., 1998). In the NLP field,
Tree nodes spanning the same words for several consec- this type of architecture has been used with success
utive level are first replaced by one node in the whole by Collobert et al. (2011) for “flat” tagging tasks. All
training set. The label of this new node is the concate- network and graph parameters are trained in a end-to-
nation of replaced node labels (see Figure 1b). At test end way, with stochastic gradient maximizing a graph
time, the inverse operation is performed on nodes hav- likelihood. We first describe in this section how we
ing concatenated labels. Considering all possible label adapt neural networks to text data, and then we in-
combinations would be intractable3 . We kept in the troduce GTNs training procedure. More details on the
3 4
Note that more than two labels might be concatenated. With the IOBES tagging scheme, each label (e.g. VP)
E.g., the tag SBAR#S#VP is quite common in the train- is expanded into 4 different tags (e.g. B-VP, I-VP, E-VP,
ing set. S-VP), as described in Section 2.1.
226
Deep Learning for Efficient Discriminative Parsing
Padding
Padding
..
. word in our dictionary W.
K
Feature K w1K w2K . . . wN
Having in mind the matrix-vector notation in (1), the
Lookup Table lookup-table applied over the sentence can be seen as
LTW 1 an efficient implementation of a convolution with a
... D kernel width of size 1. Parameters W are thus initial-
ized randomly and trained as any other neural network
LTW K
layer. However, we show in the experiments that one
can obtain a significant performance boost by initializ-
Convolution
ing6 these embeddings with the word representations
M 2 h(M 1 •) found by Collobert and Weston (2008). These rep-
resentations have been trained on a large unlabeled
corpus (Wikipedia), using a language modeling task.
|T | They contain useful syntactic and semantic informa-
tion, which appears to be useful for parsing. This cor-
roborates improvements obtained in the same way by
Collobert & Weston on various NLP tagging tasks.
Strutured Inference
B-NP In practice, it is common that one wants to represent
I-NP a word with more than one feature. In our experi-
..
. ments we always took at least the low-caps words and
O Aij a “caps” feature: wn = (wnlowcaps , wncaps ). In this case,
we apply a different lookup-table for each discrete fea-
Figure 3: Our neural network architecture. Words and ture (LTW lowcaps and LTW caps ), and the word embed-
other desired discrete features (caps, tree history, ...) ding becomes the concatenation of the output of all
are given as input. The lookup-tables embed each fea- these lookup-tables:
ture in a vector space, for each word. This is fed in a �
convolutional network which outputs a score for each LTW words (wn ) = LTW lowcaps (wnlowcaps )T ,
(2)
tag and each word. Finally, a graph is output with LTW caps (wncaps )) .
T
227
R. Collobert
with some non-linear transfer function h(·). It outputs score Atu for jumping from tag t ∈ T to u ∈ T in
a vector of size |T | for each word at position n, inter- successive words, and an initial score At0 for starting
preted as a score for each tag in T and each word wn from the tth tag. The last layer of our network outputs
in the sentence: a graph with |T | × N nodes Gtn (see Figure 3). Each
node Gtn is assigned a score s(xn )t from the previous
s(xn ) = M 2 h(M 1 xn ) , (3) layer (3) of our architecture. Given a pair of nodes Gtn
and Gu(n+1) , we add an edge with transition score Atu
where the matrices M 1 ∈ RH×(KD) and M 2 ∈ R|T |×H on the graph. For compactness, we use the sequence
are the trained parameters of the network. The num- ∆
notation [t]N1 = {t1 , . . . , tn } for now. We score a tag
ber of hidden units H is a hyper-parameter to be
path [t]N
1 in the graph G, as the sum of scores along
tuned. As transfer function, we chose in our exper-
[t]N
1 in G:
iments a (fast) “hard” version of the hyperbolic tan-
gent: N
� � �
S([w]N N
1 , [t]1 , θ) = Atn−1 tn + s(xn )tn , (6)
−1 if x < −1 n=1
h(x) = x if − 1 <= x <= 1 . (4)
1 if x > 1 where θ represents all the trainable parameters of our
complete architecture (W , M 1 , M 2 and A). The sen-
3.3 Long-Range Dependencies tence tags [t� ]N
1 are then inferred by finding the path
which leads to the maximal score:
The “window” approach proposed above assume that
the tag of a word is solely determined by the sur- [t� ]N N N
1 = argmax S([w]1 , [t]1 , θ) . (7)
[t]N
1 ∈T
N
rounding words in the window. As we will see in
our experiments, this approach falls short on long sen-
tences. Inspired by Collobert and Weston (2008), we The Viterbi (1967) algorithm is the natural choice for
consider a variant of this architecture, where all words this inference. We will show now how to train all the
{w1 , w2 , . . . , wN } are considered for tagging a given parameters of the network θ in a end-to-end way.
word wn . To indicate to the network that we want to
tag the word wn , we introduce an additional lookup- 3.5 Training Likelihood
table in (2), which embeds the relative distance (m−n)
of each word wm in the sentence with respect to wn . Following the GTN’s training method introduced
At each position 1 ≤ m ≤ N , the outputs of the all in (Bottou et al., 1997; Le Cun et al., 1998), we con-
lookup-tables (2) (low caps word, caps, relative dis- sider a probabilistic framework, where we maximize a
tance...) LTW words (wm ) are first combined together by likelihood over all the sentences [w]N1 in our training
applying a mapping M 0 . We then extract a fixed-size set, with respect to θ. The score (6) can be interpreted
“global” feature vector7 xn by performing a max over as a conditional probability over a path by taking it
the sentence: to the exponential (making it positive) and normaliz-
ing with respect to all possible paths (summing to 1
� �
[xn ]i = max M 0 LTW words (wm ) i ∀i (5) over all paths). Taking the log(·) leads to the following
1≤m≤N
conditional log-probability:
This feature vector is then fed to scoring layers (3).
log p([t]N N N N
1 | [w]1 , θ) = S([w]1 , [t]1 , θ)
The matrix M 0 is trained by back-propagation, as any
− logadd S([w]N N (8)
other network parameter. We will refer this approach 1 , [u]1 , θ) ,
∀[u]N
1 ∈T
N
as “sentence approach” in the following.
�
where we adopt the notation logaddi zi = log( i ezi ).
3.4 Structured Tag Inference This likelihood is the same as the one found in Con-
We know that there are strong dependencies between ditional Random Fields (CRFs) (Lafferty et al., 2001)
parsing tags in a sentence: not only are tags organized over temporal sequences. The CRF model is however
in chunks, but some tags cannot follow other tags. It is linear (which would correspond in our case to a linear
thus natural to infer tags from the scores in (3) using a neural network, with fixed word embeddings).
structured output approach. We introduce a transition Computing the log-likelihood (8) efficiently is not
7 straightforward, as the number of terms in the logadd
Here, the concatenation of lookup-tables outputs
LTW words includes relative position embeddings with re- grows exponentially with the length of the sentence.
spect to word n. Because of this notation shortcut, the Fortunately, in the same spirit as the Viterbi algo-
right-hand side of (5) depends on n implicitly. rithm, one can compute it in linear time with the fol-
228
Deep Learning for Efficient Discriminative Parsing
lowing classical recursion over n: the chunks we would consider in C for tagging Level
4 would be only the NP around “stocks” and the VP
∆
δn (v) = logadd S([w]n1 , [u]n1 , θ) ∀v ∈ T around “kept falling”. We would discard the S and VP
{[u]n
1 ∩ un =v} (9) around “falling” as they are included by the larger VP
= s(xn )v + logadd (δn−1 (t) + Atv ) , chunk.
t
We now implement Constraint 1 by constraining the
followed by the termination
inference graph introduced in Section 3.4 using the
logadd S([w]N N chunk history C. For each chunk c ∈ C, we adapt the
1 , [u]1 , θ̃) = logadd δN (u) .
∀[u]N
1
u graph output by our network in Figure 3 such that any
new candidate chunk c̃ overlapping c includes c, and
As a comparison, the Viterbi algorithm used to per- is strictly larger than c. Because the chunk history
form the inference (7) is achieved with the same re- C includes the largest chunks up to the last predicted
cursion, but where the logadd is replaced by a max, tree level, the new candidate chunk c̃ will be strictly
and then tracking back the optimal path through each larger than any chunk predicted in previous tree levels.
max. Constraint 1 is then always satisfied.
∂ log p([t]N N
1 | [w]1 , θ)
θ ←− θ + λ , (10) Using a IOBES tagging scheme, we implement (see
∂θ Figure 4) these three conditions by allowing only three
where λ is a chosen learning rate. The gradient in (10) corresponding possible paths c̃ in the inference graph,
is efficiently computed via a classical backpropaga- for each candidate label (e.g. VP):
tion (Rumelhart et al., 1986): the differentiation chain • The first tag of c̃ is B-VP, and remaining tags
rule is applied to the recursion (9), and then to all overlapping with c are maintained at I-VP
network layers (3), including the word embedding lay-
• The last tag of c̃ is E-VP, and previous tags over-
ers (1). Derivations are simple (but fastidious) algebra
lapping with c are maintained at I-VP
which can be found in the supplementary material of
this paper. • The path c̃ is maintained on I-VP while overlap-
ping c.
4 Chunk History and Tree
In addition to these 3 × |L| possible paths overlapping
Constraints
c, there is an additional path where no chunk is found
over c, in which case all tags stay in O while overlap-
The neural network architecture we presented in Sec-
ping c. Finally, as c̃ must be strictly larger than c, any
tion 3 is made “recursive” by adding an additional fea-
S- tag is discarded for the duration of c. Parts of the
ture (and its corresponding lookup-table (1)) describ-
graph not overlapping with the chunk history C remain
ing a history of previous tree levels. For that purpose,
fully connected, as previously described in Section 3.
we gather all chunks which were discovered in previ-
ous tree levels. If several chunks were overlapping at
different levels, we consider only the largest one. As- 5 Experiments
suming Constraint 1 is true, a word can be at most
in one of the remaining chunks. This is our history8 We conducted our experiments on the standard En-
C. The corresponding IOBES tags of each word will glish Penn Treebank benchmark (Marcus et al., 1993).
be fed as feature to the GTN. For instance, assum- Sections 02–21 were used for training, section 22 for
ing the labeling in Figure 2 was found up to Level 3, validation, and section 23 for testing. Standard pre-
8 processing as described in Section 2 was performed. In
Some other kind of history could have been chosen (e.g.
a feature for each arbitrary chosen L ∈ N previous levels). addition, the training set trees were transformed such
However we still need to “compute” the proposed history that two nodes spanning the same words were concate-
for implementing Constraint 1. nated as described in Section 2.2. We report results
229
R. Collobert
O O O
Table 1: Comparison of parsers trained and tested on
.. .. Penn Treebank, on sentences ≤ 15 words, against our
. .
B-VP B-VP B-VP
GTN parser (window approach).
I-VP I-VP I-VP
..
. E-VP E-VP E-VP
..
.
Model R P F1
Collins (1999) 88.2 89.2 88.7
.. .. ..
. . . Taskar et al. (2004) 89.1 89.1 89.1
Turian and Melamed (2006) 89.3 89.6 89.4
O O O GTN Parser 82.4 82.8 82.6
.. .. GTN Parser (LM) 86.1 87.2 86.6
. B-VP I-VP I-VP .
GTN Parser (POS) 87.1 86.2 86.7
I-VP I-VP I-VP
GTN Parser (LM+POS) 89.2 89.0 89.1
.. ..
. I-VP I-VP E-VP .
.. .. ..
. . .
sizes for the low cap words, caps, POS, relative dis-
O B-NP I-NP E-NP O tance (in the “sentence approach”) and history fea-
yesterday the black cat sat
tures were respectively 50, 5, 5, 5 and 10. The win-
dow size of our convolutional network was K = 5.
Figure 4: Implementing tree constraints: the chunk The word dictionary size was 100, 000. We used the
history (bottom) contains the NP “the black cat”. The word embeddings obtained from the language model
top inference graph is unconstrained (as it would be (LM) (Collobert and Weston, 2008) to initialize the
if no chunk were found in the history) and given only word lookup-table. Finally, we fixed the learning
for comparison. The bottom graph is constrained such rate λ = 0.01 during the stochastic gradient proce-
that new overlapping chunks strictly include the exist- dure (10). The only neural network “tricks” we used
ing chunk “the black cat”. were (1) the initialization of the parameters was done
according to the fan-in, and (2) the learning rate was
divided by the fan-in (Plaut and Hinton, 1987).
on the test set in terms of recall (R), precision (P )
and F1 score. Scores were obtained using the Evalb
implementation9 . 5.1 Small Scale Experiments
Our architecture (see Section 3) was trained on all pos-
sible parse tree levels (see Section 2.1), for all sentences First discriminative parse trees were very computa-
available in the training set. Random levels in random tionally expensive to train. Taskar et al. (2004) pro-
sentences were presented to the network until conver- posed a comparison setup for discriminative parsers
gence on the validation set. We fed our network with limited to Penn Treebank sentences with ≤ 15 words.
(1) lower cap words (to limit the number of words), Turian and Melamed (2006) reports almost 5 days of
(2) a capital letter feature (is low caps, is all caps, training for their own parser, using parallelization,
had first letter capital, or had one capital) to keep the on this setup. They also report several months of
upper case information (3) the relative distance to the training for Taskar et al.’s parser. In comparison,
word of interest (only for the “sentence approach”) (4) our parser takes only few hours to train (on a sin-
a POS feature10 (unless otherwise mentioned) (5) the gle CPU) on this setup. We report in Table 1 test
history of previous levels (see Section 4). During train- performance of our window approach system (“GTN
ing, the true history was given. During testing the Parser”, with H = 300 hidden units) against Taskar
history and the tags were obtained recursively from and Turian’s discriminative parsers. We also report
the network outputs, starting from Level 1, (see Sec- performance of Collins (1999) parser, a reference in
tion 2.2). All features had a corresponding lookup- non-discriminative parsers. Not initializing the word
table (1) in the network. lookup table with the language model (LM) and not
using POS features performed poorly, similar to exper-
Only few hyper-parameters were tried in our models
iments reported by Collobert and Weston (2008). It is
(chosen according to the validation). Lookup-table
known that POS is an fundamental feature for all ex-
9 isting parsers. The LM is crucial for the performance
Available at http://cs.nyu.edu/cs/projects/
proteus/evalb. of the architecture, as most of the capacity of the net-
10
Obtained with our own tagger available at http://ml. work lies into the word lookup-table (100, 000 words ×
nec-labs.com/senna. dimension 50). Without the LM, rare words cannot be
230
Deep Learning for Efficient Discriminative Parsing
231
R. Collobert
Table 2: Parsers comparison trained on the full Penn Treebank, and tested on sentences with ≤ 40 and ≤ 100
words. We also report testing time on the test set (Section 23).
and labeling sequence data. In Eighteenth Inter- D.E. Rumelhart, G.E. Hinton, and R.J. Williams.
national Conference on Machine Learning, ICML, Learning internal representations by back-
2001. propagating errors. In Parallel Distributed
Processing: Explorations in the Microstructure of
K.J. Lang and G.E. Hinton. The development of the
Cognition, volume 1, pages 318–362. MIT Press,
time-delay neural network architecture for speech
1986.
recognition. Technical Report CMU-CS-88-152,
Carnegie Mellon University, 1988. F. Sha and F. Pereira. Shallow parsing with condi-
tional random fields. In NAACL 2003, pages 134–
Y. Le Cun, L. Bottou, Y. Bengio, and P. Haffner. Gra-
141, 2003.
dient based learning applied to document recogni-
tion. Proceedings of IEEE, 86(11):2278–2324, 1998. B. Taskar, D. Klein, M. Collins, D. Koller, and
C. Manning. Max-margin parsing. In Proceedings
Y. LeCun. A learning scheme for asymmetric threshold of EMNLP, 2004.
networks. In Proceedings of Cognitiva 85, pages 599–
604, Paris, France, 1985. J. Turian and I. D. Melamed. Advances in discrimi-
native parsing. In Proceedings of the 21st Interna-
D. M. Magerman. Statistical decision-tree models for tional Conference on Computational Linguistics and
parsing. In Proceedings of the 33rd Annual Meeting the 44th annual meeting of the ACL, pages 873–880,
of the ACL, pages 276–283, 1995. 2006.
M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini. A. J. Viterbi. Error bounds for convolutional codes
Building a large annotated corpus of English: the and an asymptotically optimal decoding algorithm.
penn treebank. Computational Linguistics, 19(2): IEEE Transactions on Information Theory, 13:260–
313–330, 1993. 269, 1967.
A. McCallum and W. Li. Early results for named en-
tity recognition with conditional random fields, fea-
ture induction and web-enhanced lexicons. In Pro-
ceedings of HLT-NAACL, pages 188–191, 2003.
S. Petrov and D. Klein. Sparse multi-scale gram-
mars for discriminative latent variable parsing. In
EMNLP ’08, pages 867–876. ACL, 2008.
D. C. Plaut and G. E. Hinton. Learning sets of fil-
ters using back-propagation. Computer Speech and
Language, 2:35–61, 1987.
A. Ratnaparkhi. Learning to parse natural language
with maximum entropy models. Machine Learning,
34(1-3):151–175, 1999.
232