0% found this document useful (0 votes)
2 views

collobert11a

Uploaded by

phonokoye
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

collobert11a

Uploaded by

phonokoye
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Deep Learning for Efficient Discriminative Parsing

Ronan Collobert†
IDIAP Research Institute
Martigny, Switzerland
[email protected]

Abstract Henderson, 2004) did not reach standard PCFG-based


generative parsers. Henderson (2004) outperforms
Collins parser only by using a generative model and
We propose a new fast purely discrimina-
performing re-ranking. Charniak and Johnson (2005)
tive algorithm for natural language parsing,
also successfully leveraged re-ranking. Pure discrimi-
based on a “deep” recurrent convolutional
native parsers from Taskar et al. (2004) and Turian and
graph transformer network (GTN). Assum-
Melamed (2006) finally reached Collins’ parser perfor-
ing a decomposition of a parse tree into a
mance, with various simple template features. How-
stack of “levels”, the network predicts a level
ever, these parsers were slow to train and were both
of the tree taking into account predictions
limited to sentences with less than 15 words. Most re-
of previous levels. Using only few basic text
cent discriminative parsers (Finkel et al., 2008; Petrov
features which leverage word representations
and Klein, 2008) are based on Conditional Random
from Collobert and Weston (2008), we show
Fields (CRFs) with PCFG-like features. In the same
similar performance (in F1 score) to exist-
spirit, Carreras et al. (2008) use a global-linear model
ing pure discriminative parsers and exist-
(instead of a CRF), with PCFG and dependency fea-
ing “benchmark” parsers (like Collins parser,
tures.
probabilistic context-free grammars based),
with a huge speed advantage. We motivate our work with the fundamental question:
how far can we go with discriminative parsing, with
as little task-specific prior information as possible?
1 Introduction We propose a fast new discriminative parser which
not only does not rely on information extracted from
Parsing has been pursued with tremendous efforts PCFGs, but does not rely on most classical parsing
in the Natural Language Processing (NLP) commu- features. In fact, with only few basic text features and
nity. Since the introduction of lexicalized 1 probabilis- Part-Of-Speech (POS), it performs similarly to Taskar
tic context-free grammar (PCFGs) parsers (Mager- and Turian’s parsers on small sentences, and similarly
man, 1995; Collins, 1996), improvements have been to Collins’ parser on long sentences.
achieved over the years, but generative PCFGs parsers There are two main achievements in this paper. (1) We
of the last decade from Collins (1999) and Charniak trade the reduction of features for a “deeper” architec-
(2000) still remain standard benchmarks. Given the ture, a.k.a. a particular deep neural network, which
success of discriminative learning algorithms for classi- takes advantage of word representations from Col-
cal NLP tasks (Part-Of-Speech (POS) tagging, Name lobert and Weston (2008) trained on a large unlabeled
Entity Recognition, Chunking...), the generative na- corpus. (2) We show the task of parsing can be ef-
ture of such parsers has been questioned. First dis- ficiently implemented by seeing it as a recursive tag-
criminative parsing algorithms (Ratnaparkhi, 1999; ging task. We convert parse trees into a stack of levels,

Part of this work has been achieved when Ronan Col- and then train a single neural network which predicts
lobert was at NEC Laboratories America. a “level” of the tree based on predictions of previous
1
Which leverage head words of parsing constituents. levels. This approach shares some similarity with the
finite-state parsing cascades from Abney (1997). How-
Appearing in Proceedings of the 14th International Con- ever, Abney’s algorithm was limited to partial parsing,
ference on Artificial Intelligence and Statistics (AISTATS) because each level of the tree was predicted by its own
2011, Fort Lauderdale, FL, USA. Volume 15 of JMLR:
tagger: the maximum depth of the tree had to be cho-
W&CP 15. Copyright 2011 by the authors.

224
Deep Learning for Efficient Discriminative Parsing

S Level 4

VP S Level 3

S VP Level 2

NP VP NP S#VP Level 1

But stocks kept falling But stocks kept falling Words


(a) (b) (c)

Figure 1: Parse Tree representations. As in Penn Treebank (a), and after concatenating nodes spanning same
words (b). In (c) we show our definition of “levels”.

sen beforehand. pre-processing steps: (1) functional labels as well as


traces were removed (2) the label PRT was converted
We acknowledge that training a neural network is
into ADVP (see Magerman, 1995) (3) duplicate con-
a task which requires some experience, which dif-
stituents (spanning the same words and with the same
fers from the experience required for choosing good
label) were removed. The resulting dataset contains
parsing features in more classical approaches. From
26 different labels, that we will denote L in the rest of
our perspective, this knowledge allows however flex-
the paper.
ible and generic architectures. Indeed, from a deep
learning point of view, our approach is quite conven-
tional, based on a convolutional neural network (CNN) 2.1 Parse Tree Levels
adapted for text. CNNs were successful very early
for tasks involving sequential data (Lang and Hinton, Many NLP tasks involve finding chunks of words in
1988). They have also been applied to NLP (Bengio a sentence, which can be viewed as a tagging task.
et al., 2001; Collobert and Weston, 2008; Collobert For instance, “Chunking” is a task related to pars-
et al., 2011), but limited to “flat” tagging problems. ing, where one wants to obtain the label of the low-
We combine CNNs with a structured tag inference in a est parse tree node in which each word ends up. For
graph, the resulting model being called a Graph Trans- the tree in Figure 1a, the pairs word/chunking tags
former Network (GTN) (Bottou et al., 1997). Again, could be written as: But/O stocks/S-NP kept/B-VP
this is not a surprising architecture: GTNs are for falling/E-VP. We chose here to adopt the IOBES tag-
deep models what CRFs are for linear models (Lafferty ging scheme to mark chunk boundaries. Tag “S-NP” is
et al., 2001), and CRFs had great success in NLP (Sha used to mark a noun phrase containing a single word.
and Pereira, 2003; McCallum and Li, 2003; Cohn and Otherwise tags “B-NP”, “I-NP”, and “E-NP” are used
Blunsom, 2005). We show how GTNs can be adapted to mark the first, intermediate and last words of the
to parsing, by simply constraining the inference graph noun phrase. An additional tag “O” marks words that
at each parsing level prediction. are not members of a chunk.
In Section 2 we describe how we convert trees to (and As illustrated in Figure 1c and Figure 2, one can
from) a stack of levels. Section 3 describes our GTN rewrite a parse tree as a stack of tag levels. We achieve
architecture for text. Section 4 shows how to imple- this tree conversion by first transforming the lowest
ment necessary constraints to get a valid tree from nodes of the parse tree into chunk tags (‘Level 1”).
a level decomposition. Evaluation of our system on Tree nodes which contain sub-nodes are ignored at
standard benchmarks is given in Section 5. this stage2 . Words not into one of the lowest nodes
are tagged as “O”. We then strip the lowest nodes
of the tree, and apply the same principle for “Level
2 Parse Trees 2”. We repeat the process until one level contains the
root node. We chose a bottom-up approach because
We consider linguistic parse trees as described in Fig- one can rely very well on lower level predictions: the
ure 1a. The root spans all of the sentence, and is re- chunking task, which describes in an other way the
cursively decomposed into sub-constituents (the nodes lowest parse tree nodes, has a very good performance
of the tree) with labels like NP (noun phrase), VP record (Sha and Pereira, 2003).
(verb phrase), S (sentence), etc. The tree leaves con-
tain the sentence words. All our experiments were 2
E.g. in Figure 1a, “kept” is not tagged as “S-VP” in
performed using the Penn Treebank dataset (Marcus Level 1, as the node “VP” still contains sub-nodes “S” and
et al., 1993), on which we applied several standard “VP” above “falling”.

225
R. Collobert

Level 4 B-S I-S I-S E-S training set concatenated labels which were occurring
Level 3 O O B-VP E-VP at least 30 times (corresponding to the lowest number
Level 2 O O O S-S of occurrences of the less common non-concatenated
Level 1 O S-NP O S-VP tag). This added 14 extra labels to the 26 we already
Words But stocks kept falling had. Adding the extra O tag and using the IOBES
tagging scheme4 led us to 161 ((26 + 14) × 4 + 1) dif-
Figure 2: The parse tree shown in Figure 1a, rewritten ferent tags produced by our tagger. We denote T this
as four levels of tagging tasks. ensemble of tags.
With this additional pre-processing, any tree node is
2.2 From Tagging Levels To Parse Trees strictly larger (in terms of words it spans) than each
of its children. We enforce the corresponding Con-
Even if it had success with partial parsing (Abney, straint 1 during the iterative tagging process.
1997), the simplest scheme where one would have a
different tagger for each level of the parse tree is not Constraint 1 Any chunk at level i overlapping a
attractive in a full parsing setting. The maximum chunk at level j < i must span at least this overlapped
number of levels would have to be chosen at train chunk, and be larger.
time, which limits the maximum sentence length at
As a result, the iterative tagging process described
test time. Instead, we propose to have a unique tag-
above will generate a chunk of size N in at most N lev-
ger for all parse tree levels:
els, given a sentence of N words. At this time, the iter-
ative loop is stopped, and the full tree can be deduced.
1. Our tagger starts by predicting Level 1. The process might also be stopped if no new chunks
2. We then predict next level according to a history were found (all tags were O). Assuming our simple tree
of previous levels, with the same tagger. pre-processing has been done, this generic algorithm
3. We update the history of levels and go to 2. could be used with any tagger which could handle a
history of labels and tagging constraints. Even though
the tagging process is greedy because there is no global
This setup fits naturally into the recursive definition of
inference of the tree, we will see in Section 5 that
the levels. However, we must insure the predicted tags
it can perform surprisingly well. We propose in the
correspond to a parse tree. In a tree, a parent node
next section a tagger based on a convolutional Graph
fully includes child nodes. Without constraints during
Transformer Network (GTN) architecture. We will see
the level predictions, one could face a chunk partially
in Section 4 how we keep track of the history and how
spanning another chunk at a lower level, which would
we implement Constraint 1 for that tagger.
break this tree constraint.
We can guarantee that the tagging process corresponds
3 Architecture
to a valid tree, by adding a constraint enforcing higher
level chunks to fully include lower level chunks. This
We chose to use a variant of the versatile convolutional
iterative process might however never end, as it can
neural network architecture first proposed by Bengio
be subject to loops: for instance, the constraint is still
et al. (2001) for language modeling, and reintroduced
satisfied if the tagger predicts the same tags for two
later by Collobert and Weston (2008) for various NLP
consecutive levels. We propose to tackle this problem
tasks involving tagging. Our network outputs a graph
by (a) modifying the training parse trees such that
over which inference is achieved with a Viterbi algo-
nodes grow strictly as we go up in the tree and (b) en-
rithm. In that respect, one can see the whole archi-
forcing the corresponding constraints in the tagging
tecture (see Figure 3) as an instance of GTNs (Bottou
process.
et al., 1997; Le Cun et al., 1998). In the NLP field,
Tree nodes spanning the same words for several consec- this type of architecture has been used with success
utive level are first replaced by one node in the whole by Collobert et al. (2011) for “flat” tagging tasks. All
training set. The label of this new node is the concate- network and graph parameters are trained in a end-to-
nation of replaced node labels (see Figure 1b). At test end way, with stochastic gradient maximizing a graph
time, the inverse operation is performed on nodes hav- likelihood. We first describe in this section how we
ing concatenated labels. Considering all possible label adapt neural networks to text data, and then we in-
combinations would be intractable3 . We kept in the troduce GTNs training procedure. More details on the
3 4
Note that more than two labels might be concatenated. With the IOBES tagging scheme, each label (e.g. VP)
E.g., the tag SBAR#S#VP is quite common in the train- is expanded into 4 different tags (e.g. B-VP, I-VP, E-VP,
ing set. S-VP), as described in Section 2.1.

226
Deep Learning for Efficient Discriminative Parsing

Input Sentence where the matrix W ∈ RD×|W| represents the param-


Text The cat sat on the mat eters to be trained in this lookup layer. Each column
Feature 1 w11 w21 . . . 1
wN Wn ∈ RD corresponds to the embedding of the nth

Padding

Padding
..
. word in our dictionary W.
K
Feature K w1K w2K . . . wN
Having in mind the matrix-vector notation in (1), the
Lookup Table lookup-table applied over the sentence can be seen as
LTW 1 an efficient implementation of a convolution with a
... D kernel width of size 1. Parameters W are thus initial-
ized randomly and trained as any other neural network
LTW K
layer. However, we show in the experiments that one
can obtain a significant performance boost by initializ-
Convolution
ing6 these embeddings with the word representations
M 2 h(M 1 •) found by Collobert and Weston (2008). These rep-
resentations have been trained on a large unlabeled
corpus (Wikipedia), using a language modeling task.
|T | They contain useful syntactic and semantic informa-
tion, which appears to be useful for parsing. This cor-
roborates improvements obtained in the same way by
Collobert & Weston on various NLP tagging tasks.
Strutured Inference
B-NP In practice, it is common that one wants to represent
I-NP a word with more than one feature. In our experi-
..
. ments we always took at least the low-caps words and
O Aij a “caps” feature: wn = (wnlowcaps , wncaps ). In this case,
we apply a different lookup-table for each discrete fea-
Figure 3: Our neural network architecture. Words and ture (LTW lowcaps and LTW caps ), and the word embed-
other desired discrete features (caps, tree history, ...) ding becomes the concatenation of the output of all
are given as input. The lookup-tables embed each fea- these lookup-tables:
ture in a vector space, for each word. This is fed in a �
convolutional network which outputs a score for each LTW words (wn ) = LTW lowcaps (wnlowcaps )T ,
(2)
tag and each word. Finally, a graph is output with LTW caps (wncaps )) .
T

network scores on the nodes and additional transition


scores on the edges. A Viterbi algorithm can be per- For simplicity, we consider only one lookup-table in
formed to infer the word tags. the rest of the architecture description.

3.2 Word Scoring


derivations are provided in the supplementary material
attached to this paper. We will show in Section 4 how Scores for all tags T and all words in the sentence are
one can further adapt this architecture for parsing, by produced by applying a classical convolutional neural
introducing a tree history feature and few graph con- network over the lookup-table embeddings (1). More
straints. precisely, we consider all successive windows of text
(of size K), sliding over the sentence, from position
3.1 Word Embeddings 1 to N . At position n, the the network is fed with
the vector xn resulting from the concatenation of the
We consider a fixed-sized word dictionary5 W. Given embeddings:
a sentence of N words {w1 , w2 , . . . , wN }, each word � �T
T T
wn ∈ W is first embedded into a D-dimensional vector xn = Ww n−(K−1)/2
, . . . , Ww n+(K−1)/2
.
space, by applying a lookup-table operation:
The words with indices exceeding the sentence bound-
� �T
aries (n − (K − 1)/2 < 1 or n + (K − 1)/2 > N )
LTW (wn ) = W 0, · · · 0, 1 , 0, · · · 0 are mapped to a special padding word. As any clas-
at index wn (1)
sical neural network, our architecture performs sev-
= W wn ,
eral matrix-vector operations on its inputs, interleaved
5 6
Unknown words are mapped to a special unknown Only the initialization differs. The parameters are
word. Also, we map numbers to a number word. trained in any case.

227
R. Collobert

with some non-linear transfer function h(·). It outputs score Atu for jumping from tag t ∈ T to u ∈ T in
a vector of size |T | for each word at position n, inter- successive words, and an initial score At0 for starting
preted as a score for each tag in T and each word wn from the tth tag. The last layer of our network outputs
in the sentence: a graph with |T | × N nodes Gtn (see Figure 3). Each
node Gtn is assigned a score s(xn )t from the previous
s(xn ) = M 2 h(M 1 xn ) , (3) layer (3) of our architecture. Given a pair of nodes Gtn
and Gu(n+1) , we add an edge with transition score Atu
where the matrices M 1 ∈ RH×(KD) and M 2 ∈ R|T |×H on the graph. For compactness, we use the sequence
are the trained parameters of the network. The num- ∆
notation [t]N1 = {t1 , . . . , tn } for now. We score a tag
ber of hidden units H is a hyper-parameter to be
path [t]N
1 in the graph G, as the sum of scores along
tuned. As transfer function, we chose in our exper-
[t]N
1 in G:
iments a (fast) “hard” version of the hyperbolic tan-
gent: N
� � �
 S([w]N N
1 , [t]1 , θ) = Atn−1 tn + s(xn )tn , (6)
 −1 if x < −1 n=1
h(x) = x if − 1 <= x <= 1 . (4)

1 if x > 1 where θ represents all the trainable parameters of our
complete architecture (W , M 1 , M 2 and A). The sen-
3.3 Long-Range Dependencies tence tags [t� ]N
1 are then inferred by finding the path
which leads to the maximal score:
The “window” approach proposed above assume that
the tag of a word is solely determined by the sur- [t� ]N N N
1 = argmax S([w]1 , [t]1 , θ) . (7)
[t]N
1 ∈T
N
rounding words in the window. As we will see in
our experiments, this approach falls short on long sen-
tences. Inspired by Collobert and Weston (2008), we The Viterbi (1967) algorithm is the natural choice for
consider a variant of this architecture, where all words this inference. We will show now how to train all the
{w1 , w2 , . . . , wN } are considered for tagging a given parameters of the network θ in a end-to-end way.
word wn . To indicate to the network that we want to
tag the word wn , we introduce an additional lookup- 3.5 Training Likelihood
table in (2), which embeds the relative distance (m−n)
of each word wm in the sentence with respect to wn . Following the GTN’s training method introduced
At each position 1 ≤ m ≤ N , the outputs of the all in (Bottou et al., 1997; Le Cun et al., 1998), we con-
lookup-tables (2) (low caps word, caps, relative dis- sider a probabilistic framework, where we maximize a
tance...) LTW words (wm ) are first combined together by likelihood over all the sentences [w]N1 in our training
applying a mapping M 0 . We then extract a fixed-size set, with respect to θ. The score (6) can be interpreted
“global” feature vector7 xn by performing a max over as a conditional probability over a path by taking it
the sentence: to the exponential (making it positive) and normaliz-
ing with respect to all possible paths (summing to 1
� �
[xn ]i = max M 0 LTW words (wm ) i ∀i (5) over all paths). Taking the log(·) leads to the following
1≤m≤N
conditional log-probability:
This feature vector is then fed to scoring layers (3).
log p([t]N N N N
1 | [w]1 , θ) = S([w]1 , [t]1 , θ)
The matrix M 0 is trained by back-propagation, as any
− logadd S([w]N N (8)
other network parameter. We will refer this approach 1 , [u]1 , θ) ,
∀[u]N
1 ∈T
N
as “sentence approach” in the following.

where we adopt the notation logaddi zi = log( i ezi ).
3.4 Structured Tag Inference This likelihood is the same as the one found in Con-
We know that there are strong dependencies between ditional Random Fields (CRFs) (Lafferty et al., 2001)
parsing tags in a sentence: not only are tags organized over temporal sequences. The CRF model is however
in chunks, but some tags cannot follow other tags. It is linear (which would correspond in our case to a linear
thus natural to infer tags from the scores in (3) using a neural network, with fixed word embeddings).
structured output approach. We introduce a transition Computing the log-likelihood (8) efficiently is not
7 straightforward, as the number of terms in the logadd
Here, the concatenation of lookup-tables outputs
LTW words includes relative position embeddings with re- grows exponentially with the length of the sentence.
spect to word n. Because of this notation shortcut, the Fortunately, in the same spirit as the Viterbi algo-
right-hand side of (5) depends on n implicitly. rithm, one can compute it in linear time with the fol-

228
Deep Learning for Efficient Discriminative Parsing

lowing classical recursion over n: the chunks we would consider in C for tagging Level
4 would be only the NP around “stocks” and the VP

δn (v) = logadd S([w]n1 , [u]n1 , θ) ∀v ∈ T around “kept falling”. We would discard the S and VP
{[u]n
1 ∩ un =v} (9) around “falling” as they are included by the larger VP
= s(xn )v + logadd (δn−1 (t) + Atv ) , chunk.
t
We now implement Constraint 1 by constraining the
followed by the termination
inference graph introduced in Section 3.4 using the
logadd S([w]N N chunk history C. For each chunk c ∈ C, we adapt the
1 , [u]1 , θ̃) = logadd δN (u) .
∀[u]N
1
u graph output by our network in Figure 3 such that any
new candidate chunk c̃ overlapping c includes c, and
As a comparison, the Viterbi algorithm used to per- is strictly larger than c. Because the chunk history
form the inference (7) is achieved with the same re- C includes the largest chunks up to the last predicted
cursion, but where the logadd is replaced by a max, tree level, the new candidate chunk c̃ will be strictly
and then tracking back the optimal path through each larger than any chunk predicted in previous tree levels.
max. Constraint 1 is then always satisfied.

3.6 Stochastic Gradient Constraining the inference graph can be achieved by


noticing that the condition “c̃ strictly includes c” is
We maximize the log-likelihood (8) using stochastic equivalent to say that the new chunk c̃ satisfies one of
gradient ascent, which has the main advantage to be the following conditions:
extremely scalable (Bottou, 1991). Random training • Starts at the same position but ends after c
sentences [w]N1 and their associated tag labeling [t]1
N
• Starts before c, and ends at the same position
are iteratively selected. The following gradient step is
then performed: • Starts before and ends after c.

∂ log p([t]N N
1 | [w]1 , θ)
θ ←− θ + λ , (10) Using a IOBES tagging scheme, we implement (see
∂θ Figure 4) these three conditions by allowing only three
where λ is a chosen learning rate. The gradient in (10) corresponding possible paths c̃ in the inference graph,
is efficiently computed via a classical backpropaga- for each candidate label (e.g. VP):
tion (Rumelhart et al., 1986): the differentiation chain • The first tag of c̃ is B-VP, and remaining tags
rule is applied to the recursion (9), and then to all overlapping with c are maintained at I-VP
network layers (3), including the word embedding lay-
• The last tag of c̃ is E-VP, and previous tags over-
ers (1). Derivations are simple (but fastidious) algebra
lapping with c are maintained at I-VP
which can be found in the supplementary material of
this paper. • The path c̃ is maintained on I-VP while overlap-
ping c.
4 Chunk History and Tree
In addition to these 3 × |L| possible paths overlapping
Constraints
c, there is an additional path where no chunk is found
over c, in which case all tags stay in O while overlap-
The neural network architecture we presented in Sec-
ping c. Finally, as c̃ must be strictly larger than c, any
tion 3 is made “recursive” by adding an additional fea-
S- tag is discarded for the duration of c. Parts of the
ture (and its corresponding lookup-table (1)) describ-
graph not overlapping with the chunk history C remain
ing a history of previous tree levels. For that purpose,
fully connected, as previously described in Section 3.
we gather all chunks which were discovered in previ-
ous tree levels. If several chunks were overlapping at
different levels, we consider only the largest one. As- 5 Experiments
suming Constraint 1 is true, a word can be at most
in one of the remaining chunks. This is our history8 We conducted our experiments on the standard En-
C. The corresponding IOBES tags of each word will glish Penn Treebank benchmark (Marcus et al., 1993).
be fed as feature to the GTN. For instance, assum- Sections 02–21 were used for training, section 22 for
ing the labeling in Figure 2 was found up to Level 3, validation, and section 23 for testing. Standard pre-
8 processing as described in Section 2 was performed. In
Some other kind of history could have been chosen (e.g.
a feature for each arbitrary chosen L ∈ N previous levels). addition, the training set trees were transformed such
However we still need to “compute” the proposed history that two nodes spanning the same words were concate-
for implementing Constraint 1. nated as described in Section 2.2. We report results

229
R. Collobert

O O O
Table 1: Comparison of parsers trained and tested on
.. .. Penn Treebank, on sentences ≤ 15 words, against our
. .
B-VP B-VP B-VP
GTN parser (window approach).
I-VP I-VP I-VP

..
. E-VP E-VP E-VP
..
.
Model R P F1
Collins (1999) 88.2 89.2 88.7
.. .. ..
. . . Taskar et al. (2004) 89.1 89.1 89.1
Turian and Melamed (2006) 89.3 89.6 89.4
O O O GTN Parser 82.4 82.8 82.6
.. .. GTN Parser (LM) 86.1 87.2 86.6
. B-VP I-VP I-VP .
GTN Parser (POS) 87.1 86.2 86.7
I-VP I-VP I-VP
GTN Parser (LM+POS) 89.2 89.0 89.1
.. ..
. I-VP I-VP E-VP .

.. .. ..
. . .
sizes for the low cap words, caps, POS, relative dis-
O B-NP I-NP E-NP O tance (in the “sentence approach”) and history fea-
yesterday the black cat sat
tures were respectively 50, 5, 5, 5 and 10. The win-
dow size of our convolutional network was K = 5.
Figure 4: Implementing tree constraints: the chunk The word dictionary size was 100, 000. We used the
history (bottom) contains the NP “the black cat”. The word embeddings obtained from the language model
top inference graph is unconstrained (as it would be (LM) (Collobert and Weston, 2008) to initialize the
if no chunk were found in the history) and given only word lookup-table. Finally, we fixed the learning
for comparison. The bottom graph is constrained such rate λ = 0.01 during the stochastic gradient proce-
that new overlapping chunks strictly include the exist- dure (10). The only neural network “tricks” we used
ing chunk “the black cat”. were (1) the initialization of the parameters was done
according to the fan-in, and (2) the learning rate was
divided by the fan-in (Plaut and Hinton, 1987).
on the test set in terms of recall (R), precision (P )
and F1 score. Scores were obtained using the Evalb
implementation9 . 5.1 Small Scale Experiments
Our architecture (see Section 3) was trained on all pos-
sible parse tree levels (see Section 2.1), for all sentences First discriminative parse trees were very computa-
available in the training set. Random levels in random tionally expensive to train. Taskar et al. (2004) pro-
sentences were presented to the network until conver- posed a comparison setup for discriminative parsers
gence on the validation set. We fed our network with limited to Penn Treebank sentences with ≤ 15 words.
(1) lower cap words (to limit the number of words), Turian and Melamed (2006) reports almost 5 days of
(2) a capital letter feature (is low caps, is all caps, training for their own parser, using parallelization,
had first letter capital, or had one capital) to keep the on this setup. They also report several months of
upper case information (3) the relative distance to the training for Taskar et al.’s parser. In comparison,
word of interest (only for the “sentence approach”) (4) our parser takes only few hours to train (on a sin-
a POS feature10 (unless otherwise mentioned) (5) the gle CPU) on this setup. We report in Table 1 test
history of previous levels (see Section 4). During train- performance of our window approach system (“GTN
ing, the true history was given. During testing the Parser”, with H = 300 hidden units) against Taskar
history and the tags were obtained recursively from and Turian’s discriminative parsers. We also report
the network outputs, starting from Level 1, (see Sec- performance of Collins (1999) parser, a reference in
tion 2.2). All features had a corresponding lookup- non-discriminative parsers. Not initializing the word
table (1) in the network. lookup table with the language model (LM) and not
using POS features performed poorly, similar to exper-
Only few hyper-parameters were tried in our models
iments reported by Collobert and Weston (2008). It is
(chosen according to the validation). Lookup-table
known that POS is an fundamental feature for all ex-
9 isting parsers. The LM is crucial for the performance
Available at http://cs.nyu.edu/cs/projects/
proteus/evalb. of the architecture, as most of the capacity of the net-
10
Obtained with our own tagger available at http://ml. work lies into the word lookup-table (100, 000 words ×
nec-labs.com/senna. dimension 50). Without the LM, rare words cannot be

230
Deep Learning for Efficient Discriminative Parsing

properly trained.11 Initializing with the LM but not References


using POS or using POS but not LM gave similar im-
S. Abney. Partial parsing via finite-state cascades.
provements in performance. Combining LM and POS
Natural Language Engineering, 23(4):337–344, 1997.
compares well with other parsers.
Y. Bengio, R. Ducharme, and P. Vincent. A neural
probabilistic language model. In NIPS 13, 2001.
5.2 Large Scale Experiments
L. Bottou. Stochastic gradient learning in neural net-
We also trained ( Table 2) our GTN parsers (both the works. In Proceedings of Neuro-Nı̂mes 91, Nimes,
“window” and “sentence” approach) on the full Penn France, 1991. EC2.
Treebank dataset. Both takes a few days to train on a L. Bottou, Y. LeCun, and Yoshua Bengio. Global
single CPU in this setup. The number of hidden units training of document processing systems using
was set to H = 700. The size of the embedding space graph transformer networks. In Proceedings of
obtained with M 0 in the “sentence approach” was 300. CVPR, pages 489–493, 1997.
Our “window approach” parser compares well against
the first lexical PCFG parsers: Magerman (1995) and X. Carreras, M. Collins, and T. Koo. Tag, dy-
Collins (1996). The “sentence approach” (leveraging namic programming, and the perceptron for effi-
long-range dependencies) provides a clear boost and cient, feature-rich parsing. In CoNLL ’08, pages
compares well against Collins (1999) parser12 , a stan- 9–16. ACL, 2008.
dard benchmark in NLP. More refined parsers like E. Charniak. A maximum-entropy-inspired parser.
Charniak & Johnson (2005) (which takes advantage Proceedings of the first conference on North Amer-
of re-ranking) or recent discriminative parsers (which ican chapter of the Association for Computational
are based on PCFGs features) have higher F1 scores. Linguistics, pages 132–139, 2000.
Our parser performs comparatively well, considering
E. Charniak and M. Johnson. Coarse-to-fine n-best
we only used simple text features. Finally, we report
parsing and maxent discriminative reranking. In
some timing results on Penn Treebank test set (many
Proceedings of the 43rd Annual Meeting on ACL,
implementations are not available). The GTN parser
pages 173–180, 2005.
was an order of magnitude faster than other available
parsers13 . T. Cohn and P. Blunsom. Semantic role labelling with
tree conditional random fields. In Ninth Confer-
ence on Computational Natural Language (CoNLL),
6 Conclusion 2005.
M. Collins. A new statistical parser based on bigram
We proposed a new fast and scalable purely discrimi- lexical dependencies. In Proceedings of the 34th an-
native parsing algorithm based on Graph Transformer nual meeting on ACL, pages 184–191, 1996.
Networks. With only few basic text features (thanks
M. Collins. Head-Driven Statistical Models for Nat-
to word representations from Collobert and Weston
ural Language Parsing. PhD thesis, University of
(2008)), it performs similarly to existing pure dis-
Pennsylvania, 1999.
criminative algorithms, and similarly to Collins (1999)
“benchmark” parser. Many paths remain to be ex- R. Collobert and J. Weston. A unified architecture for
plored: richer features (in particular head words, as natural language processing: Deep neural networks
do lexicalized PCFGs), combination with generative with multitask learning. In ICML, 2008.
parsers, less greedy bottom-up inference (e.g. using K- R. Collobert, J. Weston, L. Bottou, M. Karlen,
best decoding), or other alternatives to describe trees. K. Kavukcuoglu, and P. Kuksa. Natural language
processing (almost) from scratch. JMLR, (to ap-
Acknowledgments pear), 2011.
J. R. Finkel, A. Kleeman, and C. D. Manning. Effi-
The author would like to thank NEC Laboratories cient, feature-based, conditional random field pars-
America for its support, as well as Léon Bottou and ing. In Proceedings of ACL-08: HLT, pages 959–967.
Vincent Etter for their comments. ACL, June 2008.
J. Henderson. Discriminative training of a neural net-
11
About 15% of the most common words appear 90% of work statistical parser. In Proceedings of the 42nd
the time, so many words are rare. Annual Meeting on ACL, 2004.
12
We picked Bikel’s implementation available at http:
//www.cis.upenn.edu/~dbikel. J. Lafferty, A. McCallum, and F. Pereira. Conditional
13
Available at http://ml.nec-labs.com/senna. random fields: Probabilistic models for segmenting

231
R. Collobert

Table 2: Parsers comparison trained on the full Penn Treebank, and tested on sentences with ≤ 40 and ≤ 100
words. We also report testing time on the test set (Section 23).

≤ 40 Words ≤ 100 Words Test Time


R P F1 R P F1 (sec.)
Magerman (1995) 84.6 84.9 84.8
Collins (1996) 85.8 86.3 86.1 85.3 85.7 85.5
Collins (1999) 88.5 88.7 88.6 88.1 88.3 88.2 2640
Charniak (2000) 90.1 90.1 90.1 89.6 89.5 89.6 1020
Charniak and Johnson (2005) 92.0 91.4
Finkel et al. (2008) 88.8 89.2 89.0 87.8 88.2 88.0
Petrov and Klein (2008) 90.0 89.4
Carreras et al. (2008) 89.9 91.1 90.5
GTN Parser (window) 81.3 81.9 81.6 80.3 81.0 80.6
GTN Parser (window, LM) 84.2 85.7 84.9 83.5 85.1 84.3
GTN Parser (window, LM+POS) 85.6 86.8 86.2 84.8 86.2 85.5
GTN Parser (sentence, LM+POS) 88.1 88.8 88.5 87.5 88.3 87.9 76

and labeling sequence data. In Eighteenth Inter- D.E. Rumelhart, G.E. Hinton, and R.J. Williams.
national Conference on Machine Learning, ICML, Learning internal representations by back-
2001. propagating errors. In Parallel Distributed
Processing: Explorations in the Microstructure of
K.J. Lang and G.E. Hinton. The development of the
Cognition, volume 1, pages 318–362. MIT Press,
time-delay neural network architecture for speech
1986.
recognition. Technical Report CMU-CS-88-152,
Carnegie Mellon University, 1988. F. Sha and F. Pereira. Shallow parsing with condi-
tional random fields. In NAACL 2003, pages 134–
Y. Le Cun, L. Bottou, Y. Bengio, and P. Haffner. Gra-
141, 2003.
dient based learning applied to document recogni-
tion. Proceedings of IEEE, 86(11):2278–2324, 1998. B. Taskar, D. Klein, M. Collins, D. Koller, and
C. Manning. Max-margin parsing. In Proceedings
Y. LeCun. A learning scheme for asymmetric threshold of EMNLP, 2004.
networks. In Proceedings of Cognitiva 85, pages 599–
604, Paris, France, 1985. J. Turian and I. D. Melamed. Advances in discrimi-
native parsing. In Proceedings of the 21st Interna-
D. M. Magerman. Statistical decision-tree models for tional Conference on Computational Linguistics and
parsing. In Proceedings of the 33rd Annual Meeting the 44th annual meeting of the ACL, pages 873–880,
of the ACL, pages 276–283, 1995. 2006.
M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini. A. J. Viterbi. Error bounds for convolutional codes
Building a large annotated corpus of English: the and an asymptotically optimal decoding algorithm.
penn treebank. Computational Linguistics, 19(2): IEEE Transactions on Information Theory, 13:260–
313–330, 1993. 269, 1967.
A. McCallum and W. Li. Early results for named en-
tity recognition with conditional random fields, fea-
ture induction and web-enhanced lexicons. In Pro-
ceedings of HLT-NAACL, pages 188–191, 2003.
S. Petrov and D. Klein. Sparse multi-scale gram-
mars for discriminative latent variable parsing. In
EMNLP ’08, pages 867–876. ACL, 2008.
D. C. Plaut and G. E. Hinton. Learning sets of fil-
ters using back-propagation. Computer Speech and
Language, 2:35–61, 1987.
A. Ratnaparkhi. Learning to parse natural language
with maximum entropy models. Machine Learning,
34(1-3):151–175, 1999.

232

You might also like