Harnessing Deep Neural Networks With Logic Rules
Harnessing Deep Neural Networks With Logic Rules
Zhiting Hu, Xuezhe Ma, Zhengzhong Liu, Eduard Hovy, Eric P. Xing
School of Computer Science
Carnegie Mellon University
{zhitingh,xuezhem,liu,epxing}@cs.cmu.edu, [email protected]
ning over multiple examples), requiring joint infer- Figure 2: The CNN architecture for sentence-level
ence. In contrast, as mentioned above, p is more sentiment analysis. The sentence representation
lightweight and efficient, and useful when rule eval- vector is followed by a fully-connected layer with
uation is expensive or impossible at prediction time. softmax output activation, to output sentiment pre-
Our experiments compare the performance of p and dictions.
q extensively.
4.1 Sentiment Classification
Imitation Strength π The imitation parameter π
in Eq.(2) balances between emulating the teacher Sentence-level sentiment analysis is to identify the
soft predictions and predicting the true hard la- sentiment (e.g., positive or negative) underlying
bels. Since the teacher network is constructed from an individual sentence. The task is crucial for
pθ , which, at the beginning of training, would pro- many opinion mining applications. One challeng-
duce low-quality predictions, we thus favor pre- ing point of the task is to capture the contrastive
dicting the true labels more at initial stage. As sense (e.g., by conjunction “but”) within a sen-
training goes on, we gradually bias towards emu- tence.
lating the teacher predictions to effectively distill
Base Network We use the single-channel convo-
the structured knowledge. Specifically, we define
lutional network proposed in (Kim, 2014). The sim-
π (t) = min{π0 , 1 − αt } at iteration t ≥ 0, where
ple model has achieved compelling performance
α ≤ 1 specifies the speed of decay and π0 < 1 is a
on various sentiment classification benchmarks.
lower bound.
The network contains a convolutional layer on top
4 Applications of word vectors of a given sentence, followed by
a max-over-time pooling layer and then a fully-
We have presented our framework that is general connected layer with softmax output activation. A
enough to improve various types of neural networks convolution operation is to apply a filter to word
with rules, and easy to use in that users are allowed windows. Multiple filters with varying window
to impose their knowledge and intentions through sizes are used to obtain multiple features. Figure 2
the declarative first-order logic. In this section shows the network architecture.
we illustrate the versatility of our approach by ap-
Logic Rules One difficulty for the plain neural
plying it on two workhorse network architectures,
network is to identify contrastive sense in order to
i.e., convolutional network and recurrent network,
capture the dominant sentiment precisely. The con-
on two representative applications, i.e., sentence-
junction word “but” is one of the strong indicators
level sentiment analysis which is a classification
for such sentiment changes in a sentence, where
problem, and named entity recognition which is a
the sentiment of clauses following “but” generally
sequence learning problem.
dominates. We thus consider sentences S with an
For each task, we first briefly describe the base
“A-but-B” structure, and expect the sentiment of the
neural network. Since we are not focusing on
whole sentence to be consistent with the sentiment
tuning network architectures, we largely use the
of clause B. The logic rule is written as:
same or similar networks to previous successful
neural models. We then design the linguistically- has-‘A-but-B’-structure(S) ⇒
motivated rules to be integrated. (5)
(1(y = +) ⇒ σθ (B)+ ∧ σθ (B)+ ⇒ 1(y = +)) ,
NYC locates in USA
where 1(·) is an indicator function that takes 1
Char+Word
when its argument is true, and 0 otherwise; class ‘+’ Representation
represents ‘positive’; and σθ (B)+ is the element of
σθ (B) for class ’+’. By Eq.(1), when S has the ‘A-
Forward
but-B’ structure, the truth value of the above logic LSTM
LSTM LSTM LSTM LSTM
Table 1: Accuracy (%) of Sentiment Classification. Row 1, CNN (Kim, 2014) is the base network
corresponding to the “CNN-non-static” model in (Kim, 2014). Rows 2-3 are the networks enhanced by
our framework: CNN-Rule-p is the student network and CNN-Rule-q is the teacher network. For MR and
CR, we report the average accuracy±one standard deviation using 10-fold cross validation.
the base networks, we obtain substantial improve- or positive sentiment. 3) CR (Hu and Liu, 2004),
ments on both tasks and achieve state-of-the-art customer reviews of various products, containing 2
or comparable results to previous best-performing classes and 3,775 instances. For MR and CR, we
systems. Comparison with a diverse set of other use 10-fold cross validation as in previous work. In
rule integration methods demonstrates the unique each of the three datasets, around 15% sentences
effectiveness of our framework. Our approach also contains the word “but”.
shows promising potentials in the semi-supervised For the base neural network we use the “non-
learning and sparse data context. static” version in (Kim, 2014) with the exact same
Throughout the experiments we set the regular- configurations. Specifically, word vectors are ini-
ization parameter to C = 400. In sentiment clas- tialized using word2vec (Mikolov et al., 2013) and
sification we set the imitation parameter to π (t) = fine-tuned throughout training, and the neural pa-
1 − 0.9t , while in NER π (t) = min{0.9, 1 − 0.9t } rameters are trained using SGD with the Adadelta
to downplay the noisy listing rule. The confidence update rule (Zeiler, 2012).
levels of rules are set to λl = 1, except for hard
constraints whose confidence is ∞. For neural 5.1.2 Results
network configuration, we largely followed the ref- Table 1 shows the sentiment classification per-
erence work, as specified in the following respec- formance. Rows 1-3 compare the base neural
tive sections. All experiments were performed on model with the models enhanced by our frame-
a Linux machine with eight 4.0GHz CPU cores, work with the “but”-rule (Eq.(5)). We see that
one Tesla K40c GPU, and 32GB RAM. We imple- our method provides a strong boost on accuracy
mented neural networks using Theano 2 , a popular over all three datasets. The teacher network q fur-
deep learning platform. ther improves over the student network p, though
the student network is more widely applicable
5.1 Sentiment Classification in certain contexts as discussed in sections 3.2
5.1.1 Setup and 3.4. Rows 4-10 show the accuracy of re-
cent top-performing methods. On the MR and CR
We test our method on a number of commonly
datasets, our model outperforms all the baselines.
used benchmarks, including 1) SST2, Stanford
On SST2, MVCNN (Yin and Schutze, 2015) (Row
Sentiment Treebank (Socher et al., 2013) which
5) is the only system that shows a slightly better re-
contains 2 classes (negative and positive), and
sult than ours. Their neural network has combined
6920/872/1821 sentences in the train/dev/test sets
diverse sets of pre-trained word embeddings (while
respectively. Following (Kim, 2014) we train mod-
we use only word2vec) and contained more neural
els on both sentences and phrases since all labels
layers and parameters than our model.
are provided. 2) MR (Pang and Lee, 2005), a set of
10,662 one-sentence movie reviews with negative To further investigate the effectiveness of our
framework in integrating structured rule knowl-
2
http://deeplearning.net/software/theano edge, we compare with an extensive array of other
Model Accuracy (%) Data size 5% 10% 30% 100%
1 CNN (Kim, 2014) 87.2 1 CNN 79.9 81.6 83.6 87.2
2 -but-clause 87.3 2 -Rule-p 81.5 83.2 84.5 88.8
3 -`2 -reg 87.5 3 -Rule-q 82.5 83.9 85.6 89.3
4 -project 87.9
4 -semi-PR 81.5 83.1 84.6 –
5 -opt-project 88.3
5 -semi-Rule-p 81.7 83.3 84.7 –
6 -pipeline 87.9
6 -semi-Rule-q 82.7 84.2 85.7 –
7 -Rule-p 88.8
8 -Rule-q 89.3
Table 3: Accuracy (%) on SST2 with varying sizes
of labeled data and semi-supervised learning. The
Table 2: Performance of different rule integration
header row is the percentage of labeled examples
methods on SST2. 1) CNN is the base network; 2)
for training. Rows 1-3 use only the supervised data.
“-but-clause” takes the clause after “but” as input; 3)
Rows 4-6 use semi-supervised learning where the re-
“-`2 -reg” imposes a regularization term γkσθ (S) −
maining training data are used as unlabeled exam-
σθ (Y )k2 to the CNN objective, with the strength
ples. For “-semi-PR” we only report its projected
γ selected on dev set; 4) “-project” projects the
solution (in analogous to q ) which performs better
trained base CNN to the rule-regularized subspace
than the non-projected one (in analogous to p).
with Eq.(3); 5) “-opt-project” directly optimizes the
projected CNN; 6) “-pipeline” distills the pre-trained
“-opt-project” to a plain CNN; 7-8) “-Rule-p” and “- general rules would contribute more to the perfor-
Rule-q ” are our models with p being the distilled stu- mance, and unlabeled data should help better learn
dent network and q the teacher network. Note that from the rules. This can be a useful property espe-
“-but-clause” and “-`2 -reg” are ad-hoc methods ap- cially when data are sparse and labels are expensive
plicable specifically to the “but”-rule. to obtain. Table 3 shows the results. The subsam-
pling is conducted on the sentence level. That is,
for instance, in “5%” we first selected 5% training
possible integration approaches. Table 2 lists these sentences uniformly at random, then trained the
methods and their performance on the SST2 task. models on these sentences as well as their phrases.
We see that: 1) Although all methods lead to differ- The results verify our expectations. 1) Rows 1-3
ent degrees of improvement, our framework outper- give the accuracy of using only data-label subsets
forms all other competitors with a large margin. 2) for training. In every setting our methods consis-
In particular, compared to the pipelined method in tently outperform the base CNN. 2) “-Rule-q” pro-
Row 6 which is in analogous to the structure com- vides higher improvement on 5% data (with margin
pilation work (Liang et al., 2008), our iterative dis- 2.6%) than on larger data (e.g., 2.3% on 10% data,
tillation (section 3.2) provides better performance. and 2.0% on 30% data), showing promising po-
Another advantage of our method is that we only tential in the sparse data context. 3) By adding
train one set of neural parameters, as opposed to unlabeled instances for semi-supervised learning
two separate sets as in the pipelined approach. 3) as in Rows 5-6, we get further improved accuracy.
The distilled student network “-Rule-p” achieves 4) Row 4, “-semi-PR” is the posterior regulariza-
much superior accuracy compared to the base CNN, tion (Ganchev et al., 2010) which imposes the rule
as well as “-project” and “-opt-project” which ex- constraint through only unlabeled data during train-
plicitly project CNN to the rule-constrained sub- ing. Our distillation framework consistently pro-
space. This validates that our distillation procedure vides substantially better results.
transfers the structured knowledge into the neu-
ral parameters effectively. The inferior accuracy 5.2 Named Entity Recognition
of “-opt-project” can be partially attributed to the 5.2.1 Setup
poor performance of its neural network part which We evaluate on the well-established CoNLL-2003
achieves only 85.1% accuracy and leads to inaccu- NER benchmark (Tjong Kim Sang and De Meul-
rate evaluation of the “but”-rule in Eq.(5). der, 2003), which contains 14,987/3,466/3,684
We next explore the performance of our frame- sentences and 204,567/51,578/46,666 tokens in
work with varying numbers of labeled instances as train/dev/test sets, respectively. The dataset in-
well as the effect of exploiting unlabeled data. In- cludes 4 categories, i.e., person, location, orga-
tuitively, with less labeled examples we expect the nization, and misc. BIOES tagging scheme is used.
Model F1 NER task we have used logic rules that introduce
1 BLSTM 89.55 extra dependencies between adjacent tag positions
2 BLSTM-Rule-trans p: 89.80, q : 91.11 as well as multiple instances, making the explicit
3 BLSTM-Rules p: 89.93, q : 91.18 joint inference of q useful for fulfilling these struc-
4 NN-lex (Collobert et al., 2011) 89.59 tured constraints.
5 S-LSTM (Lample et al., 2016) 90.33
6 BLSTM-lex (Chiu and Nichols, 2015) 90.77
7 BLSTM-CRF1 (Lample et al., 2016) 90.94 6 Discussion and Future Work
8 Joint-NER-EL (Luo et al., 2015) 91.20
9 BLSTM-CRF2 (Ma and Hovy, 2016) 91.21 We have developed a framework which combines
deep neural networks with first-order logic rules
Table 4: Performance of NER on CoNLL-2003. to allow integrating human knowledge and inten-
Row 2, BLSTM-Rule-trans imposes the transition tions into the neural models. In particular, we pro-
rules (Eq.(6)) on the base BLSTM. Row 3, BLSTM- posed an iterative distillation procedure that trans-
Rules further incorporates the list rule (Eq.(7)). We fers the structured information of logic rules into
report the performance of both the student model p the weights of neural networks. The transferring is
and the teacher model q . done via a teacher network constructed using the
posterior regularization principle. Our framework
is general and applicable to various types of neu-
Around 1.7% named entities occur in lists. ral architectures. With a few intuitive rules, our
We use the mostly same configurations for the framework significantly improves base networks
base BLSTM network as in (Chiu and Nichols, on sentiment analysis and named entity recogni-
2015), except that, besides the slight architecture tion, demonstrating the practical significance of
difference (section 4.2), we apply Adadelta for pa- our approach.
rameter updating. GloVe (Pennington et al., 2014) Though we have focused on first-order logic
word vectors are used to initialize word features. rules, we leveraged soft logic formulation which
5.2.2 Results can be easily extended to general probabilistic mod-
els for expressing structured distributions and per-
Table 4 presents the performance on the NER task.
forming inference and reasoning (Lake et al., 2015).
By incorporating the bi-gram transition rules (Row
We plan to explore these diverse knowledge rep-
2), the joint teacher model q achieves 1.56 improve-
resentations to guide the DNN learning. The pro-
ment in F1 score that outperforms most previous
posed iterative distillation procedure also reveals
neural based methods (Rows 4-7), including the
connections to recent neural autoencoders (Kingma
BLSTM-CRF model (Lample et al., 2016) which
and Welling, 2014; Rezende et al., 2014) where
applies a conditional random field (CRF) on top
generative models encode probabilistic structures
of a BLSTM in order to capture the transition pat-
and neural recognition models distill the informa-
terns and encourage valid sequences. In contrast,
tion through iterative optimization (Rezende et al.,
our method implements the desired constraints in a
2016; Johnson et al., 2016; Karaletsos et al., 2016).
more straightforward way by using the declarative
The encouraging empirical results indicate a
logic rule language, and at the same time does not
strong potential of our approach for improving
introduce extra model parameters to learn. Further
other application domains such as vision tasks,
integration of the list rule (Row 3) provides a sec-
which we plan to explore in the future. Finally,
ond boost in performance, achieving an F1 score
we also would like to generalize our framework
very close to the best-performing systems including
to automatically learn the confidence of different
Joint-NER-EL (Luo et al., 2015) (Row 8), a proba-
rules, and derive new rules from data.
bilistic graphical model optimizing NER and entity
linking jointly with massive external resources, and Acknowledgments
BLSTM-CRF (Ma and Hovy, 2016), a combination
of BLSTM and CRF with more parameters than We thank the anonymous reviewers for their valu-
our rule-enhanced neural networks. able comments. This work is supported by NSF
From the table we see that the accuracy gap be- IIS1218282, NSF IIS1447676, Air Force FA8721-
tween the joint teacher model q and the distilled 05-C-0003, and FA8750-12-2-0342.
student p is relatively larger than in the sentiment
classification task (Table 1). This is because in the
References Tejas D Kulkarni, William F Whitney, Pushmeet Kohli, and
Josh Tenenbaum. 2015. Deep convolutional inverse graph-
Stephen H Bach, Matthias Broecheler, Bert Huang, and Lise ics network. In Proc. of NIPS, pages 2530–2538.
Getoor. 2015. Hinge-loss Markov random fields and prob-
abilistic soft logic. arXiv preprint arXiv:1505.04406. Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenen-
baum. 2015. Human-level concept learning through prob-
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio.
abilistic program induction. Science, 350(6266):1332–
2014. Neural machine translation by jointly learning to
1338.
align and translate. Proc. of ICLR.
Cristian Bucilu, Rich Caruana, and Alexandru Niculescu- Guillaume Lample, Miguel Ballesteros, Sandeep Subrama-
Mizil. 2006. Model compression. In Proc. of KDD, pages nian, Kazuya Kawakami, and Chris Dyer. 2016. Neural
535–541. ACM. architectures for named entity recognition. In Proc. of
NAACL.
Jason PC Chiu and Eric Nichols. 2015. Named entity recog-
nition with bidirectional LSTM-CNNs. arXiv preprint Quoc V Le and Tomas Mikolov. 2014. Distributed represen-
arXiv:1511.08308. tations of sentences and documents. Proc. of ICML.
Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Percy Liang, Hal Daumé III, and Dan Klein. 2008. Structure
Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural lan- compilation: trading structure for features. In Proc. of
guage processing (almost) from scratch. JMLR, 12:2493– ICML, pages 592–599. ACM.
2537.
Percy Liang, Michael I Jordan, and Dan Klein. 2009. Learn-
James Foulds, Shachi Kumar, and Lise Getoor. 2015. La- ing from measurements in exponential families. In Proc.
tent topic networks: A versatile probabilistic programming of ICML, pages 641–648. ACM.
framework for topic models. In Proc. of ICML, pages
777–786. David Lopez-Paz, Léon Bottou, Bernhard Schölkopf, and
Vladimir Vapnik. 2016. Unifying distillation and privi-
Manoel VM França, Gerson Zaverucha, and Artur S dAvila leged information. Prof. of ICLR.
Garcez. 2014. Fast relational learning using bottom clause
propositionalization with artificial neural networks. Ma- Gang Luo, Xiaojiang Huang, Chin-Yew Lin, and Zaiqing Nie.
chine learning, 94(1):81–104. 2015. Joint named entity recognition and disambiguation.
In Proc. of EMNLP.
Kuzman Ganchev, Joao Graça, Jennifer Gillenwater, and Ben
Taskar. 2010. Posterior regularization for structured latent Xuezhe Ma and Eduard Hovy. 2016. End-to-end sequence
variable models. JMLR, 11:2001–2049. labeling via bi-directional LSTM-CNNs-CRF. In Proc. of
ACL.
Artur S d’Avila Garcez, Krysia Broda, and Dov M Gabbay.
2012. Neural-symbolic learning systems: foundations and Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado,
applications. Springer Science & Business Media. and Jeff Dean. 2013. Distributed representations of words
and phrases and their compositionality. In Proc. of NIPS,
Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel- pages 3111–3119.
rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent
Vanhoucke, Patrick Nguyen, Tara N Sainath, et al. 2012. Marvin Minksy. 1980. Learning meaning. Technical Report
Deep neural networks for acoustic modeling in speech AI Lab Memo. Project MAC. MIT.
recognition: The shared views of four research groups.
Signal Processing Magazine, IEEE, 29(6):82–97. Anh Nguyen, Jason Yosinski, and Jeff Clune. 2015. Deep
neural networks are easily fooled: High confidence predic-
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Dis- tions for unrecognizable images. In Proc. of CVPR, pages
tilling the knowledge in a neural network. arXiv preprint 427–436. IEEE.
arXiv:1503.02531.
Bo Pang and Lillian Lee. 2005. Seeing stars: Exploiting class
Minqing Hu and Bing Liu. 2004. Mining and summarizing relationships for sentiment categorization with respect to
customer reviews. In Proc. of KDD, pages 168–177. ACM. rating scales. In Proc. of ACL, pages 115–124.
Matthew J Johnson, David Duvenaud, Alexander B Wiltschko, Jeffrey Pennington, Richard Socher, and Christopher D Man-
Sandeep R Datta, and Ryan P Adams. 2016. Structured ning. 2014. Glove: Global vectors for word representation.
VAEs: Composing probabilistic graphical models and vari- In Proc. of EMNLP, volume 14, pages 1532–1543.
ational autoencoders. arXiv preprint arXiv:1603.06277.
Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wier-
Theofanis Karaletsos, Serge Belongie, Cornell Tech, and Gun- stra. 2014. Stochastic backpropagation and approximate
nar Rätsch. 2016. Bayesian representation learning with inference in deep generative models. Proc. of ICML.
oracle constraints. In Proc. of ICLR.
Danilo Jimenez Rezende, Shakir Mohamed, Ivo Danihelka,
Yoon Kim. 2014. Convolutional neural networks for sentence
Karol Gregor, and Daan Wierstra. 2016. One-shot gen-
classification. Proc. of EMNLP.
eralization in deep generative models. arXiv preprint
Diederik P Kingma and Max Welling. 2014. Auto-encoding arXiv:1603.05106.
variational Bayes. In Proc. of ICLR.
Matthew Richardson and Pedro Domingos. 2006. Markov
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. logic networks. Machine learning, 62(1-2):107–136.
Imagenet classification with deep convolutional neural net-
works. In Proc. of NIPS, pages 1097–1105.
David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Lau- of the eighth National conference on Artificial intelligence,
rent Sifre, George van den Driessche, Julian Schrittwieser, pages 861–866. Boston, MA.
Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot,
et al. 2016. Mastering the game of go with deep neural Sida Wang and Christopher Manning. 2013. Fast dropout
networks and tree search. Nature, 529(7587):484–489. training. In Proc. of ICML, pages 118–126.
Richard Socher, Alex Perelygin, Jean Y Wu, Jason Chuang, Bishan Yang and Claire Cardie. 2014. Context-aware learn-
Christopher D Manning, Andrew Y Ng, and Christopher ing for sentence-level sentiment analysis with posterior
Potts. 2013. Recursive deep models for semantic compo- regularization. In Proc. of ACL, pages 325–335.
sitionality over a sentiment treebank. In Proc. of EMNLP,
volume 1631, page 1642. Citeseer. Wenpeng Yin and Hinrich Schutze. 2015. Multichannel
variable-size convolution for sentence classification. Proc.
Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan of CONLL.
Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus.
2014. Intriguing properties of neural networks. Proc. of Matthew D Zeiler. 2012. Adadelta: an adaptive learning rate
ICLR. method. arXiv preprint arXiv:1212.5701.
Erik F Tjong Kim Sang and Fien De Meulder. 2003. Introduc- Ye Zhang, Stephen Roller, and Byron Wallace. 2016. MGNC-
tion to the conll-2003 shared task: Language-independent CNN: A simple approach to exploiting multiple word em-
named entity recognition. In Proc. of CoNLL, pages 142– beddings for sentence classification. Proc. of NAACL.
147. Association for Computational Linguistics.
Jun Zhu, Ning Chen, and Eric P Xing. 2014. Bayesian
Geoffrey G Towell, Jude W Shavlik, and Michiel O No- inference with posterior regularization and applications to
ordewier. 1990. Refinement of approximate domain theo- infinite latent SVMs. JMLR, 15(1):1799–1847.
ries by knowledge-based neural networks. In Proceedings
Harnessing Deep Neural Networks with Logic Rules:
Supplementary Material
We provide the detailed derivation for solving the problem in Eq.(3), Section 3, which we
repeat here:
X
min KL(q(Y|X)kpθ (Y|X)) + C ξl,gl
q,ξ≥0
l,gl
(A.1)
s.t. λl (1 − Eq [rl,gl (X, Y)]) ≤ ξl,gl
gl = 1, . . . , Gl , l = 1, . . . , L,
The following derivation is largely adapted from (Ganchev et al., 2010) for the logic rule
constraint setting, with some reformulation that produces closed-form solution.
The Lagrangian is
where
X
L = KL(q(Y|X)kpθ (Y|X)) + (C + µl,gl )ξl,gl
l,gl
X X (A.3)
+ ηl,gl (Eq [λl (1 − rl,gl (X, Y))] − ξl,gl ) + α( q(Y|X) − 1)
l,gl Y
1
n P o
X pθ (Y|X) exp − l,gl ηl,gl λl (1 − rl,gl (X, Y))
∇α L = −1=0
e exp(α)
Y
P n P o
Y p(Y|X) exp − η λ
l,gl l,gl l (1 − rl,gl (X, Y))
=⇒ α = log
e
(A.6)
P n P o
Let Zη = Y p(Y|X) exp − l,gl ηl,gl λl (1 − rl,gl (X, Y)) . Plugging α into L
X X
L = − log Zη + (C + µl,gl )ξl,gl − ηl,gl ξl,gl
l,gl l,gl (A.7)
= − log Zη
Since Zη monotonically decreases as η increases, and from Eq.(A.6) we have ηl,gl ≤ C,
therefore:
max − log Zη
C≥η≥0
(A.8)
∗
=⇒ ηl,gl
=C
Plugging Eqs.(A.6) and (A.8) into Eq.(A.5) we obtain the solution of q as in Eq.(4).
We design a simple pattern-matching based method to identify lists and counterparts in the
NER task. We ensure high precision and do not expect high recall. In particular, we only
retrieve lists that with the pattern “1. ... 2. ... 3. ...” (i.e., indexed by numbers), and “- ...
- ... - ...” (i.e., each item marked with “-”). We require at least 3 items to form a list.
We further require the text of each item follows certain patterns to ensure the text is
highly likely to be named entities, and rule out those lists whose item text is largely free
text. Specifically, we require 1) all words of the item text all start with capital letters; 2)
referring the text between punctuations as “block”, each block includes no more than 3
words.
We detect both intra-sentence lists and inter-sentence lists in documents. We found the
above patterns are effective to identify true lists. A better list detection method is expected
to further improve our NER results.
References
Ganchev, K., Graça, J., Gillenwater, J., and Taskar, B. (2010). Posterior regularization for
structured latent variable models. JMLR, 11:2001–2049.