0% found this document useful (0 votes)
4 views

Harnessing Deep Neural Networks With Logic Rules

This document presents a framework for integrating deep neural networks (DNNs) with structured logic rules to enhance interpretability and flexibility. The proposed iterative distillation method allows neural networks, such as CNNs and RNNs, to learn from both labeled data and logic rules, leading to improved performance on tasks like sentiment analysis and named entity recognition. The approach aims to transfer human knowledge into neural models, achieving state-of-the-art results with fewer parameters compared to traditional methods.

Uploaded by

chuanmx20
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Harnessing Deep Neural Networks With Logic Rules

This document presents a framework for integrating deep neural networks (DNNs) with structured logic rules to enhance interpretability and flexibility. The proposed iterative distillation method allows neural networks, such as CNNs and RNNs, to learn from both labeled data and logic rules, leading to improved performance on tasks like sentiment analysis and named entity recognition. The approach aims to transfer human knowledge into neural models, achieving state-of-the-art results with fewer parameters compared to traditional methods.

Uploaded by

chuanmx20
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Harnessing Deep Neural Networks with Logic Rules

Zhiting Hu, Xuezhe Ma, Zhengzhong Liu, Eduard Hovy, Eric P. Xing
School of Computer Science
Carnegie Mellon University
{zhitingh,xuezhem,liu,epxing}@cs.cmu.edu, [email protected]

Abstract only from concrete examples (as DNNs do) but


also from different forms of general knowledge
Combining deep neural networks with and rich experiences (Minksy, 1980; Lake et al.,
structured logic rules is desirable to harness 2015). Logic rules provide a flexible declarative
flexibility and reduce uninterpretability of language for communicating high-level cognition
the neural models. We propose a general and expressing structured knowledge. It is there-
framework capable of enhancing various fore desirable to integrate logic rules into DNNs, to
types of neural networks (e.g., CNNs and transfer human intention and domain knowledge to
RNNs) with declarative first-order logic neural models, and regulate the learning process.
rules. Specifically, we develop an iterative In this paper, we present a framework capable of
distillation method that transfers the struc- enhancing general types of neural networks, such
tured information of logic rules into the as convolutional networks (CNNs) and recurrent
weights of neural networks. We deploy the networks (RNNs), on various tasks, with logic rule
framework on a CNN for sentiment anal- knowledge. Combining symbolic representations
ysis, and an RNN for named entity recog- with neural methods have been considered in dif-
nition. With a few highly intuitive rules, ferent contexts. Neural-symbolic systems (Garcez
we obtain substantial improvements and et al., 2012) construct a network from a given rule
achieve state-of-the-art or comparable re- set to execute reasoning. To exploit a priori knowl-
sults to previous best-performing systems. edge in general neural architectures, recent work
augments each raw data instance with useful fea-
1 Introduction
tures (Collobert et al., 2011), while network train-
Deep neural networks provide a powerful mech- ing, however, is still limited to instance-label super-
anism for learning patterns from massive data, vision and suffers from the same issues mentioned
achieving new levels of performance on image above. Besides, a large variety of structural knowl-
classification (Krizhevsky et al., 2012), speech edge cannot be naturally encoded in the feature-
recognition (Hinton et al., 2012), machine trans- label form.
lation (Bahdanau et al., 2014), playing strategic Our framework enables a neural network to learn
board games (Silver et al., 2016), and so forth. simultaneously from labeled instances as well as
Despite the impressive advances, the widely- logic rules, through an iterative rule knowledge
used DNN methods still have limitations. The distillation procedure that transfers the structured
high predictive accuracy has heavily relied on large information encoded in the logic rules into the net-
amounts of labeled data; and the purely data-driven work parameters. Since the general logic rules
learning can lead to uninterpretable and some- are complementary to the specific data labels, a
times counter-intuitive results (Szegedy et al., 2014; natural “side-product” of the integration is the sup-
Nguyen et al., 2015). It is also difficult to encode port for semi-supervised learning where unlabeled
human intention to guide the models to capture de- data is used to better absorb the logical knowledge.
sired patterns, without expensive direct supervision Methodologically, our approach can be seen as a
or ad-hoc initialization. combination of the knowledge distillation (Hinton
On the other hand, the cognitive process of hu- et al., 2015; Bucilu et al., 2006) and the posterior
man beings have indicated that people learn not regularization (PR) method (Ganchev et al., 2010).
In particular, at each iteration we adapt the pos- Though there do exist general frameworks that
terior constraint principle from PR to construct a allow encoding various structured constraints on
rule-regularized teacher, and train the student net- latent variable models (Ganchev et al., 2010; Zhu
work of interest to imitate the predictions of the et al., 2014; Liang et al., 2009), they either are
teacher network. We leverage soft logic to support not directly applicable to the NN case, or could
flexible rule encoding. yield inferior performance as in our empirical study.
We apply the proposed framework on both CNN Liang et al. (2008) transfers predictive power of
and RNN, and deploy on the task of sentiment pre-trained structured models to unstructured ones
analysis (SA) and named entity recognition (NER), in a pipelined fashion.
respectively. With only a few (one or two) very Our proposed approach is distinct in that we use
intuitive rules, both the distilled networks and the an iterative rule distillation process to effectively
joint teacher networks strongly improve over their transfer rich structured knowledge, expressed in
basic forms (without rules), and achieve better or the declarative first-order logic language, into pa-
comparable performance to state-of-the-art models rameters of general neural networks. We show
which typically have more parameters and compli- that the proposed approach strongly outperforms
cated architectures. an extensive array of other either ad-hoc or general
To the best of our knowledge, this is the integration methods.
first work to integrate logic rules with general
workhorse types of deep neural networks in a prin- 3 Method
cipled framework. The encouraging results indi-
cate our method can be potentially useful for in- In this section we present our framework which en-
corporating richer types of human knowledge, and capsulates the logical structured knowledge into
improving other application domains. a neural network. This is achieved by forcing
the network to emulate the predictions of a rule-
2 Related Work regularized teacher, and evolving both models it-
eratively throughout training (section 3.2). The
Combination of logic rules and neural networks process is agnostic to the network architecture, and
has been considered in different contexts. Neural- thus applicable to general types of neural models in-
symbolic systems (Garcez et al., 2012), such as cluding CNNs and RNNs. We construct the teacher
KBANN (Towell et al., 1990) and CILP++ (França network in each iteration by adapting the posterior
et al., 2014), construct network architectures from regularization principle in our logical constraint
given rules to perform reasoning and knowledge setting (section 3.3), where our formulation pro-
acquisition. A related line of research, such as vides a closed-form solution. Figure 1 shows an
Markov logic networks (Richardson and Domin- overview of the proposed framework.
gos, 2006), derives probabilistic graphical models
teacher network construction rule knowledge distillation
(rather than neural networks) from the rule set.
With the recent success of deep neural networks 𝑝𝜃 (𝑦|𝑥) loss
projection
in a vast variety of application domains, it is in- 𝑞(𝑦|𝑥)

creasingly desirable to incorporate structured logic teacher


back
propagation student
knowledge into general types of networks to har- 𝑞(𝑦|𝑥) 𝑝𝜃 (𝑦|𝑥)
logic rules
ness flexibility and reduce uninterpretability. Re-
cent work that trains on extra features from do-
unlabeled data labeled data
main knowledge (Collobert et al., 2011), while
producing improved results, does not go beyond
the data-label paradigm. Kulkarni et al. (2015) uses Figure 1: Framework Overview. At each iteration,
a specialized training procedure with careful order- the teacher network is obtained by projecting the
ing of training instances to obtain an interpretable student network to a rule-regularized subspace (red
neural layer of an image network. Karaletsos et dashed arrow); and the student network is updated
al. (2016) develops a generative model jointly over to balance between emulating the teacher’s output
data-labels and similarity knowledge expressed in and predicting the true labels (black/blue solid ar-
triplet format to learn improved disentangled repre- rows).
sentations.
3.1 Learning Resources: Instances and Rules by weights θ. Standard neural network training
Our approach allows neural networks to learn from has been to iteratively update θ to produce the
both specific examples and general rules. Here we correct labels of training instances. To integrate
give the settings of these “learning resources”. the information encoded in the rules, we propose
Assume we have input variable x ∈ X and to train the network to also imitate the outputs
target variable y ∈ Y. For clarity, we focus of a rule-regularized projection of pθ (y|x), de-
on K-way classification, where Y = ∆K is noted as q(y|x), which explicitly includes rule con-
the K-dimensional probability simplex and y ∈ straints as regularization terms. In each iteration
{0, 1}K ⊂ Y is a one-hot encoding of the class q is constructed by projecting pθ into a subspace
label. However, our method specification can constrained by the rules, and thus has desirable
straightforwardly be applied to other contexts such properties. We present the construction in the next
as regression and sequence learning (e.g., NER section. The prediction behavior of q reveals the
tagging, which is a sequence of classification deci- information of the regularized subspace and struc-
sions). The training data D = {(xn , yn )}N tured rules. Emulating the q outputs serves to trans-
n=1 is a
set of instantiations of (x, y). fer this knowledge into pθ . The new objective is
Further consider a set of first-order logic then formulated as a balancing between imitating
(FOL) rules with confidences, denoted as R = the soft predictions of q and predicting the true hard
{(Rl , λl )}L labels:
l=1 , where Rl is the lth rule over the
input-target space (X , Y), and λl ∈ [0, ∞] is the N
1 X
confidence level with λl = ∞ indicating a hard θ (t+1) = arg min (1 − π)`(yn , σθ (xn ))
θ∈Θ N n=1 (2)
rule, i.e., all groundings are required to be true
+ π`(s(t)
n , σθ (xn )),
(=1). Here a grounding is the logic expression
with all variables being instantiated. Given a set where ` denotes the loss function selected accord-
of examples (X, Y ) ⊂ (X , Y) (e.g., a minibatch ing to specific applications (e.g., the cross entropy
from D), the set of groundings of Rl are denoted (t)
loss for classification); sn is the soft prediction
as {rlg (X, Y )}G l
g=1 . In practice a rule grounding vector of q on xn at iteration t; and π is the imita-
is typically relevant to only a single or subset of
tion parameter calibrating the relative importance
examples, though here we give the most general
of the two objectives.
form on the entire set.
A similar imitation procedure has been used in
We encode the FOL rules using soft logic (Bach
other settings such as model compression (Bucilu
et al., 2015) for flexible encoding and stable opti-
et al., 2006; Hinton et al., 2015) where the pro-
mization. Specifically, soft logic allows continu-
cess is termed distillation. Following them we call
ous truth values from the interval [0, 1] instead of
pθ (y|x) the “student” and q(y|x) the “teacher”,
{0, 1}, and the Boolean logic operators are refor-
which can be intuitively explained in analogous
mulated as:
to human education where a teacher is aware of
A&B = max{A + B − 1, 0} systematic general rules and she instructs students
A ∨ B = min{A + B, 1} by providing her solutions to particular questions
(1)
(i.e., the soft predictions). An important differ-
X
A1 ∧ · · · ∧ AN = Ai /N
i
¬A = 1 − A ence from previous distillation work, where the
teacher is obtained beforehand and the student is
Here & and ∧ are two different approximations trained thereafter, is that our teacher and student
to logical conjunction (Foulds et al., 2015): & is are learned simultaneously during training.
useful as a selection operator (e.g., A&B = B Though it is possible to combine a neural net-
when A = 1, and A&B = 0 when A = 0), while work with rule constraints by projecting the net-
∧ is an averaging operator. work to the rule-regularized subspace after it is
fully trained as before with only data-label in-
3.2 Rule Knowledge Distillation stances, or by optimizing projected network di-
A neural network defines a conditional probabil- rectly, we found our iterative teacher-student dis-
ity pθ (y|x) by using a softmax output layer that tillation approach provides a much superior per-
produces a K-dimensional soft prediction vector formance, as shown in the experiments. More-
denoted as σθ (x). The network is parameterized over, since pθ distills the rule information into the
weights θ instead of relying on explicit rule rep- the constraints. We discuss the computation of the
resentations, we can use pθ for predicting new ex- normalization factor in section 3.4.
amples at test time when the rule assessment is Our framework is related to the posterior regular-
expensive or even unavailable (i.e., the privileged ization (PR) method (Ganchev et al., 2010) which
information setting (Lopez-Paz et al., 2016)) while places constraints over model posterior in unsuper-
still enjoying the benefit of integration. Besides, vised setting. In classification, our optimization
the second loss term in Eq.(2) can be augmented procedure is analogous to the modified EM algo-
with rich unlabeled data in addition to the labeled rithm for PR, by using cross-entropy loss in Eq.(2)
examples, which enables semi-supervised learning and evaluating the second loss term on unlabeled
for better absorbing the rule knowledge. data differing from D, so that Eq.(4) corresponds
to the E-step and Eq.(2) is analogous to the M-step.
3.3 Teacher Network Construction
This sheds light from another perspective on why
We now proceed to construct the teacher network our framework would work. However, we found in
q(y|x) at each iteration from pθ (y|x). The itera- our experiments (section 5) that to produce strong
tion index t is omitted for clarity. We adapt the performance it is crucial to use the same labeled
posterior regularization principle in our logic con- data xn in the two losses of Eq.(2) so as to form a
straint setting. Our formulation ensures a closed- direct trade-off between imitating soft predictions
form solution for q and thus avoids any significant and predicting correct hard labels.
increases in computational overhead.
Recall the set of FOL rules R = {(Rl , λl )}L l=1 .
3.4 Implementations
Our goal is to find the optimal q that fits the rules The procedure of iterative distilling optimization
while at the same time staying close to pθ . For the of our framework is summarized in Algorithm 1.
first property, we apply a commonly-used strategy During training we need to compute the soft
that imposes the rule constraints on q through an predictions of q at each iteration, which is straight-
expectation operator. That is, for each rule (indexed forward through direct enumeration if the rule con-
by l) and each of its groundings (indexed by g) straints in Eq.(4) are factored in the same way as
on (X, Y ), we expect Eq(Y |X) [rlg (X, Y )] = 1, the base neural model pθ (e.g., the “but”-rule of
with confidence λl . The constraints define a rule- sentiment classification in section 4.1). If the con-
regularized space of all valid distributions. For the straints introduce additional dependencies, e.g., bi-
second property, we measure the closeness between gram dependency as the transition rule in the NER
q and pθ with KL-divergence, and wish to minimize task (section 4.2), we can use dynamic program-
it. Combining the two factors together and further ming for efficient computation. For higher-order
allowing slackness for the constraints, we finally constraints (e.g., the listing rule in NER), we ap-
get the following optimization problem: proximate through Gibbs sampling that iteratively
samples from q(yi |y−i , x) for each position i. If
X
min KL(q(Y |X)kpθ (Y |X)) + C ξl,gl
q,ξ≥0 l,gl
(3) the constraints span multiple instances, we group
s.t. λl (1 − Eq [rl,gl (X, Y )]) ≤ ξl,gl
the relevant instances in minibatches for joint in-
gl = 1, . . . , Gl , l = 1, . . . , L,
ference (and randomly break some dependencies
where ξl,gl ≥ 0 is the slack variable for respec- when a group is too large). Note that calculating
tive logic constraint; and C is the regularization the soft predictions is efficient since only one NN
parameter. The problem can be seen as project- forward pass is required to compute the base dis-
ing pθ into the constrained subspace. The problem tribution pθ (y|x) (and few more, if needed, for
is convex and can be efficiently solved in its dual calculating the truth values of relevant rules).
form with closed-form solutions. We provide the
detailed derivation in the supplementary materials p v.s. q at Test Time At test time we can use
and directly give the solution here: either the distilled student network p, or the teacher
  network q after a final projection. Our empirical re-
 X 
q ∗ (Y |X) ∝ pθ (Y |X) exp − Cλl (1 − rl,gl (X, Y )) sults show that both models substantially improve
over the base network that is trained with only data-
 
l,gl
(4) label instances. In general q performs better than
Intuitively, a strong rule with large λl will lead to p. Particularly, q is more suitable when the logic
low probabilities of predictions that fail to meet rules introduce additional dependencies (e.g., span-
Padding I like this book store a lot Padding
Algorithm 1 Harnessing NN with Rules
Word
Input: The training data D = {(xn , yn )}N n=1 , Embedding
The rule set R = {(Rl , λl )}Ll=1 ,
Parameters: π – imitation parameter
C – regularization strength Convolution
1: Initialize neural network parameter θ
2: repeat
3: Sample a minibatch (X, Y ) ⊂ D
4: Construct teacher network q with Eq.(4) Max Pooling
5: Transfer knowledge into pθ by updating θ with Eq.(2)
6: until convergence
Output: Distill student network pθ and teacher network q Sentence
Representation

ning over multiple examples), requiring joint infer- Figure 2: The CNN architecture for sentence-level
ence. In contrast, as mentioned above, p is more sentiment analysis. The sentence representation
lightweight and efficient, and useful when rule eval- vector is followed by a fully-connected layer with
uation is expensive or impossible at prediction time. softmax output activation, to output sentiment pre-
Our experiments compare the performance of p and dictions.
q extensively.
4.1 Sentiment Classification
Imitation Strength π The imitation parameter π
in Eq.(2) balances between emulating the teacher Sentence-level sentiment analysis is to identify the
soft predictions and predicting the true hard la- sentiment (e.g., positive or negative) underlying
bels. Since the teacher network is constructed from an individual sentence. The task is crucial for
pθ , which, at the beginning of training, would pro- many opinion mining applications. One challeng-
duce low-quality predictions, we thus favor pre- ing point of the task is to capture the contrastive
dicting the true labels more at initial stage. As sense (e.g., by conjunction “but”) within a sen-
training goes on, we gradually bias towards emu- tence.
lating the teacher predictions to effectively distill
Base Network We use the single-channel convo-
the structured knowledge. Specifically, we define
lutional network proposed in (Kim, 2014). The sim-
π (t) = min{π0 , 1 − αt } at iteration t ≥ 0, where
ple model has achieved compelling performance
α ≤ 1 specifies the speed of decay and π0 < 1 is a
on various sentiment classification benchmarks.
lower bound.
The network contains a convolutional layer on top
4 Applications of word vectors of a given sentence, followed by
a max-over-time pooling layer and then a fully-
We have presented our framework that is general connected layer with softmax output activation. A
enough to improve various types of neural networks convolution operation is to apply a filter to word
with rules, and easy to use in that users are allowed windows. Multiple filters with varying window
to impose their knowledge and intentions through sizes are used to obtain multiple features. Figure 2
the declarative first-order logic. In this section shows the network architecture.
we illustrate the versatility of our approach by ap-
Logic Rules One difficulty for the plain neural
plying it on two workhorse network architectures,
network is to identify contrastive sense in order to
i.e., convolutional network and recurrent network,
capture the dominant sentiment precisely. The con-
on two representative applications, i.e., sentence-
junction word “but” is one of the strong indicators
level sentiment analysis which is a classification
for such sentiment changes in a sentence, where
problem, and named entity recognition which is a
the sentiment of clauses following “but” generally
sequence learning problem.
dominates. We thus consider sentences S with an
For each task, we first briefly describe the base
“A-but-B” structure, and expect the sentiment of the
neural network. Since we are not focusing on
whole sentence to be consistent with the sentiment
tuning network architectures, we largely use the
of clause B. The logic rule is written as:
same or similar networks to previous successful
neural models. We then design the linguistically- has-‘A-but-B’-structure(S) ⇒
motivated rules to be integrated. (5)
(1(y = +) ⇒ σθ (B)+ ∧ σθ (B)+ ⇒ 1(y = +)) ,
NYC locates in USA
where 1(·) is an indicator function that takes 1
Char+Word
when its argument is true, and 0 otherwise; class ‘+’ Representation
represents ‘positive’; and σθ (B)+ is the element of
σθ (B) for class ’+’. By Eq.(1), when S has the ‘A-
Forward
but-B’ structure, the truth value of the above logic LSTM
LSTM LSTM LSTM LSTM

rule equals to (1 + σθ (B)+ )/2 when y = +, and


(2 − σθ (B)+ )/2 otherwise 1 . Note that here we
Backward
assume two-way classification (i.e., positive and LSTM
LSTM LSTM LSTM LSTM

negative), though it is straightforward to design


rules for finer grained sentiment classification. Output
Representation

4.2 Named Entity Recognition


NER is to locate and classify elements in text into Figure 3: The architecture of the bidirectional
entity categories such as “persons” and “organiza- LSTM recurrent network for NER. The CNN for
tions”. It is an essential first step for downstream extracting character representation is omitted.
language understanding applications. The task as-
signs to each word a named entity tag in an “X-Y”
format where X is one of BIEOS (Beginning, In- The confidence levels are set to ∞ to prevent any
side, End, Outside, and Singleton) and Y is the violation.
entity category. A valid tag sequence has to follow We further leverage the list structures within and
certain constraints by the definition of the tagging across sentences of the same documents. Specifi-
scheme. Besides, text with structures (e.g., lists) cally, named entities at corresponding positions in
within or across sentences can usually expose some a list are likely to be in the same categories. For
consistency patterns. instance, in “1. Juventus, 2. Barcelona, 3. ...” we
know “Barcelona” must be an organization rather
Base Network The base network has a similar than a location, since its counterpart entity “Juven-
architecture with the bi-directional LSTM recur- tus” is an organization. We describe our simple
rent network (called BLSTM-CNN) proposed in procedure for identifying lists and counterparts in
(Chiu and Nichols, 2015) for NER which has out- the supplementary materials. The logic rule is en-
performed most of previous neural models. The coded as:
model uses a CNN and pre-trained word vectors
to capture character- and word-level information, is-counterpart(X, A) ⇒ 1 − kc(ey ) − c(σθ (A))k2 , (7)
respectively. These features are then fed into a
bi-directional RNN with LSTM units for sequence where ey is the one-hot encoding of y (the class pre-
tagging. Compared to (Chiu and Nichols, 2015) we diction of X); c(·) collapses the probability mass
omit the character type and capitalization features, on the labels with the same categories into a single
as well as the additive transition matrix in the out- probability, yielding a vector with length equaling
put layer. Figure 3 shows the network architecture. to the number of categories. We use `2 distance
as a measure for the closeness between predictions
Logic Rules The base network largely makes in- of X and its counterpart A. Note that the distance
dependent tagging decisions at each position, ignor- takes value in [0, 1] which is a proper soft truth
ing the constraints on successive labels for a valid value. The list rule can span multiple sentences
tag sequence (e.g., I-ORG cannot follow B-PER). (within the same document). We found the teacher
In contrast to recent work (Lample et al., 2016) network q that enables explicit joint inference pro-
which adds a conditional random field (CRF) to vides much better performance over the distilled
capture bi-gram dependencies between outputs, we student network p (section 5).
instead apply logic rules which does not introduce
extra parameters to learn. An example rule is: 5 Experiments
equal(yi−1 , I-ORG) ⇒ ¬ equal(yi , B-PER) (6) We validate our framework by evaluating its appli-
1
cations of sentiment classification and named en-
Replacing ∧ with & in Eq.(5) leads to a probably more
intuitive rule which takes the value σθ (B)+ when y = +, tity recognition on a variety of public benchmarks.
and 1 − σθ (B)+ otherwise. By integrating the simple yet effective rules with
Model SST2 MR CR
1 CNN (Kim, 2014) 87.2 81.3±0.1 84.3±0.2
2 CNN-Rule-p 88.8 81.6±0.1 85.0±0.3
3 CNN-Rule-q 89.3 81.7±0.1 85.3±0.3
4 MGNC-CNN (Zhang et al., 2016) 88.4 – –
5 MVCNN (Yin and Schutze, 2015) 89.4 – –
6 CNN-multichannel (Kim, 2014) 88.1 81.1 85.0
7 Paragraph-Vec (Le and Mikolov, 2014) 87.8 – –
8 CRF-PR (Yang and Cardie, 2014) – – 82.7
9 RNTN (Socher et al., 2013) 85.4 – –
10 G-Dropout (Wang and Manning, 2013) – 79.0 82.1

Table 1: Accuracy (%) of Sentiment Classification. Row 1, CNN (Kim, 2014) is the base network
corresponding to the “CNN-non-static” model in (Kim, 2014). Rows 2-3 are the networks enhanced by
our framework: CNN-Rule-p is the student network and CNN-Rule-q is the teacher network. For MR and
CR, we report the average accuracy±one standard deviation using 10-fold cross validation.

the base networks, we obtain substantial improve- or positive sentiment. 3) CR (Hu and Liu, 2004),
ments on both tasks and achieve state-of-the-art customer reviews of various products, containing 2
or comparable results to previous best-performing classes and 3,775 instances. For MR and CR, we
systems. Comparison with a diverse set of other use 10-fold cross validation as in previous work. In
rule integration methods demonstrates the unique each of the three datasets, around 15% sentences
effectiveness of our framework. Our approach also contains the word “but”.
shows promising potentials in the semi-supervised For the base neural network we use the “non-
learning and sparse data context. static” version in (Kim, 2014) with the exact same
Throughout the experiments we set the regular- configurations. Specifically, word vectors are ini-
ization parameter to C = 400. In sentiment clas- tialized using word2vec (Mikolov et al., 2013) and
sification we set the imitation parameter to π (t) = fine-tuned throughout training, and the neural pa-
1 − 0.9t , while in NER π (t) = min{0.9, 1 − 0.9t } rameters are trained using SGD with the Adadelta
to downplay the noisy listing rule. The confidence update rule (Zeiler, 2012).
levels of rules are set to λl = 1, except for hard
constraints whose confidence is ∞. For neural 5.1.2 Results
network configuration, we largely followed the ref- Table 1 shows the sentiment classification per-
erence work, as specified in the following respec- formance. Rows 1-3 compare the base neural
tive sections. All experiments were performed on model with the models enhanced by our frame-
a Linux machine with eight 4.0GHz CPU cores, work with the “but”-rule (Eq.(5)). We see that
one Tesla K40c GPU, and 32GB RAM. We imple- our method provides a strong boost on accuracy
mented neural networks using Theano 2 , a popular over all three datasets. The teacher network q fur-
deep learning platform. ther improves over the student network p, though
the student network is more widely applicable
5.1 Sentiment Classification in certain contexts as discussed in sections 3.2
5.1.1 Setup and 3.4. Rows 4-10 show the accuracy of re-
cent top-performing methods. On the MR and CR
We test our method on a number of commonly
datasets, our model outperforms all the baselines.
used benchmarks, including 1) SST2, Stanford
On SST2, MVCNN (Yin and Schutze, 2015) (Row
Sentiment Treebank (Socher et al., 2013) which
5) is the only system that shows a slightly better re-
contains 2 classes (negative and positive), and
sult than ours. Their neural network has combined
6920/872/1821 sentences in the train/dev/test sets
diverse sets of pre-trained word embeddings (while
respectively. Following (Kim, 2014) we train mod-
we use only word2vec) and contained more neural
els on both sentences and phrases since all labels
layers and parameters than our model.
are provided. 2) MR (Pang and Lee, 2005), a set of
10,662 one-sentence movie reviews with negative To further investigate the effectiveness of our
framework in integrating structured rule knowl-
2
http://deeplearning.net/software/theano edge, we compare with an extensive array of other
Model Accuracy (%) Data size 5% 10% 30% 100%
1 CNN (Kim, 2014) 87.2 1 CNN 79.9 81.6 83.6 87.2
2 -but-clause 87.3 2 -Rule-p 81.5 83.2 84.5 88.8
3 -`2 -reg 87.5 3 -Rule-q 82.5 83.9 85.6 89.3
4 -project 87.9
4 -semi-PR 81.5 83.1 84.6 –
5 -opt-project 88.3
5 -semi-Rule-p 81.7 83.3 84.7 –
6 -pipeline 87.9
6 -semi-Rule-q 82.7 84.2 85.7 –
7 -Rule-p 88.8
8 -Rule-q 89.3
Table 3: Accuracy (%) on SST2 with varying sizes
of labeled data and semi-supervised learning. The
Table 2: Performance of different rule integration
header row is the percentage of labeled examples
methods on SST2. 1) CNN is the base network; 2)
for training. Rows 1-3 use only the supervised data.
“-but-clause” takes the clause after “but” as input; 3)
Rows 4-6 use semi-supervised learning where the re-
“-`2 -reg” imposes a regularization term γkσθ (S) −
maining training data are used as unlabeled exam-
σθ (Y )k2 to the CNN objective, with the strength
ples. For “-semi-PR” we only report its projected
γ selected on dev set; 4) “-project” projects the
solution (in analogous to q ) which performs better
trained base CNN to the rule-regularized subspace
than the non-projected one (in analogous to p).
with Eq.(3); 5) “-opt-project” directly optimizes the
projected CNN; 6) “-pipeline” distills the pre-trained
“-opt-project” to a plain CNN; 7-8) “-Rule-p” and “- general rules would contribute more to the perfor-
Rule-q ” are our models with p being the distilled stu- mance, and unlabeled data should help better learn
dent network and q the teacher network. Note that from the rules. This can be a useful property espe-
“-but-clause” and “-`2 -reg” are ad-hoc methods ap- cially when data are sparse and labels are expensive
plicable specifically to the “but”-rule. to obtain. Table 3 shows the results. The subsam-
pling is conducted on the sentence level. That is,
for instance, in “5%” we first selected 5% training
possible integration approaches. Table 2 lists these sentences uniformly at random, then trained the
methods and their performance on the SST2 task. models on these sentences as well as their phrases.
We see that: 1) Although all methods lead to differ- The results verify our expectations. 1) Rows 1-3
ent degrees of improvement, our framework outper- give the accuracy of using only data-label subsets
forms all other competitors with a large margin. 2) for training. In every setting our methods consis-
In particular, compared to the pipelined method in tently outperform the base CNN. 2) “-Rule-q” pro-
Row 6 which is in analogous to the structure com- vides higher improvement on 5% data (with margin
pilation work (Liang et al., 2008), our iterative dis- 2.6%) than on larger data (e.g., 2.3% on 10% data,
tillation (section 3.2) provides better performance. and 2.0% on 30% data), showing promising po-
Another advantage of our method is that we only tential in the sparse data context. 3) By adding
train one set of neural parameters, as opposed to unlabeled instances for semi-supervised learning
two separate sets as in the pipelined approach. 3) as in Rows 5-6, we get further improved accuracy.
The distilled student network “-Rule-p” achieves 4) Row 4, “-semi-PR” is the posterior regulariza-
much superior accuracy compared to the base CNN, tion (Ganchev et al., 2010) which imposes the rule
as well as “-project” and “-opt-project” which ex- constraint through only unlabeled data during train-
plicitly project CNN to the rule-constrained sub- ing. Our distillation framework consistently pro-
space. This validates that our distillation procedure vides substantially better results.
transfers the structured knowledge into the neu-
ral parameters effectively. The inferior accuracy 5.2 Named Entity Recognition
of “-opt-project” can be partially attributed to the 5.2.1 Setup
poor performance of its neural network part which We evaluate on the well-established CoNLL-2003
achieves only 85.1% accuracy and leads to inaccu- NER benchmark (Tjong Kim Sang and De Meul-
rate evaluation of the “but”-rule in Eq.(5). der, 2003), which contains 14,987/3,466/3,684
We next explore the performance of our frame- sentences and 204,567/51,578/46,666 tokens in
work with varying numbers of labeled instances as train/dev/test sets, respectively. The dataset in-
well as the effect of exploiting unlabeled data. In- cludes 4 categories, i.e., person, location, orga-
tuitively, with less labeled examples we expect the nization, and misc. BIOES tagging scheme is used.
Model F1 NER task we have used logic rules that introduce
1 BLSTM 89.55 extra dependencies between adjacent tag positions
2 BLSTM-Rule-trans p: 89.80, q : 91.11 as well as multiple instances, making the explicit
3 BLSTM-Rules p: 89.93, q : 91.18 joint inference of q useful for fulfilling these struc-
4 NN-lex (Collobert et al., 2011) 89.59 tured constraints.
5 S-LSTM (Lample et al., 2016) 90.33
6 BLSTM-lex (Chiu and Nichols, 2015) 90.77
7 BLSTM-CRF1 (Lample et al., 2016) 90.94 6 Discussion and Future Work
8 Joint-NER-EL (Luo et al., 2015) 91.20
9 BLSTM-CRF2 (Ma and Hovy, 2016) 91.21 We have developed a framework which combines
deep neural networks with first-order logic rules
Table 4: Performance of NER on CoNLL-2003. to allow integrating human knowledge and inten-
Row 2, BLSTM-Rule-trans imposes the transition tions into the neural models. In particular, we pro-
rules (Eq.(6)) on the base BLSTM. Row 3, BLSTM- posed an iterative distillation procedure that trans-
Rules further incorporates the list rule (Eq.(7)). We fers the structured information of logic rules into
report the performance of both the student model p the weights of neural networks. The transferring is
and the teacher model q . done via a teacher network constructed using the
posterior regularization principle. Our framework
is general and applicable to various types of neu-
Around 1.7% named entities occur in lists. ral architectures. With a few intuitive rules, our
We use the mostly same configurations for the framework significantly improves base networks
base BLSTM network as in (Chiu and Nichols, on sentiment analysis and named entity recogni-
2015), except that, besides the slight architecture tion, demonstrating the practical significance of
difference (section 4.2), we apply Adadelta for pa- our approach.
rameter updating. GloVe (Pennington et al., 2014) Though we have focused on first-order logic
word vectors are used to initialize word features. rules, we leveraged soft logic formulation which
5.2.2 Results can be easily extended to general probabilistic mod-
els for expressing structured distributions and per-
Table 4 presents the performance on the NER task.
forming inference and reasoning (Lake et al., 2015).
By incorporating the bi-gram transition rules (Row
We plan to explore these diverse knowledge rep-
2), the joint teacher model q achieves 1.56 improve-
resentations to guide the DNN learning. The pro-
ment in F1 score that outperforms most previous
posed iterative distillation procedure also reveals
neural based methods (Rows 4-7), including the
connections to recent neural autoencoders (Kingma
BLSTM-CRF model (Lample et al., 2016) which
and Welling, 2014; Rezende et al., 2014) where
applies a conditional random field (CRF) on top
generative models encode probabilistic structures
of a BLSTM in order to capture the transition pat-
and neural recognition models distill the informa-
terns and encourage valid sequences. In contrast,
tion through iterative optimization (Rezende et al.,
our method implements the desired constraints in a
2016; Johnson et al., 2016; Karaletsos et al., 2016).
more straightforward way by using the declarative
The encouraging empirical results indicate a
logic rule language, and at the same time does not
strong potential of our approach for improving
introduce extra model parameters to learn. Further
other application domains such as vision tasks,
integration of the list rule (Row 3) provides a sec-
which we plan to explore in the future. Finally,
ond boost in performance, achieving an F1 score
we also would like to generalize our framework
very close to the best-performing systems including
to automatically learn the confidence of different
Joint-NER-EL (Luo et al., 2015) (Row 8), a proba-
rules, and derive new rules from data.
bilistic graphical model optimizing NER and entity
linking jointly with massive external resources, and Acknowledgments
BLSTM-CRF (Ma and Hovy, 2016), a combination
of BLSTM and CRF with more parameters than We thank the anonymous reviewers for their valu-
our rule-enhanced neural networks. able comments. This work is supported by NSF
From the table we see that the accuracy gap be- IIS1218282, NSF IIS1447676, Air Force FA8721-
tween the joint teacher model q and the distilled 05-C-0003, and FA8750-12-2-0342.
student p is relatively larger than in the sentiment
classification task (Table 1). This is because in the
References Tejas D Kulkarni, William F Whitney, Pushmeet Kohli, and
Josh Tenenbaum. 2015. Deep convolutional inverse graph-
Stephen H Bach, Matthias Broecheler, Bert Huang, and Lise ics network. In Proc. of NIPS, pages 2530–2538.
Getoor. 2015. Hinge-loss Markov random fields and prob-
abilistic soft logic. arXiv preprint arXiv:1505.04406. Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenen-
baum. 2015. Human-level concept learning through prob-
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio.
abilistic program induction. Science, 350(6266):1332–
2014. Neural machine translation by jointly learning to
1338.
align and translate. Proc. of ICLR.

Cristian Bucilu, Rich Caruana, and Alexandru Niculescu- Guillaume Lample, Miguel Ballesteros, Sandeep Subrama-
Mizil. 2006. Model compression. In Proc. of KDD, pages nian, Kazuya Kawakami, and Chris Dyer. 2016. Neural
535–541. ACM. architectures for named entity recognition. In Proc. of
NAACL.
Jason PC Chiu and Eric Nichols. 2015. Named entity recog-
nition with bidirectional LSTM-CNNs. arXiv preprint Quoc V Le and Tomas Mikolov. 2014. Distributed represen-
arXiv:1511.08308. tations of sentences and documents. Proc. of ICML.

Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Percy Liang, Hal Daumé III, and Dan Klein. 2008. Structure
Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural lan- compilation: trading structure for features. In Proc. of
guage processing (almost) from scratch. JMLR, 12:2493– ICML, pages 592–599. ACM.
2537.
Percy Liang, Michael I Jordan, and Dan Klein. 2009. Learn-
James Foulds, Shachi Kumar, and Lise Getoor. 2015. La- ing from measurements in exponential families. In Proc.
tent topic networks: A versatile probabilistic programming of ICML, pages 641–648. ACM.
framework for topic models. In Proc. of ICML, pages
777–786. David Lopez-Paz, Léon Bottou, Bernhard Schölkopf, and
Vladimir Vapnik. 2016. Unifying distillation and privi-
Manoel VM França, Gerson Zaverucha, and Artur S dAvila leged information. Prof. of ICLR.
Garcez. 2014. Fast relational learning using bottom clause
propositionalization with artificial neural networks. Ma- Gang Luo, Xiaojiang Huang, Chin-Yew Lin, and Zaiqing Nie.
chine learning, 94(1):81–104. 2015. Joint named entity recognition and disambiguation.
In Proc. of EMNLP.
Kuzman Ganchev, Joao Graça, Jennifer Gillenwater, and Ben
Taskar. 2010. Posterior regularization for structured latent Xuezhe Ma and Eduard Hovy. 2016. End-to-end sequence
variable models. JMLR, 11:2001–2049. labeling via bi-directional LSTM-CNNs-CRF. In Proc. of
ACL.
Artur S d’Avila Garcez, Krysia Broda, and Dov M Gabbay.
2012. Neural-symbolic learning systems: foundations and Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado,
applications. Springer Science & Business Media. and Jeff Dean. 2013. Distributed representations of words
and phrases and their compositionality. In Proc. of NIPS,
Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel- pages 3111–3119.
rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent
Vanhoucke, Patrick Nguyen, Tara N Sainath, et al. 2012. Marvin Minksy. 1980. Learning meaning. Technical Report
Deep neural networks for acoustic modeling in speech AI Lab Memo. Project MAC. MIT.
recognition: The shared views of four research groups.
Signal Processing Magazine, IEEE, 29(6):82–97. Anh Nguyen, Jason Yosinski, and Jeff Clune. 2015. Deep
neural networks are easily fooled: High confidence predic-
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Dis- tions for unrecognizable images. In Proc. of CVPR, pages
tilling the knowledge in a neural network. arXiv preprint 427–436. IEEE.
arXiv:1503.02531.
Bo Pang and Lillian Lee. 2005. Seeing stars: Exploiting class
Minqing Hu and Bing Liu. 2004. Mining and summarizing relationships for sentiment categorization with respect to
customer reviews. In Proc. of KDD, pages 168–177. ACM. rating scales. In Proc. of ACL, pages 115–124.
Matthew J Johnson, David Duvenaud, Alexander B Wiltschko, Jeffrey Pennington, Richard Socher, and Christopher D Man-
Sandeep R Datta, and Ryan P Adams. 2016. Structured ning. 2014. Glove: Global vectors for word representation.
VAEs: Composing probabilistic graphical models and vari- In Proc. of EMNLP, volume 14, pages 1532–1543.
ational autoencoders. arXiv preprint arXiv:1603.06277.
Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wier-
Theofanis Karaletsos, Serge Belongie, Cornell Tech, and Gun- stra. 2014. Stochastic backpropagation and approximate
nar Rätsch. 2016. Bayesian representation learning with inference in deep generative models. Proc. of ICML.
oracle constraints. In Proc. of ICLR.
Danilo Jimenez Rezende, Shakir Mohamed, Ivo Danihelka,
Yoon Kim. 2014. Convolutional neural networks for sentence
Karol Gregor, and Daan Wierstra. 2016. One-shot gen-
classification. Proc. of EMNLP.
eralization in deep generative models. arXiv preprint
Diederik P Kingma and Max Welling. 2014. Auto-encoding arXiv:1603.05106.
variational Bayes. In Proc. of ICLR.
Matthew Richardson and Pedro Domingos. 2006. Markov
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. logic networks. Machine learning, 62(1-2):107–136.
Imagenet classification with deep convolutional neural net-
works. In Proc. of NIPS, pages 1097–1105.
David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Lau- of the eighth National conference on Artificial intelligence,
rent Sifre, George van den Driessche, Julian Schrittwieser, pages 861–866. Boston, MA.
Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot,
et al. 2016. Mastering the game of go with deep neural Sida Wang and Christopher Manning. 2013. Fast dropout
networks and tree search. Nature, 529(7587):484–489. training. In Proc. of ICML, pages 118–126.

Richard Socher, Alex Perelygin, Jean Y Wu, Jason Chuang, Bishan Yang and Claire Cardie. 2014. Context-aware learn-
Christopher D Manning, Andrew Y Ng, and Christopher ing for sentence-level sentiment analysis with posterior
Potts. 2013. Recursive deep models for semantic compo- regularization. In Proc. of ACL, pages 325–335.
sitionality over a sentiment treebank. In Proc. of EMNLP,
volume 1631, page 1642. Citeseer. Wenpeng Yin and Hinrich Schutze. 2015. Multichannel
variable-size convolution for sentence classification. Proc.
Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan of CONLL.
Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus.
2014. Intriguing properties of neural networks. Proc. of Matthew D Zeiler. 2012. Adadelta: an adaptive learning rate
ICLR. method. arXiv preprint arXiv:1212.5701.

Erik F Tjong Kim Sang and Fien De Meulder. 2003. Introduc- Ye Zhang, Stephen Roller, and Byron Wallace. 2016. MGNC-
tion to the conll-2003 shared task: Language-independent CNN: A simple approach to exploiting multiple word em-
named entity recognition. In Proc. of CoNLL, pages 142– beddings for sentence classification. Proc. of NAACL.
147. Association for Computational Linguistics.
Jun Zhu, Ning Chen, and Eric P Xing. 2014. Bayesian
Geoffrey G Towell, Jude W Shavlik, and Michiel O No- inference with posterior regularization and applications to
ordewier. 1990. Refinement of approximate domain theo- infinite latent SVMs. JMLR, 15(1):1799–1847.
ries by knowledge-based neural networks. In Proceedings
Harnessing Deep Neural Networks with Logic Rules:
Supplementary Material

A Solving Problem Eq.(3), Section 3

We provide the detailed derivation for solving the problem in Eq.(3), Section 3, which we
repeat here:
X
min KL(q(Y|X)kpθ (Y|X)) + C ξl,gl
q,ξ≥0
l,gl
(A.1)
s.t. λl (1 − Eq [rl,gl (X, Y)]) ≤ ξl,gl
gl = 1, . . . , Gl , l = 1, . . . , L,

The following derivation is largely adapted from (Ganchev et al., 2010) for the logic rule
constraint setting, with some reformulation that produces closed-form solution.
The Lagrangian is

max min L, (A.2)


µ≥0,η≥0,α≥0 q(y),ξ

where
X
L = KL(q(Y|X)kpθ (Y|X)) + (C + µl,gl )ξl,gl
l,gl
X X (A.3)
+ ηl,gl (Eq [λl (1 − rl,gl (X, Y))] − ξl,gl ) + α( q(Y|X) − 1)
l,gl Y

Solving Eq.(A.2), we obtain


X
∇q L = log q(Y|X) + 1 − log pθ (Y|X) + ηl,gl [λl (1 − rl,gl (X, Y))] + α = 0
l,gl
P (A.4)
pθ (Y|X) exp {− l ηl λl (1 − rl,gl (X, Y))}
=⇒ q(Y|X) =
e exp(α)

∇ξl,gl L = C + µl,gl − ηl,gl = 0 =⇒ µl,gl = C − ηl,gl (A.5)

1
n P o
X pθ (Y|X) exp − l,gl ηl,gl λl (1 − rl,gl (X, Y))
∇α L = −1=0
e exp(α)
Y
P n P o
Y p(Y|X) exp − η λ
l,gl l,gl l (1 − rl,gl (X, Y))
=⇒ α = log  
e
(A.6)

P n P o
Let Zη = Y p(Y|X) exp − l,gl ηl,gl λl (1 − rl,gl (X, Y)) . Plugging α into L
X X
L = − log Zη + (C + µl,gl )ξl,gl − ηl,gl ξl,gl
l,gl l,gl (A.7)
= − log Zη
Since Zη monotonically decreases as η increases, and from Eq.(A.6) we have ηl,gl ≤ C,
therefore:
max − log Zη
C≥η≥0
(A.8)

=⇒ ηl,gl
=C

Plugging Eqs.(A.6) and (A.8) into Eq.(A.5) we obtain the solution of q as in Eq.(4).

B Identifying Lists for NER

We design a simple pattern-matching based method to identify lists and counterparts in the
NER task. We ensure high precision and do not expect high recall. In particular, we only
retrieve lists that with the pattern “1. ... 2. ... 3. ...” (i.e., indexed by numbers), and “- ...
- ... - ...” (i.e., each item marked with “-”). We require at least 3 items to form a list.
We further require the text of each item follows certain patterns to ensure the text is
highly likely to be named entities, and rule out those lists whose item text is largely free
text. Specifically, we require 1) all words of the item text all start with capital letters; 2)
referring the text between punctuations as “block”, each block includes no more than 3
words.
We detect both intra-sentence lists and inter-sentence lists in documents. We found the
above patterns are effective to identify true lists. A better list detection method is expected
to further improve our NER results.

References
Ganchev, K., Graça, J., Gillenwater, J., and Taskar, B. (2010). Posterior regularization for
structured latent variable models. JMLR, 11:2001–2049.

You might also like