0% found this document useful (0 votes)
3 views

Learning Algorithms via Neural Logic Net

The document introduces Neural Logic Networks (NLNs), a new learning paradigm for Deep Neural Networks that utilizes Boolean logic operators to explicitly learn logical functions and discrete-algorithmic tasks. Unlike traditional Multi-Layer Perceptrons (MLPs), NLNs provide interpretable symbolic representations that can be verified by humans, addressing limitations in learning from limited training examples. The proposed framework demonstrates improved performance in solving Inductive Logic Programming problems and showcases capabilities such as predicate invention and recursion through various benchmark tasks.

Uploaded by

Jeremy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Learning Algorithms via Neural Logic Net

The document introduces Neural Logic Networks (NLNs), a new learning paradigm for Deep Neural Networks that utilizes Boolean logic operators to explicitly learn logical functions and discrete-algorithmic tasks. Unlike traditional Multi-Layer Perceptrons (MLPs), NLNs provide interpretable symbolic representations that can be verified by humans, addressing limitations in learning from limited training examples. The proposed framework demonstrates improved performance in solving Inductive Logic Programming problems and showcases capabilities such as predicate invention and recursion through various benchmark tasks.

Uploaded by

Jeremy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Learning Algorithms via Neural Logic Networks

Ali Payani 1 Faramarz Fekri 1

Abstract cation problem is very difficult to learn just by providing


the model with input/output pairs of examples (Kaiser &
We propose a novel learning paradigm for Deep
Sutskever, 2015). In particular, MLP based come with some
Neural Networks (DNN) by using Boolean logic
limitations. These model, in general, do not construct any
arXiv:1904.01554v1 [cs.LG] 2 Apr 2019

algebra. We first present the basic differentiable


explicit and symbolic representation of the algorithm they
operators of a Boolean system such as conjunc-
learned and the algorithm is implicitly stored in thousands
tion, disjunction and exclusive-OR and show how
or even millions of weights, which is typically impossible
these elementary operators can be combined in a
to be deciphered and verified by human agents. Further,
simple and meaningful way to form Neural Logic
MLP networks are usually suitable for cases where there are
Networks (NLNs). We examine the effective-
many training examples, and usually do not generalize well
ness of the proposed NLN framework in learning
where there only limited training examples. One of the most
Boolean functions and discrete-algorithmic tasks.
successful machine learning approaches that addresses these
We demonstrate that, in contrast to the implicit
shortcomings for learning discrete algorithmic tasks is the
learning in MLP approach, the proposed neural
Inductive Logic Programming (ILP). In ILP, explicit rules
logic networks can learn the logical functions ex-
and symbolic logical representations can be learned using
plicitly that can be verified and interpreted by
only a few training examples and these models are usually
human. In particular, we propose a new frame-
able to generalize well. Further, the explicit symbolic rep-
work for learning the inductive logic program-
resentation that is obtained via ILP can be understood and
ming (ILP) problems by exploiting the explicit
verified by human, and can also be used to write programs
representational power of NLN. We show the pro-
in any conventional programming language. Recently there
posed neural ILP solver is capable of feats such as
has been some attempts to bridge the gap between the two
predicate invention and recursion and can outper-
discipline and to use the deep learning methods in solving
form the current state of the art neural ILP solvers
the ILP problems. These works usually rely on some forms
using a variety of benchmark tasks such as dec-
of transforming the ILP satisfiability problem into a differ-
imal addition and multiplication, and sorting on
entiable problem which in turn could be solved by gradient
ordered list.
descent algorithms (Hölldobler et al., 1999; Bader et al.,
2008; França et al., 2014; Serafini & Garcez, 2016; Evans
& Grefenstette, 2018).
1. Introduction
In this paper we present an alternative approach to the tra-
Deep Neural Networks (DNNs) based on Convolution ditional MLP design for learning Boolean functions and
Neural Networks (CNNs) and Recurrent Neural Networks aim to address some of the shortcoming of the MLP for
(RNNs) have improved the state of the art in various areas learning discrete-algorithmic tasks. Our key idea is to de-
such as natural language processing (Collobert & Weston, fine a set of differentiable Boolean operators that can be
2008), image and video processing (Krizhevsky et al., 2012), combined in a multi-layer cascade design like MLP, and
and Speech recognition (Dahl et al., 2012) just to name a are capable of computing and learning Boolean functions.
few. However, while in theory it is known that DNNs and Unlike MLP, our proposed model provides explicit symbolic
specifically RNNs can be Turing complete and capable of representation which could be tested and verified by human.
learning any program (Siegelmann & Sontag, 1992), there We further demonstrate that the proposed approach can be
has been limited success in using DNNs for learning algo- used to transform the ILP into a differentiable problem and
rithmic problems. Even a rather simple decimal multipli- solve it using gradient optimizers more efficiently than the
1 existing neural ILP solvers.
Department of Electrical and Computer Engineering
Georgia Institute of Technology. Correspondence to: The general idea of representing and learning Boolean func-
Ali Payani <[email protected]>, Faramarz Fekri tions using neural networks is not new. There is significant
<[email protected]>.
body of research from the early days of machine learning
Learning Algorithms via Neural Logic Networks

using neural networks that is focused on the theoretical as-


pects of this problem. Some of special Boolean functions xi mi Fc xi mi Fd
such as parity-N and XOR has been the subject of special 0 0 1 0 0 0
interest as benchmark tasks for theoretical analysis. Min- 0 1 0 0 1 0
sky and Papert (Minsky & Papert, 2017; Wasserman, 1989) 1 0 1 1 0 0
for example showed the impossibility of representing all 1 1 1 1 1 1
functional dependence and proved this for XOR logic func- (a) (b)
tion while other works demonstrate the possibly of doing
so by adding hidden layers. From the practical standpoint Figure 1: Truth table of Fc (·) and Fd (·) functions
however, as was suggested by many works (for example
(Steinbach & Kohut, 2002)), any Boolean function can be
learned by a proper multi layer design equipped with proper
activation functions. However, as we will show in following to select a subset in xn and apply the fuzzy conjunction
chapters, there are scenarios where they do not perform well. (i.e. multiplication) to the selected elements. One way to
Moreover, even if they learn successfully, it is notoriously accomplish this is to use a softmax function and select the
difficult to decipher the actual learned Boolean function. elements that belong to the conjunction function similar to
In contrast, in this paper, we propose a new design for the the concept of pointer networks (Vinyals et al., 2015). This
logical operators (namely Neural Logic Networks (NLN)) requires knowing the number of items in the subset (i.e. the
by using membership weights without the adjustable bias number of terms in the conjunction function) in advance.
terms. The NLN network has an explicit representational Moreover, in our experiment we found that the convergence
power which separates the proposed models from the previ- of model using this approach is very slow for larger input
ous works. In this paper, first, we introduce general purpose vectors. Alternatively, we associate a trainable Boolean
conjunction, disjunction and the exclusive OR neurons as membership weight mi to each input elements xi from vec-
the basic elements of the NLN. We would then demonstrate tor xn . Further, we define a Boolean function Fc (xi , mi )
the properties and characteristics of the proposed model in with the truth table as in Fig.1a which is able to include
three areas : (exclude) each element in (out of) the conjunction function.
This design ensures the incorporation of each element xi in
• Learning Boolean functions efficiently : In Section 3, the conjunction function only when the corresponding mem-
we demonstrate how the NLN compares to the MLP in bership weight is 1. Consequently, the neural conjunction
learning Boolean functions. function can be defined as:
• Generalization : In Section 4, we compare the gener-
alization performance of NLN and MLP in learning a
n
message passing decoding algorithm for Low Density Y
Oconj (x) = Fc (xi , mi )
Parity Check Codes (LDPC) over erasure channels.
i=1
• Explicit symbolic representation : In Section 5, we
propose a new algorithm for solving ILP problems by where, Fc (xi , mi ) = xi mi = 1 − mi (1 − xi ) , (2)
exploiting the explicit representational power of NLN.

where Oconj is the output of conjunction neuron. To ensure


2. Neural Logic Layers the trainable membership weights remain in the range [0, 1]
2.1. Neural Conjunction and Disjunction Layers we use a sigmoid function, i.e. mi = sigmoid(c wi ) where
c ≥ 1 is a constant. Similar to perceptron layers, we can
Throughout this paper, we use the extension of the Boolean stack m neural conjunction neurons to create a conjunction
values to real values in the range [0, 1] and we use 1 (True) layer of size m. This layer has the same complexity as a
and 0 (False) representations for the two states of a binary typical perceptron layer without incorporating any bias term.
variable. We also define the fuzzy unary and dual Boolean More importantly, this way of implementing the conjunction
functions of two Boolean variables x and y as: layer makes it possible to interpret the learned Boolean func-
tion directly from the values of the membership weights.
^
x̄ = 1 − x , x y = xy (1a)
_ The disjunctive neuron can be defined similarly by intro-
x y = 1 − (1 − x)(1 − y) (1b)
ducing membership weights but using the function Fd with
This algebraic representation of the Boolean logic allows truth table as depicted in Fig.1b. This function ensures an
us to manipulate the logical expressions via Algebra. Let output 0 from each element when the membership is zero
xn ∈ {0, 1}n be the input vector in a typical logical neu- which correspond to excluding the xi element from the neu-
ron. To implement the conjunction function, we would like ron outcome. Therefore, the neural disjunction function can
Learning Algorithms via Neural Logic Networks

be expressed as: stants (e.g. 1e − 3). Alternatively we can initialize weights


n n
by a normal distribution with negative mean which needs to
Odisj (x) =
Y
Fd (xi , mi ) = 1 −
Y
(1 − Fd (xi , mi )) , be adjusted correctly dependent on the size of the layer.
i=1 i=1
where, Fd (xi , mi ) = xi mi (3) 2.3. Neural XOR Layer
Exclusive OR (XOR) is another important Boolean function
By cascading a conjunction layer with a disjunctive layer,
which has been the subject of many researches over the
we can create a multi-layer structure which is able to learn
years, especially in the context of parity-N learning problem.
and represent Boolean functions using the Disjunctive Nor-
It is easy to verify that expressing XOR of an n-dimensional
mal Form (DNF). Similarly we can construct the Con-
input vector in DNF form requires 2n−1 clauses. Although,
junctive Normal Form (CNF) by cascading the two layers
it is known that it cannot be implemented using a single
in the reverse order. The total number of possible logi-
perceptron layer (Minsky & Papert, 2017; Duch, 2006),
cal functions over a Boolean input vector x ∈ {0, 1}n
n it can be implemented, for example, using multilayer per-
is very large (i.e. 22 ). Further, in some cases, a sim-
ceptron or multiplicative networks combined with small
ple clause in one of those standard forms can lead to
threshold values and sign function activation (Iyoda et al.,
an exponential number of clauses when expressed in the
2003). However, none of these approaches allow for explicit
other
WformVFor example, V it isVeasy to verify that converting
W W representation of the learned XOR functions directly. Here
(x1 x2 ) (x3 x4 ) · · · (xn−1 xn ) to DNF leads
n we propose a new algorithm for learning XOR function (or
to 2 2 number of clauses. As such, using only one single
equivalently the parity-N problem). To form the logical
form of Boolean network for learning all possible Boolean
XOR neuron, we first define k functions of the form:
functions is not always the best approach. The general pur-
pose design of the proposed conjunction and disjunction f1 (x) = x1 + x2 + · · · + xk −xk+1 − · · · − xn
layers allows us to define the appropriate Boolean function
suitable for the problem. f2 (x) = x1 + x2 + . . . −xk − · · · − xn−1 + xn
.. .. .. .. ..
2.2. Convergence and Initialization . . . . .
For a single Boolean layer, it can be easily shown that using fk (x) = x1 −x2 − · · · − xk − xk+1 + · · · + xn
a small enough learning rate, if we have counter examples (4)
in each training batch, they are guaranteed to converge. For where k = n2 (assuming n is even). Then, we define the
example, by examining the conjunction function in (2), it is XOR function as in Theorem 1.
easy to verify that if mi is supposed to be 1, we would need Theorem 1. Given the set of k functions as defined in (4
a training example with xi = 0 and Oconj = 1 to have a we have:
negative gradient necessary for adjusting mi towards 1. This
∂Oconj
^ ^ ^
can be easily verified considering that ∂m i
∝ (xi − 1). XOR(x) = g1 (x) g2 (x) · · · gk (x) , (5a)
(
The only parameter which we need to adjust for training 0 if fi (x) = 0
these layers is the initial values for the membership weights where, gi (x) = (5b)
1 else
mi (or corresponding wi ). During the experiments, we real-
ized that while the speed of convergence somewhat depends
Proof. See Appendix A.
on the initial values for the weights, in moderate size prob-
lems, the network is able to find the optimal setting and
converges to the desired output. As such, we usually ini- Inspired by the Theorem.1, we design our XOR neuron as:
tialize all the weights randomly using normal distribution k  
with zero mean. However, in cases where the dimension of
Y
OXOR (x) = hs x × (Mi ⊙ w)T (6)
the input vector is very large, this type of initialization may i=1
result in a very slow convergence in the beginning. Due to
the multiplicative design of these layers, when many of the Here, hs(·) is the hard-sigmoid function, and × and ⊙
membership variables have values which are not zero or one, denote matrix and element-wise multiplication correspond-
the gradient can becomes extremely small. To avoid this ingly. Further, vector Mi ∈ {−1, 1}n is the set of coef-
situation, we must ensure that most of the membership vari- ficients used in fi (x) and w is the vector of membership
ables are almost zero in the beginning. In our experiments weights. The resulting XOR logical neuron uses only one
we usually initialize the membership weights by randomly weight variable per input element for learning. However,
setting a small subset of inputs to values close to 1 and we its complexity is k times higher than the conjunction and
initialize the rest of membership variables to very small con- disjunction neurons for an input vector of length n = 2k.
Learning Algorithms via Neural Logic Networks

3. NLN vs MLP 400


1 Multi-Layers Perceptron
350 XOR Logical Layer
DNF (NLN)
We now compare the performance NLN vs MLP for the task 300
MLP 0.8

Bit Error Rate (BER)


Number of errors
of learning Boolean functions using two synthetic experi- 250

200
0.6

ments. 150
0.4

100 0.2

50

3.1. Learning DNF form 0


0

0 50k 100k 150k 200k 250k 300k 350k 400k 450k 500k 0 500 1000 1500 2000 2500 3000 3500
Number of training samples Number of Training Samples

For this experiment, we randomly generate some Boolean (a) DNF Task (b) Xor 50 Task
functions over a 10 bits input vectors and a randomly gen-
erated batches of 50 samples as training data. We train Figure 2: Comparing MLP vs NLN for learning Boolean
two models; one designed via our proposed DNF network functions
(with 200 disjunction functions) and another designed by
two layers MLP network with hidden layer of size 1000 4. Generalization
and ’relu’ activation and use ’sigmoid’ activation function
for the output layer. We use ADAM optimizer (Kingma To evaluate the generalization, we consider learning an iter-
& Ba, 2014) with learning rate of 0.001 for both models ative decoding algorithm for the LDPC codes. LDPC codes
and count the number of errors in 1000 randomly generated are linear error correcting codes that are widely used due
test samples. When we used a Bernoulli distribution with to their capacity achieving performance (Richardson et al.,
parameter p = 0.5 (i.e. fair coin toss) for generating the 2001). One popular problem in the coding research is de-
bits of each training samples, both models quickly converge coding these codes over the Binary Erasure Channel (BEC),
and the number of test error drops to zero for both. How- where a subset of the bits in the received codeword (from
ever, in many realistic discrete problems, the 0’s and 1’s are the channel output) is marked as erased due to the channel
not usually equiprobable. As such, next we use Bernoulli corruption. For BEC, decoding of the received LDPC code-
with parameter p = 0.75. Fig. 2a depicts the comparative word can be performed using an iterative Message Passing
performance of the two models. The proposed DNF model (MP) algorithm by enforcing the parity checks in the parity
converges fast and remains at 0 error. On the contrary, the check matrix. To compare the performance of MLP vs NLN
MLP model continues to generate errors. In our experi- in learning a discrete-algorithmic task, we use the deep re-
ments, while the number of errors decreases as training current model that was introduced in (Payani & Fekri, 2018)
continues for an hour, the MLP model never fully converges to learn the iterative decoding using MLP and NLN.
to the true logical function and occasionally generates some Simply put, in message passing decoding of LDPC codes,
errors. While for some tasks, this may be a negligible error, each iteration involves a forward and backward path. In the
in some logical applications such as the ILP task (in Section. forward path, the content of each check-node is updated via
5), this behavior prevents the model from learning. function F which takes all the connected variable-nodes as
input. In the backward path, the content of each variable-
3.2. Learning XOR function nodes is then updated via function B which takes the signal
from all the connected check-nodes as input.
Next, we compare the two models for a much more complex
task of learning the XOR logic. We use a multi layer MLP To compare the performance of MLP and NLN we design
with ’relu’ as activation functions in the hidden layers and the forward-backward functions (i.e., F and B) for the
sigmoid function in the output layer as usual. As for NLN, first model using MLP architecture (i.e., LDPC-MLP) and
we use a single XOR neuron as described in 2.3. For the for the second model using NLN (LDPC-NLN). We use
small size inputs both models quickly converge. However, comparable number of parameters in each model (e.g. for
for larger size input vectors (n > 30) the MLP model fails LDPC(3,6) code of length 48 we use hidden dimension
to converge at all. Fig 2b shows the average bit error over of size 200). In both models, we use randomly generated
the number of training samples. The error rate for MLP codewords for a regular LDPC(3,6) code of length 48 as
was around .5, which indicates it failed to learn the XOR training data and set the number of message passing iteration
function. On the contrary, the XOR logic layer was able in training to 3. (tmax = 3). In testing phase, we run
to converge and learn the objective in most of the runs. the trained models for many more iterations to see how
This is significant considering the fact that the number of much each model has generalized and learned the iterative
parameters in our proposed XOR layer is equal to the input algorithm.
length, i.e., one membership per input variable.
Fig 3 depicts the performance of two models in terms of
bit error probability (BER). As one may expect, the model
based on MLP converges faster and generates lower BER
for the setup used in training, i.e, t = 3. However, increas-
Learning Algorithms via Neural Logic Networks
0.2

MLP Decoder
framework rules are usually written as clauses of the form:
0.18
NLN Decoder

0.16

H ← B 1 , B 2 , . . . , Bm (7)
BER
0.14

0.12

0.1
where H is called head of the clause and B1 , B2 , . . . , Bm
0.08
is called body of the clause. A clause of this form expresses
0 2 4 6 8 10 12
that if all the Boolean terms in the body are true, the head
Iteration (t)
is necessarily true. We assume each of the terms H and
Figure 3: LDPC decoding over Erasure Channels B are made of atoms. Each atom is created by applying
an n-ary Boolean function called predicate to some
constants or variables. A predicate states the relation
ing the number of iterations in test time not only does not between some variables or constants in the logic program.
improve the accuracy for LDPC-MLP, it even degrades the Throughout this paper we will use small letters for constants
performance for t > 3. On the other hand, as the num- and capital letters (A, B, C, ...) for variables. In most ILP
ber of iterations increases, the performance of LDPC-NLN systems, each predicate can be defined via several clauses of
model improves significantly. Arguably, there are ways to the form stated in (7) which is equivalent to the DNF logical
improve the performance of MLP in such tasks (e.g, by sig- form.
nificantly increasing the number of training iterations and
enforcing the network to generate valid outcome at the end Let’s consider the logic program that defines the
of each iterations by adding some penalty term). However, lessThan predicate over natural numbers:
the NLN model provides a more natural way for learning
such discrete-algorithmic tasks. lessT han(A, B) ← inc(A, B)
lessT han(A, B) ← lessT han(A, C), inc(C, B) (8)
5. Induction Logic Programming via NLN
and assume that our constants contains the set C =
One of the recent breakthroughs in solving ILP problems
{0, 1, 2, 3, 4} and the ordering of the natural numbers
(specially for recusive and algorithmic tasks) is due to works
are defined using the predicate inc (which defines in-
such as (Cropper & Muggleton, 2015) which led to the
crements of 1). The set of background atoms which
invention of Metagol (Cropper & Muggleton, 2016), the
describe the known facts about this problem is the set
state of the art ILP solver capable of learning via predi-
B = {inc(0, 1), inc(1, 2), inc(2, 3), inc(3, 4)}. We as-
cate invention and recursion. Very recently in (Evans and
sociate two scalar functions arity(p) and var(p) cor-
Grefenstette 2018) the authors proposed a differentiable ILP
responding to the number of input arguments for the
(dILP) which also supports those features but using a neural
predicate and the number of variables that can be used
network framework. While there are some other noticeable
in defining predicate. Further, we associate a Boolean
works on neural ILP solvers, we would mainly compare our
function Fp to each (intensional) predicate which de-
proposed model to the Metagol and dILP since the other
fines the Boolean function corresponding to the predi-
alternatives (for instance (Hölldobler et al., 1999; Bader
cate p. In the above example arity(lessT han) = 2 and
et al., 2008; França et al., 2014; Serafini & Garcez, 2016))
var(lessT han) = 3 and the predicate function FlessT han
do not support both of these important features (i.e., recur-
can be defined over all possible atoms which involve
sion and predecate invention) and therefore are not optimal
three variables is defined as FlessT han =
W A,B,C (e.g., it V
for solving recursive algorithmic problems.
inc(A, B) (lessT han(A, C) inc(C, B)) in (8)).
In this chapter, we introduced a new differentiable ILP
We also distinguish between extensional and intensional
solvers by exploiting the explicit representational power
predicates. The former is entirely defined by the ground
of our NLN which we believe is a significant improvement
facts (eg. inc predicate in the above example), while the
over the dILP and is more flexible than Metagol in terms of
latter is defined using the other predicate function (eg.
the need for an expert input.
the lessT han predicate in the above example) Once we
For a more complete reference on ILP programming we have the predicate formula (8) which describe our target
refer the reader to (Muggleton & De Raedt, 1994; Dze- predicate, we can use rules of deduction and infer all the
roski, 2007). Here, we give a brief background relevant to consequences of the program using forward chain
our proposed algorithm using an example problem. Logic of reasoning, i.e, we apply the target predicate rules to
programing is a programming paradigm in which we use the constants in the program iteratively. Let Pi be the set of
formal logic (and usually first-order-logic) to describe rela- intensional predicates and X (t) be the set of deduced facts
tions between facts and rules of a program domain. In this at time stamp t. We infer the X (T ) where T is the number
Learning Algorithms via Neural Logic Networks

of time stamps using the recursive formula: based on meta interpretive learning (Cropper & Muggleton,
[ 2015), employs user-defined meta rules to reduce the set
X (i) =X (i−1) of possible combinations of terms. This requires an expert
knowledge with regards to the possible forms of the solution,
{ p(a1 , . . . , am )|Fp (a1 , . . . , an ) = T rue,
which is a restrictive approach. Further,this approach may
ak ∈ C, p ∈ Pi , n = var(p), m = arity(p)}, require many trials to find the suitable set of meta rules.
where, X (0) consist of background facts. As an example, Among the neural ILP solvers, the current state of the art
for the logic program lessT han we will have: solver proposed by (Evans & Grefenstette, 2018) limits the
number of possible terms to all the combinations contain-
X (0) = B = {inc(0, 1), inc(1, 2), inc(2, 3), inc(3, 4)} ing only two atoms which significantly reduces the space
[ of possible solutions and uses a softmax network to find
X (1) = X (0) {lt(0, 1), lt(1, 2), lt(2, 3), lt(3, 4)}
[ a set of combination corresponding to the answer from all
X (2) = X (1) {lt(0, 2), lt(1, 3), lt(2, 4)} combinations containing only two atoms. While, in princi-
[ ple, this limitation can be alleviated by introducing more
X (3) = X (2) {lt(0, 3), lt(1, 4)} and more auxiliary predicates, this approach is not practical
[ and requires huge amount of memory. The sheer num-
X (4) = X (3) {lt(0, 4)},
ber of possible combinations that these algorithms need to
where we use lt as shorthand for lessT han. Here, applying consider makes them inviable candidates for larger scale
the predicate rules beyond t = 4 does not yield any new problems specially when it requires recursion and multiple
ground atom. steps of forward chain of reasoning. Consequently in (Evans
& Grefenstette, 2018) the experiments were limited to the
Given the background facts (B) and a set of positive and predicates with arity of 2
negative examples (P and N respectively), The goal of ILP
is that to learn a program (including target predicate and a Our key idea is to employ our NLN framework to define the
number of possible auxiliary predicates) such that it entails predicate functions corresponding to each intensional pred-
all the positive examples and rejects all the negative ones. icate ( instead of limiting the possible terms using search
The predicate function defined in (8) is one such solution to trees or considering limited combinations similar to previ-
the ILP problem which satisfies all the examples. ous approaches.) This allows for framing the ILP problem
as an end-to-end neural network which can be trained via
We use the simple lessT han logic program in above typical gradient based optimization techniques. Further, this
as an example to explain the basics of the proposed al- would in general eliminate the restrictions for defining the
gorithm. Assume we consider a solution for the predi- predicate functions. In particular, in NLN we are not limited
cate function Flt containing at most three variables, i.e. to use the DNF form for defining the predicate and we can
(A, B, C). We define the function P erm(S, n) to return employ any there Boolean network such as a CNF form and
all the tuples of length n from the elements of a set XOR logic to learn the predicate functions.
S. For example, P erm({A, B}, 2) would give the set
{(A, A), (A, B), (B, A), (B, B)}. Further, for any predi- 5.1. NLN based neural ILP Solver
cate p and set of variables V we define the set T erms(p, V )
as: We present our algorithm using the current lessT han ex-
ample. First, we define the predicate functions for each
T erms(p, V ) = {p(arg)| arg ∈ P erm(V, arity(p) ) } intensional predicate (only lt here) using NLN. For exam-
(9) ple, we may use the DNF structure in NLN with a hidden
For now, if we exclude the use of functions in defining pred- layer of 4 (four disjunction terms) to define the Flt .
icates, the set of all the atoms that can be used in defining
Next, we define the valuation vector for each predicate at
target predicate can be expressed as: (t)
time stamp t as Yp which consists of (fuzzy) Boolean val-
InputList(Flt ) = T erms(inc, {A, B, C}) (10) ues of all the ground atoms involving that predicate. For
[ example the vector Yinc includes the Boolean values for
T erms(lt, {A, B, C}) (11) atoms in {inc(0, 0), inc(0, 1), . . . , inc(4, 4)}. Here we re-
move the t superscripts since the values of atoms from
This is correspond to the set
extensional predicate do not change over time.
{inc(A, A), . . . , inc(C, C), lt(A, A), . . . , lt(C, C)}.
Most proposed ILP solvers examine only a very limited
subset of possible combinations to find a solution. Metagol
(Cropper & Muggleton, 2016), the state of the art ILP solver
Learning Algorithms via Neural Logic Networks

Algorithm 1 Outline of the NLN based neural ILP solver variables mi toward binary values by multiplying the cor-
Result: Tmax
Ytarget respond weights wi to some positive constant larger than 1
for t ∈ {1, . . . , Tmax } do (e.g. 1.2 in our experiments)
for p ∈ Pi do
for arg ∈ C var(p) do 5.3. Benchmark tasks
θ = {arg0 /A, arg1 /B, arg2 /C, . . . } We tested the proposed algorithm on the 20 symbolic tasks
xi = InputListp |θ described in (Evans & Grefenstette, 2018) and the details
argp = {arg0 , . . . , argWarity(p)−1 } of these experiments can be found in Appendices G.1 to
Yp [argp ] ← Yp [argp ] Fp (xi ) G.20 of that paper. In Table 1 we have listed the percent-
ages of runs for each of the tasks that resulted in correct
solution for the proposed algorithm and compared it to the
baseline methods; dILP and Metagol. Although these are
Algorithm 1 shows the outline of the Tmax steps of forward rather simple tasks, as shown in Table.1, the dlp cannot
chain of reasoning in the proposed ILP solver algorithm. always find a solution in many of the problems. This can
Here θ defines a substitution (replacing variables with con- be due to the fact that the algorithm depends on the initial
stants) and InputListp |θ is a fuzzy Boolean vector formed weights and therefore many of the simulations may result
by gathering the corresponding elements of InputListp in poor performance. Metagol, however is a deterministic
function (after substitution of variables with constants) from approach and it can either find a solution or is unable at all.
the content of valuation vectors Yp ’s. In actual tensorflow In general, if not provided with carefully tuned meta rules,
implementation (Abadi et al., 2015) we reformulate the which define templates for the axillary and target predicates,
problem in matrix form and before the start of the training Metagol cannot learn many of the tasks involving recursion
we calculate the content of valuation vectors belong to all (e.g. Relatedness and Connectedness tasks in Table 1). In
extensional predicates to speed up the training. Also, while contrast, our proposed model can always find the correct
the algorithm is described sequentially, we compute all the solution for these tasks.
Table 1: NLN solver vs dILP and Metagol in benchmark
disjunction operations in the inner-most for-loop in paral-
tasks
lel in a batch operation since they do not depend on each
other. All the conjunction and disjunction operations in our Domain/Task dILP Metagol NLN
algorithm is implemented as defined in (1). Arithmetic/Predecessor 100 100 100
We use the cross-entropy loss between Ŷtarget (the ground Arithmetic/Even 100 100 100
truth provided by the positive and negative examples) and Arithmetic/Even-Odd 49 100 100
(Tmax ) Arithmetic/Less than 100 100 100
Ytarget which is the output of Algorithm.1.
Arithmetic/Fizz 10 100 100
5.2. Training Arithmetic/Buzz 35 100 100
List/Member 100 100 100
We train the model using ADAM (Kingma & Ba, 2014) List/Length 93 100 100
optimizer with learning rate of .001 and we initialize mem- Family Tree/Son 100 100 100
bership weights of the NLN using the approach described Family Tree/GrandParent 97 100 100
in section 2.2. After the training is completed, a zero cross- Family Tree/Husband 100 100 100
entropy loss indicates that the model has been able to satisfy Family Tree/Uncle 70 100 100
all the examples in the positive and negative sets. However, Family Tree/Relatedness 100 0 100
there can be some terms with membership weights of ’1’ Family Tree/Father 100 100 100
in defining each predicate which are not necessary for the Graph/Undirected Edge 100 100 100
satisfiability of the solution. Since there is no gradient at this Graph/Adjacent to Red 51 100 100
point we cannot directly remove those term during gradient Graph/Two Children 95 100 100
descent algorithm unless we include some penalty terms. Graph/Graph Colouring 95 100 100
In practice we use a simpler approach. In the final stage Graph/Connectedness 100 0 100
of algorithm we remove these terms by a simple satisfiabil- Graph/Cyclic 100 0 100
ity check, .i.e, if by setting any of membership variables
with value of ’1’ to ’0’ the loss function does not change
5.4. Learning Decimal Arithmetic
we remove that term from the outcome. Further, because
at the end of convergence, the gradient can become small, The 20 tasks in the previous section are rather simple tasks.
to speed up the final stage of convergence, when the loss For more complex tasks for which the predicate definition
function is below some threshold, we move the membership requires more atoms and the arity of predicates are higher
Learning Algorithms via Neural Logic Networks

than two, methods such as dILP cannot be used. Even H(X) and t(X) which allow for decomposing a list into
Metagol can only learn such tasks that require recursion head and tail elements, i.e A = [H(A)|t(A)]. We use ele-
when the appropriate rule templates for the are provided ments of {a, b, c, d} and all the lists made from permutations
by an expert. Here, we apply our method to learn more of up to three elements as constants in the program. We use
complex recursive arithmetic tasks. We first describe the extensional predicates such as gt (greater than), eq (equals)
addition problem for the natural number domain and then and lte (less than or equal) to define ordering between the
we use the addition predicate as background knowledge in elements as part of the background knowledge. We allow
second task to learn the multiplication. for using two additional variables in defining the predicate
sort(A, B). One of the solution that our model finds is:
5.4.1. A DDITION TASK
We use C = {0, 1, 2, 3, 4, 5} as constants and
our background knowledges is consist of B = sort(A, B) ← sort(H(A), C), lte(t(C), tA ),
{zero(0), eq(0, 0), . . . , eq(4, 4), inc(0, 1), . . . , inc(3, 4)}, eq(H(B), C), eq(tA , tB )
where inc defines increment and eq tests for equality. The
sort(A, B) ← sort(H(A), C), gt(t(C), tA ), eq(t(B), t(C)),
target predicate is add(A, B, C) and we allow for use of
two additional variables (i.e., var(add) = 3 + 2 = 5). As eq(H(D), H(C), eq(t(A), t(D), sort(D, H(B))
usual, we use a DNF network for learning Fadd . One of the
solutions that our model finds is: To the best of our knowledge, learning a recursive solution
like this which involves clauses with 6 atoms and include 4
add(A, B, C) ← zero(B), eq(A, C) variables (and their functions) is beyond the power of any
add(A, B, C) ← add(A, D, E), inc(D, B), inc(C, E) existing neural ILP solver.

5.4.2. M ULTIPLICATION TASK 6. Conclusion


Next, we add the learned addition predicate to the back- We have introduced NLN as a new paradigm of neural net-
ground knowledge of the previous experiment and then try works designed for explicit learning and representation of
to learn the mul(A, B, C) predicate. One of the obtained Boolean functions. Using various experiments we showed
solutions is: their effectiveness in learning the logical representations.
mul(A, B, C) ← zero(B), zero(C) Further, we demonstrated their generalization superiority
to the traditional MLP in a discrete iterative algorithmic.
mul(A, B, C) ← mul(B, A, C) Finally, by proposing a new algorithm for learning ILP prob-
mul(A, B, C) ← mul(A, D, E), inc(D, B), plus(E, A, C) lems we demonstrated the importance of the explicit logical
representation that is achieved using NLN.
It is worth nothing that to the best of our knowledge, learn-
ing recursive algorithmic tasks like this using only positive
and negative examples and without using any template for A. Proof of Theorem 1.
defining viable option (other than assuming DNF form) is
Proof. First consider that for the case where the number of
beyond the power of any current ILP solver. Indeed, most ′ ′
1 s in x is odd (i.e. XOR(x) = 1), none of the functions
neural ILP solvers are either incapable of learning recursion
fi (x) can be equal to zero since the sum of odd number of
or like dlp have limited scope and cannot be used to learn
elements from the set {−1, 1} cannot be zero. Therefore,
complex predicates. While, tasks such as decimal and bi-
the statement in (5a) is true due to (5b). Now consider the
nary addition and multiplications can be learned with very
case that XOR(x) is zero. We must show that at least one of
sophisticated neural algorithms such as (Kaiser & Sutskever,
the k functions fi (x) would be equal to zero in this case. Let
2015), they lack the generalization power of ILP and their
Mi ∈ {−1, 1}n be the vector of coefficients for fi (x) and
performance significantly drops when the size of the prob-
s be the number of ones in the input vector x. Further, for
lem grows. Furthermore, the learned algorithm is not ex- (1) (−1)
plicit in nature and acquired knowledge cannot be transfered any fi , let ni and ni be the number of corresponding
to another problem easily. 1 and -1 coefficients that matches the positions of elements
of ’1’ in vector x. We notice that the sign of exactly two
elements in Mi and Mi+1 changes when we go from fi (x)
5.5. Sorting an ordered list
to fi+1 (x) and those signs remain unchanged in the next set
The sorting tasks is more complex than the previous tasks of functions. As we have k functions and s ≤ 2k, this would
since it requires the list semantics. We implement the list guarantee that the sign of the coefficients corresponding to
semantic by allowing the use of functions in defining pred- ’1’ elements changes exactly once in the set of k functions.
icates. For a data of type list, we define two functions Thus, in one of the functions, let’s say the one corresponding
Learning Algorithms via Neural Logic Networks

to the j th coefficient vector we would have nj


(1) (−1)
= n1 Evans, R. and Grefenstette, E. Learning explanatory rules
(−1) (1) from noisy data. Journal of Artificial Intelligence Re-
and nj = n1 which means fj (x) = −f1 (x). Since the search, 61:1–64, 2018.
difference between each consecutive fi can be zero or ±2,
this guarantees that at some point one of the fi ’s (1 ≤ i ≤ j) França, M. V., Zaverucha, G., and Garcez, A. S. d. Fast rela-
should be equal to zero. tional learning using bottom clause propositionalization
with artificial neural networks. Machine learning, 94(1):
In the above arguments we assumed n is an even number.
81–104, 2014.
However, if n is an odd number, we can modify it to the
n + 1 problem by appending an extra 0 entry to the input Hölldobler, S., Kalinke, Y., and Störr, H.-P. Approximat-
vector x. Since 0 has no effect on the results of XOR, the ing the semantics of logic programs by recurrent neural
above arguments still hold. networks. Applied Intelligence, 11(1):45–58, 1999.
Iyoda, E. M., Nobuhara, H., and Hirota, K. A solution for the
References
n-bit parity problem using a single translated multiplica-
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., tive neuron. Neural Processing Letters, 18(3):233–238,
Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., 2003.
Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard,
M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Lev- Kaiser, Ł. and Sutskever, I. Neural gpus learn algorithms.
enberg, J., Mané, D., Monga, R., Moore, S., Murray, D., arXiv preprint arXiv:1511.08228, 2015.
Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, Kingma, D. P. and Ba, J. Adam: A method for stochastic
I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, optimization. CoRR, abs/1412.6980, 2014.
V., Viégas, F., Vinyals, O., Warden, P., Wattenberg, M.,
Wicke, M., Yu, Y., and Zheng, X. TensorFlow: Large- Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet
scale machine learning on heterogeneous systems, 2015. classification with deep convolutional neural networks.
URL https://www.tensorflow.org/. Software In Advances in neural information processing systems,
available from tensorflow.org. pp. 1097–1105, 2012.

Bader, S., Hitzler, P., and Hölldobler, S. Connectionist Minsky, M. and Papert, S. A. Perceptrons: An introduction
model generation: A first-order approach. Neurocomput- to computational geometry. MIT press, 2017.
ing, 71(13-15):2420–2432, 2008.
Muggleton, S. and De Raedt, L. Inductive logic program-
Collobert, R. and Weston, J. A unified architecture for natu- ming: Theory and methods. The Journal of Logic Pro-
ral language processing: Deep neural networks with mul- gramming, 19:629–679, 1994.
titask learning. In Proceedings of the 25th international Payani, A. and Fekri, F. Decoding ldpc codes on binary
conference on Machine learning, pp. 160–167. ACM, erasure channels using deep recurrent neural-logic layers.
2008. In Turbo Codes and Iterative Information Processing
(ISTC), 2018 International Symposium On. IEEE, 2018.
Cropper, A. and Muggleton, S. H. Logical minimisation of
meta-rules within meta-interpretive learning. In Inductive Richardson, T. J., Shokrollahi, M. A., and Urbanke, R. L.
Logic Programming, pp. 62–75. Springer, 2015. Design of capacity-approaching irregular low-density
parity-check codes. IEEE transactions on information
Cropper, A. and Muggleton, S. H. Metagol system.
theory, 47(2):619–637, 2001.
https://github.com/metagol/metagol, 2016. URL https:
//github.com/metagol/metagol. Serafini, L. and Garcez, A. d. Logic tensor networks: Deep
learning and logical reasoning from data and knowledge.
Dahl, G. E., Yu, D., Deng, L., and Acero, A. Context- arXiv preprint arXiv:1606.04422, 2016.
dependent pre-trained deep neural networks for large-
vocabulary speech recognition. IEEE Transactions on Siegelmann, H. T. and Sontag, E. D. On the computational
audio, speech, and language processing, 20(1):30–42, power of neural nets. In Proceedings of the fifth annual
2012. workshop on Computational learning theory, pp. 440–
449. ACM, 1992.
Duch, W. K-separability. In International Conference on
Artificial Neural Networks, pp. 188–197. Springer, 2006. Steinbach, B. and Kohut, R. Neural networks–a model of
boolean functions. In Boolean Problems, Proceedings of
Dzeroski, S. Inductive logic programming in a nutshell. In- the 5th International Workshop on Boolean Problems, pp.
troduction to Statistical Relational Learning [16], 2007. 223–240, 2002.
Learning Algorithms via Neural Logic Networks

Vinyals, O., Fortunato, M., and Jaitly, N. Pointer networks.


In Advances in Neural Information Processing Systems,
pp. 2692–2700, 2015.

Wasserman, P. D. Neural computing: theory and practice.


Van Nostrand Reinhold Co., 1989.

You might also like