Learning Algorithms via Neural Logic Net
Learning Algorithms via Neural Logic Net
200
0.6
ments. 150
0.4
100 0.2
50
0 50k 100k 150k 200k 250k 300k 350k 400k 450k 500k 0 500 1000 1500 2000 2500 3000 3500
Number of training samples Number of Training Samples
For this experiment, we randomly generate some Boolean (a) DNF Task (b) Xor 50 Task
functions over a 10 bits input vectors and a randomly gen-
erated batches of 50 samples as training data. We train Figure 2: Comparing MLP vs NLN for learning Boolean
two models; one designed via our proposed DNF network functions
(with 200 disjunction functions) and another designed by
two layers MLP network with hidden layer of size 1000 4. Generalization
and ’relu’ activation and use ’sigmoid’ activation function
for the output layer. We use ADAM optimizer (Kingma To evaluate the generalization, we consider learning an iter-
& Ba, 2014) with learning rate of 0.001 for both models ative decoding algorithm for the LDPC codes. LDPC codes
and count the number of errors in 1000 randomly generated are linear error correcting codes that are widely used due
test samples. When we used a Bernoulli distribution with to their capacity achieving performance (Richardson et al.,
parameter p = 0.5 (i.e. fair coin toss) for generating the 2001). One popular problem in the coding research is de-
bits of each training samples, both models quickly converge coding these codes over the Binary Erasure Channel (BEC),
and the number of test error drops to zero for both. How- where a subset of the bits in the received codeword (from
ever, in many realistic discrete problems, the 0’s and 1’s are the channel output) is marked as erased due to the channel
not usually equiprobable. As such, next we use Bernoulli corruption. For BEC, decoding of the received LDPC code-
with parameter p = 0.75. Fig. 2a depicts the comparative word can be performed using an iterative Message Passing
performance of the two models. The proposed DNF model (MP) algorithm by enforcing the parity checks in the parity
converges fast and remains at 0 error. On the contrary, the check matrix. To compare the performance of MLP vs NLN
MLP model continues to generate errors. In our experi- in learning a discrete-algorithmic task, we use the deep re-
ments, while the number of errors decreases as training current model that was introduced in (Payani & Fekri, 2018)
continues for an hour, the MLP model never fully converges to learn the iterative decoding using MLP and NLN.
to the true logical function and occasionally generates some Simply put, in message passing decoding of LDPC codes,
errors. While for some tasks, this may be a negligible error, each iteration involves a forward and backward path. In the
in some logical applications such as the ILP task (in Section. forward path, the content of each check-node is updated via
5), this behavior prevents the model from learning. function F which takes all the connected variable-nodes as
input. In the backward path, the content of each variable-
3.2. Learning XOR function nodes is then updated via function B which takes the signal
from all the connected check-nodes as input.
Next, we compare the two models for a much more complex
task of learning the XOR logic. We use a multi layer MLP To compare the performance of MLP and NLN we design
with ’relu’ as activation functions in the hidden layers and the forward-backward functions (i.e., F and B) for the
sigmoid function in the output layer as usual. As for NLN, first model using MLP architecture (i.e., LDPC-MLP) and
we use a single XOR neuron as described in 2.3. For the for the second model using NLN (LDPC-NLN). We use
small size inputs both models quickly converge. However, comparable number of parameters in each model (e.g. for
for larger size input vectors (n > 30) the MLP model fails LDPC(3,6) code of length 48 we use hidden dimension
to converge at all. Fig 2b shows the average bit error over of size 200). In both models, we use randomly generated
the number of training samples. The error rate for MLP codewords for a regular LDPC(3,6) code of length 48 as
was around .5, which indicates it failed to learn the XOR training data and set the number of message passing iteration
function. On the contrary, the XOR logic layer was able in training to 3. (tmax = 3). In testing phase, we run
to converge and learn the objective in most of the runs. the trained models for many more iterations to see how
This is significant considering the fact that the number of much each model has generalized and learned the iterative
parameters in our proposed XOR layer is equal to the input algorithm.
length, i.e., one membership per input variable.
Fig 3 depicts the performance of two models in terms of
bit error probability (BER). As one may expect, the model
based on MLP converges faster and generates lower BER
for the setup used in training, i.e, t = 3. However, increas-
Learning Algorithms via Neural Logic Networks
0.2
MLP Decoder
framework rules are usually written as clauses of the form:
0.18
NLN Decoder
0.16
H ← B 1 , B 2 , . . . , Bm (7)
BER
0.14
0.12
0.1
where H is called head of the clause and B1 , B2 , . . . , Bm
0.08
is called body of the clause. A clause of this form expresses
0 2 4 6 8 10 12
that if all the Boolean terms in the body are true, the head
Iteration (t)
is necessarily true. We assume each of the terms H and
Figure 3: LDPC decoding over Erasure Channels B are made of atoms. Each atom is created by applying
an n-ary Boolean function called predicate to some
constants or variables. A predicate states the relation
ing the number of iterations in test time not only does not between some variables or constants in the logic program.
improve the accuracy for LDPC-MLP, it even degrades the Throughout this paper we will use small letters for constants
performance for t > 3. On the other hand, as the num- and capital letters (A, B, C, ...) for variables. In most ILP
ber of iterations increases, the performance of LDPC-NLN systems, each predicate can be defined via several clauses of
model improves significantly. Arguably, there are ways to the form stated in (7) which is equivalent to the DNF logical
improve the performance of MLP in such tasks (e.g, by sig- form.
nificantly increasing the number of training iterations and
enforcing the network to generate valid outcome at the end Let’s consider the logic program that defines the
of each iterations by adding some penalty term). However, lessThan predicate over natural numbers:
the NLN model provides a more natural way for learning
such discrete-algorithmic tasks. lessT han(A, B) ← inc(A, B)
lessT han(A, B) ← lessT han(A, C), inc(C, B) (8)
5. Induction Logic Programming via NLN
and assume that our constants contains the set C =
One of the recent breakthroughs in solving ILP problems
{0, 1, 2, 3, 4} and the ordering of the natural numbers
(specially for recusive and algorithmic tasks) is due to works
are defined using the predicate inc (which defines in-
such as (Cropper & Muggleton, 2015) which led to the
crements of 1). The set of background atoms which
invention of Metagol (Cropper & Muggleton, 2016), the
describe the known facts about this problem is the set
state of the art ILP solver capable of learning via predi-
B = {inc(0, 1), inc(1, 2), inc(2, 3), inc(3, 4)}. We as-
cate invention and recursion. Very recently in (Evans and
sociate two scalar functions arity(p) and var(p) cor-
Grefenstette 2018) the authors proposed a differentiable ILP
responding to the number of input arguments for the
(dILP) which also supports those features but using a neural
predicate and the number of variables that can be used
network framework. While there are some other noticeable
in defining predicate. Further, we associate a Boolean
works on neural ILP solvers, we would mainly compare our
function Fp to each (intensional) predicate which de-
proposed model to the Metagol and dILP since the other
fines the Boolean function corresponding to the predi-
alternatives (for instance (Hölldobler et al., 1999; Bader
cate p. In the above example arity(lessT han) = 2 and
et al., 2008; França et al., 2014; Serafini & Garcez, 2016))
var(lessT han) = 3 and the predicate function FlessT han
do not support both of these important features (i.e., recur-
can be defined over all possible atoms which involve
sion and predecate invention) and therefore are not optimal
three variables is defined as FlessT han =
W A,B,C (e.g., it V
for solving recursive algorithmic problems.
inc(A, B) (lessT han(A, C) inc(C, B)) in (8)).
In this chapter, we introduced a new differentiable ILP
We also distinguish between extensional and intensional
solvers by exploiting the explicit representational power
predicates. The former is entirely defined by the ground
of our NLN which we believe is a significant improvement
facts (eg. inc predicate in the above example), while the
over the dILP and is more flexible than Metagol in terms of
latter is defined using the other predicate function (eg.
the need for an expert input.
the lessT han predicate in the above example) Once we
For a more complete reference on ILP programming we have the predicate formula (8) which describe our target
refer the reader to (Muggleton & De Raedt, 1994; Dze- predicate, we can use rules of deduction and infer all the
roski, 2007). Here, we give a brief background relevant to consequences of the program using forward chain
our proposed algorithm using an example problem. Logic of reasoning, i.e, we apply the target predicate rules to
programing is a programming paradigm in which we use the constants in the program iteratively. Let Pi be the set of
formal logic (and usually first-order-logic) to describe rela- intensional predicates and X (t) be the set of deduced facts
tions between facts and rules of a program domain. In this at time stamp t. We infer the X (T ) where T is the number
Learning Algorithms via Neural Logic Networks
of time stamps using the recursive formula: based on meta interpretive learning (Cropper & Muggleton,
[ 2015), employs user-defined meta rules to reduce the set
X (i) =X (i−1) of possible combinations of terms. This requires an expert
knowledge with regards to the possible forms of the solution,
{ p(a1 , . . . , am )|Fp (a1 , . . . , an ) = T rue,
which is a restrictive approach. Further,this approach may
ak ∈ C, p ∈ Pi , n = var(p), m = arity(p)}, require many trials to find the suitable set of meta rules.
where, X (0) consist of background facts. As an example, Among the neural ILP solvers, the current state of the art
for the logic program lessT han we will have: solver proposed by (Evans & Grefenstette, 2018) limits the
number of possible terms to all the combinations contain-
X (0) = B = {inc(0, 1), inc(1, 2), inc(2, 3), inc(3, 4)} ing only two atoms which significantly reduces the space
[ of possible solutions and uses a softmax network to find
X (1) = X (0) {lt(0, 1), lt(1, 2), lt(2, 3), lt(3, 4)}
[ a set of combination corresponding to the answer from all
X (2) = X (1) {lt(0, 2), lt(1, 3), lt(2, 4)} combinations containing only two atoms. While, in princi-
[ ple, this limitation can be alleviated by introducing more
X (3) = X (2) {lt(0, 3), lt(1, 4)} and more auxiliary predicates, this approach is not practical
[ and requires huge amount of memory. The sheer num-
X (4) = X (3) {lt(0, 4)},
ber of possible combinations that these algorithms need to
where we use lt as shorthand for lessT han. Here, applying consider makes them inviable candidates for larger scale
the predicate rules beyond t = 4 does not yield any new problems specially when it requires recursion and multiple
ground atom. steps of forward chain of reasoning. Consequently in (Evans
& Grefenstette, 2018) the experiments were limited to the
Given the background facts (B) and a set of positive and predicates with arity of 2
negative examples (P and N respectively), The goal of ILP
is that to learn a program (including target predicate and a Our key idea is to employ our NLN framework to define the
number of possible auxiliary predicates) such that it entails predicate functions corresponding to each intensional pred-
all the positive examples and rejects all the negative ones. icate ( instead of limiting the possible terms using search
The predicate function defined in (8) is one such solution to trees or considering limited combinations similar to previ-
the ILP problem which satisfies all the examples. ous approaches.) This allows for framing the ILP problem
as an end-to-end neural network which can be trained via
We use the simple lessT han logic program in above typical gradient based optimization techniques. Further, this
as an example to explain the basics of the proposed al- would in general eliminate the restrictions for defining the
gorithm. Assume we consider a solution for the predi- predicate functions. In particular, in NLN we are not limited
cate function Flt containing at most three variables, i.e. to use the DNF form for defining the predicate and we can
(A, B, C). We define the function P erm(S, n) to return employ any there Boolean network such as a CNF form and
all the tuples of length n from the elements of a set XOR logic to learn the predicate functions.
S. For example, P erm({A, B}, 2) would give the set
{(A, A), (A, B), (B, A), (B, B)}. Further, for any predi- 5.1. NLN based neural ILP Solver
cate p and set of variables V we define the set T erms(p, V )
as: We present our algorithm using the current lessT han ex-
ample. First, we define the predicate functions for each
T erms(p, V ) = {p(arg)| arg ∈ P erm(V, arity(p) ) } intensional predicate (only lt here) using NLN. For exam-
(9) ple, we may use the DNF structure in NLN with a hidden
For now, if we exclude the use of functions in defining pred- layer of 4 (four disjunction terms) to define the Flt .
icates, the set of all the atoms that can be used in defining
Next, we define the valuation vector for each predicate at
target predicate can be expressed as: (t)
time stamp t as Yp which consists of (fuzzy) Boolean val-
InputList(Flt ) = T erms(inc, {A, B, C}) (10) ues of all the ground atoms involving that predicate. For
[ example the vector Yinc includes the Boolean values for
T erms(lt, {A, B, C}) (11) atoms in {inc(0, 0), inc(0, 1), . . . , inc(4, 4)}. Here we re-
move the t superscripts since the values of atoms from
This is correspond to the set
extensional predicate do not change over time.
{inc(A, A), . . . , inc(C, C), lt(A, A), . . . , lt(C, C)}.
Most proposed ILP solvers examine only a very limited
subset of possible combinations to find a solution. Metagol
(Cropper & Muggleton, 2016), the state of the art ILP solver
Learning Algorithms via Neural Logic Networks
Algorithm 1 Outline of the NLN based neural ILP solver variables mi toward binary values by multiplying the cor-
Result: Tmax
Ytarget respond weights wi to some positive constant larger than 1
for t ∈ {1, . . . , Tmax } do (e.g. 1.2 in our experiments)
for p ∈ Pi do
for arg ∈ C var(p) do 5.3. Benchmark tasks
θ = {arg0 /A, arg1 /B, arg2 /C, . . . } We tested the proposed algorithm on the 20 symbolic tasks
xi = InputListp |θ described in (Evans & Grefenstette, 2018) and the details
argp = {arg0 , . . . , argWarity(p)−1 } of these experiments can be found in Appendices G.1 to
Yp [argp ] ← Yp [argp ] Fp (xi ) G.20 of that paper. In Table 1 we have listed the percent-
ages of runs for each of the tasks that resulted in correct
solution for the proposed algorithm and compared it to the
baseline methods; dILP and Metagol. Although these are
Algorithm 1 shows the outline of the Tmax steps of forward rather simple tasks, as shown in Table.1, the dlp cannot
chain of reasoning in the proposed ILP solver algorithm. always find a solution in many of the problems. This can
Here θ defines a substitution (replacing variables with con- be due to the fact that the algorithm depends on the initial
stants) and InputListp |θ is a fuzzy Boolean vector formed weights and therefore many of the simulations may result
by gathering the corresponding elements of InputListp in poor performance. Metagol, however is a deterministic
function (after substitution of variables with constants) from approach and it can either find a solution or is unable at all.
the content of valuation vectors Yp ’s. In actual tensorflow In general, if not provided with carefully tuned meta rules,
implementation (Abadi et al., 2015) we reformulate the which define templates for the axillary and target predicates,
problem in matrix form and before the start of the training Metagol cannot learn many of the tasks involving recursion
we calculate the content of valuation vectors belong to all (e.g. Relatedness and Connectedness tasks in Table 1). In
extensional predicates to speed up the training. Also, while contrast, our proposed model can always find the correct
the algorithm is described sequentially, we compute all the solution for these tasks.
Table 1: NLN solver vs dILP and Metagol in benchmark
disjunction operations in the inner-most for-loop in paral-
tasks
lel in a batch operation since they do not depend on each
other. All the conjunction and disjunction operations in our Domain/Task dILP Metagol NLN
algorithm is implemented as defined in (1). Arithmetic/Predecessor 100 100 100
We use the cross-entropy loss between Ŷtarget (the ground Arithmetic/Even 100 100 100
truth provided by the positive and negative examples) and Arithmetic/Even-Odd 49 100 100
(Tmax ) Arithmetic/Less than 100 100 100
Ytarget which is the output of Algorithm.1.
Arithmetic/Fizz 10 100 100
5.2. Training Arithmetic/Buzz 35 100 100
List/Member 100 100 100
We train the model using ADAM (Kingma & Ba, 2014) List/Length 93 100 100
optimizer with learning rate of .001 and we initialize mem- Family Tree/Son 100 100 100
bership weights of the NLN using the approach described Family Tree/GrandParent 97 100 100
in section 2.2. After the training is completed, a zero cross- Family Tree/Husband 100 100 100
entropy loss indicates that the model has been able to satisfy Family Tree/Uncle 70 100 100
all the examples in the positive and negative sets. However, Family Tree/Relatedness 100 0 100
there can be some terms with membership weights of ’1’ Family Tree/Father 100 100 100
in defining each predicate which are not necessary for the Graph/Undirected Edge 100 100 100
satisfiability of the solution. Since there is no gradient at this Graph/Adjacent to Red 51 100 100
point we cannot directly remove those term during gradient Graph/Two Children 95 100 100
descent algorithm unless we include some penalty terms. Graph/Graph Colouring 95 100 100
In practice we use a simpler approach. In the final stage Graph/Connectedness 100 0 100
of algorithm we remove these terms by a simple satisfiabil- Graph/Cyclic 100 0 100
ity check, .i.e, if by setting any of membership variables
with value of ’1’ to ’0’ the loss function does not change
5.4. Learning Decimal Arithmetic
we remove that term from the outcome. Further, because
at the end of convergence, the gradient can become small, The 20 tasks in the previous section are rather simple tasks.
to speed up the final stage of convergence, when the loss For more complex tasks for which the predicate definition
function is below some threshold, we move the membership requires more atoms and the arity of predicates are higher
Learning Algorithms via Neural Logic Networks
than two, methods such as dILP cannot be used. Even H(X) and t(X) which allow for decomposing a list into
Metagol can only learn such tasks that require recursion head and tail elements, i.e A = [H(A)|t(A)]. We use ele-
when the appropriate rule templates for the are provided ments of {a, b, c, d} and all the lists made from permutations
by an expert. Here, we apply our method to learn more of up to three elements as constants in the program. We use
complex recursive arithmetic tasks. We first describe the extensional predicates such as gt (greater than), eq (equals)
addition problem for the natural number domain and then and lte (less than or equal) to define ordering between the
we use the addition predicate as background knowledge in elements as part of the background knowledge. We allow
second task to learn the multiplication. for using two additional variables in defining the predicate
sort(A, B). One of the solution that our model finds is:
5.4.1. A DDITION TASK
We use C = {0, 1, 2, 3, 4, 5} as constants and
our background knowledges is consist of B = sort(A, B) ← sort(H(A), C), lte(t(C), tA ),
{zero(0), eq(0, 0), . . . , eq(4, 4), inc(0, 1), . . . , inc(3, 4)}, eq(H(B), C), eq(tA , tB )
where inc defines increment and eq tests for equality. The
sort(A, B) ← sort(H(A), C), gt(t(C), tA ), eq(t(B), t(C)),
target predicate is add(A, B, C) and we allow for use of
two additional variables (i.e., var(add) = 3 + 2 = 5). As eq(H(D), H(C), eq(t(A), t(D), sort(D, H(B))
usual, we use a DNF network for learning Fadd . One of the
solutions that our model finds is: To the best of our knowledge, learning a recursive solution
like this which involves clauses with 6 atoms and include 4
add(A, B, C) ← zero(B), eq(A, C) variables (and their functions) is beyond the power of any
add(A, B, C) ← add(A, D, E), inc(D, B), inc(C, E) existing neural ILP solver.
Bader, S., Hitzler, P., and Hölldobler, S. Connectionist Minsky, M. and Papert, S. A. Perceptrons: An introduction
model generation: A first-order approach. Neurocomput- to computational geometry. MIT press, 2017.
ing, 71(13-15):2420–2432, 2008.
Muggleton, S. and De Raedt, L. Inductive logic program-
Collobert, R. and Weston, J. A unified architecture for natu- ming: Theory and methods. The Journal of Logic Pro-
ral language processing: Deep neural networks with mul- gramming, 19:629–679, 1994.
titask learning. In Proceedings of the 25th international Payani, A. and Fekri, F. Decoding ldpc codes on binary
conference on Machine learning, pp. 160–167. ACM, erasure channels using deep recurrent neural-logic layers.
2008. In Turbo Codes and Iterative Information Processing
(ISTC), 2018 International Symposium On. IEEE, 2018.
Cropper, A. and Muggleton, S. H. Logical minimisation of
meta-rules within meta-interpretive learning. In Inductive Richardson, T. J., Shokrollahi, M. A., and Urbanke, R. L.
Logic Programming, pp. 62–75. Springer, 2015. Design of capacity-approaching irregular low-density
parity-check codes. IEEE transactions on information
Cropper, A. and Muggleton, S. H. Metagol system.
theory, 47(2):619–637, 2001.
https://github.com/metagol/metagol, 2016. URL https:
//github.com/metagol/metagol. Serafini, L. and Garcez, A. d. Logic tensor networks: Deep
learning and logical reasoning from data and knowledge.
Dahl, G. E., Yu, D., Deng, L., and Acero, A. Context- arXiv preprint arXiv:1606.04422, 2016.
dependent pre-trained deep neural networks for large-
vocabulary speech recognition. IEEE Transactions on Siegelmann, H. T. and Sontag, E. D. On the computational
audio, speech, and language processing, 20(1):30–42, power of neural nets. In Proceedings of the fifth annual
2012. workshop on Computational learning theory, pp. 440–
449. ACM, 1992.
Duch, W. K-separability. In International Conference on
Artificial Neural Networks, pp. 188–197. Springer, 2006. Steinbach, B. and Kohut, R. Neural networks–a model of
boolean functions. In Boolean Problems, Proceedings of
Dzeroski, S. Inductive logic programming in a nutshell. In- the 5th International Workshop on Boolean Problems, pp.
troduction to Statistical Relational Learning [16], 2007. 223–240, 2002.
Learning Algorithms via Neural Logic Networks