2021 - Inductive Logic Programming at 30
2021 - Inductive Logic Programming at 30
1 Introduction
Inductive logic programming (ILP) [75, 78] is a form of machine learning (ML). As with
other forms of ML, the goal of ILP is to induce a hypothesis that generalises training
examples. However, whereas most forms of ML use vectors/tensors to represent data
(examples and hypotheses), ILP uses logic programs (sets of logical rules). Moreover,
whereas most forms of ML learn functions, ILP learns relations.
To illustrate ILP1 suppose you want to learn a string transformation program from
the following examples.
A. Cropper
University of Oxford
E-mail: [email protected]
S. Dumančić
KU Leuven
E-mail: [email protected]
R. Evans
Imperial College London
E-mail: [email protected]
S. H. Muggleton
Imperial College London
E-mail: [email protected]
1
We do not introduce ILP in detail and refer the reader to the introductory paper of Cropper and
Dumancic [17] or the textbooks of Nienhuys-Cheng and Wolf [86] and De Raedt [28].
2 Andrew Cropper et al.
Input Output
inductive e
logic c
programming g
Most forms of ML would represent these examples as a table, where each row would
be an example and each column would be a feature, such as a one-hot-encoding repre-
sentation of the string. By contrast, in ILP, we would represent these examples as logical
atoms, such as f([i,n,d,u,c,t,i,v,e], e), where f is the target predicate that we
want to learn (the relation to generalise). We would also provide auxiliary information
(features) in the form of background knowledge (BK), also represented as a logical the-
ory (a logic program). For instance, for the string transformation problem, we could pro-
vide BK that contains logical definitions for string operations, such as empty(A), which
holds when the list A is empty; head(A,B), which holds when B is the head of the list A;
and tail(A,B), which holds when B is the tail of the list A. Given the aforementioned
examples and BK, an ILP system could induce the hypothesis (a logic program):
f(A,B):- tail(A,C),empty(C),head(A,B).
f(A,B):- tail(A,C),f(C,B).
Each line of the program is a rule. The first rule says that the relation f(A,B) holds
when the three literals tail(A,C), empty(C), and head(A,B) hold. In other words, the
first rule says that B is the last element of A when the tail of A is empty and B is the head
of A. The second rule is recursive and says that the relation f(A,B) holds when the two
literals tail(A,C) and f(C,B) hold. In other words, the second rule says that f(A,B)
holds when the same relation holds for the tail of A.
Compared to most ML approaches, ILP has several attractive features [25, 17]:
Data efficiency. Many forms of ML are notorious for their inability to generalise from
small numbers of training examples, notably deep learning [70, 13]. As Evans and Grefen-
stette [39] point out, if we train a neural system to add numbers with 10 digits, it might
generalise to numbers with 20 digits, but when tested on numbers with 100 digits, the
predictive accuracy drastically decreases [91, 53]. By contrast, ILP can induce hypothe-
ses from small numbers of examples, often from a single example [69, 82].
Background knowledge. ILP learns using BK represented as a logic program. Using logic
programs to represent data allows ILP to learn with complex relational information,
such as constraints about causal networks [50], the axioms of the event calculus when
learning to recognise events [55, 56], and using a theory of light to understand images
[82]. Moreover, because hypotheses are symbolic, hypotheses can be added the to BK,
and thus ILP systems naturally support lifelong and transfer learning [69, 15, 16].
Expressivity. Because of the expressivity of logic programs, ILP can learn complex rela-
tional theories, such as cellular automata [51, 40], event calculus theories [55, 56], Petri
nets [5], and general algorithms [19]. Because of the symbolic nature of logic programs,
ILP can reason about hypotheses, which allows it to learn optimal programs, such as
minimal time-complexity programs [22] and secure access control policies [65].
Inductive logic programming at 30 3
Some of the aforementioned advantages come from recent developments, which we sur-
vey in this paper2 . To aid the reader, we coarsely compare old and new ILP systems, where
new represents systems from the past decade. We use FOIL [89], Progol [76], TILDE [9],
and HYPER [12] as representative old systems and ILASP [62], Metagol [21], ∂ ILP [39],
and Popper [19] as representative new systems. This comparison, shown in Table 1, is,
of course, vastly oversimplified, and there are many exceptions. In the rest of this paper,
we survey these developments (each row in the table) in turn. After discussing these new
ideas, we discuss recent application areas (Section 5.2) before concluding by proposing
directions for future research.
The fundamental ILP problem is to efficiently search a large hypothesis space. Most
older ILP approaches search in either a top-down or bottom-up fashion. These methods
rely on notions of generality (typically using theta-subsumption [88]), where one pro-
gram is more general or more specific than another. A third new search approach has
recently emerged called meta-level ILP [50, 84, 49, 66, 19]. We discuss these approaches
in turn.
Top-down approaches [89, 9, 12] start with a general hypothesis and then specialise it.
HYPER, for instance, searches a tree in which the nodes correspond to hypotheses and
each child of a hypothesis in the tree is more specific than or equal to its predecessor
in terms of theta-subsumption. An advantage of top-down approaches is that they can
often learn recursive programs (although not all do). A disadvantage is that they can be
2
This paper extends the paper of Cropper et al [25].
4 Andrew Cropper et al.
prohibitively inefficient because they can generate many hypotheses that do not cover
the examples.
Bottom-up approaches, by contrast, start with the examples and generalise them
[74, 77, 79, 51]. For instance, Golem [79] generalises pairs of examples based on rela-
tive least-general generalisation [86]. Bottom-up approaches can be seen as being data-
or example-driven. An advantage of these approaches is that they are typically fast. As
Bratko [12] points out, disadvantages include (i) they typically use unnecessarily long
hypotheses with many clauses, (ii) it is difficult for them to learn recursive hypotheses
and multiple predicates simultaneously, and (iii) they do not easily support predicate
invention.
Progol [76], which inspired many other ILP approaches [102, 90, 1, 100], combines
both top-down and bottom-up approaches. Starting with an empty program, Progol
picks an uncovered positive example to generalise. To generalise an example, Progol
uses mode declarations to build the bottom clause [76], the logically most-specific clause
that explains the example. The bottom clause bounds the search from below (the bottom
clause) and above (the empty set). Progol then uses an A* algorithm to generalise the
bottom clause in a top-down (general-to-specific) manner and uses the other examples
to guide the search.
1.3.1 Meta-level
Top-down and bottom-up approaches refine and revise a single hypothesis. A third ap-
proach has recently emerged called meta-level ILP [50, 84, 49, 66, 19]. There is no stan-
dard definition for meta-level ILP. Most approaches encode the ILP problem as a meta-
level logic program, i.e. a program that reasons about programs. Meta-level approaches
then often delegate the search for a hypothesis to an off-the-shelf solver [14, 21, 62,
54, 100, 40, 19] after which the meta-level solution is translated back to a standard
solution for the ILP task. In other words, instead of writing a procedure to search in
a top-down or bottom-up manner, most meta-level approaches formulate the learning
problem as a declarative search problem. For instance, ASPAL [14] translates an ILP task
into a meta-level ASP program which describes every example and every possible rule in
the hypothesis space. ASPAL then delegates the search to an ASP system to find a subset
of the rules that covers all the positive but none of the negative examples.
The main advantage of meta-level approaches is that they can more easily learn re-
cursive programs and optimal programs [14, 62, 21, 54, 40, 19], which we discuss in
Sections 2 and 4 respectively. Moreover, whereas classical ILP systems were almost en-
tirely based on Prolog, meta-level approaches use diverse techniques and technologies,
such as ASP solvers [14, 62, 54, 19, 40], which we expand on in Section 5. The devel-
opment of meta-level ILP approaches has, therefore, diversified ILP from the standard
clause refinement approach of earlier ILP systems.
Most meta-level approaches encode the ILP learning task as a single static meta-level
program [14, 62, 54, 40]. A major issue with this approach is that the meta-level program
can be very large so these approaches can struggle to scale to problems with non-trivial
domains and to programs with large clauses. Two related approaches try to overcome
this limitation by continually revising the meta-level program.
ILASP3 [61] employs a counter-example-driven select-and-constrain loop. ILASP3
first pre-computes every clause in the hypothesis space defined by a set of given mode
declarations [76]. ILASP3 then starts its select-and-constrain loop. With each iteration,
ILASP3 uses an ASP solver to find the best hypothesis (a subset of the rules) it can.
Inductive logic programming at 30 5
If the hypothesis does not cover one of the examples, ILASP3 finds a reason why and
then generates constraints (boolean formulas over the rules) which it adds to the meta-
level program to guide subsequent search. Another way of viewing ILASP3 is that it
uses a counter-example-guided approach and translates an uncovered example e into a
constraint that is satisfied if and only if e is covered.
Popper [19] adopts a similar approach but differs in that it (i) does not precompute
every possible rule in the hypothesis space, and (ii) translates a hypothesis into a set of
constraints, rather than an uncovered example. Popper works in three repeating stages:
generate, test, and constrain. Popper first constructs a meta-level logic program where
its models correspond to hypotheses. In the generate stage, Popper asks an ASP solver
to find a model (a hypothesis). In the test stage, Popper tests the hypothesis against the
examples. A hypothesis fails when it is incomplete (does not entail all the positive ex-
amples) or inconsistent (entails a negative example). If a hypothesis fails, Popper learns
constraints from the failure, which it then uses to restrict subsequent generate stages. For
instance, if a hypothesis is inconsistent, then Popper generates a generalisation constraint
to prune all generalisations of the hypothesis and adds the constraint to the meta-level
program, which eliminates models and thus prunes the hypothesis space. This process
repeats until Popper finds a complete and consistent program.
For more information about meta-level learning, we suggest the work of Inoue [49]
and Law et al [66].
2 Recursion
Learning recursive programs has long been considered a difficult problem for ILP [81,
17]. The power of recursion is that an infinite number of computations can be described
by a finite recursive program [107]. To illustrate the importance of recursion, reconsider
the string transformation problem from the introduction. Without recursion, an ILP sys-
tem would need to learn a separate clause to find the last element for each list of length
n, such as this program for when n = 3:
f(A,B):- tail(A,C),empty(C),head(A,B).
f(A,B):- tail(A,C),tail(C,D),empty(D),head(C,B).
f(A,B):- tail(A,C),tail(C,D),tail(D,E),empty(E),head(D,B).
This program does not generalise to lists of arbitrary lengths. Moreover, most ILP systems
would need examples of lists of each length to learn such a program. By contrast, an ILP
system that supports recursion can learn the compact program:
f(A,B):- tail(A,C),empty(C),head(A,B).
f(A,B):- tail(A,C),f(C,B).
Because of the symbolic representation and the recursive nature, this program gener-
alises to lists of arbitrary length and which contain arbitrary elements (e.g. integers and
characters). In general, without recursion, it can be difficult for an ILP system to gener-
alise from small numbers of examples [24].
Older ILP systems struggle to learn recursive programs, especially from small num-
bers of training examples. A common limitation with existing approaches is that they
rely on bottom clause construction [76]. In this approach, for each example, an ILP sys-
tem creates the most specific clause that entails the example, and then tries to generalise
6 Andrew Cropper et al.
the clause to entail other examples. However, this sequential covering approach requires
examples of both the base and inductive cases.
Interest in recursion has resurged with the introduction of meta-interpretive learn-
ing (MIL) [83, 84, 27] and the MIL system Metagol [21]. The key idea of MIL is to use
metarules [23], or program templates, to restrict the form of inducible programs, and
thus the hypothesis space3 . A metarule is a higher-order clause. For instance, the chain
metarule is P(A, B) ← Q(A, C), R(C, B), where the letters P, Q, and R denote higher-order
variables and A, B and C denote first-order variables. The goal of a MIL system, such as
Metagol, is to find substitutions for the higher-order variables. For instance, the chain
metarule allows Metagol to induce programs such as f(A,B):- tail(A,C),head(C,B)4 .
Metagol induces recursive programs using recursive metarules, such as the tailrec metarule
P(A,B) ← Q(A,C), P(C,B).
Following MIL, many meta-level ILP systems can learn recursive programs [62, 39,
54, 19]. With recursion, ILP systems can now generalise from small numbers of exam-
ples, often a single example [69]. Moreover, the ability to learn recursive programs has
opened up ILP to new application areas, including learning string transformations pro-
grams [69], answer set grammars [64], and general algorithms [19].
3 Predicate invention
A key characteristic of ILP is the use of BK. BK is similar to features used in most forms of
ML. However, whereas features are tables, BK contains facts and rules (extensional and
intensional definitions) in the form of a logic program. For instance, when learning string
transformation programs, we may provide helper background relations, such as head/2
and tail/2. For other domains, we may supply more complex BK, such as a theory of
light to understand images [82] or higher-order operations, such as map/3, filter/3,
and fold/4, to solve programming puzzles [27].
As with choosing appropriate features, choosing appropriate BK is crucial for good
learning performance. ILP has traditionally relied on hand-crafted BK, often designed by
domain experts. This approach is limited because obtaining suitable BK can be difficult
and expensive. Indeed, the over-reliance on hand-crafted BK is a common criticism of
ILP [39].
Rather than expecting a user to provide all the necessary BK, the goal of predicate
invention (PI) [77, 104] is for an ILP system to automatically invent new auxiliary pred-
icate symbols. This idea is similar to when humans create new functions when manually
writing programs, as to reduce code duplication or to improve readability. Whilst PI has
attracted interest since the beginnings of ILP [77], and has subsequently been repeatedly
stated as a major challenge [58, 81, 60], most ILP systems do not support it.
A key challenge faced by early ILP systems was deciding when and how to invent a
new symbol. As Kramer [59] points out, PI is difficult because it is unclear how many
arguments an invented predicate should have, how the arguments should be ordered,
etc. Several PI approaches try to address this challenge, which we discuss in turn.
3
The idea of using metarules to restrict the hypothesis space has been widely adopted by many ap-
proaches [106, 3, 96, 39, 5, 54]. However, despite their now widespread use, there is little work deter-
mining which metarules to use for a given learning task ([23] is an exception), which future work must
address.
4
Metagol can induce longer clauses though predicate invention, which is described in Section 3.
Inductive logic programming at 30 7
3.1 Placeholders
3.2 Metarules
Interest in automatic PI (where a user does not need to predefine an invented symbol) has
resurged with the introduction of MIL. MIL avoids the issues of older ILP systems by using
metarules to define the hypothesis space and in turn reduce the complexity of inventing
a new predicate symbol. For instance, the chain metarule (P(A, B) ← Q(A, C), R(C, B))
allows Metagol to induce programs such as f(A,B):- tail(A,C),tail(C,D), which
would drop the first two elements from a list. To induce longer clauses, such as to drop
first three elements from a list, Metagol uses the same metarule but invents a predicate
symbol to chain their application, such as to induce the program:
f(A,B):- tail(A,C),inv1(C,B).
inv1(A,B):- tail(A,C),tail(C,B).
To learn this program, Metagol invents the predicate symbol inv1 and induces a def-
inition for it using the chain metarule. Metagol uses this new predicate symbol in the
definition for the target predicate f.
A side-effect of this metarule-driven approach is that problems are forced to be de-
composed into reusable solutions. For instance, to learn a program that drops the first
four elements of a list, Metagol learns the following program, where the invented pred-
icate symbol inv1 is used twice:
f(A,B):- inv1(A,C),inv1(C,B).
inv1(A,B):- tail(A,C),tail(C,B).
PI has been shown to help reduce the size of target programs, which in turns reduces
sample complexity and improves predictive accuracy [15]. Several new ILP systems sup-
port PI using a metarule-guided approach [39, 54, 47].
3.3 Pre/post-processing
ALPs [37] perform PI using an auto-encoding principle: they learn an encoding logic
program that maps the provided data to a new, compressive latent representation (de-
fined in terms of the invented predicates), and a decoding logic program that can recon-
struct the provided data from its latent representation. This approach shows improved
performance on supervised tasks, even though the PI step is task-agnostic.
Knorf [35] pushes the idea of ALPs even further. Knorf compresses a program by
removing redundancies in it. If the learnt program contains invented predicates, Knorf
revises them and introduces new ones that would lead to a smaller program. The refac-
tored program is smaller in size and contains less redundancy in clauses, both of which
lead to improved performance. The authors experimentally demonstrate that refactoring
improves learning performance in lifelong learning and that Knorf substantially reduces
the size of the BK program, reducing the number of literals in a program by 50% or more.
3.5 Limitations
The aforementioned techniques have improved the ability of ILP to invent high-level
concepts. However, PI is still difficult and there are many challenges to overcome. The
challenges are that (i) many systems struggle to perform PI at all, and (ii) those that do
support PI mostly need much user-guidance, metarules to restrict the space of invented
symbols or that a user specifies the arity and argument types of invented symbols.
By developing better approaches for PI, we can make progress on existing challeng-
ing problems. For instance, in inductive general game playing [26], the task is to learn
the symbolic rules of games from observations of gameplay, such as learning the rules
Inductive logic programming at 30 9
of connect four. The target solutions, which come from the general game playing com-
petition [44], often contain auxiliary predicates. For instance, the rules for connect four
are defined in terms of definitions for lines which are themselves defined in terms of
columns, rows, and diagonals. Although these auxiliary predicates are not strictly nec-
essary to learn the target solution, inventing such predicates significantly reduces the
size of the solution, which in turns makes them easier to learn. Although new methods
for PI can invent high-level concepts, they are not yet sufficiently powerful enough to
perform well on the IGGP dataset. Making progress in this area would constitute a major
advancement in ILP.
ILP systems have traditionally induced definite and normal logic programs, typically
represented as Prolog programs. A recent development has been to use different hypoth-
esis representations.
3.6 Datalog
ASP [42] is a logic programming paradigm based on the stable model semantics of nor-
mal logic programs that can be implemented using the latest advances in SAT solving
technology. Law et al [63] discuss some of the advantages of learning ASP programs,
rather than Prolog programs, which we reiterate. When learning Prolog programs, the
procedural aspect of SLD-resolution must be taken into account. For instance, when
learning Prolog programs with negation, programs must be stratified; otherwise program
may loop under certain conditions. By contrast, as ASP is a truly declarative language, no
such consideration need be taken into account when learning ASP programs. Compared
to Datalog and Prolog, ASP supports addition language constructs, such as disjunction
in the head of a clause, choice rules, and hard and weak constraints. A key difference
between ASP and Prolog is semantics. A definite logic program has only one model (the
least Herbrand model). By contrast, an ASP program can have one, many, or even no
stable models (answer sets). Due to its non-monotonicity, ASP is particularly useful for
expressing common-sense reasoning [61].
To illustrate the benefits of learning ASP programs, we reuse an example from Law
et al [66]. Given a sufficient examples of Hamiltonian graphs, ILASP [62] can learn a
program to definite them:
10 Andrew Cropper et al.
This program illustrates useful language features of ASP. The first rule is a choice rule
and the last two rules are hard constraints.
Approaches to learning ASP programs can mostly be divided into two categories:
brave learners, which aim to learn a program such that at least one answer set covers the
examples, and cautious learners, which aim to find a program which covers the examples
in all answer sets. ILASP is notable because it supports both brave and cautious learning,
which are both needed to learn some ASP programs [63]. Moreover, ILASP differs from
most Prolog-based ILP systems because it learns unstratified ASP programs, including
programs with normal rules, choice rules, and both hard and weak constraints, which
classical ILP systems cannot. Learning ASP programs allows for ILP to be used for new
problems, such as inducing answer set grammars [64].
Imagine learning a droplasts program, which removes the last element of each sublist in
a list, e.g. [alice,bob,carol] 7→ [alic,bo,caro]. Given suitable input data, Metagol can learn
this first-order recursive program:
f(A,B):- empty(A),empty(B).
f(A,B):- head(A,C),tail(A,D),head(B,E),tail(B,F),f1(C,E),f(D,F).
f1(A,B):- reverse(A,C),tail(C,D),reverse(D,B).
Although semantically correct, the program is verbose. To learn smaller programs, Metagolho
[27] extends Metagol to support learning higher-order programs, where predicate sym-
bols can be used as terms. For instance, for the same droplasts problem, Metagolho learns
the higher-order program:
f(A,B):- map(A,B,f1).
f1(A,B):- reverse(A,C),tail(C,D),reverse(D,B).
To learn this program, Metagolho invents the predicate symbol f1, which is used twice in
the program: as term in the map(A,B,f1) literal and as a predicate symbol in the f1(A,B)
literal. Compared to the first-order program, this higher-order program is smaller be-
cause it uses map/3 (predefined in the BK) to abstract away the manipulation of the list
and to avoid the need to learn an explicitly recursive program (recursion is implicit in
map/3). Metagolho has been shown to reduce sample complexity and learning times and
improve predictive accuracies [27].
A major limitation of logical representations, such as Prolog and its derivatives, is the
implicit assumption that the BK is perfect. That is, most ILP systems assume that atoms
Inductive logic programming at 30 11
are true or false, leaving no room for uncertainty. This assumption is problematic if data
is noisy, which is often the case.
Integrating probabilistic reasoning into logical representations is a principled way to
handle such uncertainty in data. This integration is the focus of statistical relational arti-
ficial intelligence (StarAI) [29, 32]. In essence, StarAI hypothesis representations extend
BK with probabilities or weights indicating the degree of confidence in the correctness
of parts of BK. Generally, StarAI techniques can be divided in two groups: distribution
representations and maximum entropy approaches.
Distribution semantics approaches [98], including Problog [30] and PRISM [99], ex-
plicitly annotate uncertainties in BK. To allow such annotation, they extend Prolog with
two primitives for stochastic execution: probabilistic facts and annotated disjunctions.
Probabilistic facts are the most basic stochastic primitive and they take the form of logical
facts labelled with a probability p. Each probabilistic fact represents a Boolean random
variable that is true with probability p and false with probability 1 − p. For instance,
the following probabilistic fact states that there is 1% chance of an earthquake in Naples.
0.01::earthquake(naples).
1 1 1
3 ::colour(B,green); 3 ::colour(B,red); 3 ::colour(B,blue) :- ball(B).
4 Optimality
There are often multiple (sometimes infinitely many) hypotheses that explain the data.
Deciding which hypothesis to choose has long been a difficult problem. Older ILP systems
were not guaranteed to induce optimal programs, where optimal typically means with
respect to the size of the induced program or the coverage of examples. A key reason for
this limitation was that most search techniques learned a single clause at a time, leading
to the construction of sub-programs which were sub-optimal in terms of program size
and coverage. For instance, programs induced by Aleph offer no guarantee of optimality
with respect to the program size and coverage.
Newer ILP systems try to address this limitation. As with the ability to learn recursive
programs, the main development is to take a global view of the induction task by using
meta-level search techniques. In other words, rather than induce a single clause at a time
from a single example, the idea is to induce multiple clauses from multiple examples.
For instance, ILASP uses ASP’s optimisation abilities to provably learn the program with
the fewest literals.
The ability to learn optimal programs opens up ILP to new problems. For instance,
learning efficient logic programs has long been considered a difficult problem in ILP
[78, 81], mainly because there is no declarative difference between an efficient program,
such as mergesort, and an inefficient program, such as bubble sort. To address this issue,
Metaopt [22] extends Metagol to support learning efficient programs. Metaopt maintains
a cost during the hypothesis search and uses this cost to prune the hypothesis space.
To learn minimal time complexity logic programs, Metaopt minimises the number of
resolution steps. For instance, imagine trying to learn a find duplicate program, which
finds any duplicate element in a list e.g. [p,r,o,g,r,a,m] 7→ r, and [i,n,d,u,c,t,i,o,n] 7→ i.
Given suitable input data, Metagol can induce the program:
f(A,B):- head(A,B),tail(A,C),element(C,B).
f(A,B):- tail(A,C),f(C,B).
This program goes through the elements of the list checking whether the same element
exists in the rest of the list. Given the same input, Metaopt induces the program:
f(A,B):- mergesort(A,C),f1(C,B).
f1(A,B):- head(A,B),tail(A,C),head(C,B).
f1(A,B):- tail(A,C),f1(C,B).
This program first sorts the input list and then goes through the list to check whether
for duplicate adjacent elements. Although larger, both in terms of clauses and literals,
the program learned by Metaopt is more efficient O(n log n) than the program learned
by Metagol O(n2 ). Metaopt has been shown to learn efficient robot strategies, efficient
time complexity logic programs, and even efficient string transformation programs.
FastLAS [65] is an ASP-based ILP system that takes as input a custom scoring function
and computes an optimal solution with respect to the given scoring function. The authors
show that this approach allows a user to optimise domain-specific performance metrics
on real-world datasets, such as access control policies.
5 Technologies
Older ILP systems mostly use Prolog for reasoning. Recent work considers using different
technologies.
Inductive logic programming at 30 13
There have been tremendous recent advances in SAT [46]. To leverage these advances,
much recent work in ILP uses related techniques, notably ASP [14, 83, 62, 55, 56, 100,
54, 40, 19]. The main motivations for using ASP are to leverage (i) the language benefits
of ASP (Section 3.7), and (ii) the efficiency and optimisation techniques of modern ASP
solvers, such as CLASP [43], which supports conflict propagation and learning. With
similar motivations, other approaches encode the ILP problem as SAT [1] or SMT [3]
problems. These approaches have been shown able to reduce learning times compared to
standard Prolog-based approaches. However, some unresolved issues remain. A key issue
is that most approaches encode an ILP problem as a single (often very large) satisfiability
problem. These approaches therefore often struggle to scale to very large problems [27],
although preliminary work attempts to tackle this issue [19].
With the rise of deep learning, several approaches have explored using gradient-based
methods to learn logic programs. These approaches all replace discrete logical reasoning
with a relaxed version that yields continuous values reflecting the confidence of the
conclusion.
The various neural approaches can be characterised along four orthogonal dimen-
sions. The first dimension is whether the neural network implements forward or back-
ward inference. While some [96] use backward (goal-directed) chaining with a neural
implementation of unification, most approaches [39, 108, 33] use forward chaining. The
second dimension is whether the network is designed for big data problems [108, 96]
or for data-efficient learning from a handful of data items [39]. Few neural systems to
date are capable of handling both big data and small data, with the notable exception
of [33]. The third dimension is whether the neural system jointly learns embeddings
(mapping symbolic constants to continuous vectors) along with the logical rules [96].
The advantage of jointly learning embeddings is that it enables fuzzy unification be-
tween constants that are similar but not identical. The challenge for these approaches
that jointly learn embeddings is how to generalize appropriately to constants that have
not been seen at training time. The fourth dimension is whether or not the neural sys-
tem is designed to allow explicit human-readable logical rules to be extracted from the
weights of the network. While most neural ILP systems [108, 96, 39] do produce explicit
logic programs, some [33] do not. It is perhaps moot whether implicit systems that do
not produce explicit programs count as ILP systems at all – but note that even in the im-
plicit neural systems, the weight sharing of the neural net is designed to achieve strong
generalisation by performing the same computation on all tuples of objects.
Currently, most neural approaches to ILP require the use of metarules or templates
to make the search space tractable. This severely limits the applicability of these ap-
proaches, as the user cannot always be expected to provide suitable metarules for a new
problem. The only approach that avoids the use of metarules or templates is Neural Logic
Machines [33].
We now survey recent application areas for ILP.
Scientific discovery. Perhaps the most prominent application of ILP is in scinefitic discov-
ery. ILP has, for instance, been used to identify and predict ligands (substructures re-
14 Andrew Cropper et al.
sponsible for medical activity) [52] and infer missing pathways in protein signalling net-
works [50]. There has been much recent work on applying ILP in ecology [10, 105, 11].
For instance, Bohan et al [10] use ILP to generate plausible and testable hypotheses for
trophic relations (‘who eats whom’) from ecological data.
Games. Inducing game rules has a long history in ILP, where chess has often been the
focus [80]. Legras et al [68] show that Aleph and TILDE can outperform an SVM learner
in the game of Bridge. Law et al [62] use ILASP to induce the rules for Sudoku and show
that this more expressive formalism allows for game rules to be expressed more com-
pactly. Cropper et al [26] introduce the ILP problem of inductive general game playing:
the problem of inducing game rules from observations, such as Checkers, Sokoban, and
Connect Four.
Data curation and transformation. Another successful application of ILP is in data cura-
tion and transformation, which is again largely because ILP can learn executable pro-
grams. The most prominent example of such tasks are string transformations, such as
the example given in the introduction. There is much interest in this topic, largely due to
success in synthesising programs for end-user problems, such as string transformations
in Microsoft Excel [45]. String transformation have become a standard benchmark for
recent ILP papers [69, 27, 15, 18]. Other transformation tasks include extracting values
from semi-structured data (e.g. XML files or medical records), extracting relations from
ecological papers, and spreadsheet manipulation [24].
Learning from trajectories. Learning from interpretation transitions (LFIT) [51] auto-
matically constructs a model of the dynamics of a system from the observation of its
state transitions. Given time-series data of discrete gene expression, it can learn gene
interactions, thus allowing to explain and predict states changes over time [94]. LFIT
has been applied to learn biological models, like Boolean Networks, under several se-
mantics: memory-less deterministic systems [51, 92], and their multi-valued extensions
[93, 71]. Martínez et al [71] combine LFIT with a reinforcement learning algorithm to
learn probabilistic models with exogenous effects (effects not related to any action) from
scratch. The learner was notably integrated in a robot to perform the task of clearing the
tableware on a table. In this task external agents interacted, people brought new table-
ware continuously and the manipulator robot had to cooperate with mobile robots to
take the tableware to the kitchen. The learner was able to learn a usable model in just
five episodes of 30 action executions. Evans et al [40] apply the Apperception Engine
Inductive logic programming at 30 15
to explain sequential data, such as cellular automata traces, rhythms and simple nurs-
ery tunes, image occlusion tasks, game dynamics, and sequence induction intelligence
tests. Surprisingly, they show that their system can achieve human-level performance on
the sequence induction intelligence tests in the zero-shot setting (without having been
trained on lots of other examples of such tests, and without hand-engineered knowledge
of the particular setting). At a high level, these systems take the unique selling point of
ILP systems (the ability to strongly generalise from a handful of data), and apply it to
the self-supervised setting, producing an explicit human-readable theory that explains
the observed state transitions.
In a survey paper from a decade ago, Muggleton et al [81] proposed directions for fu-
ture research. In the decade since, there have been major advances on many of the topics,
notably in predicate invention (Section 3), using higher-order logic as a representation
language (Section 3.2) and to represent hypotheses (Section 3.8), and applications in
learning actions and strategies (Section 5.2). Despite the advances, there are still many
limitations in ILP that future work should address.
Better systems. Muggleton et al [81] argue that a problem with ILP is the lack of well-
engineered tools. They state that whilst over 100 ILP systems have been built, less than a
handful of systems can be meaningfully used by ILP researchers. In the decade since the
authors highlighted this problem, little progress has been made: most ILP systems are not
easy to use. In other words, ILP systems are still notoriously difficult to use and you often
need a PhD in ILP to use any of the tools. Even then, it is still often only the developers
of a system that know how to properly use it. By contrast, driven by industry, other
forms of ML now have reliable and well-maintained implementations, such as PyTorch
and TensorFlow, which has helped drive research. A frustrating issue with ILP systems
is that they use many different language biases or even different syntax for the same
biases. For instance, the way of specifying a learning task in Progol, Aleph, TILDE, and
ILASP varies considerably despite them all using mode declarations, If it is difficult for
ILP researchers to use ILP tools, then what hope do non-ILP researchers have? For ILP
to be more widely adopted both inside and outside of academia, we must develop more
standardised, user-friendly, and better-engineered tools.
Language biases. As Cropper et al [25] state, one major issue with ILP is choosing an
appropriate language bias. For instance, Metagol uses metarules (Section 3.2) to restrict
the syntax of hypotheses and thus the hypothesis space. If a user can provide suitable
metarules, then Metagol is extremely efficient. However, if a user cannot provide suitable
metarules (which is often the case), then Metagol is almost useless. This same brittle-
ness applies to ILP systems that employ mode declarations [76]. In theory, a user can
provide very general mode declarations, such as only using a single type and allowing
unlimited recall. In practice, however, weak mode declarations often lead to very poor
performance. For good performance, users of mode-based systems often need to manu-
ally analyse a given learning task to tweak the mode declarations, often through a process
of trial and error. Moreover, if a user makes a small mistake with a mode declaration,
such as giving the wrong argument type, then the ILP system is unlikely to find a good
solution. Even for ILP experts, determining a suitable language bias is often a frustrating
and time-consuming process. We think the need for an almost perfect language bias is
16 Andrew Cropper et al.
severely holding back ILP from being widely adopted. We think that an important direc-
tion for future work in ILP is to develop techniques for automatically identifying suitable
language biases. Although there is some work on mode learning [72, 41, 87] and work
on identifying suitable metarules [23], this area of research is largely under-researched.
Better datasets. Interesting problems, alongside usable systems, drive research and at-
tract interest in a research field. This relationship is most evident in the deep learn-
ing community which has, over a decade, grown into the largest AI community. This
community growth has been supported by the constant introduction of new problems,
datasets, and well-engineered tools. Challenging problems that push the state-of-the-art
to its limits are essential to sustain progress in the field; otherwise, the field risks stag-
nation through only small incremental progress. ILP has, unfortunately, failed to deliver
on this front: most research is still evaluated on 20-year old datasets. Most new datasets
that have been introduced often come from toy domains and are designed to test specific
properties of the introduced technique. To an outsider, this sends a message that ILP is
not applicable to real-world problems. We think that the ILP community should learn
from the experiences of other AI communities and put significant efforts into develop-
ing datasets that identify limitations of existing methods as well as showcase potential
applications of ILP.
Relevance. New methods for predicate invention (Section 3) have improved the abilities
of ILP systems to learn large programs. Moreover, these techniques raise the potential
for ILP to be used in lifelong learning settings. However, inventing and acquiring new BK
could lead to a problem of too much BK, which can overwhelm an ILP system [103, 16].
On this issue, a key under-explored topic is that of relevancy. Given a new induction
problem with large amounts of BK, how does an ILP system decide which BK is relevant?
One emerging technique is to train a neural network to score how relevant programs are
in the BK and to then only use BK with the highest score to learn programs [6, 38].
However, the empirical efficacy of this approach has yet to be demonstrated. Moreover,
these approaches have only been demonstrated on small amounts of BK and it is unclear
how they scale to BK with thousands of relations. Without efficient methods of relevance
identification, it is unclear how efficient lifelong learning can be achieved.
Handling mislabelled and ambiguous data. A major open question in ILP is how best to
handle noisy and ambiguous data. Neural ILP systems [96, 39] are designed from the
start to robustly handle mislabelled data. Although there has been work in recent years
on designing ILP systems that can handle noisy mislabelled data, there is much less work
on the even harder and more fundamental problem of designing ILP systems that can
handle raw ambiguous data. ILP systems typically assume that the input has already been
preprocessed into symbolic declarative form (typically, a set of ground atoms represent-
ing positive and negative examples). But real-world input does not arrive in symbolic
form. Consider e.g. a robot with a video camera, where the raw input is a sequence
of pixel images. Converting each pixel image into a set of ground atoms is a challeng-
ing non-trivial achievement that should not be taken for granted. For ILP systems to be
widely applicable in the real world, they need to be redesigned so they can handle raw
ambiguous input from the outset [39, 34].
Inductive logic programming at 30 17
Probabilistic ILP. Real-world data is often noisy and uncertain. Extending ILP to deal with
such uncertainty substantially broadens its applicability. While StarAI is receiving grow-
ing attention, learning probabilistic programs from data is still largely under-investigated
due to the complexity of joint probabilistic and logical inference. When working with
probabilistic programs, we are interested in the probability that a program covers an
example, not only whether the program covers the example. Consequently, probabilis-
tic programs need to compute all possible derivations of an example, not just a single
one. Despite added complexity, probabilistic ILP opens many new challenges. Most of the
existing work on probabilistic ILP considers the minimal extension of ILP to the prob-
abilistic setting, by assuming that either (i) BK facts are uncertain, or (ii) that learned
clauses need to model uncertainty. These assumptions make it possible to separate struc-
ture from uncertainty and simply reuse existing ILP techniques. Following this minimal
extension, the existing work focuses on discriminative learning in which the goal is to
learn a program for a single target relation. However, a grand challenge in probabilistic
programming is generative learning. That is, learning a program describing a genera-
tive process behind the data, not a single target relation. Learning generative programs
is a significantly more challenging problem, which has received very little attention in
probabilistic ILP.
5.4 Summary
As ILP approaches 30, we think that the recent advances surveyed in this paper have
opened up new areas of research for ILP to explore. Moreover, we hope that the next
decade sees developments on the numerous limitations we have discussed so that ILP
can have a significant impact on AI.
References
1145/1553374.1553440
58. Kok S, Domingos PM (2007) Statistical predicate invention. In: Machine Learning,
Proceedings of the Twenty-Fourth International Conference (ICML 2007), ACM,
ACM International Conference Proceeding Series, vol 227, pp 433–440
59. Kramer S (1995) Predicate invention: A comprehensive view. Rapport technique
OFAI-TR-95-32, Austrian Research Institute for Artificial Intelligence, Vienna
60. Kramer S (2020) A brief history of learning symbolic higher-level representations
from data (and a curious look forward). In: Proceedings of the Twenty-Ninth In-
ternational Joint Conference on Artificial Intelligence, IJCAI 2020, ijcai.org, pp
4868–4876
61. Law M (2018) Inductive learning of answer set programs. PhD thesis, Imperial
College London, UK
62. Law M, Russo A, Broda K (2014) Inductive learning of answer set programs. In:
Logics in Artificial Intelligence - 14th European Conference, JELIA 2014, Springer,
Lecture Notes in Computer Science, vol 8761, pp 311–325
63. Law M, Russo A, Broda K (2018) The complexity and generality of learning answer
set programs. Artif Intell 259:110–146
64. Law M, Russo A, Bertino E, Broda K, Lobo J (2019) Representing and learning
grammars in answer set programming. In: The Thirty-Third AAAI Conference on
Artificial Intelligence, AAAI 2019, AAAI Press, pp 2919–2928
65. Law M, Russo A, Bertino E, Broda K, Lobo J (2020) Fastlas: Scalable inductive
logic programming incorporating domain-specific optimisation criteria. In: The
Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, AAAI Press,
pp 2877–2885
66. Law M, Russo A, Broda K (2020) The ilasp system for inductive learning of answer
set programs. The Association for Logic Programming Newsletter
67. Leban G, Zabkar J, Bratko I (2008) An experiment in robot discovery with ILP. In:
Inductive Logic Programming, 18th International Conference, ILP 2008, Springer,
Lecture Notes in Computer Science, vol 5194, pp 77–90
68. Legras S, Rouveirol C, Ventos V (2018) The game of bridge: A challenge for ILP. In:
Inductive Logic Programming - 28th International Conference, ILP 2018, Springer,
Lecture Notes in Computer Science, vol 11105, pp 72–87
69. Lin D, Dechter E, Ellis K, Tenenbaum JB, Muggleton S (2014) Bias reformulation for
one-shot function induction. In: ECAI 2014 - 21st European Conference on Artificial
Intelligence, 18-22 August 2014, IOS Press, Frontiers in Artificial Intelligence and
Applications, vol 263, pp 525–530
70. Marcus G (2018) Deep learning: A critical appraisal. CoRR abs/1801.00631
71. Martínez D, Alenyà G, Torras C, Ribeiro T, Inoue K (2016) Learning relational dy-
namics of stochastic domains for planning. In: Proceedings of the Twenty-Sixth
International Conference on Automated Planning and Scheduling, ICAPS 2016,
AAAI Press, pp 235–243
72. McCreath E, Sharma A (1995) Extraction of meta-knowledge to restrict the hy-
pothesis space for ilp systems. In: Eighth Australian Joint Conference on Artificial
Intelligence, pp 75–82
73. Michie D (1988) Machine learning in the next five years. In: Sleeman DH (ed) Pro-
ceedings of the Third European Working Session on Learning, EWSL 1988, Turing
Institute, Pitman Publishing, pp 107–122
74. Muggleton S (1987) Duce, an oracle-based approach to constructive induction. In:
Proceedings of the 10th International Joint Conference on Artificial Intelligence.,
22 Andrew Cropper et al.
94. Ribeiro T, Folschette M, Magnin M, Inoue K (2020) Learning any semantics for
dynamical systems represented by logic programs, working paper or preprint
95. Richardson M, Domingos PM (2006) Markov logic networks. Ma-
chine Learning 62(1-2):107–136, DOI 10.1007/s10994-006-5833-1, URL
https://doi.org/10.1007/s10994-006-5833-1
96. Rocktäschel T, Riedel S (2017) End-to-end differentiable proving. In: Advances in
Neural Information Processing Systems 30: Annual Conference on Neural Infor-
mation Processing Systems 2017, 4-9 December 2017, pp 3788–3800
97. Sammut C, Sheh R, Haber A, Wicaksono H (2015) The robot engineer. In: Late
Breaking Papers of the 25th International Conference on Inductive Logic Program-
ming, CEUR-WS.org, CEUR Workshop Proceedings, vol 1636, pp 101–106
98. Sato T (1995) A statistical learning method for logic programs with distribution
semantics. In: Sterling L (ed) Logic Programming, Proceedings of the Twelfth In-
ternational Conference on Logic Programming, Tokyo, Japan, June 13-16, 1995,
MIT Press, pp 715–729
99. Sato T, Kameya Y (2001) Parameter learning of logic programs for symbolic-
statistical modeling. J Artif Intell Res 15:391–454, DOI 10.1613/jair.912, URL
https://doi.org/10.1613/jair.912
100. Schüller P, Benz M (2018) Best-effort inductive logic programming via fine-grained
cost-based hypothesis generation - the inspire system at the inductive logic pro-
gramming competition. Machine Learning 107(7):1141–1169
101. Sivaraman A, Zhang T, den Broeck GV, Kim M (2019) Active inductive logic pro-
gramming for code search. In: Proceedings of the 41st International Conference
on Software Engineering, ICSE 2019, IEEE / ACM, pp 292–303
102. Srinivasan A (2001) The ALEPH manual. Machine Learning at the Computing Lab-
oratory, Oxford University
103. Srinivasan A, King RD, Bain M (2003) An empirical study of the use of relevance
information in inductive logic programming. J Machine Learning Res 4:369–383
104. Stahl I (1995) The appropriateness of predicate invention as bias shift operation
in ILP. Machine Learning 20(1-2):95–117
105. Tamaddoni-Nezhad A, Bohan D, Raybould A, Muggleton S (2014) Towards ma-
chine learning of predictive models from ecological data. In: Inductive Logic Pro-
gramming - 24th International Conference, ILP 2014, Springer, Lecture Notes in
Computer Science, vol 9046, pp 154–167
106. Wang WY, Mazaitis K, Cohen WW (2014) Structure learning via parameter learn-
ing. In: Proceedings of the 23rd ACM International Conference on Conference on
Information and Knowledge Management, CIKM 2014, ACM, pp 1199–1208
107. Wirth N (1985) Algorithms and data structures. Prentice Hall
108. Yang F, Yang Z, Cohen WW (2017) Differentiable learning of logical rules for knowl-
edge base reasoning. In: NIPS 2017