0% found this document useful (0 votes)
10 views23 pages

Nickel 2016

This paper reviews statistical relational learning (SRL) techniques for predicting new facts in large knowledge graphs. SRL models relationships between entities in a graph structure, with nodes representing entities and edges representing relationships. The paper discusses two classes of SRL techniques: latent feature models that capture correlations using latent variables, and observable models that directly model patterns in the graph. It also describes combining these approaches to get the benefits of both. Finally, the paper explains how SRL can be used to automatically construct knowledge graphs by predicting new facts and integrating them with information extracted from text.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views23 pages

Nickel 2016

This paper reviews statistical relational learning (SRL) techniques for predicting new facts in large knowledge graphs. SRL models relationships between entities in a graph structure, with nodes representing entities and edges representing relationships. The paper discusses two classes of SRL techniques: latent feature models that capture correlations using latent variables, and observable models that directly model patterns in the graph. It also describes combining these approaches to get the benefits of both. Finally, the paper explains how SRL can be used to automatically construct knowledge graphs by predicting new facts and integrating them with information extracted from text.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

INVITED

PAPER

A Review of Relational Machine


Learning for Knowledge Graphs
This paper reviews how statistical models can be ‘‘trained’’ on large knowledge graphs
and then used to predict new facts about the world.
By Maximilian Nickel, Kevin Murphy, Volker Tresp, and Evgeniy Gabrilovich

ABSTRACT | Relational machine learning studies methods for Traditional machine learning algorithms take as input a
the statistical analysis of relational, or graph-structured, data. feature vector, which represents an object in terms of
In this paper, we provide a review of how such statistical numeric or categorical attributes. The main learning task
models can be ‘‘trained’’ on large knowledge graphs, and then is to learn a mapping from this feature vector to an
used to predict new facts about the world (which is equivalent output prediction of some form. This could be class
to predicting new edges in the graph). In particular, we discuss labels, a regression score, or an unsupervised cluster id
two fundamentally different kinds of statistical relational or latent vector (embedding). In statistical relational
models, both of which can scale to massive data sets. The first learning (SRL), the representation of an object can
is based on latent feature models such as tensor factorization contain its relationships to other objects. Thus the data is
and multiway neural networks. The second is based on mining in the form of a graph, consisting of nodes (entities) and
observable patterns in the graph. We also show how to labeled edges (relationships between entities). The main
combine these latent and observable models to get improved goals of SRL include prediction of missing edges,
modeling power at decreased computational cost. Finally, we prediction of properties of nodes, and clustering nodes
discuss how such statistical models of graphs can be combined based on their connectivity patterns. These tasks arise in
with text-based information extraction methods for automat- many settings such as analysis of social networks and
ically constructing knowledge graphs from the Web. To this biological pathways. For further information on SRL,
end, we also discuss Google’s knowledge vault project as an see [1]–[3].
example of such combination. In this paper, we review a variety of techniques from
the SRL community and explain how they can be applied to
KEYWORDS | Graph-based models; knowledge extraction; large-scale knowledge graphs (KGs), i.e., graph structured
knowledge graphs; latent feature models; statistical relational knowledge bases (KBs) that store factual information in
learning form of relationships between entities. Recently, a large
number of knowledge graphs have been created, including
I. INTRODUCTION YAGO [4], DBpedia [5], NELL [6], Freebase [7], and the
Google Knowledge Graph [8]. As we discuss in Section II,
‘‘I am convinced that the crux of the problem of these graphs contain millions of nodes and billions of
learning is recognizing relationships and being able edges. This causes us to focus on scalable SRL techniques,
to use them’’VChristopher Strachey in a letter to Alan which take time that is (at most) linear in the size of
Turing, 1954. the graph.
We can apply SRL methods to existing KGs to learn a
Manuscript received April 8, 2015; revised August 14, 2015; accepted September 16,
model that can predict new facts (edges) given existing
2015. Date of current version December 18, 2015. The work of M. Nickel was supported by facts. We can then combine this approach with informa-
the Center for Brains, Minds and Machines (CBMM) under NSF STC award CCF-1231216.
The work of V. Tresp was supported by the German Federal Ministry for Economic Affairs
tion extraction methods that extract ‘‘noisy’’ facts from the
and Energy under the ‘‘Smart Data’’ technology program (Grant 01MT14001). Web (see, e.g., [9] and [10]). For example, suppose an
M. Nickel is with the Laboratory for Computational and Statistical Learning (LCSL),
Massachusetts Institute of Technology, Cambridge, MA 02139 USA and also with the
information extraction method returns a fact claiming that
Istituto Italiano di Tecnologia, 16163 Genova, Italy (e-mail: [email protected]). Barack Obama was born in Kenya, and suppose (for
K. Murphy and E. Gabrilovich are with Google Inc., Mountain View, CA 94043 USA.
V. Tresp is with Siemens AG, Corporate Technology, Munich 81739, Germany and also
illustration purposes) that the true place of birth of Obama
with the Ludwig Maximilian University of Munich, Munich 80539, Germany. was not already stored in the knowledge graph. An SRL
Digital Object Identifier: 10.1109/JPROC.2015.2483592 model can use related facts about Obama (such as his
0018-9219  2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Vol. 104, No. 1, January 2016 | Proceedings of the IEEE 11
Nickel et al.: A Review of Relational Machine Learning for Knowledge Graphs

profession being U.S. President) to infer that this new fact can be expressed via the following set of SPO triples:
is unlikely to be true and should be discarded. This
provides us a way to ‘‘grow’’ a KG automatically, as we
explain in more detail in Section IX.
The remainder of this paper is structured as follows. In
Section II, we introduce knowledge graphs and some of
their properties. Section III discusses SRL and how it can
be applied to knowledge graphs. There are two main
classes of SRL techniques: those that capture the We can combine all the SPO triples together to form a
correlation between the nodes/edges using latent vari- multigraph, where nodes represent entities (all subjects
ables, and those that capture the correlation directly using and objects), and directed edges represent relationships.
statistical models based on the observable properties of the The direction of an edge indicates whether entities occur
graph. We discuss these two families in Sections IV and V, as subjects or objects, i.e., an edge points from the subject
respectively. Section VI describes methods for combining to the object. Different relations are represented via
these two approaches, in order to get the best of both different types of edges (also called edge labels). This
worlds. Section VII discusses how such models can be construction is called a knowledge graph (KG), or
trained on KGs. In Section VIII we discuss relational sometimes a heterogeneous information network [21].)
learning using Markov Random Fields. In Section IX, we See Fig. 1 for an example.
describe how SRL can be used in automated knowledge In addition to being a collection of facts, knowledge
base construction projects. In Section X, we discuss graphs often provide type hierarchies (Leonard Nimoy is
extensions of the presented methods, and Section XI an actor, which is a person, which is a living thing) and
presents our conclusions. type constraints (e.g., a person can only marry another
person, not a thing).
II. KNOWLEDGE GRAPHS
B. Open Versus Closed World Assumption
In this section, we introduce knowledge graphs, and
While existing triples always encode known true
discuss how they are represented, constructed, and used.
relationships (facts), there are different paradigms for
the interpretation of nonexisting triples.
A. Knowledge Representation
• Under the closed world assumption (CWA),
Knowledge graphs model information in the form of
nonexisting triples indicate false relationships.
entities and relationships between them. This kind of
For example, the fact that in Fig. 1 there is no
relational knowledge representation has a long history in
starredIn edge from Leonard Nimoy to Star Wars is
logic and artificial intelligence [11], for example, in semantic
interpreted to mean that Nimoy definitely did not
networks [12] and frames [13]. More recently, it has been
star in this movie.
used in the Semantic Web community with the purpose of
• Under the open world assumption (OWA), a
creating a ‘‘web of data’’ that is readable by machines [14].
nonexisting triple is interpreted as unknown, i.e.,
While this vision of the Semantic Web remains to be fully
the corresponding relationship can be either true
realized, parts of it have been achieved. In particular, the
or false. Continuing with the above example, the
concept of linked data [15], [16] has gained traction, as it
missing edge is not interpreted to mean that
facilitates publishing and interlinking data on the Web in
Nimoy did not star in Star Wars. This more
relational form using the W3C Resource Description
cautious approach is justified, since KGs are known
Framework (RDF) [17], [18]. (For an introduction to
knowledge representation, see, e.g., [11], [19], and [20].)
In this paper, we will loosely follow the RDF standard
and represent facts in the form of binary relationships, in
particular (subject, predicate, object) (SPO) triples, where
subject and object are entities and predicate is the relation
between them. (We discuss how to represent higher arity
relations in Section X-A.) The existence of a particular
SPO triple indicates an existing fact, i.e., that the
respective entities are in a relationship of the given type.
For instance, the information

‘‘Leonard Nimoy was an actor who played the Fig. 1. Sample knowledge graph. Nodes represent entities, edge
character Spock in the science-fiction movie labels represent types of relations, and edges represent existing
Star Trek’’ relationships.

12 Proceedings of the IEEE | Vol. 104, No. 1, January 2016


Nickel et al.: A Review of Relational Machine Learning for Knowledge Graphs

to be very incomplete. For example, sometimes fraction of the information stored on the Web, and
just the main actors in a movie are listed, not the completeness (or coverage) is another important aspect of
complete cast. As another example, note that even KGs. Hence the second approach tries to ‘‘read the Web,’’
the place of birth attribute, which you might think extracting facts from the natural language text of Web
would be typically known, is missing for 71% of all pages. Example projects in this category include NELL [6]
people included in Freebase [22]. and the knowledge vault [28]. In Section IX, we show how
RDF and the Semantic Web make the open-world we can reduce the level of ‘‘noise’’ in such automatically
assumption. In Section VII-B we also discuss the local extracted facts by using the knowledge from existing, high-
closed world assumption (LCWA), which is often used for quality repositories.
training relational models. KGs, and more generally KBs, can also be classified
based on whether they employ a fixed or open lexicon of
C. Knowledge Base Construction entities and relations. In particular, we distinguish two
Completeness, accuracy, and data quality are important main types of KBs.
parameters that determine the usefulness of knowledge • In schema-based approaches, entities, and rela-
bases and are influenced by the way knowledge bases are tions are represented via globally unique identi-
constructed. We can classify KB construction methods into fiers and all possible relations are predefined in a
four main groups: fixed vocabulary. For example, Freebase might
• in curated approaches, triples are created manually represent the fact that Barack Obama was born in
by a closed group of experts; Hawaii using the triple (/m/02mjmr,/people/
• in collaborative approaches, triples are created person/born-in,/m/03gh4), where /m/02mjmr is
manually by an open group of volunteers; the unique machine ID for Barack Obama.
• in automated semistructured approaches, triples • In schema-free approaches, entities and relations
are extracted automatically from semistructured are identified using open information extraction
text (e.g., infoboxes in Wikipedia) via hand-crafted (OpenIE) techniques [37], and represented via
rules, learned rules, or regular expressions; normalized but not disambiguated strings (also
• in automated unstructured approaches, triples are referred to as surface names). For example, an
extracted automatically from unstructured text via OpenIE system may contain triples such as
machine learning and natural language processing (‘‘Obama,’’ ‘‘born in,’’ ‘‘Hawaii’’), (‘‘Barack Obama,’’
techniques (see, e.g., [9] for a review). ‘‘place of birth,’’ ‘‘Honolulu’’), etc. Note that it is not
Construction of curated knowledge bases typically clear from this representation whether the first
leads to highly accurate results, but this technique does not triple refers to the same person as the second triple,
scale well due to its dependence on human experts. nor whether ‘‘born in’’ means the same thing as
Collaborative knowledge base construction, which was ‘‘place of birth.’’ This is the main disadvantage of
used to build Wikipedia and Freebase, scales better but still OpenIE systems.
has some limitations. For instance, as mentioned previ- Table 1 lists current knowledge base construction
ously, the place of birth attribute is missing for 71% of all projects classified by their creation method and data
people included in Freebase, even though this is a schema. In this paper, we will only focus on schema-based
mandatory property of the schema [22]. Also, a recent KBs. Table 2 shows a selection of such KBs and their sizes.
study [35] found that the growth of Wikipedia has been
slowing down. Consequently, automatic knowledge base D. Uses of Knowledge Graphs
construction methods have been gaining more attention. Knowledge graphs provide semantically structured
Such methods can be grouped into two main information that is interpretable by computersVa
approaches. The first approach exploits semistructured
Table 1 Knowledge Base Construction Projects
data, such as Wikipedia infoboxes, which has led to large,
highly accurate knowledge graphs such as YAGO [4], [27]
and DBpedia [5]. The accuracy (trustworthiness) of facts in
such automatically created KGs is often still very high. For
instance, the accuracy of YAGO2 has been estimated1 to be
over 95% through manual inspection of sample facts [36],
and the accuracy of Freebase [7] was estimated to be 99%.2
However, semistructured text still covers only a small

1
For detailed statistics, see http://www.mpi-inf.mpg.de/departments/
databases-and-information-systems/research/yago-naga/yago/statistics/.
2
http://thenoisychannel.com/2011/11/15/cikm-2011-industry-event-
john-giannandrea-on-freebase-a-rosetta-stone-for-entities

Vol. 104, No. 1, January 2016 | Proceedings of the IEEE 13


Nickel et al.: A Review of Relational Machine Learning for Knowledge Graphs

Table 2 Size of Some Schema-Based Knowledge Bases Knowledge graphs are also used in several specialized
domains. For instance, Bio2RDF [41], Neurocommons
[42], and LinkedLifeData [43] are knowledge graphs that
integrate multiple sources of biomedical information.
These have been used for question answering and decision
support in the life sciences.

E. Main Tasks in Knowledge Graph Construction and


Curation
In this section, we review a number of typical KG tasks.
Link prediction is concerned with predicting the existence
(or probability of correctness) of (typed) edges in the
graph (i.e., triples). This is important since existing
knowledge graphs are often missing many facts, and
property that is regarded as an important ingredient to some of the edges they contain are incorrect [44]. In the
build more intelligent machines [38]. Consequently, knowl- context of knowledge graphs, link prediction is also
edge graphs are already powering multiple ‘‘Big Data’’ referred to as knowledge graph completion. For example,
applications in a variety of commercial and scientific in Fig. 1, suppose the characterIn edge from Obi-Wan
domains. A prime example is the integration of Google’s Kenobi to Star Wars were missing; we might be able to
Knowledge Graph, which currently stores 18 billion facts predict this missing edge, based on the structural similarity
about 570 million entities, into the results of Google’s search between this part of the graph and the part involving Spock
engine [8]. The Google Knowledge Graph is used to identify and Star Trek. It has been shown that relational models
and disambiguate entities in text, to enrich search results that take the relationships of entities into account can
with semantically structured summaries, and to provide significantly outperform nonrelational machine learning
links to related entities in exploratory search. (Microsoft has methods for this task (e.g., see [45] and [46]).
a similar KB, called Satori, integrated with its Bing search Entity resolution (also known as record linkage [47],
engine [39].) object identification [48], instance matching [49], and
Enhancing search results with semantic information deduplication [50]) is the problem of identifying which
from knowledge graphs can be seen as an important step to objects in relational data refer to the same underlying
transform text-based search engines into semantically entities. See Fig. 2 for a small example. In a relational
aware question answering services. Another prominent setting, the decisions about which objects are assumed to
example demonstrating the value of knowledge graphs is be identical can propagate through the graph, so that
IBM’s question answering system Watson, which was able matching decisions are made collectively for all objects in a
to beat human experts in the game of Jeopardy!. Among domain rather than independently for each object pair
others, this system used YAGO, DBpedia, and Freebase as (see, for example, [51]–[53]). In schema-based automated
its sources of information [40]. Repositories of structured knowledge base construction, entity resolution can be used
knowledge are also an indispensable component of digital to match the extracted surface names to entities stored in
assistants such as Siri, Cortana, or Google Now. the knowledge graph.

Fig. 2. Example of entity resolution in a toy knowledge graph. In this example, nodes 1 and 3 refer to the identical entity, the actor Alec Guinness.
Node 2, on the other hand, refers to Arthur Guinness, the founder of the Guinness brewery. The surface name of node 2 (‘‘A. Guinness’’) alone
would not be sufficient to perform a correct matching as it could refer to both Alec Guinness and Arthur Guinness. However, since links in the
graph reveal the occupations of the persons, a relational approach can perform the correct matching.

14 Proceedings of the IEEE | Vol. 104, No. 1, January 2016


Nickel et al.: A Review of Relational Machine Learning for Knowledge Graphs

Link-based clustering extends feature-based clustering knowledge graph. We model each possible triple
to a relational learning setting and groups entities in xijk ¼ ðei ; rk ; ej Þ over this set of entities and relations as a
relational data based on their similarity. However, in link- binary random variable yijk 2 f0; 1g that indicates its
based clustering, entities are not only grouped by the existence. All possible triples in E  R  E can be
similarity of their features but also by the similarity of their grouped naturally in a third-order tensor (three-way array)
links. As in entity resolution, the similarity of entities can Y 2 f0; 1gNe Ne Nr , whose entries are set such that
propagate through the knowledge graph, such that
relational modeling can add important information for 
this task. In social network analysis, link-based clustering 1; if the triple ðei ; rk ; ej Þ exists
yijk ¼
is also known as community detection [54]. 0; otherwise.

II I. STATISTICAL RELATIONAL We will refer to this construction as an adjacency tensor


LEARNING FOR KNOWLEDGE GRAPHS (cf., Fig. 3). Each possible realization of Y can be
Statistical relational learning is concerned with the interpreted as a possible world. To derive a model for the
creation of statistical models for relational data. In the entire knowledge graph, we are then interested in
following sections we discuss how statistical relational estimating the joint distribution PðYÞ, from a subset
learning can be applied to knowledge graphs. We will D  E  R  E  f0; 1g of observed triples. In doing so,
assume that all the entities and (types of) relations in a we are estimating a probability distribution over possible
knowledge graph are known. (We discuss extensions of worlds, which allows us to predict the probability of
this assumption in Section X-C). However, triples are triples based on the state of the entire knowledge graph.
assumed to be incomplete and noisy; entities and relation While yijk ¼ 1 in adjacency tensors indicates the
types may contain duplicates. existence of a triple, the interpretation of yijk ¼ 0
depends on whether the open world, closed world, or
Notation: Before proceeding, let us define our mathe- local-closed world assumption is made. For details, see
matical notation. (Variable names will be introduced later Section VII-B.
in the appropriate sections.) We denote scalars by lower Note that the size of Y can be enormous for large
case letters, such as a; column vectors (of size N) by bold knowledge graphs. For instance, in the case of Freebase,
lower case letters, such as a; matrices (of size N1  N2 ) by which currently consists of over 40 million entities and
bold upper case letters, such as A; and tensors (of size 35 000 relations, the number of possible triples jE  R  Ej
N1  N2  N3 ) by bold upper case letters with an exceeds 1019 elements. Of course, type constraints reduce
underscore, such as A. We denote the k’th ‘‘frontal slice’’ this number considerably.
of a tensor A by Ak (which is a matrix of size N1  N2 ), Even amongst the syntactically valid triples, only a tiny
and the ði; j; kÞth element by aijk (which is a scalar). We use fraction are likely to be true. For example, there are over
½a; b to denote
  the vertical stacking of vectors a and b, i.e., 450 000 thousands actors and over 250 000 movies stored
½a; b ¼ ba . We can convert a matrix A of size N1  N2 in Freebase. But each actor stars only in a small number of
into a vector a of size N1 N2 by stacking all columns of A, movies. Therefore, an important issue for SRL on
denoted a ¼ vecðAÞ. The inner (scalar) product P of two
knowledge graphs is how to deal with the large number
vectors (both of size N) is defined by a> b ¼ Ni¼1 ai bi . of possible relationships while efficiently exploiting the
The tensor (Kronecker) product of two vectors (of size sparsity of relationships. Ideally, a relational model for
N1 and 0 N2 ) is 1a vector of size N1 N2 with entries large-scale knowledge graphs should scale at most linearly
a1 b with the data size, i.e., linearly in the number of entities
B .. C
a  b ¼ @ . A. Matrix multiplication is denoted by
aN1 b
AB as p usual.
P 2We denote the L2 norm of a vector by
ffiffiffiffiffiffiffiffiffiffiffi
kak2 ¼ ai , and the Frobenius norm of a matrix
qi ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
PP 2
by kAkF ¼ i j aij . We denote the vector of all ones
by 1, and the identity matrix by I.

A. Probabilistic Knowledge Graphs


We now introduce some mathematical background so
we can more formally define statistical models for
knowledge graphs.
Let E ¼ fe1 ; . . . ; eNe g be the set of all entities and
R ¼ fr1 ; . . . ; rNr g be the set of all relation types in a Fig. 3. Tensor representation of binary relational data.

Vol. 104, No. 1, January 2016 | Proceedings of the IEEE 15


Nickel et al.: A Review of Relational Machine Learning for Knowledge Graphs

Ne , linearly in the number of relations Nr , and linearly in C. Types of SRL Models


the number of observed triples jDj ¼ Nd . As we discussed, the presence or absence of certain
triples in relational data is correlated with (i.e., predictive
of) the presence or absence of certain other triples. In
B. Statistical Properties of Knowledge Graphs
other words, the random variables yijk are correlated with
Knowledge graphs typically adhere to some determin-
each other. We will discuss three main ways to model these
istic rules, such as type constraints and transitivity (e.g., if
correlations.
Leonard Nimoy was born in Boston, and Boston is located
1) Assume all yijk are conditionally independent
in the United States, then we can infer that Leonard Nimoy
given latent features associated with subject,
was born in the United States). However, KGs have
object and relation type and additional parameters
typically also various ‘‘softer’’ statistical patterns or
(latent feature models).
regularities, which are not universally true but neverthe-
2) Assume all yijk are conditionally independent
less have useful predictive power.
given observed graph features and additional
One example of such statistical pattern is known as
parameters (graph feature models).
homophily, that is, the tendency of entities to be related to
3) Assume all yijk have local interactions (Markov
other entities with similar characteristics. This has been
random fields).
widely observed in various social networks [55], [56]. For
In what follows we will mainly focus on M1 and M2 and
example, U.S.-born actors are more likely to star in
their combination; M3 will be the topic of Section VIII.
U.S.-made movies. For multirelational data (graphs with
The model classes M1 and M2 predict the existence of a
more than one kind of link), homophily has also been
triple xijk via a score function f ðxijk ; 0Þ which represents the
referred to as autocorrelation [57].
model’s confidence that a triple exists given the parameters
Another statistical pattern is known as block structure.
0. The conditional independence assumptions of M1 and
This refers to the property where entities can be divided
M2 allow the probability model to be written as follows:
into distinct groups (blocks), such that all the members of
a group have similar relationships to members of other
groups [58]–[60]. For example, we can group some actors, Ne Y
Y Ne Y
Nr
  
such as Leonard Nimoy and Alec Guinness, into a science PðYjD; 0Þ ¼ Ber yijk j f ðxijk ; 0Þ (1)
fiction actor block, and some movies, such as Star Trek and i¼1 j¼1 k¼1
Star Wars, into a science fiction movie block, since there is
a high density of links from the scifi actor block to the scifi
where ðuÞ ¼ 1=ð1 þ eu Þ is the sigmoid (logistic) func-
movie block.
tion, and
Graphs can also exhibit global and long-range statis-
tical dependencies, i.e., dependencies that can span over
chains of triples and involve different types of relations. 
p; if y ¼ 1
For example, the citizenship of Leonard Nimoy (USA) BerðyjpÞ ¼ (2)
1  p; if y ¼ 0
depends statistically on the city where he was born
(Boston), and this dependency involves a path over
multiple entities (Leonard Nimoy, Boston, USA) and is the Bernoulli distribution.
relations (bornIn, locatedIn, citizenOf). A distinctive We will refer to models of the form (1) as probabilistic
feature of relational learning is that it is able to exploit models. In addition to probabilistic models, we will also
such patterns to create richer and more accurate models discuss models which optimize f ðÞ under other criteria,
of relational domains. for instance models which maximize the margin between
When applying statistical models to incomplete existing and nonexisting triples. We will refer to such
knowledge graphs, it should be noted that the distribution models as score-based models. If desired, we can derive
of facts in such KGs can be skewed. For instance, KGs that probabilities for score-based models via Platt scaling [61].
are derived from Wikipedia will inherit the skew that There are many different methods for defining f ðÞ. In
exists in distribution of facts in Wikipedia itself. 8 Sections IV–VI and VIII, we will discuss different options
Statistical models as discussed in the following sections for all model classes. In Section VII, we will furthermore
can be affected by such biases in the input data and need to discuss aspects of how to train these models on knowledge
be interpreted accordingly. graphs.

8
As an example, there are currently 10306 male and 7586 female I V. LATE NT FEATURE MODELS
American actors listed in Wikipedia, while there are only 1268 male and In this section, we assume that the variables yijk are
1354 female Indian, and 77 male and no female Nigerian actors. India and
Nigeria, however, are the largest and second largest film industries in the conditionally independent given a set of global latent
world. features and parameters, as in (1). We discuss various

16 Proceedings of the IEEE | Vol. 104, No. 1, January 2016


Nickel et al.: A Review of Relational Machine Learning for Knowledge Graphs

possible forms for the score function f ðx; 0Þ below. What all Table 3 Summary of the Notation
models have in common is that they explain triples via latent
features of entities (This is justified via various theoretical
arguments [62]). For instance, a possible explanation for the
fact that Alec Guinness received the Academy Award is that
he is a good actor. This explanation uses latent features of
entities (being a good actor) to explain observable facts
(Guinness receiving the Academy Award). We call these
features ‘‘latent’’ because they are not directly observed in
the data. One task of all latent feature models is therefore to
infer these features automatically from the data.
In the following, we will denote the latent feature
representation of an entity ei by the vector ei 2 RHe where
He denotes the number of latent features in the model. For
instance, we could model that Alec Guinness is a good
actor and that the Academy Award is a prestigious award
via the vectors

   
0:9 0:2
eGuinness ¼ ; eAcademyAward ¼
0:2 0:8

where the component ei1 corresponds to the latent feature


Good Actor and ei2 correspond to Prestigious Award. (Note
that, unlike this example, the latent features that are inferred
by the following models are typically hard to interpret.)
The key intuition behind relational latent feature
models is that the relationships between entities can be
derived from interactions of their latent features. Howev-
er, there are many possible ways to model these
interactions, and many ways to derive the existence of a
relationship from them. We discuss several possibilities
In general, we can model block structure patterns via
below. See Table 3 for a summary of the notation.
the magnitude of entries in Wk , while we can model
homophily patterns via the magnitude of its diagonal
A. RESCAL: A Bilinear Model entries. Anticorrelations in these patterns can be modeled
RESCAL [63]–[65] is a relational latent feature model via negative entries in Wk .
which explains triples via pairwise interactions of latent Hence, in (3) we compute the score of a triple xijk via the
features. In particular, we model the score of a triple xijk as weighted sum of all pairwise interactions between the latent
features of the entities ei and ej . The parameters of the model
are 0 ¼ ffei gNi¼1e
; fWk gNk¼1
r
g. During training we jointly
He X
X He
learn the latent representations of entities and how the
RESCAL
fijk :¼ e>
i Wk ej ¼ wabk eia ejb (3)
a¼1 b¼1
latent features interact for particular relation types.
In the following, we will discuss further important
properties of the model for learning from knowledge graphs.
where Wk 2 RHe He is a weight matrix whose entries wabk
specify how much the latent features a and b interact in the Relational Learning Via Shared Representations: In (3),
kth relation. We call this a bilinear model, since it captures entities have the same latent representation regardless of
the interactions between the two entity vectors using whether they occur as subjects or objects in a relationship.
multiplicative terms. For instance, we could model the Furthermore, they have the same representation over all
pattern that good actors are likely to receive prestigious different relation types. For instance, the ith entity occurs in
awards via a weight matrix such as the triple xijk as the subject of a relationship of type k, while
it occurs in the triple xpiq as the object of a relationship of
  type q. However, the predictions fijk ¼ e> i Wk ej and fpiq ¼
0:1 0:9 e> W e both use the same latent representation ei of the ith
WreceivedAward ¼ : p q i
0:1 0:1 entity. Since all parameters are learned jointly, these shared

Vol. 104, No. 1, January 2016 | Proceedings of the IEEE 17


Nickel et al.: A Review of Relational Machine Learning for Knowledge Graphs

representations permit to propagate information between this setting, it has been shown analytically that a single
triples via the latent representations of entities and the update of E and Wk scales linearly with the number of
weights of relations. This allows the model to capture global entities Ne , linearly with the number of relations Nr , and
dependencies in the data. linearly with the number of observed triples, i.e., the
number of nonzero entries in Y [64]. We call this
Semantic Embeddings: The shared entity representations algorithm RESCAL-ALS.9 In practice, a small number (say
in RESCAL capture also the similarity of entities in the 30 to 50) of iterated updates are often sufficient for
relational domain, i.e., that entities are similar if they are RESCAL-ALS to arrive at stable estimates of the para-
connected to similar entities via similar relations [65]. For meters. Given a current estimate of E, the updates for each
instance, if the representations of ei and ep are similar, the Wk can be computed in parallel to improve the scalability
predictions fijk and fpjk will have similar values. In return, on knowledge graphs with a large number of relations.
entities with many similar observed relationships will have Furthermore, by exploiting the special tensor structure of
similar latent representations. This property can be RESCAL, we can derive improved updates for RESCAL-
exploited for entity resolution and has also enabled ALS that compute the estimates for the parameters with a
large-scale hierarchical clustering on relational data [63], runtime complexity of OðHe3 Þ for a single update (as opposed
[64]. Moreover, since relational similarity is expressed via to a runtime complexity of OðHe5 Þ for naive updates) [65],
the similarity of vectors, the latent representations ei can [69]. In summary, for relational domains that can be
act as proxies to give nonrelational machine learning explained via a moderate number of latent features,
algorithms such as k-means or kernel methods access to RESCAL-ALS is highly scalable and very fast to compute.
the relational similarity of entities. For more detail on RESCAL-ALS, see also (26) in Section VII.

Connection to Tensor Factorization: RESCAL is similar to Decoupled Prediction: In (3), the probability of single
methods used in recommendation systems [66], and to relationship is computed via simple matrix–vector products
traditional tensor factorization methods [67]. In matrix in OðHe2 Þ time. Hence, once the parameters have been
notation, (3) can be written compactly as as Fk ¼ estimated, the computational complexity to predict the score
EWk E> , where Fk 2 RNe Ne is the matrix holding all of a triple depends only on the number of latent features and
scores for the kth relation and the ith row of E 2 R Ne He is independent of the size of the graph. However, during
holds the latent representation of ei . See Fig. 4 for an parameter estimation, the model can capture global
illustration. In the following, we will use this tensor dependencies due to the shared latent representations.
representation to derive a very efficient algorithm for
parameter estimation. Relational Learning Results: RESCAL has been shown to
achieve state-of-the-art results on a number of relational
Fitting the Model: If we want to compute a probabilistic learning tasks. For instance, [63] showed that RESCAL
model, the parameters of RESCAL can be estimated by provides comparable or better relationship prediction
minimizing the log-loss using gradient-based methods such results on a number of small benchmark data sets
as stochastic gradient descent [68]. RESCAL can also be compared to Markov logic networks (with structure
computed as a score-based model, which has the main learning) [70], the infinite (hidden) relational model
advantage that we can estimate the parameters 0 very [71], [72], and Bayesian clustered tensor factorization [73].
efficiently: Due to its tensor structure and due to the Moreover, RESCAL has been used for link prediction on
sparsity of the data, it has been shown that the RESCAL entire knowledge graphs such as YAGO and DBpedia [64],
model can be computed via a sequence of efficient closed- [74]. Aside from link prediction, RESCAL has also
form updates when using the squared-loss [63], [64]. In successfully been applied to SRL tasks such as entity
resolution and link-based clustering. For instance,
RESCAL has shown state-of-the-art results in predicting
which authors, publications, or publication venues are
likely to be identical in publication databases [63], [65].
Furthermore, the semantic embedding of entities comput-
ed by RESCAL has been exploited to create taxonomies for
uncategorized data via hierarchical clusterings of entities
in the embedding space [75].

B. Other Tensor Factorization Models


Various other tensor factorization methods have
been explored for learning from knowledge graphs and
Fig. 4. RESCAL as a tensor factorization of the adjacency tensor Y.
9
Figure adapted from [147]. ALS stands for alternating least-squares.

18 Proceedings of the IEEE | Vol. 104, No. 1, January 2016


Nickel et al.: A Review of Relational Machine Learning for Knowledge Graphs

multirelational data. Kolda et al. [76] and Franz et al. [77] from this representation. In particular, we can rewrite
factorized adjacency tensors using the CP tensor decom- RESCAL as
position to analyze the link structure of web pages and
semantic web data, respectively. Drumond et al. [78]
k Fij
:¼ w>
RESCAL RESCAL
applied pairwise interaction tensor factorization [79] to fijk (4)
predict triples in knowledge graphs. Rendle [80] applied FRESCAL :¼ ej  ei (5)
ij
factorization machines to large uni-relational data sets in
recommendation settings. Jenatton et al. [81] proposed a
tensor factorization model for knowledge graphs with a where wk ¼ vecðWk Þ. Equation (4) follows from (3) via
very large number of different relations. the equality vecðAXBÞ ¼ ðB>  AÞvecðXÞ. Hence,
It is also possible to use discrete latent factors. RESCAL represents pairs of entities ðei ; ej Þ via the tensor
Miettinen [82] proposed Boolean tensor factorization to product of their latent feature representations (5) and
disambiguate facts extracted with OpenIE methods and predicts the existence of the triple xijk from Fij via wk (4). See
applied it to large data sets [83]. In contrast to previously also Fig. 5(a). For a further discussion of the tensor product
discussed factorizations, Boolean tensor factorizations are to create composite latent representations, see [88]–[90].
discrete models, where adjacency tensors are decomposed Since the tensor product explicitly models all pairwise
into binary factors based on Boolean algebra. interactions, RESCAL can require a lot of parameters when the
number of latent features are large (each matrix Wk has He2
C. Matrix Factorization Methods entries). This can, for instance, lead to scalability problems on
Another approach for learning from knowledge graphs knowledge graphs with a large number of relations.
is based on matrix factorization, where, prior to the In the following, we will discuss models based on
factorization, the adjacency tensor Y 2 RNe Ne Nr is multilayer perceptrons (MLPs), also known as feedforward
2
reshaped into a matrix Y 2 RNe Nr by associating rows neural networks. In the context of multidimensional data
with subject–object pairs ðei ; ej Þ and columns with they can be referred to a multiway neural networks. This
relations rk (cf., [84] and [85]), or into a matrix approach allows us to consider alternative ways to create
Y 2 RNe Ne Nr by associating rows with subjects ei and composite triple representations and to use nonlinear
columns with relation/objects ðrk ; ej Þ (cf., [86] and [87]). functions to predict their existence.
Unfortunately, both of these formulations lose information In particular, let us define the following E-MLP model
compared to tensor factorization. For instance, if each (E for entity):
subject–object pair is modeled via a different latent
representation, the information that the relationships yijk 
E-MLP
and ypjq share the same object is lost. It also leads to an fijk :¼ w> a
k g hijk (6)
increased memory complexity, since a separate latent E-MLP
representation is computed for each pair of entities, haijk :¼ A>
k Fij (7)
requiring OðNe2 He þ Nr He Þ parameters (compared to FijE-MLP :¼ ½ei ; ej  (8)
OðNe He þ Nr He2 Þ parameters for RESCAL).

D. Multilayer Perceptrons where gðuÞ ¼ ½gðu1 Þ; gðu2 Þ; . . . is the function g applied


We can interpret RESCAL as creating composite element-wise to vector u; one often uses the nonlinear
representations of triples and predicting their existence function gðuÞ ¼ tanhðuÞ.

Fig. 5. Visualization of RESCAL and the ER-MLP model as neural networks. Here, He ¼ Hr ¼ 3 and Ha ¼ 3. Note that the inputs are latent features.
The symbol g denotes the application of the function gðÞ. (a) RESCAL. (b) ER-MLP.

Vol. 104, No. 1, January 2016 | Proceedings of the IEEE 19


Nickel et al.: A Review of Relational Machine Learning for Knowledge Graphs

Here ha is an additive hidden layer, which is deriving each other. For instance, the closest relations to the
by adding together different weighed components of the children relation are parents, spouse, and birthplace.
entity representations. In particular, we create a com-
posite representation FEij -MLP ¼ ½ei ; ej  2 R2Ha via the
E. Neural Tensor Networks
concatenation of ei and ej . However, concatenation alone
We can combine traditional MLPs with bilinear
does not consider any interactions between the latent
models, resulting in what [92] calls a ‘‘neural tensor
features of ei and ej . For this reason, we add a (vector-
network’’ (NTN). More precisely, we can define the NTN
valued) hidden layer ha of size Ha , from which the final
model as follows:
prediction is derived via w> k gðha Þ. The important
difference to tensor-product models like RESCAL is that
we learn the interactions of latent features via the matrix h i
NTN
fijk :¼ w>
kg haijk ; hbijk (12)
Ak (7), while the tensor product considers always all
possible interactions between latent features. This adap- haijk :¼ A>k ½ei ; ej  (13)
tive approach can reduce the number of required
> 1
hijk :¼ ei Bk ej ; . . . ; e>
b Hb
i Bk ej : (14)
parameters significantly, especially on data sets with a
large number of relations.
One disadvantage of the E-MLP is that it has to define a
Here Bk is a tensor, where the ‘th slice B‘k has size
vector wk and a matrix Ak for every possible relation,
He  He , and there are Hb slices. We call hbijk a bilinear
which requires Ha þ ðHa  2He Þ parameters per relation.
hidden layer, since it is derived from a weighted
An alternative is to embed the relation itself, using a
combination of multiplicative terms.
Hr -dimensional vector rk . We can then define
NTN is a generalization of the RESCAL approach, as
we explain in Section XII-A. Also, it uses the additive layer
 from the E-MLP model. However, it has many more
ER-MLP
fijk :¼ w> g hcijk (9) parameters than the E-MLP or RESCAL models. Indeed,
hcijk :¼ C> FER -MLP (10) the results in [95] and [28] both show that it tends to
ijk
overfit, at least on the (relatively small) data sets uses in
FER -MLP :¼ ½ei ; ej ; rk : (11)
ijk those papers.

We call this model the ER-MLP, since it applies an MLP F. Latent Distance Models
to an embedding of the entities and relations. Please note Another class of models are latent distance models
that ER-MLP uses a global weight vector for all relations. (also known as latent space models in social network
This model was used in the KV project (see Section IX), analysis), which derive the probability of relationships
since it has many fewer parameters than the E-MLP (see from the distance between latent representations of
Table 5); the reason is that C is independent of the entities: entities are likely to be in a relationship if their
relation k. latent representations are close according to some distance
It has been shown in [91] that MLPs can learn to put measure. For unirelational data, Hoff et al. [96] proposed
‘‘semantically similar’’ words close by in the embedding this approach first in the context of social networks by
space, even if they are not explicitly trained to do so. In modeling the probability of a relationship xij via the score
[28], they show a similar result for the semantic function f ðei ; ej Þ ¼ dðei ; ej Þ where dð; Þ refers to an
embedding of relations using ER-MLP. For example, arbitrary distance measure such as the Euclidean distance.
Table 4 shows the nearest neighbors of latent representa- The structured embedding (SE) model [93] extends
tions of selected relations that have been computed with a this idea to multirelational data by modeling the score of a
60-dimensional model on Freebase. Numbers in paren- triple xijk as
theses represent squared Euclidean distances. It can be
seen that ER-MLP puts semantically related relations near

SE
fijk :¼  Ask ei  Aok ej 1 ¼  haijk (15)
1
Table 4 Semantic Embeddings of KV-MLP on Freebase

where Ak ¼ ½Ask ; Aok . In (15) the matrices Ask , Aok


transform the global latent feature representations of
entities to model relationships specifically for the kth
relation. The transformations are learned using the
ranking loss in a way such that pairs of entities in existing
relationships are closer to each other than entities in
nonexisting relationships.

20 Proceedings of the IEEE | Vol. 104, No. 1, January 2016


Nickel et al.: A Review of Relational Machine Learning for Knowledge Graphs

To reduce the number of parameters over the SE the triple (John, married To, Mary) from the existence of
parentOf parentOf
model, the TransE model [94] translates the latent feature the path John ! Anne  Mary, representing a
representations via a relation-specific offset instead of common child. In contrast to latent feature models, this
transforming them via matrix multiplications. In kind of reasoning explains triples directly from the
particular, the score of a triple xijk is defined as observed triples in the knowledge graph. We will now
discuss some models of this kind.
TransE
fijk :¼ dðei þ rk ; ej Þ: (16)
A. Similarity Measures for Unirelational Data
Observable graph feature models are widely used for
link prediction in graphs that consist only of a single
This model is inspired by the results in [91], who showed
relation, e.g., social network analysis (friendships between
that some relationships between words could be computed
people), biology (interactions of proteins), and Web
by their vector difference in the embedding space. As
mining (hyperlinks between Web sites). The intuition
noted in [95], under unit-norm constraints on ei ; ej and
behind these methods is that similar entities are likely to
using the squared Euclidean distance, we can rewrite (16)
be related (homophily) and that the similarity of entities
as follows:
can be derived from the neighborhood of nodes or from the
existence of paths between nodes. For this purpose,
 2
TransE
¼  2r> > various indices have been proposed to measure the
fijk k ðei  ej Þ  2ei ej þ krk k2 : (17)
similarity of entities, which can be classified into local,
global, and quasi-local approaches [97].
Furthermore, if we assume Ak ¼ ½rk ; rk , so that Local similarity indices such as Common Neighbors, the
haijk ¼ ½rk ; rk T ½ei ; ej  ¼ rTk ðei  ej Þ, and Bk ¼ I, so that Adamic-Adar index [98] or Preferential Attachment [99]
hbijk ¼ eTi ej , then we can rewrite this model as follows: derive the similarity of entities from their number of
common neighbors or their absolute number of neighbors.
 Local similarity indices are fast to compute for single
TransE
fijk ¼  2haijk  2hbijk þ krk k22 : (18) relationships and scale well to large knowledge graphs as
their computation depends only on the direct neighbor-
hood of the involved entities. However, they can be too
localized to capture important patterns in relational data
G. Comparison of Models
and cannot model long-range or global dependencies.
Table 5 summarizes the different models we have
Global similarity indices such as the Katz index [100]
discussed. A natural question is: which model is best? [28]
and the Leicht–Holme–Newman index [101] derive the
showed that the ER-MLP model outperformed the NTN
similarity of entities from the ensemble of all paths between
model on their particular data set. Reference [95]
entities, while indices like Hitting Time, Commute Time, and
performed more extensive experimental comparison of
PageRank [102] derive the similarity of entities from random
these models, and found that RESCAL (called the bilinear
walks on the graph. Global similarity indices often provide
model) worked best on two link prediction tasks. However,
significantly better predictions than local indices, but are
clearly the best model will be data set dependent.
also computationally more expensive [56], [97].
Quasi-local similarity indices like the Local Katz index
[56] or Local Random Walks [103] try to balance predictive
V. GRAPH FEATURE MODELS accuracy and computational complexity by deriving the
In this section, we assume that the existence of an edge can similarity of entities from paths and random walks of
be predicted by extracting features from the observed bounded length.
edges in the graph. For example, due to social conventions, In Section V-C, we will discuss an approach that
parents of a person are often married, so we could predict extends this idea of quasi-local similarity indices for

Table 5 Summary of the Latent Feature Models. ha , hb , and hc Are Hidden Layers of the Neural Network; See Text for Details

Vol. 104, No. 1, January 2016 | Proceedings of the IEEE 21


Nickel et al.: A Review of Relational Machine Learning for Knowledge Graphs

unirelational networks to learn from large multirelational We can then predict the edge probabilities using logistic
knowledge graphs. regression

B. Rule Mining and Inductive Logic Programming


k Fijk :
:¼ w>
PRA PRA
fijk (20)
Another class of models that works on the observed
variables of a knowledge graph extracts rules via mining
methods and uses these extracted rules to infer new links. Interpretability: A useful property of PRA is that its
The extracted rules can also be used as a basis for Markov model is easily interpretable. In particular, relation paths
logic as discussed in Section VIII. For instance, ALEPH is can be regarded as bodies of weighted rulesVmore
an inductive logic programming (ILP) system that attempts precisely Horn clausesVwhere the weight specifies how
to learn rules from relational data via inverse entailment predictive the body of the rule is for the head. For
[104] (For more information on ILP, see, e.g., [3], [105], instance, Table 6 shows some relation paths along with
and [106]). AMIE is a rule mining system that extracts their weights that have been learned by PRA in the KV
logical rules (in particular Horn clauses) based on their project (see Section IX) to predict which college a
support in a knowledge graph [107], [108]. In contrast to person attended, i.e., to predict triples of the form
ALEPH, AMIE can handle the open-world assumption of (p, college, c). The first relation path in Table 6 can be
knowledge graphs and has shown to be up to three orders interpreted as follows: it is likely that a person attended a
of magnitude faster on large knowledge graphs [108]. college if the sports team that drafted the person is from
The basis for the semantic web is description logic and the same college. This can be written in the form of a
[109]–[111] describe approaches for logic-oriented ma- Horn clause as follows:
chine learning approaches in this context. Also to mention
are data mining approaches for knowledge graphs as
described in [112]–[114]. An advantage of rule-based ðp; college; cÞ ðp; draftedBy; tÞ ^ ðt; school; cÞ:
systems is that they are easily interpretable as the model
is given as a set of logial rules. However, rules over
observed variables cover usually only a subset of patterns
By using a sparsity promoting prior on wk , we can
in knowledge graphs (or relational data) and useful rules
perform feature selection, which is equivalent to rule
can be challenging to learn.
learning.

C. Path Ranking Algorithm (PRA) Relational Learning Results: PRA has been shown to
The path ranking algorithm (PRA) [115], [116] extends outperform the ILP method FOIL [106] for link prediction
the idea of using random walks of bounded lengths for in NELL [116]. It has also been shown to have comparable
predicting links in multirelational knowledge graphs. In performance to ER-MLP on link prediction in KV: PRA
particular, let L ði; j; k; tÞ denote a path of length L of the obtained a result of 0.884 for the area under the ROC
r1 r2 rL
form ei ! e2 ! e3    ! ej , where t represents the curve, as compared to 0.882 for ER-MLP [28].
sequence of edge types t ¼ ðr1 ; r2 ; . . . ; rL Þ. We also require
rk
there to be a direct arc ei ! ej , representing the existence
of a relationship of type k from ei to ej . Let qL ði; j; kÞ VI. COMBINING L ATENT AND GRAPH
represent the set of all such paths of length L, ranging over FEAT URE MODELS
path types t. (We can discover such paths by enumerating It has been observed experimentally (see, e.g., [28]) that
all (type-consistent) paths from entities of type ei to neither state-of-the-art relational latent feature models
entities of type ej . If there are too many relations to make (RLFMs) nor state-of-the-art graph feature models are
this feasible, we can perform random sampling.)
We can compute the probability of following such a
path by assuming that at each step, we follow an outgoing Table 6 Examples of Paths Learned by PRA on Freebase to Predict Which
link uniformly at random. Let PðL ði; j; k; tÞÞ be the College a Person Attended
probability of this particular path; this can be computed
recursively by a sampling procedure, similar to PageRank
(see [116] for details). The key idea in PRA is to use these
path probabilities as features for predicting the probability
of missing edges. More precisely, define the feature vector

FPRA
ijk ¼ ½PðÞ :  2 qL ði; j; kÞ: (19)

22 Proceedings of the IEEE | Vol. 104, No. 1, January 2016


Nickel et al.: A Review of Relational Machine Learning for Knowledge Graphs

superior for learning from knowledge graphs. Instead, the B. Other Combined Models
strengths of latent and graph-based models are often In addition to ARE, further models have been explored
complementary (see, e.g., [117]), as both families focus on to learn jointly from latent and observable patterns on
different aspects of relational data. relational data. Jiang et al. [84] and Riedel et al. [85]
• Latent feature models are well-suited for modeling combined a latent feature model with an additive term to
global relational patterns via newly introduced learn from latent and neighborhood-based information on
latent variables. They are computationally efficient multirelational data, as follows11:
if triples can be explained with a small number of
latent variables.
ð1Þ> ð2Þ> ð3Þ>
• Graph feature models are well-suited for modeling ADD
fijk :¼ wk;j FSUB
i þ wk;i FOBJ
j þ wk FN
ijk (22)
local and quasi-local graphs patterns. They are
FN
ijk
0
:¼ ½yijk0 : k 6¼ k: (23)
computationally efficient if triples can be ex-
plained from the neighborhood of entities or from
short paths in the graph.
Here, FSUBi is the latent representation of entity ei as a
There has also been some theoretical work comparing
subject and FOBJ j is the latent representation of entity ej as
these two approaches [118]. In particular, it has been
an object. The term FN ijk captures patterns efficiently where
shown that tensor factorization can be inefficient when
the existence of a triple yijk0 is predictive of another triple
relational data consists of a large number of strongly
yijk between the same pair of entities (but of a different
connected components. Fortunately, such ‘‘problematic’’
relation type). For instance, if Leonard Nimoy was born in
relations can often be handled efficiently via graph-based
Boston, it is also likely that he lived in Boston. This
models. A good example is the marriedTo relation: One
dependency between the relation types bornIn and livedIn
marriage corresponds to a single strongly connected
can be modeled in (23) by assigning a large weight to
component, so data with a large number of marriages
wbornIn;livedIn .
would be difficult to model with RLFMs. However,
ARE and the models of [84] and [85] are similar in
predicting marriedTo links via graph-based models is
spirit to the model of [119], which augments SVD (i.e.,
easy: the existence of the triple (John, marriedTo, Mary)
matrix factorization) of a rating matrix with additive terms
can be simply predicted from the existence of (Mary,
to include local neighborhood information. Similarly,
marriedTo, John), by exploiting the symmetry of the
factorization machines [120] allow to combine latent and
relation. If the (Mary, marriedTo, John) edge is unknown,
observable patterns, by modeling higher order interactions
we can use statistical patterns, such as the existence of
between input variables via low-rank factorizations [78].
shared children.
An alternative way to combine different prediction
Combining the strengths of latent and graph-based
systems is to fit them separately, and use their outputs as
models is therefore a promising approach to increase the
inputs to another ‘‘fusion’’ system. This is called stacking
predictive performance of graph models. It typically also
[121]. For instance, Dong et al. [28] used the output of PRA
speeds up the training. We now discuss some ways of
and ER-MLP as scalar features, and learned a final ‘‘fusion’’
combining these two kinds of models.
layer by training a binary classifier. Stacking has the
advantage that it is very flexible in the kinds of models that
A. Additive Relational Effects Model can be combined. However, it has the disadvantage that
Reference [118] proposed the additive relational effects the individual models cannot cooperate, and thus any
(ARE), which is a way to combine RLFMs with observable individual model needs to be more complex than in a
graph models. In particular, if we combine RESCAL with combined model which is trained jointly. For example, if
PRA, we get we fit RESCAL separately from PRA, we will need a larger
number of latent features than if we fit them jointly.

ð1Þ> ð2Þ>
RESCALþPRA
fijk ¼ wk FRESCAL
ij þ wk FPRA
ijk : (21)
VI I. T RA I NI N G S RL MO D E L S ON
KNOWLEDGE GRAPHS
ARE models can be trained by alternately optimizing the In this section, we discuss aspects of training the
RESCAL parameters with the PRA parameters. The key previously discussed models that are specific to knowledge
benefit is now RESCAL only has to model the ‘‘residual graphs, such as how to handle the open-world assumption
errors’’ that cannot be modelled by the observable graph of knowledge graphs, how to exploit sparsity, and how to
patterns. This allows the method to use much lower latent perform model selection.
dimensionality, which significantly speeds up training 11 UNI ADD
Reference [85] considered an additional term fijk :¼ fijk þ
time. The resulting combined model also has increased w> F SUBþOBJ
, where F SUBþOBJ
is a (noncomposite) latent feature
k ij ij
accuracy [118]. representation of subject–object pairs.

Vol. 104, No. 1, January 2016 | Proceedings of the IEEE 23


Nickel et al.: A Review of Relational Machine Learning for Knowledge Graphs

A. Penalized Maximum-Likelihood Training One way around this is as to make a closed world
Let us assume we have a set of Nd observed triples and assumption and assume that all (type consistent) triples
let the nth triple be denoted by xn . Each observed triple that are not in Dþ are false. We will denote this negative
is either true (denoted yn ¼ 1) or false (denoted yn ¼ 0). set as D ¼ fxn 2 Djyn ¼ 0g. However, for incomplete
Let this labeled data set be denoted by D ¼ fðxn ; yn Þj knowledge graphs this assumption will be violated.
n ¼ 1; . . . ; Nd g. Given this, a natural way to estimate the Moreover, D might be very large, since the number of
parameters 0 is to compute the maximum a posteriori false facts is much larger than the number of true facts.
(MAP) estimate This can lead to scalability issues in training methods that
have to consider all negative examples.
An alternative approach to generate negative exam-
X
Nd
ples is to exploit known constraints on the structure of a
max log Berðyn jð f ðxn ; 0ÞÞÞ þ log pð0jÞ (24) knowledge graph: Type constraints for predicates (per-
0
n¼1
sons are only married to persons), valid value ranges for
attributes (the height of humans is below 3 m), or
where  controls the strength of the prior. (If the prior is functional constraints such as mutual exclusion (a person
uniform, this is equivalent to maximum-likelihood train- is born exactly in one city) can all be used for this
ing.) We can equivalently state this as a regularized loss purpose. Since such examples are based on the violation
minimization problem: of hard constraints, it is certain that they are indeed
negative examples. Unfortunately, functional constraints
are scarce and negative examples based on type
X
N constraints and valid value ranges are usually not
min Lðð f ðxn ; 0ÞÞ; yn Þ þ regð0Þ (25) sufficient to train useful models: While it is relatively
0
n¼1 easy to predict that a person is married to another
person, it is difficult to predict to which person in
particular. For the latter, examples based on type
where Lðp; yÞ ¼  log BerðyjpÞ is the log loss function. constraints alone are not very informative. A better way
Another possible loss function is the squared loss, to generate negative examples is to ‘‘perturb’’ true triples.
Lðp; yÞ ¼ ðp  yÞ2 . Using the squared loss can be espe- In particular, let us define
cially efficient in combination with a closed-world
assumption (CWA). For instance, using the squared loss
and the CWA, the minimization problem for RESCAL 
D ¼ ðe‘ ; rk ; ej Þ j ei 6¼ e‘ ^ ðei ; rk ; ej Þ 2 Dþ
becomes 
[ ðei ; rk ; e‘ Þ j ej 6¼ e‘ ^ ðei ; rk ; ej Þ 2 Dþ :

X 2
min kYk  EWk E> kF þ 1 kEk2F To understand the difference between this approach and
E;fWk g
k
X the CWA (where we assumed all valid unknown triples
þ 2 kWk k2F (26) were false), let us consider the example in Fig. 1. The CWA
k would generate ‘‘good’’ negative triples such as (Leonard-
Nimoy, starredIn, StarWars), (AlecGuinness, starredIn, Star-
Trek), etc., but also type-consistent but ‘‘irrelevant’’
where 1 ; 2 [0 control the degree of regularization. The
negative triples such as (BarackObama, starredIn, Star-
main advantage of (26) is that it can be optimized via
Trek), etc. (We are assuming (for the sake of this example)
RESCAL-ALS, which consists of a sequence of very
there is a type Person but not a type Actor.) The second
efficient, closed-form updates whose computational com-
approach (based on perturbation) would not generate
plexity depends only on the nonzero entries in Y [63],
negative triples such as (BarackObama, starredIn, Star-
[64]. We discuss some other loss functions below.
Trek), since BarackObama does not participate in any
starredIn events. This reduces the size of D , and
B. Where do the Negative Examples Come From? encourages it to focus on ‘‘plausible’’ negatives. (An even
One important question is where the labels yn come better method, used in Section IX, is to generate the
from. The problem is that most knowledge graphs only candidate triples from text extraction methods run on the
contain positive training examples, since, usually, they do Web. Many of these triples will be false, due to extraction
not encode false facts. Hence, yn ¼ 1 for all ðxn ; yn Þ 2 D. errors, but they define a good set of ‘‘plausible’’ negatives.)
To emphasize this, we shall use the notation Dþ to Another option to generate negative examples for
represent the observed positive (true) triples: Dþ ¼ training is to make a local-closed world assumption
fxn 2 Djyn ¼ 1g. Training on all-positive data is tricky, (LCWA) [28], [107], in which we assume that a KG is
because the model might easily over generalize. only locally complete. More precisely, if we have observed

24 Proceedings of the IEEE | Vol. 104, No. 1, January 2016


Nickel et al.: A Review of Relational Machine Learning for Knowledge Graphs

any triple for a particular subject–predicate pair ei ; rk , then (AUC-ROC) or the area under the precision-recall curve
we will assume that any nonexisting triple ðei ; rk ; Þ is (AUC-PR) are good evaluation criteria. For data with a
indeed false and include them in D . (The assumption is large number of negative examples (as it is typically the
valid for functional relations, such as bornIn, but not for case for knowledge graphs), it has been shown that
set-valued relations, such as starredIn.) However, if we AUC-PR can give a clearer picture of an algorithm’s
have not observed any triple at all for the pair ei ; rk , we will performance than AUC-ROC [124]. For entity resolution,
assume that all triples ðei ; rk ; Þ are unknown and not the mean reciprocal rank (MRR) of the correct entity is an
include them in D . alternative evaluation measure.

C. Pairwise Loss Training


Given that the negative training examples are not VI II . MARKOV RANDOM FIELDS
always really negative, an alternative approach to likeli- In this section, we drop the assumption that the random
hood training is to try to make the probability (or in variables yijk in Y are conditionally independent. Howev-
general, some scoring function) to be larger for true triples er, in the case of relational data and without the
than for assumed-to-be-false triples. That is, we can define conditional independence assumption, each yijk can
the following objective function: depend on any of the other Ne  Ne  Nr  1 random
variables in Y. Due to this enormous number of possible
X X dependencies, it becomes quickly intractable to estimate
min Lð f ðxþ ; 0Þ; f ðx ; 0ÞÞ þ  regð0Þ (27)
0 the joint distribution PðYÞ without further constraints,
xþ 2Dþ x 2D
even for very small knowledge graphs. To reduce the
number of potential dependencies and arrive at tractable
where Lðf ; f 0 Þ is a margin-based ranking loss function models, in this section we develop template-based
such as graphical models that only consider a small fraction of
all possible dependencies. (See [125] for an introduction to
graphical models.)
Lðf ; f 0 Þ ¼ maxð1 þ f 0  f ; 0Þ: (28)
A. Representation
Graphical models use graphs to encode dependencies
This approach has several advantages. First, it does not
between random variables. Each random variable (in our
assume that negative examples are necessarily negative,
case, a possible fact yijk ) is represented as a node in the
just that they are ‘‘more negative’’ than the positive ones.
graph, while each dependency between random variables
Second, it allows the f ðÞ function to be any function, not
is represented as an edge. To distinguish such graphs from
just a probability (but we do assume that larger f values
knowledge graphs, we will refer to them as dependency
mean the triple is more likely to be correct).
graphs. It is important to be aware of their key difference:
This kind of objective function is easily optimized by
while knowledge graphs encode the existence of facts,
stochastic gradient descent (SGD) [122]: at each iteration,
dependency graphs encode statistical dependencies be-
we just sample one positive and one negative example.
tween random variables.
SGD also scales well to large data sets. However, it can
To avoid problems with cyclical dependencies, it is
take a long time to converge. On the other hand, as
common to use undirected graphical models, also called
discussed previously, some models, when combined with
Markov random fields (MRFs).12 A MRF has the following
the squared loss objective, can be optimized using
form:
alternating least squares (ALS), which is typically much
faster.
1Y
PðYjQÞ ¼ ðyc jQÞ (29)
D. Model Selection Z c
Almost all models discussed in previous sections
include one or more user-given parameters that are
influential for the model’s performance (e.g., dimension- where ðyc jQÞ[0 is a potential function on the cth subset
ality of latent feature models, length of relation paths for of variables, in particular
P Q the cth clique in the dependency
PRA, regularization parameter for penalized maximum graph, and Z ¼ y c ðyc jQÞ is the partition function,
likelihood training). Typically, cross-validation over ran- which ensures that the distribution sums to one. The
dom splits of D into training, validation, and test sets is potential functions capture local correlations between
used to find good values for such parameters without
overfitting (for more information on model selection in 12
Technically, since we are conditioning on some observed features x,
machine learning, see, e.g., [123]). For link prediction this is a conditional random field (CRF), but we will ignore this
and entity resolution, the area under the ROC curve distinction.

Vol. 104, No. 1, January 2016 | Proceedings of the IEEE 25


Nickel et al.: A Review of Relational Machine Learning for Knowledge Graphs

variables in each clique c in the dependency graph. (Note represent these dependencies using formulae such as
that in undirected graphical models, the local potentials
do not have any probabilistic interpretation, unlike in
directed graphical models.) This equation again defines a F1: ðx; parentOf ; zÞ^ðy; parentOf ; zÞ )ðx; marriedTo; yÞ
probability distribution over ‘‘possible worlds,’’ i.e., over F2: ðx; marriedTo; yÞ ) :ðy; parentOf ; xÞ:
joint distribution assigned to the random variables Y.
The structure of the dependency graph [which defines
the cliques in (29)] is derived from a template mechanism
Rather than encoding the rule that adults cannot marry
that can be defined in a number of ways. A common
their own children using a formula, we will encode this as
approach is to use Markov logic [126], which is a template
a hard constraint into the type system. Similarly, we only
language based on logical formulae:
allow adults to be parents of children. Thus, there are six
Given a set of formulae F ¼ fFi gLi¼1 , we create an edge
possible facts in the knowledge graph. To create a
between nodes in the dependency graph if the
dependency graph for this KG and for this set of logical
corresponding facts occur in at least one grounded
formulae F , we assign a binary random variable to each
formula. A grounding of a formula Fi is given by the
possible fact, represented by a diamond in Fig. 6(b), and
(type consistent) assignment of entities to the variables in
create edges between these nodes if the corresponding
Fi . Furthermore, we define ðyc jÞ such that
facts occur in grounded formulae F1 or F2 . For instance,
grounding F1 with x ¼ a1 , y ¼ a3 , and z ¼ c, creates the
1Y edges m13 ! p1 c, m13 ! p3 c, and p1 c ! p3 c. The full
PðYjQÞ ¼ expðc xc Þ (30) dependency graph is shown in Fig. 6(c).
Z c
The process of generating the MRF graph by applying
templated rules to a set of entities is known as grounding
or instantiation. We note that the topology of the resulting
where xc denotes the number of true groundings of Fc in
graph is quite different from the original KG. In particular,
Y, and c denotes the weight for formula Fc . If c > 0, we
we have one node per possible KG edge, and these nodes
prefer worlds where formula Fc is satisfied; if c G 0, we
are densely connected. This can cause computational
prefer worlds where formula Fc is violated. If c ¼ 0, then
difficulties, as we discuss below.
formula Fc is ignored.
To explain this further, consider a KG involving two
types of entities, adults and children, and two types of B. Inference
relations, parentOf and marriedTo. Fig. 6(a) depicts a The inference problem consists of estimating the most
sample KG with three adults and one child. Obviously, probable configuration, y ¼ arg maxy pðyjQÞ, or the
these relations (edges) are correlated, since people who posterior marginals pðyi jQÞ. In general, both of these
share a common child are often married, while people problems are computationally intractable [125], so heuris-
rarely marry their own children. In Markov logic, we tic approximations must be used.

Fig. 6. (a) A small KG. There are four entities (circles): three adults (a1 , a2 , and a3 ) and one child c. There are two types of edges: adults may or may
not be married to each other, as indicated by the red dashed edges, and the adults may or may not be parents of the child, as indicated by the blue
dotted edges. (b) We add binary random variables (represented by diamonds) to each KG edge. (c) We drop the entity nodes, and add edges
between the random variables that belong to the same clique potential, resulting in a standard MRF.

26 Proceedings of the IEEE | Vol. 104, No. 1, January 2016


Nickel et al.: A Review of Relational Machine Learning for Knowledge Graphs

One approach for computing posterior marginals is to


use Gibbs sampling (see, or example, [31] and [127]) or
MC-SAT [128]. One approach for computing the MAP
estimate is to use the MPLP (max product linear
programming) method [129]. See [125] for more details.
If one restricts the class of potential functions to be just
disjunctions (using or and not, but no and), then one
obtains a (special case of) hinge loss MRF (HL-MRFs)
[130], for which efficient convex algorithms can be applied,
based on a continuous relaxation of the binary random
variables. Probabilistic soft logic (PSL) [131] provides a
convenient form of ‘‘syntactic sugar’’ for defining HL-
MRFs, just as MLNs provide a form of syntactic sugar for
regular (boolean) MRFs. HL-MRFs have been shown to
scale to fairly large knowledge bases [132].

C. Learning
Fig. 7. Architecture of the knowledge vault.
The ‘‘learning’’ problem for MRFs deals with specifying
the form of the potential functions (sometimes called
‘‘structure learning’’) as well as the values for the
numerical parameters Q. In the case of MRFs for KGs, edges. Finally, the confidence in the automatically
the potential functions are often specified in the form of extracted facts is evaluated using both the extraction
logical rules, as illustrated above. In this case, structure scores and the prior SRL model.
learning is equivalent to rule learning, which has been The knowledge vault uses a combination of latent and
studied in a number of published works (see Section V-C observable models to predict links in a knowledge graph.
and [95] and [107]). In particular, it employs the ER-MLP model (see
The parameter estimation problem (which is usually Section IV-D) as a latent feature model and PRA
cast as maximum likelihood or MAP estimation), although (Section V-C) as a graph feature model. In order to combine
convex, is in general quite expensive, since it needs to call the two models, KV uses stacking (see Section VI-B). To
inference as a subroutine. Therefore, various faster evaluate the link prediction performance, these models
approximations, such as pseudo likelihood, have been were applied to a subset of Freebase. The ER-MLP system
developed (cf., relational dependency networks [133]). achieved an area under the ROC curve (AUC-ROC) of
0.882, and the PRA approach achieved an almost identical
D. Discussion AUC-ROC of 0.884. The combination of both methods
Although approaches based on MRFs are very flexible, it further increased the AUC-ROC to 0.911. To predict the
is in general harder to make scalable inference and devise final score of a triple, the scores from the combined link-
learning algorithms for this model class, compared to prediction model are further combined with various
methods based on observable or even latent feature models. features derived from the extracted triples. These include,
In this paper, we have chosen to focus primarily on latent for instance, the confidence of the extractors and the
and graph feature models because we have more experience number of (deduplicated) Web pages from which the
with such methods in the context of KGs. However, all triples were extracted. Fig. 7 provides a high level overview
three kinds of approaches to KG modeling are useful. of the knowledge vault architecture.
Let us give a qualitative example of the benefits of
combining the prior with the extractors (i.e., the fusion
IX. KNOWLEDGE VAULT: RELATIONAL layer in Fig. 7). Consider an extracted triple corresponding
LEARNING FOR KNOWLEDGE BASE to the following relation13:
CONSTRUCTION
The knowledge vault (KV) [28] is a very large-scale (Barry Richter, attended, University of Wisconsin-
automatically constructed knowledge base, which follows Madison).
the Freebase schema (KV uses the 4469 most common
predicates). It is constructed in three steps. In the first The extraction confidence for this triple (obtained by
step, facts are extracted from a host of Web sources such as fusing multiple extraction techniques) is just 0.14, since it
natural language text, tabular data, page structure, and
human annotations (the extractors are described in detail 13
For clarity of presentation we show a simplified triple. Please see
in [28]). Second, an SRL model is trained on Freebase to [28] for the actually extracted triples including compound value types
serve as a ‘‘prior’’ for computing the probability of (new) (CVT).

Vol. 104, No. 1, January 2016 | Proceedings of the IEEE 27


Nickel et al.: A Review of Relational Machine Learning for Knowledge Graphs

was based on the following two rather indirect state- relations. In Section II, we expressed the ternary
ments14: relationship playedCharacterIn(LeonardNimoy, Spock, Star-
Trek-1) via two binary relationships (LeonardNimoy, played,
In the fall of 1989, Richter accepted a scholarship Spock) and (Spock, characterIn, StarTrek-1). However,
to the University of Wisconsin, where he played there are multiple actors who played Spock in different
for four years and earned numerous individual Star Trek movies, so we have lost the correspondence
accolades. . . between Leonard Nimoy and StarTrek-1. To model this
using binary relations without loss of information, we can
and15 use auxiliary nodes to identify the respective relationship.
For instance, to model the relationship playedCharacter-
The Polar Caps’ cause has been helped by the In(LeonardNimoy, Spock, StarTrek-1), we can write
impact of knowledgable coaches such as Andringa,
Byce and former UW teammates Chris Tancill and
Barry Richter.

However, we know from Freebase that Barry Richter was


born and raised in Madison, WI, USA. According to the where we used the auxiliary entity MovieRole-1 to uniquely
prior model, people who were born and raised in a identify this particular relationship. In most applications
particular city often tend to study in the same city. This auxiliary entities get an identifier; if not they are referred
increases our prior belief that Richter went to school there, to as blank nodes. In Freebase auxiliary nodes are called
resulting in a final fused belief of 0.61. compound value types (CVT).
Combining the prior model (learned using SRL Since higher arity relations involving time and location
methods) with the information extraction model improved are relatively common, the YAGO2 project extended the
performance significantly, increasing the number of high SPO triple format to the (subject, predicate, object, time,
confidence triples16 from 100 millions (based on extractors location) (SPOTL) format to model temporal and spatial
alone) to 271 millions (based on extractors plus prior). The information about relationships explicitly, without trans-
knowledge vault is one of the largest applications of SRL to forming them to binary relations [27]. Furthermore, there
knowledge base construction to date. See [28] for further has also been work on extracting higher-arity relations
details. directly from natural language [135].
A related issue is that the truth-value of a fact can
change over time. For example, Google’s current CEO is
X. EXTENSIONS AND FUTURE W ORK Larry Page, but from 2001 to 2011 it was Eric Schmidt.
Both facts are correct, but only during the specified time
A. Nonbinary Relations interval. For this reason, Freebase allows some facts to be
So far, we completely focussed on binary relations; annotated with beginning and end dates, using CVT
here we discuss how relations of other cardinalities can be constructs, which represent n-ary relations via auxiliary
handled. nodes. In the future, it is planned to extend the KV system
to model such temporal facts. However, this is nontrivial,
Unary Relations: Unary relations refer to statements on since it is not always easy to infer the duration of a fact
properties of entities, e.g., the height of a person. Such data from text, since it is not necessarily related to the
can naturally be represented by a matrix, in which rows timestamp of the corresponding source (cf., [136]).
represent entities, and columns represent attributes. [64] As an alternative to the usage of auxiliary nodes, a set
proposed a joint tensor-matrix factorization approach to of nth-arity relations can be represented by a single
learn simultaneously from binary and unary relations via a ðn þ 1Þth-order tensor. RESCAL can easily be generalized
shared latent representation of entities. In this case, we to higher arity relations and can be solved by higher
may also need to modify the likelihood function, so it is order tensor factorization or by neural network models
Bernoulli for binary edge variables, and Gaussian (say) for with the corresponding number of entity representations as
numeric features and Poisson for count data (see [134]). inputs [134].

Higher Arity Relations: In knowledge graphs, higher arity B. Hard Constraints: Types, Functional Constraints,
relations are typically expressed via multiple binary and Others
14
Source: http://www.legendsofhockey.net/LegendsOfHockey/jsp/ Imposing hard constraints on the allowed triples in
SearchPlayer.jsp?player=11377. knowledge graphs can be useful. Powerful ontology
15
Source: http://host.madison.com/sports/high-school/hockey/ languages such as the web ontology language (OWL)
numbers-dwindling-for-once-mighty-madison-high-school-hockey-
programs/article_95843e00-ec34-11df-9da9-001cc4c002e0.html. [137] have been developed, in which complex constraints
16
Triples with the calibrated probability of correctness above 90%. can be formulated. However, reasoning with ontologies is

28 Proceedings of the IEEE | Vol. 104, No. 1, January 2016


Nickel et al.: A Review of Relational Machine Learning for Knowledge Graphs

computationally demanding, and hard constraints are the current model. Similarly, it has been shown that the
often violated in real-world data [138], [139]. Fortunately, relation-specific weights Wk in the RESCAL model can be
machine learning methods can be robust in the face of calculated efficiently for new relation types given already
contradictory evidence. derived latent representations of entities [140].

Deterministic Dependencies: Triples in relations such as D. Querying Probabilistic Knowledge Graphs


subClassOf and isLocatedIn follow clear deterministic RESCAL and KV can be viewed as probabilistic
dependencies such as transitivity. For example, if Leonard databases (see, e.g., [141] and [142]). In the knowledge
Nimoy was born in Boston, we can conclude that he was vault, only the probabilities of triples are queried. Some
born in Massachusetts, that he was born in the United applications might require more complex queries such as:
States, that he was born in North America, etc. One way to Who is born in Rome and likes someone who is a child of
consider such ontological constraints is to precompute all Albert Einstein. It is known that queries involving joins
true triples that can be derived from the constraints and to (existentially quantified variables) are expensive to calcu-
add them to the knowledge graph prior to learning. The late in probabilistic databases [141]. In [140], it was shown
precomputation of triples according to ontological con- how some queries involving joins can be efficiently
straints is also called materialization. However, on large handled within the RESCAL framework.
knowledge graphs, full materialization can be computa-
tionally demanding. E. Trustworthiness of Knowledge Graphs
Automatically constructed knowledge bases are only as
Type Constraints: Often relations only make sense when good as the sources from which the facts are extracted.
applied to entities of the right type. For example, the Prior studies in the field of data fusion have developed
domain and the range of marriedTo is limited to entities numerous approaches for modelling the correctness of
which are persons. Modelling type constraints explicitly information supplied by multiple sources in the presence
requires complex manual work. An alternative is to learn of possible data conflicts (see [143] and [144] for
approximate type constraints by simply considering the recent surveys). However, the key assumption in data
observed types of subjects and objects in a relation. The fusionVnamely, that the facts provided by the sources are
standard RESCAL model has been extended by [69] and indeed stated by themVis often violated when the
[74] to handle type constraints of relations efficiently. As a information is extracted automatically. If a given source
result, the rank required for a good RESCAL model can be contains a mistake, it could be because the source actually
greatly reduced. Furthermore, [85] considered learning contains a false fact, or because the fact has been extracted
latent representations for the argument slots in a relation incorrectly. A recent study [145] has formulated the
to learn the correct types from data. problem of knowledge fusion, where the above assumption
is no longer made, and the correctness of information
Functional Constraints and Mutual Exclusiveness: Al- extractors is modeled explicitly. A follow-up study by the
though the methods discussed in Sections IV and V can authors [146] developed several approaches for solving
model long-range and global dependencies between the knowledge fusion problem, and applied them to
triples, they do not explicitly enforce functional con- estimate the trustworthiness of facts in the knowledge
straints that induce mutual exclusivity between possible vault (cf., Section IX).
values. For instance, a person is born in exactly one city,
etc. If one of the these values is observed, then observable
graph models can prevent other values from being XI . CONCLUDING RE MARKS
asserted, but if all the values are unknown, the resulting Knowledge graphs (KGs) have found important applica-
mutual exclusion constraint can be hard to deal with tions in question answering, structured search, explor-
computationally. atory search, and digital assistants. We provided a review
of state-of-the-art statistical relational learning (SRL)
C. Generalizing to New Entities and Relations methods applied to very large knowledge graphs. We also
In addition to missing facts, there are many entities demonstrated how statistical relational learning can be
that are mentioned on the Web but are currently missing used in conjunction with machine reading and informa-
in knowledge graphs like Freebase and YAGO. If new tion extraction methods to automatically build such
entities or predicates are added to a KG, one might want to knowledge repositories. As a result, we showed how to
avoid retraining the model due to runtime considerations. create a truly massive, machine-interpretable ‘‘semantic
Given the current model and a set of newly observed memory’’ of facts, which is already empowering numer-
relationships, latent representations of new entities can be ous practical applications. However, although these KGs
calculated approximately in both tensor factorization are impressive in their size, they still fall short of
models and in neural networks, by finding representations representing many kinds of knowledge that humans
that explain the newly observed relationships relative to possess. Notably missing are representations of ‘‘common

Vol. 104, No. 1, January 2016 | Proceedings of the IEEE 29


Nickel et al.: A Review of Relational Machine Learning for Knowledge Graphs

sense’’ facts (such as the fact that water is wet, and wet corresponding entries from the u and v matrices. For
things can be slippery), as well as ‘‘procedural’’ or how-to example
knowledge (such as how to drive a car or how to send an
email). Representing, learning, and reasoning with these   " >   
kinds of knowledge remains the next frontier for AI and u1 u1 1 0 v1
ð v1 v2 Þ ¼ ;
machine learning. h u2 u2 0 0 v2
 >   #
u1 0 0 v1
APPENDIX ...; : (32)
u2 0 1 v2
A. RESCAL is a Special Case of NTN
Here we show how the RESCAL model of Section IV-A In general, define ij as a matrix of all 0s except for entry ði; jÞ
is a special case of the neural tensor model (NTN) which is 1. Then, if we define Bk ¼ ½½1;1 ; . . . ; He ;He , we have
of Section IV-E. To see this, note that RESCAL has the
form


hbijk ¼ e> 1 > Hb
i Bk ej ; . . . ; ei Bk ej ¼ ej  ei :
RESCAL
fijk ¼ e> >
i Wk ej ¼ wk ½ej  ei : (31)
Finally, if we define Ak as the empty matrix (so haijk is
undefined), and gðuÞ ¼ u as the identity function, then the
Next, note that NTN equation

v  u ¼ vecðuv> Þ ¼ ½u> B1 v; . . . ; u> Bn v h i


NTN
fijk ¼ w>
kg haijk ; hbijk

where n ¼ jujjvj, and Bk is a matrix of all 0 s except for a


single 1 element in the kth position, which ‘‘plucks out’’ the matches (31).

REFERENCES [10] J. Fan et al., ‘‘AKBC-WEKEX 2012: Foundations. Pacific Grove, CA, USA:
The knowledge extraction workshop at Brooks/Cole, 2000.
[1] L. Getoor and B. Taskar, Eds., Introduction to NAACL-HLT,’’ 2012. [Online]. Available:
Statistical Relational Learning. [21] Y. Sun and J. Han, ‘‘Mining heterogeneous
https://akbcwekex2012.wordpress.com/. information networks: Principles and
Cambride, MA, USA: MIT Press, 2007.
[11] R. Davis, H. Shrobe, and P. Szolovits, ‘‘What methodologies,’’ Synthesis Lectures Data
[2] S. Dzeroski and N. Lavrač, Relational is a knowledge representation?’’ AI Mag., Mining Knowl. Disc., vol. 3, no. 2, pp. 1–159,
Data Mining. New York, NY, USA: vol. 14, no. 1, pp. 17–33, 1993. 2012.
Springer-Verlag, 2001.
[12] J. F. Sowa, ‘‘Semantic networks,’’ [22] R. West et al., ‘‘Knowledge base completion
[3] L. De Raedt, Logical and Relational Encyclopedia Cogn. Sci., 2006. via search-based question answering,’’ in
Learning. New York, NY, USA: Proc. 23rd Int. Conf. World Wide Web, 2014,
Springer-Verlag, 2008. [13] M. Minsky, ‘‘A framework for representing
knowledge,’’ MIT-AI Lab. Memo 306, 1974. pp. 515–526.
[4] F. M. Suchanek, G. Kasneci, and G. Weikum, [23] D. B. Lenat, ‘‘CYC: A large-scale investment
‘‘Yago: A core of semantic knowledge,’’ in [14] T. Berners-Lee, J. Hendler, and O. Lassila,
‘‘The semantic web,’’ 2001. [Online]. in knowledge infrastructure,’’ Commun.
Proc. 16th Int. Conf. World Wide Web, 2007, ACM, vol. 38, no. 11, pp. 33–38, Nov. 1995.
pp. 697–706. Available: http://www.scientificamerican.
com/article/the-semantic-web/. [24] G. A. Miller, ‘‘WordNet: A lexical database
[5] S. Auer et al., ‘‘DBpedia: A nucleus for a for english,’’ Commun. ACM, vol. 38, no. 11,
web of open data,’’ in The Semantic Web, [15] T. Berners-Lee, ‘‘Linked dataVDesign
issues,’’ Jul. 2006. [Online]. Available: pp. 39–41, Nov. 1995.
vol. 4825. Berlin, Germany: Springer-
Verlag, 2007, pp. 722–735. http://www.w3.org/DesignIssues/ [25] O. Bodenreider, ‘‘The Unified Medical
LinkedData.html. Language System (UMLS): Integrating
[6] A. Carlson et al., ‘‘Toward an architecture biomedical terminology,’’ Nucleic Acids Res.,
for never-ending language learning,’’ in [16] C.Bizer,T.Heath,andT.Berners-Lee,‘‘Linked
data-the story so far,’’ Int.J.Semantic Web Inf. vol. 32, no. Database issue, pp. D267–270,
Proc. 24th Conf. Artif. Intell., 2010, Jan. 2004.
pp. 1306–1313. Syst., vol. 5, no. 3, pp. 1–22, 2009.
[17] G. Klyne and J. J. Carroll, ‘‘Resource [26] D. Vrandečić and M. Krötzsch, ‘‘Wikidata: A
[7] K. Bollacker, C. Evans, P. Paritosh, T. Sturge, free collaborative knowledgebase,’’ Commun.
and J. Taylor, ‘‘Freebase: A collaboratively description framework (RDF): Concepts and
abstract syntax,’’ Feb. 2004. [Online]. ACM, vol. 57, no. 10, pp. 78–85, 2014.
created graph database for structuring
human knowledge,’’ in Proc. ACM SIGMOD Available: http://www.w3.org/TR/2004/ [27] J. Hoffart, F. M. Suchanek, K. Berberich, and
Int. Conf. Manage. Data, 2008, REC-rdf-concepts-20040210/. G. Weikum, ‘‘YAGO2: A spatially and
pp. 1247–1250. [18] R. Cyganiak, D. Wood, and M. Lanthaler, temporally enhanced knowledge base from
‘‘RDF 1.1 Concepts and abstract syntax,’’ Wikipedia,’’ Artif. Intell., vol. 194, pp. 28–61,
[8] A. Singhal, ‘‘Introducing the knowledge 2013.
graph: Things, not strings,’’ May 2012. Feb. 2014. [Online]. Available: http://www.
[Online]. Available: http://googleblog. w3.org/TR/2014/REC-rdf11-concepts- [28] X. Dong et al., ‘‘Knowledge vault: A web-
blogspot.com/2012/05/introducing- 20140225/. scale approach to probabilistic knowledge
knowledge-graph-things-not.html. [19] R. Brachman and H. Levesque, fusion,’’ in Proc. 20th ACM SIGKDD Int. Conf.
Knowledge Representation and Reasoning. Knowl. Disc. Data Mining, 2014, pp. 601–610.
[9] G. Weikum and M. Theobald, ‘‘From
information to knowledge: Harvesting San Francisco, CA, USA: Morgan Kaufmann, [29] N. Nakashole, G. Weikum, and F. Suchanek,
entities and relationships from web sources,’’ 2004. ‘‘PATTY: A taxonomy of relational patterns
in Proc. 29th ACM SIGMOD-SIGACT-SIGART [20] J. F. Sowa, Knowledge Representation: Logical, with semantic types,’’ in Proc. Joint Conf.
Symp. Principles Database Syst., 2010, Philosophical and Computational Empirical Meth. Natural Language Process.
pp. 65–76. Computat. Natural Language Learning, 2012,
pp. 1135–1145.

30 Proceedings of the IEEE | Vol. 104, No. 1, January 2016


Nickel et al.: A Review of Relational Machine Learning for Knowledge Graphs

[30] N. Nakashole, M. Theobald, and G. Weikum, [46] L. Getoor and C. P. Diehl, ‘‘Link mining: Ludwig-Maximilians-Universität München,
‘‘Scalable knowledge harvesting with high A survey,’’ ACM SIGKDD Explor. Newslett., Munich, Germany, Aug. 2013.
precision and high recall,’’ in Proc. 4th ACM vol. 7, no. 2, pp. 3–12, 2005. [66] Y. Koren, R. Bell, and C. Volinsky, ‘‘Matrix
Int. Conf. Web Search Data Mining, 2011, [47] H. B. Newcombe, J. M. Kennedy, S. J. Axford, factorization techniques for recommender
pp. 227–236. and A. P. James, ‘‘Automatic linkage of vital systems,’’ IEEE Computer, vol. 42, no. 8,
[31] F. Niu, C. Zhang, C. Ré, and J. Shavlik, records computers can be used to extract pp. 30–37, 2009.
‘‘Elementary: Large-scale knowledge-base ‘‘follow-up’’ statistics of families from files of [67] T. G. Kolda and B. W. Bader, ‘‘Tensor
construction via machine learning and routine records,’’ Science, vol. 130, no. 3381, decompositions and applications,’’ SIAM
statistical inference,’’ Int. J. Semantic Web Inf. pp. 954–959, Oct. 1959. Rev., vol. 51, no. 3, pp. 455–500, 2009.
Syst. (IJSWIS), vol. 8, no. 3, pp. 42–73, 2012. [48] S. Tejada, C. A. Knoblock, and S. Minton, [68] M. Nickel and V. Tresp, ‘‘Logistic
[32] A. Fader, S. Soderland, and O. Etzioni, ‘‘Learning object identification rules for tensor-factorization for multi-relational
‘‘Identifying relations for open information information integration,’’ Inf. Syst., vol. 26, data,’’ in Proc. Structured Learn.: Inferring
extraction,’’ in Proc. Conf. Empir. Meth. no. 8, pp. 607–633, 2001. Graphs Structured Unstructured Inputs
Natural Language Process., Stroudsburg, PA, [49] E. Rahm and P. A. Bernstein, ‘‘A survey of (SLG 2013) Workshop, 2013.
USA, 2011, pp. 1535–1545. approaches to automatic schema matching,’’ [69] K.-W. Chang, W.-T. Yih, B. Yang, and
[33] M. Schmitz, R. Bart, S. Soderland, and VLDB J., vol. 10, no. 4, pp. 334–350, 2001. C. Meek, ‘‘Typed tensor decomposition of
O. Etzioni, ‘‘Open language learning for [50] A. Culotta and A. McCallum, ‘‘Joint knowledge bases for relation extraction,’’ in
information extraction,’’ in Proc. Joint Conf. deduplication of multiple record types in Proc. 2014 Conf. Empir. Meth. Natural Lang.
Empirical Meth. Natural Language Process. relational data,’’ in Proc. 14th ACM Int. Conf. Process., Oct. 2014.
Computat. Natural Language Learning, 2012, Inf. Knowl. Manage., 2005, pp. 257–258. [70] S. Kok and P. Domingos, ‘‘Statistical
pp. 523–534.
[51] P. Singla and P. Domingos, ‘‘Entity predicate invention,’’ in Proc. 24th Int. Conf.
[34] J. Fan, D. Ferrucci, D. Gondek, and resolution with Markov logic,’’ in Proc. 6th Mach. Learn., New York, NY, USA, 2007,
A. Kalyanpur, ‘‘Prismatic: Inducing Int. Conf. Data Mining, Dec. 2006, pp. 433–440.
knowledge from a large scale lexicalized pp. 572–582. [71] Z. Xu, V. Tresp, K. Yu, and H.-P. Kriegel,
relation resource,’’ in Proc. NAACL HLT 1st
[52] I. Bhattacharya and L. Getoor, ‘‘Collective ‘‘Infinite hidden relational models,’’ in Proc.
Int. Workshop Formalisms Method. Learning
entity resolution in relational data,’’ ACM 22nd Int. Conf. Uncertainty Artif. Intell.,
Reading, 2010, pp. 122–127.
Trans. Knowl. Discov. Data, vol. 1, no. 1, 2006, pp. 544–551.
[35] B. Suh, G. Convertino, E. H. Chi, and Mar. 2007. [72] C. Kemp, J. B. Tenenbaum, T. L. Griffiths,
P. Pirolli, ‘‘The singularity is not near:
[53] S. E. Whang and H. Garcia-Molina, ‘‘Joint T. Yamada, and N. Ueda, ‘‘Learning systems
Slowing growth of Wikipedia,’’ in Proc. ACM
entity resolution,’’ in Proc. IEEE 28th Int. of concepts with an infinite relational
5th Int. Symp. Wikis Open Collab., 2009,
Conf. Data Eng., Washington, DC, USA, 2012, model,’’ in Proc. 21st Nat. Conf. Artif. Intell.,
pp. 8:1–8:10.
pp. 294–305. 2006, vol. 3, p. 5.
[36] J. Biega, E. Kuzey, and F. M. Suchanek,
[54] S. Fortunato, ‘‘Community detection in [73] I. Sutskever, J. B. Tenenbaum, and
‘‘Inside YAGO2s: A transparent information
graphs,’’ Phys. Rep., vol. 486, no. 3, R. R. Salakhutdinov, ‘‘Modelling relational
extraction architecture,’’ in Proc. 22nd Int.
pp. 75–174, 2010. data using Bayesian clustered tensor
Conf. World Wide Web, Republic and Canton
[55] M. E. J. Newman, ‘‘The structure of scientific factorization Adv. Neural Inf. Process. Syst.,
of Geneva, Switzerland, 2013, pp. 325–328.
collaboration networks,’’ Proc. Nat. Acad. Sci., vol. 22, pp. 1821–1828, 2009.
[37] O. Etzioni, A. Fader, J. Christensen,
vol. 98, no. 2, pp. 404–409, Jan. 2001. [74] D. KrompaQ, M. Nickel, and V. Tresp,
S. Soderland, and M. Mausam, ‘‘Open
[56] D. Liben-Nowell and J. Kleinberg, ‘‘The ‘‘Large-scale factorization of type-
information extraction: The second
link-prediction problem for social networks,’’ constrained multi-relational data,’’ in Proc.
generation,’’ in Proc. 22nd Int. Joint Conf.
J. Amer. Soc. Inf. Sci. Technol., vol. 58, no. 7, Int. Conf. Data Sci. Adv. Anal., 2014.
Artif. Intell., Barcelona, Catalonia, Spain,
2011, vol. 1, pp. 3–10. pp. 1019–1031, 2007. [75] M. Nickel and V. Tresp, ‘‘Learning
[57] D. Jensen and J. Neville, ‘‘Linkage and taxonomies from multi-relational data via
[38] D. B. Lenat and E. A. Feigenbaum, ‘‘On
autocorrelation cause feature selection bias hierarchical link-based clustering,’’ in Proc.
the thresholds of knowledge,’’ Artif. Intell.,
in relational learning,’’ in Proc. 19th Int. Learn. Semant. Workshop, Granada, Spain,
vol. 47, no. 1, pp. 185–250, 1991.
Conf. Mach. Learn., San Francisco, CA, USA, 2011.
[39] R. Qian, ‘‘Understand your world with Bing,’’
2002, pp. 259–266. [76] T. G. Kolda, B. W. Bader, and J. P. Kenny,
Bing search blog, Mar. 2013. [Online].
[58] P. W. Holland, K. B. Laskey, and S. Leinhardt, ‘‘Higher-order web link analysis using
Available: http://blogs.bing.com/search/
‘‘Stochastic blockmodels: First steps,’’ Social multilinear algebra,’’ in Proc. Fifth IEEE Int.
2013/03/21/understand-your-world-with-
Netw., vol. 5, no. 2, pp. 109–137, 1983. Conf. Data Mining, Washington, DC, USA,
bing/.
2005, pp. 242–249.
[40] D. Ferrucci et al., ‘‘Building Watson: An [59] C. J. Anderson, S. Wasserman, and K. Faust,
‘‘Building stochastic blockmodels,’’ Social [77] T. Franz, A. Schultz, S. Sizov, and S. Staab,
overview of the DeepQA project,’’ AI Mag.,
Netw., vol. 14, Special Issue on Blockmodels, ‘‘Triplerank: Ranking semantic web data by
vol. 31, no. 3, pp. 59–79, 2010.
no. 1–2, pp. 137–161, 1992. tensor decomposition,’’ Proc. Semant. Web,
[41] F. Belleau, M.-A. Nolin, N. Tourigny, 2009, pp. 213–228.
P. Rigault, and J. Morissette, ‘‘Bio2RDF: [60] P. Hoff, ‘‘Modeling homophily and stochastic
equivalence in symmetric relational data,’’ in [78] L. Drumond, S. Rendle, and
Towards a mashup to build bioinformatics
Advances in Neural Information Processing L. Schmidt-Thieme, ‘‘Predicting RDF triples
knowledge systems,’’ J. Biomed. Inf., vol. 41,
Systems 20. Red Hook, NY, USA: Curran, in incomplete knowledge bases with tensor
no. 5, pp. 706–716, 2008.
2008, pp. 657–664. factorization,’’ in Proc. 27th Annu. ACM Symp.
[42] A. Ruttenberg, J. A. Rees, M. Samwald, and Appl. Comput., Riva del Garda, Italy, 2012,
M. S. Marshall, ‘‘Life sciences on the [61] J. C. Platt, ‘‘Probabilities for SV machines,’’
pp. 326–331.
semantic web: The neurocommons and in Advances in Large Margin Classifiers.
Cambridge, MA, USA: MIT Press, 1999, [79] S. Rendle and L. Schmidt-Thieme, ‘‘Pairwise
beyond,’’ Brief. Bioinf., vol. 10, no. 2,
pp. 61–74. interaction tensor factorization for
pp. 193–204, Mar. 2009.
personalized tag recommendation,’’ in Proc.
[43] V. Momtchev, D. Peychev, T. Primov, and [62] P. Orbanz and D. M. Roy, ‘‘Bayesian models
Third ACM Int. Conf. Web Search Data Mining,
G. Georgiev, ‘‘Expanding the pathway and of graphs, arrays and other exchangeable
2010, pp. 81–90.
interaction knowledge in linked life data,’’ in random structures,’’ IEEE Trans. Pattern Anal.
Machine Intell., 2015. [80] S. Rendle, ‘‘Scaling factorization machines to
Proc. Int. Semantic Web Challenge, 2009.
relational data,’’ in Proc. 39th Int. Conf. Very
[44] G. Angeli and C. Manning, ‘‘Philosophers are [63] M. Nickel, V. Tresp, and H.-P. Kriegel,
Large Data Bases, Trento, Italy, 2013,
mortal: Inferring the truth of unseen facts,’’ ‘‘A three-way model for collective
pp. 337–348.
in Proc. 17th Conf. Comput. Natural Language learning on multi-relational data,’’ in
Proc. 28th Int. Conf. Mach. Learn., 2011, [81] R. Jenatton, N. L. Roux, A. Bordes, and
Learn., Sofia, Bulgaria, Aug. 2013,
pp. 809–816. G. R. Obozinski, ‘‘A latent factor model for
pp. 133–142.
highly multi-relational data,’’ in Advances in
[45] B. Taskar, M.-F. Wong, P. Abbeel, and [64] M. Nickel, ‘‘Factorizing YAGO: Scalable
Neural Information Processing Systems 25.
D. Koller, ‘‘Link prediction in relational machine learning for linked data,’’ in Proc.
Red Hook, NY, USA: Curran, 2012,
data,’’ in Adv. Neural Inf. Process. Syst., vol. 16, 21st Int. Conf. World Wide Web, 2012,
pp. 3167–3175.
S. Thrun, L. Saul, and B. Schölkopf, Eds. pp. 271–280.
[82] P. Miettinen, ‘‘Boolean tensor
Cambridge, MA, USA: MIT Press, 2004. [65] M. Nickel, ‘‘Tensor factorization for
factorizations,’’ in Proc. IEEE 11th Int. Conf.
relational learning,’’ Ph.D. dissertation,
Data Mining, Dec. 2011, pp. 447–456.

Vol. 104, No. 1, January 2016 | Proceedings of the IEEE 31


Nickel et al.: A Review of Relational Machine Learning for Knowledge Graphs

[83] D. Erdos and P. Miettinen, ‘‘Discovering [100] L. Katz, ‘‘A new status index derived from Int. Conf. Knowl. Discov. Data Mining,
facts with Boolean tensor tucker sociometric analysis,’’ Psychometrika, vol. 18, 2008, pp. 426–434.
decomposition,’’ in Proc. 22nd ACM Int. Conf. no. 1, pp. 39–43, 1953. [120] S. Rendle, ‘‘Factorization machines with
Inf. Knowl. Manage., New York, NY, USA, [101] E. A. Leicht, P. Holme, and M. E. Newman, libFM,’’ ACM Trans. Intell. Syst. Technol.,
2013, pp. 1569–1572. ‘‘Vertex similarity in networks,’’ Phys. Rev. E, vol. 3, no. 3, p. 57, 2012.
[84] X. Jiang, V. Tresp, Y. Huang, and M. Nickel, vol. 73, no. 2, p. 026120, 2006. [121] D. H. Wolpert, ‘‘Stacked generalization,’’
‘‘Link prediction in multi-relational graphs [102] S. Brin and L. Page, ‘‘The anatomy of a Neural Netw., vol. 5, no. 2, pp. 241–259,
using additive models,’’ in Proc. Int. large-scale hypertextual web search engine,’’ 1992.
Workshop Semant. Technol. Recomm. Sys. Big Comput. Netw. ISDN Syst., vol. 30, no. 1, [122] L. Bottou, ‘‘Large-scale machine learning
Data ISWC, M. de Gemmis, T. D. Noia, pp. 107–117, 1998. with stochastic gradient descent,’’ in Proc.
P. Lops, T. Lukasiewicz, and G. Semeraro,
[103] W. Liu and L. Lü, ‘‘Link prediction based on COMPSTAT, 2010, pp. 177–186.
Eds., 2012, vol. 919, pp. 1–12.
local random walk,’’ Europhys. Lett., vol. 89, [123] K. P. Murphy, Machine Learning: A
[85] S. Riedel, L. Yao, B. M. Marlin, and no. 5, p. 58007, 2010. Probabilistic Perspective. Cambridge, MA,
A. McCallum, ‘‘Relation extraction with
[104] S. Muggleton, ‘‘Inverse entailment and USA: MIT Press, 2012.
matrix factorization and universal schemas,’’
progol,’’ New Gen. Comput., vol. 13, no. 3/4, [124] J. Davis and M. Goadrich, ‘‘The relationship
Joint Human Language Technol. Conf./
pp. 245–286, 1995. between precision-recall and ROC curves,’’
Annu. Meet. North Amer. Chapter Assoc.
Comput. Linguistics, Jun. 2013. [105] J. R. Quinlan, ‘‘Inductive logic in Proc. 23rd Int. Conf. Mach. Learn., 2006,
programming,’’ New Gen. Computi., vol. 8, pp. 233–240, ACM.
[86] V. Tresp, Y. Huang, M. Bundschus, and
no. 4, pp. 295–318, 1991. [125] D. Koller and N. Friedman, Probabilistic
A. Rettinger, ‘‘Materializing and querying
learned knowledge,’’ Proc. IRMLeS, [106] J. R. Quinlan, ‘‘Learning logical definitions Graphical Models: Principles and
vol. 2009, 2009. from relations,’’ Mach. Learn., vol. 5, Techniques. Cambridge, MA, USA: MIT
pp. 239–266, 1990. Press, 2009.
[87] Y. Huang, V. Tresp, M. Nickel, A. Rettinger,
and H.-P. Kriegel, ‘‘A scalable approach for [107] L. A. Galárraga, C. Teflioudi, K. Hose, and [126] M. Richardson and P. Domingos, ‘‘Markov
statistical learning in semantic graphs,’’ F. Suchanek, ‘‘AMIE: Association rule logic networks,’’ Mach. Learn., vol. 62, no. 1,
Semant. Web J., 2013. mining under incomplete evidence in pp. 107–136, 2006.
ontological knowledge bases,’’ in Proc. 22nd [127] C. Zhang and C. Ré, ‘‘Towards
[88] P. Smolensky, ‘‘Tensor product variable
Int. Conf. World Wide Web, 2013, high-throughput Gibbs sampling at scale:
binding and the representation of symbolic
pp. 413–422. A study across storage managers,’’ in Proc.
structures in connectionist systems,’’ Artif.
Intell., vol. 46, no. 1, pp. 159–216, 1990. [108] L. Galárraga, C. Teflioudi, K. Hose, and ACM SIGMOD Int. Conf. Manage. Data, 2013,
F. Suchanek, ‘‘Fast rule mining in ontological pp. 397–408.
[89] G. S. Halford, W. H. Wilson, and S. Phillips,
knowledge bases with AMIE+,’’ VLDB J., [128] H. Poon and P. Domingos, ‘‘Sound and
‘‘Processing capacity defined by relational
pp. 1–24, 2015. efficient inference with probabilistic and
complexity: Implications for comparative,
developmental, cognitive psychology,’’ [109] F. A. Lisi, ‘‘Inductive logic programming in deterministic dependencies,’’ in Proc. AAAI,
Behav. Brain Sci., vol. 21, no. 06, databases: From datalog to dl+log,’’ TPLP, 2006.
pp. 803–831, 1998. vol. 10, no. 3, pp. 331–359, 2010. [129] A. Globerson and T. S. Jaakkola, ‘‘Fixing
[90] T. Plate, ‘‘A common framework for [110] C. d’Amato, N. Fanizzi, and F. Esposito, max-product: Convergent message passing
distributed representation schemes for ‘‘Reasoning by analogy in description logics algorithms for MAP LP-relaxations,’’ in NIPS,
compositional structure,’’ Connect. Syst. through instance-based learning,’’ in Proc. 2007.
Knowl. Represent. Deduct., pp. 15–34, 1997. Semant. Web Appl. Perspect., 2006. [130] S. H. Bach, M. Broecheler, B. Huang, and
[91] T. Mikolov, K. Chen, G. Corrado, and J. Dean, [111] J. Lehmann, ‘‘DL-learner: Learning concepts L. Getoor, ‘‘Hinge-loss Markov random
‘‘Efficient estimation of word representations in description logics,’’ J. Mach. Learn. Res., fields and probabilistic soft logic,’’
in vector space,’’ in Proc. Workshop ICLR, 2013. vol. 10, pp. 2639–2642, 2009. arXiv:1505.04406, 2015, [cs.LG].
[92] R. Socher, D. Chen, C. D. Manning, and [112] A. Rettinger, U. Lösch, V. Tresp, C. d’Amato, [131] A. Kimmig, S. H. Bach, M. Broecheler,
A. Ng, ‘‘Reasoning with neural tensor and N. Fanizzi, ‘‘Mining the semantic B. Huang, and L. Getoor, ‘‘A short
networks for knowledge base completion,’’ in webVStatistical learning for next generation introduction to probabilistic soft logic,’’ in
Advances in Neural Information Processing knowledge bases,’’ Data Min. Knowl. Discov., Proc. NIPS Workshop Probab. Programming:
Systems 26. Red Hook, NY, USA: Curran, vol. 24, no. 3, pp. 613–662, 2012. Found. Appl., 2012.
2013, pp. 926–934. [113] U. Lösch, S. Bloehdorn, and A. Rettinger, [132] J. Pujara, H. Miao, L. Getoor, and
[93] A. Bordes, J. Weston, R. Collobert, and ‘‘Graph kernels for rdf data,’’ in Proc. 9th Int. W. W. Cohen, ‘‘Using semantics and
Y. Bengio, ‘‘Learning structured embeddings Conf. Semant. Web: Res. Appl., 2012, statistics to turn data into knowledge,’’
of knowledge bases,’’ in Proc. 25th AAAI pp. 134–148. AI Mag., 2015.
Conf. Artif. Intell., San Francisco, CA, USA, [114] P. Minervini, N. Fanizzi, and V. Tresp, [133] J. Neville and D. Jensen, ‘‘Relational
2011. ‘‘Learning to propagate knowledge in web dependency networks,’’ J. Mach. Learn. Res.,
[94] A. Bordes, N. Usunier, A. Garcia-Duran, ontologies,’’ in Proc. 10th Int. Workshop vol. 8, pp. 637–652, May 2007.
J. Weston, and O. Yakhnenko, ‘‘Translating Uncertain. Reason. Semant. Web, 2014, [134] D. KrompaQ, X. Jiang, M. Nickel, and
embeddings for modeling multi-relational pp. 13–24. V. Tresp, ‘‘Probabilistic latent-factor database
data,’’ in Advances in Neural Information [115] N. Lao and W. W. Cohen, ‘‘Relational models,’’ in Proc. 1st Workshop Linked Data
Processing Systems 26. Red Hook, NY, USA: retrieval using a combination of path- Knowl. Discovery European Conf. Mach.
Curran, 2013, pp. 2787–2795. constrained random walks,’’ Mach. Learn., Learn. Principles Practice Knowl. Discovery
[95] B. Yang, W.-T. Yih, X. He, J. Gao, and vol. 81, no. 1, pp. 53–67, 2010. Databases, 2014.
L. Deng, ‘‘Embedding entities and relations [116] N. Lao, T. Mitchell, and W. W. Cohen, [135] H. Li et al., ‘‘Improvement of n-ary relation
for learning and inference in knowledge ‘‘Random walk inference and learning in a extraction by adding lexical semantics to
bases,’’ CoRR, vol. abs/1412 6575, 2014. large scale knowledge base,’’ in Proc. Conf. distant-supervision rule learning,’’ in Proc.
[96] P. D. Hoff, A. E. Raftery, and Empir. Meth. Nat. Lang. Process., 2011, 7th Int. Conf. Agents Artif. Intell., Lisbon,
M. S. Handcock, ‘‘Latent space approaches to pp. 529–539. Portugal, pp. 317–324,
social network analysis,’’ J. Amer. Stat. Assoc., [117] K. Toutanova and D. Chen, ‘‘Observed Jan. 10–12, 2015.
vol. 97, no. 460, pp. 1090–1098, 2002. versus latent features for knowledge base [136] H. Ji, T. Cassidy, Q. Li, and S. Tamang,
[97] L. Lü and T. Zhou, ‘‘Link prediction in and text inference,’’ Proc. 3rd Workshop ‘‘Tackling representation, annotation and
complex networks: A survey,’’ Physica A, Stat. Continuous Vector Space Models classification challenges for temporal
Mechan. Appl., vol. 390, no. 6, Compositionality, 2015. knowledge base population,’’ Knowl. Inf.
pp. 1150–1170, Mar. 2011. [118] M. Nickel, X. Jiang, and V. Tresp, ‘‘Reducing Syst., pp. 1–36, Aug. 2013.
[98] L. A. Adamic and E. Adar, ‘‘Friends and the rank in relational factorization models by [137] D. L. McGuinness and F. Van Harmelen,
neighbors on the web,’’ Social Netw., vol. 25, including observable patterns,’’ in Advances ‘‘OWL web ontology language overview,’’
no. 3, pp. 211–230, 2003. in Neural Information Processing Systems 27. W3C Recommend., vol. 10, no. 10, p. 2004,
Red Hook, NY, USA: Curran, 2014, 2004.
[99] A.-L. Barabási and R. Albert, ‘‘Emergence
pp. 1179–1187. [138] A. Hogan, A. Harth, A. Passant, S. Decker,
of scaling in random networks,’’ Science,
vol. 286, no. 5439, pp. 509–512, 1999. [119] Y. Koren, ‘‘Factorization meets the and A. Polleres, ‘‘Weaving the pedantic web,’’
neighborhood: A multifaceted collaborative in Proc. 3rd Int. Workshop Linked Data
filtering model,’’ in Proc. 14th ACM SIGKDD

32 Proceedings of the IEEE | Vol. 104, No. 1, January 2016


Nickel et al.: A Review of Relational Machine Learning for Knowledge Graphs

Web (LDOW2010)/19th Int. World Wide Web [142] D. Z. Wang, E. Michelakis, M. Garofalakis, [145] X. L. Dong et al., ‘‘From data fusion to
Conf., Raleigh, NC, USA, 2010. and J. M. Hellerstein, ‘‘BayesStore: knowledge fusion,’’ Proc. VLDB Endow.,
[139] H. Halpin, P. Hayes, J. McCusker, Managing large, uncertain data repositories vol. 7, no. 10, pp. 881–892, Jun. 2014.
D. Mcguinness, and H. Thompson, ‘‘When with probabilistic graphical models,’’ Proc. [146] X. L. Dong et al., ‘‘Knowledge-based trust:
owl: SameAs isn’t the same: An analysis of VLDB Endow., vol. 1, no. 1, pp. 340–351, Estimating the trustworthiness of web
identity in linked data Proc. Semant. Web, 2008. sources,’’ Proc. VLDB Endow., vol. 8, no. 9,
2010, pp. 305–320. [143] J. Bleiholder and F. Naumann, ‘‘Data fusion,’’ pp. 938–949, May 2015.
[140] D. KrompaQ, M. Nickel, and V. Tresp, ACM Comput. Surv., vol. 41, no. 1, pp. 1:1–1:41, [147] M. Nickel and V. Tresp, ‘‘Tensor
‘‘Querying factorized probabilistic triple Jan. 2009. factorization for multirelational learning,’’
databases,’’ Proc. Semant. Web, 2014, [144] X. Li, X. L. Dong, K. Lyons, W. Meng, and Machine Learning and Knowledge Discovery in
pp. 114–129. D. Srivastava, ‘‘Truth finding on the deep Databases, ser. Lecture Notes in Computer
[141] D. Suciu, D. Olteanu, C. Re, and C. Koch, web: Is the problem solved?’’ Proc. VLDB Science, Berlin, Germany: Springer-Verlag,
Probabilistic Databases. San Raphael, CA, Endow., vol. 6, no. 2, pp. 97–108, Dec. 2012. vol. 8190, pp. 617–621, 2013.
USA: Morgan and Claypool, 2011.

ABOUT THE AUTHORS


Maximilian Nickel received the Ph.D. degree Volker Tresp received the Diploma degree from
(summa cum laude) from the Ludwig Maximilian the University of Goettingen, Germany, in 1984
University, Munich, Germany, in 2013. and the M.Sc. and Ph.D. degrees from Yale
He is a Postdoctoral Fellow with the Laboratory University, New Haven, CT, USA, in 1986 and
for Computational and Statistical Learning, 1989, respectively.
Massachusetts Institute of Technology (MIT), Since 1989, he has been the head of various
Cambridge, MA, USA, and the Istituto Italiano research teams in machine learning at Siemens,
di Tecnologia, Genova, Italy. He is also with the Research and Technology, Munich, Germany. He
Center for Brains, Minds, and Machines at MIT. filed more than 70 patent applications and was
From 2010 to 2013, he worked as a Research inventor of the year of Siemens in 1996. He has
Assistant at Siemens Corporate Technology, Munich, Germany. His published more than 100 scientific articles and administered over 20 Ph.D.
research centers around machine learning from relational knowledge dissertations. The company Panoratio is a spin-off out of his team. His
representations and graph-structured data as well as its applications research focus in recent years has been machine learning in information
in artificial intelligence and cognitive science. networks for modeling knowledge graphs, medical decision processes, and
sensor networks. He is the coordinator of one of the first nationally funded
Kevin Murphy received the B.A. degree from the big data projects for the realization of precision medicine. In 2011, he
University of Cambridge, Cambridge, U.K., the became a Honorary Professor at the Ludwig Maximilian University of
M.Eng. degree from the University of Pennsylvania, Munich, Germany, where he teaches an annual course on machine learning.
Philadelphia, PA, USA, and the Ph.D. degree from
the University of California at Berkeley, Berkeley,
CA, USA.
Currently, he is a Research Scientist at Google,
Mountain View, CA, USA, where he works on AI, Evgeniy Gabrilovich received the Ph.D. degree in
machine learning, computer vision, knowledge computer science from the TechnionVIsrael In-
base construction, and natural language proces- stitute of Technology, Haifa, Israel.
sing. Before joining Google in 2011, he was an Associate Professor of He is a Senior Staff Research Scientist at
Computer Science and Statistics at the University of British Columbia Google, Mountain View, CA, USA, where he works
(UBC), Vancouver, BC, Canada. Before starting at UBC in 2004, he was a on knowledge discovery from the web. Prior to
Postdoctoral Researcher at the Massachusetts Institute of Technology joining Google in 2012, he was a Director of
(MIT), Cambridge, MA, USA. He has published over 80 papers in refereed Research and Head of the Natural Language
conferences and journals, as well as the 1100-page textbook Machine Processing and Information Retrieval Group at
Learning: a Probabilistic Perspective (Cambridge, MA, USA: MIT Press, Yahoo! Research.
2012), which was awarded the 2013 DeGroot Prize for best book in the Dr. Gabbrilovich is an ACM Distinguished Scientist, and is a recipient of
field of Statistical Science. the 2014 IJCAI-JAIR Best Paper Prize. He is also a recipient of the 2010
Dr. Murphy is the Co-Editor-in-Chief of the Journal of Machine Karen Sparck Jones Award for his contributions to natural language
Learning Research. processing and information retrieval.

Vol. 104, No. 1, January 2016 | Proceedings of the IEEE 33

You might also like