0% found this document useful (0 votes)
13 views

Intelli Graph

This document introduces a new task called subgraph inference for benchmarking knowledge graph generation models. It aims to evaluate models' ability to generate logical and semantically valid subgraphs rather than just predicting links. The document contributes 1) the subgraph inference task, 2) five new benchmark datasets called IntelliGraphs containing logical rules, and 3) a data generator for creating datasets. Existing knowledge graph embedding models cannot fully capture the semantics and dependencies between triples. The IntelliGraphs datasets and generator are intended to encourage the development of models with better semantic understanding.

Uploaded by

Jagadeesh M
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Intelli Graph

This document introduces a new task called subgraph inference for benchmarking knowledge graph generation models. It aims to evaluate models' ability to generate logical and semantically valid subgraphs rather than just predicting links. The document contributes 1) the subgraph inference task, 2) five new benchmark datasets called IntelliGraphs containing logical rules, and 3) a data generator for creating datasets. Existing knowledge graph embedding models cannot fully capture the semantics and dependencies between triples. The IntelliGraphs datasets and generator are intended to encourage the development of models with better semantic understanding.

Uploaded by

Jagadeesh M
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

IntelliGraphs: Datasets for Benchmarking

Knowledge Graph Generation

Thiviyan Thanapalasingam Emile van Krieken Peter Bloem


University of Amsterdam Vrije Universiteit Amsterdam Vrije Universiteit Amsterdam
[email protected]
arXiv:2307.06698v2 [cs.AI] 19 Jul 2023

Paul Groth
University of Amsterdam

Abstract

Knowledge Graph Embedding (KGE) models are used to learn continuous rep-
resentations of entities and relations. A key task in the literature is predicting
missing links between entities. However, Knowledge Graphs are not just sets of
links but also have semantics underlying their structure. Semantics is crucial in
several downstream tasks, such as query answering or reasoning. We introduce the
subgraph inference task, where a model has to generate likely and semantically
valid subgraphs. We propose IntelliGraphs, a set of ve new Knowledge Graph
datasets. The IntelliGraphs datasets contain subgraphs with semantics expressed
in logical rules for evaluating subgraph inference. We also present the dataset
generator that produced the synthetic datasets. We designed four novel baseline
models, which include three models based on traditional KGEs. We evaluate their
expressiveness and show that these models cannot capture the semantics. We be-
lieve this benchmark will encourage the development of machine learning models
that emphasize semantic understanding.

1 Introduction

Knowledge Graphs (KGs) contain knowledge about the world structured as graphs with entities
connected through different relations [Hogan et al., 2021]. Large-scale KGs are widely used in a
range of applications, such as query answering [Arakelyan et al., 2020] and information retrieval
[Noy et al., 2019].
To address the problem of incompleteness in KGs, Knowledge Graph Embedding (KGE) models
were developed. These learn continuous representations for entities and relations [Bordes et al.,
2013, Yang et al., 2014] through link prediction, the task of predicting missing links in large KGs,
by learning scoring functions that rank entities [Rufnelli et al., 2019]. These approaches implicitly
assume that each link (also known as a triple) in a Knowledge Graph can be predicted independently.
In this view of Knowledge Graphs, each triple is seen as a kind of “atomic fact” which is true or false
independent of other triples.
However, in modern Knowledge Graphs, triples depend on each other. For example, the triples
value(temperature_NY, 77) and unit(temperature_NY, Fahrenheit) together describe
that the temperature in New York is 77 °F. In this case, the truth of the rst triple depends on
the content of the second. Figure 1 provides another example: the fact that "Barack Obama lives
in the White House" highly depends on the fact that "Barack Obama is the president of the United
States" and on the temporal context 2009 – 2017.

Preprint. Under review.


Existing KGE models cannot capture such interdepen- 2009
dencies between triples [Wen et al., 2016]. In this start
paper, we introduce a new task that can be used to US_president end
2017
evaluate models that generate several connected triples
together, modelling their interdependencies. We call has_role has_time_interval

this task subgraph inference. The idea is that where


KGE models can be used to predict missing links, sub- has_name lives_in

graph inference models could be used to predict missing


subgraphs: sets of interdependent links. Barack Obama White House
To simplify the problem of subgraph inference, we
assume that a set of true subgraphs is provided so that Figure 1: The statement “US President
the problem reduces to training a generative model on Barack Obama lives in the White House"
small knowledge graphs over a shared set of entities expressed as a subgraph.
and relations. 1 Such predictions must not only capture
the general structure of the graph, but they must also
allow us to generalize effectively to graphs not explicitly shown in the data.
So far, the lack of datasets with well-understood semantics has hampered studying how effectively
KGE models capture semantics. Existing datasets commonly used for benchmarking KGE models,
such as FB15k-237 and WN18RR, lack sufcient logical constraints to investigate semantics thor-
oughly. Logical constraints play an important role as they help maintain the logical consistency of
facts in a structured knowledge base.
Thus, the three main contributions of our work are as follows:

1. Subgraph Inference. We dene a new task, where the goal is to generate, from a set of
examples, novel subgraphs that follow certain logical rules. We specied new evaluation
metrics that help empirically assess generated graphs’ semantic validity and novelty.
2. IntelliGraphs
(a) Synthetic Datasets. We propose three synthetic datasets, each designed to capture
different levels of semantics. We also describe the underlying semantics using First-
Order Logic.
(b) Real-world Datasets. We extract subgraphs from Wikidata 2 according to simple basic
patterns to generate two real-world datasets.
3. Data Generator. We developed a Python package that randomly generates and veries
subgraphs using pre-dened logical constraints.

The datasets and generators are publicly available on: https://github.com/thiviyanT/


IntelliGraphs. The generator is available as a Python package which can be installed through
PyPy 3 , and Conda package managers 4 . To ensure long-term preservation and easy access, we made
the datasets available on Zenodo 5 .

2 Benchmark Task
2.1 Limitations of Link Predictors

Binary relations KGE models exploit structural regularities to perform Knowledge Graph com-
pletion. The last decade has seen the developments of several KGE models [Rufnelli et al., 2019],
which predict the likelihood that a pair of entities are related by a given binary relation. However,
a set of binary relations cannot represent an N-ary relation because the links depend on each other.
1
That is, for a true generalization of link prediction to subgraph prediction, we would provide a single, large
knowledge graph and require a model to predict missing subgraphs. In the interest of separating concerns, we
ask here only if generative models over small knowledge graphs are feasible.
2
https://www.wikidata.org/
3
https://pypi.org/project/intelligraphs
4
https://anaconda.org/thiv/intelligraphs
5
https://doi.org/10.5281/zenodo.7824818

2
Regardless of the context, KGE models assign a set of probabilities on links, and those probabilities
are independent of each other.
N-ary relations Link prediction has been extended to cover N-ary relations, where the goal is
to predict a missing link in an N-ary fact. N-ary relation can operate on any arbitrary number of
entities. Modelling N-ary relations as triples and treating them as entities in binary relations results
in a loss of structural information [Wen et al., 2016]. Wen et al. [2016] dene N-ary relations as the
mappings from the attribute sequences to the attribute values, such that each N-ary fact is an instance
of the corresponding N-ary relation. GRAN is a graph-based approach which uses a Transformer
decoder to score N-ary facts [Wang et al., 2021]. NeuInfer uses fully-connected neural networks to
embed N-ary relations and score candidate triples [Guan et al., 2020]. These models were evaluated
by inferring an element in an N-ary fact. Because a single N-ary relation can be represented in a set
of binary relations (i.e. triples), subgraphs can be used to represent N-ary relations. This means that
Subgraph models could be used to solve N-ary relation prediction, but the task is strictly broader than
that: every single N-ary relation can be represented as a subgraph, but not every subgraph can be
represented in a single N-ary relation.
Link prediction evaluation The standard link prediction evaluation framework [Rufnelli et al.,
2019] uses ranking-based evaluation metrics which do not explicitly check for the semantics of
the predicted links. Instead, the evaluation protocol assumes that the underlying semantics can
be indirectly validated if a missing link has been correctly predicted. Therefore, the standard link
prediction framework is not suitable for evaluating Subgraph Inference. In our work, we set out to
explicitly check the semantics during evaluation.

2.2 Subgraph Inference

A Knowledge Graph, G, is a tuple G = (V, E, R, S) where V is a set of nodes and E is a set of edges
where E = V × R × V . R is the set of relations and E is the set of entities in G. V is drawn from
the set of possible entities E. S is a set of functions that dene the semantics of G by determining
which structures are permissible or not in G.
Given a Knowledge Graph G, we call subgraph F a tuple V f , E f , R where V f ⊂ V and
 

E f = (u, v) | u ∈ V f , v ∈ V f , r ∈ R and (u, r, v) ∈ E. We require subgraphs to be connected


graphs. A subgraph adheres to the semantics of the Knowledge Graph.
For example, using the example from the introduction, the statement “The temperature is 77°F.”
can be expressed as a simple two-triple subgraph F of some larger graph6 with the triples
value(temperature, 77), unit(temperature, Fahrenheit). The meaning of the entity
temperature is dependent on both triples. Notably, the unit “Fahrenheit” gives value “77”
additional context. Here, the semantics could be captured by a function that checks that
is_instance(value, integer) ∧ is_unit(unit, units_of_temperature).
Problem Statement Subgraph Inference is the task of generating novel subgraphs given a set
of existing subgraphs from a given KG. The new set of triples that are inferred7 must adhere to the
semantics of the original KG. We dene the task as follows: Given a set of subgraphs SG t
from a
given Graph G, generate subgraphs, which should follow the same logical rules L as SG .t

These subgraphs can be added back to the KG; therefore, this task can be seen as an extension of link
prediction. To make this extension complete, we should also specify how the training subgraphs are
extracted from G. However, to isolate the question of generative modelling of knowledge graphs,
we take this process as given in our tasks. For instance, in the two real-world datasets, we extract
subgraphs from Wikidata according to a hand-designed pattern. In the synthetic datasets, we simply
provide a set of small knowledge graphs over a shared set of entities and relations, leaving the larger
graph G entirely implicit. With this choice, the task reduces to training a generative model over small
knowledge graphs with a shared set of entities and relations.
Key Challenge. The subgraphs need to adhere to specic semantics which, in a learning setting,
have to be inferred from a limited set of examples, such as learning the types of entities.
6
We do not specify what larger graph this graph is a subgraph of. In most of our tasks, only the set of entities
and relations of the larger graph is given, and the rest of the graph is left implicit.
7
In symbolic AI, the term inference means formal reasoning. Here we use it to mean estimating the probability
distribution over the model’s unobserved variables given observed data (i.e. probabilistic inference).

3
Table 1: The size of the training, validation and test split for the ve datasets used in this work. The
number of edges is xed for the synthetic datasets and is variable for the Wikidata-based graphs.

Dataset Split (train/val/test) Entities Relations Triples


syn-paths 60000/20000/20000 49 3 3
syn-types 60000/20000/20000 30 3 3
syn-tipr 50000/10000/10000 130 5 5
wd-movies 38267/15698/15796 24093 3 2 – 21
wd-articles 54163/22922/22915 60932 6 4 – 212

3 IntelliGraphs
Motivated by the aforementioned limitations of link prediction datasets and the new task of subgraph
inference, we introduce ve new benchmark datasets where each dataset tests different semantics.
Table 1 shows key statistics about the synthetic and real-world graphs. The appendix (see Section
7.3.1) describes the algorithm used to generate the datasets.
Data Generator The sampler D samples subgraphs according to a probability distribution P
dened in the Python implementation of IntelliGraphs. For each dataset’s logical constraints L,
the sampler samples a graph F from the probability distribution P , ensuring that F satises all the
constraints in L.
Existential Nodes In some settings, it is necessary to have nodes that refer to entities that only
occur in one instance. For example, in the wd-movies dataset introduced below, each subgraph
in the data represents a movie. Its actors, directors and genres are entities that occur in multiple
instances, so a model can learn representations by observing the different contexts in which they
occur. However, each instance also contains one node representing the movie the graph describes.
These only occur in one instance, so a model cannot learn a representation for the specic instance,
only a general representation which expresses that some movie exists for which this subgraph is
true. We call such nodes existential nodes (in analogy to existentially quantied variables in logical
formulas) and use a special label, such as _movie, to refer to them in all instances.8 Strictly speaking,
this turns the predicted subgraphs into subgraph patterns of the Knowledge Graph G, but we refer to
them as subgraphs to keep the terminology simple.

3.1 Semantics

We use First-Order Logic (FOL) to express the underlying logical rules of the datasets. Section 7.4 in
the appendix provides a complete set of logical constraints for each IntelliGraphs dataset.
Logical Constraint Verier. The Logical Constraint Verier v is a function that veries whether
the logical constraints L hold in a generated subgraph F . We built a constraint verier within the
IntelliGraphs Python package.9 The constraint verier v(F, L) returns true if and only if the subgraph
F is consistent with all logical rules L.

3.2 Synthetic Datasets

Synthetic datasets allow complete control over the problem setup and provide a convenient testbed for
developing new machine learning models. The dataset is generated by the generator, D. We checked
if the generated subgraphs satisfy the logical rules L.10 Here is a brief description of the synthetic
datasets:
8
For most models, the difference will only be in the interpretation. For example, our baseline models will
learn one embedding vector for the node labelled _movie, which we use wherever movies occur. As such, we
do not treat it differently from the node labelled Antonio_Banderas, although when we interpret the graph,
these nodes mean different things.
9
A reasoning engine could also be used for checking the subgraphs for logical consistency. We wrote a set of
functions in Python for constraint verication and embedded it into the IntelliGraphs Python package to easily
verify graphs without loading them into a reasoning engine.
10
It is important to note that logical consistency does not equate to factual accuracy. We simply want to ensure
that the generated dataset is consistent with the logical rules.

4
• syn-paths is a dataset with path graphs. Path graphs have simple semantics that can be
algorithmically veried in linear time. Path graphs have a single directed path of length 3
and no other edges.
• syn-types contains entities with types Language, Country and City. These are con-
nected by three relations according to the relation’s type constraints: sam_type_as can
only exist between the same entity types, could_be_part_of between a capital city and
country, and could_be_spoken_in between a language and a country. The connections
are otherwise random.
• syn-tipr contains subgraphs based on the Time-indexed Person Role (tipr) ontology
pattern.11 Here, the semantics are dened by the tipr graph pattern. The semantics include
the fact that the start of an interval must precede its end.

3.3 Real-World Datasets

Wikidata12 [Vrandečić and Krötzsch, 2014] is a large graph-structured knowledge base which consists
of crowdsourced factual knowledge on various topics. We created two datasets from Wikidata using
specic graph patterns to extract subgraphs about movies and research articles. Here is a brief
description of the two datasets:

• wd-movies contains small graphs extracted from Wikidata that describe movies. Each graph
contains one existential node representing the movie, entity nodes for the movie’s director(s)
connected by a has_director relation, entity nodes for the movie’s cast connected by a
has_actor relation and an entity for the movie’s genre connected by a has_genre relation.
• wd-articles contains small graphs that describe research articles extracted from Wikidata.
Each article is annotated by an ordered list of authors, implemented by a blank node for
each author linked to a node representing the author and to a node representing the order in
the author list. We add a list of the other articles that the current article references, and a list
of subjects, together with selected superclasses of those subjects. In this dataset, most node
types, including the article’s node, may be existential or entity nodes.

4 Evaluation
4.1 Evaluation by bits-per-graph

The most common objective for a generative model is probably maximum likelihood: the probability
of a graph from the test data under the model should have maximal probability, or, equivalently,
minimal negative log probability. When base 2 logarithms are used, the latter quantity, − log2 p(S, E),
can be interpreted as the number of bits required to compress the graph [Rissanen, 1978, Grünwald,
2007].
Averaging over all graphs, we arrive at a metric of bits-per-graph to evaluate how well our model
satises the maximum likelihood objective. Moreover, each of the terms in Equation 1 can be
read as separate codelengths: − log2 P (E) describes the bits required to encode the entities, and
− log2 P (S | E) describes the bits required to encode the structure once the entities are known.

4.2 Semantics

We evaluate the semantics of graphs generated by our baseline models using the following evaluation
metrics: 1) % Valid Graphs is the probability of sampling graphs that are logically valid according
to the logical constraints for each dataset, 2) % Novel Graphs is the probability of sampling graphs
that are not in the training data, 3) % Novel & Valid Graphs is the probability of sampling graphs
that are logically valid and are not in the training data, and 4) % Empty Graphs is the probability of
sampling graphs that did not yield any graphs, due to either p(E) or p(S | E) being too low. An ideal
model gives a high probability of sampling logically valid graphs but uses a minimal number of code
lengths to compress graphs.
11
http://ontologydesignpatterns.org/wiki/Submissions:Time_indexed_person_role
12
https://www.wikidata.org

5
4.3 Baseline Models

To the best of our knowledge, no probabilistic models in the literature can infer new subgraphs for
knowledge graphs. Therefore, we developed a set of simple baselines inspired by traditional KGE
models: ComplEx [Trouillon et al., 2016], DistMult [Yang et al., 2014] and TransE [Bordes et al.,
2013]. Traditional KGE models are trained to rank all possible triples to give the correct triple
the highest score [Rufnelli et al., 2019]. ComplEx, DistMult and TransE all use different scoring
functions. TransE represents relations as translation between entities, whereas DistMult models
relations as bilinear interactions. ComplEx extends DistMult using complex-valued embeddings.
We model a subgraph F by decomposing it into its entities and structure F = (E, S), that is,
p(F ) = p(S | E) p(E). Unlike traditional KGE models, we train our baseline models with a
maximum likelihood objective.
We decompose the objective function as
− log2 p(F ) = − log2 p(S|E) − log2 p(E). (1)

We model p(E) = e∈E p(e), with p(e) estimated as the relative frequency of e in the training data

(the proportion of training subgraphs it occurs in). We train KGE models to estimate p(S | E). We
use  
p(S | E) = p((s, p, o) | E) 1 − p((s, p, o) | E),
(s,p,o)∈ST (s,p,o)∈SN

where ST represents the triples in the subgraph F , and SN represents all possible triples that are not
in the subgraph (i.e. all possible negatives).
Our random baseline model generates a random graph prediction by sampling p(E) and p(S|E)
from a uniform distribution. It then computes the exact number of bits required to represent these
probabilities, using −log2 (p) to determine the entropy of each probability value. This model does
not need to be trained.
Table 2 shows that the KGE baselines learn more compact representations than the random baseline.
The ComplEx baseline is most effective at compressing the structure of these graphs p(S|E), despite
requiring real and imaginary parts. The scale of complexity, represented by code length, seems
to increase rapidly from synthetic to real-world datasets. For instance, the highest code length for
syn-paths is 69.51 (for the TransE baseline), while the lowest code length for wd-movies is 202.68
(for ComplEx). wd-movies and wd-articles have many more entities to sample, making them
more challenging to compress.

4.4 Subgraph Inference

Table 3 shows the probabilities of sampling graphs that are logically consistent. We perform subgraph
inference under two different settings:

• Sampling P (E) and P (S | E). Here, the baseline models sample both the entities that
are relevant for a subgraph and infer their edge connectivity. Our results indicate that the
probability of sampling valid graphs is consistently 0%. Selecting the incorrect entities
negatively impacts the structure prediction. Our results indicate that this task is challenging,
especially for the random baseline, as it consistently fails to infer valid subgraphs.
• Sampling only P (S | E). In this setup, the model is given an advantage by having access
to the correct set of entities (i.e., we give p(E)), such that it only needs to predict the edge
connections between the given entities. It is worth noting that under this setting, the baseline
model collapses into a link predictor as it just predicts the edge connections between the
given entities. Despite giving the advantage, the baseline models could not generate many
logically consistent subgraphs. Interestingly, this also reveals the complexity of the datasets
and what semantics these KGE models can learn. Most KGE models are able to generate
some valid path graphs, while for syn-tipr, which requires some temporal reasoning,
seems more challenging for all baseline models. Inferring the correct entity types from
syn-types was possible for a few graphs.

6
Table 2: Estimate of the codelengths, − log2 p(F ), (the number of bits) required to compress a graph
using the four baseline models for IntelliGraphs datasets. We used the test split for this. We rounded
the numbers up/down to two decimal points.

Datasets Baseline Models − log2 p(S|E) − log2 p(E) C(S, E)


random 95.94 98.04 193.98
syn-paths TransE 16.19 33.69 49.89
DistMult 14.90 33.69 48.58
ComplEx 20.71 33.69 54.39
random 360.27 259.80 620.08
syn-tipr TransE 28.70 40.81 69.51
DistMult 26.70 40.81 67.51
ComplEx 23.15 40.81 63.96
random 187.11 59.99 247.1
syn-types TransE 19.05 29.21 48.26
DistMult 18.24 29.21 47.46
ComplEx 18.48 29.21 47.69
random 483.81 48185.97 48669.78
wd-movies TransE 51.39 157.21 208.60
DistMult 51.29 157.21 208.50
ComplEx 45.46 157.21 202.68
random 12623.20 122366.51 134989.71
wd-articles TransE 280.67 629.98 910.65
DistMult 271.94 629.98 901.91
ComplEx 257.33 629.98 887.30

5 Related Work

Datasets for Query Embedding. Query Embedding (QE) involves interpreting complex logical
queries, commonly represented as a small graph, and evaluated on QE datasets, such as GQE
[Hamilton et al., 2018], Query2Box [Ren et al., 2020], and BetaE [Ren and Leskovec, 2020]. Ren
et al. [2023] presents a comprehensive comparison of datasets. As Ren et al. [2023] highlight, query
embedding datasets lack rules and types. Although the datasets in IntelliGraphs are similar to query
embedding datasets, there is a difference in the purpose and applications. Our datasets can be used
for learning distributions to infer new logically consistent subgraphs. In contrast, QA datasets are
concerned with reasoning using logical rules to nd a missing entity.
Datasets for n-ary Relations. N-ary relations are relations involving more than two entities. Various
methods have been studied in the literature that embeds complex N-ary relations, often in non-
euclidean spaces [Wang et al., 2021, Wen et al., 2016]. The difference between N-ary relations and
subgraphs is explained in Section 2.1.
Datasets for Neurosymbolic methods. If we interpret knowledge graphs as a set of logical statements,
we can see that the task of subgraph prediction is a neurosymbolic method: it combines symbolic
systems with neural networks. Datasets have been proposed to test various aspects of such systems:
interpretability, reasoning, and generalization capabilities. Several datasets were proposed to evaluate
the understanding and reasoning of complex rules and abstract concepts. Table 4 (in the appendix)
compares different datasets for Neurosymbolic AI from the literature. Existing datasets focus
primarily on the image and text modalities, neglecting background knowledge expressed in graphs.

6 Conclusion

Existing KG datasets used for representation learning lack well-understood semantics, which limits
studying how well KGE models capture new semantics. In our work, we propose Subgraph Inference
as a new research problem and IntelliGraphs, a collection of ve new datasets for benchmarking
models. Furthermore, we used baseline models inspired by traditional KGE models to estimate the
code lengths of these graphs and sample logically valid subgraphs. Our ndings show that traditional
KGE models show a limited understanding of semantics after training. We observed a rapid increase

7
Table 3: Semantic validity of the graphs produced by our baseline models. We have tested subgraph
inference under two settings: 1) Sampling from both P (E) and P (S | E), and 2) Sampling from
P (S | E) only, taking E from the test data. We check the novelty of the sampled graphs by comparing
them against the training and validation set. We used the same hyperparameters from the model
compression experiments here.

% % Novel % %
Setting Dataset Model Valid & Valid Novel Empty
Graphs Graphs Graphs Graphs
random 0 0 100 0
TransE 0.25 0.25 23.45 76.55
syn-paths
DistMult 0.69 0.69 14.59 85.41
Sampling from P (E) and P (S | E)

ComplEx 0.71 0.71 14.27 85.73


random 0 0 100 0
TransE 0 0 5.58 94.42
syn-tipr
DistMult 0 0 13.34 86.66
ComplEx 0 0 4.95 96.05
random 0 0 100 0
TransE 0.21 0.21 15.44 84.56
syn-types
DistMult 0.13 0.13 12.46 87.53
ComplEx 0.07 0.07 10.25 89.75
random 0 0 100 0
TransE 0 0 14.61 85.39
wd-movies
DistMult 0 0 12.93 87.07
ComplEx 0 0 1.87 98.13
random 0 0 100 0
TransE 0 0 4.58 95.42
wd-articles
DistMult 0 0 0 100.00
ComplEx 0 0 2.46 97.54
random 0 0 100 0
TransE 5.25 5.25 95.52 4.48
syn-paths
DistMult 9.69 9.69 95.28 4.71
ComplEx 10.10 10.10 95.58 4.42
Sampling from P (S | E) only

random 0 0 100 0
TransE 0 0 99.45 0.55
syn-tipr
DistMult 0 0 99.43 0.57
ComplEx 0 0 99.64 0.36
random 0 0 100 0
TransE 1.43 1.43 95.42 4.58
syn-types
DistMult 1.44 1.44 96.19 4.81
ComplEx 1.01 1.01 94.17 5.83
random 0 0 100 0
TransE 0.07 0.07 97.01 2.99
wd-movies
DistMult 0.10 0.10 95.86 4.17
ComplEx 0.41 0.41 93.04 6.96
random 0 0 100 0
TransE 0 0 98.35 1.65
wd-articles
DistMult 0 0 98.77 1.23
ComplEx 0 0 100.00 0.00

in complexity, represented by code lengths, from synthetic to real-world datasets. This complexity
makes real-world datasets more challenging to compress, which is an essential consideration for
future research in graph compression. We found that the probability of sampling valid graphs was
consistently low, emphasizing the complexity and difculty of the task.
Limitations. Subgraph inference assumes that the semantics of a KG is known. However, in some
cases, this assumption may not hold. Furthermore, our datasets assume we test the machine learning

8
models in a transductive setting; entities and relations not seen during training will not be handled
well.
Ethics Statement. Our synthetic graphs are based on the logical rules we constructed and should not
be used for applications where factual accuracy matters. However, wd-movies and wd-articles
are based on real-world factual knowledge retrieved from Wikidata. Therefore, certain biases may be
inherited from Wikidata. Since these datasets are likely unsuitable for training production models
or for pretraining, we do not expect that these biases will ever affect systems making real-world
decisions. Transparency about dataset creation and maintenance is critical for adopting new machine
learning datasets [Gebru et al., 2021]. In the appendix, we provide a data card for IntelliGraphs to
provide further information about the datasets.
Applications of IntelliGraphs. It is imperative to have guarantees for safety-critical applications to
prevent machine learning models from making fatal mistakes. To develop these systems, datasets
with logical constraints are helpful. In some problem domains, there is little or no data available such
as cases where training machine learning models on sensitive data for medical or industrial use cases.
In these cases, IntelliGraph’s dataset generation framework can be used to generate synthetic datasets
using background knowledge about the problem domain.

Acknowledgements We would like to thank Frank van Harmelen and Patrick Koopmann for their
feedback on this work.

References
Erik Arakelyan, Daniel Daza, Pasquale Minervini, and Michael Cochez. Complex query answering
with neural link predictors. In International Conference on Learning Representations (ICLR),
2020.
Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko.
Translating embeddings for modeling multi-relational data. In Neural Information Processing
Systems (NIPS), pages 1–9, 2013.
Samuel R Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning. A large annotated
corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical
Methods in Natural Language Processing (EMNLP), pages 632–642, 2015.
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and
Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.
arXiv preprint arXiv:1803.05457, 2018.
Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach,
Hal Daumé Iii, and Kate Crawford. Datasheets for datasets. Communications of the ACM, 64(12):
86–92, 2021.
Eleonora Giunchiglia, Mihaela Cătălina Stoian, Salman Khan, Fabio Cuzzolin, and Thomas
Lukasiewicz. Road-r: The autonomous driving dataset with logical requirements. arXiv preprint
arXiv:2210.01597, 2022.
Peter D Grünwald. The minimum description length principle. MIT press, 2007.
Saiping Guan, Xiaolong Jin, Jiafeng Guo, Yuanzhuo Wang, and Xueqi Cheng. NeuInfer: Knowledge
inference on N-ary facts. In Proceedings of the 58th Annual Meeting of the Association for
Computational Linguistics, pages 6141–6151, Online, July 2020. Association for Computational
Linguistics. doi: 10.18653/v1/2020.acl-main.546. URL https://aclanthology.org/2020.
acl-main.546.
Will Hamilton, Payal Bajaj, Marinka Zitnik, Dan Jurafsky, and Jure Leskovec. Embedding logical
queries on knowledge graphs. Advances in neural information processing systems, 31, 2018.
Aidan Hogan, Eva Blomqvist, Michael Cochez, Claudia d’Amato, Gerard De Melo, Claudio Gutierrez,
Sabrina Kirrane, José Emilio Labra Gayo, Roberto Navigli, Sebastian Neumaier, et al. Knowledge
graphs. ACM Computing Surveys (CSUR), 54(4):1–37, 2021.

9
Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and
compositional question answering. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), pages 6700–6709, 2019.
Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, and
Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual
reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), 2017.
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980, 2014.
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen,
Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and
vision using crowdsourced dense image annotations. In Proceedings of the 30th AAAI Conference
on Articial Intelligence, pages 4088–4095, 2017.
Brenden Lake and Marco Baroni. Generalization without systematicity: On the compositional skills
of sequence-to-sequence recurrent networks. In International conference on machine learning,
pages 2873–2882. PMLR, 2018.
Natasha Noy, Yuqing Gao, Anshu Jain, Anant Narayanan, Alan Patterson, and Jamie Taylor. Industry-
scale knowledge graphs: Lessons and challenges: Five diverse technology companies show how
it’s done. Queue, 17(2):48–75, 2019.
Hongyu Ren and Jure Leskovec. Beta embeddings for multi-hop logical reasoning in knowledge
graphs. Advances in Neural Information Processing Systems, 33:19716–19726, 2020.
Hongyu Ren, Weihua Hu, and Jure Leskovec. Query2box: Reasoning over knowledge graphs in
vector space using box embeddings. arXiv preprint arXiv:2002.05969, 2020.
Hongyu Ren, Mikhail Galkin, Michael Cochez, Zhaocheng Zhu, and Jure Leskovec. Neural graph rea-
soning: Complex logical query answering meets graph databases. arXiv preprint arXiv:2303.14617,
2023.
Jorma Rissanen. Modeling by shortest data description. Automatica, 14(5):465–471, 1978.
Daniel Rufnelli, Samuel Broscheit, and Rainer Gemulla. You can teach an old dog new tricks! on
training knowledge graph embeddings. In International Conference on Learning Representations,
2019.
Adam Santoro, David Raposo, David GT Barrett, Mateusz Malinowski, Razvan Pascanu, Peter
Battaglia, and Timothy Lillicrap. A simple neural network module for relational reasoning.
Advances in Neural Information Processing Systems, 30, 2017.
David Saxton, Edward Grefenstette, Felix Hill, and Pushmeet Kohli. Analysing mathematical
reasoning abilities of neural models. arXiv preprint arXiv:1904.01557, 2019.
Alane Suhr and Yoav Artzi. A corpus of natural language for visual reasoning. In Proceedings of the
55th Annual Meeting of the Association for Computational Linguistics (ACL), volume 1, pages
217–231, 2017.
Théo Trouillon, Johannes Welbl, Sebastian Riedel, Éric Gaussier, and Guillaume Bouchard. Complex
embeddings for simple link prediction. In International conference on machine learning, pages
2071–2080. PMLR, 2016.
Denny Vrandečić and Markus Krötzsch. Wikidata: a free collaborative knowledgebase. Communica-
tions of the ACM, 57(10):78–85, 2014.
Quan Wang, Haifeng Wang, Yajuan Lyu, and Yong Zhu. Link prediction on n-ary relational facts: A
graph-based approach. In Findings of the Association for Computational Linguistics: ACL-IJCNLP
2021, pages 396–407, Online, August 2021. Association for Computational Linguistics. doi: 10.
18653/v1/2021.ndings-acl.35. URL https://aclanthology.org/2021.findings-acl.35.

10
Jianfeng Wen, Jianxin Li, Yongyi Mao, Shini Chen, and Richong Zhang. On the representation and
embedding of knowledge bases beyond binary relations. arXiv preprint arXiv:1604.08642, 2016.
Jason Weston, Antoine Bordes, Sumit Chopra, Alexander M. Rush, Bart van Merriënboer, Armand
Joulin, and Tomas Mikolov. Towards ai-complete question answering: A set of prerequisite toy
tasks. arXiv preprint arXiv:1502.05698, 2015.
Bishan Yang, Wen-tau Yih, Xiaodong He, Jianfeng Gao, and Li Deng. Embedding entities and
relations for learning and inference in knowledge bases. arXiv preprint arXiv:1412.6575, 2014.
Jiale Yang, Jinnan Li, and Yuke Zhu. A dataset and architecture for visual reasoning with a working
memory. arXiv preprint arXiv:1803.06092, 2018.
Xianda Zhang, Yifan Zhang, and Mirella Lapata. Metaqa: A dataset of metaphorically annotated
movieqa questions. In Proceedings of the 27th International Conference on Computational
Linguistics, pages 1954–1964, 2018.

11
7 Supplementary Material
Contents

7.1 Datasets for Neurosymbolic Methods . . . . . . . . . . . . . . . . . . . . . . . . 12


7.2 Reproducibility Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
7.3 Experimental Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
7.3.1 Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
7.4 Semantics of IntelliGraphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
7.4.1 Logical Rules of syn-paths . . . . . . . . . . . . . . . . . . . . . . . . . 14
7.4.2 Logical Rules of syn-types . . . . . . . . . . . . . . . . . . . . . . . . . 15
7.4.3 Logical Rules of syn-tipr . . . . . . . . . . . . . . . . . . . . . . . . . 15
7.4.4 Logical Rules of wd-movies . . . . . . . . . . . . . . . . . . . . . . . . . 15
7.4.5 Logical Rules of wd-articles . . . . . . . . . . . . . . . . . . . . . . . 16
7.5 Synthetic Dataset Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
7.5.1 syn-paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
7.5.2 syn-types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
7.5.3 syn-tipr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
7.6 Wikidata Dataset Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
7.6.1 wd-movies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
7.6.2 wd-articles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
7.6.3 Example subgraphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

7.1 Datasets for Neurosymbolic Methods

Neurosymbolic methods aim to combine neural networks with symbolic representations. As men-
tioned in Section 5, several datasets already exist in the literature for evaluating the performance of
neurosymbolic methods. Table 4 highlights widely used datasets used for benchmarking neurosym-
bolic systems.

7.2 Reproducibility Statement

To make our work fully reproducible, we make the codebase of our experiments public and open. Our
code is available on https://github.com/thiviyanT/IntelliGraphs. For each experiment,
we also provide the hyperparameter congurations we used. Furthermore, we have released a new
Python package for interacting with the IntelliGraphs datasets through the following software package
repositories: conda (https://anaconda.org/thiv/intelligraphs) and pypi (https://pypi.
org/project/intelligraphs). To ensure long-term preservation and easy access, we made
the datasets available on Zenodo (https://doi.org/10.5281/zenodo.7824818). Experimental
details can be found in the next Section.

7.3 Experimental Details

We used the PyTorch library 13 to develop and test the models. All experiments were performed
on a single-node machine with an Intel(R) Xeon(R) Gold 5118 (2.30GHz, 12 cores) CPU and
64GB of RAM, with four NVIDIA RTX A4000 GPUs (16GB of VRAM). We used PyTorch’s
GPU acceleration for training the models. We used the Adam optimiser with variable learning rates
[Kingma and Ba, 2014].
13
https://pytorch.org/

12
Table 4: Brief comparison of commonly used datasets for benchmarking neurosymbolic methods,
listed in ascending order of publication year. For each dataset, we provide an overview of the task,
domain, modality, key characteristics, and whether the dataset is synthetic.

Dataset Task Domain Modality Key Characteristics Synthetic


bAbI
Weston et al. Language Reasoning Natural Language Text Basic reasoning, Yes
[2015] generalization

SNLI
Bowman et al. Logical Reasoning Natural Language Text Entailment, contradiction, No
[2015] neutral relationships

CLEVR
Johnson et al. Visual Reasoning Computer Vision Images & Text Object counting, comparison, Yes
[2017] querying attributes

NLVR
Suhr and Artzi Visual Reasoning Computer Vision Images & Text Visual reasoning, natural Yes
[2017] language understanding

Sort-of-CLEVR
Santoro et al. Relational Reasoning Computer Vision Images & Text Spatial and relational Yes
[2017] reasoning

Visual Genome
Krishna et al. Visual Reasoning Computer Vision Images & Text Object recognition, No
[2017] relationships, attributes

Aristo
Clark et al. Science Reasoning Natural Language Text Natural language No
[2018] understanding, applying
knowledge

COG
Yang et al. [2018] Cognitive Capabilities Computer Vision Images & Text Temporal and logical Yes
reasoning

MetaQA
Zhang et al. Multi-hop Reasoning Graph Knowledge Graph Multi-step reasoning, No
[2018] knowledge base

SCAN
Lake and Baroni Compositional Command-based Text Understanding and generating No
[2018] Generalization Language novel commands

Math Dataset
Saxton et al. Math Reasoning Natural Language Text Language understanding, No
[2019] symbolic reasoning

GQA
Hudson and Visual Reasoning Computer Vision Images Text & Spatial and relational No
Manning [2019] reasoning

ROAD-R
Giunchiglia et al. Visual Reasoning Computer Vision Videos & (handcrafted) Logical reasoning No
[2022] Logical Rules

7.3.1 Hyperparameters
For each dataset, we performed hyperparameter sweeps using every baseline model (TransE, DistMult,
ComplEx) using Weights&Biases 14 . For this, we used a random search strategy with the goal of
nding the hyperparameter congurations that yield the minimum compression bits on the validation
set. We do not include the reciprocal relation model, and we used the highest batch size that we could
t in memory. Table 5 shows the hyperparameter values we obtained via the sweeps. The random
baseline did not require hyperparameter netuning. We also used Weights & Biases for monitoring
our experiments.

7.4 Semantics of IntelliGraphs

Logical rules provide a formal framework for expressing and reasoning about the semantics of a
system. In this section, we discuss the logical rules we use to verify the semantics of the IntelliGraphs
datasets. We express each logical rule using First-Order Logic (FOL) unless otherwise stated. We
opted for First Order Logic (FOL) as the formal language to communicate logical constraints due to

14
https://wandb.ai/

13
Table 5: The results of a random hyperparameter search, presenting the chosen hyperparameters
for different datasets and baseline models. The hyperparameters include batch size, embedding
size, learning rate, biases usage, and initialization method. The batch size indicates the number of
training subgraphs processed together before updating the model. The embedding size represents the
dimensionality of the entity and relation embeddings. The learning rate controls the step size taken
during model optimization. The biases denote whether bias terms are included in the model, and the
initialization method refers to the technique used to initialize the model’s parameters.
Dataset Model Batch Size Emb. Learning Rate Biases Init.
syn-paths transe 4096 1531 7.029817939842623e-05 False uniform
syn-paths distmult 4096 158 0.0697979730927795 False uniform
syn-paths complex 4096 587 5.264944612887405e-05 False uniform
syn-tipr transe 2048 147 0.0008716274682049251 True normal
syn-tipr distmult 2048 168 0.005497983171450242 True normal
syn-tipr complex 2048 350 0.0015597556675205502 True normal
syn-types transe 2048 376 0.003017403610019781 True uniform
syn-types distmult 2048 273 0.0006013105272716594 True uniform
syn-types complex 2048 996 5.603405855158606e-05 False uniform
wd-movies transe 4096 68 0.000638003263107625 False normal
wd-movies distmult 4096 181 0.00307853821840767 True uniform
wd-movies complex 4096 102 0.019520125878695407 False uniform
wd-articles transe 32 888 6.094053758340765e-05 True normal
wd-articles distmult 32 65 0.03833121378755901 False uniform
wd-articles complex 32 283 0.002251396972378282 False normal

its ability to effectively express the necessary constraints and its widespread understanding within the
machine learning community 15 .
Although we provide the general FOL rules to check the semantics of graphs of any arbitrary lengths,
we apply a size constraint (i.e. checking for graphs with a xed number of triples) for the synthetic
datasets. This is because the synthetic data generator produces graphs with xed length and we
dened it as part of our semantics. The size constraint can also be expressed in FOL, but we specify
this constraint in natural language for brevity.
Traditionally, a reasoning engine is used to check logical consistencies in knowledge bases. We wrote
a semantic checker in Python. This was more convenient to use within our framework as the graphs
could be evaluated, without having to manually load them into a reasoning engine individually. Our
semantic checker was written to closely follow the logical rules, and it is accessible through the
IntelliGraph python package.

7.4.1 Logical Rules of syn-paths

∀x, y, z : connected(x, y) ∧ connected(y, z) ⇒ connected(x, z)


∀x, y : edge(x, y) ⇒ connected(x, y)
∃x : root(x)
∀x, y : root(x) ∧ root(b) ⇒ a = b
∀x : root(x) ⇔ ∀y : ¬edge(y, x)
∀x, y : connected(x, y) ⇒ x ̸= y
∀x : root(x) ⇒ ∀y : (connected(x, y) ∨ x = y)
∀x, y, z : edge(y, x) ∧ edge(z, x) ⇒ y = z
∀x, y, z : edge(x, y) ∧ edge(x, z) ⇒ y = z
∀x, y : edge(x, y) ⇔ cycle_to(x, y) ∨ drive_to(x, y) ∨ train_to(x, y)
Number of edges : 3

15
These FOL logical constraints can also be rewritten into data specication languages, such as DataLog.

14
7.4.2 Logical Rules of syn-types

∀x, y : spoken_in(x, y) ⇒ language(x) ∧ country(y)


∀x, y : could_be_part_of (x, y) ⇒ city(x) ∧ country(y))
∀x, y : (same_type_as(x, y) ⇒ (language(x) ∧ language(y))
∨ (city(x) ∧ city(y)) ∨ (country(x) ∧ country(y))
∀x : language(x) ⇒ ¬country(x) ∧ ¬city(x)
∀x : country(x) ⇒ ¬language(x) ∧ ¬city(x)
∀x : city(x) ⇒ ¬language(x) ∧ ¬country(x)
Number of edges : 3

7.4.3 Logical Rules of syn-tipr

∀x, y : has_role(x, y) ⇒ academic(x) ∧ role(y)


∀x, y : has_name(x, y) ⇒ academic(x) ∧ name(y)
∀x, y : has_time(x, y) ⇒ academic(x) ∧ time(y)
∀x, y : start_year(x, y) ⇒ time(x) ∧ year(y)
∀x, y : end_year(x, y) ⇒ time(x) ∧ year(y)
∀x, y, z : end_year(x, y) ∧ start_year(x, z) ⇒ bef ore(y, z)
∀x : ¬has_role(x, x)
∀x : ¬has_name(x, x)
∀x : ¬has_time(x, x)
∀x : ¬start_year(x, x)
∀x : ¬end_year(x, x)
∀x : academic(x) ⇒ ¬role(x) ∧ ¬time(x) ∧ ¬name(x) ∧ ¬year(x)
∀x : role(x) ⇒ ¬academic(x) ∧ ¬time(x) ∧ ¬name(x) ∧ ¬year(x)
∀x : time(x) ⇒ ¬academic(x) ∧ ¬role(x) ∧ ¬name(x) ∧ ¬year(x)
∀x : year(x) ⇒ ¬academic(x) ∧ ¬role(x) ∧ ¬name(x) ∧ ¬time(x)
∀x : name(x) ⇒ ¬academic(x) ∧ ¬role(x) ∧ ¬year(x) ∧ ¬time(x)
Number of edges : 5

7.4.4 Logical Rules of wd-movies

∀x, y : connected(x, y) ⇔ has_director(x, y) ∨ has_actor(x, y) ∨ has_genre(x, y)


∃x : has_director(x, existential_node)
∃x : has_actor(x, existential_node)
∃x : has_genre(x, existential_node)
∀x : x ̸= existential_node ⇒ connected(existential_node, x)
∀x, y : x ̸= existential_node ∧ y ̸= existential_node ⇒ ¬connected(x, y)
∀x : ¬connected(x, existential_node)
∀x, y : has_director(x, y) ∨ has_actor(x, y) ⇒ person(y)
∀x : ¬person(x) ∨ ¬genre(x)
∀x, y : has_genre(x, y) ⇒ genre(y)

15
7.4.5 Logical Rules of wd-articles

∃x : has_author(article_node, x)
∀x, y : connected(x, y) ⇔ has_author(x, y) ∨ has_name(x, y) ∨ has_order(x, y)∨
cites(x, y) ∨ has_subject(x, y) ∨ subclass_of (x, y)
∀x, y : connected(x, y) ⇒ ¬connected(y, x) ∨ cites(y, x)
∀x : ¬connected(x, x)
∀x, y : has_author(x, y) ⇒ x = article_node
article(article_node) ∨ iri(article_node)
∀x : has_author(article_node, x) ⇒ authorpos(x)
∀x : authorpos(x) ⇔ ∃y : has_order(x, y) ∧ ∃y : has_name(x, y)
∀x, y : has_order(x, y) ⇒ authorpos(x) ∧ ordinal(y)
∀x, y : has_name(x, y) ⇒ authorpos(x) ∧ name(y) ∨ iri(y)
∀x, y, z : has_order(x, y) ∧ has_order(x, z) ⇒ y = z
∀x, y, z : has_name(x, y) ∧ has_order(x, z) ⇒ y = z
∀x : author(x) ⇒ ¬subject(x) ∧ ¬iri(x) ∧ ¬name(x) ∧ ¬ordinal(x) ∧ ¬author_pos(x)
∀x : subject(x) ⇒ ¬author(x) ∧ ¬iri(x) ∧ ¬name(x) ∧ ¬ordinal(x) ∧ ¬author_pos(x)
∀x : iri(x) ⇒ ¬author(x) ∧ ¬subject(x) ∧ ¬name(x) ∧ ¬ordinal(x) ∧ ¬author_pos(x)
∀x : name(x) ⇒ ¬subject(x) ∧ ¬iri(x) ∧ ¬author(x) ∧ ¬ordinal(x) ∧ ¬author_pos(x)
∀x : ordinal(x) ⇒ ¬subject(x) ∧ ¬iri(x) ∧ ¬name(x) ∧ ¬author(x) ∧ ¬author_pos(x)
∀x : author_pos(x) ⇒ ¬subject(x) ∧ ¬iri(x) ∧ ¬name(x) ∧ ¬author(x) ∧ ¬ordinal(x)
∀x, y, z : subclass_trans(x, y) ∧ subclass_trans(y, z) ⇒ subclass_trans(x, z)
∀x, y : subclass_of (x, y) ⇒ subclass_trans(x, y)∧
(iri(x) ∨ subject(x)) ∧ (iri(y) ∨ subject(y))
∀x, y : subclass_of (x, y) ⇒ ∃z : subclass_trans(x, z) ∧ has_subject(article_node, z)
∀x, y : cites(x, y) ⇒ iri(y) ∧ x = article_node
∀x, y : has_subject(x, y) ⇒ (subject(y) ∨ iri(y)) ∧ x = article_node
In addition to the aforementioned rules for wd-articles, our semantic checker checks the ordinal
of the author’s position to make sure that they are a complete list of consecutive numbers (i.e.
ordinal_000, ordinal_001, ordinal_002, ..., etc.), but we leave it out of the rules for brevity.

7.5 Synthetic Dataset Generation

The synthetic dataset generator contains two main modules: 1) a triple sampler is a module that
samples new triples one by one, 2) a triple verier module checks each triple for semantic validity
before they are added to a subgraph. The generator builds a subgraph by sampling one triple at a time
and veriying. If the triple passes the semantic check, it is added to a subgraph. To avoid duplicate
triples within the sam subgraph, we check if triple already exists before adding it to a subgraph. This
is done until a certain number of valid triples are samples. For reproducibility, we use the same seed
for all random data generations (seed=42).For each dataset, we generate training, validation and test
sets. To avoid data leakage, we check that these graphs are unique before splitting the dataset. In this
section, we briey describe how IntelliGraphs efciently samples valid subgraphs.

7.5.1 syn-paths
The entities are labelled after 49 Dutch cities and the relations are different modes of transport
(train_to, drive_to, cycle_to). This dataset primarily checks whether baseline models can
do structure learning.
A path graph Pk (G) of a graph G has vertex set Πk (G) and edges joining pairs of vertices that
represent two paths Pk , the union of which forms either a path Pk+1 . We denote by Πk (G) the set of
all paths of G on k vertices (k ≥ 1), and we randomly sample n edges from Πk (G) to generate each
path graph.

16
To generate a path graph, we begin by selecting a head (i.e. source node) by randomly selecting a
Dutch city and then we sample relation and a tail (i.e. target node). For the next triple in the subgraph,
we use the previous target node as the source node and then sample a relation and a target node. We
can repeat the last step k − 2 number of times to build a path-graph with k edges. We ensure that
each subgraph includes all three different relations. We avoid generating cyclic path-graphs.

7.5.2 syn-types
This dataset contains three types of entities (cities, countries & languages), 30 entities in
total (10 instances of each entity type), and three relations (same_type_as, could_be_part_of &
could_be_spoken_in). This dataset primarily checks whether baseline models can learn the types
of entities correctly.
For each relation, we sample a head and a tail entity of the corresponding type. For instance, for the
relation could_be_spoken_in we sample a language for the head of a triple and a country for the
tail. Similarly, we sample other triples to be added to the same subgraph, until a certain number of
valid triples have been sampled.
It is important to note that the syn-types dataset is not meant to be factually accurate but rather
serves as a way to study the type semantics learned by machine learning models.

7.5.3 syn-tipr
This datasets contain three entity types (names, roles, years) and (has_role, has_name,
start_year and end_year). We used a random name generator to generate 50 names. For
simplicity, we treat years are entities rather than literals. In each subgraph, there are two existential
nodes: _academic and _time). The main purpose of this dataset is to check structure learning and
check basic temporal reasoning (in this case, whether end_year appears after start_year).
The subgraphs in this dataset was modeeled after the time-indexed person role (tipr) pattern in
Semantic Web. For generating these subgraphs, we take the tipr pattern as a template and randomly
sampled entities the correct entity type. For instance, the relation has_role would always have an
academic_node in the head position of a triple and a role as a tail. Similarly, we sample triples for
the other relations (has_name, has_time, start_year, end_year). Valid triples containing every
relations is sampled. In total, every subgraph will contain ve triples.

7.6 Wikidata Dataset Generation

For reproducibility, we use a specic Wikidata dump to extract the data, rather than the live version.
For both datasets, we use the Wikidata HDT dump from 3 March 2021, available from the HDT
website 16 .
In both cases, we rst extract all data that ts the template of the graph, for instance, for every movie
we extract all actors, directors and genres. We then prune this data to ensure that every entity occurs in
enough instances to allow a model to learn a representation for it. Depending on the dataset we either
remove the infrequent nodes or replace them by existential nodes. We set the minimal frequency to 6
in both datasets.
To avoid the situation where certain entity nodes are only present in the validation or test data, we
must make our splits carefully. Ideally, we’d like for each entity to be present in all three splits of the
data, and where this is not possible, for it to be present in at least the training data.
To achieve this, we use the following algorithm: for each instance we collect “votes” among all its
entities for which of the three splits it should be part of. Simultaneously, for each entity, we collect
the splits of which it is a member. The aim is to have all entity in each instance vote for the same split,
and for each entity to be represented in all splits. We rst alternate xing one of the two problems: we
unify the votes by choosing a random entity and setting the votes of the other entities in the instance
to that vote. After all votes have been xed, we x the split memberships by, for each entity that
is not representing in all splits, taking the most frequent split and changing the vote of one of its
instances to the missing split, repeating until all splits are represented.
15
https://www.behindthename.com/random/
16
https://www.rdfhdt.org/datasets/

17
We alternate these steps for 50 iterations. Then, in the rst step, we move any instance with conicting
votes to the training data and repeat the iteration in this fashion for another 20 steps. For both datasets,
this leads to all entities being represented in the training data, and only a small number present in
only the test or only the training data.
For both datasets, the labels are Wikipedia IRIs, but a mapping to human-readable labels is provided.
In this paper, we replace IRIs with these for readability.

7.6.1 wd-movies

We collect all entities that are labeled as “instance of” the class “lm”. For each we extract all entities
connected by the relations “cast member”, “director” and “genre” as its actors, directors and genres
respectively.
We then prune the data by removing all actors, directors and genres that do not appear in at least 6
instances. We then remove any movies that are left with no actors or no directors. We allow movies
with no genres. We iterate these two steps until no changes are made. Finally, we make a test, train
and validation split by the process described above. The following Wikidata properties and entities
were used:
label wikidata IRI
instance of http://www.wikidata.org/prop/P31
lm http://www.wikidata.org/entity/Q11424
cast member http://www.wikidata.org/prop/P161
director http://www.wikidata.org/prop/P57
genre http://www.wikidata.org/prop/P136

7.6.2 wd-articles

We collect all entities from wikidata that are the object of a triple with the relation “cites”.
For each article we collect the full list of authors, using the relations “author” and “author name
string”. The former is used to refer to authors that are represented in Wikidata as an entity, and the
latter is used for authors represented only by their name as a string literal. We require at least one of
the authors to be represented by an entity. If not, the article is ltered out.
Such statements are commonly annotated in Wikidata with an ordinal, representing the order of the
author in the author list. We extract these as well. If any author does not have an ordinal or if the
collection of these ordinals does not coincide exactly with the sequence 1, . . . , n, with n the number
of authors, the article is ltered out.
We then collect all articles that, as recorded in Wikidata, the current article cites. If there are no such
references, the article is ltered out.
Finally, we collect the article subjects, and for each subject, every superclass and its superclass,
that are an instance of “academic discipline”. We do not lter based on the subjects (no subjects or
superclasses is allowed).
We collect the rst 100 000 such articles for the dataset wd-articles, and all such articles for the
dataset wd-articles-large.
As with wd-movies, we prune the data to eliminate any entities that occur in fewer than 6 instances.
For the authors, the article itself and the subjects, we replace these with existential nodes. These have
node labels specic to the role they play in the graph: _article, _author001, and _subject001.
Any references to infrequent entities are removed. As before. this removal process is iterated until
the dataset stabilizes.
Splits are then made using the algorithm described above. In the construction of the dataset, we
add authors by introducing a blank node (using label _authorpos and the relation has_author),
to which the author identity (has_name) and the ordinal (has_order) are connected. References
are added by a single edge with the relation cites and subjects and superclasses with the relations
has_subject and subclass_of.

18
syn-paths
[Nieuwegein drive_to Lelystad, Lelystad drive_to IJmuiden, IJmuiden cycle_to Zaanstad]
[IJmuiden cycle_to Maastricht, Maastricht train_to Roermond, Roermond drive_to Groningen]
[Hilversum cycle_to Emmen, Emmen drive_to Spijkenisse, Spijkenisse train_to Sittard]

syn-tipr
[_academic has_name Cleophas Erős, _academic has_role masters researcher, _academic has_time _time, _time start_year
2016, _time end_year 2018]
[_academic has_name Romana Sitk, _academic has_role professor, _academic has_time _time, _time start_year 1982, _time
end_year 2009]
[_academic has_name Drusus Krejči, _academic has_role assistant professor, _academic has_time _time, _time start_year
1996, _time end_year 2000]
[_academic has_name Božidar Bullard, _academic has_role professor_academic has_time _time, _time start_year 1973, _time
end_year 1988]

syn-types
[Dutch same_type_as English, Budapest could_be_part_of United Kingdom, Czech spoken_in Serbia]
[Serbia same_type_as Spain, Paris could_be_part_of Norway, Dutch spoken_in Greece]
[Greek same_type_as Italian, Budapest could_be_part_of Ireland, French spoken_in Serbia]

wd-movies
[_movie has_director P. Pullaiah, _movie has_actor Gummadi Venkateswara Rao, _movie has_actor Akkineni Nageswara Rao,
_movie has_actor Anjali Devi, _movie has_actor Chittoor Nagaiah, _movie has_actor Ramana Reddy, _movie has_actor Relangi
Venkata Ramaiah, _movie has_actor S. V. Ranga Rao, _movie has_actor Santha Kumari, _movie has_genre historical film,
_movie has_genre biographical film]
[_movie has_director Albert Brooks, _movie has_actor Kathryn Harrold, _movie has_actor Albert Brooks, _movie has_actor
Bruno Kirby, _movie has_genre comedy film]
[_movie has_director Dragoslav Lazić, _movie has_actor Vesna Malohodžić, _movie has_actor Snežana Savić, _movie
has_genre comedy film]
[_movie has_director Balu Mahendra, _movie has_actor Silk Smitha, _movie has_actor Sridevi, _movie has_actor Kamal
Haasan, _movie has_genre romance film]

wd-articles
[_article has_author _authorpos000, _authorpos000 has_name _author000, _authorpos000 has_order ordinal_001,
_article has_author _authorpos001, _authorpos001 has_name _author001, _authorpos001 has_order ordinal_002, _article
has_author _authorpos002, _authorpos002 has_name _author002, _authorpos002 has_order ordinal_003, _article
has_author _authorpos003, _authorpos003 has_name _author003, _authorpos003 has_order ordinal_004, _article
has_author _authorpos004, _authorpos004 has_name _author004, _authorpos004 has_order ordinal_005, _article has_author
_authorpos005, _authorpos005 has_name _author005, _authorpos005 has_order ordinal_006, _article has_author _authorpos006,
_authorpos006 has_name _author006, _authorpos006 has_order ordinal_007, _article has_author _authorpos007, _authorpos007
has_name _author007, _authorpos007 has_order ordinal_008, _article cites http://www.wikidata.org/entity/Q25938995,
_article cites http://www.wikidata.org/entity/Q28242060, _article cites http://www.wikidata.org/entity/Q28286732,
_article cites http://www.wikidata.org/entity/Q34453213, _article cites http://www.wikidata.org/entity/Q34541710,
_article cites http://www.wikidata.org/entity/Q35758845, _article cites http://www.wikidata.org/entity/Q37942996,
_article cites http://www.wikidata.org/entity/Q37972005, _article cites http://www.wikidata.org/entity/Q42642132,
_article has_subject http://www.wikidata.org/entity/Q214781, http://www.wikidata.org/entity/Q214781 subclass_of
http://www.wikidata.org/entity/Q413, _article has_author _authorpos000, _authorpos000 has_name _author000, _authorpos000
has_order ordinal_001, _article has_author _authorpos001, _authorpos001 has_name _author001, _authorpos001 has_order
ordinal_002, _article has_author _authorpos002, _authorpos002 has_name _author002, _authorpos002 has_order ordinal_003,
_article has_author _authorpos003, _authorpos003 has_name http://www.wikidata.org/entity/Q41896189, _authorpos003
has_order ordinal_004, _article has_author _authorpos004, _authorpos004 has_name _author003, _authorpos004 has_order
ordinal_005, _article has_author _authorpos005, _authorpos005 has_name _author004, _authorpos005 has_order ordinal_006,
_article cites http://www.wikidata.org/entity/Q29547376, _article cites http://www.wikidata.org/entity/Q30655427,
_article cites http://www.wikidata.org/entity/Q53584979, _article has_subject http://www.wikidata.org/entity/Q13100823,
_article has_subject http://www.wikidata.org/entity/Q16, _article has_subject http://www.wikidata.org/entity/Q180507
_article has_subject http://www.wikidata.org/entity/Q183]

Figure 2: IntelliGraphs contains ve datasets: syn-paths, syn-tipr, syn-types, wd-movies, and
wd-articles. Here we showcase a few example subgraphs from each dataset. The subgraphs are
presented as a list of triples, where each list item represents a subgraph.

The following Wikidata properties were used.


label wikidata IRI
cites http://www.wikidata.org/prop/P2860
author http://www.wikidata.org/prop/P50
author name string http://www.wikidata.org/prop/P2093
main subject http://www.wikidata.org/prop/P921
subclass of http://www.wikidata.org/prop/P279
academic discipline http://www.wikidata.org/entity/Q11862829

7.6.3 Example subgraphs

Figure 2 showcases a selection of example subgraph from each dataset: syn-paths, syn-tipr,
syn-types, wd-movies, and wd-articles.

19
8 Datacard
An up-to-date version of the data card can be found on https://github.com/thiviyanT/
IntelliGraphs/blob/main/Datacard.md.

20

You might also like