TOWARDS FOUNDATION MODELS FOR KNOWLEDGE
TOWARDS FOUNDATION MODELS FOR KNOWLEDGE
A BSTRACT
Foundation models in language and vision have the ability to run inference on
any textual and visual inputs thanks to the transferable representations such as a
arXiv:2310.04562v2 [cs.CL] 9 Apr 2024
0.4
0.3
0.2
0.1
0.0
FB V2
FB V1
FB V3
FB V4
FB-100
FB-75
FB-25
FB-50
WDsinger
WN18RR
CoDEx-M
FB237_50
FB237_20
FB237_10
CoDEx-S
WK-75
CoDEx-L
WK-25
ILPC-L
ILPC-S
WK-100
WK-50
NL V1
NL V2
NL V3
NL V4
NELL-995
NL-100
NL-50
NL-25
NL-75
NL-0
NELL23k
WN V1
WN V2
WN V4
WN V3
YAGO 310
DBP100K
Hetionet
Aristo-v4
CNet-100K
FB15k237
Average
Figure 1: Zero-shot and fine-tuned MRR (higher is better) of U LTRA pre-trained on three graphs
(FB15k-237, WN18RR, CoDEx-Medium). On average, zero-shot performance is better than best
reported baselines trained on specific graphs (0.395 vs 0.344). More results in Figure 4 and Table 1.
1 I NTRODUCTION
Modern machine learning applications increasingly rely on the pre-training and fine-tuning
paradigm. In this paradigm, a backbone model often trained on large datasets in a self-supervised
fashion is commonly known as a foundation model (FM) (Bommasani et al., 2021). After pre-
training, FMs can be fine-tuned on smaller downstream tasks. In order to transfer to a broad set of
downstream tasks, FMs leverage certain invariances pertaining to a domain of interest, e.g., large
∗
Correspondence: [email protected]
1
Published as a conference paper at ICLR 2024
language models like BERT (Devlin et al., 2019), GPT-4 (OpenAI, 2023), Llama-2 (Touvron et al.,
2023) operate on a fixed vocabulary of tokens; vision models operate on raw pixels (He et al., 2016;
Radford et al., 2021) or image patches (Dosovitskiy et al., 2021); chemistry models (Ying et al.,
2021; Zheng et al., 2023) learn a vocabulary of atoms from the periodic table.
Representation learning on knowledge graphs (KGs), however, has not yet witnessed the benefits of
transfer learning despite a wide range of downstream applications such as precision medicine (Chan-
dak et al., 2023), materials science (Venugopal et al., 2022; Statt et al., 2023), virtual assistants (Ilyas
et al., 2022), or product graphs in e-commerce (Dong, 2018). The key problem is that different KGs
typically have different entity and relation vocabularies. Classic transductive KG embedding mod-
els (Ali et al., 2021) learn entity and relation embeddings tailored for each specific vocabulary and
cannot generalize even to new nodes within the same graph. More recent efforts towards generaliza-
tion across the vocabularies are known as inductive learning methods (Chen et al., 2023). Most of
the inductive methods (Teru et al., 2020; Zhu et al., 2021; Galkin et al., 2022b; Zhang & Yao, 2022)
generalize to new entities at inference time but require a fixed relation vocabulary to learn entity
representations as a function of the relations. Such inductive methods still cannot transfer to KGs
with a different set of relations, e.g., training on Freebase and inference on Wikidata.
The main research goal of this work is finding the invariances transferable across graphs with ar-
bitrary entity and relation vocabularies. Leveraging and learning such invariances would enable
the pre-train and fine-tune paradigm of foundation models for KG reasoning where a single model
trained on one graph (or several graphs) with one set of relations would be able to zero-shot transfer
to any new, unseen graph with a completely different set of relations and relational patterns. Our
approach to the problem is based on two key observations: (1) even if relations vary across the
datasets, the interactions between the relations may be similar and transferable; (2) initial relation
representations may be conditioned on this interaction bypassing the need for any input features. To
this end, we propose U LTRA, a method for unified, learnable, and transferable KG representations
that leverages the invariance of the relational structure and employs relative relation representations
on top of this structure for parameterizing any unseen relation. Given any multi-relational graph,
U LTRA first constructs a graph of relations (where each node is a relation from the original graph)
capturing their interactions. Applying a graph neural network (GNN) with a labeling trick (Zhang
et al., 2021) over the graph of relations, U LTRA obtains a unique relative representation of each
relation. The relation representations can then be used by any inductive learning method for down-
stream applications like KG completion. Since the method does not learn any graph-specific entity
or relation embeddings nor requires any input entity or relation features, U LTRA enables zero-shot
generalization to any other KG of any size and any relational vocabulary.
Experimentally, we show that U LTRA paired with the NBFNet (Zhu et al., 2021) link predictor pre-
trained on three KGs (FB15k-237, WN18RR, and CoDEx-M derived from Freebase, WordNet, and
Wikidata, respectively) generalizes to 50+ different KGs with sizes of 1,000–120,000 nodes and
5K–1M edges. U LTRA demonstrates promising transfer learning capabilities where the zero-shot
inference performance on those unseen graphs might exceed strong supervised baselines by up to
300%. The subsequent short fine-tuning of U LTRA often boosts the performance even more.
2 R ELATED W ORK
Inductive Link Prediction. In contrast to transductive methods that only support a fixed set of
entities and relations during training, inductive methods (Chen et al., 2023) aim at generalizing to
graphs with unseen nodes (with the same set of relations) or to both new entities and relations. The
majority of existing inductive methods such as GraIL (Teru et al., 2020), NBFNet (Zhu et al., 2021),
NodePiece (Galkin et al., 2022b), RED-GNN (Zhang & Yao, 2022) can generalize to graphs only
with new nodes, but not to new relation types since the node representations are constructed as a
function of the fixed relational vocabulary.
First approaches that support unseen relations at inference resorted to meta-learning and few-shot
learning (Chen et al., 2019; Zhang et al., 2020; Huang et al., 2022). Meta-learning is computationally
expensive and is hardly scalable to large graphs. Few-shot learning methods do not work on the
whole new unseen inference graph but instead mine many support sets akin to subgraph sampling.
Both RMPI (Geng et al., 2023) and InGram (Lee et al., 2023) employ graphs of relations to general-
ize to unseen domains. However, RMPI suffers from the same computational and scalability issues
2
Published as a conference paper at ICLR 2024
as subgraph sampling methods. InGram is more scalable but its featurization strategy relies on the
discretization of node degrees that only transfers to graphs of a similar relational distribution and
does not transfer to arbitrary KGs. Gao et al. (2023) introduce the notion of double equivariance,
i.e., relation exchangeability in multi-relational graphs, as a general theoretical framework for in-
ductive reasoning that transfers to any relations at inference. ISDEA (Gao et al., 2023) is the first
approach to design doubly equivariant GNNs and MTDEA (Zhou et al., 2023) further extends the
theory to partial equivariance. However, ISDEA and MTDEA are computationally expensive and
cannot scale to graphs considered in this work. Similarly to RMPI, InGram, ISDEA, and MTDEA,
U LTRA transfers to any unseen KG in the zero-shot fashion, but exhibits better generalization ca-
pabilities, scales to graphs of millions of edges, and introduces only a marginal inference overhead
(one-step pre-computation) to any inductive link predictor.
Text-based methods. A line of inductive link prediction methods like BLP (Daza et al., 2021),
KEPLER (Wang et al., 2021), StATIK (Markowitz et al., 2022), RAILD (Gesese et al., 2022)
rely on textual descriptions of entities and relations and use language models to encode them.
PRODIGY (Huang et al., 2023a) uses text features for few-shot node classification tasks. We deem
this family of methods orthogonal to U LTRA as we assume the graphs do not have any input features
and leverage only structural information encoded in the graph. Furthermore, the zero-shot inductive
transfer to an arbitrary KG studied in this work implies running inference on graphs from different
domains that might need different language encoders, e.g., models trained on general English data
are unlikely to transfer to graphs with descriptions in other languages or domain-specific graphs.
3 P RELIMINARIES
Knowledge Graph and Inductive Learning. Given a finite set of entities V (nodes), a finite set of
relations R (edge types), and a set of triples (edges) E = (V ×R×V), a knowledge graph G is a tuple
G = (V, R, E). In the transductive setup, the graph at training time Gtrain = (Vtrain , Rtrain , Etrain ) and
the graph at inference (validation or test) time Ginf = (Vinf , Rinf , Einf ) are the same, i.e., Gtrain = Ginf .
In the inductive setup, in the general case, the training and inference graphs are different, Gtrain ̸=
Ginf . In the easier setup tackled by most of the literature, the relation set R is fixed and shared
between training and inference graphs, i.e., Gtrain = (Vtrain , R, Etrain ) and Ginf = (Vinf , R, Einf ). The
inference graph can be an extension of the training graph if Vtrain ⊆ Vinf or be a separate disjoint
graph (with the same set of relations) if Vtrain ∩ Vinf = ∅. In the hardest inductive case, both entities
and relations sets are different, i.e., Vtrain ∩ Vinf = ∅ and Rtrain ∩ Rinf = ∅. In this work, we
tackle this harder inductive (also known as fully-inductive) case with both new, unseen entities and
relation types at inference time. Since the harder inductive case (with new relations at inference) is
strictly a superset of the easier inductive scenario (with the fixed relation set), any model capable of
fully-inductive inference is by design applicable in easier inductive scenarios as well.
Problem Formulation. Each triple (h, r, t) ∈ (V × R × V) denotes a head entity h connected to a
tail entity t by relation r. The knowledge graph reasoning task answers queries (h, r, ?) or (?, r, t).
It is common to rewrite the head-query (?, r, t) as (t, r−1 , ?) where r−1 is the inverse relation of r.
The set of target triples Epred is predicted based on the incomplete inference graph Ginf which is a
part of the unobservable complete graph Ĝinf = (Vinf , Rinf , Êinf ) where Êinf = Einf ∪ Epred .
Link Prediction and Labeling Trick GNNs. Standard GNN encoders (Kipf & Welling, 2017;
Veličković et al., 2018) including those for multi-relational graphs (Vashishth et al., 2020) underper-
form in link prediction tasks due to neighborhood symmetries (automorphisms) that assign different
(but automorphic) nodes the same features making them indistinguishable. To break those sym-
metries, labeling tricks (Zhang et al., 2021) were introduced that assign each node a unique feature
vector based on its structural properties. Most link predictors that use the labeling tricks (Teru et al.,
2020; Zhang et al., 2021; Chamberlain et al., 2023) mine numerical features like Double Radius
Node Labeling (Zhang & Chen, 2018) or Distance Encoding (Li et al., 2020). In contrast, multi-
relational models like NBFNet (Zhu et al., 2021) leverage an indicator function I NDICATOR(h, v, r)
and label (initialize) the head node h with the query vector r that can be learned while other nodes v
are initialized with zeros. In other words, final node representations are conditioned on the query re-
lation and NBFNet learns conditional node representations. Conditional representations were shown
to be provably more expressive theoretically (Huang et al., 2023b) and practically effective (Zhu
et al., 2022a; Galkin et al., 2022c) than standard unconditional GNN encoders.
3
Published as a conference paper at ICLR 2024
transfer transfer
red
d
t2h
t
re
den
t2h
ho
ho
ho
disco disco Turing
stu
genre genre genre
aut
aut
rock win t2h
aut
t2h
Award
re
re
re
Beatles
win
c h2h
gen
gen
ge n
h2h
Michael collab col
lab Michaelcollab Donald oautho student
Jackson Quincy George Quincy authored Knuth
r Robert coauthor
Jackson collab
Jones Martin Jones Floyd
(a) Relative entity representations transfer (b) Relative relation representations transfer
to new entities (NBFNet, RED-GNN) to new relations (ULTRA)
Figure 2: (a) relative entity representations used in inductive models generalize to new entities;
(b) relative relation representations based on a graph of relations generalize to both new relations
and entities. The graph of relations captures four fundamental interactions (t2h, h2h, h2t, h2h)
independent from any graph-specific relation vocabulary and whose representations can be learned.
4 M ETHOD
The key challenge of inductive inference with different entity and relation vocabularies is finding
transferable invariances that would produce entity and relation representations conditioned on the
new graph (as learning entity and relation embedding matrices from the training graph is useless
and not transferable). Most inductive GNN methods that transfer to new entities (Zhu et al., 2021;
Zhang & Yao, 2022) learn relative entity representations conditioned on the graph structure as
shown in Fig. 2 (a). For example, given a, b, c are variable entities and a as a root node labeled with
authored genre collab genre
I NDICATOR(), a structure a −−−−→ b −−−→ c ∧ a −−−→ d −−−→ c might imply existence of the
genre authored
edge a −−−→ c. Learning such a structure on a training set with entities Michael Jackson −−−−→
genre authored genre
Thriller −−−→ disco seamlessly transfers to new entities Beatles −−−−→ Let It Be −−−→ rock at
inference time without learning entity embeddings thanks to the same relational structure and rel-
ative entity representations. As training and inference relations are the same Rtrain = Rinf , such
approaches learn relation embedding matrices and use relations as invariants.
In U LTRA, we generalize KG reasoning to both new entities and relations (where Rtrain ̸= Rinf ) by
leveraging a graph of relations, i.e., a graph where each node corresponds to a distinct relation type1
in the original graph. While relations at inference time are different, their interactions remain the
same and are captured by the graph of relations. For example, Fig. 2 (b), a tail node of the authored
relation is also a head node of the genre relation. Hence, authored and genre nodes are connected
by the tail-to-head edge in the relation graph. Similarly, authored and collab share the same head
node in the entity graph and thus are connected with the head-to-head edge in the relation graph.
Overall, we distinguish four such core, fundamental relation-to-relation interactions2 : tail-to-head
(t2h), head-to-head (h2h), head-to-tail (h2t), and tail-to-tail (t2t). Albeit relations in the inference
graph in Fig. 2 (b) are different, their graph of relations and relation interactions resemble that of
the training graph. Hence, we could leverage the invariance of the relational structure and four
fundamental relations to obtain relational representations of the unseen inference graph. As a typical
KG reasoning task (h, q, ?) is conditioned on a query relation q, it is possible to build representations
of all relations relative to the query q by using a labeling trick on top of the graph of relations. Such
relative relation representations do not need any input features and naturally generalize to any
multi-relational graph.
Practically (Fig. 3), given a query (h, q, ?) over a graph G, U LTRA employs a three-step algorithm
that we describe in the following subsections. (1) Lift the original graph G to the graph of relations
Gr – Section 4.1; (2) Obtain relative relation representations Rq |(q, Gr ) conditioned on the query
relation q in the relation graph Gr – Section 4.2; (3) Using the relation representations Rq as starting
relation features, run inductive link prediction on the original graph G – Section 4.3.
1
We also add inverse relations as nodes to the relation graph.
2
Other strategies for capturing relation-to-relation interactions might exist beside those four types and we
leave their exploration for future work.
4
Published as a conference paper at ICLR 2024
Knowledge Graph & Query Learn Relative Relation Representations Learn Relative Entity Representations
Thriller Thriller
genre
? 1 2 3
t2h
disco disco
? t2h
h2h
Michael authored Michael
? Quincy collab Quincy
Jackson Jackson
Jones Jones
Figure 3: Given a query (h, q, ?) on graph G, U LTRA (1) builds a graph of relations Gr with four
interactions Rfund (Sec. 4.1); (2) builds relation representations Rq conditioned on the query relation
q and Gr (Sec. 4.2); (3) runs any inductive link predictor on G using representations Rq (Sec. 4.3).
4.1 R ELATION G RAPH C ONSTRUCTION
Given a graph G = (V, R, E), we first apply the lifting function Gr = L IFT(G) to build a graph
of relations Gr = (R, Rfund , Er ) where each node is a distinct relation type3 in G. Edges Er ∈
(R × Rfund × R) in the relation graph Gr denote interactions between relations in the original graph
G, and we distinguish four such fundamental relation interactions Rfund : tail-to-head (t2h) edges,
head-to-head (h2h) edges, head-to-tail (h2t) edges, and tail-to-tail (t2t) edges. The full adjacency
tensor of the relation graph is Ar ∈ R|R|×|R|×4 . Each of the four adjacency matrices can be
efficiently obtained with one sparse matrix multiplication (Appendix B).
Given a query (h, q, ?) and a relation graph Gr , we then obtain d-dimensional node representations
Rq ∈ R|R|×d of Gr (corresponding to all edge types R in the original graph G) conditioned on the
query relation q. Practically, we implement conditioning by applying a labeling trick to initialize the
node q in Gr through the I NDICATORr function and employ a message passing GNN over Gr :
h0v|q = I NDICATORr (v, q) = 1v=q ∗ 1d , v ∈ Gr
ht+1
v|q = U PDATE ht
v|q , A GGREGATE M ESSAGE (ht
w|q , r)|w ∈ Nr (v), r ∈ Rfund
The indicator function is implemented as I NDICATORr (v, q) = 1v=q ∗ 1d that simply puts a vector
of ones on a node v corresponding to the query relation q, and zeros otherwise. Following Huang
et al. (2023b), we found that all-ones labeling with 1d generalizes better to unseen graphs of various
sizes than a learnable vector. The GNN architecture (denoted as GNNr as it operates on the relation
graph Gr ) follows NBFNet (Zhu et al., 2021) with a non-parametric DistMult (Yang et al., 2015)
message function and sum aggregation. The only learnable parameters in each layer are embeddings
of four fundamental interactions Rfund ∈ R4×d , a linear layer for the U PDATE function, and an
optional layer normalizaiton. Note that our general setup (Section 3) assumes no given input entity
or relation features, so our parameterization strategy can be used to obtain relational representations
of any multi-relational graph.
To sum up, each unique relation q ∈ R in the query has its own matrix of conditional relation
representations Rq ∈ R|R|×d used by the entity-level reasoner for downstream applications.
Given a query (h, q, ?) over a graph G and conditional relation representations Rq from the previous
step, it is now possible to adapt any off-the-shelf inductive link predictor that only needs relational
features (Zhu et al., 2021; Zhang & Yao, 2022; Zhu et al., 2023; Zhang et al., 2023) to balance
between performance and scalability. We modify another instance of NBFNet (GNNe as it operates
on the entity level) to account for separate relation representations per query:
h0v|u = I NDICATORe (u, v, q) = 1u=v ∗ Rq [q], v ∈ G
ht+1
v|u = U PDATE ht
v|u , A GGREGATE M ESSAGE (ht
w|u , g t+1
(r))|w ∈ Nr (v), r ∈ R
3
2|R| nodes after adding inverse relations to the original graph.
5
Published as a conference paper at ICLR 2024
That is, we first initialize the head node h with the query vector q from Rq whereas other nodes are
initialized with zeros. Each t-th GNN layer applies a non-linear function g t (·) to transform original
relation representations to layer-specific relation representations as Rt = g t (Rq ) from which the
edge features are taken for the M ESSAGE function. g(·) is implemented as a 2-layer MLP with
ReLU. Similar to GNNr in Section 4.2, we use sum aggregation and a linear layer for the U PDATE
function. After message passing, the final MLP s : Rd → R1 maps the node states to logits p(h, q, v)
denoting the score of a node v to be a tail of the initial query (h, q, ?).
Training. U LTRA can be trained on any multi-relational graph or mixture of graphs thanks to
the inductive and conditional relational representations. Following the standard practices in the
literature (Sun et al., 2019; Zhu et al., 2021), U LTRA is trained by minimizing the binary cross
entropy loss over positive and negative triplets
n
X 1
L = − log p(u, q, v) − log(1 − p(u′i , q, vi′ ))
i=1
n
where (u, q, v) is a positive triple in the graph and {(u′i , q, vi′ )}ni=1 are negative samples obtained by
corrupting either the head u or tail v of the positive sample.
5 E XPERIMENTS
To evaluate the qualities of U LTRA as a foundation model for KG reasoning, we explore the fol-
lowing questions: (1) Is pre-trained U LTRA able to inductively generalize to unseen KGs in the
zero-shot manner? (2) Are there any benefits from fine-tuning U LTRA on a specific dataset? (3)
How does a single pre-trained U LTRA model compare to models trained from scratch on each target
dataset? (4) Do more graphs in the pre-training mix correspond to better performance?
Datasets. We conduct a broad evaluation on 57 different KGs with reported, non-saturated results
on the KG completion task. The datasets can be categorized into three groups:
• Transductive datasets (16 graphs) with the fixed set of entities and relations at training and
inference time (Gtrain = Ginf ): FB15k-237 (Toutanova & Chen, 2015), WN18RR (Dettmers
et al., 2018), YAGO3-10 (Mahdisoltani et al., 2014), NELL-995 (Xiong et al., 2017),
CoDEx (Small, Medium, and Large) (Safavi & Koutra, 2020), WDsinger, NELL23k,
FB15k237(10), FB15k237(20), FB15k237(50) (Lv et al., 2020), AristoV4 (Chen et al.,
2021), DBpedia100k (Ding et al., 2018), ConceptNet100k (Malaviya et al., 2020), Het-
ionet (Himmelstein et al., 2017)
• Inductive entity (e) datasets (18 graphs) with new entities at inference time but with the
fixed set of relations (Vtrain ̸= Vinf , Rtrain = Rinf ): 12 datasets from GraIL (Teru et al.,
2020), 4 graphs from INDIGO (Liu et al., 2021; Hamaguchi et al., 2017), and 2 ILPC 2022
datasets (Small and Large) (Galkin et al., 2022a).
• Inductive entity and relation (e, r) datasets (23 graphs) where both entities and relations at
inference are new (Vtrain ̸= Vinf , Rtrain ̸= Rinf ): 13 graphs from I N G RAM (Lee et al., 2023)
and 10 graphs from MTDEA (Zhou et al., 2023).
In practice, however, a pre-trained U LTRA operates in the inductive (e, r) mode on all datasets
(apart from those in the training mixture) as their sets of entities, relations, and relational structures
are different from the training set. The dataset sizes vary from 1k to 120k entities and 1k-2M edges
in the inference graph. We provide more details on the datasets in Appendix A.
Pretraining and Fine-tuning. U LTRA is pre-trained on the mixture of 3 standard KGs (WN18RR,
CoDEx-Medium, FB15k237) to capture the variety of possible relational structures and sparsities
in respective relational graphs Gr . U LTRA is relatively small (177k parameters in total, with 60k
parameters in GNNr and 117k parameters in GNNe ) and is trained for 200,000 steps with batch size
of 64 with AdamW optimizer on 2 A100 (40 GB) GPUs. All fine-tuning experiments were done on
a single RTX 3090 GPU. More details on hyperparameters and training are in Appendix C.
6
Published as a conference paper at ICLR 2024
Hits@10
0.6 0.6 0.6
MRR
Metafam
MT3:infra
BM:indigo
MT1:tax
FBNELL
MT2:sci
MT1:health
MT4:health
MT4:sci
MT3:art
MT2:org
BM:1k
BM:3k
BM:5k
Metafam
MT3:infra
MT4:health
FBNELL
BM:indigo
MT4:sci
MT3:art
MT1:health
MT1:tax
MT2:sci
MT2:org
BM:1k
BM:3k
BM:5k
Avg.
Avg.
Avg.
Figure 4: U LTRA performance on 14 inductive datasets from MTDEA (Zhou et al., 2023) and
INDIGO (Liu et al., 2021) for 8 of which only an approximate metric Hits@10 (50 negs) is available
(center). We also report full MRR (left) and Hits@10 (right) computed on the entire entity sets
demonstrating that Hits@10 (50 negs) overestimates the real performance.
Table 1: Zero-shot and fine-tuned performance of U LTRA compared to the published supervised
SOTA on 51 datasets (as in Fig. 1 and Fig. 4). The zero-shot U LTRA outperforms supervised base-
lines on average and on inductive datasets. Fine-tuning improves the performance even further. We
report pre-training performance to the fine-tuned version. More detailed results are in Appendix D.
Inductive (e) + (e, r) Transductive e Total Avg Pretraining Inductive (e) + (e, r)
Model (27 graphs) (13 graphs) (40 graphs) (3 graphs) (8 graphs)
MRR H@10 MRR H@10 MRR H@10 MRR H@10 Hits@10 (50 negs)
Supervised SOTA 0.342 0.482 0.348 0.494 0.344 0.486 0.439 0.585 0.731
U LTRA 0-shot 0.435 0.603 0.312 0.458 0.395 0.556 - - 0.859
U LTRA fine-tuned 0.443 0.615 0.379 0.543 0.422 0.592 0.407 0.568 0.896
Evaluation Protocol. We report Mean Reciprocal Rank (MRR) and Hits@10 (H@10) as the main
performance metrics evaluated against the full entity set of the inference graph. For each triple, we
report the results of predicting both heads and tails. Only in three datasets from Lv et al. (2020)
we report tail-only metrics similar to the baselines. In the zero-shot inference scenario, we run a
pre-trained model on the inference graph and test set of triples. In the fine-tuning case, we further
train the model on the training split of each dataset retaining the checkpoint of the best validation
set MRR. We run zero-shot inference experiments once as the results are deterministic, and report
an average of 5 runs for each fine-tuning run on each dataset.
Baselines. On each graph, we compare U LTRA against the reported state-of-the-art model (we list
SOTA for all 57 graphs in Appendix A). To date, all of the reported SOTA models are trained end-to-
end specifically on each target dataset. Due to the computational complexity of baselines, the only
existing results on 4 MTDEA datasets (Zhou et al., 2023) and 4 INDIGO datasets (Liu et al., 2021)
report Hits@10 against 50 randomly chosen negatives. We compare U LTRA against those baselines
using this Hits@10 (50 negs) metric as well as report the full performance on the whole entity sets.
5.2 M AIN R ESULTS : Z ERO - SHOT I NFERENCE AND F INE - TUNING OF U LTRA
The main experiment reports how U LTRA pre-trained on 3 graphs inductively generalizes to 54 other
graphs both in the zero-shot (0-shot) and fine-tuned cases. Fig. 1 compares U LTRA with supervised
SOTA baselines on 43 graphs that report MRR on the full entity set. Fig. 4 presents the comparison
on the rest 14 graphs including 8 graphs for which the baselines report Hits@10 (50 negs). The
aggregated results on 51 graphs with available baseline results are presented in Table 1 and the
complete evaluation on 57 graphs grouped into three families according to Section 5.1 is in Table 2.
Full per-dataset results with standard deviations can be found in Appendix D.
On average, U LTRA outperforms the baselines even in the 0-shot inference scenario both in MRR
and Hits@10. The largest gains are achieved on smaller inductive graphs, e.g., on FB-25 and FB-
50 0-shot U LTRA yields almost 3× better performance (291% and 289%, respectively). During
pre-training, U LTRA does not reach the baseline performance (0.407 vs 0.439 average MRR) and
we link that with the lower 0-shot inference results on larger transductive graphs. However, fine-
7
Published as a conference paper at ICLR 2024
Table 2: Zero-shot and fine-tuned U LTRA results on the complete set of 57 graphs grouped by the
dataset category. Fine-tuning especially helps on larger transductive datasets and boosts the total
average MRR by 10%. Additionally, we report as (train e2e) the average performance of dataset-
specific U LTRA models trained from scratch on each graph. More detailed results are in Appendix D.
0.4
0.2
0.0
Metafam
NL V1
MT4:health
WN V1
WN V2
MT3:infra
WN V4
NL V2
YAGO310
NL V3
FB V2
FB V1
FB V3
FB V4
NELL995
CoDEx-S
FBNELL
WN18RR
NL V4
NL-100
FB-100
DBP100K
HM indigo
NL-50
WDsinger
WN V3
NL-25
FB-75
Hetionet
FB-25
WK-75
MT1:health
NL-75
CoDEx-M
FB15k237
AristoV4
CoDEx-L
FB-50
MT1:tax
NL-0
CNet100k
WK-25
MT2:sci
MT3:art
ILPC-L
ILPC-S
MT4:sci
NELL23k
FB237_50
FB237_20
FB237_10
WK-100
WK-50
MT2:org
HM 1k
HM 3k
HM 5k
Avg.
Figure 5: Comparison of zero-shot and fine-tuned U LTRA per-dataset performance against training
a model from scratch on each dataset (Train e2e). Zero-shot performance of a single pre-trained
model is on par with training from scratch while fine-tuning yields overall best results.
tuning U LTRA effectively bridges this gap and surpasses the baselines. We hypothesize that in larger
transductive graphs fine-tuning helps to adapt to different graph sizes (training graphs have 15-40k
nodes while larger inference ones grow up to 123k nodes).
Following the sample efficiency and fast convergence of NBFNet (Zhu et al., 2021), we find that
1000-2000 steps are enough for fine-tuning U LTRA. In some cases (see Appendix D) fine-tuning
brings marginal improvements or marginal negative effects. Averaged across 54 graphs (Table 2),
fine-tuned U LTRA brings further 10% relative improvement over the zero-shot version.
We performed several experiments to better understand the pre-training quality of U LTRA and mea-
sure the impact of conditional relation representations on the performance.
Positive transfer from pre-training. We first study how a single pre-trained U LTRA model com-
pares to training instances of the same model separately on each graph end-to-end. For that, for each
of 57 graphs, we train 3 U LTRA instances of the same configuration and different random seeds
until convergence and report the averaged results in Table 2 with per-dataset comparison in Fig. 5.
We find that, on average, a single pre-trained U LTRA model in the zero-shot regime performs al-
most on par with the trained separate models, lags behind those on larger transductive graphs and
exhibits better performance on inductive datasets. Fine-tuning a pre-trained U LTRA shows overall
the best performance and requires significantly less computational resources than training a model
from scratch on every target graph.
Number of graphs in the pre-training mix. We then study how inductive inference performance
depends on the training mixture. While the main U LTRA model was trained on the mixture of
three graphs, here we train more models varying the amount of KGs in the training set from a sin-
gle FB15k237 to a combination of 8 transductive KGs (more details in Appendix C). For the fair
comparison, we evaluate pre-trained models in the zero-shot regime only on inductive datasets (41
graphs overall). The results are presented in Fig. 6 where we observe the saturation of performance
having more than three graphs in the mixture. We hypothesize that getting higher inference perfor-
8
Published as a conference paper at ICLR 2024
Table 3: Ablation study: pre-training and zero-shot inference results of the main U LTRA, U LTRA
without edge types in the relation graph (no etypes), U LTRA without edge types and with InGram-
like (Lee et al., 2023) unconditional GNN over relation graph where nodes are initialized with all
ones (ones) or with Glorot initialization (random). Averaged results over 3 categories of datasets.
mance is tied up with model capacity, scale, and optimization. We leave that study along with more
principled approached to selecting a pre-training mix for future work.
Conditional vs unconditional relation graph encoding. To
measure the impact of the graph of relations and conditional
relation representations, we pre-train three more models on the
0.40
MRR
same mixture of three graphs varying several components: (1)
we exclude four fundamental relation interactions (h2h, h2t, t2h,
t2t) from the relation graph making it homogeneous and single- 0.35
relational; (2) a homogeneous relation graph with an uncondi-
datasets
tional GNN encoder following the R-GATv2 architecture from ind (e) ind (e,r) both
the previous SOTA approach, InGram (Lee et al., 2023). The un-
conditional GNN needs input node features and we probed two
0.55
strategies: Glorot initialization used in Lee et al. (2023) and ini- Hits@10
tializing all nodes with a vector of ones 1d .
The results are presented in Table 3 and indicate that ablated 0.50
models struggle to reach the same pre-training performance 2 4 6 8
and exhibit poor zero-shot generalization performance across all # graphs in the pretraining mix
groups of graphs, e.g., up to 48% relative MRR drop (0.192 vs
Figure 6: Averaged 0-shot per-
0.366) on the model with a homogeneous relation graph and ran-
formance on inductive datasets
domly initialized node states with the unconditional R-GATv2
and # graphs in pre-training.
encoder. We therefore posit that conditional representations
(both on relation and entity levels) are crucial for transferable representations for link prediction
tasks that often require pairwise representations to break neighborhood symmetries.
9
Published as a conference paper at ICLR 2024
E THICS S TATEMENT
Foundation models can be run on tasks and datasets that were originally not envisioned by authors.
Due to the ubiquitous nature of graph data, foundation graph models might be used for malicious
activities like searching for patterns in anonymized data. On the other, more positive side, foundation
models reduce the computational burden and carbon footprint of training many non-transferable
graph-specific models. Having a single model with zero-shot transfer capabilities to any graph
renders tailored graph-specific models unnecessary, and fine-tuning costs are still lower than training
any model from scratch.
R EPRODUCIBILITY S TATEMENT
The list of datasets and evaluation protocol are presented in Section 5.1. More comments and de-
tails on the dataset statistics are available in Appendix A. All hyperparameters can be found in
Appendix C, full MRR and Hits@10 results with standard deviations are in Appendix D. The source
code is available in the supplementary materials.
ACKNOWLEDGMENTS
This project is supported by Intel-Mila partnership program, the Natural Sciences and Engineering
Research Council (NSERC) Discovery Grant, the Canada CIFAR AI Chair Program, collaboration
grants between Microsoft Research and Mila, Samsung Electronics Co., Ltd., Amazon Faculty Re-
search Award, Tencent AI Lab Rhino-Bird Gift Fund and a NRC Collaborative R&D Project (AI4D-
CORE-06). This project was also partially funded by IVADO Fundamental Research Project grant
PRF-2019-3583139727. The computation resource of this project is supported by Mila4 , Calcul
Québec5 and the Digital Research Alliance of Canada6 .
R EFERENCES
Mehdi Ali, Max Berrendorf, Charles Tapley Hoyt, Laurent Vermue, Mikhail Galkin, Sahand Shar-
ifzadeh, Asja Fischer, Volker Tresp, and Jens Lehmann. Bringing light into the dark: A large-scale
evaluation of knowledge graph embedding models under a unified framework. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 2021. doi: 10.1109/TPAMI.2021.3124805.
Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx,
Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson,
S. Buch, Dallas Card, Rodrigo Castellon, Niladri S. Chatterji, Annie S. Chen, Kathleen A. Creel,
Jared Davis, Dora Demszky, Chris Donahue, Moussa Doumbouya, Esin Durmus, Stefano Ermon,
John Etchemendy, Kawin Ethayarajh, Li Fei-Fei, Chelsea Finn, Trevor Gale, Lauren E. Gillespie,
Karan Goel, Noah D. Goodman, Shelby Grossman, Neel Guha, Tatsunori Hashimoto, Peter Hen-
derson, John Hewitt, Daniel E. Ho, Jenny Hong, Kyle Hsu, Jing Huang, Thomas F. Icard, Saahil
Jain, Dan Jurafsky, Pratyusha Kalluri, Siddharth Karamcheti, Geoff Keeling, Fereshte Khani,
O. Khattab, Pang Wei Koh, Mark S. Krass, Ranjay Krishna, Rohith Kuditipudi, Ananya Kumar,
Faisal Ladhak, Mina Lee, Tony Lee, Jure Leskovec, Isabelle Levent, Xiang Lisa Li, Xuechen
Li, Tengyu Ma, Ali Malik, Christopher D. Manning, Suvir P. Mirchandani, Eric Mitchell, Zanele
Munyikwa, Suraj Nair, Avanika Narayan, Deepak Narayanan, Benjamin Newman, Allen Nie,
Juan Carlos Niebles, Hamed Nilforoshan, J. F. Nyarko, Giray Ogut, Laurel Orr, Isabel Papadim-
itriou, Joon Sung Park, Chris Piech, Eva Portelance, Christopher Potts, Aditi Raghunathan, Robert
Reich, Hongyu Ren, Frieda Rong, Yusuf H. Roohani, Camilo Ruiz, Jack Ryan, Christopher
R’e, Dorsa Sadigh, Shiori Sagawa, Keshav Santhanam, Andy Shih, Krishna Parasuram Srini-
vasan, Alex Tamkin, Rohan Taori, Armin W. Thomas, Florian Tramèr, Rose E. Wang, William
Wang, Bohan Wu, Jiajun Wu, Yuhuai Wu, Sang Michael Xie, Michihiro Yasunaga, Jiaxuan You,
Matei A. Zaharia, Michael Zhang, Tianyi Zhang, Xikun Zhang, Yuhui Zhang, Lucia Zheng, Kait-
lyn Zhou, and Percy Liang. On the opportunities and risks of foundation models. ArXiv, 2021.
URL https://crfm.stanford.edu/assets/report.pdf.
4
https://mila.quebec/
5
https://www.calculquebec.ca/
6
https://alliancecan.ca/
10
Published as a conference paper at ICLR 2024
Benjamin Paul Chamberlain, Sergey Shirobokov, Emanuele Rossi, Fabrizio Frasca, Thomas
Markovich, Nils Yannick Hammerla, Michael M. Bronstein, and Max Hansmire. Graph neu-
ral networks for link prediction with subgraph sketching. In The Eleventh International Confer-
ence on Learning Representations, 2023. URL https://openreview.net/forum?id=
m1oqEOAozQU.
Payal Chandak, Kexin Huang, and Marinka Zitnik. Building a knowledge graph to enable precision
medicine. Nature Scientific Data, 2023. doi: https://doi.org/10.1038/s41597-023-01960-3. URL
https://www.nature.com/articles/s41597-023-01960-3.
Mingyang Chen, Wen Zhang, Wei Zhang, Qiang Chen, and Huajun Chen. Meta relational learning
for few-shot link prediction in knowledge graphs. In EMNLP, pp. 4217–4226, 2019.
Mingyang Chen, Wen Zhang, Yuxia Geng, Zezhong Xu, Jeff Z Pan, and Huajun Chen. Generalizing
to unseen elements: A survey on knowledge extrapolation for knowledge graphs. arXiv preprint
arXiv:2302.01859, 2023.
Yihong Chen, Pasquale Minervini, Sebastian Riedel, and Pontus Stenetorp. Relation prediction as
an auxiliary training objective for improving multi-relational graph representations. In 3rd Con-
ference on Automated Knowledge Base Construction, 2021. URL https://openreview.
net/forum?id=Qa3uS3H7-Le.
Yihong Chen, Pushkar Mishra, Luca Franceschi, Pasquale Minervini, Pontus Stenetorp, and Se-
bastian Riedel. Refactor GNNs: Revisiting factorisation-based models from a message-passing
perspective. In Advances in Neural Information Processing Systems, 2022. URL https:
//openreview.net/forum?id=81LQV4k7a7X.
Daniel Daza, Michael Cochez, and Paul Groth. Inductive entity representations from text via link
prediction. In Proceedings of the Web Conference 2021, pp. 798–808, 2021.
Tim Dettmers, Pasquale Minervini, Pontus Stenetorp, and Sebastian Riedel. Convolutional 2d
knowledge graph embeddings. In Proceedings of the AAAI conference on artificial intelligence,
volume 32, 2018.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep
bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of
the North American Chapter of the Association for Computational Linguistics: Human Language
Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational
Linguistics, 2019. URL https://aclanthology.org/N19-1423.
Francesco Di Giovanni, Lorenzo Giusti, Federico Barbero, Giulia Luise, Pietro Lio, and Michael M.
Bronstein. On over-squashing in message passing neural networks: The impact of width,
depth, and topology. In Proceedings of the 40th International Conference on Machine Learn-
ing, pp. 7865–7885. PMLR, 2023. URL https://proceedings.mlr.press/v202/
di-giovanni23a.html.
Boyang Ding, Quan Wang, Bin Wang, and Li Guo. Improving knowledge graph embedding using
simple constraints. In Proceedings of the 56th Annual Meeting of the Association for Computa-
tional Linguistics (Volume 1: Long Papers), pp. 110–121. Association for Computational Linguis-
tics, 2018. doi: 10.18653/v1/P18-1011. URL https://aclanthology.org/P18-1011.
Xin Luna Dong. Challenges and innovations in building a product knowledge graph. In Proceedings
of the 24th ACM SIGKDD International conference on knowledge discovery & data mining, pp.
2869–2869, 2018.
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas
Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszko-
reit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recogni-
tion at scale. In International Conference on Learning Representations, 2021. URL https:
//openreview.net/forum?id=YicbFdNTTy.
Mikhail Galkin, Max Berrendorf, and Charles Tapley Hoyt. An Open Challenge for Inductive Link
Prediction on Knowledge Graphs. arXiv preprint arXiv:2203.01520, 2022a. URL http://
arxiv.org/abs/2203.01520.
11
Published as a conference paper at ICLR 2024
Mikhail Galkin, Etienne Denis, Jiapeng Wu, and William L. Hamilton. Nodepiece: Composi-
tional and parameter-efficient representations of large knowledge graphs. In International Con-
ference on Learning Representations, 2022b. URL https://openreview.net/forum?
id=xMJWUKJnFSw.
Mikhail Galkin, Zhaocheng Zhu, Hongyu Ren, and Jian Tang. Inductive logical query answering
in knowledge graphs. In Advances in Neural Information Processing Systems, 2022c. URL
https://openreview.net/forum?id=-vXEN5rIABY.
Jianfei Gao, Yangze Zhou, and Bruno Ribeiro. Double permutation equivariance for knowledge
graph completion. arXiv preprint arXiv:2302.01313, 2023.
Yuxia Geng, Jiaoyan Chen, Jeff Z Pan, Mingyang Chen, Song Jiang, Wen Zhang, and Huajun Chen.
Relational message passing for fully inductive knowledge graph completion. In 2023 IEEE 39th
International Conference on Data Engineering (ICDE), pp. 1221–1233. IEEE, 2023.
Genet Asefa Gesese, Harald Sack, and Mehwish Alam. Raild: Towards leveraging relation features
for inductive link prediction in knowledge graphs. In Proceedings of the 11th International Joint
Conference on Knowledge Graphs, pp. 82–90, 2022.
Jia Guo and Stanley Kok. BiQUE: Biquaternionic embeddings of knowledge graphs. In Proceedings
of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 8338–8351.
Association for Computational Linguistics, 2021. doi: 10.18653/v1/2021.emnlp-main.657. URL
https://aclanthology.org/2021.emnlp-main.657.
Takuo Hamaguchi, Hidekazu Oiwa, Masashi Shimbo, and Yuji Matsumoto. Knowledge transfer for
out-of-knowledge-base entities : A graph neural network approach. In Proceedings of the Twenty-
Sixth International Joint Conference on Artificial Intelligence, IJCAI-17, pp. 1802–1808, 2017.
doi: 10.24963/ijcai.2017/250. URL https://doi.org/10.24963/ijcai.2017/250.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recog-
nition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.
770–778, 2016.
Tao He, Ming Liu, Yixin Cao, Zekun Wang, Zihao Zheng, Zheng Chu, and Bing Qin. Exploring
& exploiting high-order graph structure for sparse knowledge graph completion. arXiv preprint
arXiv:2306.17034, 2023.
Daniel Scott Himmelstein, Antoine Lizee, Christine Hessler, Leo Brueggeman, Sabrina L Chen,
Dexter Hadley, Ari Green, Pouya Khankhanian, and Sergio E Baranzini. Systematic integration
of biomedical knowledge prioritizes drugs for repurposing. Elife, 6:e26726, 2017.
Qian Huang, Hongyu Ren, and Jure Leskovec. Few-shot relational reasoning via connection
subgraph pretraining. In Advances in Neural Information Processing Systems, 2022. URL
https://openreview.net/forum?id=LvW71lgly25.
Qian Huang, Hongyu Ren, Peng Chen, Gregor Kržmanc, Daniel Zeng, Percy Liang, and Jure
Leskovec. PRODIGY: Enabling in-context learning over graphs. In Thirty-seventh Conference on
Neural Information Processing Systems, 2023a. URL https://openreview.net/forum?
id=pLwYhNNnoR.
Xingyue Huang, Miguel Romero Orth, İsmail İlkan Ceylan, and Pablo Barceló. A theory of link
prediction via relational weisfeiler-leman. In Advances in Neural Information Processing Systems,
2023b. URL https://openreview.net/forum?id=7hLlZNrkt5.
Ihab F Ilyas, Theodoros Rekatsinas, Vishnu Konda, Jeffrey Pound, Xiaoguang Qi, and Mohamed
Soliman. Saga: A platform for continuous construction and serving of knowledge at scale. In
Proceedings of the 2022 International Conference on Management of Data, pp. 2259–2272, 2022.
Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional
networks. In International Conference on Learning Representations, 2017. URL https:
//openreview.net/forum?id=SJU4ayYgl.
12
Published as a conference paper at ICLR 2024
Jaejun Lee, Chanyoung Chung, and Joyce Jiyoung Whang. InGram: Inductive knowledge graph
embedding via relation graphs. In Proceedings of the 40th International Conference on Ma-
chine Learning, volume 202, pp. 18796–18809. PMLR, 23–29 Jul 2023. URL https://
proceedings.mlr.press/v202/lee23c.html.
Pan Li, Yanbang Wang, Hongwei Wang, and Jure Leskovec. Distance encoding: Design provably
more powerful neural networks for graph representation learning. In Advances in Neural Infor-
mation Processing Systems, volume 33, pp. 4465–4478, 2020.
Shuwen Liu, Bernardo Grau, Ian Horrocks, and Egor Kostylev. Indigo: Gnn-based in-
ductive knowledge graph completion using pair-wise encoding. In Advances in Neu-
ral Information Processing Systems, volume 34, pp. 2034–2045. Curran Associates, Inc.,
2021. URL https://proceedings.neurips.cc/paper_files/paper/2021/
file/0fd600c953cde8121262e322ef09f70e-Paper.pdf.
Xin Lv, Xu Han, Lei Hou, Juanzi Li, Zhiyuan Liu, Wei Zhang, Yichi Zhang, Hao Kong, and Suhui
Wu. Dynamic anticipation and completion for multi-hop reasoning over sparse knowledge graph.
In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing
(EMNLP), pp. 5694–5703. Association for Computational Linguistics, 2020. doi: 10.18653/v1/
2020.emnlp-main.459. URL https://aclanthology.org/2020.emnlp-main.459.
Farzaneh Mahdisoltani, Joanna Biega, and Fabian Suchanek. Yago3: A knowledge base from mul-
tilingual wikipedias. In 7th biennial conference on innovative data systems research. CIDR Con-
ference, 2014.
Chaitanya Malaviya, Chandra Bhagavatula, Antoine Bosselut, and Yejin Choi. Commonsense
knowledge base completion with structural and semantic context. In Proceedings of the AAAI
conference on artificial intelligence, volume 34, pp. 2925–2933, 2020.
Elan Markowitz, Keshav Balasubramanian, Mehrnoosh Mirtaheri, Murali Annavaram, Aram Gal-
styan, and Greg Ver Steeg. StATIK: Structure and text for inductive knowledge graph com-
pletion. In Findings of the Association for Computational Linguistics: NAACL 2022, pp. 604–
615, 2022. doi: 10.18653/v1/2022.findings-naacl.46. URL https://aclanthology.org/
2022.findings-naacl.46.
OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar-
wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya
Sutskever. Learning transferable visual models from natural language supervision. In Proceed-
ings of the 38th International Conference on Machine Learning, Proceedings of Machine Learn-
ing Research, pp. 8748–8763. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.
press/v139/radford21a.html.
Tara Safavi and Danai Koutra. CoDEx: A Comprehensive Knowledge Graph Completion Bench-
mark. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Pro-
cessing (EMNLP), pp. 8328–8350, Online, 2020. Association for Computational Linguistics.
doi: 10.18653/v1/2020.emnlp-main.669. URL https://www.aclweb.org/anthology/
2020.emnlp-main.669.
Michael J. Statt, Brian A. Rohr, Dan Guevarra, Ja’Nya Breeden, Santosh K. Suram, and John M.
Gregoire. The materials experiment knowledge graph. Digital Discovery, 2:909–914, 2023. doi:
10.1039/D3DD00067B. URL http://dx.doi.org/10.1039/D3DD00067B.
Zhiqing Sun, Zhi-Hong Deng, Jian-Yun Nie, and Jian Tang. Rotate: Knowledge graph embedding by
relational rotation in complex space. In International Conference on Learning Representations,
2019. URL https://openreview.net/forum?id=HkgEQnRqYQ.
Komal Teru, Etienne Denis, and Will Hamilton. Inductive relation prediction by subgraph reasoning.
In International Conference on Machine Learning, pp. 9448–9457. PMLR, 2020.
Kristina Toutanova and Danqi Chen. Observed versus latent features for knowledge base and text
inference. In Proceedings of the 3rd workshop on continuous vector space models and their
compositionality, pp. 57–66, 2015.
13
Published as a conference paper at ICLR 2024
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Niko-
lay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher,
Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy
Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn,
Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel
Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee,
Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra,
Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi,
Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh
Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen
Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic,
Sergey Edunov, and Thomas Scialom. Llama 2: Open foundation and fine-tuned chat models.
arXiv preprint arXiv:2307.09288, 2023.
Shikhar Vashishth, Soumya Sanyal, Vikram Nitin, and Partha Talukdar. Composition-based multi-
relational graph convolutional networks. In International Conference on Learning Representa-
tions, 2020. URL https://openreview.net/forum?id=BylA_C4tPr.
Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua
Bengio. Graph attention networks. In International Conference on Learning Representations,
2018.
Vineeth Venugopal, Sumit Pai, and Elsa Olivetti. Matkg: The largest knowledge graph in materials
science – entities, relations, and link prediction through graph representation learning. arXiv
preprint arXiv:2210.17340, 2022.
Xiaozhi Wang, Tianyu Gao, Zhaocheng Zhu, Zhengyan Zhang, Zhiyuan Liu, Juanzi Li, and Jian
Tang. Kepler: A unified model for knowledge embedding and pre-trained language representation.
Transactions of the Association for Computational Linguistics, 9:176–194, 2021.
Wenhan Xiong, Thien Hoang, and William Yang Wang. DeepPath: A reinforcement learning
method for knowledge graph reasoning. In Proceedings of the 2017 Conference on Empirical
Methods in Natural Language Processing, pp. 564–573. Association for Computational Linguis-
tics, 2017. URL https://aclanthology.org/D17-1060.
Bishan Yang, Wen-tau Yih, Xiaodong He, Jianfeng Gao, and Li Deng. Embedding entities and
relations for learning and inference in knowledge bases. International Conference on Learning
Representations, 2015.
Gilad Yehudai, Ethan Fetaya, Eli A. Meirom, Gal Chechik, and Haggai Maron. From local struc-
tures to size generalization in graph neural networks. In Proceedings of the 38th International
Conference on Machine Learning, ICML 2021, volume 139, pp. 11975–11986. PMLR, 2021.
Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, and
Tie-Yan Liu. Do transformers really perform badly for graph representation? In Thirty-Fifth
Conference on Neural Information Processing Systems, 2021. URL https://openreview.
net/forum?id=OeWooOxFwDa.
Chuxu Zhang, Huaxiu Yao, Chao Huang, Meng Jiang, Zhenhui Li, and Nitesh V Chawla. Few-shot
knowledge graph completion. In Proceedings of the AAAI Conference on Artificial Intelligence,
pp. 3041–3048, 2020.
Muhan Zhang and Yixin Chen. Link prediction based on graph neural networks. Advances in neural
information processing systems, 31, 2018.
Muhan Zhang, Pan Li, Yinglong Xia, Kai Wang, and Long Jin. Labeling trick: A theory of using
graph neural networks for multi-node representation learning. In Advances in Neural Information
Processing Systems, pp. 9061–9073, 2021.
Yongqi Zhang and Quanming Yao. Knowledge graph reasoning with relational digraph. In Proceed-
ings of the ACM Web Conference 2022, pp. 912–924, 2022.
14
Published as a conference paper at ICLR 2024
Yongqi Zhang, Zhanke Zhou, Quanming Yao, Xiaowen Chu, and Bo Han. Adaprop: Learning
adaptive propagation for graph neural network based knowledge graph reasoning. In KDD, 2023.
Shuxin Zheng, Jiyan He, Chang Liu, Yu Shi, Ziheng Lu, Weitao Feng, Fusong Ju, Jiaxi Wang,
Jianwei Zhu, Yaosen Min, He Zhang, Shidi Tang, Hongxia Hao, Peiran Jin, Chi Chen, Frank
Noé, Haiguang Liu, and Tie-Yan Liu. Towards predicting equilibrium distributions for molecular
systems with deep learning. arXiv preprint arXiv:2306.05445, 2023.
Jincheng Zhou, Beatrice Bevilacqua, and Bruno Ribeiro. An ood multi-task perspective for link
prediction with new relation types and nodes. arXiv preprint arXiv:2307.06046, 2023.
Yangze Zhou, Gitta Kutyniok, and Bruno Ribeiro. Ood link prediction generalization capabilities
of message-passing gnns in larger test graphs. In Advances in Neural Information Processing
Systems, 2022.
Zhaocheng Zhu, Zuobai Zhang, Louis-Pascal Xhonneux, and Jian Tang. Neural bellman-ford net-
works: A general graph neural network framework for link prediction. Advances in Neural Infor-
mation Processing Systems, 34:29476–29490, 2021.
Zhaocheng Zhu, Mikhail Galkin, Zuobai Zhang, and Jian Tang. Neural-symbolic models for logical
queries on knowledge graphs. In International Conference on Machine Learning, ICML 2022.
PMLR, 2022a.
Zhaocheng Zhu, Chence Shi, Zuobai Zhang, Shengchao Liu, Minghao Xu, Xinyu Yuan, Yangtian
Zhang, Junkun Chen, Huiyu Cai, Jiarui Lu, et al. Torchdrug: A powerful and flexible machine
learning platform for drug discovery. arXiv preprint arXiv:2202.08320, 2022b.
Zhaocheng Zhu, Xinyu Yuan, Mikhail Galkin, Sophie Xhonneux, Ming Zhang, Maxime Gazeau,
and Jian Tang. A*net: A scalable path-based reasoning approach for knowledge graphs. In
Advances in Neural Information Processing Systems, 2023.
15
Published as a conference paper at ICLR 2024
A DATASETS
We conduct evaluation on 57 openly available KGs of various sizes and three groups, i.e., tranduc-
tive, inductive with new entities, and inductive with both new entities and relations at inference time.
The statistics for 16 transductive datasets are presented in Table 4, 18 inductive entity datasets in
Table 5, and 23 inductive entity and relation datasets in Table 6. For each dataset, we also list a
currently published state-of-the-art model that, at the moment, are all trained specifically on each
target graph. Performance of those SOTA models is aggregated as Supervised SOTA in the results
reported in the tables and figures. We omit smaller datasets (Kinships, UMLS, Countries, Family)
with saturated performance as non-representative.
For the inductive datasets HM 1k, HM 3k, and HM 5k used in Hamaguchi et al. (2017) and Liu et al.
(2021), we report the performance of predicting both heads and tails (noted as b-1K, b-3K, b-5K
in Liu et al. (2021)) and compare against the respective baselines. Some inductive datasets (MT2,
MT3, MT4) from MTDEA (Zhou et al., 2023) do not have reported entity-only KG completion per-
formance. For Hetionet, we used the splits available in TorchDrug (Zhu et al., 2022b) and compare
with the baseline RotatE reported by TorchDrug.
The graph of relations Gr can be efficiently computed from the original multi-relational graph G with
sparse matrix multiplications (spmm). Four spmm operations correspond to the four fundamental
relation types {h2t, h2h, t2h, t2t} ∈ Rfund .
Given the original graph G with |V| nodes and |R| relation types, its adjacency matrix is A ∈
R|V|×|R|×|V| . For clarity, A can be rewritten with heads H and tails T as A ∈ R|H|×|R|×|T | .
From A we first build two sparse matrices Eh ∈ R|H|×|R| and Et ∈ R|T |×|R| that capture the
head-relation and tail-relation pairs, respectively. Computing interactions between relations is then
equivalent to one spmm operation between relevant adjacencies:
For each of the four sparse matrices, the respective edge index is extracted from all non-zero values
(or, similarly, by setting all non-zero values in the sparse matrix to ones). The final adjacency tensor
of the graph of relations Ar and corresponding graph of relations Gr with four fundamental edge
types can be obtained by stacking all four adjacencies ([·, ·] denotes stacking).
C H YPERPARAMETERS
Main results. The hyperparameters for the pre-trained U LTRA model reported in Section 5.2 in-
cluding Table 1, Table 2, Figure 1, and Figure 4 are presented in Table 7. Both GNNs over the
relation graph Gr and main graph G are 6-layer GNNs with hidden dimension of 64, DistMult
message function, and sum aggregation roughly following the NBFNet setup. Each layer of the
GNNe (inductive link predictor over the main entity graph) features a 2-layer MLP as a function g(·)
that transforms conditional relation representations into layer-specific relation representations. The
model is trained on the mixture of FB15k237, WN18RR, and CoDEx-Medium graphs for 200,000
steps with batch size of 64 with AdamW optimizer and learning rate of 0.0005. Each batch contains
only one graph and training samples from this graph. The sampling probability of the graph in the
mixture is proportional to the number of edges in this training graph.
16
Published as a conference paper at ICLR 2024
Table 4: Transductive datasets (16) used in the experiments. Train, Valid, Test denote triples in the
respective set. Task denotes the prediction task: h/t is predicting both heads and tails, tails is only
predicting tails. SOTA points to the best reported result.
Table 5: Inductive entity (e) datasets (18) used in the experiments. Triples denote the number of
edges of the graph given at training, validation, or test. Valid and Test denote triples to be predicted
in the validation and test sets in the respective validation and test graph.
Table 6: Inductive entity and relation (e, r) datasets (23) used in the experiments. Triples denote the
number of edges of the graph given at training, validation, or test. Valid and Test denote triples to
be predicted in the validation and test sets in the respective validation and test graph.
Fine-tuning and training from scratch. Table 8 reports training durations for fine-tuning the pre-
trained U LTRA and training models from scratch on each dataset (for the ablation study in Figure 5
17
Published as a conference paper at ICLR 2024
Table 7: U LTRA hyperparameters for pre-training. GNNr denotes a GNN over the graph of relations
Gr , GNNe is a GNN over the original entity graph G.
and Section 5.3). In fine-tuning, if the number of fine-tuning epochs k is more than one, we use the
best checkpoint (out of k) evaluated on the validation set of the respective graph. Each fine-tuning
run was repeated 5 times with different random seeds, each model trained from scratch was trained
3 times with different random seeds.
Ablation: graphs in the training mixture. For the ablation experiments reported in Figure 6,
Table 9 describes the mixtures of graphs used in the pre-trained models. The mixtures of 5 and more
graphs include large graphs of 100k+ entities each, so we reduced the amount of training steps to
complete training within 3 days (6 GPU-days in total as each model was trained on 2 A100 GPUs).
D F ULL R ESULTS
The full, per-dataset results of MRR and Hits@10 of the zero-shot inference of the pre-trained
U LTRA model, the fine-tuned model, and best reported supervised SOTA baselines are presented in
Table 10 and Table 11. The zero-shot results are deterministic whereas for fine-tuning performance
we report the average of 5 different seeds with standard deviations.
Table 10 corresponds to Figure 1 and contains results on 43 graphs where published SOTA baselines
are available, that is, on 3 pre-training graphs, on 14 inductive entity (e) graphs, on 13 inductive
entity and relation (e, r) graphs, and 13 transductive graphs. Table 11 contains results on 16 graphs
for which published SOTA exists only partially, that is, in terms of the Hits@10 (50 neg) metric
computed against 50 randomly chosen negatives. We show that this metric greatly overestimates the
real performance and encourage further works to report full MRR and Hits@k metrics computed
against the whole entity set.
The results in Table 2 on 57 graphs (Section 5.2) are aggregated from Table 10 and Table 11.
Commenting on the performance of a pre-trained U LTRA model on larger transductive graphs, we
attribute the performance difference to the following factors:
• Training data mixture and OOD generalization: the model reported in Table 1 was trained
on 3 medium-sized KGs (15k - 40k nodes, 80k - 270k edges) while the biggest gaps are
on larger graphs with many more nodes and edges (up to 120k nodes and 1M edges for
YAGO 310), or many more relation types (1600+ in AristoV4), or very sparse (as in Con-
ceptNet100k with 100k edges over 78k nodes). Size generalization issues are common for
GNNs as found in Yehudai et al. (2021); Zhou et al. (2022). However, if we take the UL-
18
Published as a conference paper at ICLR 2024
Table 8: Hyperparameters for fine-tuning U LTRA and training from scratch in the format (# epochs,
steps per epoch), e.g., (1, full) means one full epoch over the training set of the respective graph
while (1, 1000) means 1 epoch of 1000 steps over the training set.
1 2 3 4 5 6 8
FB15k237 ✓ ✓ ✓ ✓ ✓ ✓ ✓
WN18RR ✓ ✓ ✓ ✓ ✓ ✓
CoDEx-M ✓ ✓ ✓ ✓ ✓
NELL995 ✓ ✓ ✓ ✓
YAGO 310 ✓ ✓ ✓
ConceptNet100k ✓ ✓
DBpedia100k ✓
AristoV4 ✓
Batch size 32 16 64 16 16 16 16
# steps 200,000 400,000 200,000 400,000 200,000 200,000 200,000
TRA checkpoint pre-trained on 8 graphs (Table 9) and run evaluation on all 16 transductive
graphs, then the average performance is better than supervised SOTA models, i.e., 0.377
MRR / 0.537 Hits@10 of U LTRA against 0.371 MRR / 0.511 Hits@10 of the baselines.
• Transductive models have the privilege of memorizing target data distributions into
entity/relation-specific vectors with overall many millions of parameters, e.g., 80M pa-
rameters for a supervised SOTA BiQUE on ConceptNet100k. This performance, however,
comes with the absence of transferability across KGs. In contrast, all pre-trained U LTRA
checkpoints are rather small (about 170k parameters) but generalize to any KG. We ac-
19
Published as a conference paper at ICLR 2024
knowledge the scaling behavior in the Section 6 and consider it a very promising avenue
for future work. In particular, scaling laws for GNNs and common graph learning tasks
(like link prediction) are not derived yet so we can only hypothesize whether there is any
connection between GNNs size, dataset size, graph topology, and expected performance.
Generally, there is no consensus in the graph learning community on whether deep or wide
(non-geometric) GNNs bring immediate benefits - mostly due to the rising issues of over-
smoothing and oversquashing (some initial results were recently presented in Di Giovanni
et al. (2023)). In our experiments, we observe that the diversity of graphs in the pre-training
mixture plays an important role as well. Therefore, we believe that a brute-force increase
of the model size is unlikely to bring benefits unless paired with more diverse training data
and more intricate mechanisms for capturing relational interactions.
F C OMPUTATIONAL C OMPLEXITY
The time complexity of U LTRA is upper-bounded by the entity-level GNNe (because the GNNr on
the graph of relations has negligible overhead as the number of nodes in this graph is the same as
number of unique relation types |R|, and |R| ≪ |V|, that is, the number of relation types is usually
orders of magnitude smaller than the number of nodes). In our case, the main entity-level GNNe is
NBFNet, so we mainly refer to the Appendix C of Zhu et al. (2021) for all necessary derivations.
The time complexity for a single layer is generally linear in the number of edges O(|E|d + |V|d2 ).
With T layers, the overall complexity of a single forward pass is O(T (|E|d+|V|d2 )) but T is usually
a small constant (6 layers) so the complexity is essentially linear to the number of edges. However,
due to the sparsity of GNNs, they are usually bounded by memory. The memory complexity of the
basic NBFNet implementation is O(T |E|d) and linear to the number of edges, but thanks to the ef-
ficient kernelized implementation of the relational message passing (already provided by NBFNet),
the memory complexity is reduced to O(T |V|d) and is linear in the number of nodes. Moreover, the
complexity can be further reduced when applying more scalable and optimized versions of entity-
level GNNs such as AdaProp (Zhang et al., 2023) or A*Net (Zhu et al., 2023).
20
Published as a conference paper at ICLR 2024
Table 10: Full results (MRR, Hits@10) of U LTRA in the zero-shot inference and fine-tuning regimes
on 43 graphs compared to the best reported Supervised SOTA. The numbers correspond to Figure 1.
21
Published as a conference paper at ICLR 2024
Table 11: Full results (MRR, Hits@10) of U LTRA in the zero-shot inference and fine-tuning regimes
on 14 graphs where Supervised SOTA reports an estimate Hits@10 (50 negs) metric (where avail-
able). The numbers correspond to Figure 4.
22