Embedding Knowledge Graphs with RDF2vec 1st Edition Heiko Paulheim 2024 scribd download
Embedding Knowledge Graphs with RDF2vec 1st Edition Heiko Paulheim 2024 scribd download
com
https://ebookmeta.com/product/embedding-knowledge-graphs-
with-rdf2vec-1st-edition-heiko-paulheim/
OR CLICK BUTTON
DOWNLOAD NOW
https://ebookmeta.com/product/knowledge-graphs-applied-
meap-v02-alessandro-negro/
ebookmeta.com
https://ebookmeta.com/product/building-knowledge-graphs-a-
practitioners-guide-1st-edition-jesus-barrasa/
ebookmeta.com
https://ebookmeta.com/product/knowledge-graphs-data-in-context-for-
responsive-businesses-jesus-barrasa/
ebookmeta.com
https://ebookmeta.com/product/dangerous-instrument-political-
polarization-and-us-civil-military-relations-1st-edition-michael-a-
robinson/
ebookmeta.com
The Glittering World Anthology Native American Romance
Paranormal Fantasy 1st Edition Trinity Blacio S E Smith
Lia Violet Jamie K Schmidt Barb Shuler L Loren Stephanie
Burke Elle Boon Wendi Zwaduk Kate Richards
https://ebookmeta.com/product/the-glittering-world-anthology-native-
american-romance-paranormal-fantasy-1st-edition-trinity-blacio-s-e-
smith-lia-violet-jamie-k-schmidt-barb-shuler-l-loren-stephanie-burke-
elle-boon-wendi-zwaduk-kat/
ebookmeta.com
https://ebookmeta.com/product/adventurous-learning-a-pedagogy-for-a-
changing-world-1st-edition-simon-beames-mike-brown/
ebookmeta.com
https://ebookmeta.com/product/near-human-border-zones-of-species-life-
and-belonging-1st-edition-mette-n-svendsen/
ebookmeta.com
https://ebookmeta.com/product/a-fall-ball-for-all-jamie-a-swenson/
ebookmeta.com
https://ebookmeta.com/product/private-law-nudging-and-behavioural-
economic-analysis-the-mandated-choice-model-1st-edition-antonios-
karampatzos/
ebookmeta.com
Sounding Roman Representation and Performing Identity in
Western Turkey Sonia Tamar Seeman
https://ebookmeta.com/product/sounding-roman-representation-and-
performing-identity-in-western-turkey-sonia-tamar-seeman/
ebookmeta.com
Synthesis Lectures on
Data, Semantics, and Knowledge
Embedding
Knowledge Graphs
with RDF2vec
Synthesis Lectures on Data, Semantics, and
Knowledge
Series Editors
Ying Ding, The University of Texas at Austin, Austin, USA
Paul Groth, Amsterdam, Noord-Holland, The Netherlands
This series focuses on the pivotal role that data on the web and the emergent technologies
that surround it play both in the evolution of the World Wide Web as well as applications
in domains requiring data integration and semantic analysis. The large-scale availability of
both structured and unstructured data on the Web has enabled radically new technologies
to develop. It has impacted developments in a variety of areas including machine learning,
deep learning, semantic search, and natural language processing. Knowledge and seman-
tics are a critical foundation for the sharing, utilization, and organization of this data. The
series aims both to provide pathways into the field of research and an understanding of
the principles underlying these technologies for an audience of scientists, engineers, and
practitioners.
Heiko Paulheim · Petar Ristoski · Jan Portisch
Embedding Knowledge
Graphs with RDF2vec
Heiko Paulheim Petar Ristoski
University of Mannheim eBay (United States)
Mannheim, Germany San Jose, CA, USA
Jan Portisch
SAP SE
Walldorf, Germany
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature
Switzerland AG 2023
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole
or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage
and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or
hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does
not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective
laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book are
believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give
a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that
may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and
institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
1 https://gepris.dfg.de/gepris/projekt/238007641.
2 http://www.rdf2vec.org/.
v
vi Preface
my partner Tine for bearing many unsolicited dinner table monologues on graphs, vec-
tors, and stuff, and my daughter Antonia for luring me away from graphs, vectors, and
stuff every once in a while.
Jan would like to thank all researchers from the University of Mannheim who par-
ticipated in lengthy discussions on RDF2vec and graph embeddings, particularly Sven
Hertling, Nicolas Heist, and Andreea Iana. In addition, Jan is grateful for interest-
ing exchanges at SAP, especially with Michael Hladik, Guilherme Costa, and Michael
Monych. Lastly, Jan would like to thank his partner, Isabella, and his best friend, Sophia,
for a continued support in his private life.
Petar would like to thank Heiko Paulheim, Christian Bizer, and Simone Paolo Ponzetto,
for making this work possible in the first place, by forming an ideal environment for
conducting research at the University of Mannheim. Michael Cochez for the insight-
ful discussions and collaboration on extending RDF2vec in many directions. Anna Lisa
Gentile for the collaboration on applying RDF2vec in several research projects during my
time in IBM Research. Finally, my wife, Könül, and my daughter, Ada, for the non-latent,
high-dimensional support and love.
Moreover, the authors would like to thank the developers of pyRDF2vec and the Python
KG extension, which we used for examples in this book, especially Gilles Vandewielle
for quick responses on all issues around pyRDF2vec, and GEval, which has been used
countless times in evaluations. Moreover, we would like to thank all people involved in
the experiments shown in this book: Ahmad Al Taweel, Andreea Iana, Michael Cochez,
and all people who got their hands dirty with RDF2vec.
Finally, we would like to thank the series editors, Paul Groth and Ying Ding, for
inviting us to create this book and providing us with this unique opportunity, and Ambrose
Berkumans, Ben Ingraham, Charles Glaser, and Susanne Filler at Springer Nature for their
support throughout the production of this book.
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 What is a Knowledge Graph? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 A Short Bit of History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.3 General-Purpose Knowledge Graphs . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Feature Extraction from Knowledge Graphs . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Node Classification in RDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2 From Word Embeddings to Knowledge Graph Embeddings . . . . . . . . . . . . . . 17
2.1 Word Embeddings with word2vec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Representing Graphs as Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3 Learning Representations from Graph Walks . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4 Software Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.5 Node Classification with RDF2vec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3 Benchmarking Knowledge Graph Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.1 Node Classification with Internal Labels—SW4ML . . . . . . . . . . . . . . . . . . . 31
3.2 Machine Learning with External Labels—GEval . . . . . . . . . . . . . . . . . . . . . 33
3.3 Benchmarking Expressivity of Embeddings—DLCC . . . . . . . . . . . . . . . . . . 36
3.3.1 DLCC Gold Standard based on DBpedia . . . . . . . . . . . . . . . . . . . . . 37
3.3.2 DLCC Gold Standard based on Synthetic Data . . . . . . . . . . . . . . . . 40
3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4 Tweaking RDF2vec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.1 Introducing Edge Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.1.1 Graph Internal Weighting Approaches . . . . . . . . . . . . . . . . . . . . . . . . 46
4.1.2 Graph External Weighting Approaches . . . . . . . . . . . . . . . . . . . . . . . 51
vii
viii Contents
Abstract
In this chapter, the basic concept of a knowledge graph is introduced. We discuss why
knowledge graphs are important for machine learning and data mining tasks, and we
show classic feature extraction or propositionalization techniques, which are the historical
predecessor of knowledge graph embeddings, and we show how these techniques are used
for basic node classification tasks.
The term knowledge graph (or KG for short) has been popularized by Google in 2012, when
they announced in a blog post that their search is going to be based on structured knowledge
representations in the future, not only on string similarity and keyword overlap, as done until
then.1 Generally, a knowledge graph is a mechanism in knowledge representation, where
things in the world (e.g., persons, places, or events) are represented as nodes, while their
relations (e.g., a person taking part in an event, an event happening at a place) are represented
as labeled edges between those nodes.
While Google popularized the term, the idea of knowledge graphs is much older than that.
Earlier works usually used terms like knowledge base or semantic network, among others
(Ji et al. 2021). Although the exact origin of the term knowledge graph is not fully known,
1 https://blog.google/products/search/introducing-knowledge-graph-things-not/.
Hogan et al. (2021) have traced the term back to a paper from the 1970s (Schneider 1973).
In the Semantic Web community and the Linked Open Data (Bizer et al. 2011) movement,
researchers have been producing datasets that would follow the idea of a knowledge graph
for decades.
In addition to open knowledge graphs created by the research community, and the already
mentioned knowledge graph used by Google, also other major companies nowadays use
knowledge graphs as a central means to represent corporate knowledge. Notable examples
include, but are not limited to, eBay, Facebook, IBM, and Microsoft (Noy et al. 2019).
1.1.2 Definitions
While a lot of researchers and practitioners claim to use knowledge graphs, the field has
long lacked a common definition of the term knowledge graph. Ehrlinger and Wöß (2016)
have collected a few of the most common definitions of knowledge graphs. In particular,
they list the following definitions:
1. A knowledge graph (1) mainly describes real-world entities and their interrelations,
organized in a graph, (2) defines possible classes and relations of entities in a schema,
(3) allows for potentially interrelating arbitrary entities with each other, and (4) covers
various topical domains. (Paulheim 2017)
2. Knowledge graphs are large networks of entities, their semantic types, properties, and
relationships between entities. (Journal of Web Semantics 2014)
3. Knowledge graphs could be envisaged as a network of all kinds of things which are
relevant to a specific domain or to an organization. (Semantic Web Company 2014)
4. A Knowledge Graph [is] an RDF graph. An RDF graph consists of a set of RDF triples
where each RDF triple (s, p, o) is an ordered set of the following RDF terms: a subject
s ∈ U ∪ B, a predicate p ∈ U , and an object o ∈ U ∪ B ∪ L. An RDF term is either a
URI u ∈ U , a blank node b ∈ B, or a literal l ∈ L. (Färber et al. 2018)
5. Knowledge, in the form of facts, [which] are interrelated, and hence, recently this
extracted knowledge has been referred to as a knowledge graph. (Pujara et al. 2013)
6. A knowledge graph acquires and integrates information into an ontology and applies a
reasoner to derive new knowledge. (Ehrlinger and Wöß 2016)
In the course of this book, we will use a very minimalistic definition of a knowledge
graph. We consider a knowledge graph a graph G = (V, E) consisting of a set of entities V
(i.e., vertices in the graph), and a set of E ⊆ VxRx(V ∪ L) labeled edges, where R defines
the set of possible relation types (which can be considered edge labels), and L is a set of
1.1 What is a Knowledge Graph? 3
literals (e.g., numbers or string values). Moreover, each entity in V can have one or more
classes assigned, where C defines the set of possible classes. Further ontological constructs,
such as defining a class hierarchy or describing relations with domains and ranges, are not
considered here.
While most of the definitions above are more focusing on the contents of the knowledge
graph, we, in this book, look at knowledge graphs from a more technical perspective, since
the methods discussed in this book are not bound to a particular domain. Our definition is
therefore purely technical and does not constrain the contents of the knowledge graph in
any way.
While the research community has come up with a large number of knowledge graphs,
there are a few which are open, large-scale, general-purpose knowledge graphs covering a
lot of domains in reasonable depth. They are therefore interesting ingredients to artificial
intelligence applications since they are ready to use and contain background knowledge for
many different tasks at hand.
One of the earliest attempts to build a general-purpose knowledge graph was Cyc, a project
started in the 1980s (Lenat 1995). The project was initiated to build a machine-processable
collection of the essence of the world’s knowledge, using a proprietary language called Cyc.
After an investment of more than 2,000 person-years, the project, in the end, encompassed
almost 25M axioms and rules2 – which is most likely still just a tiny fraction of the world’s
knowledge.
The example of Cyc shows that having knowledge graphs built manually by modeling
experts does not really scale (Paulheim 2018). Therefore, modern approaches usually utilize
different techniques, such as crowdsourcing and/or heuristic extraction.
Crowdsourcing knowledge graphs was first explored with Freebase (Pellissier-Tanon
et al. 2016), with the goal of establishing a large community of volunteers, comparable to
Wikipedia. To that end, the schema of Freebase was kept fairly simple to lower the entrance
barrier as much as possible. Freebase was acquired by Google in 2010 and shut down in
2014.
Wikidata (Vrandečić and Krötzsch 2014) also uses a crowd editing approach. In contrast
to Cyc and Freebase, Wikidata also imports entire whole large datasets, such as several
national libraries’ bibliographies. Porting the data from Freebase to Wikidata is also a long
standing goal (Pellissier-Tanon et al. 2016).
A more efficient way of knowledge graph creation is the use of structured or semi-
structured sources. Wikipedia is a commonly used starting point for knowledge graphs such
as DBpedia (Lehmann et al. 2013) and YAGO (Suchanek et al. 2007). In these approaches,
2 https://files.gotocon.com/uploads/slides/conference_13/724/original/AI_GOTO%20Lenat
%20keynote%2030%20April%202019%20hc.pdf.
4 1 Introduction
an entity in the knowledge graph is created per page in Wikipedia, and additional axioms
are extracted from the respective Wikipedia pages using different means.
DBpedia mainly uses infoboxes in Wikipedia. Those are manually mapped to a pre-
defined ontology; both the ontology and the mapping are crowd-sourced using a Wiki and
a community of volunteers. Given those mappings, the DBpedia Extraction Framework
creates a graph in which each page in Wikipedia becomes an entity, and all values and links
in an infobox become attributes and edges in the graph.
YAGO uses a similar process but classifies instances based on the category structure and
WordNet (Miller 1995) instead of infoboxes. YAGO integrates various language editions of
Wikipedia into a single graph and represents temporal facts with meta-level statements, i.e.,
RDF reification.
CaLiGraph also uses information in categories but aims at converting them into formal
axioms using DBpedia as supervision (Heist and Paulheim 2019). Moreover, instances from
Wikipedia list pages are considered for populating the knowledge graph (Kuhn et al. 2016,
Paulheim and Ponzetto 2013). The result is a knowledge graph that is not only richly popu-
lated on the instance level but also has a large number of defining axioms for classes (Heist
and Paulheim 2020).
A similar approach to YAGO, i.e., the combination of information in Wikipedia and
WordNet, is used by BabelNet (Navigli and Ponzetto 2012). The main purpose of BabelNet
is the collection of synonyms and translations in various languages, so that this knowledge
graph is particularly well suited for supporting multi-language applications. Similarly, Con-
ceptNet (Speer and Havasi 2012) collects synonyms and translations in various languages,
integrating multiple third-party knowledge graphs itself.
DBkWik (Hertling and Paulheim 2018) uses the same codebase as DBpedia, but applies it
to a multitude of Wikis. This leads to a graph that has larger coverage and level of detail for
many long-tail entities and is highly complementary to DBpedia. However, the absence of
a central ontology and mappings, as well as the existence of duplicates across Wikis, which
might not be trivial to detect, imposes a number of data integration challenges not present
in DBpedia (Hertling and Paulheim 2022).
Another source of structured data are structured annotations in Web pages using tech-
niques such as RDFa, Microdata, and Microformats (Meusel et al. 2014). While the pure
collection of those could, in theory, already be considered a knowledge graph, that graph
would be rather disconnected and consist of a plethora of small, unconnected components
(Paulheim 2015) and would require additional cleanup for compensating irregular use of
the underlying schemas and shortcomings in the extraction (Meusel and Paulheim 2015). A
consolidated version of this data into a more connected knowledge graph has been published
under the name VoldemortKG (Tonon et al. 2016).
The extraction of a knowledge graph from semi-structured sources is considered easier
than the extraction from unstructured sources. However, the amount of unstructured data
1.2 Feature Extraction from Knowledge Graphs 5
exceeds the amount of structured data by large.3 Therefore, extracting knowledge from
unstructured sources has also been proposed.
NELL (Carlson et al. 2010) is an example of extracting a knowledge graph from free text.
NELL was originally trained with a few seed examples and continuously runs an iterative
coupled learning process. In each iteration, facts are used to learn textual patterns to detect
those facts, and patterns learned in previous iterations are used to extract new facts, which
serve as training examples in later iterations. To improve the quality, NELL has introduced
a feedback loop incorporating occasional human feedback (Pedro and Hruschka 2012).
WebIsA (Seitner et al. 2016) also extracts facts from natural language text but focuses on
the creation of a large-scale taxonomy. For each extracted fact, rich metadata are collected,
including the sources, the original sentences, and the patterns used in the extraction of a
particular fact. That metadata is exploited for computing a confidence score for each fact.
(Hertling and Paulheim 2017).
Table 1.1 depicts an overview of some of the knowledge graphs discussed above. Con-
ceptNet and WebIsA are not included, since they do not distinguish a schema and instance
level (i.e., there is no specific distinction between a class and an instance), which does not
allow for computing those metrics meaningfully. For Cyc, which is only available as a com-
mercial product today, we used the free version OpenCyc, which has been available until
2017.4
From those metrics, it can be observed that the KGs differ in size by several orders of
magnitude. The sizes range from 50,000 instances (and Voldemort) to 50 million instances
(for Wikidata), so the latter is larger by a factor of 1,000. The same holds for assertions.
Concerning the linkage degree, YAGO is much richer linked than the other graphs.
When using knowledge graphs in the context of intelligent applications, they are often com-
bined with some machine learning or data mining based processing (van Bekkum et al.
2021). The corresponding algorithms, however, mostly expect tabular or propositional data
as input, not graphs, hence, information from the graphs is often transformed into a propo-
sitional form first, a process called propositionalization or feature extraction (Lavrač et al.
2020, Ristoski and Paulheim 2014a).
Particularly for the combination with machine learning algorithms, it is not only important
to have entities in a particular propositional form, but that this propositional form also fulfills
some additional criteria. In particular, proximity in the feature space should – ideally – reflect
the similarity of entities.5 For example, when building a movie recommendation system, we
3 Although it is hard to trace down the provenance of that number, many sources state that 80% of
all data is structured, such as Das and Kumar (2013).
4 It is still available, e.g., at https://github.com/asanchez75/opencyc.
5 We will discuss this in detail in Chap. 3.
Random documents with unrelated
content Scribd suggests to you:
about donations to the Project Gutenberg Literary Archive
Foundation.”
• You comply with all other terms of this agreement for free
distribution of Project Gutenberg™ works.
1.F.
1.F.4. Except for the limited right of replacement or refund set forth
in paragraph 1.F.3, this work is provided to you ‘AS-IS’, WITH NO
OTHER WARRANTIES OF ANY KIND, EXPRESS OR IMPLIED,
INCLUDING BUT NOT LIMITED TO WARRANTIES OF
MERCHANTABILITY OR FITNESS FOR ANY PURPOSE.
Please check the Project Gutenberg web pages for current donation
methods and addresses. Donations are accepted in a number of
other ways including checks, online payments and credit card
donations. To donate, please visit: www.gutenberg.org/donate.
Most people start at our website which has the main PG search
facility: www.gutenberg.org.