100% found this document useful (2 votes)
12 views

Embedding Knowledge Graphs with RDF2vec 1st Edition Heiko Paulheim 2024 scribd download

Paulheim

Uploaded by

sorrelsaleo53
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (2 votes)
12 views

Embedding Knowledge Graphs with RDF2vec 1st Edition Heiko Paulheim 2024 scribd download

Paulheim

Uploaded by

sorrelsaleo53
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Experience Seamless Full Ebook Downloads for Every Genre at ebookmeta.

com

Embedding Knowledge Graphs with RDF2vec 1st


Edition Heiko Paulheim

https://ebookmeta.com/product/embedding-knowledge-graphs-
with-rdf2vec-1st-edition-heiko-paulheim/

OR CLICK BUTTON

DOWNLOAD NOW

Explore and download more ebook at https://ebookmeta.com


Recommended digital products (PDF, EPUB, MOBI) that
you can download immediately if you are interested.

Knowledge Graphs Applied - MEAP V02 Alessandro Negro

https://ebookmeta.com/product/knowledge-graphs-applied-
meap-v02-alessandro-negro/

ebookmeta.com

Building Knowledge Graphs: A Practitioner's Guide 1st


Edition Jesus Barrasa

https://ebookmeta.com/product/building-knowledge-graphs-a-
practitioners-guide-1st-edition-jesus-barrasa/

ebookmeta.com

Knowledge Graphs: Data in Context for Responsive


Businesses Jesus Barrasa

https://ebookmeta.com/product/knowledge-graphs-data-in-context-for-
responsive-businesses-jesus-barrasa/

ebookmeta.com

Dangerous Instrument Political Polarization and US Civil


Military Relations 1st Edition Michael A. Robinson

https://ebookmeta.com/product/dangerous-instrument-political-
polarization-and-us-civil-military-relations-1st-edition-michael-a-
robinson/
ebookmeta.com
The Glittering World Anthology Native American Romance
Paranormal Fantasy 1st Edition Trinity Blacio S E Smith
Lia Violet Jamie K Schmidt Barb Shuler L Loren Stephanie
Burke Elle Boon Wendi Zwaduk Kate Richards
https://ebookmeta.com/product/the-glittering-world-anthology-native-
american-romance-paranormal-fantasy-1st-edition-trinity-blacio-s-e-
smith-lia-violet-jamie-k-schmidt-barb-shuler-l-loren-stephanie-burke-
elle-boon-wendi-zwaduk-kat/
ebookmeta.com

Adventurous Learning A Pedagogy for a Changing World 1st


Edition Simon Beames Mike Brown

https://ebookmeta.com/product/adventurous-learning-a-pedagogy-for-a-
changing-world-1st-edition-simon-beames-mike-brown/

ebookmeta.com

Near Human Border Zones of Species Life and Belonging 1st


Edition Mette N Svendsen

https://ebookmeta.com/product/near-human-border-zones-of-species-life-
and-belonging-1st-edition-mette-n-svendsen/

ebookmeta.com

A Fall Ball for All Jamie A Swenson

https://ebookmeta.com/product/a-fall-ball-for-all-jamie-a-swenson/

ebookmeta.com

Private Law, Nudging and Behavioural Economic Analysis:


The Mandated-Choice Model 1st Edition Antonios Karampatzos

https://ebookmeta.com/product/private-law-nudging-and-behavioural-
economic-analysis-the-mandated-choice-model-1st-edition-antonios-
karampatzos/
ebookmeta.com
Sounding Roman Representation and Performing Identity in
Western Turkey Sonia Tamar Seeman

https://ebookmeta.com/product/sounding-roman-representation-and-
performing-identity-in-western-turkey-sonia-tamar-seeman/

ebookmeta.com
Synthesis Lectures on
Data, Semantics, and Knowledge

Heiko Paulheim · Petar Ristoski ·


Jan Portisch

Embedding
Knowledge Graphs
with RDF2vec
Synthesis Lectures on Data, Semantics, and
Knowledge

Series Editors
Ying Ding, The University of Texas at Austin, Austin, USA
Paul Groth, Amsterdam, Noord-Holland, The Netherlands
This series focuses on the pivotal role that data on the web and the emergent technologies
that surround it play both in the evolution of the World Wide Web as well as applications
in domains requiring data integration and semantic analysis. The large-scale availability of
both structured and unstructured data on the Web has enabled radically new technologies
to develop. It has impacted developments in a variety of areas including machine learning,
deep learning, semantic search, and natural language processing. Knowledge and seman-
tics are a critical foundation for the sharing, utilization, and organization of this data. The
series aims both to provide pathways into the field of research and an understanding of
the principles underlying these technologies for an audience of scientists, engineers, and
practitioners.
Heiko Paulheim · Petar Ristoski · Jan Portisch

Embedding Knowledge
Graphs with RDF2vec
Heiko Paulheim Petar Ristoski
University of Mannheim eBay (United States)
Mannheim, Germany San Jose, CA, USA

Jan Portisch
SAP SE
Walldorf, Germany

ISSN 2691-2023 ISSN 2691-2031 (electronic)


Synthesis Lectures on Data, Semantics, and Knowledge
ISBN 978-3-031-30386-9 ISBN 978-3-031-30387-6 (eBook)
https://doi.org/10.1007/978-3-031-30387-6

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature
Switzerland AG 2023
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole
or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage
and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or
hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does
not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective
laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book are
believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give
a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that
may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and
institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface

Knowledge graphs are an important ingredient in today’s artificial intelligence systems.


They provide a means to encode arbitrary knowledge to be processed in those AI systems,
allowing an interpretation of that knowledge both for humans and machines. Today, there
are large-scale open knowledge graphs, likeWikidata or DBpedia, as well as privately
owned knowledge graphs in organizations, e.g., the Google knowledge graph used in the
Google search engine.
Knowledge graph embedding is a technique which projects entities and relations in a
knowledge graph into a continuous vector space. Many other components of AI systems,
especially machine learning components, can work with those continuous representations
better than operating on the graph itself, and often yield superior result quality compared
to those trying to extract non-continuous features from a graph.
RDF2vec is a knowledge graph embedding approach which was invented in the scope
of the Mine@LOD project1 and has evolved since then, which has led to numerous vari-
ants of the original approach. There exist different implementations of the approach.
Moreover, the Web page rdf2vec.org2 collects far more than 60 applications of
RDF2vec to a large variety of problems in a number of domains, ranging from NLP appli-
cations like information retrieval to improving computer security by utilizing a knowledge
graph of security threats.
With this book, we want to give a gentle introduction to the idea of knowledge graph
embeddings with RDF2vec. We discuss the different variants that exist, including their
advantages and disadvantages, and give examples for using RDF2vec in practice.
Heiko would like to thank all the researchers in his team at the University of
Mannheim, i.e., Andreea Iana, Antonis Klironomos Franz Krause, Martin Böckling,
Michael Schlechtinger, Nicolas Heist, Sven Hertling, and Tobias Weller, as well as Rita
Sousa from Universidade de Lisboa, who worked on a few interesting extensions for
RDF2vec during her research stay in Mannheim. Moreover, all students who worked with
RDF2vec and provided valuable input and feedback, i.e., Alexander Lütke, Niclas Heilig,
MichaelVoit, Angelos Loucas, Rouven Grenz, and Siraj Sheikh Afham Uddin. Finally,

1 https://gepris.dfg.de/gepris/projekt/238007641.
2 http://www.rdf2vec.org/.
v
vi Preface

my partner Tine for bearing many unsolicited dinner table monologues on graphs, vec-
tors, and stuff, and my daughter Antonia for luring me away from graphs, vectors, and
stuff every once in a while.
Jan would like to thank all researchers from the University of Mannheim who par-
ticipated in lengthy discussions on RDF2vec and graph embeddings, particularly Sven
Hertling, Nicolas Heist, and Andreea Iana. In addition, Jan is grateful for interest-
ing exchanges at SAP, especially with Michael Hladik, Guilherme Costa, and Michael
Monych. Lastly, Jan would like to thank his partner, Isabella, and his best friend, Sophia,
for a continued support in his private life.
Petar would like to thank Heiko Paulheim, Christian Bizer, and Simone Paolo Ponzetto,
for making this work possible in the first place, by forming an ideal environment for
conducting research at the University of Mannheim. Michael Cochez for the insight-
ful discussions and collaboration on extending RDF2vec in many directions. Anna Lisa
Gentile for the collaboration on applying RDF2vec in several research projects during my
time in IBM Research. Finally, my wife, Könül, and my daughter, Ada, for the non-latent,
high-dimensional support and love.
Moreover, the authors would like to thank the developers of pyRDF2vec and the Python
KG extension, which we used for examples in this book, especially Gilles Vandewielle
for quick responses on all issues around pyRDF2vec, and GEval, which has been used
countless times in evaluations. Moreover, we would like to thank all people involved in
the experiments shown in this book: Ahmad Al Taweel, Andreea Iana, Michael Cochez,
and all people who got their hands dirty with RDF2vec.
Finally, we would like to thank the series editors, Paul Groth and Ying Ding, for
inviting us to create this book and providing us with this unique opportunity, and Ambrose
Berkumans, Ben Ingraham, Charles Glaser, and Susanne Filler at Springer Nature for their
support throughout the production of this book.

Mannheim, Germany Heiko Paulheim


Walldorf, Germany Jan Portisch
San Jose, USA Petar Ristoski
February 2023
Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 What is a Knowledge Graph? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 A Short Bit of History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.3 General-Purpose Knowledge Graphs . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Feature Extraction from Knowledge Graphs . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Node Classification in RDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2 From Word Embeddings to Knowledge Graph Embeddings . . . . . . . . . . . . . . 17
2.1 Word Embeddings with word2vec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Representing Graphs as Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3 Learning Representations from Graph Walks . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4 Software Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.5 Node Classification with RDF2vec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3 Benchmarking Knowledge Graph Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.1 Node Classification with Internal Labels—SW4ML . . . . . . . . . . . . . . . . . . . 31
3.2 Machine Learning with External Labels—GEval . . . . . . . . . . . . . . . . . . . . . 33
3.3 Benchmarking Expressivity of Embeddings—DLCC . . . . . . . . . . . . . . . . . . 36
3.3.1 DLCC Gold Standard based on DBpedia . . . . . . . . . . . . . . . . . . . . . 37
3.3.2 DLCC Gold Standard based on Synthetic Data . . . . . . . . . . . . . . . . 40
3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4 Tweaking RDF2vec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.1 Introducing Edge Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.1.1 Graph Internal Weighting Approaches . . . . . . . . . . . . . . . . . . . . . . . . 46
4.1.2 Graph External Weighting Approaches . . . . . . . . . . . . . . . . . . . . . . . 51
vii
viii Contents

4.2 Order-Aware RDF2vec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54


4.2.1 Motivation and Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.2.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.2.3 Order-Aware RDF2vec in Action . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.3 Alternative Walk Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.3.1 Entity Walks and Property Walks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.3.2 Further Walk Extraction Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.4 RDF2vec with Materialized Knowledge Graphs . . . . . . . . . . . . . . . . . . . . . . 66
4.4.1 Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.4.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.4.3 RDF2vec on Materialized Graphs in Action . . . . . . . . . . . . . . . . . . . 72
4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5 RDF2vec at Scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.1 Using Pre-trained Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.1.1 The KGvec2Go Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.1.2 KGvec2Go in Action . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.2 Training Partial RDF2vec Models with RDF2vec Light . . . . . . . . . . . . . . . 79
5.2.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.2.2 RDF2vec Light in Action . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6 Link Prediction in Knowledge Graphs (and its Relation to RDF2vec) . . . . . . 87
6.1 A Brief Survey on the Knowledge Graph Embedding Landscape . . . . . . . 87
6.2 Knowledge Graph Embedding for Data Mining . . . . . . . . . . . . . . . . . . . . . . 91
6.2.1 Data Mining is Based on Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.2.2 How RDF2vec Projects Similar Instances Close to Each
Other . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.2.3 Using RDF2vec for Link Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.2.4 Link Prediction with RDF2vec in Action . . . . . . . . . . . . . . . . . . . . . 98
6.3 Knowledge Graph Embedding Methods for Link Prediction . . . . . . . . . . . 99
6.3.1 Link Prediction is Based on Vector Operations . . . . . . . . . . . . . . . . 99
6.3.2 Usage for Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.3.3 Comparing the Two Notions of Similarity . . . . . . . . . . . . . . . . . . . . . 102
6.3.4 Link Prediction Embeddings for Data Mining in Action . . . . . . . . 103
6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.4.1 Experiments on Data Mining Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.4.2 Experiments on Link Prediction Tasks . . . . . . . . . . . . . . . . . . . . . . . . 110
6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
Contents ix

7 Example Applications Beyond Node Classification . . . . . . . . . . . . . . . . . . . . . . . 119


7.1 Recommender Systems with RDF2vec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
7.1.1 An RDF2vec-Based Movie Recommender in Less than 20
Lines of Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
7.1.2 Combining Knowledge Graph Embeddings with Other
Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
7.2 Ontology Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
7.2.1 Ontology Matching by Embedding Input Ontologies . . . . . . . . . . . 126
7.2.2 Ontology Matching by Embedding External Knowledge
Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
7.3 Further Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
7.3.1 Knowledge Graph Refinement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
7.3.2 Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
7.3.3 Information Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
7.3.4 Applications in the Biomedical Domain . . . . . . . . . . . . . . . . . . . . . . 137
7.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
8 Future Directions for RDF2vec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
8.1 Incorporating Information in Literals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
8.2 Exploiting Complex Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
8.3 Exploiting Ontologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
8.4 Dynamic and Temporal Knowledge Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . 147
8.5 Extension to other Knowledge Graph Representations . . . . . . . . . . . . . . . . . 148
8.6 Standards and Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
8.7 Embeddings and Explainability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

Appendix A: Datasets and Code Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155


Introduction
1

Abstract

In this chapter, the basic concept of a knowledge graph is introduced. We discuss why
knowledge graphs are important for machine learning and data mining tasks, and we
show classic feature extraction or propositionalization techniques, which are the historical
predecessor of knowledge graph embeddings, and we show how these techniques are used
for basic node classification tasks.

1.1 What is a Knowledge Graph?

The term knowledge graph (or KG for short) has been popularized by Google in 2012, when
they announced in a blog post that their search is going to be based on structured knowledge
representations in the future, not only on string similarity and keyword overlap, as done until
then.1 Generally, a knowledge graph is a mechanism in knowledge representation, where
things in the world (e.g., persons, places, or events) are represented as nodes, while their
relations (e.g., a person taking part in an event, an event happening at a place) are represented
as labeled edges between those nodes.

1.1.1 A Short Bit of History

While Google popularized the term, the idea of knowledge graphs is much older than that.
Earlier works usually used terms like knowledge base or semantic network, among others
(Ji et al. 2021). Although the exact origin of the term knowledge graph is not fully known,

1 https://blog.google/products/search/introducing-knowledge-graph-things-not/.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 1


H. Paulheim et al., Embedding Knowledge Graphs with RDF2vec, Synthesis Lectures
on Data, Semantics, and Knowledge, https://doi.org/10.1007/978-3-031-30387-6_1
2 1 Introduction

Hogan et al. (2021) have traced the term back to a paper from the 1970s (Schneider 1973).
In the Semantic Web community and the Linked Open Data (Bizer et al. 2011) movement,
researchers have been producing datasets that would follow the idea of a knowledge graph
for decades.
In addition to open knowledge graphs created by the research community, and the already
mentioned knowledge graph used by Google, also other major companies nowadays use
knowledge graphs as a central means to represent corporate knowledge. Notable examples
include, but are not limited to, eBay, Facebook, IBM, and Microsoft (Noy et al. 2019).

1.1.2 Definitions

While a lot of researchers and practitioners claim to use knowledge graphs, the field has
long lacked a common definition of the term knowledge graph. Ehrlinger and Wöß (2016)
have collected a few of the most common definitions of knowledge graphs. In particular,
they list the following definitions:

1. A knowledge graph (1) mainly describes real-world entities and their interrelations,
organized in a graph, (2) defines possible classes and relations of entities in a schema,
(3) allows for potentially interrelating arbitrary entities with each other, and (4) covers
various topical domains. (Paulheim 2017)
2. Knowledge graphs are large networks of entities, their semantic types, properties, and
relationships between entities. (Journal of Web Semantics 2014)
3. Knowledge graphs could be envisaged as a network of all kinds of things which are
relevant to a specific domain or to an organization. (Semantic Web Company 2014)
4. A Knowledge Graph [is] an RDF graph. An RDF graph consists of a set of RDF triples
where each RDF triple (s, p, o) is an ordered set of the following RDF terms: a subject
s ∈ U ∪ B, a predicate p ∈ U , and an object o ∈ U ∪ B ∪ L. An RDF term is either a
URI u ∈ U , a blank node b ∈ B, or a literal l ∈ L. (Färber et al. 2018)
5. Knowledge, in the form of facts, [which] are interrelated, and hence, recently this
extracted knowledge has been referred to as a knowledge graph. (Pujara et al. 2013)

In addition, they synthesize their own definition, i.e.:

6. A knowledge graph acquires and integrates information into an ontology and applies a
reasoner to derive new knowledge. (Ehrlinger and Wöß 2016)

In the course of this book, we will use a very minimalistic definition of a knowledge
graph. We consider a knowledge graph a graph G = (V, E) consisting of a set of entities V
(i.e., vertices in the graph), and a set of E ⊆ VxRx(V ∪ L) labeled edges, where R defines
the set of possible relation types (which can be considered edge labels), and L is a set of
1.1 What is a Knowledge Graph? 3

literals (e.g., numbers or string values). Moreover, each entity in V can have one or more
classes assigned, where C defines the set of possible classes. Further ontological constructs,
such as defining a class hierarchy or describing relations with domains and ranges, are not
considered here.
While most of the definitions above are more focusing on the contents of the knowledge
graph, we, in this book, look at knowledge graphs from a more technical perspective, since
the methods discussed in this book are not bound to a particular domain. Our definition is
therefore purely technical and does not constrain the contents of the knowledge graph in
any way.

1.1.3 General-Purpose Knowledge Graphs

While the research community has come up with a large number of knowledge graphs,
there are a few which are open, large-scale, general-purpose knowledge graphs covering a
lot of domains in reasonable depth. They are therefore interesting ingredients to artificial
intelligence applications since they are ready to use and contain background knowledge for
many different tasks at hand.
One of the earliest attempts to build a general-purpose knowledge graph was Cyc, a project
started in the 1980s (Lenat 1995). The project was initiated to build a machine-processable
collection of the essence of the world’s knowledge, using a proprietary language called Cyc.
After an investment of more than 2,000 person-years, the project, in the end, encompassed
almost 25M axioms and rules2 – which is most likely still just a tiny fraction of the world’s
knowledge.
The example of Cyc shows that having knowledge graphs built manually by modeling
experts does not really scale (Paulheim 2018). Therefore, modern approaches usually utilize
different techniques, such as crowdsourcing and/or heuristic extraction.
Crowdsourcing knowledge graphs was first explored with Freebase (Pellissier-Tanon
et al. 2016), with the goal of establishing a large community of volunteers, comparable to
Wikipedia. To that end, the schema of Freebase was kept fairly simple to lower the entrance
barrier as much as possible. Freebase was acquired by Google in 2010 and shut down in
2014.
Wikidata (Vrandečić and Krötzsch 2014) also uses a crowd editing approach. In contrast
to Cyc and Freebase, Wikidata also imports entire whole large datasets, such as several
national libraries’ bibliographies. Porting the data from Freebase to Wikidata is also a long
standing goal (Pellissier-Tanon et al. 2016).
A more efficient way of knowledge graph creation is the use of structured or semi-
structured sources. Wikipedia is a commonly used starting point for knowledge graphs such
as DBpedia (Lehmann et al. 2013) and YAGO (Suchanek et al. 2007). In these approaches,

2 https://files.gotocon.com/uploads/slides/conference_13/724/original/AI_GOTO%20Lenat
%20keynote%2030%20April%202019%20hc.pdf.
4 1 Introduction

an entity in the knowledge graph is created per page in Wikipedia, and additional axioms
are extracted from the respective Wikipedia pages using different means.
DBpedia mainly uses infoboxes in Wikipedia. Those are manually mapped to a pre-
defined ontology; both the ontology and the mapping are crowd-sourced using a Wiki and
a community of volunteers. Given those mappings, the DBpedia Extraction Framework
creates a graph in which each page in Wikipedia becomes an entity, and all values and links
in an infobox become attributes and edges in the graph.
YAGO uses a similar process but classifies instances based on the category structure and
WordNet (Miller 1995) instead of infoboxes. YAGO integrates various language editions of
Wikipedia into a single graph and represents temporal facts with meta-level statements, i.e.,
RDF reification.
CaLiGraph also uses information in categories but aims at converting them into formal
axioms using DBpedia as supervision (Heist and Paulheim 2019). Moreover, instances from
Wikipedia list pages are considered for populating the knowledge graph (Kuhn et al. 2016,
Paulheim and Ponzetto 2013). The result is a knowledge graph that is not only richly popu-
lated on the instance level but also has a large number of defining axioms for classes (Heist
and Paulheim 2020).
A similar approach to YAGO, i.e., the combination of information in Wikipedia and
WordNet, is used by BabelNet (Navigli and Ponzetto 2012). The main purpose of BabelNet
is the collection of synonyms and translations in various languages, so that this knowledge
graph is particularly well suited for supporting multi-language applications. Similarly, Con-
ceptNet (Speer and Havasi 2012) collects synonyms and translations in various languages,
integrating multiple third-party knowledge graphs itself.
DBkWik (Hertling and Paulheim 2018) uses the same codebase as DBpedia, but applies it
to a multitude of Wikis. This leads to a graph that has larger coverage and level of detail for
many long-tail entities and is highly complementary to DBpedia. However, the absence of
a central ontology and mappings, as well as the existence of duplicates across Wikis, which
might not be trivial to detect, imposes a number of data integration challenges not present
in DBpedia (Hertling and Paulheim 2022).
Another source of structured data are structured annotations in Web pages using tech-
niques such as RDFa, Microdata, and Microformats (Meusel et al. 2014). While the pure
collection of those could, in theory, already be considered a knowledge graph, that graph
would be rather disconnected and consist of a plethora of small, unconnected components
(Paulheim 2015) and would require additional cleanup for compensating irregular use of
the underlying schemas and shortcomings in the extraction (Meusel and Paulheim 2015). A
consolidated version of this data into a more connected knowledge graph has been published
under the name VoldemortKG (Tonon et al. 2016).
The extraction of a knowledge graph from semi-structured sources is considered easier
than the extraction from unstructured sources. However, the amount of unstructured data
1.2 Feature Extraction from Knowledge Graphs 5

exceeds the amount of structured data by large.3 Therefore, extracting knowledge from
unstructured sources has also been proposed.
NELL (Carlson et al. 2010) is an example of extracting a knowledge graph from free text.
NELL was originally trained with a few seed examples and continuously runs an iterative
coupled learning process. In each iteration, facts are used to learn textual patterns to detect
those facts, and patterns learned in previous iterations are used to extract new facts, which
serve as training examples in later iterations. To improve the quality, NELL has introduced
a feedback loop incorporating occasional human feedback (Pedro and Hruschka 2012).
WebIsA (Seitner et al. 2016) also extracts facts from natural language text but focuses on
the creation of a large-scale taxonomy. For each extracted fact, rich metadata are collected,
including the sources, the original sentences, and the patterns used in the extraction of a
particular fact. That metadata is exploited for computing a confidence score for each fact.
(Hertling and Paulheim 2017).
Table 1.1 depicts an overview of some of the knowledge graphs discussed above. Con-
ceptNet and WebIsA are not included, since they do not distinguish a schema and instance
level (i.e., there is no specific distinction between a class and an instance), which does not
allow for computing those metrics meaningfully. For Cyc, which is only available as a com-
mercial product today, we used the free version OpenCyc, which has been available until
2017.4
From those metrics, it can be observed that the KGs differ in size by several orders of
magnitude. The sizes range from 50,000 instances (and Voldemort) to 50 million instances
(for Wikidata), so the latter is larger by a factor of 1,000. The same holds for assertions.
Concerning the linkage degree, YAGO is much richer linked than the other graphs.

1.2 Feature Extraction from Knowledge Graphs

When using knowledge graphs in the context of intelligent applications, they are often com-
bined with some machine learning or data mining based processing (van Bekkum et al.
2021). The corresponding algorithms, however, mostly expect tabular or propositional data
as input, not graphs, hence, information from the graphs is often transformed into a propo-
sitional form first, a process called propositionalization or feature extraction (Lavrač et al.
2020, Ristoski and Paulheim 2014a).
Particularly for the combination with machine learning algorithms, it is not only important
to have entities in a particular propositional form, but that this propositional form also fulfills
some additional criteria. In particular, proximity in the feature space should – ideally – reflect
the similarity of entities.5 For example, when building a movie recommendation system, we

3 Although it is hard to trace down the provenance of that number, many sources state that 80% of
all data is structured, such as Das and Kumar (2013).
4 It is still available, e.g., at https://github.com/asanchez75/opencyc.
5 We will discuss this in detail in Chap. 3.
Random documents with unrelated
content Scribd suggests to you:
about donations to the Project Gutenberg Literary Archive
Foundation.”

• You provide a full refund of any money paid by a user who


notifies you in writing (or by e-mail) within 30 days of receipt
that s/he does not agree to the terms of the full Project
Gutenberg™ License. You must require such a user to return or
destroy all copies of the works possessed in a physical medium
and discontinue all use of and all access to other copies of
Project Gutenberg™ works.

• You provide, in accordance with paragraph 1.F.3, a full refund of


any money paid for a work or a replacement copy, if a defect in
the electronic work is discovered and reported to you within 90
days of receipt of the work.

• You comply with all other terms of this agreement for free
distribution of Project Gutenberg™ works.

1.E.9. If you wish to charge a fee or distribute a Project Gutenberg™


electronic work or group of works on different terms than are set
forth in this agreement, you must obtain permission in writing from
the Project Gutenberg Literary Archive Foundation, the manager of
the Project Gutenberg™ trademark. Contact the Foundation as set
forth in Section 3 below.

1.F.

1.F.1. Project Gutenberg volunteers and employees expend


considerable effort to identify, do copyright research on, transcribe
and proofread works not protected by U.S. copyright law in creating
the Project Gutenberg™ collection. Despite these efforts, Project
Gutenberg™ electronic works, and the medium on which they may
be stored, may contain “Defects,” such as, but not limited to,
incomplete, inaccurate or corrupt data, transcription errors, a
copyright or other intellectual property infringement, a defective or
damaged disk or other medium, a computer virus, or computer
codes that damage or cannot be read by your equipment.

1.F.2. LIMITED WARRANTY, DISCLAIMER OF DAMAGES - Except for


the “Right of Replacement or Refund” described in paragraph 1.F.3,
the Project Gutenberg Literary Archive Foundation, the owner of the
Project Gutenberg™ trademark, and any other party distributing a
Project Gutenberg™ electronic work under this agreement, disclaim
all liability to you for damages, costs and expenses, including legal
fees. YOU AGREE THAT YOU HAVE NO REMEDIES FOR
NEGLIGENCE, STRICT LIABILITY, BREACH OF WARRANTY OR
BREACH OF CONTRACT EXCEPT THOSE PROVIDED IN PARAGRAPH
1.F.3. YOU AGREE THAT THE FOUNDATION, THE TRADEMARK
OWNER, AND ANY DISTRIBUTOR UNDER THIS AGREEMENT WILL
NOT BE LIABLE TO YOU FOR ACTUAL, DIRECT, INDIRECT,
CONSEQUENTIAL, PUNITIVE OR INCIDENTAL DAMAGES EVEN IF
YOU GIVE NOTICE OF THE POSSIBILITY OF SUCH DAMAGE.

1.F.3. LIMITED RIGHT OF REPLACEMENT OR REFUND - If you


discover a defect in this electronic work within 90 days of receiving
it, you can receive a refund of the money (if any) you paid for it by
sending a written explanation to the person you received the work
from. If you received the work on a physical medium, you must
return the medium with your written explanation. The person or
entity that provided you with the defective work may elect to provide
a replacement copy in lieu of a refund. If you received the work
electronically, the person or entity providing it to you may choose to
give you a second opportunity to receive the work electronically in
lieu of a refund. If the second copy is also defective, you may
demand a refund in writing without further opportunities to fix the
problem.

1.F.4. Except for the limited right of replacement or refund set forth
in paragraph 1.F.3, this work is provided to you ‘AS-IS’, WITH NO
OTHER WARRANTIES OF ANY KIND, EXPRESS OR IMPLIED,
INCLUDING BUT NOT LIMITED TO WARRANTIES OF
MERCHANTABILITY OR FITNESS FOR ANY PURPOSE.

1.F.5. Some states do not allow disclaimers of certain implied


warranties or the exclusion or limitation of certain types of damages.
If any disclaimer or limitation set forth in this agreement violates the
law of the state applicable to this agreement, the agreement shall be
interpreted to make the maximum disclaimer or limitation permitted
by the applicable state law. The invalidity or unenforceability of any
provision of this agreement shall not void the remaining provisions.

1.F.6. INDEMNITY - You agree to indemnify and hold the Foundation,


the trademark owner, any agent or employee of the Foundation,
anyone providing copies of Project Gutenberg™ electronic works in
accordance with this agreement, and any volunteers associated with
the production, promotion and distribution of Project Gutenberg™
electronic works, harmless from all liability, costs and expenses,
including legal fees, that arise directly or indirectly from any of the
following which you do or cause to occur: (a) distribution of this or
any Project Gutenberg™ work, (b) alteration, modification, or
additions or deletions to any Project Gutenberg™ work, and (c) any
Defect you cause.

Section 2. Information about the Mission


of Project Gutenberg™
Project Gutenberg™ is synonymous with the free distribution of
electronic works in formats readable by the widest variety of
computers including obsolete, old, middle-aged and new computers.
It exists because of the efforts of hundreds of volunteers and
donations from people in all walks of life.

Volunteers and financial support to provide volunteers with the


assistance they need are critical to reaching Project Gutenberg™’s
goals and ensuring that the Project Gutenberg™ collection will
remain freely available for generations to come. In 2001, the Project
Gutenberg Literary Archive Foundation was created to provide a
secure and permanent future for Project Gutenberg™ and future
generations. To learn more about the Project Gutenberg Literary
Archive Foundation and how your efforts and donations can help,
see Sections 3 and 4 and the Foundation information page at
www.gutenberg.org.

Section 3. Information about the Project


Gutenberg Literary Archive Foundation
The Project Gutenberg Literary Archive Foundation is a non-profit
501(c)(3) educational corporation organized under the laws of the
state of Mississippi and granted tax exempt status by the Internal
Revenue Service. The Foundation’s EIN or federal tax identification
number is 64-6221541. Contributions to the Project Gutenberg
Literary Archive Foundation are tax deductible to the full extent
permitted by U.S. federal laws and your state’s laws.

The Foundation’s business office is located at 809 North 1500 West,


Salt Lake City, UT 84116, (801) 596-1887. Email contact links and up
to date contact information can be found at the Foundation’s website
and official page at www.gutenberg.org/contact

Section 4. Information about Donations to


the Project Gutenberg Literary Archive
Foundation
Project Gutenberg™ depends upon and cannot survive without
widespread public support and donations to carry out its mission of
increasing the number of public domain and licensed works that can
be freely distributed in machine-readable form accessible by the
widest array of equipment including outdated equipment. Many
small donations ($1 to $5,000) are particularly important to
maintaining tax exempt status with the IRS.

The Foundation is committed to complying with the laws regulating


charities and charitable donations in all 50 states of the United
States. Compliance requirements are not uniform and it takes a
considerable effort, much paperwork and many fees to meet and
keep up with these requirements. We do not solicit donations in
locations where we have not received written confirmation of
compliance. To SEND DONATIONS or determine the status of
compliance for any particular state visit www.gutenberg.org/donate.

While we cannot and do not solicit contributions from states where


we have not met the solicitation requirements, we know of no
prohibition against accepting unsolicited donations from donors in
such states who approach us with offers to donate.

International donations are gratefully accepted, but we cannot make


any statements concerning tax treatment of donations received from
outside the United States. U.S. laws alone swamp our small staff.

Please check the Project Gutenberg web pages for current donation
methods and addresses. Donations are accepted in a number of
other ways including checks, online payments and credit card
donations. To donate, please visit: www.gutenberg.org/donate.

Section 5. General Information About


Project Gutenberg™ electronic works
Professor Michael S. Hart was the originator of the Project
Gutenberg™ concept of a library of electronic works that could be
freely shared with anyone. For forty years, he produced and
distributed Project Gutenberg™ eBooks with only a loose network of
volunteer support.
Project Gutenberg™ eBooks are often created from several printed
editions, all of which are confirmed as not protected by copyright in
the U.S. unless a copyright notice is included. Thus, we do not
necessarily keep eBooks in compliance with any particular paper
edition.

Most people start at our website which has the main PG search
facility: www.gutenberg.org.

This website includes information about Project Gutenberg™,


including how to make donations to the Project Gutenberg Literary
Archive Foundation, how to help produce our new eBooks, and how
to subscribe to our email newsletter to hear about new eBooks.

You might also like