Towards Understanding The Geometry of KN
Towards Understanding The Geometry of KN
122
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Long Papers), pages 122–131
c
Melbourne, Australia, July 15 - 20, 2018.
2018 Association for Computational Linguistics
We make the following contributions: by (Toutanova et al., 2015). In this paper, we study
the effect of the number of negative samples on
• We initiate a study to analyze the geometry of
KG embedding geometry as well as performance.
various Knowledge Graph (KG) embeddings.
In addition to the additive and multiplicative
To the best of our knowledge, this is the first
KG embedding methods already mentioned in
study of its kind. We also formalize various
Section 1, there is another set of methods where
metrics which can be used to study geometry
the entity and relation vectors interact via a neu-
of a set of vectors.
ral network. Examples of methods in this cate-
• Through extensive analysis, we discover sev- gory include NTN (Socher et al., 2013), CONV
eral interesting insights about the geometry (Toutanova et al., 2015), ConvE (Dettmers et al.,
of KG embeddings. For example, we find 2017), R-GCN (Schlichtkrull et al., 2017), ER-
systematic differences between the geome- MLP (Dong et al., 2014) and ER-MLP-2n (Rav-
tries of embeddings learned by additive and ishankar et al., 2017). Due to space limitations,
multiplicative KG embedding methods. in this paper we restrict our scope to the analysis
• We also study the relationship between geo- of the geometry of additive and multiplicative KG
metric attributes and predictive performance embedding models only, and leave the analysis of
of the embeddings, resulting in several new the geometry of neural network-based methods as
insights. For example, in case of multiplica- part of future work.
tive models, we observe that for entity vec-
tors generated with a fixed number of neg- 3 Overview of KG Embedding Methods
ative samples, lower conicity (as defined in
Section 4) or higher average vector length For our analysis, we consider six representative
lead to higher performance. KG embedding methods: TransE (Bordes et al.,
2013), TransR (Lin et al., 2015), STransE (Nguyen
Source code of all the analysis tools de- et al., 2016), DistMult (Yang et al., 2014), HolE
veloped as part of this paper is available (Nickel et al., 2016) and ComplEx (Trouillon
at https://github.com/malllabiisc/ et al., 2016). We refer to TransE, TransR and
kg-geometry. We are hoping that these re- STransE as additive methods because they learn
sources will enable one to quickly analyze the embeddings by modeling relations as translation
geometry of any KG embedding, and potentially vectors from one entity to another, which results in
other embeddings as well. vectors interacting via the addition operation dur-
ing training. On the other hand, we refer to Dist-
2 Related Work Mult, HolE and ComplEx as multiplicative meth-
In spite of the extensive and growing literature on ods as they quantify the likelihood of a triple be-
both KG and non-KG embedding methods, very longing to the KG through a multiplicative score
little attention has been paid towards understand- function. The score functions optimized by these
ing the geometry of the learned embeddings. A re- methods are summarized in Table 1.
cent work (Mimno and Thompson, 2017) is an ex- Notation: Let G = (E, R, T ) be a Knowledge
ception to this which addresses this problem in the Graph (KG) where E is the set of entities, R is
context of word vectors. This work revealed a sur- the set of relations and T ⊂ E × R × E is the set
prising correlation between word vector geometry of triples stored in the graph. Most of the KG em-
and the number of negative samples used during bedding methods learn vectors e ∈ Rde for e ∈ E,
training. Instead of word vectors, in this paper we and r ∈ Rdr for r ∈ R. Some methods also
focus on understanding the geometry of KG em- learn projection matrices Mr ∈ Rdr ×de for rela-
beddings. In spite of this difference, the insights tions. The correctness of a triple is evaluated using
we discover in this paper generalizes some of the a model specific score function σ : E × R × E →
observations in the work of (Mimno and Thomp- R. For learning the embeddings, a loss function
son, 2017). Please see Section 6.2 for more details. L(T , T ′ ; θ), defined over a set of positive triples
Since KGs contain only positive triples, nega- T , set of (sampled) negative triples T ′ , and the
tive sampling has been used for training KG em- parameters θ is optimized.
beddings. Effect of the number of negative sam- We use small italics characters (e.g., h, r) to
ples in KG embedding performance was studied represent entities and relations, and correspond-
123
Type Model Score Function σ(h, r, t)
TransE (Bordes et al., 2013) − kh + r − tk1
Additive TransR (Lin et al., 2015) −
kMr h + r − Mr tk
1
STransE (Nguyen et al., 2016) −
Mr1 h + r − Mr2 t
1
DistMult (Yang et al., 2014) r⊤ (h ⊙ t)
Multiplicative HolE (Nickel et al., 2016) r⊤ (h ⋆ t)
ComplEx (Trouillon et al., 2016) Re(r⊤ (h ⊙ t̄))
Table 1: Summary of various Knowledge Graph (KG) embedding methods used in the paper. Please see
Section 3 for more details.
ing bold characters to represent their vector em- STransE with Mr1 = Mr2 = Id and Mr1 = Mr2 =
beddings (e.g., h, r). We use bold capitalization Mr , respectively.
(e.g., V) to represent a set of vectors. Matrices are
represented by capital italics characters (e.g., M ). 3.2 Multiplicative KG Embedding Methods
This is the set of methods where the vectors inter-
3.1 Additive KG Embedding Methods
act via multiplicative operations (usually dot prod-
This is the set of methods where entity and rela- uct). The score function for these models can be
tion vectors interact via additive operations. The expressed as
score function for these models can be expressed
as below σ(h, r, t) = r⊤ f (h, t) (2)
124
Figure 1: Comparison of high vs low Conicity. Randomly generated vectors are shown in blue with
their sample mean vector M in black. Figure on the left shows the case when vectors lie in narrow cone
resulting in high Conicity value. Figure on the right shows the case when vectors are spread out having
relatively lower Conicity value. We skipped very low values of Conicity as it was difficult to visualize.
The points are sampled from 3d Spherical Gaussian with mean (1,1,1) and standard deviation 0.1 (left)
and 1.3 (right). Please refer to Section 4 for more details.
(O(d log d)). For training, we use pairwise rank- Dataset FB15k WN18
ing loss. #Relations 1,345 18
ComplEx (Trouillon et al., 2016) represents enti- #Entities 14,541 40,943
Train 483,142 141,440
ties and relations as vectors in Cd . The compati- #Triples Validation 50,000 5,000
bility of entity pairs is measured using entry-wise Test 59,071 5,000
product between head and complex conjugate of
tail entity vectors. Table 2: Summary of datasets used in the paper.
σComplEx (h, r, t) = Re(r⊤ (h ⊙ t̄)) (5) By this definition, a high value of Conicity(V)
In contrast to (3), using complex vectors in (5) al- would imply that the vectors in V lie in a nar-
lows ComplEx to handle symmetric, asymmetric row cone centered at origin. In other words, the
and anti-symmetric relations using the same score vectors in the set V are highly aligned with each
function. Similar to DistMult, logistic loss is used other. In addition to that, we define the variance
for training the model. of ATM across all vectors in V, as the ‘vector
spread’(VS) of set V,
4 Metrics !2
1 X
For our geometrical analysis, we first define a term VS(V) = ATM(v, V)−Conicity(V)
|V|
‘alignment to mean’ (ATM) of a vector v belong- v∈V
ing to a set of vectors V, as the cosine similarity1 Figure 1 visually demonstrates these metrics for
between v and the mean of all vectors in V. randomly generated 3-dimensional points. The
!
1 X left figure shows high Conicity and low vector
ATM(v, V) = cosine v, x spread while the right figure shows low Conicity
|V|
x∈V
and high vector spread.
We also define ‘conicity’ of a set V as the mean We define the length of a vector v as L2 -norm
ATM of all vectors in V. of the vector kvk2 and ‘average vector length’
1 X (AVL) for the set of vectors V as
Conicity(V) = ATM(v, V)
|V|
v∈V 1 X
AVL(V) = kvk2
1
cosine(u, v) = u⊤ v |V|
kukkvk v∈V
125
(a) Additive Models
Figure 2: Alignment to Mean (ATM) vs Density plots for entity embeddings learned by various additive
(top row) and multiplicative (bottom row) KG embedding methods. For each method, a plot averaged
across entity frequency bins is shown. From these plots, we conclude that entity embeddings from
additive models tend to have low (positive as well as negative) ATM and thereby low Conicity and high
vector spread. Interestingly, this is reversed in case of multiplicative methods. Please see Section 6.1 for
more details.
126
(a) Additive Models
Figure 3: Alignment to Mean (ATM) vs Density plots for relation embeddings learned by various additive
(top row) and multiplicative (bottom row) KG embedding methods. For each method, a plot averaged
across entity frequency bins is shown. Trends in these plots are similar to those in Figure 2. Main
findings from these plots are summarized in Section 6.1.
127
Figure 4: Conicity (left) and Average Vector Length (right) vs Number of negative samples for entity
vectors learned using various KG embedding methods. In each bar group, first three models are additive,
while the last three are multiplicative. Main findings from these plots are summarized in Section 6.2
conicity was consistently similar across frequency vector space. From Figure 4 (right), we observe
bins. For clarity, we have not shown different plots that the average length of entity vectors produced
for individual frequency bins. by additive models is also invariant of any changes
Relation Embeddings: As in entity embeddings, in number of negative samples. On the other hand,
we observe a similar trend when we look at the increase in negative sampling decreases the aver-
distribution of ATMs for relation vectors in Fig- age entity vector length for all multiplicative mod-
ure 3. The conicity of relation vectors generated els except HolE. The average entity vector length
using additive models is almost zero across fre- for HolE is nearly 1 for any number of negative
quency bands. This coupled with the high vec- samples, which is understandable considering it
tor spread observed, suggests that these vectors constrains the entity vectors to lie inside a unit
are scattered throughout the vector space. Re- ball (Nickel et al., 2016). This constraint is also
lation vectors from multiplicative models exhibit enforced by the additive models: TransE, TransR,
high conicity and low vector spread, suggesting and STransE.
that they lie in a narrow cone centered at origin, Relation Embeddings: Similar to entity embed-
like their entity counterparts. dings, in case of relation vectors trained using ad-
ditive models, the average length and conicity do
6.2 Effect of Number of Negative Samples on not change while varying the number of negative
Geometry samples. However, the conicity of relation vec-
Summary of Findings: tors from multiplicative models decreases with in-
Additive: Conicity and average length are in- crease in negative sampling. The average rela-
variant to changes in #NegativeSamples for tion vector length is invariant for all multiplica-
both entities and relations. tive methods, except for HolE. We see a surpris-
Multiplicative: Conicity increases while av- ingly big jump in average relation vector length
erage vector length decrease with increasing for HolE going from 1 to 50 negative samples, but
#NegativeSamples for entities. Conicity de- it does not change after that. Due to space con-
creases, while average vector length remains straints in the paper, we refer the reader to the Sup-
constant (except HolE) for relations. plementary Section for plots discussing the effect
For experiments in this section, we keep the of number of negative samples on geometry of re-
vector dimension constant at 100. lation vectors.
Entity Embeddings: As seen in Figure 4 (left), We note that the multiplicative score between
the conicity of entity vectors increases as the num- two vectors may be increased by either increas-
ber of negative samples is increased for multi- ing the alignment between the two vectors (i.e., in-
plicative models. In contrast, conicity of the en- creasing Conicity and reducing vector spread be-
tity vectors generated by additive models is unaf- tween them), or by increasing their lengths. It is
fected by change in number of negative samples interesting to note that we see exactly these ef-
and they continue to be dispersed throughout the fects in the geometry of multiplicative methods
128
Figure 5: Conicity (left) and Average Vector Length (right) vs Number of Dimensions for entity vectors
learned using various KG embedding methods. In each bar group, first three models are additive, while
the last three are multiplicative. Main findings from these plots are summarized in Section 6.3.
129
Figure 6: Relationship between Performance (HITS@10) on a link prediction task vs Conicity (left) and
Avg. Vector Length (right). For each point, N represents the number of negative samples used. Main
findings are summarized in Section 6.4.
6.4 Relating Geometry to Performance to vectors being more dispersed in the space.
We see another interesting observation regard-
Summary of Findings:
ing the high sensitivity of HolE to the number of
Additive: Neither entites nor relations exhibit
negative samples used during training. Using a
correlation between geometry and performance.
large number of negative examples (e.g., N = 50
Multiplicative: Keeping negative samples fixed,
or 100) leads to very high conicity in case of HolE.
lower conicity or higher average vector length
Figure 6 (right) shows that average entity vector
for entities leads to improved performance. No
length of HolE is always one. These two obser-
relationship for relations.
vations point towards HolE’s entity vectors lying
In this section, we analyze the relationship be-
in a tiny part of the space. This translates to HolE
tween geometry and performance on the Link pre-
performing poorer than all other models in case of
diction task, using the same setting as in (Bordes
high numbers of negative sampling.
et al., 2013). Figure 6 (left) presents the effects of
We also did a similar study for relation vectors,
conicity of entity vectors on performance, while
but did not see any discernible patterns.
Figure 6 (right) shows the effects of average entity
vector length.4 7 Conclusion
As we see from Figure 6 (left), for fixed num-
ber of negative samples, the multiplicative model In this paper, we have initiated a systematic study
with lower conicity of entity vectors achieves bet- into the important but unexplored problem of an-
ter performance. This performance gain is larger alyzing geometry of various Knowledge Graph
for higher numbers of negative samples (N). Addi- (KG) embedding methods. To the best of our
tive models don’t exhibit any relationship between knowledge, this is the first study of its kind.
performance and conicity, as they are all clustered Through extensive experiments on multiple real-
around zero conicity, which is in-line with our ob- world datasets, we are able to identify several in-
servations in previous sections. In Figure 6 (right), sights into the geometry of KG embeddings. We
for all multiplicative models except HolE, a higher have also explored the relationship between KG
average entity vector length translates to better embedding geometry and its task performance.
performance, while the number of negative sam- We have shared all our source code to foster fur-
ples is kept fixed. Additive models and HolE don’t ther research in this area.
exhibit any such patterns, as they are all clustered
just below unit average entity vector length. Acknowledgements
The above two observations for multiplicative We thank the anonymous reviewers for their con-
models make intuitive sense, as lower conicity and structive comments. This work is supported in
higher average vector length would both translate part by the Ministry of Human Resources Devel-
4
A more focused analysis for multiplicative models is pre- opment (Government of India), Intel, Intuit, and
sented in Section 3 of Supplementary material. by gifts from Google and Accenture.
130
References Dat Quoc Nguyen, Kairit Sirts, Lizhen Qu, and Mark
Johnson. 2016. Stranse: a novel embedding model
Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim
of entities and relationships in knowledge bases. In
Sturge, and Jamie Taylor. 2008. Freebase: a collab-
Proceedings of NAACL-HLT. pages 460–466.
oratively created graph database for structuring hu-
man knowledge. In Proceedings of the 2008 ACM Maximilian Nickel, Lorenzo Rosasco, and Tomaso A.
SIGMOD international conference on Management Poggio. 2016. Holographic embeddings of knowl-
of data. AcM, pages 1247–1250. edge graphs. In AAAI.
Antoine Bordes, Nicolas Usunier, Alberto Garcia- Srinivas Ravishankar, Chandrahas, and Partha Pratim
Duran, Jason Weston, and Oksana Yakhnenko. Talukdar. 2017. Revisiting simple neural networks
2013. Translating embeddings for modeling multi- for learning representations of knowledge graphs.
relational data. In Advances in neural information 6th Workshop on Automated Knowledge Base Con-
processing systems. pages 2787–2795. struction (AKBC) at NIPS 2017 .
T. Dettmers, P. Minervini, P. Stenetorp, and S. Riedel. M. Schlichtkrull, T. N. Kipf, P. Bloem, R. van den Berg,
2017. Convolutional 2D Knowledge Graph Embed- I. Titov, and M. Welling. 2017. Modeling Relational
dings. ArXiv e-prints . Data with Graph Convolutional Networks. ArXiv e-
Xin Dong, Evgeniy Gabrilovich, Geremy Heitz, Wilko prints .
Horn, Ni Lao, Kevin Murphy, Thomas Strohmann,
Shaohua Sun, and Wei Zhang. 2014. Knowledge Richard Socher, Danqi Chen, Christopher D Manning,
vault: A web-scale approach to probabilistic knowl- and Andrew Ng. 2013. Reasoning with neural ten-
edge fusion. In Proceedings of the 20th ACM sor networks for knowledge base completion. In Ad-
SIGKDD international conference on Knowledge vances in Neural Information Processing Systems.
discovery and data mining. ACM, pages 601–610. pages 926–934.
Yankai Lin, Zhiyuan Liu, Maosong Sun, Yang Liu, and Fabian M Suchanek, Gjergji Kasneci, and Gerhard
Xuan Zhu. 2015. Learning entity and relation em- Weikum. 2007. Yago: a core of semantic knowl-
beddings for knowledge graph completion. In AAAI. edge. In WWW.
pages 2181–2187.
Kristina Toutanova, Danqi Chen, Patrick Pantel, Hoi-
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- fung Poon, Pallavi Choudhury, and Michael Gamon.
rado, and Jeff Dean. 2013. Distributed representa- 2015. Representing Text for Joint Embedding of
tions of words and phrases and their compositional- Text and Knowledge Bases. In Empirical Methods
ity. In Advances in neural information processing in Natural Language Processing (EMNLP). ACL
systems. pages 3111–3119. Association for Computational Linguistics.
George A Miller. 1995. Wordnet: a lexical database for Théo Trouillon, Johannes Welbl, Sebastian Riedel, Éric
english. Communications of the ACM 38(11):39– Gaussier, and Guillaume Bouchard. 2016. Complex
41. embeddings for simple link prediction. In ICML.
David Mimno and Laure Thompson. 2017. The strange Zhen Wang, Jianwen Zhang, Jianlin Feng, and Zheng
geometry of skip-gram with negative sampling. In Chen. 2014. Knowledge graph embedding by trans-
Proceedings of the 2017 Conference on Empirical lating on hyperplanes. In AAAI. Citeseer, pages
Methods in Natural Language Processing. pages 1112–1119.
2863–2868.
T. Mitchell, W. Cohen, E. Hruschka, P. Talukdar, J. Bet- Bishan Yang, Wen-tau Yih, Xiaodong He, Jianfeng
teridge, A. Carlson, B. Dalvi, M. Gardner, B. Kisiel, Gao, and Li Deng. 2014. Embedding entities and
J. Krishnamurthy, N. Lao, K. Mazaitis, T. Mohamed, relations for learning and inference in knowledge
N. Nakashole, E. Platanios, A. Ritter, M. Samadi, bases. arXiv preprint arXiv:1412.6575 .
B. Settles, R. Wang, D. Wijaya, A. Gupta, X. Chen,
A. Saparov, M. Greaves, and J. Welling. 2015.
Never-ending learning. In Proceedings of AAAI.
131