0% found this document useful (0 votes)
17 views

Matching Knowledge Graphs in Entity Embedding Spaces: An Experimental Study

Uploaded by

luchunlin014
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Matching Knowledge Graphs in Entity Embedding Spaces: An Experimental Study

Uploaded by

luchunlin014
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

12770 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO.

12, DECEMBER 2023

Matching Knowledge Graphs in Entity Embedding


Spaces: An Experimental Study
Weixin Zeng , Xiang Zhao , Zhen Tan , Jiuyang Tang, and Xueqi Cheng , Senior Member, IEEE

Abstract—Entity alignment (EA) identifies equivalent entities Recently, due to the emergence and proliferation of knowl-
that locate in different knowledge graphs (KGs), and has attracted edge graphs (KGs), matching entities in KGs draws much atten-
growing research interests over the last few years with the advance- tion from both academia and industries. Distinct from traditional
ment of KG embedding techniques. Although a pile of embedding-
based EA frameworks have been developed, they mainly focus on data matching, it brings its own challenges. Particularly, it un-
improving the performance of entity representation learning, while derlines the use of KGs’ structures for matching, and manifests
largely overlook the subsequent stage that matches KGs in entity unique characteristics of data, e.g., imbalanced class distribu-
embedding spaces. Nevertheless, accurately matching entities based tion, few attributive textual information, etc. In consequence,
on learned entity representations is crucial to the overall alignment although viable, following traditional EM pipeline, it is hard to
performance, as it coordinates individual alignment decisions and
determines the global matching result. Hence, it is essential to train an effective classifier that can infer the equivalence between
understand how well existing solutions for matching KGs in entity entities. Thus, much effort has been dedicated to specifically
embedding spaces perform on present benchmarks, as well as their addressing the matching of entities in KGs, which is also referred
strengths and weaknesses. To this end, in this article we provide a to as entity alignment (EA).
comprehensive survey and evaluation of matching algorithms for
Nevertheless, early solutions to EA are mainly unsuper-
KGs in entity embedding spaces in terms of effectiveness and effi-
ciency on both classic settings and new scenarios that better mirror vised [25], [48], i.e., no labeled data is assumed. They utilize dis-
real-life challenges. Based on in-depth analysis, we provide useful criminative features of entities (e.g., entity descriptions and re-
insights into the design trade-offs and good paradigms of existing lational structures) to infer the equivalent entity pair, which are,
works, and suggest promising directions for future development. however, embarrassed by the heterogeneity of independently-
Index Terms—Entity alignment, entity matching, knowledge constructed KGs [50].
graph, knowledge graph alignment. To mitigate this issue, recent solutions to EA employ a few
labeled pairs as seeds to guide the learning and prediction [9],
[16], [31], [43], [54]. In short, they embed the symbolic repre-
I. INTRODUCTION
sentations of KGs as low-dimensional vectors in a way such
ATCHING data instances that refer to the same real-
M world entity is a long-standing problem. It establishes
the connections among multiple data sources, and is critical to
that the semantic relatedness of entities is captured by the
geometrical structures of embedding spaces [4], where the seed
pairs are leveraged to produce unified entity representations. In
data integration and cleaning [39]. Therefore, the task has been the testing stage, they match entities based on the unified entity
actively studied; for instance, in the database community, vari- embeddings. They are coined as embedding-based EA methods,
ous entity matching (EM) (and entity resolution (ER)) strategies which have exhibited state-of-the-art performance on existing
are proposed to train a (supervised) classifier to predict whether benchmarks.
a pair of data records match [10], [39]. To be more specific, the embedding-based EA1 pipeline can be
roughly divided into two major stages, i.e., representation learn-
Manuscript received 24 April 2022; revised 18 March 2023; accepted 20 April ing and matching KGs in entity embedding spaces (or embedding
2023. Date of publication 3 May 2023; date of current version 8 November 2023. matching for short). While the former encodes the KG struc-
The work of Weixin Zeng, Xiang Zhao and Jiuyang Tang were supported in part
by the National Key R&D Program of China under Grant 2020AAA0108800, tures into low-dimensional vectors and establishes connections
and in part by NSFC under Grants 62272469 and 71971212. Recommended for between independent KGs via the calibration or transformation
acceptance by X. Yi. (Corresponding author: Xiang Zhao.) of (seed) entity embeddings [50], the latter computes pairwise
Weixin Zeng, Xiang Zhao, and Jiuyang Tang are with the Laboratory
for Big Data and Decision, National University of Defense Technology, scores between source and target entities based on such em-
Changsha, Hunan 410073, China (e-mail: [email protected]; xi- beddings and then makes alignment decisions according to the
[email protected]; [email protected]). pairwise scores. Although this field has been actively explored,
Zhen Tan is with the Science and Technology on Information Systems
Engineering Laboratory, National University of Defense Technology, Changsha, existing efforts are mainly devoted to the representation learning
Hunan 410073, China (e-mail: [email protected]). stage [19], [30], [70], while embedding matching has not raised
Xueqi Cheng is with the Institute of Computing Technology, CAS, Beijing many attentions until very recently [35], [62]. The majority of
100045, China (e-mail: [email protected]).
To download the IEEE Taxonomy go to http://www.ieee.org/documents/ existing EA solutions adopt a simple algorithm to realize this
taxonomy_v101.pdf
This article has supplementary downloadable material available at
https://doi.org/10.1109/TKDE.2023.3272584, provided by the authors. 1 In the rest of the paper, we use EA to refer to embedding-based EA solutions,
Digital Object Identifier 10.1109/TKDE.2023.3272584 and use conventional EA for the early solutions.

© 2023 The Authors. This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see
https://creativecommons.org/licenses/by/4.0/
ZENG et al.: MATCHING KNOWLEDGE GRAPHS IN ENTITY EMBEDDING SPACES: AN EXPERIMENTAL STUDY 12771

Worse still, as pointed out in previous works [50], [68],


existing representation learning methods for EA cannot fully
capture the structural information (possibly due to their inner
design mechanisms, or their incapability of dealing with scarce
supervision signals). Under these settings, e.g., case (c), the
distribution of entity embeddings in the low-dimensional space
would become irregular, where the simple embedding matching
algorithm DInf would fall short, i.e., producing incorrect entity
pairs (u3 , v1 ) and (u5 , v1 ). As thus, in these practical cases, an
effective embedding matching algorithm is crucial to inferring
the correct matches. For instance, by exploiting the collective
embedding matching algorithm that imposes the 1-to-1 align-
ment constraint, the correct matches, i.e., (u3 , v3 ) and (u5 , v5 ),
Fig. 1. Three cases of EA. Dashes lines between KGs denote the seed entity are likely to be restored.
pairs. Entities with the same subscripts are equivalent. In the embedding space, While the study on matching KGs in entity embedding spaces
the circles with two colors represent that the corresponding entities in the two
KGs have the same embeddings. is rapidly progressing, there is no systematic survey or compar-
ison of these solutions [50]. We do notice that there are several
survey papers covering embedding-based EA frameworks [50],
[61], [66], [67], [68], whereas they all briefly introduce the
stage, i.e., DInf, which first leverages common similarity metrics embedding matching module (mostly only mentioning the DInf
such as cosine similarity to calculate the pairwise similarity algorithm). In this article, we aim to fill in this gap by surveying
scores between entity embeddings, and then matches a source current solutions for matching KGs in entity embedding spaces
entity to its most similar target entity according to the pairwise and providing a comprehensive evaluation of these methods with
scores [54]. Nevertheless, it is evident that such an intuitive the following features:
strategy can merely reach local optimums for individual entities 1) Systematic survey and fair comparison: Albeit essential
and completely overlooks the (global) interdependence among to the alignment performance, existing embedding matching
the matching decisions for different entities [64]. strategies have yet not been compared directly. Instead, they
To address the shortcomings of DInf, advanced strategies are are integrated with representation learning models, and then
devised [13], [50], [57], [62], [64], [65]. While some of them evaluated and compared with each other (as a whole). This,
inject the modeling of global interdependence into the compu- however, cannot provide a fair comparison of the embedding
tation of pairwise scores [13], [50], [62], some directly improve matching strategies themselves, since the difference among them
the alignment decision-making process by imposing collective can be offset by other influential factors, such as the choices
matching constraints [57], [64], [65]. These efforts demonstrate of representation learning models or input features. Therefore,
the significance of matching KGs in entity embedding spaces in this work, we exclude irrelevant factors and provide a fair
from at least three major aspects: 1) It is an indispensable step of comparison of current matching algorithms for KGs in entity
EA, which takes as input the entity embeddings (generated by embedding spaces at both theoretical and empirical levels.
the representation learning stage), and outputs matched entity 2) Comprehensive evaluation and detailed discussion: To
pairs; 2) Its performance is crucial to the overall EA results, fully appreciate the effectiveness of embedding matching strate-
e.g., an effective algorithm can improve the alignment results gies, we conduct extensive experiments on a wide range of EA
by up to 88% [62]; and 3) It empowers EA with explainability, settings, i.e., with different representation learning models, with
as it unveils the decision-making process of alignment. We use various input features, and on datasets at different scales. We also
Example 1 to further illustrate the significance of the embedding analyze the complexity of these algorithms and evaluate their
matching process. efficiency/scalability under each experimental setting. Based
Example 1: Fig. 1 presents three representative cases of EA. on the empirical results, we discuss to reveal strengths and
The KG pairs to be aligned are first encoded into embeddings weaknesses.
via the representation learning models. Next, the embedding 3) New experimental settings and insights: Through empirical
matching algorithms produce the matched entity pairs based evaluation and analysis, we discover that the current mainstream
on the embeddings. In the most ideal case where two KGs are evaluation setting, i.e., 1-to-1 constrained EA, oversimplifies the
identical, e.g., case (a), with an ideal representation learning real-life alignment scenarios. As thus, we identify two experi-
model, equivalent entities would be embedded into exactly the mental settings that better reflect the challenges in practice, i.e.,
same place in the low-dimensional space, and using the simple alignment with unmatchable entities, as well as a new setting
DInf algorithm would attain perfect results. Nevertheless, in of non 1-to-1 alignment. We compare the embedding matching
the majority of practical scenarios, e.g., case (b) and (c), the algorithms under these challenging settings to provide further
two KGs have high structure heterogeneity. As thus, even an insights.
ideal representation learning model might generate different Contributions: We make the following contributions:
embeddings for equivalent entities. In this case, adopting the r We systematically and comprehensively survey and com-
simple DInf strategy is likely to produce false entity pairs, such pare state-of-the-art algorithms for matching KGs in entity
as (u5 , v3 ) in case (b). embedding spaces (Section III).
12772 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO. 12, DECEMBER 2023

Algorithm 2: Greedy (Es , Et , S).

Fig. 2. The pipeline of embedding-based EA. Dashed lines denote the pre-
annotated alignment links.

Algorithm 1: General Algorithm of Embedding-Based EA.


Input: Source and target KGs: Gs , Gt ; Seed pairs: S The majority of studies on EA are devoted to the representa-
Output: Aligned entity pairs: M tion learning stage. They first utilize KG embedding techniques
1: E ← Representation_Learning(Gs , Gt , S) such as TransE [4] and GCN [23] to capture the KG structure
2: M ← Embedding_Matching(Es , Et , E) information and generate entity structural representations. Next,
3: return M; based on the assumption that equivalent entities from different
KGs possess similar neighboring KG structures (and in turn
similar embeddings), they leverage the seed entity pairs as an-
r We evaluate and compare the state-of-the-art embedding chors and progressively project individual KG embeddings into
a unified space through training, resulting in the unified entity
matching algorithms on a wide range of EA datasets and representations E 2 . There have already been several survey
settings, as well as reveal their strengths and weaknesses. papers concentrating on representation learning approaches for
The codes of these algorithms are organized and integrated EA, and we refer the interested readers to these works [2], [50],
into an open-source library, EntMatcher, publicly [66], [68].
available at https://github.com/DexterZeng/EntMatcher Next, we introduce the embedding matching process—the
(Section IV).
r We identify experiment settings that better mirror real-life focus of this article, as well as its related works.
challenges and construct a new benchmark dataset, where
B. Related Work and Scope
deeper insights into the algorithms are obtained via empir-
ical evaluations (Section V). Matching KGs in entity embedding spaces: After obtaining
r Based on our evaluation and analysis, we provide useful the unified entity representations E where equivalent entities
insights into the design trade-offs of existing works, and from different KGs are assumed to have similar embeddings,
suggest promising directions for the future development of the embedding matching stage (also frequently referred to as
matching KGs in entity embedding spaces (Section VI). alignment inference stage [50]) produces alignment results by
comparing the embeddings of entities from different KGs. Con-
cretely, it first calculates the pairwise scores between source and
II. PRELIMINARIES
target entity embeddings according to a specific metric.3 The
In this section, we first present the task formulation of EA and pairwise scores are then organized into matrix form as S. Next,
its general framework. Next, we introduce the studies related according to the pairwise scores, various matching algorithms
to the topic of this article—matching KGs in entity embedding are put forward to align entities. The most common algorithm is
spaces, and clarify the scope of this study. Finally, we present Greedy, described in Algorithm 2. It directly matches a source
the key assumptions of embedding-based EA. entity to the target entity that possesses the highest pairwise score
according to S. Over the last few years, advanced solutions [13],
A. Task Formulation and Framework [17], [34], [35], [40], [50], [57], [60], [62], [64], [65], [69]
are devised to improve the embedding matching performance,
Task formulation: A KG G is composed of triples {(s, p, o)}, and in this work, we focus on surveying and comparing these
where s, o ∈ E represent entities, p ∈ P denotes the predicate algorithms for matching KGs in entity embedding spaces.
(relation). Given a source KG Gs , a target KG Gt , the task of EA Matching KGs in Symbolic Spaces: Before the emergence
is formulated as discovering new (equivalent) entity pairs M = of embedding-based EA, there have already been many conven-
{(u, v)|u ∈ Es , v ∈ Et , u ⇔ v} by using pre-annotated (seed) tional frameworks that match KGs in symbolic spaces [20], [47],
entity pairs S as anchors, where ⇔ represents the equivalence
between entities, Es and Et denote the entity sets in Gs and Gt ,
2 Indeed there are a few exceptions, which instead learn a mapping function
respectively.
between individual embedding spaces [50]. However, the subsequent steps still
General framework: The pipeline of state-of-the-art require mapping between spaces and operate on a “unified” one, e.g., target
embedding-based EA solutions can be divided into two entity embeddings.
3 Under certain metrics such as cosine similarity (resp., euclidean distance),
stages, i.e., representation learning and embedding matching,
the larger (resp., smaller) the pairwise scores, the higher the probability that two
as shown in Fig. 2. The general algorithm can be found in entities are equivalent. In this work, w.l.o.g., we adopt the former expression
Algorithm 1. and consider that higher pairwise scores are preferred.
ZENG et al.: MATCHING KNOWLEDGE GRAPHS IN ENTITY EMBEDDING SPACES: AN EXPERIMENTAL STUDY 12773

[48]. While some are based on equivalence reasoning mandated TABLE I


COMPARISON WITH EXISTING SURVEYS ON EA. THE FOCUS OF EACH WORK IS
by OWL semantics [20], some leverage similarity computation DENOTED WITH 
to compare the symbolic features of entities [48]. However, these
solutions are not comparable to algorithms for matching KGs in
entity embedding spaces, as 1) they cover both the representation
learning and embedding matching stages in embedding-based
EA; and 2) the inputs are different from those of embedding
matching algorithms. Thus, we do not include them in our
experimental evaluation, while they have already been compared
in the survey papers covering the overall embedding-based EA
frameworks [50], [68].
The matching of relations (or ontology) between KGs has also
potential practitioners [50], [67], [68]. Specifically, Zhao et
been studied by prior symbolic works [47], [48]. Nevertheless,
al. propose a general EA framework to encompass existing
compared with entities, they are usually in smaller amounts, of
works, and then evaluate them under a wide range of settings.
various granularities [42], and under-explored in embedding-
Nevertheless, they only briefly mention DInf and SMat in the
based approaches [59]. Hence, in this work, we exclude relevant
embedding matching stage [68]. Sun et al. survey EA approaches
studies on this topic and focus on the matching of entities.
and develop an open-source library to evaluate existing works.
The task of entity resolution (ER) [10], [18], [41], also known
However, they merely introduce DInf, SMat and CSLS, and
as entity matching, deduplication or record linkage, can be
overlook the comparison among these algorithms. Besides, they
regarded as the general case of EA [68]. It assumes that the
point out that current approaches put in their main efforts
input is relational data, and each data object usually has a large
in learning expressive embeddings to capture entity features
amount of textual information described in multiple attributes.
while ignore the alignment inference (i.e., embedding matching)
Nevertheless, in this article, we focus on EA approaches, which
stage [50]. Zhang et al. empirically evaluate state-of-the-art
strive to align KGs and mainly rely on graph representation
embedding-based EA methods in an industrial context, and
learning techniques to model the KG structure and generate
particularly investigate the influence of the sizes and biases in
entity structural embeddings for alignment. Therefore, the dis-
seed mappings. They evaluate each method as a whole and do
cussion and comparison with ER solutions is beyond the scope
not mention the embedding matching process [67].
of this work.
Two recent survey papers include the latest efforts on
Matching Data Instances Via Deep Learning: Entity matching
embedding-based EA and give more self-contained explanation
(EM) between databases have also been greatly advanced by
on each technique. Zhang et al. provide a tutorial-type survey,
utilizing pre-trained language models for expressive contextual-
while for embedding matching, they merely introduce the near-
ization of database records [11], [39]. These deep learning (DL)
est neighbor search strategy, i.e., DInf [66]. Zeng et al. mainly
based EM solutions devise end-to-end neural models to learn
introduce representation learning methods and their applications
to classify an entity pair into matching or non-matching, and
on EA, while neglect the embedding matching stage [61].
then feed the test entity pairs into the trained models to obtain
In all, existing EA survey articles focus on the representation
classification results [5], [29], [39]. Nevertheless, this procedure
learning process and briefly introduce the embedding matching
is different from the focus of our study, as both of its training
module (mostly only mentioning the DInf algorithm), while in
and testing stage involve representation learning and matching.
this work we systematically survey and empirically evaluate
Besides, these solutions are not suitable for matching KGs in
the algorithms designed for the embedding matching process in
entity embedding space, since (1) they require adequate labeled
KG alignment, and present comprehensive results and insightful
data to train the neural classification models, but the training
discussions.
data in EA is much less than the testing ones, which could result
Scope of this Work: This study aims to survey and
in the overfitting issue; (2) they would suffer from severe class
empirically compare the algorithms for matching KGs in
imbalance in EA, where an entity and all of its nonequivalent
entity embedding spaces, i.e., various implementations of
entities in another KG would constitute many negative samples,
Embedding_Matching() in Algorithm 1, on a wide range of
while there is usually one positive sample for this entity; (3) they
EA experimental settings.
depend on the attributive text information between data records
for training, while EA underlines the use of KG structure, which
could provide much less useful features for model training. In C. Key Assumptions
the experiment, we adapt DL-based EM models to tackle EA, Notably, existing embedding-based EA solutions have a fun-
and the results are not promising. This will be further discussed damental assumption; that is, the equivalent entities in dif-
in Section IV-C. ferent KGs possess similar (ideally, isomorphic) neighboring
Existing Surveys on EA: There are several survey papers structures. Under such an assumption, effective representation
covering EA frameworks [50], [61], [66], [67], [68], which learning models would transform the structures of equivalent
are summarized in Table I. Some articles provide high-level entities into similar entity embeddings. As thus, based on the
discussion of embedding-based EA frameworks, experimentally entity embeddings, the embedding matching stage would assign
evaluate and compare these works, and offer guidelines for higher (resp., lower) pairwise similarity scores to the equivalent
12774 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO. 12, DECEMBER 2023

TABLE II
OVERVIEW AND COMPARISON OF STATE-OF-THE-ART ALGORITHMS FOR MATCHING KGS IN ENTITY EMBEDDING SPACES. NOTE THAT WE ESTIMATE THE ORDER
OF MAGNITUDE OF THE TIME AND SPACE COMPLEXITY

(resp., nonequivalent) entity pairs, and finally make accurate


Algorithm 3: DInf(Es , Et , E).
alignment decisions via the coordination according to pairwise
scores. Input: Source and target entity sets: Es , Et ; Unified entity
Besides, current EA evaluation settings assume that the enti- embeddings: E
ties in different KGs conform to the 1-to-1 constraint. That is, Output: Matched entity pairs: M
each u ∈ Es has one and only one equivalent entity v ∈ Et , and 1: Derive similarity matrix S based on E;
vice versa. However, we contend that this assumption is in fact 2: M ← Greedy (Es , Et , S);
impractical and provide detailed experiments and discussions in 3: return M;
Section V-B.
alignment results are not necessarily equal to the target-to-source
III. ALGORITHMS FOR MATCHING KGS IN ENTITY
ones. By improving the pairwise score computation, CSLS, RInf
EMBEDDING SPACES
and Sink. are actually modeling and integrating the bidirectional
In this section, we introduce the algorithms for matching alignments, whereas they still adopt Greedy to produce final
KGs in entity embedding spaces, i.e., Embedding_Matching() results. For non-greedy methods, Hun. and SMat fully consider
in Algorithm 1. the bidirectional alignments and produce a matching agreed by
both directions, while RL is unidirectional.
A. Overview Next, we describe these methods in detail4 .
We first provide the overview and comparison of matching
algorithms for KGs in entity embedding spaces in Table II. B. Simple Embedding Matching
As mentioned in Section II, embedding matching comprises DInf is the most common implementation of
two stages— pairwise score computation and matching. The Embedding_Matching(), described in Algorithm 3. Assume
baseline approach DInf adopts existing similarity metrics to both KGs contain n entities. The time and space complexity of
calculate the similarity between entity embeddings and generate DInf is O(n2 ).
the pairwise scores in the first stage, and then it leverages
Greedy for matching. In pursuit of better alignment performance, C. CSLS Algorithm
more advanced embedding matching strategies are put forward.
The cross-domain similarity local scaling (CSLS) algo-
While some (i.e., CSLS, RInf and Sink.) optimize the pairwise
rithm [26] is introduced to mitigate the hubness and isolation
score computation process and produce more accurate pairwise
issues of entity embeddings in EA [50]. The hubness issue
scores, some (i.e., Hun., SMat and RL) take into account the
refers to the phenomenon where some entities (known as hubs)
global alignment dynamics, rather than greedily pursue the
frequently appear as the top-1 most similar entities of other
local optimum for each entity, during the matching process,
entities in the vector space, while the isolation issue means
where more correct matches could be generated according to
that there exist some outliers isolated from any point clusters.
the coordination under the global constraint.
As thus, CSLS increases the similarity associated with isolated
We further identify two notable characteristics of matching
entity embeddings, and conversely decreases the ones of vectors
KGs in entity embedding spaces, i.e., whether the matching
lying in dense areas [26]. Formally, the CSLS pairwise score
leverages the 1-to-1 constraint, and the direction of the match-
between source entity u and target entity v is:
ing. Regarding the former, Hun. and SMat explicitly exert the
1-to-1 constraint on the matching process. RL relaxes the strict CSLS(u, v) = 2S(u, v) − φ(u) − φ(v) , (1)
1-to-1 constraint by allowing non 1-to-1 matches. The greedy where S is the similarity
strategies, however, normally do not take into consideration this  matrix derived from E using similarity
metrics, φ(u) = k1 v ∈Nu S(u, v  ) is the mean similarity score
constraint, except for Sink., which implicitly implements the between the source entity u and its top-k most similar entities Nu
1-to-1 constraint in a progressive manner when calculating the
pairwise scores. As for the direction of matching, Greedy only 4 We omit the algorithmic description of the classical algorithms (e.g., Hun-
considers a single direction at a time and overlooks the influence garian [24] and Gale-Shapley [46]) and the neural model (i.e., RL [38]) in the
from the reverse direction. As thus, the resultant source-to-target interest of space.
ZENG et al.: MATCHING KNOWLEDGE GRAPHS IN ENTITY EMBEDDING SPACES: AN EXPERIMENTAL STUDY 12775

Algorithm 4: CSLS (Es , Et , E, k). Algorithm 5: RInf(Es , Et , E).


Input: Source and target entity sets: Es , Et ; Unified entity
embeddings: E; Hyper-parameter: k
Output: Matched entity pairs: M
1: Derive similarity matrix S based on E;
2: Calculate the mean values of top-k similarity scores of
entities in Es and Et , resulting in φs and φt , respectively;
3: S CSLS = 2S − φs − φ t
4: M ← Greedy (Es , Et , S CSLS );
5: return M;

in the target KG, and φ(v) is defined similarly. The mean simi-
larity scores of all source and target entities are denoted in vector
form as φs and φt , respectively. To generate the matched entity
pairs, it further applies Greedy on the CSLS matrix (i.e., S CSLS ).
Algorithm 4 describes the detailed procedure of CSLS. Notably,
Algorithm 6: Sink. (Es , Et , E, l).
Li et al. put forward Graph Interactive Divergence (GID) to
compute the similarity score, which in essence works in the Input: Source and target entity sets: Es , Et ; Unified entity
same way as CSLS according to its code implementation [28]. embeddings: E; Hyper-parameter: l
Complexity. The time and space complexity are O(n2 ). Prac- Output: Matched entity pairs: M
tically, it requires more time and space than DInf, as it needs to 1: Derive similarity matrix S based on E;
generate the additional CSLS matrix. 2: S sinkhorn = Sinkhornl (S) (cf. (3));
3: M ← Greedy (Es , Et , S sinkhorn );
D. Reciprocal Embedding Matching 4: return M;

Zeng et al. [62] formulate EA task as the reciprocal rec-


ommendation process [44] and offer a reciprocal embedding similarity scores based on E. Then they adopt the Hungarian
matching strategy RInf to model and integrate the bidirectional algorithm [24] to solve the task of assigning source entities to
preferences of entities when inferring the matching results. target entities according to the pairwise scores. The objective
Formally, it defines the pairwise score of source entity u towards is to maximize the sum of the pairwise similarity scores of the
target entity v as: final matched entity pairs while observing the 1-to-1 assignment
pu,v = S(u, v) − max

S(v, u ) + 1, (2) constraint. In this work, we use the Hungarian algorithm imple-
u ∈Es
mented by Jonker and Volgenant [21] and denote it as Hun.
where S is the similarity matrix derived from E, 0 ≤ pu,v ≤ 1, (Es , Et , E).
and a larger pu,v denotes a higher degree of preference. As such, Besides, the Sinkhorn operation [37] (or Sink. for short) is also
the matrix forms of the source-to-target and target-to-source adopted to solve the assignment problem [13], [17], [35], which
preference scores are denoted as P s,t and P t,s , respectively. converts the similarity matrix S into a doubly stochastic matrix
Next, it converts the preference matrix P into the ranking matrix S sinkhorn that encodes the entity correspondence information.
R, and then averages the two ranking matrices, resulting in the Specifically,
reciprocal preference matrix P s↔t that encodes the bidirectional
alignment information. Finally, it adopts Greedy to generate the Sinkhornl (S) = Γc (Γr (Sinkhornl−1 (S)));
matched entity pairs. S sinkhorn = lim Sinkhornl (S), (3)
Complexity. Algorithm 5 describes the detailed procedure l→∞
of RInf. The time complexity is O(n2 lg n) [62]. The space
complexity is O(n2 ). Practically, it requires more space than where Sinkhorn0 (S) = exp(S), Γc and Γr refer to the col-
DInf and CSLS, due to the computation of similarity, preference, umn and row-wise normalization operators of a matrix. Since
and ranking matrices. Noteworthily, two variant methods, i.e., the number of iterations l is limited, the Sinkhorn operation
RInf-wr and RInf-pb, are proposed to reduce the memory and can only obtain an approximate 1-to-1 assignment solution in
time consumption brought by the reciprocal modeling. More practice [35]. Then S sinkhorn is forwarded to Greedy to obtain
details can be found in [62]. the alignment results.
Complexity. For Hun., the time complexity is O(n3 ), and the
E. Embedding Matching as Assignment space complexity is O(n2 ). Algorithm 5 describes the procedure
of Sink.. The time complexity of Sink. is O(ln2 ) [35], and the
Some very recent studies [35], [57] propose to model the space complexity is O(n2 ). In practice, both algorithms require
embedding matching process as the linear assignment prob- more space than DInf, since they need to store the intermediate
lem. They first use similarity metrics to calculate pairwise results.
12776 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO. 12, DECEMBER 2023

F. Stable Embedding Matching


In order to consider the interdependence among alignment
decisions, the embedding matching process is formulated as
the stable matching problem [14] by [64], [69]. It is proved
that for any two sets of members with the same size, each of
whom provides a ranking of the members in the opposing set,
there exists a bijection of the two sets such that no pair of two
members from the opposite side would prefer to be matched to
each other rather than their assigned partners [12]. Specifically,
these works first produce the similarity matrix S based on E
using similarity metrics. Next, they generate the rankings of Fig. 3. Architecture of the EntMatcher library and additional modules
required by the experimental evaluation.
members in the opposing set according to the pairwise similarity
scores. Finally, they use the Gale-Shapley algorithm [46] to solve
the stable matching problem. This procedure is denoted as SMat produces the matched entity pairs. It has the following three
(Es , Et , E). major features:
Complexity. SMat has time complexity of O(n2 lg n) (since Loosely-Coupled Design. There are three independent mod-
for each entity, the ranking of entities in the opposite side needs ules in EntMatcher, and we have implemented the repre-
to be computed) and space complexity of O(n2 ). sentative methods in each module. Users are free to combine
the techniques in each module to develop new approaches, or
G. RL-Based Embedding Matching to implement their new designs by following the templates in
modules.
The embedding matching process is cast to the classic se- Reproduction of Existing Approaches. To support our exper-
quence decision problem by [65]. Given a sequence of source imental study, we tried our best to re-implement all existing
entities (and their embeddings), the goal of the sequence decision algorithms by using EntMatcher. For instance, the combi-
problem is to decide to which target entity each source entity nation of cosine similarity, CSLS, and Greedy reproduces the
aligns. It devises a reinforcement learning (RL)–based frame- CSLS algorithm in Section III-C; and the combination of cosine
work to learn to optimize the decision-making for all entities, similarity, None, and Hun. reproduces the Hun. algorithm in
rather than optimize every single decision separately. Under the Section III-E. The specific hyper-parameter settings are elabo-
RL-based framework, a new coordination strategy that involves rated in Section IV-B.
the coherence and exclusiveness constraints is implemented. Flexible Integration With Other Modules in EA. Ent-
While coherence aims to keep the EA decisions coherent for Matcher is highly flexible, which can be directly called during
closely-related entities, exclusiveness aims to avoid assigning the development of standalone EA approaches. Besides, users
the same target entity to multiple source entities, which requires may also use EntMatcher as the backbone and call other mod-
that, if an entity is already matched, it is less likely to be matched ules. For instance, to conduct the experimental evaluations in this
to other entities. The general procedure is shown in algorithmic work, we implemented the representation learning and auxiliary
form in Appendix A, available online due to the limit of space, information modules to generate the unified entity embeddings
and more details can be found in the original paper [65]. E, as shown in the white blocks of Fig. 3. More details are
Complexity. It is difficult to deduce the time complexity for elaborated in the next subsection. Finally, EntMatcher is also
this neural RL model. Instead, we provide the empirical time compatible with existing open-source EA libraries (that mainly
costs in experiments. The space complexity is O(n2 ). focus on representation learning) such as OpenEA6 and EAkit.7

IV. MAIN EXPERIMENTS B. Experimental Settings


In this section, we compare the algorithms for matching KGs Current EA evaluation setting assumes that the entities in
in entity embedding spaces on the mainstream EA evaluation source and target KGs are 1-to-1 matched (cf. Section II-C).
setting (1-to-1 alignment). Although this assumption simplifies the real-word scenarios
where some entities are unmatchable or some might be aligned
A. EntMatcher: An Open-Source Library to multiple entities on the other side, it indeed reflects the core
To ensure comparability, we re-implemented all compared challenge of EA. Therefore, following existing literature, we
algorithms using Python under a unified framework and estab- mainly compare the embedding matching algorithms under this
lished an open-source library, EntMatcher 5 . The architecture setting, and postpone the evaluation on the challenging real-life
of EntMatcher library is presented in the blue block of scenarios to Section V.
Fig. 3, which takes as input unified entity embeddings E and Datasets. We used popular EA benchmarks for evaluation:
(1) DBP15K, which comprises three multilingual KG pairs

5 The codes are publicly available at https://github.com/DexterZeng/ 6 [Online]. Available: https://github.com/nju-websoft/OpenEA


EntMatcher 7 [Online]. Available: https://github.com/THU-KEG/EAkit
ZENG et al.: MATCHING KNOWLEDGE GRAPHS IN ENTITY EMBEDDING SPACES: AN EXPERIMENTAL STUDY 12777

TABLE III
DATASET STATISTICS

TABLE IV
THE F1 SCORES OF ONLY USING STRUCTURAL INFORMATION

extracted from DBpedia [1]: English to Chinese (D-Z), English and GCN to generate the structural embeddings, respectively,
to Japanese (D-J), and English to French (D-F); and (2) SRPRS, DBP and SRP denote DBP15K and SRPRS, respectively. Next,
which is a sparser dataset that follows real-life entity distribution, we supplement with name embeddings, and report the results
including two multilingual KG pairs extracted from DBpedia: in Table V, where N- and NR- refer to only using the name
English to French (S-F) and English to German (S-D), and embeddings and fusing name embeddings with RREA structural
two mono-lingual KG pairs: DBpedia to Wikidata [53] (S-W) representations, respectively. Note that, on existing datasets, all
and DBpedia to YAGO [49] (S-Y); and (3) DWY100K, a larger the entities in the test set can be matched, and all the algorithms
dataset consisting of two mono-lingual KG pairs: DBpedia to are devised to find a target entity for each test source entity.
Wikidata (D-W) and DBpedia to YAGO (D-Y). The detailed Hence, the number of matches found by a method equals to the
statistics can be found in Table III, where the numbers of entities, number of gold matches, and consequently the precision value
relations, triples, gold links, and the average entity degree are is equal to the recall value and the F1 score [65].
reported. Regarding the gold alignment links, we adopted 70% Overall Performance. First, we do not delve into the em-
as test set, 20% for training, and 10% for validation. bedding matching algorithms and directly analyze the general
Evaluation metric. We utilized F1 score as the evaluation results. Specifically, using RREA to learn structural representa-
metric, which is the harmonic mean between precision and tions can bring better performance compared with using GCN,
recall, where the precision value is computed as the number showcasing that representation learning strategies are crucial to
of correct matches divided by the number of matches found by the overall alignment performance. When introducing the entity
a method, and the recall value is computed as the number of name information, it observes that this auxiliary signal alone
correct matches found by a method divided by the number of can already provide very accurate signal for alignment. This
gold matches. Note that recall is equivalent to the Hits@1 metric is because the equivalent entities in different KGs of current
used in some previous works. datasets share very similar or even identical names. After fusing
Similarity Metric. After obtaining the unified entity represen- the semantic and structural information, the alignment perfor-
tations E, a similarity metric is required to produce pairwise mance is further lifted, with most of the approaches hitting over
scores and generate the similarity matrix S. Frequent choices 0.9 in terms of the F1 score.
include the cosine similarity [7], [36], [52], the euclidean dis- Effectiveness Comparison of Embedding Matching Algo-
tance [8], [27] and the Manhattan distance [55], [58]. In this rithms. From the tables, it is evident that:
work, we followed mainstream works and adopted the cosine (1) Overall, Hun. and Sink. attain much better results than
similarity. the other strategies. Specifically, Hun. takes full account of
Notably, we omit more detailed experimental settings in the the global matching constraints and strives to reach a globally
interest of space, which can be found in Appendix B, available optimal matching given the objective of maximizing the sum
online. of pairwise similarity scores. Moreover, the 1-to-1 constraint it
exerts aligns with present evaluation setting where the source and
target entities are 1-to-1 matched. Sink., on the other hand, im-
C. Main Results and Comparison plicitly implements the 1-to-1 constraint during pairwise score
We first evaluate with only structural information and report computation and still adopts Greedy to produce final results,
the results in Table IV, where R- and G- refer to using RREA where there might exist non 1-to-1 matches; (2) DInf attains
12778 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO. 12, DECEMBER 2023

TABLE V
THE F1 SCORES OF USING AUXILIARY INFORMATION

Fig. 4. The statistic of pairwise similarity scores (i.e., Top-5 STD), where the
name of the setting is abbreviated, e.g., R-D stands for R-DBP.

the worst performance. This is because it directly adopts the


similarity scores that suffer from the hubness and isolation
issues [50]. Besides, it leverages Greedy, which merely reaches
the local optimum for each entity. (3) The performance of RInf,
CSLS, SMat and RL are well matched. RInf and CSLS improve
upon DInf by mitigating the hubness issue and enhancing the
quality of pairwise scores. SMat and RL, on the other hand, im-
prove upon DInf by modeling the interactions among matching
decisions for different entities.
Furthermore, we conduct a deeper analysis of these ap- Fig. 5. Efficiency comparison. Shapes in blue denote methods that improve
pairwise scores, while shapes in black denote those exerting global constraints
proaches, and identify the following patterns: (except for DInf).
Pattern 1. If for source entities, their highest pairwise simi-
larity scores are close, RInf and CSLS (resp., SMat and
RL ) would attain relatively better (resp., worse) performance. scores become less accurate. These inaccurate scores could
Specifically, in Table IV where RInf consistently (CSLS some- mislead the matching process and hence limit the effectiveness
times) attains superior results than SMat and RL, the average of the top-performing methods, i.e., Sink. and Hun.. In other
standard deviation (STD) values of the top-5 pairwise similarity words, sparser KG structures are more likely to (partially)
scores of source entities (cf. Fig. 4) are very small, unveiling that break the fundamental assumption on KG structure similarity
the top scores are close and difficult to differentiate. In contrast, (cf. Section II-C).
in Table V where SMat and RL outperform RInf and CSLS, the Efficiency Analysis. We compare the time and space efficiency
corresponding STD values are relatively large. This is because of these methods on the medium-sized datasets in Fig. 5. Since
RInf and CSLS aim to make the scores more distinguishable, and the costs on KG pairs from the same dataset are very similar,
hence they are more effective in cases where the top similarity we report the average time and space costs under each setting
scores are very close (i.e., low STD values). On the contrary, in the interest of space.
when the top similarity scores are already discriminating (e.g., Specifically, it observes that: (1) The simple algorithm DInf
Table V), RInf and CSLS become less useful, while SMat and is the most efficient approach; (2) Among the advanced ap-
RL can still make improvements by using the global constraints proaches, CSLS is the most efficient one, closely following DInf;
to enforce the deviation from local optimums. (3) The efficiency of RInf and Hun. are equally matched. While
Pattern 2. On sparser datasets, the superiority of Sink. and Hun. consumes relatively less memory space than RInf, its time
Hun. over the rest of the methods becomes less significant. efficiency is less stable and tends to run slower on datasets with
This is based on the observation that on SRPRS, other matching less accurate pairwise scores; (4) The space efficiency of Sink. is
algorithms (RInf in particular) attain much closer performance to close to RInf and Hun., whereas it has much higher time costs,
Sink. and Hun.. Such a pattern could be attributed to the fact that, which largely depends on the value of l; (5) RL is the least
on sparser datasets, entities normally have fewer connections time-efficient approach, while SMat is the least space-efficient
with others, i.e., lower average entity degree (in Table III), where algorithm. RL requires more time on datasets with less accurate
representation learning strategies might fail to fully capture pairwise scores where its pre-processing module fails to produce
the structural signals for alignment and the resultant pairwise promising results [65]. The memory space consumption of SMat
ZENG et al.: MATCHING KNOWLEDGE GRAPHS IN ENTITY EMBEDDING SPACES: AN EXPERIMENTAL STUDY 12779

TABLE VI more scalable matching algorithms for KGs in entity embedding


THE F1 SCORES ON DWY100K USING GCN
spaces should be devised.

E. Analysis and Insights


We provide further experiments and discussions in this sub-
section. Due to the limitation of space, more experiments and
the case study can be found in Appendix C and D, available
online.
On efficiency and scalability. The simple algorithm DInf is the
most efficient and scalable one, as it merely involves the most
basic computation and matching operations. CSLS is slightly
is high, as it needs to store a large amount of intermediate match- less efficient than DInf due to the update of pairwise similarity
ing results. In all, we can conclude that generally, advanced scores. It also has good scalability. Although RInf adopts a simi-
embedding matching algorithms require more time and memory lar idea to CSLS, it involves an additional ranking process, which
space, among which the methods incorporating global matching brings much more time and memory consumption, making it less
constraints tend to be less efficient. scalable. Sink. repeatedly conducts the normalization operation,
Comparison With DL-Based EM Approaches. We utilize the and thus its time efficiency is mainly up to the l value. Its scala-
deepmatcher python package [39], which provides built-in neu- bility is also limited by the memory space consumption since it
ral networks and utilities that can train and apply state-of-the-art needs to store intermediate results, as revealed in Table VI.
deep learning models for entity matching, to address EA. Specif- Regarding the methods that exert global constraints, Hun. is
ically, we use the structural and name embeddings to replace the efficient on medium-sized datasets, while it is not scalable due to
attributive text inputs in deepmatcher, respectively, and then the high time complexity and memory space consumption. SMat
train the neural model with labeled data. For each positive entity is space-inefficient even on the medium-sized datasets, making
pair, we randomly sample 10 negative ones. In the testing stage, it not scalable. In comparison, RL has more stable time and space
for each source entity, we feed the entity pairs constituting it costs and can scale to large datasets, and the main influencing
and all the target entities into the trained classifier, and regard factor is the accuracy of pairwise scores. This is because RL has
the entity pair with the highest predicted score as the result. a pre-processing step that filters out confident matched entity
In the final results, only several entities are correctly aligned, pairs and excludes them from the time-consuming RL learning
showing that DL-based EM approaches cannot handle EA well, process [65]. More confident matched entity pairs would be
which can be ascribed to the insufficient labeled data, imbal- filtered out if the pairwise scores are more accurate.
anced class distribution and the lack of attributive text informa- On Effectiveness of Improving Pairwise Score Computation.
tion, as discussed in Section II-B. We compare and discuss the strategies for improving the pair-
wise score computation, i.e., CSLS, RInf and Sink..
D. Results on Large-Scale Datasets Both CSLS and RInf aim to mitigate the hubness and isolation
Next, we provide the results on the relatively larger dataset, issues in the raw pairwise scores (from different starting points).
i.e., DWY100K, which can also reflect the scalability of these Particularly, we observe that, by setting k (in Equation 1) of
algorithms. The results are presented in Table VI8 . The gen- CSLS to 1, the difference between RInf and CSLS is reduced to
eral pattern is similar to that on G-DBP (i.e., using GCN on the extra ranking process of RInf, and the results in Tables IV
DBP15K), where Sink. and Hun. obtain the best results, fol- and V validate that this ranking process can consistently bring
lowed by RInf. The performance of CSLS and RL are close, better performance. This is because the ranking operation can
outperforming DInf by over 20%. amplify the difference among the scores and prevent such infor-
We compare the efficiency of these algorithms in Table VI, mation from being lost after the bidirectional aggregation [62].
where T̄ refers to the average time cost and Mem. denotes However, it is noteworthy that the ranking process brings much
whether the memory space required by the model can be covered more time and memory consumption, as can be observed from
by our experimental environment.9 It observes that, given larger the empirical results.
datasets, most of the performant algorithms have poor efficiency Then we analyze the influence of k value in CSLS. As shown
and scalability (e.g., RInf, Sink. and Hun.). Note that in [62], in Fig. 6, a larger k leads to worse performance. This is because a
two variants of RInf, i.e., RInf-wr and RInf-pb, are proposed to larger k implies a smaller φ value in Equation 1 (where the top-k
improve its scalability at the cost of a small performance drop, highest scores are considered and averaged), and the resultant
which is empirically validated in Table VI. This also reveals that pairwise scores become less distinctive. This also validates the
effectiveness of the design in RInf (cf. Equation 2), where only
the maximum value is considered to compute the preference
8 We cannot provide the results of SMat, as it requires extremely large memory
score. Nevertheless, in Section V-B, we reveal that setting k
space and cannot work under our experimental environment. to 1 is only useful in the 1-to-1 alignment setting.
9 Note that for algorithms with memory space costs exceeding our experi-
mental environment (except for SMat), there is additional swap area in the hard As for Sink., it adopts an extreme approach to optimize the
drive for them to finish the program (which usually takes much longer time). pairwise scores, which encourages each source (resp., target)
12780 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO. 12, DECEMBER 2023

TABLE VII
F1 SCORES ON DBP15K+

Fig. 6. F1 scores of CSLS with varying k value.

1-to-1 constraint and only deviates slightly from the greedy


matching, and hence the results are not very promising.
Overall Comparison and Conclusion. Finally, we compare
the algorithms all together and draw the following conclusions
under the 1-to-1 alignment setting: (1) The best performing
methods are Hun. and Sink.. Nevertheless, they have low scal-
Fig. 7. F1 scores of Sink. with varying l value. ability; (2) CSLS and RInf achieve the best balance between
effectiveness and efficiency. While CSLS is more efficient, RInf
is more effective; (3) SMat and RL tend to attain better results
entity to have only one positive pairwise score with a target when the accuracy of the pairwise scores is high. Nevertheless,
(resp., source) entity and 0’s with the rest of the target (resp., they require relatively more time.
source) entities. Thus, it is in fact progressively and implicitly
implementing the 1-to-1 alignment constraint during the pair-
wise score computation process with the increase of l, and is V. NEW EVALUATION SETTINGS
particularly useful in present 1-to-1 evaluation settings of EA. In this section, we conduct experiments on settings that can
In Fig. 7, we further examine the influence of l in Equation 3 on better reflect real-life challenges.
the alignment results of Sink., which meets our expectation that
the larger the l value, the better the distribution of the resultant
A. Unmatchable Entities
pairwise scores fits the 1-to-1 constraint, and thus the higher
the alignment performance. Nevertheless, a larger l also implies Current EA literature largely overlooks the unmatchable is-
longer processing time. Therefore, by tuning on the validation sue, where a KG contains entities that the other KG does not
set, we set l to 100 to reach the balance between effectiveness contain. For instance, when aligning YAGO 4 and IMDB, only
and efficiency. 1% of entities in YAGO 4 are film-related and possibly have
On Effectiveness of Exerting Global Constraints. Next, we equivalent entities in IMDB, while the other 99% of entities in
compare and discuss the methods that exert global constraints YAGO 4 necessarily have no match in IMDB [68]. Hence, we
on the embedding matching process, i.e., Hun., SMat and RL. aim to evaluate the embedding matching algorithms in terms of
It is evident that Hun. is the most performant approach, dealing with unmatchable entities.
as it fits well with the present EA setting and can secure Datasets and Evaluation Settings. Following [63], we adapt
an optimal solution towards maximizing the sum of pairwise the KG pairs in DBP15K to include unmatchable entities, re-
scores. Specifically, the current EA setting has two notable sulting in DBP15K+. More specific construction procedure can
assumptions (cf. Section II-C). With these two assumptions, be found in [63]. As for the evaluation metric, we follow the
EA can be transformed into the linear assignment problem, main experimental setting and adopt the F1 score. Unlike 1-
which aims to maximize the sum of pairwise scores under the to-1 alignment, there exist unmatchable entities in this adapted
1-to-1 constraint [35]. As thus, the algorithms for solving the dataset, and the precision and recall values are not necessarily
linear assignment problem, e.g., Hun., can attain remarkably equivalent, since some methods would also align unmatchable
high performance on EA. However, these two assumptions do entities. Noteworthily, the original setting of SMat and Hun.
not necessarily hold on all occasions, which could influence requires that the numbers of entities on the two sides are equal.
the effectiveness of Hun.. For instance, as revealed in Pattern Thus, we add the dummy nodes on the side with fewer entities
2, on sparse datasets (e.g., SRPRS), the neighboring structures to restore such a setting, and then apply SMat and Hun.. The
of some equivalent entities are likely to be different, where the corresponding results are reported in Table VII.
effectiveness of Hun. is limited. In addition, the 1-to-1 alignment Alignment Results. It reads that Hun. attains the best results,
constraint is not necessarily true in practice, which will be followed by SMat. The superior results are partially due to the
discussed in Section V. addition of dummy nodes, which could mitigate the unmatchable
In comparison, SMat merely aims to attain a stable matching, issue to a certain degree. The results of RInf and Sink. are
where the resultant entity pairing could be sub-optimal under close, outperforming CSLS and RL. DInf still achieves the worst
present evaluation setting. RL, on the other hand, relaxes the performance.
ZENG et al.: MATCHING KNOWLEDGE GRAPHS IN ENTITY EMBEDDING SPACES: AN EXPERIMENTAL STUDY 12781

Besides, by comparing the results on DBP15K+ and those on TABLE VIII


THE RESULTS ON NON 1-TO-1 ALIGNMENT DATASET
the original dataset DBP15K (cf. Table IV), we observe that:
(1) After including the unmatchable entities, for all methods,
the F1 scores drop. This is because most of current embedding
matching algorithms are greedy, i.e., retrieving a target entity
for each source entity (including the unmatchable ones), which
leads to a very low precision. For the rest of the methods,
e.g., Hun. and SMat, the unmatchable entities also mislead the
matching process and thus affect the final results; (2) Unlike on
DBP15K where the performance of Sink. and Hun. are close,
on DBP15K+, Hun. largely outperforms Sink., as Hun. does not
necessarily align a target entity to each source entity and has a achieve much worse results compared with the performance
higher precision; (3) Overall, existing algorithms for matching on 1-to-1 alignment datasets; (3) The results of SMat and RL
KGs in entity embedding spaces lack the capability of dealing are even inferior to those of the simple baseline DInf. The
with unmatchable entities. main reason accounting for these changes is that the non 1-to-1
alignment links pose great challenges to existing embedding
B. Non 1-to-1 Alignment matching algorithms. Specifically, for DInf, CSLS, RInf, Sink.
Next, we study the setting where the source and target entities and RL, they only align one target entity (that possesses the
do not strictly conform to the 1-to-1 constraint, so as to better ap- highest score) to a given source entity, but fail to discover other
preciate these matching algorithms for KGs in entity embedding alignment links that also involve this source entity. For SMat and
spaces. Non 1-to-1 alignment is common in practice, especially Hun., they impose the 1-to-1 constraint during matching, which
when two KGs contain entities in different granularity, or one falls short on the non 1-to-1 setting, thus leading to inferior
KG is noisy and involves duplicate entities. To the best of our results. Therefore, it calls for the study on embedding matching
knowledge, we are among the first attempts to Identify and algorithms targeted at non 1-to-1 alignment. We also discuss the
Investigate This Issue. k value in CSLS and RInf under the non 1-to-1 setting, which
Dataset Construction. Present EA benchmarks are con- can be found in Appendix C, available online.
structed according to the 1-to-1 constraint. Thus, in this work,
we establish a new dataset that involves non 1-to-1 alignment VI. SUMMARY AND FUTURE DIRECTION
relationships. Specifically, we obtain the pre-annotated links10 In this section, we summarize the observations and insights
between Freebase [3] and DBpedia [1], and preserve the entities made from our evaluation, and provide possible future research
that are involved in 1-to-many, many-to-1, and many-to-many directions.
alignment relationships. Then, we retrieve the relational triples (1) The investigation into matching KGs in embedding spaces
that contain these entities from respective KGs, which also has not yet made substantial progress. Although there are a few
introduces new entities. Next, we detect the links among the algorithms tailored for matching KGs in embedding spaces, e.g.,
newly added entities, and add them into the alignment links. CSLS, RInf and RL, under the most popular EA evaluation set-
Finally, the resultant dataset, FB_DBP_MUL, contains 44,716 ting (with 1-to-1 alignment constraint), they are outperformed by
entities, 164,882 triples, 22,117 gold links, among which 20,353 the classic general matching algorithms, i.e., Hun.. Hence, there
are non 1-to-1 links and 1,764 are 1-to-1 links11 . The specific is still much room for improving matching KGs in embedding
statistics are also presented in Table III. spaces.
Evaluation Settings. To keep the integrity of the links among (2) No existing embedding matching algorithm prevails under
entities, we sample the training, validation and test sets from the all experimental settings. The strategies designed to solve the
gold links according to the principle that the links involving the linear assignment problem attain the best performance under
same entity should not be distributed among different sets. The the 1-to-1 setting, while they fall short on more practical and
size of the final training, validation and test sets is approximately challenging scenarios since the new settings (e.g., non 1-to-1
7:1:2. We compare the entity pairs produced by embedding alignment) no longer align with the conditions of these optimiza-
matching algorithms against the gold test links, and report the tion algorithms. Similarly, although the methods for improving
precision (P), recall (R) and F1 values. the computation of pairwise scores achieve superior results in
Alignment Results. It is evident from Table VIII that, com- the non 1-to-1 alignment scenario, they are outperformed by
pared with 1-to-1 alignment, the results change significantly on other solutions under the unmatchable setting. Therefore, each
the new dataset. Specifically: (1) RInf and CSLS attain the best evaluation setting poses its own challenge to the embedding
F1 scores, whereas the results are not very promising (e.g., with matching process, and currently there is no consistent winner.
F1 score lower than 0.1 when using GCN); (2) Sink. and Hun. (3) The adaptation from general matching algorithms requires
careful design. Among the embedding matching algorithms,
10 [Online]. Available: https://www.dbpedia.org/blog/dbpedia-is-now- Hun. and SMat are general matching algorithms that have been
interlinked-with-freebase-links-to-opencyc-updated/
11 FB_DBP_MUL is publicly available at https://github.com/DexterZeng/ applied to many other related tasks. Although directly adopting
EntMatcher. these general strategies to tackle EA is simple and effective, they
12782 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO. 12, DECEMBER 2023

might well fall short in some scenarios, as the alignment on KGs 4) Currently, the most performant embedding matching algo-
possesses it own challenges, e.g., the matching is not necessarily rithms are not scalable. Among them, the Hungarian algo-
1-to-1 constrained, or the pairwise scores are inaccurate. Thus, rithm requires approximately one hour on the DWY100K
it is suggested to take full account of the characteristics of dataset. Hence, in this case, it might be better to utilize
the alignment settings when adapting other general matching the RInf and its variant algorithms, which save 2/3 of time
algorithms to cope with matching KGs in entity embedding cost at the expense of < 10% performance drop compared
spaces. with the Hungarian algorithm.
(4) The scalability and efficiency should be brought to the
attention. Existing advanced embedding matching algorithms VII. CONCLUSION
have poor scalability, due to the additional resource-consuming
operations that contribute to the alignment performance, such as This paper conducts a comprehensive survey and evaluation
the ranking process in RInf and the 1-to-1 constraint exerted by of matching algorithms for KGs in entity embedding spaces.
Hun. and SMat. Besides, the space efficiency is also a critical We evaluate seven state-of-the-art strategies in terms of effec-
issue. As shown in Section IV-D, most of the approaches have tiveness and efficiency on a wide range of datasets, including
rather high memory costs given large-scale datasets. Therefore, two experimental settings that better mirror real-life challenges.
considering that in practice there are much more entities, the We identify the strengths and weaknesses of these algorithms
scalability and efficiency issues should be considered during the under different settings. We hope the experimental results would
algorithm design. A preliminary exploration has been conducted be valuable for researchers to put forward more effective and
by [15]. scalable embedding matching algorithms.
(5) The practical evaluation settings are worth further inves-
tigation. Under the unmatchable and non 1-to-1 alignment set- REFERENCES
tings, the performance of existing algorithms is not promising. A
[1] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. G. Ives,
possible future direction is to introduce the notion of probability “DBpedia: A nucleus for a web of open data,” in Proc. Int. Semantic Web
and leverage the probabilistic reasoning frameworks [22], [45], Conf., 2007, pp. 722–735.
which have higher flexibility, to produce the alignment results. [2] M. Berrendorf, E. Faerman, V. Melnychuk, V. Tresp, and T. Seidl, “Knowl-
edge graph entity alignment with graph convolutional networks: Lessons
(6) Integrating the relation embedding might help. Two lat- learned,” in Proc. Eur. Conf. IR Res., 2020, pp. 3–11.
est studies propose to use relation embeddings to help induce [3] K. D. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor, “Freebase: A
aligned entity pairs [33], [56]. Different from existing methods collaboratively created graph database for structuring human knowledge,”
in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2008, pp. 1247–1250.
that regard EA as a matrix (second-order tensor) isomorphism [4] A. Bordes, N. Usunier, A. García-Durán, J. Weston, and O. Yakhnenko,
problem, they express the isomorphism of KGs in the form of “Translating embeddings for modeling multi-relational data,” in Proc. Int.
third-order tensors to better describe the structural information Conf. Neural Inf. Process. Syst., 2013, pp. 2787–2795.
[5] U. Brunner and K. Stockinger, “Entity matching with transformer archi-
of KGs [33]. Thus, it might be interesting to study the matching tectures - A step forward in data integration,” in Proc. 23rd Int. Conf.
between KGs in the joint entity and relation embedding space. Extending Database Technol., Copenhagen, Denmark, Mar. 30 - Apr. 02,
We also provide some actionable insights: 2020, pp. 463–473.
[6] W. Cai, W. Ma, J. Zhan, and Y. Jiang, “Entity alignment with reliable
1) In 1-to-1 constrained scenarios, it is preferable to use Hun- path reasoning and relation-aware heterogeneous graph transformer,” in
garian algorithm or the Sinkhorn operation to conduct the Proc. 31st Int. Joint Conf. Artif. Intell., Vienna, Austria, Jul. 23–29 2022,
matching, as they explicitly or implicitly implement the pp. 1930–1937.
[7] Y. Cao, Z. Liu, C. Li, Z. Liu, J. Li, and T. Chua, “Multi-channel graph
1-to-1 constraint during execution, and take full account neural network for entity alignment,” in Proc. Assoc. Comput. Linguistics,
of the global matching constraints and strive to reach a 2019, pp. 1452–1461.
globally optimal matching given the objective of maxi- [8] M. Chen, Y. Tian, K. Chang, S. Skiena, and C. Zaniolo, “Co-training
embeddings of knowledge graphs and entity descriptions for cross-
mizing the sum of pairwise similarity scores. Given large- lingual entity alignment,” in Proc. Int. Joint Conf. Artif. Intell., 2018,
scale datasets, using Hungarian algorithm would be more pp. 3998–4004.
time-efficient, as Sinkhorn operation needs to operate for [9] M. Chen, Y. Tian, M. Yang, and C. Zaniolo, “Multilingual knowledge
graph embeddings for cross-lingual knowledge alignment,” in Proc. Int.
multiple rounds to achieve convergence. Besides, while Joint Conf. Artif. Intell., 2017, pp. 1511–1517.
Hungarian algorithm depends mainly on CPU, Sinkhorn [10] V. Christophides, V. Efthymiou, T. Palpanas, G. Papadakis, and K. Ste-
operation relies on GPU. fanidis, “An overview of end-to-end entity resolution for Big Data,” ACM
Comput. Surv., vol. 53, no. 6, pp. 127:1–127:42, 2021.
2) Given datasets with unmatchable entities, it is suggested [11] A. Doan et al., “Magellan: Toward building ecosystems of entity matching
to add dummy nodes to make the number of entities in solutions,” Commun. ACM, vol. 63, no. 8, pp. 83–91, 2020.
both sides equal, and then use the Hungarian algorithm. [12] J. Doerner, D. Evans, and A. Shelat, “Secure stable matching at scale,” in
SIGSAC Conf. Comput. Commun. Secur., 2016, pp. 1602–1613.
In this scenario, there is still much room for improvement. [13] M. Fey, J. E. Lenssen, C. Morris, J. Masci, and N. M. Kriege, “Deep graph
3) Non 1-to-1 alignment is a realistic and frequently observed matching consensus,” in Proc. Int. Conf. Learn. Representations, 2020.
scenario that has not received much research attention. [14] D. Gale and L. S. Shapley, “College admissions and the stability of
marriage,” The Amer. Math. Monthly, vol. 69, no. 1, pp. 9–15, 1962.
Among existing algorithms, RInf and CSLS are preferred, [15] Y. Gao, X. Liu, J. Wu, T. Li, P. Wang, and L. Chen, “Clusterea: Scalable
since they take into account the global influence on the entity alignment with stochastic training and normalized mini-batch simi-
local matching and meanwhile do not strictly enforce the larities,” in Proc. 28th ACM SIGKDD Conf. Knowl. Discov. Data Mining,
Washington, DC, USA, Aug. 14–18, 2022, pp. 421–431.
1-to-1 constraint. More practical solutions are to be put [16] C. Ge, X. Liu, L. Chen, B. Zheng, and Y. Gao, “LargeEA: Aligning entities
forward to effectively address non 1-to-1 alignment. for large-scale knowledge graphs,” 2021, arXiv:2108.05211.
ZENG et al.: MATCHING KNOWLEDGE GRAPHS IN ENTITY EMBEDDING SPACES: AN EXPERIMENTAL STUDY 12783

[17] C. Ge, X. Liu, L. Chen, B. Zheng, and Y. Gao, “Make it easy: An effective [41] G. Papadakis, D. Skoutas, E. Thanos, and T. Palpanas, “Blocking and
end-to-end entity alignment framework,” in Proc. Int. ACM SIGIR Conf. filtering techniques for entity resolution: A survey,” ACM Comput. Surv.,
Res. Develop. Informat. Retrieval, 2021, pp. 777–786. vol. 53, no. 2, pp. 31:1–31:42, 2020.
[18] C. Ge, P. Wang, L. Chen, X. Liu, B. Zheng, and Y. Gao, “CollaborEM: [42] H. Paulheim, “Knowledge graph refinement: A survey of approaches and
A self-supervised entity matching framework using multi-features col- evaluation methods,” Semantic Web, vol. 8, no. 3, pp. 489–508, 2017.
laboration,” IEEE Trans. Knowl. Data Eng., early access, Dec. 13, 2021, [43] S. Pei, L. Yu, and X. Zhang, “Improving cross-lingual entity alignment via
doi: 10.1109/TKDE.2021.3134806. optimal transport,” in Proc. Int. Joint Conf. Artif. Intell., S. Kraus, editor,
[19] L. Guo, Q. Zhang, Z. Sun, M. Chen, W. Hu, and H. Chen, “Understanding 2019, pp. 3231–3237.
and improving knowledge graph embedding for entity alignment,” in Proc. [44] L. A. S. Pizzato, T. Rej, T. Chung, I. Koprinska, and J. Kay, “RECON: A
Int. Conf. Mach. Learn., vol. 162 Proc. Mach. Learn. Res., Baltimore, reciprocal recommender for online dating,” in Proc. ACM Conf. Recom-
Maryland, USA, Jul. 17–23 2022, pp. 8145–8156. mender syst., 2010, pp. 207–214.
[20] E. Jiménez-Ruiz and B. C. Grau, “Logmap: Logic-based and scalable [45] J. Pujara, H. Miao, L. Getoor, and W. W. Cohen, “Large-scale knowledge
ontology matching,” in Proc. Int. Semantic Web Conf., Springer, 2011, graph identification using PSL,” in Proc. Conf. Assoc. Advance. Artif.
pp. 273–288. Intell., 2013.
[21] R. Jonker and A. Volgenant, “A shortest augmenting path algorithm for [46] A. E. Roth, “Deferred acceptance algorithms: History, theory, practice,
dense and sparse linear assignment problems,” Computing, vol. 38, no. 4, and open questions,” Int. J. Game Theory, vol. 36, no. 3/4, pp. 537–569,
pp. 325–340, 1987. 2008.
[22] A. Kimmig, A. Memory, R. J. Miller, and L. Getoor, “A collective, [47] P. Shvaiko and J. Euzenat, “Ontology matching: State of the art and future
probabilistic approach to schema mapping,” in Proc. IEEE Int. Conf. Data challenges,” IEEE Trans. Knowl. Data Eng., vol. 25, no. 1, pp. 158–176,
Eng., 2017, pp. 921–932. Jan. 2013.
[23] T. N. Kipf and M. Welling, “Semi-supervised classification with graph [48] F. M. Suchanek, S. Abiteboul, and P. Senellart, “PARIS: Probabilistic
convolutional networks,” in Proc. Int. Conf. Learn. Representations, 2017. alignment of relations, instances, and schema,” Proc. VLDB Endow., vol. 5,
[24] H. W. Kuhn, “The hungarian method for the assignment problem,” Nav. no. 3, pp. 157–168, 2011.
Res. Logistics Quart., vol. 2, no. 1/2, pp. 83–97, 1955. [49] F. M. Suchanek, G. Kasneci, and G. Weikum, “Yago: A core of semantic
[25] S. Lacoste-Julien, K. Palla, A. Davies, G. Kasneci, T. Graepel, and Z. knowledge,” in Proc. Int. World Wide Web Conf., 2007, pp. 697–706.
Ghahramani, “SiGMa: Simple greedy matching for aligning large knowl- [50] Z. Sun et al., “A benchmarking study of embedding-based entity align-
edge bases,” in Proc. ACM SIGKDD Int. Conf. Knowl. Discov. Data ment for knowledge graphs,” Proc. VLDB Endow., vol. 13, no. 11,
Mining, 2013, pp. 572–580. pp. 2326–2340, 2020.
[26] G. Lample, A. Conneau, M. Ranzato, L. Denoyer, and H. Jégou, “Word [51] A. Surisetty et al., “RePS: Relation, position and structure aware en-
translation without parallel data,” in Proc. Int. Conf. Learn. Representa- tity alignment,” in Proc. Web Conf., Virtual Event / Lyon, France,
tions, 2018. Apr. 25–29, 2022, pp. 1083–1091.
[27] C. Li, Y. Cao, L. Hou, J. Shi, J. Li, and T. Chua, “Semi-supervised [52] B. D. Trisedya, J. Qi, and R. Zhang, “Entity alignment between knowledge
entity alignment via joint knowledge embedding model and cross-graph graphs using attribute embeddings,” in Proc. Conf. Assoc. Advance. Artif.
model,” in Proc. Conf. Empir. Methods Natural Lang. Process., 2019, Intell., 2019, pp. 297–304.
pp. 2723–2732. [53] D. Vrandecic and M. Krötzsch, “Wikidata: A free collaborative knowl-
[28] J. Li and D. Song, “Uncertainty-aware pseudo label refinery for entity edgebase,” Commun. ACM, vol. 57, no. 10, pp. 78–85, 2014.
alignment,” in Proc. ACM Web Conf., Virtual Event, Lyon, France, Apr. [54] Z. Wang, Q. Lv, X. Lan, and Y. Zhang, “Cross-lingual knowledge graph
25–29, 2022, pp. 829–837. alignment via graph convolutional networks,” in Proc. Conf. Empir. Meth-
[29] Y. Li, J. Li, Y. Suhara, A. Doan, and W. Tan, “Deep entity matching ods Natural Lang. Process., 2018, pp. 349–357.
with pre-trained language models,” Proc. VLDB Endow., vol. 14, no. 1, [55] Y. Wu, X. Liu, Y. Feng, Z. Wang, R. Yan, and D. Zhao, “Relation-aware
pp. 50–60, 2020. entity alignment for heterogeneous knowledge graphs,” in Proc. Int. Joint
[30] X. Lin, H. Yang, J. Wu, C. Zhou, and B. Wang, “Guiding cross-lingual Conf. Artif. Intell., 2019, pp. 5278–5284.
entity alignment via adversarial knowledge embedding,” in Proc. IEEE [56] K. Xin, Z. Sun, W. Hua, W. Hu, and X. Zhou, “Informed multi-context
Int. Conf. Data Minings, 2019, pp. 429–438. entity alignment,” in Proc. ACM Int. Conf. Web Search Data Mining,
[31] B. Liu, H. Scells, G. Zuccon, W. Hua, and G. Zhao, “ActiveEA: Active Tempe, AZ, USA, Feb. 21–25, 2022, pp. 1197–1205.
learning for neural entity alignment,” in Proc. Conf. Empir. Methods [57] K. Xu, L. Song, Y. Feng, Y. Song, and D. Yu, “Coordinated reasoning for
Natural Lang. Process., 2021, pp. 3364–3374. cross-lingual knowledge graph alignment,” in Proc. Conf. Assoc. Advance.
[32] X. Liu et al., “SelfKG: Self-supervised entity alignment in knowl- Artif. Intell., 2020, pp. 9354–9361.
edge graphs,” in Proc. ACM Web Conf., Virtual Event, Lyon, France, [58] H. Yang, Y. Zou, P. Shi, W. Lu, J. Lin, and X. Sun, “Aligning cross-lingual
Apr. 25–29, 2022, pp. 860–870. entities with multi-aspect information,” in Proc. Conf. Empir. Methods
[33] X. Mao et al., “An effective and efficient entity alignment decoding Natural Lang. Process., 2019, pp. 4430–4440.
algorithm via third-order tensor isomorphism,” in Proc. 60th Annu. [59] J. Yang et al., “Entity and relation matching consensus for entity align-
Meeting Assoc. Comput. Linguistics, Dublin, Ireland, May 22–27, 2022, ment,” in Proc. Conf. Inf. Knowl. Manage., 2021, pp. 2331–2341.
pp. 5888–5898. [60] K. Zeng et al., “Interactive contrastive learning for self-supervised entity
[34] X. Mao, W. Wang, Y. Wu, and M. Lan, “Boosting the speed of en- alignment,” in Proc. 31st ACM Int. Conf. Inf. Knowl. Manage., Atlanta,
tity alignment 10 ×: Dual attention matching network with normal- GA, USA, Oct. 17–21, 2022, pp. 2465–2475.
ized hard sample mining,” in Proc. Int. World Wide Web Conf., 2021, [61] K. Zeng, C. Li, L. Hou, J. Li, and L. Feng, “A comprehensive survey of
pp. 821–832. entity alignment for knowledge graphs,” AI Open, vol. 2, pp. 1–13, 2021.
[35] X. Mao, W. Wang, Y. Wu, and M. Lan, “From alignment to assignment: [62] W. Zeng, X. Zhao, X. Li, J. Tang, and W. Wang, “On entity alignment at
Frustratingly simple unsupervised entity alignment,” in Proc. Conf. Empir. scale,” VLDB J., vol. 31, pp. 1009–1033, 2021.
Methods Natural Lang. Process., 2021, pp. 2843–2853. [63] W. Zeng, X. Zhao, J. Tang, X. Li, M. Luo, and Q. Zheng, “Towards entity
[36] X. Mao, W. Wang, H. Xu, Y. Wu, and M. Lan, “Relational reflection entity alignment in the open world: An unsupervised approach,” in Proc. 26th
alignment,” in Proc. Conf. Inf. Knowl. Manage., 2020, pp. 1095–1104. Int. Conf. Database Syst. Adv. Appl., 2021, pp. 272–289.
[37] G. E. Mena, D. Belanger, S. W. Linderman, and J. Snoek, “Learning latent [64] W. Zeng, X. Zhao, J. Tang, and X. Lin, “Collective entity alignment
permutations with gumbel-sinkhorn networks,” in Proc. Int. Conf. Learn. via adaptive features,” in Proc. IEEE Int. Conf. Data Eng., 2020,
Representations, 2018. pp. 1870–1873.
[38] V. Mnih et al., “Asynchronous methods for deep reinforcement learning,” [65] W. Zeng, X. Zhao, J. Tang, X. Lin, and P. Groth, “Reinforcement learning-
in Proc. Int. Conf. Mach. Learn., 2016, pp. 1928–1937. based collective entity alignment with adaptive features,” ACM Trans. Inf.
[39] S. Mudgal et al., “Deep learning for entity matching: A design space Syst., vol. 39, no. 3, pp. 26:1–26:31, 2021.
exploration,” in Proc. Int. Conf. Manage. Data, Houston, TX, USA, Jun. [66] R. Zhang, B. D. Trisedya, M. Li, Y. Jiang, and J. Qi, “A benchmark and
10–15, 2018, pp. 19–34. comprehensive survey on knowledge graph entity alignment via represen-
[40] T. T. Nguyen et al., “Entity alignment for knowledge graphs with multi- tation learning,” VLDB J., vol. 31, no. 5, pp. 1143–1168, 2022.
order convolutional networks,” IEEE Trans. Knowl. Data Eng., vol. 34, [67] Z. Zhang et al., “An industry evaluation of embedding-based entity align-
no. 9, pp. 4201–4214, Sep. 2022. ment,” in Proc. 28th Int. Conf. Comput. Linguistics, 2020, pp. 179–189.
12784 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO. 12, DECEMBER 2023

[68] X. Zhao, W. Zeng, J. Tang, W. Wang, and F. Suchanek, “An experimental Zhen Tan received the PhD degree from National
study of state-of-the-art entity alignment approaches,” IEEE Trans. Knowl. University of Defense Technology (NUDT), China,
Data Eng., vol. 34, no. 6, pp. 2610–2625, Jun. 2022. in 2018. He is currently an associate professor with
[69] R. Zhu, M. Ma, and P. Wang, “RAGA: Relation-aware graph attention NUDT. His research interests include knowledge
networks for global entity alignment,” in Proc. Pacific-Asia Conf. Adv. graphs and advanced data analytics.
Knowl. Discov. Data Mining, 2021, pp. 501–513.
[70] Y. Zhu, H. Liu, Z. Wu, and Y. Du, “Relation-aware neighborhood matching
model for entity alignment,” in Proc. Conf. Assoc. Advance. Artif. Intell.,
2021, pp. 4749–4756.

Weixin Zeng received the PhD degree from the Na-


Jiuyang Tang received the PhD degree from the
tional University of Defense Technology (NUDT),
China, in 2022. He is a lecturer with NUDT, and his National University of Defense Technology (NUDT),
China, in 2004. He is currently a professor with
research mainly focuses on knowledge graphs and
NUDT. His research interests include knowledge
data mining.
graphs and advanced data analytics.

Xiang Zhao received the PhD degree from The Uni- Xueqi Cheng (Senior Member, IEEE) is currently
versity of New South Wales, Australia, in 2013. He a professor with the Institute of Computing Tech-
is currently a professor with the National University nology, Chinese Academy of Sciences. His research
of Defense Technology, China. His research interests interests include network science, web search and
include graph data management and mining, with a data mining, Big Data processing, and distributed
special focus on knowledge graphs. computing architecture.

You might also like