0% found this document useful (0 votes)
38 views

X-GOAL Multiplex Heterogeneous Graph Prototypical Contrastive Learning

This document proposes a new framework called X-GOAL for multiplex heterogeneous graph prototypical contrastive learning. X-GOAL aims to address two challenges: 1) reducing semantic errors from random data augmentations by incorporating node attribute information, and 2) effectively modeling multiple relation types in heterogeneous graphs. It does this through the GOAL framework, which learns embeddings for each homogeneous graph layer, and an alignment regularization that jointly models layers by aligning embeddings across layers at the node and cluster level. The authors evaluate X-GOAL on real-world datasets and downstream tasks to demonstrate its effectiveness.

Uploaded by

liuyunwu2008
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views

X-GOAL Multiplex Heterogeneous Graph Prototypical Contrastive Learning

This document proposes a new framework called X-GOAL for multiplex heterogeneous graph prototypical contrastive learning. X-GOAL aims to address two challenges: 1) reducing semantic errors from random data augmentations by incorporating node attribute information, and 2) effectively modeling multiple relation types in heterogeneous graphs. It does this through the GOAL framework, which learns embeddings for each homogeneous graph layer, and an alignment regularization that jointly models layers by aligning embeddings across layers at the node and cluster level. The authors evaluate X-GOAL on real-world datasets and downstream tasks to demonstrate its effectiveness.

Uploaded by

liuyunwu2008
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

X-GOAL: Multiplex Heterogeneous Graph Prototypical

Contrastive Learning
Baoyu Jing Shengyu Feng Yuejia Xiang
[email protected] [email protected] [email protected]
University of Illinois at Language Technology Institute Platform and Content Group
Urbana-Champaign Carnegie Mellon University Tencent

Xi Chen Yu Chen Hanghang Tong


[email protected] [email protected] [email protected]
arXiv:2109.03560v5 [cs.LG] 18 Oct 2022

Platform and Content Group Platform and Content Group University of Illinois at
Tencent Tencent Urbana-Champaign

ABSTRACT CCS CONCEPTS


Graphs are powerful representations for relations among objects, • Information systems → Data mining; • Computing method-
which have attracted plenty of attention in both academia and in- ologies → Unsupervised learning; • Networks;
dustry. A fundamental challenge for graph learning is how to train
an effective Graph Neural Network (GNN) encoder without labels, KEYWORDS
which are expensive and time consuming to obtain. Contrastive Prototypical Contrastive Learning, Multiplex Heterogeneous Graphs
Learning (CL) is one of the most popular paradigms to address this
challenge, which trains GNNs by discriminating positive and neg- ACM Reference Format:
ative node pairs. Despite the success of recent CL methods, there Baoyu Jing, Shengyu Feng, Yuejia Xiang, Xi Chen, Yu Chen, and Hang-
are still two under-explored problems. Firstly, how to reduce the hang Tong. 2022. X-GOAL: Multiplex Heterogeneous Graph Prototypi-
cal Contrastive Learning. In Proceedings of the 31st ACM International
semantic error introduced by random topology based data augmen-
Conference on Information and Knowledge Management (CIKM ’22), Oc-
tations. Traditional CL defines positive and negative node pairs via tober 17–21, 2022, Atlanta, GA, USA. ACM, New York, NY, USA, 11 pages.
the node-level topological proximity, which is solely based on the https://doi.org/10.1145/3511808.3557490
graph topology regardless of the semantic information of node at-
tributes, and thus some semantically similar nodes could be wrongly
treated as negative pairs. Secondly, how to effectively model the 1 INTRODUCTION
multiplexity of the real-world graphs, where nodes are connected Graphs are powerful representations of formalisms and have been
by various relations and each relation could form a homogeneous widely used to model relations among various objects [13, 23, 50, 65,
graph layer. To solve these problems, we propose a novel multiplex 66, 74, 75], such as the citation relation and the same-author relation
heterogeneous graph prototypical contrastive leaning (X-GOAL) among papers. One of the primary challenges for graph representa-
framework to extract node embeddings. X-GOAL is comprised of tion learning is how to effectively encode nodes into informative
two components: the GOAL framework, which learns node em- embeddings such that they can be easily used in downstream tasks
beddings for each homogeneous graph layer, and an alignment for extracting useful knowledge [13]. Traditional methods, such as
regularization, which jointly models different layers by aligning Graph Convolutional Network (GCN) [23], leverage human labels
layer-specific node embeddings. Specifically, the GOAL framework to train the graph encoders. However, human labeling is usually
captures the node-level information by a succinct graph transfor- time-consuming and expensive, and the labels might be unavail-
mation technique, and captures the cluster-level information by able in practice [6, 29, 60, 72, 73]. Self-supervised learning [29, 60],
pulling nodes within the same semantic cluster closer in the embed- which aims to train graph encoders without external labels, has
ding space. The alignment regularization aligns embeddings across thus attracted plenty of attention in both academia and industry.
layers at both node level and cluster level. We evaluate the proposed One of the predominant self-supervised learning paradigms in
X-GOAL on a variety of real-world datasets and downstream tasks recent years is Contrastive Learning (CL), which aims to learn an
to demonstrate the effectiveness of the X-GOAL framework. effective Graph Neural Network (GNN) encoder such that positive
node pairs will be pulled together and negative node pairs will be
pushed apart in the embedding space [60]. Early methods, such as
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed DeepWalk [42] and node2vec [12], sample positive node pairs based
for profit or commercial advantage and that copies bear this notice and the full citation on their local proximity in graphs. Recent methods rely on graph
on the first page. Copyrights for components of this work owned by others than ACM transformation or augmentation [60] to generate positive pairs and
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a negative pairs, such as random permutation [16, 18, 53], structure
fee. Request permissions from [email protected]. based augmentation [14, 67], sampling based augmentation [17, 45]
CIKM ’22, October 17–21, 2022, Atlanta, GA, USA. as well as adaptive augmentation [76].
© 2022 Association for Computing Machinery.
ACM ISBN 978-1-4503-9236-5/22/10. . . $15.00 Albeit the success of these methods, they define positive and
https://doi.org/10.1145/3511808.3557490 negative node pairs based upon the node-level information (or local
CIKM ’22, October 17–21, 2022, Atlanta, GA, USA. Baoyu Jing et al.

topological proximity) but have not fully explored the cluster-level


(or semantic cluster/prototype) information. For example, in an
academic graph, two papers about different sub-areas in graph
learning (e.g., social network analysis and drug discovery) might
not topologically close to each other since they do not have a direct
citation relation or same-author relation. Without considering their
semantic information such as the keywords and topics, these two
papers could be treated as a negative pair by most of the existing
methods. Such a practice will inevitably induce semantic errors to
node embeddings, which will have a negative impact on the per-
formance of machine learning models on downstream tasks such
as classification and clustering. To address this problem, inspired
by [27], we introduce a graph prototypical contrastive learning
(GOAL) framework to simultaneously capture both node-level and Figure 1: Illustration of the multiplex heterogeneous graph
cluster-level information. At the node level, GOAL trains an en- G M , which can be decomposed into homogeneous graph lay-
coder by distinguishing positive and negative node pairs, which ers G 1 and G 2 according to the types of relations. Different
are sampled by a succinct graph transformation technique. At the colors represent different relations.
cluster level, GOAL employs a clustering algorithm to obtain the
semantic clusters/prototypes and it pulls nodes within the same
• Method. We propose a novel X-GOAL framework to learn
cluster closer to each other in the embedding space.
node embeddings for multiplex heterogeneous graphs, which
Furthermore, most of the aforementioned methods ignore the
is comprised of a GOAL framework for each single layer and
multiplexity [18, 39] of the real-world graphs, where nodes are
an alignment regularization to propagate information across
connected by multiple types of relations and each relation formu-
different layers. GOAL reduces semantic errors, and the align-
lates a layer of the multiplex heterogeneous graph. For example, in
ment regularization is nimbler than attention modules for
an academic graph, papers are connected via the same authors or
combining layer-specific node embeddings.
the citation relation; in an entertainment graph, movies are linked
• Theoretical Analysis. We theoretically prove that the pro-
through the shared directors or actors/actresses; in a product graph,
posed alignment regularization can effectively maximize the
items have relations such as also-bought and also-view. Different
mutual information across layers.
layers could convey different and complementary information. Thus
• Empirical Evaluation. We comprehensively evaluate the
jointly considering them could produce more informative embed-
proposed methods on various real-world datasets and down-
dings than separately treating different layers and then applying
stream tasks. The experimental results show that GOAL and
average pooling over them to obtain the final embeddings [18, 39].
X-GOAL outperform the state-of-the-art methods for homo-
Most of the prior deep learning methods use attention mechanism
geneous and multiplex heterogeneous graphs respectively.
[4, 18, 30, 31, 38, 58] to combine embeddings from different layers.
However, attention modules usually require extra tasks or loss func-
2 PRELIMINARY
tions to train, such as node classification [58] and concensus loss
[38]. Besides, some attention modules are complex which require Definition 2.1 (Attributed Multiplex Heterogeneous Graph). An
significant amount of extra efforts to design and tune, such as the hi- attributed multiplex heterogeneous graph with 𝑉 layers and 𝑁
erarchical structures [58] and complex within-layer and cross-layer nodes is denoted as G M = {G 𝑣 }𝑉𝑣=1 , where G 𝑣 (A𝑣 , X) is the 𝑣-th
interactions [31]. Different from the prior methods, we propose an homogeneous graph layer, A𝑣 ∈ R𝑁 ×𝑁 and X ∈ R𝑁 ×𝑑𝑥 is the
alternative nimble alignment regularization to jointly model and adjacency matrix and the attribute matrix, and 𝑑𝑥 is the dimension
propagate information across different layers by aligning the layer- of attributes. An illustration is shown in Figure 1.
specific embeddings without extra neural network modules, and
Problem Statement. The task is to learn an encoder E for G M ,
the final node embeddings are obtained by simply average pooling
which maps the node attribute matrix X ∈ R𝑁 ×𝑑𝑥 to node embed-
over these layer-specific embeddings. The key assumption of the
ding matrix H M ∈ R𝑁 ×𝑑 without external labels, where 𝑁 is the
alignment regularization is that layer-specific embeddings of the
number of nodes, 𝑑𝑥 and 𝑑 are the dimension sizes.
same node should be close to each other in the embedding space
and they should also be semantically similar. We also theoretically
prove that the proposed alignment regularization could effectively
3 METHODOLOGY
maximize the mutual information across layers. We present the X-GOAL framework for multiplex heterogeneous
We comprehensively evaluate X-GOAL on a variety of real-world graphs G M , which is comprised of a GOAL framework and an
attributed multiplex heterogeneous graphs. The experimental re- alignment regularization. In Section 3.1, we present the GOAL
sults show that the embeddings learned by GOAL and X-GOAL framework, which simultaneously captures the node-level and the
could outperform state-of-the-art methods of homogeneous graphs cluster-level information for each layer G = (A, X) of G M . In Sec-
and multiplex heterogeneous graphs on various downstream tasks. tion 3.2, we introduce a novel alignment regularization to align node
The main contributions are summarized as follows: embeddings across layers at both node and cluster level. In section
3.3, we provide theoretical analysis of the alignment regularization.
X-GOAL: Multiplex Heterogeneous Graph Prototypical Contrastive Learning CIKM ’22, October 17–21, 2022, Atlanta, GA, USA.

3.1 The GOAL Framework


The node-level graph topology based transformation techniques
might contain semantic errors since they ignore the hidden seman-
tics and will inevitably pair two semantically similar but topologi-
cally far nodes as a negative pair. To solve this issue, we introduce a
GOAL framework for each homogeneous graph layer1 G = (A, X)
to capture both node-level and cluster-level information. An illus-
tration of GOAL is shown in Figure 2. Given a homogeneous graph
G and an encoder E, GOAL alternatively performs semantic clus-
tering and parameter updating. In the semantic clustering step, a
clustering algorithm C is applied over the embeddings H to obtain
the hidden semantic clusters. In the parameter updating step, GOAL Figure 2: Illustration of GOAL. E and C are the encoder
updates the parameters of E by the loss L given in Equation (4), and clustering algorithm. G is a homogeneous graph layer
which pulls topologically similar nodes closer and nodes within and H is the embedding matrix. L is given in Equation (4).
the same semantic cluster closer by the node-level loss and the The circles and diamonds denote nodes and cluster centers.
cluster-level loss respectively. Blue and orange denote different hidden semantics. The
A - Node-Level Loss. To capture the node-level information, we green line is the cluster boundary. “Back Prop.” means back
propose a graph transformation technique T = {T +, T − }, where propagation. The node-level topology based negative sam-
T + and T − denote positive and negative transformations, along pling treats the semantic similar node 0 and 2 as a negative
with a contrastive loss similar to InfoNCE [36]. pair. The cluster-level loss reduces semantic error by pulling
Given an original homogeneous graph G = (A, X), the positive node 0 and 2 closer to their cluster center.
transformation T + applies the dropout operation [48] over A and
X with a pre-defined probability 𝑝𝑑𝑟𝑜𝑝 ∈ (0, 1). We choose the
dropout operation rather than the masking operation since the
C - Overall Loss. Combing the node-level loss in Equation (1) and
dropout re-scales the outputs by 1−𝑝1𝑑𝑟𝑜𝑝 during training, which
the cluster-level loss in Equation (3), we have:
improves the training results. The negative transformation T −
is the random shuffle of the rows for X [53]. The transformed L = 𝜆N LN + 𝜆 C L C (4)
positive and negative graphs are denoted by G + = T + (G) and
where 𝜆 N and 𝜆 C are tunable hyper-parameters.
G − = T − (G), respectively. The node embedding matrices of G,
G + and G − are thus H = E (G), H+ = E (G + ) and H− = E (G − ).
3.2 Alignment Regularization
We define the node-level contrastive loss as:
+ Real-world graphs are often multiplex in nature, which can be de-
e𝑐𝑜𝑠 (h𝑛 ,h𝑛 )
𝑁
LN = −
1 ∑︁
log (1) composed into multiple homogeneous graph layers G M = {G 𝑣 }𝑉𝑣=1 .
+ −
𝑁 𝑛=1 e𝑐𝑜𝑠 (h𝑛 ,h𝑛 ) + e𝑐𝑜𝑠 (h𝑛 ,h𝑛 ) The simplest way to extract the embedding of a node x𝑛 in G M is
where 𝑐𝑜𝑠 (, ) denotes the cosine similarity, h𝑛 , h𝑛+ and h𝑛− are the separately extracting the embedding {h𝑛𝑣 }𝑉𝑣=1 from different layers
𝑛-th rows of H, H+ and H− . and then combing them via average pooling. However, it has been
B - Cluster-Level Loss. We use a clustering algorithm C to obtain empirically proven that jointly modeling different layers could usu-
𝐾 , where c ∈ R𝑑 is the clus-
the semantic clusters of nodes {c𝑘 }𝑘=1 ally produce better embeddings for downstream tasks [18]. Most
𝑘
ter center, 𝐾 and 𝑑 are the number of clusters and the dimension prior studies use attention modules to jointly learn embeddings
of embedding space. We capture the cluster-level semantic infor- from different layers, which are clumsy as they usually require
mation to reduce the semantic errors by pulling nodes within the extra efforts to design and train [18, 30, 39, 58]. Alternatively, we
same cluster closer to their assigned cluster center. For clarity, the propose a nimble alignment regularization to jointly learn embed-
derivations of the cluster-level loss are provided in Appendix. dings by aligning the layer-specific {h𝑛𝑣 }𝑉𝑣=1 without introducing
We define the probability of h𝑛 belongs to the cluster 𝑘 by: extra neural network modules, and the final node embedding of
x𝑛 is obtained by simply averaging the layer-specific embeddings
e (c𝑘 ·h𝑛 /𝜏)
𝑇
h𝑛M = 𝑉1 𝑉𝑣=1 h𝑛𝑣 . The underlying assumption of the alignment is
Í
𝑝 (𝑘 |h𝑛 ) = Í (2) ′
e (c𝑘 ′ ·h𝑛 /𝜏)
𝑇
𝐾
𝑘 ′ =1
that h𝑛𝑣 should be close to and reflect the semantics of {h𝑛𝑣 }𝑉𝑣′ ≠𝑣 . The
proposed alignment regularization is comprised of both node-level
where 𝜏 > 0 is the temperature parameter to re-scale the values.
and cluster-level alignments.
The cluster-level loss is defined as the negative log-likelihood of
the assigned cluster 𝑘𝑛 for h𝑛 : Given G M = {G 𝑣 }𝑉𝑣=1 with encoders {E 𝑣 }𝑉𝑣=1 , we first apply
GOAL to each layer G 𝑣 and obtain the original and negative node
𝑁 (c𝑇 ·h /𝜏)
1 ∑︁ e 𝑘𝑛 𝑛 embeddings {H𝑣 }𝑉𝑣=1 and {H𝑣− }𝑉𝑣=1 , as well as the cluster centers
LC = − log Í (3)
{C𝑣 }𝑉𝑣=1 , where C𝑣 ∈ R𝐾 ×𝑑 is the concatenation of the cluster cen-
𝑣
𝑁 𝑛=1 𝐾 e (c𝑇𝑘 ·h𝑛 /𝜏)
𝑘=1
ters for the 𝑣-th layer, 𝐾 𝑣 is the number of clusters for the 𝑣-the layer.
where 𝑘𝑛 ∈ [1, . . . , 𝐾] is the cluster index assigned to the 𝑛-th node. The node-level alignment is applied over {H𝑣 }𝑉𝑣=1 and {H𝑣− }𝑉𝑣=1 .
1 For clarity, we drop the script 𝑣 of G 𝑣 , A𝑣 and H𝑣 for this subsection. The cluster-level alignment is used on {C𝑣 }𝑉𝑣=1 and {H𝑣 }𝑉𝑣=1 .
CIKM ’22, October 17–21, 2022, Atlanta, GA, USA. Baoyu Jing et al.

Finally, we alternatively use all 𝑉 layers as anchor layers and use


the averaged KL-divergence as the final semantic regularization:
𝑉
1 ∑︁ 𝑣
R𝐶 = R (7)
𝑉 𝑣=1 𝐶
C - Overall Loss. By combining the node-level and cluster-level
regularization losses, we have:
R = 𝜇N RN + 𝜇C R C (8)
Figure 3: Cluster-level alignment. x𝑛 is the node attribute. h𝑛𝑣 where 𝜇 N and 𝜇 C are tunable hyper-parameters.

and h𝑛𝑣 are the layer-specific embeddings. C𝑣 is the anchor The final training objective of the X-GOAL framework is the com-

cluster center matrix. p𝑛𝑣 and q𝑛𝑣 are the anchor and recov- bination of the contrastive loss L in Equation (4) and the alignment
𝑣
ered semantic distributions. R C is given in Equation (6). regularization R in Equation (8):
𝑉
∑︁
L𝑋 = L𝑣 + R (9)
A - Node-Level Alignment. For a node x𝑛 , its embedding h𝑛𝑣 𝑣=1

should be close to embeddings {h𝑛𝑣 }𝑉𝑣′ ≠𝑣 and far away from the where L 𝑣 is the loss of layer 𝑣
negative embedding h𝑛𝑣− . Analogous to Equation (1), we define the
node-level alignment regularization as: 3.3 Theoretical Analysis

e𝑐𝑜𝑠 (h𝑛 ,h𝑛 )
𝑁 𝑉 𝑉 𝑣 𝑣
1 ∑︁ ∑︁ ∑︁ We provide theoretical analysis for the proposed regularization
RN = − log 𝑣 𝑣′
(5) alignments. In Theorem 3.1, we prove that the node-level alignment
e𝑐𝑜𝑠 (h𝑛 ,h𝑛 ) + e𝑐𝑜𝑠 (h𝑛 ,h𝑛 )
𝑍 𝑛=1 𝑣=1 ′ 𝑣 𝑣−
𝑣 ≠𝑣
maximizes the mutual information of embeddings 𝐻 𝑣 ∈ {h𝑛𝑣 }𝑛=1 𝑁
where 𝑍 = 𝑁𝑉 (𝑉 − 1) is the normalization factor. ′
of the anchor layer 𝑣 and embeddings 𝐻 𝑣 ∈ {h𝑛𝑣 }𝑛=1
′ 𝑁 of another
B - Cluster-Level Alignment. Similar to the node-level loss in layer 𝑣 ′ . In Theorem 3.2, we prove that the cluster-level alignment
Equation (1), the node-level alignment in Equation (5) could also maximizes the mutual information of semantic cluster assignments
introduce semantic errors since h𝑛𝑣− might be topologically far from
𝐶 𝑣 ∈ [1, · · · , 𝐾 𝑣 ] for embeddings {h𝑛𝑣 }𝑛=1
𝑁 of the anchor layer 𝑣
but semantically similar to h𝑛𝑣 . To reduce the semantic error, we also ′ ′ 𝑁
and embeddings 𝐻 ∈ {h𝑛 }𝑛=1 of the layer 𝑣 ′ .
𝑣 𝑣
align the layer-specific embeddings {h𝑛𝑣 }𝑉𝑣=1 at the cluster level.
Let the 𝑣-th layer be the anchor layer and its semantic cluster Theorem 3.1 (Maximization of MI of Embeddings from Dif-
centers C𝑣 ∈ R𝐾 ×𝑑 as the anchor cluster centers. For a node x𝑛 ,
𝑣
𝑁 and 𝐻 𝑣′ ∈ {h𝑣′ } 𝑁 be the
ferent Layers). Let 𝐻 𝑣 ∈ {h𝑛𝑣 }𝑛=1 𝑛 𝑛=1
we call its layer-specific embedding h𝑛𝑣 as the anchor embedding, random variables for node embeddings of the 𝑣-th and 𝑣 ′ -th layers,
𝑣 ′
and its semantic distribution p𝑛𝑣 ∈ R𝐾 as the anchor semantics, then the node-level alignment maximizes 𝐼 (𝐻 𝑣 ; 𝐻 𝑣 ).
which is obtained via Equation (2) based on h𝑛𝑣 and C𝑣 . Our key idea
of the cluster-level alignment is to recover the anchor semantics Proof. According to [36, 43], the following inequality holds:

p𝑛𝑣 from embeddings {h𝑛𝑣 }𝑉𝑣′ ≠𝑣 of other layers based on C𝑣 .
e 𝑓 (𝑥𝑖 ,𝑦𝑖 )
𝐾1
1 ∑︁
Our idea can be justified from two perspectives. Firstly, {h𝑛𝑣 }𝑉𝑣=1 𝐼 (𝑋 ; 𝑌 ) ≥ E[ log 𝐾2 𝑓 (𝑥𝑖 ,𝑦 𝑗 )
] (10)
𝐾1 𝑖=1 1 Í
e
reflect information of x𝑛 from different aspects, if we can recover 𝐾2 𝑗=1

the anchor semantics p𝑛𝑣 from the embedding h𝑛𝑣 of another layer ′
′ Let 𝐾1 = 1, 𝐾2 = 2, 𝑓 () = 𝑐𝑜𝑠 (), 𝑥 1 = h𝑛𝑣 , 𝑦1 = h𝑛𝑣 , 𝑦2 = h𝑛−𝑣 , then:
𝑣 ′ ≠ 𝑣, then it indicates that h𝑛𝑣 and h𝑛𝑣 share hidden semantics to 𝑣′
e𝑐𝑜𝑠 (h𝑛 ,h𝑛 )
𝑣
a certain degree. Secondly, it is impractical to directly align p𝑛𝑣 and ′
′ ′ 𝐼 (𝐻 𝑣 ; 𝐻 𝑣 ) ≥ E[log 𝑣 𝑣′
] (11)
p𝑛𝑣 , since their dimensions might be different 𝐾 𝑣 ≠ 𝐾 𝑣 , and even e𝑐𝑜𝑠 (h𝑛 ,h𝑛 ) + e𝑐𝑜𝑠 (h𝑛 ,h𝑛 )
𝑣 𝑣−

𝑣 𝑣 ′ 𝑣 𝑣 ′
if 𝐾 = 𝐾 , the cluster center vectors C and C are distributed at The expectation E is taken over all the 𝑁 nodes, and all the pairs
different positions in the embedding space. of 𝑉 layers, and thus we have:
An illustration of the cluster-level alignment is presented in ′
e𝑐𝑜𝑠 (h𝑛 ,h𝑛 )
𝑁 𝑉 𝑉 𝑣 𝑣
Figure 3. Given a node x𝑛 , on the anchor layer 𝑣, we have the 𝑣 𝑣′ 1 ∑︁ ∑︁ ∑︁
𝐼 (𝐻 ; 𝐻 ) ≥ log ′ (12)
anchor cluster centers C𝑣 , the anchor embedding h𝑛𝑣 , and the anchor 𝑍 𝑛=1 𝑣=1 ′ e𝑐𝑜𝑠 (h𝑛 ,h𝑛 ) + e𝑐𝑜𝑠 (h𝑛 ,h𝑛 )
𝑣 𝑣 𝑣 𝑣−
′ 𝑣 ≠𝑣
semantic distribution p𝑛𝑣 . Next, we use the embedding h𝑛𝑣 from the

layer 𝑣 ′ ≠ 𝑣 to obtain the recovered semantic distribution q𝑛𝑣 based where 𝑍 = 𝑁𝑉 (𝑉 − 1) is the normalization factor, and the right
′ side is R N in Equation (7). □
on C𝑣 via Equation (2). Then we align the semantics of h𝑛𝑣 and h𝑛𝑣

by minimizing the KL-divergence of p𝑛𝑣 and q𝑛𝑣 : Theorem 3.2 (Maximization of MI between Embeddings and
𝑁 ∑︁
𝑉 Semantic Cluster Assignments). Let 𝐶 𝑣 ∈ [1, · · · , 𝐾 𝑣 ] be the
1 ′
∑︁
R𝐶𝑣 = 𝐾𝐿(p𝑛𝑣 ||q𝑛𝑣 ) (6) random variable for cluster assignments for {h𝑛𝑣 }𝑛=1 𝑁 of the anchor
𝑁 (𝑉 − 1) 𝑛=1 ′ ′ ′ 𝑁
𝑣 ≠𝑣 𝑣 𝑣
layer 𝑣, and 𝐻 ∈ {h𝑛 }𝑛=1 be the random variable for node embed-
where p𝑛𝑣 is treated as the ground-truth and the gradients are not dings of the 𝑣 ′ -th layer, then the cluster-level alignment maximizes
′ ′
allowed to pass through p𝑛𝑣 during training. the mutual information of 𝐶 𝑣 and 𝐻 𝑣 : 𝐼 (𝐶 𝑣 ; 𝐻 𝑣 ).
X-GOAL: Multiplex Heterogeneous Graph Prototypical Contrastive Learning CIKM ’22, October 17–21, 2022, Atlanta, GA, USA.

Table 1: Statistics of the datasets

Graphs # Nodes Layers # Edges # Attributes # Labeled Data # Classes


Paper-Subject-Paper (PSP) 2,210,761 1,830
ACM 3,025 600 3
Paper-Author-Paper (PAP) 29,281 (Paper Abstract)
Movie-Actor-Movie (MAM) 66,428 1,007
IMDB 3,550 300 3
Movie-Director-Movie (MDM) 13,788 (Movie plot)
Paper-Author-Paper (PAP) 144,783
2,000
DBLP 7,907 Paper-Paper-Paper (PPP) 90,145 (Paper Abstract) 80 4
Paper-Author-Term-Author-Paper (PATAP) 57,137,515
Item-AlsoView-Item (IVI) 266,237
2,000
Amazon 7,621 Item-AlsoBought-Item (IBI) 1,104,257 (Item description) 80 4
Item-BoughtTogether-Item (IOI) 16,305

Proof. In the cluster-level alignment, the anchor distribution p𝑛𝑣 4 EXPERIMENTS


′ ′
is regarded as the ground-truth for the 𝑛-th node, and q𝑛𝑣 = 𝑓 (h𝑛𝑣 ) 4.1 Experimental Setups

is the recovered distribution from the 𝑣 -th layer, where 𝑓 () is a 𝐾 𝑣
dimensional function defined by Equation (2). Specifically, Datasets. We use publicly available multiplex heterogeneous graph
′ datasets [18, 39]: ACM, IMDB, DBLP and Amazon to evaluate the
(c𝑇𝑘 ·h𝑛𝑣 /𝜏)
′ ′ e proposed methods. The statistics is summarized in Table 1.
𝑓 (h𝑛𝑣 ) [𝑘] = 𝑝 (𝑘 |h𝑛𝑣 ) = Í 𝑣′
(13)
Comparison Methods. We compare with methods for (1) attrib-
e (c𝑘 ′ ·h𝑛 /𝜏)
𝑇
𝐾𝑣
𝑘 ′ =1 uted graphs, including methods disregarding node attributes: Deep-
𝑣
𝐾 is the set of cluster centers for the 𝑣-th layer.
where {c𝑘 }𝑘=1 Walk [42] and node2vec [12], and methods considering attributes:
Since p𝑛𝑣 is the ground-truth, and thus its entropy 𝐻 (p𝑛𝑣 ) is a GCN [23], GAT [52], DGI [53], ANRL [71], CAN [34], DGCN [77],
constant. As a result, the KL divergence in Equation (6) is equiva- HDI[18], GCA [76] and GraphCL [67]; (2) attributed multiplex het-
′ ′
lent to cross-entropy 𝐻 (p𝑛𝑣 , q𝑛𝑣 ) = 𝐾𝐿(p𝑛𝑣 ||q𝑛𝑣 ) + 𝐻 (p𝑛𝑣 ). Therefore, erogeneous graphs, including methods disregarding node attributes:

minimizing the KL-divergence will minimize 𝐻 (p𝑛𝑣 , q𝑛𝑣 ). CMNA [5], MNE [68], and methods considering attributes: mGCN
On the other hand, according to [33, 44], we have the following [31], HAN [58], MvAGC [28], DMGI, DMGIattn [39] and HDMI [18].

variational lower bound for 𝐼 (𝐶 𝑣 ; 𝐻 𝑣 ): Evaluation Metrics. Following [18], we first extract embeddings
𝑣′
from the trained encoder. Then we train downstream models with
′ e𝑔 (h𝑛 ,𝑘) the extracted embeddings, and evaluate models’ performance on
𝐼 (𝐶 𝑣 ; 𝐻 𝑣 ) ≥ E[log Í 𝑣 𝑣′ ′
] (14)
𝐾 e𝑔 (h𝑛 ,𝑘 ) the following tasks: (1) a supervised task: node classification; (2) un-
𝑘 ′ =1
′ supervised tasks: node clustering and similarity search. For the node
where 𝑔() is any function of h𝑛𝑣 and 𝑘. classification task, we train a logistic regression model and evaluate
In our case, we let its performance with Macro-F1 (MaF1) and Micro-F1 (MiF1). For the
′ 1 ′
𝑔(h𝑛𝑣 , 𝑘) = c𝑇𝑘 · h𝑛𝑣 (15) node clustering task, we train the K-means algorithm and evaluate
𝜏 it with Normalized Mutual Information (NMI). For the similarity
where c𝑘 is the 𝑘-th semantic cluster center of the 𝑣-th layer, and 𝜏 search task, we first calculate the cosine similarity for each pair of
is the temperature parameter. nodes, and for each node, we compute the rate of the nodes to have
As a result, we have the same label within its 5 most similar nodes (Sim@5).
𝑣′ Implementation Details. We use the one layer 1st-order GCN
e𝑔 (h𝑛 ,𝑘) ′ ′
Í𝐾 𝑣 ′ ′
= 𝑓 [h𝑛𝑣 ] [𝑘] = q𝑛𝑣 [𝑘] (16) [23] with tangent activation as the encoder E 𝑣 = tanh(A𝑣 XW +
e𝑔 (h𝑛 ,𝑘 )
𝑣
𝑘 ′ =1 XW ′ + b). We set dimension 𝑑 = 128 and 𝑝𝑑𝑟𝑜𝑝 = 0.5. The models
The expectation E is taken over the ground-truth distribution of are implemented by PyTorch [40] and trained on NVIDIA Tesla
the cluster assignments for the anchor layer 𝑣: V-100 GPU. During training, we first warm up the encoders by
′ ′ ′ 1 training them with the node-level losses L N and R N . Then we
𝑝𝑔𝑡 (h𝑛𝑣 , 𝑘) = 𝑝𝑔𝑡 (h𝑛𝑣 )𝑝𝑔𝑡 (𝑘 |h𝑛𝑣 ) = p𝑛𝑣 [𝑘] (17) apply the overall loss L X with the learning rate of 0.005 for IMDB
𝑁
′ and 0.001 for other datasets. We use K-means as the clustering
where 𝑝𝑔𝑡 (𝑘 |h𝑛𝑣 ) = p𝑛𝑣 [𝑘] is the ground-truth semantic distribution
′ algorithm, and the semantic clustering step is performed every 5
for h𝑛𝑣 on the anchor layer 𝑣, which is different from the recovered epochs of parameter updating. We adopt early stopping with the
′ ′
distribution 𝑝 (𝑘 |h𝑛𝑣 ) = q𝑛𝑣 [𝑘] shown in Equation (13). patience of 100 to prevent overfitting.
Therefore, we have
𝑁 𝐾𝑣 𝑁
′ 1 ∑︁ ∑︁ 𝑣
𝐼 (𝐶 𝑣 ; 𝐻 𝑣 ) ≥

p𝑛 [𝑘] log q𝑛𝑣 [𝑘] = −
1 ∑︁ ′
𝐻 (p𝑛𝑣 , q𝑛𝑣 )
4.2 Overall Performance
𝑍 𝑛=1
𝑘=1
𝑍 𝑛=1 X-GOAL on Multiplex Heterogeneous Graphs. The overall per-
(18) formance for all of the methods is presented in Tables 2-3, where
where 𝑍 = 𝑁 𝐾 𝑣 is the normalization factor. the upper and middle parts are the methods for homogeneous
′ ′
Thus, minimizing 𝐻 (p𝑛𝑣 , q𝑛𝑣 ) will maximize 𝐼 (𝐶 𝑣 ; 𝐻 𝑣 ). □ graphs and multiplex heterogeneous graphs respectively. “OOM”
CIKM ’22, October 17–21, 2022, Atlanta, GA, USA. Baoyu Jing et al.

Table 2: Overall performance of X-GOAL on the supervised task: node classification.

Dataset ACM IMDB DBLP Amazon


Metric Macro-F1 Micro-F1 Macro-F1 Micro-F1 Macro-F1 Micro-F1 Macro-F1 Micro-F1
DeepWalk 0.739 0.748 0.532 0.550 0.533 0.537 0.663 0.671
node2vec 0.741 0.749 0.533 0.550 0.543 0.547 0.662 0.669
GCN/GAT 0.869 0.870 0.603 0.611 0.734 0.717 0.646 0.649
DGI 0.881 0.881 0.598 0.606 0.723 0.720 0.403 0.418
ANRL 0.819 0.820 0.573 0.576 0.770 0.699 0.692 0.690
CAN 0.590 0.636 0.577 0.588 0.702 0.694 0.498 0.499
DGCN 0.888 0.888 0.582 0.592 0.707 0.698 0.478 0.509
GraphCL 0.884 0.883 0.619 0.623 0.814 0.806 0.461 0.472
GCA 0.798 0.797 0.523 0.533 OOM OOM 0.408 0.398
HDI 0.901 0.900 0.634 0.638 0.814 0.800 0.804 0.806
CMNA 0.782 0.788 0.549 0.566 0.566 0.561 0.657 0.665
MNE 0.792 0.797 0.552 0.574 0.566 0.562 0.556 0.567
mGCN 0.858 0.860 0.623 0.630 0.725 0.713 0.660 0.661
HAN 0.878 0.879 0.599 0.607 0.716 0.708 0.501 0.509
DMGI 0.898 0.898 0.648 0.648 0.771 0.766 0.746 0.748
DMGIattn 0.887 0.887 0.602 0.606 0.778 0.770 0.758 0.758
MvAGC 0.778 0.791 0.598 0.615 0.509 0.542 0.395 0.414
HDMI 0.901 0.901 0.650 0.658 0.820 0.811 0.808 0.812
X-GOAL 0.922 0.921 0.661 0.663 0.830 0.819 0.858 0.857

Table 3: Overall performance of X-GOAL on the unsupervised tasks: node clustering and similarity search.

Dataset ACM IMDB DBLP Amazon


Metric NMI Sim@5 NMI Sim@5 NMI Sim@5 NMI Sim@5
DeepWalk 0.310 0.710 0.117 0.490 0.348 0.629 0.083 0.726
node2vec 0.309 0.710 0.123 0.487 0.382 0.629 0.074 0.738
GCN/GAT 0.671 0.867 0.176 0.565 0.465 0.724 0.287 0.624
DGI 0.640 0.889 0.182 0.578 0.551 0.786 0.007 0.558
ANRL 0.515 0.814 0.163 0.527 0.332 0.720 0.166 0.763
CAN 0.504 0.836 0.074 0.544 0.323 0.792 0.001 0.537
DGCN 0.691 0.690 0.143 0.179 0.462 0.491 0.143 0.194
GraphCL 0.673 0.890 0.149 0.565 0.545 0.803 0.002 0.360
GCA 0.443 0.791 0.007 0.496 OOM OOM 0.002 0.478
HDI 0.650 0.900 0.194 0.605 0.570 0.799 0.487 0.856
CMNA 0.498 0.363 0.152 0.069 0.420 0.511 0.070 0.435
MNE 0.545 0.791 0.013 0.482 0.136 0.711 0.001 0.395
mGCN 0.668 0.873 0.183 0.550 0.468 0.726 0.301 0.630
HAN 0.658 0.872 0.164 0.561 0.472 0.779 0.029 0.495
DMGI 0.687 0.898 0.196 0.605 0.409 0.766 0.425 0.816
DMGIattn 0.702 0.901 0.185 0.586 0.554 0.798 0.412 0.825
MvAGC 0.665 0.824 0.219 0.525 0.281 0.437 0.082 0.237
HDMI 0.695 0.898 0.198 0.607 0.582 0.809 0.500 0.857
X-GOAL 0.773 0.924 0.221 0.613 0.615 0.809 0.556 0.907

means out-of-memory. Among all the baselines, HDMI has the best GOAL on Homogeneous Graph Layers. We compare the pro-
overall performance. The proposed X-GOAL further outperforms posed GOAL framework with recent infomax-based methods (DGI
HDMI. The proposed X-GOAL has 0.023/0.019/0.041/0.021 average and HDI) and graph augmentation based methods (GraphCL and
improvements over the second best scores on Macro-F1/Micro- GCA). The experimental results for each single homogeneous graph
F1/NMI/Sim@5. For Macro-F1 and Micro-F1 in Table 2, X-GOAL layer are presented in Tables 4-5. It is evident that GOAL signif-
improves the most on the Amazon dataset (0.050/0.044). For NMI icantly outperforms the baseline methods on all single homoge-
and Sim@5 in Table 3, X-GOAL improves the most on the ACM neous graph layers. On average, GOAL has 0.137/0.129/0.151/0.119
(0.071) and Amazon (0.050) dataset respectively. The superior over- improvements on Macro-F1/Micro-F1/NMI/Sim@5. For node clas-
all performance of X-GOAL demonstrate that the proposed ap- sification in Table 4, GOAL improves the most on the PATAP layer
proach can effectively extract informative node embeddings for of DBLP: 0.514/0.459 on Macro-F1/Micro-F1. For node clustering
multiplex heterogeneous graph. and similarity search in Table 5, GOAL improves the most on the
X-GOAL: Multiplex Heterogeneous Graph Prototypical Contrastive Learning CIKM ’22, October 17–21, 2022, Atlanta, GA, USA.

Table 4: Overall performance of GOAL on each layer: node classification.

Dataset ACM IMDB DBLP Amazon


View PSP PAP MDM MAM PAP PPP PATAP IVI IBI IOI
Metric MaF1 MiF1 MaF1 MiF1 MaF1 MiF1 MaF1 MiF1 MaF1 MiF1 MaF1 MiF1 MaF1 MiF1 MaF1 MiF1 MaF1 MiF1 MaF1 MiF1
DGI 0.663 0.668 0.855 0.853 0.573 0.586 0.558 0.564 0.804 0.796 0.728 0.717 0.240 0.272 0.380 0.388 0.386 0.410 0.569 0.574
GraphCL 0.649 0.658 0.833 0.824 0.551 0.566 0.554 0.562 0.806 0.779 0.678 0.675 0.236 0.286 0.290 0.305 0.335 0.348 0.506 0.516
GCA 0.645 0.656 0.748 0.749 0.534 0.537 0.489 0.500 0.716 0.710 0.679 0.665 OOM OOM 0.300 0.312 0.289 0.304 0.532 0.526
HDI 0.742 0.744 0.889 0.888 0.626 0.631 0.600 0.606 0.812 0.803 0.751 0.745 0.241 0.284 0.581 0.583 0.524 0.529 0.796 0.799
GOAL 0.833 0.836 0.908 0.908 0.649 0.653 0.653 0.652 0.817 0.804 0.765 0.755 0.755 0.745 0.849 0.848 0.850 0.848 0.851 0.851

Table 5: Overall performance of GOAL on each layer: node clustering and similarity search.

Dataset ACM IMDB DBLP Amazon


View PSP PAP MDM MAM PAP PPP PATAP IVI IBI IOI
Metric NMI Sim@5 NMI Sim@5 NMI Sim@5 NMI Sim@5 NMI Sim@5 NMI Sim@5 NMI Sim@5 NMI Sim@5 NMI Sim@5 NMI Sim@5
DGI 0.526 0.698 0.651 0.872 0.145 0.549 0.089 0.495 0.547 0.800 0.404 0.741 0.054 0.583 0.002 0.395 0.003 0.414 0.038 0.701
GraphCL 0.524 0.735 0.675 0.874 0.128 0.554 0.060 0.485 0.539 0.794 0.347 0.702 0.052 0.595 0.001 0.334 0.002 0.360 0.036 0.630
GCA 0.389 0.662 0.062 0.764 0.008 0.491 0.008 0.463 0.076 0.775 0.223 0.683 OOM OOM 0.002 0.315 0.007 0.329 0.008 0.588
HDI 0.528 0.716 0.662 0.886 0.194 0.592 0.143 0.527 0.562 0.805 0.408 0.742 0.054 0.591 0.169 0.544 0.153 0.525 0.407 0.826
GOAL 0.600 0.851 0.735 0.917 0.210 0.602 0.180 0.585 0.589 0.809 0.447 0.757 0.412 0.733 0.551 0.901 0.544 0.903 0.536 0.905

Table 6: Ablation study of X-GOAL at the multiplex heterogeneous graph level.

Dataset ACM IMDB DBLP Amazon


Metric MaF1 MiF1 NMI Sim@5 MaF1 MiF1 NMI Sim@5 MaF1 MiF1 NMI Sim@5 MaF1 MaF1 MiF1 Sim@5
X-GOAL 0.922 0.921 0.773 0.924 0.661 0.663 0.221 0.613 0.830 0.819 0.615 0.809 0.858 0.857 0.556 0.907
w/o R𝐶 0.919 0.917 0.770 0.922 0.658 0.661 0.211 0.606 0.817 0.807 0.611 0.804 0.856 0.856 0.555 0.906
w/o R 𝑁 , R𝐶 0.893 0.893 0.724 0.912 0.651 0.658 0.194 0.606 0.803 0.791 0.590 0.801 0.835 0.834 0.506 0.904

IBI layer of Amazon: 0.391 on NMI and 0.378 on Sim@5. The supe- Table 7: Ablation study of GOAL on the PAP layer of ACM.
rior performance of GOAL indicates that the proposed prototypi-
cal contrastive learning strategy is better than the infomax-based MaF1 MiF1 NMI Sim@5
and graph augmentation based instance-wise contrastive learn- GOAL 0.908 0.908 0.735 0.917
ing strategies. We believe this is because prototypical contrasive w/o warm-up 0.863 0.865 0.721 0.903
learning could effectively reduce the semantic errors. w/o L C 0.865 0.867 0.693 0.899
w/o LN 0.878 0.880 0.678 0.881
1st-ord. GCN (relu) 0.865 0.866 0.559 0.859
4.3 Ablation Study GCN (tanh) 0.881 0.881 0.486 0.886
Multiplex Heterogeneous Graph Level. In Table 6, we study the GCN (relu) 0.831 0.831 0.410 0.837
impact of the node-level and semantic-level alignments. The results dropout → masking 0.888 0.890 0.716 0.903
w/o attribute drop 0.843 0.845 0.568 0.869
in Table 6 indicate that both of the node-level alignment (R 𝑁 ) and
w/o adj. matrix drop 0.888 0.888 0.715 0.903
the semantic-level alignment (R𝐶 ) can improve the performance.
Homogeneous Graph Layer Level. The results for different con-
figurations of GOAL on the PAP layer of ACM are shown in Table 7.
First, all of the warm-up, the semantic-level loss L C and the node-
level loss L N are critical. Second, comparing GOAL (1st-order GCN
with tanh activation) with other GCN variants, (1) with the same
activation function, the 1st-order GCN perform better than the
original GCN; (2) tanh is better than relu. We believe this is because
the 1st-order GCN has a better capability for capturing the attribute
information, and tanh provides a better normalization for the node
embeddings. Finally, for the configurations of graph transformation,
if we replace dropout with masking, the performance will drop. This
is because dropout re-scales the outputs by 1/(1 − 𝑝𝑑𝑟𝑜𝑝 ), which
improves the performance. Besides, dropout on both attributes and (a) Macro-F1 v.s. 𝐾 (b) NMI v.s. 𝐾
adjacency matrix is important.
Figure 4: The number of 𝐾 on PSP and PAP of ACM
4.4 Number of Clusters
Figure 4 shows the Macro-F1 and NMI scores on the PSP and PAP ACM is 3, and the results in Figure 4 indicate that over-clustering
layers of ACM w.r.t. the number of clusters 𝐾 ∈ [3, 4, 5, 10, 20, 30, 50]. is beneficial. We believe this is because there are many sub-clusters
For PSP and PAP, the best Macro-F1 and NMI scores are obtained in the embedding space, which is consistent with the prior findings
when 𝐾 = 30 and 𝐾 = 5. The number of ground-truth classes for on image data [27].
CIKM ’22, October 17–21, 2022, Atlanta, GA, USA. Baoyu Jing et al.

(a) LN on PSP (b) LN + L C on PSP (c) LN on PAP (d) LN + L C on PAP

Figure 5: Visualization of the embeddings for the PAP and PSP layers of the ACM graph.

(a) LN (b) LN + L C (c) LN + L C + R N (d) LN + L C + R N + R C

Figure 6: Visualization of the combined embeddings for the ACM graph.

4.5 Visualization pairs. GraphCL [67] uses various graph augmentations to obtain
Homogeneous Graph Layer Level. The t-SNE [32] visualizations positive nodes. GCA [76] generates positive and negative pairs
of the embeddings for PSP and PAP of ACM are presented in Figure based on their importance. gCool [25] introduces graph communal
5. L N , L C , R N and R C are the node-level loss, cluster-level loss, contrastive learning. Ariel [8, 9] proposes a information regular-
node-level alignment and cluster-level alignment. The embeddings ized adversarial graph contrastive learning. These methods use the
extracted by the full GOAL framework (L N + L C ) are better sepa- contrastive losses similar to InfoNCE [36].
rated than the node-level loss L N only. For GOAL, the numbers For multiplex heterogeneous graphs, MNE [68], MVN2VEC [47]
of clusters for PSP and PAP are 30 and 5 since they have the best and GATNE [4] sample node pairs based on random walks. DMGI
performance as shown in Figure 4. [39] and HDMI [18] use random attribute shuffling to sample neg-
Multiplex Heterogeneous Graph Level. The visualizations for ative nodes. HeCo [59] decides positive and negative pairs based
the combined embeddings are shown in Figure 6. Embeddings in on the connectivity between nodes. Above methods mainly rely on
Figures 6a-6b are the average pooling of the layer-specific embed- the topological structures to pair nodes, yet do not fully explore
dings in Figure 5. Figure 6c and 6d are X-GOAL w/o cluster-level the semantic information, which could introduce semantic errors.
alignment and the full X-GOAL. Generally, the full X-GOAL best
separates different clusters. 5.2 Deep Clustering and Contrastive Learning
Clustering algorithms [2, 62] can capture the semantic clusters of
5 RELATED WORK instances. DeepCluster [2] is one of the earliest works which use
cluster assignments as “pseudo-labels" to update the parameters of
5.1 Contrastive Learning for Graphs the encoder. DEC [62] learns a mapping from the data space to a
The goal of CL is to pull similar nodes into close positions and push lower-dimensional feature space in which it iteratively optimizes a
dis-similar nodes far apart in the embedding space. Inspired by clustering objective. Inspired by these works, SwAV [3] and PCL
word2vec [35], early methods, such as DeepWalk [42] and node2vec [27] combine deep clustering with CL. SwAV compares the cluster
[12] use random walks to sample positive pairs of nodes. LINE [50] assignments rather than the embeddings of two images. PCL is
and SDNE [56] determine the positive node pairs by their first and the closest to our work, which alternatively performs clustering
second-order structural proximity. Recent methods leverage graph to obtain the latent prototypes and train the encoder by contrast-
transformation to generate node pairs. DGI [53], GMI [41], HDI [18] ing positive and negative pairs of nodes and prototypes. However,
and CommDGI [69] obtain negative samples by randomly shuffling PCL has some limitations compared with the proposed X-GOAL:
the node attributes. MVGRL [14] transforms graphs via techniques it is designed for single view image data; it heavily relies on data
such as graph diffusion [24]. The objective of the above methods augmentations and momentum contrast [15]; it has some complex
is to maximize the mutual information of the positive embedding assumptions over cluster distributions and embeddings.
X-GOAL: Multiplex Heterogeneous Graph Prototypical Contrastive Learning CIKM ’22, October 17–21, 2022, Atlanta, GA, USA.

5.3 Multiplex Heterogeneous Graph Neural Following [27], we maximize the following log likelihood:
Networks 𝑁
∑︁ 𝑁
∑︁ 𝐾
∑︁
The multiplex heterogeneous graph [4] considers multiple relations log 𝑝 (h𝑛 |Θ, C) = log 𝑝 (h𝑛 , 𝑘 |Θ, C) (19)
among nodes, and it is also known as multiplex graph [18, 39], 𝑛=1 𝑛=1 𝑘=1
multi-view graph [46], multi-layer graph [26] and multi-dimension where h𝑛 is the 𝑛-th row of h, Θ and C are the parameters of E
graph [30]. MVE [46] and HAN [58] uses attention mechanisms to and K-means algorithm C, 𝑘 ∈ [1, · · · , 𝐾] is the cluster index, and
combine embeddings from different views. mGCN [31] models both 𝐾 is the number of clusters. Directly optimizing this objective is
within and across view interactions. VANE [11] uses adversarial impracticable since the cluster index is a latent variable.
training to improve the comprehensiveness and robustness of the The Evidence Lower Bound (ELBO) of Equation (19) is given by:
embeddings. Multiplex graph neural networks have been used in
𝑁 ∑︁
𝐾
many applications [7], such as time series [19], text summarization ∑︁ 𝑝 (h𝑛 , 𝑘 |Θ, C)
ELBO = 𝑄 (𝑘 |h𝑛 ) log (20)
[21], temporal graphs [10], graph alignment [63], abstract reasoning
𝑛=1 𝑘=1
𝑄 (𝑘 |h𝑛 )
[57], global poverty [22] and bipartite graphs [64].
where 𝑄 (𝑘 |h𝑛 ) = 𝑝 (𝑘 |h𝑛 , Θ, C) is the auxiliary function.
In the E-step, we fix Θ and estimate the cluster centers Ĉ and the
5.4 Deep Graph Clustering cluster assignments 𝑄ˆ (𝑘 |h𝑛 ) by running the K-means algorithm
Graph clustering aims at discovering groups in graphs. SAE [51] over the embeddings of the original graph H = E (G). If a node h𝑛
and MGAE [55] first train a GNN, and then run a clustering algo- belongs to the cluster 𝑘, then its auxiliary function is an indicator
rithm over node embeddings to obtain the clusters. DAEGC [54] function satisfying 𝑄ˆ (𝑘 |h𝑛 ) = 1, and 𝑄ˆ (𝑘 ′ |h𝑛 ) = 0 for ∀𝑘 ′ ≠ 𝑘.
and SDCN [1] jointly optimize clustering algorithms and the graph In the M-step, based on Ĉ and 𝑄ˆ (𝑘 |h𝑛 ) obtained in the E-step,
reconstruction loss. AGC [70] adaptively finds the optimal order for we update Θ by maximizing ELBO:
graph filters based on the intrinsic clustering scores. M3S [49] uses
𝑁 ∑︁
∑︁ 𝐾
clustering to enlarge the labeled data with pseudo labels. SDCN [1]
ELBO = 𝑄ˆ (𝑘 |h𝑛 ) log 𝑝 (h𝑛 , 𝑘 |Θ, Ĉ)
proposes a structural deep clustering network to integrate the struc-
𝑛=1 𝑘=1
tural information into deep clustering. COIN [20] co-clusters two (21)
𝑁 ∑︁
∑︁ 𝐾
types of nodes in bipartite graphs. MvAGC [28] extends AGC [70] to
− 𝑄ˆ (𝑘 |h𝑛 ) log 𝑄ˆ (𝑘 |h𝑛 )
multi-view settings. However, MvAGC is not neural network based
𝑛=1 𝑘=1
methods which might not exploit the attribute and non-linearity
information. Recent methods combine CL with clustering to fur- Dropping the second term of the above equation, which is a con-
ther improve the performance. SCAGC [61] treats nodes within stant, we will minimize the following loss function:
the same cluster as positive pairs. MCGC [37] combines CL with 𝑁 ∑︁
∑︁ 𝐾
MvAGC [28], which treats each node with its neighbors as positive LC = − 𝑄ˆ (𝑘 |h𝑛 ) log 𝑝 (h𝑛 , 𝑘 |Θ, Ĉ) (22)
pairs. Different from SCAGC and MCGC, the proposed GOAL and 𝑛=1 𝑘=1
X-GOAL capture the semantic information by treating a node with
Assuming a uniform prior distribution over h𝑛 , we have:
its corresponding cluster center as a positive pair.
𝑝 (h𝑛 , 𝑘 |Θ, Ĉ) ∝ 𝑝 (𝑘 |h𝑛 , Θ, Ĉ) (23)
6 CONCLUSION We define 𝑝 (𝑘 |h𝑛 , Θ, Ĉ) by:
In this paper, we introduce a novel X-GOAL framework for multi-
e (ĉ𝑘 ·h𝑛 /𝜏)
𝑇

plex heterogeneous graphs, which is comprised of a GOAL frame- 𝑝 (𝑘 |h𝑛 , Θ, Ĉ) = Í (24)
e (ĉ𝑘 ′ ·h𝑛 /𝜏)
𝑇
𝐾
work for each homogeneous graph layer and an alignment regu- 𝑘 ′ =1
larization to jointly model different layers. The GOAL framework
captures both node-level and cluster-level information. The align- where h𝑛 ∈ R𝑑 is the embedding of the node x𝑛 , ĉ𝑘 ∈ R𝑑 is the
ment regularization is a nimble technique to jointly model and vector of the 𝑘-th cluster center, 𝜏 is the temperature parameter.
propagate information across different layers, which could maxi- Let’s use 𝑘𝑛 to denote the cluster assignment of h𝑛 , and normalize
mize the mutual information of different layers. The experimental the loss by 𝑁1 , then Equation (22) can be rewritten as:
results on real-world multiplex heterogeneous graphs demonstrate 𝑁 (c𝑇 ·h /𝜏)
1 ∑︁ e 𝑘𝑛 𝑛
the effectiveness of the proposed X-GOAL framework. LC = − log Í (25)
𝑁 𝑛=1 𝐾 e (c𝑇𝑘 ·h𝑛 /𝜏)
𝑘=1
A DERIVATION OF CLUSTER-LEVEL LOSS The above loss function captures the semantic similarities be-
The node-level contrastive loss is usually noisy, which could intro- tween nodes by pulling nodes within the same cluster closer to
duce semantic errors by treating two semantic similar nodes as a their assigned cluster center.
negative pair. To tackle this issue, we use a clustering algorithm
C (e.g. K-means) to obtain the semantic clusters of nodes, and we ACKNOWLEDGMENTS
use the EM algorithm to update the parameters of E to pull node BJ and HT are partially supported by NSF (1947135, 2134079 and
embeddings closer to their assigned clusters (or prototypes). 1939725), and NIFA (2020-67021-32799).
CIKM ’22, October 17–21, 2022, Atlanta, GA, USA. Baoyu Jing et al.

REFERENCES [33] David McAllester and Karl Stratos. 2020. Formal limitations on the measurement
[1] Deyu Bo, Xiao Wang, Chuan Shi, Meiqi Zhu, Emiao Lu, and Peng Cui. 2020. of mutual information. In International Conference on Artificial Intelligence and
Structural deep clustering network. In Proceedings of The Web Conference 2020. Statistics. PMLR, 875–884.
[2] Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. 2018. [34] Zaiqiao Meng, Shangsong Liang, Hongyan Bao, and Xiangliang Zhang. 2019.
Deep clustering for unsupervised learning of visual features. In ECCV. Co-embedding attributed networks. In WSDM.
[3] Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and [35] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013.
Armand Joulin. 2020. Unsupervised learning of visual features by contrasting Distributed representations of words and phrases and their compositionality. In
cluster assignments. NeurIPS (2020). Advances in neural information processing systems. 3111–3119.
[4] Yukuo Cen, Xu Zou, Jianwei Zhang, Hongxia Yang, Jingren Zhou, and Jie Tang. [36] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning
2019. Representation learning for attributed multiplex heterogeneous network. with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018).
In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge [37] Erlin Pan and Zhao Kang. 2021. Multi-view Contrastive Graph Clustering. Ad-
Discovery & Data Mining. 1358–1368. vances in Neural Information Processing Systems 34 (2021).
[5] Xiaokai Chu, Xinxin Fan, Di Yao, Zhihua Zhu, Jianhui Huang, and Jingping Bi. [38] Chanyoung Park, Jiawei Han, and Hwanjo Yu. 2020. Deep multiplex graph
2019. Cross-network embedding for multi-network alignment. In The World Wide infomax: Attentive multiplex network embedding using global information.
Web Conference. 273–284. Knowledge-Based Systems 197 (2020), 105861.
[6] Boxin Du, Changhe Yuan, Robert Barton, Tal Neiman, and Hanghang Tong. [39] Chanyoung Park, Donghyun Kim, Jiawei Han, and Hwanjo Yu. 2020. Unsuper-
2021. Hypergraph Pre-training with Graph Neural Networks. arXiv preprint vised Attributed Multiplex Network Embedding. In AAAI. 5371–5378.
arXiv:2105.10862 (2021). [40] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory
[7] Boxin Du, Si Zhang, Yuchen Yan, and Hanghang Tong. 2021. New Frontiers of Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019.
Multi-Network Mining: Recent Developments and Future Trend. In Proceedings Pytorch: An imperative style, high-performance deep learning library. Advances
of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. in neural information processing systems 32 (2019), 8026–8037.
4038–4039. [41] Zhen Peng, Wenbing Huang, Minnan Luo, Qinghua Zheng, Yu Rong, Tingyang
[8] Shengyu Feng, Baoyu Jing, Yada Zhu, and Hanghang Tong. 2022. Adversarial Xu, and Junzhou Huang. 2020. Graph Representation Learning via Graphical
graph contrastive learning with information regularization. In Proceedings of the Mutual Information Maximization. In Proceedings of The Web Conference 2020.
ACM Web Conference 2022. 1362–1371. [42] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. Deepwalk: Online learning
[9] Shengyu Feng, Baoyu Jing, Yada Zhu, and Hanghang Tong. 2022. ARIEL: Adver- of social representations. In Proceedings of the 20th ACM SIGKDD. 701–710.
sarial Graph Contrastive Learning. https://doi.org/10.48550/ARXIV.2208.06956 [43] Ben Poole, Sherjil Ozair, Aaron Van Den Oord, Alex Alemi, and George Tucker.
[10] Dongqi Fu, Liri Fang, Ross Maciejewski, Vetle I. Torvik, and Jingrui He. 2022. 2019. On variational bounds of mutual information. In ICML.
Meta-Learned Metrics over Multi-Evolution Temporal Graphs. In KDD 2022. [44] Zhenyue Qin, Dongwoo Kim, and Tom Gedeon. 2019. Rethinking softmax with
[11] Dongqi Fu, Zhe Xu, Bo Li, Hanghang Tong, and Jingrui He. 2020. A View- cross-entropy: Neural network classifier as mutual information estimator. arXiv
Adversarial Framework for Multi-View Network Embedding. In CIKM. preprint arXiv:1911.10688 (2019).
[12] Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning for [45] Jiezhong Qiu, Qibin Chen, Yuxiao Dong, Jing Zhang, Hongxia Yang, Ming Ding,
networks. In Proceedings of the 22nd ACM SIGKDD. 855–864. Kuansan Wang, and Jie Tang. 2020. Gcc: Graph contrastive coding for graph
[13] William L Hamilton, Rex Ying, and Jure Leskovec. 2017. Representation learning neural network pre-training. In Proceedings of the 26th ACM SIGKDD International
on graphs: Methods and applications. arXiv preprint arXiv:1709.05584 (2017). Conference on Knowledge Discovery & Data Mining. 1150–1160.
[14] Kaveh Hassani and Amir Hosein Khasahmadi. 2020. Contrastive Multi-View [46] Meng Qu, Jian Tang, Jingbo Shang, Xiang Ren, Ming Zhang, and Jiawei Han.
Representation Learning on Graphs. arXiv preprint arXiv:2006.05582 (2020). 2017. An attention-based collaboration framework for multi-view network
[15] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Mo- representation learning. In CIKM.
mentum contrast for unsupervised visual representation learning. In CVPR. [47] Yu Shi, Fangqiu Han, Xinwei He, Xinran He, Carl Yang, Jie Luo, and Jiawei Han.
[16] Weihua Hu, Bowen Liu, Joseph Gomes, Marinka Zitnik, Percy Liang, Vijay Pande, 2018. mvn2vec: Preservation and collaboration in multi-view network embedding.
and Jure Leskovec. 2019. Strategies for pre-training graph neural networks. arXiv arXiv preprint arXiv:1801.06597 (2018).
preprint arXiv:1905.12265 (2019). [48] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan
[17] Yizhu Jiao, Yun Xiong, Jiawei Zhang, Yao Zhang, Tianqi Zhang, and Yangyong Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from
Zhu. 2020. Sub-Graph Contrast for Scalable Self-Supervised Graph Representation overfitting. The journal of machine learning research 15, 1 (2014), 1929–1958.
Learning. In 2020 IEEE International Conference on Data Mining (ICDM). [49] Ke Sun, Zhouchen Lin, and Zhanxing Zhu. 2020. Multi-stage self-supervised
[18] Baoyu Jing, Chanyoung Park, and Hanghang Tong. 2021. Hdmi: High-order deep learning for graph convolutional networks on graphs with few labeled nodes. In
multiplex infomax. In Proceedings of the Web Conference 2021. 2414–2424. AAAI, Vol. 34. 5892–5899.
[19] Baoyu Jing, Hanghang Tong, and Yada Zhu. 2021. Network of Tensor Time Series. [50] Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei.
In The World Wide Web Conference. https://doi.org/10.1145/3442381.3449969 2015. Line: Large-scale information network embedding. In Proceedings of the
[20] Baoyu Jing, Yuchen Yan, Yada Zhu, and Hanghang Tong. 2022. COIN: Co-Cluster 24th international conference on world wide web. 1067–1077.
Infomax for Bipartite Graphs. arXiv preprint arXiv:2206.00006 (2022). [51] Fei Tian, Bin Gao, Qing Cui, Enhong Chen, and Tie-Yan Liu. 2014. Learning deep
[21] Baoyu Jing, Zeyu You, Tao Yang, Wei Fan, and Hanghang Tong. 2021. Multiplex representations for graph clustering. In AAAI, Vol. 28.
Graph Neural Network for Extractive Text Summarization. In Proceedings of the [52] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro
2021 Conference on Empirical Methods in Natural Language Processing. 133–139. Lio, and Yoshua Bengio. 2018. Graph attention networks. ICLR (2018).
[22] Muhammad Raza Khan and Joshua E Blumenstock. 2019. Multi-gcn: Graph [53] Petar Veličković, William Fedus, William L Hamilton, Pietro Liò, Yoshua Bengio,
convolutional networks for multi-view networks, with applications to global and R Devon Hjelm. 2019. Deep graph infomax. ICLR (2019).
poverty. In AAAI, Vol. 33. 606–613. [54] Chun Wang, Shirui Pan, Ruiqi Hu, Guodong Long, Jing Jiang, and Chengqi Zhang.
[23] Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph 2019. Attributed graph clustering: A deep attentional embedding approach. arXiv
convolutional networks. arXiv preprint arXiv:1609.02907 (2016). preprint arXiv:1906.06532 (2019).
[24] Johannes Klicpera, Stefan Weißenberger, and Stephan Günnemann. 2019. Diffu- [55] Chun Wang, Shirui Pan, Guodong Long, Xingquan Zhu, and Jing Jiang. 2017.
sion improves graph learning. NeurIPS (2019). Mgae: Marginalized graph autoencoder for graph clustering. In CIKM. 889–898.
[25] Bolian Li, Baoyu Jing, and Hanghang Tong. 2022. Graph Communal Contrastive [56] Daixin Wang, Peng Cui, and Wenwu Zhu. 2016. Structural deep network em-
Learning. In Proceedings of the ACM Web Conference 2022. 1203–1213. bedding. In Proceedings of the 22nd ACM SIGKDD international conference on
[26] Jundong Li, Chen Chen, Hanghang Tong, and Huan Liu. 2018. Multi-layered Knowledge discovery and data mining. 1225–1234.
network embedding. In SDM. 684–692. [57] Duo Wang, Mateja Jamnik, and Pietro Lio. 2020. Abstract Diagrammatic Reason-
[27] Junnan Li, Pan Zhou, Caiming Xiong, and Steven CH Hoi. 2021. Prototypical ing with Multiplex Graph Networks. In ICLR.
contrastive learning of unsupervised representations. ICLR (2021). [58] Xiao Wang, Houye Ji, Chuan Shi, Bai Wang, Yanfang Ye, Peng Cui, and Philip S
[28] Zhiping Lin and Zhao Kang. 2021. Graph filter-based multi-view attributed graph Yu. 2019. Heterogeneous graph attention network. In TheWebConf.
clustering. In IJCAI. 19–26. [59] Xiao Wang, Nian Liu, Hui Han, and Chuan Shi. 2021. Self-supervised heteroge-
[29] Yixin Liu, Shirui Pan, Ming Jin, Chuan Zhou, Feng Xia, and Philip S Yu. 2021. neous graph neural network with co-contrastive learning. In Proceedings of the
Graph self-supervised learning: A survey. arXiv preprint arXiv:2103.00111 (2021). 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 1726–1736.
[30] Yao Ma, Zhaochun Ren, Ziheng Jiang, Jiliang Tang, and Dawei Yin. 2018. Multi- [60] Lirong Wu, Haitao Lin, Zhangyang Gao, Cheng Tan, Stan Li, et al. 2021. Self-
dimensional network embedding with hierarchical structure. In WSDM. 387–395. supervised on Graphs: Contrastive, Generative, or Predictive. arXiv preprint
[31] Yao Ma, Suhang Wang, Chara C Aggarwal, Dawei Yin, and Jiliang Tang. 2019. arXiv:2105.07342 (2021).
Multi-dimensional graph convolutional networks. In SDM. 657–665. [61] Wei Xia, Quanxue Gao, Ming Yang, and Xinbo Gao. 2021. Self-supervised Con-
[32] Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. trastive Attributed Graph Clustering. NeurIPS (2021).
Journal of machine learning research 9, Nov (2008), 2579–2605. [62] Junyuan Xie, Ross Girshick, and Ali Farhadi. 2016. Unsupervised deep embedding
for clustering analysis. In ICML.
X-GOAL: Multiplex Heterogeneous Graph Prototypical Contrastive Learning CIKM ’22, October 17–21, 2022, Atlanta, GA, USA.

[63] Hao Xiong, Junchi Yan, and Li Pan. 2021. Contrastive Multi-View Multiplex [70] Xiaotong Zhang, Han Liu, Qimai Li, and Xiao-Ming Wu. 2019. Attributed graph
Network Embedding with Applications to Robust Network Alignment. In Pro- clustering via adaptive graph convolution. arXiv preprint arXiv:1906.01210 (2019).
ceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data [71] Zhen Zhang, Hongxia Yang, Jiajun Bu, Sheng Zhou, Pinggang Yu, Jianwei Zhang,
Mining. 1913–1923. Martin Ester, and Can Wang. 2018. ANRL: Attributed Network Representation
[64] Hansheng Xue, Luwei Yang, Vaibhav Rajan, Wen Jiang, Yi Wei, and Yu Lin. 2021. Learning via Deep Neural Networks.. In IJCAI, Vol. 18. 3155–3161.
Multiplex Bipartite Network Embedding using Dual Hypergraph Convolutional [72] Lecheng Zheng, Dongqi Fu, and Jingrui He. 2021. Tackling oversmoothing of
Networks. In Proceedings of the Web Conference 2021. 1649–1660. gnns with contrastive learning. arXiv preprint arXiv:2110.13798 (2021).
[65] Yuchen Yan, Lihui Liu, Yikun Ban, Baoyu Jing, and Hanghang Tong. 2021. Dy- [73] Lecheng Zheng, Yada Zhu, Jingrui He, and Jinjun Xiong. 2021. Heterogeneous
namic Knowledge Alignment. In AAAI. Contrastive Learning. arXiv preprint arXiv:2105.09401 (2021).
[66] Yuchen Yan, Si Zhang, and Hanghang Tong. 2021. Bright: A bridging algorithm [74] Dawei Zhou, Lecheng Zheng, Jiawei Han, and Jingrui He. 2020. A data-driven
for network alignment. In Proceedings of the Web Conference 2021. 3907–3917. graph generative model for temporal interaction networks. In Proceedings of
[67] Yuning You, Tianlong Chen, Yongduo Sui, Ting Chen, Zhangyang Wang, and the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data
Yang Shen. 2020. Graph contrastive learning with augmentations. Advances in Mining. 401–411.
Neural Information Processing Systems 33 (2020), 5812–5823. [75] Dawei Zhou, Lecheng Zheng, Jiejun Xu, and Jingrui He. 2019. Misc-GAN: A
[68] Hongming Zhang, Liwei Qiu, Lingling Yi, and Yangqiu Song. 2018. Scalable multi-scale generative model for graphs. Frontiers in big Data 2 (2019), 3.
Multiplex Network Embedding.. In IJCAI, Vol. 18. 3082–3088. [76] Yanqiao Zhu, Yichen Xu, Feng Yu, Qiang Liu, Shu Wu, and Liang Wang. 2021.
[69] Tianqi Zhang, Yun Xiong, Jiawei Zhang, Yao Zhang, Yizhu Jiao, and Yangyong Graph contrastive learning with adaptive augmentation. In Proceedings of the
Zhu. 2020. CommDGI: community detection oriented deep graph infomax. In Web Conference 2021. 2069–2080.
Proceedings of the 29th ACM International Conference on Information & Knowledge [77] Chenyi Zhuang and Qiang Ma. 2018. Dual graph convolutional networks for
Management. 1843–1852. graph-based semi-supervised classification. In Proceedings of the 2018 World Wide
Web Conference. 499–508.

You might also like