0% found this document useful (0 votes)
6 views14 pages

Esgnn

The document presents ES-GNN, a novel framework for Graph Neural Networks (GNNs) that addresses the limitations of existing models by distinguishing between relevant and irrelevant edges in both homophilic and heterophilic graphs. This approach enhances the generalization capabilities of GNNs and improves robustness against adversarial graphs while alleviating the over-smoothing problem. Extensive experiments demonstrate that ES-GNN outperforms state-of-the-art methods across various datasets, providing a significant reduction in classification errors.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views14 pages

Esgnn

The document presents ES-GNN, a novel framework for Graph Neural Networks (GNNs) that addresses the limitations of existing models by distinguishing between relevant and irrelevant edges in both homophilic and heterophilic graphs. This approach enhances the generalization capabilities of GNNs and improves robustness against adversarial graphs while alleviating the over-smoothing problem. Extensive experiments demonstrate that ES-GNN outperforms state-of-the-art methods across various datasets, providing a significant reduction in classification errors.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

1

ES-GNN: Generalizing Graph Neural Networks


Beyond Homophily with Edge Splitting
Jingwei Guo, Kaizhu Huang*, Rui Zhang, and Xinping Yi

Abstract—While Graph Neural Networks (GNNs) have achieved enormous success in multiple graph analytical tasks, modern variants
mostly rely on the strong inductive bias of homophily. However, real-world networks typically exhibit both homophilic and heterophilic
linking patterns, wherein adjacent nodes may share dissimilar attributes and distinct labels. Therefore, GNNs smoothing node proximity
holistically may aggregate both task-relevant and irrelevant (even harmful) information, limiting their ability to generalize to heterophilic
graphs and potentially causing non-robustness. In this work, we propose a novel edge splitting GNN (ES-GNN) framework to
adaptively distinguish between graph edges either relevant or irrelevant to learning tasks. This essentially transfers the original graph
into two subgraphs with the same node set but exclusive edge sets dynamically. Given that, information propagation separately on
these subgraphs and edge splitting are alternatively conducted, thus disentangling the task-relevant and irrelevant features.
Theoretically, we show that our ES-GNN can be regarded as a solution to a disentangled graph denoising problem, which further
illustrates our motivations and interprets the improved generalization beyond homophily. Extensive experiments over 11 benchmark
and 1 synthetic datasets demonstrate that ES-GNN not only outperforms the state-of-the-arts, but also can be more robust to
adversarial graphs and alleviate the over-smoothing problem.

Index Terms—Graph Neural Networks, Heterophilic Graphs, Disentangled Representation Learning, Graph Mining.

1 I NTRODUCTION

A S a ubiquitous data structure, graph can symbolize


complex relationships between entities in different do-
mains. For example, knowledge graphs describe the inter-
tive) properties whereby the opposite objects are attracted to
each other [17]. For instance, different types of amino acids
are mostly interacted in many protein structures [10], and
connections between real-world events, and social networks most people in heterosexual dating networks prefer to link
store the online interactions between users. With the flour- with others of the opposite gender. Recent studies [10], [11],
ishing of deep learning models on graph-structured data, [12], [13], [14], [15], [18], [19], [20], [21], [22] have shown that
graph neural networks (GNNs) emerge as one of the most the conventional neighborhood aggregation strategy may
powerful techniques in recent years. Owing to their re- not only cause the over-smoothing problem [23] but also
markable performance, GNNs have been widely adopted in severely hinder the generalization performance of GNNs
multiple graph-based learning tasks, such as link prediction, beyond homophily.
node classification, and recommendation [1], [2], [3], [4]. One reason why current GNNs perform poorly on het-
Modern GNNs are mainly built upon a message passing erophilic graphs, could be the mismatch between the la-
framework [5], where nodes’ representations are learned by beling rules of nodes and their linking mechanism. The
aggregating their transformed neighbors iteratively. From former is the target that GNNs are expected to learn for
the graph signal denoising viewpoint, this mechanism could classification tasks, while the latter specifies how messages
be seen as a low-pass filter [6], [7], [8], [9] that smooths the pass among nodes for attaining this goal. In homophilic sce-
signals between adjacent nodes. Several works [8], [10], [11], narios, both of them are similar in the sense that most nodes
[12], [13], [14], [15] refer this to smoothness or homophily as- are linked because of their commonality which therefore
sumption in GNNs. Notably, they work well on homophilic leads to identical labels. In heterophilic scenarios, however,
(assortative) graphs, from which the proximity information the motivation underlying why two nodes get connected
of nodes can be utilized to predict their labels [16]. However, may be ambiguous to the classification task. Let us take the
real-world networks are typically abstracted from complex social network within a university as an example, where
systems, and sometimes display heterophilic (disassorta- students from different clubs can be linked usually due
to taking the same classes and/or being roommates but
not sharing the same hobbies. Namely, the task-relevant
• J. Guo is with University of Liverpool, Liverpool, UK and irrelevant (or even harmful) information is typically
E-mail: [email protected]
• K. Huang is with Duke Kunshan University, Suzhou, China
mixed into node neighborhood under heterophily. However,
E-mail: [email protected] current methods usually fail to recognize and differentiate
• R. Zhang is with Xi’an Jiaotong-Liverpool University, Suzhou, China these two types of information within nodes’ proximity, as
E-mail: [email protected] illustrated in Fig. 1. As a consequence, the learned repre-
• X. Yi is with University of Liverpool, Liverpool, UK
E-mail: [email protected] sentations are prone to be entangled with false information,
leading to non-robustness and sub-optimal performance.
*Corresponding author: Kaizhu Huang. This work has been submitted to the
IEEE for possible publication. Copyright may be transferred without notice, Once the issue of GNNs’ learning beyond homophily
after which this version may no longer be accessible. is identified, a natural question arises: Can we design a
2

Heterophilic Graph (H = 0.27)


new type of GNNs that is adaptive to both homophilic and label-1 label-2 label-3
heterophilic scenarios? Well formed designs should be able to
recognize the node connections irrelevant to learning tasks, local non-discriminative hi disentangled discriminative zR,i

and substantially extract the most correlated information smoothness smoothness


for prediction. However, the assortativity of real-world net-
works is usually agnostic. Even worse, the features of nodes
are typically full of noises, where similarity or dissimilar-
ity between connected ones may not actually reflect their
class relations. Existing techniques including [18], [24], [25]
usually parameterize graph edges with node similarity or
dissimilarity, and cannot well assess the correlation between
node connections and the downstream target. (a) Conventional GNNs (b) Our ES-GNN
In this paper, we propose ES-GNN, an end-to-end graph
learning framework that generalizes GNNs on graphs with
Fig. 1. A toy example to show differences between conventional GNNs
either homophily or heterophily. Without loss of generality, and our ES-GNN in aggregating node features. Conventional GNNs with
we make an assumption that two nodes get connected local smoothness tend to produce non-discriminative representations
mainly because they share some similar features, which are on heterophilic graphs, while our ES-GNN is able to disentangle and
however unnecessarily just relevant to the learning task. In exclude the task-harmful features from the final predictive target.
other words, nodes may be linked due to similar features,
either relevant or irrelevant to the task. This implicitly
performs the state-of-the-art GNNs on graphs with
divides the original graph edges into two exclusive sets,
various homophily levels, and gives the largest error
each of which represents a latent relation between nodes.
reduction 17.4% on average.
Thanks to the proximity smoothness, aggregating node
• Importantly, ES-GNN is able to alleviate the over-
features individually on each edge set should disentangle
smoothing problem, and enjoys remarkable robust-
the task-relevant and irrelevant features. Meanwhile, these
ness against adversarial graphs. This shows that ES-
disentangled representations potentially reflect node simi-
GNN could still lead to excellent performance even if
larity in two aspects (task-relevant and irrelevant). As such,
the disentangled smoothness assumption may not hold
they can be better utilized to split the original graph edges
practically.
more precisely. Motivated by this, the proposed framework
integrates GNNs with an interpretable edge splitting (ES),
to jointly partition network topology and disentangle node
features.
2 P RELIMINARIES
Technically, we design a residual scoring mechanism, ex- Let G = (V, E) represents an undirected graph with adja-
ecuted within each ES-layer, to distinguish the task-relevant cency matrix A ∈ R|V|×|V| , where V denotes the node set,
and irrelevant graph edges. The node features are then E denotes the edge set, and |V| is the number of nodes.
aggregated separately on these connections to produce dis- We define (vi , vj ) ∈ E , vi , vj ∈ V , i ̸= j if vi and vj are
entangled representations, based on which graph edges can connected, and Ni = {vj |(vi , vj ) ∈ E} as the neighborhood
be classified more accurately in the next ES-layer. Finally, of node vi . The nodes are associated with a feature matrix
the task-relevant representations are granted for predic- X ∈ R|V|×f where f is the number of raw features, and
tion. Meanwhile, an Irrelevant Consistency Regularization we use X[i,:] ∈ Rf to denote the ith row of X. We consider
(ICR) is developed to regulate the task-irrelevant represen- the standard node classification task on undirected graphs,
tations with the potential label-disagreement between adja- where each node vi has a ground truth vector yi ∈ RC in
cent nodes, for further reducing the classification-harmful one-hot encoding where yi (ci ) = 1 and ci is the assigned
information from the final predictive target. To interpret label out of C ≤ |V| classes. As our ES-GNN disentangles
our new algorithm theoretically, generalizing the standard the original graph into the task-relevant and irrelevant sub-
smoothness assumption [8], we also conduct some analysis graphs, we will denote their adjacency matrixes respectively
on ES-GNN and establish its connection with a disentangled as AR and AIR in this paper.
graph signal denoising problem.
To summarize, the main contributions of this work are 3 BACKGROUND AND R ELATED W ORK
four-fold:
3.1 Homophily and Heterophily on Graphs
• We propose a novel framework called ES-GNN for On graphs, homophily and heterophliy (or low homophily)
node classification tasks with one plausible hypoth- typically refer to the similarity and dissimilarity between
esis, which enables GNNs to go beyond the strong adjacent nodes, including but not limited to labels and
homophily assumption on graphs. features. In this work, we study node classification tasks,
• We theoretically prove that our ES-GNN is equiv- thereby focusing on homophily and heterophily in class
alent to solving a graph denoising problem with a labels. According to different homophily ratios, real-world
disentangled smoothness assumption, which interprets networks in the literature, such as citation networks [26], so-
its good performance on different types of networks. cial networks [27], [28], community networks [29], [30], web-
• Extensive evaluations on 11 benchmark and 1 syn- page networks [27], [31], and co-occurrence networks [32]
thetic datasets show that ES-GNN consistently out- can be categorized into homophilic and heterophilc ones.
3

Several metrics have been proposed to estimate the graph weights in both positive and negative signs, so as to extract
homophily level, e.g., edge homophily [10], as the most both low- and high-frequency information.
popular one, is defined as the percentage of edges linking However, none of them analyzes the motivations why
nodes of the same label: two nodes get connected, nor do they associate them with
|{(vi , vj )|(vi , vj ) ∈ E, yi = yj }| learning tasks, which is analyzed as one of the keys to gen-
H= . (1) eralize GNNs beyond homophily in this paper. In contrast,
|E|
ES-GNN distinguishes graph edges as either relevant or
However, this metric may give inaccurate estimation in irrelevant to the task. Such information acts as a guide to
case of datasets with the class-imbalance problem [28]. To disentangle and exclude classification-harmful information
alleviate this, a new metric is proposed by [28]: from the final predictive target, and thus boosts GNNs’ per-
C−1 formance under heterophily. Meanwhile, detailed analyses
1 X |Ck |
Ĥ = max(hk − , 0), (2) on the limited performance of the existing state-of-the-arts
C − 1 k=0 |V| are provided in Section 6.3.
where Ck is the set of nodes from class k ∈ {0, 1, ..., C − 1},
and hk is the class-wise homophily ratio computed as: 3.3 Disentangled Representation Learning
P
|{vj |yi = yj , vj ∈ Ni }| Disentangled representation learning is to learn decom-
hk = vi ∈Ck P .
vi ∈Ck |Ni |
posed vector representations which disentangle the explana-
tory latent variables underlying the observed data and
All these two indexes in Eq.(1) and Eq. (2) range from 0 to 1, encode them as separate dimensions [49], [50]. Existing
of which the higher values suggest higher homophily (lower efforts concerning that topic are mainly made on computer
heterophily), and otherwise. vision [51], [52], [53], [54], while a couple of works recently
emerge to explore the potential of disentangled learning in
3.2 Graph Neural Networks graph-structured domains [35], [55], [56], [57]. For example,
The central idea of most GNNs is to utilize nodes’ proximity DisenGCN [55] employs a neighborhood routing mecha-
information for building their representations for tasks, nism to iteratively partition node neighborhood into mul-
based on which great effort has been made in developing tiple separated parts. FactorGCN [35] factorizes the original
different variants [6], [24], [33], [34], [35], [36], [37], [38], [39], graph into multiple subgraphs by clipping edges so as to
[40], and understanding the nature of GNNs [8], [9], [41], capture different graph aspects.
[42], [43], [44]. Several works have proved that GNNs essen- We notice that our work shares a similarity with Fac-
tially behave as a low pass filter that smooths information torGCN [35]: to learn multiple subgraphs from the original
within node surrounding [6], [7], [16], [45]. In line with this network topology for disentangling features. Nevertheless,
view, [9] and [8] further show that a number of GNN mod- there are three main differences. First, FactorGCN could
els, such as GCN [33], SGC [6], GAT [24], and APPNP [46], assign one edge to multiple groups, i.e., the factorized
can be seen as different optimization solvers to a graph subgraphs may share overlapped edges, while our ES-GNN
signal denoising problem with a smoothness assumption upon employs an edge splitting to partition the original net-
connected nodes. All these results indicate that GNNs are work topology into two mutually exclusive ones satisfying
mostly tailored for the strong homophily hypothesis on the AR + AIR = A. Second, despite the disentangling prop-
observed graphs while largely overlooking the important erty, FactorGCN merely interprets the inferred subgraphs
setting of heterophily, where node features and labels vary as different graph aspects without providing any concrete
unsmoothly on graphs. Recent studies [20], [47] also connect meanings, and the predefined number of latent factors re-
this to the over-smoothing problem [23]. quires to be tuned differently across graphs. Differently, our
To extend GNNs on heterophilic graphs, several works model adaptively produces two interptable task-relevant
leverage the long-range information beyond nodes’ proxim- and irrelevant topolgies for all kinds of input graphs. Last,
ity. Geom-GCN [31] extends the standard message passing FactorGCN models all disentangled parts towards final pre-
with geometric aggregation in latent space. H2GCN [10] diction, while we target at decoupling the task-relevant and
directly models the higher order neighborhoods for cap- task-irrelevant features whereby the classification-harmful
turing the homophily-dominant information. WRGAT [41] information can be excluded from the final predictive target
transforms the input graph into a multi-relational graph, and disentangled in the task-irrelevant parts. Experimental
for modeling structural information and enhancing the as- results also validate that our proposed model substantially
sortativity level. GEN [13] estimates a suitable graph for outperform FactorGCN on all the datasets used in the paper
GNNs’ learning with multi-order neighborhood informa- (see Section 6).
tion and Bayesian inference as guide. Another line of work
emphasizes the proper utilization of node neighbors. The
most common works employ attention mechanism [24], [48], 4 F RAMEWORK : ES-GNN
however, they are still imposing smoothness within nodes’ In this section, we propose an end-to-end graph learning
neighborhood albeit on the important members only [7], framework, ES-GNN, generalizing Graph Neural Networks
[8], [9]. Compared to that, FAGCN [18] adaptively models (GNNs) to arbitrary graph-structured data with either ho-
both similarities and dissimilarities between adjacent nodes. mophilic or heterophilic properties. An overview of ES-
GPR-GNN [11] introduces a universal polynomial graph GNN is given in Fig. 2. The central idea is to integrate
filter, by associating different hop neighbors with learnable GNNs with an interpretable edge splitting (ES) layer that
4

Next Layer ZR ← Z′R , ZIR ← Z′IR

Task-Relevant Topology

GNN ( ZR , AR )
{A , ZR , ZIR } Z′R Prediction
Task

λICR LICR + Lpred → L


Feature
A, X Irrelevant
Consistency
Projection Task-Irrelevant Topology

GNN ( ZIR , AIR )


Z′IR

Edge Splitting Layer Aggregation Layer

Fig. 2. Illustration of our ES-GNN framework where A and X denote the adjacency matrix and feature matrix of nodes, respectively. First, X
is projected onto different latent subspaces via different channels R and IR. An edge splitting is then performed to divide the original graph
edges into two exclusive sets. After that, the node information can be aggregated individually and separately on different edge sets to produce
disentangled representations, which are further utilized to make an more accurate edge splitting in the next layer. The task-relevant representation

ZR is reasonably granted for prediction. Meanwhile, an Irrelevant Consistency Regularization (ICR) is developed to further reduce the potential
task-harmful information from the final predictive target.

adaptively partitions the network topology as guide to This hypothesis is assumed without losing generality to
disentangle the task-relevant and irrelevant node features. both homophilic and heterophilic graphs. For a homophilic
scenario, e.g., in citation networks, scientific papers tend to
cite or be cited by others from the same area, and both of
4.1 Edge Splitting Layer them usually possess the common keywords uniquely ap-
pearing in their topics. For a heterophilic scenario, students
The goal of this layer is to infer the latent relations underly- having different interests are likely be connected because of
ing adjacent nodes on the observed graph, and distinguish the same classes and/or dormitory they take and/or live in,
between graph edges which could be relevant or irrelevant but neither has direct relation to the clubs they have joined.
to learning tasks. Given a simple graph with an adjacency This inspires us to classify graph edges by measuring the
matrix A and node feature matrix X, an ES-layer splits the similarity between adjacent nodes in two different aspects,
original graph edges into two exclusive sets, and thereby i.e., a graph edge is more relevant to classification task if the
produces two partial network topologies with adjacency connected nodes are more similar in their task-relevant fea-
matrices AR , AIR ∈ R|V|×|V| satisfying AR + AIR = A. We tures, or otherwise. Our experimental analysis in Section 6.6
would expect AR storing the most correlated graph edges further provides evidences that even when our Hypothesis 1
to the classification task, of which the rest is excluded and may not hold, most adversarial edges (considered as the
disentangled in AIR . Therefore, analyzing the correlation task-irrelevant ones) can still be recognized though neither
between node connections and learning tasks comes into types of node similarity exists.
the first step. It is worthy mentioning that our hypothesis is not in
However, existing techniques [18], [24], [25] mainly pa- contradiction to the “opposites attract”, which could be in-
rameterize graph edges with node similarity or dissimi- tuitively explained by linking due to different but matching
larity, while failing to explicitly correlate them with the attributes. We believe the inherent cause to connection even
prediction target. Even worse, as the assortativiy of real- in “opposites attract” may still be certain commonalities. For
world networks is usually agnostic and node features are example, in heterosexual dating networks, people of the op-
typically full of noises, the captured similarity/dissimilarity posite sex are most likely connected because of their similar
may not truly reflect the label-agreement/disagreement be- life values. Although these similarities may be inappropriate
tween nearby nodes. Consequently, the harmful-similarity (or even harmful) in distinguishing genders, modeling and
between pairwise nodes from different classes could be mis- disentangling them from the final predictive target might be
takenly preserved for prediction. To this end, we present one still of great importance.
plausible hypothesis below, whereby the explicit correlation An ES-layer consists of two channels to respectively
between node connections and learning tasks is established extract the task-relevant and irrelevant information from
automatically. nodes. As only the raw feature matrix X is provided in the
beginning, we will project them into two different subspaces
Hypothesis 1. Two nodes get connected in a graph mainly due before the first ES-layer:
to their similarity in some features, which could be either relevant
or irrelevant (even harmful) to the learning task. Z(0) T
s = σ(Ws X + bs ), (3)
5

f× d d
Algorithm 1 Framework of ES-GNN
where Ws ∈ R 2and bs ∈ R are the learnable parame-
2

ters in channel s ∈ {R, IR}, d is the number of node hidden Input: nodes set: V , edge set: E , adjacency matrix: A ∈
states, and σ is a nonlinear activation function. R|V|×|V| , node feature matrix: X ∈ R|V |×f , the number
Given Hypothesis 1, a graph edge should be classified of layers: K , scaling parameters: {ϵR , ϵIR }, irrelevant
into the task-relevant set if the connected nodes display a consistency coefficient: λICR , and ground truth labels on
higher similarity in the corresponding channel, and other- the training set: {yi ∈ RC |∀vi ∈ Vtrn }.
wise. However, introducing metrics between nearby nodes Param: WR , WIR ∈ Rf ×d , WF ∈ Rd×C , bF ∈ RC , {g(k) ∈
to learn AR and AIR independently may fail to model the R1×2d |k = 0, 1, ..., K − 1}
complex interaction between different channels, and also 1: // Project node features into two subspaces.
lose emphasis on topology difference. Therefore, in case of 2: for s ∈ {R, IR} do
A(i,j) = 1, we parameterize the residual between AR(i,j) (0)
3: Zs ← σ(WsT X + bs ).
and AIR(i,j) , and solving the linear equation: 4:
(0) (0)
Zs ← Dropout(Zs ) // Enabled only for training.
(
AR(i,j) − AIR(i,j) = αi,j 5: end for
. 6: // Stack Edge Splitting and Aggregation Layers.
AR(i,j) + AIR(i,j) = 1
7: for layer number k = 0, 1, ..., K − 1 do
1+α 1−α
This gives us AR(i,j) = i,j
and AIR(i,j) = i,j
with 8: // Edge Splitting Layer.
2 2
αi,j ∈ (−1, 1). To effectively incorporate all the channel 9: Initialize AR , AIR ∈ R|V|×|V| with zeros.
information into the coefficient αi,j , we propose a residual 10: for (vi , vj ) ∈ E do
h iT
scoring mechanism: (k) (k) (k) (k)
11: αi,j ← tanh(g(k) ZR[i,:] ⊕ ZIR[i,:] ⊕ ZR[j,:] ⊕ ZIR,[j,:] ).
 T
αi,j = tanh(g ZR[i,:] ⊕ ZIR[i,:] ⊕ ZR[j,:] ⊕ ZIR[j,:] ). (4) 12: αi,j ← Dropout(αi,j ) // Enabled only for training.
1+α 1−α
Here, both of the task-relevant and irrelevant node fea- 13: AR(i,j) ← 2 i,j , AIR(i,j) ← 2 i,j .
tures are first concatenated and convoluted by learnable 14: end for
g ∈ R1×2d , and then passed to the tangent activation 15: // Aggregation Layer.
function to produce a scalar value within (−1, 1). To further 16: for s ∈ {R, IR} do
(k+1) (0) −1 − 1 (k)
strengthen the discreteness property of (or exclusiveness 17: Zs ← ϵs Zs + (1 − ϵs )Ds 2 As Ds 2 Zs .
between) AR and AIR , one can apply techniques, such as 18: end for
softmax with temperature in Eq. (5), Gumbel-Softmax [58], 19: end for
[59] in Eq. (6), or threholding in Eq. (7). 20: // Prediction.
(K)
′ exp(As(i,j) /τ ) 21: ŷi = softmax(WFT ZR[i,:] + bF ), ∀vi ∈ V .
As(i,j) = P (5) 22: // Optimization with Irrelevant Consistency Regularization.
κ∈{R,IR} exp(Aκ(i,j) /τ )
′ exp((log(As(i,j) ) + γ)/τ ) 23: LICR = (vi ,vj )∈E (1 − δ(ŷi , ŷj ))∥ZIR[i,:] − ZIR[j,:] ∥22 .
P
As(i,j) =P (6)
κ∈{R,IR} exp((log(Aκ(i,j) ) + γ)/τ ) 24: Lpred = − |V1 | T
P
( trn i∈Vtrn yi log(ŷi ).
′ 1 As(i,j) > 0.5 25: Minimize Lpred + λICR LICR .
As(i,j) = (7)
0 otherwise
where s ∈ {R, IR}, τ is a hyper-parameter mediating dis-
creteness degree, and γ ∼ Gumbel(0, 1) is a Gumbel random 4.3 Irrelevant Consistency Regularization
variable. However, in this work, we find good results with-
out adding any additional discretization techniques, and
will leave this investigation to the future work. Stacking ES-layer and aggregation layer iteratively lends
itself to disentangling different features of nodes into the
4.2 Aggregation Layer task-relevant and irrelevant representations, denoted by ZR
As the split network topologies disclose the partial relations and ZR respectively. First, ZR is granted for prediction
among nodes in different latent spaces, they can be utilized and gradually trained by the supervision signals from the
to aggregate information for learning different node aspects. classification loss. However, only supervising one channel
Specifically, we leverage a simple low-pass filter with scal- (R) may not also guarantee the meaningfulness of the other
ing parameters {ϵR , ϵIR } for both task-relevant and irrelevant (IR), which possibly results in inaccurate disentanglement.
channels, from the k th to k + 1th layer: The confounding and erroneous information could then
−1 −1 be mistakenly preserved for prediction. To this end, we
Z(k+1)
s = ϵs Z(0) 2 2 (k)
s + (1 − ϵs )Ds As Ds Zs . (8) propose to regulate ZIR for modeling the opposite of ZR ,
s ∈ {R, IR} denotes the task-relevant or irrelevant channel, i.e., the classification-harmful information hidden in the
and Ds is the degree matrix associated with the adjacency observed graph.
matrix As . Derivation of Eq. (8) is detailed in our theoretical To attain this, we develop Irrelevant Consistency Regu-
analysis. Importantly, by incorporating proximity informa- larization (ICR) that imposes a concrete meaning on ZIR . The
tion in different structural spaces, the task-relevant and rationale is to incorporate the potential label-disagreement
(k+1)
irrelevant information can be better disentangled in ZR between adjacent nodes into ZIR . Given any two connected
(k+1)
and ZIR , based on which the next ES-layer can make a nodes vi and vj , we would expect ZIR[i,:] and ZIR[j,:] to be
more precise partition on the raw topology. similar if they share a distinct label. Specifically, our ICR can
6

TABLE 1 5 T HEORETICAL A NALYSIS


Time complexity of the comparison models with one hidden layer as an
example. Ne denotes the number of graph aspects assumed in In this section, we investigate two important problems: (1)
FactorGCN [35], Dmax represents the maximum node degree, and |E2 | what limits the generalization power of the conventional
is the total number of neighbors in the second hop of nodes. Other GNNs on graphs beyond homophily, and (2) how the pro-
symbols are earlier defined in the texts.
posed ES-GNN breaks this limit and performs well on dif-
ferent types of networks. We will answer these questions by
Models Complexity first analyzing the typical GNNs as graph signal denoising
GCN [33] O((f + C)|E|d) from a more generalized viewpoint, and then impose our
GAT [24] O(((2 + f )|V| + (4 + C)|E|)d) Hypothesis 1 to derive ES-GNN.
FactorGCN [35] O(Ne |V| + (|V|f + (3 + C)|E|)d)
H2GCN [10] O(f d + |E|Dmax + (|E| + |E2 |)d)
FAGCN [18] O(((1 + C + f )|V| + |E|)d) 5.1 Limited Generalization of Conventional GNNs
GPR-GNN [11] O((f |V| + |E|C)d) Recent studies [8], [9] have proved that most GNNs can be
ES-GNN (Ours) O(((1 + C + f )|V| + |E|)d) regarded as solving a graph signal denoising problem:

arg min ∥Z − X∥22 + ξ · tr(ZT LZ), (10)


Z
be formulated as:
X where X ∈ R|V|×f is the input signal, L = D−A ∈ R|V|×|V|
LICR = (1 − δ(yi , yj ))∥ZIR[i,:] − ZIR[j,:] ∥22 , is the graph laplacian matrix, and ξ is a constant coefficient.
(vi ,vj )∈E The first term guides Z to be close to X, while the sec-
where δ is a Kronecker function returning 1 if yi = yj ond term tr(ZT LZ) is the laplacian regularization, which
and 0 otherwise, and ∥ · ∥2 denotes l2 norm. By doing so, enforces smoothness between connected nodes. One funda-
ZIR is constrained with a local consistency between adjacent mental assumption made here is that similar nodes should
nodes from different classes. As a benefit, the classification- have a higher tendency to connect each other, and we refer
harmful similarity between nodes can be further excluded it as standard smoothness assumption on graphs. However,
from ZR , and disentangled in ZIR . real-world networks typically exhibit diverse linking pat-
terns of both assortativity and disassortativity. Constrain-
Several powerful techniques [25], [60] have been de-
ing smoothness on each node pair is prone to mistakenly
veloped to measure the label-agreement between pairwise
preserve both of the task-relevant and irrelevant (or even
nodes. In this work, however, we find that using directly the
harmful) information for prediction. Given that, we divide
joint probability from model prediction works well, which
the original graph into two subgraphs with the same nodes
also offers advantages in low computational complexity as
sets but exclusive edge sets, and reformulate Eq. (10) as:
no additional trainable parameters are required.
arg min ∥Z − X∥22 + ξ · tr(ZT LR Z) + ξ · tr(ZT LIR Z).
Z
4.4 Overall Algorithm
Here, LR = DR − AR , and LIR = DIR − AIR , where the
The overall pipeline of ES-GNN is detailed in Algorithm 1. task-relevant and irrelevant node relations are separately
Specifically, we adopt ReLU activation function in Eq. (3) captured in AR and AIR . Clearly, emphasizing the common-
to first map node features into two different channels, and ality between adjacent nodes in AR is beneficial for keeping
then pass them with the adjacency matrix to an ES-layer for task-correlated information only. However, smoothing node
splitting the raw network topology into two exclusive parts. pairs in AIR simultaneously may preserve classification-
After that, these two partial network topologies are utilized harmful similarity between nodes, thus limiting the predic-
to aggregate information in different structural spaces. Al- tion performance of GNNs.
ternatively stacking ES-layer and aggregation layer not only
enables more accurate disentanglement but also explores
the graph information beyond local neighborhood. Finally, 5.2 Disentangled Smoothness Assumption in ES-GNN
a fully connected layer is appended to project the learned Our Hypothesis 1 suggests that the original graph topol-
representations into class space RC . We integrate LICR into ogy can be partitioned into two exclusive ones, wherein
the optimization process with a irrelevant consistency co- connected nodes displays high similarity with either task-
efficient λICR toPhave final objective function below, where relevant or irrelevant features only. We further interpret this
Lpred = − |V1trn | vi ∈Vtrn yiT log(ŷi ). result as disentangled smoothness assumption, based on which
the conventional graph signal denoising problem in Eq. (10)
L = Lpred + λICR LICR . (9) can be generalized as:
Finally, we also report in Table 1 the complexity of the arg min ∥ZR − XIR ∥22 + ∥ZIR − XIR ∥22
proposed ES-GNN method in comparison with the state- ZR ,ZIR
of-the-arts which will be evaluated in the experimental + ξ · tr(ZTR LR ZR ) + ξ · tr(ZTIR LIR ZIR )
section. Clearly, our model displays the same complexity
to FAGCN [18] while being slightly overhead compared to where LR = DR − AR , LIR = DIR − AIR (11)
GPR-GNN [11]. Here, we omit the related works, GEN [13]
s.t. AR + AIR = A
and WRGAT [14], as their complexity is obviously higher
than others by involving reconstructing the whole graph. AR(i,j) , AIR(i,j) ∈ [0, 1].
7
(K)
Here, AR(i,j) and AIR(i,j) measure the degree to which the and ZIR ,we minimize the prediction loss Lpred and the
node connection (vi , vj ) are relevant and irrelevant to the Irrelevant Consistency Regularization LICR in Eq. (9) with
learning task, respectively. We further name this optimiza- Adam [61] algorithm, which imposes concrete meanings on
tion as disentangled graph denoising problem, and finally derive different channels, and simultaneously ensures the conver-
the following theorem: gence of our described alternative learning.
Theorem 1. The proposed ES-GNN is equivalent to the solution Lemma 1. When adopting the normalized laplacian matrix LR =
of the disentangled graph denoising problem in Eq. (11). −1 −1
I−DR 2 AR DR 2 , the feature aggregation operator in Eq. (8) with
d d
Proof. Let XR ∈ R 2 and XIR ∈ R 2 be the results of mapping channel s = R can be regarded as solving Eq. (14) using iterative
1
(0) gradient descent with stepsize β = 2+2ξ and ξ = ϵ1R − 1.
X into different channels in Eq. (3), i.e., XR = ZR and
(0)
XIR = ZIR . Hypothesis 1 motivates us to define AR(i,j) and Proof. We take iterative gradient descent with the stepsize β
AIR(i,j) as node similarity in two aspects. Combining above to solve the denoising problem in Eq. (14) (referred as LR )
constraints, we have a linear system in case of A(i,j) = 1: as follows:
(k+1) (k) ∂LR
( ZR = ZR − β · | (k)
AR(i,j) + AIR(i,j) = 1 ∂Z∗R Z∗R =ZR
, −1 −1
AR(i,j) − AIR(i,j) = ϕres (ZR[i,:] , ZIR[i,:] , ZR[j,:] , ZIR[j,:] ) (0) (k)
= 2βZR + 2βξ(DR 2 AR DR 2 )ZR + (1 − 2β − 2βξ)ZR .
(k)

where ϕres (·) outputs the residual between AR(i,j) and


AIR(i,j) considering both task-relevant and irrelevant node Setting β as 1
gives us:
2+2ξ
information, and can be formulated with our residual scor-
ing mechanism in Eq. (4). Solving above equations, we can (k+1) 1 (0) ξ −1 −1 (k)
ZR = ZR + (DR 2 AR DR 2 )ZR ,
express both AR and AIR in terms of ZR and ZIR , i.e., 1+ξ 1+ξ
1
1 + ϕres (ZR[i,:] , ZIR[i,:] , ZR[j,:] , ZIR[j,:] ) which is equivalent to Eq. (8) while choosing ξ = ϵR − 1, i.e.,
AR(i,j) = (12)
2 (k+1) (0) − 21 − 12 (k)
1 − ϕres (ZR[i,:] , ZIR[i,:] , ZR[j,:] , ZIR[j,:] ) ZR = ϵR ZR + (1 − ϵR )(DR AR DR )ZR .
AIR(i,j) = . (13)
2
So far, the optimization problem in Eq. (11) is only made up
As the possible classification-harmful similarity between
of variables XR , XIR , ZR , and ZIR . Directly solving it is still
nodes (hidden in AIR ) can be excluded from ZR and dis-
however not easy, as the mixing variables of ZR and ZIR ,
entangled in ZIR while optimizing Eq. (11), our ES-GNN
and the introduced non-linear operator in ϕres(·) result in a
presents a universal approach that theoretically guarantees
complicated differentiation process.
good performance on different types of networks.
Instead, we can approach this problem by decoupling
the learning of AR , AIR from the optimization target, and
employ an alternative learning between stages. Suppose 6 E XPERIMENTS
we have attained the task-relevant and irrelevant node We empirically evaluate our ES-GNN for node classification
(k) (k)
features in the k th round, i.e., ZR and ZIR . In the using both synthetic and real-world datasets in this section.
(k+1) (k+1)
first stage, we can compute AR(i,j) and AIR(i,j) using
(k) (k) (k) (k)
{ZR[i,:] , ZIR[i,:] , ZR,[j,:] , ZIR[j,:] } with Eq. (12) and Eq. (13), 6.1 Datasets & Experimental Setup
which in fact turns out to be our ES-layer in Section 4.1. 6.1.1 Real-World Datasets
In the second stage, injecting the computed values of
(k+1) (k+1) We consider 11 widely used benchmark datasets includ-
AR(i,j) and AIR(i,j) relaxes the mixture of variables ZR ing both seven heterophilc graphs, i.e., Chameleon, Squir-
and ZIR , and the original optimization problem can then rel [27], Wisconsin, Cornell, Texas [31] (webpage networks),
be disentangled into two independent targets (as all four Actor [32] (co-occurrence network), and Twitch-DE [27],
penalized terms are positive): [28] (social network), as well as four homophilic graphs
including Cora, Citeseer, Pubmed [26] (citation networks),
(0)
arg min ∥Z∗R − ZR ∥22 + ξ · tr(Z∗R T LR Z∗R )
(k)
(14) and Polblogs [29], [30] (community network) with statistics
Z∗
R
shown in Table 2.
(0) (k)
arg min ∥Z∗IR − ZIR ∥22 + ξ · tr(Z∗IR T LIR Z∗IR ) (15)
Z∗ 6.1.2 Synthetic Data
IR

(k) (k) (k) (k) (k) (k)


To investigate the behavior of GNNs on graphs with ar-
where LR = DR − AR and LIR = DIR − AIR are fixed bitrary levels of homophily and heterophily, we construct
values. Lemma 1, on the R channel as an example, further synthetic graphs with our Hypothesis 1. The central idea is
shows that our aggregation layer, on the task-relevant and to define links among nodes under two conditions indepen-
irrelevant topologies, in Section 4.2 is approximately solving dently, of which only one is correlated with the classification
these two optimization problems in Eq. (14) and Eq. (15). task. We consider 1,200 nodes, 3 equal-size classes, and
Therefore, stacking ES- and aggregation layers itera- 500 node features made up of both explicit and implicit
tively is equivalent to the above alternative learning for attributes. The explicit attributes depend on the label as-
solving the disentangled graph denoising problem in Eq. (11) signment, while the implicit ones model dependency across
(0) (0) (K)
with XR = ZR and XIR = ZIR . Finally, given ZR different classes. Fig. 3 further illustrates their allocation to
8

TABLE 2
Statistics of real-world datasets, where H and Ĥ (considering class-imbalance problem) provide indexes of graph homophily ratio as respectively
defined in Eq. (1) and Eq. (2). It can be observed that, despite the relative high homophily level measured by H = 0.632, the Twitch-DE dataset
with class-imbalance problem is essentially a heterophilic graph [28] as suggested by Ĥ = 0.139. For Polblogs dataset, since node features are
not provided, we directly use the rows of the adjacency matrix.

Heterophilic Graphs Homophilic Graphs


Datasets
Squirrel Chameleon Wisconsin Cornell Texas Twitch-DE Actor Cora Citeseer Pubmed Polblogs
H 0.222 0.230 0.178 0.296 0.061 0.632 0.217 0.810 0.735 0.802 0.906
Ĥ 0.025 0.062 0.094 0.047 0.001 0.139 0.011 0.766 0.627 0.664 0.811
# Nodes 5,201 2,277 251 183 183 9,498 7,600 2,708 3,327 19,717 1,222
# Edges 217,073 36,101 499 295 309 153,138 33,544 5,429 4,732 44,338 16,714
# Features 2,089 2,325 1,703 1,703 1,703 2,514 931 1,433 3,703 500 /
# Classes 5 5 5 5 5 2 5 7 6 3 2

Step-1 Step-2 Step-3 PE TABLE 3


PI Parameter setting for constructing synthetic graphs with different
...

...
...

...
...
...

homophily ratios Hsyn .


... ... ... ... ... ...

... ... ... ... ... ... Hsyn 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
PE 0.02 0.06 0.1 0.2 0.4 0.4 0.6 0.7 0.8 0.9 0.96
Homophilic Pattern pE ≫ pI PI 0.72 0.81 0.6 0.7 0.9 0.6 0.6 0.45 0.3 0.15 0.045
Heterophilic Pattern pE ≪ pI
Explicit Attributes Implicit Attributes ω 0.1 0.084 0.1 0.075 0.05 0.062 0.05 0.05 0.05 0.05 0.051

Fig. 3. Constructing synthetic graphs with arbitrary levels of homophily


and heterophily. Shape and color of nodes respectively illustrate the with n being the total number of nodes. Clearly, we have
explicit and implicit node attributes. Nodes with the same shape or color Hsyn → 0 while PI ≫ PE , and Hsyn → 1 while PI ≪ PE .
are connected with a probability of PE or PI , independently, while they
are only classified by their shapes (the explicit attributes) into three To avoid possible computational overhead, we also need to
categories. Obviously, we can observe heterophilic graph pattern given control the average node degree of our synthetic graphs.
PE ≪ PI , and strong homophily otherwise. Similarly, we can approximately derive it as the function of
PI and PE :
n−3 4n
nodes in the step-2 with shape and color as an example. T (PE , PI ) = PE + PI . (17)
3 9
Notably, all these attributes in six types (three explicit and
From Eq. (16) and Eq. (17), we have that Hsyn (·) is a function
three implicit ones) are randomly sampled from different
of the fraction between PE and PI with fixed n, and T (·)
Gaussian distributions, each pair of them are combined
is linearly correlated with PE and PI . As such, given fixed
via element-wise addition to attain the final node features.
PE and PI attaining certain Hsyn , we can almost attain the
For instance, the features of a node (from class-i) with
average node degree in any values with a scaling parameter
explicit attribute-i and implicit attribute-j are defined as the
ω , i.e., average degree = ω · T (PE , PI ) = T (ω · PE , ω · PI )
addition of two random vectors respectively sampled from
without changing Hsyn . In this work, we tune all these
N (µE,i , σ E,i ) and N (µI,j , σ I,j ), where µE,i , µI,j ∈ Rfsyn are
parameters such that the average degree is around 20, and
means, σ E,i , σ I,j ∈ Rfsyn ×fsyn are the associated covariance
list the tested values in Table 3.
matrixes, and fsyn = 500 is the feature dimensions.
After that, inspired by the Erdős-Rényi random graphs, 6.1.3 Data Splitting
we connect nodes with probability PE if they are from the For heterophilic graphs and our synthetic graphs, we di-
same class (the task-relevant condition), with probability PI vides each dataset into 60%/20%/20% corresponding to
if they share different labels but posses implicit attributes training/validation/testing to follow [10], [11], [31]. For ho-
from the same distribution (the task-irrelevant condition), mophilic graphs, we adopt the popular sparse splitting [6],
as shown in Fig. 3 (see step-3). For all other cases, we [24], [33], i.e., 20 nodes per class, 500 nodes, and 1,000
connect nodes with probability q in a small value, 1e−5 in nodes to train, validate, and test models. For each dataset,
this work for ensuring a connected graph. Since no class- 10 random splits are created for evaluation.
imbalance problem exists here, the homophily ratios of our
generated graphs are measured with Eq. (1). Intuitively, 6.1.4 Baselines
we could anticipate heterophilic connecting pattern when We compare our ES-GNN with 9 baselines including the
setting PE ≪ PI , and strong homophily otherwise. Quanti- state-of-the-art GNNs: 1) GCN [33] adopts Chebyshev ex-
tatively, the relationship between the homophily ratio Hsyn pansion to approximate the graph laplacian efficiently; 2)
and parameters PE , PI can be derived with the simple SGC [6] simplifies GCN [33] by removing non-linearity; 3)
knowledge on combinatorics and statistics while omitting GAT [24] employs an attention mechanism to adaptively
the small value of q : utilize neighborhood information; 4) FactorGCN [35]; 5)
GEN [13]; 6) WRGAT [14]; 7) H2GCN [10]; 8) FAGCN [18];
3(n − 3) 9) GPR-GNN [11], of which baselines from 4) to 9) have been
Hsyn (PE , PI ) = , (16)
3(n − 3) + 2n PPEI briefly introduced in the Section 3.2.
9

TABLE 4
Node classification accuracies (%) over 100 runs. Error Reduction gives the average improvement of our ES-GNN upon the second place models,
which are explicitly designed for heterophilic graphs.

Heterophilic Graphs Homophilic Graphs


Datasets
Squirrel Chameleon Wisconsin Cornell Texas Twitch-DE Actor Cora Citeseer Pubmed Polblogs
GCN [33] 55.2±1.5 67.6±2.0 59.5±3.6 52.8±6.0 61.7±3.7 74.0±1.2 31.2±1.3 79.7±1.2 69.5±1.7 78.7±1.6 89.4±0.9
SGC [6] 50.7±1.3 61.9±2.6 53.7±3.9 51.2±0.9 51.4±2.2 73.9±1.3 30.9±0.6 79.1±1.0 69.9±2.0 76.6±1.3 89.0±1.5
GAT [24] 54.8±2.2 67.3±2.2 57.9±4.5 50.4±5.9 55.4±5.9 73.7±1.3 30.5±1.2 82.0±1.1 69.9±1.7 78.6±2.0 87.4±1.1
FactorGCN [35] 56.6±2.4 69.8±2.0 64.2±4.8 50.6±1.8 69.5±6.5 73.1±1.4 29.0±1.4 75.2±1.6 61.6±2.0 72.9±2.3 87.9±1.7
GEN [13] 36.0±4.0 57.6±3.1 83.3±3.6 81.0±3.9 78.3±8.0 74.1±1.4 37.3±1.4 79.8±1.3 69.7±1.6 78.9±1.7 89.6±1.4
WRGAT [14] 39.6±1.4 57.7±1.6 82.9±4.5 79.2±3.5 80.5±6.1 70.0±1.3 38.6±1.1 71.7±1.5 64.1±1.9 73.3±2.1 88.2±1.2
H2GCN [10] 45.1±1.9 62.9±1.9 82.6±4.0 79.6±4.9 79.8±7.3 73.1±1.5 38.4±1.0 81.4±1.4 68.7±2.0 78.0±2.0 89.0±1.0
FAGCN [18] 50.4±2.6 68.9±1.8 82.3±4.4 79.4±5.5 80.3±5.5 74.1±1.4 37.9±1.0 82.6±1.3 70.3±1.6 80.0±1.7 89.3±1.1
GPR-GNN [11] 54.1±1.6 69.6±1.7 82.7±4.1 79.9±5.3 81.7±4.9 74.0±1.6 38.0±1.1 81.5±1.5 69.6±1.7 79.8±1.3 89.5±0.8
ES-GNN (ours) 62.4±1.4 72.3±2.1 85.3±4.6 82.2±4.0 82.3±5.7 74.7±1.1 38.9±0.8 83.0±1.1 70.7±1.7 80.7±1.4 89.7±0.9
Error Reduction 17.4% 9.0% 2.5% 2.4% 2.2% 1.6% 0.9% 3.6% 2.2% 2.7% 0.6%

TABLE 5
Edge Analysis of our ES-GNN on synthetic graphs with various
homophily ratios. Removed Het. gives the percentage (%) of
heterophilic node connections excluded from the task-relevant topology
and disentangled in the task-irrelevant topology. The last two rows give
the corresponding node classification accuracies (%) of ES-GNN and
its variant while ablating ES-layer.

Hsyn 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Avg.
Removed Het. 41.9 53.2 60.8 70.4 74.2 80.7 86.7 87.8 89.9 71.7
ES-GNN 90.0 69.6 62.1 69.6 85.4 93.8 98.3 99.2 100.0 85.3
ES-GNN w/o ES 84.6 57.9 53.3 53.8 74.2 81.7 86.3 90.4 96.7 75.4

state-of-the-art performance on all the eleven datasets, and


Fig. 4. Results of different models on synthetic graphs with varied consistently outperforms all the baselines including four
homophily ratios, where ES-GNN constantly outperform all the base- popular graph neural network models and five recent state-
lines including conventional GNNs and the state-of-the-arts explicitly of-the-arts which explicitly considers heterophily on graphs.
designed for heterophilic graphs.
Specifically, compared to the second place models, our
method achieves significant performance gains by 17.4%
6.1.5 Implementation Details and 9.0% separately on the heterophilic graphs Squirrel and
Chameleon, and we have the relative error reductions of
For all the baselines and our model, we set d = 64 2.5%, 2.4%, and 2.2% on Wisconsin, Cornell, and Texas,
as the number of hidden states for fair comparison, and respectively. For Actor and Twitch-DE, ES-GNN wins by
tune the hyper-parameters on the validation split of each an average margin of 1.3%. We notice that FactorGCN
dataset using Optuna [62] for 200 trials. With the best surprisingly has relative good performance on Squirrel and
hyper-parameters, we train models in 1,000 epochs using Chameleon datasets. This phenomenon can be explained
the early-stopping strategy with a patience of 100 epochs. by its ability on separating channels for learning disentan-
We then report the average performance in 10 runs on gled graph aspects, which further verifies our speculation
the test set for each random split. For reproducibility, we in Section 1, i.e., the different types of information are
provide the searching space of our hyper-parameters: learn- typically mixed and entangled in the node neighborhood
ing rate ∼ [1e−2, 1e−1], weight decay ∼ [1e−6, 1e−3], under heterophily. However, FactorGCN dose not continue
dropout ∼ {0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8}, the num- to perform well on other five heterophilic graphs, mainly
ber of layers K ∼ {1, 2, 3, 4, 5, 6, 7, 8}, scaling parameter because it fails to distinguish between the useful and useless
ϵR , ϵIR ∼ {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1}, and irrel- (even harmful) channels, and takes the whole parts for
evant consistency coefficient λICR ∼ [0, 1] for Cora, Citeseer, classification. For graphs with strong homophily, ES-GNN
Pubmed, and Twitch-DE, [5e−8, 5e−6] for Chameleon, Wis- maintains competitiveness and exhibits averagely 2.3% su-
consin, Cornell, and Texas, [5e−5, 5e−3] for Squirrel, and periority upon the state-of-the-art models. We will show
[5e−3, 5e−2] for Actor. Our implementation can be found at our ES-GNN could demonstrate remarkable robustness on
https://github.com/jingweio/ES-GNN. homophilic graphs in case of perturbation or noisy links in
Section 6.6.
6.2 Results on Real-World Graphs
Table 4 summaries node classification accuracies on real- 6.3 Results on Synthetic Graphs
world datasets in 100 runs with multiple random splits and We examine the learning ability of various models on graphs
different model initializations. In general, ES-GNN achieves across the homophily or heterophily spectrum. From Fig. 4,
10

(a) Chameleon (b) Cora (c) Hsyn = 0.1 (d) Hsyn = 0.5 (e) Hsyn = 0.9

Fig. 5. Feature correlation analysis. Two distinct patterns (task-relevant and task-irrelevant topologies) can be learned on Chameleon with H = 0.23,
while almost all information is retained in the task-relevant channel (0-31) on Cora with H = 0.81. On synthetic graphs in (c), (d), and (e), block-
wise pattern in the task-irrelevant channel (32-63) is gradually attenuated with the incremental homophily ratios across 0.1, 0.5, and 0.9. ES-GNN
presents one general framework which can be adaptive for both heterophilic and homophilic graphs.

(a) Cora (b) Citeseer (c) Pubmed (d) Polblogs

Fig. 6. Results of different models on perturbed homophilic graphs. ES-GNN is able to identify the falsely injected (the task-irrelevant) graph edges,
and exclude these connections from the final predictive learning, thereby displaying relative robust performance against adversarial edge attacks.

we have the following observations: (1) Looking through graph links, and makes prediction with the most correlated
the overall trend, we obtain a “U” pattern on graphs from features only. We further provide detailed analyses in the
the lowest to the highest homophily ratios. That suggests following sections.
GNNs’ prediction performance is not monotonically cor-
related with graph homophily levels in a strict manner. 6.4 Edge Analysis
When it comes to the extreme heterophilic scenario, GNNs
We analyze the split edges from our ES-layer using syn-
tend to alternate node features completely between dif-
thetic graphs as an example in this section. According to
ferent classes, thereby still making nodes distinguishable
Section 6.1.2, the synthetic edges are defined as the task-
w.r.t. their labels, which coincides with the findings in [63].
relevant connections if they link nodes from the same
(2) Despite the attention mechanism for adaptively utilizing
class, and the task-irrelevant ones otherwise. Therefore, we
relevant neighborhood information, GAT turns out to be
calculate the percentages of heterophilic node connections,
the least robust method to arbitrary graphs. The entangled
which are excluded from our task-relevant topology and
information in the mixed assortativity and disassortativity
disentangled in the task-irrelevant one, so as to investigate
provides weak supervision signals for learning the attention
the discerning ability of ES-GNN between edges in different
weights. FactorGCN employs a graph factorization to disen-
types. As can be observed in Table 5, 71.7% task-irrelevant
tangle different graph aspects but still adopts all of them for
edges are identified on average across various homophily
prediction without judgement, thereby performing poorly
ratios. On the other hand, we also report the classification
especially on the tough cases of Hsyn = 0.3, 0.4, and 0.5.
accuracies of ES-GNN and its variant while ablating ES-
(3) Both FAGCN and GPR-GNN model the dissimilarity
layer, from which approximately 10% degradation can be
between nearby nodes to go beyond the smoothness as-
observed. All of these strongly validate the effectiveness
sumption in conventional GNNs, and display some superi-
of our ES-layer and reasonably interprets the good perfor-
ority under heterophily. However, the correlation between
mance of ES-GNN.
graph edges and classification tasks is not explicitly de-
fined and emphasized in their designs. In other words, the
classification-harmful information still could be preserved 6.5 Correlation Analysis
in their node dissimilarity. Experimental results also show To better understand our proposed method, we investigate
that these methods are constantly beaten by our disen- the disentangled features on Chameleon, Cora, and three
tangled approach. (4) The proposed ES-GNN consistently synthetic graphs as typical examples in Fig. 5. Clearly, on
outperforms, or matches, others across different graphs the strong heterophilic graph Chameleon with H = 0.23,
with different homophily levels, especially in the hardest correlation analysis of learned latent features displays two
case with Hsyn = 0.3 where some baselines even perform clear block-wise patterns, each of which represents task-
worse than MLP. This is mainly because our ES-GNN is relevant or task-irrelevant aspect respectively. In contrast,
able to distinguish between task-relevant and irrelevant on the citation network Cora with H = 0.81, the node
11

Fig. 7. Ablation study of ES-GNN on eight datasets in node classification.

(a) Squirrel (b) Chameleon (c) Twitch-DE (d) Actor

(e) Cora (f) Citeseer (g) Pubmed (h) Polblogs

Fig. 8. Sensitivity analysis of coefficient λICR .

could be disentangled in the task-irrelevant topology (see


Fig. 5b). On the other hand, the results on synthetic graphs
from Fig. 5c to 5e display an attenuating trend on the second
block-wise pattern with the incremental homophily ratios
across 0.1, 0.5, and 0.9. This correlation analysis empirically
verifies that our ES-GNN successfully disentangles the task-
relevant and irrelevant features, and also demonstrates its
universal adaptivity on different types of networks.

(a) Cora (b) Citeseer 6.6 Robustness Analysis


By splitting the original graph edge set into task-relevant
and task-irrelevant subsets, our proposed ES-GNN enjoys
strong robustness particularly on homophilic graphs, since
perturbed or noisy aspects of nodes could be purified from
the task-relevant topology and disentangled in the task-
irrelevant topology. To examine this, we randomly inject
fake edges into graphs with perturbed rates from 0% to
100% with a step size of 20%. Adversarially perturbed
examples are generated from graphs with strong homophily,
(c) Pubmed (d) Polblogs such as Cora, Citseer, Pubmed, and Polblogs. As shown in
Fig. 6, models considering graphs beyond homophily, i.e.,
Fig. 9. Classification accuracy vs. model depths.
H2GCN, FAGCN, GPR-GNN, and our model, consistently
display a more robust behavior than GCN and GAT. That
is mainly because fake edges may connect nodes across dif-
connections are in line with the classification task, since ferent labels, and consequently cause erroneous information
scientific papers mostly cite or are cited by others in the sharing in the conventional methods.
same research topic. Thus, most information will be retained On the other hand, our ES-GNN beats all the state-of-
in the task-relevant topology, while very minor information the-arts by an average margin of 2% to 3% on Citeseer,
12

Pubmed, and Polblogs while displaying relatively the same accuracy on Squirrel in Fig. 8a goes up first and then grad-
results on Cora. We attribute this to the capability of our ually drops. Promising results can be attained by choosing
model in associating node connections with learning tasks. λICR from [5e−5, 5e−3]. Similar trends can be also observed
Take Pubmed dataset as an example. We investigate the on the other datasets, where λICR is relatively robust within
learned task-relevant topologies and find that 81.0%, 73.0%, a wide albeit distinct interval.
82.1%, 83.0%, 82.6% fake links get removed on adversatial
graphs with perturbation rates from 20% to 100%. This
also offers evidences supporting that our ES-layer is able 7 C ONCLUSION
to distinguish between task-relevant and irrelevant node In this paper, we develop a novel graph learning framework
connections. Therefore, despite a large number of false edge which enables GNNs to go beyond the strong homophily
injections, the proximity information of nodes can still be assumption on graphs. We manage to establish correlation
reasonably mined in our model to predict their labels. Im- between node connections and learning tasks through one
portantly, these empirical results also indicate that ES-GNN plausible hypothesis, based on which ES-GNN is derived
can still identify most of the task-irrelevant edges though no with an interpretable edge splitting. Our ES-GNN essen-
clear similarity or association between the connected nodes tially partitions the original graph structure into the task-
exists in the adversarial setting. relevant and irrelevant topologies as guide to disentangle
node features, whereby the classification-harmful informa-
6.7 Alleviating Over-smoothing Problem tion can be disentangled and excluded from the final pre-
diction target.
In order to verify whether ES-GNN alleviates the over-
Theoretical analysis illustrates our motivation and offers
smoothing problem, we compare it with GCN and GAT
interpretations on the expressive power of ES-GNN on
by varying the layer number in Fig. 9. It can be observed
different types of networks. To provide empirical verifica-
that these two baselines attain their highest results when
tion, we conduct extensive experiments over 11 benchmark
the number of layers reaches around two. As the layer goes
and 1 synthetic datasets. The node classification results
deeper, the accuracies of both GCN and GAT gradually drop
show that ES-GNN constantly outperforms the other 9
to a lower point. On the contrary, our ES-GNN presents a
competitive GNNs (including 5 state-of-the-arts explicitly
stable curve. In spite of starting from a relative lower point,
designed for heterophily) on graphs with either homophily
the performance of ES-GNN keeps improving as the model
or heterophily. In particular, we also conduct analysis on
depths increase, and eventually outperforms both GCN and
the split edges, correlation among disentangled features,
GAT. The main reason is that, our ES-GNN can adaptively
model robustness, and the ablated variants. All of these
utilize proper graph edges in different layers to attain the
results demonstrate the success of ES-GNN in identifying
task-optimal results with enlarged receptive fields. In other
graph edges between different types, which also validates
words, once an edge stops passing useful information or
the effectiveness of our interpretable edge splitting.
starts passing harmful messages, ES-GNN tends to identify
In future work, we will further explore more sophisti-
it and remove it from learning the task-correlated represen-
cated designs in the edge splitting layer. Another interesting
tations, thereby having the ability of mitigating the over-
direction would be how to extend our learning paradigm in
smoothing problem.
accomplishing graph-level tasks.

6.8 Channel Analysis and Ablation Study


In this section, we compare ES-GNN with its variant ES- ACKNOWLEDGMENTS
GNN-d which takes dual (both the task-relevant and ir- The work was partially supported by the following:
relevant) channels for prediction, and perform an abla- National Natural Science Foundation of China under
tion study. Fig. 7 provides comparison on eight real-world no.61876155; Jiangsu Science and Technology Programme
datasets as examples. Here, we first specify some annota- (Natural Science Foundation of Jiangsu Province) under no.
tions including 1) “w/o ICR”: without regularization loss BK20181189, BE2020006-4; Key Program Special Fund in
LICR , and 2) “w/o ES”: without edge splitting (ES-) layer. XJTLU under no. KSF-A-10, KSF-T-06, KSF-E-26, KSF-P-02,
Overall, two conclusions can be drawn from Fig. 7. First, ES- and KSF-A-01.
GNN is consistently better than ES-GNN-d, implying that
the task-irrelevant channels indeed capture some false infor-
mation where model performance downgrades even with R EFERENCES
the doubled feature dimensions. Second, removing either [1] G. Ciano, A. Rossi, M. Bianchini, and F. Scarselli, “On inductive–
ICR or ES-layer from both ES-GNN and ES-GNN-d leads to transductive learning with graph neural networks,” IEEE Transac-
a clear accuracy drop. That validates the effectiveness of our tions on Pattern Analysis and Machine Intelligence, vol. 44, no. 2, pp.
758–769, 2021.
model designs. [2] J. Zhou, G. Cui, S. Hu, Z. Zhang, C. Yang, Z. Liu, L. Wang, C. Li,
and M. Sun, “Graph neural networks: A review of methods and
applications,” AI Open, vol. 1, pp. 57–81, 2020.
6.9 Sensitivity Analysis of Coefficient λICR [3] T. Chen and R. C.-W. Wong, “Handling information loss of graph
We test the effect of the irrelevant consistency coefficient neural networks for session-based recommendation,” in Proceed-
ings of the 26th ACM SIGKDD International Conference on Knowledge
λICR , and plot the learning performance of our model on Discovery & Data Mining, 2020, pp. 1172–1180.
eight real-world datasets as examples in Fig. 8 by varying [4] Z. Zhang, P. Cui, and W. Zhu, “Deep learning on graphs: A
λICR with different values. For example, the classification survey,” IEEE Transactions on Knowledge and Data Engineering, 2020.
13

[5] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl, [27] B. Rozemberczki, C. Allen, and R. Sarkar, “Multi-scale attributed
“Neural message passing for quantum chemistry,” in International node embedding,” Journal of Complex Networks, vol. 9, no. 2, p.
Conference on Machine Learning. PMLR, 2017, pp. 1263–1272. cnab014, 2021.
[6] F. Wu, A. Souza, T. Zhang, C. Fifty, T. Yu, and K. Weinberger, “Sim- [28] D. Lim, X. Li, F. Hohne, and S.-N. Lim, “New bench-
plifying graph convolutional networks,” in International Conference marks for learning on non-homophilous graphs,” arXiv preprint
on Machine Learning. PMLR, 2019, pp. 6861–6871. arXiv:2104.01404, 2021.
[7] M. Balcilar, G. Renton, P. Héroux, B. Gaüzère, S. Adam, and [29] L. A. Adamic and N. Glance, “The political blogosphere and the
P. Honeine, “Analyzing the expressive power of graph neural 2004 us election: Divided they blog,” in Proceedings of the 3rd
networks in a spectral perspective,” in International Conference on international workshop on Link discovery, 2005, pp. 36–43.
Learning Representations, 2020. [30] W. Jin, Y. Ma, X. Liu, X. Tang, S. Wang, and J. Tang, “Graph
[8] Y. Ma, X. Liu, T. Zhao, Y. Liu, J. Tang, and N. Shah, “A unified structure learning for robust graph neural networks,” in Proceed-
view on graph neural networks as graph signal denoising,” in ings of the 26th ACM SIGKDD International Conference on Knowledge
Proceedings of the 30th ACM International Conference on Information Discovery & Data Mining, 2020, pp. 66–74.
& Knowledge Management, 2021, pp. 1202–1211. [31] H. Pei, B. Wei, K. C.-C. Chang, Y. Lei, and B. Yang, “Geom-
[9] M. Zhu, X. Wang, C. Shi, H. Ji, and P. Cui, “Interpreting and uni- gcn: Geometric graph convolutional networks,” ArXiv, vol.
fying graph neural networks with an optimization framework,” in abs/2002.05287, 2020.
Proceedings of the Web Conference 2021, 2021, pp. 1215–1226. [32] J. Tang, J. Sun, C. Wang, and Z. Yang, “Social influence analysis
[10] J. Zhu, Y. Yan, L. Zhao, M. Heimann, L. Akoglu, and D. Koutra, in large-scale networks,” in Proceedings of the 15th ACM SIGKDD
“Beyond homophily in graph neural networks: current limitations international conference on Knowledge discovery and data mining, 2009,
and effective designs,” Advances in Neural Information Processing pp. 807–816.
Systems, vol. 33, 2020. [33] T. N. Kipf and M. Welling, “Semi-supervised classification with
[11] E. Chien, J. Peng, P. Li, and O. Milenkovic, “Adaptive universal graph convolutional networks,” in International Conference on
generalized pagerank graph neural network,” in International Learning Representations (ICLR), 2017.
Conference on Learning Representations, 2021. [Online]. Available: [34] K. Xu, C. Li, Y. Tian, T. Sonobe, K.-i. Kawarabayashi, and
https://openreview.net/forum?id=n6jl7fLxrP S. Jegelka, “Representation learning on graphs with jumping
[12] J. Zhu, R. A. Rossi, A. Rao, T. Mai, N. Lipka, N. K. Ahmed, knowledge networks,” in International Conference on Machine Learn-
and D. Koutra, “Graph neural networks with heterophily,” in ing. PMLR, 2018, pp. 5453–5462.
Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, [35] Y. Yang, Z. Feng, M. Song, and X. Wang, “Factorizable graph
no. 12, 2021, pp. 11 168–11 176. convolutional networks,” Advances in Neural Information Processing
[13] R. Wang, S. Mou, X. Wang, W. Xiao, Q. Ju, C. Shi, and X. Xie, Systems, vol. 33, 2020.
“Graph structure estimation neural networks,” in Proceedings of [36] K.-H. Lai, D. Zha, K. Zhou, and X. Hu, “Policy-gnn: Aggregation
the Web Conference 2021, 2021, pp. 342–353. optimization for graph neural networks,” in Proceedings of the
[14] S. Suresh, V. Budde, J. Neville, P. Li, and J. Ma, “Breaking the limit 26th ACM SIGKDD International Conference on Knowledge Discovery
of graph neural networks by improving the assortativity of graphs & Data Mining (KDD), 2020, pp. 461–471.
with local mixing patterns,” Proceedings of the 27th ACM SIGKDD [37] E. Isufi, F. Gama, and A. Ribeiro, “Edgenets: Edge varying graph
Conference on Knowledge Discovery & Data Mining, 2021. neural networks,” IEEE Transactions on Pattern Analysis and Ma-
chine Intelligence, 2021.
[15] L. Yang, W. Zhou, W. Peng, B. Niu, J. Gu, C. Wang, X. Cao,
and D. He, “Graph neural networks beyond compromise between [38] F. M. Bianchi, D. Grattarola, L. Livi, and C. Alippi, “Graph neural
attribute and topology,” in Proceedings of the ACM Web Conference networks with convolutional arma filters,” IEEE Transactions on
2022, 2022, pp. 1127–1135. Pattern Analysis and Machine Intelligence, 2021.
[16] H. Nt and T. Maehara, “Revisiting graph neural networks: All we [39] Y. Gao, Y. Feng, S. Ji, and R. Ji, “Hgnn+: General hypergraph neu-
have is low-pass filters,” arXiv preprint arXiv:1905.09550, 2019. ral networks,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 45, no. 3, pp. 3181–3199, 2022.
[17] M. McPherson, L. Smith-Lovin, and J. M. Cook, “Birds of a feather:
[40] G. Bouritsas, F. Frasca, S. P. Zafeiriou, and M. Bronstein, “Improv-
Homophily in social networks,” Annual review of sociology, vol. 27,
ing graph neural network expressivity via subgraph isomorphism
no. 1, pp. 415–444, 2001.
counting,” IEEE Transactions on Pattern Analysis and Machine Intel-
[18] D. Bo, X. Wang, C. Shi, and H. Shen, “Beyond low-frequency ligence, 2022.
information in graph convolutional networks,” in AAAI. AAAI
[41] M. Balcilar, P. Héroux, B. Gauzere, P. Vasseur, S. Adam, and
Press, 2021.
P. Honeine, “Breaking the limits of message passing graph neural
[19] M. Liu, Z. Wang, and S. Ji, “Non-local graph neural networks,” networks,” in International Conference on Machine Learning. PMLR,
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021. 2021, pp. 599–608.
[20] Y. Yan, M. Hashemi, K. Swersky, Y. Yang, and D. Koutra, “Two [42] L. Faber, A. K. Moghaddam, and R. Wattenhofer, “When com-
sides of the same coin: Heterophily and oversmoothing in graph paring to ground truth is wrong: On evaluating gnn explanation
convolutional neural networks,” arXiv preprint arXiv:2102.06462, methods,” in Proceedings of the 27th ACM SIGKDD Conference on
2021. Knowledge Discovery & Data Mining, 2021, pp. 332–341.
[21] Z. Fang, L. Xu, G. Song, Q. Long, and Y. Zhang, “Polarized graph [43] X. Wang, Y. Wu, A. Zhang, F. Feng, X. He, and T.-S. Chua,
neural networks,” in Proceedings of the ACM Web Conference 2022, “Reinforced causal explainer for graph neural networks,” IEEE
2022, pp. 1404–1413. Transactions on Pattern Analysis and Machine Intelligence, 2022.
[22] X. Li, R. Zhu, Y. Cheng, C. Shan, S. Luo, D. Li, and W. Qian, “Find- [44] T. Schnake, O. Eberle, J. Lederer, S. Nakajima, K. T. Schutt, K.-R.
ing global homophily in graph neural networks when meeting Muller, and G. Montavon, “Higher-order explanations of graph
heterophily,” arXiv preprint arXiv:2205.07308, 2022. neural networks via relevant walks.” IEEE transactions on pattern
[23] K. Oono and T. Suzuki, “Graph neural networks exponentially analysis and machine intelligence, vol. PP, 2021.
lose expressive power for node classification,” in International [45] Y. Min, F. Wenkel, and G. Wolf, “Scattering gcn: Overcoming
Conference on Learning Representations, 2020. [Online]. Available: oversmoothness in graph convolutional networks,” Advances in
https://openreview.net/forum?id=S1ldO2EFPr Neural Information Processing Systems, vol. 33, pp. 14 498–14 508,
[24] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, and 2020.
Y. Bengio, “Graph attention networks,” International Conference [46] J. Klicpera, A. Bojchevski, and S. Günnemann, “Predict then
on Learning Representations, 2018, accepted as poster. [Online]. propagate: Graph neural networks meet personalized pagerank,”
Available: https://openreview.net/forum?id=rJXMpikCZ in ICLR, 2019.
[25] D. Kim and A. Oh, “How to find your friendly neighborhood: [47] M. Chen, Z. Wei, Z. Huang, B. Ding, and Y. Li, “Simple and
Graph attention design with self-supervision,” in International deep graph convolutional networks,” in International Conference
Conference on Learning Representations, 2021. [Online]. Available: on Machine Learning. PMLR, 2020, pp. 1725–1735.
https://openreview.net/forum?id=Wi5KUNlqWty [48] Y. Hou, J. Zhang, J. Cheng, K. Ma, R. T. Ma, H. Chen, and M.-C.
[26] P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Gallagher, and T. Eliassi- Yang, “Measuring and improving the use of graph information
Rad, “Collective classification in network data,” AI Mag., vol. 29, in graph neural networks,” in International Conference on Learning
pp. 93–106, 2008. Representations (ICLR), 2019.
14

[49] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: Rui Zhang received the First-class (Hons) de-
A review and new perspectives,” IEEE transactions on pattern gree in Telecommunication Engineering from
analysis and machine intelligence, vol. 35, no. 8, pp. 1798–1828, 2013. Jilin University of China in 2001 and the Ph.D.
[50] I. Higgins, D. Amos, D. Pfau, S. Racaniere, L. Matthey, D. Rezende, degree in Computer Science and Mathematics
and A. Lerchner, “Towards a definition of disentangled represen- from University of Ulster, UK in 2007. After fin-
tations,” arXiv preprint arXiv:1812.02230, 2018. ishing her PhD study, she worked as a Research
[51] R. Lopez, J. Regier, M. I. Jordan, and N. Yosef, “Information Associate at University of Bradford and Univer-
constraints on auto-encoding variational bayes,” Advances in neural sity of Bristol in the UK for 5 years. She joined
information processing systems, vol. 31, 2018. Xi’an Jiaotong-Liverpool University in 2012 and
[52] L. Ma, Q. Sun, S. Georgoulis, L. Van Gool, B. Schiele, and M. Fritz, currently holds the position of Associate Pro-
“Disentangled person image generation,” in Proceedings of the IEEE fessor. Her research interests include machine
Conference on Computer Vision and Pattern Recognition, 2018, pp. 99– learning, data mining and statistical analysis.
108.
[53] Z. Zhang, L. Tran, F. Liu, and X. Liu, “On learning disentangled
representations for gait recognition,” IEEE Transactions on Pattern Xinping Yi received the Ph.D. degree in
Analysis and Machine Intelligence, 2020. electronics and communications from Télécom
[54] C. Eom, W. Lee, G. Lee, and B. Ham, “Is-gan: Learning disen- ParisTech, Paris, France, in 2015. He is currently
tangled representation for robust person re-identification,” IEEE a Lecturer (Assistant Professor) with the Depart-
transactions on pattern analysis and machine intelligence, 2021. ment of Electrical Engineering and Electronics,
[55] J. Ma, P. Cui, K. Kuang, X. Wang, and W. Zhu, “Disentangled University of Liverpool, U.K. Prior to Liverpool,
graph convolutional networks,” in International conference on ma- he was a Research Associate with Technische
chine learning. PMLR, 2019, pp. 4212–4221. Universität Berlin, Berlin, Germany, from 2014
[56] Y. Liu, X. Wang, S. Wu, and Z. Xiao, “Independence promoted to 2017, a Research Assistant with EURECOM,
graph disentangled networks,” in Proceedings of the AAAI Confer- Sophia Antipolis, France, from 2011 to 2014, and
ence on Artificial Intelligence, vol. 34, no. 04, 2020, pp. 4916–4923. a Research Engineer with Huawei Technologies,
[57] H. Li, X. Wang, Z. Zhang, Z. Yuan, H. Li, and W. Zhu, “Dis- Shenzhen, China, from 2009 to 2011. His main research interests
entangled contrastive learning on graphs,” Advances in Neural include information theory, graph theory, and machine learning, and their
Information Processing Systems, vol. 34, pp. 21 872–21 884, 2021. applications in wireless communications and artificial intelligence.
[58] E. Jang, S. Gu, and B. Poole, “Categorical reparameterization with
gumbel-softmax,” arXiv preprint arXiv:1611.01144, 2016.
[59] C. J. Maddison, A. Mnih, and Y. W. Teh, “The concrete distribu-
tion: A continuous relaxation of discrete random variables,” arXiv
preprint arXiv:1611.00712, 2016.
[60] O. Stretcu, K. Viswanathan, D. Movshovitz-Attias, E. A. Platanios,
S. Ravi, and A. Tomkins, “Graph agreement models for semi-
supervised learning,” in NeurIPS, 2019.
[61] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza-
tion,” arXiv preprint arXiv:1412.6980, 2014.
[62] T. Akiba, S. Sano, T. Yanase, T. Ohta, and M. Koyama, “Optuna:
A next-generation hyperparameter optimization framework,” in
Proceedings of the 25th ACM SIGKDD international conference on
knowledge discovery & data mining, 2019, pp. 2623–2631.
[63] Y. Ma, X. Liu, N. Shah, and J. Tang, “Is homophily a necessity for
graph neural networks?” ArXiv, vol. abs/2106.06134, 2021.

Jingwei Guo received the First-class (Hons) de-


gree in Applied Mathematics from University of
Liverpool, UK, in 2018. After finishing his under-
graduate study, he worked as a Research As-
sociate at Xi’an Jiaotong-Liverpool University of
China for a year. He is currently pursing his PhD
degree at University of Liverpool, UK. His re-
search focuses on developing new graph neural
networks, and applying the techniques in various
domains.

Kaizhu Huang (corresponding author) Short


Bio: Kaizhu Huang works on machine learn-
ing, neural information processing, and pattern
recognition. He is currently a tenured Profes-
sor of ECE at Duke Kunshan University (DKU).
Prof. Huang obtained his PhD degree from
Chinese University of Hong Kong (CUHK) in
2004. He worked in Fujitsu Research Centre,
CUHK, University of Bristol, National Labora-
tory of Pattern Recognition, Chinese Academy of
Sciences, and Xi’an Jiaotong-Liverpool Univer-
sity from 2004 to 2022. Prof. Huang has been working in machine learn-
ing, neural information processing, and pattern recognition. He was the
recipient of 2011 Asia Pacific Neural Network Society Young Researcher
Award. He received best paper or book award five times and published
extensively in journals (JMLR, Neural Computation, IEEE T-PAMI, IEEE
T-NNLS, IEEE T-BME, IEEE T-Cybernetics) and conferences (NeurIPS,
IJCAI, SIGIR, UAI, CIKM, ICDM, ICML, ECML, CVPR). He serves as
associated editors/advisory board members in a number of journals
and book series. He was invited as keynote speaker in more than 30
international conferences or workshops.

You might also like