Esgnn
Esgnn
Abstract—While Graph Neural Networks (GNNs) have achieved enormous success in multiple graph analytical tasks, modern variants
mostly rely on the strong inductive bias of homophily. However, real-world networks typically exhibit both homophilic and heterophilic
linking patterns, wherein adjacent nodes may share dissimilar attributes and distinct labels. Therefore, GNNs smoothing node proximity
holistically may aggregate both task-relevant and irrelevant (even harmful) information, limiting their ability to generalize to heterophilic
graphs and potentially causing non-robustness. In this work, we propose a novel edge splitting GNN (ES-GNN) framework to
adaptively distinguish between graph edges either relevant or irrelevant to learning tasks. This essentially transfers the original graph
into two subgraphs with the same node set but exclusive edge sets dynamically. Given that, information propagation separately on
these subgraphs and edge splitting are alternatively conducted, thus disentangling the task-relevant and irrelevant features.
Theoretically, we show that our ES-GNN can be regarded as a solution to a disentangled graph denoising problem, which further
illustrates our motivations and interprets the improved generalization beyond homophily. Extensive experiments over 11 benchmark
and 1 synthetic datasets demonstrate that ES-GNN not only outperforms the state-of-the-arts, but also can be more robust to
adversarial graphs and alleviate the over-smoothing problem.
Index Terms—Graph Neural Networks, Heterophilic Graphs, Disentangled Representation Learning, Graph Mining.
1 I NTRODUCTION
Several metrics have been proposed to estimate the graph weights in both positive and negative signs, so as to extract
homophily level, e.g., edge homophily [10], as the most both low- and high-frequency information.
popular one, is defined as the percentage of edges linking However, none of them analyzes the motivations why
nodes of the same label: two nodes get connected, nor do they associate them with
|{(vi , vj )|(vi , vj ) ∈ E, yi = yj }| learning tasks, which is analyzed as one of the keys to gen-
H= . (1) eralize GNNs beyond homophily in this paper. In contrast,
|E|
ES-GNN distinguishes graph edges as either relevant or
However, this metric may give inaccurate estimation in irrelevant to the task. Such information acts as a guide to
case of datasets with the class-imbalance problem [28]. To disentangle and exclude classification-harmful information
alleviate this, a new metric is proposed by [28]: from the final predictive target, and thus boosts GNNs’ per-
C−1 formance under heterophily. Meanwhile, detailed analyses
1 X |Ck |
Ĥ = max(hk − , 0), (2) on the limited performance of the existing state-of-the-arts
C − 1 k=0 |V| are provided in Section 6.3.
where Ck is the set of nodes from class k ∈ {0, 1, ..., C − 1},
and hk is the class-wise homophily ratio computed as: 3.3 Disentangled Representation Learning
P
|{vj |yi = yj , vj ∈ Ni }| Disentangled representation learning is to learn decom-
hk = vi ∈Ck P .
vi ∈Ck |Ni |
posed vector representations which disentangle the explana-
tory latent variables underlying the observed data and
All these two indexes in Eq.(1) and Eq. (2) range from 0 to 1, encode them as separate dimensions [49], [50]. Existing
of which the higher values suggest higher homophily (lower efforts concerning that topic are mainly made on computer
heterophily), and otherwise. vision [51], [52], [53], [54], while a couple of works recently
emerge to explore the potential of disentangled learning in
3.2 Graph Neural Networks graph-structured domains [35], [55], [56], [57]. For example,
The central idea of most GNNs is to utilize nodes’ proximity DisenGCN [55] employs a neighborhood routing mecha-
information for building their representations for tasks, nism to iteratively partition node neighborhood into mul-
based on which great effort has been made in developing tiple separated parts. FactorGCN [35] factorizes the original
different variants [6], [24], [33], [34], [35], [36], [37], [38], [39], graph into multiple subgraphs by clipping edges so as to
[40], and understanding the nature of GNNs [8], [9], [41], capture different graph aspects.
[42], [43], [44]. Several works have proved that GNNs essen- We notice that our work shares a similarity with Fac-
tially behave as a low pass filter that smooths information torGCN [35]: to learn multiple subgraphs from the original
within node surrounding [6], [7], [16], [45]. In line with this network topology for disentangling features. Nevertheless,
view, [9] and [8] further show that a number of GNN mod- there are three main differences. First, FactorGCN could
els, such as GCN [33], SGC [6], GAT [24], and APPNP [46], assign one edge to multiple groups, i.e., the factorized
can be seen as different optimization solvers to a graph subgraphs may share overlapped edges, while our ES-GNN
signal denoising problem with a smoothness assumption upon employs an edge splitting to partition the original net-
connected nodes. All these results indicate that GNNs are work topology into two mutually exclusive ones satisfying
mostly tailored for the strong homophily hypothesis on the AR + AIR = A. Second, despite the disentangling prop-
observed graphs while largely overlooking the important erty, FactorGCN merely interprets the inferred subgraphs
setting of heterophily, where node features and labels vary as different graph aspects without providing any concrete
unsmoothly on graphs. Recent studies [20], [47] also connect meanings, and the predefined number of latent factors re-
this to the over-smoothing problem [23]. quires to be tuned differently across graphs. Differently, our
To extend GNNs on heterophilic graphs, several works model adaptively produces two interptable task-relevant
leverage the long-range information beyond nodes’ proxim- and irrelevant topolgies for all kinds of input graphs. Last,
ity. Geom-GCN [31] extends the standard message passing FactorGCN models all disentangled parts towards final pre-
with geometric aggregation in latent space. H2GCN [10] diction, while we target at decoupling the task-relevant and
directly models the higher order neighborhoods for cap- task-irrelevant features whereby the classification-harmful
turing the homophily-dominant information. WRGAT [41] information can be excluded from the final predictive target
transforms the input graph into a multi-relational graph, and disentangled in the task-irrelevant parts. Experimental
for modeling structural information and enhancing the as- results also validate that our proposed model substantially
sortativity level. GEN [13] estimates a suitable graph for outperform FactorGCN on all the datasets used in the paper
GNNs’ learning with multi-order neighborhood informa- (see Section 6).
tion and Bayesian inference as guide. Another line of work
emphasizes the proper utilization of node neighbors. The
most common works employ attention mechanism [24], [48], 4 F RAMEWORK : ES-GNN
however, they are still imposing smoothness within nodes’ In this section, we propose an end-to-end graph learning
neighborhood albeit on the important members only [7], framework, ES-GNN, generalizing Graph Neural Networks
[8], [9]. Compared to that, FAGCN [18] adaptively models (GNNs) to arbitrary graph-structured data with either ho-
both similarities and dissimilarities between adjacent nodes. mophilic or heterophilic properties. An overview of ES-
GPR-GNN [11] introduces a universal polynomial graph GNN is given in Fig. 2. The central idea is to integrate
filter, by associating different hop neighbors with learnable GNNs with an interpretable edge splitting (ES) layer that
4
Task-Relevant Topology
GNN ( ZR , AR )
{A , ZR , ZIR } Z′R Prediction
Task
Fig. 2. Illustration of our ES-GNN framework where A and X denote the adjacency matrix and feature matrix of nodes, respectively. First, X
is projected onto different latent subspaces via different channels R and IR. An edge splitting is then performed to divide the original graph
edges into two exclusive sets. After that, the node information can be aggregated individually and separately on different edge sets to produce
disentangled representations, which are further utilized to make an more accurate edge splitting in the next layer. The task-relevant representation
′
ZR is reasonably granted for prediction. Meanwhile, an Irrelevant Consistency Regularization (ICR) is developed to further reduce the potential
task-harmful information from the final predictive target.
adaptively partitions the network topology as guide to This hypothesis is assumed without losing generality to
disentangle the task-relevant and irrelevant node features. both homophilic and heterophilic graphs. For a homophilic
scenario, e.g., in citation networks, scientific papers tend to
cite or be cited by others from the same area, and both of
4.1 Edge Splitting Layer them usually possess the common keywords uniquely ap-
pearing in their topics. For a heterophilic scenario, students
The goal of this layer is to infer the latent relations underly- having different interests are likely be connected because of
ing adjacent nodes on the observed graph, and distinguish the same classes and/or dormitory they take and/or live in,
between graph edges which could be relevant or irrelevant but neither has direct relation to the clubs they have joined.
to learning tasks. Given a simple graph with an adjacency This inspires us to classify graph edges by measuring the
matrix A and node feature matrix X, an ES-layer splits the similarity between adjacent nodes in two different aspects,
original graph edges into two exclusive sets, and thereby i.e., a graph edge is more relevant to classification task if the
produces two partial network topologies with adjacency connected nodes are more similar in their task-relevant fea-
matrices AR , AIR ∈ R|V|×|V| satisfying AR + AIR = A. We tures, or otherwise. Our experimental analysis in Section 6.6
would expect AR storing the most correlated graph edges further provides evidences that even when our Hypothesis 1
to the classification task, of which the rest is excluded and may not hold, most adversarial edges (considered as the
disentangled in AIR . Therefore, analyzing the correlation task-irrelevant ones) can still be recognized though neither
between node connections and learning tasks comes into types of node similarity exists.
the first step. It is worthy mentioning that our hypothesis is not in
However, existing techniques [18], [24], [25] mainly pa- contradiction to the “opposites attract”, which could be in-
rameterize graph edges with node similarity or dissimi- tuitively explained by linking due to different but matching
larity, while failing to explicitly correlate them with the attributes. We believe the inherent cause to connection even
prediction target. Even worse, as the assortativiy of real- in “opposites attract” may still be certain commonalities. For
world networks is usually agnostic and node features are example, in heterosexual dating networks, people of the op-
typically full of noises, the captured similarity/dissimilarity posite sex are most likely connected because of their similar
may not truly reflect the label-agreement/disagreement be- life values. Although these similarities may be inappropriate
tween nearby nodes. Consequently, the harmful-similarity (or even harmful) in distinguishing genders, modeling and
between pairwise nodes from different classes could be mis- disentangling them from the final predictive target might be
takenly preserved for prediction. To this end, we present one still of great importance.
plausible hypothesis below, whereby the explicit correlation An ES-layer consists of two channels to respectively
between node connections and learning tasks is established extract the task-relevant and irrelevant information from
automatically. nodes. As only the raw feature matrix X is provided in the
beginning, we will project them into two different subspaces
Hypothesis 1. Two nodes get connected in a graph mainly due before the first ES-layer:
to their similarity in some features, which could be either relevant
or irrelevant (even harmful) to the learning task. Z(0) T
s = σ(Ws X + bs ), (3)
5
f× d d
Algorithm 1 Framework of ES-GNN
where Ws ∈ R 2and bs ∈ R are the learnable parame-
2
ters in channel s ∈ {R, IR}, d is the number of node hidden Input: nodes set: V , edge set: E , adjacency matrix: A ∈
states, and σ is a nonlinear activation function. R|V|×|V| , node feature matrix: X ∈ R|V |×f , the number
Given Hypothesis 1, a graph edge should be classified of layers: K , scaling parameters: {ϵR , ϵIR }, irrelevant
into the task-relevant set if the connected nodes display a consistency coefficient: λICR , and ground truth labels on
higher similarity in the corresponding channel, and other- the training set: {yi ∈ RC |∀vi ∈ Vtrn }.
wise. However, introducing metrics between nearby nodes Param: WR , WIR ∈ Rf ×d , WF ∈ Rd×C , bF ∈ RC , {g(k) ∈
to learn AR and AIR independently may fail to model the R1×2d |k = 0, 1, ..., K − 1}
complex interaction between different channels, and also 1: // Project node features into two subspaces.
lose emphasis on topology difference. Therefore, in case of 2: for s ∈ {R, IR} do
A(i,j) = 1, we parameterize the residual between AR(i,j) (0)
3: Zs ← σ(WsT X + bs ).
and AIR(i,j) , and solving the linear equation: 4:
(0) (0)
Zs ← Dropout(Zs ) // Enabled only for training.
(
AR(i,j) − AIR(i,j) = αi,j 5: end for
. 6: // Stack Edge Splitting and Aggregation Layers.
AR(i,j) + AIR(i,j) = 1
7: for layer number k = 0, 1, ..., K − 1 do
1+α 1−α
This gives us AR(i,j) = i,j
and AIR(i,j) = i,j
with 8: // Edge Splitting Layer.
2 2
αi,j ∈ (−1, 1). To effectively incorporate all the channel 9: Initialize AR , AIR ∈ R|V|×|V| with zeros.
information into the coefficient αi,j , we propose a residual 10: for (vi , vj ) ∈ E do
h iT
scoring mechanism: (k) (k) (k) (k)
11: αi,j ← tanh(g(k) ZR[i,:] ⊕ ZIR[i,:] ⊕ ZR[j,:] ⊕ ZIR,[j,:] ).
T
αi,j = tanh(g ZR[i,:] ⊕ ZIR[i,:] ⊕ ZR[j,:] ⊕ ZIR[j,:] ). (4) 12: αi,j ← Dropout(αi,j ) // Enabled only for training.
1+α 1−α
Here, both of the task-relevant and irrelevant node fea- 13: AR(i,j) ← 2 i,j , AIR(i,j) ← 2 i,j .
tures are first concatenated and convoluted by learnable 14: end for
g ∈ R1×2d , and then passed to the tangent activation 15: // Aggregation Layer.
function to produce a scalar value within (−1, 1). To further 16: for s ∈ {R, IR} do
(k+1) (0) −1 − 1 (k)
strengthen the discreteness property of (or exclusiveness 17: Zs ← ϵs Zs + (1 − ϵs )Ds 2 As Ds 2 Zs .
between) AR and AIR , one can apply techniques, such as 18: end for
softmax with temperature in Eq. (5), Gumbel-Softmax [58], 19: end for
[59] in Eq. (6), or threholding in Eq. (7). 20: // Prediction.
(K)
′ exp(As(i,j) /τ ) 21: ŷi = softmax(WFT ZR[i,:] + bF ), ∀vi ∈ V .
As(i,j) = P (5) 22: // Optimization with Irrelevant Consistency Regularization.
κ∈{R,IR} exp(Aκ(i,j) /τ )
′ exp((log(As(i,j) ) + γ)/τ ) 23: LICR = (vi ,vj )∈E (1 − δ(ŷi , ŷj ))∥ZIR[i,:] − ZIR[j,:] ∥22 .
P
As(i,j) =P (6)
κ∈{R,IR} exp((log(Aκ(i,j) ) + γ)/τ ) 24: Lpred = − |V1 | T
P
( trn i∈Vtrn yi log(ŷi ).
′ 1 As(i,j) > 0.5 25: Minimize Lpred + λICR LICR .
As(i,j) = (7)
0 otherwise
where s ∈ {R, IR}, τ is a hyper-parameter mediating dis-
creteness degree, and γ ∼ Gumbel(0, 1) is a Gumbel random 4.3 Irrelevant Consistency Regularization
variable. However, in this work, we find good results with-
out adding any additional discretization techniques, and
will leave this investigation to the future work. Stacking ES-layer and aggregation layer iteratively lends
itself to disentangling different features of nodes into the
4.2 Aggregation Layer task-relevant and irrelevant representations, denoted by ZR
As the split network topologies disclose the partial relations and ZR respectively. First, ZR is granted for prediction
among nodes in different latent spaces, they can be utilized and gradually trained by the supervision signals from the
to aggregate information for learning different node aspects. classification loss. However, only supervising one channel
Specifically, we leverage a simple low-pass filter with scal- (R) may not also guarantee the meaningfulness of the other
ing parameters {ϵR , ϵIR } for both task-relevant and irrelevant (IR), which possibly results in inaccurate disentanglement.
channels, from the k th to k + 1th layer: The confounding and erroneous information could then
−1 −1 be mistakenly preserved for prediction. To this end, we
Z(k+1)
s = ϵs Z(0) 2 2 (k)
s + (1 − ϵs )Ds As Ds Zs . (8) propose to regulate ZIR for modeling the opposite of ZR ,
s ∈ {R, IR} denotes the task-relevant or irrelevant channel, i.e., the classification-harmful information hidden in the
and Ds is the degree matrix associated with the adjacency observed graph.
matrix As . Derivation of Eq. (8) is detailed in our theoretical To attain this, we develop Irrelevant Consistency Regu-
analysis. Importantly, by incorporating proximity informa- larization (ICR) that imposes a concrete meaning on ZIR . The
tion in different structural spaces, the task-relevant and rationale is to incorporate the potential label-disagreement
(k+1)
irrelevant information can be better disentangled in ZR between adjacent nodes into ZIR . Given any two connected
(k+1)
and ZIR , based on which the next ES-layer can make a nodes vi and vj , we would expect ZIR[i,:] and ZIR[j,:] to be
more precise partition on the raw topology. similar if they share a distinct label. Specifically, our ICR can
6
TABLE 2
Statistics of real-world datasets, where H and Ĥ (considering class-imbalance problem) provide indexes of graph homophily ratio as respectively
defined in Eq. (1) and Eq. (2). It can be observed that, despite the relative high homophily level measured by H = 0.632, the Twitch-DE dataset
with class-imbalance problem is essentially a heterophilic graph [28] as suggested by Ĥ = 0.139. For Polblogs dataset, since node features are
not provided, we directly use the rows of the adjacency matrix.
...
...
...
...
...
... ... ... ... ... ... Hsyn 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
PE 0.02 0.06 0.1 0.2 0.4 0.4 0.6 0.7 0.8 0.9 0.96
Homophilic Pattern pE ≫ pI PI 0.72 0.81 0.6 0.7 0.9 0.6 0.6 0.45 0.3 0.15 0.045
Heterophilic Pattern pE ≪ pI
Explicit Attributes Implicit Attributes ω 0.1 0.084 0.1 0.075 0.05 0.062 0.05 0.05 0.05 0.05 0.051
TABLE 4
Node classification accuracies (%) over 100 runs. Error Reduction gives the average improvement of our ES-GNN upon the second place models,
which are explicitly designed for heterophilic graphs.
TABLE 5
Edge Analysis of our ES-GNN on synthetic graphs with various
homophily ratios. Removed Het. gives the percentage (%) of
heterophilic node connections excluded from the task-relevant topology
and disentangled in the task-irrelevant topology. The last two rows give
the corresponding node classification accuracies (%) of ES-GNN and
its variant while ablating ES-layer.
Hsyn 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Avg.
Removed Het. 41.9 53.2 60.8 70.4 74.2 80.7 86.7 87.8 89.9 71.7
ES-GNN 90.0 69.6 62.1 69.6 85.4 93.8 98.3 99.2 100.0 85.3
ES-GNN w/o ES 84.6 57.9 53.3 53.8 74.2 81.7 86.3 90.4 96.7 75.4
(a) Chameleon (b) Cora (c) Hsyn = 0.1 (d) Hsyn = 0.5 (e) Hsyn = 0.9
Fig. 5. Feature correlation analysis. Two distinct patterns (task-relevant and task-irrelevant topologies) can be learned on Chameleon with H = 0.23,
while almost all information is retained in the task-relevant channel (0-31) on Cora with H = 0.81. On synthetic graphs in (c), (d), and (e), block-
wise pattern in the task-irrelevant channel (32-63) is gradually attenuated with the incremental homophily ratios across 0.1, 0.5, and 0.9. ES-GNN
presents one general framework which can be adaptive for both heterophilic and homophilic graphs.
Fig. 6. Results of different models on perturbed homophilic graphs. ES-GNN is able to identify the falsely injected (the task-irrelevant) graph edges,
and exclude these connections from the final predictive learning, thereby displaying relative robust performance against adversarial edge attacks.
we have the following observations: (1) Looking through graph links, and makes prediction with the most correlated
the overall trend, we obtain a “U” pattern on graphs from features only. We further provide detailed analyses in the
the lowest to the highest homophily ratios. That suggests following sections.
GNNs’ prediction performance is not monotonically cor-
related with graph homophily levels in a strict manner. 6.4 Edge Analysis
When it comes to the extreme heterophilic scenario, GNNs
We analyze the split edges from our ES-layer using syn-
tend to alternate node features completely between dif-
thetic graphs as an example in this section. According to
ferent classes, thereby still making nodes distinguishable
Section 6.1.2, the synthetic edges are defined as the task-
w.r.t. their labels, which coincides with the findings in [63].
relevant connections if they link nodes from the same
(2) Despite the attention mechanism for adaptively utilizing
class, and the task-irrelevant ones otherwise. Therefore, we
relevant neighborhood information, GAT turns out to be
calculate the percentages of heterophilic node connections,
the least robust method to arbitrary graphs. The entangled
which are excluded from our task-relevant topology and
information in the mixed assortativity and disassortativity
disentangled in the task-irrelevant one, so as to investigate
provides weak supervision signals for learning the attention
the discerning ability of ES-GNN between edges in different
weights. FactorGCN employs a graph factorization to disen-
types. As can be observed in Table 5, 71.7% task-irrelevant
tangle different graph aspects but still adopts all of them for
edges are identified on average across various homophily
prediction without judgement, thereby performing poorly
ratios. On the other hand, we also report the classification
especially on the tough cases of Hsyn = 0.3, 0.4, and 0.5.
accuracies of ES-GNN and its variant while ablating ES-
(3) Both FAGCN and GPR-GNN model the dissimilarity
layer, from which approximately 10% degradation can be
between nearby nodes to go beyond the smoothness as-
observed. All of these strongly validate the effectiveness
sumption in conventional GNNs, and display some superi-
of our ES-layer and reasonably interprets the good perfor-
ority under heterophily. However, the correlation between
mance of ES-GNN.
graph edges and classification tasks is not explicitly de-
fined and emphasized in their designs. In other words, the
classification-harmful information still could be preserved 6.5 Correlation Analysis
in their node dissimilarity. Experimental results also show To better understand our proposed method, we investigate
that these methods are constantly beaten by our disen- the disentangled features on Chameleon, Cora, and three
tangled approach. (4) The proposed ES-GNN consistently synthetic graphs as typical examples in Fig. 5. Clearly, on
outperforms, or matches, others across different graphs the strong heterophilic graph Chameleon with H = 0.23,
with different homophily levels, especially in the hardest correlation analysis of learned latent features displays two
case with Hsyn = 0.3 where some baselines even perform clear block-wise patterns, each of which represents task-
worse than MLP. This is mainly because our ES-GNN is relevant or task-irrelevant aspect respectively. In contrast,
able to distinguish between task-relevant and irrelevant on the citation network Cora with H = 0.81, the node
11
Pubmed, and Polblogs while displaying relatively the same accuracy on Squirrel in Fig. 8a goes up first and then grad-
results on Cora. We attribute this to the capability of our ually drops. Promising results can be attained by choosing
model in associating node connections with learning tasks. λICR from [5e−5, 5e−3]. Similar trends can be also observed
Take Pubmed dataset as an example. We investigate the on the other datasets, where λICR is relatively robust within
learned task-relevant topologies and find that 81.0%, 73.0%, a wide albeit distinct interval.
82.1%, 83.0%, 82.6% fake links get removed on adversatial
graphs with perturbation rates from 20% to 100%. This
also offers evidences supporting that our ES-layer is able 7 C ONCLUSION
to distinguish between task-relevant and irrelevant node In this paper, we develop a novel graph learning framework
connections. Therefore, despite a large number of false edge which enables GNNs to go beyond the strong homophily
injections, the proximity information of nodes can still be assumption on graphs. We manage to establish correlation
reasonably mined in our model to predict their labels. Im- between node connections and learning tasks through one
portantly, these empirical results also indicate that ES-GNN plausible hypothesis, based on which ES-GNN is derived
can still identify most of the task-irrelevant edges though no with an interpretable edge splitting. Our ES-GNN essen-
clear similarity or association between the connected nodes tially partitions the original graph structure into the task-
exists in the adversarial setting. relevant and irrelevant topologies as guide to disentangle
node features, whereby the classification-harmful informa-
6.7 Alleviating Over-smoothing Problem tion can be disentangled and excluded from the final pre-
diction target.
In order to verify whether ES-GNN alleviates the over-
Theoretical analysis illustrates our motivation and offers
smoothing problem, we compare it with GCN and GAT
interpretations on the expressive power of ES-GNN on
by varying the layer number in Fig. 9. It can be observed
different types of networks. To provide empirical verifica-
that these two baselines attain their highest results when
tion, we conduct extensive experiments over 11 benchmark
the number of layers reaches around two. As the layer goes
and 1 synthetic datasets. The node classification results
deeper, the accuracies of both GCN and GAT gradually drop
show that ES-GNN constantly outperforms the other 9
to a lower point. On the contrary, our ES-GNN presents a
competitive GNNs (including 5 state-of-the-arts explicitly
stable curve. In spite of starting from a relative lower point,
designed for heterophily) on graphs with either homophily
the performance of ES-GNN keeps improving as the model
or heterophily. In particular, we also conduct analysis on
depths increase, and eventually outperforms both GCN and
the split edges, correlation among disentangled features,
GAT. The main reason is that, our ES-GNN can adaptively
model robustness, and the ablated variants. All of these
utilize proper graph edges in different layers to attain the
results demonstrate the success of ES-GNN in identifying
task-optimal results with enlarged receptive fields. In other
graph edges between different types, which also validates
words, once an edge stops passing useful information or
the effectiveness of our interpretable edge splitting.
starts passing harmful messages, ES-GNN tends to identify
In future work, we will further explore more sophisti-
it and remove it from learning the task-correlated represen-
cated designs in the edge splitting layer. Another interesting
tations, thereby having the ability of mitigating the over-
direction would be how to extend our learning paradigm in
smoothing problem.
accomplishing graph-level tasks.
[5] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl, [27] B. Rozemberczki, C. Allen, and R. Sarkar, “Multi-scale attributed
“Neural message passing for quantum chemistry,” in International node embedding,” Journal of Complex Networks, vol. 9, no. 2, p.
Conference on Machine Learning. PMLR, 2017, pp. 1263–1272. cnab014, 2021.
[6] F. Wu, A. Souza, T. Zhang, C. Fifty, T. Yu, and K. Weinberger, “Sim- [28] D. Lim, X. Li, F. Hohne, and S.-N. Lim, “New bench-
plifying graph convolutional networks,” in International Conference marks for learning on non-homophilous graphs,” arXiv preprint
on Machine Learning. PMLR, 2019, pp. 6861–6871. arXiv:2104.01404, 2021.
[7] M. Balcilar, G. Renton, P. Héroux, B. Gaüzère, S. Adam, and [29] L. A. Adamic and N. Glance, “The political blogosphere and the
P. Honeine, “Analyzing the expressive power of graph neural 2004 us election: Divided they blog,” in Proceedings of the 3rd
networks in a spectral perspective,” in International Conference on international workshop on Link discovery, 2005, pp. 36–43.
Learning Representations, 2020. [30] W. Jin, Y. Ma, X. Liu, X. Tang, S. Wang, and J. Tang, “Graph
[8] Y. Ma, X. Liu, T. Zhao, Y. Liu, J. Tang, and N. Shah, “A unified structure learning for robust graph neural networks,” in Proceed-
view on graph neural networks as graph signal denoising,” in ings of the 26th ACM SIGKDD International Conference on Knowledge
Proceedings of the 30th ACM International Conference on Information Discovery & Data Mining, 2020, pp. 66–74.
& Knowledge Management, 2021, pp. 1202–1211. [31] H. Pei, B. Wei, K. C.-C. Chang, Y. Lei, and B. Yang, “Geom-
[9] M. Zhu, X. Wang, C. Shi, H. Ji, and P. Cui, “Interpreting and uni- gcn: Geometric graph convolutional networks,” ArXiv, vol.
fying graph neural networks with an optimization framework,” in abs/2002.05287, 2020.
Proceedings of the Web Conference 2021, 2021, pp. 1215–1226. [32] J. Tang, J. Sun, C. Wang, and Z. Yang, “Social influence analysis
[10] J. Zhu, Y. Yan, L. Zhao, M. Heimann, L. Akoglu, and D. Koutra, in large-scale networks,” in Proceedings of the 15th ACM SIGKDD
“Beyond homophily in graph neural networks: current limitations international conference on Knowledge discovery and data mining, 2009,
and effective designs,” Advances in Neural Information Processing pp. 807–816.
Systems, vol. 33, 2020. [33] T. N. Kipf and M. Welling, “Semi-supervised classification with
[11] E. Chien, J. Peng, P. Li, and O. Milenkovic, “Adaptive universal graph convolutional networks,” in International Conference on
generalized pagerank graph neural network,” in International Learning Representations (ICLR), 2017.
Conference on Learning Representations, 2021. [Online]. Available: [34] K. Xu, C. Li, Y. Tian, T. Sonobe, K.-i. Kawarabayashi, and
https://openreview.net/forum?id=n6jl7fLxrP S. Jegelka, “Representation learning on graphs with jumping
[12] J. Zhu, R. A. Rossi, A. Rao, T. Mai, N. Lipka, N. K. Ahmed, knowledge networks,” in International Conference on Machine Learn-
and D. Koutra, “Graph neural networks with heterophily,” in ing. PMLR, 2018, pp. 5453–5462.
Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, [35] Y. Yang, Z. Feng, M. Song, and X. Wang, “Factorizable graph
no. 12, 2021, pp. 11 168–11 176. convolutional networks,” Advances in Neural Information Processing
[13] R. Wang, S. Mou, X. Wang, W. Xiao, Q. Ju, C. Shi, and X. Xie, Systems, vol. 33, 2020.
“Graph structure estimation neural networks,” in Proceedings of [36] K.-H. Lai, D. Zha, K. Zhou, and X. Hu, “Policy-gnn: Aggregation
the Web Conference 2021, 2021, pp. 342–353. optimization for graph neural networks,” in Proceedings of the
[14] S. Suresh, V. Budde, J. Neville, P. Li, and J. Ma, “Breaking the limit 26th ACM SIGKDD International Conference on Knowledge Discovery
of graph neural networks by improving the assortativity of graphs & Data Mining (KDD), 2020, pp. 461–471.
with local mixing patterns,” Proceedings of the 27th ACM SIGKDD [37] E. Isufi, F. Gama, and A. Ribeiro, “Edgenets: Edge varying graph
Conference on Knowledge Discovery & Data Mining, 2021. neural networks,” IEEE Transactions on Pattern Analysis and Ma-
chine Intelligence, 2021.
[15] L. Yang, W. Zhou, W. Peng, B. Niu, J. Gu, C. Wang, X. Cao,
and D. He, “Graph neural networks beyond compromise between [38] F. M. Bianchi, D. Grattarola, L. Livi, and C. Alippi, “Graph neural
attribute and topology,” in Proceedings of the ACM Web Conference networks with convolutional arma filters,” IEEE Transactions on
2022, 2022, pp. 1127–1135. Pattern Analysis and Machine Intelligence, 2021.
[16] H. Nt and T. Maehara, “Revisiting graph neural networks: All we [39] Y. Gao, Y. Feng, S. Ji, and R. Ji, “Hgnn+: General hypergraph neu-
have is low-pass filters,” arXiv preprint arXiv:1905.09550, 2019. ral networks,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 45, no. 3, pp. 3181–3199, 2022.
[17] M. McPherson, L. Smith-Lovin, and J. M. Cook, “Birds of a feather:
[40] G. Bouritsas, F. Frasca, S. P. Zafeiriou, and M. Bronstein, “Improv-
Homophily in social networks,” Annual review of sociology, vol. 27,
ing graph neural network expressivity via subgraph isomorphism
no. 1, pp. 415–444, 2001.
counting,” IEEE Transactions on Pattern Analysis and Machine Intel-
[18] D. Bo, X. Wang, C. Shi, and H. Shen, “Beyond low-frequency ligence, 2022.
information in graph convolutional networks,” in AAAI. AAAI
[41] M. Balcilar, P. Héroux, B. Gauzere, P. Vasseur, S. Adam, and
Press, 2021.
P. Honeine, “Breaking the limits of message passing graph neural
[19] M. Liu, Z. Wang, and S. Ji, “Non-local graph neural networks,” networks,” in International Conference on Machine Learning. PMLR,
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021. 2021, pp. 599–608.
[20] Y. Yan, M. Hashemi, K. Swersky, Y. Yang, and D. Koutra, “Two [42] L. Faber, A. K. Moghaddam, and R. Wattenhofer, “When com-
sides of the same coin: Heterophily and oversmoothing in graph paring to ground truth is wrong: On evaluating gnn explanation
convolutional neural networks,” arXiv preprint arXiv:2102.06462, methods,” in Proceedings of the 27th ACM SIGKDD Conference on
2021. Knowledge Discovery & Data Mining, 2021, pp. 332–341.
[21] Z. Fang, L. Xu, G. Song, Q. Long, and Y. Zhang, “Polarized graph [43] X. Wang, Y. Wu, A. Zhang, F. Feng, X. He, and T.-S. Chua,
neural networks,” in Proceedings of the ACM Web Conference 2022, “Reinforced causal explainer for graph neural networks,” IEEE
2022, pp. 1404–1413. Transactions on Pattern Analysis and Machine Intelligence, 2022.
[22] X. Li, R. Zhu, Y. Cheng, C. Shan, S. Luo, D. Li, and W. Qian, “Find- [44] T. Schnake, O. Eberle, J. Lederer, S. Nakajima, K. T. Schutt, K.-R.
ing global homophily in graph neural networks when meeting Muller, and G. Montavon, “Higher-order explanations of graph
heterophily,” arXiv preprint arXiv:2205.07308, 2022. neural networks via relevant walks.” IEEE transactions on pattern
[23] K. Oono and T. Suzuki, “Graph neural networks exponentially analysis and machine intelligence, vol. PP, 2021.
lose expressive power for node classification,” in International [45] Y. Min, F. Wenkel, and G. Wolf, “Scattering gcn: Overcoming
Conference on Learning Representations, 2020. [Online]. Available: oversmoothness in graph convolutional networks,” Advances in
https://openreview.net/forum?id=S1ldO2EFPr Neural Information Processing Systems, vol. 33, pp. 14 498–14 508,
[24] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, and 2020.
Y. Bengio, “Graph attention networks,” International Conference [46] J. Klicpera, A. Bojchevski, and S. Günnemann, “Predict then
on Learning Representations, 2018, accepted as poster. [Online]. propagate: Graph neural networks meet personalized pagerank,”
Available: https://openreview.net/forum?id=rJXMpikCZ in ICLR, 2019.
[25] D. Kim and A. Oh, “How to find your friendly neighborhood: [47] M. Chen, Z. Wei, Z. Huang, B. Ding, and Y. Li, “Simple and
Graph attention design with self-supervision,” in International deep graph convolutional networks,” in International Conference
Conference on Learning Representations, 2021. [Online]. Available: on Machine Learning. PMLR, 2020, pp. 1725–1735.
https://openreview.net/forum?id=Wi5KUNlqWty [48] Y. Hou, J. Zhang, J. Cheng, K. Ma, R. T. Ma, H. Chen, and M.-C.
[26] P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Gallagher, and T. Eliassi- Yang, “Measuring and improving the use of graph information
Rad, “Collective classification in network data,” AI Mag., vol. 29, in graph neural networks,” in International Conference on Learning
pp. 93–106, 2008. Representations (ICLR), 2019.
14
[49] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: Rui Zhang received the First-class (Hons) de-
A review and new perspectives,” IEEE transactions on pattern gree in Telecommunication Engineering from
analysis and machine intelligence, vol. 35, no. 8, pp. 1798–1828, 2013. Jilin University of China in 2001 and the Ph.D.
[50] I. Higgins, D. Amos, D. Pfau, S. Racaniere, L. Matthey, D. Rezende, degree in Computer Science and Mathematics
and A. Lerchner, “Towards a definition of disentangled represen- from University of Ulster, UK in 2007. After fin-
tations,” arXiv preprint arXiv:1812.02230, 2018. ishing her PhD study, she worked as a Research
[51] R. Lopez, J. Regier, M. I. Jordan, and N. Yosef, “Information Associate at University of Bradford and Univer-
constraints on auto-encoding variational bayes,” Advances in neural sity of Bristol in the UK for 5 years. She joined
information processing systems, vol. 31, 2018. Xi’an Jiaotong-Liverpool University in 2012 and
[52] L. Ma, Q. Sun, S. Georgoulis, L. Van Gool, B. Schiele, and M. Fritz, currently holds the position of Associate Pro-
“Disentangled person image generation,” in Proceedings of the IEEE fessor. Her research interests include machine
Conference on Computer Vision and Pattern Recognition, 2018, pp. 99– learning, data mining and statistical analysis.
108.
[53] Z. Zhang, L. Tran, F. Liu, and X. Liu, “On learning disentangled
representations for gait recognition,” IEEE Transactions on Pattern Xinping Yi received the Ph.D. degree in
Analysis and Machine Intelligence, 2020. electronics and communications from Télécom
[54] C. Eom, W. Lee, G. Lee, and B. Ham, “Is-gan: Learning disen- ParisTech, Paris, France, in 2015. He is currently
tangled representation for robust person re-identification,” IEEE a Lecturer (Assistant Professor) with the Depart-
transactions on pattern analysis and machine intelligence, 2021. ment of Electrical Engineering and Electronics,
[55] J. Ma, P. Cui, K. Kuang, X. Wang, and W. Zhu, “Disentangled University of Liverpool, U.K. Prior to Liverpool,
graph convolutional networks,” in International conference on ma- he was a Research Associate with Technische
chine learning. PMLR, 2019, pp. 4212–4221. Universität Berlin, Berlin, Germany, from 2014
[56] Y. Liu, X. Wang, S. Wu, and Z. Xiao, “Independence promoted to 2017, a Research Assistant with EURECOM,
graph disentangled networks,” in Proceedings of the AAAI Confer- Sophia Antipolis, France, from 2011 to 2014, and
ence on Artificial Intelligence, vol. 34, no. 04, 2020, pp. 4916–4923. a Research Engineer with Huawei Technologies,
[57] H. Li, X. Wang, Z. Zhang, Z. Yuan, H. Li, and W. Zhu, “Dis- Shenzhen, China, from 2009 to 2011. His main research interests
entangled contrastive learning on graphs,” Advances in Neural include information theory, graph theory, and machine learning, and their
Information Processing Systems, vol. 34, pp. 21 872–21 884, 2021. applications in wireless communications and artificial intelligence.
[58] E. Jang, S. Gu, and B. Poole, “Categorical reparameterization with
gumbel-softmax,” arXiv preprint arXiv:1611.01144, 2016.
[59] C. J. Maddison, A. Mnih, and Y. W. Teh, “The concrete distribu-
tion: A continuous relaxation of discrete random variables,” arXiv
preprint arXiv:1611.00712, 2016.
[60] O. Stretcu, K. Viswanathan, D. Movshovitz-Attias, E. A. Platanios,
S. Ravi, and A. Tomkins, “Graph agreement models for semi-
supervised learning,” in NeurIPS, 2019.
[61] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza-
tion,” arXiv preprint arXiv:1412.6980, 2014.
[62] T. Akiba, S. Sano, T. Yanase, T. Ohta, and M. Koyama, “Optuna:
A next-generation hyperparameter optimization framework,” in
Proceedings of the 25th ACM SIGKDD international conference on
knowledge discovery & data mining, 2019, pp. 2623–2631.
[63] Y. Ma, X. Liu, N. Shah, and J. Tang, “Is homophily a necessity for
graph neural networks?” ArXiv, vol. abs/2106.06134, 2021.