2024_Introduction to Graph Neural Networks A Starting
2024_Introduction to Graph Neural Networks A Starting
Abstract
Graph neural networks are deep neural networks designed for graphs
with attributes attached to nodes or edges. The number of research
papers in the literature concerning these models is growing rapidly
due to their impressive performance on a broad range of tasks. This
survey introduces graph neural networks through the encoder-decoder
framework and provides examples of decoders for a range of graph ana-
lytic tasks. It uses theory and numerous experiments on homogeneous
graphs to illustrate the behavior of graph neural networks for different
training sizes and degrees of graph complexity.
Contents
1 Introduction 2
2 Common applications 3
1
5 Experiments 18
5.1 Baseline node classification performance . . . . . . . . . . . . . . . . 19
5.2 Hyperparameters and node classification accuracy . . . . . . . . . . 21
5.2.1 Adjusting the number of hidden dimensions . . . . . . . . . . 21
5.2.2 Adjusting the number of training epochs . . . . . . . . . . . . 22
5.2.3 Adjusting the number of layers and other hyperparameters . 23
5.3 Qualitative description of GNN learning . . . . . . . . . . . . . . . . 29
6 Conclusion 31
1 Introduction
Relationships within data are important for everyday tasks like internet search and
road map navigation as well as for scientific research in fields like bioinformatics.
Such relationships can be described using graphs with real vectors as attributes
associated with the graph’s nodes or edges; however, traditional machine learning
models operate on arrays, so they cannot directly exploit the relationships. This
report surveys Graph Neural Networks (GNNs), which jointly learn from both edge
and node feature information, and often produce more accurate models. These
architectures have become popular due to their impressive performance on graph
analysis tasks. Consequently, the number of research papers on GNNs is growing
rapidly, and many surveys exist.
Some surveys discuss graph neural networks in the context of broad families
such as graph networks, graph representation learning and geometric deep learning
[1, 2, 3, 4, 5]. Other surveys categorize GNNs by abstracting their distinguishing
properties into functional relationships [6, 7, 8, 3, 9]. Although useful for organiza-
tional purposes, generality and abstraction can be difficult to understand for those
new to the field. Other surveys have a narrow focus, for example to discuss efforts
to improve a specific weakness in GNN architectures [10], or to survey GNN work on
a particular task, such as fake news detection or product recommendation [11, 12].
While valuable for those interested in the task, they provide little background in
GNNs and therefore assume the reader already has that knowledge.
For this reason, a concrete and concise introduction to GNNs is missing. We
begin by introducing GNNs as encoder-decoder architectures. To provide perspec-
tive on the ways GNNs are used, we discuss common GNN applications along with
examples of task-specific decoders for turning features into predictions. We think
that studying a few important examples of GNNs well will help the reader develop
a feeling for the subject that would be difficult to achieve otherwise. We there-
fore focus on three convolutional and attentional networks, GCN, GraphSAGE,
and GATv2, which are commonly used both as benchmarks and as components in
other GNN architectures. We conduct numerous experiments with these GNNs at
2
two training sizes and on thirteen datasets of both high and low complexity. The
experiments have three goals:
• Compare benchmark GNNs with other graph models.
• Demonstrate how hyperparameter adjustments affect GNN performance.
• Provide a qualitative picture of what happens when GNNs learn.
We hope these experiments combined with the theoretical sections will enable new-
comers to use GNNs more effectively and to improve GNN performance on their
problems. We also hope that experts will gain new insights from our experiments.
2 Common applications
Graph neural networks are suited to a variety of graph tasks.
1. Node classification
This task concerns categorizing nodes of a graph. There are several appli-
cations within the space of social networks, such as assigning roles or in-
terests to individuals or predicting whether individuals are members of a
group [13, 14]. Node classification tasks also include classifying documents,
videos or webpages into different categories [15, 16]. There are also important
applications in bioinformatics, such as classifying the biological function of
proteins (nodes) and their interactions (edges) with other proteins [17].
2. Link prediction
Link prediction is a classification task on pairs on nodes in a graph. Most
often, this is a binary classification problem, where the task is to predict
whether an edge exists between two nodes, e.g. one to predict that an edge
is present and zero to predict that it is absent. Link prediction also exists for
graphs with multiple edge types, so edges are predicted to be one of several
types [18].
Link prediction can predict the presence of a relationship (edge) between two
individuals (nodes) in a social network, either presently or in the near future
[19]. Recommendation systems try to recommend products to customers; this
task is a link prediction problem, where one seeks edges between two differ-
ent types of nodes, the product nodes and the customer nodes [20, 21]. Link
prediction for entity resolution predicts links between different records in a
dataset that refer to the same object [22, 23]. For example, we want to link
a record describing ”John Smith” with another record for the same person
written ”Smith, John”. In bioinformatics, link prediction can predict rela-
tionships between drugs and diseases [24] and the similarity between diseases
[25]. Link prediction also includes finding new relationships between nodes
in knowledge graphs, a task called knowledge graph completion [26, 27].
3. Community detection
3
Community detection algorithms cluster graph nodes by using some prob-
lem dependent similarity measure. They are typically not machine learn-
ing based [28, 29], but some algorithms may be trained in an unsupervised
or semi-supervised manner [30, 31]. Applications include identifying social
groups within a social network [32], entity resolution [33], fraud detection
[34], text clustering (e.g. grouping Reddit posts into similar topics) [17] and
visualization [35, 36].
4. Node regression and edge regression
The traffic prediction literature tries to predict traffic conditions, like traf-
fic speed, volume, etc., in the near future from sensors on the road, which
supports tasks such as travel time estimation and route recommendations
[37, 38]. The road network has intersections as nodes and road segments as
edges. The sensors are additional nodes on the road network, so estimating
the numeric descriptors of traffic conditions at these sensors is a node regres-
sion problem. Less often, edge regression models support traffic prediction by
predicting edge weights that represent traffic flow or count, [39]. Other node
regression applications include predicting house prices and weather charac-
teristics [40], and predicting the amount of internet traffic to web pages [41].
5. Graph classification and graph regression
Conventionally, time consuming and expensive laboratory experiments es-
tablish a molecule’s properties. Molecule property prediction is foundational
for the development of new materials with industrial applications and new
drugs to treat diseases, and consequently, significant resources have been de-
voted to developing a model that can accurately predict molecule properties
quickly and cheaply. Graphs naturally represent molecules with its nodes
as atoms and its edges as chemical bonds between two atoms, and GNNs,
which operate directly on graphs, quickly proved to be well suited to this task
[42, 43, 44, 45]. The accuracy of GNN predictions matches or exceeds that of
conventional models with expert features when enough labeled data is avail-
able for training [43], but labeled data is often limited in the target domain,
so prediction accuracy suffers [46]. To meet this challenge, self-supervised ap-
proaches that leverage large amounts of unlabeled data are being developed
[46, 47, 48].
An attributed graph has a set of nodes, N , as well as edges that define how the
nodes relate to each other. To simplify the discussion, we restrict our attention to
undirected graphs, so the edges are represented by a weighted, symmetric adjacency
matrix, A = (Aij ) where i, j ∈ N . An entry Aij is non-zero if an edge connects
node i to node j and zero otherwise. Each node, i ∈ N has an attribute, xi ∈ Rℓ
for some ℓ ∈ N. Encoder-decoder models on graphs is a class of machine learning
4
models. Machine learning on graphs presents challenges that do not arise in con-
ventional machine learning on vectors, because graphs are irregular data structures
and do not have a natural coordinate system. In particular, standard convolu-
tional neural networks for image arrays do not work on graphs, because the k-hop
neighborhoods may be different for every node. Nonetheless, a typical first step for
machine learning on graphs is to obtain a low-dimensional feature vector for every
node that contains all the information that is needed to complete the desired task.
These feature vectors are real vectors that often contain the information needed to
represent the local edge structure about each node.
A feature vector of a node is also called a node embedding or a node represen-
tation, and collectively the feature vectors can be used for tasks on nodes, tasks
on edges, or tasks on the entire graph. At the graph level, applying principal com-
ponent analysis (PCA) or t-distributed stochastic neighbor embedding (t-SNE) to
node embeddings can produce lower dimensional representations that enable visu-
alizations to help understand the how algorithms are performing [49]. In addition,
community detection algorithms use node embeddings to define the communities,
either in an end-to-end fashion [50] or as part of a two-step process by applying the
k-means algorithm to the node embeddings [51, 52]. Node embeddings also support
graph classification, where in the simplest case, a mean activation over all node em-
beddings of the graph determines the graph class. More sophisticated approaches
are described in [53, 54]. Not surprisingly, node features are also used for node level
tasks like node classification and regression, [55, 15], as well as for edge level tasks
like link prediction [56], edge classification [57, 58] or edge regression [59].
Due to the importance of node embeddings, there are many techniques to ob-
tain them for a range of goals and data conditions. Perhaps the simplest exam-
ple of a node embedding is given by the rows of an adjacency matrix. The map
: i → (Aij )j∈N defines node embeddings in R|N | . However, it is difficult to use
these vector representations in machine learning due to their sparsity or high di-
mension, which tends to lead to overfitting. The row vectors are also poor features,
because they do not provide any structural information beyond each node’s 1-hop
neighborhood nor do they account for any node attributes.
Instead, researchers may use rule-based descriptions of nodes, like centrality or
clustering measurements, to produce low dimensional node representations that are
more information dense, which may subsequently be applied to a downstream task
with a traditional machine learning algorithm. The disadvantage of this approach
is that hand-crafted features are not part of the algorithm’s training process, so the
features are not fine-tuned to minimize the loss function. To do this, researchers
use an encoder-decoder approach.
Enc : N → Rℓ (1)
5
Ground
Truth Func-
tion
Encoder Model
Graph Node embeddings Decoder Loss / Eval
Prediction
Figure 1
Dec : Rm → Rk , (2)
converts those node embeddings into predictions, where m ≥ ℓ and k is the dimen-
sion of the model predictions. We emphasize that the decoder generally does not
simply invert the encoder. It is instead a kind of interpreter that “decodes” abstract
node embeddings into predictions in order to solve the given task. The decoders for
common tasks like those described in Section 2 are usually simple functions with
few parameters, such as an inner product followed by a softmax function. Hence,
the majority of the model’s learnable parameters are usually in the encoder.
We introduce the term ground truth function
Gt : G → Rk , (3)
that provides reference information that is known about the graph, such as a
node’s class for node classification, which the loss function compares with the k-
dimensional model predictions. The training and evaluation algorithms use it to
assess the quality of model predictions. There does not seem to be an accepted
term in the literature that accounts for all contexts that occur. Hamilton, et al.
[55] consider the case of a relationship between two nodes and call it a pairwise
similarity function. This occurs in link prediction, where the ground truth function
may be a map : N × N → {0, 1} that says whether or not an edge exists between
two nodes. In node classification, however, the ground truth function typically pro-
vides the node’s class. In all cases, its role in the encoder-decoder framework is the
same, so we refer to it by a single name.
The loss functions and evaluation metrics link the ground truth function and
the model prediction. Most algorithms learn model parameters by some form of
gradient decent, where the loss functions are fairly smooth. Common examples of
loss functions are cross entropy loss for classification and L1 or L2 loss for regression
tasks. For evaluation metrics, common examples are accuracy, F1 and AUC (Area
Under the receiver operator characteristic Curve) for classification and RMSE (Root
Mean Square Error) and MAE (Mean Absolute Error) for regression tasks.
6
3.2 Shallow embedding examples
We now present several representative examples of models that produce embedding
lookups for nodes that were seen during the training process. These examples
will illustrate the encoder-decoder framework and at the end we will note their
shortcomings, which Hamilton et al. [55, 2] describes. This will lead us to more
complicated encoder-decoder models called GNNs in the next section.
For each example, the input is a fixed matrix that provides a similarity statistic
between any two nodes in N such as a weighted adjacency matrix. The output
of these algorithms is a real vector (a feature vector) for each node describing the
node’s neighborhood structure, and taken together, they support some downstream
machine learning task.
The Laplacian eigenmaps algorithm is an early and successful nonlinear dimen-
sionality reduction algorithm [60]. Given a user-defined parameter t > 0, a weighted
adjacency matrix, W = (Wij )i,j∈N , can be defined by
2
exp − ∥xi − xj ∥
if Aij = 1 ,
Wij = Wij (t) = t (4)
0 otherwise .
In practice, the above weighted adjacency matrix is typically the input to the Lapla-
cian eigenmaps algorithm, but a simple adjacency matrix or a k-nearest neighbor
matrix may alternatively be inputs.
The Laplacian eigenmaps algorithm can be reformulated in terms of the encoder-
decoder framework [55, 2]. Define the ground truth, decoder and loss functions by
that minimizes the model’s loss L ∈ R+ up to a scaling factor, where that loss is
X
L= L Gt(i, j), Dec(zi , zj )
i,j∈N
X (9)
= Wij ∥zi − zj ∥2 ,
i,j∈N
where the minimization is subject to a constraint that prevents the solution from
collapsing to a lower dimension (i.e. Z T DZ = I, where Z = (zi )i∈N ). Notice that
Wi,j ≥ 0 is larger when i and j are adjacent. Then the above equation means that
the model is punished during training for having node attributes of adjacent nodes
7
be far apart. (Note that the constant encoder Enc(i) = 1 satisfies L = 0, but this
is not useful).
Belkin et al. [60] provides an optimal solution based on generalized eigenvectors
of the graph Laplacian. The graph Laplacian
P is the matrix ∆ = D − W , where D
is a diagonal matrix defined by Dii = j Wji [61]. The generalized eigenvectors
(fk )1≤k≤|N | ⊂ R|N | are the solutions to the equation
∆f = λDf , (10)
where they are labeled in sorted order so that the corresponding eigenvalues satisfy
Gt : N × N → R+ , Gt(i, j) ∈ R+ , (13)
Dec : Rℓ × Rℓ → R, Dec(w, z) = wT z, (14)
1
L : R+ × R → R+ , L(q, r) = (q − r)2 . (15)
2
Given an encoder Enc with node embeddings
the loss L ∈ R+ is X
L= L Gt(i, j), Dec(zi , zj )
i,j∈N
1 X 2 (17)
= ziT zj − Gt(i, j) .
2
i,j∈N
Notice that if Z = (zi ) is the matrix of features in Rℓ×|N | , then the above loss
satisfies
1
L = ∥Z T Z − S∥2 , (18)
2
where S is the matrix with entries Sij = Gt(i, j). Minimizing L means finding a
matrix Z that factors the ground truth of matrix S as shown in Equation (18),
which is why the methods are called matrix factorization methods.
8
Ahmed et al. [62] define the ground truth function by Gt(i, j) = Ai,j , where
(Aij )i,j∈N are the coefficients of the adjacency matrix. Hence, their goal is to find
a solution that minimizes the loss
1 X 2
L= ziT zi − Ai,j . (19)
2
i,j∈N
More recently in 2014, Perozzi et al. [16] introduce random walks on a graph
as a tool to learn node embeddings that capture the edge structure of larger node
neighborhoods in a computationally efficient manner. After a random initialization
of node features, a stochastic gradient descent algorithm updates features to opti-
mize the information necessary to estimate the probability that two nodes co-occur
on the same random walk of a fixed length. Two years later Grover et al. [65]
improved on Perozzi et al. by adding explore and return parameters that respec-
tively determined the algorithm’s tendency to explore new nodes and return to the
starting node. Later, [66] showed that these random walk methods are essentially
matrix factorization techniques.
Matrix factorization methods have the advantage of being applicable on graphs
without attributes. On attributed graphs, however, all of the examples of shallow
models share several shortcomings [55]:
1. They make insubstantial use of the node attributes during training, so they
do not use all available information. Moreover, these models tend to define
similarity in terms of proximity, and consequently they usually produce poor
results when adjacent nodes in a graph tend to be dissimilar [67].
2. Trained models cannot be applied to unseen nodes without further training.
This is impractical for dynamic graphs and for graphs that are so large that
they cannot fit in memory. It also means that a model trained in a setting
with a lot of labeled data is not transferrable to an unseen graph in a related
domain with sparsely labeled data.
3. The information is not efficiently stored in the model. Each trained model is
the collection of node features for the graph, which means model parameters
are not shared across nodes. In particular, the number of parameters grows
linearly with |N |, which can create memory challenges for processing on large
graphs.
The next section discusses more powerful encoder-decoder approaches called graph
neural networks, which resolve these shortcomings.
9
This section focuses on GNNs that have so-called message-passing layers (de-
scribed below). The vast majority of GNNs in the literature have message-passing
layers.
Input
Pre-processing Layers
Message-Passing Layers
Post-processing Layers
Output
Pre-Proc : Rm → Rm
e
, Pre-Proc(xi ) = x̃i , (21)
that maps each node attribute vector xi to a node feature vector x̃i in a computation
that does not involve the edges of the graph.
These node features feed into the message-passing layers, which are the most
important layers for the GNNs performance [68]. If A is a graph with a matrix of
node features Xe = (x̃i )i∈N , then a message-passing layer is a map
e A) → (H, A)
Message-Passing : (X, (22)
10
from the graph A with node features X e to the graph A with node features H =
(hi )i∈N , where the node feature vectors hi ∈ Rℓ are obtained by aggregating infor-
e
mation from each node’s neighborhood. Then node features from each successive
message-passing layer contain information that has been aggregated over a wider
set of nodes than the previous layers. At the end, an encoder of a k-layer GNN
that aggregates node features over a 1-hop neighborhood produces low dimensional
node embeddings that summarize information in each node’s k-hop neighborhood.
In this way, message-passing layers resemble the highly successful convolutional
neural networks for image classification.
Node features from the message-passing layers subsequently feed into the final
layers of the GNN encoder called the post-processing layers. They are collectively,
like the pre-processing layers, fully connected feedforward neural networks
that maps each node feature vector hi that is produced by the message-passing
layers to the node embedding zi . Then the encoder of the GNN maps each node
with its node attribute xi to a node embedding zi
Each message-passing layer of the encoder computes its output using the same
process. Consider a K-layer message-passing network. For each node i, define
(0) (k)
hi = x̃i , and for integers 0 < k < K, let hi ∈ Rℓ̃k be the node feature vector
that is the output of the kth message-passing layer. Starting from the output of the
kth message-passing layer, for each node i, the k +1 message-passing layer computes
(k+1)
the vectors hi by
(k+1) (k)
^ M
hi = ϕ hi , µij . (25)
j∈Ni
11
defines a GNN’s type. We denote the message-passing category by its initials, MP,
to help distinguish it from message-passing layers. Our description of each GNN
category follows Bronstein et al. [4]. This discussion is intended to capture key
ideas rather than all subtle similarities and differences between individual models.
An architecture is in the convolutional category,
(k+1) (k) (k)
^ M
hi = ϕ hi , wij ψ(hj ) , (26)
j∈Ni
(k)
if the value µij from (25) is defined by µij = wij ψ(hj ), where ψ is a differentiable
function that can have trainable parameters, such as an affine linear transformation,
ψ : Rℓ̃k → Rℓ̃k ,
(k) (k)
ψ(hj ) = W hj + b , (27)
where W : Rℓk → Rℓk is a matrix and b ∈ Rℓk is a vector. The coefficients wij are
unlearned weights, usually depending only on the local graph topology and which
encode the connection strength between pairs of nodes [69, 15, 17, 68, 70]. If the
graph exhibits homophily, meaning that nodes with similar features or the same
class label tend to be linked [71], then in principal, the fixed weights wij make
these models a good choice due to their scalability and regularization. This occurs,
for example, in a social network with users connected by friendship [72]. On the
downside, the rigidness of fixed weights may inhibit their ability to represent the
complex relationships that arise in low homophily graphs.
An architecture in the MP category computes vectors by
(k+1) (k) (k) (k)
^ M
hi = ϕ hi , ψ(hi , hj ) , (28)
j∈Ni
where the algorithm learns the scalar-valued function a and the possibly the func-
tion ψ [75, 76, 77, 78, 79]. For example [78], the function a may be computed
12
by
This enables GraphSAGE to better preserve the information of each node when
mixing it with that of its neighbors hurts performance. Section 5.2 presents results
from numerous experiments that show GraphSAGE tends to outperform the atten-
tional networks GAT and GATv2 on low homophily graphs, see also [17, 77, 78].
V
Table 1. Versus GNN Category for Some Common GNNs
V
Convolutional Attentional
Sum GCN GAT, GATv2
Concatenation GraphSAGE
Lastly, we remark that a version of convolutional GNNs also exist for the spec-
tral domain, where an aggregation function operates on the eigenvectors of the
graph Laplacian [81]. Comparing with the convolutional GNNs described above,
the spectral version may provide richer features, but it also is more memory inten-
sive and does not readily extend to directed graphs nor allow predictions on unseen
nodes [82, 83].
13
K > 0, the softmax function, softmax : RK → RK , is defined along each coordinate
by
exp(sj )
softmax(s)j = PK−1 , (32)
k=0 exp(sk )
PK−1
where s = (sk )K−1 K
k=0 ∈ R . Notice that j=0 softmax(s)j = 1.
As usual, define zi = Enc(i) and let A = (Ai.j )i,j∈N be the graph’s adjacency
matrix.
1. Node classification
Let K be the number of class labels, and let yi = (yi (c))Kc=1 ∈ {0, 1}
K
be the
ground truth vector of terms yi (c), where yi (c) = 1 if node i is in class c and
yi (c) = 0 otherwise. For a matrix Θ ∈ Rℓ×K with trainable parameters, the
ground truth, decoder and loss functions are
When stochastic gradient descent is used, the sum is over a batch of nodes
B ⊂ N . (This same comment also applies to the losses in the examples
below). Also see [15, 84, 68, 77].
2. Link prediction
The sigmoid function, sigmoid : R → R, is defined for t ∈ R by
1
sigmoid(t) = . (37)
1 + exp(−t)
14
Then the loss is
X
L=− L(Gt(i), Dec(zi , zj ))
(i,j)
X
Aij log sigmoid(ziT zj ) + (1 − Aij ) log 1 − sigmoid(ziT zj ) .
=−
(i,j)
(41)
See [56, 85, 86] for more sophisticated examples.
3. Graph classification
Graph classification can be done like the node classification example but
with one additional step. After the encoder produces the node embeddings,
apply a global aggregator (e.g. entry-wise addition), which combines all node
embeddings produced by the encoder into a single feature vector. This feature
vector represents the graph and can be converted into a prediction, as done
in the node classification example.
Specifically, consider a set of graphs G, and for notational convenience, for
any graph G ∈ G, include its number of nodes, n, as a subscript, Gn = G.
Let K be the number of class labels for G, and let yGn = (yGn (c))K c=1 be the
ground truth vector for Gn , so yGn (c) = 1 if Gn is in class c and yGn (c) = 0
otherwise. Then for a matrix Θ ∈ Rℓ×K with trainable parameters, the
ground truth, decoder and loss functions are
X n
X
!! (45)
T
=− yG n
log softmax ziT Θ ,
Gn i=1
15
K > 1 clusters. Define
(
1 if i and j belong to the same cluster ,
δ(i, j) = (46)
0 otherwise .
Let E denote the number of edges of the graph, and let di be the degree of
node i. Then the modularity metric is
1 X di dj
Q= Aij − δ(i, j) . (47)
2E 2E
i,j∈N
ddT
B =A− . (48)
2E
Then
1
Q= T r(C T BC) , (49)
2E
where C ∈ {0, 1}|N |×K is the cluster assignment matrix (i.e. Cik = 1 if node
i belongs to cluster k, and Cik = 0 otherwise).
Next, relax the entries of C by allowing them to take values in the interval
[0, 1]. This way we can apply continuous optimization methods to Q, which
is differentiable with respect to the entries of C. Specifically, let Θ ∈ Rℓ×K
be a learnable parameter matrix, and define the decoder by:
Next we will define the ground truth function. This is somewhat of a mis-
nomer for community detection problems because these problems have no
ground truth, but it will serve the same purpose: guiding training. Define
di dj
Gt : N × N → R : (i, j) → Bij = Aij − . (52)
2E
16
Then the decoder outputs probability estimates that a node is in a given
cluster, which determine meaningful communities when the loss
1
L=− T r(C T BC) (53)
2E
has a large negative value. Notice that L is differentiable, so the graph neural
network can be trained in an end-to-end fashion. See [50] for implementation
details, which include a regularity term not included here.
We just discussed an unsupervised approach to community detection with
graph neural networks, but there is also semi-supervised community detec-
tion. Here, the modeler incorporates knowledge that some nodes must be
in the same class (must-link constraints) and some nodes cannot be in the
same class (cannot-link constraints), [91]. Supervised community detection
also exists, but this is not common and may be regarded as a type of node
classification problem [92].
5. Node regression
Our final example illustrates that problems with a time variable can fit within
the same framework. A common node regression problem is to predict numer-
ical values of traffic speed and volume at sensors located on a road network.
These models can be complex, but a relatively simple one appears in [93].
For each time t ∈ N, their ground truth, decoder and loss functions are
where Ns are nodes with sensors. A loss function defined by mean absolute
error is used by the top performing models [94, 95].
17
training, whereas in transductive learning, all test data except the test labels are
available during training. This means that inductive learning for node classification
is the usual supervised learning. For an example with the node classification task,
consider a coauthor network, where each node is an author, an edge between two
nodes indicates the two authors have worked together, and node features represent
key words from their papers [96]. In inductive learning, we may have a test graph
that covers the years 2000-2007 and a separate training graph that covers the years
2000-2004. The goal is to predict the most active fields of study for authors in
the test graph who are absent from the training graph. In transductive learning,
the training graph is the same graph from 2000-2007, but the labels of the test set
would be withheld during training.
5 Experiments
This section complements the previous theoretical sections with experimental re-
sults. The goal is to describe the behavior of GNNs under several training and
dataset conditions. Our experiments focus on GCN, GATv2, and GraphSAGE be-
cause they are commonly used as benchmarks and many GNN architectures are
built on top of them, for example [97, 98, 99, 56, 50, 100]. Table 1 summarizes
important properties of these GNNs. Our experiments include two other graph
models: Multilayer Perceptron (MLP), which only uses node features, and Deep-
Walk, which only uses edges. All experiments are in the transductive setting. Two
limitations are that none of our datasets are large and we only consider the node
classification task.
Thirteen open-source datasets are used: seven high homophily datasets and six
low homophily ones. The high homophily datasets are citation networks (Cora,
PubMed, CiteSeer, DBPL [101, 102]), co-purchase networks (AmazonComputers,
AmazonPhoto [96]), and a coauthor network (CoauthorCS [96]). The low ho-
mophily datasets are webpage-webpage networks (WikipediaSquirrel, Wikipedi-
aChameleon, WikipediaCrocodile, Cornell, Wisconsin [41, 103]) and a co-occurrence
network (Actor [103]). All datasets are homogeneous graphs, which means they
have a single node type (e.g. ”article”) and a single edge type (e.g. ”is cited by”).
The Squirrel, Chameleon and Crocodile datasets are node regression datasets, so we
transform them into node classification networks by partitioning the range of values
into five parts, where each part defines a class. The remaining ten are natively node
classification datasets.
Edge homophily and the signal-to-noise ratio (SNR) of node features are two
measures of complexity in an attributed graph. Edge homophily is the fraction of
edges that connects two nodes of the same class [71]. SNR is a measure of node
features, where roughly speaking, it is the squared distance between the mean of
each class compared to the variance within each class. Specifically, let C be the
node classes, and for each class i ∈ C, let Fi be the node features of class i. Define
18
the signal S = {Mean(Fi ) − Mean(Fj )}i,j∈C . Then
1 ∥S∥2
SNR = P . (58)
|C| j∈C Var(Fj )
where the ∥·∥ is the ℓ2 norm, and the factor 1/|C| comes from averaging: divide the
numerator by |C|2 and the denominator by |C|. Table 2 records these characteristics
for each dataset.
Unlike computer vision or language models, people tend to train GNNs from
random parameters instead of fine-tuning them from a pre-trained model, which
is doable because they are relatively small. We take the same approach. PyTorch
Geometric is the framework for all experiments. In Section 5.1, the models are
run in an off-the-shelf manner without tuning any hyperparameters except the
number of training epochs. In Section 5.2, the hyperparameters (e.g. the number
of message-passsing layers) of GNNs are tuned to each dataset, where a GitHub
repository called GraphGym manages the experiments1 [104]. All experiments are
run 25 times.
The baseline GCN, GATv2 and GraphSAGE architectures are two layer message-
passing networks with 16 hidden dimensions and no pre-processing or post-processing
layers, DeepWalk also has 16 hidden dimensions, MLP is a three-layer fully con-
nected network where the first layer has 128 hidden dimensions and the second one
has 64. In the literature, GAT is a benchmark more often than GATv2, but we
use GATv2 because it consistently outperforms GAT in our experiments [78]. In
all cases, the models are trained for at most 200 epochs with a learning rate of 0.1
and a train/val/test split. We use two training sizes, 80% or 1%, and the labels not
used for training are divided evenly among the validation and test sets.
Table 3 lists CPU computation times for each model on the Cora dataset [101],
which is a dataset often used for benchmarking. The graph convolutional models
1
See https://github.com/snap-stanford/GraphGym.git.
19
have comparable processing times as MLP, while the attentional network, GATv2,
is somewhat slower. The shallow embedding model, DeepWalk, is by far the slowest.
The theory in Section 4.1 indicates that more flexible models should do better
on low homophily graphs and the more rigid ones should outperform on high ho-
mophily graphs. This is illustrated in Table 6, which provides node classification
accuracy scores for each model under each training
L condition. GATv2 and GCN
use similar addition-based aggregation functions , as described in Equation (25),
but GATv2 is more flexible than GCN because it is attentional rather than convo-
lutional. Accordingly, we see that GATv2 outperforms GCN on the low homophily
(i.e. high edge complexity) graphs. As noted in Section 4.1, GraphSAGE V is a
convolutional network that is more flexible than GATv2 in the function (see
Equation 31 and Table 1). This results in GraphSAGE V outperforming GATv2 on
low homophily graphs. In contrast, GraphSAGE’s function hurts performance
on high homophily (i.e. low edge complexity) graphs, as shown by the improved
performance of GATv2 and GCN in these settings. In fact, the top performing
model on high homophily graphs is the most rigid one, GCN. The greatest advan-
tage of GNNs over MLP is on high homophily datasets with little training data,
which suggests that GNNs make effective use of edge information in this setting
(because MLP does not use edge information). DeepWalk performs almost as well
on high homophily datasets as GNNs, but because it relies entire edge information,
it performs the worst on low homophily datasets.
At first sight, it seems odd that MLP, which does not use edge information,
tends to do better on high homophily graphs than low homophily ones. This is
20
a reflection that the SNR values of the datasets are largely correlated with their
homophily values. The only exceptions are the Cornell and Wisconsin datasets,
which have low homophily and high SNR. The unusually small size of these dataset
hurts the accuracy of MLP despite the their high SNR values.
The tendency to have low SNR of node features in low homophily datasets
presents an additional challenge when working with these data sets. However, the
literature largely focuses on creating GNNs that effectively handle edge complexity
(i.e. low homphily) while not mentioning that the node features of these datasets
are often poorly separated between classes as well.
21
hidden dimensions should help GNN performance on the more complex, low ho-
mophily datasets, but the low homophily graphs showed the least improvement.
In Section 5.2.3, we see that in the 80% training size regime, there is significant
improvement by tuning the other hyperparameters such as the number of layers
and the skip connections. This indicates that increasing the number of hidden di-
mensions needs to be paired with other structural improvements to see significant
improvement in node classification accuracy.
22
(a) (b)
Figure 3. These figures provide the test set performance for the tuned GCN
model on each dataset for medium difficulty graph complexity and training
conditions. Plots for GATv2 and GraphSAGE look similar.
aggregate information over node neighborhoods, so for low homophily graphs, they
tend to aggregate conflicting information, which can hurt performance. Additional
layers to process node features without sharing neighborhood information could
help. To this end, in Section 5.2.3, we add pre- and post-processing layers and
tune other hyperparameters, and we recover a training plot for the low homophily
graphs that resembles Figure 3b (see Figure 3a).
23
Table 8. Hyperparameter values to tune
Tuning
Parameter Order Starting Value Options
Message-Passing Layers 1 2 1, 2, 3, 4, 5, 6, 7, 8
Post-Processing Layers 2 1 1, 2, 3
Pre-Processing Layers 3 1 1, 2, 3
Layer Connectivity 4 Skip Sum None, Skip Sum, Skip Concatenate
Aggregation Function 5 Mean Add, Mean, Max
Learning Rate 6 0.01 0.005, 0.01, 0.0125, 0.015
We choose each option from Table 8 in a greedy fashion, by first finding the best
option for the number of message-passing layers and then proceeding according to
the tuning order in the table. All models are trained for 400 epochs. Every dataset
is partitioned between training and test sets, and the best hyperparameter selection
is the one with the highest average test set accuracy from 25 experiments. Following
You et al. [104], we adjust the number of hidden layers to make the size of each
design comparable, and thus enable a fair comparison of them.
Average Node Classification Accuracy Improvement over Default with 128 Hidden Dims
80% Training 80% Training 1% Training 1% Training
Model Name High Homophily Low Homophily High Homophily Low Homophily
(Difficulty) (Easy) (Medium) (Medium) (Hard)
GCN +0.57 +22.98 +0.93 −0.22
GATv2 +1.53 +16.39 +4.57 −0.92
GraphSAGE +0.15 +6.18 +4.02 −3.89
Table 9 shows there is little or no value in tuning the structure of the GNN to
the dataset in the easy or hard conditions. In fact, the off-the-shelf models outper-
form the hyperparameter tuned ones in the hard condition. This may be due to
starting from a suboptimal design compared to the design chosen by the architec-
ture creators, and additionally there being little benefit to tuning any individual
hyperparameter selection beyond that.
The most benefit for tuning the GNN design occurs when the training and
dataset conditions are of medium difficulty. Although the tuned GraphSAGE per-
forms best on low homophily graphs and the tuned GCN performs best on high
homophily ones, once tuned, all models perform comparably. Table 10 indicates
that most of the gain on the low homophily graphs with plenty of training data
24
comes from the Cornell and Wisconsin datasets. Cornell and Wisconsin are special
in that they are small datasets with low homophily and a high SNR among their
node features. Having fewer nodes may have made the GNN performance more sen-
sitive to improvements, and the datasets having a high SNR may have enabled a
reasonably high node classification accuracy, with the appropriate hyperparameter
configuration.
Table 10. The average improvement of node classification accuracy for tuned
designs over default ones with 128 hidden dimensions (see Section 5.2.1) on
the collection of Cornell and Wisconsin datasets versus the collection of other
low homophily datasets, with 80% of nodes training. The hyperparameters
for each algorithm have been tuned to each dataset.
25
(a) (b)
Table 11. The mean value and p-value of the number of layers and the
learning rates of the hyperparameter tuned models. The hyperparemeters
were tuned for each model on the low and high homophily dataset collections
with 1% and 80% of node labels for training. For each hyperparameter, the
p-value is for the null hypothesis that the selections over the collection of
datasets are drawn from a random distribution.
1% Training
High Homophily
GCN 1.57 0.12 6.86 0.0 1.43 0.05 0.010 0.37
GATv2 1.71 0.25 5.86 0.05 1.43 0.05 0.010 0.46
GraphSAGE 2.29 0.13 7.71 0.0 1.71 0.25 0.0075 0.02
26
Table 12. The most common and % occurring values of the skip connections
and aggregation functions on the low and high homophily dataset collections,
with 1% and 80% of nodes training. The hyperparameters for the algorithms
have been tuned to each dataset. The % occurring field says how often the
most common selection occurred.
1% Training
High Homophily
GCN skip sum 71.43 max 71.43
GATv2 skip sum 100.0 max 57.14
GraphSAGE none 71.43 add 42.86
27
(a) (b)
Figure 5. These figures provide the test set performance for the tuned GCN
model on each dataset for the medium difficulty conditions. Plots for the
GATv2 and GraphSAGE models look the similar.
and skip sum is the best option for GATv2. On the other hand, Figure 4 and
Table 12 show GraphSAGE performance tends to be best without skip connections
on high homophily
V graphs. GraphSAGE already has a skip connection from its
function being concatenation, so it apparently does not need another one.
Tuning the remaining hyperparameters appears to have relatively little effect.
28
train/val/test split. Tables 13 shows the hyperparameter tuned GNNs perform
comparably or outperform the off-the-shelf RevGNNs.
29
(a) (b)
(c) (d)
Figure 6. These figures show the energy of the signal and the noise for each
model, averaged over all datasets in each medium difficulty case.
mation. This makes node features from different classes more similar. Consistent
with this, the energy of the signal in GCN and GATv2 is smaller after the message-
passing step in Figure 7b. Only the energy of the signal for GraphSAGE is larger
following Vthe message-passing layers. GraphSAGE’s use of concatenation for the
function explains this. Concatenation allows it to learn on each node’s feature
vector directly instead of first mixing with the aggregate of its neighbors’ vectors,
so the node features can more easily separate by class.
Message-passing layers in a GNN provide node features with information from
the neighbors, and then post-processing layers further refine the embeddings, lead-
ing to further separation of the classes. Notice from Figures 7c and 7d that MLP
models do not benefit from having more than three layers.
30
(a) (b)
(c) (d)
Figure 7. These figures show the energy of the signal and the noise in the
final hidden layer of each layer type. For some of the low homophily datasets,
the tuned GraphSAGE design has no hidden post-processing layers, so the
plot does not include the post-processing layers for this model.
6 Conclusion
A decade ago deep convolutional neural networks for image classification initiated
a revolution where feature learning was integrated into the training process of a
neural network, and this was subsequently extended to data structures like irregu-
lar graphs. The encoder-decoder framework neatly describes these models, and the
shortcomings of simpler encoder-decoder models motivates the use of more com-
plicated Graph Neural Networks (GNNs). Graph neural networks have attracted
considerable attention due to state-of-the-art results on a range of graph analysis
31
tasks and datasets, but because of the great variety of graphs and graph analysis
tasks, they can be difficult to use for those new to the field. As such, we hope our
overview of GNNs, their construction and behavior on a variety of datasets and
training conditions, has prepared the reader to solve diverse graph problems and
understand the technical aspects of literature.
• PyTorch Geometric
This library is built on PyTorch and its design aims to stay close to usual Py-
Torch [108]. It provides well-documented examples, and benchmark datasets
and most state-of-the art GNN models are available here. Many GNNs from
the literature are implemented, and it supports multi-GPU processing.
• Deep Graph Library
This library is sponsored by AWS, NSF, NVIDIA and Intel [109]. It supports
multi-GPU processing and the PyTorch, TensorFlow and Apache MXNet
frameworks. They have well-documented examples and example code for
many state-of-the art models.
• GeometricFlux.jl
This is a Julia library for geometric deep learning [110], as described in [4].
It supports deep learning in a range of settings: Graphs and sets; grids and
Euclidean spaces; groups and homogeneous spaces; geodesics and manifolds;
gauges and bundles. It also offers GPU support and has integration with
GNN benchmark datasets. It supports both graph network architectures,
which are more general graph models than graph neural networks [1], and
message-passing architectures.
• Spektral
This library is built on TensorFlow 2 and Keras [111]. It intends to feel close
to the Keras API and to be flexible and easy to use. It provides code for the
standard components of GNNs as well as example implementations of GNNs
on specific datasets.
• Jraph
This is a library written in JAX, which is a language that enables automatic
differentiation of python and numpy. It is created by DeepMind and inherits
some design properties from its earlier library, Graph Nets. Like Graph Nets,
it supports building graph networks and is a lightweight library with utilities
for working with graphs. Unlike Graph Nets, it has a model zoo of graph
neural network models.
32
Table 14. The node classification accuracy of default designs (see Table 4).
• Graph Nets
This is a DeepMind library in TensorFlow and Sonnet library for building
graph networks as described in [1]. It supports both CPU and GPU process-
ing, but as of this writing, it is not actively maintained.
• Stellar Graph
This library is built on TensorFlow 2 and uses the Keras API. It supports
a variety of graph machine learning tasks including node classification, link
prediction and graph classification on a homogeneous graphs, heterogeneous
graphs and other graph types. As of this writing, it is not actively maintained.
• PyTorch GNN
This Microsoft library is written in PyTorch and is primarily engineered to be
fast on sparse graphs. Graph neural network models from several papers and
graph analysis tasks are implemented. This library is not actively maintained
as of this writing.
33
Table 15. The node classification accuracy of default designs (see Table 4).
Table 16. The node classification accuracy of default designs (see Table 6).
34
Table 17. The node classification accuracy of default designs (see Table 6).
Table 18. The node classification accuracy of default designs (see Table 6).
35
Table 19. The node classification accuracy of default designs (see Table 6).
Table 20. The node classification accuracy of default designs (see Table 6).
36
Table 21. The node classification accuracy of default designs (see Table 6).
Table 22. The node classification accuracy of default designs (see Table 6).
37
Table 23. The node classification accuracy of default designs (see Table 6).
Table 24. The node classification accuracy of the tuned designs with 80%
of nodes labeled for training (see Table 9).
38
Table 25. The node classification accuracy of tuned designs with 1% of
nodes labeled for training (see Table 9).
Table 26. The node classification accuracy of RevGNNs with 80% of node
labels for training on low homophily graphs (see Table 13).
39
References
[1] P. W. Battaglia, J. B. Hamrick, V. Bapst, A. Sanchez-Gonzalez, V. Zambaldi,
M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro, R. Faulkner et al., “Re-
lational inductive biases, deep learning, and graph networks,” arXiv preprint
arXiv:1806.01261, 2018.
[5] L. Wu, P. Cui, J. Pei, L. Zhao, and X. Guo, “Graph neural networks: foun-
dation, frontiers and applications,” in Proceedings of the 28th ACM SIGKDD
Conference on Knowledge Discovery and Data Mining, 2022, pp. 4840–4841.
[7] J. Zhou, G. Cui, S. Hu, Z. Zhang, C. Yang, Z. Liu, L. Wang, C. Li, and
M. Sun, “Graph neural networks: A review of methods and applications,” AI
Open, vol. 1, pp. 57–81, 2020.
[8] Z. Zhang, P. Cui, and W. Zhu, “Deep learning on graphs: A survey,” IEEE
Transactions on Knowledge and Data Engineering, 2020.
[9] Y. Zhou, H. Zheng, X. Huang, S. Hao, D. Li, and J. Zhao, “Graph neural net-
works: Taxonomy, advances, and trends,” ACM Transactions on Intelligent
Systems and Technology (TIST), vol. 13, no. 1, pp. 1–54, 2022.
[12] C. Gao, Y. Zheng, N. Li, Y. Li, Y. Qin, J. Piao, Y. Quan, J. Chang, D. Jin,
X. He et al., “A survey of graph neural networks for recommender systems:
Challenges, methods, and directions,” ACM Transactions on Recommender
Systems, vol. 1, no. 1, pp. 1–51, 2023.
40
[13] S. Bhagat, G. Cormode, and S. Muthukrishnan, “Node classification in social
networks,” in Social network data analytics. Springer, 2011, pp. 115–148.
[20] Y. Koren, R. Bell, and C. Volinsky, “Matrix factorization techniques for rec-
ommender systems,” Computer, vol. 42, no. 8, pp. 30–37, 2009.
[21] S. Wu, F. Sun, W. Zhang, X. Xie, and B. Cui, “Graph neural networks in
recommender systems: a survey,” ACM Computing Surveys, vol. 55, no. 5,
pp. 1–37, 2022.
41
[24] Z. Yu, F. Huang, X. Zhao, W. Xiao, and W. Zhang, “Predicting drug–disease
associations through layer attention graph convolutional network,” Briefings
in Bioinformatics, vol. 22, no. 4, p. bbaa243, 2021.
[25] J. Gao, X. Zhang, L. Tian, Y. Liu, J. Wang, Z. Li, and X. Hu, “Mtgnn:
Multi-task graph neural network based few-shot learning for disease similarity
measurement,” Methods, 2021.
[26] M. Nickel, K. Murphy, V. Tresp, and E. Gabrilovich, “A review of relational
machine learning for knowledge graphs,” Proceedings of the IEEE, vol. 104,
no. 1, pp. 11–33, 2015.
[27] S. Arora, “A survey on graph neural networks for knowledge graph comple-
tion,” arXiv preprint arXiv:2007.12374, 2020.
[28] N. R. Smith, P. N. Zivich, L. M. Frerichs, J. Moody, and A. E. Aiello, “A
guide for choosing community detection algorithms in social network studies:
The question alignment approach,” American journal of preventive medicine,
vol. 59, no. 4, pp. 597–605, 2020.
[29] Z. Yang, R. Algesheimer, and C. J. Tessone, “A comparative analysis of
community detection algorithms on artificial networks,” Scientific reports,
vol. 6, no. 1, pp. 1–18, 2016.
[30] S. Bandyopadhyay and V. Peter, “Unsupervised constrained community de-
tection via self-expressive graph neural network,” in Uncertainty in Artificial
Intelligence. PMLR, 2021, pp. 1078–1088.
[31] D. Jin, Z. Liu, W. Li, D. He, and W. Zhang, “Graph convolutional net-
works meet markov random fields: Semi-supervised community detection in
attribute networks,” in Proceedings of the AAAI conference on artificial in-
telligence, vol. 33, no. 01, 2019, pp. 152–159.
[32] C. Wang, C. Hao, and X. Guan, “Hierarchical and overlapping social circle
identification in ego networks based on link clustering,” Neurocomputing, vol.
381, pp. 322–335, 2020.
[33] G. Tauer, K. Date, R. Nagi, and M. Sudit, “An incremental graph-
partitioning algorithm for entity resolution,” Information Fusion, vol. 46,
pp. 171–183, 2019.
[34] S. Maddila, S. Ramasubbareddy, and K. Govinda, “Crime and fraud de-
tection using clustering techniques,” Innovations in Computer Science and
Engineering, pp. 135–143, 2020.
[35] K. Wongsuphasawat, D. Smilkov, J. Wexler, J. Wilson, D. Mane, D. Fritz,
D. Krishnan, F. B. Viégas, and M. Wattenberg, “Visualizing dataflow graphs
of deep learning models in tensorflow,” IEEE transactions on visualization
and computer graphics, vol. 24, no. 1, pp. 1–12, 2017.
42
[36] M. Burch, M. Hlawatsch, and D. Weiskopf, “Visualizing a sequence of a
thousand graphs (or even more),” in Computer Graphics Forum, vol. 36,
no. 3. Wiley Online Library, 2017, pp. 261–271.
[37] X. Yin, G. Wu, J. Wei, Y. Shen, H. Qi, and B. Yin, “A comprehensive survey
on traffic prediction,” arXiv preprint arXiv:2004.08555, 2020.
[39] M. T. Schaub and S. Segarra, “Flow smoothing and denoising: Graph signal
processing in the edge-space,” in 2018 IEEE Global Conference on Signal and
Information Processing (GlobalSIP). IEEE, 2018, pp. 735–739.
[44] K. Xu, W. Hu, J. Leskovec, and S. Jegelka, “How powerful are graph neural
networks?” in International Conference on Learning Representations, 2019.
[Online]. Available: https://openreview.net/forum?id=ryGs6iA5Km
[47] Y. Rong, Y. Bian, T. Xu, W. Xie, Y. Wei, W. Huang, and J. Huang, “Self-
supervised graph transformer on large-scale molecular data,” Advances in
Neural Information Processing Systems, vol. 33, pp. 12 559–12 571, 2020.
43
[48] P. Li, J. Wang, Y. Qiao, H. Chen, Y. Yu, X. Yao, P. Gao, G. Xie, and S. Song,
“An effective self-supervised framework for learning expressive molecular
global representations to drug discovery,” Briefings in Bioinformatics, vol. 22,
no. 6, p. bbab109, 2021.
[49] L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.” Journal of
machine learning research, vol. 9, no. 11, 2008.
[51] A. Y. Ng, M. I. Jordan, and Y. Weiss, “On spectral clustering: Analysis and
an algorithm,” in Advances in neural information processing systems, 2002,
pp. 849–856.
[53] T. Pham, T. Tran, H. Dam, and S. Venkatesh, “Graph classification via deep
learning with virtual nodes,” arXiv preprint arXiv:1708.04357, 2017.
[54] M. Zhang, Z. Cui, M. Neumann, and Y. Chen, “An end-to-end deep learning
architecture for graph classification,” in Thirty-Second AAAI Conference on
Artificial Intelligence, 2018.
[56] M. Zhang and Y. Chen, “Link prediction based on graph neural networks,”
Advances in Neural Information Processing Systems, vol. 31, pp. 5165–5175,
2018.
[57] J. Kim, T. Kim, S. Kim, and C. D. Yoo, “Edge-labeling graph neural net-
work for few-shot learning,” in Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, 2019, pp. 11–20.
[58] T. Kipf, E. Fetaya, K.-C. Wang, M. Welling, and R. Zemel, “Neural relational
inference for interacting systems,” in International Conference on Machine
Learning. PMLR, 2018, pp. 2688–2697.
[59] Y. Li, X. Sun, H. Zhang, Z. Li, L. Qin, C. Sun, and Z. Ji, “Cellular traffic
prediction via a deep multi-reservoir regression learning network for multi-
access edge computing,” IEEE Wireless Communications, vol. 28, no. 5, pp.
13–19, 2021.
44
[61] R. Merris, “Laplacian matrices of graphs: a survey,” Linear algebra and its
applications, vol. 197, pp. 143–176, 1994.
[63] S. Cao, W. Lu, and Q. Xu, “Grarep: Learning graph representations with
global structural information,” in Proceedings of the 24th ACM international
on conference on information and knowledge management, 2015, pp. 891–900.
[65] A. Grover and J. Leskovec, “node2vec: Scalable feature learning for net-
works,” in Proceedings of the 22nd ACM SIGKDD international conference
on Knowledge discovery and data mining, 2016, pp. 855–864.
[66] J. Qiu, Y. Dong, H. Ma, J. Li, K. Wang, and J. Tang, “Network embed-
ding as matrix factorization: Unifying deepwalk, line, pte, and node2vec,” in
Proceedings of the eleventh ACM international conference on web search and
data mining, 2018, pp. 459–467.
[71] J. Zhu, Y. Yan, L. Zhao, M. Heimann, L. Akoglu, and D. Koutra, “Beyond ho-
mophily in graph neural networks: Current limitations and effective designs,”
Advances in neural information processing systems, vol. 33, pp. 7793–7804,
2020.
45
[72] L. M. Aiello, A. Barrat, R. Schifanella, C. Cattuto, B. Markines, and
F. Menczer, “Friendship prediction and homophily in social media,” ACM
Transactions on the Web (TWEB), vol. 6, no. 2, pp. 1–33, 2012.
[73] P. Battaglia, R. Pascanu, M. Lai, D. J. Rezende, and K. kavukcuoglu, “Inter-
action networks for learning about objects, relations and physics,” in Proceed-
ings of the 30th International Conference on Neural Information Processing
Systems, ser. NIPS’16. Red Hook, NY, USA: Curran Associates Inc., 2016,
p. 4509–4517.
[74] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl,
“Neural message passing for quantum chemistry,” in Proceedings of the
34th International Conference on Machine Learning, ser. Proceedings
of Machine Learning Research, D. Precup and Y. W. Teh, Eds.,
vol. 70. PMLR, 06–11 Aug 2017, pp. 1263–1272. [Online]. Available:
https://proceedings.mlr.press/v70/gilmer17a.html
[75] F. Monti, D. Boscaini, J. Masci, E. Rodolà, J. Svoboda, and M. M. Bron-
stein, “Geometric deep learning on graphs and manifolds using mixture model
cnns,” in CVPR, 2017.
[76] J. Zhang, X. Shi, J. Xie, H. Ma, I. King, and D. Y. Yeung, “Gaan: Gated
attention networks for learning on large and spatiotemporal graphs,” in 34th
Conference on Uncertainty in Artificial Intelligence 2018, UAI 2018, 2018.
[77] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio,
“Graph attention networks,” in International Conference on Learning
Representations, 2018. [Online]. Available: https://openreview.net/forum?
id=rJXMpikCZ
[78] S. Brody, U. Alon, and E. Yahav, “How attentive are graph attention
networks?” in International Conference on Learning Representations, 2022.
[Online]. Available: https://openreview.net/forum?id=F72ximsx7C1
[79] D. Kim and A. Oh, “How to find your friendly neighborhood: Graph
attention design with self-supervision,” in International Conference on
Learning Representations, 2021. [Online]. Available: https://openreview.
net/forum?id=Wi5KUNlqWty
[80] X. Zheng, Y. Liu, S. Pan, M. Zhang, D. Jin, and P. S. Yu, “Graph
neural networks for graphs with heterophily: A survey,” arXiv preprint
arXiv:2202.07082, 2022.
[81] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun, “Spectral networks and
locally connected networks on graphs,” arXiv preprint arXiv:1312.6203, 2013.
[82] S. Zhang, H. Tong, J. Xu, and R. Maciejewski, “Graph convolutional net-
works: a comprehensive review,” Computational Social Networks, vol. 6,
no. 1, pp. 1–23, 2019.
46
[83] Y. Ma, J. Hao, Y. Yang, H. Li, J. Jin, and G. Chen, “Spectral-based graph
convolutional network for directed graphs,” arXiv preprint arXiv:1907.08990,
2019.
[86] L. Cai, J. Li, J. Wang, and S. Ji, “Line graph neural networks for link pre-
diction,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
2021.
[87] S. A. Tailor, F. Opolka, P. Lio, and N. D. Lane, “Do we need anisotropic graph
neural networks?” in International Conference on Learning Representations,
2021.
[90] R. Van Der Hofstad, “Random graphs and complex networks,” Available on
http://www. win. tue. nl/rhofstad/NotesRGCN. pdf, vol. 11, p. 60, 2009.
[91] Y. Ren, K. Hu, X. Dai, L. Pan, S. C. Hoi, and Z. Xu, “Semi-supervised deep
embedded clustering,” Neurocomputing, vol. 325, pp. 121–130, 2019.
[92] Z. Chen, L. Li, and J. Bruna, “Supervised community detection with line
graph neural networks,” in International conference on learning representa-
tions, 2020.
47
[96] O. Shchur, M. Mumme, A. Bojchevski, and S. Günnemann, “Pitfalls of graph
neural network evaluation,” arXiv preprint arXiv:1811.05868, 2018.
[99] G. Li, M. Müller, B. Ghanem, and V. Koltun, “Training graph neural net-
works with 1000 layers,” in International conference on machine learning.
PMLR, 2021, pp. 6437–6449.
[103] H. Pei, B. Wei, K. C.-C. Chang, Y. Lei, and B. Yang, “Geom-gcn: Geometric
graph convolutional networks,” in International Conference on Learning
Representations, 2020. [Online]. Available: https://openreview.net/forum?
id=S1e2agrFvS
[104] J. You, Z. Ying, and J. Leskovec, “Design space for graph neural networks,”
Advances in Neural Information Processing Systems, vol. 33, 2020.
48
[107] F. D. Giovanni, J. Rowbottom, B. P. Chamberlain, T. Markovich,
and M. M. Bronstein, “Graph neural networks as gradient flows:
understanding graph convolutions via energy,” 2023. [Online]. Available:
https://openreview.net/forum?id=M3GzgrA7U4
[108] M. Fey and J. E. Lenssen, “Fast graph representation learning with pytorch
geometric,” arXiv preprint arXiv:1903.02428, 2019.
[109] M. Wang, D. Zheng, Z. Ye, Q. Gan, M. Li, X. Song, J. Zhou, C. Ma, L. Yu,
Y. Gai et al., “Deep graph library: A graph-centric, highly-performant pack-
age for graph neural networks,” arXiv preprint arXiv:1909.01315, 2019.
[110] Y.-H. Tu, “Geometricflux. jl: a geometric deep learning library in julia,”
Proceedings of JuliaCon, vol. 1, p. 1, 2020.
[111] D. Grattarola and C. Alippi, “Graph neural networks in tensorflow and keras
with spektral [application notes],” IEEE Computational Intelligence Maga-
zine, vol. 16, no. 1, pp. 99–106, 2021.
49