GNN-Foundations-Frontiers-and-Applications-chapter6
GNN-Foundations-Frontiers-and-Applications-chapter6
Abstract Over the past decade, Graph Neural Networks have achieved remarkable
success in modeling complex graph data. Nowadays, graph data is increasing expo-
nentially in both magnitude and volume, e.g., a social network can be constituted
by billions of users and relationships. Such circumstance leads to a crucial question,
how to properly extend the scalability of Graph Neural Networks? There remain
two major challenges while scaling the original implementation of GNN to large
graphs. First, most of the GNN models usually compute the entire adjacency matrix
and node embeddings of the graph, which demands a huge memory space. Second,
training GNN requires recursively updating each node in the graph, which becomes
infeasible and ineffective for large graphs. Current studies propose to tackle these
obstacles mainly from three sampling paradigms: node-wise sampling, which is ex-
ecuted based on the target nodes in the graph; layer-wise sampling, which is im-
plemented on the convolutional layers; and graph-wise sampling, which constructs
sub-graphs for the model inference. In this chapter, we will introduce several repre-
sentative research accordingly.
Hehuan Ma
Department of CSE, University of Texas at Arlington, e-mail: [email protected]
Yu Rong
Tencent AI Lab, e-mail: [email protected]
Junzhou Huang
Department of CSE, University of Texas at Arlington, e-mail: [email protected]
99
100 Hehuan Ma, Yu Rong, and Junzhou Huang
6.1 Introduction
Graph Neural Network (GNN) has gained increasing popularity and obtained re-
markable achievement in many fields, including social network (Freeman, 2000;
Perozzi et al, 2014; Hamilton et al, 2017b; Kipf and Welling, 2017b), bioin-
formatics (Gilmer et al, 2017; Yang et al, 2019b; Ma et al, 2020a), knowledge
graph (Liben-Nowell and Kleinberg, 2007; Hamaguchi et al, 2017; Schlichtkrull
et al, 2018), etc. GNN models are powerful to capture accurate graph structure in-
formation as well as the underlying connections and interactions between nodes (Li
et al, 2016b; Veličković et al, 2018; Xu et al, 2018a, 2019d). Generally, GNN models
are constructed based on the features of the nodes and edges, as well as the adja-
cency matrix of the whole graph. However, since the graph data is growing rapidly
nowadays, the graph size is increasing exponentially too. Recently published graph
benchmark datasets, Open Graph Benchmark (OGB), collects several commonly
used datasets for machine learning on graphs (Weihua Hu, 2020). Table 6.1 is the
statistics of the datasets about node classification tasks. As observed, large-scale
dataset ogbn-papers100M contains over one hundred million nodes and one billion
edges. Even the relatively small dataset ogbn-arxiv still consists of fairly large nodes
and edges.
Table 6.1: The statistics of node classification datasets from OGB (Weihua Hu,
2020).
For such large graphs, the original implementation of GNN is not suitable. There
are two main obstacles, 1) large memory requirement, and 2) inefficient gradient
update. First, most of the GNN models need to store the entire adjacent matrices
and the feature matrices in the memory, which demand huge memory consumption.
Moreover, the memory may not be adequate for handling very large graphs. There-
fore, GNN cannot be applied on large graphs directly. Second, during the training
phase of most GNN models, the gradient of each node is updated in every iteration,
which is inefficient and infeasible for large graphs. Such scenario is similar with the
gradient descent versus stochastic gradient descent, while the gradient descent may
take too long to converge on large dataset, and stochastic gradient is introduced to
speed up the process towards an optimum.
In order to tackle these obstacles, recent studies propose to design proper sam-
pling algorithms on large graphs to reduce the computational cost as well as increase
6 Graph Neural Networks: Scalability 101
6.2 Preliminary
We first briefly introduce some concepts and notations that are used in this chapter.
Given a graph G (V , E ), |V | = n denotes the set of n nodes and |E | = m denotes a
set of m edges. Node u 2 N (v) is the neighborhood of node v, where v 2 V , and
(u, v) 2 E . The vanilla GNN architecture can be summarized as:
⇣ ⌘
h(l+1) = s Ah(l)W (l) ,
where A is the normalized adjacency matrix, h(l) represents the embedding of the
node in the graph for layer/depth l, W (l) is the weight matrix of the neural network,
and s denotes the activation function.
For large-scaled graph learning, the problem is often referred as the node classi-
fication, where each node v is associated with a label y, and the goal is to learn from
the graph and predict the labels of unseen nodes.
The concept of sampling aims at selecting a partition of all the samples to represent
the entire sample distribution. Therefore, the sampling algorithm on large graphs
refers to the approach that uses partial graph instead of the full graph to address
target problems. In this chapter, we categorize different sampling algorithms into
three major groups, which are node-wise sampling, layer-wise sampling and graph-
wise sampling.
Node-wise sampling plays a dominant role during the early stage of imple-
menting GCN on large graphs, such as Graph SAmple and aggreGatE (Graph-
SAGE) (Hamilton et al, 2017b) and Variance Reduction Graph Convolutional
Networks (VR-GCN) (Chen et al, 2018d). Later, layer-wise sampling algorithms
are proposed to address the neighborhood expansion problem occurred during
node-wise sampling, e.g., Fast Learning Graph Convolutional Networks (Fast-
GCN) (Chen et al, 2018c) and Adaptive Sampling Graph Convolutional Networks
(ASGCN) (Huang et al, 2018). Moreover, graph-wise sampling paradigms are de-
signed to further improve the efficiency and scalability, e.g., Cluster Graph Convo-
lutional Networks (Cluster-GCN) (Chiang et al, 2019) and Graph SAmpling based
INductive learning meThod (GraphSAINT) (Zeng et al, 2020a). Fig. 6.1 illustrates
a comparison between three sampling paradigms. In the node-wise sampling, the
nodes are sampled based on the target node in the graph. While in the layer-wise
sampling, the nodes are sampled based on the convolutional layers in the GNN
102 Hehuan Ma, Yu Rong, and Junzhou Huang
(c) Graph-wise.
models. For the graph-wise sampling, the sub-graphs are sampled from the original
graph, and used for the model inference.
According to these paradigms, two main issues should be addressed while con-
structing large-scale GNNs: 1) how to design efficient sampling algorithms? and 2)
how to guarantee the sampling quality? In recent years, a lot of works have studied
about how to construct large-scale GNNs and how to address the above issues prop-
erly. Fig. 6.2 displays a timeline of certain representative works in this area from the
year 2017 until recent. Each work will be introduced accordingly in this chapter.
Other than these major sampling paradigms, more recent works have attempted
to improve the scalability of large graphs from various perspectives as well. For
example, heterogeneous graph has attracted more and more attention with regards
to the rapid growth of data. Large graphs not only include millions of nodes but
also various data types. How to train GNNs on such large graphs has become a new
domain of interest. Li et al (2019a) proposes a GCN-based Anti-Spam (GAS) model
6 Graph Neural Networks: Scalability 103
Rather than use all the nodes in the graph, the first approach selects certain nodes
through various sampling algorithms to construct large-scale GNNs. GraphSAGE (Hamil-
ton et al, 2017b) and VR-GCN (Chen et al, 2018d) are two pivotal studies that utilize
such a method.
6.3.1.1 GraphSAGE
At the early stage of GNN development, most work target at the transductive learn-
ing on a fixed-size graph (Kipf and Welling, 2017b, 2016), while the inductive
setting is more practical in many cases. Yang et al (2016b) develops an inductive
learning on graph embeddings, and GraphSAGE Hamilton et al (2017b) extends the
study on large graphs. The overall architecture is illustrated in Fig. 6.3.
Fig. 6.3: Overview of the GraphSAGE architecture. Step 1: sample the neighbor-
hoods of the target node; step 2: aggregate feature information from the neighbors;
step 3: utilize the aggregated information to predict the graph context or label. Fig-
ure excerpted from (Hamilton et al, 2017b).
Different from the original mean aggregator in GCN, GraphSAGE proposes LSTM
aggregator and Pooling aggregator to aggregate the information from the neigh-
bors. The second extension is that GraphSAGE applies the concatenation function
to combine the information of target node and neighborhoods instead of the sum-
mation function:
⇣ ⇣ ⌘⌘
(l+1) (l) (l+1)
hv = s W (l+1) · CONCAT hv , hN (v) ,
where W (l+1) are the weight matrices, and s is the activation function.
In order to make GNN suitable for the large-scale graphs, GraphSAGE intro-
duces the mini-batch training strategy to reduce the computation cost during the
training phase. Specifically, in each training iteration, only the nodes that are used
by computing the representations in the batch are considered, which significantly
reduces the number of sampled nodes. Take layer 2 in Fig. 6.4(a) as an example,
unlike the full-batch training which takes all 11 nodes into consideration, only 6
nodes are involved for mini-batch training. However, the simple implementation of
mini-batch training strategy suffers the neighborhood expansion problem. As shown
in layer 1 of Fig. 6.4(a), most of the nodes are sampled since the number of sampled
nodes grows exponentially if all the neighbors are sampled at each layer. Thus, all
the nodes are selected eventually if the model contains many layers.
Fig. 6.4: Visual comparison between mini-batch training and fixed-size neighbor
sampling.
To further improve the training efficiency and eliminate the neighborhood expan-
sion problem, GraphSAGE adopts fixed-size neighbor sampling strategy. In specific,
a fixed-size set of neighbor nodes are sampled for each layer for computing, instead
of using the entire neighborhood sets. For example, one can set the fixed-size set as
two nodes, which is illustrated in Fig. 6.4(b), the yellow nodes represent the sampled
nodes, and the blue nodes are the candidate nodes. It is observed that the number of
sampled nodes is significantly reduced, especially for layer 1.
6 Graph Neural Networks: Scalability 105
6.3.1.2 VR-GCN
In order to further reduce the size of the sampled nodes, as well as conduct a com-
prehensive theoretical analysis, VR-GCN (Chen et al, 2018d) proposes a Control
Variate Based Estimator. It only samples an arbitrarily small size of the neighbor
nodes by employing historical activations of the nodes. Fig. 6.5 compares the recep-
tive field of one target node using different sampling strategies. For the implementa-
tion of the original GCN (Kipf and Welling, 2017b), the number of sampled nodes is
increased exponentially with the number of layers. With neighbor sampling, the size
of the receptive field is reduced randomly according to the preset sampling number.
Compared with them, VR-GCN utilizes the historical node activations as a control
variate to keep the receptive field small scaled.
Fig. 6.5: Illustration of the receptive field of a single node utilizing different sam-
pling strategies with a two-layer graph convolutional neural network. The red circle
represents the latest activation, and the blue circle indicates the historical activation.
Figure excerpted from (Chen et al, 2018d).
Â
(l) (l)
NSv := R Avu hu , R = N (v)/d (l) ,
u2Nˆ (l) (v)
where N (v) represents the neighbor set of node v, d (l) is the sampled size of the
neighbor nodes at layer l, Nˆ (l) (v) ⇢ N (v) is the sampled neighbor set of node v at
106 Hehuan Ma, Yu Rong, and Junzhou Huang
layer l, and A represents the normalized adjacency matrix. Such a method has been
proved to be a biased sampling, and would cause larger variance. The detailed proof
can be found in (Chen et al, 2018d). Such properties result in a larger sample size
Nˆ (l) (v) ⇢ N (v).
To address these issues, VR-GCN proposes Control Variate Based Estimator
(l)
(CV Sampler) to maintain all the historical hidden embedding h̄v of every partici-
(l) (l)
pated node. For a better estimation, since the difference between h̄v and hv shall
be small if the model weights do not change too fast. CV Sampler is capable of
reducing the variance and obtaining a smaller sample size n̂(l) (v) eventually. Thus,
the feed-forward layer of VR-GCN can be defined as,
⇣ ⇣ ⌘ ⌘
H (l+1) = s A(l) H (l+1) H̄ (l) + AH̄ (l) W (l) .
(l) (l)
A(l) is the sampled normalized adjacency matrix at layer l, H̄ (l) = {h̄1 , · · · , h̄n }
(l) (l+1) (l+1)
is the stack of the historical hidden embedding h̄ , H (l+1) = {h1 , · · · , hn } is
the embedding of the graph nodes in the (l + 1)-th layer, and W is the learnable
(l)
weights matrix. In such a manner, the sampled size of A(l) is greatly reduced com-
(l)
pared with GraphSAGE by utilizing the historical hidden embedding h̄ , which
introduces a more efficient computing method. Moreover, VR-GCN also studies
how to apply the Control Variate Estimator on the dropout model. More details can
be found in the paper.
In summary, VR-GCN first analyzes the variance reduction on node-wise sam-
pling, and successfully reduces the size of the samples. However, the trade-off is
that the additional memory consumption for storing the historical hidden embed-
dings would be very large. Also, the limitation of applying GNNs on large-scale
graphs is that it is not realistic to store the full adjacent matrices or the feature ma-
trices. In VR-GCN, the historical hidden embeddings storage actually increases the
memory cost, which is not helping from this perspective.
Since node-wise sampling can only alleviate but not completely solve the neigh-
borhood expansion problem, layer-wise sampling has been studied to address this
obstacle.
6.3.2.1 FastGCN
In order to solve the neighborhood expansion problem, FastGCN (Chen et al, 2018c)
first proposes to understand the GNN from the functional generalization perspective.
The authors point out that training algorithms such as stochastic gradient descent are
implemented according to the additivity of the loss function for independent data
6 Graph Neural Networks: Scalability 107
samples. However, GNN models generally lack sample loss independence. To solve
this problem, FastGCN converts the common graph convolution view to an integral
transform view by introducing a probability measure for each node. Fig. 6.6 shows
the conversion between the traditional graph convolution view and the integral trans-
form view. In the graph convolution view, a fixed number of nodes are sampled in
a bootstrapping manner in each layer, and are connected if there is a connection
exists. Each convolutional layer is responsible for integrating the node embeddings.
The integral transform view is visualized according to the probability measure, and
the integral transform (demonstrated in the yellow triangle form) is used to calculate
the embedding function in the next layer. More details can be found in (Chen et al,
2018c).
Fig. 6.6: Two views of GCN. The circles represent the nodes in the graph, while
the yellow circles indicate the sampled nodes. The lines represent the connection
between nodes.
(l) (l)
Moreover, consider sampling tl iid samples u1 , . . . , utl ⇠ p for each layer l, l =
0, . . . , K 1, a layer-wise estimation of the loss function is admitted as,
1 tK ⇣ ⇣ ⌘⌘
Âg
(K) (K)
Lt0 ,t1 ,...,tK := htK ui ,
tK i=1
which proves that FastGCN samples a fixed number of nodes at each layer.
108 Hehuan Ma, Yu Rong, and Junzhou Huang
Furthermore, in order to reduce the sampling variance, FastGCN adopts the im-
portance sampling with respect to the weights in the normalized adjacency matrix.
2
q(u) = kA(:, u)k2 / Â A :, u0 , u2V, (6.1)
u 2V
0
where A is the normalized adjacency matrix of the graph. Detailed proofs can be
found in (Chen et al, 2018c). According to Equation 6.1, the entire sampling process
is independent for each layer, and the sampling probability keeps the same.
6.3.2.2 ASGCN
To better capture the between-layer correlations, ASGCN (Huang et al, 2018) pro-
poses an adaptive layer-wise sampling strategy. In specific, the sampling probability
of lower layers depends on the upper ones. As shown in Fig. 8(a), ASGCN only
6 Graph Neural Networks: Scalability 109
samples nodes from the neighbors of the sampled node (yellow node) to obtain the
better between-layer correlations, while FastGCN utilizes the importance sampling
among all the nodes.
Fig. 6.9: Network construction example: (a) node-wise sampling; (b) layer-wise
sampling; (c) skip connection implementation. Figure excerpted from (Huang et al,
2018).
To further reduce the sampling variance, ASGCN introduces the explicit vari-
ance reduction to optimize the sampling variance as the final objective. Consider
x (u j ) as the node feature of node u j , the optimal sampling probability q⇤ (u j ) can
be formulated as,
0
Ân p (u j | vi ) g (x (u j ))
q (u j ) = n i=1n0
⇤
, g (x (u j )) = Wg x (u j ) . (6.2)
 j=1 Âi=1 p (u j | vi ) g (x (v j ))
However, simply utilizing the sampler given by Equation 6.2 is not sufficient
to secure a minimal variance. Thus, ASGCN designs a hybrid loss by adding the
variance to the classification loss Lc , as shown in Equation 6.3. In such a manner,
the variance can be trained to achieve the minimal status.
0
1 n
L = Â Lc (yi , ȳ (µ̂q (vi ))) + l Varq (µ̂q (vi )) ,
n0 i=1
(6.3)
where yi is the ground-truth label, µ̂q (vi ) represents the output hidden embeddings
of node vi , and ȳ (µ̂q (vi )) is the prediction. l is involved as a trade-off parameter.
The variance reduction term l Varq (µ̂q (vi )) can also be viewed as a regularization
according to the sampled instances.
ASGCN also proposes a skip connection method to obtain the information across
distant nodes. As shown in Fig. 6.9 (c), the nodes in the (l-1)-th layer theoretically
preserve the second-order proximity (Tang et al, 2015b), which are the 2-hop neigh-
bors for the nodes in the (l+1)-th layer. The sampled nodes will include both 1-hop
and 2-hop neighbors by adding a skip connection between the (l-1)-th layer and the
(l+1)-th layer, which captures the information between distant nodes and facilitates
the model training.
In summary, by introducing the adaptive sampling strategy, ASGCN has gained
better performance as well as equips a better variance control. However, it also
brings in the additional dependence during sampling. Take FastGCN as an example,
it can perform parallel sampling to accelerate the sampling process since each layer
is sampled independently. While in ASGCN, the sampling process is dependent to
the upper layer, thus parallel processing is not applicable.
6 Graph Neural Networks: Scalability 111
6.3.3.1 Cluster-GCN
Cluster-GCN (Chiang et al, 2019) first proposes to extract small graph clusters based
on efficient graph clustering algorithms. The intuition is that the mini-batch algo-
rithm is correlated with the number of links between nodes in one batch. Hence,
Cluster-GCN constructs mini-batch on the sub-graph level, while previous studies
usually construct mini-batch based on the nodes.
Cluster-GCN extracts small clusters based on the following clustering algo-
rithms. A graph G (V , E ) can be devided into c portions by grouping its nodes,
where V = [V1 , · · · Vc ]. The extracted sub-graphs can be defined as,
(Vt , Et ) represents the nodes and the links within the t-th portion, t 2 (1, c). And the
re-ordered adjacency matrix can be written as,
2 3 2 3 2 3
A11 · · · A1c A11 · · · 0 0 · · · A1c
A = Ā + D = 4 ... . . . ... 5 ; Ā = 4 ... . . . ... 5 , D = 4 ... . . . ... 5 .
6 7 6 7 6 7
Different graph clustering algorithms can be used to partition the graph by enabling
more links between nodes within the cluster. The motivation of considering sub-
graph as a batch also follows the nature of graphs, which is that neighbors usually
stay closely with each other.
112 Hehuan Ma, Yu Rong, and Junzhou Huang
Obviously, this strategy can avoid the neighbor expansion problem since it only
samples the nodes in the clusters, as shown in Fig. 6.11. For Cluster-GCN, since
there is no connection between the sub-graphs, the nodes in other sub-graphs will
not be sampled when the layer increases. In such a manner, the sampling process
establishes a neighbor expansion control by sampling over the sub-graphs, while in
layer-wise sampling the neighbor expansion control is implemented by fixing the
neighbor sampling size.
However, there still remain two concerns with the vanilla Cluster-GCN. The first
one is that the links between sub-graphs are dismissed, which may fail to capture
important correlations. The second issue is that the clustering algorithm may change
the original distribution of the dataset and introduce some bias. To address these
concerns, the authors propose stochastic multiple partitions scheme to randomly
combine clusters to a batch. In specific, the graph is first clustered into p sub-graphs;
then in each epoch training, a new batch is formed by randomly combine q clusters
(q < p), and the interactions between clusters are included too. Fig. 6.12 visualized
an example when q equals to 2. As observed, the new batch is formed by 2 random
clusters, along with the retained connections between the clusters.
6.3.3.2 GraphSAINT
Instead of using clustering algorithms to generate the sub-graphs which may bring in
certain bias or noise, GraphSAINT (Zeng et al, 2020a) proposes to directly sample a
sub-graph for mini-batch training according to sub-graph sampler, and employ a full
GCN on the sub-graph to generate the node embeddings as well as back-propagate
the loss for each node. As shown in Fig. 6.13, sub-graph Gs is constructed from the
original graph G with Nodes 0, 1, 2, 3, 4, 7 included. Next, a full GCN is applied
on these 6 nodes along with the corresponding connections.
Fig. 6.13: An illustration of GraphSAINT training algorithm. The yellow circle in-
dicates the sampled node.
PinSage (Ying et al, 2018b) is one of the successful applications in the early stage
of utilizing large-scale GNNs on item-item recommendation systems, which is de-
ployed on Pinterest1 . Pinterest is a social media application that shares and discovers
various content. The users mark their interested content with pins and organize them
on the boards. When the users browse the website, Pinterest recommends the poten-
tially interesting content for them. By the year 2018, the Pinterest graph contains 2
billion pins, 1 billion boards, and over 18 billion edges between pins and boards.
In order to scale the training model on such a large graph, Ying et al (2018b)
proposes PinSage, a random-walk-based GCN, to implement node-wise sampling
on Pinterest graph. In specific, a short random walk is used to select a fixed-number
neighborhood of the target node. Fig. 6.15 demonstrates the overall architecture of
PinSage. Take node A as an example, a 2-depth convolution is constructed to gen-
(2) (1)
erate the node embedding hA . The embedding vector hN (A) of node A’s neighbors
are aggregated by node B, C, and D. Similar process is established to get the 1-hop
(1) (1) (1)
neighbors’ embedding hB , hC , and hD . An illustration of all participated nodes
for each node from the input graph is shown at the bottom of Fig. 6.15. In addition,
a L1-normalization is computed to sort the neighbors by their importance (Eksom-
batchai et al, 2018), and a curriculum training strategy is used to further improve the
prediction performance by feeding harder-and-harder examples.
A series of comprehensive experiments that are conducted on Pinterest data, e.g.,
offline experiments, production A/B tests and user studies, have demonstrated the
effectiveness of the proposed method. Moreover, with the adoption of highly effi-
cient MapReduce inference pipeline, the entire process on the whole graph can be
finished within one day.
1 https://www.pinterest.com/
6 Graph Neural Networks: Scalability 117
Fig. 6.15: Overview of PinSage architecture. Colored nodes are applied to illustrate
the construction of graph convolutions.
(l) (l)
embeddings hN (v) and the target itself hv . Such an operation is able to capture two
types of information: the interactions between target node and its neighborhoods;
and the interactions between different dimensions of the embedding space. How-
ever, in user-item networks, learning the information between different feature di-
mensions may be less informative and unnecessary. Therefore, IntentNet designs a
vector-wise convolution operation as follows:
⇣ ⌘
(l) (l) (l) (l) (l)
gv (i) = s Wv (i, 1) · hv +Wv (i, 2) · hN (v) ,
⇣ ⌘
(l+1) (l) (l)
hv = s ÂLi=1 qi · gv (i) ,
(l) (l)
where Wv (i, 1) and Wv (i, 2) are the associated weight matrices for the i-th local
(l)
filter. gv (i) represents the operation that learns the interactions between the target
node and its neighbor nodes in a vector-wise manner. Another vector-wise layer is
applied to gather the final embedding vector of the target node for the next convolu-
tional layer. Moreover, the output vector of the last convolutional layer is fed into a
three-layer fully-connected network to further learn the node-level combinatory fea-
tures. Such an operation significantly promotes the training efficiency and reduces
the time complexity.
Extensive experiments are conducted on Taobao and Amazon datasets, which
contain millions to billions of users and items. IntentGC outperforms other baseline
methods, as well as reduces the training time for about two days compared with
GraphSAGE.
Overall, in recent years, the scalability of GNNs has been extensively studied and
has achieved fruitful results. Fig. 6.18 summarizes the development towards large-
scale GNNs.
6 Graph Neural Networks: Scalability 119
Editor’s Notes: For graphs of large scale or with rapid expansibility, such
as dynamic graph (chapter 15) and heterogeneous graph (chapter 16), the
scalability characterization of GNNs is of vital importance to determine
whether the algorithm is superior in practice. For example, graph sampling
strategy is especially necessary to ensure computational efficiency in in-
dustrial scenarios, such as recommender system (chapter 19) and urban in-
telligence (chapter 27). With the increasing complexity and scale of the
real problem, the limitation in scalability has been considered almost ev-
erywhere in the study of GNNs. Researchers devoted to graph embedding
(chapter 2), graph structure learning (chapter 14) and self-supervised learn-
ing (chapter 18) put forward very remarkable works to overcome it.