0% found this document useful (0 votes)

12 views

GNN-Foundations-Frontiers-and-Applications-chapter6

Uploaded by

waifungloo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views

GNN-Foundations-Frontiers-and-Applications-chapter6

Uploaded by

waifungloo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

Chapter 6

Graph Neural Networks: Scalability

Hehuan Ma, Yu Rong, and Junzhou Huang

Abstract Over the past decade, Graph Neural Networks have achieved remarkable
success in modeling complex graph data. Nowadays, graph data is increasing expo-
nentially in both magnitude and volume, e.g., a social network can be constituted
by billions of users and relationships. Such circumstance leads to a crucial question,
how to properly extend the scalability of Graph Neural Networks? There remain
two major challenges while scaling the original implementation of GNN to large
graphs. First, most of the GNN models usually compute the entire adjacency matrix
and node embeddings of the graph, which demands a huge memory space. Second,
training GNN requires recursively updating each node in the graph, which becomes
infeasible and ineffective for large graphs. Current studies propose to tackle these
obstacles mainly from three sampling paradigms: node-wise sampling, which is ex-
ecuted based on the target nodes in the graph; layer-wise sampling, which is im-
plemented on the convolutional layers; and graph-wise sampling, which constructs
sub-graphs for the model inference. In this chapter, we will introduce several repre-
sentative research accordingly.

Hehuan Ma
Department of CSE, University of Texas at Arlington, e-mail: [email protected]
Yu Rong
Tencent AI Lab, e-mail: [email protected]
Junzhou Huang
Department of CSE, University of Texas at Arlington, e-mail: [email protected]

99
100 Hehuan Ma, Yu Rong, and Junzhou Huang

6.1 Introduction

Graph Neural Network (GNN) has gained increasing popularity and obtained re-
markable achievement in many fields, including social network (Freeman, 2000;
Perozzi et al, 2014; Hamilton et al, 2017b; Kipf and Welling, 2017b), bioin-
formatics (Gilmer et al, 2017; Yang et al, 2019b; Ma et al, 2020a), knowledge
graph (Liben-Nowell and Kleinberg, 2007; Hamaguchi et al, 2017; Schlichtkrull
et al, 2018), etc. GNN models are powerful to capture accurate graph structure in-
formation as well as the underlying connections and interactions between nodes (Li
et al, 2016b; Veličković et al, 2018; Xu et al, 2018a, 2019d). Generally, GNN models
are constructed based on the features of the nodes and edges, as well as the adja-
cency matrix of the whole graph. However, since the graph data is growing rapidly
nowadays, the graph size is increasing exponentially too. Recently published graph
benchmark datasets, Open Graph Benchmark (OGB), collects several commonly
used datasets for machine learning on graphs (Weihua Hu, 2020). Table 6.1 is the
statistics of the datasets about node classification tasks. As observed, large-scale
dataset ogbn-papers100M contains over one hundred million nodes and one billion
edges. Even the relatively small dataset ogbn-arxiv still consists of fairly large nodes
and edges.

Table 6.1: The statistics of node classification datasets from OGB (Weihua Hu,
2020).

Scale Name Number of Nodes Number of Edges

Large ogbn-papers100M 111,059,956 1,615,685,872
Medium ogbn-products 2,449,029 61,859,140
Medium ogbn-proteins 132,534 39,561,252
Medium ogbn-mag 1,939,743 21,111,007
Small ogbn-arxiv 169,343 1,166,243

For such large graphs, the original implementation of GNN is not suitable. There
are two main obstacles, 1) large memory requirement, and 2) inefficient gradient
update. First, most of the GNN models need to store the entire adjacent matrices
and the feature matrices in the memory, which demand huge memory consumption.
Moreover, the memory may not be adequate for handling very large graphs. There-
fore, GNN cannot be applied on large graphs directly. Second, during the training
phase of most GNN models, the gradient of each node is updated in every iteration,
which is inefficient and infeasible for large graphs. Such scenario is similar with the
gradient descent versus stochastic gradient descent, while the gradient descent may
take too long to converge on large dataset, and stochastic gradient is introduced to
speed up the process towards an optimum.
In order to tackle these obstacles, recent studies propose to design proper sam-
pling algorithms on large graphs to reduce the computational cost as well as increase
6 Graph Neural Networks: Scalability 101

the scalability. In this chapter, we categorize different sampling methods according

to the underlying algorithms, and introduce typical works accordingly.

6.2 Preliminary

We first briefly introduce some concepts and notations that are used in this chapter.
Given a graph G (V , E ), |V | = n denotes the set of n nodes and |E | = m denotes a
set of m edges. Node u 2 N (v) is the neighborhood of node v, where v 2 V , and
(u, v) 2 E . The vanilla GNN architecture can be summarized as:
⇣ ⌘
h(l+1) = s Ah(l)W (l) ,

where A is the normalized adjacency matrix, h(l) represents the embedding of the
node in the graph for layer/depth l, W (l) is the weight matrix of the neural network,
and s denotes the activation function.
For large-scaled graph learning, the problem is often referred as the node classi-
fication, where each node v is associated with a label y, and the goal is to learn from
the graph and predict the labels of unseen nodes.

6.3 Sampling Paradigms

The concept of sampling aims at selecting a partition of all the samples to represent
the entire sample distribution. Therefore, the sampling algorithm on large graphs
refers to the approach that uses partial graph instead of the full graph to address
target problems. In this chapter, we categorize different sampling algorithms into
three major groups, which are node-wise sampling, layer-wise sampling and graph-
wise sampling.
Node-wise sampling plays a dominant role during the early stage of imple-
menting GCN on large graphs, such as Graph SAmple and aggreGatE (Graph-
SAGE) (Hamilton et al, 2017b) and Variance Reduction Graph Convolutional
Networks (VR-GCN) (Chen et al, 2018d). Later, layer-wise sampling algorithms
are proposed to address the neighborhood expansion problem occurred during
node-wise sampling, e.g., Fast Learning Graph Convolutional Networks (Fast-
GCN) (Chen et al, 2018c) and Adaptive Sampling Graph Convolutional Networks
(ASGCN) (Huang et al, 2018). Moreover, graph-wise sampling paradigms are de-
signed to further improve the efficiency and scalability, e.g., Cluster Graph Convo-
lutional Networks (Cluster-GCN) (Chiang et al, 2019) and Graph SAmpling based
INductive learning meThod (GraphSAINT) (Zeng et al, 2020a). Fig. 6.1 illustrates
a comparison between three sampling paradigms. In the node-wise sampling, the
nodes are sampled based on the target node in the graph. While in the layer-wise
sampling, the nodes are sampled based on the convolutional layers in the GNN
102 Hehuan Ma, Yu Rong, and Junzhou Huang

(a) Node-wise. (b) Layer-wise.

Fig. 6.1: Three sampling paradigms toward large-scale GNNs.

models. For the graph-wise sampling, the sub-graphs are sampled from the original
graph, and used for the model inference.
According to these paradigms, two main issues should be addressed while con-
structing large-scale GNNs: 1) how to design efficient sampling algorithms? and 2)
how to guarantee the sampling quality? In recent years, a lot of works have studied
about how to construct large-scale GNNs and how to address the above issues prop-
erly. Fig. 6.2 displays a timeline of certain representative works in this area from the
year 2017 until recent. Each work will be introduced accordingly in this chapter.

Fig. 6.2: Timeline of leading research work toward large-scale GNNs.

Other than these major sampling paradigms, more recent works have attempted
to improve the scalability of large graphs from various perspectives as well. For
example, heterogeneous graph has attracted more and more attention with regards
to the rapid growth of data. Large graphs not only include millions of nodes but
also various data types. How to train GNNs on such large graphs has become a new
domain of interest. Li et al (2019a) proposes a GCN-based Anti-Spam (GAS) model
6 Graph Neural Networks: Scalability 103

to detect spams by considering both homogeneous and heterogeneous graphs. Zhang

et al (2019b) designs a random walk sampling method based on all types of nodes.
Hu et al (2020e) employs the transformer architecture to learn the mutual attention
between nodes, and sample the nodes according to different node types.

6.3.1 Node-wise Sampling

Rather than use all the nodes in the graph, the first approach selects certain nodes
through various sampling algorithms to construct large-scale GNNs. GraphSAGE (Hamil-
ton et al, 2017b) and VR-GCN (Chen et al, 2018d) are two pivotal studies that utilize
such a method.

6.3.1.1 GraphSAGE

At the early stage of GNN development, most work target at the transductive learn-
ing on a fixed-size graph (Kipf and Welling, 2017b, 2016), while the inductive
setting is more practical in many cases. Yang et al (2016b) develops an inductive
learning on graph embeddings, and GraphSAGE Hamilton et al (2017b) extends the
study on large graphs. The overall architecture is illustrated in Fig. 6.3.

Fig. 6.3: Overview of the GraphSAGE architecture. Step 1: sample the neighbor-
hoods of the target node; step 2: aggregate feature information from the neighbors;
step 3: utilize the aggregated information to predict the graph context or label. Fig-
ure excerpted from (Hamilton et al, 2017b).

GraphSAGE can be viewed as an extension of the original Graph Convolutional

Network (GCN) (Kipf and Welling, 2017b). The first extension is the generalized
aggregator function. Given G (V , E ), N (v) is the neighborhood of v, h is the repre-
sentation of the node, the embedding generation at the current (l+1)-th depth from
the target node v 2 V can be formulated as,
104 Hehuan Ma, Yu Rong, and Junzhou Huang
⇣n o⌘
(l+1) (l)
hN (v) = AGGREGATE l hu , 8u 2 N (v) ,

Different from the original mean aggregator in GCN, GraphSAGE proposes LSTM
aggregator and Pooling aggregator to aggregate the information from the neigh-
bors. The second extension is that GraphSAGE applies the concatenation function
to combine the information of target node and neighborhoods instead of the sum-
mation function:
⇣ ⇣ ⌘⌘
(l+1) (l) (l+1)
hv = s W (l+1) · CONCAT hv , hN (v) ,

where W (l+1) are the weight matrices, and s is the activation function.
In order to make GNN suitable for the large-scale graphs, GraphSAGE intro-
duces the mini-batch training strategy to reduce the computation cost during the
training phase. Specifically, in each training iteration, only the nodes that are used
by computing the representations in the batch are considered, which significantly
reduces the number of sampled nodes. Take layer 2 in Fig. 6.4(a) as an example,
unlike the full-batch training which takes all 11 nodes into consideration, only 6
nodes are involved for mini-batch training. However, the simple implementation of
mini-batch training strategy suffers the neighborhood expansion problem. As shown
in layer 1 of Fig. 6.4(a), most of the nodes are sampled since the number of sampled
nodes grows exponentially if all the neighbors are sampled at each layer. Thus, all
the nodes are selected eventually if the model contains many layers.

Fig. 6.4: Visual comparison between mini-batch training and fixed-size neighbor
sampling.

To further improve the training efficiency and eliminate the neighborhood expan-
sion problem, GraphSAGE adopts fixed-size neighbor sampling strategy. In specific,
a fixed-size set of neighbor nodes are sampled for each layer for computing, instead
of using the entire neighborhood sets. For example, one can set the fixed-size set as
two nodes, which is illustrated in Fig. 6.4(b), the yellow nodes represent the sampled
nodes, and the blue nodes are the candidate nodes. It is observed that the number of
sampled nodes is significantly reduced, especially for layer 1.
6 Graph Neural Networks: Scalability 105

In summary, GraphSAGE is the first to consider inductive representation learn-

ing on large graphs. It introduces a generalized aggregator, the mini-batch training,
and fixed-size neighbor sampling algorithm to accelerate the training process. How-
ever, fixed-size neighbor sampling strategy can not totally avoid the neighborhood
expansion problem. Also, there is no theoretical guarantees for the sampling quality.

6.3.1.2 VR-GCN

In order to further reduce the size of the sampled nodes, as well as conduct a com-
prehensive theoretical analysis, VR-GCN (Chen et al, 2018d) proposes a Control
Variate Based Estimator. It only samples an arbitrarily small size of the neighbor
nodes by employing historical activations of the nodes. Fig. 6.5 compares the recep-
tive field of one target node using different sampling strategies. For the implementa-
tion of the original GCN (Kipf and Welling, 2017b), the number of sampled nodes is
increased exponentially with the number of layers. With neighbor sampling, the size
of the receptive field is reduced randomly according to the preset sampling number.
Compared with them, VR-GCN utilizes the historical node activations as a control
variate to keep the receptive field small scaled.

Fig. 6.5: Illustration of the receptive field of a single node utilizing different sam-
pling strategies with a two-layer graph convolutional neural network. The red circle
represents the latest activation, and the blue circle indicates the historical activation.
Figure excerpted from (Chen et al, 2018d).

The neighbor sampling (NS) algorithm proposed by GraphSAGE (Hamilton et al,

2017b) can be formulated as:

Â
(l) (l)
NSv := R Avu hu , R = N (v)/d (l) ,
u2Nˆ (l) (v)

where N (v) represents the neighbor set of node v, d (l) is the sampled size of the
neighbor nodes at layer l, Nˆ (l) (v) ⇢ N (v) is the sampled neighbor set of node v at
106 Hehuan Ma, Yu Rong, and Junzhou Huang

layer l, and A represents the normalized adjacency matrix. Such a method has been
proved to be a biased sampling, and would cause larger variance. The detailed proof
can be found in (Chen et al, 2018d). Such properties result in a larger sample size
Nˆ (l) (v) ⇢ N (v).
To address these issues, VR-GCN proposes Control Variate Based Estimator
(l)
(CV Sampler) to maintain all the historical hidden embedding h̄v of every partici-
(l) (l)
pated node. For a better estimation, since the difference between h̄v and hv shall
be small if the model weights do not change too fast. CV Sampler is capable of
reducing the variance and obtaining a smaller sample size n̂(l) (v) eventually. Thus,
the feed-forward layer of VR-GCN can be defined as,
⇣ ⇣ ⌘ ⌘
H (l+1) = s A(l) H (l+1) H̄ (l) + AH̄ (l) W (l) .

(l) (l)
A(l) is the sampled normalized adjacency matrix at layer l, H̄ (l) = {h̄1 , · · · , h̄n }
(l) (l+1) (l+1)
is the stack of the historical hidden embedding h̄ , H (l+1) = {h1 , · · · , hn } is
the embedding of the graph nodes in the (l + 1)-th layer, and W is the learnable
(l)

weights matrix. In such a manner, the sampled size of A(l) is greatly reduced com-
(l)
pared with GraphSAGE by utilizing the historical hidden embedding h̄ , which
introduces a more efficient computing method. Moreover, VR-GCN also studies
how to apply the Control Variate Estimator on the dropout model. More details can
be found in the paper.
In summary, VR-GCN first analyzes the variance reduction on node-wise sam-
pling, and successfully reduces the size of the samples. However, the trade-off is
that the additional memory consumption for storing the historical hidden embed-
dings would be very large. Also, the limitation of applying GNNs on large-scale
graphs is that it is not realistic to store the full adjacent matrices or the feature ma-
trices. In VR-GCN, the historical hidden embeddings storage actually increases the
memory cost, which is not helping from this perspective.

6.3.2 Layer-wise Sampling

Since node-wise sampling can only alleviate but not completely solve the neigh-
borhood expansion problem, layer-wise sampling has been studied to address this
obstacle.

6.3.2.1 FastGCN

In order to solve the neighborhood expansion problem, FastGCN (Chen et al, 2018c)
first proposes to understand the GNN from the functional generalization perspective.
The authors point out that training algorithms such as stochastic gradient descent are
implemented according to the additivity of the loss function for independent data
6 Graph Neural Networks: Scalability 107

samples. However, GNN models generally lack sample loss independence. To solve
this problem, FastGCN converts the common graph convolution view to an integral
transform view by introducing a probability measure for each node. Fig. 6.6 shows
the conversion between the traditional graph convolution view and the integral trans-
form view. In the graph convolution view, a fixed number of nodes are sampled in
a bootstrapping manner in each layer, and are connected if there is a connection
exists. Each convolutional layer is responsible for integrating the node embeddings.
The integral transform view is visualized according to the probability measure, and
the integral transform (demonstrated in the yellow triangle form) is used to calculate
the embedding function in the next layer. More details can be found in (Chen et al,
2018c).

Fig. 6.6: Two views of GCN. The circles represent the nodes in the graph, while
the yellow circles indicate the sampled nodes. The lines represent the connection
between nodes.

Formally, given a graph G (V , E ), an inductive graph G 0 with respect to a pos-

sibility space (V 0 , F, p) is constructed. In specific, V 0 denotes the sample space of
nodes which are iid samples. The probability measure p defines a sampling distri-
bution, and F can be any⇣event space, V0
⌘ e.g., F = 2 . Take node v and u with same
probability measure p, g h(K) (v) as the gradient of the final embedding of node
v, and E as the expectation function, the functional generalization is formulated as,
h ⇣ ⌘i Z ⇣ ⌘
L = Ev⇠p g h(K) (v) = g h(K) (v) d p(v).

(l) (l)
Moreover, consider sampling tl iid samples u1 , . . . , utl ⇠ p for each layer l, l =
0, . . . , K 1, a layer-wise estimation of the loss function is admitted as,

1 tK ⇣ ⇣ ⌘⌘
Âg
(K) (K)
Lt0 ,t1 ,...,tK := htK ui ,
tK i=1

which proves that FastGCN samples a fixed number of nodes at each layer.
108 Hehuan Ma, Yu Rong, and Junzhou Huang

Furthermore, in order to reduce the sampling variance, FastGCN adopts the im-
portance sampling with respect to the weights in the normalized adjacency matrix.
2
q(u) = kA(:, u)k2 / Â A :, u0 , u2V, (6.1)
u 2V
0

where A is the normalized adjacency matrix of the graph. Detailed proofs can be
found in (Chen et al, 2018c). According to Equation 6.1, the entire sampling process
is independent for each layer, and the sampling probability keeps the same.

Fig. 6.7: Comparison between full GCN and FastGCN.

Compared with GraphSAGE (Hamilton et al, 2017b), FastGCN is much less

computational costly. Assume tl neighbor nodes are samples for layer l, the neigh-
borhood expansion size is at most the sum of the tl ’s for FastGCN, while could be up
to the product of the tl ’s for GraphSAGE. Fig. 6.7 illustrates the sampling difference
between Full GCN and FastGCN. In full GCN, the connections are very sparse so
that it has to compute and update all the gradients, while FastGCN only samples a
fixed number of samples at each layer. Therefore, the computational cost is greatly
decreased. On the other hand, FastGCN still retains most of the information accord-
ing to the importance sampling method. The fixed number of nodes are randomly
sampled in each training iteration, thus every node and the corresponding connec-
tions could be selected and fit into the model if the training time is long enough.
Therefore, the information of the entire graph is generally retained.
In summary, FastGCN solves the neighborhood expansion problem according to
the fixed-size layer sampling. Meanwhile, this sample strategy has a quality guaran-
tee. However, since FastGCN samples each layer independently, it failed to capture
the between-layer correlations, which leads to a performance compromise.

6.3.2.2 ASGCN

To better capture the between-layer correlations, ASGCN (Huang et al, 2018) pro-
poses an adaptive layer-wise sampling strategy. In specific, the sampling probability
of lower layers depends on the upper ones. As shown in Fig. 8(a), ASGCN only
6 Graph Neural Networks: Scalability 109

samples nodes from the neighbors of the sampled node (yellow node) to obtain the
better between-layer correlations, while FastGCN utilizes the importance sampling
among all the nodes.

(a) ASGCN vs. FastGCN.

(b) Top-down sampling of ASGCN.

Fig. 6.8: A demonstration of the sampling strategies used in ASGCN.

Meanwhile, the sampling process of ASGCN is performed in a top-down man-

ner. As shown in Fig. 8(b), the sampling process is first conducted in the output
layer, which is the layer 3. Next, the participated nodes of the intermediate layer
are sampled according to the results of the output layer. Such a sampling strategy
captures dense connections between layers.
The sampling probability of lower layers depends on the upper ones. Take
Fig. 6.9 as an illustration, p (u j | vi ) is the probability of sampling node u j given
node vi , vi refers to node i in the (l+1)-th layer while u j denotes node j in the l-th
layer, n0 represents the sampled node number in every layer while n is the number of
all the nodes in the graph, q (u j | v1 , · · · , vn0 ) is the probability of sampling u j given
all the nodes in the current layer, and â (vi , u j ) represents the entry of node vi and
u j in the re-normalized adjacency matrix Â. The sampling probability q(u j ) can be
written as,
p (u j | vi )
q (u j ) =
q (u j | v1 . . . vn0 )
n
â (vi , u j )
p (u j | vi ) =
N (vi )
, N (vi ) = Â â (vi , u j ) .
j=1
110 Hehuan Ma, Yu Rong, and Junzhou Huang

Fig. 6.9: Network construction example: (a) node-wise sampling; (b) layer-wise
sampling; (c) skip connection implementation. Figure excerpted from (Huang et al,
2018).

To further reduce the sampling variance, ASGCN introduces the explicit vari-
ance reduction to optimize the sampling variance as the final objective. Consider
x (u j ) as the node feature of node u j , the optimal sampling probability q⇤ (u j ) can
be formulated as,
0
Ân p (u j | vi ) g (x (u j ))
q (u j ) = n i=1n0
⇤
, g (x (u j )) = Wg x (u j ) . (6.2)
Â j=1 Âi=1 p (u j | vi ) g (x (v j ))

However, simply utilizing the sampler given by Equation 6.2 is not sufficient
to secure a minimal variance. Thus, ASGCN designs a hybrid loss by adding the
variance to the classification loss Lc , as shown in Equation 6.3. In such a manner,
the variance can be trained to achieve the minimal status.
0
1 n
L = Â Lc (yi , ȳ (µ̂q (vi ))) + l Varq (µ̂q (vi )) ,
n0 i=1
(6.3)

where yi is the ground-truth label, µ̂q (vi ) represents the output hidden embeddings
of node vi , and ȳ (µ̂q (vi )) is the prediction. l is involved as a trade-off parameter.
The variance reduction term l Varq (µ̂q (vi )) can also be viewed as a regularization
according to the sampled instances.
ASGCN also proposes a skip connection method to obtain the information across
distant nodes. As shown in Fig. 6.9 (c), the nodes in the (l-1)-th layer theoretically
preserve the second-order proximity (Tang et al, 2015b), which are the 2-hop neigh-
bors for the nodes in the (l+1)-th layer. The sampled nodes will include both 1-hop
and 2-hop neighbors by adding a skip connection between the (l-1)-th layer and the
(l+1)-th layer, which captures the information between distant nodes and facilitates
the model training.
In summary, by introducing the adaptive sampling strategy, ASGCN has gained
better performance as well as equips a better variance control. However, it also
brings in the additional dependence during sampling. Take FastGCN as an example,
it can perform parallel sampling to accelerate the sampling process since each layer
is sampled independently. While in ASGCN, the sampling process is dependent to
the upper layer, thus parallel processing is not applicable.
6 Graph Neural Networks: Scalability 111

6.3.3 Graph-wise Sampling

Fig. 6.10: An illustration of graph-wise sampling on large-scale graph.

Other than layer-wise sampling, the graph-wise sampling strategy is introduced

recently to accomplish efficient training on large-scale graphs. As shown in Fig. 6.10,
a whole graph can be sampled into several sub-graphs and fit into the GNN models,
in order to reduce the computational cost.

6.3.3.1 Cluster-GCN

Cluster-GCN (Chiang et al, 2019) first proposes to extract small graph clusters based
on efficient graph clustering algorithms. The intuition is that the mini-batch algo-
rithm is correlated with the number of links between nodes in one batch. Hence,
Cluster-GCN constructs mini-batch on the sub-graph level, while previous studies
usually construct mini-batch based on the nodes.
Cluster-GCN extracts small clusters based on the following clustering algo-
rithms. A graph G (V , E ) can be devided into c portions by grouping its nodes,
where V = [V1 , · · · Vc ]. The extracted sub-graphs can be defined as,

G¯ = [G1 , · · · , Gc ] = [{V1 , E1 } , · · · , {Vc , Ec }] .

(Vt , Et ) represents the nodes and the links within the t-th portion, t 2 (1, c). And the
re-ordered adjacency matrix can be written as,
2 3 2 3 2 3
A11 · · · A1c A11 · · · 0 0 · · · A1c
A = Ā + D = 4 ... . . . ... 5 ; Ā = 4 ... . . . ... 5 , D = 4 ... . . . ... 5 .
6 7 6 7 6 7

Ac1 · · · Acc 0 · · · Acc Ac1 · · · 0

Different graph clustering algorithms can be used to partition the graph by enabling
more links between nodes within the cluster. The motivation of considering sub-
graph as a batch also follows the nature of graphs, which is that neighbors usually
stay closely with each other.
112 Hehuan Ma, Yu Rong, and Junzhou Huang

Fig. 6.11: Comparison between GraphSAGE and Cluster-GCN. In Cluster-GCN, it

only samples the nodes in each sub-graph.

Obviously, this strategy can avoid the neighbor expansion problem since it only
samples the nodes in the clusters, as shown in Fig. 6.11. For Cluster-GCN, since
there is no connection between the sub-graphs, the nodes in other sub-graphs will
not be sampled when the layer increases. In such a manner, the sampling process
establishes a neighbor expansion control by sampling over the sub-graphs, while in
layer-wise sampling the neighbor expansion control is implemented by fixing the
neighbor sampling size.
However, there still remain two concerns with the vanilla Cluster-GCN. The first
one is that the links between sub-graphs are dismissed, which may fail to capture
important correlations. The second issue is that the clustering algorithm may change
the original distribution of the dataset and introduce some bias. To address these
concerns, the authors propose stochastic multiple partitions scheme to randomly
combine clusters to a batch. In specific, the graph is first clustered into p sub-graphs;
then in each epoch training, a new batch is formed by randomly combine q clusters
(q < p), and the interactions between clusters are included too. Fig. 6.12 visualized
an example when q equals to 2. As observed, the new batch is formed by 2 random
clusters, along with the retained connections between the clusters.

Fig. 6.12: An illustration of stochastic multiple partitions scheme.

6 Graph Neural Networks: Scalability 113

In summary, Cluster-GCN is a practical solution based on the sub-graph batch-

ing. It has good performance and good memory usage, and can alleviate the neigh-
borhood expansion problem in traditional mini-batch training. However, Cluster-
GCN does not analyze the sampling quality, e.g., the bias and variance of this sam-
pling strategy. In addition, the performance is highly correlated to the clustering
algorithm.

6.3.3.2 GraphSAINT

Instead of using clustering algorithms to generate the sub-graphs which may bring in
certain bias or noise, GraphSAINT (Zeng et al, 2020a) proposes to directly sample a
sub-graph for mini-batch training according to sub-graph sampler, and employ a full
GCN on the sub-graph to generate the node embeddings as well as back-propagate
the loss for each node. As shown in Fig. 6.13, sub-graph Gs is constructed from the
original graph G with Nodes 0, 1, 2, 3, 4, 7 included. Next, a full GCN is applied
on these 6 nodes along with the corresponding connections.

Fig. 6.13: An illustration of GraphSAINT training algorithm. The yellow circle in-
dicates the sampled node.

GraphSAINT introduces three sub-graph sampler constructions to form the sub-

graphs, which are node sampler, edge sampler and random walk sampler (Fig. 6.14).
Given graph G (V , E ), node v 2 V , edge (u, v) 2 E , the node sampler randomly
samples Vs nodes from V . The edge sampler selects the sub-graph based on the
probability of edges in the original graph G . The random walk sampler picks node
pairs according to the probability that there exists L hops paths from node u to v.
Moreover, GraphSAINT provides comprehensive theoretical analysis on how to
control the bias and variance of the sampler. First, it proposes loss normalization
and aggregation normalization to eliminate the sampling bias.

Loss normalization: Lbatch = Â Lv /lv , lv = |V |pv

v2Gs

Aggregation normalization: a(u, v) = pu,v /pv

114 Hehuan Ma, Yu Rong, and Junzhou Huang

where pv is the probability of a node v 2 V being sampled, pu,v is the probability

of an edge (u, v) 2 E being sampled, Lv represents the loss of v in the output layer.
Second, GraphSAINT also proposes to minimize the sampling variance by adjusting
the edge sampling probability by:

pu,v µ 1/du + 1/dv .

The extensive experiments demonstrate the effectiveness and efficiency of Graph-

SAINT, and prove that GraphSAINT converges fast as well as achieves superior
performance.
In summary, GraphSAINT proposes a highly flexible and extensible frame-
work including the graph sampler strategies and the GNN architectures, as well
as achieves good performance on both accuracy and speed.

6.3.3.3 Overall Comparison of Different Models

Table 6.2 compares and summarizes the characteristics of previously mentioned

models. Paradigm indicates the different sampling paradigms, and Model defers to
the proposed method in each paper. Sampling Strategy shows the sampling theory,
and Variance Reduction denotes whether such analysis is conducted in the paper.
Solved Problem represents the problem that proposed model has addressed, and
Characteristic extracts the features of the model.

Fig. 6.14: An illustration of different samplers.

6 Graph Neural Networks: Scalability 115

Table 6.2: The comparison between different models.

Sampling Variance Solved

Paradigm Model Characteristics
Strategy Reduction Problem
Mini-batch training,
Inductive
GraphSAGE (Hamil-Random ⇥ reduce neighborhood
learning
ton et al, 2017b) expansion.
Node-wise
Sampling Neighborhood Historical
VR-GCN (Chen Random X
expansion activations.
et al, 2018d)
Neighborhood Integral transform
FastGCN (Chen Importance X
expansion view.
Layer-wise et al, 2018c)
Sampling Explicit variance
Between-layer
ASGCN (Huang Importance X reduction, skip
correlation
et al, 2018) connection.
Mini-batch on
Cluster-GCN (Chi- Random X Graph batching
Graph-wise ang et al, 2019) sub-graph.
Sampling Edge Neighborhood Variance and bias
GraphSAINT (Zeng X
Probability expansion control.
et al, 2020a)

6.4 Applications of Large-scale Graph Neural Networks on

Recommendation Systems

Deploying large-scale neural networks in academia has achieved remarkable suc-

cess. Other than the theoretical study on how to expand the GNNs on large graphs,
another crucial problem is how to embed the algorithms into industrial applications.
One of the most conventional applications that demand tremendous data is the rec-
ommendation systems, which learn the user preferences and make predictions for
what the users may interest in. Traditional recommendation algorithms like collabo-
rative filtering are mainly designed according to the user-item interactions(Goldberg
et al, 1992; Koren et al, 2009; Koren, 2009; He et al, 2017b). Such methods are not
capable of the explosive increased web-scale data due to the extreme sparsity. Re-
cently, graph-based deep learning algorithms have gained significant achievements
on improving the prediction performance of recommendation systems by modeling
the graph structures of web-scale data (Zhang et al, 2019b; Shi et al, 2018a; Wang
et al, 2018b). Therefore, utilizing large-scale GNNs for recommendation has be-
come a trend in industry (Ying et al, 2018b; Zhao et al, 2019b; Wang et al, 2020d;
Jin et al, 2020b).
Recommendation systems can be typically categorized into two fields: item-item
recommendation and user-item recommendation. The former one aims to find the
similar items based on a user’s historical interactions; while the later one directly
predicts the user’s preferred items by learning the user behaviors. In this chapter,
116 Hehuan Ma, Yu Rong, and Junzhou Huang

we briefly introduce notable recommendation systems that are implemented on large

graphs for each field.

6.4.1 Item-item Recommendation

PinSage (Ying et al, 2018b) is one of the successful applications in the early stage
of utilizing large-scale GNNs on item-item recommendation systems, which is de-
ployed on Pinterest1 . Pinterest is a social media application that shares and discovers
various content. The users mark their interested content with pins and organize them
on the boards. When the users browse the website, Pinterest recommends the poten-
tially interesting content for them. By the year 2018, the Pinterest graph contains 2
billion pins, 1 billion boards, and over 18 billion edges between pins and boards.
In order to scale the training model on such a large graph, Ying et al (2018b)
proposes PinSage, a random-walk-based GCN, to implement node-wise sampling
on Pinterest graph. In specific, a short random walk is used to select a fixed-number
neighborhood of the target node. Fig. 6.15 demonstrates the overall architecture of
PinSage. Take node A as an example, a 2-depth convolution is constructed to gen-
(2) (1)
erate the node embedding hA . The embedding vector hN (A) of node A’s neighbors
are aggregated by node B, C, and D. Similar process is established to get the 1-hop
(1) (1) (1)
neighbors’ embedding hB , hC , and hD . An illustration of all participated nodes
for each node from the input graph is shown at the bottom of Fig. 6.15. In addition,
a L1-normalization is computed to sort the neighbors by their importance (Eksom-
batchai et al, 2018), and a curriculum training strategy is used to further improve the
prediction performance by feeding harder-and-harder examples.
A series of comprehensive experiments that are conducted on Pinterest data, e.g.,
offline experiments, production A/B tests and user studies, have demonstrated the
effectiveness of the proposed method. Moreover, with the adoption of highly effi-
cient MapReduce inference pipeline, the entire process on the whole graph can be
finished within one day.

6.4.2 User-item Recommendation

Unlike item-item recommendation, user-item recommendation systems is more

complex since it aims at predicting the user’s behaviors. Moreover, there remains
more auxiliary information between users and users, items and items, and users and
items, which leads to a heterogeneous graph problem. As shown in Fig. 6.16, there
are various properties of the edges between user-user and item-item, which cannot
be considered as one simple relation, e.g., user searches a word or visits a shop
should be considered with different impacts.

1 https://www.pinterest.com/
6 Graph Neural Networks: Scalability 117

Fig. 6.15: Overview of PinSage architecture. Colored nodes are applied to illustrate
the construction of graph convolutions.

Fig. 6.16: Examples of heterogeneous auxiliary relationships on e-commerce web-

sites.

IntentGC (Zhao et al, 2019b) proposes a GCN-based framework for large-scale

user-item recommendation on e-commerce data. It explores the explicit user prefer-
ences as well as the abundant auxiliary information by graph convolutions and make
predictions. E-commerce data such as Amazon contains billions of users and items,
while the diverse relationships bring in more complexity. Thus, the graph structure
gets larger and more complicated. Moreover, due to the sparsity of user-item graph
network, sampling methods like GraphSAGE may result in a very huge sub-graph.
In order to train efficient graph convolutions, IntentGC designs a faster graph con-
volution mechanism to boost the training, named as IntentNet.
As shown in Fig. 6.17, the bit-wise operation illustrates the traditional way of
node embedding construction in GNN. In specific, consider node v as the target
(l+1)
node, the embedding vector hv is generated by concatenating the neighborhoods’
118 Hehuan Ma, Yu Rong, and Junzhou Huang

Fig. 6.17: Comparison between bit-wise and vector-wise graph convolution.

(l) (l)
embeddings hN (v) and the target itself hv . Such an operation is able to capture two
types of information: the interactions between target node and its neighborhoods;
and the interactions between different dimensions of the embedding space. How-
ever, in user-item networks, learning the information between different feature di-
mensions may be less informative and unnecessary. Therefore, IntentNet designs a
vector-wise convolution operation as follows:
⇣ ⌘
(l) (l) (l) (l) (l)
gv (i) = s Wv (i, 1) · hv +Wv (i, 2) · hN (v) ,
⇣ ⌘
(l+1) (l) (l)
hv = s ÂLi=1 qi · gv (i) ,

(l) (l)
where Wv (i, 1) and Wv (i, 2) are the associated weight matrices for the i-th local
(l)
filter. gv (i) represents the operation that learns the interactions between the target
node and its neighbor nodes in a vector-wise manner. Another vector-wise layer is
applied to gather the final embedding vector of the target node for the next convolu-
tional layer. Moreover, the output vector of the last convolutional layer is fed into a
three-layer fully-connected network to further learn the node-level combinatory fea-
tures. Such an operation significantly promotes the training efficiency and reduces
the time complexity.
Extensive experiments are conducted on Taobao and Amazon datasets, which
contain millions to billions of users and items. IntentGC outperforms other baseline
methods, as well as reduces the training time for about two days compared with
GraphSAGE.

6.5 Future Directions

Overall, in recent years, the scalability of GNNs has been extensively studied and
has achieved fruitful results. Fig. 6.18 summarizes the development towards large-
scale GNNs.
6 Graph Neural Networks: Scalability 119

Fig. 6.18: Overall performance comparison of introduced work on large-scale

GNNs.

GraphSAGE is the first to propose sampling on the graph instead of computing

on the whole graph. VR-GCN designs another node sampling algorithm and pro-
vides a comprehensive theoretical analysis, but the efficiency is still limited. Fast-
GCN and ASGCN propose to sample over layers, and both prove the efficiency with
detailed analysis. Cluster-GCN first partitions the graph into sub-graphs to elimi-
nate the neighborhood expansion problem, and boosts the performance of several
benchmarks. GraphSAINT further improves the graph-wise sampling algorithm to
achieve the state-of-the-art classification performance over commonly used bench-
mark datasets. Various industrial applications prove the effectiveness and practica-
bility of large-scale GNNs in the real world.
However, many new open problems arise, e.g., how to balance the trade-off be-
tween variance and bias during sampling; how to deal with complex graph types
such as heterogeneous/dynamic graphs; how to properly design models over com-
plex GNN architectures. Studies toward such directions would improve the devel-
opment of large-scale GNNs.

Editor’s Notes: For graphs of large scale or with rapid expansibility, such
as dynamic graph (chapter 15) and heterogeneous graph (chapter 16), the
scalability characterization of GNNs is of vital importance to determine
whether the algorithm is superior in practice. For example, graph sampling
strategy is especially necessary to ensure computational efficiency in in-
dustrial scenarios, such as recommender system (chapter 19) and urban in-
telligence (chapter 27). With the increasing complexity and scale of the
real problem, the limitation in scalability has been considered almost ev-
erywhere in the study of GNNs. Researchers devoted to graph embedding
(chapter 2), graph structure learning (chapter 14) and self-supervised learn-
ing (chapter 18) put forward very remarkable works to overcome it.

Teaching Arabic As A Foreign Language: Origins, Developments and Current Directions
No ratings yet
Teaching Arabic As A Foreign Language: Origins, Developments and Current Directions
295 pages
2010 - A Model of Workplace Environment Satisfaction Collaboration Experience - Article
No ratings yet
2010 - A Model of Workplace Environment Satisfaction Collaboration Experience - Article
21 pages
Original Paper
No ratings yet
Original Paper
10 pages
GNNChap 6
No ratings yet
GNNChap 6
27 pages
CS8
No ratings yet
CS8
50 pages
gnns
No ratings yet
gnns
75 pages
Generalization Guarantee of Training Graph Convolutional Networks With Graph Topology Sampling
No ratings yet
Generalization Guarantee of Training Graph Convolutional Networks With Graph Topology Sampling
38 pages
GNN-Foundations-Frontiers-and-Applications-chapter3
No ratings yet
GNN-Foundations-Frontiers-and-Applications-chapter3
11 pages
GNNChap 7
No ratings yet
GNNChap 7
26 pages
GNNs
No ratings yet
GNNs
28 pages
Graph Neural Networks
No ratings yet
Graph Neural Networks
5 pages
04 GNNBasic
No ratings yet
04 GNNBasic
107 pages
A Survey of Graph Neural Networks in Various Learning Paradigms Methods, Applications, and Challenges
No ratings yet
A Survey of Graph Neural Networks in Various Learning Paradigms Methods, Applications, and Challenges
70 pages
Graphrnn: A Deep Generative Model For Graphs
No ratings yet
Graphrnn: A Deep Generative Model For Graphs
29 pages
2204.07697v1
No ratings yet
2204.07697v1
23 pages
2302.08043v3
No ratings yet
2302.08043v3
12 pages
Intro To GNN
No ratings yet
Intro To GNN
49 pages
A Practical Guide To Graph Neural Networks
No ratings yet
A Practical Guide To Graph Neural Networks
28 pages
Deep Learning On Graphs: A Survey: Ziwei Zhang, Peng Cui and Wenwu Zhu, Fellow, IEEE
No ratings yet
Deep Learning On Graphs: A Survey: Ziwei Zhang, Peng Cui and Wenwu Zhu, Fellow, IEEE
24 pages
Bacciu 2020
No ratings yet
Bacciu 2020
62 pages
Chap7 GNN (20240229) - DL4H Practioner Guide
No ratings yet
Chap7 GNN (20240229) - DL4H Practioner Guide
37 pages
2024_Introduction to Graph Neural Networks A Starting
No ratings yet
2024_Introduction to Graph Neural Networks A Starting
49 pages
Yang 20 A
No ratings yet
Yang 20 A
16 pages
A Practical Guide To Graph Neural Networks
No ratings yet
A Practical Guide To Graph Neural Networks
28 pages
Gcn
No ratings yet
Gcn
23 pages
GNN-Foundations-Frontiers-and-Applications-chapter4
No ratings yet
GNN-Foundations-Frontiers-and-Applications-chapter4
21 pages
Training Graph Neural Networks With 1000 Layers
No ratings yet
Training Graph Neural Networks With 1000 Layers
16 pages
Hierarchical Graph Neural Networks
No ratings yet
Hierarchical Graph Neural Networks
14 pages
Graph Representation Learning
No ratings yet
Graph Representation Learning
141 pages
Graph Neural Networks (GNNs)
No ratings yet
Graph Neural Networks (GNNs)
22 pages
GNNS
No ratings yet
GNNS
7 pages
GRL Book-Chapter 5-GNNs
No ratings yet
GRL Book-Chapter 5-GNNs
21 pages
Approximation- and Quantization-Aware Training for Graph Neural Networks
No ratings yet
Approximation- and Quantization-Aware Training for Graph Neural Networks
14 pages
1911.05954v3
No ratings yet
1911.05954v3
9 pages
GNN PPoPP 2021
No ratings yet
GNN PPoPP 2021
14 pages
[2020 Arxiv]A Survey on The Expressive Power of Graph Neural Networks
No ratings yet
[2020 Arxiv]A Survey on The Expressive Power of Graph Neural Networks
42 pages
GraphGPT
No ratings yet
GraphGPT
10 pages
grl unit 3
No ratings yet
grl unit 3
14 pages
Unit III GNN
No ratings yet
Unit III GNN
56 pages
10 Graph Neural Networks v2.2
No ratings yet
10 Graph Neural Networks v2.2
61 pages
Graph Transformer Networks: Corresponding Author
No ratings yet
Graph Transformer Networks: Corresponding Author
11 pages
A Graph Neural Network Accelerator
No ratings yet
A Graph Neural Network Accelerator
14 pages
WWW23-Tutorial-V6 Self-Supervised Learning and Pre-Training On Graphs
No ratings yet
WWW23-Tutorial-V6 Self-Supervised Learning and Pre-Training On Graphs
107 pages
A Comparison Between Recursive Neural Networks and Graph Neural Networks
No ratings yet
A Comparison Between Recursive Neural Networks and Graph Neural Networks
8 pages
Rolip2 Report GNN
No ratings yet
Rolip2 Report GNN
6 pages
Edgenets: Edge Varying Graph Neural Networks: Elvin Isufi, Fernando Gama and Alejandro Ribeiro
No ratings yet
Edgenets: Edge Varying Graph Neural Networks: Elvin Isufi, Fernando Gama and Alejandro Ribeiro
15 pages
Ishigurognnintroduction201023 201027054344
No ratings yet
Ishigurognnintroduction201023 201027054344
81 pages
GNN Review
No ratings yet
GNN Review
26 pages
Content Augmented Graph Neural Networks
No ratings yet
Content Augmented Graph Neural Networks
15 pages
Graph Neural Networks: Primeview
No ratings yet
Graph Neural Networks: Primeview
1 page
Neural Networks
No ratings yet
Neural Networks
10 pages
Graph Neural Networks
100% (1)
Graph Neural Networks
27 pages
Improving Graph Neural Networks With Simple Architecture Design
No ratings yet
Improving Graph Neural Networks With Simple Architecture Design
10 pages
CAAI Trans on Intel Tech - 2024 - Sharma - Image and video analysis using graph neural network for Internet of Medical
No ratings yet
CAAI Trans on Intel Tech - 2024 - Sharma - Image and video analysis using graph neural network for Internet of Medical
15 pages
DGCNN
No ratings yet
DGCNN
8 pages
AlonAndYahav 2021 On The Bottleneck of Graph Neu
No ratings yet
AlonAndYahav 2021 On The Bottleneck of Graph Neu
16 pages
Computing Graph Neural Networks: A Survey From Algorithms To Accelerators
No ratings yet
Computing Graph Neural Networks: A Survey From Algorithms To Accelerators
38 pages
Distributed Graph Neural Network Training: A Survey
No ratings yet
Distributed Graph Neural Network Training: A Survey
37 pages
Graph Conv
No ratings yet
Graph Conv
16 pages
A Comprehensive Survey On Graph Neural Networks
No ratings yet
A Comprehensive Survey On Graph Neural Networks
22 pages
Mesh Generation: Advances and Applications in Computer Vision Mesh Generation
From Everand
Mesh Generation: Advances and Applications in Computer Vision Mesh Generation
Fouad Sabry
No ratings yet
Computer Vision Graph Cuts: Exploring Graph Cuts in Computer Vision
From Everand
Computer Vision Graph Cuts: Exploring Graph Cuts in Computer Vision
Fouad Sabry
No ratings yet
The Application of "Green Economy" Policy of Switzerland and Sweden To The Karabakh Region of Azerbaijan: Review and Appraisal
No ratings yet
The Application of "Green Economy" Policy of Switzerland and Sweden To The Karabakh Region of Azerbaijan: Review and Appraisal
19 pages
Adaptive Kalman Filtering For INS and GPS
No ratings yet
Adaptive Kalman Filtering For INS and GPS
11 pages
Executive Summary C
No ratings yet
Executive Summary C
9 pages
Research Opportunities in "Semiconductor Materials and Devices" ROSMD-2020
No ratings yet
Research Opportunities in "Semiconductor Materials and Devices" ROSMD-2020
2 pages
Nursing Test Bank 2022
No ratings yet
Nursing Test Bank 2022
20 pages
Shaukat Khanum Memorial Cancer Hospital & Research Centre
No ratings yet
Shaukat Khanum Memorial Cancer Hospital & Research Centre
1 page
AOAP v1
No ratings yet
AOAP v1
15 pages
Predictive Modelling
No ratings yet
Predictive Modelling
9 pages
Unit 1 Lesson 1
No ratings yet
Unit 1 Lesson 1
19 pages
Probability - Sample Problems W Bring Home Problems PDF
No ratings yet
Probability - Sample Problems W Bring Home Problems PDF
4 pages
Advanced linear modeling statistical learning and dependent data 3rd Edition Christensen R pdf download
100% (2)
Advanced linear modeling statistical learning and dependent data 3rd Edition Christensen R pdf download
59 pages
HS-420 4-20ma Velocity Sensor - 4 Pin M12 - TS029.9
No ratings yet
HS-420 4-20ma Velocity Sensor - 4 Pin M12 - TS029.9
1 page
Geoinformation Remote Sensing Photogrammetry and Geographical Information Systems 1st Edition Gottfried Konecny (Author) download
100% (5)
Geoinformation Remote Sensing Photogrammetry and Geographical Information Systems 1st Edition Gottfried Konecny (Author) download
81 pages
Sension + Ph3
No ratings yet
Sension + Ph3
280 pages
AMC 12 Contest B: Solutions Pamphlet
No ratings yet
AMC 12 Contest B: Solutions Pamphlet
12 pages
GHHR - Culture Analytics
No ratings yet
GHHR - Culture Analytics
482 pages
Datasheet Minivisc 3000
No ratings yet
Datasheet Minivisc 3000
2 pages
Episode 9
No ratings yet
Episode 9
4 pages
The impostor phenomenon psychological research theory and interventions 1st Edition Kevin Cokley 2024 Scribd Download
100% (11)
The impostor phenomenon psychological research theory and interventions 1st Edition Kevin Cokley 2024 Scribd Download
81 pages
Elements of Hydrotherapy For Nurses. George K. Abbott
No ratings yet
Elements of Hydrotherapy For Nurses. George K. Abbott
277 pages
Chemical Project Economics
No ratings yet
Chemical Project Economics
26 pages
Agile Leadership Model in Health Care: Organizational and Individual Antecedents and Outcomes
No ratings yet
Agile Leadership Model in Health Care: Organizational and Individual Antecedents and Outcomes
22 pages
Unemployed Graduates
No ratings yet
Unemployed Graduates
3 pages
Research 7-q1 Performance Task 1
100% (1)
Research 7-q1 Performance Task 1
2 pages
Deg Prog
No ratings yet
Deg Prog
14 pages
Ahp Dissertation
100% (1)
Ahp Dissertation
4 pages
SAT 25
No ratings yet
SAT 25
10 pages
DLL All-Subjects-2 Q3 W7 D2
No ratings yet
DLL All-Subjects-2 Q3 W7 D2
7 pages

GNN-Foundations-Frontiers-and-Applications-chapter6

Uploaded by

GNN-Foundations-Frontiers-and-Applications-chapter6

Uploaded by

Chapter 6

Graph Neural Networks: Scalability

Hehuan Ma, Yu Rong, and Junzhou Huang

Scale Name Number of Nodes Number of Edges

the scalability. In this chapter, we categorize different sampling methods according

6.3 Sampling Paradigms

(a) Node-wise. (b) Layer-wise.

Fig. 6.1: Three sampling paradigms toward large-scale GNNs.

Fig. 6.2: Timeline of leading research work toward large-scale GNNs.

to detect spams by considering both homogeneous and heterogeneous graphs. Zhang

6.3.1 Node-wise Sampling

GraphSAGE can be viewed as an extension of the original Graph Convolutional

In summary, GraphSAGE is the first to consider inductive representation learn-

The neighbor sampling (NS) algorithm proposed by GraphSAGE (Hamilton et al,

6.3.2 Layer-wise Sampling

Formally, given a graph G (V , E ), an inductive graph G 0 with respect to a pos-

Fig. 6.7: Comparison between full GCN and FastGCN.

Compared with GraphSAGE (Hamilton et al, 2017b), FastGCN is much less

(a) ASGCN vs. FastGCN.

(b) Top-down sampling of ASGCN.

Fig. 6.8: A demonstration of the sampling strategies used in ASGCN.

Meanwhile, the sampling process of ASGCN is performed in a top-down man-

6.3.3 Graph-wise Sampling

Fig. 6.10: An illustration of graph-wise sampling on large-scale graph.

Other than layer-wise sampling, the graph-wise sampling strategy is introduced

G¯ = [G1 , · · · , Gc ] = [{V1 , E1 } , · · · , {Vc , Ec }] .

Ac1 · · · Acc 0 · · · Acc Ac1 · · · 0

Fig. 6.11: Comparison between GraphSAGE and Cluster-GCN. In Cluster-GCN, it

Fig. 6.12: An illustration of stochastic multiple partitions scheme.

In summary, Cluster-GCN is a practical solution based on the sub-graph batch-

GraphSAINT introduces three sub-graph sampler constructions to form the sub-

Loss normalization: Lbatch = Â Lv /lv , lv = |V |pv

Aggregation normalization: a(u, v) = pu,v /pv

where pv is the probability of a node v 2 V being sampled, pu,v is the probability

pu,v µ 1/du + 1/dv .

The extensive experiments demonstrate the effectiveness and efficiency of Graph-

6.3.3.3 Overall Comparison of Different Models

Table 6.2 compares and summarizes the characteristics of previously mentioned

Fig. 6.14: An illustration of different samplers.

Table 6.2: The comparison between different models.

Sampling Variance Solved

6.4 Applications of Large-scale Graph Neural Networks on

Deploying large-scale neural networks in academia has achieved remarkable suc-

we briefly introduce notable recommendation systems that are implemented on large

6.4.1 Item-item Recommendation

6.4.2 User-item Recommendation

Unlike item-item recommendation, user-item recommendation systems is more

Fig. 6.16: Examples of heterogeneous auxiliary relationships on e-commerce web-

IntentGC (Zhao et al, 2019b) proposes a GCN-based framework for large-scale

Fig. 6.17: Comparison between bit-wise and vector-wise graph convolution.

6.5 Future Directions

Fig. 6.18: Overall performance comparison of introduced work on large-scale

GraphSAGE is the first to propose sampling on the graph instead of computing

You might also like