0% found this document useful (0 votes)
9 views

2302.08043v3

The paper introduces GraphPrompt, a novel framework that unifies pre-training and downstream tasks for graph neural networks (GNNs) by employing a learnable prompt to enhance task-specific performance. It addresses the challenges of limited task-specific supervision and the need for a consistent approach across different tasks, such as node and graph classification. Extensive experiments demonstrate that GraphPrompt outperforms existing methods, showcasing its effectiveness in leveraging pre-trained models for various graph-related applications.

Uploaded by

germain courtois
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

2302.08043v3

The paper introduces GraphPrompt, a novel framework that unifies pre-training and downstream tasks for graph neural networks (GNNs) by employing a learnable prompt to enhance task-specific performance. It addresses the challenges of limited task-specific supervision and the need for a consistent approach across different tasks, such as node and graph classification. Extensive experiments demonstrate that GraphPrompt outperforms existing methods, showcasing its effectiveness in leveraging pre-trained models for various graph-related applications.

Uploaded by

germain courtois
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

GraphPrompt: Unifying Pre-Training and Downstream Tasks

for Graph Neural Networks


Zemin Liu1∗ Xingtong Yu2∗
1 National University of Singapore 2 University of Science and Technology of China
Singapore China
[email protected] [email protected]

Yuan Fang3† Xinming Zhang2†


3 Singapore Management University 2 University of Science and Technology of China
arXiv:2302.08043v3 [cs.LG] 25 Feb 2023

Singapore China
[email protected] [email protected]

ABSTRACT Neural Networks. In Proceedings of the ACM Web Conference 2023 (WWW
Graphs can model complex relationships between objects, enabling ’23), May 1–5, 2023, Austin, TX, USA. ACM, New York, NY, USA, 12 pages.
https://doi.org/10.1145/3543507.3583386
a myriad of Web applications such as online page/article classifi-
cation and social recommendation. While graph neural networks
(GNNs) have emerged as a powerful tool for graph representation 1 INTRODUCTION
learning, in an end-to-end supervised setting, their performance
heavily relies on a large amount of task-specific supervision. To re- The ubiquitous Web is becoming the ultimate data repository, capa-
duce labeling requirement, the “pre-train, fine-tune” and “pre-train, ble of linking a broad spectrum of objects to form gigantic and com-
prompt” paradigms have become increasingly common. In partic- plex graphs. The prevalence of graph data enables a series of down-
ular, prompting is a popular alternative to fine-tuning in natural stream tasks for Web applications, ranging from online page/article
language processing, which is designed to narrow the gap between classification to friend recommendation in social networks. Mod-
pre-training and downstream objectives in a task-specific manner. ern approaches for graph analysis generally resort to graph repre-
However, existing study of prompting on graphs is still limited, lack- sentation learning including graph embedding and graph neural
ing a universal treatment to appeal to different downstream tasks. networks (GNNs). Earlier graph embedding approaches [12, 33, 41]
In this paper, we propose GraphPrompt, a novel pre-training and usually embed nodes on the graph into a low-dimensional space,
prompting framework on graphs. GraphPrompt not only unifies in which the structural information such as the proximity between
pre-training and downstream tasks into a common task template, nodes can be captured [5]. More recently, GNNs [13, 20, 43, 50]
but also employs a learnable prompt to assist a downstream task in have emerged as the state of the art for graph representation learn-
locating the most relevant knowledge from the pre-trained model in ing. Their key idea boils down to a message-passing framework,
a task-specific manner. Finally, we conduct extensive experiments in which each node derives its representation by receiving and
on five public datasets to evaluate and analyze GraphPrompt. aggregating messages from its neighboring nodes recursively [48].
Graph pre-training. Typically, GNNs work in an end-to-end man-
CCS CONCEPTS ner, and their performance depends heavily on the availability of
• Computing methodologies → Learning latent representa- large-scale, task-specific labeled data as supervision. This super-
tions; • Information systems → Data mining. vised paradigm presents two problems. First, task-specific supervi-
sion is often difficult or costly to obtain. Second, to deal with a new
KEYWORDS task, the weights of GNN models need to be retrained from scratch,
even if the task is on the same graph. To address these issues, pre-
Graph neural networks, pre-training, prompt, few-shot learning.
training GNNs [15, 16, 30, 34] has become increasingly popular,
ACM Reference Format: inspired by pre-training techniques in language and vision appli-
Zemin Liu1∗ , Xingtong Yu2∗ , Yuan Fang3† , and Xinming Zhang2† . 2023. cations [1, 7]. The pre-training of GNNs leverages self-supervised
GraphPrompt: Unifying Pre-Training and Downstream Tasks for Graph learning on more readily available label-free graphs (i.e., graphs
∗ Co-first authors with equal contribution. Part of the work was done while at Singapore without task-specific labels), and learns intrinsic graph properties
Management University. that intend to be general across tasks and graphs in a domain. In
† Corresponding authors.
other words, the pre-training extracts a task-agnostic prior, and can
Permission to make digital or hard copies of part or all of this work for personal or be used to initialize model weights for a new task. Subsequently,
classroom use is granted without fee provided that copies are not made or distributed the initial weights can be quickly updated through a lightweight
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for third-party components of this work must be honored. fine-tuning step on a smaller number of task-specific labels.
For all other uses, contact the owner/author(s). However, the “pre-train, fine-tune” paradigm suffers from the
WWW ’23, May 1–5, 2023, Austin, TX, USA problem of inconsistent objectives between pre-training and down-
© 2023 Copyright held by the owner/author(s).
ACM ISBN 978-1-4503-9416-1/23/04. stream tasks, resulting in suboptimal performance [23]. On one
https://doi.org/10.1145/3543507.3583386 hand, the pre-training step aims to preserve various intrinsic graph
SMU Classification: Restricted

WWW ’23, May 1–5, 2023, Austin, TX, USA Zemin Liu, Xingtong Yu, Yuan Fang, Xinming Zhang

Contextual
subgraph of node 𝑣 (a) Pre-training with Contextual
subgraph of node 𝑣
tasks on graphs, so that a single pre-trained model can universally
link prediction support different tasks? That is, we try to convert the pre-training
Pre-training

task and downstream tasks to follow the same “template”. Using pre-
𝑣 𝑣
Sim? trained language models as an analogy, both their pre-training and
downstream tasks can be formulated as masked language modeling.
Secondly, under the unification framework, it is still important
Learnable node Learnable graph
to identify the distinction between different downstream tasks,
classification classification in order to attain task-specific optima. For pre-trained language
prompt prompt
READOUT READOUT models, prompts in the form of natural language tokens or learnable
word vectors have been designed to give different hints to different
Prompting

Graph class 1 Graph class 2


Node class 1 Node class 2

tasks, but it is less apparent what form prompts on graphs should


𝑣 take. Hence, how do we design prompts on graphs, so that they can
Sim?
guide different downstream tasks to effectively make use of the
Sim?
pre-trained model?
Node class prototypes 𝐺 Graph class prototypes
Present work. To address these challenges, we propose a novel
(b) Node classification (c) Graph classification graph pre-training and prompting framework, called GraphPrompt,
aiming to unify the pre-training and downstream tasks for GNNs.
Figure 1: Illustration of the motivation. (a) Pre-training on Drawing inspiration from the prompting strategy for pre-trained
graphs. (b/c) Downstream node/graph classification. language models, GraphPrompt capitalizes on a unified template
to define the objectives for both pre-training and downstream tasks,
thus bridging their gap. We further equip GraphPrompt with task-
properties such as node/edge features [15, 16], node connectiv- specific learnable prompts, which guides the downstream task to
ity/links [13, 16, 30], and local/global patterns [15, 30, 34]. On the exploit relevant knowledge from the pre-trained GNN model. The
other hand, the fine-tuning step aims to reduce the task loss, i.e., unified approach endows GraphPrompt with the ability of working
to fit the ground truth of the downstream task. The discrepancy on limited supervision such as few-shot learning tasks.
between the two steps can be quite large. For example, pre-training More specifically, to address the first challenge of unification, we
may focus on learning the connectivity pattern between two nodes focus on graph topology, which is a key enabler of graph models. In
(i.e., related to link prediction), whereas fine-tuning could be deal- particular, subgraph is a universal structure that can be leveraged
ing with a node or graph property (i.e., node classification or graph for both node- and graph-level tasks. At the node level, the infor-
classification task). mation of a node can be enriched and represented by its contextual
subgraph, i.e., a subgraph where the node resides in [17, 55]; at the
Prior work. To narrow the gap between pre-training and down-
graph level, the information of a graph is naturally represented by
stream tasks, prompting [4] has first been proposed for language
the maximum subgraph (i.e., the graph itself). Consequently, we
models, which is a natural language instruction designed for a
unify both the node- and graph-level tasks, whether in pre-training
specific downstream task to “prompt out” the semantic relevance
or downstream, into the same template: the similarity calculation
between the task and the language model. Meanwhile, the parame-
of (sub)graph1 representations. In this work, we adopt link predic-
ters of the pre-trained language model are frozen without any fine-
tion as the self-supervised pre-training task, given that links are
tuning, as the prompt can “pull” the task toward the pre-trained
readily available in any graph without additional annotation cost.
model. Thus, prompting is also more efficient than fine-tuning, espe-
Meanwhile, we focus on the popular node classification and graph
cially when the pre-trained model is huge. Recently, prompting has
classification as downstream tasks, which are node- and graph-
also been introduced to graph pre-training in the GPPT approach
level tasks, respectively. All these tasks can be cast as instances of
[39]. While the pioneering work has proposed a sophisticated de-
learning subgraph similarity. On one hand, the link prediction task
sign of pre-training and prompting, it can only be employed for the
in pre-training boils down to the similarity between the contex-
node classification task, lacking a universal treatment that appeals
tual subgraphs of two nodes, as shown in Fig. 1(a). On the other
to different downstream tasks such as both node classification and
hand, the downstream node or graph classification task boils down
graph classification.
to the similarity between the target instance (a node’s contextual
Research problem and challenges. To address the divergence subgraph or the whole graph, resp.) and the class prototypical sub-
between graph pre-training and various downstream tasks, in this graphs constructed from labeled data, as illustrated in Figs. 1(b) and
paper we investigate the design of pre-training and prompting for (c). The unified template bridges the gap between the pre-training
graph neural networks. In particular, we aim for a unified design and different downstream tasks.
that can suit different downstream tasks flexibly. This problem is Toward the second challenge, we distinguish different down-
non-trivial due to the following two challenges. stream tasks by way of the ReadOut operation on subgraphs. The
Firstly, to enable effective knowledge transfer from the pre- ReadOut operation is essentially an aggregation function to fuse
training to a downstream task, it is desirable that the pre-training
step preserves graph properties that are compatible with the given
task. However, since different downstream tasks often have differ- 1 As a graph is a subgraph of itself, we may simply use subgraph to refer to a graph
ent objectives, how do we unify pre-training with various downstream too.
GraphPrompt WWW ’23, May 1–5, 2023, Austin, TX, USA

node representations in the subgraph into a single subgraph repre- capitalizes on meta-learning [9] to simulate the fine-tuning step
sentation. For instance, sum pooling, which sums the representa- during pre-training. However, since the downstream tasks can still
tions of all nodes in the subgraph, is a practical and popular scheme differ from the simulation task, the problem is not fundamentally ad-
for ReadOut. However, different downstream tasks can benefit dressed. In other fields, as an alternative to fine-tuning, researchers
from different aggregation schemes for their ReadOut. In particu- turn to prompting [4], in which a task-specific prompt is used to
lar, node classification tends to focus on features that can contribute cue the downstream tasks. Prompts can be either handcrafted [4]
to the representation of the target node, while graph classification or learnable [22, 24]. On graph data, the study of prompting is still
tends to focus on features associated with the graph class. Moti- limited. One recent work called GPPT [39] capitalizes on a sophisti-
vated by such differences, we propose a novel task-specific learnable cated design of learnable prompts on graphs, but it only works with
prompt to guide the ReadOut operation of each downstream task node classification, lacking a unification effort to accommodate
with an appropriate aggregation scheme. As shown in Fig. 1, the other downstream tasks like graph classification. Besides, there
learnable prompt serves as the parameters of the ReadOut opera- is a model also named as GraphPrompt [54], but it considers an
tion of downstream tasks, and thus enables different aggregation NLP task (biomedical entity normalization) on text data, where
functions on the subgraphs of different tasks. Hence, GraphPrompt graph is only auxiliary. It employs the standard text prompt unified
not only unifies the pre-training and downstream tasks into the by masked language modeling, assisted by a relational graph to
same template based on subgraph similarity, but also recognizes generate text templates, which is distinct from our work.
the differences between various downstream tasks to guide task- Comparison to other settings. Our few-shot setting is different
specific objectives. from other paradigms that also deal with label scarcity, including
Contributions. To summarize, our contributions are three-fold. (1) semi-supervised learning [20] and meta-learning [9]. In particular,
We recognize the gap between graph pre-training and downstream semi-supervised learning cannot cope with novel classes not seen
tasks, and propose a unification framework GraphPrompt based in training, while meta-learning requires a large volume of labeled
on subgraph similarity for both pre-training and downstream tasks, data in their base classes for a meta-training phase, before they can
including both node and graph classification tasks. (2) We propose a handle few-shot tasks in testing.
novel prompting strategy for GraphPrompt, hinging on a learnable
prompt to actively guide downstream tasks using task-specific
aggregation in ReadOut, in order to drive the downstream tasks
3 PRELIMINARIES
to exploit the pre-trained model in a task-specific manner. (3) We In this section, we give the problem definition and introduce the
conduct extensive experiments on five public datasets, and the background of GNNs.
results demonstrate the superior performance of GraphPrompt in
comparison to the state-of-the-art approaches. 3.1 Problem Definition
2 RELATED WORK Graph. A graph can be defined as 𝐺 = (𝑉 , 𝐸), where 𝑉 is the set
of nodes and 𝐸 is the set of edges. We also assume an input feature
Graph representation learning. The rise of graph representa- matrix of the nodes, X ∈ R |𝑉 |×𝑑 , is available. Let x𝑖 ∈ R𝑑 denote
tion learning, including earlier graph embedding [12, 33, 41] and the feature vector of node 𝑣𝑖 ∈ 𝑉 . In addition, we denote a set of
recent GNNs [13, 20, 43, 50], opens up great opportunities for vari- graphs as G = {𝐺 1, 𝐺 2, . . . , 𝐺 𝑁 }.
ous downstream tasks at node and graph levels. Note that learning Problem. In this paper, we investigate the problem of graph pre-
graph-level representations requires an additional ReadOut op- training and prompting. For the downstream tasks, we consider
eration which summarizes the global information of a graph by the popular node classification and graph classification tasks. For
aggregating node representations through a flat [8, 11, 50, 56] or node classification on a graph 𝐺 = (𝑉 , 𝐸), let 𝐶 be the set of node
hierarchical [10, 21, 31, 51] pooling algorithm. We refer the readers classes with ℓ𝑖 ∈ 𝐶 denoting the class label of node 𝑣𝑖 ∈ 𝑉 . For
to two comprehensive surveys [5, 48] for more details. graph classification on a set of graphs G, let C be the set of graph
Graph pre-training. Inspired by the application of pre-training labels with 𝐿𝑖 ∈ C denoting the class label of graph 𝐺𝑖 ∈ G.
models in language [2, 7] and vision [1, 29] domains, graph pre- In particular, the downstream tasks are given limited supervi-
training [49] emerges as a powerful paradigm that leverages self- sion in a few-shot setting: for each class in the two tasks, only
supervision on label-free graphs to learn intrinsic graph properties. 𝑘 labeled samples (i.e., nodes or graphs) are provided, known as
While the pre-training learns a task-agnostic prior, a relatively 𝑘-shot classification.
light-weight fine-tuning step is further employed to update the
pre-trained weights to fit a given downstream task. Different pre-
training approaches design different self-supervised tasks based 3.2 Graph Neural Networks
on various graph properties such as node features [15, 16], links The success of GNNs boils down to the message-passing mechanism
[13, 16, 19, 30], local or global patterns [15, 30, 34], local-global [48], in which each node receives and aggregates messages (i.e.,
consistency [14, 32, 37, 44], and their combinations [40, 52, 53]. features or embeddings) from its neighboring nodes to generate
However, the above approaches do not consider the gap between its own representation. This operation of neighborhood aggrega-
pre-training and downstream objectives, which limits their gen- tion can be stacked in multiple layers to enable recursive message
eralization ability to handle different tasks. Some recent studies passing. Formally, in the 𝑙-th GNN layer, the embedding of node
recognize the importance of narrowing this gap. L2P-GNN [30] 𝑣, denoted by h𝑙𝑣 , is calculated based on the embeddings in the
WWW ’23, May 1–5, 2023, Austin, TX, USA Zemin Liu, Xingtong Yu, Yuan Fang, Xinming Zhang

previous layer, as follows. Unified task template. Based on the above subgraph definitions
for both node- and graph-level instances, we are ready to unify
h𝑙𝑣 = Aggr(h𝑙−1 𝑙−1
𝑣 , {h𝑢 : 𝑢 ∈ N𝑣 }; 𝜃 ),
𝑙
(1)
different tasks to follow a common template. Specifically, the link
where N𝑣 is the set of neighboring nodes of 𝑣, 𝜃 𝑙
is the learnable prediction task in pre-training and the downstream node and graph
GNN parameters in layer 𝑙. Aggr(·) is the neighborhood aggrega- classification tasks can all be redefined as subgraph similarity learn-
tion function and can take various forms, ranging from the simple ing. Let s𝑥 be the vector representation of the subgraph 𝑆𝑥 , and
mean pooling [13, 20] to advanced neural networks such as neural sim(·, ·) be the cosine similarity function. As illustrated in Figs. 2(b)
attention [43] or multi-layer perceptrons [50]. Note that in the first and (c), the three tasks can be mapped to the computation of sub-
layer, the input node embedding h0𝑣 can be initialized as the node graph similarity, which is formalized below.
features in X. The total learnable GNN parameters can be denoted • Link prediction: This is a node-level task. Given a graph 𝐺 =
as Θ = {𝜃 1, 𝜃 2, . . .}. For brevity, we simply denote the output node (𝑉 , 𝐸) and a triplet of nodes (𝑣, 𝑎, 𝑏) such that (𝑣, 𝑎) ∈ 𝐸 and
representations of the last layer as h𝑣 . (𝑣, 𝑏) ∉ 𝐸, we shall have

4 PROPOSED APPROACH sim(s𝑣 , s𝑎 ) > sim(s𝑣 , s𝑏 ). (4)


In this section, we present our proposed approach GraphPrompt. Intuitively, the contextual subgraph of 𝑣 shall be more similar to
that of a node linked to 𝑣 than that of another unlinked node.
4.1 Unification Framework • Node classification: This is also a node-level task. Consider
We first introduce the overall framework of GraphPrompt in Fig. 2. a graph 𝐺 = (𝑉 , 𝐸) with a set of node classes 𝐶, and a set of
Our framework is deployed on a set of label-free graphs shown in labeled nodes 𝐷 = {(𝑣 1, ℓ1 ), (𝑣 2, ℓ2 ), . . .} where 𝑣𝑖 ∈ 𝑉 and ℓ𝑖 is
Fig. 2(a), for pre-training in Fig. 2(b). The pre-training adopts a link the corresponding label of 𝑣𝑖 . As we adopt a 𝑘-shot setting, there
prediction task, which is self-supervised without requiring extra are exactly 𝑘 pairs of (𝑣𝑖 , ℓ𝑖 = 𝑐) ∈ 𝐷 for every class 𝑐 ∈ 𝐶. For
annotation. Afterward, in Fig. 2(c), we capitalize on a learnable each class 𝑐 ∈ 𝐶, further define a node class prototypical subgraph
prompt to guide each downstream task, namely, node classification represented by a vector s̃𝑐 , given by
or graph classification, for task-specific exploitation of the pre- 1 ∑︁
trained model. In the following, we explain how the framework s̃𝑐 = s𝑣𝑖 . (5)
𝑘
supports a unified view of pre-training and downstream tasks. (𝑣𝑖 ,ℓ𝑖 ) ∈𝐷,ℓ𝑖 =𝑐

Instances as subgraphs. The key to the unification of pre-training Note that the class prototypical subgraph is a “virtual” subgraph
and downstream tasks lies in finding a common template for the with a latent representation in the same embedding space as
tasks. The task-specific prompt can then be further fused with the node contextual subgraphs. Basically, it is constructed as
the template of each downstream task, to distinguish the varying the mean representation of the contextual subgraphs of labeled
characteristics of different tasks. nodes in a given class. Then, given a node 𝑣 𝑗 not in the labeled
In comparison to other fields such as visual and language process- set 𝐷, its class label ℓ 𝑗 shall be
ing, graph learning is uniquely characterized by the exploitation
ℓ 𝑗 = arg max sim(s𝑣 𝑗 , s̃𝑐 ). (6)
of graph topology. In particular, subgraph is a universal structure 𝑐 ∈𝐶
capable of expressing both node- and graph-level instances. On one Intuitively, a node shall belong to the class whose prototypical
hand, at the node level, every node resides in a local neighborhood, subgraph is the most similar to the node’s contextual subgraph.
which in turn contextualizes the node [25, 27, 28]. The local neigh- • Graph classification: This is a graph-level task. Consider a set
borhood of a node 𝑣 on a graph 𝐺 = (𝑉 , 𝐸) is usually defined by of graphs G with a set of graph classes C, and a set of labeled
a contextual subgraph 𝑆 𝑣 = (𝑉 (𝑆 𝑣 ), 𝐸 (𝑆 𝑣 )), where its set of nodes graphs D = {(𝐺 1, 𝐿1 ), (𝐺 2, 𝐿2 ), . . .} where 𝐺𝑖 ∈ G and 𝐿𝑖 is the
and edges are respectively given by corresponding label of 𝐺𝑖 . In the 𝑘-shot setting, there are exactly
𝑉 (𝑆 𝑣 ) = {𝑑 (𝑢, 𝑣) ≤ 𝛿 | 𝑢 ∈ 𝑉 }, and (2) 𝑘 pairs of (𝐺𝑖 , 𝐿𝑖 = 𝑐) ∈ D for every class 𝑐 ∈ C. Similar to
node classification, for each class 𝑐 ∈ C, we define a graph class
𝐸 (𝑆 𝑣 ) = {(𝑢, 𝑢 ′ ) ∈ 𝐸 | 𝑢 ∈ 𝑉 (𝑆 𝑣 ), 𝑢 ′ ∈ 𝑉 (𝑆 𝑣 )}, (3)
prototypical subgraph, also represented by the mean embedding
where 𝑑 (𝑢, 𝑣) gives the shortest distance between nodes 𝑢 and 𝑣 vector of the (sub)graphs in 𝑐:
on the graph 𝐺, and 𝛿 is a predetermined threshold. That is, 𝑆 𝑣 1 ∑︁
consists of nodes within 𝛿 hops from the node 𝑣, and the edges s̃𝑐 = s𝐺𝑖 . (7)
𝑘
between those nodes. Thus, the contextual subgraph 𝑆 𝑣 embodies (𝐺𝑖 ,𝐿𝑖 ) ∈D,𝐿𝑖 =𝑐
not only the self-information of the node 𝑣, but also rich contex- Then, given a graph 𝐺 𝑗 not in the labeled set D, its class label
tual information to complement the self-information [17, 55]. On 𝐿 𝑗 shall be
the other hand, at the graph level, the maximum subgraph of a
graph 𝐺, denoted 𝑆𝐺 , is the graph itself, i.e., 𝑆𝐺 = 𝐺. The maxi- 𝐿 𝑗 = arg max sim(s𝐺 𝑗 , s̃𝑐 ). (8)
𝑐 ∈C
mum subgraph 𝑆𝐺 spontaneously embodies all information of 𝐺.
In summary, subgraphs can be used to represent both node- and Intuitively, a graph shall belong to the class whose prototypical
graph-level instances: Given an instance 𝑥 which can either be a subgraph is the most similar to itself. □
node or a graph (e.g., 𝑥 = 𝑣 or 𝑥 = 𝐺), the subgraph 𝑆𝑥 offers a It is worth noting that node and graph classification can be
unified access to the information associated with 𝑥. further condensed into a single set of notations. Let (𝑥, 𝑦) be an
SMU Classification: Restricted

GraphPrompt WWW ’23, May 1–5, 2023, Austin, TX, USA

GNN Encoder
Learnable node Learnable graph
𝛿 1 READOUT classification prompt classification prompt
𝑣 𝑣 𝐩 𝐩
𝑣 𝑣 𝐬 𝐬 𝐬
𝐺 Sim
𝑣 𝑣 Link prediction triplets 𝑣 𝑣 𝑣 READOUT READOUT
𝑣
𝐬 𝐬 𝐬

Graph class 1 Graph class 2


𝐬 𝐬 𝐬 𝐬 𝐬

Node class 1 Node class 2


, , ,
, ,
𝑣 Sim
𝑣 𝑣 𝑣 𝑣 𝐬 , 𝐺 𝐬 , 𝐬 ,
𝑣 𝑣
𝑣 𝑣
𝐺 𝑣 𝑣 Sim?
𝑣 𝐬 𝐬 𝐬 Sim? 𝐺
𝐬 , 𝐬 𝐬
𝑣 Sim
, ,
𝐺 𝐬 , 𝐬 ,
𝑣 𝑣 𝑣 𝐬 ,
𝑣 𝑣 𝑣
𝑣 Graph class
𝑣 Node class prototypical subgraph
𝐺 𝑣 𝑣 prototypical subgraph
𝑣 Optimize with pre-training loss (Eq.(11)) Optimize with prompt tuning loss (Eq.(14))

(a) Toy graphs (b) Pre-training (c) Prompting for node classification (left) or graph classification (right)
Figure 2: Overall framework of GraphPrompt.

annotated instance of graph data, i.e., 𝑥 is either a node or a graph, downstream, as graph similarity is consistent with subgraph simi-
and 𝑦 ∈ 𝑌 is the class label of 𝑥 among the set of classes 𝑌 . Then, larity not only in letter (as a graph is technically always a subgraph
of itself), but also in spirit. The “spirit” here refers to the tendency
𝑦 = arg max sim(s𝑥 , s̃𝑐 ). (9) that graphs sharing similar subgraphs are likely to be similar them-
𝑐 ∈𝑌
selves, which means graph similarity can be translated into the
Finally, to materialize the common task template, we discuss how similarity of the containing subgraphs [36, 42, 56].
to learn the subgraph embedding vector s𝑥 for the subgraph 𝑆𝑥 . Formally, given a node 𝑣 on graph 𝐺, we randomly sample one
Given node representations h𝑣 generated by a GNN (see Sect. 3.2), a positive node 𝑎 from 𝑣’s neighbors, and a negative node 𝑏 from the
standard approach of computing s𝑥 is to employ a ReadOut oper- graph that does not link to 𝑣, forming a triplet (𝑣, 𝑎, 𝑏). Our objective
ation that aggregates the representations of nodes in the subgraph is to increase the similarity between the contextual subgraphs 𝑆 𝑣
𝑆𝑥 . That is, and 𝑆𝑎 , while decreasing that between 𝑆 𝑣 and 𝑆𝑏 . More generally,
s𝑥 = ReadOut({h𝑣 : 𝑣 ∈ 𝑉 (𝑆𝑥 )}). (10) on a set of label-free graphs G, we sample a number of triplets from
each graph to construct an overall training set Tpre . Then, we define
The choice of the aggregation scheme for ReadOut is flexible, the following pre-training loss.
including sum pooling and more advanced techniques [50, 51]. In
our implementation, we simply use sum pooling. ∑︁ exp(sim(s𝑣 , s𝑎 )/𝜏)
In summary, the unification framework is enabled by the com- Lpre (Θ) = − ln Í , (11)
(𝑣,𝑎,𝑏) ∈Tpre 𝑢 ∈ {𝑎,𝑏 } exp(sim(s𝑣 , s𝑢 )/𝜏)
mon task template of subgraph similarity learning, which lays the
foundation of our pre-training and prompting strategies as we will
introduce in the following parts. where 𝜏 is a temperature hyperparameter to control the shape of
the output distribution. Note that the loss is parameterized by Θ,
4.2 Pre-Training Phase which represents the GNN model weights.
The output of the pre-training phase is the optimal model pa-
As discussed earlier, our pre-training phase employs the link predic-
rameters Θ0 = arg minΘ Lpre (Θ). Θ0 can be used to initialize the
tion task. Using link prediction/generation is a popular and natural
GNN weights for downstream tasks, thus enabling the transfer of
way [13, 16, 18, 30], as a vast number of links are readily available
prior knowledge downstream.
on large-scale graph data without extra annotation. In other words,
the link prediction objective can be optimized on label-free graphs,
such as those shown in Fig. 2(a), in a self-supervised manner. 4.3 Prompting for Downstream Tasks
Based on the common template defined in Sect. 4.1, the link The unification of pre-training and downstream tasks enables more
prediction task is anchored on the similarity of the contextual sub- effective knowledge transfer as the tasks in the two phases are made
graphs of two candidate nodes. Generally, the subgraphs of two more compatible by following a common template. However, it is
positive (i.e., linked) candidates shall be more similar than those still important to distinguish different downstream tasks, in order
of negative (i.e., non-linked) candidates, as illustrated in Fig. 2(b). to capture task individuality and achieve task-specific optimum.
Subsequently, the pre-trained prior on subgraph similarity can be To cope with this challenge, we propose a novel task-specific
naturally transferred to node classification downstream, which learnable prompt on graphs, inspired by prompting in natural lan-
shares a similar intuition: the subgraphs of nodes in the same class guage processing [4]. In language contexts, a prompt is initially
shall be more similar than those of nodes from different classes. a handcrafted instruction to guide the downstream task, which
On the other hand, the prior can also support graph classification provides task-specific cues to extract relevant prior knowledge
WWW ’23, May 1–5, 2023, Austin, TX, USA Zemin Liu, Xingtong Yu, Yuan Fang, Xinming Zhang

through a unified task template (typically, pre-training and down- Table 1: Summary of datasets.
stream tasks are all mapped to masked language modeling). More
recently, learnable prompts [22, 24] have been proposed as an al- Graph Avg. Avg. Node Node Task
Graphs
ternative to handcrafted prompts, to alleviate the high engineering classes nodes edges features classes (N/G)
cost of the latter. Flickr 1 - 89,250 899,756 500 7 N
Prompt design. Nevertheless, our proposal is distinctive from PROTEINS 1,113 2 39.06 72.82 1 3 N, G
language-based prompting for two reasons. Firstly, we have a dif- COX2 467 2 41.22 43.45 3 - G
ENZYMES 600 6 32.63 62.14 18 3 N, G
ferent task template from masked language modeling. Secondly,
BZR 405 2 35.75 38.36 3 - G
since our prompts are designed for graph structures, they are more
abstract and cannot take the form of language-based instructions.
Thus, they are virtually impossible to be handcrafted. Instead, they
should be topology related to align with the core of graph learning. where the class prototypical subgraph for class 𝑐 is represented by
In particular, under the same task template of subgraph similarity s̃𝑡,𝑐 , which is also generated by the prompt-assisted, task-specific
learning, the ReadOut operation (used to generate the subgraph ReadOut.
representation) can be “prompted” differently for different down- Note that, the prompt tuning loss is only parameterized by the
stream tasks. Intuitively, different tasks can benefit from different learnable prompt vector p𝑡 , without the GNN weights. Instead, the
aggregation schemes for their ReadOut. For instance, node clas- pre-trained GNN weights Θ0 are frozen for downstream tasks, as
sification pays more attention to features that are topically more no fine-tuning is necessary. This significantly decreases the number
relevant to the target node. In contrast, graph classification tends of parameters to be updated downstream, thus not only improving
to focus on features that are correlated to the graph class. More- the computational efficiency of task learning and inference, but
over, the important features may also vary given different sets of also reducing the reliance on labeled data.
instances or classes in a task.
Formally, let p𝑡 denote a learnable prompt vector for a down- 5 EXPERIMENTS
stream task 𝑡, as shown in Fig. 2(c). The prompt-assisted ReadOut In this section, we conduct extensive experiments including node
operation on a subgraph 𝑆𝑥 for task 𝑡 is classification and graph classification as downstream tasks on five
benchmark datasets to evaluate the proposed GraphPrompt.
s𝑡,𝑥 = ReadOut({p𝑡 ⊙ h𝑣 : 𝑣 ∈ 𝑉 (𝑆𝑥 )}), (12)
where s𝑡,𝑥 is the task 𝑡-specific subgraph representation, and ⊙ de- 5.1 Experimental Setup
notes the element-wise multiplication. That is, we perform a feature
Datasets. We employ five benchmark datasets for evaluation. (1)
weighted summation of the node representations from the subgraph,
Flickr [47] is an image sharing network. (2) PROTEINS [3] is a
where the prompt vector p𝑡 is a dimension-wise reweighting in
collection of protein graphs which include the amino acid sequence,
order to extract the most relevant prior knowledge for the task 𝑡.
conformation, structure, and features such as active sites of the
Note that other prompt designs are also possible. For example,
proteins. (3) COX2 [35] is a dataset of molecular structures including
we could consider a learnable prompt matrix P𝑡 , which applies a
467 cyclooxygenase-2 inhibitors. (4) ENZYMES [46] is a dataset of
linear transformation to the node representations:
600 enzymes collected from the BRENDA enzyme database. (5) BZR
s𝑡,𝑥 = ReadOut({P𝑡 h𝑣 : 𝑣 ∈ 𝑉 (𝑆𝑥 )}). (13) [35] is a collection of 405 ligands for benzodiazepine receptor.
We summarize these datasets in Table 1, and present further
More complex prompts such as an attention layer is another alter- details in Appendix B. Note that the “Task” column indicates the
native. However, one of the main motivation of prompting instead type of downstream task performed on each dataset: “N” for node
of fine-tuning is to reduce reliance on labeled data. In few-shot classification and “G” for graph classification.
settings, given very limited supervision, prompts with fewer pa-
Baselines. We evaluate GraphPrompt against the state-of-the-art
rameters are preferred to mitigate the risk of overfitting. Hence, the
approaches from three main categories, as follows. (1) End-to-end
feature weighting scheme in Eq. (12) is adopted for our prompting
graph neural networks: GCN [20], GraphSAGE [13], GAT [43] and
as the prompt is a single vector of the same length as the node
GIN [50]. They capitalize on the key operation of neighborhood
representation, which is typically a small number (e.g., 128).
aggregation to recursively aggregate messages from the neighbors,
Prompt tuning. To optimize the learnable prompt, also known and work in an end-to-end manner. (2) Graph pre-training models:
as prompt tuning, we formulate the loss based on the common DGI [44], InfoGraph [38], and GraphCL [53]. They work in the “pre-
template of subgraph similarity, using the prompt-assisted task- train, fine-tune” paradigm. In particular, they pre-train the GNN
specific subgraph representations. models to preserve the intrinsic graph properties, and fine-tune
Formally, consider a task 𝑡 with a labeled training set T𝑡 = the pre-trained weights on downstream tasks to fit task labels. (3)
{(𝑥 1, 𝑦1 ), (𝑥 2, 𝑦2 ), . . .}, where 𝑥𝑖 is an instance (i.e., a node or a Graph prompt models: GPPT [39]. GPPT utilizes a link prediction
graph), and 𝑦𝑖 ∈ 𝑌 is the class label of 𝑥𝑖 among the set of classes task for pre-training, and resorts to a learnable prompt for the node
𝑌 . The loss for prompt tuning is defined as classification task, which is mapped to a link prediction task.
∑︁ exp(sim(s𝑡,𝑥𝑖 , s̃𝑡,𝑦𝑖 )/𝜏) Note that other few-shot learning methods on graphs, such as
Lprompt (p𝑡 ) = − ln Í , (14) Meta-GNN [57] and RALE [26], adopt a meta-learning paradigm [9].
𝑐 ∈𝑌 exp(sim(s𝑡,𝑥𝑖 , s̃𝑡,𝑐 )/𝜏)
(𝑥𝑖 ,𝑦𝑖 ) ∈ T𝑡 Thus, they cannot be used in our setting, as they require labeled data
GraphPrompt WWW ’23, May 1–5, 2023, Austin, TX, USA

in their base classes for the meta-training phase. In our approach, Table 2: Accuracy evaluation on node classification.
only label-free graphs are utilized for pre-training. All tabular results are in percent, with best bolded and runner-up underlined.

Settings and parameters. To evaluate the goal of our Graph- Flickr PROTEINS ENZYMES
Methods
Prompt in realizing a unified design that can suit different down- 50-shot 1-shot 1-shot
stream tasks flexibly, we consider two typical types of downstream
GCN 9.22 ± 9.49 59.60 ± 12.44 61.49 ± 12.87
tasks, i.e., node classification and graph classification. In particular,
GraphSAGE 13.52 ± 11.28 59.12 ± 12.14 61.81 ± 13.19
for the datasets which are suitable for both of these two tasks, i.e., GAT 16.02 ± 12.72 58.14 ± 12.05 60.77 ± 13.21
PROTEINS and ENZYMES, we only pre-train the GNN model once GIN 10.18 ± 5.41 60.53 ± 12.19 63.81 ± 11.28
on each dataset, and utilize the same pre-trained model for the two
DGI 17.71 ± 1.09 54.92 ± 18.46 63.33 ± 18.13
downstream tasks with their task-specific prompting.
GraphCL 18.37 ± 1.72 52.00 ± 15.83 58.73 ± 16.47
The downstream tasks follow a 𝑘-shot classification setting. For
each type of downstream task, we construct a series of 𝑘-shot clas- GPPT 18.95 ± 1.92 50.83 ± 16.56 53.79 ± 17.46
sification tasks. The details of task construction will be elaborated GraphPrompt 20.21 ± 11.52 63.03 ± 12.14 67.04 ± 11.48
later when reporting the results in Sect. 5.2. For task evaluation, as
the 𝑘-shot tasks are balanced classification, we employ accuracy as
the evaluation metric following earlier work [26, 45]. Table 3: Accuracy evaluation on graph classification.
For all the baselines, based on the authors’ code and default
settings, we further tune their hyper-parameters to optimize their PROTEINS COX2 ENZYMES BZR
Methods
performance. We present more implementation details of the base- 5-shot 5-shot 5-shot 5-shot
lines and our GraphPrompt in Appendix D.
GCN 54.87 ± 11.20 51.37 ± 11.06 20.37 ± 5.24 56.16 ± 11.07
GraphSAGE 52.99 ± 10.57 52.87 ± 11.46 18.31 ± 6.22 57.23 ± 10.95
5.2 Performance Evaluation GAT 48.78 ± 18.46 51.20 ± 27.93 15.90 ± 4.13 53.19 ± 20.61
As discussed, we perform two types of downstream task different GIN 58.17 ± 8.58 51.89 ± 8.71 20.34 ± 5.01 57.45 ± 10.54
from the link prediction task in pre-training, namely, node clas- InfoGraph 54.12 ± 8.20 54.04 ± 9.45 20.90 ± 3.32 57.57 ± 9.93
sification and graph classification in few-shot settings. We first GraphCL 56.38 ± 7.24 55.40 ± 12.04 28.11 ± 4.00 59.22 ± 7.42
evaluate on a fixed-shot setting, and then vary the shot numbers to GraphPrompt 64.42 ± 4.37 59.21 ± 6.82 31.45 ± 4.32 61.63 ± 7.68
see the performance trend.
Few-shot node classification. We conduct this node-level task
on three datasets, i.e., Flickr, PROTEINS, and ENZYMES. Following
a typical 𝑘-shot setup [26, 45, 57], we generate a series of few-shot such a case, even with sophisticated pre-training, they cannot ef-
tasks for model training and validation. In particular, for PROTEINS fectively promote the performance of downstream tasks. Third, the
and ENZYMES, on each graph we randomly generate ten 1-shot graph prompt model GPPT is only comparable to or even worse
node classification tasks (i.e., in each task, we randomly sample than the other baselines, despite also using prompts. A potential
1 node per class) for training and validation, respectively. Each reason is that GPPT requires much more learnable parameters in
training task is paired with a validation task, and the remaining their prompts than ours, which may not work well given very few
nodes not sampled by the pair of training and validation tasks will shots (e.g., 1-shot).
be used for testing. For Flickr, as it contains a large number of very Few-shot graph classification. We further conduct few-shot graph
sparse node features, selecting very few shots for training may classification on four datasets, i.e., PROTEINS, COX2, ENZYMES, and
result in inferior performance for all the methods. Therefore, we BZR. For each dataset, we randomly generate 100 5-shot classifica-
randomly generate ten 50-shot node classifcation tasks, for training tion tasks for training and validation, following a similar process
and validation, respectively. On Flickr, 50 shots are still considered for node classification tasks.
few, accounting for less than 0.06% of all nodes on the graph. We illustrate the results of few-shot graph classification in Ta-
Table 2 illustrates the results of few-shot node classification. We ble 3, and have the following observations. First, our proposed
have the following observations. First, our proposed GraphPrompt GraphPrompt significantly outperforms the baselines on these
outperforms all the baselines across the three datasets, demonstrat- four datasets. This again demonstrates the necessity of unification
ing the effectiveness of GraphPrompt in transferring knowledge for pre-training and downstream tasks, and the effectiveness of
from the pre-training to downstream tasks. In particular, by virtue prompt-assisted task-specific aggregation for ReadOut. Second, as
of the unification framework and prompt-based task-specific ag- both node and graph classification tasks share the same pre-trained
gregation in ReadOut function, GraphPrompt is able to narrow model on PROTEINS and ENZYMES, the superior performance of
the gap between pre-training and downstream tasks, and guide the GraphPrompt on both types of task further demonstrates that, the
downstream tasks to exploit the pre-trained model in a task-specific gap between different tasks is well addressed by virtue of our unifi-
manner. Second, compared to graph pre-training models, end-to- cation framework. Third, the graph pre-training models generally
end GNN models can sometimes achieve comparable or even better achieve better performance than the end-to-end GNN models. This
performance. This implies that the discrepancy between the pre- is because both InfoGraph and GraphCL capitalize on graph-level
training and downstream tasks in these pre-training approaches tasks for pre-training, which are naturally closer to the downstream
obstructs the knowledge transfer from the former to the latter. In graph classification.
WWW ’23, May 1–5, 2023, Austin, TX, USA Zemin Liu, Xingtong Yu, Yuan Fang, Xinming Zhang

(a) Node classification (b) Graph classification

Figure 5: Ablation study.

Figure 3: Impact of shots on few-shot node classification.


Table 4: Study of parameter efficiency on node classification.

Flickr PROTEINS ENZYMES


Methods
Params FLOPs Params FLOPs Params FLOPs

GIN 22,183 240,100 5,730 12,380 6,280 11,030


GPPT 4,096 4,582 1,536 1,659 1,536 1,659
GraphPrompt 96 96 96 96 96 96
GraphPrompt-ft 21,600 235,200 6,176 13,440 6,176 10,944

we remove the prompt vector, and conduct classification by em-


ploying a classifier on the subgraph representations obtained by a
Figure 4: Impact of shots on few-shot graph classification. direct sum-based ReadOut. (2) lin. prompt: we replace the prompt
vector with a linear transformation matrix in Eq. (13).
We conduct the ablation study on three datasets for node classi-
fication (Flickr, PROTEINS, and ENZYMES) and graph classification
Performance with different shots. We study the impact of num- (COX2, ENZYMES, and BZR), respectively, and illustrate the compar-
ber of shots on the PROTEINS and ENZYMES datasets. For node clas- ison in Fig. 5. We have the following observations. (1) Without the
sification, we vary the number of shots between 1 and 10, and com- prompt vector, no prompt usually performs the worst among the
pare with several competitive baselines (i.e., GIN, DGI, GraphCL, variants, showing the necessity of prompting the ReadOut opera-
and GPPT) in Fig. 3. For few-shot graph classification, we vary the tion differently for different downstream tasks. (2) Converting the
number of shots between 1 and 30, and compare with competitive prompt vector into a linear transformation matrix also hurts the per-
baselines (i.e., GIN, InfoGraph, and GraphCL) in Fig. 4. The task formance, as the matrix involves more parameters thus increasing
settings are identical to those stated earlier. the reliance on labeled data.
In general, our proposed GraphPrompt consistently outper-
forms the baselines especially with lower shots. For node classifica- Parameter efficiency. We also compare the number of parameters
tion, as the number of nodes in each graph is relatively small, 10 that needs to be updated in a downstream node classification task
shots per class might be sufficient for semi-supervised node classi- for a few representative models, as well as their number of floating
fication. Nevertheless, GraphPrompt is competitive even with 10 point operations (FLOPs), in Table 4.
shots. For graph classification, GraphPrompt can be surpassed by In particular, as GIN works in an end-to-end manner, it is obvious
some baselines when given more shots (e.g., 20 or more), especially that it involves the largest number of parameters for updating. For
on ENZYMES. On this dataset, 30 shots per class implies 30% of the GPPT, it requires a separate learnable vector for each class as its
600 graphs are used for training, which is not our target scenario. representation, and an attention module to weigh the neighbors
for aggregation in the structure token generation. Therefore, GPPT
needs to update more parameters than GraphPrompt, which is
5.3 Model Analysis one factor that impairs its performance in downstream tasks. For
We further analyse several aspects of our model. Due to space our proposed GraphPrompt, it not only outperforms the baselines
constraint, we only report the ablation and parameter efficiency GIN and GPPT as we have seen earlier, but also requires the least
study, and leave the rest to Appendix E. parameters and FLOPs for downstream tasks. For illustration, in
Ablation study. To evaluate the contribution of each component, addition to prompt tuning, if we also fine-tune the pre-trained
we conduct an ablation study by comparing GraphPrompt with weights instead of freezing them (denoted GraphPrompt+ft), there
different prompting strategies: (1) no prompt: for downstream tasks, will be significantly more parameters to update.
GraphPrompt WWW ’23, May 1–5, 2023, Austin, TX, USA

6 CONCLUSIONS REFERENCES
In this paper, we studied the research problem of prompting on [1] Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. 2022. BEiT: BERT Pre-
Training of Image Transformers. In International Conference on Learning Repre-
graphs and proposed GraphPrompt, in order to overcome the lim- sentations.
itations of graph neural networks in the supervised or “pre-train, [2] Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciBERT: A Pretrained Language
Model for Scientific Text. In Conference on Empirical Methods in Natural Language
fine-tune” paradigms. In particular, to narrow the gap between pre- Processing and the International Joint Conference on Natural Language Processing.
training and downstream objectives on graphs, we introduced a 3615–3620.
unification framework by mapping different tasks to a common task [3] Karsten M Borgwardt, Cheng Soon Ong, Stefan Schönauer, SVN Vishwanathan,
Alex J Smola, and Hans-Peter Kriegel. 2005. Protein function prediction via graph
template. Moreover, to distinguish task individuality and achieve kernels. Bioinformatics 21, suppl_1 (2005), i47–i56.
task-specific optima, we proposed a learnable task-specific prompt [4] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan,
vector that guides each downstream task to make full of the pre- Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda
Askell, et al. 2020. Language models are few-shot learners. Advances in Neural
trained model. Finally, we conduct extensive experiments on five Information Processing Systems 33 (2020), 1877–1901.
public datasets, and show that GraphPrompt significantly outper- [5] Hongyun Cai, Vincent W Zheng, and Kevin Chen-Chuan Chang. 2018. A com-
prehensive survey of graph embedding: Problems, techniques, and applications.
forms various state-of-the-art baselines. IEEE Transactions on Knowledge and Data Engineering 30, 9 (2018), 1616–1637.
[6] Deli Chen, Yankai Lin, Wei Li, Peng Li, Jie Zhou, and Xu Sun. 2020. Measuring
ACKNOWLEDGMENTS and relieving the over-smoothing problem for graph neural networks from the
topological view. In AAAI Conference on Artificial Intelligence. 3438–3445.
This research / project is supported by the Ministry of Educa- [7] Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng
tion, Singapore, under its Academic Research Fund Tier 2 (MOE- Gao, Ming Zhou, and Hsiao-Wuen Hon. 2019. Unified language model pre-
training for natural language understanding and generation. Advances in Neural
T2EP20122-0041). Any opinions, findings and conclusions or recom- Information Processing Systems 32 (2019).
mendations expressed in this material are those of the author(s) and [8] David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell,
do not reflect the views of the Ministry of Education, Singapore. Timothy Hirzel, Alán Aspuru-Guzik, and Ryan P Adams. 2015. Convolutional
networks on graphs for learning molecular fingerprints. Advances in Neural
Information Processing Systems 28 (2015).
[9] Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-agnostic meta-
learning for fast adaptation of deep networks. In International Conference on
Machine Learning. 1126–1135.
[10] Hongyang Gao and Shuiwang Ji. 2019. Graph u-nets. In International Conference
on Machine Learning. 2083–2092.
[11] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E
Dahl. 2017. Neural message passing for quantum chemistry. In International
Conference on Machine Learning. 1263–1272.
[12] Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning for
networks. In ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining. 855–864.
[13] Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representation
learning on large graphs. Advances in neural information processing systems 30
(2017), 1025–1035.
[14] Kaveh Hassani and Amir Hosein Khasahmadi. 2020. Contrastive multi-view rep-
resentation learning on graphs. In International Conference on Machine Learning.
4116–4126.
[15] Weihua Hu, Bowen Liu, Joseph Gomes, Marinka Zitnik, Percy Liang, Vijay Pande,
and Jure Leskovec. 2020. Strategies for Pre-training Graph Neural Networks. In
International Conference on Learning Representations.
[16] Ziniu Hu, Yuxiao Dong, Kuansan Wang, Kai-Wei Chang, and Yizhou Sun. 2020.
GPT-GNN: Generative pre-training of graph neural networks. In ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining. 1857–1867.
[17] Kexin Huang and Marinka Zitnik. 2020. Graph meta learning via local subgraphs.
Advances in Neural Information Processing Systems 33 (2020), 5862–5874.
[18] Dasol Hwang, Jinyoung Park, Sunyoung Kwon, KyungMin Kim, Jung-Woo Ha,
and Hyunwoo J Kim. 2020. Self-supervised auxiliary learning with meta-paths
for heterogeneous graphs. Advances in Neural Information Processing Systems 33
(2020), 10294–10305.
[19] Thomas N Kipf and Max Welling. 2016. Variational graph auto-encoders. In
Bayesian Deep Learning Workshop.
[20] Thomas N Kipf and Max Welling. 2017. Semi-supervised classification with graph
convolutional networks. In International Conference on Learning Representations.
[21] Junhyun Lee, Inyeop Lee, and Jaewoo Kang. 2019. Self-attention graph pooling.
In International Conference on Machine Learning. 3734–3743.
[22] Brian Lester, Rami Al-Rfou, and Noah Constant. 2021. The Power of Scale
for Parameter-Efficient Prompt Tuning. In Conference on Empirical Methods in
Natural Language Processing. 3045–3059.
[23] Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Gra-
ham Neubig. 2021. Pre-train, prompt, and predict: A systematic survey of prompt-
ing methods in natural language processing. arXiv preprint arXiv:2107.13586
(2021).
[24] Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Yujie Qian, Zhilin Yang, and
Jie Tang. 2021. GPT understands, too. arXiv preprint arXiv:2103.10385 (2021).
[25] Zemin Liu, Yuan Fang, Chenghao Liu, and Steven C.H. Hoi. 2021. Node-wise
Localization of Graph Neural Networks. In International Joint Conference on
Artificial Intelligence. 1520–1526.
WWW ’23, May 1–5, 2023, Austin, TX, USA Zemin Liu, Xingtong Yu, Yuan Fang, Xinming Zhang

[26] Zemin Liu, Yuan Fang, Chenghao Liu, and Steven CH Hoi. 2021. Relative and [42] Matteo Togninalli, Elisabetta Ghisu, Felipe Llinares-López, Bastian Rieck, and
absolute location embedding for few-shot node classification on graph. In AAAI Karsten Borgwardt. 2019. Wasserstein weisfeiler-lehman graph kernels. Advances
Conference on Artificial Intelligence. 4267–4275. in Neural Information Processing Systems 32 (2019).
[27] Zemin Liu, Trung-Kien Nguyen, and Yuan Fang. 2021. Tail-GNN: Tail-Node [43] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro
Graph Neural Networks. In ACM SIGKDD Conference on Knowledge Discovery Lio, and Yoshua Bengio. 2018. Graph attention networks. In International Confer-
and Data Mining. 1109–1119. ence on Learning Representations.
[28] Zemin Liu, Wentao Zhang, Yuan Fang, Xinming Zhang, and Steven C. H. Hoi. [44] Petar Velickovic, William Fedus, William L Hamilton, Pietro Liò, Yoshua Bengio,
2020. Towards Locality-Aware Meta-Learning of Tail Node Embeddings on and R Devon Hjelm. 2019. Deep Graph Infomax. In International Conference on
Networks. In Conference on Information and Knowledge Management. 975–984. Learning Representations.
[29] Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. ViLBERT: Pretrain- [45] Ning Wang, Minnan Luo, Kaize Ding, Lingling Zhang, Jundong Li, and Qinghua
ing task-agnostic visiolinguistic representations for vision-and-language tasks. Zheng. 2020. Graph few-shot learning with attribute matching. In ACM Interna-
Advances in Neural Information Processing Systems 32 (2019). tional Conference on Information and Knowledge Management. 1545–1554.
[30] Yuanfu Lu, Xunqiang Jiang, Yuan Fang, and Chuan Shi. 2021. Learning to pre-train [46] Song Wang, Yushun Dong, Xiao Huang, Chen Chen, and Jundong Li. 2022. FAITH:
graph neural networks. In AAAI Conference on Artificial Intelligence. 4276–4284. Few-Shot Graph Classification with Hierarchical Task Graphs. In International
[31] Yao Ma, Suhang Wang, Charu C Aggarwal, and Jiliang Tang. 2019. Graph convo- Joint Conference on Artificial Intelligence.
lutional networks with eigenpooling. In ACM SIGKDD International Conference [47] Zhihao Wen, Yuan Fang, and Zemin Liu. 2021. Meta-inductive node classifi-
on Knowledge Discovery and Data Mining. 723–731. cation across graphs. In International ACM SIGIR Conference on Research and
[32] Zhen Peng, Wenbing Huang, Minnan Luo, Qinghua Zheng, Yu Rong, Tingyang Development in Information Retrieval. 1219–1228.
Xu, and Junzhou Huang. 2020. Graph representation learning via graphical [48] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and
mutual information maximization. In The Web Conference. 259–270. S Yu Philip. 2020. A comprehensive survey on graph neural networks. IEEE
[33] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. DeepWalk: Online learning Transactions on Neural Networks and Learning Systems 32, 1 (2020), 4–24.
of social representations. In ACM SIGKDD International Conference on Knowledge [49] Jun Xia, Yanqiao Zhu, Yuanqi Du, and Stan Z Li. 2022. A survey of pretraining on
Discovery and Data Mining. 701–710. graphs: Taxonomy, methods, and applications. arXiv preprint arXiv:2202.07893
[34] Jiezhong Qiu, Qibin Chen, Yuxiao Dong, Jing Zhang, Hongxia Yang, Ming Ding, (2022).
Kuansan Wang, and Jie Tang. 2020. GCC: Graph contrastive coding for graph neu- [50] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. 2019. How powerful
ral network pre-training. In ACM SIGKDD International Conference on Knowledge are graph neural networks?. In International Conference on Learning Representa-
Discovery and Data Mining. 1150–1160. tions.
[35] Ryan A. Rossi and Nesreen K. Ahmed. [n. d.]. The Network Data Repository with [51] Zhitao Ying, Jiaxuan You, Christopher Morris, Xiang Ren, Will Hamilton, and Jure
Interactive Graph Analytics and Visualization. In AAAI Conference on Artificial Leskovec. 2018. Hierarchical graph representation learning with differentiable
Intelligence. 4292–4293. pooling. Advances in Neural Information Processing Systems 31 (2018), 4805–4815.
[36] Nino Shervashidze, Pascal Schweitzer, Erik Jan Van Leeuwen, Kurt Mehlhorn, [52] Yuning You, Tianlong Chen, Yang Shen, and Zhangyang Wang. 2021. Graph
and Karsten M Borgwardt. 2011. Weisfeiler-lehman graph kernels. Journal of contrastive learning automated. In International Conference on Machine Learning.
Machine Learning Research 12, 9 (2011). 12121–12132.
[37] Fan-Yun Sun, Jordan Hoffman, Vikas Verma, and Jian Tang. 2020. InfoGraph: [53] Yuning You, Tianlong Chen, Yongduo Sui, Ting Chen, Zhangyang Wang, and
Unsupervised and Semi-supervised Graph-Level Representation Learning via Yang Shen. 2020. Graph contrastive learning with augmentations. Advances in
Mutual Information Maximization. In International Conference on Learning Rep- Neural Information Processing Systems 33 (2020), 5812–5823.
resentations. [54] Jiayou Zhang, Zhirui Wang, Shizhuo Zhang, Megh Manoj Bhalerao, Yucong Liu,
[38] Fan-Yun Sun, Jordan Hoffmann, Vikas Verma, and Jian Tang. 2020. InfoGraph: Un- Dawei Zhu, and Sheng Wang. 2021. GraphPrompt: Biomedical Entity Normal-
supervised and semi-supervised graph-level representation learning via mutual ization Using Graph-based Prompt Templates. arXiv preprint arXiv:2112.03002
information maximization. In International Conference on Learning Representa- (2021).
tions. [55] Muhan Zhang and Yixin Chen. 2018. Link prediction based on graph neural
[39] Mingchen Sun, Kaixiong Zhou, Xin He, Ying Wang, and Xin Wang. 2022. GPPT: networks. Advances in Neural Information Processing Systems 31 (2018).
Graph Pre-training and Prompt Tuning to Generalize Graph Neural Networks. In [56] Muhan Zhang, Zhicheng Cui, Marion Neumann, and Yixin Chen. 2018. An end-
ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 1717–1727. to-end deep learning architecture for graph classification. In AAAI conference on
[40] Susheel Suresh, Pan Li, Cong Hao, and Jennifer Neville. 2021. Adversarial graph artificial intelligence, Vol. 32.
augmentation to improve graph contrastive learning. Advances in Neural Infor- [57] Fan Zhou, Chengtai Cao, Kunpeng Zhang, Goce Trajcevski, Ting Zhong, and Ji
mation Processing Systems 34 (2021), 15920–15933. Geng. 2019. Meta-GNN: On few-shot node classification in graph meta-learning.
[41] Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. In ACM International Conference on Information and Knowledge Management.
2015. LINE: Large-scale information network embedding. In The Web Conference. 2357–2360.
1067–1077.
GraphPrompt WWW ’23, May 1–5, 2023, Austin, TX, USA

APPENDICES as active sites of the proteins. The nodes represent the secondary
structures, and each edge depicts the neighboring relation in the
A Algorithm and Complexity Analysis
amino-acid sequence or in 3D space. The nodes belong to three
Algorithm. We present the algorithm for prompt design and tuning categories, and the graphs belong to two classes.
of GraphPrompt in Alg. 1. In line 1, we initialize the prompt (3) COX2 [35] is a dataset of molecular structures including 467
vector and the objective Lprompt . In lines 2-3, we obtain the node cyclooxygenase-2 inhibitors, in which each node is an atom, and
embeddings of input graphs based on the pre-trained GNN. In each edge represents the chemical bond between atoms, such as
lines 5-13, we accumulate the loss for the given tuning samples. In single, double, triple or aromatic. All the molecules belong to two
particular, in lines 5-6, we design the prompt for the specific task categories.
𝑡. In lines 7-8, we calculate the subgraph representation for each (4) ENZYMES [46] is a dataset of 600 enzymes collected from
class prototype. Then, in lines 9-13, we calculate and accumulate the BRENDA enzyme database. These enzymes are labeled into 6
the loss and get the overall objective. Finally, in line 14 we optimize categories according to their top-level EC enzyme.
the prompt vector by minimizing the objective Lprompt . (5) BZR [35] is a collection of 405 ligands for benzodiazepine
receptor, in which each ligand is represented by a graph. All these
Algorithm 1 Prompt Design and Tuning ligands belong to 2 categories.
Input: Graphs set G = {𝐺 𝑗 | 𝑗 = 1, 2, . . . }, task 𝑡 -specific subgraphs set Note that we conduct node classification on Flickr, PROTEINS and
S = {𝑆𝑡,𝑥 |𝑥 = 1, 2, . . . }, labeled set D = { (𝑥𝑖 , 𝑦𝑖 ) |𝑖 = 1, 2, . . . }, class ENZYMES, since their node labels generally appear on all the graphs,
set 𝑌 , pre-trained GNN model 𝑓Θ0 which takes in a graph and outputs which is suitable for the setting of few-shot node classification on
its node embedding vectors. each graph. Note that, we only choose the graphs which consist
Output: Prompt vector p𝑡 . of more than 50 nodes for the downstream node classification, to
1: p𝑡 ← prompt vector initialization, Lprompt ← 0; ensure there exist sufficient labeled nodes for testing. Additionally,
2: for each graph 𝐺 𝑗 ∈ G do ⊲ Load pre-trained GNN graph classification is conducted on PROTEINS, COX2, ENXYMES
3: H 𝑗 ← 𝑓Θ0 (𝐺 𝑗 )
and BZR. We use the given node features in the cited datasets to
4: while not converged do ⊲ Tuning iteration initialize input feature vectors, without additional processing.
5: for each subgraph 𝑠𝑡 ,𝑥 ∈ S do ⊲ Prompt design, Eq. (12)
6: s𝑡 ,𝑥 ← ReadOut( {p𝑡 ⊙ h𝑣 : 𝑣 ∈ 𝑉 (𝑆𝑥 ) })
C Further Descriptions of Baselines
7: for each class 𝑐 ∈ 𝑌 do ⊲ Class prototypical subgraph
8: s̃𝑡,𝑐 ← Mean of node/graph embedding vectors In this section, we present more details for the baselines, which are
chosen from three main categories.
9: for each labeled pair (𝑥𝑖 , 𝑦𝑖 ) ∈ D do ⊲ Accumulate loss, Eq. (14)
10: Z𝑖 ← 0 (1) End-to-end graph neural networks.
11: for each class 𝑐 ∈ 𝑌 do • GCN [20]: GCN resorts to mean-pooling based neighborhood
12: Z𝑖 = exp(sim(s𝑡,𝑥𝑖 , s̃𝑡,𝑐 )/𝜏) + Z𝑖 aggregation to receive messages from the neighboring nodes for
13: Lprompt = Lprompt − ln(exp(sim(s𝑡,𝑥𝑖 , s̃𝑡,𝑦𝑖 )/𝜏)/Z𝑖 ) node representation learning in an end-to-end manner.
14: Update p𝑡 by minimize Lprompt ; • GraphSage [13]: GraphSAGE has a similar neighborhood ag-
15: return p𝑡 . gregation mechanism with GCN, while it focuses more on the
information from the node itself.
• GAT [43] : GAT also depends on neighborhood aggregation for
Complexity analysis. For a node 𝑣, with average degree 𝑑, ¯ 𝑘 GNN
node representation learning in an end-to-end manner, while
layers, 𝛿 hops for subgraph extraction, 𝐷 hidden dimensions, the it can assign different weights to neighbors to reweigh their
complexity of GNN-based embedding calculation is 𝑂 (𝐷 · 𝑑¯𝑘 ), and contributions.
the complexity of subgraph extraction is 𝑂 (𝑑¯𝛿 ). Thus, the embed- • GIN [50]: GIN employs a sum-based aggregator to replace the
ding calculation of 𝑣’s subgraph with ReadOut is 𝑂 (𝐷 · 𝑑¯𝑘 · 𝑑¯𝛿 ), mean-pooling method in GCN, which is more powerful in ex-
where 𝑘, 𝛿 are small constants. Furthermore, if some neighborhood pressing the graph structures.
sampling [13] is adopted during GNN aggregation, 𝑑¯ is a relatively (2) Graph pre-training models.
small constant too.
• DGI [44]: DGI capitalizes on a self-supervised method for pre-
B Further Descriptions of Datasets training, which is based on the concept of mutual information
(MI). It maximizes the MI between the local augmented instances
We provide further details of the datasets. and the global representation.
(1) Flickr [47] is an image sharing network, which is collected • InfoGraph [38]: InfoGraph learns a graph-level representation,
by SNAP2 . In particular, each node is an image, and there exists an which maximizes the MI between the graph-level representation
edge between two images if they share some common properties, and substructure representations at various scales.
such as commented by the same user, or from the same location. • GraphCL [53]: GraphCL applies different graph augmentations
Each image belongs to one of the 7 categories. to exploit the structural information on the graphs, and aims to
(2) PROTEINS [3] is a collection of protein graphs which include maximize the agreement between different augmentations for
the amino acid sequence, conformation, structure, and features such graph pre-training.
2 https://snap.stanford.edu/data/
(3) Graph prompt models.
WWW ’23, May 1–5, 2023, Austin, TX, USA Zemin Liu, Xingtong Yu, Yuan Fang, Xinming Zhang

• GPPT [39]. GPPT pre-trains a GNN model based on the link Parameter sensitivity. We evaluate the sensitivity of two impor-
prediction task, and employs a learnable prompt to reformulate tant hyperparameters in GraphPrompt, and show the impact in
the downstream node classification task into the same format as Figs. 7 and 8 for node classification and graph classification, respec-
link prediction. tively.
For the number of hops (𝛿) in subgraph construction, the perfor-
D Further Implementation Details mance on node classification gradually decreases as the number of
For baseline GCN [20], we employ a 3-layer architecture, and set hops increases. This is because a larger subgraph tends to bring in
the hidden dimension as 32. For GraphSAGE [13], we utilize the irrelevant information for the target node, and may suffer from the
mean aggregator, and employ a 3-layer architecture. The hidden over-smoothing issue [6]. On the other hand, for graph classifica-
dimension is also set to 32. For GAT [43], we employ a 2-layer tion, the number of hops only affects the pre-training stage as the
architecture and set the hidden dimension as 32. Besides, we apply whole graph is used in downstream classification. In this case, the
4 attention heads in the first GAT layer. Similarly, for GIN [50], we number of hops does not show a clear trend, implying less impact
also employ a 3-layer architecture and set the hidden dimension on graph classification since both small and large subgraphs are
as 32. For the pre-training and prompting approaches, we use the helpful in capturing substructure information at different scales.
backbones in their original paper. Specifically, for DGI [44], we For the hidden dimension, a smaller dimension is better for node
use a 1-layer GCN as the backbone, and set the hidden dimension classification, such as 32 and 64. For graph classification, a slightly
as 512. Besides, we utilize PReLU as the activation function. For larger dimension might be better, such as 64 and 128. Overall, 32 or
InfoGraph [38], we use a 3-layer GIN as the backbone, and set 64 appears to be robust for both node and graph classification.
its hidden dimension as 32. For GraphCL [53], we also employ
a 3-layer GIN as its backbone, and set the hidden dimension as
32. In particular, we choose the augmentations of node dropping
and subgraph, with a default augmentation ratio of 0.2. For GPPT
[39], we utilize a 2-layer GraphSAGE as its backbone, set its hidden
dimension as 128, and utilize the mean aggregator. For our proposed
GraphPrompt, we employ a 3-layer GIN as the backbone, and set
the hidden dimensions as 32. In addition, we set 𝛿 = 1 to construct
1-hop subgraphs for the nodes.
(a) Number of hops (b) Hidden dimension
E Further Experimental Results
Figure 7: Parameter sensitivity on node classification.
Scalability study. We investigate the scalability of GraphPrompt
on the dataset PROTEINS for graph classification. We divide the
graphs into six groups based on their size (i.e., number of nodes).
The size of graphs in each group is approximately 50, 60, . . ., 100
nodes. We sample 10 graphs from each group, and record the prompt
tuning time on the 10 graphs in each epoch. The results are pre-
sented in Fig. 6. Note that we also report the tuning time for Graph-
Prompt-ft, a variant of GraphPrompt, which fine-tunes all the
parameters including the pre-trained GNN weights. We first ob-
serve that the tuning time of our GraphPrompt increases linearly
as the graph size increases, demonstrating the scalability of Graph-
Prompt on larger graphs. In addition, compared to GraphPrompt, (a) Number of hops (b) Hidden dimension
GraphPrompt-ft needs more tuning time, showing the inefficiency
of the fine-tuning paradigm. Figure 8: Parameter sensitivity on graph classification.

F Data Ethics Statement


To evaluate the efficacy of this work, we conducted experiments
which only use publicly available datasets, namely, Flickr3 , PRO-
TEINS, COX2, ENZYMES and BZR4 , in accordance to their usage
terms and conditions if any. We further declare that no personally
identifiable information was used, and no human or animal subject
was involved in this research.
3 https://snap.stanford.edu/data/web-flickr.html
Figure 6: Scalability study. 4 https://chrsmrrs.github.io/datasets/

You might also like