AGraphis Worth K Words: Euclideanizing Graph using Pure Transformer
AGraphis Worth K Words: Euclideanizing Graph using Pure Transformer
Zhangyang Gao * 1 2 Daize Dong * 1 Cheng Tan 1 2 Jun Xia 1 2 Bozhen Hu 1 2 Stan Z. Li 1
Abstract 2024a;b; Tan et al., 2023; Gao et al., 2022a;b; 2023; Lin
Can we model Non-Euclidean graphs as pure lan- et al., 2022a). The Non-Euclidean nature of graphs has in-
guage or even Euclidean vectors while retaining spired sophisticated model designs, including graph neural
arXiv:2402.02464v3 [cs.LG] 29 May 2024
their inherent information? The Non-Euclidean networks (Kipf & Welling, 2016a; Veličković et al., 2017)
property have posed a long term challenge in and graph transformers (Ying et al., 2021; Min et al., 2022).
graph modeling. Despite recent graph neural These models excel in encoding graph structures through
networks and graph transformers efforts encod- attention maps. However, the structural encoding strategies
ing graphs as Euclidean vectors, recovering the limit the usage of auto-regressive mechanism, thereby hin-
original graph from vectors remains a challenge. dering pure transformer from revolutionizing graph fields,
In this paper, we introduce GraphsGPT, featur- akin to the success of Vision Transformers (ViT) (Doso-
ing an Graph2Seq encoder that transforms Non- vitskiy et al., 2020) in computer vision. We employ pure
Euclidean graphs into learnable Graph Words in transformer for graph modeling and address the following
the Euclidean space, along with a GraphGPT de- open questions: (1) How to eliminate the Non-Euclidean na-
coder that reconstructs the original graph from ture to facilitate graph representation? (2) How to generate
Graph Words to ensure information equivalence. Non-Euclidean graphs from Euclidean representations? (3)
We pretrain GraphsGPT on 100M molecules and Could the combination of graph representation and genera-
yield some interesting findings: (1) The pre- tion framework benefits from self-supervised pretraining?
trained Graph2Seq excels in graph representation We present Graph2Seq, a pure transformer encoder designed
learning, achieving state-of-the-art results on 8/9 to compress the Non-Euclidean graph into a sequence of
graph classification and regression tasks. (2) The learnable tokens called Graph Words in a Euclidean form,
pretrained GraphGPT serves as a strong graph where all nodes and edges serve as the inputs and undergo an
generator, demonstrated by its strong ability to initial transformation to form Graph Words. Different from
perform both few-shot and conditional graph gen- graph transformers (Ying et al., 2021), our approach doesn’t
eration. (3) Graph2Seq+GraphGPT enables ef- necessitate explicit encoding of the adjacency matrix and
fective graph mixup in the Euclidean space, over- edge features in the attention map. Unlike TokenGT (Kim
coming previously known Non-Euclidean chal- et al., 2022), we introduce a Codebook featuring learnable
lenges. (4) The edge-centric pretraining frame- vectors for graph position encoding, leading to improved
work GraphsGPT demonstrates its efficacy in training stability and accelerated convergence. In addition,
graph domain tasks, excelling in both representa- we employ a random shuffle of the position Codebook, im-
tion and generation. Code is available at GitHub. plicitly augmenting different input orders for the same graph,
and offering each position vector the same opportunity of
1. Introduction optimization to generalize to larger graphs.
Graphs, inherent to Non-Euclidean data, are extensively We introduce GraphGPT, a groundbreaking GPT-style trans-
applied in scientific fields such as molecular design, social former model for graph generation. To recover the Non-
network analysis, recommendation systems, and meshed 3D Euclidean graph structure, we propose an edge-centric gen-
surfaces (Shakibajahromi et al., 2024; Zhou et al., 2020a; eration strategy that utilizes block-wise causal attention to
Huang et al., 2022; Tan et al., 2023; Li et al., 2023a; Liu sequentially generate the graph. Contrary to previous meth-
et al., 2023a; Xia et al., 2022b;b; Gao et al., 2022a; Wu et al., ods (Hu et al., 2020a; Shi et al., 2019; Peng et al., 2022) that
generate nodes before predicting edges, the edge-centric
*
Equal contribution 1 Westlake University, Hangzhou, China technique jointly generates edges and their corresponding
2
Zhejiang University, Hangzhou, China. Correspondence to: Stan endpoint nodes, greatly simplifying the generative space. To
Z. Li <[email protected]>.
align graph generation with language generation, we imple-
Proceedings of the 41 st International Conference on Machine ment auto-regressive generation using block-wise causal at-
Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by tention, which enables the effective translation of Euclidean
the author(s).
1
A Graph is Worth K Words: Euclideanizing Graph using Pure Transformer
representations into Non-Euclidean graph structures. position embeddings and attention mechanisms. For in-
stance, Dwivedi & Bresson (2020); Hussain et al. (2021)
Leveraging Graph2Seq encoder and GraphGPT decoder, we
adopt Laplacian eigenvectors and SVD vectors of the adja-
present GraphsGPT, an integrated end-to-end framework.
cency matrix as position encoding vectors. Dwivedi & Bres-
This framework facilitates a natural self-supervised task to
son (2020); Mialon et al. (2021); Ying et al. (2021); Zhao
optimize the representation and generation tasks, enabling
et al. (2021) enhance the attention computation based on the
the transformation between Non-Euclidean and Euclidean
adjacency matrix. Recently, Kim et al. (2022) introduced
data structures. We pretrain GraphsGPT on 100M molecule
a decoupled position encoding method that empowers the
graphs and comprehensively evaluate it from three perspec-
pure transformer as strong graph learner without the needs
tives: Encoder, Decoder, and Encoder-Decoder. The pre-
of expensive computation of eigenvectors and modifications
trained Graph2Seq encoder is a strong graph learner for
on the attention computation.
property prediction, outperforming baselines of sophisti-
cated methodologies on 8/9 molecular classification and Graph Self-Supervised Learning. The exploration of
regression tasks. The pretrained GraphGPT decoder serves self-supervised pretext tasks for learning expressive graph
as a powerful structure prior, showcasing both few-shot and representations has garnered significant research interest
conditional generation capabilities. The GraphsGPT frame- (Wu et al., 2021a; Liu et al., 2022; 2021c; Xie et al., 2022).
work seamlessly connects the Non-Euclidean graph space Contrastive (You et al., 2020; Zeng & Xie, 2021; Qiu et al.,
to the Euclidean vector space while preserving informa- 2020; Zhu et al., 2020; 2021; Peng et al., 2020b; Liu et al.,
tion, facilitating tasks that are known to be challenging in 2023c;b; Lin et al., 2022b; Xia et al., 2022a; Zou et al., 2022)
the original graph space, such as graph mixup. The good and predictive (Peng et al., 2020a; Jin et al., 2020; Hou et al.,
performance of pretrained GraphsGPT demonstrates that 2022; Tian et al., 2023; Hwang et al., 2020; Wang et al.,
our edge-centric GPT-style pretraining task offers a sim- 2021) objectives have been extensively explored, leveraging
ple yet powerful solution for graph learning. In summary, strategies from the fields of NLP and CV. However, the
we tame pure transformer to convert Non-Euclidean graph discussion around generative pretext tasks (Hu et al., 2020a;
into K learnable Graph Words , showing the capabilities Zhang et al., 2021) for graphs is limited, particularly due to
of Graph2Seq encoder and GraphGPT decoder pretrained the Non-Euclidean nature of graph data, which has led to few
through self-supervised tasks, while also paving the way for instances of pure transformer utilization in graph generation.
various Non-Euclidean challenges like graph manipulation This paper introduces an innovative approach by framing
and graph mixing in Euclidean latent space. graph generation as analogous to language generation, thus
enabling the use of a pure transformer to generate graphs as
a novel self-supervised pretext task.
2. Related Work
Motivation. The pure transformer has revolutionized the
Graph2Vec. Graph2Vec methods create the graph embed- modeling of texts (Devlin et al., 2018; Brown et al., 2020;
ding by aggregating node embeddings via graph pooling Achiam et al., 2023), images (Dosovitskiy et al., 2020;
(Lee et al., 2019; Ma et al., 2019; Diehl, 2019; Ying et al., Alayrac et al., 2022; Dehghani et al., 2023; Liu et al., 2021d),
2018). The node embeddings could be learned by either tra- and the point cloud (Li et al., 2023b; Yu et al., 2022; Pang
ditional algorithms (Ahmed et al., 2013; Grover & Leskovec, et al., 2022) in both representation and generation tasks.
2016; Perozzi et al., 2014; Kipf & Welling, 2016b; Chan- However, due to the Non-Euclidean nature, extending trans-
puriya & Musco, 2020; Xiao et al., 2020), or deep learn- formers to graphs typically necessitates the explicit incorpo-
ing based graph neural networks (GNNs) (Kipf & Welling, ration of structural information into the attention computa-
2016a; Hamilton et al., 2017; Wu et al., 2019; Chiang et al., tion. Such constraint results in following challenges:
2019; Chen et al., 2018; Xu et al., 2018), and graph trans-
formers (Ying et al., 2021; Hu et al., 2020c; Dwivedi & 1. Generation Challenge. When generating new nodes
Bresson, 2020; Rampášek et al., 2022; Chen et al., 2022). or bonds, the undergone graph structure changes, re-
These methods are usually designed for specific downstream sulting in a complete update of all graph embeddings
tasks and can not be used for general pretraining. from scratch for full attention mechanisms. Moreover,
an additional link predictor is required to predict po-
Graph Transformers. The success of extending trans- tential edges from a |V| × |V| search space.
former architectures from natural language processing
2. Non-Euclidean Challenge. Previous methods do not
(NLP) to computer vision (CV) has inspired recent works
provide Euclidean prototypes to fully describe graphs.
to apply transformer models in the field of graph learning
The inherent Non-Euclidean nature poses challenges
(Ying et al., 2021; Hu et al., 2020c; Dwivedi & Bresson,
for tasks like graph manipulation and mixing.
2020; Rampášek et al., 2022; Chen et al., 2022; Wu et al.,
2021b; Kreuzer et al., 2021; Min et al., 2022). To encode the 3. Representation Challenge. Limited by the generation
graph prior, these approaches introduce structure-inspired challenge, traditional graph self-supervised learning
2
A Graph is Worth K Words: Euclideanizing Graph using Pure Transformer
Non-Euclidean Euclidean
[GW 1] POS 1
Input Graph Vectors
Graph Word Prompts
Non-Euclidean
Graph Words
[GP] POS 1 4 [GW 1] Generated Graph
[GW k] POS k
3
<BOS> C 11
[GP] POS k 1 2 [GW k]
atom C 11 C-C 12
Graph2Seq Encoder
GraphGPT Decoder
C 11 atom
bond C-C 12 C 22
C-C 12 bond
atom C 22 C-C 23
C 22 atom
bond C 33
C-C 23
C-C 23 bond
atom C 33 C=O 34
C 33 atom
bond O 44
C=O 34
C=O 34 bond
atom O 44 C-C 13
O 44 atom
Figure 1: The Overall framework of GraphsGPT. Graph2Seq encoder transforms the Non-Euclidean graph into Euclidean
Graph Words, which are further fed into GraphGPT decoder to auto-regressively generate the original Non-Euclidean graph.
Both Graph2Seq and GraphGPT employ pure transformer as the structure.
methods have typically focused on reconstructing cor- slight abuse of notation, we use eli and eri to represent the
rupted sub-features and sub-structures. They overlook left and right endpoint nodes of edge ei . For example, we
of learning from the entire graph potentially limits the have e1 = (el1 , er1 ) = (v1 , v2 ) in Figure 2. Inspired by (Kim
ability to capture the global topology. et al., 2022), we flatten the nodes and edges in a graph into
a Flexible Token Sequence (FTSeq) consisting of:
To tackle these challenges, we propose GraphsGPT, which
uses pure transformer to convert the Non-Euclidean graph 1. Graph Tokens. The stacked node and edge features
into a sequence of Euclidean vectors (Graph2Seq) while are represented by X = [XV ; XE ] ∈ Rn+n ,C . We
′
ensuring informative equivalence (GraphGPT). For the first utilize a token Codebook Bt to generate node and
time, we bridge the gap between graph and sequence mod- edge features, incorporating 118+92 learnable vectors.
eling in both representation and generation tasks. Specifically, we consider the atom type and bond type,
deferring the exploration of other properties, such as
3. Method the electric charge and chirality, for simplicity.
3.1. Overall Framework
Figure 1 outlines the comprehensive architecture of 2. Graph Position Encodings (GPE). The graph struc-
GraphsGPT, which consists of a Graph2Seq encoder ture is implicitly encoded through decoupled position
and a GraphGPT decoder. The Graph2Seq converts Non- encodings, utilizing a position Codebook Bp com-
Euclidean graphs into a series of learnable feature vectors, prising m learnable embeddings {o1 , o2 , · · · , om } ∈
named Graph Words. Following this, the GraphGPT uti- Rm,dp . The position encodings of node vi and edge ei
lizes these Graph Words to auto-regressively reconstruct are expressed as gvi = [ovi , ovi ] and gei = [oeli , oeri ],
the original Non-Euclidean graph. Both components, the respectively. Notably, gvl i = gvri = ovi , gel i = oeli ,
Graph2Seq and GraphGPT, incorporate the pure transformer and geri = oeri . To learn permutation-invariant features
structure and are pretrained via a GPT-style pretext task. and generalize to larger, unseen graphs, we randomly
shuffle the position Codebook, giving each vector an
3.2. Graph2Seq Encoder equal optimization opportunity.
Flexible Token Sequence (FTSeq). Denote G = (V, E)
as the input graph, where V = {v1 , · · · , vn } and E = 3. Segment Encodings (Seg). We introduce two learn-
{e1 , · · · , en′ } are sets of nodes and edges associated with able segment tokens, namely [node] and [edge],
′
features XV ∈ Rn,C and XE ∈ Rn ,C , respectively. With a to designate the token types within the FTSeq.
3
A Graph is Worth K Words: Euclideanizing Graph using Pure Transformer
4
A Graph is Worth K Words: Euclideanizing Graph using Pure Transformer
(1) Edge Generation; will stop if ei+1 = [EOS]. Note that the edge position
(2) Left Node Attachment; (3) Right Node Placement. encoding [oeli+1 , oeri+1 ] remains undetermined. This infor-
mation will affect the connection of the generated edge
We provide a brief illustration of the three steps in Figure 3. to the existing graph, as well as the determination of new
The step-wise classification complexity of generating an atoms, i.e., left atom attachment and right atom placement.
edge is O(|De |). Once the edge is obtained, the model au-
tomatically infers the left node attachment and right node Training Token Generation. The first node and next edge
placement, relieving the generation from the additional bur- prediction tasks are optimized by the cross entropy loss:
X
Init Next Edge Attach Place Ltoken = − yi · log pi . (6)
i
C1 C C C1 C C C1 C2
5
A Graph is Worth K Words: Euclideanizing Graph using Pure Transformer
edge C-C
edge C-C
node C
node C
node C
node O
[GW 1]
[GW 2]
[GW 3]
[BOS]
Key
Query
MOSES & ZINC-C (Generation). For few-shot gener-
[GW 1]
ation, we evaluate GraphsGPT on MOSES (Polykovskiy
B0 [GW 2]
et al., 2020) dataset, which is designed for benchmarking
[GW 3]
generative models. Following MOSES, we compute molecu-
B1 [BOS]
lar properties (LogP, SA, QED) and scaffolds for molecules
node C
collected from ZINC, obtaining ZINC-C. The dataset pro-
B2 edge C-C
vides a standardized set of molecules in SMILES format.
node C
edge C=O
Model Configurations. We adopt the transformer as our
B4 model structure. Both the Graph2Seq encoder and the
node O
B5 edge C-C
GraphGPT decoder consist of 8 transformer blocks with
8 attention heads. For all layers, we use Swish (Ramachan-
dran et al., 2017) as the activation function and RMSNorm
Figure 4: Block-Wise causal attention with grey cells in- (Zhang & Sennrich, 2019) as the normalizing function. The
dicating masked positions. Graph Words contribute to the hidden size is set to 512, and the length of the Graph Po-
generation through full attention, serving as prefix prompts. sition Encoding (GPE) is 128. The total number param-
eters of the model is 50M. Denote K as the number of
4. Experiments Graph Words, multiple versions of GraphsGPT, referred
to as GraphsGPT-KW, were pretrained. We mainly use
4.1. Experiment Settings GraphsGPT-1W, while we find that GraphsGPT-8W has
better encoding-decoding consistency (Section 6, Q2).
We extensively conduct experiments to assess GraphsGPT,
delving into the following questions:
Training Details. The GraphsGPT model undergoes train-
• Representation (Q1): Can Graph2Seq effectively learn ing for 100K steps with a global batch size of 1024 on
expressive graph representation through pretraining? 8 NVIDIA-A100s, utilizing AdamW optimizer with 0.1
weight decay, where β1 = 0.9 and β2 = 0.95. The maxi-
• Generation (Q2): Could pretrained GraphGPT serve mum learning rate is 1e−4 with 5K warmup steps, and the
as a strong structural prior model for graph generation? final learning rate decays to 1e−5 with cosine scheduling.
6
A Graph is Worth K Words: Euclideanizing Graph using Pure Transformer
Table 1: Results of molecular property prediction. We report the mean (standard deviation) metrics of 10 runs with standard
scaffold splitting (not random scaffold splitting). The best results and the second best are highlighted.
ROC-AUC ↑ RMSD ↓
Tox21 ToxCast Sider HIV BBBP Bace ESOL FreeSolv Lipo
# Molecules 7,831 8,575 1,427 41,127 2,039 1,513 1128 642 4200
# Tasks 12 617 27 1 1 1 1 1 1
No pretrain
GINs 74.6 (0.4) 61.7 (0.5) 58.2 (1.7) 75.5 (0.8) 65.7 (3.3) 72.4 (3.8) 1.050 (0.008) 2.082 (0.082) 0.683 (0.016)
Graph2Seq-1W 74.0 (0.4) 62.6 (0.3) 66.6 (1.1) 73.6 (3.4) 68.3 (1.4) 77.3 (1.2) 0.953 (0.025) 1.936 (0.246) 0.907 (0.021)
Relative gain to GIN -0.8% +1.4% +12.6% -2.6% +3.8% +6.3% +10.2% +7.5% -24.7%
InfoGraph (Sun et al., 2019) 73.3 (0.6) 61.8 (0.4) 58.7 (0.6) 75.4 (4.3) 68.7 (0.6) 74.3 (2.6)
GPT-GNN (Hu et al., 2020b) 74.9 (0.3) 62.5 (0.4) 58.1 (0.3) 58.3 (5.2) 64.5 (1.4) 77.9 (3.2)
EdgePred (Hamilton et al., 2017) 76.0 (0.6) 64.1 (0.6) 60.4 (0.7) 64.1 (3.7) 67.3 (2.4) 77.3 (3.5)
ContextPred (Hu et al., 2019) 73.6 (0.3) 62.6 (0.6) 59.7 (1.8) 74.0 (3.4) 70.6 (1.5) 78.8 (1.2)
GraphLoG (Xu et al., 2021) 75.0 (0.6) 63.4 (0.6) 59.6 (1.9) 75.7 (2.4) 68.7 (1.6) 78.6 (1.0)
G-Contextual (Rong et al., 2020) 75.0 (0.6) 62.8 (0.7) 58.7 (1.0) 60.6 (5.2) 69.9 (2.1) 79.3 (1.1)
G-Motif (Rong et al., 2020) 73.6 (0.7) 62.3 (0.6) 61.0 (1.5) 77.7 (2.7) 66.9 (3.1) 73.0 (3.3)
AD-GCL (Suresh et al., 2021) 74.9 (0.4) 63.4 (0.7) 61.5 (0.9) 77.2 (2.7) 70.7 (0.3) 76.6 (1.5)
JOAO (You et al., 2021) 74.8 (0.6) 62.8 (0.7) 60.4 (1.5) 66.6 (3.1) 66.4 (1.0) 73.2 (1.6) 1.120 (0.003) 0.708 (0.004)
Pretrain
SimGRACE (Xia et al., 2022a) 74.4 (0.3) 62.6 (0.7) 60.2 (0.9) 75.5 (2.0) 71.2 (1.1) 74.9 (2.0)
GraphCL (You et al., 2020) 75.1 (0.7) 63.0 (0.4) 59.8 (1.3) 77.5 (3.8) 67.8 (2.4) 74.6 (2.1) 0.947 (0.038) 2.233 (0.261) 0.739 (0.009)
GraphMAE (Hou et al., 2022) 75.2 (0.9) 63.6 (0.3) 60.5 (1.2) 76.5 (3.0) 71.2 (1.0) 78.2 (1.5)
3D InfoMax (Stärk et al., 2022) 74.5 (0.7) 63.5 (0.8) 56.8 (2.1) 62.7 (3.3) 69.1 (1.2) 78.6 (1.9) 0.894 (0.028) 2.337 (0.227) 0.695 (0.012)
GraphMVP (Liu et al., 2021b) 74.9 (0.8) 63.1 (0.2) 60.2 (1.1) 79.1 (2.8) 70.8 (0.5) 79.3 (1.5) 1.029 (0.033) 0.681 (0.010)
MGSSL (Zhang et al., 2021) 75.2 (0.6) 63.3 (0.5) 61.6 (1.0) 77.1 (4.5) 68.8 (0.6) 78.8 (0.9)
AttrMask (Hu et al., 2019) 75.1 (0.9) 63.3 (0.6) 60.5 (0.9) 73.5 (4.3) 65.2 (1.4) 77.8 (1.8) 1.100 (0.006) 2.764 (0.002) 0.739 (0.003)
MolCLR (Wang et al., 2022) 75.0 (0.2) 58.9 (1.4) 78.1 (0.5) 72.2 (2.1) 82.4 (0.9) 1.271 (0.040) 2.594 (0.249) 0.691 (0.004)
Graphformer (Rong et al., 2020) 74.3 (0.1) 65.4 (0.4) 64.8 (0.6) 62.5 (0.9) 70.0 (0.1) 82.6 (0.7) 0.983 (0.090) 2.176 (0.052) 0.817 (0.008)
Mole-BERT (Xia et al., 2023) 76.8 (0.5) 64.3 (0.2) 62.8 (1.1) 78.9 (3.0) 71.9 (1.6) 80.8 (1.4) 1.015 (0.030) 0.676 (0.017)
Relative gain to GIN +2.9% +6.0% +11.3% +4.8% +9.9% +14.1% +14.9% -4.5% +1.0%
Graph2Seq-1W 76.9 (0.3) 65.4 (0.5) 68.2 (0.9) 79.4 (3.9) 72.8 (1.5) 83.4 (1.0) 0.860 (0.024) 1.797 (0.237) 0.716 (0.019)
Pretrain
Relative gain to GIN +3.1% +6.0% +17.2% +5.2% +10.8% +15.2% +18.1% +13.7% -4.8%
Relative gain to Graph2Seq-1W +3.9% +4.5% +2.4% +7.9% +6.6% +7.9% +9.8% +7.2% +21.1%
7
A Graph is Worth K Words: Euclideanizing Graph using Pure Transformer
0.5
1.4 logP
0.0
0.9
Dataset
1.2 2.0
4.0
Dataset
1.0
Density
Density
0.8
2
0.2
0 0.0
Could pretrained GraphGPT serve as a strong structural (a) QED (b) logP
prior model for graph generation? Figure 5: Property distribution of generated molecules on
different conditions using GraphsGPT-1W-C. “Dataset” de-
notes the distribution of the training dataset (ZINC-C).
GraphGPT Generates Novel Molecules with High Valid- Table 3: Comparison with MolGPT on different properties.
ity. We assess pretrained GraphGPT-1W on the MOSES “MAD” denotes the Mean Absolute Deviation in generated
dataset through few-shots generation without finetuning. By molecule properties compared to the oracle value. “SD”
extracting Graph Word embeddings {hi }M i=1 from M train- denotes the Standard Deviation of the generated property.
ing molecules,
PM we construct a mixture Gaussian distribution
p(h, s) = i=1 N (h i , sI), where s is the standard vari- Pretrain Metric QED=0.5 SA=0.7 logP=0.0 Avg.
ance. We sample M molecules from p(h, s) and report the MAD ↓ 0.081 0.024 0.304 0.136
MolGPT
% SD ↓ 0.065 0.022 0.295 0.127
validity, uniqueness, novelty and IntDiv in Table 2. We ob- Validity ↑ 0.985 0.975 0.982 0.981
serve that GraphGPT generates novel molecules with high MAD ↓ 0.041 0.012 0.103 0.052
GraphGPT-1W-C
validity. Without any finetuning, GraphGPT outperforms % SD ↓ 0.079 0.055 0.460 0.198
Validity ↑ 0.988 0.995 0.980 0.988
MolGPT on validity, uniqueness, novelty, and diversity. Def- MAD ↓ 0.032 0.002 0.017 0.017
" SD ↓ 0.080 0.042 0.404 0.175
inition of metrics could be found in the Appendix B. Validity ↑ 0.996 0.995 0.994 0.995
GraphGPT-1Ws=0.5 0.993 0.996 0.334 0.856 0.848 2020b; Park et al., 2022; Wu et al., 2022; Zhang et al., 2023;
GraphGPT-1Ws=1.0 0.978 0.997 0.871 0.860 0.857
GraphGPT-1Ws=2.0 0.972 1.0 1.0 0.850 0.847 Guo & Mao, 2023) have attempted to address this challenge
by introducing complex hand-crafted rules. Additionally,
G-mixup (Han et al., 2022) leverages estimated graphons
for generating mixed graphs. To our best knowledge, there
GraphGPT-C is a Controllable Molecule Generator. are currently no learnable model for mixing in Euclidean
Following (Bagal et al., 2021), we finetune GraphsGPT- space while generating new graphs.
1W on 100M molecules from ZINC-C with properties and
Table 4: Graph mixup results. We compare Graph2Seq with
scaffolds as prefix inputs, obtaining GraphsGPT-1W-C. We
G-mixup on multiple tasks from MoleculeNet.
access whether the model could generate molecules satis-
fying specified properties. We present summarized results mixup HIV ↑ BBBP ↑ Bace ↑ Tox21 ↑ ToxCast ↑ Sider ↑
in Figure 5 and Table 3, while providing the full results % 77.1 68.4 75.9
G-Mix
in the appendix due to space limit. The evaluation is con- " 77.1 70.2 77.8
gain +0.0 +1.8 +1.9
ducted using the scaffold “c1ccccc1”, demonstrating that % 79.4 72.8 83.4 76.9 65.4 68.2
Ours
GraphGPT can effectively control the properties of gen- " 79.8 73.4 85.4 77.2 65.5 68.9
gain +0.4 +0.6 +2.0 +0.3 +0.1 +0.7
erated molecules. Table 3 further confirms that unsuper-
vised pretraining enhances the controllability and validity GraphsGPT is a Competitive Graph Mixer. We mixup
of GraphGPT. More details can be found in Appendix B.2. the learned Graph Words encoded by Graph2Seq-1W, then
8
A Graph is Worth K Words: Euclideanizing Graph using Pure Transformer
generate the mixed graph using GraphGPT-1W. Formally, Table 6: Self-consistency of decoded sequences. “C@N ”
the Graph Words of Gi and Gj are Wi = Graph2Seq(Gi ) denotes the decoded results of N out of the total 1024 per-
and Wj = Graph2Seq(Gj ), and the mixed graph is Gmix = mutations for each molecule are consistent. “Avg.” denotes
GraphGPT(λWi + (1 − λ)Wj ). We conduct experiments the average consistency of all test data.
on MoleculeNet and show the results in Table 4. We ob- Models C@256 C@512 C@768 C@1024 Avg.
serve that the straightforward latent mixup outperforms the GraphsGPT-1W 100% 99.2% 94.1% 77.3% 96.1%
elaborately designed G-mixup proposed in the ICML’22 GraphsGPT-8W 100% 99.4% 96.5% 85.3% 97.9%
outstanding paper (Han et al., 2022).
Due to page limit, more results are moved to the appendix.
9
A Graph is Worth K Words: Euclideanizing Graph using Pure Transformer
Impact Statement Chen, J., Ma, T., and Xiao, C. Fastgcn: fast learning with
graph convolutional networks via importance sampling.
This paper presents work whose goal is to advance the field arXiv preprint arXiv:1801.10247, 2018.
of Machine Learning. There are many potential societal
consequences of our work, none which we feel must be Chiang, W.-L., Liu, X., Si, S., Li, Y., Bengio, S., and Hsieh,
specifically highlighted here. GraphsGPT provides a new C.-J. Cluster-gcn: An efficient algorithm for training deep
paradigm for graph representation, generation and manip- and large graph convolutional networks. In Proceedings
ulation. The Non-Euclidean to Euclidean transformation of the 25th ACM SIGKDD international conference on
may affect broader downstream graph applications, such as knowledge discovery & data mining, pp. 257–266, 2019.
graph translation and optimization. The methodology could
be extend to other modalities, such as image and sequence. Dehghani, M., Djolonga, J., Mustafa, B., Padlewski, P.,
Heek, J., Gilmer, J., Steiner, A. P., et al. Scaling vision
transformers to 22 billion parameters. In ICML, pp. 7480–
References 7512. PMLR, 2023.
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert:
Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Pre-training of deep bidirectional transformers for lan-
Anadkat, S., et al. Gpt-4 technical report. arXiv preprint guage understanding. arXiv:1810.04805, 2018.
arXiv:2303.08774, 2023.
Diehl, F. Edge contraction pooling for graph neural net-
Ahmed, A., Shervashidze, N., Narayanamurthy, S., Josi- works. arXiv preprint arXiv:1905.10990, 2019.
fovski, V., and Smola, A. J. Distributed large-scale nat-
ural graph factorization. In Proceedings of the 22nd Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn,
international conference on World Wide Web, pp. 37–48, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M.,
2013. Heigold, G., Gelly, S., et al. An image is worth 16x16
words: Transformers for image recognition at scale. arXiv
Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., preprint arXiv:2010.11929, 2020.
Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds,
M., et al. Flamingo: a visual language model for few-shot Dwivedi, V. P. and Bresson, X. A generalization
learning. Advances in Neural Information Processing of transformer networks to graphs. arXiv preprint
Systems, 35:23716–23736, 2022. arXiv:2012.09699, 2020.
Bagal, V., Aggarwal, R., Vinod, P., and Priyakumar, Gao, Z., Tan, C., and Li, S. Z. Pifold: Toward effective
U. D. Molgpt: molecular generation using a transformer- and efficient protein inverse folding. In The Eleventh
decoder model. Journal of Chemical Information and International Conference on Learning Representations,
Modeling, 62(9):2064–2076, 2021. 2022a.
Brown, N., Fiscato, M., Segler, M. H., and Vaucher, A. C. Gao, Z., Tan, C., Wu, L., and Li, S. Z. Simvp: Simpler yet
Guacamol: benchmarking models for de novo molecular better video prediction. In Proceedings of the IEEE/CVF
design. Journal of chemical information and modeling, conference on computer vision and pattern recognition,
59(3):1096–1108, 2019. pp. 3170–3180, 2022b.
Gao, Z., Tan, C., Chen, X., Zhang, Y., Xia, J., Li, S., and
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D.,
Li, S. Z. Kw-design: Pushing the limit of protein deign
Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,
via knowledge refinement. In The Twelfth International
Askell, A., et al. Language models are few-shot learners.
Conference on Learning Representations, 2023.
Advances in neural information processing systems, 33:
1877–1901, 2020. Grover, A. and Leskovec, J. node2vec: Scalable feature
learning for networks. In Proceedings of the 22nd ACM
Chanpuriya, S. and Musco, C. Infinitewalk: Deep network
SIGKDD international conference on Knowledge discov-
embeddings as laplacian embeddings with a nonlinearity.
ery and data mining, pp. 855–864, 2016.
In Proceedings of the 26th ACM SIGKDD International
Conference on Knowledge Discovery & Data Mining, pp. Guo, H. and Mao, Y. Interpolating graph pair to regularize
1325–1333, 2020. graph classification. In AAAI, volume 37, pp. 7766–7774,
2023.
Chen, D., O’Bray, L., and Borgwardt, K. Structure-aware
transformer for graph representation learning. In Interna- Hamilton, W., Ying, Z., and Leskovec, J. Inductive repre-
tional Conference on Machine Learning, pp. 3469–3489. sentation learning on large graphs. Advances in neural
PMLR, 2022. information processing systems, 30, 2017.
10
A Graph is Worth K Words: Euclideanizing Graph using Pure Transformer
Han, X., Jiang, Z., Liu, N., and Hu, X. G-mixup: Graph Kim, J., Nguyen, D., Min, S., Cho, S., Lee, M., Lee, H., and
data augmentation for graph classification. In ICML, pp. Hong, S. Pure transformers are powerful graph learners.
8230–8248. PMLR, 2022. Advances in Neural Information Processing Systems, 35:
14582–14595, 2022.
Hou, Z., Liu, X., Cen, Y., Dong, Y., Yang, H., Wang, C.,
and Tang, J. Graphmae: Self-supervised masked graph Kipf, T. N. and Welling, M. Semi-supervised classifica-
autoencoders. In Proceedings of the 28th ACM SIGKDD tion with graph convolutional networks. arXiv preprint
Conference on Knowledge Discovery and Data Mining, arXiv:1609.02907, 2016a.
pp. 594–604, 2022.
Kipf, T. N. and Welling, M. Variational graph auto-encoders.
Hu, W., Liu, B., Gomes, J., Zitnik, M., Liang, P., Pande, V., arXiv preprint arXiv:1611.07308, 2016b.
and Leskovec, J. Strategies for pre-training graph neural
networks. arXiv preprint arXiv:1905.12265, 2019. Kong, L., Cui, J., Sun, H., Zhuang, Y., Prakash, B. A., and
Zhang, C. Autoregressive diffusion model for graph gen-
Hu, Z., Dong, Y., Wang, K., Chang, K.-W., and Sun, Y. Gpt-
eration. In International conference on machine learning,
gnn: Generative pre-training of graph neural networks.
pp. 17391–17408. PMLR, 2023.
In Proceedings of the 26th ACM SIGKDD International
Conference on Knowledge Discovery & Data Mining, pp.
Kreuzer, D., Beaini, D., Hamilton, W., Létourneau, V., and
1857–1867, 2020a.
Tossou, P. Rethinking graph transformers with spectral
Hu, Z., Dong, Y., Wang, K., Chang, K.-W., and Sun, Y. Gpt- attention. Advances in Neural Information Processing
gnn: Generative pre-training of graph neural networks. Systems, 34:21618–21629, 2021.
In Proceedings of the 26th ACM SIGKDD International
Conference on Knowledge Discovery & Data Mining, pp. Lee, J., Lee, I., and Kang, J. Self-attention graph pooling. In
1857–1867, 2020b. International conference on machine learning, pp. 3734–
3743. PMLR, 2019.
Hu, Z., Dong, Y., Wang, K., and Sun, Y. Heterogeneous
graph transformer. In Proceedings of the web conference Li, X., Sun, L., Ling, M., and Peng, Y. A survey of graph
2020, pp. 2704–2710, 2020c. neural network based recommendation in social networks.
Neurocomputing, pp. 126441, 2023a.
Huang, Y., Peng, X., Ma, J., and Zhang, M. 3dlinker: an
e (3) equivariant variational autoencoder for molecular Li, Z., Gao, Z., Tan, C., Li, S. Z., and Yang, L. T. General
linker design. arXiv preprint arXiv:2205.07309, 2022. point model with autoencoding and autoregressive. arXiv
preprint arXiv:2310.16861, 2023b.
Hussain, M. S., Zaki, M. J., and Subramanian, D. Edge-
augmented graph transformers: Global self-attention is Lin, H., Gao, Z., Xu, Y., Wu, L., Li, L., and Li, S. Z. Condi-
enough for graphs. arXiv preprint arXiv:2108.03348, tional local convolution for spatio-temporal meteorologi-
2021. cal forecasting. In Proceedings of the AAAI conference on
artificial intelligence, volume 36, pp. 7470–7478, 2022a.
Hwang, D., Park, J., Kwon, S., Kim, K., Ha, J.-W., and
Kim, H. J. Self-supervised auxiliary learning with meta-
Lin, Z., Tian, C., Hou, Y., and Zhao, W. X. Improving
paths for heterogeneous graphs. Advances in Neural
graph collaborative filtering with neighborhood-enriched
Information Processing Systems, 33:10294–10305, 2020.
contrastive learning. In Proceedings of the ACM Web
Irwin, J. J. and Shoichet, B. K. Zinc- a free database of Conference 2022, pp. 2320–2329, 2022b.
commercially available compounds for virtual screening.
Journal of chemical information and modeling, 45(1): Liu, C., Li, Y., Lin, H., and Zhang, C. Gnnrec: Gated graph
177–182, 2005. neural network for session-based social recommendation
model. Journal of Intelligent Information Systems, 60(1):
Jin, W., Derr, T., Liu, H., Wang, Y., Wang, S., Liu, 137–156, 2023a.
Z., and Tang, J. Self-supervised learning on graphs:
Deep insights and new direction. arXiv preprint Liu, M., Yan, K., Oztekin, B., and Ji, S. Graphebm: Molec-
arXiv:2006.10141, 2020. ular graph generation with energy-based models. arXiv
preprint arXiv:2102.00546, 2021a.
Jo, J., Lee, S., and Hwang, S. J. Score-based generative
modeling of graphs via the system of stochastic differen- Liu, S., Wang, H., Liu, W., Lasenby, J., Guo, H., and Tang,
tial equations. In International Conference on Machine J. Pre-training molecular graph representation with 3d
Learning, pp. 10362–10383. PMLR, 2022. geometry. arXiv preprint arXiv:2110.07728, 2021b.
11
A Graph is Worth K Words: Euclideanizing Graph using Pure Transformer
Liu, X., Zhang, F., Hou, Z., Mian, L., Wang, Z., Zhang, Niu, C., Song, Y., Song, J., Zhao, S., Grover, A., and Ermon,
J., and Tang, J. Self-supervised learning: Generative or S. Permutation invariant graph generation via score-based
contrastive. IEEE transactions on knowledge and data generative modeling. In International Conference on Ar-
engineering, 35(1):857–876, 2021c. tificial Intelligence and Statistics, pp. 4474–4484. PMLR,
2020.
Liu, Y., Jin, M., Pan, S., Zhou, C., Zheng, Y., Xia, F., and
Philip, S. Y. Graph self-supervised learning: A survey. Pang, Y., Wang, W., Tay, F. E., Liu, W., Tian, Y., and Yuan,
IEEE Transactions on Knowledge and Data Engineering, L. Masked autoencoders for point cloud self-supervised
35(6):5879–5900, 2022. learning. In ECCV, pp. 604–621. Springer, 2022.
Liu, Y., Yang, X., Zhou, S., Liu, X., Wang, S., Liang, K., Tu, Park, J., Shim, H., and Yang, E. Graph transplant: Node
W., and Li, L. Simple contrastive graph clustering. IEEE saliency-guided graph mixup with local structure preser-
Transactions on Neural Networks and Learning Systems, vation. In Proceedings of the AAAI Conference on Artifi-
2023b. cial Intelligence, volume 36, pp. 7966–7974, 2022.
Liu, Y., Yang, X., Zhou, S., Liu, X., Wang, Z., Liang, K., Peng, X., Luo, S., Guan, J., Xie, Q., Peng, J., and Ma, J.
Tu, W., Li, L., Duan, J., and Chen, C. Hard sample aware Pocket2mol: Efficient molecular sampling based on 3d
network for contrastive deep graph clustering. In Pro- protein pockets. In International Conference on Machine
ceedings of the AAAI conference on artificial intelligence, Learning, pp. 17644–17655. PMLR, 2022.
volume 37, pp. 8914–8922, 2023c.
Peng, Z., Dong, Y., Luo, M., Wu, X.-M., and Zheng, Q.
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, Self-supervised graph representation learning via global
S., and Guo, B. Swin transformer: Hierarchical vision context prediction. arXiv:2003.01604, 2020a.
transformer using shifted windows. In ICCV, pp. 10012–
Peng, Z., Huang, W., Luo, M., Zheng, Q., Rong, Y., Xu, T.,
10022, 2021d.
and Huang, J. Graph representation learning via graphical
Luo, Y., Yan, K., and Ji, S. Graphdf: A discrete flow mutual information maximization. In Proceedings of The
model for molecular graph generation. In International Web Conference 2020, pp. 259–270, 2020b.
conference on machine learning, pp. 7192–7203. PMLR,
Perozzi, B., Al-Rfou, R., and Skiena, S. Deepwalk: Online
2021.
learning of social representations. In Proceedings of the
Ma, Y., Wang, S., Aggarwal, C. C., and Tang, J. Graph con- 20th ACM SIGKDD international conference on Knowl-
volutional networks with eigenpooling. In Proceedings edge discovery and data mining, pp. 701–710, 2014.
of the 25th ACM SIGKDD international conference on
Polykovskiy, D., Zhebrak, A., Sanchez-Lengeling, B., Golo-
knowledge discovery & data mining, pp. 723–731, 2019.
vanov, S., Tatanov, O., Belyaev, S., Kurbanov, R., Arta-
Martinkus, K., Loukas, A., Perraudin, N., and Wattenhofer, monov, A., Aladinskiy, V., Veselov, M., et al. Molecular
R. Spectre: Spectral conditioning helps to overcome the sets (moses): a benchmarking platform for molecular gen-
expressivity limits of one-shot graph generators. In In- eration models. Frontiers in pharmacology, 11:565644,
ternational Conference on Machine Learning, pp. 15159– 2020.
15179. PMLR, 2022.
Qiu, J., Chen, Q., Dong, Y., Zhang, J., Yang, H., Ding, M.,
McInnes, L. and Healy, J. Accelerated hierarchical density Wang, K., and Tang, J. Gcc: Graph contrastive coding for
based clustering. In Data Mining Workshops (ICDMW), graph neural network pre-training. In Proceedings of the
2017 IEEE International Conference on, pp. 33–42. IEEE, 26th ACM SIGKDD international conference on knowl-
2017. edge discovery & data mining, pp. 1150–1160, 2020.
McInnes, L., Healy, J., and Melville, J. Umap: Uniform Ramachandran, P., Zoph, B., and Le, Q. V. Searching for
manifold approximation and projection for dimension activation functions. arXiv:1710.05941, 2017.
reduction. arXiv preprint arXiv:1802.03426, 2018.
Rampášek, L., Galkin, M., Dwivedi, V. P., Luu, A. T., Wolf,
Mialon, G., Chen, D., Selosse, M., and Mairal, J. Graphit: G., and Beaini, D. Recipe for a general, powerful, scal-
Encoding graph structure in transformers. arXiv preprint able graph transformer. Advances in Neural Information
arXiv:2106.05667, 2021. Processing Systems, 35:14501–14515, 2022.
Min, E., Chen, R., Bian, Y., Xu, T., Zhao, K., Huang, W., Rong, Y., Bian, Y., Xu, T., Xie, W., Wei, Y., Huang, W., and
Zhao, P., Huang, J., Ananiadou, S., and Rong, Y. Trans- Huang, J. Self-supervised graph transformer on large-
former for graphs: An overview from architecture per- scale molecular data. Advances in Neural Information
spective. arXiv preprint arXiv:2202.08455, 2022. Processing Systems, 33:12559–12571, 2020.
12
A Graph is Worth K Words: Euclideanizing Graph using Pure Transformer
Shakibajahromi, B., Kim, E., and Breen, D. E. Rimeshgnn: Wu, F., Souza, A., Zhang, T., Fifty, C., Yu, T., and Wein-
A rotation-invariant graph neural network for mesh clas- berger, K. Simplifying graph convolutional networks. In
sification. In WACV, pp. 3150–3160, 2024. ICML, pp. 6861–6871. PMLR, 2019.
Shi, C., Xu, M., Zhu, Z., Zhang, W., Zhang, M., and Tang, Wu, L., Lin, H., Tan, C., Gao, Z., and Li, S. Z. Self-
J. Graphaf: a flow-based autoregressive model for molec- supervised learning on graphs: Contrastive, generative, or
ular graph generation. In International Conference on predictive. IEEE Transactions on Knowledge and Data
Learning Representations, 2019. Engineering, 2021a.
Shi, C., Xu, M., Zhu, Z., Zhang, W., Zhang, M., and Tang, Wu, L., Xia, J., Gao, Z., et al. Graphmixup: Improving class-
J. Graphaf: a flow-based autoregressive model for molec- imbalanced node classification by reinforcement mixup
ular graph generation. arXiv preprint arXiv:2001.09382, and self-supervised context prediction. In ECML-PKDD,
2020. pp. 519–535. Springer, 2022.
Stärk, H., Beaini, D., Corso, G., Tossou, P., Dallago, C., Wu, L., Huang, Y., Tan, C., Gao, Z., Hu, B., Lin, H.,
Günnemann, S., and Liò, P. 3d infomax improves gnns Liu, Z., and Li, S. Z. Psc-cpi: Multi-scale protein
for molecular property prediction. In ICML, pp. 20479– sequence-structure contrasting for efficient and gener-
20502. PMLR, 2022. alizable compound-protein interaction prediction. arXiv
preprint arXiv:2402.08198, 2024a.
Sun, F.-Y., Hoffmann, J., Verma, V., and Tang, J. Info-
graph: Unsupervised and semi-supervised graph-level Wu, L., Tian, Y., Huang, Y., Li, S., Lin, H., Chawla,
representation learning via mutual information maximiza- N. V., and Li, S. Z. Mape-ppi: Towards effective
tion. arXiv preprint arXiv:1908.01000, 2019. and efficient protein-protein interaction prediction via
microenvironment-aware protein embedding. arXiv
Suresh, S., Li, P., Hao, C., and Neville, J. Adversarial preprint arXiv:2402.14391, 2024b.
graph augmentation to improve graph contrastive learning.
Wu, Z., Ramsundar, B., Feinberg, E. N., Gomes, J., Ge-
Advances in Neural Information Processing Systems, 34:
niesse, C., Pappu, A. S., Leswing, K., and Pande, V.
15920–15933, 2021.
Moleculenet: a benchmark for molecular machine learn-
Tan, C., Gao, Z., and Li, S. Z. Target-aware molecular graph ing. Chemical science, 9(2):513–530, 2018.
generation. In Joint European Conference on Machine
Wu, Z., Jain, P., Wright, M., Mirhoseini, A., Gonzalez,
Learning and Knowledge Discovery in Databases, pp.
J. E., and Stoica, I. Representing long-range context for
410–427. Springer, 2023.
graph neural networks with global attention. NeurIPS,
Tian, Y., Dong, K., Zhang, C., Zhang, C., and Chawla, N. V. 34:13266–13279, 2021b.
Heterogeneous graph masked autoencoders. In Proceed- Xia, J., Wu, L., Chen, J., Hu, B., and Li, S. Z. Simgrace: A
ings of the AAAI Conference on Artificial Intelligence, simple framework for graph contrastive learning without
volume 37, pp. 9997–10005, 2023. data augmentation. In Proceedings of the ACM Web
Veličković, P., Cucurull, G., Casanova, A., Romero, A., Conference 2022, pp. 1070–1079, 2022a.
Lio, P., and Bengio, Y. Graph attention networks. arXiv Xia, J., Zhao, C., Hu, B., Gao, Z., Tan, C., Liu, Y., Li, S., and
preprint arXiv:1710.10903, 2017. Li, S. Z. Mole-bert: Rethinking pre-training graph neural
networks for molecules. In The Eleventh International
Vignac, C., Krawczuk, I., Siraudin, A., Wang, B., Cevher,
Conference on Learning Representations, 2022b.
V., and Frossard, P. Digress: Discrete denoising diffusion
for graph generation. arXiv preprint arXiv:2209.14734, Xia, J., Zhao, C., Hu, B., Gao, Z., Tan, C., Liu, Y., Li, S., and
2022. Li, S. Z. Mole-bert: Rethinking pre-training graph neural
networks for molecules. In The Eleventh International
Wang, P., Agarwal, K., Ham, C., Choudhury, S., and Reddy,
Conference on Learning Representations, 2023.
C. K. Self-supervised learning of contextual embeddings
for link prediction in heterogeneous networks. In Pro- Xiao, W., Zhao, H., Zheng, V. W., and Song, Y. Vertex-
ceedings of the web conference 2021, pp. 2946–2957, reinforced random walk for network embedding. In Pro-
2021. ceedings of the 2020 SIAM International Conference on
Data Mining, pp. 595–603. SIAM, 2020.
Wang, Y., Wang, J., Cao, Z., and Barati Farimani, A. Molec-
ular contrastive learning of representations via graph neu- Xie, Y., Xu, Z., Zhang, J., Wang, Z., and Ji, S. Self-
ral networks. NMI, 4(3):279–287, 2022. supervised learning of graph neural networks: A unified
13
A Graph is Worth K Words: Euclideanizing Graph using Pure Transformer
review. IEEE transactions on pattern analysis and ma- Zhao, J., Li, C., Wen, Q., Wang, Y., Liu, Y., Sun, H., Xie, X.,
chine intelligence, 45(2):2412–2429, 2022. and Ye, Y. Gophormer: Ego-graph transformer for node
classification. arXiv preprint arXiv:2110.13094, 2021.
Xu, K., Hu, W., Leskovec, J., and Jegelka, S. How
powerful are graph neural networks? arXiv preprint Zhou, J., Cui, G., Hu, S., Zhang, Z., Yang, C., Liu, Z., Wang,
arXiv:1810.00826, 2018. L., Li, C., and Sun, M. Graph neural networks: A review
of methods and applications. AI open, 1:57–81, 2020a.
Xu, M., Wang, H., Ni, B., Guo, H., and Tang, J. Self-
supervised graph-level representation learning with local Zhou, J., Shen, J., and Xuan, Q. Data augmentation for
and global structure. In International Conference on graph classification. In Proceedings of the 29th ACM
Machine Learning, pp. 11548–11558. PMLR, 2021. International Conference on Information & Knowledge
Management, pp. 2341–2344, 2020b.
Ying, C., Cai, T., Luo, S., Zheng, S., Ke, G., He, D., Shen, Y.,
and Liu, T.-Y. Do transformers really perform badly for Zhu, Y., Xu, Y., Yu, F., Liu, Q., Wu, S., and Wang, L. Deep
graph representation? Advances in Neural Information graph contrastive representation learning. arXiv preprint
Processing Systems, 34:28877–28888, 2021. arXiv:2006.04131, 2020.
Zhu, Y., Xu, Y., Yu, F., et al. Graph contrastive learning
Ying, Z., You, J., Morris, C., Ren, X., Hamilton, W., and
with adaptive augmentation. In Proceedings of the Web
Leskovec, J. Hierarchical graph representation learning
Conference 2021, pp. 2069–2080, 2021.
with differentiable pooling. Advances in neural informa-
tion processing systems, 31, 2018. Zou, D., Wei, W., Mao, X.-L., et al. Multi-level cross-view
contrastive learning for knowledge-aware recommender
You, Y., Chen, T., Sui, Y., Chen, T., Wang, Z., and Shen, Y.
system. In SIGIR, pp. 1358–1368, 2022.
Graph contrastive learning with augmentations. NeurIPS,
33:5812–5823, 2020.
You, Y., Chen, T., Shen, Y., and Wang, Z. Graph contrastive
learning automated. In International Conference on Ma-
chine Learning, pp. 12121–12132. PMLR, 2021.
Zhang, Z., Liu, Q., Wang, H., Lu, C., and Lee, C.-K. Motif-
based graph self-supervised learning for molecular prop-
erty prediction. Advances in Neural Information Process-
ing Systems, 34:15870–15882, 2021.
14
A Graph is Worth K Words: Euclideanizing Graph using Pure Transformer
A. Representation
When applying graph mixup, the training samples are drawn from the original data with probability pself and from mixed
data with probability (1 − pself ). The mixup hyperparameter α and pself are shown in Table 7.
15
A Graph is Worth K Words: Euclideanizing Graph using Pure Transformer
B. Generation
B.1. Few-Shots Generation
We introduce metrics (Bagal et al., 2021) of few-shots generation as follows:
• Validity: the fraction of a generated molecules that are valid. We use RDkit for validity check of molecules. Validity
measures how well the model has learned the SMILES grammar and the valency of atoms.
• Uniqueness: the fraction of valid generated molecules that are unique. Low uniqueness highlights repetitive molecule
generation and a low level of distribution learning by the model.
• Novelty: the fraction of valid unique generated molecules that are not in the training set. Low novelty is a sign of
overfitting. We do not want the model to memorize the training data.
• Internal Diversity (IntDivp ): measures the diversity of the generated molecules, which is a metric specially designed
to check for mode collapse or whether the model keeps generating similar structures. This uses the power (p) mean of
the Tanimoto similarity (T ) between the fingerprints of all pairs of molecules (s1, s2) in the generated set (S).
s
1 X
InvDivp (S) = 1 − p T (s1, s2)p (14)
|S|2
s1,s2∈S
• QED (Quantitative Estimate of Drug-likeness): a measure that quantifies the “drug-likeness” of a molecule based on
its pharmacokinetic profile, ranging from 0 to 1.
• SA (Synthetic Accessibility): a score that predicts the difficulty of synthesizing a molecule based on multiple factors.
Lower SA scores indicate easier synthesis.
• logP (Partition Coefficient): a key parameter in studies of drug absorption and distribution in the body that measuring
a molecule’s hydrophobicity.
• Scaffold: the core structure of a molecule, which typically includes rings and the atoms that connect them. It provides
a framework upon which different functional groups can be added to create new molecules.
In order to integrate conditional information into our model, we set aside an additional 100M molecules from the ZINC
database for finetuning, which we denote as the dataset DG . For each molecule G ∈ DG , we compute its property values
vQED , vSA and vlogP and normalize them to 0 mean and 1.0 variance, yielding v̄QED , v̄SA and v̄logP .
The Graph2Seq model takes all properties and scaffolds as inputs and transforms them into the Graph Word sequence
W = [w1 , w2 , · · · , wk ]. The additional property and scaffold information enables Graph2Seq to encode Graph Words with
conditions. The Graph Words are then subsequently decoded by GraphGPT following the same implementation in Section
3.3. In summary, the inputs of the Graph2Seq encoder comprises:
1. Graph Word Prompts [[GW 1], · · · , [GW k]], which are identical to the word prompts discussed in Section 3.2.
2. Property Token Sequence [[QED], [SA], [logP]], which is encoded from the normalized property values v̄QED ,
v̄SA and v̄logP .
3. Scaffold Flexible Token Sequence FTSeqScaf , representing the sequence of the scaffold for the molecule.
For the sake of comparison, we followed Bagal et al. (2021) and trained a MolGPT model on the GuacaMol dataset (Brown
et al., 2019) using QED, SA, logP, and scaffolds as conditions for 10 epochs. We compare the conditional generation ability
by measuring the MAD (Mean Absolute Deviation), SD (Standard Deviation), validity and uniqueness. Table 8 presents the
full results, underscoring the superior control of GraphGPT-1W-C over molecular properties.
16
A Graph is Worth K Words: Euclideanizing Graph using Pure Transformer
Pretrain Metric QED=0.5 QED=0.7 QED=0.9 SA=0.7 SA=0.8 SA=0.9 logP=0.0 logP=2.0 logP=4.0 Avg.
MAD ↓ 0.081 0.082 0.097 0.024 0.019 0.013 0.304 0.239 0.286 0.127
MolGPT
% SD ↓ 0.065 0.066 0.092 0.022 0.016 0.013 0.295 0.232 0.258 0.118
Validity ↑ 0.985 0.985 0.984 0.975 0.988 0.995 0.982 0.983 0.982 0.984
MAD ↓ 0.041 0.031 0.077 0.012 0.028 0.031 0.103 0.189 0.201 0.079
GraphGPT-1W-C
% SD ↓ 0.079 0.077 0.121 0.055 0.062 0.070 0.460 0.656 0.485 0.229
Validity ↑ 0.988 0.995 0.991 0.995 0.991 0.998 0.980 0.992 0.991 0.991
MAD ↓ 0.032 0.033 0.051 0.002 0.009 0.022 0.017 0.190 0.268 0.069
" SD ↓ 0.080 0.075 0.090 0.042 0.037 0.062 0.463 0.701 0.796 0.261
Validity ↑ 0.996 0.998 0.999 0.995 0.999 0.996 0.994 0.990 0.992 0.995
Table 8: Overall comparison between GraphGPT-1W-C and MolGPT on different properties with scaffold SMILES
“c1ccccc1”. “MAD” denotes the Mean Absolute Deviation of the property value in generated molecules compared to the
oracle value. “SD” denotes the Standard Deviation of the generated property.
QED SA
12
0.5 30 0.7
0.7 0.8
10 0.9
25 0.9
Dataset Dataset
8
20
Density
Density
6 15
4 10
2 5
0 0
0.2 0.4 0.6 0.8 1.0 1.2 0.6 0.7 0.8 0.9 1.0
(a) QED (b) SA
1.4 logP
0.0
1.2 2.0
4.0
1.0 Dataset
Density
0.8
0.6
0.4
0.2
0.0
2 0 2 4 6
(c) logP
Figure 6: Property distribution of generated molecules on different conditions using GraphGPT-1W-C.
17
A Graph is Worth K Words: Euclideanizing Graph using Pure Transformer
C. Graph Words
C.1. Clustering
The efficacy of the Graph2Seq encoder hinges on its ability to effectively map Non-Euclidean graphs into Euclidean latent
features in a structured manner. To investigate this, we visualize the latent Graph Words space using sampled features,
encoding 32,768 molecules with Graph2Seq-1W and employing HDBSCAN (McInnes & Healy, 2017) for clustering the
Graph Words.
Figures 7 and 8 respectively illustrate the clustering results and the molecules within each cluster. An intriguing observation
emerges from these results: the Graph2Seq model exhibits a propensity to cluster molecules with similar properties (e.g.,
identical functional groups in clusters 0, 1, 4, 5; similar structures in clusters 2, 3, 7; or similar Halogen atoms in cluster 3)
within the latent Graph Words space. This insight could potentially inform and inspire future research.
6
0
5
1
4
Figure 7: UMAP (McInnes et al., 2018) visualization of the clustering result on the Graph Words of Graph2Seq-1W.
18
A Graph is Worth K Words: Euclideanizing Graph using Pure Transformer
(a) Cluster 0
(b) Cluster 1
(c) Cluster 2
(d) Cluster 3
(e) Cluster 4
(f) Cluster 5
(g) Cluster 6
(h)Cluster 7
19
A Graph is Worth K Words: Euclideanizing Graph using Pure Transformer
(a)
(b)
(c)
(d)
Figure 9: Graph interpolation results with different source and target molecules using GraphsGPT-1W. The numbers denote
the values of α for corresponding results.
Graph Hybridization. With Graph2Seq, a graph G can be transformed into a fixed-length Graph Word sequence
W = [w1 , · · · , wk ], where each Graph Word is expected to encapsulate distinct semantic information. We investigate the
representation of Graph Words by hybridizing them among different inputs.
Specifically, consider a source molecule Gs and a target molecule Gt , along with their Graph Words Ws = [ws1 , · · · , wsk ]
and Wt = [wt1 , · · · , wtk ]. Given the indices set I ,we replace a subset of source Graph Words with the corresponding
target Graph Words wsi ← wti , i ∈ I, yielding the hybrid Graph Words Wh = [wh1 , · · · , whk ], where:
(
wti , i∈I
wh = . (15)
wsi , i∈
/I
20
A Graph is Worth K Words: Euclideanizing Graph using Pure Transformer
We then decode Wh using GraphGPT back into the graph and observe the changes on the molecules. The results are depicted
in Figure 10. From these results, we observe that hybridizing specific Graph Words can lead to the introduction of certain
features from the target molecule into the source molecule, such as the Sulfhydryl functional group. This suggests that
Graph Words could potentially be used as a tool for manipulating specific features in molecular structures, which could have
significant implications for molecular design and optimization tasks.
Figure 10: Hybridization results of Graph Words. The figure shows the changes in the source molecule after hybridizing
specific Graph Words from the target molecule. We use GraphsGPT-8W which has 8 Graph Words in total.
21